Wednesday, April 20, 2005

Summarizer ranks sentences

Summarizer ranks sentences
April 20/27, 2005
By Kimberly Patch, Technology Research News

Because computers don't understand the meanings of words and sentences,
automating the seemingly simple task of summarizing a news story using
several sources is a major computer science challenge.

Key to meeting the challenge is finding a way to identify the most
important sentences from a set of documents on the same subject.

Researchers from the University of Michigan have developed a
multi-document summarization technique that compares sentences and has
the effect of sentences voting for the most important among them.

The method, dubbed LexRank, combines the content-sorting concepts of
prestige and lexical similarity to find the most important sentences in
a group of documents on the same subject.

Algorithms that use prestige to sort information have been around since
the '90s. It is possible to find the most prestigious, or popular member
of a network by analyzing the relationships among network members. In a
social network, for example, the most prestigious individual can be
identified by analyzing the social relations among all pairs of members
of the group.

The PageRank algorithm that powers Google takes advantage of this
concept. It assigns a Web site a prestige score based on the number of
other sites that point to it and the prestige of those sites. This works
to rank Web pages on a network because the pages are connected. In the
case of multi-document summarization, however, such hyperlinks are not
available, said Dragomir Erkan, an assistant professor of information,
electrical engineering and computer science and linguistics at the
University of Michigan.

Instead, the algorithm uses the similarities among sentences.

The researchers' lexical centrality algorithm compares the lexical
similarity of sentences. "Lexical similarity can be thought of as a
measure of the word overlap between two sentences," said Erkan. "For
example, 'Bush went to China' and 'George Bush visited China' are fairly
similar in a lexical way [but] 'Bush visited China' and 'Blair is the
prime minister of the United Kingdom' have no overlap at all," he said.

The algorithm allows the researcher to pick a threshold to indicate the
point at which two sentences start to become similar.

There are many possible factors that can be used to assess the lexical
similarity of a pair of sentences, said Radev. "We chose to weigh the
contribution of each word... by its relative informativeness," he said.
"Rare words like 'Igor', 'Taha' and 'disarmament' are more informative
than common ones like 'today', 'between', and 'November'."

The researchers' system considers a sentence important if it is similar
to many other sentences and if those other sentences are themselves
important. "In a sense, sentences vote for each other just by virtue of
being similar to each other," said Radev. "The sentences with the
highest scores... are considered to contain the gist of the document and
are presented as the multi-document summary," he said.

In contrast, the state-of-the-art method -- dubbed Centroid --
calculates a pseudo sentence that is the average of all the sentences in
a set of documents, and calculates how similar each sentence is to this
"centroid" sentence.

The researchers have applied the method to a prototype of their news
clustering Web site. "For each cluster of related stories, we compute
the pairwise similarity between all sentence pairs, then apply the
[lexical] centrality algorithm," to parse out the important sentences,
which become the summary of the document cluster, said Radev.

The most important realization in doing the research was that the
patterns of language in the multi-document summarization task are
similar to seemingly unrelated natural phenomena such as the patterns of
links among Web pages, social interactions and electrical components,
said Radev.

The researchers are planning to incorporate the method into their
NewsInEssence Web site, which crawls the Web for news stories, clusters
them into topical groups, and summarizes each group.

The researchers are also looking for other uses of the lexical
centrality algorithm. Possibilities include automatic translation and
question answering, said Radev. The method could potentially find
sentences that are likeliest to contain the answer to a given natural
language question, or, in the biomedical domain, sentences that are most
likely to contain important facts like particular protein interactions,
said Radev.

The researchers' experiments show that the method has the potential to
yield summaries as good as those of state-of-the-art summarization
systems, said Lillian Lee, an associate professor of computer science at
Cornell University. "This general field of investigation is one that
seems very promising," she said.

Radev's research colleague was Günes Erkan. The researchers presented
the work at the Conference on Empirical Methods in Natural Language
Processing (EMNLP 2004), held July 25 to 26, 2004 in Barcelona, Spain.
The research was funded by the National Science Foundation (NSF).

0 Comments:

Post a Comment

<< Home