Upload
constance-fleming
View
214
Download
0
Embed Size (px)
Citation preview
Summarizing Encyclopedic Term Descriptions on the Web
from Coling 2004Atsushi Fujii and Tetsuya Ishikawa
Graduate School of Library, Information and Media Studies, University of Tsukuba
motivation
Existing encyclopedias often lack new terms and new definitions for existing terms
Web contains an enormous volume of up-to-date information is a source to obtain new term descriptions
The use of existing search engine has many problems
search engine??
Often retrieve extraneous pages not describing a submitted term
A user has to identify page fragments describing the term
Descriptions in multiple pages are independent
Word senses are not distinguished for ambiguous terms
They propose a summarization method that produces a concise and condensed term description from multiple paragraphs
In this paper, they focus on Japanese technical terms in the computer domain
Overview of CYCLONE
Summarization Method
Given a set of paragraph-style descriptions for a single term in a specific domain, their summarization method produces a concise text describing the term from different viewpoints
12 viewpoints in computer domain: definition, abbreviation, exemplification, purpose, synonym, reference, product, advantage, drawback , history, component, function
Four steps
Identification Recognize the language unit associated with a viewpoint
Classification Merge units with the same viewpoint into a single group
Selection Determine one or more representative units for each group
Presentation Produce a summary in a format
Identification
A sentence is often associated with multiple viewpointse.g. XML is an abbreviation for eXtensible Markup Language, and is markup language
Segment Japanese sentences into simple sentences, and apply zero pronoun detection and anaphora resolution can be used
XML is an abbreviation for eXtensible Markup Language XML is markup language
Abbreviation viewpoint
definition viewpoint
Four steps
Identification Recognize the language unit associated with a viewpoint
Classification Merge units with the same viewpoint into a single
group
Selection Determine one or more representative units for each group
Presentation Produce a summary in a format
Classification
12 viewpoints 36 linguistic patterns are used to describe
terms from a specific viewpoint Simple sentences match with patterns for
multiple viewpoints is classified into viewpoint group
Classification (cont)
How about those sentences do not match any patterns?
Classify remaining sentences into the group where their most similar sentence is belong
Compute the similarity between an unclassified sentences and each of the classified sentences (Dice coefficient)
“miscellaneous” group
example
Four steps
Identification Recognize the language unit associated with a viewpoint
Classification Merge units with the same viewpoint into a single group
Selection Determine one or more representative units for each
group
Presentation Produce a summary in a format
Selection The number of sentences selected from each group
depends on the desired size of the resultant summary
Compute the score for each sentence and select sentences with greater scores in each group # of common words included (W) – sentences including
frequent words are preferred Rank order in CYCLONE (R) # of characters include (C) – short sentences are preferred
Normalize each factor and compute final score as a weighed average of the three factors above (W>R>C)
Selection (cont)
For miscellaneous group, they select the most dissimilar sentence to representative sentences selected from the regular groups
Presentation
Top 50 paragraphs for the term “XML” Only one sentence was selected from each
group Each viewpoint label or sentence is hyper-
linked to the associated group or the source paragraph
Presentation (cont)
Evaluation
Summarization evaluation can be classified into intrinsic and extrinsic approaches
Intrinsic: the quality of a text, informativeness Extrinsic: if a summary improves the efficiency of
a specific task
Evaluation (cont)
15 Japanese terms are test inputs In order to calculate the coverage, for each of the
15 terms, two students annotate each simple sentence in the top 50 paragraphs in the CYCLONE results with one or more viewpoints
They define 28 viewpoints including the 12 viewpoints
Compression ratio and coverage were calculate by the top 50 paragraphs
Results
#Reps: the number of representative sentences selected from each viewpoint group
#Chars: the number of characters in a summary They select five sentences from the miscellaneous
group VBS: viewpoint-based summarization method Lead: systematically extracted the top N characters
from the CYCLONE results
Conclusion
To compile encyclopedic term descriptions from the Web, they introduced a summarization method
They identify the simple sentences, classify those sentences into viewpoint groups, select the representative sentences from each group and show them up
VBS got good compression ratio and the coverage score is better than baseline
Future work includes generating a coherent text and performing extrinsic evaluation method