Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics

Automatic Evaluation of Automatic Evaluation of Summaries Using N-gram Summaries Using N-gram Co-Occurrence StatisticsCo-Occurrence Statistics

By Chin-Yew Lin and Eduard By Chin-Yew Lin and Eduard HovyHovy

The Document Understanding The Document Understanding ConferenceConference

In 2002 there were two main tasksIn 2002 there were two main tasks Summarization of single-documentsSummarization of single-documents Summarization of Multiple-Summarization of Multiple-

documentsdocuments

DUC Single Document DUC Single Document SummarizationSummarization

Summarization of single-documentsSummarization of single-documents• Generate a 100 word summaryGenerate a 100 word summary• Training 30 sets of 10 docs each with Training 30 sets of 10 docs each with

100 word summaries100 word summaries• Test against 30 unseen documentsTest against 30 unseen documents

DUC Multi-Document DUC Multi-Document SummarizationSummarization

Summarization of multiple Summarization of multiple documents about a single subjectdocuments about a single subject• Generate 50,100,200,400 word Generate 50,100,200,400 word

summariessummaries• Four types: single natural disaster, Four types: single natural disaster,

single event, multiple instance of a type single event, multiple instance of a type of event, info about an individualof event, info about an individual

• Training: 30 sets of 10 documents with Training: 30 sets of 10 documents with their 50,100,200,400 word summariestheir 50,100,200,400 word summaries

• Test : 30 unseen documentsTest : 30 unseen documents

DUC Evaluation MaterialDUC Evaluation Material

For each document set, one human For each document set, one human summary was created to be the ‘Ideal’ summary was created to be the ‘Ideal’ summary for each length.summary for each length.

Two additional human summaries were Two additional human summaries were created at each lengthcreated at each length

Base line summaries were create Base line summaries were create automatically for each length as reference automatically for each length as reference pointspoints• Lead base line took first n-words of last Lead base line took first n-words of last

document for multi-doc taskdocument for multi-doc task• Coverage baseline used first sentence of each Coverage baseline used first sentence of each

doc until it reached its lengthdoc until it reached its length

SEE- Summary Evaluation SEE- Summary Evaluation EnvironmentEnvironment

A tool to allow assessors to compare A tool to allow assessors to compare system text (peer) with Ideal text (model). system text (peer) with Ideal text (model).

Can rank quality and content.Can rank quality and content. Assessor marks all system units sharing Assessor marks all system units sharing

content with model as {all,most,some, content with model as {all,most,some, hardly any}hardly any}

Assessor rate quality of grammaticality, Assessor rate quality of grammaticality, cohesion, and coherence {all, most, some, cohesion, and coherence {all, most, some, hardly any ,none}hardly any ,none}

SEE interfaceSEE interface

Making a JudgementMaking a Judgement

From Chin-Yew-Lin / MT summit IX 2003-09-27From Chin-Yew-Lin / MT summit IX 2003-09-27

Evaluation MetricsEvaluation Metrics

One idea is simple sentence recall, but it cannot One idea is simple sentence recall, but it cannot differentiate system performance (it pays to be differentiate system performance (it pays to be over productive)over productive)

Recall is measured relative to the model textRecall is measured relative to the model text E is average of coverage scoresE is average of coverage scores

Machine Translation and Machine Translation and Summarization EvaluationSummarization Evaluation

Machine TranslationMachine Translation• InputsInputs

Reference translationReference translation Candidate translationCandidate translation

• MethodsMethods Manually compare two Manually compare two

translations in:translations in:• AccuracyAccuracy• FluencyFluency• InformativenessInformativeness

Auto evaluation using:Auto evaluation using:• Blue/NIST scoresBlue/NIST scores

Auto SummarizationAuto Summarization• InputsInputs

Reference summaryReference summary Candidate summaryCandidate summary

• MethodsMethods Manually compare two Manually compare two

summaries in:summaries in:• Content OverlapContent Overlap• Linguistic QualitiesLinguistic Qualities

Auto Evaluation ?Auto Evaluation ?• ??

NIST BLEUNIST BLEU Goal: Measure the translation closeness between Goal: Measure the translation closeness between

a candidate translation and set of reference a candidate translation and set of reference translations with a numeric metrictranslations with a numeric metric

Method: use a weighted average of variable Method: use a weighted average of variable length n-gram matches between system length n-gram matches between system translation and the set of human reference translation and the set of human reference translationstranslations

BLEU correlates highly with human assessmentsBLEU correlates highly with human assessments Would like to make the same assumptions: the Would like to make the same assumptions: the

closer a summary is to a professional summary closer a summary is to a professional summary the better it isthe better it is

BLEU BLEU Is a promising automatic scoring metric for Is a promising automatic scoring metric for

summary evaluationsummary evaluation Basically a Basically a precisionprecision metric metric Measures how well a source overlaps a Measures how well a source overlaps a

model using n-gram co-occurrence model using n-gram co-occurrence statisticsstatistics

Uses a Brevity Penalty (BP) to prevent Uses a Brevity Penalty (BP) to prevent short translation that try to maximize their short translation that try to maximize their precision scoreprecision score

In formulas c = candidate length, r= In formulas c = candidate length, r= reference lengthreference length

Anatomy of BLEU Matching ScoreAnatomy of BLEU Matching Score


ROUGE: Recall-Oriented ROUGE: Recall-Oriented Understudy for Gisting EvaluationUnderstudy for Gisting Evaluation


What makes a good metric?What makes a good metric?

Automatic Evaluation should correlate Automatic Evaluation should correlate highly, positively, and consistently with highly, positively, and consistently with human assessments human assessments • If a human recognizes a good system, so will If a human recognizes a good system, so will

the metricthe metric The statistical significance of automatic The statistical significance of automatic

evaluations should be a good predictor of evaluations should be a good predictor of the statistical significance of human the statistical significance of human assessments with high reliabilityassessments with high reliability• The system can be used to assist in system The system can be used to assist in system

development in place of humansdevelopment in place of humans

ROUGE vs BLUEROUGE vs BLUE ROUGE – Recall basedROUGE – Recall based

• Separately evaluate 1,2,3, and 4 –gramsSeparately evaluate 1,2,3, and 4 –grams• No length penalty No length penalty • Verified for extraction summariesVerified for extraction summaries• Focus on content overlapFocus on content overlap

BLUE-Precision basedBLUE-Precision based• Mixed n-gramsMixed n-grams• Use Brevity penalty to penalize system Use Brevity penalty to penalize system

translations that are shorter than the average translations that are shorter than the average reference lengthreference length

• Favors longer n-grmas for grammaticality or Favors longer n-grmas for grammaticality or word orderword order

By all measures By all measures

FindingsFindings Ngram(1,4) is a weighted variable length n-gram match Ngram(1,4) is a weighted variable length n-gram match

score similar to BLEUscore similar to BLEU Simple unigrams, Ngram(1,1) and Bigrams Ngram(2,2) Simple unigrams, Ngram(1,1) and Bigrams Ngram(2,2)

consistently outperformed Ngram(1,4) in single and consistently outperformed Ngram(1,4) in single and multiple document tasks when stopwords are ignoredmultiple document tasks when stopwords are ignored

Weighted average n-grams are between bi-gram and tri-Weighted average n-grams are between bi-gram and tri-gram scores suggesting summaries are over-penalized by gram scores suggesting summaries are over-penalized by the weighted average due to lack of longer n-gram matchesthe weighted average due to lack of longer n-gram matches

Excluding stopword in computing n-gram statistics Excluding stopword in computing n-gram statistics generally achieves better correlation than including themgenerally achieves better correlation than including them

Ngram(1,1) and Ngram(2,2) are good automatic scoring Ngram(1,1) and Ngram(2,2) are good automatic scoring metrics based on statistical predictive power.metrics based on statistical predictive power.

Documents

Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics