Corpora and Statistical Methods

PowerPoint Presentation

Albert GattCorpora and Statistical MethodsIn this lectureIntroduction to Natural Language Generation (NLG)the use of corpora & statistical models in NLG

SummarisationSingle-documentMulti-document

Evaluation using corpora: BLEU/NIST/ROUGE and related metricsNatural Language GenerationPart 1What is NLG?NLG systems aim to produce understandable texts (in English or other languages) typically from non-linguistic input.

Examples:

Automatic generation of weather reports.Input: data in the form of numbers (Numerical Weather Prediction models)Output: short text representing a weather forecastMany systems developed in this domain.

STOP:generates smoking cessation letters based on a user-input questionnairehttp://www.csd.abdn.ac.uk/research/stop/

Weather report example

S 8-13 increasing 13-18 by early morning, then backing NNE 18-23 by morning, and veering S 13-18 by midday, then easing 8-13 by midnight.

S 8-13 increasing 18-23 by morning, then easing 8-13 by midnight.

SUMTIME: http://cgi.csd.abdn.ac.uk/~ssripada/cgi_bin/startSMT.cgiNLG in dialogue systemsDialogue fragment:

System1: Welcome.... What airport would you like to fly out of?User2: I need to go to Dallas.System3: Flying to Dallas. What departure airport was that?User4: from Newark on September the 1st.

What should the system say next?Plan for next utterance(after analysis of User4)implicit-confirm(orig-city:NEWARK)implicit-confirm(dest-city:DALLAS)implicit-confirm(month:9)implicit-confirm(day-number:1)request(depart-time)Output next uttterance:What time would you like to travel on September the 1st to Dallas from Newark?Walker et al. (2001). SPoT: A trainable sentence planner. Proc. NAACL Types of input to an NLG systemRaw data (e.g. Weather report systems):Typical of data-to-text systems These systems need to pre-analyse the data

Knowledge base:Symbolic information (e.g. database of available flights)

Content plan: representation of what to communicate (usually in some canonical representation)e.g.: complete story plan (STORYBOOK)

Other sources:Discourse/dialogue historyKeep track of whats been said to inform planning

The architecture of NLG systemsA pipeline architecturerepresents a consensus of what NLG systems actually dovery modularnot all implemented systems conform 100% to this architecture

Document Planner(Content selection)Microplanner (text planner)Surface RealiserCommunicative goaldocument plantext specificationtextConcrete exampleBabyTalk systems (Portet et al 2009)summarise data about a patient in a Neonatal Intensive Care Unitmain purpose: generate a summary that can be used by a doctor/nurse to make a clinical decisionF. Portet et al (2009). Automatic generation of textual summaries from neonatal intensive care data. Artificfial IntelligenceA micro example

There were 3 successive bradycardias down to 69. Input data: unstructured raw numeric signal from patients heart rate monitor (ECG)A micro example: pre-NLG steps

(1) Signal Analysis (pre-NLG)Identify interesting patterns in the data.Remove noise.(2) Data interpretation (pre-NLG)Estimate the importance of eventsPerform linking & abstractionDocument planning/Content SelectionMain tasksContent selectionInformation ordering

Typical output is a document plantree whose leaves are messagesnonterminals indicate rhetorical relations between messages (Mann & Thompson 1988)e.g. justify, part-of, cause, sequenceA micro example: Document planning

(1) Signal Analysis (pre-NLG)Identify interesting patterns in the data.Remove noise.(2) Data interpretation (pre-NLG)Estimate the importance of eventsPerform linking & abstraction

(3) Document planningSelect content based on importanceStructure document using rhetorical relationsCommunicative goals (here: assert something)A micro example: MicroplanningLexicalisationMany ways to express the same thingMany ways to express a relationshipe.g. SEQUENCE(x,y,z)x happened, then y, then zx happened, followed by y and zx,y,z happenedthere was a sequence of x,y,z Many systems make use of a lexical database.

A micro example: MicroplanningAggregation:given 2 or more messages, identify ways in which they could be merged into one, more concise messagee.g. be(HR, stable) + be(HR, normal)(No aggregation) HR is currently stable. HR is within the normal range.(conjunction) HR is currently stable and HR is within the normal range.(adjunction) HR is currently stable within the normal range.

A micro example: MicroplanningReferring expressions:Given an entity, identify the best way to refer to ite.g. BRADYCARDIAbradycardiaitthe previous oneDepends on discourse context! (Pronouns only make sense if entity has been referred to before)

A micro example

(4) MicroplanningMap events to semantic representationlexicalise: bradycardia vs sudden drop in HRaggregate multiple messages (3 bradycardias = one sequence)decide on how to refer (bradycardia vs it)A micro example: RealisationSubtasks:map the output of microplanning to a syntactic structureneeds to identify the best form, given the input representationtypically many alternativeswhich is the best one?apply inflectional morphology (plural, past tense etc)linearise as text stringA micro example

(4) MicroplanningMap events to semantic representationlexicalise: bradycardia vs sudden drop in HRaggregate multiple messages (3 bradycardias = one sequence)decide on how to refer (bradycardia vs it)choose sentence form (there were)theresPROVP (+past)VbeNP (+pl)three successive bradycardiasPPdown to 69(5) Realisation map semantic representations to syntactic structures apply word formation rules

Rules vs statisticsMany NLG systems are rule-basedGrowing trend to use statistical methods.

Main aims:increase linguistic coverage (e.g. of a realiser) cheaplydevelop techniques for fast building of a complete systemUsing statistical methodsLanguage models and realisationAdvantages of using statisticsConstruction of NLG systems is extremely laborious!e.g. BabyTalk system took ca. 4 years with 3-4 developers

Many statistical approaches focus on specific modulesbest-studied: statistical realisationrealisers that take input in some canonical form and rely on language models to generate outputadvantage: easily ported to new domains/applicationscoverage can be increased (more data/training examples)Overgeneration and rankingThe approaches we will consider rely on overgenerate-and-rank approach:

Given: input specification (semantics or canonical form)

Use a simple rule-based generator to produce many alternative realisations.Rank them using a language model.Output the best (= most probable) realisation.

Advantages of overgeneration + rankingThere are usually many ways to say the same thing.e.g. ORDER(eat(you,chicken))Eat chicken!It is required that you eat chicken!It is required that you eat poulet!Poulet should be eaten by you.You should eat chicken/chickens.Chicken/Chickens should be eaten by you.

Where does the data come from?Some statistical NLG systems were built based on parallel data/text corpora.allows direct learning of correspondences between content and outputrarely available

Some work relies on Penn Treebank:Extract input: process the treebank to extract canonical specifications from parsed sentencestrain a language modelre-generate using a realiser and evaluate against original treebankExtracting input from treebankPenn treebank input:

C. Callaway (2003). Evaluating coverage for large, symbolic NLG grammars. Proc. IJCAIExtracting input from treebankConverted into required input representation:

C. Callaway (2003). Evaluating coverage for large, symbolic NLG grammars. Proc. IJCAINitrogen and HALogenPioneering realisation systems with wide coverage (i.e. handle many phenomena of English grammar)Based on overgeneration/rankingHALogen (Langkilde-Geary 2002) is a successor to Nitrogen (Langkilde 1998)main differences: representation data structure for possible realisation alternativesHALogen handles more grammatical featuresStructure of HALogenSymbolic GeneratorRules to map input representation to syntactic structuresLexiconMorphology

multiple outputsrepresented in a forestStatistical rankern-gram model (from Penn Treebank)best sentenceHALogen InputGrammatical specification(e1 / eat:subject (d1 / dog):object (b1 / bone:premod(m1 / meaty)):adjunct(t1 / today))

Semantic specification(e1 / eat:agent (d1 / dog):patient (b1 / bone:premod(m1 / meaty)):temp-loc(t1 / today))Labeled feature-value representation specifying properties and relations of domain objects (e1, d1, etc)Recursively structuredOrder-independentCan be either grammatical or semantic (or mixture of both)recasting mechanism maps from one to anotherHALogen base generatorConsists of about 255 hand-written rulesRules map an input representation into a packed set of possible output expressions.Each part of the input is recursively processed by the rules, until only a string is left.Types of rules:recastingorderingfillingmorphing

RecastingMap semantic input representation to one that is closer to surface syntax.

Grammatical specification(e1 / eat:object (b1 / bone :premod(m1 / meaty)):adjunct(t1 / today):subject (d1 / dog))Semantic specification(e1 / eat:patient (b1 / bone :premod(m1 / meaty)):temp-loc(t1 / today):agent (d1 / dog))IF relation = :agent AND sentence is not passiveTHEN map relation to :subjectOrdering Assign a linear order to the values in the input.Grammatical specification(e1 / eat:object (b1 / bone :premod(m1 / meaty)):adjunct(t1 / today):subject (d1 / dog))Grammatical specification + order(e1 / eat:subject (d1 / dog):object (b1 / bone :premod(m1 / meaty)):adjunct(t1 / today))Put subject first unless sentence is passive.Put adjuncts sentence-finally.FillingIf input is under-specified for some features, add all the possible values for them.NB: this allows for different degrees of specification, from minimally to maximally specified input.Can create multiple copies of same inputGrammatical specification + order(e1 / eat:subject (d1 / dog):object (b1 / bone :premod(m1 / meaty)):adjunct(t1 / today))+:TENSE (past)+:TENSE (present)MorphingGiven the properties of parts of the input, add the correct inflectional features.Grammatical specification + order(e1 / eat:tense(past):subject (d1 / dog):object (b1 / bone :premod(m1 / meaty)):adjunct(t1 / today))Grammatical specification + order(e1 / ate:subject (d1 / dog):object (b1 / bone :premod(m1 / meaty)):adjunct(t1 / today))The output of the base generatorProblem:a single input may have literally hundreds of possible realisations after base generationthese need to be represented in an efficient way to facilitate search for the best output

Options:word lattice forest of treesOption 1: lattice structure (Langkilde 2000)

You may have to eat chicken: 576 possibilities!Properties of latticesIn a lattice, a complete left-right path represents a possible sentence.

Lots of duplication!e.g. the same word chicken occurs multiple timesranker will be scoring the same substring more than once

In a lattice path, every word is dependent on all other words.cant model local dependenciesOption 2: Forests (Langkilde 00,02)SORS.328S.358PRP.3VP.327youVP.357to be eaten byPRP.3NP.318NP.318VP.248ORthe chickenProperties of forestsEfficient representation:each individual constituent represented only once, with pointersranker will only compute a partial score for a subtree onceseveral alternatives represented by disjunctive (OR) nodes

Equivalent to a non-recursive context-free grammarS.469 S.328S.469 S.358

Statistical rankingUses n-gram language models to choose the best realisation r:

Performance of HALogenMinimally specified input frame (bigram model):It would sell its fleet age of Boeing Co. 707s because of maintenance costs increase the company announced earlier.

Minimally specified input frame (trigram model):The company earlier announced it would sell its fleet age of Boeing Co. 707s because of the increase maintenance costs.

Almost fully specified input frame:Earlier the company announced it would sell its aging fleet of Boeing Co. 707s because of increased maintenance costs.ObservationsThe usual issues with n-gram models apply:bigger n better output, but more data sparseness

Domain dependentrelatively easy to train, assuming corpus in the right formatEvaluationHow should an NLG system/module be evaluated?Evaluation in NLGTypes of evaluation:Intrinsic: evaluate output in its own right (linguistic quality etc)Extrinsic: evaluate output in the context of a task with target users

Intrinsic evaluation of realisation output often relies on metrics like BLEU and NIST.BLEU: Modified n-gram precisionLet t be a translation/generated textLet {r1,,rn} be a set of reference translations/textsLet n be the maximum ngram value (usually 4)

do for 1 to n: For each ngram in t:max_ref_count := max times it occurs in some rclipped_count := min(count,max_ref_count) score := total clipped counts/total unclipped counts

Scores for different ngrams are combined using a geometric mean.A brevity penalty is added to the score to avoid favouring very short ngrams.BLEU example (unigram)t = the the the the the ther1 = the dog ate the meat pier2 = the dog ate a meat pie

only one unigram (the) in tmax_ref_count = 2clipped_count = min(count, max_ref_count) = min(2,6) = 2score = clipped_count/count = 2/6 NIST: modified version of BLEUA version of BLEU developed by the US National Institute of Standards and Technology.

Instead of just counting matching ngrams, weights counts by their informativenessfor any matching ngram between t and reference corpus, the rarer the ngram in the reference corpus the betterAlternative metricsSome version of edit (Levenshtein) distance is often used.score reflecting the no. of insertions (I), deletions (D) and substitutions (S) required to transform a string into another string.

NIST simple string accuracy (SSA): essentially average edit distanceSSA = 1-(I+D+S)/(length of sentence)

BLEU/NIST in NLGHALogens output compared to reference Treebank outputs using BLEU/SSA.

Fully specified input:output produced for ca. 83% of inputsSSA = 94.5BLEU = 0.92

Minimally specified input:output produced for ca. 79.3%SSA = 55.3BLEU = 0.51How adequate are these measures?An important question for NLG:Is matching a gold standard corpus all that matters?(As with MT, a complete mismatch is possible, but the output could still be perfectly OK).

Some recent work suggests that corpus-based metrics give very different results from task-based experiments.Therefore, difficult to identify a relationship between a measure like BLEU and results on systems adequacy in a task.Automatic summarisationPart 2The taskGiven a single document or collection of documents, return an abridged version that distils the most important information (possibly for a particular task/user)

Summarisation systems perform:Content selection: choosing the relevant information in the source document(s), typically in the form of sentences/clauses.Information ordering (especially if source is more than one document)Sentence realisation: cleaning up the sentences to make them fluent.

Note the similarity to NLG architectures.Main difference: summarisation input is text, whereas NLG input is non-linguistic data.Types of summariesExtractive vs. AbstractiveExtractive: select informative sentences/clauses in the source document and reproduce them most current systems (and our focus today)Abstractive: summarise the subject matter (usually using new sentences)much harder, as it involves deeper analysis & generation

DimensionsSingle-document vs. multi-document

ContextQuery-specific vs. query-independentExtracts vs Abstracts: Lincolns Gettsyburg Address

Source: Jurafsky & Martin (2009), p. 823

ExtractAbstractA Summarization Machine

EXTRACTSABSTRACTS?MULTIDOCSExtractAbstractIndicativeGenericBackgroundQuery-orientedJust the news10%50%100%Very BriefBriefLongHeadlineInformativeDOCQUERYCASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMSAdapted from: Hovy & Marcu (1998). Automated text summarization. COLING-ACL Tutorial. http://www.isi.edu/~marcu/57The Modules of the Summarization Machine

EXTRACTION

INTERPRETATIONEXTRACTSABSTRACTS?

CASE FRAMESTEMPLATESCORE CONCEPTSCORE EVENTSRELATIONSHIPSCLAUSE FRAGMENTSINDEX TERMSMULTIDOCEXTRACTSGENERATIONFILTERINGDOCEXTRACTSUnsupervised Extractive Summarisationbag of words approachesUnsupervised content selection I: Topic SignaturesSimplest unsupervised algorithm:Split document into sentences.Select those sentences which contain the most salient/informative words.Salient term = a term in the topic signature (words that are crucial to identifying the topic of the document)

Topic signature detection:Represent sentences (documents) as word vectorsCompute the weight of each wordWeight sentences by the average weight of their (non-stop) words.Common term-weighting schemesTF/IDF (see lecture on IR)Terms are weighted more if they are more frequent in the input document(s), but not so frequent in other documents.Topic signature = terms whose tf/idf score is above some threshold

Log LikelihoodTopic signature = terms whose LLR indicates they are significantly more likely to occur in input document(s), compared to background corpus.Term weighting: log likelihood ratioRequirements:A background corpus

For a term w, LLR is the ratio between:Prob. of observing w in the input corpusProb. of observing w in the background corpusLLR is asymptotically chi-square distributed.LLR is significant, we treat the term as a key term.Chi-square values tend to be significant at p = .001 if they are greater than 10.8

A note about these term weight approachesThe set of significant terms in a document can be thought of as a pseudo-sentence. the documents centroid

These approach ranks sentences by their closeness to this topical centre of the document.

An alternative is sentence centrality, where instead of measuring closeness to a single centroid pseudo-sentence, we measure the closeness of a sentence to all other sentences.Alternative: Sentence centralityBasic idea: a sentence which is similar to many other sentences is more central/salient to the document(s).

We therefore need:A definition of similarity between two sentencesA definition of centrality: given the similarity of a sentence to all other sentences, how do we compute how central it is to the document(s)?

SimilarityCosine between vectors of tf/idf weights.Effectively yields an undirected, weighted graphOr a similarity threshold to create a discretised graph.

After Erkan & Radev 2004CentralityAssume we have set a cosine similarity threshold...Cosine average: centrality as average pairwise similarity to all other K sentences.

Degree centrality: centrality as the number of nodes to which sentence node x is linked (= the number of sentences which have similarity to x above a threshold)NB: using a threshold assumes we have a discretised graph (i.e. Were only counting number of connected nodes, not their edge weights)

Centrality as prestigeDegree centrality is very democratic. Problem: if x is connected to many sentences, neither of which is very important, this boosts x artificially.

LexRank algorithm: Consider not only how many nodes are connected to x, but also how central these nodes themselves are.

Where: adj[x] = nodes connected to sentence x degree(y) = degree centrality of node y

Unsupervised extraction IIUsing rhetorical structureRhetorical Structure TheoryRST (Mann and Thompson 1988) is a theory of text structureNot concerned with the topic of a text butHow bits of the underlying content of a text are structured so as to hang together in a coherent way.

The main claim of RST:Parts of a text are related to each other in predetermined ways.There is a finite set of such relations.Relations hold between two spans of textNucleusSatelliteAn aside and a reminderWe encountered RST earlier when talking about NLG.

Many document planning approaches choose content and structure messages based on RST relations.Subsequently these form the input to microplanning and realisation.A small exampleYou should visit the new exhibition. Its excellent. It got very good reviews. Its completely free.You should ...Its completely ...Its excellent...It got ...MOTIVATIONEVIDENCEENABLEMENTAn RST relation definitionMOTIVATIONNucleus represents an action which the hearer is meant to do at some point in future.You should go to the exhibitionSatellite represents something which is meant to make the hearer want to carry out the nucleus action.Its excellent. It got a good review.Note: Satellite need not be a single clause. In our example, the satellite has 2 clauses. They themselves are related to eachother by the EVIDENCE relation.Effect: to increase the hearers desire to perform the nucleus action.

RST relations more generallyAn RST relation is defined in terms of theNucleus + constraints on the nucleusNucleus is the core content of the discourse unit.(e.g. Nucleus of motivation is some action to be performed by H)Satellite + constraints on satelliteSatellite is additional information, related to the nucleus in a specific manner.A desired effect on the reader/listener, arising as a result of the relation.

Some further RST ExamplesCAUSE: the nucleus is the result; the satellite is the cause...any liquid water would evaporate because of the low atmospheric pressure

ELABORATION: the satellite gives more information about the nucleusWith its distant orbit [...] and slim atmospheric blanket, Mars experiences frigid weather conditions.

CONCESSION: satellite expresses possible exceptions or apparent counter-examples to the rule expressed by the nucleusAlthough the atmosphere holds a small amount of water [...] most Martian weather involves blowing dust...

With its distant orbit 50 percent farther from the sun than Earth and slim atmospheric blanket, Mars experiences frigid weather conditions. Surface temperatures typically average about -70 degrees Fahrenheit at the equator, and can dip to -123 degrees C near the poles.

Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion, but any liquid water formed in this way would evaporate almost instantly because of the low atmospheric pressure. Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop, most Martian weather involves blowing dust or carbon dioxide.Some more on RSTRST relations are neutral with respect to their realisation.E.g. You can express ELABORATION in lots of different ways.

With its distant orbit 50 percent further from the sun than the Earth and slim atmospheric blanket,(SATELLITE)ELABORATIONWith its distant orbit [...] and slim atmospheric blanket, Mars experiences frigid weather conditions.

Mars experiences frigid weather conditions given its its distant orbit [...] and slim atmospheric blanket.

Given that it has a distant orbit [...] and slim atmorpheric blanket, Mars experiences frigid weather conditions.Mars experiences frigid weather conditions

(NUCLEUS)RST for unsupervised content selection (in single document summarisation)Basic intuition:The nucleus in an RST relation is the more contentful part.Its what the text chunk in question is about.

Assume we have a discourse parser, i.e. one that can:Identify the discourse units within a text andanalyse a text into an RST graph.Example discourse parser outputWith its distant orbit 50 percent farther from the sun than Earth and slim atmospheric blanket, Mars experiences frigid weather conditions. Surface temperatures typically average about -70 degrees Fahrenheit at the equator, and can dip to -123 degrees C near the poles.

Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion, but any liquid water formed in this way would evaporate almost instantly because of the low atmospheric pressure. Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop, most Martian weather involves blowing dust or carbon dioxide.

RST for unsupervised content selectionTraverse graph nodes. For each node n, we identify the set of salient units related to that node, Sal(n)

Base case: If n is a leaf node, then Sal(n) = {n}

Recursive case: if n is non-leaf, then look at all the immediate nuclear children c of n

Rank nodes in Sal(n): the higher the node of which n is a nucleus, the more salient it is

Rhetorical structure: exampleUnit 2 is a leaf node Promote unit 2Unit 2 is the nucleus of background/justification relation. promote unit 2Sequence 1-2 is nucleus of Elaboration spanning 1-6promote nucleus children i.e. promote unit 2 againSequence 1-6 is nucleus of Elaboration at root, spanning 1-8promote nucleus children i.e. promote unit 2 again

Ranking of a nodes: 2 > 8 > 3 ...Rhetorical structure: exampleResulting summary:Mars experiences frigid weather conditions.Most Martian weather involves blowing dust or carbon dioxide.Surface temperatures typically average about -60 degrees ...Yet even on the summer pole...

Ranking of a nodes: 2 > 8 > 3 ...Supervised content selectionBasic ideaInput: a training set consisting of:Document + human-produced (extractive) summariesSo sentences in each doc can be marked with a binary feature (1 = included in summary; 0 = not included)

Train a machine learner to classify sentences as 1 (extract-worthy) or 0, based on features.Edmundson (1969)Highly influential early approach that set the trend for much contemporary work.

Edmundsons corpus:200 scientific (chemistry) articles (100-3900 words long)

Features considered:Cue words (derived from training data same for all docs)Title words KeywordsSentence location

Derived from the input documentCue words (from training data)Frequency list (desc order)

Word1f1Word2f2Word3f3...

WordnfnWordn+1 fn+1

Bonus wordsPresence of these in a sentence makes it more summary-worthyStigma wordsPresence of these in a sentence makes it less summary-worthyKeywordsKeywords come from the document that needs to be summarised.Frequency list (desc order)

Word1f1Word2f2Word3f3...

WordnfnWordn+1 fn+1KeywordsAny word above freq threshold which is not also a cue wordNon-keywords

Sentence LocationBasically, a heuristic that determines sentence importance based on where it is.

First component:Does the sentence occur under one of a small list of predefined headings?(e.g. Intro, conclusions...)

Second component:Ordinal position in the document Sentences towards the beginning or end are assigned a positive weight.Combining the featuresEdmundson needed a way to quantify each feature for each sentence in a document and combine them into one single score.

Cue word scoreKeyword scoreLocation scoreTitle word scoreThings to noteFeature weights reflect the relative importance of the feature.

Function is linear (later work has moved beyond such simplistic functions)

Tweaking parameters by hand to obtain the best feature combination, Edmundson observed that:Keywords were the least useful feature. (Could this have to do with his corpus? Recall that all docs were of the same kind.)Of the other three, location feature was the most useful.Would you expect this to change with documents of a different type (say short stories?)

Contemporary machine-learning approachesGiven a feature set F, we want to compute:

Many methods weve discussed will do!Naive Bayes Maximum Entropy...

Features considered in contemporary approachesPosition: important sentences tend to occur early in a document (but this is genre dependent). E.g. news articles: most important sentence is the title.

Cue phrases: sentences with phrases like to summarise give important summary info. (Again, genre dependent: different genres have different cue phrases).

Word informativeness: words in the sentence which belong to the docs topic signature

Sentence length: we usually want to avoid very short sentences

Cohesion: we can use lexical chains to compute how many words are in a sentence which are also in the document lexical chainLexical chain: a series of words that are indicative of the documents topicWhich corpus?There are some corpora with extractive summaries. DUC Evaluation campaigns have produced a lot of data since 2001. Many types of text in themselves contain summaries, e.g. scientific articles have abstractsBut these are not purely extractive!(though people tend to include sentences in abstracts that are very similar to the sentences in their text).

DIY method: align sentences in an abstract with sentences in the document, by computing their overlap (e.g. using n-grams)

Issues with multi-document summarisationIssue 1: RedundancyThe methods weve looked at can be applied to single and multi-document extraction.

One important issues arises with the multi-doc case:Redundancy: repeated info in several documents; overlapping words, sentences, phrases...

Usual solution:Modify sentence scoring methods to penalise redundancy, by comparing a candidate sentence to sentences already selected.

Issue 1: RedundancyMaximum marginal relevance (MMR)Compare candidate sentence to sentences already selected.The more similar it is, the higher the penalty

Where is a weight to be tuned; Sim is some similarity function (Dice, Jaccard, Cosine, Edit distance...)

Issue 1: RedundancyClustering:Like MMR, based on similarity. Apply a clustering algorithm to group similar sentences from multiple documents.Select a single, centroid sentence from each cluster to include in the summary.Issue 2: Information orderingIf sentences are selected from multiple documents, we risk creating an incoherent document.

Rhetorical structure:*Therefore, I slept. I was tired.I was tired. Therefore, I slept.

Lexical cohesion:*We had chicken for dinner. Paul was late. It was roasted.We had chicken for dinner. I was roasted. Paul was late.

Referring expressions:*He said that ... . George W. Bush was speaking at a meeting.George W. Bush said that ... . He was speaking at a meeting.

These heuristics can be combined.We can also do information ordering during the content selection process itself.Information ordering based on referenceReferring expressions (NPs that identify objects) include pronouns, names, definite NPs...Centering Theory (Grosz et al 1995): every discourse segment has a focus (what the segment is about).

Entities are salient in discourse depending on their position in the sentence: SUBJECT >> OBJECT >> OTHER

A coherent discourse is one which, as far as possible, maintains smooth transitions between sentences.Information ordering based on lexical cohesionSentences which are about the same things tend to occur together in a document.

Possible method:use tf-idf cosine to compute pairwise similarity between selected sentencesattempt to order sentences to maximise the similarity between adjacent pairs.Realisation, simplication and revisionRealisationWith single-doc summarisation, realisation isnt a big problem (were reproducing sentences from summaries).But we may want to simplify (or compress) the sentences.

Simplest method is to use heuristics, e.g.:Appositives: Rajam, 28, an artist who lives in Philadelphia, found inspiration in the back of city magazines.Sentential adverbs: As a matter of fact, this policy will be ruinous.

A lot of current research on simplification/compression, often using parsers to identify dependencies that can be omitted with little loss of information.

Realisation is much more of an issue in multi-document summarisation.Multi-doc realisation and revisionCompare:

Source: Jurafsky & Martin (2009), p. 835Uses of realisationSince sentences come from different documents, we may end up with infelicitous NP orderings (e.g. pronoun before definite). One possible solution: run a coreference resolver on the extracted summaryIdentify reference chains (NPs referring to the same entity)Replace or reorder NPs if they violate coherence. E.g. use full name before pronoun

Another interesting problem is sentence aggregation or fusion, where different phrases (from different sources) are combined into a single phrase.Evaluating summarisationEvaluation baselinesRandom sentences:If were producing summaries of length N, we use as baseline a random extractor that pulls out N sentences.Not too difficult to beat.

Leading sentences:Choose the first N sentences.Much more difficult to beat!A lot of informative sentences are at the beginning of documents.

Evaluation against a commercial summariser. (Many approaches were compared to Microsofts AutoSummarize)

Evaluation against human extractive summaries:Many systems are now evaluated using ROUGECan also use precision/recall PYRAMID methodBLEU vs ROUGEBLEUROUGEPrecision-orientedLooks at n-gram overlap for different values of n up to some maximum.Measures the average n-gram overlap between an output text and a set of reference texts.Recall-orientedN-gram is fixed:ROUGE-1, ROUGE-2 etc (for different n-gram lengths)Measures how many n-grams an output summary contains from the source summary.ROUGEGeneralises easily to any n-gram length.Other versions:ROUGE-L: measures longest common subsequence between reference summary and outputROUGE-SU: uses skip bigrams

The Pyramid method (Nenkova et al)Also intrinsic, but relies on semantic content units instead of n-grams.

Human annotators label SCUs in sentences from human summaries.Based on identifying the content of different sentences, and grouping together sentences in different summaries that talk about the same thing.Goes beyond surface wording!

Find SCUs in the automatic summaries.

Weight SCUs

Compute the ratio of the sum of weights of SCUs in the automatic summary to the weight of an optimal summary of roughly the same length.Intrinsic vs. Extrinsic againProblem: ROUGE assumes that reference summaries are gold standards, but people often disagree about summaries, including wording.

Same questions arise as for NLG (and MT):To what extent does this metric actually tell us about the effectiveness of a summary?Some recent work has shown that the correlation between ROUGE and a measure of relevance given by humans is quite low.

See: Dorr et al. (2005). A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate? Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 18, Ann Arbor, June 2005

Documents

Corpora and Statistical Methods