43
Measuring Text Re-use June 2000 Page 1 of 43 Identifying re-use between the Press Association and newspapers of the British Press Paul Clough Department of Computer Science, University of Sheffield [email protected] Abstract This report identifies some common transformations applied to Press Association (PA) texts when re-used in British newspapers. This analysis will provide a better understanding of re- use in newspapers and allow classification for the most common transformations found. The Journalism Department at Sheffield University has aided the work by providing newspaper and PA text to study. Discussed is a framework for analysing re-use, some example techniques being used to study re-use and which indicators within the newspaper text are used to determine whether the newspaper text uses PA as a source, as the only source, or whether the shared text is simply due to chance. Keywords Text re-use, plagiarism, paraphrasing. 1 Introduction This report comes from work undertaken by the Departments of Computer Science and Journalism at Sheffield University on a 2-year project, called METER (MEasuring TExt Re- use). This project is also in co-operation with the Press Association (PA) who is supplying the source material for the analysis. The PA is the national news agency for the UK and Ireland. It provides news copy every hour of every day throughout the year to paying media customers. The aim of METER is to find an approach to measure and evaluate how much of the text published in British newspapers is derived from the PA copy. The derived text can be used verbatim (the same words in the same order) or re-written (different words or order, but same meaning) from the PA material. METER looks at how the PA and British newspapers report the same news events and whether newspapers have draw on the PA copy. There are variations in the way in which a newspaper reports a news story. Even if the newspaper uses the PA as source, detecting this is not always easy. Newspapers may re-write PA copy to comply with ‘house style’, fit available space in a particular edition, provide additional detail or suit a particular ‘line’ or ‘angle’ for the story. Therefore, text from the PA may get deleted, transformed or added to. Text re-use can be thought of as a form of recycling. The PA source text is melted down into a pot and then recycled to form a new text. When recycled, it is very unlikely that the new text will appear the same as the original. Text can be copied and used in a verbatim form, or can be changed in some manner to suit a particular situation (e.g. newspaper, type of story, space available etc.). This report describes and illustrates (using examples) transformations that can occur when text is not used directly verbatim (a re-write) and also “issues” involved in deciding whether

Identifying re-use between the Press Association and newspapers of …ir.shef.ac.uk/cloughie/papers/newspaper_reuse.pdf · 2003-03-06 · Identifying re-use between the Press Association

Embed Size (px)

Citation preview

Measuring Text Re-use June 2000

Page 1 of 43

Identifying re-use between the Press Association andnewspapers of the British Press

Paul CloughDepartment of Computer Science, University of Sheffield

[email protected]

AbstractThis report identifies some common transformations applied to Press Association (PA) textswhen re-used in British newspapers. This analysis will provide a better understanding of re-use in newspapers and allow classification for the most common transformations found. TheJournalism Department at Sheffield University has aided the work by providing newspaperand PA text to study. Discussed is a framework for analysing re-use, some exampletechniques being used to study re-use and which indicators within the newspaper text are usedto determine whether the newspaper text uses PA as a source, as the only source, or whetherthe shared text is simply due to chance.

KeywordsText re-use, plagiarism, paraphrasing.

1 IntroductionThis report comes from work undertaken by the Departments of Computer Science andJournalism at Sheffield University on a 2-year project, called METER (MEasuring TExt Re-use). This project is also in co-operation with the Press Association (PA) who is supplying thesource material for the analysis.

The PA is the national news agency for the UK and Ireland. It provides news copy every hourof every day throughout the year to paying media customers. The aim of METER is to find anapproach to measure and evaluate how much of the text published in British newspapers isderived from the PA copy. The derived text can be used verbatim (the same words in the sameorder) or re-written (different words or order, but same meaning) from the PA material.METER looks at how the PA and British newspapers report the same news events andwhether newspapers have draw on the PA copy.

There are variations in the way in which a newspaper reports a news story. Even if thenewspaper uses the PA as source, detecting this is not always easy. Newspapers may re-writePA copy to comply with ‘house style’, fit available space in a particular edition, provideadditional detail or suit a particular ‘line’ or ‘angle’ for the story. Therefore, text from the PAmay get deleted, transformed or added to.

Text re-use can be thought of as a form of recycling. The PA source text is melted down intoa pot and then recycled to form a new text. When recycled, it is very unlikely that the newtext will appear the same as the original. Text can be copied and used in a verbatim form, orcan be changed in some manner to suit a particular situation (e.g. newspaper, type of story,space available etc.).

This report describes and illustrates (using examples) transformations that can occur whentext is not used directly verbatim (a re-write) and also “issues” involved in deciding whether

Measuring Text Re-use June 2000

Page 2 of 43

PA is the source (or one of a number of sources) of the newspaper even with shared textbetween the two articles.

2 The PA and newspaper textsExamples for this analysis are from the METER corpus. This corpus of newspaper and PAtexts covers a wide range of stories in court reporting (classified as COURTS in the PAArchive) compiled by the Department of Journalism. Newspaper stories were chosen fromthose appearing the next day after being released by the PA. Newspapers used for the corpusare The Sun, Daily Mirror, Daily Star, Daily Mail, Daily Express, The Times, The DailyTelegraph, The Guardian and The Independent (all southern editions). Currently there are 219PA stories consisting of 99,890 words and a corresponding 389 newspaper stories consistingof 128,179 words. All newspaper texts are hand annotated to indicate which parts of the storyare copied verbatim or re-written from the PA text.

3 A framework of text re-useText re-use in this project is defined between two sources: 1) the newspaper and 2) the PA.The PA is considered the source and the newspaper either derived or not derived from thatsource. Re-use of text in the newspaper is considered at word-level and can be one of threecategories:

1. Verbatim: text shared between the PA and the newspaper with no changes(word-for-word).

2. Re-write: text in the newspaper that can be associated with some part of the PAsource but is not copied word-for-word.

3. New: text that appears in the newspaper but cannot be associated with anypart of the PA source.

Currently, text re-use is classified at word level, but some words may form part of a higher-level re-write. It is unlikely that a journalist will simply re-use words, but more likely usephrases (e.g. noun phrases such as name of a person or company name).

Text classified as new is not technically re-used. It does not appear in the PA, but may be dueto newspaper style or other factors. Therefore any text that is not verbatim can be consideredas re-written. If this text cannot be matched to the PA source using a number oftransformation tests then it can be considered as new. A situation that can occur with verbatimmaterial is although appearing in both PA and the newspaper, it may be re-used in a differentcontext. Therefore verbatim should consider the context in which the newspaper versionappears and use this to help decide whether it is re-used.

4 Factors influencing re-useNewspapers do not use the PA source in the same way. Identifying why not can help toidentify what factors influence how the text re-appears in a newspaper version using the PAas source. This is useful when trying to determine whether text that appears in the newspaperbut not on PA is simply due to style or language of the newspaper.

Common factors that influence the re-use include:

1. Journalistic style – all journalists are taught to report in a particular manner. Thisincludes the 5 W’s (Who, What, When, Where, Why), layout of the story, writing quotedmaterial etc.

Measuring Text Re-use June 2000

Page 3 of 43

2. In-house style – each newspaper has a particular style and language (5). The PA sourcemay not be compatible with that style and therefore must be changed. Changes caninclude exaggeration, additional detail, dramatics etc.

3. Type of article – a short News in Brief article may use PA as a source for the basic newsstory, but it is unlikely that the PA will be the only source (if a source at all) in anexclusive.

4. Space available – the number of words that can fit into the physical space available to thejournalist or copy editor may less than that of the PA source. Therefore the newspaperwill have to summarise the PA source which will involve deletion of PA text.

5. Production constraints – may include the number of journalists on the story, the turn-around time of the story (may have hours rather than days) etc.

6. Bias of the newspaper – most newspapers are biased in some way and this may affecthow the basic facts given in the PA source are re-used and the language and style used tore-write the story. An example bias is a political bias.

7. The audience – each newspaper will write in a manner that suits their audience and thiswill affect the language and vocabulary used in the stories. For example, there areassumptions that Sun readers are supposed to be Tory supporters and working class;Guardian readers are left-wing, middle-class, Labour supporters; Sun readers aresupposed to be younger; Telegraph readers older (12).

5 Examples of verbatim text re-useThe easiest place to start is with examples of verbatim re-use. In this case, the newspaper usesthe PA word-for-word. Moreover, the text is re-used in the same order. This is an importantconcept as all the text of a small newspaper story could be found from a large PA source, butmay be taken from such varying places that although the text is the same, the context iscompletely different. In this framework, verbatim is classed as the longest group of matchingwords in both the newspaper and PA, regardless of context or order.

Each match will also be given a weighting value that provides an indication as whether averbatim match is “significant” or not. The “significance” value will be based upon:

1. The length of the match (against frequency of longest matches in the whole corpus todetermine what length of shared text is likely in matches that occur simply by chance).

2. The type of match (e.g. whether name of person, location etc. – text entities that arelikely to be used by independent sources reporting the same events).

3. The words within the match – domain-specific words that would be commonly used byeither PA or newspaper (e.g. ‘ a court heard’), or function words that simply form part ofEnglish language structure (such as ‘of’, ‘and’, ‘the’ etc.).

The picture in Figure 1 illustrates the output of finding the longest common strings betweenthe PA source on the left of the screen, and the newspaper (The Times) on the right. Thegreen (light grey) highlights represent the verbatim text and the red (dark grey) in thenewspaper story new words (these are simply words that are in the newspaper story but not inthe PA story). The blue highlights (black) indicate words or phrases that although appear inboth texts, appear more times in the newspaper than the PA, i.e. they indicate duplicated text.The newspaper text is composed of verbatim strings from 1 word to 11 words that come froma number of different locations in the PA text.

Measuring Text Re-use June 2000

Page 4 of 43

Figure 1 – verbatim text

In this example, there are phrases such as “Eamon Reidy”, “The Queen Mother”, “bannedfrom driving”, “for two years”, “July 4”, “Surrey” that would probably appear in any storiesreporting this same event even if not using PA source. This highlights the need for some formof weighting for each match. Phrases such as names of people, names of places, dates, verbsdescribing an event such as bombing could be found using Information Extraction techniques(13). Domain specific phrases such as “the court was told”, or “was fined” could be extractedusing statistics based upon previous newspaper and PA articles.

More information about the detection and classification of verbatim text can be found in (1).

6 Transformations of re-written textIdentifying verbatim text is just one part of re-use. Text can also be re-used and changed. Thisis classed as re-written text in this framework. Re-written text is the same as that which hasbeen paraphrased. The re-written text means the same as the original but is expressed usingdifferent words. The hard part is identifying the re-written part of the newspaper text andwhere this has come from in the PA source.

Some of the most common types of re-write that appear within the newspaper are:

(1) SubstitutionGenerally a single word but can be extended to a phrase. Single word substitutionuses synonyms that could be found in a dictionary or thesaurus. Words and phrasesused in a newspaper that have the same meaning as the corresponding PA source arecommonly used to alter the original text (also a very popular form of plagiarism).Not only are words used, but also numbers as words and vice-versa.

• Scalding water (Boiling water)

• Passed classified information (Passed sensitive information) (transferred top-secret data).

Measuring Text Re-use June 2000

Page 5 of 43

• A mother-of-four (a mum-of-four)

• Appeared in court (was in court)

• The Queen Mother (the Queen Mum)

• A third accused (a third defendant)

• In debt (had financial difficulties) (was broke) (owed money) (was on thebreadline) (poverty stricken) (cash-strapped)

• was in court’ (made a brief appearance in court) (appeared before X magistrates)

• accused of four charges of indecent assault’ (charged with four sex attacks)

• A top cop (a Chief Constable)

• from Malvern (of Malvern), from Hove (of Hove)

• United States citizens (citizens of the USA) (people from America)

Approaches that could be used to deal with this form of re-write include using athesaurus or dictionary (for example Wordnet) or using synonymous expressionsfound manually over a body of similar text.

(2) AbbreviationsThe PA generally gives the full name of an organisation or company, the newspaperwill more likely abbreviate the PA source and assume the audience knows what theabbreviation stands for.

• British Broadcasting Corporation (BBC)

• British Airways (BA)

• United Kingdom (UK)

An important point to mention is that the abbreviation does not always go from thelarger to smaller form. Sometimes an abbreviation given in the PA is expanded intoit’s full form.

• DJ (Disc Jockey)

• USA (America)

• 29 (29-year-old)

Abbreviations can also take the form of a shortened version of the original word.

• Photographs (photos)

• Hants (Hampshire)

Measuring Text Re-use June 2000

Page 6 of 43

Similar to substitutions in many ways as a word or words can be used to express thesame meaning. This substitution provides either a smaller or larger string in the re-written text from the original and re-sizing a newspaper article to fit an availablespace is a very common process. Abbreviations could be identified using some formof previously derived list or using some form of dynamic programming to identify“similar” strings.

(3) Re-ordering and movement of PA sourceVery often, the order of the PA source is changed. Sometimes to fit the flow ofthe newspaper story, other times to fit a change in tense, other times to fit achange from active to passive voice etc. This is central in the treatment of PAmaterial and may involve not just re-ordering at a local level, but also re-orderingwords/phrases of different sentences of the PA report to create a coherentnewspaper article.

• In a nightclub punch-up (In a punch-up at a nightclub)

• Son if a policeman (a policeman’s son)

• The Chief Constable of Sussex (the Sussex Chief Constable)

• BBC1 Controller (Controller of BBC1)

• Sentence 1 of Mirror Report On Ostrich Case:A routine vehicle check (Sentence 4 - PA) uncovered a cattle lorry that wastransporting a cargo of (S1 - PA) 26 (S2 - PA) ostriches (S1 - PA), two ofthem injured(S1 - PA) and covered in blood (S2 - PA), an animal welfareexpert told (S1 - PA) Birmingham magistrates (S2 - PA).

If verbatim text is re-ordered, this can be identified using an approach such as theDotplot (6), but generally this will be combined with many other transformationsthat make re-writes much harder to identify.

(4) Assumed/World knowledgeIn some cases, the newspaper may make reference to an assumed audience or worldknowledge. The newspaper re-write may not even use the same words as the PAsource, but the overall result will be the same. This is often found in newspapers thataddress their audiences directly, for example in The Sun where terms andexpressions are used that the reader will understand but would be hard to identifyautomatically. For example “the British” may be re-written as “us Brits” or “ourfamily”. This form of re-write makes analysing the text at surface-level very hard.

• For subjecting … a regime of torture (for torturing)

• Hobbs who had been taking drugs … was high on amphetamines (Hobbs washigh on drugs)

• The Argentines (the Argies)

• Sheffield Wednesday Football Club (The Owls)

Measuring Text Re-use June 2000

Page 7 of 43

• Pushed and swore at one officer and kicked another after they intervened during arow with her boyfriend (swore at one and kicked another during a row with herboyfriend)

• An official at the crown prosecution service (an official at the heart of Britain’sjustice system)

• London Court (Horseferry Road Magistrates)

• Damien and Leanne gassed themselves (They were found dead in his Ford Escortcar. …. an hosepipe led from the exhaust)

• Police Sgt Ian Froggett told the inquest (Sgt Ian Froggett told the inquest)

• Roger Pond, who worked for Radio Kent (Roger Pond, the former radio DJ)

• A new inquest was ordered (the judge rejected a bid by South Wales Police toblock a new inquest)

• Had drunk X pints of cider (was drunk)

• The injured PC (the injured officer)

• Jailed for life for stabbing (murdering)

• Assaulting two officers in her father’s force (the chief constable ... assaulting twoof his officers)

• ‘A gamekeeper ... responsible for game management’ (‘The head gamekeeper’)

• ‘Hobbs calmly left the flat with a Nintendo game, a watch and Peter Smith’sAbbey National Cashcard. He twice withdrew sums of £250 from a cashpoint’(‘He stole a video game, a watch and a cashcard which he used to take £500 fromMr Smith’s account’)

• ‘two witnesses pointed the finger at her as the person responsible’ (‘men at thesquat accused her’)

• ‘she mixed mainly with fellow drinkers, living rough or in squats’ (‘at a squat sheshared with other alcoholics’)

• Melanie C (Sporty Spice)

(5) Domain-specific knowledgeThere are particular phrases and words that are used many more times than otherssimply because they are specific to a domain. In the examples from the METERcorpus, many references to courts and law are made with the most common phrasesbeing the result of verdicts or location of court for a trial etc. It is useful to identifythese common phrases from the PA and the newspapers as in many cases thecommon phrases are synonymous. It is worth noting that a newspaper can be helplibel for what it writes. Therefore it cannot make assumptions about the outcome of acase or use the entire PA source until the verdict has been announced publicly. Anewspaper is more likely to use phrases such as ‘allegedly’ or ‘Mr X prosecuting’ tomake a statement about what has happened.

Measuring Text Re-use June 2000

Page 8 of 43

• the court was told (the court heard)

• appeared at Liverpool magistrates (was brought before the magistrates atLiverpool)

• ‘the jury was told’

• ‘the case was adjourned’

• ‘the trial continues’

• ‘allegedly’

(6) Temporal issuesDue to the difference between the times at which the newspaper reports the story andthe PA releases the source, there are changes that can be categorised as temporal.The most common is the newspaper use of “yesterday” as oppose to the PA using“today”. Other temporal changes are due to different expressions of dates, referenceto times relative to the published newspaper date etc.

• today (yesterday)

• in 1998 (last year)

• earlier this year (in March)

• offences spanning 12 years (committed robberies from 1984 to 1996)

• During an incident in July 1998 (incident on 4 July, 1998)

• In 1996 (on Christmas Day, 1996)

• On 14th February (On Valentine’s Day)

(7) Exaggeration in the newspaperSometimes, the newspaper may exaggerate what the PA has supplied to make thestory more catching or “exciting”. This can help sensationalise an event and therelatively “neutral” terms supplied by PA become more dramatic. The style of thenewspaper may also have a big influence over the expression of events.

• believed they would part (feared they would part)

• leaving a note saying (leaving a note begging)

• after the inquest, the couple’s family (after the inquest, the couple’s grievingfamily)

• attacked (butchered, hacked, slaughtered)

• the defendant said (the defendant insisted)

• the man who killed (the killer) (evil killer) (cruel psychopath)

Measuring Text Re-use June 2000

Page 9 of 43

• killed (ruthlessly destroyed) (viciously attacked)

• the car (the £30,000 sports car) (the dream machine)

• ran into (rammed into) (smashed into)

• BA immediately announced that it would appeal against the ruling which itbranded ‘wrong in fact and in law’ (BA said it would appeal and blasted theruling as ‘wrong in fact and in law’)

• executed his lover’s husband (‘blasted his lover’s husband to death’)

• Oasis guitarist Paul Arthurs has quit the hugely successful rock band, it wasannounced today (Oasis guitarist Paul Arthurs shocked fans last night byannouncing he is quitting the top rock band.)

Exaggeration is generally in the form of additional adjectives or adverbs to givegreater effect.

(8) Use of pronounsAnother common form of re-write is the use of pronouns to shorten the PA source orto remove repetition.

• Johnnie Walker (He, Mr Walker)

• Leanne and Damien (they, the pair/ the couple/ the lovers etc)

• Smith, Jones and Brown (the trio, the defendants, the gang, the accused)

Knowledge of these synonymous phrases would help to resolve more cases wherethe texts do not match verbatim.

(9) Use of prepositions, conjunctions, adjectives and adverbsConjunctions such as ‘and’ and ‘because’, adverbs such as ‘when’, adjectives such as‘after’ and prepositions such as ‘the’ and ‘of’ are used most often to link and bringtogether separate phrases or words that are distant in the PA. They help to facilitatein the movement of material.

• A woman jailed for life for a double murder was freed today by the Court ofAppeal. Three judges ruled that the convictions for Mary Druhan were ‘unsafe’(A woman jailed for life for a double murder was freed yesterday by The Court ofAppeal after judges ruled that her conviction was unsafe)

In this example, two sentences have been reduced to one but still kept coherent byusing the adjective ‘after’.

(10) Use of the definite and indefinite articleUse of ‘a’ (indefinite article) and ‘the’ (definite article) can be used to makeinformation in a newspaper report more or less specific.

Measuring Text Re-use June 2000

Page 10 of 43

• Deputy Chief Constable of Surrey Police, Ian Beckett (Ian Beckett, the DeputyChief Constable of Surrey).

Using ‘the’ in the above example helps to bring ‘Ian Beckett’ to the reader’s attentionmuch quicker than the PA version.

• Boxing champion Garry Delany was cleared (Garry Delany, the boxing championwas cleared)

• Hendricks said outside Horseferry Road Magistrates Court (Hendricks saidoutside the court)

(11) Naming formsDifferent contexts will bring about different forms of names (12) and no more isthis so between different newspapers. There are a large number of different namingconventions that could be used to address an individual or group of individuals andall can be applied to a name or names given in the PA source.

• First name only: Elizabeth, Robert, Michael.

• Short form: Liz, Bob, Mike.

• First name + last name: Elizabeth Clough

• Title + last name: Mr Smith, Dr Roberts, Miss Partridge.

• Title only: Sir, Madam, Miss

• Last name only: Darby, Smith.

• Profession or trade: Doctor, Constable, Engineer.

• Formal name + title: Lord Archer, The Right Honourable Tony Blair.

• Formal title: Her Royal Highness, Madam Speaker.

• Anonymous address: boy, girl, you.

• Assumed name: Maggie, Sporty Spice, Tarzan, Squiggle.

• Assumed name taken by the individual: Snoop Doggie Dogg, Terrence TrentD’arby

• Other group: dearly beloved, comrades, ladies and gentlemen.

(12) Expression of numbersNumbers and amounts given in the PA source are often quoted in words and vice-versa.

• Was two-and-a-half times over the drink-drive limit (was 21/2 times over thelimit/was 2.5 times over the limit)

Measuring Text Re-use June 2000

Page 11 of 43

• 20% (one-in-five)

• 18.3% (almost one-in-five, one fifth)

• 2000 cases a year (nearly 40 a week)

More often than not, the PA expresses amounts such as 2.5 as words. This may bebecause the PA produce the source in straight ASCII text to be as compatible withas many customers as possible. The use of symbols may cause compatibilityproblems. Generally ages seem to be given in number form, but amounts will beexpressed in words.

(13) Changes of active and passive voiceOften, the newspaper changes the voice of the PA source from active to passive orvice-versa. This generally results in a change of word order as the recipient of someaction is typically expressed as the subject of a passive clause, or the doer of anaction expressed as the subject of an active clause.

• He shot the man (the man was shot by him)

• Mr A shot Mr B, a gardener (An Horticultural Worker, B was shot by Mr A)(The agricultural labourer, Mr B was blasted by Mr A)

• ‘was moonlighting at a night club’ (‘moonlighted as a nightclub bouncer’)

• Mr Justice Carnwarth refused permission to appeal (Permission to appeal wasrefused)

(14) Change of verbs to nounsChanges in vocabulary are very common in the newspaper. The PA source can bechanged to create a shorter or longer story that requires a change in sentencestructure to form a coherent story. Changing a verb to a noun or noun to verb maybe a result of style used in journalistic reporting. Changing a verb or adverb to anoun is also called nominalisation (5) and the derived nouns are called derivednominals. Nominalisation and the use of nouns for actions is very common inofficial, bureaucratic and formal modes.

• Charged with dealing in cocaine (a cocaine dealing charge)

Here, the verb ‘charged’ used by the PA is changed to the noun ‘charge’ in the re-written newspaper version.

• in her battle with anorexia (as she battled with anorexia)

In this example, the noun ‘battle’ has been re-written as a verb ‘battled’. In thiscase, there is no movement of text, just a syntactic change. Further examples ofverb/noun changes include:

• is charged with possession of cocaine and offering to supply cocaine (ischarged with possessing and offering to supply cocaine)

Measuring Text Re-use June 2000

Page 12 of 43

• his defection (defected)

• she was quizzed by Simon Mayo (Simon Mayo, who was quizzing)

• is thought to have had an amicable split from the group (is thought to have splitamicably from the group)

• There are no plans for a replacement, said the spokesman (There are no plans toreplace him, a band spokesman said) (A spokesman said there were no plans toreplace Arthurs) (There are no plans to replace him)

• Victor Temple, prosecuting (Victor Temple, for the prosecution) (Prosecutor,Victor Temple)

• The coroner recorded a verdict of suicide ... (Recording a verdict of suicide, thecoroner .....) (A suicide verdict was recorded by the coroner)

• ‘promises to recreate prehistoric life’ (‘a recreation of prehistoric life’)

(15) Change of tenseTogether with a change in part-of-speech, it is common to change the tense of thePA source. This is because most events in the PA source are reported in the presenttense, whereas the newspaper will report the events in the past tense. Thenewspaper will obtain the PA source and by the time the newspaper reaches thereader it will be the following day. This results in the use of verbs with an –ingform of ending in the newspaper opposed to a different form in the source.

• Charged with dealing in cocaine (a cocaine dealing charge)

• After admitting liability (he admitted liability)

• An inquest jury's finding that a cell death was contributed to by police neglectwas overturned by the High Court today and a fresh inquest ordered. (The HighCourt has overturned an inquest jury’s finding that police neglect contributed tothe death of Jason Tristram)

• One of Britain's most senior policeman made a brief appearance in court todayaccused of indecently assaulting two women (The Deputy Chief Constable ofSurrey was in court yesterday charged with four sex attacks)

• ‘his defection’ (‘defected’)

• she was quizzed by Simon Mayo (Simon Mayo, who was quizzing)

• ‘is thought to have had an amicable split from the group’ (‘is thought tohave split amicably from the group’)

• ‘There are no plans for a replacement, said the spokesman’ (‘There are no plansto replace him, a band spokesman said) (A spokesman said there were no plansto replace Arthurs’) (Report: ‘There are no plans to replace him’)

Measuring Text Re-use June 2000

Page 13 of 43

(16) Change in treatment of direct speechThe newspaper may change the way in which direct speech is presented and forexample re-write a quote given in first person to one given in third person. Forexample, the PA source may have used a quote directly from Mr Smith: ‘I thoughtthat .......’, but the newspaper changes this to the third person: ‘Smith recalled that...........’

• PA: Recorder McLaren said: ‘I accept the evidence that this woman ... wasterrified by this incident and sought to escape’ (Express: Mrs Field said she wasterrified and tried to escape)

• ‘A delighted Miss Hendricks said outside Horseferry Road magistrates court,central London: ‘I am very, very relieved’. (‘Outside the court, PC Hendrickssaid she was relieved)

• A Scotland Yard spokeswoman said: “If PC Hendricks logs a formal complaint,it will be investigated. We have not received any complaints in relation to anyother officers in this incident” (A Scotland Yard spokeswoman said the matterwould only be investigated further if PC Hendricks logged a formal complaint)

• Former police officer, Lisa Conneely, told the court: she’s totally 100%dedicated to the police service’ (‘PC Hendricks - described by a friend as‘100% dedicated tothe police force’)

• ‘I wish everyone in the band every success ... see you at the next show’ (Theguitarist promised to attend future shows and wished the group well)

• A spokesman for the show said: “It is nothing to do with that incident on thetrain. The decision was taken before that”. (Street bosses said the decision wastaken two days before the rail incident.)

An interesting finding is that sometimes quotes from the PA source are altered andby the time they appear in the newspaper are different from the original article.

• Mr Carman asked: ‘You are a goalkeeper who keeps moving the goalposts,don’t you? In the sense that you are playing with the truth’. Grobbelaarresponded: ‘That’s simply not true’. (Bruce Grobbelaar was told in court hewas ‘a goalkeeperwho keeps moving the goalposts’. Carman made the claim, accusingGrobbelaar ‘of playing with the truth’. ‘That’s simply not true’ saidGrobbelaar.)

• Peter Salmon said: ‘what this autumn on BBC1 represents is something thecommercial sector can’t or won’t produce’. (Peter Salmon said that theprogramming represented ‘something the commercial sector can’t or won’tproduce’.)

• ‘We have a deal. We have agreed terms. The BBC are obviously having troublecoming to terms with their loss.’ (‘We have a deal. We’ve agreed terms. TheBBC is obviously having difficulty coming to terms with their loss’.)

Measuring Text Re-use June 2000

Page 14 of 43

7 Example techniques being used to help detect and measure re-useSome possible techniques involved in detecting and measuring re-use are presented here. Thetechniques are used to help identify domain-specific words and phrases, extract names anddates, find “similar” words and phrases and detect differences in vocabulary between the PAand British Press.

7.1 Information Extraction (IE)This fairly new Language Engineering technology uses the structure of texts as well as thecontent to help extract information and fill pre-determined templates (templates: entities andtheir relationships). Applications for this type of technology are vast and IE has already beenused in areas such as finance, business, media etc. However, IE is domain-specific and mustbe customised for specific tasks.

Currently I have only used a small part of a complete IE system called LaSIE (Large ScaleInformation Extraction) developed by the University of Sheffield known as named entityrecognition. This technology is able to extract names, dates, company names etc. and has beenrun on a couple of small example newspaper texts.

A driver was almost three times over the limit when he crashed into <ENAMEXTYPE="PERSON">Queen Elizabeth</ENAMEX> the Queen Mother’s <ENAMEXTYPE="ORGANIZATION">Daimler</ENAMEX> then fled, a court was told yesterday.Eamon Reidy, 32, reversed away but crashed his <ENAMEXTYPE="ORGANIZATION">Citroen</ENAMEX> BX into a wall at Egham, near <ENAMEXTYPE="ORGANIZATION">Windsor Great</ENAMEX> Park, Surrey. He then ran off and was caughtafter a mile-and-a-half chase.Reidy of Langley, Berks, was banned for two years and fined £700 with £50 costs by magistrates atWoking, Surrey, after admitting drink-driving and failing to stop after an accident.

Figure 2 – output of the named entity recogniser

Identifying the names etc. within the text can help to determine a weighting value for thematches found. An example of the named entity recogniser output is shown in Figure 2. Theoutput is not accurate. This is because the system has been tuned for American newswire textsand not British newspaper text. English geographical locations and combinations specific tonewspaper and PA style such as first name + surname followed by age are not part of thecurrent ruleset. Therefore these entities are not recognised. In this story, many correctlyidentified entities were classified as unknown. Therefore given the “unknown” list also, mostnames, locations and company names would have been extracted.

7.2 StatisticsHaving the METER corpus allows application of a number of statistical techniques to analysethe language and vocabulary used in both the newspapers and the PA. This can be used forthree purposes:

(1) To identify differences in the vocabulary between the PA and British newspapers.(Kilgarriff (7) used the chi-square statistic to compare differences in vocabularybetween broadsheet and tabloid newspapers).

(2) To identify words and phrases specific to a domain, such as courts and law.

Measuring Text Re-use June 2000

Page 15 of 43

(3) To identify vocabulary differences between derived and non-derived newspaperstories.

(4) To determine lengths of co-occurring verbatim strings that are “significant” andrepresentative of derived texts.

Overviews of statistics for Natural Language Processing (NLP) can be found in (9) and (10).

One approach for vocabulary comparison is to use frequency. This is not appropriate forcorpora of differing sizes, therefore the frequency counts must be normalised. One commonuse of statistics in NLP is to highlight differences in vocabulary usage between two corpora.In (11), Paul Rayson et al. uses the chi-square (X2) statistic to find differences between thespoken vocabulary of men and women using the British National Corpus (BNC). The chi-square statistic is a significance test that shows whether findings are the result of genuinedifferences or due to chance. The problem with chi-square is that low frequency events giveerroneous results and the therefore the statistic cannot be used on counts less than 5.However, sometimes low frequency word occurrences are very significant and it may benecessary to find these words. Therefore another significance test can be used called log-likelihood (G2) which deals accurately with low counts (8). The two tests mentioned areknown as non-parametric which means they do not assume words follow any underlyingdistribution. Other statistical tests that do assume known underlying distributions are calledparametric . Examples include the t-test, and Wilcoxon’s rank sum test. Consult (9) for moredetails. In my experiments I use the G2 statistic because it is parametric and can deal with lowcounts.

The procedure for measuring vocabulary usage using statistical methods is to first construct acontingency table. The contingency table used in my experiments has 2 rows and 2 columnsallowing comparison of word usage between two corpora. The table is illustrated in Table 1.

Corpus A and Word X Corpus B and Word X

Corpus A and any other word Corpus B and any other word

Table 1 – 2 x 2 contingency for measuring vocabulary differences

The cells of the contingency table contain the observed and expected outcomes of wordfrequency and these are used in calculation of the chi-square and log-likelihood statistics.

Currently only initial tests have been made on the data, but results of the vocabulary resultsare given in Appendices C and D. More information about the framework used for theseexperiments can be found in (3).

Identifying differences in the vocabulary between the PA and BritishnewspapersThe PA and newspapers are different organisations , therefore the first experiment findsdifferences in the vocabulary used. The contingency table was constructed for each bi-gram(pair of words) of the corpora (only for court reporting) and sorted in ascending order on the

Measuring Text Re-use June 2000

Page 16 of 43

G2 value. This puts bi-grams most descriptive of either source nearer the top of the list. Thehigher G2 values result when one of the sources uses words or phrases that are not found at allin the other. The frequency is also given for each bi-gram and results given in Appendix C.

In many cases, the bi-grams are specific to either the PA or the newspapers (in this case, allnewspapers are grouped together to form a domain). The top three PA bi-grams are elementsof PA formatting which occur on every PA article, but words of which are not used much innewspapers. Bi-grams such as these are not useful and all PA headers should be removedfrom the texts. There are five formatting bi-grams in the PA and none in the newspapers. ThePA uses the word ‘today’ in many bi-grams which corresponds to the newspaper’s use of‘yesterday’. This highlights the different times at which the PA and newspapers report thestories (e.g. ‘heard yesterday’ and ‘heard today’). Twelve of the newspaper bi-grams refer totemporal expressions opposed to five in the PA. A further cause of this could be thatnewspapers will generally start the story with “X appeared in court yesterday on charges of….” or something along these lines. With many different newspaper stories corresponding tothe same PA article, this repetition of ‘yesterday’ is not unusual. It may be interesting to lookat where in the newspaper article the use of ‘yesterday’ appears and the context with whichthe term is used.

Nine of the PA bi-grams contain a proper name (e.g. ‘mr boyd’, ‘mrs hillier’) opposed tothree of the newspaper bi-grams. This could indicate the PA uses fewer pronouns and is morespecific in the manner in which it reports people involved in events, but could also be simplydue to repetition in the PA text.

Noticeable is the PA’s use of possessives (e.g. x’s death, x’s murder, robert’s x’s home, x’sfamily). There are only two bi-grams in the newspapers that contain possessives. I think thisis simply due to style of the PA reporting and again repetition in the PA domain.

A final interesting bi-gram is the PA use of ‘adjourned until’. The newspaper has a bi-gram‘trial continues’ and this fits in with how the newspaper and PA report events. The newspapertends to summarise a court report with “the case continues” (particularly the tabloids) whereasthe PA will report something like “the case is adjourned until next Wednesday”.

The two domains share common bi-grams with court-specific terminology appearing in bothsets of results as expected. Personally I would say that 11 of the PA bi-grams arerepresentative bi-grams of the law-reporting domain and 15 of the newspaper bi-grams are(mainly due to the large number of occurrences including ‘yesterday’).

Identifying words and phrases specific to a domainThe same procedure was repeated as before, but this time with just PA data from two differentdomains: 1) law-reporting and 2) showbusiness. The aim of this experiment was to seewhether the G2 measure could pick out specific words and phrases for a given domain. Theresults are given in Appendix D.

As before, there are some bi-grams that have come up that are not so much style of a domain,but simply names used within the domain. The usage of proper names is more interesting anda better indicator of style.

The results are again representative of each domain with the exception of names of people,organisations and function words (such as ‘that the’, ‘he had’). With these bi-grams removed,the remaining bi-grams would be a reasonable summary of each domain. It is also interestingthat the results for the show-businesses domain were representative as the number of PAstories for this domain was approximately 1/5 the size of that for the law-reporting domain.

Measuring Text Re-use June 2000

Page 17 of 43

There may be more words that are representatives of the corpus that have a lower G2 valueand do not appear on the list. The list, however, is not exhaustive and only supposed to be asample of terms. Personally I would say that 23 of the showbiz bi-grams are representative ofthe domain (including names of celebrities such as “sir elton” and names of televisionprogrammes such as “coronation street” and “top gear”) but only 9 terms representative oflaw-reporting. Of the representative law-reporting bi-grams, most are indicative of courts(such as “mr judge”, crown court” and “court heard”). I think there are many more terms usedwithin the showbiz domain that are very specific to that domain and would not commonly beused in law reporting. Also, there may be fewer showbiz terms in the vocabulary of thedomain. Therefore repetition will be greater and the G2 value higher.

Identifying differences between derived and non-derived newspaperstoriesCurrently this experiment has not been performed.

Identifying lengths of co-occurring verbatim strings that are “significantand representative of derived textThe aim of this experiment would be to evaluate the results of finding the longest commonstrings between PA and newspaper texts to gain a probability figure for the numbers andlengths of longest verbatim strings. This would indicate what lengths of strings in commoncould be due to chance and those more likely due to derivation. It may also be necessary toevaluate the ratios of the length of PA and newspaper texts to determine the lengths ofverbatim strings that could be expected between texts of certain sizes (i.e. a small newspapertext from the PA may well be derived but have fewer shorter verbatim matches in common).Currently this experiment has not been performed.

A problem with using these statistical measures is that they do not take into account thedistribution or dispersion of the terms within the text. In other words, all occurrences of aword may appear in just one story and therefore occur not as a domain-specific word but as asingle story-specific term. This is part of the problem with names. They may occur manytimes in one story and be specific to one domain therefore giving a high G2 value. Bycombining a G2 with a dispersion measure, a more accurate list of terms may be extracted.More about dispersion measure can be found in (8)

7.3 DotplotThe Dotplot (6) is a method for visually comparing two texts enabling similarity to be seen inthe form of diagonal lines. Two texts exactly the same will produce a line along the maindiagonal with further diagonal lines indicating duplication within the text.

A sample of four Dotplots for the first 4 texts from Appendix A are shown in Appendix E.The input texts for the Dotplots are both the PA and the newspaper text merged together. Thiscreates the square plot. The Dotplot has been created from tri-graphs (groups of threecharacters) and should be examined in quarters. I have manually divided the Dotplots intoquarters to indicate the locations of each text. The top-left and bottom-right quadrants are thesame text matched against each other. These are not interesting. The quadrants that show thesimilarity between two texts are the top-right and the bottom-left. The diagonal lines showgroups of tri-graphs in common between the two texts. The longer the line, the more likely thegroups of tri-graphs come from the other text. There will be a certain number of repeated tri-grams that will occur because some words in a text are more common than others (e.g. ‘the’,

Measuring Text Re-use June 2000

Page 18 of 43

‘and’). This is called noise and can be removed by filtering out these words or using signalprocessing to remove the noise from the image.It is possible to see traces of diagonal lines in the top-right and bottom-left quadrants of theDotplots in Appendix E. The most easily seen are those in The Sun and The Times. Theseboth use longer verbatim strings from the PA than the other two and it is possible to see thatthe diagonals covering The Times are in the first part of the PA text. There are also traces of adiagonal in the Daily Star, but more noise makes identifying this harder.

The Dotplot is useful as it can provide a visual illustration of the similarity between two textsand highlight positions from within the PA where verbatim newspaper matches are found. Aproblem with this approach, however, is the simplicity of the matching (either matchesexactly or not). This makes it currently unsuitable for detecting re-writes. The Dotplot couldbe improved using simple approaches such as lemmatising and normalising the text (such asconverting dates to a “common” form) and experimenting with the most suitable unit ofcomparison (i.e. different lengths of character or word n-grams). The advantage of the Dotplotis it’s immunity to movement of text.

Appendix F shows Dotplots for three scenarios: 1) related subjects and derived from the PAsource (as classified by the Journalism Department), 2) related subjects and not derived fromthe PA source and 3) unrelated subjects. The Dotplots have been constructed using tri-graphsand the texts used include bombings at a US embassy and a case about Bruce Grobelaar.

The Dotplots quite clearly show distinction between unrelated subjects (no diagonal lines arevisible) and related subjects. Even the derived and non-derived Dotplots show differenceswith the derived story sharing many more consecutive n-grams than the non-derived.

8 An example of how different newspapers report the same eventAppendix A contains the PA version of a news event and five examples of the same eventreported by different newspapers. It is interesting to see the differences between the style ofreporting and the parts of the PA deleted to create a shorter summary.

In these examples, the following observations have been made:

• The Times version – PA is the only source as no new extra material and similaritybetween articles very high.

• The Sun version – PA is again the only source as no extra detail in this article. Thesimilarity between events and the order in which described also gives indicates derivation.

• The Mirror version – the story has been exaggerated to fit in-house style (e.g. “a boozydriver” rather than PA “a drink-driver”). The Mirror also adds detail to the story that hasnot come from the PA such as John Horton being a Grandad and chasing Eamon Reidy inhis slippers. Therefore PA is not the only source, although parts are still derived from thePA.

• The Daily Star version – as with The Mirror, the Daily Star has covered most of the texton PA, but PA is not the only source as the Daily Star has added more detail such asinformation about a blood test and the registration of Reidy’s car. It is much harder to tellwhether this has been derived from the PA. The Department of Journalism assure me thatthis is derived

• Another version – this is derived from the PA source, but again contains additional detailabout Reidy admitting drink-driving.

Measuring Text Re-use June 2000

Page 19 of 43

9 Determining whether PA is the only sourceSo far, it has been assumed that the PA has been the only source used to create the newspaperstory and that any verbatim or re-written text from the newspaper can be associated with somepart of the PA source. Any text that cannot is assumed to be new information. One problemthat has not been addressed is how can the journalist analysing the texts can be sure that thenewspaper story is derived from the PA and that the words and phrases found in thenewspaper are simply not just due to chance.

One of the most important indicators of this is new material that appears in the newspaperarticle but cannot be associated with any part of the PA source. This generally takes the formof more detail or completely new information. For example, the newspaper may have usedanother source to find out the ages of people, the number of people involved in an incident,obtain more information on amounts etc.

An interesting observation is this: some phrases in common between the PA and newspaperhave a low significance weighting (for example names, ages, dates, amounts etc.) becauseindependent stories would also share these terms. However, the same terms have a very highsignificance weighting if they appear in the newspaper but not in the PA. For example,finding a new name or amount in the newspaper article makes the journalist think that the PAis not the only source. Finding function words, words of language structure or words of genrein the newspaper but not in the PA are insignificant as they are expected to appear.

Another observation is that it only takes one minor difference in the newspaper not on PA toshow PA is not the only source.

The whole PA and newspaper source needs to be analysed to determine whether more detailis given, or new material exists in the newspaper against the PA, but some indicators that thePA is not the only source include:

(1) Differences in the newspaper versionDifferences in spelling:

• Mr Abramson ran off into Vicar’s Close, Stratford. (Mr Colligan gave himself up,but Mr Abrahamson ran away)

Differences in names:

• Bramble, Rogers and Rogers also deny perjury. (Bramble, Fenton and Rogers alsodeny perjury).

Differences in ages:

• A 15-year-old-boy, whom the coroner ruled cannot be named … (And the hearingwas told that a 14-year-old boy, who cannot be named …).

(2) Additional information and facts in the newspaperIf a newspaper has used an additional source or a completely different source from the PAis always much greater detail or additional facts that the PA have not supplied.

• A New Age traveller today lost a claim for compensation (A New Age travelleryesterday lost his £500,000 damages claim).

Measuring Text Re-use June 2000

Page 20 of 43

• Frank Walter and Sandra reached an out-of-court settlement. (Frank and SandraWalters reached an out-of-court settlement – believed to be a ‘no fault’ payment of£9,000).

• Mr and Mrs Walters (Mr and Mrs Walters, who have five children and 19grandchildren)

• Their car was surrounded by Mr Gedge and several friends who had been drinkingstrong lager and cider for much of the day. (The court heard how Mrs Fields and herhusband, who have since divorced, were in her Ford Mondeo when it wassurrounded by a gang of drunken travellers with mohican haircuts).

• “I used to be 22 stone when I was 12” (A son who weighed 22 stone at the age of 12and was so large he had to sleep on a double bed).

• Denied three counts of cruelty (denied three charges of cruelty from 1994 to 1997).

• A man was beaten up by Territorial Support Group police officers who kicked himand struck him with a retractable baton. (Abrahams … was said to have beenrepeatedly kicked and beaten with 2ft metal batons by four officers from ScotlandYard’s territorial support group).

(3) Interpretation of factsSometimes, the newspaper will explain what a fact means:

• She heard him say: “There’s a lot of dodgy people around here” (She heard him say“There’s a lot of dodgy people around here” – police slang for corruption).

Deciding whether the newspaper story is derived from the PA is hard and in most cases isvery subjective. Some indicators that can be used to help with this include:

• The same reporter’s name may appear as the author of both PA and the newspaperstory (share the same by-line).

• Although there may be many verbatim sections between the two texts, the number ofverbatim matches, length of the matches, dispersion of related matches within the PAtext might give a good indication of derivation.

• All facts in the newspaper are a subset of the facts in the PA.

• Certain PA-specific phrases (or n-grams) may appear in the newspaper article (byPA-specific I mean phrases that commonly appear many more times in the PA thanany of the newspapers).

• The order of the verbatim and possible re-written phrases corresponds to the order atwhich they appear on the PA (less likely to share the same text in the same order).

• Matching phrases of the PA and newspaper share similar contexts (by similar I meansome overlap of the same or re-written words).

The hardest case is determining when PA is not the source. The newspaper article could sharemany verbatim strings with the PA copy, but still not come from the PA, the verbatim

Measuring Text Re-use June 2000

Page 21 of 43

material could be due to chance. Because all newspaper stories are written in a certain way(e.g. with introduction and summary, then more detail until finally a conclusion), the ordercould be similar for derived and non-derived articles (e.g. the example given in Appendix B).By looking at the whole text of the newspaper and PA articles, a better idea can be formed asto whether the newspaper is derived or not.

10 An example of news reporting on the same event where PA is notthe source

The classification of texts in the METER corpus as derived from PA or not cannot beguaranteed to be correct. The only way to do this would be to ask individual authors of thereports. This is not possible. Appendix B is an illustration of how a newspaper report canappear similar to the PA source even when it can be guaranteed that the journalist has notused the PA. A member of the Journalism Department wrote the article for the Independentbut did not use the PA as a source.

Appendix G shows the output of finding the longest common strings between the texts. Thegreen highlights indicate the verbatim text and the blue verbatim but duplicated text. The redhighlights show new text in the newspaper that does not occur in the PA version.

The output quite clearly shows the large density of longer green highlights nearer thebeginning of the story, but more red nearer the end. The verbatim text at the end of the storyis very much shorter and more distributed. The green highlights in the PA text are spreadrelatively evenly throughout the text. Currently the first occurrence of a word not used in aprevious longer highlight is used for association with the longest string found in thenewspaper. This is not accurate as no consideration to context is given (i.e. in many cases thewrong occurrence of a matching verbatim string from the newspaper text is chosen in the PAversion).

One interesting feature of derivation this example points out is that even texts writtenindependently will share relatively long verbatim strings. In this case the 14-word string“based at Abermarle, near Newcastle upon Tyne, have already pleaded guilty to various drug”is shared between the texts. This may seem unlikely and impossible between twoindependently written texts, but the problem is that both the Independent and the PAjournalist would have been at the same Press Conference and heard the same events.Therefore, it is not so unlikely to have long verbatim strings in common. If the wholenewspaper text were like this, however, then the independence assumption would be lessbelievable. Therefore in this case, although the texts are similar and in some places share longverbatim strings, the texts are still independent but are derived from a third source. Simplyfinding whether the newspaper text was derived from the PA version may in this case lead toa wrong conclusion without considering another source. Statistics based upon the longestcommon verbatim strings would be necessary to determine lengths of verbatim strings thatwould be expected from derived and non-derived texts.

The Dotplot for this example is given in Appendix H. Looking at the bottom-left and top-rightquadrants, there are no obvious traces of lines in the plots indicating that they are not verysimilar (or rather they share many words, but not of long lengths across the whole text).

11 ConclusionsIn this report I have presented a framework for dealing with text re-use between a PA sourceand a newspaper. The framework categorises text as being verbatim, re-written or new. Theseclassifications apply at a word level, although can be part of other classifications at a higherlevel. For example some words may be classified as verbatim at a word level, but at a higher

Measuring Text Re-use June 2000

Page 22 of 43

level form part of a re-write. The newspaper may be derived from the PA or not andexpectation of verbatim string lengths between derived and non-derived texts would helpdetermine this.

The re-written text can be passed through a number of transformations before appearing in thenewspaper. I have given 16, but there may be more. The types of transformation range fromsimple to complex. Some transformations are syntactic, others semantic. Some changes inferthe use of audience or domain knowledge. Extracting this knowledge from a single PA andnewspaper text is not possible. It is necessary to analyse many examples to build up list ofpossible substitutions and synonymous expressions. In most cases the substitutions used in anewspaper would not be found in a “standard” dictionary or thesaurus (by “standard” I meangeneral and not domain–specific). Although I have presented the transformations individually,they will almost always appear together and some examples given to me by the Department ofJournalism are very complex to analyse. This makes detecting and associating re-written textvery hard. In many ways, re-writing newspaper text from a PA source is like plagiarism andparaphrasing and the similarity is possible to see in (2).

I have presented some techniques that can help to detect and measure re-use. I have onlybegun investigation with these tools and all approaches require further investigation. Theidentification of longest common strings and the Dotplot visually help to identify matchesbetween the two texts and patterns of usage. I have included examples of a newspaper storybeing written by different newspapers to show how the same event can be reported in variousways. Also included is an example of a story that I can be sure has been written independentlyof the PA. This shows that long verbatim matches are still possible and a method for ignoringthese in a re-use measurement is essential.

Finally I presented ways in which it is possible to distinguish between a newspaper using thePA as a single source and as only one of a number of sources. This again is very subjective,but the indicators mentioned are valid and can help to decide whether it is likely that the PAhas been used. Deciding derivation is hard. It can depend upon many factors external to thenewspaper text and the experience of the analyser to make this decision.

Another problem that I have not addressed in this report is associating the PA text with the re-written and verbatim newspaper text. Although it may at first appearance seem easy,automating this process involves several complexities and my current research has shown thatin most cases a simple program aligning the texts will get the association wrong. The contextof a verbatim or re-written match is very important and the best indicator to where in the PAthe newspaper text may have been taken from. Problems with aligning the verbatim text canbe found in (1).

Analysing the newspaper and PA texts is very hard and only so much can be done at a surfacelevel. In most cases, it is important to work at a semantic level. However, the syntax, thepatterns of re-use of verbatim material, the statistics based upon a collection of newspaper andPA documents can help in finding re-written text. A rule-based approach using thetransformations listed for the re-written text is possible, but will not cover all possible cases. Iaim to explore a various number of techniques to find the most reliable and robust (but notnecessarily correct) approach.

Weighting the matches found with a “significance” value helps to identify matches that arecommon across the domain and matches that could occur in any newspaper article about thesame topic whether using the PA or not. This means giving low weighing to names of places,people, companies, events etc. that are likely to be shared. What is more unlikely to be sharedare partial phrases or longer phrases. Where a verbatim match starts and ends may besignificant as if the match also contains structure or function words then perhaps it implies thePA was used as a source to lift the verbatim text from.

Measuring Text Re-use June 2000

Page 23 of 43

There are two main problems to solve in identifying re-use between newspapers and the PA.The first is identifying the similarity between the texts including re-written text. The second isdetermining whether PA is the source, part of a source or not a source. This will determinewhether the identified similarities between the texts are due to derivation (i.e. the newspaperre-written from PA) or chance.

12 Future analysisGiven more time, I would like to analyse more newspapers. I would particularly like toanalyse more sets of newspapers and PA articles on the same story to highlight differences inreporting language. I would like to also gather a greater number of stories from a wide rangeof newspapers to find individual styles that could be used to transform a PA story into a storyin the style of e.g. The Daily Express or The Guardian.

The re-use framework could also be used to monitor on-line web sites that use the PA assource and highlight re-used stories. For a large-scale analysis in real-time it would benecessary to use on-line news sources. This could be useful in finding which stories are re-used across most customers and could be taken further to find which parts of the PA are re-used by news providers.

A further study could be to analyse from what page the re-used PA story came from to findwhich stories are considered “important” enough to be on the front page or near to the front(assuming the front page is the most important news).

A final study could be to determine whether the PA covers all news events that are reportedby the British media, and if not then identify which stories are not covered. This would helpthe PA to determine how complete their coverage as a comprehensive news service is.

Further work must be undertaken on the METER corpus. Currently, most examples of re-usein the corpus are newspaper stories derived from the PA. To tackle the issue of the newspapertext being derived from the PA examples of both classifications are necessary. Analysis of thecorpus using n-gram statistics and probability models may well determine lengths of matchesthat are due to chance and those due to derivation.

13 AcknowledgementsI would like to thank John Arundel for help with analysing the newspaper and PA texts, ScottPiao for providing me with a number of statistical tools and Mark Hepple for the use of hisDotplot program.

14 References

(1) P .D Clough, A framework for detecting and classifying verbatim text, Internal Report,Department of Computer Science, University of Sheffield, June 2000.

(2) P .D Clough, Plagiarism in natural and programming languages: an overview ofcurrent tools and technologies, Internal Report, Department of Computer Science,University of Sheffield, January 2000.

Measuring Text Re-use June 2000

Page 24 of 43

(3) P Clough, S Piao, Extracting macro statistical information from PA and newspapercollection for measuring text re-use, Internal Report, Department of Computer Science,University of Sheffield, June 2000.

(4) T Dunning, Accurate methods for the statistics of surprise and coincidence,Computational Linguistics, 19(1), pp(61-67), 1993.

(5) R Fowler, Language in the News – Discourse and Ideology in the Press, Routledge,ISBN 0-415-01419-0, 1999.

(6) J Helfman, Dotplot: a program for exploring self-similarity in millions of lines of textand code, Journal of Computational and Graphical Statistics, 2(2), pp(153-174), June1993.

(7) A Kilgarriff, Comparing corpora, ITRI, University of Brighton, November 9th 1999.

(8) A Lynn, In Praise of Juilland’s ‘D’: a contribution to the empirical evaluation ofvarious measures of dispersion applied to word frequencies, Methodes of Quantitativeset Informatiques dans l’etude des Testes, Geneve-Paris, pp(588-595), 1986.

(9) C Manning and H Schuetze, Foundations of Statistical Natural Language Processing,MIT Press, ISBN: 0262133601, 1998.

(10) M P. Oakes, Statistics for Corpus Linguistics, Edinburgh books in empirical linguistics,Edinburgh University Press, ISBN 0 7486 0817 6, 1998.

(11) P Rayson, G Leech, M Hodges, Social Differentiation in the Use of EnglishVocabulary: Some analyses of the Conversational Component of the British NationalCorpus, International Journal of Corpus Linguistics, Vol. 2(1), pp(133-152), 1997.

(12) D Reah, The Language of Newspapers, Routledge, ISBN 0-415-146000-3, 1998.

(13) Y Wilks and R Gaizauskas, Information Extraction: Beyond Document Retrieval,Journal of Documentation, vol. 54, no. 1, pp(70-105), January 1998.

Measuring Text Re-use June 2000

Page 25 of 43

Appendix

Measuring Text Re-use June 2000

Page 26 of 43

Appendix A – An Example of PA re-useThe following example highlights differences between newspapers and the PA reporting thesame story. It is not known whether PA is the source for all stories, but newspapers that havemore detail that the PA on some parts of the story can be found. These highlight facts andexaggeration that the PA have not reported.

PA version

DRINK-DRIVER FLED AFTER COLLISION WITH ROYAL DAIMLER<�By John Sheehan, PANews<

(1) A drink-driver who ran into the Queen Mother's official Daimler was fined £700 and banned fromdriving for two years today.<

(2) Eamon Reidy, 32, was two-and-a-half times over the drink-drive limit when he rammed the royalcar, magistrates in Woking, Surrey, were told.<

(3) The 99-year-old Queen Mother was not in the vehicle when the accident happened on July 4 inBishopsgate, Egham, Surrey.<

(4) Magistrates were told that Reidy sped off before abandoning his car, running across fields andhiding in undergrowth until he was spotted by the police helicopter.<

(5) Prosecuting Robin Bowen said: ``At 8pm the defendant was driving towards Englefield Green in ablack Citroen BX and collided with a Daimler limousine, a vehicle which was used on a dailybasis by the Queen Mother. She was not in it at the time. It was being driven by a chauffeur.<

(6) ``He then reversed before driving forward at high speed, turned immediately right and collidedwith a wall and some bushes.''<

(7) Mr Bowen said Reidy jumped out of his car and was chased for one-and-a-half miles by 56-year-old John Horton who had witnessed the accident. <

(8) Mr Bowen said: ``Mr Horton pursued him for one-and-a-half miles before catching up with himand taking hold of his arm. Mr Reidy threatened him and told him to leave him alone. He let himgo and the defendant ran off.''<

(9) The court was told that police later discovered Reidy lying in undergrowth.<

(10) Defending Ms Lesley Barry said Reidy had drunk two glasses of champagne at his parentswedding anniversary party at lunchtime before drinking three pints of strong lager in The Sun pubin Egham.<

(11) She said Reidy and his partner were looking to buy a house and that the stress of viewingproperties and trying to arrange a mortgage had got to him.<

(12) She said: ``He is not a man who normally goes to the pub to drink on his own but he felt he wanteda bit of space to himself to mull over the finances of purchasing the house.''<

(13) She said he decided to walk home afterwards but returned a few minutes later to pick up his car. <

(14) She described how the accident happened, saying: ``He was driving along Wick Lane andsuddenly he was aware that there was a big black saloon car virtually stationary in the road. Hedoes not know where it came from. He was too close to be able to stop.''<

Measuring Text Re-use June 2000

Page 27 of 43

(15) She said that Reidy has been ``shaken up enormously'' by the crash and that he was in a state ofshock and panic before fleeing. Ms Barry said that Reidy had become scared by Mr Horton andthat he had run away across the fields.<

(16) Airport worker Reidy, of Langley, near Slough, Berks, was banned from driving for two years,fined #700 and ordered to pay #50 costs .<

The Times version

Eamon Reidy, 32, a drink-driver who rammed into Queen Elizabeth the Queen Mother’s Daimler, wasfined £700 and banned from driving for two years.

The Queen Mother was not in car when the accident happened on July 4 in Surrey.

The Sun version

A DRUNK driver who ploughed into the Queen Mother’s limo was fined £700 and banned for twoyears yesterday.

Eamon Reidy, 32, was 2½ tImes over the legal limit when he rammed the parked Daimler in a countrylane.

The Queen Mum - 99 last week - was not in the car at the time but her chauffeur was.

Airport worker Reid sped off after the smash near Egham, Surrey, on July 4.

He glanced off a wall and flattened some bushes before abandoning bis Citroen.

Chased

Then he ran 1½ miles across fields chased by crash witness John horton Mr Horton finally cornered,him - but Reidy threatened him and fled.

Reidy, of Langley, Berks, tried to hide in some undergrowth.

But he was spotted by a police helicopter and arrested, magistrates In Woking, Surrey, heard.

Defending, Lesley Barry ‘said Reidy was trying to buy a house and had money worries.

He had drunk two glasses of champagne at his parents’ wedding anniversary party before drinkingthree pints of strong lager at a pub.

The Mirror version

A BOOZY driver who smashed into the Queen Mum’s chauffeur-driven Daimler minutes after she hadbeen dropped off was banned for two years and fined £700 yesterday.

Eamon Reidy, 32, fled across fields in Windsor Great Park after the crash, the court heard.

Grandad John Horton, 56, head gardener on the royal estate, chased him in his slippers for one and ahalf miles as armed cops, dogs and helicopter joined in the pursuit.

John caught up with the fugitive and grabbed his arm. But when Reidy threatened him - "he decideddiscretion was the better part of valour and let him go," Woking magistrates were told.

Measuring Text Re-use June 2000

Page 28 of 43

Police discovered airport worker Reidy lying in undergrowth near the Queen Mum’s Royal Lodge onthe Crown estate.

He was found to be twoand-a-half times over the legal limit.

Reidy, of Langley, Berks, admitted drink- driving and failing to stop.

The Daily Star version

A DRUNK driver who crashed into the back of the Queen Mum’s limo was banned for two yearsyesterday.

Airport worker Eamon Reidy, 32, was nearly three times the drink-drive limit when he hit theroyal Daimler after a wo-and-a-half hour session in the pub.

He reversed his black E- regCitroen BX after the crash and hit a wall before fleeing the crash scene.

But he was chased for a mile-and-a-half by a passer-by who gave police a description of theCitroen driver.

A helicopter and armed police were drafted into the search and Reidy was found hiding in bushes.

The Queen Mother who uses the Daimler daily, was not in the car when it was hit.

Reidy refused to comment after the case at Woking magistrates’court.

He hit the chauffeur driven car, registration NLT 2, in Bishopsgate Road, Egham, Surrey, lastmonth.

Head gardener John Horton, 56, chased Reidy, who told his pursuer to leave him alone or hewould "have him".

Reidy was found in bushes by police, but ran off again before he was finally arrested.

A blood test showed a reading of 93 microgrammes of alcohol - the legal limit is 35.

Another version

A driver was almost three times over the limit when he crashed into Queen Elizabeth the QueenMother’s Daimler then fled, a court was told yesterday.

Eamon Reidy, 32, reversed away but crashed his Citroen BX into a wall at Egham, near Windsor GreatPark, Surrey. He then ran off and was caught after a mile-and-a-half chase.

Reidy of Langley, Berks, was banned for two years and fined £700 with £50 costs by magistrates atWoking, Surrey, after admitting drink-driving and failing to stop after an accident.

Measuring Text Re-use June 2000

Page 29 of 43

Appendix B – An Example where PA is not the sourceThe following PA and corresponding newspaper article show the same story reported byindependent sources. I know that the newspaper article did NOT use the PA as a sourcebecause the journalist – Jonathan Foster of the Department of Journalism – has assured methat no use of PA was made.

PA version

15:19 17/07/98: Page 1 (HHH) COURTS Soldiers

SOLDIER CONVICTED OF DRUGS SMUGGLING<�By Melanie Harvey, PA News�A soldier who served in one of Britain's most famous regiments was todaconvicted of taking part in a major drugs smuggling plot.<�Dale Mills, 26, of Southvale, King's Heath, Northampton, was found guilty ofimporting heroin, ecstasy and cocaine during cross-Channel trips.<�He was remanded in custody until sentencing next wee .<�The jury of nine women and three men took 12 hours and 36 minutes to returtheir verdict on Mills at Liverpool Crown Court in a majority verdict.<�Mills, wearing a grey suit and green shirt, hung his head as the verdict waannounced.<�Members of his family in the public gallery and two women jurors began tweep.<

mf

15:19 17/07/98: Page 2 (HHH) COURTS Soldiers

Yesterday, the same jury unanimously convicted Bombadier Kevin Jones andex-Gunner James Bull for being part of the same drugs smuggling plot.<�Six other men, four of them either serving or ex-members of the 39th RegimenRoyal Artillery, along with Mills, Bull and Jones, based at Abermarle, nearNewcastle upon Tyne, have already pleaded guilty to various drug importantcharges in relation to the two-year smuggling ring.<�Jason Foster, 25, a lance bombadier in the same regiment, from Balmoral Road,Wigan, Greater Manchester, was cleared yesterday by the jury of smugglingcharges.<�During the three-week trial the court heard that couriers travelled to thContinent using a number of different types of transport .<�They would then return with ecstasy, heroin and cocaine, bound for the nortwest of England.<�The conspiracy ended in June 1997 when a Honda Civic car was stopped ansearched by Customs officers at Coquelles, at the French entrance to the ChannelTunnel.<�Inside the vehicle, driven by serving soldier Peter Jackson, accompanied bpassenger Billy Gee Stott, a gunner with the regiment, were drugs with a streetvalue of more than #1 million.<�As part of the same plot, narcotics worth in excess of #1 million werrecovered by Customs officers in the back of two taxi cabs in Liverpool.<�In total, more than #2.5 million worth of drugs were seized during the probecodenamed Operation Cruiser, but it is believed that the total imported by thegang could have been worth more than #12 million.<�mf 15:36 17/07/98: Page 3 (HHH) COURTS Soldiers

Mills will be sentenced next week, along with Jones, 31, of Pader Close,Hazelrigg, Newcastle upon Tyne, and Bull, 29, of Inskip, Skelmersdale, westLancashire.<

Measuring Text Re-use June 2000

Page 30 of 43

�Also appearing for sentence at Liverpool Crown Court will be the six men whpleaded guilty to their part in the smuggling plot.<�They are Peter Jackson, 29, of Linden Park, Burnage, Manchester; PaulBromiley, 30, of Station Road, Bamber Bridge, Preston; Billy Gee Stott, ofAbermarle Barracks _ all serving soldiers with the regiment.<

They will be joined by Paul Wright, 29, of Blaydon Close, Netherton,Merseyside, a former gunner; Peter O'Toole, 26, of Lowell Road, Walton,Liverpool, and Darren Williams, 27, of Highfield Road, Ellesmere Port, who alladmitted importing drugs.<�Following the hearing, a spokesman for Her Majesty's Customs and Excise whworked with the Army to uncover the plot, said: ``There is no doubt that thisjoint operation has smashed a complete team that was responsible for theimportation and onward distribution of high priority hard drugs that weredestined for the north west.''<

end

16:09 17/07/98: Page 2 (HHH) COURTS Soldiers

(corrected repetition _ correcting spelling of Bombardier)<�Yesterday, the same jury unanimously convicted Bombardier Kevin Jones anex-Gunner James Bull for being part of the same drugs smuggling plot.<�Six other men, four of them either serving or ex-members of the 39th RegimenRoyal Artillery, along with Mills, Bull and Jones, based at Abermarle, nearNewcastle upon Tyne, have already pleaded guilty to various drug importantcharges in relation to the two-year smuggling ring.<�Jason Foster, 25, a lance bombardier in the same regiment, from Balmoral Road,Wigan, Greater Manchester, was cleared yesterday by the jury of smugglingcharges.<�During the three-week trial the court heard that couriers travelled to thContinent using a number of different types of transport .<�They would then return with ecstasy, heroin and cocaine, bound for the nortwest of England.<�The conspiracy ended in June 1997 when a Honda Civic car was stopped ansearched by Customs officers at Coquelles, at the French entrance to the ChannelTunnel.<�Inside the vehicle, driven by serving soldier Peter Jackson, accompanied bpassenger Billy Gee Stott, a gunner with the regiment, were drugs with a streetvalue of more than #1 million.<�As part of the same plot, narcotics worth in excess of #1 million werrecovered by Customs officers in the back of two taxi cabs in Liverpool.<�In total, more than #2.5 million worth of drugs were seized during the probecodenamed Operation Cruiser, but it is believed that the total imported by thegang could have been worth more than #12 million.<

mf

16:29 17/07/98: Page 1 (HHH) COURTS Soldiers Background

DRUG SMUGGLING SOLDIERS WHO ABUSED PRIVILEGED STATUS<�By Maria Breslin, PA News<�A multi-million pound drug smuggling ring smashed after an 18-month undercoveoperation had its roots in one of Britain's most distinguished regiments.<�Under the motto Whither Right And Glory Lead, men from the 39 Regiment of thRoyal Artillery fought for their country in major conflicts across the globe.<�But while off-duty serving and former soldiers were abusing their privilegestatus to ship casements of cocaine, heroin, ecstasy and amphetamines fromAmsterdam to Britain.<�Their actions said David Turner QC, prosecuting, had forever changed thperception of police and customs officials waging war against the importation ofillegal drugs.<�``Because of their status and the frequency of the traffic, Army officers an

Measuring Text Re-use June 2000

Page 31 of 43

soldiers occupy a fairly privileged position as travellers. Until this casethose who man the borders at customs were entitled to feel that the least likelydrug smugglers were those who had sworn an oath to protect the realm.<�``This case changed that perception,'' he adde .<�The Army's Special Investigation Branch of the Royal Military Police attacheofficers to specialist undercover Customs and Excise Teams in order to net thegang and a surveillance team was provided to cover the suspects while they wereon duty.<�``The thought of letting members of my Regiment continue for some time witthese activities in order to gather compelling evidence was against my naturalmilitary instincts,'' said Lt Col Nick Clissitt, Commanding Officer of 39Regiment Royal Artillery at the time of the investigation.<�``However it did reinforce my determination that they should be brought tbook and face the full weight of the law. I am gratified that this has happened.Also I am very proud of the remainder of the Regiment who are decent and honestsoldiers who did their bit to root out these criminals,'' he added.<�The conspiracy was cracked in June last year when serving soldiers PeteJackson and Billy Gee Stott were stopped in Coquelles by customs carryingheroin, ecstasy and amphetamines with a street value of more than #1 million.<�The total value of drugs brought in by the Cross-Channel couriers could havbeen as much as #12 million.<�Their drop-off destination in Liverpool suggests they were destined for townand cities in the north west of England.<�The swoop and subsequent arrest of Jackson and Stott, both Gunners in the 39Regiment who pleaded guilty to their part in the drug smuggling racket, was theresult of an on-going operation.<�It was the culmination of an investigation centred on Jackson, fellow RoyaArtillery soldier Paul Bromiley and Peter O'Toole during which narcotics worthmore than #1 million were recovered from two taxi cabs in Liverpool.<�``As the investigation proceeded suspicions grew that serving soldiers in thaRegiment were being used to bring drugs into this country,'' said David TurnerQC.<�``Of course there is no special dispensation for members of the armed forcein passing customs control. However, because soldiers frequently are travellingin and out of the country their movements do not attract attention.<�``Showing a British Army identity card at Customs might be expected treassure Customs officers that its bearer would not be a drugs smuggler,'' headded.<�Lieutenant Colonel John Lanham, head of the Royal Military Police SpeciaInvestigation Branch, said the Army goes to great lengths to root out the drugsmugglers in its midst.<�``The measures in place resulted in these criminals appearing in LiverpooCrown Court. Whilst, for obvious reasons, I do not wish to go into detail, wehave a number of Drug Investigation Teams working at home and abroad alongsidecivilian agencies.<�``They are dedicated to finding, gathering evidence and prosecuting anmilitary personnel tempted by this kind of activity,'' he added.<�An army spokesman said the three-week trial at Liverpool Crown Courreinforced its policy of zero tolerance and paid testament to the determinationof both military and civilian organisations to stamp out drugs.<

end<

Independent’s version – Jonathan Foster

1 18 Jul 1998 Soldiers ran pounds 12m drugs import ring: The Independent (Q1:100) By JONATHAN FOSTER

Measuring Text Re-use June 2000

Page 32 of 43

A DRUGS TRIAL which has badly damaged the reputation of one of Britain'smost famous regiments ended yesterday.It followed an 18-month investigation by Customs officials who believe theracket they uncovered involved the smuggling into Britain of up to pounds12m of heroin, ecstasy, amphetamines and cocaine by soldiers and formerservicemen with the 39th Regiment Royal Artillery.During the trial, it emerged that more than pounds 1m of drugs had beenfound in two taxis in Liverpool. In all, pounds 2.5m of drugs were seized.One of the accused, Dale Mills, 26, was found guilty of importing narcoticsat Liverpool Crown Court yesterday. On Thursday two others - Bombadier KevinJones, 31, and former gunner James Bull, 29 - were convicted of taking partin the same two-year plot.Six other men, four of them either serving or former members of theregiment, based at Abermarle, near Newcastle upon Tyne, have already pleadedguilty to various drug charges. They are serving soldiers Peter Jackson, 29,Paul Bromiley, 30, and Billy Gee Stott; Paul Wright, 29, a former gunner;and Peter O'Toole, 26, and 27-year-old Darren Williams. All nine will besentenced next week. A 10th man, Jason Foster, 25, a lance bombardier, wascleared by the court.The trial brings to an end one of the most extraordinary and embarrassingcases ever to involve the military in Britain. Customs officers hope it willalso cut off one of the major drugs supply lines to the North-west.The ring was exposed two years ago after Customs officers grew suspicious ofa foot passenger who arrived at Dover in Kent on a ferry from Calais. Theyfound that the man had receipts for pounds 4,500 cash deposited during theprevious month. He claimed he was 'buying property in Dusseldorf'.Officers were further alerted when he walked over to a red Nissan waiting toleave the docks. He climbed in, the car was pulled over and, on a cloudynight in January 1996, the British Army fell under suspicion of drugrunning.In the Nissan were two off-duty gunners from 39th Regiment Royal Artilleryback from Calais on the same sailing as their passenger. They carriedpassports and authentic military identification.There was no contrabrand in the car, and Customs officers were used tosoldiers travelling frequently to and from continental postings. But why hadPaul Bromiley and Peter Jackson picked-up Peter O'Toole in the car park? Whywould soldiers from barracks in the North-east travel out from Hull andreturn two days later through Calais? And why had two storage spaces beencreated in the car, concealed behind the rear seat?Customs let the three men go, but an investigation was launched whichrevealed that the soldiers spent time off-duty using their private cars andArmy identification to run a 13-trip drugs 'caravan' from the Netherlands toLiverpool.'The Army were shaken,' a Customs investigator said. 'It was the first timemilitary personnel had been involved at this level. There have been casesdiscovered of small quantities of drugs for personal use by soldiers, butnothing on this scale.'The gang's trial heard that soldiers enjoyed a 'privileged position astravellers'. But any privilege has now ended, according to Customs and theMinistry of Defence (MoD).'Customs officers haven't known since the abolition of British forces numberplates if cars entering the country belonged to squaddies,' the Customsinvestigator said. 'But an officer may still have been swayed - he gives acar a pull, the driver shows his passport and then flashes a warrant card.The officer doesn't associate a soldier with drugs-smuggling.'Military Police seconded an investigator to work with the 20-member Customsteam, and co-operation has subsequently become routine, including regularsharing of intelligence.'We didn't think smuggling by soldiers happened before,' a MoD spokesmansaid. 'Other men in 39th Regiment had noticed that something wasn't quite

Measuring Text Re-use June 2000

Page 33 of 43

right with these men - extra money in their pockets, car loans being paidoff, that sort of thing.'The soldiers were being paid between pounds 2,000-pounds 5,000 a trip, acheap rate for loading a hatchback with a typical payload of eight kilogramsof drugs plus 48,000 tablets. But it was good money for men such as Bromileyand Jones, gunners in their thirties taking home about pounds 550 a month.Bromiley paid pounds 27,525 into his TSB accounts during the 18 months.When 39th Regiment took its multiple rocket launchers off on a tour of dutyin Cyprus in June 1996, regular runners were decommissioned. But Jonesremained in Britain and readily assumed the drug courier duties. He bought aHonda Civic, made three runs to the continent, and banked pounds 22,800.Suspicions at barracks of new-found wealth identified many of the soldiersto the investigation team. But command was probably vested in O'Toole, thefoot passenger who first aroused suspicion. He is a 26-year-old Liverpudlianwho variously described himself as a Merchant Navy cook or a painter anddecorator, and whose mobile phone and pager were busy. He also handleddistribution of the drugs in Liverpool and banked pounds 81,000 during the18 months.The tale has severely damaged the reputation of the regiment. Its motto -'Whither Right and Glory Lead' - has been left tainted.INDNews 4 The IndependentCopyright (C) Newspaper Publishing Plc, 1988-1997

Measuring Text Re-use June 2000

Page 34 of 43

Appendix C – Results for G2 between PA and newspapers for courtreporting domainGiven are the top 30 bi-grams sorted on Log-Likelihood.

PA specific bi-grams Newspaper-specific bi-grams

Bi-gram Frequency Log-likelihood

Bi-gram Frequency Log-likelihood

page 1 174 421.997 courtyesterday*

46 61.914

pa news 159 385.595 yesterdayfor*

33 53.819

page 2 65 157.575 heardyesterday*

32 52.188

thomas 's 36 87.262 ’ he 29 47.295

was today* 35 84.838 casecontinues*

26 42.402

page 3 35 84.838 last night 38 40.95

it 's 23 55.748 ’ mr 25 40.771

's death* 27 52.31 ’ the 24 39.14

heard today* 21 50.9 toldyesterday*

22 35.878

today 's* 20 48.476 wpchendricks

21 34.247

page 4 19 46.052 wasyesterday*

21 34.247

said he 189 45.497 lifeyesterday*

20 32.616

roberts 's 18 43.628 yesterdayafter*

19 30.985

courts care 18 43.628 was jailed* 102 27.487

's murder* 18 43.628 saidyesterday*

16 26.092

court today* 23 43.086 set up 22 24.933

mr henriques 17 41.204 mr fayed 22 24.933

adjourneduntil*

26 41.088 trialcontinues*

15 24.462

he said 369 40.516 ’ she 14 22.831

mrgrobbelaar

74 38.909 yearsyesterday*

14 22.831

mr boyd 21 38.506 hand clothes 13 21.2

mrs gee 15 36.356 her car 24 20.861

's family* 15 36.356 the boy’s 12 19.569

walker 's 14 33.932 mr hagland’s 12 19.569

untiltomorrow*

14 33.932 I didn’t 12 19.569

courtsgrobbelaar

14 33.932 yesterdaythat*

18 19.003

's home* 14 33.932 of cruelty 18 19.003

's body* 14 33.932 yesterday of* 11 17.938

that the 403 31.923 yesterdayaccused*

11 17.938

mrs hillier 18 31.691 the family’s 11 17.938

* - bi-grams I think are representative of the domain

Measuring Text Re-use June 2000

Page 35 of 43

Appendix D – Results for G2 between court reporting and showbizdomains in the PAGiven are the top 30 bi-grams sorted on Log-Likelihood.

Showbiz specific bi-grams Law and courts specific bi-grams

Bi-gram Frequency Log-likelihood

Bi-gram Frequency Log-likelihood

the film* 34 201.8 the court* 183 60.006

the band* 31 183.985 had been 209 42.225

the bbc* 32 157.249 it 's 42 38.369

will be 73 131.099 told the 144 37.6

showbusinesscorrespondent*

21 124.614 said that 119 35.923

by anthony 13 77.132 that the 229 34.741

anthony barnes 13 77.132 the jury* 85 33.603

who plays* 12 71.197 a new 35 32.807

sir elton* 12 71.197 the judge* 80 31.626

the show* 11 65.263 crown court* 77 30.439

played by* 11 65.263 court heard* 73 28.857

star wars* 10 59.329 the first 61 27.08

band 's* 10 59.329 he had 228 27.079

coronation street* 9 53.395 I think 26 24.947

the special 11 50.055 nursing care 61 24.112

the prince* 8 47.462 he has 27 23.602

the premiere* 8 47.462 mr hamilton 54 21.344

jackie burdon 8 47.462 mrgrobbelaar

54 21.344

by jackie 8 47.462 she was 113 21.328

be screened* 8 47.462 was told 53 20.949

spokeswomansaid*

14 45.459 was found* 53 20.949

top gear* 7 41.528 the nhs 53 20.949

this autumn* 7 41.528 that he 167 20.937

the series* 7 41.528 for the 168 19.698

showbiz starwars* 7 41.528 mr carman 49 19.368

salmon said 7 41.528 which is 23 19.029

be shown* 7 41.528 the police* 47 18.577

announced today* 7 41.528 mr justice* 47 18.577

a bbc* 7 41.528 part of 36 18.273

the movie* 6 35.595 jailed for* 46 18.182

* - bi-grams I think are representative of the domain

Measuring Text Re-use June 2000

Page 36 of 43

Appendix E – Dotplot results

Figure 1 – Daily Star vs PA Figure 2 – Mirror vs PA

Figure 3 – The Sun vs PA Figure 4 – The Times vsPA

Measuring Text Re-use June 2000

Page 37 of 43

Appendix F – Further Dotplot results

Related subjects and derived

AP = American PressPA = Press Association

Dotplot of PA bombings vs Independent version

PA vs PA

IndependentvsIndependent

Independentvs PA

PA vsIndependent

Measuring Text Re-use June 2000

Page 38 of 43

Related subjects and not derived

AP = American PressPA = Press Association

Dotplot of AP bombings vs PA version

AP vs AP

PA vs PA

AP vs PA

PA vs AP

Measuring Text Re-use June 2000

Page 39 of 43

Un-related subjects

AP = American PressPA = Press Association

Dotplot of AP bombings vs PA version

PAbombings

vsPA

bombings

PAGrobelaar vsPA

PAGrobelaar vsPA

PAGrobelaarvs

Measuring Text Re-use June 2000

Page 40 of 43

Appendix G – Result of finding the longest common verbatim stringsbetween two independent texts

Measuring Text Re-use June 2000

Page 41 of 43

Measuring Text Re-use June 2000

Page 42 of 43

Measuring Text Re-use June 2000

Page 43 of 43

Appendix H – Dotplot of previous independent texts