44
Corpus analysis (1) Corpus Linguistics Richard Xiao [email protected]

Corpus analysis (1) Corpus Linguistics Richard Xiao [email protected]

Embed Size (px)

Citation preview

Page 1: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Corpus analysis (1)

Corpus Linguistics

Richard Xiao

[email protected]

Page 2: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Outline of the session• Lecture

– Concordance– Patterning– Semantic prosody– Wordlist– Cluster (lexical bundle, MWU, n-gram)

• Lab– WST Concord and Wordlist– AntConc– Online concordancers

Page 3: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Who reads a corpus?• A corpus is usually too large for anyone to read, e.g. the

BNC is very, very large…– It took 4 years to build– It contains over 100 million (100,106,008) words of modern

English– It comprises 4,124 texts– There are six and a quarter million sentence units in the whole

corpus – Each word is automatically assigned a part of speech code -

there are 65 parts of speech identified– It occupies 1.5 gigabytes of disk space - the equivalent of more

than 1,000 high capacity floppy disks– The whole corpus printed in small type on thin paper would take

up 10 metres of shelf space – Reading the whole corpus aloud at a rate of 150 words a minute,

eight hours a day, 365 days a year, would take nearly 4 years• A computer can scan in a few seconds more text than

you can read in your whole life…

Page 4: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Concordance• A comprehensive index of the words used in a

text or a corpus• A set of concordance lines• The most common concordance format is the

KWIC concordance - Key Word in Context– In a KWIC concordance of your search word, i.e. the

node word, is in a central position with all lines vertically aligned around the node

• Can be sorted to reveal patterns of usage

Page 5: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Concordancer• A concordancer is the software that displays

concordances (Unicode compliant)– Concord WordSmith Tools (GBP50)

• www.lexically.net/wordsmith/ – MonoConc (USD85)

• www.athel.com/mono.html– AntConc (free)

• www.antlab.sci.waseda.ac.jp/software/antconc3.2.4w.exe – Xaira (free)

• www.oucs.ox.ac.uk/rts/xaira/

– Multilingual Corpus Tool (MLCT) - free• www.lancs.ac.uk/fass/projects/corpus/cbls/resources.asp

Page 6: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

KWIC concordance (WST)

Page 7: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

KWIC concordance (MonoConc)

Page 8: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

KWIC concordance (AntConc)

Page 9: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

KWIC concordance (Xaira)

Page 10: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Online concordancers• English (free)

– http://corpus.byu.edu/bnc/– http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php– http://www.americancorpus.org/ (COCA)

• Chinese (free)– www.lancs.ac.uk/fass/projects/corpus/LCMC/– www.lancs.ac.uk/fass/projects/corpus/UCLA/ – www.lancs.ac.uk/fass/projects/corpus/babel/babel.htm

• Sketch Engine: Corpus query system of multilingual data, incorporating word sketches, grammatical relations, and a distributional thesaurus (30 days free trial)– http://www.sketchengine.co.uk/

Page 11: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Syntagmatic vs. paradigmatic

Page 12: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Collocation is syntagmatic

famous boots. On the stroke of full time the Stoke the lead on the stroke of half-time with a goal Smith sin-binned on the stroke of half-time, added a clinched their win on the stroke of lunch after resuming chase by declaring on the stroke of lunch. <p> With a lead expectant crowd, on the stroke of midday. The bird hour began not upon the stroke of midnight but upon the of midnight but upon the stroke of noon. There was, booked in advance. On the stroke of seven, a gong summons Promptly on the stroke of six 'clock, the chooks from Edinburgh on the stroke of the Millennium.

Parole (Utterance)

syntagmatic

Langue (Language system)paradigmatic

Page 13: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Example of pattern meaning

• “on the stroke of X”– X = a temporal point

• “It is/was adj. that…” (construction grammar?)– certain, likely, possible, probable, etc.– apparent, clear, evident, obvious, plain, etc.– fantastic, marvellous, appropriate, logical,

encouraging, exciting, reassuring, etc.– appalling, unjust, annoying, etc.– critical, important, necessary, vital, etc.– amazing, funny, interesting, intriguing, etc.

Page 14: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Pattern meaning• A large number of different adjectives occur in the

pattern between is/was and that– Probability

• “It was important to establish this because it was possible that strontium and calcium in fossils might have reacted chemically with the rock in which the fossils were buried.” (New Scientist)

– Evaluation - used to evaluate propositions (statements) rather than things or people

• “But a lot of health authorities say they will not allow these drugs on NHS prescription as they cannot afford them at around £90 a month. It is scandalous that the rich can buy the drugs privately, but tough luck if you are poor.” (The Sun)

Page 15: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Meaning arising from collocation• “There are always semantic relations

between node and collocates, and among the collocates themselves.” (Stubbs 2002: 225)– Collocational meaning arising from the

semantic relations between node and collocates: semantic prosody (also called “discourse prosody”)

– Collocational meaning arising from the semantic relations among collocates of a node: semantic preference

Page 16: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

What is semantic prosody?

• “consistent aura of meaning with which a form is imbued by its collocates” (Louw 1993: 157)

• “a form of meaning which is established through the proximity of a consistent series of collocates.” (Louw 2000: 57)

• “the spreading of connotational colouring beyond single word boundaries” (Partington 1998: 68)

• “When the usage of a word gives an impression of an attitudinal or pragmatic meaning, this is called a semantic prosody” (Sinclair 1999)

• This kind of meaning is “prosody” in the sense that it stretches over more than one unit (word)

Page 17: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Semantic prosody• The primary function of SP is to express

speaker/writer attitude or evaluation (Louw 2000: 58)– Attitudinal, affective, evaluative and pragmatic

meaning• Typically negative, with relatively few of them

bearing an affectively positive meaning– Unsurprising: contented human beings utter much

less than discontented ones– It is unrequited love, not requited love, that forms

most of the subject matter for the greatest love poetry in English!

Page 18: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Semantic prosody

• SET IN: occurs primarily with subjects which refer to unpleasant states of affairs– …before bad weather sets in…– …the fact that misery can set in…– …desperation can set in…– …stagnation seemed to have set in…– …before rigor mortis sets in…

• BREAK OUT: it is bad things that break out– …violence broke out…– …riots broke out…– …war broke out…– …real disagreements have broken out…– …a storm of protest broke out…

Page 19: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Semantic prosody• Collocates of CAUSE

– damage, problems, pain, disease, distress, trouble, concern, degradation, harm, pollution, suffering, anxiety, death, fear, stress, symptoms

– These examples of ‘bad company’ collocate with cause so frequently that the central and typical use of cause shows a negative affective meaning (近墨者黑? )

• Collocates of consequences– In the sense of result

• serious, disastrous, adverse, dire, damaging, negative, unintended, unfortunate, tragic, fatal, severe

– In the sense of importance• important, significant, far-reaching, profound

Page 20: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Semantic prosody

• PROVIDE: a positive semantic prosody– facilities, information, services; aid, assistance, help,

support; care, food, money, nourishment, protection, security

• CREATE: “prosodically mixed or incomplete”– [Negative] illusion, problems– [Neutral] atmosphere, conditions, environment,

image, impression, situation, space– [Positive] jobs, opportunities, order, wealth

Page 21: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Semantic prosody• The negative (or less frequently positive)

prosody that belongs to an lexical item is the result of the interplay between the item and its typical collocates– The item does not appear to have an affective meaning until it

appears in the context of its typical collocates– If a word has typical collocates with an affective meaning, it may

take on that affective meaning even when it is used with other atypical collocates

• The consequence of a word frequently keeping ‘bad company’ is that the use of the word alone may become enough to indicate something unfavourable (cf. Partington 1998: 67)

Page 22: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Semantic prosody

• Is semantic prosody a type of connotative meaning?

• “Semantic prosodies are not merely connotational” as the force behind semantic prosodies is “more strongly collocational than the schematic aspects of connotation.” (Louw 2000: 49-50)

• In my view, connotation can be collocational or non-collocational; semantic prosody can only be collocational

Page 23: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Semantic prosody

• Semantic prosody is strongly collocational in that it operates beyond the meanings of individual words

• Both personal and price are quite neutral, but when they co-occur, a negative prosody may result: personal price most frequently refers to something undesirable– In the BoE with over 550 million words of

written and spoken texts, 20 instances of “personal price” are all evaluatively negative

Page 24: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

“Personal price”

Barclays’ slogan to promote their personal financial services in 2003“The personal loan with the personal price”

typically negative and highsomething undesirable

Page 25: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Semantic preference

• ‘a lexical set of frequently occurring collocates [sharing] some semantic feature’ (Stubbs 2002: 449)– large typically collocates with items from the

same semantic set indicating ‘quantities and sizes’

• number(s), scale, part, quantities, amount(s)

– ‘absence/change of state’ is a common feature of the collocates of maximizers such as utterly, totally, completely and entirely

Page 26: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Semantic preference

• Semantic preference and semantic prosody are two distinct yet interdependent collocational meanings– Semantic prosody is a further level of abstraction of

the relationship between lexical units (Sinclair 1996, 1998; Stubbs 2001)

• Collocation (the relationship between a node and individual words)

• Colligation (the relationship between a node and grammatical categories, e.g. “very” tends to collocate with adjectives and adverbs)

• Semantic preference (semantic sets/fields of collocates) • Semantic prosody (affective meanings of a given node with

its typical collocates)

Page 27: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Semantic preference

• Semantic preference and semantic prosody have different operating scopes (Partington 2004:151)– Semantic preference can be viewed as a feature of

the collocates while semantic prosody is a feature of the node word

• The two also interact (Partington 2004: 151)– Semantic prosody ‘dictates the general environment

which constrains the preferential choices of the node item’

– Semantic preference ‘contributes powerfully’ to building semantic prosody

End of concordance versus patterning, collocation and colloational meaning

Page 28: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Wordlist• A list of words in a corpus and their frequency

– Can become very meaningful when compared with other lists: “keyword analysis”

• “A type is not a token.”– Token: an occurrence of any given word form (6 tokens)– Type: a (unique) word form (5 types - “a” is repeated)

• Type-token ratio (TTR): the number of types divided by the number of tokens multiplies 100– lexical density: a low TTR indicates a text is not very lexically rich– useful when comparing samples of roughly equal length

• Standardized type-token ratio (STTR)– It is difficult to compare the TTR of a smaller corpus against a larger one

• As a corpus gets bigger, the number of new word types being counted declines

– In order to remedy the issue of comparing TTRs of corpora of different sizes, WordSmith can calculate TTR based on every 1,000 words (the default setting can be adjusted) and produce an average TTR

Page 29: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Wordlist

Page 30: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

AntConc wordlist

Page 31: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Practice• Make a wordlist of the following text using

wordlist function in WST or AntConc– The Stephen text (local copy available)

• http://www.cch.kcl.ac.uk/legacy/teaching/av1000/textanalysis/gaskin/stephen.txt

• A book written by the hippie guru Stephen Gaskell

• Browse through the frequency list. Can you see any pattern in the list?

Page 32: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Cluster• Also called lexical bundle, n-gram, multi-word unit

(MWU)• Groups of N words which appear in sequence in the text• Presented using frequency lists• Good way to identify recurrent/specific expressions for a

corpus• Tools

– WordSmith• Concord• Wordlist (Index)

– AntConc• N-gram

Page 33: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Cluster/lexical bundle/n-gram

Concord(3-gram)

Wordlist

Page 34: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Clusters in WordSmith• The Stephen text• Clusters with WST Concord

– The search term

• Clusters with WST Wordlist (Index)– The whole corpus

• Questions– What are the most frequent 3-word clusters with

“know” in the Stephen text?

–  What are the most frequent 3-word clusters in the whole text? Are they all “expected” phrases?

Page 35: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Clusters in WordSmith• Make adjustments here

Page 36: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Concord: “know”

Page 37: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

3-word clusters of “know”

recompute n-word clusters

Page 38: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Clusters in Wordlist (Index)

An error may occur if you specify a folder without having the writing permissions

Page 39: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Clusters in Wordlist (Index)

The index is created and saved in the specified file location

Warning: Your file location may be different!

Page 40: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Resulting index

Page 41: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Clusters in Wordlist (Index)

• OR: Wordlist – File – Open

Page 42: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Clusters in Wordlist (Index)

Page 43: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Clusters in Wordlist (Index)

Page 44: Corpus analysis (1) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

N-gram in AntConc

Difference from WST: Can a word contain the apostrophe?