Quantitative Individuated Corpus Linguistics

1. Quantitative Individuated Corpus Linguistics: A Speaker-Centric Approach to Variation Cornelius Puschmann Universitt Osnabrck 5 Juni 2007

any sentence is as good as any other sentence (the data is flat)

a corpus should be a well-balanced mix of different genres, modes and sources (representativeness)

textual and compositional coherence cannot be taken into account

contextual information (who said what, when, where, why, how and to whom) is largely unavailable

copora largely consist of well-established genres

the material they contain is produced by language professionals (journalists, writers, politicians)

texts are long and stylistically distant from everyday communication in their level of formality, complexity and elaborateness

compositional integrity (text structure) is very important but largely ignored

the text (=collection of words) takes precedent over the speaker

language data and sources of

variation...

... vs. speakers and their natural attributes

estimated 100 million active bloggers in 2007

split evenly among genders

all age groups are represented

many bloggers provide personal information (age, gender, location)

use web feeds (Atom and RSS formats) to syndicate blog entries in XML (ideal for building modern corpora)

clean data with minimal interference

very large bodies of data can be automatically assembled

data is naturally segmented by

speaker (+gender, +age, +location, ...)

length and time of writing

often include additional meta-data

produced by a large and growing variety of individuals using it for a wide spectrum of purposes

only one genre (?)

CMC as a singular mode (?)

sampling of speakers not representative (?)

self-built corpus for my research project on corporate blogging

web feeds (RSS and Atom protocols) used to retrieve, store and analyze language data

implemented TreeTagger for automated part-of-speech tagging

156 sources

25,769 posts

6.6 million words

Heather Hamilton (Microsoft)

Irving Wladawsky-Berger (IBM)

Heather Hamilton (Microsoft)

1theDT2787

2IPP2723

3toTO2088

4aDT1440

5ofIN1324

6and CC1254

7ItPP1097

8youPP854

9inIN818

10thatIN776

11myPP$757

12isVBZ739

13ForIN580

14n'tRB540

15'sVBZ530

16onIN498

17areVBP475

18mePP450

19with IN431

20thisDT424

Irving Wladawsky-Berger (IBM)

1theDT2788

2andCC1931

3ofIN1571

4toTO1562

5inIN1291

6aDT1047

7isVBZ695

8IPP560

9thatIN439

10ForIN434

11ItPP417

12with IN401

13asIN390

14areVBP380

15wePP359

16onIN331

17ourPP$259

18haveVHP253

19thatWDT248

articleEffects of Age and Gender on Blogging (AAAI 2006 )

all blogs accessible from blogger.com one day in August 2004

downloaded each blog that included author-provided indication of gender and at least 200 appearances of common English words

the full corpus thus obtained included over 71,000 blogs and over 300 million tokens

used to predict age and gender of bloggers

token male female

linux0.530.04 0.030.01

microsoft0.630.05 0.080.01

gaming0.250.020.040.00

server 0.760.05 0.130.01

software 0.990.050.170.02

gb0.270.02 0.050.01

programming 0.360.02 0.080.01

google 0.900.04 0.190.02

data0.620.03 0.140.01

graphics0.270.02 0.060.01

india0.620.04 0.150.01

nations0.250.01 0.060.01

democracy0.230.01 0.060.01

users0.450.02 0.110.01

economic0.260.01 0.070.01

token male female

shopping 0.660.02 1.480.03

mom 2.070.05 4.690.08

cried 0.310.01 0.720.02

freaked 0.080.01 0.210.01

pink 0.330.02 0.850.03

cute 0.830.03 2.320.04

gosh 0.170.01 0.470.02

kisses 0.080.01 0.280.01

yummy 0.100.01 0.360.01

mommy 0.080.01 0.310.02

boyfriend 0.410.02 1.730.04

skirt 0.060.01 0.260.01

adorable 0.050.00 0.230.01

husband 0.280.01 1.380.04

hubby 0.010.00 0.300.02

token teens twens thirties

maths 1.050.06 0.030.00 0.020.01

homework 1.370.06 0.180.01 0.150.02

bored 3.840.27 1.110.14 0.470.04

sis 0.740.04 0.260.03 0.100.02

boring 3.690.10 1.020.04 0.630.05

awesome 2.920.08 1.280.04 0.570.04

mum 1.250.06 0.410.04 0.230.04

mad 2.160.07 0.800.03 0.530.04

dumb 0.890.04 0.450.03 0.220.03

semester 0.220.02 0.440.03 0.180.04

apartment 0.180.021.230.05 0.550.05

drunk 0.770.04 0.880.03 0.410.05

beer 0.320.02 1.150.05 0.700.05

student 0.650.04 0.980.05 0.610.06

album 0.640.05 0.840.06 0.560.08

college 1.510.07 1.920.07 1.310.09

someday 0.350.02 0.400.02 0.280.03

dating 0.310.02 0.520.03 0.370.04

token teens twens thirties

marriage 0.270.03 0.830.05 1.410.13

development 0.160.02 0.500.03 0.820.10

campaign 0.140.02 0.380.03 0.700.07

tax 0.140.02 0.380.03 0.720.11

local 0.380.02 1.180.04 1.850.10

democratic 0.130.02 0.290.02 0.590.05

son 0.510.03 0.920.05 2.370.16

systems 0.120.01 0.360.03 0.550.06

provide 0.150.01 0.540.03 0.690.05

workers 0.100.01 0.350.02 0.460.04

allow us to take into account individual stylistic preference as a source of variation when making generalizations (syntax, semantics, ...)

allow us to observe specificities of individual production before making blanket label statements about groups (based on gender, social standing etc)

inverts the idea of system and variation (how much overlap is there in language use? vs. how much variation can our theories account for?)

personal grammar?, personal semantics?

Construction Grammar (to what degree are constructions individual?)

variation over the lifetime

weighing genre, mode and individual variation

practical applications for forensic linguistics / language profiling

Education

Quantitative Individuated Corpus Linguistics