Quantitative Individuated Corpus Linguistics

Embed Size (px)

DESCRIPTION

Held at the Linguistic Colloquim, University of Osnabrueck, June 5, 2007.

Citation preview

  • 1. Quantitative Individuated Corpus Linguistics: A Speaker-Centric Approach to Variation Cornelius Puschmann Universitt Osnabrck 5 Juni 2007

2. Preliminaries 3. A totalizing view of language competence performance production investigation but... whosecompetence? whoseperformance? social function learning cognitive basis cultural transmission 4. Variation or overlap? competence performance Speaker A competence performance Speaker B competence performance Speaker C Observations: 1. A contrastive comparison of performance should give us some insight into shared competence. 2. Speaker-level granularity is preferable to higher levels of segmentation (by gender, social class etc). 3. Instead of generalizing from the outset, we can reach general conclusions after observing the degree of variation or overlap in language production. So how do we do this? 5. How corpora treat language data

  • any sentence is as good as any other sentence (the data is flat)
  • a corpus should be a well-balanced mix of different genres, modes and sources (representativeness)
  • textual and compositional coherence cannot be taken into account
  • contextual information (who said what, when, where, why, how and to whom) is largely unavailable

6. Corpora and traditions of text production

  • copora largely consist of well-established genres
  • the material they contain is produced by language professionals (journalists, writers, politicians)
  • texts are long and stylistically distant from everyday communication in their level of formality, complexity and elaborateness
  • compositional integrity (text structure) is very important but largely ignored
  • the text (=collection of words) takes precedent over the speaker

7. A different view of language data

  • language data and sources of
  • variation...
  • ... vs. speakers and their natural attributes

8. Blogs as data sources 9. A new kind of resource

  • estimated 100 million active bloggers in 2007
  • split evenly among genders
  • all age groups are represented
  • many bloggers provide personal information (age, gender, location)
  • use web feeds (Atom and RSS formats) to syndicate blog entries in XML (ideal for building modern corpora)
  • clean data with minimal interference

10. Blogs as corpus data: Pros

  • very large bodies of data can be automatically assembled
  • data is naturally segmented by
    • speaker (+gender, +age, +location, ...)
    • length and time of writing
  • often include additional meta-data
  • produced by a large and growing variety of individuals using it for a wide spectrum of purposes

11. Blogs as corpus data: Cons

  • only one genre (?)
  • CMC as a singular mode (?)
  • sampling of speakers not representative (?)

12. Granularity and natural segmentation of data in a blog-based corpus Modes of investigation: 1. Degree of internal variation among all posts by the same blogger 2. Variation between bloggers 3. Variation between groups (gender, age etc) What I had for breakfast this morning xxx xxx xx xxxx, xxx xxx xx xxxxxxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx posted 01/01/2007 by Jane Smith post 1 post 2 post 3 post 4 ... 13. An example for a blog-based corpus

  • self-built corpus for my research project on corporate blogging
  • web feeds (RSS and Atom protocols) used to retrieve, store and analyze language data
  • implemented TreeTagger for automated part-of-speech tagging
  • 156 sources
  • 25,769 posts
  • 6.6 million words

14. Application 15. Individual variation: word class distribution

    • Heather Hamilton (Microsoft)

16. Individual variation: word class distribution

    • Irving Wladawsky-Berger (IBM)

17. Individual variation: pronoun use

  • Heather Hamilton (Microsoft)
  • 1theDT2787
  • 2IPP2723
  • 3toTO2088
  • 4aDT1440
  • 5ofIN1324
  • 6and CC1254
  • 7ItPP1097
  • 8youPP854
  • 9inIN818
  • 10thatIN776
  • 11myPP$757
  • 12isVBZ739
  • 13ForIN580
  • 14n'tRB540
  • 15'sVBZ530
  • 16onIN498
  • 17areVBP475
  • 18mePP450
  • 19with IN431
  • 20thisDT424
  • Irving Wladawsky-Berger (IBM)
  • 1theDT2788
  • 2andCC1931
  • 3ofIN1571
  • 4toTO1562
  • 5inIN1291
  • 6aDT1047
  • 7isVBZ695
  • 8IPP560
  • 9thatIN439
  • 10ForIN434
  • 11ItPP417
  • 12with IN401
  • 13asIN390
  • 14areVBP380
  • 15wePP359
  • 16onIN331
  • 17ourPP$259
  • 18haveVHP253
  • 19thatWDT248

18. Individual variation: collocates preceding instances ofbelieve 19. Gender, age and variation: Schler et al

  • articleEffects of Age and Gender on Blogging (AAAI 2006 )
  • all blogs accessible from blogger.com one day in August 2004
  • downloaded each blog that included author-provided indication of gender and at least 200 appearances of common English words
  • the full corpus thus obtained included over 71,000 blogs and over 300 million tokens
  • used to predict age and gender of bloggers

20. Gender, age and variation: common words males

  • token male female
  • linux0.530.04 0.030.01
  • microsoft0.630.05 0.080.01
  • gaming0.250.020.040.00
  • server 0.760.05 0.130.01
  • software 0.990.050.170.02
  • gb0.270.02 0.050.01
  • programming 0.360.02 0.080.01
  • google 0.900.04 0.190.02
  • data0.620.03 0.140.01
  • graphics0.270.02 0.060.01
  • india0.620.04 0.150.01
  • nations0.250.01 0.060.01
  • democracy0.230.01 0.060.01
  • users0.450.02 0.110.01
  • economic0.260.01 0.070.01

21. Gender, age and variation: common words females

  • token male female
  • shopping 0.660.02 1.480.03
  • mom 2.070.05 4.690.08
  • cried 0.310.01 0.720.02
  • freaked 0.080.01 0.210.01
  • pink 0.330.02 0.850.03
  • cute 0.830.03 2.320.04
  • gosh 0.170.01 0.470.02
  • kisses 0.080.01 0.280.01
  • yummy 0.100.01 0.360.01
  • mommy 0.080.01 0.310.02
  • boyfriend 0.410.02 1.730.04
  • skirt 0.060.01 0.260.01
  • adorable 0.050.00 0.230.01
  • husband 0.280.01 1.380.04
  • hubby 0.010.00 0.300.02

22. Gender, age and variation: common words by age

  • token teens twens thirties
  • maths 1.050.06 0.030.00 0.020.01
  • homework 1.370.06 0.180.01 0.150.02
  • bored 3.840.27 1.110.14 0.470.04
  • sis 0.740.04 0.260.03 0.100.02
  • boring 3.690.10 1.020.04 0.630.05
  • awesome 2.920.08 1.280.04 0.570.04
  • mum 1.250.06 0.410.04 0.230.04
  • mad 2.160.07 0.800.03 0.530.04
  • dumb 0.890.04 0.450.03 0.220.03
  • semester 0.220.02 0.440.03 0.180.04
  • apartment 0.180.021.230.05 0.550.05
  • drunk 0.770.04 0.880.03 0.410.05
  • beer 0.320.02 1.150.05 0.700.05
  • student 0.650.04 0.980.05 0.610.06
  • album 0.640.05 0.840.06 0.560.08
  • college 1.510.07 1.920.07 1.310.09
  • someday 0.350.02 0.400.02 0.280.03
  • dating 0.310.02 0.520.03 0.370.04

23. Gender, age and variation: common words by age (ii)

  • token teens twens thirties
  • marriage 0.270.03 0.830.05 1.410.13
  • development 0.160.02 0.500.03 0.820.10
  • campaign 0.140.02 0.380.03 0.700.07
  • tax 0.140.02 0.380.03 0.720.11
  • local 0.380.02 1.180.04 1.850.10
  • democratic 0.130.02 0.290.02 0.590.05
  • son 0.510.03 0.920.05 2.370.16
  • systems 0.120.01 0.360.03 0.550.06
  • provide 0.150.01 0.540.03 0.690.05
  • workers 0.100.01 0.350.02 0.460.04

24. Observations 25. How an individuated approach to corpus linguistics can benefit the field

  • allow us to take into account individual stylistic preference as a source of variation when making generalizations (syntax, semantics, ...)
  • allow us to observe specificities of individual production before making blanket label statements about groups (based on gender, social standing etc)
  • inverts the idea of system and variation (how much overlap is there in language use? vs. how much variation can our theories account for?)

26. Research possibilities?

  • personal grammar?, personal semantics?
  • Construction Grammar (to what degree are constructions individual?)
  • variation over the lifetime
  • weighing genre, mode and individual variation
  • practical applications for forensic linguistics / language profiling

27. Thank you for listening! 28. Quantitative Individuated Corpus Linguistics: A Speaker-Centric Approach to Variation Cornelius Puschmann Universitt Osnabrck 5 Juni 2007