124
Tracking the Emergence of New Words across Time and Space Jack Grieve Aston University Research conducted with Diansheng Guo & Alice Kasakoff, University of South Carolina Andrea Nini, Aston University Funded as part of the Digging into Data Challenge

Tracking the Emergence of New Words across Time and Space

Embed Size (px)

Citation preview

Page 1: Tracking the Emergence of New Words across Time and Space

Tracking the Emergence of New Words across Time and Space

Jack GrieveAston University

Research conducted with Diansheng Guo & Alice Kasakoff, University of South CarolinaAndrea Nini, Aston University

Funded as part of the Digging into Data Challenge

Page 2: Tracking the Emergence of New Words across Time and Space

Approaches to Historical Linguistics

There are several different approaches to the analysis of

language change:

Reconstruction through comparison of known languages (comparative method)

Analysis of previous linguistic research (e.g. lexicographic research)

Analysis of historical texts (corpus-based)

Apparent time studies with interview data (sociolinguistics)

Computer simulations

Page 3: Tracking the Emergence of New Words across Time and Space

Lexical Change

Research in historical linguistics and etymology has

analysed how the usage of certain words have changed

over relatively long periods of time (primarily based on

historical corpora and lexicographic research), but overall

there are large gaps in our knowledge of lexical change,

including how newly emerging words enter a language

and spread across its speakers.

Page 4: Tracking the Emergence of New Words across Time and Space

Words are Rare Events

The main problem with studying lexical variation and

change is that most words are incredibly rare, thus

requiring incredibly large corpora of natural language.

This is why most research on lexical variation and

change has focused on relatively high frequency words,

primarily function words (e.g. pronouns, prepositions,

auxiliary verbs).

Page 5: Tracking the Emergence of New Words across Time and Space

Word Frequency Distribution (Zipf 1935, 1945)

Page 6: Tracking the Emergence of New Words across Time and Space

Word Frequency Distribution (Zipf 1935, 1945)

Page 7: Tracking the Emergence of New Words across Time and Space

The majority of the 67,000 most frequent words in our corpus occur less than once per 25 million words

Word Frequency Distribution (Zipf 1935, 1945)

Page 8: Tracking the Emergence of New Words across Time and Space

New Words are Incredibly Rare Events

The analysis of new words requires even more data,

because emerging words are by definition especially

rare.

In addition, to analyse the temporal and spatial spread

of new words, large corpora must be compiled for a

large number of points in times and locations.

Page 9: Tracking the Emergence of New Words across Time and Space

Big Data

Suitable data has recently become available with the

rise of the social media and smartphones, which

provide massive amounts of time-stamped and geo-

coded natural language data.

Page 10: Tracking the Emergence of New Words across Time and Space

Goals of Today’s Talk

Identify emerging words from 2014 based on a multi-

billion word corpus of American tweets.

Chart their usage over time and identify common

temporal patterns of lexical spread.

Map their geographical diffusion and identify common

spatial patterns of lexical spread.

Page 11: Tracking the Emergence of New Words across Time and Space

The Corpus

Since 2013, the team at USC have been compiling two

multi-billion word geocoded corpora for the US and the UK

using the Twitter API.

Twitter is a particularly rich source of geocoded data and

is also very popular, informal, and youthful, making it ideal

for tracking the emergence of new words.

Approximately 2% of tweets are geocoded.

Page 12: Tracking the Emergence of New Words across Time and Space

The Corpus

The analysis today is based on a 8.9 billion word

corpus of American Tweets from October 2013-

November 2014, which totals approximately 980 million

Tweets from 7 million users.

Every tweet is geocoded with the precise longitude and

latitude of the user when posting, which were then used

to identify the county where each Tweet was produced.

Page 13: Tracking the Emergence of New Words across Time and Space

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

Page 14: Tracking the Emergence of New Words across Time and Space

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

Page 15: Tracking the Emergence of New Words across Time and Space

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

Page 16: Tracking the Emergence of New Words across Time and Space

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

Page 17: Tracking the Emergence of New Words across Time and Space

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

Page 18: Tracking the Emergence of New Words across Time and Space

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

Page 19: Tracking the Emergence of New Words across Time and Space

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

Page 20: Tracking the Emergence of New Words across Time and Space

-­‐87.684555,42.074043Just  posted  a  photo  @  Baha'i  House  of  Worship  

Page 21: Tracking the Emergence of New Words across Time and Space

Corpus Examples

username,fips,time,tweet-­‐,48439,Sun  Jul  27  23:59:59  EDT  2014,don't  follow  the  right  ppl  lol-­‐,42007,Sun  Jul  27  23:59:59  EDT  2014,yesss  moody  judy-­‐,36005,Sun  Jul  27  23:59:59  EDT  2014,Man  i  was  just  thinking  shexx  be  lurking  but  won't  hmu-­‐,25021,Sun  Jul  27  23:59:59  EDT  2014,no  seeing  u  on  tv  is  reel  but  not  seeing  u  on  twitter  is  real  for  me...so  pls  visit  us  here  everyday.-­‐,26163,Sun  Jul  27  23:59:59  EDT  2014,Hate  seeing  my  friends  sad-­‐,12093,Sun  Jul  27  23:59:59  EDT  2014,this  is  the  shirt  i  won  that  i  got  to  sign  btw!!:)

Page 22: Tracking the Emergence of New Words across Time and Space

Graveyard/Cemetery

Page 23: Tracking the Emergence of New Words across Time and Space

Graveyard/Cemetery

Page 24: Tracking the Emergence of New Words across Time and Space

Graveyard/Cemetery Percent

Page 25: Tracking the Emergence of New Words across Time and Space

Graveyard/Cemetery Smoothed (Getis-Ord Gi)

Page 26: Tracking the Emergence of New Words across Time and Space

Identifying Rising Words

To find newly emerging words, we first measured the

degree to which the usage of each word in the corpus

had been rising over the 13 month period.

To identify these rising words we extracted the 67,000

words that occur at least 1,000 times in the corpus and

compared word relative frequency per day to day of the

year using a Spearman’s rank correlation coefficient.

Page 27: Tracking the Emergence of New Words across Time and Space

ρ = .044

ρ = .116

Page 28: Tracking the Emergence of New Words across Time and Space

ρ = .044

Page 29: Tracking the Emergence of New Words across Time and Space

ρ = .044ρ = -.028

Page 30: Tracking the Emergence of New Words across Time and Space

The Top 10 Rising Words on Twitter 2014

Word ρ Definitionfuckboy 0.947 Asshole, Jerk, Poser, Tool, etc.rn 0.938 Right Now (Top Riser 2013)hbd 0.928 Happy Birthdayfw 0.927 Fuck withunbothered 0.926 Unconcerned & Disengagedft 0.925 Face timegmfu 0.924 Get me fucked upsm 0.919 So Muchsquad 0.919 Squadasf 0.918 As fuck

Page 31: Tracking the Emergence of New Words across Time and Space
Page 32: Tracking the Emergence of New Words across Time and Space
Page 33: Tracking the Emergence of New Words across Time and Space
Page 34: Tracking the Emergence of New Words across Time and Space
Page 35: Tracking the Emergence of New Words across Time and Space
Page 36: Tracking the Emergence of New Words across Time and Space

Identifying Emerging Words

Although measuring correlations allows for rising words

to be identified, most are far too common by 2014 to

show patterns of regional spread.

To identify emerging words we cross-referenced the list

of rising words against a list of rare words, defined as

words with low overall frequencies in the fourth quarter

of 2013 (excluding proper nouns).

Page 37: Tracking the Emergence of New Words across Time and Space
Page 38: Tracking the Emergence of New Words across Time and Space
Page 39: Tracking the Emergence of New Words across Time and Space
Page 40: Tracking the Emergence of New Words across Time and Space
Page 41: Tracking the Emergence of New Words across Time and Space
Page 42: Tracking the Emergence of New Words across Time and Space
Page 43: Tracking the Emergence of New Words across Time and Space
Page 44: Tracking the Emergence of New Words across Time and Space
Page 45: Tracking the Emergence of New Words across Time and Space
Page 46: Tracking the Emergence of New Words across Time and Space
Page 47: Tracking the Emergence of New Words across Time and Space

Top 10 Emerging Words on Twitter 2014

Words ρ Definitionunbothered 0.926 Unconcerned & Disengagedgmfu 0.924 Get Me Fucked Upjoggers 0.908 Jogging pantsfuckboys 0.902 Losers, wimps, posers, etc.rekt 0.900 Wreckedtfw 0.879 That feel whenxans 0.878 Benzodiazepine pillsbaeless 0.875 To be without a baeboolin 0.857 Hanging out, esp. young menlordt 0.854 Lord, as exclamation

Page 48: Tracking the Emergence of New Words across Time and Space

Top 11-20 Emerging Words on Twitter 2014

Words ρ Definitioncelfie 0.852 selfieslays 0.843 impresses, succeeds at, etc.famo 0.840 family and friendsfuckboi 0.838 fuckboy(on) fleek 0.838 on point, esp. eyebrowsfaved 0.836 to favorite somethinggainz 0.828 earningsbruuh 0.817 broamirite 0.816 am I rightnotifs 0.808 notifications, especially online

Page 49: Tracking the Emergence of New Words across Time and Space
Page 51: Tracking the Emergence of New Words across Time and Space
Page 52: Tracking the Emergence of New Words across Time and Space
Page 53: Tracking the Emergence of New Words across Time and Space
Page 54: Tracking the Emergence of New Words across Time and Space
Page 55: Tracking the Emergence of New Words across Time and Space
Page 56: Tracking the Emergence of New Words across Time and Space
Page 57: Tracking the Emergence of New Words across Time and Space
Page 58: Tracking the Emergence of New Words across Time and Space
Page 59: Tracking the Emergence of New Words across Time and Space
Page 60: Tracking the Emergence of New Words across Time and Space
Page 61: Tracking the Emergence of New Words across Time and Space

S-shaped Curves

In the time charts for many of the rising and emerging

words we see clear s-curves or what look like the start

of s-curves.

Page 62: Tracking the Emergence of New Words across Time and Space

S-shaped Curves

Similar results have also been found repeatedly in

sociolinguistic apparent time studies (see Labov, 2001),

as well as in corpus-based research in historical

linguistics (e.g. Nevalainen & Raumolin-Brunberg, 2003).

Similar results have also been obtained in research on

the diffusion of innovations (see Rogers, 2003), where it

is referred to as an S-shaped Curve of Diffusion.

Page 65: Tracking the Emergence of New Words across Time and Space

Summary: Time Patterns

New words rise (and fall) very quickly in Modern

English, with numerous new words entering the

language and quickly rising in usage every year.

The usage of emerging words over time tends to follow

an s-shaped curve, echoing results found in

sociolinguistic apparent time studies and diffusion of

innovation research.

Page 66: Tracking the Emergence of New Words across Time and Space

Goals of Today’s Talk

Identify emerging words from 2014 based on a multi-

billion word corpus of American tweets.

Chart their usage over time and identify common

temporal patterns of lexical spread.

Map their geographical diffusion and identify common

spatial patterns of lexical spread.

Page 67: Tracking the Emergence of New Words across Time and Space

Mapping the Spread of New Words

An important technical problem is how to map the

spread of a new word across a region.

One approach is to map the relative frequency (e.g.

occurrences per million words) of the word across a

series of regional corpora (e.g. all the tweets from a

particular county) over a series of time points.

Page 68: Tracking the Emergence of New Words across Time and Space
Page 69: Tracking the Emergence of New Words across Time and Space
Page 70: Tracking the Emergence of New Words across Time and Space
Page 71: Tracking the Emergence of New Words across Time and Space
Page 72: Tracking the Emergence of New Words across Time and Space
Page 73: Tracking the Emergence of New Words across Time and Space
Page 74: Tracking the Emergence of New Words across Time and Space
Page 75: Tracking the Emergence of New Words across Time and Space
Page 76: Tracking the Emergence of New Words across Time and Space
Page 77: Tracking the Emergence of New Words across Time and Space
Page 78: Tracking the Emergence of New Words across Time and Space
Page 79: Tracking the Emergence of New Words across Time and Space
Page 80: Tracking the Emergence of New Words across Time and Space
Page 81: Tracking the Emergence of New Words across Time and Space

Geographical Diffusion of Linguistic Forms

Two major theories have been proposed to explain how

new linguistic forms generally spread in language:

The Wave Model states that new forms spread out

radially from their source.

The Gravity Model states that new forms spread out

from one urban area to the next, based on distance

and population size, only later filling in less

populated areas in between.

Page 82: Tracking the Emergence of New Words across Time and Space

Assessing the Wave and Gravity Models

We can begin assess the validity of the wave and

gravity models for lexical spread by comparing the

spread of unbothered.

This analysis can be facilitated by focusing on one state

where the form eventually becomes relatively common,

for example Georgia.

Page 83: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

Population Density of Georgia

Page 84: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 November 2013

Page 85: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 December 2013

Page 86: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 January 2014

Page 87: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 February 2014

Page 88: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 March 2014

Page 89: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 April 2014

Page 90: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 May 2014

Page 91: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 June 2014

Page 92: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 July 2014

Page 93: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 August 2014

Page 94: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 September 2014

Page 95: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 October 2014

Page 96: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 November 2014

Page 97: Tracking the Emergence of New Words across Time and Space

Assessing the Wave and Gravity Models

The geographical spread of unbothered in Georgia

appears to be more complex than predicted by the

Wave or Gravity Model, although both appear to offer a

partial explanation for this pattern of spread

The percentage of African Americans, however, also

appears to be an important predictor.

Page 98: Tracking the Emergence of New Words across Time and Space

African Americans in Georgia

Atlanta

Columbus

Macon

Augusta

Savannah

Page 99: Tracking the Emergence of New Words across Time and Space

Atlanta

Columbus

Macon

Augusta

Savannah

01 November 2014

Page 100: Tracking the Emergence of New Words across Time and Space

01 November 2014

Atlanta

Columbus

Macon

Augusta

Savannah

Page 101: Tracking the Emergence of New Words across Time and Space

Presenting a time series of maps is an effective way to

map lexical spread, but another technical issue is how

to map emerging words on one map:

Relative frequency

Date of first (or second...) occurrence

Number of words until first (or second...) occurrence

Mapping the Spread of New Words on One Map

Page 102: Tracking the Emergence of New Words across Time and Space
Page 103: Tracking the Emergence of New Words across Time and Space
Page 104: Tracking the Emergence of New Words across Time and Space
Page 105: Tracking the Emergence of New Words across Time and Space
Page 106: Tracking the Emergence of New Words across Time and Space
Page 107: Tracking the Emergence of New Words across Time and Space
Page 108: Tracking the Emergence of New Words across Time and Space

Top 10 Emerging Words on Twitter 2014

Words ρ Definitionunbothered 0.926 Unconcerned & Disengagedgmfu 0.924 Get Me Fucked Upjoggers 0.908 Jogging pantsfuckboys 0.902 Losers, wimps, posers, etc.rekt 0.900 Wreckedtfw 0.879 That feel whenxans 0.878 Benzodiazepine pillsbaeless 0.875 To be without a baeboolin 0.857 Hanging out, esp. young menlordt 0.854 Lord, as exclamation

Page 109: Tracking the Emergence of New Words across Time and Space
Page 110: Tracking the Emergence of New Words across Time and Space
Page 111: Tracking the Emergence of New Words across Time and Space
Page 112: Tracking the Emergence of New Words across Time and Space
Page 113: Tracking the Emergence of New Words across Time and Space
Page 114: Tracking the Emergence of New Words across Time and Space
Page 115: Tracking the Emergence of New Words across Time and Space
Page 116: Tracking the Emergence of New Words across Time and Space
Page 117: Tracking the Emergence of New Words across Time and Space
Page 118: Tracking the Emergence of New Words across Time and Space

Top 11-20 Emerging Words on Twitter 2014

Words ρ Definitioncelfie 0.852 selfieslays 0.843 impresses, succeeds at, etc.famo 0.840 family and friendsfuckboi 0.838 fuckboy(on) fleek 0.838 on point, esp. eyebrowsfaved 0.836 to favorite somethinggainz 0.828 earningsbruuh 0.817 broamirite 0.816 am I rightnotifs 0.808 notifications, especially online

Page 119: Tracking the Emergence of New Words across Time and Space
Page 120: Tracking the Emergence of New Words across Time and Space

Summary: Regional Patterns

New words originate from across the US, including the

Southeast (e.g. Unbothered, Baeless, Boolin), the North

(e.g. Fuckboy, Gainz), and the West (e.g. Wrekt), and

tend to spread within these regions first.

Otherwise, the spread of new words appears to be highly

complex, affected by numerous factors, including

proximity, population density, and demographic patterns.

Page 121: Tracking the Emergence of New Words across Time and Space

Traditional Approaches to Historical Linguistics

The empirical analysis of language change is generally

based on historical corpora, which tend to span

centuries, or collections of linguistic interviews, which

tend to span generations (i.e. based on apparent time).

Both sources of data tend to provide a broad temporal

scope but limited temporal resolution and amounts of

data (<1 million words).

Page 122: Tracking the Emergence of New Words across Time and Space

The Uniformitarian Principle

“Knowledge of processes that operated in the past can

be inferred by observing ongoing processes in the

present” (Christy, 1983: ix).

This Uniformitarian Principle is cited in Labov (2001) to

justify the use of apparent time interview data in place of

historical corpora, but it also justifies the use of

extremely large and dense contemporary corpora in

place of both of these more common approaches.

Page 123: Tracking the Emergence of New Words across Time and Space

A Modern Approach to Historical Linguistics

Analysing with modern language data mined from online

sources allows for unprecedentedly large, rich and

dense natural language corpora to be compiled.

Although historical scope is lost, this approach allows for

language change to be analysed in far greater detail

than would otherwise be possible.

Page 124: Tracking the Emergence of New Words across Time and Space

Tracking the Emergence of New Words across Time and Space

Jack GrieveCentre for Forensic LinguisticsAston University

Email: [email protected]: https://sites.google.com/site/jackgrieveastonTwitter: @JWGrieve