11
A Latent Variable Model for Geographic Lexical Variation Jac ob Eis enstein Br end an O’Connor Noah A. Smith Eric P. Xin g School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {jacobeis,brendano,nasmith,epxing }@cs.cmu.edu Abstract The rapid growth of geotagged social media raises new computational possibilities for in- vestigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geograp hica l regi ons. High- lev el topics such as “sports” or “entertainment” are ren- dere d dif ferently in each geographi c reg ion, revealing topic-specic regional distinctions. Applie d to a new dat aset of geo tag ged mi - crob logs, our mode l reco vers coheren t top- ics and thei r reg ional varia nts, while iden ti- fyin g geogr aphic areas of lingu isti c cons is- tenc y . The model also enabl es predic tion of an author’s geographic location from raw text, outperforming both text regression and super- vised topic models. 1 Intr oducti on Sociolinguistics and dialectology stud y how lan- gua ge va ries acr oss social and regional contexts. Quantitative research in these elds generally pro- ceeds by counting the frequency of a handful of previously-identied linguistic variables: pa ir s of  phonolog ical, lexic al, or morph osyntac tic featur es tha t are semant ica lly equiv ale nt, but whose fre - que ncy depends on soc ial , geogra phi cal , or othe r factors (Paolillo, 2002; Chambers, 2009). It is left to the experimenter to determine which variables will be considered, and there is no obvious procedure for drawing inferences from the distribution of multiple va ria bles. In this paper , we present a method for identifying geographically-aligned lexical variation direct ly from raw text. Our approach takes the form of a probabilistic graphical model capable of iden- tifying both geographically-salient terms and coher- ent linguistic communities. One challenge in the study of lexical variation is that term frequencies are inuenced by a variety of fact ors, such as the topic of discourse. We address this issue by adding latent variables that allow us to model topical variati on expli citly . We hypothe size that geography and topic interact, as “pure” topi- cal lexical distributions are corrupted by geographi- cal factors; for example, a sports-related topic will be rendered differently in New York and Califor- nia . Eac h aut hor is imb ued with a late nt “region” indica tor , which both select s the regional variant of each topic, and generates the author’s observed ge- ograph ical location. The regional corruption of top- ics is modeled through a cascade of logistic normal priors—a general modeling approach which we call cascading topi c models . The resu lting sys tem has multipl e capabilities , includin g: (i) analyzing lexi- cal variation by both topic and geography; (ii) seg- mentin g geogra phical space into coherent linguist ic communities; (iii) predicting author location based on text alone. This research is only possible due to the rapid growth of social media . Our dataset is derived from the microblogging website Twitter, 1 which permits use rs to post short mess age s to the publ ic. Man y users of Twitter also supply exact geographical co- ordina tes from GPS-enabl ed devices (e.g., mobile phones), 2 yielding geotagged te xt dat a. T ex t in compu ter- mediat ed commu nication is often more vernacular (Tagliamonte and Denis, 2008), and as such it is more likely to reveal the inuence of ge- ographic factors than text written in a more formal genre, such as news text (Labov, 1966). We evaluate our approach both qualitatively and quantitatively. We investigate the topics and regions 1 http://www.twitter.com 2 User proles also contain self-reported location names, but we do not use that informatio n in this work.

emnlp2010

Embed Size (px)

Citation preview

8/8/2019 emnlp2010

http://slidepdf.com/reader/full/emnlp2010 1/11

8/8/2019 emnlp2010

http://slidepdf.com/reader/full/emnlp2010 2/11

8/8/2019 emnlp2010

http://slidepdf.com/reader/full/emnlp2010 3/11

8/8/2019 emnlp2010

http://slidepdf.com/reader/full/emnlp2010 4/11

8/8/2019 emnlp2010

http://slidepdf.com/reader/full/emnlp2010 5/11

8/8/2019 emnlp2010

http://slidepdf.com/reader/full/emnlp2010 6/11

8/8/2019 emnlp2010

http://slidepdf.com/reader/full/emnlp2010 7/11

8/8/2019 emnlp2010

http://slidepdf.com/reader/full/emnlp2010 8/11

8/8/2019 emnlp2010

http://slidepdf.com/reader/full/emnlp2010 9/11

8/8/2019 emnlp2010

http://slidepdf.com/reader/full/emnlp2010 10/11

8/8/2019 emnlp2010

http://slidepdf.com/reader/full/emnlp2010 11/11