Contextualized Diachronic Word Representations

HAL Id: hal-02194763https://hal.archives-ouvertes.fr/hal-02194763

Submitted on 25 Jul 2019

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Contextualized Diachronic Word RepresentationsGanesh Jawahar, Djamé Seddah

To cite this version:Ganesh Jawahar, Djamé Seddah. Contextualized Diachronic Word Representations. 1st InternationalWorkshop on Computational Approaches to Historical Language Change 2019 (colocated with ACL2019), Aug 2019, Florence, Italy. �hal-02194763�

https://hal.archives-ouvertes.fr/hal-02194763

https://hal.archives-ouvertes.fr

Contextualized Diachronic Word Representations

Ganesh Jawahar Djame SeddahInria

{firstname.lastname}@inria.fr

Abstract

Diachronic word embeddings play a key rolein capturing interesting patterns about howlanguage evolves over time. Most of the ex-isting work focuses on studying corpora span-ning across several decades, which is under-standably still not a possibility when workingon social media-based user-generated content.In this work, we address the problem of study-ing semantic changes in a large Twitter cor-pus collected over five years, a much shorterperiod than what is usually the norm in di-achronic studies.

We devise a novel attentional model, basedon Bernoulli word embeddings, that are con-ditioned on contextual extra-linguistic (social)features such as network, spatial and socio-economic variables, which are associated withTwitter users, as well as topic-based features.We posit that these social features provide aninductive bias that helps our model to over-come the narrow time-span regime problem.Our extensive experiments reveal that our pro-posed model is able to capture subtle semanticshifts without being biased towards frequencycues and also works well when certain con-textual features are absent. Our model fitsthe data better than current state-of-the-art dy-namic word embedding models and thereforeis a promising tool to study diachronic seman-tic changes over small time periods.

1 Introduction

Natural language changes over time due to a widerange of linguistic, psychological, socioculturaland encyclopedic causes (Blank and Koch, 1999;Grzega and Schoener, 2007). Studying the seman-tic change of a word helps us understand moreabout the human language and build temporallyaware models, that are especially complementaryto the work done in the digital humanities and his-

Figure 1: The diachronic embedding computed by ourproposed model for the word ‘BATACLAN’ revealshow the term’s usage changed over the years. We listthe most similar five words (with English translation inparanthesis) in each year by cosine similarity. The y-axis corresponds to “meaning”, a one dimensional PCAprojection of the embeddings.

torical linguistics. Recently, diachronic word em-beddings based on distributional hypothesis (Har-ris, 1954) have been used to automatically studysemantic changes in a data-driven fashion fromlarge corpora (Kim et al., 2014; Hamilton et al.,2016; Rudolph and Blei, 2018). We refer thereader to Kutuzov et al. (2018) who survey the re-cent methods in this field and establishes the chal-lenges that lie ahead.

Currently, we find the literature on this prob-lem to be focused on English corpora, spanningacross several decades. This has not only cre-ated a gap in extending the diachronic word em-beddings for a wider scope of languages, but alsoto datasets spanning across few successive yearswhich are common in digital humanities and so-cial sciences. In this work, we study French textfrom Twitter collected over just five years, whichprovides a challenging platform to build modelsthat can capture semantic drifts in a noisy, subtlyevolving language corpus.

Figure 1 shows an instance of the evolution ofthe word ‘Bataclan’ (a theatre in Paris that was at-

tacked by terrorists on November 2015) from theFrench corpus. It also shows that such embed-ding representations mostly capture the dominantsense of a word when used in synchrony and cantherefore only reflect the evolution of the dominantsense when used diachronically, yet leaving openthe question of whether small, subtle changes canbe captured (Tahmasebi et al., 2018).

We hypothesize that the current state-of-the-artmodels lack inductive biases to fit data accuratelyin this setting. We build on the observation byJurafsky (2018) that “it’s important to considerwho produced the language, in what context, forwhat purpose, and make sure that the models arefit to the data”. Hence, we propose a novel modelextending on Dynamic Bernoulli word Embed-dings (Rudolph and Blei, 2018) (DBE) which ex-ploits the inductive bias by conditioning on a num-ber of contextualized features such as network,spatial and socio-economic variables, which areassociated with Twitter users, as well as topic-based features.

We perform qualitative studies and show thatour model can: (i) accurately capture the subtlechanges caused due to cultural drifts, (ii) learn asmooth trajectory of word evolution despite ex-ploiting various inductive biases. Our quantitativestudies illustrate that our model can: (i) capturebetter semantic properties, (ii) be less sensitive tofrequency cues compared to DBE model, (iii) actas better features for 2 out of 4 tweet classificationtasks. Through an ablation study, we find in addi-tion that our model can: (iv) work with a reducedset of contextualized features, (v) follow the test oflaw of prototypicality (Dubossarsky et al., 2015).In sum, we believe our model is a promising toolto study diachronic semantic changes over smalltime periods. 1

Our main contributions are as follows:• Our work is the first to study diachronic word

embeddings for tweets from French languageto the best of our knowledge. Unlike previ-ous works, we consider dataset from a narrowtime horizon (five years).• We propose a novel, attentional, diachronic

word embedding model that derives inductivebiases from several contextualized, socio-demographic, features to fit the data accu-rately.

1Code to reproduce our experiments is publicly ac-cessible at https://github.com/ganeshjawahar/social_word_emb

• Our work is also the first to estimate the use-fulness of the diachronic word embeddingsfor downstream task like tweet classification.

2 Related Work

Kim et al. (2014) introduced prediction-basedword embedding models to track semantic shiftsacross time. They extended SkipGram model withNegative Sampling (SGNS) (Mikolov et al., 2013)by training a model on current year after initial-izing the word embeddings from trained model ofprevious year. This initialization ensures the wordvectors across time slices are grounded in same se-mantic space. Kulkarni et al. (2015) and Hamil-ton et al. (2016) utilize ad hoc alignment tech-niques like orthogonal Procrustes transformationsto map successive model pairs together. These ap-proaches have an impractical demand of havingenough data in each time slice to learn high qualityembeddings.

The work done by Bamler and Mandt (2017),Yao et al. (2018) and Rudolph and Blei (2018) pro-posed to learn word embeddings across all timeperiods jointly along with their alignment in a sin-gle step. Rudolph and Blei (2018) represent wordembeddings as sequential latent variables, natu-rally accommodating for time slices with sparsedata and assuring word embeddings are groundedacross time. Our proposed model builds upon thiswork to condition on several inductive biases, us-ing contextual extra-linguistic (social) and topic-based features, to accurately fit dataset from a nar-row time horizon.

3 Contextualized Features

Natural language text is inherently contextual, de-pending on the author, the period and the intendedpurpose (Jurafsky, 2018). For instance, featuresbased on authors’ demography although incom-plete can explain some of the variance in thetext (Garten et al., 2019). While diachronic wordembeddings’ ability to capture semantic shifts isinteresting because of its flexibility, we postulatethat there is a need to capture contextualized in-formation about tweets such as the characteristicsof their authors (including spatial, network, socio-economic, interested topics) and meta-informationsuch as their topic. To extract features, we makeuse of the largest French Twitter corpus to dateproposed in Abitbol et al. (2018). In this sectionwe will describe the set of contextualized feature

https://github.com/ganeshjawahar/social_word_emb

https://github.com/ganeshjawahar/social_word_emb

we propose to inject to our diachronic word em-bedding model (see Section 4).

3.1 Spatial

Users from similar geographical areas tend toshare similar properties in terms of word usageand language idiosyncrasies. Among others, Hovyand Purschke (2018) for German and Abitbol et al.(2018) for French, confirmed regional variationsin geolocated users’ content in social media. Thelatter work found the southern part of France touse a more standard language than the northernpart. To exploit these geographic variations, weidentify geolocated users (∼ 100K) and asso-ciate each of them to their respective region (outof 22 regions) and department (out of 96 depart-ments) within the French territory. We learn alatent embedding for each region and departmentwhich captures the spatial information with differ-ent levels of granularity.

3.2 Socioeconomic

Users from similar socioeconomic status tend toshare similar online behavior in terms of circa-dian cycles. Specifically, Abitbol et al. (2018)found that people of higher socioeconomic sta-tus are active to a greater degree during the day-time and also use a more standard language. Na-tional Institute of Statistics and Economic Studies(INSEE) of France provided the population levelsalary for each 4 hectare square patch across thewhole French territory, estimated from the 2010tax return in France. We also use IRIS datasetprovided by French government which has morecoarse grained annotation for socioeconomic sta-tus. This information is mapped with the ge-ographical coordinates of users’ home locationfrom Twitter so we can roughly ascertain the eco-nomic status of every geolocated users. We create9 socioeconomic classes by binning the incomeand ensuring that the sum of income is the samefor each class. We learn a latent embedding foreach such class, which thus captures the variationcaused by status homophily.2

3.3 Network

Users who are connected to each other in socialnetworks are usually believed to share similar in-

2Some statistical pretreatments were applied to the databy INSEE before its public release to uphold current privacylaws and due to the highly sensitive nature of the discloseddata.

terests. We construct a co-mention network fromthe set of geolocated users as nodes and edgesconnecting those users who have mentioned eachother at least once. We run the LINE model (Tanget al., 2015) to embed the nodes in the graph usingthe connectivity information and use the resultingnode embedding as fixed features.

3.4 Interest

Interest feature corresponds to the set of importanttopics a user cares about. We obtain this informa-tion by composing a user document capturing allthe words used in their posts, ranking the wordsin the document by the tf-idf score and selectingthe top 50 of them. We then construct the uservector by summing the vectors (obtained by run-ning word2vec on the entire corpus or geolocatedtweets) corresponding to the top 50 words. We usethe user vectors as fixed features.

3.5 Knowledge

Knowledge features keep track of the way the userwrites and as such, it is also a summary of theircontent in Twitter. We learn a latent embeddingfor each geolocated user.

3.6 Topic

This feature associated with a tweet correspondsto the topic a tweet belongs to. Since the avail-able corpus does not have any annotation aboutthe topic of the tweet, we exploit the distantsupervision-based idea proposed by Magdy et al.(2015) to filter geolocated tweets with an accom-panying YouTube video link. We then use theYouTube public API to obtain the category of thevideo, which is then associated to the topic ofthe tweet. We learn a latent embedding for eachYouTube category.

4 Proposed model

In this section we will first briefly discuss the ‘Dy-namic Bernoulli Embeddings’ model (DBE) andthen provide the details of our proposal, whichuses DBE model as its backbone.

4.1 Dynamic Bernoulli Embeddings (DBE)

The DBE model is an extension of the ‘Exponen-tial Family Embeddings’ model (EFE, (Rudolphet al., 2016)) for incorporating sequential changesto the data representation. Let the sequence ofwords from a corpus of text be represented by

(xi, . . . , xN ) from a vocabulary V . Each wordxi ∈ 0, 1V corresponds to a one-hot vector, having1 in the position corresponding to the vocabularyterm and 0 elsewhere. The context ci representsthe set of words surrounding a given word at po-sition i.3 DBE builds on Bernoulli embeddings,which provides a conditional model for each entryin the indicator vector xiv ∈ 0, 1, whose condi-tional distribution is

xiv|xci ∼ Bern(ρiv), (1)

where ρiv ∈ (0, 1) is the Bernoulli probability andxci is the collection of data points indexed by thecontext positions. Each index (i, v) in the data rep-resents two parameter vectors, the embedding vec-tor ρ(t)v ∈ RK and the context vector αv ∈ RK .The natural parameter of the Bernoulli is given by,

ηiv = ρᵀv(∑j∈ci

∑v′

αv′xjv′). (2)

Since each observation xiv is associated with atime slice ti (which is a year, in our case 4), DBElearns a per-time-slice embedding vector ρ(ti)v forevery word in the vocabulary. Thus, equation 2becomes,

ηiv = ρ(ti)ᵀv (∑j∈ci

∑v′

αv′xjv′). (3)

DBE lets the context vectors shared across thetime slices to ground the successive embeddingvectors in the same semantic space. DBE assumesa Gaussian random walk as a prior on the embed-ding vectors to encourage smooth change in theestimates of each term’s embedding,

αv, ρ(0)v ∼ N (0, λ−10 I)

ρ(t)v ∼ N (ρ(t−1)v , λ−1I).(4)

4.2 Proposed modelIn this work, we argue that the DBE model failsto accurately fit the data spanning across feweryears as it discards other explanatory variables(besides time) about the complicated processes inthe language in terms of evolution and construc-tion. These variables, which we defined in Sec-tion 3 as contextualized features, carry useful sig-nals to understand subtle changes such as cultural

3We use 2 words before and after the focal word to deter-mine context for all our experiments.

4Our preliminary investigation with different time spanunits can be found in Appendix A.6.

drifts. Our proposed model extends DBE by utiliz-ing these contextualized features as inductive bi-ases.

In our setting, we represent a tweet as tk =(xi, . . . , xN ) belonging to user ul. Each tuple(i, c) is associated with a set of contextualized fea-tures based on either ul or tk, fi,m ∈ Rdm(m =1, . . . , |F |) (where |F | corresponds to the numberof contextualized features). Each contextualizedfeature not only follows a different distribution butalso has different degrees of noise (e.g., sparsityof co-mention network, geolocation inaccuracy).Hence, it is harder to unify them in a single model.We propose three ways to introduce inductive biasto the DBE model.Unweighted sum: The simplest approach is toproject all the feature embeddings to a commonspace and sum them up. This approach is not ag-nostic to the embedding vector xi in question andconsider all the contextualized features equally.Incorporating this approach, equation 3 now be-comes:

ηiv = (ρ(ti)v +

|F |∑m=1

wmfi,m)

ᵀ

(∑j∈ci

∑v′

αv′xjv′), (5)

where wm corresponds to the learnable weightscorresponding to the linear projection of fi,,m withsize as K × dm. Note that K denotes the dimen-sion of both context and target embedding.Self-attention: Considering all the featuresequally would be wasteful for certain embeddingvector xi. Henceforth, we propose to let the net-work decide the important contextualized featuresbased on self attention. This approach gives a pro-vision to our model to handle the effect of spuri-ous contextual signals by paying no attention. In-corporating this approach, equation 5 will now be-come:

ηiv = (ρ(ti)v +

|F |∑m=1

αmwmfi,m)

ᵀ

(∑j∈ci

∑v′

αv′xjv′), (6)

where αm are the scalar weights correspondingto the self-attention mechanism:

αm = g(fi,m) = φ(a wmfi,m + b) (7)

where a ∈ RK and b ∈ R are learnable parame-ters while φ is a softmax.Contextual attention: We can also make the at-tention mechanism to be context-dependent, that

DBE Context Attn.

2014 2015 2016 2017 2018 2014 2015 2016 2017 2018

jdd rocard macron macron macron jdd attali macron macron macronbrunet levy matignon rugy elysee brunet levy matignon hollande elysee

frederic attali lejdd.fr hollande matignon dupont cnrs fustige rugy elyseeelysee montel medef elysee petain frederic monarchie renoncement melenchon electiondupont monarchie elysee bayrou interpelle revelee rocard medef presidentielle emmanuelmacron

Table 1: Embedding neighborhood of ‘EMMANUEL’ obtained by finding closest word in each time period sortedby decreasing similarity. All named entities are italicized. Interesting words identified by the proposed model arebolded.

DBE Context Attn.

2014 2015 2016 2017 2018 2014 2015 2016 2017 2018

genesio huitiemes estac ogcnice asnl malcuit sampaoli pyeongchang ogcnice asnlgenesio lafont pyeongchang amical tricolore seri huitiemes estac asnl eswcraggi genesio tgvmax slovaquie pariez tousart donnarumma cu.e bleuets carrassozambo pyeongchang u20 asnl affrontera raggi lafont ndombele slovaquie tricoloremalcuit sampaoli lrem bleuets carrasso asensio sertic auproux mennel euro2016

Table 2: Embedding neighborhood of EQUIPEDEFRANCE ‘French Team’ in obtained by finding the closest wordin each time period sorted by decreasing similarity. All named entities are italicized. Interesting words identifiedby the model are bolded.

is, dependent on the embedding vector. Equation 7then becomes:

αm = g(ρi) = φ(amρi + b) (8)

where am ∈ RK corresponds to the learnable at-tention parameter specific to a contextualized fea-ture fm.

We fit the diachronic embeddings with thepseudo log likelihood, the sum of log conditionals.Particularly, we regularize the pseudo log likeli-hood with the log priors, followed by maximiza-tion to obtain a pseudo MAP estimate. Our objec-tive function can be summarized as,

L(ρ, α) = Lpos + Lneg + Lprior (9)

The likelihoods are given by:

Lpos =|T |∑k=1

N∑i=1

V∑v=1

xivlogσ(ηiv),

Lneg =

|T |∑k=1

N∑i=1

∑v∈Si

log(1− σ(ηiv)),

(10)

where Si correspond to the negative samplesdrawn at random (Mikolov et al., 2013) and σ(.)denote the sigmoid function, which maps naturalparameters to probabilities. The prior is given by,

Lprior = −λ02

∑v

‖αv‖2 −λ02

∑v

∥∥∥ρ(0)v

∥∥∥2− λ

2

∑v,t

∥∥∥ρ(t)v − ρ(t−1)v

∥∥∥2 . (11)

Language evolution is a gradual process and therandom walk prior prevents successive embeddingvectors ρ(t−1)v and ρ(t)v from drifting far apart.

The objective function established in equation 9is learned using stochastic gradients (Robbinsand Monro, 1985) with the help of Adam opti-mizer (Kingma and Ba, 2014). Negative sam-ples are resampled at each gradient step. Pseudocode for training our model can be found in Ap-pendix A.1.

5 Experiments and Results

In this section we discuss the experimental proto-col, qualitative and quantitative evaluation to un-derstand the performance of our model.

5.1 Protocol

Data: We use the French twitter dataset proposedin Abitbol et al. (2018), which is the largest col-lection of French tweets to date. The originaldataset consists of 190M French tweets postedby 2.5M of users between June 2014 and March2018. To be able to use socio-geographic fea-tures and assess the validity of our model, weonly considered tweets from users whose homelocation could be identified to be in MetropolitanFrance. This filtering step resulted in a data setof 18M tweets from 110K users spread across 5years. This data set was then enriched using outputfrom the constituency-based Stanford parser in itsoff-the-shelf French settings (Green et al., 2011)

and from the dependency-based parser of Jawaharet al. (2018). We lowercased all the tweets, re-moved hashtags, mentions, URLs, emoticons andpunctuations. We used 80% of the tweets fromeach year to train our model, split the rest equallyto create validation (10%) and test set (10%). Fi-nally, we pick the most frequent 50K words fromthe train set to create our vocabulary.Baseline models: We compare our pro-posed model with three baseline models: (i)Word2vec (Mikolov et al., 2013)5 - We use theSGNS version of Word2vec trained independentlyfor each year with the embedding size as 100,window as 2 and the rest maintained to default;ii) HistWords (Hamilton et al., 2016)6 - We usethe SGNS version which is effective for datasetsof different sizes and employ similar settings asthe previous baseline; (iii) DBE (Rudolph andBlei, 2018)7 - We use the dynamic Bernoulliembedding model (backbone of our model)with the recommended settings. We have threevariants of our proposed model: no attentionmodel (unweighted sum), self attention modeland contextual attention model. Hyperparametersettings to reproduce our results can be found inAppendix A.2.

5.2 Qualitative StudyEmbedding neighborhood: The goal of di-achronic word embedding model is to automati-cally discover the changes in the usage of a word.The current usage at time t of a word w can beobtained by inspecting the nearby words of theword represented by ρ(t)w . From Table 1, we canobserve that ‘EMMANUEL’ (first name of cur-rent French president) is associated with his lastname (‘macron’) and office location (‘elysee’) byboth DBE and proposed model. However, pro-posed model is able to capture interesting neigh-borhood by bringing words such as ‘election’,‘presidentielle’ and ‘melenchon’ closer to ‘EM-MANUEL’8. Table 2 presents words of interestassociated by our proposed model to the Frenchfootball team like ‘euro2016’.Smoothness of the embedding trajectories:

5https://radimrehurek.com/gensim/models/word2vec.html

6https://nlp.stanford.edu/projects/histwords/

7https://github.com/mariru/dynamic_bernoulli_embeddings

8Emmanuel Macron became the president of France onMay 2017. Jean-Luc Melenchon stood fourth.

Since language evolution is a gradual process,the trajectory for a word tracked by a modelshould be changing smoothly. There are excep-tions for words undergoing cultural shifts wherethe changes can be subtle and rapid. We plotthe trajectory by computing the cosine similaritybetween word (e.g., MACRON) and its known,changed usage (e.g., PRESIDENT). Figure 2shows that models relying on Bernoulli embed-dings have smooth trajectories for known relationscompared to other models. Despite fusing differ-ent, possibly noisy contextualized features, the tra-jectory tracked by our proposed model and DBEare comparably smooth.t-SNE: Alternatively, we can overlay the embed-dings from all the time slices and visualize themusing dimensionality reduction technique like t-SNE (Maaten and Hinton, 2008). From Figure 3,we see a similar result where most of the wordsmodeled by our proposed model has experiencedconsistent change with time.

5.3 Quantitative Study

Log Likelihood: We can evaluate models byheld-out Bernoulli probability (Rudolph and Blei,2018). Given a held-out position, a better modelassigns higher probability to the observed wordand lower probability to the rest. We reportLeval = Lpos+Lneg in Table 3. Contextual atten-tion based model which smartly utilizes the con-textualized features provides better fits to the datacompared to the rest. Interestingly, the other vari-ants of our proposed model performs poorly com-pared to the DBE model which suggests the im-portance of utilizing attention appropriately. Sinceall the competing methods produce Bernoulli con-ditional likelihoods (Equation 1), where n is thenumber of negative samples. We keep n to be 20for all the methods to peform a fair comparison.Semantic Similarity: Certain tweets are taggedwith a ‘category’ to which it belongs (as discussedin Section 3.6). Similar to Yao et al. (2018), wecreate the ground truth of word category based onthe identification of words in years that are excep-tionally numerous in one particular category. Inother words, if a word is most frequent in a cate-gory, we tag the word with that category and formour ground truth. For each category c and eachword w in year t, we find the percentage of occur-rences p in each category. We collect such word-time-category 〈w,t,c〉 triplets, avoid duplication by

https://radimrehurek.com/gensim/models/word2vec.html

https://radimrehurek.com/gensim/models/word2vec.html

https://nlp.stanford.edu/projects/histwords/

https://nlp.stanford.edu/projects/histwords/

https://github.com/mariru/dynamic_bernoulli_embeddings

https://github.com/mariru/dynamic_bernoulli_embeddings

0.75

0.8

0.85

0.9

0.95

1

2014 2015 2016 2017 2018

(a) equipedefrance andeuro2016

0.4

0.5

0.6

0.7

0.8

0.9

1

2014 2015 2016 2017 2018

(b) macron and president

0.4

0.5

0.6

0.7

0.8

0.9

1

2014 2015 2016 2017 2018

(c) bataclan and assailant

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2014 2015 2016 2017 2018

(d) trump and president0.4

0.5

0.6

0.7

0.8

0.9

1

2014 2015 2016 2017 2018

Word2vec HistWords DBE NoAttn. SelfAttn. ContextAttn.

Figure 2: Smoothness of word embedding trajectories vs. baseline models. High values correspond to similarity.Notice that for Word2vec model, we do not plot the results for time periods where at least one of the word ofinterest occurs below the minimum frequency threshold.

Model log lik. SS Senti Htag Topic Conv.

Word2vec Nil 0.034 71.54 37.32 34.98 70.04HistWords Nil 0.042 73.69 36.75 36.85 70.17DBE -7.708 0.065 73.00 41.83 40.01 70.98No Attn. -8.059 0.058 73.22 42.11∗ 39.61 71.21∗Self Attn. -7.840 0.061 73.18 42.19∗ 39.67 71.10Context Attn. -7.425 0.068 73.19 41.88 39.65 71.15

Table 3: Quantitative results based on log likelihood, semantic similarity and tweet classification. Higher numbersare better for all the tasks. Statistically significant differences to the best baseline for each task based on bootstraptest are marked with an asterisk. Note that we could not perform statistical significance studies for log likelihoodexperiment due to the large size of the test set and semantic similarity experiment due to the nature of clusteringevaluation.

01234

Figure 3: t-SNE visualization of mid-frequency (be-tween 2000-2500) words for our contextual attentionmodel.

0.002

0.007

0.012

0.017

0.022

0.027

0.2 0.4 0.6 0.8

(a) Frequent

0.001

0.006

0.011

0.016

0.021

0.2 0.4 0.6 0.8

(b) Syntactic0.002

0.007

0.012

0.017

0.022

0.027

0.2 0.4 0.6 0.8

Word2vec HistWords DBE NoAttn. SelfAttn. ContextAttn.

Figure 4: Synthetic Evaluation. preplacement vs MRR.

keeping the year of largest strength for each wand s combination, and remove triplets where p isless than 35%. Finally, we pick top 200 words bystrength from each category and create a datasetof 3036 triplets across 15 categories, where eachword-year pair is essentially strongly linked to itstrue category. We evaluate the purity of clusteringresults by using Normalized Mutual Information(NMI) metric. From Table 3, we find a similartrend in the performance of our proposed model.

As we see in Section 6.3, the reason our contextualattention based model excels in this task is due toits superiority in capturing semantic properties ofa word.Synthetic Linguistic Change: We can syntheti-cally introduce the linguistic shift by introducingchanges to the corpus and then evaluate if the di-achronic word embedding model is able to detectthose artificial drifts accurately. We follow thework done by Kulkarni et al. (2015) to duplicateour data belonging to the 2018 year 6 times (alongwith the extra-linguistic information), perturb thelast 3 snapshots and use the diachronic embeddingmodel to rank all the words according to their p-values. We then calculate the Mean ReciprocalRank (MRR) for the perturbed words and expect itto be higher for models that can identify the wordsthat have changed. To perturb the data, we sam-ple a pair of words from the vocabulary exlcud-ing stop words, replace one of the word with theother with a replacement probability preplacement

and repeat this step 100 times. We employ twotypes of perturbation - syntactic (where the boththe words that are sampled in each step have thesame most frequent part of speech tag) and fre-quent (where there is no restriction for the wordsbeing sampled at each step). From Figure 4, wefind that DBE model is sensitive to the frequencycues from the data and fails to model subtle se-

0.550.560.570.580.59

0.60.610.620.63

100 250 500 750 1000

No attn. actual Self attn. actual Context attn. actual

No attn. proto Self attn. proto Context attn. proto

Figure 5: Change in the word’s usage correlated withdistance for different numbers of clusters between the2014 and 2018 year.

mantic shifts (e.g. for words which has evolved inits meaning without substantial change in its syn-tactic functionality).

Tweet Classification: We find that the existingwork skips evaluating the diachronic word embed-dings for a downstream NLP task. In this workwe propose to test if the diachronic word embed-dings can be used as features to build a temporally-aware tweet classifier.9 We obtain a representationfor a tweet by summing the embeddings for thewords (belonging to the year in which tweet wasposted) present in the tweet. We then train a lo-gistic regression model and compute the F-scoreon the held-out instances. We establish four tweetclassification tasks — Sentiment Analysis, Hash-tag Prediction, Topic Categorization and Conver-sation Prediction (predict if a tweet will receivea reply or not) through distant supervision meth-ods. Details of the task and dataset collection canbe found in Appendix A.3. From Table 3, wefind that our proposed model provides competi-tive performance with the baseline models for sen-timent analysis and topic categorization while itoutperforms them for the hashtag and conversa-tion prediction tasks by a statistically significantmargin (computed using bootstrap test (Efron andTibshirani, 1994)). Note that there is no singlebest model that works for every tweet classifica-tion tasks.

6 Analysis

In this section we perform extended analysis ofour proposed model to gain more insights aboutits functionality.

9https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

naive self context

interest (all)

interest (geo)

dept.

income (insee)

income (iris)

network

knowledge

region

topic

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Figure 6: Importance score for each contextualized fea-ture.

6.1 Ablation StudyWe perform ablation studies of the proposedmodel by considering different set of contextual-ized features as inductive biases, illustrated in Ta-ble 4. It is interesting to find that our model canwork with a limited set of contextualized featuresin practice.

6.2 Law of PrototypicalityDubossarsky et al. (2015) state that the likelihoodof change in a word’s meaning correlates with itsposition within its center. They define the proto-typicality measure based on the word’s distancefrom its cluter centroid (e.g., sword is a more pro-totypical exemplar than spear or dagger) and theprototypicality score reduces when the word un-dergoes change in its meaning. For all our mod-els, we correlate the distance of word vector corre-sponding to 2014 and 2018 year with the distancethe 2014 (2018) year vector moved from its clustercenter. We then check if there is a positive correla-tion (r > .3). From Figure 5, we observe that thereexists a positive correlation for all the variants ofour model when compared to a prototypical or ac-tual cluster centroid. Interestingly, when the clus-ter sizes are small (< 250), the word’s meaningchange is correlated with a prototypical exemplarmore than a actual exemplar. On the other hand,this correlation direction gets reversed when thecluster sizes are greater than 250 and there existsmore semantic areas.

6.3 Interpretation via Probing TasksOur tweet classification experiments (Section 5.3)demonstrated the usefulness of diachronic wordembeddings as features in building a diachronictweet classifier. Understanding the underlyingproperties of the tweet embeddings that enable itto outperform competing models is hard. This iswhy, following Conneau et al. (2018), we inves-

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html



Task log lik. SS Senti Htag Topic Conv.

spatial -7.610 0.059 73.10 42.19 39.61 71.14income -7.600 0.067 72.98 42.18 39.64 71.10interest -7.724 0.061 73.06 42.21 39.75 71.41spatial & income -7.510 0.059 73.21 42.08 39.67 71.22spatial & interest -7.396 0.059 73.11 42.14 39.77 71.23income & interest -7.410 0.059 73.27 42.30 39.75 71.11spatial & income & network -7.447 0.062 73.35 42.19 39.66 71.06spatial & interest & network -7.429 0.064 73.16 42.15 39.64 71.17interest & income & network -7.522 0.061 73.11 42.16 39.82 71.15interest & income & network & spatial -7.489 0.060 73.10 41.95 39.62 71.22interest & income & network & spatial & knowledge -7.438 0.059 73.21 41.90 39.70 71.28interest & income & network & spatial & topic -7.426 0.064 73.16 41.94 39.65 71.22

Table 4: Ablation Results for contextual attention model based on log likelihood, semantic similarity and tweetclassification.

Model/Task SentLen WC TreeDepth TopConst BShift Tense SubjNum ObjNum SOMO CoordInv(Task type) (Surface) (Surface) (Syntactic) (Syntactic) (Syntactic) (Semantic) (Semantic) (Semantic) (Semantic) (Semantic)

non diachronicWord2vec 84.07 22.65 50.34 37.27 50.69 75.99 84.40 82.88 64.40 49.79HistWords 83.40 34.08 47.51 40.43 49.92 77 84.99 83.31 64.29 50.46

diachronicDBE 73.48 46.97 43.64 31.41 50.46 73.34 82.57 82.02 64.85 50.05No Attn. 75.51 46.82 48.28 32.78 49.15 73.45 82.39 82.07 65.65 49.17Self Attn. 74.82 47.37 47.77 32.49 50.19 73.16 82.38 82.18 64.51 50.18Context Attn. 75.47 46.03 47.31 33.08 49.98 73.05 82.10 81.81 65.76 49.59

Table 5: Probing task accuracies. See Conneau et al. (2018) for the details of probing tasks and classifier used.

tigate that question by setting a diagnostic classi-fier that probes for important linguistic features onparsed output we mentioned earlier. Those probesare based on various prediction tasks (word con-tent, sentence length, subject or object number de-tection, etc.) described in (Conneau et al., 2018)and succinctly in our Appendix A.5. In 7 out of 9tasks the use of contextual features seems to bedetrimental, but the relative performance differ-ence between our proposed models and the base-line are negligible for 5 of them. This suggests thatthe addition of contextualized features does nothurt the syntactic and semantic information cap-tured by our models. Interestingly, all dynamicembeddings models are able to perform twice bet-ter in the word prediction task than a Word2vecbaseline but it is unclear if those models capturelanguage usage or actual topic prediction within adegraded language modeling task.

6.4 Interpretation via ErasureAlternatively, we can directly compute the impor-tance of a contextualized feature by observing theeffects on the model of erasing (setting the weightsto 0) the particular feature (Li et al., 2016). Bysubtracting the erased model performance on thetest set from that of the original model perfor-mance and post normalization, we can establishthe importance score for each feature against eachversion of our proposed model. Figure 6 empha-

sizes our finding that all contextualized features(except interest) are equally important to the per-formance of each variant of our proposed model.

7 Conclusion

In this work, we proposed a new family of di-achronic word embeddings models that utilize var-ious contextualized features as inductive biases toprovide better fits to a social media corpus. Ourwide range of quantitative and qualitative studieshighlight the competitive performance of our mod-els in detecting semantic changes over a short timerange. In the future, we will consider the tempo-ral nature of some of our contextualized featureswhen incorporating them into our models. For ex-ample, the static social network we built can bedynamically evolving and more susceptible to ac-curately model underlying phenomenon.

Acknowledgments

We thank our anonymous reviewers for provid-ing insightful comments and suggestions. Thiswork was funded by the ANR projects ParSiTi(ANR-16-CE33-0021), SoSweet (ANR15-CE38-0011-01) and the Programme Hubert Curien Mai-monide project which is part of a French-IsraeliCooperation program.

References

Jacob Levy Abitbol, Marton Karsai, Jean-PhilippeMague, Jean-Pierre Chevrot, and Eric Fleury. 2018.Socioeconomic dependencies of linguistic patternsin twitter: a multivariate analysis. In Proceedingsof the 2018 World Wide Web Conference on WorldWide Web, WWW 2018, Lyon, France, April 23-27,2018, pages 1125–1134.

Robert Bamler and Stephan Mandt. 2017. Dynamicword embeddings. In Proceedings of the 34th In-ternational Conference on Machine Learning, ICML2017, Sydney, NSW, Australia, 6-11 August 2017,pages 380–389.

Andreas Blank and Peter Koch. 1999. Historical se-mantics and cognition. Walter de Gruyter.

Alexis Conneau, German Kruszewski, GuillaumeLample, Loıc Barrault, and Marco Baroni. 2018.What you can cram into a single \$&!#* vector:Probing sentence embeddings for linguistic proper-ties. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics, ACL2018, Melbourne, Australia, July 15-20, 2018, Vol-ume 1: Long Papers, pages 2126–2136.

Haim Dubossarsky, Yulia Tsvetkov, Chris Dyer, andEitan Grossman. 2015. A bottom up approach tocategory mapping and meaning change. In Pro-ceedings of the NetWordS Final Conference on WordKnowledge and Word Usage: Representations andProcesses in the Mental Lexicon, Pisa, Italy, March30 - April 1, 2015., pages 66–70.

Bradley Efron and Robert J Tibshirani. 1994. An intro-duction to the bootstrap. CRC press.

Yanai Elazar and Yoav Goldberg. 2018. Adversarialremoval of demographic attributes from text data.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages11–21. Association for Computational Linguistics.

Justin Garten, Brendan Kennedy, Joe Hoover, KenjiSagae, and Morteza Dehghani. 2019. Incorporatingdemographic embeddings into language understand-ing. Cognitive Science, 43(1).

Alec Go, Richa Bhayani, and Lei Huang. 2009. Twittersentiment classification using distant supervision.

Spence Green, Marie-Catherine De Marneffe, JohnBauer, and Christopher D Manning. 2011. Multi-word expression identification with tree substitutiongrammars: A parsing tour de force with french. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing, pages 725–735. Association for Computational Linguistics.

Joachim Grzega and Marion Schoener. 2007. Englishand general historical lexicology.

William L. Hamilton, Jure Leskovec, and Dan Jurafsky.2016. Diachronic word embeddings reveal statisti-cal laws of semantic change. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics, ACL 2016, August 7-12, 2016,Berlin, Germany, Volume 1: Long Papers.

Zellig Harris. 1954. Distributional structure. Word,10(23):146–162.

Dirk Hovy and Christoph Purschke. 2018. Capturingregional variation with distributed place representa-tions and geographic retrofitting. In Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 4383–4394, Brus-sels, Belgium. Association for Computational Lin-guistics.

Ganesh Jawahar, Benjamin Muller, Amal Fethi, LouisMartin, Eric Villemonte de la Clergerie, BenoıtSagot, and Djame Seddah. 2018. ELMoLex: Con-necting ELMo and lexicon features for dependencyparsing. In Proceedings of the CoNLL 2018 SharedTask: Multilingual Parsing from Raw Text to Uni-versal Dependencies, pages 223–237, Brussels, Bel-gium. Association for Computational Linguistics.

Dan Jurafsky. 2018. Speech & language processing,3rd edition. Currently in draft.

Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde,and Slav Petrov. 2014. Temporal analysis of lan-guage through neural language models. In Pro-ceedings of the Workshop on Language Technologiesand Computational Social Science@ACL 2014, Bal-timore, MD, USA, June 26, 2014, pages 61–65.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, andSteven Skiena. 2015. Statistically significant de-tection of linguistic change. In Proceedings ofthe 24th International Conference on World WideWeb, WWW 2015, Florence, Italy, May 18-22, 2015,pages 625–635.

Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski,and Erik Velldal. 2018. Diachronic word embed-dings and semantic shifts: a survey. In Proceedingsof the 27th International Conference on Computa-tional Linguistics, COLING 2018, Santa Fe, NewMexico, USA, August 20-26, 2018, pages 1384–1397.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Un-derstanding neural networks through representationerasure. CoRR, abs/1612.08220.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-sne. Journal of machinelearning research, 9(Nov):2579–2605.

https://doi.org/10.1111/cogs.12701



http://www.aclweb.org/anthology/D18-1469



http://www.aclweb.org/anthology/K18-2023



https://web.stanford.edu/~jurafsky/slp3/

http://arxiv.org/abs/1412.6980





Walid Magdy, Hassan Sajjad, Tarek El-Ganainy, andFabrizio Sebastiani. 2015. Distant supervision fortweet classification using youtube labels. In Pro-ceedings of the Ninth International Conference onWeb and Social Media, ICWSM 2015, University ofOxford, Oxford, UK, May 26-29, 2015, pages 638–641.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed rep-resentations of words and phrases and their com-positionality. In Advances in Neural InformationProcessing Systems 26: 27th Annual Conference onNeural Information Processing Systems 2013. Pro-ceedings of a meeting held December 5-8, 2013,Lake Tahoe, Nevada, United States., pages 3111–3119.

Herbert Robbins and Sutton Monro. 1985. A stochas-tic approximation method. In Herbert Robbins Se-lected Papers, pages 102–109. Springer.

Maja R. Rudolph and David M. Blei. 2018. Dynamicembeddings for language evolution. In Proceedingsof the 2018 World Wide Web Conference on WorldWide Web, WWW 2018, Lyon, France, April 23-27,2018, pages 1003–1011.

Maja R. Rudolph, Francisco J. R. Ruiz, Stephan Mandt,and David M. Blei. 2016. Exponential family em-beddings. In Advances in Neural Information Pro-cessing Systems 29: Annual Conference on NeuralInformation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 478–486.

Nina Tahmasebi, Lars Borin, and Adam Jatowt. 2018.Survey of computational approaches to diachronicconceptual change. CoRR, abs/1811.06278.

Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, JunYan, and Qiaozhu Mei. 2015. LINE: large-scale in-formation network embedding. In Proceedings ofthe 24th International Conference on World WideWeb, WWW 2015, Florence, Italy, May 18-22, 2015,pages 1067–1077.

Jason Weston, Sumit Chopra, and Keith Adams. 2014.#tagspace: Semantic embeddings from hashtags. InProceedings of the 2014 Conference on EmpiricalMethods in Natural Language Processing, EMNLP2014, October 25-29, 2014, Doha, Qatar, A meet-ing of SIGDAT, a Special Interest Group of the ACL,pages 1822–1827.

Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao,and Hui Xiong. 2018. Dynamic word embeddingsfor evolving semantic discovery. In Proceedings ofthe Eleventh ACM International Conference on WebSearch and Data Mining, WSDM 2018, Marina DelRey, CA, USA, February 5-9, 2018, pages 673–681.

A Appendices

A.1 Pseudo code for the training algorithm

Pseudo code for training our proposed model ispresented in Algorithm 1.

Algorithm 1 : Training algorithm for the pro-posed diachronic word embedding model

Input: Tweets Xt of size mt from T time slices, contextual features fm,context size c, embedding sizeK, number of negative samplesn, number ofminibatch fractions m, initial learning rate η, precision λ, vocabulary sizeV , smoothed unigram distribution ρ.for v from 1 to V do

Initialize αv and ρ(T )v withN (0, 0.01)

end forform from 1 to |F | do

if fm is learnable thenInitialize fm with U(0, 1)

end ifend forfor number of passes over the data do

for number of minibatch fractions m dofor t from 1 to T do

for i from 1 to mtm do

Sample c+1 consecutive words from a random tweetX(t)

and construct: C(t)i =

∑j∈ci

∑v′ αv′xjv′

Compute contextualized features: F(t)i =∑|F |

m=1 αmwmfi,m Draw a set S(t)i of n nega-

tive samples from ρ.end for

end forUpdate the parameters θ = α, ρ, fm, w, a, b by ascending thestochastic gradient

∇θ{∑T

t=1m∑mt

mi=1

(∑Vv=1 x

(t)iv logσ((ρ

(t)v +

F(t)i )TC

(t)i )

+∑xj∈S

(t)i

∑Vv=1 log(1− σ((ρ

(t)v + F

(t)i )TC

(t)i )

−λ02∑v ‖αv‖

2 − λ02

∑v

∥∥∥ρ(0)v ∥∥∥2−λ2

∑v,t

∥∥∥ρ(t)v − ρ(t−1)v

∥∥∥2}.end for

end forWe utilize Adam (Kingma and Ba, 2014) to set rate η.

A.2 Hyperparameter settings

We follow the hyperparameter search space pro-vided by Rudolph and Blei (2018) to find the bestconfiguration of our model. Before training ourmodel, we initialize the parameters with one epochfit of non-diachronic Bernoulli embedding model(as defined in Equation 2 in the paper). We thentrain our model for 9 more epochs. We fix the em-bedding dimension to 100, context size to 2 andnumber of negative samples to 20. We select theinitial learning rate η ∈ [0.01, 0.1, 1, 10], mini-batch size m ∈ [0.001N, 0.0001N, 0.00001N ](where N is the number of training records), theprecision on context vectors and initial dynamicembeddings λ ∈ [1, 10] (λ0 = λ/1000). We usethe conditional likelihood metric (as discussed inSection 5.3) to sweep over the search space andselect the best hyperparameters.



A.3 Tweet Classification Details

We will list down the details of tweet classificationtasks where the data comes from our corpus.• Sentiment Analysis - This is a binary task to

classify the sentiment of the tweet. FollowingGo et al. (2009), we create a balanced datasetby tagging a tweet as positive (negative) if itcontains only positive (negative) emoticons.We remove the emoticons from the tweets toavoid bias.• Hashtag Prediction - This multiclass classi-

fication task is to identify the hashtag presentin the tweet. Following Weston et al. (2014),we identify the most frequent 100 hashtagsfrom the corpus, keep the tweets that containexactly one occurrence of the frequent hash-tag, remove the hashtag from the tweet andpredict them.• Topic Categorization - This multiclass clas-

sification task is to identify the topical cate-gory to which a tweet belongs to. FollowingMagdy et al. (2015), we filter the tweets thathas a YouTube video associated with it, querythe video category using the public YouTubeAPI and associate that to the topical categoryof the tweet.• Conversation Prediction - This binary task

is to classify if a tweet will receive a reply ornot. Following Elazar and Goldberg (2018),we tag the tweet as a conversational tweet ifit has at least a mention (‘@’) in it, otherwiseit’s a non-conversational tweet. We removethe mentions from the tweets to avoid bias.

A.4 Ablation Results

We perform ablation studies of the no attentionand self attention variant of the proposed modelby considering different set of contextualized fea-tures as inductive biases, illustrated in Table 6.

A.5 Probing Task Description

In this section we will describe briefly the set ofprobing tasks (proposed in Conneau et al. (2018))used in our study.• SentLen - The goal for the classification task

is to predict the tweet length which has beenbinned in 6 categories with lengths rang-ing in the following intervals: (5 − 8), (9 −12), (13−16), (17−20), (21−25), (26−28).• WC - This classification task is about predict-

ing which of the target words appear on the

given tweet.• TreeDepth - In this classification task the

goal is to predict the maximum depth of thetweet’s syntactic tree (with values rangingfrom 5 to 12).• TopConst - The goal of this classification

task is to predict the sequence of top con-stituents immediately below the sentence (S)node. The classes are given by the 19most common top-constituent sequences inthe corpus, plus a 20th category for all otherstructures.• BShift - In this binary classification task the

goal is to predict whether two consecutive to-kens within the tweet have been inverted ornot.• Tense - The goal of this task is to identify the

tense of the main verb of the tweet.• SubjNum - The goal of this task is to identify

the number of the subject of the main clause.• ObjNum - The goal of this task is to identify

the number of the subject on the direct objectof the main clause.• SOMO - This task classifies whether a tweet

occurs as-is in the source corpus, or whethera randomly picked noun or verb was re-placed with another form with the same partof speech.• CoordInv - This task distinguishes between

original tweet and tweet where the order oftwo coordinated clausal conjoints has beeninverted purposely.

A.6 Selection of time span unitWe performed preliminary experiments with DBEmodel to identify the time span unit that best fitsthe data. As shown in Table 7, DBE model fits thedata well in terms of log likelihood metric whenthe time span unit is year.

Time span unit Yearly Monthly Quarterly Half-yearly

Log lik. -5.7323 -7.1055 -6.4004 -6.0768

Table 7: Log likelihood scores of DBE model withvarying time span units.

Task log lik. SS Senti Htag Topic Conv.

No Attention


Self Attention


Table 6: Ablation results based on log likelihood, semantic similarity and tweet classification.

Documents

Contextualized Diachronic Word Representations