19
Nov 2014 Social media fingerprints of unemployment Alejandro Llorente 1,2 , Manuel Garc´ ıa-Herranz 3 , Manuel Cebrian 4,5 , Esteban Moro 1,2 * Abstract Recent wide-spread adoption of electronic and pervasive technologies has enabled the study of human behavior at an unprecedented level, uncovering universal patterns underlying human activity, mobility, and inter-personal communication. In the present work, we investigate whether deviations from these universal patterns may reveal information about the socio-economical status of geographical regions. We quantify the extent to which deviations in diurnal rhythm, mobility patterns, and communication styles across regions relate to their unemployment incidence. For this we examine a country-scale publicly articulated social media dataset, where we quantify individual behavioral features from over 145 million geo-located messages distributed among more than 340 different Spanish economic regions, inferred by computing communities of cohesive mobility fluxes. We find that regions exhibiting more diverse mobility fluxes, earlier diurnal rhythms, and more correct grammatical styles display lower unemployment rates. As a result, we provide a simple model able to produce accurate, easily interpretable reconstruction of regional unemployment incidence from their social-media digital fingerprints alone. Our results show that cost-effective economical indicators can be built based on publicly-available social media datasets. Keywords Human mobility, Social networks, Communication patterns, Unemployment 1 Instituto de Ingenier´ ıa del Conocimiento, Universidad Aut´onoma de Madrid, Madrid 28049, Spain 2 Departamento de Matem´ aticas & GISC, Universidad Carlos III de Madrid, Legan´ es 28911, Spain 3 UNICEF Innovation Unit, New York, NY 10017, USA 4 Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093, USA 5 National Information and Communications Technology Australia, Melbourne, Victoria 3003, Australia *Corresponding author: [email protected] H UMAN behavior is closely intertwined with socioeconom- ical status, as many of our daily routines are driven by activities related to maintain, to improve, or afforded by such status [13]. From our movements around the city, to our daily schedules, to the communication with others, humans perform different actions along the day that reflect and impact their economical situation. The distribution of different individ- ual behaviors across neighborhoods, municipalities, or cities impacts the economical development of those geographical areas, and in turn to that of the whole country [49]. Detect- ing patterns and quantifying relevant metrics to unveil the complex relationship between geography and collective be- havior is thus of paramount importance for understanding the economical heart-beat of cities, and the structure of inter-city networks, and thus to economic planning, educational policy, urban planning, transportation design, and other large-scale societal problems [10–14]. Much knowledge about how mobility, social communi- cation and education affect the economical development of cities has been being obtained through complex and costly surveys, with an update rate ranging from fortnights (unem- ployment) to decades (census) [1517]. At the same time, the recent availability of vast and rich datasets of individual digital fingerprints has increased the scale and granularity at which we can measure these behavioral features, reduced the cost and update rate of these measurements, and provided new opportunities to combine them with more traditional socio- economical surveys [14, 18–22]. In this work we provide a proof of concept for the use of social media individual digital fingerprints to infer city-level behavioral measures, and then uncover their relationship with socioeconomic output. We present a comprehensive study of the different behavioral traces that can be extracted from social media: (i) technology adoption from (social media) user demographics, (ii) mobility patterns from geo-located messages, (iii) communication patterns from exchanged mes- sages, and (iv) content analysis from the published messages. To this end, we use a country-scale publicly articulated social media dataset in Spain, where we infer behavioral patterns from almost 146 million geo-located messages. We match this dataset with the granular unemployment at the level of munic- ipality, measured at the peak of the Spanish financial crisis (2012–2013). We consider unemployment to be the most im- portant signal for the socioeconomic status of a region, since the effects of the crisis have had a very large impact in terms of unemployment in the country (around 9.2% in 2005, more than 26% in 2013). Our extensive investigation of this large variety of traces in a large social media dataset allows us not only to build an accurate model of unemployment impact across geographi- cal areas, but also to compare globally previously reported metrics in diverse works and datasets, as well as asses their relevance and uniqueness to understand economical devel- opment [14, 19, 20, 2227]. As we will show, technology adoption, mobility, diurnal activity, and communication style metrics carry a different weight in explaining unemployment in different geographical areas. Our goal is not to state causal- ity between unemployment and the extracted metrics but to uncover the relationship emerging when we observe the eco- nomical metrics of cities and the social behavior at the same time. arXiv:1411.3140v2 [physics.soc-ph] 19 Nov 2014

Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Nov 2014

Social media fingerprints of unemploymentAlejandro Llorente12 Manuel Garcıa-Herranz3 Manuel Cebrian45 Esteban Moro 12

AbstractRecent wide-spread adoption of electronic and pervasive technologies has enabled the study of human behavior at an unprecedentedlevel uncovering universal patterns underlying human activity mobility and inter-personal communication In the present workwe investigate whether deviations from these universal patterns may reveal information about the socio-economical status ofgeographical regions We quantify the extent to which deviations in diurnal rhythm mobility patterns and communicationstyles across regions relate to their unemployment incidence For this we examine a country-scale publicly articulated socialmedia dataset where we quantify individual behavioral features from over 145 million geo-located messages distributed amongmore than 340 different Spanish economic regions inferred by computing communities of cohesive mobility fluxes We findthat regions exhibiting more diverse mobility fluxes earlier diurnal rhythms and more correct grammatical styles display lowerunemployment rates As a result we provide a simple model able to produce accurate easily interpretable reconstructionof regional unemployment incidence from their social-media digital fingerprints alone Our results show that cost-effectiveeconomical indicators can be built based on publicly-available social media datasets

KeywordsHuman mobility Social networks Communication patterns Unemployment

1Instituto de Ingenierıa del Conocimiento Universidad Autonoma de Madrid Madrid 28049 Spain2Departamento de Matematicas amp GISC Universidad Carlos III de Madrid Leganes 28911 Spain3UNICEF Innovation Unit New York NY 10017 USA4Department of Computer Science and Engineering University of California at San Diego La Jolla CA 92093 USA5National Information and Communications Technology Australia Melbourne Victoria 3003 Australia

Corresponding author emoromathuc3mes

HUMAN behavior is closely intertwined with socioeconom-ical status as many of our daily routines are driven by

activities related to maintain to improve or afforded by suchstatus [1ndash3] From our movements around the city to our dailyschedules to the communication with others humans performdifferent actions along the day that reflect and impact theireconomical situation The distribution of different individ-ual behaviors across neighborhoods municipalities or citiesimpacts the economical development of those geographicalareas and in turn to that of the whole country [4ndash9] Detect-ing patterns and quantifying relevant metrics to unveil thecomplex relationship between geography and collective be-havior is thus of paramount importance for understanding theeconomical heart-beat of cities and the structure of inter-citynetworks and thus to economic planning educational policyurban planning transportation design and other large-scalesocietal problems [10ndash14]

Much knowledge about how mobility social communi-cation and education affect the economical development ofcities has been being obtained through complex and costlysurveys with an update rate ranging from fortnights (unem-ployment) to decades (census) [15ndash17] At the same timethe recent availability of vast and rich datasets of individualdigital fingerprints has increased the scale and granularity atwhich we can measure these behavioral features reduced thecost and update rate of these measurements and provided newopportunities to combine them with more traditional socio-economical surveys [14 18ndash22]

In this work we provide a proof of concept for the use ofsocial media individual digital fingerprints to infer city-levelbehavioral measures and then uncover their relationship withsocioeconomic output We present a comprehensive study

of the different behavioral traces that can be extracted fromsocial media (i) technology adoption from (social media)user demographics (ii) mobility patterns from geo-locatedmessages (iii) communication patterns from exchanged mes-sages and (iv) content analysis from the published messagesTo this end we use a country-scale publicly articulated socialmedia dataset in Spain where we infer behavioral patternsfrom almost 146 million geo-located messages We match thisdataset with the granular unemployment at the level of munic-ipality measured at the peak of the Spanish financial crisis(2012ndash2013) We consider unemployment to be the most im-portant signal for the socioeconomic status of a region sincethe effects of the crisis have had a very large impact in termsof unemployment in the country (around 92 in 2005 morethan 26 in 2013)

Our extensive investigation of this large variety of tracesin a large social media dataset allows us not only to build anaccurate model of unemployment impact across geographi-cal areas but also to compare globally previously reportedmetrics in diverse works and datasets as well as asses theirrelevance and uniqueness to understand economical devel-opment [14 19 20 22ndash27] As we will show technologyadoption mobility diurnal activity and communication stylemetrics carry a different weight in explaining unemploymentin different geographical areas Our goal is not to state causal-ity between unemployment and the extracted metrics but touncover the relationship emerging when we observe the eco-nomical metrics of cities and the social behavior at the sametime

arX

iv1

411

3140

v2 [

phys

ics

soc-

ph]

19

Nov

201

4

Social media fingerprints of unemployment mdash 219

Figure 1 A) Map of the mobility fluxes Ti j between municipalities based on Twitter inferred trips (white) Infomap communitiesdetected on the network Ti j are colored under the mobility fluxes (blue colors) B) Mobility fluxes Ti j between municipalities i and jare constructed by aggregating the number of trips between them C) Correspondence between the observed fluxes Ti j and the fittedgravity model fluxes Dashed line is the Ti j = T grav

i j while the (blue) solid line is an conditional average of T gravi j for fixed values of

Ti j

1 Social media dataset and functional parti-tion of cities

Twitter is a microblogging online application where users canexpress their opinions share content and receive informationfrom other users in text messages of 140 characters longcommonly known as tweets Users can interact with other

users by mentioning them or retweeting (share someonersquostweet with your followers) their content Some of these tweetscontain information about the geographical location wherethe user was located when the tweet was published we referto them as geo-located tweets

To perform our analysis we consider 196 million geo-located Twitter messages (tweets) collected through the pub-

Social media fingerprints of unemployment mdash 319

lic API provided by Twitter from continental Spain rang-ing from 29th November 2012 to 30th June 2013 Tweetswere posted by (properly anonymized) 057 Million uniqueusers and geo-positioned in 7683 different municipalitiesWe observed a large correlation (Pearsonrsquos coefficient ρ =0951[09490953]) between the number of geopositionedtweets per municipality and the municipalityrsquos population Onaverage we find around 50 tweets per month and per 1000persons in each municipality

Despite this high level of social media activity withinmunicipalities we find their official administrative areas notsuitable to study socio-economical activity administrativeboundaries between municipalities reflect political and histo-rical decisions while economical trade and activity often hap-pens across those boundaries The result is that municipalitiesin Spain are artificially diverse ranging from a municipalitywith only 7 inhabitants to other with population 32 millionAlthough there exists natural aggregations of municipalitiesin provinces (regions) or statisticalmetropolitan areas (NUTSareas) we have used our own procedure to detect economicalareas In particular we have used user daily trips betweenpairs of municipalities as a measure of the economic relat-edness between said municipalities We say that there is adaily trip between municipality i and j if a user has tweetedin place i and j consecutively within the same day In ourdataset we find 19 million trips by 022 million users Withthose trips we construct the daily mobility flux network Ti jbetween municipalities as the number of trips between place iand j (see 1B) Remarkably the statistical properties of tripsand of the mobility matrix Ti j coincide with those of othermobility datasets (see SI section 2) for example trip distancer and elapsed time δ t are power-law distributed with expo-nents P(r)sim rminus167 and P(δ t)sim δ tminus062 very similar to thosefound in the literature [9 23] And the mobility fluxes Ti j arewell described by the Gravity Law (R2 = 080) [28]

Ti j T gravi j =

Pαii P

α jj

i j

(1)

where Pi and Pj are the populations of municipalities i andj and di j is the distance between them Similarly the expo-nents in (1) are very similar to those reported in other worksαi α j = 048 and β 105 [23 29] These results suggestthat detected mobility from geo-located tweets is a good proxyof human mobility within and between municipalities [30]

We use the network of daily fluxes between municipali-ties Ti j to detect the geographical communities of economicalactivity To this end we employ standard partition techniquesof the mobility network Ti j using graph community findingalgorithms This technique has been applied extensively spe-cially with mobile phone data to unveil the effective mapsof countries based on mobility andor social interactions ofpeople[31ndash33] In our case we have used the Infomap al-gorithm [34] and found 340 different communities withinSpain For further details about the comparison among dif-ferent state-of-art community detection algorithms executedon the inter-city graph see SI section 3 The average num-ber of municipalities per community is 21 and the largest

community contains 142 municipalities The communitiesdetected have very interesting features (see SI section 3) (i)they are cohesive geographically (see figure 1) (ii) they arestatistically robust against randomly removal of trips in ourdatabase (SI table S2) and (iii) modularity of the partition isvery high ( 076 see SI table S3) Finally (iv) the partitionfound has some overlap (77 of Normalized Mutual Informa-tion NMI see [35]) with coarser administrative boundarieslike provinces (regions) (see SI section 3 for details) Butinterestingly it shows a larger overlap (83 of NMI) withcomarcas (counties) areas in Spain that reflect geographicaland economical relations between municipalities This resultshows that the mobility detected from geo-located tweets andthe communities obtained are a good description of economi-cal areas

In the rest of the paper we restrict our analysis to the geo-graphical areas defined by the Infomap detected communities(see figure 1) For statistical reasons we discard communitieswhich are not formed by at least 5 municipalities Despite thissampling 96 of the total country population is consideredin our analysis Our results in the rest of the paper also holdfor municipalities counties or provinces though with lowerstatistical power (see SI section 9)

2 Social media behavioral fingerprints

The goal of this work is to quantify how and what behavioralfeatures can be extracted from social media and then relatedback to the to the economical level of cities To this end wedefine four groups of measures that have been widely exploredin other fields like economy or social sciences These fourtypes measures rely on the identification of the place whereusers live Instead of using information in the user profilewe analyze the places where the user has tweeted and weset as hometown of the user the municipality where heshehas tweeted with the highest frequency a method usually em-ployed in mobile phone and social media [11 23] To thisend we select those users with more than 5 geo-located tweetsin our period and which have tweeted at least 40 of theirtweets in a given municipality which we will consider theirhometown After this filtering we end up with 032 millionusers and we can then define the twitter population πi in areai as the number of users with their hometown within area iWe obtain a very high correlation between πi and populationof the cities Pi in the national census ρ = 0977[09760978]which provides an indirect validation of our approach withthe present data However not all demographic groups areequally represented in the our twitter database As shown inthe SI section 4 Twitter user demographics in Spain obtainedfrom surveys [36] show that age groups above 44 years old areunder-represented Thus our results would mainly describethe socio-economical status of people below 44 years old Em-ployment analysis is then performed in different age groupsunemployment for people below 25 years old between 25 and44 years old and older than 44 years old Finally we havechosen the unemployment reported officially at the end of ourobservation time window (June 2013) but our results are notaffected by the month selected see SI Section 7

For every considered region we investigate the officiallyreported unemployment for different age groups and a number

Social media fingerprints of unemployment mdash 419

Figure 2 Examples of different behaviour in the observed variables and the unemployment In A we observe that two cities withdifferent unemployment levels have different temporal activity patterns Figure C show how communities (red) with distinct entropylevels of social communication with other communities (blue) may hold different unemployment intensity left map shows a highlyfocused communication pattern (low entropy) while right map correspond to a community with a diverse communication pattern(high entropy) Finally figure B shows some examples of detected misspellings in our database using 618 incorrect expressions (seeSI Section 6) such as ldquoCon migordquo ldquoAverrdquo or ldquollendordquo

of metrics related to social media activity Some of thosemetrics are already reported in the literature but some othersare introduced in this work Specifically we consider

bull Social media technology adoption we can use twitterpenetration rate τi = πiPi in each area i as a proxy oftechnology adoption Recent works have shown thatindeed there is a correlation between country GDP andtwitter penetration specifically it was found that a pos-itive correlation between τi and GDP at the countrylevel [23] However in our data we find the oppo-site correlation (see figure 3) namely that the largerthe penetration rate the bigger the unemployment iswhich suggest that the impact of technology adoptionat country scale is different of what happens withinan (industrialized) country where technology to accesssocial media is commoditized

bull Social media activity regions with very different eco-nomical situations should exhibit different patterns ofactivity during the day Since working leisure fam-ily shopping etc activities happen at different timesof the day we might observe different daily patterns

in regions with different socio-economical status Forexample we hypothesize that communities with lowlevels of unemployment will tend to have higher activ-ity levels at the beginning of a typical weekday Thisis indeed what we find figure 2A shows the hourlyfraction of tweets during workdays of two communi-ties with very different rate of unemployment As wecan observe both profiles are quite different and inthe case of low unemployment we find a strong peakof activity between 8 and 11am (morning) and lowerperiods of activity during the afternoons and nights Weencode this finding in νmrngi νaftni and νngti the to-tal fraction of tweets happening in geographical area ibetween 8am and 10am 3pm and 5pm and 12am and3am respectively Figure 3 shows a strong negative co-rrelation between νmrngi and the unemployment for thecommunities in our database and positive correlationwith νaftni and νngti

bull Social media content some works have observed acorrelation between the frequency of words relatedto work conditions [22] or looking forward thinking

Social media fingerprints of unemployment mdash 519

searches [21] to the economical situation of countriesIn our case we also find that there is a moderate posi-tive correlation between the fraction of tweets microi men-tioning job or unemployment terms and the observedunemployment while the correlation is negative for thenumber mentions to employment or the economy How-ever we have tried a different approach by measuringthe relation between the way of writing and the edu-cational level [37] To this end we build a list of 618misspelled Spanish expressions and extract the tweetsof the dataset containing at least one of these words(see SI section 6 for further details about how theseexpressions were collected) We only consider tweetsin Spanish detected with a N-grams based algorithmThen we only consider misspellings that cannot bejustified as abbreviations Finally we compute for ev-ery region the proportion εi of misspellers among theTwitter population If the fraction of misspellers pergeographical area is a proxy for the educational levelof that region we expect a positive correlation betweenεi and unemployment Indeed we find (see figure 3)that there is a strong correlation between the fraction ofmisspellers and unemployment

bull Social media interactions and geographical flow di-versity following the ideas in [14] which correlatedthe economical development of an area with the diver-sity of communications with other areas we considerall tweets mentioning another user and take them asa proxy for communication between users Then wecompute the number of communications wi j betweenareas i and j as the number of mentions between usersin those areas To measure the diversity we use as in[14] the informational normalized entropy (Entropy 1)Sui = minussum j pi j log pi jSri where pi j = wi jsum j wi j and(Entropy 2) Sri = logki with ki the number of differentareas with which users in area i have interacted As in[14] we find that areas with large unemployment haveless diverse communication patterns than areas with lowunemployment This translates in a strong negative co-rrelation between Si and the unemployment see figure3 Similar ideas are applied to the flows of people be-tween areas to investigate the diversity of the geograph-ical flows through the entropy Si =minussum j pi j log pi jSriwhere pi j = Ti jsum j Ti j and Sri = log(ki) with ki thenumber of different areas which has been visited byusers that live in area i Figure 3 shows that as in [19]correlation of these geographical entropies is low witheconomical development

Normalization of variables is discussed in SI section 5 Wehave also studied the correlation between the variables con-sidered As expected variables in each group show moderatecorrelations between them However the inspection of thecorrelation matrix and a Principal Component Analysis ofthe variables considered show that there is information (aspercentage of variance in the data) in each of the groups ofvariables see SI section 5 Because of these two facts werestrict our analysis to the variables within each group withthe highest correlation with the unemployment namely thepenetration rate τi the social and mobility diversity variables

Sui and Sui the morning activity νmrngi the fraction of mis-spellers εi and fraction of employment-related tweets microempi

3 Explanatory power of social media in un-employment

The four previous groups of variables are fingerprints of hu-man behavior reflected on the Twitter usage habits As weobserved in figure 3 all of them exhibit statistically strongcorrelations with unemployment The question we address inthis section is whether those variables suffice to explain theobserved unemployment (their explanatory power) and alsodetermine the most important ones among themselves (whichgive more explanatory power than others) Note that we arenot stating a causality arrow between the measures built in theprevious section and the unemployment rate but only explor-ing whether they can be used as alternative indicators with areal translation in the economy

Figure 4 shows the result of a simple linear regressionmodel for the observed unemployment for ages below 25 yearsas a function of the variables which have more correlation withthe unemployment The model has a significant R2 = 062showing that there is a large explanatory power of the un-employment encoded in the behavioral variables extractedfrom Twitter However not all the variables weight equallyin the model specifically the penetration rate geographicaldiversity morning activity and fraction of misspellers accountfor up to 92 of the explained variance while social diversityand number of employment related tweets are not statisticalsignificant (see SI section 10 for the methods used to deter-mine the relative importance of the variables) It is interestingto note that while social diversity obtained by mobile phonecommunications was a key variable in the explanation of de-privation indexes in [14 19] the communication diversity oftwitter users seem to have a minor role in the explanation ofheterogeneity of unemployment in Spain

Similar explanatory power is found for other age groupsR2 = 044 for all ages and R2 = 052 for ages between 25 and44 years However the model degrades for ages above 44years (R2 = 026) proving that our variables mainly describedthe behavior of the most represented age groups in Twitternamely those below 44 years old On the other hand sinceour Twitter variables seem to describe the behavior of youngpeople we have investigated whether Twitter constructed vari-ables have similar explanatory value (in terms of R2) thansimple census demographic variables for young people How-ever regression models including young population rate yieldto a minor improvement R2 = 065 while young populationrate only gives R2 = 024 a result which shows that Twittervariables do indeed posses a genuine explanatory power awayfrom their simple demographic representation Finally ourmodel have the largest explanatory power for detected com-munities but large R2 are also found for other geographicalareas like counties (R2 = 054) and provinces (R2 = 065) seeSI section 9

4 Discussion

This work serves as a proof of concept for how a wide rangeof behavioral features linked to socioeconomic behavior can

Social media fingerprints of unemployment mdash 619

eco

unemp

emp

job

fmiss

madrugada

tarde

manana

siorsocial

siosocial

sior

sio

rtwpen

minus05 00 05corre

ff500

1000

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

4

5

6

7

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]] 5

10

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

40

50

60

70

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

02

04

06

08

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

Penetration rate Entropy 1 (geo) Entropy 2 (geo)

Entropy 1 (social) Entropy 2 (social) Activity (morning)

Activity (afternoon) Activity (night)

Misspellers rate Job tweets

Employment tweets Unemployment tweets

Economy tweets

A B C

D E

Unemployment UnemploymentCorrelation

Entropy1 (social)

Misspellers rate

Pen

etra

tion

rate

Act

ivity

(mor

ning

)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data

10

15

20

25

10 15 20 25x

y

5

10

15

2025

5 10 15 20 25x

y

0 10 20 30per

order

col0000000072B2009E7356B4E9E69F00

Penetration rate

Entropy 1 (geo)

Entropy 1 (social)

Activity (morning)

Misspellers rate

Employment tweetsR2 = 062

Pred

icte

d un

empl

oym

ent

Observed unemployment Weight

A B

R2 = 052

Observed unemployment

CAge lt 25 25 lt Age lt 44

Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model

be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)

This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]

Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the

Social media fingerprints of unemployment mdash 719

economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc

It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity

The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate

A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter

penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]

Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets

Acknowledgments

We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program

References

[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)

[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510

[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)

[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report

[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306

[6] Batty M (2008) The size scale and shape of cities science319 769ndash771

[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41

Social media fingerprints of unemployment mdash 819

[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4

[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782

[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313

[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)

[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090

[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145

[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031

[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78

[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378

[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)

[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721

[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692

[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388

[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2

[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report

[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680

[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98

[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility

[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496

[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge

[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3

[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100

[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333

[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101

[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668

[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707

[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123

[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008

[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]

[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058

[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77

[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286

Social media fingerprints of unemployment mdash 919

Social media fingerprints of unemployment mdash 1019

Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro

S1 The dataset

Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys

For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work

From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j

We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years

S2 Twitter as mobility proxy

Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)

Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of

maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]

The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression

T gravi j =

Pα1i Pα2

j

i j

(2)

where T gravi j is the flow in terms of number of people between cities

i and j di j is the geographical distance and Pi and Pj the populationof every city respectively

Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization

αlowast1 α

lowast2 β

lowast = argminα1α2β

1N sum

i jwi j

(Ti jminusT grav

i j

)2(3)

where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13

i j givesthe best performance in the model

In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j

S3 Community structures in inter-city mo-bility graph

Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for

Social media fingerprints of unemployment mdash 1119

10minus8

10minus6

10minus4

10minus2

100

100 1005 101 1015 102 1025 103x

dens

10minus6

10minus4

10minus2

100

1005 101 1015 102 1025 103x

dens

10minus7

10minus6

10minus5

10minus4

10minus3

102 1025 103 1035 104 1045 105x

dens

Den

sity

Den

sity

Den

sity

Trip distance (km) Number of trips Elapsed time (secs)

Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively

each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)

1e+01 1e+03 1e+05

2eminus04

2eminus03

2eminus02

2eminus01

cities2$total_population

cities2$twpen

Population

Pen

etra

tion

Rat

e CommunitiesCities

Figure 6 Penetration rates for both cities and detectedcommunities

As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is

observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers

S4 Twitter demographics and unemploy-ment rates

Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old

S5 Properties of Twitter variables

Normalization and distributions

Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 2: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 219

Figure 1 A) Map of the mobility fluxes Ti j between municipalities based on Twitter inferred trips (white) Infomap communitiesdetected on the network Ti j are colored under the mobility fluxes (blue colors) B) Mobility fluxes Ti j between municipalities i and jare constructed by aggregating the number of trips between them C) Correspondence between the observed fluxes Ti j and the fittedgravity model fluxes Dashed line is the Ti j = T grav

i j while the (blue) solid line is an conditional average of T gravi j for fixed values of

Ti j

1 Social media dataset and functional parti-tion of cities

Twitter is a microblogging online application where users canexpress their opinions share content and receive informationfrom other users in text messages of 140 characters longcommonly known as tweets Users can interact with other

users by mentioning them or retweeting (share someonersquostweet with your followers) their content Some of these tweetscontain information about the geographical location wherethe user was located when the tweet was published we referto them as geo-located tweets

To perform our analysis we consider 196 million geo-located Twitter messages (tweets) collected through the pub-

Social media fingerprints of unemployment mdash 319

lic API provided by Twitter from continental Spain rang-ing from 29th November 2012 to 30th June 2013 Tweetswere posted by (properly anonymized) 057 Million uniqueusers and geo-positioned in 7683 different municipalitiesWe observed a large correlation (Pearsonrsquos coefficient ρ =0951[09490953]) between the number of geopositionedtweets per municipality and the municipalityrsquos population Onaverage we find around 50 tweets per month and per 1000persons in each municipality

Despite this high level of social media activity withinmunicipalities we find their official administrative areas notsuitable to study socio-economical activity administrativeboundaries between municipalities reflect political and histo-rical decisions while economical trade and activity often hap-pens across those boundaries The result is that municipalitiesin Spain are artificially diverse ranging from a municipalitywith only 7 inhabitants to other with population 32 millionAlthough there exists natural aggregations of municipalitiesin provinces (regions) or statisticalmetropolitan areas (NUTSareas) we have used our own procedure to detect economicalareas In particular we have used user daily trips betweenpairs of municipalities as a measure of the economic relat-edness between said municipalities We say that there is adaily trip between municipality i and j if a user has tweetedin place i and j consecutively within the same day In ourdataset we find 19 million trips by 022 million users Withthose trips we construct the daily mobility flux network Ti jbetween municipalities as the number of trips between place iand j (see 1B) Remarkably the statistical properties of tripsand of the mobility matrix Ti j coincide with those of othermobility datasets (see SI section 2) for example trip distancer and elapsed time δ t are power-law distributed with expo-nents P(r)sim rminus167 and P(δ t)sim δ tminus062 very similar to thosefound in the literature [9 23] And the mobility fluxes Ti j arewell described by the Gravity Law (R2 = 080) [28]

Ti j T gravi j =

Pαii P

α jj

i j

(1)

where Pi and Pj are the populations of municipalities i andj and di j is the distance between them Similarly the expo-nents in (1) are very similar to those reported in other worksαi α j = 048 and β 105 [23 29] These results suggestthat detected mobility from geo-located tweets is a good proxyof human mobility within and between municipalities [30]

We use the network of daily fluxes between municipali-ties Ti j to detect the geographical communities of economicalactivity To this end we employ standard partition techniquesof the mobility network Ti j using graph community findingalgorithms This technique has been applied extensively spe-cially with mobile phone data to unveil the effective mapsof countries based on mobility andor social interactions ofpeople[31ndash33] In our case we have used the Infomap al-gorithm [34] and found 340 different communities withinSpain For further details about the comparison among dif-ferent state-of-art community detection algorithms executedon the inter-city graph see SI section 3 The average num-ber of municipalities per community is 21 and the largest

community contains 142 municipalities The communitiesdetected have very interesting features (see SI section 3) (i)they are cohesive geographically (see figure 1) (ii) they arestatistically robust against randomly removal of trips in ourdatabase (SI table S2) and (iii) modularity of the partition isvery high ( 076 see SI table S3) Finally (iv) the partitionfound has some overlap (77 of Normalized Mutual Informa-tion NMI see [35]) with coarser administrative boundarieslike provinces (regions) (see SI section 3 for details) Butinterestingly it shows a larger overlap (83 of NMI) withcomarcas (counties) areas in Spain that reflect geographicaland economical relations between municipalities This resultshows that the mobility detected from geo-located tweets andthe communities obtained are a good description of economi-cal areas

In the rest of the paper we restrict our analysis to the geo-graphical areas defined by the Infomap detected communities(see figure 1) For statistical reasons we discard communitieswhich are not formed by at least 5 municipalities Despite thissampling 96 of the total country population is consideredin our analysis Our results in the rest of the paper also holdfor municipalities counties or provinces though with lowerstatistical power (see SI section 9)

2 Social media behavioral fingerprints

The goal of this work is to quantify how and what behavioralfeatures can be extracted from social media and then relatedback to the to the economical level of cities To this end wedefine four groups of measures that have been widely exploredin other fields like economy or social sciences These fourtypes measures rely on the identification of the place whereusers live Instead of using information in the user profilewe analyze the places where the user has tweeted and weset as hometown of the user the municipality where heshehas tweeted with the highest frequency a method usually em-ployed in mobile phone and social media [11 23] To thisend we select those users with more than 5 geo-located tweetsin our period and which have tweeted at least 40 of theirtweets in a given municipality which we will consider theirhometown After this filtering we end up with 032 millionusers and we can then define the twitter population πi in areai as the number of users with their hometown within area iWe obtain a very high correlation between πi and populationof the cities Pi in the national census ρ = 0977[09760978]which provides an indirect validation of our approach withthe present data However not all demographic groups areequally represented in the our twitter database As shown inthe SI section 4 Twitter user demographics in Spain obtainedfrom surveys [36] show that age groups above 44 years old areunder-represented Thus our results would mainly describethe socio-economical status of people below 44 years old Em-ployment analysis is then performed in different age groupsunemployment for people below 25 years old between 25 and44 years old and older than 44 years old Finally we havechosen the unemployment reported officially at the end of ourobservation time window (June 2013) but our results are notaffected by the month selected see SI Section 7

For every considered region we investigate the officiallyreported unemployment for different age groups and a number

Social media fingerprints of unemployment mdash 419

Figure 2 Examples of different behaviour in the observed variables and the unemployment In A we observe that two cities withdifferent unemployment levels have different temporal activity patterns Figure C show how communities (red) with distinct entropylevels of social communication with other communities (blue) may hold different unemployment intensity left map shows a highlyfocused communication pattern (low entropy) while right map correspond to a community with a diverse communication pattern(high entropy) Finally figure B shows some examples of detected misspellings in our database using 618 incorrect expressions (seeSI Section 6) such as ldquoCon migordquo ldquoAverrdquo or ldquollendordquo

of metrics related to social media activity Some of thosemetrics are already reported in the literature but some othersare introduced in this work Specifically we consider

bull Social media technology adoption we can use twitterpenetration rate τi = πiPi in each area i as a proxy oftechnology adoption Recent works have shown thatindeed there is a correlation between country GDP andtwitter penetration specifically it was found that a pos-itive correlation between τi and GDP at the countrylevel [23] However in our data we find the oppo-site correlation (see figure 3) namely that the largerthe penetration rate the bigger the unemployment iswhich suggest that the impact of technology adoptionat country scale is different of what happens withinan (industrialized) country where technology to accesssocial media is commoditized

bull Social media activity regions with very different eco-nomical situations should exhibit different patterns ofactivity during the day Since working leisure fam-ily shopping etc activities happen at different timesof the day we might observe different daily patterns

in regions with different socio-economical status Forexample we hypothesize that communities with lowlevels of unemployment will tend to have higher activ-ity levels at the beginning of a typical weekday Thisis indeed what we find figure 2A shows the hourlyfraction of tweets during workdays of two communi-ties with very different rate of unemployment As wecan observe both profiles are quite different and inthe case of low unemployment we find a strong peakof activity between 8 and 11am (morning) and lowerperiods of activity during the afternoons and nights Weencode this finding in νmrngi νaftni and νngti the to-tal fraction of tweets happening in geographical area ibetween 8am and 10am 3pm and 5pm and 12am and3am respectively Figure 3 shows a strong negative co-rrelation between νmrngi and the unemployment for thecommunities in our database and positive correlationwith νaftni and νngti

bull Social media content some works have observed acorrelation between the frequency of words relatedto work conditions [22] or looking forward thinking

Social media fingerprints of unemployment mdash 519

searches [21] to the economical situation of countriesIn our case we also find that there is a moderate posi-tive correlation between the fraction of tweets microi men-tioning job or unemployment terms and the observedunemployment while the correlation is negative for thenumber mentions to employment or the economy How-ever we have tried a different approach by measuringthe relation between the way of writing and the edu-cational level [37] To this end we build a list of 618misspelled Spanish expressions and extract the tweetsof the dataset containing at least one of these words(see SI section 6 for further details about how theseexpressions were collected) We only consider tweetsin Spanish detected with a N-grams based algorithmThen we only consider misspellings that cannot bejustified as abbreviations Finally we compute for ev-ery region the proportion εi of misspellers among theTwitter population If the fraction of misspellers pergeographical area is a proxy for the educational levelof that region we expect a positive correlation betweenεi and unemployment Indeed we find (see figure 3)that there is a strong correlation between the fraction ofmisspellers and unemployment

bull Social media interactions and geographical flow di-versity following the ideas in [14] which correlatedthe economical development of an area with the diver-sity of communications with other areas we considerall tweets mentioning another user and take them asa proxy for communication between users Then wecompute the number of communications wi j betweenareas i and j as the number of mentions between usersin those areas To measure the diversity we use as in[14] the informational normalized entropy (Entropy 1)Sui = minussum j pi j log pi jSri where pi j = wi jsum j wi j and(Entropy 2) Sri = logki with ki the number of differentareas with which users in area i have interacted As in[14] we find that areas with large unemployment haveless diverse communication patterns than areas with lowunemployment This translates in a strong negative co-rrelation between Si and the unemployment see figure3 Similar ideas are applied to the flows of people be-tween areas to investigate the diversity of the geograph-ical flows through the entropy Si =minussum j pi j log pi jSriwhere pi j = Ti jsum j Ti j and Sri = log(ki) with ki thenumber of different areas which has been visited byusers that live in area i Figure 3 shows that as in [19]correlation of these geographical entropies is low witheconomical development

Normalization of variables is discussed in SI section 5 Wehave also studied the correlation between the variables con-sidered As expected variables in each group show moderatecorrelations between them However the inspection of thecorrelation matrix and a Principal Component Analysis ofthe variables considered show that there is information (aspercentage of variance in the data) in each of the groups ofvariables see SI section 5 Because of these two facts werestrict our analysis to the variables within each group withthe highest correlation with the unemployment namely thepenetration rate τi the social and mobility diversity variables

Sui and Sui the morning activity νmrngi the fraction of mis-spellers εi and fraction of employment-related tweets microempi

3 Explanatory power of social media in un-employment

The four previous groups of variables are fingerprints of hu-man behavior reflected on the Twitter usage habits As weobserved in figure 3 all of them exhibit statistically strongcorrelations with unemployment The question we address inthis section is whether those variables suffice to explain theobserved unemployment (their explanatory power) and alsodetermine the most important ones among themselves (whichgive more explanatory power than others) Note that we arenot stating a causality arrow between the measures built in theprevious section and the unemployment rate but only explor-ing whether they can be used as alternative indicators with areal translation in the economy

Figure 4 shows the result of a simple linear regressionmodel for the observed unemployment for ages below 25 yearsas a function of the variables which have more correlation withthe unemployment The model has a significant R2 = 062showing that there is a large explanatory power of the un-employment encoded in the behavioral variables extractedfrom Twitter However not all the variables weight equallyin the model specifically the penetration rate geographicaldiversity morning activity and fraction of misspellers accountfor up to 92 of the explained variance while social diversityand number of employment related tweets are not statisticalsignificant (see SI section 10 for the methods used to deter-mine the relative importance of the variables) It is interestingto note that while social diversity obtained by mobile phonecommunications was a key variable in the explanation of de-privation indexes in [14 19] the communication diversity oftwitter users seem to have a minor role in the explanation ofheterogeneity of unemployment in Spain

Similar explanatory power is found for other age groupsR2 = 044 for all ages and R2 = 052 for ages between 25 and44 years However the model degrades for ages above 44years (R2 = 026) proving that our variables mainly describedthe behavior of the most represented age groups in Twitternamely those below 44 years old On the other hand sinceour Twitter variables seem to describe the behavior of youngpeople we have investigated whether Twitter constructed vari-ables have similar explanatory value (in terms of R2) thansimple census demographic variables for young people How-ever regression models including young population rate yieldto a minor improvement R2 = 065 while young populationrate only gives R2 = 024 a result which shows that Twittervariables do indeed posses a genuine explanatory power awayfrom their simple demographic representation Finally ourmodel have the largest explanatory power for detected com-munities but large R2 are also found for other geographicalareas like counties (R2 = 054) and provinces (R2 = 065) seeSI section 9

4 Discussion

This work serves as a proof of concept for how a wide rangeof behavioral features linked to socioeconomic behavior can

Social media fingerprints of unemployment mdash 619

eco

unemp

emp

job

fmiss

madrugada

tarde

manana

siorsocial

siosocial

sior

sio

rtwpen

minus05 00 05corre

ff500

1000

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

4

5

6

7

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]] 5

10

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

40

50

60

70

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

02

04

06

08

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

Penetration rate Entropy 1 (geo) Entropy 2 (geo)

Entropy 1 (social) Entropy 2 (social) Activity (morning)

Activity (afternoon) Activity (night)

Misspellers rate Job tweets

Employment tweets Unemployment tweets

Economy tweets

A B C

D E

Unemployment UnemploymentCorrelation

Entropy1 (social)

Misspellers rate

Pen

etra

tion

rate

Act

ivity

(mor

ning

)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data

10

15

20

25

10 15 20 25x

y

5

10

15

2025

5 10 15 20 25x

y

0 10 20 30per

order

col0000000072B2009E7356B4E9E69F00

Penetration rate

Entropy 1 (geo)

Entropy 1 (social)

Activity (morning)

Misspellers rate

Employment tweetsR2 = 062

Pred

icte

d un

empl

oym

ent

Observed unemployment Weight

A B

R2 = 052

Observed unemployment

CAge lt 25 25 lt Age lt 44

Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model

be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)

This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]

Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the

Social media fingerprints of unemployment mdash 719

economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc

It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity

The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate

A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter

penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]

Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets

Acknowledgments

We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program

References

[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)

[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510

[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)

[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report

[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306

[6] Batty M (2008) The size scale and shape of cities science319 769ndash771

[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41

Social media fingerprints of unemployment mdash 819

[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4

[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782

[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313

[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)

[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090

[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145

[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031

[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78

[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378

[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)

[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721

[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692

[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388

[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2

[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report

[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680

[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98

[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility

[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496

[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge

[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3

[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100

[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333

[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101

[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668

[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707

[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123

[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008

[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]

[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058

[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77

[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286

Social media fingerprints of unemployment mdash 919

Social media fingerprints of unemployment mdash 1019

Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro

S1 The dataset

Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys

For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work

From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j

We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years

S2 Twitter as mobility proxy

Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)

Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of

maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]

The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression

T gravi j =

Pα1i Pα2

j

i j

(2)

where T gravi j is the flow in terms of number of people between cities

i and j di j is the geographical distance and Pi and Pj the populationof every city respectively

Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization

αlowast1 α

lowast2 β

lowast = argminα1α2β

1N sum

i jwi j

(Ti jminusT grav

i j

)2(3)

where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13

i j givesthe best performance in the model

In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j

S3 Community structures in inter-city mo-bility graph

Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for

Social media fingerprints of unemployment mdash 1119

10minus8

10minus6

10minus4

10minus2

100

100 1005 101 1015 102 1025 103x

dens

10minus6

10minus4

10minus2

100

1005 101 1015 102 1025 103x

dens

10minus7

10minus6

10minus5

10minus4

10minus3

102 1025 103 1035 104 1045 105x

dens

Den

sity

Den

sity

Den

sity

Trip distance (km) Number of trips Elapsed time (secs)

Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively

each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)

1e+01 1e+03 1e+05

2eminus04

2eminus03

2eminus02

2eminus01

cities2$total_population

cities2$twpen

Population

Pen

etra

tion

Rat

e CommunitiesCities

Figure 6 Penetration rates for both cities and detectedcommunities

As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is

observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers

S4 Twitter demographics and unemploy-ment rates

Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old

S5 Properties of Twitter variables

Normalization and distributions

Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 3: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 319

lic API provided by Twitter from continental Spain rang-ing from 29th November 2012 to 30th June 2013 Tweetswere posted by (properly anonymized) 057 Million uniqueusers and geo-positioned in 7683 different municipalitiesWe observed a large correlation (Pearsonrsquos coefficient ρ =0951[09490953]) between the number of geopositionedtweets per municipality and the municipalityrsquos population Onaverage we find around 50 tweets per month and per 1000persons in each municipality

Despite this high level of social media activity withinmunicipalities we find their official administrative areas notsuitable to study socio-economical activity administrativeboundaries between municipalities reflect political and histo-rical decisions while economical trade and activity often hap-pens across those boundaries The result is that municipalitiesin Spain are artificially diverse ranging from a municipalitywith only 7 inhabitants to other with population 32 millionAlthough there exists natural aggregations of municipalitiesin provinces (regions) or statisticalmetropolitan areas (NUTSareas) we have used our own procedure to detect economicalareas In particular we have used user daily trips betweenpairs of municipalities as a measure of the economic relat-edness between said municipalities We say that there is adaily trip between municipality i and j if a user has tweetedin place i and j consecutively within the same day In ourdataset we find 19 million trips by 022 million users Withthose trips we construct the daily mobility flux network Ti jbetween municipalities as the number of trips between place iand j (see 1B) Remarkably the statistical properties of tripsand of the mobility matrix Ti j coincide with those of othermobility datasets (see SI section 2) for example trip distancer and elapsed time δ t are power-law distributed with expo-nents P(r)sim rminus167 and P(δ t)sim δ tminus062 very similar to thosefound in the literature [9 23] And the mobility fluxes Ti j arewell described by the Gravity Law (R2 = 080) [28]

Ti j T gravi j =

Pαii P

α jj

i j

(1)

where Pi and Pj are the populations of municipalities i andj and di j is the distance between them Similarly the expo-nents in (1) are very similar to those reported in other worksαi α j = 048 and β 105 [23 29] These results suggestthat detected mobility from geo-located tweets is a good proxyof human mobility within and between municipalities [30]

We use the network of daily fluxes between municipali-ties Ti j to detect the geographical communities of economicalactivity To this end we employ standard partition techniquesof the mobility network Ti j using graph community findingalgorithms This technique has been applied extensively spe-cially with mobile phone data to unveil the effective mapsof countries based on mobility andor social interactions ofpeople[31ndash33] In our case we have used the Infomap al-gorithm [34] and found 340 different communities withinSpain For further details about the comparison among dif-ferent state-of-art community detection algorithms executedon the inter-city graph see SI section 3 The average num-ber of municipalities per community is 21 and the largest

community contains 142 municipalities The communitiesdetected have very interesting features (see SI section 3) (i)they are cohesive geographically (see figure 1) (ii) they arestatistically robust against randomly removal of trips in ourdatabase (SI table S2) and (iii) modularity of the partition isvery high ( 076 see SI table S3) Finally (iv) the partitionfound has some overlap (77 of Normalized Mutual Informa-tion NMI see [35]) with coarser administrative boundarieslike provinces (regions) (see SI section 3 for details) Butinterestingly it shows a larger overlap (83 of NMI) withcomarcas (counties) areas in Spain that reflect geographicaland economical relations between municipalities This resultshows that the mobility detected from geo-located tweets andthe communities obtained are a good description of economi-cal areas

In the rest of the paper we restrict our analysis to the geo-graphical areas defined by the Infomap detected communities(see figure 1) For statistical reasons we discard communitieswhich are not formed by at least 5 municipalities Despite thissampling 96 of the total country population is consideredin our analysis Our results in the rest of the paper also holdfor municipalities counties or provinces though with lowerstatistical power (see SI section 9)

2 Social media behavioral fingerprints

The goal of this work is to quantify how and what behavioralfeatures can be extracted from social media and then relatedback to the to the economical level of cities To this end wedefine four groups of measures that have been widely exploredin other fields like economy or social sciences These fourtypes measures rely on the identification of the place whereusers live Instead of using information in the user profilewe analyze the places where the user has tweeted and weset as hometown of the user the municipality where heshehas tweeted with the highest frequency a method usually em-ployed in mobile phone and social media [11 23] To thisend we select those users with more than 5 geo-located tweetsin our period and which have tweeted at least 40 of theirtweets in a given municipality which we will consider theirhometown After this filtering we end up with 032 millionusers and we can then define the twitter population πi in areai as the number of users with their hometown within area iWe obtain a very high correlation between πi and populationof the cities Pi in the national census ρ = 0977[09760978]which provides an indirect validation of our approach withthe present data However not all demographic groups areequally represented in the our twitter database As shown inthe SI section 4 Twitter user demographics in Spain obtainedfrom surveys [36] show that age groups above 44 years old areunder-represented Thus our results would mainly describethe socio-economical status of people below 44 years old Em-ployment analysis is then performed in different age groupsunemployment for people below 25 years old between 25 and44 years old and older than 44 years old Finally we havechosen the unemployment reported officially at the end of ourobservation time window (June 2013) but our results are notaffected by the month selected see SI Section 7

For every considered region we investigate the officiallyreported unemployment for different age groups and a number

Social media fingerprints of unemployment mdash 419

Figure 2 Examples of different behaviour in the observed variables and the unemployment In A we observe that two cities withdifferent unemployment levels have different temporal activity patterns Figure C show how communities (red) with distinct entropylevels of social communication with other communities (blue) may hold different unemployment intensity left map shows a highlyfocused communication pattern (low entropy) while right map correspond to a community with a diverse communication pattern(high entropy) Finally figure B shows some examples of detected misspellings in our database using 618 incorrect expressions (seeSI Section 6) such as ldquoCon migordquo ldquoAverrdquo or ldquollendordquo

of metrics related to social media activity Some of thosemetrics are already reported in the literature but some othersare introduced in this work Specifically we consider

bull Social media technology adoption we can use twitterpenetration rate τi = πiPi in each area i as a proxy oftechnology adoption Recent works have shown thatindeed there is a correlation between country GDP andtwitter penetration specifically it was found that a pos-itive correlation between τi and GDP at the countrylevel [23] However in our data we find the oppo-site correlation (see figure 3) namely that the largerthe penetration rate the bigger the unemployment iswhich suggest that the impact of technology adoptionat country scale is different of what happens withinan (industrialized) country where technology to accesssocial media is commoditized

bull Social media activity regions with very different eco-nomical situations should exhibit different patterns ofactivity during the day Since working leisure fam-ily shopping etc activities happen at different timesof the day we might observe different daily patterns

in regions with different socio-economical status Forexample we hypothesize that communities with lowlevels of unemployment will tend to have higher activ-ity levels at the beginning of a typical weekday Thisis indeed what we find figure 2A shows the hourlyfraction of tweets during workdays of two communi-ties with very different rate of unemployment As wecan observe both profiles are quite different and inthe case of low unemployment we find a strong peakof activity between 8 and 11am (morning) and lowerperiods of activity during the afternoons and nights Weencode this finding in νmrngi νaftni and νngti the to-tal fraction of tweets happening in geographical area ibetween 8am and 10am 3pm and 5pm and 12am and3am respectively Figure 3 shows a strong negative co-rrelation between νmrngi and the unemployment for thecommunities in our database and positive correlationwith νaftni and νngti

bull Social media content some works have observed acorrelation between the frequency of words relatedto work conditions [22] or looking forward thinking

Social media fingerprints of unemployment mdash 519

searches [21] to the economical situation of countriesIn our case we also find that there is a moderate posi-tive correlation between the fraction of tweets microi men-tioning job or unemployment terms and the observedunemployment while the correlation is negative for thenumber mentions to employment or the economy How-ever we have tried a different approach by measuringthe relation between the way of writing and the edu-cational level [37] To this end we build a list of 618misspelled Spanish expressions and extract the tweetsof the dataset containing at least one of these words(see SI section 6 for further details about how theseexpressions were collected) We only consider tweetsin Spanish detected with a N-grams based algorithmThen we only consider misspellings that cannot bejustified as abbreviations Finally we compute for ev-ery region the proportion εi of misspellers among theTwitter population If the fraction of misspellers pergeographical area is a proxy for the educational levelof that region we expect a positive correlation betweenεi and unemployment Indeed we find (see figure 3)that there is a strong correlation between the fraction ofmisspellers and unemployment

bull Social media interactions and geographical flow di-versity following the ideas in [14] which correlatedthe economical development of an area with the diver-sity of communications with other areas we considerall tweets mentioning another user and take them asa proxy for communication between users Then wecompute the number of communications wi j betweenareas i and j as the number of mentions between usersin those areas To measure the diversity we use as in[14] the informational normalized entropy (Entropy 1)Sui = minussum j pi j log pi jSri where pi j = wi jsum j wi j and(Entropy 2) Sri = logki with ki the number of differentareas with which users in area i have interacted As in[14] we find that areas with large unemployment haveless diverse communication patterns than areas with lowunemployment This translates in a strong negative co-rrelation between Si and the unemployment see figure3 Similar ideas are applied to the flows of people be-tween areas to investigate the diversity of the geograph-ical flows through the entropy Si =minussum j pi j log pi jSriwhere pi j = Ti jsum j Ti j and Sri = log(ki) with ki thenumber of different areas which has been visited byusers that live in area i Figure 3 shows that as in [19]correlation of these geographical entropies is low witheconomical development

Normalization of variables is discussed in SI section 5 Wehave also studied the correlation between the variables con-sidered As expected variables in each group show moderatecorrelations between them However the inspection of thecorrelation matrix and a Principal Component Analysis ofthe variables considered show that there is information (aspercentage of variance in the data) in each of the groups ofvariables see SI section 5 Because of these two facts werestrict our analysis to the variables within each group withthe highest correlation with the unemployment namely thepenetration rate τi the social and mobility diversity variables

Sui and Sui the morning activity νmrngi the fraction of mis-spellers εi and fraction of employment-related tweets microempi

3 Explanatory power of social media in un-employment

The four previous groups of variables are fingerprints of hu-man behavior reflected on the Twitter usage habits As weobserved in figure 3 all of them exhibit statistically strongcorrelations with unemployment The question we address inthis section is whether those variables suffice to explain theobserved unemployment (their explanatory power) and alsodetermine the most important ones among themselves (whichgive more explanatory power than others) Note that we arenot stating a causality arrow between the measures built in theprevious section and the unemployment rate but only explor-ing whether they can be used as alternative indicators with areal translation in the economy

Figure 4 shows the result of a simple linear regressionmodel for the observed unemployment for ages below 25 yearsas a function of the variables which have more correlation withthe unemployment The model has a significant R2 = 062showing that there is a large explanatory power of the un-employment encoded in the behavioral variables extractedfrom Twitter However not all the variables weight equallyin the model specifically the penetration rate geographicaldiversity morning activity and fraction of misspellers accountfor up to 92 of the explained variance while social diversityand number of employment related tweets are not statisticalsignificant (see SI section 10 for the methods used to deter-mine the relative importance of the variables) It is interestingto note that while social diversity obtained by mobile phonecommunications was a key variable in the explanation of de-privation indexes in [14 19] the communication diversity oftwitter users seem to have a minor role in the explanation ofheterogeneity of unemployment in Spain

Similar explanatory power is found for other age groupsR2 = 044 for all ages and R2 = 052 for ages between 25 and44 years However the model degrades for ages above 44years (R2 = 026) proving that our variables mainly describedthe behavior of the most represented age groups in Twitternamely those below 44 years old On the other hand sinceour Twitter variables seem to describe the behavior of youngpeople we have investigated whether Twitter constructed vari-ables have similar explanatory value (in terms of R2) thansimple census demographic variables for young people How-ever regression models including young population rate yieldto a minor improvement R2 = 065 while young populationrate only gives R2 = 024 a result which shows that Twittervariables do indeed posses a genuine explanatory power awayfrom their simple demographic representation Finally ourmodel have the largest explanatory power for detected com-munities but large R2 are also found for other geographicalareas like counties (R2 = 054) and provinces (R2 = 065) seeSI section 9

4 Discussion

This work serves as a proof of concept for how a wide rangeof behavioral features linked to socioeconomic behavior can

Social media fingerprints of unemployment mdash 619

eco

unemp

emp

job

fmiss

madrugada

tarde

manana

siorsocial

siosocial

sior

sio

rtwpen

minus05 00 05corre

ff500

1000

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

4

5

6

7

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]] 5

10

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

40

50

60

70

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

02

04

06

08

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

Penetration rate Entropy 1 (geo) Entropy 2 (geo)

Entropy 1 (social) Entropy 2 (social) Activity (morning)

Activity (afternoon) Activity (night)

Misspellers rate Job tweets

Employment tweets Unemployment tweets

Economy tweets

A B C

D E

Unemployment UnemploymentCorrelation

Entropy1 (social)

Misspellers rate

Pen

etra

tion

rate

Act

ivity

(mor

ning

)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data

10

15

20

25

10 15 20 25x

y

5

10

15

2025

5 10 15 20 25x

y

0 10 20 30per

order

col0000000072B2009E7356B4E9E69F00

Penetration rate

Entropy 1 (geo)

Entropy 1 (social)

Activity (morning)

Misspellers rate

Employment tweetsR2 = 062

Pred

icte

d un

empl

oym

ent

Observed unemployment Weight

A B

R2 = 052

Observed unemployment

CAge lt 25 25 lt Age lt 44

Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model

be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)

This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]

Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the

Social media fingerprints of unemployment mdash 719

economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc

It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity

The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate

A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter

penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]

Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets

Acknowledgments

We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program

References

[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)

[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510

[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)

[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report

[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306

[6] Batty M (2008) The size scale and shape of cities science319 769ndash771

[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41

Social media fingerprints of unemployment mdash 819

[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4

[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782

[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313

[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)

[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090

[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145

[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031

[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78

[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378

[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)

[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721

[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692

[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388

[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2

[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report

[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680

[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98

[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility

[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496

[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge

[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3

[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100

[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333

[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101

[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668

[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707

[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123

[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008

[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]

[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058

[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77

[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286

Social media fingerprints of unemployment mdash 919

Social media fingerprints of unemployment mdash 1019

Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro

S1 The dataset

Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys

For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work

From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j

We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years

S2 Twitter as mobility proxy

Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)

Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of

maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]

The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression

T gravi j =

Pα1i Pα2

j

i j

(2)

where T gravi j is the flow in terms of number of people between cities

i and j di j is the geographical distance and Pi and Pj the populationof every city respectively

Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization

αlowast1 α

lowast2 β

lowast = argminα1α2β

1N sum

i jwi j

(Ti jminusT grav

i j

)2(3)

where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13

i j givesthe best performance in the model

In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j

S3 Community structures in inter-city mo-bility graph

Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for

Social media fingerprints of unemployment mdash 1119

10minus8

10minus6

10minus4

10minus2

100

100 1005 101 1015 102 1025 103x

dens

10minus6

10minus4

10minus2

100

1005 101 1015 102 1025 103x

dens

10minus7

10minus6

10minus5

10minus4

10minus3

102 1025 103 1035 104 1045 105x

dens

Den

sity

Den

sity

Den

sity

Trip distance (km) Number of trips Elapsed time (secs)

Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively

each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)

1e+01 1e+03 1e+05

2eminus04

2eminus03

2eminus02

2eminus01

cities2$total_population

cities2$twpen

Population

Pen

etra

tion

Rat

e CommunitiesCities

Figure 6 Penetration rates for both cities and detectedcommunities

As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is

observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers

S4 Twitter demographics and unemploy-ment rates

Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old

S5 Properties of Twitter variables

Normalization and distributions

Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 4: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 419

Figure 2 Examples of different behaviour in the observed variables and the unemployment In A we observe that two cities withdifferent unemployment levels have different temporal activity patterns Figure C show how communities (red) with distinct entropylevels of social communication with other communities (blue) may hold different unemployment intensity left map shows a highlyfocused communication pattern (low entropy) while right map correspond to a community with a diverse communication pattern(high entropy) Finally figure B shows some examples of detected misspellings in our database using 618 incorrect expressions (seeSI Section 6) such as ldquoCon migordquo ldquoAverrdquo or ldquollendordquo

of metrics related to social media activity Some of thosemetrics are already reported in the literature but some othersare introduced in this work Specifically we consider

bull Social media technology adoption we can use twitterpenetration rate τi = πiPi in each area i as a proxy oftechnology adoption Recent works have shown thatindeed there is a correlation between country GDP andtwitter penetration specifically it was found that a pos-itive correlation between τi and GDP at the countrylevel [23] However in our data we find the oppo-site correlation (see figure 3) namely that the largerthe penetration rate the bigger the unemployment iswhich suggest that the impact of technology adoptionat country scale is different of what happens withinan (industrialized) country where technology to accesssocial media is commoditized

bull Social media activity regions with very different eco-nomical situations should exhibit different patterns ofactivity during the day Since working leisure fam-ily shopping etc activities happen at different timesof the day we might observe different daily patterns

in regions with different socio-economical status Forexample we hypothesize that communities with lowlevels of unemployment will tend to have higher activ-ity levels at the beginning of a typical weekday Thisis indeed what we find figure 2A shows the hourlyfraction of tweets during workdays of two communi-ties with very different rate of unemployment As wecan observe both profiles are quite different and inthe case of low unemployment we find a strong peakof activity between 8 and 11am (morning) and lowerperiods of activity during the afternoons and nights Weencode this finding in νmrngi νaftni and νngti the to-tal fraction of tweets happening in geographical area ibetween 8am and 10am 3pm and 5pm and 12am and3am respectively Figure 3 shows a strong negative co-rrelation between νmrngi and the unemployment for thecommunities in our database and positive correlationwith νaftni and νngti

bull Social media content some works have observed acorrelation between the frequency of words relatedto work conditions [22] or looking forward thinking

Social media fingerprints of unemployment mdash 519

searches [21] to the economical situation of countriesIn our case we also find that there is a moderate posi-tive correlation between the fraction of tweets microi men-tioning job or unemployment terms and the observedunemployment while the correlation is negative for thenumber mentions to employment or the economy How-ever we have tried a different approach by measuringthe relation between the way of writing and the edu-cational level [37] To this end we build a list of 618misspelled Spanish expressions and extract the tweetsof the dataset containing at least one of these words(see SI section 6 for further details about how theseexpressions were collected) We only consider tweetsin Spanish detected with a N-grams based algorithmThen we only consider misspellings that cannot bejustified as abbreviations Finally we compute for ev-ery region the proportion εi of misspellers among theTwitter population If the fraction of misspellers pergeographical area is a proxy for the educational levelof that region we expect a positive correlation betweenεi and unemployment Indeed we find (see figure 3)that there is a strong correlation between the fraction ofmisspellers and unemployment

bull Social media interactions and geographical flow di-versity following the ideas in [14] which correlatedthe economical development of an area with the diver-sity of communications with other areas we considerall tweets mentioning another user and take them asa proxy for communication between users Then wecompute the number of communications wi j betweenareas i and j as the number of mentions between usersin those areas To measure the diversity we use as in[14] the informational normalized entropy (Entropy 1)Sui = minussum j pi j log pi jSri where pi j = wi jsum j wi j and(Entropy 2) Sri = logki with ki the number of differentareas with which users in area i have interacted As in[14] we find that areas with large unemployment haveless diverse communication patterns than areas with lowunemployment This translates in a strong negative co-rrelation between Si and the unemployment see figure3 Similar ideas are applied to the flows of people be-tween areas to investigate the diversity of the geograph-ical flows through the entropy Si =minussum j pi j log pi jSriwhere pi j = Ti jsum j Ti j and Sri = log(ki) with ki thenumber of different areas which has been visited byusers that live in area i Figure 3 shows that as in [19]correlation of these geographical entropies is low witheconomical development

Normalization of variables is discussed in SI section 5 Wehave also studied the correlation between the variables con-sidered As expected variables in each group show moderatecorrelations between them However the inspection of thecorrelation matrix and a Principal Component Analysis ofthe variables considered show that there is information (aspercentage of variance in the data) in each of the groups ofvariables see SI section 5 Because of these two facts werestrict our analysis to the variables within each group withthe highest correlation with the unemployment namely thepenetration rate τi the social and mobility diversity variables

Sui and Sui the morning activity νmrngi the fraction of mis-spellers εi and fraction of employment-related tweets microempi

3 Explanatory power of social media in un-employment

The four previous groups of variables are fingerprints of hu-man behavior reflected on the Twitter usage habits As weobserved in figure 3 all of them exhibit statistically strongcorrelations with unemployment The question we address inthis section is whether those variables suffice to explain theobserved unemployment (their explanatory power) and alsodetermine the most important ones among themselves (whichgive more explanatory power than others) Note that we arenot stating a causality arrow between the measures built in theprevious section and the unemployment rate but only explor-ing whether they can be used as alternative indicators with areal translation in the economy

Figure 4 shows the result of a simple linear regressionmodel for the observed unemployment for ages below 25 yearsas a function of the variables which have more correlation withthe unemployment The model has a significant R2 = 062showing that there is a large explanatory power of the un-employment encoded in the behavioral variables extractedfrom Twitter However not all the variables weight equallyin the model specifically the penetration rate geographicaldiversity morning activity and fraction of misspellers accountfor up to 92 of the explained variance while social diversityand number of employment related tweets are not statisticalsignificant (see SI section 10 for the methods used to deter-mine the relative importance of the variables) It is interestingto note that while social diversity obtained by mobile phonecommunications was a key variable in the explanation of de-privation indexes in [14 19] the communication diversity oftwitter users seem to have a minor role in the explanation ofheterogeneity of unemployment in Spain

Similar explanatory power is found for other age groupsR2 = 044 for all ages and R2 = 052 for ages between 25 and44 years However the model degrades for ages above 44years (R2 = 026) proving that our variables mainly describedthe behavior of the most represented age groups in Twitternamely those below 44 years old On the other hand sinceour Twitter variables seem to describe the behavior of youngpeople we have investigated whether Twitter constructed vari-ables have similar explanatory value (in terms of R2) thansimple census demographic variables for young people How-ever regression models including young population rate yieldto a minor improvement R2 = 065 while young populationrate only gives R2 = 024 a result which shows that Twittervariables do indeed posses a genuine explanatory power awayfrom their simple demographic representation Finally ourmodel have the largest explanatory power for detected com-munities but large R2 are also found for other geographicalareas like counties (R2 = 054) and provinces (R2 = 065) seeSI section 9

4 Discussion

This work serves as a proof of concept for how a wide rangeof behavioral features linked to socioeconomic behavior can

Social media fingerprints of unemployment mdash 619

eco

unemp

emp

job

fmiss

madrugada

tarde

manana

siorsocial

siosocial

sior

sio

rtwpen

minus05 00 05corre

ff500

1000

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

4

5

6

7

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]] 5

10

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

40

50

60

70

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

02

04

06

08

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

Penetration rate Entropy 1 (geo) Entropy 2 (geo)

Entropy 1 (social) Entropy 2 (social) Activity (morning)

Activity (afternoon) Activity (night)

Misspellers rate Job tweets

Employment tweets Unemployment tweets

Economy tweets

A B C

D E

Unemployment UnemploymentCorrelation

Entropy1 (social)

Misspellers rate

Pen

etra

tion

rate

Act

ivity

(mor

ning

)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data

10

15

20

25

10 15 20 25x

y

5

10

15

2025

5 10 15 20 25x

y

0 10 20 30per

order

col0000000072B2009E7356B4E9E69F00

Penetration rate

Entropy 1 (geo)

Entropy 1 (social)

Activity (morning)

Misspellers rate

Employment tweetsR2 = 062

Pred

icte

d un

empl

oym

ent

Observed unemployment Weight

A B

R2 = 052

Observed unemployment

CAge lt 25 25 lt Age lt 44

Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model

be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)

This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]

Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the

Social media fingerprints of unemployment mdash 719

economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc

It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity

The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate

A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter

penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]

Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets

Acknowledgments

We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program

References

[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)

[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510

[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)

[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report

[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306

[6] Batty M (2008) The size scale and shape of cities science319 769ndash771

[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41

Social media fingerprints of unemployment mdash 819

[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4

[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782

[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313

[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)

[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090

[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145

[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031

[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78

[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378

[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)

[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721

[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692

[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388

[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2

[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report

[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680

[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98

[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility

[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496

[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge

[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3

[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100

[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333

[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101

[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668

[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707

[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123

[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008

[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]

[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058

[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77

[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286

Social media fingerprints of unemployment mdash 919

Social media fingerprints of unemployment mdash 1019

Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro

S1 The dataset

Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys

For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work

From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j

We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years

S2 Twitter as mobility proxy

Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)

Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of

maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]

The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression

T gravi j =

Pα1i Pα2

j

i j

(2)

where T gravi j is the flow in terms of number of people between cities

i and j di j is the geographical distance and Pi and Pj the populationof every city respectively

Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization

αlowast1 α

lowast2 β

lowast = argminα1α2β

1N sum

i jwi j

(Ti jminusT grav

i j

)2(3)

where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13

i j givesthe best performance in the model

In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j

S3 Community structures in inter-city mo-bility graph

Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for

Social media fingerprints of unemployment mdash 1119

10minus8

10minus6

10minus4

10minus2

100

100 1005 101 1015 102 1025 103x

dens

10minus6

10minus4

10minus2

100

1005 101 1015 102 1025 103x

dens

10minus7

10minus6

10minus5

10minus4

10minus3

102 1025 103 1035 104 1045 105x

dens

Den

sity

Den

sity

Den

sity

Trip distance (km) Number of trips Elapsed time (secs)

Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively

each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)

1e+01 1e+03 1e+05

2eminus04

2eminus03

2eminus02

2eminus01

cities2$total_population

cities2$twpen

Population

Pen

etra

tion

Rat

e CommunitiesCities

Figure 6 Penetration rates for both cities and detectedcommunities

As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is

observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers

S4 Twitter demographics and unemploy-ment rates

Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old

S5 Properties of Twitter variables

Normalization and distributions

Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 5: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 519

searches [21] to the economical situation of countriesIn our case we also find that there is a moderate posi-tive correlation between the fraction of tweets microi men-tioning job or unemployment terms and the observedunemployment while the correlation is negative for thenumber mentions to employment or the economy How-ever we have tried a different approach by measuringthe relation between the way of writing and the edu-cational level [37] To this end we build a list of 618misspelled Spanish expressions and extract the tweetsof the dataset containing at least one of these words(see SI section 6 for further details about how theseexpressions were collected) We only consider tweetsin Spanish detected with a N-grams based algorithmThen we only consider misspellings that cannot bejustified as abbreviations Finally we compute for ev-ery region the proportion εi of misspellers among theTwitter population If the fraction of misspellers pergeographical area is a proxy for the educational levelof that region we expect a positive correlation betweenεi and unemployment Indeed we find (see figure 3)that there is a strong correlation between the fraction ofmisspellers and unemployment

bull Social media interactions and geographical flow di-versity following the ideas in [14] which correlatedthe economical development of an area with the diver-sity of communications with other areas we considerall tweets mentioning another user and take them asa proxy for communication between users Then wecompute the number of communications wi j betweenareas i and j as the number of mentions between usersin those areas To measure the diversity we use as in[14] the informational normalized entropy (Entropy 1)Sui = minussum j pi j log pi jSri where pi j = wi jsum j wi j and(Entropy 2) Sri = logki with ki the number of differentareas with which users in area i have interacted As in[14] we find that areas with large unemployment haveless diverse communication patterns than areas with lowunemployment This translates in a strong negative co-rrelation between Si and the unemployment see figure3 Similar ideas are applied to the flows of people be-tween areas to investigate the diversity of the geograph-ical flows through the entropy Si =minussum j pi j log pi jSriwhere pi j = Ti jsum j Ti j and Sri = log(ki) with ki thenumber of different areas which has been visited byusers that live in area i Figure 3 shows that as in [19]correlation of these geographical entropies is low witheconomical development

Normalization of variables is discussed in SI section 5 Wehave also studied the correlation between the variables con-sidered As expected variables in each group show moderatecorrelations between them However the inspection of thecorrelation matrix and a Principal Component Analysis ofthe variables considered show that there is information (aspercentage of variance in the data) in each of the groups ofvariables see SI section 5 Because of these two facts werestrict our analysis to the variables within each group withthe highest correlation with the unemployment namely thepenetration rate τi the social and mobility diversity variables

Sui and Sui the morning activity νmrngi the fraction of mis-spellers εi and fraction of employment-related tweets microempi

3 Explanatory power of social media in un-employment

The four previous groups of variables are fingerprints of hu-man behavior reflected on the Twitter usage habits As weobserved in figure 3 all of them exhibit statistically strongcorrelations with unemployment The question we address inthis section is whether those variables suffice to explain theobserved unemployment (their explanatory power) and alsodetermine the most important ones among themselves (whichgive more explanatory power than others) Note that we arenot stating a causality arrow between the measures built in theprevious section and the unemployment rate but only explor-ing whether they can be used as alternative indicators with areal translation in the economy

Figure 4 shows the result of a simple linear regressionmodel for the observed unemployment for ages below 25 yearsas a function of the variables which have more correlation withthe unemployment The model has a significant R2 = 062showing that there is a large explanatory power of the un-employment encoded in the behavioral variables extractedfrom Twitter However not all the variables weight equallyin the model specifically the penetration rate geographicaldiversity morning activity and fraction of misspellers accountfor up to 92 of the explained variance while social diversityand number of employment related tweets are not statisticalsignificant (see SI section 10 for the methods used to deter-mine the relative importance of the variables) It is interestingto note that while social diversity obtained by mobile phonecommunications was a key variable in the explanation of de-privation indexes in [14 19] the communication diversity oftwitter users seem to have a minor role in the explanation ofheterogeneity of unemployment in Spain

Similar explanatory power is found for other age groupsR2 = 044 for all ages and R2 = 052 for ages between 25 and44 years However the model degrades for ages above 44years (R2 = 026) proving that our variables mainly describedthe behavior of the most represented age groups in Twitternamely those below 44 years old On the other hand sinceour Twitter variables seem to describe the behavior of youngpeople we have investigated whether Twitter constructed vari-ables have similar explanatory value (in terms of R2) thansimple census demographic variables for young people How-ever regression models including young population rate yieldto a minor improvement R2 = 065 while young populationrate only gives R2 = 024 a result which shows that Twittervariables do indeed posses a genuine explanatory power awayfrom their simple demographic representation Finally ourmodel have the largest explanatory power for detected com-munities but large R2 are also found for other geographicalareas like counties (R2 = 054) and provinces (R2 = 065) seeSI section 9

4 Discussion

This work serves as a proof of concept for how a wide rangeof behavioral features linked to socioeconomic behavior can

Social media fingerprints of unemployment mdash 619

eco

unemp

emp

job

fmiss

madrugada

tarde

manana

siorsocial

siosocial

sior

sio

rtwpen

minus05 00 05corre

ff500

1000

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

4

5

6

7

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]] 5

10

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

40

50

60

70

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

02

04

06

08

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

Penetration rate Entropy 1 (geo) Entropy 2 (geo)

Entropy 1 (social) Entropy 2 (social) Activity (morning)

Activity (afternoon) Activity (night)

Misspellers rate Job tweets

Employment tweets Unemployment tweets

Economy tweets

A B C

D E

Unemployment UnemploymentCorrelation

Entropy1 (social)

Misspellers rate

Pen

etra

tion

rate

Act

ivity

(mor

ning

)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data

10

15

20

25

10 15 20 25x

y

5

10

15

2025

5 10 15 20 25x

y

0 10 20 30per

order

col0000000072B2009E7356B4E9E69F00

Penetration rate

Entropy 1 (geo)

Entropy 1 (social)

Activity (morning)

Misspellers rate

Employment tweetsR2 = 062

Pred

icte

d un

empl

oym

ent

Observed unemployment Weight

A B

R2 = 052

Observed unemployment

CAge lt 25 25 lt Age lt 44

Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model

be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)

This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]

Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the

Social media fingerprints of unemployment mdash 719

economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc

It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity

The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate

A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter

penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]

Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets

Acknowledgments

We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program

References

[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)

[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510

[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)

[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report

[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306

[6] Batty M (2008) The size scale and shape of cities science319 769ndash771

[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41

Social media fingerprints of unemployment mdash 819

[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4

[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782

[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313

[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)

[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090

[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145

[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031

[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78

[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378

[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)

[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721

[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692

[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388

[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2

[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report

[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680

[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98

[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility

[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496

[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge

[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3

[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100

[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333

[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101

[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668

[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707

[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123

[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008

[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]

[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058

[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77

[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286

Social media fingerprints of unemployment mdash 919

Social media fingerprints of unemployment mdash 1019

Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro

S1 The dataset

Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys

For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work

From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j

We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years

S2 Twitter as mobility proxy

Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)

Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of

maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]

The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression

T gravi j =

Pα1i Pα2

j

i j

(2)

where T gravi j is the flow in terms of number of people between cities

i and j di j is the geographical distance and Pi and Pj the populationof every city respectively

Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization

αlowast1 α

lowast2 β

lowast = argminα1α2β

1N sum

i jwi j

(Ti jminusT grav

i j

)2(3)

where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13

i j givesthe best performance in the model

In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j

S3 Community structures in inter-city mo-bility graph

Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for

Social media fingerprints of unemployment mdash 1119

10minus8

10minus6

10minus4

10minus2

100

100 1005 101 1015 102 1025 103x

dens

10minus6

10minus4

10minus2

100

1005 101 1015 102 1025 103x

dens

10minus7

10minus6

10minus5

10minus4

10minus3

102 1025 103 1035 104 1045 105x

dens

Den

sity

Den

sity

Den

sity

Trip distance (km) Number of trips Elapsed time (secs)

Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively

each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)

1e+01 1e+03 1e+05

2eminus04

2eminus03

2eminus02

2eminus01

cities2$total_population

cities2$twpen

Population

Pen

etra

tion

Rat

e CommunitiesCities

Figure 6 Penetration rates for both cities and detectedcommunities

As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is

observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers

S4 Twitter demographics and unemploy-ment rates

Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old

S5 Properties of Twitter variables

Normalization and distributions

Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 6: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 619

eco

unemp

emp

job

fmiss

madrugada

tarde

manana

siorsocial

siosocial

sior

sio

rtwpen

minus05 00 05corre

ff500

1000

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

20

40

60

80

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

0

5

10

15

20

1020paro fa

ctor[i] tt[ variables_sel[i]]

4

5

6

7

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]] 5

10

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

40

50

60

70

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

02

04

06

08

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

0

50

100

150

200

10 20parofa

ctor

[i]

tt[ v

aria

bles

_sel

[i]]

Penetration rate Entropy 1 (geo) Entropy 2 (geo)

Entropy 1 (social) Entropy 2 (social) Activity (morning)

Activity (afternoon) Activity (night)

Misspellers rate Job tweets

Employment tweets Unemployment tweets

Economy tweets

A B C

D E

Unemployment UnemploymentCorrelation

Entropy1 (social)

Misspellers rate

Pen

etra

tion

rate

Act

ivity

(mor

ning

)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data

10

15

20

25

10 15 20 25x

y

5

10

15

2025

5 10 15 20 25x

y

0 10 20 30per

order

col0000000072B2009E7356B4E9E69F00

Penetration rate

Entropy 1 (geo)

Entropy 1 (social)

Activity (morning)

Misspellers rate

Employment tweetsR2 = 062

Pred

icte

d un

empl

oym

ent

Observed unemployment Weight

A B

R2 = 052

Observed unemployment

CAge lt 25 25 lt Age lt 44

Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model

be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)

This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]

Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the

Social media fingerprints of unemployment mdash 719

economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc

It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity

The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate

A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter

penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]

Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets

Acknowledgments

We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program

References

[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)

[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510

[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)

[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report

[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306

[6] Batty M (2008) The size scale and shape of cities science319 769ndash771

[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41

Social media fingerprints of unemployment mdash 819

[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4

[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782

[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313

[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)

[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090

[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145

[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031

[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78

[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378

[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)

[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721

[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692

[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388

[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2

[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report

[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680

[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98

[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility

[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496

[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge

[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3

[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100

[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333

[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101

[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668

[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707

[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123

[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008

[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]

[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058

[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77

[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286

Social media fingerprints of unemployment mdash 919

Social media fingerprints of unemployment mdash 1019

Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro

S1 The dataset

Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys

For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work

From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j

We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years

S2 Twitter as mobility proxy

Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)

Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of

maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]

The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression

T gravi j =

Pα1i Pα2

j

i j

(2)

where T gravi j is the flow in terms of number of people between cities

i and j di j is the geographical distance and Pi and Pj the populationof every city respectively

Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization

αlowast1 α

lowast2 β

lowast = argminα1α2β

1N sum

i jwi j

(Ti jminusT grav

i j

)2(3)

where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13

i j givesthe best performance in the model

In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j

S3 Community structures in inter-city mo-bility graph

Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for

Social media fingerprints of unemployment mdash 1119

10minus8

10minus6

10minus4

10minus2

100

100 1005 101 1015 102 1025 103x

dens

10minus6

10minus4

10minus2

100

1005 101 1015 102 1025 103x

dens

10minus7

10minus6

10minus5

10minus4

10minus3

102 1025 103 1035 104 1045 105x

dens

Den

sity

Den

sity

Den

sity

Trip distance (km) Number of trips Elapsed time (secs)

Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively

each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)

1e+01 1e+03 1e+05

2eminus04

2eminus03

2eminus02

2eminus01

cities2$total_population

cities2$twpen

Population

Pen

etra

tion

Rat

e CommunitiesCities

Figure 6 Penetration rates for both cities and detectedcommunities

As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is

observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers

S4 Twitter demographics and unemploy-ment rates

Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old

S5 Properties of Twitter variables

Normalization and distributions

Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 7: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 719

economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc

It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity

The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate

A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter

penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]

Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets

Acknowledgments

We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program

References

[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)

[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510

[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)

[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report

[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306

[6] Batty M (2008) The size scale and shape of cities science319 769ndash771

[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41

Social media fingerprints of unemployment mdash 819

[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4

[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782

[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313

[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)

[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090

[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145

[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031

[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78

[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378

[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)

[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721

[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692

[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388

[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2

[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report

[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680

[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98

[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility

[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496

[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge

[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3

[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100

[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333

[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101

[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668

[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707

[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123

[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008

[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]

[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058

[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77

[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286

Social media fingerprints of unemployment mdash 919

Social media fingerprints of unemployment mdash 1019

Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro

S1 The dataset

Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys

For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work

From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j

We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years

S2 Twitter as mobility proxy

Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)

Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of

maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]

The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression

T gravi j =

Pα1i Pα2

j

i j

(2)

where T gravi j is the flow in terms of number of people between cities

i and j di j is the geographical distance and Pi and Pj the populationof every city respectively

Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization

αlowast1 α

lowast2 β

lowast = argminα1α2β

1N sum

i jwi j

(Ti jminusT grav

i j

)2(3)

where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13

i j givesthe best performance in the model

In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j

S3 Community structures in inter-city mo-bility graph

Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for

Social media fingerprints of unemployment mdash 1119

10minus8

10minus6

10minus4

10minus2

100

100 1005 101 1015 102 1025 103x

dens

10minus6

10minus4

10minus2

100

1005 101 1015 102 1025 103x

dens

10minus7

10minus6

10minus5

10minus4

10minus3

102 1025 103 1035 104 1045 105x

dens

Den

sity

Den

sity

Den

sity

Trip distance (km) Number of trips Elapsed time (secs)

Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively

each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)

1e+01 1e+03 1e+05

2eminus04

2eminus03

2eminus02

2eminus01

cities2$total_population

cities2$twpen

Population

Pen

etra

tion

Rat

e CommunitiesCities

Figure 6 Penetration rates for both cities and detectedcommunities

As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is

observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers

S4 Twitter demographics and unemploy-ment rates

Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old

S5 Properties of Twitter variables

Normalization and distributions

Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 8: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 819

[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4

[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782

[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313

[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)

[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090

[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145

[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031

[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78

[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378

[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)

[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721

[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692

[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388

[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2

[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report

[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680

[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98

[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility

[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496

[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge

[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3

[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100

[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333

[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101

[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668

[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707

[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123

[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008

[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]

[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058

[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77

[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286

Social media fingerprints of unemployment mdash 919

Social media fingerprints of unemployment mdash 1019

Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro

S1 The dataset

Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys

For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work

From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j

We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years

S2 Twitter as mobility proxy

Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)

Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of

maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]

The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression

T gravi j =

Pα1i Pα2

j

i j

(2)

where T gravi j is the flow in terms of number of people between cities

i and j di j is the geographical distance and Pi and Pj the populationof every city respectively

Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization

αlowast1 α

lowast2 β

lowast = argminα1α2β

1N sum

i jwi j

(Ti jminusT grav

i j

)2(3)

where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13

i j givesthe best performance in the model

In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j

S3 Community structures in inter-city mo-bility graph

Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for

Social media fingerprints of unemployment mdash 1119

10minus8

10minus6

10minus4

10minus2

100

100 1005 101 1015 102 1025 103x

dens

10minus6

10minus4

10minus2

100

1005 101 1015 102 1025 103x

dens

10minus7

10minus6

10minus5

10minus4

10minus3

102 1025 103 1035 104 1045 105x

dens

Den

sity

Den

sity

Den

sity

Trip distance (km) Number of trips Elapsed time (secs)

Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively

each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)

1e+01 1e+03 1e+05

2eminus04

2eminus03

2eminus02

2eminus01

cities2$total_population

cities2$twpen

Population

Pen

etra

tion

Rat

e CommunitiesCities

Figure 6 Penetration rates for both cities and detectedcommunities

As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is

observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers

S4 Twitter demographics and unemploy-ment rates

Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old

S5 Properties of Twitter variables

Normalization and distributions

Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 9: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 919

Social media fingerprints of unemployment mdash 1019

Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro

S1 The dataset

Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys

For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work

From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j

We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years

S2 Twitter as mobility proxy

Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)

Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of

maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]

The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression

T gravi j =

Pα1i Pα2

j

i j

(2)

where T gravi j is the flow in terms of number of people between cities

i and j di j is the geographical distance and Pi and Pj the populationof every city respectively

Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization

αlowast1 α

lowast2 β

lowast = argminα1α2β

1N sum

i jwi j

(Ti jminusT grav

i j

)2(3)

where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13

i j givesthe best performance in the model

In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j

S3 Community structures in inter-city mo-bility graph

Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for

Social media fingerprints of unemployment mdash 1119

10minus8

10minus6

10minus4

10minus2

100

100 1005 101 1015 102 1025 103x

dens

10minus6

10minus4

10minus2

100

1005 101 1015 102 1025 103x

dens

10minus7

10minus6

10minus5

10minus4

10minus3

102 1025 103 1035 104 1045 105x

dens

Den

sity

Den

sity

Den

sity

Trip distance (km) Number of trips Elapsed time (secs)

Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively

each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)

1e+01 1e+03 1e+05

2eminus04

2eminus03

2eminus02

2eminus01

cities2$total_population

cities2$twpen

Population

Pen

etra

tion

Rat

e CommunitiesCities

Figure 6 Penetration rates for both cities and detectedcommunities

As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is

observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers

S4 Twitter demographics and unemploy-ment rates

Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old

S5 Properties of Twitter variables

Normalization and distributions

Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 10: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 1019

Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro

S1 The dataset

Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys

For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work

From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j

We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years

S2 Twitter as mobility proxy

Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)

Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of

maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]

The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression

T gravi j =

Pα1i Pα2

j

i j

(2)

where T gravi j is the flow in terms of number of people between cities

i and j di j is the geographical distance and Pi and Pj the populationof every city respectively

Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization

αlowast1 α

lowast2 β

lowast = argminα1α2β

1N sum

i jwi j

(Ti jminusT grav

i j

)2(3)

where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13

i j givesthe best performance in the model

In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j

S3 Community structures in inter-city mo-bility graph

Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for

Social media fingerprints of unemployment mdash 1119

10minus8

10minus6

10minus4

10minus2

100

100 1005 101 1015 102 1025 103x

dens

10minus6

10minus4

10minus2

100

1005 101 1015 102 1025 103x

dens

10minus7

10minus6

10minus5

10minus4

10minus3

102 1025 103 1035 104 1045 105x

dens

Den

sity

Den

sity

Den

sity

Trip distance (km) Number of trips Elapsed time (secs)

Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively

each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)

1e+01 1e+03 1e+05

2eminus04

2eminus03

2eminus02

2eminus01

cities2$total_population

cities2$twpen

Population

Pen

etra

tion

Rat

e CommunitiesCities

Figure 6 Penetration rates for both cities and detectedcommunities

As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is

observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers

S4 Twitter demographics and unemploy-ment rates

Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old

S5 Properties of Twitter variables

Normalization and distributions

Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 11: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 1119

10minus8

10minus6

10minus4

10minus2

100

100 1005 101 1015 102 1025 103x

dens

10minus6

10minus4

10minus2

100

1005 101 1015 102 1025 103x

dens

10minus7

10minus6

10minus5

10minus4

10minus3

102 1025 103 1035 104 1045 105x

dens

Den

sity

Den

sity

Den

sity

Trip distance (km) Number of trips Elapsed time (secs)

Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively

each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)

1e+01 1e+03 1e+05

2eminus04

2eminus03

2eminus02

2eminus01

cities2$total_population

cities2$twpen

Population

Pen

etra

tion

Rat

e CommunitiesCities

Figure 6 Penetration rates for both cities and detectedcommunities

As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is

observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers

S4 Twitter demographics and unemploy-ment rates

Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old

S5 Properties of Twitter variables

Normalization and distributions

Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 12: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 1219

Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions

microi are also given per 100000 tweets published in the geographicalarea

Correlation between variables

Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables

High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group

of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels

S6 Misspellers detection

In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases

bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way

bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling

bull In the same line we neglect mistakes produced by removing

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 13: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 1319

0

10

20

30

40

0 100 200 300x = Tweets (unemployment)

P(x)

0

5

10

15

500 1000 1500x = Penetration rate

P(x)

0

5

10

15

2 4 6 8x = Entropy2 (social)

P(x)

0

5

10

15

025 050 075x = Entropy1 (social)

P(x)

0

5

10

15

4 6 8x = Entropy2 (geo)

P(x)

0

5

10

15

01 02 03 04x = Entropy1 (geo)

P(x)

0

5

10

15

0 50 100 150 200x = Misspellers rate

P(x)

0

20

40

60

0 100 200x = Tweets (employment)

P(x)

0

5

10

15

20

250 500 750x = Tweets (job)

P(x)

0

5

10

15

20

2 3 4 5 6x = Activity (night)

P(x)

0

5

10

15

20

4 5 6 7x = Activity (morning)

P(x)

0

5

10

15

35 40 45 50x = Activity (afternoon)

P(x)

i SuiSri

Sri Suimrngi

aftni ngti

imicrojobi

microempi

microunempi

x x x

x x x

x x x

x x x

Figure 9 Frequency plots for each variable constructed from Twitter

letters in the middle of a word whose pronunciation can bededuced without them

bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area

Likewise we will consider as real misspellings the followingmistakes

bull Adding letters For example writing a h at the beginning of aword that starts with a vowel

bull Changing the special cases mp mb by the wrong writings npnb

bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 14: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 1419

0

10

20

30

16minus24 25minus34 35minus44 45minus54 55minus64r

f

000

025

050

075

100llena

Age group

Perc

enta

ge o

f pop

ulat

ion Census

Twitter

10

15

20

25

10 15 20 25x

y

10

15

20

25

10 15 20 25x

y

5

10

15

20

5 10 15 20x

y

5

10

15

2025

5 10 15 20 25x

y

All ages lt 24

25-44 gt 44

Observed Unemployment () Observed Unemployment ()

Observed Unemployment ()Observed Unemployment ()

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

Pred

icte

d Un

empl

oym

ent (

)

R2 = 047 R2 = 062

R2 = 052 R2 = 026

Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups

bull Confusing the verb haber with the periphrasis a ver

bull Separating a word into two ones for instance writing theword conmigo as con migo

This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)

We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly

minus1

minus08

minus06

minus04

minus02

0

02

04

06

08

1rtwpen

sio

sior

siosocial

siorsocial

manana

tarde

madrugada

fmiss

job

emp

unem

p

eco

rtwpensiosior

siosocialsiorsocialmanana

tardemadrugada

fmissjobemp

unempeco

i

Sui

Sri

Sri

Sui

i

microecoi

microunempi

microempi

microjobi

ngti

aftni

mrngi

minus3 minus2 minus1 0 1 2 3

minus3minus2

minus10

12

3

First Principal component

Seco

nd P

rinci

pal C

ompo

nent

6

26

47

70 92123

145

155185

209

213

251

257 279305318339

371

396411

423

455

466

490

507

546552

568

597

621

637672

681

710

718738

772

781

805828

849

876

900

919

925

966979

1002 1021

1047

1064

1085

1109

1126

1137

1164

11861200

12361249

1264

1300

1318

1336

13541386

1406

1425

1444

1457

14711504

15141554

1572

1584

1611

1622

16571667

1686

1721

1739

1800

1823

1837

1863

18831930

1949

1968

20062036

2042

2067

2181

2206

22442264

2305

2331

23322413 2435

2456

2512

2554

2569

25982617

26522670

26972721 2748

2790

2875

2917

2939

29492976

2989

3025

3132

3228

3245

3331

3347

3381

3451

3484

3519

3613

3837

38653987

4326

minus05 00 05

minus05

00

05

sio

sior

siosocial

siorsocial

rtwpen

manana

tarde

madrugada

fmiss

jobemp

unemp

eco

ngt

microunemp

aftn

microempmicroeco

microjobmrng

Si

Si

Sir

˜Sir

domingo 12 de octubre de 14

Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them

with the number of tweets (exponent asymp 033)

Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 15: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 1519

2 5 10 20 50 200 500

10

15

20

30

40

tweets

N(m

iss|

tw

eets

)

2 5 10 20 50 200 500

000

50

050

050

0

tweets

P(m

iss|

twee

ts)

2 50 500

Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets

050

055

060

065

070

201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406

month

r2R2

month

Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed

S7 Time window and unemployment

In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data

S8 Demographics does not explain unem-ployment

Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled

S9 Unemployment models for other geo-graphical areas

While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities

S10 Relative importance of the variables

To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use

1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one

2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 16: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 1619

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

microemp mrng SuSu

Rela

tive

impo

rtanc

e (

)

0

10

20

30

40

emp fmiss manana rtwpen sio siosocialnames

values

indabscoefffirstlmgpmvd

weight first lmg pvmd

Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it

3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights

4 (first) The univariate R2-values from regression models withone variable only

All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model

References

[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987

[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997

[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008

[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955

[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004

[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005

[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_

es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-

sus httpwwwineescensos2011_datoscen11_

datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model

for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973

[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011

[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013

[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006

[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013

[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013

[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006

[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005

[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007

[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008

[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959

[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012

[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010

[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970

[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974

[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 17: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 1719

Gravity ModelParameter Description Spain

α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra

i j 0826

Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001

NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01

FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884

Table 2 NMI measure comparing G and Gp

Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C

FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205

Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 18: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 1819

All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast

(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240

(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006

(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast

(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast

(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360

(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271

(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years

All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast

(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast

(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast

(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast

(004) (004) (003)Social diversity minus002 minus003

(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast

(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast

(1309) (1278) (1271)Employment mentions 034 317

(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion
Page 19: Social media ngerprints of unemployment · 2014. 11. 20. · Social media ngerprints of unemployment | 2/19 Figure 1. A) Map of the mobility fluxes Tij between municipalities based

Social media fingerprints of unemployment mdash 1919

Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast

(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast

(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008

(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003

(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103

(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409

(1278) (251) (1002)Employment mentions 317 minus071 241 minus317

(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005

Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate

  • 1 Social media dataset and functional partition of cities
  • 2 Social media behavioral fingerprints
  • 3 Explanatory power of social media in unemployment
  • 4 Discussion