Twitter Watch: Leveraging Social Media to Monitor and ... · Twitter Watch: Leveraging Social Media to Monitor and Predict Collective-Efﬁcacy of Neighborhoods Moniba Keymanesh1,

Twitter Watch: Leveraging Social Media to Monitor and PredictCollective-Efficacy of Neighborhoods

Moniba Keymanesh1, Saket Gurukar1, Bethany Boettner1,Christopher Browning1, Catherine Calder2, and Srinivasan Parthasarathy1

1The Ohio State University, 2University of Texas at Austin{keymanesh.1, gurukar.1, boettner.6, browning.90, parthasarathy.2}@osu.edu,

[email protected]

Abstract

Sociologists associate the spatial variation of crime withinan urban setting, with the concept of collective efficacy. Thecollective efficacy of a neighborhood is defined as socialcohesion among neighbors combined with their willingnessto intervene on behalf of the common good. Sociologistsmeasure collective efficacy by conducting survey studies de-signed to measure individuals’ perception of their commu-nity. In this work, we employ the curated data from a surveystudy (ground truth) and examine the effectiveness of substi-tuting costly survey questionnaires with proxies derived fromsocial media. We enrich a corpus of tweets mentioning a lo-cal venue with several linguistic and topological features. Wethen propose a pairwise learning to rank model with the goalof identifying a ranking of neighborhoods that is similar tothe ranking obtained from the ground truth collective efficacyvalues. In our experiments, we find that our generated rankingof neighborhoods achieves 0.77 Kendall tau-x ranking agree-ment with the ground truth ranking. Overall, our results areup to 37% better than traditional baselines.

1 INTRODUCTIONUnderstanding occurrence of crime and disorder in cities isimportant for public health, policy, and governance. How-ever, occurrence of criminal violence is uneven across theneighborhoods (Chainey and Ratcliffe 2013; Weisburd, Bru-insma, and Bernasco ). Sociologists and policy-makers as-sociate the spatial variation of disorder to the organizationalcharacteristics of the neighborhoods (Morenoff, Sampson,and Raudenbush 2001; Sampson, Raudenbush, and Earls1997; Kornhauser 1978; Sampson and Groves 1989; Brown-ing, Cagney, and Boettner 2016). An important measure ofsuch disorder is collective efficacy (Sampson, Raudenbush,and Earls 1997). Collective efficacy is defined as “social co-hesion and trust among neighbors combined with the jointwillingness to intervene on behalf of the common good”(Sampson, Raudenbush, and Earls 1997). Collective efficacyis increasingly used by local governments to prioritize re-sources to both monitor and reduce disorder through tar-geted policies and neighborhood gentrification strategies. Itis also used to measure the impact of said policies and strate-gies over time (Hipp 2016; Bandura 1997).

Under peer review.

The computation of neighborhoods’1 collective efficacytraditionally requires conducting expensive surveys; usuallyrequiring funding on the order of hundreds of thousands ofdollars (Couper 2017). Changes to collective efficacy overtime (Hipp 2016), due to policy shifts (e.g. through neigh-borhood gentrification efforts) require additional surveys,further exacerbating this cost.

Sociologists and government agencies typically use col-lective efficacy to “order” neighborhoods with respect toneighborhood safety perception, social cohesion among res-idents, and their willingness to intervene on behalf of thecommon good. Neighborhoods with high collective efficacytend to be safer while lower collective efficacy values corre-spond to relatively less safe neighborhoods (Sampson, Rau-denbush, and Earls 1997). Essentially one may model this asa ranking problem. Concretely the key question we seek toanswer in this paper is: “Given the social media data aboutneighborhoods, can we rank the neighborhoods such that theranked list is close to the ranked list of neighborhoods or-dered by collective efficacy – thereby saving on the cost ofexpensive surveys?”.

Our approach, a first of its kind study at a city-scale, seeksto characterize neighborhood collective efficacy by leveringspatially conditioned linguistic features extracted from so-cial media. These features are related to the type of urban ac-tivity, language use, visible signs of crime and anti-social be-havior reported on such media, familiarity of residents withone another, and public mood of the neighborhood. We leveradditional sociological, and spatial features and develop asimple pairwise learning to rank model based on these fea-tures. We empirically show the effectiveness of our modelon a real world city-scale dataset, with ground truth valuesof collective efficacy computed from a traditional survey-based study (details in section 3). Additionally, we conducta comprehensive analysis of the predictive power of spe-cific features in the learning to rank task to better under-stand the relative importance of individual features. In termsof broader impacts such ideas can be used as a cost-effectiveearly warning mechanism to monitor the transformations ofthe neighborhoods and prioritize the resources.

1In this paper we use the terms “neighborhood” and “blockgroup” interchangeably. Block group refers to a census block groupwhich is a smallest geographical unit for which the United Statescensus bureau publishes sample data.

arX

iv:1

911.

0635

9v1

[cs

.SI]

14

Nov

201

9

2 Background and Related WorkTwitter has been used by researchers to make sense ofhuman behavior. The behavior of users on this platformhas been used in particular to assist prediction of crimi-nal violence (Wang, Gerber, and Brown 2012; Gerber 2014;Wang, Brown, and Gerber 2012; Wang and Gerber 2015;Aghababaei and Makrehchi 2016; Williams, Burnap, andSloan 2017; Bendler et al. 2014). The link between the pre-diction of social unrest and the user’s online activity on Twit-ter has been studied by (Compton et al. 2013). Moreover,Twitter has been employed to study the online behavior ofgang members (Patton 2015) and to measure the populationat risk, considering violent crime (Malleson and Andresen2015). Several studies have used Twitter to study the trustrelations (Vedula, Parthasarathy, and Shalin 2017) amongonline users. Researchers have also leveraged Twitter datafor studying social disorganization by evaluating entropy ofindividuals’ opinion about soccer teams (Pacheco, Oliveira,and Menezes 2017). Although the concepts of trust, crime,and social disorganization are related to collective efficacy,to the best of our knowledge estimating individuals’ percep-tion of their social climate and expectation of interventionusing social media data has not been addressed till now.

3 Data CollectionAHDC study: The adolescent health and development incontext (AHDC) study is a longitudinal data collection ef-fort in a representative and diverse urban setting that fo-cuses on the contribution of social and spatial environ-ments to the health and developmental outcomes of urbanyouth. The study area is a contiguous space in ColumbusOhio. In the first wave of the study 1403 Columbus res-idents participated in the study. Participants were asked aseries of questions about their neighborhood and routineactivity locations. Questions specifically focused on infor-mal social control items measuring the participant’s per-ception of the social climate in the area at and aroundeach location and in the neighborhood. Participants reportedagreement with the following questions: 1- whether peo-ple on the streets can be trusted? 2- whether people arewatching what is happening on the street?, and 3- whetherpeople would come to the defense of others being threat-ened? Responses ranged from 1 (“strongly disagree”) to 5(“strongly agree”). This step resulted in roughly 9000 loca-tion reports (4031 unique locations) nested within 567 blockgroups. In order to achieve the collective efficacy value ofeach neighborhood, we aggregated individual responses tothe three social control items at the report level. Then, weaggregated report-level results for each block group. Fi-nally, we normalized the scores in range of 0 to 1 anduse it as the ground truth for our study. This methodol-ogy is aligned with the traditional measurement approachemployed to compute collective efficacy at neighborhoodlevel (Bandura and Wessels 1997; Paskevich et al. 1999;Sampson, Morenoff, and Earls 1999). Note that while weadopt a similar ground truth model (Sampson, Morenoff, andEarls 1999), we lever survey reports from individuals whoboth reside within and frequently visit a particular neighbor-hood (the original model focused just on residents). Con-

39.9

40.0

40.1

−83.2 −83.1 −83.0 −82.9 −82.8

Longitude

La

titu

de

0.00

0.25

0.50

0.75

1.00

collective efficacy

Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)

Figure 1: Collective efficacy map of Columbus, OH

comitantly, the social media postings included in our studyincludes postings from both individuals that reside withinand frequently visit a particular neighborhood. In order toincrease the reliability of the aggregation we only includethe neighborhoods having at least 5 reports. Figure 1 showsa collective efficacy map of Columbus (ground truth).

Twitter Data: Matching the offline survey instrument ofthe AHDC study we collected a significant corpus of Twit-ter feeds from Columbus area and its suburbs. Our goal wasto capture the informal language of local citizenry focusedon local venues and localities. We chose Twitter because ofits national appeal2 and easy availability through the API.For the purpose of our study, collectively, more than 50 mil-lion publicly available tweets were collected from the ac-counts of 54k Twitter users who identified their location asColumbus. These users were identified through Snowballsampling (Goodman 1961). Details of our data collectionprocess can be found in Appendix A.1.

3.1 Associating Tweets to Neighborhoods

Following our data collection, we excluded the tweets thatdid not contain a mention of locations within our study area.For doing so, we used a state-of-the-art publicly available lo-cation name extractor LNEx (Al-Olimat et al. 2018). LNExextracts location entries from tweets, handles abbreviations,tackles appellation formation and metonomy pose disam-biguation problems given gazetteer and region information.Open Street Map gazetteer was used and region was set toColumbus. However, there are cases in which ambiguous lo-cations were reported by LNEx. In our study, we excludetweets containing ambiguous location entities. For more de-tails on the ambiguous cases and pruning steps see AppendixA.1. This pruning step resulted in 4846 unique locations thatwere spotted in 545k tweets and were mapped to 424 neigh-borhoods.3

2http://www.pewinternet.org/2018/03/01/social-media-use-in-2018/

3It is our intent to open source the tweet IDs as well as theground truth collective efficacy values of neighborhoods in Colum-bus once this work is published.

4 MethodologyIn this section, we formalize our prediction task and theproposed ranking model. As mentioned earlier, the goal ofour study is to rank the neighborhoods based on featuresextracted from tweets such that the ranked list is similarto the list of neighborhoods ordered by collective efficacy.Hence, we formulated our problem as pairwise learning torank task (Liu and others 2009).

4.1 Definitions and Problem FormulationFirst, we define the terms related to ordinal ranking. Tiedobjects refer to the set of two or more objects that are in-terchangeable in ranking with respect to the quality underconsideration (Kendall 1945). The ranking in which ties areallowed is called a weak ordering. In our study, the neigh-borhoods with significantly small difference in their groundtruth value of collective efficacy are considered tied. As aresult they are interchangeable in ranking. Ties are definedbased on a threshold on the difference of collective efficacyvalues. Note that in this case, ties are intransitive by defini-tion. Meaning that a tie relationship between neighborhoodsna and nb; and nb and nc does not imply that neighborhoodsna and nc are tied. This constraint will be reflected in theway we define the ranking matrix and will be discussed insection 4.5.

Next, we formalize our ranking task and our proposed ap-proach. Our data consists of {t1, t2, ..., tc}where ti is the setof tweets associated with neighborhood ni. We denote thecollective efficacy of neighborhood ni with C(ni). The goalof our framework is to automatically generate a permutationof neighborhoods (f(n1)f(n2).....f(nm)) where f is the or-dering function that maps each neighborhood to its positionsuch that the mapped position of neighborhood is close to itstrue position based on collective efficacy values. Formallythe ranking task is defined as {f(ni) < f(nj) | ∀ni, nj ifC(ni) ≤ C(nj)}.

In order to generate a ranking of the neighborhoods, wefirst predict the local rank of all pairs of neighborhoods niand nj . In this scenario, there can be three cases for anypair; ni comes before nj , ni comes after nj , or ni and nj areinterchangeable in the ranking. Thus, we formulate our localranking task as a 3-class classification task. We then use thelocal rankings to generate the global ranking. Details of thisprocess are discussed in Section 4.4. Next, we explain thefeatures used in this study to characterize the neighborhoods.

4.2 FeaturesWe characterize each neighborhood with features extractedfrom the tweets associated with the neighborhood. We com-pute two types of features: I) features that are computed foreach neighborhood and II) features that are computed for apair of neighborhoods. To generate the feature vector of apair of neighborhoods, we first concatenate feature vector ofeach of the neighborhoods. Next, we add the pairwise fea-tures to the feature vector. These features are explained indetail in the following subsections.

TF-IDF of crime related words: “Broken Win-dows” (Wilson and Kelling 1982) is a well-known

theory in criminology. The basic formulation of this theoryis that visible signs of crime creates an urban environmentthat encourages further crime and disorder (Skogan 2015;Welsh, Braga, and Bruinsma 2015). Under the brokenwindows theory, a disordered environment, with signsof broken windows, graffiti, prostitutes, and excessivelitter sends the signal that the area is not monitored andthat criminal behavior has little risk of detection. Such asignal can potentially draw offenders from outside of theneighborhood. On the basis of this theory, we used a lexiconof crime4 as a proxy for visible signs of crime and disorderin neighborhoods. This lexicon contains words that peopleoften use while talking about crime and disorder. TF-IDFcaptures the importance of a term in a document. Withthis in mind, we employed TF-IDF to capture the contentsurrounding the location entity in a tweet. For more detailsof preprocessing see the Appendix A.2.

Distribution of spatio-temporal urban activities usingtopic modeling: Casual, superficial interaction and the re-sulting public familiarity engender place-based trust amongresidents and ultimately the expectation of response to de-viant behaviour (Jacobs 1961). Identifying the activities thatindividuals conduct in a city is a non-trivial step to under-standing the ecological dynamics of a neighborhood such asthe potential for street activity and public contact. Followingthe same methodology as in (Fu et al. 2018) we applied La-tent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003)to tweets associated with a given neighborhood to identifythe main activity types in each neighborhood. The numberof topics in a set of tweets is an important prior parameter inLDA model. To evaluate the topic model and determine theoptimal number of topics, perplexity (Blei, Ng, and Jordan2003) is used by convention in language modeling. Perplex-ity is defined as:

Perplexity(T ) = exp

{−∑M

t=1 log p(wt)∑Mt=1Nt

}(1)

Where T is the set of test tweets that are held from the tweetset for building the LDA model; M is size of T ; Nt is thenumber of words in a tweet t from tweet set T ; and p(wt) isthe probability of word distribution in the tweet. (Zhao et al.2015) highlights a few issues with using perplexity to findthe appropriate number of topics and proposes additionalmetric called rate of the perplexity change (RPC) for thispurpose. Formally, RPC is defined as:

RPC(i) =

∣∣∣∣Pi − Pi−1

ti − ti−1

∣∣∣∣ (2)

Where ti is the number of topics from an increasing se-quence of candidate numbers and Pi is the correspondingperplexity. We varied the number of topics from 10 to 150and observed that RPC is maximized at 70 topics. Thus, wetrained the LDA model with 70 topics on a subset of 5Mtweets collected from user profiles. For more details see Ap-pendix A.3.

4This lexicon has been acquired from an open source repositoryhttps://github.com/sefabey/fear of crime paper

Document embeddings: In order to represent the vari-able length tweets of neighborhoods with a fixed-length fea-ture vector, we used Doc2vec (Le and Mikolov 2014), anunsupervised framework that learns continuous distributedvector representations for pieces of texts. Details on trainingthe doc2vec model can be found in Appendix A.4.

Sentiment Distribution: Sentiment analysis, has beenused by researchers for quantifying public moods in thecontext of unstructured short messages in online social net-works (Bertrand et al. 2013). We also characterize the neigh-borhoods in our study using the mood of the tweets men-tioning a venue located inside the neighborhood. As re-ported in (Ribeiro et al. 2016) the existing methods for senti-ment analysis vary widely regarding their agreement; mean-ing that depending on the choice of sentiment analysis tool,same content could be interpreted very differently. Thus, weuse a combination of several methods to make our frame-work more robust to the limitations of each method. We usedfive of the best methods for sentiment analysis (Ribeiro et al.2016) including Vader (Gilbert 2014), Umigon (Levallois2013), SentiStrength (Thelwall et al. 2010), Opinion Lex-icon (Hu and Liu 2004), and Sentiment140 (Go, Bhayani,and Huang 2009). We applied the methods on each tweet andnormalized the values. Next, we categorized the observedsentiment values in 4 bins and reported the distribution oftweets sentiment for each neighborhood. For more detailssee Appendix A.5.

Spatial Distance: In order to represent the spatial rela-tionship of the neighborhoods, we computed the geodesicdistance between the center points of each pair of neighbor-hoods. We then normalize the distance values using min-max normalization.

Common Users: Frequent interaction and the resultingpublic familiarity engender place-based trust among resi-dents and ultimately the expectation of response to deviantbehaviour (Jacobs 1961). For a pair of neighborhoods weassume that the greater the number of users that tweetedabout both neighborhoods, the higher is the level of the pub-lic familiarity of the residents and the more similar are theneighborhoods in terms of level of collective efficacy. Thus,for each pair of neighborhoods we computed the number ofusers that tweeted about both of the neighborhoods. Then wedivided this value by the total number of users that tweetedabout at least one of the neighborhoods in the pair.

4.3 ModelIn this section we discuss our ranking task and model archi-tecture. We use a pairwise approach to automatically gener-ate a ranking of neighborhoods with respect to their collec-tive efficacy. Ranking the objects with a function is equiv-alent to projecting the objects into a vector and sorting theobjects according to the projections. The goal here is to usethe extracted features for generating a permutation which isclose to the ranking of neighborhoods if sorted by collec-tive efficacy values. In the pairwise approach, the rankingtask is transformed into a pairwise classification problem. Inour case, given representations of a pair of neighborhoods

< na, nb > the goals is to predict if na should be rankedhigher than nb or na should come later in the ranking. Inthe first case, a value +1 is the label to be predicted and inthe latter case the value -1 is assigned as the true label. Weconsider a label value of 0 for a pair of tied neighborhoodssince we do not want to move one of them higher or lowerin the list with respect to the other one.

We then use this local ordering to generate a global order-ing of the neighborhoods. We employed different classifiersfor the local ordering task including a neural ranker which isa feed-forward neural network. Extensive experiments wereconducted to evaluate the effect of the model architecture aswell as the predictive power of the features. Our experimen-tal setup and the results are provided in Sections 5 and 6.

4.4 OrderingWe train the local ranker model for each pair of the neighbor-hoods < ni, nj > and their corresponding local rank labelrij which can take -1, +1, or 0. For each pair, we also includeanother training instance < nj , ni > as the input and −rijas the ground truth value. Given the set of tweets associatedwith neighborhoods in our study we rank the neighborhoodsas follows: for every pair < ni, nj > we first extract fea-tures from tweets of neighborhood ni and neighborhood njthen we compute the pairwise features including the spatialdistance, and normalized common users count for each pair.We concatenate all the features for every pair mentioned insection 4.2. The model then predicts the local ranking foreach pair of neighborhoods using the feature representationof each pair. Let R(ni, nj) be the local rank value of neigh-borhoods ni and nj predicted by our model. In order to getthe global rank of the neighborhoods, we compute the finalscore C(ni) for all neighborhoods by computing:

C(ni) =∑

ni 6=nj∈N

R(ni, nj)

Then we rank the neighborhoods in decreasing orderof these scores. The lower the score the lower the de-gree of collective efficacy of the neighborhood. Similarranking setup has been used in (Glavas and Stajner 2015;Paetzold and Specia 2017; Maddela and Xu 2018) for sub-stitution ranking.

4.5 EvaluationWe evaluated the accuracy of our model by measuring theagreement between the generated ranking and the groundtruth ranking. As mentioned in section 4.1, the ranked listof neighborhoods has non-transitive ties. The quality of pre-dicted ranking in this setting can be computed using τx rankcorrelation coefficient (Emond and Mason 2002). τb is an-other metric that is used for measuring ranking consensus,however (Emond and Mason 2002) uncovers fundamentalissues with the usage of τb metric in the presence of ties.

The τx rank correlation coefficient Let A be a rankingof n objects. Then (Emond and Mason 2002) defines a weakordering A of n objects using the n × n score matrix. Ele-ment aij of this matrix is defined as follows:

% of top tweeted Neighborhood Min. Median Mean Standard DeviationNeighborhoods Count of Collective Efficacy

20 78 490 1394.5 5903.5 0.182640 157 110 473 3047.7 0.182160 235 39 197 2058.8 0.196280 314 14 110 1546.9 0.2038100 393 1 63 1237.2 0.2063

Table 1: The neighborhood count, minimum, mean, and me-dian number of tweets at each set of top k% neighborhoodsof the sorted list of the block groups based on their tweetcount. The maximum number of tweets in each set of topk% neighborhoods is 98,951.

aij =

1 if object i is ranked ahead of or tied with

object j−1 if object i is ranked behind object j0 if i = j

The τx rank correlation coefficient between two weak or-derings A and B is computed by the dot product of theirscore matrices.

τx(A,B) =

∑ni=1

∑nj=1 aijbij

n(n− 1)

We further evaluate our proposed framework by computingthe τx ranking correlation between the generated rankingand the ground truth ranking. It is important to note thatthe cumulated gain-based metrics (Jarvelin and Kekalainen2002) such as Discounted Cumulated Gain (DCG) and thenormalized version of it (NDCG) widely used in informa-tion retrieval literature for examining the retrieval results arenot appropriate to evaluate our framework. The main reasonbeing these measures penalize the ranking mistakes moreon the higher end of the ranking while devaluing late re-trieved items. However, such an objective does not work forour context - mistakes in ranking on the higher end shouldbe penalized the same as the mistakes in the middle or end ofof the list. Thus, employing a measure of ranking agreementis a more appropriate way to evaluate our model.

5 EXPERIMENTSIn this section, we empirically evaluate our hypothesis that“one can leverage social media data to quantify collectiveefficacy of neighborhoods”.

5.1 Dataset and empirical setupWe lever the survey data obtained from the AHDC study.The details about the AHDC study and the computation ofcollective efficacy for each neighborhood is shared in sec-tion 3. We sorted the neighborhoods based on the numberof tweets collected for each of them and (if not stated other-wise) used the top 40% of neighborhoods in this list in ourexperiments. This list contains 157 neighborhoods that werementioned in 3,047 tweets on average. The information onthe count of block groups in each set of top k% of this listas well as minimum, maximum, mean, and median numberof tweets collected for each set is reported in Table 1. Formore information on the distribution of collective efficacy in

each group of the neighborhoods see the Appendix A.6. Welearn our proposed learning to rank model based on 90/10train/test split of collected tweets. The train/test split alsomaintains the temporal order where train split is treated ascurrent tweets while test split is treated as future tweets.

5.2 Baselines for the ranking task

Following are the baselines for the ranking task:• Venue count: Neighborhoods were sorted by the num-

ber of venues located in them that were mentioned in thetweets.

• Population: Neighborhoods were sorted by total popu-lation of them. The values are extracted from the 2013report of the United States census bureau5.

• Tweet count: We sort the neighborhoods by number oftweets that mentioned a venue located in them.

• User count: We sort the neighborhoods by number ofusers that tweeted about a venue located in them.

• Random: We generate 100 random permutations of theneighborhoods and report the average τx.

5.3 The classifier for the local ranking task

Since we rely on pairwise learning to rank, we experimentwith below classifiers for our local ranking task. The param-eters are tuned using grid search with cross-validation pa-rameter set to 5 and scoring function set to ‘f1’.• Logistic Regression (LR): The estimator penalty is set

to ‘L1’ and the inverse of regularization strength is set to0.1.

• Support Vector Machine (SVM): The kernel is set to‘rbf’, the penalty parameter C is set to 1, and the gammakernel coefficient for rbf is set to 0.1.

• Random Forest (RF): The number of estimators is set to200. The minimum number of samples required to be ata leaf node is set to 5, and the function to measure thequality of a split is set to ‘gini’.

• Multi-layer Perceptron (MLP): We use a feed-forwardneural network with 3 hidden layers and 100 units ateach hidden layer, and a task-specific output layer. We usecross entropy loss and Adam algorithm (Kingma and Ba2015) for optimization.As discussed in section 4.1 we define the tied neighbor-

hoods as the ones having a significantly small difference incollective efficacy value. Tied neighborhoods are consideredinterchangeable in the ranking. We define the ties based ona threshold on collective efficacy difference. We computethe standard deviation of the collective efficacy value of theneighborhoods in our study and define our threshold basedon different coefficients of the standard deviation of the col-lective efficacy. We vary the coefficient from 0 to 1 with 0.2increments and evaluate the ranking consensus using a rank-ing correlation metric discussed in section 4.5. More detailson tied neighborhoods in shared in Appendix A.7. The re-sults are discussed in section 6.

ID Models 0 0.2 0.4 0.6 0.8 1 AUC-ERC1 Random 0 0.09 0.20 0.30 0.40 0.49 0.252 Coordinates -0.04 0.05 0.17 0.27 0.37 0.47 0.223 User count 0.03 0.13 0.25 0.35 0.45 0.53 0.2924 Tweet count 0.04 0.14 0.25 0.35 0.45 0.53 0.2955 Population 0.05 0.15 0.26 0.36 0.46 0.55 0.3066 Venue count 0.09 0.19 0.3 0.4 0.49 0.58 0.3427 Doc2vec + Sentiment 0.3539 0.4504 0.5256 0.6347 0.713 0.7635 0.57648 Doc2vec + Sentiment + Common Users 0.3698 0.461 0.5497 0.6043 0.7033 0.7642 0.57709 Doc2vec + Sentiment + Common Users + Topics 0.368 0.4367 0.5425 0.6412 0.6988 0.7647 0.577110 Doc2vec + Sentiment + Common Users + Topics + Distance 0.3748 0.4565 0.5388 0.6322 0.7207 0.7735 0.584411 Doc2vec + Sentiment + Common Users + Topics + Distance + Tfidf 0.3686 0.4597 0.5207 0.6294 0.6957 0.7666 0.5746

Table 2: The drilldown of our proposed model. The comparison of Kendall τx and area under the collective efficacy rankingcurve(AUC-ERC) of the baselines and our proposed framework. Top 40% of highly tweeted block groups were included in thisexperiment. Random forest was used for local ordering task.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

●

●

●

●●

●

●

●

●

●

●

●τ x

Tie Coefficient

●

●

RFMLPSVCLRVenue count

PopulationTweet countUser countRandom

Figure 2: Ranking performance of our proposed model andthe baselines. We used 4 classifiers for local ranking mod-ule. The x axis indicates the coefficient that is multiplied bystandard deviation to make the tie threshold. The standarddeviation of the ground truth collective efficacy for the 157block groups included in this experiment is 0.18.

6 Results6.1 Ranking performanceIn this section, we present the results of ranking agreementof the permutation generated by our framework using 4 dif-ferent classifiers when the most informative combination offeatures discussed in Section 4.2 were used. More specifi-cally, we used doc2vec, distribution of topics, distributionof sentiment, normalized common user count, and spatialdistance to characterize each pair of neighborhoods in ourstudy. More details on parameter setting for classifiers andfeature analysis is presented in Section 5.3 and Section 6.2.As shown in the Figure 2, our framework even when usedwith a linear classifier such as logistic regression outper-forms the baselines by at least 20%. Also, it can be seenthat random forest closely followed by multi-layer percep-tron is consistently giving better ranking correlation results

5https://www.census.gov

in comparison to other classifiers.

6.2 Model drill downIn this section we discuss our experiments related to the ef-fect of each feature discussed in Section 4.2 on ranking. Todetermine the best context feature, we experimented withfeatures in this group namely TF-IDF of crime lexicon, topicdistribution of urban activities, and doc2vec on the top 40%highly tweeted neighborhoods. We performed experimentswith all different combinations of our 3 content factors. Eachcontent feature is enabled in 3 combinations and disabledin 3 other corresponding paired combinations. Each facto-rial experiment was conducted using 3 classifiers for thelocal ranking module including random forest, multi layerperceptron, and logistic regression. We repeated this pro-cess for 6 tie coefficients. Tie coefficients varied from 0 to1 with an interval of 0.2. This resulted in 54 experimentsin which a content feature is enabled and 54 experiments inwhich a content feature is disabled. We observed that addingdoc2vec increases the ranking performance and this boost isstatistically significant (Wilcoxon signed-rank test with p-value < 0.001) (Wilcoxon, Katti, and Wilcox ). However,this was not the case for two other content features. Thebox plot of these experiments is shared in Appendix A.8.Next, we examine the impact of additional features alongwith doc2vec on the ranking performance. In each experi-ment we computed the ranking correlation of the generatedranking with the ground truth ranking with tie coefficientvalues ranging from 0 to 1 with interval of 0.2. The resultsare presented in Table 2. As it can be seen from the Table,regardless of choice of feature combination or tie coefficientour model consistently outperforms the baselines. For sum-marizing the ranking correlation results, we rely on AUC-ERC. AUC-ERC is area under the curve of graph createdby plotting tie coefficients against τx. From Table 2, we seethat models 7 to 11, show better AUC-ERC score than thebaselines. Model number 10 achieves the highest AUC-ERCscore. We see that model 11 which includes all the featuresis not the best performing model. We conjecture that thisbehaviour is due to the over-fitting of the model on the train-ing set. The TF-IDF feature has a dimensionality of 100 andthe classifier might learn a function to predict the local rankbetween pair of neighborhoods based on few crime lexicon

-0.2

0

0.2

0.4

0.6

0.8

1

0 0.4 0.8 1 0.2 0.6 1 0 0.4 0.8 1 0.2 0.6 1 0 0.4 0.8

Tie Coefficient20

Tie Coefficient40

Tie Coefficient60

Tie Coefficient80

Tie Coefficient100

τ x

Percentage of top tweeted neighborhoods

MLP Random Venue Count

Tweet Count Population User Count

Figure 3: Ranking performance of our framework and thebaselines on different sets of neighborhoods. The sets aredefined based on the number of collected tweets. The x axisindicates the tie coefficient. Tie threshold is computed bymultiplying the standard deviation of collective efficacy bythe tie coefficient. The standard deviation of each set is re-ported in Table 1.

words (e.g. gun, shooting) in the training set. However, thetest set might not contain those words on which classifierlearned the function thereby resulting in wrong prediction.To summarize, using Model 10 as our proposed model, ourgenerated ranking of neighborhoods achieves 0.77 Kendalltau-x ranking agreement with the ground truth ranking. Ourresults are between 20% to 37% better than the baselinesdepending on choice of the tie threshold.

6.3 Effect of data availabilityIn this section, we explored to what extent the result of rank-ing consensus is related to the amount of data we have foreach neighborhood. With this in mind, we solved the rank-ing tasks for different set of block groups. These sets areintroduced in Section 5.1. We used our ranking frameworkwith MLP as the classifier to rank each set. As indicated inFigure 3 the more the amount of data we have for the neigh-borhoods in our study, the higher is the ranking consensusof the generated ranking and the ground truth ranking.

7 CONCLUSIONIn this paper, we focused on the problem of costly compu-tation of collective efficacy values for the neighborhoods.With the help of extensive experiments, we showed thatthis problem can be addressed by leveraging the social me-dia data. Our proposed framework allows frequent and lesscostly access to collective efficacy values of the neighbor-hoods. In the future, we plan to leverage data from othersources (e.g., additional social forums and census) to im-prove our model. Additionally, we plan to explore the ego-net of users on social media and weigh high importanceto tweets of users who are more familiar with a particularneighbourhood. Our proposed framework can act as an earlywarning system to capture the transformations in the neigh-borhoods’ composition. This potentially can assist regula-tors and policymakers to prioritize resources, monitor neigh-

borhood safety, and upkeep. It is our intent to release thetweet IDs as well as the ground truth collective efficacy val-ues of the neighborhoods once this work is published.

8 AcknowledgementsThis material is based upon work supported by the Na-tional Institute of Health (NIH) under Grant No NIH-1R01HD088545-01A1. Any opinions, findings, and conclusionsin this material are those of the author(s) and may not reflectthe views of the respective funding agency.

References[Aghababaei and Makrehchi 2016] Aghababaei, S., andMakrehchi, M. 2016. Mining social media content forcrime prediction. In WI.

[Al-Olimat et al. 2018] Al-Olimat, H. S.; Thirunarayan, K.;Shalin, V.; and Sheth, A. 2018. Location name extractionfrom targeted text streams using gazetteer-based statisticallanguage models. COLING.

[Bandura and Wessels 1997] Bandura, A., and Wessels, S.1997. Self-efficacy.

[Bandura 1997] Bandura, A. 1997. Editorial. AmericanJournal of Health Promotion.

[Bendler et al. 2014] Bendler, J.; Brandt, T.; Wagner, S.; andNeumann, D. 2014. Investigating crime-to-twitter relation-ships in urban environments-facilitating a virtual neighbor-hood watch.

[Bertrand et al. 2013] Bertrand, K. Z.; Bialik, M.; Virdee, K.;Gros, A.; and Bar-Yam, Y. 2013. Sentiment in new york city:A high resolution spatial and temporal view. arXiv preprintarXiv:1308.5010.

[Blei, Ng, and Jordan 2003] Blei, D. M.; Ng, A. Y.; and Jor-dan, M. I. 2003. Latent dirichlet allocation. JMLR.

[Browning, Cagney, and Boettner 2016] Browning, C. R.;Cagney, K. A.; and Boettner, B. 2016. Neighborhood, place,and the life course. In Handbook of the life course.

[Chainey and Ratcliffe 2013] Chainey, S., and Ratcliffe, J.2013. GIS and crime mapping.

[Compton et al. 2013] Compton, R.; Lee, C.; Lu, T.-C.;De Silva, L.; and Macy, M. 2013. Detecting future socialunrest in unprocessed twitter data:emerging phenomena andbig data. In ISI.

[Couper 2017] Couper, M. P. 2017. New developments insurvey data collection. Annual Review of Sociology.

[Emond and Mason 2002] Emond, E. J., and Mason, D. W.2002. A new rank correlation coefficient with application tothe consensus ranking problem. Journal of Multi-CriteriaDecision Analysis.

[Fu et al. 2018] Fu, C.; McKenzie, G.; Frias-Martinez, V.;and Stewart, K. 2018. Identifying spatiotemporal urban ac-tivities through linguistic signatures. Computers, Environ-ment and Urban Systems.

[Gerber 2014] Gerber, M. S. 2014. Predicting crime usingtwitter and kernel density estimation. DSS.

[Gilbert 2014] Gilbert, C. H. E. 2014. Vader: A parsimo-nious rule-based model for sentiment analysis of social me-dia text. In ICWSM.

[Glavas and Stajner 2015] Glavas, G., and Stajner, S. 2015.Simplifying lexical simplification: do we need simplifiedcorpora? In ACL.

[Go, Bhayani, and Huang 2009] Go, A.; Bhayani, R.; andHuang, L. 2009. Twitter sentiment classification using dis-tant supervision. CS224N Project Report, Stanford.

[Goodman 1961] Goodman, L. A. 1961. Snowball sampling.The annals of mathematical statistics.

[Hipp 2016] Hipp, J. R. 2016. Collective efficacy: How is itconceptualized, how is it measured, and does it really matterfor understanding perceived neighborhood crime and disor-der? JCJ.

[Hu and Liu 2004] Hu, M., and Liu, B. 2004. Mining andsummarizing customer reviews. In SIGKDD.

[Jacobs 1961] Jacobs, J. 1961. The death and life of greatamerican cities. New-York, NY: Vintage.

[Jarvelin and Kekalainen 2002] Jarvelin, K., andKekalainen, J. 2002. Cumulated gain-based evalua-tion of ir techniques. TOIS.

[Kendall 1945] Kendall, M. G. 1945. The treatment of tiesin ranking problems. Biometrika.

[Kingma and Ba 2015] Kingma, D. P., and Ba, J. 2015.Adam: A method for stochastic optimization. ICLR.

[Kornhauser 1978] Kornhauser, R. R. 1978. Social sourcesof delinquency: An appraisal of analytic models.

[Kwak et al. 2010] Kwak, H.; Lee, C.; Park, H.; and Moon,S. 2010. What is twitter, a social network or a news media?In WWW.

[Le and Mikolov 2014] Le, Q., and Mikolov, T. 2014. Dis-tributed representations of sentences and documents. InICML.

[Levallois 2013] Levallois, C. 2013. Umigon: sentimentanalysis based on terms lists and heuristics. In SemEval.

[Liu and others 2009] Liu, T.-Y., et al. 2009. Learning torank for information retrieval. Foundations and Trends inInformation Retrieval.

[Maddela and Xu 2018] Maddela, M., and Xu, W. 2018. Aword-complexity lexicon and a neural readability rankingmodel for lexical simplification. In ACL.

[Malleson and Andresen 2015] Malleson, N., and Andresen,M. A. 2015. The impact of using social media data in crimerate calculations: shifting hot spots and changing spatial pat-terns. CaGIS.

[Morenoff, Sampson, and Raudenbush 2001] Morenoff,J. D.; Sampson, R. J.; and Raudenbush, S. W. 2001.Neighborhood inequality, collective efficacy, and the spatialdynamics of urban violence. Criminology.

[Pacheco, Oliveira, and Menezes 2017] Pacheco, D. F.;Oliveira, M.; and Menezes, R. 2017. Using social media toassess neighborhood social disorganization.

[Paetzold and Specia 2017] Paetzold, G., and Specia, L.2017. Lexical simplification with neural ranking. In EACL.

[Paskevich et al. 1999] Paskevich, D. M.; Brawley, L. R.;Dorsch, K. D.; and Widmeyer, W. N. 1999. Relationship be-tween collective efficacy and team cohesion: Conceptual andmeasurement issues. Group Dynamics: Theory, Research,and Practice.

[Patton 2015] Patton, D. 2015. Gang violence, crime, andsubstance use on twitter: A snapshot of gang communica-tions in detroit. In SSWR.

[Ribeiro et al. 2016] Ribeiro, F. N.; Araujo, M.; Goncalves,P.; Goncalves, M. A.; and Benevenuto, F. 2016. Sentibench-a benchmark comparison of sentiment analysis methods.EPJ Data Science.

[Sampson and Groves 1989] Sampson, R. J., and Groves,W. B. 1989. Community structure and crime: Testing social-disorganization theory. AJS.

[Sampson, Morenoff, and Earls 1999] Sampson, R. J.;Morenoff, J. D.; and Earls, F. 1999. Beyond socialcapital: Spatial dynamics of collective efficacy for children.American sociological review.

[Sampson, Raudenbush, and Earls 1997] Sampson, R. J.;Raudenbush, S. W.; and Earls, F. 1997. Neighborhoodsand violent crime: A multilevel study of collective efficacy.Science.

[Skogan 2015] Skogan, W. 2015. Disorder and decline: Thestate of research. JRCD.

[Thelwall et al. 2010] Thelwall, M.; Buckley, K.; Paltoglou,G.; Cai, D.; and Kappas, A. 2010. Sentiment strength de-tection in short informal text. JASIST.

[Vedula, Parthasarathy, and Shalin 2017] Vedula, N.;Parthasarathy, S.; and Shalin, V. L. 2017. Predictingtrust relations within a social network: A case study onemergency response. In WebSci.

[Wang and Gerber 2015] Wang, M., and Gerber, M. S. 2015.Using twitter for next-place prediction, with an applicationto crime prediction. In Computational Intelligence, IEEESymposium Series on.

[Wang, Brown, and Gerber 2012] Wang, X.; Brown, D. E.;and Gerber, M. S. 2012. Spatio-temporal modeling of crim-inal incidents using geographic, demographic, and twitter-derived information. In ISI.

[Wang, Gerber, and Brown 2012] Wang, X.; Gerber, M. S.;and Brown, D. E. 2012. Automatic crime prediction usingevents extracted from twitter posts. In SBP-BRiMS.

[Weisburd, Bruinsma, and Bernasco ] Weisburd, D.; Bru-insma, G. J.; and Bernasco, W. Units of analysis ingeographic criminology: historical development, criticalissues, and open questions. In Putting crime in its place.

[Welsh, Braga, and Bruinsma 2015] Welsh, B. C.; Braga,A. A.; and Bruinsma, G. J. 2015. Reimagining broken win-dows: from theory to policy. Journal of Research in Crimeand Delinquency.

[Wilcoxon, Katti, and Wilcox ] Wilcoxon, F.; Katti, S.; andWilcox, R. A. Critical values and probability levels for thewilcoxon rank sum test and the wilcoxon signed rank test.Selected tables in mathematical statistics.

[Williams, Burnap, and Sloan 2017] Williams, M. L.; Bur-nap, P.; and Sloan, L. 2017. Crime sensing with big data:The affordances and limitations of using open-source com-munications to estimate crime patterns. BJC.

[Wilson and Kelling 1982] Wilson, J. Q., and Kelling, G. L.1982. Broken windows. Atlantic monthly.

[Zhao et al. 2015] Zhao, W.; Chen, J. J.; Perkins, R.; Liu, Z.;Ge, W.; Ding, Y.; and Zou, W. 2015. A heuristic approachto determine an appropriate number of topics in topic mod-eling. In BMC bioinformatics.

A Supplemental MaterialA.1 Data Collection

We collected a significant amount of data from Twitter. Weused Snowball sampling to identify Twitter accounts of lo-cal citizens. Crawling publicly available tweets from userprofiles enables us to collect significantly more amount ofdata in comparison to collecting streaming real-time tweetsof the Columbus area. For the purpose of our study, 63 Twit-ter accounts that mostly posted news and information aboutColumbus city were identified and used as the seed users.Many local residents follow such accounts to stay informedabout the local events (Kwak et al. 2010). The seed set in-cluded the twitter account of several organizations includ-ing major universities, recreational centers, medical centers,newspapers, local bloggers, local reporters, police, libraries,restaurants as well as the local sports teams. Using Twit-ter’s streaming API, the followers of the seed accounts werecollected. Following this step, we explored user’s profilesand identified 54K public profiles that marked their loca-tions as Columbus or one of the suburban areas included inthe AHDC study. The AHDC study area included severalpopulous suburbs. Collectively, 50 million publicly avail-able tweets were collected from these accounts. In anotherwave of data collection, we collected publicly available geo-tagged tweets for a period of May-August of 2018. Thisresulted to additional 2.8 million tweets. Next, LNEx wasused for location name extraction from tweets and associat-ing tweets to neighborhoods. There are cases in which am-biguous locations were reported by LNEx. In our study, weexclude tweets containing ambiguous location entities. Thelocation ambiguities were observed in a following cases:• A location entity may have several matches in the

gazetteer. For example, Holiday Inn and Gamestop.• A location entity having a single gazetteer entry can po-

tentially refer to a huge area. For example, gazetteer entryof Trans-Siberian Highway in Russia spans from St. Pe-tersberg to Vladivastok. Such entities cannot be mappedto a single neighborhood.

• Location entities extracted by LNEx having a gazetteerentry but not referring to a location in the context. Forexample, American Girl, Modern Male etc.Such mentions were identified manually and excluded

from the study. This pruning step resulted in 4846 uniquelocations in the area that were spotted in 545k tweets andwere mapped to 424 neighborhoods.

A.2 TF-IDF of crime related words

We tokenized each tweet in our train set preserving the hash-tags, handles, and emojis as separate words. We then re-moved the stopwords and lemmatized the tokens. Bigramsof the tweets were added to the token set. The top 100 crime-related terms that had the most frequency across the tweetswere chosen as our vocabulary set. For the test set, we con-catenated all the tweets in each neighborhood to get a singlecorpus per each neighborhood. We then, transformed eachcorpus to get the corresponding term-document vector.

0.0 0.2 0.4 0.6 0.8 1.00

10

20

30top 20.0%

0.0 0.2 0.4 0.6 0.8 1.00

20

40

60top 40.0%

0.0 0.2 0.4 0.6 0.8 1.00

50

top 60.0%

0.0 0.2 0.4 0.6 0.8 1.00

50

100

top 80.0%

0.0 0.2 0.4 0.6 0.8 1.00

50

100

150all neighborhoods

Figure 4: Distribution of the collective efficacy values ineach set of neighborhoods. The collective efficacy values arecomputed from the survey study and are normalized in the 0to 1 range. Refer to Section 3 for more details.

20 40 60 80 100 120 140

110

100

1000

0

Rat

e of

Per

plex

ity C

hang

e

Topic Number

Figure 5: RPC was used to determine the appropriate num-ber of topics. RPC is maximized for 70 topics.

A.3 Distribution of spatio-temporal urbanactivities

Prior to feeding the corpus to the LDA module we tokenizedthe tweets using a tokenizer adapted for tweets6, removedstop words, lemmatized the tokens, and added the bi-gramsthat appeared in more that 20 tweets to our set of tokens.Next, we removed the words that appeared in less than 20tweets (rare words) or more than 50% of the tweets. Em-ploying RPC, we used an increment of 10 and varied thenumber of topics from 10 to 150 and trained LDA model ona corpus of 5M tweets collected from users’ profiles. As de-picted in Figure 5 RPC in maximized at 70 topics. Thus, weused 70 as the optimal number of topics for our model.

A.4 Document EmbeddingWe tokenized, lemmatized, and removed the stop words of5M tweets collected from user profiles. Subsequently, we fita Doc2vec model on this corpus. We set the vector size to 50.

6We used the open source tokenizer presented inhttps://github.com/erikavaris/tokenizer

tfidf topics doc2veccontent feature

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8x

Figure 6: The box plot of τx values for TF-IDF, distribu-tion of topics and doc2vec baselines on Top 40% of tweetedneighborhoods with tie threshold from 0 to 1 with an inter-val of 0.2. Three classifiers including random forest, multilayer perceptron, and logistic regression were used to con-duct overall of 54 experiments per content feature.

For each neighborhood we concatenate all of the associatedtweets and generate the embedding using the trained model.

A.5 Sentiment DistributionWe applied the 5 sentiment analysis tools to each tweet andnormalized the values in a range of -1 to 1. Most of thesetools predict the sentiment value using a predefined lexicon.Thus, they cannot perform accurately in the absence of sen-timent lexicons in the tweets. To account for this, for eachtweet, we only consider the non-zero outputs and computethe average value of them. Subsequently, we use a binningstep to put the tweets associated with a neighborhood in fourbins - highly negative, negative, positive, and highly posi-tive. We normalized the value of bins by dividing the countsby the total number of tweets of the neighborhood. At theend of this step, for each neighborhood, we report the dis-tribution of sentiment of all the tweets mentioning a venuelocated inside the boundaries of the neighborhood.

A.6 Distribution of Collective EfficacyThe distribution of collective efficacy has been presented inFigure 4. As it can be seen in the plot, in all of neighbor-hood sets, the distribution of collective efficacy ground truthvalues in approximately similar to set of all neighborhoods.Also, it can be seen that most of the block groups have acollective efficacy value between 0.4 and 0.6.

A.7 Tied NeighborhoodsAs discussed in section 4.1 we define tied neighborhoodsas the ones having a significantly small difference in collec-tive efficacy value. Tied neighborhoods are considered in-terchangeable in the ranking. We define the ties based on athreshold on collective efficacy difference. We refer to thisvalue as ”Tie Threshold”. We compute the standard devia-tion of the collective efficacy value of the neighborhoods in

Figure 7: Number of tied neighborhood pairs at each tie co-efficient. Tied neighborhoods are not ranked against eachother. For N number of neighborhoods at each set, the num-ber of paired is N ×N −1. Total number of pairs at each tiethreshold is shown with label ”infinity”. By increasing thetie threshold, the number of tied neighborhoods increases.

our study and define our threshold based on different coef-ficients of the standard deviation of the collective efficacy.We refer to these coefficient as ”Tie Coefficient”. We varythe coefficient from 0 to 1 with 0.2 increments and evalu-ate the ranking consensus using a ranking correlation met-ric discussed in section 4.5. The number of neighborhoodsthat are considered as ”tied” at each tie threshold is shownin Figure 7. As indicated in the plot, by increasing the tiethreshold, number of tied pairs increases.

A.8 ResultsIn order to find the best context feature, we experimentedwith features in this group namely TF-IDF of crime lexi-con, topic distribution of urban activities, and doc2vec onthe top 40% highly tweeted neighborhoods. We performedexperiments with all different combinations of our 3 contentfactors. Each content feature is enabled in 3 combinationsand disabled in 3 other corresponding paired combinations.We conducted each factorial experiment with 3 classifiersfor the local ranking module including random forest, multilayer perceptron, and logistic regression. We repeated thisprocess for 6 tie coefficients. Tie coefficients varied from 0to 1 with an interval of 0.2. The number of tied neighbor-hoods at each tie coefficient is shown in Figure 7. The crossproduct of these parameters resulted to 3×3×6 = 54 exper-iments in which a content feature is enabled. The box plotof the observed ranking consensus for these 54 experimentfor each content feature is presented in Figure 6. As it can beseen in the figure, by characterizing neighborhood’s tweetsusing doc2vec we consistently generate better rankings incomparison to TF-IDF of crime lexicon and topic distribu-tion of urban activities.

Documents

Twitter Watch: Leveraging Social Media to Monitor and ... · Twitter Watch: Leveraging Social Media to Monitor and Predict Collective-Efﬁcacy of Neighborhoods Moniba Keymanesh1,