2
Unsupervised Hashtag Retrieval and Visualization for Crisis Informatics Yao Gu Department of Computer Science Los Angeles, California [email protected] Mayank Kejriwal Information Sciences Institute Marina Del Rey, California [email protected] ABSTRACT In social media like Twitter, hashtags carry a lot of semantic infor- mation and can be easily distinguished from the main text. Explor- ing and visualizing the space of hashtags in a meaningful way can offer important insights into a dataset, especially in crisis situations. In this demo, we present a functioning prototype, HashViz, that ingests a corpus of tweets collected in the aftermath of a crisis situation (such as the Las Vegas shootings) and uses the fastText bag-of-tricks semantic embedding algorithm (from Facebook Re- search) to embed words and hashtags into a vector space. Hashtag vectors obtained in this way can be visualized using the t-SNE di- mensionality reduction algorithm in 2D. Although multiple Twitter visualization platforms exist, HashViz is distinguished by being simple, scalable, interactive and portable enough to be deployed on a server for million-tweet corpora collected in the aftermath of arbitrary disasters, without special-purpose installation, technical expertise, manual supervision or costly software or infrastructure investment. Although simple, we show that HashViz offers an intu- itive way to summarize, and gain insight into, a developing crisis situation. HashViz is also completely unsupervised, requiring no manual inputs to go from a raw corpus to a visualization and search interface. Using the recent Las Vegas mass shooting massacre as a case study, we illustrate the potential of HashViz using only a web browser on the client side. KEYWORDS Information Retrieval, Visualization, Social Web, Twitter, Crisis Informatics, Text Embeddings, Data Preprocessing ACM Reference Format: Yao Gu and Mayank Kejriwal. 2018. Unsupervised Hashtag Retrieval and Visualization for Crisis Informatics. In Proceedings of ACM WSDM conference (Social Web in Emergency and Disaster Management). ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn 1 DEMO We provide a high-level overview of the demo via the screenshots in Figures 1, 2 and 3. Following pre-computation steps such as corpus preprocessing, embedding and t-SNE visualization, the user may commence the demo by entering a hashtag query such as ‘lasvegasmassacre’. Using the semantic text embeddings, the system, Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Social Web in Emergency and Disaster Management, 2018 © 2018 Copyright held by the owner/author(s). ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. https://doi.org/10.1145/nnnnnnn.nnnnnnn Figure 1: Display after searching for hashtag ‘lasvegasmas- sacre’. in real-time, computes the hundred nearest neighbors of the hashtag (if it exists 1 ) and returns an interactive scatter plot. Thus, in an unsupervised way, and without relying on storage of original text or tweet data, the system is able to limit the display to only a hundred relevant entries, from tens of thousands of hashtags in the corpus. Potential interactions (which may be interleaved) include drawing a bounding box and zooming into any region of the plot, and scrolling around the plot to explore the neighborhood. In the scatter plot, the hashtag is encircled in red, and one can intuitively note the similarities between that hashtag and other hashtags in the corpus (Figure 2). Similar hashtags, in the text em- bedding space (which uses the context to determine similarity) tend to be closer together, even when they share little in common in terms of the raw string. For example, when the user draws a bounding box around ‘lasvegasmassacre’ and zooms in, #lasveg- asmassacre and #mandalaybayattack appear relatively close, even though they share no terms in common. This has been achieved without any manual supervision or labeling. The user can also scroll around the ‘hashtag space’ to gain situa- tional awareness and insight that she might not fully be aware of otherwise. If we go a little bit further, in Figure 3 for example, we gain an additional piece of information (the name of the shooter i.e. Stephen Paddock). 1 Currently, the system does not work for non-existing hashtags, but we are working to extend its capabilities thereof by drawing on the generalization offered by the text embeddings. arXiv:1801.05906v1 [cs.IR] 18 Jan 2018

Unsupervised Hashtag Retrieval and Visualization for Crisis … · 2018-01-19 · [5] Thanaa M Ghanem, Amr Magdy, Mashaal Musleh, Sohaib Ghani, and Mohamed F Mokbel. 2014. VisCAT:

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unsupervised Hashtag Retrieval and Visualization for Crisis … · 2018-01-19 · [5] Thanaa M Ghanem, Amr Magdy, Mashaal Musleh, Sohaib Ghani, and Mohamed F Mokbel. 2014. VisCAT:

Unsupervised Hashtag Retrieval and Visualization for CrisisInformatics

Yao GuDepartment of Computer Science

Los Angeles, [email protected]

Mayank KejriwalInformation Sciences InstituteMarina Del Rey, California

[email protected]

ABSTRACTIn social media like Twitter, hashtags carry a lot of semantic infor-mation and can be easily distinguished from the main text. Explor-ing and visualizing the space of hashtags in a meaningful way canoffer important insights into a dataset, especially in crisis situations.In this demo, we present a functioning prototype, HashViz, thatingests a corpus of tweets collected in the aftermath of a crisissituation (such as the Las Vegas shootings) and uses the fastTextbag-of-tricks semantic embedding algorithm (from Facebook Re-search) to embed words and hashtags into a vector space. Hashtagvectors obtained in this way can be visualized using the t-SNE di-mensionality reduction algorithm in 2D. Although multiple Twittervisualization platforms exist, HashViz is distinguished by beingsimple, scalable, interactive and portable enough to be deployedon a server for million-tweet corpora collected in the aftermath ofarbitrary disasters, without special-purpose installation, technicalexpertise, manual supervision or costly software or infrastructureinvestment. Although simple, we show that HashViz offers an intu-itive way to summarize, and gain insight into, a developing crisissituation. HashViz is also completely unsupervised, requiring nomanual inputs to go from a raw corpus to a visualization and searchinterface. Using the recent Las Vegas mass shooting massacre as acase study, we illustrate the potential of HashViz using only a webbrowser on the client side.

KEYWORDSInformation Retrieval, Visualization, Social Web, Twitter, CrisisInformatics, Text Embeddings, Data PreprocessingACM Reference Format:Yao Gu and Mayank Kejriwal. 2018. Unsupervised Hashtag Retrieval andVisualization for Crisis Informatics. In Proceedings of ACMWSDM conference(Social Web in Emergency and Disaster Management). ACM, New York, NY,USA, 2 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 DEMOWe provide a high-level overview of the demo via the screenshotsin Figures 1, 2 and 3. Following pre-computation steps such ascorpus preprocessing, embedding and t-SNE visualization, the usermay commence the demo by entering a hashtag query such as‘lasvegasmassacre’. Using the semantic text embeddings, the system,Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).Social Web in Emergency and Disaster Management, 2018© 2018 Copyright held by the owner/author(s).ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.https://doi.org/10.1145/nnnnnnn.nnnnnnn

Figure 1: Display after searching for hashtag ‘lasvegasmas-sacre’.

in real-time, computes the hundred nearest neighbors of the hashtag(if it exists1) and returns an interactive scatter plot. Thus, in anunsupervised way, and without relying on storage of original textor tweet data, the system is able to limit the display to only ahundred relevant entries, from tens of thousands of hashtags in thecorpus. Potential interactions (which may be interleaved) includedrawing a bounding box and zooming into any region of the plot,and scrolling around the plot to explore the neighborhood.

In the scatter plot, the hashtag is encircled in red, and one canintuitively note the similarities between that hashtag and otherhashtags in the corpus (Figure 2). Similar hashtags, in the text em-bedding space (which uses the context to determine similarity)tend to be closer together, even when they share little in commonin terms of the raw string. For example, when the user draws abounding box around ‘lasvegasmassacre’ and zooms in, #lasveg-asmassacre and #mandalaybayattack appear relatively close, eventhough they share no terms in common. This has been achievedwithout any manual supervision or labeling.

The user can also scroll around the ‘hashtag space’ to gain situa-tional awareness and insight that she might not fully be aware ofotherwise. If we go a little bit further, in Figure 3 for example, wegain an additional piece of information (the name of the shooter i.e.Stephen Paddock).

1Currently, the system does not work for non-existing hashtags, but we are workingto extend its capabilities thereof by drawing on the generalization offered by the textembeddings.

arX

iv:1

801.

0590

6v1

[cs

.IR

] 1

8 Ja

n 20

18

Page 2: Unsupervised Hashtag Retrieval and Visualization for Crisis … · 2018-01-19 · [5] Thanaa M Ghanem, Amr Magdy, Mashaal Musleh, Sohaib Ghani, and Mohamed F Mokbel. 2014. VisCAT:

Social Web in Emergency and Disaster Management, 2018 Gu and Kejriwal

Figure 2: Display after drawing a bounding box around‘lasvegasmassacre’ and zooming in.

Figure 3: Display allows a user to interactively scroll aroundthe plot.

The entire process can be restarted simply by entering anotherquery. The system is currently deployed on a cloud server and iscompletely unsupervised.

2 RELATEDWORKVisualization is an important part of any human-centric systemthat is attempting to make sense of a large amount of information.Several good crisis informatics platforms that provide visualiza-tions include Ushahidi [3], Twitris [7], Twitcident [2], AIDR [6],CrisisTracker[11], TweetTracker [8], and several others [4], [12].Our preliminary system is meant to complement, not compete with,

the capabilities offered by these platforms, since it provides succinct‘summarization’-based visualizations that are unsupervised, canrun on large and potentially noisy corpora, and that are specificallytuned to rapidly explore the space of Twitter hashtags in crisissituations.

In addition to this body of work, over the years, several sophisti-cated Twitter-specific visualization platforms have been proposed,but their complexity precludes them from rapid deployment andprototyping, especially in resource-constrained environments [10],[1], [9], [5]. Some are specifically tuned for spatio-temporal analy-ses or community detection, and require access to text and otherinformation. In contrast, our system is simple, scalable and portable,and relies on loosely coupled open-source components. Privacy con-cerns are also alleviated because hashtags, by themselves, do notcarry sensitive information about individual users. The case studyvisualizations in Figures 2 and 3 show that hashtags can (even bythemselves) provide useful situational and summarization-basedinsights for crisis informatics.

In ongoing work, we are setting this system up so that it can berapidly deployed on any Twitter corpus, possibly being updatedin a streaming fashion, with only a few commands. We may alsomake the hashtags clickable, to access more insights related to thathashtag, and extend HashViz beyond just hashtags.

REFERENCES[1] Youcef Abdelsadek, Kamel Chelghoum, Francine Herrmann, Imed Kacem, and

Benoît Otjacques. 2018. Community extraction and visualization in social net-works applied to Twitter. Information Sciences 424 (2018), 204–223.

[2] Fabian Abel, Claudia Hauff, Geert-Jan Houben, Richard Stronkman, and Ke Tao.2012. Twitcident: fighting fire with information from social web streams. InProceedings of the 21st International Conference on World Wide Web. ACM, 305–308.

[3] Ken Banks and Erik Hersman. 2009. FrontlineSMS and Ushahidi-a demo. InInformation and Communication Technologies and Development (ICTD), 2009 In-ternational Conference on. IEEE, 484–484.

[4] Seonhwa Choi and Byunggul Bae. 2015. The real-time monitoring system ofsocial big data for disaster management. In Computer Science and its Applications.Springer, 809–815.

[5] Thanaa M Ghanem, Amr Magdy, Mashaal Musleh, Sohaib Ghani, and Mohamed FMokbel. 2014. VisCAT: Spatio-Temporal Visualization and Aggregation of Cate-gorical Attributes in Twitter Data (Demo Paper). (2014).

[6] Muhammad Imran, Carlos Castillo, Ji Lucas, Patrick Meier, and Sarah Vieweg.2014. AIDR: Artificial intelligence for disaster response. In Proceedings of the 23rdInternational Conference on World Wide Web. ACM, 159–162.

[7] Ashutosh Sopan Jadhav, Hemant Purohit, Pavan Kapanipathi, Pramod Anan-tharam, Ajith H Ranabahu, Vinh Nguyen, Pablo N Mendes, Alan Gary Smith,Michael Cooney, and Amit P Sheth. 2010. Twitris 2.0: Semantically empoweredsystem for understanding perceptions from social data. (2010).

[8] Shamanth Kumar, Geoffrey Barbier, Mohammad Ali Abbasi, and Huan Liu.2011. TweetTracker: An Analysis Tool for Humanitarian and Disaster Relief.. InICWSM.

[9] Chen Liu, Dongxiang Zhang, and Yueguo Chen. 2015. Personalized KnowledgeVisualization in Twitter. In International Conference on Conceptual Modeling.Springer, 409–423.

[10] Fred Morstatter, Shamanth Kumar, Huan Liu, and Ross Maciejewski. 2013. Under-standing twitter data with tweetxplorer. In Proceedings of the 19th ACM SIGKDDinternational conference on Knowledge discovery and data mining. ACM, 1482–1485.

[11] Jakob Rogstadius, Maja Vukovic, CA Teixeira, Vassilis Kostakos, Evangelos Kara-panos, and Jim Alain Laredo. 2013. CrisisTracker: Crowdsourced social mediacuration for disaster awareness. IBM Journal of Research and Development 57, 5(2013), 4–1.

[12] Dennis Thom, Robert Krüger, Thomas Ertl, Ulrike Bechstedt, Axel Platz, JuliaZisgen, and Bernd Volland. 2015. Can twitter really save your life? A case study ofvisual social media analytics for situation awareness. In Visualization Symposium(PacificVis), 2015 IEEE Pacific. IEEE, 183–190.