Learning from Twitter Hashtags: Leveraging Proximate Tags to Enhance Graph-based Keyphrase Extraction

1. Learning from Twitter Hashtags: LeveragingProximate Tags to Enhance Graph-basedKeyphrase ExtractionAbdelghani Bellaachia & Mohammed Al-Dhelaan([email protected] , [email protected]) Computer Science Department George Washington University Washington, DC, USA1

2. Overview Twitter Introduction Why Extracting Keyphrases in Twitter? Learning from Twitter Hashtags Twitter Lexical Graph Expansion Proposed Approach for Graph Expansion How to Choose Hashtags Frequency Approach Hybrid Approach How to Build Lexical Graph Topic Modeling Graph-based Ranking Scheme Experiments Experimental Results2 Conclusion 3. Twitter Introduction Twitter is a micro-blogging social network site It enables users to blog or broadcast their thoughts andmessages It gained a lot of popularity due to the speed of broadcastingnews through it. The main idea behind it is that a user can follow people ororganizations accounts that seems to be interesting to theuser. Once a user follows an account, all the news and tweetsissued by that account will be shown to that user in histimeline tweets. 3 4. Tweets Tweets are the posts or messages broadcasted by users. It can only include up to 140 characters In it is nature, it meant to be broadcasted to all the followersof a user. However, it can be directed to a specific user usingthe mention @ feature. Tweets are generally public and anyone can view them exceptif the user made his tweets private and only can be seen byhis/her followers (rarely used!). Tweets can include text, hashtags, or mentions. Or anycombination of them. 4 5. Tweets Example of a tweet containing a hashtag, text, and link5 6. Hashtags Hashtags started as a user convention. They are used to index and organize tweets. Trend discovery Every Hashtag is generally about a specific topic that if youinclude a hashtag into a tweet, that tweet will be directed tothat topic which have a specific audience. Multiple hashtags are accepted Hashtag is a hyperlink to all tweets containing that hashtag. 6 7. Why Extracting Keyphrasesin Twitter? In 2011, Twitter has attracted over 200 million users, whompublish at least a billion tweets each week [2]. With such massive amount of user generated text, the needfor summarizing topics in tweets becomes important However, tweets are short text documents so normalsummarization techniques are not applicable Instead, extracting short keyphrases that could representtopics in tweets can be an insightful approach7 8. Definitions Topical Tweets: are the collection of tweets that we willextract keyphrases from. Also called target set Auxiliary Hashtag Tweets: Are the collection of tweetsgathered from a selected hashtag from the topicaltweets. In this research, we investigate the possibility ofexpanding the lexical graph for topical tweets withauxiliary hashtag tweets, and whether it could improvethe ranking for keyphrases extracted from the targettweets. 8 9. Learning from Twitter Hashtags Tweets are short text documents The shortage of text in tweets could be an obstacle whentrying to learn from text However, tweets can contain an abundant number of links inthe form of hashtags Can we improve the ranking using an auxiliary set of hashtagtweets (external tweets)? How can we choose the best hashtags to fit the topic? Somehashtags are general! Some are very specific! Can we expand the graph to include auxiliary hashtag tweets?How can it affect the ranking? 9 10. Twitter Lexical GraphExpansionTarget Tweets Set Lexical Graphttt t tHHashtags HH Expanded Lexical GraphH Auxiliary Tweets Sett ttt 10 11. Proposed Approach From a random collection of tweets: Identify topics Cluster tweets based on topics found For every cluster (topic): Build a lexical graph to calculate words weights Expand the graph with auxiliary hashtag tweets similar to topic Generate keyphrases using top keywords Rank keyphrase Show top 10 keyphrases 11 12. Proposed Approach for Graph Expansion12 13. How to Choose Hashtags? Hashtags are user generated and varies in scope Expanding the graph with the wrong hashtags candeteriorate the ranking (irrelative or general hashtags) Two approaches to choose hashtags for expanding thegraph: Frequency Approach By choosing the most frequent hashtag in each topical cluster of tweets (target tweets). Hybrid Approach By measuring similarity between top-10 frequent hashtag tweets keywords and the target tweets keywords13 14. Frequency Approach Frequency approach is not always correct Topic Sandusky 14 15. Hybrid ApproachTarget Tweets Cosine Sim k1 k2 k3 Hashtag1 Tweets Hashtag2 TweetsHashtag 10 Tweets..k1k1 k1 knk2k2 k2k3k3k3 . .. . ..knkn kn K: keywords extracted from all tweets in the set Select the highest similar hashtag to expand the lexical graph 15 16. Hybrid Approach Let Target Tweets be a set of tweets {t1, t2, ,tn}From all tweets in the set, we have a vector of words TT_terms ={k1, k2, ,kn}Target TweetsTT_terms t1k1 t2k2 t3k3. .. . tnknIn the Target Tweets set, we have a set of hashtagsoccurring in all tweets. We call it HashtagsTitles = {h1, h2 ,, hn}16 17. Hybrid Approach For each hashtag in HashtagTitles set = {h1, h2 ,, hn},we search Twitter for all tweets that does not occur in theTarget Tweets set.The search result for each hashtag is grouped in a vectorof tweets called HT( Hashtag Tweets) HashtagTitlesh1= Ht1, Ht2,, Htnh1h2= Ht1, Ht2,, Htnh2h3: .hn= Ht1, Ht2,, Htn .hn17 18. Hybrid ApproachFor each HT, we build a vector of words representing eachhashtag separately which we call HT_termsWe compute the cosine similarity between the twovectors TT_terms and HT_termsFinally, we choose the most similar hashtag to expand thegraph with 18 19. Hybrid Approach Measures the similarity of top frequent hashtag tweetscontent with target tweets content using cosine similarity The top-10 frequent hashtags are used since we assumethat the most relevant hashtag is frequent Selecting the most similar hashtag using cosine similaritywith top-10 frequent hashtags will use both approachwhich will improve the accuracy of the selection 19 20. Hybrid Approach After selecting an auxiliary hashtag tweet set: classify each hashtags tweet as either relevant orirrelevant by measuring the word overlap between auxiliary tweetterms and top-10 tf-idf in target tweets terms If there is at least two words from the top-10, then weclassify an auxiliary tweet as relevant.20 21. How to Build Lexical Graph Let G=(V,E) be a weighted graph that represent the text Vertices V denote words We build an edge E between every two words if theyco-occur within a specific window size The weight of the edges for terms in the target tweets isthe frequency of the co-occurrence The frequency of the co-occurrence shows how strongthe relationship between two nodes Edge_weight(Vi, Vj) = |co-occurrence|21 22. How to Build Lexical Graph 22 23. How to Build Lexical Graph 23 24. Topic Modeling Latent Dirichlet Allocation (LDA) (D. M. Blei, A. Y. Ng, andM. I. Jordan) Unsupervised model that identifies topics in a collection of documents. A statistical model that uses bag of words assumption for each document. Documents are represented over probability distribution over topics . Topics are represented over probability distribution over collection of words. 24 25. Topic Modeling Latent Dirichlet Allocation (LDA) Dirichlet prior and Multinomial distribution over topics Multinomial distribution over words Z w J D 25 26. Graph-based Ranking Scheme PageRank (Brin and Page, 1998) Voting idea! When a vertex links to another, it cast a vote for theother vertex. The algorithm has a recursive nature! The importanceof the vertex casting the vote determines theimportance of the vote. Uses nodes rank iteratively until convergence 26 27. Graph-based Ranking Scheme 27 28. Graph-based Ranking Scheme TextRank (Mihalcea & Tarau, 2004) Create a graph for text Words are represented in nodes (nouns and adjectives only) Edges are the co-occurrence between words within a window Frequency of co-occurring words is represented on edge weights TextRank uses edge weights to influence the rank28 29. Graph-based Ranking Scheme 29 30. Graph-based Ranking Scheme NE-Rank (Node Edge- Rank)(Bellaachia & Al-Dhelaan) Incorporate nodes weight into the formula Instead of either using only node weights or only edgeweights, we try to use both features. In text, node weights are best represented by tf-idf torepresent the content of documents. PageRank only focuses on the relations betweenobjects without the content. TextRank only uses the co-occurrence relation toidentify important words. NE-Rank takes the content into consideration as tf-idf30 31. Graph-based Ranking Scheme 31 32. Experiment Crawled Twitter since 1/19/2012 to 2/6/2012 Dataset have 31,227 tweets. 244,139 tokens 40,674 hashtags in tweets (4,079 unique hashtag). Hashtags have been segmented into word tokens intotokenization step. We have extracted 30 topics out of tweets. Let C be the collection of tweets, 1..k are topics. Aggregate tweets for topic yielding Ck Build a graph and extract keyphrases from every Ck32 C= C1 U C2 U Ck 33. Experiment Preprocessing : Removed non-English tweets Removed URL links Normalized tweets from conversational style tostandard English: for example: luv became love Part of speech tagging to extract nouns and adjectivesonly Stemming and stopwords removal 33 34. Experiment Since NE-Rank has showed better result compared toother ranking methods in our previous research[8], weused it to compare the ranking of 3 approaches: Single Approach: No graph expansion Expanded with hashtags-Frequency Approach Expanded with hashtags-Hybrid Approach We validated our results using an empiricalevaluation approach as in the next slides34 35. Experiment Since there is no golden labels to compare against, weempirically designed an evaluation approach utilizing asearch engine to generate labels. To generate such labels we searched Google using top-5terms in LDA for each topic. We only focused on two fields from search snippetsresults: title and description If a keyphrase happens to occur in search results, thenwe consider it correct35 36. Experimental ResultsAutomatic Approach Using Search EngineTop-10 KeyphrasesPrecision BPrefSingle NE-Rank0.400.67Expanded with Hashtags Frequency Approach 0.450.52Expanded with Hashtags Hybrid Approach0.550.7336 37. Conclusion Twitter Introduction Why Extracting Keyphrases in Twitter? Learning from Twitter Hashtags Twitter Lexical Graph Expansion Proposed Approach for Graph Expansion How to Choose Hashtags Frequency Approach Hybrid Approach How to Build Lexical Graph Topic Modeling Graph-based Ranking Scheme Experiments Experimental Results37 Conclusion 38. References [1] Liu, et al.,2010. Automatic Keyphrase Extraction via TopicDecomposition [2] Lin, Snow, & Morgan Smoothing Techniquesfor Adaptive Online Language Models: Topic Tracking in TweetStreams, [3] Liu, et al., 2011. Why is SXSW Trending? Exploring Multiple TextSources For Twitter Topic Summarization [4] X. Wan and J. Xiao, Single document keyphrase extraction using neighborhood knowledge, [5] Weng, et al., 2010. TwitterRank: Finding Topic-sensitive InfluentialTwitterers [6] Zhao, et al., 2011. Topical Keyphrase Extraction from Twitter [7] Mihaleca & Tarau, Textrank: Bringing order into texts [8] Bellaachia & Al-Dhelaan, NE-Rank: A Novel Graph-based Keyphrase38Exctraction in Twitter in press 39. The End Thank You!39