Tweet Collect: short text message collection using automatic …uu.diva-portal.org/smash/get/diva2:606687/FULLTEXT01.pdf · 2013. 2. 20. · as the Tweet Collect system, in Java utilizing

UPTEC IT 13 003

Examensarbete 30 hpFebruari 2013

Tweet Collect: short text message collection using automatic query expansion and classification

Erik Ward

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Tweet Collect: short text message collection usingautomatic query expansion and classification

Erik Ward

The growing number of twitter users create large amounts of messages that contain valuable information for market research. These messages, called tweets, which are short, contain twitter-specific writing styles and are often idiosyncratic give rise to a vocabulary mismatch between typically chosen keywords for tweet collection and words used to describe television shows. A method is presented that uses a new form of query expansion that generates pairs of search terms and takes into consideration the language usage of twitter to access user data that would otherwise be missed. Supervised classification, without manually annotated data, is used to maintain precision by comparing collected tweets with external sources. The method is implemented, as the Tweet Collect system, in Java utilizing many processing steps to improve performance.

The evaluation was carried out by collecting tweets about five different television shows during their time of airing and indicating, on average, a 66.5% increase in the number of relevant tweets compared with using the title of the show as the search terms and 68.0% total precision. Classification gives a, slightly lower, average increase of 55.2% in number of tweets and a greatly increased 82.0% total precision.

The utility of an automatic system for tracking topics that can find additional keywords is demonstrated. Implementation considerations and possible improvements are discussed that can lead to improved performance.

Tryckt av: Reprocentralen ITC

Sponsor: KDDI R&D Laboratories, Göran Holmquist FoundationISSN: 1401-5749, UPTEC IT 13 003Examinator: Lars-Åke NordénÄmnesgranskare: Tore RischHandledare: Kazushi Ikeda

Sammanfattning

Social media som Twitter växer i popularitet och stora mängder med-delanden, tweets, skrivs varje dag. Dessa meddelanden innehåller värde-full information som kan användas till marknadsundersökningar men ärmycket korta, 140 tecken, och uppvisar i många fall ett idiosynkratisktuttryckssätt. För att komma åt så många tweets som möjligt om en vissprodukt, till exempel ett TV program, krävs det att rätt söktermer ärtillgängliga; en twitteranvändare använder nödvändigtvis inte sammaord för att beskriva samma sak som en annan. Olika grupper använ-der således olika språkbruk och jargong. I text i twittermeddelanden ärdetta uppenbart, vi kan se hur somliga använder vissa så kallade hash-tags for att uttrycka sig samt andra språkyttringar. Detta leder till vadsom brukar kallas problemet med olika ordförråd (vocabulary mismatchproblem).

För att försöka samla in så många twittermeddelanden som möjligtom olika produkter har ett system som kan generera nya söktermerutvecklats, här kallat Tweet Collect. Genom att analysera vilka ord somger mest information, generera par av ord som beskriver olika saker ochta hänsyn till det språkbruk som finns på Twitter skapas nya söktermerutifrån ursprungliga söktermer, s.k. frågeexpansion (query expansion).Utöver att samla in tweets som motsvarar de nya söktermerna så avgören maskininlärningsalgoritm om dessa tweets är relevanta eller inte föratt på så sätt öka precisionen.

Efter att ha samlat in tweets för fem TV program så utvärderadessystemet genom att utföra en stickprovsundersökning av nya insamladetweets. Denna undersökning visar att, i genomsnitt, ökar antalet rele-vanta tweets med 66.5% mot att endast använda TV programmets titel.Av alla insamlade tweets så handlar endast 68.0% faktiskt om TV pro-grammet, men genom att använda maskininlärning kan detta ökas till82.0% med endast ett avkall till 55.2% ökning av nya, relevanta, tweets.

I denna rapport visas användbarheten av ett automatiskt systemsom kan hitta nya söktermer och på så sätt motverka problemet medolika ordförråd. Genom att komma åt tweets som skrivs med ett annatspråkbruk så hävdas det att metodfelet vid insamling av tweets minskar.Systemets förverkligande i programmeringsspråket Java diskuteras ochförbättringar föreslås som kan leda till ökad effektivitet.

This thesis is dedicated to the wonderful country of Japan and allwho come to experience her.

This thesis expands upon:

Erik Ward, Kazushi Ikeda1, Maike Erdmann2, Masami Nakazawa2, Gen Hattori2,and Chihiro Ono2. Automatic Query Expansion and Classification for TelevisionRelated Tweet Collection. Proceedings of Information Processing Society of Japan(IPSJ) SIG Technical Reports, vol. 2012, no. 10, pp. 1-8, 2012.

AcknowledgmentI wish to thank the Göran Holmquist Foundation and the Sweden Japan Foundationfor travel funding.

1Supervisor2Proofreading

GlossaryAQE Automatic Query Expansion. Blind relevance feedback.Corpus A set of documents, typically in one domain.Relevance feed-back

Update a query based on documents that are known to be relevantfor this query.

Table of NotationsΩ The vocabulary: the set of all known terms.t Term: a word without spacing characters.q Query: a set of terms. q ∈ Q ⊂ D.C Corpus: a set of documents.d Document: a set of terms. d ∈ D, where D is the set of all possible

documents.tf(t, d) Term frequency: an integer valued function that gives the frequency

of occurrence of t in d.df(t) Document frequency: the number of documents in a corpus that

contains t.idf(t) lg(1/df(t))R Set of related documents; used for automatic query expansion.

Contents

1 Introduction 1

2 Background 32.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Structure of a tweet . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Accessing twitter data: Controlling sampling . . . . . . . . . 42.1.3 Stratification of tweet users and resulting language use . . . . 5

2.2 Information retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Text data: Sparse vectors . . . . . . . . . . . . . . . . . . . . 62.2.2 Terms weights based on statistical methods . . . . . . . . . . 92.2.3 The vocabulary mismatch problem . . . . . . . . . . . . . . . 92.2.4 Automatic query expansion . . . . . . . . . . . . . . . . . . . 92.2.5 Measuring performance . . . . . . . . . . . . . . . . . . . . . 11

2.2.6 Software systems for information retrieval . . . . . . . . . . . 122.3 Topic classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 External data sources . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Related work 153.1 Relevant works in information retrieval . . . . . . . . . . . . . . . . . 15

3.1.1 Query expansion and pseudo relevance feedback . . . . . . . 163.1.2 An alternative representation using Wikipedia . . . . . . . . 17

3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 Television ratings by classification . . . . . . . . . . . . . . . 183.2.2 Ambiguous tweet about television shows . . . . . . . . . . . . 183.2.3 Other topics than television . . . . . . . . . . . . . . . . . . . 20

3.3 Tweet collection methodology . . . . . . . . . . . . . . . . . . . . . . 213.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Automatic query expansion and classification using auxiliary data 254.1 Problem description and design goals . . . . . . . . . . . . . . . . . . 254.2 New search terms from query expansion . . . . . . . . . . . . . . . . 26

4.2.1 Co-occurrence heuristic . . . . . . . . . . . . . . . . . . . . . 274.2.2 Hashtag heuristic . . . . . . . . . . . . . . . . . . . . . . . . . 284.2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.4 Auxiliary Data and Pre-processing . . . . . . . . . . . . . . . 304.2.5 Twitter data quality issues . . . . . . . . . . . . . . . . . . . 304.2.6 Collection of new tweets for evaluation . . . . . . . . . . . . . 32

4.3 A classifier to improve precision . . . . . . . . . . . . . . . . . . . . . 334.3.1 Unsupervised system . . . . . . . . . . . . . . . . . . . . . . . 344.3.2 Data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3.3 Web scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3.4 Classification of sparse vectors . . . . . . . . . . . . . . . . . 344.3.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Combined approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Tweet Collect: Java implementation using No-SQL database 375.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Statistics database . . . . . . . . . . . . . . . . . . . . . . . . 395.2.2 Implementation of algorithms . . . . . . . . . . . . . . . . . . 415.2.3 Twitter access . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2.4 Web scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2.6 Result storage and visualization . . . . . . . . . . . . . . . . 44

5.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.4 Development methodology . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Performance evaluation 476.1 Collecting tweets about television programs . . . . . . . . . . . . . . 47

6.1.1 Auxiliary data . . . . . . . . . . . . . . . . . . . . . . . . . . 486.1.2 Experiment parameters . . . . . . . . . . . . . . . . . . . . . 496.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.2.1 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.2.3 System results . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7 Analysis 557.1 System results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.2 Generalizing the results . . . . . . . . . . . . . . . . . . . . . . . . . 567.3 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 577.4 New search terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577.5 Classifier performance . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8 Conclusions and future work 618.1 Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8.3.1 Other types of products and topics . . . . . . . . . . . . . . . 628.3.2 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . 628.3.3 Temporal aspects . . . . . . . . . . . . . . . . . . . . . . . . . 638.3.4 Understanding names . . . . . . . . . . . . . . . . . . . . . . 638.3.5 Improved classification . . . . . . . . . . . . . . . . . . . . . . 638.3.6 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648.3.7 Improved scalability and performance profiling . . . . . . . . 64

Bibliography 65

Appendices 68

A Hashtag splitting 69

B SVM performance 71

List of Figures

2.1 The C4.5 classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Approach to classifying tweets, here for television shows, but the sameapproach applies for other proper nouns. . . . . . . . . . . . . . . . . . . 19

4.1 Visualization of fraction of tweets by keywords for the show Saturdaynight live, here different celebrities that have been on the show dominatethe resulting twitter feed. . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Conceptual view of collection and classification of new tweets. . . . . . 36

5.1 Conceptual view of collection and classification of new tweets. . . . . . 38

5.2 How the different components are used to evaluate system performance.This does not represent the intended use case where collection, pre-processing and classification is an ongoing process. . . . . . . . . . . . . 39

6.1 Results of filtering auxiliary data to improve data quality. Note that thefirst filtering step is not included here and these tweets represent stringscontaining either the title of a show or the title words formed into ahashtag. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2 Fraction of tweets by search terms for How I met your mother. . . . . . 52

6.3 Fraction of tweets by search terms for The X factor. . . . . . . . . . . . 52

7.1 Ways to generate training data from auxiliary data. Here we have twodata sets, A and B that correspond to searching for the titles A andB, respectively. Either we can have a classifier for each title, the leftcase, or we can have just one classifier that is trained on the Cartesianproduct of data sets and titles, the right case. Regrettably tests showthat we cannot use the right case unless we include training data of thetype Rtitle,title for all shows. . . . . . . . . . . . . . . . . . . . . . . . . 59

List of Tables

2.1 Examples of two vector representations of the same document. In thisexample the vocabulary is severely limited and readers should imaginea vocabulary of several thousands of words and the resulting, sparse,vector representations. Note that capitalization is ignored which is verycommon in practice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Inverted index structure. We can look up which documents that containcertain words by grouping document numbers by the words that areincluded in the document. If we wish to support frequency counts ofthe words we store not only document numbers but instead tuples of(Number, frequency). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Methods that I use from related works in my combined approach of queryexpansion and classification. . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Expansion terms for the show “How I met your mother” using equation2.7 and resulting search terms by hashtag,mention and co-occurrenceheuristics. Note that a space means conjunction and a comma meansdisjunction. This used data where tweets mentioning other show havebeen removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Search terms generated for the television shows The vampire diaries andThe secret circle using a moderately sizes data-set. . . . . . . . . . . . . 31

5.1 List of dependencies organized by (sub) component. . . . . . . . . . . . 40

6.1 TV shows used for collecting tweets with new search terms. Showsmarked with “*” are aired as reruns multiple times every day. . . . . . 49

6.2 Text sources used for comparing with tweets. . . . . . . . . . . . . . . . 506.3 Number of tweets collected for the different TV shows during 23h30min. 516.4 Percentage of tweets containing the title that are related to the television

show. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.5 Classification results when using manually labeled test data as training

data with 10-fold cross validation. . . . . . . . . . . . . . . . . . . . . . 526.6 Classification results when using training data generated from the same

external sources, training examples are from all five shows. . . . . . . . 52

6.7 Class distribution of annotated data after classification by baseline, left,and C4.5 classifiers, right. The baseline classifier is the naive classifier:cbaseline(tweet) = related . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.8 System performance using automatic query expansion, before and afterclassification. The subscript c denotes results after classification. . . . . 54

7.1 95% confidence interval for accuracy with training data generated fromthe same external sources, training examples are from all five shows. . 56

7.2 First 13 term pairs for AQE using top 40 terms to form pairs and virtualdocuments of size 5. Also visible is a bug where I do not remove hashtagsfrom consideration when forming pairs. . . . . . . . . . . . . . . . . . . 58

B.1 Results of classification of annotated test data with linear support vectormachines. Text data is treated as sparse vectors. . . . . . . . . . . . . . 71

List of Algorithms

1 Algorithm, Top(K,R), produces an array of single search terms. . . 292 Algorithm, Pairs, produces the pairs of search terms used. . . . . . 29

Chapter 1

Introduction

“Ultimately, brands need to have a role in society. The best way to havea role in society is to understand how people are talking about things inreal time.”– Jean-Philippe Maheu, Chief Digital Officer, Ogilvy [19]

Adoption of social media has increased dramatically in the last years and millionsof users use social media services every day. There are for example 806 millionFacebook users [11] and 140 million twitter users [5]. Since the creation of materialis decentralized and requires no permission, enormous quantities of unstructured,uncategorized information is created by users every minute; 340 million twittermessages that are authored every day [5].

This development has co-occurred with the explosion of data generation in gen-eral and presents IT practitioners and computer scientists with new unsolved prob-lems but also with opportunity for new business. Industry has quickly realized thatthere is value in all this unstructured data, coining the equivocal term big data. Themarked for managing big data has grown faster than the IT sector in general andshowed a growth of 10% annually to $100 billion in 2010 [3].

The multitude of data available yields unprecedented opportunities to gain in-sight of what people are thinking and what they want; to conduct automated marketresearch. This knowledge is extremely valuable for public relations and for adver-tising. One maturing technology that attempts to analyze users opinions in textis sentiment analysis, another is estimating ratings of television programs usingthe number of twitter messages written [39][35]. But for these technologies to betruly useful, text about the topics of interest needs to be collected in a reliable andrepresentative fashion.

Classification and information retrieval techniques can be used to improve thequality and reach of twitter message collection. However, human text, especiallyin a social media setting, is often very vague and one problem is to find the manymessages that do not explicitly mention the topic that one wishes to analyze: thevocabulary miss match problem[12]. Mitigating these difficulties is the focus of thisthesis.

1

CHAPTER 1. INTRODUCTION

A crucial part of the process of conducting market research on a topic, such asdetermining sentiment towards a certain product or estimating ratings, is to get agood sample of messages. When gathering messages in social media, often keywordsdetermined by an analyst is used, such as in [38] [39] and [35]. I argue that thismethod ignores a large fraction of the messages relating to certain topics and thusdetrimentally affects the validity of results of later analysis. The idiosyncratic andnovel language use on twitter, driven by the short message length, results in avocabulary mismatch that can be mitigated by the use of a systematic method tofind the messages not covered by using the title, or other manually selected terms,as a search terms.

At its core the research problem addressed by this thesis work is:

Get as many as possible tweets about a specified product.

This goal is to be solved using the methods available for a running and scalablesystem. I use the term product in stead of topic here since this is closer to mostbusiness goals of tweet analysis and it reflects the experiments that I have carriedout. In essence, I wish to optimize tweet collection.

To improve tweet collection I present and test the use of streaming retrieval withadditional keywords determined using relevance feedback techniques and automaticquery expansion (AQE), as seen in information retrieval, particularly in ad-hoc re-trieval. By comparing term distributions in sets of messages about different topics Idetermine descriptive terms for each topic that yield improved recall when includedas search terms. By also classifying the retrieved tweets as either relevant or irrel-evant to the topic, higher precision can be achieved. Classification also, in part,deals with the issue of ambiguity [40].

The proposed method is evaluated by collecting tweets about television showsusing streaming retrieval for popularity estimation, but the method is not limitedto this domain.

This thesis consists of the following chapters:

Chapter 2 An overview of general techniques.

Chapter 3 Related work in the area of tweet retrieval and classification.

Chapter 4 Methods that I have employed.

Chapter 5 Prototype system that I developed.

Chapter 6 Experiments and data.

Chapter 7 Analysis of results.

Chapter 8 Conclusions and future work.

2

Chapter 2

Background

This chapter presents the concepts and technologies involved, in particular: twitter,information retrieval (IR) and classification. The problem of conduction marketresearch is first presented in terms of how to get data from twitter, then how to findwhich of this data to use by IR and classification.

This chapter is intended to introduce the most common techniques used whenfinding relevant information, but these techniques are not used in the standard wayin my proposed method. Instead they serve as the inspiration and implementationbuilding blocks of the method. A holistic, conceptual view of the proposed methodis introduced in chapter 4 and it could be useful to read this first if the reader isalready familiar with the concepts presented here.

Subsection 2.1.2 is technical and subject to the changing implementation ofthe twitter API1. However, it is necessary to analyze the API to understand thelimitations present when working with twitter data since this represents the accesspoint used by researchers and other third parties.

2.1 Twitter

Twitter is a growing social media site where users can share short text messages of140 characters; tweets. A user base of 140 million users [5] make it a very interestingsource of data. The data that I am interested in collecting is the messages whereusers write about certain TV programs; to use for market research.

2.1.1 Structure of a tweet

Below is a hypothetical twitter message highlighting different features. The formatof the message is very similar to what one can find in actual tweets and it is evidentthat much of the information has been shortened to fit in 140 characters.

1https://dev.twitter.com/

3

https://dev.twitter.com/

CHAPTER 2. BACKGROUND

User Erik WardUser tag erikuppsalaText I am writing a master thesis

http://bit.ly/KdMep5 @kddird #KDDITime-stamp 17:22 PM - 12 Oct 2012 via web

The very sort text message format has given rise to several adoptions by the com-munity:

Retweet The letters “RT” at the start of a message indicate that it is a copy ofanother message.

User tag A unique string associated with each twitter account

Reply and mentions The @<uid> sign indicates that the message is directedtowards a specific user with user tag <uid> or refers to that user.

Hashtags A ’#’ sign followed by a keyword denotes the user selected categoryof the message (one category for each unique keyword string). Hashtags areunorganized and work by gentleman’s agreement.

Short URLs Several services provide a way to shorten URLs such as transforminghttp://www.it.uu.se/edu/exjobb/helalistan to http://bit.ly/KdMep5by redirecting through their site.

2.1.2 Accessing twitter data: Controlling sampling

In essence, accessing twitter data is done by collecting tweets that contain certainkeywords or are written by specific users. What the keywords are for finding tweetsabout television shows and how they are obtained are described in chapter 4. But,even if these keywords are known, access to the data is limited because of theunderlying medium.

The basic approach is sampling at different times using standard HTTP GETrequests, the so called REST2 approach. Each sample has an upper limit of howmany tweets that are retrieved and a user is allowed only a certain number of callsper hour.

Conceptually: the Twitter company maintains a buffer of tweets of a fixed sizethat is indexed by a full text index for Boolean search. This FIFO cache is replacedwith new tweets at different rates depending on the rate which tweets are produced.Users are allowed to query this very large cache of tweets and thus gain accessonly to the fraction of results that was produced in a fairly recent time period.Furthermore, not all tweets that are produced are available through this methodand the complexity of a query is limited.

2REpresentational State Transfer

4

http://www.it.uu.se/edu/exjobb/helalistan

http://bit.ly/KdMep5

2.1. TWITTER

“Please note that Twitter’s search service and, by extension, the SearchAPI is not meant to be an exhaustive source of Tweets. Not all Tweetswill be indexed or made available via the search interface." – https://dev.twitter.com/docs/api/1.1/get/search/tweets3

Besides the fact that not all tweets are accessible through the REST approach,there are further complications. These limitation has to do both with the factthat the number of tweets per request is limited and that the number of requestsare limited. If tweets a produced faster than the requests are issued the surplustweets are dropped without warning, this can happen if one wishes to track manydifferent keyword sets. If they are produced slower then each request will returnmany previous seen tweets (wasting bandwidth). The following is also stated in theAPI documentation, making long queries harder to use in this setting:

“Limit your searches to 10 keywords and operators.” – https://dev.twitter.com/docs/api/1.1/get/search/tweets

Twitter data can also be accessed in a streaming fashion in two ways:

1. Access all incoming tweets or a sample of all incoming tweets. Accessing arandom sample of all tweets is not attractive for our application and obtainingall tweets is a very data intensive streaming service requiring a contract withretailers.

2. Access all messages that match a Boolean query e.g. “My friend has a dog”and “My father drives a Volvo” will match q = (My ∧ dog) ∨ (My ∧ V olvo).This sample is limited to at most 1% of all tweets but represents the mostexhaustive way of collecting tweets containing certain keywords.

When researchers evaluate their twitter related research it is common to use astatic data set composed of messages collected for a certain query over a period oftime [10][26]. One important such data set is the TREC microblog corpus4. I willrevisit various sampling issues in chapter 3.

In my project a combination of methods is used, the REST search method toacquire a large sample of tweets for many different topics over a long period of timeand the streaming method to track a specific topic in an exhaustive way.

2.1.3 Stratification of tweet users and resulting language useIt is safe to assume that different groups of twitter users use different languageto describe their thoughts. Certain trends in e.g. hashtag use spread to differentgroups of users depending on their position in the social network and other factorssuch as what their interests are and so on. There is support for this assumption

3Accessed Oct. 16 20124https://sites.google.com/site/microblogtrack/2012-guidelines

5

https://dev.twitter.com/docs/api/1.1/get/search/tweets




https://sites.google.com/site/microblogtrack/2012-guidelines


in work done here at KDDI R&D where the feature extraction was used to extractterms used by different demographic groups show that the terms used differ [23].

If we expand our assumption slightly we can also assume that an analyst thatselects keywords to use for tweet collection need not necessarily be aware of thelanguage use of different strata. It is therefore possible to achieve an improvementin recall if we can catch other types of language use.

In the proposed method we start with the title of a television show as the basisfor our analysis, see chapter 4 but it is not hard to imagine that the jargon of usersis not an exact specification and that they will sometimes use the title combinedwith words that are more specific to their writing style, demographic and socialcontext. These words could include slang expressions and hashtags.

2.2 Information retrieval

This section is based upon the book An Introduction to Information Retrieval byManning, Raghavan and Schütze [25] and summarizes the key concepts of informa-tion retrieval that are used in this thesis.

The task of finding the correct content out of a large collection of documents isoften called information retrieval (IR). Most work in IR focus on finding the textdocument that, according to the models employed, the user wants, although thereare several applications in which IR is expanded to other content such as images,video or audio recordings.

A typical task for a commercial IR system is ad hoc retrieval: find the bestdocuments related to a set of user supplied search terms, a query. This thesis is notconcerned with ad hoc retrieval, the topics for which I want to retrieve documentsfor are automatically generated or known beforehand, nevertheless a great dealof overlap exists between the more traditional techniques of IR an my proposedmethod, described in chapter 4. Specifically, my method uses queries, representsdocuments in a similar way and builds upon an existing IR system destined forad-hoc retrieval.

2.2.1 Text data: Sparse vectors

Textual data is composed of strings of characters where one can choose differentscopes of analysis, common strategies are to regard a document as one scope or toconsider parts of documents as scopes on their own, e.g. 100 word segments or thedifferent structural parts of the documents such as titles, captions and main text.

Delving further into the text, one can look at words as the smallest buildingblock or look at sub-strings of length n, often called n-grams. The concept of n-grams is also used for looking at a sequence of n words in the literature but it shouldbe clear from the context which is meant. In this thesis we will look at words, oftencalled terms, separated by spacing characters, as the atomic unit of strings anddenote these by t.

6

2.2. INFORMATION RETRIEVAL

Table 2.1: Examples of two vector representations of the same document. In thisexample the vocabulary is severely limited and readers should imagine a vocabularyof several thousands of words and the resulting, sparse, vector representations. Notethat capitalization is ignored which is very common in practice.

Vocabulary ’a’, ’brown’, ’dog’, ’fox’, ’i’, ’is’, ’jumped’,’lazy’, ’over’, ’quick’, ’the’, ’this’

Document “The quick brown fox jumped over the lazy dog”Words (lex. order) ’brown’, ’dog’, ’fox’, ’jumped’, ’lazy’, ’over’, ’quick’, ’the’, ’the’Boolean vector (0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0)Frequency vector (0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 2, 0)

A well know concept is lexicographic ordering and this definition of how stringsare ordered can be used to transform text into a vector representation in a conciseway. Assume that one know all words that can appear in a language; call this setthe vocabulary, Ω. Using the vocabulary one can transform any document into avector representation by counting the number of of each word and outputting thesecounts in lexicographical order: create a feature vector for the string. Table 2.1shows an example of the process. In practice we do not know all words that canappear in texts but can ignore words that are not seen before or add them to thevocabulary since the vectors are never formed explicitly.

Conceptually, the Boolean model, used by the twitter search API, creates Booleanvectors as in table 2.1 for all document and all queries. A bit wise matching is thenperformed and the documents with one or more matches are eligible results of asearch. Since the data is sparse many optimizations in storage and computation arepossible, but they are omitted here.

For twitter messages the Boolean model makes a lot of sense, tweets are veryshort and each word can thus be assumed to be very important to the message.Furthermore, it is the method that requires the least computation and storage so itcan be implemented effectively with an inverted index, see table 2.2.

If we study table 2.2 further we can see that the more common a word is themore documents will be stored in the index entry for that word, possibly slowingdown retrieval. For common data sets of documents, often called corpora anddenoted by C, an important phenomena called Zipf’s law has been observed: theprobability mass distribution of words decrease exponentially for words accordingto how common they are. So roughly: the second most common word is abouthalf as common as the most common word and so on. This empirical law meansthat a very small number of words are present in most documents. To reduce thespace and time needed for look up in an inverted index these words are commonlyignored: these words are often called stop words.

If we are not content with regarding all documents that match, in the Booleanmatching sense, equally relevant and want to rank5 documents in some way there

5Typically the k highest ranked documents, with the highest scores, are assumed the most

7


Table 2.2: Inverted index structure. We can look up which documents that containcertain words by grouping document numbers by the words that are included in thedocument. If we wish to support frequency counts of the words we store not onlydocument numbers but instead tuples of (Number, frequency).

DocumentsNum. Text1 “a dog”2 “a brown fox”

IndexKey Value’a’ 1,2’brown’ 2’dog’ 1’fox’ 2

are many extensions and interpretations one can use. The basic building block forranking is the following two assumptions:

• The more common a word is inside a document, the more relevant the docu-ment is to a query containing this word.

• The more documents that contain a word, the less important the word is.

This leads us to the very common representations of texts as tf · idf vectors(term frequency, inverse document frequency vectors). For each word that we keeptrack of in the vocabulary we also keep track of the number of documents thatterm appears in and calculate idf(t) = lg(1/|d | t ∈ d|), d ∈ C. When a query,q = t1, t2, ... is issued we sum up, for each document d, the tf · idf elements thatmatch that document:

Score(q, d) =∑

t∈q∧t∈d

w(t, d) =∑

t∈q∧t∈d

tf(t, d) · idf(t) (2.1)

One can also compare two documents, or a query and a document, in the form oftf · idf vectors using any distance metric, most notably the cosine distance:

cos(~di, ~dj) =~di · ~dj

||~di|| ||~dj ||(2.2)

A keen reader will note that by using frequency vectors to represent text we havelost a lot of information; namely the ordering information of words. This model mayappear simplistic but has been shown to work well in practice. Because order is notaccounted for the model is often called the BOW (Bag6 Of Words) model.

important.6Bag is synonymous to multiset here.

8


2.2.2 Terms weights based on statistical methods

One can formalize the two basic assumptions used in the tf ·idf vector representationand instead use statistical methods. The intuition is the same but the models aremore familiar since it allows IR systems to benefit from the vast knowledge inprobability theory and statistics. For instance, it is possible to weigh terms indocuments according to some function, f , defined on the two probabilities p(t|d):the probability of term t given a document d, and p(t|c): the probability of a termin the whole corpus.

w(t, d) ∼ f(p(t|d), p(t|C)) (2.3)

p(t|d) = tf

TF(2.4)

p(t|C) = tfC

|C|(2.5)

Where tf is the number of times we see t in document d, TF is the numberof terms in d, tfC is the number of times we see t in the whole collection and |C|the number of terms in the collection. Note that we are essentially estimating theprobability distribution of terms in different sets using the assumption that termsare distributed according to the mean (maximum likelihood estimation).

These methods are usually similar to hypothesis testing in that we choose termsfor which we reject the null hypothesis that p(t|d) has a good fit with p(t|C). An-other way is to consider the information of the seeing a term in a document such asusing the Kullback-Liebler divergence. It is also possible to use an urn model thatexplicitly considers the size of documents, TF , instead of just p(t|d) = tf/TF suchas the divergence from randomness framework [7].

2.2.3 The vocabulary mismatch problem

When a user is searching for a set of relevant documents it is a typical case thatthe user and the authors of those documents use a different vocabulary (not to beconfused with the vocabulary of all seen terms Ω). This means that not all relevantdocuments are found or that ranking of documents is not in line with what the userfinds important.

This problem can also be seen when one tried to conduct market research us-ing twitter data. It is not hard to imagine that authors of tweets use a differentvocabulary and jargon than an analyst that selects keywords to search for.

2.2.4 Automatic query expansion

Query expansion is a method where the original query is used to generate newqueries to provide greater effectiveness of the IR task. There are many differentways one can perform query expansion automatically, they can be very broadly

9


categorized into local and global methods. Local methods use the results obtainedfrom the first query to find new search terms while global methods often use globalsemantic information such as Wordnet or use an auxiliary corpora. An excellentreview of query expansion is available in the survey by Carpineto and Romano [12].

In the local case query expansion is often pseudo relevance feedback, related tothe concept of relevance feedback. In relevance feedback, an ad-hoc IR technique,users are asked to grade results and a new search is carried out taking into accountthe best rated results of the first query. If instead the top k results of ranked ad-hocretrieval are assumed to be relevant, here denoted as R, the pseudo Relevant set,one can use this set to get new query terms. A common technique is to use Roc-chios algorithm where a BOW query vector is moved towards relevant documentsaccording to different weights.

~qm = α~q0 + β1|R|

∑~dj∈R

~dj (2.6)

We see that the original query vector q0 of tf · idf elements is multiplied withthe scalar α and added to a summation of the tf · idf vectors of the documents inR multiplied with another scalar β. Note that we need to use a similarity basedscore for a new ranking of results as in equation 2.2 rather than a summation of allw(t), t ∈ qm as in equation 2.1 to have use of query expansion with re-weighting.

For our purpose the re-weighting is not very interesting since we are interestedin only obtaining new search terms. If we consider the vector:

1|R|

∑~dj∈R

~dj

as a list instead and sort the elements in ascending order we can use the first Lelements as additional terms.

The statistical method in section 2.2.2 can easily be extended to query expansionwhere we instead of p(t|d) one considers p(t|R), where R is a set of known “relevant”documents. Using the χ2 statistic we can perform AQE by ranking expansion termsaccording to highest score and using the top K ones:

score(t) = χ2(t) = (p(t|R)− p(t|C))2

p(t|C) (2.7)

p(t|d) = tfR

|R|

p(t|C) = TF

|C|

Query expansion in this thesis is used only to acquire additional search termsand no re-weighting of the search terms is done. In this sense AQE is very sim-ilar to association mining, especially the concept of confidence [12, p. 14]. In my

10


proposed method I calculate a metric similar in spirit to confidence for the rule:keyword → group of records.

Instead of χ2, there are other way of hypothesis testing that can be used tofind good descriptors of sets such as the AIC (Akaike information criterion), whichcompares different models, in this case rules of the form above. [6]

2.2.5 Measuring performance

In ad hoc retrieval performance is perhaps best measured by what users think ofthe results returned. This is a very time consuming and expensive process so mostIR systems are tested on annotated test collection such as the TREC [4] collections.Here a set of queries are supplied and a list of relevant documents for each, theIR system is then tested on how well it can retrieve the predetermined relevantdocuments. Often only the first couple of results are measured, but for the problemI am concerned with, maximum recall, this is not an option: I will instead samplethe results to determine overall performance, see chapter 6.

For each query q and underlying information need, each document d is in one ofthe four categories:

True positive A document that is relevant and was retrieved by the IR system inresponse to q. Denoted by tp.

True negative A document that is not relevant and was not retrieved, tn.

False positive A non relevant document assumed wrongly to be relevant and isthus retrieved, fp.

False negative A document that is relevant and was not retrieved, fn.

From these simple definitions several metrics have been developed. The twomost common measures are:

precision = tp

tp+ fp(2.8)

recall = tp

tp+ fn(2.9)

Where precision reflects how many of the retrieved results are relevant and recallhow many of the relevant results that are made available. It is desirable to maximizeboth metrics but they are naturally opposed goals, if one returns all documents inthe collection then recall is maximized (1.0) and if no documents are returned thenprecision is maximized. In actual systems, increasing one often decreases the other.Therefore the harmonic mean of the two is often used to measure the IR system,called the F-measure.

F1 = 2 ∗ precision ∗ recallprecision+ recall

(2.10)

11


These metrics can also be used in other cases when an asymmetric importanceis assigned to two different classes: that related results are important and unre-lated unimportant. A typical case is classification where one wants to filter out allunrelated records but keep all related ones. The difference between filtering andIR is blurred when there are temporal relevance demands on search results. In theextreme case, used in this thesis, where no results are stored for a standing queryon streaming data they become inseparable.

2.2.6 Software systems for information retrieval

There are many software systems that are perfectly suited for information retrievalemploying different versions of the inverted index idea presented in section 2.2.1.Relational database management systems or other systems optimized for e.g. fastindexing instead of consistency can be used. There are many specialized informationretrieval systems and they can be called No-SQL databases because they do notadhere to SQL specifications.

I have worked with one such specialized information retrieval system (No-SQL),Terrier 3.57 [31]. It is a relatively mature research system dedicated to informationretrieval with open source code, good documentation and community support. Itsdesign is focused on experimentation and configuration and it written entirely inJava giving me a good trade-off between performance, scalability, stability, easeof implementation and experimentation. One drawback is that there is no querylanguage and if one wishes to do other things than document search you have to dothis in source code. This is still preferable since custom search operations is exactlythe idea with this software: Terrier is designed for easy modification of the opensource code and easy configuration. In contrast, many SQL systems would requireforeign functions or possibly even recompilation to do what I wanted to do.

2.3 Topic classificationThe act of retrieving top ranked documents for a query is it self a form of classifica-tion of all the documents in the corpus. But in the case of twitter no ranking is doneof the results of streaming retrieval, so we can explicitly introduce a classificationstep here to improve precision. To clarify:

1. Obtain a query vector q

2. If desired, create a new query vector q∗ with AQE

3. Rank all documents according to q∗ by sorting their scores. For examplescores obtained by using equation 2.2.

4. The k highest ranked documents are considered related.7www.terrier.org

12

www.terrier.org

2.4. EXTERNAL DATA SOURCES

In an ad-hoc IR system the user can themselves decide what value of k is ac-ceptable and in that way allow for maximum recall my looking at more and moreresults. In practice this will result in very low precision. By instead of rankingresults classifying all of them (giving them a score of either 1 or 0) I suggest thatone can achieve more favorable results in terms of overall F −measure.

A supervised classifier is a function or program that uses has a training step tomodify its behavior. In this step it is fed data that is similar to the data it will laterbe asked to classify and can generalize various properties of the data in order tomake an informed decision later.[18] I will assume that the reader is familiar withsupervised classification8 and instead focus on concepts that are important for myproposed method.

The sparse vector formats listed in table 2.1 are necessarily not optimal forclassification and if we can introduce some form of processing to include backgroundknowledge in the features used there is a possibility for improving the results. Inchapters 3 and 4 I will elaborate on this idea further but the basic scheme used is tofocus as much as possible on transforming a sparse text representation into a moreconcise representation based on comparing it with other texts.

Our classifier will need to make a decision for each tweet whether or not it isrelevant to our information need. The information need in IR is usually expressedas a query but in supervised classification we describe it in the form of trainingexamples and in the proposed method, also as external sources.

Figure 2.1: The C4.5 classifier

The C4.5 is a commercial decision tree classifier. It is an extension of a basicdecision tree induction algorithm [32] and thus creates partitioning rules for datainto classes based on purity measures.In C4.5 and the open source Java implementation J48 that I used, additionalmeasures are taken to reduce over-fitting and to handle missing values and otherimprovements [34].

2.4 External data sources

External data sources can be used both for query expansion by looking for contextof the original search terms [12] and for classification by comparing with our data.The key issue is that we can find representative information about the informationneed. Many researchers have focused on link structure of external resources butdue to time constraints I have not considered this angle of approach and insteadfocused on using the text of the source itself.

8An excellent book is [32], where most of the techniques used in this thesis are covered

13


Wikipedia The on-line dictionary, Wikipedia9, is the largest such resource of itskind. It has been used by many researchers in text mining and I will use it toprovide context for my classifier.

EPG EPGs (Electronic program guides) contain a the airtime of a show, the mostprominent actors of the show and a short synopsis. Several companies provideAPI access to EPG data such as Rovi Corporation10.

Web pages If we have web pages that are relevant to our information need we canuse them as additional background information.

9wikipedia.org10rovicorp.com

14

wikipedia.org

rovicorp.com

Chapter 3

Related work

In this chapter I will investigate related work regarding the collection of tweets withcertain topics. One topic that is of special interest is television and thus much ofthe related work presented will be about the problem of finding and identifyingtelevision related tweets.

Perhaps the most straightforward interpretation of the problem of identifyingtelevision related tweets is as a classification task. The solution taken by mostresearchers is that of supervised classification of tweets by topic. A training setof labeled data is used to train a classifier such as a support vector machine andthe approach is tested on an unused part of the training set, typically using k-foldcross validation. But in reality a running system must deal not only with classifyingtweets but also retrieving them from the a large database such as the twitter APIand therefore formulating the problem as an IR task is also attractive. I will alsogo through the approaches taken by different authors for collecting tweets.

My proposed method, see chapter 4, is a combination of query expansion intweet retrieval and classification of tweets and thus these two research areas aredirectly related to my work. However I have not found any directly comparableresults.

3.1 Relevant works in information retrieval

I believe that one does not have to look at the problem strictly as a classificationproblem and that many techniques used in ad hoc retrieval can be used for thevertical search task. These techniques could be viewed as pre-processing methodsfor the classification task. The status of information retrieval of tweets is reviewedin [16] where several interesting analogues to more typical document (much longerthan tweets) search techniques are assessed, among them a version of page rank fortwitter, which does not yield great results. From IR in general, the perhaps mostinteresting technique to improve recall is Query expansion [12].

15

CHAPTER 3. RELATED WORK

3.1.1 Query expansion and pseudo relevance feedback

One promising idea is to account for the change of language use over time. Massoudiet al. [26] use a retrieval model where a language modeling approach is used andquery expansion terms are generated by looking at the recency of tweets that theyappear in and how many tweets they appear in.

Another variant of temporal pseudo relevance feedback used for analyzing twittermessages is to build a fully connected graph of initial search results where edges areweighed by their terms temporal correlation similar to the approach above. Pagerank is then used on this graph to find the most relevant documents. The temporalprofiles where built with four hour intervals pretty small corpus of twitter messagesand page rank is not suited for working with this kind of graph so it is not surprisingthat this TREC submission [2]1 was unsuccessful.

A very interesting use of Wikipedia is in one AQE approach where anchor textsare used as expansion terms. In [8] Wikipedia is indexed and searched for thesame query terms as in an original query for a blog collection, the top Wikipediadocuments returned are analyzed to find popular anchor text that link to the thehighest ranked Wikipedia pages. These anchor texts are then used as expansionterms resulting in an improvement over a baseline.

In [15] Efron uses hashtags to improve the effectiveness of twitter ad hoc re-trieval. By analyzing a corpus of twitter messages he creates subsets where onehashtag is present and fits a probabilistic language model for each such subset. Alanguage model is also fitted to each query and the models that correspond thebest to the query model (according to Kullback Lieber divergence) will have theirhashtags added as additional query terms. This approach provides modest improve-ment, but I think that creating a language model from just the query terms is riskysince there is so little evidence present in a query of a few words.

Papers submitted for the TREC micro-blog track of 2011 represent the use ofdifferent IR techniques for twitter search including topic modeling, different formsof query expansion, extensive use of hashtags and many other approaches. Many ofthe papers are not published in peer-reviewed journals but never the less representthe latest research in this area. The main evaluation measure was precision at 30results averaged for each of the different queries and the best results where in the40% range. [2].

The approach taken by Bhattacharya et al. [2]2 is particularly interesting sincethey report one of the best unofficial test scores of more than 60% P@30 and usea IR methodology perhaps best suited to structured (XML) retrieval. They useIndri3 which employs a combination of language modeling and inference in Bayesiannetworks. They create different regions from the tweet and external sources that canbe treated using different language models and combine the similarity of a querywith each of these regions in a Bayesian network. The external sources are web

1Qatar Computing Research Institute submission to the TREC 2011 microblog track2Bhattacharya et al. University of Iowa (UIowaS) at TREC 2011 microblog track.3http://www.lemurproject.org/indri/

16

3.2. CLASSIFICATION

pages from URLs in the tweets and definitions of hashtags from a community basedsite, i.e. they expand tweets to include externally referenced information.

3.1.2 An alternative representation using Wikipedia

Since my goal is to achieve good recall while maintaining precision I have lookedat the work of Gabrilovich et al. with much interest. They use an alternativerepresentation they call explicit semantic indexing, ESA, contrasting LSA (latentsemantic indexing), where Wikipedia serves as the basis for the representation ofdocuments.

In [20] the basic idea of ESA is presented. Each word in a text is associated witha set of Wikipedia pages. They create this representation by building an inverseindex of Wikipedia and instead of using it for look-up they use the index structureitself as the representation, i.e. each word is represented by a sparse vector with allWikipedia pages as elements. A word such as Sweden might appear in thousands ofarticles but using an tf ∗ idf scheme the word might have the greatest associationwith articles about Sweden. To make the system feasible only the top k articlesand the weights corresponding to the term frequency in those articles are kept.For a text the vectors of the words in it are summed to create a document vector.This approach works best for small text so it should be ideal for use with twittermessages.

The alternative representation is used to build an ad-hoc IR system wherequeries are also transformed into Wikipedia-space and compared with e.g. cosinesimilarity to the text we have in a collection. Using only ESA results in poor per-formance but very impressive abilities to associate queries and texts that does notshare a single word with each other, which highlights the possibility for greatly in-creased recall. In unison with a BOW IR system and automatic feature selectionusing the information measure the method yields good results [17]. But this methodcan definitely cause a loss in precision for some queries because unrelated Wikipediapages may contain the same word.

3.2 Classification

In this section I list some works that only look at the subset of the problem wherea set of possibly related tweets are acquired and we want to classify them, skip-ping the issue of recall when retrieving tweets from twitter. In general the relatedworks mentioned in this section tackle the problem of classifying tweets contain-ing ambiguous words, such as the company name Apple. Even though I have usedclassification to improve precision of additional tweets gathered using other searchterms the justification for using classification is the same, it is needed to try andfilter out tweets that superficially seem related but are not.

17


3.2.1 Television ratings by classification

Arguing that conventional TV ratings, the so called Nielsen ratings, are outdatedWakamiya et al. employ an alternative method for estimating the number of viewers[39]. In their paper they present a method that uses tweets to calculate the ratingsand they use a large data set from the Twitter Open API. The data was geotaggedand filtered by keywords such as TV and watching and later filtered.

Here they key problem of identifying which messages are related to a particu-lar TV shows is addressed. As seen in other works, additional information aboutthe television programs are used, here in the form of an electronic program guide(EPG). Textual similarity is then computed between the set of collected tweets andEPG entries. As far as I know Wakamiya and her colleagues are unique in alsoincorporating both temporal and spatial information to make the decision.

The textual similarity is based on the Jaccard similarity coefficient and a mor-phological analysis function is used to only compare nouns, possibly due to the waythe EPG is structured. In contrast, one could for instance imagine that verbs such aswatching could be useful, an observation made in other related works. To facilitatethe large number of text comparisons required an inverted index was employed.

The use of spatial relevance is motivated by the need to determine which TVstation the author of a tweet was watching. Therefore it might be unnecessary inthe general problem of determining weather or not a tweet is related to a particularTV program.

Hypothesizing that users write about TV shows in close temporal proximity tothe broadcast a temporal relevance score is used in the final relevance measure, aquotient of the three similarity scores, which is then used to match a tweet to thehighest rated EPG entry, corresponding to one television broadcast. The soughtafter popularity measure of how many people watched the show can then be calcu-lated.

Experimental results indicate high precision for the proposed method but pos-sibly low recall. Regrettably no discussion about the statistical significance of theratings acquired was present.

3.2.2 Ambiguous tweet about television shows

In a series of papers; [35] [13] [14], a group of researchers from AT&T labs andLeigh University, including Bernard Renger, Junlan Feng and Ovidiu Dan, presenta method for classifying tweets and an application of their method, Voice enabledsocial TV. Their approach achieves the best performance in terms of F-measure ona labeled test set that I have seen in literature. But since there are no standardizedtest sets caution should be taken before accepting this approach as the unequivocallybest approach. The basic scheme of classification considered is shown in figure 3.1.

The key concept of their work is the use of a two stage classifier approach. Firsta classifier is trained using a set of manually labeled data, secondly another classifieris trained on the small data set but with features derived from the large data set

18

3.2. CLASSIFICATION

Figure 3.1: Approach to classifying tweets, here for television shows, but the sameapproach applies for other proper nouns.

labeled by the first classifier which is also used to extract additional informationsuch as new search terms for the twitter streaming API. The second classifier isused to make a decision about a tweet m and a show i having the title si:

f(i,m) =

1 if m is a reference to show si

0 otherwise

The first classifier is a binary classifier that also models the function above, itdoes however use less features that the second classifier, as seen below. The trainingand testing data is generated using twitters streaming API where one searches forkeywords and get a statistically significant sample back. The search terms usedinclude not only the name of the show but also alternative titles found at IMDB.comand TV.com. A set of labeled data was manually created for eight shows and thisdata set is then used for training and validation of the first classifier.

The features used for the first classifier differs from the previous approach listedsince they are not directly related to textual similarity where one uses a bag ofwords model. Instead a collaboration of features is used:

• Manually selected terms and language patterns of interest.

– Television terms such as watching, episode.– Network terms such as cnn, bcc.– Regular expression capturing episode and season information e.g. S0−

9+E0− 9+.

• Automatically captured language patterns.

– From a large data set (10 million tweets), replace titles and hashtags withplaceholders and extract sequences of three words where the placeholderis included. These then become rules that if seen in unlabeled messagesare used to indicate that the message is TV related.

19


– Use si to check for the presence of the uppercase string.– Check if there is more than one title that is not an ambiguous word

(according to Wordnet).

• Textual comparison with external sources using cosine similarity measure andthe bag of words model.

– Characters of the show– Actors of the show– Words from Wikipedia page

Most features are treated as binary values, 1 if a positive match was found 0otherwise, the rest are scaled to the unit interval.

After training on a few thousand twitter messages the first classifier is then usedto classify the large unlabeled data set, this yields labels for each twitter message.This data is then used as training data together with the original data set forthe second classifier that uses all of the features of the first classifier and threeadditional feature types. Interestingly new more refined rules are captured fromthis new labeled data set as well as new search terms.

Using the features listed above, different classifiers where tested for the twoclassifiers support vector machines and rotation forests [36] where deemed the best.An F-measure of 85.6% was the best result achieved by the latter classifier in 10-foldcross validation of the initial labeled data set. To summarize: several interestingfeatures are combined with the common textual similarity measures often usedinformation retrieval, the two classifier approach slightly increases the F-measureand also generalizes quite well to unseen shows.

Parts of this approach can certainly be applied to classifying new tweets thatare retrieved using query expansion and in my method I use a similar approach withslightly different features, see chapter 4.

3.2.3 Other topics than televisionOther authors tackle the related and very similar problem of identifying whichtweets are about a certain company and which are not. Company names can beambiguous in much the same way as television shows and programs. As stated inNational University of Distributed Learning’s WePS-3 workshop’s task definition:“Nowadays, the ambiguity of names is an important bottleneck for these experts”[1],referring to experts in on-line reputation management. The task outlined in theworkshop included data to be analyzed.

As a submission to the WePS-3 workshop, Yerva et al. devised a classificationmethod for the problem using support vector machines (SVM) [41]. Here we also seethe use of external data as a basis for comparison with tweets. For each company,a set of profiles is created, each a bag of pre-processed unigrams and trigrams. Thedifferent profiles capture e.g. the company website, Wordnet disambiguation and

20

3.3. TWEET COLLECTION METHODOLOGY

manually recorded related words. The features used was co-occurrence counts ofwords in the tweet with the different profiles. Experimental results were positiveand indicate the need for high quality external information.

One idea to improve recall in classification is to cluster messages, however thismethod typically suffers from poor results if applied to just term occurrence vectors.Perez et al find terms from the corpus of twitter messages they are working on to helpclustering methods [33]. They call their method Self-term expansion methodologyand achieve improvements in recall and precision by finding a set of additionalterms for ambiguous company names. Words that co-occur with company namesin tweets labeled true in the training set are added to each tweet containing thecompany name in the test set. Unfortunately the paper is very vague in its methoddescription but using unsupervised clustering with k-means with k = 2 does notseem like a promising idea for classification, however the method could be used asa query expansion technique.

3.3 Tweet collection methodology

In chapters 1 and 2 I argued that the common approach of accessing twitter data forvarious research project is lacking in reach, or recall, of the data that is consideredfor sampling. To see this we can consider the methodology used by some of therelated work presented in this chapter.

Regarding ad-hoc information retrieval of tweets such as [37], [16]; [10] thatuse the TREC microblog data set4; and [26] which employed query expansion,it is not clear if we can compare effectiveness of tweet collection. In relation tomarket research, it is an open question weather results achieved on a small dataset sampled for a shorter period of time and annotated with a modest number ofquery–relevance judgment pairs are applicable to the problem of obtaining as manyas possible related tweets. We are most interested in evaluations done with theconstraints of up to date, inclusive tweet collection in place. Nevertheless, many ofthe techniques used are certainly interesting.

In [28], Mitchell et al. evaluate a system they have set up for on-line televisionin which social media is integrated. Twitter is used to present tweets about thecurrently viewed program. Here the twitter API is used and a simple search of theprograms title is employed to retrieve relevant messages. Their work represents thebasic use of twitter for retrieving TV related tweets and unfortunately recall andprecision is not evaluated.

The work done with classifying TV-program related tweets [39], [35] and worksabout classifying other ambiguous topics such as [40] use test sets collected usingsimple rules, such as using the title of the topic, or manually selected keywords. Alimited form of query expansion is used in [9] to generate the data set, all hashtagsfound in the data set retrieved by searching for “#worldcup” are recursively used to

4https://sites.google.com/site/microblogtrack/

21


search for new tweets. In [30] the streaming API is used and messages are classifiedin a streaming fashion, however the search terms used are manually selected.

Wakamiya et al., that employ an alternative method for estimating the numberof viewers by counting certain tweets [39], do not use titles of TV programs directly.Instead, a large data set collected from the Twitter API during one month wasused, where all geotagged 5 data with Japanese origin available was filtered for the,manually selected, Japanese keywords equivalent to words such as TV and watchingis used. Experimental results indicate high precision for the proposed method butpossibly low recall. Regrettably, no discussion about the statistical significance ofthe ratings acquired was present.

Dan et al.[35] [13] [14] use an approach that achieves an F-measure of 89%.However, their results are only valid as a measure of an overall system if all therelevant tweets can be found using the title of the show as a search term.

3.4 SummaryAd-hoc search of twitter messages typically uses text indexing, either the commontf ∗ idf scheme or a language model approach. Even though there are many dif-ferences between ad-hoc search of microblogs and web documents [37] techniqueslearned from established IR methods can certainly be applied.

Text classification typically represents texts as tf vectors and either does super-vised training directly on the sparse vectors or extracts features to use for training.When dealing with short documents such as tweets, external sources are often usedand the best results [14] come from including hand crafted features and mining verysimple rules from a bootstrapped sample. Unsupervised approaches are less suc-cessful when dealing with short text messages as described in [33] and a clusteringcan never be maintained for the incoming stream of data in a live application.

Table 3.1: Methods that I use from related works in my combined approach of queryexpansion and classification.

Method ReferenceUse hashtags as expansion terms [15]Look up URL contents in tweets [2]6Use Wikipedia as a way to compare tweets [20], [13]Use EPGs as a way to compare tweets [39]Co-occurrence with name to get additional terms [33]Multiclass supervised classification of tweets using external sources [13], [41]

Since we are treading in somewhat unknown territory, the streaming retrieval oftelevision related tweets, I have perhaps included some works that can be considered

5Some users enable geotagging so that the coordinates of the user at the time of posting ispublicly available

22

3.4. SUMMARY

of peripheral importance. However, I will explore some concepts from each of theseworks in my own system, aiming to exploit the strengths of their approaches andavoid the weak points. In table 3.1 I have listed the different ideas from relatedwork that I have used in my proposed method, see chapter 4.

23

Chapter 4

Automatic query expansion andclassification using auxiliary data

This chapter contains a conceptual overview of the methods used. I start by elabo-rating on the problem statement in chapter 1 and then continue to describe the twoparts of my approach: AQE (automatic query expansion) and supervised classifica-tion. Some initial experiments are described since they guided my development ofvarious features of the method that was later used for larger experiments.

4.1 Problem description and design goalsTo clarify the intuition behind which methods are used I repeat and elaborate theresearch problem definition from chapter 1:

Get as many as possible tweets about a specified product.

With the following constraints:

Tweet availability The twitter API determines how tweets can be accessed.

Precision We do not want false positives.

Scalability The resources required in terms of CPU time, memory consumptionand disk should increase sub-linearly, disregarding storage of collected tweets,with the number of tweets we collect.

Recency Results need to be made available in a timely fashion. We need to accessthe tweets about a product as soon as possible, ideally in real time.

Resolution We need to be able to tell when a specific tweet was created.

We can also formulate the problem in terms of a search problem:The problem here is that we need to know the keywords to use (or suffer fromlow recall). Often these keywords are chosen by an analyst as seen in chapter 3.

25

CHAPTER 4. AUTOMATIC QUERY EXPANSION AND CLASSIFICATION USINGAUXILIARY DATA

Retrieve tweets about products by searching (Boolean match) for keywords aboutthese products.

We can prove by example that this method does not retrieve all related tweets.Furthermore, many keywords that are good descriptors for products are also gooddescriptors for other classes of tweets. There is a problem of ambiguity.

Another interpretation is in terms of a supervised classification problem:

Classify all tweets as related to the product or not

Superficially this looks as an attractive approach; we have now gained a goodmethod to weed out ambiguous tweets. But there are several problems:

Multiple-class problem If we need to track more than one product it becomes amultiple class problem. To solve this we need to have access to labeled datafor each product.

Unbalanced classes The ratio of tweets that are about a specific product arevanishingly small, let us assume this ratio is 0.1% and that the false positiverate is 0.01%. With perfect recall the results will be almost tied between50.02% tp and 49.97% fp. This classifier performance is of course completelyunrealistic making the problem harder.

The class distribution problem is not insurmountable, we must give up recall toachieve reasonable precision. This is directly opposed to our goal but is unavoidable,we must instead focus on giving up as little recall as possible. But compared totreating the problem as a search problem we are still ahead in achieving our goal.

There is an important caveat here: if we have high recall of the related classand consistent fp rates we can still track the change in the number of tweets.

The worst problem is to acquire training data; it is not financially viable tomanually annotate data for a given product unless that product is very impor-tant. When tracking television programs this manual labor is staggering, there arehundreds of large shows and programs in the US alone of very different genres.

4.2 New search terms from query expansionThe main idea of my approach to the problem described in section 4.1 is to combinethe strength of both search and classification by removing the analyst that selectssearch terms and replacing him with an automatic method.

As I will show in the following sections, the end goal of query expansion in thisthesis is to find a disjunction of new search terms that describe the product thatwe are searching for. If we retrieve all tweets that match this logical expression wecan hopefully find more relevant tweets. In other words, these search terms can beused to find additional messages. As an example, consider a search for “hello”:

26

4.2. NEW SEARCH TERMS FROM QUERY EXPANSION

Original search expression Expansion search expression“hello” “hello” ∨ “hi” ∨ “greetings” ∨ ...

Here we assume that some query expansion method generates some additionalterms and that we can retrieve the tweets that match one or more of these terms.As described in section 4.2.1 I will refine this approach somewhat to search for notonly a disjunction of single search terms but a disjunction of pairs of two searchterms where both of the terms in the pair must be present in the retrieved messages(logical AND). If we revisit the example above it could look something like this:

Original search expression Expansion search expression“hello” “hello” ∨ “hi” ∧ “greetings” ∨ ...

Since I want the Tweet Collect system to only need a list of product names towork the original search expression will be a conjunction of the words that make upthe product name. For the television show “How I met your mother” the originalsearch expression will be: ”how” ∧ “I” ∧ “met” ∧ “your” ∧ “mother”.

Taking inspiration from the automatic query expansion techniques listed in chap-ter 2: we can treat a set of tweets that contain the exact title of a product as pseudorelevant tweets; they are generated by an original query. Given a larger populationof tweets C where R ⊂ C we can calculate many different statistics about the termspresent in R. From these statistics we can generate well chosen additional searchterms. This method is very simple and highly effective, given that our initial as-sumptions holds: the tweets in R are actually relevant and we can approximate thetrue distributions by the statistics in our corpus.

I have performed some small scale tests with different query expansion methodsthat use different ranking criteria for new search terms and found that χ2, equation2.7, performed the best. Furthermore, it is not very important which method isused:

“However, several experiments suggest that the choice of the rankingfunction does not have a great impact on the overall system performanceas long as it is used just to determine a set of terms to be used in theexpanded query [Salton and Buckley 1990; Harman 1992; Carpineto etal. 2001]”[12]

I also did some small scale tests where I tried the query expansion method from[26] directly but this resulted in very poor expansion terms. The terms could bedismissed upon inspection and testing searching for them on twitter.

4.2.1 Co-occurrence heuristicInstead of producing single term expansion terms I produce conjunctions of twoterms as follows: Given a list of k terms, check the pairwise co-occurrence of these

27


terms in virtual documents consisting of V tweets, a tweet, the bV/2c tweets col-lected just before and the bV/2c collected just after the tweet containing the firstterm in the conjunction pair. Rank the pairs according to their modified dice coef-ficient:

D = 2 · dfu∧v

dfu + dfv(4.1)

Where df represents document frequency of the virtual documents in the pseudorelevant set and df the document frequency in the collection as a whole.

4.2.2 Hashtag heuristicGiven a list of k terms, all terms that are mentions or hashtags, start with # or @respectively, are considered related if the hashtag without the initial pound symbolis not found in a standard English dictionary.

28


Algorithm 1 Algorithm, Top(K,R), produces an array of single search terms.1: R is an array of relevant tweets, twl, 1 ≥ l ≥ N .2: for all terms t ∈ ∪i twi do3: if pR > pC then4: Use equation 2.7 calculate score(t) and add 〈t, score(t)〉 to list l5: end if6: end for7: Sort l in order of score(t).8: Let top[K] be an array of terms ti.9: top← the K terms corresponding to largest Inf(t).

10: return top

Algorithm 2 Algorithm, Pairs, produces the pairs of search terms used.1: Let R be an array of relevant tweets, twl, 1 ≥ l ≥ N .2: top← Top(K, R)3: Let pairs[K · (K − 1)/2] be an array of 〈String, String, Integer〉.4: for all terms ti in top do5: Tu ← tweets tw | ti ∈ tw6: for all terms tj ∈ top | j > i do7: Tv ← tweets tw | tj ∈ tw8: for all twl ∈ Tv do9: vd← twl−2@twl−1@...@twl+2

10: if ti ∈ vd then11: 〈ti, tj , count〉 ← pairs[index(i, j)]12: pairs[index(i, j)]← 〈ti, tj , count+ 1〉13: end if14: end for15: end for16: end for

4.2.3 Algorithms

Algorithm 2 describes how to get pairs of terms and their counts. Note that thenested loops on lines 6-14 correspond to doing a join between the tweets that containti and the virtual documents, formed by tj , that contain ti. This can be implementedas a hash-join of search results, see listing 5.1. The final step of sorting the termpairs according to their modified dice coefficient using equation 4.1 is omitted. Thefunction index(i, j) returns the correct index to store the term pair at in Pairs.

29


Top 20 expansion terms for the show “How I met your mother”met,mother,@realwizkhalifa,@himymcbs,barney,movie,ted,thats,assisti,#orangotag,#himym,tv,wiz,netflix,show,@fancite,fancite,lily,watch,robin

The final keywords@realwizkhalifa,@himymcbs,#orangotag,#himym,@fancite,@fancite fancite,barney #himym,mosby ted,barney ted,barney fancitebarney @fancite,fancite ted,@fancite ted,rt@realwizkhalifa ted

Table 4.1: Expansion terms for the show “How I met your mother” using equation2.7 and resulting search terms by hashtag,mention and co-occurrence heuristics.Note that a space means conjunction and a comma means disjunction. This useddata where tweets mentioning other show have been removed.

4.2.4 Auxiliary Data and Pre-processingTo get accurate statistics about the term distributions of tweets about certain prod-ucts I use auxiliary data collected by KDDI R&D about television programs forseveral months. The data consists of tweets containing titles of television programs:one set Rj for each television program. This collection is carried out using theREST GET method described in chapter 2.

After an initial test of my approach I discovered that data quality is lackingin many tweets in the sets Rj . To mitigate this problem I included extensive pre-processing methods to obtain good data to work with for algorithms 1 and 2.

• Because it is so common to mention several shows in one tweet I chose todefine those tweets as non related and thus filtered out each tweet containinga title that has a name longer that 1 word except its own.

• As mentioned in chapter 2, re-tweets are a way in which users share tweetsthat they think are important. From my own collecting I have discovered thatabout 25% of tweets are re-tweets. These contain almost now new informationto increase recall and all messages containing the sub string “RT” or “rt” asa separate word are filtered out.

• Many tweets are either automatically generated or re-posted so than an iden-tical duplicate is found; I use a hash table to remove these.

• Filter all tweets for language using a Bayesian classifier, non-English tweetsare removed.

4.2.5 Twitter data quality issuesAs a preliminary experiment I took up to 20 thousand tweets for each of the ap-proximately 1000 shows that are tracked as my different pseudo relevant sets Rj .

30


Performing the pre-processing steps above yields the search terms as shown in table4.1 when the proposed method is used. Using these terms as search terms to thetwitter streaming API tweets were collected for four hours or until 1,000 tweetswere collected. These initial results are not very impressive other than that they docontain some related tweets that do not contain the title.

If we take a look at table 4.1 we can see that there are many terms that seemcompletely irrelevant. This made me investigate this phenomenon further by lookingat the expansion terms of three TV shows: “The Vampire Diaries”, “How I MetYour Mother” and “The Secret Circle”.

Table 4.2: Search terms generated for the television shows The vampire diaries andThe secret circle using a moderately sizes data-set.

Show New termsThe Vampire Diaries vampire,diaries, #peopleschoice, @peopleschoice, ,voted

vote, retweet, #networktvdrama, #scififantasyshow,ordinary,#thevampirediaries,just,homecoming,@iansomerhalder,damon,#tvd, s3ep8, people,#orangotag

The Secret Circle circle, secret, #thesecretcircle, balcoin, #newtvdrama,#peopleschoice,@peopleschoice, voted, retweet,vote,assisti,#orangotag, s01e09,vou,s1ep8,assistir,cassie,1x09,@chriszylka, ver

The terms that I consider good from tables 4.1 and 4.2 are: #thevampirediaries,@iansomerhalder, damon, @himymcbs, barney, ted,#himym, robin, lily, #thesecret-circle, cassie, @chriszylka, balcion. They include relevant hashtags, actor names,character names and related concepts. The rest of the terms are most likely notrelevant.

The analytics company Pear Analytics was cited in [27] and they determinedthat out of a sample of two thousand twitter messages 40% was “pointless babble”.Some of the unattractive keywords found correspond to tweets in the followingcategories:

Non English ver, vou, assistir are non English words from tweets that slipped bythe language detector. It is a very hard problem to detect language in just140 characters, one improvement one can make is to exclude any search termsfrom the analysis and make sure that the search terms are in English to beginwith.

Chance Some terms just happen to be much more frequent terms in some pseudorelevant sets. By increasing the corpus size one can hope to obtain a bettersample of the distribution of terms; improve df estimations.

31


Automatically generated tweets Many websites produce tweets on the usersbehalf, these messages are often a standard message where the user’s nameor a URL to a user-page is the only distinguishing characteristic. Since thesemessages are also very frequent they will shift the language model away fromhuman writing patterns.

Compound hashtags By examination of data I have discovered that it is verycommon to use hashtags that are composed of several standard words thatare concatenated, such as #networktvdrama seen in table 4.2. These seem tobe used both for describing topics and some times for emphasis.

The most effective way to lessen these problems is to increase the corpus size.This will limit the effect of many of the problems listed above. Furthermore, onecan devise many strategies to deal with different undesirable tweets. One could forinstance use only tweets by verified users 1, that is, users where the twitter companyhas verified the identity. This is very problematic however since verification isa privilege that only users with a high public profile is granted and these usersrepresent only a minute proportion of twitter users.

Some strategies to consider are avoiding comparing things that are used to dis-tinguish automatically generated tweets, user-names and URLs for example. Manyspam tweets are very similar to other spam tweets, actually hashing tweets re-moves about 20.4% of collected tweets as we can see in chapter 6 (due to spambut also other reasons). A very high word concurrence rate could perhaps alsobe used to remove spam tweets, but this is a far more expensive operation; inthe worst case we have to compare each new tweet with all old ones resulting in|C|(|C|+1)/2 = O(|C|2) complexity. Perhaps one can devise a special hash functionfor this problem, but this is not explored in this thesis.

A developer can, if they choose to do so, indicate a source value for each tweetsuch as web (default) or my_iphone_app.com. But, this is not a good approach forremoving automatically generated tweet content since developers do care to includeanother source in all cases and many users write tweets on a device which generatesanother source tag than default, such as: iphone.

Compound hashtags is a phenomenon that I explored when building a classifierand is treated in section 4.3.2.

4.2.6 Collection of new tweets for evaluation

Since there is no annotated corpus of tweets for this task to use for empirical eval-uation I decided that the best approach is to evaluate the method as it would beused in a deployed system and evaluate results manually. For each television showthat is tracked a sample of tweets will be annotated by labels: related, unrelated.

By means of the streaming API, tweets are collected for a fixed time using theadditional keywords found and the actual title. Firstly, the set of tweets containing

1http://support.twitter.com/articles/119135-faqs-about-verified-accounts

32

http://support.twitter.com/articles/119135-faqs-about-verified-accounts

4.3. A CLASSIFIER TO IMPROVE PRECISION

the exact title is determined. These tweets represent the baseline and will notbe considered for manual labeling. Secondly, the remaining tweets are filtered forEnglish. By comparing the cardinality of the two sets we get the ratio of tweetscollected by my method and the baseline, an indication of the increase in recall.

To determine the precision, the tweets that are in English and do not containthe exact title are sampled uniformly without replacement and manually inspectedto see if they are related or not. Results of the experiments are shown in chapter 6.

4.3 A classifier to improve precisionInitial experiments where partly successful and I decided to create a data visualiza-tion tool to inspect how many tweets where collected for each of the generated searchterms or search term pairs. I did further experiments where data was collected for24 hours for the following television programs:

• Make it or break it

• Buffy the vampire slayer

• Saturday night live

Figure 4.1: Visualization of fraction of tweets by keywords for the show Satur-day night live, here different celebrities that have been on the show dominate theresulting twitter feed.

The results for Saturday night live are shown in figure 4.1. These results arediscouraging: firstly, the increase in recall does not seem to be that great butmore importantly, the resulting tweets are dominated by those found by searchingfor different celebrities. This illustrates the need for maintaining precision with aclassifier that is able to prune these large numbers of only partially related tweets.

33


4.3.1 Unsupervised systemOne can deploy a supervised algorithm relying on annotated training data if onefinds a way to gather data considered relevant and not-relevant. In this thesis Iwill use the sets Rj yet again for training data. This means that the data used fortraining is different in nature from data we wish to classify and I will use differenttechniques to bridge this gap.

4.3.2 Data extractionI want to include the information contained in external resources of the tweets aswell as the hashtags and user mentions.

Hashtags I split hashtags to their constituent parts with an algorithm that checksall sub-strings of either “i” or of length greater than two that make up thehashtag. If a sub-string is not found in a dictionary then it is not consid-ered. Then the sub-strings, which when concatenated make up the originalstring, are selected according the product of their frequencies in a corpus. Seeappendix A.

Mentions I will search for the user on twitter and use a description and name ifavailable. These will be included as additional text in the tweet.

URLs I will scrape the website and use text extracted from it. This text, withoutmarkup, replaces the URL in a tweet making the tweet a longer document.

Initially I wanted to resolve abbreviations but I found no reliable way of doingthis. There are publicly available abbreviation databases such as the STANDS4API2 but the problem is that there are many different possibilities for abbreviationsand it is a very hard problem to determine what is an abbreviation in the first place.

4.3.3 Web scrapingA lot of content on web pages are not relevant to the main focus of the web page.This content could for instance be commercials or a side menu that offers navigationof the web site and so on. If this non-relevant text was included either as an externalsource or as additional tweet text found by looking up URLs found in tweets it islikely that the proposed method would be much less effective. Therefore we havechosen to use the Boilerplate supervised learning method that has high accuracywhen determining informative text sections of web sites [24].

4.3.4 Classification of sparse vectorsTo increase precision I have used different supervised classification schemes to clas-sify tweets retrieved using additional keywords. When classifying sparse data di-rectly, such as text data, we need to use classifiers models that are adapted to

2http://www.abbreviations.com/api.php

34

http://www.abbreviations.com/api.php

4.3. A CLASSIFIER TO IMPROVE PRECISION

thousands of dimensions. One popular such model is a linear SVM (support vectormachine).[32]

A a linear SVM has a non-linear, convex objective function with linear con-straints, that is optimized to find a maximum separating hyperplane of two sets ofpoints in euclidean space, related points and unrelated points. If such a hyperplaneexits then the problem is convex and a global optimum can be found using gradientbased methods.[32]

If points overlap in the two sets so that it is impossible to fit a hyperplane be-tween the two sets of points, then slack variables are introduced for miss-labeledtraining examples. The dual of the problem is still convex however so that a globaloptimum can be found. The SVM problem is a Quadratic program since the eu-clidean distance is used and there exists many efficient algorithms for finding theoptimizer.

I tested using linear SVM for my classification problem, and not surprisingly theyperform very well when cross-validating a training set generated when searching forthe title of tweets. However the model performs poorly on tweets that come from asearch of expansion terms. See appendix B for more details.

This behavior can be explained by the fact that twitter messages are very short,there simply are not many words in common of the very sparse vectors.

4.3.5 Features

Since working directly with the sparse tweet vectors is not fruitful I take my in-spiration from related works in tweet classification [13][40] and compare externalsources with tweets. The supervised classifier, f , can be seen as a function of twoinput arguments, a tweet and a show title. If we use K different external sources:

c : RK → true, false

f(tweet, title) = c(g(pp(tweet), ws(title)))

Here c denotes a supervised, binary, vector based classifier, pp the pre-processingoperations listed in section 4.3.2, ws web scraping of external resources as describedin section 4.3.3 and g the cosine distance of tf ∗ idf vectors. The features used in c,correspond to different external sources processed by ws. Each source correspondsto one feature in the feature vector that represents the tweet during classification.The feature value is calculated as the cosine distance between the tf · idf vectors ofthe tweet and the text source.

4.3.6 Classification

Once I have a non-sparse vector for each tweet that is reasonably good at discrim-inating the two classes my choice of classifier models is much greater than whendealing with sparse vectors. This is because many classifier algorithms are not

35


Figure 4.2: Conceptual view of collection and classification of new tweets.

adapted to this problem in terms of run-time and resulting complexity. By empir-ical evaluation the C4.5 decision tree classifier turned out to give the best resultsand have a reasonable run time.

Since the training data is of a different nature that the training data that weintend to classify I decided to add some noise to the training data by changing theclass labels of 5% of the training data. This turned out to give slightly improvedresults.

4.4 Combined approachFigure 4.2 shows a conceptual view of a system that uses a combined approach ofAQE to find new terms and supervised classification to improve precision.

36

Chapter 5

Tweet Collect: Java implementationusing No-SQL database

"If I have seen further, it is by standing on the shoulders of giants"- Isaac NewtonLetter from Isaac Newton to Robert Hooke, 5 February 1676 [29, p. 416]

In this chapter I describe a component view of the implemented system andthe many external libraries on which it depends. For each component a shortdescription of the functionality is provided and issues regarding it are addressed.The Java implementation of algorithm 2 is listed and analyzed.

5.1 System overview

The algorithms described in chapter 4 are not very complicated for smaller datasets, the main problem any implementation needs to solve is how to cope with largeamounts of data so that statistics gathered become accurate. In section 5.2.2 Idescribe a scalable implementation using various data structures for efficiency.

My goal was to build a prototype system and thus minimizing developmenttime was crucial. By dividing the system into two main components, query expan-sion and classification I could focus on one part at a time during implementation.Future versions of the software could perhaps be more focused on a more unifiedarchitecture.

Since development time is crucial but also the ability to process large amounts ofdata using Java was a natural choice, offering a good trade off between performanceand safety. A JVM acts as the necessary glue for many different third party librariesand gives a developer a lot of safety because of garbage collection, array boundschecking and exception handling. Some components of the system are not so tightlyintegrated however and instead communicate using files and are run by bash scripts.

Figure 5.1 describes the two main components and their most important sub-components. The external systems Twitter API and Internet are not accessed di-

37

CHAPTER 5. TWEET COLLECT: JAVA IMPLEMENTATION USING NO-SQLDATABASE

Figure 5.1: Conceptual view of collection and classification of new tweets.

rectly but instead through robust Java libraries.The typical usage of the system is first computation of expansion terms as a

batch job then data collection from twitter as a ongoing process. Classification isnot done as tweets arrive in this prototype version but that is the intended use case.Instead I only wanted to evaluate the feasibility of classification and this is doneas a batch job. Figure 5.2 describes how the components are used at this stage ofprototype development and evaluation.

5.2 ComponentsMany of the components of the system are built as wrapper classless to encapsulatethe needed functionality of external libraries. These external libraries, or dependen-cies, do most of the work with the exception of two components that were custombuilt:

Get pairs of terms The implementation of algorithm 2 Pairs().

Pre-processing String processing and web scraping to transform tweets and web-pages to a similar language.

The dependencies used are listed in table 5.1.

38

5.2. COMPONENTS

Figure 5.2: How the different components are used to evaluate system performance.This does not represent the intended use case where collection, pre-processing andclassification is an ongoing process.

5.2.1 Statistics database

This is the most important component in terms of using algorithms 1 and 2 fromchapter 4 efficiently. What is required is very fast look-up of statistics for termsin different subsets of the total collection as well as inspection of the terms in anindividual document.

Furthermore, the system should be easily configurable and extensible so thatone can e.g. decide what terms to index, how to treat different characters andcustomize query expansion.

Almost all RDBMs support full text indexes, i.e. inverted indexes as describedin chapter 2 but fall short in many other respects. Dedicated IR systems such asLucene1, Indri and Terrier2[31] have retrieval performance that is much faster forsome operations with much less hassle than in many RDBMs systems. Further-more, I need to be able to go in and modify low level behavior without too muchdevelopment time.

After investigating Apache Lucene, Indri and Terrier (all frequently used inrelated works) I decided that Terrier was the best documented system with theclearest source code and most supportive community. The basis of my project istherefore built around Terrier 3.5.

I had many doubts throughout the project about not using a RDBMs but thekey issue that made the decision to use a dedicated text system was developmenttime. Once I got up to speed of how Terrier worked I could modify any internalbehavior relatively easily. Changing low level behavior in many RDBMs systems is

1http://lucene.apache.org/core/2http://terrier.org/

39


Table 5.1: List of dependencies organized by (sub) component.

Software component Dependency RoleAll Logger4j Variable levels of debug and logging out-

put.Term statistics Terrier 3.5 Fast access of term statistics, efficient

storage of inverted index structure, easyto customize indexing and retrieval be-havior.

language-detection Remove as many tweets that are not inEnglish as possible to not skew statistics.

Query expansion Terrier 3.5 Generate single expansion terms using al-gorithm 1.

Twitter API access Twitter4j Reliable access to the twitter API, convertdata into Java objects.

Web scraping Executor service Reliable implementation of thread poolfor I/O intensive tasks.

Boilerpipe3[24] Remove boilerplate code from web pages.Cosine distance Terrier 3.5 Term statistics about collected tweets and

external sources are used to form tf · idfvectors.

Classify Weka Stable and correct implementation of var-ious classification algorithms.

Results storage JFreeChart Visualize resultsand visualization

JDBC Database connectivity

MySQL Store results and select subsets of tweetsfor visualization.

a tedious and dangerous task.Most of the work happens when we index documents and build an inverted

index structure. Terrier does not fully support incremental indexing but one canmerge indexes without to much overhead. Nevertheless, for a prototype system it isenough to build an index once and then use it. The perhaps most time consumingprocess is the pre-processing operations before indexing where we wish to remove alltweets that contain two different show terms. Many optimizations are possible butI choose to do a naive implementation of comparing every show name with everytweet and leave it running over night.

40

5.2. COMPONENTS

5.2.2 Implementation of algorithmsAlgorithm 1 that computes the most informative expansion terms from a set ofpseudo relevant documents was already implemented in Terrier 3.5 as a query ex-pansion class. I did some modifications of the source code to be able to specify largeamounts of documents as related instead of doing an original query. The use of thequery expansion class directly is not a typical use case of terrier but the code basewas well organized so it was easy to use this as an entry point instead of the moretypical use cases of ad-hoc querying.

Listing 5.1 describes the implementation of algorithm 2. I perform searches ofan inverted index with a particular term to create result sets of documents RSk

corresponding to term tk. I do TopX ∗ (TopX − 1)/2 hash-joins between the re-sult sets RSi, i = 0..T opK − 1 and RSj , j = i..T opK − 1 to see in how manydocuments the two terms tj and ti co-occur. I also exploit the fact that docu-ments are assigned documents ids in order of indexing to create virtual documentsdj−2@dj−1@dj@dj+1@dj+2. The complexity is (skipping some minor operationssuch as sorting the O(TopX2) size array on line 55):

• Algorithm 1 takes O(|PRS|) time.

• The outer loop is done TopX times

• Create and populate a hash table in O(|RSi|) where RSi is the documentsthat contain the term ti.

• The inner loop is done TopX-i times

• Do a hash-join with the results of a query for tJ . This corresponds to 5look-ups in O(1) done |RSj | times.

The total complexity is (some abuse of notation):

O(|PRS|) +O(TopX ∗ [ |RSi|+ (TopX − i) ∗ |RSj | ])

= O(|PRS|) +O(TopX ∗ |RSi|) +O(TopX2 ∗ |RSj |)

Now the complexity analysis becomes a bit tricky if we want to express it in thetotal number of documents in the collection |C|. The number of documents |RS| ispotentially |C| but this is almost guaranteed to not happen since we are searchingfor the most descriptive terms of a particular relevant set. To guarantee a maximumrun-time we can limit the search results RS to the first e.g. 1000 documents thatcontain this term and in most cases not affect the results.

Searching an inverted index for a specific term is assumed to take O(1) in thisanalysis and this is the case if we can find the pointer to the list of documentsthat contain that term in constant time; which is exactly what an inverted index isdesigned to do.

41


Listing 5.1: Java code to perform Algorithm 21 public String[][] getCoOcTerms(PRF type,List<Term> tl,int topX,String topicId)2 int queryids = 10000;3 int vDocSize = Settings.getInt("vDocSize",5);4 logger.info("Virtual document size is " + vDocSize);5 //get the topX terms6 List<Term> KL;7 if(tl == null)8 KL = getTopTerms(type,topX,topicId);//type is Chi Squared in experiments9 else

10 KL=tl;11 //term pair12 class TermPair 13 Term termA, termB; int count=0; double dice=0.0;14 15 //create term−pair array that we later sort16 TermPair[] arr = new TermPair[(topX∗(topX−1))/2];17 for(int i=0; i<arr.length ; i++)18 arr[i] = new TermPair();1920 Hashtable<Integer,Boolean> ht;21 int idx =0;22 double c = 1.0;23 //Go through the top terms one by one24 for(int i=0;i<topX;i++) 25 //Do a query for term_i −− get the tweets that contain this term.26 ResultSet set_u = processQuery(""+queryids,KL.get(i).str,c);27 int[] docids_u = set_u.getDocids();2829 //Should be faster to create the hashtable only in the outer loop even if u is much longer30 ht = new Hashtable<Integer,Boolean>(docids_u.length);31 for(int j=0;j<docids_u.length;j++) 32 ht.put(new Integer(docids_u[j]),true);33 34 //Compare with all the other top terms (ignore symmetrical pairs)35 for(int j=i+1;j<topX;j++) 3637 arr[idx].termA=KL.get(i);38 arr[idx].termB=KL.get(j);39 //Do a query for term_j40 ResultSet set_v = processQuery(""+queryids,KL.get(j).str,c);41 int[] docids_v = set_v.getDocids();4243 for(int k=0;k<docids_v.length;k++) 44 if(checkVirualDoc(ht, docids_v, vDocSize, k)) 45 //Does a check: ht.containsKey(docids_v[k]+i) for i=[−vDocSize/2,vDocSize/2]46 arr[idx].count++;47 48 49 idx+=1;50 51 52 //Calculate dice53 for(int i=0; i<arr.length;i++)54 arr[i].dice = 2∗(((double)arr[i].count) / (arr[i].termA.docFreq + arr[i].termB.docFreq));55 //Sort arr56 Arrays.sort(arr,new Comp());57 return arr;58

42

5.2. COMPONENTS

5.2.3 Twitter access

Access to twitter data is based on the HTTP protocol and uses JSON format forthe data. Managing connections, authentication, conversion to Java objects andasynchronous handling of new packages arriving is done conveniently using theTwitter4j library.

5.2.4 Web scraping

This components job description is very simple, fetch the HTML content of a web-page in a tweet and pass it on to the boilerpipe library. Yet I spent major effortsoptimizing this component, why?

The problem lies in response times of HTTP requests, with Java typically any-where between 0.5 to 5 seconds. Perhaps one in three tweets contain an URL andwith an average look-up time of a web page of 1.5 seconds just fetching the contentsof 10,000 tweets would take in excess of 80 minutes. Combine this with the factthat the boilerpipe algorithm is not very fast either, perhaps 0.5 seconds to processone web-page and we have a serious problem.

To solve this problem of waiting for I/O i used a thread pool implementationand the producer-consumer pattern. The many threads performing fetching of web-pages and sleeping while waiting for the reply make up the producer and the boil-erpipe library the consumer. Using the java.util.concurrent.ExecutorServiceinterface I could with relative ease create an implementation capable of fetching andprocessing 1-2 MB/s of web content.


The actual classification algorithms used are implemented in the open source Wekaproject. However these implementations need data in a specific input format. Toimprove performance I also pre-processed the data before classification and this iswhere I did most of my implementation work.

Since I wanted to use tf ∗ idf weighting of the terms in tweets I again used aterrier index for the tweets that I classified and the external documents that I wantto compare them with. After the index is created it is easy to do the necessaryconversions.

Here we run into the first obstacle with the prototype system: it does not supportreal time indexing with the present software components. One could however updatethe index on a regular basis, e.g. once per hour and still meet reasonable timedemands.

After pre-processing I use either the Weka GUI or a bash script to performtraining and classification and for this initial prototype I saw no need to integratethis component further. This is not hard however since Weka is a Java library.

43


5.2.6 Result storage and visualization

After data is fetched from the twitter streaming API it is stored in a MySQLdatabase so that it can be visualized. This is essential for understanding the perfor-mance of the method, we must de-multiplex the different keywords to know whichones were effective.

5.3 Limitations

At the present state the system is very much a prototype system; it works wellenough to evaluate a snapshot of the world. But not for continuous operation. Tosee this I will elaborate what happens if we wish to classify tweets in real time.

1. Build (or update) indexA of the current state of collected tweets containingtitles (pseudo relevant tweets).

2. Use indexA to get expansion terms using algorithm 2.

3. Search twitter using the streaming API using the keywords found.

a) A tweet arrives

b) Retrieve (or update) external sources so they are up to date.

c) Pre-process the tweet.

d) Build (or update) indexB with the external sources and the new tweet.

e) Convert the tweet to the cosine distance from relevant external sourcesusing tf · ifd vectors found in indexB.

f) Classify the tweet using a classifier trained previously.

The big issue with the current implementation and the limitations of Terrier 3.5is that we have no really efficient way to update indexA and indexB. One mustindex new tweets and external sources (as an index of a single document in theextreme case) and merge with the old version of the index at reasonable intervalsmeaning that we cannot get a real-time view of the world.

This is not the state of the implementation described here. Instead indexA is atfirst computed, and we search twitter using keywords generated using the statisticswe can retrieve from this structure and store these tweets as-is in a MySQL database.Then we fetch external sources and create indexB from the retrieved tweets andthe pseudo relevant tweets that will be used to train a classifier. Before traininga classifier pre-processing is done by retrieving statistics from indexB so that wefinally can train and test it.

44

5.4. DEVELOPMENT METHODOLOGY

5.4 Development methodologyBy using an agile approach I was able to quickly realize a working prototype thatstill is realistic in that it is scalable and works with real quantities of data (millionsof tweets). This was only possible by using light weight designs and heavy relianceon external libraries.

I used two major sprints:

1. Develop the AQE component and do small scale tests

2. Develop the classification component and do large scale tests

After the first sprint I evaluated the progress and realized the need for extensiveprocessing of tweets (remove spam, resolve URLs and hashtags) to obtain goodresults.

It is near impossible to do an architecture before prototyping that will holdup for anything more than a short amount of time; the human mind is simplynot capable of imagining the complex interplay of different external libraries andstandards before one has experience working with them. I actually lost perhaps3-4 days by doing an initial architecture of the system and an initial design. Ithelped me somewhat with knowing the different problems ahead but it was far tolow level and did not play well with they way external libraries and other softwarecomponents were designed.

The system is not robust, not optimized for performance nor does it representhow a running system will look. But, the goal of investigating the proposed methodin chapter 4 within the time frame available and with a zero budget was achieved.

45

Chapter 6

Performance evaluation

In this chapter results of experiments to test the performance of Tweet Collect aredescribed. The axillary data and parameters used to collect tweets about televisionprograms are also described. As shown in chapter 4 a large corpus of pseudo-relatedtweets are used as a basis for query expansion and since I am interested in collectingtweets about television programs, I used a corpus of television related tweets.

To remove spam and other unwanted tweets pre-processing operations are per-formed on this corpus. After there operations I index the data with a full-text indexusing Terrier 3.5. We are then ready to get expansion terms which are used to querythe Twitter streaming API. The resulting tweets are stored in a MySQL databaseso that i can visualize the results. To improve precision the tweets are transformedinto a vector representation by comparing to external sources and classified.

As seen in chapter 5, Tweet Collect is essentially a two stage system consistingof a query expansion stage and a classification stage. Each of these two stages areevaluated by comparing to a separate baseline. For query expansion I compare thenumber of collected tweets using AQE and the number of collected tweets not usingquery expansion. Furthermore I look at the precision of these collected tweets. Forclassification I sample and label 500 tweets from each show and compare classifierperformance with the naive baseline: assume all collected tweets are related. Finally,I extrapolate AQE and classifier performance to estimate the system performance interms of additional numbers of tweets collected and overall precision. This systemperformance is compared with the baseline of using my AQE approach withoutclassification.

6.1 Collecting tweets about television programs

One possible application of the proposed method is to improve popularity estimationof television programs by increasing recall of collection of tweets about televisionprograms. The method is evaluated with this application in mind, but should workfor other types of product tracking as well.

47

CHAPTER 6. PERFORMANCE EVALUATION

Figure 6.1: Results of filtering auxiliary data to improve data quality. Note that thefirst filtering step is not included here and these tweets represent strings containingeither the title of a show or the title words formed into a hashtag.

6.1.1 Auxiliary data

A large corpus of tweets is essential. This means that we need to have ongoing tweetcollection for tweets that include titles of TV programs over a longer period of time.A TV related tweet corpus made up of 133 million tweets was collected by KDDIR&D over several months. These tweets have been collected by polling the RESTAPI at the maximum allowed rate every day. The keywords used to poll the RESTAPI was titles of 1478 different American TV shows and the most common hashtagsfound in these tweets, always grouping by the title. To improve data quality strictfiltering was employed:

1. Only keep tweets that contain the title words or a concatenated string of thetitle words prefixed with #.

2. Keep only alphanumeric characters and #,@. Remove URLs from considera-tion.

3. Remove all tweets containing any capitalization of RT as a stand-alone term.

4. Remove all tweets matching the exact same content as another previously seentweet.

5. Remove all tweets that contain more than one show title. This second titlemust be longer than one word and comes from a list of known shows.

6. Remove all tweets that are determined not to be English by a naive Bayesclassifier.

Figure 6.1 lists the results of this filtering where we only keep the roughly 25%tweets that does not match any of the filter criteria.

48

6.1. COLLECTING TWEETS ABOUT TELEVISION PROGRAMS

Table 6.1: TV shows used for collecting tweets with new search terms. Showsmarked with “*” are aired as reruns multiple times every day.

TV show Genre Air times (UTC)How I met your mother Drama, Comedy 9/22/*The big bang theory Drama, Comedy 9/23/*The vampire diaries Drama, Science fiction 9/21/00:00The X factor Talent show 9/27/00:00Wheel of fortune Game show 9/18/23:30

6.1.2 Experiment parameters

To evaluate the proposed method we collected data for 5 TV shows of differentgenres using AQE. Due to limitations of a free twitter API account we could onlysearch for one of these shows at a time and did so for 23h30min staring 6h beforeairing of the show, see table 6.1.

To obtain search terms for the twitter streaming API Top(k) was used withk=25. Then the hashtag heuristic was applied to get hashtags and mentions assearch terms. Finally, the 40 highest ranked term pairs according to equation 4.1out of the possible 300 generated by Pairs() was used. For comparison, we alsosearch for the actual title so that we later can filter out all tweets that contain thetitle to see the increase in number of tweets.

As described in section 4.3.5 I use the distances from a tweets tf · idf vector toexternal documents tf · idf vectors as the features for classification. In table 6.2 theexternal documents that are used are listed. For each show the external documentscan be externally generated by looking up web content.

The TV words source is not gathered from the web but instead created manuallyand consists of the words episode, premiere, season, watch, watching and patternsof the form eX, e0X, sX, s0X, sXeX and s0Xe0X with X = 1..10. More ac-curate document frequencies are estimated using government documents from theAmerican national corpus [22].

6.1.3 Evaluation

After obtaining AQE collection results for the different shows a sample of 500 tweetsfor each show that do not contain the title was labeled. This allows us to see howwell the system works without the classification step, see table 6.7; to evaluate aclassifier for the problem, see table 6.6, and complete system performance in termsof increased number of tweets and precision, see table 6.8.

Judging the topic of a message is something most humans are very good at,however this problem is far from trivial. The decision is based upon experience andknowledge of the interpreter about the subject matter itself and the jargon used totalk about it. Consider the following two hypothetical messages:

49


Table 6.2: Text sources used for comparing with tweets.

Text source DescriptionEPG Description of showTV.com Description of show, character namesWikipedia page Main Wikipedia page,

use of boilerplate algorithmTop10 Google The top 10 pages of Google search,

use of boilerplate then concatenatedCollected tweets Concatenation of originally collected

tweets containing the title of the TV showTV words Television related terms

“When actorX and actorY kiss I get tears in my eyes every time”“omg #MN is so good, @actorX is the best” where #MN is a hypothetical hashtagused to denote Movie Name.For a person that has seen the movie in question it is obvious that the first mes-sage refers to a specific movie. If that person is also an avid twitter user she willunderstand the second message to be strongly related to the same movie. Much oftwitter consists of even more idiosyncratic messages but with the proper knowledgethese can be understood and classified.

A strong definition of related to is not possible, however we can at least concludethat a message that contains a title that is unique (or almost unequivocally used forone topic) is related. If this title has alternatives in the form of hashtags, messagescontaining these are also related. Furthermore we can collect messages containingother strongly related meta data terms and leave it up to an evaluator to determineif they are related.

Tweets that are not written in English are manually replaced from the tweetsuntil we have 500 English tweets for each show that are labeled.

6.2 ResultsThe proposed method gives us a number additional tweets, the results of the exper-iments when using AQE only are listed in table 6.3. We can observe that the TVshow The X factor has an abnormal number of additional tweets compared to thenumber of tweets containing the title.

Figures 6.2-6.3 shows a breakdown of how many tweets per keyword, or keywordpair, were found for the shows How I met your mother and The X factor respectively.A keyword must account for at least 0.1% to be included in the chart. These chartsshow what kind of keywords are generated and how large a fraction of the retrievedresults they account for. Most of the keyword pairs do not give many new tweetsbut a few do. The most important new keywords are arguably different hashtags

50

6.2. RESULTS

Table 6.3: Number of tweets collected for the different TV shows during 23h30min.

TV show Containing title Extra tweetsHow I met your mother 6,271 11,002The big bang theory 10,222 3,907The vampire diaries 13,118 23,598The X factor 62,539 253,376Wheel of fortune 1,253 912

Table 6.4: Percentage of tweets containing the title that are related to the televisionshow.

TV show Fraction relatedHow I met your mother 100%The big bang theory 99%The vampire diaries 100%The X factor 100%Wheel of fortune 81%

and mentions. Here we see the reason why The X factor has a disproportionatenumber of additional tweets: the popularity of the celebrity hosts overtake that ofthe show itself.

6.2.1 Ambiguity

The issue of ambiguous titles is investigated in [40], [14] and other works. Here wehave focused on titles that consists of at least three words and assumed that anytweet that contains all these words is actually about the TV show. To test thisassumption we sampled 100 tweets from each show and assigned labels. We can seein table 6.4 that this assumption is not completely accurate but good enough forour assumption except in the case of The wheel of fortune.


To increase precision we wish to remove as many of the unrelated additional tweetsas possible. We also want to keep as many as possible of the related ones to achieveour goal of increasing recall. We do this by supervised classification and the chosenalgorithm was the J48 implementation of the C4.5 decision tree algorithm using themachine learning toolkit Weka [21].

Best case classification results are listed in table 6.5 where one model is builtfor each show and the manually labeled data is used with 10-fold cross validation.The following abbreviations are used: Acc. denotes the accuracy, P1 the precision,

51


Figure 6.2: Fraction of tweets by searchterms for How I met your mother.

Figure 6.3: Fraction of tweets by searchterms for The X factor.

Table 6.5: Classification results when using manually labeled test data as trainingdata with 10-fold cross validation.

TV show Acc. P1 R1 F1How I met your mother 0.892 0.846 0.856 0.851The big bang theory 0.894 0.924 0.916 0.92The vampire diaries 0.784 0.726 0.898 0.803The X factor 0.876 0.822 0.731 0.774Wheel of fortune 0.938 0.929 0.954 0.941Average 0.877 0.850 0.871 0.858

Table 6.6: Classification results when using training data generated from the sameexternal sources, training examples are from all five shows.

TV show Acc. P1 R1 F1How I met your mother 0.874 0.820 0.833 0.826The big bang theory 0.886 0.918 0.910 0.914The vampire diaries 0.746 0.748 0.727 0.737The X factor 0.508 0.356 0.862 0.504Wheel of fortune 0.834 0.797 0.916 0.852Average 0.770 0.728 0.850 0.767

R1 the recall and F1 the F-measure. P1,R1 and F1 are calculated for the relatedclass. These metrics are defined as follows, where tp denotes true positive, tn true

52

6.2. RESULTS

Table 6.7: Class distribution of annotated data after classification by baseline,left, and C4.5 classifiers, right. The baseline classifier is the naive classifier:cbaseline(tweet) = related

TV show tp fp acc. (new) tpC fpC acc.C (new)How I met your mother 180 320 36.0% 150 33 82.0%The big bang theory 334 166 66.8% 304 27 91.8%The vampire diaries 245 255 49.0% 178 60 74.8%The X factor 145 355 29.0% 125 226 48.3%Wheel of fortune 261 239 52.2% 239 61 79.7%

negative, fp false positive and fn false negative:

Acc. = (tp+ tn)/(tp+ tn+ fp+ fn)P1 = tp/(tp+ fp)R1 = tp/(tp+ fn)F1 = 2 · P1R1/(P1 +R1)

A feasible system however, cannot rely on manually labeled data and table 6.6shows the results when we build one model using assumed labels. The training datais made up of up to 10,000 tweets containing the title for each show that whererandomly sampled from a database of collected tweets. These tweets are used bothas related and unrelated training examples depending on which of the 5 sets ofexternal sources they where compared against. The test set is composed of theannotated data.

Table 6.7 shows the class distribution of the labeled sample of 500 tweets thatdo not contain the title for each show. The table also shows classification results ofthis sample, indicated by the subscript C . Our classifier is compared to a baselineclassifier that assumes all tweets are relevant. In a live system one uses all tweetsthat are determined to be relevant by the classifier and these correspond to twocategories tpC and fpC .

6.2.3 System resultsAfter classification we can estimate the performance of the complete system if weassume that the rate related to unrelated tweets is the same for all new tweets thatare collected and that the classifier performance is also the same. The maximumlikelihood estimation of precision and the increase in number of tweets is calculatedwith:

ˆTP = |ttitle|+ TP_rate · P_rate · |textra|FP = FP_rate ·N_rate · |textra|

∆tweets = ( ˆTP / |ttitle|)− 1prec = ˆTP / ( ˆTP · FP )

53


Table 6.8: System performance using automatic query expansion, before and afterclassification. The subscript c denotes results after classification.

TV show ∆tweets prec ∆tweetsC precC

How I met your mother 63.2% 59.2% 52.6% 88.5%The big bang theory 25.5% 90.8% 23.2% 97.7%The vampire diaries 88.1% 67.2% 64.1% 83.2%The X factor 117.5% 43.1% 101.2% 48.3%Wheel of fortune 38.0% 79.9% 34.8% 91.4%Average 66.5% 68.0% 55.2% 82.0%

Here ttitle denotes the set of tweets containing the title and textra the set of additionaltweets that are retrieved using AQE. The rates P_rate andN_rate is the estimatedrate of positive and negative tweets of textra according to assigned labels. Fromclassification of the labeled data we estimate the classifier performance for all theretrieved tweets with the true positive rate TP_rate and the false positive rateFP_rate. The results can be seen in table 6.8, where ∆tweetsC and precC are thecollection results after classification. Note that for the show Wheel of fortune theincrease in tweets is actually greater and the total precision lower since not all thetweets containing the title are related, see table 6.4.

54

Chapter 7

Analysis

In this chapter I have analyzed the results of the experiments described in chapter6 with regards to the system performance, automatic query expansion performanceand classifier performance. There are some weak points of the proposed methodand these are discussed.

7.1 System results

In chapter 6 the results of AQE and classification are listed. For four out of thefive shows tested the improvement in recall, the increased number of related tweetsare reasonable at around 30% and the system precision is agreeable. With around80 % accuracy for the new tweets we see that the method is feasible and that it isworthwhile to try and improve the results further.

For one show, The X factor, the results are not satisfactory with an accuracy ofclassification for the new tweets that does not contain the title at 48.3%. This falsepositive rate comes from the fact that many users talk about the celebrities in thetweets that we base AQE on so that we get query drift. Regrettably the externalsources also feature a lot of celebrity names causing the spurious tweets to get agood distance value from them.

Query drift is the primary risk with query expansion as noted in [12]. This isthe reason for using external sources to try and filter out the new spurious results.

An example of the problem: the system associates the mention @TheXfac-torUSA and @ddlovato to the show The X factor and tweets containing them arelikely to be classified as related. When resolving these user ids using twitter weget a description that includes the sub-string “The X Factor”. What is needed isthat the system can understand that the mention @TheXfactorUSA is not abouta person but the twitter account associated with the show whilst @ddlovato refers(mostly) to the host of this show as the users idol.

The most effective operational characteristic of the system is the understandingof twitter language use with the help of heuristic methods. Splitting hashtagsinto their constituent words, looking up web content, resolving user tags used as a

55

CHAPTER 7. ANALYSIS

Table 7.1: 95% confidence interval for accuracy with training data generated fromthe same external sources, training examples are from all five shows.

TV show Acc. lower upperHow I met your mother 0.874 0.842 0.902The big bang theory 0.886 0.855 0.913The vampire diaries 0.746 0.705 0.784The X factor 0.508 0.463 0.553Wheel of fortune 0.834 0.798 0.866Average 0.770

substitute for the title and assuming that some abbreviations stand for the showsname allows classification to be accurate. The tweets where this is applicable alsocorrespond to the majority of related additional tweets. A second, much smaller,group of related tweets are not easy to classify correctly, they often refer to eventsin the shows or voice opinions about how characters or TV personalities behave inthe TV program.

7.2 Generalizing the results

In table 6.7 I have assumed the following:

• The rate of related tweets to unrelated tweets is the maximum likelihoodestimation from the annotated sample of 500 tweets for each show.

• The classification performance is the same for all tweets for a specific show.That is, I generalize the true-positive and false-positive rate and apply themto the whole population.

Lets investigate these assumptions and their implications.We can start by looking at the accuracy and assume that classification represents

a series of Bernoulli trials where success is assigning the correct label to a datumand failure assigning the wrong label. The confidence interval is calculated usingthe Pearson-Klopper method and the results are listed in table 7.1. One can observethat the confidence intervals are rather tight suggesting that a sample of 500 tweetsare enough to gauge the accuracy of the classifier.

The big issue is that the accuracy is different depending on which show we have,this tells us that the classifier model can not represent what an annotator thinks isnot related, especially in the case of The X factor. This is because we compare toexternal sources before classification so what we are really classifying is: is the datasimilar to external sources in the same way as training data and for one show thisdoes not give similar results as the annotator.

56

7.3. EVALUATION MEASURES

7.3 Evaluation measures

The goal of the system is to get as many as possible tweets about a certain productwith a low false positive rate. It would be natural to measure the recall to seehow well this goal was achieved. However, this is not possible unless we have arepresentative test set where all the data has been annotated with labels. Since,one cannot know the total number of related tweets in any feasible way I have chosento estimate the increase in recall: the fraction of new related tweets.

7.4 New search terms

The search term pairs generally match mostly related tweets, but often quite fewof them. Many of the term-pairs account for a very small amount of the collectedtweets. An exception is when the search term pair is the name of an actor ortelevision personality, then we usually get a large number of tweets about thisperson. Here we see a clear example of query drift.

Worth mentioning is that the reason for using term pairs is twofold, firstly oneneeds to reduce the burden of a classifier by narrowing down the initial set of tweetsto classify and the latent classes in this data set. Secondly, the twitter API doesnot allow an unlimited amount of tweets to be retrieved for one query, at leastwithout commercial access, so it is desirable to perform this pruning before actuallyretrieving tweets.

I employ a heuristic to select hashtags and mentions and this is in general agood idea. Many of the generated hashtags result in new tweets that are relatedbut some do not. An example is the hashtag #suitup that originally comes from acatch phrase of the show How I met your mother but has since become very popularto use in many different occasions.

The mention heuristic is perhaps the most dangerous, where the exact sameproblem of query drift towards celebrities exhibits itself and also towards avid twitterusers that often discuss the topic. It is a very blunt weapon to try and get somehighly desirable mentions such as “@thexfactorusa”.

For many of the tested television shows mostly names of people are returned asterm pairs. This is especially the case for the show The X factor. Some examples areshow in table 7.2. This is most likely due to the fact that the two word combinationgiven-name, surname is a very common writing pattern casting some doubt on thetype of search terms that is obtainable with the co-occurrence heuristic describedin section 4.2.1. However many of these names are names of characters of showsand in this case almost all of the tweets one gets are related to the show.

7.5 Classifier performance

In [14] a random forest classifier is used that is reported to generalize well to unseenshows. This classifier is trained on manually labeled data about 7 of 8 shows and

57

CHAPTER 7. ANALYSIS

Table 7.2: First 13 term pairs for AQE using top 40 terms to form pairs and virtualdocuments of size 5. Also visible is a bug where I do not remove hashtags fromconsideration when forming pairs.

The big bang theory The X factoryousxo inspirationthe caley cuocoamaro melanie rothman disintegrationxfactor #xfactor ornithophobia diffusiondemi lovato #bazinga bazingacanty marcus recombination hypothesisamaro krajcik initiation betaamaro rene hawking excitationmelanie rene initiation sirikrajcik canty sheldon #sheldonbritney spears kitty purmelanie krajcik kitty purrastro crow hawking stephenamaro cowell siri beta

tested on the last. I tested this approach but it did not give good results and Ibelieve that in contrast to their approach which centers very much around languagepatters including the title I have attempted a very different problem with new tweetsthat do not contain the same words at all in many cases.

Even though the classification scheme is similar with doing comparisons to ex-ternal sources in my approach it is necessary to have training data that is aboutthe same show, we can however benefit from using a single classifier that is trainedon the Cartesian product of relevant sets Rj and titles, see figure 7.1. This couldappear to be a major limitation, however since we are never relying on manuallylabeled data it is only a question of increased training time.

Why did I choose the C4.5 algorithm? This question has two answers: scala-bility and empirical performance. Out of all the classifiers attempted: SVM linear,SVM Gaussian kernel, neural network, C4.5, Random Tree, Random Forest, NaiveBayes, Nearest neighbor only rotation forest gives comparable results in terms of F1measure. But, rotation forest is much more memory and computationally intensivesince it creates many decision trees and performs principal component analysis [36]so my machine ran out of memory unless I cut the training set down dramatically.Other classifier models such as neural networks are very expensive to train, espe-cially considering optimizing the structure, and also give poor results so these havenot been tested in any large extent.

Not only the training data and the classifier algorithm used are important. Per-haps most important are the features used to describe tweets. I have used distancesto external documents in terms of the distance from BOW representations. Thisignores ordering of terms and assigns a lot of importance to high frequency terms

58

7.5. CLASSIFIER PERFORMANCE

Figure 7.1: Ways to generate training data from auxiliary data. Here we have twodata sets, A and B that correspond to searching for the titles A and B, respectively.Either we can have a classifier for each title, the left case, or we can have just oneclassifier that is trained on the Cartesian product of data sets and titles, the rightcase. Regrettably tests show that we cannot use the right case unless we includetraining data of the type Rtitle,title for all shows.

in external documents. Using this approach certain patterns or language structuresthat cannot be described using this simple representation cannot be expressed. Con-sider an exact quote from the script of a television show, a tweet containing thisis most certainly related to the television show but has very little change of beingclassified correctly using the present approach. This is one of the reasons why wedo not see recall values that are higher than what is observed for classification.

59

Chapter 8

Conclusions and future work

The proposed method in chapter 4 and its prototype implementation described inchapter 5 give an indication that the system indeed partially fulfills the research goalof increasing tweet collection about products, in this case: tweets about televisionprograms.

Performing the combined approach of query expansion and classification fortweet collection is a new way of improving market research and as such much ad-ditional testing is required before the true potential can be gaged accurately. Theinitial tests done using the prototype developed should be seen as a proof of conceptmore than anything else and indicate potential for increased effectiveness in tweetcollection.

8.1 Applicability

One intended use case is for market research but the methods used can be employedas long as it is possible to collect a corpus of auxiliary data for an original set ofqueries. Further testing is required to see the performance of product tracking forareas such as reputation management of companies.

Improving the sampling for market research is clearly a desirable goal. Theimplemented system allows an analyst to see not only what users that use thesame keywords as him thinks about a product but also the users that use other,in particular twitter specific, keywords think about a product. The error rate isslightly high for really accurate aggregate analysis but removing bias should be adesirable goal even if the precision of the survey is lowered slightly.

It is definitely possible to use the system to access individual tweets instead ofjust aggregate statistics such as the numbers of tweets during specific times. Withthis system users can access a broad sample of tweets about their favorite topic thatthey might not have know about before. Most likely further ranking methods couldbe applied, perhaps based on the social graph on twitter to show a user a previouslyunseen selection of tweets and users that are also about their favorite topic.

61

CHAPTER 8. CONCLUSIONS AND FUTURE WORK

8.2 ScalabilityOne big issue with the system is the need to collect large number tweets aboutspecific products. But in the intended use case of market research this is alreadybeing done and my method has the potential to improve the results. In terms ofimplementation a major rework is required that collects all data under the sameroof with a common access point. The current implementation using many files ismerely a prototype and was limited by strict time constraints.

The real problem is the need to recompute or update indexes. Even if thiscan be done rather quickly even for large amounts of data it means a latency forclassification of tweets. There are database management systems that support realtime indexing, relational databases for example, that can be used to avoid thisproblem. But preferably one of the parallel No-SQL databases available, such asMongoDB1, should be used to support large amounts of inserts.

8.3 Future workThe proposed method has some issues that need improvement but there are alsountapped potential for using the method in other settings than collecting TV relatedtweets that can be evaluated.

8.3.1 Other types of products and topics

Only five different television shows where tested because of the expensive processof labeling data for evaluation. Because of this limited possibility to evaluate themethod I decided that it was best to stick to one type of products instead of evalu-ating for other types. Future work should evaluate the applicability of the proposedmethod for other products and topics. Perhaps company related tweets are themost interesting which can be used for reputation management.

8.3.2 Parameter tuning

Due to the fact that annotation of results was done after collection it is not possibleto optimize parameters using e.g. a grid search where we find the best results on atraining corpus. However, the tests done actually tests the real world usefulness ofthe system.

A more thorough theoretical analysis of the system could give us models for e.g.how many search terms to use or a threshold for the weights, X2 scores, of hashtagsand mentions that we choose to include as search terms. Providing a theoreticalmodel of tweets and the occurrence of different terms could be of great interest.

There are two routes for parameter tuning by local search; compare systemresults e.g. F -measure to find the parameters that give the best results. If one has

1http://www.mongodb.org/

62

8.3. FUTURE WORK

sufficiently good theoretical models then it is possible to simulate tweet generation,perhaps supported by real data. A more realistic, but expensive approach, is tocreate a product tracking corpus by collecting a sample of all tweets generated inwhich all interesting tweets are labeled.

8.3.3 Temporal aspectsIn chapter 3 I mentioned some work in twitter information retrieval that uses tempo-ral profiles to weigh terms. This has a strong intuitive appeal; it is safe to assumethat language use changes over time and is temporary. In twitter this manifestsitself in changing hashtags and other buzz-words. However it is unclear how oneshould treat temporally in AQE, further study into how language use changes overtime is needed to use this information for generating new search terms.

When it comes to classification it is certainly interesting to look into time weight-ing of tweets, e.g. that tweets closer to the broadcasting time of a show are morelikely to be related. This is a very dangerous approach however since it is very closeto circular reasoning:

Tweets about television that are authored close to the broadcast time of a showindicate ratings for the show↔Tweets that are authored close to the broadcast time of a show indicate that they areabout the television show.

8.3.4 Understanding namesThe role of names in language is that of identifying entities, not just people butproducts or even abstract concepts. In my research names represent the mostcommon proper nouns used to access data but understanding and exploiting thepresence of named entities is crucial for effectiveness.

A lot of the problems of the proposed approach has to do with how to dealwith names of persons associated with television shows, either fictional characters,actors, hosts or just twitter authorities on the television show. Here I suggest futureresearch into to the field of entity disambiguation: e.g. if we can associate a namewith a Wikipedia page we could possibly conclude what relationship this name hasto the original query and act accordingly.

8.3.5 Improved classificationRegarding hard to classify tweets such as tweets quoting a television show or de-scribing a scene of a television show: using BOW based approach for features is notenough to express this type of information. However, matching word two-gramsor three-grams from tweets against a full transcript of the show would most likelycapture the hard to classify tweets about events that happened in the TV-show,but this requires access to more accurate external data as-well as radically more

63

CHAPTER 8. CONCLUSIONS AND FUTURE WORK

memory and computing resources. The bag of words assumption made where wecompare tf ∗ idf scores is not enough to handle these types of tweets.

The main future work in improving tweet classification lies in finding compu-tationally efficient ways to associate tweets with concepts. Concretely what theseconcepts are is also another question. I have used external sources such as Wikipediabut also a collection of tweets that we have defined to be about the concept, theauxiliary data itself. The focus of this thesis has been on methods that are compu-tationally feasible and even though there are many techniques from computationallinguistics, for instance such that consider sentence structure e.g. non-deterministicgrammars, these are many times computationally expensive.

8.3.6 OntologyIn general, the external sources used leave much to be desired. Access to morespecific descriptions about the products we wish to track would be desirable. Onecould also imagine using links between external sources to provide richer context.In related works in web-search this is often an approach taken, such as assigningspecific importance to anchor-texts. One could also build ontologies with statisticaltechniques such as latent semantic indexing.

I briefly looked into link oriented data mining of Wikipedia during the initial,literature review, stages of this thesis but found the technologies involved to beimmature for practical deployment. Another ontology based approach that is tech-nically mature but lacks proven efficiency is the use of man-made ontologies suchas word-net or RDF descriptors.

Finding exactly what kind of link structure that is usable for either improvingAQE or classification is a daunting task but philosophically very interesting. Tweetsare very short and thus any type of correctly inferred link between a tweet and aknow concept has the potential to be very helpful.

8.3.7 Improved scalability and performance profilingThe underlying system, Terrier 3.5, works well for a research application and as aproof of concept but for a commercial application it is very likely that other systemswould work better, especially with regards to real time indexing and horizontalscalability. As a first step a thorough profile of the application obtained by runninga suitable evaluation suite is required. This should reveal key characteristics suchas what operations need to be fast.

64

Bibliography

[1] Task definition WePS-3 workshop. http://nlp.uned.es/weps/weps-3/call-for-participation. Accessed 2012 11 19.

[2] The twentieth text REtrieval conference (TREC 2011) proceedings. http://trec.nist.gov/pubs/trec20/t20.proceedings.html.

[3] Data, data everywhere. The Economist, February 2010.

[4] Trec data. http://trec.nist.gov/data.html, 2012. Accessed 2012 11 19.

[5] Twitter turns six. http://blog.twitter.com/2012/03/twitter-turns-six.html, March 2012. Accessed 2012 11 19.

[6] H. Akaike. A new look at the statistical model identification. IEEE Transac-tions on Automatic Control, 19(6):716 – 723, December 1974.

[7] Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of in-formation retrieval based on measuring the divergence from randomness. ACMTransactions on Information Systems, 20(4):357–389, October 2002.

[8] Jaime Arguello, Jonathan L. Elsas, Jamie Callan, and Jaime G. Carbonell.Document representation and query expansion models for blog recommenda-tion. Technical report, CiteSeerX, 2009.

[9] Kyoko Ariyasu, Hiroshi Fujisawa, and Yasuaki Kanatsugu. Message analysisalgorithms and their application to social tv. page 1. ACM Press, 2011.

[10] S. Bhattacharya, C.G. Harris, Y. Mejova, C. Yang, P. Srinivasan, and T.M.Track. The university of iowa at trec 2011: Microblogs, medical records andcrowdsourcing.

[11] Berit Block. Facebook: Around the world in 800 days. http://blog.comscore.com/, May 2012.

[12] Claudio Carpineto and Giovanni Romano. A survey of automatic query ex-pansion in information retrieval. ACM Comput. Surv., 44(1):1:1–1:50, January2012.

65

http://nlp.uned.es/weps/weps-3/call-for-participation

http://nlp.uned.es/weps/weps-3/call-for-participation

http://trec.nist.gov/pubs/trec20/t20.proceedings.html

http://trec.nist.gov/pubs/trec20/t20.proceedings.html

http://trec.nist.gov/data.html

http://blog.twitter.com/2012/03/twitter-turns-six.html

http://blog.twitter.com/2012/03/twitter-turns-six.html

http://blog.comscore.com/

http://blog.comscore.com/

BIBLIOGRAPHY

[13] Ovidiu Dan, Junlan Feng, and Brian Davison. Filtering microblogging messagesfor social tv. In Proceedings of the 20th international conference companion onWorld wide web, WWW ’11, pages 197–200, New York, NY, USA, 2011. ACM.

[14] Dan Ovidiu, Junlan Feng, and Brian D. Davidson. A bootstrapping approachto identifying relevant tweets for social TV. In ICWSM, Barcelona, 2011.

[15] Miles Efron. Hashtag retrieval in a microblogging environment. In Proceedingsof the 33rd international ACM SIGIR conference on Research and developmentin information retrieval, SIGIR ’10, pages 787–788, New York, NY, USA, 2010.ACM.

[16] Miles Efron. Information search and retrieval in microblogs. Journal of theAmerican Society for Information Science and Technology, 62(6):996–1008,2011.

[17] Ofer Egozi, Shaul Markovitch, and Evgeniy Gabrilovich. Concept-based in-formation retrieval using explicit semantic analysis. ACM Trans. Inf. Syst.,29(2):8:1–8:34, April 2011.

[18] A.P. Engelbrecht. Computational intelligence: an introduction. wiley, 2007.

[19] Barbara Farfan. Funny and inspiring quotable quotations about twit-ter, social media & business. http://retailindustry.about.com/od/retailleaderquotes/a/Funny_inspiring_quotable_quotations_about_twitter_social_media_business.htm. Acessed 19/11/2012.

[20] Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic relatednessusing wikipedia-based explicit semantic analysis. In In Proceedings of the20th International Joint Conference on Artificial Intelligence, pages 1606–1611,2007.

[21] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute-mann, and Ian H. Witten. The WEKA data mining software: an update.SIGKDD Explor. Newsl., 11(1):10–18, November 2009.

[22] N. Ide and K. Suderman. Integrating linguistic resources: The american na-tional corpus model. In Proceedings of the 6th International Conference onLanguage Resources and Evaluation, 2006.

[23] K. Ikeda, G. Hattori, K. Matsumoto, C. Ono, and Y. Takishima. Social mediavisualization for tv, 2011.

[24] Christian KohlschÃ¼tter, Peter Fankhauser, and Wolfgang Nejdl. Boilerplatedetection using shallow text features. In Proceedings of the third ACM interna-tional conference on Web search and data mining, WSDM ’10, pages 441–450,New York, NY, USA, 2010. ACM.

66

http://retailindustry.about.com/od/retailleaderquotes/a/Funny_inspiring_quotable_quotations_about_twitter_social_media_business.htm



[25] C.D. Manning, P. Raghavan, and H. Schütze. Introduction to informationretrieval, volume 1. Cambridge University Press Cambridge, 2008.

[26] Kamran Massoudi, Manos Tsagkias, Maarten de Rijke, and Wouter Weerkamp.Incorporating query expansion and quality indicators in searching microblogposts. In Paul Clough, Colum Foley, Cathal Gurrin, Gareth Jones, WesselKraaij, Hyowon Lee, and Vanessa Mudoch, editors, Advances in InformationRetrieval, volume 6611 of Lecture Notes in Computer Science, pages 362–367.Springer Berlin / Heidelberg, 2011.

[27] Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. Adding semantics tomicroblog posts. In Proceedings of the fifth ACM international conference onWeb search and data mining, WSDM ’12, pages 563–572, New York, NY, USA,2012. ACM.

[28] Keith Mitchell, Andrew Jones, Johnathan Ishmael, and Nicholas J.P. Race.Social TV: toward content navigation using social awareness. In Proceedingsof the 8th international interactive conference on Interactive TV&Video,EuroITV ’10, pages 283–292, New York, NY, USA, 2010. ACM.

[29] S.I. Newton. The Correspondence of Isaac Newton, volume 1. published forthe Royal Society at the University Press, 1959.

[30] Kyosuke Nishida, Ryohei Banno, Ko Fujimura, and Takahide Hoshide. Tweetclassification by data compression. In Proceedings of the 2011 internationalworkshop on DETecting and Exploiting Cultural diversiTy on the social web,DETECT ’11, pages 29–34, New York, NY, USA, 2011. ACM.

[31] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma. Ter-rier: A High Performance and Scalable Information Retrieval Platform. In Pro-ceedings of ACM SIGIR’06 Workshop on Open Source Information Retrieval(OSIR 2006), 2006.

[32] T. Pang-Ning, M. Steinbach, and V. Kumar. Introduction to data mining. 2006.

[33] Fernando Perez-Tellez, David Pinto, John Cardiff, and Paolo Rosso. On thedifficulty of clustering microblog texts for online reputation management. InProceedings of the 2nd Workshop on Computational Approaches to Subjectivityand Sentiment Analysis, WASSA ’11, pages 146–152, Stroudsburg, PA, USA,2011. Association for Computational Linguistics.

[34] J.R. Quinlan. C4. 5: programs for machine learning, volume 1. Morgan kauf-mann, 1993.

[35] Bernard Renger, Junlan Feng, Ovidiu Dan, Harry Chang, and Luciano Bar-bosa. VoiSTV: voice-enabled social TV. In Proceedings of the 20th interna-tional conference companion on World wide web, WWW ’11, pages 253–256,New York, NY, USA, 2011. ACM.

67

BIBLIOGRAPHY

[36] J.J. Rodriguez, L.I. Kuncheva, and C.J. Alonso. Rotation forest: A new classi-fier ensemble method. Pattern Analysis and Machine Intelligence, IEEE Trans-actions on, 28(10):1619 –1630, October 2006.

[37] Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. #TwitterSearch:a comparison of microblog search and web search. In Proceedings of the fourthACM international conference on Web search and data mining, WSDM ’11,pages 35–44, New York, NY, USA, 2011. ACM.

[38] A. Tumasjan, T.O. Sprenger, P.G. Sandner, and I.M. Welpe. Predicting elec-tions with twitter: What 140 characters reveal about political sentiment. 2010.

[39] Shoko Wakamiya, Ryong Lee, and Kazutoshi Sumiya. Towards better TVviewing rates: exploiting crowd’s media life logs over twitter for TV rating.In Proceedings of the 5th International Conference on Ubiquitous InformationManagement and Communication, ICUIMC ’11, pages 39:1–39:10, New York,NY, USA, 2011. ACM.

[40] S.R. Yerva, Z. Miklós, and K. Aberer. It was easy, when apples and blackberrieswere only fruits. In Third Web People Search Evaluation Forum (WePS-3),CLEF, 2010.

[41] Surender Reddy Yerva, Zoltán Miklós, and Karl Aberer. It was easy, whenapples and blackberries were only fruits. 2010.

68

Appendix A

Hashtag splitting

As I realized that twitter users use hashtags in many different ways i decided thatI wanted process one common type of them, long compound hashtags, and convertto machine understandable text. Examples of compound hashtags for are: #best-showever, #Ilovesports. I hypothesize that often hashtags are not used for topicmarking but for emphasis and in this case compound hashtags are very common.Since twitter does not allow users to format their text e.g. as bold or italic, usershave come up with this way to express themselves.

The gist of this problem is to convert a concatenated string into its constituentparts, where each part is found in a dictionary. However where to split is non-trivial.Often several splits can form a valid solution, consider “mynameis”. This string canbe slit as “my name is” or as “myna me is”1 which contains only words from acommon dictionary.

This problem requires more than a greedy algorithm or brute force search sinceeven moderately short strings such as “mynameis” will create an exponential numberof alternatives to compare with the original string unless we are careful.

My approach is as follows, start by generating all the in-dictionary words thatstart with a letter at a certain position in the original word and are a sub sequenceof the original string as seen here:

Pos. Possible wordm my,mynayn na,namea amm mee eii i,iss

1Mynas are a family of birds originating from southern and eastern Asia.

69

APPENDIX A. HASHTAG SPLITTING

After generating possible substrings, do a search of all combinations to find whichform the valid string.

I do depth first search starting from the left of the string and applying twoconstraints:

1. The solutions length must be the same as the original string with boundsconsistency.

2. Any chosen new word must together with the other chosen words make upthe first letters of the original string.

The second constraint subsumes the first constraint but is more expensive to check,therefore I apply the first constraint first.

To later choose only one of the solutions found I choose the one with the largestproduct of term frequencies from the constituent words. The term frequencies comefrom a Project Gutenberg corpus2. Regrettably this means that the input data isslightly dated.

2Project Gutenberg frequency lists are available at http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists

70

http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists

http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists

Appendix B

SVM performance

Table B.1: Results of classification of annotated test data with linear support vectormachines. Text data is treated as sparse vectors.

TV show AccuracyHow I Met Your Mother 57% (285/500)The Big Bang Theory 30.6% (153/500)The Vampire Diaries 53.6% (268/500)The X factor 69.8% (349/500)Wheel of Fortune 74% (370/500)

71