CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf•...

Preview:

Citation preview

CS 572: Information Retrieval

Web 2.0+: Social Media/CGC

Status

• Final project: updated proposal due by today/tomorrow

• Questions? Discuss at end of class.

3/28/2016 CS 572: Information Retrieval. Spring 2016 2

3

3Eugene Agichtein, Emory University, IR Lab

• Collaboratively Generated Content: CGC

– Creation models, quality

• Twitter/Weibo

• Searching:

– Indexing

– Ranking

3/28/2016 CS 572: Information Retrieval. Spring 2016 4

5

Estimates of daily content creation [Ramakrishnan & Tomkins, IEEE Computer 2007]

Content type Amount produced / day

Published (books, magazines,

newspapers)

3-4 Gb

Professional Web (paid creation,

e.g., corporate Web site)

2 Gb

User-generated (reviews,

blogs, personal Web sites)

8-10 Gb

Private text (instant messages,

email)

3000 Gb (= 3Tb)

The value of CGC

• Extrinsic value – impact on AI, IR and NLP applications

• Intrinsic value – resources for the people– How to make the data more useful

(e.g., increase user engagement)

• Synergistic view of CGC– AI techniques can be used to improve the

quality of CGC repositories,

– which can in turn improve the performance of other AI applications

6

3,600,000 articles

Since 2001

2,500,000 entries

Since 2002

1,000,000,000 Q&A

Since 2005

4,000,000,000 images

Since 2004

Before7

After

Extrinsic value: sources of knowledge

Collaboratively Generated Content

articles: 65,000

Since 1768

entries: 150,000

Since 1985

assertions: 4,600,000

Since 1984

WordNet

CYC

Extrinsic value: what we can do with all this knowledge

• Under the hood– CGC as an additional (huge!) corpus

• Better term statistics, observe the document authoring process

– Repositories of knowledge• Concepts as features (BOW Bag of Words + Concepts)

• Concepts as word senses (way beyond WordNet)

– More comprehensive lexicons, gazetteers

– Extending existing knowledge repositories (e.g., WordNet)

• Higher-level tasks– (Concept-based) information retrieval

– Question answering

8

(Text) Social Media Today

Published: 4Gb/day Social Media: 10Gb/DayYahoo Answers: 120M users, 40M questions, 1B answers

Yes, we could read your blog. Or, you could tell us about your day

9Eugene Agichtein, Emory University, IR Lab

Models for CGC generation: Overview

• Wikipedia: contribution, curation, and coordination

– http://en.wikipedia.org/wiki/

Wikipedia:Modelling_Wikipedia%27s_growth

• Community forums: Yahoo! Answers, focused forums

• Sharing: twitter

10

New

lin

ks a

dd

ed

dail

y

Models of CGC Generation: Link Creation“Rich-Get-Richer” (for a while ...)

– Flickr and other sites alsoshow strong evidence of preferential attachment[Mislove et al. ‘08]

Preferential attachment:

More popular Wikipedia pages

are more likely to acquire links

(up to a point!)

[Capocci et al. ‘06]Po

pu

lari

ty11

Models of CGC Generation: Edits“Rich-Get-Richer” (again, up to a point)

Popular Wiki pages (> 1000 edits):

Editing activity initially increases than drops off

Later on, arriving visitors

are less able to contribute

novel information

[Das and Magdon-Ismail ‘10]

12

Models of CGC generation:Slowing Growth [Suh et al., WikiSym 2009]

• Article growth (#) per month, extrapolated to slow down dramatically by 2013

• Increased coordination and overhead, resistance to new edits

13

Models for CGC generationWikipedia conflict and coordination

[Kittur & Kraut,

2008]

Explicit coordination: discussion page

Implicit coordination: leading the effort

Example: “Music of Italy” page – The bulk of

edits are made by TUF-KAT and Jeffmatt

Page quality increases with number of

editors iff concentrated effort; otherwise,

more editors higher overhead, lower

quality

14

Models of CGC Generation: Content Visibility and Evolution

• Dynamics of votes of select Digg stories over a period of 4 days[Lerman ‘07]

• Dynamics of Y! Answersposts over a period of 4 days [Aji et al. ‘10]

15

Models for CGC Generation: ForumsInfo seekers vs. info providers

• In Yahoo! Answers, question and answer quality are strongly related:bad questions attractbad or mediocre answers:[Agichtein et al. ’08]

• Askers (Java forums) prefer help from users with justslightly more expertise [Zhang et al. ‘07]

16

Lots of content... Do we trust it?

3/28/2016 CS190: Web Science and Technology, 2010 17

On the Internet, nobody knows you are a dog.

CGC Quality Filtering (Overview)

• Multiple dimensions of CGC content to exploit:

– Content persistence (Wikipedia)

– Surface quality (language-wise): all CGC

– Metadata: Author reliability, geo-spatial edit patterns

– Citation/link-based estimation of authority

• Most effective systems use a combination of all of the above

18

Content-Driven Author Reputation[Adler & Alfaro, WWW 2007]

19

Automatic Vandalism Detection

• Hoaxes

• Libelous content

• Malware distribution

• Errors picked up by mainstream media!

• Wikipedia “truthiness”

Vandalized Ben Franklin Wikipedia page

20

Assigning Trust to Wikipedia Text

• Estimating text trust from edit history information.

• Trust is computed as:

1) Insertion: trust value for newly inserted text (E)

2) Edge effect: the text at the edge of blocks has the same trust as newly inserted text.

3) Revision: text trust may increase, if author reputation gets higher

[Adler et al., WikiSym 2008]Online demo+code are available: http://wikitrust.soe.ucsc.edu/

Before

After

21

NLP-based Vandalism Detection

• Use language modeling to predict edit quality

– Lexical: vocabulary, regular expression/lists

– Syntactic: N-gram analysis using part-of-speech tags

– Semantic: N-gram analysis with hand-picked words

[Wang et al., COLING 2010]

Figure adapted from Andrew G. West, 2010

22

Comparison of Vandalism Detection Approaches: Shared tasks [Potthast et al. ‘10]

• Shared vandalism corpus, 32,000 edits

• Best single approach: NLP-based

• Meta-detector: best– Task requires different

classes of features.

• CLEF 2010 Competition on Vandalism Detection:– Best result: AUC 0.92,

Precision = 100% atRecall = 20%

23

Finding High Quality Content in SM

• Well-written• Interesting• Relevant (answer)• Factually correct• Popular?• Provocative?• Useful?

24

As judged by

professional editors

E. Agichtein, C. Castillo, D. Donato, A. Gionis,

and G. Mishne, Finding High Quality Content in

Social Media, in WSDM 2008

Eugene Agichtein, Emory University, IR Lab

25

25

Eugene Agichtein, Emory University, IR Lab

26

How do Question and Answer Quality relate?

26Eugene Agichtein, Emory University, IR Lab

27

27

Eugene Agichtein, Emory University, IR Lab

28

28

Eugene Agichtein, Emory University, IR Lab

29

29

Eugene Agichtein, Emory University, IR Lab

30

30

Eugene Agichtein, Emory University, IR Lab

Community

31Eugene Agichtein, Emory University, IR Lab

Link Analysis for Authority Estimation

32

Question 1

Question 2

Answer 5

Answer 1

Answer 2

Answer 4

Answer 3

User 1

User 2

User 3

User 6

User 4

User 5

Answer 6

Question 3

User 1

User 2

User 3

User 6

User 4

User 5

Kj

jAiH..0

)()(

Mi

iHjA..0

)()(

Hub (asker) Authority (answerer)

Eugene Agichtein, Emory University, IR Lab

[Jurzcyk & Agichtein, CIKM 2007]

Qualitative Observations

HITS effective

HITS ineffective

33Eugene Agichtein, Emory University, IR Lab

34

34

Random forest

classifier

Eugene Agichtein, Emory University, IR Lab

Result 1: Identifying High Quality Questions

35Eugene Agichtein, Emory University, IR Lab

Top Features for Question Classification

• Asker popularity (“stars”)

• Punctuation density

• Topical category

• Page views

• KL Divergence from reference corpus LM

36Eugene Agichtein, Emory University, IR Lab

Identifying High Quality Answers

37Eugene Agichtein, Emory University, IR Lab

Top Features for Answer Classification

• Answer length

• Community ratings

Answerer reputation

• Word overlap

• Kincaid readability score

38Eugene Agichtein, Emory University, IR Lab

CGC as a corpus

• Better statistics on a larger or cleaner text collection

– Compute IDF on the entire Wikipedia corpus

• Beware of different style / topic distribution

– Cf. Web as a corpus (Mihalcea’04, Gonzalo et al.’ 03)

• Web corpus statistics, such as counts of search results

– Cf. using very large corpora in NLP (Banko and Brill ‘01a, ‘01b)

39

CGC as a corpus (cont’d)

• Extend existing knowledge repositories (e.g., WordNet)

• Create new datasets, labeled by construction

– WSD datasets using OpenMind – Chklovski & Mihalcea ‘02, ‘03

– WSD datasets based on Wikipedia – Mihalcea’07; Bunescu & Pasca ‘06; Cucerzan ’07

– Text categorization datasets based on the ODP (Davidov et al. ‘04)

• Study the authoring process by observing multiple revisions of a document

40

41 CGC as corpus / Extending WordNet

Using corpus statistics for term weighting

• J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. WSDM 2010– Periodic crawls (starting from an unknown point in document’s life) vs.

formally tracked revision history

– LM-specific vs. general term frequency model, which can be plugged into any retrieval model

– Aji et al.’10 also model bursts – “significant” events in the document’s life

• M. Efron. Linear time series models for term weighting in information retrieval. JASIST 2010– Temporal behavior of terms as the entire collection changes over time

42 CGC as corpus / Document revisions

Extending term weighting models with Revision History Analysis (RHA) [Aji et al. ’10]

• RHA captures term importance from the document authoring process.

• Natural integration with state of the art retrieval models (BM25, LM)

• Consistent improvement over existing retrieval models

• Beyond Wikipedia

– Tracking changes in MS Word documents; version control

– Past snapshots of Web pages

(Elsas & Dumais ’10 constructed a new corpus)

CGC as corpus / Document revisions43

Observable document generation process Example: Revisions of “Topology” in Wikipedia

1st revision:

250th revision:

Current revision:

CGC as corpus / Document revisions44

Revision History Analysis redefines Term Frequency (TF)

• TF is a key indicator of document relevance

• TF can be naturally integrated into ranking models

𝑆 𝑄, 𝐷 =

𝑡𝜖𝑄

𝐼𝐷𝐹 𝑡 ∙𝑇𝐹 𝑡, 𝐷 ∙ 𝑘1 + 1

𝑇𝐹 𝑡, 𝐷 + 𝑘1 1 − 𝑏 + 𝑏 ∙𝐷

𝑎𝑣𝑔𝑑𝑙

𝑆 𝑄, 𝐷 = 𝐷(𝑄| 𝐷 =

𝑡𝜖𝑉

𝑃 𝑡|𝑄 ∙ log𝑃 𝑡 𝑄

𝑃 𝑡 𝐷

BM25

Language Model

CGC as corpus / Document revisions45

RHA global model

Define the term frequency based on the whole document generation process

– a document grows steadily over time

– a term is relatively important if it appears in the early revisions

𝑇𝐹𝑔𝑙𝑜𝑏𝑎𝑙 𝑡, 𝑑 =

𝑗=1

𝑛𝑐(𝑡, 𝑣𝑗)

𝑗𝛼

Frequency of term 𝑡 in revision 𝑣𝑗

Decay factor

CGC as corpus / Document revisions46

Iteration over revisions

RHA global model: example

CGC as corpus / Document revisions

Decayed term importance

Decay factor

47

But some pages are different: “Avatar(2009 film)” – very bursty editing process

1st revision:

500th revision:

Current revision:

CGC as corpus / Document revisions48

RHA burst model

• A burst resets the decay clock for a term

• The weight will decrease after a burst

𝑇𝐹𝑏𝑢𝑟𝑠𝑡 𝑡, 𝑑 =

𝑗=1

𝑚

𝑘=𝑏𝑗

𝑛𝑐(𝑡, 𝑣𝑘)

(𝑘 − 𝑏𝑗 + 1)𝛽

Frequency of term 𝑡 in revision 𝑣𝑘

Decay factor for jth burst

Burst detection

Content-based (relative content change)

Activity-based (intensive edit activity)

Both models combined

CGC as corpus / Document revisions49

Iteration over bursts

Iteration over revisions within

each burst

RHA burst model: example

CGC as corpus / Document revisions

Decay factor

Term importance

50

𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷 = 𝜆1 ∙ 𝑇𝐹𝑔 𝑡, 𝐷 + 𝜆2 ∙ 𝑇𝐹𝑏 𝑡, 𝐷 + 𝜆3 ∙ 𝑇𝐹 𝑡, 𝐷

Combining the global model with the burst model:

𝜆1 + 𝜆2 + 𝜆3 = 1

CGC as corpus / Document revisions

Putting it all together:RHA Term Frequency

global original TFburst

51

Integrating RHA into retrieval models

𝑆 𝑄, 𝐷 =

𝑡𝜖𝑄

𝐼𝐷𝐹 𝑡 ∙𝑇𝐹 𝑡, 𝐷 ∙ 𝑘1 + 1

𝑇𝐹 𝑡, 𝐷 + 𝑘1 1 − 𝑏 + 𝑏 ∙𝐷

𝑎𝑣𝑔𝑑𝑙

BM25

𝑆 𝑄, 𝐷 = 𝐷(𝑄| 𝐷 =

𝑡𝜖𝑉

𝑃 𝑡|𝑄 ∙ log𝑃 𝑡 𝑄

𝑃 𝑡 𝐷

Statistical Language Models

𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷

𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷

𝑃𝑟ℎ𝑎 𝑡, 𝐷

+ RHA

+ RHA

RHA Term Probability:

Performance improvements: +2-7% (INEX), +1-4% (TREC)

CGC as corpus / Document revisions52

Overview of Part 3

Use CGC as an additional (huge!) corpus• Better term statistics

• Observing the authoring process

CGC as repositories of world knowledge• Distilling knowledge from CGC structure

and content

• Concepts as features(BOW Bag of Words + Concepts)

– Semantic relatedness, and later IR

• Concepts as word senses– WSD

53

CGC as repositories of senses and concepts

• Articles / concepts as representation features

– From the bag of words to

the bag of words + concepts

– Semantic relatedness of words (and texts)

– Later: concept-based IR

• Articles / concepts as word senses

– Word sense disambiguation

CGC as knowledge repositories54

Semantic relatedness of words and texts

• How related are

cat mouse

Augean stables golden apples of the Hesperides

慕田峪 (Mutianyu) 司马台 (Simatai)

• Used in many applications

– Information retrieval

– Word-sense disambiguation

– Error correction (“Web site” or “Web sight”)

55 CGC as knowledge / Semantic relatedness

Evaluation criterion: correlation with human judgments

Human score [0 .. 10]

journey voyage 9.29

Jerusalem Israel 8.46

money property 7.57

football tennis 6.63

alcohol chemistry 5.54

coast hill 4.38

professor cucumber 0.23

56 CGC as knowledge / Semantic relatedness

Previous approaches

• WordNet-based measures

– Using graph structure (path length, random walks), information content, glosses

– Comprehensive review: Budanitsky & Hirst ‘06

– Patwardhan et al. ‘03 – using semantic relatedness measures for WSD

• Jiang-Conrath (WN structure + info content) performs on par with Adapted Lesk (gloss overlap)

• All other WN-based measures are inferior to Adapted Lesk

• Co-occurrence based measures

– Latent Semantic Analysis (Deerwester et al. ‘90)

CGC as knowledge / Semantic relatedness57

The role of background knowledge

When human judge semantic relatedness they

use massive amounts of background

knowledge

Can we endow computers with such

capability of knowledge-based

reasoning?

CGC as knowledge / Semantic relatedness58

Using Wikipedia for computing semantic relatedness

• Structure (links)

– Strube & Ponzetto ’06 (WikiRelate)

– Milne & Witten ’08 (WLM)

• Content (text)

– Gabrilovich & Markovitch ’07 (ESA)

• Structure + content

– Yeh et al. ’09 (WikiWalk)

CGC as knowledge / Semantic relatedness59

Explicit Semantic Analysis (ESA)[Gabrilovich & Markovitch ‘06-’09]

• Wikipedia-based semantic representation

• Some ESA applications– Computing semantic relatedness of words and texts

– Text categorization

– Information retrieval

• Cross-lingual retrieval utilizing cross-language links in Wikipedia

CGC as knowledge / Semantic relatedness60

The AI Dream61

Problem

solverProblem Solution

CGC concepts as features

62

The circular dependency problem

Problem

solverProblem Solution

Natural Language

Understanding

CGC concepts as features

63

Problem

solverProblem Solution

Natural Language

Understanding

Wikipedia-based

semantics Statistical text processing

Circumventing the need for deep Natural Language Understanding

CGC concepts as features

Every Wikipedia article represents a concept

Leopard

64 CGC concepts as features

Wikipedia can be viewed as an ontology –a collection of concepts

65

Leopard

Isaac

Newton

Ocean

CGC concepts as features

The semantics of a word is a vector of its associations with Wikipedia concepts

Word

66

Cf. Bengio’s talk

tomorrow on learning

text and image

representations

CGC concepts as features

0.0 1.0

0.0

The semantics of a

word is a point in the

multi-dimensional

space defined by the

set of Wikipedia

concepts

word

67

Word representation: 2D example

Wikipedia

concept B

Wikipedia

concept A

1.0

CGC concepts as features

0.0 1.0

0.0

68

Text representation: 2D example

Wikipedia

concept B

Wikipedia

concept A

1.0

The semantics of a

text fragment is a

centroid of the vectors

of the individual words

word

1

word

2

word

3

word

4

CGC concepts as features

“Cat”

Leopard

Mouse

Veterinarian

PetsTennis

ball

Elephant

House

CGC concepts as features69

Cat

Tom &

Jerry

Cat [0.92] Leopard [0.84] Roar [0.77]

Cat [0.95] Felis [0.74] Claws [0.61]

Animation [0.85] Mouse [0.82] Cat [0.67]

CatCat

0.95

Panthera

0.92

Tom & Jerry

0.67

Concept and word representation in Explicit Semantic Analysis[Gabrilovich & Markovitch ’09]

70

Panthera

CGC concepts as features

71

Leopar

d0.84

Mac OS X

v10.5

0.61

Cat

0.52

Panthera

CatCat

0.95

Panthera

0.92

Tom & Jerry

0.67

Computing semantic relatednessas cosine between two vectors

CGC concepts as features

Experimental results (individual words)

Algorithm Correlation with

human judgments

WordNet-based 0.33–0.35

Roget’s Thesaurus 0.55

LSA 0.56

WikiRelate 0.49

WLM 0.69

ESA (ODP) 0.65

ESA (Wikipedia) 0.75

TSA [Radinsky et al. ‘11] 0.8

EWC (ESA/Wikipedia + WordNet + collocations)

[Haralambous & Klyuev, forthcoming at IJCNLP’11]

0.79-0.87

CGC concepts as features72

Using Wikipedia for Named Entity Disambiguation [Bunescu & Pasca ’06]

• Both the method and the dataset are based on Wikipedia

• Dataset: 1.7M labeled examples– ''The [[Vatican City | Vatican]] is now an independent enclave ...'‘

Geographical location, not the Holy See

CGC for WSD

The basic approach:

compute the cosine

similarity between the

word context and the

target articles

73

Problems with the basic approach

1. Short articles and stubs

2. Synonymy (lexical mismatch between the word context and the article)

Solution: use categories for generalization

CGC for WSD74

Word-Category Correlations

“John Williams and the Boston Pops a summer Star Wars concert at Tanglewood.”conducted

John Williams (composer)

?

John Williams (wrestler)

Film score composers

Composers

Musicians

Professional wrestlers

Wrestlers

People known in connection

with sports and hobbies

People by occupation

When a Wikipedia article is too short,

get additional context from its categories

CGC for WSD75

The full solution [Bunescu & Pasca ’06]

• ML-based approach (SVM) to rank candidate senses

• Features:

– Cosine(word context, article text)

– Word-category affinities (binary features indicating if a word has occurred in any article of a given category)

• Dimensionality (|V| x |C|) can be reduced by thresholding word frequencies

• Accuracy improvement vs. cosine-only: 3%-25.5%

CGC for WSD76

Results [Bunescu & Pasca ’06]

• The classification model can be trained on Wikipedia, then applied to any input text– Senses can only be resolved to those

defined in Wikipedia

• Another idea to overcome the lexical mismatch: augment Wikipedia articles with contexts of corresponding word senses – Cf. anchor text aggregation [Metzler et al. ’09] and

[Cucerzan ‘07]

CGC for WSD

Taxonomy Kernel

77

An alternative approach to named entity disambiguation [Cucerzan ‘07]

• Applicable to any text, not solely that from Wikipedia

– Sense inventory still limited to Wikipedia

• The system performed both NE identificationand disambiguation

• World knowledge used– Known entities (Wikipedia articles)

– Their class (Person, Location, Organization)

– Known surface forms (redirection pages + anchor text)

– Words that describe or co-occur with the entity

– Categories, lists, enumeration, tables

CGC for WSD78

Restricted document representation:aggregation of features of all identified surface forms

for all entity names

CGC for WSD79

Entities that can be

referred to as “Columbia”

Key idea: joint disambiguation

• Maximize the agreement – between the entity vectors and the document, and

– between the categories of the entities

• Example: {Columbia, Discovery}– Space Shuttle Columbia & Space Shuttle Discovery

– Columbia Pictures & Discovery Investigations

– LIST_astronomical_topics, CAT_manned_spacecraft, CAT_space_shuttles

• If the “one sense per discourse” assumption is violated – shrink the context iteratively

CGC for WSD80

Results [Cucerzan ‘07]

• Baseline: most frequent Wikipedia sense

• Wikipedia dataset (labeled by construction)

– 86.2% 88.3%

• MSNBC news stories (human labeling)

– 51.7% 91.4%

– Surface forms common in news are highly ambiguous

– Popularity in Wikipedia ≠ popularity in news

CGC for WSD

Blackberry

China

81

Case Study: Searching Tweets

3/28/2016 CS 572: Information Retrieval. Spring 2016 82

Twitter Statistics (2014)

• 284 million monthly active users

• 500 million tweets are sent per day

• More than 300 billion tweets have been sent since company founding in 2006

• Tweets-per-second record: one-second peak of 143,199 TPS.

• More than 2 billion search queries per day.

3/28/2016 CS 572: Information Retrieval. Spring 2016 83

Twitter Search Architecture

3/28/2016 CS 572: Information Retrieval. Spring 2016 84

Twitter Search Architecture

3/28/2016 CS 572: Information Retrieval. Spring 2016 85

Twitter Search Architecture

3/28/2016 CS 572: Information Retrieval. Spring 2016 86

Active Index Segments: Write Friendly

3/28/2016 CS 572: Information Retrieval. Spring 2016 87

Twitter Search Architecture

3/28/2016 CS 572: Information Retrieval. Spring 2016 88

Twitter Lucene Extensions

3/28/2016 CS 572: Information Retrieval. Spring 2016 89

Twitter Lucene Extensions (cont’d)

3/28/2016 CS 572: Information Retrieval. Spring 2016 90

Lucene Extensions: Storage

3/28/2016 CS 572: Information Retrieval. Spring 2016 91

In-Memory Real Time Index

3/28/2016 CS 572: Information Retrieval. Spring 2016 92

Revisit Old Idea: Probabilistic Skip Lists

3/28/2016 CS 572: Information Retrieval. Spring 2016 93

3/28/2016 CS 572: Information Retrieval. Spring 2016 94

Top Entities (cont’d)

3/28/2016 CS 572: Information Retrieval. Spring 2016 95

Resources

• CGC tutorial at WSDM 2012:http://www.mathcs.emory.edu/~eugene/wsdm2012-cgc-tutorial/

• Real-time Search at Twitter: https://www.youtube.com/watch?v=cbksU0qkFm0

http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdf

http://www.slideshare.net/lucidworks/search-at-twitter-presented-by-michael-busch-twitter

3/28/2016 CS 572: Information Retrieval. Spring 2016 96

Recommended