Retrieval and Feedback Models for Blog Feed Search

Preview:

DESCRIPTION

SIGIR 2008 Presentation

Citation preview

SIGIR 2008Singapore

Jonathan Elsas, Jaime Arguello,

Jamie Callan & Jaime Carbonell

LTI/SCS/CMU

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Retrieval and Feedback Models for Blog Feed

Search

Outline

• The task– Overview of Blogs & Blog Search– Challenges in Blog Search

• Our approach– Retrieval Models– Query Expansion Models

• Conclusion

Background

What is a Blog?

What is a Feed?<xml>

<feed>

<entry>

<author>Peter …</>

<title>Good, Evil…</>

<content>I’ve said…</>

</entry>

<entry>

<author>Peter …</>

<title>Agreeing…</>

<content>Some peo…</>

</entry>

Blog-Feed Correspondence

Blog Feed

Post Entry

HTMLHTML XMLXML

Why are Blogs important?

Technorati currently tracking > 112.8 Million Blogs> 175,000 new Blogs per day> 1.6 Million posts per day

[http://www.technorati.com/about/]

The Task

Feed Search at TREC

Ranking Blogs/Feeds (collections of posts) in response to a user’s query, [X]

“A relevant feed should have a principle and recurring interest in X”

— TREC 2007 Blog Track

(a.k.a. Blog Distillation)

Feed Search at TREC

[Gardening][Apple iPod]

[Violence in Sudan][Gun Control]

[Food][Wine]

RepresentOngoing

Information Needs

FrequentlyVery

General

Challenges in Feed Search

Challenges in Feed Search

entries

time

feed

1.A feed is a collection of documents

1.A feed is a collection of documents – How does relevance at the entry level

correspond to relevance at the feed level?

Challenges in Feed Search

entries

time

feed

Challenges in Feed Search

2. Even a topical feed is topically diverse

time

NASA

China’s plans for the moon

shuttle launch

My dog

Mars rover

Boeing

Space Exploration

topic

Challenges in Feed Search

2. Even a topical feed is topically diverse– Can we favor entries close to the

central topic of the feed?

Space Exploration

time

topic

Challenges in Feed Search

3. Feeds are noisy– Spam blogs, Spam & off topic comments

time

Challenges in Feed Search

4. General & Ongoing Information Needs

[Mac]

[Music]

[Food]

[Wine]

… post regularly about new products, features, or application software of Apple Mac computers.

… describing songs, biographies of musicians, musical styles andtheir influences of music on people are discussed.

…such as tastings, reviews, food matching or pairing, and oenophile news and events.

… describing experiences eating cuisines, culinary delights,recipes, nutrition plans.

Our Approach

Retrieval Models

Feedback Models

Feeds:Topically Diverse

Noisy

Collections

Information Needs:

General & Ongoing

ChallengesOur

Approach

Retrieval Models

• Challenge: ranking topically diverse

collections

• Representation: feed vs. entry• Model topical relationship between entries

Large Document (Feed) Model

<?xml……

</…>

`<?xml……

</…>

<?xml……

</…>

<?xml…<feed><entry><entry><entry><entry><entry>

…</…>

<?xml……

</…>

<?xml……

</…>

<?xml……

</…>

<?xml…<feed><entry><entry><entry><entry><entry>

…</…>

Feed Document Collection

[Q]

Ranked Feeds

Rank by

Indri’s standard retrieval model[Metzler and Croft, 2004; 2005]

Large Document (Feed) Model

Advantages:

• A straightforward application of existing retrieval techniques

Potential Pitfalls:

• Large entries dominate a feed’s language model

• Ignores relationship among entries

Feed

Entry E E Entry Entry E

Small Document (Entry) Model

<entry><entry><entry><entry><?xml…<entry>

Entry Document Collection

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

Ranked FeedsRanked Entriesdocument = entry

[Q]

Apply some rankaggregation function

Rank By

Small Document (Entry) Model

• Query Likelihood• Entry Centrality• Feed Prior: favors longer feeds

ReDDE Federated Search Algortihm[Si & Callan, 2003]

Entry Centrality

Uniform :

Geometric Mean :

time

topic

Small Document (Entry) Model

Advantages:• Controls for differing entry length

• Models topical relationship among entries

Disadvantages:• Centrality computation is slow(er)

Q

Not only improves speed, Also performance

Retrieval Model Results

Retrieval Model Results

• 45 Queries from the TREC 2007 Blog Distillation Task

• BLOG06 test collection, XML feeds only

• 5-Fold Cross Validation for all retrieval model smoothing parameters

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mean Average Precision

LargeDocument(Feed)Model

Small Document (Entry) Models

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mean Average Precision

Uniform Log(Feed Length)UniformLog PriorMap 0.188

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mean Average Precision

Uniform Log(Feed Length)Uniform

n/a

Feedback Models

• Challenge: Noisy collection with general

& ongoing information needs

• Use a cleaner external collection for query expansion (Wikipedia)

• With an expansion technique designed to identify multiple query facets

Query Expansion (PRF)

[Q]

BLOG06Collection

Related Terms from top K documents[Q + Terms]

[Lavrenko & Croft, 2001]

Query Expansion Example

Idealdigital

photography

depth of field

photographic film

photojournalism

cinematography

[Photography]PRF

photographynudeeroticartgirlfreeteen

fashionwomen

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

BLOG LD BLOG SD

Mean Average Precision None PRF

Query Expansion (Wikipedia PRF)

[Q]

BLOG06Collection

[Q + Terms]

[Lavrenko & Croft, 2001]

Wikipedia

[Diaz & Metzler, 2006]

Related Terms from top K documents

Query Expansion Example

Idealdigital

photography

depth of field

photographic film

photojournalism

cinematography

[Photography]PRF

photographynudeeroticartgirlfreeteen

fashionwomen

Wikipedia PRFphotographydirectorspecialfilmart

cameramusic

cinematographerphotographic

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

BLOG LD BLOG SD

Mean Average Precision None PRF Wiki. PRF

Query Expansion (Wikipedia Link)

[Q]

BLOG06Collection

[Q + Terms]

Wikipedia

Related Terms from link structure

Wikipedia Link-BasedQuery Expansion

Wikipedia Link-Based ExpansionWikipedia

Q

Wikipedia Link-Based Expansion

Wikipedia

Relevance Set, Top R = 100

Working Set, Top W = 1000

Q

Wikipedia Link-Based Expansion

Wikipedia

Q

Relevance Set, Top R = 100

Working Set, Top W = 1000

Wikipedia Link-Based Expansion

Relevance Set, Top R = 100

Working Set, Top W = 1000

Wikipedia

Extract anchor text fromWorking Set that link tothe Relevance Set.

Q

Wikipedia Link-Based Expansion

Relevance Set, Top R = 500

Working Set, Top W = 1000

Wikipedia

Extract anchor text fromWorking Set that link tothe Relevance Set.

Q

Combines relevance and popularity

Relevance: An anchor phrase that links to a high ranked article gets a high score

Popularity: An anchor phrase that links many times to a mid-ranked articles also gets high score

Query Expansion Example

Wikipedia Link-Based

photographyphotographer

digital photographyphotographicdepth of field

feature photographyfilm

photographic filmphotojournalism

[Photography]PRF

photographynudeeroticartgirlfreeteen

fashionwomen

Idealdigital photography

depth of field

photographic film

photojournalism

cinematography

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

0.4

BLOG LD BLOG SD

Mean Average Precision None PRF Wiki. PRFWiki. Link

Conclusion

• Feed Search Challenges:– Feeds are topically diverse, noisy collections

– Ranked against ongoing & general information needs

• Novel Retrieval Models:– Ranking collections, sensitive to topical relationship among entries

• Novel Feedback Models:– Discover multiple query facets & robust to collection noise

Thank You!

Student Travel Grant funding from: ACM SIGIR, Amit Singhal, Microsoft Research

Entry Centrality GM Derivation

where

Entry Generation Likelihood:

|E|

Query Expansion Examples

Wikipedia ExpansionMusic

Folk musicElectronic music

FolkMusic videoWorld music

AmbientElectronic

Country music

[Music]

PRFMusicCountryDownloadFreeMP3Mp3andmoreLyricListenSong

Query Expansion Examples

Wikipedia Expansionscotland

scottish parliamentscottish

scottish national party wars of scottish

independencescottish independence

william wallaceglasgow

scottish socialist party

[Scottish Independence]

PRFscotlandindependencepartyconventionpoliticssnpnationalpeoplescot

Query Expansion Examples

Wikipedia Expansionmachine learning

learningartificial intelligence

turing machine machine gun

neural networksupport vector machine

supervised learningartificial neural network

[Machine Learning]

PRFlearnmachinecreditcardkaraokejournalsexmodelsew

Query Generality Characteristics• Query Length:

– BLOG: 1.9 words – TB04: 3.2 words– TB05: 3.0 words

• ODP Depth– BLOG: 4.7 levels– TB04: 5.2 levels– TB05: 5.3 levels

Relevance Set Cohesiveness

Wikipedia

Relevance Set, Top R = 100 Cohesivenes

s

=| Lin |

| Lin U Lout |

Relevant Set Cohesiveness

Is it the Queries?

Feed Search Queries ≠

TB Adhoc Queries

But, none of these measurespredict whether wikipedia

expansions helps…

Recommended