55
Searching the Temporal Web Challenges and Current Approaches Nattiya Kanhabua IR Group, University of Glasgow 4 February 2013

Searching the Temporal Web: Challenges and Current Approaches

Embed Size (px)

Citation preview

Page 1: Searching the Temporal Web: Challenges and Current Approaches

Searching the Temporal Web

Challenges and Current

Approaches

Nattiya Kanhabua

IR Group, University of Glasgow

4 February 2013

Page 2: Searching the Temporal Web: Challenges and Current Approaches

Outline

• Evolution of the Web

• Temporal IR system

• Current Approaches

– Content Analysis

– Query Analysis

– Retrieval and Ranking

• Open Issues

2

Page 3: Searching the Temporal Web: Challenges and Current Approaches

Evolution of the Web

• Web is changing over time in many aspects:

– Size: web pages are added/deleted all the time

– Content: web pages are edited/modified

– Query: users’ information needs changes

[Ke et al., CN 2006; Risvik et al., CN 2002] [Dumais, SIAM-SDM 2012; WebDyn 2010] 3

Page 4: Searching the Temporal Web: Challenges and Current Approaches

20

00

First billion-URL index

The world’s largest!

≈5000 PCs in clusters! 2

00

4

Index grows to

4.2 billion pages

1995 2012

20

08

Google counts

1 trillion

unique URLs

Web and Index Sizes

20

09

TBs or PBs of data/index

Tens of thousands of PCs

http://www.worldwidewebsize.com/

Impacts: crawling, indexing, and caching 4

Page 5: Searching the Temporal Web: Challenges and Current Approaches

Content Dynamics

• WayBack Machine

– Web archive search by the Internet Archive

5

Page 6: Searching the Temporal Web: Challenges and Current Approaches

1998

2006

Content Dynamics

2012

Impacts: document representation and retrieval 6

Page 7: Searching the Temporal Web: Challenges and Current Approaches

Query Dynamics

• Search queries exhibit temporal patterns

– Spikes or seasonality

Impacts: search intent and query representation

http://www.google.com/insights/search/

7

Page 8: Searching the Temporal Web: Challenges and Current Approaches

Temporal IR System

8

Page 9: Searching the Temporal Web: Challenges and Current Approaches

Time-sensitive Queries

• Represent temporal information needs – E.g., 2006 FIFA World Cup, Thailand tsunami

• Queries and the relevance depend on time – Query popularities change over time

– Documents are about events at particular time

9

Page 10: Searching the Temporal Web: Challenges and Current Approaches

Time Distribution of Qrel Recency query Time-sensitive query

Time-insensitive query

[Li et al., CIKM 2003]

10

Page 11: Searching the Temporal Web: Challenges and Current Approaches

Query/Document Matching

query

Temporal

Web

Determining

Search Intent

Term: {Germany, World, Cup}

Time: {06/2006, 07/2006}

D2006

Retrieved results

matching

Time-sensitive

queries

Semantic

Annotation

Annotated

documents Term: {w1, w2, …, wn}

Time: {PubTime(di), ContentTime(di)}

11

Page 12: Searching the Temporal Web: Challenges and Current Approaches

Content Analysis

Page 13: Searching the Temporal Web: Challenges and Current Approaches

Two Time Aspects

Two time dimensions

1. Publication or modified time

2. Content or event time

content time

publication time

13

Page 14: Searching the Temporal Web: Challenges and Current Approaches

Problem Statements • Difficult to find the trustworthy time for web documents

– Time gap between crawling and indexing

– Decentralization and relocation of web documents

– No standard metadata for time/date

Document Dating

Let’s me see…

This document is

probably

written in 850 A.C.

with 95% confidence.

I found a bible-like

document. But I have

no idea when it was

created?

“ For a given document with uncertain

timestamp, can the contents be used to

determine the timestamp with a sufficiently

high confidence? ”

14

Page 15: Searching the Temporal Web: Challenges and Current Approaches

Current Approaches

1. Content-based

2. Link-based

3. Hybrid

15

Page 16: Searching the Temporal Web: Challenges and Current Approaches

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

Temporal Language Models

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage

Freq

1

1

1

1

1

1

tsunami

Thailand

A non-timestamped

document

16

[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]

Page 17: Searching the Temporal Web: Challenges and Current Approaches

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped

document

Temporal Language Models

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage

Freq

1

1

1

1

1

1

17

[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]

Page 18: Searching the Temporal Web: Challenges and Current Approaches

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped

document

Temporal Language Models

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage

Freq

1

1

1

1

1

1

18

[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]

Page 19: Searching the Temporal Web: Challenges and Current Approaches

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped

document

Temporal Language Models

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage

Freq

1

1

1

1

1

1

19

[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]

Page 20: Searching the Temporal Web: Challenges and Current Approaches

Content-based Approach

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped

document

Similarity Scores

Score(1999) = 1

Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

Temporal Language Models

• Based on the statistic usage of words over time

• Compare each word of a non-timestamped document with a reference corpus

• Tentative timestamp -- a time partition mostly overlaps in word usage

[de Jong et al., AHC 2005; Kanhabua et al., ECDL 2008]

Freq

1

1

1

1

1

1

20

Page 21: Searching the Temporal Web: Challenges and Current Approaches

Normalized Log-likelihood Ratio

Partition Word

1999 tsunami

1999 Japan

1999 tidal wave

2004 tsunami

2004 Thailand

2004 earthquake

Temporal Language Models

tsunami

Thailand

A non-timestamped

document

Similarity Scores

Score(1999) = 1

Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004

Normalized log-likelihood ratio

• Variant of Kullback-Leibler divergence

• Similarity of a document and time partitions

• C is the background model estimated on the corpus

• Linear interpolation smoothing to avoid the zero probability of unseen words

[Kraaij, SIGIR Forum 2005] 21

Page 22: Searching the Temporal Web: Challenges and Current Approaches

Link-based Approach

• Dating a document using its neighbors 1. Web pages linking to the document

• Incoming links

2. Web pages pointed by the document

• Outgoing links

3. Media assets associated with the document

• E.g., images

• Averaging the last-modified dates of its

neighbors as timestamps

[Nunes et al., WIDM 2007] 22

Page 23: Searching the Temporal Web: Challenges and Current Approaches

Hybrid Approach

• Inferring timestamps using machine

learning – Exploit links, contents of a web pages and its neighbors

– Features: linguistic, position, page formats, and tags

[Chen et al., SIGIR 2010] 23

Page 24: Searching the Temporal Web: Challenges and Current Approaches

Content Time Extraction

• Three types of temporal expressions 1. Explicit: time mentions being mapped directly to

a time point or interval, e.g., “July 4, 2012”

2. Implicit: imprecise time point or interval, e.g.,

“Independence Day 2012”

3. Relative: resolved to a time point or interval

using other types or the publication date, e.g.,

“next month”

[Alonso et al., SIGIR Forum 2007] 24

Page 25: Searching the Temporal Web: Challenges and Current Approaches

Identifying Relevant Time

• How to determine relevant temporal

expressions tagged in a document? – Not all temporal expressions associated to an event

are equally relevant

Reported by World Health Organization (WHO) on

29 July 2012 about an ongoing Ebola outbreak

in Uganda since the beginning of July 2012

25

Page 26: Searching the Temporal Web: Challenges and Current Approaches

Current Approaches

1. Ranking temporal expressions using

different features

2. The task of identifying relevant time is

regarded as a classification problem – Two classes: (1) relevant and (2) irrelevant

– Definition: relevant referring to the starting, ending or

ongoing time of the event

[Strötgen et al., TempWeb 2012; Kanhabua et al., TAIA 2012] 26

Page 27: Searching the Temporal Web: Challenges and Current Approaches

Query Analysis

Page 28: Searching the Temporal Web: Challenges and Current Approaches

Determining Time of Queries

• Two types of temporal queries: 1. Explicit: time is provided, "Presidential election 2012“

2. Implicit: time is not provided, "Germany World Cup" • Temporal intent can be implicitly inferred

• Previous studies on temporal queries: – 1.5% of web queries are explicit

– ~7% of web queries are implicit

[Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009] 28

Page 29: Searching the Temporal Web: Challenges and Current Approaches

Current Approaches

1. Query log analysis

2. Analyzing top-k documents

29

Page 30: Searching the Temporal Web: Challenges and Current Approaches

Query Log Analysis

• Features from query logs – Analyze query frequencies over time for identifying

the relevant time of queries

• Time-series analysis – Leverage time-series decomposition for detecting

seasonal queries

[Metzler et al., SIGIR 2009; Shokouhi, SIGIR 2011] 30

Page 31: Searching the Temporal Web: Challenges and Current Approaches

Time-series Decomposition

Query: Easter Query: World cup

31

Page 32: Searching the Temporal Web: Challenges and Current Approaches

Analyzing Top-k Documents

• Using temporal language models – Determine time of queries when no time is given explicitly

• Exploiting time from search snippets – Extract temporal expressions (i.e., years) from the

contents of top-k retrieved web snippets for a given query

[Kanhabua et al., ECDL 2010; Campos et al., CIKM 2012] 32

Page 33: Searching the Temporal Web: Challenges and Current Approaches

Matching: Re-visited

D2006

Ranked results

query

Temporal

Web

Determining

Search Intent

Term: {Germany, World, Cup}

Time: {06/2006, 07/2006}

D2006

Retrieved results

matching

Time-sensitive

Queries

Semantic

Annotation

Annotated

documents Term: {w1, w2, …, wn}

Time: {PubTime(di), ContentTime(di)}

33

Page 34: Searching the Temporal Web: Challenges and Current Approaches

Retrieval and Ranking

Page 35: Searching the Temporal Web: Challenges and Current Approaches

Searching the Past

• Searching documents created/edited over time

– E.g., web archives, news archives, blogs, or emails

– A journalist wants to write a timeline of a news article

– A Wikipedia contributor searches for historical

information about an entity of interests

Web

archives

news

archives

blogs emails

“temporal document

collections”

Retrieve documents

about Pope Benedict

XVI written before 2005

Term-based IR approaches may give unsatisfied results

35

Page 36: Searching the Temporal Web: Challenges and Current Approaches

• Time must be explicitly modeled in order to increase the effectiveness of ranking

– To order search results so that the most relevant ones are ranked higher

• Time uncertainty should be taken into account

– Two temporal expressions can refer to the same time period even though they are not equally written

– E.g. the query “Independence Day 2011” • A retrieval model relying on term-matching only will fail to

retrieve documents mentioning “July 4, 2011”

Challenges

36

Page 37: Searching the Temporal Web: Challenges and Current Approaches

Query/Document Models

• A temporal query consists of: – Query keywords

– Temporal expressions

• A document consists of: – Terms, i.e., bag-of-words

– Publication time and temporal expressions

37

Page 38: Searching the Temporal Web: Challenges and Current Approaches

Temporal Query Examples

• A temporal query consists of: – Query keywords

– Temporal expressions

• A document consists of: – Terms, i.e., bag-of-words

– Publication time and temporal expressions

[Berberich et al., ECIR 2010]

38

Page 39: Searching the Temporal Web: Challenges and Current Approaches

Time-aware Ranking

• Two main approaches 1. Mixture model

• Linearly combining textual- and temporal similarity

2. Probabilistic model

• Generating a query from the textual part and temporal part

of a document independently

[Li et al., CIKM 2003; Baeza-Yates, SIGIR Forum 2005]

[Berberich et al., ECIR 2010; Kanhabua et al., ECDL 2010] 39

Page 40: Searching the Temporal Web: Challenges and Current Approaches

Mixture Model

• Linearly combine textual- and temporal similarity

– α indicates the importance of similarity scores

• Both scores are normalized before combining

– Textual similarity can be determined using any term-based retrieval model

• E.g., tf.idf or a unigram language model

40 [Li et al., CIKM 2003; Baeza-Yates, SIGIR Forum 2005]

[Kanhabua et al., ECDL 2010]

Page 41: Searching the Temporal Web: Challenges and Current Approaches

Mixture Model

• Linearly combine textual- and temporal similarity

– α indicates the importance of similarity scores

• Both scores are normalized before combining

– Textual similarity can be determined using any term-based retrieval model

• E.g., tf.idf or a unigram language model

How to determine temporal similarity? 41

Page 42: Searching the Temporal Web: Challenges and Current Approaches

Temporal Similarity

Sim

ilarity

score

Time distance

d1 d2

42

Page 43: Searching the Temporal Web: Challenges and Current Approaches

Probabilistic Model

• Assume that temporal expressions in the query are generated independently from a two-step generative model:

– P(tq|td) can be estimated likelihood for each time

interval tq that td can refer to

– Linear interpolation smoothing is applied to eliminates zero probabilities

• I.e., an unseen temporal expression tq in d

43

[Berberich et al., ECIR 2010]

Page 44: Searching the Temporal Web: Challenges and Current Approaches

• Five time-aware ranking models

– LMT [Berberich et al., ECIR 2010]

– LMTU [Berberich et al., ECIR 2010]

– TS [Kanhabua et al., ECLD 2010]

– TSU [Kanhabua et al., ECLD 2010]

– FuzzySet [Kalczynski et al., Inf. Process. 2005]

Comparison of Time-aware Ranking

[Kanhabua et al., SIGIR 2011a]

44

Page 45: Searching the Temporal Web: Challenges and Current Approaches

Open Issues

Page 46: Searching the Temporal Web: Challenges and Current Approaches

Problem Statements • Queries of named entities (people, company, place)

– Highly dynamic in appearance, i.e., relationships between terms changes over time

– E.g. changes of roles, name alterations, or semantic shift

Named Entity Evolution

Scenario 1 Query: “Pope Benedict XVI” and written before 2005

Documents about “Joseph Alois Ratzinger” are relevant

Scenario 2 Query: “Hillary R. Clinton” and written from 1997 to 2002

Documents about “New York Senator” and “First Lady of

the United States” are relevant 46

Page 47: Searching the Temporal Web: Challenges and Current Approaches

Examples of Name Changes

47 QUEST Demo: http://research.idi.ntnu.no/wislab/quest/

Page 48: Searching the Temporal Web: Challenges and Current Approaches

Current Approaches

• Temporal co-occurrence

• Mining Wikipedia revisions

• Temporal association rule mining

48 [Berberich et al., WebDB 2009; Kanhabua et al., JCDL 2010]

[Kaluarachchi et al., CIKM 2010; Tahmasebi et al., COLING 2012]

Page 49: Searching the Temporal Web: Challenges and Current Approaches

1. Performance prediction – Predict the retrieval effectiveness wrt. a ranking model

Query Prediction Problems

query

precision = ?

recall = ?

MAP = ?

predict

[Cronen-Townsend et al., SIGIR 2002; Diaz et al., SIGIR 2004]

[Hauff et al., ECIR 2010; Carmel et al., 2010; Kanhabua et al. SIGIR 2011b] 49

Page 50: Searching the Temporal Web: Challenges and Current Approaches

1. Performance prediction – Predict the retrieval effectiveness wrt. a ranking model

2. Retrieval model prediction – Predict the retrieval model that is most suitable

Query Prediction Problems

query ranking = ?

predict max(precision)

max(recall)

max(MAP)

[Peng et al., ECIR 2010; Kanhabua et al., SIGIR 2012] 50

Page 51: Searching the Temporal Web: Challenges and Current Approaches

References • [Alonso et al., SIGIR Forum 2007] Omar Alonso, Michael Gertz, Ricardo A. Baeza-Yates: On the

value of temporal information in information retrieval. SIGIR Forum 41(2): 35-41 (2007)

• [Baeza-Yates, SIGIR Forum 2005] Ricardo A. Baeza-Yates: Searching the future. SIGIR workshop

MF/IR 2005

• [Berberich et al., WebDB 2009] Klaus Berberich, Srikanta J. Bedathur, Mauro Sozio, Gerhard

Weikum: Bridging the Terminology Gap in Web Archive Search. WebDB 2009

• [Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard Weikum:

A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25

• [Campos et al., CIKM 2012] Ricardo Campos, Gaël Dias, Alípio Jorge, Celia Nunes: GTE: A

Distributional Second-Order Co-Occurrence Approach to Improve the Identification of Top Relevant

Dates in Web Snippets. CIKM 2012

• [Carmel et al., 2010] David Carmel, Elad Yom-Tov: Estimating the Query Difficulty for Information

Retrieval. Morgan & Claypool Publishers 2010

• [Chen et al., SIGIR 2010] Zhumin Chen, Jun Ma, Chaoran Cui, Hongxing Rui, Shaomang Huang: Web

page publication time detection and its application for page rank. SIGIR 2010: 859-860

• [Cronen-Townsend et al., SIGIR 2002] Stephen Cronen-Townsend, Yun Zhou, W. Bruce Croft:

Predicting query performance. SIGIR 2002: 299-306

• [Diaz et al., SIGIR 2004] Fernando Diaz, Rosie Jones: Using temporal profiles of queries for precision

prediction. SIGIR 2004: 18-24

• [Dumais, SIAM-SDM 2012] Susan T. Dumais: Temporal Dynamics and Information Retrieval. SIAM-

SDM 2012

51

Page 52: Searching the Temporal Web: Challenges and Current Approaches

References (cont’) • [Hauff et al., ECIR 2010] Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra, Franciska de Jong: Query

Performance Prediction: Evaluation Contrasted with Effectiveness. ECIR 2010: 204-216

• [de Jong et al., AHC 2005] Franciska de Jong, Henning Rode, Djoerd Hiemstra: Temporal language

models for the disclosure of historical text. AHC 2005: 161-168

• [Kaluarachchi et al., CIKM 2010] Amal Chaminda Kaluarachchi, Aparna S. Varde, Srikanta J.

Bedathur, Gerhard Weikum, Jing Peng, Anna Feldman: Incorporating terminology evolution for query

translation in text retrieval with association rules. CIKM 2010: 1789-1792

• [Kalczynski et al., Inf. Process. 2005] Pawel Jan Kalczynski, Amy Chou: Temporal Document

Retrieval Model for business news archives. Inf. Process. Manage. 41(3): 635-650 (2005)

• [Kanhabua et al., JCDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Exploiting time-based synonyms in

searching document archives. JCDL 2010: 79-88

• [Kanhabua et al., ECDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Determining Time of Queries for Re-

ranking Search Results. ECDL 2010: 261-272

• [Kanhabua et al., SIGIR 2011a] Nattiya Kanhabua, Kjetil Nørvåg: A comparison of time-aware ranking

methods. SIGIR 2011: 1257-1258

• [Kanhabua et al., SIGIR 2011b] Nattiya Kanhabua, Kjetil Nørvåg: Time-based query performance

predictors. SIGIR 2011: 1181-1182

• [Kanhabua et al., SIGIR 2012] Nattiya Kanhabua, Klaus Berberich, Kjetil Nørvåg: Learning to select a

time-aware retrieval model. SIGIR 2012

• [Kanhabua et al., TAIA 2012] Nattiya Kanhabua, Sara Romano, Avaré Stewart: Identifying Relevant

Temporal Expressions for Real-World Events. Time-aware Information Access Workshop 2012

52

Page 53: Searching the Temporal Web: Challenges and Current Approaches

References (cont’) • [Ke et al., CN 2006] Yiping Ke, Lin Deng, Wilfred Ng, Dik Lun Lee: Web dynamics and their

ramifications for the development of Web search engines. Computer Networks 50(10): 1430-1447

(2006)

• [Kraaij, SIGIR Forum 2005] Wessel Kraaij: Variations on language modeling for information

retrieval. SIGIR Forum 39(1): 61 (2005)

• [Li et al., CIKM 2003] Xiaoyan Li, W. Bruce Croft: Time-based language models. CIKM 2003:

469-475

• [Nunes et al., WIDM 2007] Sérgio Nunes, Cristina Ribeiro, Gabriel David: Using neighbors to

date web documents. WIDM 2007: 129-136

• [Metzler et al., SIGIR 2009] Donald Metzler, Rosie Jones, Fuchun Peng, Ruiqiang Zhang:

Improving search relevance for implicitly temporal queries. SIGIR 2009: 700-701

• [Mazeika et al., CIKM 2011] Arturas Mazeika, Tomasz Tylenda, Gerhard Weikum: Entity

timelines: visual analytics and named entity evolution. CIKM 2011: 2585-2588

• [Peng et al., ECIR 2010] Jie Peng, Craig Macdonald, Iadh Ounis: Learning to Select a Ranking

Function. ECIR 2010: 114-126

• [Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics.

Computer Networks 39(3): 289-302 (2002)

• [Shokouhi, SIGIR 2011] Milad Shokouhi: Detecting Seasonal Queries by Time-Series Analysis.

SIGIR 2011: 1171-1172

• [Strötgen et al., SemEval 2010] Jannik Strötgen, Michael Gertz: Heideltime: High quality rule-

based extraction and normalization of temporal expressions. SemEval 2010: 321-324

53

Page 54: Searching the Temporal Web: Challenges and Current Approaches

References (cont’) • [Strötgen et al., TempWeb 2012] Jannik Strötgen, Omar Alonso, Michael Gertz: Identification of

top relevant temporal expressions in documents. Temporal Web Workshop 2012.

• [Tahmasebi et al., COLING2012] Nina Tahmasebi, Gerhard Gossen, Nattiya Kanhabua, Helge

Holzmann, Thomas Risse: NEER: An Unsupervised Method for Named Entity Evolution

Recognition. COLING 2012

• [Verhagen et al., ACL 2005] Marc Verhagen, Inderjeet Mani, Roser Sauri, Jessica Littman,

Robert Knippen, Seok Bae Jang, Anna Rumshisky, John Phillips, James Pustejovsky: Automating

Temporal Annotation with TARSQI. ACL 2005

• [WebDyn 2010] Web Dynamics course: http://www.mpi-

inf.mpg.de/departments/d5/teaching/ss10/dyn/, Max-Planck Institute for Informatics, Saarbrücken,

Germany, 2010

• [Zhang et al., EMNLP 2010] Ruiqiang Zhang, Yuki Konda, Anlei Dong, Pranam Kolari, Yi Chang,

Zhaohui Zheng: Learning Recurrent Event Queries for Web Search. EMNLP 2010: 1129-1139

54

Page 55: Searching the Temporal Web: Challenges and Current Approaches

Thank you!

55