33
On the Value of Temporal Anchor Texts in Wikipedia Nattiya Kanhabua and Wolfgang Nejdl L3S Research Center Hannover, Germany SIGIR 2014 Workshop on Temporal, Social and Spatially-aware Information Access (#TAIA2014), Gold Coast, Australia 1

On the Value of Temporal Anchor Texts in Wikipedia

Embed Size (px)

Citation preview

1

On the Value of Temporal Anchor Texts in

Wikipedia

Nattiya Kanhabua and Wolfgang Nejdl

L3S Research CenterHannover, Germany

SIGIR 2014 Workshop on Temporal, Social and Spatially-aware Information Access (#TAIA2014), Gold Coast, Australia

2

Motivation Entity Evolution: Implications for IR Our Approach: Model and Ranking Experimental Results Conclusions and Future Work

Outline

3

Computer Science and interdisciplinary research on all aspects of the Web

Information Access: Knowledge on and through the Web

Community: Supporting communities for research, education, production and entertainment

Society: Requirements (technological, social, legal) for the Web

Internet: Communication and Networks

Selected projects

Web Science @ L3STemporal Retrieval,

Exploration and Analytics in Web Archives

LivingKnowledge: Diversity, opinion and

bias on the Web

CUbRIK: Searching by computers and humans

Real-time data processing for finance predictions

Privacy, Property and Internet Governance

ForgetIT: Concise Preservation via

Managed Forgetting

MAPPING

4

Computer Science and interdisciplinary research on all aspects of the Web

Information Access: Knowledge on and through the Web

Community: Supporting communities for research, education, production and entertainment

Society: Requirements (technological, social, legal) for the Web

Internet: Communication and Networks

Selected projects

Web Science @ L3S

ALEXANDRIA: Temporal Retrieval, Exploration and Analytics in Web Archives

ERC Advanced Grant (2014-2018, 2.5 MEuro)

Cooperation: Internet Archive, German National Library,

British Library, Rutgers University, et al.

Q1: How to link web archive content against multiple entity and event collections evolving over time?Ioannou, E., Nejdl, W., Niederée, C. and Velegrakis, Y. 2011. LinkDB: A Probabilistic Linkage Database System. SIGMOD’2011Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T. and Nejdl, W. 2012. Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. WSDM’2012

Q2: How to support time-sensitive and entity-based queries?Kanhabua, N. and Nørvåg, K. 2010. Exploiting time-based synonyms in searching document archives. JCDL’2010Nguyen, T., and Kanhabua, N. 2014. Leveraging dynamic query subtopics for time-aware search result diversification. ECIR’2014

Entity Evolution and Implications for IR

6

march madnessbegan

14/03/2006

ncaa women tournament began

18/03/2006 01/04/2006

final four began

query: ncaa

Change of Subtopic Popularities over Time

S. R. Kairam, M. R. Morris, J. Teevan, D. J. Liebling, and S. T. Dumais. Towards supporting search over trending events with social media. ICWSM'2013Nguyen, T., and Kanhabua, N. 2014. Leveraging dynamic query subtopics for time-aware search result diversification. ECIR’2014

Motivation

Anchor texts are complementary descriptionfor target pages, widely used to improve

search

Characteristics: Short summary (a few words) of target pages Collective wisdom of people other than

authors Similar behavior to real-world queries and

titles Capturing aboutness or what a document is

about

Our idea: Temporal anchor texts mined from the edit

history of Wikipedia as a hook for tracking entity evolution

Large-scale analysis and a more robust discovery of evolving information using limited resources

N. Eiron and K. S. McCurley: Analysis of Anchor Text for Web Search. SIGIR‘2003http://googleresearch.blogspot.de/2013/03/learning-from-big-data-40-million.html

Mining Temporal Anchor Texts

1. Partition Wikipedia revisions using the one-month granularity

2. For each Wikipedia snapshot, identify named entity articles/pages

3. Extract anchor texts from all articles linking to an entity page

4. Rank aggregated entity-anchor relationships at a particular time t

N. Kanhabua and K. Nørvåg: Exploiting time-based synonyms in searching document archives. JCDL 2010

Mining Temporal Anchor Texts

1. Partition Wikipedia revisions using the one-month granularity

2. For each Wikipedia snapshot, identify named entity articles/pages

3. Extract anchor texts from all articles linking to an entity page

4. Rank aggregated entity-anchor relationships at a particular time t

N. Kanhabua and K. Nørvåg: Exploiting time-based synonyms in searching document archives. JCDL 2010

Mining Temporal Anchor Texts

1. Partition Wikipedia revisions using the one-month granularity

2. For each Wikipedia snapshot, identify named entity articles/pages

3. Extract anchor texts from all articles linking to an entity page

4. Rank aggregated entity-anchor relationships at a particular time t

President_of_the_United_States

President Bush (43)

Time: 10/2005

Barack Obama

Time: 11/2008

George W. Bush

Time: 11/2004

N. Kanhabua and K. Nørvåg: Exploiting time-based synonyms in searching document archives. JCDL 2010

1. Multi-word title with all words capitalized, except prepositions, determiners, etc.

E.g., President_of_the_United_States => entity

2. Single-word titles with multiple capital letters

E.g., UNICEF and WHO => entities

3. 75% of occurrences in the article text itself are capitalized (not beginning of sentence)

Recognizing Named Entity Articles

R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. EACL2006

Weight anchor texts by importance with respect to a target entity at particular time:

• Link-independent : inlink pages are independent and equally important to the target page

• Compute based on the whole collection of Wikipedia entity pages at particular time t

• Two variants: 1) article links, and 2) distinct pages

Temporal Anchor Weighting

Z. Dou, R. Song, J.-Y. Nie, and J.-R. Wen. Using anchor texts with their hyperlink structure for web search. SIGIR'2009

Weight anchor texts by importance with respect to a target entity at particular time:

• Link-independent : inlink pages are independent and equally important to the target page

• Compute based on the whole collection of Wikipedia entity pages at particular time t

• Two variants: 1) article links, and 2) distinct pages

Temporal Anchor Weighting

Z. Dou, R. Song, J.-Y. Nie, and J.-R. Wen. Using anchor texts with their hyperlink structure for web search. SIGIR'2009

Experiments

Data collection:• A dump of English Wikipedia edit history

(2.8 TB)• All pages and revisions 03/2001 to 03/2008• 85 snapshots + 4 additional snapshots

(24/05/2008, 27/07/2008, 08/10/2008, 06/03/2009)

Tools:• Preprocess/store revisions using MWDumper

http://www.mediawiki.org/wiki/Mwdumper• Store anchor texts: mySQL databases

Top-100 Named Entities

Top-100 Named Entities

Top-100 Named Entities

Evolving Context

“Barack Obama”

time

05/2008 03/2009

1. Senator Barack Obama2. Senator Obama's

legislative accomplishments

3. Illinois4. U.S. Sen. Barack Obama

1. Senator Barack Obama2. Illinois Senator Barack

Obama3. Barack Hussein Obama II4. Senator Obama's

legislative accomplishments

07/2008 10/2008

Evolving Context

“Barack Obama”

time

05/2008 03/2009

1. Senator Barack Obama2. Senator Obama's

legislative accomplishments

3. Illinois4. U.S. Sen. Barack Obama

1. Senator Barack Obama2. Illinois Senator Barack

Obama3. Barack Hussein Obama II4. Senator Obama's

legislative accomplishments

07/2008

1. Senator Barack Obama2. Illinois Senator Barack

Obama3. Barak Obama, U.S.

Senator, Illinois, 2008 Democratic nominee for U.S. President

4. presidential candidacy announcement

1. President Barack Obama

2. Senator Barack Obama 3. U.S. President Barack

Obama 4. 44th President of the

United States5. Obama Administration

10/2008

Our Findings

Evolving information & context• Role changes for political entities• Geographic name changes for

locations• Trend or things in vogue for

celebrities• Products in demand for technology

Our Findings

Evolving information & context• Role changes for political entities• Geographic name changes for

locations• Trend or things in vogue for

celebrities• Products in demand for technology

Our Findings

Evolving information & context• Role changes for political entities• Geographic name changes for

locations• Trend or things in vogue for

celebrities• Products in demand for technology

Our Findings

Evolving information & context• Role changes for political entities• Geographic name changes for

locations• Trend or things in vogue for

celebrities• Products in demand for technology

Our Findings

Evolving information & context• Role changes for political entities• Geographic name changes for

locations• Trend or things in vogue for

celebrities• Products in demand for technology

Our Findings

Evolving information & context• Role changes for political entities• Geographic name changes for

locations• Trend or things in vogue for

celebrities• Products in demand for technology

Our Findings

Evolving information & context• Role changes for political entities• Geographic name changes for

locations• Trend or things in vogue for

celebrities• Products in demand for technology

Conclusions and Future Work• Analysis of temporal anchor texts extracted from Wikipedia edits

• A temporal model for Wikipedia and a probabilistic ranking method

• A hook for tracing entity evolution: evolving information and context

Future work:

• A distinction between contemporaneous and historical events/entities

• Adding semantic information to anchor texts, e.g., temporal expressions

28

Thank you!

Questions or suggestions?

Q1: How to link web archive content against multiple entity and event collections evolving over time?Ioannou, E., Nejdl, W., Niederée, C. and Velegrakis, Y. 2011. LinkDB: A Probabilistic Linkage Database System. SIGMOD (New York, New York, USA, Jun. 2011)

Q2: How to maintain entity and event information and indexes for web-scale archives?Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T. and Nejdl, W. 2012. Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. WSDM (New York, NY, USA, 2012), 53–62.Papadakis, G., Ioannou, E., Palpanas, T., Niederée, C. and Nejdl, W. 2012. A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. TKDE. (2012).

Evolution-Aware Entity-Based Enrichment and Indexing

Huge and Heterogeneous Information Spaces

Voluminous, (semi-)structured datasets DBPedia 3.4: 36,5 million triples and 2,1 million

entities BTC09: 1,15 billion triples and 182 million entities

Users are free to insert not only attribute values but also attribute names high levels of heterogeneity

DBPedia 3.4: 50,000 attribute names Google Base:100,000 schemata and 10,000 entity

typesLarge portion of data stemming from automatic information extraction noise, tag-style values

and this does neither involve time nor entity evolution …

Temporal Retrieval and Ranking

Q3: How to support time-sensitive and entity-based query formulation?Kanhabua, N. and Nørvåg, K. 2010. Exploiting time-based synonyms in searching document archives. JCDL (New York, New York, USA, Jun. 2010)Nguyen, T., and Kanhabua, N. 2014. Leveraging dynamic query subtopics for time-aware search result diversification. ECIR (Amsterdam, April 2014)

Q4: How to improve result ranking and clustering for time-sensitive and entity-based queries?Kanhabua, N., Blanco, R. and Matthews, M. 2011. Ranking related news predictions. SIGIR (New York, New York, USA, Jul. 2011)G. Demartini, C. Firan, T. Iofciu, R. Krestel, W. Nejdl: Why finding entities in Wikipedia is difficult, sometimes. Inf. Retr. 13(5): 534-567 (2010)

ALEXANDRIA Visions and Research Questions

33