34
Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model Andreas Spitz , Jannik Str¨ otgen, Thomas B¨ ogel and Michael Gertz Heidelberg University Institute of Computer Science Database Systems Research Group http://dbs.ifi.uni-heidelberg.de [email protected] 5th Temporal Web Analytics Workshop Florence, May 18, 2015

Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Terms in Time and Times in Context:A Graph-based Term-Time Ranking Model

Andreas Spitz, Jannik Strotgen,Thomas Bogel and Michael Gertz

Heidelberg UniversityInstitute of Computer Science

Database Systems Research Grouphttp://dbs.ifi.uni-heidelberg.de

[email protected]

5th Temporal Web Analytics WorkshopFlorence, May 18, 2015

Page 2: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

What happened on June 15, 1215?

A simple question.How simple is the answer?

Terms in Time and Times in Context Andreas Spitz 1 of 24

Page 3: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

With structured data:quite simple

Based on unstructured text data:much more challenging

Terms in Time and Times in Context Andreas Spitz 2 of 24

Page 4: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Data Set and Approach

A corpus of all English Wikipedia articles:

• Only text is considered, no info-boxes

• 3, 079, 620 documents with time expressions

Problem statement, given such a corpus:

• Extract and normalize temporal expressions (dates)

• Find key terms that best summarize a given date

Terms in Time and Times in Context Andreas Spitz 3 of 24

Page 5: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Outline

Outline of the approach:

• Represent date-term co-occurrences efficiently

• Extract and normalize temporal expressions (dates)• Extract content words that co-occur with dates• Generate an efficient data structure

• Based on this representation

• Identify relevant terms for any given date• Identify similar dates for any given date

• Example applications

Terms in Time and Times in Context Andreas Spitz 4 of 24

Page 6: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Extraction of Temporal Expressions

• Normalization, e.g., May 18, 2015 → 2015-05-18

• Handling relative temporal expressions, e.g., in May

• Considering the document type

Source: Strotgen, Gertz Multilingual and Cross-domain Temporal Tagging (2013)

Terms in Time and Times in Context Andreas Spitz 5 of 24

Page 7: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Coverage of Dates

We use a combination of dates of three granularities:

• YYYY-MM-DD (day)

• YYYY-MM (month)

• YYYY (year)

Percentage of dates that are included in the data per year

●●●

●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●

●●

●●●●●

●●●●●

●●

●●

●●

●●

●●●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●

●●

●●

●●●●

●●●

●● ●

●●

●●

●●●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●●

●●●●●●●

●●●●●●●●●

●●

●●

●●●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●●

●●

●●●●

●●●

●●

●●●

●●

●●●

●●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●●●●●

●● ●

●●

●●●●

●●●

●●

●●●

●●●

●●●

●●●● ●

●●●

●●●●●

●●

●●

●●

●●●●●●●● ●●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●

●●●

●●●●●●●●●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●●●

●●●

●●●●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●

●●●

●●

●●●●●●

●●

●●

●●●●●●

●●

●●

●●

●●●●●

●●

25

50

75

100

1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000

year

cove

rage

in %

Terms in Time and Times in Context Andreas Spitz 6 of 24

Page 8: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Extraction of Terms and Representation

For all sentences s in any Wikipedia document:

Terms in Time and Times in Context Andreas Spitz 7 of 24

Page 9: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Extraction of Terms and Representation

Identify/normalize dates and remove stop words

Terms in Time and Times in Context Andreas Spitz 7 of 24

Page 10: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Extraction of Terms and Representation

Create a bipartite graph Gs = (Ts ∪Ds, Es) with weights ωs

Terms in Time and Times in Context Andreas Spitz 7 of 24

Page 11: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Extraction of Terms and Representation

Satisfy the inclusion condition for dates

Terms in Time and Times in Context Andreas Spitz 7 of 24

Page 12: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Extraction of Terms and Representation

Satisfy the inclusion condition for dates

Terms in Time and Times in Context Andreas Spitz 7 of 24

Page 13: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Graph aggregation

Aggregate the sentence-graphs Gs:

• T :=⋃Ts

• D :=⋃Ds

• E :=⋃Es

• ω(e) :=∑

ωs(e)

We obtain G = (T ∪D,E, ω) with:

• |T | = 3, 748, 730 terms

• |D| = 210, 375 dates

• |E| = 110, 639, 525 edges

Terms in Time and Times in Context Andreas Spitz 8 of 24

Page 14: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Formalising the Question

What happened on June 15, 1215?

Which terms in the graph co-occurin a significant manner with the date 1215-06-15?

Terms in Time and Times in Context Andreas Spitz 9 of 24

Page 15: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Ranking

We need a ranking-function from dates D to a list of terms T

• r : D → R|T |

• r(d) := ranking of terms t ∈ T by their significance for d

Idea: adapt tf-idf to the bipartite graph

tf -idf := tf · log 1

df

• tf : frequency of term in document

• df : fraction of documents that contain the term

Terms in Time and Times in Context Andreas Spitz 10 of 24

Page 16: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Ranking

We need a ranking-function from dates D to a list of terms T

• r : D → R|T |

• r(d) := ranking of terms t ∈ T by their significance for d

Idea: adapt tf-idf to the bipartite graph

tf -idf := tf · log 1

df

• tf : frequency of term in document

• df : fraction of documents that contain the term

Terms in Time and Times in Context Andreas Spitz 10 of 24

Page 17: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Adapting tf-idf

How does this relate to the graph?

• Identify dates with documents, i.e., dates contain terms

• Term frequency given by edge weights:tf(d, t) ≈ ω(d, t)

• Inverse document frequency given by number of neighbours:idf(t) ≈ |D|

deg(t)

tf -idf := tf · log 1

df⇒ tf -idf(d, t) := ω(d, t) log

|D|deg(t)

Terms in Time and Times in Context Andreas Spitz 11 of 24

Page 18: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

June 15, 1215

Terms in Time and Times in Context Andreas Spitz 12 of 24

Page 19: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

June 15, 1215

Terms in Time and Times in Context Andreas Spitz 12 of 24

Page 20: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

A Ranking for Dates

Ranking dates by term works analogously:

tf -idf(t, d) := ω(t, d) log|T |

deg(d)

Terms in Time and Times in Context Andreas Spitz 13 of 24

Page 21: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

A Ranking for Dates

Ranking dates by term works analogously:

tf -idf(t, d) := ω(t, d) log|T |

deg(d)

Terms in Time and Times in Context Andreas Spitz 13 of 24

Page 22: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Ranking Nodes by Similarity Within a Set

Can we...

• ... create a ranking for dates by dates?

• ... or for terms by terms?

Formally this is a one-mode projection of the bipartite graph:

• Reduce graph to a single set of nodes T or D

• Connect nodes that share neighbours in the bipartite graph

• This results in a very dense graph

⇒ How can we identify relevant edges in the projection?

Terms in Time and Times in Context Andreas Spitz 14 of 24

Page 23: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Ranking Nodes by Similarity Within a Set

Can we...

• ... create a ranking for dates by dates?

• ... or for terms by terms?

Formally this is a one-mode projection of the bipartite graph:

• Reduce graph to a single set of nodes T or D

• Connect nodes that share neighbours in the bipartite graph

• This results in a very dense graph

⇒ How can we identify relevant edges in the projection?

Terms in Time and Times in Context Andreas Spitz 14 of 24

Page 24: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Cosine Similarity of Adjacency Vectors

In a lesson from Collaborative Filtering :use a cosine similarity of adjacency vectors

cos(ta, tb) :=

∑tai · tbi√∑

t2ai ·√∑

t2bi

Terms in Time and Times in Context Andreas Spitz 15 of 24

Page 25: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Evaluation

Ground truth: U.S. Election Days (1848 - 2013)

• Recurs annually

• Always on Tuesday after the first Monday in November(Nov 2 - Nov 8)

• Every four years: presidential election

Expectation:

• For a given election day, election days in other years areranked highly

• For presidential election days, other presidential election daysare ranked highly

Terms in Time and Times in Context Andreas Spitz 16 of 24

Page 26: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Evaluation

Ground truth: U.S. Election Days (1848 - 2013)

• Recurs annually

• Always on Tuesday after the first Monday in November(Nov 2 - Nov 8)

• Every four years: presidential election

Expectation:

• For a given election day, election days in other years areranked highly

• For presidential election days, other presidential election daysare ranked highly

Terms in Time and Times in Context Andreas Spitz 16 of 24

Page 27: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Precision at k

0.3

0.4

0.5

0.6

1 20 40 60 80 100

k

prec

isio

n

election day presidential general

preck := |Election days in top k ranks|k

Terms in Time and Times in Context Andreas Spitz 17 of 24

Page 28: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Area Under the ROC Curve

●●●●●

●●●

●●●●

●●●●●●

●●●●

●●●

●●●●

●●●●●●●

●●●

●●

●●

0.8

0.9

1.0

1850 1900 1950 2000

year

AU

C

election day ● general presidential

Terms in Time and Times in Context Andreas Spitz 18 of 24

Page 29: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Practical Application: Hot Spots & Key Players

Here: approximation of countries’ activity during given months

For each European country c,

• define its name, e.g. tn(c) = italy,

• define the countries adjectival form, e.g. ta(c) = italian,

• compute individual tf-idf scores for terms and combine.

act(c, d) :=tf -idf(d, tn(c)) + tf -idf(d, ta(c))

max[tf -idf(d, ·)]

Terms in Time and Times in Context Andreas Spitz 19 of 24

Page 30: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Activity by Country During World War II

German occupation

of Czechoslovakia

German invasion

of Poland

Soviet invasion

of Finland

German assault

on Norway

German invasion

of France

Battle of Britain and

Greco−Italian War

March 1939 September 1939 November 1939

April 1940 May 1940 October 1940

0.0

0.5

1.0

1.5activity

Terms in Time and Times in Context Andreas Spitz 20 of 24

Page 31: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Activity by Country During World War II (2)

Invasion of Yugoslavia

and Greece

German offensive

against the Soviets

Allied invasion

of Italy

Allied forces land in

Normandy (DDay)

Liberation of Paris

German Capitulation

April 1941 June 1941 September 1943

June 1944 August 1944 May 1945

0.5

1.0

1.5

activity

Terms in Time and Times in Context Andreas Spitz 21 of 24

Page 32: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Summary

Approach:

• Extract dates and terms from unstructured text

• Construct a bipartite date-term graph

• Allows ranking dates / terms according to co-occurrences

Benefits:

• Simple measures already yield good results

• Efficient: 4GB Memory and real-time queries

• Flexibility of ranking methods

Terms in Time and Times in Context Andreas Spitz 22 of 24

Page 33: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Ongoing Work

Terms in Time and Times in Context Andreas Spitz 23 of 24

Page 34: Terms in Time and Times in Context: A Graph-based Term ... · 5/18/2015  · Activity by Country During World War II German occupation of Czechoslovakia German invasion of Poland

Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary

Thank you!

Questions?

Terms in Time and Times in Context Andreas Spitz 24 of 24