Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Terms in Time and Times in Context:A Graph-based Term-Time Ranking Model
Andreas Spitz, Jannik Strotgen,Thomas Bogel and Michael Gertz
Heidelberg UniversityInstitute of Computer Science
Database Systems Research Grouphttp://dbs.ifi.uni-heidelberg.de
5th Temporal Web Analytics WorkshopFlorence, May 18, 2015
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
What happened on June 15, 1215?
A simple question.How simple is the answer?
Terms in Time and Times in Context Andreas Spitz 1 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
With structured data:quite simple
Based on unstructured text data:much more challenging
Terms in Time and Times in Context Andreas Spitz 2 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Data Set and Approach
A corpus of all English Wikipedia articles:
• Only text is considered, no info-boxes
• 3, 079, 620 documents with time expressions
Problem statement, given such a corpus:
• Extract and normalize temporal expressions (dates)
• Find key terms that best summarize a given date
Terms in Time and Times in Context Andreas Spitz 3 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Outline
Outline of the approach:
• Represent date-term co-occurrences efficiently
• Extract and normalize temporal expressions (dates)• Extract content words that co-occur with dates• Generate an efficient data structure
• Based on this representation
• Identify relevant terms for any given date• Identify similar dates for any given date
• Example applications
Terms in Time and Times in Context Andreas Spitz 4 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Extraction of Temporal Expressions
• Normalization, e.g., May 18, 2015 → 2015-05-18
• Handling relative temporal expressions, e.g., in May
• Considering the document type
Source: Strotgen, Gertz Multilingual and Cross-domain Temporal Tagging (2013)
Terms in Time and Times in Context Andreas Spitz 5 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Coverage of Dates
We use a combination of dates of three granularities:
• YYYY-MM-DD (day)
• YYYY-MM (month)
• YYYY (year)
Percentage of dates that are included in the data per year
●
●
●
●●●
●●●
●
●●●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●
●●●
●●
●
●
●
●
●
●●●●●
●
●●●●●
●
●●
●●
●●
●●
●●●●●●
●
●
●●●●●●●●●●●●
●●
●●●
●
●●
●
●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●
●
●
●
●
●
●
●
●●
●●
●
●●●●
●
●●●
●● ●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●●●●
●●●
●
●
●●●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●●
●●
●
●●●
●
●
●
●
●
●●●
●
●●●●●
●
●
●
●
●
●
●●●
●
●●●
●
●●●
●●
●●●
●●●
●●
●
●
●
●
●
●●●
●
●
●●●
●
●●●●●●●
●
●
●●●●●●●●●
●
●●
●●
●●●●●●
●
●●●
●●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●●●
●
●
●
●●
●●●
●
●
●
●●●
●
●
●●
●●
●
●
●●
●
●
●●
●●
●
●●●●●
●
●
●●
●
●●●
●
●
●
●●●
●●
●
●
●
●●●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●●●●
●
●●●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●●
●●
●
●
●●●
●
●
●●
●
●●
●●●
●●
●
●
● ●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●
●
●
●●
●
●●
●●
●
●●
●
●
●● ●
●●●
●
●
●
●
●
●●
●●
●
●●●●●●
●
●
●
●● ●
●
●
●
●●
●●●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●●
●
●●●
●●●● ●
●
●●●
●
●
●●●●●
●
●●
●
●
●
●●
●●
●●●●●●●● ●●●●●
●
●
●●●
●
●
●●●
●●
●
●●●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●●●
●●
●●●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●●
●●
●
●●
●
●●●●
●
●
●●
●
●
●
●
●●●●
●
●
●
●
●
●●●
●
●
●
●
●●●●●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●
●
●
●
●
●●●
●
●
●
●
●●
●
●●●●●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●●●●●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
25
50
75
100
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
year
cove
rage
in %
Terms in Time and Times in Context Andreas Spitz 6 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Extraction of Terms and Representation
For all sentences s in any Wikipedia document:
Terms in Time and Times in Context Andreas Spitz 7 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Extraction of Terms and Representation
Identify/normalize dates and remove stop words
Terms in Time and Times in Context Andreas Spitz 7 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Extraction of Terms and Representation
Create a bipartite graph Gs = (Ts ∪Ds, Es) with weights ωs
Terms in Time and Times in Context Andreas Spitz 7 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Extraction of Terms and Representation
Satisfy the inclusion condition for dates
Terms in Time and Times in Context Andreas Spitz 7 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Extraction of Terms and Representation
Satisfy the inclusion condition for dates
Terms in Time and Times in Context Andreas Spitz 7 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Graph aggregation
Aggregate the sentence-graphs Gs:
• T :=⋃Ts
• D :=⋃Ds
• E :=⋃Es
• ω(e) :=∑
ωs(e)
We obtain G = (T ∪D,E, ω) with:
• |T | = 3, 748, 730 terms
• |D| = 210, 375 dates
• |E| = 110, 639, 525 edges
Terms in Time and Times in Context Andreas Spitz 8 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Formalising the Question
What happened on June 15, 1215?
Which terms in the graph co-occurin a significant manner with the date 1215-06-15?
Terms in Time and Times in Context Andreas Spitz 9 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Ranking
We need a ranking-function from dates D to a list of terms T
• r : D → R|T |
• r(d) := ranking of terms t ∈ T by their significance for d
Idea: adapt tf-idf to the bipartite graph
tf -idf := tf · log 1
df
• tf : frequency of term in document
• df : fraction of documents that contain the term
Terms in Time and Times in Context Andreas Spitz 10 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Ranking
We need a ranking-function from dates D to a list of terms T
• r : D → R|T |
• r(d) := ranking of terms t ∈ T by their significance for d
Idea: adapt tf-idf to the bipartite graph
tf -idf := tf · log 1
df
• tf : frequency of term in document
• df : fraction of documents that contain the term
Terms in Time and Times in Context Andreas Spitz 10 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Adapting tf-idf
How does this relate to the graph?
• Identify dates with documents, i.e., dates contain terms
• Term frequency given by edge weights:tf(d, t) ≈ ω(d, t)
• Inverse document frequency given by number of neighbours:idf(t) ≈ |D|
deg(t)
tf -idf := tf · log 1
df⇒ tf -idf(d, t) := ω(d, t) log
|D|deg(t)
Terms in Time and Times in Context Andreas Spitz 11 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
June 15, 1215
Terms in Time and Times in Context Andreas Spitz 12 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
June 15, 1215
Terms in Time and Times in Context Andreas Spitz 12 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
A Ranking for Dates
Ranking dates by term works analogously:
tf -idf(t, d) := ω(t, d) log|T |
deg(d)
Terms in Time and Times in Context Andreas Spitz 13 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
A Ranking for Dates
Ranking dates by term works analogously:
tf -idf(t, d) := ω(t, d) log|T |
deg(d)
Terms in Time and Times in Context Andreas Spitz 13 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Ranking Nodes by Similarity Within a Set
Can we...
• ... create a ranking for dates by dates?
• ... or for terms by terms?
Formally this is a one-mode projection of the bipartite graph:
• Reduce graph to a single set of nodes T or D
• Connect nodes that share neighbours in the bipartite graph
• This results in a very dense graph
⇒ How can we identify relevant edges in the projection?
Terms in Time and Times in Context Andreas Spitz 14 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Ranking Nodes by Similarity Within a Set
Can we...
• ... create a ranking for dates by dates?
• ... or for terms by terms?
Formally this is a one-mode projection of the bipartite graph:
• Reduce graph to a single set of nodes T or D
• Connect nodes that share neighbours in the bipartite graph
• This results in a very dense graph
⇒ How can we identify relevant edges in the projection?
Terms in Time and Times in Context Andreas Spitz 14 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Cosine Similarity of Adjacency Vectors
In a lesson from Collaborative Filtering :use a cosine similarity of adjacency vectors
cos(ta, tb) :=
∑tai · tbi√∑
t2ai ·√∑
t2bi
Terms in Time and Times in Context Andreas Spitz 15 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Evaluation
Ground truth: U.S. Election Days (1848 - 2013)
• Recurs annually
• Always on Tuesday after the first Monday in November(Nov 2 - Nov 8)
• Every four years: presidential election
Expectation:
• For a given election day, election days in other years areranked highly
• For presidential election days, other presidential election daysare ranked highly
Terms in Time and Times in Context Andreas Spitz 16 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Evaluation
Ground truth: U.S. Election Days (1848 - 2013)
• Recurs annually
• Always on Tuesday after the first Monday in November(Nov 2 - Nov 8)
• Every four years: presidential election
Expectation:
• For a given election day, election days in other years areranked highly
• For presidential election days, other presidential election daysare ranked highly
Terms in Time and Times in Context Andreas Spitz 16 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Precision at k
0.3
0.4
0.5
0.6
1 20 40 60 80 100
k
prec
isio
n
election day presidential general
preck := |Election days in top k ranks|k
Terms in Time and Times in Context Andreas Spitz 17 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Area Under the ROC Curve
●●●●●
●●●
●
●●●●
●
●
●
●●●●●●
●●●●
●●●
●
●●●●
●
●●●●●●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
0.8
0.9
1.0
1850 1900 1950 2000
year
AU
C
election day ● general presidential
Terms in Time and Times in Context Andreas Spitz 18 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Practical Application: Hot Spots & Key Players
Here: approximation of countries’ activity during given months
For each European country c,
• define its name, e.g. tn(c) = italy,
• define the countries adjectival form, e.g. ta(c) = italian,
• compute individual tf-idf scores for terms and combine.
act(c, d) :=tf -idf(d, tn(c)) + tf -idf(d, ta(c))
max[tf -idf(d, ·)]
Terms in Time and Times in Context Andreas Spitz 19 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Activity by Country During World War II
German occupation
of Czechoslovakia
German invasion
of Poland
Soviet invasion
of Finland
German assault
on Norway
German invasion
of France
Battle of Britain and
Greco−Italian War
March 1939 September 1939 November 1939
April 1940 May 1940 October 1940
0.0
0.5
1.0
1.5activity
Terms in Time and Times in Context Andreas Spitz 20 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Activity by Country During World War II (2)
Invasion of Yugoslavia
and Greece
German offensive
against the Soviets
Allied invasion
of Italy
Allied forces land in
Normandy (DDay)
Liberation of Paris
German Capitulation
April 1941 June 1941 September 1943
June 1944 August 1944 May 1945
0.5
1.0
1.5
activity
Terms in Time and Times in Context Andreas Spitz 21 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Summary
Approach:
• Extract dates and terms from unstructured text
• Construct a bipartite date-term graph
• Allows ranking dates / terms according to co-occurrences
Benefits:
• Simple measures already yield good results
• Efficient: 4GB Memory and real-time queries
• Flexibility of ranking methods
Terms in Time and Times in Context Andreas Spitz 22 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Ongoing Work
Terms in Time and Times in Context Andreas Spitz 23 of 24
Motivation Co-ooccurrence Graphs Term-Ranking Projection Application Summary
Thank you!
Questions?
Terms in Time and Times in Context Andreas Spitz 24 of 24