25
Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign

Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

  • Upload
    bazyli

  • View
    36

  • Download
    1

Embed Size (px)

DESCRIPTION

Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams. Alexander Kotov , ChengXiang Zhai , Richard Sproat. University of Illinois at Urbana-Champaign. Roadmap. Problem definition Previous work Approach Experiments Summary. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Mining Named Entities with Temporally Correlated Bursts from Multilingual Web

News StreamsAlexander Kotov, ChengXiang Zhai, Richard

Sproat

University of Illinois at Urbana-Champaign

Page 2: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 3: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

MotivationWeb data is generated by a large number of

textual streams (news, blogs, tweets, etc.)Bursts of entity mentions (people, locations)

correspond to a particular eventBursts of entity mentions are influenced by

bursts of other entities

Intuition: bursts of semantically related entities should be temporally correlated

Page 4: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Problem definition

time

13

25

31

46

9 8

3

96

21

21

15 1

4 10

13 1

26

11 10

457 8

54 3 2

𝑡 0 𝑡𝑇

2 13 2

11 7

24 3

51 2

63

time

𝑡 0 𝑡𝑇

sparsity

magnitude

time lag

entity 1

entity 2

=

?

Page 5: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Temporally correlated bursts

Problem: given a collection of textual streams discover named entities with correlated bursts

Provide multilingual summaries of real life events

Estimate social impact of a particular event in different countries

Differentiate between local and global eventsDiscover transliterations of named entities

Page 6: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 7: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Previous workBurst detection:

infinite-state automation (Kleinberg ’02)factorial HMMs (Krause ‘06)wavelet transformation (Zhu ’03)

Stream correlation: distance-based measures: Pearson coefficient

(Chien’05)singular spectrum transformation (Ide’05)topic based (PLSA, LDA) (Wang’09)

Page 8: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Previous work

Smoothing is efficient for large amount of data, but not precise

Do not abstract away from the raw dataDistance based measures suffer from

magnitude and sparsity problemsTemporal lags are not considered

Page 9: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 10: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Approach

Difference in magnitude: normalization with Markov Modulated Poisson Process

Temporal lag: flexible alignment of bursts using dynamic programming

Page 11: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Markov-Modulated Poisson Process

• Ergodic Markov chain over finite number of states

• Each state is associated with Poisson distribution

• “Burstiness’’ of a state is represented by the intensity parameter of Poisson distribution

• States are labeled by the rank of the intensity parameter

Page 12: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Normalization

time

25

31

46

9 8

3

96

21

21

15 1

4 10

13 1

26

11 10

457 8

54 3 2

1 1 1 1 1 1 2 2 2 2 2 1 1 1 3 3 3 3 3 3 2 1 1 1 13 3 3 31

2 13 2

13 1

1 7

24 3

51 2

63

time

21 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 2 2 1 1 2 1 1 12 21

mention counts

MMPP states

Page 13: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Normalization

• MMPP consistently outperforms the baseline• The optimal performance is achieved when the

number of states is 3

Page 14: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Burst AlignmentInput: -pair of normalized MC streams of length - threshold for ``bursty’’ states; - reward constant; - penalty function.Output: a table :

Page 15: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Burst alignment

perfect alignement

exponential penalty

logarithmic penalty

Page 16: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Burst alignment

• quadratic penalty function in combination with reward constant of 2 is optimal•maximum permitted temporal gap is 1 day

Page 17: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 18: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Dataset

News data crawled from RSS feeds over 4 month

Basic named entity recognitionBasic stemming

Page 19: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Correlated Bursts

Pattern 1: World Economic Forum in Davos, Switzerland and death of actor Heath Ledger;Pattern 2: death of Bobby FischerPattern 3: assassination of Benazir BhuttoPattern 4: French bank major trading loss incident and death of George Habash

Real life events:

Page 20: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Mining transliterationsStatic aligned corpora:

+ identical or semantically related contents + temporal topical alignment - limited coverage

Web: + covers almost any domain - difference in burst magnitude - temporal lag between bursts

Page 21: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Transliteration

•MMPP+DP outperforms one baseline (CS) in all entropy categories and the other baseline (PC) for low- and medium-entropy (more “bursty’’) entities;• Combination of MMPP+DP performs better than MMPP alone.

Page 22: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Roadmap

Problem definitionPrevious workApproachExperimentsSummary

Page 23: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

SummaryNovel multi-stream text mining problemOur approach can effectively discover

correlated bursts corresponding to major and minor real life events

Effective for unsupervised discovery of transliterations

Method is data independent and not limited to textual domain

Page 24: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Contributions

First method to use MMPP for burst detection in textual streams

Algorithm for temporally flexible stream correlation based on bursts

Unsupervised method for language-independent transliteration without any linguistic knowledge

Page 25: Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Future work

Applying proposed method to non-textual data (e.g., sensor streams)

Burst correlations between entities different types of Web 2.0 data (news and tweets, news and blogs, news and tags, etc.)