48
Semantic Filtering An example of Semantic technologies for real-time analysis Pavan Kapanipathi Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis ) Wright State University, USA Tutorial @ Kno.e.sis Centre: Semantics Approach to Big Data and Event Processing, Oct 7-9, 2015

Knoesis-Semantic filtering-Tutorials

Embed Size (px)

Citation preview

Page 1: Knoesis-Semantic filtering-Tutorials

Semantic FilteringAn example of Semantic technologies for real-time

analysisPavan Kapanipathi

Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis)Wright State University, USA

Tutorial @ Kno.e.sis Centre: Semantics Approach to Big Data and Event Processing, Oct 7-9, 2015

Page 2: Knoesis-Semantic filtering-Tutorials

Streams are everywhere

Social DataTextImagesVideos

Sensor Data Streams

Page 3: Knoesis-Semantic filtering-Tutorials

3

Information Overload

500M users generate 500M tweets per day

It’s not information overload. It’s filter failure

-- Clay Shirky

Page 4: Knoesis-Semantic filtering-Tutorials

Each of our projects face Information Overload

• Disaster Management• Hazards SEES

• Healthcare Issues• Depression

• Societal Issues • Edrug Trends• Harassment

Page 5: Knoesis-Semantic filtering-Tutorials

• Filtering is necessary

• Understanding the requirements and utilizing semantics for filtering is important

Semantic Filtering

Page 6: Knoesis-Semantic filtering-Tutorials

Two Main Topics

• Twarql• Streaming annotation and flexible

querying on Twitter

• Continuous Semantics• Tracking dynamic topics on Twitter

Page 7: Knoesis-Semantic filtering-Tutorials

Twarql

Tracking health care debate in

the United States on Social

Media

Health Care Reform

Health Care Reform

Healthcare reform legislation in the

United States

Patient Protection and Affordable Care

Act (Obamacare)

Health Care Reform

Page 8: Knoesis-Semantic filtering-Tutorials

Twarql

Page 9: Knoesis-Semantic filtering-Tutorials

Extraction Pipeline - Tweet

I think it’s good deal Apple Ipad Tablet (3G, wifi, WiFi + 3G) Hard Nylon Cube Carrying Case for ipad ( iPad.. http://bit.ly/cry6LF)

Dbpedia:Ipad

Dbpedia:Tablet

URLs

http://penguinkang.com/tweetprobe/

Page 10: Knoesis-Semantic filtering-Tutorials

RDF

• RDF Annotation• Common RDF/OWL Data formats.• FOAF, SIOC, OPO, MOAT

Page 11: Knoesis-Semantic filtering-Tutorials

: Health_care_reform

Twarql – Use Case

Page 12: Knoesis-Semantic filtering-Tutorials

Demo

http://knoesis.wright.edu/library/tools/twarql/demo.swf

Page 13: Knoesis-Semantic filtering-Tutorials

13

Continuous Semantics

Page 14: Knoesis-Semantic filtering-Tutorials

14

Dynamic Topics

Continuously Evolving on

Twitter

Entity – Event relevance changes

Many entities are involved

Page 15: Knoesis-Semantic filtering-Tutorials

15

Dynamic Topics

Manually crawl using keywords

“indianelection”“jan25” “sandy”

“swineflu” “ebola”

Page 16: Knoesis-Semantic filtering-Tutorials

16

Dynamic Topics

Manually updating keywords to get topic relevant tweets is

not feasible

“indianelection”

“modi”“bjp”

“congress”

“jan25”“egypt”“tunisia”

“arabspring”

“sandy”“newyork”“redcross”

“fema”

“swineflu” “ebola”

Page 17: Knoesis-Semantic filtering-Tutorials

17

Problem

How can we automatically update the filters to track a dynamically evolving

topic on Twitter

Page 18: Knoesis-Semantic filtering-Tutorials

18

Hashtags as Filters

• Identify a topic on Twitter• Tweets with hashtags are more

informative• Users have a lot of freedom to

create them • Some get popular, most die

Page 19: Knoesis-Semantic filtering-Tutorials

19

Exploring Hashtags as Evolving Filters for Dynamic Topics

Colorado Shooting

Occupy Wall Street

CS OWS

Tweets: 122,062 Tweets: 6,077,378

Tags: 192,512Distinct: 12,350100% Retrieval: 7,763

Tags: 15,963,209 Distinct: 191,602100% Retrieval: 21,314 HASHTAG

FILTERS

Page 20: Knoesis-Semantic filtering-Tutorials

20

Top 1% retrieves around 85% of the

tweets

Hashtag distributions

Page 21: Knoesis-Semantic filtering-Tutorials

21

Colorado Shooting Occupy Wall Street

Event Related Hashtags co-occur

with each other

Hashtag Filters Co-occurrence Graph

Page 22: Knoesis-Semantic filtering-Tutorials

22

Summarizing Hashtag Analysis

Starting with one of the event relevant hashtags, by co-occurrence we can

reach other relevant hashtags

Page 23: Knoesis-Semantic filtering-Tutorials

23

Determining Relevancy of Co-occurring Hashtags

#indianelection2015

#modikisarkar

Too many co-occurring hashtags

Page 24: Knoesis-Semantic filtering-Tutorials

24

Determining Relevancy of Co-occurring Hashtags

#indianelection2015

#modikisarkar

Co-occurring: Threshold δ

Preferably a prominent hashtag

Page 25: Knoesis-Semantic filtering-Tutorials

25

Hashtag Co-occurrence works?

o No. Just co-occurrence does not worko Many noisy or unrelated hashtags co-occurs

o Determine the “dynamic” relevance of the top co-occurring hashtag with the dynamic topic

Page 26: Knoesis-Semantic filtering-Tutorials

26

Determining Relevancy of Co-occurring Hashtags

#indianelection2015

#modikisarkar

Co-occurring: Threshold

Latest K (200,500)

Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2

Entity Extraction and Scoring

δ

Normalized FrequencyScoring

(Vector Space Model)

Page 27: Knoesis-Semantic filtering-Tutorials

27

Determining Relevancy of Co-occurring Hashtags (Vector Space Model)

#indianelection2015

#modikisarkar

Co-occurring: Threshold

Latest K (200,500)

Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2

Entity Extraction and Scoring

Indian General Election,_2014

Dynamically Updated Background Knowledge

δ

Page 28: Knoesis-Semantic filtering-Tutorials

28

Event Relevant Background Knowledge

o Wikipedia Event Pages

Page 29: Knoesis-Semantic filtering-Tutorials

29

o Wikipedia Event Pages

Event Relevant Background Knowledge

Page 30: Knoesis-Semantic filtering-Tutorials

30

o Entities mentioned on the Event page of Wikipedia are relevant to the Event

Event Relevant Background Knowledge

Page 31: Knoesis-Semantic filtering-Tutorials

31

o Wikipedia’s Hyperlink structure is very richo Page-Page (Wikipedia) links

Indian General Election, 2014

Narendra Modi

Rahul Gandhi

NDA (India)UPA (India)

BJP

Indian National Congress

Event Relevant Background Knowledge – Graph Structure

Page 32: Knoesis-Semantic filtering-Tutorials

32

Determining Relevancy of Co-occurring Hashtags (Vector Space Model)

#indianelection2015

#modikisarkar

Co-occurring: Threshold

Latest K (200,500)

Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2

Entity Extraction and Scoring

Indian General Election,_2014

Extract, PeriodicallyUpdate Hyperlink structure

One hop from EventPage

δ

Page 33: Knoesis-Semantic filtering-Tutorials

33

o Hyperlink structure is dynamically updated

Indian General Election, 2014

Narendra Modi

Rahul Gandhi

NDA (India)UPA (India)

BJP

Indian National Congress

10 May 2010

Event Relevant Background Knowledge

Page 34: Knoesis-Semantic filtering-Tutorials

34

o Hyperlink structure is dynamically updated

Indian General Election, 2014

Narendra Modi

Rahul Gandhi

NDA (India)UPA (India)

BJP

Indian National Congress

10 May 2010

29 March 2013

29 March 2013 29 March 2013

29 March 2013

Event Relevant Background Knowledge

Page 35: Knoesis-Semantic filtering-Tutorials

35

o Hyperlink structure is dynamically updated

Indian General Election, 2014

Narendra Modi

Rahul Gandhi

NDA (India)UPA (India)

BJP

Indian National Congress

10 May 2010

29 March 2013

29 March 2013 29 March 2013

29 March 2013

20 May 2013

20 May 2013

Event Relevant Background Knowledge

Page 36: Knoesis-Semantic filtering-Tutorials

36

Determining Relevancy of Co-occurring Hashtags (Vector Space Model)

#indianelection2015

#modikisarkar

Co-occurring: Threshold

Latest K (200,500)

Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2

Entity Extraction and Scoring

Indian General Election,_2014

Extract, PeriodicallyUpdate Hyperlink structure

Entity scoring based on relevance to the Event

One hop from EventPage

δ

Page 37: Knoesis-Semantic filtering-Tutorials

37

o Edge Based Measure

o Link Overlap Measure: Jaccard similarity

o Out(c) are the links in Wikipedia page “c”o Final Score: r(c,E) = ed(c,E) + oco(c,E)

Hyperlink Entity Scoring

India General Election, 2014 Narendra Modi

India General Election, 2014

India General Election, 2009 1

Mutually Important

ed (c,E) = 1

ed (c,E) = 2

Page 38: Knoesis-Semantic filtering-Tutorials

38

Determining Relevancy of Co-occurring Hashtags (Vector Space Model)

#indianelection2015

#modikisarkar

Co-occurring: Threshold

Latest K (200,500)

Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2

Entity Extraction and Scoring

Indian General Election,_2014

Extract, PeriodicallyUpdate Hyperlink structure

Entity scoring based on relevance to the Event

One hop from EventPage

Indian General Elec: 1.0India: 0.9Elections: 0.7UPA: 0.6BJP: 0.3NDA: 0.3Narendra Modi: 0.3

δ

Page 39: Knoesis-Semantic filtering-Tutorials

39

Determining Relevancy of Co-occurring Hashtags (Vector Space Model)

#indianelection2015

#modikisarkar

Co-occurring: Threshold

Latest K (200,500)

Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2

Entity Extraction and Scoring

Indian General Election,_2014

Extract, PeriodicallyUpdate Hyperlink structure

Entity scoring based on relevance to the Event

One hop from EventPage

Indian General Elec: 1.0India: 0.9Elections: 0.7UPA: 0.6BJP: 0.3NDA: 0.3Narendra Modi: 0.3

Similarity Check

Relevance Score: 0.6

δ

Page 40: Knoesis-Semantic filtering-Tutorials

40

o Set Basedo Jaccard Similarity

o Considers the entities without the scores

o Vector Basedo Symmetric

o Cosine Similarity

o Asymmetrico Subsumption Similarity

Similarity Check

Page 41: Knoesis-Semantic filtering-Tutorials

41

India General Election 2014

Narendra Modi

Intuition behind Asymmetric

India General Election 2014

Narendra Modi

Similarity

Similarity

Penalized

Ignored

Similarity

Symmetric

Asymmetric

Page 42: Knoesis-Semantic filtering-Tutorials

42

Determining Relevancy of Co-occurring Hashtags (Vector Space Model)

#indianelection2015

#modikisarkar

Co-occurring: Threshold

Latest K (200,500)

Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2

Entity Extraction and Scoring

Indian General Election,_2014

Extract, PeriodicallyUpdate Hyperlink structure

Entity scoring based on relevance to the Event

One hop from EventPage

Indian General Elec: 1.0India: 0.9Elections: 0.7UPA: 0.6BJP: 0.3NDA: 0.3Narendra Modi: 0.3

Similarity Check

Relevance Score: 0.6

δ

Page 43: Knoesis-Semantic filtering-Tutorials

43

o 2 eventso US Presidential Elections (#election2012)o Hurricane Sandy (#sandy)

o Top 25 co-occurring hashtags

Evaluation – Dataset

Page 44: Knoesis-Semantic filtering-Tutorials

44

o Ranking Problemo Rank the Top 25 hashtags based on the relevancy

of tweets to the evento Experiment with all the similarity metricso Manually annotated the tweets of these hashtags

as relevant/irrelevant (Gold Standard)

o Ranking Evaluation Metricso Mean Average Precisiono NDCG

Evaluation – Strategy

Page 45: Knoesis-Semantic filtering-Tutorials

45

Evaluation

Page 46: Knoesis-Semantic filtering-Tutorials

46

Evaluation

Evaluated tweets comprising of top-relevant hashtags detected for dynamic topics• NDCG - 92% at top-5 Mean Average

Precision

Page 47: Knoesis-Semantic filtering-Tutorials

47

Conclusions• Semantic Technologies for Real-time filtering of Social Data

– Wikipedia as a Dynamic Knowledge base for events– Determining relevant hashtags using Asymmetric similarity

measure– More hashtags in turn increase the coverage of Tweets for events

• Hashtag Analysis– Co-occurrence technique can be used to detect event relevant

hashtags– More popular hashtags are easier to be detected via co-occurrence

Page 48: Knoesis-Semantic filtering-Tutorials

ThanksContact: @[email protected]