Upload
pavan-kapanipathi
View
400
Download
0
Embed Size (px)
Citation preview
Semantic FilteringAn example of Semantic technologies for real-time
analysisPavan Kapanipathi
Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis)Wright State University, USA
Tutorial @ Kno.e.sis Centre: Semantics Approach to Big Data and Event Processing, Oct 7-9, 2015
Streams are everywhere
Social DataTextImagesVideos
Sensor Data Streams
3
Information Overload
500M users generate 500M tweets per day
It’s not information overload. It’s filter failure
-- Clay Shirky
Each of our projects face Information Overload
• Disaster Management• Hazards SEES
• Healthcare Issues• Depression
• Societal Issues • Edrug Trends• Harassment
• Filtering is necessary
• Understanding the requirements and utilizing semantics for filtering is important
Semantic Filtering
Two Main Topics
• Twarql• Streaming annotation and flexible
querying on Twitter
• Continuous Semantics• Tracking dynamic topics on Twitter
Twarql
Tracking health care debate in
the United States on Social
Media
Health Care Reform
Health Care Reform
Healthcare reform legislation in the
United States
Patient Protection and Affordable Care
Act (Obamacare)
Health Care Reform
Twarql
Extraction Pipeline - Tweet
I think it’s good deal Apple Ipad Tablet (3G, wifi, WiFi + 3G) Hard Nylon Cube Carrying Case for ipad ( iPad.. http://bit.ly/cry6LF)
Dbpedia:Ipad
Dbpedia:Tablet
URLs
http://penguinkang.com/tweetprobe/
RDF
• RDF Annotation• Common RDF/OWL Data formats.• FOAF, SIOC, OPO, MOAT
: Health_care_reform
Twarql – Use Case
Demo
http://knoesis.wright.edu/library/tools/twarql/demo.swf
13
Continuous Semantics
14
Dynamic Topics
Continuously Evolving on
Entity – Event relevance changes
Many entities are involved
15
Dynamic Topics
Manually crawl using keywords
“indianelection”“jan25” “sandy”
“swineflu” “ebola”
16
Dynamic Topics
Manually updating keywords to get topic relevant tweets is
not feasible
“indianelection”
“modi”“bjp”
“congress”
“jan25”“egypt”“tunisia”
“arabspring”
“sandy”“newyork”“redcross”
“fema”
“swineflu” “ebola”
17
Problem
How can we automatically update the filters to track a dynamically evolving
topic on Twitter
18
Hashtags as Filters
• Identify a topic on Twitter• Tweets with hashtags are more
informative• Users have a lot of freedom to
create them • Some get popular, most die
19
Exploring Hashtags as Evolving Filters for Dynamic Topics
Colorado Shooting
Occupy Wall Street
CS OWS
Tweets: 122,062 Tweets: 6,077,378
Tags: 192,512Distinct: 12,350100% Retrieval: 7,763
Tags: 15,963,209 Distinct: 191,602100% Retrieval: 21,314 HASHTAG
FILTERS
20
Top 1% retrieves around 85% of the
tweets
Hashtag distributions
21
Colorado Shooting Occupy Wall Street
Event Related Hashtags co-occur
with each other
Hashtag Filters Co-occurrence Graph
22
Summarizing Hashtag Analysis
Starting with one of the event relevant hashtags, by co-occurrence we can
reach other relevant hashtags
23
Determining Relevancy of Co-occurring Hashtags
#indianelection2015
#modikisarkar
Too many co-occurring hashtags
24
Determining Relevancy of Co-occurring Hashtags
#indianelection2015
#modikisarkar
Co-occurring: Threshold δ
Preferably a prominent hashtag
25
Hashtag Co-occurrence works?
o No. Just co-occurrence does not worko Many noisy or unrelated hashtags co-occurs
o Determine the “dynamic” relevance of the top co-occurring hashtag with the dynamic topic
26
Determining Relevancy of Co-occurring Hashtags
#indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2
Entity Extraction and Scoring
δ
Normalized FrequencyScoring
(Vector Space Model)
27
Determining Relevancy of Co-occurring Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Dynamically Updated Background Knowledge
δ
28
Event Relevant Background Knowledge
o Wikipedia Event Pages
29
o Wikipedia Event Pages
Event Relevant Background Knowledge
30
o Entities mentioned on the Event page of Wikipedia are relevant to the Event
Event Relevant Background Knowledge
31
o Wikipedia’s Hyperlink structure is very richo Page-Page (Wikipedia) links
Indian General Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National Congress
Event Relevant Background Knowledge – Graph Structure
32
Determining Relevancy of Co-occurring Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Extract, PeriodicallyUpdate Hyperlink structure
One hop from EventPage
δ
33
o Hyperlink structure is dynamically updated
Indian General Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National Congress
10 May 2010
Event Relevant Background Knowledge
34
o Hyperlink structure is dynamically updated
Indian General Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
Event Relevant Background Knowledge
35
o Hyperlink structure is dynamically updated
Indian General Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India)UPA (India)
BJP
Indian National Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
20 May 2013
20 May 2013
Event Relevant Background Knowledge
36
Determining Relevancy of Co-occurring Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Extract, PeriodicallyUpdate Hyperlink structure
Entity scoring based on relevance to the Event
One hop from EventPage
δ
37
o Edge Based Measure
o Link Overlap Measure: Jaccard similarity
o Out(c) are the links in Wikipedia page “c”o Final Score: r(c,E) = ed(c,E) + oco(c,E)
Hyperlink Entity Scoring
India General Election, 2014 Narendra Modi
India General Election, 2014
India General Election, 2009 1
Mutually Important
ed (c,E) = 1
ed (c,E) = 2
38
Determining Relevancy of Co-occurring Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Extract, PeriodicallyUpdate Hyperlink structure
Entity scoring based on relevance to the Event
One hop from EventPage
Indian General Elec: 1.0India: 0.9Elections: 0.7UPA: 0.6BJP: 0.3NDA: 0.3Narendra Modi: 0.3
δ
39
Determining Relevancy of Co-occurring Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Extract, PeriodicallyUpdate Hyperlink structure
Entity scoring based on relevance to the Event
One hop from EventPage
Indian General Elec: 1.0India: 0.9Elections: 0.7UPA: 0.6BJP: 0.3NDA: 0.3Narendra Modi: 0.3
Similarity Check
Relevance Score: 0.6
δ
40
o Set Basedo Jaccard Similarity
o Considers the entities without the scores
o Vector Basedo Symmetric
o Cosine Similarity
o Asymmetrico Subsumption Similarity
Similarity Check
41
India General Election 2014
Narendra Modi
Intuition behind Asymmetric
India General Election 2014
Narendra Modi
Similarity
Similarity
Penalized
Ignored
Similarity
Symmetric
Asymmetric
42
Determining Relevancy of Co-occurring Hashtags (Vector Space Model)
#indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9BJP: 0.7NDA: 0.6India: 0.4Elections: 0.2Rahul Gandhi: 0.2Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Extract, PeriodicallyUpdate Hyperlink structure
Entity scoring based on relevance to the Event
One hop from EventPage
Indian General Elec: 1.0India: 0.9Elections: 0.7UPA: 0.6BJP: 0.3NDA: 0.3Narendra Modi: 0.3
Similarity Check
Relevance Score: 0.6
δ
43
o 2 eventso US Presidential Elections (#election2012)o Hurricane Sandy (#sandy)
o Top 25 co-occurring hashtags
Evaluation – Dataset
44
o Ranking Problemo Rank the Top 25 hashtags based on the relevancy
of tweets to the evento Experiment with all the similarity metricso Manually annotated the tweets of these hashtags
as relevant/irrelevant (Gold Standard)
o Ranking Evaluation Metricso Mean Average Precisiono NDCG
Evaluation – Strategy
45
Evaluation
46
Evaluation
Evaluated tweets comprising of top-relevant hashtags detected for dynamic topics• NDCG - 92% at top-5 Mean Average
Precision
47
Conclusions• Semantic Technologies for Real-time filtering of Social Data
– Wikipedia as a Dynamic Knowledge base for events– Determining relevant hashtags using Asymmetric similarity
measure– More hashtags in turn increase the coverage of Tweets for events
• Hashtag Analysis– Co-occurrence technique can be used to detect event relevant
hashtags– More popular hashtags are easier to be detected via co-occurrence
ThanksContact: @[email protected]