Upload
heinestien
View
331
Download
0
Tags:
Embed Size (px)
Citation preview
2015 Hany SalahEldeen Dissertation Defense 1
Detecting, Modeling, & Predicting User Temporal Intention in Social Media
Hany M. SalahEldeenDoctor of PhilosophyDissertation Defense
Old Dominion UniversityDepartment of Computer Science
Advisor: Dr. Michael L. Nelson
Dr. Michele C. WeigleDr. Hussein M. Abdel-WahabDr. M’Hammed Abdous
Committee:
May 5th, 2015
2015 Hany SalahEldeen Dissertation Defense 2
All tweets are equal…
…but some are more equal than the others
2015 Hany SalahEldeen Dissertation Defense 3
It is imperative to know…
1. How long would these last?2. And if lost, is there a backup somewhere?3. Is this what the author intended?
2015 Hany SalahEldeen Dissertation Defense 4
To maintain historical integrity
Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised.
2015 Hany SalahEldeen Dissertation Defense 5
Motivation
Background
Related Research
Research Question
User-Time-Shared Resource
Conclusions
2015 Hany SalahEldeen Dissertation Defense 6
People rely on social media for most updated information
2015 Hany SalahEldeen Dissertation Defense 7
Social media is more than kitty photos
Marie ColvinJanuary 12, 1956 – February 22, 2012
Rémi Ochlik16 October 1983 – 22 February 2012
Ahmed Assem1987 – July 8, 2013
2015 Hany SalahEldeen Dissertation Defense 8
For the web is dark, and full of missing content…
Accessed in July 2014
3 out 8 external links on Remi’sWikipedia page return 404
2015 Hany SalahEldeen Dissertation Defense 9
even for content shared in social media
Accessed in July 2014
2015 Hany SalahEldeen Dissertation Defense 10
News sites are also prone to change
Accessed in July 2014
2015 Hany SalahEldeen Dissertation Defense 11
So are specialized sites
Accessed in July 2014
2015 Hany SalahEldeen Dissertation Defense 12
Research Problem:Author’s Intention ≠ Reader’s Experience
2015 Hany SalahEldeen Dissertation Defense 13
Research ImplicationAuthor’s Intention ≠ Reader’s Experience
Broken Inconsistent Weband Historical Records
2015 Hany SalahEldeen Dissertation Defense 14
Motivation
Background
Related Research
Research Question
User-Time-Shared Resource
Conclusions
2015 Hany SalahEldeen Dissertation Defense 15
Social Post
2015 Hany SalahEldeen Dissertation Defense 16
The anatomy of a tweet
Author’s username
Other user mention
Tweet Body
Hash TagShortened URL to resource
Publishing timestamp
SocialPost
Shared Resource
Interactionoptions
2015 Hany SalahEldeen Dissertation Defense 17
3 URIs = 3 Chances to fail
2015 Hany SalahEldeen Dissertation Defense 18
URL shortening and aliasing
curl -L -I http://bit.ly/losing_revolution
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Mon, 07 Jul 2014 18:19:48 GMT
Cache-Control: private; max-age=90
Location:
http://ws-dl.blogspot.com/2012/02/2012-02-11-
losing-my-revolution-year.html
Mime-Version: 1.0
Set-Cookie: _bit=53bae4c4-00328-04f10-
cb1cf10a;domain=.bit.ly;expires=Sat Jan 3
18:19:48 2015;path=/; HttpOnly
Content-Type: text/html;charset=utf-8Content-Length: 167
HTTP/1.1 200 OK
Expires: Mon, 07 Jul 2014 18:19:52 GMT
Date: Mon, 07 Jul 2014 18:19:52 GMT
Cache-Control: private, max-age=0
Last-Modified: Mon, 07 Jul 2014 18:19:07
GMT
ETag: "e3555826-b103-4daa-a3f2-
d0509ebab51f"
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Server: GSE
Alternate-Protocol: 80:quic
Content-Type: text/html;charset=UTF-8Content-Length: 0
2015 Hany SalahEldeen Dissertation Defense 19
Life cycle of a social post
2015 Hany SalahEldeen Dissertation Defense 20
Life cycle of a social post
tweets
2015 Hany SalahEldeen Dissertation Defense 21
Life cycle of a social post
tweets Links to
2015 Hany SalahEldeen Dissertation Defense 22
Life cycle of a social post
tweets
What the reader
receives
Links to
Same state the author intended
2015 Hany SalahEldeen Dissertation Defense 23
Life cycle of a social post
tweets
What the reader
receives
Links to
Same state the author intended
Ideally!
2015 Hany SalahEldeen Dissertation Defense 24
Life cycle of a social post
tweets
What the reader
receives
Links to
Same state the author intended
After a period of time
2015 Hany SalahEldeen Dissertation Defense 25
Life cycle of a social post
tweets
What the reader
receives
Links to
Same state the author intended
The resource has disappeared
After a period of time
2015 Hany SalahEldeen Dissertation Defense 26
Life cycle of a social post
tweets
What the reader
receives
Links to
Same state the author intended
The resource has disappeared
The resource has changed
After a period of time
2015 Hany SalahEldeen Dissertation Defense 27
Memento framework
* http://mementoweb.org/guide/rfc/
2015 Hany SalahEldeen Dissertation Defense 28
Motivation
Background
Related Research
Research Question
User-Time-Shared Resource
Conclusions
2015 Hany SalahEldeen Dissertation Defense 29
Related Work
• Social media analysis:• Understanding Microblogging
• Zhao 2009• Yang 2010• Newman 2003• Kwak 2010• Java 2007• Cha 2009
• History Narration• Vieweg 2010• Starbird 2010-2012• Qu 2011• Neubig 2011• Lehman and Lalmas 2012-
2013
• User’s Web Search Intention• Ashkan 2009
• Lee 2005
• Loser 2008
• Azzopardi 2009
• Baeza-Yates 2006
• Dai 2011
• Commercial Intention• Guo 2010
• Benczur 2007
• Sentiment Analysis• Mishne 2006
• Bollen 2011
• Access to Archives• Van de Sompel 2009
• Persistence of shared resources– Nelson 2002
– Sanderson 2011
– McCown 2007
• URL Shortening– Antoniades 2011
• Tweeting, Micro-blogging and Popularity– Wu 2011
– Java 2007
– Kwak 2010
• Social Networks Growth and Evolution– Meeder 2011
Further details: refer to chapter 3
2015 Hany SalahEldeen Dissertation Defense 30
Motivation
Background
Related Research
Research Question
User-Time-Shared Resource
Conclusions
2015 Hany SalahEldeen Dissertation Defense 31
Research Question:Can we estimate the users’
intention at the time of posting and reading to predict and
maintain temporal consistency?
2015 Hany SalahEldeen Dissertation Defense 32
Research Goals
• Detect the temporal intention of the:
1. Author upon sharing time
2. The reader upon dereferencing time
• Model this intention as a function of time, nature of the resource, and its context.
• Predict how resources change with time and the intention behind sharing them to minimize inconsistency.
• Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web.
Further details: refer to chapter 6
Further details: refer to chapter 7
Further details: refer to chapter 8
Further details: refer to chapter 9
2015 Hany SalahEldeen Dissertation Defense 33
Motivation
Background
Related Research
Research Question
User-Time-Shared Resource
Conclusions
2015 Hany SalahEldeen Dissertation Defense 34
Shared Resource Time User
Our analysis covers three angles
2015 Hany SalahEldeen Dissertation Defense 35
Shared Resource Time User
Loss and Persistence of Shared Resources
2015 Hany SalahEldeen Dissertation Defense 36
Shared Resource Time User
Alive
First: Estimate social media content loss
2015 Hany SalahEldeen Dissertation Defense 37
Six socially significant events
Event Source Year
Iranian Election SNAP Dataset 2009
H1N1 Virus Outbreak SNAP Dataset 2009
Michael Jackson’s Death SNAP Dataset 2009
Obama’s Nobel Peace Prize SNAP Dataset 2009
The Egyptian Revolution Twitter, Websites, Books 2011
The Syrian Uprising Twitter API 2012
2015 Hany SalahEldeen Dissertation Defense 38
Twitter tag expansion and filtration
2015 Hany SalahEldeen Dissertation Defense 39
Twitter tag expansion increases precision
2015 Hany SalahEldeen Dissertation Defense 40
What are people sharing?
2015 Hany SalahEldeen Dissertation Defense 41
Existence on the live web and in the archives
• For each unique URL we resolved the final HTTP response and considered 2 classes:• Success: 200 OK• Failure: 4XX, 50X families and the 30X loop redirects or soft 404s.
• Utilize the memento aggregator:• Archived: if it has at least one memento in the timemap
2015 Hany SalahEldeen Dissertation Defense 42
Resources Missing and Archived
Collection Percentage Missing Percentage Archived
23.49%H1N1 Outbreak 41.65%
36.24%Michael Jackson 39.45%
26.98%Iran 43.08%
24.59%Obama 47.87%
10.48%Egypt 20.18%
7.04%Syria 5.35%
31.62% 30.78%
24.47% 36.26%
25.64% 43.87%
26.15% 46.15%
2015 Hany SalahEldeen Dissertation Defense 43
Shared Resource Time User
Alive
Mis
sin
g
Second: Can we measure existence and disappearance as a function of time?
2015 Hany SalahEldeen Dissertation Defense 44
Resources Missing and Archived
Collection Percentage Missing Percentage Archived
23.49%H1N1 Outbreak 41.65%
36.24%Michael Jackson 39.45%
26.98%Iran 43.08%
24.59%Obama 47.87%
10.48%Egypt 20.18%
7.04%Syria 5.35%
31.62% 30.78%
24.47% 36.26%
25.64% 43.87%
26.15% 46.15%
2015 Hany SalahEldeen Dissertation Defense 45
Timeline of Events
2015 Hany SalahEldeen Dissertation Defense 46
Timeline of Events
2015 Hany SalahEldeen Dissertation Defense 47
Social Events Having a Bimodal Time Distribution
2015 Hany SalahEldeen Dissertation Defense 48
Timeline of Events
2015 Hany SalahEldeen Dissertation Defense 49
Social Events Having a Bimodal Time Distribution
2015 Hany SalahEldeen Dissertation Defense 50
Existence as a function of time
2015 Hany SalahEldeen Dissertation Defense 51
Existence as a function of time
2015 Hany SalahEldeen Dissertation Defense 52
• Results:
• Publications and Articles:1. H. M. SalahEldeen. Losing My Revolution: A year after the Egyptian Revolution, 10% of the
social media documentation is gone. http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html , 2012.
2. H. M. SalahEldeen and M. L. Nelson. Losing my revolution: how many resources shared on social media have been lost? In Proceedings of the Second international conference on Theory and Practice of Digital Libraries, TPDL'12, 2012.
Conclusion: Existence could be estimated as a function of time
• Measured 21,625 resources from 6 data sets in archives & live web.
• After a year from publishing about 11% of content shared on social media will be gone.
• After this we are losing roughly 0.02% daily.
2015 Hany SalahEldeen Dissertation Defense 53
Revisiting Existence after a year
MJ Iran H1N1 Obama Egypt Syria
Measured 37.10% 37.50% 28.17% 30.56% 26.29% 31.62% 32.47% 24.64% 7.55% 12.68%Predicted 31.72% 31.42% 31.96% 30.98% 30.16% 29.68% 29.60% 28.36% 19.80% 11.54%
Error 5.38% 6.08% 3.79% 0.42% 3.87% 1.94% 2.87% 3.72% 12.25% 1.14%
MJ Iran H1N1 Obama Egypt SyriaMeasured 48.61% 40.32% 60.80% 55.04% 47.97% 52.14% 48.38% 40.58% 23.73% 0.56%Predicted 61.78% 61.18% 62.26% 60.30% 58.66% 57.70% 57.54% 55.06% 37.94% 21.42%Error 13.17% 20.86% 1.46% 5.26% 10.69% 5.56% 9.16% 14.48% 14.21% 20.86%
Average Prediction Error = 11.57%
in all cases, our archival predictions were too optimistic
Missing
Archived
Average Prediction Error = 4.15%
in all cases, our missing predictions were acceptable
2015 Hany SalahEldeen Dissertation Defense 54
Shared Resource Time User
Alive
Mis
sin
g
Replaced
Third: Can we use social context to find replacements of missing resources?
2015 Hany SalahEldeen Dissertation Defense 55
Context discovery and shared resource replacement
Problem:
140 characters limits the description of the linked resource. If it went missing, can we get the next best thing?
Solution:
• Shared links typically have several tweets, responses, and retweets
• We can mine these traces for context and viable replacements
2015 Hany SalahEldeen Dissertation Defense 56
Context Discovery
Linking to: http://beta.18daysinegypt.com/
2015 Hany SalahEldeen Dissertation Defense 57
What if the resource disappeared?
Linking to: http://beta.18daysinegypt.com/
2015 Hany SalahEldeen Dissertation Defense 58
Use Topsy to discover tweets sharing the same link
2015 Hany SalahEldeen Dissertation Defense 59
Social Context Extraction{
"URI": "http://beta.18daysinegypt.com/",
"Related Tweet Count": 500,
"Related Hashtags": "#tran #citizensx #arabspring #visualstorytelling
#collaborativerevolution #feb11http://t.co/qxusp70 ...",
"Users who talked about this": "@petra_stienen: @waleedrashed:
@omarsamra @ungormite: @dcisbusy @webdocumentario: ...",
"All associated unique links:": "http://t.co/63X1f3f1
http://t.co/reBh6c4V http://t.co/B3GuhQN4 http://t.co/X2sjf4Rf
http://t.co/P9iR28fH http://t.co/1C4EPh8h ...",
"All other links associated:": "http://vimeo.com/35368376
http://mashable.com/2012/01/21/18daysinegypt-2/ ",
"Most frequent link appearing:": "http://t.co/2ke0rEjP",
"Number of times the Most frequent link appearing:": 49,
"Most frequent tweet posted and reposted:": "Check out 18DaysInEgypt -
A crowd sourced documentary project ================= via
@18daysinegypt",
"Number of times the Most frequent tweet appearing:": 46,
"The longest common phrase appearing:": "RT 2ke0rEjP is an interactive
documentary website that YOU can help create Get your Jan25 stories
ready! Pl RT",
"Number of times the Most common phrase appearing:": 18
}
2015 Hany SalahEldeen Dissertation Defense 60
Build a Tweet Document
A tweet document represents the concatenation of all extracted tweets:
do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairoit's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ”
“
2015 Hany SalahEldeen Dissertation Defense 61
Tweet Signature
Tweet Document:
do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairoit's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ”
“
Tweet Signature = top 5 most frequent terms from Tweet Document
documentary project daysinegypt check sourced
2015 Hany SalahEldeen Dissertation Defense 62
Query Google with the Tweet Signature
2015 Hany SalahEldeen Dissertation Defense 63
Search Engine Results
The original resource
2015 Hany SalahEldeen Dissertation Defense 64
Search Engine Results
The original resource
The others are good replacement
candidates
2015 Hany SalahEldeen Dissertation Defense 65
Recommendation Evaluation
We extract a dataset of resources that are currently available:• Pretend these resources no longer exist (for a baseline)
• Each of the resources are textual based
• Each resource has at least 30 retrievable tweets.
Extracted 731 unique resources
We use boiler plate removal library to remove the template from the:• linked resources
• top 10 retrieved results from Google
We use cosine similarity to compare the documents
2015 Hany SalahEldeen Dissertation Defense 66
Similarity measures in resource replacement
----70% similarity----
41% of the cases we found a replacement with >=70% similarity
2015 Hany SalahEldeen Dissertation Defense 67
Conclusion: We can find viable replacements for missing shared resources
• Results:• 41% of the test cases we can find a replacement page with at least 70% similarity to the original
missing resource• The search results provide a mean reciprocal rank of 0.43
• Publications:1. H. SalahEldeen and M. L. Nelson. Resurrecting my revolution: Using social link
neighborhood in bringing context to the disappearing web. In Research and Advanced Technology for Digital Libraries- International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013.
2015 Hany SalahEldeen Dissertation Defense 68
Now we finished analyzing the shared resource…what’s next?
2015 Hany SalahEldeen Dissertation Defense 69
Shared Resource Time User
Alive
Mis
sin
g
Replaced
Footprints on the web
2015 Hany SalahEldeen Dissertation Defense 70
The tweet, the resource…and time
time
Posted a tweet
Read the tweetRelevancy of the resource to the tweet changed through time
we need to measure that
Another tweet posted
And another
…
We need to measure tweet relevance through time
2015 Hany SalahEldeen Dissertation Defense 71
Shared Resource Time User
Alive
Mis
sin
g
Replaced
Rate of Change
Longitudinal Study: Rate of change of shared content
2015 Hany SalahEldeen Dissertation Defense 72
Pilot 1: Resource change in the first 80 hours after tweeting
2015 Hany SalahEldeen Dissertation Defense 73
Pilot 2: Delta days from Bitly creation for just tweeted content
Dataset size = 4,000
2015 Hany SalahEldeen Dissertation Defense 74
Pilot 3: Dataset of 1,000 freshly created Bitlys
http://www.cnn.com depth = 0
http://www.cnn.com/world depth = 1
http://www.cnn.com/2009/SHOWBIZ/Music/06/25/jackson depth = 6
2015 Hany SalahEldeen Dissertation Defense 75
What domains do users link to?
2015 Hany SalahEldeen Dissertation Defense 76
What categories* do users link to?
* Extracted from Alexa.com
2015 Hany SalahEldeen Dissertation Defense 77
Summation of Intention in Social Content Through Time
Longitudinal study: We record the change over an extended period of time:• Content: we download a snapshot of the resource every 45 minutes
• Metadata: we collect meta data about the resource• Facebook likes, posts• Tweets in the last hour• Bitly clicklogs and shares
• Average data size: ~1 TB per month
2015 Hany SalahEldeen Dissertation Defense 78
Hourly analysis over an extended period of time
2015 Hany SalahEldeen Dissertation Defense 79
There is a difference between ttweet and tclick
• After just one hour, 4% of the resources have changed by 30%.• After six hours, the percentage doubled to be 8% changed by 40%.• After a day the change rate slowed to be 12% of the resources
changed by 40%.• After that it almost stabilizes at 17% of the resources to be
changed by 40%.
2015 Hany SalahEldeen Dissertation Defense 80
Shared Resource Time User
Alive
Mis
sin
g
Replaced
Rate of Change
Archive & Creation
First: Resource – Time – Public Archives
2015 Hany SalahEldeen Dissertation Defense 81
Revisited: Resources Missing and Archived
Collection Percentage Missing Percentage Archived
23.49%H1N1 Outbreak 41.65%
36.24%Michael Jackson 39.45%
26.98%Iran 43.08%
24.59%Obama 47.87%
10.48%Egypt 20.18%
7.04%Syria 5.35%
31.62% 30.78%
24.47% 36.26%
25.64% 43.87%
26.15% 46.15%
2015 Hany SalahEldeen Dissertation Defense 82
But on a more general notion we want to know…
2015 Hany SalahEldeen Dissertation Defense 83
How much of the web is archived?
• Goal: Estimate how much of the public web is present in the public archives and how many copies are available?
• Action:• Getting 4 different datasets from 4 different sources:
• Search Engines Indices• Bit.ly• DMOZ• Delicious.
2015 Hany SalahEldeen Dissertation Defense 84
Conclusion: It depends on the source
• Results:
• Publication:S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages 133-136, New York, NY, USA, 2011. ACM.
2015 Hany SalahEldeen Dissertation Defense 85
Conclusion: It depends on the source
• Results:
• Publication:S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages 133-136, New York, NY, USA, 2011. ACM.
Changes since 2011:
no more free SE APIs;
greatly reduced IA
quarantine period; 15
public web archives
2013
95%
92%
23%
26%
2015 Hany SalahEldeen Dissertation Defense 86
Side Experiment: Analyzing the quality of the archives and the archived content
• Goal:• Assessing the quality of the web archives• Better discussed in Justin Brunelle’s work
• Publications:1. J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not All Mementos
Are Created Equal: Measuring The Impact Of Missing Resources. In Proceedings of the 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), 2014 (Best student paper award)
2015 Hany SalahEldeen Dissertation Defense 87
A question emerged: When did a certain resource first appear on
the web?
2015 Hany SalahEldeen Dissertation Defense 88
Shared Resource Time User
Alive
Mis
sin
g
Replaced
Rate of Change
Archive & Creation
Second: When was the resource created?
2015 Hany SalahEldeen Dissertation Defense 89
Idea
Web pages leave trails as well since the day they were created…
2015 Hany SalahEldeen Dissertation Defense 90
WebResource
Web trails
A web page could leave a trail of one of the following denoting its existence:
• References
• Links (anchors)
• Social media likes and interactions.
• URL shortening.
• Backlinks
• The creation date of any of the associated events/trails could be an estimate of the creation date.
2015 Hany SalahEldeen Dissertation Defense 91
Resource’s timeline
2015 Hany SalahEldeen Dissertation Defense 92
Observations Recorded
1.Last modified date from the response header.2.First Appearance of a backlink.3.First Tweet published.4.First Bitly Shortened URL created.5.Time stamp of first memento in the archives.6.Date of the last crawl by the search engine.
2015 Hany SalahEldeen Dissertation Defense 93
Carbon Date service
2015 Hany SalahEldeen Dissertation Defense 94
Carbon Dating API{
"self": "http://cd.cs.odu.edu/cd?url=http://www.cnn.com","URI": "http://www.cnn.com","Estimated Creation Date": "1998-12-06T04:02:33","Last Modified": "","Bitly.com": "2008-06-08T12:00:00","Topsy.com": "2015-01-25T23:31:42","Backlinks": "2003-03-12T05:35:44","Google.com": "2005-01-11T00:00:00","Archives": [
["Earliest","1998-12-06T04:02:33"
],[
"By_Archive",{
"http://archive.today/20000815052826/http://www.cnn.com/": "2000-08-15T05:28:26","http://arquivo.pt/wayback/wayback/20000815052826/http://www.cnn.com/": "2000-08-15T05:28:26","http://wayback.vefsafn.is/wayback/20011106102722/http://www.cnn.com/": "1998-12-06T04:02:33","http://web.archive.org/web/20131218180509/http://www.cnn.com/": "2013-12-18T18:05:09"
}]
]}
2015 Hany SalahEldeen Dissertation Defense 95
Evaluation Dataset
From each we randomly selected 100 unique URLs to create our gold standard dataset
2015 Hany SalahEldeen Dissertation Defense 96
Evaluation
• Applied our 6 methods on 1200 resources.
• Get leftmost estimate.
Number of Resources Percentage
An estimate found 910 76%
Exact matching estimate 393 33%
No estimate found 290 24%
Total Resources 1200 100%
2015 Hany SalahEldeen Dissertation Defense 97
Actual Vs. Estimated Dates
2015 Hany SalahEldeen Dissertation Defense 98
Conclusion: We can estimate the creation date of resources correctly
• Results:• Succeeded in estimating the creation date accurately in 75.90% of the resources.
• Publications:1. H. M. SalahEldeen and M. L. Nelson. Carbon dating the web: Estimating the age of web
resources. In Proceedings of the 22nd International Conference on World Wide Web Companion, TempWeb03, WWW '13, 2013
2015 Hany SalahEldeen Dissertation Defense 99
Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities.
http://cd.cs.odu.edu/
2015 Hany SalahEldeen Dissertation Defense 100
Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities.
Yes, it’s better than mine… I admit it
2015 Hany SalahEldeen Dissertation Defense 101
Shared Resource Time User
Alive
Mis
sin
g
Replaced
Rate of Change
Archive & Creation
User’s Temporal Intention
2015 Hany SalahEldeen Dissertation Defense 102
Problem: There is an inconsistency between what the tweet’s author intended
to share at time ttweet
and what the reader might actually read upon clicking on the link at time tclick .
2015 Hany SalahEldeen Dissertation Defense 103
Shared Resource Time User
Alive
Mis
sin
g
Replaced
Rate of Change
Archive & Creation
Detecting
What is Intention and how to detect it?
2015 Hany SalahEldeen Dissertation Defense 104
Amazon’s Mechanical Turk
• Crowdsourcing Internet marketplace
• Co-ordinates the use of human intelligence to perform tasks that computers are currently unable to do.*
* http://en.wikipedia.org/wiki/Amazon_Mechanical_Turk
2015 Hany SalahEldeen Dissertation Defense 105
Goal: Understand and collect user intention data via MT
Tweets dataset Intention Classification Tasks User Intention Data
Classifier
Train
2015 Hany SalahEldeen Dissertation Defense 106
Goal: Understand and collect user intention data via MT
Tweets dataset Intention Classification Tasks User Intention Data
Classifier
Train
• Problem:• It is not as easy as it seems!
2015 Hany SalahEldeen Dissertation Defense 107
How NOT to classify temporal intention 101
• The tweet is presented along with the two snapshots:
at ttweet at tclick
2015 Hany SalahEldeen Dissertation Defense 108
And compared MT results with Experts
• Experts: Manually assigning a version to each tweet via a face to face meeting with WS-DL members.
• For 9 MT assignments per tweet:• If we allowed 4-5 splits we have 58% match with WS-DL.
• If we allowed 3-6 splits or better we got 31% match
Which is worse than flipping a coin!
2015 Hany SalahEldeen Dissertation Defense 109
Idea: We need to transform the problem from intention to relevance.
2015 Hany SalahEldeen Dissertation Defense 110
Relevance tasks are simpler
• MT workers are more accustomed to classification tasks and it requires minimum amount of explanation
• Transform a hard problem to an easy one
Is that a cat?
- Yes
- No
2015 Hany SalahEldeen Dissertation Defense 111
Temporal Intention Relevancy Model (TIRM)
Between ttweet and tclick:
The linked resource could have:• Changed• Not changed
The tweet and the linked resource could be:• Still relevant• No longer relevant
2015 Hany SalahEldeen Dissertation Defense 112
Resource is changed but relevant
• The resource changed• But it is still relevant
Intention: need the current version of the resource at any time
2015 Hany SalahEldeen Dissertation Defense 113
Relevancy and Intention mapping
Current
2015 Hany SalahEldeen Dissertation Defense 114
Resource is changed and not relevant
Intention: need the past version of the resource at any time
• The resource changed• But it is no longer relevant
2015 Hany SalahEldeen Dissertation Defense 115
Relevancy and Intention mapping
PastCurrent
2015 Hany SalahEldeen Dissertation Defense 116
Resource is not changed and relevant
Intention: need the past version of the resource at any time
• The resource is not changed• And it is relevant
2015 Hany SalahEldeen Dissertation Defense 117
Relevancy and Intention mapping
PastCurrent
Past
2015 Hany SalahEldeen Dissertation Defense 118
Resource is not changed and not relevant
Intention: I am not sure which version of the resource I need
• The resource is not changed• But it is not relevant
2015 Hany SalahEldeen Dissertation Defense 119
Relevancy and Intention mapping
PastCurrent
Past Not Sure
2015 Hany SalahEldeen Dissertation Defense 120
Validation: Update the MT experiment
• MT workers ≡ judgments of the experts (WS-DL members)
✓
Is the content still relevant to the tweet?
2015 Hany SalahEldeen Dissertation Defense 121
Mechanical Turk Workers Vs. Experts
• For 100 tweets, WS-DL members % of agreement:
• Cohen’s K = 0.854 almost perfect agreement
Agreement in 3-2 split or more votes 93%
Agreement in 4-1 split or more votes 80%
Agreement with 5-0 votes 60%
2015 Hany SalahEldeen Dissertation Defense 122
Shared Resource Time User
Alive
Mis
sin
g
Replaced
Rate of Change
Archive & Creation
Detecting
Modeling
Can we model this temporal intention?
2015 Hany SalahEldeen Dissertation Defense 123
Data Collection
• From SNAP dataset we extracted:• Tweets in English
• Each has an embedded URI pointing to an external resource.
• The embedded URI is shortened via Bit.ly
• The external resource:• Still persists.
• Has at least 10 mementos.
• Is unique.
We extracted 5,937 unique instances
2015 Hany SalahEldeen Dissertation Defense 124
Time delta between the tweet and the closest memento
Randomly selected 1,124 instancesTime delta range: 3.07 minutes to 56.04 hours Average: 25.79 hours ~ 1 day
Tweet time
After Tweet time
Before Tweet time
2015 Hany SalahEldeen Dissertation Defense 125
Training Dataset
• Rcurrent: The state of the resource at current time.
• Rclick: The state of the resource at click time.
Relevant Assignments 929 82.65%
Non-Relevant Assignments 195 17.35%
5 MT workers agreeing (5-0 split) 589 52.40%
4 MT workers agreeing (4-1 split) 309 27.49%
3 MT workers agreeing (3-2 close call split) 226 20.11%
2015 Hany SalahEldeen Dissertation Defense 126
Training Dataset
• Rcurrent: The state of the resource at current time.
• Rclick: The state of the resource at click time.
Relevant Assignments 929 82.65%
Non-Relevant Assignments 195 17.35%
5 MT workers agreeing (5-0 split) 589 52.40%
4 MT workers agreeing (4-1 split) 309 27.49%
3 MT workers agreeing (3-2 close call split) 226 20.11%
2015 Hany SalahEldeen Dissertation Defense 127
Intention modeling: Feature extraction
• For each tweet we perform:• Link analysis• Social media mining• Archival existence• Sentiment analysis• Content similarity• Entity identification
2015 Hany SalahEldeen Dissertation Defense 128
Training the classifier
• From the feature extraction phase we extracted 39 different features to train the classifier.
• Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32%
2015 Hany SalahEldeen Dissertation Defense 129
Most significant features sorted by information gain
Rank Feature Gain Ratio
1 Existence of celebrities in tweets 0.149
2 Number of mementos 0.090
3 Tweet similarity with current page 0.071
4 Similarity: Current & past page 0.053
5 Similarity: Tweet & past page 0.044
6 Original URI’s depth 0.032
2015 Hany SalahEldeen Dissertation Defense 130
Testing the model
• We tested against:• The remaining 4,813 from the original 5,937 instances after extracting the 1,124 used
in training.
• The Tweet Collections based on historic events. (MJ, Obama, Iran, Syria, & H1N1)
Dataset Status 200 Status 404 or other Relevant % Non-Relevant %
Extended 4,813 instances 96.77% 3.23% 96.74% 3.26%
MJ’s Death 57.54% 42.46% 93.24% 6.76%
H1N1 Outbreak 8.96% 91.04% 97.48% 2.52%
Iran Elections 68.21% 31.79% 94.69% 5.31%
Obama’s Nobel Prize 62.86% 37.14% 93.89% 6.11%
Syrian Uprising 80.80% 19.20% 70.26% 29.75%
2015 Hany SalahEldeen Dissertation Defense 131
Idea: We need to transform the problem from intention to relevance.
Now we need to transform it back!
Recap…
2015 Hany SalahEldeen Dissertation Defense 132
Recap: Relevancy and Intention mapping
PastReading
the wrong history
2015 Hany SalahEldeen Dissertation Defense 133
Mapping TIRM
• We used 70% similarity as a threshold of relevancy.
Reading the wrong
historyIn up to
25% of the cases
2015 Hany SalahEldeen Dissertation Defense 134
Conclusion: We can model users’ temporal intention accurately and efficiently
• Results:• We successfully transformed the complicated problem of intention to a simpler one of relevance.• We successfully collected a gold standard dataset of temporal user intention.• We found a temporal inconsistency in the shared resource up to 25% of the cases according to the
dataset.
• Publications:1. H. M. SalahEldeen and M. L. Nelson. Reading the correct history?: Modeling temporal
intention in resource sharing. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '13, 2013.
2015 Hany SalahEldeen Dissertation Defense 135
So we modeled intention… can we make it better?
2015 Hany SalahEldeen Dissertation Defense 136
Most significant features sorted by information gain
Rank Feature Gain Ratio
1 Existence of celebrities in tweets 0.149
2 Number of mementos 0.090
3 Tweet similarity with current page 0.071
4 Similarity: Current & past page 0.0527
5 Similarity: Tweet & past page 0.04401
6 Original URI’s depth 0.0324
2015 Hany SalahEldeen Dissertation Defense 137
Most significant features sorted by information gain
Rank Feature Gain Ratio
1 Existence of celebrities in tweets 0.149
2 Number of mementos 0.090
3 Tweet similarity with current page 0.071
4 Similarity: Current & past page 0.0527
5 Similarity: Tweet & past page 0.04401
6 Original URI’s depth 0.0324
2015 Hany SalahEldeen Dissertation Defense 138
Enhancing TIRM
• Extending and tuning the features:• Linguistic feature analysis• Semantic similarity analysis using latent topic modeling• Dataset balancing• Feature selection and minimization
2015 Hany SalahEldeen Dissertation Defense 139
A whole lot of features!39 65 different features in extended TIRM
Further details: refer to chapter 7
2015 Hany SalahEldeen Dissertation Defense 140
TIRM enhancement and minimization results
2015 Hany SalahEldeen Dissertation Defense 141
Point of Confusion: C
Point of Certainty: S
Strongest Current Intention
From binary to probabilistic strength
Further details: refer to chapter 7
2015 Hany SalahEldeen Dissertation Defense 142
Intention strength formulation
Intention strength magnitude of the new resource:
Generalization in regards of class:
2015 Hany SalahEldeen Dissertation Defense 143
Intention strength across instances in dataset
2015 Hany SalahEldeen Dissertation Defense 144
2015 Hany SalahEldeen Dissertation Defense 145
Shared Resource Time User
Alive
Mis
sin
g
Replaced
Rate of Change
Archive & Creation
Detecting
Modeling
Pre
dic
tin
g
Can we find a relation between the modeled intention and time
…to predict it?
2015 Hany SalahEldeen Dissertation Defense 146
Remember: Data Collection
• From SNAP dataset we extracted:• Tweets in English
• Each has an embedded URI pointing to an external resource.
• The embedded URI is shortened via Bit.ly
• The external resource:• Still persists.
• Has at least 10 mementos.
• Is unique.
We extracted 5,937 unique instances
2015 Hany SalahEldeen Dissertation Defense 147
Intention strength across time
time
Resource = Closest
memento
Resource = current versionWe have 10 mementos of the resource uniformly distributed
…
We can calculate intention strength at every point
2015 Hany SalahEldeen Dissertation Defense 148
Intention strength across time
Dataset collection and calculation framework
2015 Hany SalahEldeen Dissertation Defense 149
Behavior of instances in different classes
time
time
time
Inte
nti
on
str
engt
h
Inte
nti
on
str
engt
h
Inte
nti
on
str
engt
h
Steady Current Intention
Steady Past Intention
2015 Hany SalahEldeen Dissertation Defense 150
Behavior of instances in different classes
2015 Hany SalahEldeen Dissertation Defense 151
Given the features we already collected can we classify tweets
according to their behavioral class?
2015 Hany SalahEldeen Dissertation Defense 152
Classifying intention behavior across time
2015 Hany SalahEldeen Dissertation Defense 153
If we can limit the features to the ones that exist before tweet time
can we perform a prediction?
2015 Hany SalahEldeen Dissertation Defense 154
Classifying intention behavior across time
We can perform a prediction!
2015 Hany SalahEldeen Dissertation Defense 155
Intention behavior prediction classifier
2015 Hany SalahEldeen Dissertation Defense 156
Conclusion: We can predict the author’s temporal intention
• Results:• We can predict for the author whether the intention conveyed to the readers will be
consistent or will it change with 77% accuracy.
• Publications:1. H. M. SalahEldeen and M. L. Nelson. Predicting Temporal Intention in Resource Sharing. In
Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '15, 2015.
2015 Hany SalahEldeen Dissertation Defense 157
At this time, we successfully detected, modeled and predicted
User’s Temporal Intention in Shared Content
2015 Hany SalahEldeen Dissertation Defense 158
Shared Resource Time User
Alive
Mis
sin
g
Replaced
Rate of Change
Archive & Creation
Detecting
Modeling
Pre
dic
tin
g
Use
r Te
mp
ora
l In
ten
tio
n
Temporal Intention Model
2015 Hany SalahEldeen Dissertation Defense 159
So we built an awesome prediction model for Temporal
Intention… what next?
2015 Hany SalahEldeen Dissertation Defense 160
A Framework of Temporal Intention
time
Posted a tweet
Read the tweet
• Tools for authors• Enrich the archives with current content
for posterity
2015 Hany SalahEldeen Dissertation Defense 161
Prediction API
2015 Hany SalahEldeen Dissertation Defense 162
Tools for Authors
2015 Hany SalahEldeen Dissertation Defense 163
Temporal Intention Implementation
time
Posted a tweet
Read the tweet
• Tools for readers• Maintain the temporal consistence of
content
2015 Hany SalahEldeen Dissertation Defense 164
Tools for readers
2015 Hany SalahEldeen Dissertation Defense 165
Tools for readers
1. Temporal preservation of
vulnerable content
2. Version recommendation
based on temporal intention
estimation
Target Publication: Utilizing Temporal Intention
Prediction for Just-in-time Preservation and
Recommendation of Vulnerable Social Media
Content. WSDM 2016
2015 Hany SalahEldeen Dissertation Defense 166
Motivation
Background
Related Research
Research Question
User-Time-Shared Resource
Conclusions
2015 Hany SalahEldeen Dissertation Defense 167
Accomplished Goals
• Detect the temporal intention of the:
1. Author upon sharing time
2. The reader upon dereferencing time
• Model this intention as a function of time, nature of the resource, and its context.
• Predict how resources change with time and the intention behind sharing them to minimize inconsistency.
• Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web.
Further details: refer to chapter 6
Further details: refer to chapter 7
Further details: refer to chapter 8
Further details: refer to chapter 9
2015 Hany SalahEldeen Dissertation Defense 168
Also, our work reached fame…
2015 Hany SalahEldeen Dissertation Defense 169
The Virginian Pilot
2015 Hany SalahEldeen Dissertation Defense 170
http://www.bbc.com/future/story/20120927-the-decaying-web
BBC.com
2015 Hany SalahEldeen Dissertation Defense 171
Popular MechanicsFebruary 2014 issue, page 20
2015 Hany SalahEldeen Dissertation Defense 172
3 x MIT Technology
Review
http://www.technologyreview.com/view/513996/how-to-carbon-date-a-web-page/
http://www.technologyreview.com/view/519391/internet-archaeologists-reconstruct-lost-web-pages/
http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
2015 Hany SalahEldeen Dissertation Defense 173
Mashable
2015 Hany SalahEldeen Dissertation Defense 174
Mashable
Yes I am Indiana Jones of the
internet
2015 Hany SalahEldeen Dissertation Defense 175
Publications
Published Submitted In preparation Planned
JCDL 2011 TPDL 2015 WWW 2016 IJDL 2016
TPDL 2012 SIGIR 2016 WSDM 2016
JCDL 2013
TPDL 2013
WWW 2013
DL 2014
AAAI 2015
IJDL 2015
JCDL 2015
2015 Hany SalahEldeen Dissertation Defense 176
Remember Rémi Ochlik?
Rémi Ochlik16 October 1983 – 22 February 2012
2015 Hany SalahEldeen Dissertation Defense 177
… and the missing content about him?
Accessed in July 2014
2015 Hany SalahEldeen Dissertation Defense 178
We can maintain the consistency of history
Our Temporal Intention Relevancy Model