Social Network AnalysisUpdate
A Short Overview of the Problems and an Update of
Our Twitter Capture/Analysis System
Joshua WhiteCS644
Background
The Problem
• Social Networking Sites:– Provide a communication method thought by
many to be at least somewhat private• Many never change the default security
setting associated with their accounts– Support linking of older accounts/sites to new
sites through unified login which often leads to a sort of “most-privileged” escalation• This is where the site with the highest public access
settings enabled is able to gain private data from the restricted account on another site and re-display it because they share a login system.
Target for Current Work
• Twitter– A real-time social information network– Various Parsable API
• Search, Live, Some Historical (24 hrs)
– Large userbase:• 65 million ‘tweets’ per day*• ~750 ‘tweets’ per second
– International community
Who Uses Twitter
• People– Every Day People– Politicians– Celebrities– Professionals– Bad-Guys
• Objects– Tweeting gadgets (sensors, bots, computers,
bot-masters, spammers)• Labeled Nefarious Groups
– Lulzsec– Anonymous
The Twitter API
• Twitter protocol fields:– Typically Shown in XML or JSON:
• Provide:– Location (geo fields)– Username/Real Name– Threading
» Track conversations and @ replies» Track retweets
– Twitter client software data– Timestamping– And, of course, the text of the tweet.
Field Name Description Example Data
name User's REAL Name Text: "Robert Scoble"
screen_name User's Twitter username Text: "scobleizer"
profile_image_url Link to users profile imageLink: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop-fanatiguy_normal.jpg"
url Link to user's non-Twitter site Link: "http://www.google.com/profiles/scobleizer"
followers_count Number of followers user has Number: "185496"
friends_count Number of people user follows Number: "31971"
utc_offset Offset from GMT (in seconds) Number: "-28800"
geo_enabled Whether user has enabled location Boolean: "True"
statuses_count Number of statuses user has posted Number: "53522"
Tweet Specific Fields
created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011"
id Tweet id (useful for URL creation) Number: "80703603437875201"
textContains the actual text + any embedded URLs Whatever text the person chooses to enter. <- Could be any language supported.
sourceLinks to Twitter client URL <- not important HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>"
in_reply_to_status_id Number of status that user replied to Number: "80671170374025220"
in_reply_to_screen_name
Screen name of user the current status replies to Text: "danharmon"
retweet_countNumber of times this status is retweeted Number: "0"
retweetedWhether or not the status has been retweeted Boolean: "false"
'geo' flag specific:
georss:point Lat. & Long. Location Number: "43.21227199 -75.39866939"
urlPoints to a JSON or XML file with further GEO Info. Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"
Benefits to Social Media Awareness As The Gov.
Agencies See It.
• Track locations with reasonable accuracy– If enabled by the user
• Bad guys may have protected feeds– Others may ‘retweet’ them – this can be
tracked.
• Track trends– Who said what, who repeated it
• News before ‘official’ reports
9
Gov. View
• DHS identified categories of Social Media sites [2]:– Search– Video– Maps– Photos– Blog Aggregators– Twitter related sites)– Facebook related sites)
10
Gov. View Continued
• Among these categories are sites like:– RSSOwl– Hulu– YouTube– Google Flu– Flickr– Twitter– Facebook– ABCNews Blotter– Myspace
11
Interesting FaceBook Privacy Facts:
• Percentage of FaceBook users by age that change their account security settings to something other then the default (no security) [1]:– 18-29 years old = 71%
– 30-39 years old = 67%
– 50-64 years old = 55%
• 80% of all users (according to some websites) fall within that 18-64 age range.– That means that potentially 20+ million users have no security
on their accounts.
12
Proliferation of Facebook
13
Current Work Update
14
Data Collection
• To date:– We have collected over 80 million tweets
using *John's Java based method/system.• Located At the GI (Griffiss Institute)• Each compressed .tcm capture file is
– 10 days of capture– ~ 8.5 million tweets and associated data
» Tweets are only a sampling of the total data being posted to twitter, but we're rate limited by Twitters API
– Uses the twitter streaming API* John Stacy
15
Data Collection Update
• As of 6/21/2011 a new data collection method/system is in place:– Located at the GI as well– Uses John's JSON analysis method re-
implemented in php with data storage in MySQL – Captured Data:
• ~ 160,000 Tweets per hour so far– Estimated ~ 4 million per day
• Uses phirehose api [3]• DB consists of raw json data, parsed out tweets, and a
special stripped down user section– User section is in preparation to add crawled account
data to.
16
DB Snapshot
17
Why Is This New DB Important
• The previous method is perfect for long term analysis, but we need a method that will allow us to gather stats and see what does/doesn't work that doesn't need to be coded.
• The new DB allows for simple SQL queries such as:– SELECT * FROM `tweets` WHERE `geo_lat` >0 LIMIT 0 , 30
• This looks for any tweet that has a greater then 0 value in the latitude field.
– Out of 1,593,922 tweets at the time of this query– 8,470 had a latitude/longitude associated with them
» We'll need a more complex query to see how many of those are associated with individual users
18
Conclusion
• There's a lot we can do with this data– I suggest we develop methods using the
DB and then port them over to the Coalmine Query code for better scalability.
• The privacy implications of using this data are high.– I'm torn on it's usage by the Gov.
• I see the national security implications and also the privacy violations that may ensue.
19
Citations• [1]
– “Vaidhyanathan, S.; , "Welcome to the surveillance society," Spectrum, IEEE , vol.48, no.6, pp.48-51, June 2011 doi: 10.1109/MSPEC.2011.5779791 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5779791&isnumber=5779759
• [2]
– DHS, Office of Operations Coordination and Planning, “Publicly Available Social Media Monitoring and Situational Awareness Initiative,” June 22 2010 http://www.dhs.gov/xlibrary/assets/privacy/privacy_pia_ops_publiclyavailablesocialmedia.pdf
• [3]
– http://code.google.com/p/phirehose/
–