Upload
withthebest
View
34
Download
7
Embed Size (px)
Citation preview
Here’s the data. Here’s what you can do with it.
Janani Kalyanam
Who is this talk for?
1. all backgrounds
2. main message: need not be an ML expert in order to be able to analyze and learn from data. You need to know the right resources -- that’s key!
Introduction1. Data today is easily available
2. Several interesting machine learning tools to analyze and summarize data
3. (1) + (2) opens doors to lots of interesting applications
Outline1. Data Collection
a. some pseudo code
2. Data Analysis and Learninga. show a concrete exampleb. will point to more resources
3. Some applicationsa. in the health space
Some specs1. My data source of choice: Twitter
2. Primarily text. Analyze the text in several hundreds of thousands of tweets.
3. My language of choice: Pythona. Python is G-R-E-A-T!b. If you are a MATLAB user, making the switch is not difficult.
Data Collection
- Streaming API- Public stream, user stream, site stream
- Public stream: approximately 1% of the live feed.
- Can track for specific keywords- Tweets “ebola”, “ebolavirus”, “sudanvirus”
- Python module for Twitter called tweepy
Codehttps://github.com/kjanani/fent/blob/master/scripts_for_streaming/cr
eate_stream.py
Data Processing1. JSON files
2. Python’s json module reads the data as a dict
3. Each tweet, with the tweet text along with all the meta data is a “tweet object”
Data Processing
Big Data1. Several millions of tweets
2. Even to calculate simple statistics, might need to loop through the json file(s)○ NOT EFFICIENT!
3. Store the data into database○ SQLITE3 has python modules which makes queries very quick
Data Analysis and Learning
Data Analysis and Learning1. unsupervised: clustering/topic models
○ NMF-based○ LDA-based
2. supervised: ○ Depending on the application, manually code a small subset of data, and extrapolate
Topic Models- Summarize the content in a large repository
- Some parameters, that need to be set or tuned
- Once a summary is obtained, results need to be put in context
- Special topic models developed for analyzing tweets- Tweets are very short length, noisy text
Example- A biterm topic model for short texts
- Xiaohui Yan et. al, WWW 2013
- From Event Detection to Story Telling on Microblogs- Kalyanam et. al. ASONAM 2016
- Graph based methods for summarizing events- Kalyanam et. al. (under preparation)
Example: Ebola outbreak in 2014
ur, watching, disney, channel
patient , hospital, dallas critical, condition
patient, dallas, homeless, man, officials
jfk, airport, new, york, screening, enhances
Applications
Once the summaries are obtained, need to put them in the bigger context of the domain of interest.
Exploring Non-medical use of prescription drug abuse
Aim: behaviors of drug abuse, specifically “oxycontin”, “oxycodone”, “percocet”
Exploring Non-medical use of prescription drug abuse
- Filtered the Twitter Streaming API for “oxycontin”, “oxycodone”, “percocet”- Detected themes, and discarded the irrelevant themes
- What constitutes a relevant theme?- Mentions identified verbs of substance abuse (e.g., overdose, injection, withdrawal)- Contains adjectives related to prescription drug abuse behavior (e.g., popping, high)
canada, monopoly, rules, oxycodone, drugs
super, high, best, online, price, discount, percocet
Percocet, xanax, pop, strippers
Thank you.
If you have more questions, or want to discuss more, feel free to reach me at: [email protected]