20
Here’s the data. Here’s what you can do with it. Janani Kalyanam

Here's the Data. Here's what you can do with it - Janani Kalyanam

Embed Size (px)

Citation preview

Page 1: Here's the Data. Here's what you can do with it - Janani Kalyanam

Here’s the data. Here’s what you can do with it.

Janani Kalyanam

Page 2: Here's the Data. Here's what you can do with it - Janani Kalyanam

Who is this talk for?

1. all backgrounds

2. main message: need not be an ML expert in order to be able to analyze and learn from data. You need to know the right resources -- that’s key!

Page 3: Here's the Data. Here's what you can do with it - Janani Kalyanam

Introduction1. Data today is easily available

2. Several interesting machine learning tools to analyze and summarize data

3. (1) + (2) opens doors to lots of interesting applications

Page 4: Here's the Data. Here's what you can do with it - Janani Kalyanam

Outline1. Data Collection

a. some pseudo code

2. Data Analysis and Learninga. show a concrete exampleb. will point to more resources

3. Some applicationsa. in the health space

Page 5: Here's the Data. Here's what you can do with it - Janani Kalyanam

Some specs1. My data source of choice: Twitter

2. Primarily text. Analyze the text in several hundreds of thousands of tweets.

3. My language of choice: Pythona. Python is G-R-E-A-T!b. If you are a MATLAB user, making the switch is not difficult.

Page 6: Here's the Data. Here's what you can do with it - Janani Kalyanam

Data Collection

Page 7: Here's the Data. Here's what you can do with it - Janani Kalyanam

Twitter

- Streaming API- Public stream, user stream, site stream

- Public stream: approximately 1% of the live feed.

- Can track for specific keywords- Tweets “ebola”, “ebolavirus”, “sudanvirus”

- Python module for Twitter called tweepy

Page 8: Here's the Data. Here's what you can do with it - Janani Kalyanam

Codehttps://github.com/kjanani/fent/blob/master/scripts_for_streaming/cr

eate_stream.py

Page 9: Here's the Data. Here's what you can do with it - Janani Kalyanam

Data Processing1. JSON files

2. Python’s json module reads the data as a dict

3. Each tweet, with the tweet text along with all the meta data is a “tweet object”

Page 10: Here's the Data. Here's what you can do with it - Janani Kalyanam

Data Processing

Page 11: Here's the Data. Here's what you can do with it - Janani Kalyanam

Big Data1. Several millions of tweets

2. Even to calculate simple statistics, might need to loop through the json file(s)○ NOT EFFICIENT!

3. Store the data into database○ SQLITE3 has python modules which makes queries very quick

Page 12: Here's the Data. Here's what you can do with it - Janani Kalyanam

Data Analysis and Learning

Page 13: Here's the Data. Here's what you can do with it - Janani Kalyanam

Data Analysis and Learning1. unsupervised: clustering/topic models

○ NMF-based○ LDA-based

2. supervised: ○ Depending on the application, manually code a small subset of data, and extrapolate

Page 14: Here's the Data. Here's what you can do with it - Janani Kalyanam

Topic Models- Summarize the content in a large repository

- Some parameters, that need to be set or tuned

- Once a summary is obtained, results need to be put in context

- Special topic models developed for analyzing tweets- Tweets are very short length, noisy text

Page 15: Here's the Data. Here's what you can do with it - Janani Kalyanam

Example- A biterm topic model for short texts

- Xiaohui Yan et. al, WWW 2013

- From Event Detection to Story Telling on Microblogs- Kalyanam et. al. ASONAM 2016

- Graph based methods for summarizing events- Kalyanam et. al. (under preparation)

Page 16: Here's the Data. Here's what you can do with it - Janani Kalyanam

Example: Ebola outbreak in 2014

ur, watching, disney, channel

patient , hospital, dallas critical, condition

patient, dallas, homeless, man, officials

jfk, airport, new, york, screening, enhances

Page 17: Here's the Data. Here's what you can do with it - Janani Kalyanam

Applications

Once the summaries are obtained, need to put them in the bigger context of the domain of interest.

Page 18: Here's the Data. Here's what you can do with it - Janani Kalyanam

Exploring Non-medical use of prescription drug abuse

Aim: behaviors of drug abuse, specifically “oxycontin”, “oxycodone”, “percocet”

Page 19: Here's the Data. Here's what you can do with it - Janani Kalyanam

Exploring Non-medical use of prescription drug abuse

- Filtered the Twitter Streaming API for “oxycontin”, “oxycodone”, “percocet”- Detected themes, and discarded the irrelevant themes

- What constitutes a relevant theme?- Mentions identified verbs of substance abuse (e.g., overdose, injection, withdrawal)- Contains adjectives related to prescription drug abuse behavior (e.g., popping, high)

canada, monopoly, rules, oxycodone, drugs

super, high, best, online, price, discount, percocet

Percocet, xanax, pop, strippers

Page 20: Here's the Data. Here's what you can do with it - Janani Kalyanam

Thank you.

If you have more questions, or want to discuss more, feel free to reach me at: [email protected]