Upload
sarah-guido
View
557
Download
2
Embed Size (px)
Citation preview
The Wild West of Data Wrangling
Sarah Guido PyCon 2017
@sarah_guido
This talk:
• A day in the life
• Three examples of dealing with uncooperative data
• Not ground truth!
Who am I?
• Senior data scientist at Mashable
• Mashable == internet culture media!
• Data sciencing in Python
• Twitter: @sarah_guido
Iris Dataset
Iris Dataset
Example 1: Predicting building sales
• The problem: can we predict if a building will sell the following year?
• The data: floors, location, square footage, price per sqft, etc
• The goal: provide valuable insight to platform users
Example 1: Predicting building sales
• First thought: logistic regression using scikit-learn
• Binary classification: sale/no sale
Problem…
Data: 95% no sale, 5% sale
Logistic regression: 95% accurate
DONE!
Problem: Class imbalance
Class imbalance
When the values you are trying to predict are not equal, this can create bias in classification models.
Solution: Gradient boosting
Gradient boosting
Produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
Example 2: Clustering user interactions
The problem: how can we identify similar patterns based on click data?
The data: time, geolocation, cookie, browser useragent string, referrer
The goal: understand how people interact with content over time
Why Scala?
Problem: Clustering user interactions
K-means clustering
An unsupervised learning method of grouping data together based on a distance metric.
Problem: Clustering the data
• Only look at users with 5 or more interactions
• Each user has a different number of interactions
• Each data point ends up in a different cluster
Solution: Transform the data
Solution: Transform the data
date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01, 2017-05-12
Length of interactions: 5
Average time between interactions: ~8 days
Solution: Transform the data
referrer: facebook, twitter
One-hot encode and transform to matrix
• Facebook: [1, 0]
• Twitter: [0, 1]
Solution: Transform the data
Example 3: Understand audience composition
The problem: how can we effectively describe our audience?
The data: anonymized demographic and psychographic data
The goal: audience segmentation and channel analysis
Problem: insufficient data
• Google Analytics data – 1/3 of urls
• Finicky API
• Semi-useless psychographic data
Solution: accept defeat
Solution: accept defeat make it work!
Solution: make it work!
• Theory of highly-performant links
• Segmentation through archetypal analysis
• Go get more data!
General strategy
• What problem are you trying to solve?
• What’s wrong with your data?
• What do you need that you don’t have?
Keep in mind…
• Data your company collects is complicated
• What you do to your data will affect the model
• Creativity is your friend
• Lots of ways to solve the problem
Thank you!
@sarah_guido