34
The Wild West of Data Wrangling Sarah Guido PyCon 2017 @sarah_guido

The Wild West of Data Wrangling

Embed Size (px)

Citation preview

Page 1: The Wild West of Data Wrangling

The Wild West of Data Wrangling

Sarah Guido PyCon 2017

@sarah_guido

Page 2: The Wild West of Data Wrangling

This talk:

•  A day in the life

•  Three examples of dealing with uncooperative data

•  Not ground truth!

Page 3: The Wild West of Data Wrangling

Who am I?

•  Senior data scientist at Mashable

•  Mashable == internet culture media!

•  Data sciencing in Python

•  Twitter: @sarah_guido

Page 4: The Wild West of Data Wrangling

Iris Dataset

Page 5: The Wild West of Data Wrangling

Iris Dataset

Page 6: The Wild West of Data Wrangling
Page 7: The Wild West of Data Wrangling
Page 8: The Wild West of Data Wrangling

Example 1: Predicting building sales

•  The problem: can we predict if a building will sell the following year?

•  The data: floors, location, square footage, price per sqft, etc

•  The goal: provide valuable insight to platform users

Page 9: The Wild West of Data Wrangling

Example 1: Predicting building sales

•  First thought: logistic regression using scikit-learn

•  Binary classification: sale/no sale

Page 10: The Wild West of Data Wrangling

Problem…

Data: 95% no sale, 5% sale

Logistic regression: 95% accurate

DONE!

Page 11: The Wild West of Data Wrangling
Page 12: The Wild West of Data Wrangling

Problem: Class imbalance

Class imbalance

When the values you are trying to predict are not equal, this can create bias in classification models.

Page 13: The Wild West of Data Wrangling

Solution: Gradient boosting

Gradient boosting

Produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

Page 14: The Wild West of Data Wrangling

Example 2: Clustering user interactions

The problem: how can we identify similar patterns based on click data?

The data: time, geolocation, cookie, browser useragent string, referrer

The goal: understand how people interact with content over time

Page 15: The Wild West of Data Wrangling

Why Scala?

Page 16: The Wild West of Data Wrangling

Problem: Clustering user interactions

K-means clustering

An unsupervised learning method of grouping data together based on a distance metric.

Page 17: The Wild West of Data Wrangling

Problem: Clustering the data

•  Only look at users with 5 or more interactions

•  Each user has a different number of interactions

•  Each data point ends up in a different cluster

Page 18: The Wild West of Data Wrangling
Page 19: The Wild West of Data Wrangling
Page 20: The Wild West of Data Wrangling
Page 21: The Wild West of Data Wrangling
Page 22: The Wild West of Data Wrangling

Solution: Transform the data

Page 23: The Wild West of Data Wrangling

Solution: Transform the data

date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01, 2017-05-12

Length of interactions: 5

Average time between interactions: ~8 days

Page 24: The Wild West of Data Wrangling

Solution: Transform the data

referrer: facebook, twitter

One-hot encode and transform to matrix

•  Facebook: [1, 0]

•  Twitter: [0, 1]

Page 25: The Wild West of Data Wrangling

Solution: Transform the data

Page 26: The Wild West of Data Wrangling

Example 3: Understand audience composition

The problem: how can we effectively describe our audience?

The data: anonymized demographic and psychographic data

The goal: audience segmentation and channel analysis

Page 27: The Wild West of Data Wrangling

Problem: insufficient data

•  Google Analytics data – 1/3 of urls

•  Finicky API

•  Semi-useless psychographic data

Page 28: The Wild West of Data Wrangling

Solution: accept defeat

Page 29: The Wild West of Data Wrangling

Solution: accept defeat make it work!

Page 30: The Wild West of Data Wrangling

Solution: make it work!

•  Theory of highly-performant links

•  Segmentation through archetypal analysis

•  Go get more data!

Page 31: The Wild West of Data Wrangling

General strategy

•  What problem are you trying to solve?

•  What’s wrong with your data?

•  What do you need that you don’t have?

Page 32: The Wild West of Data Wrangling

Keep in mind…

•  Data your company collects is complicated

•  What you do to your data will affect the model

•  Creativity is your friend

•  Lots of ways to solve the problem

Page 33: The Wild West of Data Wrangling
Page 34: The Wild West of Data Wrangling

Thank you!

@sarah_guido