The Wild West of Data Wrangling

The Wild West of Data Wrangling

Sarah Guido PyCon 2017

@sarah_guido

This talk:

•  A day in the life

•  Three examples of dealing with uncooperative data

•  Not ground truth!

Who am I?

•  Senior data scientist at Mashable

•  Mashable == internet culture media!

•  Data sciencing in Python

•  Twitter: @sarah_guido

Iris Dataset

Iris Dataset

Example 1: Predicting building sales

•  The problem: can we predict if a building will sell the following year?

•  The data: floors, location, square footage, price per sqft, etc

•  The goal: provide valuable insight to platform users

Example 1: Predicting building sales

•  First thought: logistic regression using scikit-learn

•  Binary classification: sale/no sale

Problem…

Data: 95% no sale, 5% sale

Logistic regression: 95% accurate

DONE!

Problem: Class imbalance

Class imbalance

When the values you are trying to predict are not equal, this can create bias in classification models.

Solution: Gradient boosting

Gradient boosting

Produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

Example 2: Clustering user interactions

The problem: how can we identify similar patterns based on click data?

The data: time, geolocation, cookie, browser useragent string, referrer

The goal: understand how people interact with content over time

Why Scala?

Problem: Clustering user interactions

K-means clustering

An unsupervised learning method of grouping data together based on a distance metric.

Problem: Clustering the data

•  Only look at users with 5 or more interactions

•  Each user has a different number of interactions

•  Each data point ends up in a different cluster

Solution: Transform the data


date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01, 2017-05-12

Length of interactions: 5

Average time between interactions: ~8 days


referrer: facebook, twitter

One-hot encode and transform to matrix

•  Facebook: [1, 0]

•  Twitter: [0, 1]


Example 3: Understand audience composition

The problem: how can we effectively describe our audience?

The data: anonymized demographic and psychographic data

The goal: audience segmentation and channel analysis

Problem: insufficient data

•  Google Analytics data – 1/3 of urls

•  Finicky API

•  Semi-useless psychographic data

Solution: accept defeat

Solution: accept defeat make it work!

Solution: make it work!

•  Theory of highly-performant links

•  Segmentation through archetypal analysis

•  Go get more data!

General strategy

•  What problem are you trying to solve?

•  What’s wrong with your data?

•  What do you need that you don’t have?

Keep in mind…

•  Data your company collects is complicated

•  What you do to your data will affect the model

•  Creativity is your friend

•  Lots of ways to solve the problem

Thank you!

@sarah_guido

Technology

The Wild West of Data Wrangling