37
Data Science & the Data Cycle Girl Geeks Waterloo May 2015 Jennifer Nguyen

Data Science & the Data Cycle

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Science & the Data Cycle

Data Science & the Data Cycle

Girl Geeks WaterlooMay 2015

Jennifer Nguyen

Page 2: Data Science & the Data Cycle

Why do we need data?To inform our decisions

Page 3: Data Science & the Data Cycle

Why do we need data?Everyday examples

• To decide what to wear in the morning

Page 4: Data Science & the Data Cycle

Why do we need data?Everyday examples

• To book a flight

Page 5: Data Science & the Data Cycle

Why do we need data?Everyday examples

• To negotiate a salary

Page 6: Data Science & the Data Cycle

Data Cycle

User Activity

Data Collection/Storage

Data Analytics

Improved User Experience

Page 7: Data Science & the Data Cycle

How do we create data?

Page 8: Data Science & the Data Cycle

How do we create data?

Source: DOMO

• Every interaction on any platform is collected

• Clicks, mouse hovers, scrolls

• User generated data• photos, videos, “likes”• purchases

• Direct feedback• Registration information

Page 9: Data Science & the Data Cycle

Data is collected and stored

• Data is collected and stored in large databases to be analyzed

Source: DOMO

Page 10: Data Science & the Data Cycle

What is it used for?“Not only are we doing more with data, data is doing more with us” – Jer Thorp, DOMO

Page 11: Data Science & the Data Cycle

Data Analytics Spectrum

Source: Gartner

Business Intelligence

Page 12: Data Science & the Data Cycle

Business Intelligencea.k.a. reactive analytics

• Reactive analytics answers questions related to past/current events

• E.g., “How are we doing?”, “What went well?”

• Answers are used to inform business decisions

Page 13: Data Science & the Data Cycle

Business Intelligencea.k.a. reactive analytics

• Measuring web traffic• How did the volume of traffic

change since yesterday/last week/last month/last year?

• Where are users coming from?• Which topics resonated with

readers?

Page 14: Data Science & the Data Cycle

Data Analytics Spectrum

Source: Gartner

Data Mining

Page 15: Data Science & the Data Cycle

Data MiningKnowledge Discovery

• Used to find patterns, trends, and insights from data

• Techniques include• Anomaly detection• Association rule learning• Clustering

Page 16: Data Science & the Data Cycle

Data Science Technique: Clustering

• How to segment users?• K-Means Clustering

• Clustering can be used to discover communities of users

• Other applications:• Find similar items, movies

Source: Stanford

Page 17: Data Science & the Data Cycle

Data Analytics Spectrum

Source: Gartner

Predictive Analytics

Page 18: Data Science & the Data Cycle

Predictive Analytics

• Predictive analytics answers questions related to future events

• E.g.:• How likely will this student drop

out?• What would this reader like to

read next?• Will a customer churn?

Page 19: Data Science & the Data Cycle

Data Science Technique: Classification

• How to predict customer churn?• Logistic Regression• Decision Trees• Random Forests

• Results can be used to “save” a customer

Source: Stanford

Filed complaint?

<2 years of tenure?

No

No

Yes

Yes

Page 20: Data Science & the Data Cycle

What is a decision tree?

Filed complaint?

<2 years of tenure?

Yes

Yes

No

No

Root/internal nodes: split the data based on an attribute

Leaf nodes: outputs the prediction

Branches: binary decisions

Page 21: Data Science & the Data Cycle

How to build a decision tree• Given the following data set:

Example adapted from Michael S. Lewicki, Artificial Intelligence: Learning and Decision Trees, http://www.cs.cmu.edu/afs/cs/academic/class/15381-s07/www/slides/041007decisionTrees1.pdf

<2 years of tenure? Filed Complaint? Churned?

N N N

Y N Y

N N N

N N N

N Y Y

Y N N

N Y N

N Y Y

Y N N

Y N N

Page 22: Data Science & the Data Cycle

How to build a decision tree

• Goal:• Split the data in such a way to achieve high classification accuracy

• Requires knowing which attributes to use and in which order• Use “greedy” algorithm:

• Choose attribute that gives best split at each level of tree

Page 23: Data Science & the Data Cycle

Recursive Algorithm

1. Start with all data2. Find query that gives best split.3. Create child nodes4. Recurse until stopping criterion:

• Node consists of one dominant class, considered the node’s “purity”

Page 24: Data Science & the Data Cycle

Building a Decision Tree

Filed complaint?

<2 y

ears

of t

enur

e?

No Yes

No

Yes

Filed complaint?

3 Y7 N

1 Y6 N

2 Y1 N

No Yes

Purity: 6/7 Purity: 2/3

Page 25: Data Science & the Data Cycle

Building a Decision Tree

missed payment?

<2 y

ears

at c

urre

nt jo

b?

No Yes

No

Yes

3 Y7 N

1 Y6 N

2 Y1 N

No Yes

Filed complaint?

<2 years of tenure?

3 Y7 N

1 Y6 N

2 Y1 NNo

Yes

0 Y3 N

1 Y3 N

Purity: 2/3

Purity: 3/3 Purity: 3/4

Page 26: Data Science & the Data Cycle

How to use a decision tree

• Given new data samples, how to predict if the individual will churn?

• Use recursive tree traversal algorithm to find the corresponding leaf node

<2 years of tenure?

Filed complaint?

Churned?

N N ?

Y N ?

N Y ?

Filed complaint?

<2 years of tenure?

Page 27: Data Science & the Data Cycle

Data Analytics Spectrum

Source: Gartner

Machine Learning

Page 28: Data Science & the Data Cycle

Data Science Technique: Collaborative Filtering• Used in simple recommendation

engines to recommend items to users

• Items can be news articles, movies, clothing, friends

Page 29: Data Science & the Data Cycle

Data Mining Technique: Collaborative Filtering

Page 30: Data Science & the Data Cycle

Data Mining Technique: Collaborative Filtering

• Goal is to fill in the missing empty cells with a prediction• Items with positive predictions are recommended to users• Predictions are influenced by ratings from other users• The more similar two users are with respect to their ratings, the more

they will influence the prediction

Page 31: Data Science & the Data Cycle

Data Mining Technique: Collaborative Filtering

Page 32: Data Science & the Data Cycle

Data Mining Technique: Collaborative Filtering

Page 33: Data Science & the Data Cycle

Data Mining Technique: Collaborative Filtering

• Disadvantages of this method:• Users do not understand why some recommendations are made by the

engine• Users may not receive recommendations they like because other users have

not liked them

• Advanced recommendation methods exist to address these shortcomings

Page 34: Data Science & the Data Cycle

How do users benefit from data science?

• Users get a more personalized experience that is tailored to their interests

• E.g., Nest Thermostat, PC Plus

• Users save time from not having to sift through the vast sea of options

• E.g., Netflix, online retailers, LinkedIn

Page 35: Data Science & the Data Cycle

Key Takeaways

• Data science is part of an iterative process to continually improve the user experience

User Activity

Data Collection/Storage

Data Analytics

Improved User Experience

Page 36: Data Science & the Data Cycle

Key Takeaways

• There are many flavours of data analytics

• Which one to use depends on the questions you want answered

Page 37: Data Science & the Data Cycle

Questions?