Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
©2017 Firefly Consulting, All Rights Reserved1
A Lean Six Sigma Practitioner’s Guide
Getting Started with Big Data
Kristine Bradley, Principal, Firefly Consulting
©2017 Firefly Consulting, All Rights Reserved2
WHY CARE ABOUT BIG DATA?
“Data Scientist: The Sexiest Job of the 21st Century.” - Harvard Business
Review article
GE has bet big on the Internet of Things – committing $1B to put sensors on gas turbines, jet engines, and other
machines, connecting them to the cloud and analyzing the resulting flow of data to identify ways to improve machine
productivity and reliability – MIT Sloan Case Study
CNN recently stated that “the amount of data captured globally is estimated to reach 40 zettabytes by
2020.“ That’s 40 with 21 zeros!
©2017 Firefly Consulting, All Rights Reserved3
BIG DATA APPLICATIONS
▪ Reduced credit card fraud▪ Decreased loan default
rate▪ Increased response rate
with significantly reduced mailing costs
▪ Supply chain analytics▪ Improved process
monitoring and control▪ Reduced equipment
downtime
▪ Cancer detection▪ Hospital readmission▪ Nonadherence to
medication prescriptions▪ Billing errors
▪ Airfare pricing optimization
▪ Personalized product recommendations
▪ Tax returns▪ Casinos
Manufacturing
Financial Services Daily Life
Healthcare
©2017 Firefly Consulting, All Rights Reserved4
FUN FACT
Airline customers who pre-
order a vegetarian meal, are
more likely to make their
flight on time.
- An airline study
©2017 Firefly Consulting, All Rights Reserved5
BIG DATA AND LEAN SIX SIGMA
Lean Six Sigma Skills
Business Skills
Data Science
IT Skills
Expertise needed for Big Data solutions:
− Data Science
− IT Skills
− Business Skills
Expertise needed for Lean Six Sigma solutions:
− Business Skills
− Data Analysis Skills
− IT Partnership
©2017 Firefly Consulting, All Rights Reserved6
SIMILARITIES AND DIFFERENCES
More data – linkage to external
More powerful analytics
Stronger systems linkage
More real time visualization
Links with Artificial Intelligence and the Internet of Things
What’s New? What’s the Same?What’s the Same?
Up to 80% of the work can be in the data preparation
To get the value, you still have to do something with it!
Analysis using statistical tools
Understand the relationship between your inputs (x’s) and outputs (y’s)
Correlation still does not equal causation
©2017 Firefly Consulting, All Rights Reserved7
FUN FACT
Hungry judges rule negatively on parole decisions. Your chances of favorable parole hearing right after a food break are 65% favorable, which drops to nearly 0 before the next break.
– Columbia and Ben Gurio Universities
©2017 Firefly Consulting, All Rights Reserved8
DEFINING THE “BIG DATA” TERMS
Business Need Internal DataExternal Data
Modeling, “Machine Learning”
Predictive Model
Target, Prediction, Outcome, Response, Y
Business Insights
Individual Characteristics,
Attributes, Factors, Variables, Predictors, X’s
• “Big Data”• “Big Data Analytics”• “Business Analytics”• “Predictive Analytics• “Business Intelligence”
“Artificial Intelligence” “Internet of Things” “IoT”
“Data Mining”
CO
DIN
G
Prediction
©2017 Firefly Consulting, All Rights Reserved9
“BIG DATA” ANALYTICS PROCESS
Business Question
Extract Insights
Acquire Data
Prepare the Data
Choose Algorithm
Build Model
Test and Evaluate Model
Deploy Model
• Cast the business problem or goal into one or more modeling problems
• Determine use scenario
• Identify data sources
• Understand data
• Evaluate cost/benefit of sources
• Extract data
• Clean up the data –structure, missing values
• Visualize the data• Dimensionality
reduction and/or feature selection
• Validate the data
• Select the modeling technique(s) that will best solve your modeling problem and suit your use scenario
• Utilize statistical software to build model
• Use test data set to assess model accuracy and reliability
• Assess if model satisfies original business goal
• Beware of overfitting
• Code model into production systems
• Make near or real time decisions
• Use model to solve business problem
DEFINE
MEASURE ANALYZE
IMPROVE/CONTROL
IMPROVE
©2017 Firefly Consulting, All Rights Reserved10
Online loan applicants who complete the form using correct capitalization are more likely to pay on time, all lowercase next likely, all caps, least likely.
- Financial services startup
FUN FACT
My Namemy name
MY NAME
©2017 Firefly Consulting, All Rights Reserved11
Data Retrieval and Visualization
Statistical Hypothesis Testing
Similarity and Clustering
Classification
Prediction
4 Will a particular customer be profitable?
5 How much potential revenue can I generate from this particular customer?
2 Is there a difference between profitable and average customers?
3 What are common characteristics of profitable customers?
1 Who are the most profitable customers?
DIFFERENT BUSINESS QUESTIONS REQUIRE DIFFERENT TOOLS
Specificity
Specific
General
©2017 Firefly Consulting, All Rights Reserved12
WHAT TOOLS AS A LSS PROFESSIONAL DO YOU HAVE NOW?
▪ Data Retrieval and Visualization− Basic statistics
− Graphical tools
− Measurement System Analysis
− Control Charts
▪ Statistical Hypothesis Testing− T-Tests
− ANOVA
▪ Prediction− Multiple Linear Regression
− General Linear Model
©2017 Firefly Consulting, All Rights Reserved13
FAMILIAR PREDICTION TOOLS
Business Question
Statistical Tools
Description Example Applications
Prediction
Linear Regression
Models a straight line relationship between continuous predictors and a single response variable
Financial Services: Premium table development in property insurance
Healthcare: Predict future healthcare costs using prior costs, demographics and diagnoses
Manufacturing: Develop acceptable ranges for input materials to optimize pharmaceutical particle size
Nonlinear Regression
Models a nonlinear curve – concave, convex, exponential, s shaped, asymptotic, etc
General Linear Model
Uses ANOVA and regression to model the relationship between continuous or attribute predictors and a continuous response
©2017 Firefly Consulting, All Rights Reserved14
EXAMPLE: LINEAR REGRESSION IN FINANCIAL SERVICES
Predictors“Big Data”
“Data Mining”
Modeling, “Machine Learning”
Predictive Model
Business Insights
Business Need
Prediction
Develop a premium table for property insurance
Predictors = driver age, credit score, gender, auto attributes…
Linear Regression
Target = Predicted Claims
Use predicted claims to set better premiums and reduce risk
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐶𝑙𝑎𝑖𝑚𝑠 = 𝛽0 + 𝛽1 𝐴𝑔𝑒+𝛽2 𝐶𝑟𝑒𝑑𝑖𝑡 𝑆𝑐𝑜𝑟𝑒+ 𝛽3𝐺𝑒𝑛𝑑𝑒𝑟 + ⋯ 𝛽𝑥
©2017 Firefly Consulting, All Rights Reserved15
KEEP IN MIND
▪ Multicollinearity − Lots of data and lots of variables brings
risk of double dipping
▪ Nonlinear responses− Responses aren’t always straight lines
▪ Standardization− You may need to standardize your data
to eliminate differences in variable scale
▪ Homoskedasticity− Important to linear regression models− It’s also fun to say
▪ Many tools can solve the same types of problems in different ways− Tool selection is sometimes an art vs. a
science
▪ Model validation is required− Set aside 20-50% of your data points
to assess model accuracy
▪ Overfitting− Given enough data and variables,
something will correlate
− Consider diminishing returns
Bigger Picture Things
Nerdy Things
©2017 Firefly Consulting, All Rights Reserved16
FUN FACT
Liking “curly fries” on Facebook is a predictor of high intelligence.
– University of Cambridge and Microsoft Research
©2017 Firefly Consulting, All Rights Reserved17
Other
Supervised Tools
Regression
General Linear Model
Regression Tree/Forest
Gaussian Process
Support Vector Machines
(SVM)
Neural Network
Regression
Nearest Neighbor Methods
Linear Regression
Nonlinear Regression
Classification
Decision Trees, Forests
Neural Network
Naïve Bayesk-Nearest Neighbor
Discriminant Analysis
Logistic Regression
Support Vector Machines
Unsupervised Tools
Clustering
Hard Clustering
Hierarchical
k-Means
Soft Clustering
Fuzzy c-Means
Gaussian Mixture Mode
Anomaly Identification
One Class SVM
k-Nearest Neighbor
Principal Component
Analysis
Data Reduction Methods
Principal Component
Analysis (PCA)
Factor Analysis
A SELECTION OF NEXT LEVEL TOOLS
You have an output value you are trying to predict
You do not have a specific output value
Natural Language Processing
Image Processing /
Pattern RecognitionExamples follow
©2017 Firefly Consulting, All Rights Reserved18
LOGISTIC REGRESSION – THE NEXT TOOL TO ADD TO YOUR KIT
Business Question
Statistical Tools
Description Applications
ClassificationLogistic Regression
Regression where the dependent (target) variable is binary or categorical
Financial Services: Predict likelihood that a consumer will accept or reject credit card offer
Healthcare: Quantify odds of developing post surgical site infection
Manufacturing: Predict product pass or fail based on upstream sensor data
©2017 Firefly Consulting, All Rights Reserved19
EXAMPLE: LOGISTIC REGRESSION IN HEALTHCARE
Predictors“Big Data”
Modeling, “Machine Learning”
Predictive Model
Business Insights
Business Need
Prediction
Predict which patients are high risk for readmission within 30 days
Predictors = underlying diagnosis, age, discharge day, days to follow up visit post discharge, nurse call follow up…
Logistic Regression
Target = Probability of Post Discharge Readmission (< 30 Days)
Improve patient outcomes and reduce costs by identifying and addressing readmission risk factors
ln𝑝𝑟𝑒𝑎𝑑𝑚𝑖𝑠𝑠𝑖𝑜𝑛
1−𝑝𝑟𝑒𝑎𝑑𝑚𝑖𝑠𝑠𝑖𝑜𝑛=
𝛽0 +𝛽1 𝐷𝑖𝑎𝑔𝑛𝑜𝑠𝑖𝑠+𝛽2𝐷𝐶 𝐷𝑎𝑦…+ 𝛽6𝐹𝑜𝑙𝑙𝑜𝑤 𝑈𝑝
“Data Mining”
©2017 Firefly Consulting, All Rights Reserved20
FUN FACT
If you buy diapers from a pharmacy, you are more likely to also buy beer – NCR and Osco Drug study
©2017 Firefly Consulting, All Rights Reserved21
SOME ADDITIONAL CLASSIFICATION METHODSBusiness Question
Statistical Tools Description Example Applications
Classification
k-Nearest Neighbor Categorizes data based on where their nearest neighbors are in the data set.
Manufacturing: Using logged machine sensor data to predict equipment failures before they happen as part of a Total Productive Maintenance system
Healthcare: Predicting pulmonary tuberculosis in hospitalized patients
Retail: Consumer decision trees that classify shopper behavior and quantify shopper decision making
Classification or Decision Trees, Forests
Easy to use method that allows you to predict responses to data by following a series of branching conditions leading to a binary or categorical response
Discriminant Analysis Classifies data by finding linear combinations of features.
Other Methods: Neural Network, Naïve Bayes, Support Vector Machines
This is still only a partial list!
©2017 Firefly Consulting, All Rights Reserved22
CLUSTERING METHODS
Business Question
Statistical Tools Description Applications
Similarity and Clustering
Hierarchical
Creates nested sets of clusters by measuring similarities between pairs and groups objects into a tree. Produces dendrogram graphic which shows hierarchy.
Financial Services: Place securities into groups based on similarities found amongst returns and investment strategies
Healthcare: Identifying subgroups of patients with similar condition patterns to drive targeted care management
Manufacturing: Part family identification for cell design and optimization
k-Means
Partitions data into k numbers of mutually exclusive clusters based on the distance between the data point and the cluster’s center.
Fuzzy c-MeansSimilar to k-Means, but allows for overlap of the clusters. Clusters are not mutually exclusive.
Methods that assign data (or variables) into similar groups. Unlike classification, groups are not known beforehand.
©2017 Firefly Consulting, All Rights Reserved23
EXAMPLE: CLUSTERING IN RETAIL
Predictors“Big Data”
Modeling, “Machine Learning”
Predictive Model
Business Insights
Business Need
Prediction
Offer relevant similar items to customers during online shopping for whiskey
A historical database of review descriptions: Color, Nose, Body, Palate, Finish
k-Means Clustering
Offer a group of whiskeys as alternate choices to customer’s first selection
Provides customer options, and keeps them on your site
“Data Mining”
©2017 Firefly Consulting, All Rights Reserved24
ADVANCED PREDICTION TOOLS
Business Question
Statistical Tools Description Example Applications
Prediction
Regression Tree/Forest Similar to decision trees for classification, but predicts a continuous response vs. categorical.
Financial Services: Predicting likelihood a mortgage will go into default or be paid off early
Healthcare: Predicting hospital average length of stay
Manufacturing: Predicting wafer reject rates in semiconductor manufacturing
Gaussian Process Nonparametric models often used for spatial data.
Support Vector Machines (SVM)
Fits a “hyperplane” that deviates from measured data by no more than a small amount.
Others: Neural Network Regression, Nearest Neighbor Methods
And more!
©2017 Firefly Consulting, All Rights Reserved25
EXAMPLE: REGRESSION TREES IN MORTGAGES
"From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel,"
Predictors“Big Data”
Modeling, “Machine Learning”
Predictive Model
Business Insights
Business Need
Prediction
Predict likelihood that mortgage will be paid off early
Predictors: Interest rate, income, payoff amount, property type, loan to value ratio…
Regression TreeTarget = probability of prepayment or default
Bank can screen refinance applicants and make better business decisions
“Data Mining”
©2017 Firefly Consulting, All Rights Reserved26
▪ An application would be processed through the tree to calculate likelihood of pre-payment
▪ Greatest risk:− Interest ≥7.94
−Mortgage ≥ $182,926
− Property not a condo or co-op
REGRESSION TREE INTERPRETATION
Yes No
Yes No Yes No
Yes No Yes No Yes No
Yes No
Yes No
Yes No
©2017 Firefly Consulting, All Rights Reserved27
DATA REDUCTION METHODS
Business Question
Statistical Tools
Description Applications
Similarity and Clustering
Principal Component Analysis
Transforms the data so that most of the variance in your data is accounted for in the first few principal components. Model improvement
Key factor identificationFactor Analysis
Identifies underlying correlations between variables in your data set so you can identify commonality amongst factors.
Methods that help you reduce the number of variables in your models to reduce collinearity, model noise and risk of overfit. Very helpful in regression models.
©2017 Firefly Consulting, All Rights Reserved28
FUN FACT
Your reliability as a debtor varies by your use of your credit card: at a pool hall (less reliable); at the dentist (more reliable); to buy felt pads for under your furniture legs (most reliable) – 2002 study by Canadian Tire
©2017 Firefly Consulting, All Rights Reserved29
CONCLUSIONS
▪ Big Data is more than a buzzword
▪ We are in a new age of analytics
▪ There is opportunity for LSS practitioners to expand skills in this rapidly growing and complementary area
▪ What data exists in your organization today where you could start to apply these tools?
©2017 Firefly Consulting, All Rights Reserved30
SOURCES AND RESOURCES
▪ Eric Sigel, Predictive Analytics, Wiley, 2016 (particularly the Fun Facts)
▪ Foster Provost and Tom Fawcett, Data Science for Business, O’Reilly, 2013
▪ Tools and Tutorials for Data Mining and Predictive Analytics Software: https://www.salford-systems.com/
©2017 Firefly Consulting, All Rights Reserved31
▪ Contact me at
http://www.firefly-consulting.com
THANK YOU!