57
Survey of Data Analytics Peter Bruce, President The Institute for Statistics Education at Statistics.com About Statistics.com: 100+ courses, introductory and advanced Traditional statistics, data mining, machine learning, text mining, clinical trials, optimization, use of R All online Typically 4 weeks, scheduled dates Don’t need to be online particular times/days Private discussion forum with instructors - noted authors & experts [email protected]

Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Embed Size (px)

Citation preview

Page 1: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Survey of Data Analytics Peter Bruce, President

The Institute for Statistics Education at Statistics.com

About Statistics.com:

• 100+ courses, introductory and advanced

• Traditional statistics, data mining, machine learning, text mining, clinical trials, optimization, use of R

• All online

• Typically 4 weeks, scheduled dates

• Don’t need to be online particular times/days

• Private discussion forum with instructors - noted authors & experts

[email protected]

Page 2: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

From the simple to the complex…

• 35 different reports tracking traffic daily

• Midday report “are we on track for

visitors?”

• # visitors from key domains - .gov, .mil,

.senate or .house

Which photos do best?

1. Monkeys

2. Dogs

3. Cats

Page 3: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Analytics as Erector Set

• Most real-world analytics jobs involve building a

“machine” that produces outcomes (decisions)

• Different components, some simple in function, others

complex

Page 4: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Complex or “black-box” components

• Misunderstanding what the component does and how it works

compromises the “machine”

• Task of education typically focuses on the components, particularly

complex ones

• Machine-building skill comes with practice and experience in the

business context (hard to teach in school)

Page 5: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

At the center lies prediction…

Page 6: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

A man walks into a Target® store…

Page 7: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Predictive

Analytics

Cluster/Segment Affinity/Recommend

Also part of Data Analytics:

• Outlier Detection

• Profiling

• Exploration

• Text Mining

• Social Network Analysis

Supervised

learning

Statistics:

•Controlled Experiments

(A-B tests)

•Observational studies

•Estimation

Page 8: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Predictive Models – Supervised Learning

• Lots of predictor variables, train models with known

outcome (target, dependent) variables

• Use multiple methods (statistical, machine learning)

• Goes beyond the obvious, capturing complexity

• Implemented for real-time behavior and decisions

Classification

(categorical)

Prediction

(continuous)

Page 9: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Pregnant?

• Obvious retail clues – maternity clothes, baby food, baby

clothes, crib …

• These may be too late

• Earlier clues not so obvious – lotions, supplements, and,

esp., combinations and changes in purchase patterns

• Data mining algorithms can capture these less obvious,

more complex signals

Page 10: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Training the Model

• Each row is a customer

• Numerous predictor variables (mostly on purchase data),

target variable “pregnant?” (0/1)

• Those on baby shower registry are 1’s

• Women of similar demographic not on registry are 0’s

• Together they constitute the training set

• Known outcome

• Purchase data over time

Page 11: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

K-NN (hypothetical data)

Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry

?

1 1 1 1 1 0 1 0

2 1 1 1 1 0 1 0

3 1 1 0 1 0 1 0

4 0 1 1 0 1 1 0

5 1 1 0 1 1 0 1

6 0 0 1 0 1 0 1

NEW 1 0 1 1 0 1 ?

Page 12: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Cust #

1 1 1 1 1 0 1 0

NEW 1 0 1 1 0 1 ?

dif 0 1 0 0 0 0

sq dif 0 1 0 0 0 0

sum=1

Cust #

6 0 0 1 0 1 0 1

NEW 1 0 1 1 0 1 ?

dif -1 0 0 -1 1 -1

sq dif 1 0 0 1 1 1

sum=4

•Calculating distance (illustration): The NEW customer is quite close to cust

#1, not so close to cust #6.

•Classification, for k=3: The three closest records (see prior slide) are 1, 2

and 3. They are all 0’s (not pregnant), so we classify the NEW customer as

“0.”

Page 13: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Statistical Distance/Similarity

• In the above procedure, we calculated numerical measure

of the distance between two records.

• There are various measures of statistical distance and

similarity

• Some are sensitive to scale, requiring normalization

(standardization)

• Used in clustering, nearest neighbor calculations.

Page 14: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

K-Nearest Neighbors Classification

• Take a new record, find its closest neighbor (k=1)

• Assign that neighbor’s class to the new record

• Or… find the closest k records, find the majority* class,

and assign that class to the new record

• High k = smoothes over local information (too high and

useful information is lost)

• Low k = fits local information (too low and you fit the

signal not the noise)

*A lower cutoff may be used when classifying rare events

Page 15: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Classification Algorithms, cont.

• Logistic Regression

• CART

• Discriminant Analysis

• Neural Network

• Naïve Bayes

Page 16: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

The Overfit Problem

0

200

400

600

800

1000

1200

1400

1600

0 200 400 600 800 1000

Rev

en

ue

Expenditure 0

200

400

600

800

1000

1200

1400

1600

0 200 400 600 800 1000

Rev

en

ue

Expenditure

Linear regression – fair fit (some error remains) Complex polynomial – perfect fit, but fits

noise too well. Will have lots of error with

new data. Complex models, esp. machine

learning ones, are prone to this problem.

Page 17: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Assess Model Performance

• Assess the model with new data that were not used to train the model

• Measures of performance:

• Accuracy (categorical)

• Lift (categorical/continuous)

• RMSE (continuous)

Page 18: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Data (sample)

Training

partition

Validation

partition

Model 1

Model 2

Model 3

Model n

Best

model

Assess and validate the

models

Performance metrics:

•RMSE (continuous)

•Accuracy (categorical)

•Lift (categorical)

Page 19: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Measuring Accuracy: Confusion Matrix (and Cutoff Control) Training Data scoring - Summary Report

Cut off Prob.Val. for Success (Updatable) 0.5

Classification Confusion Matrix

Predicted Class

Actual

Class 1 0

1 9 3

0 11 247

The analyst sets this in the

classification algorithm.

•Higher values > fewer predicted 1’s

(fewer false positives, more false

negatives)

•Lower values > more predicted 1’s

(more false positives, fewer false

negatives)

Classification accuracy

= (9+247)/(9+3+11+247)

= 256/270

= 0.948

But wait… classifying everyone as “0” yields accuracy of 0.956!

Page 20: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Lift

• Need metric that reflects greater importance of the “pregnant” category, which is rare

• Lift is the model’s improvement over average random selection

• First step: take the predictions and rank them in order of belonging to the class of interest

• Next, review accuracy of predictions by decile

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10

Decile m

ean

/ G

lob

al m

ean

Deciles

Decile-wise lift chart (validation dataset)

The top decile is 5.2 times more likely to

be a “1” than the average record. We are

using our model to “skim the cream” and

the decile chart measures how much

cream the model has captured in each

Page 21: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Data (sample)

Training

partition

Validation

partition

Model 1

Model 2

Model 3

Model n

Best

model

Add test partition for unbiased estimate

of performance on new data

Test

partition

Unbiased estimate of

performance

Page 22: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Software

• SAS Enterprise Miner $$$$

• IBM SPSS Modeler (Clementine) $$$$

• XLMiner (Excel add-in) $

• Statistica Data Miner $$

• Salford Systems $$

• Rapid Miner $$ (open source free version)

• R open source free

Page 23: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

K-NN

• Model-less prediction – use where features of data a

highly local, and without structure.

Page 24: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Linear classifier

• Lack of flexibility leads to more error

Source for figures: Hastie, Tibshirani and Friedman, The Art of Statistical Learning

Page 25: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

KNN in XLMiner

• Partition

• Normalize

• Find best k

Page 26: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Predictive Modeling via classical statistics

• Linear Regression

• Logistic Regression

• Discriminant Analysis

Page 27: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Predictive Modeling – Machine Learning

Classification & Regression Trees (CART)

Recursively partition data for homogeneity, derive rules:

Page 28: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Trees, cont.

Complete partitioning leads to 100% homogeneity (100%

classification accuracy) but overfits:

Page 29: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Machine Learning (cont.)

Neural Networks – like regression on steroids:

Regression: CA = β1*fat + β2*salt + ε

NN: proliferation of coefficients & interactions, + iterative learning

Page 30: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Ensemble Methods

• Fit multiple different models

• Try additional models that are weighted average of the

predictions from multiple “single” models

• “Bagging” – fit models to bootstrap samples of cases, take

average prediction (or majority vote for classification)

• “Boosting” – iteratively fit models, each time adjusting

case weights

• Overweight the hard to predict cases

• Underweight the easy to predict cases

• Average the models, giving most weight to the earlier ones

Page 31: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Crowdsourcing - Kaggle

• Publish the data for which you want a model

• Let the hacker community compete

• Overfit danger – lots of energy is spent building a perfect

model of a static data set. The real world is dynamic,

modeling is an ongoing process.

• Usually, most of the big gain comes from the simple, the

obvious

Page 32: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Netflix Prize – Predicting Customer Ratings

Goal: 10% improvement in RMSE of predicted customer rating

Page 33: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

PA - Courses at Statistics.com

• Predictive Modeling

• Trees

• Data Mining in R

• Logistic Regression

Page 34: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Clustering-Segmentation

• Established statistical technique used from Astronomy to

Zoology

• Used in business to identify different customer segments

to be targeted with different marketing approaches

Page 35: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Agglomerative Clustering

• Join two closest cases together in a cluster

• Now you have many single-case clusters, plus one cluster

of two cases.

• Again, join the two closest clusters together (whether

single-case clusters or the two case cluster)

• As the process proceeds, you have fewer and fewer

clusters

Page 36: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Return to Hypothetical Purchase Data

Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry

?

1 1 1 1 1 0 1 0

2 1 1 1 1 0 1 0

3 1 1 0 1 0 1 0

4 0 1 1 0 1 1 0

5 1 1 0 1 1 0 1

6 0 0 1 0 1 0 1

Step 1

Step 2

Page 37: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Final result is a dendrogram

1 18 14 19 3 9 6 2 22 4 20 10 13 5 8 16 11 7 12 21 15 17 0

22000

0

5

10

15

20

25

30

35

40

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Dis

tan

ce

Dendrogram(Ward's method)

Y value = distance

between clusters

Page 38: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Sliding horizontal line indicates # of clusters

1 18 14 19 3 9 6 2 22 4 20 10 13 5 8 16 11 7 12 21 15 17 0

22000

0

5

10

15

20

25

30

35

40

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Dis

tan

ce

Dendrogram(Ward's method)

Page 39: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Measuring closeness of clusters

• Minimize minimum distance between two clusters

• Minimize maximum distance

• Minimize average distance

• Minimize distance between centroids

• Minimize loss of information (ESS) that comes from

joining two clusters – Ward’s Method

Different metrics can yield very different results, and even

random data exhibits apparent clustering.

Page 40: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression
Page 41: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Measuring closeness of clusters

• Minimize minimum distance between two clusters

• Minimize maximum distance

• Minimize average distance

• Minimize distance between centroids

• Minimize loss of information (ESS) that comes from

joining two clusters – Ward’s Method

Different metrics can yield very different results, and even

random data exhibits apparent clustering.

Page 42: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Recommender Systems

Page 43: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Association Rules, Affinity Analysis binary transaction matrix

Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90

1 1 1 1 1 0 1

2 1 1 1 1 0 1

3 1 1 0 1 0 1

4 0 1 1 0 1 1

5 1 1 0 1 1 0

6 0 0 1 0 1 0

7 1 0 1 1 0 1

Page 44: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

What goes with what

• Apriori algorithm to generate lists of item-sets (not

transactions)

• Antecedent and consequent item sets form rules

• Support: % of transactions with a given item-set

• Confidence: % of transactions w/ antecedent that also

have consequent

• Lift: (rule confidence)/(consequent support), or

In looking for the consequent item-set, how much gain do you get from

the rule, as opposed to randomly picking transactions?

Page 45: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Text analytics (very briefly)

• Origins in linguistics and computer science

• “Natural” in NLP – methods of computer language

processing were extended to natural languages

• Huge challenge – ambiguity & complexity pervades

natural language

• Hierarchy of recognition tasks > tokenization

x fox The fox jumped over the log

Page 46: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Ambiguity

White space and punctuation as delimiters, but consider…

Clairson International Corp. said it expects to report a

net loss for it’s second quarter ended March 26 and

doesn’t expect to meet analysts’ profit estimates of $3.9

to $4 million, or 76 cents a share to 79 cents a share, for

its year ending Sept. 24.

Page 47: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Ambiguity, cont.

Or…

A series of mono and di-N-2,3-epoxypropyl N-

phenylhydrazones have been prepared on a large scale

by reaction of the corresponding N-phenylhydrazones of

9-ethyl-3carbazolecarbaldehyde, 9-ethyl-3-

6carbazoledicarbaldehyde, ….

Page 48: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

“Bag of words” and sentiment

Common metric: counts of positive and negative words

I adore the hero of “I hate love stories”

Simple approach will misclassify “hate”

Page 49: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Complexity counts • Greater statistical power from more sophisticated systems

• Ben Bernanke and Aug. 30, 2012 Jackson Hole speech – digital

version was rapidly analyzed and there was a stock sell-off

Page 50: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Google searches: Sparsity and Big Data

Page 51: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Resampling & classical statistics

• Hypothesis tests & confidence intervals

• What-if simulation (OR applications)

• Example (confidence interval): median of 100 incomes

1. Place all values in a box

2. Randomly pick one, record, replace

3. Repeat 99 more times, record median

4. Repeat steps 2+3, say, 1000 times

5. Review distribution of medians

Page 52: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Resampled medians

0

50

100

150

200

250

21.65 22.15 22.65 23.15 23.65 24.15 24.65 25.15 25.65 26.15 26.65 27.15 27.65 28.15 28.65 29.15 29.65

Co

un

ts

Page 53: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Resampling Stats for Excel

Page 54: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Repeat & Score dialog

Page 55: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

A-B test (of 2 web offers)

• Control: 220 views > 7 clicks = 0.0318

• Treatment: 195 views > 11 clicks = 0.0564

• 77% improvement: 0.0564/0.0318 (=1.77)

Resampling test:

1. Box with 18 1’s and 397 0’s (total 415)

2. Shuffle & draw 220, count 1’s

3. Count 1’s in remaining 195

4. Record ratio (the 195 group to the 220 group)

5. Repeat steps 2-4, say 1000 times

6. How often >= 1.77?

Page 56: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Results (143 of 1000 trials >= 1.77)

0

50

100

150

200

250

300

Co

un

ts

Page 57: Survey of Data Analytics - Statistics.com · Survey of Data Analytics Peter Bruce, ... •Rapid Miner $$ (open source free version) ... •Linear Regression

Material presented is partially drawn from

Text, includes XLMiner User Guide