46
Modeling and Analysis for the Non-Statistician Presented by: Andrew Curtis Vice President Richard Pless Consultant

Modeling for the Non-Statistician

Embed Size (px)

DESCRIPTION

Step-by-step guide to prepare customer data for modeling.

Citation preview

Page 1: Modeling for the Non-Statistician

Modeling and Analysis for the Non-Statistician

Presented by:

Andrew CurtisVice President

Richard PlessConsultant

Page 2: Modeling for the Non-Statistician

2

Models are developed using a six-step process.

1. Research Design 10%

2. Data Checking and Variable Creation30

3. Create Analysis Files30

4. Calibrate Scoring Model10

5. Model Evaluation10

6. Model Implementation10

% Effort

1. Research Design

Page 3: Modeling for the Non-Statistician

3

Research design requires the input of both marketers and analysts. Is the problem solvable through

modeling?

Do we have representative promotions from which to develop a model?

Do we need to be concerned about selection bias?

Will we be able to pull all the information we need to score the model off of our database in a timely manner?1. Research Design

Page 4: Modeling for the Non-Statistician

4

Research Design--Unsolvable Problems

Prospecting models for niche marketer.

– Some lists work really well.– All others are unprofitable, even in the first decile.

Finding all prospective buyers.

– Impossible to accurately predict all

behavior.

– All models leave some revenue on the

table.1. Research Design

Page 5: Modeling for the Non-Statistician

5

Research Design--Unrepresentative Promotions.

Album promotion during a major

tour.

Retail sale announcement during major clearance.

Veterans magazine solicitation during the Gulf War.

1. Research Design

Page 6: Modeling for the Non-Statistician

6

Research Design--Selection Bias.

The model is built off a series of mailings for business-appropriate suits, dresses, and accessories.

The mailings were mailed to women only.

If the resulting model is put into production without the gender pre-screen, then males will end up getting contacted, probably quite unprofitably.

1. Research Design

Page 7: Modeling for the Non-Statistician

7

Research Design--Timely Scoring Data.

The model looks for number of Web applicants from a given ZIP code in the prior week but the data can only be pulled monthly.

At best, the model can only be scored accurately once a month.

The predictor which uses the information is ineffective.

1. Research Design

Page 8: Modeling for the Non-Statistician

8

Rule #1 Garbage In Garbage Out!Bad Data In Bad Models Out! Analysis is only good as the data being

analyzed.

All input data must be checked for reasonableness, timeliness, and completeness.

Information extracted from multiple sources must be verified that all data are appended to the “master file” appropriately.

You must engage in on-going quality control!

2. Data Checking

Page 9: Modeling for the Non-Statistician

9

Study and scrutinize the data dictionary! Understand every field in the database. Eliminate fields that are too new, poorly

filled, or unrealiable. Look at distributions of values for each

field. – Know what every field means.– Understand every value in the field. If there a “Z”, find out what “Z” means.

Work with the finance to define the business rules for properly counting orders, revenue, and other business drivers.

2. Data Checking

Page 10: Modeling for the Non-Statistician

10

Clean the data when appropriate.

Models are driven by underlying data patterns.

– Bad patterns lead to bad models.

Correct data/variables with:

– Anomalies

– Missing values

– Outliers

– Errors.

2. Data Checking

Page 11: Modeling for the Non-Statistician

11

Data Checking--Example of an anomaly.

Dollars per Contact Over Five Mailings

Recency Avg. February Other Four

0 - 3 months $3.50 $3.45 $3.52

4 - 6 months 2.20 1.36 2.42

7 - 9 months 2.31 2.40 2.29

10 - 12 months 1.40 1.50 1.36

2. Data Checking

Page 12: Modeling for the Non-Statistician

12

Data Checking--Missing Data Example.

Response Rates by Age

Age AverageRange Response %

18 - 35 1.3 %

36 - 49 0.8

50 and up 1.0

Missing 1.4 How to explain?

2. Data Checking

Page 13: Modeling for the Non-Statistician

13

Data Checking--Outliers Example

The “Michael Jordan” example.

Individual credit card holders with $200,000 lines of credit.

The department store employee with 100 shopping trips a year.

2. Data Checking

Page 14: Modeling for the Non-Statistician

14

Data Checking--Errors pose a tremendous risk for the modeler.Commonly Occurring Errors:

Response data from a prior mailing incorrectly matched back to the customer file.

Changes in meaning or usage of a particular variable.

Alpha characters in supposedly numeric variable fields.

2. Data Checking

Page 15: Modeling for the Non-Statistician

15

Variable creation captures the dynamics of the business. Use creativity to create predictor variables.

Predictor variables typically come in three classes:–Recency—the time elapsed since an

action.–Frequency—the number of times an event has happen, e.g. orders, clicked on a web page etc.

–Monetary—the amount of money spent purchasing goods and services.

Use ratios and cross variables to identify meaningful interactions between variables.

2b. Variable Creation

Page 16: Modeling for the Non-Statistician

16

Predictor Variable Creation--Example

Order Date Shipping to Category Product Description QTY Price

4/2/1999 Andew Curtis Book Markstrat3 : The Strategic 1 40

5/2/1999 Andew Curtis Book

The Service Profit Chain : How Leading Companies Link Profit and Growth to Loyalty, Satisfaction, and Value 1 20

6/14/2000 Stephen J. Curtis Book

Tuesdays with Morrie: An Old Man, a Young Man and Life's Greatest Lesson 1 20

6/1/2001 Andrew Curtis Book

Zapp! : The Lightning of Empowerment : How to Improve Quality, Productivity, and Employee Satisfaction 1 20

7/1/2001 Andrew Curtis ElectronicsHandspring Visor Platinum (Silver) - Special Offer 1 300

8/17/2001 Glenn Waldorf Video The Godfather DVD Collection 1 100

Recency (11/14/01 – 8/17/01) = 89 Days or 3 Months!

MonetarySum of Revenue = $500

FrequencyCount Order Dates=6 Orders

2b. Variable Creation

Page 17: Modeling for the Non-Statistician

17

Order Date Shipping to Category Product Description QTY Price

4/2/1999 Andew Curtis Book Markstrat3 : The Strategic 1 40

5/2/1999 Andew Curtis Book

The Service Profit Chain : How Leading Companies Link Profit and Growth to Loyalty, Satisfaction, and Value 1 20

6/14/2000 Stephen J. Curtis Book

Tuesdays with Morrie: An Old Man, a Young Man and Life's Greatest Lesson 1 20

6/1/2001 Andrew Curtis Book

Zapp! : The Lightning of Empowerment : How to Improve Quality, Productivity, and Employee Satisfaction 1 20

7/1/2001 Andrew Curtis ElectronicsHandspring Visor Platinum (Silver) - Special Offer 1 300

8/17/2001 Glenn Waldorf Video The Godfather DVD Collection 1 100Recency in Books (11/14/01 – 6/1/01) = 166 Days or 5.5 Months!

Total Books = 4Total DVDS = 1Total Electronics = 1

Average Order Size =$500 / 6 Orders

Percent Gift Purchases=2 / 6 = 33%

Predictor Variable Creation--Example

2b. Variable Creation

Page 18: Modeling for the Non-Statistician

18

Selecting a Target Variable

Make sure your target variable will give you the type of results you want. – Measuring response: may get a lot of hand- raisers that are not profitable. – Measuring profit: by focusing only on the dollars, you may miss a viable low-profit group.

Isolate all information gathered during the target period from being included as a predictor variable.2b. Variable Creation

Page 19: Modeling for the Non-Statistician

19

Analysis files have three time frames:

1.Predictor Period—The time before individuals are selected for a marketing contact. All predictor variables must contain only data from this period.

2.Gap Period—The time between the selection date and when the first response is recorded.

3.Target Period—The time between the first and last response date. All target variables must only contain information from this period.

3. Create Analysis Files

Predictor Period Gap Period Target Period

Selection Date

First Response Date

Last Response Date

Page 20: Modeling for the Non-Statistician

20

Good models are developed with modeling and validation samples.

Before modeling begins, split the analysis file into two random subsets: modeling and validation.

Develop the model using only the modeling subset.

Test the robustness and accuracy of the model using the validation subset.

Techniques exist for handling validation when analysis sample is too small to split. 3. Create Analysis Files

Page 21: Modeling for the Non-Statistician

21

The appropriate modeling technique is driven by several factors. The nature of the target variable.

The software that is supported in the production environment.

The skills of the analytical team.

4. Model Calibration

Page 22: Modeling for the Non-Statistician

22

No modeling technique should operate on autopilot.

The analyst developing the model must:–Know how to use the modeling technique.–Know how to interpret the results.–Know a “cringe variable” when they see one. –Know how the model will be used by the marketers.

Without a pilot, even the most sophisticated plane will crash.

4. Model Calibration

Page 23: Modeling for the Non-Statistician

23

Scoring models can be built using many different techniques.

Linear regression

Logistic regression

Discriminant analysis

Neural networks

Many, many more...

All can be used as predictors of future

behavior.4. Model Calibration

Page 24: Modeling for the Non-Statistician

24

Model Calibration Rule #1

If you want to get famous, talk about technique.

If you want a great model, concentrate on “the other 90 percent.”

4. Model Calibration

Page 25: Modeling for the Non-Statistician

25

Corollary to Rule #1

Regardless of your technique of choice,

if you short-change “the other 90 percent,”

you will probably end up with a lousy model.

4. Model Calibration

Page 26: Modeling for the Non-Statistician

26

Construction analogy

Throw several power tools onto a pile of lumber,

come back in a month, and -- presto –

you will NOT have a house.

4. Model Calibration

Page 27: Modeling for the Non-Statistician

27

Linear Regression is best suited for continuous outcomes, such as sales.

Output can be understood by non-statisticians.

Each name is assigned an estimated value.

Scored population is easily ranked with respect to the target variable (sales, profits, etc.).

Does not automatically identify interactions between predictor variables.4. Model Calibration

Page 28: Modeling for the Non-Statistician

28

Linear Regression Example

Scoring Model for Predicting Monthly Revenue

Score = 0.08 + 0.06 * House Value (Estimated in $Thousands)

- 0.20 * Number of Children + 0.10 * Average Credit Card Limit (in

$Thousands) - 0.30 * Number of Autos

John Jennifer YOUHouse Value? $150,000 $125,000No. of Kids? 2 0 Ave Limit? $15,000 $8,000No. of Cars? 2 1

Score $9.58 $8.084. Model Calibration

Page 29: Modeling for the Non-Statistician

29

Logistic regression is best suited for binary outcomes, such as buy/no buy. Output can be understood by non-

statisticians.

Each name is assigned a probability of performing the expected outcome that is NOT a prediction of future performance.

Scored population is easily ranked with respect to likelihood of displaying the targeted behavior.

Does not automatically identify interactions between predictor variables.4. Model Calibration

Page 30: Modeling for the Non-Statistician

30

Logistic Regression Example

Scoring Model for Predicting Likelihood to Purchase (Yes/No)Score = 0.01

+ 0.04 * Person Owns Home (1=yes,0=no) - 0.05 * Number of Credit Cards + 0.01 * Income (Estimated in $Thousands) - 0.02 * Age

Probability Fix = 1 / [1 + Exponent(-Score)]John Jennifer YOU

Owns Home? No YesNo. of Cards? 6 3 Income? $40,000 $25,000Age? 45 35Score -0.79 (prob=31%) -0.55 (prob=37%)

4. Model Calibration

Page 31: Modeling for the Non-Statistician

31

Neural networks can be used with either binary or continuous targets.

No restrictions on the type or structure of either the target variable or the historical variables.

Can more easily capture interactions between predictor variables.

Output is very difficult to explain.

Implementation can be difficult.

Models don’t always outperform traditional regression.4. Model Calibration

Page 32: Modeling for the Non-Statistician

32

When done well, scoring models are smooth with few, if any clumps. Target behaviors of the scored names

distribute on a “Gains Table” smoothly from highest to lowest.

This makes it easier to target a precise number of names, or to select down to a precise threshold of response or profit.

5. Model Evaluation

Page 33: Modeling for the Non-Statistician

33

Understanding the Lift Table

Decile

No. of Cust.

No. of Resp.

1 10,000 7,000

2 10,000 5,280

3 10,000 4,600

4 10,000 3,710

5 10,000 3,300

6 10,000 2,400

7 10,000 2,020

8 10,000 1,590

9 10,000 650

10 10,000 450

====== 100,000

====== 31,000

Start by ranking all customers by their descending scores and observing the number of responders in each “decile.”

5. Model Evaluation

Page 34: Modeling for the Non-Statistician

34

Next, calculate response rates for each decile.

Decile

No. of Cust.

No. of Resp.

Resp. Rate

1 10,000 7,000 0.700

2 10,000 5,280 0.528

3 10,000 4,600 0.460

4 10,000 3,710 0.371

5 10,000 3,300 0.330

6 10,000 2,400 0.240

7 10,000 2,020 0.202

8 10,000 1,590 0.159

9 10,000 650 0.065

10 10,000 450 0.045

====== 100,000

====== 31,000

5. Model Evaluation

Page 35: Modeling for the Non-Statistician

35

Then, calculate the percent of all respondents that are in each decile.

Decile

No. of Cust.

No. of Resp.

Resp. Rate

Percent of Resp.

1 10,000 7,000 0.700 22.6%

2 10,000 5,280 0.528 17.0%

3 10,000 4,600 0.460 14.8%

4 10,000 3,710 0.371 12.0%

5 10,000 3,300 0.330 10.6%

6 10,000 2,400 0.240 7.7%

7 10,000 2,020 0.202 6.5%

8 10,000 1,590 0.159 5.1%

9 10,000 650 0.065 2.1%

10 10,000 450 0.045 1.5%

====== 100,000

====== 31,000

5. Model Evaluation

Page 36: Modeling for the Non-Statistician

36

Sum down the columns to calculate cumulative totals.

Decile

No. of Cust.

No. of Resp.

Resp. Rate

Percent of Resp.

Cum. Cust.

Cum. Resp.

1 10,000 7,000 0.700 22.6% 10,000 7,000

2 10,000 5,280 0.528 17.0% 20,000 12,280

3 10,000 4,600 0.460 14.8% 30,000 16,880

4 10,000 3,710 0.371 12.0% 40,000 20,590

5 10,000 3,300 0.330 10.6% 50,000 23,890

6 10,000 2,400 0.240 7.7% 60,000 26,290

7 10,000 2,020 0.202 6.5% 70,000 28,310

8 10,000 1,590 0.159 5.1% 80,000 29,900

9 10,000 650 0.065 2.1% 90,000 30,550

10 10,000 450 0.045 1.5% 100,000 31,000

====== 100,000

====== 31,000

5. Model Evaluation

Page 37: Modeling for the Non-Statistician

37

Calculate cumulative response Rate and percentage of response rates.

Decile

No. of Cust.

No. of Resp.

Resp. Rate

Percent of Resp.

Cum. Cust.

Cum. Resp.

Cum. Resp. Rate

Cum. Percent of Resp.

1 10,000 7,000 0.700 22.6% 10,000 7,000 0.700 22.6%

2 10,000 5,280 0.528 17.0% 20,000 12,280 0.614 39.6%

3 10,000 4,600 0.460 14.8% 30,000 16,880 0.563 54.5%

4 10,000 3,710 0.371 12.0% 40,000 20,590 0.515 66.4%

5 10,000 3,300 0.330 10.6% 50,000 23,890 0.478 77.1%

6 10,000 2,400 0.240 7.7% 60,000 26,290 0.438 84.8%

7 10,000 2,020 0.202 6.5% 70,000 28,310 0.404 91.3%

8 10,000 1,590 0.159 5.1% 80,000 29,900 0.374 96.5%

9 10,000 650 0.065 2.1% 90,000 30,550 0.339 98.5%

10 10,000 450 0.045 1.5% 100,000 31,000 0.310 100.0%

====== 100,000

====== 31,000

5. Model Evaluation

Page 38: Modeling for the Non-Statistician

38

Lift is the ratio of cum response rate to the overall response rate = 0.310

Decile

No. of Cust.

No. of Resp.

Resp. Rate

Percent of Resp.

Cum. Cust.

Cum. Resp.

Cum. Resp. Rate

Cum. Percent of Resp.

Lift

1 10,000 7,000 0.700 22.6% 10,000 7,000 0.700 22.6% 226

2 10,000 5,280 0.528 17.0% 20,000 12,280 0.614 39.6% 198

3 10,000 4,600 0.460 14.8% 30,000 16,880 0.563 54.5% 182

4 10,000 3,710 0.371 12.0% 40,000 20,590 0.515 66.4% 166

5 10,000 3,300 0.330 10.6% 50,000 23,890 0.478 77.1% 154

6 10,000 2,400 0.240 7.7% 60,000 26,290 0.438 84.8% 141

7 10,000 2,020 0.202 6.5% 70,000 28,310 0.404 91.3% 130

8 10,000 1,590 0.159 5.1% 80,000 29,900 0.374 96.5% 121

9 10,000 650 0.065 2.1% 90,000 30,550 0.339 98.5% 109

10 10,000 450 0.045 1.5% 100,000 31,000 0.310 100.0% 100

====== 100,000

====== 31,000

5. Model Evaluation

Page 39: Modeling for the Non-Statistician

39

Gains tables can show performance for both response and revenue.

Decile

Resp. Rate

Percent of Resp.

Cum. Resp. Rate

Cum.

Percent of Resp.

Lift

Revenue

Per Cust.

Cum

Revenue Per Cust.

Cum Percent

of Revenue

Lift 1 0.700 22.6% 0.700 22.6% 226 $3.70 $3.70 21.3% 213

2 0.528 17.0% 0.614 39.6% 198 $2.68 $3.19 36.7% 183

3 0.460 14.8% 0.563 54.5% 182 $2.24 $2.87 49.6% 165

4 0.371 12.0% 0.515 66.4% 166 $1.87 $2.62 60.3% 151

5 0.330 10.6% 0.478 77.1% 154 $1.48 $2.39 68.8% 137

6 0.240 7.7% 0.438 84.8% 141 $1.46 $2.24 77.2% 129

7 0.202 6.5% 0.404 91.3% 130 $1.21 $2.09 84.2% 120

8 0.159 5.1% 0.374 96.5% 121 $1.20 $1.98 91.1% 114

9 0.065 2.1% 0.339 98.5% 109 $1.01 $1.87 96.9% 107

10 0.045 1.5% 0.310 100.0% 100 $0.54 $1.74 100.0% 100

5. Model Evaluation

Page 40: Modeling for the Non-Statistician

40

Graphical displays of the lift table are easy to follow.

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5 6 7 8 9 10

Response % Revenue % Random

5. Model Evaluation

Page 41: Modeling for the Non-Statistician

41

With cost figures, the gains table can be expanded to show profit.

Decile No. of Cust. Revenue

Revenue per

Customer

Cost per

Contact Contact

Cost Cost of Goods

Total Cost

Marginal Profit

Cum Profit

1 10,000 $37,000 $3.70 $0.45 $4,500 $24,050 $28,550 $8,450 $8,450 2 10,000 $26,800 $2.68 $0.45 $4,500 $17,420 $21,920 $4,880 $13,330

3 10,000 $22,400 $2.24 $0.45 $4,500 $14,560 $19,060 $3,340 $16,670

4 10,000 $18,700 $1.87 $0.45 $4,500 $12,155 $16,655 $2,045 $18,715

5 10,000 $14,800 $1.48 $0.45 $4,500 $9,620 $14,120 $680 $19,395

6 10,000 $14,600 $1.46 $0.45 $4,500 $9,490 $13,990 $610 $20,005

7 10,000 $12,100 $1.21 $0.45 $4,500 $7,865 $12,365 ($265) $19,740

8 10,000 $12,000 $1.20 $0.45 $4,500 $7,800 $12,300 ($300) $19,440

9 10,000 $10,100 $1.01 $0.45 $4,500 $6,565 $11,065 ($965) $18,475

10 10,000 $5,400 $0.54 $0.45 $4,500 $3,510 $8,010 ($2,610) $15,865

5. Model Evaluation

Page 42: Modeling for the Non-Statistician

42

In this example, profit peaks around a mail quantity of 60,000.

$0

$50

$100

$150

$200

0 20,000 40,000 60,000 80,000 100,000

Mail Quantity

Re

ve

nu

e (

$0

00

s)

$0

$5

$10

$15

$20

$25

Pro

fit

($0

00

s)

Cum Revenue Cum Profit

5. Model Evaluation

Page 43: Modeling for the Non-Statistician

43

The production algorithm translates the model into the production environment. The model is worthless without proper

implementation.

Goal: create identical production and model algorithms.

Involve the production people.

Involve the marketers.

6. Model Implementation

Page 44: Modeling for the Non-Statistician

44

Quality control procedures ensure the model is applied correctly every time. Develop audit trail reports that highlight

potential problems.

Look for model degradation over time.

Develop mini-profiles of each scoring decile and compare over time.

6. Model Implementation

Page 45: Modeling for the Non-Statistician

45

Testing should always be done to continually validate assumptions.

The secret of determining the success of the model used for direct marketing is through tracking the results of its use in-market.–Each cell must be measured as well as the overall.

– For scoring models, this means that ‘cells’ must be created, usually deciles or percentiles.

Each group is marked and tracked. The performance can be compared to each other and to expected.6. Model Implementation

Page 46: Modeling for the Non-Statistician

46

Focus not only on overall performance, but also at the margin.

If you are losing money at the margin, too many unprofitable names are being contacted.

If you are making money at the margin, you may be leaving profits on the table.

Common sense and company policy will guide you to a target marginal ROI.

6. Model Implementation