Upload
andy-curtis
View
538
Download
0
Embed Size (px)
DESCRIPTION
Step-by-step guide to prepare customer data for modeling.
Citation preview
Modeling and Analysis for the Non-Statistician
Presented by:
Andrew CurtisVice President
Richard PlessConsultant
2
Models are developed using a six-step process.
1. Research Design 10%
2. Data Checking and Variable Creation30
3. Create Analysis Files30
4. Calibrate Scoring Model10
5. Model Evaluation10
6. Model Implementation10
% Effort
1. Research Design
3
Research design requires the input of both marketers and analysts. Is the problem solvable through
modeling?
Do we have representative promotions from which to develop a model?
Do we need to be concerned about selection bias?
Will we be able to pull all the information we need to score the model off of our database in a timely manner?1. Research Design
4
Research Design--Unsolvable Problems
Prospecting models for niche marketer.
– Some lists work really well.– All others are unprofitable, even in the first decile.
Finding all prospective buyers.
– Impossible to accurately predict all
behavior.
– All models leave some revenue on the
table.1. Research Design
5
Research Design--Unrepresentative Promotions.
Album promotion during a major
tour.
Retail sale announcement during major clearance.
Veterans magazine solicitation during the Gulf War.
1. Research Design
6
Research Design--Selection Bias.
The model is built off a series of mailings for business-appropriate suits, dresses, and accessories.
The mailings were mailed to women only.
If the resulting model is put into production without the gender pre-screen, then males will end up getting contacted, probably quite unprofitably.
1. Research Design
7
Research Design--Timely Scoring Data.
The model looks for number of Web applicants from a given ZIP code in the prior week but the data can only be pulled monthly.
At best, the model can only be scored accurately once a month.
The predictor which uses the information is ineffective.
1. Research Design
8
Rule #1 Garbage In Garbage Out!Bad Data In Bad Models Out! Analysis is only good as the data being
analyzed.
All input data must be checked for reasonableness, timeliness, and completeness.
Information extracted from multiple sources must be verified that all data are appended to the “master file” appropriately.
You must engage in on-going quality control!
2. Data Checking
9
Study and scrutinize the data dictionary! Understand every field in the database. Eliminate fields that are too new, poorly
filled, or unrealiable. Look at distributions of values for each
field. – Know what every field means.– Understand every value in the field. If there a “Z”, find out what “Z” means.
Work with the finance to define the business rules for properly counting orders, revenue, and other business drivers.
2. Data Checking
10
Clean the data when appropriate.
Models are driven by underlying data patterns.
– Bad patterns lead to bad models.
Correct data/variables with:
– Anomalies
– Missing values
– Outliers
– Errors.
2. Data Checking
11
Data Checking--Example of an anomaly.
Dollars per Contact Over Five Mailings
Recency Avg. February Other Four
0 - 3 months $3.50 $3.45 $3.52
4 - 6 months 2.20 1.36 2.42
7 - 9 months 2.31 2.40 2.29
10 - 12 months 1.40 1.50 1.36
2. Data Checking
12
Data Checking--Missing Data Example.
Response Rates by Age
Age AverageRange Response %
18 - 35 1.3 %
36 - 49 0.8
50 and up 1.0
Missing 1.4 How to explain?
2. Data Checking
13
Data Checking--Outliers Example
The “Michael Jordan” example.
Individual credit card holders with $200,000 lines of credit.
The department store employee with 100 shopping trips a year.
2. Data Checking
14
Data Checking--Errors pose a tremendous risk for the modeler.Commonly Occurring Errors:
Response data from a prior mailing incorrectly matched back to the customer file.
Changes in meaning or usage of a particular variable.
Alpha characters in supposedly numeric variable fields.
2. Data Checking
15
Variable creation captures the dynamics of the business. Use creativity to create predictor variables.
Predictor variables typically come in three classes:–Recency—the time elapsed since an
action.–Frequency—the number of times an event has happen, e.g. orders, clicked on a web page etc.
–Monetary—the amount of money spent purchasing goods and services.
Use ratios and cross variables to identify meaningful interactions between variables.
2b. Variable Creation
16
Predictor Variable Creation--Example
Order Date Shipping to Category Product Description QTY Price
4/2/1999 Andew Curtis Book Markstrat3 : The Strategic 1 40
5/2/1999 Andew Curtis Book
The Service Profit Chain : How Leading Companies Link Profit and Growth to Loyalty, Satisfaction, and Value 1 20
6/14/2000 Stephen J. Curtis Book
Tuesdays with Morrie: An Old Man, a Young Man and Life's Greatest Lesson 1 20
6/1/2001 Andrew Curtis Book
Zapp! : The Lightning of Empowerment : How to Improve Quality, Productivity, and Employee Satisfaction 1 20
7/1/2001 Andrew Curtis ElectronicsHandspring Visor Platinum (Silver) - Special Offer 1 300
8/17/2001 Glenn Waldorf Video The Godfather DVD Collection 1 100
Recency (11/14/01 – 8/17/01) = 89 Days or 3 Months!
MonetarySum of Revenue = $500
FrequencyCount Order Dates=6 Orders
2b. Variable Creation
17
Order Date Shipping to Category Product Description QTY Price
4/2/1999 Andew Curtis Book Markstrat3 : The Strategic 1 40
5/2/1999 Andew Curtis Book
The Service Profit Chain : How Leading Companies Link Profit and Growth to Loyalty, Satisfaction, and Value 1 20
6/14/2000 Stephen J. Curtis Book
Tuesdays with Morrie: An Old Man, a Young Man and Life's Greatest Lesson 1 20
6/1/2001 Andrew Curtis Book
Zapp! : The Lightning of Empowerment : How to Improve Quality, Productivity, and Employee Satisfaction 1 20
7/1/2001 Andrew Curtis ElectronicsHandspring Visor Platinum (Silver) - Special Offer 1 300
8/17/2001 Glenn Waldorf Video The Godfather DVD Collection 1 100Recency in Books (11/14/01 – 6/1/01) = 166 Days or 5.5 Months!
Total Books = 4Total DVDS = 1Total Electronics = 1
Average Order Size =$500 / 6 Orders
Percent Gift Purchases=2 / 6 = 33%
Predictor Variable Creation--Example
2b. Variable Creation
18
Selecting a Target Variable
Make sure your target variable will give you the type of results you want. – Measuring response: may get a lot of hand- raisers that are not profitable. – Measuring profit: by focusing only on the dollars, you may miss a viable low-profit group.
Isolate all information gathered during the target period from being included as a predictor variable.2b. Variable Creation
19
Analysis files have three time frames:
1.Predictor Period—The time before individuals are selected for a marketing contact. All predictor variables must contain only data from this period.
2.Gap Period—The time between the selection date and when the first response is recorded.
3.Target Period—The time between the first and last response date. All target variables must only contain information from this period.
3. Create Analysis Files
Predictor Period Gap Period Target Period
Selection Date
First Response Date
Last Response Date
20
Good models are developed with modeling and validation samples.
Before modeling begins, split the analysis file into two random subsets: modeling and validation.
Develop the model using only the modeling subset.
Test the robustness and accuracy of the model using the validation subset.
Techniques exist for handling validation when analysis sample is too small to split. 3. Create Analysis Files
21
The appropriate modeling technique is driven by several factors. The nature of the target variable.
The software that is supported in the production environment.
The skills of the analytical team.
4. Model Calibration
22
No modeling technique should operate on autopilot.
The analyst developing the model must:–Know how to use the modeling technique.–Know how to interpret the results.–Know a “cringe variable” when they see one. –Know how the model will be used by the marketers.
Without a pilot, even the most sophisticated plane will crash.
4. Model Calibration
23
Scoring models can be built using many different techniques.
Linear regression
Logistic regression
Discriminant analysis
Neural networks
Many, many more...
All can be used as predictors of future
behavior.4. Model Calibration
24
Model Calibration Rule #1
If you want to get famous, talk about technique.
If you want a great model, concentrate on “the other 90 percent.”
4. Model Calibration
25
Corollary to Rule #1
Regardless of your technique of choice,
if you short-change “the other 90 percent,”
you will probably end up with a lousy model.
4. Model Calibration
26
Construction analogy
Throw several power tools onto a pile of lumber,
come back in a month, and -- presto –
you will NOT have a house.
4. Model Calibration
27
Linear Regression is best suited for continuous outcomes, such as sales.
Output can be understood by non-statisticians.
Each name is assigned an estimated value.
Scored population is easily ranked with respect to the target variable (sales, profits, etc.).
Does not automatically identify interactions between predictor variables.4. Model Calibration
28
Linear Regression Example
Scoring Model for Predicting Monthly Revenue
Score = 0.08 + 0.06 * House Value (Estimated in $Thousands)
- 0.20 * Number of Children + 0.10 * Average Credit Card Limit (in
$Thousands) - 0.30 * Number of Autos
John Jennifer YOUHouse Value? $150,000 $125,000No. of Kids? 2 0 Ave Limit? $15,000 $8,000No. of Cars? 2 1
Score $9.58 $8.084. Model Calibration
29
Logistic regression is best suited for binary outcomes, such as buy/no buy. Output can be understood by non-
statisticians.
Each name is assigned a probability of performing the expected outcome that is NOT a prediction of future performance.
Scored population is easily ranked with respect to likelihood of displaying the targeted behavior.
Does not automatically identify interactions between predictor variables.4. Model Calibration
30
Logistic Regression Example
Scoring Model for Predicting Likelihood to Purchase (Yes/No)Score = 0.01
+ 0.04 * Person Owns Home (1=yes,0=no) - 0.05 * Number of Credit Cards + 0.01 * Income (Estimated in $Thousands) - 0.02 * Age
Probability Fix = 1 / [1 + Exponent(-Score)]John Jennifer YOU
Owns Home? No YesNo. of Cards? 6 3 Income? $40,000 $25,000Age? 45 35Score -0.79 (prob=31%) -0.55 (prob=37%)
4. Model Calibration
31
Neural networks can be used with either binary or continuous targets.
No restrictions on the type or structure of either the target variable or the historical variables.
Can more easily capture interactions between predictor variables.
Output is very difficult to explain.
Implementation can be difficult.
Models don’t always outperform traditional regression.4. Model Calibration
32
When done well, scoring models are smooth with few, if any clumps. Target behaviors of the scored names
distribute on a “Gains Table” smoothly from highest to lowest.
This makes it easier to target a precise number of names, or to select down to a precise threshold of response or profit.
5. Model Evaluation
33
Understanding the Lift Table
Decile
No. of Cust.
No. of Resp.
1 10,000 7,000
2 10,000 5,280
3 10,000 4,600
4 10,000 3,710
5 10,000 3,300
6 10,000 2,400
7 10,000 2,020
8 10,000 1,590
9 10,000 650
10 10,000 450
====== 100,000
====== 31,000
Start by ranking all customers by their descending scores and observing the number of responders in each “decile.”
5. Model Evaluation
34
Next, calculate response rates for each decile.
Decile
No. of Cust.
No. of Resp.
Resp. Rate
1 10,000 7,000 0.700
2 10,000 5,280 0.528
3 10,000 4,600 0.460
4 10,000 3,710 0.371
5 10,000 3,300 0.330
6 10,000 2,400 0.240
7 10,000 2,020 0.202
8 10,000 1,590 0.159
9 10,000 650 0.065
10 10,000 450 0.045
====== 100,000
====== 31,000
5. Model Evaluation
35
Then, calculate the percent of all respondents that are in each decile.
Decile
No. of Cust.
No. of Resp.
Resp. Rate
Percent of Resp.
1 10,000 7,000 0.700 22.6%
2 10,000 5,280 0.528 17.0%
3 10,000 4,600 0.460 14.8%
4 10,000 3,710 0.371 12.0%
5 10,000 3,300 0.330 10.6%
6 10,000 2,400 0.240 7.7%
7 10,000 2,020 0.202 6.5%
8 10,000 1,590 0.159 5.1%
9 10,000 650 0.065 2.1%
10 10,000 450 0.045 1.5%
====== 100,000
====== 31,000
5. Model Evaluation
36
Sum down the columns to calculate cumulative totals.
Decile
No. of Cust.
No. of Resp.
Resp. Rate
Percent of Resp.
Cum. Cust.
Cum. Resp.
1 10,000 7,000 0.700 22.6% 10,000 7,000
2 10,000 5,280 0.528 17.0% 20,000 12,280
3 10,000 4,600 0.460 14.8% 30,000 16,880
4 10,000 3,710 0.371 12.0% 40,000 20,590
5 10,000 3,300 0.330 10.6% 50,000 23,890
6 10,000 2,400 0.240 7.7% 60,000 26,290
7 10,000 2,020 0.202 6.5% 70,000 28,310
8 10,000 1,590 0.159 5.1% 80,000 29,900
9 10,000 650 0.065 2.1% 90,000 30,550
10 10,000 450 0.045 1.5% 100,000 31,000
====== 100,000
====== 31,000
5. Model Evaluation
37
Calculate cumulative response Rate and percentage of response rates.
Decile
No. of Cust.
No. of Resp.
Resp. Rate
Percent of Resp.
Cum. Cust.
Cum. Resp.
Cum. Resp. Rate
Cum. Percent of Resp.
1 10,000 7,000 0.700 22.6% 10,000 7,000 0.700 22.6%
2 10,000 5,280 0.528 17.0% 20,000 12,280 0.614 39.6%
3 10,000 4,600 0.460 14.8% 30,000 16,880 0.563 54.5%
4 10,000 3,710 0.371 12.0% 40,000 20,590 0.515 66.4%
5 10,000 3,300 0.330 10.6% 50,000 23,890 0.478 77.1%
6 10,000 2,400 0.240 7.7% 60,000 26,290 0.438 84.8%
7 10,000 2,020 0.202 6.5% 70,000 28,310 0.404 91.3%
8 10,000 1,590 0.159 5.1% 80,000 29,900 0.374 96.5%
9 10,000 650 0.065 2.1% 90,000 30,550 0.339 98.5%
10 10,000 450 0.045 1.5% 100,000 31,000 0.310 100.0%
====== 100,000
====== 31,000
5. Model Evaluation
38
Lift is the ratio of cum response rate to the overall response rate = 0.310
Decile
No. of Cust.
No. of Resp.
Resp. Rate
Percent of Resp.
Cum. Cust.
Cum. Resp.
Cum. Resp. Rate
Cum. Percent of Resp.
Lift
1 10,000 7,000 0.700 22.6% 10,000 7,000 0.700 22.6% 226
2 10,000 5,280 0.528 17.0% 20,000 12,280 0.614 39.6% 198
3 10,000 4,600 0.460 14.8% 30,000 16,880 0.563 54.5% 182
4 10,000 3,710 0.371 12.0% 40,000 20,590 0.515 66.4% 166
5 10,000 3,300 0.330 10.6% 50,000 23,890 0.478 77.1% 154
6 10,000 2,400 0.240 7.7% 60,000 26,290 0.438 84.8% 141
7 10,000 2,020 0.202 6.5% 70,000 28,310 0.404 91.3% 130
8 10,000 1,590 0.159 5.1% 80,000 29,900 0.374 96.5% 121
9 10,000 650 0.065 2.1% 90,000 30,550 0.339 98.5% 109
10 10,000 450 0.045 1.5% 100,000 31,000 0.310 100.0% 100
====== 100,000
====== 31,000
5. Model Evaluation
39
Gains tables can show performance for both response and revenue.
Decile
Resp. Rate
Percent of Resp.
Cum. Resp. Rate
Cum.
Percent of Resp.
Lift
Revenue
Per Cust.
Cum
Revenue Per Cust.
Cum Percent
of Revenue
Lift 1 0.700 22.6% 0.700 22.6% 226 $3.70 $3.70 21.3% 213
2 0.528 17.0% 0.614 39.6% 198 $2.68 $3.19 36.7% 183
3 0.460 14.8% 0.563 54.5% 182 $2.24 $2.87 49.6% 165
4 0.371 12.0% 0.515 66.4% 166 $1.87 $2.62 60.3% 151
5 0.330 10.6% 0.478 77.1% 154 $1.48 $2.39 68.8% 137
6 0.240 7.7% 0.438 84.8% 141 $1.46 $2.24 77.2% 129
7 0.202 6.5% 0.404 91.3% 130 $1.21 $2.09 84.2% 120
8 0.159 5.1% 0.374 96.5% 121 $1.20 $1.98 91.1% 114
9 0.065 2.1% 0.339 98.5% 109 $1.01 $1.87 96.9% 107
10 0.045 1.5% 0.310 100.0% 100 $0.54 $1.74 100.0% 100
5. Model Evaluation
40
Graphical displays of the lift table are easy to follow.
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8 9 10
Response % Revenue % Random
5. Model Evaluation
41
With cost figures, the gains table can be expanded to show profit.
Decile No. of Cust. Revenue
Revenue per
Customer
Cost per
Contact Contact
Cost Cost of Goods
Total Cost
Marginal Profit
Cum Profit
1 10,000 $37,000 $3.70 $0.45 $4,500 $24,050 $28,550 $8,450 $8,450 2 10,000 $26,800 $2.68 $0.45 $4,500 $17,420 $21,920 $4,880 $13,330
3 10,000 $22,400 $2.24 $0.45 $4,500 $14,560 $19,060 $3,340 $16,670
4 10,000 $18,700 $1.87 $0.45 $4,500 $12,155 $16,655 $2,045 $18,715
5 10,000 $14,800 $1.48 $0.45 $4,500 $9,620 $14,120 $680 $19,395
6 10,000 $14,600 $1.46 $0.45 $4,500 $9,490 $13,990 $610 $20,005
7 10,000 $12,100 $1.21 $0.45 $4,500 $7,865 $12,365 ($265) $19,740
8 10,000 $12,000 $1.20 $0.45 $4,500 $7,800 $12,300 ($300) $19,440
9 10,000 $10,100 $1.01 $0.45 $4,500 $6,565 $11,065 ($965) $18,475
10 10,000 $5,400 $0.54 $0.45 $4,500 $3,510 $8,010 ($2,610) $15,865
5. Model Evaluation
42
In this example, profit peaks around a mail quantity of 60,000.
$0
$50
$100
$150
$200
0 20,000 40,000 60,000 80,000 100,000
Mail Quantity
Re
ve
nu
e (
$0
00
s)
$0
$5
$10
$15
$20
$25
Pro
fit
($0
00
s)
Cum Revenue Cum Profit
5. Model Evaluation
43
The production algorithm translates the model into the production environment. The model is worthless without proper
implementation.
Goal: create identical production and model algorithms.
Involve the production people.
Involve the marketers.
6. Model Implementation
44
Quality control procedures ensure the model is applied correctly every time. Develop audit trail reports that highlight
potential problems.
Look for model degradation over time.
Develop mini-profiles of each scoring decile and compare over time.
6. Model Implementation
45
Testing should always be done to continually validate assumptions.
The secret of determining the success of the model used for direct marketing is through tracking the results of its use in-market.–Each cell must be measured as well as the overall.
– For scoring models, this means that ‘cells’ must be created, usually deciles or percentiles.
Each group is marked and tracked. The performance can be compared to each other and to expected.6. Model Implementation
46
Focus not only on overall performance, but also at the margin.
If you are losing money at the margin, too many unprofitable names are being contacted.
If you are making money at the margin, you may be leaving profits on the table.
Common sense and company policy will guide you to a target marginal ROI.
6. Model Implementation