Database Marketing - Dominick's stores in Chicago distric

Marketing Database Analysis

Anna Andrusova

Nathan BaileyJames Ballard

Han SiDemin WangMKT 6362. Database

Marketing

Overview of Business Problem• In the 1990’s and early 2000’s, Dominick’s was a chain of over 100 grocery stores in the Chicago Metropolitan area

• For this evaluation, we are performing a corporate-level as well as a category-level data analysis

• Corporate Analysis – Relate store sales performance with known demographics to facilitate corporate planning activities and test potential locations

• Category Analysis – Relate category sales performance with known demographics to improve sales performance and expand product offerings

Data DescriptionStore-level historical data on the sales over more than seven year

period

Customer Count FileDaily sales of stores in 30 product categories:• Bakery•Beer•Cosmetic•Dairy•Meat•Pharmacy•Grocery

Store-Specific Demographics

Demographic profiles of stores: • Age• Single / Retired / Unemployed• Mortgage• Poverty• Income• Education• Household size• Working woman, etc

• Cheese•Wine•Health and Beauty•Deli•Fish•Floral•Jewelry, etc.

Data PreparationStep 1. The latest year’s sales data was aggregated by Store and summarized for the year from Customer Count FileStep 2. Demographic variables were added from Store Account FileResulting data set:• 1-record per store (94 stores) containing 12-month sales data and store demographic data

• Sales data on 30 product categories (the ‘Behavior’ variables)

• 43 demographic variables for residents living near the store

Approach1. Segmentation: create groups of the stores similar in their

performance according to certain group of product categories and dissimilar to the other groups according to the same group of categories

Method: Non-hierarchical and hierarchical clustering 2. Response Analysis: find targetable characteristics of identified groups of the stores Method: Discriminant analysis3. Model Validation: evaluate performance of the models on a hold-out sample (20% of the stores)4. Recommendations and conclusions

Dominick’s Data Set

General Data Set

Corporate AnalysisCategory Analysis

Data Preparation

ClustersHierarchical Clustering and Non-Hierarchical Clustering

Response Analysis Discriminate Analysis Hold-Out

Group

20%

Model Test

Conclusion and Recommendation

Corporate Analysis ResultsCategory Analysis Results

Flowchart of the Approach

Cluster HistoryNumber

ofClusters

Clusters Joined Freq New ClusterRMS Std Dev

SemipartialR-Square

R-Square CentroidDistance

Tie

… … … … … … … …. 11 CL21 311 3 255876 0.0013 .955 2.09E6 10 CL15 112 15 223435 0.0018 .953 2.09E6

9 CL18 CL11 9 293813 0.0044 .949 2.25E6 8 CL14 314 5 281264 0.0020 .947 2.37E6 7 CL10 CL17 43 329098 0.0346 .912 2.84E6 6 304 315 2 376122 0.0018 .910 2.86E6 5 CL8 CL9 14 451590 0.0209 .889 3.85E6

4 CL13 CL7 76 455327 0.1236 .766 3.88E6 3 CL12 CL5 16 567698 0.0270 .739 5.93E6 2 CL3 CL6 18 679121 0.0365 .702 6.84E6 1 CL4 CL2 94 918977 0.7022 .000 1.05E7

Corporate AnalysisStep #1 – Hierarchical Clustering

Conclusion: optimal number of clusters is between 3 and 6

3 clusters 4 clusters 5 clusters 6 clusters

Pseudo F Statistic 256.19 245.65 246.97 260.81

Approximate Expected Over-All R-Squared

0.7364 0.77973 0.80166 0.8157

Cubic Clustering Criterion 5.517 6.505 7.813 16.200

Corporate Analysis (Cont.)Step #2 – Non-Hierarchical Clustering

Conclusion: based on the results of both Hierarchical and Non Hierarchical clustering 6-cluster solution

is determined to be optimal

Corporate Analysis – Clustering ResultsCluster Summary

Cluster Freq RMS Std Deviation

Max Distancefrom Seed to Observation

RadiusExceeded

Nearest Cluster

Distance BetweenCluster Centroids

1 33 201245 2233427 6 29484672 1 . 0 3 47654173 6 336424 2455353 4 42861414 9 293813 2207687 3 42861415 16 274583 3058122 6 31920186 29 213948 2063995 1 2948467

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6

35.1%

1.1%

6.4%9.6%

17.0%

30.9%

21.5%

2.4%

12.1%14.7%

21.2%

28.2%

% of stores vs. % of sales

% of stores % of sales

Corporate Analysis - Discriminant AnalysisConfidence Level: 90%

Univariate Test Statistics

F Statistics, Num DF=5, Den DF=79

Variable TotalStandardDeviation

PooledStandardDeviation

BetweenStandardDeviation

R-Square R-Square/ (1-RSq)

F Value Pr > F

EDUC 0.1129 0.1102 0.0394 0.1029 0.1147 1.81 0.1200

NOCAR 0.1316 0.1287 0.0453 0.1000 0.1111 1.76 0.1318

INCSIGMA 2323 2264 824.9388 0.1064 0.1190 1.88 0.1070

HSIZE1 0.0829 0.0809 0.0292 0.1045 0.1167 1.84 0.1138

SINHOUSE 0.2173 0.2103 0.0817 0.1194 0.1355 2.14 0.0690

HVAL200 0.1853 0.1758 0.0792 0.1541 0.1822 2.88 0.0194

SINGLE 0.0703 0.0665 0.0306 0.1593 0.1895 2.99 0.0158

NWRKCH17 0.0199 0.0194 0.006933 0.1024 0.1141 1.80 0.1218

TELEPHN 0.0309 0.0293 0.0134 0.1581 0.1879 2.97 0.0166

SHPINDX 0.2482 0.2405 0.0924 0.1168 0.1323 2.09 0.0753

* 17 statistically significant variables in total

Corporate Analysis - Discriminant Analysis (Cont.)

CanonicalCorrelation

AdjustedCanonical

Correlation

ApproximateStandard

Error

SquaredCanonical

Correlation

1 0.847077 0.761387 0.030819 0.717540

Multivariate Statistics and F ApproximationsS=5 M=15 N=21

Statistic Value F Value Num DF Den DF Pr > FWilks' Lambda

0.02426163 1.39 180 223.58 0.0103

Pillai's Trace 2.50666011 1.34 180 240 0.0172Hotelling-Lawley Trace

6.07753961 1.44 180 164.86 0.0093

Roy's Greatest Root

2.54031820 3.39 36 48 <.0001

Means of the independent variables are statistically different among segments

Only 2.4% of the variance in the discriminant scores is not explained by the differences among groups of the stores Ratio between-group SS to

the total SS => Good set of descriptors

Error Count Estimates for CLUSTER 1 3 4 5 6 Total

Rate 0.1429 0.0000 0.0000 0.3333 0.3333 0.1619Priors 0.1667 0.1667 0.1667 0.1667 0.1667 0.8333

Error Count Estimates for CLUSTER 1 2 3 4 5 6 Total

Rate 0.1818 0.0000 0.0000 0.1667 0.3571 0.1923 0.1497Priors 0.1667 0.1667 0.1667 0.1667 0.1667 0.1667

Corporate Analysis – Classification Results

OriginalDataset

Hold-out Sample

~ 85% of the stores are classified correctly


Category Analysis: Beer and Wine

Cluster HistoryNumber

ofClusters

Clusters Joined Freq New Cluster

RMS Std Dev

SemipartialR-Square

R-Square CentroidDistance

Tie

9 CL16 309 8 72804.2 0.0031 .906 197203 8 CL23 CL13 10 93539.6 0.0091 .897 200748 7 CL10 CL31 11 95378.9 0.0085 .888 239550 6 CL7 CL8 21 145510 0.0459 .842 311263 5 CL87 CL11 61 112380 0.0639 .778 318702 4 CL5 CL6 82 170030 0.2099 .568 385452 3 CL4 CL15 85 185394 0.0973 .471 609748 2 CL3 CL9 93 226017 0.3212 .150 696877 1 CL2 304 94 243807 0.1499 .000 1.29E6

Step #1 – Hierarchical Clustering

Conclusion: optimal number of clusters is between 4 and 6

Category Analysis: Beer and Wine (Cont.)

Step #2 – Non-Hierarchical Clustering

4 clusters 5 clusters 6 clusters

Pseudo F Statistic 87.53 116.85 131.08

Approximate Expected Over-All R-Squared 0.7692 0.81988 0.85358

Cubic Clustering Criterion -1.336 1.458 2.489

Conclusion: based on the results of both Hierarchical and Non Hierarchical clustering 6-cluster

solution is determined to be optimal

Category Analysis: Beer and Wine (Cont.)Cluster Summary

Cluster Frequency RMS Std Deviation

Maximum Distance

from Seedto

Observation

RadiusExceeded

Nearest Cluster

Distance Between

Cluster Centroids

1 35 83267.8 194999 2 2685322 32 78629.9 206948 1 2685323 8 131663 250170 2 3746034 9 82174.1 159203 2 3336465 9 80329.2 180104 4 3773896 1 . 0 3 924906

Cluster MeansCluster BEER WINE1 144128.421 101864.5772 326776.212 298713.2413 493651.738 634093.243

4 649465.774 213912.8425 955669.947 434505.4596 383045.800 1552362.060

Cluster #5 is the top seller of BeerCluster #6 is the Top seller of WineCluster #1 has the lowest sales of both Beer & Wine

One store in Cluster 6

outlier

Discriminant Analysis: Beer and WineConfidence level: 95% Univariate Test Statistics

F Statistics, Num DF=5, Den DF=79Variable Total

StandardDeviation

PooledStandardDeviation

BetweenStandardDeviation

R-Square R-Square/ (1-RSq)

F Value Pr > F

AGE9 0.0272 0.0261 0.0109 0.1347 0.1557 2.46 0.0400EDUC 0.1129 0.1051 0.0528 0.1843 0.2259 3.57 0.0058INCOME 0.2921 0.2793 0.1192 0.1405 0.1635 2.58 0.0324INCSIGMA 2323 2191 1021 0.1630 0.1948 3.08 0.0137HSIZEAVG 0.2686 0.2480 0.1303 0.1985 0.2477 3.91 0.0032HSIZE2 0.0322 0.0298 0.0154 0.1942 0.2410 3.81 0.0038HSIZE567 0.0325 0.0277 0.0200 0.3176 0.4655 7.35 <.0001HH3PLUS 0.0844 0.0796 0.0371 0.1628 0.1944 3.07 0.0138HH4PLUS 0.0650 0.0606 0.0303 0.1833 0.2244 3.55 0.0061DENSITY 0.001250 0.001192 0.000518 0.1447 0.1692 2.67 0.0277HVAL150 0.2460 0.2260 0.1217 0.2064 0.2601 4.11 0.0023HVAL200 0.1853 0.1664 0.0992 0.2417 0.3188 5.04 0.0005HVALMEAN 47.3071 42.9341 24.4560 0.2254 0.2909 4.60 0.0010SINGLE 0.0703 0.0664 0.0308 0.1616 0.1927 3.04 0.0145UNEMP 0.0239 0.0226 0.0103 0.1576 0.1871 2.96 0.0169WRKWNCH 0.0446 0.0424 0.0187 0.1483 0.1742 2.75 0.0241TELEPHN 0.0309 0.0287 0.0148 0.1929 0.2389 3.78 0.0041POVERTY 0.0457 0.0441 0.0175 0.1238 0.1413 2.23 0.0590

Statistically significant variables in discriminating observations among groups

Discriminant Analysis: Beer and Wine (Cont.)

CanonicalCorrelation

AdjustedCanonical

Correlation

ApproximateStandard

Error

SquaredCanonical

Correlation

1 0.846814 0.751237 0.030868 0.717094

Multivariate Statistics and F ApproximationsS=5 M=15 N=21

Statistic Value F Value Num DF Den DF Pr > FWilks' Lambda 0.01346418 1.72 180 223.58 <.0001Pillai's Trace 2.81504177 1.72 180 240 <.0001Hotelling-Lawley Trace

7.26639429 1.72 180 164.86 0.0002

Roy's Greatest Root

2.53474655 3.38 36 48 <.0001

Means of the independent variables are statistically different among segments

Only 1.3% of the variance in the discriminant scores is not explained by the differences among groups of the stores

Good set of descriptors

Beer & Wine Category Analysis – Classification ResultsOriginalDataset

Error Count Estimates for CLUSTER 1 2 3 4 5 6 Total

Rate 0.5714 0.6207 0.7143 0.6000 0.8750 1.0000 0.7302Priors 0.1667 0.1667 0.1667 0.1667 0.1667 0.1667

Hold-out Sample

Error Count Estimates for CLUSTER 1 2 3 4 5 Total

Rate 0.1667 0.3333 0.5000 0.5000 0.5000 0.4000Priors 0.1667 0.1667 0.1667 0.1667 0.1667 0.8333



RecommendationsCorporate Level:• Resource allocation among the stores: perform additional analysis of the

stores in underperforming segments (1 & 6) • Evaluation of the potential locations for a new store: deploy discriminant

function to predict performance of the stores in different product categories based on the demographic profiles of their locations

Category Level (Beer & Wine):• Marketing strategy for a new brand of Beer or Wine: adjust targeting

strategy for a product based on the demographic profile of the location it will be sold

• Choice of the stores to test market a new product: recommend to perform a market test for Beer in stores of segments 4 & 5 and Wine in segments 3 &6

Limitations of the AnalysisAdditional data

• Product-level data: assessment of specific product sales in new stores & prediction of a new product performance that is being considered to be launched

• Customer-specific data: ability to build better predictive models tied to the

customer demographics (scanner data from the loyalty program members’ transactions)

Higher quality analysis at a more granular level