Upload
leona-quinn
View
223
Download
1
Embed Size (px)
Citation preview
Session 9
Applied Regression -- Prof. Juran 2
OutlineTwo Multivariate Methods • Cluster Analysis
– Excel– Minitab
• Discriminant Analysis– Excel– Minitab
Steam caseCars
Applied Regression -- Prof. Juran 3
Cluster Analysis• Concerned with grouping a large
number of observations into reasonable sub-groups (clusters) on the basis of their similarities on multiple dimensions
• Similar to regression in terms of its basic method: finding a solution that minimizes a total sum of squared errors
• Not concerned with explaining variability or forecasting
• No dependent variable
Applied Regression -- Prof. Juran 4
Example: MBA ProgramsSchool Accept % Enroll % GMAT GPA Cost Minority % Non-U.S. % Female % Salary Pop. Density Student Body Age Rolling Job
Harvard 12.0% 82.5% 705 3.58 $ 31,800 24.0 35.0 35.0 $127,338 14,618 1821 28 0 92% Stanford 9.0% 78.0% 718 3.58 $ 31,002 24.0 24.0 38.0 $130,253 2,371 756 27 0 92% Columbia 10.0% 72.0% 710 3.45 $ 32,154 20.0 28.0 35.0 $121,000 23,671 1225 27 1 94% Penn 15.5% 70.2% 703 3.57 $ 31,218 19.0 42.0 31.0 $122,711 11,492 1542 28 0 88% MIT 18.0% 69.9% 703 3.50 $ 31,200 19.0 35.0 26.0 $120,449 11,398 638 28 0 91% Northwestern 16.0% 64.0% 700 3.45 $ 30,255 17.5 30.8 30.0 $120,500 12,185 1250 28 0 88% Carn.-Mellon 28.2% 60.1% 660 3.25 $ 26,750 21.3 44.0 29.0 $113,448 6,598 440 28 0 94% Chicago 28.4% 60.0% 695 3.33 $ 30,596 18.2 36.0 22.0 $117,893 12,185 1043 29 0 93% Michigan 19.6% 58.5% 676 3.34 $ 30,686 20.5 29.0 27.6 $119,718 4,238 862 28 0 93% Duke 20.1% 57.0% 688 3.43 $ 30,323 20.6 34.0 30.0 $117,899 2,503 682 27 0 93% Yale 20.0% 56.7% 698 3.50 $ 29,720 24.0 34.0 33.0 $114,274 6,559 436 28 0 94% Michigan State 22.3% 55.0% 641 3.36 $ 17,400 13.6 39.0 24.0 $ 90,733 3,738 221 28 0 95% Texas 29.0% 52.4% 680 3.34 $ 24,252 12.3 28.0 22.7 $106,905 2,260 780 28 1 78% NYU 15.2% 51.7% 700 3.40 $ 32,194 23.0 33.0 38.0 $112,900 23,671 585 27 1 86% Cornell 25.8% 51.2% 669 3.32 $ 30,505 22.6 33.0 31.0 $117,825 1,802 551 29 0 92% Virginia 16.6% 51.0% 681 3.40 $ 28,900 15.1 29.0 30.0 $115,272 3,938 516 28 1 93% Dartmouth 18.0% 51.0% 690 3.40 $ 30,250 17.2 29.9 24.0 $124,235 1,800 436 28 0 92% Berkeley 14.0% 50.3% 688 3.43 $ 21,208 19.4 33.0 29.0 $115,560 9,631 471 28 0 69% Indiana 24.7% 49.8% 651 3.35 $ 20,694 14.3 31.0 22.0 $102,299 4,073 596 28 0 83% UCLA 8.0% 48.0% 691 3.60 $ 22,490 21.6 28.0 30.0 $114,368 7,436 653 28 1 62% Emory 31.5% 46.7% 651 3.30 $ 28,012 10.7 36.0 21.0 $108,959 2,996 387 28 1 88% Georgetown 20.8% 45.0% 662 3.35 $ 28,440 10.8 35.0 25.0 $103,968 9,351 520 28 1 81% North Carolina 23.8% 44.2% 674 3.30 $ 26,545 16.2 32.0 29.0 $111,104 2,619 554 27 0 92% Rochester 32.0% 44.0% 637 3.33 $ 28,620 10.3 52.0 24.0 $103,633 6,541 453 29 1 72% USC 29.0% 43.4% 676 3.30 $ 30,044 32.3 29.0 30.0 $100,537 7,436 589 28 0 81% Ohio State 29.0% 41.4% 645 3.38 $ 21,792 15.5 35.0 41.0 $ 99,859 3,368 277 28 1 79% Minnesota 34.4% 40.6% 645 3.33 $ 21,367 8.6 26.0 31.0 $ 93,080 6,606 236 28 1 94%
Applied Regression -- Prof. Juran 5
Cluster Analysis Questions
• Given a certain number of clusters, which schools are grouped together?
• How is the set of clusters affected if we change the number of clusters?
• For each cluster, which school is the most “typical”?
• How different are the clusters from each other?
• What is the best number of clusters?
Applied Regression -- Prof. Juran 6
Basic Method in Excel• We will assume that all of these attributes
deserve equal weighting in our analysis. We will – name a school as the “typical” school in each cluster
(called the centroid of the cluster), – assign each of the non-centroid schools to the cluster
where they are most similar to the centroid, and– optimize the identities of the centroids and the
cluster assignments so as to minimize the total Euclidean distance between each school and its cluster centroid.
• We define “most similar” to be the least sum of squared errors across all attributes between a cluster member and the centroid of the cluster.
Applied Regression -- Prof. Juran 7
Nonlinear Problems
Some nonlinear problems can be formulated in a linear fashion (i.e. some network problems).
Other nonlinear functions can be solved with our basic methods (i.e. smooth, continuous functions that are concave or convex, such as portfolio variances).
However, there are many types of nonlinear problems that pose significant difficulties.
Applied Regression -- Prof. Juran 8
Nonlinear Problems
The linear solution to a nonlinear (say, integer) problem may be infeasible.
The linear solution may be far away from the actual optimal solution.
Some functions have many local minima (or maxima), and Solver is not guaranteed to find the global minimum (or maximum).
Applied Regression -- Prof. Juran 9
3 Solvers
• Simplex LP Solver• GRG Nonlinear Solver• Evolutionary Solver
Applied Regression -- Prof. Juran 10
Solution MethodologyThe standard simplex algorithm (Solver’s default method) won’t work on this problem. The GRG Nonlinear algorithm will make an honest effort, but is likely to give up without finding the optimal solution.
This can result from the use of MAX, IF, and SUMIF functions, resulting in discontinuities in our productive function and constraints as functions of the decision variables.
It can also be the result of using numerical decision variables that are in fact simply names (as in this example, where the names of the clusters happen to be numbers).
The Evolutionary Solver, a genetic algorithm, can do a good job with a problem like this, but is not guaranteed to find the optimal solution.
Applied Regression -- Prof. Juran 11
Solution MethodologyThe Evolutionary Solver operates in a completely different way from the other types. Instead of searching in a structured way guaranteed to reach the optimal solution, genetic algorithms operate somewhat like biological evolutionary processes, with some degree of randomness in the steps taken from one solution to the next.
In a finite period of time, the Evolutionary Solver is not guaranteed to find the optimal solution, but it will find very good solutions and try to improve upon them.
Applied Regression -- Prof. Juran 12
In cluster analysis it is common to standardize the attribute data, so that those variables with large units (such as cost, salary and student body size) do not dominate the sum of squares over attributes with small units (such as % female, % admitted, and % with a job at graduation).
So we transform each attribute for each school into a z-value.
Standardization
Applied Regression -- Prof. Juran 13
123456789
10111213141516171819202122232425262728
A B C D E F G H I J K LSchool Accept % Enroll % GMAT GPA Cost Minority pct Non-U.S. pct Female % Salary (base+signing)Pop. Den. In City (people per sq. mile)Size: Student Body
Stanford -1.63 2.02 1.67 1.82 0.78 1.09 -1.59 1.68 1.74 -0.88 0.18Harvard -1.22 2.42 1.11 1.82 0.97 1.09 0.28 1.11 1.45 1.18 2.94Penn -0.76 1.32 1.02 1.71 0.83 0.15 1.47 0.34 0.99 0.66 2.21MIT -0.42 1.30 1.02 1.00 0.82 0.15 0.28 -0.61 0.76 0.64 -0.12Northwestern -0.69 0.77 0.89 0.49 0.60 -0.13 -0.43 0.15 0.77 0.77 1.46Duke -0.14 0.15 0.38 0.28 0.62 0.45 0.11 0.15 0.51 -0.86 -0.01Chicago 0.97 0.41 0.68 -0.74 0.68 0.00 0.45 -1.37 0.51 0.77 0.92Columbia -1.49 1.48 1.32 0.49 1.05 0.34 -0.91 1.11 0.82 2.71 1.39Dartmouth -0.42 -0.39 0.47 -0.03 0.60 -0.19 -0.59 -0.99 1.14 -0.98 -0.65Berkeley -0.96 -0.45 0.38 0.28 -1.54 0.22 -0.06 -0.04 0.27 0.34 -0.56Michigan -0.21 0.28 -0.14 -0.64 0.70 0.43 -0.74 -0.30 0.69 -0.57 0.46Virginia -0.61 -0.39 0.08 -0.03 0.28 -0.58 -0.74 0.15 0.24 -0.62 -0.44NYU -0.80 -0.33 0.89 -0.03 1.06 0.90 -0.06 1.68 0.01 2.71 -0.26Yale -0.15 0.12 0.81 1.00 0.47 1.09 0.11 0.73 0.14 -0.17 -0.65UCLA -1.76 -0.66 0.51 2.02 -1.24 0.64 -0.91 0.15 0.15 -0.03 -0.09Cornell 0.62 -0.37 -0.44 -0.85 0.66 0.82 -0.06 0.34 0.50 -0.98 -0.35North Carolina 0.36 -1.00 -0.22 -1.05 -0.28 -0.38 -0.23 -0.04 -0.17 -0.84 -0.34Carnegie-Mellon 0.95 0.42 -0.82 -1.56 -0.23 0.58 1.80 -0.04 0.06 -0.17 -0.64Texas 1.05 -0.27 0.04 -0.64 -0.82 -1.11 -0.91 -1.24 -0.59 -0.90 0.24USC 1.05 -1.06 -0.14 -1.05 0.55 2.64 -0.74 0.15 -1.23 -0.03 -0.25Indiana 0.48 -0.49 -1.21 -0.54 -1.66 -0.73 -0.40 -1.37 -1.06 -0.59 -0.23Emory 1.39 -0.77 -1.21 -1.05 0.07 -1.41 0.45 -1.56 -0.39 -0.78 -0.77Rochester 1.45 -1.01 -1.81 -0.74 0.21 -1.48 3.16 -0.99 -0.92 -0.18 -0.60Georgetown -0.05 -0.92 -0.74 -0.54 0.17 -1.39 0.28 -0.80 -0.89 0.30 -0.43Michigan State 0.15 -0.04 -1.64 -0.44 -2.44 -0.86 0.96 -0.99 -2.21 -0.65 -1.20Ohio State 1.05 -1.25 -1.47 -0.23 -1.40 -0.51 0.28 2.25 -1.30 -0.71 -1.06Minnesota 1.78 -1.31 -1.47 -0.74 -1.50 -1.80 -1.25 0.34 -1.98 -0.17 -1.16
=((VLOOKUP($A9,'Raw Data'!$A$2:$O$28,3,0))-AVERAGE('Raw Data'!C$2:C$28))/STDEV('Raw Data'!C$2:C$28)
Applied Regression -- Prof. Juran 14
123456789
101112131415161718192021222324252627282930313233
34353637383940414243
44
45
464748495051525354555657585960616263646566676869707172
A B C D E F G H I J K L M N O PIndex School Accept % Enroll % GMAT GPA Cost Minority pctNon-U.S. pct Female % Salary Density St Body Age Rolling Job
1 Stanford -1.63 2.02 1.67 1.82 0.78 1.09 -1.59 1.68 1.74 -0.88 0.18 -1.68 -0.75 0.582 Harvard -1.22 2.42 1.11 1.82 0.97 1.09 0.28 1.11 1.45 1.18 2.94 0.13 -0.75 0.583 Penn -0.76 1.32 1.02 1.71 0.83 0.15 1.47 0.34 0.99 0.66 2.21 0.13 -0.75 0.124 MIT -0.42 1.30 1.02 1.00 0.82 0.15 0.28 -0.61 0.76 0.64 -0.12 0.13 -0.75 0.465 Northwestern -0.69 0.77 0.89 0.49 0.60 -0.13 -0.43 0.15 0.77 0.77 1.46 0.13 -0.75 0.126 Duke -0.14 0.15 0.38 0.28 0.62 0.45 0.11 0.15 0.51 -0.86 -0.01 -1.68 -0.75 0.697 Chicago 0.97 0.41 0.68 -0.74 0.68 0.00 0.45 -1.37 0.51 0.77 0.92 1.95 -0.75 0.698 Columbia -1.49 1.48 1.32 0.49 1.05 0.34 -0.91 1.11 0.82 2.71 1.39 -1.68 1.28 0.819 Dartmouth -0.42 -0.39 0.47 -0.03 0.60 -0.19 -0.59 -0.99 1.14 -0.98 -0.65 0.13 -0.75 0.58
10 Berkeley -0.96 -0.45 0.38 0.28 -1.54 0.22 -0.06 -0.04 0.27 0.34 -0.56 0.13 -0.75 -2.0811 Michigan -0.21 0.28 -0.14 -0.64 0.70 0.43 -0.74 -0.30 0.69 -0.57 0.46 0.13 -0.75 0.6912 Virginia -0.61 -0.39 0.08 -0.03 0.28 -0.58 -0.74 0.15 0.24 -0.62 -0.44 0.13 1.28 0.6913 NYU -0.80 -0.33 0.89 -0.03 1.06 0.90 -0.06 1.68 0.01 2.71 -0.26 -1.68 1.28 -0.1214 Yale -0.15 0.12 0.81 1.00 0.47 1.09 0.11 0.73 0.14 -0.17 -0.65 0.13 -0.75 0.8115 UCLA -1.76 -0.66 0.51 2.02 -1.24 0.64 -0.91 0.15 0.15 -0.03 -0.09 0.13 1.28 -2.8816 Cornell 0.62 -0.37 -0.44 -0.85 0.66 0.82 -0.06 0.34 0.50 -0.98 -0.35 1.95 -0.75 0.5817 North Carolina 0.36 -1.00 -0.22 -1.05 -0.28 -0.38 -0.23 -0.04 -0.17 -0.84 -0.34 -1.68 -0.75 0.5818 Carnegie-Mellon 0.95 0.42 -0.82 -1.56 -0.23 0.58 1.80 -0.04 0.06 -0.17 -0.64 0.13 -0.75 0.8119 Texas 1.05 -0.27 0.04 -0.64 -0.82 -1.11 -0.91 -1.24 -0.59 -0.90 0.24 0.13 1.28 -1.0420 USC 1.05 -1.06 -0.14 -1.05 0.55 2.64 -0.74 0.15 -1.23 -0.03 -0.25 0.13 -0.75 -0.6921 Indiana 0.48 -0.49 -1.21 -0.54 -1.66 -0.73 -0.40 -1.37 -1.06 -0.59 -0.23 0.13 -0.75 -0.4622 Emory 1.39 -0.77 -1.21 -1.05 0.07 -1.41 0.45 -1.56 -0.39 -0.78 -0.77 0.13 1.28 0.1223 Rochester 1.45 -1.01 -1.81 -0.74 0.21 -1.48 3.16 -0.99 -0.92 -0.18 -0.60 1.95 1.28 -1.7324 Georgetown -0.05 -0.92 -0.74 -0.54 0.17 -1.39 0.28 -0.80 -0.89 0.30 -0.43 0.13 1.28 -0.6925 Michigan State 0.15 -0.04 -1.64 -0.44 -2.44 -0.86 0.96 -0.99 -2.21 -0.65 -1.20 0.13 -0.75 0.9226 Ohio State 1.05 -1.25 -1.47 -0.23 -1.40 -0.51 0.28 2.25 -1.30 -0.71 -1.06 0.13 1.28 -0.9227 Minnesota 1.78 -1.31 -1.47 -0.74 -1.50 -1.80 -1.25 0.34 -1.98 -0.17 -1.16 0.13 1.28 0.81
Cluster centers and standardized valuesColumn offset 2 3 4 5 6 7 8 9 10 11 12 13 14
Cluster School Index Accept % Enroll % GMAT GPA Cost Minority pct Non-U.S. pct Female % Salary Density St Body Age Rolling
1 Stanford 1 -1.6267 2.0201 1.6671 1.8161 0.7775 1.0862 -1.5873 1.6804 1.7438 -0.8812 0.1812 -1.6849 -0.75262 Penn 3 -0.7560 1.3239 1.0235 1.7137 0.8286 0.1486 1.4654 0.3443 0.9884 0.6572 2.2142 0.1348 -0.75263 Michigan 11 -0.2069 0.2769 -0.1351 -0.6407 0.7026 0.4299 -0.7393 -0.3047 0.6887 -0.5663 0.4554 0.1348 -0.75264 Columbia 8 -1.4927 1.4849 1.3238 0.4853 1.0503 0.3361 -0.9089 1.1078 0.8171 2.7114 1.3943 -1.6849 1.27955 UCLA 15 -1.7606 -0.6559 0.5086 2.0208 -1.2384 0.6362 -0.9089 0.1534 0.1529 -0.0269 -0.0852 0.1348 1.27956 Carnegie-Mellon 18 0.9451 0.4242 -0.8216 -1.5620 -0.2295 0.5799 1.8046 -0.0375 0.0607 -0.1683 -0.6361 0.1348 -0.75267 USC 20 1.0522 -1.0633 -0.1351 -1.0502 0.5506 2.6425 -0.7393 0.1534 -1.2323 -0.0269 -0.2507 0.1348 -0.75268 Georgetown 24 -0.0461 -0.9235 -0.7358 -0.5384 0.1707 -1.3890 0.2783 -0.8010 -0.8887 0.2961 -0.4292 0.1348 1.27959 Michigan State 25 0.1548 -0.0359 -1.6369 -0.4360 -2.4439 -0.8639 0.9566 -0.9919 -2.2142 -0.6507 -1.2025 0.1348 -0.7526
10 Rochester 23 1.4541 -1.0127 -1.8085 -0.7431 0.2133 -1.4827 3.1614 -0.9919 -0.9222 -0.1779 -0.6025 1.9545 1.2795
SumSqDists 140.4160356
Squared distances to centers Assigned toSchool Index To 1 To 2 To 3 To 4 To 5 To 6 To 7 To 8 To 9 To 10 Minimum Cluster SchoolStanford 1 0.0 24.2 24.0 23.0 38.4 49.9 48.4 56.4 72.2 104.8 0.0 1 StanfordHarvard 2 19.7 5.6 26.5 17.3 45.9 45.7 52.0 55.2 74.9 86.7 5.6 2 PennPenn 3 24.2 0.0 18.7 21.3 35.0 29.5 41.3 34.7 52.4 55.2 0.0 2 PennMIT 4 19.4 8.6 8.2 20.1 29.5 18.1 27.0 21.8 35.5 47.4 8.2 3 MichiganNorthwestern 5 18.8 6.2 6.6 14.6 26.4 22.4 25.8 21.5 40.3 51.7 6.2 2 PennDuke 6 17.0 17.4 5.7 25.7 31.9 14.1 19.5 20.4 30.1 50.3 5.7 3 MichiganChicago 7 49.6 19.6 10.2 38.9 46.5 14.6 24.0 20.8 34.6 33.7 10.2 3 MichiganColumbia 8 23.0 21.3 27.8 0.0 41.0 46.7 48.1 39.6 70.9 86.3 0.0 4 ColumbiaDartmouth 9 26.3 23.6 3.7 36.0 29.8 16.3 21.7 16.1 28.9 43.1 3.7 3 MichiganBerkeley 10 36.5 26.8 17.5 39.2 9.9 23.2 21.6 16.6 26.5 38.5 9.9 5 UCLAMichigan 11 24.0 18.7 0.0 27.8 32.6 11.9 15.1 16.8 28.7 44.0 0.0 3 MichiganVirginia 12 29.0 26.3 7.6 26.6 22.8 18.8 23.4 8.2 26.3 36.8 7.6 3 MichiganNYU 13 33.4 29.3 26.7 9.8 32.3 33.8 29.0 26.4 56.8 62.4 9.8 4 ColumbiaYale 14 18.2 15.4 7.5 26.7 26.9 14.7 15.7 22.1 30.2 47.7 7.5 3 MichiganUCLA 15 38.4 35.0 32.6 41.0 0.0 49.6 36.1 25.6 49.0 54.0 0.0 5 UCLACornell 16 41.6 29.3 6.5 48.7 39.8 10.2 13.4 20.1 28.9 32.9 6.5 3 MichiganNorth Carolina 17 36.8 34.0 8.8 39.1 37.2 12.0 17.3 13.6 19.0 41.5 8.8 3 MichiganCarnegie-Mellon 18 49.9 29.5 11.9 46.7 49.6 0.0 18.4 18.3 17.1 26.0 0.0 6 Carnegie-MellonTexas 19 53.0 39.0 16.4 46.6 25.4 23.3 24.4 7.0 22.7 27.1 7.0 8 GeorgetownUSC 20 48.4 41.3 15.1 48.1 36.1 18.4 0.0 24.6 34.8 45.5 0.0 7 USCIndiana 21 57.0 40.4 15.3 56.3 31.0 15.9 21.2 10.3 7.3 27.9 7.3 9 Michigan StateEmory 22 67.9 46.3 18.9 57.2 43.3 15.4 28.7 5.4 18.1 15.6 5.4 8 GeorgetownRochester 23 104.8 55.2 44.0 86.3 54.0 26.0 45.5 16.5 31.8 0.0 0.0 10 RochesterGeorgetown 24 56.4 34.7 16.8 39.6 25.6 18.3 24.6 0.0 19.2 16.5 0.0 8 Georgetown
=VLOOKUP(C42,$A$2:$B$28,2,0)
=VLOOKUP($B40,$B$2:$P$28,D$32,0)
=SUMXMY2($C5:$P5,$D$35:$Q$35)=MIN(C51:L51)
=MATCH(M53,C53:L53,0)
=VLOOKUP(N55,$A$34:$B$43,2)
Applied Regression -- Prof. Juran 15
We set up the model in a large spreadsheet, as shown here.
The upper section contains the standardized data, the middle section contains information about the 10 centroids, and the lower section evaluates the distances between each school and each of the centroids, and assigns schools to clusters on the basis of minimum distance.
Optimization Procedure
Applied Regression -- Prof. Juran 16
313233
34353637383940414243
A B C D E F G H I J K L M N O PCluster centers and standardized values
Column offset 2 3 4 5 6 7 8 9 10 11 12 13 14Cluster School Index Accept % Enroll % GMAT GPA Cost Minority pct Non-U.S. pct Female % Salary Density St Body Age Rolling
1 Stanford 1 -1.6267 2.0201 1.6671 1.8161 0.7775 1.0862 -1.5873 1.6804 1.7438 -0.8812 0.1812 -1.6849 -0.75262 Stanford 1 -1.6267 2.0201 1.6671 1.8161 0.7775 1.0862 -1.5873 1.6804 1.7438 -0.8812 0.1812 -1.6849 -0.75263 Stanford 1 -1.6267 2.0201 1.6671 1.8161 0.7775 1.0862 -1.5873 1.6804 1.7438 -0.8812 0.1812 -1.6849 -0.75264 Stanford 1 -1.6267 2.0201 1.6671 1.8161 0.7775 1.0862 -1.5873 1.6804 1.7438 -0.8812 0.1812 -1.6849 -0.75265 Stanford 1 -1.6267 2.0201 1.6671 1.8161 0.7775 1.0862 -1.5873 1.6804 1.7438 -0.8812 0.1812 -1.6849 -0.75266 Stanford 1 -1.6267 2.0201 1.6671 1.8161 0.7775 1.0862 -1.5873 1.6804 1.7438 -0.8812 0.1812 -1.6849 -0.75267 Stanford 1 -1.6267 2.0201 1.6671 1.8161 0.7775 1.0862 -1.5873 1.6804 1.7438 -0.8812 0.1812 -1.6849 -0.75268 Stanford 1 -1.6267 2.0201 1.6671 1.8161 0.7775 1.0862 -1.5873 1.6804 1.7438 -0.8812 0.1812 -1.6849 -0.75269 Stanford 1 -1.6267 2.0201 1.6671 1.8161 0.7775 1.0862 -1.5873 1.6804 1.7438 -0.8812 0.1812 -1.6849 -0.7526
10 Stanford 1 -1.6267 2.0201 1.6671 1.8161 0.7775 1.0862 -1.5873 1.6804 1.7438 -0.8812 0.1812 -1.6849 -0.7526
=VLOOKUP(C42,$A$2:$B$28,2,0)
=VLOOKUP($B40,$B$2:$P$28,D$32,0)
We begin by setting up cells C34:C43, where Solver can identify which schools are centroids. In this initial solution, all centroids have a value of 1 (the index for Stanford), and the corresponding standardized data for Stanford appear in D34:P43. These indices will be manipulated by Solver to find the best ten centroids.
Decision Variables
Applied Regression -- Prof. Juran 17
In the lower section of the worksheet, we calculate the total squared distance from each school to each centroid, and pick the minimum. Cell B45 — the objective function in this problem — is the sum of M49:M75.45
464748495051525354555657585960616263646566676869707172737475
A B C D E F G H I J K L M N OSumSqDists 1107.029838
Squared distances to centers Assigned toSchool Index To 1 To 2 To 3 To 4 To 5 To 6 To 7 To 8 To 9 To 10 Minimum Cluster SchoolStanford 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 StanfordHarvard 2 19.7 19.7 19.7 19.7 19.7 19.7 19.7 19.7 19.7 19.7 19.7 1 StanfordPenn 3 24.2 24.2 24.2 24.2 24.2 24.2 24.2 24.2 24.2 24.2 24.2 1 StanfordMIT 4 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4 1 StanfordNorthwestern 5 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 1 StanfordDuke 6 17.0 17.0 17.0 17.0 17.0 17.0 17.0 17.0 17.0 17.0 17.0 1 StanfordChicago 7 49.6 49.6 49.6 49.6 49.6 49.6 49.6 49.6 49.6 49.6 49.6 1 StanfordColumbia 8 23.0 23.0 23.0 23.0 23.0 23.0 23.0 23.0 23.0 23.0 23.0 1 StanfordDartmouth 9 26.3 26.3 26.3 26.3 26.3 26.3 26.3 26.3 26.3 26.3 26.3 1 StanfordBerkeley 10 36.5 36.5 36.5 36.5 36.5 36.5 36.5 36.5 36.5 36.5 36.5 1 StanfordMichigan 11 24.0 24.0 24.0 24.0 24.0 24.0 24.0 24.0 24.0 24.0 24.0 1 StanfordVirginia 12 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 29.0 1 StanfordNYU 13 33.4 33.4 33.4 33.4 33.4 33.4 33.4 33.4 33.4 33.4 33.4 1 StanfordYale 14 18.2 18.2 18.2 18.2 18.2 18.2 18.2 18.2 18.2 18.2 18.2 1 StanfordUCLA 15 38.4 38.4 38.4 38.4 38.4 38.4 38.4 38.4 38.4 38.4 38.4 1 StanfordCornell 16 41.6 41.6 41.6 41.6 41.6 41.6 41.6 41.6 41.6 41.6 41.6 1 StanfordNorth Carolina 17 36.8 36.8 36.8 36.8 36.8 36.8 36.8 36.8 36.8 36.8 36.8 1 StanfordCarnegie-Mellon 18 49.9 49.9 49.9 49.9 49.9 49.9 49.9 49.9 49.9 49.9 49.9 1 StanfordTexas 19 53.0 53.0 53.0 53.0 53.0 53.0 53.0 53.0 53.0 53.0 53.0 1 StanfordUSC 20 48.4 48.4 48.4 48.4 48.4 48.4 48.4 48.4 48.4 48.4 48.4 1 StanfordIndiana 21 57.0 57.0 57.0 57.0 57.0 57.0 57.0 57.0 57.0 57.0 57.0 1 StanfordEmory 22 67.9 67.9 67.9 67.9 67.9 67.9 67.9 67.9 67.9 67.9 67.9 1 StanfordRochester 23 104.8 104.8 104.8 104.8 104.8 104.8 104.8 104.8 104.8 104.8 104.8 1 StanfordGeorgetown 24 56.4 56.4 56.4 56.4 56.4 56.4 56.4 56.4 56.4 56.4 56.4 1 StanfordMichigan State 25 72.2 72.2 72.2 72.2 72.2 72.2 72.2 72.2 72.2 72.2 72.2 1 StanfordOhio State 26 63.5 63.5 63.5 63.5 63.5 63.5 63.5 63.5 63.5 63.5 63.5 1 StanfordMinnesota 27 78.2 78.2 78.2 78.2 78.2 78.2 78.2 78.2 78.2 78.2 78.2 1 Stanford
=SUMXMY2($C5:$P5,$D$35:$Q$35)
Applied Regression -- Prof. Juran 18
Here’s a close-up view of the cells where the assignments are made:
4849505152535455
M N O P Q R SMinimum Cluster School
0.0 6 Stanford5.6 10 Penn0.0 10 Penn8.3 7 Dartmouth6.2 10 Penn6.8 7 Dartmouth
10.6 1 Cornell
=MIN(C51:L51)
=MATCH(M53,C53:L53,0)
=VLOOKUP(N55,$A$34:$B$43,2)
Applied Regression -- Prof. Juran 19
Applied Regression -- Prof. Juran 20
Applied Regression -- Prof. Juran 21
Applied Regression -- Prof. Juran 22
Applied Regression -- Prof. Juran 23
Columbia MIT Texas
NYU Duke Emory
Chicago Georgetown
Indiana Dartmouth Ohio State
Michigan State Michigan Minnesota
Virginia
Carnegie-Mellon Yale USC
Cornell
Harvard North Carolina Stanford
Penn
Northwestern UCLA Rochester
Berkeley
Applied Regression -- Prof. Juran 24
Columbia MIT Texas
NYU Northwestern Emory
Duke Georgetown
Indiana Chicago Ohio State
Michigan State Dartmouth Minnesota
Michigan
Harvard Virginia USC
Penn Yale
Cornell Stanford
UCLA North Carolina
Berkeley Carnegie-Mellon Rochester
Applied Regression -- Prof. Juran 25
Columbia MIT Texas
NYU Duke Emory
Chicago Georgetown
Indiana Dartmouth Ohio State
Michigan State Michigan Minnesota
Virginia
Harvard Yale UCLA
Penn Cornell Berkeley
Northwestern North Carolina
Carnegie-Mellon Stanford
Rochester USC
Applied Regression -- Prof. Juran 26
Columbia MIT Texas
NYU Duke Emory
Chicago Georgetown
UCLA Dartmouth Rochester
Berkeley Michigan Ohio State
Virginia Minnesota
Harvard Yale
Penn Cornell Indiana
Northwestern North Carolina Michigan State
Carnegie-Mellon
USC Stanford
Applied Regression -- Prof. Juran 27
Columbia MIT Texas
NYU Duke Emory
Chicago Georgetown
UCLA Dartmouth Rochester
Berkeley Michigan Minnesota
Virginia Indiana
Harvard Yale Michigan State
Penn Cornell Ohio State
Northwestern North Carolina
Carnegie-Mellon Stanford
USC
Applied Regression -- Prof. Juran 28
Columbia MIT Texas
NYU Northwestern Emory
Duke Georgetown
UCLA Chicago Rochester
Berkeley Dartmouth Minnesota
Michigan Indiana
Harvard Virginia Michigan State
Penn Yale Ohio State
Stanford Cornell
North Carolina
Carnegie-Mellon
USC
Applied Regression -- Prof. Juran 29
Stanford Duke Texas
Harvard Chicago Emory
Penn Dartmouth Georgetown
MIT Michigan Rochester
Northwestern Virginia Minnesota
Columbia Yale Indiana
NYU Cornell Michigan State
North Carolina Ohio State
UCLA Carnegie-Mellon
Berkeley USC
Applied Regression -- Prof. Juran 30
Stanford Duke Texas
Harvard Chicago Emory
Penn Dartmouth Georgetown
MIT Michigan Rochester
Northwestern Virginia Minnesota
Columbia Yale Indiana
NYU Cornell Michigan State
Berkeley North Carolina Ohio State
Carnegie-Mellon UCLA
USC
Applied Regression -- Prof. Juran 31
Stanford Duke Dartmouth
Harvard Chicago Yale
Penn Columbia Cornell
MIT NYU Michigan
Northwestern Berkeley
Virginia Carnegie-Mellon Ohio State
UCLA USC Emory
North Carolina Texas Georgetown
Minnesota Indiana Rochester
Michigan State
Applied Regression -- Prof. Juran 32
Stanford Duke Dartmouth
Harvard Chicago Yale
Penn Columbia Cornell
MIT NYU Michigan
Northwestern Carnegie-Mellon Berkeley
Virginia USC Ohio State
UCLA Texas Emory
North Carolina Indiana Georgetown
Minnesota Michigan State Rochester
Applied Regression -- Prof. Juran 33
Cluster Analysis Questions
• Given a certain number of clusters, which schools are grouped together?
• How is the set of clusters affected if we change the number of clusters?
• For each cluster, which school is the most “typical”?
• How different are the clusters from each other?
• What is the best number of clusters?
Applied Regression -- Prof. Juran 34
Given a certain number of clusters, which schools are
grouped together?• Columbia and NYU are always in the same
cluster, as are Harvard-Penn, Indiana-Michigan State.
• Michigan-Cornell-Yale-Dartmouth-Chicago-Duke.
• Texas-Emory-Georgetown-Minnesota.• What happens with UCLA-Berkeley?
Applied Regression -- Prof. Juran 35
How is the set of clusters affected if we change the
number of clusters?• Notice the behavior of Northwestern as we
reduce the number of clusters.• Stanford seems to be very different from all
other schools; the last school to have its own cluster.
Applied Regression -- Prof. Juran 36
For each cluster, which school is the most “typical”?
• The centroid represents the school most typical in each cluster.
• We observe that Michigan is almost always the centroid of a large cluster.
Applied Regression -- Prof. Juran 37
How different are the clusters from each other?
• This is difficult to assess with this method; Minitab will provide more useful output.
Applied Regression -- Prof. Juran 38
TSS vs. Number of Clusters
0
50
100
150
200
250
300
350
400
450
500
1 2 3 4 5 6 7 8 9 10
Clusters
To
tal S
qu
are
d D
ista
nc
es
What is the best number of clusters?
Applied Regression -- Prof. Juran 39
Acc. % Enr. % GMAT GPA Cost Min. % Int. % Fem. % Salary Pop. Den. Size Age Rolling JobAcc. % 1Enr. % -0.614 1GMAT -0.771 0.723 1GPA -0.810 0.637 0.703 1Cost -0.299 0.512 0.636 0.232 1
Min. % -0.436 0.403 0.602 0.347 0.463 1Int. % 0.336 -0.040 -0.383 -0.202 0.011 -0.218 1
Fem. % -0.419 0.273 0.377 0.445 0.220 0.489 -0.250 1Salary -0.674 0.739 0.838 0.579 0.717 0.476 -0.167 0.255 1
Pop. Den. -0.416 0.398 0.479 0.294 0.389 0.250 0.045 0.339 0.241 1Size -0.485 0.757 0.642 0.531 0.532 0.323 -0.026 0.188 0.632 0.506 1Age 0.472 -0.231 -0.393 -0.269 -0.117 -0.214 0.436 -0.462 -0.197 -0.226 -0.069 1
Rolling 0.158 -0.423 -0.300 -0.117 -0.166 -0.501 -0.048 0.086 -0.379 0.183 -0.248 -0.037 1Job 0.045 0.415 0.167 -0.122 0.384 0.082 -0.094 0.082 0.234 0.011 0.145 -0.210 -0.387 1
Correlation issues?
Applied Regression -- Prof. Juran 40
Minorities vs. Enroll %
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Enroll % (standardized)
Min
ori
ty %
(s
tan
da
rdiz
ed
)
Harvard
Minnesota
USC
Cornell
Columbia
Wharton
MIT
StanfordNYU
Applied Regression -- Prof. Juran 41
Women vs. Enroll %
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Enroll % (standardized)
Fe
ma
le %
(s
tan
da
rdiz
ed
)
Harvard
Minnesota
Ohio State
Emory
Cornell
Columbia
Wharton
MIT
Stanford
NYU
Applied Regression -- Prof. Juran 42
International vs. Enroll %
-2.0
-1.0
0.0
1.0
2.0
3.0
4.0
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Enroll % (standardized)
Intl
% (
sta
nd
ard
ize
d)
Harvard
Minnesota
Cornell
Columbia
Wharton
MIT
Stanford
Carnegie-Mellon
NYU
Rochester
Applied Regression -- Prof. Juran 43
Population vs. Enroll %
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Enroll % (standardized)
Po
p. D
en
sit
y (
sta
nd
ard
ize
d) Harvard
Minnesota
CornellDartmouth
NYU Columbia
Wharton
MIT
Stanford
Applied Regression -- Prof. Juran 44
GMAT vs. Enroll %
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Enroll % (standardized)
GM
AT
(s
tan
da
rdiz
ed
)
Harvard
Minnesota
Cornell
Columbia
Wharton
MIT
Stanford
Rochester
NYU
Applied Regression -- Prof. Juran 45
Applied Regression -- Prof. Juran 46
Number of obs. Number of Similarity Distance Clusters New in newStep clusters level level joined cluster cluster 1 26 81.1711 1.92787 9 17 9 2 2 25 79.2999 2.11946 5 6 5 2 3 24 77.4028 2.31370 21 22 21 2 4 23 76.9886 2.35612 1 4 1 2 5 22 76.6438 2.39141 9 10 9 3 6 21 76.5324 2.40283 5 11 5 3 7 20 76.5139 2.40472 5 9 5 6 8 19 76.2752 2.42916 5 23 5 7 9 18 75.5949 2.49881 1 5 1 9 10 17 75.1837 2.54091 1 15 1 10 11 16 74.3490 2.62638 1 16 1 11 12 15 74.1586 2.64588 13 21 13 3 13 14 73.5778 2.70534 12 19 12 2 14 13 72.2647 2.83979 12 13 12 5 15 12 72.0863 2.85805 1 12 1 16 16 11 69.3901 3.13412 3 14 3 2 17 10 69.2620 3.14723 18 20 18 2 18 9 68.8567 3.18872 1 7 1 17 19 8 68.8347 3.19098 1 8 1 18 20 7 65.8747 3.49405 26 27 26 2 21 6 64.3943 3.64563 1 26 1 20 22 5 64.2088 3.66462 1 18 1 22 23 4 64.1918 3.66636 1 25 1 23 24 3 62.7114 3.81794 1 3 1 25 25 2 61.3983 3.95238 1 24 1 26 26 1 59.7832 4.11776 1 2 1 27
Applied Regression -- Prof. Juran 47
Applied Regression -- Prof. Juran 48
Average Maximum Within distance distance Number of cluster sum from from observations of squares centroid centroidCluster1 16 152.337 2.94527 5.25820Cluster2 1 0.000 0.00000 0.00000Cluster3 2 4.911 1.56706 1.56706Cluster4 1 0.000 0.00000 0.00000Cluster5 1 0.000 0.00000 0.00000Cluster6 2 4.953 1.57362 1.57362Cluster7 1 0.000 0.00000 0.00000Cluster8 1 0.000 0.00000 0.00000Cluster9 1 0.000 0.00000 0.00000Cluster10 1 0.000 0.00000 0.00000
Applied Regression -- Prof. Juran 49
Variable Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6Accept % -0.038602 -1.62669 -1.14449 0.94506 0.97185 -1.35880Enroll % 0.107802 2.02010 0.57950 0.42423 0.41448 -0.55259GMAT 0.015098 1.66709 1.10928 -0.82163 0.68019 0.44419GPA 0.031042 1.81605 0.22938 -1.56203 -0.74310 1.15067Cost 0.099078 0.77748 1.05504 -0.22952 0.68133 -1.39023Minority pct -0.163118 1.08618 0.61740 0.57990 -0.00139 0.42989Non-U.S. pct -0.011110 -1.58728 -0.48491 1.80461 0.44785 -0.48491Female % -0.308277 1.68044 1.39412 -0.03747 -1.37362 0.05797Salary (base+signing) 0.117210 1.74379 0.41147 0.06074 0.50592 0.21257Pop. Den. In City (people per s -0.273673 -0.88123 2.71135 -0.16828 0.77406 0.15817Size: Student Body 0.132103 0.18125 0.56664 -0.63609 0.92357 -0.32053Mean Age 0.021061 -1.68491 -1.68491 0.13479 1.95450 0.13479Rolling (1=yes) -0.244604 -0.75263 1.27947 -0.75263 -0.75263 0.26342Students w/ first job offer by 0.295438 0.57646 0.34588 0.80705 0.69176 -2.47880 Variable Cluster7 Cluster8 Cluster9 Cluster10 Grand centroidAccept % 1.45405 1.05221 1.05221 1.77552 -0.0000000Enroll % -1.01274 -1.06333 -1.24651 -1.31486 0.0000000GMAT -1.80853 -0.13508 -1.46526 -1.46526 -0.0000000GPA -0.74310 -1.05020 -0.23127 -0.74310 0.0000000Cost 0.21335 0.55060 -1.40373 -1.50438 -0.0000000Minority pct -1.48274 2.64253 -0.50767 -1.80151 -0.0000000Non-U.S. pct 3.16136 -0.73930 0.27826 -1.24809 0.0000000Female % -0.99186 0.15341 2.25307 0.34429 0.0000000Salary (base+signing) -0.92224 -1.23231 -1.30021 -1.97914 0.0000000Pop. Den. In City (people per s -0.17789 -0.02694 -0.71307 -0.16693 0.0000000Size: Student Body -0.60246 -0.25070 -1.05769 -1.16373 -0.0000000Mean Age 1.95450 0.13479 0.13479 0.13479 -0.0000000Rolling (1=yes) 1.27947 -0.75263 1.27947 1.27947 0.0000000Students w/ first job offer by -1.72939 -0.69176 -0.92234 0.80705 0.0000000
Applied Regression -- Prof. Juran 50
Distances Between Cluster Centroids Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6 Cluster7 Cluster8Cluster1 0.00000 5.1787 4.66361 3.07784 3.14580 3.86172 5.6111 3.98074Cluster2 5.17873 0.0000 5.07379 7.06157 7.04249 5.91279 10.2389 6.95451Cluster3 4.66361 5.0738 0.00000 6.14703 6.00605 5.48510 8.4794 6.00463Cluster4 3.07784 7.0616 6.14703 0.00000 3.81626 5.82255 5.1003 4.29133Cluster5 3.14580 7.0425 6.00605 3.81626 0.00000 5.81509 5.8010 4.89544Cluster6 3.86172 5.9128 5.48510 5.82255 5.81509 0.00000 6.6180 5.13248Cluster7 5.61115 10.2389 8.47936 5.10025 5.80099 6.61797 0.0000 6.74368Cluster8 3.98074 6.9545 6.00463 4.29133 4.89544 5.13248 6.7437 0.00000Cluster9 4.66285 7.9670 6.77100 4.95710 6.69781 5.01759 5.2538 5.19326Cluster10 4.89897 8.8418 7.36838 5.48899 6.39972 6.38305 5.9808 5.87407 Cluster9 Cluster10Cluster1 4.66285 4.89897Cluster2 7.96699 8.84179Cluster3 6.77100 7.36838Cluster4 4.95710 5.48899Cluster5 6.69781 6.39972Cluster6 5.01759 6.38305Cluster7 5.25382 5.98082Cluster8 5.19326 5.87407Cluster9 0.00000 3.49405Cluster10 3.49405 0.00000
Applied Regression -- Prof. Juran 51
Married Sophisticates: You're in your late 20s or early 30s, recently married and likely have a household income between $50,000 and $100,000. You probably own a home, most likely in an upscale suburban neighborhood. You're a fan of "green and trendy cars," shop at Banana Republic and The Gap and are a loyal Netflix Inc. subscriber.
Truckin' & Stylin': You're in your 30s or 40s, live in a rural town and earn a moderate income. You may be married, but you don't have any children. You shop at stores like Wal-mart and AutoZone and enjoy watching NASCAR and classic shows on TV Land.
Collegiate Crowd: Between 18 and 23 years old, you're single and highly mobile. You're likely a renter and probably live in a college town. You buy clothes from American Eagle and Express Inc. and are a frequent liquor store patron. Your TV is tuned to Family Guy and you probably have copies of Rolling Stone and Us Weekly lying around.
Applied Regression -- Prof. Juran 52
Shooting Stars: You're in your 30s or 40s, married without any kids. You enjoy a six-figure household income and likely have a graduate degree. You shop at stores like Ann Taylor and Sephora, read magazines like Men's Health and Real Simple and use the web to check your stock investments and make travel plans.
Apple Pie Families: You're part of an upper-middle class family, likely living in a smaller city or nearby suburb. You probably drive a minivan. You shop at stores like Home Depot, Target and Best Buy, read Sports Illustrated and listen to NPR.
City Mixers: You're a childless, single "urbanite" living in a city like New York, Los Angeles or Chicago. Well-educated, you likely enjoy museums and the theater. You buy groceries from Trader Joe's and Whole Foods, outfit your home with Crate & Barrel and buy clothes from Banana Republic. You read The New York Times and watch The Office.
Applied Regression -- Prof. Juran 53
224143252018272687222113191216152310179116541
59.78
73.19
86.59
100.00
Observations
Sim
ilarityDendrogram
Single Linkage, Euclidean Distance
Applied Regression -- Prof. Juran 54
1 2 3 4 5 6 2 Stanford 2 Stanford 2 Stanford 2 Stanford 2 Stanford 2 Stanford 2 Stanford
3 Columbia 3 Columbia 3 Columbia 3 Columbia 3 Columbia 3 Columbia 3 Columbia
14 NYU 14 NYU 14 NYU 14 NYU 14 NYU 14 NYU 14 NYU 15 Cornell 15 Cornell 15 Cornell 15 Cornell 15 Cornell 15 Cornell 15 Cornell
1 Harvard 1 Harvard 1 Harvard 1 Harvard 1 Harvard 1 Harvard 1 Harvard 4 Penn 4 Penn 4 Penn
4 Penn 4 Penn 4 Penn 4 Penn
9 Michigan 9 Michigan 9 Michigan
9 Michigan 9 Michigan 9 Michigan 9 Michigan 17 Dartmouth 17 Dartmouth 17 Dartmouth
17 Dartmouth 17 Dartmouth 17 Dartmouth 10 Duke 10 Duke
17 Dartmouth 10 Duke
10 Duke 10 Duke 10 Duke 11 Yale 11 Yale 10 Duke 11 Yale 5 MIT
11 Yale 11 Yale 11 Yale 5 MIT 6 Northwestern
11 Yale 5 MIT 6 Northwestern
5 MIT 5 MIT 5 MIT 6 Northwestern 23 North Carolina
5 MIT 6 Northwestern 6 Northwestern 23 North Carolina
6 Northwestern 23 North Carolina 16 Virginia 6 Northwestern 23 North Carolina 23 North Carolina 16 Virginia
23 North Carolina 16 Virginia 21 Emory
23 North Carolina 16 Virginia 16 Virginia 21 Emory 22 Georgetown
16 Virginia 21 Emory 22 Georgetown
16 Virginia 21 Emory 21 Emory 22 Georgetown 13 Texas
21 Emory 22 Georgetown 13 Texas
21 Emory 22 Georgetown 13 Texas 18 Berkeley 22 Georgetown 13 Texas 12 Michigan State 22 Georgetown 13 Texas 12 Michigan State 20 UCLA 13 Texas 12 Michigan State 19 Indiana 13 Texas 12 Michigan State 19 Indiana 25 USC 12 Michigan State 19 Indiana 18 Berkeley 12 Michigan State 19 Indiana 18 Berkeley 26 Ohio State 19 Indiana 18 Berkeley 20 UCLA 19 Indiana 18 Berkeley 20 UCLA 27 Minnesota 18 Berkeley 20 UCLA 25 USC 18 Berkeley 20 UCLA 25 USC 7 Carnegie-Mellon 20 UCLA 25 USC 26 Ohio State 20 UCLA 25 USC 26 Ohio State 8 Chicago 25 USC 26 Ohio State 27 Minnesota 25 USC 26 Ohio State 27 Minnesota 12 Michigan State 26 Ohio State 27 Minnesota 7 Carnegie-Mellon 26 Ohio State 27 Minnesota 7 Carnegie-Mellon 19 Indiana 27 Minnesota 7 Carnegie-Mellon 8 Chicago 27 Minnesota 7 Carnegie-Mellon 8 Chicago 24 Rochester 7 Carnegie-Mellon 8 Chicago 24 Rochester
7 Carnegie-Mellon 8 Chicago 24 Rochester 8 Chicago 24 Rochester
8 Chicago 24 Rochester 24 Rochester 24 Rochester
Applied Regression -- Prof. Juran 55
7 8 9 10 11 12 13 2 Stanford 2 Stanford 2 Stanford 2 Stanford 2 Stanford 2 Stanford 2 Stanford
3 Columbia 3 Columbia 3 Columbia 3 Columbia 3 Columbia 3 Columbia 3 Columbia
14 NYU 14 NYU 14 NYU 14 NYU 14 NYU 14 NYU 14 NYU
15 Cornell 15 Cornell 15 Cornell 15 Cornell 15 Cornell 15 Cornell 15 Cornell 1 Harvard 1 Harvard 1 Harvard 1 Harvard
1 Harvard 1 Harvard 1 Harvard 4 Penn 4 Penn 4 Penn 4 Penn 4 Penn 4 Penn 4 Penn 9 Michigan 9 Michigan 9 Michigan 9 Michigan
9 Michigan 17 Dartmouth 17 Dartmouth 17 Dartmouth 17 Dartmouth
9 Michigan 9 Michigan 17 Dartmouth 10 Duke 10 Duke 10 Duke 10 Duke 17 Dartmouth 17 Dartmouth 10 Duke 11 Yale 11 Yale 11 Yale 11 Yale 10 Duke 10 Duke 11 Yale 5 MIT 5 MIT 5 MIT 5 MIT 11 Yale 11 Yale 5 MIT 6 Northwestern 6 Northwestern 6 Northwestern 6 Northwestern
5 MIT 5 MIT 6 Northwestern 23 North Carolina 23 North Carolina 23 North Carolina 23 North Carolina
6 Northwestern 6 Northwestern 23 North Carolina 16 Virginia 16 Virginia 16 Virginia
23 North Carolina 16 Virginia
23 North Carolina 16 Virginia 21 Emory 21 Emory 21 Emory
16 Virginia 21 Emory 22 Georgetown 22 Georgetown 22 Georgetown
16 Virginia 21 Emory 22 Georgetown 13 Texas 13 Texas
21 Emory 22 Georgetown 13 Texas
21 Emory 22 Georgetown 13 Texas 18 Berkeley 18 Berkeley 22 Georgetown 13 Texas 18 Berkeley
13 Texas 18 Berkeley 20 UCLA 20 UCLA 13 Texas 18 Berkeley 20 UCLA 18 Berkeley 20 UCLA 25 USC 25 USC 18 Berkeley 20 UCLA 25 USC 20 UCLA 25 USC 26 Ohio State 26 Ohio State 20 UCLA 25 USC 26 Ohio State 25 USC 26 Ohio State 27 Minnesota 27 Minnesota 25 USC 26 Ohio State 27 Minnesota 26 Ohio State 27 Minnesota 7 Carnegie-Mellon 7 Carnegie-Mellon 26 Ohio State 27 Minnesota 7 Carnegie-Mellon 27 Minnesota 7 Carnegie-Mellon 8 Chicago 8 Chicago 27 Minnesota 7 Carnegie-Mellon 8 Chicago
7 Carnegie-Mellon 8 Chicago 12 Michigan State 12 Michigan State 7 Carnegie-Mellon 8 Chicago 12 Michigan State 19 Indiana
8 Chicago 12 Michigan State 19 Indiana 8 Chicago 12 Michigan State 19 Indiana 24 Rochester
12 Michigan State 19 Indiana 24 Rochester 12 Michigan State 19 Indiana 24 Rochester 19 Indiana 24 Rochester 19 Indiana 24 Rochester 24 Rochester 24 Rochester
Applied Regression -- Prof. Juran 56
14 15 16 17 18 19 20 2 Stanford 2 Stanford 2 Stanford 2 Stanford 2 Stanford 2 Stanford 2 Stanford
3 Columbia 3 Columbia 3 Columbia 3 Columbia 3 Columbia 3 Columbia 3 Columbia 14 NYU 14 NYU 14 NYU 14 NYU 14 NYU
14 NYU 14 NYU
15 Cornell 15 Cornell 15 Cornell 15 Cornell 15 Cornell
15 Cornell 15 Cornell 1 Harvard 1 Harvard 1 Harvard 1 Harvard 1 Harvard 1 Harvard 1 Harvard 4 Penn 4 Penn 4 Penn 4 Penn 4 Penn 4 Penn 4 Penn 9 Michigan 9 Michigan 9 Michigan 9 Michigan 9 Michigan 9 Michigan 9 Michigan 17 Dartmouth 17 Dartmouth 17 Dartmouth 17 Dartmouth 17 Dartmouth
17 Dartmouth 17 Dartmouth 10 Duke 10 Duke 10 Duke 10 Duke 10 Duke 10 Duke 10 Duke 11 Yale 11 Yale 11 Yale 11 Yale 11 Yale 11 Yale 11 Yale 5 MIT 5 MIT 5 MIT 5 MIT 5 MIT
5 MIT 5 MIT 6 Northwestern 6 Northwestern 6 Northwestern 6 Northwestern 6 Northwestern 6 Northwestern 6 Northwestern 23 North Carolina 23 North Carolina 23 North Carolina 23 North Carolina 23 North Carolina
23 North Carolina 23 North Carolina 16 Virginia 16 Virginia 16 Virginia 16 Virginia 16 Virginia 16 Virginia 16 Virginia 21 Emory 21 Emory 21 Emory 21 Emory 21 Emory
21 Emory 22 Georgetown 22 Georgetown 22 Georgetown 22 Georgetown 22 Georgetown
21 Emory 22 Georgetown 13 Texas 13 Texas 13 Texas 13 Texas 13 Texas 22 Georgetown 13 Texas 12 Michigan State 12 Michigan State 12 Michigan State 12 Michigan State 12 Michigan State 13 Texas 12 Michigan State 19 Indiana 19 Indiana 19 Indiana 19 Indiana 19 Indiana
12 Michigan State 19 Indiana 7 Carnegie-Mellon 7 Carnegie-Mellon 7 Carnegie-Mellon
19 Indiana 18 Berkeley 18 Berkeley 8 Chicago 8 Chicago
18 Berkeley 20 UCLA 18 Berkeley
18 Berkeley 20 UCLA 20 UCLA 18 Berkeley 18 Berkeley
20 UCLA 25 USC 20 UCLA 20 UCLA
20 UCLA 25 USC 25 USC 25 USC 26 Ohio State 25 USC 25 USC 25 USC 26 Ohio State 26 Ohio State
26 Ohio State 27 Minnesota 26 Ohio State 26 Ohio State 26 Ohio State 27 Minnesota 27 Minnesota 27 Minnesota
27 Minnesota 7 Carnegie-Mellon 27 Minnesota 27 Minnesota 7 Carnegie-Mellon 8 Chicago 24 Rochester 7 Carnegie-Mellon 8 Chicago 24 Rochester
7 Carnegie-Mellon 8 Chicago 24 Rochester 8 Chicago 24 Rochester
8 Chicago 24 Rochester 24 Rochester 24 Rochester
Applied Regression -- Prof. Juran 57
21 22 23 24 25 26
2 Stanford 2 Stanford 2 Stanford 2 Stanford 2 Stanford 2 Stanford 3 Columbia
3 Columbia 3 Columbia 3 Columbia 3 Columbia 3 Columbia 14 NYU 14 NYU 14 NYU 14 NYU 14 NYU 14 NYU 15 Cornell
15 Cornell 15 Cornell 1 Harvard
15 Cornell 15 Cornell 15 Cornell 1 Harvard 1 Harvard 4 Penn 1 Harvard 1 Harvard 1 Harvard 4 Penn 4 Penn 9 Michigan 4 Penn 4 Penn 4 Penn 9 Michigan 9 Michigan 17 Dartmouth 9 Michigan 9 Michigan 9 Michigan 17 Dartmouth 17 Dartmouth 10 Duke
17 Dartmouth 17 Dartmouth 17 Dartmouth 10 Duke 10 Duke 11 Yale 10 Duke 10 Duke 10 Duke 11 Yale 11 Yale 5 MIT 11 Yale 11 Yale 11 Yale 5 MIT 5 MIT 6 Northwestern 5 MIT 5 MIT 5 MIT 6 Northwestern 6 Northwestern 23 North Carolina 6 Northwestern 6 Northwestern 6 Northwestern 23 North Carolina 23 North Carolina 16 Virginia
23 North Carolina 23 North Carolina 23 North Carolina 16 Virginia 16 Virginia 21 Emory 16 Virginia 16 Virginia 16 Virginia 21 Emory 21 Emory 22 Georgetown 21 Emory 21 Emory 21 Emory 22 Georgetown 22 Georgetown 13 Texas 22 Georgetown 22 Georgetown 22 Georgetown 13 Texas 13 Texas 12 Michigan State 13 Texas 13 Texas 13 Texas 12 Michigan State 12 Michigan State 19 Indiana 12 Michigan State 12 Michigan State 12 Michigan State 19 Indiana 19 Indiana 7 Carnegie-Mellon 19 Indiana 19 Indiana 19 Indiana 7 Carnegie-Mellon 7 Carnegie-Mellon 8 Chicago 7 Carnegie-Mellon 7 Carnegie-Mellon 7 Carnegie-Mellon 8 Chicago 8 Chicago 26 Ohio State 8 Chicago 8 Chicago 8 Chicago 26 Ohio State 26 Ohio State 27 Minnesota
26 Ohio State 26 Ohio State 26 Ohio State 27 Minnesota 27 Minnesota 18 Berkeley 27 Minnesota 27 Minnesota 27 Minnesota 18 Berkeley 18 Berkeley 20 UCLA
18 Berkeley 18 Berkeley 20 UCLA 20 UCLA 25 USC
18 Berkeley 20 UCLA 20 UCLA 25 USC 25 USC 24 Rochester
20 UCLA 25 USC 24 Rochester
25 USC 24 Rochester 25 USC 24 Rochester 24 Rochester 24 Rochester
Applied Regression -- Prof. Juran 58
Discriminant Analysis• Concerned with predicting membership into
two or more sub-groups (categories) on the basis of predictor variables
• Similar to regression in terms of its purpose: finding a function of independent variables that enables us to correctly forecast the value of a dependent variable
• Dependent variable is categorical
Applied Regression -- Prof. Juran 59
123456789
10111213141516171819202122232425262728293031
A B C D E F G H I J K L M N O PSubject Single Married Divorced Widowed Credit A Credit B Credit C Credit D Credit E Children? Age Income Debt Female July Default
1 1 0 0 0 1 0 0 0 0 0 29 $65,311 $185,246 1 02 0 1 0 0 0 0 1 0 0 1 44 $25,803 $24,699 0 03 0 1 0 0 0 1 0 0 0 1 28 $33,286 $59,406 0 04 0 0 1 0 1 0 0 0 0 0 39 $53,188 $170,868 0 05 0 1 0 0 0 1 0 0 0 1 49 $75,419 $101,881 0 06 0 1 0 0 0 0 0 0 1 1 52 $77,962 $61,582 1 17 0 1 0 0 0 0 1 0 0 1 35 $37,222 $28,267 0 08 0 1 0 0 0 0 0 1 0 1 54 $52,914 $44,654 0 19 0 1 0 0 0 1 0 0 0 1 34 $67,021 $92,176 0 0
10 0 0 1 0 1 0 0 0 0 1 42 $74,753 $191,216 0 011 0 1 0 0 0 0 1 0 0 1 40 $59,282 $52,319 0 012 1 0 0 0 0 1 0 0 0 0 36 $46,501 $71,008 1 013 0 1 0 0 1 0 0 0 0 1 33 $40,820 $159,388 0 014 1 0 0 0 0 1 0 0 0 0 38 $36,557 $64,047 0 015 0 1 0 0 0 0 1 0 0 1 27 $62,586 $56,442 1 016 1 0 0 0 0 1 0 0 0 0 53 $69,656 $94,161 0 017 0 1 0 0 0 0 1 0 0 1 32 $74,703 $66,860 1 018 0 1 0 0 0 0 1 0 0 1 31 $59,561 $54,065 1 019 0 1 0 0 0 0 1 0 0 1 42 $50,329 $41,829 0 020 0 0 1 0 0 1 0 0 0 1 50 $67,447 $89,373 1 021 0 1 0 0 1 0 0 0 0 1 39 $21,207 $136,043 0 022 0 1 0 0 0 1 0 0 0 1 25 $18,380 $42,364 1 023 0 1 0 0 1 0 0 0 0 1 40 $61,626 $173,986 1 024 0 1 0 0 0 1 0 0 0 1 26 $45,353 $67,803 1 025 1 0 0 0 0 1 0 0 0 1 35 $54,935 $84,884 1 026 1 0 0 0 0 1 0 0 0 0 37 $30,084 $57,730 1 127 0 1 0 0 0 0 0 1 0 1 45 $40,077 $33,029 1 028 0 1 0 0 1 0 0 0 0 1 47 $29,328 $147,257 1 029 0 1 0 0 0 1 0 0 0 1 44 $33,745 $58,952 1 030 1 0 0 0 0 1 0 0 0 0 24 $18,004 $45,675 1 0
Rick Beck Consumer Credit
Applied Regression -- Prof. Juran 60
Excel Method• Standardize data• Create coefficients for each
independent variable• Create a “score” for each observation
(the sumproduct of the independent variables and the coefficients)
• Create a “cut-off” value
Applied Regression -- Prof. Juran 61
Excel Method• Use the cut-off value as a decision rule
for categorization – a predicted value of the dependent variable
• Track how many observations are correctly predicted using the current coefficients and cut-off value
• Optimize the coefficients and cut-off value so as to maximize the number of correct predictions
Applied Regression -- Prof. Juran 62
123456789
1011121314151617181920212223242526272829303132333435363738
A B C D E F G H I J K L M N O P Q RDiscriminant Scores
1 1 1 1 1 1 1 1 1 1 1 1 1 1
Cut-off Value 1
Number Correct 582
Subject Single Married Divorced Widowed Credit A Credit B Credit C Credit D Credit E Children? Age Income Debt Female July Default Score Prediction1 1.57 -1.13 -0.31 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 -1.19 -1.06 0.85 1.92 1.01 0 1.20 12 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 0.37 -1.27 -1.12 -0.99 0 -2.76 03 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 -1.15 -0.87 -0.47 -0.99 0 -3.17 04 -0.64 -1.13 3.20 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 -1.19 -0.11 0.20 1.65 -0.99 0 0.53 05 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 0.85 1.40 0.34 -0.99 0 1.90 16 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 -0.64 -0.35 3.96 0.84 1.13 1.53 -0.43 1.01 1 5.51 17 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 -0.49 -0.66 -1.06 -0.99 0 -2.94 08 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 -0.64 2.81 -0.25 0.84 1.32 0.18 -0.75 -0.99 1 1.00 09 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 -0.58 0.94 0.15 -0.99 0 -0.16 0
10 -0.64 -1.13 3.20 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 0.84 0.18 1.36 2.03 -0.99 0 4.39 111 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 -0.01 0.53 -0.60 -0.99 0 -0.82 012 1.57 -1.13 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 -1.19 -0.39 -0.16 -0.25 1.01 0 -1.31 013 -0.64 0.88 -0.31 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 0.84 -0.68 -0.47 1.43 -0.99 0 -0.39 014 1.57 -1.13 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 -1.19 -0.20 -0.70 -0.38 -0.99 0 -3.79 015 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 -1.25 0.71 -0.52 1.01 0 0.20 016 1.57 -1.13 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 -1.19 1.23 1.09 0.19 -0.99 0 -0.01 017 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 -0.77 1.36 -0.33 1.01 0 1.52 118 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 -0.87 0.54 -0.57 1.01 0 0.37 019 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 0.18 0.05 -0.80 -0.99 0 -1.31 020 -0.64 -1.13 3.20 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 0.94 0.97 0.10 1.01 0 4.83 121 -0.64 0.88 -0.31 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 0.84 -0.11 -1.52 0.99 -0.99 0 -1.32 022 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 -1.44 -1.67 -0.79 1.01 0 -2.58 023 -0.64 0.88 -0.31 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 0.84 -0.01 0.65 1.71 1.01 0 3.67 124 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 -1.34 -0.22 -0.31 1.01 0 -0.55 025 1.57 -1.13 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 -0.49 0.29 0.02 1.01 0 1.34 126 1.57 -1.13 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 -1.19 -0.30 -1.04 -0.50 1.01 1 -2.35 027 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 -0.64 2.81 -0.25 0.84 0.46 -0.51 -0.97 1.01 0 1.23 128 -0.64 0.88 -0.31 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 0.84 0.65 -1.08 1.20 1.01 0 2.09 129 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 0.37 -0.85 -0.47 1.01 0 0.37 0
Applied Regression -- Prof. Juran 63
91011121314151617181920212223
P Q R S T U VJuly Default Score Prediction Correct?
0 1.20 1 00 -2.76 0 10 -3.17 0 10 0.53 0 10 1.90 1 01 5.51 1 10 -2.94 0 11 1.00 0 00 -0.16 0 10 4.39 1 00 -0.82 0 10 -1.31 0 10 -0.39 0 10 -3.79 0 1
=SUMPRODUCT($B$2:$O$2,B12:O12)
=IF(Q14>$B$4,1,0)
=IF(R16=P16,1,0)
Applied Regression -- Prof. Juran 64
123456
789
10111213
A B C D EDiscriminant Scores
1 1 1 1
Cut-off Value 1
Number Correct 582
Subject Single Married Divorced Widowed1 1.57 -1.13 -0.31 -0.252 -0.64 0.88 -0.31 -0.253 -0.64 0.88 -0.31 -0.254 -0.64 -1.13 3.20 -0.25
=SUM(S10:S1009)
Applied Regression -- Prof. Juran 65
123456
789
1011121314151617181920212223242526272829303132333435363738
A B C D E F G H I J K L M N O P QDiscriminant Scores
2E+29 2E+29 -6.7E+28 8.375E+28 -5.9E+29 -3.1E+29 -1.6E+29 1.9E+29 6.9E+29 -2.8E+29 2.40956E+28 -1.7E+29 3E+29 2.8E+28
Cut-off Value 7E+29
Number Correct 877
Subject Single Married Divorced Widowed Credit A Credit B Credit C Credit D Credit E Children? Age Income Debt Female July Default Score1 1.57 -1.13 -0.31 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 -1.19 -1.06 0.85 1.92 1.01 0 -1.51E+292 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 0.37 -1.27 -1.12 -0.99 0 -2.36E+293 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 -1.15 -0.87 -0.47 -0.99 0 -5.00E+294 -0.64 -1.13 3.20 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 -1.19 -0.11 0.20 1.65 -0.99 0 -7.51E+295 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 0.85 1.40 0.34 -0.99 0 -6.07E+296 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 -0.64 -0.35 3.96 0.84 1.13 1.53 -0.43 1.01 1 2.83E+307 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 -0.49 -0.66 -1.06 -0.99 0 -3.42E+298 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 -0.64 2.81 -0.25 0.84 1.32 0.18 -0.75 -0.99 1 6.15E+299 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 -0.58 0.94 0.15 -0.99 0 -6.17E+29
10 -0.64 -1.13 3.20 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 0.84 0.18 1.36 2.03 -0.99 0 -1.40E+3011 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 -0.01 0.53 -0.60 -0.99 0 -4.02E+2912 1.57 -1.13 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 -1.19 -0.39 -0.16 -0.25 1.01 0 3.80E+2813 -0.64 0.88 -0.31 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 0.84 -0.68 -0.47 1.43 -0.99 0 -6.36E+2914 1.57 -1.13 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 -1.19 -0.20 -0.70 -0.38 -0.99 0 3.93E+2815 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 -1.25 0.71 -0.52 1.01 0 -3.83E+2916 1.57 -1.13 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 -1.19 1.23 1.09 0.19 -0.99 0 -6.58E+2817 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 -0.77 1.36 -0.33 1.01 0 -4.26E+2918 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 -0.87 0.54 -0.57 1.01 0 -3.59E+2919 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 1.56 -0.35 -0.25 0.84 0.18 0.05 -0.80 -0.99 0 -3.72E+2920 -0.64 -1.13 3.20 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 0.94 0.97 0.10 1.01 0 -1.19E+3021 -0.64 0.88 -0.31 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 0.84 -0.11 -1.52 0.99 -0.99 0 -5.69E+2922 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 -1.44 -1.67 -0.79 1.01 0 -4.06E+2923 -0.64 0.88 -0.31 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 0.84 -0.01 0.65 1.71 1.01 0 -6.74E+2924 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 -1.34 -0.22 -0.31 1.01 0 -5.13E+2925 1.57 -1.13 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 -0.49 0.29 0.02 1.01 0 -5.34E+2926 1.57 -1.13 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 -1.19 -0.30 -1.04 -0.50 1.01 1 1.18E+2927 -0.64 0.88 -0.31 -0.25 -0.60 -0.60 -0.64 2.81 -0.25 0.84 0.46 -0.51 -0.97 1.01 0 7.05E+2928 -0.64 0.88 -0.31 -0.25 1.65 -0.60 -0.64 -0.35 -0.25 0.84 0.65 -1.08 1.20 1.01 0 -5.08E+2929 -0.64 0.88 -0.31 -0.25 -0.60 1.65 -0.64 -0.35 -0.25 0.84 0.37 -0.85 -0.47 1.01 0 -4.13E+29
Applied Regression -- Prof. Juran 66
Applied Regression -- Prof. Juran 67
* ERROR *
After subtracting group means,Credit E is highly correlated with other predictors.
* ERROR * Calculations for discriminant analysis cannot be done.
Applied Regression -- Prof. Juran 68
Applied Regression -- Prof. Juran 69
Linear Method for Response: Default
Predictors: Married, Divorced, Widowed, Credit A, Credit B, Credit C, Credit D, Children, Age, Income, Debt, Male
Group 0 1Count 847 153
Summary of classification
True GroupPut into Group 0 10 757 561 90 97Total N 847 153N correct 757 97Proportion 0.894 0.634
N = 1000 N Correct = 854 Proportion Correct = 0.854
Applied Regression -- Prof. Juran 70
Squared Distance Between Groups
0 10 0.00000 3.437841 3.43784 0.00000
Linear Discriminant Function for Groups
0 1Constant -33.44 -27.08Married 1.68 0.52Divorced 0.23 -0.45Widowed -1.07 -2.23Credit A 233.62 228.17Credit B 86.86 80.92Credit C 38.30 32.32Credit D 34.27 31.99Children 1.88 1.02Age 0.31 0.30Income 0.00 0.00Debt -0.00 -0.00Male 2.49 2.46
Applied Regression -- Prof. Juran 71
Summary of Misclassified Observations
True Pred SquaredObservation Group Group Group Distance Probability 26** 1 0 0 8.366 0.627 1 9.402 0.373 27** 0 1 0 12.842 0.172 1 9.699 0.828 32** 0 1 0 18.46 0.032 1 11.67 0.968 37** 0 1 0 10.55 0.468 1 10.29 0.532 40** 0 1 0 15.651 0.027 1 8.499 0.973 52** 0 1 0 14.99 0.172 1 11.85 0.828 62** 1 0 0 9.263 0.872 1 13.104 0.128 64** 0 1 0 16.556 0.026 1 9.290 0.974
Applied Regression -- Prof. Juran 72
Excel vs. Minitab• In this case the Excel method found a better
solution• The Excel method illustrated here is really only
good for distinguishing between two groups; the Minitab method is more general (multiple groups)
• For the special case in which there are only two groups, there is a better tool called logistic regression (or logit regression). This will be the topic for Session 10.
Applied Regression -- Prof. Juran 73
Steam CaseDetergent plant energy costs are increasing and the company wants to control them better.
The actual causes of energy use are not well understood beyond an intuitive grasp.
The problem is complex, as there are many hypothesized causes.
Relatively good historical records have been maintained.
Question: Can past data be used to build a descriptive and predictive model?
Why the Variation? How can steam usage be predicted or controlled?
Applied Regression -- Prof. Juran 74
Historical plant data are available for a 25 month period on:
STEAM Thousands of tons of high pressure steam used
INV Tons of inventory of fats
PROD Tons of detergent production
WIND Average wind velocity
CDAY Calendar days in the month
OPDAY Operating days in the month
FDAY Number of days below freezing
TEMP Average outside air temperature
STARTS Number of production start-ups
Applied Regression -- Prof. Juran 75
SEQ MONTH STEAM INV PROD WIND CDAY OPDAY FDAY TEMP STARTS 1 1 10.98 5.20 0.61 7.4 31 20 22 35.3 4 2 2 11.13 5.12 0.64 8.0 29 20 25 29.7 5 3 3 12.51 6.19 0.78 7.4 31 23 17 30.8 4 4 4 8.40 3.89 0.49 7.5 30 20 22 58.8 4 5 5 9.27 6.28 0.84 5.5 31 21 0 61.4 5 6 6 8.73 5.76 0.74 8.9 30 22 0 71.3 4 7 7 6.36 3.45 0.42 4.1 31 11 0 74.4 2 8 8 8.50 6.57 0.87 4.1 31 23 0 76.7 5 9 9 7.82 5.69 0.75 4.1 30 21 0 70.7 4 10 10 9.14 6.14 0.76 4.5 31 20 0 57.5 5 11 11 8.24 4.84 0.65 10.3 30 20 11 46.4 4 12 12 12.19 4.88 0.62 6.9 31 21 12 28.9 4 13 1 11.88 6.03 0.79 6.6 31 21 25 28.1 5 14 2 9.57 4.55 0.60 7.3 28 19 18 39.1 5 15 3 10.94 5.71 0.70 8.1 31 23 5 46.8 4 16 4 9.58 5.67 0.74 8.4 30 20 7 48.5 4 17 5 10.09 6.72 0.85 6.1 31 22 0 59.3 6 18 6 8.11 4.95 0.67 4.9 30 22 0 70.0 4 19 7 6.83 4.62 0.45 4.6 31 11 0 70.0 3 20 8 8.88 6.60 0.95 3.7 31 23 0 74.5 4 21 9 7.68 5.01 0.64 4.7 30 20 0 72.1 4 22 10 8.47 5.68 0.75 5.3 31 21 1 58.1 6 23 11 8.86 5.28 0.70 6.2 30 20 14 44.6 4 24 12 10.36 5.36 0.67 6.8 31 20 22 33.4 4 25 1 11.08 5.87 0.70 7.5 31 22 28 28.6 5
Applied Regression -- Prof. Juran 76
Run Chart
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1
Month
Ste
am
Applied Regression -- Prof. Juran 77
• Multiple regression will permit us to fit Steam versus any combination of the explanatory variables.
• It is not necessarily good to use them all. The scientific principle of modeling parsimony should be applied.
• Fitting the “best” simple model and fitting a “full model” can be a useful diagnostics.
• Bottom Line: We will find that several quite good and simple models can be developed in this case.
Building a Useful Regression Model
Applied Regression -- Prof. Juran 78
SEQ MONTH STEAM INV PROD WIND CDAY OPDAY FDAY TEMP STARTS SEQ 1 MONTH 0.3438 1 STEAM -0.1415 -0.4122 1 INV 0.1140 -0.0601 0.3832 1 PROD 0.1088 0.0268 0.3055 0.9436 1 WIND -0.2062 -0.3179 0.4743 -0.1261 -0.1437 1 CDAY 0.1102 0.1738 0.1367 0.3821 0.2482 -0.3168 1 OPDAY 0.0094 -0.1017 0.5361 0.6851 0.7645 0.2311 0.0201 1 FDAY -0.0908 -0.4498 0.6406 -0.1911 -0.2264 0.5581 -0.2048 0.1169 1 TEMP 0.0257 0.3296 -0.8452 -0.0019 0.0677 -0.6163 0.0774 -0.2098 -0.8576 1 STARTS 0.1195 -0.1651 0.3821 0.6163 0.6013 0.0739 -0.0533 0.6006 0.1175 -0.2370 1
Applied Regression -- Prof. Juran 79
Applied Regression -- Prof. Juran 80
One approach to building a model is to start with one independent variable, and then add others sequentially on the basis of their correlation with the dependent variable. (We will examine variants of this approach next time.)
Here are the independent variables, ranked by their correlation with Steam:
STEAM TEMP -0.8452 FDAY 0.6406 OPDAY 0.5361 WIND 0.4743 MONTH -0.4122 INV 0.3832 STARTS 0.3821 PROD 0.3055 CDAY 0.1367
Applied Regression -- Prof. Juran 81
Steam vs. Temp
0
2
4
6
8
10
12
14
0 10 20 30 40 50 60 70 80 90
Temp
Ste
am
Applied Regression -- Prof. Juran 82
Regression Statistics Multiple R 0.8452 R Square 0.7144 Adjusted R Square 0.7020 Standard Error 0.8901 Observations 25 ANOVA df SS MS F Significance F Regression 1 45.5924 45.5924 57.5428 0.0000 Residual 23 18.2234 0.7923 Total 24 63.8158 Coefficients Standard Error t Stat P-value Intercept 13.6230 0.5815 23.4288 0.0000 TEMP -0.0798 0.0105 -7.5857 0.0000
Applied Regression -- Prof. Juran 83
SEQ MONTH STEAM TEMP Predicted STEAM Residuals 1 1 10.98 35.3 10.81 0.17 2 2 11.13 29.7 11.25 -0.12 3 3 12.51 30.8 11.16 1.35 4 4 8.40 58.8 8.93 -0.53 5 5 9.27 61.4 8.72 0.55 6 6 8.73 71.3 7.93 0.80 7 7 6.36 74.4 7.68 -1.32 8 8 8.50 76.7 7.50 1.00 9 9 7.82 70.7 7.98 -0.16 10 10 9.14 57.5 9.03 0.11 11 11 8.24 46.4 9.92 -1.68 12 12 12.19 28.9 11.32 0.87 13 1 11.88 28.1 11.38 0.50 14 2 9.57 39.1 10.50 -0.93 15 3 10.94 46.8 9.89 1.05 16 4 9.58 48.5 9.75 -0.17 17 5 10.09 59.3 8.89 1.20 18 6 8.11 70.0 8.03 0.08 19 7 6.83 70.0 8.03 -1.20 20 8 8.88 74.5 7.68 1.20 21 9 7.68 72.1 7.87 -0.19 22 10 8.47 58.1 8.98 -0.51 23 11 8.86 44.6 10.06 -1.20 24 12 10.36 33.4 10.96 -0.60 25 1 11.08 28.6 11.34 -0.26
Applied Regression -- Prof. Juran 84
Intervals
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50 60 70 80 90
Temp
Ste
am
Y-hat
Prediction LCL
Prediction UCL
Confidence LCL
Confidence UCL
Data
Applied Regression -- Prof. Juran 85
Let’s add “Days Below Freezing” to the model: Regression Statistics Multiple R 0.8610 R Square 0.7413 Adjusted R Square 0.7177 Standard Error 0.8663 Observations 25 ANOVA df SS MS F Significance F Regression 2 47.3047 23.6524 31.5153 0.0000 Residual 22 16.5111 0.7505 Total 24 63.8158 Coefficients Standard Error t Stat P-value Intercept 15.4426 1.3310 11.6025 0.0000 FDAY -0.0505 0.0334 -1.5105 0.1451 TEMP -0.1056 0.0199 -5.3039 0.0000
Not encouraging; why?
Applied Regression -- Prof. Juran 86
Let’s skip “Days Below Freezing” and try “Operating Days”: Regression Statistics Multiple R 0.9215 R Square 0.8491 Adjusted R Square 0.8354 Standard Error 0.6616 Observations 25 ANOVA df SS MS F Significance F Regression 2 54.1871 27.0935 61.9043 0.0000 Residual 22 9.6287 0.4377 Total 24 63.8158 Coefficients Standard Error T Stat P-value Intercept 9.1269 1.1028 8.2761 0.0000 TEMP -0.0724 0.0080 -9.0498 0.0000 OPDAY 0.2028 0.0458 4.4314 0.0002
Better!
Applied Regression -- Prof. Juran 87
The most obvious 3-variable model: Regression Statistics Multiple R 0.9279 R Square 0.8609 Adjusted R Square 0.8411 Standard Error 0.6501 Observations 25 ANOVA df SS MS F Significance F Regression 3 54.9417 18.3139 43.3388 0.0000 Residual 21 8.8741 0.4226 Total 24 63.8158 Coefficients Standard Error T Stat P-value Intercept 10.1988 1.3482 7.5647 0.0000 TEMP -0.0802 0.0098 -8.1868 0.0000 OPDAY 0.2108 0.0454 4.6470 0.0001 WIND -0.1295 0.0969 -1.3363 0.1957
The “Wind Velocity” variable doesn’t seem to add much.
Applied Regression -- Prof. Juran 88
A better 3-variable model: Regression Statistics
Multiple R 0.9379 R Square 0.8796
Adjusted R Square 0.8624 Standard Error 0.6049 Observations 25
ANOVA
df SS MS F Significance F Regression 3 56.1310 18.7103 51.1290 0.0000 Residual 21 7.6848 0.3659
Total 24 63.8158 Coefficients Standard Error t Stat P-value
Intercept 8.5663 1.0373 8.2581 0.0000 TEMP -0.0758 0.0075 -10.1578 0.0000 INV 0.4880 0.2117 2.3048 0.0315
OPDAY 0.1082 0.0586 1.8456 0.0791
This one is better; the adjusted R-square is up to 0.8624.
Applied Regression -- Prof. Juran 89
There are some useful inferences to be made from the “full” model, even though it contains some insignificant variables and may not be the best model for forecasting purposes.
Applied Regression -- Prof. Juran 90
Regression Statistics Multiple R 0.9539 R Square 0.9098 Adjusted R Square 0.8648 Standard Error 0.5997 Observations 25 ANOVA df SS MS F Significance F Regression 8 58.0619 7.2577 20.1818 0.0000 Residual 16 5.7539 0.3596 Total 24 63.8158 Coefficients Standard Error t Stat P-value Intercept 6.2909 6.8096 0.9238 0.3693 INV 0.9365 0.5760 1.6257 0.1235 PROD -4.6388 3.9979 -1.1603 0.2629 WIND -0.0868 0.1035 -0.8388 0.4139 CDAY 0.1053 0.2152 0.4893 0.6313 OPDAY 0.2206 0.0810 2.7222 0.0151 FDAY -0.0181 0.0258 -0.7001 0.4939 TEMP -0.0874 0.0163 -5.3696 0.0001 STARTS -0.2548 0.2142 -1.1896 0.2516
Applied Regression -- Prof. Juran 91
Note that R2 for the simple model with only TEMP was 71%, while for the full model it was 91%. This R2 becomes a benchmark for other models.
TEMP is still very significant in the full model, even given that all other variables are in the model.
OPDAYS is the only other variable that is significant on a “last- in” basis.
INV is of borderline significance on a “last in” basis.
These three variables’ significance is evaluated on what might be called a “worst-case” basis.
Conclusions from the Full Model
Applied Regression -- Prof. Juran 92
Examine the correlations among these variables: STEAM INV OPDAY TEMP
STEAM 1 INV 0.3832 1 OPDAY 0.5361 0.6851 1 TEMP -0.8452 -0.0019 -0.2098 1
Applied Regression -- Prof. Juran 93
The bottom line is whether the model makes good predictions:
Predictions vs. Observations
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Months
Ste
am
Actual Data
Predictions
Validating a Three-Parameter Model
Applied Regression -- Prof. Juran 94
Residual AnalysisHistogram of Residuals
0
2
4
6
8
10
12
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Residual Error
Fre
qu
en
cy
Applied Regression -- Prof. Juran 95
Normal Probability Plot
-3
-2
-1
0
1
2
3
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
Normal Score
Sta
nd
ard
ize
d R
es
idu
al
Applied Regression -- Prof. Juran 96
Residuals vs. Observations
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
6 7 8 9 10 11 12 13
Steam
Re
sid
ua
l Err
or
Applied Regression -- Prof. Juran 97
Residuals vs. Predictions
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
0 2 4 6 8 10 12 14
Predicted Steam
Re
sid
ua
l Err
or
Applied Regression -- Prof. Juran 98
Residuals vs. Time
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
0 5 10 15 20 25
Months
Re
sid
ua
l Err
or
Applied Regression -- Prof. Juran 99
Residuals vs. Temperature
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
20 30 40 50 60 70 80
Temperature
Re
sid
ua
l Err
or
Applied Regression -- Prof. Juran 100
Residuals vs. Inventory
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
Inventory of Fats (tons)
Re
sid
ua
l Err
or
Applied Regression -- Prof. Juran 101
Residuals vs. Operating Days
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
10 12 14 16 18 20 22 24
Operating Days
Re
sid
ua
l Err
or
Applied Regression -- Prof. Juran 102
W e might check the residuals against other independent variables not in the model. W hy?
Residuals vs. Starts
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
Starts
Res
idu
al E
rro
r
Applied Regression -- Prof. Juran 103
One possible problem with our model is the potential for “overfitting”, where we have a great model in terms of fitting past data, but a poor model for predicting the future. How can we assess the predictive quality of our model?
One method is to “hold out” some data, and fit a model using only a subset of all of the data we have. Then, we can use the model to see how well it would have predicted the holdout sample.
Holdout Samples
Applied Regression -- Prof. Juran 104
Here, we use one year of data to fit a model: Regression Statistics Multiple R 0.9463 R Square 0.8954 Adjusted R Square 0.8562 Standard Error 0.7080 Observations 12 ANOVA df SS MS F Significance F Regression 3 34.3438 11.4479 22.8370 0.0003 Residual 8 4.0103 0.5013 Total 11 38.3541 Coefficients Standard Error t Stat P-value Intercept 9.8145 1.7183 5.7117 0.0004 TEMP -0.0861 0.0127 -6.7716 0.0001 INV 0.4893 0.3620 1.3516 0.2135 OPDAY 0.0804 0.1161 0.6927 0.5081
Applied Regression -- Prof. Juran 105
Using 1st Year to Predict 2nd Year
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Month
Ste
am
STEAM
Y-hat
Applied Regression -- Prof. Juran 106
Is the 3-variable model significantly better than the 2-variable model?
Applied Regression -- Prof. Juran 107
2 Variables 3 Variables
R-square 0.8491 0.9379
Adjusted R-square 0.8354 0.8624
Standard Error 0.6616 0.6049
Coefficient for Temp -0.0724 -0.0758 Coefficient for Inv 0.4880 Coefficient for Opdays 0.2028 0.1082
Applied Regression -- Prof. Juran 108
Predictions vs. Observations
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Months
Ste
am
Actual Data
Predictions
Predictions vs. Observations
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Months
Ste
am
Actual Data
Predictions
Predictions and Observations
Applied Regression -- Prof. Juran 109
Histograms of Residuals
Histogram of Residuals
0
2
4
6
8
10
12
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Residual Error
Fre
qu
en
cy
Histogram of Residuals
0
1
2
3
4
5
6
7
8
9
10
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Residual Error
Fre
qu
en
cy
Applied Regression -- Prof. Juran 110
Normal Plots
Normal Probability Plot
-3
-2
-1
0
1
2
3
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
Normal Score
Sta
nd
ard
ize
d R
es
idu
al
Normal Probability Plot
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Normal Score
Sta
nd
ard
ize
d R
es
idu
al
Applied Regression -- Prof. Juran 111
Residuals vs. Observations
Residuals vs. Observations
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
6 7 8 9 10 11 12 13
Steam
Re
sid
ua
l E
rro
r
Residuals vs. Observations
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
6 7 8 9 10 11 12 13
Steam
Re
sid
ua
l E
rro
r
Applied Regression -- Prof. Juran 112
Residuals vs. Predictions
Residuals vs. Predictions
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
0 2 4 6 8 10 12 14
Predicted Steam
Re
sid
ua
l E
rro
r
Residuals vs. Predictions
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
0 2 4 6 8 10 12 14
Predicted Steam
Re
sid
ua
l E
rro
r
Applied Regression -- Prof. Juran 113
Residuals vs. Temp
Residuals vs. Temperature
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
20 30 40 50 60 70 80
Temperature
Re
sid
ua
l E
rro
r
Residuals vs. Temperature
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
20 30 40 50 60 70 80
Temperature
Re
sid
ua
l E
rro
r
Applied Regression -- Prof. Juran 114
Residuals vs. Opdays
Residuals vs. Operating Days
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
10 12 14 16 18 20 22 24
Operating Days
Re
sid
ua
l E
rro
r
Residuals vs. Operating Days
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
10 12 14 16 18 20 22 24
Operating Days
Re
sid
ua
l E
rro
r
Applied Regression -- Prof. Juran 115
Residuals vs. Inventory
Residuals vs. Inventory
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
Inventory of Fats (tons)
Re
sid
ua
l E
rro
r
Residuals vs. Inventory
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
Inventory of Fats (tons)
Re
sid
ua
l E
rro
r
Applied Regression -- Prof. Juran 116
Residuals vs. Time
Residuals vs. Time
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
0 5 10 15 20 25
Months
Re
sid
ua
l E
rro
r
Residuals vs. Time
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
0 5 10 15 20 25
Months
Re
sid
ua
l E
rro
r
Applied Regression -- Prof. Juran 117
Residuals vs. Starts
Residuals vs. Starts
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
0 1 2 3 4 5 6 7
Starts
Re
sid
ua
l E
rro
r
Residuals vs. Starts
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
Starts
Re
sid
ua
l E
rro
r
Applied Regression -- Prof. Juran 118
Conclusions• We observe very low values for operating days in
months 7 and 19. On investigation we find that the plant shuts down for 2 week vacation in July. Perhaps we need a dummy variable for summer vacation shutdown?
• Some reduced model appears to do the fitting well. Which we’d adopt depends in part on how management wishes to use the model and on whether the predictor variables being considered (in this case INV, TEMP, OPDAYS etc.) can themselves be predicted.
• A more complex model might be possible. Perhaps TEMP*CDAYS would be better than either alone. Perhaps WIND^3*TEMP (measuring a wind-chill factor)? However, it is easy to reach the point of diminishing returns and having paralysis by analysis.
Applied Regression -- Prof. Juran 119
Cars CaseNeed to remove one drive type (I removed FWD)
Need to remove one make (I removed Chevrolet)
Still need to remove Volkswagen because of multicollinearity
Applied Regression -- Prof. Juran 120
A Possible “Enter” Procedure
Regression Statistics Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7a Model 7b Model 7c Model 7dMultiple R 0.9175 0.9408 0.9507 0.9595 0.9632 0.9736 0.9736 0.9738 0.9737 0.9736R Square 0.8418 0.8852 0.9038 0.9206 0.9278 0.9479 0.9479 0.9484 0.9481 0.9479Adjusted R Square 0.8379 0.8793 0.8962 0.9120 0.9178 0.9390 0.9372 0.9377 0.9374 0.9372Standard Error 6492 5602 5195 4782 4623 3983 4040 4024 4033 4041
ABS(Correl) In tercept -7326 9993 10618 13428 11043 -9528 -8779 -8912 -10341 -95320.9175 HP 171 129 118 109 96 136 135 135 137 136
-0.8163 FWD -11145 -9929 -11738 -7298 -8040 -8131 -7930 -8028 -80550.6064 AWD 9886 417 6608 3864 3808 4023 3901 38530.5584 Audi 10095 11199 10615 10505 10460 10507 106440.5507 Lexus 7316 3662 3627 3665 3513 3680
-0.5170 MPG City 635 628 618 668 6340.5163 RWD
-0.2737 Powertrain W arranty (miles) -6-0.2367 Chevrolet -1203-0.2192 Toyota -825-0.1827 Nissan 191-0.1546 Mazda-0.1518 Ford-0.1513 Volkswagen-0.1508 Saturn-0.1455 Honda-0.0928 Trunk-0.0888 Chrysler
Applied Regression -- Prof. Juran 121
A Possible “Remove” ProcedureMultiple R 0.9757 0.9757 0.9757 0.9756 0.9755 0.9753 0.9752 0.9750 0.9748 0.9742 0.9727R Square 0.9520 0.9520 0.9520 0.9519 0.9516 0.9513 0.9510 0.9505 0.9501 0.9490 0.9462Adjusted R Square 0.9243 0.9271 0.9296 0.9320 0.9338 0.9355 0.9372 0.9385 0.9399 0.9402 0.9388Standard Error 4435 4353 4276 4206 4148 4093 4041 3997 3953 3941 3989
Intercept -13060 -13066 -13972 -14427 -15063 -15499 -16268 -18731 -17596 -19837 -20269MPG City 666 666 662 662 656 650 657 673 626 669 687HP 141 141 140 140 140 139 139 141 139 147 145Trunk -1W arranty -70 -70 -59 -55 -46 -35 -25Audi 7857 7857 8440 8640 9056 9193 9305 9866 10047 9533 10060Chrysler 1770 1770 1760 1762 1755Ford -4533 -4535 -4078 -3915 -3586 -3579 -3566 -3222 -3069 -3362Honda -1215 -1215 -776Lexus 2386 2388 2782 2885 3157 2500 2586 2851 3090Nissan -1692 -1693 -1247 -1074Saturn 1899 1900 1897 1902 1908 1443Toyota -2435 -2435 -1983 -1800 -1443 -1417 -1461 -1168Volkswagen -621 -622RW D 7563 7564 7578 7570 7573 8263 8069 7727 7758 9473 10141AWD 11443 11446 11506 11474 11487 11534 11560 11456 11443 10595 10934
Applied Regression -- Prof. Juran 122
Minitab “Best Subsets” Procedure
1 2 3 4 5 6 7Multiple R 0.9175 0.9372 0.9543 0.9669 0.9727 0.9742 0.9748R Square 0.8418 0.8783 0.9107 0.9348 0.9462 0.9490 0.9501Adjusted R Square 0.8379 0.8720 0.9036 0.9278 0.9388 0.9402 0.9399Standard Error 6492 5768 5005 4334 3989 3941 3953
Intercept -7326 -32141 -30500 -23445 -20269 -19837 -17596HP 171.1 206.9 191.6 159.1 145.3 147.2 139.1MPG City 805.4 828.5 721.1 686.9 668.9 626.4Audi 9856 14864 10060 9533 10047RW D 8518 10141 9473 7758AWD 10934 10595 11443Ford -3362 -3069Lexus 3090
Conclusions
• These factors can be used to estimate MSRP within about $4000
• You should expect to pay more for higher HP and for higher fuel efficiency (despite the fact that they are negatively correlated with each other)
• FWD is the least expensive drive train• Audi (and maybe Lexus) is overpriced, based
on these features• Ford is a relative bargain
Applied Regression -- Prof. Juran 123
Applied Regression -- Prof. Juran 124
Comparison of Three Models
Regression Statistics Enter "Best" Remove "Best" Best Subsets #6Multiple R 0.9736 0.9742 0.9742R Square 0.9479 0.9490 0.9490Adjusted R Square 0.9390 0.9402 0.9402Standard Error 3983 3941 3941
Intercept -9528 -19837 -19837HP 135.9 147.2 147.2FWD -8040AWD 3864 10595 10595Audi 10615 9533 9533Lexus 3662MPG City 634.6 668.9 668.9RWD 9473 9473Ford -3362 -3362
Applied Regression -- Prof. Juran 125
SummaryTwo Multivariate Methods • Cluster Analysis
– Excel– Minitab
• Discriminant Analysis– Excel– Minitab
Steam caseCars