AdPac Advanced Biometrics Pac - CPP

AdPac

Advanced Biometrics Pac

From left to right: Francis Galton (1822-1911) Karl Pearson (1857-1936)

William Sealy Gosset (1876-1937) Ronald A. Fisher (1890-1962)

AdPac Page 1

A Brief Introduction to SAS

History - SAS started as the Statistical Analysis System, developed at the Institute of Statistics at North Carolina State University, Raleigh, in the 1960s. (By the way, North Carolina State has an excellent degree program in biostatistics). In 1976, it became SAS Institute, a privately held corporation. It's now one of the largest privately held software companies in the world, and is used at more than 37,000 sites in 111 countries. SAS has grown far beyond its origins as a "statistics package". SAS has positioned itself as "enterprise software", i.e. a complete system to manage, analyze, and present information, especially in a business environment. SAS contains its own programming language (similar to BASIC or FORTRAN), database capability, and even geographic information system capability. SAS is enormous and complex. Fortunately, like most complex programs, you can make good use of SAS with minimal knowledge. Unfortunately, the complexity makes it harder to learn how to do new things in SAS. Like most '60s era programs, SAS carries some obsolete baggage. For example, you'll see reference to "cards", which means computer cards, which haven't been used for years. SAS also has the ability to extensively format output, because people used to write reports with SAS before there were word processors. Where To Find More Information - SAS documentation is provided with the program. In previous versions, there were substantial computer problems in trying to read the documentation. However, these problems have been solved with version 9. Getting familiar with the SAS documentation is a formidable task - that's one reason for this handout. You could purchase documentation from SAS, but to get the minimum you need: Language Reference (2 vols.), SAS/STAT User's Guide (3 vols.), Procedures Guide (2 vols.), would cost around $350. However, if money is no object for you, the actual docs are easier to use than the online version. You can order docs from the SAS web site: http://www.sas.com/ There are also "after market" books on how to use SAS. These are often easier to read than the manuals, but are usually less complete. Some examples (available from www.amazon.com, with prices as of 12/7/2010): Delwiche, Lora and Susan Slaughter. 2003. The Little SAS Book: A Primer. Third Edition. $44.07 Cody Ron P. and Jeffrey K. Smith. 2005 Applied Statistics and the SAS Programming Language (5th Edition). $66.81 The SAS Web site: http://www.sas.com The Big Picture - The SAS program uses three windows: (1) Program Editor - this is where you enter SAS instructions; (2) Output - results produced by SAS from your instructions; (3) Log - this is where SAS shows how it interpreted your instructions, including any error messages. You enter your instructions in the editor window, and click the Submit button. Check the output window for the results. If there's nothing in the output window, or something is missing, or anything doesn't seem right, then check the log window for error messages. Even if everything seems okay, ALWAYS CHECK THE LOG WINDOW. An error affecting your results may have occurred without you being aware. Data may kept in a separate file (this needs to be an "ASCII text" file or a special format created by SAS), or data may be entered directly into the SAS instructions. For small data sets, it is convenient to enter the data with the SAS instructions. For larger data sets, it's better to use Excel for data entry and checking, and then have Excel save the data to a text file for SAS. If you need this method (for your research data, you won’t need it for this class), use the “comma separated value” (.csv) text file. Dr. M. can help. Data Structure - Regardless of whether you put data in the SAS instructions or in a separate file, you generally need to structure the data using the concept that columns are variables and rows are observations (this structure is general to many computer programs). Variables can be measurements (i.e. ratio/interval or ordinal scale) or used to show group membership (categorical variables). Categorical variables are called "classification" or "stratifier" variables. It is usually best to use numbers to represent groups, but names can be used. In each row, separate values for different variables by at least one space. The missing value symbol is the period with at least one space on each side (i.e. " . " without the quotes). If you are in Advanced Biometrics or the ANOVA class, you should have many examples of SAS. Refer to these as you read what follows. SAS Instructions - a SAS job (or SAS run, or whatever you want to call it) consists of two "steps" or "segments": 1. DATA step - here you tell SAS the name of your data file (if there is one), have SAS read the data, perform any needed transformations (e.g. logarithms; arcsine), and create any new variables which are algebraic functions of the input variables. 2. PROCEDURES step - this is where you give instructions on what statistical test should be done, and information SAS needs to do the tests. SAS users don't use the word "procedure" - it's called a "proc" (pronounced "prock"). Plural is procs. If you're going to walk the walk, you have to talk the talk.

AdPac Page 2

All SAS commands (in either step) end with a semicolon (;). If you don't have a semicolon at the end of a SAS command, an error will be generated, and your analysis won't be done. SAS is not case sensitive, i.e. it doesn't matter whether commands are in lower or upper case, or any combination. DATA Step The first line you put in the editor should be the data statement. It has the form: DATA [name]; The brackets around the name means that you supply the name. The name may consist of characters including letters, numbers, and the underscore character. What you're doing here is naming the data set, and you're doing this because it is possible to use more than one data set in a SAS run - but this isn't something to try right away. For example: DATA BabyMoms; Next, if you're using a data file, you must tell SAS the path and name of the file (i.e. the file specification) by using the INFILE command. This requires you to know where the file is on the disk, which can be a challenge in Windows. For example: INFILE 'C:\Files\Research\SAS\Year1\ALL_DATA.TXT'; Notice that the file specification must be enclosed in single quotes. Next, you should tell SAS to read your data with the INPUT command. Supply a name for each variable, separate the names by at least one space. Try to keep the names short, about eight characters (letters, numbers, underscore). For example: INPUT Smoking Mom_Wt Baby_Wt; If you want to put the data in the instructions (i.e. you're not using a data file), do that next by using the DATALINES command. DATALINES; 1 65.2 3515 1 58.2 3420 2 62.1 3444 2 72.1 3827 3 59.3 2608 3 51.2 2509 ; In this little example, the six lines of data are consistent with the INPUT command above. The first number (Smoking) shows the group membership; the second (Mom_Wt) is mom's prepregnancy weight; the third (Baby_Wt) is baby's birth weight. Notice that you start the data with the DATALINES; command (don't forget the semicolon). Then, lines of data follow. Do NOT put semicolons at the end of the data lines. After the last data line, put a semicolon. This indicates the end of the data. The order of statements in the DATA step is important. Start with the DATA command, then the INPUT command. Then, enter any data transformation or data handling statements (see below). The DATALINES command with the data must come after any transformations or data handling. SAS can do many data transformations. Let's say you want to take the common logarithm of the Mom_Wt variable, and create a new variable called LogMom. Here's the command: LogMom = LOG10(Mom_Wt); Note that the function for common logs is LOG10(). A few other SAS functions are: LOG() Log base e (Naperian) LOG10() Log base 10 (common) LOG2() Log base 2 MAX() The maximum value, e.g. BIG_BABY = MAX(Baby_Wt); MIN() The minimum value SQRT() Square root EXP() Exponential function, i.e. raises the constant e to the given power E_Mom = EXP(Mom_Wt); ARSIN() Inverse sine function. When you have data that are proportions, use this to insure the values are normally distributed (proportions are binomially distributed). There are many more SAS functions. Consult the documentation for a complete list.

AdPac Page 3

You may want to create new variables which are algebraic functions of the variables you input to SAS. This can be done with simple statements using operators: + for addition, - for subtraction, * for multiplication, / for division, and ** for exponentiation. For example: B_Plus_M = Baby_Wt + Mom_Wt; creates a new variable with the sum of each mom and baby BminusM = Baby_Wt - Mom_Wt ; baby weight minus mom weight BtimesM = Baby_Wt * Mom_Wt; multiplies baby times mom B_Div_M = Baby_Wt / Mom_Wt; divides baby by mom B_Squared = Baby_Wt**2; squares the baby weights B_CubeRoot = Baby_Wt**(1/3); gives the cube root of the baby weights The names to the left of the equal sign are new variables (you supply the names - follow the same rules as naming input variables). The operation is performed on each observation (e.g. each mom and baby). Sometimes you want to create a new variable whose value depends on existing variables. For example, let's say you want to use the Mom weights as a factor (categorical variable) in an ANOVA. Let's create a new variable (we'll call it Mom_Cat) that has a value of 1 if the mom is below the mean weight (which is 62.5 kg), and a 2 if they are equal to or above the mean. One way to do this is the following statement: IF expression THEN statement; <ELSE statement>; The brackets (< >) mean that part is optional. Let's see some examples. We could do the following. If Mom_Wt LT 62.5 Then Mom_Cat = 1; If Mom_Wt GE 62.5 Then Mom_Cat = 2; The LT means LESS THAN, the GE means GREATER THAN OR EQUAL TO. If we wanted to be clever, we could do this: If Mom_Wt LT 62.5 Then Mom_Cat = 1; Else Mom_Cat = 2; The LT and GE are called comparison operators. Here is a list of comparison operators in SAS: Mnemonic Meaning EQ equal to NE not equal to GT greater than LT less than GE greater than or equal to LE less than or equal to You can write complicated statements using Boolean operators (e.g. AND, OR), but make sure you know what you're doing. For example, the two statements If Mom_Wt LT 62.5 AND Baby_Wt LT 3292 Then Mom_Cat = 1; If Mom_Wt LT 62.5 OR Baby_Wt LT 3292 Then Mom_Cat = 1; do not produce the same results. Be careful. It's usually best to keep it as simple as possible. Also, have SAS print (see below) values for new variables so you can check to see that SAS has done what you want The SAS command TITLE allows you to specify a title that will appear on each page of your output. This could help you remember what the output was when you stumble across it five years from now. Put the title in quotes. For example: TITLE “Advanced Courses in Biostatistics Are Very Cool”; One (somewhat) advanced command is SELECT. Let’s start with an example, and then explain it: SELECT (Smoking); WHEN (1) SmokeCat = 1; WHEN (2) SmokeCat = 2; WHEN (3) SmokeCat = 2; END; We’re creating a new categorical variable called SmokeCat. If the Smoking variable is 1, then SmokeCat is 1. If Smoking is a 2 or a 3, then SmokeCat is 2. This combines the two smoking categories (2 = 1 pack/day, 3 = 1+ pack/day) into a single category, allowing a t-test or ANOVA of nonsmoking mean vs. mean of all smoking babies. You could do the same thing with a series of IF-THEN statements, but SELECT is sometimes more efficient. You must have the END; statement at the end. Check out the following: SELECT (Smoking); WHEN (1) SmokeCat = 1; WHEN (2,3) SmokeCat = 2; END; This little shortcut produces the same results, but with fewer statements. The data step ends when SAS encounters the first PROC statement.

AdPac Page 4

PROCEDURES Step Procs allow you to do statistical tests (and some other stuff). Procs may be repeated in a single run, and a run can have multiple procs. While it is possible to write huge SAS runs with many procs, don’t let yourself (or SAS) get too confused. Also, you must have the RUN; command as the very last statement in the editor, or SAS won’t do the analysis. PROC Print; This will print all of your data (including any newly created variables) to the Output window. To make it easier to work with, you can select the output, copy it, and paste it into Word or Excel. If you go into Excel, you’ll probably want to click Data, then Text to Columns, and use the wizard to get each variable into its own Excel column. PROC Means; This will give you the mean, standard deviation, sample size, maximum and minimum for all variables. PROC Means n mean std stderr; This will give you the sample size, mean, standard deviation, and standard error for all variables. PROC Means; VAR Baby_Wt Mom_Wt; By adding VAR variablelist; you restrict output to the variables listed, just baby weights and mom weights in this example. PROC Means; VAR Baby_Wt Mom_Wt; CLASS Smoking; This gives stats for babies and moms broken down by smoking group. PROC MEANS always excludes all missing data from all calculations. PROC Univariate; This gives a lot more basic stats, including median, mode, variance, standard error. It also tests the null hypothesis Ho: µ = 0 using a t-test. If you print the output from the Output window, you get a separate page for each variable, so you can use a lot of paper in hurry if you’re not careful. PROC Univariate; VAR Baby_Wt Mom_Wt; Limits analysis to the variables specified. To get analyses by groups, do the following: PROC Univariate; VAR Baby_Wt Mom_Wt; CLASS Smoking; The Univariate proc allows the CLASS statement like PROC Means. Now, how about testing for normality? Again, we use PROC Univariate for this task. PROC Univariate Normal; VAR Baby_Wt; The use of the Normal command after Univariate will add a test for normality (Ho: Distribution is normal) to the output. You get several statistics which test normality including W, the Shapiro-Wilk statistic and the Kolmogorov-Smirnov D. Both of these tests are considered better than the test for normality used in StatCat, but they are also a lot more complicated to calculate. But SAS doesn’t seem to mind doing the calculations. PROC Univariate Normal; CLASS Smoking;

VAR Baby_Wt; This gives you a normality test of the Baby_Wt variable for each smoking group. This is an assumption of ANOVA. PROC TTest; CLASS SmokeCat; VAR Baby_Wt Mom_Wt; This does a Two-sample t-test of Baby_Wt for the nonsmoking and smoking groups, and a Two-sample t-test of Mom_Wt for the nonsmoking and smoking groups. SAS does both the “regular” Two-sample t-test (i.e. assuming equal variances), and the Welch’s approximate t, (t’ - see Zar), which does not assume equal variances. SAS also automatically does the Variance Ratio Test to see if the variances are equal.

AdPac Page 5

Suppose you wish to do a One-sample t-test. For example, suppose you had a priori reason to believe that all moms weigh 66 kg. This can be done with PROC TTest by specifying the null hypothesis as an option in the PROC statement, and specifying a single variable in the VAR statement: PROC TTest H0=66; VAR Mom_Wt; Now, how about a Paired-sample t-test? For example, suppose you weigh a group of people before and after a 6-month exercise program, and you want to test for a difference in weight. The data are paired because it’s the same person before and after. In the data, each row is a person, the first column is the before weights, and the second column is the after weights. Use INPUT Before After; to read the data in the data step. Then use the PAIRED statement in PROC TTest to do the Paired-sample t-test: INPUT Before After; PROC TTest; PAIRED Before*After; Note that you don’t use either a CLASS statement or VAR statement to do a Paired-sample t-test. PROC ANOVA; CLASS Smoking; MODEL Baby_Wt = Smoking; The ANOVA procedure will do several types of ANOVA, but is primarily designed for balanced data. It should be used for unbalanced data only when doing a one-factor ANOVA. Other unbalanced designs should use PROC GLM (below). This example does the simple one-factor ANOVA with baby weight as the dependent variable and Smoking as the single factor. PROC GLM; CLASS Smoking; MODEL Baby_Wt = Smoking; The General Linear Models (GLM) procedure can be used for ANOVA, ANCOVA, and MANOVA. This example does the simple one-factor ANOVA with baby weight as the dependent variable and Smoking as the single factor. Consult the ANOVAPac from the BIO 499 Biological Applications of ANOVA class for lots of examples using PROC GLM;. The ANOVAPac may be downloaded from Dr. M’s courses page (linked from his home page). PROC GLM; CLASS Smoking; MODEL Baby_Wt = Smoking; MEANS Smoking / HOVTEST=BF TUKEY; The MEANS / HOVTEST=BF TUKEY; command above requests SAS to show the means of the smoking group, do the Levene's Test for homoscedasticity (an assumption of ANOVA), and do the Tukey multiple comparison test. PROC Reg; MODEL Baby_Wt = Mom_Wt; The Reg procedure does regression. The above does the regression of baby birth weight on mom weight that Dr. M. uses in BIO 211. Recall that earlier we created a new variable called Mom_Cat which was 1 if the mom was less than 62.5 kg, and 2 if greater than or equal to 62.5 kg. We can now do a contingency table analysis: PROC Freq; TABLES Smoking*Mom_Cat / Chisq; This produces a contingency table with smoking category as the rows and the mom category as the columns. SAS does what’s called a cross tabulation, i.e. how many times was 1 in Smoking associated with 1 in Mom_Cat, how many times with 2 in Mom_Cat. It then does this for each of the other values in Smoking. The ChiSq option prints statistics, including the chi-squared value. If you know what a pivot table does in Excel, then you are familiar with cross tabulation.

AdPac Page 6

The Freq procedure can be used to do a Fisher Exact Test. Here's a complete SAS job which will do the Fisher Exact Test in Zar on pages 544-551 of the 4th edition.

DATA FISHEREX; INPUT ROW COL FREQ; DATALINES; 1 1 12 1 2 7 2 1 2 2 2 9 ; PROC FREQ; WEIGHT FREQ; TABLE ROW*COL / CHISQ; TITLE 'Fisher Exact Test for 2x2 Table'; RUN;

Notice that you're putting in the actual frequencies here, not doing a cross tabulation. The WEIGHT command tells SAS that the FREQ variable has the actual frequencies. The ROW and COL variables give the row and column index for the frequency, e.g. the frequency in row 1, column 1 is 12. The frequency in row 1, column 2 is 7. And so on. RUN; This tells SAS you’re done with commands and want the analysis run. This should be the last command you enter in the editor.

AdPac Page 7

Advanced Biometrics - SAS Assignments DUE DATES WILL BE ANNOUNCED IN CLASS

You will submit your assignments by email to Dr. M. When you are sure you have the assignment working properly, simply select and copy your commands in the SAS program editor, and paste them into the email message. Dr. M. will copy your commands and paste them into SAS, and then see if you did the assignment correctly. It is not recommended that you retype the commands into the email message. You could make a typo that would cause your commands not to work, and you to lose points on the assignment. If you can't email from the computer where you are using SAS, then save your commands in the Program Editor to a disk. The file created (for example, A1.SAS) is a text file and can be opened on your email computer with Notepad. Then, copy and paste from Notepad into your email client. The subject on your email MUST be of the form SAS A? LastName where the ? is replaced by the assignment number, and LastName is replaced by your last name. For example, if your name is Charles Darwin, and you're submitting Assignment 1, the subject would be SAS A1 Darwin Here's a sample email submission from a previous class. Note that this Assignment 1 is not the same as yours.

Delivered-To: [email protected] X-Sender: [email protected] Date: Fri, 08 Nov 2002 17:35:41 -0800 To: [email protected] From: ChuckBob Darwin <[email protected]> Subject: SAS A1 Darwin Dr. M, The trip is going well. These are pretty neat islands. I've been thinking about doing something with all the different finches found on the islands, but what could possibly be important about a bunch of little brown birds? Anyway, here's my first Advanced Biometrics assignment. I hope this boat doesn't sink. Chuck Data A1; Input Mom Baby; Datalines; 65.2 3515 58.2 3420 48.7 3175 65.8 3586 73.5 3232 68.2 3884 69.3 3856 69.3 3941 59.3 3232 ; Proc Corr; Var Mom Baby; Run; _____________________________________________ Galapagos.Net - the natural selection for the best in internet fitness

Notice that the email starts with a little message - you do not have to include any message. Notice there is a blank line before the SAS commands begin, and a blank line after the last SAS command (Run;). This will help Dr. M. copy and paste easily and correctly, which is a good thing. YOU MUST DO YOUR OWN WORK. DO NOT DISCUSS ASSIGNMENTS WITH OTHER STUDENTS. At the course web page you may download an Excel file (data_all.xls; only 18 KB). This file has the data for all the assignments. You are not required to use this file. You may find it easier (and more accurate) to use the data from the Excel file, but you are more than welcome to manually enter the data into SAS yourself.

AdPac Page 8

ASSIGNMENT 1 The data below are body weights (in pounds) of 12 black bears from the San Gabriel Mountains, and 12 black bears from the San Bernardino Mountains. THINK ABOUT THE DATA STRUCTURE/FORMAT USED IN SAS AND ALL STATISTICAL PROGRAMS. San Gabriel Mts San Bernardino Mts 360 344 514 365 270 356 204 202 446 436 332 220 262 212 220 182 202 236 348 166 204 180 316 416 Using only one SAS command language file, do the following: 1. Determine the common logarithms for the data. 2. Print all data for all variables (including the logarithms) in the output. 3. Test for a difference (two-tailed) between the means (untransformed data) of the bears from the two mountain ranges using the two-sample t procedure. The absolute value of your t statistic should be 0.74. 4. Test for a difference (two-tailed) between the means (log transformed data) of the bears from the two mountain ranges using the two-sample t procedure. The absolute value of your t statistic should be 0.85. ASSIGNMENT 2 In this assignment, we'll learn something about statistics as well as expanding our knowledge of SAS. We'll be working with two-sample t-tests, paired-sample tests, One-way ANOVA, and Two-way ANOVA. If you want to review your ANOVA, try Zar or the TestPac from BIO 211. Surely you didn't sell your Zar text!! The following data are serum cholesterol levels for 12 subjects before and after a diet-exercise program. SUBJECT BEFORE AFTER ------- ------ ----- 1 201 200 2 231 236 3 221 216 4 260 233 5 228 224 6 237 216 7 326 296 8 235 195 9 240 207 10 267 247 11 284 210 12 201 209 Here is what you have to do: 1. These data are clearly paired (because it's the same person before and after), but let's start out with a two-sample t-test of the before

AdPac Page 9

mean versus the after mean. This is an incorrect test, because it does not recognize the paired nature of the data, but do it anyway. This part is easy, because it is almost exactly like Assignment 1. Your t value should be 1.56. 2. Use PROC ANOVA to do two one-way ANOVAs. In the first One-way ANOVA, the groups should be before and after (F = 2.43). In the second One-way ANOVA, the groups should be the subjects (i.e. We are testing for equality of 12 means here; the 12 means are the mean of before and after for each subject. F = 3.88). 3. Use PROC ANOVA to do a two-way ANOVA, grouped by before-after and subject. In your model statement, do not ask for interactions. Notice there are no replications in the cells, so poor SAS could not calculate interactions even if you ask. F for Before/After = 9.12. F for Subjects = 6.51. 4. Finally, do a paired-sample t-test. To do this, you are going to have to violate the "columns are separate variables" rule. Make a data file that has BEFORE and AFTER as separate columns. Then, use PROC TTEST to do the paired-sample t-test. Remember that doing a paired-sample t-test with PROC TTEST was discussed above. Your value should be t = 3.02. If you think about it carefully, you can do numbers 1, 2, & 3 using the same data file. Number 4 will require its own data file. You can do this using only one SAS command file, but you need separate DATA steps for the two data files. Do one DATA statement followed by its PROCs, then do your second DATA statement followed by its PROCS, then don't forget your RUN; statement. We want to investigate the following questions using the above results. You don’t have to write answers to these questions - but think about them and be ready to discuss them in class. 1. Are the results of a two-sample t-test consistent with a One-way ANOVA? 2. What is the difference between a paired-sample t and a two-sample t? 3. What is the difference between a One-way ANOVA and a Two-way ANOVA? 4. Are the results of paired-sample t and a Two-way ANOVA consistent? Are they testing the same thing? ASSIGNMENT 3 The following data are heights and weights of English boys of various ages. Each height and weight is the average of fifty boys of the given age. AGE HEIGHT WEIGHT --- ------ ------ 5 45 46 6 48 52 7 49 58 8 51 62 9 54 70 10 55 75 11 57 83 12 58 86 13 59 88 Using PROC REG, do the following regression: Height (dependent variable) vs. Age (independent variable). Your equation should be: Y = 36.98889 + 1.76667 X Your output must include the 95% upper- and lower-confidence limits for each predicted value. We did not discuss how to do this in class, and it's not discussed in "A Brief Introduction to SAS" (above). You are expected to consult the SAS documentation, and figure out how to do this. Do not ask your fellow students or Dr. M. how to do this. They've already read the documentation, and now it's your turn. The first predicted value (for the observed HEIGHT = 45) is 45.8222, and the 95% confidence limits are 43.8256 and 47.8188

AdPac Page 10

ASSIGNMENT 4 Below is a small subset of the Werner blood chemistry data (these are actual data, the reference is in the BIO 211 Test Pac). The first variable indicates if the subject was on the birth control pill ( 1=not on pill; 2=on pill), and the second variable is serum cholesterol (mg/dl). PILL SERUM STATUS CHOLESTEROL ------ ----------- 1 200 2 600 1 243 2 50 1 158 2 255 1 210 2 192 1 246 2 245 1 208 2 260 1 204 2 192 1 280 2 230 1 215 2 225 1 165 2 200 Your assignment is - 1. Using PROC UNIVARIATE, determine the median serum cholesterol value. Also test the serum cholesterol data for normality using the Shapiro and Wilks W statistic. Your W value should be 0.679365. 2. Using PROC FREQ, analyze a 2x2 contingency table (chi-square statistic), where the rows are pill or no pill, and the columns are values above the median and less than (or equal to) the median. 2 = 0.800. The table must be created by cross tabulation. You'll have to do number 1 first to find out what the value for the median. Once you know the value, you can then add the appropriate data steps and PROC FREQ to do the contingency table. The challenge is to create a new, categorical variable which indicates whether a cholesterol value is greater than the median, or less than or equal to the median. This new categorical variable must be created with SAS commands in the Data Step. You are not allowed to enter it as an input variable. Your submission must include both PROCs.

AdPac Page 11

ASSIGNMENT 5 Below are data from an agricultural experiment. 24 steers were randomly placed into one of three groups, such that each group had 8 steers. The groups were fed different rations (called Ration 1, Ration 2, and Ration 3) for a period of time. The question was to determine if there was a difference among the three rations in the area of the ribeye steak produced. The data show the area of the ribeye steak (in square inches) as well as the final body weight (in pounds) for each steer. RATION 1 RATION 2 RATION 3 Weight Ribeye Weight Ribeye Weight Ribeye 855 8.7 870 10.1 1030 11.3 900 10.1 890 9.5 950 11.0 825 9.5 880 9.7 920 9.5 805 8.7 1035 10.3 855 10.5 880 9.3 875 9.1 1025 11.1 950 10.1 985 9.6 1000 9.2 955 9.9 1130 10.8 1040 10.9 900 8.9 925 10.4 1115 12.0 Your assignment is: 1. Using PROC GLM, test for a difference in ribeye area among the three rations (i.e. a one-way ANOVA). Ignore body weight. Your F value should be 6.55. 2. Also using PROC GLM, perform an Analysis of Covariance (ANCOVA) for difference in ribeye area among the three rations, using body weight as the covariate. Use the GLM procedure twice: (1) the first time, include (in your model statement) the ration variable, the weight variable, and the ration*weight interaction as independent variables (ribeye is dependent). This will provide a test of equality of the slopes among the three rations, i.e. the null hypothesis Ho: ß1 = ß2 = ß3. It won’t tell you the value of the slopes, but don’t worry about that. The F value for this test (the interaction on the ANOVA table) should be 0.34.; (2) the second time, use only ration and weight as independent variables (do not include interaction). Use the SOLUTION option of the MODEL statement (consult your SAS documentation), as well as the MEANS and LSMEANS statements (see your documentation) for the ration variable. All of this will provide the pooled regression coefficient and tell us the value of the adjusted means. The ANOVA table here will test equality of the adjusted means (F = 2.32) and the significance of the pooled regression (F = 11.32). Look in the “Type III SS” section of the ANOVA table for these F values. Can you find the "pooled regression coefficient" on the output? (Hint: the pooled regression coefficient is the parameter estimate for the weight variable.) The adjusted means (LSMEANS) are Ration 1 = 9.7173248, Ration 2 = 9.8974048, Ration 3 = 10.4102704.

AdPac Page 12

4

8

9

7

a

Matrix Algebra - A basic review By convention, symbols for single numbers (called scalars) are in standard weight typeface, while matrices are in bold. Thus, the symbol x represents a scalar, while x represents a matrix. This convention is not always followed - it's used when it's convenient, but it's often ignored. Vectors A matrix with only one column is a vector, also called a column matrix. Vector a has 4 elements. A row of numbers is usually thought of as the transpose of a column matrix,

although it is sometimes called row vector. The transpose of a is a'. 4897a

Remember that vectors are just special cases of matrices. Matrix (Matrices) The order or dimension of a matrix informs us of the number of elements. Matrix A is of order 3x3. Since A is a square matrix we can

also say it is "square matrix of order 3".

729

647

538

A

The individual numbers that make up the matrix are called the elements of the matrix. Elements are referred to by subscripts, with the first number being the row number, and the second number being the column. Thus, in A above a23 = 6. Matrix A in all symbols is: Matrix Addition and Subtraction Two matrices of identical dimension may be added or subtracted. Simply add or subtract elements in the corresponding positions:

31

71

155

99

63

15

92

84CBACBABA

Matrix Multiplication In order to multiply two matrices, they must conform form multiplication:

Two matrices conform for multiplication if and only if the number of columns in the first matrix equals the number of rows in the second matrix.

If A is a matrix of order (m x r), that is m rows and r columns, and B is a matrix of order (r x n) then AB exists because they conform for multiplication.

AB = C, matrix C will be of order (m x n). The elements of C are calculated by this formula:

r

ssjisij bac

1

where a, b, and c

represent the elements of A, B, and C respectively. In words, cij is the sum of the products of the elements of row i of A with the corresponding elements in column j of B. Many find the formula more confusing than helpful. It may be easier to see an example, and memorize the pattern. Let's do an example.

333231

232221

131211

aaa

aaa

aaa

A

AdPac Page 13

1112

1314

1510

975

862

341

BandA Since A is (3 x 3) and B is (3 x 2), then AB = C exists. C will be (3 x 2).

265256

196200

100102

265119137155256129147105

196118136152200128146102

100113134151102123144101

xxxxxx

xxxxxx

xxxxxx

C

Notice that although AB exists, that BA does not exist because the matrices don't conform for multiplication. In scalar algebra, AB = BA, but not necessarily in matrix algebra. Even if both AB and BA exist, they may or may not result in the same matrix. Another way to say this is that the commutative law does not hold for matrix multiplication. To multiply a scalar times a matrix, simply multiply the scalar times each element in the matrix. The Inverse of a Square Matrix There is no division in matrix algebra, but with square matrices, you can do a comparable operation. First, some definitions: A square matrix has the same number of rows and columns. If it's not square, then it is a rectangular matrix. The principal diagonal of a square matrix are the values from the upper left to the lower right. If you call this the principle diagonal rather than the principal diagonal, prepare to lose all points on the exam. Make sure you know the definitions of principle and principal. The Identity Matrix is a square matrix with 1s along the principal diagonal, and 0s elsewhere. The identity matrix of order 4 is:

1000

0100

0010

0001

4I The symbol Ip is reserved for the identity matrix of order p.

Now, finally, the inverse of a square matrix is the matrix such that when the matrices are multiplied, the identity matrix results. Let A be a square matrix. The symbol for the inverse of A is A-1. If AA-1 = I = A-1A, then A-1 is the inverse of A. Finding the inverse of a matrix can be computationally intense. Multiplying a square matrix by its inverse is comparable to division in scalar algebra. For example, consider the following scalar algebra problem: 2X = 8. To solve for X, you divide both sides by 2. Now, how about the matrix problem AX = B, where all matrices are square and of equal order? To solve for X, you use the inverse of A and the identity matrix: A-1AX = A-1B IX = A-1B X = A-1B One "trick" here is that IX = X. You need to know that any square matrix multiplied by the identity matrix of the same order yields itself. That's why they call it the identity matrix.

975

862

341

A

AdPac Page 14

Transpose The transpose of a matrix is when the rows and columns are interchanged, i.e. the first row becomes the first column, the first column becomes the first row and so on. For matrix A, the symbol for the transpose is A'.

983

764

521

'

975

862

341

AA

Symmetry

A symmetrical matrix is equal to its transpose:

147.068.0

47.0125.0

68.025.01

'

147.068.0

47.0125.0

68.025.01

RR

Null Matrix The null matrix has all elements equal to 0. Diagonal Matrix A diagonal matrix is a square matrix with nonzero elements only on the principal diagonal. The identity matrix is a diagonal matrix. Trace The trace of a square matrix is the sum of the elements along the principal diagonal.

158784

57

AtrA

Determinants Determinants are scalar quantities associated with square matrices. Determinants are calculated from the elements of the matrix. The symbol for the determinant of a matrix is |A|, or det A. If A is a square matrix of order 2, then |A| = a11a22 - a21a12.

362056548784

57

xxAA

Determinants of larger order square matrices are very computationally intense, and we will not demonstrate any here. The computer doesn't seem to mind doing them. A matrix is singular if its determinant = 0. Determinants are used to calculate inverses of matrices (see above). If A is a square matrix, then the inverse of A is:

AadjA

A 11 adj A is the adjoint of A. We have not, and will not, define this, because we don't need to worry about it.

The important thing to see here is that calculating the inverse of a matrix requires the calculation of the inverse (the scalar inverse) of the determinant. Therefore, if a matrix is singular (determinant = 0), the matrix inverse does not exist (is undefined). There is a mathematical method to estimate the inverse of a singular matrix; this is called a “generalized inverse”. Well, okay, since you asked: if A is a square matrix of order 2, then adj A is:

1121

1222

2221

1211

aa

aaAadj

aa

aaA The calculation is very complicated for larger order matrices.

AdPac Page 15

Linear Dependency A matrix is said to be linearly dependent (have linear dependency) if consecutive rows are related by constants.

In the simplest case:

123

41C notice that if you multiply each element in the first row by 3, that you get the second row.

A more formal way of stating this is: Let ri be the elements in row i. Then, -3r1 +1r2 = 0. That is, multiply each element in the first row by -3. You get -3 and -12. Then multiply each element in the second row by 1. Now add the columns. Note they add to 0. This means that the matrix is linearly dependent.

Let's try a harder matrix:

013

121

251

D D is linearly dependent because -1r1 + 2r2 - 1r3 = 0. Try it!

013

242

251

D Note that if you add the columns, they all add to 0.

Trying to find linear dependency by this method is difficult, especially for large matrices. The way the computer does it for square matrices is to calculate the determinant. If a square matrix has a determinant of 0, then the matrix is linearly dependent. Singular matrices are linearly dependent. Some statistical procedures require matrices that are not linearly dependent. Simultaneous Equations Many multivariate statistical procedures require the solution of sets of simultaneous equations. This is the sort of problem your math teacher used to torture you with in junior high. Consider the following equations:

W + X + 2Y + 3Z = 10 3W + 2X + 2Y + Z = 20 W + 3Y + 4Z = 15 W + X + Z = 6

What are the values for W, X, Y, and Z that satisfy these equations? If you knew matrix algebra in junior high (and had a computer to calculate matrix inverses), then life would have been much easier. Here's how you do the problem in matrix algebra. First, rewrite the equations with all variables and coefficients. Make sure the sequence of the variables is the same in each equation!

1W + 1X + 2Y + 3Z = 10 3W + 2X + 2Y + 1Z = 20 1W + 0X + 3Y + 4Z = 15 1W + 1X + 0Y + 1Z = 6

Next, form a square matrix with the coefficients:

1011

4301

1223

3211

A Then, make a vector with the answers:

6

15

20

10

Z

Then, make a vector with the unknowns:

Z

Y

X

W

X This system in matrix algebra is then AX = Z.

Note that the order (sequence) of everything is important!

AdPac Page 16

Next, we solve for vector X using matrix algebra. AX = Z

A-1AX = A-1Z IX = A-1Z X = A-1Z

Now, we need the inverse of A, which would require a computer (either Excel or SAS or R will compute the inverse of a matrix). Then we multiply A-1 times the Z vector, which yields the X vector with the answers. Therefore, our solution is W = 7.75, X = -3, Y = 0.75, Z = 1.25 If you're skeptical, and you always should be, then go ahead and put these values into the equations to see if they work.

625.025.025.0125.0

125.125.025.0625.0

5.0105.1

875.075.025.0375.1

1A

25.1

75.0

3

75.7

X

AdPac Page 17

Matrix Algebra - Sample Exam-type Problems The answers to these problems are on the next page. It is recommended you try to do the problems before looking at the answers.

For problems 1 through 7 below, let's define A and B as:

25

31

53

42BA

1. What is the order of A? 2. Find |A| and |B|. Are either A or B singular? 3. Find A + B. 4. Find A - B. 5. Let k be a scalar such that k = 4. Find kB. 6. Find AB. 7. Write A'. Does AA' = A'A? 8. If 3x - y =20, and -4x + 2y = -20, solve for x and y using matrix methods. Hint: 9. Using C and C-1 as defined in the previous question, prove that C-1 really is the inverse of C.

5.12

5.01

24

13 1CthenCIf

AdPac Page 18

Matrix Algebra - Sample Exam-type Problems - Answers

For problems 1 through 7 below, let's define A and B as:

25

31

53

42BA

1. What is the order of A? The order of A is 2. Or 2x2 would also be correct. 2. Find |A| and |B|? Are either A or B singular? |A| = 2x5 - 3x4 = 10 - 12 =-2. |B| = 1x2 - 3x5 = -13. Since neither determinant is zero, neither matrix is singular.

3. Find A + B.

78

73BA Simply add the elements in the corresponding positions.

4. Find A - B.

32

11BA Simply subtract the elements in the corresponding positions.

5. Let k be a scalar such that k = 4. Find kB.

820

124kB Simply multiply 4 times each element in B.

6. Find AB.

1928

1422

25335513

24325412

xxxx

xxxxAB

7. Write A'. Does AA' = A'A?

54

32'A

4123

2313'

3426

2620' AAAA Therefore, AA' ≠ A'A

8. If 3x - y =20, and -4x + 2y = -20, solve for x and y using matrix methods. Hint: C is the coefficients matrix. Define an answer vector B, and an unknown vector X:

y

xXB

20

20 Then, CX = B, C-1CX = C-1B, IX = C-1B, X = C-1B

10

10

205.1202

205.02011BC Therefore, x = 10 and y = 10.

9. Using C and C-1 as defined in the previous question, prove that C-1 really is the inverse of C. The way to "prove" C-1 really is the inverse of C is to multiply them. If you do this, you get the identity matrix I. Since CC-1 = I = C-1C, it must be true that C-1 is the inverse of C. Big Hint: For the exam, make sure you know all the terms, e.g. trace, diagonal, principal diagonal, symmetrical, etc.

5.12

5.01

24

13 1CthenCIf

AdPac Page 19

Matrix Algebra using Excel The following assumes you have some familiarity with Excel. You are not required to know how to do matrix algebra on Excel (i.e. it will not be on the test). This information is provided just in case you may need it. For example, if some junior high kid asks you to help them with their math homework! With Excel you can add and subtract matrices using standard, simple Excel formulas. There are also functions provided to do the transpose, determinant, inverse, and to multiply matrices. For our examples, let's say we have a spreadsheet as seen below. The values in cells A1:B2 will be the "first matrix", while the values in cells D1:E2 will be referred to as the "second matrix".

A B C D E F G H 1 6 3 3 9 2 8 7 7 1 3 4 5

Addition and Subtraction Use standard Excel formulas. For example, in cell G1, enter the formula =A1+D1. Then copy that formula to cells G2, H1, and H2. You have now added the matrices. You can figure out subtraction from here. Determinant To calculate the determinant of the first matrix, go to an empty cell (let's say cell A4) and enter this formula =MDETERM(A1:B2). You should see the value of 18 in cell A4. Transpose The quickest way to transpose a matrix is to select and copy the cells. Then, in an area on the worksheet with enough room, Paste Special (found in the Edit menu), and check the Transpose box. A more complicated method is, first select a block of empty cells big enough to hold the transposed matrix. For example, to get the transpose of the first matrix, we need to select a 2 x 2 block of cells. Let's say we select cells B4:C5. With the cells selected, type the formula =TRANSPOSE(A1:B2). Now, here's the tricky part. Do NOT press the Enter key. Instead, hold down the Ctrl and Shift keys, and with them held down, press the Enter key. The transpose of the first matrix should appear in cells B4:C5. Holding down the Ctrl and Shift keys while pressing the Enter key is what Excell calls "entering an array formula". Inverse Select a block of empty cells big enough to hold the inverse, use the formula =MINVERSE(array), and enter as an array formula. For example, to get the inverse of the first matrix, select a 2x2 block of cells (let's say D4:E5). With the cells selected, enter the formula =MINVERSE(A1:B2), hold down Ctrl and Shift while you press Enter. Cells D4:E5 should look like this:

0.388889 -0.16667

-0.44444 0.333333Multiplication Select a block of empty cells big enough to hold the product, use the formula =MMULT(array1, array2), and enter as an array formula. To multiply the first matrix times the second, select a 2x2 block of cells (let's say G4:H5). With the cells selected, enter the formula =MMULT(A1:B2, D1:E2). Cells G4:H5 should look like this:

39 57

73 79Excel multiplies array1 times array2. Remember that the order in which you specify the arrays is important in multiplying matrices! There is no commutative law in multiplying matrices! General Hints

When selecting a block of empty cells for your results (transpose, inverse, or multiplication) it's okay to select a block of cells that's too big (too many rows and columns). Excel will fill in the appropriate number of cells, and put #N/A in the extra cells. However, do NOT select a block that's too small. Excel will fill in the cells you have selected, but does NOT tell you there were too few cells for a complete answer. You'll think you have the whole answer, but you don't.

If you just press Enter when you should have pressed Ctrl, Shift, Enter, you'll get just one number in your block of cells. Select the block again, press the F2 key, then hold down Ctrl and Shift while pressing Enter.

Excel formulas/functions are not case sensitive. You don't have to use all capitals as was done in the examples.

AdPac Page 20

0 -1 -4 1 2 1 10 -1 -1 -2 -1 1 40 0 0 1 0 1 -2

Dispersion Matrices A dispersion matrix is a symmetrical square matrix whose elements are statistics which describe the dispersion of variables. Dispersion is synonymous with variability, so examples of dispersion statistics include the sum of squares, standard deviation, and variance. Quantities that show the relationships between variables (e.g. the correlation coefficient) are considered dispersion statistics, because they show how the variability of one variable relates to the variability of a second variable. Therefore, a matrix of correlation coefficients is considered a dispersion matrix. Let's start with raw data (a data matrix), and work our way to a dispersion matrix. The following data are just made-up numbers for three variables, which we are calling X, Y, and Z.

Note: The letters X, Y, and Z are shown just for readability. They would not be considered part of the data matrix. Only the numbers are considered the data matrix. Mean of X = 4. Standard deviation of X = 2. Mean of Y = 2. Standard deviation of Y = 2. Mean of Z = 2. Standard deviation of Z = 1. Next, we subtract the mean of the variable from each data point in the variable. So, the first data point in X is 4, and the mean of X is 4. Therefore, 4 - 4 = 0.

This gives a matrix called the deviation matrix or scores matrix. We don't consider this a dispersion matrix, because it is not a square matrix. Rather, it is a transformed data matrix. This matrix is often symbolized by lower case x (x).

This is the scores matrix (x).

Now, we write the transpose of the scores matrix (x'). Next, multiply x'x. Let's think about this. Since the order of x' is 3x7, and x is 7x3, that means their product will be 3x3.

Doing the matrix multiplication yields:

690

9246

0624

' xx Notice this is a symmetrical, square matrix. Now, the key question:

what statistical quantities are represented by the elements of this matrix? There are two things you need to think about: 1. The method for multiplying matrices, and

2. The elements of x' and x. Each element of the first row of x' and the first column of x is )( XX i . The second row of x' and

second column of x is )( YYi . The third row and column is )( ZZi .

The first person to put these two things together and correctly identifies the elements of our square matrix, without looking at the next page, wins a free lunch from Dr. M. (Note: if you attempt to claim the prize, Dr. M. will weasel out by saying you cheated.)

X Y Z 4 2 2 3 1 2 0 1 2 5 0 3 6 1 2 5 3 3 5 6 0

X Y Z 0 0 0

-1 -1 0 -4 -1 0 1 -2 1 2 -1 0 1 1 1 1 4 -2

AdPac Page 21

1

)(

1

))((

1

))((1

))((

1

)(

1

))((1

))((

1

))((

1

)(

2

2

2

n

ZZ

n

ZZYY

n

ZZXXn

ZZYY

n

YY

n

YYXXn

ZZXX

n

YYXX

n

XX

iiiii

iiiii

iiiii

Putting our two things from the previous page together, you're supposed to see that

))(())(())((

))(())(())((

))(())(())((

'

ZZZZZZYYZZXX

ZZYYYYYYYYXX

ZZXXYYXXXXXX

xx

iiiiii

iiiiii

iiiiii

which we can also write as You should immediately recognize the quantities along the principal diagonal. If you don't, find your BIO 211 Test Pac from the Biometrics class, and stare at the front cover until something looks familiar. These values are sums of squares. You also encountered the off-diagonal elements in Biometrics. Think back to regression and correlation. Consider the formulas for b, the regression coefficient (slope of the line) and the correlation coefficient (r) - they're both in the BIO 211 Test Pac. Just in case you

can't find your Test Pac, the formula for b was

2)(

))((

XX

YYXXb

i

ii. The numerator was called the sum of the crossproducts.

The off-diagonal elements are sums of crossproducts. Our matrix is therefore called the Sum of Squares and Crossproducts matrix, and is symbolized by SSCP. The SSCP is our first dispersion matrix.

Next, we multiply the elements of the SSCP by a scalar, namely 1

1

n, where n is the sample size for each variable (n = 7). This

yields the following matrix: In terms of numbers:

15.10

5.141

014

6

6

6

9

6

06

9

6

24

6

66

0

6

6

6

24

Look at the formulas for the elements along the principal diagonal. You should

recognize immediately that these are variances. The off-diagonal elements are called covariances. This matrix is called the Variance-Covariance Matrix, and is often symbolized as S2 or S. The Sum of Squares and Crossproducts matrix (SSCP) and the Variance-Covariance Matrix (S2) are important dispersion matrices. But we're not done yet! There's one more very important dispersion matrix to develop!

2

2

2

)())(())((

))(()())((

))(())(()(

'

ZZZZYYZZXX

ZZYYYYYYXX

ZZXXYYXXXX

xx

iiiii

iiiii

iiiii

AdPac Page 22

First, take the square root of the elements along the principal diagonal of the Variance-Covariance Matrix. Since these elements are variances, their square roots are standard deviations. The element in the first row, first column of the Variance-Covariance matrix is the variance of the X variable, so when we take the square root, we have the standard deviation of X (sX). For the second row, second column, it will be the standard deviation of Y (sY). For the third row, third column, it will be the standard deviation of Z (sZ). Therefore: sX = 2 sY = 2 sZ = 1. Since X is the first variable, Y is the second, and Z the third, we can symbolize the standard deviations as s1 = 2 s2 = 2 s3 = 1.

Now, this next part is confusing, but not really that hard. Consider the Variance-Covariance Matrix S2. Let's use the symbol 2ijS to

denote the element in row I, column j of the matrix. What we need to do is divide each element of the Variance-Covariance matrix by

the product of the standard deviations that correspond to the row and column of that element. In other words, divide 211S by the

product s1s1. Divide 212S by the product s1s2. Here's this process applied to the whole matrix:

175.00

75.0125.0

025.01

11

1

21

5.1

21

012

5.1

22

4

22

112

0

22

1

22

4

33

233

23

232

13

231

32

223

22

222

12

221

31

213

21

212

11

211

ss

S

ss

S

ss

Sss

S

ss

S

ss

Sss

S

ss

S

ss

S

Now, what is this new matrix?

The easiest thing to see is that the elements along the principal diagonal of this matrix will always be 1. Think about the first row, first column. The variance of X is the numerator. In the denominator, you multiply the standard deviation of X times the standard deviation of X. In other words, you square the standard deviation of X. What do you get when you square the standard deviation of X? That's right, the variance of X. So, you wind up dividing the variance of X by the variance of X, so of course you get 1. This pattern continues down the principal diagonal. The off-diagonal elements are more complicated. Let's just look at one of them; how about the first row, second column (0.25). The numerator is the crossproduct of X and Y. The denominator is the standard deviation of X times the standard deviation of Y. Let's do the math: Now, who recognizes this last quantity? You've seen it before. It's the Pearson product-moment correlation coefficient (correlation coefficient). Therefore, the off-diagonal elements are correlation coefficients. Actually, the elements on the principal diagonal are also correlation coefficients. The correlation of any variable with itself is always 1. This matrix is the correlation matrix and is symbolized as R. The correlation matrix is our most important dispersion matrix. The two most important dispersion matrices are the Variance-Covariance matrix and the Correlation matrix. They are used as input for many of our multivariate procedures. They both contain the same general information - they show the linear relationship between all pairs of the original variables. The Variance-Covariance matrix has units - the original units of the data. The Correlation matrix is standardized - there are no units. The Correlation matrix is a standardized version of the Variance-Covariance matrix. When we do multivariate procedures, if we want our results to have units, we use the Variance-Covariance matrix. If we want our results standardized (not affected by units), we use the Correlation matrix.

2222 )()(

))((

1

)(

1

)(

1

))((

YYXX

YYXX

n

YY

n

XX

n

YYXX

ii

ii

ii

ii

AdPac Page 23

Multiple Regression Example - Pollution Data DATA POLLUTE; INPUT CITY $ SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS; * Data are means for 1969, 1970, and 1971. CITY = city name (note that the $ tells SAS it is a character variable, i.e. not a number) S02 = annual mean concentration of SO2 (micrograms/cubic meter) TEMP = annual mean temperature (degrees F) FACTORYS = number of factories with 20 or more employees POP = population (in thousands) from 1970 census WIND = annual mean wind speed (miles/hour) PRECIP = annual mean precipitation (inches) PRCPDAYS = annual mean number of days/year with precipitation ; Datalines; PHOENIX 10 70.3 213 582 6.0 7.05 36 SAN_FRAN 12 56.7 453 716 8.7 20.66 67 DENVER 17 51.9 454 515 9.0 12.95 86 MIAMI 10 75.5 207 335 9.0 59.80 128 ATLANTA 24 61.5 368 497 9.1 48.34 115 CHICAGO 110 50.6 3344 3369 10.4 34.44 122 NEW_ORLS 9 68.3 204 361 8.4 56.77 113 DETROIT 35 49.9 1064 1513 10.1 30.96 129 ST_LOUIS 56 55.9 775 622 9.5 35.89 105 ALBQURQE 11 56.8 46 244 8.9 7.77 58 CLEVLAND 65 49.7 1007 751 10.9 34.99 155 DALLAS 9 66.2 641 844 10.9 35.94 78 HOUSTON 10 68.9 721 1233 10.8 48.19 103 SLT_LAKE 28 51.0 137 176 8.7 15.17 89 SEATTLE 29 51.1 379 531 9.4 38.79 164 ; PROC PRINT; PROC REG CORR SIMPLE; * The CORR option will print a correlation matrix of the variables which will allow us to inspect for multicollinearities. The SIMPLE option will print basic statististics for our variables; MODEL SO2 = TEMP FACTORYS POP WIND PRECIP PRCPDAYS / STB; * This first model will include all independent variables. The STB option requests that the standardized partial regression coefficients (standardized b values) be printed; MODEL SO2 = TEMP FACTORYS POP WIND PRECIP PRCPDAYS / SELECTION=STEPWISE STB TOL VIF; * This model will use the STEPWISE procedure to have SAS select important predictors. The TOL and VIF options print quantities to help determine if we have multicollinearities. TOL (TOLerance) shows how closely each predictor is correlated with the other predictors in the model. TOL is calculated as TOL = 1 -R^2, where R^2 is the multiple coefficient of determination of the other independent variables with the particular variable. A tolerance value less than 0.1 means the predictor is highly correlated with others, and therefore you have a problem. VIF is Variance Inflation Factor and is calculated as VIF = 1/TOL, therefore a VIF greater than 10 indicates you have a problem with multicollinearities. It's called "Variance Inflation" because it shows how much the variance of the standardized b value is inflated by multicollinearity. Inflating the variance is a fancy way of saying there is a lot of error in the estimate of the b.; RUN;

AdPac Page 24

The SAS System 14:38 Monday, December 9, 2002 1 Obs CITY SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS 1 PHOENIX 10 70.3 213 582 6.0 7.05 36 2 SAN_FRAN 12 56.7 453 716 8.7 20.66 67 3 DENVER 17 51.9 454 515 9.0 12.95 86 4 MIAMI 10 75.5 207 335 9.0 59.80 128 5 ATLANTA 24 61.5 368 497 9.1 48.34 115 6 CHICAGO 110 50.6 3344 3369 10.4 34.44 122 7 NEW_ORLS 9 68.3 204 361 8.4 56.77 113 8 DETROIT 35 49.9 1064 1513 10.1 30.96 129 9 ST_LOUIS 56 55.9 775 622 9.5 35.89 105 10 ALBQURQE 11 56.8 46 244 8.9 7.77 58 11 CLEVLAND 65 49.7 1007 751 10.9 34.99 155 12 DALLAS 9 66.2 641 844 10.9 35.94 78 13 HOUSTON 10 68.9 721 1233 10.8 48.19 103 14 SLT_LAKE 28 51.0 137 176 8.7 15.17 89 15 SEATTLE 29 51.1 379 531 9.4 38.79 164 The SAS System 14:38 Monday, December 9, 2002 2 The REG Procedure Descriptive Statistics Uncorrected Standard Variable Sum Mean SS Variance Deviation Intercept 15.00000 1.00000 15.00000 0 0 TEMP 884.30000 58.95333 53211 77.06552 8.77870 FACTORYS 10013 667.53333 15700937 644066 802.53730 POP 12289 819.26667 18801473 623822 789.82389 WIND 139.80000 9.32000 1325.00000 1.57600 1.25539 PRECIP 487.71000 32.51400 19846 284.90137 16.87902 PRCPDAYS 1548.00000 103.20000 177008 1232.45714 35.10637 SO2 435.00000 29.00000 23903 806.28571 28.39517 Correlation Variable TEMP FACTORYS POP WIND TEMP 1.0000 -0.3765 -0.2767 -0.3133 FACTORYS -0.3765 1.0000 0.9675 0.4757 POP -0.2767 0.9675 1.0000 0.4334 WIND -0.3133 0.4757 0.4334 1.0000 PRECIP 0.4250 0.1002 0.0908 0.4021 PRCPDAYS -0.3140 0.3031 0.2150 0.5618 SO2 -0.5847 0.8773 0.7499 0.4116 Correlation Variable PRECIP PRCPDAYS SO2 TEMP 0.4250 -0.3140 -0.5847 FACTORYS 0.1002 0.3031 0.8773 POP 0.0908 0.2150 0.7499 WIND 0.4021 0.5618 0.4116 PRECIP 1.0000 0.6467 0.0389 PRCPDAYS 0.6467 1.0000 0.4513 SO2 0.0389 0.4513 1.0000

AdPac Page 25

The SAS System 14:38 Monday, December 9, 2002 3 The REG Procedure Model: MODEL1 Dependent Variable: SO2 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 10817 1802.75491 30.59 <.0001 Error 8 471.47054 58.93382 Corrected Total 14 11288 Root MSE 7.67684 R-Square 0.9582 Dependent Mean 29.00000 Adj R-Sq 0.9269 Coeff Var 26.47185 Parameter Estimates Parameter Standard Standardized Variable DF Estimate Error t Value Pr > |t| Estimate Intercept 1 54.44194 48.07122 1.13 0.2902 0 TEMP 1 -0.34631 0.59357 -0.58 0.5757 -0.10707 FACTORYS 1 0.07233 0.01197 6.04 0.0003 2.04438 POP 1 -0.04431 0.01156 -3.83 0.0050 -1.23248 WIND 1 -3.06619 2.32270 -1.32 0.2233 -0.13556 PRECIP 1 -0.12776 0.36184 -0.35 0.7332 -0.07594 PRCPDAYS 1 0.15233 0.15038 1.01 0.3407 0.18834 The SAS System 14:38 Monday, December 9, 2002 4 The REG Procedure Model: MODEL2 Dependent Variable: SO2 Stepwise Selection: Step 1 Variable FACTORYS Entered: R-Square = 0.7697 and C(p) = 33.1164 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 8688.05333 8688.05333 43.44 <.0001 Error 13 2599.94667 199.99590 Corrected Total 14 11288 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 8.27927 4.81835 590.48488 2.95 0.1095 FACTORYS 0.03104 0.00471 8688.05333 43.44 <.0001 Bounds on condition number: 1, 1 ------------------------------------------------------------------------------------------------ Stepwise Selection: Step 2 Variable POP Entered: R-Square = 0.9225 and C(p) = 5.8488

AdPac Page 26

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 10413 5206.45100 71.39 <.0001 Error 12 875.09799 72.92483 Corrected Total 14 11288 The SAS System 14:38 Monday, December 9, 2002 5 The REG Procedure Model: MODEL2 Dependent Variable: SO2 Stepwise Selection: Step 2 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 18.48185 3.58698 1936.01791 26.55 0.0002 FACTORYS 0.08393 0.01124 4065.74405 55.75 <.0001 POP -0.05554 0.01142 1724.84868 23.65 0.0004 Bounds on condition number: 15.621, 62.485 ------------------------------------------------------------------------------------------------ Stepwise Selection: Step 3 Variable TEMP Entered: R-Square = 0.9417 and C(p) = 4.1683 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 10630 3543.26986 59.22 <.0001 Error 11 658.19043 59.83549 Corrected Total 14 11288 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 49.21771 16.46685 534.54034 8.93 0.0123 TEMP -0.52173 0.27402 216.90756 3.63 0.0834 FACTORYS 0.07423 0.01138 2543.75168 42.51 <.0001 POP -0.04762 0.01115 1090.81469 18.23 0.0013 Bounds on condition number: 19.531, 117.11 ------------------------------------------------------------------------------------------------ All variables left in the model are significant at the 0.1500 level. No other variable met the 0.1500 significance level for entry into the model.

AdPac Page 27

The SAS System 14:38 Monday, December 9, 2002 6 The REG Procedure Model: MODEL2 Dependent Variable: SO2 Summary of Stepwise Selection Variable Variable Number Partial Model Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F 1 FACTORYS 1 0.7697 0.7697 33.1164 43.44 <.0001 2 POP 2 0.1528 0.9225 5.8488 23.65 0.0004 3 TEMP 3 0.0192 0.9417 4.1683 3.63 0.0834 The SAS System 14:38 Monday, December 9, 2002 7 The REG Procedure Model: MODEL2 Dependent Variable: SO2 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 10630 3543.26986 59.22 <.0001 Error 11 658.19043 59.83549 Corrected Total 14 11288 Root MSE 7.73534 R-Square 0.9417 Dependent Mean 29.00000 Adj R-Sq 0.9258 Coeff Var 26.67359 Parameter Estimates Parameter Standard Standardized Variable DF Estimate Error t Value Pr > |t| Estimate Tolerance Intercept 1 49.21771 16.46685 2.99 0.0123 0 . TEMP 1 -0.52173 0.27402 -1.90 0.0834 -0.16130 0.73857 FACTORYS 1 0.07423 0.01138 6.52 <.0001 2.09793 0.05120 POP 1 -0.04762 0.01115 -4.27 0.0013 -1.32445 0.05509 Parameter Estimates Variance Variable DF Inflation Intercept 1 0 TEMP 1 1.35397 FACTORYS 1 19.53106 POP 1 18.15249

AdPac Page 28

Multiple Regression - Ridge Trace - Example DATA POLLUTE; INPUT CITY $ SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS; Datalines; PHOENIX 10 70.3 213 582 6.0 7.05 36 SAN_FRAN 12 56.7 453 716 8.7 20.66 67 DENVER 17 51.9 454 515 9.0 12.95 86 MIAMI 10 75.5 207 335 9.0 59.80 128 ATLANTA 24 61.5 368 497 9.1 48.34 115 CHICAGO 110 50.6 3344 3369 10.4 34.44 122 NEW_ORLS 9 68.3 204 361 8.4 56.77 113 DETROIT 35 49.9 1064 1513 10.1 30.96 129 ST_LOUIS 56 55.9 775 622 9.5 35.89 105 ALBQURQE 11 56.8 46 244 8.9 7.77 58 CLEVLAND 65 49.7 1007 751 10.9 34.99 155 DALLAS 9 66.2 641 844 10.9 35.94 78 HOUSTON 10 68.9 721 1233 10.8 48.19 103 SLT_LAKE 28 51.0 137 176 8.7 15.17 89 SEATTLE 29 51.1 379 531 9.4 38.79 164 ; PROC REG OUTSTB OUTEST=BVALS RIDGE = 0 TO 0.3 BY 0.006; * The specified options request a ridge trace, with the ridge value (or ridge factor) starting at 0 and going to up to a maximum of 0.3. The value is incremented by 0.006 each time. The OUTSTB and OUTEST=BVALS create a data set with the the standardized b values (in addition to unstandardized b values).; MODEL SO2 = TEMP FACTORYS POP WIND PRECIP PRCPDAYS / STB; * This model will include all independent variables. The STB option requests that the standardized partial regression coefficients (standardized b values) be printed; PROC SORT; BY _TYPE_; * SAS does a ridge trace on both standardized and unstandardized b values, but we're only interested in the standardized values. The _TYPE_ variable has the value RIDGE for unstandardized and RIDGESTB for unstandardized. We sort by _TYPE_ to group all of our standardized values together - this just makes it easier to read the output.; PROC PRINT; VAR _TYPE_ TEMP FACTORYS POP WIND PRECIP PRCPDAYS; RUN;

AdPac Page 29

The SAS System 14:25 Tuesday, December 10, 2002 1 The REG Procedure Model: MODEL1 Dependent Variable: SO2 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 10817 1802.75491 30.59 <.0001 Error 8 471.47054 58.93382 Corrected Total 14 11288 Root MSE 7.67684 R-Square 0.9582 Dependent Mean 29.00000 Adj R-Sq 0.9269 Coeff Var 26.47185 Parameter Estimates Parameter Standard Standardized Variable DF Estimate Error t Value Pr > |t| Estimate Intercept 1 54.44194 48.07122 1.13 0.2902 0 TEMP 1 -0.34631 0.59357 -0.58 0.5757 -0.10707 FACTORYS 1 0.07233 0.01197 6.04 0.0003 2.04438 POP 1 -0.04431 0.01156 -3.83 0.0050 -1.23248 WIND 1 -3.06619 2.32270 -1.32 0.2233 -0.13556 PRECIP 1 -0.12776 0.36184 -0.35 0.7332 -0.07594 PRCPDAYS 1 0.15233 0.15038 1.01 0.3407 0.18834 The SAS System 14:25 Tuesday, December 10, 2002 2 Obs _TYPE_ TEMP FACTORYS POP WIND PRECIP PRCPDAYS 1 PARMS -0.34631 0.072334 -0.044309 -3.06619 -0.12776 0.15233 2 RIDGE -0.34631 0.072334 -0.044309 -3.06619 -0.12776 0.15233 3 RIDGE -0.51046 0.060544 -0.033063 -3.15140 -0.08085 0.15295 4 RIDGE -0.61162 0.052666 -0.025577 -3.16829 -0.05412 0.15447 5 RIDGE -0.67837 0.047023 -0.020238 -3.14943 -0.03791 0.15613 6 RIDGE -0.72453 0.042776 -0.016241 -3.11042 -0.02770 0.15766 7 RIDGE -0.75755 0.039459 -0.013138 -3.05955 -0.02110 0.15898 8 RIDGE -0.78176 0.036795 -0.010662 -3.00157 -0.01680 0.16007 9 RIDGE -0.79981 0.034604 -0.008642 -2.93933 -0.01400 0.16094 10 RIDGE -0.81343 0.032770 -0.006964 -2.87465 -0.01221 0.16161 11 RIDGE -0.82379 0.031210 -0.005549 -2.80869 -0.01112 0.16210 12 RIDGE -0.83166 0.029865 -0.004341 -2.74224 -0.01051 0.16244 13 RIDGE -0.83764 0.028694 -0.003298 -2.67584 -0.01025 0.16264 14 RIDGE -0.84213 0.027662 -0.002390 -2.60985 -0.01023 0.16273 15 RIDGE -0.84544 0.026747 -0.001593 -2.54453 -0.01038 0.16272 16 RIDGE -0.84781 0.025928 -0.000888 -2.48006 -0.01066 0.16262 17 RIDGE -0.84941 0.025191 -0.000260 -2.41656 -0.01103 0.16244 18 RIDGE -0.85039 0.024523 0.000301 -2.35411 -0.01145 0.16221 19 RIDGE -0.85085 0.023914 0.000806 -2.29277 -0.01192 0.16191 20 RIDGE -0.85088 0.023357 0.001262 -2.23257 -0.01241 0.16157 21 RIDGE -0.85055 0.022845 0.001676 -2.17353 -0.01292 0.16119 22 RIDGE -0.84993 0.022373 0.002052 -2.11565 -0.01344 0.16078 23 RIDGE -0.84905 0.021935 0.002396 -2.05894 -0.01395 0.16033 24 RIDGE -0.84796 0.021527 0.002711 -2.00338 -0.01447 0.15986 25 RIDGE -0.84669 0.021148 0.003000 -1.94897 -0.01497 0.15936 26 RIDGE -0.84527 0.020792 0.003267 -1.89568 -0.01547 0.15885 27 RIDGE -0.84371 0.020459 0.003513 -1.84350 -0.01595 0.15832 28 RIDGE -0.84204 0.020145 0.003741 -1.79241 -0.01642 0.15778

AdPac Page 30

Obs _TYPE_ TEMP FACTORYS POP WIND PRECIP PRCPDAYS 29 RIDGE -0.84028 0.019849 0.003952 -1.74239 -0.01688 0.15722 30 RIDGE -0.83843 0.019570 0.004148 -1.69341 -0.01732 0.15666 31 RIDGE -0.83651 0.019306 0.004330 -1.64545 -0.01774 0.15609 32 RIDGE -0.83453 0.019055 0.004500 -1.59849 -0.01816 0.15551 33 RIDGE -0.83250 0.018816 0.004659 -1.55250 -0.01855 0.15493 34 RIDGE -0.83043 0.018589 0.004808 -1.50747 -0.01893 0.15434 35 RIDGE -0.82832 0.018373 0.004947 -1.46336 -0.01930 0.15376 36 RIDGE -0.82618 0.018166 0.005077 -1.42015 -0.01965 0.15317 37 RIDGE -0.82402 0.017968 0.005199 -1.37783 -0.01999 0.15257 38 RIDGE -0.82183 0.017779 0.005313 -1.33636 -0.02031 0.15198 39 RIDGE -0.81962 0.017597 0.005421 -1.29574 -0.02062 0.15139 40 RIDGE -0.81740 0.017423 0.005523 -1.25593 -0.02092 0.15080 41 RIDGE -0.81517 0.017255 0.005618 -1.21692 -0.02120 0.15022 42 RIDGE -0.81293 0.017094 0.005708 -1.17868 -0.02148 0.14963 43 RIDGE -0.81069 0.016938 0.005793 -1.14120 -0.02173 0.14905 44 RIDGE -0.80844 0.016788 0.005873 -1.10446 -0.02198 0.14847 45 RIDGE -0.80619 0.016643 0.005949 -1.06844 -0.02222 0.14789 46 RIDGE -0.80394 0.016503 0.006020 -1.03313 -0.02244 0.14731 47 RIDGE -0.80169 0.016368 0.006088 -0.99849 -0.02266 0.14674 48 RIDGE -0.79944 0.016237 0.006152 -0.96453 -0.02286 0.14617 49 RIDGE -0.79719 0.016110 0.006212 -0.93121 -0.02306 0.14561 50 RIDGE -0.79495 0.015987 0.006270 -0.89853 -0.02324 0.14505 51 RIDGE -0.79271 0.01587 0.00632 -0.86647 -0.023417 0.14449 52 RIDGE -0.79048 0.01575 0.00638 -0.83501 -0.023583 0.14394 53 RIDGESTB -0.10707 2.04438 -1.23248 -0.13556 -0.075943 0.18834 54 RIDGESTB -0.15781 1.71115 -0.91967 -0.13933 -0.048061 0.18910 55 RIDGESTB -0.18909 1.48852 -0.71144 -0.14007 -0.032170 0.19098 56 RIDGESTB -0.20973 1.32903 -0.56293 -0.13924 -0.022538 0.19303 57 RIDGESTB -0.22400 1.20898 -0.45174 -0.13752 -0.016465 0.19493 58 RIDGESTB -0.23421 1.11525 -0.36544 -0.13527 -0.012544 0.19656 59 RIDGESTB -0.24169 1.03993 -0.29657 -0.13270 -0.009986 0.19791 60 RIDGESTB -0.24727 0.97803 -0.24037 -0.12995 -0.008321 0.19898 61 RIDGESTB -0.25148 0.92619 -0.19370 -0.12709 -0.007258 0.19981 62 RIDGESTB -0.25468 0.88209 -0.15434 -0.12418 -0.006608 0.20042 63 RIDGESTB -0.25712 0.84409 -0.12074 -0.12124 -0.006248 0.20083 64 RIDGESTB -0.25897 0.81097 -0.09174 -0.11830 -0.006091 0.20108 65 RIDGESTB -0.26035 0.78183 -0.06648 -0.11538 -0.006079 0.20119 66 RIDGESTB -0.26138 0.75595 -0.04430 -0.11250 -0.006171 0.20117 67 RIDGESTB -0.26211 0.73281 -0.02469 -0.10965 -0.006337 0.20105 68 RIDGESTB -0.26260 0.71197 -0.00724 -0.10684 -0.006554 0.20084 69 RIDGESTB -0.26291 0.69309 0.00838 -0.10408 -0.006807 0.20054 70 RIDGESTB -0.26305 0.67589 0.02242 -0.10137 -0.007085 0.20018 71 RIDGESTB -0.26306 0.66015 0.03510 -0.09870 -0.007379 0.19976 72 RIDGESTB -0.26296 0.64568 0.04661 -0.09609 -0.007681 0.19929 73 RIDGESTB -0.26277 0.63232 0.05708 -0.09354 -0.007988 0.19877 74 RIDGESTB -0.26250 0.61994 0.06664 -0.09103 -0.008294 0.19822 75 RIDGESTB -0.26216 0.60843 0.07540 -0.08857 -0.008599 0.19764 76 RIDGESTB -0.26177 0.59770 0.08345 -0.08617 -0.008899 0.19703 77 RIDGESTB -0.26132 0.58765 0.09087 -0.08381 -0.009194 0.19639 78 RIDGESTB -0.26084 0.57823 0.09771 -0.08150 -0.009481 0.19574 79 RIDGESTB -0.26033 0.56936 0.10405 -0.07925 -0.009761 0.19507 80 RIDGESTB -0.25978 0.56101 0.10992 -0.07703 -0.010032 0.19438 81 RIDGESTB -0.25921 0.55311 0.11538 -0.07487 -0.010294 0.19368 82 RIDGESTB -0.25862 0.54564 0.12045 -0.07275 -0.010548 0.19298 83 RIDGESTB -0.25801 0.53855 0.12518 -0.07067 -0.010793 0.19227 84 RIDGESTB -0.25738 0.53181 0.12960 -0.06864 -0.011029 0.19155 85 RIDGESTB -0.25674 0.52539 0.13373 -0.06665 -0.011255 0.19082 86 RIDGESTB -0.25609 0.51927 0.13759 -0.06470 -0.011473 0.19010 87 RIDGESTB -0.25542 0.51343 0.14121 -0.06279 -0.011682 0.18937 88 RIDGESTB -0.25475 0.50784 0.14461 -0.06092 -0.011883 0.18864 89 RIDGESTB -0.25408 0.50249 0.14780 -0.05908 -0.012075 0.18791 90 RIDGESTB -0.25340 0.49735 0.15080 -0.05729 -0.012260 0.18718 91 RIDGESTB -0.25271 0.49242 0.15362 -0.05553 -0.012436 0.18645 92 RIDGESTB -0.25202 0.48768 0.15627 -0.05380 -0.012605 0.18572 93 RIDGESTB -0.25133 0.48312 0.15878 -0.05211 -0.012766 0.18500 94 RIDGESTB -0.25063 0.47873 0.16114 -0.05045 -0.012920 0.18427

AdPac Page 31

Obs _TYPE_ TEMP FACTORYS POP WIND PRECIP PRCPDAYS 95 RIDGESTB -0.24994 0.47449 0.16337 -0.04883 -0.013067 0.18356 96 RIDGESTB -0.24924 0.47040 0.16547 -0.04724 -0.013207 0.18284 97 RIDGESTB -0.24855 0.46644 0.16746 -0.04568 -0.013341 0.18213 98 RIDGESTB -0.24785 0.46262 0.16934 -0.04414 -0.013469 0.18143 99 RIDGESTB -0.24716 0.45891 0.17112 -0.04264 -0.013590 0.18072 100 RIDGESTB -0.24646 0.45533 0.17280 -0.04117 -0.013705 0.18003 101 RIDGESTB -0.24577 0.45185 0.17439 -0.039725 -0.013815 0.17933 102 RIDGESTB -0.24508 0.44848 0.17590 -0.038308 -0.013920 0.17865 103 RIDGESTB -0.24439 0.44520 0.17733 -0.036917 -0.014019 0.17796

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Ridge Value

Sta

nd

ard

ized

b v

alu

e

Temp Factorys Pop Wind Precip PrcpDays

Note: the above graph was produced with Excel, not with SAS

AdPac Page 32

Polynomial Regression In polynomial regression, we have a single independent variable, but we develop a multiple regression model by using successive powers of the independent variable. The model may be written: Y = a + b1X + b2X

2 + b3X3 + ….. + bkX

k. Notice there is (as usual) a single dependent variable (Y), and a single independent variable (X). Polynomial regression is used when we have X,Y data that are not linear. The polynomial regression equation produces a line that has inflection points (which is a fancy way of saying it bends). The number of bends in the line is a function of the highest power (the degree) of the equation. A polynomial of degree k produces a line with k-1 inflection points. Our simple linear regression equation (Y = a + bX) can be described as a polynomial of degree 1. Since the degree (k) is 1, the line will bend k-1 = 0 times. A line that doesn't bend is what we call a straight line! You should recognize a polynomial of degree 2 (Y = a + b1X + b2X

2) as a quadratic equation, which has one inflection point (bend), i.e. this is a parabola. Here's a summary table, up to a 5th degree polynomial: Degree Name Equation Example line 1 Linear Y = a + b1X 2 Quadratic Y = a + b1X + b2X

2 3 Cubic Y = a + b1X + b2X

2 + b3X3

4 Quartic Y = a + b1X + b2X

2 + b3X3 + b4X

4 5 Quintic Y = a + b1X + b2X

2 + b3X3 + b4X

4 + b5X5

The reason for doing a polynomial regression is usually to demonstrate that the relationship between X and Y is nonlinear. Generally, there is no biological interpretation to the degree, or to any of the powers of X. Using the successive powers of X is simply a convenient way to demonstrate a nonlinear relationship. To do a polynomial regression, first examine a graph of the data and see what might be reasonable in terms of a possible degree. You want a degree high enough to model the inflections in the data, but don't consider doing high degrees - especially near the sample size. For example, if you have n = 25 X,Y points, you can fit the data pretty well with a polynomial of degree 23 - you're basically bending the curve to hit each point. This would not produce any useful information. Have the computer (e.g. SAS) do successive models, adding an additional degree each time. Examine the F statistic for each successive term (i.e. Ho: β = 0). If the term is significant, keep the term in the model. If not significant, stop with the previous term. Besides the example here, you can find an example and discussion of polynomial regression in Zar (4th edition), Chapter 21. Consider the data shown below, which are soil moisture as a function of distance from a creek. The relationship is clearly nonlinear, and would appear to have at least two major inflection points. We'll use SAS to do polynomial regression.

0

2

4

6

8

10

12

0 50 100 150 200 250 300 350 400

Distance from creek (m)

So

il M

ois

ture

Va

lue

AdPac Page 33

Data Polynomial_Reg; * X is distance from the creek in meters. Y is soil moisture. The soil moisture was measured with an inexpensive meter, which did not provide units (although it is some measure of electrical conductivity). We'll treat the numbers as a ratio scale variable for the purposes of this example. ; INPUT X Y @@; * The @@ symbol allows you to have multiple X Y values on a single line. In the data below, the first line has the nine X Y values. This just keeps the DATALINES portion shorter, and saves some space.; X2 = X**2; X3 = X**3; X4 = X**4; X5 = X**5; X6 = X**6; X7 = X**7; X8 = X**8; * Note that we create the powers of the independent variable (X) in the DATA segment. As you can see, X2 is X squared, X3 is X cubed, and so on. When doing X to all of these powers, you need to be careful not to let the numbers get too big or too small. For example, if X has values in the thousands, divide all of them by 1000 before taking the powers (1,000 to the 6th power is a big number). If your numbers are small, multiply them by something so most of them are between 1 and 10 (0.001 to the 6th power is a small number).; DATALINES; 1 10 6 1.5 11 3 16 3 21 1 26 4 31 1.5 36 3.5 41 4 46 1 51 1.75 56 1.5 61 1 66 1 71 1 76 0.5 81 0.5 86 0.5 91 0.5 96 1 101 1 106 1 111 1 116 1 121 1 126 0.5 131 0.5 136 1 141 1 146 1.5 151 0.5 156 1 161 1 166 0.5 171 1 176 1 181 1 186 1.75 191 0.5 196 1.9 201 2.21 206 5.5 211 2 216 2.5 221 3 226 1.5 231 2 236 1 241 2 246 3 251 3.5 256 6 261 8 266 6 271 10 276 10 281 7 286 7 291 8 296 5.5 301 7 306 9 311 3.5 316 3 321 6 326 4.5 331 6.5 336 4 341 4 346 1.5 351 1.75 356 4.5 361 3 366 1 371 3.5 376 2 ; PROC GLM;

MODEL Y = X / SS1; PROC GLM;

MODEL Y = X X2 / SS1; PROC GLM;

MODEL Y = X X2 X3 / SS1; PROC GLM;

MODEL Y = X X2 X3 X4/ SS1 P; PROC GLM;

MODEL Y = X X2 X3 X4 X5 / SS1; PROC GLM;

MODEL Y = X X2 X3 X4 X5 X6 / SS1; PROC GLM;

MODEL Y = X X2 X3 X4 X5 X6 X7 / SS1; PROC GLM; MODEL Y = X X2 X3 X4 X5 X6 X7 X8 / SS1; * What we want to do is add successive powers, one at a time. The easiest way to do this on SAS is just to provide separate PROC GLM statements. The SS1 option tells SAS to use Type I Sums of Squares. Type I SS are appropriate when the predictors should be entered in a particular order. This is the case in polynomial regression, i.e. we enter the linear term first, then the quadratic, then cubic, and so on. On the SAS output for PROC GLM, look at the F Value test of the sources, not the t-tests (they are not equivalent here because of the use of the Type I SS). Don't worry about Type I SS - that's covered in the ANOVA class and is not part of this class. The / P on the fourth degree model tells SAS to print the predicted values from that model (i.e. the predicted values for the quartic model). From our graph, we saw that we need at least two major inflection points, so we're interested in a degree 3 polynomial (at a minimum). What we do is evaluate the F value of the sources for each successive term. If it's significant, we keep the term in the model. If not, we stop with the previous term. In this example, the first (linear) degree is significant, but the second (quadratic) is not. In other words, a parabola does not fit these data very well - but we already knew that by looking at the graph. Since the graph suggested a minimum degree of 3, we keep going. The third (cubic), and

AdPac Page 34

fourth (quartic) terms were all significant in the model when they were added. However, the fifth was not significant. At degree 6, all of the terms except degree 5 are highly significant, but neither degree 7 or 8 is significant. So, what should be our final model? Degrees 3, 4, and 6 all are candidates. The degree 6 has the highest r-squared (which is expected because it has more predictors). If you examine graphs of the lines produced, they all seem to fit the data pretty well. The answer to our question is: it really doesn't matter too much if you go with degree 3, 4, or 6. They all demonstrate nicely that the relationship is nonlinear, and all of them model the data rather well. Degree four is a nice compromise for our final model, but there's no compelling reason why you couldn't use degree 6. They all show nonlinearity in the data. Selecting the final model is subjective. Notice that just because you have fit a polynomial of degree k to your data, that you won't always see all k-1 inflections on the graph. Look at the graph for the quartic (degree 4). You don't really see 3 inflection points. Notice the parameter value for X4: it's -0.000000005, which is almost zero! So that's not much of an inflection at all.; run; See the graphs following the SAS output. The SAS System 1 The GLM Procedure Number of Observations Read 76 Number of Observations Used 76 _______________________________________________________________________________________________ The SAS System 2 The GLM Procedure Dependent Variable: Y Sum of Source DF Squares Mean Square F Value Pr > F Model 1 83.4452221 83.4452221 14.74 0.0003 Error 74 418.8303305 5.6598693 Corrected Total 75 502.2755526 R-Square Coeff Var Root MSE Y Mean 0.166134 81.49628 2.379048 2.919211 Source DF Type I SS Mean Square F Value Pr > F X 1 83.44522215 83.44522215 14.74 0.0003 Standard Parameter Estimate Error t Value Pr > |t| Intercept 1.118475051 0.54259768 2.06 0.0428 X 0.009552973 0.00248795 3.84 0.0003 _______________________________________________________________________________________________ The SAS System 4 The GLM Procedure

AdPac Page 35

Dependent Variable: Y Sum of Source DF Squares Mean Square F Value Pr > F Model 2 98.1018695 49.0509347 8.86 0.0004 Error 73 404.1736832 5.5366258 Corrected Total 75 502.2755526 R-Square Coeff Var Root MSE Y Mean 0.195315 80.60411 2.353004 2.919211 Source DF Type I SS Mean Square F Value Pr > F X 1 83.44522215 83.44522215 15.07 0.0002 X2 1 14.65664732 14.65664732 2.65 0.1080 Standard Parameter Estimate Error t Value Pr > |t| Intercept 2.077773967 0.79726556 2.61 0.0111 X -0.005835976 0.00977318 -0.60 0.5523 X2 0.000040819 0.00002509 1.63 0.1080 _______________________________________________________________________________________________ The SAS System 6 The GLM Procedure Dependent Variable: Y Sum of Source DF Squares Mean Square F Value Pr > F Model 3 285.3329378 95.1109793 31.57 <.0001 Error 72 216.9426149 3.0130919 Corrected Total 75 502.2755526 R-Square Coeff Var Root MSE Y Mean 0.568080 59.46217 1.735826 2.919211 Source DF Type I SS Mean Square F Value Pr > F X 1 83.4452221 83.4452221 27.69 <.0001 X2 1 14.6566473 14.6566473 4.86 0.0306 X3 1 187.2310683 187.2310683 62.14 <.0001 Standard Parameter Estimate Error t Value Pr > |t| Intercept 6.043010892 0.77391761 7.81 <.0001 X -0.134565276 0.01785104 -7.54 <.0001

AdPac Page 36

X2 0.000897797 0.00011028 8.14 <.0001 X3 -0.000001515 0.00000019 -7.88 <.0001 _______________________________________________________________________________________________ The SAS System 8 The GLM Procedure Dependent Variable: Y Sum of Source DF Squares Mean Square F Value Pr > F Model 4 301.1172138 75.2793035 26.57 <.0001 Error 71 201.1583388 2.8332160 Corrected Total 75 502.2755526 R-Square Coeff Var Root MSE Y Mean 0.599506 57.65997 1.683216 2.919211 Source DF Type I SS Mean Square F Value Pr > F X 1 83.4452221 83.4452221 29.45 <.0001 X2 1 14.6566473 14.6566473 5.17 0.0260 X3 1 187.2310683 187.2310683 66.08 <.0001 X4 1 15.7842761 15.7842761 5.57 0.0210 Standard Parameter Estimate Error t Value Pr > |t| Intercept 4.775668728 0.92276278 5.18 <.0001 X -0.064920153 0.03420924 -1.90 0.0618 X2 0.000059026 0.00037110 0.16 0.8741 X3 0.000001954 0.00000148 1.32 0.1915 X4 -0.000000005 0.00000000 -2.36 0.0210 _______________________________________________________________________________________________ The SAS System 9 The GLM Procedure Observation Observed Predicted Residual 1 10.00000000 4.71080955 5.28919045 2 1.50000000 4.38868889 -2.88868889 3 3.00000000 4.07122288 -1.07122288 4 3.00000000 3.75975989 -0.75975989 5 1.00000000 3.45557923 -2.45557923 6 4.00000000 3.15989121 0.84010879 7 1.50000000 2.87383709 -1.37383709 8 3.50000000 2.59848911 0.90151089 9 4.00000000 2.33485051 1.66514949 10 1.00000000 2.08385547 -1.08385547 11 1.75000000 1.84636917 -0.09636917 12 1.50000000 1.62318775 -0.12318775 13 1.00000000 1.41503832 -0.41503832 14 1.00000000 1.22257898 -0.22257898 15 1.00000000 1.04639879 -0.04639879 16 0.50000000 0.88701781 -0.38701781

AdPac Page 37

17 0.50000000 0.74488703 -0.24488703 18 0.50000000 0.62038845 -0.12038845 19 0.50000000 0.51383503 -0.01383503 20 1.00000000 0.42547072 0.57452928 21 1.00000000 0.35547043 0.64452957 22 1.00000000 0.30394003 0.69605997 23 1.00000000 0.27091640 0.72908360 24 1.00000000 0.25636736 0.74363264 25 1.00000000 0.26019173 0.73980827 26 0.50000000 0.28221930 0.21778070 27 0.50000000 0.32221081 0.17778919 28 1.00000000 0.37985800 0.62014200 29 1.00000000 0.45478359 0.54521641 30 1.50000000 0.54654124 0.95345876 31 0.50000000 0.65461563 -0.15461563 32 1.00000000 0.77842236 0.22157764 33 1.00000000 0.91730806 0.08269194 34 0.50000000 1.07055030 -0.57055030 35 1.00000000 1.23735763 -0.23735763 36 1.00000000 1.41686958 -0.41686958 37 1.00000000 1.60815665 -0.60815665 38 1.75000000 1.81022032 -0.06022032 39 0.50000000 2.02199303 -1.52199303 40 1.90000000 2.24233822 -0.34233822 41 2.21000000 2.47005027 -0.26005027 42 5.50000000 2.70385457 2.79614543 43 2.00000000 2.94240746 -0.94240746 44 2.50000000 3.18429626 -0.68429626 45 3.00000000 3.42803927 -0.42803927 46 1.50000000 3.67208576 -2.17208576 47 2.00000000 3.91481597 -1.91481597 48 1.00000000 4.15454113 -3.15454113 49 2.00000000 4.38950342 -2.38950342 _______________________________________________________________________________________________ The SAS System 10 The GLM Procedure Observation Observed Predicted Residual 50 3.00000000 4.61787601 -1.61787601 51 3.50000000 4.83776305 -1.33776305 52 6.00000000 5.04719965 0.95280035 53 8.00000000 5.24415191 2.75584809 54 6.00000000 5.42651688 0.57348312 55 10.00000000 5.59212261 4.40787739 56 10.00000000 5.73872811 4.26127189 57 7.00000000 5.86402337 1.13597663 58 7.00000000 5.96562934 1.03437066 59 8.00000000 6.04109798 1.95890202 60 5.50000000 6.08791218 -0.58791218 61 7.00000000 6.10348584 0.89651416 62 9.00000000 6.08516381 2.91483619 63 3.50000000 6.03022192 -2.53022192 64 3.00000000 5.93586699 -2.93586699 65 6.00000000 5.79923680 0.20076320 66 4.50000000 5.61740010 -1.11740010 67 6.50000000 5.38735662 1.11264338 68 4.00000000 5.10603708 -1.10603708 69 4.00000000 4.77030314 -0.77030314 70 1.50000000 4.37694747 -2.87694747 71 1.75000000 3.92269369 -2.17269369 72 4.50000000 3.40419641 1.09580359 73 3.00000000 2.81804120 0.18195880 74 1.00000000 2.16074461 -1.16074461

AdPac Page 38

75 3.50000000 1.42875417 2.07124583 76 2.00000000 0.61844838 1.38155162 Sum of Residuals -0.0000000 Sum of Squared Residuals 201.1583388 Sum of Squared Residuals - Error SS 0.0000000 First Order Autocorrelation 0.2504760 Durbin-Watson D 1.3504874 _______________________________________________________________________________________________ The SAS System 12 The GLM Procedure Dependent Variable: Y Sum of Source DF Squares Mean Square F Value Pr > F Model 5 301.2298328 60.2459666 20.98 <.0001 Error 70 201.0457198 2.8720817 Corrected Total 75 502.2755526 R-Square Coeff Var Root MSE Y Mean 0.599730 58.05411 1.694722 2.919211 Source DF Type I SS Mean Square F Value Pr > F X 1 83.4452221 83.4452221 29.05 <.0001 X2 1 14.6566473 14.6566473 5.10 0.0270 X3 1 187.2310683 187.2310683 65.19 <.0001 X4 1 15.7842761 15.7842761 5.50 0.0219 X5 1 0.1126189 0.1126189 0.04 0.8436 Standard Parameter Estimate Error t Value Pr > |t| Intercept 4.889870000 1.09351525 4.47 <.0001 X -0.074514577 0.05944678 -1.25 0.2142 X2 0.000239858 0.00098669 0.24 0.8086 X3 0.000000668 0.00000666 0.10 0.9204 X4 -0.000000001 0.00000002 -0.04 0.9692 X5 -0.000000000 0.00000000 -0.20 0.8436 _______________________________________________________________________________________________ The SAS System 14 The GLM Procedure Dependent Variable: Y Sum of Source DF Squares Mean Square F Value Pr > F Model 6 337.8159931 56.3026655 23.62 <.0001 Error 69 164.4595595 2.3834719

AdPac Page 39

Corrected Total 75 502.2755526 R-Square Coeff Var Root MSE Y Mean 0.672571 52.88586 1.543850 2.919211 Source DF Type I SS Mean Square F Value Pr > F X 1 83.4452221 83.4452221 35.01 <.0001 X2 1 14.6566473 14.6566473 6.15 0.0156 X3 1 187.2310683 187.2310683 78.55 <.0001 X4 1 15.7842761 15.7842761 6.62 0.0122 X5 1 0.1126189 0.1126189 0.05 0.8286 X6 1 36.5861604 36.5861604 15.35 0.0002 Standard Parameter Estimate Error t Value Pr > |t| Intercept 7.038343557 1.13712763 6.19 <.0001 X -0.332884311 0.08533221 -3.90 0.0002 X2 0.007244243 0.00200103 3.62 0.0006 X3 -0.000074316 0.00002008 -3.70 0.0004 X4 0.000000374 0.00000010 3.84 0.0003 X5 -0.000000001 0.00000000 -3.92 0.0002 X6 0.000000000 0.00000000 3.92 0.0002 _______________________________________________________________________________________________ The SAS System 16 The GLM Procedure Dependent Variable: Y Sum of Source DF Squares Mean Square F Value Pr > F Model 7 341.4795540 48.7827934 20.63 <.0001 Error 68 160.7959987 2.3646470 Corrected Total 75 502.2755526 R-Square Coeff Var Root MSE Y Mean 0.679865 52.67660 1.537741 2.919211 Source DF Type I SS Mean Square F Value Pr > F X 1 83.4452221 83.4452221 35.29 <.0001 X2 1 14.6566473 14.6566473 6.20 0.0152 X3 1 187.2310683 187.2310683 79.18 <.0001 X4 1 15.7842761 15.7842761 6.68 0.0119 X5 1 0.1126189 0.1126189 0.05 0.8279 X6 1 36.5861604 36.5861604 15.47 0.0002 X7 1 3.6635608 3.6635608 1.55 0.2175 Standard Parameter Estimate Error t Value Pr > |t| Intercept 6.339745354 1.26406165 5.02 <.0001

AdPac Page 40

X -0.218047668 0.12544291 -1.74 0.0867 X2 0.003010225 0.00394252 0.76 0.4478 X3 -0.000011135 0.00005456 -0.20 0.8389 X4 -0.000000090 0.00000038 -0.23 0.8157 X5 0.000000001 0.00000000 0.62 0.5366 X6 -0.000000000 0.00000000 -0.96 0.3407 X7 0.000000000 0.00000000 1.24 0.2175 _______________________________________________________________________________________________ The SAS System 18 The GLM Procedure Dependent Variable: Y Sum of Source DF Squares Mean Square F Value Pr > F Model 8 342.7216584 42.8402073 17.99 <.0001 Error 67 159.5538942 2.3814014 Corrected Total 75 502.2755526 R-Square Coeff Var Root MSE Y Mean 0.682338 52.86289 1.543179 2.919211 Source DF Type I SS Mean Square F Value Pr > F X 1 83.4452221 83.4452221 35.04 <.0001 X2 1 14.6566473 14.6566473 6.15 0.0156 X3 1 187.2310683 187.2310683 78.62 <.0001 X4 1 15.7842761 15.7842761 6.63 0.0123 X5 1 0.1126189 0.1126189 0.05 0.8285 X6 1 36.5861604 36.5861604 15.36 0.0002 X7 1 3.6635608 3.6635608 1.54 0.2192 X8 1 1.2421044 1.2421044 0.52 0.4727 Standard Parameter Estimate Error t Value Pr > |t| Intercept 6.753169582 1.39171319 4.85 <.0001 X -0.307805100 0.17689939 -1.74 0.0865 X2 0.007333059 0.00717500 1.02 0.3104 X3 -0.000096599 0.00013039 -0.74 0.4614 X4 0.000000767 0.00000125 0.61 0.5409 X5 -0.000000004 0.00000001 -0.57 0.5692 X6 0.000000000 0.00000000 0.59 0.5586 X7 -0.000000000 0.00000000 -0.64 0.5224 X8 0.000000000 0.00000000 0.72 0.4727

AdPac Page 41

0

2

4

6

8

10

12

0 50 100 150 200 250 300 350 400

Linear (Degree 1)

0

2

4

6

8

10

12

0 50 100 150 200 250 300 350 400

Quadratic (Degree 2)

AdPac Page 42

0

2

4

6

8

10

12

0 50 100 150 200 250 300 350 400

Cubic (Degree 3)

0

2

4

6

8

10

12

0 50 100 150 200 250 300 350 400

Quartic (Degree 4)

0

2

4

6

8

10

12

0 50 100 150 200 250 300 350 400

Sextic (Degree 6)

AdPac Page 43

The Logistic Transformation and the Logit The Logistic Transformation for Y is to calculate YT where

Y

Y

YT e

e

eY

11

1 When Y = -∞, YT = 0; When Y = 0, YT = 0.5; When Y = +∞, YT = 1

Both Ye1

1 and

Y

Y

e

e

1 are considered forms of

the Logistic Transformation As indicated above, the two forms are mathematically

equivalent. The form Y

Y

e

e

1 is more commonly

encountered. It is easy to show they are equivalent:

Y

Y

Y

YY

Y

YYY

Y

Y

Y

Y e

e

e

ee

e

eee

e

e

e

e

11

1

In order to work with regression, we wish to isolate Y on the right side of the equal sign. This requires manipulating the transform into

the Logit. We now do the necessary algebra. For this derivation, we’ll start with the Ye1

1 form.

Next, take the inverse of both sides and continue to the right.

Note that this last form is a more usual regression form. This form (using the Logit) is the Logistic Regression form. The Logit serves to “link” the Logistic Transformation to the standard regression form (dependent = independent). However, we cannot use ordinary least squares to calculate values for our parameter estimates (i.e. a, b1, b2, b3, b4,). Instead, these parameters are estimated using the method of maximum likelihood.

FamilybEmergbGPAbXbaY

Y

thenFamilybEmergbGPAbXbaYIf

LogittheisY

Y

YeYeeY

Y

XX

recalleY

Y

eY

Y

T

T

T

T

YY

T

T

YT

T

Y

T

T

4321

4321

1ln

,

.1

ln

)ln(ln1

ln1

ln

)ln(1

ln1

1

11

1

Y

T

T

T

T

T

T

TT

Y

T

Y

TY

YT

eY

Y

Y

Y

Y

Y

YYe

Ye

Ye

eY

1

111

1

11

11

1

1

0

0.5

1

-10 0 10

Y

YT

+∞-∞

AdPac Page 44

Logistic Regression - Example option formdlim='_'; * For LOGISTIC REGRESSSION, let's consider the following example: Suppose we are interested in developing a model that would predict which physicians would be good at diagnosing a certain disease. A sample of 35 physicians is identified. Each physician is given a copy of the file of the same patient (an individual who has the disease). The file contains background information on the patient, as well as the results of various lab tests. Each physician is asked to make a diagnosis based on the information in the file. The dependent variable (called Y below) is a 1 if the diagnosis was correct, and a 0 if it was not correct. The independent variables are:

GPA - the physician's GPA as an undergraduate. Months_exp - the number of months since the physician completed their residency. Specialty - physician's specialty. Note that this variable is categorical:

Emerg = Emergency Medicine Internal = Internal Medicine Family = Family Practice; DATA LOGREG; INPUT GPA Months_exp Y Specialty $; * Note the dollar sign ($) after Specialty tells SAS that this is a string variable; DATALINES; 3.90 2 0 Emerg 3.97 4 0 Internal 3.43 5 0 Family 3.63 6 0 Internal 3.88 7 0 Emerg 3.46 8 1 Family 3.81 8 1 Emerg 3.27 9 0 Emerg 3.57 10 0 Emerg 3.95 10 0 Emerg 3.37 11 1 Family 3.71 12 1 Emerg 3.29 13 0 Family 3.34 15 1 Internal 3.75 16 1 Family 3.81 17 0 Internal 3.35 19 1 Internal 3.42 20 1 Family 3.51 22 0 Family 3.30 23 1 Emerg 3.32 24 1 Internal 3.74 27 1 Internal 3.73 30 0 Emerg 3.71 31 1 Family 3.66 32 1 Family 3.60 23 1 Family 3.54 21 0 Family 3.87 17 0 Family 3.91 2 1 Family 3.85 12 1 Emerg 3.60 3 1 Emerg 3.74 9 1 Internal 3.80 22 0 Family 3.63 20 1 Family 3.57 17 1 Family ;

AdPac Page 45

PROC LOGISTIC DESCENDING; CLASS Specialty; MODEL Y = Months_exp GPA Specialty / STB; OUTPUT OUT=LOGOUT P=PRED; *The DESCENDING option in PROC LOGISTIC tells SAS which code (a 1 or a 0) represents a positive outcome (SAS calls this the "ordered value"). The DESCENDING means the largest value (1 in our example) is the positive outcome. This causes SAS to calculate the probability of a correct diagnosis - which is what we want. The STB option prints out standardized b values. The OUTPUT statement creates a new SAS dataset with the original variables plus P, which is the probability of a correct diagnosis. We will call this variable PRED. ; PROC SORT; BY Months_exp; *We sort the new dataset by the Months Experience, then print independent variables along with the probabilities. ; PROC PRINT; VAR Months_exp GPA Specialty PRED Y; RUN;

AdPac Page 46

The SAS System 1 17:27 Wednesday, February 9, 2005 The LOGISTIC Procedure Model Information Data Set WORK.LOGREG Response Variable Y Number of Response Levels 2 Number of Observations 35 Model binary logit Optimization Technique Fisher's scoring Response Profile Ordered Total Value Y Frequency 1 1 20 2 0 15 Probability modeled is Y=1. Class Level Information Design Variables Class Value 1 2 Specialty Emerg 1 0 Family 0 1 Internal -1 -1 Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. __________________________________________________________________________________________ The SAS System 2 17:27 Wednesday, February 9, 2005 The LOGISTIC Procedure Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 49.804 55.006 SC 51.359 62.783 -2 Log L 47.804 45.006

AIC is the Akaike Information Criterion - it is a quantity used in a variety of statistical procedures to assess how well a particular model fits a set of data. The smaller the AIC value, the better the fit. Generally, AIC is related to the Error SS. For linear regression models, the formula for AIC is: where n is the sample size, and p is the number of parameters in the model (including the intercept). In our logistic regression, we have five parameters (see SAS output): intercept, X, GPA, Emergency, Family. The exact formula for AIC in logistic regression is different (as you can see, we don't have an Error SS, or any SS quantity, in logistic regression), but this should give you an idea of the general approach.

pn

SSErrornAIC 2ln

AdPac Page 47

Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 2.7972 4 0.5923 Score 2.7474 4 0.6009 Wald 2.5961 4 0.6275 Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq Months_exp 1 0.7973 0.3719 GPA 1 0.7704 0.3801 Specialty 2 0.2445 0.8849 Analysis of Maximum Likelihood Estimates Standard Wald Standardized Parameter DF Estimate Error Chi-Square Pr > ChiSq Estimate Intercept 1 5.3596 6.5943 0.6606 0.4164 Months_exp 1 0.0408 0.0457 0.7973 0.3719 0.1905 GPA 1 -1.5594 1.7765 0.7704 0.3801 -0.1817 Specialty Emerg 1 -0.2625 0.5387 0.2375 0.6261 . Specialty Family 1 0.0567 0.5007 0.0128 0.9099 . __________________________________________________________________________________________ The SAS System 3 17:27 Wednesday, February 9, 2005 The LOGISTIC Procedure Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits Months_exp 1.042 0.952 1.139 GPA 0.210 0.006 6.839 Specialty Emerg vs Internal 0.626 0.089 4.402 Specialty Family vs Internal 0.861 0.138 5.369 Association of Predicted Probabilities and Observed Responses Percent Concordant 69.0 Somers' D 0.380 Percent Discordant 31.0 Gamma 0.380 Percent Tied 0.0 Tau-a 0.192 Pairs 300 c 0.690

AdPac Page 48

__________________________________________________________________________________________ The SAS System 4 17:27 Wednesday, February 9, 2005 Months_ Obs exp GPA Specialty PRED Y 1 2 3.90 Emerg 0.28847 0 2 2 3.91 Family 0.35452 1 3 3 3.60 Emerg 0.40269 1 4 4 3.97 Internal 0.38648 0 5 5 3.43 Family 0.56748 0 6 6 3.63 Internal 0.53733 0 7 7 3.88 Emerg 0.33900 0 8 8 3.46 Family 0.58592 1 9 8 3.81 Emerg 0.37336 1 10 9 3.27 Emerg 0.59025 0 11 9 3.74 Internal 0.52508 1 12 10 3.57 Emerg 0.48450 0 13 10 3.95 Emerg 0.34196 0 14 11 3.37 Family 0.64790 1 15 12 3.71 Emerg 0.45047 1 16 12 3.85 Emerg 0.39721 1 17 13 3.29 Family 0.69341 0 18 15 3.34 Internal 0.72489 1 19 16 3.75 Family 0.55506 1 20 17 3.81 Internal 0.57871 0 21 17 3.87 Family 0.51869 0 22 17 3.57 Family 0.63242 1 23 19 3.35 Internal 0.75331 1 24 20 3.42 Family 0.71071 1 25 20 3.63 Family 0.63908 1 26 21 3.54 Family 0.67972 0 27 22 3.51 Family 0.69848 0 28 22 3.80 Family 0.59576 0 29 23 3.30 Emerg 0.70871 1 30 23 3.60 Family 0.67710 1 31 24 3.32 Internal 0.79690 1 32 27 3.74 Internal 0.69729 1 33 30 3.73 Emerg 0.62341 0 34 31 3.71 Family 0.70996 1 35 32 3.66 Family 0.73379 1 See the next page for how the PRED (predicted probabilities) values are calculated.

AdPac Page 49

Logistic Regression: Predicted Probability Calculation Our model is:

FamilybEmergbGPAbMonths_expba)1

ln( 4321 T

T

Y

Y

The logit (the “dependent variable”) above is not a probability. In fact it can range from - to +. So we need to use the logistic transformation, which is a probability:

Family)bEmergbGPAb Months_expb(a

Family)bEmergbGPAb Months_expb(a

4321

4321

11

e

e

e

eY

Y

The “Emerg” and “Family” variables are the “Design Variables” we saw on the SAS output (the “effects” coding). Here are the values we would use.

If physician is: Emerg Family Emergency Medicine 1 0

Family Practice 0 1 Internal Medicine -1 -1

Values for a and b1, b2, b3, b4 are the maximum likelihood estimates from the SAS output. Let’s evaluate this expression for physician #35 (on the last page of the SAS output):

a + b1Months_exp + b2GPA + b3Emerg + b4Family 5.3596 + 0.040832 + -1.55943.66 + -0.26250 + 0.05671 = 1.014496

Now we calculate our probability:

73389.093.75797302

92.75797302

92.757973021

92.75797302

1 014496.1

014496.1

e

e

Notice that this matches the probability from the SAS output (with a little rounding error).

AdPac Page 50

What are the odds? Odds and Odds Ratios Concordant / Discordant Odds and Odds Ratios are associated with probabilities. Since the output of logistic regression is in the form of a probability, odds and odds ratios are encountered frequently. However, many biologists are not too familiar with odds or odds ratios. Epidemiologists (and people who frequent the race track) are well acquainted with these concepts. Odds If the probability of some event happening is p, and the probability of it not happening is q, where q = 1-p, then the odds of it happening is defined as p divided by q. For example, in our logistic regression output (above), we see that the probability of physician #18 getting the diagnosis correct is p = 0.72489. Therefore, the probability of the physician getting it wrong is

q = 1 - 0.72489 = 0.27511. The odds of physician #18 getting it correct is: 63491.227511.0

72489.0

q

p. Or, as our friends at the race

track might say: “About 2.6 to 1.”

Note that we could also calculate the odds of physician #18 getting it wrong: 37952.072489.0

27511.0

p

q, or “About 1 to 2.6.”

Odds Ratio Not too surprisingly, an Odds Ratio is the ratio of two odds. For example, we already calculated the odds of physician #18 getting the diagnosis correct as 2.63491. Now, let’s calculate the odds of physician #19 getting the diagnosis correct. For physician #19, we

see that p = 0.55506, so q = 0.44494, and the odds of correct are: 24749.144494.0

55506.0

q

p.

One way to compare physician #18 to physician #19 is to calculate the odds ratio: 11217.224749.1

63491.2 . In words, the odds of #18

getting it correct are a little more than twice the odds of #19 getting it correct. Concordant / Discordant Pair each physician with a correct diagnosis with each physician with an incorrect diagnosis. Since there were 20 correct diagnoses and 15 incorrect diagnoses, that’s 20 x 15 = 300 pairs. For each pair, if the physician with the correct diagnosis was predicted to have a higher probability of a correct diagnosis, then that pair is concordant. However, if the physician with the correct diagnosis was predicted to have a lower probability of a correct diagnosis, then that pair is discordant. Examples Pair physician #2 (correct diagnosis) with physician #1 (incorrect diagnosis). On the last page of the SAS output, we see that the probability of physician #2 having a correct diagnosis was 0.35452; while the probability of physician #1 having a correct diagnosis was 0.28847. Since the physician with the correct diagnosis had the higher probability, this is a concordant pair. Next pair physician #2 with physician #4 (incorrect diagnosis; probability = 0.38648). This pair is discordant because the physician with the incorrect diagnosis had the higher probability. The SAS output indicates that 69% of the 300 pairs, or 207 pairs were concordant. The other 31% (93 pairs) were discordant. The percent concordant/discordant is another description of how good your model performs. You want a model with the highest possible percentage of concordant pairs.

AdPac Page 51

Eigenvalues and Eigenvectors Eigenvalues (scalars) and eigenvectors (vectors) are associated with square matrices. A square matrix of order p has p eigenvalues. Each of the eigenvalues is associated with an eigenvector. An eigenvalue and corresponding eigenvector of a matrix satisfy the property that the eigenvector multiplied by the matrix yields a vector proportional to itself. The constant of proportionality is the eigenvalue. This is less confusing in mathematical terms: Let A be a square matrix of order p Let v be a vector with p elements Let λ be a scalar If Av = λv is true, then λ is an eigenvalue of A, and v is the associated eigenvector. Example: Thus, λ = 4 (4 is an eigenvalue of A), and v is the associated eigenvector. The eigenvalues of a matrix are determined by solving the characteristic function of the matrix. Let A be a square matrix of order p, Ip be the identity matrix of order p, and λ be the eigenvalue. The characteristic function of A is: Note that the straight lines represent the determinant of A - λIp Important properties of eigenvalues:

1. The characteristic equation of A (order p) reduces to a polynomial of degree p, and the p solutions are the eigenvalues. Not all the eigenvalues will be necessarily different values.

2. The sum of the p eigenvalues is equal to the trace of the original matrix. Σλ = tr(A) 3. The product of the p eigenvalues is equal to the determinant of the original matrix. Πλ = |A|

The eigenvectors associated with each eigenvalue are determined by solving the equation: (A - λI)v = 0 Note that this equation does not involve the determinant. Since this equation does not provide unique values for the elements of v, we apply the normalizing function to the elements. This says that the sum of the square of each element must equal 1. That is: v1

2 + v22 +….vp

2 = 1. An important property of eigenvectors is that, given the p eigenvectors of a square matrix, the product of any two eigenvectors is zero. In mathematical terms: vi' vj = 0. Note that we transpose the first vector to conform for multiplication. A second important property of eigenvectors is that if you make a matrix whose columns are the eigenvectors (called the eigenvector matrix), and multiply it times its transpose, you will get the identity matrix. In other words, if V is an eigenvector matrix, then VV' = I. Excel does not have functions for calculating eigenvalues or eigenvectors. They can be done on SAS using the Interactive Matrix Language procedure. See below.

0

4

8

0

1

2

4

0

4

8

0

1

2

17189

023

182416

AvvA

0 pIA

AdPac Page 52

Eigenvalues and Eigenvectors - Example calculations

Let

21

12A , the characteristic function of A is . Substitute for A and Ip, then solve for λ.

0112221

12

0

0

21

12

10

01

21

12

(2- λ )(2- λ )-1 = 4 - 2 λ - 2 λ + λ 2- 1 = λ 2 - 4 λ + 3 = 0. Note that this last equation is a polynomial of degree 2 (a quadratic equation), which can be written (λ -3)( λ -1) = 0. Therefore, the eigenvalues of A are λ 1 = 3 and λ 2 = 1. By definition, the largest eigenvalue is considered the first eigenvalue, the next largest is the second, and so on. Now that we know the eigenvalues, we can solve for the eigenvectors. We'll only solve for the eigenvector associated with λ 1 = 3. You can do the eigenvector for λ 2 = 1 on your own. We use the expression (A - λI)v = 0 to find the eigenvector (v). We substitute for A, I, and λ , then solve for v.

011

11

11

11

30

03

21

12

10

013

21

12

21

21

2

1

2

1

2

1

vv

vv

v

v

v

v

v

v

Notice that this means -1v1 + 1v2 = 0, and 1v1 + -1v2 = 0. Both of these expressions simplify to v1 = v2. Therefore, any values will work for the two elements of v, as long as they are equal. This is an indeterminate solution. In order to get unique values for v1 and v2, we need another expression relating them. This is where we use the normalizing function, which says that v1

2 + v22 = 1.

Since v1

2 + v22 = 1, and v1 = v2, we substitute v1 for v2 in the normalizing function:

2

1

2

1

12

1

1

21

21

21

21

v

v

v

vv

Since v1 = v2, then the eigenvector associated with λ 1 = 3 is

2

12

1

v .

When you solve for λ 2 = 1, you should get

2

12

1

v or

2

12

1

v . Either solution is correct. With an eigenvector, you can

always change the signs on the elements, as long as you change the signs on all of the elements, not just some of them! Satisfy yourself that the product of the two eigenvectors is zero, that is: vi' vj = 0. Note: transpose the first vector to conform for multiplication. Do this on Excel, or on SAS using IML (see below).

0 pIA

AdPac Page 53

The Interactive Matrix Language (IML) DATA EigStruct; * The Interactive Matrix Language (IML) is a SAS tool for manipulating matrices. In this first example, you see how to use IML to multiply matrices, calculate a determinant, and calculate an inverse. Notice how matrices are entered. The rows must be separated by commas. Separate the elements by at least one space. Enclose the matrix in brackets, and don't forget the semicolon! Here, we multiply square matrix A by vector X to get vector B. Then, we calculate the determinant of A as scalar D. Then, we calculate the inverse of A as matrix G. If G is the inverse of A, then if we multiply G times A, we should get the identity matrix. We do this to check. Matrix P should be the identity matrix. When you look on the output, you'll see numbers of magnitude 10-15 or 10-16 instead of 0. This is due to rounding error - these values are essentially 0. ; PROC IML; TITLE "Multiplication example. D = Determinant, G = Inverse"; A = { 16 -24 18, 3 -2 0, -9 18 -17 }; X = {2, 1, 0}; B = A*x; D=Det(A); G=Inv(A); P=A*G; Print A X B; Print D; Print G; Print P; Stop; *This next routine uses the IML procedure to calculate the eigenvalues and eigenvectors of any square matrix. Enter your matrix below as matrix A. The rows must be separated by commas. Separate the elements by at least one space. Enclose the matrix in brackets, and don't forget the semicolon! In the output, L is a matrix of eigenvalues. If your matrix A is symmetrical (like a correlation matrix), L is a column vector. If your matrix A is asymmetrical (such as a Leslie matrix in pop eco), then the eigenvalues you want are in the first column. Use the first nonzero eigenvalue, which should be the largest eigenvalue. The other columns are the imaginary parts of the eigenvalues. (Eigenvalues of asymmetrical matrices can have real and imaginary parts). V is a matrix of eigenvectors. If A is symmetrical, the first column of V is the eigenvector corresponding to the first row of L, and so on. If A is asymmetrical, V is complicated. For eigenvalues with nonzero real and imaginary parts, there are two columns in V. They are the real and imaginary parts of the eigenvectors. The values below for A are a correlation matrix.; PROC IML; TITLE "L = Eigenvalues V = Eigenvectors"; A = { 1.000 -0.047 0.845, -0.047 1.000 -0.433, 0.845 -0.433 1.000 }; CALL EIGEN (L,V,A); Print L; Print V; quit; run;

AdPac Page 54

Multiplication example. D = Determinant, G = Inverse 1 17:30 Wednesday, December 11, 2002 A X B 16 -24 18 2 8 3 -2 0 1 4 -9 18 -17 0 0 D -32 G -1.0625 2.625 -1.125 -1.59375 3.4375 -1.6875 -1.125 2.25 -1.25 P 1 1.776E-15 4.441E-16 2.22E-16 1 2.22E-16 -8.88E-16 1.332E-15 1 L = Eigenvalues V = Eigenvectors 2 17:30 Wednesday, December 11, 2002 L 1.9691294 0.9618815 0.0689891 V 0.6267117 0.4548108 -0.632756 -0.343013 0.8901161 0.3000593 0.699696 0.0289926 0.7138521

AdPac Page 55

Principal Components Analysis - Example Data are length, width, and height of the shells of 24 female painted turtles (Chrysemys picta). Data are in mm. Remember: 10 mm = 1 cm = 0.3937 in. The smallest turtle is 98 mm in length, or 9.8 cm, or 3.9 in. The biggest turtle is 177 mm in length, or 17.7 cm, or 7 in.

Note that the most obvious pattern is size. Bigger turtles are bigger in all dimensions. But are there any patterns with respect to shape? Can you see anything on the graph?

AdPac Page 56

Data TurtlePCA; Input Length Width Height; Datalines; 98 81 38 103 84 38 103 86 42 105 86 40 109 88 44 123 92 50 123 95 46 133 99 51 133 102 51 133 102 51 134 100 48 136 102 49 137 98 51 138 99 51 141 105 53 147 108 57 149 107 55 153 107 56 155 115 63 155 115 60 158 118 62 159 118 63 162 124 61 177 132 67 ; * The Factor procedure will do a principal components analysis. The Corr option prints the correlation matrix. NFACTORS = 3 tells SAS to retain 3 components in the analysis. The Eigenvectors option tells SAS to print the eigenvectors. The OUT = FACS creates a new dataset (FACS) with the factor scores as well as the original variables. FACS becomes the active dataset, so we see all variables and the factor scores when we print. ; PROC Factor Corr NFactors = 3 Eigenvectors OUT = FACS; PROC PRINT; Run;

AdPac Page 57

The SAS System 15:03 Tuesday, February 18, 2003 1 The FACTOR Procedure Correlations Length Width Height Length 1.00000 0.97467 0.97258 Width 0.97467 1.00000 0.96748 Height 0.97258 0.96748 1.00000 The SAS System 15:03 Tuesday, February 18, 2003 2 The FACTOR Procedure Initial Factor Method: Principal Components Prior Communality Estimates: ONE Eigenvalues of the Correlation Matrix: Total = 3 Average = 1 Eigenvalue Difference Proportion Cumulative 1 2.94316257 2.91047201 0.9811 0.9811 2 0.03269056 0.00854369 0.0109 0.9920 3 0.02414687 0.0080 1.0000 3 factors will be retained by the NFACTOR criterion. Eigenvectors 1 2 3 Length 0.57816 -0.11633 -0.80759 Width 0.57715 -0.64132 0.50557 Height 0.57674 0.75840 0.30365 Factor Pattern Factor1 Factor2 Factor3 Length 0.99187 -0.02103 -0.12549 Width 0.99014 -0.11595 0.07856 Height 0.98943 0.13712 0.04718 Variance Explained by Each Factor Factor1 Factor2 Factor3 2.9431626 0.0326906 0.0241469 Final Communality Estimates: Total = 3.000000 Length Width Height 1.0000000 1.0000000 1.0000000

AdPac Page 58

The SAS System 15:03 Tuesday, February 18, 2003 3 The FACTOR Procedure Initial Factor Method: Principal Components Scoring Coefficients Estimated by Regression Squared Multiple Correlations of the Variables with Each Factor Factor1 Factor2 Factor3 1.0000000 1.0000000 1.0000000 Standardized Scoring Coefficients Factor1 Factor2 Factor3 Length 0.33700874 -0.6433854 -5.1970814 Width 0.33642135 -3.5470412 3.25348491 Height 0.33617905 4.19457094 1.95407797 The SAS System 15:03 Tuesday, February 18, 2003 4 Obs Length Width Height Factor1 Factor2 Factor3 1 98 81 38 -1.73062 -0.18962 0.60569 2 103 84 38 -1.57458 -1.15000 0.12463 3 103 86 42 -1.35872 0.36585 1.57672 4 105 86 40 -1.40935 -0.72230 0.60879 5 109 88 44 -1.13004 0.67242 1.08242 6 123 92 50 -0.55859 2.25259 0.08329 7 123 95 46 -0.64658 -0.61154 -0.13211 8 133 99 51 -0.17976 0.57597 -0.39214 9 133 102 51 -0.10303 -0.23300 0.34988 10 133 102 51 -0.10303 -0.23300 0.34988 11 134 100 48 -0.26186 -1.26534 -1.10748 12 136 102 49 -0.13780 -1.35143 -0.86268 13 137 98 51 -0.14189 0.72449 -1.61794 14 138 99 51 -0.10045 0.42455 -1.61521 15 141 105 53 0.18295 -0.25665 -0.38632 16 147 108 57 0.51956 0.80785 -0.15458 17 149 107 55 0.44335 -0.01064 -1.36985 18 153 107 56 0.54798 0.38202 -2.10896 19 155 115 63 1.07256 1.76075 1.05600 20 155 115 60 0.94902 0.21938 0.33793 21 158 118 62 1.15569 0.34714 0.82481 22 159 118 63 1.21273 0.83065 0.81955 23 162 124 61 1.33142 -1.90571 1.09103 24 177 132 67 2.02103 -1.43445 0.83664 What SAS calls Factor1, Factor2, and Factor3, we call PC 1, PC 2, and PC 3.

AdPac Page 59

Below, the 24 turtles are plotted by their factor scores. The x-axis is the score on PC 1, the y-axis is PC 2. Interpretation of Principal Components PC 1 is a "size component", that is, the factor scores on PC 1 represent a gradient from small turtles (negative values) to large turtles (positive values). Note that this component explains over 98% of the variation in the three variables. The principal pattern in the turtle data is that some turtles have larger values for length, width, and height than other turtles. In other words, some turtles are bigger than others. Although this is a trivial result, the purpose here is to show how PCA works - not to win the Nobel prize in turtle biology. PC 2 is a shape component. Turtles with high (positive) values on PC 2 tend to be tall for their width ("tall, skinny turtles"). Turtles low on PC 2 (negative values) tend to be short for their width ("short, wide turtles). This is not absolute size. Think of this as the ratio of height:width. The turtle with the highest score on PC 2 is #6, and her height:width ratio is 50/92 = 0.543. The lowest value on PC 2 is turtle #23, and her height:width ratio is 61/124 = 0.492. Turtle #17 is right about in the middle of PC 2, and her height:width ratio is 55/107 = 0.514. These are very small, subtle differences in shape that you would probably not notice by looking at the raw data, or at the turtle shells. Remember, this component explains only a little over 1% of the total variation in length, width, and height. However, the shape difference is apparently real. Other studies have shown that older female turtles are relatively taller, so the turtles higher on PC 2 might be the older ones.

Chrysemys picta morphology

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

PC 2

PC 1

AdPac Page 60

Additional Matrix Calculations in PCA Let Z be a matrix of the Z-scores (standard scores) of the data. For the turtles, this would be a 24 x 3 matrix with the Z-score for each turtle length, width, and height. Let V be the eigenvector matrix. Let L½ be a diagonal matrix, where the elements along the principal diagonal are the square roots of the eigenvalues. Remember that eigenvalues are variances, therefore their square roots are standard deviations. Let L-½ be a diagonal matrix, where the elements along the principal diagonal are the reciprocals of the square roots of the eigenvalues. Factor Pattern Matrix (S) S = V L½ Recall that the Factor Pattern Matrix contains correlations between each component and the original variables. This is the same basic information as the eigenvector matrix, but most of us are more comfortable interpreting correlations than eigenvector elements. If you ever see the term "factor loadings", that is usually referring to the factor pattern matrix. Standardized Scoring Coefficients (B) B = V L-½ This matrix is used to calculate the standardized factor scores (see below). This matrix has no interpretive value. By examining the formula or the values in one of the examples, you can see that the values in this matrix are directly related to the eigenvectors. Factor Scores (F) F = ZB These are the factor scores as produced by SAS in PROC FACTOR. They are actually standardized factor scores, which means that the scores on any PC have a mean of 0, and a variance (and standard deviation) of 1. Remember that the variance of the observations along a PC is the eigenvalue associated with that component. However, these factor scores are standardized to 0 mean and unit variance. Note that the B matrix above has the reciprocals of the square roots of the eigenvalues. By using this matrix, SAS is essentially dividing by the standard deviation (the square root of the eigenvalue). The reason for standardizing the factor scores is to spread the point out better on a graph. If they weren't standardized, the scores on a component with a small eigenvalue (such as PC 2 and PC 3 for the turtles) would be very close together. Standardized factor scores, as produced by PROC FACTOR, is the common method performed by statistical programs. Unstandardized Factor Scores (U) U = ZV If you multiply the standardized data matrix (Z) times the eigenvector matrix (V) you get factor scores that are not standardized. The variance of these scores for a PC will be the eigenvalue for that component. If you were to then divide each score by the square root of the eigenvalue, you would get the standardized factor scores (same as the F matrix). If you use PROC PRINCOMP to do factor scores, you get the unstandardized scores (U). You can have PROC PRINCOMP produce standardized factor scores by adding the STANDARD option to the PROC PRINCOMP statement.

AdPac Page 61

The Importance of Principal Components - Stopping Rules The table below provides minimum values for eigenvalues for principal components to be interpretable. The values are those from the broken-stick method as presented in Jackson (1993). Rows in the table represent the number of variables that are in a particular PCA, while columns represent the eigenvalues. For example: suppose you have a PCA involving eight variables. Look in the eighth row below. The first (i.e. the largest) eigenvalue should be greater than 2.718 if it is to be interpreted. The second eigenvalue should be at least 1.718. If the eigenvalues are not greater than these broken-stick values, then the component represents random variation. 1 1 1.000 2 2 1.500 0.500 3 3 1.833 0.833 0.333 4 4 2.083 1.083 0.583 0.250 5 5 2.283 1.283 0.783 0.450 0.200 6 6 2.450 1.450 0.950 0.617 0.367 0.167 7 7 2.593 1.593 1.093 0.760 0.510 0.310 0.143 8 8 2.718 1.718 1.218 0.885 0.635 0.435 0.268 0.125 9 9 2.829 1.829 1.329 0.996 0.746 0.546 0.379 0.236 0.111 10 10 2.929 1.929 1.429 1.096 0.846 0.646 0.479 0.336 0.211 0.100 11 11 3.020 2.020 1.520 1.187 0.937 0.737 0.570 0.427 0.302 0.191 0.091 12 12 3.103 2.103 1.603 1.270 1.020 0.820 0.653 0.510 0.385 0.274 0.174 0.083 13 13 3.180 2.180 1.680 1.347 1.097 0.897 0.730 0.587 0.462 0.351 0.251 0.160 0.077 14 14 3.252 2.252 1.752 1.418 1.168 0.968 0.802 0.659 0.534 0.423 0.323 0.232 0.148 0.071 15 15 3.318 2.318 1.818 1.485 1.235 1.035 0.868 0.725 0.600 0.489 0.389 0.298 0.215 0.138 0.067 16 16 3.381 2.381 1.881 1.547 1.297 1.097 0.931 0.788 0.663 0.552 0.452 0.361 0.278 0.201 0.129 0.063 17 17 3.440 2.440 1.940 1.606 1.356 1.156 0.990 0.847 0.722 0.611 0.511 0.420 0.336 0.259 0.188 0.121 0.059 18 18 3.495 2.495 1.995 1.662 1.412 1.212 1.045 0.902 0.777 0.666 0.566 0.475 0.392 0.315 0.244 0.177 0.114 0.056 19 19 3.548 2.548 2.048 1.714 1.464 1.264 1.098 0.955 0.830 0.719 0.619 0.528 0.445 0.368 0.296 0.230 0.167 0.108 0.053 20 20 3.598 2.598 2.098 1.764 1.514 1.314 1.148 1.005 0.880 0.769 0.669 0.578 0.495 0.418 0.346 0.280 0.217 0.158 0.103 0.050 The values above are calculated using the formula below. p is the number of variables; bk is minimum value of the eigenvalue for the kth component.

For example, suppose a PCA is done on three variables. The minimum value for the largest eigenvalue (k=1) is: b1 = 1/1 + 1/2 + 1/3 = 1 + 0.5 + 0.333 = 1.833. For the second eigenvalue (k=2): b2 = 1/2 + 1/3 = 0.5 + 0.333 = 0.833. The value for the third eigenvalue is b3 = 1/3 = 0.333. Note that these are the values found in the table above. Literature Cited Jackson, Donald A. 1993. Stopping rules in principal components analysis: a comparison of heuristical and statistical approaches. Ecology 74:2204-2214.

p

kik i

b1

AdPac Page 62

The PCA Game - America's Favorite Game Show A PCA is done on data (below) showing the abundance of each of 5 different species in three communities (called A, B, and C). The graph shows each of the 5 species plotted by their factor scores on the first two principal components. The object of the game is to identify each of the species on the graph, that is, which point is Species1, which is Species2, and so on. In order to do this, you have to understand and interpret the information from the principal components analysis given below. Hint: pay special attention to the Data and Factor Pattern matrix. Note: the PCA game is a classroom exercise to help you understand and interpret a PCA. This is not something that is done in the "real world" (which is actually a show on MTV where they don't do PCAs).

Data A B C Species1 40 0 50 Species2 30 30 20 Species3 20 40 20 Species4 10 25 0 Species5 0 5 10

Correlations A B C A 1.00000 -0.04663 0.84515 B -0.04663 1.00000 -0.43346 C 0.84515 -0.43346 1.00000

Eigenvalues of the Correlation Matrix: Total = 3 Average = 1 Eigenvalue Difference Proportion Cumulative 1 1.96932471 1.00715932 0.6564 0.6564 2 0.96216539 0.89365549 0.3207 0.9772 3 0.06850990 0.0228 1.0000

Factor Pattern Factor1 Factor2 A 0.87935 0.44646 B -0.48142 0.87295 C 0.98199 0.02817

-2

-1.5

-1

-0.5

0

0.5

1

1.5

-1.5 -1 -0.5 0 0.5 1 1.5 2

PC 1

PC 2

AdPac Page 63

PCA Example - Pollution Data DATA POLLUTE; INPUT CITY $ SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS; Datalines; PHOENIX 10 70.3 213 582 6.0 7.05 36 SAN_FRAN 12 56.7 453 716 8.7 20.66 67 DENVER 17 51.9 454 515 9.0 12.95 86 MIAMI 10 75.5 207 335 9.0 59.80 128 ATLANTA 24 61.5 368 497 9.1 48.34 115 CHICAGO 110 50.6 3344 3369 10.4 34.44 122 NEW_ORLS 9 68.3 204 361 8.4 56.77 113 DETROIT 35 49.9 1064 1513 10.1 30.96 129 ST_LOUIS 56 55.9 775 622 9.5 35.89 105 ALBQURQE 11 56.8 46 244 8.9 7.77 58 CLEVLAND 65 49.7 1007 751 10.9 34.99 155 DALLAS 9 66.2 641 844 10.9 35.94 78 HOUSTON 10 68.9 721 1233 10.8 48.19 103 SLT_LAKE 28 51.0 137 176 8.7 15.17 89 SEATTLE 29 51.1 379 531 9.4 38.79 164 ; PROC PRINCOMP; * The PRINCOMP procedure does a "barebones" PCA. We'll get simple statistics (mean, standard deviation) for our variables, the correlation matrix, eigenvalues and eigenvectors. The FACTOR procedure below gives you the factor pattern matrix, which (as you remember) are the eigenvectors transformed into more familiar correlation coefficients. ; PROC FACTOR MSA NFACTORS=3 OUT=FACS; * The FACTOR procedure will produce PCA output as you commonly see it in the literature. The MSA option prints a matrix of partial correlation coefficients - we don't really need this for the PCA, but it shows us how we can get SAS to print such a matrix. The NFACTORS=3 option says I want to retain 3 factors. This option must be specified if you want to specify the OUT= option. The OUT=FACS option creates a new data set, which has all the original variables plus the factor scores calculated by SAS. You can call the data set any legal SAS name, the FACS name was just used by Dr. M. for this example. FACS now replaces POLLUTE as the "current data set". This allows us to print and plot the factor scores, which we accomplish with the following procedures.; PROC PRINT; VAR CITY FACTOR1 FACTOR2 FACTOR3; RUN;

AdPac Page 64

The SAS System 15:32 Thursday, December 12, 2002 1 The PRINCOMP Procedure Observations 15 Variables 7 Simple Statistics SO2 TEMP FACTORYS POP Mean 29.00000000 58.95333333 667.5333333 819.2666667 StD 28.39517062 8.77869716 802.5373037 789.8238924 Simple Statistics WIND PRECIP PRCPDAYS Mean 9.320000000 32.51400000 103.2000000 StD 1.255388386 16.87902155 35.1063690 Correlation Matrix SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS SO2 1.0000 -.5847 0.8773 0.7499 0.4116 0.0389 0.4513 TEMP -.5847 1.0000 -.3765 -.2767 -.3133 0.4250 -.3140 FACTORYS 0.8773 -.3765 1.0000 0.9675 0.4757 0.1002 0.3031 POP 0.7499 -.2767 0.9675 1.0000 0.4334 0.0908 0.2150 WIND 0.4116 -.3133 0.4757 0.4334 1.0000 0.4021 0.5618 PRECIP 0.0389 0.4250 0.1002 0.0908 0.4021 1.0000 0.6467 PRCPDAYS 0.4513 -.3140 0.3031 0.2150 0.5618 0.6467 1.0000 Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 3.57712916 1.83796725 0.5110 0.5110 2 1.73916191 0.73514870 0.2485 0.7595 3 1.00401320 0.50945619 0.1434 0.9029 4 0.49455701 0.36494403 0.0707 0.9736 5 0.12961298 0.07918011 0.0185 0.9921 6 0.05043288 0.04534002 0.0072 0.9993 7 0.00509286 0.0007 1.0000 The SAS System 15:32 Thursday, December 12, 2002 2 The PRINCOMP Procedure Eigenvectors Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 SO2 0.476877 -.180538 0.013700 0.356991 0.713054 0.064110 0.315924 TEMP -.288831 0.381692 0.651315 -.027340 0.237915 0.537784 0.010638 FACTORYS 0.482661 -.155212 0.347322 0.026460 -.031244 -.020512 -.787539 POP 0.444554 -.153084 0.457597 -.079213 -.538310 0.016992 0.522687 WIND 0.366943 0.272014 -.207668 -.828812 0.213948 0.117940 0.040286 PRECIP 0.133018 0.708151 0.169598 0.169708 0.043718 -.648027 0.037600 PRCPDAYS 0.325937 0.445783 -.416754 0.386142 -.310662 0.521651 -.060186

AdPac Page 65

The SAS System 15:32 Thursday, December 12, 2002 3 The FACTOR Procedure Initial Factor Method: Principal Components Partial Correlations Controlling all other Variables SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS SO2 1.00000 -0.20202 0.90566 -0.80461 -0.42293 -0.12387 0.33718 TEMP -0.20202 1.00000 0.06181 -0.02658 -0.41201 0.83634 -0.65558 FACTORYS 0.90566 0.06181 1.00000 0.97211 0.37062 0.16716 -0.29979 POP -0.80461 -0.02658 0.97211 1.00000 -0.28461 -0.15404 0.24220 WIND -0.42293 -0.41201 0.37062 -0.28461 1.00000 0.28501 0.05290 PRECIP -0.12387 0.83634 0.16716 -0.15404 0.28501 1.00000 0.85330 PRCPDAYS 0.33718 -0.65558 -0.29979 0.24220 0.05290 0.85330 1.00000 Kaiser's Measure of Sampling Adequacy: Overall MSA = 0.48794684 SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS 0.53002607 0.41088798 0.51802421 0.50849249 0.64033013 0.33119609 0.45213607 Prior Communality Estimates: ONE Eigenvalues of the Correlation Matrix: Total = 7 Average = 1 Eigenvalue Difference Proportion Cumulative 1 3.57712916 1.83796725 0.5110 0.5110 2 1.73916191 0.73514870 0.2485 0.7595 3 1.00401320 0.50945619 0.1434 0.9029 4 0.49455701 0.36494403 0.0707 0.9736 5 0.12961298 0.07918011 0.0185 0.9921 6 0.05043288 0.04534002 0.0072 0.9993 7 0.00509286 0.0007 1.0000 3 factors will be retained by the NFACTOR criterion. The SAS System 15:32 Thursday, December 12, 2002 4 The FACTOR Procedure Initial Factor Method: Principal Components Factor Pattern Factor1 Factor2 Factor3 SO2 0.90193 -0.23809 0.01373 TEMP -0.54628 0.50337 0.65262 FACTORYS 0.91287 -0.20469 0.34802 POP 0.84080 -0.20188 0.45851 WIND 0.69401 0.35872 -0.20808 PRECIP 0.25158 0.93389 0.16994 PRCPDAYS 0.61645 0.58789 -0.41759

AdPac Page 66

Variance Explained by Each Factor Factor1 Factor2 Factor3 3.5771292 1.7391619 1.0040132 Final Communality Estimates: Total = 6.320304 SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS 0.87035569 0.97770742 0.99634731 0.95793191 0.65363189 0.96432262 0.90000742 The SAS System 15:32 Thursday, December 12, 2002 5 The FACTOR Procedure Initial Factor Method: Principal Components Scoring Coefficients Estimated by Regression Squared Multiple Correlations of the Variables with Each Factor Factor1 Factor2 Factor3 1.0000000 1.0000000 1.0000000 Standardized Scoring Coefficients Factor1 Factor2 Factor3 SO2 0.25214 -0.13690 0.01367 TEMP -0.15271 0.28943 0.65001 FACTORYS 0.25520 -0.11769 0.34663 POP 0.23505 -0.11608 0.45668 WIND 0.19401 0.20626 -0.20725 PRECIP 0.07033 0.53698 0.16926 PRCPDAYS 0.17233 0.33803 -0.41592 The SAS System 15:32 Thursday, December 12, 2002 6 Obs CITY Factor1 Factor2 Factor3 1 PHOENIX -1.53031 -1.43540 1.58640 2 SAN_FRAN -0.53362 -0.77323 0.08496 3 DENVER -0.35771 -0.93924 -0.73577 4 MIAMI -0.56114 1.83012 0.76975 5 ATLANTA -0.18998 0.78031 -0.07433 6 CHICAGO 2.74165 -1.01343 1.66905 7 NEW_ORLS -0.61697 1.25478 0.49627 8 DETROIT 0.78401 -0.16037 -0.54512 9 ST_LOUIS 0.31905 -0.06330 -0.29791 10 ALBQURQE -0.88110 -1.09993 -0.41247 11 CLEVLAND 1.07704 0.31874 -1.41038 12 DALLAS -0.16998 0.46155 0.60185 13 HOUSTON 0.09144 1.09084 0.90489 14 SLT_LAKE -0.46845 -0.87542 -1.09380 15 SEATTLE 0.29605 0.62399 -1.54339

AdPac Page 67

Cities high on PC 1 are high in SO2, Factories, and Population. They also tend to be high in wind and precipitation days, and low in temperature. Cities low on PC 1 are the opposite. PC 2 is primarily related to the amount of precipitation, with wet cities high on PC 2 and dry cities low on PC 2. Note how the cities sort rather well geographically - which is reasonable given that several of the variables deal with weather. The southeast is the upper left quadrant. The southwest is the lower left quadrant. The Midwest is the right side of the graph (above 0 on PC 1). One city that is "out of position" is Seattle, which tends to be placed in the Midwest rust belt (sorry, citizens of Seattle). This analysis immediately identifies Chicago as an outlier. Chicago is a much bigger city (population and factories) than any of the others. US cities comparable to Chicago (New York, Los Angeles) are not included in this data set.

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

Chicago

Miami

New Orleans Houston

Atlanta Seattle

Dallas Cleveland

St. Louis

Detroit

Phoenix

Albuquerque

San Francisco

DenverSalt Lake City

PC 1

PC 2

AdPac Page 68

PCA Game - Practice Game A Data are for four different commercial apples. Variables are the average water provided per day (gal), the average daily temperature (°F) where the tree was grown, the average number of apples/tree, and the height of the tree. Apple Type Water (gal) Temp (F) Apples/Tree Tree Height (ft)

Fuji 10 88 121 19Red 5 65 140 30

Green 7 55 196 37Yellow 2 71 89 23

R (Correlation Matrix)

Water (gal) Temp (F) Apples/Tree Tree Height (ft)

Water (gal) 1 0.41491 0.40573 -0.11235Temp (F) 0.41491 1 -0.66315 -0.94954

Apples/Tree 0.40573 -0.66315 1 0.86053Tree Height (ft) -0.11235 -0.94954 0.86053 1

PC 1 PC 2

Eigenvalues 2.659 1.339

% Trace 66.5 33.5

Cumulative % 66.5 99.9

Factor Pattern

PC 1 PC 2

Water (gal) -0.09002 0.99591 Temp (F) -0.94343 0.33121 Identify which point

Apples/Tree 0.87332 0.48653 is which apple type Tree Height (ft) 0.99907 -0.02278

-1.5

-1

-0.5

0

0.5

1

1.5

-1.5 -1 -0.5 0 0.5 1 1.5PC 1

PC 2

AdPac Page 69

PCA Game - Practice Game B Same variables as Practice Game A, but more apple types makes this game more difficult. Apple Type Water (gal) Temp (F) Apples/Tree Tree Height (ft)Fuji 10 88 121 19

Red 5 65 140 30Green 7 55 196 37

Yellow 2 71 89 23Dry 6 87 188 13

Juicy 12 40 101 30Sweet 5 60 52 28

Sour 6 72 168 15

R (Correlation Matrix)

Water (gal) Temp (F) Apples/Tree Tree Height (ft)Water (gal) 1.000 -0.281 0.071 0.117

Temp (F) -0.281 1.000 0.265 -0.784Apples/Tree 0.071 0.265 1.000 -0.159

Tree Height (ft) 0.117 -0.784 -0.159 1.000

PC 1 PC 2 PC 3

Eigenvalues 1.95 1.07 0.79

% Trace 48.8 26.8 19.8

Cumulative % 48.8 75.6 95.4

Factor Pattern

PC 1 PC 2 PC 3

Water (gal) -0.358 0.756 -0.542

Temp (F) 0.943 -0.028 -0.102 Apples/Tree 0.384 0.706 0.592

Tree Height (ft) -0.886 -0.029 0.367

-2

-1.5

-1

-0.5

0

0.5

1

1.5

-2 -1 0 1 2

PC 1

PC 2

AdPac Page 70

PCA Game - Practice Game C Data are various environmental variables for eight mines. The variables are explained below, next to the factor pattern matrix. Mine Elev Precip Temp Frost Area A. c. A.t.

Black Thunder 1433 280 6.8 125 26.3 328 1

Belle Ayr 1433 422 6.8 125 2 89 0WyoDak 1341 422 6.8 125 9 50 186

Pathfinder 2195 244 4.1 100 37.3 256 4Dave Johnston 1646 328 8.8 123 23.9 5 331

Seminoe I 2012 261 5.5 106 35.4 166 40Bridger Coal 2073 225 5.9 112 152.3 46 562

Kemmerer 2225 274 3.5 71 77.2 392 6

R (Correlation) Elev Precip Temp Frost Area A. c. A.t.

Elev 1.000 -0.798 -0.763 -0.834 0.636 0.390 0.065Precip -0.798 1.000 0.500 0.518 -0.683 -0.428 -0.179

Temp -0.763 0.500 1.000 0.858 -0.364 -0.706 0.387Frost -0.834 0.518 0.858 1.000 -0.416 -0.660 0.249

Area 0.636 -0.683 -0.364 -0.416 1.000 0.032 0.647

A. c. 0.390 -0.428 -0.706 -0.660 0.032 1.000 -0.673

A. t. 0.065 -0.179 0.387 0.249 0.647 -0.673 1.000

PC 1 PC 2

Eigenvalue 3.95 2.13

% Trace 56.5 30.5

Cumulative % 56.5 87.0

Factor Pattern PC 1 PC 2 Variables

Elev -0.92 0.23 Elevation (m)

Precip 0.79 -0.36 Mean Annual Precipitation (mm)

Temp 0.90 0.26 Mean Annual Temperature Frost 0.91 0.15 Mean frost free days

Area -0.59 0.75 Area Seeded (ha)

A. c. -0.70 -0.59 Artiplex canescens # of plants

A. t. 0.15 0.97 Artemisia tridentata # of plants The graph of the mines on PC 1 and PC 2 is on the next page.

AdPac Page 71

Graph for PCA Practice Game C

-1

-0.5

0

0.5

1

1.5

2

2.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5

AdPac Page 72

PCA Game - Practice Game D The data are for fish in some lakes of Africa. The variables are explained below, next to the factor pattern matrix. Lake Fam Stotal Scichlid Area Depth Volume

Chad 23 176 13 1540 4.1 72Turkana 15 39 7 6750 30.2 2750

Albert 15 46 9 5300 25.0 280Edward 8 57 40 2325 17.0 39

Victoria 12 238 200 68800 40.0 2750Tanganyika 19 247 136 32000 572.0 17800

Malawi 9 242 200 6400 292.0 8400

Correlations

Fam Stotal Scichlid Area Depth Volume

Fam 1.000 0.136 -0.378 -0.063 0.129 0.149Stotal 0.136 1.000 0.842 0.569 0.622 0.630

Scichlid -0.378 0.842 1.000 0.670 0.524 0.542Area -0.063 0.569 0.670 1.000 0.189 0.276

Depth 0.129 0.622 0.524 0.189 1.000 0.989Volume 0.149 0.630 0.542 0.276 0.989 1.000

Eigenvalues

Eigenvalue DifferenceProportion Cumulative

1 3.3819068 1.99121562 0.5637 0.5637

2 1.3906911 0.51003961 0.2318 0.7954

Factor Pattern

PC 1 PC 2

Fam -0.00549 0.79865 Number of families of fish in lakeStotal 0.90055 -0.0216 Number of species of fish in lake

Scichlid 0.87239 -0.45386 Number of species of Cichlid fish Area 0.62529 -0.4479 Area in km2

Depth 0.83063 0.42704 Mean lake depth in m Volume 0.85374 0.40427 Volume of lake in km3 The graph of the lakes on PC 1 and PC 2 is on the next page.

AdPac Page 73

Graph for PCA Practice Game D

-2

-1.5

-1

-0.5

0

0.5

1

1.5

-1.5 -1 -0.5 0 0.5 1 1.5 2

PC 1

PC 2

AdPac Page 74

Answers for PCA Games Game A:

-1.5

-1

-0.5

0

0.5

1

1.5

-1.5 -1 -0.5 0 0.5 1 1.5PC 1

PC 2

Game B:

-2

-1.5

-1

-0.5

0

0.5

1

1.5

-2 -1 0 1 2

PC 1

Fuji

Green

Red

Yellow

Juicy Green

Fuji

Sour

Dry

Red

Sweet Yellow

AdPac Page 75

Game C

-1

-0.5

0

0.5

1

1.5

2

2.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5

Game D:

-2

-1.5

-1

-0.5

0

0.5

1

1.5

-1.5 -1 -0.5 0 0.5 1 1.5 2

PC 1

PC 2

Bridger Coal

Dave Johnston

WyoDak

Belle Ayr Black Thunder

Seminoe I

Pathfinder Kemmerer

Chad

Turkana Albert

Edward

Tanganyika

Malawi

Victoria

AdPac Page 76

Group 1

Group 2

Lots of overlap between Groups 1 and 2 in the X dimension.

Lots of overlap in the Y dimension.

This new dimension, the discriminant variable (canonical discriminant variable), provides the maximum separation between Group1 and Group 2. We want to find the function of X and Y that describes the canonical discriminant variable.

Discriminant Analysis

AdPac Page 77

Discriminant Analysis - Example DATA POLLUTE; INPUT CITY $ SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS; IF SO2 GE 20 THEN AIR='SMOGGY'; ELSE AIR='CLEAN'; * Note that we have created a new classification variable named AIR. Instead of assigning a number to AIR, we have used strings (i.e. SMOGGY or CLEAN). SAS will handle this just as if numbers had been used; Datalines; PHOENIX 10 70.3 213 582 6.0 7.05 36 SAN_FRAN 12 56.7 453 716 8.7 20.66 67 DENVER 17 51.9 454 515 9.0 12.95 86 MIAMI 10 75.5 207 335 9.0 59.80 128 ATLANTA 24 61.5 368 497 9.1 48.34 115 CHICAGO 110 50.6 3344 3369 10.4 34.44 122 NEW_ORLS 9 68.3 204 361 8.4 56.77 113 DETROIT 35 49.9 1064 1513 10.1 30.96 129 ST_LOUIS 56 55.9 775 622 9.5 35.89 105 ALBQURQE 11 56.8 46 244 8.9 7.77 58 CLEVLAND 65 49.7 1007 751 10.9 34.99 155 DALLAS 9 66.2 641 844 10.9 35.94 78 HOUSTON 10 68.9 721 1233 10.8 48.19 103 SLT_LAKE 28 51.0 137 176 8.7 15.17 89 SEATTLE 29 51.1 379 531 9.4 38.79 164 ; PROC PRINT; * Check out the value of AIR in this print; PROC GLM; CLASS AIR; MODEL TEMP = AIR; MEANS AIR / HOVTEST; * We do a one-way ANOVA on the TEMP variable, not because we are interested in the ANOVA, but rather to do the test of homogeneity of variances (homoscedasticity). The HOVTEST option of the MEANS statement will do Levene's test. We've only done this for the TEMP variable, but could do similar procedures for the other variables. ; PROC DISCRIM ANOVA CANONICAL POOL=TEST LIST CROSSLIST OUT=CANVARS; CLASS AIR; VAR TEMP FACTORYS POP WIND PRECIP PRCPDAYS; ID CITY; * The DISCRIM procedure peforms the multiple discriminant functions analysis. AIR is the classification variable, we want a linear function of the other varibles which best separates the CLEAN and SMOGGY cities. The CANONICAL option tells SAS to perform a canonical analysis, which is the type of analysis discussed in class and the type most often done in the literature. The POOL=TEST option produces the test of homogeneity of within-group covariance matrics. The LIST option provides the listing of how each city was classified by the classification functions. The CROSSLIST option tells SAS to produce a jackknifed classification matrix, and to list how each city was classified in the jackknife process. The OUT=CANVARS creates a new data set (called CANVARS) which contains the original data as well as the canonical (discriminant) variable. (Note that we only have one discriminant variable because we only have 2 groups) CANVARS then becomes the default data set, allowing us to print the canonical scores, which we do below. With VAR statement, we tell SAS what variables to use in the discriminant analysis. Notice that we don't include the SO2 variable - that wouldn't make any sense, because we used the SO2 variable to form the CLEAN and SMOGGY groups. The ID CITY statement tells SAS to use the CITY variable on the listings provided by the LIST and CROSSLIST options. This just makes it easier to read the output. ; PROC SORT; BY AIR; PROC PRINT; VAR AIR CITY CAN1; RUN;

AdPac Page 78

The SAS System 11:15 Tuesday, December 17, 2002 1 Obs CITY SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS AIR 1 PHOENIX 10 70.3 213 582 6.0 7.05 36 CLEAN 2 SAN_FRAN 12 56.7 453 716 8.7 20.66 67 CLEAN 3 DENVER 17 51.9 454 515 9.0 12.95 86 CLEAN 4 MIAMI 10 75.5 207 335 9.0 59.80 128 CLEAN 5 ATLANTA 24 61.5 368 497 9.1 48.34 115 SMOGGY 6 CHICAGO 110 50.6 3344 3369 10.4 34.44 122 SMOGGY 7 NEW_ORLS 9 68.3 204 361 8.4 56.77 113 CLEAN 8 DETROIT 35 49.9 1064 1513 10.1 30.96 129 SMOGGY 9 ST_LOUIS 56 55.9 775 622 9.5 35.89 105 SMOGGY 10 ALBQURQE 11 56.8 46 244 8.9 7.77 58 CLEAN 11 CLEVLAND 65 49.7 1007 751 10.9 34.99 155 SMOGGY 12 DALLAS 9 66.2 641 844 10.9 35.94 78 CLEAN 13 HOUSTON 10 68.9 721 1233 10.8 48.19 103 CLEAN 14 SLT_LAKE 28 51.0 137 176 8.7 15.17 89 SMOGGY 15 SEATTLE 29 51.1 379 531 9.4 38.79 164 SMOGGY The SAS System 11:15 Tuesday, December 17, 2002 2 The GLM Procedure Class Level Information Class Levels Values AIR 2 CLEAN SMOGGY Number of observations 15 The SAS System 11:15 Tuesday, December 17, 2002 3 The GLM Procedure Dependent Variable: TEMP Sum of Source DF Squares Mean Square F Value Pr > F Model 1 494.653762 494.653762 11.01 0.0056 Error 13 584.263571 44.943352 Corrected Total 14 1078.917333 R-Square Coeff Var Root MSE TEMP Mean 0.458472 11.37167 6.703980 58.95333 Source DF Type I SS Mean Square F Value Pr > F AIR 1 494.6537619 494.6537619 11.01 0.0056 Source DF Type III SS Mean Square F Value Pr > F AIR 1 494.6537619 494.6537619 11.01 0.0056

AdPac Page 79

The SAS System 11:15 Tuesday, December 17, 2002 4 The GLM Procedure Levene's Test for Homogeneity of TEMP Variance ANOVA of Squared Deviations from Group Means Sum of Mean Source DF Squares Square F Value Pr > F AIR 1 6718.0 6718.0 3.56 0.0816 Error 13 24506.3 1885.1 The SAS System 11:15 Tuesday, December 17, 2002 5 The GLM Procedure Level of -------------TEMP------------ AIR N Mean Std Dev CLEAN 8 64.3250000 8.19385658 SMOGGY 7 52.8142857 4.36441236 The SAS System 11:15 Tuesday, December 17, 2002 6 The DISCRIM Procedure Observations 15 DF Total 14 Variables 6 DF Within Classes 13 Classes 2 DF Between Classes 1 Class Level Information Variable Prior AIR Name Frequency Weight Proportion Probability CLEAN CLEAN 8 8.0000 0.533333 0.500000 SMOGGY SMOGGY 7 7.0000 0.466667 0.500000 Within Covariance Matrix Information Natural Log of the Covariance Determinant of the AIR Matrix Rank Covariance Matrix CLEAN 6 33.92177 SMOGGY 6 32.16990 Pooled 6 37.53746

AdPac Page 80

The SAS System 11:15 Tuesday, December 17, 2002 7 The DISCRIM Procedure Test of Homogeneity of Within Covariance Matrices Notation: K = Number of Groups P = Number of Variables N = Total Number of Observations - Number of Groups N(i) = Number of Observations in the i'th Group - 1 __ N(i)/2 || |Within SS Matrix(i)| V = ----------------------------------- N/2 |Pooled SS Matrix| _ _ 2 | 1 1 | 2P + 3P - 1 RHO = 1.0 - | SUM ----- - --- | ------------- |_ N(i) N _| 6(P+1)(K-1) DF = .5(K-1)P(P+1) _ _ | PN/2 | | N V | Under the null hypothesis: -2 RHO ln | ------------------ | | __ PN(i)/2 | |_ || N(i) _| is distributed approximately as Chi-Square(DF). Chi-Square DF Pr > ChiSq 29.166416 21 0.1101 Since the Chi-Square value is not significant at the 0.1 level, a pooled covariance matrix will be used in the discriminant function. Reference: Morrison, D.F. (1976) Multivariate Statistical Methods p252. The SAS System 11:15 Tuesday, December 17, 2002 8 The DISCRIM Procedure Pairwise Generalized Squared Distances Between Groups 2 _ _ -1 _ _ D (i|j) = (X - X )' COV (X - X ) i j i j Generalized Squared Distance to AIR From AIR CLEAN SMOGGY CLEAN 0 7.94228 SMOGGY 7.94228 0

AdPac Page 81

The SAS System 11:15 Tuesday, December 17, 2002 9 The DISCRIM Procedure Univariate Test Statistics F Statistics, Num DF=1, Den DF=13 Total Pooled Between Standard Standard Standard R-Square Variable Deviation Deviation Deviation R-Square / (1-RSq) F Value Pr > F TEMP 8.7787 6.7040 8.1212 0.4585 0.8466 11.01 0.0056 FACTORYS 802.5373 758.1577 453.7967 0.1713 0.2067 2.69 0.1251 POP 789.8239 781.3826 325.8306 0.0912 0.1003 1.30 0.2741 WIND 1.2554 1.2364 0.5405 0.0993 0.1102 1.43 0.2526 PRECIP 16.8790 17.4451 2.0754 0.0081 0.0082 0.11 0.7498 PRCPDAYS 35.1064 28.6700 29.5946 0.3807 0.6147 7.99 0.0143 Average R-Square Unweighted 0.2015056 Weighted by Variance 0.1321024 The SAS System 11:15 Tuesday, December 17, 2002 10 The DISCRIM Procedure Canonical Discriminant Analysis Adjusted Approximate Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation 1 0.833788 0.777586 0.081461 0.695202 Test of H0: The canonical correlations in the current row and all Eigenvalues of Inv(E)*H that follow are zero = CanRsq/(1-CanRsq) Likelihood Approximate Eigenvalue Difference Proportion Cumulative Ratio F Value Num DF Den DF Pr > F 1 2.2809 1.0000 1.0000 0.30479812 3.04 6 8 0.0743 NOTE: The F statistic is exact.

AdPac Page 82

The SAS System 11:15 Tuesday, December 17, 2002 11 The DISCRIM Procedure Canonical Discriminant Analysis Total Canonical Structure Variable Can1 TEMP 0.812084 FACTORYS -0.496372 POP -0.362137 WIND -0.377938 PRECIP -0.107936 PRCPDAYS -0.740011 Between Canonical Structure Variable Can1 TEMP 1.000000 FACTORYS -1.000000 POP -1.000000 WIND -1.000000 PRECIP -1.000000 PRCPDAYS -1.000000 Pooled Within Canonical Structure Variable Can1 TEMP 0.609252 FACTORYS -0.301031 POP -0.209719 WIND -0.219855 PRECIP -0.059833 PRCPDAYS -0.519152 The SAS System 11:15 Tuesday, December 17, 2002 12 The DISCRIM Procedure Canonical Discriminant Analysis Total-Sample Standardized Canonical Coefficients Variable Can1 TEMP 1.567678595 FACTORYS -0.927472227 POP 0.623125314 WIND 0.580056210 PRECIP -0.720218205 PRCPDAYS -0.512306266 Pooled Within-Class Standardized Canonical Coefficients Variable Can1 TEMP 1.197180652 FACTORYS -0.876183792 POP 0.616465594 WIND 0.571284451 PRECIP -0.744372896 PRCPDAYS -0.418380387

AdPac Page 83

Raw Canonical Coefficients Variable Can1 TEMP 0.1785775915 FACTORYS -.0011556749 POP 0.0007889421 WIND 0.4620531908 PRECIP -.0426694286 PRCPDAYS -.0145929722 Class Means on Canonical Variables AIR Can1 CLEAN 1.315162453 SMOGGY -1.503042804 The SAS System 11:15 Tuesday, December 17, 2002 13 The DISCRIM Procedure Linear Discriminant Function _ -1 _ -1 _ Constant = -.5 X' COV X Coefficient Vector = COV X j j j Linear Discriminant Function for AIR Variable CLEAN SMOGGY Constant -319.36947 -286.33533 TEMP 7.57147 7.06820 FACTORYS 0.01009 0.01335 POP -0.00978 -0.01200 WIND 20.29620 18.99404 PRECIP -3.68109 -3.56084 PRCPDAYS 1.03595 1.07707

AdPac Page 84

The SAS System 11:15 Tuesday, December 17, 2002 14 The DISCRIM Procedure Classification Results for Calibration Data: WORK.POLLUTE Resubstitution Results using Linear Discriminant Function Generalized Squared Distance Function 2 _ -1 _ D (X) = (X-X )' COV (X-X ) j j j Posterior Probability of Membership in Each AIR 2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Posterior Probability of Membership in AIR From Classified CITY AIR into AIR CLEAN SMOGGY PHOENIX CLEAN CLEAN 0.9998 0.0002 SAN_FRAN CLEAN CLEAN 0.8464 0.1536 DENVER CLEAN SMOGGY * 0.3492 0.6508 MIAMI CLEAN CLEAN 0.9866 0.0134 ATLANTA SMOGGY SMOGGY 0.2954 0.7046 CHICAGO SMOGGY SMOGGY 0.0014 0.9986 NEW_ORLS CLEAN CLEAN 0.7194 0.2806 DETROIT SMOGGY SMOGGY 0.0199 0.9801 ST_LOUIS SMOGGY SMOGGY 0.0906 0.9094 ALBQURQE CLEAN CLEAN 0.9854 0.0146 CLEVLAND SMOGGY SMOGGY 0.0024 0.9976 DALLAS CLEAN CLEAN 0.9988 0.0012 HOUSTON CLEAN CLEAN 0.9977 0.0023 SLT_LAKE SMOGGY SMOGGY 0.1711 0.8289 SEATTLE SMOGGY SMOGGY 0.0014 0.9986 * Misclassified observation

AdPac Page 85

The SAS System 11:15 Tuesday, December 17, 2002 15 The DISCRIM Procedure Classification Summary for Calibration Data: WORK.POLLUTE Resubstitution Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ -1 _ D (X) = (X-X )' COV (X-X ) j j j Posterior Probability of Membership in Each AIR 2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Number of Observations and Percent Classified into AIR From AIR CLEAN SMOGGY Total CLEAN 7 1 8 87.50 12.50 100.00 SMOGGY 0 7 7 0.00 100.00 100.00 Total 7 8 15 46.67 53.33 100.00 Priors 0.5 0.5 Error Count Estimates for AIR CLEAN SMOGGY Total Rate 0.1250 0.0000 0.0625 Priors 0.5000 0.5000

AdPac Page 86

The SAS System 11:15 Tuesday, December 17, 2002 16 The DISCRIM Procedure Classification Results for Calibration Data: WORK.POLLUTE Cross-validation Results using Linear Discriminant Function Generalized Squared Distance Function 2 _ -1 _ D (X) = (X-X )' COV (X-X ) j (X)j (X) (X)j Posterior Probability of Membership in Each AIR 2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Posterior Probability of Membership in AIR From Classified CITY AIR into AIR CLEAN SMOGGY PHOENIX CLEAN CLEAN 1.0000 0.0000 SAN_FRAN CLEAN CLEAN 0.5816 0.4184 DENVER CLEAN SMOGGY * 0.0630 0.9370 MIAMI CLEAN CLEAN 0.9407 0.0593 ATLANTA SMOGGY CLEAN * 0.7455 0.2545 CHICAGO SMOGGY SMOGGY 0.0000 1.0000 NEW_ORLS CLEAN SMOGGY * 0.0257 0.9743 DETROIT SMOGGY SMOGGY 0.1342 0.8658 ST_LOUIS SMOGGY SMOGGY 0.3144 0.6856 ALBQURQE CLEAN CLEAN 0.9710 0.0290 CLEVLAND SMOGGY SMOGGY 0.0026 0.9974 DALLAS CLEAN CLEAN 0.9994 0.0006 HOUSTON CLEAN CLEAN 0.9980 0.0020 SLT_LAKE SMOGGY SMOGGY 0.4132 0.5868 SEATTLE SMOGGY SMOGGY 0.0010 0.9990 * Misclassified observation

AdPac Page 87

The SAS System 11:15 Tuesday, December 17, 2002 17 The DISCRIM Procedure Classification Summary for Calibration Data: WORK.POLLUTE Cross-validation Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ -1 _ D (X) = (X-X )' COV (X-X ) j (X)j (X) (X)j Posterior Probability of Membership in Each AIR 2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Number of Observations and Percent Classified into AIR From AIR CLEAN SMOGGY Total CLEAN 6 2 8 75.00 25.00 100.00 SMOGGY 1 6 7 14.29 85.71 100.00 Total 7 8 15 46.67 53.33 100.00 Priors 0.5 0.5 Error Count Estimates for AIR CLEAN SMOGGY Total Rate 0.2500 0.1429 0.1964 Priors 0.5000 0.5000 The SAS System 11:15 Tuesday, December 17, 2002 18 Obs AIR CITY Can1 1 CLEAN PHOENIX 2.89753 2 CLEAN SAN_FRAN 0.51166 3 CLEAN DENVER -0.31491 4 CLEAN MIAMI 1.43099 5 CLEAN NEW_ORLS 0.24016 6 CLEAN ALBQURQE 1.40125 7 CLEAN DALLAS 2.29587 8 CLEAN HOUSTON 2.05874 9 SMOGGY ATLANTA -0.40244 10 SMOGGY CHICAGO -2.43076 11 SMOGGY DETROIT -1.47738 12 SMOGGY ST_LOUIS -0.91224 13 SMOGGY CLEVLAND -2.23013 14 SMOGGY SLT_LAKE -0.65386 15 SMOGGY SEATTLE -2.41449

AdPac Page 88

Plot of Cities by Canonical Scores

SAN_FRAN

MIAMI

NEW_ORLS

ALBQURQE

DALLAS

CHICAGO

DETROIT

ST_LOUIS

CLEVLAND

SLT_LAKEDENVER

SEATTLE

ATLANTA

PHOENIX

HOUSTON

-3

-2

-1

0

1

2

3

4

Can

on

ical

Sco

re

Cities to the left of the dashed line were in the CLEAN group a priori, that is by us. Those to the right of the dashed line were put in the SMOGGY group by us. Cities above the solid line (with positive canonical scores) were classified as CLEAN by the analysis. Those with negative canonical scores were classified as SMOGGY. Note that this figure quickly points out that Denver was actually a CLEAN city that the analysis classified as SMOGGY. The distances along the solid line ("x-axis") are uniform, and have no statistical or biological meaning. It just makes the graph easier to deal with visually. How the Discriminant Functions are Evaluated In the SAS output above, you are given coefficients for two equations: one for CLEAN, and one for SMOGGY. The equations are: For CLEAN: Y=-319.36947+7.57147(TEMP)+0.01009(FACTORYS)-0.00978(POP)+20.29620(WIND)-3.68109(PRECIP)+1.03595(PRCPDAYS) For SMOGGY: Y=-286.33533+7.06820(TEMP)+0.01335(FACTORYS)-0.01200(POP)+18.99404(WIND)-3.56084(PRECIP)+1.07707(PRCPDAYS) For each city, the data for all the variables are put into each equation, and the equations evaluated. The equation with the larger value gives the classification. For example, let's put the data for Chicago into both equations: For CLEAN: Y=-319.36947+7.57147(50.6)+0.01009(3344)-0.00978(3369)+20.29620(10.4)-3.68109(34.44)+1.03595(122) = 275.2287 For SMOGGY: Y=-286.33533+7.06820(50.6)+0.01335(3344)-0.01200(3369)+18.99404(10.4)-3.56084(34.44)+1.07707(122) = 281.8352 Since the value for the SMOGGY equation (281.8352) is greater than the value for the CLEAN equation (275.2287), Chicago is classified into the SMOGGY group.

AdPac Page 89

Canonical Correlation Analysis – Introductory Example Canonical Correlation Analysis will allow us to look for relationships between two sets of variables, measured on the same subjects. This example, of course, deals with birds. Our two sets of variables are: (1) Ecology; and (2) Morphology. The Ecology variables are: 1. Distance from tree trunk (m); 2. Diameter of foraging substrate (cm). The Morphology variables are: 1. Body length (cm); 2. Body Weight (g) Below are data on the variables for 5 different kinds of birds. By examining the data, we can see that woodpeckers are big birds, that forage close to the tree trunk, and that tree trunks are big. Warblers are small birds that forage far from the trunk on little branches. First, we graph the data, and show the approximate positions of Principal Components with dashed lines: Canonical Correlation AnalysisIntroduction - Example

Distance from Trunk (m)

Diameter of Foraging Substrate (cm)

Body Length (cm)

Body Weight (g)

Grosbeak 1.3 15 21 45.0Sparrow 1.7 13 16 20.0Warbler 2.0 1 13 12.5Nuthatch 1.0 10 15 21.0Woodpecker 0.2 50 23 80.0

Ecology Morphology

Ecology

Warbler

Sparrow

Grosbeak

Nuthatch

Woodpecker

-10

0

10

20

30

40

50

60

0.0 0.5 1.0 1.5 2.0 2.5

Distance from Trunk (m)

Dia

met

er S

ubst

rate

(cm

)

Dashed lines are Principal Components

Morphology

Nuthatch

SparrowWarbler

Woodpecker

Grosbeak

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

0 5 10 15 20 25

Body Length (cm)

Bod

y W

eigh

t (g

)

Dashed lines are Principal Components

AdPac Page 90

Below are the birds graphed by their factor scores on PC 1 and PC 2. Note that a separate PCA was done on each set of variables. Next, Canonical Correlation will determine a “derived axis” for the Ecology PCA, and a “derived axis” for the Morphology PCA. These derived axes are determined such that the correlation between them is the maximum possible across the 5 birds. The correlation between these derived axes is the “canonical correlation”. Below, the dashed lines represent visual approximations of the derived axes.

PC 1 PC 2 PC 1 PC 2Grosbeak 0.12091 -0.13738 0.57919 1.42946Sparrow 0.47119 0.88938 -0.48042 0.57702Warbler 1.02105 0.4356 -0.97851 -0.76217Nuthatch 0.03595 -1.66506 -0.5822 -0.25565Woodpecker -1.6491 0.47746 1.46195 -0.98865

PCA Ecology PCA Morphology

Ecology

Woodpecker

Nuthatch

Grosbeak

Sparrow

Warbler

-2

-1.5

-1

-0.5

0

0.5

1

1.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5

PC 1

PC

2

Morphology

Grosbeak

WoodpeckerWarbler

Sparrow

Nuthatch

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1.5 -1 -0.5 0 0.5 1 1.5 2

PC 1

PC

2

The dashed line is a visual approximation of the first derived morphology axis.

Below, species are graphed by their factor scores (given on the left).

The dashed line is a visual approximation of the first derived ecology axis.

Actually, since there are two principal components for ecology and morphology, two sets of derived axes are determined. We are only looking at the first set of derived axes, just to keep things a bit simpler.

AdPac Page 91

Finally, the species are graphed by their positions on the first two derived axes. The canonical correlation is the regular, Pearson-product moment correlation coefficient between the two axes. However, assessment of significance is very different here. In this case, although it appears significant, the p values associated with our correlation is only p = 0.43. What the analysis tells us (without worrying about significance here, just for the sake of example) is that woodpeckers are big birds, that forage close to the tree trunk, and that tree trunks are big. Warblers are small birds that forage far from the trunk on little branches.

Ecology MorphologyGrosbeak -0.14413 0.07609Sparrow -0.29946 -0.64366Warbler -0.92343 -0.6729Nuthatch -0.34178 -0.46552Woodpecker 1.7088 1.70599

First Derived Axes

First Derived Axes

WarblerSparrow

Grosbeak

Nuthatch

Woodpecker

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1.5 -1 -0.5 0 0.5 1 1.5 2

Ecology

Mor

ph

olog

y

The solid line is the "canonical correlation" between the derived axes. r = 0.969368

AdPac Page 92

Canonical Correlations Analysis - Example One way to think of Canonical Correlations is to view it as a multiple regression - with multiple independent and multiple dependent variables. For example, from our pollution data, let's consider TEMP WIND PRECIP and PRCPDAYS as independent variables - we'll call them the "Weather" variables. Let's consider SO2, FACTORYS, and POP as dependent variables - we'll call them the "Anthropogenic" variables, since they are generated by people. Canonical Correlations Analysis will ask if a linear function of the Weather variables predicts (is correlated with) a linear function of the Anthropogenic variables. In other words, is there a relationship between the Weather and Anthropogenic variables? This is the example we'll do on SAS. By the way, the answer to the question "Is there a relationship between the Weather and Anthropogenic variables?" is going to be: "Not particularly." DATA POLLUTE; INPUT CITY $ SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS; Datalines; PHOENIX 10 70.3 213 582 6.0 7.05 36 SAN_FRAN 12 56.7 453 716 8.7 20.66 67 DENVER 17 51.9 454 515 9.0 12.95 86 MIAMI 10 75.5 207 335 9.0 59.80 128 ATLANTA 24 61.5 368 497 9.1 48.34 115 CHICAGO 110 50.6 3344 3369 10.4 34.44 122 NEW_ORLS 9 68.3 204 361 8.4 56.77 113 DETROIT 35 49.9 1064 1513 10.1 30.96 129 ST_LOUIS 56 55.9 775 622 9.5 35.89 105 ALBQURQE 11 56.8 46 244 8.9 7.77 58 CLEVLAND 65 49.7 1007 751 10.9 34.99 155 DALLAS 9 66.2 641 844 10.9 35.94 78 HOUSTON 10 68.9 721 1233 10.8 48.19 103 SLT_LAKE 28 51.0 137 176 8.7 15.17 89 SEATTLE 29 51.1 379 531 9.4 38.79 164 ; PROC CANCORR ALL OUT=CANSCORE VPREFIX = ANTH WPREFIX = WEATH VNAME = 'Anthropogenic Variables' WNAME = 'Weather Variables'; TITLE 'Canonical Correlation Analysis for 15 Cities'; TITLE2 'Pollution and Weather Variables'; VAR SO2 FACTORYS POP; WITH TEMP WIND PRECIP PRCPDAYS; * The CANCORR procedure will do the canonical correlation. The ALL option instructs SAS to print extensive statistics. The OUT=CANSCORE option creates a new data set, which has all the original variables plus the scores on the derived axes calculated by SAS. CANSCORE replaces POLLUTE as the "current data set". The VPREFIX WPREFIX VNAME WNAME TITLE and TITLE2 options are just providing labels for the output - none of these are necessary for actually doing the analysis. VPREFIX and WPREFIX prefix provide names for the derived axes in dataset CANSCORE (the defaults would be V and W). The VAR and WITH statements are important. These statements identify the two variable sets to be used in the analysis.; PROC PRINT; VAR CITY ANTH1-ANTH3 WEATH1-WEATH3; * This prints the scores for each city on the derived axes. Note that we specify the derived axes using the names specified with VPREFIX and WPREFIX.; RUN;

AdPac Page 93

Canonical Correlation Analysis for 15 Cities 1 Pollution and Weather Variables 12:55 Tuesday, December 17, 2002 The CANCORR Procedure Anthropogenic Variables 3 Weather Variables 4 Observations 15 Means and Standard Deviations Standard Variable Mean Deviation SO2 29.000000 28.395171 FACTORYS 667.533333 802.537304 POP 819.266667 789.823892 TEMP 58.953333 8.778697 WIND 9.320000 1.255388 PRECIP 32.514000 16.879022 PRCPDAYS 103.200000 35.106369 Canonical Correlation Analysis for 15 Cities 2 Pollution and Weather Variables 12:55 Tuesday, December 17, 2002 The CANCORR Procedure Correlations Among the Original Variables Correlations Among the Anthropogenic Variables SO2 FACTORYS POP SO2 1.0000 0.8773 0.7499 FACTORYS 0.8773 1.0000 0.9675 POP 0.7499 0.9675 1.0000 Correlations Among the Weather Variables TEMP WIND PRECIP PRCPDAYS TEMP 1.0000 -0.3133 0.4250 -0.3140 WIND -0.3133 1.0000 0.4021 0.5618 PRECIP 0.4250 0.4021 1.0000 0.6467 PRCPDAYS -0.3140 0.5618 0.6467 1.0000 Correlations Between the Anthropogenic Variables and the Weather Variables TEMP WIND PRECIP PRCPDAYS SO2 -0.5847 0.4116 0.0389 0.4513 FACTORYS -0.3765 0.4757 0.1002 0.3031 POP -0.2767 0.4334 0.0908 0.2150

AdPac Page 94

Canonical Correlation Analysis for 15 Cities 3 Pollution and Weather Variables 12:55 Tuesday, December 17, 2002 The CANCORR Procedure Canonical Correlation Analysis Adjusted Approximate Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation 1 0.754727 0.643927 0.115026 0.569612 2 0.552889 0.474993 0.185563 0.305686 3 0.091938 -.269777 0.265002 0.008453 Test of H0: The canonical correlations in the current row and all Eigenvalues of Inv(E)*H that follow are zero = CanRsq/(1-CanRsq) Likelihood Approximate Eigenvalue Difference Proportion Cumulative Ratio F Value Num DF Den DF Pr > F 1 1.3235 0.8832 0.7468 0.7468 0.29629834 1.04 12 21.458 0.4478 2 0.4403 0.4317 0.2484 0.9952 0.68844493 0.62 6 18 0.7152 3 0.0085 0.0048 1.0000 0.99154735 0.04 2 10 0.9584 Multivariate Statistics and F Approximations S=3 M=0 N=3 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.29629834 1.04 12 21.458 0.4478 Pillai's Trace 0.88375107 1.04 12 30 0.4373 Hotelling-Lawley Trace 1.77228145 1.10 12 10.323 0.4449 Roy's Greatest Root 1.32348563 3.31 4 10 0.0568 NOTE: F Statistic for Roy's Greatest Root is an upper bound.

AdPac Page 95

Canonical Correlation Analysis for 15 Cities 4 Pollution and Weather Variables 12:55 Tuesday, December 17, 2002 The CANCORR Procedure Canonical Correlation Analysis Raw Canonical Coefficients for the Anthropogenic Variables ANTH1 ANTH2 ANTH3 SO2 0.0952576496 -0.061401686 0.0561606761 FACTORYS -0.004911431 0.0067362969 -0.008212213 POP 0.0025160677 -0.003937349 0.0072501909 Raw Canonical Coefficients for the Weather Variables WEATH1 WEATH2 WEATH3 TEMP -0.088803214 -0.085033707 0.0754151053 WIND -0.36146664 0.68616785 0.5534895041 PRECIP -0.002688268 0.0578715442 -0.098892778 PRCPDAYS 0.0182372727 -0.029027126 0.0110932647 Canonical Correlation Analysis for 15 Cities 5 Pollution and Weather Variables 12:55 Tuesday, December 17, 2002 The CANCORR Procedure Canonical Correlation Analysis Standardized Canonical Coefficients for the Anthropogenic Variables ANTH1 ANTH2 ANTH3 SO2 2.7049 -1.7435 1.5947 FACTORYS -3.9416 5.4061 -6.5906 POP 1.9873 -3.1098 5.7264 Standardized Canonical Coefficients for the Weather Variables WEATH1 WEATH2 WEATH3 TEMP -0.7796 -0.7465 0.6620 WIND -0.4538 0.8614 0.6948 PRECIP -0.0454 0.9768 -1.6692 PRCPDAYS 0.6402 -1.0190 0.3894

AdPac Page 96

Canonical Correlation Analysis for 15 Cities 6 Pollution and Weather Variables 12:55 Tuesday, December 17, 2002 The CANCORR Procedure Canonical Structure Correlations Between the Anthropogenic Variables and Their Canonical Variables ANTH1 ANTH2 ANTH3 SO2 0.7370 0.6674 0.1067 FACTORYS 0.3540 0.8679 0.3485 POP 0.2022 0.8130 0.5460 Correlations Between the Weather Variables and Their Canonical Variables WEATH1 WEATH2 WEATH3 TEMP -0.8577 -0.2812 -0.3875 WIND 0.1319 0.9155 0.0351 PRECIP -0.1451 0.3468 -0.8566 PRCPDAYS 0.6008 0.3311 -0.5076 Correlations Between the Anthropogenic Variables and the Canonical Variables of the Weather Variables WEATH1 WEATH2 WEATH3 SO2 0.5562 0.3690 0.0098 FACTORYS 0.2672 0.4799 0.0320 POP 0.1526 0.4495 0.0502 Correlations Between the Weather Variables and the Canonical Variables of the Anthropogenic Variables ANTH1 ANTH2 ANTH3 TEMP -0.6473 -0.1555 -0.0356 WIND 0.0996 0.5062 0.0032 PRECIP -0.1095 0.1917 -0.0788 PRCPDAYS 0.4534 0.1830 -0.0467

AdPac Page 97

Canonical Correlation Analysis for 15 Cities 7 Pollution and Weather Variables 12:55 Tuesday, December 17, 2002 The CANCORR Procedure Canonical Redundancy Analysis Raw Variance of the Anthropogenic Variables Explained by Their Own The Opposite Canonical Variables Canonical Variables Canonical Variable Cumulative Canonical Cumulative Number Proportion Proportion R-Square Proportion Proportion 1 0.0841 0.0841 0.5696 0.0479 0.0479 2 0.7077 0.7918 0.3057 0.2163 0.2642 3 0.2082 1.0000 0.0085 0.0018 0.2660 Raw Variance of the Weather Variables Explained by Their Own The Opposite Canonical Variables Canonical Variables Canonical Variable Cumulative Canonical Cumulative Number Proportion Proportion R-Square Proportion Proportion 1 0.3180 0.3180 0.5696 0.1811 0.1811 2 0.1108 0.4288 0.3057 0.0339 0.2150 3 0.3372 0.7660 0.0085 0.0029 0.2178 Canonical Correlation Analysis for 15 Cities 8 Pollution and Weather Variables 12:55 Tuesday, December 17, 2002 The CANCORR Procedure Canonical Redundancy Analysis Standardized Variance of the Anthropogenic Variables Explained by Their Own The Opposite Canonical Variables Canonical Variables Canonical Variable Cumulative Canonical Cumulative Number Proportion Proportion R-Square Proportion Proportion 1 0.2365 0.2365 0.5696 0.1347 0.1347 2 0.6199 0.8564 0.3057 0.1895 0.3242 3 0.1436 1.0000 0.0085 0.0012 0.3254 Standardized Variance of the Weather Variables Explained by Their Own The Opposite Canonical Variables Canonical Variables Canonical Variable Cumulative Canonical Cumulative Number Proportion Proportion R-Square Proportion Proportion 1 0.2838 0.2838 0.5696 0.1616 0.1616 2 0.2868 0.5706 0.3057 0.0877 0.2493 3 0.2857 0.8563 0.0085 0.0024 0.2517

AdPac Page 98

Canonical Correlation Analysis for 15 Cities 9 Pollution and Weather Variables 12:55 Tuesday, December 17, 2002 The CANCORR Procedure Canonical Redundancy Analysis Squared Multiple Correlations Between the Anthropogenic Variables and the First M Canonical Variables of the Weather Variables M 1 2 3 SO2 0.3094 0.4456 0.4457 FACTORYS 0.0714 0.3016 0.3027 POP 0.0233 0.2253 0.2279 Squared Multiple Correlations Between the Weather Variables and the First M Canonical Variables of the Anthropogenic Variables M 1 2 3 TEMP 0.4191 0.4432 0.4445 WIND 0.0099 0.2661 0.2662 PRECIP 0.0120 0.0488 0.0550 PRCPDAYS 0.2056 0.2391 0.2413 Canonical Correlation Analysis for 15 Cities 10 Pollution and Weather Variables 12:55 Tuesday, December 17, 2002 Obs CITY ANTH1 ANTH2 ANTH3 WEATH1 WEATH2 WEATH3 1 PHOENIX -0.17447 -0.96104 0.94544 -0.96464 -2.76594 0.79086 2 SAN_FRAN -0.82554 0.00527 0.05836 -0.20411 0.13096 0.25760 3 DENVER -0.85989 0.49640 -1.12634 0.48094 -0.25273 1.03489 4 MIAMI -0.76647 -0.02893 -0.79608 -0.97480 -0.76739 -1.35252 5 ATLANTA 0.18400 -0.44186 -0.15747 0.02603 0.20585 -1.36389 6 CHICAGO 0.98589 3.01675 1.05535 0.68910 1.01713 -0.01411 7 NEW_ORLS -0.78157 -0.09011 -0.63910 -0.38395 -0.30679 -2.09436 8 DETROIT 0.36981 -0.42916 2.11079 0.99672 0.46622 0.18885 9 ST_LOUIS 1.54781 -0.15721 -0.79642 0.22983 0.52627 -0.44453 10 ALBQURQE -0.10943 -0.81658 -0.07752 -0.41477 -0.22503 1.55073 11 CLEVLAND 1.59024 0.34508 -1.26093 1.18864 0.51068 0.50644 12 DALLAS -1.71261 0.95191 -0.72599 -1.68343 1.39769 0.80266 13 HOUSTON -1.03151 -0.10221 1.49351 -1.46406 1.08273 0.01683 14 SLT_LAKE 0.89192 -0.97966 -0.36311 0.71805 -0.34066 0.61471 15 SEATTLE 0.69181 -0.80864 0.27951 1.76044 -0.67896 -0.49416

AdPac Page 99

Canonical Correlations Analysis - Example Plot of each city by score on WEATH1 and ANTH1

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

WEATH1

ANTH1

ClevelandSt. Louis

Chicago

Salt Lake CitySeattle

DetroitAtlanta

AlbuquerquePhoenix

Dallas

HoustonMiami

New Orleans

San FranciscoDenver

From the SAS output, we see that the canonical correlation between WEATH1 and ANTH1 is 0.754727, however, p = 0.4478. Therefore, we do not have a strong relationship between these two derived variables. WEATH1 is a temperature, moisture gradient. Warm and dry cities tend to be negative, while cold and wet tend to be positive. ANTH1 is an SO2 gradient from smaller values (negative scores) to higher values (positive). Notice that a lot of the positions of the cities are difficult to interpret. For example: 1. If warm, dry cities tend to be low on WEATH1, why does Dallas have a lower score than Phoenix? 2. If ANTH1 is an SO2 gradient, why doesn't Chicago have the highest value? What you have to remember is that these variables are not really well related (p = 0.4478). There's more "noise" than "pattern" here.

AdPac Page 100

Multivariate Multiple Regression A Multivariate Multiple Regression is simply a regression with multiple independent variables and multiple dependent variables. It is very similar to Canonical Correlations Analysis, especially when testing the entire model, i.e. does one set of variables predict the other set of variables. The Multivariate Multiple Regression does allow other tests with respect to particular dependent and independent variables (example below). dm 'output; clear; log; clear;'; *options ls=76 ps=55 pageno=1; *** FOR THE SCREEN ***; *options ls=123 ps=41 nodate pageno=1; *** FOR HARD COPIES - Landscape ***; options ls=95 ps=55 nodate pageno=1; *** FOR HARD COPIES - Portrait ***; option formdlim = '_'; OPTIONS FORMCHAR="|----|+|---+=|-/\<>*"; DATA POLLUTE; INPUT CITY $ SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS; Datalines; PHOENIX 10 70.3 213 582 6.0 7.05 36 SAN_FRAN 12 56.7 453 716 8.7 20.66 67 DENVER 17 51.9 454 515 9.0 12.95 86 MIAMI 10 75.5 207 335 9.0 59.80 128 ATLANTA 24 61.5 368 497 9.1 48.34 115 CHICAGO 110 50.6 3344 3369 10.4 34.44 122 NEW_ORLS 9 68.3 204 361 8.4 56.77 113 DETROIT 35 49.9 1064 1513 10.1 30.96 129 ST_LOUIS 56 55.9 775 622 9.5 35.89 105 ALBQURQE 11 56.8 46 244 8.9 7.77 58 CLEVLAND 65 49.7 1007 751 10.9 34.99 155 DALLAS 9 66.2 641 844 10.9 35.94 78 HOUSTON 10 68.9 721 1233 10.8 48.19 103 SLT_LAKE 28 51.0 137 176 8.7 15.17 89 SEATTLE 29 51.1 379 531 9.4 38.79 164 ; PROC REG; MODEL TEMP WIND PRECIP PRCPDAYS = SO2 FACTORYS POP / STB; * Here we use the REG procedure to do a multivariate multiple regression. That is, a multiple regression with multiple dependent variables rather than just one dependent variable. Notice on the output that SAS does a separate multiple regression for each dependent variable. None of them are significant, which means we probably do not have very good predictors. Examination of the standardized partial regression coefficients indicates that FACTORYS is the best predictor (even if not a very good predictor).; can_corr: MTEST / CANPRINT; * This MTEST statement with the CANPRINT option does a canonical correlation analysis on our two sets of variables. Notice this is the same as when we did used PROC CANCORR. Multivariate multiple regression and canonical correlations analysis are very similar procedures. The tests at the end of this part (Wilks' Lambda, etc.) are testing the hyposthesis that all the predictors are zero for all the dependent variables. Notice this is not significant (p = 0.4478), so as we expected, we don't have significant predictors. Again, this exactly matches the canonical correlations output; Factorys_same: MTEST TEMP-WIND, WIND-PRECIP, PRECIP-PRCPDAYS, FACTORYS; * This MTEST statement tests the hypothesis that the FACTORYS predictor is the same for all four dependent variables. It is (p = 0.6353). However, remember that FACTORYS was not a significant predictor, therefore it's not surprising that it is the same (i.e. equally nonsignificant) for all the dependent variables. A variety of hypotheses about the predictors and dependent variables may be tested. This one was done just as an example.; RUN; QUIT;

AdPac Page 101

The SAS System 1 The REG Procedure Model: MODEL1 Dependent Variable: TEMP Number of Observations Read 15 Number of Observations Used 15 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 479.57659 159.85886 2.93 0.0808 Error 11 599.34075 54.48552 Corrected Total 14 1078.91733 Root MSE 7.38143 R-Square 0.4445 Dependent Mean 58.95333 Adj R-Sq 0.2930 Coeff Var 12.52080 Parameter Estimates Parameter Standard Standardized Variable DF Estimate Error t Value Pr > |t| Estimate Intercept 1 67.69157 5.55703 12.18 <.0001 0 SO2 1 -0.47508 0.24952 -1.90 0.0834 -1.53668 FACTORYS 1 0.02128 0.02309 0.92 0.3763 1.94576 POP 1 -0.01119 0.01702 -0.66 0.5243 -1.00688 _______________________________________________________________________________________________ The SAS System 2 The REG Procedure Model: MODEL1 Dependent Variable: WIND Number of Observations Read 15 Number of Observations Used 15 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 5.87254 1.95751 1.33 0.3144 Error 11 16.19146 1.47195 Corrected Total 14 22.06400 Root MSE 1.21324 R-Square 0.2662 Dependent Mean 9.32000 Adj R-Sq 0.0660 Coeff Var 13.01759

AdPac Page 102

Parameter Estimates Parameter Standard Standardized Variable DF Estimate Error t Value Pr > |t| Estimate Intercept 1 9.44226 0.91337 10.34 <.0001 0 SO2 1 -0.02688 0.04101 -0.66 0.5256 -0.60805 FACTORYS 1 0.00363 0.00379 0.96 0.3589 2.32277 POP 1 -0.00216 0.00280 -0.77 0.4566 -1.35780 _______________________________________________________________________________________________ The SAS System 3 The REG Procedure Model: MODEL1 Dependent Variable: PRECIP Number of Observations Read 15 Number of Observations Used 15 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 219.22378 73.07459 0.21 0.8851 Error 11 3769.39538 342.67231 Corrected Total 14 3988.61916 Root MSE 18.51141 R-Square 0.0550 Dependent Mean 32.51400 Adj R-Sq -0.2028 Coeff Var 56.93366 Parameter Estimates Parameter Standard Standardized Variable DF Estimate Error t Value Pr > |t| Estimate Intercept 1 39.79357 13.93610 2.86 0.0156 0 SO2 1 -0.44946 0.62576 -0.72 0.4876 -0.75612 FACTORYS 1 0.04180 0.05790 0.72 0.4854 1.98730 POP 1 -0.02703 0.04267 -0.63 0.5394 -1.26490 _______________________________________________________________________________________________ The SAS System 4 The REG Procedure Model: MODEL1 Dependent Variable: PRCPDAYS Number of Observations Read 15 Number of Observations Used 15 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 4162.81063 1387.60354 1.17 0.3666 Error 11 13092 1190.14449 Corrected Total 14 17254

AdPac Page 103

Root MSE 34.49847 R-Square 0.2413 Dependent Mean 103.20000 Adj R-Sq 0.0343 Coeff Var 33.42875 Parameter Estimates Parameter Standard Standardized Variable DF Estimate Error t Value Pr > |t| Estimate Intercept 1 85.29699 25.97178 3.28 0.0073 0 SO2 1 1.02970 1.16620 0.88 0.3961 0.83286 FACTORYS 1 -0.02144 0.10790 -0.20 0.8461 -0.49006 POP 1 0.00287 0.07953 0.04 0.9719 0.06458 _______________________________________________________________________________________________ The SAS System 5 The REG Procedure Model: MODEL1 Multivariate Test: can_corr Adjusted Approximate Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation 1 0.754727 0.643927 0.115026 0.569612 2 0.552889 0.474993 0.185563 0.305686 3 0.091938 -.269777 0.265002 0.008453 Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq) Eigenvalue Difference Proportion Cumulative 1 1.3235 0.8832 0.7468 0.7468 2 0.4403 0.4317 0.2484 0.9952 3 0.0085 0.0048 1.0000 Test of H0: The canonical correlations in the current row and all that follow are zero Likelihood Approximate Ratio F Value Num DF Den DF Pr > F 1 0.29629834 1.04 12 21.458 0.4478 2 0.68844493 0.62 6 18 0.7152 3 0.99154735 0.04 2 10 0.9584 Multivariate Statistics and F Approximations S=3 M=0 N=3 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.29629834 1.04 12 21.458 0.4478 Pillai's Trace 0.88375107 1.04 12 30 0.4373 Hotelling-Lawley Trace 1.77228145 1.10 12 10.323 0.4449 Roy's Greatest Root 1.32348563 3.31 4 10 0.0568 NOTE: F Statistic for Roy's Greatest Root is an upper bound. _______________________________________________________________________________________________

AdPac Page 104

The SAS System 6 The REG Procedure Model: MODEL1 Multivariate Test: Factorys_same Multivariate Statistics and Exact F Statistics S=1 M=0.5 N=3.5 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.83502918 0.59 3 9 0.6353 Pillai's Trace 0.16497082 0.59 3 9 0.6353 Hotelling-Lawley Trace 0.19756294 0.59 3 9 0.6353 Roy's Greatest Root 0.19756294 0.59 3 9 0.6353

AdPac Page 105

Correspondence Analysis (CA) CA is a heuristic technique for analyzing two-way or multi-way tables containing some measure of correspondence between the rows and columns. Contingency tables are common examples of such tables. The CA provides information similar to PCA on a correlation matrix, i.e. the CA allows you to explore the structure of the categorical variables found in the table. Consider the following contingency table. The columns represent four different species of plants. The rows represent 5 different species of frugivores (two bird species, three monkey species). The frequencies in the table show how often each frugivore was seen eating the fruit of each plant species.

Plant_1 Plant_2 Plant_3 Plant_4 Row Totals Bird_A 4 2 3 2 11 Bird_B 4 3 7 4 18

Monkey_A 25 10 12 4 51 Monkey_B 18 24 33 13 88 Monkey_C 10 6 7 2 25

Column Totals 61 45 62 25 n = 193 Notice on the SAS output from PROC FREQ that the χ2 = 16.4416, DF = 12, p = 0.1718. Think of the 4 column values in each row as coordinates for that animal in 4-dimensional space (don't ask me to draw this). You could then calculate the distance between each pair of animals in this 4-dimensional space. These would be Euclidean distances - which you had to calculate in high school, except that you probably only used 2-dimensional space, which allowed you to use the Pythagorean theorem (the good old days!). These distances between the animals in 4-dimensional space would summarize all the information about the similarities between the animals. Now suppose you could find a lower-dimensional (less than 4) space, where you could position the animals in such a way as to retain most of the information about the differences between the animals. You could then present the information about the similarities (and differences) between the animals in a simple 1 or 2-dimensional graph. Notice how this is philosophically like PCA - it is a dimensional reduction technique. Remember the PCA on the turtles? We could reduce information on Length, Width, and Height to a single component, and still retain most of the information. CA does the same thing for our contingency table. Two important points: (1) we've discussed this example in terms of how similar (or different) the animals are in their use of the plants. It would be just as correct (mathematically and statistically) to discuss the problem in terms of how similar the plants are in terms of the animals; (2) as with the turtles, this is a simple example. You don't really need CA to see the patterns in this table (just as you didn't need PCA to know that some turtles are bigger than others). Again, this example is for teaching purposes. CA is more useful for much larger tables (consider 50 species of animals and 30 species of plants - a 50x30 table is tough to examine for pattern). On to the CA. As always, there are new terms to be learned. Mass Consider taking each frequency in the table and dividing by n (193 in our example). Each frequency is now a relative frequency (proportion), and all 20 relative frequencies sum to 1. (Note: PROC FREQ actually does this, although it's displayed in the output as percentages rather than frequencies - check the output!). We can now say the table has one unit of mass distributed across the table. A row total of the relative frequencies is called a row mass. A column total of relative frequencies is a column mass. Inertia and Row, Column Profiles. The inertia of a table is the χ2 value for the table divided by the sample size. In our table, inertia = 16.4416 / 193 = 0.08519. Row profiles and column profiles simply refer to the row totals and column totals. What a CA does If you're not comfortable with the basics of contingency tables, you might want to do a little review. Recall that if the rows and columns are independent (null hypothesis accepted), then the entries in the table (the distribution of mass) can be reproduced from the row and column profiles (row and column totals). For example, consider this simple 2x3 table:

AdPac Page 106

Column A Column B Column C Row Totals Row 1 2 4 8 14 Row 2 4 8 16 28

Column Totals 6 12 24 n = 42 This table has a χ2 = 0, because the expected frequencies will be exactly the same as the observed. Thus, it is easy to "reduce the dimensionality" of this table - you can reduce it to the row totals or column totals. When the expected deviates from the observed, as in real tables (or our example frugivore x plant table), then each deviation contributes to the chi-squared. The CA attempts to decompose the chi-squared (or inertia) by finding a small number of dimensions in which the deviations (between observed and expected) can be represented. This is analogous to a PCA: a PCA attempts to decompose the total variance in a correlation matrix by finding a small number of components that represent most of the correlations among the variables. In a CA, the dimensions are extracted by an eigenvalue technique. They are extracted to maximally explain the inertia. The maximum number of dimensions that can be extracted is the minimum of (r-1) or (c-1), where r is the number of rows, and c is the number of columns. For our example, (c-1) = (4-1) = 3. Hopefully, much of the information will be captured in the first few dimensions. Interpretation Similarity among the row categories (the animals in our example) can be examined with a plot of each category along the dimensions extracted. These are called the "row coordinates". Similarity among the column categories (the plants in our example) can be examined with a plot of each category along the dimensions extracted. These are called the "column coordinates". You can plot both the row and column categories on the same plot, but there is no meaning to the relative position of a column category compared to a row category. Limit your interpretation to the row categories relative to each other, and the column categories relative to each other. Several other quantities are routinely provided to assist in interpretation. The quantities are provided for both the row categories and the column categories. We'll only discuss row categories here, just to save space. Mass - the mass column shows the row totals expressed as relative frequencies. Compare the values to the row percentages provided by PROC FREQ. Quality - shows how well the retained dimensions represent each row. In our example, SAS has retained 2 dimensions (the default - you can change this) which represent 99.51% of the total inertia. In other words, these two dimensions represent almost all of the original information. Therefore, the quality of each row is high (~.9), because the two dimensions do very well at representing the original table. If the first two dimensions only accounted for 25% of the inertia, our quality values would be lower. Quality is calculated as the sum of the squared cosine values (see below) for each row across all the retained dimensions. Inertia - shows how much each row contributes to the total inertia. The inertia for the first row (Bird_A) is only 0.0314, while the value for row three (Monkey_A) is 0.4497. Note on the output that the contribution to the χ2 of 16.4416 for the first row is 0.5159. Further note that 0.5159/16.4416 = 0.0314. The contribution to χ2 for row three is 7.3946, and 7.3946/16.4416 = 0.4497. Partial Contributions to Inertia - shows much each row contributes to the inertia explained by each dimension. We see that row three (Monkey_A) explains the largest amount of the inertia along Dimension 1 (the inertia along Dimension 1 is 0.07476). Row two (Bird_B) explains the largest amount of the inertia along Dimension 2. Indices of the Coordinates that Contribute Most to Inertia - tells you, for each row, which dimension had the highest partial contribution to inertia for that row. Squared cosines - essentially, the correlation between each dimension and each row. Dimension 1 is closely related to row three (Monkey_A). Remember that this row is the big contributor to the χ2. Row three is the one that's the most "different". Similar to an eigenvector element or factor loading in a PCA, the squared cosines show you the position of the dimensions in the original space defined by the rows. Following the SAS output are graphs of the Frugivores on the first two dimensions, the Plants on the first two dimensions, and a combined graph of Frugivores and Plants on the first two dimensions.

AdPac Page 107

Data Contingency Table; *The Frugivores are animals (1=Bird_A, 2=Bird_B, 3=Monkey_A, 4=Monkey_B, 5=Monkey_C). The Plants are plant species (1=Plant_1, 2=Plant_2, 3=Plant_3, and 4=Plant_4). The Freq values show how often each frugivore was seen feeding on each plant species.; Input Frugivore Plant Freq; Datalines; 1 1 4 1 2 2 1 3 3 1 4 2 2 1 4 2 2 3 2 3 7 2 4 4 3 1 25 3 2 10 3 3 12 3 4 4 4 1 18 4 2 24 4 3 33 4 4 13 5 1 10 5 2 6 5 3 7 5 4 2 ; PROC FREQ; Weight Freq; Table Frugivore*Plant / Chisq; * Correspondence analysis is accomplished by PROC CORRESP. Note that the table is configured in a manner very similar to PROC FREQ. There is a slightly different syntax in the Table statements (PROC FREQ uses an asterisk, PROC CORRESP uses a comma). For large tables, there is an easier way to enter the frequencies. See the SAS documentation (and the VAR statment) if this becomes a concern for you. This option allows you to enter the table as a table (i.e. with rows and columns)- not one cell at a time as we have done here. The ALL option prints out a number of options, most of which describe the original contingency table.; Proc Corresp All; Weight Freq; Tables Frugivore, Plant; Run; Note that PROC FREQ uses the TABLE command with an asterisk, while PROC CORRESP uses the TABLES command with a comma. Needless to say, this can lead to confusion.

AdPac Page 108

The SAS System 13:42 Thursday, March 6, 2003 1 The FREQ Procedure Table of Frugivore by Plant Frugivore Plant Frequency| Percent | Row Pct | Col Pct | 1| 2| 3| 4| Total ---------ˆ--------ˆ--------ˆ--------ˆ--------ˆ 1 | 4 | 2 | 3 | 2 | 11 | 2.07 | 1.04 | 1.55 | 1.04 | 5.70 | 36.36 | 18.18 | 27.27 | 18.18 | | 6.56 | 4.44 | 4.84 | 8.00 | ---------ˆ--------ˆ--------ˆ--------ˆ--------ˆ 2 | 4 | 3 | 7 | 4 | 18 | 2.07 | 1.55 | 3.63 | 2.07 | 9.33 | 22.22 | 16.67 | 38.89 | 22.22 | | 6.56 | 6.67 | 11.29 | 16.00 | ---------ˆ--------ˆ--------ˆ--------ˆ--------ˆ 3 | 25 | 10 | 12 | 4 | 51 | 12.95 | 5.18 | 6.22 | 2.07 | 26.42 | 49.02 | 19.61 | 23.53 | 7.84 | | 40.98 | 22.22 | 19.35 | 16.00 | ---------ˆ--------ˆ--------ˆ--------ˆ--------ˆ 4 | 18 | 24 | 33 | 13 | 88 | 9.33 | 12.44 | 17.10 | 6.74 | 45.60 | 20.45 | 27.27 | 37.50 | 14.77 | | 29.51 | 53.33 | 53.23 | 52.00 | ---------ˆ--------ˆ--------ˆ--------ˆ--------ˆ 5 | 10 | 6 | 7 | 2 | 25 | 5.18 | 3.11 | 3.63 | 1.04 | 12.95 | 40.00 | 24.00 | 28.00 | 8.00 | | 16.39 | 13.33 | 11.29 | 8.00 | ---------ˆ--------ˆ--------ˆ--------ˆ--------ˆ Total 61 45 62 25 193 31.61 23.32 32.12 12.95 100.00 The SAS System 13:42 Thursday, March 6, 2003 2 The FREQ Procedure Statistics for Table of Frugivore by Plant Statistic DF Value Prob ------------------------------------------------------ Chi-Square 12 16.4416 0.1718 Likelihood Ratio Chi-Square 12 16.3476 0.1758 Mantel-Haenszel Chi-Square 1 0.0000 0.9944 Phi Coefficient 0.2919 Contingency Coefficient 0.2802 Cramer's V 0.1685 WARNING: 35% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Sample Size = 193

AdPac Page 109

The SAS System 13:42 Thursday, March 6, 2003 3 The CORRESP Procedure Contingency Table 1 2 3 4 Sum 1 4 2 3 2 11 2 4 3 7 4 18 3 25 10 12 4 51 4 18 24 33 13 88 5 10 6 7 2 25 Sum 61 45 62 25 193 Chi-Square Statistic Expected Values 1 2 3 4 1 3.4767 2.5648 3.5337 1.4249 2 5.6891 4.1969 5.7824 2.3316 3 16.1192 11.8912 16.3834 6.6062 4 27.8135 20.5181 28.2694 11.3990 5 7.9016 5.8290 8.0311 3.2383 Observed Minus Expected Values 1 2 3 4 1 0.52332 -0.56477 -0.53368 0.57513 2 -1.68912 -1.19689 1.21762 1.66839 3 8.88083 -1.89119 -4.38342 -2.60622 4 -9.81347 3.48187 4.73057 1.60104 5 2.09845 0.17098 -1.03109 -1.23834 Contributions to the Total Chi-Square Statistic 1 2 3 4 Sum 1 0.0788 0.1244 0.0806 0.2321 0.5159 2 0.5015 0.3413 0.2564 1.1938 2.2931 3 4.8929 0.3008 1.1728 1.0282 7.3946 4 3.4625 0.5909 0.7916 0.2249 5.0698 5 0.5573 0.0050 0.1324 0.4735 1.1682 Sum 9.4929 1.3624 2.4338 3.1526 16.4416

AdPac Page 110

The SAS System 13:42 Thursday, March 6, 2003 4 The CORRESP Procedure Row Profiles 1 2 3 4 1 0.363636 0.181818 0.272727 0.181818 2 0.222222 0.166667 0.388889 0.222222 3 0.490196 0.196078 0.235294 0.078431 4 0.204545 0.272727 0.375000 0.147727 5 0.400000 0.240000 0.280000 0.080000 Column Profiles 1 2 3 4 1 0.065574 0.044444 0.048387 0.080000 2 0.065574 0.066667 0.112903 0.160000 3 0.409836 0.222222 0.193548 0.160000 4 0.295082 0.533333 0.532258 0.520000 5 0.163934 0.133333 0.112903 0.080000 The SAS System 13:42 Thursday, March 6, 2003 5 The CORRESP Procedure Inertia and Chi-Square Decomposition Singular Principal Chi- Cumulative Value Inertia Square Percent Percent 18 36 54 72 90 ----+----+----+----+----+--- 0.27342 0.07476 14.4285 87.76 87.76 ************************ 0.10009 0.01002 1.9333 11.76 99.51 *** 0.02034 0.00041 0.0798 0.49 100.00 Total 0.08519 16.4416 100.00 Degrees of Freedom = 12 Row Coordinates Dim1 Dim2 1 -0.0658 0.1937 2 0.2590 0.2433 3 -0.3806 0.0107 4 0.2330 -0.0577 5 -0.2011 -0.0789 Summary Statistics for the Row Points Quality Mass Inertia 1 0.8926 0.0570 0.0314 2 0.9911 0.0933 0.1395 3 0.9998 0.2642 0.4497 4 0.9998 0.4560 0.3084 5 0.9986 0.1295 0.0711

AdPac Page 111

Partial Contributions to Inertia for the Row Points Dim1 Dim2 1 0.0033 0.2136 2 0.0837 0.5512 3 0.5120 0.0030 4 0.3310 0.1518 5 0.0701 0.0805 The SAS System 13:42 Thursday, March 6, 2003 6 The CORRESP Procedure Indices of the Coordinates that Contribute Most to Inertia for the Row Points Dim1 Dim2 Best 1 0 2 2 2 0 2 2 3 1 0 1 4 1 1 1 5 0 0 2 Squared Cosines for the Row Points Dim1 Dim2 1 0.0922 0.8003 2 0.5264 0.4647 3 0.9990 0.0008 4 0.9419 0.0579 5 0.8653 0.1333 Column Coordinates Dim1 Dim2 1 -0.3933 0.0305 2 0.0995 -0.1411 3 0.1963 -0.0074 4 0.2938 0.1978 Summary Statistics for the Column Points Quality Mass Inertia 1 1.0000 0.3161 0.5774 2 0.9840 0.2332 0.0829 3 0.9832 0.3212 0.1480 4 0.9946 0.1295 0.1917 Partial Contributions to Inertia for the Column Points Dim1 Dim2 1 0.6540 0.0293 2 0.0308 0.4632 3 0.1656 0.0017 4 0.1495 0.5058

AdPac Page 112

The SAS System 13:42 Thursday, March 6, 2003 7 The CORRESP Procedure Indices of the Coordinates that Contribute Most to Inertia for the Column Points Dim1 Dim2 Best 1 1 0 1 2 0 2 2 3 1 0 1 4 0 2 2 Squared Cosines for the Column Points Dim1 Dim2 1 0.9940 0.0060 2 0.3267 0.6573 3 0.9818 0.0014 4 0.6844 0.3102

AdPac Page 113

Frugivores

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4

Bird_A

Bird_B

Monkey_A

Monkey_BMonkey_C

Dimension 1

Dimension 2

Plants

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4

Plant_4

Plant_1 Plant_3

Plant_2

Dimension 1

Dimension 2

AdPac Page 114

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4

Frugivores

Plants

Bird_APlant_4

Plant_1 Monkey_A

Monkey_B

Monkey_C

Dimension 1

Dimension 2

Plant_3

Plant_2

Bird_B

Detrended Correspondence Analysis is a method to try to remove arch distortion/compression from the plots. Notice above that (especially if you ignore Bird_A) that the points tend to form an arch (actually, an upside down arch, more like a bowl). So, although Dimension 1 and Dimension 2 are orthogonal (not correlated by definition), they appear to have a systematic relationship (or trend). Various detrending methods have been proposed, but they tend to be somewhat arbitrary, and are often not recommended. There is no detrending capability provided by SAS (as far as I know). Multiple Correpsondence Analysis is done on tables that involve three or more categorical variables (multi-way contingency tables).

AdPac Page 115

Cluster Analysis The objective in cluster analysis is to organize a group of entities (e.g. study sites, or species) into groups, based on similarity over multiple variables. The results are often presented in a dendrogram, a "tree diagram" which graphically demonstrates similarity by progressively uniting groupings (clusters) of the entities. You may be familiar with dendrograms from systematics. Although the method of presenting the information is similar (i.e. the dendrogram), you should be aware that the methods are quite different. In systematic biology, the principles of cladistics are generally used, which involves procedures not used in a cluster analysis. Anytime you are examining a dendrogram, make sure you know the method used to generate the relationships. Example of Cluster Analysis Consider the following data set which show the abundance of four species (A, B, C, D) at six different sites: Species

Sites A B C D 1 1 9 12 1 2 1 8 11 1 3 1 6 10 10 4 10 0 9 10 5 10 2 8 10 6 10 0 7 2

A cluster analysis is done to examine similarity among the sites across the species. The results may look like this:

Examine the data to see the patterns that produced the relationships in the graph. The dendrogram shows that sites 1 and 2 are the most similar, followed closely by sites 4 and 5. Site 6 is closer to the cluster of 4 and 5. Site 3 is closer to the cluster of 1 and 2.

AdPac Page 116

The techniques used to produce the above cluster (and the ones most commonly used) are: 1. Exclusive: each entity (site) can be in only one cluster. 2. Sequential: the clustering is not done all at once, but progressively. For example, after sites 1 and 2 were clustered, then the similarity of the remaining sites to the cluster had to be calculated, i.e. sites 1 and 2 are no longer considered separate entities. 3. Hierarchical: all clusters are united at sequentially higher levels. 4. Agglomerative: each entity starts out separated, and are then progressively joined. A divisive technique would start with all entities united, and then progressively split them apart. 5. Polythetic: all variables are used to determine similarity among the entities. Similarity The first task is to quantify the similarity between entities. In actual practice, what is calculated is actually dissimilarity, or distance, between entities. There are many ways to do this including the familiar correlation coefficient, percent similarity, and Euclidean distance. Since Euclidean distance is commonly used, this will be discussed. The Euclidean (ED) distance between entities j and k (these would be the sites above) across P variables (the species above) is calculated using this formula (xij is the data for entity j on variable i; xik is the data for entity k on variable i):

P

ijkij xxED

1

2)( For sites 1 and 2, 414.121)-(1 11)-(12 8)-(9 1)-(1 2222 ED

Euclidean distance is simple geometry, an application of the Pythagorean theorem. For simplicity's sake, let's consider the similarity

between sites 1 and 5 using only the first two species (A and B). The Euclidean distance is 4.11)29()101( 22 ED

This allows us to draw a simple 2-dimensional plot:

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

A

B

Site 1

Site 5

(9 - 2) = 7

(1 - 10) = -9

This shows that the Euclidean distance is the length of the line between Site 1 and Site 5. Using all four species (A, B, C, D) is doing the same thing, except in 4 dimensions (which is a lot more difficult to draw). Once a similarity (or dissimilarity) measure is employed, then one must determine how to unite entities (individual entities or clusters of entities) in order to produce the dendrogram. These are called "clustering methods" (in SAS), or sometimes "fusion strategies". There are many ways to do this, and no universal agreement on what's the best way. SAS offers eleven different clustering methods. We'll limit discussion to four, including the two (UPGMA, Ward's Minimum Variance) most commonly used by biologists.

4.11)29()101( 22

AdPac Page 117

Clustering Methods 1. Unweighted Pair-Group Method using Averages (UPGMA) (SAS METHOD = AVERAGE).

Calculate distances (usually the square of the Euclidean distance) between each pair of subjects. If the subjects are already in clusters, then each subject in one cluster is paired with each subject in other clusters.

Calculate the mean of all distances between each subject (cluster). The two subjects (cluster) with the smallest mean are joined. This is probably the most widely used by biologists. It is the method generally recommended if you have no basis for

choosing among the available methods. 2. Ward's Minimum Variance (SAS METHOD = WARD)

For each existing cluster, calculate within cluster variance of the distances between each pair of subjects. Combine all possible pairs of clusters, and calculate within cluster variance for all possible combinations. Whichever combined cluster has the minimum increase in variance (less than if either cluster was combined with some other

cluster) are the clusters that are joined. This method is fairly popular among biologists. The concept of minimizing variance is somewhat similar to the concepts

found in regression or ANOVA, and are therefore familiar to many biologists. 3. Centroid Linkage (SAS METHOD = CENTROID)

For each existing cluster, a centroid is calculated. This is usually the n-dimensional mean of all subjects in the cluster on all n variables.

Distances between cluster centroids are calculated for each pair of clusters. Clusters with the minimum distance are joined. This method has enjoyed popularity in the past. It can lead to reversals (a

cluster is joined back to a previous grouping). 4. Single Linkge (Nearest Neighbor) (SAS METHOD = SINGLE)

For all possible pairs of clusters, each subject in the first cluster is paired with each subject in the second cluster, and distances are calculated.

The clusters of the pair of subjects with the minimum distance are joined. This method was sometimes used in the past for taxonomic purposes, but cluster analysis is no longer an important

taxonomic method. The method tends to chain, produce dendrograms where each of the subjects are added to the same cluster, one at a time. In other words, it doesn't result in groupings of clusters, and is therefore difficult to interpret.

If you think about these methodologies, you can see that they are very computationally intensive. Indeed, large clustering problems can challenge even modern computers. SAS has a procedure for "fast clustering" of large data sets (PROC FASTCLUS) to deal with this problem. Computational Example Let's return to the site x species data discussed above, and show how some of the calculations are done. The SAS program and output for this example are provided below. The dendrogram from this SAS example is above. On the output (below), you can see that SAS has calculated the eigenvalues for the variables. This is to describe your variables, i.e. to indicate how independent/related they are. Next we see the history of the clustering used to produce the dendrogram.: NCL tells you the number of clusters that exist after the join that occurred in each step. We started with 6 sites, so at the first step sites 1 and 2 are joined. This leaves 5 clusters: sites 1 & 2 as one cluster, and the other 4 sites are "individual" clusters. FREQ tells you how many of the original subjects are clustered together at each step. In each of the first two steps, two subjects are clustered. In each of the next two steps, one subject is added to each cluster, so each new cluster has 3 original subjects. The last step joins all 6 subjects. Norm RMS Dist is the "Normalized Root-Mean_Square Distance". This is the actual distance used to produce the dendrogram (compare the dendrogram above with the Norm RMS Dist values below.) Norm RMS Dist is calculated by dividing the Euclidean distance by the square root of the mean of all squared Euclidean distances. It is a method of standardizing the distances, so the clustering and dendrogram are not affected by the units. The calculation of Norm RMS Dist for the original sites is shown below.

AdPac Page 118

Cluster Analysis - Example (Sites x Species) DATA Cluster Example; INPUT Sites $ A B C D; Datalines; Site_1 1 9 12 1 Site_2 1 8 11 1 Site_3 1 6 10 10 Site_4 10 0 9 10 Site_5 10 2 8 10 Site_6 10 0 7 2 ; PROC CLUSTER OUTTREE=BySites METHOD=AVE; * The OUTTREE = BySites statement creates a new data set with the information SAS will use to produce the dendrogram in PROC TREE below. The METHOD = AVE statement tells SAS to use the UPGMA clustering method. The ID statement makes the site names available both for the output from this procedure and for PROC TREE to use in the dendrogram.; ID Sites; PROC TREE DATA=BySites HORIZONTAL; * The procedure draws the dendrogram, which appears on a separate window. You can do some primitive formatting and editing on the dendrogram (see the SAS documentation). You will probably want to export the dendrogram as a bitmap file (.bmp) to use with other programs (word processing or powerpoint).; ID Sites; RUN; The SAS System 14:09 Tuesday, May 6, 2003 1 The CLUSTER Procedure Average Linkage Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 48.9275612 32.6256828 0.7343 0.7343 2 16.3018783 15.2959246 0.2447 0.9789 3 1.0059537 0.6080136 0.0151 0.9940 4 0.3979401 0.0060 1.0000 Root-Mean-Square Total-Sample Standard Deviation = 4.081462 Root-Mean-Square Distance Between Observations = 11.54412 Cluster History Norm T RMS i NCL --Clusters Joined--- FREQ Dist e 5 Site_1 Site_2 2 0.1225 4 Site_4 Site_5 2 0.1937 3 CL4 Site_6 3 0.7169 2 CL5 Site_3 3 0.8218 1 CL2 CL3 6 1.1817

AdPac Page 119

Calculation of Norm RMS Dist First, we calculate the Euclidean distance for all pairs of sites. This results in the following matrix:

Sites 1 2 3 4 5 61 0 1.414214 9.69536 15.87451 15.06652 13.711312 1.414214 0 9.273618 15.16575 14.38749 12.727923 9.69536 9.273618 0 10.86278 10.04988 13.784054 15.87451 15.16575 10.86278 0 2.236068 8.2462115 15.06652 14.38749 10.04988 2.236068 0 8.3066246 13.71131 12.72792 13.78405 8.246211 8.306624 0

We can already see that the minimum distance is between sites 1 and 2, so those will be joined first. However, for the purpose of the dendrogram, we need to standardize the distance. First, we square all of the distances:

Sites 1 2 3 4 5 61 0 2 94 252 227 1882 2 0 86 230 207 1623 94 86 0 118 101 1904 252 230 118 0 5 685 227 207 101 5 0 696 188 162 190 68 69 0

Next, we calculate the mean of the squared distances between each pair of sites. In other words, the mean of the 15 numbers in the upper-right (or lower-left) triangle of this matrix. It's not the mean of all the values in the matrix, because the squared distances of each site to itself (i.e. 0) are not included. The mean of all 15 squared distances is 133.2666667. We then calculate the square root of this mean. The square root of 133.2666667 is 11.54412. Notice that this value is on the SAS output, and is labeled as Root-Mean-Square Distance Between Observations. The Normalized Root-Mean-Square Distance is then calculated by dividing the Euclidean distance by the Root-Mean-Square Distance Between Observations. For example: Norm RMS Dist for Sites 1 and 2: 1.414214 / 11.54412 = 0.1225 Norm RMS Dist for Sites 4 and 5: 2.236068 / 11.54412 = 0.1937 Make sure that you realize that, at each step in the clustering, you have to recalculate all of the distances in order to get the Norm RMS Distance. For example, the SAS output says the cluster containing Sites 4 and 5 is joined to Site 6 at a Normalized Root-Mean-Square Distance of 0.7169. Let’s see how this is calculated. Once Site 4 and Site 5 are combined into a cluster, we need to calculate the distance of this cluster to Site 6. To do this, we take the average of the Euclidean Distance from Site 4 to Site 6 and Site 5 to Site 6. That is: Euclidean Distance of Site 4 to Site 6: 8.246211 (you can see this in the matrix above) Euclidean Distance of Site 5 to Site 6: 8.306624 (you can see this in the matrix above) The average of these two distances is (8.246211 + 8.306624) / 2 = 8.2764175 This is the Euclidean Distance of the cluster to Site 6. We now normalize this distance by dividing it by the RMS Distance of 11.54412: 8.2764175 / 11.54412 = 0.7169 This value (0.7169) is the Norm RMS Distance where this cluster joins to Site 6. See the SAS output and the graph to verify this. Next, we'll have SAS cluster the cities in our data set using the UPGMA method (below). Notice that in the PROC CLUSTER statement that we have used the STD option, which tells SAS to form clusters based on standardized data. This is to reduce the impact of outliers (i.e. Chicago), which can have large effects on the results of a cluster analysis.

AdPac Page 120

DATA POLLUTE; INPUT CITY $ SO2 TEMP FACTORYS POP WIND PRECIP PRCPDAYS; * Data are means for 1969, 1970, and 1971. CITY = city name (note that the $ tells SAS it is a character variable, i.e. not a number) S02 = annual mean concentration of SO2 (micrograms/cubic meter) TEMP = annual mean temperature (degrees F) FACTORYS = number of factories with 20 or more employees POP = population (in thousands) from 1970 census WIND = annual mean wind speed (miles/hour) PRECIP = annual mean precipitation (inches) PRCPDAYS = annual mean number of days/year with precipitation ; Datalines; PHOENIX 10 70.3 213 582 6.0 7.05 36 SAN_FRAN 12 56.7 453 716 8.7 20.66 67 DENVER 17 51.9 454 515 9.0 12.95 86 MIAMI 10 75.5 207 335 9.0 59.80 128 ATLANTA 24 61.5 368 497 9.1 48.34 115 CHICAGO 110 50.6 3344 3369 10.4 34.44 122 NEW_ORLS 9 68.3 204 361 8.4 56.77 113 DETROIT 35 49.9 1064 1513 10.1 30.96 129 ST_LOUIS 56 55.9 775 622 9.5 35.89 105 ALBQURQE 11 56.8 46 244 8.9 7.77 58 CLEVLAND 65 49.7 1007 751 10.9 34.99 155 DALLAS 9 66.2 641 844 10.9 35.94 78 HOUSTON 10 68.9 721 1233 10.8 48.19 103 SLT_LAKE 28 51.0 137 176 8.7 15.17 89 SEATTLE 29 51.1 379 531 9.4 38.79 164 ; * PROC CLUSTER performs the cluster analysis. The METHOD= option is required. SAS has 11 different clustering methods available. Researching all 11 will take you a while. The METHOD=AVE requests the unweighted pair-group method using arithmetic averages (UPGMA). This is a frequently used method, and is probably the best choice when you have no reason to select any particular method. Another method frequently used by biologists is the Ward's minimum-variance method (METHOD=WARD). The OUTTREE=CITIES statement creates a dataset that can be used by PROC TREE to draw a dendrogram. Since we know we have at least one substantial outlier (Chicago), we have included the STD option, which standardizes (i.e. Z-scores) the data prior to clustering. Outliers can have a substantial impact on clustering. Standardizing reduces, but does not eliminate, the impact of outliers. To assess the effect of an outlier, do the analysis both with and without the outlier. The ID CITY statement tells SAS to use the CITY variable to label the output. ; PROC CLUSTER METHOD=AVE OUTTREE=CITIES STD; ID CITY; * PROC TREE draws the dendrogram, using the dataset CITIES which was created by PROC CLUSTER. HORIZONTAL orients the diagram so the leaves (cities) are vertical on the page, which is how such graphs are often in the literature. The default in SAS, is to place the leaves horizontally across the page. The ID CITY statement tells SAS to use the CITY variable to label the output. ; PROC TREE DATA=CITIES HORIZONTAL; ID CITY; RUN;

AdPac Page 121

The SAS System 15:21 Tuesday, May 6, 2003 1 The CLUSTER Procedure Average Linkage Cluster Analysis Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 3.57712916 1.83796725 0.5110 0.5110 2 1.73916191 0.73514870 0.2485 0.7595 3 1.00401320 0.50945619 0.1434 0.9029 4 0.49455701 0.36494403 0.0707 0.9736 5 0.12961298 0.07918011 0.0185 0.9921 6 0.05043288 0.04534002 0.0072 0.9993 7 0.00509286 0.0007 1.0000 The data have been standardized to mean 0 and variance 1 Root-Mean-Square Total-Sample Standard Deviation = 1 Root-Mean-Square Distance Between Observations = 3.741657 Cluster History Norm T RMS i NCL --Clusters Joined--- FREQ Dist e 14 DENVER SLT_LAKE 2 0.204 13 MIAMI NEW_ORLS 2 0.2826 12 SAN_FRAN ALBQURQE 2 0.3035 11 DALLAS HOUSTON 2 0.3149 10 CL12 CL14 4 0.3299 9 CL13 ATLANTA 3 0.422 8 DETROIT CLEVLAND 2 0.4678 7 CL8 ST_LOUIS 3 0.5065 6 CL7 SEATTLE 4 0.5355 5 CL9 CL11 5 0.6441 4 CL10 CL6 8 0.8203 3 CL4 CL5 13 0.856 2 PHOENIX CL3 14 1.2042 1 CL2 CHICAGO 15 1.7089 The dendrogram produced by SAS is below.

AdPac Page 122

Note that we have three major clusters, which correspond to geographic region. There's a "southwest" cluster with San Francisco, Albuquerque, Denver, and Salt Lake City. The "Midwest" cluster contains Detroit, Cleveland, St. Louis. Notice that Seattle is clustered with this group (we have seen Seattle placed with "Midwest" cities before, apparently due to climate). The "southeast" cluster contains Miami, New Orleans, Atlanta, Dallas, and Houston. Chicago is clearly an outlier. Although standardizing the data reduces the impact of outliers, it is not a "fix." Phoenix is also somewhat unique. Besides the unusually warm and dry climate, the SO2 value is quite low. Notice that cluster analysis is a good way to identify outliers in your data.

Documents

AdPac Advanced Biometrics Pac - CPP