Nonparametric Statistical Methods Presented by Guo Cheng, Ning Liu, Faiza Khan, Zhenyu Zhang, Du Huang, Christopher Porcaro, Hongtao Zhao, Wei Huang 1

Embed Size (px)

Citation preview

  • Slide 1

Nonparametric Statistical Methods Presented by Guo Cheng, Ning Liu, Faiza Khan, Zhenyu Zhang, Du Huang, Christopher Porcaro, Hongtao Zhao, Wei Huang 1 Slide 2 Introduction Slide 3 Definition Nonparametric methods 1: rank-based methods are used when we have no idea about the population distribution from which the data is sampled. Used for small sample sizes. Used when the data are measured on an ordinal scale and only their ranks are meaningful. 3 Slide 4 Outline 1. Sign Test 2. Wilcoxon Signed Rank Test 3. Inferences for Two Independent Samples 4. Inferences for Several Independent Samples 5. Friedman Test 6. Spearmans Rank Correlation 7. Kendalls Rank Correlation Coefficient 4 Slide 5 1.Sign Test 5 Slide 6 Parameter of interest: Median Median is used as a parameter because it is a better measure of data as compared to the mean for skewed distributions. 6 Slide 7 Hypothesis test H 0 : = 0 vs H a : > 0 where 0 is a specified value and is unknown median 7 Slide 8 Testing Procedure Step 1: Given a random sample x 1, x 2, , x n from a population with unknown median , count the number of x i s that exceed 0. Denote them by s +. s - = n - s + Step 2: Reject H 0 if s + is large or s - is small. 8 Slide 9 How to reject H 0 ? To determine how large s + must be in order to reject H 0, we need to find out the distribution of the corresponding random variable S +. X i : random variable corresponding to the observed values x i S - : random variable corresponding to s - 9 Slide 10 Distribution of S + and S - 10 Slide 11 Calculating P-value 11 Slide 12 Rejection criteria 12 Slide 13 Large sample z-test 13 Slide 14 Confidence Interval 14 Slide 15 Example 15 Slide 16 SAS code 16 DATA themostat; INPUT temp; datalines; 202.2 203.4 ; PROC UNIVARIATE DATA=themostat loccount mu0=200; VAR temp; RUN; Slide 17 SAS Output Basic Statistical Measures Location Variability Mean 201.7700 Std Deviation 2.41019 Median 201.7500 Variance 5.80900 Mode. Range 8.30000 Interquartile Range 2.90000 Tests for Location: Mu0=200 Test -Statistic- -----p Value------ Student's t t 2.322323 Pr > |t| 0.0453 Sign M 3 Pr >= |M| 0.1094 Signed Rank S 19.5 Pr >= |S| 0.048 17 Slide 18 2. Wilcoxon signed rank test 18 Slide 19 Inventor Frank Wilcoxon (2 September 1892 in County Cork, Ireland 18 November 1965, Tallahassee, Florida, USA) was a chemist and statistician, known for development of several statistical tests. 19 Slide 20 What is it used for? Two related samples Matched samples Repeated measurements on a single sample Slide 21 Hypothesis 21 Slide 22 Testing procedure 22 Slide 23 Example 23 Slide 24 SAS codes 24 DATA thermo; INPUT temp; datalines; 202.2 203.4 ; PROC UNIVARIATE DATA=thermo loccount mu0=200; TITLE "Wilcoxon signed rank test the thermostat"; VAR temp; RUN; Slide 25 SAS outputs (selected results) 25 8 Basic Statistical Measures Location Variability Mean 201.7700 Std Deviation 2.41019 Median 201.7500 Variance 5.80900 Mode. Range 8.30000 Interquartile Range 2.90000 Tests for Location: Mu0=200 Test -Statistic- -----p Value------ Student's t t 2.322323 Pr > |t| 0.0453 Sign M 3 Pr >= |M| 0.1094 Signed Rank S 19.5 Pr >= |S| 0.048 Slide 26 Large sample approximation 26 Slide 27 Derive E(x) & Var(x) 27 Slide 28 Rejection region: 28 Slide 29 3. Inferences for Two Independent Samples 29 Slide 30 Hypothesis Slide 31 Definition 31 Slide 32 Definition 32 Slide 33 Wilcoxon sum rank test 33 Slide 34 Mann-Whitney-U test 34 Slide 35 Between two tests 35 Slide 36 Advantages 36 Slide 37 For large samples 37 Slide 38 For large samples 38 Slide 39 Treatment of ties 39 Slide 40 Example To test if the grades of two classes which have the same teacher are the same, we randomly pick 7 students from Class A and 9 from Class B, their scores are as follows A: 8.50 9.48 8.65 8.16 8.83 7.76 8.63 B: 8.27 8.20 8.25 8.14 9.00 8.10 7.20 8.32 7.70 40 Slide 41 Example 7.207.707.768.108.148.168.208.25 BBABBABB 12345678 8.278.328.508.638.658.839.009.48 BBAAAABA 910111213141516 41 Slide 42 Example 42 Slide 43 Example 43 Slide 44 SAS code Data exam; Input group $ score @@; Datalines; A 8.50 A 9.48 A 8.65 A 8.16 A 8.83 A 7.76 A 8.63 B 8.27 B 8.20 B 8.25 B 8.14 B 9.00 B 8.10 B 7.20 B 8.32 B 7.70 ; 44 Slide 45 SAS code Proc npar1way data=exam wilcoxon; Var score; Class group; Exact wilcoxon; Run; 45 Slide 46 Output Wilcoxon Scores (Rank Sums) for Variable score Classified by Variable group groupNSum of Scores Expected Under H0 Std Dev Under H0 Mean Score A775.059.509.44722210.714286 B961.076.509.4472226.777778 46 Slide 47 Output Wilcoxon Two-Sample Test Statistic (S)75.0000 Normal Approximation Z1.5878 One-Sided Pr > Z0.0562 Two-Sided Pr > |Z|0.1123 t Approximation One-Sided Pr > Z0.0666 Two-Sided Pr > |Z|0.1332 Exact Test One-Sided Pr >= S0.0571 Two-Sided Pr >= |S - Mean|0.1142 Z includes a continuity correction of 0.5. 47 Slide 48 Output 48 Slide 49 4. Inferences for Several Independent Samples 49 Slide 50 Introduction We know that if our data is normally distributed and that the population standard deviations are equal, we can test for a difference among several populations by using the One-way ANOVA F test. 50 Slide 51 When to use Kruskal-Wallis test? But what happens when our data is not normal? This is when we use the nonparametric Kruskal-Wallis test to compare more than two populations as long as our data comes from a continuous distribution. The notion of the kw rank test is to rank all the data from each group together and then apply one-way ANOVA to the ranks rather than to the original data. 51 Slide 52 Kruskal-Wallis Test (kw Test) A non-parametric method for testing whether samples originate from the same distribution. Used for comparing more than two samples that are independent. 52 Slide 53 Kruskal-Wallis Test: History William Henry Kruskal October 10 th, 1919 April 21 st, 2005 Obtained Bachelors and Masters degree in Mathematics at Harvard University and received his Ph. D. from Columbia University in 1955. Wilson Allen Wallis November 5 th,1912 October 12 th, 1998 Undergraduate work at the University of Minnesota and Graduate work at the University of Chicago in 1933. 53 Slide 54 Kruskal-Wallis Test: Steps 1. Create Hypothesis: Null Hypothesis (H o ): The samples from populations are identical Alternative Hypothesis (H a ): At least one sample is different 54 Slide 55 Kruskal-Wallis Test: Steps 2. Rank all the data. The lowest number gets the lowest rank and so on. Tied data gets the average of the ranks they would have obtained if they werent tied. 3. All the ranks of the different samples are added together. Label these sums L 1, L 2, L 3, and L 4. 55 Slide 56 Kruskal-Wallis Test: Steps 4. Find Test Statistic: n = total number of observations in all samples L i = total rank of each sample kw = test statistic 5. Reject H o if H is greater than the chi-square table value. 56 Slide 57 Kruskal-Wallis Test: Example An experiment was done to compare four different ways of teaching a concept to a class of students. In this experiment, 28 tenth grade classes were randomly assigned to the four methods (7 classes per method). A 45 question test was given to each class. The average test scores of the classes are given in the following table. Apply the Kruskal- Wallis test to the test scores data set. 57 Slide 58 Kruskal-Wallis Test: Example Given Data Ranks of Data values 58 Slide 59 Kruskal-Wallis Test: Example 59 Slide 60 Kruskal-Wallis Test: Example 60 Slide 61 SAS Input data test; input methodname $ scores; cards; case 14.59 case 23.44 case 25.43 case 18.15 Case 20.82 Case 14.06 Case 14.26 Formula 20.27 Formula 26.84 Formula 14.71 Formula 22.34 Formula 19.49 Formula 24.92 Formula 20.20 Equation 27.82 Equation 24.92 Equation 28.68 Equation 23.32 Equaiton 32.85 Equation 33.90 Equation 23.42 Unitary 33.16 Unitary 26.93 Unitary 30.43 Unitary 36.43 Unitary 37.04 Unitary 29.76 Unitary 33.88 ; proc npar1way data=test wilcoxon; class methodname; var scores; run; 61 Slide 62 SAS Output Wilcoxon Scores (Rank Sums) for Variable scores Classified by Variable methodname Sum of Expected Std Dev Mean methodname N Scores Under H0 Under H0 Score case 7 49.00 101.50 18.845498 7.000000 formula 7 66.50 101.50 18.845498 9.500000 equation 7 125.50 101.50 18.845498 17.928571 unitary 7 165.00 101.50 18.845498 23.571429 Average scores were used for ties. Kruskal-Wallis Test Chi-Square 18.1390 DF 3 Pr > Chi-Square 0.0004 62 Slide 63 4. Friedman Test 63 Slide 64 Introduction A distribution-free rank-based test for comparing the treatments is known as the Friedman test, named after the Nobel Laureate economist Milton Friedman who proposed it. The Friedman Test is a version of the repeated-Measures ANOVA that can be performed on ordinal(ranked) data. 64 Slide 65 Steps in the Friedman test 65 Slide 66 Steps in the Friedman test 66 Slide 67 Example Now we have 8 treatments separated in 3 blocks, = 0.025 67 Slide 68 Define Null and Alternative Hypothesis H 0 : There is no difference between 8 treatments H a : There exists difference between 8 treatments 68 Slide 69 Rank Sum 69 Slide 70 Friedman Test 70 Slide 71 Conclusion 71 Slide 72 5. Spearmans Rank Correlation Coefficient 72 Slide 73 Introduction From Pearson to Spearman Spearmans Rank Correlation Coefficient Large-Sample Approximation Hypothesis Test Examples 73 Slide 74 From Pearson to Spearman Pearsons Measure only the degree of linear association Based on the assumption of bivariate normally of two variables Spearmans Take in account only the ranks Measure the degree of monotone association Inferences on the rank correlation coefficients are distribution-free 74 Slide 75 From Pearson to Spearman 75 Slide 76 From Pearson to Spearman Charles Edward Spearman As a psychologist General factor of intelligence the nature and causes of variations in human As a statistician Rank correlation two-way analysis Charles Edward Spearman (10 Sept. 1863 17 Sept. 1945) Correlation coefficient 76 Slide 77 Spearmans Rank Correlation Coefficient 77 Slide 78 Spearmans Rank Correlation Coefficient 78 Slide 79 Large sample approximation 79 Slide 80 Hypothesis testing 80 Slide 81 Example Table 5.1 Wine Consumption and Heart Disease Deaths Country Australia2.5211Netherlands1.8167 Austria3.9167New Zealand1.9266 Belgium2.9131Norway0.8227 Canada2.4191Spain6.586 Denmark2.9220Sweden1.6207 Finland0.8297Switzerland5.8115 France9.171U.K.1.3285 Iceland0.8211U.S.1.2199 Ireland0.7300W. Germany2.7172 Italy7.9107 81 Slide 82 Example 82 Slide 83 Example Table 5.2 Ranks of Wine Consumption and Heart Disease Deaths 11112.5-1.51186.51.5 2156.58.512916-7.0 313.55-8.513315-12.0 41091.01417215.0 513.514-0.515711-4.0 6318-15.016 412.0 719118.0176 -11.0 8312.5-9.518510-5.0 9119-18.0191284.0 1018315.0 83 Slide 84 Example 84 Slide 85 Example 85 Slide 86 6. Kendalls Rank Correlation Coefficient 86 Slide 87 Kendalls Tau It is a coefficient use to measure the association between two pairs of ranked data. Named after British statistician Maurice Kendall who developed it in 1938. Ranges from -1.0 to 1.0 Tau-a (with no ties) and Tau-b (with ties) 87 Slide 88 Formula for Tau-a 88 Slide 89 Concordant and Discordant 89 Slide 90 Example 1 Kendalls tau-a Raw data for 11 students in 2 exams: Exam 1Exam 2 85 9895 9080 8375 5770 6365 7773 9993 8079 9688 6974 90 Slide 91 Ranks of exam results Exam1 xExam 2 ycd 1 291 2 190 3 380 4 561 5 460 6 741 7 640 8 921 9 820 101101 10C=50D=5 91 Slide 92 Calculation for 92 Slide 93 Steps for calculating 1.Sort data x in ascending order, pair y ranks with x 2.Count c and d for each y 3.Sum C and D 4.Use formula to calculate 93 Slide 94 Formula for tau-b(with ties) 94 Slide 95 Example 2 Kendalls tau-b Wine Consumption and heart disease deaths data iCountryxiyicd 1Ireland0.7300018 2Iceland0.8211311 2Norway0.8227213 4Finland0.8297015 5U.S.1.219959 6U.K1.3285013 7Sweden1.620739 8Netherlands1.816755 9N. Z1.9266010 Canada2.419127 11Australia2.521117 12Germany2.717216 13Belgium2.913124 14Denmark2.922005 15Austria3.916704 16Switzerland5.811503 17Spain6.58611 18Italy7.910701 19France9.17100 C=25D=141 95 Slide 96 Calculation for tau-b 96 Slide 97 Hypothesis Test for 97 Slide 98 Hypothesis test results 98 Slide 99 Hypothesis test results 99 Slide 100 100 Slide 101 Example 1 extension Exam1 xExam 2 y Kendall's Spearman r s 129111 219011 338000 456111 546011 674111 764011 892111 982011 10110111 1011 C=50D=5 101 Slide 102 102 Slide 103 103 Slide 104 SAS Code Data exams; Input exam1 exam2; Datalines; 85 98 95 ; Run; Proc corr data=exams kendall; Var exam1 exam2; Run; 104 Slide 105 SAS output 105 Slide 106 7. Conclusion 106 Slide 107 Summary Nonparametric tests are very useful when we dont know anything about the distributions. Especially when the distribution is not normal, we cant use T-test, then we have to study the nonparametric methods. Median is a better measurement of central tendency for non-normal population. Sample can be ordinal and sample size is usually small. 107 Slide 108 Summary In summary, we have briefly introduced some most common methods in our presentation including: Sign test Wilcoxon rank sum test and signed rank test Kruskal-Wallis Test Friedman Test Spearmans Rank Correlation Kendalls Rank Correlation Coefficient 108 Slide 109 Questions Q1Q2 109 Slide 110 The End. Thank You ! 110