Upload
alaina-wilcox
View
215
Download
0
Embed Size (px)
Citation preview
Xuhua Xia
Data Transformation
• Objectives:– Understand why we often need to transform our data
– The three commonly used data transformation techniques
– Additive effects and multiplicative effects
– Application of data transformation in ANOVA and regression.
Xuhua Xia
Why Data Transformation?
• The assumptions of most parametric methods:– Homoscedasticity
– Normality
– Additivity
– Linearity
• Data transformation is used to make your data conform to the assumptions of the statistical methods
• Illustrative examples
Xuhua Xia
Homoscedasticity and Normality
The data deviates from both homoscedasticity and normality.
Xuhua Xia
Homoscedasticity and Normality
Won’t it be nice if we would make data look this way?
Xuhua Xia
Types of Data Transformation
• The logarithmic transformation• The square-root transformation• The arcsine transformation.• Data transformation can be done conveniently in
EXCEL.• Alternatives: Ranks and nonparametric methods.
Xuhua Xia
HomoscedasticityID Group 1 Group 21 2.72 20.092 7.39 54.603 7.39 54.604 20.09 148.415 20.09 148.416 20.09 148.417 54.60 403.438 54.60 403.439 148.41 1096.63
n 9 9Mean 37.26 275.33Var 2102.35 114784.50t -2.09df 16P 0.0530Equal Var.? P= 0.0000Kurtosis 4.89 4.89Skewness 2.12 2.12P(Zg1) 0.0046 0.0046P(Zg2) 0.0144 0.0144
• The two groups of data seem to differ greatly in means, but a t-test shows that the means do not differ significantly from each other - a surprising result.
• The two groups of data differ greatly in variance, and both deviate significantly from normality. These results invalidate the t-test.
• We calculate two ratios: var/mean ratio and Std/mean ratio (i.e., coefficient of variation).
• Group1 Group2Var/mean 56.420 416.891C.V. 1.230 1.230
• Log-transformation
Xuhua Xia
• The transformation is successful because:– The variance is now similar
– Deviation from normality is now nonsignificant
– The t-test revealed a highly significant difference in means between the two groups
Log-Transformed DataID Group 1 Group 21 2.72 20.092 7.39 54.603 7.39 54.604 20.09 148.415 20.09 148.416 20.09 148.417 54.60 403.438 54.60 403.439 148.41 1096.63
NewX = ln(X+1)1.312.13
ID Group 1 Group 21 1.31 3.052 2.13 4.023 2.13 4.024 3.05 5.015 3.05 5.016 3.05 5.017 4.02 6.008 4.02 6.009 5.01 7.00
n 9 9Mean 3.08 5.01Var 1.30 1.47t -3.48df 16P 0.0031Equal Var.? P= 0.8687Kurtosis -0.35 -0.30Skewness 0.16 0.02P(Zg1) 0.82 0.97P(Zg2) 0.659393 0.684475
Xuhua Xia
ID Group 1 Group 21 1.31 3.052 2.13 4.023 2.13 4.024 3.05 5.015 3.05 5.016 3.05 5.017 4.02 6.008 4.02 6.009 5.01 7.00
n 9 9Mean 3.08 5.01Var 1.30 1.47t -3.48df 16P 0.0031Equal Var.? P= 0.8687Kurtosis -0.35 -0.30Skewness 0.16 0.02P(Zg1) 0.82 0.97P(Zg2) 0.659393 0.684475
Log-Transformed DataNewX = ln(X+1)
1 NewXeX
Transform back:
Compare this mean with the original mean. Which one is more preferable?
Calculate the standard error, the degree of freedom, and 95% CL (t0.025,16 = 2.47).
Xuhua Xia
Normal but Heteroscedastic
Any transformation that you use is likely to change normality. Fortunately, t-test and ANOVA are quite robust for this kind of data. Of course, you can also use nonparametric tests.
Xuhua Xia
Normal but HeteroscedasticID Group 1 Group 21 11 12 12 43 12 44 13 85 13 86 13 87 14 128 14 129 15 15
n 9 9Mean 13 8Var 1.5 20.25t 3.216338df 16P 0.0054Equal Var.? P= 0.0013Kurtosis -0.28571 -0.76582Skewness 0 0
The two variances are significantly different.
The t-test, however, detects significant difference in means. You can use nonparametric methods to analyse data for comparison, and you are like to find t-test to be more powerful.
Xuhua Xia
AdditivityFactor B Level 1 Level 2
1.313 2.4502.127 3.2402.127 3.3433.049 3.9763.049 3.5783.049 4.5074.018 4.6664.018 5.3245.007 6.060
Mean 3.084 4.1272.030 2.9272.805 3.5992.751 3.8373.968 4.9333.766 4.5763.589 4.7814.570 5.7194.562 5.9835.956 6.868
Mean 3.778 4.803
Level 1
Level 2
Factor A
• What experimental design is this?
• Compare the group means. Is there an interaction effect?
Additivity means that the difference between levels of one factor is consistent for different levels of another factor.
Xuhua Xia
Multiplicative EffectsFactor B Level 1 Level 2
2.718 10.5897.389 24.5307.389 27.316
20.086 52.30420.086 34.79520.086 89.66054.598 105.30654.598 204.262
148.413 427.365Mean 37.262 108.458
6.616 17.67815.524 35.57014.665 45.40851.884 137.73942.222 96.11035.185 118.22595.498 303.59694.819 395.790
385.215 960.457Mean 82.403 234.508
Level 1
Level 2
Factor A
• Compare the group means. Is there an interaction effect?
• Does this data set meet the assumption of additivity?
• When the assumption of additivity is not met, we have difficulty in interpreting main effects.
• Now calculate the ratio of group means. What did you find?
Xuhua Xia
Factor B Level 1 Level 22.718 10.5897.389 24.5307.389 27.316
20.086 52.30420.086 34.79520.086 89.66054.598 105.30654.598 204.262
148.413 427.365Mean 37.262 108.458
6.616 17.67815.524 35.57014.665 45.40851.884 137.73942.222 96.11035.185 118.22595.498 303.59694.819 395.790
385.215 960.457Mean 82.403 234.508
Level 1
Level 2
Factor A
Multiplicative EffectsFor Factor A, we see that Level 2 has a mean about 2.88 times as large as that for Level 1. For factor B, Level 2 has a mean about 2.18 times as large as that for Level 1).
If you know the value for Level 1 of Factor A, you can obtain the value for Level 2 of Factor A by multiplying the known value by 2.88. Similarly, you can do the same for Factor B.
We say that the effect of Factors A and B are multiplicative, not additive.
Xuhua Xia
Factor B Level 1 Level 22.718 10.5897.389 24.5307.389 27.316
20.086 52.30420.086 34.79520.086 89.66054.598 105.30654.598 204.262
148.413 427.365Mean 37.262 108.458
6.616 17.67815.524 35.57014.665 45.40851.884 137.73942.222 96.11035.185 118.22595.498 303.59694.819 395.790
385.215 960.457Mean 82.403 234.508
Level 1
Level 2
Factor A
1.312.13
Factor B Level 1 Level 21.313 2.4502.127 3.2402.127 3.3433.049 3.9763.049 3.5783.049 4.5074.018 4.6664.018 5.3245.007 6.060
Mean 3.084 4.1272.030 2.9272.805 3.5992.751 3.8373.968 4.9333.766 4.5763.589 4.7814.570 5.7194.562 5.9835.956 6.868
Mean 3.778 4.803
Level 1
Level 2
Factor A
Log-transformationNow log-transform the data. Compare the means. Is the assumption of additivity met now?
Original Data
37.262 108.4582102.351 17878.648
82.403 234.50812400.091 80241.944
3.084 4.1271.302 1.268
3.778 4.8031.235 1.385
Transformed data
Mean
Variance
Xuhua Xia
Why log-transformation can change the multiplicative effects to additive effects?
ln( ) ln( ) ln( )
Z XY
Z X Y
Xuhua Xia
Square-Root TransformationID Group 1 Group 21 1 92 4 163 4 164 9 255 9 256 9 257 16 368 16 369 25 49
Mean 10.333 26.333Var 56.500 152.500Var/Mean 5.468 5.791Std/Mean 0.727 0.469
• The two groups of data differ much in variance.
• Calculate two ratios: var/mean ratio and Std/mean ratio (i.e., coefficient of variation).
• Does your calculation suggest log-transformation? When is log-transformation appropriate?
• Use square-root transformation when different groups have similar Variance/Mean ratios
Notice the means, which do not coincide with the most frequent observations
Xuhua Xia
Square-Root TransformationID Group 1 Group 21 1 92 4 163 4 164 9 255 9 256 9 257 16 368 16 369 25 49
Mean 10.333 26.333Var 56.500 152.500
1.172.091.17 3.062.09 4.052.09 4.053.06 5.043.06 5.043.06 5.044.05 6.034.05 6.035.04 7.033.07 5.04
1.412 1.475
8/3' XX
Square-root transformation:
The variance is now almost identical between the two groups
8
3)'( 2 XX
Transform the means back to the original scale and compare these means with the original means:
Xuhua Xia
Quiz on Data Transformation
1 2 3 42 6 9 20 4 5 42 8 6 13 2 5 00 4 11 2
n 5 5 5 5Mean 1.4 4.8 7.2 1.8Var 1.8 5.2 7.2 2.2SE 0.600 1.020 1.200 0.663T 2.776 2.776 2.776 2.776LowerL -0.266 1.969 3.868 -0.042UpperL 3.066 7.631 10.532 3.642
GroupThe data set is right-skewed for each group.
Calculate the variance/mean ratio and C.V. for each group, and decide what transformation you should use.
Do the transformation and convert the means back to the original scale.
Xuhua Xia
With Multiple Groups
When you have multiple groups, a “Variance vs Mean” or a “Std vs Mean” plot can help you to decide which data transformation to use. The graph on the left shows that the Var/Mean ratio is almost constant. What transformation should you use?
0
2
4
6
8
0 2 4 6 8
Mean
Var
ianc
e
Xuhua Xia
Confidence Limits
-2
0
2
4
6
8
10
12
0 2 4 6 8
Mean
Mea
n, L
ower
, Upp
er
-202468
1012
0 2 4 6 8
Mean
Mea
n, L
ower
, Upp
er
Before transformation After transformation
With the skewness in our data, do confidence limits on the right make more sense? Why?
Xuhua Xia
Arcsine Transformation
• Used for proportions
• Compare the variances before and after transformation
• Do you know how to transform the means and C.L. back to the original scale?
Group1 Group2 Group1 Group284.20 92.30 66.579 73.89088.90 95.10 70.539 77.21189.20 90.30 70.814 71.85483.40 88.60 65.957 70.26780.10 92.60 63.507 74.21581.30 96.00 64.378 78.46385.80 93.70 67.863 75.463
Mean 84.70 92.66 67.091 74.480Var 12.29 6.73 8.017 8.226SE 1.325 0.980 1.070 1.084LowerL 81.457 90.258 64.472 71.828UpperL 87.943 95.056 69.709 77.133Transform backNewMean 84.847 92.841LowerL 81.428 90.273UpperL 87.974 95.041
)arcsin(' XX
2)'(sin XX
Xuhua Xia
Data Transformation Using SAS
Data Mydata;
input x;
newx=log(x);
newx=sqrt(x+3/8);
newx=arsin(sqrt(x));
cards;
Natural logarithm transfromation
Square-root transformation
Arcsine transformation