Xuhua Xia Data Transformation Objectives: –Understand why we often need to transform our data –The three commonly used data transformation techniques

Xuhua Xia

Data Transformation

• Objectives:– Understand why we often need to transform our data

– The three commonly used data transformation techniques

– Additive effects and multiplicative effects

– Application of data transformation in ANOVA and regression.

Xuhua Xia

Why Data Transformation?

• The assumptions of most parametric methods:– Homoscedasticity

– Normality

– Additivity

– Linearity

• Data transformation is used to make your data conform to the assumptions of the statistical methods

• Illustrative examples

Xuhua Xia

Homoscedasticity and Normality

The data deviates from both homoscedasticity and normality.

Xuhua Xia

Homoscedasticity and Normality

Won’t it be nice if we would make data look this way?

Xuhua Xia

Types of Data Transformation

• The logarithmic transformation• The square-root transformation• The arcsine transformation.• Data transformation can be done conveniently in

EXCEL.• Alternatives: Ranks and nonparametric methods.

Xuhua Xia

HomoscedasticityID Group 1 Group 21 2.72 20.092 7.39 54.603 7.39 54.604 20.09 148.415 20.09 148.416 20.09 148.417 54.60 403.438 54.60 403.439 148.41 1096.63

n 9 9Mean 37.26 275.33Var 2102.35 114784.50t -2.09df 16P 0.0530Equal Var.? P= 0.0000Kurtosis 4.89 4.89Skewness 2.12 2.12P(Zg1) 0.0046 0.0046P(Zg2) 0.0144 0.0144

• The two groups of data seem to differ greatly in means, but a t-test shows that the means do not differ significantly from each other - a surprising result.

• The two groups of data differ greatly in variance, and both deviate significantly from normality. These results invalidate the t-test.

• We calculate two ratios: var/mean ratio and Std/mean ratio (i.e., coefficient of variation).

• Group1 Group2Var/mean 56.420 416.891C.V. 1.230 1.230

• Log-transformation

Xuhua Xia

• The transformation is successful because:– The variance is now similar

– Deviation from normality is now nonsignificant

– The t-test revealed a highly significant difference in means between the two groups

Log-Transformed DataID Group 1 Group 21 2.72 20.092 7.39 54.603 7.39 54.604 20.09 148.415 20.09 148.416 20.09 148.417 54.60 403.438 54.60 403.439 148.41 1096.63

NewX = ln(X+1)1.312.13

ID Group 1 Group 21 1.31 3.052 2.13 4.023 2.13 4.024 3.05 5.015 3.05 5.016 3.05 5.017 4.02 6.008 4.02 6.009 5.01 7.00

n 9 9Mean 3.08 5.01Var 1.30 1.47t -3.48df 16P 0.0031Equal Var.? P= 0.8687Kurtosis -0.35 -0.30Skewness 0.16 0.02P(Zg1) 0.82 0.97P(Zg2) 0.659393 0.684475

Xuhua Xia

ID Group 1 Group 21 1.31 3.052 2.13 4.023 2.13 4.024 3.05 5.015 3.05 5.016 3.05 5.017 4.02 6.008 4.02 6.009 5.01 7.00

n 9 9Mean 3.08 5.01Var 1.30 1.47t -3.48df 16P 0.0031Equal Var.? P= 0.8687Kurtosis -0.35 -0.30Skewness 0.16 0.02P(Zg1) 0.82 0.97P(Zg2) 0.659393 0.684475

Log-Transformed DataNewX = ln(X+1)

1 NewXeX

Transform back:

Compare this mean with the original mean. Which one is more preferable?

Calculate the standard error, the degree of freedom, and 95% CL (t0.025,16 = 2.47).

Xuhua Xia

Normal but Heteroscedastic

Any transformation that you use is likely to change normality. Fortunately, t-test and ANOVA are quite robust for this kind of data. Of course, you can also use nonparametric tests.

Xuhua Xia

Normal but HeteroscedasticID Group 1 Group 21 11 12 12 43 12 44 13 85 13 86 13 87 14 128 14 129 15 15

n 9 9Mean 13 8Var 1.5 20.25t 3.216338df 16P 0.0054Equal Var.? P= 0.0013Kurtosis -0.28571 -0.76582Skewness 0 0

The two variances are significantly different.

The t-test, however, detects significant difference in means. You can use nonparametric methods to analyse data for comparison, and you are like to find t-test to be more powerful.

Xuhua Xia

AdditivityFactor B Level 1 Level 2

1.313 2.4502.127 3.2402.127 3.3433.049 3.9763.049 3.5783.049 4.5074.018 4.6664.018 5.3245.007 6.060

Mean 3.084 4.1272.030 2.9272.805 3.5992.751 3.8373.968 4.9333.766 4.5763.589 4.7814.570 5.7194.562 5.9835.956 6.868

Mean 3.778 4.803

Level 1

Level 2

Factor A

• What experimental design is this?

• Compare the group means. Is there an interaction effect?

Additivity means that the difference between levels of one factor is consistent for different levels of another factor.

Xuhua Xia

Multiplicative EffectsFactor B Level 1 Level 2

2.718 10.5897.389 24.5307.389 27.316

20.086 52.30420.086 34.79520.086 89.66054.598 105.30654.598 204.262

148.413 427.365Mean 37.262 108.458

6.616 17.67815.524 35.57014.665 45.40851.884 137.73942.222 96.11035.185 118.22595.498 303.59694.819 395.790

385.215 960.457Mean 82.403 234.508

Level 1

Level 2

Factor A

• Compare the group means. Is there an interaction effect?

• Does this data set meet the assumption of additivity?

• When the assumption of additivity is not met, we have difficulty in interpreting main effects.

• Now calculate the ratio of group means. What did you find?

Xuhua Xia

Factor B Level 1 Level 22.718 10.5897.389 24.5307.389 27.316

20.086 52.30420.086 34.79520.086 89.66054.598 105.30654.598 204.262

148.413 427.365Mean 37.262 108.458

6.616 17.67815.524 35.57014.665 45.40851.884 137.73942.222 96.11035.185 118.22595.498 303.59694.819 395.790

385.215 960.457Mean 82.403 234.508

Level 1

Level 2

Factor A

Multiplicative EffectsFor Factor A, we see that Level 2 has a mean about 2.88 times as large as that for Level 1. For factor B, Level 2 has a mean about 2.18 times as large as that for Level 1).

If you know the value for Level 1 of Factor A, you can obtain the value for Level 2 of Factor A by multiplying the known value by 2.88. Similarly, you can do the same for Factor B.

We say that the effect of Factors A and B are multiplicative, not additive.

Xuhua Xia

Factor B Level 1 Level 22.718 10.5897.389 24.5307.389 27.316

20.086 52.30420.086 34.79520.086 89.66054.598 105.30654.598 204.262

148.413 427.365Mean 37.262 108.458

6.616 17.67815.524 35.57014.665 45.40851.884 137.73942.222 96.11035.185 118.22595.498 303.59694.819 395.790

385.215 960.457Mean 82.403 234.508

Level 1

Level 2

Factor A

1.312.13

Factor B Level 1 Level 21.313 2.4502.127 3.2402.127 3.3433.049 3.9763.049 3.5783.049 4.5074.018 4.6664.018 5.3245.007 6.060

Mean 3.084 4.1272.030 2.9272.805 3.5992.751 3.8373.968 4.9333.766 4.5763.589 4.7814.570 5.7194.562 5.9835.956 6.868

Mean 3.778 4.803

Level 1

Level 2

Factor A

Log-transformationNow log-transform the data. Compare the means. Is the assumption of additivity met now?

Original Data

37.262 108.4582102.351 17878.648

82.403 234.50812400.091 80241.944

3.084 4.1271.302 1.268

3.778 4.8031.235 1.385

Transformed data

Mean

Variance

Xuhua Xia

Why log-transformation can change the multiplicative effects to additive effects?

ln( ) ln( ) ln( )

Z XY

Z X Y

Xuhua Xia

Square-Root TransformationID Group 1 Group 21 1 92 4 163 4 164 9 255 9 256 9 257 16 368 16 369 25 49

Mean 10.333 26.333Var 56.500 152.500Var/Mean 5.468 5.791Std/Mean 0.727 0.469

• The two groups of data differ much in variance.

• Calculate two ratios: var/mean ratio and Std/mean ratio (i.e., coefficient of variation).

• Does your calculation suggest log-transformation? When is log-transformation appropriate?

• Use square-root transformation when different groups have similar Variance/Mean ratios

Notice the means, which do not coincide with the most frequent observations

Xuhua Xia

Square-Root TransformationID Group 1 Group 21 1 92 4 163 4 164 9 255 9 256 9 257 16 368 16 369 25 49

Mean 10.333 26.333Var 56.500 152.500

1.172.091.17 3.062.09 4.052.09 4.053.06 5.043.06 5.043.06 5.044.05 6.034.05 6.035.04 7.033.07 5.04

1.412 1.475

8/3' XX

Square-root transformation:

The variance is now almost identical between the two groups

8

3)'( 2 XX

Transform the means back to the original scale and compare these means with the original means:

Xuhua Xia

Quiz on Data Transformation

1 2 3 42 6 9 20 4 5 42 8 6 13 2 5 00 4 11 2

n 5 5 5 5Mean 1.4 4.8 7.2 1.8Var 1.8 5.2 7.2 2.2SE 0.600 1.020 1.200 0.663T 2.776 2.776 2.776 2.776LowerL -0.266 1.969 3.868 -0.042UpperL 3.066 7.631 10.532 3.642

GroupThe data set is right-skewed for each group.

Calculate the variance/mean ratio and C.V. for each group, and decide what transformation you should use.

Do the transformation and convert the means back to the original scale.

Xuhua Xia

With Multiple Groups

When you have multiple groups, a “Variance vs Mean” or a “Std vs Mean” plot can help you to decide which data transformation to use. The graph on the left shows that the Var/Mean ratio is almost constant. What transformation should you use?

0

2

4

6

8

0 2 4 6 8

Mean

Var

ianc

e

Xuhua Xia

Confidence Limits

-2

0

2

4

6

8

10

12

0 2 4 6 8

Mean

Mea

n, L

ower

, Upp

er

-202468

1012

0 2 4 6 8

Mean

Mea

n, L

ower

, Upp

er

Before transformation After transformation

With the skewness in our data, do confidence limits on the right make more sense? Why?

Xuhua Xia

Arcsine Transformation

• Used for proportions

• Compare the variances before and after transformation

• Do you know how to transform the means and C.L. back to the original scale?

Group1 Group2 Group1 Group284.20 92.30 66.579 73.89088.90 95.10 70.539 77.21189.20 90.30 70.814 71.85483.40 88.60 65.957 70.26780.10 92.60 63.507 74.21581.30 96.00 64.378 78.46385.80 93.70 67.863 75.463

Mean 84.70 92.66 67.091 74.480Var 12.29 6.73 8.017 8.226SE 1.325 0.980 1.070 1.084LowerL 81.457 90.258 64.472 71.828UpperL 87.943 95.056 69.709 77.133Transform backNewMean 84.847 92.841LowerL 81.428 90.273UpperL 87.974 95.041

)arcsin(' XX

2)'(sin XX

Xuhua Xia

Data Transformation Using SAS

Data Mydata;

input x;

newx=log(x);

newx=sqrt(x+3/8);

newx=arsin(sqrt(x));

cards;

Natural logarithm transfromation

Square-root transformation

Arcsine transformation

Documents

Xuhua Xia Data Transformation Objectives: –Understand why we often need to transform our data –The three commonly used data transformation techniques