1
June 27, 2013 1
INTRODUCTION TO BIOSTATISTICSFOR GRADUATE AND MEDICAL STUDENTS
Descriptive Statistics and Graphically Visualizing Data
Beverley Adams Huet, MSAssistant ProfessorDepartment of Clinical Sciences, Division of Biostatistics
June 27, 2013 2
Files for today (June 27) Lecture and Homework
Lecture #2 (1 file) PPT presentation
Homework #1 (1 file) Biostat_HW_Huet_assigned062713.docx
Medical Student Research Page -- Summer IIhttp://www.utsouthwestern.edu/education/medical-school/academics/research/summer/course-ii.html
June 27, 2013 3
Today’s Outline
Describing dataDescriptive statisticsMeasures of central tendency
Measures of dispersion
Other statisticsCoefficient of variation
Standard error of the mean
Transformations
Histograms and other graphs
June 27, 2013 4
Types of Statistics
Descriptive statistics• Which summary statistics to use to organize
and describe the data?• Proportion, mean, median, SD, percentiles
Inferential statistics• Generalizing from the sample. Which test?
• T-test, Fisher’s Exact, ANOVA, survival analysis• Bayesian approaches
June 27, 2013 5
Type of Outcome Variable:
Goal: Continuous measurement
Rank, Score, or Measurement Binomial Survival
(from a normal distribution)
(from non-normal distribution) (e.g. heads or tails) (Time to event)
Describe one group: Mean, SD Median, interquartile range Proportion Kaplan-Meier survival
curve
Compare one group to a hypothetical value: One-sample t test Wilcoxon Signed-Rank
testChi-square or binomial
test
Compare two unpaired (independent) groups:
Two sample (unpaired) t test
Mann-Whitney (Wilcoxon Rank Sum)
test
Fisher's exact test (or chi-square for large
samples)
Log-rank test or Mantel-Haenszel
Compare two paired groups: Paired t test Wilcoxon Signed-Rank test McNemar test Conditional proportional
hazards regression
Compare three or more unmatched groups: One-way ANOVA Kruskal-Wallis test Chi-square test Cox proportional hazard
regression
Compare three or more matched groups:
Repeated-measures analysis Friedman test Cochrane Q test Conditional proportional
hazards regression
Quantify association between two variables: Pearson correlation Spearman correlation Contingency coefficients
Predict value from another measured variable:
Simple linearregression
Nonparametric regression
Simple logistic regression
Cox proportional hazard regression
Predict value from several measured or binomial variables:
Multiple linear regression, ANCOVA
Multiple logistic regression
Cox proportional hazard regression
Summary of commonly used statistical tests
June 27, 2013
Censored data
•Left censoring
•Right censoring
Cannot be measured beyond some limit
6
2
June 27, 2013
Left Censored data
• Lab data – “undetectable”, “below lower limit”
• Example CRP “< 0.2 mg/dL”
Cannot be measured beyond some limit
Subject CRP001 0.7002 1.6003 <0.2004 3.8
Censored at the limit of detectability
7 June 27, 2013
Right Censored data
• Right censoring
- “Survival” data – the period of observation was cut off before the event of interest occurred.
Cannot be measured beyond some limit
Note – an event in a ‘survival’ analysis may be infection, fracture , transplant , metastasis
8
June 27, 2013
Right censored survival data
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10 12
Study time, months
Su
bje
ct
Survival time knownCensored
“Event” at 3 months
Lost to follow-up at 9 months
9 June 27, 2013
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10 12
Su
bje
ct
Study time, months
Survival time known
Censored
Right censored survival data
Survival Analysis
Time
0 2 4 6 8 10 12
Su
rviv
al
0.0
0.2
0.4
0.6
0.8
1.0
10
Step function
June 27, 2013
• Measures of Central Tendency
• Measures of Dispersion
Descriptive statistics
11 12
Measures of Central Tendency*
• Mean• Median• Geometric mean• Mode
*or Measures of Location
0 20 40 60 80 1000
50
100
150
200
250
300
350
3
June 27, 2013 13
Measures of Central Tendency*
• Mean– Arithmetic average or balance point– Discrete/continuous data; symmetric
distribution– May be sensitive to outliers– Sample mean symbol is denoted as ‘x-bar’
XX
N
SubjectID Glucose mg/dL0204 1450205 1260206 1360210 970211 2640212 144Mean 152
Fasting plasma glucose, n=6
*or Measures of Location
June 27, 2013 14
Fasting plasma glucose, n=6
0
20
40
60
80
100
120
140
160
180
200
Mean
Glucosemg/dL
0
50
100
150
200
250
300
Glu
cose
, m
g/d
L
Fasting Plasma Glucose
SubjectID Glucose mg/dL0204 1450205 1260206 1360210 970211 2640212 144Mean 152
Median 140
X
What about other measures of central tendency?
June 27, 2013 15
Measures of Central Tendency**or Measures of Location
In a symmetric distribution, the median, mode and mean will have the same value.
0 2 4 6 8 10
0
20
40
60
80
100
In a non-symmetric (skewed) distribution, the median, mode and mean may not have the same value.
June 27, 2013 16
Measures of Central Tendency
• Middle value when the data are ranked in order (if the sample size is an even number then the median is the average of the two middle values)
• 50th percentile• Ordinal/discrete/continuous data• Useful with highly skewed discrete or
continuous data• Relatively insensitive to outliers
Median
June 27, 2013 17
Measures of Central Tendency
The median of 13, 11, 17 is 13 The median of 13, 11, 568 is 13The median of 14, 12, 11, 568 is 13
June 27, 2013 18
Measures of Central TendencySubjectID Glucose mg/dL
0204 1450205 1260206 1360210 970211 2640212 144Mean 152
Median 140
SubjectIDGlucose mg/dL
0210 970205 1260206 1360212 1440204 1450211 264
Order the glucose values from
smallest to largest
4
June 27, 2013 19
Gonick & Smith (1993) The Cartoon Guide to Statistics.
The median is often better than the mean for describing the center of the data
June 27, 2013 20
Geometric mean
Report for log transformed data
• Usually smaller than the arithmetic mean.
• Can be used instead of arithmetic mean when data have a skewed or log-normal distribution
• Find the mean on the log scale, taking the antilog of this mean yields the geometric mean.
n xxxxG n...*3*2*1
June 27, 2013 21
Geometric mean• Creatinine Log10(Creatinine)
Histograms
Log transformed data
Sometimes we can transform our data
June 27, 2013 22
Geometric mean
SubjectID Glucose mg/dL ln(Glucose)
0204 145 4.976734
0205 126 4.836282
0206 136 4.912655
0210 97 4.574711
0211 264 5.575949
0212 144 4.969813
Mean 152 4.9743573
SD 57.644 0.330
Median 140 4.941234093
Geometric mean
Take the antilog of the mean
exp(4.974357) = 144.6558278
Geometric mean:
Back-transform (antilog) the mean of the log transformed data
Loge transformed data
June 27, 2013 23
Measures of Central Tendency
• Most frequently occurring value in the distribution
• Nominal/ordinal/discrete/continuous data
Mode
The mode of 13, 11, 22, 11, 17 = 11
June 27, 2013 24
Gonick & Smith (1993) The Cartoon Guide to Statistics.
Measures of Dispersion
5
June 27, 2013 25
Measures of Dispersion
• Also known as– Measures of Spread– Measures of Variability
• Commonly used measures of variability– Standard deviation– Range – Percentiles
June 27, 2013 26
Measures of Dispersion• Standard Deviation
Square root of average of the squared deviations
– Has the same units as the original observation
– NEVER negative
2
1x
X Xs
n
2
x
X
n
Population standard deviation (Greek letters)
Sample standard deviation (Roman letters)[n-1 in the denominator corrects for bias]
Lower case sigma
June 27, 2013 27
Deviation from mean
SubjectIDGlucose
mg/dL
Glucose minus
mean of 152Squared deviation
0204 145 -7 49
0205 126 -26 676
0206 136 -16 256
0210 97 -55 3025
0211 264 112 12544
0212 144 -8 64
Mean 152
Median 140
Sum of squares
Divide by n-1
n 6 Variance SD
16614 3322.8 57.644
Standard deviation calculation
Square root of variance
Standard deviation and the Normal Distribution
28
Approximately 68% of the observations fall within 1 SD of the meanApproximately 95% of the observations fall within 2 SDs of the mean
Approximately 99.7% of the observations fall within 3 SDs of the mean
June 27, 2013 29
Percentiles
From Primer of Biostatistics by Stanton A Glantz
June 27, 2013 30
Percentiles
• The value below which a given percentage of the values occur– The 50th percentile is the median– Quartiles
• The 25th percentile is first quartile (Q1)• The 75th percentile is third quartile (Q3)
– Interquartile range is the difference between 25th
and 75th percentiles– Other commonly reported percentiles
• 10th and 90th percentiles• 5th and 95th percentiles
6
“Special” PercentilesQuartiles
25% 25% 25% 25%
June 27, 2013 31
Q1 Q2 Q3
IQR
June 27, 2013 32
Percentiles
From Primer of Biostatistics by Stanton A Glantz
June 27, 2013 33
Summarizing data with medians and percentiles
June 27, 2013 34
Box and Whisker Plot
From Lang and Secic, How to Report Statistics in Medicine
Copyright ©2007 The Endocrine Society
Dovio, A. et al. J Clin Endocrinol Metab 2007;92:1803-1808
FIG. 1. Serum levels of OPG (A) and sRANKL (B) in CS patients and controls. Values are median, 25th and 75th percentile, and range. *, P < 0.01 by Mann-Whitney U test.
Box and Whisker plots
June 27, 2013 35 June 27, 2013 36
Descriptive statistics with percentiles
proc means n mean std median min p25 p75 max maxdec=5 ;;
title3 'Descriptive statistics';
class group;
var hdl;
run;
7
June 27, 2013 37
Measures of shape
“Skewed left” – a long left tail “Skewed right” – a long right tail
Skewness measures symmetry
June 27, 2013 38
Figure 1. Histograms of the fasting insulin levels in Japanese men and women.
Characteristics of the Insulin Resistance Syndrome in a Japanese Population The Jichi Medical School Cohort Study
Arteriosclerosis, Thrombosis, and Vascular Biology. 1996;16:269-274
Skewed left or right??
June 27, 2013 39
Examples of variables having a skewed distribution
Skewness measures symmetry
• Triglycerides, insulin, HOMA
• Bilirubin, leptin, CRP, viral load counts
• Urine albumin
• Income
• Health care costs
• Hospital length of stay
June 27, 2013 40
Other statisticsCoefficient of variation
Standard error of the mean
June 27, 2013 41
Coefficient of variation (CV)
MeanSDCV
MeanSDCV 100%
Often expressed as a percent:
The standard deviation expressed as a proportion of the mean:
June 27, 2013 42
Coefficient of variation (CV)
“The magnitude of total intra-individual variability based on coefficient of variation (CV) for these lipids in premenopausal women (CV, 4% to 8.1%) was similar to that found for men (CV, 4.3% to 9.1%) and for postmenopausal women (CV, 3.7% to 6.7%). “
Metabolism. 2000 Sep;49(9):1101-5.
8
June 27, 2013 43
Femoral Neck Hip Bone Density
ID1 ID2 ID3 ID4
Scan1 0.691 0.775 0.754 0.846
Scan2 0.688 0.779 0.780 0.832
Scan3 0.693 0.775 0.753 0.798
Mean 0.690667 0.776333 0.762333 0.825333
SD 0.002517 0.002309 0.015308 0.024685
CV 0.003644 0.002975 0.02008 0.029909
CV% 0.36% 0.30% 2.01% 2.99%
Coefficient of variation (CV)
Subject A
CO2 AST ALT
Day1 28 24 27
Day2 31 29 27
Day3 29 30 28
Mean 29.33333 27.66667 27.33333
SD 1.527525 3.21455 0.57735
CV 0.052075 0.116189 0.021123
CV% 5.21% 11.62% 2.11%MeanSDCV
MeanSDCV 100%
June 27, 2013 44
Coefficient of variation (CV)
• A measure of spread
• A unitless fraction (SD/mean)
• Used for comparing the relative variability of variables measured in different units– e.g., HDL-cholesterol (4.6%) and triglycerides (2.7%)
• Used for comparing the relative variability of variables measured in same units– e.g., inter-assay versus intra-assay variability
June 27, 2013 45
Standard error of the mean
• Is NOT a descriptive statistic
• The standard error of the mean is useful in the calculations of confidence intervals and significance tests
Do not summarize continuous data with the mean and the standard error of the mean.
Lang and Secic (2006)How to Report Statistics in Medicine
June 27, 2013 46
Standard error of the mean(also called SEM, SE, standard error)
SEM =Standard deviation
_______________________________
Square root of the sample size
nSDSEM
or
Which is smaller - SD or SEM?
June 27, 2013 47
Standard error(SE, SEM, standard error of the mean)
Why is the standard error commonly used as descriptive statistic and graphed as ‘error bars’?
• It is smaller than the standard deviation• ‘Looks’ better?
The only role of the standard error…is to distort and conceal the data.
Feinstein, Clinical Biostatistics
June 27, 2013 48
Standard deviation (SD) and Standard error of the mean (SEM)
nSDSEM
nSEMSD These formulas do need to be memorized!
Can convert one to the other if the sample size (n) is known
9
June 27, 2013 49
Standard deviation (SD) and Standard error of the mean (SEM)
Can convert one to the other if the sample size (n) is known
SD = 40, n = 64,
SEM = 40/8 = 5
SEM = ?
SEM = 12, n = 81, SD = ?SD = 12 * 9 =108
June 27, 2013 50
Make friends with your data!
Look at your
data!
Making friends with your data
Don’t run away!
June 27, 2013 51
Transformations
Why transform data?
• Many statistical analyses include assumptions
• Normality (Normally distributed)• The different groups have the same standard
deviation• Linearity for correlation or modeling
June 27, 2013 52
Transformations
• Linear
• Nonlinear
Plasma Glucose
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
0 50 100 150 200 250 300
mg/dL
mm
ol/L
June 27, 2013 53
Linear Transformations
Add, subtract, multiply, divide…
• A straight line is obtained when plotting the new values against the original values
• Mean and standard deviation of the transformed values are easily obtained
June 27, 2013 54
Linear TransformationsAdding/subtracting a constant
• The mean increases/decreases by the same amount as the constant
• The standard deviation is unaffected
Obs X X+20
a 5 25.00
b 7 27.00
c 2 22.00
d 9 29.00
e 3 23.00
f 1 21.00
Mean 4.5 24.50
SD 3.082 3.082
Median 4 24.00
Adding a constant
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
0 2 4 6 8 10
X
X+
20
10
June 27, 2013 55
Linear Transformations - Multiplying by a constant
Plasma Glucose
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
0 50 100 150 200 250 300
mg/dL
mm
ol/L
SubjectIDGlucose mg/dL
Convert to SI units
Glucose mmol/L
0204 145 8.05
0205 126 6.99
0206 136 7.55
0210 97 5.38
0211 264 14.65
0212 144 7.99
Mean 152 8.44
SD 57.644 3.20
Median 140 7.77
Multiply by 0.0555 to convert to mg/dL to mmol/L
Mean 152 152*0.0555= 8.44
SD 57.6437334 57.64*0.0555= 3.20
Median 140 140*0.0555= 7.77
Conversion from ‘conventional’ units to Standard International (SI) units
Both the mean and SD are multiplied by 0.0555
June 27, 2013 56
Nonlinear transformation
Copyright ©1996 BMJ Publishing Group Ltd.
Bland, J M. et al. BMJ 1996;312:1079
Fig 1--Serum triglyceride and log10 serum triglyceride concentrations in cord blood for 282 babies, with best fitting normal distribution
June 27, 2013 57 June 27, 2013 58
Transformations
Common non-linear transformations
• Log transformation (log10 or loge )
• Square-root transformation–Less dramatic than the log transformation
• Reciprocal transformation–More drastic than the log transformation
–Use for extremely skewed data distributions
June 27, 2013 59
Transformations
Pearson Correlation coefficient = 0.28
Untransformed
0
5
10
15
20
25
30
35
40
45
50
0 10 20 30 40 50 60
% Total Body Fat
% L
ive
r F
at
loge transformed
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
0 10 20 30 40 50 60
% Total Body Fat
% L
iver
Fat
Pearson Correlation coefficient = 0.38X
For assessing linear association
June 27, 2013 60
Log transformation
• When might a log transformation be useful?–Remove positive (right) skewness
–The standard deviation is greater than half of the
mean (if the measure cannot be negative)
–The mean is larger than the median • Mean liver fat = 6.2% SD=6.3% (median=4.2)
–The mean is proportional to the standard deviation
–When comparing several groups the variances or
standard deviations are not equal
11
June 27, 2013 61
Leptin (ng/ml)
Group A Group B
n 33 29Mean 36.3 27.4SD 19.5 15.1Median 29 19
These data might need a log transformation – why?
62
Log Transformation Serum Creatinine
Original units Log transformed
June 27, 2013 63
Histogram with normal distribution overlay
0 20 40 60 80 1000
50
100
150
200
250
300
350
Statistics and graph softwareSigmaPlot and GraphPad Prism can be downloaded from the UTSW Information Resources INTRAnet
June 27, 2013 64
Histogram• Group the data into intervals (x-axis)
• Height of the bar indicates the frequency (y-axis)
• Each “bar” begins/ends at the “true limits” of the interval.
• Bars are presented next to each other (unless the data in the next interval has a frequency of 0).
• Bars are usually the same width
• Frequencies correspond to the area of each bar
Histograms for continuous/discrete data
Relative frequency histogram
Absolute frequency histogram
Absolute Relative
June 27, 2013 66
Spaghetti plot for Repeated Measurements
Each subject is observed under multiple conditions or on multiple occasions
Spaghetti plot
Multiple time points
12
June 27, 2013 67
Forest plot for Meta-analysis
Forest plotDescriptive statistics
June 27, 2013 68
Look at your data!
Do not rely on p-values. The non-significant results might be just as interesting or enlightening.
A difference must be a difference to make a difference