Lecture 3 – Data Summary Measures and Graphical Display of Results

Lecture 3 – Data Summary Measures and Graphical

Display of Results

Univariate Data –

Analysis of one variable at a time

Why Think About/Explore Data?• Done to accomplish:

– Checking for data entry errors– Describing demographic and study

characteristics– Examining distributions of outcomes

•Central tendency•Variability

– Checking for outliers– Checking assumptions for subsequent

analyses– Give a picture of your sample

• In order to understand choices of which statistics could be appropriate, it is paramount to ascertain what measurement level the outcome (s) and predictor (s) have.

Dependent variable = outcome Independent variable = predictor

Types of DataNominal – Qualitative Data

Measured in unordered categories

Ordinal – Qualitative Data Measured in ordered categories

Continuous – Quantitative Data Measured on a continuum

(summarize with %’s):

(summarize with %’s):

summarize with Many Summary Measures

Types of DataNominal – Qualitative Data

Measured in unordered categoriesRace Blood TypeDead/Alive

Ordinal – Qualitative Data Measured in ordered categoriesCancer StagesSocio-economic Status (low, med, hi)

Continuous – Quantitative Data Measured on a continuumSerum CreatinineHeight/Weight/BMI

Gender On Dialysis/Not on Dialysis

Likert (unlikely, somewhat unlikely, neutral, likely, very likely)

Systolic Blood PressureDiastolic Blood PressureOthers???

Continuous (Numerical)

Mean Arithmetic AverageSum of Values/Number of ValuesNice mathematical/statistical properties

Median (a.k.a 50th Percentile)Value where half the sample is above, half

the sample is belowBetter measure for skewed data. Robust to

Extreme values

ModeMost Frequently Occurring value in Sample

Measures of Location

Continuous (Numerical)NORMAL DISTRIBUTION

Measures of VariabilityMeasures of Variability

• Range = (maximum - minimum)

• Interquartile range = (Q3 – Q1) always covers half the sample (75th - 25th percentile)

• Variance = average of the squares of the deviations of the observations from their mean

• Standard deviation =

Variance

Continuous (Numerical)

n

i

i

n

xx

1

2

1

)(var

Continuous (Numerical)NORMAL DISTRIBUTION

http://www.stattucino.com/berrie/dsl/index.html

Describing Data using Numerical Summaries

Descriptive statistics:

Explore data in order to describe their main features

Get an initial picture of data sample

Let’s Talk Data…

Categorical

GenderN %

Female 6163

38.4%

Male 3837

61.6%

DialysisN %

No 8093 80.9%

Yes 1907 19.1%

0%

20%

40%

60%

80%

Gender

Female Male

0%

20%

40%

60%

80%

100%

Dialysis

No Yes

CategoricalRace

N %

Black 1942

19.4%

Hispanic 723 7.2%

Other 1068

10.7%

White 6267

62.7%

EducationN %

Elementary 1491

14.9%

High School Grad

2640

26.4%

College Grad 3246

32.5%

Post Graduate

2616

26.2%

0%

20%

40%

Education

Elementary High School Grad College Grad Post Graduate

0%

20%

40%

60%

80%

Race/Ethnicity

Black Hispanic Other White

CategoricalRace

N %

Black 1942

19.4%

Hispanic 723 7.2%

Other 1068

10.7%

White 6267

62.7%

EducationN %

Elementary 1491

14.9%

High School Grad

2640

26.4%

College Grad 3246

32.5%

Post Graduate

2616

26.2%

0%

20%

40%

Education

Elementary High School Grad College Grad Post Graduate

0%

20%

40%

60%

80%

Race/Ethnicity

Black Hispanic Other White

Continuous

BMIMeasure

Mean 32.2

Std Dev 5.46

Median 31.8

Minimum 16.0

Maximum 50.7

25th Percentile

28.2

75th Percentile

35.9

Mode 29.0

N = 115

BMIMeasure

Mean 32.0

Std Dev 5.34

Median 31.2

Minimum 21.8

Maximum 44.5

25th Percentile

28.5

75th Percentile

34.8

Mode .

BMIMean: 32.2

Std: 5.4

Median: 31.8

Mean: 136.3

Std: 17.1

Median: 135

Mean: 189.77

Std: 148.9

Median: 154.11

Fra

ctio

n

z-3.19068 3.16666

0

.224

Fra

ctio

n

x-29.644 -.540257

0

.1955

Fra

ctio

n

z.397801 31.7841

0

.1995

Shape of a distributionsymmetric

skewed tothe right

skewed tothe left

Mean greater than Median(positively skewed)

Mean less than Median(negatively skewed)

Mean: 136.3

Std: 17.1

Median: 135

Skewness: 0.38

Mean: 189.77

Std: 148.9

Median: 154.11

Skewness: 5.63

NORMAL DISTRIBUTION

Normal Distribution – Has Excellent Statistical Properties

Many Statistical techniques require normal distributions

If data does not have Normal Distribution, need to consider alternative techniques appropriate for data

Box (and Whisker) PlotsBox (and Whisker) Plots• A graph of the 5 number summary

with suspected outliers plotted individually

• 5 number summary: Min, Q1, Median, Q3, Max• A line somewhere inside the box marks

the Median• IQR = Q3 – Q1• Cases more than 1.5*IQR are plotted

individually (possible outliers)• Lines from the box extend to the

smallest and largest values that are not more than 1.5*IQR

median

25th Percentile

75th Percentile

mean

1.5 x IQR

Outlier

Skewed to the right Skewed to the leftSymmetric

+

+

+

Normal Probability PlotNormal Probability Plot

• Plot that can help assess normality.

• Idea: plot the observed levels of the variable against the expected levels corresponding to a Normal distribution.

• If data lie in a reasonably straight diagonal line, then assumption of Normality is reasonable.

Normal Probability PlotsNormal Probability Plots

BMI

Triglycerides

Error Error Bar Bar

PlotsPlots

Circle denotes the mean and the bars denote the standard deviation (in this case).

Part II – Measures of Association

(plus a little more)

Measures of Association• Continuous Variables

– Correlation– Agreement (reliability)

• Categorical Variables– Two-way layout (2×2 tables)– “Risk” measures– Agreement– Others

Two Continuous Variables

Correlation– General sense: the relationship between two

variables (quantitative or qualitative)– Narrow (statistical) sense: measure of

interdependence between two continuous random variables

• The degree to which increases or decreases in Y occur with increases or decreases in X

• Values range between -1 (perfect discordance) and 1 (perfect concordance)

• A value of 0 indicates no association

Pearson Correlation

Data

Subject # X Y

1 x1 y1

2 x2 y2 . . .

.

.

.

.

.

. n xn yn

Purpose - measures linear association between two continuous variables X and Y

Pearson CorrelationThe Pearson (product-moment) correlation coefficient can be calculated for 2 continuous variables in a sample (regardless of distribution) using the formula:

N

1i

2

i

N

1i

2

i

N

1iii

xy

YYXX

YYXXrrρ̂

Correlation Figures

•

•• •••

•• •

••

•

••••••••

••••••••••

•••

••

••• •• ••

•

•

•• • • •

••••• •

••

•••

No relationship X

YA B C

D E

Perfect positive relationship Perfect negative relationship

Moderate positive relationship Strong negative relationship

•

••

•••

•

ρ = 0

ρ = 1ρ = -1

ρ = 0.5 ρ = -0.8

Correlation Inference• Easy “large sample” test for H0: ρ=0

For n ≥ 25, compute

which has N(0, ) distribution under H0

• This test assumes X,Y~ NBiv(μX, μY, σX

2, σX2, ρ)

e

ˆ1 1 ρlog

ˆ2 1 ρ

Many times a tenuous assumption!• Beware positive skewness & outliers• Beware data not truly continuous

1

(n-3)

Timeout: ASSUMPTIONS• As with any mathematical or physical

model, model assumptions are critical to making the correct inference

• Dealing with assumptions has lead to development of:– Nonparametric statistics: techniques that

reduce or eliminate dependence on the underlying distribution of the data

– Robust statistics: techniques that are affected little by departures from assumptions

Correlation (resumed)• A nonparametric version of the correlation

coefficient: Spearman’s Rank Correlation

• Like ρ, rs :

– ranges from -1 to 1– 0 no correlation, 1 perfect agreement– only requires ordinal data

2i i

s 2

6 [R(X ) R(Y)]r 1

n(n 1)

where R( ) is the of the variable

i

rank

Correlation Example: SBP and DBPSBP DBP R(SBP) R(DBP)

141.8 89.7 12 14

140.2 74.4 8.5 1

131.8 83.5 3 4

132.5 77.8 4 2

135.7 85.8 7 7

141.2 86.5 11 10

143.9 89.4 14 13

140.2 89.3 8.5 12

140.8 88.0 10 11

131.7 82.2 2 3

130.8 84.6 1 6

135.6 84.4 6 5

143.6 86.3 13 9

133.2 85.9 5 8

Correlation Example: SBP and DBP

SB

P

125

130

135

140

145

DBP

70 75 80 85 90

• All Data: ρ = 0.42; rs = 0.71

• Outlier deleted: ρ = 0.75; rs = 0.82

Questions –

1.Can we calculate a correlation coefficient between the incomes of a group of people and what city they live in?

Correlation Coefficient

No, we cannot, since city is a categorical variable. Correlation requires that both variables be quantitative.

Questions –

2.Does it change the correlation between height and weight if we measure height in inches rather than centimeters and weight in pounds rather than kilograms?


No. Because ρ (and r) uses the standardized values of the observations, ρ does not change when we change the units of measurements of x , y, or both.The correlation ρ itself has no unit of measure; it is just a number.

Question –

3.Does ρ = 0 mean there is no relationship between X and Y ?


Correlation only measures the strength of the linear relationship between two variables. Correlation does not describe nonlinear relationships between two variables, no matter how strong they are.

x

y •

• •••••

••••••

Correlation and Regression

••

••• •• ••

•

•

•• • • •

••••• •

••

•••

Moderate positive relationship Strong negative relationship

ρ = 0.5 ρ = -0.8

2i

Y

2Xi

(Y Y)σn-1ˆ ˆ ˆβ = ρ = ρσ(X X)

n-1

Y Y

X X

Y = α+βX

Correlation and RegressionS

BP

125

130

135

140

145

DBP

70 75 80 85 90

SBP = 40.1 + 1.12×DBP

DBP = 16.3 + 0.51×SBP

SBP and DBP example (continued)

σSBP= 4.9 (mmHg)

σDBP= 3.3 (mmHg)

ρ = 0.75

4.90.75

3.3

3.30.75

4.9

Correlation and Covariance• Suppose two random variables, X and Y:

E(X) = μX, V(X) = σX2; E(Y) = μY, V(Y) = σY

2; and Corr(X,Y) = ρ

• Define Cov(X,Y) = E[(X-μX)(Y-μY)]

Note: Cov(X,X) = E[(X-μX)(X-μx)] = E(X-μX)2 = σX2

• Population correlation (ρ) is defined as:

• Thus Cov(X,Y) = ρσXσY

X Y

X Y X Y

E[(X-μ )(Y-μ )] Cov(X,Y)ρ =

σ σ σ σ

Correlation and Covariance

What’s the big deal about covariance?Use it to find the variance of functions of

random variables, e.g.:

In general:2 2 2 2

X YV(aX+bY) = a σ b σ 2abCov(X,Y)

2 2X YV(X-Y) = σ σ 2Cov(X,Y)

2 2X YV(X+Y) = σ σ 2Cov(X,Y)

Correlation as AgreementSBP1 SBP2

141.8 139.7

140.2 144.4

131.8 133.5

132.5 127.8

135.7 135.8

141.2 146.5

143.9 139.4

140.2 139.3

140.8 138.0

131.7 132.2

130.8 134.6

135.6 134.4

143.6 146.3

133.2 135.9

Suppose two nurses are measuring SBP in the same patients and each nurse measures SBP 3 times in each patient.

Correlation as Agreement• Could use Pearson correlation

• Another measure, intraclass correlation– Can separate the variance into two sources: between-

subject and within-subject– The intraclass correlation is the ratio of the within-

subject to the total (i.e., within + between)– By definition, intraclass correlation ranges from 0 to 1– Best measure of the “individual” touch

• In SBP example:

ρ(Pearson) = 0.809 ρ(Intraclass) = 0.814

Things to Remember AboutCorrelation

• 5 warnings (adopted from Huck):

1. Does not speak to cause-and-effect

2. Beware outliers

3. Assumes linear relationship

4. Correlation vs. Independence Zero correlation implies independence for

Normal distribution only

5. Strength of relationship WRT trend

Categorical Outcomes: Two-way Tables

• Prospective DesignRelative Risk (RR)

P(Disease in Exposed Group) P(D|E)

P(Disease in Unexposed Group) P(D|E)

• Retrospective DesignOdds Ratio (OR)

=

P(E|D)P(Exposure in Cases)P(E|D)1-P(Exposure in Cases) P(E|D)P(E|D)

P(E|D)P(E|D)P(Exposure in Controls) P(E|D)1-P(Exposure in Controls) P(E|D)

Two-way TablesDisease

Yes No

Yes a b a+b

No c d c+d

a+c b+d n=a+b+c+dExp

osur

e

P(D|E) = a/(a+b)

P(D|E) = c/(c+d)

P(E|D) = a/(a+c)

P(E|D) = b/(b+d)

Prospective Retrospective

E

a dada+c b+d

OR = = c b bc

a+c b+d

acRR =

ac+bc

Two-way Tables• Prospective design and relative risk (RR)

are optimal

• Retrospective designs and odds ratio (OR) are easiest (cheapest)

• Can compute OR for prospective design

D

a dada+b c+d

OR = = b c bc

a+b c+d

Two-way Table• Why we like the odds ratio…

The exposure odds ratio is equivalent to the disease odds ratio!

• Regardless of study design (i.e., which margin is fixed) the estimate of the OR is the same

D E

a d a dada+b c+d a+c b+d

OR = = = = OR b c b cbc

a+b c+d b+d a+c

Two-way TablesCancer

Yes No

Yes 35 25 60

No 5 35 40

40 60 100

Sm

oke

35 5RR = = 4.7

35 5+25 5

35 35OR = = 9.8

25 5

Two-way TableWhy we like the odds ratio – Part II

• For retrospective design, if…– Cases are representative of the population of

all cases– Controls are representative of the population

of all controls– The disease is “rare” (i.e., prevalence <20%)

Then OR ≈ RR

Two-way TablesCancer

Yes No

Yes 75 325 400

No 25 575 600

100 900 1000

Sm

oke

35 5RR = = 4.5

35 5+25 5

35 35OR = = 5.3

25 5

Other Measures From Clinical TrialsOutcome

Yes No

Experimental 15 135 150

Control 100 150 250

115 285 400Tre

atm

ent

P(O|E) = 15/150 = 0.1 P(O|C) = 100/250 = 0.4

RR = P(O|E)/P(O|C) = 0.25

• Absolute Risk Reduction (ARR) = P(O|C) - P(O|E) = 0.3• Relative Risk Reduction (RRR) = 1 – RR = 0.75• Number Needed to Treat (NNT) = 1/ARR = 3.33 (number needed to treat in the population to prevent 1 outcome event)

Things to Remember About Measures of Association

1. Beware: some sources use “odds ratio” and “relative risk” interchangeably

– In most settings, OR overestimates RR

2. Be on guard when considering ARR, RRR, and NNT

– Almost never see a SE or CI estimate– Should be based on large, well planned,

prospective studies

Categorical Measures of Agreement

• The “kappa” coefficient or κ • Example: two physicians diagnosing a disease

Here pa, pb, pc, pd are the proportions of subjects, not the number of subjects.

DOCTOR B

Disease No Disease

Disease pa pb pA

No Disease pc pd qA

pB qB 1DO

CT

OR

A

a d b c

A B B A

2(p p p p )κ̂

p q p q

Kappa ExamplePsychiatrist B

Neurosis Normal

Neurosis 0.04 0.06 0.10

Normal 0.01 0.89 0.90

0.05 0.95 1.00Psy

chia

tris

t A

2(0.04 0.89 0.06 0.01)κ̂ 0.50

0.10 0.95 0.05 0.90

• Kappa is a categorical analog of the intraclass correlation• Kappa can be computed for any “square” (k×k) tables

Schedule

Documents

Lecture 3 – Data Summary Measures and Graphical Display of Results