64
Lecture 3 – Data Summary Measures and Graphical Display of Results Univariate Data – Analysis of one variable at a time

Lecture 3 – Data Summary Measures and Graphical Display of Results

  • Upload
    deo

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 3 – Data Summary Measures and Graphical Display of Results. Univariate Data – Analysis of one variable at a time. Why Think About/Explore Data?. Done to accomplish: Checking for data entry errors Describing demographic and study characteristics - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 3 – Data Summary Measures and Graphical Display of Results

Lecture 3 – Data Summary Measures and Graphical

Display of Results

Univariate Data –

Analysis of one variable at a time

Page 2: Lecture 3 – Data Summary Measures and Graphical Display of Results

Why Think About/Explore Data?• Done to accomplish:

– Checking for data entry errors– Describing demographic and study

characteristics– Examining distributions of outcomes

•Central tendency•Variability

– Checking for outliers– Checking assumptions for subsequent

analyses– Give a picture of your sample

Page 3: Lecture 3 – Data Summary Measures and Graphical Display of Results

• In order to understand choices of which statistics could be appropriate, it is paramount to ascertain what measurement level the outcome (s) and predictor (s) have.

Dependent variable = outcome Independent variable = predictor

Page 4: Lecture 3 – Data Summary Measures and Graphical Display of Results

Types of DataNominal – Qualitative Data

Measured in unordered categories

Ordinal – Qualitative Data Measured in ordered categories

Continuous – Quantitative Data Measured on a continuum

(summarize with %’s):

(summarize with %’s):

summarize with Many Summary Measures

Page 5: Lecture 3 – Data Summary Measures and Graphical Display of Results

Types of DataNominal – Qualitative Data

Measured in unordered categoriesRace Blood TypeDead/Alive

Ordinal – Qualitative Data Measured in ordered categoriesCancer StagesSocio-economic Status (low, med, hi)

Continuous – Quantitative Data Measured on a continuumSerum CreatinineHeight/Weight/BMI

Gender On Dialysis/Not on Dialysis

Likert (unlikely, somewhat unlikely, neutral, likely, very likely)

Systolic Blood PressureDiastolic Blood PressureOthers???

Page 6: Lecture 3 – Data Summary Measures and Graphical Display of Results

Continuous (Numerical)

Mean Arithmetic AverageSum of Values/Number of ValuesNice mathematical/statistical properties

Median (a.k.a 50th Percentile)Value where half the sample is above, half

the sample is belowBetter measure for skewed data. Robust to

Extreme values

ModeMost Frequently Occurring value in Sample

Measures of Location

Page 7: Lecture 3 – Data Summary Measures and Graphical Display of Results

Continuous (Numerical)NORMAL DISTRIBUTION

Page 8: Lecture 3 – Data Summary Measures and Graphical Display of Results

Measures of VariabilityMeasures of Variability

• Range = (maximum - minimum)

• Interquartile range = (Q3 – Q1) always covers half the sample (75th - 25th percentile)

• Variance = average of the squares of the deviations of the observations from their mean

• Standard deviation =

Variance

Continuous (Numerical)

n

i

i

n

xx

1

2

1

)(var

Page 9: Lecture 3 – Data Summary Measures and Graphical Display of Results

Continuous (Numerical)NORMAL DISTRIBUTION

http://www.stattucino.com/berrie/dsl/index.html

Page 10: Lecture 3 – Data Summary Measures and Graphical Display of Results

Describing Data using Numerical Summaries

Descriptive statistics:

Explore data in order to describe their main features

Get an initial picture of data sample

Page 11: Lecture 3 – Data Summary Measures and Graphical Display of Results

Let’s Talk Data…

Page 12: Lecture 3 – Data Summary Measures and Graphical Display of Results

Categorical

GenderN %

Female 6163

38.4%

Male 3837

61.6%

DialysisN %

No 8093 80.9%

Yes 1907 19.1%

0%

20%

40%

60%

80%

Gender

Female Male

0%

20%

40%

60%

80%

100%

Dialysis

No Yes

Page 13: Lecture 3 – Data Summary Measures and Graphical Display of Results

CategoricalRace

N %

Black 1942

19.4%

Hispanic 723 7.2%

Other 1068

10.7%

White 6267

62.7%

EducationN %

Elementary 1491

14.9%

High School Grad

2640

26.4%

College Grad 3246

32.5%

Post Graduate

2616

26.2%

0%

20%

40%

Education

Elementary High School Grad College Grad Post Graduate

0%

20%

40%

60%

80%

Race/Ethnicity

Black Hispanic Other White

Page 14: Lecture 3 – Data Summary Measures and Graphical Display of Results

CategoricalRace

N %

Black 1942

19.4%

Hispanic 723 7.2%

Other 1068

10.7%

White 6267

62.7%

EducationN %

Elementary 1491

14.9%

High School Grad

2640

26.4%

College Grad 3246

32.5%

Post Graduate

2616

26.2%

0%

20%

40%

Education

Elementary High School Grad College Grad Post Graduate

0%

20%

40%

60%

80%

Race/Ethnicity

Black Hispanic Other White

Page 15: Lecture 3 – Data Summary Measures and Graphical Display of Results

Continuous

Page 16: Lecture 3 – Data Summary Measures and Graphical Display of Results

BMIMeasure

Mean 32.2

Std Dev 5.46

Median 31.8

Minimum 16.0

Maximum 50.7

25th Percentile

28.2

75th Percentile

35.9

Mode 29.0

Page 17: Lecture 3 – Data Summary Measures and Graphical Display of Results

N = 115

BMIMeasure

Mean 32.0

Std Dev 5.34

Median 31.2

Minimum 21.8

Maximum 44.5

25th Percentile

28.5

75th Percentile

34.8

Mode .

Page 18: Lecture 3 – Data Summary Measures and Graphical Display of Results

BMIMean: 32.2

Std: 5.4

Median: 31.8

Page 19: Lecture 3 – Data Summary Measures and Graphical Display of Results

Mean: 136.3

Std: 17.1

Median: 135

Page 20: Lecture 3 – Data Summary Measures and Graphical Display of Results

Mean: 189.77

Std: 148.9

Median: 154.11

Page 21: Lecture 3 – Data Summary Measures and Graphical Display of Results

Fra

ctio

n

z-3.19068 3.16666

0

.224

Fra

ctio

n

x-29.644 -.540257

0

.1955

Fra

ctio

n

z.397801 31.7841

0

.1995

Shape of a distributionsymmetric

skewed tothe right

skewed tothe left

Mean greater than Median(positively skewed)

Mean less than Median(negatively skewed)

Page 22: Lecture 3 – Data Summary Measures and Graphical Display of Results

Mean: 136.3

Std: 17.1

Median: 135

Skewness: 0.38

Page 23: Lecture 3 – Data Summary Measures and Graphical Display of Results

Mean: 189.77

Std: 148.9

Median: 154.11

Skewness: 5.63

Page 24: Lecture 3 – Data Summary Measures and Graphical Display of Results

NORMAL DISTRIBUTION

Normal Distribution – Has Excellent Statistical Properties

Many Statistical techniques require normal distributions

If data does not have Normal Distribution, need to consider alternative techniques appropriate for data

Page 25: Lecture 3 – Data Summary Measures and Graphical Display of Results

Box (and Whisker) PlotsBox (and Whisker) Plots• A graph of the 5 number summary

with suspected outliers plotted individually

• 5 number summary: Min, Q1, Median, Q3, Max• A line somewhere inside the box marks

the Median• IQR = Q3 – Q1• Cases more than 1.5*IQR are plotted

individually (possible outliers)• Lines from the box extend to the

smallest and largest values that are not more than 1.5*IQR

Page 26: Lecture 3 – Data Summary Measures and Graphical Display of Results

median

25th Percentile

75th Percentile

mean

1.5 x IQR

Outlier

Page 27: Lecture 3 – Data Summary Measures and Graphical Display of Results
Page 28: Lecture 3 – Data Summary Measures and Graphical Display of Results

Skewed to the right Skewed to the leftSymmetric

+

+

+

Page 29: Lecture 3 – Data Summary Measures and Graphical Display of Results

Normal Probability PlotNormal Probability Plot

• Plot that can help assess normality.

• Idea: plot the observed levels of the variable against the expected levels corresponding to a Normal distribution.

• If data lie in a reasonably straight diagonal line, then assumption of Normality is reasonable.

Page 30: Lecture 3 – Data Summary Measures and Graphical Display of Results

Normal Probability PlotsNormal Probability Plots

BMI

Triglycerides

Page 31: Lecture 3 – Data Summary Measures and Graphical Display of Results

Error Error Bar Bar

PlotsPlots

Circle denotes the mean and the bars denote the standard deviation (in this case).

Page 32: Lecture 3 – Data Summary Measures and Graphical Display of Results

Part II – Measures of Association

(plus a little more)

Page 33: Lecture 3 – Data Summary Measures and Graphical Display of Results

Measures of Association• Continuous Variables

– Correlation– Agreement (reliability)

• Categorical Variables– Two-way layout (2×2 tables)– “Risk” measures– Agreement– Others

Page 34: Lecture 3 – Data Summary Measures and Graphical Display of Results

Two Continuous Variables

Correlation– General sense: the relationship between two

variables (quantitative or qualitative)– Narrow (statistical) sense: measure of

interdependence between two continuous random variables

• The degree to which increases or decreases in Y occur with increases or decreases in X

• Values range between -1 (perfect discordance) and 1 (perfect concordance)

• A value of 0 indicates no association

Page 35: Lecture 3 – Data Summary Measures and Graphical Display of Results

Pearson Correlation

Data

Subject # X Y

1 x1 y1

2 x2 y2 . . .

.

.

.

.

.

. n xn yn

Purpose - measures linear association between two continuous variables X and Y

Page 36: Lecture 3 – Data Summary Measures and Graphical Display of Results

Pearson CorrelationThe Pearson (product-moment) correlation coefficient can be calculated for 2 continuous variables in a sample (regardless of distribution) using the formula:

N

1i

2

i

N

1i

2

i

N

1iii

xy

YYXX

YYXXrrρ̂

Page 37: Lecture 3 – Data Summary Measures and Graphical Display of Results

Correlation Figures

•• •••

•• •

••

••••••••

••••••••••

•••

••

••• •• ••

•• • • •

••••• •

••

•••

No relationship X

YA B C

D E

Perfect positive relationship Perfect negative relationship

Moderate positive relationship Strong negative relationship

••

•••

ρ = 0

ρ = 1ρ = -1

ρ = 0.5 ρ = -0.8

Page 38: Lecture 3 – Data Summary Measures and Graphical Display of Results

Correlation Inference• Easy “large sample” test for H0: ρ=0

For n ≥ 25, compute

which has N(0, ) distribution under H0

• This test assumes X,Y~ NBiv(μX, μY, σX

2, σX2, ρ)

e

ˆ1 1 ρlog

ˆ2 1 ρ

Many times a tenuous assumption!• Beware positive skewness & outliers• Beware data not truly continuous

1

(n-3)

Page 39: Lecture 3 – Data Summary Measures and Graphical Display of Results

Timeout: ASSUMPTIONS• As with any mathematical or physical

model, model assumptions are critical to making the correct inference

• Dealing with assumptions has lead to development of:– Nonparametric statistics: techniques that

reduce or eliminate dependence on the underlying distribution of the data

– Robust statistics: techniques that are affected little by departures from assumptions

Page 40: Lecture 3 – Data Summary Measures and Graphical Display of Results

Correlation (resumed)• A nonparametric version of the correlation

coefficient: Spearman’s Rank Correlation

• Like ρ, rs :

– ranges from -1 to 1– 0 no correlation, 1 perfect agreement– only requires ordinal data

2i i

s 2

6 [R(X ) R(Y)]r 1

n(n 1)

where R( ) is the of the variable

i

rank

Page 41: Lecture 3 – Data Summary Measures and Graphical Display of Results

Correlation Example: SBP and DBPSBP DBP R(SBP) R(DBP)

141.8 89.7 12 14

140.2 74.4 8.5 1

131.8 83.5 3 4

132.5 77.8 4 2

135.7 85.8 7 7

141.2 86.5 11 10

143.9 89.4 14 13

140.2 89.3 8.5 12

140.8 88.0 10 11

131.7 82.2 2 3

130.8 84.6 1 6

135.6 84.4 6 5

143.6 86.3 13 9

133.2 85.9 5 8

Page 42: Lecture 3 – Data Summary Measures and Graphical Display of Results

Correlation Example: SBP and DBP

SB

P

125

130

135

140

145

DBP

70 75 80 85 90

• All Data: ρ = 0.42; rs = 0.71

• Outlier deleted: ρ = 0.75; rs = 0.82

Page 43: Lecture 3 – Data Summary Measures and Graphical Display of Results

Questions –

1.Can we calculate a correlation coefficient between the incomes of a group of people and what city they live in?

Correlation Coefficient

No, we cannot, since city is a categorical variable. Correlation requires that both variables be quantitative.

Page 44: Lecture 3 – Data Summary Measures and Graphical Display of Results

Questions –

2.Does it change the correlation between height and weight if we measure height in inches rather than centimeters and weight in pounds rather than kilograms?

Correlation Coefficient

No. Because ρ (and r) uses the standardized values of the observations, ρ does not change when we change the units of measurements of x , y, or both.The correlation ρ itself has no unit of measure; it is just a number.

Page 45: Lecture 3 – Data Summary Measures and Graphical Display of Results

Question –

3.Does ρ = 0 mean there is no relationship between X and Y ?

Correlation Coefficient

Correlation only measures the strength of the linear relationship between two variables. Correlation does not describe nonlinear relationships between two variables, no matter how strong they are.

x

y •

• •••••

••••••

Page 46: Lecture 3 – Data Summary Measures and Graphical Display of Results

Correlation and Regression

••

••• •• ••

•• • • •

••••• •

••

•••

Moderate positive relationship Strong negative relationship

ρ = 0.5 ρ = -0.8

2i

Y

2Xi

(Y Y)σn-1ˆ ˆ ˆβ = ρ = ρσ(X X)

n-1

Y Y

X X

Y = α+βX

Page 47: Lecture 3 – Data Summary Measures and Graphical Display of Results

Correlation and RegressionS

BP

125

130

135

140

145

DBP

70 75 80 85 90

SBP = 40.1 + 1.12×DBP

DBP = 16.3 + 0.51×SBP

SBP and DBP example (continued)

σSBP= 4.9 (mmHg)

σDBP= 3.3 (mmHg)

ρ = 0.75

4.90.75

3.3

3.30.75

4.9

Page 48: Lecture 3 – Data Summary Measures and Graphical Display of Results

Correlation and Covariance• Suppose two random variables, X and Y:

E(X) = μX, V(X) = σX2; E(Y) = μY, V(Y) = σY

2; and Corr(X,Y) = ρ

• Define Cov(X,Y) = E[(X-μX)(Y-μY)]

Note: Cov(X,X) = E[(X-μX)(X-μx)] = E(X-μX)2 = σX2

• Population correlation (ρ) is defined as:

• Thus Cov(X,Y) = ρσXσY

X Y

X Y X Y

E[(X-μ )(Y-μ )] Cov(X,Y)ρ =

σ σ σ σ

Page 49: Lecture 3 – Data Summary Measures and Graphical Display of Results

Correlation and Covariance

What’s the big deal about covariance?Use it to find the variance of functions of

random variables, e.g.:

In general:2 2 2 2

X YV(aX+bY) = a σ b σ 2abCov(X,Y)

2 2X YV(X-Y) = σ σ 2Cov(X,Y)

2 2X YV(X+Y) = σ σ 2Cov(X,Y)

Page 50: Lecture 3 – Data Summary Measures and Graphical Display of Results

Correlation as AgreementSBP1 SBP2

141.8 139.7

140.2 144.4

131.8 133.5

132.5 127.8

135.7 135.8

141.2 146.5

143.9 139.4

140.2 139.3

140.8 138.0

131.7 132.2

130.8 134.6

135.6 134.4

143.6 146.3

133.2 135.9

Suppose two nurses are measuring SBP in the same patients and each nurse measures SBP 3 times in each patient.

Page 51: Lecture 3 – Data Summary Measures and Graphical Display of Results

Correlation as Agreement• Could use Pearson correlation

• Another measure, intraclass correlation– Can separate the variance into two sources: between-

subject and within-subject– The intraclass correlation is the ratio of the within-

subject to the total (i.e., within + between)– By definition, intraclass correlation ranges from 0 to 1– Best measure of the “individual” touch

• In SBP example:

ρ(Pearson) = 0.809 ρ(Intraclass) = 0.814

Page 52: Lecture 3 – Data Summary Measures and Graphical Display of Results

Things to Remember AboutCorrelation

• 5 warnings (adopted from Huck):

1. Does not speak to cause-and-effect

2. Beware outliers

3. Assumes linear relationship

4. Correlation vs. Independence Zero correlation implies independence for

Normal distribution only

5. Strength of relationship WRT trend

Page 53: Lecture 3 – Data Summary Measures and Graphical Display of Results

Categorical Outcomes: Two-way Tables

• Prospective DesignRelative Risk (RR)

P(Disease in Exposed Group) P(D|E)

P(Disease in Unexposed Group) P(D|E)

• Retrospective DesignOdds Ratio (OR)

=

P(E|D)P(Exposure in Cases)P(E|D)1-P(Exposure in Cases) P(E|D)P(E|D)

P(E|D)P(E|D)P(Exposure in Controls) P(E|D)1-P(Exposure in Controls) P(E|D)

Page 54: Lecture 3 – Data Summary Measures and Graphical Display of Results

Two-way TablesDisease

Yes No

Yes a b a+b

No c d c+d

a+c b+d n=a+b+c+dExp

osur

e

P(D|E) = a/(a+b)

P(D|E) = c/(c+d)

P(E|D) = a/(a+c)

P(E|D) = b/(b+d)

Prospective Retrospective

E

a dada+c b+d

OR = = c b bc

a+c b+d

acRR =

ac+bc

Page 55: Lecture 3 – Data Summary Measures and Graphical Display of Results

Two-way Tables• Prospective design and relative risk (RR)

are optimal

• Retrospective designs and odds ratio (OR) are easiest (cheapest)

• Can compute OR for prospective design

D

a dada+b c+d

OR = = b c bc

a+b c+d

Page 56: Lecture 3 – Data Summary Measures and Graphical Display of Results

Two-way Table• Why we like the odds ratio…

The exposure odds ratio is equivalent to the disease odds ratio!

• Regardless of study design (i.e., which margin is fixed) the estimate of the OR is the same

D E

a d a dada+b c+d a+c b+d

OR = = = = OR b c b cbc

a+b c+d b+d a+c

Page 57: Lecture 3 – Data Summary Measures and Graphical Display of Results

Two-way TablesCancer

Yes No

Yes 35 25 60

No 5 35 40

40 60 100

Sm

oke

35 5RR = = 4.7

35 5+25 5

35 35OR = = 9.8

25 5

Page 58: Lecture 3 – Data Summary Measures and Graphical Display of Results

Two-way TableWhy we like the odds ratio – Part II

• For retrospective design, if…– Cases are representative of the population of

all cases– Controls are representative of the population

of all controls– The disease is “rare” (i.e., prevalence <20%)

Then OR ≈ RR

Page 59: Lecture 3 – Data Summary Measures and Graphical Display of Results

Two-way TablesCancer

Yes No

Yes 75 325 400

No 25 575 600

100 900 1000

Sm

oke

35 5RR = = 4.5

35 5+25 5

35 35OR = = 5.3

25 5

Page 60: Lecture 3 – Data Summary Measures and Graphical Display of Results

Other Measures From Clinical TrialsOutcome

Yes No

Experimental 15 135 150

Control 100 150 250

115 285 400Tre

atm

ent

P(O|E) = 15/150 = 0.1 P(O|C) = 100/250 = 0.4

RR = P(O|E)/P(O|C) = 0.25

• Absolute Risk Reduction (ARR) = P(O|C) - P(O|E) = 0.3• Relative Risk Reduction (RRR) = 1 – RR = 0.75• Number Needed to Treat (NNT) = 1/ARR = 3.33 (number needed to treat in the population to prevent 1 outcome event)

Page 61: Lecture 3 – Data Summary Measures and Graphical Display of Results

Things to Remember About Measures of Association

1. Beware: some sources use “odds ratio” and “relative risk” interchangeably

– In most settings, OR overestimates RR

2. Be on guard when considering ARR, RRR, and NNT

– Almost never see a SE or CI estimate– Should be based on large, well planned,

prospective studies

Page 62: Lecture 3 – Data Summary Measures and Graphical Display of Results

Categorical Measures of Agreement

• The “kappa” coefficient or κ • Example: two physicians diagnosing a disease

Here pa, pb, pc, pd are the proportions of subjects, not the number of subjects.

DOCTOR B

Disease No Disease

Disease pa pb pA

No Disease pc pd qA

pB qB 1DO

CT

OR

A

a d b c

A B B A

2(p p p p )κ̂

p q p q

Page 63: Lecture 3 – Data Summary Measures and Graphical Display of Results

Kappa ExamplePsychiatrist B

Neurosis Normal

Neurosis 0.04 0.06 0.10

Normal 0.01 0.89 0.90

0.05 0.95 1.00Psy

chia

tris

t A

2(0.04 0.89 0.06 0.01)κ̂ 0.50

0.10 0.95 0.05 0.90

• Kappa is a categorical analog of the intraclass correlation• Kappa can be computed for any “square” (k×k) tables

Page 64: Lecture 3 – Data Summary Measures and Graphical Display of Results

Schedule