1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html [email protected]

1

StatisticsAchim Tresch

UoC / MPIPZ Cologne

treschgroup.de/[email protected]

2

3

Franziskus, Pope

Andrej Kolmogoroff,Mathematician

Two ways of dealing with uncertainty

http://turnbull.mcs.st-and.ac.uk/history/Mathematicians/Kolmogorov.html

4

Topics

I. Descriptive Statistics

II. Testing

III. Clustering

IV. Regression

5

What is „data“?

Cases (Samples, Observations)

Endpoints (Variables)Only one item per column!

Meaningful variable names!

Values, instances

of a variable

…Th

e s

am

ple

/ th

e s

am

ple

pop

ula

tion

⊆ p

op

ula

tion

A collection of observationsof a similar structure

6

Different Scales of a Variable

Categorial VariablesHave only a finite number of instances:Male/female; Mon/Tue/…/Sun

Continuous VariablesCan take values in an interval of the real numbersE.g. blood pressure [mmHg], costs [€]

Nominal data: Categorial variables without a given orderE.g. eye color [brown, blue, green, grey]Special Case: Binary (=dichotomic) variables (yes/no, 0/1…)Ordinal data: Instances are ordered in a natural wayE.g. tumor grade [I, II, III, IV], rank in a contest (1,2,3,…)

785% shinier hair!

I. Description

Problem:

It is often difficult to map a variable to an appropriate scale:

E.g. metabolic activity, evolutionary success, pain, social status, customer satisfaction, anger

-> Check whether your choice of scale is meaningful!

8

Value A B AB 0 (absolute) frequency 83 20 10 75 188

relative frequency 44% 11% 5% 40% 100%

Always list absolute frequencies!• Do not list relative frequencies in percent if the

sample size is small (n < 20)• Do not use decimal digits in percent numbers for

n<300• Rule of thumb: Use ca. (log10n) - 2 digits

„Side effects were observed in 14,2857% of all cases“Nonsense, we conclude that n=7!

Description of a categorial variable: Tables

Example: Blood antigens (ABO), n = 188 samples

I. Description

9

0

5

10

15

20

25

30

35

40

45

A B AB 0

%

Description of a categorial variable: Barplot

I. Description

Rel. fre-

quency

Abs. fre-

quency

10

20

40

80

10Merkmalsausprägung

Za

hl d

er

Fä

lle

-3 -2 -1 0 1 2 3

02

04

06

0

Description of continuous data: Histogram

I. Description

11

Merkmalsausprägung

Za

hl d

er

Fä

lle

-3 -2 -1 0 1 2 3

02

04

06

0

Merkmalsausprägung

Za

hl d

er

Fä

lle

-3 -2 -1 0 1 2 3

05

10

15

20

Merkmalsausprägung

Za

hl d

er

Fä

lle

-4 -2 0 2 4

05

01

00

15

02

00

The size of the bins (= width of the bars) is a matter of choice and has to be

determined sensibly!

50 bins 4 Balken12 bins

I. Description

Merkmalsausprägung

rela

tive

Hä

ufig

keit

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Merkmalsausprägung

rela

tive

Hä

ufig

keit

12

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Merkmalsausprägung

rela

tive

Hä

ufig

keit

Caution: Data will be smoothed automatically. This is very suggestive and blurs discontinuities in a distribution.

I. Description

Description of continuous data: Density plot

13

Most important: The Gaussian (=normal) distribution

Expectation value

Standard-deviation

I. Description

C.F Gauss (1777-1855):Roughly speaking, continuous variables that are the (additive) result of a lot of other random variables follow a Gaussian distribution.-> It is often sensible to assume a gaussian distribution for continuous variables.

14

Measures of Location, Scale and Scatter

Mean: sum of all observations / number of samples

Ex.: observations: 2, 3, 7, 9, 14sum: 2+3+7+9+14 = 35

# observations: 5Mean: 35/5 = 7

Median: A number M such that 50% of all observations are less than or equal to M, and 50% are greater than or equal to M. (Q: What if #observations is even?)

|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||

-2 -1 0 1 2 3

-1.0

-0.5

0.0

0.5

1.0

d

rep

(0, le

ng

th(d

))

50% 50%

I. Description

rel. H

äu

fig

ke

it

0 1 2 3 4

02

00

04

00

06

00

08

00

01

00

00 Mode: A value for which the

density of the variable reaches a local maximum. If there is only one such value, the distribution is called unimodal, otherwise multimodal. Special case: bimodal)

The mode usually is an unstable description of a sample.

15MeanMedian

I. Description

Description of Location, Scale and Scatter

Mode

16

Distribution Shapes

SymmetricMean Median

Skewed to the rightMedian << Mean

Skewed to the leftMean << Median

I. Description

17

The median should be preferred to the mean if• the ditribution is very asymmetric• there are extreme outliers

The skewness g of the distribution ranges between–1 und +1, i.e. the distribution is approx. symmetric.

skewness g > 0

skewness g < 0

0 1 2 3 4 5

-2

-1

01

2

d

rep(0, length(d))

The mean is more „precise“ than the median if the distribution is approximately normal

Rule of thumb:

Right skew:

Left skew:

I. Description

18

How would you describe this distribution?

I. Description

19

„…it showed a giant boa swallowing an elephant. I painted the inside of the boa to make it visible to the adults. They always need explanations.“

Antoine de Saint-Exupéry, Le petit prince

Unexpected distributionshave unexpected causes!

I. Description

20

More measures of location

Quantile: A q-quantile Q (0≤q≤1) splits the data into a fraction of q points below or equal to Q and a fraction of 1-q points above or equal to Q.

|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||

-2 -1 0 1 2 3

-1.0

-0.5

0.0

0.5

1.0

d

rep

(0, le

ng

th(d

))

50% 50%Median = 50%-quantile

|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||

-2 -1 0 1 2 3

-1.0

-0.5

0.0

0.5

1.0

d

rep

(0, le

ng

th(d

))

25% 25%1.quartile =

25%-quantile

25% 25%3.quartile =

75%-quantile

1-quantile =

maximum

0-quantile =

minimum

I. Description

21

The five-point Summary and the Boxplot

|| | ||| ||| || | || || ||| || | || | || ||| |||| ||| || | || || | || |||| || ||| | || || || || || || | |||| || || ||| | ||| | || || | || | || || || ||||| | || |||| || || || || | || | | | || |||| || || ||| ||| ||| || ||| |||| || | || || | | ||| | || ||||| || ||| | ||| | ||| || || | | | | || |||

-2 -1 0 1 2 3

-1.0

-0.5

0.0

0.5

1.0

d

rep(

0, le

ngth

(d))

I. Description

22

How far do the observations scatter around their „center“(=measure of location)?

Measures of Variation

large variationsmall variation

Location measure

e.g.: location = Median variation = 3.quartile – 1.quartile

= Interquartile range (IQR)

I. Description

23


x~ jx

|~| xx j

e.g.: location = median variation = mean deviation (MD) from

=

x~

n

jj xx

n 1

|~|1

x~

e.g.: location = median variation = median absolute deviation,MAD

from njxx j ,...,1 , |~| Median

x~

I. Description

x~

24

Mean ± s contains ~68% of the data

Mean ± 2s ´´ ~95% ´´

Mean ± 3s ´´ ~99.7% ´´ x-s x x+s


Numbers for Gaussian variables:

z.B.: location = mean variation = mean squared deviation

from

=

= varianceOr: variation = square root of the variance

= standard deviation (s, std.dev)

x

n

jj xx

n 1

2)(1

x

I. Description

25

Histogram/Density Plot vs. Boxplot

Boxplot contains less information, but it is easier to interpret.

I. Description

1

3

2

4

26

Multiple Boxplots I. Description

Sample: 2769 schoolchildren

27

Always report the sample size!

a) numericalMedian, Q1, Q3, Min., Max. (5-summary) for symmetric distr. alternatively: mean, standard deviation

b) graphical

Boxplots, histograms, density plots

c) textuale.g. „Blood pressure was reduced by 12 mmHg (Interquartile range: 8 to 18 mmHg = 10mmHg), whereas the reduction in the placebo group was only3 mmHg (IQR: –2 to 4 mmHg = 6mmHg).“

SummaryI. Description

28

Cross Table

Person Medication Response

A Verum yes

B Placebo no

Two categorial variables: Cross Tables

Data

I. Description

29

Cross Tablevalues of variable 2

values of variable 1(potential causes)

(potential effects)

I. Description

Two categorial variables: Cross tables


A Verum yes

B Placebo no

Data

30

Cross TableResponse

yes no

Medi-cation

Verum

Placebo

values of variable 2


(potential effects)

Each case is one count in the table

I. Description



A Verum yes

B Placebo no

Data

31

Cross TableResponse

yes no

Medi-cation

Verum 1 0

Placebo 0 1



(potential effects)

I. Description


Each case is one count in the table


A Verum yes

B Placebo no

Data

≠

32

Cross TableResponse

yes no

Medi-cation

Verum 1 0

Placebo 0 1



(potential effects)

The most common question is:Are there differences between █ and

█ ?

I. Description


33

Absolute number, row-, column percent

ResponseTotal

yes no

Medi-cation

Verum20

50%, 67%20

50%, 40%40

50%

Placebo10

25%, 33%30

75%, 60%40

50%

Total 30, 37% 50, 63% 80, 100%

Cross Table: n = 80 cases

I. Description


34

What‘s bad about this table?

I. Description


35

Cross tables:Independent vs. paired data

independent data

paired data


A Verum yes

B Placebo no

Person Medic.: VerumMedic.: Placebo

A yes yes

B yes no

Paired data: One object (or two closely related objects) serves for the measurement of two variables of the same kind.Exercise: The influence of diet on body height is assessed in 1) a study with 100 randomly picked subjects. 2) a study with 50 identical twins that grew up separately. Write down the cross tables. Which study is probably more informative?

I. Description

36

Cross TableMedic.: Placebo

yes no

Medic.: Verum

yes 1 1

no 0 0



I. Description

Cross tables:Paired data

paired data

Person Medic.: VerumMedic.: Placebo

A yes yes

B yes no

37

Cross tableMedic.: Placebo

yes no

Medic.: Verum

yes 1 1

no 0 0



A typical question is:

concordant observations

discordant observations

Are the observations concordant or discordant?Is there a particularly large number in █ or █ ?

I. Description

Cross tables:Paired data

Comparison of two global gene expression measurements

Absolute scale Double logarithmic scale

y = ½ x

y = ¼ x

y = 2x

y = 4x

y = ½ x

y = ¼ x

y = 2x

y = 4x

Advantages of double log scale:

• Skewed distributions appear more evenly spread across the plot• Loci of fixed expression folds are lines parallel to the main diagonal

Scatterplot I. Description

Two continuous variables: Scatter Plots

Advantages of the MA-Plot:• Lines of constant expression folds are parallel to the x-axis.• Differences between channel 1 and channel 2 can easily be read off

the plot. Intensity-dependent systematic errors can be detected.

1 10 100 1000 10000

11

01

00

10

00

10

00

0

x

y

0 2 4 6 8 10

-4-2

02

4

log(x * y)/2

log

(y/x

)

turn by 45o

log

(fo

ld r

atio

of y

an

d x

)

log (geometr. Mean of x and y)

Scatterplot vs. M-A-plot I. Description> x = log(exprs[,1])> y = log(exprs[,2])> plot(x,y)

> xMA =(x+y)/2> yMA = y - x> plot(xMA,yMA)

log (x)

log

(y)

There is a mistake in these plots (compare left and right plot)!

No visible bias (=systematic error)

Channel 2 differs from channel 1 by a

constant factor

multiplicative bias

M-A-plot I. Description

How to quantify such a relation between x and y?

Example

Korrelation I. Description

Dependence of two continuous variables

The Pearson correlation coefficient r

measures the degree of linear dependence of two variables

Properties:

-1 ≤ r ≤ +1

r = ± 1: perfect linear dependence

the sign of r indicates the direction of the dependence

r is symmetric, i.e., rxy=ryx

Pearson Korrelation I. Description

r=1

r=1

r= -1


The smaller the absolute value of r, the weaker the linear dependence











rxy = 0,38 rxy = 0,84

Example: Relation between height and weight resp. Arm length

The closer the points scatter around a line, the larger the absolute value of r.


What is the value of r in these cases?

Pearson correlation has difficulties in recognizing non-linear dependencies.

r ≈ 0

r ≈ 0r ≈ 0


Spearman correlation measures monotonic dependencies.X

Y

Rang(X)

Ra

ng

(Y)

Idea: Calculate the pearson correlation coefficient of the rank transformed data Spearman-Korrelation s

X

Y

rank

(Y)

rank(X)

r = 0,88 s = 0,95

Korrelation

Pearson correlation Spearman correlation

I. Description

Raw

dat

a

Pearson vs. Spearman Korrelation I. Description

Pearson correlation

NM_001767NM_001767 NM_000734NM_000734 NM_001049NM_001049 NM_006205NM_006205

NM_001767NM_001767 1.00000000 0.94918522 -0.04559766 0.04341766

NM_000734NM_000734 0.94918522 1.00000000 -0.02659545 0.01229839

NM_001049NM_001049 -0.04559766 -0.02659545 1.00000000 -0.85043885

NM_006205NM_006205 0.04341766 0.01229839 -0.85043885 1.00000000


Ran

k tr

ansf

orm

ed d

ata


NM_001767NM_001767 NM_000734NM_000734 NM_001049NM_001049 NM_006205NM_006205

NM_001767NM_001767 1.00000000 0.9529094 -0.10869080 -0.17821449

NM_000734NM_000734 0.9529094 1.00000000 -0.11247013 -0.20515650

NM_001049NM_001049 -0.10869080 -0.11247013 1.00000000 0.03386758

NM_006205NM_006205 -0.17821449 -0.20515650 0.03386758 1.00000000

Spearman correlation


Conclusion: Spearman correlation is more robust against outliers. However in case of linear dependence, it is less sensitive than Pearsion correlation.

Pearson vs. Spearman Korrelation

Raw data Rank transformed data

I. Description

Quantile-Quantile plot (qq-plot). For the comparison of two distributions (of x and y), plot the quantiles of the x distribution against the corresponding quantiles of the y distribution.

QQ-plot

Q(uantile)-Q(uantile) Plots I. Description

Interpretation:

Unsimilar distributions:qq-plot is not linear, in

particular not in the center of the qq-line.

Similar Distributions except for the tails,

the tails of the y distribution are

“heavier”

Q(uantile)-Q(uantile) Plots I. Description

Similar Distributions except for the tails,

the tails of the x distribution are

“heavier”

Documents

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html [email protected]