42
Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...)

Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Preprocessing of Microarray Data

Normalization and Statistical Analysis

(Working with Noise...)

Page 2: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Microarray Processing Pipeline

Question/Experimental Design

Sample Preparation/Hybridization

Image Analysis

Array Design/Probe Design

Normalization (Scaling)

Comparable Gene Expression Data

Statistical Analysis

Buy Chip or Array

Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network

Expression Index Calculation

Page 3: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Question/Experimental Design

• Read papers!

• Formulate a detailed hypothesis

• Design experiment accordingly!

• Take our extended 27612 course ☺

Page 4: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Microarray Processing Pipeline

Question/Experimental Design

Sample Preparation/Hybridization

Image Analysis

Array Design/Probe Design

Normalization (Scaling)

Comparable Gene Expression Data

Statistical Analysis

Buy Chip or Array

Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network

Expression Index Calculation

Page 5: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Setup at DTU/CBS—The GCS3000

• >50.000 probe sets pr. chip

• Can use the newest series chips

• Automation of routine procedures– better reproducibility– lighter workload– faster scans

Page 6: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

The CBS Lab

Conducts minor experiments/validations

~200 Chips scanned last year by CBS

Page 7: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Microarray Processing Pipeline

Question/Experimental Design

Sample Preparation/Hybridization

Image Analysis

Array Design/Probe Design

Normalization (Scaling)

Comparable Gene Expression Data

Statistical Analysis

Buy Chip or Array

Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network

Expression Index Calculation

Page 8: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Two kinds of variation

Global variation

Amount of RNA in the sample

Efficiencies of:– RNA extraction– Reverse transcription– amplification– Labeling– Photodetection

Systematic

Gene-specific variation

Spotting efficiency,– Spot size– Spot shape

Cross-/unspecific hybridization

Biological variation– RNA degradation– Noise

Stochastic

Page 9: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Gene-specific variation:

Too random to be explicitly accounted for“noise”

Global variation:

Similar effect on many

measurementsCorrections can be estimated from data

Normalization Statistical testing

Sources of variation

Systematic Stochastic

Page 10: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

The R Project for Statistical Computing

Page 11: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

BioConductor and R

R is flexible and community oriented

BioConductor is a repository for R methods for microarray analysis

Web: http://www.bioconductor.org

Page 12: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

The CBS Workhorse

Page 13: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Qspline – Normalization developed at CBS

Qspline performs a high-quality non-linear normalization

Page 14: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

The Quantile and Qspline method

From the empirical distribution, a number of quantiles are calculated for each of the channels to be normalized (one channel shown in red) and for the reference distribution (shown in black)A QQ-plot is made and a normalization curve is constructed by fitting a cubic spline functionAs reference one can use an artificial “median array” for a set of arrays or use a log-normal distribution, which is a good approximation.

Page 15: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Visualizing data

MVA plot

Page 16: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Lowess Normalization

One of the most commonly utilized normalization techniques is the LOcally Weighted Scatterplot Smoothing (LOWESS) algorithm.

M

A

* * * *** *

Page 17: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Invariant set normalization (Li and Wong)

A invariant set of probes is usedProbes that does does not change intensity rank between arraysA piecewise linear median line is calculatedThis curve is used for normalization

Page 18: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Comparison of Normalization Algorithms

QsplineLoessLinearInv. Set

(From Workman et al., 2002)

Page 19: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Calibration = Normalization = Scaling

Page 20: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Microarray Processing Pipeline

Question/Experimental Design

Sample Preparation/Hybridization

Image Analysis

Array Design/Probe Design

Normalization (Scaling)

Comparable Gene Expression Data

Statistical Analysis

Buy Chip or Array

Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network

Expression Index Calculation

Page 21: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Expression Index Calculation

Some microarrays have multiple probes addressing the expression of the same gene

- Affymetrix GeneChips have 11-20 probe pairs pr. gene

PM: CGATCAATTGCACTATGTCATTTCTMM: CGATCAATTGCAGTATGTCATTTCT

-Perfect Match (PM)-MisMatch (MM)

Page 22: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Comparison of Methods

Reproducibility of Genetic Expression from 20 Replicates:

Blue and Red RMA; Black Li & Wong dChip; Green MAS 5.0

(By Terry Speed)

Page 23: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Microarray Processing Pipeline

Question/Experimental Design

Sample Preparation/Hybridization

Image Analysis

Array Design/Probe Design

Normalization (Scaling)

Comparable Gene Expression Data

Statistical Analysis

Buy Chip or Array

Advanced Data Analysis:Clustering PCA Classification Promoter Analysis Regulatory Network

Expression Index Calculation

Page 24: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Gene-specific variation:

Too random to be explicitly accounted for“noise”

Global variation:

Similar effect on many

measurementsCorrections can be estimated from data

Normalization Statistical testing

Sources of variation

Systematic Stochastic

Page 25: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Asking questions of your microarray data

• Typically we want to identify differentially expressed genes

• Example: • Alcohol dehydrogenase is expressed at a higher level

when alcohol is added to the media

without alcohol with alcohol

alcohol dehydrogenase

Page 26: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Significance Testing

wildtype+alcohol

wildtype

mutant +alcohol

mutant

Page 27: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Alas, the data contain stochastic noise

He’s going to say it

There is no way around it...

Statistics

Page 28: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

You can think of statistics as a black box...

...but, you still need to perform the test and understand how to interpret the results

Noisymeasurements p-value

statistic

Page 29: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

What's inside the black box of ‘statistics’?

t-tests, ANOVAs and Volcanoes

Page 30: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

The notorious p-value

p-value = The chance of rejecting the null

hypothesis by coincidence----------------------------For gene expression analysis we can

say: The chance that a gene is categorized as

differential expressed by mistake

Page 31: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

The t-test

The t statistic is based on the sample meanand variance

Then a lookup of t in a table of the t-distribution finds the p-value

Page 32: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

ANOVA = ANalysis Of VAriance

• Very similar to the t-test, but can test multiple categories

• Ex: is gene x differentially expressed between wt, mutant 1 and mutant 2

• Advantage: it has more ‘power’ than the t-test

Page 33: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

The two way ANOVA

wildtype+alcohol

wildtype

mutant +alcohol

mutant

2

2

1 13 3

3

Page 34: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

The fine print...

We can rank the genesaccording to the p-value

But, we really can’t trust the p-value, in the strict statistical way!

We rarely fulfill all the assumptions of the statistical tests...

And we should take multiple testing into account

Page 35: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

The Problem of Multiple Testing

1981• Prince Charles gets married• Liverpool wins the European

Championship League• Pope dies

2005• Prince Charles gets married• Liverpool wins the European

Championship League• Pope dies

Page 36: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

The Problem of Multiple Testing

In a typical microarray analysis we test thousands of genes

If we use a significance level of 0.05 and we test 1000 genes. We would expect 50 genes to be significant by chance

1000 x 0.05 = 50

Page 37: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Correction for multiple testing

Bonferroni:

P ≤Confidence level of 99%

0.01N

Benjamini-Hochberg:

P ≤ iN

0.01

N = Number of genesi = Ranking number of gene

Page 38: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

But really, those methods are not the answer

What we can trust is the ability of statistic test to rank genes accordingly to their reliability

The number of genes that are needed or can be handled in downstream processes can be used to set the cutoff

If we permute the samples we then can get an estimate of the False Discovery Rate (FDR) in this set

Page 39: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

...but then what do we do?

log2 fold change (M)

P-value

Page 40: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Conclusions

Array data contains stochastic noise– Statistics is needed to

conclude on differential expression

We can’t trust the p-value

But the statistics can rank genes– The capacity/needs of

downstream processes can be used to set cutoff

FDR can be estimated– e.g. Volcano Plots

t-test is used for two category tests

ANOVA is used for multiple categories

Page 41: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

The “Result”—A Ranked Gene List

Page 42: Normalization and Statistical Analysis - CBS€¦ · Preprocessing of Microarray Data Normalization and Statistical Analysis (Working with Noise...) Microarray Processing Pipeline

Relapse: Cluster Analysis

Clustering of 19 class predictive genes

Sensitivity:– Relapse: 87%– CCR: 74%

Specificity:– Relapse: 69%– CCR: 92%

P30.R

P18.C

P35.C

P52.C

P33.C

P13.C

P57.C

P7.C

P28.C

P37.C

P34.C

P22.C

P19.C

P55.C

P29.C

P5.R

P17.R

P21.R

P9.C

P27.R

P50.C

P53.R

P48.C

P20.R

P24.R

P51.R

P25.R

P47.C

204812_at ZW10

203073_at LDLC

202958_at PTPN9

202772_at HMGCL

218896_s_at HSA277841

204798_at MYB

206586_at CNR2

208594_x_at ILT8

204708_at MAPK4

221746_at UBL4

222140_s_at SH120

217942_at MDS023

202322_s_at GGPS1

222103_at ATF1

202630_at APPBP2

200050_at ZNF146

204327_s_at ZNF202

202535_at FADD

201746_at TP53

-0.48 0.00 0.48