8
1 Bioinformatics 13-17 March 2006 Microarray data analysis John Gustafsson Mathematical statistics Chalmers The microarray block • Lectures – DNA microarray technology overview (KS) – Analysis of microarray data (JG) – How to use microarray technology – and what to do with the results… (BG) • Computere exercise – A case study using Bioconductor (JG) Microarray experiments Experimental Design Labwork Image Analysis Normalization Statistics and ranking To consider all these parts is extremely important for a satisfying result. Biological Question or Hypothesis Biological verification and interpretation Outline • Microarray experiments Data analysis steps Introduction to the Computer Exercise Outline • Microarray experiments – Technologies – Experimental design – Image acquisition and analysis Data analysis steps Introduction to the Computer Exercise Microarray Technologies Spotted two-channel cDNA – Up to 100.000 spots on a single glass-slides – Two sources of RNA on each array labeled with “red” (Cy5) and “green” (Cy3) – Can be bought or produced – One slide costs from 100 euro

microarray data analysis - Göteborgs universitetbio.lundberg.gu.se/courses/vt06/microarray_data_analysis_JG.pdfLinear Models for Microarray Analysis • An R-package for analysis

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: microarray data analysis - Göteborgs universitetbio.lundberg.gu.se/courses/vt06/microarray_data_analysis_JG.pdfLinear Models for Microarray Analysis • An R-package for analysis

1

Bioinformatics 13-17 March 2006

Microarray data analysis

John GustafssonMathematical statisticsChalmers

The microarray block

• Lectures– DNA microarray technology overview (KS)– Analysis of microarray data (JG)– How to use microarray technology – and what to do

with the results… (BG)

• Computere exercise– A case study using Bioconductor (JG)

Microarray experiments

ExperimentalDesign Labwork Image

Analysis

NormalizationStatistics andranking

To consider all these parts is extremely important for a satisfying result.

BiologicalQuestion orHypothesis

Biological verificationand interpretation

Outline

• Microarray experiments• Data analysis steps• Introduction to the Computer Exercise

Outline

• Microarray experiments– Technologies– Experimental design– Image acquisition and analysis

• Data analysis steps• Introduction to the Computer Exercise

Microarray Technologies

• Spotted two-channel cDNA– Up to 100.000 spots on a single glass-slides– Two sources of RNA on each array labeled

with “red” (Cy5) and “green” (Cy3)– Can be bought or produced– One slide costs from 100 euro

Page 2: microarray data analysis - Göteborgs universitetbio.lundberg.gu.se/courses/vt06/microarray_data_analysis_JG.pdfLinear Models for Microarray Analysis • An R-package for analysis

2

Microarray Technologies

• Affymetrix– Each gene has seveal PM and MM– One source on each array– Can only be bought– One chip costs from 200 euro

Experimental design

• Depends heavily on the biological question• Different types of designs are: directs

comparison, common reference and time course

• Differs between cDNA and Affymetrix type of microarrays

ExperimentalDesign

Lab work ImageAnalysis

Normalization Statistics and ranking

Experimental design

A B

Direct Comparison

A B

C. Ref.

Common Reference

Experimental design

T1 T2 T3 T4

C. Ref.

Time course with common reference

T1 T2 T3 T4

Time course with direct comparisons

Image acquisition and analysisExperimental

Design Lab work ImageAnalysis Normalization Statistics and ranking

cDNA Microarray

Print-tip area

Image acquisition and analysisExperimental

Design Lab work ImageAnalysis Normalization Statistics and ranking

Page 3: microarray data analysis - Göteborgs universitetbio.lundberg.gu.se/courses/vt06/microarray_data_analysis_JG.pdfLinear Models for Microarray Analysis • An R-package for analysis

3

Outline

• Microarray experiments• Data analysis

– R and Bioconductor– Visualization and exploration of data– Normalization– Statistics and ranking

• Introduction to the Computer Exercise

R and Bioconductor

• R is a powerful statistical software that is open source and free

• R is available for many platforms (Windows, Linux, Solaris, …), see www.r-project.org

• Bioconductor is a collection of R packages for analysis of biological data of various kind, see www.bioconductor.org

LIMMA –Linear Models for Microarray Analysis• An R-package for analysis of microarray data• Originally intended for spotted cDNA microarrays but it is

possible to analyze Affymetrix data as well• Relatively easy to use and can handle almost any

experimental design• Contains most of the functions needed for a basic

microarray analysis• More information of LIMMA is available at

http://bioinf.wehi.edu.au/limma/

Data Representation (cDNA)

• To increase visualization, the following transformation is usually done

( )GRA

GR

GRM

loglog21

logloglog

+=

=−=

• M reflects the fold change and A the average spot intensity

MA-plot Data Representation (Affy)

ExperimentalDesign Lab work Image

Analysis Normalization Statistics and ranking

• A model that takes into account the different affinity of the probes is fitted to the PM’s

• The MM’s is usually discarded• The result is a log value for each gene (signal)• MA-values can be created by comparing two

Affymetrix arrays

Page 4: microarray data analysis - Göteborgs universitetbio.lundberg.gu.se/courses/vt06/microarray_data_analysis_JG.pdfLinear Models for Microarray Analysis • An R-package for analysis

4

Visualization

• Very important for quality control, and ”understanding” the data

• Visualization of spatial patterns• Visualization of intensity dependent bias:

MA-plots• Visualization of variance: box-plots

6 8 10 12 14

-3-2

-10

12

34

swirl.1

A

M

swirl.1 swirl.2 swirl.3 swirl.4

-20

24

Box-plotM

Normalization

ExperimentalDesign Labwork Image

Analysis Normalization Statistics and ranking

• Necessary to remove red or green bias• Which subset to normalize with? • Can be intensity-dependent. Ad-hoc solution:

The loess function. • Can be spatially dependent. A solution: print-tip

loess normalization• Normalizing between arrays: Adjusting the

distribution

Normalization – loess

• A robust way to remove bias and intensity dependent trends.

• A curve is fitted using a robust weighted least square to the MA-plot and is then subtracted from the M-value.

• Can be done both locally (print-tip loess) and globally (global loess).

• Computationally intensive for large datasets (i.e. many genes).

Page 5: microarray data analysis - Göteborgs universitetbio.lundberg.gu.se/courses/vt06/microarray_data_analysis_JG.pdfLinear Models for Microarray Analysis • An R-package for analysis

5

Normalization – loess Statistics and rankingExperimental

Design Lab work ImageAnalysis Normalization Statistics and ranking

• To decide which genes that are regulated we need a statistic or ranking function

• The most common hypothesis areH0: gene g is not regulatedHA: gene g is regulated,

but other hypothesis might also be interesting• Example of statistics used are: M-value threshold, t-

statistic and moderated t-statistic

Statistics and ranking – M-value threshold

ExperimentalDesign Lab work Image

Analysis Normalization Statistics and ranking

Reject H0 if TM g >

Pros• Easy to use and implement• Fast and works with any number of arrays• Easy to understand• Easy to calculate significance under the right assumptionsCons• Does not take the variance of a gene into account

Statistics and ranking – M-value threshold

Reject H0 if TM g >

Statistics and ranking – t-statisticExperimental

Design Labwork ImageAnalysis Normalization Statistics and ranking

Reject H0 if ./2

TnS

M g>

Pros• Takes the variance into account• Easy to use and implement• Easy to calculate significance under right assumptionsCons• Very sensitive to genes with small variance• Need many arrays to work satisfactory

Statistics and ranking – t-statistic

Reject H0 if ./2

TnS

M g>

Page 6: microarray data analysis - Göteborgs universitetbio.lundberg.gu.se/courses/vt06/microarray_data_analysis_JG.pdfLinear Models for Microarray Analysis • An R-package for analysis

6

Statistics and ranking – moderated t-statistic

ExperimentalDesign Labwork Image

Analysis Normalization Statistics and ranking

Reject H0 if ./2

TanS

M g>

+

Pros• Takes the variance into account• Robust• Works well with few arrays (>2)Cons• Numerical methods is needed to calculate significance.

Statistics and ranking – moderated t-statistic

Reject H0 if ./2

TanS

M g>

+

Outline

• Microarray experiments• Data analysis steps• Introduction to the Computer Exercise

Introduction to Computer ExerciseAims• To understand a complete microarray analysis• To understand the need for normalization• To understand the difference between different statistics• To use the LIMMA-package and understand some its

structure

The SWIRL dataset• ~8000 genes on four arrays from zebrafish.• Direct comparison of WT vs BMP2 mutant.• Dye swaps - two arrays with WT Cy3, mutant Cy5 and

two arrays with WT Cy5, mutant Cy3.

Questions at issue

• What happens in the mutant? (Exploativeapproach)

• What happens with the BMP2 gene? What should happen in a ideal experiment?

Acknowledgements

• The lecture is partially based on slides by Erik Kristiansson and Petter Mostad

Page 7: microarray data analysis - Göteborgs universitetbio.lundberg.gu.se/courses/vt06/microarray_data_analysis_JG.pdfLinear Models for Microarray Analysis • An R-package for analysis

7

Extra slides Array Layout

• cDNA microarrays are printed using sets of printtips

• This creates ”blocks” containing ”rows” and ”columns”

• In order to visualize correctly each spot, itsspatial position on the chip must be inferred

• An object containing the number of rows and columns of blocks, and the number of rows and columns of spots within blocks must be created

Multiple testing adjustments

• The raw p-values are valid only individually• Several ways to deal with this, for example

controlling family wise error rate, or controllingthe false discovery rate

• The ”Holm” method will control the family wiseerror rate: gives few significant genes

• The ”FDR” method controls the false discoveryrate: how many false positives (genes that onlyappear to be diff. expr.) one can expect

On normalization of trends

• Attention: Trends might be confoundedwith interesting information

• Normalization might conceal this interesting information

• Solution: Randomize!• Example: Spatial trends in microarrays

(can randomize gene position), howevernot possible for 2D-gels (cannotrandomize protein position)

Page 8: microarray data analysis - Göteborgs universitetbio.lundberg.gu.se/courses/vt06/microarray_data_analysis_JG.pdfLinear Models for Microarray Analysis • An R-package for analysis

8