31
Computational Approaches for Understanding Biological Significance of Microarray Data Liangjiang (LJ) Wang [email protected] KSU Bioinformatics Center, Biology Division June, 2005 Spotted Microarray Workshop

Computational Approaches for Understanding Biological Significance of Microarray Data

  • Upload
    huong

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Spotted Microarray Workshop. Computational Approaches for Understanding Biological Significance of Microarray Data. Liangjiang (LJ) Wang [email protected] KSU Bioinformatics Center, Biology Division June, 2005. Sample acquisition. RNA: purification, labeling. Data acquisition. - PowerPoint PPT Presentation

Citation preview

Page 1: Computational Approaches for Understanding Biological Significance of Microarray Data

Computational Approaches for Understanding

Biological Significance of Microarray Data

Liangjiang (LJ) Wang

[email protected]

KSU Bioinformatics Center, Biology Division

June, 2005

Spotted Microarray Workshop

Page 2: Computational Approaches for Understanding Biological Significance of Microarray Data

RNA: purification, labeling

Microarray: hybridization,washing, image analysis

Biological insight

Sampleacquisition

Dataacquisition

Data analysis

Hypothesistesting

Data: preprocessing, statistical inference, clustering analysis, . . .

(Hypothesis generation)

Page 3: Computational Approaches for Understanding Biological Significance of Microarray Data

Beyond Clustering Analysis

• Using GO to understand significant functional associations of a gene cluster.

• Mapping gene expression data onto biochemical pathways.

• Discovering regulatory elements shared by the promoters of co-expressed genes.

• Inferring gene regulatory networks.

Page 4: Computational Approaches for Understanding Biological Significance of Microarray Data

Genes in a Cluster May Be Co-Regulated

Transcriptionalregulation

Posttranscriptionalregulation

Translationalregulation

RNA processing

RNA turnover

RNA synthesis

Protein synthesis

Posttranslationalregulation

Protein modificationand degradation

Multi-level regulation

Microarrays measure steady-state levels of mRNAs.

Page 5: Computational Approaches for Understanding Biological Significance of Microarray Data

The Gene Ontology (GO)• Providing controlled vocabularies for describing

gene products in the domain of molecular biology.

• Enabling a common understanding of model organisms and between databases.

• Consisted of three unlinked hierarchies:– Molecular function: elemental activity/task (What)

(e.g., DNA-binding, polymerase, transcription factor)

– Biological process: goal or objective (why) (e.g., mitosis, DNA replication, cell cycle control)

– Cellular component: location or complex (where) (e.g., nucleus, ribosome, pre-replication complex)

• Available at http://www.geneontology.org/.

Page 6: Computational Approaches for Understanding Biological Significance of Microarray Data

Example of Gene Ontology HierarchyBiological process

(GO:0008150)

Behavior(GO:0007610)

Cellular process(GO:0009987)

Development(GO:0007275)

Physiological(GO:0007582)

Cell death(GO:0008219)

Cell aging(GO:0007569)

Programmed(GO:0012501)

Apoptosis(GO:0006915)

Induction(GO:0012502)

Autophagic cell death(GO:0048102)

HS response(GO:0009626)

… … …

… … …

… … … … …

… … … …

… … …

… … …

i

i i i i

i i i

P i

P i

i

P

i is a

part of

Communication(GO:0007154)

Cell growth(GO:0008151)

ii

Page 7: Computational Approaches for Understanding Biological Significance of Microarray Data

Gene Annotation Using GO Terms

• Association of GO terms with gene products based on evidence from literature reference or computational analysis.

• The creation of GO and the association of GO terms with gene products (gene annotation) are two independent operations.

• A gene can be associated with one or more GO terms (gene categories), and one category normally has many genes (many-to-many relationship between genes and GO terms).

Page 8: Computational Approaches for Understanding Biological Significance of Microarray Data

Example of GO Annotation

(The Gene Ontology Consortium, 2000)

Page 9: Computational Approaches for Understanding Biological Significance of Microarray Data

Genes from the Same Biological Process Tend to Be Co-Expressed

Gene Names Bio Process(The Gene Ontology Consortium, 2000)

Page 10: Computational Approaches for Understanding Biological Significance of Microarray Data

How to Assess Overrepresentation of a GO Term?

Genes on an array:

Total number of genes (N): 2,285

Number of genes – cell cycle (R): 161

Genes in a cluster:

Number of genes in the cluster (n): 147

Number of genes – cell cycle (r): 25

Is the GO term (i.e., cell cycle) significantly overrepresented in the cluster?

Page 11: Computational Approaches for Understanding Biological Significance of Microarray Data

Using the Z-Statistic• Assume the hypergeometric distribution.

• The z-score:

• For the example:

11

)(

)(

NnN

NR

NR

n

NR

nr

observedstdev

expectedobservedz

88.4

122851472285

2285161

12285161

147

2285161

14725

z

Page 12: Computational Approaches for Understanding Biological Significance of Microarray Data

Using the Fisher Exact Test• Contingency table:

• Probability of finding a genes of the GO class in the cluster:

• The p value:

a b

c d

GOclass

in

out

Clusterin out a = r

b = R - rc = n - rd = N - R - n + r

!!!!!

)!()!()!()!(

dcbaN

dbcadcbapa

ba

aiipp

Page 13: Computational Approaches for Understanding Biological Significance of Microarray Data

MAPPFinder

• A tool for mapping gene expression data to the GO hierarchies.

• Part of the free software package GenMAPP.

• Available at http://

www.genmapp.org/.

(Doniger et al., 2003)

Page 14: Computational Approaches for Understanding Biological Significance of Microarray Data

MAPPFinder Sample Output

(Doniger et al., 2003)

Page 15: Computational Approaches for Understanding Biological Significance of Microarray Data

GoMiner

(Zeeberg et al., 2003)

• A client-server application using Java (data on the server side).• Available at http://discover.nci.nih.gov/gominer/.

Page 16: Computational Approaches for Understanding Biological Significance of Microarray Data

Onto-Express• A web application for GO-based microarray data

analysis (http://vortex.cs.wayne.edu/Projects.html).

• The input to Onto-Express is a list of Affymetrix probe IDs, GenBank sequence accessions or UniGene cluster IDs.

• Part of the integrated Onto-Tools, including:– Onto-Compare: compare commercial arrays.– Onto-Design: help array design (probe selection).– Onto-Translate: provide mapping of different IDs.

p GO # genes (Genes linked to poor breast cancer outcome)

Page 17: Computational Approaches for Understanding Biological Significance of Microarray Data

Pathway Visualization of Microarray Data

Yeast metabolic pathways

Page 18: Computational Approaches for Understanding Biological Significance of Microarray Data

Pathway Tools Software• A software package for the creation, curation,

querying and visualization of metabolic pathways.

• The Pathway Tools Omics Viewer is used to paint data from high-throughput experiments onto the metabolic pathway diagram for an organism.

• For microarray data, each reaction line is color-coded according to the gene expression level of the enzyme that catalyzes the reaction step.

• The Omics Viewer can also be used to display an animated time series.

• The Pathway Tools software is freely available to academics (http://biocyc.org/download.shtml).

Page 19: Computational Approaches for Understanding Biological Significance of Microarray Data

Web Access to the Omics Viewer

• Yeast biochemical pathways at http://pathway.yeastgenome.org/biocyc/.

• Arabidopsis thaliana biochemical pathways at http://www.arabidopsis.org/tools/aracyc/.

• For other species (including E. coli and human), http://biocyc.org/ECOLI/expression.html.

Page 20: Computational Approaches for Understanding Biological Significance of Microarray Data

Other Tools for Pathway Visualization• KEGG (Kyoto Encyclopedia of Genes and Genomes

at http://www.genome.jp/kegg/kegg2.html):

– A database of curated metabolic and signaling pathways (mostly reference maps).

– KEGG Expression is for mapping microarray data to pathways and genomes (static view).

• GenMAPP (http://www.genmapp.org/):

– A software package for viewing and analyzing microarray data in the context of pathways.

– Allowing users to modify pathways or design new pathways for their own use.

– MAPP files can be used for data exchange.

Page 21: Computational Approaches for Understanding Biological Significance of Microarray Data

Transcriptional Regulation• Cells respond to various stimuli by regulating the

expression of particular genes.

• Transcription factors regulate gene expression by binding to specific

DNA sequence motifs.

• Transcription factor binding sites are often short (5 - 25 bases) and degenerate DNA motifs.

• Co-expressed genes may have common regulatory motifs in their promoters.

H2

H1

L

H2

L

H1

DNA

MyoD HLH Dimer

CAACTGAC

Page 22: Computational Approaches for Understanding Biological Significance of Microarray Data

How to Represent a Promoter Motif?• Multiple sequence alignment.

• Consensus: e.g., TATAAAA (the TATA box).

• Position Weight Matrices (PWM): relative frequencies of nucleotides at different positions.

• Sequence logo: information content of each site (a measure of tolerance for substitution).

1 2 3 4 5

A 1 0.25 0 0.25 0

C 0 0 0.125 0.25 0.5

G 0 0.75 0.125 0.25 0.25

T 0 0 0.75 0.25 0.25

AGTCCAGTCCAGTACAGTACAGTGGAGTGGAACTTAAGTT

Page 23: Computational Approaches for Understanding Biological Significance of Microarray Data

Discovery of Shared Promoter Motifs

Promoter retrieval from SCPD, GBrowser, etc.

Motif discovery usingMEME, AlignACE, etc.

Co-expressedgene cluster

Promoters

Transcription factors

Hypothesis testing

Factor identification inTransFac, TESS, etc.

Shared motifs

Page 24: Computational Approaches for Understanding Biological Significance of Microarray Data

Retrieval of Promoter Sequences• Yeast promoters may be retrieved from SCPD

(http://cgsigma.cshl.org/jian/HTML/retrieveseq.html) using gene or ORF names (batch retrieval).

• For model organisms, promoter sequences of RefSeq genes may be downloaded from the UCSC Genome Browser site at: http://hgdownload.cse.ucsc.edu/downloads.html.

• Gene prediction tools predict transcription start sites (TSS). The sequence upstream of TSS may be used as the promoter.

Page 25: Computational Approaches for Understanding Biological Significance of Microarray Data

Statistical Overrepresentation of Motifs

Enumerate all the possible motif patterns with ambiguous characters.

Count the occurrences of all the motif patterns in the promoter sequences.

Compute statistical significance based on the background distribution.

(The method has been used to discover novel transcription factor binding sites in yeast)

e.g., CWTNC, CRTGTW, YCGGAYRRAWG, …… over {A, C, G, T, R, Y, S, W, M, K, V, H, D, B, N}

e.g., z-score:

)(

)(

X

XENz

Page 26: Computational Approaches for Understanding Biological Significance of Microarray Data

MEME: Multiple EM for Motif Elicitation

• A widely used motif discovery tool available at http://meme.sdsc.edu/meme/website/meme.html.

• Based on the Expectation Maximization (EM) algorithm with several extensions.

• A sequence motif is represented as a Position Weight Matrix (PWM).

Page 27: Computational Approaches for Understanding Biological Significance of Microarray Data

Gibbs Sampling Methods

• For motif discovery, Gibbs sampling can be viewed as a stochastic analog of EM.

• Gibbs sampling may be less susceptible to local minima than EM.

• Gibbs Motif Sampler at http://bayesweb.wadsworth.org/gibbs/gibbs.html.

• AlignACE at http://atlas.med.harvard.edu/.

Page 28: Computational Approaches for Understanding Biological Significance of Microarray Data

Search for Transcription Factors• TransFac (http://www.gene-regulation.com/):

– A database on eukaryotic transcription factors and their DNA binding sites (PWMs).

– Provide classification of transcription factors.

– MatchTM uses the PWMs to search for potential transcription factor binding sites.

• TESS (Transcription Element Search System at http://www.cbil.upenn.edu/cgi-bin/tess/tess?RQ=WELCOME):

– A web tool for predicting TF binding sites.

– Using PWMs from TransFac and others.

Page 29: Computational Approaches for Understanding Biological Significance of Microarray Data

Inferring Gene Regulatory Networks

(Segal et al., 2003. Bioinformatics, 19:i273-i282)

• Integrate gene expression and promoter data into a unified model.

• Use a probabilistic graphical model trained using the EM algorithm.

• Validate the model using GO and protein complex information.

Page 30: Computational Approaches for Understanding Biological Significance of Microarray Data

Yeast Respiration and Carbon Regulation

(Segal et al., 2003. Nature Genetics, 34:166-176)

Page 31: Computational Approaches for Understanding Biological Significance of Microarray Data

Summary• “Statistical significance is fine, but biological

significance is better” (Baxevanis and Ouellette, 2005).

• Gene Ontology (GO) can be used to assess significant functional associations of a gene cluster or a list of significant genes.

• Pathway visualization allows one to interpret microarray results in a pathway context.

• Motif discovery tools can be used to identify common regulatory elements shared by the promoters of co-expressed genes.