60
Methods for function prediction based on diverse large-scale data Some slides borrowed from Serafim Batzoglou, Stanford University

Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Methods for function prediction based on diverse large-scale data

Some slides borrowed from Serafim Batzoglou, Stanford University

Page 2: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

The central densest region of interaction network (15000 interactions) from yeast. The interactions were collected from large-scale studies and the MIPS and BIND databases. Known molecular complexes can be seen clearly, as well as a large, previously unsuspected nucleolarcomplex.

Page 3: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Abundance of high-throughput data• Sequence data

– Transcription factor binding sites– Functional domains

• Location data– Colocalization

• Physical association data – Yeast two hybrid– Co-IP

• Genetic relationship data– Synthetic lethality– Synthetic rescue– Synthetic interaction

Data explosion

Information explosion

?

Page 4: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Overcoming noise with data integration

Gene clustering based on Gene clustering based on Microarray AnalysisMicroarray Analysis: :

Gene groupings based on Gene groupings based on Other Experimental DataOther Experimental Data: :

••TF binding sitesTF binding sites•• Affinity precipitationAffinity precipitation•• Two HybridTwo Hybrid

Gene grouping based on Gene grouping based on diverse highdiverse high--throughput datathroughput data

MetabolismMetabolism Function unknownFunction unknown

Page 5: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

General representation for gene groupings

Gene AGene AGene BGene BGene CGene C

Cluster 1Cluster 1

Gene AGene AGene DGene D

Cluster 2Cluster 2

Gene A Gene B Gene C Gene DGene A 1 1 1 1Gene B 1 1 1Gene C 1 1 1Gene D 1 1

Page 6: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

gene gene i…mi…m

gene

ge

ne i.

.mi..m

1 1 1 1 11

1 11 1

1 1 11 1 11 1 1 1 1

11 1

1 1

gene gene i…mi…m

gene

ge

ne i.

.mi..m

1 1 1 11

1 11

1 11

1 1 1 11

1 11 1

gene gene i…mi…m

gene

ge

ne i.

.mi..m

1 1 11 1

1 11 1

1 1 1 11

1 1 1 11

1 1 11 1

Apply MAGIC algorithm to Apply MAGIC algorithm to each gene each gene ii-- gene gene jj pairpair

gene gene i…mi…m

gene

ge

ne i.

.mi..m

1.0 1.0 0.3 0.8 0.91.0

1.0 0.11.0 0.4

1.0 1.00.3 0.4 1.00.8 0.1 1.0 0.8

1.00.9 1.0

0.8 1.0

MAGIC: Multi-source Association of Genes by Integration of Clusters

Page 7: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

A brief intro to Bayesian networks• Bayes rule• Bayesian networks – graphical probabilistic

models• Inference and learning

Page 8: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

The “famous” sprinkler Bayes netBayes nets are graphical probabilistic models

Conditional probability tables contain the priors

Murphy, K. “A Brief Introduction to Graphical Models and Bayesian Networks”

Page 9: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

The “famous” sprinkler Bayes netPrior probability that it is cloudy

Conditional probability that it rains when it’s cloudy

Probability that grass is wet when the sprinkler is off and it rains

Page 10: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

InferenceObserve: grass is wetTwo possible causes: either it is raining, or the sprinkler is on. Which

is more likely? Use Bayes' rule to compute the posterior probability of each explanation.

is a normalizing constant (probability (likelihood) of the data).

=> it is more likely that the grass is wet because it is raining: the likelihood ratio is 0.7079/0.4298 = 1.647.

Page 11: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Learning Bayesian networks

• Two learning problems: structure + CPTs

Due either to missing data or to hidden nodes

Page 12: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Back to gene function predictionwith Bayes nets

Page 13: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

MAGIC Bayesian network

Page 14: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Gene Ontology

• Three hierarchies: – Molecular function– Biological process– Cellular component

• Curated annotation

Biological process

Cell growth and/or maintenance

Metabolism Cell organization and biogenesis

Nucleobase, nucleoside,

nucleotide and nucleic acid metabolism

Protein metabolism

DNA metabolism

Establishment and/or maintenance

of chromatin architecture

DNA packaging

Nuclear organization and biogenesis

Chromosome organization and

biogenesis

Protein modification

Cell cycle

Mitotic cell cycle

Mitotic cell cycle

Transcription

Page 15: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Biological process

Cell growth and/or maintenance

Metabolism Cell organization and biogenesis

Nucleobase, nucleoside,

nucleotide and nucleic acid metabolism

Protein metabolism

DNA metabolism

Establishment and/or maintenance

of chromatin architecture

DNA packaging

Nuclear organization and biogenesis

Chromosome organization and

biogenesis

Protein modification

Cell cycle

Mitotic cell cycle

Mitotic cell cycle

Transcription

Evaluation method

pairs of num totalGOterm same toannotated geneB &geneA s.t. pairs numTPproportion =

pairs of num totalGOterm same the toannotatednot geneB&geneA s.t. pairs numFPproportion =

Predicted Predicted Gene PairsGene Pairs

Page 16: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 1000 2000 3000 4000 5000FP

TPMAGIC

SOM

K-means

HIER

MAGIC performs better than input methods over a range of FP levels

0

100

200

300

400

500

600

700

800

900

1000

0 100 200 300FP

TPMAGIC

SOM

K-means

HIER

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 1000 2000 3000 4000 5000FP

TP

MAGIC Non-Expression only

MAGIC Expression only

MAGIC

SOM

K-means

HIER

Page 17: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

UbiquitinUbiquitin--dependent dependent protein catabolismprotein catabolism

10 genes in cluster:10 genes in cluster:9 ubiquitin9 ubiquitin--dependent dependent

protein catabolismprotein catabolism1 nucleotide1 nucleotide--excision excision

repairrepair

Page 18: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Vacuolar Vacuolar acidification acidification clustercluster9 genes in cluster:9 genes in cluster:7 vacuolar acidification7 vacuolar acidification2 phosphate transport2 phosphate transport

Page 19: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Protein Protein biosynthesis biosynthesis clustercluster•• 49 protein biosynthesis49 protein biosynthesis•• 9 other functions9 other functions•• 10 unknown 10 unknown function function

predictionprediction

Page 20: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Next Steps• Experiment with automatically learning the

weights (in progress)• MAGIC for specific processes (in progress)• Test best predictions experimentally• Integrate system with the Saccharomyces

Genome Database for public use

Page 21: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Promoter finding & expression modules

What are promoters?Finding regulatory regions from MSAs

Using coexpression in motif findingRegulatory modules

Page 22: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

What are promoters and why are they important?

Page 23: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Opportunities for gene regulation

• Opening of DNA duplex

• Transcription• mRNA stability• Translation• Protein stability• Protein modification

Page 24: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Transcriptional regulation

• Thought to be the most used • Does not waste intermediate products

(mRNA, protein, etc)• But transcriptional regulation is slow, and

thus may not be used in cases when fast, transient regulation is necessarily

Page 25: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –
Page 26: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –
Page 27: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Transcriptional activation

Page 28: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Transcription Factors Bind DNA

Transcription factors bind DNA in a specific manner.

Binding recognizes DNA substrings called regulatory motifs

Page 29: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Regulation of Genes

Transcription Factor(Protein)

RNA polymerase(Protein)

DNA

Regulatory Element Gene

Page 30: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Regulation of Genes

RNA polymerase

Transcription Factor(Protein)

DNA

Regulatory Element Gene

Page 31: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Regulation of Genes

New proteinRNA

polymeraseTranscription Factor

DNA

GeneRegulatory Element

Page 32: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Example: A Human heat shock protein

• TATA box: positioning transcription start

• TATA, CCAAT: constitutive transcription• GRE: glucocorticoid response• MRE: metal response• HSE: heat shock element

TATASP1CCAAT AP2HSEAP2CCAAT--158

GENE

0SP1

promoter of heat shock hsp70

Page 33: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Gene Regulatory NetworkIf C then D

A B Make DCgene D

• Genes =? wires• Motifs =? gates

If B then NOT D

DIf A and B then D

Make BD Cgene B

If D then B

Page 34: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Regulatory Networks

Page 35: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Identifying TF factor binding sites directly – “ChIP on Chip”

Page 36: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Protein - TF

DNA

Incubation with antibodies against TF

Hybridization to intragenic array

Precipitation

Regions that are enriched on the array as compared to whole-cell lysate correspond to putative binding sites

Page 37: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Promoter finding from genome sequences

Page 38: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Multiple sequence alignment

Multiple sequence alignment of Saccharomyces species.

Species relatively closely related to S. cerevisiae are likely to be most useful for identifying conserved non-coding sequences.

Alignments of S. cerevisiae protein sequences to their orthologs in species that are more distantly related to S. cerevisiae are likely to be more useful for identifying possible functional domains in protein sequence

Cliften et al. Science 301 2003

Page 39: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Promoters can be identified from multiple sequence alignment

Cliften et al. Science 301 2003

Reb1 - RNA polymerase I Enhancer Binding protein

Ume6 - Regulator of both repression and induction of early meiotic genes.

Ndt80 - DNA binding transcription factor that activates middle sporulation genes

Page 40: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Promoter finding based on transcriptional profiles

Page 41: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Co-regulated genes are co-expressed

Expression profiles of 53 genes in S. cerevisiaegenome that contain the exact match to an MCB box in their promoters (profiles normalized by mean & variance).

Cliften et al. Science 301 2003

Page 42: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Gibbs Sampling in Motif FindingGibbs Sampling in Motif Finding

Page 43: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Idea

• Identify sets of co-regulated genes from microarrays– Unsupervised analysis - clustering– Supervised analysis

• Identify common motifs in regulatory regions of co-regulated genes– Combinatorial methods (enumeration with tricks)– Probabilistic methods (EM, Gibbs Sampling – a special

case of MCMC)

Page 44: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –
Page 45: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Gibbs Sampling• Given:

– x1, …, xN sequences– motif length K,– background B,

• Find:– Model M– Locations a1,…, aN in x1, …, xN

Maximizing log-odds likelihood ratio:

∑∑= = +

+N

i

K

ki

ka

ika

i

i

xBxkM

1 1 )(),(

log

Page 46: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Gibbs Sampling (2)• AlignACE: first statistical motif finder• BioProspector: more recent, faster algorithm

with higher accuracyAlgorithm (sketch):1. Initialization:

a. Select random locations in sequences x1, …, xN

b. Compute an initial model M from these locations

2. Sampling Iterations:a. Remove one sequence xi

b. Recalculate modelc. Pick a new location of motif in xi according to

probability the location is a motif occurrence

Page 47: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Gibbs Sampling (3)

Initialization:• Select random locations a1,…, aN in x1, …, xN

• For these locations, compute M:

∑=

+ ==N

ikakj jx

NM

i1

)(1

• That is, Mkj is the number of occurrences of letter j in motif position k, over the total

Page 48: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Gibbs Sampling (4)Predictive Update:

• Select a sequence x = xi

• Remove xi, recompute model: M

Again, Mkj is the proportion of occurrences of letter j in motif position k

))(()1(

1,1∑

≠=+ =+

+−=

N

isskajkj jx

BNM

where βj are pseudocounts to avoid 0s,and B = Σj βj

Page 49: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Gibbs Sampling (5)Sampling:For every K-long word xj,…,xj+k-1 in x:

Qj = P( word | motif ) = M(1,xj)×…×M(k,xj+k-1)Pi = P( word | background ) = B(xj)×…×B(xj+k-1)

Let

Sample a random new position ai according to the probabilities A1,…, A|x|-k+1 (new location for the motif)

∑+−

=

= 1||

1/

/kx

jjj

jjj

PQ

PQA

0 |x|

Prob

How “overrepresented” this word is in motif vs. background

Represents weights for sampling (words more different from background get higher weight)

Page 50: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Gibbs Sampling (6)

Running Gibbs Sampling:

1. Initialize

2. Run until convergence

3. Repeat 1,2 several times, report common motifs

Page 51: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Advantages / Disadvantages• Very similar to EM (essentially EM’s stochastic analog)

Advantages:• Easier to implement• Less dependent on initial parameters• Less likely to converge to local minima than EM• More versatile, easier to enhance with heuristics

Disadvantages:• More dependent on all sequences to exhibit the motif• Less systematic search of initial parameter space (doesn’t

converge to point estimate like EM)

Page 52: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Regulatory ModulesSegal et.al. Nature Genetics 32(2):166 2004

Page 53: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Regulatory modules• Regulatory Module – set of genes that are co-

regulated by a shared regulation program• Regulation program – set of regulatory genes and

expression profile of genes in the module as a function of expression of regulatory genes

• Assumption – regulators are themselves transcriptionally regulated

• Input: gene expression data set, list of putative regulator genes

• Output: partition of genes into modules and a regulation program for each module

Page 54: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –
Page 55: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Regulation programs

Page 56: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Defining regulation programs• Regulation program specifies:

– Set of contexts (rules describing behavior of genes in modules e.g. upregulation)

– Response of modules in each context• Contexts organized in regression tree

– Decision nodes are regulators – Each path to a leaf defines a context using texts on the

path– Contexts effectively specify sets of arrays– Context model: normal distribution over the expression

of the module’s genes in these arrays (mean, variance stored in corresponding leaf)

– Small variance => tight regulation

Page 57: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Learning Module Networks with EM• E-step - Given g’s inferred regulation program, find module that

best predicts g’s behavior– For each gene g:

• Calculate:

context c of array j is defined as • Reassign gene g to the program that gives highest

• M-step - Given partition of genes into modules, learn best regulation program (tree) through combinatorial search of trees

– Tree grown from root to leaves, for each regulatory node:

– Choose query that best partitions gene expression into two distinct distributions

– Stop when no such split exists

)( cc ,σµN)|()_|( contextarraygpprogramregulatorygP j

arraysj∏=

)_|( programregulatorygP

Page 58: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

EM learning details

• Initialize with clusters• Converges after 23 iterations to the 50

modules (initial assignments changed for 49% of genes)

Page 59: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Predicted Modules

• Regulators (in ovals) are connected to modules (numbered squares)– Red line – regulation supported in

literature, dashed line - inferred

• Module groups (boxes) share common motifs and sometimes common function

• Yellow regulators tested experimentally

Page 60: Methods for function prediction based on diverse large ...€¦ · Methods for function prediction based on diverse large-scale data ... – Transcription factor binding sites –

Limitations• Requires a set of putative transcription factors as input and

a predefined number of modules• Cannot find regulators whose expression does not change

sufficiently for detection• Cannot identify multiple regulators that participate in a

regulatory even, will only identify one of them• Can mistakenly identify a gene as regulator because it is

highly predictive of a module either b/c it’s a member of the module or by chance (gene has to be a member of the putative regulator set

• Will not identify regulatory events specific to regulator and its target

• Can only handle non-overlapping modules – a gene can belong to only one module (other methods address this problem)