49
Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Embed Size (px)

Citation preview

Page 1: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Predicting the function of a protein form either a sequence or

a structure(is not trivial)

Adam Godzik

The Sanford-Burnham Medical Research Institute

Page 2: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Summary - overview

Homology based methods Analogy based methods Physics based methods Why function prediction?

Page 3: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

What we mean by function

Multilevel definition Phenotype Cellular function Molecular function

(activity) Substrates Inhibitors cofactors

Several attempts to develop a unified function classification EC classification for

enzymes 4.2.31.101

Merops (proteases), CAZY (hydrolases)

Gene ontology

Page 4: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Two, complementary views of the evolution and diversity of life

Organisms (species) Genes (proteins)

Page 5: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Both are amazingly large and diverse

Organisms (species)

About 1.5M known today, 10-100 million species estimated to exists, depending on the definition of species and other assumptions

Their relations can be described in a tree of life, at least for eukaryotes.

Bacterial and archeal tree of

life is much more controversial, some even dispute the concepts of species for bacteria

Proteins

With 20 amino acid alphabet, the number of possible protein sequences is very large (20100 i.e. 1.2*10130 short proteins(!))

Total number: >10billions? 10-100M species, with ~4K genes

in a bacterial and ~10K in an eukaryotic genome

Over 25 million known today, i.e. ~0.2%

Representative sample?

Page 6: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

From the 25 million proteins known today

Direct experimental data is available for few thousand proteins

Indirect experimental data are available for perhaps few hundred thousand

Structures of ~60 thousands have been solved

Page 7: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

protein universe seems to be very large. But is it random?

Page 8: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Many proteins (like species) are close relatives

Histone H1 (human) - histone H1 (chicken)

SRRSASHPTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAA | | || || || ||| ||| | |||||||||||||||||| ||| |||||| || SKKSTDHPKYSDMIVAAIQAEKNRAGSSRQSIQKYIKSHYKVGENADSQIKLSIKRLVTT

similarity: 77% id, BLAST e.value 0.0 function: two H1 histones from

different species (orthologs) Their functions and structures are

obviously very similar

Page 9: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

We can organize the protein universe into neighborhoods

(families)?

Page 10: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Number of protein clusters (modeling families) grows linearly in number of protein sequences (and exponentially in time) – cumulative total

Rate of discovery

0

50

100

150

200

250

0 1 2 3 4 5 6 7

Number of sequences (millions)

Nu

mb

er

of

clu

ste

rs (

tho

us

an

ds

)

size >=3

size >=5

size >=10

size >=20

From Yooseph et al, PloS Biology, (2007) 5:e16

How many protein families are still out there?

Page 11: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

How far can we go?

Histone H5 - histone H1

TYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRLA | | | | | | | | | ||| | | | |||| |||||||| SVTELITKAVSASKERKGLSLAALKKALAAGGYDVEKNNSRIKLGLKSLVSKGTLVQTKGTGASGSFRLS

similarity: 40% seq id, BLAST e.value 10-15

function: two histones (paralogs) Structures still very similar,

functions somewhat different, but obviously similar

Page 12: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

This is surely too far?

Histone H5 - TRANSCRIPTION FACTOR E2F-4

PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | |

GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW

similarity :7% seq id, BLAST e.value 1

Page 13: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Is it?

Structure – obviously similar (2.4 Å RMSD over 80 aa)

function – clearly related (both bind DNA)

More subtle similarity can be detected with more sophisticated methods

Page 14: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

We can keep adding more layers

Page 15: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

most “function assignments” are provided by predicted

homology

Unknown protein GLLTTKFVSLLQEAKDGVLD

LKLAADTLAVRQKRRIYDITNVLEGIGLIEKKSKNSIQW

Well studied protein SRRSASHPTYSEMIAAAIRAE

KSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAASimilarity

->homology

prediction?

similarity

Page 16: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Similarity -> homology based annotations

Recognition of close and/or distant homologs based on similarity Sequence Sequence/profile,

profile/profile Structure

Problems How to predict

differences? Even homologous proteins evolve and change!

Page 17: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Prediction by homology

Recognition

Are there any well characterizedproteins similar to my protein?

Can we assume they are homologous?

Structure of my protein is similar to the other one

Modeling

Alignment What is the position-by-positiontarget/template equivalence

Function prediction

Function of my protein is similar to the other one

Page 18: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

We could predict

activityRole in the

whole organism

3D structureStructure of a complex

Page 19: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Important distinction

Similarity Two proteins have similar

sequences/structures/functions if by some metric the s/s/f of one protein is more similar to the s/s/f of another than to a randomly chosen protein

Homology Two proteins are

homologous if they have evolved from a common ancestor

Common error Two proteins are 65% homologous

What we really meant The sequences of two proteins are 65% similar,

therefore we can safely assume they are homologous, why else they would be so similar?

Page 20: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

If life would be easy, this is how it would look like

similarhomologous

not similarunrelated

Page 21: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Not (obviously) similar, but (probably) homologous

Histon H5 and transcription factor E2F4, identity 7%, similar fold, similar function (DNA binding)

PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | |

GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW

Page 22: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Similar, but not homologous

phosphoribosyltransferase and viral coat protein, identity: 42%, different folds, different functions

. . . . . 99 IRLKSYCNDQSTGDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQY.NPKMVKVASLLVKRTPRSVGY 173 : ||. ||| || |. || | : | | | | || | || |:| | ||.| |214 VPLKTDANDQ.IGDSLY....SAMTVDDFGVLAVRVVNDHNPTKVT..SKVRIYMKPKHVRV...WCPRPPRAVPY 279

Page 23: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Similarity vs. homology

similarhomologous

not similarunrelated

not similarhomologous

similarnot homologous

Page 24: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Can we return to this simple picture by redefining

similarity?

similarhomologous

not similarunrelated

Page 25: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Are these two protein families related?

New protein (target)

KAAELEMEKEQILRSLGEISVHNCMFKLEECDREEIEAITDRLTKRTKTVQVVVETPRNEEQKKALEDATLMIDEVGEMMHSNIEKAKLCLQ

Known protein (template)

VKKDALENLRVYLCEKIIAERHFDHLRAKKILSREDTEEISCRTSSRKRAGKLLDYLQENPKGLDTLVESIRREKTQNF

Page 26: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

How to compare two families?

alignment Fam Fam jmijli BMMA1 2Score = ?

Page 27: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Compare asvectors in 21dimensionalspace (FFAS)

Profile-profile similarity

Page 28: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

How to validate a protocol1. Recognition

Folding benchmarks from structural clustering of PDB (several

sets, 700 pairs used here)compared to sequence based clustering of the same group of proteins

correct predictions vs. wrong predictions CASP meetings, CAFASP, LiveBench published and/or publicly available

predictions, fold prediction servers, available prediction programs

Page 29: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute
Page 30: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Summary - overview

Homology based methods Analogy based methods Physics based methods Why function prediction?

Page 31: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Similarity -> analogy based annotations

Recognition of potential analogs based on similarity in Genome organization

(non homologous replacements)

Genomic fingerprints Expression patterns Specific features

Charge distribution Presence of specific

patterns

Problems Is this similarity

related to function?

Page 32: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

TM0449 (thy1) – from prediction to proof

TM0449 Hypothetical, uncharacterized

protein Multiple homologs in

pathogenic and thermophilic bacteria

Novel fold evidence

Phylogenetic profile complementing thymidylate synthase

A homolog complements TS in Dictyostelium

Confirmed experimentally

Page 33: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

3D motif search finds an identical arrangement binding phosphate in a

different protein

Page 34: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Summary - overview

Homology based methods Analogy based methods Physics based methods Why function prediction?

Page 35: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

“Ab initio function prediction” – substrate docking

Page 36: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

We know the structure of one protein in the family and functions of

some others – is the function conserved?

Newly solvedtarget

Gallery of models

Page 37: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

We can analyze conservation of surface features by mapping them

on the sphere

Page 38: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

And then compare maps between homologs

Page 39: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

And come up with new (predicted) functions

Phospholipid vs. retinol vs. short peptide binding

Page 40: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Summary - overview

Homology based methods Analogy based methods Physics based methods Why function prediction?

Page 41: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Why my interest in function prediction?

Structural genomics: the structure is often the easiest experimental information to obtain (after sequence)

Page 42: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Function vs function

We witnessed dramatic technological advances in sequencing and now structure determination, function analysis remain a painstaking, manual effort.

We used to know a lot about function even before we started working on a protein. Well, not anymore

1990 2005 2010 ?

1 y

ear

Structure determination

1970

Function discovery

Sequencing

Page 43: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

purificationexpressioncloning

struc. refinementstruc. validationannotationpublication

phasingdata collectionxtal screening tracingbl xtal mounting

crystallizationimagingharvesting

targetselection

3 X 2 X 5 X

1 X1 X 1 X

2 X

1 X 1 X 2 X 2 X 1 X

7 X1 X

1 X

1 X

PDB

1 X

Structure determination is now done on an assembly

line

Page 44: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

purificationexpressioncloning

struc. refinementstruc. validationannotationpublication

phasingdata collectionxtal screening tracingbl xtal mounting

crystallizationimagingharvesting

targetselection

3 X 2 X 5 X

1 X1 X 1 X

2 X

1 X 1 X 2 X 2 X 1 X

7 X1 X

1 X

1 X

PDB

1 X

Even few years ago functional annotation

seemed trivial

Page 45: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

purificationexpressioncloning

struc. refinementstruc. validationannotationpublication

phasingdata collectionxtal screening tracingbl xtal mounting

crystallizationimagingharvesting

targetselection

1 X

2 X 2 X 1 X

7 X1 X

1 X

1 X

PDB

1 X

After few years, the reality seems to be very different

Page 46: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

“reverse order” of function and structure determination and it’s challenges

The classical way 1. A function is discovered

and studied 2. The gene responsible in

this function is identified 3. Function is confirmed 4. Product of this gene is

isolated, crystallized solved.

5. we have a whole story!

Structure “rationalizes” function and provides molecular details

Post-genomic 1. a new, uncharacterized gene

is found in a genome 2. predictions or high-

throughput methods prioritize this gene for further studies

3. the protein is studied in detail

Structure is solved in a high throughput center

Structure is the first experimental information about the “hypothetical” protein

Page 47: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

We now have hundreds of structures of proteins with

unknown functions

Page 48: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Summary

For some, function prediction is a practical, day to day problem

Analogy based approaches dominate the field Homology seen from sequence similarity structural similarities Potential active sites, clefts, surface features

Many useful tools exists, but they are very scattered and not very user-friendly

Page 49: Predicting the function of a protein form either a sequence or a structure (is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

Summary (2)

Avoid overconfidence - “easy” predictions contain many surprises

Only synergy of several independent lines of reasoning can give a correct answer

Elimination of “easy”, but inconsistent predictions is critical

So far, AFP doesn’t even come close to expert analysis