Upload
brett-worthman
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Predicting the function of a protein form either a sequence or
a structure(is not trivial)
Adam Godzik
The Sanford-Burnham Medical Research Institute
Summary - overview
Homology based methods Analogy based methods Physics based methods Why function prediction?
What we mean by function
Multilevel definition Phenotype Cellular function Molecular function
(activity) Substrates Inhibitors cofactors
Several attempts to develop a unified function classification EC classification for
enzymes 4.2.31.101
Merops (proteases), CAZY (hydrolases)
Gene ontology
Two, complementary views of the evolution and diversity of life
Organisms (species) Genes (proteins)
Both are amazingly large and diverse
Organisms (species)
About 1.5M known today, 10-100 million species estimated to exists, depending on the definition of species and other assumptions
Their relations can be described in a tree of life, at least for eukaryotes.
Bacterial and archeal tree of
life is much more controversial, some even dispute the concepts of species for bacteria
Proteins
With 20 amino acid alphabet, the number of possible protein sequences is very large (20100 i.e. 1.2*10130 short proteins(!))
Total number: >10billions? 10-100M species, with ~4K genes
in a bacterial and ~10K in an eukaryotic genome
Over 25 million known today, i.e. ~0.2%
Representative sample?
From the 25 million proteins known today
Direct experimental data is available for few thousand proteins
Indirect experimental data are available for perhaps few hundred thousand
Structures of ~60 thousands have been solved
protein universe seems to be very large. But is it random?
Many proteins (like species) are close relatives
Histone H1 (human) - histone H1 (chicken)
SRRSASHPTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAA | | || || || ||| ||| | |||||||||||||||||| ||| |||||| || SKKSTDHPKYSDMIVAAIQAEKNRAGSSRQSIQKYIKSHYKVGENADSQIKLSIKRLVTT
similarity: 77% id, BLAST e.value 0.0 function: two H1 histones from
different species (orthologs) Their functions and structures are
obviously very similar
We can organize the protein universe into neighborhoods
(families)?
Number of protein clusters (modeling families) grows linearly in number of protein sequences (and exponentially in time) – cumulative total
Rate of discovery
0
50
100
150
200
250
0 1 2 3 4 5 6 7
Number of sequences (millions)
Nu
mb
er
of
clu
ste
rs (
tho
us
an
ds
)
size >=3
size >=5
size >=10
size >=20
From Yooseph et al, PloS Biology, (2007) 5:e16
How many protein families are still out there?
How far can we go?
Histone H5 - histone H1
TYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRLA | | | | | | | | | ||| | | | |||| |||||||| SVTELITKAVSASKERKGLSLAALKKALAAGGYDVEKNNSRIKLGLKSLVSKGTLVQTKGTGASGSFRLS
similarity: 40% seq id, BLAST e.value 10-15
function: two histones (paralogs) Structures still very similar,
functions somewhat different, but obviously similar
This is surely too far?
Histone H5 - TRANSCRIPTION FACTOR E2F-4
PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | |
GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW
similarity :7% seq id, BLAST e.value 1
Is it?
Structure – obviously similar (2.4 Å RMSD over 80 aa)
function – clearly related (both bind DNA)
More subtle similarity can be detected with more sophisticated methods
We can keep adding more layers
most “function assignments” are provided by predicted
homology
Unknown protein GLLTTKFVSLLQEAKDGVLD
LKLAADTLAVRQKRRIYDITNVLEGIGLIEKKSKNSIQW
Well studied protein SRRSASHPTYSEMIAAAIRAE
KSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAASimilarity
->homology
prediction?
similarity
Similarity -> homology based annotations
Recognition of close and/or distant homologs based on similarity Sequence Sequence/profile,
profile/profile Structure
Problems How to predict
differences? Even homologous proteins evolve and change!
Prediction by homology
Recognition
Are there any well characterizedproteins similar to my protein?
Can we assume they are homologous?
Structure of my protein is similar to the other one
Modeling
Alignment What is the position-by-positiontarget/template equivalence
Function prediction
Function of my protein is similar to the other one
We could predict
activityRole in the
whole organism
3D structureStructure of a complex
Important distinction
Similarity Two proteins have similar
sequences/structures/functions if by some metric the s/s/f of one protein is more similar to the s/s/f of another than to a randomly chosen protein
Homology Two proteins are
homologous if they have evolved from a common ancestor
Common error Two proteins are 65% homologous
What we really meant The sequences of two proteins are 65% similar,
therefore we can safely assume they are homologous, why else they would be so similar?
If life would be easy, this is how it would look like
similarhomologous
not similarunrelated
Not (obviously) similar, but (probably) homologous
Histon H5 and transcription factor E2F4, identity 7%, similar fold, similar function (DNA binding)
PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | |
GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW
Similar, but not homologous
phosphoribosyltransferase and viral coat protein, identity: 42%, different folds, different functions
. . . . . 99 IRLKSYCNDQSTGDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQY.NPKMVKVASLLVKRTPRSVGY 173 : ||. ||| || |. || | : | | | | || | || |:| | ||.| |214 VPLKTDANDQ.IGDSLY....SAMTVDDFGVLAVRVVNDHNPTKVT..SKVRIYMKPKHVRV...WCPRPPRAVPY 279
Similarity vs. homology
similarhomologous
not similarunrelated
not similarhomologous
similarnot homologous
Can we return to this simple picture by redefining
similarity?
similarhomologous
not similarunrelated
Are these two protein families related?
New protein (target)
KAAELEMEKEQILRSLGEISVHNCMFKLEECDREEIEAITDRLTKRTKTVQVVVETPRNEEQKKALEDATLMIDEVGEMMHSNIEKAKLCLQ
Known protein (template)
VKKDALENLRVYLCEKIIAERHFDHLRAKKILSREDTEEISCRTSSRKRAGKLLDYLQENPKGLDTLVESIRREKTQNF
How to compare two families?
alignment Fam Fam jmijli BMMA1 2Score = ?
Compare asvectors in 21dimensionalspace (FFAS)
Profile-profile similarity
How to validate a protocol1. Recognition
Folding benchmarks from structural clustering of PDB (several
sets, 700 pairs used here)compared to sequence based clustering of the same group of proteins
correct predictions vs. wrong predictions CASP meetings, CAFASP, LiveBench published and/or publicly available
predictions, fold prediction servers, available prediction programs
Summary - overview
Homology based methods Analogy based methods Physics based methods Why function prediction?
Similarity -> analogy based annotations
Recognition of potential analogs based on similarity in Genome organization
(non homologous replacements)
Genomic fingerprints Expression patterns Specific features
Charge distribution Presence of specific
patterns
Problems Is this similarity
related to function?
TM0449 (thy1) – from prediction to proof
TM0449 Hypothetical, uncharacterized
protein Multiple homologs in
pathogenic and thermophilic bacteria
Novel fold evidence
Phylogenetic profile complementing thymidylate synthase
A homolog complements TS in Dictyostelium
Confirmed experimentally
3D motif search finds an identical arrangement binding phosphate in a
different protein
Summary - overview
Homology based methods Analogy based methods Physics based methods Why function prediction?
“Ab initio function prediction” – substrate docking
We know the structure of one protein in the family and functions of
some others – is the function conserved?
Newly solvedtarget
Gallery of models
We can analyze conservation of surface features by mapping them
on the sphere
And then compare maps between homologs
And come up with new (predicted) functions
Phospholipid vs. retinol vs. short peptide binding
Summary - overview
Homology based methods Analogy based methods Physics based methods Why function prediction?
Why my interest in function prediction?
Structural genomics: the structure is often the easiest experimental information to obtain (after sequence)
Function vs function
We witnessed dramatic technological advances in sequencing and now structure determination, function analysis remain a painstaking, manual effort.
We used to know a lot about function even before we started working on a protein. Well, not anymore
1990 2005 2010 ?
1 y
ear
Structure determination
1970
Function discovery
Sequencing
purificationexpressioncloning
struc. refinementstruc. validationannotationpublication
phasingdata collectionxtal screening tracingbl xtal mounting
crystallizationimagingharvesting
targetselection
3 X 2 X 5 X
1 X1 X 1 X
2 X
1 X 1 X 2 X 2 X 1 X
7 X1 X
1 X
1 X
PDB
1 X
Structure determination is now done on an assembly
line
purificationexpressioncloning
struc. refinementstruc. validationannotationpublication
phasingdata collectionxtal screening tracingbl xtal mounting
crystallizationimagingharvesting
targetselection
3 X 2 X 5 X
1 X1 X 1 X
2 X
1 X 1 X 2 X 2 X 1 X
7 X1 X
1 X
1 X
PDB
1 X
Even few years ago functional annotation
seemed trivial
purificationexpressioncloning
struc. refinementstruc. validationannotationpublication
phasingdata collectionxtal screening tracingbl xtal mounting
crystallizationimagingharvesting
targetselection
1 X
2 X 2 X 1 X
7 X1 X
1 X
1 X
PDB
1 X
After few years, the reality seems to be very different
“reverse order” of function and structure determination and it’s challenges
The classical way 1. A function is discovered
and studied 2. The gene responsible in
this function is identified 3. Function is confirmed 4. Product of this gene is
isolated, crystallized solved.
5. we have a whole story!
Structure “rationalizes” function and provides molecular details
Post-genomic 1. a new, uncharacterized gene
is found in a genome 2. predictions or high-
throughput methods prioritize this gene for further studies
3. the protein is studied in detail
Structure is solved in a high throughput center
Structure is the first experimental information about the “hypothetical” protein
We now have hundreds of structures of proteins with
unknown functions
Summary
For some, function prediction is a practical, day to day problem
Analogy based approaches dominate the field Homology seen from sequence similarity structural similarities Potential active sites, clefts, surface features
Many useful tools exists, but they are very scattered and not very user-friendly
Summary (2)
Avoid overconfidence - “easy” predictions contain many surprises
Only synergy of several independent lines of reasoning can give a correct answer
Elimination of “easy”, but inconsistent predictions is critical
So far, AFP doesn’t even come close to expert analysis