DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Functional Annotation of Proteins with Known Structure by...
If you can't read please download the document
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Functional Annotation of Proteins with Known Structure by Structure and Sequence Similarity, DNA-protein Interaction
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Functional Annotation
of Proteins with Known Structure by Structure and Sequence
Similarity, DNA-protein Interaction Patterns and GO Framework Ilya
Shindyalov, UCSD/SDSC PhD, Group Leader, Protein Science Research
DIMACS 2005-06-13
Slide 2
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Protein Essential
Dataflow in Protein Science Sequence Structure Function Data:
Methods: Sequence similarity: (i)BLAST, (ii)fold recognition, (iii)
homology modeling Results: Structure similarity: (i)DALI, (ii)VAST,
(iii) CE
Slide 3
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD COVERAGE RATIO FOR
FUNCTIONAL ANNOTATION Disease Biological Cell Molecular Process
Component Function PDB STRUCTURES 0.758 0.396 0.371 0.335 SG
TARGETS 0.355 0.315 0.452 0.259 PDB+SG 0.822 0.528 0.593 0.477
HOMOLOGY MODELS 0.984 0.792 0.839 0.821 Do we know the function, if
we know the structure?
Slide 4
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD The Subjects of my Talk
3 Approaches of Using Structure Similarity to Infer Protein
Function: #1: Assigning function from known to unknown CASE STUDY
Prediction of calcium binding in Acetylcholine Esterase Projection
on SNP responsible for Autism. #2: Classification of DNA-binding
protein domains involving (in addition to structure similarity)
DNA-protein interaction patterns and sequence similarity. #3:
Extending GO annotation using structure similarity how reliable it
can be? #4 [BONUS]: Why ontology is so important for humans?
Slide 5
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD #1: Assigning function
from known to unknown CASE STUDY Prediction of calcium binding in
Acetylcholine Esterase Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving (in
addition to structure similarity) DNA-protein interaction patterns
and sequence similarity. #3: Extending GO annotation using
structure similarity how reliable it can be? #4 [BONUS]: Why
ontology is so important for humans?
Slide 6
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD CE Protein structure
comparison by Combinatorial Extension of the optimal path
(Shindyalov and Bourne, 1998). http://cl.sdsc.edu
Slide 7
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD CE Step 1. Heuristic
search for initial path.
Slide 8
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD CE Step 2. Iterative
dynamic programming on starting superposition from step 1.
Slide 9
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD CE vs. other Algorithms
??? Novotny, M., Madsen, D., and Kleywegt, G.J. 2004. Evaluation of
protein fold comparison servers. Proteins 54: 260-270.
Slide 10
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD 2ACE vs. 1TN4: RMSD =
4.6 Z-score = 4.6 LALI = 86 LGAP = 8 Seq. Identity = 3.5%
Acetylcholinestarase vs. Troponin C
Slide 11
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD #1: Assigning function
from known to unknown CASE STUDY Prediction of calcium binding in
Acetylcholine Esterase Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving (in
addition to structure similarity) DNA-protein interaction patterns
and sequence similarity. #3: Extending GO annotation using
structure similarity how reliable it can be? #4 [BONUS]: Why
ontology is so important for humans?
Slide 12
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD PDB - Protein Data Bank
of February 13, 2002 with 17,304 entries was used as the source of
original structural data. - The DNA fragment size is at least 5 bp
long. - At least 5 different protein residues are involved in the
interaction with DNA. - The contact distance cutoff between
interacting atoms was < 5. - We did not take into account the
different types of DNA (A, B, Z) because of the insufficient level
of this annotation in the PDB PDP Protein Domain Parser
(Alexandrov, Shindyalov, Bioinformatics, submitted) CE Protein
structure alignment by Combinatorial Extension (Shindyalov, Bourne,
1998) SCOP - Structure Classification of Proteins (Murzin et al.,
1995) Data and algorithms used:
Slide 13
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD PDB Selection of
DNA-binding protein chains by analyzing DNA-protein contacts
Parsing of DNA-binding protein chains into domains using PDP
Selection of DNA-binding protein domains by analyzing DNA-protein
contacts All-against-all structural alignment of DNA-binding
protein domains using CE Selection of representative
(non-redundant) set of DNA-binding protein domains Calculating
classification of DNA- binding protein domains Building
representative set of domains:
Slide 14
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Rmsd, root mean squared
deviation between two aligned and compared protein domains > 2.0
; Z-score, statistical score obtained from CE is < 4.5; Rnar,
ratio of the number of aligned residues to the smallest domain
length < 90%; Note: sequence identity in the alignment < 90%;
Parameters measuring structural similarity:
Slide 15
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Figure 2. Determination
of matched protein-DNA contact pattern for two hypothetical
DNA-protein domain complexes A and B structurally aligned to each
other. All residues except those matched to - are considered
aligned to each other. Stars denote residues involved in
protein-DNA interactions. Vertical bars denote matched protein
residues involved in interaction with DNA. (1)Parameters measuring
structural similarity: Rmsd, Z-score, Rnar; (2) Parameter measuring
the match between DNA-protein contact patterns, Rmat; A and B -
DNA-protein domain complexes; Rmat = min{Rmat A, Rmat B } Rmat X -
ratio of the number of matched residues to the total number of
residues involved in contacts with DNA in the DNA-protein complex
X.
Slide 16
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Realignment using
scoring function taking into account structural similarity between
two protein domains and protein-DNA contact pattern Structure
similarity term: Protein-DNA contact pattern term: where m denotes
protein residue, X protein-DNA complex; C 3 is a scaling
constant;
Slide 17
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD If Rmsd > 5.0 or
Rnar < 70% or Z-score < 3.5, then domains are not considered
as similar; If Rmsd 3.0 and Rnar 80%, then domains are considered
as similar; If Rmat Rmat threshold and either: 3.0 < Rmsd 5.0
and Rnar 70% or 70% Rnar < 80% and Rmsd 5.0 , then domains are
considered similar;
Slide 18
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Comparison of the
classification for all 338 DNA-binding domain representatives with
SCOP at various threshold parameters
Slide 19
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Final classification of
DNA-binding protein domains (fragment):
Slide 20
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Rmsd Rnar 53 70 80
Similar if Rmat
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD For two polypeptides A
and B with all calculated parameter values (Rmsd, Z-score, Rnar,
Rseq) and given threshold values (Rmsd threshold, Z- score
threshold, Rnar threshold, Rseq threshold ) we define: SSC AB
=(Rmsd Z-score threshold ) (Rnar>Rnar threshold ) (Rseq>Rseq
threshold ) - denotes logical AND. SSC AB can only be ascribed two
values: true or false. If SSC AB is true, then A and B are similar.
If SSC AB is false, then A and B are not similar. The chains were
clustered such that for every two chains in each cluster the above
condition (in red) holds true.
Slide 29
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Specificity Criteria:
For the clusters where GO terms were available for at least two
chains we define: positive cluster - where all chains have the same
GO terms; negative cluster - where chains have different GO terms
(more specific definitions for three criteria will be given
further); TP (true positives) - a number of chains with GO terms in
the positive clusters; FP (false positives) - a number of chains
with GO terms in the positive clusters; ppv (positive predictive
value) or specificity is the following ratio - TP/(TP+FP)
Slide 30
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Further detailing of
specificity (Specificity-4) should involve the semantic distance
(e.g. Lord et al, 2003) between terms in judging cluster to be
positive. Specificity Criteria (cont.): Specificity-3 (less
rigorous than specificity-2) - positive cluster must have a common
set of terms {t 1,..t L } for all N chains within the cluster: {t
1,..t N } {t i1,..t ik(i) }, i=1,N; {t 1,..t N } . Specificity-2
(less rigorous than specificity-1) - positive cluster must have for
every pair of chains (i, j) with different number of GO terms the
following: for the chain with a smaller number of terms all terms
must be present amongst the terms for a chain with a larger number
of GO terms: {t i1,..t ik(i) } {t j1,..t jk(j) }, if k(i) k(j); i
{1,N}, j {1,N}; {t 1,..t N } . Specificity-1 (the most rigorous) -
positive cluster must have every pair of chains (i, j) with the
same set of GO terms: t in = t jn, n=1,k(i), k(i)=k(j), for (i, j),
i {1,N}, j {1,N}. {t i1,..t ik(i) } - is a set of GO terms k(i) for
i-th chain. Each specificity is defined for a clusters with at
least two annotated chains.
Slide 31
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Clusterization of PDB
chains and the accuracy of GO annotation at different threshold
values of structural similarity parameters.
Slide 32
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Assignment of GO
annotation with structural similarity parameters (Rmsd 5.0, Z-score
3.8, Rnar 70%, Rseq 90%). Red dot denotes newly annotated chains,
red arrow denotes new GO term chain associations assigned for newly
annotated chains. Purple line denotes new GO term chain
associations assigned for chains previously annotated (by EBI).
Black arrow denotes existing GO term chain associations assigned by
EBI.
Slide 33
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD The example of negative
cluster by the definition of specificity-1 and positive cluster by
the definitions of specificity-2 and specificity-3. Seven GO terms
could be assigned to chains 1h9dA, 1h9dC (Rmsd 5.0, Z-score 3.8,
Rnar 70%, Rseq 90%). 1e50A (7) 3677, 3700, 5524, 5634, 6355, 7275,
8151, 1e50C (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1e50E (7)
3677, 3700, 5524, 5634, 6355, 7275, 8151, 1e50G (7) 3677, 3700,
5524, 5634, 6355, 7275, 8151, 1e50Q (7) 3677, 3700, 5524, 5634,
6355, 7275, 8151, 1e50R (7) 3677, 3700, 5524, 5634, 6355, 7275,
8151, 1cmoA (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151, 1co1A (7)
3677, 3700, 5524, 5634, 6355, 7275, 8151, 1ljmA (7) 3677, 3700,
5524, 5634, 6355, 7275, 8151, 1ljmB (7) 3677, 3700, 5524, 5634,
6355, 7275, 8151, 1hjbC (4) 3677, 5524, 5634, 6355, 1hjbF (4) 3677,
5524, 5634, 6355, 1hjcA (4) 3677, 5524, 5634, 6355, 1hjcD (4) 3677,
5524, 5634, 6355, 1io4C (4) 3677, 5524, 5634, 6355, 1eanA (4) 3677,
5524, 5634, 6355, 1eaoA (4) 3677, 5524, 5634, 6355, 1eaoB (4) 3677,
5524, 5634, 6355, 1eaqA (4) 3677, 5524, 5634, 6355, 1eaqB (4) 3677,
5524, 5634, 6355, 1h9dA no go terms 1h9dC no go terms 3677 (F) -
DNA binding 3700 (F) - transcription factor activity 5524 (F) - ATP
binding 5634 (C) - nucleus 6355 (P) - regulation of transcription,
DNA-dependent 7275 (P) - development 8151 (P) - cell growth and/or
maintenance The cluster of the same proteins which is Runt-related
transcription factor 1 (synonyms: core-binding factor alfa subunit,
acute myeloid leukemia 1 protein etc.).
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Only four negative
clusters have occurred by definition of specificity-3: An example
of missed GO terms for 2mtaC and other chains of Cytochrome c-L
(cytochrome c551i) 2mtaC (3) 5489, 6118, 15945 1mg2D (2) 16021,
16032, 1mg2H (2) 16021, 16032, 1mg2L (2) 16021, 16032, 1mg2P (2)
16021, 16032, 1mg3D (2) 16021, 16032, 1mg3H (2) 16021, 16032, 1mg3L
(2) 16021, 16032, 1mg3P (2) 16021, 16032, 5489 (F) - electron
transporter activity 6118 (P) - electron transport 15945 (P) -
methanol metabolism 16021 (C) - integral to membrane 16032 (P) -
viral life cycle
Slide 36
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD #1: Assigning function
from known to unknown CASE STUDY Prediction of calcium binding in
Acetylcholine Esterase Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving (in
addition to structure similarity) DNA-protein interaction patterns
and sequence similarity. #3: Extending GO annotation using
structure similarity how reliable it can be? #4 [BONUS]: Why
ontology is so important for humans?
Slide 37
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Evolution of complex
systems : Computers: complexity doubles in every 18 month per $$$
(Moores Law) Human Brain: very slow (complexity doubles in ~100,000
years) SystemShort Term Storage Long Term Storage SpeedCost PC
cluster (256 units) 65GB5 TB256 GFLOP$130K Human Brain (Average) 57
TB1137 TB4.4 TFLOP$130K Complexity = Speed x Memory Computer = 5TB
x 256 GFLOP = 10 24 memory FLOPs Brain = 1137TB x 4.4 TFLOP = 5x10
27 memory FLOPs Brain/Computer=5x10 3 or 3.7 log units Moores Law:
3.5 years/log unit Based on (Ramsey, 1997) Human brain capacity for
computers will be reached: 2000+3.7x3.5=2013
Slide 38
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Slide 39
The accuracy of predicting the future for the next 2 years
equals 10%
Slide 40
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD Credits: Julia
Ponomarenko (she did #2 and #3) Phil Bourne (discussions,
conceptualizations, logistics) Lei Xie (PDB statistics) NIH Grant
GM63208 NSF Grants DBI 9808706, DBI 0111710 Gift from Ceres
Inc.