View
249
Download
1
Tags:
Embed Size (px)
Citation preview
Protein Classification
PDB GrowthN
ew
PD
B s
tru
ctu
res
Only a few folds are found in nature
Protein classification
• Number of protein sequences grows exponentially
• Number of solved structures grows exponentially
• Number of new folds identified very small (and close to constant)
• Protein classification can Generate overview of structure types Detect similarities (evolutionary relationships) between protein sequences Help predict 3D structure of new protein sequences
Morten Nielsen,CBS, BioCentrum, DTU
SCOP release 1.69, Class # folds # superfamilies # families
All alpha proteins 218 376 608
All beta proteins 144 290 560
Alpha and beta proteins (a/b) 136 222 629
Alpha and beta proteins (a+b) 279 409 717
Multi-domain proteins 46 46 61
Membrane & cell surface 47 88 99
Small proteins 75 108 171
Total 945 1539 2845
Classification of 25,973 protein structures in PDB
Protein world
Protein fold
Protein structure classification
Protein superfamily
Protein family
Morten Nielsen,CBS, BioCentrum, DTU
Structure Classification Databases
• SCOP Manual classification (A. Murzin) scop.berkeley.edu
• CATH Semi manual classification (C. Orengo) www.biochem.ucl.ac.uk/bsm/cath
• FSSP Automatic classification (L. Holm)
www.ebi.ac.uk/dali/fssp/fssp.html
Morten Nielsen,CBS, BioCentrum, DTU
Major classes in SCOP
• Classes All proteins All proteins and proteins (/) and proteins (+) Multi-domain proteins Membrane and cell surface proteins Small proteins Coiled coil proteins
Morten Nielsen,CBS, BioCentrum, DTU
All : Hemoglobin (1bab)
Morten Nielsen,CBS, BioCentrum, DTU
All : Immunoglobulin (8fab)
Morten Nielsen,CBS, BioCentrum, DTU
Triosephosphate isomerase (1hti)
Morten Nielsen,CBS, BioCentrum, DTU
: Lysozyme (1jsf)
Morten Nielsen,CBS, BioCentrum, DTU
Families
• Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity)
• Families are further subdivided into Proteins
• Proteins are divided into Species The same protein may be found in
several species
Fold
Family
Superfamily
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
Superfamilies
• Proteins which are (remotely) evolutionarily related
Sequence similarity low
Share function
Share special structural features
• Relationships between members of a superfamily may not be readily recognizable from the sequence alone
Fold
Family
Superfamily
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
Folds
• >~50% secondary structure elements arranged in the same order in sequence and in 3D
• No evolutionary relation
Fold
Family
Superfamily
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
Protein Classification
• Given a new protein sequence, can we place it in its “correct” position within an existing protein hierarchy?
Methods
• BLAST / PsiBLAST
• Profile HMMs
• Supervised Machine Learning methods
Fold
Family
Superfamily
Proteins
?
new protein
BLAST
(Basic Local Alignment Search Tool)
Main idea:
1. Construct a dictionary of all the words in the query
2. Initiate a local alignment for each word match between query and DB
Running Time: O(MN)
However, orders of magnitude faster than Smith-Waterman
query
DB
PSI-BLAST
Given a sequence query x, and database D
1. Find all pairwise alignments of x to sequences in D
2. Collect all matches of x to y with some minimum significance
3. Construct position specific matrix M, a profile
4. Using the matrix M, search D for more matches
5. Iterate 1–4 until convergence
Profile M
A profile
Profile HMMs
• Each M state has a position-specific pre-computed substitution table• Each I and D state has position-specific gap penalties
• Profile HMM is a generative model: The sequence X that is aligned to H, is thought of as “generated by” H Therefore, H parametrizes a conditional distribution P(X | H)
Protein profile HMM
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
Classification with Profile HMMs
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
Fold
Family
Superfamily
?M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
new protein
Classification with Profile HMMs
• How generative models work
Training examples ( sequences known to be members of family )
Model assigns a probability to any given protein sequence.
The sequence from that family yield a higher probability than that of outside family.
• Log-likelihood ratio as score
P(X | H1) P(H1) P(H1|X) P(X) P(H1|X) L(X) = log -------------------------- = log --------------------- = log --------------
P(X | H0) P(H0) P(H0|X) P(X) P(H0|X)
Generative Models
Generative Models
Generative Models
Generative Models
Generative Models
Discriminative Methods
Instead of modeling the process that generates data, directly discriminate between classes
• More direct way to the goal• Better if model is not accurate
Discriminative Models -- SVM
v
Decision Rule:red: vTx > 0
marginIf x1 … xn training examples,
sign(iixiTx) “decides” where x falls
• Train i to achieve best margin
Large Margin for |v| < 1 Margin of 1 for small |v|
Discriminative protein classification
Jaakkola, Diekhans, Haussler, ISMB 1999
• Define the discriminant function to be
L(X) = XiH1 i K(X, Xi) - XjH0 j K(X, Xj)
We decide X family H whenever L(X) > 0
• For now, let’s just assume K(.,.) is a similarity function
• Then, we want to train i so that this classifier makes as few mistakes as possible in the new data
• Similarly to SVMs, train i so that margin is largest for 0 i 1
Discriminative protein classification
• Ideally, for training examples, L(Xi) ≥ 1 if Xi H1, L(Xi) -1 otherwise
• This is not always possible; softer constraints are obtained with the following objective function
J() = XiH1 i(2 - L(Xi)) + XjH0 j(2 + L(Xj))
• Training: for Xi H, try to “make” L(Xi) = 1
1 - L(Xi) + i K(Xi, Xi) i -----------------------------; with minimum 0, and maximum 1
K(Xi, Xi)
• Similarly, for Xi H0 try to “make” L(Xi) = -1
The Fisher Kernel
• The function K(X, Y) compares two sequences Acts effectively as an inner product in a (non-Euclidean) space Called “Kernel”
• Has to be positive definite• For any X1, …, Xn, the matrix K: Kij = K(Xi, Xj) is such that
For any X Rn, X ≠ 0, XT K X > 0
• Choice of this function is important
• Consider P(X | H1, ) – sufficient statistics How many expected times X takes each transition/emission
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
The Fisher Kernel
Let be the vector of parameters of HMM (probs in each arrow & emission)
• Fisher score UX = log P(X | H1, ) Quantifies how each parameter contributes to generating X For two different sequences X and Y, can compare UX, UY
• D2F(X, Y) = ½ 2 |UX – UY|2; is just a scaling parameter
• Given this distance function, K(X, Y) is defined as a similarity measure: K(X, Y) = exp(-D2
F(X, Y)) Set so that the average distance of training sequences X i H1 to sequences Xj
H0 is 1
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
The Fisher Kernel
• To train a classifier for a given family H1,
1. Build profile HMM, H1
2. UX = log P(X | H1, ) (Fisher score)3. D2
F(X, Y) = ½ 2 |UX – UY|2 (distance)4. K(X, Y) = exp(-D2
F(X, Y)), (akin to dot product)
5. L(X) = XiH1 i K(X, Xi) – XjH0 j K(X, Xj)6. Iteratively adjust to optimize
J() = XiH1 i(2 - L(Xi)) + XjH0 j(2 + L(Xj))
• To classify query X,
Compute UX
Compute K(X, Xi) for all training examples Xi with I ≠ 0 (few) Decide based on L(X) >? 0
The Fisher Kernel
• If a given superfamily has more than one profile model,
Lmax(X) = maxi Li(X) = maxi (XjHi j K(X, Xj) – XjH0 j K(X, Xj))
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
Family
Superfamily
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
O. Jangmin
Benchmarks
Other methods
• WU-BLAST version 2.0a16 (Althcshul & Gish 1996)
PDB90 database was queried with each positive training examples, and E-values were recorded.
BLAST:SCOP-only
BLAST:SCOP+SAM-T98-homologs
Scores were combined by the maximum method
• SAM-T98 method
Null model: reverse sequence model
Same data and same set of models as in the SVM-Fisher
Combined with maximum methods
O. Jangmin
Results
• Metric : the rate of false positives (RFP)
• RFP for a positive test sequence : the fraction of negative test sequences that score as good of better than positive sequence
• Result of the family of the nucleotide triphosphate hydrolases SCOP superfamily
Test the ability to distinguish 8 PDB90 G proteins from 2439 sequences in other SCOP folds
O. Jangmin
QUESTION
Running time of Fisher kernel SVM
on query X?
k-mer based SVMs
Leslie, Eskin, Weston, Noble; NIPS 2002
Highlights
• K(X, Y) = exp(-½ 2 |UX – UY|2), requires expensive profile alignment:
UX = log P(X | H1, ) – O(|X| |H1|)
• Instead, new kernel K(X, Y) just “counts up” k-mers with mismatches in common between X and Y – O(|X|) in practice
• Off-the-shelf SVM software used
k-mer based SVMs
• For given word size k, and mismatch tolerance l, define
K(X, Y) = # distinct k-long word pairs with ≤ l mismatches
• Define normalized kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y))
• SVM can be learned by supplying this kernel function
A B A C A R D I
A B R A D A B I
X
Y
K(X, Y) = 4
K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1
SVMs will find a few support vectors
v
After training, SVM has determined a small set of sequences, the support vectors, who need to be compared with query sequence X
Benchmarks
Semi-Supervised Methods
GENERATIVE SUPERVISED METHODS
Semi-Supervised Methods
DISCRIMINATIVE SUPERVISED METHODS
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
UNSUPERVISED METHODS
Mixture of Centers
Data generated by a fixed set of centers (how many?)
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
• SVMs and other discriminative methods may make significant mistakes due to lack of data
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Attempt to “contract” the distances within each cluster while keeping intracluster distances larger
Semi-Supervised Methods
• Some examples are labeled
• Assume labels vary smoothly among all examples
Cluster Kernels
1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs Nbd(X)
2. Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X) Φoriginal(X’)
“Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”
3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’)
2. Bagged mismatch
Cluster Kernels
1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs Nbd(X)
2. Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X) Φoriginal(X’)
“Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”
3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’)
2. Bagged mismatch
1. Run k-means clustering n times, giving p = 1,…,n assignments cp(X)
2. For every X and Y, count up the fraction of times they are bagged together
Kbag(X, Y) = 1/n p 1(cp(X) = cp (Y))
3. Combine the “bag fraction” with the original comparison K(.,.)
Knew(X, Y) = Kbag(X, Y) K(X, Y)
Benchmarks