Upload
claribel-patrick
View
217
Download
0
Embed Size (px)
Citation preview
Graphical Modeling of Multiple Sequence Alignment
Jinbo XuToyota Technological Institute at Chicago
Computational Institute, The University of Chicago
Two applications of MSA• Predict inter-residue interaction network (i.e.,
protein contact map) from MSA using joint graphical lasso– An important subproblem of protein folding
• Align two MSAs through alignment of two Markov Random Fields (MRFs)– Homology detection and fold recognition– Merge two MSAs into a larger one
Modeling MSAby Markov Random Fields
The generating probability of a sequence :
Infer , by maximum-likelihood encodes residue correlation relationshipA special case is Gaussian Graphical Model
Numeric Representation of MSA
… 0 0 … 1 0 …
21 elements for each column in MSA
Represent a sequence in MSA as a L×21 binary vector
Gaussian Graphical Model (GGM)
• a multiple sequence alignment (MSA)• Assume has Gaussian distribution where is the
covariance matrix• (inverse of ): the precision matrix, implying the
residue interaction pattern among all MSA columns
Covariance and Precision Matrix
L
L
The precision matrix has dimension 21L×21L
21×21
one residue pair
Larger values indicate stronger interaction
Today’s talk• Predict inter-residue interaction network (i.e.,
protein contact map) from MSA using joint graphical lasso– An important subproblem of protein folding
• Align two MSAs through alignment of two Markov Random Fields (MRFs)– Homology detection and fold recognition– Merge two MSAs into a larger one
Protein Contact Map(residue interaction network)
1
2
3
4
6.0
8.1
5.9
1 2 3 4
1 0 1 1 0
2 1 0 1 1
3 1 1 0 1
4 0 1 1 0
Two residues in contact if their Cα or Cβ distance < 8Å
3.8
Shorter distance Stronger interaction
Contact Matrix is Sparse
short range: 6-12 AAs apart along primary sequencemedium range: 12-24 AAs apartlong range: >24 AAs apart
#contacts is linear w.r.t. sequence length
Input:MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTKEVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLANLESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKKKASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN
Protein Contact Prediction
Output:
With L/12 long-range native contacts, the fold of a protein can be roughly determined [Baker group]
Contact Prediction Methods
• Evolutionary coupling analysis (unsupervised learning) Identity co-evolved residues from multiple sequence alignment No solved protein structures used at all High-throughput sequencing makes this method promising e.g., mutual information, Evfold, PSICOV, plmDCA, GREMLIN
• Supervised machine learning Input features: sequence (profile) similarity, chemical properties
similarity, mutual information (implicitly) learn information from solved structures examples: NNcon, SVMcon, CMAPpro, PhyCMAP
Evolutionary Coupling (EC) AnalysisObservation: two residues in contact tend to co-evolve,i.e., two co-evolved residues likely to form a contact
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028766
Evolutionary Coupling (EC) Analysis (Cont’d)• Local statistical methods: examine the correlation
between two residues independent of the others Mutual information (MI): two residues in contact likely to have
large MI Not all residue pairs with large MI are in contact due to indirect
evolutionary coupling. If A~B and B~C, then likely A~C
• Global statistical methods: examine the correlation between two residues condition on the others
Need a large number of sequences Maximum-Entropy: Evfold Graphical lasso: PSICOV Pseudo-likelihood: plmDCA, GREMLIN
Single MSA-based Contact Prediction
• Given a protein sequence under prediction, run PSI-BLAST to detect its homologs and build an MSA
• Calculate the sample covariance matrix from the MSA
• is singular, so cannot calculate the precision matrix by
• Calculate by maximum-likelihood, i.e., maximize the occurring probability of observed seqs
Enforce sparse precision matrixWhy ?
Issues with Existing Methods Evolutionary coupling (EC) analysis works for proteins
with a large number of sequence homologs Focus on how to improve the statistical methods instead of
use of extra biological information/insight, e.g., relax the Gaussian assumption, consensus of a few EC methods,
Use information mostly in a single protein family Physical constraints other than sparsity not used
Our Work: contact prediction using multiple MSAs
Jointly predict contacts for related families of similar folds. That is, predict contacts using multiple MSAs. These MSAs share inter-residue interaction network to some degree
Integrate evolutionary coupling (EC) analysis with supervised learning EC analysis makes use of residue co-evolution information Supervised learning makes use of sequence (profile) similarity
Goal: focus on proteins without many sequence homologsStrategy: increase statistical power by information aggregation
Red: shared; Blue: unique to PF00116; Green: unique to PF13473
Observation: different protein families share similar contact maps
Joint evolutionary coupling (EC) analysis
Jointly predict contacts for a set of related protein families Predict contacts for a protein family using information in
other related families Enforce contact map consistency among related families Do not lose family-specific information
Joint graphical lasso for joint evolutionary coupling analysis
1. Given a protein family and its MSA, find related families and corresponding MSAs
Let be precision matrices
2. Estimate by joint log-likelihood as follows
Where the last term enforces sparse precision matrices
How to enforce contact map consistency?
Residue Pair/Submatrix Grouping
In total ≤L(L-1)/2 groups where L is the seq length
Enforce Contact Map Consistencyby Group Penalty
: the number of groups: the number of families Using group lasso to model family consistency:
Group conservation level
is defined as
Supervised Machine Learning
• Input features: sequence profile, amino acid chemical properties, mutual information power series, context-specific statistical potential
• Mutual information power series: – Local info: mutual information matrix (MI)– Partially global info: MI2, MI3, …, MI11
– Can be calculated much faster than PSICOV• Random Forests trained by 800-900 proteins
Joint EC Analysis with Supervised Prediction as Prior
max∑𝑘=1
𝐾
( log|𝛺𝑘|−tr (𝛺𝑘 �̂�𝑘 ) )
− 𝜆1∑𝑘=1
𝐾
¿|𝛺𝑘|∨¿1¿−∑𝑔=1
𝐺
𝜆𝑔∨¿𝛺𝑔∨¿2
− 𝜆2∑𝑘=1
𝐾
∑𝑖𝑗
‖𝛺𝑖𝑗𝑘‖1
𝑚𝑎𝑥 (𝑃 𝑖𝑗𝑘 ,0.3)
sparsity
contact map consistency among families
similarity with supervised prediction
Log-likelihood of K families
This optimization problem can be solved by ADMM to suboptimal
Accuracy on 98 Pfam families Medium-range Long-range L/10 L/5 L/2 L/10 L/5 L/2
CoinDCA 0.496 0.435 0.312 0.561 0.502 0.391
PSICOV 0.375 0.312 0.213 0.446 0.400 0.311PSICOV_b 0.388 0.306 0.199 0.462 0.400 0.294plmDCA 0.433 0.354 0.233 0.484 0.443 0.343
plmDCA_h 0.433 0.339 0.211 0.480 0.413 0.292GREMLIN 0.401 0.332 0.225 0.447 0.423 0.329
GREMLIN_h 0.391 0.316 0.204 0.428 0.400 0.301
Merge_p 0.303 0.246 0.178 0.370 0.328 0.253Merge_m 0.276 0.223 0.169 0.355 0.309 0.232
Voting 0.405 0.280 0.168 0.337 0.353 0.275
Accuracy vs. # Sequence Homologs(A) Medium-range (B) Long-range
X-axis: ln of the number of non-redundant sequence homologsY-axis: L/10 accuracy
Accuracy on 123 CASP10 targets
Medium-range Long-range L/10 L/5 L/2 L/10 L/5 L/2CoinDCA 0.500 0.440 0.340 0.412 0.351 0.279
Evfold 0.294 0.249 0.188 0.257 0.225 0.171PSICOV 0.310 0.259 0.192 0.276 0.225 0.168plmDCA 0.344 0.289 0.214 0.326 0.280 0.213
GREMLIN 0.343 0.280 0.229 0.320 0.278 0.159
NNcon 0.393 0.334 0.226 0.239 0.188 0.001CMAPpro 0.414 0.363 0.276 0.336 0.297 0.227
Accuracy vs. # sequence homologs (CASP10)
X-axis: ln of # non-redundant sequence homologsY-axis: L/10 long-range prediction accuracy
Accuracy vs. Contact Conservation Level
(A)Medium-range; (B) long-rangeX-axis: conservation level, the larger, the more conserved
Today’s Talk• Predict inter-residue interaction network (i.e.,
protein contact map) from MSA using joint graphical lasso– An important subproblem of protein folding
• Align two MSAs through alignment of two Markov Random Fields (MRFs)– Remote homology detection and fold recognition– Merge two MSAs into a larger one
Homology Detection & Fold Recognition
• Primary sequence comparison– Similar sequences -> very likely homologous– Sequence alignment method, e.g., BLAST, FASTA– works only for close homologs
• Profile-based method– Compare two protein families instead of primary sequences, using
evolutionary information in a family– Sequence-profile alignment & profile-profile alignment– Profile can be represented as a matrix (e.g., FFAS) or a HMM (e.g.,
HHpred, HMMER)– Sometimes works for remote homologs, but not sensitive enough
MSA to Sequence Profile
Two popular profile representations: (1) Position-specific scoring matrix (PSSM); (2) Hidden Markov Model (HMM)
Position-Specific Scoring Matrix (PSSM)
Taken from http://carrot.mcb.uconn.edu/~olgazh/bioinf2010/class10.html
Our Work: Markov Random Fields (MRF) Representation
1) MRF encodes long-range residue interaction pattern while HMM does not;
2) Long-range interaction pattern encodes global information of a protein,So can deal with proteins of similar folds but divergent sequences
Protein alignment by aligning two MRFs
G R K - Y S A
G R K - Y S A
F L V - L Y I
K L V - L Y I
P T A K F R E
P T A K F R S
P T V P G Y E
P T V P G R S
MRF1
MRF2
Family 1
Family 2
Scoring function for MRF alignment
local alignment potential pairwise alignment potential
MRF1
MRF2
𝜃𝑖 , 𝑗𝑀
𝜃𝑘 ,𝑙𝑀
𝑍 𝑖 , 𝑗𝑀 =1 𝑍 𝑘 ,𝑙
𝑀 =1
NP-hard due to1) Gaps allowed2) Pairwise potential
Alternating Direction of Method Multiplier (ADMM)
Make a copy of z to y
Add a penalty term to obtain an augmented problem
ADMM (Cont’d)
Solve the above problem iteratively as follows:Step 1: Solve the optimization problem for a fixed Step 2: Update by subgradient and repeat 1) until convergence
Use a Lagrangian multiplier to relax the original problemand obtain a upper bound
ADMM(Cont’d)
(SP1) Where
(SP2)
Where
For a fixed plit the relaxation problem into two subproblems and solve them alternatively
Both subproblems can solved efficiently by dynamic programming!
Superfamily & Fold Recognition Rate
Superfamily level detection Fold level detection
Conclusion
• Joint evolutionary coupling analysis + supervised learning can significantly improve protein contact prediction by using information in multiple MSAs
• Long-range residue interaction encoded in an MSA helpful for remote homolog detection
Acknowledgements• RaptorX servers at http://raptorx.uchicago.edu• Students: Jianzhu Ma, Zhiyong Wang, Sheng Wang• Funding
– NIH R01GM0897532– NSF CAREER award and NSF ABI– Alfred P. Sloan Research Fellowship
• Computational resources– University of Chicago Beagle team– TeraGrid and Open Science Grid
Input:MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTKEVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLANLESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKKKASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN
Protein Structure Prediction
Output:1. One of the most challenging problems in computational biology!
2. Improved due to better algorithms and large databases
3. Knowledge-based methods outperformsphysics-based methods
4. Big demand: our server processes> 800 jobs/week, >12k users in 3yrs
Performance in CASP9 (2010)A blind test for protein structure prediction
Server ranking tested on the 50 hardest TBM targets
Adapted from http://predictioncenter.org/casp9/doc/presentations/CASP9_TBM.pdf
Performance in CASP10 (2012)A blind test for protein structure prediction
The only server group among top 10
Adapted from http://predictioncenter.org/casp10/doc/presentations/CASP10_TBM_GM.pdf
The top 10 performing human/server groups on the hardest TBM targets
My WorkAnalyze large-scale biological data and build predictive models• Protein sequence and structure alignment• Homology detection and fold recognition• Protein structure prediction• Protein function prediction (e.g., interaction and binding site
prediction)• Biological network construction and analysis
Study computational methods that have applications beyond bioinformatics• Machine learning (e.g. probabilistic graphical model)• Optimization (discrete, combinatorial and continuous)
Homology Detection & Fold Recognition
• Homology detection & fold recognition– Determine the relationship between two proteins– Given a query, search for all homologs in a database
• Homology search/fold recognition useful for– Study protein evolutionary relationship– Functional transfer– Homology modeling (i.e., template-based modeling)
Two proteins are homologous if they have shared ancestry.Two proteins have the same fold if their 3D structures are similar.
Structure Prediction (Cont’d)• Template-based modeling (TBM)
– Using solved protein structures as template, e.g., homology modeling and protein threading
– Most reliable, but fails when no good templates• Template-free modeling (FM) or ab initio folding
– Not using solved protein structures as template– Mostly works only on some small proteins
• Subproblems– Loop modeling– Inter-residue contact prediction
Residue Pair Grouping
Precision Submatrix Grouping
for Family 1 for Family 2
Suppose that residue pair (2,4) in Family 1 aligned to pair (3,5) in Family 2
In total ≤L(L-1)/2 groups where L is the seq length
Performance on the 31 Pfam families with only distantly-related auxiliary families
Medium-range Long-range L/10 L/5 L/2 L/10 L/5 L/2
CoinFold 0.457 0.400 0.267 0.558 0.524 0.416PSICOV 0.413 0.360 0.252 0.494 0.465 0.377
PSICOV_p 0.320 0.295 0.212 0.396 0.355 0.290PSICOV_v 0.400 0.320 0.179 0.396 0.375 0.261
CoinFold: our workPSICOV: single-family method PSICOV_p: merge multiple families and apply single-family methodPSICOV_v: single-family method for each family and then consensus
Performance on the 13 Pfam families with closely-related auxiliary families
Medium-range Long-rangeL/10 L/5 L/2 L/10 L/5 L/2
CoinFold 0.501 0.395 0.251 0.462 0.413 0.293PSICOV 0.433 0.351 0.231 0.398 0.331 0.234
PSICOV_p 0.335 0.220 0.175 0.322 0.276 0.194PSICOV_v 0.423 0.320 0.188 0.386 0.384 0.301
CoinFold: our workPSICOV: single-family method PSICOV_p: merge multiple families and apply single-family methodPSICOV_v: single-family method for each family and then consensus
Our method vs. PSICOV
Our method vs. GREMLIN
L/10 top predicted long-range contacts are evaluated
Performance vs. family size
CoinFold: our workPSICOV: single-family method PSICOV_p: merge multiple families and apply single-family methodPSICOV_v: single-family method for each family and then consensus
Multiple Sequence Alignment (MSA) of One Protein Family
Top L/10 long-range prediction accuracy on 15 large Pfam families
PFAM ID MEFF CoinFold PSICOV PSICOV_p
PF00041 2981 0.767 0.767 0.667PF00595 3026 0.556 0.444 0.233PF03061 3334 0.375 0.250 0.500PF01522 3519 0.615 0.462 0.077PF00578 3733 0.308 0.308 0.231PF00059 3744 0.455 0.455 0.182PF07686 3801 0.917 0.583 0.667PF00034 4060 0.600 0.500 0.200PF00989 4596 0.583 0.250 0.500PF00144 4684 0.272 0.212 0.242PF00085 5075 0.636 0.545 0.182PF00168 6735 0.667 0.667 0.556PF00515 7230 0.500 0.500 0.250PF00089 9045 0.783 0.783 0.739PF00550 11476 0.857 0.857 0.714
Running Time
Average protein sequence length
Time (in seconds)
Performance: Alignment Accuracy
Performance: Homology Detection
Performance: Alignment Accuracy
Tmalign, Matt and DeepAlign represent three different ground truth
Joint Graphical Lasso Formulation
Log-likelihood: Regularization:
Rewrite the original problem as
Unconstrained optimization problem Both and are convex, so the objective is the difference of two convex functions Can be solved by the Convex-Concave Procedure [A. Yuille]
Alternating Direction of Method Multiplier (ADMM)
s.t. ZAdd a penalty term to obtain an augmented problem, which has the same solution but converges faster.
Make a copy of to , without changing the solution space
s.t. Z
Lagrangian Relaxation
min𝑈maxΩ ,𝑍
𝑓 (Ω)−𝑃 (𝑍 )−∑𝑘=1
𝐾
¿¿¿¿
Use a Lagrange multiplier for each constraint Obtain a dual problem to upper bound the augmented problem
Solve the dual problem iteratively by subgradient as follows.Step 1) fix and solve
Step 2) Update by and repeat 1) until convergence
ADMM (Cont’d)
For a fixed U, split the relaxation problem into two subproblems and solve them alternatively
(SP1)
(SP2)