Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago...

Preview:

Citation preview

Graphical Modeling of Multiple Sequence Alignment

Jinbo XuToyota Technological Institute at Chicago

Computational Institute, The University of Chicago

Two applications of MSA• Predict inter-residue interaction network (i.e.,

protein contact map) from MSA using joint graphical lasso– An important subproblem of protein folding

• Align two MSAs through alignment of two Markov Random Fields (MRFs)– Homology detection and fold recognition– Merge two MSAs into a larger one

Modeling MSAby Markov Random Fields

The generating probability of a sequence :

Infer , by maximum-likelihood encodes residue correlation relationshipA special case is Gaussian Graphical Model

Numeric Representation of MSA

… 0 0 … 1 0 …

21 elements for each column in MSA

Represent a sequence in MSA as a L×21 binary vector

Gaussian Graphical Model (GGM)

• a multiple sequence alignment (MSA)• Assume has Gaussian distribution where is the

covariance matrix• (inverse of ): the precision matrix, implying the

residue interaction pattern among all MSA columns

Covariance and Precision Matrix

L

L

The precision matrix has dimension 21L×21L

21×21

one residue pair

Larger values indicate stronger interaction

Today’s talk• Predict inter-residue interaction network (i.e.,

protein contact map) from MSA using joint graphical lasso– An important subproblem of protein folding

• Align two MSAs through alignment of two Markov Random Fields (MRFs)– Homology detection and fold recognition– Merge two MSAs into a larger one

Protein Contact Map(residue interaction network)

1

2

3

4

6.0

8.1

5.9

1 2 3 4

1 0 1 1 0

2 1 0 1 1

3 1 1 0 1

4 0 1 1 0

Two residues in contact if their Cα or Cβ distance < 8Å

3.8

Shorter distance Stronger interaction

Contact Matrix is Sparse

short range: 6-12 AAs apart along primary sequencemedium range: 12-24 AAs apartlong range: >24 AAs apart

#contacts is linear w.r.t. sequence length

Input:MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTKEVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLANLESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKKKASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN

Protein Contact Prediction

Output:

With L/12 long-range native contacts, the fold of a protein can be roughly determined [Baker group]

Contact Prediction Methods

• Evolutionary coupling analysis (unsupervised learning) Identity co-evolved residues from multiple sequence alignment No solved protein structures used at all High-throughput sequencing makes this method promising e.g., mutual information, Evfold, PSICOV, plmDCA, GREMLIN

• Supervised machine learning Input features: sequence (profile) similarity, chemical properties

similarity, mutual information (implicitly) learn information from solved structures examples: NNcon, SVMcon, CMAPpro, PhyCMAP

Evolutionary Coupling (EC) AnalysisObservation: two residues in contact tend to co-evolve,i.e., two co-evolved residues likely to form a contact

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028766

Evolutionary Coupling (EC) Analysis (Cont’d)• Local statistical methods: examine the correlation

between two residues independent of the others Mutual information (MI): two residues in contact likely to have

large MI Not all residue pairs with large MI are in contact due to indirect

evolutionary coupling. If A~B and B~C, then likely A~C

• Global statistical methods: examine the correlation between two residues condition on the others

Need a large number of sequences Maximum-Entropy: Evfold Graphical lasso: PSICOV Pseudo-likelihood: plmDCA, GREMLIN

Single MSA-based Contact Prediction

• Given a protein sequence under prediction, run PSI-BLAST to detect its homologs and build an MSA

• Calculate the sample covariance matrix from the MSA

• is singular, so cannot calculate the precision matrix by

• Calculate by maximum-likelihood, i.e., maximize the occurring probability of observed seqs

Enforce sparse precision matrixWhy ?

Issues with Existing Methods Evolutionary coupling (EC) analysis works for proteins

with a large number of sequence homologs Focus on how to improve the statistical methods instead of

use of extra biological information/insight, e.g., relax the Gaussian assumption, consensus of a few EC methods,

Use information mostly in a single protein family Physical constraints other than sparsity not used

Our Work: contact prediction using multiple MSAs

Jointly predict contacts for related families of similar folds. That is, predict contacts using multiple MSAs. These MSAs share inter-residue interaction network to some degree

Integrate evolutionary coupling (EC) analysis with supervised learning EC analysis makes use of residue co-evolution information Supervised learning makes use of sequence (profile) similarity

Goal: focus on proteins without many sequence homologsStrategy: increase statistical power by information aggregation

Red: shared; Blue: unique to PF00116; Green: unique to PF13473

Observation: different protein families share similar contact maps

Joint evolutionary coupling (EC) analysis

Jointly predict contacts for a set of related protein families Predict contacts for a protein family using information in

other related families Enforce contact map consistency among related families Do not lose family-specific information

Joint graphical lasso for joint evolutionary coupling analysis

1. Given a protein family and its MSA, find related families and corresponding MSAs

Let be precision matrices

2. Estimate by joint log-likelihood as follows

Where the last term enforces sparse precision matrices

How to enforce contact map consistency?

Residue Pair/Submatrix Grouping

In total ≤L(L-1)/2 groups where L is the seq length

Enforce Contact Map Consistencyby Group Penalty

: the number of groups: the number of families Using group lasso to model family consistency:

Group conservation level

is defined as

Supervised Machine Learning

• Input features: sequence profile, amino acid chemical properties, mutual information power series, context-specific statistical potential

• Mutual information power series: – Local info: mutual information matrix (MI)– Partially global info: MI2, MI3, …, MI11

– Can be calculated much faster than PSICOV• Random Forests trained by 800-900 proteins

Joint EC Analysis with Supervised Prediction as Prior

max∑𝑘=1

𝐾

( log|𝛺𝑘|−tr (𝛺𝑘 �̂�𝑘 ) )

− 𝜆1∑𝑘=1

𝐾

¿|𝛺𝑘|∨¿1¿−∑𝑔=1

𝐺

𝜆𝑔∨¿𝛺𝑔∨¿2

− 𝜆2∑𝑘=1

𝐾

∑𝑖𝑗

‖𝛺𝑖𝑗𝑘‖1

𝑚𝑎𝑥 (𝑃 𝑖𝑗𝑘 ,0.3)

sparsity

contact map consistency among families

similarity with supervised prediction

Log-likelihood of K families

This optimization problem can be solved by ADMM to suboptimal

Accuracy on 98 Pfam families Medium-range Long-range L/10 L/5 L/2 L/10 L/5 L/2

CoinDCA 0.496 0.435 0.312 0.561 0.502 0.391

PSICOV 0.375 0.312 0.213 0.446 0.400 0.311PSICOV_b 0.388 0.306 0.199 0.462 0.400 0.294plmDCA 0.433 0.354 0.233 0.484 0.443 0.343

plmDCA_h 0.433 0.339 0.211 0.480 0.413 0.292GREMLIN 0.401 0.332 0.225 0.447 0.423 0.329

GREMLIN_h 0.391 0.316 0.204 0.428 0.400 0.301

Merge_p 0.303 0.246 0.178 0.370 0.328 0.253Merge_m 0.276 0.223 0.169 0.355 0.309 0.232

Voting 0.405 0.280 0.168 0.337 0.353 0.275

Accuracy vs. # Sequence Homologs(A) Medium-range (B) Long-range

X-axis: ln of the number of non-redundant sequence homologsY-axis: L/10 accuracy

Accuracy on 123 CASP10 targets

Medium-range Long-range L/10 L/5 L/2 L/10 L/5 L/2CoinDCA 0.500 0.440 0.340 0.412 0.351 0.279

Evfold 0.294 0.249 0.188 0.257 0.225 0.171PSICOV 0.310 0.259 0.192 0.276 0.225 0.168plmDCA 0.344 0.289 0.214 0.326 0.280 0.213

GREMLIN 0.343 0.280 0.229 0.320 0.278 0.159

NNcon 0.393 0.334 0.226 0.239 0.188 0.001CMAPpro 0.414 0.363 0.276 0.336 0.297 0.227

Accuracy vs. # sequence homologs (CASP10)

X-axis: ln of # non-redundant sequence homologsY-axis: L/10 long-range prediction accuracy

Accuracy vs. Contact Conservation Level

(A)Medium-range; (B) long-rangeX-axis: conservation level, the larger, the more conserved

Today’s Talk• Predict inter-residue interaction network (i.e.,

protein contact map) from MSA using joint graphical lasso– An important subproblem of protein folding

• Align two MSAs through alignment of two Markov Random Fields (MRFs)– Remote homology detection and fold recognition– Merge two MSAs into a larger one

Homology Detection & Fold Recognition

• Primary sequence comparison– Similar sequences -> very likely homologous– Sequence alignment method, e.g., BLAST, FASTA– works only for close homologs

• Profile-based method– Compare two protein families instead of primary sequences, using

evolutionary information in a family– Sequence-profile alignment & profile-profile alignment– Profile can be represented as a matrix (e.g., FFAS) or a HMM (e.g.,

HHpred, HMMER)– Sometimes works for remote homologs, but not sensitive enough

MSA to Sequence Profile

Two popular profile representations: (1) Position-specific scoring matrix (PSSM); (2) Hidden Markov Model (HMM)

Position-Specific Scoring Matrix (PSSM)

Taken from http://carrot.mcb.uconn.edu/~olgazh/bioinf2010/class10.html

Hidden Markov Model (HMM)

http://www.biopred.net/eddy.html

Our Work: Markov Random Fields (MRF) Representation

1) MRF encodes long-range residue interaction pattern while HMM does not;

2) Long-range interaction pattern encodes global information of a protein,So can deal with proteins of similar folds but divergent sequences

Protein alignment by aligning two MRFs

G R K - Y S A

G R K - Y S A

F L V - L Y I

K L V - L Y I

P T A K F R E

P T A K F R S

P T V P G Y E

P T V P G R S

MRF1

MRF2

Family 1

Family 2

Scoring function for MRF alignment

 

local alignment potential pairwise alignment potential

   

   

MRF1

MRF2

𝜃𝑖 , 𝑗𝑀

𝜃𝑘 ,𝑙𝑀

𝑍 𝑖 , 𝑗𝑀 =1 𝑍 𝑘 ,𝑙

𝑀 =1

NP-hard due to1) Gaps allowed2) Pairwise potential

Alternating Direction of Method Multiplier (ADMM)

Make a copy of z to y

Add a penalty term to obtain an augmented problem

ADMM (Cont’d)

Solve the above problem iteratively as follows:Step 1: Solve the optimization problem for a fixed Step 2: Update by subgradient and repeat 1) until convergence

Use a Lagrangian multiplier to relax the original problemand obtain a upper bound

ADMM(Cont’d)

(SP1) Where

(SP2)

Where

For a fixed plit the relaxation problem into two subproblems and solve them alternatively

Both subproblems can solved efficiently by dynamic programming!

Superfamily & Fold Recognition Rate

Superfamily level detection Fold level detection

Conclusion

• Joint evolutionary coupling analysis + supervised learning can significantly improve protein contact prediction by using information in multiple MSAs

• Long-range residue interaction encoded in an MSA helpful for remote homolog detection

Acknowledgements• RaptorX servers at http://raptorx.uchicago.edu• Students: Jianzhu Ma, Zhiyong Wang, Sheng Wang• Funding

– NIH R01GM0897532– NSF CAREER award and NSF ABI– Alfred P. Sloan Research Fellowship

• Computational resources– University of Chicago Beagle team– TeraGrid and Open Science Grid

Input:MEKVNFLKNGVLRLPPGFRFRPTDEELVVQYLKRKVFSFPLPASIIPEVEVYKSDPWDLPGDMEQEKYFFSTKEVKYPNGNRSNRATNSGYWKATGIDKQIILRGRQQQQQLIGLKKTLVFYRGKSPHGCRTNWIMHEYRLANLESNYHPIQGNWVICRIFLKKRGNTKNKEENMTTHDEVRNREIDKNSPVVSVKMSSRDSEALASANSELKKKASIIFYDFMGRNNSNGVAASTSSSGITDLTTTNEESDDHEESTSSFNNFTTFKRKIN

Protein Structure Prediction

Output:1. One of the most challenging problems in computational biology!

2. Improved due to better algorithms and large databases

3. Knowledge-based methods outperformsphysics-based methods

4. Big demand: our server processes> 800 jobs/week, >12k users in 3yrs

Performance in CASP9 (2010)A blind test for protein structure prediction

Server ranking tested on the 50 hardest TBM targets

Adapted from http://predictioncenter.org/casp9/doc/presentations/CASP9_TBM.pdf

Performance in CASP10 (2012)A blind test for protein structure prediction

The only server group among top 10

Adapted from http://predictioncenter.org/casp10/doc/presentations/CASP10_TBM_GM.pdf

The top 10 performing human/server groups on the hardest TBM targets

My WorkAnalyze large-scale biological data and build predictive models• Protein sequence and structure alignment• Homology detection and fold recognition• Protein structure prediction• Protein function prediction (e.g., interaction and binding site

prediction)• Biological network construction and analysis

Study computational methods that have applications beyond bioinformatics• Machine learning (e.g. probabilistic graphical model)• Optimization (discrete, combinatorial and continuous)

Homology Detection & Fold Recognition

• Homology detection & fold recognition– Determine the relationship between two proteins– Given a query, search for all homologs in a database

• Homology search/fold recognition useful for– Study protein evolutionary relationship– Functional transfer– Homology modeling (i.e., template-based modeling)

Two proteins are homologous if they have shared ancestry.Two proteins have the same fold if their 3D structures are similar.

Structure Prediction (Cont’d)• Template-based modeling (TBM)

– Using solved protein structures as template, e.g., homology modeling and protein threading

– Most reliable, but fails when no good templates• Template-free modeling (FM) or ab initio folding

– Not using solved protein structures as template– Mostly works only on some small proteins

• Subproblems– Loop modeling– Inter-residue contact prediction

Residue Pair Grouping

Precision Submatrix Grouping

for Family 1 for Family 2

Suppose that residue pair (2,4) in Family 1 aligned to pair (3,5) in Family 2

In total ≤L(L-1)/2 groups where L is the seq length

Performance on the 31 Pfam families with only distantly-related auxiliary families

Medium-range Long-range L/10 L/5 L/2 L/10 L/5 L/2

CoinFold 0.457 0.400 0.267 0.558 0.524 0.416PSICOV 0.413 0.360 0.252 0.494 0.465 0.377

PSICOV_p 0.320 0.295 0.212 0.396 0.355 0.290PSICOV_v 0.400 0.320 0.179 0.396 0.375 0.261

CoinFold: our workPSICOV: single-family method PSICOV_p: merge multiple families and apply single-family methodPSICOV_v: single-family method for each family and then consensus

Performance on the 13 Pfam families with closely-related auxiliary families

Medium-range Long-rangeL/10 L/5 L/2 L/10 L/5 L/2

CoinFold 0.501 0.395 0.251 0.462 0.413 0.293PSICOV 0.433 0.351 0.231 0.398 0.331 0.234

PSICOV_p 0.335 0.220 0.175 0.322 0.276 0.194PSICOV_v 0.423 0.320 0.188 0.386 0.384 0.301

CoinFold: our workPSICOV: single-family method PSICOV_p: merge multiple families and apply single-family methodPSICOV_v: single-family method for each family and then consensus

Our method vs. PSICOV

Our method vs. GREMLIN

L/10 top predicted long-range contacts are evaluated

Performance vs. family size

CoinFold: our workPSICOV: single-family method PSICOV_p: merge multiple families and apply single-family methodPSICOV_v: single-family method for each family and then consensus

Multiple Sequence Alignment (MSA) of One Protein Family

Top L/10 long-range prediction accuracy on 15 large Pfam families

PFAM ID MEFF CoinFold PSICOV PSICOV_p

PF00041 2981 0.767 0.767 0.667PF00595 3026 0.556 0.444 0.233PF03061 3334 0.375 0.250 0.500PF01522 3519 0.615 0.462 0.077PF00578 3733 0.308 0.308 0.231PF00059 3744 0.455 0.455 0.182PF07686 3801 0.917 0.583 0.667PF00034 4060 0.600 0.500 0.200PF00989 4596 0.583 0.250 0.500PF00144 4684 0.272 0.212 0.242PF00085 5075 0.636 0.545 0.182PF00168 6735 0.667 0.667 0.556PF00515 7230 0.500 0.500 0.250PF00089 9045 0.783 0.783 0.739PF00550 11476 0.857 0.857 0.714

Running Time

Average protein sequence length

Time (in seconds)

Performance: Alignment Accuracy

Performance: Homology Detection

Performance: Alignment Accuracy

Tmalign, Matt and DeepAlign represent three different ground truth

Joint Graphical Lasso Formulation

Log-likelihood: Regularization:

Rewrite the original problem as

Unconstrained optimization problem Both and are convex, so the objective is the difference of two convex functions Can be solved by the Convex-Concave Procedure [A. Yuille]

Alternating Direction of Method Multiplier (ADMM)

s.t. ZAdd a penalty term to obtain an augmented problem, which has the same solution but converges faster.

Make a copy of to , without changing the solution space

s.t. Z

Lagrangian Relaxation

min𝑈maxΩ ,𝑍

𝑓 (Ω)−𝑃 (𝑍 )−∑𝑘=1

𝐾

¿¿¿¿

Use a Lagrange multiplier for each constraint Obtain a dual problem to upper bound the augmented problem

Solve the dual problem iteratively by subgradient as follows.Step 1) fix and solve

Step 2) Update by and repeat 1) until convergence

ADMM (Cont’d)

For a fixed U, split the relaxation problem into two subproblems and solve them alternatively

(SP1)

(SP2)

Recommended