Upload
lamar
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Fast Alignment of Protein Structures Based on Conformational Letters. Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica. Introduction Conformational alphabet and CLESUM CLePAPS: Pairwise structure alignment BloMAPS: Multiple structure alignment Conclusion. Introduction - PowerPoint PPT Presentation
Citation preview
Fast Alignment of Protein Structures Based on
Conformational Letters
Wei-Mou ZhengInstitute of Theoretical Physics
Academia Sinica
• Introduction
• Conformational alphabet and CLESUM
• CLePAPS: Pairwise structure alignment
• BloMAPS: Multiple structure alignment
• Conclusion
Introduction
Protein structures comparison: an extremely important problem in structural and evolutional biology.
• detection of local or global structural similarity • prediction of the new protein's function. • structures are better conserved → remote homology detection • Structural comparison → organizing and classifying structures discovering structure patterns discovering correlation between structures & sequences → structure prediction
Conformational alphabet and CLESUMThe main difference of CLePAPS from other existing algorithms for structure alignment is the use of conformational letters. Conformational letters = discretized states of 3D segmental conformations. A letter = a cluster of combinations of three angles formed by C pseudobonds of four contiguous residues. (obtained by clustering according to the probability distribution.) Centers of 17 conformational letters
Similarity between conformational letters
CLESUM: Conformational LEtter SUbstitution Matrix
Mij = 20* log 2 (Pij/PiPj) ~ BLOSUM83, H ~ 1.05
constructed using FSSP representatives.
typical helix
typical sheet
evolutionary
+ geometric
CLePAPS: Pairwise structure alignmentStructure alignment --- a self-consistent problemCorrespondence Rigid transformation
However, when aligning two protein structures, at the beginning we know neither the transformation nor the correspondence.
DALI, CEVASTSTRUCTAL, ProSup
CLePAPS: Conformational Letters based Pairwise Alignment of Protein Structures
Initialization + iteration• Similar Fragment Pairs (SFPs); • Anchor-based; • Alignment = As many consistent SFPs as possible
Anchor-based superposition
SFPs
anchor SFP
consistent
inconsistent
Collect as many consistent SFPs as possible
the smaller
similar
seed
Redundancy removal
shaved
kept
SFP = highly scored string pair• Fast search for SFPs by string comparison
• CLESUM similarity score importance of SFPs
Guided by CLESUM scores, only the top few SFPs need to be examinedto determine the superposition for alignment, and hence a reliable greedy strategy becomes possible.
rank 1 3 5 2 4
Example: Top K, K = 2; J = 5
Anchor
Anchor
# of consistent SFBs = 4 # of consistent SFBs = 1
Selection of optimal anchor
1
2
Top-1 SFB is globally supported by three other SFPs, while Top-2 SFB is supported only by itself.
Anchor
‘Zoom-in’
d1
d2
d3
d1 > d2 > d3
successively reduced cutoffs for maximal coordinate difference
Flow chart of the CLePAPS algorithm
specificity
sensitivity
1/2
5/6
1
• Finding SFPs of high CLESUM similarity scores• The greedy `zoom-in' strategy• Refinement by elongation and shrinking
• The Fischer benchmark test• Database search with CLePAPS• Multiple solutions of alignments: repeats, symmetry, domain move• Non-topological alignment and domain shuffling
Multiple structure alignment
Multiple alignment carries significantly more information than pairwise alignment, and hence is a much more powerful tool for classifying proteins, detecting evolutionary relationship and common structural motifs, and assisting structure/function prediction.
Most existing methods of multiple structural alignment combine a pairwise alignment and some heuristic with a progressive-type layout to merge pairwise alignments into a multiple alignment. like CLUSTAL-W, T-Coffee: MAMMOTH-mult, CE-MC• slow• alignments which are optimal for the whole input set might be missed
A few true multiple alignment tools: MASS, MultiProt
Vertical equivalency and horizontal consistencylocal similarity among consistent spatial
structures arrangement for a pair
Highly similar fragment block (HSFB)
Attributes of HSFB: width, positions, depth, score, consensus
Horizontal consistency of two HSFBs
template
pivot
superposition inconsistent
consistent
anchor HSFB
similar
seed
a
b
c
a&b
a&c
1. Creating HSFBscreate HSFBs using the shortest protein as a templatesort HSFBs according to depths, then to scoresderive redundancy-removed HFSBs by examining position overlapIf the new HSFB has a high proportion of positions which overlaps with existing HSFBs, remove it.
2. Selecting optimal HSFBfor each HSFB in top K select the pivot protein based on the HSFB consensus;superimpose anchored proteins on the pivot; find consistent HSFBs;A consistent HSFB contains at least 3 consistent SFPs.
select the best HSFB which admits most consistent HSFBs;
3. Building scaffoldbuild a primary scaffold from the consistent HSFBs;update the transformation using the consistent HSFBs;recruit aligned fragments;improve the scaffold;create average template;
4. Dealing with unanchored proteinsUnanchored protein: has no member in the anchor HSFB which is supported by enough number of consistent SFPs.
for each unanchored protein if (members are found in colored HSFBs) find top K members;Try to use ‘colored’ HSFBs other than the anchor HSFB.
else search for fragments similar to the scaffold, and select top K; pairwisely align the protein on the template;
5. Fingding missing motifsfind missing motifs by registering atoms in spatial cells;Only patterns shared by the pivot protein have a chance to be discovered above. Two ways for discovering ‘missing motifs’: by searching for maximal star-trees and by registering atoms in spatial cell. The latter: We divide the space occupied by the structures after superimposition into uniform cubic cells of a finite size, say 6A. The number of different proteins = cell depth. Sort cells in descending order of depth.
from cells to ‘octads’
find fragments in octads
find aligned fragments
6. Final refinementrefine the alignment and the average template;
Conclusion
CLePAPS and BLOMAPS distinguish themselves from other existing algorithms for structure alignment in the use of conformational letters. • Conformational alphabet: aptly balance precision with simplicity• CLESUM: a proper measure of similarity between states• fit the -congruent problem • CLESUM extracted from the database FSSP contains information of structure database statistics, which reduces the chance of accidental matching of two irrelevant helices. evolutionary + geometric = specificity gain
For example, two frequent helices are geometrically very similar,
but their score is relatively low.• CLESUM similarity score can be used to sort the importance of SFPs for a greedy algorithm. Only the top few SFPs need to be examined.
Conclusion
Greedy strategies:HSFBs instead of SFBsUse the shortest protein to generate HSFBUse consensus to select pivotTop K --- guided by scoresOptimal anchor HSFBMissing motifs
Tested on 17 structure datasetsFaster than MASS by 3 orders
Thank you
* * * Overall algorithm of BLOMAPAS * * *create HSFBs using the shortest protein as a template;sort HSFBs;derive redundancy-removed HFSBs;for each HSFB in top K select the pivot protein based on the HSFB consensus; superimpose anchored proteins on the pivot; find consistent HSFBs;select the best HSFB which admits most consistent HSFBs;build a primary scaffold from the consistent HSFBs;update the transformation using the consistent HSFBs;recruit aligned fragments;improve the scaffold;create average template;for each unanchored protein if (members are found in colored HSFBs) find top K members; else search for fragments similar to the scaffold, and select top K; pairwisely align the protein on the template;find missing motifs by registering atoms in spatial cells;refine the average template;
17 test sets
Comparison of BLOMAPS with MAMMOTH-mult and CE-MC
1ak6-aa sasgvqVADEVCRIFYDMkvrkcstpeeikkrkKAVIFCLSADKKCIIVEEGKeilvgdvgvtitDPFKHFVGMLPEKD1cfyB-aa vaVADESLTAFNDLKLGKKY---------KFILFGLNDAKTEIVVKETStd----------PSYDAFLEKLPEND1cnuA-aa giaVSDDCVQKFNELKLGHQH---------RYVTFKMNASNTEVVVEHVGgpn---------ATYEDFKSQLPERD1f7sA-aa asgmaVHDDCKLRFLELKAKRTH---------RFIVYKIEEKQKQVVVEKVGqpi---------QTYEEFAACLPADE1cfyB-ss ceECHHHHHHHHHHHHHCCC CEEEEEECCCCCEEEEEEEEcc CCHHHHHCCCCCCC Core **1111111111111*** xaaaaaa******bbbbbbb **22222xxxxxxx 1ak6-cl mkobbeCCAJJKJJIKJKmlnmgalpjkmjjjcBMEDEEQCQNOLQCFBGNCLqdbahjmklnleCGCMJHJJKMEAJL1cfyB-cl CCAJJHHHHHHHIKKAPL---------BLDEEECCAHOMLDECPLEEbg----------PGAJJHIJJGCAKL1cnuA-cl fCCAJJIJIHHHHKKKAPL---------BLDEFECCAHOMLDECBLEEcai---------GFAHJIKJIGCAKL1f7sA-cl dceDCAJJIHHHHHHIKKAML---------BLDEEEDPPJKOGDECBLEFcal---------EDAHHHHJJMCAIL
1ak6-aa CRYALYDASFETkesrke--ELMFFLWAPELAPLKSKMIYASSKDAIKKKFQGIKHECQANGPeDLNRACIAEKLGGsl1cfyB-aa CLYAIYDFEYEIngNEGKRSKIVFFTWSPDTAPVRSKMVYASSKDALRRALNGVSTDVQGTDFsEVSYDSVLERVSR1cnuA-aa CRYAIFDYEFQV--DGGQRNKITFILWAPDSAPIKSKMMYTSTKDSIKKKLVGIQVEVQATDAaEISEDAVSERAKK1f7sA-aa CRYAIYDFDFVT-aENCQKSKIFFIAWCPDIAKVRSKMIYASSKDRFKRELDGIQVELQATDPte1cfyB-ss CEEEEEEEEEEEccCCEEEEEEEEEEECCCCCCHHHHHHHHHHHHHHHHHCCCCCEEEEECCHhHHCHHHHHHHHHC Core xccccc****** ******dddddddxxxxxx333333333333333*******eeeeexx4 44x444******* 1ak6-cl EECEEEAFBLDEbkfnge--BEFCEPOGEAIGCAJIIIKIJKJJIIIJJKJMKPOLBGEEPLAjJJGAHJKKKINIJkq1cfyB-cl DEDEDEBFEEDEcnOMCQEEECEEEBEBEAJGCAHHHHHHHIIMHJKIHIGENGBLEEBEPLAjHKGAIHIHJHIK1cnuA-cl DEDEDEAFEEDB--NOGCEEEBFEEBEBEAJGCAHHHHIIHHIMIIIHHHMENBBLEEEEBLAiIIGAJIIIJJII1f7sA-cl DEDEDEBBEEDF-aJOGCEFDCEEFBEBEAIGCAJHHHHHHHIMIHHHJHMENBBLEEDFEDPj
The alignment of four CL proteins to the template for the CL-GL ensemble. two additional helices indexed as 1 and 2, “x”: capping; ‘*’: submotifs specific to CL
Default parameters of BLOMAPS
FSSP representative pairs with the same first three family indices are used to construct CLESUM.
amino acidsa.b.c.u.v.w avpetRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRS... a.b.c.x.y.z ahLTVKKIFVGGIKEDT....EEHHLRDYFEQYGKIEVIEIMTDRGS
conformational lettersa.b.c.u.v.w CCPMCEALEEEENGCPJGCCIHHHHHHHHIKMJILQEPLDEEEBGAIKa.b.c.x.y.z ...BBEBGEDEENMFNML....FAHHHHHKKMJJLCEBLDEBCECAKK
NAB++; NBA++;
1, Fast search for SFPs by merely string comparison
2, Width 20 for specificity + width 8 for sensitivity
3, Optimal anchor SFP selected by checking consistency
4, Avoid local trap by ’zoom-in’
The running time for the 68 pairs of the Fischer benchmark is less than 2% of that of the downloaded CE local version.
Next steps
1, BLOMAPS: fast multiple structure alignment;
SFPs → Highly Similar Fragment Blocks (HFBs)
2, Include biochemical information into CLESUM by amino acid clustering.
Entropic clustering: AVCFIWLMY (h) + DEGHKNPQRST (p)
Entropic h-p clustering: AVCFIWLMY and DEGHKNPQRST
CLESUM-hh (lower left) and CLESUM-pp (upper right)
CLESUM-hp for type h-p (row-column)
Rigid transformation for SuperimpositionFinding the rotation R and the translation T to minimize
If the two sets of points are shifted with their centers of mass at the origin, T=0. Let X3xn any Y3xn be the sets after shift. . Introduce = ,the objective function is defined aswhere Lagrange multipliers are g and symmetric matrix L, representing the conditions for R to be an orthogonal and proper rotation matrix. constraint:
where M is symmetric, and S = diag(si), si = 1 or -1. |C| = |R||M| = |M| = |D||S|.Singular values are non-negative, |D|>0. Finally, |S| = sgn(|C|), and
initial correspondence (anchor SFP)
optimal transformation for the correspondence
Correspondence updating
by adding consistent SFPs
Convergent?
no
yesend
• Protein structure alphabet: discretization of 3D continuous structure states
• Bridge 2’ structure and 3D structure
• Enhance correlation between sequence and structure
• Fast structure comparison
• Structure prediction (transplant 2’ structure prediction methods for 3D)
Default parameters of CLePAPS
Symbol Value Meaning
20 length of long SFPs 350 similarity threshold for long SFPs 10 number of long SFPs used as seed candidates 50 number of long SFPs for building a star-tree 8 length of short SFPs for blank-filling 0 similarity threshold for short SFPs 4 minimum length of aligned fragments 5A distance cutoff for evaluating overall alignment 10A separation threshold for star construction 8A separation cutoff for blank-filling in first run 6A separation cutoff for blank-filling in second run 5A separation cutoff for blank-filling in third run 0.1 maximal difference for rotational matrix entries of two ‘identical' alignments