Upload
jaumebp
View
653
Download
3
Tags:
Embed Size (px)
Citation preview
The The InfobioticsInfobiotics Contact Map Contact Map Predictor at CASP9Predictor at CASP9
J. BacarditJ. Bacardit1,21,2, P.Widera, P.Widera11 and N. Krasnogor and N. Krasnogor11
1 - School of Computer Science, University of 1 - School of Computer Science, University of Nottingham, 2 – School of Biosciences, University of Nottingham, 2 – School of Biosciences, University of
Nottingham Nottingham
[email protected]@nottingham.ac.uk
Contact MapContact Map Two residues of a chain Two residues of a chain
are said to be in contact if are said to be in contact if their distance is less than a their distance is less than a certain thresholdcertain threshold
The contacts of a protein The contacts of a protein can be represented by a can be represented by a binary matrix. 1 = contact binary matrix. 1 = contact 0 = non contact0 = non contact
Plotting this matrix reveals Plotting this matrix reveals many characteristics from many characteristics from the protein structurethe protein structure
CM prediction CM prediction is used in is used in many 3D PSP methods many 3D PSP methods (e.g. I-Tasser)(e.g. I-Tasser)
helices sheet
s
Contact
StepsSteps
1.1. Prediction ofPrediction of Secondary structure (using PSIPRED)Secondary structure (using PSIPRED) Solvent AccessibilitySolvent Accessibility Recursive Convex HullRecursive Convex Hull Coordination numberCoordination number
§ Integration of all these predictions plus Integration of all these predictions plus other sources of informationother sources of information
§ Final CM prediction (using BioHEL)Final CM prediction (using BioHEL)
Using BioHEL [Bacardit et al., 09]
Prediction of RCH, SA and CNPrediction of RCH, SA and CN
Dataset of 3262 protein chains created Dataset of 3262 protein chains created using PDB-REPRDB with:using PDB-REPRDB with: A resolution less than 2ÅA resolution less than 2Å Less than 30% sequence identifyLess than 30% sequence identify Without chain breaks nor non-standard Without chain breaks nor non-standard
residuesresidues
90% of this set was used for training 90% of this set was used for training (~490000 residues)(~490000 residues)
10% for test 10% for test
How are these features How are these features predicted?predicted?
Many of these features are due to local Many of these features are due to local interactions of an amino acid and its immediate interactions of an amino acid and its immediate neighbours neighbours We predict them from the closest neighbours We predict them from the closest neighbours
in the chainin the chain
Ri
SSi
Ri+1
SSi+1
Ri-1
SSi-1
Ri+2
SSi+2
Ri-2
SSi-2
Ri+3
SSi+3
Ri+4
SSi+4
Ri-3
SSi-3
Ri-4
SSi-4
Ri-5
SSi-5
Ri+5
SSi+5
Ri-1 Ri Ri+1 SSi
Ri Ri+1 Ri+2 SSi+1
Ri+1 Ri+2 Ri+3 SSi+2
Prediction of RCH, SA and CNPrediction of RCH, SA and CN
All three features were predicted based on All three features were predicted based on a window of ±4 residues around the targeta window of ±4 residues around the target Evolutionary information (as a Position-Evolutionary information (as a Position-
Specific Scoring Matrix) is the basis of this Specific Scoring Matrix) is the basis of this local informationlocal information
Each residue characterised by a vector of 180 Each residue characterised by a vector of 180 valuesvalues
The domain for all three features was The domain for all three features was partitioned into 5 statespartitioned into 5 states
Characterisation of the contact Characterisation of the contact map problemmap problem
Three types of input information were usedThree types of input information were used1.1. Detailed information of three different windows of Detailed information of three different windows of
residues centered aroundresidues centered around The two target residues (2x)The two target residues (2x)
The middle point between themThe middle point between them
§ Statistics about the connecting segment between Statistics about the connecting segment between the two target residues and the two target residues and
§ Global protein information. Global protein information.
1
2
3
Contact Map datasetContact Map dataset
From the set of 3262 proteins we kept all From the set of 3262 proteins we kept all proteins with less than 250AA and a proteins with less than 250AA and a randomly selected 20% of larger proteinsrandomly selected 20% of larger proteins
The resulting training set contained more The resulting training set contained more than 32 million instances and 631 than 32 million instances and 631 attributesattributes
Less than 2% of those are actual contactsLess than 2% of those are actual contacts
56GB of disk space56GB of disk space
Samples and ensemblesSamples and ensembles
50 samples of 660K examples 50 samples of 660K examples are generated from the training are generated from the training set with a ratio of 2:1 non-set with a ratio of 2:1 non-contacts/contacts contacts/contacts
BioHEL is run 25 times for each BioHEL is run 25 times for each samplesample
Prediction is done by a Prediction is done by a consensus of 1250 rule setsconsensus of 1250 rule sets
Confidence of prediction is Confidence of prediction is computed based on the votes computed based on the votes distribution in the ensemble. distribution in the ensemble.
Whole training process takes Whole training process takes about 25000 CPU hoursabout 25000 CPU hours
Training set
x50
x25
Consensus
Predictions
Samples
Rule sets