The Infobiotics Contact Map predictor at CASP9

The The InfobioticsInfobiotics Contact Map Contact Map Predictor at CASP9Predictor at CASP9

J. BacarditJ. Bacardit1,21,2, P.Widera, P.Widera11 and N. Krasnogor and N. Krasnogor11

1 - School of Computer Science, University of 1 - School of Computer Science, University of Nottingham, 2 – School of Biosciences, University of Nottingham, 2 – School of Biosciences, University of

Nottingham Nottingham

[email protected]@nottingham.ac.uk

mailto:[email protected]

mailto:[email protected]

Contact MapContact Map Two residues of a chain Two residues of a chain

are said to be in contact if are said to be in contact if their distance is less than a their distance is less than a certain thresholdcertain threshold

The contacts of a protein The contacts of a protein can be represented by a can be represented by a binary matrix. 1 = contact binary matrix. 1 = contact 0 = non contact0 = non contact

Plotting this matrix reveals Plotting this matrix reveals many characteristics from many characteristics from the protein structurethe protein structure

CM prediction CM prediction is used in is used in many 3D PSP methods many 3D PSP methods (e.g. I-Tasser)(e.g. I-Tasser)

helices sheet

s

Contact

StepsSteps

1.1. Prediction ofPrediction of Secondary structure (using PSIPRED)Secondary structure (using PSIPRED) Solvent AccessibilitySolvent Accessibility Recursive Convex HullRecursive Convex Hull Coordination numberCoordination number

§ Integration of all these predictions plus Integration of all these predictions plus other sources of informationother sources of information

§ Final CM prediction (using BioHEL)Final CM prediction (using BioHEL)

Using BioHEL [Bacardit et al., 09]

Prediction of RCH, SA and CNPrediction of RCH, SA and CN

Dataset of 3262 protein chains created Dataset of 3262 protein chains created using PDB-REPRDB with:using PDB-REPRDB with: A resolution less than 2ÅA resolution less than 2Å Less than 30% sequence identifyLess than 30% sequence identify Without chain breaks nor non-standard Without chain breaks nor non-standard

residuesresidues

90% of this set was used for training 90% of this set was used for training (~490000 residues)(~490000 residues)

10% for test 10% for test

How are these features How are these features predicted?predicted?

Many of these features are due to local Many of these features are due to local interactions of an amino acid and its immediate interactions of an amino acid and its immediate neighbours neighbours We predict them from the closest neighbours We predict them from the closest neighbours

in the chainin the chain

Ri

SSi

Ri+1

SSi+1

Ri-1

SSi-1

Ri+2

SSi+2

Ri-2

SSi-2

Ri+3

SSi+3

Ri+4

SSi+4

Ri-3

SSi-3

Ri-4

SSi-4

Ri-5

SSi-5

Ri+5

SSi+5

Ri-1 Ri Ri+1 SSi

Ri Ri+1 Ri+2 SSi+1

Ri+1 Ri+2 Ri+3 SSi+2

Prediction of RCH, SA and CNPrediction of RCH, SA and CN

All three features were predicted based on All three features were predicted based on a window of ±4 residues around the targeta window of ±4 residues around the target Evolutionary information (as a Position-Evolutionary information (as a Position-

Specific Scoring Matrix) is the basis of this Specific Scoring Matrix) is the basis of this local informationlocal information

Each residue characterised by a vector of 180 Each residue characterised by a vector of 180 valuesvalues

The domain for all three features was The domain for all three features was partitioned into 5 statespartitioned into 5 states

Characterisation of the contact Characterisation of the contact map problemmap problem

Three types of input information were usedThree types of input information were used1.1. Detailed information of three different windows of Detailed information of three different windows of

residues centered aroundresidues centered around The two target residues (2x)The two target residues (2x)

The middle point between themThe middle point between them

§ Statistics about the connecting segment between Statistics about the connecting segment between the two target residues and the two target residues and

§ Global protein information. Global protein information.

1

2

3

Contact Map datasetContact Map dataset

From the set of 3262 proteins we kept all From the set of 3262 proteins we kept all proteins with less than 250AA and a proteins with less than 250AA and a randomly selected 20% of larger proteinsrandomly selected 20% of larger proteins

The resulting training set contained more The resulting training set contained more than 32 million instances and 631 than 32 million instances and 631 attributesattributes

Less than 2% of those are actual contactsLess than 2% of those are actual contacts

56GB of disk space56GB of disk space

Samples and ensemblesSamples and ensembles

50 samples of 660K examples 50 samples of 660K examples are generated from the training are generated from the training set with a ratio of 2:1 non-set with a ratio of 2:1 non-contacts/contacts contacts/contacts

BioHEL is run 25 times for each BioHEL is run 25 times for each samplesample

Prediction is done by a Prediction is done by a consensus of 1250 rule setsconsensus of 1250 rule sets

Confidence of prediction is Confidence of prediction is computed based on the votes computed based on the votes distribution in the ensemble. distribution in the ensemble.

Whole training process takes Whole training process takes about 25000 CPU hoursabout 25000 CPU hours

Training set

x50

x25

Consensus

Predictions

Samples

Rule sets

Technology

The Infobiotics Contact Map predictor at CASP9