Neural Networks for Protein Structure Prediction Brown, JMB 1999veda.cs.uiuc.edu › courses › fa08 › cs466 › lectures › Lecture22.pdf · 2008-11-18 · •“Ab initio”

Neural Networks for ProteinStructure PredictionBrown, JMB 1999

CS 466Saurabh Sinha

Outline

• Goal is to predict “secondary structure”of a protein from its sequence

• Artificial Neural Network used for thistask

• Evaluation of prediction accuracy

What is Protein Structure?

http://academic.brooklyn.cuny.edu/biology/bio4fv/page/3d_prot.htm

http://matcm

adison.edu/biotech/resources/proteins/labManual/im

ages/220_04_114.png

Protein Structure

• An amino acid sequence “folds” into acomplex 3-D structure

• Finding out this 3-D structure is acrucial and challenging task

• Experimental methods (e.g., X-raycrystallography) are very tedious

• Computational predictions are apossibility, but very difficult

What is “secondary structure”?

http://www.wiley.com/college/pratt/0471393878/student/structure/secondary_structure/secondary_structure.gif

“Strand” “Helix”

http://www.npaci.edu/features/00/Mar/protein.jpg

“Strand”

“Helix”

Secondary structure prediction

• Well, the whole 3-D “tertiary” protein structuremay be hard to predict from sequence

• But can we at least predict the secondarystructural elements such as “strand”, “helix” or“coil”?

• This is what this paper does• .. and so do many other papers (it is a hard

problem !)

A survey of structure prediction

• The most reliable technique is “comparativemodeling”– Find a protein P whose amino acid sequence is

very similar to your “target” protein T– Hope that this other protein P does have a known

structure– Predict a similar structure similar to that of P, after

carefully considering how the sequences of P andT differ

A survey of structure prediction• Comparative modeling fails if we don’t have a

suitable homologous “template” protein P for ourprotein T

• “Ab initio” tertiary methods attempt to predict thestructure without using a protein structure– Incorporate basic physical and chemical principles into the

structure calculation– Gets very hairy, and highly computationally intensive

• The other option is prediction of secondary structureonly (i.e., making the goal more modest)– These may be used to provide constraints for tertiary

structure prediction

Secondary structure prediction

• Early methods were based on stereochemicalprinciples

• Later methods realized that we can do betterif we use not only the one sequence T (oursequence), but also a family of “relatedsequences”

• Search for sequences similar to T, build amultiple alignment of these, and predictsecondary structure from the multiplealignment of sequence

What’s multiple alignmentdoing here ?

• Most conserved regions of a proteinsequence are either functionally important orburied in the protein “core”

• More variable regions are usually on surfaceof the protein,– there are few constraints on what type of amino

acids have to be here (apart from bias towardshydrophilic residues)

• Multiple alignment tells us which portions areconserved and which are not

http://bio.nagaokaut.ac.jp/~mbp-lab/img/hpc.png

hydrophobic core

What’s multiple alignmentdoing here ?

• Therefore, by looking at multiple alignment,we could predict which residues are in thecore of the protein and which are on thesurface (“solvent accessibility”)

• Secondary structure then predicted bycomparing the accessibility patternsassociated with helices, strands etc.

• This approach (Benner & Gerloff) mostlymanual

• Today’s paper suggest an automated method

The PSI-PRED algorithm

• Given an amino-acid sequence, predictsecondary structure elements in the protein

• Three stages:1. Generation of a sequence profile (the

“multiple alignment” step)2. Prediction of an initial secondary structure

(the neural network step)3. Filtering of the predicted structure (another

neural network step)

Generation of sequence profile• A BLAST-like program called “PSI-BLAST”

used for this step• We saw BLAST earlier -- it is a fast way to

find high scoring local alignments• PSI-BLAST is an iterative approach

– an initial scan of a protein database using thetarget sequence T

– align all matching sequences to construct a“sequence profile”

– scan the database using this new profile• Can also pick out and align distantly related

protein sequences for our target sequence T

The sequence profile looks like this

• Has 20 x M numbers• The numbers are log likelihood of each residue at each position

Preparing for the second step

• Feed the sequence profile to an artificialneural network

• But before feeding, do a simply“scaling” to bring the numbers to 0-1scale

!

x"1

1+ e#x

Intro to Neural nets(the second and third steps of

PSIPRED)

Artificial Neural Network

• Supervised learning algorithm• Training examples. Each example has a

label– “class” of the example, e.g., “positive” or

“negative”– “helix”, “strand”, or “coil”

• Learns how to predict the class of anexample

Artificial Neural Network

• Directed graph• Nodes or “units” or “neurons”• Edges between units• Each edge has a weight (not known a

priori)

Layered Architecture

Input here is a four-dimensional vector. Each dimension goesinto one input unit

http://www.akri.org/cognition/images/annet2.gif

Layered Architecturehttp://www.geocomputation.org/2000/GC016/GC016_01.GIF

(units)

What a unit (neuron) does• Unit i receives a total input xi from the

units connected to it, and produces anoutput yi = fi(xi) where fi() is the “transferfunction” of unit i

!

xi = wij y j + wi

j"N#{i}

$

yi = fi(xi) = f i wij y j + wi

j"N#{i}

$%

& ' '

(

) * *

wi is called the “bias” of the unit

Weights, bias and transfer function

Unit takes n inputsEach input edge has weight wiBias bOutput a

Transfer function f()Linear, Sigmoidal, or other

Weights, bias and transfer function

• Weights wij and bias wi of each unit are“parameters” of the ANN.– Parameter values are learned from input data

• Transfer function is usually the same forevery unit in the same layer

• Graphical architecture (connectivity) isdecided by you.– Could use fully connected architecture: all units in

one layer connect to all units in “next” layer

Where’s the algorithm?

• It’s in the training of parameters !• Given several examples and their labels: the

training data• Search for parameter values such that output

units make correct predictions on the trainingexamples

• “Back-propagation” algorithm– Read up more on neural nets if you are interested

Back to PSIPRED …

Step 2• Feed the sequence profile to the input layer of

an ANN• Not the whole profile, only a window of 15

consecutive positions• For each position, there are 20 numbers in the

profile (one for each amino acid)• Therefore ~ 15 x 20 = 300 numbers fed• Therefore, ~ 300 “input units” in ANN• 3 output units, for “strand”, “helix”, “coil”

– each number is confidence in that secondarystructure for the central position in the window of 15

15

Input layer Hidden layer

helix

strand

coil

e.g.,

0.18

0.09

0.67

Step 3

• Feed the output of 1st ANN to the 2nd ANN• Each window of 15 positions gave 3

numbers from the 1st ANN• Take 15 successive windows’ outputs and

feed them to 2nd ANN• Therefore, ~ 15 x 3 = 45 input units in ANN• 3 output units, for “strand”, “helix”, “coil”

Test of performance

Cross-validation• Partition the training data into “training set” (two

thirds of the examples) and “test set”(remaining one third)

• Train PSIPRED on training set, test predictionsand compare with known answers on test set.

• What is an answer?– For each position of sequence, a prediction of what

secondary structure that position is involved in– That is, a sequence over “H/S/C” (helix/strand/coil)

• How to compare answer with known answer?– Number of positions that match

Documents

Neural Networks for Protein Structure Prediction Brown, JMB 1999veda.cs.uiuc.edu › courses › fa08 › cs466 › lectures › Lecture22.pdf · 2008-11-18 · •“Ab initio”