Chapter-3 HOMOLOGY MODELING - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/8277/13/13_chapter 3.pdf · Chapter-3 HOMOLOGY MODELING 3.1 HOMOLOGY MODELING One technique that

44

Chapter-3

HOMOLOGY MODELING

3.1 HOMOLOGY MODELING

One technique that can be applied to generate reasonable models of protein

structures is homology modeling. This procedure, also termed comparative modeling

or knowledge-based modeling, develops a three-dimensional model from a protein

sequence based on the structures of homologous proteins.

3.1.1 Swiss-Prot

Swiss-Prot is a protein sequence database maintained by Swiss Institute of

Bioinformatics (SIB). Swiss-Prot was established in 1986. Swiss-Prot strives to

provide reliable protein sequences associated with a high level of annotation (such as

the description of the function of a protein, its domains structure, post-translational

modifications, variants, etc.) Later on it joined forces as the UniProt consortium. The

UniProt Knowledgebase (UniProtKB) provides the central database of protein

sequences with accurate, consistent, rich sequence and functional annotation, the most

widely used protein information resource in the world. The group also develops and

maintains other databases including PROSITE, a database of protein families and

domains, and ENZYME, a database of enzyme nomenclature.Protein sequence

database is scanned for Glutathione S-transferase sequences and from the resulted

sequences Glutathione S-transferase proteins are listed to select one among those for

homology modeling.

http://www.expasy.org/prosite/

http://www.expasy.org/enzyme/

45

Figure 3. 1: Image of Swiss prot protein sequence database.

3.1.2 Protein Sequence Selection

Q08392 protein identified from SWISS PROT database. Six proteins were

observed. They are

1. Q08392 (GSTA1_CHICK),

2. Q08393 (GSTA2_CHICK),

3. P80895[Antechinus stuartii (Brown marsupial mouse)],

4. P46428[Anopheles gambiae (African malaria mosquito)],

5. Q7REH6 [Plasmodium yoelii yoelii],

6. P46423[Hyoscyamus muticus (Egyptian henbane)].

BLAST program, protein-protein blastp from NCBI was selected to scan the

query protein sequence against pdb structure database. Swiss-Prot ID Q08392 was

scanned against pdb database; similarly the other five protein sequences were also

scanned. The results were discussed in result section in detail. FASTA formats of

GST sequences extracted from Swiss-Prot protein sequence database are given below.

46

1. Q08392 (GSTA1_CHICK)

>sp|Q08392|GSTA1_CHICK Glutathione S-transferase OS=Gallus

gallus PE=2 SV=1

MSGKPVLHYANTRGRMESVRWLLAAAGVEFEEKFLEKKEDLQKLKSDGSLLFQQVPMVEI

DGMKMVQTRAILNYIAGKYNLYGKDLKERALIDMYVEGLADLYELIMMNVVQPADKKEEH

LANALDKAANRYFPVFEKVLKDHGHDFLVGNKLSRADVHLLETILAVEESKPDALAKFPL

LQSFKARTSNIPNIKKFLQPGSQRKPRLEEKDIPRLMAIFH

2.Q08393 (GSTA2_CHICK)

>sp|Q08393|GSTA2_CHICK Glutathione S-transferase OS=Gallus gallus

PE=2 SV=2

MAGKPKLHYTRGRGKMESIRWLLAAAGVEFEEEFIEKKEDLEKLRNDGSLLFQQVPMVEIDGMKMVQSR

AILCYIAGKYNLYGKDLKERAWIDMYVEGTTDLMGMIMALPFQAADVKEKNIALITERATTRYFPVYEK

ALKDHGQDYLVGNKLSWADIHLLEAILMTEELKSDILSAFPLLQAFKGRMSNVPTIKKFLQPGSQRKPP

LDEKSIANVRKIFSF

3.Antechinus stuartii (Brown marsupial mouse)

>sp|P80894|GSTA1_ANTST Glutathione S-transferase OS=Antechinus

stuartii PE=1 SV=1

MAGEQNIKYFNIKGRMEAIRWLLAVAGVEFEEKFFETKEQLQKLKETVLLFQQVPMVEIDGMKLVQTRA

ILHYIAEKYNLLGKDMKEHAQIIMYSEGTMDLMELIMIYPFLKGEEKKQRLVEIANKAKGRYFPAFENV

LKTHGQNFLVGNQLSMADVQLFEAILMVEEKVPDALSGFPLLQAFKTRISNIPTVKTFLAPGSKRKPVP

DAKYVEDIIKIFYF

4.Anopheles gambiae (African malaria mosquito)

>sp|P46428|GST_ANOGA Glutathione S-transferase OS=Anopheles gambiae

GN=GstS1 PE=2 SV=4

MPDYKVYYFNVKALGEPLRFLLSYGNLPFDDVRITREEWPALKPTMPMGQMPVLEVDGKKVHQSVAMSR

YLANQVGLAGADDWENLMIDTVVDTVNDFRLKIAIVAYEPDDMVKEKKMVTLNNEVIPFYLTKLNVIAK

ENNGHLVLGKPTWADVYFAGILDYLNYLTKTNLLENFPNLQEVVQKVLDNENVKAYIAKRPITEV

47

5. Plasmodium yoelii yoelii

>sp|Q7REH6|GST_PLAYO Glutathione S-transferase OS=Plasmodium yoelii

yoelii GN=GST PE=3 SV=1

MTYLYNFFFFFFFFFSRGKAELIRLIFAYLQVKYTDIRFGVNGDAFAEFNNFKKEKEIPFNQVPILEIG

GLILAQSQAIVRYLSKKYNISGNGELNEFYADMIFCGVQDIHYKFNNTNLFKQNETTFLNEELPKWSGY

FEKLLQKNNTNYFVGDTITYADLAVFNLYDDIESKYPNCLKNFPLLKAHIELISNIPNIKHYIANRKES

VY

6.Hyoscyamus muticus (Egyptian henbane)

>sp|P46423|GSTF_HYOMU Glutathione S-transferase OS=Hyoscyamus muticus

PE=1 SV=1

MGMKLHGPAMSPAVMRVIATLKEKDLDFELVPVNMQAGDHKKEPFITLNPFGQVPAFEDGDLKLFESRA

ITQYIAHTYADKGNQLLANDPKKMAIMSVWMEVESQKFDPVASKLTFEIVIKPMLGMVTDDAAVAENEE

KLGKVLDVYESRLKDSKYLGGDSFTLADLHHAPAMNYLMGTK

VKSLFDSRPHVSAWCADILARPAWSKAIEYKQ

From about thousands of sequences, were selected for this study. No criterion

wasfollowed during selection as almost all sequences submitted in database are

identical.

3.1.3 Sequence Retrieval

The Sequence Retrieval System (SRS) is the world's premier data integration,

analysis and display tool for genomics, bioinformatics and related data. SRS is a

homogeneous interface to over 80 biological databases that had been developed at the

European Bioinformatics Institute (EBI) at Hinxton, UK .It includes databases of

sequences, transcription factors, metabolic pathways, and application results like

BLAST, FASTA as well as protein 3-D structures, genomes, mappings, mutations,

and locus specific mutations. SRS is a data retrieval system that integrates

heterogeneous databanks in molecular biology and genome analysis. There are

currently several dozen servers world-wide that provide access to over 300 different

48

databanks via the World Wide Web. Additional technology to integrate externally

developed applications into the package gives novel and powerful capabilities for

biological data analysis.

3.1.4 Blast

The Basic Local Alignment Search Tool (BLAST) is the most popular

database searching program due to its combination of speed and sensitivity and also it

finds regions of local similarity between sequences. The program compares

nucleotide or protein sequences to sequence databases and calculates the statistical

significance of matches. BLAST is heuristic method to find the highest locally

optimal alignments between a query sequence and a database and it can be used to

infer functional and evolutionary relationships between sequences as well as help

identify members of gene families. The statistics allows the probability of obtaining

an alignment without gaps (HSP - Highest Segment Pair) with a particular score to be

estimated. The BLAST algorithm permits nearly all HSP's above a cutoff to be

located efficiently in a database. Fundamental problem of sequence similarity search

against a DNA/protein sequence database is to make an inference of structural or even

further functional homology based on sequence similarity score. This is calculated

from pairwise sequence alignments using varieties of algorithms. Homologous genes

are genes that have evolved from a common ancestor gene through duplications and

mutations. Homology is a powerful inference because reliable homology can be

inferred from statistically significant similarity scores with high confidence. It is also

informative because homologous sequences always share a common 3D structure.

However, function could be quite different. In the following discussion we will be

using the BLAST search algorithm as the example as it is the most popular similarity

search program. Fundamental problem of sequence similarity search against a

49

DNA/protein sequence database is to make an inference of structural or even further

functional homology based on sequence similarity score. This is calculated from

pairwise sequence alignments using varieties of algorithms. Homology is a powerful

inference because reliable homology can be inferred from statistically significant

similarity scores with high confidence. It is also informative because homologous

sequences always share a common 3D structure. However, function could be quite

different. In the following discussion we will be using the BLAST search algorithm as

an example as it is the most popular similarity search program.

3.2 How Blast Works

3.2.1 The Basics

The BLAST algorithm is a heuristic program, which means that it relies on

some smart shortcuts to perform the search faster. BLAST performs "local"

alignments. Most proteins are modular in nature, with functional domains often being

repeated within the same protein as well as across different proteins from different

species. The BLAST algorithm is tuned to find these domains or shorter stretches of

sequence similarity. The local alignment approach also means that a mRNA can be

aligned with a piece of genomic DNA, as is frequently required in genome assembly

and analysis. If instead BLAST started out by attempting to align two sequences over

their entire lengths (known as a global alignment), fewer similarities would be

detected, especially with respect to domains and motifs. When a query is submitted

via one of the BLAST Web pages, the sequence, plus any other input information

such as the database to be searched, word size, expect value, and so on, are fed to the

algorithm on the BLAST server. BLAST

works by first making a look-up table of all the “words” (short subsequences, which

for proteins the default is three letters) and “neighbouring words”, i.e., similar words

http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook&part=A1237&rendertype=def-item&id=app9










http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/BLAST_algorithm.html



50

in the query sequence. The sequence database is then scanned for these “hot spots”.

When a match is identified, it is used to initiate gap-free and gapped extensions of the

“word”.BLAST does not search GenBank flat files (or any subset of GenBank flat

files) directly. Rather, sequences are made into BLAST databases. Each entry is split,

and two files are formed, one containing just the header information and one

containing just the sequence information. These are the data that the algorithm uses. If

BLAST is to be run in “stand-alone” mode, the data file could consist of local, private

data, downloaded NCBIBLAST databases, or a combination of the two. For

performing Blastp under NCBI website, protein sequence, Glutathione S-transferase

{query sequence (Q08392)} was downloaded in FASTA format and subjected to blast

against the PDB database, analysis using default parameters. Using default parameters

except the matrices, the results obtained were reported.









51

Figure 3. 2: BLAST input sequence showing PDB database being chosen for analysis.

3.2.2 Wu-Blast2

WU-BLAST2 stands for Washington University Basic Local Alignment tool. This is

sensitive, fast alignment tool as it gives good alignment between query and subject

sequences. WU-Blast program was used to analyze query sequence scan against PDB

protein database using default options. By changing matrices it showed variation in

the result i.e. In the score and e-value.

52

Figure 3. 3: Wu BLAST2 input sequence showing Protein Structure Sequence database being chosen for

analysis

3.2.3 Fasta

It stands for FAST-ALL reflecting the fact that it can be used for a fast

nucleotide comparison for a fast protein comparison. This program achieves a high

level of sensitivity for similarity searching at high speed accuracy. This is achieved by

performing optimized searches for local alignments using a substitution matrix. . The

trade-off between speed and sensitivity is controlled by the ktup parameter, which

specifies the size of the word. Increasing the ktup the number of background hits can

be decreased. Not every word hit is investigated but initially look for segment's

containing several nearby hits. The high speed of this program is achieved by using

the observed pattern of word hits to identify potential matches before attempting the

53

more time consuming optimized search. FASTA analysis was performed by pasting

query sequence in the box given below. All parameters were kept as defaults expect

matrices. They are discussed in Results section in detail. An example of Fasta input is

given below, against the Database PDB selected with BLOSUM45.

Figure 3. 4: FASTA input sequence showing Protein Structure Sequence database Against GST Sequence

for analysis

3.2.4 Pdb

The RCSB PDB provides a variety of tools and resources for studying the

structure of biological macromolecules and their relationship to sequence, function,

and disease. The RCSB is a member of the www.PDB whose mission is to ensure that

the PDB archive remains an international resource with uniform data. This site office

is used for browsing, searching and reporting to utilize the data resulting from

ongoing efforts to create a more consistent and comprehensive archive.It is a

repository for the 3-D structural data of large biological molecules, such as proteins

http://www.pdb/

54

and nucleic acids.The PDB archive contains information about experimentally-

determined structures of proteins, nucleic acids, and complex assemblies. As a

member of the wwPDB, the RCSB PDB curates and annotates PDB data according to

agreed upon standards.

The RCSB PDB also provides a variety of tools and resources. Users can

perform simple and advanced searches based on annotations relating to sequence,

structure and function. These molecules are visualized, downloaded, and analyzed by

users who range from students to specialized scientists. The Protein Data Bank (PDB)

format provides a standard representation for macromolecular structure data derived

from X-ray diffraction and NMR studies. This representation was created in the

1970's and a large amount of software using it has been written. If the contents of the

PDB are thought of as primary data, then there are hundreds of derived (i.e.,

secondary) databases that categorize the data differently. For example, both SCOP

and CATH categorize structures according to type of structure and assumed

evolutionary relations; GO categorize structures based on genes.

This database is used to download structural sequences in pdb extension

format in order to perform homology modeling. The structural data, summary

information, sequence length, x-ray parameters, Resolution, Ramachandran plot and

other factors are carefully studied.

55

Figure 3. 5: Image showing the search page of Protein Data Bank

3.2.5 Sequence Search Using Hill climbing Algorithm (SSHC)

Step 1: Start

Step 2: Take Unknown Sequence Sk

Step 3: Take Known KSs Sequence set As a Search Space

Step 4: Apply Hill Climbing Technique to match Sk with KSs

Sk[i,j] (Intersection) KSs[i,j] >Sk[i,k] (Intersection) KSs[i,k]

Then Sk[i,j] (Intersection) KSs[i,j] is the Result

Step 5: If Matches are not found with KSs Then Sk is New Sequence

Step 6: Stop

KNOWN PROTEIN SEQUENCE 1ML6:

AGKPVLHYFNARGRMECIRWLLAAAGVEFEEKFIQSPEDLEKLKKDGNLMFDQVPMV

EIDGMKLVQTRAILNYIATKYDLYGKDMKERALIDMYTEGILDLTEMIGQLVLCPPD

QREAKTALAKDRTKNRYLPAFEKVLKSHGQDYLVGNRLTRVDVHLLELLLYVEELDA

SLLTPFPLLKAFKSRISSLPNVKKFLQPGSQRKPPLDAKQIEEARKVFKF

56

UN KNOWN PROTIEN SEQUENCE Q08392:

MSGKPVLHYANTRGRMESVRWLLAAAGVEFEEKFLEKKEDLQKLKSDGSLLFQQVPM

VEIDGMKMVQTRAILNYIAGKYNLYGKDLKERALIDMYVEGLADLYELIMMNVVQPA

DKKEEHLANALDKAANRYFPVFEKVLKDHGHDFLVGNKLSRADVHLLETILAVEESK

PDALAKFPLLQSFKARTSNIPNIKKFLQPGSQRKPRLEEKDIPRLMAIFH

3.2.6 Script

<HTML><BODY>

<TABLE border=1>

<?php

$str1 =

"MGAASGRRGPGLLLPLPLLLLLPPQPALALDPGLQPGNFSADEAGAQLFAQSYNSS

AEQVFQSVAASWAHDTNITAENARRQEEAALLSQEFAEAWGQKAKELYEPIWQNFTD

PQLRRIGAVRTLGSANLPLAKRQQYNALLSNMSRIYSTAKVCLPNKTATCWSLDPDL

TNILASSRSYAMLLFAWEGWHNAAGIPLKPLYEDFTALSNEAYKQDGFTDTGAYWRS

WYNSPTFEDDLEHLYQQLEPLYLNLHAFVRRALHRRYGDRYINLRGPIPAHLLGDMW

AQSWENIYDMVVPFPDKPNLDVTSTMLQQGWNATHMFRVAEEFFTSLELSPMPPEFW

EGSMLEKPADGREVVCHASAWDFYNRKDFRIKQCTRVTMDQLSTVHHEMGHIQYYLQ

YKDLPVSLRRGANPGFHEAIGDVLALSVSTPEHLHKIGLLDRVTNDTESDINYLLKM

ALEKIAFLPFGYLVDQWRWGVFS GRTPPSRYNFDW";

$str2 =

MGNTTSDRVSGERHGAKAARSEGAGGHAPGKEHKIMVGSTDDPSVFSLPDSKLPGDK

EFVSWQQDLEDSVKPTQQARPTVIRWSEGGKEVFISGSFNNWSTKIPLIKSHNDFVA

ILDLPEGEHQYKFFVDGQWVHDPSEPVVTSQLGTINNLIHVKKSDFEVFDALKLDSM

57

ESSETSCRDLSSSPPGPYGQEMYAFRSEERFKSPPILPPHLLQVILNKDTNISCDPA

LLPEPNHVMLNHLYALSIKDSVMVLSATHRYKKKYVTTLLYKPI";

?>

<TR>

<TD>String length</TD>

<TD><?phpprint_r(strlen($str1)); ?></TD>

<TD><?phpprint_r($str1); ?></TD>

</TR>

<TR>

<TD>String length</TD>

<TD><?phpprint_r(strlen($str2)); ?></TD>

<TD><?phpprint_r($str2); ?></TD>

</TR>

</table>

<table style="float:left" border=0>

<?php

$chars = array('');

$chars1 = array();

$red= array();

$c=0;

$high=0;

for($l = 0; $l<=strlen($str2); $l++){

for($k = 0; $k<=strlen($str2)-$l; $k++){

$string = substr($str2,$l,$k);

58

//echo substr($str2,$l,$k). $l. '->'. $k . "<br/>";

$chunk = substr($str2,$l,$k); if(strlen($chunk)>0){ $cnt =

substr_count($str1,$chunk);

if($cnt>0){

if(!isset($red[$string])){

$c++;

if($cnt>$high)

$high=$cnt;

$red[$string]=$cnt;

}

}

}

}

}

foreach($red as $i => $value){

?>

<tr><td><b><?phpprint_r($i); ?></b></td>

<TD> - <?php echo $red[$i]; ?> - </TD>

<?php

echo " <TD >".(($red[$i]/sizeof($red))*100)."%</TD> ";

for($j=0;$j<$red[$i];$j++)

echo " <TD bgcolor='blue'>.</TD> ";

echo "</tr>";

}

?>

59

<tfoot></tfoot>

</table>

<?php

echosizeof($red)."<span style='float:left'><b>Total match count is ".$c." with highest

frequency as ".$high."</b></span>";

?>

</BODY></HTML>

3.2.7 Modeller9v1

. MODELLER is a computer program that models three-dimensional structures

of proteins and their assemblies by satisfaction of spatial restraints.

MODELLER implements an automated approach to comparative protein structure

modeling by satisfaction of spatial restraints, Briefly the core modeling procedure

begins with an alignment of the sequence to be modeled (target) with related known

3D structures (templates). This alignment is usually the input to the program. The

output is a 3D model for the target sequence containing all main chain and side chain

non-hydrogen atoms. Based on the given an alignment, the model will obtain.

60

Figure 3. 6: Modeller workspace showing on windows command prompt

3.3 HOMOLOGY MODELING METHODOLOGY

3.3.1 Step 1

Comparative models were constructed for various gene/protein sequences to

study the sequences in the structural context and to suggest site directed mutagenesis

experiments for elucidating specificity changes in this apparent case of convergent

evolution of enzymatic specificity. To perform homology modeling,Blast analysis has

been carried out by using BL-45 matrix against the protein structure sequence

database with the following Swiss entries Q08392,P80894 ,Q08393

P46428,Q7REH6,P46463 out of which Q08392 is taken into consideration It

wasconsidered because they have the much similarity and identity. Thus we select this

sequence, in that the sequence has lowest E-value. So we consider only that sequence.

61

3.3.2 Step2

>>PDB:1ML6 mol:protein length:221 Glutathione S-Trans (221 aa)

initn: 784 init1: 784 opt: 786 Z-score: 1835.9 bits: 346.9 E(): 2.3e-95 Smith-

Waterman score: 786; 66.2% identity (83.1% similar) in 219 aa overlap (2-220:1-219)

10 20 30 40 50 60

Sequence

MSGKPVLHYANTRGRMESVRWLLAAAGVEFEEKFLEKKEDLQKLKSDGSLLFQQVPMVEI

.::::::: :.::::: .:::::::::::::::.. :::.::: ::.:.: :::::::

PDB:1M

AGKPVLHYFNARGRMECIRWLLAAAGVEFEEKFIQSPEDLEKLKKDGNLMFDQVPMVEI

10 20 30 40 50

70 80 90 100 110

Sequence

DGMKMVQTRAILNYIAGKYNLYGKDLKERALIDMYVEGLADLYELIMMNVVQPADKKEEH

::::.::::::::::: ::.:::::.:::::::::.::. :: :.: . :. : :..:

PDB:1M

DGMKLVQTRAILNYIATKYDLYGKDMKERALIDMYTEGILDLTEMIGQLVLCPPDQREAK

60 70 80 90 100 110

130 140 150 160 170 180

SeqLANALDKAANRYFPVFEKVLKDHGHDFLVGNKLSRADVHLLETILAVEESKPDALAKFP

L

: : :.. :::.:.:::::: ::.:.::::.:.:.:::::: .: ::: :. :::

PDB:1M

TALAKDRTKNRYLPAFEKVLKSHGQDYLVGNRLTRVDVHLLELLLYVEELDASLLTPFPL

120 130 140 150 160 170

190 200 210 220

Seq LQSFKARTSNIPNIKKFLQPGSQRKPRLEEKDIPRLMAIFH

http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+%5bpdb-id:1ML6%5d+-noSession

62

:..::.: :..::.:::::::::::: :. : : .:

PDB:1M LKAFKSRISSLPNVKKFLQPGSQRKPPLDAKQIEEARKVFKF

180 190 200 210 220

3.3.3. Step 3

Series of commands in the modeller9v1 that will generate model with

superimposed and optimized structure.

1. Mod9v1 search.py

2. Mod9v1 malign.py

3. Mod9v1 get-model.py

4. Mod9v1 optimize.py

5. Mod9v1 superpose.py

3.3.4 Files in Modeller9v1

1. Search file:

Figure 3. 7: Modeller workspace showing sequence search file

63

In this search file the target sequence file name specified as Q08392 with the

extensions. The command “mod9v1 search.py” searches target file by using this file.

2. Alignment file:

Figure 3. 8: Modeller workspace showing sequence alignment file

In this file the alignment should be the same as like alignment in the alignment

program FASTA that we have taken. The command mod9v1 malign.py will check

this alignment and also checks template (1ML6) sequence with template structure.

The alignment sequence(1ML6) must match that from the 1ML6(PDB) in the atom

files exactly.

64

3. Get-model file:

Figure 3. 9: Modeller workspace showing get-model file

In this get-model file we have to specify the known template structure file

name and target protein sequence file name. Here target protein is specified as

Q08392 and template structure as 1ML6.

Here certain modifications were made, such as

Starting model=1

Ending model =5

This will generate five models.

65

4. Optimize file:

Figure 3. 10: Modeller workspace showing optimized file

In this file modeled protein name has to be mentioned for optimization. Here

modeled protein name specified as Q08392 and the command Mod9v1 optimize.py

runs the optimization and gets the modeled protein with stable and minimum energy.

66

5. Superpose:

Figure 3. 11: Modeller workspace showing superimposed file

Here modeled protein file name is specified as (Q08392) and template file

name as (1ML6) for superimposition. The command mod9v1 superpose.py runs

superimposition of these two proteins and gets RMSD (root mean square deviation)

value.

Documents

Chapter-3 HOMOLOGY MODELING - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/8277/13/13_chapter 3.pdf · Chapter-3 HOMOLOGY MODELING 3.1 HOMOLOGY MODELING One technique that