Allele Mining: with respect to Comparative Protein Structure Modelling and Docking study

Allele Mining: with respect to Comparative Protein Structure Modelling and Docking study

Sunil Kumar

Institute of Life Sciences

Bhubaneswar

E-mail: [email protected]

• Enormous sequence information is available in public databases as a result of sequencing of diverse crop genomes.

• It is important to use this genomic information for the identification and isolation of novel and superior alleles of agronomically important genes from crop gene pools to suitably deploy for the development of improved cultivars.

• Allele mining is a promising approach to dissect naturally occuring allelic variation at candidate genes controlling key agronomic traits which has potential applications in crop improvement programs.

• It helps in tracing the evolution of allels, identification of new haplotypes and development of allele specific markers for use in marker-assisted selection.

Allele Mining: an Introduction

Allele Mining…..cont

• Initial studies of allele mining have focused only on the identification of SNP/InDels at coding sequences or exons of the gene.

• Since these variations were expected to affect the encoded protein structure and/or function

• However, recent reports indicate that the nucleotide changes in non-coding regions (5’UTR) including promoter, introns and 3’ UTR) also have significant effect on transcript synthesis and accumulation which in turn alter the trait expression

Information Transfer pathway within the cell

……ATGCATGCATGCATGCATGC..

………CGUACGUACGUACGU…………

………CGUACGUACGUACGU…………

DECODING MECHANISM

DNA

RNA

PROTEIN Sequence

PROTEIN Structure

Biological function

Proteins

Proteins are the building blocks of life.

In a cell, 70% is water and 15%-20% are proteins.

Examples:hormones – regulate metabolismstructural – hair, wool, muscle,…antibodies – immune responseenzymes – chemical reactions

A protein is composed of a central backbone and a collection of (typically) 50-2000 amino acids

There are 20 different kinds of amino acidsName 3-letter code 1-letter codeLeucine Leu LAlanine Ala ASerine Ser SGlycine Gly GValine Val VGlutamic acid Glu EThreonine Thr T

Amino Acids

Amino Acids

Side chain

Each amino acid is identified by its side chain, which determines the properties of this amino acid.

Side Chain Properties

•Hydrophobic stays inside, while hydrophilic stay close to water

•Oppositely charged amino acids can form salt bridge.

•Polar amino acids can participate hydrogen bonding

Protein Folding

•Proteins must fold to function

•Some diseases are caused by misfoldinge.g., mad cow

disease

Three Structure Levels

Beta Sheet

Helix

Loop

Primary structure: sequence of amino acids– e.g., DRVYIHPF

Secondary structure: local folding patterns– e.g., alpha-helix,

beta-sheet, loop

Tertiary structure: complete 3D fold

Beta Sheet Examples

Anti-parallel beta sheetParallel beta sheet

Helix Examples

Domain, Fold, Motif

•A protein chain could have several domains▫A domain is a discrete portion of a protein, can

fold independently, possess its own function

•The overall shape of a domain is called a fold. There are only a few thousand possible folds.

•Sequence motif: highly conserved protein subsequence

•Structure motif: highly conserved substructure

Protein Data BankAbout 50,000 protein structures, solved using

experimental techniques ~800 are unique structural folds

Different structural folds

Same structural folds

The Problem

• Protein functions determined by 3D structures

• ~ 50,000 protein structures in PDB (Protein Data Bank)

• Experimental determination of protein structures time-consuming and expensive

• Many protein sequences available

sequence

proteinstructure

function

medicine

“Three-dimensional protein structures are important in understanding the mechanisms of human genetic diseases, predicting the effect of non-synonymous single nucleotide polymorphisms and developing new personalized medicines”

Xie and Bourne (2005) PLoS Compt.Biol. 1:e31

Why Protein 3D Structures?

3D Structures of Proteins

Better Understanding of Protein Functions

What is Homology Modeling?What is Homology Modeling?

An approach to predict a model of the three-dimensional structure of a given protein sequence (TARGET) based on an alignment to one or more known protein structures (TEMPLATES)

The homology modeling method is based on the assumption that the structure of an unknown protein is similar to known structures of reference proteins

A model is desirable when either X-ray crystallography or NMR spectroscopy can not determine the structure of a protein in time or at all.

While the 3-D structure of proteins can be determined by x-ray crystallography and NMR spectroscopy. These experimental techniques are time consuming and not possible if a sufficient quantity and quality of a proteins is not available.

The built model provides a wealth of information of how the protein functions with information at residue property level. This information can than be used for mutational studies or for drug design..

Why a Model?

Protein Structure Determination

• High-resolution structure determination▫ X-ray crystallography (~1Å)▫ Nuclear magnetic resonance (NMR) (~1-2.5Å)

• Low-resolution structure determination▫ Cryo-EM (electron-microscropy) ~10-15Å

X-ray crystallography• most accurate

• An extremely pure protein sample is needed.

• The protein sample must form crystals that are relatively large without flaws. Generally the biggest problem.

• Many proteins aren’t amenable to crystallization at all (i.e., proteins that do their work inside of a cell membrane).

• ~$100K per structure

Nuclear Magnetic Resonance

• Fairly accurate

• No need for crystals

• limited to small, soluble proteins only.

1. Identification of structures that will form the template for modelling

2. Sequence Alignment of the target with template

3. Transfer of the coordinates from the template(s) to the target of structurally conserved regions (SCR’s)

4. Modelling the missing regions

5. Refinement and validation of the model

Steps in homology modellingTarget’s sequence

Target’s structure

Template search

• Homology modeling is based on using similar structures i.e. no Similar structures = No Model

• 40% amino acid identity or higher is best; below that is not advisable but examples of success do exist

• Need sequence similarity across the whole sequence,not just in one part

Searching DatabasesQuery

Database

BLASTING…. FASTING….

Key Step:

Sequence alignment of the target with the basis structures

Good Alignment

Good Model

• Sequence alignment is a basic technique in homology modeling.

• It is used to establish a one-to-one correspondence between the amino acids of the reference protein (template) and those of the unknown protein (target) in the structurally conserved regions.

• The correspondence is the basis for transferring coordinates from the reference to the model protein

Sequence A

Sequence B

GGTGGAC

AAAGGTGAC

GGTGGAC

AAAGGTG - AC

A Sample alignment of two DNA sequences

(a) Un-gapped alignment

(b) Gapped alignment. The “I” indicates matching nucleotides

Local Alignment

Global Alignment

Sequence Alignment

Applications: Global alignment : essential for comparative

modeling.Local alignment : sufficient for functional

domains.N.B: Global alignment is computationally more time

consuming than the local alignment.

Sequence Homology Vs Sequence Similarity

Dotplot:

A

T T C

A

C

A

T A

T A C A T T A C G T A C

Sequence 1

Sequence 2

A dotplot gives an overview of all possible alignments

Dynamic Programming

• Needleman and Wunsch Algorithm

- Global Alignment -

• Smith and Waterman Algorithm

- Local Alignment -

Dynamic programming is a computational method used for

aligning two protein or nucleotide sequences. The method

compares every pair of residues/nucleotides in the two sequences

and generates an alignment.

In the alignment matches, mismatches and gaps in the two

sequences are positioned in such a way that the number of

matches between identical or similar residues is maximum

possible.

F(i, j) = F(i-1, j-1) + s(xi ,yj)

F(i, j) = max F(i, j) = F(i-1, j) - d

F(i, j) = F(i, j-1) - d

F(i-1, j-1) F(i, j-1)

F(i-1,j)F(i, j)

-d

-d

s(xi ,yj)

Steps

1. Initialization:- 1st Row and 1st Column- Filled with Multiple of Gap Penalty

2. Rest of the cells: Filled with Vmax Value

3. Generation of Optimal path: Through back tracking

4. Generation of optimal alignment: For the optimal path (No. of optimal path = No. of optimal alignment

Scoring Scheme :- Given an alignment between two sequences, we can compute its similarity by :-

1) Rewarding for a match Match => +12) Penalizing for a mismatch Mismatch => -13) Penalizing for a gap Gap or Indel => -2

Two differences:

1.

2. An alignment can now end anywhere in the matrix

Smith and Waterman(local alignment)

Example:Sequence 1 H E A G A W G H E ESequence 2 P A W H E A E

Scoring parameters:BLOSUMGap penalty: Linear gap penalty of 8

0

F(i, j) = F(i-1, j-1) + s(xi ,yj)

F(i, j) = F(i-1, j) - d

F(i, j) = F(i, j-1) - d

F(i, j) = max

Comparative Modelling Methods

Restrained based methods -MODELLER

(Sali and Blundell, 1993)

MODELLERMODELLER MODELLER is a computer program that models

three-dimensional structures of proteins and their assemblies by satisfaction of spatial restraints.

MODELLER is most frequently used for homology or comparative protein structure modeling.

The user provides an alignment of a sequence to be modeled with known related structures and MODELLER will automatically calculate a model with all non-hydrogen atoms.

A 3D model is obtained by optimization of a molecular probability density function (pdf).

Format for Modeller:INCLUDESET ATOM_FILES_DIRECTORY = './:../‘

SET PDB_EXT = '.atm‘

SET STARTING_MODEL = 1

SET ENDING_MODEL = 20

SET MD_LEVEL = 'refine1‘

SET DEVIATION = 4.0

SET KNOWNS ='1JKE‘

SET HETATM_IO = off

SET WATER_IO = off

SET ALIGNMENT_FORMAT = 'PIR‘

SET SEQUENCE = 'target1‘

SET ALNFILE = 'multiple1.ali

CALL ROUTINE = 'model'

Loop Modelling

Loop region

Calculate distances between the anchor residues.

Loop Generation Process:

1. Select a loop for each region2. Fixing of the loop

FRAGMENTDATABASE

Loop Library

• Loops extracted from PDB using high resolution (<2 Å) X-ray structures

• Typically thousands of loops in DB

• Includes loop coordinates, sequence, # residues in loop, Ca-Ca distance, preceding 2o structure and following 2o structure (or their Ca coordinates)

Structure Validation

(a)Stereochemical Quality Check

(b) Residue Environment Check

Stereochemical Quality Check

PROCHECK(Thornton and Co-workers)

Following properties are calculated and analysedin comparison with those of highly refined structures solved at varying resolutions.

Torsional angles:- (f,y) combination- c1-c2 combination- c1 torsion for those residues without c2- combined c3 and c4 angles- w angles

Covelent geometry:- main-chain bond lengths- main-chain bond angles

Profiles-3D

•Amino acid residues in proteins can be classified according to their local environments:

▫solvent accessibility ▫secondary structure ▫polarity of other protein chemical groups in

contact with them

Refining the Model

- Energy minimize N- and C-termini.- Repair spliced peptide bonds.- Minimize loop regions- Energy minimize mutated side chains in SCRs.- Minimize segments together.

Energy Minimization

• Energy minimization adjusts the structure of the molecule in order to lower the energy of the system.

• For small molecules, a global minimum energy configuration can often be found.

• for large macromolecular systems, energy minimization allows one to examine the local minimum around a particular conformation.

Modelling on the Web

• Prior to 1998 homology modelling could only be done with commercial software or command-line freeware

• The process was time-consuming and labor-intensive

• The past few years has seen an explosion in automated web-based homology modelling servers

• Now anyone can homology model!

Application of Comparative ModelingApplication of Comparative Modeling

- Comparative modeling is an efficient way to obtain useful information about the proteins of interest. For example – comparative modeling can be helpful in- Designing mutants to text hypothesis about the proteins function.- Identifying active and binding sites.- Searching for designing and improving.

- Modeling substrate specificity.- predicting antigenic epitopes.- Simulating protein – protein docking.- Confirming a remote structural relationship.

Prediction of the optimal physical configuration and energy between two molecules

The docking problem optimizes:

Binding between two molecules such that their orientation maximizes

the interaction

Evaluates the total energy of interaction such that for the best

binding configuration the binding energy is the minimum

The resultant structural changes brought about by the interaction

What is docking?

Molecular Docking

• The process of “docking” a ligand to a binding site mimics the natural course of interaction of the ligand and its receptor via a lowest energy pathway.

• Put a compound in the approximate area where binding occurs and evaluate the following:

– Do the molecules bind to each other?

– If yes, how strong is the binding?

– How does the molecule (or) the protein-ligand complex look like. (understand the intermolecular interactions)

– Quantify the extent of binding.

Few terms related to docking

• Receptor: The receiving molecule, most commonly a protein or other biopolymer.

• Ligand: The complementary partener molecule which binds to the receptor. Ligands are most often small molecules but could also be another biopolymer.

• Docking: Computational simulation of a candidate ligand binding to a receptor.

• Binding mode: The orientation of the ligand relative to the receptor as well as the conformation of the ligand and receptor when bound to each other.

• Pose: A candidate binding mode.

• Scoring: The process of evaluating a particular pose by counting the number of favorable intermolecular interactions such as hydrogen bonds and hydrophobic contacts.

• Ranking: The process of classifying which ligands are most likely to interact favorably to a particular receptor based on the predicted free-energy of binding.

Classes of Docking

• Both molecules usually considered rigid.• 6 degrees of freedom, 3 for rotation, 3 for translation• First apply only steric constraints to limit search space• Then examine energetics of possible binding confirmations

Protein-Protein docking

Protein-Ligand docking

• Flexible ligand, rigid receptor.• Search space much larger• Either reduce flexible ligand or rigid fragments to

connected by one or several hings (reduces confirmational space)

• Or search the confirmational space using the monte-carlo methods or molecular dynamics.

1. Protein-Protein Docking

1. Protein-Ligand Docking

optimized

It involves:

Finding useful ways of representing the molecules and molecular properties.

Exploration of the configuration spaces available for interaction between ligand and receptor.

Evaluate and rank configurations using a scoring system, in this case the binding energy

However, since it is difficult to evaluate the binding energy because the binding sites may not be easily accessible, the binding energy is modeled as follows:

∆G bind= ∆Gvdw + ∆Ghbond + ∆Gelect + ∆G conform+ ∆G tor + ∆G sol

Docking uses a “search and score” method

3D Structure of the Complex

Experimental Information:

The active site can be identified based on the position of the ligand in the crystal structures of the protein-ligand complexes

If Active Site is not KNOWN?????

Some Available Programs to Perform Docking

• Affinity• AutoDock• BioMedCAChe• CAChe for Medicinal

Chemists• DOCK• DockVision

• FlexX• Glide• GOLD• Hammerhead• PRO_LEADS• SLIDE• VRDD

Ligand in Active Site Region

Ligand

Active site residuesHistidine 6; Phenylalanine 5; Tyrosine 21; Aspartic acid 91; Aspartic acid 48; Tyrosine 51; Histidine 47; Glycine 29; Leucine 2; Glycine 31; Glycine 22; Alanine 18; Cysteine 28; Valine 20; Lysine 62

Examples of Docked structuresHIV protease inhibitors COX2 inhibitors

• Shape-complementarity method: find binding mode(s) without any steric clashes

• Only 6-degrees of freedom (translations and rotations)

• Move ligand to binding site and monitor the decrease in the energy

• Only non-bonded terms remain in the energy term

• try to find a good steric match between ligand and receptor

Rigid Docking

The DOCK algorithm in rigid-ligand mode

.. .

.

..

. .

N

NH

N

SO

F

.. .

N

NH

N

SO

F

.

N

NH

N

SO

F

N

NH

N

SO

F

1. Define the target binding site points.

2. Match the distances.

3. Calculate the transformation matrix for the orientation.

4. Dock the molecule.

5. Score the fit.

Flexible Docking

• Dock flexible ligands into binding pocket of rigid protein

• Binding site broken down into regions of possible interactions

binding site from X-ray

H-bondsparameterised binding site

Detailed calculations on all possibilities would be very expensive

The major challenge in structure based drug design to identify the best position and orientation of the ligand in the binding site of the target.

This is done by scoring or ranking of the various possibilities, which are based on empirical parameters, knowledge based on using rigorous calculations

Need for Scoring

Caspase Dependent Programmed Cell Death in Developing Embryos: A potential Target for Therapeutic Intervention against Pathogenic Nematodes For the first time, we developed and evaluated flow cytometry based assays

to assess several conserved features of apoptosis in developing embryos of a pathogenic filarial nematode Setaria digitata, in vitro.

We validated programmed cell death in developing embryos by using immuno-fluorescence microscopy and scoring expression profile of nematode specific proteins related to apoptosis [e.g. CED-3, CED-4 and CED-9].

Mechanistically, apoptotic death of embryonic stages was found to be a caspase dependent phenomenon mediated primarily through induction of intracellular ROS. The apoptogenicity of some pharmacological compounds viz. DEC, Chloroquine, Primaquine and Curcumin were also evaluated. Curcumin was found to be the most effective pharmacological agent followed by Primaquine while Chloroquine displayed minimal effect and DEC had no demonstrable effect.

Further, demonstration of induction of apoptosis in embryonic stages by lipid peroxidation products [molecules commonly associated with inflammatory responses in filarial disease] and demonstration of in-situ apoptosis of developing embryos in adult parasites in a natural bovine model of filariasis have offered a framework to understand anti-fecundity host immunity

operational against parasitic helminths.PLoS NTD, 2011

Induction of apoptosis in developing embryos of a pathogenic nematode

PLoS NTD, 2011

CARDDomain

α/β(P-loop) Domain

Cytochrome-c

Helical Domain

Winged helix Domain

CED- 4

J Mol Model, 2011

Binding efficiencies of carbohydrate ligands with different genotypes of cholera toxin B: molecular modeling, dynamics and docking simulation studies

Molecular interaction plots between carbohydrate ligand and genotype 1. a) Galactose b) Sialic acid c) N-acetyl galactosamine

J Mol Model, 2011

Molecular interaction plots between carbohydrate ligand and genotpye 3. a) Galactose b) Sialic acid c) N-acetyl galactosamine

J Mol Model, 2011

Molecular interaction plots between carbohydrate ligand and genotype 5. a) Galactose b) Sialic acid c) N-acetyl galactosamine

Molecular interaction plots between carbohydrate ligand and genotpye 6. a) Galactose b) Sialic acid c) N-acetyl galactosamine

• The promyelocytic leukemia zinc finger (Plzf) gene containing evolutionary conserved BTB domain plays a key role in self-renewal of mammalian spermatogonial stem cells.

• Little is known about the function of plzf in vertebrate, especially in fish species.

• Cloned plzf from the testis of Labeo rohita (rohu), a commercially important freshwater carp. Containing a conserved N-terminal BTB domain and C-terminal C2H2-zinc finger motifs.

Molecular cloning of cDNA and peptide structure prediction of Plzf expressed in the

spermatogonial cells of Labeo rohita

Marine Genomics, 2010

Molecular cloning of cDNA and peptide structure prediction of Plzf expressed in the

spermatogonial cells of Labeo rohita


•A 3D model of BTB domain of plzf protein was constructed by homology modeling approach.


•Molecular docking on this 3D structure established a homo-dimer between two BTB domains creating a charged pocket containing conserved AA residues: L33,C34, D35, and R49.


Thus, Plzf of SSC is structurally and possibly functionally conserved.

The identified plzf could be the first step towards exploring its role in rohu SSC behavior.

Thank you

• Alok Das Mohapatra, Sunil Kumar, Ashok Kumar Satapathy and Balachandran Ravindran (2011). Apoptosis in a pathogenic nematode involves mitochondrial pathway. PloS Neglected Tropical Disease (In Press).

• MHU Turabe Fazil, Sunil Kumar, Rohit Farmer, HP Pandey and DV Singh(2011). Binding efficiencies of carbohydrate ligands with different genotypes of cholera toxin B: Molecular Modeling, dynamics and Docking Simulation studies. J Mol Model, DOI 10.1007/s00894-010-0947-6 (Springer publication).

• Biswaranjan Paital, Sunil Kumar*, Rohit Farmer, Niraj Kanti Tripathy, Gagan Bihari Nityananda Chainy (2011) In silico prediction and characterization of 3D structure and binding properties of catalase from the commercially important crab, Scylla serrata. Interdiscip Sci Comput Life Sci 3: 110–120(Springer publication).*corresponding author.

• Chinmayee Mohapatra, Hirak Kumar Barman, Rudra Prasanna Panda, Sunil Kumar, Varsha Das, Ramya Mohanta, Shibani Mohapatra, Pallipuram Jayasankar (2010) Cloning of cDNA and prediction of peptide structure of plzf expressed in the spermatogonial cells of Labeo rohita, Mar. Genomics, doi: 10.1016/j.margen.2010.09.002. (Elsevier publication).

• MHU Turabe Fazil*, Sunil Kumar*, N Subbarao, H P Pandey and Durg V. Singh (2010). Homology modeling of a sensor histidine kinase from Aeromonas hydrophila. J Mol Model, 16: 1003-1009 * Equal contribution. (Springer publication).

• • Babu A Manjasetty, Sunil Kumar, Andrew P Turnbull and Niraj Kanti Tripathy (2009). Homology

Modeling and Analysis of Human Disease Proteins: Structural Investigations of Shwachman-Bodian-Diamond Syndrome (SBDS) model through Bioinformatics Approach InterJRI Science and Technology, Vol. 1, Issue 2,97-104

Documents

Allele Mining: with respect to Comparative Protein Structure Modelling and Docking study