31
FACULTY OF INDUSTRIAL SCIENCES AND TECHNOLOGY SEMESTER 2 2014/2015 ASSESSMENT COVER SHEET FOR Subject: BIOINFORMATIC Subject Code: BSB3553 TYPE OF COURSEWORK: INDIVIDUAL ASSIGNMENT LECTURER: Dr. Aizi Nor Mazila Bt Ramli TITLE: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1) DUE DATE: 29 May 2015 SUBMISSION DATE: 29 May 2015 SUBMITTED BY: Lai Kei Fung SB12048 LECTURER’S SIGNATURE: 1

In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

Embed Size (px)

DESCRIPTION

In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

Citation preview

Page 1: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

FACULTY OF INDUSTRIAL SCIENCES AND TECHNOLOGYSEMESTER 2 2014/2015

ASSESSMENT COVER SHEET FORSubject: BIOINFORMATIC

Subject Code: BSB3553

TYPE OF COURSEWORK:

INDIVIDUAL ASSIGNMENT

LECTURER: Dr. Aizi Nor Mazila Bt Ramli

TITLE:In Silico study on an uncultured bacterium clone Lip1 lipase

(lip1) gene (Accession code number: GQ352455.1)

DUE DATE: 29 May 2015 SUBMISSION DATE:

29 May 2015

SUBMITTED BY:Lai Kei Fung

SB12048LECTURER’S SIGNATURE:

1

Page 2: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

1.0 INFORMATION OF THE PROTEIN

The accession code number GQ352455.1 belongs to an uncultured bacterium clone Lip1 lipase (lip1) gene, complete coding sequence. This gene sequence had been discovered by Cieslinski, Bialkowska, Dlugolecka, Tkaczuk, Kur, and Turkiewicz in a metagenomic library study. The sample was soil from Antarctica and the origin of the genomic DNA is believed to be from uncultured bacterium where its identity has not been confirmed. No journal paper has been published yet by the authors.

Antarctica is a harsh cold environment. Antarctic land soil exhibits extreme cold temperature far below 0oC and only rarely exceeding 0°C in summer (Convey, 1996). Therefore, the microorganisms living in Antarctic soil are psychrophiles. Subsequently, it is safe to classify the uncultured bacterium clone Lip1 lipase psychrophilic and can retain its enzymatic functionality under low temperature. This offers tremendous industrial prospects. Psychrophilic lipase enzyme can be used in low temperature food processing in food industry. Other than food industry, it may also be used in other industries where low temperature processes are desired.

2.0 RESEARCH ARTICLES RELATING TO THE PROTEIN

Lipases are glycerol ester hydrolases that catalyse the hydrolysis of triglycerides to free fatty acids and glycerol. The temperature stability of lipases has been considered as the most important attribute for industry application. Psychrophilic lipases have attracted attention because of their increasing use in the organic synthesis of chiral intermediates due to their low optimum temperature and high activity at very low temperatures, which are favourable properties for the production of relatively frail compounds (Joseph, et al., 2008).

The structural difference of cold enzymes had also been investigated. Enzymes from Antarctic bacteria have been analysed using protein modelling and X-ray crystallography and the deduced three-dimensional structures of cold enzymes (α-amylase, β-lactamase, lipase and subtilisin) have been compared to their mesophilic homologs (Gerday, et al., 1997). The research had concluded that the molecular adaptation resides in a weakening of the intramolecular interactions, and in some cases in an increase of the interaction with the solvent, leading to more flexible molecular edifices capable of performing catalysis at a lower energy cost (Gerday, et al., 1997).

Various industrial applications of cold active microbial lipases in the medical and pharmaceuticals, fine chemical synthesis, food industry, domestic and environmental applications had also been described in review paper (Joseph, et al., 2007).

2

Page 3: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

3.0 DOMAIN AND MOTIF ANALYSIS

3.1 Analysis using first database- INTERPRO

From domain analysis using Interpro, I had discovered four domains:-

3.1.1 Alpha/Beta hydrolase fold (IPR029058)

The alpha/beta hydrolase fold is common to a number of hydrolytic enzymes of widely differing phylogenetic origin and catalytic function. The core of each enzyme is an alpha/beta-sheet (rather than a barrel), containing 8 strands connected by helices.

3.1.2 Serralysin-like metalloprotease, C-terminal (IPR011049)

Serralysin is a bacterial Zn-endopeptidase that acts as a virulence factor to cause tissue damage and anaphylactic response. Many Zn-endopeptidases contain the metal binding motif HexxHxxGxxH; in addition to these coordinated histidine residues, serralysin contains a coordinated tyrosine residue that is unique to the astacin-like Zn enzymes. The Zn-endopeptidases containing the histidine motif are structurally similar to one another, containing an N-terminal catalytic domain that belongs to the zincin family, and a C-terminal beta-helix metal-binding domain.

3.1.3 Haemolysin-type calcium-binding repeat (IPR001343)

Gram-negative bacteria produce a number of proteins that are secreted into the growth medium by a mechanism that does not require a cleaved N-terminal signal sequence. These proteins, while having different functions, seem to share two properties: they bind calcium and they contain a multiple tandem repeat of a nonapeptide.

3.1.4 Hemolysin-type calcium-binding conserved site (IPR018511)

The description of this domain is the same with the description of Haemolysin-type calcium-binding repeat (IPR001343).

3

Page 4: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

Figure 1: Screenshot of Interpro online domain analysis.

3.2 Analysis using second database- PROSITE

From domain analysis using PROSITE, I had discovered three hits by pattern:-

3.2.1 PS00120 LIPASE_SER Lipases, serine active site

Triglyceride lipases (EC 3.1.1.3) are lipolytic enzymes that hydrolyse the ester bond of triglycerides. Lipases are widely distributed in animals, plants and prokaryotes. In higher vertebrates there are at least three tissue-specific isozymes: pancreatic, hepatic, and gastric/lingual.

3.2.2 PS00330 HEMOLYSIN_CALCIUM Hemolysin-type calcium-binding region

Gram-negative bacteria produce a number of proteins which are secreted into the growth medium by a mechanism that does not require a cleaved N-terminal signal sequence. These proteins, while having different functions, seem to share two properties: they bind calcium and they contain a variable number of tandem repeats consisting of a nine amino acid motif rich in glycine, aspartic acid and asparagine. It has been shown that such a domain is involved in the binding of calcium ions in a parallel β roll structure.

3.2.3 PS00012 PHOSPHOPANTETHEINE Phosphopantetheine attachment site

Phosphopantetheine (or pantetheine 4' phosphate) is the prosthetic group of acyl carrier proteins (ACP) in some multienzyme complexes where it serves as a 'swinging arm' for the attachment of activated fatty acid and amino-acid groups. Phosphopantetheine is attached to a serine residue in these proteins.

4

Page 5: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

Figure 2: Screenshot of Prosite online domain analysis.

3.3 Analysis using third database- ProDom

The list of domains identified by ProDom is as below:

1. PD014259: Lipase repeat EC=3.1.1.3 hydrolase lipase. Polyurethanse.2. PD368883: Lipase repeat EC=3.1.1.3 hydrolase lipase. Polyurethanse. 3. PD000479: Repeat calcium-binding. Hemolysin-type. Putative plasmid uncharacterized

binding membrane. 4. PD699086: Lipase repeat EC=3.1.1.3 hydrolase A. Polyurethanse calcium binding. 5. PD000317: Flagellum bacterial. Flagellin. Putative domain membrane uncharacterized

flagellin precursor. 6. PDB0T940: EC=3.1.1.3 Hydrolase. Lipase. Putative extracellular.

Figure 3: Screenshot of ProDom online domain analysis.

5

Page 6: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

In general, all the tree domains databases show consensus in results. First, domains identity consensus in Interpro, Prosite, and ProDom databases confirmed that the enzyme is a lipolytic enzyme. Second, it is also shown that the enzyme has a calcium binding site. Furthermore, various domain identities found suggested that the uncultured bacterium is a negative-gram bacterium.

4.0 PHYLOGENETIC ANALYSIS

In nucleotide BLAST, we can select any of three types of algorithms; megablast for highly similar sequences, discontiguous megablast for more dissimilar sequences, and lastly blastn for somewhat similar sequences. In my BLAST analysis, I utilized the blastn algorithm because it provides more somewhat similar sequences, with are suitable for phylogenetic tree construction. Then, I selected some suitable sequences from the blastn result for my further tree construction. I avoided selecting genomic sequences and partial sequences. I also selected a few lipase gene sequences from other species that are not included in the BLAST result. The purpose of this step is to provide a new dimension to my phylogenetic tree analysis.

After the selection of suitable sequences, I conducted multiple sequence alignment by using MEGA6 (ClustalW). Other than MEGA6, other multiple sequence alignment tools include Muscle, T-Coffee, and COBALT. Lastly, I constructed the phylogenetic tree with using Neighbour-joining tree option with p-distance setting for the model/method. The resulting tree can be seen in Figure 5. Neighbour-joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. The main advantage of this method is that it is fast, therefore making it practical for analysing large data sets and for bootstrapping. The horizontal lines are branches and represent evolutionary lineages changing over time. Subsequently, the longer the branch, the amount of genetic changes will be larger. Thereby, from the tree that I had generated, I can conclude that the gene of interest most likely share the same recent ancestor with lipase sequences from Pseudomonas species.

Figure 4: Screenshot of multiple sequence alignment by using MEGA6 (ClustalW).

6

Page 7: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

Figure 5: Phylogenetic tree constructed.

5.0 PRIMARY STRUCTURE ANALYSIS

5.1 Protparam analysis.

I had conducted protparam analysis on the protein sequence. The screenshot of the result can be seen in Figure 6. The number of amino acids is 469. The molecular weight of the protein is predicted to be 49372.8 Da. Next, the theoretical pI is predicted to be 4.83. There are a total of 53 negatively charged amino acids. Then, there are a total of 32 positively charged amino acids in the protein.

Figure 6: Screenshot of Protparam result.

7

Page 8: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

5.2 SignalP analysis.

The screenshot of the SignalP result can be seen in Figure 7. The graph in Figure 7 depicts C-score, S-score and Y-score. C-score is the output from the CS networks, which are trained to distinguish signal peptide cleavage sites from everything else. S-score is the output from the SP networks, which are trained to distinguish positions within signal peptides from positions in the mature part of the proteins and from proteins without signal peptides. Y-score is a combination (geometric average) of the C-score and the slope of the S-score, resulting in a better cleavage site prediction than the raw C-score alone. The SignalP analysis had concluded that signal peptides are absent in the protein sequence.

Figure 7: Screenshot of SignalP result.

5.3 CELLO analysis.

Figure 8: Screensht of CELLO result.

In protein analysis, experimentally determining the subcellular localization of a protein is a laborious and time consuming task. Therefore, new approaches in computer science had been

8

Page 9: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

developed to provide fast and accurate localization predictions for many organisms. One of these tools to predict subcellular localization is CELLO. CELLO uses a two-level Support Vector Machine system to assign localizations to both prokaryotic and eukaryotic proteins. As for my protein of interest, the CELLO had predicted that the protein is localized at extracellular.

6.0 SECONDARY STRUCTURE ANALYSIS

6.1 HHpred secondary structure analysis

In HHpred secondary structure analysis, many templates were utilized to develop the predicted secondary structure. However, for simplicity, I only take one to discuss further, which is the secondary structure developed in consensus with crystal structure of extracellular lipase LipA from Serratia marcescens as template. In HHpred, probability data is the probability of template to be a true positive. For the probability of being a true positive, the secondary structure score in column 8 is taken into account, together with the raw score ('Score' in column 7). True positives are defined to be either globally homologous or they are at least locally similar in structure. In almost all cases the structural similarity will we be due to a global or local homology between query and template. Although I do not really understand this definition myself, I believe that the higher the probability the better the template is. The crystal structure of extracellular lipase LipA from Serratia marcescens template has a total probability of 100.0 as seen in Figure 9. Then, E-value gives the average number of false positives with a score better than the one for the template when scanning the database. It is a measure of reliability.

9

Page 10: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

Figure 9: Screenshot of HHpred secondary structure prediction result.

10

Page 11: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

E-values near 0 signify a very reliable hit, an E-value of 10 means about 10 wrong hits are expected to be found in the database with a score at least this good. Our data here has an E-value of 2.9, which is excellent. Next, the score (raw score) is 1034.55. The raw score is calculated by comparing the amino acid distributions between columns from the query alignment and columns from the template alignment. The probabilities for insertions and deletions at each position in the alignments are taken into account as position specific gap penalties. Therefore, I had obtained a predicted secondary structure of my protein of interest which is shown in details in Figure 9. It will be utilized later for making a consensus. Furthermore, the abbreviation codes of secondary structure information are as below.

G = 3-turn helix (310 helix). H = 4-turn helix (α helix). I = 5-turn helix (π helix). T = hydrogen bonded turn (3, 4 or 5 turn) E = extended strand in parallel and/or anti-parallel β-sheet conformation. B = residue in isolated β-bridge (single pair β-sheet hydrogen bond formation) S = bend (the only non-hydrogen-bond based assignment). C = coil (residues which are not in any of the above conformations).

6.2 Jpred secondary structure analysisMGIFDYKNLGTEGSKALFADAMAITLYTYHNLDNGFAVGYQHNGLGLGLPATLVGALLGSTDSQGVIPGLPWNPDSE-----------HHHHHHHHHHHHHHHHHH--------EE----------HHHHHHHHH------------------H

KAALDAVQKAGWTPISASTLGYGGKVDARGTFFGEKAGYTTAQVEVLGKYDDAGKLLEIGIGFRGTSGPRETLISDSHHHHHHHH----------------------EE-----------EEEEEEE-----EEEEEEEEE-------------

IGDLVSDLLAALGPKDYAKNYAGEAFGGLLKNVADYASAHGLSGKDVLVSGHSLGGLAVNSLADLSGNKWGGFYKDA--------------------EE------EE-------EEE-----EEEEE---------EEEEE-----E-------

HYVAYASPTQSAGDKVLNIGYENDPVFRALDGSSFNLSSLGVLYKPHESTTDNIVSFNDHYASTLWNVLPFSIANLPEEEE----------EEEEE-----EEEE------EE--------EE-------EEE-------E--------E----

TWLPHLPTGYGDGMTRILESGFYEQMSRDSTVIVANLSDPARANTWVQDLNRNAEPHKGNTFIIGSDGDDLIQGGKG---EEE-------EEEE-----E------EEEEEE-----EE--EE-----EEEE------EEE------EEE----

VDFIEGGKGNDTLRDNSGHNTFLFSGHFGQDRVIGYQSTDKLVFKDLQGSVDYREHGGDTVISVGGDSVTLVGVSGG--EEE------EEEE-----EEEEE-----EEEE-----EEEEEE-----EEEE----EEEEEE----EEEEE----

LGEVVIG--EEE--Figure 10: Screenshot of Jpred secondary structure prediction result.

Figure 10 depicts Jpred secondary structure result. Next step will be to construct a consensus between Jpred and HHpred result. The dashes in the secondary structure actually represent coil.

11

Page 12: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

6.3 Predictprotein secondary structure analysis

Figure 11: Screenshot of Predictprotein result.

Figure 11 shows Predictprotein result which lays out predicted features that correspond to regions within the queried sequence. Initially I wanted to construct consensus structure from all three secondary structure prediction tools. However, Predictprotein tool is proven to be far more complicated as compared to HHpred and Jpred. Honestly, I cannot analyse the tremendous data offered by Predictprotein.

6.4 Consensus secondary structure analysisMGIFDYKNLGTEGSKALFADAMAITLYTYHNLDNGFAVGYQHNGLGLGLPATLVGALLGSTDSQGVIPGLPWNPDSE-----------HHHHHHHHHHHHHHHHHH--------EE----------HHHHHHHHH------------------H----------HHHHHHHHHHHHHHHHHHH--------HHHH---------HHHHHHHH-----------------HH

KAALDAVQKAGWTPISASTLGYGGKVDARGTFFGEKAGYTTAQVEVLGKYDDAGKLLEIGIGFRGTSGPRETLISDSHHHHHHHH----------------------EE-----------EEEEEEE-----EEEEEEEEE-------------HHHHHHHHH-------HHH-----EE----EEE--------HHHEEEEEE-----EEEEEEEEE-------------

IGDLVSDLLAALGPKDYAKNYAGEAFGGLLKNVADYASAHGLSGKDVLVSGHSLGGLAVNSLADLSGNKWGGFYKDA--------------------EE------EE-------EEE-----EEEEE---------EEEEE-----E---------HHHHHHHHH-------HHHHHHHHHHHHHHHHHHHHH-------EEEE----------HHHH-------------

HYVAYASPTQSAGDKVLNIGYENDPVFRALDGSSFNLSSLGVLYKPHESTTDNIVSFNDHYASTLWNVLPFSIANLPEEEE----------EEEEE-----EEEE------EE--------EE-------EEE-------E--------E------EE-----------EEE------EEEE-----EEEEE---------------EEEE-------EEE----------

TWLPHLPTGYGDGMTRILESGFYEQMSRDSTVIVANLSDPARANTWVQDLNRNAEPHKGNTFIIGSDGDDLIQGGKG---EEE-------EEEE-----E------EEEEEE-----EE--EE-----EEEE------EEE------EEE----------------------------------EEEEEE--------EEEE----EEE------EEE------EEE----

12

Page 13: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

VDFIEGGKGNDTLRDNSGHNTFLFSGHFGQDRVIGYQSTDKLVFKDLQGSVDYREHGGDTVISVGGDSVTLVGVSGG--EEE------EEEE-----EEEEE-----EEEE-----EEEEEE-----EEEE----EEEEEE----EEEEE------EEE------EEE------EEEE------EEEE------EEEEE-----EEEEEE---EEEEE---EEEEE-----

LGEVVIG = Protein sequence--EEE-- = Jpred result---EEE- = HHpred resultFigure 12: Consensus analysis.

MGIFDYKNLGTEGSKALFADAMAITLYTYHNLDNGFAVGYQHNGLGLGLPATLVGALLGSTDSQGVIPGLPWNPDSE-----------HHHHHHHHHHHHHHHHHH---------------------HHHHHHHH------------------H

KAALDAVQKAGWTPISASTLGYGGKVDARGTFFGEKAGYTTAQVEVLGKYDDAGKLLEIGIGFRGTSGPRETLISDSHHHHHHHH----------------------EE------------EEEEEE-----EEEEEEEEE-------------

IGDLVSDLLAALGPKDYAKNYAGEAFGGLLKNVADYASAHGLSGKDVLVSGHSLGGLAVNSLADLSGNKWGGFYKDA----------------------------------------------EEEE---------------------------

HYVAYASPTQSAGDKVLNIGYENDPVFRALDGSSFNLSSLGVLYKPHESTTDNIVSFNDHYASTLWNVLPFSIANLP--EE-----------EEE------EEEE------EE-----------------EEE---------------------

TWLPHLPTGYGDGMTRILESGFYEQMSRDSTVIVANLSDPARANTWVQDLNRNAEPHKGNTFIIGSDGDDLIQGGKG------------------------------EEEEE---------EE------EEE------EEE------EEE----

VDFIEGGKGNDTLRDNSGHNTFLFSGHFGQDRVIGYQSTDKLVFKDLQGSVDYREHGGDTVISVGGDSVTLVGVSGG--EEE------EEE------EEEE------EEEE------EEEEE-----EEEE-----EEEEE----EEEE-----

LGEVVIG = Protein sequence---EE-- = Consensus secondary structure

Figure 13: Final consensus result.

Thereby, with my eyes and brain, I had conducted consensus analysis manually without any tools. I had aligned the Jpred and HHpred result as seen in Figure 12. The structure highlighted in blue is the consensus. Finally, I remove all other non-consensus data resulting in the final determined secondary structure of my protein of interest (Figure 13).

13

Page 14: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

7.0 TERTIARY STRUCTURE ANALYSIS

7.1 BLASPp and psiBLAST analysis.

Figure 14: Screenshot of BLASTp result. Result identities range from 73% to 100%.

Figure 15: Screenshot of psiBLAST result. Result identities range from 28% to 100%.

14

Page 15: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

First of all, before conducting tertiary structure prediction, it is vital to determine whether there are any suitable templates. If there are templates with identities 30% above, we can predict the tertiary structure using homology modelling. However, if there are only templates with identities lesser that 30%, fold recognition method shall be utilized instead. To check this out, BLASTp was be utilized. Figure 14 shows the result of BLASTp, which has results with identities ranging from 73% to 100%. Other other hand, I had also ran psiBLAST and its results were proteins with identities 28%-100%. PsiBLAST uses position-specific scoring matrices (PSSMs) to score matches between query and database sequences, in contrast to BLASTp which uses pre-defined scoring matrices. Therefore, psiBLAST results will include proteins with significantly lower identities. This concept is very useful if we want to study the evolutionary relatedness between proteins.

7.2 Protein tertiary structure prediction by using SWISSMODAL

Figure 16: Tertiary structure predicted using 2zj6.1.A template; crystal structure of D337A mutant of Pseudomonas sp. MIS38 lipase with 75% identity. First model in SWISSMODEL result.GMQE = 0.91QMEAN4 = -1.33

Figure 17: Tertiary structure predicted using 2zj7.1.A template; crystal structure of D157A mutant of Pseudomonas sp. MIS38 lipase with 75% identity. Second model in SWISSMODEL result.GMQE = 0.90QMEAN4 = -1.46

I had carried out tertiary structure prediction using SWISSMODEL and the results can be seen in Figure 16 and Figure 17. Now to choose the best model, we need to consider some vital data. Of these two predictions, the structure in Figure 16 is deemed more accurate because it has higher GMQE and QMEAN4 values. GMQE (Global Model Quality Estimation) is a quality estimation which combines properties from the target-template alignment. The resulting GMQE score is expressed as a number between zero and one, reflecting the expected accuracy of a

15

Page 16: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

model built with that alignment and template. Higher numbers indicate higher reliability. For QMEAN4, higher numbers also indicate higher reliability of the residues. In regard to the template selection, the SWISSMODEL tool had automatically selected the best template for me. In the BLASTp and psiBLAST results, there are plenty of proteins with identities above 75%. However, a question shall be asked is that why SWISSMODEL chose templates with 75% instead. The answer is that, those proteins with identities above 75% found in BLAST result do not have their tertiary structure determined by X-ray crystallography or NMR. Hence, there are no 3D structure data for those proteins that can be used as our template. Therby, SWISSMODEL chose the correct templates.

An alternative to BLAST in searching for templates would be to search at protein database instead. This includes PDBe and PDBj. PDBe (Protein Data Bank in Europe) is an European resource for the collection, organisation and dissemination of data on biological macromolecular structures. PDBj (Protein Data Bank Japan) maintains a centralized PDB archive of macromolecular structures and provides integrated tools.

7.3 Predicted model analysis using Pymol viewer

Figure 18: Protein view in cartoon style. Figure 19: Protein surface analysis.

I had viewed the predicted structure using pymol and the structure can be seen in Figure 18. Then I conducted surface analysis using pymol and the result can be seen in Figure 19. As seen in Figure 19, there is an inward chamber which hypothetically may be the enzyme’s active site.

16

Page 17: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

Next, I conducted domain analysis. Since there are many domains determined, I only chose one domain for this analysis which is the Alpha/Beta hydrolase fold (refer to 3.1.1) which comprises of amino acid residues from 135 to 263 (Information obtained from Interpro). I then entered command in pymol to select the 135-263 residues and coloured the domain red. The Alpha/Beta hydrolase fold domain can be seen in Figure 20 (red colour).

Figure 20: Predicted protein structure with the Alpha/Beta hydrolase fold domain coloured in red.

7.4 Superposition analysis

Next, I conducted superposition analysis in pymol. This is done by opening our protein of interest tertiary structure (SWISSMODEL first choice) and another tertiary structure predicted using SWISSMODEL but with different template. After opening the two structures in pymol, the view was as seen in Figure 21. After inserting align command, the final aligning result was as seen in Figure 22. The blue coloured model is our primary selected protein tertiary structure. The structure in Figure 22 is said to be superposition.

17

Page 18: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

Figure 21: Two structures before aligning. Figure 22: The structure after aligning.

Algorithms to superimpose protein 3D structures are applied to identify similarities of protein folds. The coordinates of a mobile protein is transformed (superposed) so that the backbone lies over the backbone of a reference protein. Distant homologues may not be recognized by their amino acid sequence because the sequences diverge more rapidly in evolution than the 3D-structure. This feature can be very useful when comparing two highly identical protein structures.

7.5 Model evaluation using SAVES

Structure validation of the predicted structures was done by feeding the predicting structure into the SAVES protein verification server. The overall quality factor obtained from ERRAT was 81.879. As for PROVE analysis, the total number of buried outlier protein atoms was 78 ( 4.9 percent) of scored atoms. Then, VERIFY3D had determined that 98.93% of the residues had an averaged 3D-1D score >= 0.2. This is good because at least 80% of the amino acids have scored >= 0.2 in the 3D/1D profile. Subsequently, for PROCHECK, out of 12 evaluations, four had errors, three had warning, and five passed.

ERRAT analyses the statistics of non-bonded interactions between different atom types and plots the value of the error function versus position of a 9-residue sliding window, calculated by a comparison with statistics from highly refined structures. The higher the scores mean the

18

Page 19: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

higher the quality. The normally accepted range is >50 for a high quality tertiary structure model. Since our overall quality factor is 81.879, we can conclude that it is a good model. PROCHECK checks the stereochemical quality of a protein structure by analysing residue-by-residue geometry and overall structure geometry. VERIFY3D determines the compatibility of an atomic model (3D) with its own amino acid sequence (1D) by assigning a structural class based on its location and environment (alpha, beta, loop, polar, nonpolar, etc.) and comparing the results to good structures. The percentage of residues determined by VERIFY3D with a score > 0.2 should be more than 80% for a reliable model. PROVE calculates the volumes of atoms in macromolecules using an algorithm which treats the atoms like hard spheres and calculates a statistical Z-score deviation for the model from highly resolved (2.0 Å or better) and refined (R-factor of 0.2 or better) PDB-deposited structures.

Next is Ramachandran plot analysis. A Ramachandran plot is a way to visualize backbone dihedral angles ψ against φ of amino acid residues in protein structure. Ideally, one would hope to have over 90% of the residues in these "core" regions. The percentage of residues in the "core" regions is one of the better guides to stereochemical quality.

Figure 23: Ramachandran plot for the protein model.

19

Page 20: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

The summary of the Ramachandran plot is as below:

In conclusion, only 89% of the residues fall in the most favoured regions. An ideal model would need to exceed 90% in this aspect. Despite that, 89% is just 1% different from 90%. Henceforth, although not an ideal model, my predicted model has an appreciable accuracy.

CONCLUSION

In this research, I had investigated the protein with accession code number GQ352455.1 which is an liapse gene belongs to an uncultured bacterium. First, I had conducted domains analysis and the domain identity consensus in Interpro, Prosite, and ProDom databases confirmed that the enzyme is a lipolytic enzyme. Second, it is also shown that the enzyme has a calcium binding site. Furthermore, various domain identities found suggested that the uncultured bacterium is a negative-gram bacterium. Third, from the phylogenetic tree that I had generated, I can conclude that the gene of interest most likely share the same recent ancestor with lipase sequences from Pseudomonas species. We had conducted a few primary sequence analyses. The number of amino acids is 469. The molecular weight of the protein is predicted to be 49372.8 Da. Next, the theoretical pI is predicted to be 4.83. There are a total of 53 negatively charged amino acids. Then, there are a total of 32 positively charged amino acids in the protein. The SignalP analysis had concluded that signal peptides are absent in the protein sequence. As for my protein of interest, the CELLO had predicted that the protein is localized at extracellular. Then, I conducted secondary structure analysis using Jpred and HHpred. I had aligned the Jpred and HHpred result. After that, I removed all other non-consensus data resulting in the final determined secondary structure of my protein of interest. Next, I had carried out tertiary structure prediction using SWISSMODEL. Then, I had viewed the predicted structure using pymol.

20

Page 21: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

Subsequently, I conducted surface and domain analysis using pymol. Lastly, I conducted quality assessment on the model using ERRAT, PROVE, VERIFY3D, PROCHECK, and Ramachandran plot analysis. In conclusion, the model is proven to be appreciably accurate.

REFERENCES

1. ProDomServant F, Bru C, Carrère S, Courcelle E, Gouzy J, Peyruc D, Kahn D (2002) ProDom: Automated clustering of homologous domains. Briefings in Bioinformatics. vol 3, no 3:246-251

2. PrositeSigrist CJA, de Castro E, Cerutti L, Cuche BA, Hulo N, Bridge A, Bougueleret L, Xenarios I.  New and continuing developments at PROSITE Nucleic Acids Res. 2012; doi: 10.1093/nar/gks1067 

3. InterproMitchell A, Chang HY, Daugherty L, Fraser M, Hunter S, Lopez R, McAnulla C, McMenamin C, Nuka G, Pesseat S, Sangrador-Vegas A, Scheremetjew M, Rato C, Yong SY, Bateman A, Punta M, Attwood TK, Sigrist CJA, Redaschi N, Rivoire C, Xenarios I, Kahn D, Guyot D, Bork P, Letunic I, Gough J, Oates M, Haft D, Huang H, Natale DA, Wu CH, Orengo C, Sillitoe I, Mi H, Thomas PD, and Robert inn (2015). The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Research, Jan 2015; doi: 10.1093/nar/gku1243

4. ProtparamGasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins M.R., Appel R.D, Bairoch A (2005)Protein Identification and Analysis Tools on the ExPASy Server;(In) John M. Walker (ed): The Proteomics Protocols Handbook, Humana Press . pp. 571-607 

5. Mega 6Tamura K,  Stecher G,  Peterson D, Filipski A, and Kumar S (2013) MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Molecular Biology and Evolution:30 2725-2729.

6. ClustalWLarkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ and Higgins DG (2007) ClustalW and ClustalX version 2. Bioinformatics 23(21): 2947-2948.

7. SignalPPetersen TN, Brunak S, Heijne GV and Nielsen H (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature Methods, 8:785-786.

21

Page 22: In Silico study on an uncultured bacterium clone Lip1 lipase (lip1) gene (Accession code number: GQ352455.1)

8. Cello v.2.5Yu CS, Lin CJ and Hwang JK (2004) Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Science. 13:1402-1406.

9. PredictproteinRost B, Yachdav G and Liu J (2004) Nucleic Acid Res. 32: 321-326.

10. JpredDrozdetskiy A, Cole C, Procter J & Barton GJ. (first published online April 16, 2015) JPred4: a protein secondary structure prediction server, Nucleic Acids Res. Web Server issue 

11. HHpredHHpred: Söding, J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21: 951-960.

12. PROCHECKLaskowski R A, Rullmannn J A, MacArthur M W, Kaptein R, Thornton J M (1996). AQUA and PROCHECK-NMR: programs for checking the quality of protein structures solved by NMR. J Biomol NMR,8, 477-486.

13. PROVEPontius J, Richelle J, & Wodak S J (1996). Deviations from standard atomic volumes as a quality measure for protein crystal structures. Journal of molecular biology, 264(1), 121-136.

14. VERIFY3DLiithy R, Bowie J U & Eisenberg D (1992). Assessment of protein models with three-dimensional profiles. Nature, 356(6364), 83-85.

15. Other literaturesi) Convey P (1996). The influence of environmental characteristics on life history

attributes of Antarctic terrestrial biota. Biological Reviews, 71(2), 191-225.ii) Gerday C, Aittaleb M, Arpigny J L, Baise E, Chessa J P, Garsoux, G, ... and Feller G

(1997). Psychrophilic enzymes: a thermodynamic challenge.Biochimica et Biophysica Acta (BBA)-Protein Structure and Molecular Enzymology, 1342(2), 119-131.

iii) Joseph B, Ramteke, P W, and Thomas G (2008). Cold active microbial lipases: some hot issues and recent developments. Biotechnology advances,26(5), 457-470.

iv) Joseph B, Ramteke P W, Thomas G, and Shrivastava N (2007). Standard review cold-active microbial lipases: a versatile tool for industrial applications.Biotechnology and Molecular Biology Review, 2(2), 039-048.

22