Comparative genomics and proteomics

Comparative genomics & proteomics

Genome

The genome contains all the biological information required to build and maintain any given living organism.

The genome contains the organisms molecular history.

Decoding the biological information encoded in these molecules will have enormous impact in our understanding of biology.

~50 years

‘Finished’ human genome sequence

1900

1944

1953

1960’s

1977

1975-79

1986

1995

1999

1990

Rediscovery of Mendel’s genetics

DNA identified as hereditary material

DNA structure

Genetic code

Advent of DNA sequencing

First human genes isolated

DNA sequencing automated

First whole genome

First human chromosome

Human genome project officially begins

Mendel discovers laws of genetics1865

2003

Some of the completed Genomes

• Haemophilus influenzae • Escherichia coli • Bacillus subtilus• Helicobacter pylori• Borrelia burgdorferi • Streptococcus pneumoniae• Saccharomyces cerevisiae

• Caenorhabditis elegans• Arabidopsis thaliana• Archaeoglobus fulgidus• Methanobacterium thermoautotrophicum• Methanococcus jannaschii• Mycoplasma pneumoniae• Mycoplasm genitaliu• Rickettsia prowazekii• Mycobacterium tuberculosis

How much can sequence data alone tell us?

• The answer is that that a DNA sequence taken in isolation from a single organism reveals very little.

• The vast majority of DNA in most organisms is noncoding. Protein coding sequences or genes cannot function as isolated units without interaction with noncoding DNA and neighboring genes.

• This genomic environment is specific to each organism. In order to understand this we need to look at similar genes in different organisms, to determine how function and position has changed over the course of evolution.

• By understanding evolutionary processes we can gain a greater insight into what makes a gene and the wider processes of genetics and inheritance

Comparative Genomics

• study of the relationship of genome structure and function across different biological species or strains

• One of the important goals of the field is the identification of the mechanisms of eukaryotic genome evolution.

• It is however often complicated by the multiplicity of events that have taken place throughout the history of individual lineages, leaving only distorted and superimposed traces in the genome of each living organism.

• For this reason comparative genomics studies of small model organisms (for example the model Caenorhabditis elegans and closely related Caenorhabditis briggsae) are of great importance to advance our understanding of general mechanisms of evolution

• proteome is a blend of "protein" and "genome“• it gives a much better understanding of an organism than genomics. • First, the level of transcription of a gene gives only a rough estimate

of its level of expression into a protein. An mRNA produced in abundance may be degraded rapidly or translated inefficiently, resulting in a small amount of protein.

• Second, as mentioned above many proteins experience post-translational modifications that profoundly affect their activities; for example some proteins are not active until they become phosphorylated

Human Genome Project

• international scientific research project with a primary goal of determining the sequence of chemical base pairs which make up DNA, and of identifying and mapping the approximately 20,000-25,000 genes of the human genome from both a physical and functional standpoint

• The project began in October 1990 and was initially headed by Ari Patrinos, head of the Office of Biological and Environmental Research in the U.S. Department of Energy's Office of Science.

• While the objective of the Human Genome Project is to understand the genetic makeup of the human species, the project has also focused on several other nonhuman organisms such as E. coli, the fruit fly, and the laboratory mouse.

http://www.ornl.gov/sci/techresources/Human_Genome/research/function.shtml

Original goals

• construction of a high-resolution genetic map of the human genome;• production of a variety of physical maps of all human

chromosomes and of the DNA of selected model organisms; • determination of the complete sequence of human and selected

model-organism DNA; • development of capabilities for collecting, storing, distributing, and

analyzing the data produced; and• creation of appropriate technologies necessary to achieve these

objectives

This was a huge technical undertaking so further aims of the project were…

• Develop and improve technologies for: DNA sequencing, physical and genetic mapping, database design, informatics, public access

• Genome projects of 5 model organisms e.g. E. coli, S. cerevisiae, C. elegans, D. melanogaster, M. musculus.

• There are approximately 23,000 genes in human beings, the same range as in mice and roundworms. Understanding how these genes express themselves will provide clues to how diseases are caused

Provide information about these organisms

As test cases for refinement and implementation of various tools required for the HGP

The Human genome project

http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml

Advantages of Human Genome Project:

• Knowledge of the effects of variation of DNA among individuals can revolutionize the ways to diagnose, treat and even prevent a number of diseases that affects the human beings.

• It provides clues to the understanding of human biology.

• The functions of human genes and other DNA regions often are revealed by studying their parallels in nonhumans.

• To enable such comparisons, HGP researchers have obtained complete genomic sequences for the bacterium Escherichia coli, the yeast Saccharomyces cerevisiae, the roundworm Caenorhabditis elegans, the fruitfly Drosophila melanogaster, the laboratory mouse, and many other organisms.

• The availability of complete genome sequences generated both inside and outside the HGP is driving a major breakthrough in fundamental biology as scientists compare entire genomes to gain new insights into evolutionary, biochemical, genetic, metabolic, and physiological pathways

How we use this data to understand physiology, behaviour, disease and variation between species/individuals we need to:

• The evolutionary history of every genetic element (every base)• Evolutionary forces shaping the genome• Structural and sequence variation in the population and between

species.

Comparative genomics studies differences between genome sequences pin-pointing changes over time. Comparison of the number/type changes against the background “neutral” expected changes provides a better understanding of the forces that shaped genomes and traits.

Introduction• Mass spectrometry recently emerged as a valuable technique for

proteogenomic annotations that improves on the state of the art in predicting genes and other features. Previous proteogenomic approaches were limited to a single genome and did not take advantage of analyzing mass spectrometry data from multiple genomes at once. We show that such a comparative proteogenomics approach allows one to address the problems that remained beyond the reach of the traditional “single proteome” approach in mass spectrometry.

• In particular, we show how comparative

proteogenomics addresses the notoriously

difficult problemof “one-hit-wonders” in

proteomics, improves on the existing gene

prediction tools in genomics and allows

identification of rare post-translation

modifications.http://genome.cshlp.org/content/18/7/1133.short

http://genome.cshlp.org/content/18/7/1133.short

Developments• Since the sequencing of the first genome, Haemophilus influenzae

in 1995, the number of sequenced genomes has been rising sharply. Every sequencing project is followed by annotation of the genome to identify genes, pathways, etc.

• MS-Genome software for automated proteogenomic annotation of bacterial genomes was developed and applied for improving annotation of Shewanella oneidensis MR-1, a model bacterium for studies of bioremediation and metal reduction. However, the synergy between MS/MS data from different species was never explored in the past. We show that such comparative proteogenomics analysis sheds new light on the annotations of both genomes and proteomes.


Continued…..• Similar to Expressed Sequence Tag (EST) studies, mass

spectrometry experiments generate Expressed Protein Tag (EPT) that provide valueable information about expresses proteins. Unlike ESTs, EPTs are relatively uniformly distributed along the protein length and provide information about the translational starts, proteolytic events and post-translational modifications, making it nontrivial to transform the existing EST approaches into the EPT domain.

• Here, we will analyze MS/MS data sets for the three Shewanella bacteria representing multiple growth conditions: S. oneidensis MR-1 (~14.5 mn spectra), S. frigidimarina (~0.955 mn spectra) and S. putrefaciens CN-32 (~0.768 mn spectra). In addition to predicting new genes and finding errors in existing annotations, we will show that MS/MS data help to identify programmed frameshifts, a difficult problem in genomics. We will also demonstrate that comparative analysis of peptides across species is helpful in resolving the dilemma of “one-hit-wonders” in proteomics.


Methods• Peptide identification

It was performed for So earlier. The MS/MS spectra were acquired on ion-trap mass spectrometers using electrospray ionization. They used InsPecT to search the spectra of each species against a database containing the six frame translation of the genome along with common contaminants and a decoy database of the same size.

The InsPecT score threshold was

selected for each case to limit the number

of identifications on the decoy database to

at most 1% of the number of identifications

on the target database to keep the false

discovery rate under control. After filtering

step we obtain the peptides in all three

species that do not match the annotated

proteins in these genomes.

• Analyzing late start codons

We describe an algorithm for predicting ‘late’ start codons i.e. the (correct) start codons that are located downstream from the wrongly annotated start codons. While a late start codon implies a “missing” peptide in the beginning of the protein, such missing peptides can also be caused by low peptide detectability or may simply represent signal peptides.

However, noncovered peptides in the beginning of the protein , that cannot be explained by the signal peptide consensus sequence , point to late start codons. Within 18 residues of the start, there are 33 cases of N-terminal most-noncovered peptides in So. Many of them start with ATG start codon or start immediately after a start codon. Distribution of codons for amino acids at positions 1 and -1 in the peptides are non-uniform, hence raising the case that these all cannot be artifacts.



• Correlated peptides

Traditional MS/MS analysis is focused on identification of proteins and is less concerned with the question of which peptide in a protein are observed or not. In a typical mass spectrometry experiment, some peptides with low detectability are always missed, resulting in highly non-uniform protein coverage by identifies peptides.

Peptide detectability depends on protein abundance, peptide length, peptide hydrophobicity, etc. and several groups are using large data sets to develop the ability to its prediction. Peptides identified by MS/MS in two species are called correlated peptides if they are observed in the same position in the protein alignment or one of them spans another.

For example if one peptide is located at position (start1,end1) and other at (start2, end2) in the alignment then they are considered correlated if

start1=<start2=<end2=<end1

or

start2=<start1=<end1=<end2


• Identification of post-translational modifications

MS-Alignment was used to identify PTMs in each of the three organisms in a blind mode, in the range from -200 to 250 Da. Common contaminants like keratin were included in protein sequence databases. A decoy database of the same size as the actual protein database, containing shuffled sequences was used to control error rate. Any hits to decoy database are expected to be incorrect identifications.

A score cut off is chosen such that the number of PTMs identified in the decoy database is at most 5% of the number of identifications in the target database. This provides a controlled PTM site-specific false discovery rate of 5%. All spectra that were identified in the regular InsPecT search were removed. After this post processing MS-Alignment results,9917,7649, and 6709 PTMs were obtained in So, Sf and Sp respectively.


Results• Multiple Shewanella genome

The three Shewanella species used in this study were sequenced. The protein orthology assignments across different Shewanella species were prepared using INPARANOID, subsequently aligned by MUSCLE.

Expression of orthologous genes across the three species. (A) The number of orthologs shared between different species. There are 2590 orthologous genes present in all three species (referred to as “shared genes”). (B) The number of expressed shared genes among the three species; 1052 shared genes are expressed in all three species.


• Protein Identification

MS-based protein identification can be done to analyze the expression of pathways or functional categories. Having proteomic data for these three species allows us to compare the expression of pathways and identify which pathways are conserved or differentially expressed across these species.


• Resolving one hit wonders

There are 1052 shared genes that are expressed in all three species. However, as per guidelines, we require at least two peptides to consider a protein as expressed. Since almost every analysis of MS/MS data sets reveals a large number of proteins with a single identified peptide, it leads to a significant reduction in the number of identified proteins.

While orthologous one-hit-wonders are strong indicators of protein expression, peptides identified at the same orthologous positions in different species provide overwhelming evidence that the proteins are expressed.


We should observe orthologous peptides in closely related species. We thus check if the only peptide observed in the protein is correlated between multiple species. If peptide identification is spurious, it is very likely that the peptide will be at the same position as the observed peptides in its orthologs.

Aligned amino acid sequences of the shared gene (annotated as hypothetical lipoprotein). The identified peptides are shown in blue.


Identification of programmed frameshifts and sequencing errors

• A frameshift occurs when a ribosome skips one or more nucleotides in an mRNA sequence, thereby changing the reading frame to produce different protein sequence from the original frame. In programmed frameshifts, this phenomenon is built into the translational machinery.

Mass spectrometry provides experimental evidence for the actual translation products and allows one to detect the frame shifts. The presence of peptides from two different reading frames within the region of predicted gene may represent (1) Incorrect peptide identification,

(2) an insertion/deletion sequencing error,

(3) overlapping genes in different frames or

(4) a programmed frameshift.


Proteolytic events

An in vivo proteolytic event can be observed as a non-tryptic peptide. However, non-tryptic peptides may also be observed due to other reasons such as degradation of tryptic peptides or incorrect peptide identifications.

On applying the filter approach and removing the cuts explained by trypsin specificity we obtain some putative proteolytic sites in these three species. Then to check whether these are conserved between multiple organisms, we map them on the alignment of orthologous protein. And the found conserved proteolytic sites between two or more organism were greater than expected by chance. Thus, there is an argument that the conserved sites reported here cannot be results of non-specific degradations.



Post-translational modifications• While algorithms for blind searches for unexpected modifications have been

developed they had to rely on the “strength in numbers” principle to distinguish real modifications from computational artifacts. As a result, the biologically important modifications that appears only a few times in the genome are likely to be classified as computational artifacts.

Blind PTM searches with MS-Alignment find all possible mass offsets without a priori knowledge of which modifications may be present in the sample. Since blind searches may yield thousands of modifications, the strength in numbers approach consider frequent modification as reliable and discards rare modifications as unreliable.

After the post processing of MS-Alignment results, we find 162 different modifications that are observed in all three species. While 74 of these represent chemical adducts that are expected in mass spectrometry experiments, 88 others reveal biologically interesting modifications as well as other potentially important modifications that remain unknown.


Some more Applications:• “RNA editing” is difficult to confirm by MS-based analysis of a single genome

since amino acid mutations can also be explained by DNA sequencing errors or false peptide identifications. While mass spectrometry is routinely used for confirming RNA editing events in a case-by-case fashion, it was never used for genome-wide discovery of RNA editing. The comparative proteogenomics analysis of related species would be a simple way to rule out such alternative explanations and to confirm RNA editing.

• While “signal peptides” are important for understanding protein function, they are difficult to confirm experimentally, and computational tools are used to fill the gap. Comparative proteogenomics opens a possibility to construct the first reliable data set of all signal peptides in a set of genomes and to study evolution of signal peptides across multiple species.

• “Operon prediction” in bacterial genomes is an important but still unsolved problem. Also, since peptide detectability varies from species, we expect that comparative proteogenomics approach based on signatures may minimize errors and improve on existing operon predictions.

Missing genes in metabolic pathways

• Comparative analysis of a large and growing number of diverse sequenced genomes is revolutionizing the pace of gene discovery.

• A common theme of these efforts is the integration of various types of genomic evidence such as clustering of genes on the chromosome, protein fusion events, occurrence profiles or signatures and shared regulatory sites to infer functional coupling for proteins participating in related cellular processes.

• It is primarily focused on which components (e.g. metabolic enzymes) are actually present and which should be present but cannot be identified and thus provides a rather specific and precise notion of what is actually missing.

Source- Missing genes in metabolic pathways: a comparative genomics approach by Andrei Osterman and Ross Overbeek

Human coenzyme A biosynthesis

• Only the gene for human pantothenate kinase (PANK) was known.• Given the conservation at the functional level of this pathway

between humans and bacteria, genes from bacteria to humans were projected using comparative genomics.

Source- Complete Reconstitution of the Human Coenzyme A Biosynthetic Pathway via Comparative Genomics by Matthew Daugherty, Boris Polanuyer, Michael Farrell, Michael Scholle, Athanasios Lykidis, Vale´rie de Cre´cy-Lagard, and Andrei Osterman


PANK PPCS PPCDC PPAT DPCK• PSI-BLAST searches identified three proteins in the human cDNA sequence database as strong homologs of E. coli CoA biosynthesis enzymes.

•One homolog was found for PPCDC and two homologs were found for DPCK.

•No reliable homologs could be found for E. coli PPCS or PPAT

•The predicted human PPCDC appeared to be a mono-functional enzyme. This is in contrast to most bacteria, in which PPCDC is fused with PPCS forming a bi-functional protein.



PANK PPCS PPCDC PPAT DPCK

•Among prokaryotic genomes, only streptococci and enterococci contain mono-functional PPCDC genes. Bacterial mono-functional PPCDC from these organisms shows the highest sequence similarity to human mono-functional PPCDC.

• In the same bacterial genomes, PPCS is also mono-functional and is found in the same operon with PPCDC. Using this unique mono-functional PPCS from Streptococcus pneumoniae, a candidate for human mono-functional PPCS was identified with a reliable similarity.

•There was marginal similarity between PPCS domains of bacterial bi-functional proteins and putative human mono-functional PPCS.





•Biochemical analysis in rat and pig suggested the existence of a non- dissociable complex, potentially a bi-functional fusion protein, of PPAT and DPCK.

•Based on the biochemical evidence of PPAT/DPCK fusion, additional searches in the human expressed sequence tag database were performed, revealing that the predicted human DPCK open reading frames was potentially 5’-truncated.





Source- Missing genes in metabolic pathways: a comparative genomics approach by Andrei Osterman and Ross Overbeek




The analysis of publicly available human genomic data allows us to establish chromosomal localization of all genes encoding the final four steps of CoA biosynthesis. Most of them exist as single copies, such as:- a. PPCS on chromosome 1b. PPCDC on chromosome 15c. PPAT/DPCK on chromosome 17

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3190097/

Comparative proteome analysis of Helicobacter pylori clinical

strains by two-dimensional gel electrophoresis

Objective: To investigate the pathogenic properties of Helicobacter pylori by comparing

the proteome map of H. pylori clinical strains.


Introduction

• Gastric cancer is the fourth most common cancer and the second most common cause of cancer deaths .

• There is a multistep progression stage from pre-malignancy to invasive malignancy.

• Thus, early gastric cancer diagnosis for effective preventive strategies and therapeutics against gastric cancer is urgently required.

• Based on results of epidemiological and clinical studies, the World Health Organization (WHO) has declared H. pylori as a definitive carcinogen in 1994.


• The evidence comes mainly from epidemiological studies supporting that the risk for development of stomach diseases is higher in persons infected with cag pathogenicity island (cagPAI)-positive H- pylori than in those infected with cagPAI-negative strains.

• Gastric epithelial cell damage may be a consequence of the inflammatory responses induced by H. pylori infection.


• cagA and iceA have been proposed as biomarkers that might predict the risk for symptomatic clinical outcomes.

• However, they cannot explain why H. pylori strains isolated from asymptomatic patients have also the same frequency of expression of both CagA and VacA compared to the strains isolated from patients with peptic ulcer or gastric cancer .

• Neither iceA nor assembly of iceA/vacA/cagA is helpful in predicting the clinical presentation infected with H. pylori. H. pylori strain- specific factors may influence the pathogenicity of different H. pylori isolates and the presentation of a clinical outcome.

• There should be other proteins/ factors cooperating with CagA and VacA to induce or promote the development of disease.


Necessity of proteome analysis

• The complete genome of strains provides sufficient genetic information for proteome analysis of H. pylori.

• The fact that 35 genes of H. pylori 11637 translate 93 proteins

suggests that H. pylori proteins express a high degree of post-translational modification.

• A comparative proteome map of H. pylori strains should be beneficial to investigate the pathogenic properties of these organisms.


Experiment• Materials and methods :• Two wild-type H. pylori strains: YN8 (isolated from biopsy tissue of a gastric cancer patient) YN14 (isolated from biopsy tissue of a gastritis and duodenal ulcer patient)


Experimental procedure:

H-pylori and protein preparation Protein assay Two-dimensional gel electrophoresis In-gel digestion


• Quadrupole time-of-flight (Q-TOF) mass spectrometry analysis and database search

The peptide analysis was carried out as per protocols of the supplier (Bio-Rad).

Mass spectra were obtained using the Q-TOF mass spectrometer

MS/MS data were searched against NCBInr protein sequence databases (http://www.ncbi.nlm. nih.gov) and Mascot (http://www.matrixscience.com).


Results• The protein compositions of H. pylori YN8 and YN14 were

initially separated on 2-DE gel stained by silver staining .

YN8 YN14


• The protein spots were separated over the molecular weight (Mr) range of 10–200 kDa and the pI range of 3–10.

• The gel revealed prominent individual proteins with several protein “families” (most notably as clusters of bands).

• Although several main spots/clusters were found at the same position, some proteins visually varied in expression level.


Protein identification

• Because most expressed proteins were spread over the center of pI 3–10 gel, we further selected the 2-DE experiments of pI 5–8.

YN8 YN14


• Then Q-TOF was performed for protein identification with statistical confidence .

• Seven of nine protein spots identified using protein database (http://www.ncbi.nlm.nih.gov)


• Interestingly, the same amino acid sequence has different protein definition depending on individual strain, e.g., IVESDAITALIQR definition is hydantoin utilization protein A in H. pylori HPKX, and hypothetical protein in H. pylori B128.

• Two of nine proteins are unknown.


Discussion

• H. pylori strains display a high interstrain genomic divergence .This high variation at the genomic level does not provide evidence of a functional protein difference between strains because silent mutations happen naturally.

• Disordered proteins are considered as disease-initiating factors.

• Thus, authors focused on protein expression levels of H. pylori isolates.

• From this study, different H. pylori isolates have individual protein expression levels. The presence or absence of some protein spots on 2-DE was thought to be useful for H. pylori infection characterization.


• Disease-specific proteins were thought to be responsible for the clinical presentation induced by H. pylori infection. However, none of the seven identified proteins showed similarities with virulence factors .

• E.g. Dsb family of redox protein: The interesting thing is that H. pylori isolated from

gastric cancer showed high increased DsbB-like protein compared to that of the strain isolated from gastritis. This infers that a strain which produces much more redox proteins when it colonizes human gastric mucosa may portend a higher risk for gastric cancer.


• Taylor (1992) speculated that H. pylori strains undergo genomic rearrangements to adapt to a new human host environment.

• H. pylori strains express/repress proteins variation, not only in terms of the virulence proteins, but also in terms of physiological proteins when they infect a human host.


Inference

• comparative analysis of proteins is, to date, a better method to find a new disease-specific protein antigen .

• In this preliminary study, variation at the protein level, of H. pylori isolated from patients with gastric cancer and gastritis was confirmed.

• This reveals completely unexpected complexity and diversity in protein expression.

Future directions

• Comprehensively identify the structural and functional components encoded in human genome

• Develop a detailed understanding of the heritable variation in the human genome

• Understand evolutionary variation across species and the mechanisms underlying it

• Develop robust strategies for identifying the genetic contributions to desease and drug response

• Develop strategies to identify gene variants that contribute to good health and resistance to disease

• Develop genome and proteome approaches to detect illness and thus accelerate drug discovery.


Bibliography• http

://genome.cshlp.org/content/18/7/1133.short • http://www.ncbi.nlm.nih.gov/pmc/articles/P

MC3190097/• Complete Reconstitution of the Human Coenzyme A

Biosynthetic Pathway via Comparative Genomics by Matthew Daugherty, Boris Polanuyer, Michael Farrell, Michael Scholle, Athanasios Lykidis, Vale´rie de Cre´cy-Lagard, and Andrei Osterman

• http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml







Questions?

Documents

Comparative genomics and proteomics