Proteogenomics Kelly Ruggles, Ph.D. Proteomics Informatics March 31, 2015

Proteogenomics

Kelly Ruggles, Ph.D. Proteomics Informatics

March 31, 2015

As the cost of high-throughput genome sequencing goes down whole genome, exome and RNA sequencing can be easily attained for most proteomics experiments

In combination with mass spectrometry-based proteomics, sequencing can be used for:1. Genome annotation2. Studying the effect of genomic variation in proteome3. Biomarker identification

Proteogenomics: Intersection of proteomics and genomics

Proteogenomics: Intersection of proteomics and genomics

First published on in 2004 “Proteogenomic mapping as a complementary method to perform genome annotation”

(Jaffe JD, Berg HC and Church GM) using genomic sequencing to better annotate Mycoplasma pneumoniae

Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011

Proteogenomic workflow High throughput shotgun MS/MS

Requires no knowledge of peptides present, uses mass difference to determine next AA in peptide chain.

Requirements for Proteogenomic Analysis

• DNA or RNA sequencing data• High resolution MS/MS• Informatics tools for proteogenomic database construction and protein

searchingDNA and/or RNA Sequencing

PersonalizedProtein DB

MS/MSCompare, score, test significance

Identified peptides and proteins

Sample

Informatics Tools

Informatics Tools

Proteogenomics• In the past, computational algorithms were

commonly used to predict and annotate genes. – Many limitations

• With mass spectrometry we can – Confirm existing gene models– Correct gene models– Identify novel genes and splice isoforms

Proteogenomics

1. Improving genome annotation2. Sequencing driven database construction3. Proteomic mapping to genomic coordinates

Proteogenomics


Genome Annotation• Process of identifying and assigning function to genes• Historically, identification of protein coding regions was completed

using– Comparative sequence similarity analysis– ab initio gene prediction algorithms– RNA transcript analysis

• Limitations associated with these methods in determining– Gene start and stop sites– Translation reading frames– Short genes, overlapping genes– Alternative splice boundaries– Translated vs. transcribed genes

• Therefore, MS-based proteomics can be used to supplement sequence analysis for genome annotation

Protein Sequence Databases

• Identification of peptides from MS relies heavily on the quality of the protein sequence database (DB)

• DBs with missing peptide sequences will fail to identify the corresponding peptides

• DBs that are too large will have low sensitivity• Ideal DB is complete and small, containing all

proteins in the sample and no irrelevant sequences

Genome Sequence-based database for genome annotation

Reference protein DB

Compare, score, test significance

annotated peptides

6 frame translation of genome sequence


annotated + novel peptides

m/z

inte

nsity

MS/MS

Commonly used method is to search MS against 6 frame translation, removing bias based on established annotation

Creating 6-frame translation databaseATGAAAAGCCTCAGCCTACAGAAACTCTTTTAATATGCATCAGTCAGAATTTAAAAAAAAAATC

M K S L S L Q K L F * Y A S V R I * K K N

* K A S A Y R N S F N M H Q S E F K K K I

E K P Q P T E T L L I C I S Q N L K K K S

H F A E A * L F E K L I C * D S N L F F I

S F G * G V S V R K I H M L * F K F F F D

F L R L R C F S K * Y A D T L I * F F F G

Positive Strand

Negative Strand

Software: • Peppy: creates the database + searches MS, Risk BA, et. al (2013)• PIUS (Peptide Identification by Unbiased Searching): Costa et al,

2013• MS-Dictionary: Kim et al, 2009

Genome Annotation Example 1: A. gambiae


Peptides mapping to annotated 3’ UTR

Peptides mapping to novel exon within an existing gene

Genome Annotation Example 1: A. gambiae


Peptides mapping to unannotated gene

related species

Genome Annotation Example 2: Correcting Miss-annotations

A. Establishes new transcriptional start locationB. Confirm ORFC. Establishes intron-exon boundariesD. Determines new reading frame for exonsE. Predicts novel coding regionF. Finds the end of a geneG. Uses a related species to build on genomic annotation

RNA Sequence-based database for alternatively splicing identification

RNA-Seq junction DB


Identification of novel splice isoforms

m/z

inte

nsity

MS/MS

Annotation of organisms which lack genome sequencing


Identification of potential protein coding regions

Reference DB of related species

m/z

inte

nsity

MS/MS

De novo MS/MS sequencing

Proteogenomics: Genome Annotation Summary

• Confirms existing gene models• Corrects existing gene models– Intron-exon boundaries– Reading frames– Novel splice isoforms– Novel exons

• Identifies novel genes• Fusion protein identification• Identify genomic polymorphisms

Proteogenomics


Proteogenomic workflow Before the advent of proteogenomics, variant protein analysis was laborious, often requiring de novo sequencing**,

which is very time-consuming, and therefore only a very limited number of peptides can be sequenced.

**

DNA/RNAsequencing

Single nucleotide variant database for variant protein identification


Identification of variant proteins

m/z

inte

nsity

MS/MS

TCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGATAGCTGExon 1

Variants predicted from genome sequencing


+ Variant DB

Creating variant sequence DBVCF File Format

# Meta-information linesColumns: 1. Chromosome2. Position3. ID (ex: dbSNP)4. Reference base 5. Alternative allele 6. Quality score7. Filter (PASS=passed filters)8. Info (ex: SOMATIC, VALIDATED..)

Creating variant sequence DB

…GTATTGCAAAAATAAGATAGAATAAGAATAATTACGACAAGATTC…

……

…CTATTGCAAAAATACGATAGCATAAGAATAGTTACGACAAGATTC…

Add in variants within exon boundaries

In silico translation

EXON 1 EXON2

…LLQKYDSIRIVTTRF…

Variant DB

Splice junction database for novel exon, alternative splicing identification


Identification of novel splice proteins

m/z

inte

nsity

MS/MS

Intron/Exon boundaries from RNA sequencing


+RNA-Seq junction

DB

Exon 1 Exon 2 Exon 3

Alt. Splicing Novel Expression

Exon 1 Exon X Exon 2

Creating splice junction DBBED File Format

Columns:1. Chromosome2. Chromosome Start3. Chromosome End 4. Name 5. Score6. Strand (+or-)7-9. Display info10. # blocks (exons)11. Size of blocks12. Start of blocks

Creating splice junction DBJu

nctio

n be

d fil

e Map to known intron/exon boundaries

Unannotated alternative splicing

One novel intron/exon boundary

Two novel intron/exon boundaries

Bed file with new gene mapping

Fusion protein identification


Identification of variant proteins

m/z

inte

nsity

MS/MS


+Fusion Gene

DB

Gene XExon 1

Gene XExon 2

Gene YExon 1

Gene YExon 2

Chr 1 Chr 2

Gene XExon 1

Gene YExon 2

Fusion Genes

Fusion Location

.…AGAACTGGAAGAATTGG*AATGGTAGATAACGCAGATCATCT..…

Find consensus sequence

6 frame translation FASTA

Informatics tools for customized DB creation

• QUILTS: perl/python based tool to generate DB from genomic and RNA sequencing data (Fenyo lab)

• customProDB: R package to generate DB from RNA-Seq data (Zhang B, et al.)

• Splice-graph database creation (Bafna V. et al.)

Proteogenomics and Human Disease: Genomic Heterogeneity

•Whole genome sequencing has uncovered millions of germline variants between individuals

•Genomic, proteome studies typically use a reference database to model the general population, masking patient specific variation

Nature October 28, 2010

Proteogenomics and Human Disease: Cancer Proteomics

Cancer is characterized by altered expression of tumor drivers and suppressors• Results from gene mutations causing changes in

protein expression, activity• Can influence diagnosis, prognosis and treatment

Cancer proteomics • Are genomic variants evident at the protein level?• What is their effect on protein function?• Can we classify tumors based on protein markers?

Tumor Specific Proteomic Variation

Stephens, et al. Complex landscape of somatic rearrangement in human breast cancer genomes.

Nature 2009

Nature April 15, 2010

Personalized Database for Protein Identification

m/z

inte

nsity

MS/MS

Protein DB


Somatic VariantsSVATGSSEAAGGASGGGARGQVAGTMKIEIAQYRDSGSYGQSGGEQQREETSDFAEPTTCITNNQHSEPRDPRFIKGWFCFIISAR….

Germline VariantsMQYAPNTQVEIIPQGRSSAEVIAQSRASSSIIINESEPTTNIQIRQRAQEAIIQISQAISIMETVKSSPVEFECINDKSPAPGMAIGSGR…


Personalized Database for Protein Identification

m/z

inte

nsity

MS/MS

Tumor Specific Protein DB


+ tumor specific + patient specific peptides

RNA-SeqGenome Sequencing


Tumor Specific Protein Databases

Tumor Specific

Protein DB

Non-Tumor Sample Genome sequencing Identify germline variants

Reference Human Database (Ensembl)

Genome sequencingRNA-SeqTumor Sample

Identify alternative splicing, somatic variants and

novel expression

TCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGATAGCTG

Exon 1 Exon 2 Exon 3

Exon 1

Variants

Alt. Splicing Novel Expression

Exon 1 Exon X Exon 2

Fusion Genes

Gene XExon 1

Gene XExon 2

Gene YExon 1

Gene YExon 2

Gene X Gene Y

Proteogenomics and Biomarker Discovery

• Tumor-specific peptides identified by MS can be used as sensitive drug targets or diagnostic tools– Fusion proteins– Protein isoforms– Variants

• Effects of genomic rearrangements on protein expression can elucidate cancer biology

Proteogenomics


Proteogenomic mapping

• Map back observed peptides to their genomic location. • Requires tools to convert proteomic location to

genomic coordinates• Use to determine: – Exon location of peptides– Proteotypic– Novel coding region– Visualize in genome browsers (UCSF genome browser,

Integrative Genomics Viewer (IGV))– Quantitative comparison based on genomic location

Informatics tools for proteogenomic mapping

• PGx: python-based tool, maps peptides back to genomic coordinates using user defined reference database (Fenyo lab)

• The Proteogenomic Mapping Tool: Java-based search of peptides against 6-reading frame sequence database (Sanders WS, et al).

PGX: Proteogenomic mapping toolPeptides

Sample specific protein database

Peptides mapped onto genomic

coordinates

Man

or A

sken

azi

Dav

id F

enyo

Log Fold Change in Expression (10,000 bp bins)

Copy Number Variation

Methylation Status

Exon Expression (RNA-Seq)

Number of Genes/Bin

Peptides

Variant Peptide Mapping

SVATGSSEAAGGASGGGAR

SVATGSSETAGGASGGGAR

ACG->GCG

Peptides with single amino acid changes corresponding to germline and somatic variants

ENSEMBL Gene

Tumor Peptide

Reference Peptide

Novel Peptide MappingPeptides corresponding to RNA-Seq expression in non-coding regions

ENSEMBL Gene

Tumor Peptide

Tumor RNA-Seq

Proteogenomic integration

Maps genomic, transcriptomic and proteomic data to same coordinate system including quantitative information

Variants

Proteomic Quantitation

RNA-Seq Data

Proteomic Mapping

Predicted gene expression

Summary

• The integration of proteomics and genomics can improve our understanding of not only genomic annotation, but also of the functional protein products integral in biological processes.

• Proteogenomics is currently being used extensively in cancer discovery– Genetic rearrangement differs between tumors– Requires personalized database– Can provide cancer specific proteins for biomarker

development• Proteogenomics will likely continue to grow, particularly

in the identification of genomic abnormalities in disease

Questions?

Documents

Proteogenomics Kelly Ruggles, Ph.D. Proteomics Informatics March 31, 2015