Upload
drusilla-blake
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Proteogenomics
Kelly Ruggles, Ph.D. Proteomics Informatics
March 31, 2015
As the cost of high-throughput genome sequencing goes down whole genome, exome and RNA sequencing can be easily attained for most proteomics experiments
In combination with mass spectrometry-based proteomics, sequencing can be used for:1. Genome annotation2. Studying the effect of genomic variation in proteome3. Biomarker identification
Proteogenomics: Intersection of proteomics and genomics
Proteogenomics: Intersection of proteomics and genomics
First published on in 2004 “Proteogenomic mapping as a complementary method to perform genome annotation”
(Jaffe JD, Berg HC and Church GM) using genomic sequencing to better annotate Mycoplasma pneumoniae
Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011
Proteogenomic workflow High throughput shotgun MS/MS
Requires no knowledge of peptides present, uses mass difference to determine next AA in peptide chain.
Requirements for Proteogenomic Analysis
• DNA or RNA sequencing data• High resolution MS/MS• Informatics tools for proteogenomic database construction and protein
searchingDNA and/or RNA Sequencing
PersonalizedProtein DB
MS/MSCompare, score, test significance
Identified peptides and proteins
Sample
Informatics Tools
Informatics Tools
Proteogenomics• In the past, computational algorithms were
commonly used to predict and annotate genes. – Many limitations
• With mass spectrometry we can – Confirm existing gene models– Correct gene models– Identify novel genes and splice isoforms
Proteogenomics
1. Improving genome annotation2. Sequencing driven database construction3. Proteomic mapping to genomic coordinates
Proteogenomics
1. Improving genome annotation2. Sequencing driven database construction3. Proteomic mapping to genomic coordinates
Genome Annotation• Process of identifying and assigning function to genes• Historically, identification of protein coding regions was completed
using– Comparative sequence similarity analysis– ab initio gene prediction algorithms– RNA transcript analysis
• Limitations associated with these methods in determining– Gene start and stop sites– Translation reading frames– Short genes, overlapping genes– Alternative splice boundaries– Translated vs. transcribed genes
• Therefore, MS-based proteomics can be used to supplement sequence analysis for genome annotation
Protein Sequence Databases
• Identification of peptides from MS relies heavily on the quality of the protein sequence database (DB)
• DBs with missing peptide sequences will fail to identify the corresponding peptides
• DBs that are too large will have low sensitivity• Ideal DB is complete and small, containing all
proteins in the sample and no irrelevant sequences
Genome Sequence-based database for genome annotation
Reference protein DB
Compare, score, test significance
annotated peptides
6 frame translation of genome sequence
Compare, score, test significance
annotated + novel peptides
m/z
inte
nsity
MS/MS
Commonly used method is to search MS against 6 frame translation, removing bias based on established annotation
Creating 6-frame translation databaseATGAAAAGCCTCAGCCTACAGAAACTCTTTTAATATGCATCAGTCAGAATTTAAAAAAAAAATC
M K S L S L Q K L F * Y A S V R I * K K N
* K A S A Y R N S F N M H Q S E F K K K I
E K P Q P T E T L L I C I S Q N L K K K S
H F A E A * L F E K L I C * D S N L F F I
S F G * G V S V R K I H M L * F K F F F D
F L R L R C F S K * Y A D T L I * F F F G
Positive Strand
Negative Strand
Software: • Peppy: creates the database + searches MS, Risk BA, et. al (2013)• PIUS (Peptide Identification by Unbiased Searching): Costa et al,
2013• MS-Dictionary: Kim et al, 2009
Genome Annotation Example 1: A. gambiae
Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011
Peptides mapping to annotated 3’ UTR
Peptides mapping to novel exon within an existing gene
Genome Annotation Example 1: A. gambiae
Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011
Peptides mapping to unannotated gene
related species
Genome Annotation Example 2: Correcting Miss-annotations
A. Establishes new transcriptional start locationB. Confirm ORFC. Establishes intron-exon boundariesD. Determines new reading frame for exonsE. Predicts novel coding regionF. Finds the end of a geneG. Uses a related species to build on genomic annotation
RNA Sequence-based database for alternatively splicing identification
RNA-Seq junction DB
Compare, score, test significance
Identification of novel splice isoforms
m/z
inte
nsity
MS/MS
Annotation of organisms which lack genome sequencing
Compare, score, test significance
Identification of potential protein coding regions
Reference DB of related species
m/z
inte
nsity
MS/MS
De novo MS/MS sequencing
Proteogenomics: Genome Annotation Summary
• Confirms existing gene models• Corrects existing gene models– Intron-exon boundaries– Reading frames– Novel splice isoforms– Novel exons
• Identifies novel genes• Fusion protein identification• Identify genomic polymorphisms
Proteogenomics
1. Improving genome annotation2. Sequencing driven database construction3. Proteomic mapping to genomic coordinates
Proteogenomic workflow Before the advent of proteogenomics, variant protein analysis was laborious, often requiring de novo sequencing**,
which is very time-consuming, and therefore only a very limited number of peptides can be sequenced.
**
DNA/RNAsequencing
Single nucleotide variant database for variant protein identification
Compare, score, test significance
Identification of variant proteins
m/z
inte
nsity
MS/MS
TCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGATAGCTGExon 1
Variants predicted from genome sequencing
Reference protein DB
+ Variant DB
Creating variant sequence DBVCF File Format
# Meta-information linesColumns: 1. Chromosome2. Position3. ID (ex: dbSNP)4. Reference base 5. Alternative allele 6. Quality score7. Filter (PASS=passed filters)8. Info (ex: SOMATIC, VALIDATED..)
Creating variant sequence DB
…GTATTGCAAAAATAAGATAGAATAAGAATAATTACGACAAGATTC…
……
…CTATTGCAAAAATACGATAGCATAAGAATAGTTACGACAAGATTC…
Add in variants within exon boundaries
In silico translation
EXON 1 EXON2
…LLQKYDSIRIVTTRF…
Variant DB
Splice junction database for novel exon, alternative splicing identification
Compare, score, test significance
Identification of novel splice proteins
m/z
inte
nsity
MS/MS
Intron/Exon boundaries from RNA sequencing
Reference protein DB
+RNA-Seq junction
DB
Exon 1 Exon 2 Exon 3
Alt. Splicing Novel Expression
Exon 1 Exon X Exon 2
Creating splice junction DBBED File Format
Columns:1. Chromosome2. Chromosome Start3. Chromosome End 4. Name 5. Score6. Strand (+or-)7-9. Display info10. # blocks (exons)11. Size of blocks12. Start of blocks
Creating splice junction DBJu
nctio
n be
d fil
e Map to known intron/exon boundaries
Unannotated alternative splicing
One novel intron/exon boundary
Two novel intron/exon boundaries
Bed file with new gene mapping
Fusion protein identification
Compare, score, test significance
Identification of variant proteins
m/z
inte
nsity
MS/MS
Reference protein DB
+Fusion Gene
DB
Gene XExon 1
Gene XExon 2
Gene YExon 1
Gene YExon 2
Chr 1 Chr 2
Gene XExon 1
Gene YExon 2
Fusion Genes
Fusion Location
.…AGAACTGGAAGAATTGG*AATGGTAGATAACGCAGATCATCT..…
Find consensus sequence
6 frame translation FASTA
Informatics tools for customized DB creation
• QUILTS: perl/python based tool to generate DB from genomic and RNA sequencing data (Fenyo lab)
• customProDB: R package to generate DB from RNA-Seq data (Zhang B, et al.)
• Splice-graph database creation (Bafna V. et al.)
Proteogenomics and Human Disease: Genomic Heterogeneity
•Whole genome sequencing has uncovered millions of germline variants between individuals
•Genomic, proteome studies typically use a reference database to model the general population, masking patient specific variation
Nature October 28, 2010
Proteogenomics and Human Disease: Cancer Proteomics
Cancer is characterized by altered expression of tumor drivers and suppressors• Results from gene mutations causing changes in
protein expression, activity• Can influence diagnosis, prognosis and treatment
Cancer proteomics • Are genomic variants evident at the protein level?• What is their effect on protein function?• Can we classify tumors based on protein markers?
Tumor Specific Proteomic Variation
Stephens, et al. Complex landscape of somatic rearrangement in human breast cancer genomes.
Nature 2009
Nature April 15, 2010
Personalized Database for Protein Identification
m/z
inte
nsity
MS/MS
Protein DB
Compare, score, test significance
Somatic VariantsSVATGSSEAAGGASGGGARGQVAGTMKIEIAQYRDSGSYGQSGGEQQREETSDFAEPTTCITNNQHSEPRDPRFIKGWFCFIISAR….
Germline VariantsMQYAPNTQVEIIPQGRSSAEVIAQSRASSSIIINESEPTTNIQIRQRAQEAIIQISQAISIMETVKSSPVEFECINDKSPAPGMAIGSGR…
Identified peptides and proteins
Personalized Database for Protein Identification
m/z
inte
nsity
MS/MS
Tumor Specific Protein DB
Compare, score, test significance
+ tumor specific + patient specific peptides
RNA-SeqGenome Sequencing
Identified peptides and proteins
Tumor Specific Protein Databases
Tumor Specific
Protein DB
Non-Tumor Sample Genome sequencing Identify germline variants
Reference Human Database (Ensembl)
Genome sequencingRNA-SeqTumor Sample
Identify alternative splicing, somatic variants and
novel expression
TCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGAGAGCTGTCGATAGCTG
Exon 1 Exon 2 Exon 3
Exon 1
Variants
Alt. Splicing Novel Expression
Exon 1 Exon X Exon 2
Fusion Genes
Gene XExon 1
Gene XExon 2
Gene YExon 1
Gene YExon 2
Gene X Gene Y
Proteogenomics and Biomarker Discovery
• Tumor-specific peptides identified by MS can be used as sensitive drug targets or diagnostic tools– Fusion proteins– Protein isoforms– Variants
• Effects of genomic rearrangements on protein expression can elucidate cancer biology
Proteogenomics
1. Improving genome annotation2. Sequencing driven database construction3. Proteomic mapping to genomic coordinates
Proteogenomic mapping
• Map back observed peptides to their genomic location. • Requires tools to convert proteomic location to
genomic coordinates• Use to determine: – Exon location of peptides– Proteotypic– Novel coding region– Visualize in genome browsers (UCSF genome browser,
Integrative Genomics Viewer (IGV))– Quantitative comparison based on genomic location
Informatics tools for proteogenomic mapping
• PGx: python-based tool, maps peptides back to genomic coordinates using user defined reference database (Fenyo lab)
• The Proteogenomic Mapping Tool: Java-based search of peptides against 6-reading frame sequence database (Sanders WS, et al).
PGX: Proteogenomic mapping toolPeptides
Sample specific protein database
Peptides mapped onto genomic
coordinates
Man
or A
sken
azi
Dav
id F
enyo
Log Fold Change in Expression (10,000 bp bins)
Copy Number Variation
Methylation Status
Exon Expression (RNA-Seq)
Number of Genes/Bin
Peptides
Variant Peptide Mapping
SVATGSSEAAGGASGGGAR
SVATGSSETAGGASGGGAR
ACG->GCG
Peptides with single amino acid changes corresponding to germline and somatic variants
ENSEMBL Gene
Tumor Peptide
Reference Peptide
Novel Peptide MappingPeptides corresponding to RNA-Seq expression in non-coding regions
ENSEMBL Gene
Tumor Peptide
Tumor RNA-Seq
Proteogenomic integration
Maps genomic, transcriptomic and proteomic data to same coordinate system including quantitative information
Variants
Proteomic Quantitation
RNA-Seq Data
Proteomic Mapping
Predicted gene expression
Summary
• The integration of proteomics and genomics can improve our understanding of not only genomic annotation, but also of the functional protein products integral in biological processes.
• Proteogenomics is currently being used extensively in cancer discovery– Genetic rearrangement differs between tumors– Requires personalized database– Can provide cancer specific proteins for biomarker
development• Proteogenomics will likely continue to grow, particularly
in the identification of genomic abnormalities in disease
Questions?