33
Finding the Lost Treasure of NGS Data Yan Guo, PhD

Finding the Lost Treasure of NGS Data

  • Upload
    liuz

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Finding the Lost Treasure of NGS Data. Yan Guo, PhD. Modules Overview for DNA-sequence Exome / whole-Genome . f astq files. gene coding changes. FastQC. r ealignment. bamQC. r ecalibration. somatic mutation. bwa alignment. d bsnp / indel resources. best practice filter. - PowerPoint PPT Presentation

Citation preview

Page 1: Finding the Lost Treasure of NGS Data

Finding the Lost Treasure of NGS Data

Yan Guo, PhD

Page 2: Finding the Lost Treasure of NGS Data
Page 3: Finding the Lost Treasure of NGS Data
Page 4: Finding the Lost Treasure of NGS Data

Modules Overview for DNA-sequence Exome / whole-Genome

Bam files

bwa alignment

FastQC

bamQC

fastqfiles

structural variant analysis

GATK refinement

SNP/INDELvcf files

somatic mutation

gene-level analysis

gene associates

Translocation, inversion, copy number variants

gene coding changes

realignment

recalibration

mark-duplicationbest practicefilter

dbsnp / indel resources

Page 5: Finding the Lost Treasure of NGS Data

RNAseq

Bam files

tophat alignment

FastQC

SeQC

fastqfiles

cufflinksannotations

cuffdiffcomparisons

Refinement

cuffmerge

gene-fusionanalysis

functional/pathway

cufflinksannotations

cuffdiffcomparisons

genes identifying

novel genes discovery

cluster

Gene List

gene quantification

Page 6: Finding the Lost Treasure of NGS Data

DNAseq• SNPs• Somatic Mutations• Small Indels• Large Structural Change• CNV

RNAseq• Gene expression difference• Splicing Variants• Fusion Genes

What do you expect to find in NGS data?

Page 7: Finding the Lost Treasure of NGS Data

What you don’t expect to find in NGS data?

Is targeted?

Exome sequencing reads

Mapped reads

Targeted DNA

Unmapped DNA reads

Untargeted DNA

Virus/Microbe DNA

Contamination

Intronic DNA

Intergenic DNA

Mitochondrial DNA

Is mapped? No

No

Yes

Yes

Page 8: Finding the Lost Treasure of NGS Data

Exome CaptureTruSeq sa306744

Page 9: Finding the Lost Treasure of NGS Data
Page 10: Finding the Lost Treasure of NGS Data

Why do we care about intron and intergenic regions

• some introns can encode specific proteins and can be processed after splicing to form noncoding RNA molecules. (Rearick, Prakash et al. 2011)

• Majority of the GWAS SNPs are not in coding regions (706 exon, 3986 intron, 3323 intergenic)

• The ENCODE Project: ENCyclopedia Of DNA Elements

Page 11: Finding the Lost Treasure of NGS Data

GWAS catalog SNPs

KitTarget total bases

Missing Exon SNPs

Missing intron SNPs

Missing Intergenic SNPs

SureSelect(v2) 37627747 387 3946 3323

TrueSeq 62085286 206 3980 3320

SeqCap EZ (v3.0) 64190747 326 3880 3317

Page 12: Finding the Lost Treasure of NGS Data
Page 13: Finding the Lost Treasure of NGS Data
Page 14: Finding the Lost Treasure of NGS Data
Page 15: Finding the Lost Treasure of NGS Data
Page 16: Finding the Lost Treasure of NGS Data

Samples Average depth Intronic Splicing1 ncRNA2 Intergen

icExonic

Non-synonymous

Stopgain Stoploss

Agilent (N=22)

≥ 2 21741 48 9129 91480 1431 38 6≥ 5 7362 39 5794 44269 1142 29 5

≥ 10 4766 37 4393 28673 892 19 4

1000G (N=6)

≥ 2 4561 19 648 4658 491 10 1≥ 5 2784 12 360 2815 337 6 1

≥ 10 1419 9 194 1624 233 5 1

Illumina (N=6)

≥ 2 6114 0 985 9659 25 0 0≥ 5 2408 0 501 5344 0 0 0

≥ 10 1058 0 327 3498 0 0 01. Variant is within 2-bp of a splicing junction2. Variant overlaps a transcript without coding annotation in the gene definition

Page 17: Finding the Lost Treasure of NGS Data
Page 18: Finding the Lost Treasure of NGS Data

Mitochondria• Mitochondria play an important role in cellular energy

metabolism, free radical generation, and apoptosis (Andrews, Kubacka et al. 1999; Verma and Kumar 2007).

• Mitochondrial DNA (mtDNA) is a maternally-inherited 16,569-bp closed-circle genome that encodes two rRNAs, 22 tRNAs, and 10 polypeptides.

• Dysfunctions in mitochondrial function are an important cause of many neurological diseases (Fernandez-Vizarra, Bugiani et al. 2007) and drug toxicities (Lemasters, Qian et al. 1999; Wallace and Starkov 2000) and may contribute to carcinogenesis and tumor progression (Modica-Napolitano and Singh 2004; Chen 2012).

Page 19: Finding the Lost Treasure of NGS Data
Page 20: Finding the Lost Treasure of NGS Data

Mitochondria Extraction Strategy

Page 21: Finding the Lost Treasure of NGS Data

Results

Page 22: Finding the Lost Treasure of NGS Data

EXAMPLE

Page 23: Finding the Lost Treasure of NGS Data

Virus• Known oncogenic viruses are estimated to cause 15 to 20

percent of all cancers in humans (Parkin 2006).• Understanding the viral integration pattern of cancer-

associated viruses may uncover novel oncogenes and tumor suppressors that are associated with cellular transformation.

• Viral genomes have been detected using off-target exome sequencing reads (Barzon, Lavezzo et al. 2011; Li and Delwart 2011; Chevaliez, Rodriguez et al. 2012; Radford, Chapman et al. 2012; Capobianchi, Giombini et al. 2013).

Page 24: Finding the Lost Treasure of NGS Data

One example using HNSCC

Page 25: Finding the Lost Treasure of NGS Data

Virus Detection in HNSCC in TCGA

Site clin_hpv_ish clin_hpv_p16 ExomeSeq low_pass RNAseq HPVBuccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Oropharynx 0 0 1 0 0 1Oropharynx 0 0 0 0 0 0Oropharynx 0 0 0 0 0 0Tonsil 1 1 1 0 1 4Tonsil 1 1 1 0 1 4Tonsil 1 1 1 0 1 4Tonsil 0 0 1 1 1 4Tonsil 0 0 1 1 1 4Tonsil 0 1 1 0 1 3Tonsil 1 0 1 0 1 3Tonsil 0 0 0 1 1 3Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 0 1

Page 26: Finding the Lost Treasure of NGS Data

Existing Tools

• PathSeq (Kostic, Ojesina et al. 2011)• VirusSeq (Chen, Yao et al. 2012)• ViralFusionSeq (Li, Wan et al. 2013)

Page 27: Finding the Lost Treasure of NGS Data

SNP and Somatic Mutation Identification using RNAseq Data

• Traditionally, somatic mutations are detected using Sanger sequencing or RT-PCR by comparing paired tumor and normal samples. One obvious limitation of such methods is that we have to limit our search to a certain genomic region of interest.

• With the maturity of next generation sequencing, we can now screen all coding genes or even the whole genome for somatic mutations at a reasonable cost.

Page 28: Finding the Lost Treasure of NGS Data

Why do we want to detect mutation in RNAseq data?

• You don’t have DNA sequencing data• Detecting mutation was not the original goal,

but why not• There are much more RNAseq data than

DNAseq data• A mutation in RNA is more relevant than a

mutation in DNA

Page 29: Finding the Lost Treasure of NGS Data

Difficulties

• Not enough depth in the non-expressed genes to detect mutation

• Reverse transcribe RNA to cDNA introduce more error

• Hard to distinguish mutation from RNA editing• In summary, somatic mutation detection using

RNAseq data contains much more false positives.

Page 30: Finding the Lost Treasure of NGS Data

Somatic Mutation Caller Designed Specifically for RNAseq Data

Page 31: Finding the Lost Treasure of NGS Data

Other Ways you can mine your data

Page 32: Finding the Lost Treasure of NGS Data

Summary

• Get your priority right, never design a study just for secondary analysis targets

• If you have old data, think about else you can do with it, try to maximize the full potential of your data

• At VANGARD, we help you with your basic genomic data analysis needs

• Advanced data analysis can be done through collaboration.

Page 33: Finding the Lost Treasure of NGS Data

Acknowledgement

• Yu Shyr• Tiger Sheng• Chung-I Li• Jiang Li• Mike Guo• David Samuels• Chun Li