Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
A technical and methodological introduction
to NGS (data) analysis.
biomina
Geert Vandeweyer 2015-04-24
Outline
Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity
The digital code of DNA, Leroy Hood and David Galas Nature 421, 444-448, 23 January 2003
NGS Principles: Sanger
target adaptors
+ ==>
library
Sample preparation
NGS Principles: Illumina (sbs)
The figures above are provided by
target adaptors
+ ==>
library
Sample preparation
Cluster Generation
NGS Principles: Illumina (sbs)
The figures above are provided by
Sequencing by Synthesis
The figures above are provided by
NGS Principles: Illumina (sbs)
Sequencing by Synthesis
NGS Principles: Illumina (sbs)
The figures above are provided by
Outline
Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity
NGS Applications: DNA-Seq
Whole Genome Sequencing • Novel organisms, de novo reference genome
NGS Applications: DNA-Seq
Whole Genome Sequencing • Novel organisms, de novo reference genome • Structural variance detection
NGS Applications: DNA-Seq
“Selective” Sequencing • Whole exome sequencing
NGS Applications: DNA-Seq
“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing
• Candidate genes for disease • All genes in pathway • ... => PCR, Capture, MIPs, ...
NGS Applications: DNA-Seq
“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing • ChIP-Seq
NGS Applications: DNA-Seq
“Selective” Sequencing • Whole exome sequencing • Gene panel resequencing • ChIP-Seq • 16S metagenomics
NGS Applications: RNA-Seq
Whole Transcriptome Sequencing • Gene/Transcript variant identification
NGS Applications: RNA-Seq
Whole Transcriptome Sequencing • Gene/Transcript variant identification • Gene Expression
• Unbiased detection • Highly quantitative
NGS Applications: RNA-Seq
“Selective” Sequencing • Ribosome Profiling
Outline
Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity
NGS Data Description
What kind of data are we working with? - Sanger Sequencing:
- 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks
NGS Data Description
What kind of data are we working with? - Sanger Sequencing: - 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred target amplicons - exome panel: > 200.000 target amplicons
NGS Data Description
What kind of data are we working with? - Sanger Sequencing: - 1 amplicon / reaction - 1 sequence / amplicon (or 2) - Visual inspection for overlapping peaks - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred targets - exome panel: > 200.000 targets - Multiple amplicons / target - optimal design: > 40 unique fragments covering every nucleotide in targets.
NGS Data Description
What kind of data are we working with? - Next-Generation Sequencing: - Massive Parallel sequencing - small panel : few hundred targets - exome panel: > 200.000 targets - Multiple amplicons / target - optimal design: > 40 unique fragments covering every nucleotide in targets. => Amount of data : > 8.000.000 sequences / sample
NGS Data Description
What kind of data are we working with? - Data format : FASTQ - FASTA : >Sequence_Name
AACTACTAGATACTGATAGTATATCTCTCTTAATCGA GCTCTAGATCGATCTATACCGAT
- Add Quality (fasta-Q => FASTQ) @Read_Name
AACTACTAGATACTGATAGTATATCTCTCTTAATCGA + BCEECEEFFECGECGECFGFF@?<<=??<>53@##
NGS Data Description
What kind of data are we working with? - Data format : FASTQ @Read_Name
AACTACATACTGATAGTATATCTC + BCEECEECGECGECFGFF@?<< Standard == Sanger Format : Quality = phred + 33, ascii-encoded
NGS Data Description
What kind of data are we working with? - Data format : FASTQ @Read_Name
AACTACATACTGATAGTATATCTC + BCEECEECGECGECFGFF@?<<
Phred Score : correlates with the risk on error (probability that basecall is wrong) “High Quality” : P(error) < 0.001
Q > 30
NGS Data Description
What kind of data are we working with? Quality = phred + 33, ascii-encoded
=> Example: Value == B Ascii-decode : 66 Phred : 66-33 = 33 Chance on error = 1/10^3.3
NGS Data Description
What kind of data are we working with?
- Data format : FASTQ << WARNING >>
@Read_Name
AACTACTAGATACTGATAGTATATCTCTCTTAATCGA
+
BCEECEEFFECGECGECFGFF@?<<=??<>53@##
=> Standard == Sanger Format : Other scales exist !
Outline
Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity
Adapter Trimming: Remove artificial sequences
Read Mapping: Place reads on the reference genome (BWA)
Quality-Trimming: Remove low quality sequences to improve specificity
Generate QC Reports: Visual inspection of main quality parameters
Optimize Mapping: Remove Duplicate reads (picard),
recalibrate mapping scores (GATK), realign around indels (GATK)
Call and Annotate Variants: Call variants(GATK) and
annotate using ANNOVAR, and snpEff (VariantDB)
Pre
- P
roce
ssin
g
Seq
uen
ce –
To
– V
ari
an
t
NGS Based Variant Calling From sequence to variant: Analysis flow
Adapter Trimming Pre - Processing
NGS Based Variant Calling From sequence to variant: Analysis flow
Sequence Read 1 Sequence Read 2
Sequence Barcode
Scan all reads for presence of artificial sequence & remove from the reads Note: Adapters are present when lenght(Targetted fragment) < read_length
Pre - Processing
NGS Based Variant Calling From sequence to variant: Analysis flow
Low quality leads to high error rates (cfr Phred Score) => Due to chemical degradation, 3’ ends have a lower quality => We want a limit of 1 error in 1000 positions => Trim everything on 3’ end with quality < 30
Quality-Trimming
Pre - Processing
NGS Based Variant Calling From sequence to variant: Analysis flow
Quality should improve after trimming
Generate QC Reports
Pre - Processing
NGS Based Variant Calling From sequence to variant: Analysis flow
Base composition should be 25% for G,C,T,A
Generate QC Reports
Good run Failed Run
Read Mapping Sequence – To – Variant
NGS Data analysis From sequence to variant: Analysis flow
Burrows-Wheeler Transformation: - Highly efficient method to scan string for substring matches - Principle: Build Prefix Trie, scan top-down using reverse search. 1. Permute String 2. Sort Permuted Strings 3. Last Column = Burrows Wheeler Transformation of string. 4. Build prefix trie from BWT
Pre - Processing
NGS Based Variant Calling From sequence to variant: Analysis flow
Generate QC Reports Insert Size
Pre - Processing
NGS Based Variant Calling From sequence to variant: Analysis flow
Generate QC Reports Capture Efficiency
Pre - Processing
NGS Based Variant Calling From sequence to variant: Analysis flow
Generate QC Reports Capture Efficiency
Optimize Mapping Sequence – To – Variant
NGS Based Variant Calling From sequence to variant: Analysis flow
- Remove Duplicate reads (picard) => Reduce computational time => Reduce amplification bias
Optimize Mapping Sequence – To – Variant
NGS Based Variant Calling From sequence to variant: Analysis flow
- Realign around indels (GATK) => InDels are hard to align => P(>1 SNPs) < P(1 indel)
If at a certain locus, both InDel AND multiple SNPs => Replace SNPs by one InDel => Reduction of false positives
Call And Annotate Variants Sequence – To – Variant
NGS Based Variant Calling From sequence to variant: Analysis flow
- Call Variants (GATK) - Search for positions with statistically significant evidence for
a non-reference nucleotide - Take into account: base-quality, position in read, strand
bias, ...
Call And Annotate Variants Sequence – To – Variant
NGS Based Variant Calling From sequence to variant: Analysis flow
Annotate Variants (ANNOVAR, snpEff, ...) - Add as information to the variant to ease interpretation - Effect on Gene transcription (RefSeq, Ensembl, UCSC) - Quality parameters (GATK) - Occurence in control populations (dbSNP, ESP, HapMap, 1KG, ...) - Known pathogenic variations (dbSNP, OMIM, ...) - Effect on gene function (PolyPhen, MutationTaster, Sift, ...) - ...
Outline
Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity
NGS Based Variant Calling From variant to knowledge: Interpretation flow
Step 1 : Quality Filtering: GATK Variant Recalibration “The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact.” “The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.”
NGS Based Variant Calling From variant to knowledge: Interpretation flow
Step 1 : Quality Filtering: GATK Variant Recalibration Train model on known variants (both positive and negative)
NGS Based Variant Calling From variant to knowledge: Interpretation flow
Step 1 : Quality Filtering: GATK Variant Recalibration Apply model to experimental data
NGS Based Variant Calling From variant to knowledge: Interpretation flow
Step 2 : Select an inheritance model
De Novo Dominant Recessive Variant not present in Variant present in Variant homozygous either parent affected parent in patient, heterozygous in both parents
NGS Based Variant Calling From variant to knowledge: Interpretation flow
Step 3 : Effect on gene function (~ from high to low)
- Variant causes gain/loss of stop/start coding? - Variant causes aberrant splicing of the transcript? - Variant replaces a highly conserved nucleotide/amino acid ? - Variant replaces an aminoacid, and is not reported in control
populations ? - Variant can modify binding of regulatory elements? - ...
Extended annotation is critical
Manual inspection of > 20.000 variants/sample is impossible. automation is needed
Outline
Next Generation Sequencing: Technological principles Applications NGS Data Description Example Applicaton: NGS based Variant Calling From sequence to variant: Analysis flow From variant to knowledge: Interpretation flow Final Remarks: Reducing computational complexity
Final Remarks Reducing Computational complexity: Web-Tools
Sequence-to-Variants: Galaxy - A website offering an easy way to run complete pipelines - No programming skills needed, very usefull for dynamic analysis (http://www.usegalaxy.org)
Final Remarks
Sequence-to-Variants: Galaxy - A website offering an easy way to run complete pipelines - No programming skills needed, very usefull for dynamic analysis - Support for allmost all types of analysis - Variant Calling - RNA seq : Expression / transcript identification - MetaGenomics - ChIP-seq - Many organisms available by default (on main servers) - New organisms can be added on request (on Biomina Servers) Public Server: http://www.usegalaxy.org Biomina Server: http://www.biomina.be/apps/galaxy
Reducing Computational complexity: Web-Tools
Final Remarks
Variant Interpretation: VariantDB - Extensive annotation - Flexible filtering options - Automatic updates - Multiple output formats: - online (tabular) - offline (CSV) - API (JSON)
Reducing Computational complexity: Web-Tools
Final Remarks Reducing Computational complexity: Future ?
Final Remarks
Future NGS assays will be: - Real-Time - On-Site - Low-Cost - ....
Reducing Computational complexity: Future ?
biomina