View
22
Download
4
Category
Preview:
Citation preview
20/10/15 YannickBoursin
NGS,CancerandBioinforma;cs
1
NGS and Clinical Oncology
• NGSinhereditarycancergenometes;ng• BRCA1/2(breast/ovarycancer)• XPC(melanoma)• ERCC1(colorectalcancer)
• NGSforpersonalizedcancertreatment• Clinicaltrials:MOSCATO(GR),SAFIR(GR),SHIVA(Curie),…• Ipilimumab(an;-CTLA4),Nivolumab(an;-PD1),Trastuzumab(an;-HER2),Cetuximab(an;-EGFR)
• Detec;onofchimerictranscripts• ChronicMyeloidLeukemia:Philadelphiachromosome(BCR/ABL)• Non-Small-CellLungCancer:EML4-ALK
20/10/15 YannickBoursin 2
NGS and Oncology
20/10/15 YannickBoursin
NGSisnowwidelyusedas:•Aresearchtooltoscreenalargeamountofcancersamples
NGS and Oncology
18
07-09th April 2014 NGS and Bioinformatics
NGS is now widely used as: • A research tool to screen a large amount of cancer samples • A clinical/diagnosis tool in daily practice These projects require dedicated bioinformatics integration project to access and analyses this huge amount of data
•Aclinical/diagnosistoolindailyprac;ce
Theseprojectsrequirededicatedbioinforma;csintegra;onprojecttoaccessandanalysesthishugeamountofdata.
NGS and Oncology
18
07-09th April 2014 NGS and Bioinformatics
NGS is now widely used as: • A research tool to screen a large amount of cancer samples • A clinical/diagnosis tool in daily practice These projects require dedicated bioinformatics integration project to access and analyses this huge amount of data
3
Why do we need computers for NGS
Sequencingdatasizeevolu7on Needstoaddress
• StorePetaBytesofdata(1PBis1000TB).
• Sharedataaroundtheworldthroughnetworks
• Analyzehugeamountsofdatawithcomplexalgorithms
20/10/15 YannickBoursin 4
Bioinformatics and Oncology
• Problem:finding,extrac;ng,andpresen;ngrelevantinforma;ons.
• Par;alsolu;on:designingworkflowsinordertoeasedataanalysis.
20/10/15 YannickBoursin 5
Interdisciplinary collaboration
20/10/15 YannickBoursin
Bioinforma;csactsasahubsbetweenthedifferentfields.Trustbetweenpartnersisneeded,trainingisneededaswellforefficientunderstanding.
Interdisciplinary collaboration
07-09th April 2014 NGS and Bioinformatics
Bioinformatics acts as a hubs between the different fields. Trust between partners is needed, training is needed as well for efficient understanding.
Biology knowledge Knowledge modeling,
Technological platforms Sequencing, Microarrays, ImmunoChemistry, …
Bioinformatics
Raw data storage Integration of biological and clinical
data Quality Control Data analysis
Clinical Biostatistics Report for biological/medical staff
Medical staff Clinicians, specialists, …
Biological staff Biologists, Geneticists, …
19
Interdisciplinary collaboration
07-09th April 2014 NGS and Bioinformatics
Bioinformatics acts as a hubs between the different fields. Trust between partners is needed, training is needed as well for efficient understanding.
Biology knowledge Knowledge modeling,
Technological platforms Sequencing, Microarrays, ImmunoChemistry, …
Bioinformatics
Raw data storage Integration of biological and clinical
data Quality Control Data analysis
Clinical Biostatistics Report for biological/medical staff
Medical staff Clinicians, specialists, …
Biological staff Biologists, Geneticists, …
19
Interdisciplinary collaboration
07-09th April 2014 NGS and Bioinformatics
Bioinformatics acts as a hubs between the different fields. Trust between partners is needed, training is needed as well for efficient understanding.
Biology knowledge Knowledge modeling,
Technological platforms Sequencing, Microarrays, ImmunoChemistry, …
Bioinformatics
Raw data storage Integration of biological and clinical
data Quality Control Data analysis
Clinical Biostatistics Report for biological/medical staff
Medical staff Clinicians, specialists, …
Biological staff Biologists, Geneticists, …
19 6
Standard Workflow for NGS Analysis
20/10/15 YannickBoursin
Standard Workflow for NGS Analysis
Raw Reads
Reads Mapping
Data Analysis
Depends on the NGS Application
Sequencing &
Primary Analysis
Reads Cleaning
QC: 1 QC: 2 QC: 3
07-09th April 2014 NGS and Bioinformatics
30 7
AtypicalNGSworkflow
Step 1: Quality Check and improvements
20/10/15 YannickBoursin 8
NGS Data: what do they look like ?
20/10/15 YannickBoursin 9
Arawdatafile(.fastq,.sff,.fa,.csfasta/.qual)withmillionsofshortreadsofthesamesize(SOLiD,HiSeq)orreadsofdifferentsize(IonPGM/Proton)
Enhancedviewofthereadsinafastqfile
FASTQ format
20/10/15 YannickBoursin
•1sequence=1read=4linesinthefile
Fastq format (base–space)
• 1 sequence = 4 lines in the file
07-09th April 2014 NGS and Bioinformatics
• First line = sequence identifier
24
•Firstline=sequenceiden;fier
Fastq format (base–space)
• 1 sequence = 4 lines in the file
07-09th April 2014 NGS and Bioinformatics
• First line = sequence identifier
24 10
FASTQ format
20/10/15 YannickBoursin
•Fourthline=Quality
Fastq format (base–space)
• Fourth line = Quality
• ASCII encoded (Reduce the file size)
07-09th April 2014 NGS and Bioinformatics
25
•ASCIIencoded(Reducethefilesize)
11
Sequence quality encoding
20/10/15 YannickBoursin
Phred scores Q : Q scores are defined as a property that is logarithmically related to the base-calling error probabilities (P).
Q = -10 log10 P
Sequence quality encoding
07-09th April 2014 NGS and Bioinformatics
26
12
Quality controls on raw reads : lets start after sequencing
20/10/15 YannickBoursin
Let’s start after sequencing …
A raw data file (.fastq, .sff, .fa, .csfasta/.qual) with millions of short reads of the same size (SOLiD, HiSeq) or
reads of different size (Ion PGM/Proton)
07-09th April 2014 NGS and Bioinformatics
ACTGATTAGTCTGAATTAGANNGATAGGAT
GATCGATGCATAGCGATCAGCATCGATACG
CGGCGCTCCGCTCTCGAAACTAGCACTGAC
AGCATCAGGATCTACGATCTAGCGAACTGAC ACTAGCTACTATCGAGCGAGCGATCATCGAC
ACTAGGCATCGGCATCACGGACNNNNNNNN
ACTAGCTATCGAGCTATCAGCGAGCATCTATC
CTGACTACTATCGAGCGAGCTACTAACTGAC
ACTACTTACGACATCGAGGTTAGGAGCATCA
ACTANNGACTAGGAATTAGCTACTGAGCTAC ACTAGCAGCTATATGAGCTACTAGCACTGAC
ACTATCAGCTAGCGCTTCAGCATTACCGT
NNNNNNNNNNNNNNNNNNNNNNNNNNNNN
23
13
Arawreadischaracterizedbythreeparameters:• Itslength• Itssequence• Per-base-in-sequencequality
Rawreads
Why looking at sequencing quality ?
20/10/15 YannickBoursin
•Qualityofdataisveryimportantforvariousdownstreamanalyses:
•Sequenceassemblyormapping•Variantsdetec;on•Geneexpressionstudies•...
•Qualityofdata=poor
•Trytofindareason•Canwecorrect/improvethequality?•Mayleadtoerroneousconclusions
14
Quality controls on raw reads: which metrics to check ?
20/10/15 YannickBoursin
Mainly:• Qualityscoreperbaseandoverthereads
Butalso:• Readlengthdistribu;on• Sequencecontentperbaseand%ofGC• Kmerscontent• Overrepresentedsequences• Duplicatedreads
15
Quality scores
20/10/15 YannickBoursin
•Perbase(BoxWhiskertypeplot)->toseewetherbasecallsfallsintolowquality(commonlytowardstheendofaread)•Persequence(meanqualitydistribu;on)->toseeifasubsetofyoursequenceshaveuniversallylowqualityvalues
16
Quality scores
20/10/15 YannickBoursin
Quality scores
PGM – run A PGM – run A
PGM – run B PGM – run B
07-09th April 2014 NGS and Bioinformatics
41 17
Quality scores
20/10/15 YannickBoursin
Quality scores
Illumina – run C Illumina – run C
Illumina – run D Illumina – run D
07-09th April 2014 NGS and Bioinformatics
42
18
Quality control on raw reads: adapters removal
20/10/15 YannickBoursin
•AnadapterisasmallpieceofknownDNAlocatedattheendofthereads•Adaptersroles:
•Hangreadtothesequencerflowcell•AllowsaspecificPCRenrichmentofreadshavingadapter•Useinmul;plexsequencing(samplesinmix)
•Availabletoolstotrimadapters:•Cutadapt•SeqPrep•RmAdapter
Adapters
• An adapter is a small piece of known DNA located at the end of the reads
• Adapters roles: • Hang read to the sequencer flowcell • Allows a specific PCR enrichment of reads having adapter • Use in multiplex sequencing (samples in mix)
• Available tools to trim adapters: • Cutadapt • SeqPrep • RmAdapter
07-09th April 2014 NGS and Bioinformatics
27 19
Inblue:adapters.Inorange:informa;vepartoftheread.
Quality controls on raw reads : lets start after sequencing
20/10/15 YannickBoursin
AfirstQualityControlofrawreadsismandatoryandcanbeestablishedaccordingtotheapplica;on('N',adaptersequences,barcode,contamina;on,etc.)
Let’s start after sequencing …
A first Quality Control of raw reads is mandatory and can be established according to the application ('N', adapter sequences, barcode, contamincation, etc.)
ACTGATTAGTCTGAATTAGANNGATAGGAT
GATCGATGCATAGCGATCAGCATCGATACG
CGGCGCTCCGCTCTCGAAACTAGCATCGAC
AGCATCAGGATCTACGATCTAGCGAACTGAC ACTAGCTACTATCGAGCGAGCGATCATCGAC
ACTAGGCATCGGCATCACGGACNNNNNNNN
ACTAGCTATCGAGCTATCAGCGAGCATCTATC
CTGACTACTATCGAGCGAGCTACTAACTGAC
ACTACTTACGACATCGAGGTTAGGAGCATCA
ACTANNGACTAGGAATTAGCTACTGAGCTAC ACTAGCAGCTATATGAGCTACTAGCACTGAC
ACTATCAGCTAGCGCTTCAGCATTACCGT
NN NNNNNNNN
NN
ACTGAC
ACTGAC
ACTGAC
ACTGAC
NNNNNNNNNNNNNNNNNNNNNNNNNNNNN
07-09th April 2014 NGS and Bioinformatics
31 20
Processedreads:bluepartsaretobekept,greenandredpartstoberemoved
Quality controls : Standard Workflow for NGS Analysis
20/10/15 YannickBoursin
Standard Workflow for NGS Analysis
Raw Reads
Reads Mapping
Data Analysis
Depends on the NGS Application
Sequencing &
Primary Analysis
Reads Cleaning
QC: 1 QC: 2 QC: 3
07-09th April 2014 NGS and Bioinformatics
30 21
AtypicalNGSworkflow
Step 2: Short Reads Alignment
20/10/15 YannickBoursin 22
Reads alignment - Vocabulary
20/10/15 YannickBoursin
Alignment:(mapping)Thereadsalignmentaimsattransformingthesinglereadsinforma;oninanorganizedandreducedsetofinforma;on.Mismatch:Incoherencebetweentwonucleo;desReferenceGenome:Thereferencegenomeisaknownsequence,supposedtobeascloseaspossibletotheinputgenome,andwhichisusedasananchortoorganizethesinglereadsinforma;on.Gap:Bridgewithinthereadalignment(i.e.smallInser;on/dele;on)Mappability:Uniquenessofaregion(repeatedregion=lowmappability,uniqueregion=goodmappability)Indels:Inser;on/Dele;onintothereferencegenome
23
Reads alignment – Two strategies
20/10/15 YannickBoursin
Thereadsalignmentaimsattransformingthesinglereadsinforma;oninanorganizedandreducedsetofinforma;on.Twostrategiescanbeapplied:-DenovoReadsAssemblyUsedwhennoreferencegenomeareavailable.Itaimsatreconstruc;nglongscaffoldsfromsinglereadsinforma;on.-AlignmentonaReferenceGenomeThereadsaredirectlycomparedtoaknownreferencegenome.
24
Alignment on a reference genome
20/10/15 YannickBoursin
Thereferencegenomeisaknownsequence,supposedtobeascloseaspossibletotheinputgenome,andwhichisusedasananchortoorganizethesinglereadsinforma;on.
The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads information.
A C T A C G A C A T C T A C
A C G A C T T C T A C G A G T T T A C G A A G C T A C T
T T T A C G A A G C T A C T
G C T C C T A
T C C T A G C A C G A G C T
C G A G C T G
A G C T G C G C G G C C A A
C G A G C T G G G C C A A C
Alignment on a reference genome
A C T A C G A C T C T A C G A G C A T C T A C G A G C T A C T A G C G A T C T A C G A G C T G C G A G C A A C G GC C A A C Reference Genome Sequence
T G C C A A C A C C T T G G
07-09th April 2014 NGS and Bioinformatics
52 25
Alignmentofreadsagainstreferencegenome
Alignment on a reference genome
20/10/15 YannickBoursin
Thereferencegenomeisaknownsequence,supposedtobeascloseaspossibletotheinputgenome,andwhichisusedasananchortoorganizethesinglereadsinforma;on.
26
The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads information.
Alignment on a reference genome
A C T A C G A C A T C T A C A C G A C T T C T A C G A G
T T T A C G A A G C T A C T T T T A C G A A G C T A C T
G C T C C T A T C C T A G C
A C G A G C T
C G A G C T G A G C T G C G
C G G C C A A
C G A G C T G G G C C A A C
A C T A C G A C T C T A C G A G C A T C T A C G A G C T A C T A G C G A T C T A C G A G C T G C G A G C A A C G GC C A A C Reference Genome Sequence
Homozygous Polymorphism (T/C)
T G C C A A C A C C T T G G
07-09th April 2014 NGS and Bioinformatics
53 Alignmentofreadsagainstreferencegenome
Alignment on a reference genome - Challenges
20/10/15 YannickBoursin
NewalignmentalgorithmsmustaddresstherequirementsandcharactericsofNGSreads–Millionsofreadsperrun(30xofgenomecoverage)–Readsofdifferentsize(35bp-200bp)–Differenttypesofreads(single-end,paired-end,mate-pair,etc.)–Base-callingqualityfactors–Sequencingerrors(~1%)–Repe;;veregions–Sequencingorganismvs.referencegenome–Mustadjusttoevolvingsequencingtechnologiesanddataformats
27
Alignment on a reference genome – Bioinformatics tools
20/10/15 YannickBoursin
Mappers timeline (since 2001)
Fonseca N A et al. Bioinformatics 2012;28:3169-3177 07-09th April 2014
Alignment on a reference genome Bioinformatics tools
07-09th April 2014 NGS and Bioinformatics
55 28
Finding the best alignment - Rational
20/10/15 YannickBoursin
Givenareferenceandasetofreads,reportatleastone“good”localalignmentforeachreadifoneexistsWhatis“good”?Fornow,weconcentrateon:–Fewermismatchesisbeuer
Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists What is “good”? For now, we concentrate on: – Fewer mismatches is better – Failing to align a low-quality base is better than failing to align a high-quality base Based on a scoring system, i.e. score for a match (1), MM penalty (3), gap open penalty (5), gap extension penalty (2). The best alignment is the one with the highest score.
… T G A T C A T A ...
G A T C A A
… T G A T .C A T A ...
G A G A A T
Is better than
… T G A T A T T A ...
G A T c a.T
… T G A T c a T A ...
G T A C A T
Is better than
Finding the best alignment Rational
07-09th April 2014 NGS and Bioinformatics
56
–Failingtoalignalow-qualitybaseisbeuerthanfailingtoalignahigh-qualitybase
Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists What is “good”? For now, we concentrate on: – Fewer mismatches is better – Failing to align a low-quality base is better than failing to align a high-quality base Based on a scoring system, i.e. score for a match (1), MM penalty (3), gap open penalty (5), gap extension penalty (2). The best alignment is the one with the highest score.
… T G A T C A T A ...
G A T C A A
… T G A T .C A T A ...
G A G A A T
Is better than
… T G A T A T T A ...
G A T c a.T
… T G A T c a T A ...
G T A C A T
Is better than
Finding the best alignment Rational
07-09th April 2014 NGS and Bioinformatics
56
Basedonascoringsystem,i.e.scoreforamatch(1),MMpenalty(3),gapopenpenalty(5),gapextensionpenalty(2).Thebestalignmentistheonewiththehighestscore.
29
Alignment key parameters - Repeats
20/10/15 YannickBoursin
Approximately50%ofthehumangenomeiscomprisedofrepeats
Treangen T.J. and Salzberg S.L. 2012. Nature review Genetics 13, 36-46
Approximately 50% of the human genome is comprised of repeats
07-09th April 2014 NGS and Bioinformatics
Alignment Key Parameters Repeats
58
07-09th April 2014 NGS and Bioinformatics
Treangen
T.J.and
SalzbergS.L.2012.Naturereview
Gen
e;cs13,36-46
30
Alignment key parameters - Repeats
20/10/15 YannickBoursin
Closeproximitywithgenes:intergenicandintragenicposi;onsClose proximity with genes : intergenic and intragenic positions
07-09th April 2014 NGS and Bioinformatics
Alignment Key Parameters Repeats
59
07-09th April 2014 NGS and Bioinformatics
31
BRCA2:amosaicofrepeatedregions
Alignment key parameters – Repeats – 3 strategies
20/10/15 YannickBoursin
-1-Reportonlyuniquealignment-2-Reportbestalignmentsandrandomlyassignreadsacrossequalygoodloci-3-Reportall(best)alignments
Treangen T.J. and Salzberg S.L. 2012. Nature review Genetics 13, 36-46
-1- Report only unique alignment -2- Report best alignments and randomly assign reads across equaly good loci -3- Report all (best) alignments
A B A B A B
-1- -2- -3-
07-09th April 2014 NGS and Bioinformatics
Alignment Key Parameters Repeats – Three strategies
60
07-09th April 2014 NGS and Bioinformatics
TreangenT.J.andSalzbergS.L.2012.NaturereviewGene;cs13,36-46
32
Alignment key parameters – Using single or paired-end reads ?
20/10/15 YannickBoursin
Thetypeofsequencing(i.e.singleorpaired-endreads)isowendrivenbytheapplica;on.Exemple:Findinglargeindels,genomicrearrangements,...However,inmostofthecase,thepairinforma;oncanimprovethemappingspecificity-Single-endalignment–repeatedsequence
The type of sequencing (i.e. single or paired-end reads) is often driven by the application Exemple : Finding large indels, genomic rearrangements, ... However, in most of the case, the pair information can improve the mapping specificity
- Single-end alignment – repeted sequence
A C G A C T C A C G A C T C G G C C A A C G G C C A A C
- Paired-end alignment – unique sequence
A C G A C T C A C G A C T C
Alignment Key Parameters Using single or paired-end reads ?
A C T A C G A C T C T A C G A G C A T C T A C G A G C T A C T A G C G A T C T A C G A G C T G C G A G C A A C G GC C A A C Reference Genome Sequence
A C T A C G A C T C T A C G A G C A T C T A C G A G C T A C T A G C G A T C T A C G A G C T G C G A G C A A C G GC C A A C Reference Genome Sequence
07-09th April 2014 NGS and Bioinformatics
61
-Paired-endalignment–uniquesequence
The type of sequencing (i.e. single or paired-end reads) is often driven by the application Exemple : Finding large indels, genomic rearrangements, ... However, in most of the case, the pair information can improve the mapping specificity
- Single-end alignment – repeted sequence
A C G A C T C A C G A C T C G G C C A A C G G C C A A C
- Paired-end alignment – unique sequence
A C G A C T C A C G A C T C
Alignment Key Parameters Using single or paired-end reads ?
A C T A C G A C T C T A C G A G C A T C T A C G A G C T A C T A G C G A T C T A C G A G C T G C G A G C A A C G GC C A A C Reference Genome Sequence
A C T A C G A C T C T A C G A G C A T C T A C G A G C T A C T A G C G A T C T A C G A G C T G C G A G C A A C G GC C A A C Reference Genome Sequence
07-09th April 2014 NGS and Bioinformatics
61 33
Alignmentofreadsagainstreferencegenome
Alignment on a reference genome
20/10/15 YannickBoursin
Keypoints•ThealignmentisacrucialstepoftheNGSanalysis.•Thereferencegenomehastobecarefullychosen.•Themappabilityoftheregionofinteresthastobetakenintoaccount(primerdesign).•Thescoringmethodhastobechosenaccordinglytothesequencingerrorrateandthequalityoftherawreads.•Thealignmentparametershavetobesetproperly.
34
Limitations of Alignment Tools
20/10/15 YannickBoursin
Evenifwehavenowsomenicetoolstoalignreadsonareferencegenome,severalissuesares;llimportant:-Homopolymermapping-Efficientlyalignsmallindels-Alignmentonseveralgenomes-Alignmentonrepeatedsequences-...
35
Alignment formats
20/10/15 YannickBoursin
•Alotofformatsexists:
• SAM• BAM• ELAND(Illuminaspecific)• MAQmap• …
SAMandBAMarenowthestandardforaligneddata
36
SAM format
20/10/15 YannickBoursin
•SAMforSequenceAlignmentMap•Tabulatedtextfile•1lineperread•Eachlineiscomposedof11fields(minimum)
SAM format
• SAM for Sequence Alignment Map • Tabulated text file • 1 line per read • Each line is composed of 11 fields (minimum)
07-09th April 2014 NGS and Bioinformatics
70 37
SAM format
20/10/15 YannickBoursin
SAM format
07-09th April 2014 NGS and Bioinformatics
11695_6 0 chr1 3292760 255 20M * 0 0 AAGAGATCTGGAACCATAGA DGDFCDGFFGBEFFGFDEEF XA:i:0 MD:Z:20 NM:i:0 XX:i:3984 9985_1 0 chr1 3292761 255 19M * 0 0 AGAGATCTGGAACCATAGA IIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:19 NM:i:0 XX:i:3990 4226_1 0 chr1 3296594 255 22M * 0 0 TCTGCAAGGCAAAAGACACTGT GHHHHHGHGHHHGHHHHBHBGG XA:i:0 MD:Z:22 NM:i:0 XX:i:4194 7001_1 0 chr1 3328828 255 20M * 0 0 AAGAAAGAGAACTTCAGACC GGGG+GGGGGGIIIIIBHII XA:i:0 MD:Z:20 NM:i:0 XX:i:2357 1042_1 0 chr1 3334731 255 21M * 0 0 GGGACTCAGCAGAACTTAGGA ?@GGGDGGGG>DDGGGGGGDB XA:i:0 MD:Z:21 NM:i:0 XX:i:1027 14647_1 0 chr1 3334756 255 23M * 0 0 AGTCTGAACAGGTTAGAGGGTGC IIIIIIEGIHIGID<DBDGDBGB XA:i:0 MD:Z:23 NM:i:0 XX:i:1910
71
38
SAM format
20/10/15 YannickBoursin
•Secondfieldcanbeusedforquicksortoffile
SAM format
• Second field can be used for quick sort of file
• With Samtools (command line) and –f et –F options • Useful webpage:
• http://picard.sourceforge.net/explain-flags.html
07-09th April 2014 NGS and Bioinformatics
72
•WithSamtools(commandline)and–fet–Fop;ons•Usefulwebpage:
• hup://broadins;tute.github.io/picard/explain-flags.html39
BAM format
20/10/15 YannickBoursin
•BAMforBinaryAlignment/Map•CorrespondtoSAMformatcompressedasBGZF•Reduceby5;mesthesizeofthealignmentfile•NotdirectlyreadableasSAMformat•RequireSamtools•Bestformatforalignmentfilesharing•Coupleswithanindexfile(BAI)•Avoidasequen;alreadofthecompletefile
40
Quality controls on aligned data : Standard workflow for NGS analysis
20/10/15 YannickBoursin
Standard Workflow for NGS Analysis
Raw Reads
Reads Mapping
Data Analysis
Depends on the NGS Application
Sequencing &
Primary Analysis
Reads Cleaning
QC: 1 QC: 2 QC: 3
07-09th April 2014 NGS and Bioinformatics
75 41
AtypicalNGSworkflow
QC 3 : Which metric to check ?
20/10/15 YannickBoursin
Inprac7ce,howtovalidatemyalignment?BeawareofthemappingstrategyusedLookatsimpledescrip;vesta;s;cs
–Numberofalignedreads–Coverage/Depth–Mappingquality–Numberofnormal/abnormalpairsforpaired-enddata–Strandbias–...
42
Paired-end mapping
20/10/15 YannickBoursin
•Insert-sizechecking
Paired-end mapping
• Insert-size checking
• % of "All Good"= both reads in the pair have aligned • "the pair is properly aligned" meaning that they mapped within a proper
distance from each other • % of "All Bad" = neither the read nor its mate mapped • % of Only one read maps = only one read in a pair is mapped
07-09th April 2014 NGS and Bioinformatics
78
Paired-end mapping
• Insert-size checking
• % of "All Good"= both reads in the pair have aligned • "the pair is properly aligned" meaning that they mapped within a proper
distance from each other • % of "All Bad" = neither the read nor its mate mapped • % of Only one read maps = only one read in a pair is mapped
07-09th April 2014 NGS and Bioinformatics
78
•%of"AllGood"=bothreadsinthepairhavealigned•"thepairisproperlyaligned"meaningthattheymappedwithinaproperdistancefromeachother•%of"AllBad"=neitherthereadnoritsmatemapped•%ofOnlyonereadmaps=onlyonereadinapairismapped
43
NGS Analysis : How can I work with my NGS data ?
20/10/15 YannickBoursin
•Difficultonpersonalcomputer(lackofressources)•1alignement=4processors+15gbRam(tomul;plybythenumberofsamples)•Impossibletoopenfilesintosofwaresliketexteditor•Needaverylargestoragecapacity•Databackupadministra;on•Applica;onsserverconnectedtoacompu;ngclusterandstoragearray:
•Commercialssolu;on(CLCBio,NextGene,...)•Galaxyserver: hWps://galaxy.gustaveroussy.fr/galaxyprod
44
Data analysis
20/10/15 YannickBoursin
Standard Workflow for NGS Analysis
Raw Reads
Reads Mapping
Data Analysis
Depends on the NGS Application
Sequencing &
Primary Analysis
Reads Cleaning
QC: 1 QC: 2 QC: 3
07-09th April 2014 NGS and Bioinformatics
30 45
AtypicalNGSworkflow
Data Analyses in Cancer
Chimerictranscriptsearch
Alterna;vetranscriptsstudy
Differen;alexpressionstudy
Methyla;onstudy Detec;onofgenomicvariants
Detec;onofcopy-numbervaria;on
20/10/15 YannickBoursin 46
Chimeric transcripts
20/10/15 YannickBoursin
Doesthetumoralcellsexpressanychimerictranscript?
47
Historyofthebcr-ablfusion
Alternative transcripts
20/10/15 YannickBoursin 48
Differential expression
20/10/15 YannickBoursin 49
Aretheregenesthatwouldbestronglyexpressedinonekindoftumorthatarenotintheotherkind?Canwegrouptumorsaccordingtotheirexpressionprofiles?
Clusteringdifferen;alexpressioninbreasttumours.
Methylome
20/10/15 YannickBoursin 50
IsthereanydifferencebetweenDNAmethyla;onintumorsandinnormalcells?
Howdoesmethyla;onpromotescancer?
Detection of copynumber variations
20/10/15 YannickBoursin 51
Arethereanycopy-numberaltera;on(gainorlossofchomosomalregions,amplifica;ons…)thatcouldexplaintumorigenesis?
Copynumbervaria;onsincancer.MYCandKRASareamplified.
Detection of genomic variants
20/10/15 YannickBoursin 52
Aretheremuta;onaleventsthatarespecifictothetumoralgenome?Couldthetumorigenesisbeexplainedbythose?Isthereanydrugtarge;ngthosemuta;ons?
Pancreasadenocarcinoma:fromnormalcellstotumoralcells
Limitations: Detection of genomic variants
20/10/15 YannickBoursin 53
Between1.4and8.9%ofthevariantsaretechnologyspecific
The reasons why a SNP is not detected by one sequencingtechnology, whereas it is reported by another, can be broadlydivided into three categories:
N Issues related to coverage: These can be furthersubdivided into complete lack of coverage, low coverage(which is not enough to call a SNP based on predefinedcriteria), and higher-than-expected coverage (based on amodel used to separate SNPs from structural variants andassembly errors) at the candidate location.
N Issues with the alternate allele: Most software tools(including SAMtools and Newbler) require observing thealternate allele at least twice or more, before they consider thelocation as a potential variant. These can be further subdividedinto instances where the alternate allele is not seen at all andothers, when the alternate allele is not seen a sufficient numberof times.
N Issues with the variant calling: These refer to thesituations where the alternate allele is seen a requisite numberof times, but the SNP is not called due to other reasons. Thesereasons may include proximity to many other SNPs, proximityto a high quality indel, existence in a non-uniquely alignableregion, and a huge deviation from the expected diploidbehavior of the sample for the data aligned using BWA. Forthe reads aligned using Newbler, the reasons include thelocation being in a non-uniquely alignable region and otheralignment errors that arise due to the unique error-profile ofthe 454 reads.
We investigated the alignments at the 439,122 locations thatwere called as putative variants by using 454 and Illuminasequences, but not using SOLiD sequences (Figure 4a i). Weassigned each location to a particular category based on the reasonwhy it was not called a SNP. We found that the variant allele wasobserved in the SOLiD reads in 64% of these cases, but the SNPwas filtered away for various reasons. 27% of the locations werefiltered away due to a low SNP quality (defined as the Phred-scaledlikelihood that the called genotype is identical to the reference),18% of them were filtered away due to a low RMS (root meansquare) mapping quality (reflecting the limitation of shorter reads)and another 19% were filtered away as the variant allele was notseen enough number of times. Coverage related issues (nocoverage, too little coverage or more than expected coverage)
were responsible for another 19% of the locations. The alternateallele was not seen at all, despite adequate coverage at the site, forthe remaining 17% locations.
For the 71,567 locations that were called using the SOLiDsequences (but not by others), we looked at the alignments for boththe 454 dataset and the Illumina datasets. At about 15% of theselocations (Figure 4a ii), the alternate allele was seen just once in the454 dataset and at about another 16% of them, the coverage of454 reads was not enough to call a SNP. For another 21% of thelocations the SNP was not called by Newbler, even though theallele was seen multiple times in the pairwise alignments betweenthe reference and the 454 reads, with most of them beingassociated with homopolymer errors. On the other hand at 25% ofthese locations the SNP was seen in the Illumina dataset (Figure 4aiii), but it was filtered away due to a lower SNP quality (15%), orbecause lower mapping quality (9%). Another 14% of theselocations did not have sufficient coverage with Illumina reads to beconsidered in SNP calling. Considering the locations where both454 and Illumina had little, no, or higher than expected coverage,and where the alternate allele was seen at least once in either 454or Illumina dataset as true SNPs, we expect 14,707 of the 71,567locations to be false-positives for the SOLiD calls.
When we looked at the 47,381 locations that were called a SNPusing 454 and SOLiD reads, we found that primary reason (at60% of the locations) these were not called a SNP with Illuminareads had to do with the coverage (Figure 4b i). 57% of thelocations were in regions where the coverage was more thanexpected (signaling a putative structural variant), whereas therewas little of no coverage for the remaining 3%. We used a Poissondistribution with the same mean value to calculate the coveragethreshold to filter variants, but this data suggests that a gammadistribution with more weight on more tails is probably a bettermodel for Illumina data. The second largest contributor was lowSNP quality (22% of the locations), which is the result of anobserved deviation from the expectation that both allele should beseen approximately the same number of times on a heterozygouslocation.
We found 225,981 locations that were called as putative variantsusing Illumina reads only. Looking at the alignments for theSOLiD reads at those locations (Figure 4b ii), we found that for22% of them we saw the alternate allele a sufficient number oftimes, but it was filtered away either due to low RMS mappingquality or a low SNP quality. Another 16% of the locations were
Figure 3. Venn diagram showing the overlap in the SNP calls made using data from the three sequencing technologies. We displaythe sizes of each of the seven categories of overlaps among the variant calls in the three technologies. (a) depicts the overlaps when all substitutioncalls are used, (b) depicts the overlaps when all calls from Illumina and SOLiD are used but only the high-confidence subset of the 454 dataset is used,and (c) depicts the overlaps when only the variants in the uniquely alignable regions of the reference sequence are used.doi:10.1371/journal.pone.0055089.g003
Comparison of Sequencing Platforms
PLOS ONE | www.plosone.org 4 February 2013 | Volume 8 | Issue 2 | e55089
Limitations: Detection of genomic variants
20/10/15 YannickBoursin 54
Commongenomicvariantsbetweendifferentvariantcallers
Conclusion
• Nowadays,NGSiswidelyusedincancercentersinordertocategorizecancersandlinkpa;entswithpersonnalizedtreatments(PrecisionMedicine)
• NGSarealsousedincancerresearch,inordertodiscovernewoncogene;cmechanisms,tounderstandthewayatreatmentworks,tolinkbiologicalandgene;calcharacters…
• Duetotechnicaland“how-the-universe-works-relatedissues”,usingNGSmightnotsolveyourproblems.Itisimportanttoknowthatthetechniqueislimited:
• A)bytheques;onyouaskedatfirst.Ifacancercannotbeexplainedbymuta;onalevents,itmightbeexplainedbyothermechanisms.Buts;ll,nothingistobefoundindata.
• B)bytechnicalissues.Sequencersandsowwaresarepronetoerrors.Sta;s;cally,therewillbeatleastoneerrorforyouranalysis.Youcanowenlimittheroleofthislimita;onbymakingbiologicalandtechnicalreplicates.
20/10/15 YannickBoursin 55
Galaxy: a web-based genome analysis platform
29janvier2015 Forma;onNGS&Cancer-AnalysesExome
Galaxy: a web-based genome analysis platform
• Galaxy is an open-source framework for integrating various computational tools and databases into a cohesive workspace
• https://main.g2.bx.psu.edu/ • A web-based service that provides and integrates many popular tools and
resources for comparative genomics • A completely self-contained application for building your own Galaxy style sites
NGS – Galaxy NGS and Bioinformatics
94
07-09th April 2014
•Galaxyisanopen-sourceframeworkforintegra;ngvariouscomputa;onaltoolsanddatabasesintoacohesiveworkspace•hWps://main.g2.bx.psu.edu/•Aweb-basedservicethatprovidesandintegratesmanypopulartoolsandresourcesforcompara;vegenomics•Acompletelyself-containedapplica;onforbuildingyourownGalaxystylesites
Galaxy: the instant web-based tool and data resource integration platform
29janvier2015 Forma;onNGS&Cancer-AnalysesExome
•OpenSourcedownloadablepackagethatcanbedeployedinindividuallabs•Modularized•Addnewtools•Integratenewdatasources•Easytopluginyourowncomponents•Straigh|orwardtorunyourownprivategalaxyserver
Galaxy: the one-stop shop for genome analysis
29janvier2015 Forma;onNGS&Cancer-AnalysesExome
•Analyze•Retrieveshareddatabetweengalaxyusersoruploadyourown•Interac;velymanipulategenomicdatawithacomprehensiveandexpandingbest-prac;cestoolset•Galaxyisdesignedtoworkwithmanydifferentdatatypes.•hup://wiki.galaxyproject.org/Learn/Datatypes•Visualize•Visualanalysisenvironmentofyourdata,youranalysisworkflows.•PublishandShare•Resultsandstep-by-stepanalysisrecord(DataLibrariesandHistories)•Customizablepipelines(Workflows)•Completeprotocols/documenta;ons(Pages)
https://galaxy.gustaveroussy.fr/galaxyprod
29janvier2015 Forma;onNGS&Cancer-AnalysesExome
Data libraries
29janvier2015 Forma;onNGS&Cancer-AnalysesExome
•DatasetsareaccessiblefromGalaxyorfordownload.
History
29janvier2015 Forma;onNGS&Cancer-AnalysesExome
•Historiesareallstepsintheprocessandtheusedse}ng.•Historiescanbeimportedintoyoursessionandrerunasisormodified.
Workflows
29janvier2015 Forma;onNGS&Cancer-AnalysesExome
•Workflowsspecifythestepsinaprocess(asuiteoforderedtools).•Workflowsareanalysesthataremeanttoberun,each;mewithdifferentuser-provideddatasets.
User account
29janvier2015 Forma;onNGS&Cancer-AnalysesExome
•GalaxypublicMainorTestinstances•Anaccountisnotrequiredtoaccessit•Butifused,thedataquotaisincreasedandfullfunc;onalityacrosssessionsopensup,suchasnaming,saving,sharing,andpublishingGalaxyobjects(Histories,Workflows,Datasets,Pages).
•Galaxy@GR:hups://galaxy.gustaveroussy.fr/galaxyprod
•Anaccountisrequiredtoaccessit•fullfunc;onalityacrosssessionsopensup,suchasnaming,
saving,sharing,andpublishingGalaxyobjects(Histories,Workflows,Datasets,Pages).
20/10/15 YannickBoursin 64
Recommended