NGS, Cancer and Bioinformacsrssf.i2bc.paris-saclay.fr/transfert/M2CANCERO/NGS... · NGS and...

20/10/15 YannickBoursin

NGS,CancerandBioinforma;cs

NGS and Clinical Oncology

• NGSinhereditarycancergenometes;ng•  BRCA1/2(breast/ovarycancer)•  XPC(melanoma)•  ERCC1(colorectalcancer)

• NGSforpersonalizedcancertreatment•  Clinicaltrials:MOSCATO(GR),SAFIR(GR),SHIVA(Curie),…•  Ipilimumab(an;-CTLA4),Nivolumab(an;-PD1),Trastuzumab(an;-HER2),Cetuximab(an;-EGFR)

• Detec;onofchimerictranscripts•  ChronicMyeloidLeukemia:Philadelphiachromosome(BCR/ABL)•  Non-Small-CellLungCancer:EML4-ALK

20/10/15 YannickBoursin 2

NGS and Oncology

NGSisnowwidelyusedas:•Aresearchtooltoscreenalargeamountofcancersamples

NGS and Oncology

07-09th April 2014 NGS and Bioinformatics

NGS is now widely used as: • A research tool to screen a large amount of cancer samples • A clinical/diagnosis tool in daily practice These projects require dedicated bioinformatics integration project to access and analyses this huge amount of data

•Aclinical/diagnosistoolindailyprac;ce

Theseprojectsrequirededicatedbioinforma;csintegra;onprojecttoaccessandanalysesthishugeamountofdata.

NGS and Oncology

NGS is now widely used as: • A research tool to screen a large amount of cancer samples • A clinical/diagnosis tool in daily practice These projects require dedicated bioinformatics integration project to access and analyses this huge amount of data

Why do we need computers for NGS

Sequencingdatasizeevolu7on Needstoaddress

•  StorePetaBytesofdata(1PBis1000TB).

•  Sharedataaroundtheworldthroughnetworks

• Analyzehugeamountsofdatawithcomplexalgorithms

Bioinformatics and Oncology

• Problem:finding,extrac;ng,andpresen;ngrelevantinforma;ons.

• Par;alsolu;on:designingworkflowsinordertoeasedataanalysis.

Interdisciplinary collaboration

Bioinforma;csactsasahubsbetweenthedifferentfields.Trustbetweenpartnersisneeded,trainingisneededaswellforefficientunderstanding.

Bioinformatics acts as a hubs between the different fields. Trust between partners is needed, training is needed as well for efficient understanding.

Biology knowledge Knowledge modeling,

Technological platforms Sequencing, Microarrays, ImmunoChemistry, …

Bioinformatics

Raw data storage Integration of biological and clinical

data Quality Control Data analysis

Clinical Biostatistics Report for biological/medical staff

Medical staff Clinicians, specialists, …

Biological staff Biologists, Geneticists, …

Bioinformatics

Standard Workflow for NGS Analysis

Raw Reads

Reads Mapping

Data Analysis

Depends on the NGS Application

Sequencing &

Primary Analysis

Reads Cleaning

QC: 1 QC: 2 QC: 3

AtypicalNGSworkflow

Step 1: Quality Check and improvements

NGS Data: what do they look like ?

Arawdatafile(.fastq,.sff,.fa,.csfasta/.qual)withmillionsofshortreadsofthesamesize(SOLiD,HiSeq)orreadsofdifferentsize(IonPGM/Proton)

Enhancedviewofthereadsinafastqfile

FASTQ format

•1sequence=1read=4linesinthefile

Fastq format (base–space)

• 1 sequence = 4 lines in the file

• First line = sequence identifier

•Firstline=sequenceiden;fier

• 1 sequence = 4 lines in the file

• First line = sequence identifier

FASTQ format

•Fourthline=Quality

• Fourth line = Quality

• ASCII encoded (Reduce the file size)

•ASCIIencoded(Reducethefilesize)

Sequence quality encoding

Phred scores Q : Q scores are defined as a property that is logarithmically related to the base-calling error probabilities (P).

Q = -10 log10 P

Sequence quality encoding

Quality controls on raw reads : lets start after sequencing

Let’s start after sequencing …

A raw data file (.fastq, .sff, .fa, .csfasta/.qual) with millions of short reads of the same size (SOLiD, HiSeq) or

reads of different size (Ion PGM/Proton)

ACTGATTAGTCTGAATTAGANNGATAGGAT

GATCGATGCATAGCGATCAGCATCGATACG

CGGCGCTCCGCTCTCGAAACTAGCACTGAC

AGCATCAGGATCTACGATCTAGCGAACTGAC ACTAGCTACTATCGAGCGAGCGATCATCGAC

ACTAGGCATCGGCATCACGGACNNNNNNNN

ACTAGCTATCGAGCTATCAGCGAGCATCTATC

CTGACTACTATCGAGCGAGCTACTAACTGAC

ACTACTTACGACATCGAGGTTAGGAGCATCA

ACTANNGACTAGGAATTAGCTACTGAGCTAC ACTAGCAGCTATATGAGCTACTAGCACTGAC

ACTATCAGCTAGCGCTTCAGCATTACCGT

NNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Arawreadischaracterizedbythreeparameters:•  Itslength•  Itssequence•  Per-base-in-sequencequality

Rawreads

Why looking at sequencing quality ?

•Qualityofdataisveryimportantforvariousdownstreamanalyses:

•Sequenceassemblyormapping•Variantsdetec;on•Geneexpressionstudies•...

•Qualityofdata=poor

•Trytofindareason•Canwecorrect/improvethequality?•Mayleadtoerroneousconclusions

Quality controls on raw reads: which metrics to check ?

Mainly:•  Qualityscoreperbaseandoverthereads

Butalso:•  Readlengthdistribu;on•  Sequencecontentperbaseand%ofGC•  Kmerscontent•  Overrepresentedsequences•  Duplicatedreads

Quality scores

•Perbase(BoxWhiskertypeplot)->toseewetherbasecallsfallsintolowquality(commonlytowardstheendofaread)•Persequence(meanqualitydistribu;on)->toseeifasubsetofyoursequenceshaveuniversallylowqualityvalues

Quality scores

PGM – run A PGM – run A

PGM – run B PGM – run B

Quality scores

Illumina – run C Illumina – run C

Illumina – run D Illumina – run D

Quality control on raw reads: adapters removal

•AnadapterisasmallpieceofknownDNAlocatedattheendofthereads•Adaptersroles:

•Hangreadtothesequencerflowcell•AllowsaspecificPCRenrichmentofreadshavingadapter•Useinmul;plexsequencing(samplesinmix)

•Availabletoolstotrimadapters:•Cutadapt•SeqPrep•RmAdapter

Adapters

• An adapter is a small piece of known DNA located at the end of the reads

• Adapters roles: • Hang read to the sequencer flowcell • Allows a specific PCR enrichment of reads having adapter • Use in multiplex sequencing (samples in mix)

• Available tools to trim adapters: • Cutadapt • SeqPrep • RmAdapter

Inblue:adapters.Inorange:informa;vepartoftheread.

Quality controls on raw reads : lets start after sequencing

AfirstQualityControlofrawreadsismandatoryandcanbeestablishedaccordingtotheapplica;on('N',adaptersequences,barcode,contamina;on,etc.)

Let’s start after sequencing …

A first Quality Control of raw reads is mandatory and can be established according to the application ('N', adapter sequences, barcode, contamincation, etc.)

ACTGATTAGTCTGAATTAGANNGATAGGAT

GATCGATGCATAGCGATCAGCATCGATACG

CGGCGCTCCGCTCTCGAAACTAGCATCGAC

AGCATCAGGATCTACGATCTAGCGAACTGAC ACTAGCTACTATCGAGCGAGCGATCATCGAC

ACTAGGCATCGGCATCACGGACNNNNNNNN

ACTAGCTATCGAGCTATCAGCGAGCATCTATC

CTGACTACTATCGAGCGAGCTACTAACTGAC

ACTACTTACGACATCGAGGTTAGGAGCATCA

ACTANNGACTAGGAATTAGCTACTGAGCTAC ACTAGCAGCTATATGAGCTACTAGCACTGAC

ACTATCAGCTAGCGCTTCAGCATTACCGT

NN NNNNNNNN

ACTGAC

NNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Processedreads:bluepartsaretobekept,greenandredpartstoberemoved

Quality controls : Standard Workflow for NGS Analysis

Raw Reads

Reads Mapping

Data Analysis

Sequencing &

Primary Analysis

Reads Cleaning

QC: 1 QC: 2 QC: 3

AtypicalNGSworkflow

Step 2: Short Reads Alignment

Reads alignment - Vocabulary

Alignment:(mapping)Thereadsalignmentaimsattransformingthesinglereadsinforma;oninanorganizedandreducedsetofinforma;on.Mismatch:Incoherencebetweentwonucleo;desReferenceGenome:Thereferencegenomeisaknownsequence,supposedtobeascloseaspossibletotheinputgenome,andwhichisusedasananchortoorganizethesinglereadsinforma;on.Gap:Bridgewithinthereadalignment(i.e.smallInser;on/dele;on)Mappability:Uniquenessofaregion(repeatedregion=lowmappability,uniqueregion=goodmappability)Indels:Inser;on/Dele;onintothereferencegenome

Reads alignment – Two strategies

Thereadsalignmentaimsattransformingthesinglereadsinforma;oninanorganizedandreducedsetofinforma;on.Twostrategiescanbeapplied:-DenovoReadsAssemblyUsedwhennoreferencegenomeareavailable.Itaimsatreconstruc;nglongscaffoldsfromsinglereadsinforma;on.-AlignmentonaReferenceGenomeThereadsaredirectlycomparedtoaknownreferencegenome.

Alignment on a reference genome

Thereferencegenomeisaknownsequence,supposedtobeascloseaspossibletotheinputgenome,andwhichisusedasananchortoorganizethesinglereadsinforma;on.

The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads information.

A C T A C G A C A T C T A C

A C G A C T T C T A C G A G T T T A C G A A G C T A C T

T T T A C G A A G C T A C T

G C T C C T A

T C C T A G C A C G A G C T

C G A G C T G

A G C T G C G C G G C C A A

C G A G C T G G G C C A A C

A C T A C G A C T C T A C G A G C A T C T A C G A G C T A C T A G C G A T C T A C G A G C T G C G A G C A A C G GC C A A C Reference Genome Sequence

T G C C A A C A C C T T G G

Alignmentofreadsagainstreferencegenome

Thereferencegenomeisaknownsequence,supposedtobeascloseaspossibletotheinputgenome,andwhichisusedasananchortoorganizethesinglereadsinforma;on.

The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads information.

A C T A C G A C A T C T A C A C G A C T T C T A C G A G

T T T A C G A A G C T A C T T T T A C G A A G C T A C T

G C T C C T A T C C T A G C

A C G A G C T

C G A G C T G A G C T G C G

C G G C C A A

C G A G C T G G G C C A A C

Homozygous Polymorphism (T/C)

T G C C A A C A C C T T G G

53 Alignmentofreadsagainstreferencegenome

Alignment on a reference genome - Challenges

NewalignmentalgorithmsmustaddresstherequirementsandcharactericsofNGSreads–Millionsofreadsperrun(30xofgenomecoverage)–Readsofdifferentsize(35bp-200bp)–Differenttypesofreads(single-end,paired-end,mate-pair,etc.)–Base-callingqualityfactors–Sequencingerrors(~1%)–Repe;;veregions–Sequencingorganismvs.referencegenome–Mustadjusttoevolvingsequencingtechnologiesanddataformats

Alignment on a reference genome – Bioinformatics tools

Mappers timeline (since 2001)

Fonseca N A et al. Bioinformatics 2012;28:3169-3177 07-09th April 2014

Alignment on a reference genome Bioinformatics tools

Finding the best alignment - Rational

Givenareferenceandasetofreads,reportatleastone“good”localalignmentforeachreadifoneexistsWhatis“good”?Fornow,weconcentrateon:–Fewermismatchesisbeuer

Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists What is “good”? For now, we concentrate on: – Fewer mismatches is better – Failing to align a low-quality base is better than failing to align a high-quality base Based on a scoring system, i.e. score for a match (1), MM penalty (3), gap open penalty (5), gap extension penalty (2). The best alignment is the one with the highest score.

… T G A T C A T A ...

G A T C A A

… T G A T .C A T A ...

G A G A A T

Is better than

… T G A T A T T A ...

G A T c a.T

… T G A T c a T A ...

G T A C A T

Is better than

Finding the best alignment Rational

–Failingtoalignalow-qualitybaseisbeuerthanfailingtoalignahigh-qualitybase

Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists What is “good”? For now, we concentrate on: – Fewer mismatches is better – Failing to align a low-quality base is better than failing to align a high-quality base Based on a scoring system, i.e. score for a match (1), MM penalty (3), gap open penalty (5), gap extension penalty (2). The best alignment is the one with the highest score.

… T G A T C A T A ...

G A T C A A

… T G A T .C A T A ...

G A G A A T

Is better than

… T G A T A T T A ...

G A T c a.T

… T G A T c a T A ...

G T A C A T

Is better than

Finding the best alignment Rational

Basedonascoringsystem,i.e.scoreforamatch(1),MMpenalty(3),gapopenpenalty(5),gapextensionpenalty(2).Thebestalignmentistheonewiththehighestscore.

Alignment key parameters - Repeats

Approximately50%ofthehumangenomeiscomprisedofrepeats

Treangen T.J. and Salzberg S.L. 2012. Nature review Genetics 13, 36-46

Approximately 50% of the human genome is comprised of repeats

Alignment Key Parameters Repeats

Treangen

T.J.and

SalzbergS.L.2012.Naturereview

e;cs13,36-46

Alignment key parameters - Repeats

Closeproximitywithgenes:intergenicandintragenicposi;onsClose proximity with genes : intergenic and intragenic positions

Alignment Key Parameters Repeats

BRCA2:amosaicofrepeatedregions

Alignment key parameters – Repeats – 3 strategies

-1-Reportonlyuniquealignment-2-Reportbestalignmentsandrandomlyassignreadsacrossequalygoodloci-3-Reportall(best)alignments

Treangen T.J. and Salzberg S.L. 2012. Nature review Genetics 13, 36-46

-1- Report only unique alignment -2- Report best alignments and randomly assign reads across equaly good loci -3- Report all (best) alignments

A B A B A B

-1- -2- -3-

Alignment Key Parameters Repeats – Three strategies

TreangenT.J.andSalzbergS.L.2012.NaturereviewGene;cs13,36-46

Alignment key parameters – Using single or paired-end reads ?

Thetypeofsequencing(i.e.singleorpaired-endreads)isowendrivenbytheapplica;on.Exemple:Findinglargeindels,genomicrearrangements,...However,inmostofthecase,thepairinforma;oncanimprovethemappingspecificity-Single-endalignment–repeatedsequence

The type of sequencing (i.e. single or paired-end reads) is often driven by the application Exemple : Finding large indels, genomic rearrangements, ... However, in most of the case, the pair information can improve the mapping specificity

- Single-end alignment – repeted sequence

A C G A C T C A C G A C T C G G C C A A C G G C C A A C

- Paired-end alignment – unique sequence

A C G A C T C A C G A C T C

Alignment Key Parameters Using single or paired-end reads ?

-Paired-endalignment–uniquesequence

The type of sequencing (i.e. single or paired-end reads) is often driven by the application Exemple : Finding large indels, genomic rearrangements, ... However, in most of the case, the pair information can improve the mapping specificity

- Single-end alignment – repeted sequence

A C G A C T C A C G A C T C G G C C A A C G G C C A A C

- Paired-end alignment – unique sequence

A C G A C T C A C G A C T C

Alignment Key Parameters Using single or paired-end reads ?

Alignmentofreadsagainstreferencegenome

Keypoints•ThealignmentisacrucialstepoftheNGSanalysis.•Thereferencegenomehastobecarefullychosen.•Themappabilityoftheregionofinteresthastobetakenintoaccount(primerdesign).•Thescoringmethodhastobechosenaccordinglytothesequencingerrorrateandthequalityoftherawreads.•Thealignmentparametershavetobesetproperly.

Limitations of Alignment Tools

Evenifwehavenowsomenicetoolstoalignreadsonareferencegenome,severalissuesares;llimportant:-Homopolymermapping-Efficientlyalignsmallindels-Alignmentonseveralgenomes-Alignmentonrepeatedsequences-...

Alignment formats

•Alotofformatsexists:

•  SAM•  BAM•  ELAND(Illuminaspecific)•  MAQmap•  …

SAMandBAMarenowthestandardforaligneddata

SAM format

•SAMforSequenceAlignmentMap•Tabulatedtextfile•1lineperread•Eachlineiscomposedof11fields(minimum)

SAM format

• SAM for Sequence Alignment Map • Tabulated text file • 1 line per read • Each line is composed of 11 fields (minimum)

SAM format

11695_6 0 chr1 3292760 255 20M * 0 0 AAGAGATCTGGAACCATAGA DGDFCDGFFGBEFFGFDEEF XA:i:0 MD:Z:20 NM:i:0 XX:i:3984 9985_1 0 chr1 3292761 255 19M * 0 0 AGAGATCTGGAACCATAGA IIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:19 NM:i:0 XX:i:3990 4226_1 0 chr1 3296594 255 22M * 0 0 TCTGCAAGGCAAAAGACACTGT GHHHHHGHGHHHGHHHHBHBGG XA:i:0 MD:Z:22 NM:i:0 XX:i:4194 7001_1 0 chr1 3328828 255 20M * 0 0 AAGAAAGAGAACTTCAGACC GGGG+GGGGGGIIIIIBHII XA:i:0 MD:Z:20 NM:i:0 XX:i:2357 1042_1 0 chr1 3334731 255 21M * 0 0 GGGACTCAGCAGAACTTAGGA ?@GGGDGGGG>DDGGGGGGDB XA:i:0 MD:Z:21 NM:i:0 XX:i:1027 14647_1 0 chr1 3334756 255 23M * 0 0 AGTCTGAACAGGTTAGAGGGTGC IIIIIIEGIHIGID<DBDGDBGB XA:i:0 MD:Z:23 NM:i:0 XX:i:1910

SAM format

•Secondfieldcanbeusedforquicksortoffile

SAM format

• Second field can be used for quick sort of file

• With Samtools (command line) and –f et –F options • Useful webpage:

• http://picard.sourceforge.net/explain-flags.html

•WithSamtools(commandline)and–fet–Fop;ons•Usefulwebpage:

•  hup://broadins;tute.github.io/picard/explain-flags.html39

BAM format

•BAMforBinaryAlignment/Map•CorrespondtoSAMformatcompressedasBGZF•Reduceby5;mesthesizeofthealignmentfile•NotdirectlyreadableasSAMformat•RequireSamtools•Bestformatforalignmentfilesharing•Coupleswithanindexfile(BAI)•Avoidasequen;alreadofthecompletefile

Quality controls on aligned data : Standard workflow for NGS analysis

Raw Reads

Reads Mapping

Data Analysis

Sequencing &

Primary Analysis

Reads Cleaning

QC: 1 QC: 2 QC: 3

AtypicalNGSworkflow

QC 3 : Which metric to check ?

Inprac7ce,howtovalidatemyalignment?BeawareofthemappingstrategyusedLookatsimpledescrip;vesta;s;cs

–Numberofalignedreads–Coverage/Depth–Mappingquality–Numberofnormal/abnormalpairsforpaired-enddata–Strandbias–...

Paired-end mapping

•Insert-sizechecking

Paired-end mapping

• Insert-size checking

• % of "All Good"= both reads in the pair have aligned • "the pair is properly aligned" meaning that they mapped within a proper

distance from each other • % of "All Bad" = neither the read nor its mate mapped • % of Only one read maps = only one read in a pair is mapped

Paired-end mapping

• Insert-size checking

• % of "All Good"= both reads in the pair have aligned • "the pair is properly aligned" meaning that they mapped within a proper

distance from each other • % of "All Bad" = neither the read nor its mate mapped • % of Only one read maps = only one read in a pair is mapped

•%of"AllGood"=bothreadsinthepairhavealigned•"thepairisproperlyaligned"meaningthattheymappedwithinaproperdistancefromeachother•%of"AllBad"=neitherthereadnoritsmatemapped•%ofOnlyonereadmaps=onlyonereadinapairismapped

NGS Analysis : How can I work with my NGS data ?

•Difficultonpersonalcomputer(lackofressources)•1alignement=4processors+15gbRam(tomul;plybythenumberofsamples)•Impossibletoopenfilesintosofwaresliketexteditor•Needaverylargestoragecapacity•Databackupadministra;on•Applica;onsserverconnectedtoacompu;ngclusterandstoragearray:

•Commercialssolu;on(CLCBio,NextGene,...)•Galaxyserver: hWps://galaxy.gustaveroussy.fr/galaxyprod

Data analysis

Raw Reads

Reads Mapping

Data Analysis

Sequencing &

Primary Analysis

Reads Cleaning

QC: 1 QC: 2 QC: 3

AtypicalNGSworkflow

Data Analyses in Cancer

Chimerictranscriptsearch

Alterna;vetranscriptsstudy

Differen;alexpressionstudy

Methyla;onstudy Detec;onofgenomicvariants

Detec;onofcopy-numbervaria;on

Chimeric transcripts

Doesthetumoralcellsexpressanychimerictranscript?

Historyofthebcr-ablfusion

Alternative transcripts

Differential expression

Aretheregenesthatwouldbestronglyexpressedinonekindoftumorthatarenotintheotherkind?Canwegrouptumorsaccordingtotheirexpressionprofiles?

Clusteringdifferen;alexpressioninbreasttumours.

Methylome

IsthereanydifferencebetweenDNAmethyla;onintumorsandinnormalcells?

Howdoesmethyla;onpromotescancer?

Detection of copynumber variations

Arethereanycopy-numberaltera;on(gainorlossofchomosomalregions,amplifica;ons…)thatcouldexplaintumorigenesis?

Copynumbervaria;onsincancer.MYCandKRASareamplified.

Detection of genomic variants

Aretheremuta;onaleventsthatarespecifictothetumoralgenome?Couldthetumorigenesisbeexplainedbythose?Isthereanydrugtarge;ngthosemuta;ons?

Pancreasadenocarcinoma:fromnormalcellstotumoralcells

Limitations: Detection of genomic variants

Between1.4and8.9%ofthevariantsaretechnologyspecific

The reasons why a SNP is not detected by one sequencingtechnology, whereas it is reported by another, can be broadlydivided into three categories:

N Issues related to coverage: These can be furthersubdivided into complete lack of coverage, low coverage(which is not enough to call a SNP based on predefinedcriteria), and higher-than-expected coverage (based on amodel used to separate SNPs from structural variants andassembly errors) at the candidate location.

N Issues with the alternate allele: Most software tools(including SAMtools and Newbler) require observing thealternate allele at least twice or more, before they consider thelocation as a potential variant. These can be further subdividedinto instances where the alternate allele is not seen at all andothers, when the alternate allele is not seen a sufficient numberof times.

N Issues with the variant calling: These refer to thesituations where the alternate allele is seen a requisite numberof times, but the SNP is not called due to other reasons. Thesereasons may include proximity to many other SNPs, proximityto a high quality indel, existence in a non-uniquely alignableregion, and a huge deviation from the expected diploidbehavior of the sample for the data aligned using BWA. Forthe reads aligned using Newbler, the reasons include thelocation being in a non-uniquely alignable region and otheralignment errors that arise due to the unique error-profile ofthe 454 reads.

We investigated the alignments at the 439,122 locations thatwere called as putative variants by using 454 and Illuminasequences, but not using SOLiD sequences (Figure 4a i). Weassigned each location to a particular category based on the reasonwhy it was not called a SNP. We found that the variant allele wasobserved in the SOLiD reads in 64% of these cases, but the SNPwas filtered away for various reasons. 27% of the locations werefiltered away due to a low SNP quality (defined as the Phred-scaledlikelihood that the called genotype is identical to the reference),18% of them were filtered away due to a low RMS (root meansquare) mapping quality (reflecting the limitation of shorter reads)and another 19% were filtered away as the variant allele was notseen enough number of times. Coverage related issues (nocoverage, too little coverage or more than expected coverage)

were responsible for another 19% of the locations. The alternateallele was not seen at all, despite adequate coverage at the site, forthe remaining 17% locations.

For the 71,567 locations that were called using the SOLiDsequences (but not by others), we looked at the alignments for boththe 454 dataset and the Illumina datasets. At about 15% of theselocations (Figure 4a ii), the alternate allele was seen just once in the454 dataset and at about another 16% of them, the coverage of454 reads was not enough to call a SNP. For another 21% of thelocations the SNP was not called by Newbler, even though theallele was seen multiple times in the pairwise alignments betweenthe reference and the 454 reads, with most of them beingassociated with homopolymer errors. On the other hand at 25% ofthese locations the SNP was seen in the Illumina dataset (Figure 4aiii), but it was filtered away due to a lower SNP quality (15%), orbecause lower mapping quality (9%). Another 14% of theselocations did not have sufficient coverage with Illumina reads to beconsidered in SNP calling. Considering the locations where both454 and Illumina had little, no, or higher than expected coverage,and where the alternate allele was seen at least once in either 454or Illumina dataset as true SNPs, we expect 14,707 of the 71,567locations to be false-positives for the SOLiD calls.

When we looked at the 47,381 locations that were called a SNPusing 454 and SOLiD reads, we found that primary reason (at60% of the locations) these were not called a SNP with Illuminareads had to do with the coverage (Figure 4b i). 57% of thelocations were in regions where the coverage was more thanexpected (signaling a putative structural variant), whereas therewas little of no coverage for the remaining 3%. We used a Poissondistribution with the same mean value to calculate the coveragethreshold to filter variants, but this data suggests that a gammadistribution with more weight on more tails is probably a bettermodel for Illumina data. The second largest contributor was lowSNP quality (22% of the locations), which is the result of anobserved deviation from the expectation that both allele should beseen approximately the same number of times on a heterozygouslocation.

We found 225,981 locations that were called as putative variantsusing Illumina reads only. Looking at the alignments for theSOLiD reads at those locations (Figure 4b ii), we found that for22% of them we saw the alternate allele a sufficient number oftimes, but it was filtered away either due to low RMS mappingquality or a low SNP quality. Another 16% of the locations were

Figure 3. Venn diagram showing the overlap in the SNP calls made using data from the three sequencing technologies. We displaythe sizes of each of the seven categories of overlaps among the variant calls in the three technologies. (a) depicts the overlaps when all substitutioncalls are used, (b) depicts the overlaps when all calls from Illumina and SOLiD are used but only the high-confidence subset of the 454 dataset is used,and (c) depicts the overlaps when only the variants in the uniquely alignable regions of the reference sequence are used.doi:10.1371/journal.pone.0055089.g003

Comparison of Sequencing Platforms

PLOS ONE | www.plosone.org 4 February 2013 | Volume 8 | Issue 2 | e55089

Limitations: Detection of genomic variants

Commongenomicvariantsbetweendifferentvariantcallers

Conclusion

• Nowadays,NGSiswidelyusedincancercentersinordertocategorizecancersandlinkpa;entswithpersonnalizedtreatments(PrecisionMedicine)

• NGSarealsousedincancerresearch,inordertodiscovernewoncogene;cmechanisms,tounderstandthewayatreatmentworks,tolinkbiologicalandgene;calcharacters…

•  Duetotechnicaland“how-the-universe-works-relatedissues”,usingNGSmightnotsolveyourproblems.Itisimportanttoknowthatthetechniqueislimited:

•  A)bytheques;onyouaskedatfirst.Ifacancercannotbeexplainedbymuta;onalevents,itmightbeexplainedbyothermechanisms.Buts;ll,nothingistobefoundindata.

•  B)bytechnicalissues.Sequencersandsowwaresarepronetoerrors.Sta;s;cally,therewillbeatleastoneerrorforyouranalysis.Youcanowenlimittheroleofthislimita;onbymakingbiologicalandtechnicalreplicates.

Galaxy: a web-based genome analysis platform

29janvier2015 Forma;onNGS&Cancer-AnalysesExome

Galaxy: a web-based genome analysis platform

• Galaxy is an open-source framework for integrating various computational tools and databases into a cohesive workspace

• https://main.g2.bx.psu.edu/ • A web-based service that provides and integrates many popular tools and

resources for comparative genomics • A completely self-contained application for building your own Galaxy style sites

NGS – Galaxy NGS and Bioinformatics

07-09th April 2014

•Galaxyisanopen-sourceframeworkforintegra;ngvariouscomputa;onaltoolsanddatabasesintoacohesiveworkspace•hWps://main.g2.bx.psu.edu/•Aweb-basedservicethatprovidesandintegratesmanypopulartoolsandresourcesforcompara;vegenomics•Acompletelyself-containedapplica;onforbuildingyourownGalaxystylesites

Galaxy: the instant web-based tool and data resource integration platform

•OpenSourcedownloadablepackagethatcanbedeployedinindividuallabs•Modularized•Addnewtools•Integratenewdatasources•Easytopluginyourowncomponents•Straigh|orwardtorunyourownprivategalaxyserver

Galaxy: the one-stop shop for genome analysis

•Analyze•Retrieveshareddatabetweengalaxyusersoruploadyourown•Interac;velymanipulategenomicdatawithacomprehensiveandexpandingbest-prac;cestoolset•Galaxyisdesignedtoworkwithmanydifferentdatatypes.•hup://wiki.galaxyproject.org/Learn/Datatypes•Visualize•Visualanalysisenvironmentofyourdata,youranalysisworkflows.•PublishandShare•Resultsandstep-by-stepanalysisrecord(DataLibrariesandHistories)•Customizablepipelines(Workflows)•Completeprotocols/documenta;ons(Pages)

https://galaxy.gustaveroussy.fr/galaxyprod

Data libraries

•DatasetsareaccessiblefromGalaxyorfordownload.

History

•Historiesareallstepsintheprocessandtheusedse}ng.•Historiescanbeimportedintoyoursessionandrerunasisormodified.

Workflows

•Workflowsspecifythestepsinaprocess(asuiteoforderedtools).•Workflowsareanalysesthataremeanttoberun,each;mewithdifferentuser-provideddatasets.

User account

•GalaxypublicMainorTestinstances•Anaccountisnotrequiredtoaccessit•Butifused,thedataquotaisincreasedandfullfunc;onalityacrosssessionsopensup,suchasnaming,saving,sharing,andpublishingGalaxyobjects(Histories,Workflows,Datasets,Pages).

•Galaxy@GR:hups://galaxy.gustaveroussy.fr/galaxyprod

•Anaccountisrequiredtoaccessit•fullfunc;onalityacrosssessionsopensup,suchasnaming,

saving,sharing,andpublishingGalaxyobjects(Histories,Workflows,Datasets,Pages).

NGS, Cancer and Bioinformacsrssf.i2bc.paris-saclay.fr/transfert/M2CANCERO/NGS... · NGS and...

Documents

e-paper pakistantoday 09th March, 2013

NGS 400 Series NGS 250 Series · 2017. 12. 8. · MAGNE PMP NGS SERES NGS 250 Series / NGS-F 250 Series / NGS 400 Series / NGSM 400 Series / NGSU 400 Series / NGS-F 400 Series MAGNE

profitepaper pakistantoday 09th september, 2012

09th Karnataka Mathematics

Lorenzo Maserati - c2n.universite-paris-saclay.fr

profitepaper pakistantoday 09th may, 2012

North County Leader - 09th September 2014

09th 980601_A1 QP Science

NGS, Cancer and Bioinformaticsrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_intro_NGS-YB.pdf · NGS and Oncology 5/3/2015 Yannick Boursin NGS is now widely used as: • A research

e-paper pakistantoday 09th april, 2012

MCL - 09th February 2016

Guide to student accommodation - universite-paris-saclay.fr

09th Karnataka Kannada 2

09th SEPTEMBER 2022 PATRON: JACOB DE HAAN

NGS 400 Series NGS 250 Series - Komachine · 2018. 9. 27. · MAGNE PMP NGS SERES NGS 250 Series / NGS-F 250 Series / NGS 400 Series / NGSM 400 Series / NGSU 400 Series / NGS-F 400

#9 Le Fil Prune - universite-paris-saclay.fr

NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

09th Apr 2016

profitepaper pakistantoday 09th November, 2012

profitepaper pakistantoday 09th March, 2013