Upload
leland
View
57
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Detection and analysis of SNP polymorphisms. Alexis Dereeper. CIBA courses – Brasil 2011. Objectives. Short reads Solexa. To know and manipulate available packages/tools for SNP and INDEL detection from NGS data (assembly of NGS data). Mapping SAM. - PowerPoint PPT Presentation
Citation preview
• To know and manipulate available packages/tools for SNP and INDEL detection from NGS data(assembly of NGS data)
• To think about difficulties encountered when analysing new generation sequencing data(differentiate sequencing errors, paralogs and allelic variation)
• Detect SNP and assign genotypes to every polymorphic positions
• Simply exploit polymorphisms data via a Web-based application (genetic diversity, LD)
• Obtain an exploitable dataset to send for the design of a high-throughput SNP chip(Illumina VeraCode technology)
Short reads Solexa
Mapping SAM
Exploitation of polymorphism data
Design of a Illumina SNP chip
Assignation of genotypes
Ind1 ATTGTGTCGTAACGTATGTCATGTCGTInd2 ATTGTGTCGGAACGTATGTCATGTCGTInd3 ATTGTGTCGKAACGTATGTCATGTCGT
Allelic variations
List of SNPs867
A/G1998
T/C2341
T/G
Objectives
Alexis Dereeper
Tablet• Graphical viewer for assembly of NGS data
• Accepts different formats:ACE, SAM, BAM
CIBA courses – Brasil 2011
Alexis Dereeper
Automatic detection of SNP from SAM assembly
SAM assembly
SAM-to-BAM
Generate Pileup
Pileup2snp
Pileup file
FastQ Groomer
Mapping BWA
SAM-to-BAM
IndelRealigner
CountCovariates
TableRecalibration
UnifiedGenotyper
VCF file
SamTools
GATK
PicardTools
VarScan
SNP tabular file
SNiPlay Utilities
SamToFastaAlignments
FASTA alignmentswith IUPAC
Fastq
AddReadGroupIntoSam
VCFToFastaAlignments
Example of pipeline faisable with the Galaxy system:3 alternatives
CIBA courses – Brasil 2011
Alexis Dereeper
Program for SNP detection from Pileup file : Pileup2snpAnother module exists for indel Pileup2indel but not implemented yet in Galaxy SouthGreen
Text file describing for each position: base for reference, depth of coverage, variations, quality
seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
Varscan
Pileup format
CIBA courses – Brasil 2011
Alexis Dereeper
genotype2
genotype3
Depth threshold
Depth threshold
Heterozygosity
genotype1
Depth Frequency Depth
Threshold values per genotype
1 0 1
4 0.3 2
4 0.3 2
WYA
A
T
Assemblage: Ace format
For each contig
CL1Contig1
CL1Contig1.align.fa
+ CL1Contig2.align.fa , CL2Contig1.align.fa …
FASTA alignments including IUPACList of heterozygous positions
+
Mapping: SAM format
Stats: estimation of average heterozygosity for each genotype+
For heterozygosity estimation
For position
SamToFastaAlignments and AceToFastaAlignments: SNiPlay utilities for management of NGS data
CIBA courses – Brasil 2011
Alexis Dereeper
GATK (Genome Analysis ToolKit)
• Package for analysis of NGS data.
• Developed for the analysis of Human medical resequencing projects(1000 Genomes, The Cancer Genome Atlas)
• Includes tools for depth analysis, quality score recalibration, SNP/InDel discovery
• Complementary of 2 other packages: SamTools, PicardTools
PREPROCESS:
* Index human genome (Picard), we used HG18 from UCSC. * Convert Illumina reads to Fastq format * Convert Illumina 1.6 read quality scores to standard Sanger scores
FOR EACH SAMPLE:
1. Align samples to genome (BWA), generates SAI files. 2. Convert SAI to SAM (BWA) 3. Convert SAM to BAM binary format (SAM Tools) 4. Sort BAM (SAM Tools) 5. Index BAM (SAM Tools) 6. Identify target regions for realignment (Genome Analysis Toolkit) 7. Realign BAM to get better Indel calling (Genome Analysis Toolkit) 8. Reindex the realigned BAM (SAM Tools) 9. Call Indels (Genome Analysis Toolkit) 10. Call SNPs (Genome Analysis Toolkit) 11. View aligned reads in BAM/BAI (Integrated Genome Viewer)
CIBA courses – Brasil 2011
Global SAM with read group
FastQ Groomer
Mapping BWA
SAM-to-BAM
IndelRealigner
CountCovariates
TableRecalibration
UnifiedGenotyper
VCF file
Fastq (RC1)
AddReadGroupIntoSam
SAM with read group
FastQ Groomer
Mapping BWA
Fastq (RC2)
AddReadGroupIntoSam
SAM with read group
FastQ Groomer
Mapping BWA
Fastq (RC3)
AddReadGroupIntoSam
SAM with read group
FastQ Groomer
Mapping BWA
Fastq (RC4)
AddReadGroupIntoSam
SAM with read group
….
mergeSam
Global SAM with read group
SAM-to-BAM
IndelRealigner
CountCovariates
TableRecalibration
UnifiedGenotyper
VCF file
FastQ Groomer
Mapping BWA
Fastq global
AddReadGroupIntoSam
Fastq (RC1) Fastq (RC2) Fastq (RC3) Fastq (RC4)
Alexis Dereeper
VCF format (Variant Call Format)
##fileformat=VCFv4.0##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,5120 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3
Advantages: describes the variations for each position + genotype assignation
CIBA courses – Brasil 2011
Alexis Dereeper
Other functionalities of GATK• DepthOfCoverage module:Enables to inform sequencing depth of coverage for each gene, each position and each individual
• ReadBackedPhasing module:Enables to define if possible allele association (phase or haplotype) in case of heterozygosity…
And not AGGGGA
CIBA courses – Brasil 2011
Alexis Dereeper
SNiPlay: Web-based application for polymorphism analysis
http://sniplay.cirad.fr
CIBA courses – Brasil 2011
Alexis Dereeper
SAM assembly
SAM-to-BAM
Generate Pileup
Pileup2snp
Pileup file
FastQ Groomer
Mapping BWA
SAM-to-BAM
IndelRealigner
CountCovariates
TableRecalibration
UnifiedGenotyper
VCF file
SamTools
GATK
PicardTools
VarScan
SNP tabular file
SNiPlay Utilities
SamToFastaAlignments
FASTA alignmentswith IUPAC
Fastq
AddReadGroupIntoSam
VCFToFastaAlignments
CIBA courses – Brasil 2011
Automatic detection of SNP from SAM assembly
Example of pipeline faisable with the Galaxy system:3 alternatives
Options of SNiPlay
Select the VCF format
Load the VCF fileLoad reference file
Select the Rice genome as reference
Alexis Dereeper
Cartesian coordinates
Genotyping file
Submission file for Illumina
Analysis with the BeadStudio software
Design of Illumina chip
CIBA courses – Brasil 2011
Alexis Dereeper
@DARwin 5.0 - ALLELIC - 233 20N° 50 50 122 122 218 218 245 245 261 261 290 290 3561 1 1 1 1 3 3 3 3 4 4 2 2 22 1 1 1 1 3 3 1 3 4 4 2 2 23 1 1 1 1 3 3 3 3 4 4 2 2 24 1 1 1 1 3 3 3 3 4 4 2 2 2
3310P 49 121 217 244 260 289SSSSSSSSSS#cARBA A G G T C C A T TA A G G T C C A T T#cSYRA A G A T C C A T CA A G G T C C A T T
• PED format
• DARwin format
• .inp format for Phase • Format for TASSEL (association studies)
cARB 1 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4cSYR 2 0 0 1 0 1 1 1 1 3 3 1 3 4 4 2 2 2 2 1 1 4 4 2 4cARA 3 0 0 1 0 1 1 1 1 3 3 3 3 4 4 2 2 2 2 1 1 4 4 4 4
33 10:250 122 218 245 261 290 356 461 467 560cARB A:A A:A G:G G:G T:T C:C C:C A:A T:T T:TcSYR A:A A:A G:G A:G T:T C:C C:C A:A T:T C:TcARA A:A A:A G:G G:G T:T C:C C:C A:A T:T T:TcORL A:A A:A G:G G:G T:T C:C C:C A:A T:T T:TcLAR A:G A:G A:G A:G C:T C:C C:C A:A T:T C:T
Allelic files
CIBA courses – Brasil 2011
Alexis Dereeper
Haplotype networks
High frequency haplotypes
Low frequency haplotype
Group distribution whithin this haplotype
Distance between 2 haplotypes (nb of mutations)
CIBA courses – Brasil 2011