Upload
lex-nederbragt
View
1.301
Download
2
Tags:
Embed Size (px)
DESCRIPTION
A talk for I gave for the 2011 metagenomics course at the Biological Dept. Univ. of Oslo April 2011
Citation preview
NGS techniques and data relevant for metagenomics analyses
Lex NederbragtNorwegian Sequencing Center &
Centre for Ecological and Evolutionary SynthesisUniversity of Oslo
The sequence revolution
Stratton et al Nature 458, 719-724
The sequence revolution
Stratton et al Nature 458, 719-724
Norwegian Sequencing Center
www.sequencing.uio.no
This talk
• Technologies– 454– Illumina
• Topics– How does it work–What do you get– Quality check– Filtering
How does it work: 454
Library preparation
Shotgun library Amplicon library
Starting from DNA sample Starting from PCR product
Library preparation
Shotgun library
Fragmentation
Addition of adaptors
Fw
AFw
Rv B
A
Rv B
Amplicon library
Multiplexing
Fw
AFw
RvB
A
Rv B
Amplicon libraryA
Fw
Tag
Shotgun: tag in the adaptors
Amplification
Plate loading
Multiplexing
Flickr.com
2 lanes
4 lanes
8 lanes
16 lanes
Sequencing
PPi: pyrophosphate
Basecalling
Read length
500 bases
Coming soon
Single end
• Default single end sequencing• Special protocols for mate-pairs
How does it work: Illumina
Library preparation
Multiplexing: same as for 454
Bridge amplification
Metzker 2010 Nat Rev Genet.11(1):31-46
Bridge amplification
Metzker 2010 Nat Rev Genet.11(1):31-46
Multiplexing
Flowcell: 8 lanes
Sequencing
Metzker 2010 Nat Rev Genet.11(1):31-46
Reversible terminators
Basecalling
Metzker 2010 Nat Rev Genet.11(1):31-46
Read length
454 GS FLX Titanium Illumina HiSeq
500 bases
Paired-end
• Default paired-end sequencing– single end also possible
150– 600 bases
What do you get?
454 Throughput
• GS FLX Titanium per-run output:– Up to 1.5 million single-end reads– Up to 600 megabases (Mb, million bases)– Less for amplicons
Illumina throughput (HiSeq 2000)
• Variable length– 50,100, (soon 150)– single or paired-end
• per-run output:– Up to 1 billion (109) single-end– Up to 2 billion paired-end reads – Up to 200 gigabases (Gb, billion bases) – Soon: 3 times more reads and bases
What do you get? Errors!
http://www.it.bton.ac.uk/staff/je/java/jewl/tutorial/tutorial.html
Error profiles
454 GS FLX Titanium Illumina Genome Analyzer II
454 specific
3 G's? 4 G's?
Illumina specific
• Substitutions– e.g. AG
• Underrepresentation of AT and GC rich regions
Solving errors
• Oversampling
Oversampling: 454
AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATT-GGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATT-GGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGG
AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGG
Undercall in two reads
Overcall in three reads
Consensus
Solving errors
• Oversampling• 454 amplicons: AmpliconNoise– this course
• Illumina GC-bias: PCR conditions– Aird et al. Genome Biology 2011, 12:R18
Duplicate reads
• Illumina: PCR step in library prep• 454: two beads in one microreactor– emulsion PCR
Chimeras
Haas B J et al. Genome Res. 2011;21:494-504
Chimeras
• 454 FLX Titanium– chimera rate of up to 20%
• >70% of sequences representing particular genera
Haas B J et al. Genome Res. 2011;21:494-504
Chimeras: solutions
• ChimeraSlayer– AmpliconNoise
• ChimeraCheck–Mothur
• See Haas et al. 2011 Genome Res. 21:494-504
What do you get? Bytes!
Filesizes
• 454– Up to 2 Gbytes per lane (sff)– two lanes
• HiSeq– up to 20 Gb per lane (fastq)– eight lanes
Datafiles 454
• sff file (standard flowgram format)– binary
• fasta & qual– text
454: sff file (text format)
>F7K88GK01BMPI0Run Prefix: R_2009_12_18_15_27_42_Region #: 1XY Location: 0551_2346
Run Name: R_2009_12_18_15_27_42_FLX########_Administrator_yourrunnameAnalysis Name: D_2009_12_19_01_11_43_XX_fullProcessingFull Path: /data/R_2009_12_18_15_27_42_FLX########_Administrator_yourrunname/D_2009_12_19_01_11_43_XX_fullProcessing/
Read Header Len: 32Name Length: 14# of Bases: 500Clip Qual Left: 15Clip Qual Right: 490Clip Adap Left: 0Clip Adap Right: 0
Flowgram: 1.03 0.00 1.01 0.02 0.00 0.96 0.00 1.00 0.00 1.04 0.00 0.00 0.97 0.00 0.96 0.02 0.00 1.04 0.01 1.04 0.00 0.97 0.96 0.02 0.00 1.00 0.95 1.04 0.00 0.00 2.04 0.02 0.03 1.05 Flow Indexes: 1 3 6 8 10 13 15 18 20 22 23 26 27 28 31 31 34 35 37 37 37 40 43 45 47 47 47 50 53 53 53 55 58 60 63 66 67 67 67 67 70 71 71 74 74 76 79 82 83 86 86 88 88 91 93 96 97...Bases: tcagatcagacacgCCACTTTGCTCCCATTTCAGCACCCCACCAAGCACAAGGCTGTCATCCCAATTGGACGGACAGATATGAGGTTAGCATTGGAAACCAATTCAGTCCCTAATTATTCACGACTGAACCCAGCGACAATTGGACATGGATTCATTTTTCAACTTGATTTGTTGTTGTAAAAGCA...Quality Scores: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 38 38 38 40 40 40 39 39 39 40 34 34 34 40 40 40 40 39 26 26 26 26 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 ...
454: fasta and qual files
Fasta:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_
AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAATTGTCCCTTTGACATAACGACTAAAGGAGTCAACAGATTTTCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACGCTATT...
Qual:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_
40 40 39 39 39 40 40 40 40 40 40 40 40 38 31 26 26 16 16 16 20 20 14 14 14 14 27 33 32 35 36 33 36 35 36 38 35 20 20 21 24 24 22 36 39 40 38 38 38 40 40 40 40 40 40 37 37 37 33 3329 36 38 38 38 38 38 38 38 35 20 21 21 21 31 36 37 40 40 35 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40...
Sanger-style Phred scores
454: fasta and qual files
Fasta:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_
AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAATTGTCCCTTTGACATAACGACTAAAGGAGTCAACAGATTTTCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACGCTATT...
Qual:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_
40 40 39 39 39 40 40 40 40 40 40 40 40 38 31 26 26 16 16 16 20 20 14 14 14 14 27 33 32 35 36 33 36 35 36 38 35 20 20 21 24 24 22 36 39 40 38 38 38 40 40 40 40 40 40 37 37 37 33 3329 36 38 38 38 38 38 38 38 35 20 21 21 21 31 36 37 40 40 35 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40...
Sanger-style Phred scores
chance of being wrong: 1:104.0 = 1:10000
chance of being wrong: 1:103.5 = 1:3162
Illumina: fastq file
@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1CCAACATAGCTGGATGCCAACATAGCTGGATTGTTATAGCTGGTTTGCTTTTCTAACTCGCTGGAAGTTTATAAGCATTCCTACTATTTCATAGTATTAC+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1BBbfYcbV^BV`cQffaBZfB_fdfUYaa]`adcbfef\acfd^cad^fOabRceb`beSbdfaad_e^^dbeedTbd`V\ecdfffYBddb^fa\d\de
Quality score as characters: Phred score = ASCII value -33'B' is ASCII 66 Phred 33
Illumina: fastq file
@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1CCAACATAGCTGGATGCCAACATAGCTGGATTGTTATAGCTGGTTTGCTTTTCTAACTCGCTGGAAGTTTATAAGCATTCCTACTATTTCATAGTATTAC+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1BBbfYcbV^BV`cQffaBZfB_fdfUYaa]`adcbfef\acfd^cad^fOabRceb`beSbdfaad_e^^dbeedTbd`V\ecdfffYBddb^fa\d\de
Matching pair in the other file:+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/2
FastQ formats
Cock PJ et al 2009
The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.
Nucleic Acids Res. 2010 Apr;38(6):1767-71.
and
http://en.wikipedia.org/wiki/Fastq
Quality control
Quality Control
• 454 (and others): Prinseq• Illumina (and others): fastQC, fastQA, etc
Prinseq
• http://edwards.sdsu.edu/prinseq_beta• Web-based and stand-alone• Upload – fasta file– qual file (optional)
Prinseq: read length
Prinseq: quality per position
Prinseq: quality values
Prinseq: duplicate reads
Prinseq: adaptors
No tag
Barcode (Roche 'MID')
Transcriptome library adaptor
Prinseq: contamination
The dinucleotideodds ratios*
Principal component
analysis (PCA)
*dinucleotide frequencies normalized for the base composition
FastQC
• http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
• Stand-alone• GUI (Java based)• Upload – fasta file– qual file (optional)
FastQC: quality per position
FastQC: quality per position
FastQC: quality values
FastQC: nucleotide composition
FastQC: GC distribution
FastQC: duplicated reads
Filtering/trimming
• Adaptor removal – especially Illumina
• Duplicate removal• Filtering for low quality bases– or stretches of them– reads with 'N's
• E.g. – fastX toolkit– prinseq
Other technologies
• Life Technologies– SOLiD– ionTorrent– not much used for metagenomics
• Pacific Biosciences– PacBio RS– large potential
Pacific Biosciences
Metzker 2010 Nat Rev Genet.11(1):31-46
Zero Mode Waveguides
Pacific Biosciences
Metzker 2010 Nat Rev Genet.11(1):31-46
Videos
http://www.qiagen.com/media/player.aspx?movie=Pyrosequencing
http://www.youtube.com/watch?v=HtuUFUnYB9Y