View
5
Download
0
Category
Preview:
Citation preview
Workshop on Whole Genome Sequencing and Analysis, 2-4 Oct. 2017
Sequencing techniques
Learning objective:
After this lecture, you should be able to…
…account for different techniques for whole genome sequencing (Illumina, Ion Torrent, PacBio, Nanopore)
..identify the elements that make up the raw sequence files
..at a general level assess the quality of your data
Preparing for sequencing 2nd generation sequencing have many steps in common
1. DNA isolation
2. DNA fragmentation
3. Primer ligation
4. Amplification
Amplification primers
Sequencing primers
BarcodeIsolated DNA
Illumina sequencing
Question
In the figure above each coloured spot represents a spot on the flow cell where millions of identical DNA templates are clustered and each grey square one cycle of sequencing. What is the sequence of the template DNA strand in the lower right corner of the flow cell?
Illumina reads have equal lengths. One base is determined per cycle
A
T
C
A
C
TA
C
A
A
G
T
A
T
T
A
C
C
C
C
T
>Read_1 >Read_2 >Read_3
End cycle 1:
A
C
A
C
T
G
G
T
G
A TC
A
C
A
C
T
G
G
T
G
ATEnd cycle 2: TCCA
C
T
T
A
C
A
C
G
End cycle 3: ATA CAT TCC
Ion Torrent - also a 2nd gen. sequencing technology
• Does not rely on optical signals from fluorescently labelled nucleotides
• Detects the small pH change caused by H+ release, when a nucleotide is incorporated
https://www.youtube.com/watch?v=ZL7DXFPz8rU&t=4s
• Has difficulties correctly calling homopolymers (stretches of identical nucleotides)
Imagine the above is the output from an Ion Torrent run. Which sequence does it represent?
Question
Type of nucleotide flooded across well
Ion Torrent reads do not usually have equal lengths
A
C
A
C
TA
C
A
G
T
C
C
A
C
C
C
GT
>Read_1 >Read_2 >Read_3
A A A
End cycle 1: T
CCC
End cycle 2: TG
G
G
G
G
G
End cycle 3: TGCCC
T
T
T
CEnd cycle 4: A CA TGCCCA
Third generation sequencing
• No template amplification step (single molecule sequencing)
• Fast
• Produces very long reads (>10,000 bp)
• Assembly gets much easier
PacificBiosciences - PacBio (3rd gen. sequencing)• The first 3rd generation sequencer on the market • Uses Single-Molecule Sequencing in Real Time (SMRT) technology
• Single DNA polymerases are attached to the bottom surface of individual detector wells
• DNA is sequenced as fluorescently labelled nucleotides are incorporated into the complementary strand, since incorporation results in retention of the nucleotide, and this retention can be detected
Advantages: • Long reads • Quick run time
Disadvantages: • Big, expensive machine • Relatively low accuracy (but not context specific
errors) • Reagent costs per run is expensive when only
one bacterial strains is sequenced per run
RSII
Sequel
Recently: Protocol for multiplexing 5 Mb microbial genomes (e.g., E. coli) up to 12-plex and 2 Mb genomes (e.g., Campylobacter) up to 16-plex making sequencing of microbial genomes more affordable.
PacBio
Oxford Nanopore (3rd. gen. sequencing)
https://www.youtube.com/watch?v=CE4dW64x3Ts
• The newest kid in class
• Sequences while single-stranded DNA is passed through nanopore
• The minION is the size of a small cell phone
• VERY long reads (up to 1.000.000?)
• So far also very high error rates (up to 15%)
Comparing sequencing technologies
Platform Sequencer
Costs sequencing
platform ($)
Output per run/lane
Max. read lengths
(bp)
Average run time
Illumina HiSeq 3000 750,000 150 gbp 250 4 days
Illumina MiSeq 100,000 15 gbp 300 2 days
Ion torrent Proton II 224,000 66 gbp 200 4 hours
Ion torrent PGM 318 50,000 2 gbp 400 7 hours
PacBio RS II 700,000 400 mbp 54,000 3 hours
Nanopore MinION 1,000 1-10 gbp 150,000 n.a*
*Machine run time is adjusted to need of sequencing depth. Example given is for 48 hours
Bleidorn C., Systematics and Biodiversity (2016), 14(1):1-8
What is the data?Fastq files
Fastq example:@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA
+
BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
1 read, 4 lines
@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA
+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
Header/ID
What is the data?Fastq files
Fastq example:
@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA
+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
DNA sequenceFastq example:
What is the data?Fastq files
@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA
+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
Name field (optional)
Fastq example:
What is the data?Fastq files
@FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1
ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA
+
_BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB
@FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1
ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC
+
bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT
@FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1
AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT
+
bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc
@FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1
AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG
+
bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
Quality scores (also called PHRED scores)
Fastq example:
What is the data?Fastq files
Why are quality scores necessary?In a perfect world…
In our world…
PHRED (Q) quality scores
• PHRED quality score, Q = -10 log10 P
• Error probability, P = 10-Q/10
Example: Base call with Q = 30 has error probability of 10-3 meaning 1 out of 1000 bases called with this quality score would be wrong
Encodes the probability of an erroneous call
Phred Quality Score (Q)
Error probability
(P)
Probability of incorrect base
call
Base call accuracy
10 0.1 1 in 10 90 %
20 0.01 1 in 100 99 %
30 0.001 1 in 1000 99,9 %
40 0.0001 1 in 10,000 99,99 %
50 0.00001 1 in 100,000 99,999 %
The PHRED quality scores are written using ASCII encoding
Shown here is the Sanger/Phred+33 conversion table currently used by Illumina
Data quality assessed via FastQC
Great data!
• FastQC is freely downloadable (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
• Great for generating reports on your WGS data
• Not able to trim the data
Data quality assessed via FastQC
Horrible data!
How to perform read trimming using PRINSEQ
Recap by multiple choice scratch cards
Recommended