Upload
bethanie-fox
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
Quality Control
Hubert DENISE ([email protected])
Image credits:
(1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74:171-199
Quality control
Diversity analysisMetagenomics data analysis
Functional analysis
QC rationale
Why ?
Garbage in, garbage out
Base call error: - each base call has a quality score associated- specific platform-
dependent errors
Reads quality decreases with reads length
NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.
EBI Metagenomics: QC step by step
Clipping - low quality ends trimmed and adapter sequences removed using Biopython SeqIO package
Quality filtering - sequences with > 10% undetermined nucleotides removed
Read length filtering - short sequences are removed: 100 nt theshold
Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579 for 454 and Qiime prefix clustering for Illumina) and representative sequence chosen
Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked
EBI Metagenomics: QC consequences
Roche 454
Illumina
Ion Torrent
MG-RAST QC EBI Metagenomics QC
dereplication (first 50 bp)
model organism screening (bowtie)
length filtering (>75 bp)
ambiguous base filtering (<5 bp)
dynamic base filtering (phred score)
analysis
duplicate sequence filtering (first 50 bp)
repeat masking
clipping (10%)
quality filtering (phred score)
read length filtering (> 100bp)
analysis
QC Tutorial
• Today we’ll be investigating a dataset obtained from varying depths of water taken from the Pacific Ocean
25m 125m
75m 500m
• First we will look at the “HOT_Station_ALOHA,_25m_depth” fastq sequence file using the software FASTQC
• Then we will use the Trimmomatic package to:
• Perform quality and length trimming on this file
Performing QC steps using Trimmomatic
• All instructions are provided in the manual
• Trimmomatic is written in Java but you only need basic Unix knowledge to run it
• Trimmomatic functions: - removal of Illumina adapters from reads,
- quality filtering,
- length trimming,
- conversion of quality score format
• In this tutorial we will only perform quality and length filtering
• More details at http://www.usadellab.org/cms/?page=trimmomatic.
@D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCCTCGGTTTTTCATCCAATTGAGTCGTCCCGTTGATAGTGAACTGGTACGTCATCGACTGCA...+!!#$:(*1<=“#HHA@IJIIJIHIJIJIJIIIJIGIBGIJJIIIFHGBHIIJIIIIIJJI...
...TGCACGTTCGGATTGGTCACCTCAATCGCAATATCGTAGCGATTGTTACCCAGAGGAAATA
...@CCFDFFFGHHHHIIIJIIJIHIJIJIJIIIJIGJHIJIIIFHGB2$’=IC5);=HA&&#%
Trimmomatic steps used in this tutorial
A - LEADING:8 TRAILING:8
quality threshold
quality score (phred 33)00
23
2579
1627
28 …
2628
3932
552
4
+
7trimmed sequence
@D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCCTTTTTCATCCAATTGAGTCGTCCCGTTGATAG...CGTAGCGATTGTTACCCAGAGGA+:(*1<=“#[email protected]$’=IC5);=HA
Trimmomatic steps used in this tutorial
B – SLIDINGWINDOW:4:15
window size average quality
sum: 57avg: 14.25
work in the 5’ to 3’ end direction (whole read is scanned)
7916
2528
1227
sum: 58avg: 14.5
++
39 3231
sum = 141avg = 32.25 no trimming
etc … avg ≥ 15 : no trimmingFinal sequence 33 3 617
sum = 59avg < 15
=> trimming
Hubert DENISE ([email protected])