12
Quality Control Hubert DENISE ([email protected])

Quality Control Hubert DENISE ([email protected])

Embed Size (px)

Citation preview

Page 1: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

Quality Control

Hubert DENISE ([email protected])

Page 2: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

Image credits:

(1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74:171-199

Quality control

Diversity analysisMetagenomics data analysis

Functional analysis

Page 3: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

QC rationale

Why ?

Garbage in, garbage out

Base call error: - each base call has a quality score associated- specific platform-

dependent errors

Reads quality decreases with reads length

NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.

Page 4: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

EBI Metagenomics: QC step by step

Clipping - low quality ends trimmed and adapter sequences removed using Biopython SeqIO package

Quality filtering - sequences with > 10% undetermined nucleotides removed

Read length filtering - short sequences are removed: 100 nt theshold

Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579 for 454 and Qiime prefix clustering for Illumina) and representative sequence chosen

Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked

Page 5: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

EBI Metagenomics: QC consequences

Roche 454

Illumina

Ion Torrent

Page 6: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

MG-RAST QC EBI Metagenomics QC

dereplication (first 50 bp)

model organism screening (bowtie)

length filtering (>75 bp)

ambiguous base filtering (<5 bp)

dynamic base filtering (phred score)

analysis

duplicate sequence filtering (first 50 bp)

repeat masking

clipping (10%)

quality filtering (phred score)

read length filtering (> 100bp)

analysis

Page 7: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

QC TutorialIntroduction to exercise

Hubert Denise

[email protected]

Page 8: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

QC Tutorial

• Today we’ll be investigating a dataset obtained from varying depths of water taken from the Pacific Ocean

25m 125m

75m 500m

• First we will look at the “HOT_Station_ALOHA,_25m_depth” fastq sequence file using the software FASTQC

• Then we will use the Trimmomatic package to:

• Perform quality and length trimming on this file

Page 9: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

Performing QC steps using Trimmomatic

• All instructions are provided in the manual

• Trimmomatic is written in Java but you only need basic Unix knowledge to run it

• Trimmomatic functions: - removal of Illumina adapters from reads,

- quality filtering,

- length trimming,

- conversion of quality score format

• In this tutorial we will only perform quality and length filtering

• More details at http://www.usadellab.org/cms/?page=trimmomatic.

Page 10: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

@D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCCTCGGTTTTTCATCCAATTGAGTCGTCCCGTTGATAGTGAACTGGTACGTCATCGACTGCA...+!!#$:(*1<=“#HHA@IJIIJIHIJIJIJIIIJIGIBGIJJIIIFHGBHIIJIIIIIJJI...

...TGCACGTTCGGATTGGTCACCTCAATCGCAATATCGTAGCGATTGTTACCCAGAGGAAATA

...@CCFDFFFGHHHHIIIJIIJIHIJIJIJIIIJIGJHIJIIIFHGB2$’=IC5);=HA&&#%

Trimmomatic steps used in this tutorial

A - LEADING:8 TRAILING:8

quality threshold

quality score (phred 33)00

23

2579

1627

28 …

2628

3932

552

4

+

7trimmed sequence

Page 11: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

@D8QSB6V1:140:HA62CADXX:1:1101:1343:2227_1:N:0:AGTTCCTTTTTCATCCAATTGAGTCGTCCCGTTGATAG...CGTAGCGATTGTTACCCAGAGGA+:(*1<=“#[email protected]$’=IC5);=HA

Trimmomatic steps used in this tutorial

B – SLIDINGWINDOW:4:15

window size average quality

sum: 57avg: 14.25

work in the 5’ to 3’ end direction (whole read is scanned)

7916

2528

1227

sum: 58avg: 14.5

++

39 3231

sum = 141avg = 32.25 no trimming

etc … avg ≥ 15 : no trimmingFinal sequence 33 3 617

sum = 59avg < 15

=> trimming

Page 12: Quality Control Hubert DENISE (hudenise@ebi.ac.uk)

Hubert DENISE ([email protected])