NGSrawdataanalysis · 2013. 3. 4. · Libraries, lanes, and flowcells Each reaction produces a unique library of DNA fragments for sequencing. Each NGS machine processes a single

Laura Riva & Ma+a Pelizzola

NGS raw data analysis Introduc3on to Next-‐Genera3on Sequencing (NGS)-‐data Analysis

1.  Descrip7on of sequencing pla

DNA fragmenta3on and adapter liga3on A@achment to the flow cell

Cluster genera3on by bridge amplifica3on of DNA fragments

Single base extension with incorpora3on of fluorescently labeled nucleo3des

Library prepara3on

Sequencing

Automated cluster genera3on

Illumina Sequencing

Sequence Files Quality Scores

Illumina Output

454 Ti RocheT Illumina HiSeqTM 2000

ABI 5500 (SOLiD)

Amplifica7on Emulsion PCR Bridge PCR Emulsion PCR

Sequencing reac7on

Pyrosequencing Reversible terminators

Liga7on-‐based sequencing

Paired ends/sep Yes/3kb Yes/200 bp Yes/3 kb

Read length 400 bp 100 bp 75 bp

Advantages Short run 7mes. Longer reads improve mapping in repe77ve regions. Ability to detect large structural varia7ons

The most popular pla

The Illumina HiSeq!

Stacey Gabriel

Libraries, lanes, and flowcells

Each reaction produces a unique library of DNA fragments for

sequencing.

Each NGS machine processes a single flowcell containing several independent lanes during a single

sequencing run

Illumina

Lanes

Flowcell

Applications of Next-Generation Sequencing

Basic data analysis concepts: •  Raw data analysis= image processing and base calling

•  Understanding Fastq •  Understanding Quality scores •  Quality control and read cleaning •  Alignment to the reference genome •  Understanding SAM/BAM formats •  Samtools •  Bedtools

Understanding FASTQ file format FASTQ format is a text-‐based format for storing a biological sequence and its corresponding quality scores. It has become the standard for storing the output of high throughput sequencing instruments. EXAMPLE:

1.  begins with a '@' character and is followed by a sequence iden7fier 2.  the raw sequence leeers. 3.  begins with a '+' character and is op#onally followed by the same sequence iden7fier 4.  encodes the quality values for the sequence in and must contain the same number of symbols as leeers in the sequence.

Understanding Quality scores Fastq contains quality informa7on. Phred quality scores �Q� are defined as a property which is logarithmically related to the base-‐calling error probabili#es �P� Where P is the probability that the corresponding base call is incorrect.

Phred quality scores Phred quality scores �Q� are represented with a single bit in ASCII format. ASCII stands for American Standard Code for Informa7on Interchange. ASCII code is the numerical representa7on of a character such as 'a' or '@' or an ac7on of some sort. The first 32 symbols in ASCII are control characters, so we start at 33. If your ASCII character is ‘B’ and they are in Sanger format, then 66-‐33=33, so 33= (-‐10log10p) -‐3.3=log10p 10-‐3.3=p, so p= 0.0005 or 0.05% chance of an incorrect base.

@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + BBBBCCCC?

•  Before alignment there is some7mes the need to preprocess/manipulate the FASTA/FASTQ files to produce beeer mapping results. It is important to do quality control checks to understand whether your data has any problems of which you should be aware before doing any further analysis

•  FastQC: quality control checks on raw sequence data coming from

high throughput sequencing pipelines (hep://www.bioinforma7cs.bbsrc.ac.uk/projects/fastqc/).

•  The FASTX-‐Toolkit tools perform some of these preprocessing tasks (hep://hannonlab.cshl.edu/fastx_toolkit/). Two of many useful tools are:

–  FASTQ Quality Filter à Filters sequences based on quality –  FASTQ Quality Trimmer à Trims (cuts) sequences based on quality

Quality control and read cleaning

FastQC: quality controls of sequencing files I

PER BASE SEQUENCE QUALITY

ACCTGGGATCAAACATTCAGGACATATAGCACAATAGGAC

90th percen7le of the data

median

mean

10th percen7le of the data

Very good quality

Reasonable quality

Poor quality

A very good quality sequence

FastQC: quality controls of sequencing files II

DUPLICATION LEVELS

How many 7mes is the sequence repeated?

Which por7on of the sequences has that characteris7c?

Computed for the first 200’000 reads gives an overall impression for the duplicate levels in the whole file

A very good quality sequence

General steps of data analysis: – Quality control and read cleaning – Alignment to the genome

There are many aligners and many short reads aligners:

Alignment strategy: •  Mapping: quickly iden7fy candidates of hits on the reference

genome •  Alignment and report: score the alignment

•  Important features: –  Some souware use the base quality score to evaluate alignment,

others do not –  For all the aligners there is a trade off between performance and

accuracy –  Gapped or ungapped alignment –  Important parameters:

•  Maximum of mismatches •  Repor7ng unique hits or mul7ple hits

BWA • It uses Burrows-‐Wheeler indexing algorithm to speed up alignment 7me

•  Fast and moderate memory usage

•  Work for different sequencing pla

The Sequence Alignment/Map (SAM) format: • Format for the storage of sequence alignments and their mapping coordinates • Supports different sequencing pla

SAM file format

BAM headers: an optional (but essential) part of a BAM file

@HD VN:1.0 GO:none SO:coordinate @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:247249719 @SQ SN:chr2 LN:242951149 [cut for clarity] @SQ SN:chr9 LN:140273252 @SQ SN:chr10 LN:135374737 @SQ SN:chr11 LN:134452384 [cut for clarity] @SQ SN:chr22 LN:49691432 @SQ SN:chrX LN:154913754 @SQ SN:chrY LN:57772954 @RG ID:20FUK.1 PL:illumina PU:20FUKAAXX100202.1 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.2 PL:illumina PU:20FUKAAXX100202.2 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.3 PL:illumina PU:20FUKAAXX100202.3 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.4 PL:illumina PU:20FUKAAXX100202.4 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.5 PL:illumina PU:20FUKAAXX100202.5 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.6 PL:illumina PU:20FUKAAXX100202.6 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.7 PL:illumina PU:20FUKAAXX100202.7 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.8 PL:illumina PU:20FUKAAXX100202.8 LB:Solexa-18484 SM:NA12878 CN:BI @PG ID:BWA VN:0.5.7 CL:tk @PG ID:GATK TableRecalibration VN:1.0.2864 20FUKAAXX100202:1:1:12730:189900 163 chrM 1 60 101M = 282 381

GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTA…[more bases] ?BA@A>BBBBACBBAC@BBCBBCBC@BC@CAC@:BBCBBCACAACBABCBCCAB…[more quals] RG:Z:20FUK.1 NM:i:1 SM:i:37 AM:i:37 MD:Z:72G28 MQ:i:60 PG:Z:BWA UQ:i:33

Required: Standard header

Essential: contigs of aligned reference

sequence. Should be in karotypic order.

Essential: read groups. Carries platform (PL), library

(LB), and sample (SM) information. Each read is

associated with a read group

Useful: Data processing tools applied to the reads

SAM file format

SAM file format

They quan7fy the probability that a read is misplaced (P probability that the alignment is wrong). The error probability scaled in the Phred is the mapping quality (Q). The calcula7on of mapping quali7es considers all the following factors: •  The repeat structure of the reference. Reads falling in repe77ve regions

have very low mapping quality. •  The base quality of the read. Low quality means the observed read

sequence is possibly wrong, and wrong sequence may lead to a wrong alignment.

•  The sensi7vity of the alignment algorithm. The true hit is more likely to be missed by an algorithm with low sensi7vity, which also causes mapping errors.

•  Paired end or not. Reads mapped in pairs are more likely to be correct.

Mapping quality scores

•M: match/mismatch •I: inser7on •D: dele7on •S: souclip •H: hardclip •P: padding •N: skip

CIGAR string

SAMtools •  Library and souware package that manipulate BAM/SAM files •  SAMàBAM conversion (samtools view) •  Samtools view –f INT file.bam

Only output alignments with all bits in INT present in the FLAG field. •  Samtools view –F INT file.bam

Skip alignments with bits present in INT [0] •  Creates sorted and indexed BAM files from SAM files (samtools sort /samtools

index) •  Removing PCR duplicates (samtools rmdup) •  Merging alignments (samtools merge) •  Visualiza7on of alignments from BAM files •  SNP calling and short indel detec3on

References: 1.  hep://samtools.sourceforge.net/ 2. hep://samtools.sourceforge.net/samtools.shtml

BEDtools

References: 1  Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of u7li7es for comparing genomic

features. Bioinforma7cs. 26, 6, pp. 841–842. 2  hep://code.google.com/p/bedtools/downloads/detail?name=BEDTools-‐User-‐Manual.v4.pdf

The BEDTools u7li7es allow one to address common genomics tasks such as finding feature overlaps and compu7ng coverage. The u7li7es are largely based on four widely-‐used file formats: BED, GFF/GTF, VCF and SAM/BAM.

•  slopBed Adjusts each BED entry by a requested number of base pairs. •  shuffleBed Randomly permutes the loca7ons of a BED file among a genome. •  intersectBed (BAM) Returns overlaps between two BED/GFF/VCF files. •  genomeCoverageBed (BAM) Creates a "per base" report of genome

coverage. •  subtractBed Removes the por7on of an interval that is overlapped by

another feature. •  mergeBed Merges overlapping features into a single feature.

BED format

Freely available at hep://github.org/pezmaster31/bamtools

BamTools

Picard comprises Java-‐based command-‐line u7li7es that manipulate SAM files, and a Java API (SAM-‐JDK) for crea7ng new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported Freely available at hep://picard.sourceforge.net

Picard

GATK

Freely available at hep://www.broadins7tute.org/gatk/

Reads and fragments

Depth of coverage

Individual reads aligned to the genome

Clean C/T heterozygote

Details about specific read

First and second read from the same fragment

Reference genome

Non-reference bases are colored; reference bases are grey

IGV screenshot

Reference: hep://www.broadins7tute.org/igv/

Reads Visualiza7on: IGV

Recita7on We are going to use a fastq file that is present in GEO (hep://www.ncbi.nlm.nih.gov/geo/, record GSE24326) It is an Hsf1 ChIP-‐seq experiment used to iden7fy the genome-‐wide sites of HSF1 binding in human adipocytes derived from mesenchymal stem cells. HSF1 (Heat shock factor protein 1) is a heat-‐shock transcrip7on factor. Its target genes include major inducible heat shock proteins such as Hsp72. The transcrip7on of heat-‐shock genes is rapidly induced auer temperature stress. The known Hsf1 mo7f is:

Quality control

Quality control fastx_quality_stats -‐Q 33 -‐i hsf1.fastq -‐o hsf1_stats.txt The output TEXT file will have the following fields (one row per column): •  column = column number (1 to 36 for a 36-‐cycles read solexa file) •  count = number of bases found in this column. •  min = Lowest quality score value found in this column. •  max = Highest quality score value found in this column. •  sum = Sum of quality score values for this column. •  mean = Mean quality score value for this column. •  Q1 = 1st quar7le quality score. •  med = Median quality score. •  Q3 = 3rd quar7le quality score. •  IQR = Inter-‐Quar7le range (Q3-‐Q1). •  lW = 'Leu-‐Whisker' value (for boxplo+ng). •  rW = 'Right-‐Whisker' value (for boxplo+ng). •  A_Count = Count of 'A' nucleo7des found in this column. •  C_Count = Count of 'C' nucleo7des found in this column. •  G_Count = Count of 'G' nucleo7des found in this column. •  T_Count = Count of 'T' nucleo7des found in this column. •  N_Count = Count of 'N' nucleo7des found in this column. •  max-‐count = max. number of bases (in all cycles)

fastq_quality_boxplot_graph.sh -‐i hsf1_stats.txt -‐o hsf1.png fastx_ar7facts_filter -‐Q 33 -‐v -‐i hsf1.fastq -‐o hsf1_ar7fact.fastq It removes sequencing ar7facts Input: 6313567 reads. Output: 6312015 reads. discarded 1552 (0%) ar7fact reads.

fastq_quality_trimmer -‐Q 33 -‐t 20 -‐l 28 -‐v -‐i hsf1_ar7fact.fastq –o hsf1_ar7fact_trim.fastq

It trims sequences based on quality Minimum Quality Threshold: 20 Minimum Length: 28 Input: 6312015 reads. Output: 5929465 reads. discarded 382550 (6%) too-‐short reads. fastq_quality_filter -‐Q 33 -‐v -‐q 20 -‐p 80 –I hsf1_ar7fact_trim.fastq -‐o

hsf1_ar7fact_trim_qual.fastq It filters sequences based on quality Quality cut-‐off: 20 Minimum percentage: 80 Input: 5929465 reads. Output: 5655141 reads. discarded 274324 (4%) low-‐quality reads.

•  fastx_quality_stats -‐Q 33 –i hsf1_ar7fact_trim_qual.fastq -‐o hsf1_ar7fact_trim_qual_stats.txt

•  fastq_quality_boxplot_graph.sh -‐i hsf1_ar7fact_trim_qual_stats.txt -‐o hsf1_ar7fact_trim_qual.png

Quality control with fastQC /home/lriva/FastQC/fastqc -‐-‐contaminants /home/lriva/FastQC/Contaminants/contaminant_list.txt hsf1_ar7fact_trim_qual.fastq

§  A fast program able to align short query sequences (< 200 bp; error rate < 3% ) to a target reference genome

§  Work flow:

ü index [-options] (it index the database)

ü aln [-options] < input query.fastq> (it computes the SA coordinates for the input reads)

ü samse [-options] (it converts SA coordinates into chromosomal coordinates and generates alignment in SAM formats for single reads)

BWA

Alignment of the reads

•  bwa aln -‐t 4 -‐f hsf1.sai /db/bwa/0.6.2/hg19/hg19 hsf1_ar7fact_trim_qual.fastq

•  bwa samse /db/bwa/0.6.2/hg19/hg19 hsf1.sai hsf1_ar7fact_trim_qual.fastq | samtools view –ut /db/bwa/0.6.2/hg19/hg19 -‐ | samtools sort -‐ hsf1

•  samtools index hsf1.bam

Visualizing alignments, conver7ng from BAM binary to SAM text format samtools view -‐h -‐o hsf1_stdchr.nodup.sam hsf1_stdchr.nodup.bam

less hsf1_stdchr.nodup.sam

Go to hep://bioserver.iit.ieo.eu/~lriva/ and download corsoPoplimiChIPseq.txt ssh [email protected] ssh user@dpt-‐n1.bioinfo.ieo.eu cd public_html You can check all your files at hep://bioserver.iit.ieo.eu/~user/

Let’s start

Filtering alignments

coun7ng and visualizing unmapped reads •  samtools view -‐f 4 -‐c hsf1.bam •  samtools view -‐f 4 hsf1.bam | less

HWI-‐EAS413:4:1:3:1708 4 * 0 0 * * 0 0 GGTGCTGGCCAACATGACCTTGCACGGA BCACCCCBCCCBCCBBCCCBB

Filtering alignments coun7ng and visualizing mapped reads •  samtools view -‐F 4 -‐c hsf1.bam •  samtools view -‐F 4 hsf1.bam | less HWI-‐EAS413:4:100:367:1780 16 chr1 9999 25 35M * 0 0 GATCTCCCTAACCCTAACCCTAACCCTAACCCTAA ?B;BABBAAC?AC?AC?CBCBCBCCCBCBBCCBCB XT:A:U NM:i:2 XN:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:3A0A30

select only mapped reads with mapping quality>=0 samtools view -‐F 4 -‐q 1 -‐bo hsf1_stdchr.bam hsf1.bam chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX

Prefix trie of string ‘GOOGOL’ 0,6

1,2

1,2 4,4

6,6

2,2

2,2

4,6

1,2

1,2 4,4

6,6

2,2

2,2

6,6

2,2

2,2

3,3

5,5

1,1

4,4

6,6

2,2

2,2

G

O

O

G

O

O

G

O

O

G

G

O

G

O

O

G

L

^

^

^

^

^

^ ^ = start of the string

root

Burrows-‐Wheeler transform (BWT)

X = ‘googol$’ (reference string) W = ‘go’ (query string) SA interval=[1,2]

* * g o o g o l 0 1 2 3 4 5

SA interval of ‘go’

B= BWT(X)" (BWT string of X)"

Suffix array =

googol + $"

Suffix array interval: formal defini#on

Example:""X = ‘googol$’"W = ‘go’ "

If W is an empty string then" = [1, n-1]"[ , ]"= SA interval of W "

(‘go’)= 1 "(‘go’)= 2 "

SA interval of ‘go’ = [1,2]"

[ , ]"

Exact match: Ferragina-‐Manzini algorithm

true only if and only if aW is a substring of X

X = reference string; W = query string"C(a) = number of symbols < a in X[0,n-2]"O(a, i) = number of occurrences of a in B[0,i] where B= BWT (X)"If W is a substring of X: "

Exact match: backward search

0 6 $googol 1 3 gol$goo 2 0 googol$ 3 5 l$googo 4 2 ogol$go 5 4 ol$goog 6 1 oogol$g

X= ‘googol$’"bwt= ‘lo$oogg’"W=‘ogo’"

""

if W ← ‘ ’:! L ←1! R ← length(X)-1!j←length(W)-1!!while (L ≤ R and j ≥ 0) do:! aW←W[ j:end ]!! if length(aW)

Exact match: backward search (con#nued)

A T T T G A T C T T T T G A T C 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

* *

X=

W= GATC

SA interval =[6,7]

SA= (16, 13, 5, 0, 15, 7, 12, 4, 14, 6, 11, 3, 10, 2, 9, 1, 8)

BWT string= CGG$TTTTAATTTTTAC 0 16 $ATTTGATCTTTTGATC 1 13 ATC$ATTTGATCTTTTG 2 5 ATCTTTTGATC$ATTTG 3 0 ATTTGATCTTTTGATC$ 4 15 C$ATTTGATCTTTTGAT 5 7 CTTTTGATC$ATTTGAT 6 12 GATC$ATTTGATCTTTT 7 4 GATCTTTTGATC$ATTT 8 14 TC$ATTTGATCTTTTGA 9 6 TCTTTTGATC$ATTTGA 10 11 TGATC$ATTTGATCTTT 11 3 TGATCTTTTGATC$ATT 12 10 TTGATC$ATTTGATCTT 13 2 TTGATCTTTTGATC$AT 14 9 TTTGATC$ATTTGATCT 15 1 TTTGATCTTTTGATC$A 16 8 TTTTGATC$ATTTGATC

Inexact match algorithm

•  DNA is fragmented •  Adaptors ligated to fragments •  Several possible protocols yield

array of PCR colonies. (AMPLIFICATION)

•  Enzyma7c extension with fluorescently tagged nucleo7des. (SEQUENCING REACTION)

•  Cyclic readout by imaging the array.

•  Align strings to the reference genome.

Basics of NGS technology!

Consider the following paper that describes sequencing technologies in details: Michael Metzker ‘Sequencing technologies-‐the next genera7on’ Nat Rev Genet. 2010 Jan;11(1):31-‐46.

Mapping Algorithm (2 mismatches)

GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG

GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG seeds

Exact match Genome

GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG .........

Indexed table of exactly matching seeds

Approximate search around the exactly matching seeds

Mapping Algorithm (2 mismatches)

•  Partition reads into 4 seeds {A,B,C,D} –  At least 2 seed must map with no mismatches

•  Scan genome to identify locations where the seeds match exactly –  6 possible combinations of the seeds to search

•  {AB, CD, AC, BD, AD, BC} –  6 scans to find all candidates

•  Do approximate matching around the exactly-matching seeds. –  Determine all targets for the reads –  Ins/del can be incorporated

•  The reads are indexed and hashed before scanning genome

•  Bit operations are used to accelerate mapping –  Each nt encoded into 2-bits

Documents

NGSrawdataanalysis · 2013. 3. 4. · Libraries, lanes, and flowcells Each reaction produces a unique library of DNA fragments for sequencing. Each NGS machine processes a single