Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Laura Riva & Ma+a Pelizzola
NGS raw data analysis Introduc3on to Next-‐Genera3on Sequencing (NGS)-‐data Analysis
1. Descrip7on of sequencing pla
DNA fragmenta3on and adapter liga3on A@achment to the flow cell
Cluster genera3on by bridge amplifica3on of DNA fragments
Single base extension with incorpora3on of fluorescently labeled nucleo3des
Library prepara3on
Sequencing
Automated cluster genera3on
Illumina Sequencing
Sequence Files Quality Scores
Illumina Output
454 Ti RocheT Illumina HiSeqTM 2000
ABI 5500 (SOLiD)
Amplifica7on Emulsion PCR Bridge PCR Emulsion PCR
Sequencing reac7on
Pyrosequencing Reversible terminators
Liga7on-‐based sequencing
Paired ends/sep Yes/3kb Yes/200 bp Yes/3 kb
Read length 400 bp 100 bp 75 bp
Advantages Short run 7mes. Longer reads improve mapping in repe77ve regions. Ability to detect large structural varia7ons
The most popular pla
The Illumina HiSeq!
Stacey Gabriel
Libraries, lanes, and flowcells
Each reaction produces a unique library of DNA fragments for
sequencing.
Each NGS machine processes a single flowcell containing several independent lanes during a single
sequencing run
Illumina
Lanes
Flowcell
Applications of Next-Generation Sequencing
Basic data analysis concepts: • Raw data analysis= image processing and base calling
• Understanding Fastq • Understanding Quality scores • Quality control and read cleaning • Alignment to the reference genome • Understanding SAM/BAM formats • Samtools • Bedtools
Understanding FASTQ file format FASTQ format is a text-‐based format for storing a biological sequence and its corresponding quality scores. It has become the standard for storing the output of high throughput sequencing instruments. EXAMPLE:
1. begins with a '@' character and is followed by a sequence iden7fier 2. the raw sequence leeers. 3. begins with a '+' character and is op#onally followed by the same sequence iden7fier 4. encodes the quality values for the sequence in and must contain the same number of symbols as leeers in the sequence.
Understanding Quality scores Fastq contains quality informa7on. Phred quality scores �Q� are defined as a property which is logarithmically related to the base-‐calling error probabili#es �P� Where P is the probability that the corresponding base call is incorrect.
Phred quality scores Phred quality scores �Q� are represented with a single bit in ASCII format. ASCII stands for American Standard Code for Informa7on Interchange. ASCII code is the numerical representa7on of a character such as 'a' or '@' or an ac7on of some sort. The first 32 symbols in ASCII are control characters, so we start at 33. If your ASCII character is ‘B’ and they are in Sanger format, then 66-‐33=33, so 33= (-‐10log10p) -‐3.3=log10p 10-‐3.3=p, so p= 0.0005 or 0.05% chance of an incorrect base.
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + BBBBCCCC?
• Before alignment there is some7mes the need to preprocess/manipulate the FASTA/FASTQ files to produce beeer mapping results. It is important to do quality control checks to understand whether your data has any problems of which you should be aware before doing any further analysis
• FastQC: quality control checks on raw sequence data coming from
high throughput sequencing pipelines (hep://www.bioinforma7cs.bbsrc.ac.uk/projects/fastqc/).
• The FASTX-‐Toolkit tools perform some of these preprocessing tasks (hep://hannonlab.cshl.edu/fastx_toolkit/). Two of many useful tools are:
– FASTQ Quality Filter à Filters sequences based on quality – FASTQ Quality Trimmer à Trims (cuts) sequences based on quality
Quality control and read cleaning
FastQC: quality controls of sequencing files I
PER BASE SEQUENCE QUALITY
ACCTGGGATCAAACATTCAGGACATATAGCACAATAGGAC
90th percen7le of the data
median
mean
10th percen7le of the data
Very good quality
Reasonable quality
Poor quality
A very good quality sequence
FastQC: quality controls of sequencing files II
DUPLICATION LEVELS
How many 7mes is the sequence repeated?
Which por7on of the sequences has that characteris7c?
Computed for the first 200’000 reads gives an overall impression for the duplicate levels in the whole file
A very good quality sequence
General steps of data analysis: – Quality control and read cleaning – Alignment to the genome
There are many aligners and many short reads aligners:
Alignment strategy: • Mapping: quickly iden7fy candidates of hits on the reference
genome • Alignment and report: score the alignment
• Important features: – Some souware use the base quality score to evaluate alignment,
others do not – For all the aligners there is a trade off between performance and
accuracy – Gapped or ungapped alignment – Important parameters:
• Maximum of mismatches • Repor7ng unique hits or mul7ple hits
BWA • It uses Burrows-‐Wheeler indexing algorithm to speed up alignment 7me
• Fast and moderate memory usage
• Work for different sequencing pla
The Sequence Alignment/Map (SAM) format: • Format for the storage of sequence alignments and their mapping coordinates • Supports different sequencing pla
SAM file format
BAM headers: an optional (but essential) part of a BAM file
@HD VN:1.0 GO:none SO:coordinate @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:247249719 @SQ SN:chr2 LN:242951149 [cut for clarity] @SQ SN:chr9 LN:140273252 @SQ SN:chr10 LN:135374737 @SQ SN:chr11 LN:134452384 [cut for clarity] @SQ SN:chr22 LN:49691432 @SQ SN:chrX LN:154913754 @SQ SN:chrY LN:57772954 @RG ID:20FUK.1 PL:illumina PU:20FUKAAXX100202.1 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.2 PL:illumina PU:20FUKAAXX100202.2 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.3 PL:illumina PU:20FUKAAXX100202.3 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.4 PL:illumina PU:20FUKAAXX100202.4 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.5 PL:illumina PU:20FUKAAXX100202.5 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.6 PL:illumina PU:20FUKAAXX100202.6 LB:Solexa-18484 SM:NA12878 CN:BI @RG ID:20FUK.7 PL:illumina PU:20FUKAAXX100202.7 LB:Solexa-18483 SM:NA12878 CN:BI @RG ID:20FUK.8 PL:illumina PU:20FUKAAXX100202.8 LB:Solexa-18484 SM:NA12878 CN:BI @PG ID:BWA VN:0.5.7 CL:tk @PG ID:GATK TableRecalibration VN:1.0.2864 20FUKAAXX100202:1:1:12730:189900 163 chrM 1 60 101M = 282 381
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTA…[more bases] ?BA@A>BBBBACBBAC@BBCBBCBC@BC@CAC@:BBCBBCACAACBABCBCCAB…[more quals] RG:Z:20FUK.1 NM:i:1 SM:i:37 AM:i:37 MD:Z:72G28 MQ:i:60 PG:Z:BWA UQ:i:33
Required: Standard header
Essential: contigs of aligned reference
sequence. Should be in karotypic order.
Essential: read groups. Carries platform (PL), library
(LB), and sample (SM) information. Each read is
associated with a read group
Useful: Data processing tools applied to the reads
SAM file format
SAM file format
They quan7fy the probability that a read is misplaced (P probability that the alignment is wrong). The error probability scaled in the Phred is the mapping quality (Q). The calcula7on of mapping quali7es considers all the following factors: • The repeat structure of the reference. Reads falling in repe77ve regions
have very low mapping quality. • The base quality of the read. Low quality means the observed read
sequence is possibly wrong, and wrong sequence may lead to a wrong alignment.
• The sensi7vity of the alignment algorithm. The true hit is more likely to be missed by an algorithm with low sensi7vity, which also causes mapping errors.
• Paired end or not. Reads mapped in pairs are more likely to be correct.
Mapping quality scores
•M: match/mismatch •I: inser7on •D: dele7on •S: souclip •H: hardclip •P: padding •N: skip
CIGAR string
SAMtools • Library and souware package that manipulate BAM/SAM files • SAMàBAM conversion (samtools view) • Samtools view –f INT file.bam
Only output alignments with all bits in INT present in the FLAG field. • Samtools view –F INT file.bam
Skip alignments with bits present in INT [0] • Creates sorted and indexed BAM files from SAM files (samtools sort /samtools
index) • Removing PCR duplicates (samtools rmdup) • Merging alignments (samtools merge) • Visualiza7on of alignments from BAM files • SNP calling and short indel detec3on
References: 1. hep://samtools.sourceforge.net/ 2. hep://samtools.sourceforge.net/samtools.shtml
BEDtools
References: 1 Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of u7li7es for comparing genomic
features. Bioinforma7cs. 26, 6, pp. 841–842. 2 hep://code.google.com/p/bedtools/downloads/detail?name=BEDTools-‐User-‐Manual.v4.pdf
The BEDTools u7li7es allow one to address common genomics tasks such as finding feature overlaps and compu7ng coverage. The u7li7es are largely based on four widely-‐used file formats: BED, GFF/GTF, VCF and SAM/BAM.
• slopBed Adjusts each BED entry by a requested number of base pairs. • shuffleBed Randomly permutes the loca7ons of a BED file among a genome. • intersectBed (BAM) Returns overlaps between two BED/GFF/VCF files. • genomeCoverageBed (BAM) Creates a "per base" report of genome
coverage. • subtractBed Removes the por7on of an interval that is overlapped by
another feature. • mergeBed Merges overlapping features into a single feature.
BED format
Freely available at hep://github.org/pezmaster31/bamtools
BamTools
Picard comprises Java-‐based command-‐line u7li7es that manipulate SAM files, and a Java API (SAM-‐JDK) for crea7ng new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported Freely available at hep://picard.sourceforge.net
Picard
GATK
Freely available at hep://www.broadins7tute.org/gatk/
Reads and fragments
Depth of coverage
Individual reads aligned to the genome
Clean C/T heterozygote
Details about specific read
First and second read from the same fragment
Reference genome
Non-reference bases are colored; reference bases are grey
IGV screenshot
Reference: hep://www.broadins7tute.org/igv/
Reads Visualiza7on: IGV
Recita7on We are going to use a fastq file that is present in GEO (hep://www.ncbi.nlm.nih.gov/geo/, record GSE24326) It is an Hsf1 ChIP-‐seq experiment used to iden7fy the genome-‐wide sites of HSF1 binding in human adipocytes derived from mesenchymal stem cells. HSF1 (Heat shock factor protein 1) is a heat-‐shock transcrip7on factor. Its target genes include major inducible heat shock proteins such as Hsp72. The transcrip7on of heat-‐shock genes is rapidly induced auer temperature stress. The known Hsf1 mo7f is:
Quality control
Quality control fastx_quality_stats -‐Q 33 -‐i hsf1.fastq -‐o hsf1_stats.txt The output TEXT file will have the following fields (one row per column): • column = column number (1 to 36 for a 36-‐cycles read solexa file) • count = number of bases found in this column. • min = Lowest quality score value found in this column. • max = Highest quality score value found in this column. • sum = Sum of quality score values for this column. • mean = Mean quality score value for this column. • Q1 = 1st quar7le quality score. • med = Median quality score. • Q3 = 3rd quar7le quality score. • IQR = Inter-‐Quar7le range (Q3-‐Q1). • lW = 'Leu-‐Whisker' value (for boxplo+ng). • rW = 'Right-‐Whisker' value (for boxplo+ng). • A_Count = Count of 'A' nucleo7des found in this column. • C_Count = Count of 'C' nucleo7des found in this column. • G_Count = Count of 'G' nucleo7des found in this column. • T_Count = Count of 'T' nucleo7des found in this column. • N_Count = Count of 'N' nucleo7des found in this column. • max-‐count = max. number of bases (in all cycles)
fastq_quality_boxplot_graph.sh -‐i hsf1_stats.txt -‐o hsf1.png fastx_ar7facts_filter -‐Q 33 -‐v -‐i hsf1.fastq -‐o hsf1_ar7fact.fastq It removes sequencing ar7facts Input: 6313567 reads. Output: 6312015 reads. discarded 1552 (0%) ar7fact reads.
fastq_quality_trimmer -‐Q 33 -‐t 20 -‐l 28 -‐v -‐i hsf1_ar7fact.fastq –o hsf1_ar7fact_trim.fastq
It trims sequences based on quality Minimum Quality Threshold: 20 Minimum Length: 28 Input: 6312015 reads. Output: 5929465 reads. discarded 382550 (6%) too-‐short reads. fastq_quality_filter -‐Q 33 -‐v -‐q 20 -‐p 80 –I hsf1_ar7fact_trim.fastq -‐o
hsf1_ar7fact_trim_qual.fastq It filters sequences based on quality Quality cut-‐off: 20 Minimum percentage: 80 Input: 5929465 reads. Output: 5655141 reads. discarded 274324 (4%) low-‐quality reads.
• fastx_quality_stats -‐Q 33 –i hsf1_ar7fact_trim_qual.fastq -‐o hsf1_ar7fact_trim_qual_stats.txt
• fastq_quality_boxplot_graph.sh -‐i hsf1_ar7fact_trim_qual_stats.txt -‐o hsf1_ar7fact_trim_qual.png
Quality control with fastQC /home/lriva/FastQC/fastqc -‐-‐contaminants /home/lriva/FastQC/Contaminants/contaminant_list.txt hsf1_ar7fact_trim_qual.fastq
§ A fast program able to align short query sequences (< 200 bp; error rate < 3% ) to a target reference genome
§ Work flow:
ü index [-options] (it index the database)
ü aln [-options] < input query.fastq> (it computes the SA coordinates for the input reads)
ü samse [-options] (it converts SA coordinates into chromosomal coordinates and generates alignment in SAM formats for single reads)
BWA
Alignment of the reads
• bwa aln -‐t 4 -‐f hsf1.sai /db/bwa/0.6.2/hg19/hg19 hsf1_ar7fact_trim_qual.fastq
• bwa samse /db/bwa/0.6.2/hg19/hg19 hsf1.sai hsf1_ar7fact_trim_qual.fastq | samtools view –ut /db/bwa/0.6.2/hg19/hg19 -‐ | samtools sort -‐ hsf1
• samtools index hsf1.bam
Visualizing alignments, conver7ng from BAM binary to SAM text format samtools view -‐h -‐o hsf1_stdchr.nodup.sam hsf1_stdchr.nodup.bam
less hsf1_stdchr.nodup.sam
Go to hep://bioserver.iit.ieo.eu/~lriva/ and download corsoPoplimiChIPseq.txt ssh [email protected] ssh user@dpt-‐n1.bioinfo.ieo.eu cd public_html You can check all your files at hep://bioserver.iit.ieo.eu/~user/
Let’s start
Filtering alignments
coun7ng and visualizing unmapped reads • samtools view -‐f 4 -‐c hsf1.bam • samtools view -‐f 4 hsf1.bam | less
HWI-‐EAS413:4:1:3:1708 4 * 0 0 * * 0 0 GGTGCTGGCCAACATGACCTTGCACGGA BCACCCCBCCCBCCBBCCCBB
Filtering alignments coun7ng and visualizing mapped reads • samtools view -‐F 4 -‐c hsf1.bam • samtools view -‐F 4 hsf1.bam | less HWI-‐EAS413:4:100:367:1780 16 chr1 9999 25 35M * 0 0 GATCTCCCTAACCCTAACCCTAACCCTAACCCTAA ?B;BABBAAC?AC?AC?CBCBCBCCCBCBBCCBCB XT:A:U NM:i:2 XN:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:3A0A30
select only mapped reads with mapping quality>=0 samtools view -‐F 4 -‐q 1 -‐bo hsf1_stdchr.bam hsf1.bam chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX
Prefix trie of string ‘GOOGOL’ 0,6
1,2
1,2 4,4
6,6
2,2
2,2
4,6
1,2
1,2 4,4
6,6
2,2
2,2
6,6
2,2
2,2
3,3
5,5
1,1
4,4
6,6
2,2
2,2
G
O
O
G
O
O
G
O
O
G
G
O
G
O
O
G
L
^
^
^
^
^
^ ^ = start of the string
root
Burrows-‐Wheeler transform (BWT)
X = ‘googol$’ (reference string) W = ‘go’ (query string) SA interval=[1,2]
* * g o o g o l 0 1 2 3 4 5
SA interval of ‘go’
B= BWT(X)" (BWT string of X)"
Suffix array =
googol + $"
Suffix array interval: formal defini#on
Example:""X = ‘googol$’"W = ‘go’ "
If W is an empty string then" = [1, n-1]"[ , ]"= SA interval of W "
(‘go’)= 1 "(‘go’)= 2 "
SA interval of ‘go’ = [1,2]"
[ , ]"
Exact match: Ferragina-‐Manzini algorithm
true only if and only if aW is a substring of X
X = reference string; W = query string"C(a) = number of symbols < a in X[0,n-2]"O(a, i) = number of occurrences of a in B[0,i] where B= BWT (X)"If W is a substring of X: "
Exact match: backward search
0 6 $googol 1 3 gol$goo 2 0 googol$ 3 5 l$googo 4 2 ogol$go 5 4 ol$goog 6 1 oogol$g
X= ‘googol$’"bwt= ‘lo$oogg’"W=‘ogo’"
""
if W ← ‘ ’:! L ←1! R ← length(X)-1!j←length(W)-1!!while (L ≤ R and j ≥ 0) do:! aW←W[ j:end ]!! if length(aW)
Exact match: backward search (con#nued)
A T T T G A T C T T T T G A T C 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
* *
X=
W= GATC
SA interval =[6,7]
SA= (16, 13, 5, 0, 15, 7, 12, 4, 14, 6, 11, 3, 10, 2, 9, 1, 8)
BWT string= CGG$TTTTAATTTTTAC 0 16 $ATTTGATCTTTTGATC 1 13 ATC$ATTTGATCTTTTG 2 5 ATCTTTTGATC$ATTTG 3 0 ATTTGATCTTTTGATC$ 4 15 C$ATTTGATCTTTTGAT 5 7 CTTTTGATC$ATTTGAT 6 12 GATC$ATTTGATCTTTT 7 4 GATCTTTTGATC$ATTT 8 14 TC$ATTTGATCTTTTGA 9 6 TCTTTTGATC$ATTTGA 10 11 TGATC$ATTTGATCTTT 11 3 TGATCTTTTGATC$ATT 12 10 TTGATC$ATTTGATCTT 13 2 TTGATCTTTTGATC$AT 14 9 TTTGATC$ATTTGATCT 15 1 TTTGATCTTTTGATC$A 16 8 TTTTGATC$ATTTGATC
Inexact match algorithm
• DNA is fragmented • Adaptors ligated to fragments • Several possible protocols yield
array of PCR colonies. (AMPLIFICATION)
• Enzyma7c extension with fluorescently tagged nucleo7des. (SEQUENCING REACTION)
• Cyclic readout by imaging the array.
• Align strings to the reference genome.
Basics of NGS technology!
Consider the following paper that describes sequencing technologies in details: Michael Metzker ‘Sequencing technologies-‐the next genera7on’ Nat Rev Genet. 2010 Jan;11(1):31-‐46.
Mapping Algorithm (2 mismatches)
GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG
GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG seeds
Exact match Genome
GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG .........
Indexed table of exactly matching seeds
Approximate search around the exactly matching seeds
Mapping Algorithm (2 mismatches)
• Partition reads into 4 seeds {A,B,C,D} – At least 2 seed must map with no mismatches
• Scan genome to identify locations where the seeds match exactly – 6 possible combinations of the seeds to search
• {AB, CD, AC, BD, AD, BC} – 6 scans to find all candidates
• Do approximate matching around the exactly-matching seeds. – Determine all targets for the reads – Ins/del can be incorporated
• The reads are indexed and hashed before scanning genome
• Bit operations are used to accelerate mapping – Each nt encoded into 2-bits