View
6
Download
0
Category
Preview:
Citation preview
10/09/2015
1
__________________________________________________________________________________________________ Fall 2015 GCBA 815 __________________________________________________________________________________________________ Fall 2015 GCBA 815
Tools and Algorithms in Bioinformatics GCBA815, Fall 2015
Week 13: Next Generation Sequencing (NGS) Analysis
Adam Cornish
Graduate Student Guda lab
Department of Genetics, Cell Biology and Anatomy
University of Nebraska Medical Center
__________________________________________________________________________________________________ Fall 2015 GCBA 815
__________________________________________________________________________________________________ Fall 2015 GCBA 815
n Vector NTI is an integrated suite of sequence analysis and design tools that help you manage, view, analyze, transform, share, and publicize diverse types of molecular biology data, in a graphically rich analysis environment.
Introduction
Eisenstein. Nature. 2015
10/09/2015
2
__________________________________________________________________________________________________ Fall 2015 GCBA 815
Sources of NGS data
Illumina Ion Torrent
PacBio
__________________________________________________________________________________________________ Fall 2015 GCBA 815
Single Cell Sequencing
10/09/2015
3
__________________________________________________________________________________________________ Fall 2015 GCBA 815
n Genome ¨ Targeted sequencing panels (cancer, newborns, autism, etc.) ¨ Whole exome sequencing ¨ Whole genome sequencing ¨ Copy number analysis ¨ Reconstruction of extinct species’ genomes
n Transcriptome ¨ Whole transcriptome (poly-A selection) ¨ Small RNA analysis (siRNA, snoRNA, lincRNA, etc.) ¨ Gene expression profiling for selected target genes ¨ Rare cell identification
n Metagenome ¨ Bulk sequencing of many types of bacteria ¨ Examples: human gut microbiome, pollen composition, bacteria composition, viral studies
n Epigenome ¨ Chromatin Immunoprecipitation Sequencing (ChIP-Seq) ¨ Methylation Sequencing (Methyl-Seq)
Applications of NGS
__________________________________________________________________________________________________ Fall 2015 GCBA 815
Variant calling using NGS data
10/09/2015
4
__________________________________________________________________________________________________ Fall 2015 GCBA 815
The big three: n Fastq
¨ Raw sequencing data usually directly from the sequencer
n SAM/BAM ¨ Sequence data that has usually been aligned to a specific genome
n VCF ¨ Tab-delimited text file that contains a list of possible variants: ¨ SNV ¨ Insertion and deletion (indel) ¨ Duplication ¨ Copy number variation ¨ Inversion ¨ Tandem duplication
Important file types
__________________________________________________________________________________________________ Fall 2015 GCBA 815
Row 1: Information from the sequencer about the location of this read on the plate
Row 2: The Sequence Row 3: Metadata provided by the sequencing team Row 4: Quality scores pertaining to each nucleotide in the
sequence
Fastq @SRR098401.11403008/1 GAGGCTATAGCATGGTCAAGGCACAAGAAGATCACTGGACTGCCCTCGCTCAGCCCTCAGCTACTG + >>?>?@>?>@@>?@@=@@@@@??>??@??@?@A?>@@@?>@@???A@:@A@@A@@@A@@AAB@@BB
10/09/2015
5
__________________________________________________________________________________________________ Fall 2015 GCBA 815
Quality scores are phred-scaled:
Seq: TCAGCCCTCAGCTACTGCTCT
Score: A@@A@@@A@@AAB@@BBABAB
Phred-33 is the most common, and is based on ASCII values.
The quality score of a base call is the ASCII value of the character subtracted by 33.
Example: the ASCII value for ‘A’ is 65, and 65 - 33 = 32. That means the base call corresponding to this score has a 1 in ~2,000 chance of being wrong.
Fastq continued Phred quality
score Probability that
the base is called wrong
Accuracy of the base call
20 1 in 100 99%
30 1 in 1,000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
__________________________________________________________________________________________________ Fall 2015 GCBA 815
Similar to the Fastq file in that it contains the raw sequence and its quality scores.
It also tells you where the sequence aligned to the genome, and how well (this scre is also phred-scaled).
In this case, this read aligned to chromosome 22, position 17445857, and has a quality score of 60 (or a 1 in 1,000,000 chance of being placed incorrectly).
Sequence Alignment / Map (SAM / BAM) SRR098401.104031357 83 chr22 17445857 60 76M = 17445512 -421 ACTGTTACCAGATCAAGAACTGATAGGGACAGGGATCATTATTCCCCCTTTACAGATGAGAAGGCCGTCACGCCTC @@>>B@@@BBAAAB9A@@>:@@?=A@?@?@A???>?@??=???@@@@@>@>>@@@><??@>@>@@8?>?=:@>?>> BD:Z:NOJKPQQQQMONOMKKKLNOMNLLLJLMINLJLMLMLKKKKJLJJJMKCKLINJMMLJKKKMOOMNNOLPQSNMKK PG:Z:MarkDuplicates RG:Z:NA12878 BI:Z:OOMLRRPPRPPQQONOLOPOONOOOKLNMONJKMNONMMMMLMKKKMLGMNLNMMNNJMJLNOMLNMPNONONNMM NM:i:0 MQ:i:60 AS:i:76 XS:i:0
10/09/2015
6
__________________________________________________________________________________________________ Fall 2015 GCBA 815 ExAC Browser
Variant Call Format (VCF)
Recommended