56
Deep Sequencing Introduction to Bioinformatics Seminar November 9th, 2009 Angela Benton, Samuel Darko, Prakriti Mudvari and Prisca Takundwa

sam

Embed Size (px)

Citation preview

Page 1: sam

Deep Sequencing

Introduction to Bioinformatics Seminar

November 9th, 2009

Angela Benton, Samuel Darko, Prakriti Mudvari and Prisca Takundwa

Page 2: sam

History of Sequencing

”Sanger Sequencing” developed by Fred Sanger et al in the mid 1970’s

Uses dideoxynucleotides for ”chain termination”, generating fragments of different lengths ending in ddATP, ddGTP, ddCTP or ddTTP

http://openwetware.org/wiki/BE.109:Bio-material_engineering/Sequence_analysis

Page 3: sam

History of Sequencing Cont.

• A schematic of Sanger sequencing

http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/

Page 4: sam

History of Sequencing Cont.

DNA fragments are separated by size by gel electrophoresis

From the gel, the DNA sequence can be determined Can produce DNA fragments 700-900bp long (good),

but it’s slow (bad) Lots of other problems including clone library

generation and low throughput The Human Genome Project used Sanger

sequencing, completion took over 10 years

Page 5: sam

Next Generation Sequencers

Next (or 3rd) generation sequencers came onto the scene in the early 2000’s

General characteristics include: Amplification of genetic material by PCR Ligation of amplified material to a solid surface Sequence of the target genetic material is determined using

Sequence-by-Synthesis (using labelled nucleotides or pyrosequencing for detection) or Sequence by ligation

Sequencing done in a massively parallel fashion and sequence information is captured by a computer

Page 6: sam

Next Gen. Sequencers Cont.

Sequencing platform

ABI3730xl Genome Analyzer

Roche (454) FLX

Illumina Genome Analyzer

ABI SOLiD HeliScope

Sequencing chemistry Automated Sanger sequencing

Pyrosequencing on solid support

Sequencing-by-synthesis with reversible terminators

Sequencing by ligation

Sequencing-by-synthesis with virtual terminators

Template amplification method

In vivo amplification via cloning

Emulsion PCR Bridge PCR Emulsion PCR None (single molecule)

Read length 700–900 bp 200–300 bp 32–40 bp 35 bp 25–35 bp

Sequencing throughput 0.03–0.07 Mb/h 13 Mb/h 25 Mb/h 21–28 Mb/h 83 Mb/h

Page 7: sam

Next Gen. Sequencers Cont.

Page 8: sam

8

PositionTEMPLATESG C A G T C A 1 2 3

G--

-CC

AA-

-GG

T--

--C

AAA

Cycle:

Next Gen. Sequencers Cont.

Provided to author courtesy of Helicos representative

Jim Brayer
Why is this not Following base addition orderG - C - A - G - T - C - AShouldn't it be:G - ' T ' - C - A - G - T - C - A
Page 9: sam

Next Gen. Sequencers Cont.

• Sequencing-by-ligation on SOLiD

http://www.umcutrecht.nl/subsite/genetics/Research/PersonalGenomics.htm

Page 10: sam

Next Gen vs Sanger

Let’s think about the domesticated silkworm genome The reference genome is about 432Mb large It was assembled from approximately 8.5 fold

coverage

PlatformABI3730xl Genome Analyzer

Roche (454) FLX

Illumina Genome

Analyzer ABI SOLiD Helicos

Heliscope

Sequencing Speed

0.03-0.07 Mb/h 13 Mb/h 25 Mb/h 21–28 Mb/h 83 Mb/h

Time to sequence

(days)2185.7 11.8 6.1 5.5 1.8

Page 11: sam

Bioinformatics

Because of the massively parallel nature of next gen sequencers, huge amounts of data are produced quickly requiring terabytes of storage

New bioinformatics tools were developed to utilize the huge number of much shorter reads (~35bp vs ~800bp) Bowtie - Ultrafast, memory-efficient short read aligner SOAPdenovo - Part of the SOAP suite, used to build

reference genome TopHat - TopHat is a fast splice junction mapper for RNA-

Seq reads

Page 12: sam

Applications

Novel whole genome sequencing The Sorcerer II Global Ocean Sampling Expedition:

Northwest Atlantic through Eastern Tropical Pacific Whole genome resequencing

Complete Resequencing of 40 Genomes Reveals Domestication Events and Genes in Silkworm (Bombyx)

RNA-Seq (transcriptomics) A Global View of Gene Activity and Alternative Splicing by

Deep Sequencing of the Human Transcriptome

Page 13: sam

Prisca Takundwa

NEXT-GENERATION SEQUENCING : APPLICATIONS

Page 14: sam

APPLICATIONS

• The potential applications platform for next-generation sequencing is enormous.

• Some examples that will be discussed include application in

Cellular Genomes using WGS MetagenomicsGenomic MedicineOther novel applications

Page 15: sam

Cellular Genomes

• The advent of automation in Sequencing initiated by Craig Venter et al gave rise to sequencing beyond viruses and organelles.

• In 1995 Venter’s group at TIGR reported complete sequences of two bacteria, Haemophilus influenzae and Mycoplasma genitalium.

Page 16: sam

Cellular Genomes

• Significance ; 1st glimpse of the complete instruction set for

a living organism an approximation of the minimal set of

genes required for cellular life Insight into the methods used to come up

with these cellular genomes

Page 17: sam

Cellular Genomes

• Significance Paved the way for other cellular genomes

such as E.coli, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogoster

Human Genome Project Next-generation appeal

Page 18: sam

Metagenomics

• Getting rid of cultures• Introduces diversity, includes all genes and potentially

all members contributing to a given environment• Typically use 16S rRNA gene to identify different

species and strains• Advantages : Closes the huge gap in sequence data in non-model

species. Many prokaryotes are human pathogens

Page 19: sam

Metagenomics

• Some examples

• Breitbart et al showed that 2000 liters of sea water contained >5000 different viruses. >1000 of these were found in human stool and majority of these were new species.

• Craig Venter’s Global Ocean Voyage

Page 20: sam

Genomic Medicine

• Sequencing and how it lends itself to medicine

• Implications in diagnosis, treatment and prevention

• Personalized medicine

• $1000 genome

• Some examples include Cancer and HIV applications

Page 21: sam

Other Novel Applications

• Resequencing

• Plants – Sugar beet and Tropical Evergreen Fagaceae

• Junk DNA

• Drug discovery

Page 22: sam

Angela Benton

Transcriptomics

Page 23: sam

Background

• Transcriptome – the complete set of coding and non-coding RNA molecules in a cell at a particular time–Varies between cell types

• Transcriptomics – the study of the transcripts in a cell, cell type, organism, etc.

Page 24: sam

Candidate Gene Analysis

1. Northern blot analysis–Separation of RNA molecules by size–Hybridization of a complementary radioactively-

labeled probe–Detection method

2. Reverse transcriptase PCR (RT-PCR)–RNA molecules reverse transcribed into cDNA –PCR amplified–Quantification method

Page 25: sam

Microarray Technology

• High-throughput gene expression profiling

• Hybridization of labeled cDNAs to an array of complementary DNA probes

• Measurement of expression levels based on hybridization intensity

Page 26: sam

Sequencing-Based Approaches

1. Full-length cDNA (FLcDNA) sequencing– Complete sequencing of cDNA clone

2. Expressed sequence tag (EST) sequencing– Single-pass sequencing of cDNA clone

3. Serial Analysis of Gene Expression (SAGE)– Short sequence tags at 3’ end of transcript– Tags concatenated and sequenced

Page 27: sam

RNA-Seq

• Alternative to Sanger sequencing

• RNA molecules converted into library of cDNA fragments

–Adaptors attached to one/both ends

• Short sequence reads obtained

• Aligned to reference genome and classified as:

1.Exons

2.Junctions

3.Poly-A ends

• Can be used to assemble de novo sequences

Page 28: sam

Next Generation Sequencing Applications

Protein-coding gene annotation– Transcriptome sequences can be aligned:

– To genome of same species– To genome of related species

– Discovery of novel exons and introns– Long read lengths – de novo analyses– Short read lengths – novel splicing events

Page 29: sam

Gene expression profiling– SAGE method– 5’-RATE method (454 sequencing)– 3’-UTR method (454 sequencing)

Next Generation Sequencing Applications

Page 30: sam

Noncoding RNA (ncRNA) discovery– ncRNA not translated into protein product– Role in regulation of development and cell fate

determination– Three kinds:

1. Micro RNAs (miRNAs)2. Small interfering RNAs (siRNAs)3. Piwi-interacting RNAs (piRNAs)

Next Generation Sequencing Applications

Page 31: sam

Transcript rearrangement discovery– Genome rearrangements common in human

cancers– Includes:

1. Translocations2. Inversions3. Indels4. Copy number variants

– Paired-end sequencing– Infers presence of rearrangement

Next Generation Sequencing Applications

Page 32: sam

Bioinformatic Implications

• Large amounts of data generated

• Tools are needed to aid in:

– Storage

– Retrieval

– Processing

– Interpretation

– Integration

Page 33: sam

Bioinformatics of Deep Sequencing

Prakriti Mudvari

Page 34: sam

Bioinformatics of Deep Sequencing

http://www.cbcb.umd.edu/research/viewer.jpg

Page 35: sam

The Basics.

http://www.k.u-tokyo.ac.jp/pros-e/person/shinichi_morishita/shinichi_morishita.htm

Page 36: sam

Creating a Paired End Tag

http://media.wiley.com/wires/WSBM/WSBM40/nfig001.jpg

Page 37: sam

Paired End vs. Unpaired Reads

• Millions of reads are generated.

• Repetitive regions within the genome cause the reads to be mapped to multiple locations.

• Polymorphism in a read can cause it to be mapped to a wrong location.

• Discarding ambiguous reads can reduce coverage

Page 38: sam

Comparison of Output

ABI 3730 Genome Analyzer

454 Illumina Genome Analyzer

ABI SOLiD HeliScope

Read Length 700-900 bp 200-300 bp 32-40 bp 35 bp 25-35 bp

Sequencing throughput

0.03-0.07 Mb/h 13 Mb/h 25 Mb/h 21-28 Mb/h 83 Mb/h

Page 39: sam

Challenges

• Quality of data

• Storage

• Cross Platform Analysis

• Data Annotation

• Assembly

• SNP/Mutation Detection

Page 40: sam

Bioinformatics Tools

• Alignment of reads to reference genome

• Assembly of de novo sequence

• Quality Control & Base Calling

• Polymorphism detection

• Genome browsing and annotation

Page 41: sam

Alignment of reads

• Reads generated from sequencing is mapped to a reference genome

• Conventional tools like Blast or Blat do not work well with short sequence reads.

• Modification of existing alignment algorithms to handle short reads.

Page 42: sam

Alignment Tools

• Cross_match

• ELAND

• Exonerate

• MAQ

• Mosaik

• SHRiMP

• SOAP

• Zoom!

Page 43: sam

Short Oligonucleotide Alignment Program (SOAP)

• Maps short oligonucleotides to reference sequence in a gapped or ungapped alignment.

• Can be used for single as well as paired end alignments.

• Allows at most two mismatches per read or one continuous gap of size 1-3bp when aligning. No mismatches allowed in the flanking region.

• Best hit is the one with least number of mismatches or smallest gap.

• Iteratively trims the several basepairs at 3’ end, that have highest number of sequencing errors and realigns.

• Uses seed and hash-lookup algorithm to accelerate alignment.

• Loads reference sequence into memory instead of reads.

• Written in C++.

Page 44: sam

Assembly

• De novo sequencing involves assembling overlapping reads to form contiguous sequence of DNA.

• Done in cases where there’s no genomic information available.

Page 45: sam

Assembly

• ABySS

• ALLPATHS

• Edena

• Euler-SRSHARCGS

• SHRAP

• SSAKE

• Velvet

Page 46: sam

Assembly By Short Sequence (ABySS)

• Originally developed for de novo assembly of large genomes using short reads.

• Is a distributed representation of a de Bruijn graph that allows parallel computation of algorithm across a network of computers.

• Assembly is done in two steps.

• First possible substrings of a specific length of sequence reads are first generated. Substring dataset are then processed to remove errors and contiguous sequences are built without using paired end information.

• Mate pair information is then used to extend the contigs.

Page 47: sam

Assembly By Short Sequence (ABySS)

• Use of paired end reads reduces the ambiguity of repetitive regions.• Written in C++ and uses Message Parsing

Interface to communicate between nodes.

Page 48: sam

Basecalling

Determination of nucleotide base depending on signal on the trace file produced by a sequencer

http://stat.fsu.edu/~lilei/lilei/research/hmm/simulate.gif

Page 49: sam

Basecalling

• PyroBayes• Alta-Cyclic• BayesCall

Page 50: sam

Single Nucleotide Polymorphisms (SNP) Detection

Sequence variation caused when a single nucleotide base differs between different members of species or between two chromosomes of an individual.

Page 51: sam

• PbShort

• ssahaSNP

SNP Detection

Page 52: sam

Other Tools

• TagDust: Program for identifying and eliminating artifacts from next generation sequencing data.

• ShortRead: Package for input, quality assessment and exploration of high-throughput sequence data.

Page 53: sam

The End

Thank you!

Questions?

Page 54: sam

References

O. Morozova, et al. Applications of New Sequencing Technologies for Transcriptome Analysis. Annu. Rev. Genomics Hum. Genet. 2009

J.C. Vera, et al. 2001. Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol. Ecol. 17:1636-1637.

N. Cloonan, et al. 2008. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods. 5:613-619.

R.D. Morin, et al. 2008. Application massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res. 18(4):610-21

R.K. Thomas, et al. 2007. High-throughput oncogene mutation profiling in human cancer. Nat. Genet. 39:347-51.

Z. Wang, et al. 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nature Review Genetics. 10(1):57-63.

T.A. Brown. 2007. Genomics. Garland Science Publishing. Chapter 6.

Page 55: sam

References cont.

Venter, C et al: The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biology, 2007.

Sultan, M et al: A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome. Science, 2008.

Xia, Q et al: Complete Resequencing of 40 Genomes Reveals Domestication Events and Genes in Silkworm (Bombyx). Science, 2009.

Genome Res. 2008 November; 18(11): 1851–1858. 2008, Cold Spring Harbor Laboratory PressMapping short DNA sequencing reads and calling variants using mapping quality scoresHeng Li,1 Jue Ruan,2 and Richard Durbin1,3

Multiplex parallel pair-end-ditag sequencing approaches in system biologyYijun Ruan, Chia-Lin Wei *Genome Technology & Biology Group, Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672

"SOAP: short oligonucleotide alignment program" (2008) BIOINFORMATICS,Vol. 24 no.5 2008, pages 713–714 doi:10.1093/bioinformatics/btn025

Page 56: sam

References cont.

Hutchinson, Clyde A. DNA Sequencing : bench to bedside and beyond. Nucleid Acids Research,2007 Vol 35, No.18 6227-637

Breitbart, M; Salamon P, Andresen B, Mahaffy JM, Segall AM, Mead D, Azam F, Rohwer F (2002). "Genomic analysis of uncultured marine viral communities". Proceedings of the National Academy USA 99: 14250–14255.

Himmelbauer et al, Plant Genomics in the era of high throughput sequencing: The case of the sugar beet, Next Generation Sequencing, 2009

Kua, CS and Cannon, CH, Comparative genomics of Tropical Evergreen Fagaceae, Next Generation Sequencing, 2009

Liu George, Applications and Case Studies of the Next-Generation Sequencing Technologies in Food, Nutrition and Agriculture, Recent Patents on Food, Nutrition & Agriculture, 2009,1,75-79