48
Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Embed Size (px)

Citation preview

Page 1: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Massive Parallel Sequencing

Kun Huang, PhD

Department of Biomedical Informatics

OSU CCC Bioinformatics Shared Resources

Page 2: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

High throughput sequencing – a new paradigm Solexa SOLiD 454/Roche Genome Sequencer

Applications Genome sequencing microRNA screening Gene expression ChIP-seq

Introduction

Page 3: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

What is ChIP-Sequencing?

ChIP-Sequencing is a new frontier technology to analyze protein interactions with DNA.

ChIP-Seq Combination of chromatin immunoprecipitation

(ChIP) with ultra high-throughput massively parallel sequencing

Allow mapping of protein–DNA interactions in-vivo on a genome scale

Page 4: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Mardis, E.R. Nat. Methods 4, 613-614 (2007)

Workflow ofChIP-Seq

Page 5: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Workflow ofChIP-Seq

Page 6: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

ChIP-seq

Challenges: •Millions of segments•Mapping to genome•Visualization•Peak detection•Data normalization•…

Page 7: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources
Page 8: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Johnson et al, 2007

ChIP-Seq technology is used to understand in vivo binding of the neuron-restrictive silencer factor (NRSF)

Results are compared to known binding sites ChIP-Seq signals are strongly agree with the

existing knowledge Sharp resolution of binding position New noncanonical NRSF binding motifs are

identified

Page 9: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources
Page 10: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Robertson et al, 2007

ChIP-Seq technology used to study genome-wide profiles of STAT1 DNA association

STAT1 targets in interferon-γ-stimulated and unstimulated human HeLA S3 cells are compared

The performance of ChIP-Seq is compared to the alternative protein-DNA interaction methods of ChIP-PCR and ChIP-chip.

41,582 and 11,004 putative STAT-1 binding regions are identified in stimulated and unstimulated cells respectively.

Page 11: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Why ChIP-Sequencing?

Current microarray and ChIP-ChIP designs require knowing sequence of interest as a promoter, enhancer, or RNA-coding domain.

Lower cost Less work in ChIP-Seq Higher accuracy

Alterations in transcription-factor binding in response to environmental stimuli can be evaluated for the entire genome in a single experiment.

Page 12: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Bioinformatics

Page 13: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Sequencers

Solexa (Illumina) 1 GB of sequences in a single run 35 bases in length

454 Life Sciences (Roche Diagnostics) 25-50 MB of sequences in a single run Up to 500 bases in length

SOLiD (Applied Biosystems) 6 GB of sequences in a single run 35 bases in length

Page 14: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Illumina Genome Analysis System

8 lanes100 tiles per lane

Page 15: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Sequencing

Page 16: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Sequencer Output

Sequence FilesQuality Scores

Page 17: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Sequence Files

~10 million sequences per lane

~500 MB files

Page 18: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Quality Score Files Quality scores describe the confidence of bases in each read Solexa pipeline assigns a quality score to the four possible

nucleotides for each sequenced base 9 million sequences (500MB file) ~6.5GB quality score file

Page 19: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Bioinformatics Challenges

Rapid mapping of these short sequence reads to the reference genome

Visualize mapping results Thousand of enriched regions

Peak analysis Peak detection Finding exact binding sites

Compare results of different experiments Normalization Statistical tests

Page 20: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Mapping of Short Oligonucleotides to the Reference Genome Mapping Methods

Need to allow mismatches and gaps SNP locations Sequencing errors Reading errors

Indexing and hashing genome oligonucleotide reads

Use of quality scores Use of SNP knowledge Performance

Partitioning the genome or sequence reads

Page 21: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Mapping Methods: Indexing the Genome Fast sequence similarity search algorithms

(like BLAST) Not specifically designed for mapping millions

of query sequences Take very long time

e.g. 2 days to map half million sequences to 70MB reference genome (using BLAST)

Indexing the genome is memory expensive

Page 22: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources
Page 23: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

SOAP (Li et al, 2008)

Both reads and reference genome are converted to numeric data type using 2-bits-per-base coding

Load reference genome into memory For human genome, 14GB RAM required for

storing reference sequences and index tables 300(gapped) to 1200(ungapped) times faster

than BLAST

Page 24: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

SOAP (Li et al, 2008)

2 mismatches or 1-3bp continuous gap Errors accumulate during the sequencing

process Much higher number of sequencing errors at

the 3’-end (sometimes make the reads unalignable to the reference genome)

Iteratively trim several basepairs at the 3’-end and redo the alignment

Improve sensitivity

Page 25: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Mapping Methods: Indexing the Oligonucleotide Reads ELAND (Cox, unpublished)

“Efficient Large-Scale Alignment of Nucleotide Databases” (Solexa Ltd.)

SeqMap (Jiang, 2008) “Mapping massive amount of oligonucleotides to the

genome” RMAP (Smith, 2008)

“Using quality scores and longer reads improves accuracy of Solexa read mapping”

MAQ (Li, 2008) “Mapping short DNA sequencing reads and calling

variants using mapping quality scores”

Page 26: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Mapping Algorithm (2 mismatches)

GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG

GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG seeds

Exact matchGenome

GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG.........

Indexed table of exactly matching seeds

Approximate search around the exactly matching seeds

Page 27: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Mapping Algorithm (2 mismatches)

Partition reads into 4 seeds {A,B,C,D} At least 2 seed must map with no mismatches

Scan genome to identify locations where the seeds match exactly 6 possible combinations of the seeds to search

{AB, CD, AC, BD, AD, BC} 6 scans to find all candidates

Do approximate matching around the exactly-matching seeds. Determine all targets for the reads Ins/del can be incorporated

The reads are indexed and hashed before scanning genome

Bit operations are used to accelerate mapping Each nt encoded into 2-bits

Page 28: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

ELAND (Cox, unpublished)

Commercial sequence mapping program comes with Solexa machine

Allow at most 2 mismatches Map sequences up to 32 nt in length All sequences have to be same length

Page 29: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources
Page 30: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources
Page 31: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

RMAP (Smith et al, 2008)

Improve mapping accuracy Possible sequencing errors at 3’-ends of longer reads Base-call quality scores

Use of base-call quality scores Quality cutoff

High quality positions are checked for mismatces Low quality positions always induce a match

Quality control step eliminates reads with too many low quality positions

Allow any number of mismatches

Page 32: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Map to reference genome

Map to reference genome

Mapped to a unique location

Mapped to multiple locations

No mapping

Low quality

3 MQuality

filter

7.2 M

1.8 M

2.5 M

0.5 M

12 M

Page 33: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources
Page 34: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Bioinformatics Challenges

Rapid mapping of these short sequence reads to the reference genome

Visualize mapping results Thousand of enriched regions

Peak analysis Peak detection Finding exact binding sites

Compare results of different experiments Normalization Statistical tests

Page 35: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Visualization

BED files are build to summarize mapping results

BED files can be easily visualized in Genome Browser

http://genome.ucsc.edu

Page 36: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Visualization: Genome Browser

Robertson, G. et al. Nat. Methods 4, 651-657 (2007)

Page 37: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)

Visualization: Custom

300 kb region from mouse ES cells

Page 38: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Visualization

Huang, 2008 (unpublished)

Page 39: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Huang, 2008 (unpublished)

Page 40: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Bioinformatics Challenges

Rapid mapping of these short sequence reads to the reference genome

Visualize mapping results Thousand of enriched regions

Peak analysis Peak detection Finding exact binding sites

Compare results of different experiments Normalization Statistical tests

Page 41: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Peak Analysis

Peak Detection

ChIP-Peak Analysis Module (Swiss Institute of Bioinformatics)

ChIPSeq Peak Finder (Wold Lab, Caltech)

Page 42: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources
Page 43: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources
Page 44: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Peak Analysis

Finding Exact Binding Site

Determining the exact binding sites from short reads generated from ChIP-Seq experiments

SISSRs (Site Identification from Short Sequence Reads) (Jothi 2008)

MACS (Model-based Analysis of ChIP-Seq) (Zhang et al, 2008)

Page 45: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Bioinformatics Challenges

Rapid mapping of these short sequence reads to the reference genome

Visualize mapping results Thousand of enriched regions

Peak analysis Peak detection Finding exact binding sites

Compare results of different experiments Normalization Statistical tests

Page 46: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Compare Samples

Huang, 2008 (unpublished)

Page 47: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Compare Samples

Fold change

HPeak: An HMM-based algorithm for defining read-enriched regions from massive parallel sequencing data Xu et al, 2008

Advanced statistics

Page 48: Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

QUESTIONS?