Upload
sydney-norris
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Massive Parallel Sequencing
Kun Huang, PhD
Department of Biomedical Informatics
OSU CCC Bioinformatics Shared Resources
High throughput sequencing – a new paradigm Solexa SOLiD 454/Roche Genome Sequencer
Applications Genome sequencing microRNA screening Gene expression ChIP-seq
Introduction
What is ChIP-Sequencing?
ChIP-Sequencing is a new frontier technology to analyze protein interactions with DNA.
ChIP-Seq Combination of chromatin immunoprecipitation
(ChIP) with ultra high-throughput massively parallel sequencing
Allow mapping of protein–DNA interactions in-vivo on a genome scale
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
Workflow ofChIP-Seq
Workflow ofChIP-Seq
ChIP-seq
Challenges: •Millions of segments•Mapping to genome•Visualization•Peak detection•Data normalization•…
Johnson et al, 2007
ChIP-Seq technology is used to understand in vivo binding of the neuron-restrictive silencer factor (NRSF)
Results are compared to known binding sites ChIP-Seq signals are strongly agree with the
existing knowledge Sharp resolution of binding position New noncanonical NRSF binding motifs are
identified
Robertson et al, 2007
ChIP-Seq technology used to study genome-wide profiles of STAT1 DNA association
STAT1 targets in interferon-γ-stimulated and unstimulated human HeLA S3 cells are compared
The performance of ChIP-Seq is compared to the alternative protein-DNA interaction methods of ChIP-PCR and ChIP-chip.
41,582 and 11,004 putative STAT-1 binding regions are identified in stimulated and unstimulated cells respectively.
Why ChIP-Sequencing?
Current microarray and ChIP-ChIP designs require knowing sequence of interest as a promoter, enhancer, or RNA-coding domain.
Lower cost Less work in ChIP-Seq Higher accuracy
Alterations in transcription-factor binding in response to environmental stimuli can be evaluated for the entire genome in a single experiment.
Bioinformatics
Sequencers
Solexa (Illumina) 1 GB of sequences in a single run 35 bases in length
454 Life Sciences (Roche Diagnostics) 25-50 MB of sequences in a single run Up to 500 bases in length
SOLiD (Applied Biosystems) 6 GB of sequences in a single run 35 bases in length
Illumina Genome Analysis System
8 lanes100 tiles per lane
Sequencing
Sequencer Output
Sequence FilesQuality Scores
Sequence Files
~10 million sequences per lane
~500 MB files
Quality Score Files Quality scores describe the confidence of bases in each read Solexa pipeline assigns a quality score to the four possible
nucleotides for each sequenced base 9 million sequences (500MB file) ~6.5GB quality score file
Bioinformatics Challenges
Rapid mapping of these short sequence reads to the reference genome
Visualize mapping results Thousand of enriched regions
Peak analysis Peak detection Finding exact binding sites
Compare results of different experiments Normalization Statistical tests
Mapping of Short Oligonucleotides to the Reference Genome Mapping Methods
Need to allow mismatches and gaps SNP locations Sequencing errors Reading errors
Indexing and hashing genome oligonucleotide reads
Use of quality scores Use of SNP knowledge Performance
Partitioning the genome or sequence reads
Mapping Methods: Indexing the Genome Fast sequence similarity search algorithms
(like BLAST) Not specifically designed for mapping millions
of query sequences Take very long time
e.g. 2 days to map half million sequences to 70MB reference genome (using BLAST)
Indexing the genome is memory expensive
SOAP (Li et al, 2008)
Both reads and reference genome are converted to numeric data type using 2-bits-per-base coding
Load reference genome into memory For human genome, 14GB RAM required for
storing reference sequences and index tables 300(gapped) to 1200(ungapped) times faster
than BLAST
SOAP (Li et al, 2008)
2 mismatches or 1-3bp continuous gap Errors accumulate during the sequencing
process Much higher number of sequencing errors at
the 3’-end (sometimes make the reads unalignable to the reference genome)
Iteratively trim several basepairs at the 3’-end and redo the alignment
Improve sensitivity
Mapping Methods: Indexing the Oligonucleotide Reads ELAND (Cox, unpublished)
“Efficient Large-Scale Alignment of Nucleotide Databases” (Solexa Ltd.)
SeqMap (Jiang, 2008) “Mapping massive amount of oligonucleotides to the
genome” RMAP (Smith, 2008)
“Using quality scores and longer reads improves accuracy of Solexa read mapping”
MAQ (Li, 2008) “Mapping short DNA sequencing reads and calling
variants using mapping quality scores”
Mapping Algorithm (2 mismatches)
GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG
GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG seeds
Exact matchGenome
GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG.........
Indexed table of exactly matching seeds
Approximate search around the exactly matching seeds
Mapping Algorithm (2 mismatches)
Partition reads into 4 seeds {A,B,C,D} At least 2 seed must map with no mismatches
Scan genome to identify locations where the seeds match exactly 6 possible combinations of the seeds to search
{AB, CD, AC, BD, AD, BC} 6 scans to find all candidates
Do approximate matching around the exactly-matching seeds. Determine all targets for the reads Ins/del can be incorporated
The reads are indexed and hashed before scanning genome
Bit operations are used to accelerate mapping Each nt encoded into 2-bits
ELAND (Cox, unpublished)
Commercial sequence mapping program comes with Solexa machine
Allow at most 2 mismatches Map sequences up to 32 nt in length All sequences have to be same length
RMAP (Smith et al, 2008)
Improve mapping accuracy Possible sequencing errors at 3’-ends of longer reads Base-call quality scores
Use of base-call quality scores Quality cutoff
High quality positions are checked for mismatces Low quality positions always induce a match
Quality control step eliminates reads with too many low quality positions
Allow any number of mismatches
Map to reference genome
Map to reference genome
Mapped to a unique location
Mapped to multiple locations
No mapping
Low quality
3 MQuality
filter
7.2 M
1.8 M
2.5 M
0.5 M
12 M
Bioinformatics Challenges
Rapid mapping of these short sequence reads to the reference genome
Visualize mapping results Thousand of enriched regions
Peak analysis Peak detection Finding exact binding sites
Compare results of different experiments Normalization Statistical tests
Visualization
BED files are build to summarize mapping results
BED files can be easily visualized in Genome Browser
http://genome.ucsc.edu
Visualization: Genome Browser
Robertson, G. et al. Nat. Methods 4, 651-657 (2007)
Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)
Visualization: Custom
300 kb region from mouse ES cells
Visualization
Huang, 2008 (unpublished)
Huang, 2008 (unpublished)
Bioinformatics Challenges
Rapid mapping of these short sequence reads to the reference genome
Visualize mapping results Thousand of enriched regions
Peak analysis Peak detection Finding exact binding sites
Compare results of different experiments Normalization Statistical tests
Peak Analysis
Peak Detection
ChIP-Peak Analysis Module (Swiss Institute of Bioinformatics)
ChIPSeq Peak Finder (Wold Lab, Caltech)
Peak Analysis
Finding Exact Binding Site
Determining the exact binding sites from short reads generated from ChIP-Seq experiments
SISSRs (Site Identification from Short Sequence Reads) (Jothi 2008)
MACS (Model-based Analysis of ChIP-Seq) (Zhang et al, 2008)
Bioinformatics Challenges
Rapid mapping of these short sequence reads to the reference genome
Visualize mapping results Thousand of enriched regions
Peak analysis Peak detection Finding exact binding sites
Compare results of different experiments Normalization Statistical tests
Compare Samples
Huang, 2008 (unpublished)
Compare Samples
Fold change
HPeak: An HMM-based algorithm for defining read-enriched regions from massive parallel sequencing data Xu et al, 2008
Advanced statistics
QUESTIONS?