GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA

View
215
Download
0
Category

Documents

Tags:

Preview:

Citation preview

PARALLEL PAIR-HMM SNP

DETECTION

GNUMAP-SNP

Nathan ClementThe University of TexasAustin, TX, USA

Outline

MotivationNGS Issues and RequirementsPair-HMM

Memory Optimizations Results Conclusion

MotivationMutation Detection: SNP discovery

HapMap and resequencing Species Identification Bisulfite Sequencing

Epigenetic influencesRNA editing

Error Rates*

Instrument Run Time Mb/run Bases/read

Primary Error Type

Error Rate (%)

3730xl (Capillary)

2 h 0.06 650 Substitution 0.1-1

454 FLX+ 18-20 h 900 700 Indel 1

Illumina HiSeq2000

10 days ≤ 600,000 100+100 Substitution ≥0.1

Ion Torrent – 318 chip

2 h >1000 >100 Indel ~1

PacBio RS 0.5-2h 5-10 860-1100 CG Deletions

16

* Data current as of May 2011: Glenn, Travis C, “Field guide to next-generation DNA sequencers,” Molecular Ecology Resources, vol 11, pp 759-769, 2011

Pair-HMM

Pair-HMM (Mathematics)

Match

Gap (in both directions)

Pair-HMM (M)

a t a c g a c t

a 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

g 0.00 0.68 0.00 0.00 0.00 0.00 0.00 0.00

t 0.00 0.32 0.68 0.00 0.00 0.00 0.00 0.00

a 0.00 0.00 0.32 0.68 0.00 0.00 0.00 0.00

g 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

a 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

c 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

Pair-HMM (X)

a t a c g a c t

a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

g 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00

t 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Pair-HMM (Y)

a t a c g a c t

a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

t 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

a 0.00 0.00 0.00 0.31 0.00 0.00 0.00 0.00

g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Pair-HMMA C G T

a 1.00 0.00 0.00 0.00

g 0.00 0.00 0.68 0.31

t 0.32 0.00 0.00 0.68

a 0.99 0.00 0.00 0.00

g 0.00 0.00 1.00 0.00

a 1.00 0.00 0.00 0.00

c 0.00 1.00 0.00 0.00

c 0.00 1.00 0.00 0.00

Expected ResultsCHR POS TOT A C G T SNP? PVAL

chrX 1755234 17.00 0.00 0.00 17 0.00 N

chrX 1755235 18.00 0.00 18.00 0.00 0.00 N

chrX 1755236 19.00 9.99 0.00 9.00 0.01 Y:g->a/g 2.54e-08

chrX 1755237 19.50 0.00 0.00 0.00 19.50 N

chrX 1755238 19.50 0.00 0.00 19.50 0.00 N

chrX 1755239 46.00 0.01 19.49 0.00 0.00 N

Why Inline SNP Calling?

Post-ProcessingDisk space, less memory

InlineRequires more memoryLess disk spaceCan include specifics probabilities for each

read

Previous Optimizations

Two methods for speeding up mapping:1. Entire genome on one machine

2. Split memory among different machines○ Must normalize across all genome portions○ MPI reduction

Previous Optimizations

Memory Requirements

Human Genome (3gb)HashMap ≈ 12GB4 bits/character = 1.5GB5 floating point values per base (plus N) =

sizeof(float)*5 * 3GB=60GBAlso stores total for easy computation =

sizeof(float) * 3GB = 12GB Total of ≈ 90GB per run

Three Memory Optimizations Normal (no optimization) Integer discretization Centroid discretization

Integer Discretization

Only need one floating point value (for total) and 1 byte/nucleotide.

“Parts per 255” Biggest hit: Going into and out of

“integer space”

Integer DiscretizationAdded from ri:

1.0 0.00 0.68 0.31 0.01 0.00 Step 1: Convert from

Integer Space Step 2: Add from ri to

Genome Step 3: Convert

back to Integer Space

Genome

Total A C G T N

12.0 3 231 7 12 3

Total A C G T N

12.0 0.15 10.9 0.33 0.56 0.15

Total A C G T N

13.0 0.15 11.6 0.64 0.57 0.15

Total A C G T N

13.0 2 228 13 11 2

Centroid Discretization

Many states not used:[255, 255, 255, 255, 255][0, 0, 0, 0, 0]

Many states not biologically relevantSNP transition (common) vs transversion

(not likely) MSA uses this compression to perform

fast alignment of one-to-many alignment

Centroid Discretization (cont)

Centroid Discretization (cont) Benefits

Doesn’t waste impossible or infrequently used space

Much smaller memory footprint Drawbacks:

Slight overhead in converting from centroid to floating point spaces

Rounding error (how significant?)

Speed Comparison

Optimization Stats (chrX)

Optimization Memory Mem % Wallclock TP FP

Normal 4.76GB 100% 04:25:55 1309 127

CharDisc 2.58GB 54.2% 04:36:58 677 0

CentDisc 2.01GB 42.2% 04:27:29 166 9058

Conclusion

For high error rates, HMM approach is ideal, but requires more memoryDistributing the genome across processors

doesn’t scale linearly Discretization methods provide good

memory reductions (up to 42%)Centroid discretization performs poorlyInteger discretization can be used when

available memory is low

Questions

Recommended

Clement Levy

Documents

SNP ANGLET

Documents

First Clement

Documents

Clement SPITERI

Education

SNP-l5233H/l5233€¦ · Security Dimensions (WxH) Weight SNP-l5233H/l5233 SNP-l5233HN/HP SNP-l5233N/P 1.3M HD 23x Network PTZ Dome Camera SNP-L5233H SNP-L5233 key Features

Documents

Puertas Clement

Documents

HumanaChoice SNP-DE H8087-003 (PPO D-SNP)

Documents

Clement of Rome - Letter to the Corinthians (I Clement)Clement of Rome - Letter to the Corinthians (I Clement) Author Bible Texts Editor Keywords clement rome,clement,clemens,clemens

Documents

GNUMap: Unbiased Probabilistic Mapping of Next- Generation ...bioresearch.byu.edu/gnumap/ISMB_Presentation.pptx.pdf · G N U M a p Alignment • Given a possible genome match location,

Documents

General Purpose SNP-G20 Series 160W~300W SNP-E30 Series · Rev. 2011 SNP-G16 Series SNP-G20 Series SNP-E30 Series General Purpose 160W~300W Product Manager: Claus Technical Supervisor:

Documents

Pemenuhan SNP

Documents

User Manual-SNP-3120-ENGLISH Web - GfK Etilizecontent.etilize.com/User-Manual/1020310792.pdf · NETWORK CAMERA User Manual SNP-3120/SNP-3120V/ SNP-3120VH

Documents

Clement Chenost

Economy & Finance

Calend snp

Documents

SNP technologies

Documents

Gene-Gene /SNP-SNP Interaction: BIOFILTER

Documents

SNP comparisons

Documents

Oratie Clement

Documents

SNP Optimizer

Documents

SNP Logistics

Documents