SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of...

SeqMapReduce: software and web service for accelerating sequence mapping

Yanen LiDepartment of Computer Science, University of Illinois at

Urbana-ChampaignEmail: yanenli2@illinois.edu

10/05/2009, CAMDA 2009, Chicago

Challenge of NGS Alignment

• Sequences: Short (25 ~ 76 bp)• Size of data set: large, still increasing• BLAST?

Transaction /Long Query

Batch/Short Query

NGS Aligner

We need INDEX !

The NGS Aligner War

Where are you?

NGS Aligner Classification

• Standalone AlgorithmsHash Reads: Eland, RMAP, MAQ, SHRiMP …Pros: less RAM, less overheadCons: waste of genome scan

Hash Genome: SOAP, PASS, Mosaik, BFAST …Pros: fast, scale up wellCons: big RAM, heavy overhead

Index Genome (Burrows-Wheeler): Bowtie, BWA

NGS Aligner Classification

Parallel Algorithm Options Things Needed to Consider

Multi-thread Hard to scale up to many cores

Cluster Computing Load balancing, Fault tolerance

Cloud Computing Restricted programming interface

Programming Model of Cloud Computing

• MapReduceDeveloper supplies two functions

– All v’ with the same k’ are reduced together

Simple framework usually can scale up well

Why Cloud Computing Attractive?

• Fit for Data Intensive Computing (DIC)• NGS alignment is DIC in nature

• Hadoop – open sourced Cloud Computing system

Built-in Load balancing and Fault tolerance Easy to program

Cloud Based NGS Aligner

Hash Reads Hash Genome Hash Both

SeqMapReduce *CloudBurst *

Hash/index Genome will be the next

SeqMapReduce: Hash all reads in RAM in every nodeCloudBurst: Hash reads and the genome, but not in RAM

The SeqMapReduce Framework

Inside SeqMapReduce

• Pre-processing: formatting the genome

Format once, use every time

Bases at the end are duplicated

Inside SeqMapReduce

• Map phase: Seed & Filtering Divide a read into K parts, If M mismatches: at least (K-M) parts are exactly matched e.g. K=4, M=2 4-2=2 parts exactly matched combinations We need only 6 Hash Tables

Genome seqs scanned for potential hitsThen go to Mismatches Counting

Inside SeqMapReduce

• Reduce Phase Aggregating intermediate results

• Post Processing Duplication detection Mismatches counting Final output report

Inside SeqMapReduce• Mismatches counting Naive way: simple counting (O(N))• Mismatches counting using bit operationsBit-wise XOR (Exclusive or)

00 01 10 11

00 00 01 10 11

01 01 00 11 10

10 10 11 00 01

11 11 10 01 00

Mismatches counting

• Original R (read), and G (genome)• W=R XOR G• Define 2 constantsW1=10101010…W2=01010101…X=W & W1 (keep 10, clear 01, 11=>10)Y=W & W2(keep 01, clear 10, 11=>01)Then Y << 1N=POPCNT(X | Y)

W is combinations of 00 01 10 11

W 00 01 10 11W2 01 01 01 01Y=W & W2 00 01 00 01Y << 1 00 10 00 10

W 00 01 10 11

W1 10 10 10 10

X=W & W1 00 00 10 10

X=W & W1

X | Y W 00 01 10 11

X | Y 00 10 10 10

Y =W & W2

Web Service of SeqMapReduce

• Input format .zip of fasta format reads• Reads can be upload through web site• Support 13 model organisms• Support reads longer than 32 bps• Up to 5 mismatches • No indels in current version (will update soon)• Output with ELAND format• Free of charge for academics • Users: Small labs, want quick results but could be afford expensive hardware

and softwares

Results on CAMDA 2009 datasets

• Pol II ChIP-seq FC201WVA_20080307_s_5 (4.5 million)

• IFNg stimulated STAT1 ChIP-seq FC302MA_20080507_s_1 (6.2 million)

• Illinois Cloud Computing Testbed (CCT). Each node: 64 bit 2.6 GHz CPUs, 16 GB RAM, and 2 TB storage.

• 2 mismatches are allowed.• Accuracy: 95% of results are the same as MAQ.

Speed Up

Run time VS No. of coresPol II data set

Run time VS No. of coresSTAT1 data set

Speed up is quasi-linear to the No. of cores

1 2 4 8 16 320

With overhead

Without overhead

1 2 4 8 16 320

With overheadWithout overhead

Ave overhead time: 67.22s Ave overhead time: 86.09 s

Scale UpSize Size Ratio Run Time Run Time

RatioSTAT1 6.2 million 1.38 364 second 1.03

Pol II 4.5 million 354 second

RAM requirement: ~ 50 M per million readsCan scale up to tens of millions of read with several Gs of RAM

Comparison to CloudBurst

Why CloudBurst is slow?It hashes Reads and genome, with Hadoop system hash functionNo filtering in the Map phase: heavy I/O to Reduce phase

Results on Amazon EC2

• Speed up similar of using UIUC Hadoop Cluster, but slower• Large Standard Instances are chosen• Cost $99.01

Future Plans

• Apply to Bisulfite Reads to genome wide methylation analysis

• Web-based visualization of short-read alignments

Acknowledgements

• UIUC Cloud Test Bed • Michael Schatz • CAMDA Organizers

This work is supported by NSF DBI 08-45823 (SZ)

Thank you!

SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of...

Documents

Champaign pedestrian plan

AT URBANA-CHAMPAIGN ENGINEERING · CONFERENCEROOM ENGINEERINGLIBRARY UNIVERSITYOFILLINOIS URBANA,ILLINOIS UNIVERSITYOFILLINOISATURBANA-CHAMPAIGN URBANA,ILLINOIS61801 CACDocumentNumber181

AT URBANA-CHAMPAIGN ENGINEERING - University Library...CONFERENCEROOM ENGINEERINGLIBRARY UNIVERSITYOFILLINOIS URBANA,ILLINOIS UNIVERSITYOFILLINOISATURBANA-CHAMPAIGN URBANA,ILLINOIS61801

CITY OF CHAMPAIGN, ILLINOIS Champaign, Illinoisci.champaign.il.us/wp-content/uploads/2010/08/CITY-FINAL... · 2016-08-26 · CITY OF CHAMPAIGN, ILLINOIS Champaign, Illinois Comprehensive

2009 Project Profiles for Champaign County, Illinois...Thank you, on behalf of Champaign County, for your consideration. Participating Agencies Champaign County Champaign Park District

UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN · 2015-05-29 · UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 174 Children's Research Center 51 Gerty Drive Champaign, Illinois 61820 370

CHAMPAIGN COUNTY COMBINED CHARITIES CAMPAIGN · 2020. 10. 14. · CHAMPAIGN COUNTY COMBINED CHARITIES CAMPAIGN MANAGED BY UNITED WAY OF CHAMPAIGN COUNTY List of Federations and Agencies

CHAMPAIGN COUNTY ECONOMIC DEVELOPMENT CORPORATION … · CHAMPAIGN COUNTY ECONOMIC DEVELOPMENT CORPORATION 1817 S. Neil Street, Suite 100 | Champaign, IL 61820 | 217-359-6261 | edc@champaigncountyedc.com

Champaign County GIS Consortium

AT UR3ANA-CHAMPAIGN ENGINEERING · enterforAdvancedComputation UNIVERSITYOFILLINOISATURBANA-CHAMPAIGN URBANA,ILLINOIS61801 CACDocumentNo.I07 AMICROECONOMICMODELOF SOCIOMETRICCHOICE

Champaign City Fact Sheet

Champaign Ads

CHAMPAIGN COUNTY WELCOMING PLAN

Champaign Lion's Chatter

Champaign Residential Services, Inc. - Champaign ...Residential Services, Inc. (CRSI) has been providing services to individuals with disabilities in Champaign and Shelby County for

Champaign v. Madigan

AT URBANA-CHAMPAIGN

CHAMPAIGN POLICE PENSION FUND - Champaign Police Pension … · 2018. 11. 8. · (12,500.00) (1,195.07) (50,775.60) Champaign Police Pension Fund Statement of Changes in Plan Net

Champaign County's 2009 CommunityLink

Extension Educator Family and Consumer Science (SNAP-Ed) · Location: Unit 13 - Champaign, Ford, Iroquois, and Vermilion Counties (Position will be housed in Champaign County - Champaign,