O connor bosc2010

SeqWare Query EngineStoring & Searching Sequence Data in the Cloud

Brian O'ConnorUNC Lineberger Comprehensive

Cancer Center

BOSCJuly 9th, 2010

SeqWare Query Engine

● Want to ask simple questions: ● “What SNVs are in 5'UTR of phosphatases?”● “What frameshift indels affect PTEN?”● “What genes include homozygous, non-

synonymous SNVs?”● SeqWare Query Engine created to query data

● RESTful Webservice● Scalable/Queryable Backend

Variant Annotation with SeqWareWhole Genome/Exome

pileup

Alignment

Variant,Coverage, &

Consequence

dbSNP

Consequence

SeqWare Pipeline

SeqWare QueryEngine WebserviceBAM

VariantCalling

SeqWare QueryEngine

SeqWare QueryEngine Backend

HBase orBerkeley DBStores

RESTlet

BackendInterface

WIG BED

Webservice Interfaces

track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009" chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])

SeqWare Query Engine includes a RESTful web service that returns XML describing variant DBs, a web form for querying, and returns BED/WIG data available via a well-defined URL

http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN

orRESTful XML Client API HTML Forms

Loading in Genome BrowsersSeqWare Query Engine URLs can be directly loaded into IGV & UCSC genome browsers

Requirements for QueryEngine Backend

The backend must:– Represent many types of data– Support a rich level of annotation – Support very large variant databases

(~3 billion rows x thousands of columns)– Be distributed across a cluster– Support processing, annotating, querying &

comparing samples (variants, coverage, annotations)

– Support a crazy growth of data

Increase in Sequencer OutputNelson Lab - UCLA

08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10

10000000

100000000

1000000000

10000000000

100000000000

Illumina Sequencer Ouput

Sequence File Sizes Per Lane

Date

File

Siz

e (

Byt

es)

Log scale

Suggests Sequencer OutputIncreases by 5-10x Every 2 Years!

Far outpacing hard drive,CPU, and bandwidth growth

HBase to the Rescue?● Billions of rows x millions of columns!● Focus on random access (vs. HDFS)● Table is column oriented, sparse matrix● Versioning (timestamps) built in● Flexible storage of different data types● Splits DB across many nodes transparently● Locality of data, I can run map/reduce jobs that

process the table rows present on a given node● 22M variants processed <1 minute on 5 node cluster

Underlying HBase Tables

chr15:00000123454

key

byte[]

variant:genome4

byte[]

variant:genome7

byte[]

coverage:genome7hg18Table

Variant object byte array

Database on filesystem (HDFS)

family label

chr15:00000123454

key

t1

timestamp

genome7

column:variant

byte[]

is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123

key

byte[]

rowId:Genome1102NTagIndexTable

queries look up by tag thenfilter the variant results

HBase API & Map/Reduce Querying● HBase API

● Powers the current Backend & Webservice● Provides a familiar API, scanners, iterators, etc● Backend written using this, retrieve variants by tags● Distributed database but single thread using API

● Prototype somatic mutations by Map/Reduce● Every row is examined, variants in tumor not in

normal are retrieved● Map/Reduce jobs run on node with local data● Highly parallel & faster than API with single thread

SeqWare Query Engine on HBase

DataNode

DataNode

HBase on HDFSVariant & CoverageDatabase System

Analysis/Web Nodes

Analysis/Web Nodes

Analysis/Web Nodes

Querying &LoadingNodes processqueries via API

RESTfulWeb Service

Backend

MetaDB

Webservice combinesVariant/Coverage datawith metadata

BED/WIGFiles

XMLMetadata

clients

NameNode

ETL Map Job

ETL Map Job

ETLReduce Job

ETLjobs extract,transform, &/orload in parallel

Webservice

MapReduce HBase API

Status of HBase Backend

● Both BerkeleyDB & HBase, Relational soon● Multiple genomes stored in the same table,

very Map/Reduce compatible● Basic secondary indexing for “tags”● API used for queries via Webservice● Prototype Map/Reduce example for “somatic”

mutation detection in paired normal/cancer samples

● Currently loading 1102 normal/tumor (GBM)

Backend Performance Comparison

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

Pileup Load Time 1102N

HBase vs. Berkeley DB

load time bdbload time hbase

time (s)

vari

an

ts

Backend Performance Comparison

0 1000 2000 3000 4000 5000 6000 7000

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

BED Export Time 1102NHBase API vs. M/R vs. BerkeleyDB

dump time bdbdump time hbasedump time m/r

time

vari

an

ts

HBase/Hadoop Have Potential!

● Era of Big Data for Biology is here!● CPU bound problems no doubt but as short

reads become long reads and prices per GBase drop the problems seem to shift to handling/mining data

● Tools designed for Peta-scale datasets are key

Next Steps● Model other datatypes: copy number, RNAseq

gene/exon/splice junction counts, isoforms etc

● Focus on porting analysis/querying to Map/Reduce

● Indexing beyond “tags” with Katta (distributed Lucene)

● Push scalability, what are the limits of an 8 node HBase/Hadoop cluster?

● Look at Cascading, Pig, Hive, etc as advanced workflow and data mining tools

● Standards for Webservice dialect (DAS?)

● Exposing Query Engine through GALAXY

Acknowledgments

Jordan Mendler Michael Clark Hane Lee Bret Harry Stanley Nelson

Sara Grimm Matt Soloway Jianying Li Feri Zsuppan Neil Hayes Chuck Perou Derek Chiang

UCLA UNC

Resources● Hbase & Hadoop: http://hadoop.apache.org ● When to use HBase:

http://blog.rapleaf.com/dev/?p=26● NOSQL presentations:

http://blog.oskarsson.nu/2009/06/nosql-debrief.html

● Other DBs: CouchDB, Hypertable, Cassandra, Project Voldemort, and more...

● Data mining tools: Pig and Hive● SeqWare: http://seqware.sourceforge.net● [email protected]

Extra Slides

Overview

● SeqWare Query Engine background● New tools for combating the data deluge● HBase/Hadoop in SeqWare Query Engine

● HBase for backend● Map/Reduce & HBase API for webservice

● Better performance and scalability?● Next steps

SeqWare Query Engine:BerkeleyDB

GenomeDatabase

GenomeDatabase

GenomeDatabase

BerkeleyDBVariant & CoverageDatabases

Lustre Filesystem

Analysis/Web Nodes

Analysis/Web Nodes

Analysis/Web Nodes

Web/AnalysisNodes processqueries

RESTfulWeb Service

backend webservice

MetaDB

Webservice combinesVariant/Coverage datawith metadata

BED/WIGFiles

XMLMetadata

clients

More

● Details on API vs. M/R● Details on XML Restful API & web app including

loading in UCSC browser● Details on generic store object (BerkeleyDB,

HBase, and Relational at Renci)● Byte serialization from BerkeleyDB, custom

secondary key creation

Pressures of Sequencing

● A lot of data (50GB SRF file, 150GB alignment files, 60GB variants for a 20x human genome)

● PostgreSQL (2xquad core, 64GB RAM) died with the Celsius schema (microarray database) after loading ~1 billion rows

● Needs to be processed, annotated, and queryable & comparable (variants, coverage, annotations)

● ~3 billion rows x thousands of columns● COMBINE WITH PREVIOUS SLIDE

Thoughts on BerkeleyDB

● BerkeleyDB let me:● Create a database per genome, independent from a

single database daemon● Provision database to cluster● Adapt to key-value database semantics

● Limitations:● Creation on single node only● Not inherently distributed● Performance issues with big DBs, high I/O wait

● Google to the rescue?

HBase Backend

● How the table(s) are actually structured● Variants● Coverage● Etc

● How I do indexing currently (similar to indexing feature extension)● Multiple secondary indexes

Frontend

● RESTlet API● What queries can you do?

● Examples● URLs

● Potential for swapping out generic M/R for many of these queries (less reliance on indexes which will speed things up as DB grows)

Ideas for a distributed future

● Federated Dbs/datastores/clusters for computation rather than one giant datacenter

● Distribute software not data

Potential Questions

● How big is the DB to store whole human genome?

● How long does it take to M/R 3 billion positions on 5 node cluster?

● How does my stuff compare to other bioinf software? GATK, Crossbow, etc

● How did I choose HBase instead of Pig, Hive, etc?

Current Prototyping Work

● Validate creation of U87 (genome resequencing at 20x) genome database● SNVs● Coverage● Annotations

● Test fast querying of record subsets● Test fast processing of whole DB using

MapReduce● Test stability, fault-tolerance, auto-balancing,

and deployment issues along the way

What About Fast Queries?

● I'm fairly convinced I can create a distributed HBase database on a Hadoop cluster

● I have a prototype HBase database running on two nodes

● But HBase shines when bulk processing DB● Big question is how to make individual lookups

fast● Possible solution is Hbase+Katta for indexes

(distributed Lucene)


GenomeDatabase

GenomeDatabase

GenomeDatabase

BerkeleyDBVariant & CoverageDatabases

Lustre Filesystem

Analysis/Web Nodes(8 CPU, 32GB RAM)




RESTfulWeb Service

backend webservice

AnnotationDatabase

Webservice combinesVariant/Coverage datawith annotations (hg18)

BED/WIGFiles

DASXML

clients

How Do We Scale Up the QE?

● Sequencers are increasing output by a factor of 10 every two years!

● Hard drives: 4x every 2 years● CPUs: 2x every 2 years● Bandwidth: 2x every 2 years (really?!)● So there's a huge disconnect, can't just throw

more hardware at a single database server!● Must look for better ways to scale for

Google to the Rescue?

● Companies like Google, Amazon, Facebook, etc have had to deal with massive scalability issues over the last 10+ years

● Solutions include:● Frameworks like MapReduce● Distributed file systems like HDFS● Distributed databases like HBase

● Focus here on HBase

What Do You Give Up?

● SQL queries● Well defined schema, normalized data structure● Relationships manged by DB● Flexible and easy indexing of table columns● Existing tools that query a SQL database must

be re-written● Certain ACID aspects● Software maturity, most distributed NOSQL

projects are very new

What Do You Gain?

● Scalability is the clear win, you can have many processes on many nodes hit the collection of database servers

● Ability to look at very large datasets and do complex computations across a cluster

● More flexibility in representing information now and in the future

● HBase includes data timestamps/versions● Integration with Hadoop


DataNode

DataNode

HBaseVariant & CoverageDatabase System

Hadoop HDFS

Analysis/Web Nodes

Analysis/Web Nodes

Analysis/Web Nodes


RESTfulWeb Service

backend webservice

AnnotationDatabase

Webservice combinesVariant/Coverage datawith annotations (hg18)

BED/WIGFiles

XMLMetadata

clients

NameNode

ETL Map Job

ETL Map Job

HadoopMap Reduce

ETLReduce Job

MapReducejobs extract,transform, &load in parallel

What an HBase DB Looks Like

A Record in my HBase

chr15:00000123454

key

byte[]

variant:genome4

byte[]

variant:genome7

byte[]

coverage:genome7

A Record in my HBase

Variant object to byte array

Database on filesystem (HDFS)

family label

chr15:00000123454

key

t1

timestamp

genome7

column:variant

byte[]

Scalability and BerkeleyDB

● BerkeleyDB let me:

● Create a database per genome, independent from a single database daemon

● Provision database to cluster for distributed analysis● Adapt to key-value database semantics with nice API

● Limitations:

● Creation on single node only● Want to query easily across genomes● Database are not distributed● I saw performance issues, high I/O wait

Would 2,000 Genomes Kill SQL?● Say each genome has 5M variants (not counting

coverage!)● 5M variant rows x 2,000 genomes = 10 billion rows● Our DB server running PostgreSQL (2xquad core,

64GB RAM) died with the Celsius (Chado) schema after loading ~1 billion rows

● So maybe conservatively we would have issues with 150+ genomes

● That threshold is probably 1 year away with public datasets available via SRA, 1000 genomes, TCGA

Related Projects

My Abstract

● backend/frontend● Traverse and query with Map/Reduce● Java web service with RESTlet● Deployment on 8 node cluster

Background on Problem

● Why abandon PostgreSQL/MySQL/SQL?● Experience with Celsius...

● What you give up● What you gain

First Solution: BerkeleyDB

● Good:● key/value data store● Easy to use● Great for testing

● Bad:● Not performant for multiple genomes● Manual distribution across cluster● Annoying phobia of shared filesystems

Sequencers vs. Information Technology

● Sequencers are increasing output by a factor of 10 every two years!

● Hard drives: 4x every 2 years● CPUs: 2x every 2 years● Bandwidth: 2x every 2 years (really?!)● So there's a huge disconnect, can't just throw

more hardware at a single database server!● Must look for better ways to scale

● What are we doing, what are the challenges. Big picture of the project (webservice, backend etc)

● How did people solve this problem before? How did I attempt to solve this problem? Where did it break down?

● “New” approach, looking to Google et al for scalability for big data problems

● What is Hbase/Hadoop & what do they provide?

● How did I adapt Hbase/hadoop to my problem?

● Specifics of implementation: overall flow, tables, query engine search (API), example M/R task

● Is this performant, does this scale? Can I get billionxmillion? Fast arbitrary retrieval?

● Next steps: more data types, focus on M/R for analytical tasks, focus on Katta for rich querying, push scalability w/ 8 nodes (test with genomes), look at Cascading & other tools for datamining