Upload
bosc-2010
View
1.271
Download
0
Tags:
Embed Size (px)
Citation preview
SeqWare Query EngineStoring & Searching Sequence Data in the Cloud
Brian O'ConnorUNC Lineberger Comprehensive
Cancer Center
BOSCJuly 9th, 2010
SeqWare Query Engine
● Want to ask simple questions: ● “What SNVs are in 5'UTR of phosphatases?”● “What frameshift indels affect PTEN?”● “What genes include homozygous, non-
synonymous SNVs?”● SeqWare Query Engine created to query data
● RESTful Webservice● Scalable/Queryable Backend
Variant Annotation with SeqWareWhole Genome/Exome
pileup
Alignment
Variant,Coverage, &
Consequence
dbSNP
Consequence
SeqWare Pipeline
SeqWare QueryEngine WebserviceBAM
VariantCalling
SeqWare QueryEngine
SeqWare QueryEngine Backend
HBase orBerkeley DBStores
RESTlet
BackendInterface
WIG BED
Webservice Interfaces
track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009" chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
SeqWare Query Engine includes a RESTful web service that returns XML describing variant DBs, a web form for querying, and returns BED/WIG data available via a well-defined URL
http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN
orRESTful XML Client API HTML Forms
Loading in Genome BrowsersSeqWare Query Engine URLs can be directly loaded into IGV & UCSC genome browsers
Requirements for QueryEngine Backend
The backend must:– Represent many types of data– Support a rich level of annotation – Support very large variant databases
(~3 billion rows x thousands of columns)– Be distributed across a cluster– Support processing, annotating, querying &
comparing samples (variants, coverage, annotations)
– Support a crazy growth of data
Increase in Sequencer OutputNelson Lab - UCLA
08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10
10000000
100000000
1000000000
10000000000
100000000000
Illumina Sequencer Ouput
Sequence File Sizes Per Lane
Date
File
Siz
e (
Byt
es)
Log scale
Suggests Sequencer OutputIncreases by 5-10x Every 2 Years!
Far outpacing hard drive,CPU, and bandwidth growth
HBase to the Rescue?● Billions of rows x millions of columns!● Focus on random access (vs. HDFS)● Table is column oriented, sparse matrix● Versioning (timestamps) built in● Flexible storage of different data types● Splits DB across many nodes transparently● Locality of data, I can run map/reduce jobs that
process the table rows present on a given node● 22M variants processed <1 minute on 5 node cluster
Underlying HBase Tables
chr15:00000123454
key
byte[]
variant:genome4
byte[]
variant:genome7
byte[]
coverage:genome7hg18Table
Variant object byte array
Database on filesystem (HDFS)
family label
chr15:00000123454
key
t1
timestamp
genome7
column:variant
byte[]
is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123
key
byte[]
rowId:Genome1102NTagIndexTable
queries look up by tag thenfilter the variant results
HBase API & Map/Reduce Querying● HBase API
● Powers the current Backend & Webservice● Provides a familiar API, scanners, iterators, etc● Backend written using this, retrieve variants by tags● Distributed database but single thread using API
● Prototype somatic mutations by Map/Reduce● Every row is examined, variants in tumor not in
normal are retrieved● Map/Reduce jobs run on node with local data● Highly parallel & faster than API with single thread
SeqWare Query Engine on HBase
DataNode
DataNode
HBase on HDFSVariant & CoverageDatabase System
Analysis/Web Nodes
Analysis/Web Nodes
Analysis/Web Nodes
Querying &LoadingNodes processqueries via API
RESTfulWeb Service
Backend
MetaDB
Webservice combinesVariant/Coverage datawith metadata
BED/WIGFiles
XMLMetadata
clients
NameNode
ETL Map Job
ETL Map Job
ETLReduce Job
ETLjobs extract,transform, &/orload in parallel
Webservice
MapReduce HBase API
Status of HBase Backend
● Both BerkeleyDB & HBase, Relational soon● Multiple genomes stored in the same table,
very Map/Reduce compatible● Basic secondary indexing for “tags”● API used for queries via Webservice● Prototype Map/Reduce example for “somatic”
mutation detection in paired normal/cancer samples
● Currently loading 1102 normal/tumor (GBM)
Backend Performance Comparison
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
Pileup Load Time 1102N
HBase vs. Berkeley DB
load time bdbload time hbase
time (s)
vari
an
ts
Backend Performance Comparison
0 1000 2000 3000 4000 5000 6000 7000
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
BED Export Time 1102NHBase API vs. M/R vs. BerkeleyDB
dump time bdbdump time hbasedump time m/r
time
vari
an
ts
HBase/Hadoop Have Potential!
● Era of Big Data for Biology is here!● CPU bound problems no doubt but as short
reads become long reads and prices per GBase drop the problems seem to shift to handling/mining data
● Tools designed for Peta-scale datasets are key
Next Steps● Model other datatypes: copy number, RNAseq
gene/exon/splice junction counts, isoforms etc
● Focus on porting analysis/querying to Map/Reduce
● Indexing beyond “tags” with Katta (distributed Lucene)
● Push scalability, what are the limits of an 8 node HBase/Hadoop cluster?
● Look at Cascading, Pig, Hive, etc as advanced workflow and data mining tools
● Standards for Webservice dialect (DAS?)
● Exposing Query Engine through GALAXY
Acknowledgments
Jordan Mendler Michael Clark Hane Lee Bret Harry Stanley Nelson
Sara Grimm Matt Soloway Jianying Li Feri Zsuppan Neil Hayes Chuck Perou Derek Chiang
UCLA UNC
Resources● Hbase & Hadoop: http://hadoop.apache.org ● When to use HBase:
http://blog.rapleaf.com/dev/?p=26● NOSQL presentations:
http://blog.oskarsson.nu/2009/06/nosql-debrief.html
● Other DBs: CouchDB, Hypertable, Cassandra, Project Voldemort, and more...
● Data mining tools: Pig and Hive● SeqWare: http://seqware.sourceforge.net● [email protected]
Extra Slides
Overview
● SeqWare Query Engine background● New tools for combating the data deluge● HBase/Hadoop in SeqWare Query Engine
● HBase for backend● Map/Reduce & HBase API for webservice
● Better performance and scalability?● Next steps
SeqWare Query Engine:BerkeleyDB
GenomeDatabase
GenomeDatabase
GenomeDatabase
BerkeleyDBVariant & CoverageDatabases
Lustre Filesystem
Analysis/Web Nodes
Analysis/Web Nodes
Analysis/Web Nodes
Web/AnalysisNodes processqueries
RESTfulWeb Service
backend webservice
MetaDB
Webservice combinesVariant/Coverage datawith metadata
BED/WIGFiles
XMLMetadata
clients
More
● Details on API vs. M/R● Details on XML Restful API & web app including
loading in UCSC browser● Details on generic store object (BerkeleyDB,
HBase, and Relational at Renci)● Byte serialization from BerkeleyDB, custom
secondary key creation
Pressures of Sequencing
● A lot of data (50GB SRF file, 150GB alignment files, 60GB variants for a 20x human genome)
● PostgreSQL (2xquad core, 64GB RAM) died with the Celsius schema (microarray database) after loading ~1 billion rows
● Needs to be processed, annotated, and queryable & comparable (variants, coverage, annotations)
● ~3 billion rows x thousands of columns● COMBINE WITH PREVIOUS SLIDE
Thoughts on BerkeleyDB
● BerkeleyDB let me:● Create a database per genome, independent from a
single database daemon● Provision database to cluster● Adapt to key-value database semantics
● Limitations:● Creation on single node only● Not inherently distributed● Performance issues with big DBs, high I/O wait
● Google to the rescue?
HBase Backend
● How the table(s) are actually structured● Variants● Coverage● Etc
● How I do indexing currently (similar to indexing feature extension)● Multiple secondary indexes
Frontend
● RESTlet API● What queries can you do?
● Examples● URLs
● Potential for swapping out generic M/R for many of these queries (less reliance on indexes which will speed things up as DB grows)
Ideas for a distributed future
● Federated Dbs/datastores/clusters for computation rather than one giant datacenter
● Distribute software not data
Potential Questions
● How big is the DB to store whole human genome?
● How long does it take to M/R 3 billion positions on 5 node cluster?
● How does my stuff compare to other bioinf software? GATK, Crossbow, etc
● How did I choose HBase instead of Pig, Hive, etc?
Current Prototyping Work
● Validate creation of U87 (genome resequencing at 20x) genome database● SNVs● Coverage● Annotations
● Test fast querying of record subsets● Test fast processing of whole DB using
MapReduce● Test stability, fault-tolerance, auto-balancing,
and deployment issues along the way
What About Fast Queries?
● I'm fairly convinced I can create a distributed HBase database on a Hadoop cluster
● I have a prototype HBase database running on two nodes
● But HBase shines when bulk processing DB● Big question is how to make individual lookups
fast● Possible solution is Hbase+Katta for indexes
(distributed Lucene)
SeqWare Query Engine
GenomeDatabase
GenomeDatabase
GenomeDatabase
BerkeleyDBVariant & CoverageDatabases
Lustre Filesystem
Analysis/Web Nodes(8 CPU, 32GB RAM)
Analysis/Web Nodes(8 CPU, 32GB RAM)
Analysis/Web Nodes(8 CPU, 32GB RAM)
Web/AnalysisNodes processqueries
RESTfulWeb Service
backend webservice
AnnotationDatabase
Webservice combinesVariant/Coverage datawith annotations (hg18)
BED/WIGFiles
DASXML
clients
How Do We Scale Up the QE?
● Sequencers are increasing output by a factor of 10 every two years!
● Hard drives: 4x every 2 years● CPUs: 2x every 2 years● Bandwidth: 2x every 2 years (really?!)● So there's a huge disconnect, can't just throw
more hardware at a single database server!● Must look for better ways to scale for
Google to the Rescue?
● Companies like Google, Amazon, Facebook, etc have had to deal with massive scalability issues over the last 10+ years
● Solutions include:● Frameworks like MapReduce● Distributed file systems like HDFS● Distributed databases like HBase
● Focus here on HBase
What Do You Give Up?
● SQL queries● Well defined schema, normalized data structure● Relationships manged by DB● Flexible and easy indexing of table columns● Existing tools that query a SQL database must
be re-written● Certain ACID aspects● Software maturity, most distributed NOSQL
projects are very new
What Do You Gain?
● Scalability is the clear win, you can have many processes on many nodes hit the collection of database servers
● Ability to look at very large datasets and do complex computations across a cluster
● More flexibility in representing information now and in the future
● HBase includes data timestamps/versions● Integration with Hadoop
SeqWare Query Engine
DataNode
DataNode
HBaseVariant & CoverageDatabase System
Hadoop HDFS
Analysis/Web Nodes
Analysis/Web Nodes
Analysis/Web Nodes
Web/AnalysisNodes processqueries
RESTfulWeb Service
backend webservice
AnnotationDatabase
Webservice combinesVariant/Coverage datawith annotations (hg18)
BED/WIGFiles
XMLMetadata
clients
NameNode
ETL Map Job
ETL Map Job
HadoopMap Reduce
ETLReduce Job
MapReducejobs extract,transform, &load in parallel
What an HBase DB Looks Like
A Record in my HBase
chr15:00000123454
key
byte[]
variant:genome4
byte[]
variant:genome7
byte[]
coverage:genome7
A Record in my HBase
Variant object to byte array
Database on filesystem (HDFS)
family label
chr15:00000123454
key
t1
timestamp
genome7
column:variant
byte[]
Scalability and BerkeleyDB
● BerkeleyDB let me:
● Create a database per genome, independent from a single database daemon
● Provision database to cluster for distributed analysis● Adapt to key-value database semantics with nice API
● Limitations:
● Creation on single node only● Want to query easily across genomes● Database are not distributed● I saw performance issues, high I/O wait
Would 2,000 Genomes Kill SQL?● Say each genome has 5M variants (not counting
coverage!)● 5M variant rows x 2,000 genomes = 10 billion rows● Our DB server running PostgreSQL (2xquad core,
64GB RAM) died with the Celsius (Chado) schema after loading ~1 billion rows
● So maybe conservatively we would have issues with 150+ genomes
● That threshold is probably 1 year away with public datasets available via SRA, 1000 genomes, TCGA
Related Projects
My Abstract
● backend/frontend● Traverse and query with Map/Reduce● Java web service with RESTlet● Deployment on 8 node cluster
Background on Problem
● Why abandon PostgreSQL/MySQL/SQL?● Experience with Celsius...
● What you give up● What you gain
First Solution: BerkeleyDB
● Good:● key/value data store● Easy to use● Great for testing
● Bad:● Not performant for multiple genomes● Manual distribution across cluster● Annoying phobia of shared filesystems
Sequencers vs. Information Technology
● Sequencers are increasing output by a factor of 10 every two years!
● Hard drives: 4x every 2 years● CPUs: 2x every 2 years● Bandwidth: 2x every 2 years (really?!)● So there's a huge disconnect, can't just throw
more hardware at a single database server!● Must look for better ways to scale
● What are we doing, what are the challenges. Big picture of the project (webservice, backend etc)
● How did people solve this problem before? How did I attempt to solve this problem? Where did it break down?
● “New” approach, looking to Google et al for scalability for big data problems
● What is Hbase/Hadoop & what do they provide?
● How did I adapt Hbase/hadoop to my problem?
● Specifics of implementation: overall flow, tables, query engine search (API), example M/R task
● Is this performant, does this scale? Can I get billionxmillion? Fast arbitrary retrieval?
● Next steps: more data types, focus on M/R for analytical tasks, focus on Katta for rich querying, push scalability w/ 8 nodes (test with genomes), look at Cascading & other tools for datamining