Upload
cloudera-inc
View
1.389
Download
0
Tags:
Embed Size (px)
DESCRIPTION
NextBio relies on HBase to store the world’s largest collection of continuously curated genomic knowledge. The HBase cluster is leveraged to store billions of correlations as well as processed genomic information. In this talk, we will describe how we use HBase, why we migrated from a large MySQL deployment to HBase, and the challenges along the way.
Citation preview
NEXTBIO 2008
Leveraging HBase for the World's Largest Curated Genomic Data Collection
Satnam Alag, Ph.D.VP of [email protected]
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Technology Generating Exponential Data
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Genomic Big Data
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Use Case 1: HBase to Store Variant Data• Each Genome has ~4 million
variants• Immutable – write once,
never change, read many times
• Bloom Filters are useful• Batch import of Data – HFile• Data to be accessed
collocated in region• Separate Hbase cluster from
Hadoop• All the smarts are in the keysFor the various tables
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
In Hbase: 1 Genome 10Million rows100 Genomes 1Billion rows100K Genomes 1Trillion rows100M Genomes 1 Quadrillion1,000,000,000,000,000
Fortunately, HBase cluster access can be partitioned by the application when required
Accessing Data with Pagination
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Table 1:Key: Bioset Id + Display Order
Columns
Pagination Example:Page 5, Page Size = 100
Retrieve 100 rows from Display Order = 400-500
Number of rows = 1 per SNPOrder of 4 million
Accessing Data with Keys
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Table 1:Key: Bioset Id + Display Order
Keys returned by search index
Filtering Data with Pagination
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Table 1:Key: Bioset Id + Display Order
Example:Gene: ESR1, Class: MisensePage Size = 100
Retrieve rows from Table 2Retrieve rows by keys fromTable 1
Number of rowsOrder of 0.5 million per dataset(# genes x classes)
Table 2:Id+GeneId+MutationClass
Column: Counts, Keys to Table
Powering the Genome Browser
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Table 1:Key: Bioset Id + Display Order
Example:Chr: 6Specified Range
Retrieve all rows
1 Row per SNP ~ 4 million per dataset
Table 2:Id+GeneId+MutationClass
Table 3:Id+ChromosomeId+Range+DisplayOrder
Use Case 2: Correlation Data
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Use Case 2• Each Correlation score stored as a row• HFile created for new score• Over 20 billion correlations
B1 B2 … … .. Bn Bn+1
B1
B2
…
…
Bn
Bn+1
T1: scorebioset (base table) key: biosetid_1 [+] biosetid_2
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Lessons Learnt• HBase Works Wells For
-- Immutable Data-- Insertions Using HFiles-- Billions of Rows-- Intelligence in Key Definition
• Road to Production-- Redundant Data in Database
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.