11
NEXTBIO 2008 Leveraging HBase for the World's Largest Curated Genomic Data Collection Satnam Alag, Ph.D. VP of Engineering [email protected] © 2012 NextBio | All rights reserved | This information is proprietary and confidential.

HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Embed Size (px)

DESCRIPTION

NextBio relies on HBase to store the world’s largest collection of continuously curated genomic knowledge. The HBase cluster is leveraged to store billions of correlations as well as processed genomic information. In this talk, we will describe how we use HBase, why we migrated from a large MySQL deployment to HBase, and the challenges along the way.

Citation preview

Page 1: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

NEXTBIO 2008

Leveraging HBase for the World's Largest Curated Genomic Data Collection

Satnam Alag, Ph.D.VP of [email protected]

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Page 2: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Technology Generating Exponential Data

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Page 3: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Genomic Big Data

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Page 4: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Use Case 1: HBase to Store Variant Data• Each Genome has ~4 million

variants• Immutable – write once,

never change, read many times

• Bloom Filters are useful• Batch import of Data – HFile• Data to be accessed

collocated in region• Separate Hbase cluster from

Hadoop• All the smarts are in the keysFor the various tables

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

In Hbase: 1 Genome 10Million rows100 Genomes 1Billion rows100K Genomes 1Trillion rows100M Genomes 1 Quadrillion1,000,000,000,000,000

Fortunately, HBase cluster access can be partitioned by the application when required

Page 5: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Accessing Data with Pagination

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Table 1:Key: Bioset Id + Display Order

Columns

Pagination Example:Page 5, Page Size = 100

Retrieve 100 rows from Display Order = 400-500

Number of rows = 1 per SNPOrder of 4 million

Page 6: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Accessing Data with Keys

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Table 1:Key: Bioset Id + Display Order

Keys returned by search index

Page 7: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Filtering Data with Pagination

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Table 1:Key: Bioset Id + Display Order

Example:Gene: ESR1, Class: MisensePage Size = 100

Retrieve rows from Table 2Retrieve rows by keys fromTable 1

Number of rowsOrder of 0.5 million per dataset(# genes x classes)

Table 2:Id+GeneId+MutationClass

Column: Counts, Keys to Table

Page 8: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Powering the Genome Browser

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Table 1:Key: Bioset Id + Display Order

Example:Chr: 6Specified Range

Retrieve all rows

1 Row per SNP ~ 4 million per dataset

Table 2:Id+GeneId+MutationClass

Table 3:Id+ChromosomeId+Range+DisplayOrder

Page 9: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Use Case 2: Correlation Data

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Page 10: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Use Case 2• Each Correlation score stored as a row• HFile created for new score• Over 20 billion correlations

B1 B2 … … .. Bn Bn+1

B1

B2

Bn

Bn+1

T1: scorebioset (base table) key: biosetid_1 [+] biosetid_2

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Page 11: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Lessons Learnt• HBase Works Wells For

-- Immutable Data-- Insertions Using HFiles-- Billions of Rows-- Intelligence in Key Definition

• Road to Production-- Redundant Data in Database

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.