HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

NEXTBIO 2008

Leveraging HBase for the World's Largest Curated Genomic Data Collection

Satnam Alag, Ph.D.VP of [email protected]

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Technology Generating Exponential Data


Genomic Big Data


Use Case 1: HBase to Store Variant Data• Each Genome has ~4 million

variants• Immutable – write once,

never change, read many times

• Bloom Filters are useful• Batch import of Data – HFile• Data to be accessed

collocated in region• Separate Hbase cluster from

Hadoop• All the smarts are in the keysFor the various tables


In Hbase: 1 Genome 10Million rows100 Genomes 1Billion rows100K Genomes 1Trillion rows100M Genomes 1 Quadrillion1,000,000,000,000,000

Fortunately, HBase cluster access can be partitioned by the application when required

Accessing Data with Pagination


Table 1:Key: Bioset Id + Display Order

Columns

Pagination Example:Page 5, Page Size = 100

Retrieve 100 rows from Display Order = 400-500

Number of rows = 1 per SNPOrder of 4 million

Accessing Data with Keys



Keys returned by search index

Filtering Data with Pagination



Example:Gene: ESR1, Class: MisensePage Size = 100

Retrieve rows from Table 2Retrieve rows by keys fromTable 1

Number of rowsOrder of 0.5 million per dataset(# genes x classes)

Table 2:Id+GeneId+MutationClass

Column: Counts, Keys to Table

Powering the Genome Browser



Example:Chr: 6Specified Range

Retrieve all rows

1 Row per SNP ~ 4 million per dataset

Table 2:Id+GeneId+MutationClass

Table 3:Id+ChromosomeId+Range+DisplayOrder

Use Case 2: Correlation Data


Use Case 2• Each Correlation score stored as a row• HFile created for new score• Over 20 billion correlations

B1 B2 … … .. Bn Bn+1

B1

B2

…

…

Bn

Bn+1

T1: scorebioset (base table) key: biosetid_1 [+] biosetid_2


Lessons Learnt• HBase Works Wells For

-- Immutable Data-- Insertions Using HFiles-- Billions of Rows-- Intelligence in Key Definition

• Road to Production-- Redundant Data in Database