HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

Kyungseog OhMay 22, 2012HBaseCon

What is Solbase?

Solbase is an open-source, real-time search platform based on Lucene, Solr and HBase built at Photobucket

• 40% of total page views• 500 million ‘docs’ or images• 30 million search requests per day• 120 Gigabyte size• Previous infrastructure built on

Solr/Lucene

Search at Photobucket

• Memory issues• Indexing time• Speed• Capacity and Scalability

Why Solbase?

• Field Cache– Sortable and filterable fields stored in a

java array the size of the maximum document number

• Example– Every doc is sorted by an integer field,

for 500 million documents the array is 2 GB in size

Lucene Memory Issues

• Solr indexing took 15-16 hours to rebuild the indices

• We wanted to provide near real-time updates

Indexing Time

• Every 100 ms improvement in response time equates to approximately 1 extra page view per visit.

• Can end up being hundreds of millions of extra page views per month

Speed

• Impractical to add significant number of new docs and data (Geo, Exif, etc)

• Difficult to divide data set to create brand new shard

• Fault tolerance is not built in

Capacity & Scalability

Modify Lucene and Solr to use HBase as the source of index and document data

The Concept

Term/Document Tables

create 'TV', 'd', {COMPRESSION=>'SNAPPY',NAME=>'d',VERSION=>1, REPLICATION_SCOPE=>1}

create 'Docs', 'field', 'allTerms', 'timestamp', {COMPRESSION=>'SNAPPY',NAME=>'field',VERSION=>1,REPLICATION_SCOPE=>1},{COMPRESSION=>'SNAPPY',NAME=>'allTerms',VERSION=>1,REPLICATION_SCOPE=>1},{COMPRESSION=>'SNAPPY',NAME=>'timestamp',VERSION=>1, REPLICATION_SCOPE=>1}

Solbase tables

Term Queries are HBase range scans

Start key<field><delimiter><term><delimiter><begin doc id>0x00000000

End key<field><delimiter><term><delimiter><end doc id>0xffffffff

Query Methodology

Solr ShardingMaster

Shard

Index File

Shard

Index File

Shard

Index File

Shard

Index File

Master

Shard

HBase

Solbase – Distributed Processing

Solbase Sharding

Shard Shard Shard

• Extra bits in Encoded Metadata

• Solved Lucene’s sort/filter field

cache issue

Solbase – Sorts & Filters

Solbase – Indexing Process

• Initial Indexing– Leveraging Map/Reduce Framework

• Real-Time Indexing– Using Solr’s update API

• Term ‘me’ takes 13 seconds to load from HBase, 500 ms from cache– ‘me’ has ~14M docs, the largest term in our

indices

• Most terms not in cache take < 200 ms• Most cached terms take < 20 ms• Average query time for native Solr/Lucene:

169 ms• Average query time for Solbase: 109 ms or

35% decrease• ~300 real-time updates per second

Results

• Compatibility issue with latest Solr

• CDH3 latest build

• HBase/Solbase clusters per data center

HBase configuration/Limitation

• https://github.com/Photobucket/Solbase

• https://github.com/Photobucket/Solbase-Lucene

• https://github.com/Photobucket/Solbase-Solr

Repos

Q&A

Technology

HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket