Upload
cloudera-inc
View
1.276
Download
0
Embed Size (px)
DESCRIPTION
Solbase is an exciting new open-source, real-time search engine being developed at Photobucket to service the over 30 million daily search requests Photobucket handles. Solbase replaces Lucene’s file system-based index with HBase. This allows the system to update in real-time and linearly scale to serve millions of daily search requests on a large dataset. This session will explore the architecture of Solbase as well as some of Lucene/Solr’s inherent issues we overcame. Finally, we’ll go over performance metrics of Solbase against production traffic.
Citation preview
Kyungseog OhMay 22, 2012HBaseCon
What is Solbase?
Solbase is an open-source, real-time search platform based on Lucene, Solr and HBase built at Photobucket
• 40% of total page views• 500 million ‘docs’ or images• 30 million search requests per day• 120 Gigabyte size• Previous infrastructure built on
Solr/Lucene
Search at Photobucket
• Memory issues• Indexing time• Speed• Capacity and Scalability
Why Solbase?
• Field Cache– Sortable and filterable fields stored in a
java array the size of the maximum document number
• Example– Every doc is sorted by an integer field,
for 500 million documents the array is 2 GB in size
Lucene Memory Issues
• Solr indexing took 15-16 hours to rebuild the indices
• We wanted to provide near real-time updates
Indexing Time
• Every 100 ms improvement in response time equates to approximately 1 extra page view per visit.
• Can end up being hundreds of millions of extra page views per month
Speed
• Impractical to add significant number of new docs and data (Geo, Exif, etc)
• Difficult to divide data set to create brand new shard
• Fault tolerance is not built in
Capacity & Scalability
Modify Lucene and Solr to use HBase as the source of index and document data
The Concept
Term/Document Tables
create 'TV', 'd', {COMPRESSION=>'SNAPPY',NAME=>'d',VERSION=>1, REPLICATION_SCOPE=>1}
create 'Docs', 'field', 'allTerms', 'timestamp', {COMPRESSION=>'SNAPPY',NAME=>'field',VERSION=>1,REPLICATION_SCOPE=>1},{COMPRESSION=>'SNAPPY',NAME=>'allTerms',VERSION=>1,REPLICATION_SCOPE=>1},{COMPRESSION=>'SNAPPY',NAME=>'timestamp',VERSION=>1, REPLICATION_SCOPE=>1}
Solbase tables
Term Queries are HBase range scans
Start key<field><delimiter><term><delimiter><begin doc id>0x00000000
End key<field><delimiter><term><delimiter><end doc id>0xffffffff
Query Methodology
Solr ShardingMaster
Shard
Index File
Shard
Index File
Shard
Index File
Shard
Index File
Master
Shard
HBase
Solbase – Distributed Processing
Solbase Sharding
Shard Shard Shard
• Extra bits in Encoded Metadata
• Solved Lucene’s sort/filter field
cache issue
Solbase – Sorts & Filters
Solbase – Indexing Process
• Initial Indexing– Leveraging Map/Reduce Framework
• Real-Time Indexing– Using Solr’s update API
• Term ‘me’ takes 13 seconds to load from HBase, 500 ms from cache– ‘me’ has ~14M docs, the largest term in our
indices
• Most terms not in cache take < 200 ms• Most cached terms take < 20 ms• Average query time for native Solr/Lucene:
169 ms• Average query time for Solbase: 109 ms or
35% decrease• ~300 real-time updates per second
Results
• Compatibility issue with latest Solr
• CDH3 latest build
• HBase/Solbase clusters per data center
HBase configuration/Limitation
• https://github.com/Photobucket/Solbase
• https://github.com/Photobucket/Solbase-Lucene
• https://github.com/Photobucket/Solbase-Solr
Repos
Q&A