Musings on Secondary Indexing in HBase

Secondary Indexing

the discussion so far….

9/11/12 HBase Pow-wow

Jesse YatesSalesforce.com

What is it?

Problem

• HBase rows are multi-dimensional– Only sorted on the row key

• How do you efficiently lookup deeper into the row key?

ExampleRow Family Qualifier Timestamp value

1 Name First 0 Babe

1 Name Last 0 Ruth

How do we find all people with the last name ‘Ruth’?

Full table scan!

Indexing!Row Family Qualifier Timestamp Value

Ruth Name Last 0 1

Store the property we need to search for as the primary key• pointer back to the primary row • fast lookup - O(lg(n))

Use Cases

• Point lookups– Volume of data influences usefulness of index• Let user decide if they need to use an index

• Scan lookup– WHERE age > 16

Implementations

Full transactional supportCentralized oracle

WAL implementation on top of HBase100-500 writes/sec

Percolator

Full transactionsDistributed, optimistic locking

~10 sec latencies possible

Culvert

AsyncDead project, incomplete

http://jyates.github.com/2012/07/09/consistent-enough-secondary-indexes.html

Client-side coordinated indexUse timestamps to coordinate

Not yet implemented

Trend Micro Implementation

Still just POC???

Solr/Lucene

Standard Lucene library bolted on HBaseNot commonly used

Lots of formats/codecs already written

Considerations for HBase

What do we need to do?

Built-in vs. external library vs.

semi-supported (e.g. security)

Which should I use??

• HBase experts write a single ‘right’ impl• Officially endorse a ‘correct’ version• What changes do we need to make• How close to the core is the project– Written in everywhere– hbase-index module– External library

Async vs. Synchronous vs.

Transactional

Key Observation

“Secondary indexing is inherently an easier problem than full transactions… secondary index updates are idempotent.”

- Lars Hofhansl

Async vs. Synchronous vs.Transactional

• We don’t need full transactions– Transactions are slow – Transactions fail with increasing probability as

number of servers increases• Optionally async or sync– Async• Inherently ‘dirty’ index

• How does index cleanup work?– Inherently different for each type

Locality

Where’s my data?

• Extra columns vs. index table• HBase Region-pinning– Has to be best-effort or will decrease availability – Helps minimize RPC overhead– Cross-table region-pinning– Needs a coprocessor hook to be useful

• HDFS block allocation– Keep index and data blocks on same HDFS node

Index Cardinality

How much data are we talking?

“Seems like there are 3 categories of sparseness:1. sparse indexes (like ipAddress) where a per-table approach is

Musings on Secondary Indexing in HBase

Technology

Merrick Musings

Midnight Musings

Feb musings

Intro to HBase Internals & Schema Design (for HBase users)

HBase Backups

Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

INDEXING* INDEXING*

Building a LINQ Provider for HBase MapReduce · 2019-04-30 · HBase/ Hadoop Building a LINQ Provider for HBase MapReduce Building a LINQ Provider for HBase MapReduce Summary HBase

HBase train Stark - community.qingcloud.com · HBase 介绍及特点 HBase 系统架构 HBase 集群搭建 HBase 存储结构 HBase 关键流程 HBase 使用及开发 HBase 大纲

Philosophical Musings

March musings

SYNCOPATED MUSINGS

Secondary Indexing in Phoenix Jesse Yates HBase Committer Software Engineer SF HBase User Group – September 26, 2013 James Taylor Phoenix Lead Software

MILL MUSINGS

Millennial Musings

Scalable Inverted Indexing on NoSQL Table Storagexqiu/Scalable Inverted... · Google's BigTable, HBase supports reliable storage and efficient access to terabytes or even petabytes

Tech Musings

hadoop developer - SevenMentor · 2021. 2. 17. · D. HBASE: Introduction to HBASE Basic Configurations of HBASE Fundamentals of HBase What is NoSQL? HBase Data Model Table and Row

Bus Musings

Specialized indexing for NoSQL Databases like Accumulo and HBase