HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce

HBase InternalsLars Hofhansl

Architect @ Salesforce.comHBase Committer

HBase InternalsA.K.A. Get ready to:

Agenda

• Overview

• Scanning

• Atomicity/MVCC

• Updates– Put/Delete

• Code!

A sparse, consistent, distributed, multi-dimensional,

persistent, sorted map.

In the end it comes down to a sorting problem.How do you sort 1PB with 16GB of memory?

7 5 6 3 2 9 1 8 4

7 5 6 3 2 9 1 8 4

5 6 7 2 3 9 1 4 8

1 2 3 4 5 6 7 8 9

Overview

• All changes collected/sorted in memory

• Periodically flushed to HDFS– Writes to disk are batched

• HFiles periodically compacted into larger files

• Scanning/Compacting: Merge Sort

HDFS Directory Hierarchy

/hbase /-ROOT- /.META. /.logs /<table1> /<region> [/.recovered_logs] /<column family> /<HFile> /... /... /<column family> /... /<table2> ...

Storage is per CF

Scanning

• "Log Structured Mergetrees"• Multiway mergesort• Efficient scanning/seeking

/Region | | +--/CF | | +--/HFile

KeyValueHeap

• Maintains PriorityQueue of “lower” Scanners

• TopScanner in the Queue has the next KV

Scanning

• "Log Structured Mergetrees"• Multiway mergesort• Efficient scanning/seeking

/Region | | +--/CF | | +--/HFile

Updates

• All changes are written to the WAL first• All changes are written to memory (the "MemStore")• MemStores are flushed to disk (creating a new HFile)• HFiles are periodically and asynchronously compacted into

fewer files.• HFiles are immutable

ROW Atomicity

• Snapshot isolation and locking (per row)o Row is locked while memory is changedo Each row-update "creates" a new snapshoto Each row-read sees exactly one snapshot

• Done with MultiVersionConcurrencyControl (MVCC)

Locking

• All KVs for a “row” are co-located in a region

• Locks are per row

• Stored in-memory at region server

MultiVersionConcurrencyControl

Wikipedia: "implement updates not by deleting an old piece of data and overwriting it with a new one, but instead by marking the old data as obsolete and adding the newer version"

Note that HBase calls it MultiVersionConsistencyControl

MVCC writing

• Each write gets a monotonically increasing "writenumber“

• Each KV written is tagged with this writenumber(called memstoreTS in HBase)

• "Committing" a writenumber:– wait for all prior writes to finish– set the current readpoint to the writenumber

MVCC reading

• Each read reads as of a "readpoint“– Filters KVs with a newer memstoreTS

• This is per regionserver (cheap in memory data structures)– if regionserver dies, current read is lost anyway

• but... "writenumbers" are persisted to disk – for scanners that outlive a Memstore flush

MVCC

• Reader do not lock

• Transaction are committed strictly serially

• HBase has no client demarcated transactions– A transaction does not outlive an RPC

Anatomy of an "update"

1. Acquire a MVCC "writenumber"2. Lock the "row" (the row-key) 3. Tag all KVs with the writenumber4. Write change to the WAL and sync to file system5. Apply update in memory ("Memstore")6. Unlock "row" 7. Roll MVCC forward -> now change is visible8. If Memstore reaches configurable size, take a snapshot, flush it

to disk (asynchronously)


1. Acquire a MVCC "writenumber"2. Lock the "row" (the row-key) 3. Tag all KVs with the writenumber4. Write change to the WAL and sync to file system5. Apply update in memory ("Memstore")6. Unlock "row"

o 5.5 sync WAL to HDFS without the row lock held If that fails undo changes in Memstore that works because changes are not visible, yet

7. Roll MVCC forward -> now change is visible8. If Memstore reaches configurable size, take a snapshot, flush it


Puts are optimized


1. Acquire a MVCC "writenumber"2. lock as many "rows" as possible3. Tag all KVs with the writenumber4. write all changes to the WAL and sync to file system5. apply all updates in memory ("Memstore")6. unlock all "rows"

o 5.5 sync WAL to HDFS without the row locks held If that fails undo changes in Memstore that works because changes are not visible, yet

7. roll MVCC forward -> now changes are visible8. if Memstore reaches configurable size, take a snapshot, flush it


Puts are optimized, and batched

Undo

• In-memory only

• Changes are not visible until MVCC is rolled

• Changes are tagged with the writenumber

• Undo removes KVs tagged with the writenumber

Deletes

• Nothing is deleted in place.• a Delete sets a "tombstone marker" with a timestamp• upon compaction deleted KeyValues are removed• upon major compactions the tombstones are removed• Three different tombstone marker scopes (all within a row)

o version - mark a specific version of a column as deletedo column - mark all versions of a column as deletedo column family - mark all versions of all columns of a column

family as deleted

Deletes, cont

• Delete markers always sort before KVs

• A scanner remembers markers it encounters

• Each KV is checked against remembered markers

• Only one-pass is required for scanning

• Deletes are just KVs stored in HFiles

Let’s look at some code!

MVCC

MultiVersionConsistencyControl.java

public WriteEntry beginMemstoreInsert() { synchronized (writeQueue) { long nextWriteNumber = ++memstoreWrite; WriteEntry e = new WriteEntry(nextWriteNumber); writeQueue.add(e); return e; } } public void completeMemstoreInsert(WriteEntry e) { advanceMemstore(e); waitForRead(e); }

Acquire a new Writenumber

Roll forward the readpoint

Wait for prior transactions to finish


boolean advanceMemstore(WriteEntry e) { synchronized (writeQueue) { e.markCompleted();

long nextReadValue = -1; while (!writeQueue.isEmpty()) { WriteEntry queueFirst = writeQueue.getFirst(); ... if (queueFirst.isCompleted()) { nextReadValue = queueFirst.getWriteNumber(); writeQueue.removeFirst(); } else { break; } } ...

Roll forward completedtransactions. Ordering is preserved.


... if (nextReadValue > 0) { synchronized (readWaiters) { memstoreRead = nextReadValue; readWaiters.notifyAll(); } } if (memstoreRead >= e.getWriteNumber()) { return true; } return false; } }

Notify later transactions


public void waitForRead(WriteEntry e) { boolean interrupted = false; synchronized (readWaiters) { while (memstoreRead < e.getWriteNumber()) { try { readWaiters.wait(0); } catch (InterruptedException ie) { interrupted = true; } } } if (interrupted) Thread.currentThread().interrupt(); }

Wait until write entry was applied.

Scanning

RegionScanner, creation

RegionScannerImpl(Scan scan, List<KeyValueScanner> additionalScanners) { ... IsolationLevel isolationLevel = scan.getIsolationLevel(); synchronized(scannerReadPoints) { if (isolationLevel == IsolationLevel.READ_UNCOMMITTED) { // This scan can read even uncommitted transactions this.readPt = Long.MAX_VALUE; MVCC.setThreadReadPoint(this.readPt); } else { this.readPt = MVCC.resetThreadReadPoint(mvcc); } scannerReadPoints.put(this, this.readPt); } ...

MVCC protocol

RegionScanner

... for (Map.Entry<byte[], NavigableSet<byte[]>> entry : scan.getFamilyMap().entrySet()) { Store store = stores.get(entry.getKey()); StoreScanner scanner = store.getScanner(...); scanners.add(scanner); } this.storeHeap = new KeyValueHeap(scanners, comparator); }

Get all StoreScanners

Heap of StoreScanners

RegionScanner, cont.

public synchronized boolean next( List<KeyValue> outResults, int limit) { ... MVCC.setThreadReadPoint(this.readPt); boolean returnResult = nextInternal(limit); ... } private boolean nextInternal(int limit) throws IOException { if (isStopRow(currentRow)) { ... return false; } else { byte [] nextRow; do { this.storeHeap.next(results, limit - results.size()) } while (Bytes.equals(currentRow, nextRow =peekRow()));

final boolean stopRow = isStopRow(nextRow); } ... }

MVCC

Get next KV from StoreScanners

StoreScanner public synchronized boolean next(…) { ... LOOP: while((kv = this.heap.peek()) != null) { ScanQueryMatcher.MatchCode qcode = matcher.match(kv); switch(qcode) { case INCLUDE: results.add(kv); this.heap.next(); ... continue; case DONE_SCAN: ... case SEEK_NEXT_ROW: reseek(matcher.getKeyForNextRow(kv)); break; case SEEK_NEXT_COL: ... case SKIP: this.heap.next(); break; case SEEK_NEXT_USING_HINT: KeyValue nextKV = matcher.getNextKeyHint(kv); ... reseek(nextKV); break; ... } } }

Heap of Memstore/StoreFile scanners

What to do with the KV:•Versions•TTL•Deletes

Seek Hints

• Allow skipping many KVs without “touching”

• Seek-to(KV) instead of skip, skip, skip, …

• Used internally (for deletes, skipping older versions, TTL)

• Used by filters

Memstore Scanner

public synchronized KeyValue next() { if (theNext == null) { return null; }

final KeyValue ret = theNext;

// Advance one of the iterators if (theNext == kvsetNextRow) { kvsetNextRow = getNext(kvsetIt); } else { snapshotNextRow = getNext(snapshotIt); }

// Calculate the next value theNext = getLowest(kvsetNextRow, snapshotNextRow);

return ret; }

Snapshot duringflushes

KeyValueSkipListSet.iterator()

Memstore Scanner

protected KeyValue getNext(Iterator<KeyValue> it) {

long readPoint = MVCC.getThreadReadPoint();

while (it.hasNext()) {

KeyValue v = it.next();

if (v.getMemstoreTS() <= readPoint) {

return v;

}

}

return null;

}

MVCC

Memstore Scanner

@Override public synchronized boolean seek(KeyValue key) { ... // kvset and snapshot will never be null. // if tailSet can't find anything, SortedSet is empty (not null). kvTail = kvsetAtCreation.tailSet(key); snapshotTail = snapshotAtCreation.tailSet(key);

return seekInSubLists(key); }

For seek find the right tailSet

seekInSubLists() almost identical to next()

StoreFileScanner

private final HFileScanner hfs; private KeyValue cur = null; ... public KeyValue next() throws IOException { KeyValue retKey = cur;

try { // only seek if we aren't at the end.

// cur == null implies 'end'. if (cur != null) { hfs.next(); cur = hfs.getKeyValue(); skipKVsNewerThanReadpoint(); } } catch(IOException e) { throw new IOException("Could not iterate " + this, e); } return retKey; }

HFileScanner/Reader

@Override public boolean next() throws IOException { ... blockBuffer.position(...); ... if (blockBuffer.remaining() <= 0) { long lastDataBlockOffset = reader.getTrailer().getLastDataBlockOffset();

// read the next block HFileBlock nextBlock = readNextDataBlock(); if (nextBlock == null) { return false; }

updateCurrBlock(nextBlock); return true; }

// We are still in the same block. readKeyValueLen(); return true; }

Still on current block?

If not, read the next block

Mark the next KV in the buffer

Puts

Batch Put• private long doMiniBatchPut(BatchOperationInProgress<…> batchOp){

WALEdit walEdit = new WALEdit(); ... MultiVersionConsistencyControl.WriteEntry w = null; ... try { // STEP 1. Try to acquire as many locks as we can

// STEP 2. Update any LATEST_TIMESTAMP timestamps

// Acquire the latest mvcc number w = mvcc.beginMemstoreInsert();

// STEP 3. Write back to memstore for (int i = firstIndex; i < lastIndexExclusive; i++) {

addedSize += applyFamilyMapToMemstore(familyMaps[i], w); }

// STEP 4. Build WAL edit for (int i = firstIndex; i < lastIndexExclusive; i++) { addFamilyMapToWALEdit(familyMaps[i], walEdit); }

Begin transaction

Batch Put, cont. // STEP 5. Append the edit to WAL. Do not sync wal. txid = this.log.appendNoSync(regionInfo, …, walEdit, …);

// STEP 6. Release row locks, etc. this.updatesLock.readLock().unlock(); for (Integer toRelease : acquiredLocks) { releaseRowLock(toRelease); } // STEP 7. Sync wal. this.log.sync(txid);

walSyncSuccessful = true; // STEP 8. Advance mvcc. // This will make this put visible to scanners and getters. if (w != null) { mvcc.completeMemstoreInsert(w); w = null; } ... return addedSize; }

Write WAL recordbut don’t sync!

Sync after locks are released

Commit

Guard against concurrent flushes

Batch Put, something went wrong

} finally { if (!walSyncSuccessful) { rollbackMemstore(batchOp, familyMaps, firstIndex, lastIndexExclusive); } if (w != null) mvcc.completeMemstoreInsert(w);

if (locked) { this.updatesLock.readLock().unlock(); }

for (Integer toRelease : acquiredLocks) { releaseRowLock(toRelease); } ... }

Always completethe transaction!

Memstore changes

private long applyFamilyMapToMemstore( Map<byte[], List<KeyValue>> familyMap, WriteEntry localizedWriteEntry) { long size = 0; boolean freemvcc = false;

try { if (localizedWriteEntry == null) { localizedWriteEntry = mvcc.beginMemstoreInsert(); freemvcc = true; } for (Map.Entry<…> e : familyMap.entrySet()) { byte[] family = e.getKey(); List<KeyValue> edits = e.getValue(); ...

Can pass a write entrythat spans mutliple calls

This begins thetransaction(MVCC)

Memstore changes

... Store store = getStore(family); for (KeyValue kv: edits) { kv.setMemstoreTS(localizedWriteEntry.getWriteNumber()); size += store.add(kv); } } } finally { if (freemvcc) { mvcc.completeMemstoreInsert(localizedWriteEntry); } } return size; }

Tag KV with write number (MVCC)

This makes the changes visible

Deletes

ScanDeleteTracker (checking markers) public void add(buffer, qualifierOffset, qualifierLength, ts, type) { if (!hasFamilyStamp || ts > familyStamp) { if (type == KeyValue.Type.DeleteFamily.getCode()) { hasFamilyStamp = true; familyStamp = ts; return; }

if (deleteBuffer != null && type < deleteType) { // same column, so ignore less specific delete if (Bytes.equals(deleteBuffer, deleteOffset, deleteLength, buffer, qualifierOffset, qualifierLength)){ return; } } // new column, or more general delete type deleteBuffer = buffer; deleteOffset = qualifierOffset; deleteLength = qualifierLength; deleteType = type; deleteTimestamp = ts; } }

Only remember TS for family deletes

A version delete marker can be ignored if there is a column marker already.

Remember the KV

ScanDeleteTracker (checking markers) public DeleteResult isDeleted(buffer, qualifierOffset,qualifierLength, timestamp) { if (hasFamilyStamp && timestamp <= familyStamp) { return DeleteResult.FAMILY_DELETED; } if (deleteBuffer != null) { int ret = Bytes.compareTo(deleteBuffer, deleteOffset, deleteLength, buffer, qualifierOffset, qualifierLength); if (ret == 0) { if (deleteType == KeyValue.Type.DeleteColumn.getCode()) { return DeleteResult.COLUMN_DELETED; } // If the timestamp is the same, keep this one if (timestamp == deleteTimestamp) { return DeleteResult.VERSION_DELETED; } // different timestamp, let's clear the buffer. deleteBuffer = null; } else if(ret < 0){ // Next column case. deleteBuffer = null; } else { throw new IllegalStateException(...); } } return DeleteResult.NOT_DELETED; }

Family marker case

Column marker case

Version marker case

HFiles scanned newest TS first

Questions?Comments?

More details on http://hadoop-hbase.blogspot.com

Technology

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce