Hbase jdd

HBase Tame your BigData

Andrzej Grzesik LunarLogicPolska

me: present

past

Questions? Ask them right away!

So

HBase

distributed NoSQL

open-‐‑source

fast datastore

high-‐‑performance BigTable

scalable

Cool and fun to work with!

fault tolerant

built upon Hadoop

Who uses Hbase?

Beware! Lots of text

Hadoop stack

via (hKp://gigaom.com/cloud/with-‐‑40m-‐‑for-‐‑cloudera-‐‑how-‐‑much-‐‑is-‐‑hadoop-‐‑worth/)

By my count — and it’s very possible I’m missing someone — Hadoop-‐‑based startups have raised $104.5 million since May. The same set of companies has raised $159.7 million since 2009 when Cloudera closed its first round.

By comparison, the handful of popular NoSQL database vendors, often lumped into the big data category as well, and similar to Hadoop in their focus on unstructured data, have announced just more than $90 million in funding overall.

Some theory

architecture

servers

hadoop hdfs m/r

node node node

HBase

Zookeeper

Related projects: •  Chukwa

o  Log analysis tool

•  Hive

o  Or, if Hive is slow:

•  Pig o  High level data manipulation language o  Don’t write all MapReduce jobs by hand!

Brewer’s CAP theorem

RDBMS HBase

CouchDB

Pick 2

Availability

Consistency Partition Tolerance

Data organisation

… Rowkey 1

Rowkey n

Rowkey n+1

…

Region 1 Region 2

…

Data organisation

Region

Column family Column family

Column family col1, col2, col3

Column family col1, col2

Data organisation ColumnKey

Region

Timestam

p

column1 column2 column3

v1@t1 v1@t1 v1@t1

v1@t2 v1@t2

v1@t3

Let’s see some code?

Integration testing? Start cluster locally

Use a remote one

?

How to start hacking? Grab hadoop

http://hadoop.apache.org/

and Hbase

http://hbase.apache.org/

Spend an eon learning more than you wanted about plumbing

How to start hacking? Better (faster) way: Grab a VM/packages from

Pro tip Don’t run HBase on or face problems It’s doable (http://hbase.apache.org/docs/r0.20.6/cygwin.html) but VMs are faster!

How to start hacking? Situation will improve, since

modes Develop with •  local mode

o  single instance, single JVM

Then •  Pseudo-distributed

o  multiple instances, single machine

For production •  Distributed mode

o  many nodes

One more Befriend some admins, you will need them

Use cases?

Example from X •  Customer-provided user data •  Schema varying between customers

o  kept in RDBMS,

•  Data in HBase

Example from Facebook HBase drives Facebook messages

•  Key: UserId •  Column: Word •  Version: MessageId

See for more details (http://www.infoq.com/presentations/HBase-at-Facebook)

When to use Hbase? •  Lots of key/value data •  Need good scalability •  Need good query times with random access •  Data analytics

What is HBase poor at? •  transactions •  relying on indexes •  security

T(h)ank you!

Useful Brewer’s CAP theorem http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

Google BigTable http://labs.google.com/papers/bigtable-osdi06.pdf

Dzone Refcards http://refcardz.dzone.com/refcardz/getting-started-apache-hadoop http://refcardz.dzone.com/refcardz/deploying-hadoop

Technology

Hbase jdd