31
HBase Tame your BigData Andrzej Grzesik LunarLogicPolska

Hbase jdd

Embed Size (px)

Citation preview

Page 1: Hbase jdd

HBase Tame your BigData

Andrzej  Grzesik  LunarLogicPolska

Page 2: Hbase jdd

me:  present

past

Page 3: Hbase jdd

Questions? Ask them right away!

Page 4: Hbase jdd
Page 5: Hbase jdd

So

Page 6: Hbase jdd

HBase

distributed NoSQL

open-­‐‑source

fast datastore

high-­‐‑performance BigTable

scalable

Cool  and  fun  to  work  with!

fault  tolerant

built  upon    Hadoop

Page 7: Hbase jdd

Who  uses  Hbase?

Page 8: Hbase jdd

Beware! Lots of text

Page 9: Hbase jdd

Hadoop  stack

via  (hKp://gigaom.com/cloud/with-­‐‑40m-­‐‑for-­‐‑cloudera-­‐‑how-­‐‑much-­‐‑is-­‐‑hadoop-­‐‑worth/)

By  my  count  —  and  it’s  very  possible  I’m  missing  someone  —   Hadoop-­‐‑based  startups  have  raised  $104.5  million  since  May.   The  same  set  of  companies  has  raised  $159.7  million  since  2009   when  Cloudera  closed  its  first  round.

By  comparison,  the  handful  of  popular  NoSQL  database  vendors,   often  lumped  into  the  big  data  category  as  well,  and  similar  to  Hadoop  in  their  focus  on  unstructured  data,  have  announced  just  more  than  $90  million  in  funding  overall.

Page 10: Hbase jdd

Some  theory

Page 11: Hbase jdd

architecture

servers

hadoop hdfs m/r

node node node

HBase

Zookeeper

Page 12: Hbase jdd

Related  projects: •  Chukwa

o  Log analysis tool

•  Hive

o  Or, if Hive is slow:

•  Pig o  High level data manipulation language o  Don’t write all MapReduce jobs by hand!

Page 13: Hbase jdd

Brewer’s  CAP  theorem

RDBMS HBase

CouchDB

Pick  2

Availability

Consistency Partition  Tolerance

Page 14: Hbase jdd

Data  organisation

… Rowkey  1

Rowkey  n

Rowkey  n+1

Region  1 Region  2

Page 15: Hbase jdd

Data  organisation

Region

Column  family Column  family

Column  family  col1,  col2,  col3

Column  family  col1,  col2

Page 16: Hbase jdd

Data  organisation ColumnKey

Region

Timestam

p

column1 column2 column3

v1@t1 v1@t1 v1@t1

v1@t2 v1@t2

v1@t3

Page 17: Hbase jdd

Let’s  see  some  code?

Page 18: Hbase jdd

Integration  testing? Start cluster locally

Use a remote one

?

Page 19: Hbase jdd

How  to  start  hacking? Grab hadoop

http://hadoop.apache.org/

and Hbase

http://hbase.apache.org/

Spend an eon learning more than you wanted about plumbing

Page 20: Hbase jdd

How  to  start  hacking? Better (faster) way: Grab a VM/packages from

Page 21: Hbase jdd

Pro  tip Don’t run HBase on or face problems It’s doable (http://hbase.apache.org/docs/r0.20.6/cygwin.html) but VMs are faster!

Page 22: Hbase jdd

How  to  start  hacking? Situation will improve, since

Page 23: Hbase jdd

modes Develop with •  local mode

o  single instance, single JVM

Then •  Pseudo-distributed

o  multiple instances, single machine

For production •  Distributed mode

o  many nodes

Page 24: Hbase jdd

One  more Befriend some admins, you will need them

Page 25: Hbase jdd

Use  cases?

Page 26: Hbase jdd

Example  from  X •  Customer-provided user data •  Schema varying between customers

o  kept in RDBMS,

•  Data in HBase

Page 27: Hbase jdd

Example  from  Facebook HBase drives Facebook messages

•  Key: UserId •  Column: Word •  Version: MessageId

See for more details (http://www.infoq.com/presentations/HBase-at-Facebook)

Page 28: Hbase jdd

When  to  use  Hbase? •  Lots of key/value data •  Need good scalability •  Need good query times with random access •  Data analytics

Page 29: Hbase jdd

What  is  HBase  poor  at? •  transactions •  relying on indexes •  security

Page 30: Hbase jdd

T(h)ank  you!

Page 31: Hbase jdd

Useful Brewer’s CAP theorem http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

Google BigTable http://labs.google.com/papers/bigtable-osdi06.pdf

Dzone Refcards http://refcardz.dzone.com/refcardz/getting-started-apache-hadoop http://refcardz.dzone.com/refcardz/deploying-hadoop