Upload
andrzej-grzesik
View
721
Download
0
Tags:
Embed Size (px)
Citation preview
HBase Tame your BigData
Andrzej Grzesik LunarLogicPolska
me: present
past
Questions? Ask them right away!
So
HBase
distributed NoSQL
open-‐‑source
fast datastore
high-‐‑performance BigTable
scalable
Cool and fun to work with!
fault tolerant
built upon Hadoop
Who uses Hbase?
Beware! Lots of text
Hadoop stack
via (hKp://gigaom.com/cloud/with-‐‑40m-‐‑for-‐‑cloudera-‐‑how-‐‑much-‐‑is-‐‑hadoop-‐‑worth/)
By my count — and it’s very possible I’m missing someone — Hadoop-‐‑based startups have raised $104.5 million since May. The same set of companies has raised $159.7 million since 2009 when Cloudera closed its first round.
By comparison, the handful of popular NoSQL database vendors, often lumped into the big data category as well, and similar to Hadoop in their focus on unstructured data, have announced just more than $90 million in funding overall.
Some theory
architecture
servers
hadoop hdfs m/r
node node node
HBase
Zookeeper
Related projects: • Chukwa
o Log analysis tool
• Hive
o Or, if Hive is slow:
• Pig o High level data manipulation language o Don’t write all MapReduce jobs by hand!
Brewer’s CAP theorem
RDBMS HBase
CouchDB
Pick 2
Availability
Consistency Partition Tolerance
Data organisation
… Rowkey 1
Rowkey n
Rowkey n+1
…
Region 1 Region 2
…
Data organisation
Region
Column family Column family
Column family col1, col2, col3
Column family col1, col2
Data organisation ColumnKey
Region
Timestam
p
column1 column2 column3
v1@t1 v1@t1 v1@t1
v1@t2 v1@t2
v1@t3
Let’s see some code?
Integration testing? Start cluster locally
Use a remote one
?
How to start hacking? Grab hadoop
http://hadoop.apache.org/
and Hbase
http://hbase.apache.org/
Spend an eon learning more than you wanted about plumbing
How to start hacking? Better (faster) way: Grab a VM/packages from
Pro tip Don’t run HBase on or face problems It’s doable (http://hbase.apache.org/docs/r0.20.6/cygwin.html) but VMs are faster!
How to start hacking? Situation will improve, since
modes Develop with • local mode
o single instance, single JVM
Then • Pseudo-distributed
o multiple instances, single machine
For production • Distributed mode
o many nodes
One more Befriend some admins, you will need them
Use cases?
Example from X • Customer-provided user data • Schema varying between customers
o kept in RDBMS,
• Data in HBase
Example from Facebook HBase drives Facebook messages
• Key: UserId • Column: Word • Version: MessageId
See for more details (http://www.infoq.com/presentations/HBase-at-Facebook)
When to use Hbase? • Lots of key/value data • Need good scalability • Need good query times with random access • Data analytics
What is HBase poor at? • transactions • relying on indexes • security
T(h)ank you!
Useful Brewer’s CAP theorem http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf
Google BigTable http://labs.google.com/papers/bigtable-osdi06.pdf
Dzone Refcards http://refcardz.dzone.com/refcardz/getting-started-apache-hadoop http://refcardz.dzone.com/refcardz/deploying-hadoop