149
Apache HBase: Overview, HandsOn, and Use Cases Apekshit Sharma Dima Spivak

Introduction to HBase - NoSqlNow2015

Embed Size (px)

Citation preview

Page 1: Introduction to HBase - NoSqlNow2015

1©  Cloudera,  Inc.  All  rights  reserved.

Apache  HBase:Overview,  Hands-­‐On,  and  Use  CasesApekshit SharmaDima Spivak

Page 2: Introduction to HBase - NoSqlNow2015

2©  Cloudera,  Inc.  All  rights  reserved.

Apekshit Sharma• Distributed  Software  Engineer,  Cloudera• Software  Engineer,  Google

• Apache  HBase contributor• Performance  improvements  and  configuration  framework

Dima Spivak (@dimaspivak)• Distributed  Software  Engineer,  Cloudera• Research  Assistant  (Physics),  University  of  Minnesota

• Apache  HBase contributor• Test  frameworks  and  automation

Who  we  are

Page 3: Introduction to HBase - NoSqlNow2015

3©  Cloudera,  Inc.  All  rights  reserved.

Contents

• Motivation• Introduction  to  Apache  HBase• Data  model

• Hands-­‐On:  Installation,  HBase  shell• Break• A  slightly  more  in-­‐depth  introduction  to  Apache  HBase• Apache  Hadoop• System  internals• APIs

• Break

Page 4: Introduction to HBase - NoSqlNow2015

4©  Cloudera,  Inc.  All  rights  reserved.

Contents

• Industry  use  cases  &  patterns• Augmenting  HBase• OpenTSDB• Apache  Phoenix

Page 5: Introduction to HBase - NoSqlNow2015

5©  Cloudera,  Inc.  All  rights  reserved.

Motivation

http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/

Page 6: Introduction to HBase - NoSqlNow2015

6©  Cloudera,  Inc.  All  rights  reserved.

Motivation

“We've  known  it  for  a  long  time:  the  web  is  big.”– Jesse  Alpert  &  Nissan  Hajaj,  Google

http://googleblog.blogspot.com/2008/07/we-­‐knew-­‐web-­‐was-­‐big.html

Page 7: Introduction to HBase - NoSqlNow2015

7©  Cloudera,  Inc.  All  rights  reserved.

Motivation

• Indexing  the  internet  has  challenges:• Scale• Volume• Rate

• Diversity  of  content• URLs• High-­‐resolution  images• Video

• Access

Page 8: Introduction to HBase - NoSqlNow2015

8©  Cloudera,  Inc.  All  rights  reserved.

Motivation

Page 9: Introduction to HBase - NoSqlNow2015

9©  Cloudera,  Inc.  All  rights  reserved.

•What  if  you’re  not  trying  to  index  the  internet?

Motivation

Page 10: Introduction to HBase - NoSqlNow2015

10©  Cloudera,  Inc.  All  rights  reserved.

• Data  for  analytical  processing• User-­‐facing  real-­‐time  platforms

Motivation

Page 11: Introduction to HBase - NoSqlNow2015

11©  Cloudera,  Inc.  All  rights  reserved.

Introduction  to  Apache  HBase

•“Apache  HBase™  is  the  Hadoop  database,  a  distributed,  scalable,  big  data  store.”http://hbase.apache.org/

Page 12: Introduction to HBase - NoSqlNow2015

12©  Cloudera,  Inc.  All  rights  reserved.

Introduction  to  Apache  HBase

•Apache  HBase  is  an  open  source, horizontally  scalable,consistent,  random  access,  low  latency  data  store  built  on  top  of  Apache  Hadoop.

Page 13: Introduction to HBase - NoSqlNow2015

13©  Cloudera,  Inc.  All  rights  reserved.

Apache  HBase  is  open  source

• Apache  2.0  License• A  community  project  with  committers  and  contributors  from  diverse  organizations• Cloudera,  Facebook,  Salesforce.com,  Huawei,  TrendMicro,  eBay,  HortonWorks,  Intel,  Twitter,  …

• Code  license  means  anyone  can  modify  and  use  the  code.

Page 14: Introduction to HBase - NoSqlNow2015

14©  Cloudera,  Inc.  All  rights  reserved.

• Adding  more  servers  linearly increases  performance  and  capacity• Storage  capacity  • Input/output  operations

• Store  and  access  data  on  commodity  servers

• Largest  cluster:  >  3000  nodes,  >  100  PB• Average  cluster:  10-­‐40  nodes,  100-­‐400TB

Apache  HBase  is  horizontally  scalable

0

100

200

300

400

500

600

Performance  (IOPs/Storage/Throughput)

#  of  servers

Page 15: Introduction to HBase - NoSqlNow2015

15©  Cloudera,  Inc.  All  rights  reserved.

• Commodity  servers  (crica 2015)• 12-­‐24  2-­‐4TB  hard disks• 2  octa-­‐core  CPUs,  2-­‐3 GHz• 64  -­‐ 512 GBs  of  RAM• 10  Gbps ethernet• $5,000  -­‐ $10,000  /  machine

Apache  HBase  is  horizontally  scalable

Page 16: Introduction to HBase - NoSqlNow2015

16©  Cloudera,  Inc.  All  rights  reserved.

•Brewer’s  theorem•Consistency•Availability•Partition  tolerance

Apache  HBase  is  consistent

HBase

Page 17: Introduction to HBase - NoSqlNow2015

17©  Cloudera,  Inc.  All  rights  reserved.

Data  model

• Data  is  stored…  in  a  big  table• Sorted  map  datastore

• Tables  consist  of  sorted  rows,  each  of  which  has  a  primary  row  key• Each  row  has  a  set  of  columns• A  column  is  specified  as  a  column  family  and column  qualifier  pair• A  given  cell (row,  column  family:column qualifier)  can  have  different  time-­‐stamped  values

Page 18: Introduction to HBase - NoSqlNow2015

18©  Cloudera,  Inc.  All  rights  reserved.

Data  model

Row  key info:height info:state roles:hadoop roles:hbase

cutting ‘9ft’ ‘CA’ ‘Founder’

todd ‘5ft7’ ‘CA’ ‘PMC’  (ts=2011)‘Committer’   (ts=2010) ‘Committer’

Page 19: Introduction to HBase - NoSqlNow2015

19©  Cloudera,  Inc.  All  rights  reserved.

Hands  OnApache  HBase  installationThe  HBase  shellhttp://pastebin.com/nMkZeq5S

Page 20: Introduction to HBase - NoSqlNow2015

20©  Cloudera,  Inc.  All  rights  reserved.

Whats  up  for  the  next  1  hour?

Understanding  basic  architecture  of

HDFS(Apache  Hadoop)

And,  more  Hands-­‐On  with  Apache  HBase.

Page 21: Introduction to HBase - NoSqlNow2015

21©  Cloudera,  Inc.  All  rights  reserved.

Break

Page 22: Introduction to HBase - NoSqlNow2015

22©  Cloudera,  Inc.  All  rights  reserved.

Understanding  basic  architecture  of

Hadoop  (HDFS)

Page 23: Introduction to HBase - NoSqlNow2015

23©  Cloudera,  Inc.  All  rights  reserved.

Apache  Hadoop

open  source

commodity servers

horizontally  scalable

highly  fault-­‐tolerant

massive  processing  power  

Page 24: Introduction to HBase - NoSqlNow2015

24©  Cloudera,  Inc.  All  rights  reserved.

Apache  Hadoop

MapReduce  +  YARN

2  Core  Components

HDFS(Hadoop  Distributed  File  

System)

Page 25: Introduction to HBase - NoSqlNow2015

25©  Cloudera,  Inc.  All  rights  reserved.

History

2003

Page 26: Introduction to HBase - NoSqlNow2015

26©  Cloudera,  Inc.  All  rights  reserved.

• distributed  file  system

• commodity servers

• horizontally  scalable

• highly  fault-­‐tolerant

• proprietary

GFS

Page 27: Introduction to HBase - NoSqlNow2015

27©  Cloudera,  Inc.  All  rights  reserved.

• distributed  file  system

• commodity servers

• horizontally  scalable

• highly  fault-­‐tolerant

• open  source

HDFS

Page 28: Introduction to HBase - NoSqlNow2015

28©  Cloudera,  Inc.  All  rights  reserved.

HDFS  API

• File

• Open,  Close,  Read,  Write,  Move,  etc

• Directories

• Create,  Delete,  etc

• Permissions

• Owners,  Groups,  rwx  permissions

Page 29: Introduction to HBase - NoSqlNow2015

29©  Cloudera,  Inc.  All  rights  reserved.

Basic  Architecture  of  HDFS

Page 30: Introduction to HBase - NoSqlNow2015

30©  Cloudera,  Inc.  All  rights  reserved.

File

B1 B2 B3File  system  will  splitthe  file  into  blocks

DiskB1

B2

B3

Local  file  system

Page 31: Introduction to HBase - NoSqlNow2015

31©  Cloudera,  Inc.  All  rights  reserved.

DataNode  1 DataNode  2 DataNode  3 DataNode  4

HDFS

File  distributed  across  machines

B1 B2 B3

Page 32: Introduction to HBase - NoSqlNow2015

32©  Cloudera,  Inc.  All  rights  reserved.

B1

DataNode  1

B2

DataNode  2

B3

DataNode  3 DataNode  4

HDFS    DataNode

Page 33: Introduction to HBase - NoSqlNow2015

33©  Cloudera,  Inc.  All  rights  reserved.

DataNode  1 DataNode  2 DataNode  3 DataNode  4

HDFS    NameNode

NameNode

B1 B2 B3

Page 34: Introduction to HBase - NoSqlNow2015

34©  Cloudera,  Inc.  All  rights  reserved.

DataNode  1 DataNode  2 DataNode  3 DataNode  4

HDFS    Reading  a  file

NameNode

B1 B2 B3

Client

1.  File  ‘foo’ 2.  Verify  client  has  permissionsto  read  the  file

3.  List  of   foo’sbocks  and  datanodes

Page 35: Introduction to HBase - NoSqlNow2015

35©  Cloudera,  Inc.  All  rights  reserved.

DataNode  1 DataNode  2 DataNode  3 DataNode  4

HDFS    Fault  tolerance

NameNode

B1 B2 B3

Page 36: Introduction to HBase - NoSqlNow2015

36©  Cloudera,  Inc.  All  rights  reserved.

DataNode  1 DataNode  2 DataNode  3 DataNode  4

HDFS    Redundancy

NameNode

B1 B2 B3B1 B1B2 B2 B3B3

Page 37: Introduction to HBase - NoSqlNow2015

37©  Cloudera,  Inc.  All  rights  reserved.

DataNode  1 DataNode  2 DataNode  3 DataNode  4

HDFS    Horizontal  Scalability

NameNode

B1 B2 B3B1 B1B2 B2 B3B3

DataNode  5

Page 38: Introduction to HBase - NoSqlNow2015

38©  Cloudera,  Inc.  All  rights  reserved.

Let’s  look  at  some  existing  HDFS  systems...

Page 39: Introduction to HBase - NoSqlNow2015

39©  Cloudera,  Inc.  All  rights  reserved.

• Yahoo!  HDFS  Clusters40k+  servers,  100k+  CPUs,  450PB  data

• Facebook  HDFS  Cluster15TB  new  data  per  day1200+  machines,  30PB  in  one  cluster

• Lots of  5-­‐40  node  clusters  at  companies  without petabytes  of  data  (web,  retail,  finance,  telecom,  research,  government)

Page 40: Introduction to HBase - NoSqlNow2015

40©  Cloudera,  Inc.  All  rights  reserved.

But….  there  are  restrictions!It’s  not  a  magic  wand!

Page 41: Introduction to HBase - NoSqlNow2015

41©  Cloudera,  Inc.  All  rights  reserved.

Files  are append  only

• Access  Model  :  Write-­‐once-­‐read-­‐many  

• Can  not  change  existing  contents

Page 42: Introduction to HBase - NoSqlNow2015

42©  Cloudera,  Inc.  All  rights  reserved.

Not  designed  for  small  files

• Block  sizes  are  in  MB  (default  128MB)

• Designed  for  typical  GBs  /  TBs  of  file  sizes

• Normal  files  system  have  4kb  block  size!

Page 43: Introduction to HBase - NoSqlNow2015

43©  Cloudera,  Inc.  All  rights  reserved.

Summary

HDFS  is  a  great  distributed  file  system!

• Store  massive  data

• Scalable

• High  throughput

• Fault  tolerance

Page 44: Introduction to HBase - NoSqlNow2015

44©  Cloudera,  Inc.  All  rights  reserved.

MapReduce

• Distributed    processing  framework

• Commodity  machines

• Fault  tolerance

Page 45: Introduction to HBase - NoSqlNow2015

45©  Cloudera,  Inc.  All  rights  reserved.

MapReduce

InputData

Input  4

Input  3

Input  2

Input  1 Map1

Map2

Map3

Map4

Reduce1

Reduce2

Reduce3

Output  1

Output  2

Output  2

OutputData

Page 46: Introduction to HBase - NoSqlNow2015

46©  Cloudera,  Inc.  All  rights  reserved.

Page 47: Introduction to HBase - NoSqlNow2015

47©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture

Name Weight UPC Price

Prego  Tomato  Sauce 67  Oz xxxxxxxx $4.97

Trumoo Lowfat Chocolate  Milk 128  Oz xxxxxxxx $2.99

Gatorade  Lemon-­‐Lime 64  Oz xxxxxxxx $3.98

info:weight info:upc info:price

Prego  Tomato  Sauce 67  Oz xxxxxxxx $4.97

Trumoo Lowfat Chocolate  Milk 128  Oz xxxxxxxx $2.99

Gatorade  Lemon-­‐Lime 64  Oz xxxxxxxx $3.98

Page 48: Introduction to HBase - NoSqlNow2015

48©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture

info:weight info:upc info:price

Prego  Tomato  Sauce 67  Oz xxxxxxxx $4.97

Trumoo Lowfat Chocolate  Milk 128  Oz xxxxxxxx $2.99

Gatorade  Lemon-­‐Lime 64  Oz xxxxxxxx $3.98

Page 49: Introduction to HBase - NoSqlNow2015

49©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture

info:weight info:upc info:price

Prego  Tomato  Sauce 67  Oz xxxxxxxx $4.97

Trumoo Lowfat Chocolate  Milk 128  Oz xxxxxxxx $2.99

Gatorade  Lemon-­‐Lime 64  Oz xxxxxxxx $3.98

A New  Product 4  Oz xxxxxxxx $9.99

Page 50: Introduction to HBase - NoSqlNow2015

50©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture

info:weight info:upc info:price

Prego  Tomato  Sauce 67  Oz xxxxxxxx $4.97

Trumoo Lowfat Chocolate  Milk 128  Oz xxxxxxxx $2.99

Gatorade  Lemon-­‐Lime 64  Oz xxxxxxxx $3.98

A New  Product 4  Oz xxxxxxxx $9.99

Yet  Another  New  Product 8  Oz xxxxxxxx $19.99

Page 51: Introduction to HBase - NoSqlNow2015

51©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture

info:weight info:upc info:price

Prego  Tomato  Sauce 67  Oz xxxxxxxx $4.97

Trumoo Lowfat Chocolate  Milk 128  Oz xxxxxxxx $2.99

Gatorade  Lemon-­‐Lime 64  Oz xxxxxxxx $3.98

A New  Product 4  Oz xxxxxxxx $9.99

Yet  Another  New  Product 8  Oz xxxxxxxx $19.99

Four  More Products   (1) 16  Oz xxxxxxxx $9.99

Four  More Products   (2) 16  Oz xxxxxxxx $9.99

Four  More  Products   (3) 16  Oz xxxxxxxx $9.99

Four  More  Products   (4) 16  Oz xxxxxxxx $9.99

Page 52: Introduction to HBase - NoSqlNow2015

52©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Regions

• Tables  are  chopped  up  into  regions  (split).• A  region  is  only  served  by  a  single  “region  server”  at  a  time.• RegionServer can  serve  multiple  regions.

Page 53: Introduction to HBase - NoSqlNow2015

53©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Regions

info:weight info:upc info:price

Prego  Tomato  Sauce 67  Oz xxxxxxxx $4.97

Trumoo Lowfat Chocolate  Milk 128  Oz xxxxxxxx $2.99

Gatorade  Lemon-­‐Lime 64  Oz xxxxxxxx $3.98

Yet  Another  New  Product 8  Oz xxxxxxxx $19.99

info:weight info:upc info:priceA New  Product 4  Oz xxxxxxxx $9.99

Four  More Products   (1) 16  Oz xxxxxxxx $9.99Four  More Products   (2) 16  Oz xxxxxxxx $9.99Four  More  Products   (3) 16  Oz xxxxxxxx $9.99Four  More  Products   (4) 16  Oz xxxxxxxx $9.99

Served  by  RegionServer on  machine  2

Served  by  RegionServer on  machine  3

Page 54: Introduction to HBase - NoSqlNow2015

54©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Regions

info:price info:upc info:weight

Gatorade  Lemon-­‐Lime $3.98 xxxxxxxx 64  Oz

Prego  Tomato  Sauce $4.97 xxxxxxxx 67  Oz

Trumoo Lowfat Chocolate  Milk $2.99 xxxxxxxx 128  OzYet  Another  New  Product $19.99 xxxxxxxx 8  Oz

info:price info:upc info:weightA New  Product $9.99 xxxxxxxx 4  Oz

Four  More Products   (1) $9.99 xxxxxxxx 16  OzFour  More Products   (2) $9.99 xxxxxxxx 16  OzFour  More  Products   (3) $9.99 xxxxxxxx 16  OzFour  More  Products   (4) $9.99 xxxxxxxx 16  Oz

Served  by  RegionServer on  machine  2

Served  by  RegionServer on  machine  3

Page 55: Introduction to HBase - NoSqlNow2015

55©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture

info:price info:upc info:weight available:store1 available:store2 available:store3

Gatorade  Lemon-­‐Lime $3.98 xxxxxxxx 64  Oz Yes Yes Yes

Prego  Tomato  Sauce $4.97 xxxxxxxx 67  Oz Yes No Yes

Trumoo Lowfat Chocolate  Milk $2.99 xxxxxxxx 128  Oz No No YesYet  Another  New  Product $19.99 xxxxxxxx 8  Oz Yes Yes Yes

info:price info:upc info:weight available:store1 available:store2 available:store3A New  Product $9.99 xxxxxxxx 4  Oz Yes Yes Yes

Four  More Products   (1) $9.99 xxxxxxxx 16  Oz Yes Yes YesFour  More Products   (2) $9.99 xxxxxxxx 16  Oz Yes Yes NoFour  More  Products   (3) $9.99 xxxxxxxx 16  Oz Yes Yes NoFour  More  Products   (4) $9.99 xxxxxxxx 16  Oz No No No

Page 56: Introduction to HBase - NoSqlNow2015

56©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture

info:price info:upc info:weight available:store1 available:store2 available:store3

Gatorade  Lemon-­‐Lime $3.98 xxxxxxxx 64  Oz Yes Yes Yes

Prego  Tomato  Sauce $4.97 xxxxxxxx 67  Oz Yes No Yes

Trumoo Lowfat Chocolate  Milk $2.99 xxxxxxxx 128  Oz No No YesYet  Another  New  Product $19.99 xxxxxxxx 8  Oz Yes Yes Yes

info:price info:upc info:weight available:store1 available:store2 available:store3A New  Product $9.99 xxxxxxxx 4  Oz Yes Yes Yes

Four  More Products   (1) $9.99 xxxxxxxx 16  Oz Yes Yes YesFour  More Products   (2) $9.99 xxxxxxxx 16  Oz Yes Yes NoFour  More  Products   (3) $9.99 xxxxxxxx 16  Oz Yes Yes NoFour  More  Products   (4) $9.99 xxxxxxxx 16  Oz No No No

Page 57: Introduction to HBase - NoSqlNow2015

57©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Column  family

• A  column  family  is  a  set  of  related  columns.• Group  sets  of  columns  that  have  similar  access  patterns.

• Tune  read  performance.• Compression• Version  retention  policies• Cache  priority

Page 58: Introduction to HBase - NoSqlNow2015

58©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture

info:  price

info:  upc

info:  weight

Gatorade  Lemon-­‐Lime $3.98 xxxxxxxx 64  Oz

Prego  Tomato  Sauce $4.97 xxxxxxxx 67  Oz

Trumoo LowfatChocolate  Milk $2.99 xxxxxxx

x 128  Oz

available:  store1

available:  store2

available:  store3

Gatorade  Lemon-­‐Lime Yes Yes Yes

Prego  Tomato  Sauce Yes No Yes

Trumoo LowfatChocolate  Milk No No Yes

Region

Store Store

Page 59: Introduction to HBase - NoSqlNow2015

59©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Write  Path

1. Client  creates  a  row  to  put.2. Client  checks  with  meta*  for  which  RegionServer hosts  this  row.3. Row  is  written  into  write-­‐ahead  log  (WAL).4. Row  is  written  to  MemStore.

Page 60: Introduction to HBase - NoSqlNow2015

60©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Write  Path

Put

Client:  Which  RegionServer

should  host  this  row?

meta:  RegionServer 2

Page 61: Introduction to HBase - NoSqlNow2015

61©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Write  Path

Region

RegionServer 2

Put MemStore

WAL

Store

MemStore

Store

Page 62: Introduction to HBase - NoSqlNow2015

62©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Write  Path

•When  MemStore gets  full  or  a  flush  is  triggered,  contents  of  MemStore are  flushed  to  disk.• HFiles are  created.

Page 63: Introduction to HBase - NoSqlNow2015

63©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Write  Path

Region

RegionServer 2

MemStore

WAL

Store

MemStore

Store

HFiles HFiles

Page 64: Introduction to HBase - NoSqlNow2015

64©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Write  Path

• Each  subsequent  write  repeats  this  process.•Write  to  WAL.•Write  to  MemStore.• Flush  when  MemStore fills  or  a  flush  is  triggered.• Create  HFiles.

• Lots  of  HFiles in  a  Region  mean  lots  of  disk  seeks  on  read.•Might  be  better  to  combine  (compact)  HFiles.

Page 65: Introduction to HBase - NoSqlNow2015

65©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture

Region

RegionServer 2

MemStore

Store

MemStore

Store

HFiles HFiles

Page 66: Introduction to HBase - NoSqlNow2015

66©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Compactions

•Minor  compactions•Merge  some  HFiles (in  a  given  Store).

•Major  compactions•Merge  all  HFiles (in  a  given  Store).• Take  care  of  other  HBase  housekeeping  tasks.

Page 67: Introduction to HBase - NoSqlNow2015

67©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Compaction

Region

RegionServer 2

MemStore

Store

MemStore

Store

HFiles HFiles

Page 68: Introduction to HBase - NoSqlNow2015

68©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Minor  compaction

Region

RegionServer 2

MemStore

Store

MemStore

Store

HFiles HFiles

Page 69: Introduction to HBase - NoSqlNow2015

69©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Major  compaction

Region

RegionServer 2

MemStore

Store

MemStore

Store

HFiles HFiles

Page 70: Introduction to HBase - NoSqlNow2015

70©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Compactions

•Minor  compactions• Controlled  by  policy  (pluggable).

•Major  compactions• Automatic  (by  time)  or  manually  triggered.• Tend  to  be  run  during  off-­‐peak  times.

Page 71: Introduction to HBase - NoSqlNow2015

71©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Splits

• Eventually,  Regions  become  imbalanced.• Some  grow  to  be  huge,  others  remain  small.• Leads  to  disparate  load  across  RegionServers.

• In  these  cases,  HBase  can  split  a  Region  into  two.• Each  Region  is  then  available  to  be  moved  to  a  different  RegionServer,  if  necessary.

Page 72: Introduction to HBase - NoSqlNow2015

72©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Splits

Region

RegionServer 2

Page 73: Introduction to HBase - NoSqlNow2015

73©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Splits

Region

RegionServer 2

Region

RegionServer 3:  Yeah!  Pick  me!

Master:  RegionServer 2  is  really  busy…  Maybe  another  RegionServercan  handle  one  of  its  

Regions?

Page 74: Introduction to HBase - NoSqlNow2015

74©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  APIs

• Conventional  write  path  can  be  accessed  through  multiple  APIs:• Java  API•Most  full-­‐featured.

• REST  API• Easily  accessible.

• Thrift  API• Support  for  many  languages  (e.g.  C,  C++,  Perl,  Ruby,  Python).

Page 75: Introduction to HBase - NoSqlNow2015

75©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  APIs

• This  write  path  is  durable,  but  if  you’re  importing  a  lot  of  data,  it  can  be  problematic…• Every  put  goes  into  WAL,  which  means  disk  seeks.  Lots  of  puts  mean  lots  of  disk  seeks.• Lots  of  data  into  MemStores means  lots  of  flushing  to  disk.• Lots  of  flushing  to  disk  might  mean  lots  of  compactions.

Page 76: Introduction to HBase - NoSqlNow2015

76©  Cloudera,  Inc.  All  rights  reserved.

HBase  Architecture  |  Bulk  Loading

• Bypass  conventional  write  path.• Extract  data  from  source.• Transform  data  into  HFiles (done  with  MapReduce job)  directly.• Tell  RegionServers to  serve  these  HFiles.

Page 77: Introduction to HBase - NoSqlNow2015

77©  Cloudera,  Inc.  All  rights  reserved.

Enough  of  Architecture

Page 78: Introduction to HBase - NoSqlNow2015

78©  Cloudera,  Inc.  All  rights  reserved.

What’s  up  next,  Doc?

• Break

• What  have  we  learned  from  the  users

• How can  you  benefit  from that  information

Page 79: Introduction to HBase - NoSqlNow2015

79©  Cloudera,  Inc.  All  rights  reserved.

Break

Page 80: Introduction to HBase - NoSqlNow2015

80©  Cloudera,  Inc.  All  rights  reserved.

Apache  HBase  “Nascar”  Slide

Page 81: Introduction to HBase - NoSqlNow2015

81©  Cloudera,  Inc.  All  rights  reserved.

Apache  HBase  “Nascar”  Slide

Page 82: Introduction to HBase - NoSqlNow2015

82©  Cloudera,  Inc.  All  rights  reserved.

Apache  HBase  “Nascar”  Slide

Page 83: Introduction to HBase - NoSqlNow2015

83©  Cloudera,  Inc.  All  rights  reserved.

Apache  HBase  “Nascar”  Slide

Page 84: Introduction to HBase - NoSqlNow2015

84©  Cloudera,  Inc.  All  rights  reserved.

Apache  HBase  “Nascar”  Slide

Page 85: Introduction to HBase - NoSqlNow2015

85©  Cloudera,  Inc.  All  rights  reserved.

Apache  HBase  “Nascar”  Slide

Page 86: Introduction to HBase - NoSqlNow2015

86©  Cloudera,  Inc.  All  rights  reserved.

Apache  HBase  “Nascar”  Slide

Page 87: Introduction to HBase - NoSqlNow2015

87©  Cloudera,  Inc.  All  rights  reserved.

What  have  we  learned  from  all  these  users?

Page 88: Introduction to HBase - NoSqlNow2015

88©  Cloudera,  Inc.  All  rights  reserved.

There  are  some  patterns  which  repeat  often.

Just  like  a  lego  block,  maybe  you  can  fit  one  directly  in  your  system!

Page 89: Introduction to HBase - NoSqlNow2015

89©  Cloudera,  Inc.  All  rights  reserved.

● Entity  Data

● Time-­‐centric  Event  Data

● Operational

● Analytical

● Real-­‐time  vs  Batch

● Random  vs  Sequential

Data Use  of  data How  it  goes  in  and  out

Know  your    ...

Page 90: Introduction to HBase - NoSqlNow2015

90©  Cloudera,  Inc.  All  rights  reserved.

Know  your    data  ...There  are  primarily  two  kinds  of  big  data  workloads.    They  have  different  storage  requirements.

• Entity  centric  data

• Time  centric  event  data

Page 91: Introduction to HBase - NoSqlNow2015

91©  Cloudera,  Inc.  All  rights  reserved.

• Scales  up  with  #  of  entities

• Billions  of  distinct  entities

Entity  centric  data

Users Accounts Location Clicks  and  Metrics

Sensor  Data

Page 92: Introduction to HBase - NoSqlNow2015

92©  Cloudera,  Inc.  All  rights  reserved.

• Time-­‐series  data  points  over  a  period

• Scales  up  due  to  finer  grained  intervals,  retention  policies,  and  the  passage  of  time

Time  centric  event  data

Periodic  Sensor  DataStock  Ticker  Data Monitoring  applications

Page 93: Introduction to HBase - NoSqlNow2015

93©  Cloudera,  Inc.  All  rights  reserved.

Time  

Entities

Now

e1

e2

e3

e5

e4

Page 94: Introduction to HBase - NoSqlNow2015

94©  Cloudera,  Inc.  All  rights  reserved.

Time   Now

Entities dataEntities  data

Millions  of  entities  =  Big  Data

e1

e2

e3

e5

e4

Entities

Page 95: Introduction to HBase - NoSqlNow2015

95©  Cloudera,  Inc.  All  rights  reserved.

Time   Now

Time-­‐centric  events  data

Time  centric  events  data

Millions  of  events  =  Big  Data

Page 96: Introduction to HBase - NoSqlNow2015

96©  Cloudera,  Inc.  All  rights  reserved.

Time   Now

Time-­‐centric  events  about  Entities

e1

e2

e3

e5

e4

Entities

|Entities|  *  |Events|  =  Really  Big  Data

Page 97: Introduction to HBase - NoSqlNow2015

97©  Cloudera,  Inc.  All  rights  reserved.

What  questions  do  you  ask?

• Do  you  focus  in  on  entity  first?

OR

• Do  you  focus  in  on  time  ranges  first?

• Your  answer  will  help  you  determine  where  and  how  to  store  your  data.

Page 98: Introduction to HBase - NoSqlNow2015

98©  Cloudera,  Inc.  All  rights  reserved.

Time   Now

Entity  first  questions…For  a  give  user,  show  allthe  messages.

Entities

user1

user2

user3

user4

user5

Page 99: Introduction to HBase - NoSqlNow2015

99©  Cloudera,  Inc.  All  rights  reserved.

Entity  first  questions…For  a  given  user,  show  thelast  message.

Time   Now

Entities

user1

user2

user3

user4

user5

Page 100: Introduction to HBase - NoSqlNow2015

100©  Cloudera,  Inc.  All  rights  reserved.

Entity  first  questions…For  a  give  user,  showlast  N  messages.

Time   Now

Entities

user1

user2

user3

user4

user5

Page 101: Introduction to HBase - NoSqlNow2015

101©  Cloudera,  Inc.  All  rights  reserved.

Entity  first  questions…

T1 T2

For  a  give  user,  show  all  messagesreceived  between  time  [t1,  t2].

Entities

Time   Now

Entities

user1

user2

user3

user4

user5

Page 102: Introduction to HBase - NoSqlNow2015

102©  Cloudera,  Inc.  All  rights  reserved.

Time  centric  event  first  questions…

T1 T2

Find  all  messages  betweentime  [t1,  t2].

Time   Now

Entities

user1

user2

user3

user4

user5

Page 103: Introduction to HBase - NoSqlNow2015

103©  Cloudera,  Inc.  All  rights  reserved.

Time  centric  event  first  questions…

T1 T2

Find  all  messages  betweentime  [t1,  t2]  for  all  users.

Time  Time   Now

Entities

user1

user2

user3

user4

user5

Page 104: Introduction to HBase - NoSqlNow2015

104©  Cloudera,  Inc.  All  rights  reserved.

How  does  the  data  get  in  and  out  of  HBase?

Page 105: Introduction to HBase - NoSqlNow2015

105©  Cloudera,  Inc.  All  rights  reserved.

Getting  data  in...

Apache  HBase

Put,  Incr,  Append

Bulk  Import

Page 106: Introduction to HBase - NoSqlNow2015

106©  Cloudera,  Inc.  All  rights  reserved.

Getting  data  out...

Apache  HBase

Get,  Short  Scans

Full  scan

Page 107: Introduction to HBase - NoSqlNow2015

107©  Cloudera,  Inc.  All  rights  reserved.

So,  what’s  the  best  way?

Page 108: Introduction to HBase - NoSqlNow2015

108©  Cloudera,  Inc.  All  rights  reserved.

Depends  on  your  use  caseBottom-­‐line:  Disk  I/O  takes  times.

-­ Limited  disk  read-­‐write  heads  in  a  cluster

-­ Use  the  I/O  bandwidth  of  your  cluster  efficiently

Page 109: Introduction to HBase - NoSqlNow2015

109©  Cloudera,  Inc.  All  rights  reserved.

Apache  HBase

Put,  Incr,  Append

Bulk  Import

Get,  Short  Scans

Full  scan

Real-­‐time

Batch

Page 110: Introduction to HBase - NoSqlNow2015

110©  Cloudera,  Inc.  All  rights  reserved.

Let’s  dive  into  use  case  ...

Page 111: Introduction to HBase - NoSqlNow2015

111©  Cloudera,  Inc.  All  rights  reserved.

Simple  Entities

• Purely  entity  data,  no  relation  between  entities

• Often  from  many  different  sources

• Could  be  a  well-­‐done  de-­‐normalized  RDBMS  port

Time   Now

e1

e2

e3

e5

e4

Entities

Page 112: Introduction to HBase - NoSqlNow2015

112©  Cloudera,  Inc.  All  rights  reserved.

Simple  Entities :  Schema

• Row  per  entity

• Row  key  =>  entity  ID,  or  hash  of  entity  ID

• Column  =>  Property  /  field,  possibly  timestamp

Page 113: Introduction to HBase - NoSqlNow2015

113©  Cloudera,  Inc.  All  rights  reserved.

Simple  Entities :  Example

OCLC  :  Online  Computer  Library  Center

Workloads:• Lookup  books  à Real  time  read• Add  new  book  one  at  a  time,  update  information  about  existing  books,  issue  books  à Real-­‐time  write• New  library  joins  the  group,  import  its  data  à Batch  write

Page 114: Introduction to HBase - NoSqlNow2015

114©  Cloudera,  Inc.  All  rights  reserved.

Simple  Entities :  Access  Pattern

• Access  Patterns

• Writes  :  Batch  /  Real-­‐time

• Reads:  Real-­‐time

Apache  HBase

Put,  Incr,  Append

Bulk  Import

Get,  Short  ScansReal-­‐time

Batch

Page 115: Introduction to HBase - NoSqlNow2015

115©  Cloudera,  Inc.  All  rights  reserved.

Linked  Entities  (Graph  Data)

• Entity  are  linked  to  form  a  graph

Time   Now

e1

e2

e3

e5

e4

Entities

Page 116: Introduction to HBase - NoSqlNow2015

116©  Cloudera,  Inc.  All  rights  reserved.

Linked  Entities  (Graph  Data)  :  Schema

• Row  per Node (Entity)

• Row  key  =>  Node  ID  (Entity  ID)

• Column  =>  “Relationship:OtherNodeID”

• Value  =>  Meta  data  about  relationship

Page 117: Introduction to HBase - NoSqlNow2015

117©  Cloudera,  Inc.  All  rights  reserved.

Linked  Entities  (Graph  Data)  :  Example

Social  Network  (Facebook)

Workloads:

• Get  any  info  about  a  user  à Real  time  read

• Update  any  info  about  a  user  à Real  time  write

• Limited  graph  analysis  (based  on  immediate  friends)  à Batch  read

Page 118: Introduction to HBase - NoSqlNow2015

118©  Cloudera,  Inc.  All  rights  reserved.

Linked  Entities  (Graph  Data)  :  Access  Pattern

• Access  Patterns

• Reads:  Real-­‐time  or  Batch

• Writes:  Real-­‐time

Apache  HBase

Put,  Incr,  Append Get,  Short  Scans

Full  scan

Real-­‐time

Batch

Page 119: Introduction to HBase - NoSqlNow2015

119©  Cloudera,  Inc.  All  rights  reserved.

Time-­‐coupled  entities

• Events  about  entities  in  time  centric

• Focus  on  entities  first

Time   Now

e1

e2

e3

e5

e4

Entities

Page 120: Introduction to HBase - NoSqlNow2015

120©  Cloudera,  Inc.  All  rights  reserved.

Time-­‐coupled  entities  :  Schema

• Row  = Entity’s  events  in  a  time  slice

• Row  key  =  Entity  ID +  (time /  k)

• Column  Qualifier  =  timestamp

Page 121: Introduction to HBase - NoSqlNow2015

121©  Cloudera,  Inc.  All  rights  reserved.

Time-­‐coupled  entities:  Example

Messaging  service

Primary  Workload

• Sending  a  message,  update  metadata  (read,  star,  move,  delete)  à

Real-­‐time  write

• Reading  a  message,  get  last  N  messages  à Real-­‐time  read

Page 122: Introduction to HBase - NoSqlNow2015

122©  Cloudera,  Inc.  All  rights  reserved.

Time-­‐coupled  entities  :  Access  Pattern

• Access  Pattern

• Writes:  Real-­‐time

• Reads:  Real-­‐timeApache  HBase

Put,  Incr,  Append Get,  Short  ScansReal-­‐time

Batch

Page 123: Introduction to HBase - NoSqlNow2015

123©  Cloudera,  Inc.  All  rights  reserved.

HBase  is  great!

But  not  for  everything  ...

Page 124: Introduction to HBase - NoSqlNow2015

124©  Cloudera,  Inc.  All  rights  reserved.

Current  HBase  weak  spots

• HBase  architecture  can  handle  a  lot• Engineering  tradeoffs  optimize  for  some  use  cases

• HBase  can  still  do  things  it  is  not  optimal  for

• Other  systems  are  fundamentally  more  efficient  for  some  workloads

• Just  because  it  is  not  good  today,  doesn’t  mean  it  can’t  be  better  

tomorrow!

Page 125: Introduction to HBase - NoSqlNow2015

125©  Cloudera,  Inc.  All  rights  reserved.

A  not  so  good  use  case:  Large  Blob  Store

• Saving  large  objects  >50  MB  per  cell

• Examples

• Raw  video  storage  in  HBase

• Problems:

• Write  amplification  when  re-­‐optimizing  data  for  read  (compactions  on  large  unchanging  data)

• New:  Medium  Object  (MOB)  supported  (lots  of  100KB-­‐10MB  cells)

Page 126: Introduction to HBase - NoSqlNow2015

126©  Cloudera,  Inc.  All  rights  reserved.

Another  not  good  use  case:  Analytic  archive

• Store  data  chronologically,  time  as  primary  index• Row  key  =  timestamp

• Real  time  writes

• Column-­‐centric  aggregations  over  all  rows

• Schema• Row  key:  timestamp

• Column  qualifiers:  properties  with  data  or  counters

• Example• Machine  logs  organized  by  timestamp  (causes  write  hot-­‐spotting)

Page 127: Introduction to HBase - NoSqlNow2015

127©  Cloudera,  Inc.  All  rights  reserved.

Summary

• HBase  is  used  widely  across  industry

• Few  patterns  learnt  from  these  users

• Understanding

• Data  :  Entity  and  time-­‐centric  events

• Questions  you  ask  from  your  data

• How  does  data  gets  in  and  out

• When  not  to  use  HBase

Page 128: Introduction to HBase - NoSqlNow2015

128©  Cloudera,  Inc.  All  rights  reserved.

Scalable  time  series  database

Page 129: Introduction to HBase - NoSqlNow2015

129©  Cloudera,  Inc.  All  rights  reserved.

Time-­‐Series

Data  points  for  entitiesover  time

Page 130: Introduction to HBase - NoSqlNow2015

130©  Cloudera,  Inc.  All  rights  reserved.

• Store  trillions  of  data  points

• Millisecond  precision

• Keep  raw  data  forever

• Scales  to  millions  of  writes  per  sec

• Generate  graphs  from  GUI

OpenTSDB

Page 131: Introduction to HBase - NoSqlNow2015

131©  Cloudera,  Inc.  All  rights  reserved.

• Store  trillions  of  data  points

• Millisecond  precision

• Keep  raw  data  forever

• Scales  to  millions  of  writes  per  sec

• Generate  graphs  from  GUI

OpenTSDB

Page 132: Introduction to HBase - NoSqlNow2015

132©  Cloudera,  Inc.  All  rights  reserved.

OpenTSDB :  Use  Cases

• System  Monitoring

• Servers

• Network

• Sensor  Data

• Stock  market  data

Page 133: Introduction to HBase - NoSqlNow2015

133©  Cloudera,  Inc.  All  rights  reserved.

OpenTSDB :  Example

OVH• Large  cloud/hosting  provider•Monitor  everything:  networking,  temperature,  voltage,  application  performance,  resource  utilization,  customer-­‐facing  metrics,  etc.  • 35  servers,  100k  writes/s,  25TB  raw  data

Yahoo!  •Monitoring  application  performance  and  statistics  • 15  servers,  280k  writes/s  

Source:  http://www.slideshare.net/HBaseCon/ecosystem-­‐session-­‐6

Page 134: Introduction to HBase - NoSqlNow2015

134©  Cloudera,  Inc.  All  rights  reserved.

OpenTSDB :  Datapoints

• In  OpenTSDB,  there  are•Metric• Timestamp• Value• Tags  (key-­‐value  pairs)  :  to  identify  the  entity

Page 135: Introduction to HBase - NoSqlNow2015

135©  Cloudera,  Inc.  All  rights  reserved.

OpenTSDB :  Datapoints example

• E.g.  10  servers  handling  requests  web  requets•Metric:  num_requests_per_second• Tags:  “host=web-­‐server-­‐1”,  “host=web-­‐server-­‐2”,  and  so  on

• Example  data  points• num_requests_per_second 1439828251  50  host=web-­‐server-­‐1• num_requests_per_second 1439828251  72  host=web-­‐server-­‐2• num_requests_per_second 1439828252  30  host=web-­‐server-­‐3• …so  on

Page 136: Introduction to HBase - NoSqlNow2015

136©  Cloudera,  Inc.  All  rights  reserved.

OpenTSDB :  How  it  works

Image  source:  http://opentsdb.net/overview.html

Sensor1 Sensor2 SensorN…………..

TSD TSD

HBaseOpenTSDB

Page 137: Introduction to HBase - NoSqlNow2015

137©  Cloudera,  Inc.  All  rights  reserved.

OpenTSDB :  Writing  data

• Telnet• put  <metric>  <timestamp>  <value>  <tagk1=tagv1[  tagk2=tagv2  ...tagkN=tagvN]>• Example:  put  num_requests_per_second 1439828251  50  host=web-­‐server-­‐1

• HTTP  API• <host>:<port>/api/put• JSON  objects  containing  data  points

• Bulk  Import• Using  ‘import’  CLI  utility

Page 138: Introduction to HBase - NoSqlNow2015

138©  Cloudera,  Inc.  All  rights  reserved.

OpenTSDB :  Reading  data

• OpenTSDBGUI• Select  metrics  and  tags  to  generate  graphs

• HTTP  API• <host>:<port>/api/query

Page 139: Introduction to HBase - NoSqlNow2015

139©  Cloudera,  Inc.  All  rights  reserved.

OpenTSDB :   Storing  data  – row  key

• Row  key  is  a  concatenation  of  metric,  timestamp  and  tags• num_requests_per_second1439827200host=web-­‐server-­‐1

• Since  data  is  stored  in  sorted  order,  chunking  happens  in  this  order1. Metric

• Enables  fast  scan  of  all  time  series  for  a  metric2. Time

• Normalized  on  1  hour  boundaries• All  data  points  for  an  hour  are  stored  in  a  single  row

3. Tags

Page 140: Introduction to HBase - NoSqlNow2015

140©  Cloudera,  Inc.  All  rights  reserved.

OpenTSDB :   Storing  data  – column

• Offset  from  timestamp  in  row  key• Example• num_requests_per_second1439828251  50  host=web-­‐server-­‐1• num_requests_per_second1439828251  72  host=web-­‐server-­‐2• num_requests_per_second1439828252  30  host=web-­‐server-­‐3

Row key Data:1051 Data:1052num_requests_per_second1439827200host=web-­‐server-­‐1 50

num_requests_per_second1439827200host=web-­‐server-­‐2 72

num_requests_per_second1439827200host=web-­‐server-­‐3 30

Page 141: Introduction to HBase - NoSqlNow2015

141©  Cloudera,  Inc.  All  rights  reserved.

OpenTSDB :GUI

Page 142: Introduction to HBase - NoSqlNow2015

142©  Cloudera,  Inc.  All  rights  reserved.

• High  performance  relational  database  layer  over  HBase  for  low-­‐latency  applications

• JDBC  API

Page 143: Introduction to HBase - NoSqlNow2015

143©  Cloudera,  Inc.  All  rights  reserved.

Phoenix  :  Use  Case

Scalability  of  HBase+

SQL  interface  access

Page 144: Introduction to HBase - NoSqlNow2015

144©  Cloudera,  Inc.  All  rights  reserved.

Phoenix

• Provides  typed  access  to  data• Provides  secondary  indexes• Compiles  SQL  queries  to  native  HBase  scans• Executes  scans  parallely• Directly  uses  HBase  API,  server-­‐side  hooks  and  custom  filters• Brings  computation  to  the  data• Pushes  where  clause  to  server-­‐side  filter• Executes  aggregate  queries  using  server-­‐side  hooks

Page 145: Introduction to HBase - NoSqlNow2015

145©  Cloudera,  Inc.  All  rights  reserved.

That’s  it  folks!

Page 146: Introduction to HBase - NoSqlNow2015

146©  Cloudera,  Inc.  All  rights  reserved.10/17/14  Strata+Hadoop  world  2014.    George  and  Hsieh

Try  Hadoop  Nowcloudera.com/live

Page 147: Introduction to HBase - NoSqlNow2015

147©  Cloudera,  Inc.  All  rights  reserved.10/17/14  Strata+Hadoop  world  2014.    George  and  Hsieh

Join  the  Discussion

Get  community  help  or  provide  feedback

cloudera.com/community

Page 148: Introduction to HBase - NoSqlNow2015

148©  Cloudera,  Inc.  All  rights  reserved.

Sources

• A  Survey  of  HBase  Application  Archetypes• Lars  George,   Jon  Hsieh• http://www.slideshare.net/HBaseCon/case-­‐studies-­‐session-­‐7

• OpenTSDB 2.0• Benoit  Sigoure,  Chris  Larsen• http://www.slideshare.net/HBaseCon/ecosystem-­‐session-­‐6

• Hadoop  and  HBase:  Motivations,  Use  cases  and  Trade-­‐offs• Jon  Hsieh

• Phoenix• https://phoenix.apache.org

Page 149: Introduction to HBase - NoSqlNow2015

149©  Cloudera,  Inc.  All  rights  reserved.

Questions  ?