37
1 Headline Goes Here Speaker Name or Subhead Goes Here HBase schema design Amandeep Khurana | Solu7ons Architect Big Data TechCon, Boston, April 2013 Friday, April 12, 13

HBase schema design Big Data TechCon Boston

  • Upload
    amansk

  • View
    2.133

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: HBase schema design Big Data TechCon Boston

1

Headline  Goes  HereSpeaker  Name  or  Subhead  Goes  Here

HBase  schema  designAmandeep  Khurana  |  Solu7ons  ArchitectBig  Data  TechCon,  Boston,  April  2013

Friday, April 12, 13

Page 2: HBase schema design Big Data TechCon Boston

About  me

•Solu@ons  Architect,  Cloudera  Inc•Amazon  Web  Services•Interested  in  large  scale  distributed  systems•Co-­‐author,  HBase  In  Ac@on•TwiHer:  amansk

2

M A N N I N G

Nick DimidukAmandeep Khurana

Friday, April 12, 13

Page 3: HBase schema design Big Data TechCon Boston

About  the  talk

•Data  model  recap•Data  modeling  thought  process•Tools  and  techniques

3Friday, April 12, 13

Page 4: HBase schema design Big Data TechCon Boston

HBase  is  ...

•Column  family  oriented  database•Column  family  oriented•Tables  consis@ng  of  rows  and  columns

•Persisted  Map•Sparse•Mul@  dimensional•Sorted• Indexed  by  rowkey,  column  and  @mestamp

•Key  Value  store• [rowkey,  col  family,  col  qualifier,  @mestamp]  -­‐>  cell  value

4Friday, April 12, 13

Page 5: HBase schema design Big Data TechCon Boston

HBase  is  not  ...

•A  rela@onal  database•No  SQL  query  language•No  joins•No  secondary  indexing•No  transac@ons

5Friday, April 12, 13

Page 6: HBase schema design Big Data TechCon Boston

6

It’s  not  a  rela7onal  database  system

Data  Model  recap

Friday, April 12, 13

Page 7: HBase schema design Big Data TechCon Boston

Important  terms

• Table• Consists  of  rows  and  columns

• Row• Has  a  bunch  of  columns.• Iden@fied  by  a  rowkey  (primary  key)

• Column  Qualifier• Dynamic  column  name

• Column  Family• Column  groups  -­‐  logical  and  physical  (Similar  access  paHern)

• Cell• The  actual  element  that  contains  the  data  for  a  row-­‐column  intersec@on

• Version• Every  cell  has  mul@ple  versions.

7Friday, April 12, 13

Page 8: HBase schema design Big Data TechCon Boston

Data  coordinates

•Row  is  addressed  using  rowkey•Cell  is  addressed  using              [rowkey  +  family  +  qualifier]

8Friday, April 12, 13

Page 9: HBase schema design Big Data TechCon Boston

Tabular  representa@on

9

Column Family - Info

password

abc123

abc123

abc123

[email protected]

TheRealMT

HMS_Surprise

[email protected]

Sir Arthur Conan Doyle [email protected]

[email protected]

SirDoyle

GrandpaD

Fyodor Dostoyevsky

Patrick O'Brien

Mark Twain

Rowkey name email

Cells

Each cell has multiple versions,

typically represented by the timestamp

of when they were inserted into the table

(ts2>ts1)

ts1=1329088321289 ts2=1329088818321

The table is lexicographicallysorted on the rowkeys

Langhorneabc123

12

3

4

The coordinates used to identify data in an HBase table are:(1) rowkey, (2) column family, (3) column qualifier, (4) version

Friday, April 12, 13

Page 10: HBase schema design Big Data TechCon Boston

Key-­‐Value  store

10

[TheRealMT, info, password, 1329088818321]

[TheRealMT, info, password, 1329088321289]

abc123

Langhorne

Keys Values

A single KeyValue instance

Friday, April 12, 13

Page 11: HBase schema design Big Data TechCon Boston

Key-­‐Value  store

11

[TheRealMT, info, password, 1329088818321] abc123

[TheRealMT, info, password] 1329088818321 : "abc123",1329088321289 : "Langhorne"

}

{

[TheRealMT, info] },

"name" : { 1329088321289 : "Mark Twain"

"email" : { 1329088321289 : "[email protected]" },

"password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } }

{

},

"info" : {

"name" : { 1329088321289 : "Mark Twain"

"email" : { 1329088321289 : "[email protected]" },

"password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } }

{

[TheRealMT]

Keys

Values

Start with coordinates of full precision1

Drop version and you're left with a map of version to values2

Omit qualifier and you have a map of qualifiers to the previous maps3

Finally, drop the column family and you have a row, a map of maps4

Friday, April 12, 13

Page 12: HBase schema design Big Data TechCon Boston

Sorted  map  of  maps

12

Rowkey

Column family

Column qualifiers

Versions

Values

{

},

"TheRealMT" : { "info" : {

"name" : { 1329088321289 : "Mark Twain"

"email" : { 1329088321289 : "[email protected]" },

"password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } }}

Friday, April 12, 13

Page 13: HBase schema design Big Data TechCon Boston

HFiles  and  physical  data  model

•HFiles  are•Immutable•Sorted  on  rowkey  +  qualifier  +  @mestamp•In  the  context  of  a  column  family  per  region

13

, , , "TheRealMT" "info" "password" , 1329088321289 "Langhorne", , , "TheRealMT" "info" "password" , 1329088818321 "abc123",

, , , , "TheRealMT" "info" "email" 1329088321289 "[email protected]", , , , "TheRealMT" "info" "name" 1329088321289 "Mark Twain"

HFile for the info column family in the users table

Friday, April 12, 13

Page 14: HBase schema design Big Data TechCon Boston

14

...  it’s  a  database  a?er-­‐all

Thinking  through  the  design

Friday, April 12, 13

Page 15: HBase schema design Big Data TechCon Boston

But  isn’t  HBase  schema-­‐less?

•Number  of  tables•Rowkey  design  •Number  of  column  families  per  table.  What  goes  into  what  column  family•Column  qualifier  names•What  goes  into  the  cells•Number  of  versions

15Friday, April 12, 13

Page 16: HBase schema design Big Data TechCon Boston

Rowkeys

•Rowkey  design  is  the  single  most  important  aspect  of  HBase  table  designs•The  only  way  to  address  rows  in  HBase

16Friday, April 12, 13

Page 17: HBase schema design Big Data TechCon Boston

TwitBase  rela@onships

•Users  follow  users•Rela@onships  need  to  be  persisted  for  usage  later  on•Model  tables  for  the  expected  access  paHerns•Read  paHern•Who  does  A  follow?•Who  follows  A?•Does  A  follow  B?

•Write  paHern•A  follows  B•A  unfollows  B

17Friday, April 12, 13

Page 18: HBase schema design Big Data TechCon Boston

Start  simple

•Adjacency  list

18

Column Family : followsrow key:userid

cell value: followed userid

column qualifier: followed user number

4:HRogers1:HRogers

3:Olivia1:TheRealMTTheFakeMT2:Olivia

2:MTFanBoyTheRealMT

followsCol Qualifier

Cell value

Friday, April 12, 13

Page 19: HBase schema design Big Data TechCon Boston

Op@mizing  the  adjacency  list

•We  need  a  count•Where  does  a  new  followed  user  go?

19

2:Olivia count:2count:4TheFakeMT 4:HRogers3:Olivia2:MTFanBoy1:TheRealMT

TheRealMT 1:HRogers

follows

Friday, April 12, 13

Page 20: HBase schema design Big Data TechCon Boston

Adding  a  new  user

20

2:Olivia count:2count:4TheFakeMT 4:HRogers3:Olivia2:MTFanBoy1:TheRealMT

TheRealMT 1:HRogers

follows

Row that needs to be updated

Client code:Step 1: Get current countStep 2: Update countStep 3: Add new entryStep 4: Write the new data to HBase

1

2

TheFakeMT : follows: {count -> 4}

increment count

TheFakeMT : follows: {count -> 5}

3 add new entry

TheFakeMT : follows: {5 -> MTFanBoy2, count -> 5}

count:52:Olivia count:2

5:MTFanBoy2TheFakeMT 4:HRogers3:Olivia2:MTFanBoy1:TheRealMTTheRealMT 1:HRogers

follows

4

Friday, April 12, 13

Page 21: HBase schema design Big Data TechCon Boston

Transac@ons  ==  not  good

•HBase  doesn’t  have  na@ve  support  (think  scale)•Don’t  want  to  complicate  client  side  logic•Only  solu@on  -­‐>  simplify  schema

21

Olivia:1MTFanBoy:1TheFakeMTOlivia:1HRogers:1

HRogers:1TheRealMT:1TheRealMT

follows

Friday, April 12, 13

Page 22: HBase schema design Big Data TechCon Boston

Revisit  the  ques@ons

•Read  paHern•Who  all  does  A  follow?•Who  all  follows  A?•Does  A  follow  B?

•Write  paHern•A  follows  B•A  unfollows  B

22Friday, April 12, 13

Page 23: HBase schema design Big Data TechCon Boston

Revisit  the  ques@ons

22Friday, April 12, 13

Page 24: HBase schema design Big Data TechCon Boston

Denormaliza@on

•Second  table  for  reverse  rela@onship•Otherwise  scan  across  en@re  table  and  affect  read  performance

23

DenormalizationPoor design

DreamlandNormalization

Read performance

Writ

e pe

rform

ance

Friday, April 12, 13

Page 25: HBase schema design Big Data TechCon Boston

More  op@miza@ons

•Convert  into  tall-­‐narrow  table•Leverage  rowkey  indexing  beHer•Gets  -­‐>  short  Scans

24

CF : f

row key:follower + followed

cell value: 1

CQ: followed user's nameThe + in the row key refers to concatenating

the two values. You could delimitateusing any character you like.

eg: A-B or A,B

Keeping the column family and column qualifiernames short reduces the data transferred over thenetwork back to the client. The KeyValue objects

become smaller.

Friday, April 12, 13

Page 26: HBase schema design Big Data TechCon Boston

Tall-­‐narrow  table  example

•Denormaliza@on  is  the  way  to  go

25

TheRealMT+HRogers Henry Rogers:1Olivia Clemens:1TheRealMT+Olivia

Amandeep Khurana:1TheFakeMT+MTFanBoyOlivia Clemens:1

Mark Twain:1TheFakeMT+TheRealMT

TheFakeMT+OliviaHenry Rogers:1TheFakeMT+HRogers

f Putting the user name in the columnqualifier saves you from looking upthe users table for the name of theuser given an id. You can simply

list out names or ids while lookingat relationships just from this table.

The downside of this is that you needto update the name in all the cellsif the user updates their name in

their profile.This is classic Denormalization.

Friday, April 12, 13

Page 27: HBase schema design Big Data TechCon Boston

Uniform  rowkey  length

•MD5  the  userids  -­‐>  16  bytes  +  16  bytes  rowkeys•BeHer  distribu@on  of  load

26

CF : f

row key:md5(follower)md5(followed)

cell value: followed users name

CQ: followed useridUsing MD5 of the user ids gives you

fixed lengths instead of variablelength user ids. You don't needconcatenation logic anymore.

Friday, April 12, 13

Page 28: HBase schema design Big Data TechCon Boston

Uniform  rowkey  length  (con@nued)

27

MD5(TheRealMT) MD5(HRogers) HRogers:Henry RogersOlivia:Olivia ClemensMD5(TheRealMT) MD5(Olivia)

MTFanBoy:Amandeep KhuranaMD5(TheFakeMT) MD5(MTFanBoy)Olivia:Olivia Clemens

TheRealMT:Mark TwainMD5(TheFakeMT) MD5(TheRealMT)

MD5(TheFakeMT) MD5(Olivia)HRogers:Henry RogersMD5(TheFakeMT) MD5(HRogers)

f

Friday, April 12, 13

Page 29: HBase schema design Big Data TechCon Boston

Tall  v/s  Wide  tables  storage  footprint

28

r5 c3:v5 c7:v8c1:v1

r4 c2:v4

c2:v3 c5:v6r3

c1:v2 c3:v6r2

c6:v2c1:v9c1:v1r1

CF1 CF2

r1:CF1:c1:t1:v1r2:CF1:c1:t2:v2r2:CF1:c3:t3:v6r3:CF1:c2:t1:v3r4:CF1:c2:t1:v4r5:CF1:c1:t2:v1r5:CF1:c3:t3:v5

r1:CF2:c1:t1:v9r1:CF2:c6:t4:v2r3:CF2:c5:t4:v6r5:CF2:c7:t3:v8

HFile for CF1 HFile for CF2

r5:cf2:c7:t3:v8r5:CF1:c3:t3:v5r5:CF1:c1:t2:v1

Result object returned for a Get() on row r5

KeyValue objects

Cell Value

TimeStamp

Col Qual

Col Fam

Row Key

Key Value

Logical representation of an HBase table.We'll look at what it means to Get() row r5 from this table. Actual physical storage of the table

Structure of a KeyValue object

Friday, April 12, 13

Page 30: HBase schema design Big Data TechCon Boston

Rowkey  design

•Single  most  important  aspect  of  designing  tables•Depends  on  expected  access  paHerns•HFiles  are  sorted  on  Key  part  of  KeyValue  objects

29

, , , "TheRealMT" "info" "password" , 1329088321289 "Langhorne", , , "TheRealMT" "info" "password" , 1329088818321 "abc123",

, , , , "TheRealMT" "info" "email" 1329088321289 "[email protected]", , , , "TheRealMT" "info" "name" 1329088321289 "Mark Twain"

HFile for the info column family in the users table

Friday, April 12, 13

Page 31: HBase schema design Big Data TechCon Boston

Write  op@mized

•Distribute  writes  across  the  cluster•Issue  most  pronounced  with  @me  series  data

•Hashinghash("TheRealMT") -> random byte[]

•Sal@ngint salt = new Integer(new Long(timestamp).hashCode()).shortValue() % <number of region servers>;byte[] rowkey = Bytes.add(Bytes.toBytes(salt) + Bytes.toBytes("|") + Bytes.toBytes(timestamp));

30Friday, April 12, 13

Page 32: HBase schema design Big Data TechCon Boston

Read  op@mized

•Data  to  be  accessed  together  should  be  stored  together•eg:  twit  streams  -­‐  last  10  twits  by  the  users  I  follow

31

Olivia1Olivia2Olivia5Olivia7Olivia9TheFakeMT2TheFakeMT3TheFakeMT4TheFakeMT5TheFakeMT6TheRealMT1TheRealMT2TheRealMT5TheRealMT8

1Olivia1TheRealMT2Olivia2TheFakeMT2TheRealMT3TheFakeMT4TheFakeMT5Olivia5TheFakeMT5TheRealMT6TheFakeMT7Olivia8TheRealMT9Olivia

Friday, April 12, 13

Page 33: HBase schema design Big Data TechCon Boston

Rela@onal  to  Non  rela@onal

•Rela@onal  concepts•En@@es•AHributes•Rela@onships

•En@@es•Table  is  a  table.  Not  much  going  on  there•Users  table  contains...  users.  Those  are  en@@es•Good  place  to  start

32Friday, April 12, 13

Page 34: HBase schema design Big Data TechCon Boston

Rela@onal  to  Non  rela@onal  

•AHributes•Iden@fying•Primary  keys.  Compound  keys•Maps  to  rowkeys

•Non-­‐iden@fying•Other  columns•Maps  to  column  qualifiers  and  cells

•Rela@onships•Foreign  keys,  junc@on  tables,  joins.•Non-­‐existent  in  HBase.  Instead  try  to  denormalize

33Friday, April 12, 13

Page 35: HBase schema design Big Data TechCon Boston

Nested  En@@es

•Column  Qualifiers  can  contain  data  instead  of  just  a  column  name

34

hbase tablerow key

column family

repeating entity

fixed qualifier → timestamp → value

variable qualifier → timestamp → value

Nested entities

Friday, April 12, 13

Page 36: HBase schema design Big Data TechCon Boston

Schema  design  summary

•Schema  can  make  or  break  the  performance  you  get•Rowkey  is  the  single  most  important  thing•Use  tricks  like  hashing  and  sal@ng

•Denormalize  to  your  advantage•There  are  no  joins

• Isolate  access  paHerns•Separate  CFs  or  even  separate  tables

•Shorter  names  -­‐>  lower  storage  footprint•Column  qualifiers  can  be  used  to  store  data  and  not  just  column  names•Big  difference  as  compared  to  RDBMS

35Friday, April 12, 13

Page 37: HBase schema design Big Data TechCon Boston

36Friday, April 12, 13