HBase Spring 2014 WPI, Mohamed Eltabakh

PowerPoint Presentation

HBase

Spring 2014WPI, Mohamed Eltabakh1HBase: Overview HBase is a distributed column-oriented data store built on top of HDFS

HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing

Data is logically organized into tables, rows and columns2

HBase: Part of Hadoops Ecosystem3

HBase is built on top of HDFS HBase files are internally stored in HDFSHBase vs. HDFSBoth are distributed systems that scale to hundreds or thousands of nodes

HDFS is good for batch processing (scans over big files)Not good for record lookupNot good for incremental addition of small batchesNot good for updates 4HBase vs. HDFS (Contd)HBase is designed to efficiently address the above pointsFast record lookupSupport for record-level insertionSupport for updates (not in place)

HBase updates are done by creating new versions of values5HBase vs. HDFS (Contd)6

If application has neither random reads or writes Stick to HDFSHBase Data Model7HBase Data ModelHBase is based on Googles Bigtable modelKey-Value pairs

8

HBase Logical View9

HBase: Keys and Column Families10

Each row has a KeyEach record is divided into Column FamiliesEach column family consists of one or more ColumnsKeyByte arrayServes as the primary key for the tableIndexed far fast lookupColumn FamilyHas a name (string)Contains one or more related columnsColumnBelongs to one column familyIncluded inside the rowfamilyName:columnName

11

Column family named ContentsColumn family named anchorColumn named apache.comVersion NumberUnique within each keyBy default Systems timestampData type is LongValue (Cell)Byte array

12

Version number for each rowvalueNotes on Data ModelHBase schema consists of several TablesEach table consists of a set of Column FamiliesColumns are not part of the schema HBase has Dynamic ColumnsBecause column names are encoded inside the cellsDifferent cells can have different columns

13

Roles column family has different columns in different cellsNotes on Data Model (Contd)The version number can be user-suppliedEven does not have to be inserted in increasing orderVersion number are unique within each keyTable can be very sparseMany cells are empty Keys are indexed as the primary key

Has two columns[cnnsi.com & my.look.ca]HBase Physical Model15HBase Physical ModelEach column family is stored in a separate file (called HTables)Key & Version numbers are replicated with each column familyEmpty cells are not stored16

HBase maintains a multi-level index on values:

Example17

Column Families18

HBase RegionsEach HTable (column family) is partitioned horizontally into regionsRegions are counterpart to HDFS blocks19

Each will be one regionHBase Architecture20Three Major Components21The HBaseMasterOne master

The HRegionServerMany region servers

The HBase client

HBase ComponentsRegionA subset of a tables rows, like horizontal range partitioningAutomatically doneRegionServer (many slaves)Manages data regionsServes data for reads and writes (using a log)Master Responsible for coordinating the slavesAssigns regions, detects failuresAdmin functions22Big Picture23

ZooKeeperHBase depends on ZooKeeper By default HBase manages the ZooKeeper instanceE.g., starts and stops ZooKeeperHMaster and HRegionServers register themselves with ZooKeeper24

Creating a TableHBaseAdmin admin= new HBaseAdmin(config);HColumnDescriptor []column;column= new HColumnDescriptor[2];column[0]=new HColumnDescriptor("columnFamily1:");column[1]=new HColumnDescriptor("columnFamily2:");HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));desc.addFamily(column[0]);desc.addFamily(column[1]);admin.createTable(desc);

25Operations On Regions: Get()Given a key return corresponding recordFor each value return the highest version26

Can control the number of versions you wantOperations On Regions: Scan()27

Get()Row keyTimeStampColumn anchor:com.apache.wwwt12t11t10anchor:apache.comAPACHEcom.cnn.wwwt9anchor:cnnsi.comCNNt8anchor:my.look.caCNN.comt6t5t3Select value from table where key=com.apache.www AND label=anchor:apache.comScan()Select value from table where anchor=cnnsi.comRow keyTimeStampColumn anchor:com.apache.wwwt12t11t10anchor:apache.comAPACHEcom.cnn.wwwt9anchor:cnnsi.comCNNt8anchor:my.look.caCNN.comt6t5t3Operations On Regions: Put()Insert a new record (with a new key), OrInsert a record for an existing key30

Implicit version number (timestamp)Explicit version numberOperations On Regions: Delete()Marking table cells as deletedMultiple levelsCan mark an entire column family as deletedCan make all column families of a given row as deleted31All operations are logged by the RegionServersThe log is flushed periodicallyHBase: JoinsHBase does not support joins

Can be done in the application layerUsing scan() and get() operations32Altering a Table33

Disable the table before changing the schemaLogging Operations34

HBase Deployment35

Master nodeSlavenodesHBase vs. HDFS36

HBase vs. RDBMS37

When to use HBase38

Documents

HBase Spring 2014 WPI, Mohamed Eltabakh