View
14
Download
0
Category
Preview:
Citation preview
Adding Value to HBase with IBM InfoSphere BigInsightsand BigSQLSession Number 1687
Piotr Pruski
@ppruski
2
Please note
IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
3
Agenda
� Introduction to HBase
� Big SQL HBase Storage Handler– Column mapping
– Data encoding– Data load
� Secondary Indexes
� Querying
� Recommendations and limitations
� Logs and Troubleshooting
� Highlights and HBase use cases
4
HBase Basics
� Client/server database– Master and a set of region servers
� Key-value store – Key and value are byte arrays– Efficient access using row key
� Different from relational databases– No types: all data is stored as bytes
– No schema: Rows can have different set of columns
5
HBase Data Model
� Table– Contains column-families
� Column family– Logical and physical grouping of
columns
� Column– Exists only when inserted
– Can have multiple versions
– Each row can have different set of
columns
– Each column identified by it’s key
� Row key– Implicit primary key
– Used for storing ordered rows
– Efficient queries using row key
HBTABLE
cf_data: {‘cq_name’: ‘name2’, ‘cq_val’: 2013 @ ts = 2013,‘cq_val’: 2012 @ ts = 2012
}
22222
cf_data:
{‘cq_name’: ‘name1’, ‘cq_val’: 1111}cf_info: {‘cq_desc’: ‘desc11111’}
11111
ValueRow key
HFileHFileHFile
11111 cf_data cq_name name1 @ ts1
11111 cf_data cq_val 1111 @ ts1
22222 cf_data cq_name name2 @ ts1
22222 cf_data cq_val 2013 @ ts1
22222 cf_data cq_val 2012 @ ts 2
HFile
11111 cf_info cq_desc desc11111 @ ts1
6
More on the HBase Data Model
� There is no Schema for an HBase Table in the RDBMS sense– Except that one has to declare the Column Families
• Since it determines the physical on-disk organization
– Thus every row can have a different set of Columns
� HBase is described as a key-value store
� Each key-value pair is versioned– Can be a timestamp or an integer
– Update a column is just to add a new version
� All data are byte arrays, including table name, Column Family names, and Column names (also called Column Qualifiers)
Key/Value Row Column Family Column Qualifier Timestamp Value
Key
7
HBase Cluster Architecture
HDFS / GPFS
Region Server …
Master
…
ClientZooKeeper Peer
ZooKeeper Quorum
ZooKeeper Peer
… Hbase master assigns regions and load balancing
Client finds
region server addresses in ZooKeeper
Client reads and writes row by
accessing the region server
ZooKeeper is used for coordination / monitoring
Region
Region Server
Region … Region Region …
HFile
HFile
HFile
HFile
HFile
HFile HFile HFile
HFile HFile
Coprocessor Coprocessor …Coprocessor Coprocessor
Region Server
Coprocessor CoprocessorCoprocessor …CoprocessorCoprocessor …CoprocessorCoprocessor
8 8
BigInsights - Big SQL
� Big SQL brings robust SQL support to the Hadoop ecosystem
� Driving design goals
– Existing queries should run with no or few modifications– Existing JDBC and ODBC compliant tools should continue to function
• Data warehouse augmentation is a very common use case for Hadoop
� While highly scalable, MapReduce is notoriously difficult to use
� SQL support opens the data to a much wider audience
� Making data in BigInsights accessible to SQL capable tools– Cognos BI
– Microstrategy
– Tableau– …
9
Big Data for a Query-able Archive
BigInsights
(Hadoop)
InfoSphereWarehouse/ Netezza **
• Cognos BI can issue SQL Queries against data managed by Apache Hive in BigInsights
• The IBM BigData platform supports bi-directional queries between BigInsights and the EDW
•Key Benefits:
• Existing SQL based applications can leverage the BigData platform
• EDW optimized from size and performance perspective
• Provides cost effective and flexible big data storage and analysis
CognosInsight
Cognos BI Server
Explore & Analyze
Report & Act
Bi-Directional Query Support
SQLSQLInfoSphere OptimInfoSphere Optim
10
Big SQL HBase Storage Handler
� Mapping of SQL to HBase data: Column Mapping� Handles serialization/deserialization of data (SerDe)� Efficiently handles SQL queries by pushing down predicates
InputData
Big SQL
QueryResults
HBase Storage Handler
SerDe
Delimited
files
Warehouse
JDBCapplication
DFS
HBase
SQLQuery
Query Optimizer(Compile time)
- Process hints
Query Analyzer(Runtime)
- HBase scan limits
- Filters
- Index usage
11
Column Mapping
� Mapping HBase row key/columns to SQL columns– Supports one to one and one to many mappings
� One to one mapping– Single HBase entity mapped to a single SQL column
11111 name1 1111name1 1111name1 1111name1 1111name1 1111name1 1111name111111 name111111 1111name111111 name1 1111name1 1111name1 1111name1 1111name1 1111name1
id name value
11111
id
11111
id
11111
id
name111111
id
1111name111111
id name
1111name111111
id valuename
1111name111111
id valuename
1111name111111
id valuename
1111name111111
id descvaluename
1111name1
id SQL
HBase
valuename
1111name1
id
name1 1111name1 1111name1 1111
nameid valuenameid descvaluenameid
Column Family: cf_info
desc11111
key cq_name
Column Family: cf_data
cq_val cq_desc
12
Create Table: One to One Mapping
CREATE HBASE TABLE HBTABLE
( id INT,
name VARCHAR(10),
value INT,
desc VARCHAR(20)
)
COLUMN MAPPING
(
key mapped by (id),
cf_data:cq_name mapped by (name),
cf_data:cq_val mapped by (value),
cf_info:cq_desc mapped by (desc)
);
SQLHBase
Required
HBase column
identified by
family:qualifier
13
One to Many Column Mapping
� Single HBase entity mapped to multiple SQL columns
� Composite key– HBase row key mapped to multiple SQL columns
� Dense column– One HBase column mapped to multiple SQL columns
11111_ac11 fname1_lname1 11111#11#0.25
Column Family: cf_data
balancefirst_nameacc_nouserid last_name interestmin_bal SQL
HBasecq_names cq_acctkey
14
Create Table: One to Many Mapping
CREATE HBASE TABLE DENSE_TABLE
( userid INT,
acc_no VARCHAR(10),
first_name VARCHAR(10),
last_name VARCHAR(10),
balance double,
min_bal double,
interest double
)
COLUMN MAPPING
(
key mapped by (userid, acc_no),
cf_data:cq_names mapped by (first_name, last_name),
cf_data:cq_acct mapped by (balance, min_bal, interest)
);
Composite Key
Dense
Columns
List of SQL columns
15
Why use One to Many mapping ?
� HBase is very verbose– Stores a lot of information for each value– Primarily intended for sparse data
<row> <columnfamily> <columnqualifier> <timestamp> <value>
� Save storage space– Sample table with 9 columns. 1.5 million rows
– One to one mapping: 522 MB– One to many mapping: 276 MB
� Improve query response time– Query results also return the entire key for each value
– select * query on sample table• One to one mapping: 1m 31 s
• One to many mapping: 1m 2s
16
Data encoding
� HBase stores all data as an array of bytes– Application decides how to encode/decode the bytes
� Big SQL uses Hive SerDe interface for serialization/deserialization
� Supports two types of data encodings: String, Binary
� Encoding can be specified at HBase row key/column level
11111_ac11 fname1_lname1 0x000001 …
Column Family: cf_data
balancefirst_nameacc_nouserid last_name interestmin_bal SQL
HBasecq_names cq_acctkey
HBaseHBase
String BinaryString
17
String encoding
� Default encoding
� Value is converted to string and stored as UTF-8 bytes
� Separator to identify parts in one to many mapping– Default separator: \u0000
CREATE HBASE TABLE DENSE_TABLE_STR
( userid INT,
acc_no VARCHAR(10),
first_name VARCHAR(10),
last_name VARCHAR(10),
balance double,
min_bal double,
interest double
)
COLUMN MAPPING
(
key mapped by (userid, acc_no) separator '_',
cf_data:cq_names mapped by (first_name, last_name) separator '_',
cf_data:cq_acct mapped by (balance, min_bal, interest) separator '#'
);
Can specify different separator
for each column and row key.
Default separator is null byte
(\u0000) for string encoding.
18
String Encoding: Pros and Cons
� Readable format and easier to port across applications
� Useful to map existing data
� Numeric data not collated correctly– HBase stores data as bytes
– Lexicographic ordering
� Slow– Parsing strings is expensive
11111_ac11 fname1_lname1 10000#10#0.25
Column Family: cf_data
cq_names cq_acctkey
last_name interestbalance min_balfirst_nameacc_nouserid
Existing
HBase table
External
Big SQL table
11029
2 > 109 > 10
19
External Tables
� Useful to map tables that already exist in HBase– Data in external tables is not pre-validated
� Can create multiple views of same table
create external hbase table externalhbase_table (user INT, acc string,
balance double, min_bal double, interest double)
column mapping(key mapped by (user,acc), cf_data:cq_acct mapped by(balance,
min_bal, interest) separator '#')
hbase table name 'dense_table';
� HBase tables created using Hive HBase storage handler cannot be read by Big SQL– Need to create external tables for this
� Things to note:– Dropping external table only drops the metadata
– Cannot create secondary index on external tables
Use subset of data from
dense_table
20
Binary Encoding
� Data encoded using sortable binary representation
� Separators handled internally– Escaped to avoid issue of separator existing within data
CREATE HBASE TABLE MIXED_ENCODING
(
C1 INT, C2 INT, C3 INT,
C4 VARCHAR(10), C5 DECIMAL(5,2),
C6 SMALLINT
)
COLUMN MAPPING
(
KEY MAPPED BY (C1, C2, C3) ENCODING BINARY,
CF1:COL1 MAPPED BY (C4, C5) SEPARATOR '|',
CF2:COL1 MAPPED BY (C6) ENCODING BINARY
);
0x000000000000000100000000000000020000000000000003
key
foo|97.31
col1cf1
0x0000DEAF
col1
0x000000000000000100000000000000020000000000000003
cf2
key col1
0x000000000000000100000000000000020000000000000003
key
foo|97.31
col1
0x000000000000000100000000000000020000000000000003
key
foo|97.31
col1
0x000000000000000100000000000000020000000000000003
keycf1
foo|97.31
col1
0x000000000000000100000000000000020000000000000003
keycf1
foo|97.31
col1
0x000000000000000100000000000000020000000000000003
keycf1
foo|97.31
col1
0x000000000000000100000000000000020000000000000003
key col1cf1
foo|97.31
col1
0x000000000000000100000000000000020000000000000003
key col1cf1
foo|97.31
col1
0x000000000000000100000000000000020000000000000003
key
0x0000DEAF
col1cf1
foo|97.31
col1
0x000000000000000100000000000000020000000000000003
key
0x0000DEAF
col1cf1
foo|97.31
col1
0x000000000000000100000000000000020000000000000003
key
0x0000DEAF
cf1
foo|97.31
col1
0x000000000000000100000000000000020000000000000003
key
If encoding not
specified, string is
used as default
21
Binary Encoding: Pros and Cons
� Faster
� Numeric types collated correctly including negative numbers
CREATE HBASE TABLE WEATHER (temp INT, date TIMESTAMP, humidity DOUBLE)
COLUMN MAPPING (key mapped by (temp, date), cf:cq mapped by (humidity))
default encoding binary;
� Limited portability
100,2012-06-10 17:00:00:000,40.25-17,2012-12-12 17:00:00:000,30.2595,2012-06-05 17:00:00:000,50.25
\x01\x7F\xFF\xFF\xEF\x012012-12-12 17:00:00:000\x00 \x01\x80\x00\x00_\x012012-06-05 17:00:00:000\x00 \x01\x80\x00\x00d\x012012-06-10 17:00:00:000\x00
\x01\xC0>@\x00\x00\x00\x00\x00\x01\xC0I \x00\x00\x00\x00\x00\x01\xC0D \x00\x00\x00\x00\x00
cq
-1795
100
cf
22
Load Data
� Load HBase – Loads data from delimited files– Column list can be specified
load hbase data inpath 'file:///input.dat'
delimited fields terminated by '|'
into table hbtable
(name, value, desc, id);
� Load FROM– Loads data from a (JDBC) source outside of a BigInsights cluster
� Insert command available
insert into hbtable
(name, value, desc, id)
values(‘name5’, 5555, ‘desc55555’, 55555);
File can be on DFS or local to Big SQL server
Column list optional. If not specified, uses column ordering in
table definition
23
Load Data: Upsert
� HBase ensures uniqueness of row key
� Upsert can be confusing. No errors but fewer rows !
� Combine multiple columns to make row key uniquekey mapped by (id, name)
select count(*) from hbtable : 7 rowsDelimited file : 10 rows Load : 10 rows affected
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
Load
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
Load
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
Load
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
Load
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
Load
11111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
Load
keykey
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc11111 @ts011111 , name9, 9999, desc99999 @ts122222 , name2, 2222, desc22222 @ts1
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111/x00name1, 1111, desc11111 @ts011111/x00name9, 9999, desc99999 @ts122222/x00name2, 2222, desc22222 @ts1
…
Load11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
key
24
Force Key Unique
� Use force key unique option when creating a table
CREATE HBASE TABLE HBTABLE_FORCE_KEY_UNIQUE
( id INT, name VARCHAR(10), value INT, desc VARCHAR(20) )
COLUMN MAPPING
(
key mapped by (id) force key unique,
cf_data:cq_name mapped by (name),
cf_data:cq_val mapped by (value),
cf_info:cq_desc mapped by (desc)
);
� Load adds UUID to the row key
� Prevents data loss
� Inefficient� Stores more data
� Slower queries
11111\x00b71c95d8-ffdd-4d49-9015-2fdd6f7dcdf4, name1, 1111, desc1111111111\x00ea780078-9893-4bf7-95d8-cb9ca4b2427f, name9, 9999, desc9999922222\x00a90885b0-418b-49ac-a6f6-aa73273b57ca, name2, 2222, desc22222
…
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
25
Load Data: Error Handling
� Option to continue and log error rows– LOG ERROR ROWS IN FILE 'filename'
� Common Errors– Separator exists within data for string encoding– Invalid numeric types
� Always count number of rows after loading– Load always reports total number of rows that it handled
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
… Error file (2 rows)22222 , name-2, 2222, desc22222
3333a , name3, 3333, desc33333
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
Load: 4 rows affected
11111 , name1, 1111, desc1111111111 , name9, 9999, desc9999922222 , name2, 2222, desc22222
…
HBase Table (2 rows)11111-name1, 1111, desc1111111111-name9, 9999, desc99999
11111, name1, 1111, desc1111111111, name9, 9999, desc9999922222, name-2, 2222, desc222223333a, name3, 3333, desc33333
…
keykey
key mapped by (id, name) separator ‘-’id defined as integer
26
Options to Speed up Load
� Disable WAL– Data loss can happen if region server crashes
LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS
TERMINATED BY '|' INTO TABLE ORDERS DISABLE WAL;
� Increase write buffer– set hbase.client.write.buffer=8388608;
27
Secondary Index Support
� Self-maintaining secondary indexes– Stored in an HBase table– Populated using a Map Reduce index builder
– Kept up to date using a synchronous coprocessor
Data Table
Index Table
Index Regions
ClientClient
Big SQL
HBase Storage Handler
Query Optimizer(Compile time)
- Process hints
Query Analyzer(Runtime)
- Use index ?
Data RegionsMRIndexBuildercreate
index
SerDe
IndexCoprocessor
inputdata
queryresults
query
Index building Index maintenance Batched Get Requests
28
Index Creation and Usage
create hbase table dt(id int,c1 string,c2 string,c3 string,c4 string,c5 string)column mapping (key mapped by (id), f:a mapped by (c1,c2,c3), f:b mapped by
(c4,c5));
create index ixc3 on table dt (c3) as 'hbase';
� Automatic index usage– Range scan on index table to get matching row key(s) in base table– Batched get requests to base table with the matched row key(s)
bt1 , c11_c21_c31, c41_c51
bt2 , c12_c22_c32, c42_c52bt3 , c13_c23_c33, c43_c53
…
key c1 c2 c3 c4 c5
c31_bt1c32_bt2c33_bt3
…
key
Data table (dt) Index table (dt_ixc3)
Use Index ?
Query
c3=c32
create index ixc3 (c3)
YesNo
Full table scanIndex table range scan
start row = c32stop row = c32++
Data table getrow = bt2
29
Index Pros and Cons
� Fast key based lookups for queries that return limited data
� Not beneficial if there are too many matches� No statistics to make the decision in compiler
� useindex hint to make explicit choices
� Index adds latency to data load– When loading a big data set, drop index and recreate
�LOAD from option bypasses index maintenance� Uses HBase bulk load which writes to HFiles directly
30
Column Family Options
� Compression– compression(gz)
� Bloom filters– NONE, ROW, ROWCOL
� In memory columns– in memory, no in memory
create hbase table colopt_table (key string, c1 string)
column mapping(key mapped by (key), cf1:c1 mapped by(c1))
column family options(cf1 compression(gz) bloom filter(row) in memory);
31
Query Handling
� Projection pushdown
� Predicate pushdown– Point scan
– Range scan– Automatic index usage
– Filters
� Query Hints
32
Sample Data
� TPCH orders table with 1.5 million rows
drop table if exists orders;
CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER,
O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE
TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15),
O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) )
column mapping (
key mapped by (O_ORDERKEY,O_CUSTKEY),
cf:d mapped by
(O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORIT
Y,O_COMMENT),
cf:od mapped by (O_ORDERDATE)
)
default encoding binary;
LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS
TERMINATED BY '|' INTO TABLE ORDERS;
33
Projection Pushdown
� Get only columns required by the query
� Limit data retrieved to the client
select * from orders
go -m discard
1500000 rows in results(first row: 0.21s; total: 1m1.77s)
HBase scan details:{ … , families={cf=[d, od]}, …}
select o_totalprice from orders
go -m discard
1500000 rows in results(first row: 0.19s; total: 21.27s)
HBase scan details:{ … , families={cf=[d]}, …}
select o_orderdate from orders
go -m discard
1500000 rows in results(first row: 0.36s; total: 36.24s)
HBase scan details:{ … , families={cf=[od]}, …}
� Projection happens at HBase column level– For composite key and dense columns, the entire value is retrieved to the client– Efficient to pack columns that are queried together
Log
Log
Log
The response time is higher for this query even when it retrieves lesser
data than query for o_totalprice. This is
because timestamp type is more expensive
34
Predicate Pushdown: Point Scan
� With full row key
� Big SQL can combine predicates on row key parts
set force local on;
select o_orderkey,o_totalprice from orders where o_custkey=1 and o_orderkey=454791;
+--------------+
| o_totalprice |
+--------------+
| 208660.75000 |
+--------------+
1 row in results(first row: 0.14s; total: 0.14s)
Found a row scan by combining all composite key parts.Log
1#4547911# 579908 1# 3868359 1# 4273923 1# 4808192 1# 5133509
…
…
key
o_custkey o_orderkey columns
Queryo_custkey=1
ando_orderkey=454791
start row=1#454791stop row=1#454791
35
Predicate Pushdown: Partial row Scan
select o_orderkey,o_totalprice from orders where o_custkey=1;
+------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
| 454791 | 74602.81250 |
| 579908 | 54048.26172 |
| 3868359 | 123076.84375 |
| 4273923 | 95911.00781 |
| 4808192 | 65478.05078 |
| 5133509 | 174645.93750 |
+------------+--------------+
6 rows in results(first row: 0.13s; total: 0.13s)
Found a row scan that uses the first 1 part(s) of composite key.
1#4547911# 579908 1# 3868359 1# 4273923 1# 4808192 1# 5133509
2#430243
…
…
key
o_custkey o_orderkey columns
Queryo_custkey=1
start row=1stop row=1++
Log
Predicate(s) on leading part(s) of row key
36
Predicate Pushdown: Range Scan
� With range predicates
select o_orderkey,o_totalprice from orders where o_custkey < 3;
Found a row scan that uses the first 1 part(s) of composite key.
HBase scan details:{ .. , stopRow=\x01\x80\x00\x00\x03, startRow=, … }
Log
Log
1#454791…
1# 51335092#430243
…4#164711
…
key
o_custkey o_orderkey columns
Queryo_custkey<3
start row=stop row=3#
37
Predicate Pushdown: Full table Scan
� This is an example of a case where predicates are not pushed down.
� If there are predicates on non-leading parts of row key
set force local on;
select o_orderkey,o_totalprice from orders where o_orderkey=454791;
+------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
| 454791 | 74602.81250 |
+------------+--------------+
1 row in results(first row: 32.13s; total: 32.13s)
HBase scan details:{ .. , stopRow=, startRow=, … }Log
38
Automatic Index Usage
select * from orders where o_clerk='Clerk#000000999'
go -m discard
1472 rows in results(first row: 1.63s; total: 30.32s)
create index ix_clerk on table orders (o_clerk) as 'hbase';0 rows affected (total: 3m57.82s)
select * from orders where o_clerk='Clerk#000000999'
go -m discard
1472 rows in results(first row: 3.60s; total: 3.65s)
Index query successful
� Index used automatically
� For composite index, rules similar to composite row key apply– Parts will be combined where possible– With partial value for composite index, range scan done on index table
� Multiple indexes on a table– Index to be used is randomly chosen– Specify useIndex hint to make use of specific index
Log
39
Pushing down Filters into HBase
� Filters do not avoid full table scan– Some filters can skip certain sections e.g, PrefixFilter
� Limits rows returned to the client
� Limits data returned to client– Key only filters
select o_orderkey from orders where o_custkey>100000 and o_orderstatus='P'
go -m discard
12819 rows in results(first row: 1.12s; total: 6.80s)
Found a row scan that uses the first 1 part(s) of composite key.
HBase filter list created using AND.
HBase scan details:{… , filter=FilterList AND (1/1):
[SingleColumnValueFilter (cf, d, EQUAL, \x01P\x00)], stopRow=,
startRow=\x01\x80\x01\x86\xA1, …}
Log
Row scan
Column filter as
there is a
predicate on
leading part of
dense column
40
Key Only Tables
� Big SQL allows creation of tables without specifying any HBase column
create hbase table KEY_ONLY_TABLE (k1 string, k2 string, k3 string)
column mapping (key mapped by (k1, k2, k3));
select * from KEY_ONLY_TABLE;
Only row key or parts of row key requested. Applying filters.
…
HBase scan details:{… families={}, filter=FilterList AND (2/2):
[FirstKeyOnlyFilter, KeyOnlyFilter], …}
Log
41
Predicate Precedence
� When a query contains multiple predicates, the following precedence applies:
– Row Scan
– Index
– Filters• Row filters• Column filters
� Filters will be applied along with row scans
� Filters cannot be combined with index lookups
� Multiple predicates: Use of row and column filter
select o_orderkey, o_custkey, o_orderdate from orders where
o_orderdate=cast('1996-12-09' as timestamp) or o_custkey=2;
HBase filter list created using OR.
HBase scan details:{… , filter=FilterList OR (2/2):
[SingleColumnValueFilter (cf, od, EQUAL, \x011996-12-09 00:00:00.000\x00),
PrefixFilter \x01\x80\x00\x00\x02], cacheBlocks=false, stopRow=,
startRow=, … }
Log
The OR condition prevents usage of row scan. Row filter (PrefixFilter) is used along with a column filter
42
Accessmode Hint
� Will run the query locally in Big SQL server – Useful to avoid map reduce overhead
� Very important for HBase point queries– This is not detected currently by compiler– Specify accessmode=‘local’ hint when getting a limited set of data from HBase
� Specify at query level
select o_orderkey from orders /*+ accessmode='local' +*/ where o_custkey=1
and o_orderkey=454791;
� Specify at session level– set force local on
– set force commands override query level hints
43
HBase Hints
� rowcachesize (default=2000)– Used as scan cache setting– Also used to determine number of get requests to batch in index lookups
� colbatchsize (default=100)
� useindex (‘false’ to avoid index usage)
select o_orderkey from orders /*+ rowcachesize=10000 +*/ where o_custkey>5000
go -m discard
1450136 rows in results(first row: 22.67s; total: 27.46s)
HBase scan details:{... , caching=10000, ...}
� rowcachesize can also be set using the set command:– set hbase.client.scanner.caching=10000;
Log
44
Recommendations
� Row key design is the most important factor– Try to combine predicates that are most commonly used into row key columns
– Do not make the row key too long
� Use short names for HBase column families and column qualifiers– f:q instead of mycolumnfamily:mycolumnqualifier
� Check if key only tables can be used
� Pack columns that are queried together into dense columns– Use the column that is used as query predicate as prefix
– Create indexes for columns that do not have repeating values and are queried often
� Separate columns that are rarely or never queried into a different column family
� Set hbase.client.scanner.caching to an optimum value
� Ensure even data distribution
45
Limitations
� No diagnostic info about HBase pushdown– How HBase storage handler pushes down a query is decided only at runtime– Predicate handling details are logged at INFO level
– Many examples of log messages covered in previous slides
� No auto detection of local vs MR mode– Currently depends on user specified hints
� Statistics not available– Big SQL does not have a framework to collect statistics
– Query optimizations can be improved with availability of useful statistics
� Map type not supported– Big SQL does not support map data type
– Hive HBase handler supports map data type and many to one mapping• Mapping an entire HBase column family to a map data type
46
Logs and Troubleshooting
� Big SQL logs– Look for rewritten query– More information in Big SQL logs if query is run in local mode
� Map Reduce logs– Predicate handling information in map task log when run in MR mode
� HBase web GUI– http://<<hostname>>:60010/master-status
47
Big SQL HBase Handler Highlights
� Support for composite key/dense columns
� Pushdown for efficient execution of queries
� Support for secondary indexes
� Binary encoding (collated correctly)
� Key only tables
� Support for hints to make query optimization decisions
48
Scenarios that can leverage HBase features
� Point queries– Queries that return a single row of result– Row can be determined using row key or secondary index
• All queries using secondary index are not point queries
� Queries with projections– If a query requires only a few columns
– Projection happens at HBase column level
� Data maintenance using upserts– Loading different value for columns using same row key
49
Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2013. All rights reserved.
•U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
Piotr Pruski
@ppruski
Thank YouAdding Value to HBase with IBM InfoSphere BigInsights and BigSQL
� Full credit to Deepa Remesh
Acknowledgements
Recommended