23
Column Qualifier Encoding in Apache Phoenix Samarth Jain

Column encoding

Embed Size (px)

Citation preview

Page 1: Column encoding

Column Qualifier Encoding in

Apache Phoenix

Samarth Jain

Page 2: Column encoding

About Me

• Software Engineer @ Salesforce• Previously, Software Engineer @ eBay• Apache Phoenix PMC• PHOENIX-1819 – Metrics collection framework• PHOENIX-1504 – Altering views• PHOENIX-1779 – 8x faster non-aggregate, non-

ordered queries• PHOENIX-914 – Row timestamp feature• PHOENIX-1598 – Column qualifier encoding

Page 3: Column encoding

Overview

• Data model• Drawbacks• Column qualifier encoding• Benefits• ORDER BY performance• GROUP BY performance

Page 4: Column encoding
Page 5: Column encoding
Page 6: Column encoding
Page 7: Column encoding
Page 8: Column encoding

Drawbacks

• Column names used as column qualifiers• Size bloat: dense tables with large column

names• Inefficient column renaming• GC pressure, network i/o, block cache• Lack of control on column qualifiers prevents

possible optimizations

Page 9: Column encoding

Column Qualifier Encoding

• Simple• Don’t use column names as HBase column

qualifiers• Using numbers (short/integer) as column

qualifiers• Controlled assignment of column qualifiers

Page 10: Column encoding

Column Renaming

• Currently renaming column isn’t possible without having to copy data using the new column qualifier (PHOENIX-2341)

• Phoenix stores table related metadata in its SYSTEM.CATALOG table

• Store mapping of column name to column qualifier• Renaming a column would then just involve

updating a few metadata rows in SYSTEM.CATALOG

Page 11: Column encoding

Packing Key Values in Immutable Tables

• Store all column values in a single column qualifier per column family PHOENIX-2565

• Uses variable width array format for storing values

• Column encoding provides capability to index into array for accessing the value of a key value column

Page 12: Column encoding

Performance Benefits

• Considerable disk size reduction• More number of rows will fit into block cache• Relieved GC pressure as garbage size would go

down• Replace binary search of column qualifiers

with O(1) look up

Page 13: Column encoding

ORDER BY Overview• Phoenix compiles the query into scans

projecting columns in <SELECT >, <ORDER BY>, and <WHERE>

• Phoenix co-processor retrieves rows by asking HRegionScanner to fill List<Cell>

• List<Cell> lexicographically sorted• Adds rows to the Phoenix sort data-structure• Sort keys formed by doing binary search in the

List<Cell> filled by HRegionScanner

Page 14: Column encoding

ORDER BY with Column Encoding

• Use numbers as column qualifiers• Custom list implementation for HBase

scanners to fill the key values in• Key values added to the list at index

determined by converting qualifier byte[] to integer/short

• Replaces binary search with O(1) lookup

Page 15: Column encoding

ORDER BY with Column Encoding

• Table with 50 columns• Dense • 20 characters column names generated

randomly• ORDER BY columns selected were the first and

last in lexicographical order

Page 16: Column encoding

ORDER BY Performance Test Results

• Table size: 5x smaller (4byte CQ vs 20 byte)• 2x faster (with and without block cache)• Near constant/slow growth with increase in

number of columns projected

Page 17: Column encoding

ORDER BY Test Results

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 500

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

EncodedNon-encoded

Number of Columns Projected

Page 18: Column encoding

GROUPED AGGREGATIONS Overview

• Queries compiled to scans that project key value columns in <SELECT>, <GROUP BY>, and <WHERE>

• Rows aggregated in a GROUP BY map in Phoenix’s aggregate co-processor

• Map key – binary search in List<Cell>• Number based column qualifiers and custom

list implementation to rescue again

Page 19: Column encoding

GROUPED AGGREGATION with Column Encoding

• PHOENIX-1940 – TPC-Q1• TPC Data• 60% smaller disk size• Query time – with and without block cache

enabled – 25% faster

Page 20: Column encoding

Heap and GC

Page 21: Column encoding

Possible Performance Gains(to be measured)

• Faster bulk load times because of smaller data size

• Reduced index build times – both ASYNC and SYNC

• Reduction in network I/O – faster UPSERT, UPSERT SELECT

• Faster joins – smaller hash caches

Page 22: Column encoding

Work in Progress..

• https://github.com/apache/phoenix/tree/encodecolumns

• 4.9 release• Make joins take advantage of encoded

columns• More encoding schemes (2 byte column

qualifiers)• More perf. testing and tuning

Page 23: Column encoding

Thank You!