Upload
samarth-jain
View
226
Download
0
Embed Size (px)
Citation preview
Column Qualifier Encoding in
Apache Phoenix
Samarth Jain
About Me
• Software Engineer @ Salesforce• Previously, Software Engineer @ eBay• Apache Phoenix PMC• PHOENIX-1819 – Metrics collection framework• PHOENIX-1504 – Altering views• PHOENIX-1779 – 8x faster non-aggregate, non-
ordered queries• PHOENIX-914 – Row timestamp feature• PHOENIX-1598 – Column qualifier encoding
Overview
• Data model• Drawbacks• Column qualifier encoding• Benefits• ORDER BY performance• GROUP BY performance
Drawbacks
• Column names used as column qualifiers• Size bloat: dense tables with large column
names• Inefficient column renaming• GC pressure, network i/o, block cache• Lack of control on column qualifiers prevents
possible optimizations
Column Qualifier Encoding
• Simple• Don’t use column names as HBase column
qualifiers• Using numbers (short/integer) as column
qualifiers• Controlled assignment of column qualifiers
Column Renaming
• Currently renaming column isn’t possible without having to copy data using the new column qualifier (PHOENIX-2341)
• Phoenix stores table related metadata in its SYSTEM.CATALOG table
• Store mapping of column name to column qualifier• Renaming a column would then just involve
updating a few metadata rows in SYSTEM.CATALOG
Packing Key Values in Immutable Tables
• Store all column values in a single column qualifier per column family PHOENIX-2565
• Uses variable width array format for storing values
• Column encoding provides capability to index into array for accessing the value of a key value column
Performance Benefits
• Considerable disk size reduction• More number of rows will fit into block cache• Relieved GC pressure as garbage size would go
down• Replace binary search of column qualifiers
with O(1) look up
ORDER BY Overview• Phoenix compiles the query into scans
projecting columns in <SELECT >, <ORDER BY>, and <WHERE>
• Phoenix co-processor retrieves rows by asking HRegionScanner to fill List<Cell>
• List<Cell> lexicographically sorted• Adds rows to the Phoenix sort data-structure• Sort keys formed by doing binary search in the
List<Cell> filled by HRegionScanner
ORDER BY with Column Encoding
• Use numbers as column qualifiers• Custom list implementation for HBase
scanners to fill the key values in• Key values added to the list at index
determined by converting qualifier byte[] to integer/short
• Replaces binary search with O(1) lookup
ORDER BY with Column Encoding
• Table with 50 columns• Dense • 20 characters column names generated
randomly• ORDER BY columns selected were the first and
last in lexicographical order
ORDER BY Performance Test Results
• Table size: 5x smaller (4byte CQ vs 20 byte)• 2x faster (with and without block cache)• Near constant/slow growth with increase in
number of columns projected
ORDER BY Test Results
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 500
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
EncodedNon-encoded
Number of Columns Projected
GROUPED AGGREGATIONS Overview
• Queries compiled to scans that project key value columns in <SELECT>, <GROUP BY>, and <WHERE>
• Rows aggregated in a GROUP BY map in Phoenix’s aggregate co-processor
• Map key – binary search in List<Cell>• Number based column qualifiers and custom
list implementation to rescue again
GROUPED AGGREGATION with Column Encoding
• PHOENIX-1940 – TPC-Q1• TPC Data• 60% smaller disk size• Query time – with and without block cache
enabled – 25% faster
Heap and GC
Possible Performance Gains(to be measured)
• Faster bulk load times because of smaller data size
• Reduced index build times – both ASYNC and SYNC
• Reduction in network I/O – faster UPSERT, UPSERT SELECT
• Faster joins – smaller hash caches
Work in Progress..
• https://github.com/apache/phoenix/tree/encodecolumns
• 4.9 release• Make joins take advantage of encoded
columns• More encoding schemes (2 byte column
qualifiers)• More perf. testing and tuning
Thank You!