33
HyperLogLog in Hive How to count sheep efficiently? Phillip Capper: Whitecliffs Sheep @bzamecnik

HyperLogLog in Hive - How to count sheep efficiently?

Embed Size (px)

Citation preview

Page 1: HyperLogLog in Hive - How to count sheep efficiently?

HyperLogLog

in Hive How to countsheep efficiently?

Phillip Capper: Whitecliffs Sheep

@bzamecnik

Page 2: HyperLogLog in Hive - How to count sheep efficiently?

Agenda

● the problem – count distinct elements● exact counting● fast approximate counting – using HLL in Hive● comparing performance and accuracy● appendix – a bit of theory of probabilistic counting

○ how it works?

Page 3: HyperLogLog in Hive - How to count sheep efficiently?

The problem: count distinct elements

● eg. the number of unique visitors● each visitor can make a lot of clicks● typically grouped in various ways● "set cardinality estimation" problem

Page 4: HyperLogLog in Hive - How to count sheep efficiently?

Small data solutions

● sort the data O(N*log(N)) and skip duplicates O(N)○ O(N) space

● put data into a hash or tree set and iterate○ hash set: O(N^2) worst case build, O(N) iteration○ tree set: O(N*log(N)) build, O(N) iteration○ both O(N) space

● but: we have big data

Example:

~100M unique values in 5B rows each day

32 bytes per value -> 3 GB unique, 150 GB total

Page 5: HyperLogLog in Hive - How to count sheep efficiently?

Problems with counting big data

● data is partitioned○ across many machines○ in time

● we can't sum cardinality of each partition○ since the subsets are generally not disjoint○ we would overestimate

count(part1) + count(part1) >= count(part1 ∪ part2)

● we need to merge estimators and then estimate cardinality

count(estimator(part1) ∪ estimator(part2))

Page 6: HyperLogLog in Hive - How to count sheep efficiently?

SELECT COUNT(DISTINCT user_id)

FROM events;

single reducer!

Exact counting in Hive

Page 7: HyperLogLog in Hive - How to count sheep efficiently?

Exact counting in Hive – subquery

SELECT COUNT(*) FROM (

SELECT 1 FROM events

GROUP BY user_id

) unique_guids;

Or more concisely:

SELECT COUNT(*) FROM (

SELECT DISTINCT user_id

FROM events

) unique_guids;

� many reducers

� two phases

cannot combine more aggregations

Page 8: HyperLogLog in Hive - How to count sheep efficiently?

Exact counting in Hive

● hive.optimize.distinct.rewrite○ allows to rewrite COUNT(DISTINCT) to subquery○ since Hive 1.2.0

Page 9: HyperLogLog in Hive - How to count sheep efficiently?
Page 10: HyperLogLog in Hive - How to count sheep efficiently?

Probabilistic counting

● fast results, but approximate● practical example of using HLL in Hive● more theory in the appendix

Page 11: HyperLogLog in Hive - How to count sheep efficiently?

● klout/brickhouse○ single option○ no JAR, some tests○ based on HLL++ from stream-lib (quite fast)

● jdmaturen/hive-hll○ no options (they are in API, but not implemented!)○ no JAR, no tests○ compatible with java-hll, pg-hll, js-hll

● t3rmin4t0r/hive-hll-udf○ no options, no JAR, no tests

Implementations of HLL as Hive UDFs

Page 12: HyperLogLog in Hive - How to count sheep efficiently?

● User-Defined Functions● function registered from a class (loaded from JAR)● JAR needs to be on HDFS (otherwise it fails)● you can choose the UDF name at will● work both in HiveServer2/Beeline and Hive CLI

ADD JAR hdfs:///path/to/the/library.jar;

CREATE TEMPORARY FUNCTION foo_func

AS 'com.example.foo.FooUDF';

● Usage:

SELECT foo_func(...) FROM ...;

UDFs in Hive

Page 13: HyperLogLog in Hive - How to count sheep efficiently?

● to_hll(value)○ aggregate values to HLL○ UDAF (aggregation function)○ + hash each value○ optionally can be configured (eg. for precision)

● union_hlls(hll)○ union multiple HLLs○ UDAF

● hll_approx_count(hll)○ estimate cardinality from a HLL○ UDF

HLL can be stored as binary or string type.

General UDFs API for HLL

Page 14: HyperLogLog in Hive - How to count sheep efficiently?

● Estimate of total unique visitors:

SELECT hll_approx_count(to_hll(user_id))

FROM events;

● Estimate of total events + unique visitors at once:

SELECT

count(*) AS total_events

hll_approx_count(to_hll(user_id))

AS unique_visitors

FROM events;

Example usage

Page 15: HyperLogLog in Hive - How to count sheep efficiently?

Example usage

● Compute each daily estimator once:

CREATE TABLE daily_user_hll AS

SELECT date, to_hll(user_id) AS users_hll

FROM events

GROUP BY date;

● Then quickly aggregate and estimate:

SELECT hll_approx_count(union_hlls(users_hll))

AS user_count

FROM daily_user_hll

WHERE date BETWEEN '2015-01-01' AND '2015-01-31';

Page 16: HyperLogLog in Hive - How to count sheep efficiently?

https://github.com/klout/brickhouse - Hive UDFhttps://github.com/addthis/stream-lib - HLL++

$ git clone https://github.com/klout/brickhouse

disable maven-javadoc-plugin in pom.xml (since it fails)

$ mvn package

$ wget http://central.maven.org/maven2/com/clearspring/analytics/stream/2.3.0/stream-

2.3.0.jar

$ scp target/brickhouse-0.7.1-SNAPSHOT.jar \

stream-2.3.0.jar cluster-host:

cluster-host$ hdfs dfs -copyFromLocal *.jar \

/user/me/hive-libs

Brickhouse – installation

Page 17: HyperLogLog in Hive - How to count sheep efficiently?

Brickhouse – usage

ADD JAR /user/zamecnik/lib/brickhouse-0.7.1-15f5e8e.jar;

ADD JAR /user/zamecnik/lib/stream-2.3.0.jar;

CREATE TEMPORARY FUNCTION to_hll AS 'brickhouse.udf.hll.

HyperLogLogUDAF';

CREATE TEMPORARY FUNCTION union_hlls AS 'brickhouse.udf.

hll.UnionHyperLogLogUDAF';

CREATE TEMPORARY FUNCTION hll_approx_count AS 'brickhouse.

udf.hll.EstimateCardinalityUDF';

to_hll(value, [bit_precision])

● bit_precision: 4 to 16 (default 6)

Page 18: HyperLogLog in Hive - How to count sheep efficiently?

Hive-hll usage

ADD JAR /user/zamecnik/lib/hive-hll-0.1-2807db.jar;

CREATE TEMPORARY FUNCTION hll_hash as 'com.kresilas.hll.

HashUDF';

CREATE TEMPORARY FUNCTION to_hll AS 'com.kresilas.hll.

AddAggUDAF';

CREATE TEMPORARY FUNCTION union_hlls AS 'com.kresilas.hll.

UnionAggUDAF';

CREATE TEMPORARY FUNCTION hll_approx_count AS 'com.

kresilas.hll.CardinalityUDF';

Page 19: HyperLogLog in Hive - How to count sheep efficiently?

We have to explicitly hash the value:

SELECT

hll_approx_count(to_hll(hll_hash(user_id)))

FROM events;

Options for creating HLL:

to_hll(x, [log2m, regwidth, expthresh, sparseon])

hardcoded to:

[log2m=11, regwidth=5, expthresh=-1, sparseon=true]

Hive-hll usage

Page 20: HyperLogLog in Hive - How to count sheep efficiently?

Nice things

● HLLs are additive○ can be computed once

○ various partitions can be merged and estimated for cardinality later

● we can count multiple unique columns at once○ no need to subquery○ we can do wild grouping (by country, browser, …)

● HLLs take only little space

Page 21: HyperLogLog in Hive - How to count sheep efficiently?

Rolling window

-- keep reasonable number of task for month of dataSET mapreduce.input.fileinputformat.split.maxsize=5368709120;-- keep low number of output files (HLLs are quite small)SET hive.merge.mapredfiles=true;-- maximum precisionSET hivevar:hll_precision=16;

-- HLL for each dayCREATE TABLE guids_parquet_hll ASSELECT '${year}' AS year, '${month}' AS month, day, to_hll(guid, ${hll_precision}) AS guid_hllFROM parquet.dump_${year}_${month}GROUP BY day;

Page 22: HyperLogLog in Hive - How to count sheep efficiently?

-- for each day estimate number of guids 7-days back

CREATE TABLE zamecnik.guids_parquet_rolling_30_day_countASSELECT `date`, hll_approx_count(guids_union) AS guid_countFROM ( SELECT concat(`year`, '-', `month`, '-', `day`) as `date`, union_hlls(guid_hll) OVER w AS guids_union FROM guids_parquet_hll WINDOW w AS ( ORDER BY `year`, `month`, `day` ROWS 6 PRECEDING )) rolling_guids;

Rolling window

Page 23: HyperLogLog in Hive - How to count sheep efficiently?

● when JARs are not on HDFS the query fails (why?)● computing on many days of raw clickstream fails in

Beeline (works in Hive CLI), parquet is ok

● HIVE-9073 WINDOW + custom UDAF → NPE○ fixed in Hive 1.2.0

● DISTRO-631

Pitfalls

Page 24: HyperLogLog in Hive - How to count sheep efficiently?

Approximation error

● Typically < 1-2 %● Can be controlled by the parameters● Example: 1 year of guids

Page 25: HyperLogLog in Hive - How to count sheep efficiently?

Appendix – more interesting things

Page 26: HyperLogLog in Hive - How to count sheep efficiently?

● trade-off: some approximation error for far better performance and memory consumption

● sketch - streaming & probabilistic algorithm● KMV - k minimal values● linear counter● loglog counter

Probabilistic counting

Page 27: HyperLogLog in Hive - How to count sheep efficiently?

LogLog counter

● run length of initial zeros● multiple estimators (registers)● stochastic averaging

○ single hash function○ multiple buckets

● hash → (register index, run length)

Page 28: HyperLogLog in Hive - How to count sheep efficiently?

Linear counter

m = 20 # size of the registerregister = bitarray(m) # register, m bits

def add(value): h = mmh3.hash(value) % m # select bit index register[h] = 1 # = max(1, register[h])

def cardinality(): u_n = register.count(0) # number of zeros v_n = u_n / m # relative number of zeros n_hat = -m * math.log(v_n) # estimate of the set cardinality return n_hat

Page 29: HyperLogLog in Hive - How to count sheep efficiently?

● structure like loglog counter● harmonic mean to combine registers● correction for small and large cardinalities● values needs to be hashed well – murmur3

HyperLogLog (HLL)

Page 30: HyperLogLog in Hive - How to count sheep efficiently?

HLL union

● just take max of each register value● no loss – same result as HLL of union of streams● parallelizable● union preserves error bound, intersection/diff do not

Page 31: HyperLogLog in Hive - How to count sheep efficiently?

Further reading

● very nice explanation of HLL● Probabilistic Data Structures For Web Analytics And

Data Mining● Sketch of the Day: HyperLogLog — Cornerstone of a

Big Data Infrastructure● HyperLogLog in Pure SQL● Use Subqueries to Count Distinct 50X Faster● It is possible to combine HLL of different sizes

Page 32: HyperLogLog in Hive - How to count sheep efficiently?

Papers

● HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm

● https://github.com/addthis/stream-lib#cardinality

Page 33: HyperLogLog in Hive - How to count sheep efficiently?

Other problems & structures

● set membership – bloom filter● top-k elements – count-min-sketch, stream-summary