Robert Sedgewick Princeton Universityrs/talks/CardinalityX.pdf · Knuth’s insight: AofA is a scientific endeavor. •Start with a working program (algorithm implementation). •Develop

Cardinality Estimation

with special thanks to Jérémie Lumbroso

Robert Sedgewick Princeton University

Philippe Flajolet 1948-2011

Philippe Flajolet, mathematician and computer scientist extraordinaire

3

Don Knuth’s legacy: Analysis of Algorithms (AofA)

3


Understood since Babbage:

3


Understood since Babbage:• Computational resources are limited.

3


Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.

3



Analytic Engine

how many times do we have to turn the crank?

Knuth’s insight: AofA is a scientific endeavor.

3



Analytic Engine


Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).

3



Analytic Engine


Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.

3



Analytic Engine


Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.

3



Analytic Engine


Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.• Use the program to validate hypotheses.

3



Analytic Engine


Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.• Use the program to validate hypotheses.• Iterate on basis of insights gained.

3



Analytic Engine



3



Analytic Engine



3



Analytic Engine



3



Analytic Engine



3



Analytic Engine



3



Analytic Engine


Difficult to overstate the significance of this insight.

AofA has played a critical role

in the development of our computational infrastructure

4



4

how many times to turn the crank?



4


how long to compile my program?



4


how long to sort random data for cryptanalysis preprocessing?

how long to compile my program?



4



how long to compile my program? how long to check

that my VLSI circuit follows the rules?



4





how many bodies in motion can I

simulate?

and the advance of scientific knowledge



4





how quickly can I find clusters?how many bodies

in motion can I simulate?




4





how quickly can I find clusters?how many bodies

in motion can I simulate?


“ PEOPLE WHO ANALYZE ALGORITHMS have double happiness. They experience the sheer beauty of elegant mathematical patterns that surround elegant computational procedures. Then they receive a practical payoff when their theories make it possible to get other jobs done more quickly and more economically.”

− Don Knuth

5

Analysis of Algorithms (present-day context)

5


Practical computing

Practical computing

5


Practical computing

• Real code on real machines

Practical computing

5


Practical computing


• Thorough validation

Practical computing

5


Practical computing



• Limited math models

Practical computing

5


Theoretical Computer Science

Practical computing




Theoretical computer science

Practical computing

5



Practical computing





• Theorems

Practical computing

5



Practical computing





• Theorems

• Abstract math models

Practical computing

5



Practical computing





• Theorems


• Limited experimentation

Practical computing

5



Practical computing





• Theorems



AofA

Practical computing AofA

5



Practical computing





• Theorems



AofA• Theorems and code


5



Practical computing





• Theorems



AofA• Theorems and code• Precise math models


5



Practical computing





• Theorems



AofA• Theorems and code• Precise math models• Experiment, validate, iterate


OF


•Warmup: exact cardinality count •Probabilistic counting •Stochastic averaging •Refinements •Final frontier

7

Cardinality counting

Q. In a given stream of data values, how many different values are present?

7



Reference application. How many unique visitors in a web log?

pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net

7




log.07.f3.txt

6 million strings


7




State of the art in the wild for decades. Sort, then count.

log.07.f3.txt

6 million strings


7





log.07.f3.txt

6 million strings

% sort -u log.07.f3.txt | wc -l 1112365

UNIX (1970s-present)

“unique”


7





SELECT DATE_TRUNC(‘day’,event_time), COUNT(DISTINCT user_id), COUNT(DISTINCT url)FROM weblog

SQL (1970s-present)

log.07.f3.txt

6 million strings

% sort -u log.07.f3.txt | wc -l 1112365

UNIX (1970s-present)

“unique”

8

Standard “optimal” solution: Use a hash table

8


Hashing with linear probing

8


Hashing with linear probing• Create a table of size M.

8


Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.

8


Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.

example: multiply by a prime, then take remainder after dividing by M.

8


Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.


8


Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.


8



small example data stream P J J E K J L C K O M T P G L J I F K C


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8





0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17)


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8






0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8






count 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15


count 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15


P

count 01

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6


P

count 01

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6


J P

count 012

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6


J P

count 012

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14


J P

count 012

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14


J E P

count 0123

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1


J E P

count 0123

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1


J EK P

count 01234

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6


J EK P

count 01234

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13


J EK P

count 01234

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13


J EK L P

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7


J EK L P

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7


J EK L PC

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1


J EK L PC

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15


J EK L PC

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15


J EK L

O

PC

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15


J EK L OPC

count 0123456

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8


J EK L OPC

count 0123456

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8


J EK LM OPC

count 01234567

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7


J EK LM OPC

count 01234567

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7


T

J EK LM OPC

count 01234567

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7


T

J EK LM OPC

count 01234567

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7


T

J EK LM OPC

count 01234567

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7


TJ EK LM OPC

count 012345678

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15


TJ EK LM OPC

count 012345678

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4


TJ EK LM OPC

count 012345678

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4


TJ EK LM OPCG

count 0123456789

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13


TJ EK LM OPCG

count 0123456789

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6


TJ EK LM OPCG

count 0123456789

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11


TJ EK LM OPCG

count 0123456789

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11


TJ EK LM OPCG I

count 012345678910

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9


TJ EK LM OPCG I

count 012345678910

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9


TJ EK LM OPCG I

Fcount 012345678910

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9


TJ EK LM OPCG I

Fcount 012345678910

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9


TJ EK LM OPCG IF

count 01234567891011

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9 1


TJ EK LM OPCG IF

count 01234567891011

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9 1 7


TJ EK LM OPCG IF

count 01234567891011

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

8




hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9 1 7

Additional (key) idea. Keep searches short by doubling table size when it becomes half full.


TJ EK LM OPCG IF

count 01234567891011

9

Mathematical analysis of exact cardinality count with linear probing

Theorem. Expected time and space cost is linear.

9



Proof. Follows from classic Knuth Theorem 6.4.K.

9




ptg999

From the Library of Robert Sedgewick

9




ptg999


“ I first formulated [this] derivation in 1962. Since this was the first nontrivial algorithm I had ever analyzed satisfactorily, it had a strong influence on the structure of these books. Ever since that day, the analysis of algorithms has in fact been one of the major themes of my life.”

− Knuth, TAOCP volume 3

9




ptg999




9




ptg999


Q. Do the hash functions that we use uniformly and independently distribute keys in the table?



9




ptg999


Q. Do the hash functions that we use uniformly and independently distribute keys in the table?

A. Not likely.



10

Scientific validation of exact cardinality count with linear probing

Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.

10



Quick experiment. Doubling the problem size should double the running time.

10




public static void main(String[] args)

Driver to read N strings and count distinct values

10




public static void main(String[] args){


10




public static void main(String[] args){ int N = Integer.parseInt(args[0]);


get problem size

10




public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N);


get problem sizeinitialize input stream

10




public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();



get current time

10





StdOut.println(count(stream));



get current time

print count

10






long now = System.currentTimeMillis();



get current time

print count

10






long now = System.currentTimeMillis(); double time = (now - start) / 1000.0;



get current time

print count

10






long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);



get current time

print count

print elapsed time

10






long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}



get current time

print count

print elapsed time

10







% java Hash 2000000 < log.07.f3.txt



get current time

print count

print elapsed time

10







% java Hash 2000000 < log.07.f3.txt483477



get current time

print count

print elapsed time

10







% java Hash 2000000 < log.07.f3.txt4834773.322 seconds



get current time

print count

print elapsed time

10










get current time

print count

print elapsed time

10











get current time

print count

print elapsed time

10











get current time

print count

print elapsed time

10











get current time

print count

print elapsed time

10











get current time

print count

print elapsed time

10












get current time

print count

print elapsed time

10












get current time

print count

print elapsed time

10












get current time

print count

print elapsed time

10









% java Hash 6000000 < log.07.f3.txt10979449.49 seconds ✓



get current time

print count

print elapsed time

10










% sort -u log.07.f3 | wc -l 1097944

sort-based method takes about 3 minutes



get current time

print count

print elapsed time

10



Q. Is hashing with linear probing effective?








% sort -u log.07.f3 | wc -l 1097944




get current time

print count

print elapsed time

10



Q. Is hashing with linear probing effective?

A. Yes. Validated in countless applications for over half a century.








% sort -u log.07.f3 | wc -l 1097944




get current time

print count

print elapsed time

11

Complexity of exact cardinality count

Q. Does there exist an optimal algorithm for this problem?Upper bound

Lower bound

11


Q. Does there exist an optimal algorithm for this problem?

A. Depends on definition of “optimal”.

Upper bound

Lower bound

11



Guaranteed linear-time? NO. Linearithmic lower bound.


Upper bound

Lower bound

11





Guaranteed linearithmic? YES. Balanced BSTs or mergesort.

Upper bound

Lower bound

11





Linear-time with high probability assuming the existence of random bits? YES. Dynamic perfect hashing.


Upper bound

Lower bound

11





Linear-time with high probability assuming the existence of random bits? YES. Dynamic perfect hashing. Dietzfelbinger, Karlin, Mehlhorn, Meyer auf der Heide, Rohnert, and Tarjan

Dynamic Perfect Hashing: Upper and Lower Bounds

SICOMP 1994.


Upper bound

Lower bound

11






Linear with a small constant factor in practical situations?

Dietzfelbinger, Karlin, Mehlhorn, Meyer auf der Heide, Rohnert, and TarjanDynamic Perfect Hashing: Upper and Lower Bounds

SICOMP 1994.


Upper bound

Lower bound

11






Linear with a small constant factor in practical situations? YES. Hashing with linear probing.


SICOMP 1994.


Upper bound

Lower bound

11






Linear with a small constant factor in practical situations? YES. Hashing with linear probing. M. Mitzenmacher and S. Vadhan

Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream. SODA 2008.


SICOMP 1994.


Upper bound

Lower bound

11









SICOMP 1994.

Hypothesis. Hashing with linear probing is “optimal”.


Upper bound

Lower bound

11









SICOMP 1994.

Hypothesis. Hashing with linear probing is “optimal”.


Upper bound

Lower bound

but TSTs may give a sublinear algorithm

12

Exact cardinality count requires linear space

Q. I can’t use a hash table. The stream is much too big to fit all values in memory. Now what?

Help!


12



Help!


12



A. Bad news: You cannot get an exact count.

Help!


12




A. (Bloom, 1970) You can get an accurate estimate using a few bits per distinct value.

Help!


12




A. (Bloom, 1970) You can get an accurate estimate using a few bits per distinct value.

Help!

A. Much better news: You can get an accurate estimate using only a handful of bits (stay tuned).

OF


•Warmup: exact cardinality count •Probabilistic counting •Stochastic averaging •Refinements •Final frontier

14

is a fundamental problem with many applications where memory is limited.

Cardinality estimation

Q. About how many different values appear in a given stream?

14




Constraints

14




Constraints• Make one pass through the stream.

14




Constraints• Make one pass through the stream.• Use as few operations per value as possible

14




Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.

14




Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.• Produce as accurate an estimate as possible.

typical applications

14






14





How many unique visitors to my website?


14






How many different values for a database join?


14






How many different websites visited by each customer? How many different values

for a database join?


14








Which sites are the most/least popular?


14








To fix ideas on scope: Think of billions of streams each having trillions of values.

Which sites are the most/least popular?

15

Probabilistic counting with stochastic averaging (PCSA)

15


Flajolet and Martin, Probabilistic Counting Algorithms for Data Base Applications FOCS 1983, JCSS 1985.


15


Contributions



15


Contributions• Introduced problem



15


Contributions• Introduced problem• Idea of streaming algorithm



15


Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data



15


Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data• Detailed analysis that yields tight bounds on accuracy



15


Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data• Detailed analysis that yields tight bounds on accuracy• Full validation of mathematical results with experimentation



15


Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data• Detailed analysis that yields tight bounds on accuracy• Full validation of mathematical results with experimentation• Practical algorithm that has remained effective for decades



15


Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data• Detailed analysis that yields tight bounds on accuracy• Full validation of mathematical results with experimentation• Practical algorithm that has remained effective for decades

Bottom line: Quintessential example of the effectiveness of scientific approach to algorithm design.



16

PCSA first step: Use hashing

16


Transform value to a “random” computer word.

16


Transform value to a “random” computer word.• Compute a hash function that transforms

data value into a 32- or 64-bit value.

16



data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).

16



data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.

16



data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.• Allows use of fast machine-code operations.

16




21st century: use 64 bits (quadrillions of values)20th century: use 32 bits (millions of values)

16





Example: Java

16





Example: Java

• All data types implement a hashCode() method (though we often override the default).

16





Example: Java


• String data type stores value (computed once).

16





String value = “gsearch.CS.Princeton.EDU” int x = value.hashCode();

current Java default is 32-bit int value

Example: Java



16







Bottom line: Do cardinality estimation on streams of (binary) integers.

Example: Java



00010000111001101000111010010011 00001001011011100000010010010111 00001001011011100000010010010111 00111000101001001011010101001100 00111000101001001011010101001100 01101001001000011100110100110011 00001000011101100110110010100101 00001001011011100000010010010111 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01100000110010011101011001011111 01110111001110110010000010111111 01011001010110101100010110110100 00101101101111101111001110001100 01101001001000011100110100110011 01110110000011101101011001000101 00001001001011010110111101111110 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01110111001110110010000010111111 01101001001000011100110100110011 00001000011101100110110010100101 00101101101111101111001110001100 01101001001000011100110100110011 00001000011101100110110010100101 01101001001000011100110100110011 01101001001000011100110100110011 01100001000111001001110010100000

16








Example: Java



00010000111001101000111010010011 00001001011011100000010010010111 00001001011011100000010010010111 00111000101001001011010101001100 00111000101001001011010101001100 01101001001000011100110100110011 00001000011101100110110010100101 00001001011011100000010010010111 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01100000110010011101011001011111 01110111001110110010000010111111 01011001010110101100010110110100 00101101101111101111001110001100 01101001001000011100110100110011 01110110000011101101011001000101 00001001001011010110111101111110 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01110111001110110010000010111111 01101001001000011100110100110011 00001000011101100110110010100101 00101101101111101111001110001100 01101001001000011100110100110011 00001000011101100110110010100101 01101001001000011100110100110011 01101001001000011100110100110011 01100001000111001001110010100000

16








Example: Java



“Random” except for the fact that some values are equal.

17

Initial hypothesis

17

Initial hypothesis

Hypothesis. Uniform hashing assumption is reasonable in this context.

17

Initial hypothesis


Implication. Need to run experiments to validate any hypotheses about performance.

17

Initial hypothesis

No problem!



17

Initial hypothesis

No problem!

• AofA is a scientific endeavor (we always validate hypotheses).



17

Initial hypothesis

No problem!


• End goal is development of algorithms that are useful in practice.



17

Initial hypothesis

No problem!



• It is the responsibility of the designer to validate utility before claiming it.



17

Initial hypothesis

No problem!




• After decades of experience, discovering a performance problem due toa bad hash function would be a significant research result.



17

Initial hypothesis

No problem!







Unspoken bedrock principle of AofA. Experimenting to validate hypotheses is WHAT WE DO!

17

Initial hypothesis

No problem!







Unspoken bedrock principle of AofA. Experimenting to validate hypotheses is WHAT WE DO!

18

Probabilistic counting starting point: three integer functions

18


Definition. p (x) is the number of 1s in the binary representation of x.

18


Definition. r (x) is the number of trailing 1s in the binary representation of x.


18


Definition. r (x) is the number of trailing 1s in the binary representation of x. position of rightmost 0


18



Definition. R(x) = 2r (x)

position of rightmost 0


18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0



18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1



18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0



18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0



18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0

Bit-whacking magic: R(x) is easy to compute.



18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0




0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x

18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0




0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x

18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0




0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 1

18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0




0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)

18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0


3 instructions on a typical computer



0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)

18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0



Exercise: Compute p (x) as easily.



0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)

18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0






0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)

Beeler, Gosper, and SchroeppelHAKMEM item 169, MIT AI Laboratory AIM 239, 1972

http://www.inwap.com/pdp10/hbaker/hakmem/hakmem.html

18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0






0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)

Note: r (x ) = p (R(x ) − 1).



18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0






0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)

Note: r (x ) = p (R(x ) − 1).



see also Knuth volume 4A

18




15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0

1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1

0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0






0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)

Note: r (x ) = p (R(x ) − 1).

Bottom line: p (x ), r (x ), and R(x ) all can be computed with just a few machine instructions.



see also Knuth volume 4A

19

Probabilistic counting (Flajolet and Martin, 1983)

19


Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …

19


Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).

19


Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.

19



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1typical sketch

N = 106

19



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1typical sketch N = 106

19



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1

R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

typical sketch N = 106

19



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1

R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


19



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1

R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

typical sketch N = 106 R(x) = 2k

with probability 1/ 2k

19



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1

R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


leading bits almost surely 0

R(x) = 2k with probability

1/ 2k

19



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1

R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


leading bits almost surely 0 trailing bits almost surely 1


1/ 2k

19



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1

R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1



estimate of lg N


1/ 2k

Rough estimate of lgN is r (sketch).

19



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1

R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1



estimate of lg N


1/ 2k


Rough estimate of N is R(sketch).

19



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1

R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1



estimate of lg N


1/ 2k


correction factor needed (stay tuned) Rough estimate of N is R(sketch).

20

Probabilistic counting trace

20


x r(x) R(x) sketch

01100010011000111010011110111011 2 100 00000000000000000000000000000100

20


x r(x) R(x) sketch

01100010011000111010011110111011 2 100 00000000000000000000000000000100

01100111001000110001111100000101 1 10 00000000000000000000000000000110

20


x r(x) R(x) sketch

01100010011000111010011110111011 2 100 00000000000000000000000000000100

01100111001000110001111100000101 1 10 00000000000000000000000000000110

00010001000111000110110110110011 2 100 00000000000000000000000000000110

20


x r(x) R(x) sketch

01100010011000111010011110111011 2 100 00000000000000000000000000000100

01100111001000110001111100000101 1 10 00000000000000000000000000000110

00010001000111000110110110110011 2 100 00000000000000000000000000000110

01000100011101110000000111011111 5 100000 00000000000000000000000000100110

20


x r(x) R(x) sketch

01100010011000111010011110111011 2 100 00000000000000000000000000000100

01100111001000110001111100000101 1 10 00000000000000000000000000000110

00010001000111000110110110110011 2 100 00000000000000000000000000000110

01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111

20


x r(x) R(x) sketch

01100010011000111010011110111011 2 100 00000000000000000000000000000100

01100111001000110001111100000101 1 10 00000000000000000000000000000110

00010001000111000110110110110011 2 100 00000000000000000000000000000110

01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111

00110111101100000000101001010101 1 10 00000000000000000000000000100111

20


x r(x) R(x) sketch

01100010011000111010011110111011 2 100 00000000000000000000000000000100

01100111001000110001111100000101 1 10 00000000000000000000000000000110

00010001000111000110110110110011 2 100 00000000000000000000000000000110

01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111

00110111101100000000101001010101 1 10 0000000000000000000000000010011100110100011000111010101111111100 0 1 00000000000000000000000000100111

20


x r(x) R(x) sketch

01100010011000111010011110111011 2 100 00000000000000000000000000000100

01100111001000110001111100000101 1 10 00000000000000000000000000000110

00010001000111000110110110110011 2 100 00000000000000000000000000000110

01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111

00110111101100000000101001010101 1 10 0000000000000000000000000010011100110100011000111010101111111100 0 1 00000000000000000000000000100111

00011000010000100001011100110111 3 1000 00000000000000000000000000101111

20


x r(x) R(x) sketch

01100010011000111010011110111011 2 100 00000000000000000000000000000100

01100111001000110001111100000101 1 10 00000000000000000000000000000110

00010001000111000110110110110011 2 100 00000000000000000000000000000110

01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111

00110111101100000000101001010101 1 10 0000000000000000000000000010011100110100011000111010101111111100 0 1 00000000000000000000000000100111

00011000010000100001011100110111 3 1000 00000000000000000000000000101111

00011001100110011110010000111111 6 1000000 00000000000000000000000001101111

20


x r(x) R(x) sketch

01100010011000111010011110111011 2 100 00000000000000000000000000000100

01100111001000110001111100000101 1 10 00000000000000000000000000000110

00010001000111000110110110110011 2 100 00000000000000000000000000000110

01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111

00110111101100000000101001010101 1 10 0000000000000000000000000010011100110100011000111010101111111100 0 1 00000000000000000000000000100111

00011000010000100001011100110111 3 1000 00000000000000000000000000101111

00011001100110011110010000111111 6 1000000 0000000000000000000000000110111101000101110001001010110011111100 0 1 00000000000000000000000001101111

20


x r(x) R(x) sketch

01100010011000111010011110111011 2 100 00000000000000000000000000000100

01100111001000110001111100000101 1 10 00000000000000000000000000000110

00010001000111000110110110110011 2 100 00000000000000000000000000000110

01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111

00110111101100000000101001010101 1 10 0000000000000000000000000010011100110100011000111010101111111100 0 1 00000000000000000000000000100111

00011000010000100001011100110111 3 1000 00000000000000000000000000101111

00011001100110011110010000111111 6 1000000 0000000000000000000000000110111101000101110001001010110011111100 0 1 00000000000000000000000001101111

R(sketch) = 100002 = 16

21


public long R(long x) { return ~x & (x+1); }

public long estimate(Iterable<String> stream) { long sketch; for (s : stream) sketch = sketch | R(s.hashCode()); return R(sketch); }

21




Maintain a sketch of the data

21




Maintain a sketch of the data• A single word

21




Maintain a sketch of the data• A single word• OR of all values of R(x) in the stream

21




Maintain a sketch of the data• A single word• OR of all values of R(x) in the stream• Return smallest value not seen

21




Early example of “a simple algorithm whose analysis isn’t”


21






Q. (Martin) Estimate seems a bit low. How much?

21







A. (unsatisfying) Obtain correction factor empirically.

21







A. (unsatisfying) Obtain correction factor empirically.

A. (Flajolet) Without the analysis, there is no algorithm!

/.77351;with correction for bias

22

Mathematical analysis of probabilistic counting

Theorem. The expected number of trailing 1s in the PC sketch is

and P is an oscillating function of lg N of very small amplitude.

lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351

22




Proof (omitted).

lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351

22




Proof (omitted).

1980s: Flajolet tour de force

lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351

22




Proof (omitted).


1990s: trie parameter

lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351

22




Proof (omitted).



lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351

22




Proof (omitted).



lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351

highest null left of

right spine

22




Proof (omitted).



lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351

Kirschenhofer, Prodinger, and Szpankowski

Analysis of a splitting process arising in probabilistic counting and other related algorithms, ICALP 1992.


right spine

trailing 1s in sketch

22




Proof (omitted).



21st century: standard AC

lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351




right spine


stay tuned for Szpankowski talk

Jacquet and Szpankowski

Analytical depoissonization and its applications, TCS 1998.

22




Proof (omitted).



21st century: standard AC

In other words. In PC code, R(sketch)/.77351 is an unbiased statistical estimator of N.

lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351




right spine


stay tuned for Szpankowski talk

Jacquet and Szpankowski

Analytical depoissonization and its applications, TCS 1998.

23

Validation of probabilistic counting

Hypothesis. Expected value returned is N for random values from a large range.

23


Quick experiment. 100,000 31-bit random values (20 trials)


23




23


Flajolet and Martin: Result is “typically one binary order of magnitude off.”



23



Of course! (Always returns a power of 2 divided by .77351.)



23





16384/.77351 = 21181

32768/.77351 = 42362

65536/.77351 = 84725

…


23





16384/.77351 = 21181

32768/.77351 = 42362

65536/.77351 = 84725

…

Need to incorporate more experiments for more accuracy.


OF


•Rules of the game •Probabilistic counting •Stochastic averaging •Refinements •Final frontier

25

Stochastic averaging

Goal: Perform M independent PC experiments and average results.

25



Alternative 1: M independent hash functions? No, too expensive.

25




Alternative 2: M-way alternation? No, bad results for certain inputs.

25





01 02 03 04 01 02 03 04

01 01

02 02

03 03

04 04

01 02 03 04

01

02

03

04

25



Alternative 3: Stochastic averaging



01 02 03 04 01 02 03 04

01 01

02 02

03 03

04 04

01 02 03 04

01

02

03

04

25



Alternative 3: Stochastic averaging• Use second hash to divide stream into 2m independent streams



01 02 03 04 01 02 03 04

01 01

02 02

03 03

04 04

01 02 03 04

01

02

03

04

25






01 02 03 04 01 02 03 04

01 01

02 02

03 03

04 04

01 02 03 04

01

02

03

04

10 11 39 21

09 07 07

11

23 22 22

31

11 09 07 23 31 07 22 2221

39

10 11

25




key point: equal values all go to the same stream



01 02 03 04 01 02 03 04

01 01

02 02

03 03

04 04

01 02 03 04

01

02

03

04

10 11 39 21

09 07 07

11

23 22 22

31

11 09 07 23 31 07 22 2221

39

10 11

25



Alternative 3: Stochastic averaging• Use second hash to divide stream into 2m independent streams• Use PC on each stream, yielding 2m sketches .




01 02 03 04 01 02 03 04

01 01

02 02

03 03

04 04

01 02 03 04

01

02

03

04

10 11 39 21

09 07 07

11

23 22 22

31

11 09 07 23 31 07 22 2221

39

10 11

25



Alternative 3: Stochastic averaging• Use second hash to divide stream into 2m independent streams• Use PC on each stream, yielding 2m sketches .• Compute mean = average number of trailing bits in the sketches.




01 02 03 04 01 02 03 04

01 01

02 02

03 03

04 04

01 02 03 04

01

02

03

04

10 11 39 21

09 07 07

11

23 22 22

31

11 09 07 23 31 07 22 2221

39

10 11

25



Alternative 3: Stochastic averaging• Use second hash to divide stream into 2m independent streams• Use PC on each stream, yielding 2m sketches .• Compute mean = average number of trailing bits in the sketches.• Return 2mean/.77531.




01 02 03 04 01 02 03 04

01 01

02 02

03 03

04 04

01 02 03 04

01

02

03

04

10 11 39 21

09 07 07

11

23 22 22

31

11 09 07 23 31 07 22 2221

39

10 11

26

PCSA trace

26

PCSA trace

x R(x) sketch[0] sketch[1] sketch[2] sketch[3]

1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

M = 4use initial m bits for second hash

26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000


26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000

0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000


26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000

0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000

0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000


26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000

0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000

0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000

0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000


26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000

0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000

0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000

0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000

0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000


26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000

0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000

0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000

0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000

0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000

1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000


26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000

0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000

0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000

0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000

0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000

1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000

0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000


26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000

0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000

0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000

0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000

0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000

1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000

0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000

1110010000111111 1000000 0000000000101010 0000000000000101 0000000000000101 0000000001000000


26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000

0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000

0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000

0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000

0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000

1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000

0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000

1110010000111111 1000000 0000000000101010 0000000000000101 0000000000000101 0000000001000000

1010110011111101 10 0000000000101010 0000000000000101 0000000000000111 0000000001000000


26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000

0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000

0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000

0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000

0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000

1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000

0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000

1110010000111111 1000000 0000000000101010 0000000000000101 0000000000000101 0000000001000000

1010110011111101 10 0000000000101010 0000000000000101 0000000000000111 0000000001000000

0001110100110100 1 0000000000101011 0000000000000101 0000000000000111


26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000

0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000

0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000

0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000

0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000

1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000

0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000

1110010000111111 1000000 0000000000101010 0000000000000101 0000000000000101 0000000001000000

1010110011111101 10 0000000000101010 0000000000000101 0000000000000111 0000000001000000

0001110100110100 1 0000000000101011 0000000000000101 0000000000000111

0000000000101011 0000000000000101 0000000000000111 0000000001000000


26

PCSA trace


1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000

0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000

0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000

0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000

0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000

0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000

1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000

0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000

1110010000111111 1000000 0000000000101010 0000000000000101 0000000000000101 0000000001000000

1010110011111101 10 0000000000101010 0000000000000101 0000000000000111 0000000001000000

0001110100110100 1 0000000000101011 0000000000000101 0000000000000111

0000000000101011 0000000000000101 0000000000000111 0000000001000000

r (sketch[ ] ) 2 1 3 0


27

Probabilistic counting with stochastic averaging in Java

public static long estimate(Iterable<Long> stream, int M) { long[] sketch = new long[M]; for (long x : stream) { int k = hash2(x, M); sketch[k] = sketch[k] | R(x); } int sum = 0; for (int k = 0; k < M; k++) sum += r(sketch[k]); double mean = 1.0 * sum / M; return (int) (M * Math.pow(2, mean)/.77351); }

Flajolet-Martin 1983

27




Idea. Stochastic averaging

27





• Use second hash to split into M = 2m independent streams

27






• Use PC on each stream, yielding 2m sketches .

27







• Compute mean = average # trailing 1 bits in the sketches.

27








• Return 2mean/.77351.

27









Q. Accuracy improves as M increases.

27










Q. How much?

27










Q. How much?

Theorem (paraphrased to fit context of this talk).

Under the uniform hashing assumption, PCSA

• Uses 64M bits.

• Produces estimate with a relative accuracy close to .0.78/

�M

28

Validation of PCSA analysis

Hypothesis. Value returned is accurate to for random values from a large range.0.78/�M

28


Hypothesis. Value returned is accurate to for random values from a large range.

Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)

0.78/�M

28




% java PCSA 1000000 31 1024 10

0.78/�M

28




% java PCSA 1000000 31 1024 10964416

0.78/�M

28




% java PCSA 1000000 31 1024 10964416997616

0.78/�M

28




% java PCSA 1000000 31 1024 10964416997616959857

0.78/�M

28




% java PCSA 1000000 31 1024 109644169976169598571024303

0.78/�M

28




% java PCSA 1000000 31 1024 109644169976169598571024303972940

0.78/�M

28




% java PCSA 1000000 31 1024 109644169976169598571024303972940985534

0.78/�M

28




% java PCSA 1000000 31 1024 109644169976169598571024303972940985534998291

0.78/�M

28




% java PCSA 1000000 31 1024 109644169976169598571024303972940985534998291996266

0.78/�M

28




% java PCSA 1000000 31 1024 109644169976169598571024303972940985534998291996266959208

0.78/�M

28




% java PCSA 1000000 31 1024 1096441699761695985710243039729409855349982919962669592081015329

0.78/�M

28




% java PCSA 1000000 31 1024 1096441699761695985710243039729409855349982919962669592081015329

0.78/�M

29

Space-accuracy tradeoff for probabilistic counting with stochastic averaging

10245122561286432

10%

5%

Relative accuracy:0.78�M

29


10245122561286432

10%

5%


Bottom line.

29


10245122561286432

10%

5%


Bottom line.• Attain 10% relative accuracy with a sketch consisting of 64 words.

M = 64

29


10245122561286432

10%

5%


Bottom line.• Attain 10% relative accuracy with a sketch consisting of 64 words.• Attain 2.4% relative accuracy with a sketch consisting of 1024 words.

M = 64

M = 1024

30

Scientific validation of PCSA

Hypothesis. Accuracy is as specified for the hash functions we use and the data we have.

30



Validation (Flajolet and Martin, 1985). Extensive reproducible scientific experiments (!)

30




Validation (RS, this morning).


log.07.f3.txt

30






log.07.f3.txt

30




% java PCSA 6000000 1024 < log.07.f3.txt 1106474



log.07.f3.txt

30






<1% larger than actual value


log.07.f3.txt

30



Q. Is PCSA effective?






log.07.f3.txt

30



Q. Is PCSA effective?

A. ABSOLUTELY!





31

is a demonstrably effective approach to cardinality estimation

Summary: PCSA (Flajolet-Martin, 1983)

31



Q. About how many different values are present in a given stream?

31




PCSA

31




PCSA

• Makes one pass through the stream.

31




PCSA

• Makes one pass through the stream.• Uses a few machine instructions per value

31




PCSA

• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy 0.78/

�M

31




PCSA

• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy ✓0.78/

�M

31




PCSA

• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy ✓

Results validated through extensive experimentation.

0.78/�M

31




PCSA

• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy ✓

Results validated through extensive experimentation.

0.78/�M

A poster child for AofA/AC

31




PCSA

• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy

Open questions

✓Results validated through extensive experimentation.

0.78/�M


31




PCSA


Open questions• Better space-accuracy tradeoffs?


0.78/�M


31




PCSA


Open questions• Better space-accuracy tradeoffs?• Support other operations?


0.78/�M


31




PCSA


Open questions• Better space-accuracy tradeoffs?• Support other operations?


0.78/�M


“ IT IS QUITE CLEAR that other observable regularities on hashed values of records could have been used…

− Flajolet and Martin

32

Small sample of work on related problems

32


1970 Bloom set membership

32



1984 Wegman unbiased sampling estimate

32




1996– many authors refinements (stay tuned)

32





2000 Indyk L1 norm

32





2000 Indyk L1 norm

2004 Cormode–Muthukrishnan

frequency estimation deletion and other operations

32





2000 Indyk L1 norm



2005 Giroire fast stream processing

32





2000 Indyk L1 norm




2012 Lumbroso full range, asymptotically unbiased

32





2000 Indyk L1 norm




2012 Lumbroso full range, asymptotically unbiased

2014 Helmi–Lumbroso–Martinez–Viola uses neither sampling nor hashing

OF



34

We can do better (in theory)

Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.

34


Contributions


34


Contributions

• Studied problem of estimating higher moments


34


Contributions


• Formalized idea of randomized streaming algorithms


34


Contributions



• Won Gödel Prize in 2005 for “foundational contribution”



34


Contributions





Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,

34


Contributions





Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,• Uses O(log N ) bits.

34


Contributions





Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,• Uses O(log N ) bits.• Is accurate to a factor of c, with probability at least 2/c.

34


Contributions






34


Contributions





Replaces “uniform hashing” assumption with “random bit existence” assumption


34


Contributions





BUT, no impact on cardinality estimation in practice



34


Contributions






• “Algorithm” just changes hash function for PC



34


Contributions







• Accuracy estimate is too weak to be useful



34


Contributions








• No validation



34


Contributions








• No validation


???!

35

Interesting quote

“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”

− Alon, Matias, and Szegedy

35

Interesting quote

No! They hypothesized that practical hash functions would be as effective as random ones.



35

Interesting quote

No! They hypothesized that practical hash functions would be as effective as random ones.They then validated that hypothesis by proving tight bounds that match experimental results.



35

Interesting quote

Points of view re hashing




35

Interesting quote


• Theoretical computer science. Uniform hashing assumption is not proved.




35

Interesting quote



• Practical computing. Hashing works for many common data types.




35

Interesting quote




• AofA. Extensive experiments have validated precise analytic models.




35

Interesting quote





Points of view re random bits




35

Interesting quote






• Theoretical computer science. Axiomatic that random bits exist.




35

Interesting quote







• Practical computing. No, they don’t ! And randomized algorithms are inconvenient, btw.




35

Interesting quote







• Practical computing. No, they don’t ! And randomized algorithms are inconvenient, btw.

• AofA. More effective path forward is to validate precise analysis even if stronger assumptions are needed.




36

logs and loglogs

To improve space-time tradeoffs, we need to carefully count bits.

36

logs and loglogs


Relevant quantities

36

logs and loglogs


Relevant quantities• N is the number of items in the data stream.

36

logs and loglogs


Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary.

36

logs and loglogs


Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary. • lg lg N is the number of bits needed to represent numbers less than lg N in binary.

36

logs and loglogs



For real-world applications

36

logs and loglogs



For real-world applications• N is less than 264.

36

logs and loglogs



For real-world applications• N is less than 264. • lg N is less than 64.

36

logs and loglogs



For real-world applications• N is less than 264. • lg N is less than 64. • lg lg N is less than 8.

36

logs and loglogs




Typical PCSA implementations

36

logs and loglogs




Typical PCSA implementations• Could use M lg N bits, in theory.

36

logs and loglogs




Typical PCSA implementations• Could use M lg N bits, in theory.• Use 64-bit words to take advantage of machine-language efficiencies.

36

logs and loglogs




Typical PCSA implementations• Could use M lg N bits, in theory.• Use 64-bit words to take advantage of machine-language efficiencies. • Use (therefore) 64*64 = 4096 bits with M = 64 (for 10% accuracy with N < 264).

37


Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.

37



Contribution Improves space-accuracy tradeoff at extra stream-processing expense.

37





37




Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that

37




Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that• Uses O(M log log N ) bits.

37




Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that• Uses O(M log log N ) bits. PCSA uses M lg N bits

37




Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that• Uses O(M log log N ) bits.• Achieves relative accuracy .O(1/

�M)

PCSA uses M lg N bits

37



STILL no impact on cardinality estimation in practice



�M)


37




• Infeasible because of high stream-processing expense.



�M)


37





• Big constants hidden in O-notation



�M)


37






• No validation



�M)


37






• No validation

???!



�M)


38

We can do better (in theory and in practice)

Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.

38



Contributions (independent of BYJKST)

38




• Presents LogLog algorithm, an easy variant of PCSA

38





PCSA saves sketches (lg N bits each)00000000000000000000000001101111

38






LogLog saves r() values (lglg N bits each)

00100 ( = 4)

38





• Improves space-accuracy tradeoff without extra expense per value



00100 ( = 4)

38






• Full analysis, fully validated with experimentation



00100 ( = 4)

38







Theorem (paraphrased to fit context of this talk). PCSA saves sketches (lg N bits each)00000000000000000000000001101111


00100 ( = 4)

38







Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, LogLog



00100 ( = 4)

38








• Uses M lg lg N bits.



00100 ( = 4)

38








• Uses M lg lg N bits.• Achieves relative accuracy close to .1.30/

�M



00100 ( = 4)

38


Not much impact on cardinality estimation in practice only because








�M



00100 ( = 4)

38



• PCSA was effectively deployed in practical systems








�M



00100 ( = 4)

38









I like PCSA



�M



00100 ( = 4)

38




• Idea led to a better algorithm a few years later (stay tuned)






I like PCSA



�M



00100 ( = 4)

39

We can do better (in theory and in practice): HyperLogLog algorithm (2007)

public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }

Flajolet-Fusy-Gandouet-Meunier 2007

39




Idea. Harmonic mean of r() values

39





• Use stochastic splitting

8-bit bytes (code to pack into M loglogN bits omitted)

39






• Keep track of min(r (x)) for each stream


39







• Return harmonic mean.


39




about .709 for M = 64






39




Flajolet, Fusy, Gandouet, and Meunier

HyperLogLog: the analysis of a near-

optimal cardinality estimation algorithm

AofA 2007; DMTCS 2007.about .709 for M = 64






39







AofA 2007; DMTCS 2007.








39








Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, HyperLogLog







39









• Uses M log log N bits.







39









• Uses M log log N bits.• Achieves relative accuracy close to .1.02/

�M







40

Space-accuracy tradeoff for HyperLogLog

102451225612864

10%

5%


40


102451225612864

10%

5%

Relative accuracy:

Bottom line (for N < 264).

1.02�M

40


102451225612864

10%

5%

Relative accuracy:

Bottom line (for N < 264).• Attain 10% relative accuracy with a sketch consisting of 108x6 = 648 bits.

1.02�M

M = 64

40


102451225612864

10%

5%

Relative accuracy:

Bottom line (for N < 264).• Attain 10% relative accuracy with a sketch consisting of 108x6 = 648 bits.• Attain 3.1% relative accuracy with a sketch consisting of 1024x6 = 6144 bits.

1.02�M

M = 64

M = 1024

40


102451225612864

10%

5%

Relative accuracy:

Bottom line (for N < 264).• Attain 10% relative accuracy with a sketch consisting of 108x6 = 648 bits.• Attain 3.1% relative accuracy with a sketch consisting of 1024x6 = 6144 bits.

1.02�M

Yay!

M = 64

M = 1024

41

PCSA vs Hyperloglog

41

PCSA vs Hyperloglog

Typical PCSA implementations

41

PCSA vs Hyperloglog

Typical PCSA implementations• Could use M lg N bits, in theory.

41

PCSA vs Hyperloglog

Typical PCSA implementations• Could use M lg N bits, in theory.• Use 64-bit words to take advantage of machine-language efficiencies.

41

PCSA vs Hyperloglog


41

PCSA vs Hyperloglog


Typical Hyperloglog implementations

41

PCSA vs Hyperloglog


Typical Hyperloglog implementations• Could use M lg lg N bits, in theory.

41

PCSA vs Hyperloglog


Typical Hyperloglog implementations• Could use M lg lg N bits, in theory.• Use 8-bit bytes to take advantage of machine-language efficiencies.

41

PCSA vs Hyperloglog


Typical Hyperloglog implementations• Could use M lg lg N bits, in theory.• Use 8-bit bytes to take advantage of machine-language efficiencies. • Use (therefore) 108*8 = 864 bits with M = 108 (for 10% accuracy with N < 264).

42

Validation of Hyperloglog

42


S. Heule, M. Nunkesser and A. HallHyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.Extending Database Technology/International Conference on Database Theory 2013.

42



42



42



42



42



42



42



42



Philippe Flajolet, mathematician, data scientist, and computer scientist extraordinaire

OF



44

We can do a bit better (in theory) but not much better

Upper bound

Lower bound

44


Upper bound

Lower bound

Indyk and Woodruff Tight Lower Bounds for the Distinct Elements Problem, FOCS 2003.

44


Upper bound

Lower bound


Theorem (paraphrased to fit context of this talk). Any algorithm that achieves relative accuracy must use bitsO(1/

�M) Ω(M)

44


Upper bound

Lower bound



�M) Ω(M)

loglogN improvement possible

44


Kane, Nelson, and Woodruff Optimal Algorithm for the Distinct Elements Problem, PODS 2010.

Upper bound

Lower bound



�M) Ω(M)


44



Upper bound

Lower bound

Theorem (paraphrased to fit context of this talk). With strongly universal hashing there exists an algorithm that • Uses O(M ) bits. • Achieves relative accuracy .O(1/

�M)



�M) Ω(M)


44



Upper bound

Lower bound

Theorem (paraphrased to fit context of this talk). With strongly universal hashing there exists an algorithm that • Uses O(M ) bits. • Achieves relative accuracy .O(1/

�M)



�M) Ω(M)


optimal

44



Upper bound

Lower bound

Theorem (paraphrased to fit context of this talk). With strongly universal hashing there exists an algorithm that • Uses O(M ) bits. • Achieves relative accuracy .

Unlikely to have impact on cardinality estimation in practice

O(1/�M)



�M) Ω(M)


optimal

44



Upper bound

Lower bound



• Tough to beat HyperLogLog’s low stream-processing expense.

O(1/�M)



�M) Ω(M)


optimal

44



Upper bound

Lower bound




• Constants hidden in O-notation not likely to be < 6

O(1/�M)



�M) Ω(M)


optimal

44



Upper bound

Lower bound




• Constants hidden in O-notation not likely to be < 6

• No validation

O(1/�M)



�M) Ω(M)


optimal

45

Can we beat HyperLogLog in practice?

45


Necessary characteristics of a better algorithm

45


Necessary characteristics of a better algorithm• Makes one pass through the stream.

45


Necessary characteristics of a better algorithm• Makes one pass through the stream.• Uses a few dozen machine instructions per value

45


Necessary characteristics of a better algorithm• Makes one pass through the stream.• Uses a few dozen machine instructions per value• Uses a few hundred bits

45


Necessary characteristics of a better algorithm• Makes one pass through the stream.• Uses a few dozen machine instructions per value• Uses a few hundred bits• Achieves 10% relative accuracy or better

45



“ I’ve long thought that there should be a simple algorithm that uses a small constant times M bits…”

− Jérémie Lumbroso

45





45



machine instructions per stream element

memory bound

memory bound when N < 264

# bits for 10% accuracy when N < 264

HyperLogLog 20–30 M loglog N 6M 648



45




memory bound




BetterAlgorithm a few dozen a few hundred



45


Also, results need to be validated through extensive experimentation.



memory bound




BetterAlgorithm a few dozen a few hundred

I love HyperLogLog



46

A proposal: HyperBitBit (Sedgewick, 2016)

46


public static long estimate(Iterable<String> stream, int M) { int lgN = 5; long sketch = 0; long sketch2 = 0; for (String x : stream) { long x = hash(s); int k = hash2(x, 64); if (r(x) > lgN) sketch = sketch | (1L << k); if (r(x) > lgN + 1) sketch2 = sketch2 | (1L << k); if (p(sketch) > 31) { sketch = sketch2; lgN++; sketch2 = 0; } } return (int) (Math.pow(2, lgN + 5.4 + p(sketch)/32.0)); }

46



Idea.

46



Idea.

• lgN is estimate of lgN

46



Idea.

• lgN is estimate of

• sketch is 64 indicators whether to increment lgN

lgN

46



Idea.



• sketch2 is is 64 indicators whether to increment lgN by 2

lgN

46



Idea.




• Update when half the bits in sketch are 1

lgN

46



Idea.





• correct with p(sketch)

lgN

bias factor (determined empirically)

and bias factor

46



Idea.





• correct with p(sketch)

lgN

bias factor (determined empirically)

and bias factor

Q. Does this even work?

47

Initial experiments

% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944

Exact values for web log example

47

Initial experiments


Exact values for web log example HyperBitBit estimates

47

Initial experiments



% java HyperBitBit 1000000 < log.07.f3.txt

HyperBitBit estimates

47

Initial experiments



% java HyperBitBit 1000000 < log.07.f3.txt234219


47

Initial experiments



% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt


47

Initial experiments



% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889


47

Initial experiments



% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt


47

Initial experiments



% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801


47

Initial experiments



% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt


47

Initial experiments



% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt1044043


47

Initial experiments





1,000,000 2,000,000 4,000,000 6,000,000

47

Initial experiments





1,000,000 2,000,000 4,000,000 6,000,000

Exact 242,601 483,477 883,071 1,097,944

47

Initial experiments





1,000,000 2,000,000 4,000,000 6,000,000

Exact 242,601 483,477 883,071 1,097,944

HyperBitBit 234,219 499,889 916,801 1,044,043

47

Initial experiments





1,000,000 2,000,000 4,000,000 6,000,000

Exact 242,601 483,477 883,071 1,097,944

HyperBitBit 234,219 499,889 916,801 1,044,043

ratio 1.05 1.03 0.96 1.03

47

Initial experiments





Conjecture. On practical data, HyperBitBit, for N < 264,

1,000,000 2,000,000 4,000,000 6,000,000

Exact 242,601 483,477 883,071 1,097,944

HyperBitBit 234,219 499,889 916,801 1,044,043

ratio 1.05 1.03 0.96 1.03

47

Initial experiments





Conjecture. On practical data, HyperBitBit, for N < 264, • Uses 128 + 6 bits.

1,000,000 2,000,000 4,000,000 6,000,000

Exact 242,601 483,477 883,071 1,097,944

HyperBitBit 234,219 499,889 916,801 1,044,043

ratio 1.05 1.03 0.96 1.03

47

Initial experiments





Conjecture. On practical data, HyperBitBit, for N < 264, • Uses 128 + 6 bits.• Estimates cardinality within 10% of the actual.

1,000,000 2,000,000 4,000,000 6,000,000

Exact 242,601 483,477 883,071 1,097,944

HyperBitBit 234,219 499,889 916,801 1,044,043

ratio 1.05 1.03 0.96 1.03

47

Initial experiments





Conjecture. On practical data, HyperBitBit, for N < 264, • Uses 128 + 6 bits.• Estimates cardinality within 10% of the actual.

1,000,000 2,000,000 4,000,000 6,000,000

Exact 242,601 483,477 883,071 1,097,944

HyperBitBit 234,219 499,889 916,801 1,044,043

ratio 1.05 1.03 0.96 1.03

Next steps. • Analyze. • Experiment. • Iterate

48

Summary/timeline for cardinality estimation

48


hashing assumption

feasible and

validated?memory

bound (bits)relative

accuracy constant


48


hashing assumption

feasible and

validated?memory


accuracy constant


1970 Bloom Bloom filter uniform ✓ kN > 264

48


hashing assumption

feasible and

validated?memory


accuracy constant



1985 Flajolet-Martin PCSA uniform ✓ M log N 0.78 4096

48


hashing assumption

feasible and

validated?memory


accuracy constant




1996 Alon–Matias–Szegedy [theorem] strong universal ✗ O(M log N ) O(1) ?

48


hashing assumption

feasible and

validated?memory


accuracy constant





2002Bar–Yossef–Jayram–Kumar–Sivakumar–

Trevisan[theorem] strong

universal ✗ O(M log log N ) O(1) ?

48


hashing assumption

feasible and

validated?memory


accuracy constant








2003 Durand–Flajolet LogLog uniform ✓ M lglg N 1.30 1536

48


hashing assumption

feasible and

validated?memory


accuracy constant









2007 Flajolet–Fusy–Gandouet–Meunier HyperLogLog uniform ✓ M lglg N 1.04 648

48


hashing assumption

feasible and

validated?memory


accuracy constant










2010 Kane–Nelson–Woodruff [theorem] strong

universal ✗ O(M ) + lglg N O(1) ?

48


hashing assumption

feasible and

validated?memory


accuracy constant










2010 Kane–Nelson–Woodruff [theorem] strong

universal ✗ O(M ) + lglg N O(1) ?

2018+ RS–? HyperBitBit uniform ✓(?) 2M + lglg N ? 134 (?)

49

Happy Birthday, Don!

Anders Björner Mireille Bousquet-Mélou

Erik DemainePersi Diaconis

Michael DrmotaRon Graham

Yannis Haralambous Susan HolmesSvante Janson

Dick KarpJohn Knuth

Svante LinussonLaci Lovász

Jan OverduinMike Paterson

Tim Roughgarden Martin RuckertBob Sedgewick

Jeffrey ShallitRichard Stanley

Wojtek Szpankowski Bob Tarjan

Greg TuckerAndrew Yao

Colloquium forDon Knuth’s

80th birthday

AlgorithmsCombinatoricsInformation

http://knuth80.elfbrink.se


with special thanks to Jérémie Lumbroso

Robert Sedgewick Princeton University

Documents

Robert Sedgewick Princeton Universityrs/talks/CardinalityX.pdf · Knuth’s insight: AofA is a scientific endeavor. •Start with a working program (algorithm implementation). •Develop