Robert Sedgewick Princeton Universityrs/talks/CardinalityX.pdf · Knuth’s insight: AofA is a...

Cardinality Estimation

with special thanks to Jérémie Lumbroso

Robert Sedgewick Princeton University

Philippe Flajolet 1948-2011

Philippe Flajolet, mathematician and computer scientist extraordinaire

Don Knuth’s legacy: Analysis of Algorithms (AofA)

Understood since Babbage:

Understood since Babbage:• Computational resources are limited.

Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.

Analytic Engine

how many times do we have to turn the crank?

Knuth’s insight: AofA is a scientific endeavor.

Analytic Engine

Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).

Analytic Engine

Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.

Analytic Engine

Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.

Analytic Engine

Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.• Use the program to validate hypotheses.

Analytic Engine

Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.• Use the program to validate hypotheses.• Iterate on basis of insights gained.

Analytic Engine

Difficult to overstate the significance of this insight.

AofA has played a critical role

in the development of our computational infrastructure

how many times to turn the crank?

how long to compile my program?

how long to sort random data for cryptanalysis preprocessing?

how long to compile my program?

how long to compile my program? how long to check

that my VLSI circuit follows the rules?

how many bodies in motion can I

simulate?

and the advance of scientific knowledge

how quickly can I find clusters?how many bodies

in motion can I simulate?

how quickly can I find clusters?how many bodies

in motion can I simulate?

“ PEOPLE WHO ANALYZE ALGORITHMS have double happiness. They experience the sheer beauty of elegant mathematical patterns that surround elegant computational procedures. Then they receive a practical payoff when their theories make it possible to get other jobs done more quickly and more economically.”

− Don Knuth

Analysis of Algorithms (present-day context)

Practical computing

• Real code on real machines

Practical computing

• Thorough validation

Practical computing

• Limited math models

Practical computing

Theoretical Computer Science

Practical computing

Theoretical computer science

Practical computing

• Theorems

Practical computing

• Theorems

• Abstract math models

Practical computing

• Theorems

• Limited experimentation

Practical computing

• Theorems

Practical computing AofA

Practical computing

• Theorems

AofA• Theorems and code

Practical computing

• Theorems

AofA• Theorems and code• Precise math models

Practical computing

• Theorems

AofA• Theorems and code• Precise math models• Experiment, validate, iterate

•Warmup: exact cardinality count •Probabilistic counting •Stochastic averaging •Refinements •Final frontier

Cardinality counting

Q. In a given stream of data values, how many different values are present?

Reference application. How many unique visitors in a web log?

pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net

log.07.f3.txt

6 million strings

State of the art in the wild for decades. Sort, then count.

log.07.f3.txt

6 million strings

log.07.f3.txt

6 million strings

% sort -u log.07.f3.txt | wc -l 1112365

UNIX (1970s-present)

“unique”

SELECT DATE_TRUNC(‘day’,event_time), COUNT(DISTINCT user_id), COUNT(DISTINCT url)FROM weblog

SQL (1970s-present)

log.07.f3.txt

6 million strings

% sort -u log.07.f3.txt | wc -l 1112365

UNIX (1970s-present)

“unique”

Standard “optimal” solution: Use a hash table

Hashing with linear probing

Hashing with linear probing• Create a table of size M.

Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.

example: multiply by a prime, then take remainder after dividing by M.

Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.

Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.

small example data stream P J J E K J L C K O M T P G L J I F K C

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

count 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15

count 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15

count 01

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6

count 01

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6

count 012

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6

count 012

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14

count 012

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14

count 0123

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1

count 0123

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1

J EK P

count 01234

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6

J EK P

count 01234

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13

J EK P

count 01234

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13

J EK L P

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7

J EK L P

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7

J EK L PC

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1

J EK L PC

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15

J EK L PC

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15

J EK L

count 012345

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15

J EK L OPC

count 0123456

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8

J EK L OPC

count 0123456

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8

J EK LM OPC

count 01234567

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7

J EK LM OPC

count 01234567

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7

J EK LM OPC

count 01234567

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7

J EK LM OPC

count 01234567

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7

J EK LM OPC

count 01234567

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7

TJ EK LM OPC

count 012345678

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15

TJ EK LM OPC

count 012345678

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4

TJ EK LM OPC

count 012345678

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4

TJ EK LM OPCG

count 0123456789

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13

TJ EK LM OPCG

count 0123456789

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6

TJ EK LM OPCG

count 0123456789

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11

TJ EK LM OPCG

count 0123456789

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11

TJ EK LM OPCG I

count 012345678910

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9

TJ EK LM OPCG I

count 012345678910

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9

TJ EK LM OPCG I

Fcount 012345678910

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9

TJ EK LM OPCG I

Fcount 012345678910

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9

TJ EK LM OPCG IF

count 01234567891011

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9 1

TJ EK LM OPCG IF

count 01234567891011

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9 1 7

TJ EK LM OPCG IF

count 01234567891011

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

hash table (M = 17)

hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9 1 7

Additional (key) idea. Keep searches short by doubling table size when it becomes half full.

TJ EK LM OPCG IF

count 01234567891011

Mathematical analysis of exact cardinality count with linear probing

Theorem. Expected time and space cost is linear.

Proof. Follows from classic Knuth Theorem 6.4.K.

ptg999

From the Library of Robert Sedgewick

ptg999

“ I first formulated [this] derivation in 1962. Since this was the first nontrivial algorithm I had ever analyzed satisfactorily, it had a strong influence on the structure of these books. Ever since that day, the analysis of algorithms has in fact been one of the major themes of my life.”

− Knuth, TAOCP volume 3

ptg999

Q. Do the hash functions that we use uniformly and independently distribute keys in the table?

ptg999

Q. Do the hash functions that we use uniformly and independently distribute keys in the table?

A. Not likely.

Scientific validation of exact cardinality count with linear probing

Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.

Quick experiment. Doubling the problem size should double the running time.

public static void main(String[] args)

Driver to read N strings and count distinct values

public static void main(String[] args){

public static void main(String[] args){ int N = Integer.parseInt(args[0]);

get problem size

public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N);

get problem sizeinitialize input stream

public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();

get current time

StdOut.println(count(stream));

get current time

print count

long now = System.currentTimeMillis();

get current time

print count

long now = System.currentTimeMillis(); double time = (now - start) / 1000.0;

get current time

print count

long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);

get current time

print count

print elapsed time

long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}

get current time

print count

print elapsed time

% java Hash 2000000 < log.07.f3.txt

get current time

print count

print elapsed time

% java Hash 2000000 < log.07.f3.txt483477

get current time

print count

print elapsed time

% java Hash 2000000 < log.07.f3.txt4834773.322 seconds

get current time

print count

print elapsed time

get current time

print count

print elapsed time

get current time

print count

print elapsed time

get current time

print count

print elapsed time

get current time

print count

print elapsed time

get current time

print count

print elapsed time

get current time

print count

print elapsed time

get current time

print count

print elapsed time

get current time

print count

print elapsed time

% java Hash 6000000 < log.07.f3.txt10979449.49 seconds ✓

get current time

print count

print elapsed time

% sort -u log.07.f3 | wc -l 1097944

sort-based method takes about 3 minutes

get current time

print count

print elapsed time

Q. Is hashing with linear probing effective?

% sort -u log.07.f3 | wc -l 1097944

get current time

print count

print elapsed time

Q. Is hashing with linear probing effective?

A. Yes. Validated in countless applications for over half a century.

% sort -u log.07.f3 | wc -l 1097944

get current time

print count

print elapsed time

Complexity of exact cardinality count

Q. Does there exist an optimal algorithm for this problem?Upper bound

Lower bound

Q. Does there exist an optimal algorithm for this problem?

A. Depends on definition of “optimal”.

Upper bound

Lower bound

Guaranteed linear-time? NO. Linearithmic lower bound.

Upper bound

Lower bound

Guaranteed linearithmic? YES. Balanced BSTs or mergesort.

Upper bound

Lower bound

Linear-time with high probability assuming the existence of random bits? YES. Dynamic perfect hashing.

Upper bound

Lower bound

Linear-time with high probability assuming the existence of random bits? YES. Dynamic perfect hashing. Dietzfelbinger, Karlin, Mehlhorn, Meyer auf der Heide, Rohnert, and Tarjan

Dynamic Perfect Hashing: Upper and Lower Bounds

SICOMP 1994.

Upper bound

Lower bound

Linear with a small constant factor in practical situations?

Dietzfelbinger, Karlin, Mehlhorn, Meyer auf der Heide, Rohnert, and TarjanDynamic Perfect Hashing: Upper and Lower Bounds

SICOMP 1994.

Upper bound

Lower bound

Linear with a small constant factor in practical situations? YES. Hashing with linear probing.

SICOMP 1994.

Upper bound

Lower bound

Linear with a small constant factor in practical situations? YES. Hashing with linear probing. M. Mitzenmacher and S. Vadhan

Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream. SODA 2008.

SICOMP 1994.

Upper bound

Lower bound

SICOMP 1994.

Hypothesis. Hashing with linear probing is “optimal”.

Upper bound

Lower bound

SICOMP 1994.

Hypothesis. Hashing with linear probing is “optimal”.

Upper bound

Lower bound

but TSTs may give a sublinear algorithm

Exact cardinality count requires linear space

Q. I can’t use a hash table. The stream is much too big to fit all values in memory. Now what?

A. Bad news: You cannot get an exact count.

A. (Bloom, 1970) You can get an accurate estimate using a few bits per distinct value.

A. Much better news: You can get an accurate estimate using only a handful of bits (stay tuned).

•Warmup: exact cardinality count •Probabilistic counting •Stochastic averaging •Refinements •Final frontier

is a fundamental problem with many applications where memory is limited.

Cardinality estimation

Q. About how many different values appear in a given stream?

Constraints

Constraints• Make one pass through the stream.

Constraints• Make one pass through the stream.• Use as few operations per value as possible

Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.

Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.• Produce as accurate an estimate as possible.

typical applications

How many unique visitors to my website?

How many different values for a database join?

How many different websites visited by each customer? How many different values

for a database join?

Which sites are the most/least popular?

To fix ideas on scope: Think of billions of streams each having trillions of values.

Which sites are the most/least popular?

Probabilistic counting with stochastic averaging (PCSA)

Flajolet and Martin, Probabilistic Counting Algorithms for Data Base Applications FOCS 1983, JCSS 1985.

Contributions

Contributions• Introduced problem

Contributions• Introduced problem• Idea of streaming algorithm

Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data

Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data• Detailed analysis that yields tight bounds on accuracy

Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data• Detailed analysis that yields tight bounds on accuracy• Full validation of mathematical results with experimentation

Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data• Detailed analysis that yields tight bounds on accuracy• Full validation of mathematical results with experimentation• Practical algorithm that has remained effective for decades

Bottom line: Quintessential example of the effectiveness of scientific approach to algorithm design.

PCSA first step: Use hashing

Transform value to a “random” computer word.

Transform value to a “random” computer word.• Compute a hash function that transforms

data value into a 32- or 64-bit value.

data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).

data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.

data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.• Allows use of fast machine-code operations.

21st century: use 64 bits (quadrillions of values)20th century: use 32 bits (millions of values)

Example: Java

• All data types implement a hashCode() method (though we often override the default).

Example: Java

• String data type stores value (computed once).

String value = “gsearch.CS.Princeton.EDU” int x = value.hashCode();

current Java default is 32-bit int value

Example: Java

Bottom line: Do cardinality estimation on streams of (binary) integers.

Example: Java

00010000111001101000111010010011 00001001011011100000010010010111 00001001011011100000010010010111 00111000101001001011010101001100 00111000101001001011010101001100 01101001001000011100110100110011 00001000011101100110110010100101 00001001011011100000010010010111 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01100000110010011101011001011111 01110111001110110010000010111111 01011001010110101100010110110100 00101101101111101111001110001100 01101001001000011100110100110011 01110110000011101101011001000101 00001001001011010110111101111110 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01110111001110110010000010111111 01101001001000011100110100110011 00001000011101100110110010100101 00101101101111101111001110001100 01101001001000011100110100110011 00001000011101100110110010100101 01101001001000011100110100110011 01101001001000011100110100110011 01100001000111001001110010100000

Example: Java

00010000111001101000111010010011 00001001011011100000010010010111 00001001011011100000010010010111 00111000101001001011010101001100 00111000101001001011010101001100 01101001001000011100110100110011 00001000011101100110110010100101 00001001011011100000010010010111 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01100000110010011101011001011111 01110111001110110010000010111111 01011001010110101100010110110100 00101101101111101111001110001100 01101001001000011100110100110011 01110110000011101101011001000101 00001001001011010110111101111110 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01110111001110110010000010111111 01101001001000011100110100110011 00001000011101100110110010100101 00101101101111101111001110001100 01101001001000011100110100110011 00001000011101100110110010100101 01101001001000011100110100110011 01101001001000011100110100110011 01100001000111001001110010100000

Example: Java

“Random” except for the fact that some values are equal.

Initial hypothesis

Hypothesis. Uniform hashing assumption is reasonable in this context.

Initial hypothesis

Implication. Need to run experiments to validate any hypotheses about performance.

Initial hypothesis

No problem!

Initial hypothesis

No problem!

• AofA is a scientific endeavor (we always validate hypotheses).

Initial hypothesis

No problem!

• End goal is development of algorithms that are useful in practice.

Initial hypothesis

No problem!

• It is the responsibility of the designer to validate utility before claiming it.

Initial hypothesis

No problem!

• After decades of experience, discovering a performance problem due toa bad hash function would be a significant research result.

Initial hypothesis

No problem!

Unspoken bedrock principle of AofA. Experimenting to validate hypotheses is WHAT WE DO!

Initial hypothesis

No problem!

Unspoken bedrock principle of AofA. Experimenting to validate hypotheses is WHAT WE DO!

Probabilistic counting starting point: three integer functions

Definition. p (x) is the number of 1s in the binary representation of x.

Definition. r (x) is the number of trailing 1s in the binary representation of x.

Definition. r (x) is the number of trailing 1s in the binary representation of x. position of rightmost 0

Definition. R(x) = 2r (x)

position of rightmost 0