Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Cardinality Estimation
with special thanks to Jérémie Lumbroso
Robert Sedgewick Princeton University
Philippe Flajolet 1948-2011
Philippe Flajolet, mathematician and computer scientist extraordinaire
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Knuth’s insight: AofA is a scientific endeavor.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.• Use the program to validate hypotheses.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.• Use the program to validate hypotheses.• Iterate on basis of insights gained.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.• Use the program to validate hypotheses.• Iterate on basis of insights gained.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.• Use the program to validate hypotheses.• Iterate on basis of insights gained.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.• Use the program to validate hypotheses.• Iterate on basis of insights gained.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.• Use the program to validate hypotheses.• Iterate on basis of insights gained.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Knuth’s insight: AofA is a scientific endeavor.• Start with a working program (algorithm implementation).• Develop mathematical model of its behavior.• Use the model to formulate hypotheses on resource usage.• Use the program to validate hypotheses.• Iterate on basis of insights gained.
3
Don Knuth’s legacy: Analysis of Algorithms (AofA)
Understood since Babbage:• Computational resources are limited.• Method (algorithm) used matters.
Analytic Engine
how many times do we have to turn the crank?
Difficult to overstate the significance of this insight.
AofA has played a critical role
in the development of our computational infrastructure
4
AofA has played a critical role
in the development of our computational infrastructure
4
how many times to turn the crank?
AofA has played a critical role
in the development of our computational infrastructure
4
how many times to turn the crank?
how long to compile my program?
AofA has played a critical role
in the development of our computational infrastructure
4
how many times to turn the crank?
how long to sort random data for cryptanalysis preprocessing?
how long to compile my program?
AofA has played a critical role
in the development of our computational infrastructure
4
how many times to turn the crank?
how long to sort random data for cryptanalysis preprocessing?
how long to compile my program? how long to check
that my VLSI circuit follows the rules?
AofA has played a critical role
in the development of our computational infrastructure
4
how many times to turn the crank?
how long to sort random data for cryptanalysis preprocessing?
how long to compile my program? how long to check
that my VLSI circuit follows the rules?
how many bodies in motion can I
simulate?
and the advance of scientific knowledge
AofA has played a critical role
in the development of our computational infrastructure
4
how many times to turn the crank?
how long to sort random data for cryptanalysis preprocessing?
how long to compile my program? how long to check
that my VLSI circuit follows the rules?
how quickly can I find clusters?how many bodies
in motion can I simulate?
and the advance of scientific knowledge
AofA has played a critical role
in the development of our computational infrastructure
4
how many times to turn the crank?
how long to sort random data for cryptanalysis preprocessing?
how long to compile my program? how long to check
that my VLSI circuit follows the rules?
how quickly can I find clusters?how many bodies
in motion can I simulate?
and the advance of scientific knowledge
“ PEOPLE WHO ANALYZE ALGORITHMS have double happiness. They experience the sheer beauty of elegant mathematical patterns that surround elegant computational procedures. Then they receive a practical payoff when their theories make it possible to get other jobs done more quickly and more economically.”
− Don Knuth
5
Analysis of Algorithms (present-day context)
5
Analysis of Algorithms (present-day context)
Practical computing
Practical computing
5
Analysis of Algorithms (present-day context)
Practical computing
• Real code on real machines
Practical computing
5
Analysis of Algorithms (present-day context)
Practical computing
• Real code on real machines
• Thorough validation
Practical computing
5
Analysis of Algorithms (present-day context)
Practical computing
• Real code on real machines
• Thorough validation
• Limited math models
Practical computing
5
Analysis of Algorithms (present-day context)
Theoretical Computer Science
Practical computing
• Real code on real machines
• Thorough validation
• Limited math models
Theoretical computer science
Practical computing
5
Analysis of Algorithms (present-day context)
Theoretical Computer Science
Practical computing
• Real code on real machines
• Thorough validation
• Limited math models
Theoretical computer science
• Theorems
Practical computing
5
Analysis of Algorithms (present-day context)
Theoretical Computer Science
Practical computing
• Real code on real machines
• Thorough validation
• Limited math models
Theoretical computer science
• Theorems
• Abstract math models
Practical computing
5
Analysis of Algorithms (present-day context)
Theoretical Computer Science
Practical computing
• Real code on real machines
• Thorough validation
• Limited math models
Theoretical computer science
• Theorems
• Abstract math models
• Limited experimentation
Practical computing
5
Analysis of Algorithms (present-day context)
Theoretical Computer Science
Practical computing
• Real code on real machines
• Thorough validation
• Limited math models
Theoretical computer science
• Theorems
• Abstract math models
• Limited experimentation
AofA
Practical computing AofA
5
Analysis of Algorithms (present-day context)
Theoretical Computer Science
Practical computing
• Real code on real machines
• Thorough validation
• Limited math models
Theoretical computer science
• Theorems
• Abstract math models
• Limited experimentation
AofA• Theorems and code
Practical computing AofA
5
Analysis of Algorithms (present-day context)
Theoretical Computer Science
Practical computing
• Real code on real machines
• Thorough validation
• Limited math models
Theoretical computer science
• Theorems
• Abstract math models
• Limited experimentation
AofA• Theorems and code• Precise math models
Practical computing AofA
5
Analysis of Algorithms (present-day context)
Theoretical Computer Science
Practical computing
• Real code on real machines
• Thorough validation
• Limited math models
Theoretical computer science
• Theorems
• Abstract math models
• Limited experimentation
AofA• Theorems and code• Precise math models• Experiment, validate, iterate
Practical computing AofA
OF
Cardinality Estimation
•Warmup: exact cardinality count •Probabilistic counting •Stochastic averaging •Refinements •Final frontier
7
Cardinality counting
Q. In a given stream of data values, how many different values are present?
7
Cardinality counting
Q. In a given stream of data values, how many different values are present?
Reference application. How many unique visitors in a web log?
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
7
Cardinality counting
Q. In a given stream of data values, how many different values are present?
Reference application. How many unique visitors in a web log?
log.07.f3.txt
6 million strings
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
7
Cardinality counting
Q. In a given stream of data values, how many different values are present?
Reference application. How many unique visitors in a web log?
State of the art in the wild for decades. Sort, then count.
log.07.f3.txt
6 million strings
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
7
Cardinality counting
Q. In a given stream of data values, how many different values are present?
Reference application. How many unique visitors in a web log?
State of the art in the wild for decades. Sort, then count.
log.07.f3.txt
6 million strings
% sort -u log.07.f3.txt | wc -l 1112365
UNIX (1970s-present)
“unique”
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
7
Cardinality counting
Q. In a given stream of data values, how many different values are present?
Reference application. How many unique visitors in a web log?
State of the art in the wild for decades. Sort, then count.
SELECT DATE_TRUNC(‘day’,event_time), COUNT(DISTINCT user_id), COUNT(DISTINCT url)FROM weblog
SQL (1970s-present)
log.07.f3.txt
6 million strings
% sort -u log.07.f3.txt | wc -l 1112365
UNIX (1970s-present)
“unique”
8
Standard “optimal” solution: Use a hash table
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.
example: multiply by a prime, then take remainder after dividing by M.
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.
example: multiply by a prime, then take remainder after dividing by M.
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
example: multiply by a prime, then take remainder after dividing by M.
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
example: multiply by a prime, then take remainder after dividing by M.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
example: multiply by a prime, then take remainder after dividing by M.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17)
example: multiply by a prime, then take remainder after dividing by M.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17)
example: multiply by a prime, then take remainder after dividing by M.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17)
example: multiply by a prime, then take remainder after dividing by M.
count 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15
example: multiply by a prime, then take remainder after dividing by M.
count 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15
example: multiply by a prime, then take remainder after dividing by M.
P
count 01
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6
example: multiply by a prime, then take remainder after dividing by M.
P
count 01
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6
example: multiply by a prime, then take remainder after dividing by M.
J P
count 012
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6
example: multiply by a prime, then take remainder after dividing by M.
J P
count 012
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14
example: multiply by a prime, then take remainder after dividing by M.
J P
count 012
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14
example: multiply by a prime, then take remainder after dividing by M.
J E P
count 0123
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1
example: multiply by a prime, then take remainder after dividing by M.
J E P
count 0123
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1
example: multiply by a prime, then take remainder after dividing by M.
J EK P
count 01234
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6
example: multiply by a prime, then take remainder after dividing by M.
J EK P
count 01234
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13
example: multiply by a prime, then take remainder after dividing by M.
J EK P
count 01234
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13
example: multiply by a prime, then take remainder after dividing by M.
J EK L P
count 012345
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7
example: multiply by a prime, then take remainder after dividing by M.
J EK L P
count 012345
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7
example: multiply by a prime, then take remainder after dividing by M.
J EK L PC
count 012345
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1
example: multiply by a prime, then take remainder after dividing by M.
J EK L PC
count 012345
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15
example: multiply by a prime, then take remainder after dividing by M.
J EK L PC
count 012345
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15
example: multiply by a prime, then take remainder after dividing by M.
J EK L
O
PC
count 012345
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15
example: multiply by a prime, then take remainder after dividing by M.
J EK L OPC
count 0123456
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8
example: multiply by a prime, then take remainder after dividing by M.
J EK L OPC
count 0123456
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8
example: multiply by a prime, then take remainder after dividing by M.
J EK LM OPC
count 01234567
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7
example: multiply by a prime, then take remainder after dividing by M.
J EK LM OPC
count 01234567
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7
example: multiply by a prime, then take remainder after dividing by M.
T
J EK LM OPC
count 01234567
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7
example: multiply by a prime, then take remainder after dividing by M.
T
J EK LM OPC
count 01234567
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7
example: multiply by a prime, then take remainder after dividing by M.
T
J EK LM OPC
count 01234567
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPC
count 012345678
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPC
count 012345678
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPC
count 012345678
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG
count 0123456789
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG
count 0123456789
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG
count 0123456789
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG
count 0123456789
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG I
count 012345678910
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG I
count 012345678910
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG I
Fcount 012345678910
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG I
Fcount 012345678910
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG IF
count 01234567891011
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9 1
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG IF
count 01234567891011
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9 1 7
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG IF
count 01234567891011
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hash table (M = 17)
8
Standard “optimal” solution: Use a hash table
Hashing with linear probing• Create a table of size M.• Transform each value into a “random” table index.• Move right to find space if value collides.• Count values new to the table.
small example data stream P J J E K J L C K O M T P G L J I F K C
hash values (x-(‘A’))*97 % 17) 15 6 6 14 1 6 13 7 1 15 8 7 15 4 13 6 11 9 1 7
Additional (key) idea. Keep searches short by doubling table size when it becomes half full.
example: multiply by a prime, then take remainder after dividing by M.
TJ EK LM OPCG IF
count 01234567891011
9
Mathematical analysis of exact cardinality count with linear probing
Theorem. Expected time and space cost is linear.
9
Mathematical analysis of exact cardinality count with linear probing
Theorem. Expected time and space cost is linear.
Proof. Follows from classic Knuth Theorem 6.4.K.
9
Mathematical analysis of exact cardinality count with linear probing
Theorem. Expected time and space cost is linear.
Proof. Follows from classic Knuth Theorem 6.4.K.
ptg999
From the Library of Robert Sedgewick
9
Mathematical analysis of exact cardinality count with linear probing
Theorem. Expected time and space cost is linear.
Proof. Follows from classic Knuth Theorem 6.4.K.
ptg999
From the Library of Robert Sedgewick
“ I first formulated [this] derivation in 1962. Since this was the first nontrivial algorithm I had ever analyzed satisfactorily, it had a strong influence on the structure of these books. Ever since that day, the analysis of algorithms has in fact been one of the major themes of my life.”
− Knuth, TAOCP volume 3
9
Mathematical analysis of exact cardinality count with linear probing
Theorem. Expected time and space cost is linear.
Proof. Follows from classic Knuth Theorem 6.4.K.
ptg999
From the Library of Robert Sedgewick
“ I first formulated [this] derivation in 1962. Since this was the first nontrivial algorithm I had ever analyzed satisfactorily, it had a strong influence on the structure of these books. Ever since that day, the analysis of algorithms has in fact been one of the major themes of my life.”
− Knuth, TAOCP volume 3
9
Mathematical analysis of exact cardinality count with linear probing
Theorem. Expected time and space cost is linear.
Proof. Follows from classic Knuth Theorem 6.4.K.
ptg999
From the Library of Robert Sedgewick
Q. Do the hash functions that we use uniformly and independently distribute keys in the table?
“ I first formulated [this] derivation in 1962. Since this was the first nontrivial algorithm I had ever analyzed satisfactorily, it had a strong influence on the structure of these books. Ever since that day, the analysis of algorithms has in fact been one of the major themes of my life.”
− Knuth, TAOCP volume 3
9
Mathematical analysis of exact cardinality count with linear probing
Theorem. Expected time and space cost is linear.
Proof. Follows from classic Knuth Theorem 6.4.K.
ptg999
From the Library of Robert Sedgewick
Q. Do the hash functions that we use uniformly and independently distribute keys in the table?
A. Not likely.
“ I first formulated [this] derivation in 1962. Since this was the first nontrivial algorithm I had ever analyzed satisfactorily, it had a strong influence on the structure of these books. Ever since that day, the analysis of algorithms has in fact been one of the major themes of my life.”
− Knuth, TAOCP volume 3
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args)
Driver to read N strings and count distinct values
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){
Driver to read N strings and count distinct values
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]);
Driver to read N strings and count distinct values
get problem size
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N);
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis();
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0;
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt483477
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
% java Hash 4000000 < log.07.f3.txt
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
% java Hash 4000000 < log.07.f3.txt883071
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
% java Hash 4000000 < log.07.f3.txt8830716.55 seconds
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
% java Hash 4000000 < log.07.f3.txt8830716.55 seconds
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
% java Hash 4000000 < log.07.f3.txt8830716.55 seconds
% java Hash 6000000 < log.07.f3.txt
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
% java Hash 4000000 < log.07.f3.txt8830716.55 seconds
% java Hash 6000000 < log.07.f3.txt1097944
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
% java Hash 4000000 < log.07.f3.txt8830716.55 seconds
% java Hash 6000000 < log.07.f3.txt10979449.49 seconds
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
% java Hash 4000000 < log.07.f3.txt8830716.55 seconds
% java Hash 6000000 < log.07.f3.txt10979449.49 seconds ✓
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
% java Hash 4000000 < log.07.f3.txt8830716.55 seconds
% java Hash 6000000 < log.07.f3.txt10979449.49 seconds ✓
% sort -u log.07.f3 | wc -l 1097944
sort-based method takes about 3 minutes
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Q. Is hashing with linear probing effective?
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
% java Hash 4000000 < log.07.f3.txt8830716.55 seconds
% java Hash 6000000 < log.07.f3.txt10979449.49 seconds ✓
% sort -u log.07.f3 | wc -l 1097944
sort-based method takes about 3 minutes
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
10
Scientific validation of exact cardinality count with linear probing
Hypothesis. Time and space cost is linear for the hash functions we use and the data we have.
Q. Is hashing with linear probing effective?
A. Yes. Validated in countless applications for over half a century.
Quick experiment. Doubling the problem size should double the running time.
public static void main(String[] args){ int N = Integer.parseInt(args[0]); StringStream stream = new StringStream(N); long start = System.currentTimeMillis();
StdOut.println(count(stream));
long now = System.currentTimeMillis(); double time = (now - start) / 1000.0; StdOut.println(time + " seconds”);}
% java Hash 2000000 < log.07.f3.txt4834773.322 seconds
% java Hash 4000000 < log.07.f3.txt8830716.55 seconds
% java Hash 6000000 < log.07.f3.txt10979449.49 seconds ✓
% sort -u log.07.f3 | wc -l 1097944
sort-based method takes about 3 minutes
Driver to read N strings and count distinct values
get problem sizeinitialize input stream
get current time
print count
print elapsed time
11
Complexity of exact cardinality count
Q. Does there exist an optimal algorithm for this problem?Upper bound
Lower bound
11
Complexity of exact cardinality count
Q. Does there exist an optimal algorithm for this problem?
A. Depends on definition of “optimal”.
Upper bound
Lower bound
11
Complexity of exact cardinality count
Q. Does there exist an optimal algorithm for this problem?
Guaranteed linear-time? NO. Linearithmic lower bound.
A. Depends on definition of “optimal”.
Upper bound
Lower bound
11
Complexity of exact cardinality count
Q. Does there exist an optimal algorithm for this problem?
Guaranteed linear-time? NO. Linearithmic lower bound.
A. Depends on definition of “optimal”.
Guaranteed linearithmic? YES. Balanced BSTs or mergesort.
Upper bound
Lower bound
11
Complexity of exact cardinality count
Q. Does there exist an optimal algorithm for this problem?
Guaranteed linear-time? NO. Linearithmic lower bound.
A. Depends on definition of “optimal”.
Linear-time with high probability assuming the existence of random bits? YES. Dynamic perfect hashing.
Guaranteed linearithmic? YES. Balanced BSTs or mergesort.
Upper bound
Lower bound
11
Complexity of exact cardinality count
Q. Does there exist an optimal algorithm for this problem?
Guaranteed linear-time? NO. Linearithmic lower bound.
A. Depends on definition of “optimal”.
Linear-time with high probability assuming the existence of random bits? YES. Dynamic perfect hashing. Dietzfelbinger, Karlin, Mehlhorn, Meyer auf der Heide, Rohnert, and Tarjan
Dynamic Perfect Hashing: Upper and Lower Bounds
SICOMP 1994.
Guaranteed linearithmic? YES. Balanced BSTs or mergesort.
Upper bound
Lower bound
11
Complexity of exact cardinality count
Q. Does there exist an optimal algorithm for this problem?
Guaranteed linear-time? NO. Linearithmic lower bound.
A. Depends on definition of “optimal”.
Linear-time with high probability assuming the existence of random bits? YES. Dynamic perfect hashing.
Linear with a small constant factor in practical situations?
Dietzfelbinger, Karlin, Mehlhorn, Meyer auf der Heide, Rohnert, and TarjanDynamic Perfect Hashing: Upper and Lower Bounds
SICOMP 1994.
Guaranteed linearithmic? YES. Balanced BSTs or mergesort.
Upper bound
Lower bound
11
Complexity of exact cardinality count
Q. Does there exist an optimal algorithm for this problem?
Guaranteed linear-time? NO. Linearithmic lower bound.
A. Depends on definition of “optimal”.
Linear-time with high probability assuming the existence of random bits? YES. Dynamic perfect hashing.
Linear with a small constant factor in practical situations? YES. Hashing with linear probing.
Dietzfelbinger, Karlin, Mehlhorn, Meyer auf der Heide, Rohnert, and TarjanDynamic Perfect Hashing: Upper and Lower Bounds
SICOMP 1994.
Guaranteed linearithmic? YES. Balanced BSTs or mergesort.
Upper bound
Lower bound
11
Complexity of exact cardinality count
Q. Does there exist an optimal algorithm for this problem?
Guaranteed linear-time? NO. Linearithmic lower bound.
A. Depends on definition of “optimal”.
Linear-time with high probability assuming the existence of random bits? YES. Dynamic perfect hashing.
Linear with a small constant factor in practical situations? YES. Hashing with linear probing. M. Mitzenmacher and S. Vadhan
Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream. SODA 2008.
Dietzfelbinger, Karlin, Mehlhorn, Meyer auf der Heide, Rohnert, and TarjanDynamic Perfect Hashing: Upper and Lower Bounds
SICOMP 1994.
Guaranteed linearithmic? YES. Balanced BSTs or mergesort.
Upper bound
Lower bound
11
Complexity of exact cardinality count
Q. Does there exist an optimal algorithm for this problem?
Guaranteed linear-time? NO. Linearithmic lower bound.
A. Depends on definition of “optimal”.
Linear-time with high probability assuming the existence of random bits? YES. Dynamic perfect hashing.
Linear with a small constant factor in practical situations? YES. Hashing with linear probing. M. Mitzenmacher and S. Vadhan
Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream. SODA 2008.
Dietzfelbinger, Karlin, Mehlhorn, Meyer auf der Heide, Rohnert, and TarjanDynamic Perfect Hashing: Upper and Lower Bounds
SICOMP 1994.
Hypothesis. Hashing with linear probing is “optimal”.
Guaranteed linearithmic? YES. Balanced BSTs or mergesort.
Upper bound
Lower bound
11
Complexity of exact cardinality count
Q. Does there exist an optimal algorithm for this problem?
Guaranteed linear-time? NO. Linearithmic lower bound.
A. Depends on definition of “optimal”.
Linear-time with high probability assuming the existence of random bits? YES. Dynamic perfect hashing.
Linear with a small constant factor in practical situations? YES. Hashing with linear probing. M. Mitzenmacher and S. Vadhan
Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream. SODA 2008.
Dietzfelbinger, Karlin, Mehlhorn, Meyer auf der Heide, Rohnert, and TarjanDynamic Perfect Hashing: Upper and Lower Bounds
SICOMP 1994.
Hypothesis. Hashing with linear probing is “optimal”.
Guaranteed linearithmic? YES. Balanced BSTs or mergesort.
Upper bound
Lower bound
but TSTs may give a sublinear algorithm
12
Exact cardinality count requires linear space
Q. I can’t use a hash table. The stream is much too big to fit all values in memory. Now what?
Help!
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
12
Exact cardinality count requires linear space
Q. I can’t use a hash table. The stream is much too big to fit all values in memory. Now what?
Help!
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
12
Exact cardinality count requires linear space
Q. I can’t use a hash table. The stream is much too big to fit all values in memory. Now what?
A. Bad news: You cannot get an exact count.
Help!
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
12
Exact cardinality count requires linear space
Q. I can’t use a hash table. The stream is much too big to fit all values in memory. Now what?
A. Bad news: You cannot get an exact count.
A. (Bloom, 1970) You can get an accurate estimate using a few bits per distinct value.
Help!
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
12
Exact cardinality count requires linear space
Q. I can’t use a hash table. The stream is much too big to fit all values in memory. Now what?
A. Bad news: You cannot get an exact count.
A. (Bloom, 1970) You can get an accurate estimate using a few bits per distinct value.
Help!
A. Much better news: You can get an accurate estimate using only a handful of bits (stay tuned).
OF
Cardinality Estimation
•Warmup: exact cardinality count •Probabilistic counting •Stochastic averaging •Refinements •Final frontier
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
Constraints
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
Constraints• Make one pass through the stream.
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
Constraints• Make one pass through the stream.• Use as few operations per value as possible
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.• Produce as accurate an estimate as possible.
typical applications
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.• Produce as accurate an estimate as possible.
typical applications
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.• Produce as accurate an estimate as possible.
How many unique visitors to my website?
typical applications
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.• Produce as accurate an estimate as possible.
How many unique visitors to my website?
How many different values for a database join?
typical applications
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.• Produce as accurate an estimate as possible.
How many unique visitors to my website?
How many different websites visited by each customer? How many different values
for a database join?
typical applications
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.• Produce as accurate an estimate as possible.
How many unique visitors to my website?
How many different websites visited by each customer? How many different values
for a database join?
Which sites are the most/least popular?
typical applications
14
is a fundamental problem with many applications where memory is limited.
Cardinality estimation
Q. About how many different values appear in a given stream?
Constraints• Make one pass through the stream.• Use as few operations per value as possible• Use as little memory as possible.• Produce as accurate an estimate as possible.
How many unique visitors to my website?
How many different websites visited by each customer? How many different values
for a database join?
To fix ideas on scope: Think of billions of streams each having trillions of values.
Which sites are the most/least popular?
15
Probabilistic counting with stochastic averaging (PCSA)
15
Probabilistic counting with stochastic averaging (PCSA)
Flajolet and Martin, Probabilistic Counting Algorithms for Data Base Applications FOCS 1983, JCSS 1985.
Philippe Flajolet 1948-2011
15
Probabilistic counting with stochastic averaging (PCSA)
Contributions
Flajolet and Martin, Probabilistic Counting Algorithms for Data Base Applications FOCS 1983, JCSS 1985.
Philippe Flajolet 1948-2011
15
Probabilistic counting with stochastic averaging (PCSA)
Contributions• Introduced problem
Flajolet and Martin, Probabilistic Counting Algorithms for Data Base Applications FOCS 1983, JCSS 1985.
Philippe Flajolet 1948-2011
15
Probabilistic counting with stochastic averaging (PCSA)
Contributions• Introduced problem• Idea of streaming algorithm
Flajolet and Martin, Probabilistic Counting Algorithms for Data Base Applications FOCS 1983, JCSS 1985.
Philippe Flajolet 1948-2011
15
Probabilistic counting with stochastic averaging (PCSA)
Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data
Flajolet and Martin, Probabilistic Counting Algorithms for Data Base Applications FOCS 1983, JCSS 1985.
Philippe Flajolet 1948-2011
15
Probabilistic counting with stochastic averaging (PCSA)
Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data• Detailed analysis that yields tight bounds on accuracy
Flajolet and Martin, Probabilistic Counting Algorithms for Data Base Applications FOCS 1983, JCSS 1985.
Philippe Flajolet 1948-2011
15
Probabilistic counting with stochastic averaging (PCSA)
Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data• Detailed analysis that yields tight bounds on accuracy• Full validation of mathematical results with experimentation
Flajolet and Martin, Probabilistic Counting Algorithms for Data Base Applications FOCS 1983, JCSS 1985.
Philippe Flajolet 1948-2011
15
Probabilistic counting with stochastic averaging (PCSA)
Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data• Detailed analysis that yields tight bounds on accuracy• Full validation of mathematical results with experimentation• Practical algorithm that has remained effective for decades
Flajolet and Martin, Probabilistic Counting Algorithms for Data Base Applications FOCS 1983, JCSS 1985.
Philippe Flajolet 1948-2011
15
Probabilistic counting with stochastic averaging (PCSA)
Contributions• Introduced problem• Idea of streaming algorithm• Idea of “small” sketch of “big” data• Detailed analysis that yields tight bounds on accuracy• Full validation of mathematical results with experimentation• Practical algorithm that has remained effective for decades
Bottom line: Quintessential example of the effectiveness of scientific approach to algorithm design.
Flajolet and Martin, Probabilistic Counting Algorithms for Data Base Applications FOCS 1983, JCSS 1985.
Philippe Flajolet 1948-2011
16
PCSA first step: Use hashing
16
PCSA first step: Use hashing
Transform value to a “random” computer word.
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.• Allows use of fast machine-code operations.
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.• Allows use of fast machine-code operations.
21st century: use 64 bits (quadrillions of values)20th century: use 32 bits (millions of values)
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.• Allows use of fast machine-code operations.
21st century: use 64 bits (quadrillions of values)20th century: use 32 bits (millions of values)
Example: Java
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.• Allows use of fast machine-code operations.
21st century: use 64 bits (quadrillions of values)20th century: use 32 bits (millions of values)
Example: Java
• All data types implement a hashCode() method (though we often override the default).
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.• Allows use of fast machine-code operations.
21st century: use 64 bits (quadrillions of values)20th century: use 32 bits (millions of values)
Example: Java
• All data types implement a hashCode() method (though we often override the default).
• String data type stores value (computed once).
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.• Allows use of fast machine-code operations.
21st century: use 64 bits (quadrillions of values)20th century: use 32 bits (millions of values)
String value = “gsearch.CS.Princeton.EDU” int x = value.hashCode();
current Java default is 32-bit int value
Example: Java
• All data types implement a hashCode() method (though we often override the default).
• String data type stores value (computed once).
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.• Allows use of fast machine-code operations.
21st century: use 64 bits (quadrillions of values)20th century: use 32 bits (millions of values)
String value = “gsearch.CS.Princeton.EDU” int x = value.hashCode();
current Java default is 32-bit int value
Bottom line: Do cardinality estimation on streams of (binary) integers.
Example: Java
• All data types implement a hashCode() method (though we often override the default).
• String data type stores value (computed once).
00010000111001101000111010010011 00001001011011100000010010010111 00001001011011100000010010010111 00111000101001001011010101001100 00111000101001001011010101001100 01101001001000011100110100110011 00001000011101100110110010100101 00001001011011100000010010010111 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01100000110010011101011001011111 01110111001110110010000010111111 01011001010110101100010110110100 00101101101111101111001110001100 01101001001000011100110100110011 01110110000011101101011001000101 00001001001011010110111101111110 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01110111001110110010000010111111 01101001001000011100110100110011 00001000011101100110110010100101 00101101101111101111001110001100 01101001001000011100110100110011 00001000011101100110110010100101 01101001001000011100110100110011 01101001001000011100110100110011 01100001000111001001110010100000
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.• Allows use of fast machine-code operations.
21st century: use 64 bits (quadrillions of values)20th century: use 32 bits (millions of values)
String value = “gsearch.CS.Princeton.EDU” int x = value.hashCode();
current Java default is 32-bit int value
Bottom line: Do cardinality estimation on streams of (binary) integers.
Example: Java
• All data types implement a hashCode() method (though we often override the default).
• String data type stores value (computed once).
00010000111001101000111010010011 00001001011011100000010010010111 00001001011011100000010010010111 00111000101001001011010101001100 00111000101001001011010101001100 01101001001000011100110100110011 00001000011101100110110010100101 00001001011011100000010010010111 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01100000110010011101011001011111 01110111001110110010000010111111 01011001010110101100010110110100 00101101101111101111001110001100 01101001001000011100110100110011 01110110000011101101011001000101 00001001001011010110111101111110 00001001001011010110111101111110 00001001001011010110111101111110 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01101001001000011100110100110011 01110111001110110010000010111111 01101001001000011100110100110011 00001000011101100110110010100101 00101101101111101111001110001100 01101001001000011100110100110011 00001000011101100110110010100101 01101001001000011100110100110011 01101001001000011100110100110011 01100001000111001001110010100000
16
PCSA first step: Use hashing
Transform value to a “random” computer word.• Compute a hash function that transforms
data value into a 32- or 64-bit value.• Cardinality count is unaffected (with high probability).• Built-in capability in modern systems.• Allows use of fast machine-code operations.
21st century: use 64 bits (quadrillions of values)20th century: use 32 bits (millions of values)
String value = “gsearch.CS.Princeton.EDU” int x = value.hashCode();
current Java default is 32-bit int value
Bottom line: Do cardinality estimation on streams of (binary) integers.
Example: Java
• All data types implement a hashCode() method (though we often override the default).
• String data type stores value (computed once).
“Random” except for the fact that some values are equal.
17
Initial hypothesis
17
Initial hypothesis
Hypothesis. Uniform hashing assumption is reasonable in this context.
17
Initial hypothesis
Hypothesis. Uniform hashing assumption is reasonable in this context.
Implication. Need to run experiments to validate any hypotheses about performance.
17
Initial hypothesis
No problem!
Hypothesis. Uniform hashing assumption is reasonable in this context.
Implication. Need to run experiments to validate any hypotheses about performance.
17
Initial hypothesis
No problem!
• AofA is a scientific endeavor (we always validate hypotheses).
Hypothesis. Uniform hashing assumption is reasonable in this context.
Implication. Need to run experiments to validate any hypotheses about performance.
17
Initial hypothesis
No problem!
• AofA is a scientific endeavor (we always validate hypotheses).
• End goal is development of algorithms that are useful in practice.
Hypothesis. Uniform hashing assumption is reasonable in this context.
Implication. Need to run experiments to validate any hypotheses about performance.
17
Initial hypothesis
No problem!
• AofA is a scientific endeavor (we always validate hypotheses).
• End goal is development of algorithms that are useful in practice.
• It is the responsibility of the designer to validate utility before claiming it.
Hypothesis. Uniform hashing assumption is reasonable in this context.
Implication. Need to run experiments to validate any hypotheses about performance.
17
Initial hypothesis
No problem!
• AofA is a scientific endeavor (we always validate hypotheses).
• End goal is development of algorithms that are useful in practice.
• It is the responsibility of the designer to validate utility before claiming it.
• After decades of experience, discovering a performance problem due toa bad hash function would be a significant research result.
Hypothesis. Uniform hashing assumption is reasonable in this context.
Implication. Need to run experiments to validate any hypotheses about performance.
17
Initial hypothesis
No problem!
• AofA is a scientific endeavor (we always validate hypotheses).
• End goal is development of algorithms that are useful in practice.
• It is the responsibility of the designer to validate utility before claiming it.
• After decades of experience, discovering a performance problem due toa bad hash function would be a significant research result.
Hypothesis. Uniform hashing assumption is reasonable in this context.
Implication. Need to run experiments to validate any hypotheses about performance.
Unspoken bedrock principle of AofA. Experimenting to validate hypotheses is WHAT WE DO!
17
Initial hypothesis
No problem!
• AofA is a scientific endeavor (we always validate hypotheses).
• End goal is development of algorithms that are useful in practice.
• It is the responsibility of the designer to validate utility before claiming it.
• After decades of experience, discovering a performance problem due toa bad hash function would be a significant research result.
Hypothesis. Uniform hashing assumption is reasonable in this context.
Implication. Need to run experiments to validate any hypotheses about performance.
Unspoken bedrock principle of AofA. Experimenting to validate hypotheses is WHAT WE DO!
18
Probabilistic counting starting point: three integer functions
18
Probabilistic counting starting point: three integer functions
Definition. p (x) is the number of 1s in the binary representation of x.
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. p (x) is the number of 1s in the binary representation of x.
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x. position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
Bit-whacking magic: R(x) is easy to compute.
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
Bit-whacking magic: R(x) is easy to compute.
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
Bit-whacking magic: R(x) is easy to compute.
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
Bit-whacking magic: R(x) is easy to compute.
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 1
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
Bit-whacking magic: R(x) is easy to compute.
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
Bit-whacking magic: R(x) is easy to compute.
3 instructions on a typical computer
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
Bit-whacking magic: R(x) is easy to compute.
3 instructions on a typical computer
Exercise: Compute p (x) as easily.
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
Bit-whacking magic: R(x) is easy to compute.
3 instructions on a typical computer
Exercise: Compute p (x) as easily.
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)
Beeler, Gosper, and SchroeppelHAKMEM item 169, MIT AI Laboratory AIM 239, 1972
http://www.inwap.com/pdp10/hbaker/hakmem/hakmem.html
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
Bit-whacking magic: R(x) is easy to compute.
3 instructions on a typical computer
Exercise: Compute p (x) as easily.
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)
Note: r (x ) = p (R(x ) − 1).
Beeler, Gosper, and SchroeppelHAKMEM item 169, MIT AI Laboratory AIM 239, 1972
http://www.inwap.com/pdp10/hbaker/hakmem/hakmem.html
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
Bit-whacking magic: R(x) is easy to compute.
3 instructions on a typical computer
Exercise: Compute p (x) as easily.
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)
Note: r (x ) = p (R(x ) − 1).
Beeler, Gosper, and SchroeppelHAKMEM item 169, MIT AI Laboratory AIM 239, 1972
http://www.inwap.com/pdp10/hbaker/hakmem/hakmem.html
see also Knuth volume 4A
18
Probabilistic counting starting point: three integer functions
Definition. r (x) is the number of trailing 1s in the binary representation of x.
Definition. R(x) = 2r (x)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 p (x) r(x) R (x) R (x)21 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 12 1 2 1 0
1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 0 8 0 1 1
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 10 5 32 1 0 0 0 0 0
Bit-whacking magic: R(x) is easy to compute.
3 instructions on a typical computer
Exercise: Compute p (x) as easily.
position of rightmost 0
Definition. p (x) is the number of 1s in the binary representation of x.
0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 x1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 ~x0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 0 x + 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ~x & (x + 1)
Note: r (x ) = p (R(x ) − 1).
Bottom line: p (x ), r (x ), and R(x ) all can be computed with just a few machine instructions.
Beeler, Gosper, and SchroeppelHAKMEM item 169, MIT AI Laboratory AIM 239, 1972
http://www.inwap.com/pdp10/hbaker/hakmem/hakmem.html
see also Knuth volume 4A
19
Probabilistic counting (Flajolet and Martin, 1983)
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1typical sketch
N = 106
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1typical sketch N = 106
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1
R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
typical sketch N = 106
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1
R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
typical sketch N = 106
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1
R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
typical sketch N = 106 R(x) = 2k
with probability 1/ 2k
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1
R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
typical sketch N = 106
leading bits almost surely 0
R(x) = 2k with probability
1/ 2k
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1
R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
typical sketch N = 106
leading bits almost surely 0 trailing bits almost surely 1
R(x) = 2k with probability
1/ 2k
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1
R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
typical sketch N = 106
leading bits almost surely 0 trailing bits almost surely 1
estimate of lg N
R(x) = 2k with probability
1/ 2k
Rough estimate of lgN is r (sketch).
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1
R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
typical sketch N = 106
leading bits almost surely 0 trailing bits almost surely 1
estimate of lg N
R(x) = 2k with probability
1/ 2k
Rough estimate of lgN is r (sketch).
Rough estimate of N is R(sketch).
19
Probabilistic counting (Flajolet and Martin, 1983)
Maintain a single-word sketch that summarizes a data stream x0, x1, …, xN, …• For each xN in the stream, update sketch by bitwise or with R(xN).• Use position of rightmost 0 (with slight correction factor) to estimate lg N.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sketch 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
xN 0 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1
R(xN) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
sketch | R(xN) 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
typical sketch N = 106
leading bits almost surely 0 trailing bits almost surely 1
estimate of lg N
R(x) = 2k with probability
1/ 2k
Rough estimate of lgN is r (sketch).
correction factor needed (stay tuned) Rough estimate of N is R(sketch).
20
Probabilistic counting trace
20
Probabilistic counting trace
x r(x) R(x) sketch
01100010011000111010011110111011 2 100 00000000000000000000000000000100
20
Probabilistic counting trace
x r(x) R(x) sketch
01100010011000111010011110111011 2 100 00000000000000000000000000000100
01100111001000110001111100000101 1 10 00000000000000000000000000000110
20
Probabilistic counting trace
x r(x) R(x) sketch
01100010011000111010011110111011 2 100 00000000000000000000000000000100
01100111001000110001111100000101 1 10 00000000000000000000000000000110
00010001000111000110110110110011 2 100 00000000000000000000000000000110
20
Probabilistic counting trace
x r(x) R(x) sketch
01100010011000111010011110111011 2 100 00000000000000000000000000000100
01100111001000110001111100000101 1 10 00000000000000000000000000000110
00010001000111000110110110110011 2 100 00000000000000000000000000000110
01000100011101110000000111011111 5 100000 00000000000000000000000000100110
20
Probabilistic counting trace
x r(x) R(x) sketch
01100010011000111010011110111011 2 100 00000000000000000000000000000100
01100111001000110001111100000101 1 10 00000000000000000000000000000110
00010001000111000110110110110011 2 100 00000000000000000000000000000110
01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111
20
Probabilistic counting trace
x r(x) R(x) sketch
01100010011000111010011110111011 2 100 00000000000000000000000000000100
01100111001000110001111100000101 1 10 00000000000000000000000000000110
00010001000111000110110110110011 2 100 00000000000000000000000000000110
01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111
00110111101100000000101001010101 1 10 00000000000000000000000000100111
20
Probabilistic counting trace
x r(x) R(x) sketch
01100010011000111010011110111011 2 100 00000000000000000000000000000100
01100111001000110001111100000101 1 10 00000000000000000000000000000110
00010001000111000110110110110011 2 100 00000000000000000000000000000110
01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111
00110111101100000000101001010101 1 10 0000000000000000000000000010011100110100011000111010101111111100 0 1 00000000000000000000000000100111
20
Probabilistic counting trace
x r(x) R(x) sketch
01100010011000111010011110111011 2 100 00000000000000000000000000000100
01100111001000110001111100000101 1 10 00000000000000000000000000000110
00010001000111000110110110110011 2 100 00000000000000000000000000000110
01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111
00110111101100000000101001010101 1 10 0000000000000000000000000010011100110100011000111010101111111100 0 1 00000000000000000000000000100111
00011000010000100001011100110111 3 1000 00000000000000000000000000101111
20
Probabilistic counting trace
x r(x) R(x) sketch
01100010011000111010011110111011 2 100 00000000000000000000000000000100
01100111001000110001111100000101 1 10 00000000000000000000000000000110
00010001000111000110110110110011 2 100 00000000000000000000000000000110
01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111
00110111101100000000101001010101 1 10 0000000000000000000000000010011100110100011000111010101111111100 0 1 00000000000000000000000000100111
00011000010000100001011100110111 3 1000 00000000000000000000000000101111
00011001100110011110010000111111 6 1000000 00000000000000000000000001101111
20
Probabilistic counting trace
x r(x) R(x) sketch
01100010011000111010011110111011 2 100 00000000000000000000000000000100
01100111001000110001111100000101 1 10 00000000000000000000000000000110
00010001000111000110110110110011 2 100 00000000000000000000000000000110
01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111
00110111101100000000101001010101 1 10 0000000000000000000000000010011100110100011000111010101111111100 0 1 00000000000000000000000000100111
00011000010000100001011100110111 3 1000 00000000000000000000000000101111
00011001100110011110010000111111 6 1000000 0000000000000000000000000110111101000101110001001010110011111100 0 1 00000000000000000000000001101111
20
Probabilistic counting trace
x r(x) R(x) sketch
01100010011000111010011110111011 2 100 00000000000000000000000000000100
01100111001000110001111100000101 1 10 00000000000000000000000000000110
00010001000111000110110110110011 2 100 00000000000000000000000000000110
01000100011101110000000111011111 5 100000 0000000000000000000000000010011001101000001011000101110001000100 0 1 00000000000000000000000000100111
00110111101100000000101001010101 1 10 0000000000000000000000000010011100110100011000111010101111111100 0 1 00000000000000000000000000100111
00011000010000100001011100110111 3 1000 00000000000000000000000000101111
00011001100110011110010000111111 6 1000000 0000000000000000000000000110111101000101110001001010110011111100 0 1 00000000000000000000000001101111
R(sketch) = 100002 = 16
21
Probabilistic counting (Flajolet and Martin, 1983)
public long R(long x) { return ~x & (x+1); }
public long estimate(Iterable<String> stream) { long sketch; for (s : stream) sketch = sketch | R(s.hashCode()); return R(sketch); }
21
Probabilistic counting (Flajolet and Martin, 1983)
public long R(long x) { return ~x & (x+1); }
public long estimate(Iterable<String> stream) { long sketch; for (s : stream) sketch = sketch | R(s.hashCode()); return R(sketch); }
Maintain a sketch of the data
21
Probabilistic counting (Flajolet and Martin, 1983)
public long R(long x) { return ~x & (x+1); }
public long estimate(Iterable<String> stream) { long sketch; for (s : stream) sketch = sketch | R(s.hashCode()); return R(sketch); }
Maintain a sketch of the data• A single word
21
Probabilistic counting (Flajolet and Martin, 1983)
public long R(long x) { return ~x & (x+1); }
public long estimate(Iterable<String> stream) { long sketch; for (s : stream) sketch = sketch | R(s.hashCode()); return R(sketch); }
Maintain a sketch of the data• A single word• OR of all values of R(x) in the stream
21
Probabilistic counting (Flajolet and Martin, 1983)
public long R(long x) { return ~x & (x+1); }
public long estimate(Iterable<String> stream) { long sketch; for (s : stream) sketch = sketch | R(s.hashCode()); return R(sketch); }
Maintain a sketch of the data• A single word• OR of all values of R(x) in the stream• Return smallest value not seen
21
Probabilistic counting (Flajolet and Martin, 1983)
public long R(long x) { return ~x & (x+1); }
public long estimate(Iterable<String> stream) { long sketch; for (s : stream) sketch = sketch | R(s.hashCode()); return R(sketch); }
Early example of “a simple algorithm whose analysis isn’t”
Maintain a sketch of the data• A single word• OR of all values of R(x) in the stream• Return smallest value not seen
21
Probabilistic counting (Flajolet and Martin, 1983)
public long R(long x) { return ~x & (x+1); }
public long estimate(Iterable<String> stream) { long sketch; for (s : stream) sketch = sketch | R(s.hashCode()); return R(sketch); }
Early example of “a simple algorithm whose analysis isn’t”
Maintain a sketch of the data• A single word• OR of all values of R(x) in the stream• Return smallest value not seen
Q. (Martin) Estimate seems a bit low. How much?
21
Probabilistic counting (Flajolet and Martin, 1983)
public long R(long x) { return ~x & (x+1); }
public long estimate(Iterable<String> stream) { long sketch; for (s : stream) sketch = sketch | R(s.hashCode()); return R(sketch); }
Early example of “a simple algorithm whose analysis isn’t”
Maintain a sketch of the data• A single word• OR of all values of R(x) in the stream• Return smallest value not seen
Q. (Martin) Estimate seems a bit low. How much?
A. (unsatisfying) Obtain correction factor empirically.
21
Probabilistic counting (Flajolet and Martin, 1983)
public long R(long x) { return ~x & (x+1); }
public long estimate(Iterable<String> stream) { long sketch; for (s : stream) sketch = sketch | R(s.hashCode()); return R(sketch); }
Early example of “a simple algorithm whose analysis isn’t”
Maintain a sketch of the data• A single word• OR of all values of R(x) in the stream• Return smallest value not seen
Q. (Martin) Estimate seems a bit low. How much?
A. (unsatisfying) Obtain correction factor empirically.
A. (Flajolet) Without the analysis, there is no algorithm!
/.77351;with correction for bias
22
Mathematical analysis of probabilistic counting
Theorem. The expected number of trailing 1s in the PC sketch is
and P is an oscillating function of lg N of very small amplitude.
lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351
22
Mathematical analysis of probabilistic counting
Theorem. The expected number of trailing 1s in the PC sketch is
and P is an oscillating function of lg N of very small amplitude.
Proof (omitted).
lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351
22
Mathematical analysis of probabilistic counting
Theorem. The expected number of trailing 1s in the PC sketch is
and P is an oscillating function of lg N of very small amplitude.
Proof (omitted).
1980s: Flajolet tour de force
lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351
22
Mathematical analysis of probabilistic counting
Theorem. The expected number of trailing 1s in the PC sketch is
and P is an oscillating function of lg N of very small amplitude.
Proof (omitted).
1980s: Flajolet tour de force
1990s: trie parameter
lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351
22
Mathematical analysis of probabilistic counting
Theorem. The expected number of trailing 1s in the PC sketch is
and P is an oscillating function of lg N of very small amplitude.
Proof (omitted).
1980s: Flajolet tour de force
1990s: trie parameter
lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351
22
Mathematical analysis of probabilistic counting
Theorem. The expected number of trailing 1s in the PC sketch is
and P is an oscillating function of lg N of very small amplitude.
Proof (omitted).
1980s: Flajolet tour de force
1990s: trie parameter
lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351
highest null left of
right spine
22
Mathematical analysis of probabilistic counting
Theorem. The expected number of trailing 1s in the PC sketch is
and P is an oscillating function of lg N of very small amplitude.
Proof (omitted).
1980s: Flajolet tour de force
1990s: trie parameter
lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351
Kirschenhofer, Prodinger, and Szpankowski
Analysis of a splitting process arising in probabilistic counting and other related algorithms, ICALP 1992.
highest null left of
right spine
trailing 1s in sketch
22
Mathematical analysis of probabilistic counting
Theorem. The expected number of trailing 1s in the PC sketch is
and P is an oscillating function of lg N of very small amplitude.
Proof (omitted).
1980s: Flajolet tour de force
1990s: trie parameter
21st century: standard AC
lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351
Kirschenhofer, Prodinger, and Szpankowski
Analysis of a splitting process arising in probabilistic counting and other related algorithms, ICALP 1992.
highest null left of
right spine
trailing 1s in sketch
stay tuned for Szpankowski talk
Jacquet and Szpankowski
Analytical depoissonization and its applications, TCS 1998.
22
Mathematical analysis of probabilistic counting
Theorem. The expected number of trailing 1s in the PC sketch is
and P is an oscillating function of lg N of very small amplitude.
Proof (omitted).
1980s: Flajolet tour de force
1990s: trie parameter
21st century: standard AC
In other words. In PC code, R(sketch)/.77351 is an unbiased statistical estimator of N.
lg(�N) + P(lgN) + o(1) where 𝜙 ≐�.77351
Kirschenhofer, Prodinger, and Szpankowski
Analysis of a splitting process arising in probabilistic counting and other related algorithms, ICALP 1992.
highest null left of
right spine
trailing 1s in sketch
stay tuned for Szpankowski talk
Jacquet and Szpankowski
Analytical depoissonization and its applications, TCS 1998.
23
Validation of probabilistic counting
Hypothesis. Expected value returned is N for random values from a large range.
23
Validation of probabilistic counting
Quick experiment. 100,000 31-bit random values (20 trials)
Hypothesis. Expected value returned is N for random values from a large range.
23
Validation of probabilistic counting
Quick experiment. 100,000 31-bit random values (20 trials)
Hypothesis. Expected value returned is N for random values from a large range.
23
Validation of probabilistic counting
Flajolet and Martin: Result is “typically one binary order of magnitude off.”
Quick experiment. 100,000 31-bit random values (20 trials)
Hypothesis. Expected value returned is N for random values from a large range.
23
Validation of probabilistic counting
Flajolet and Martin: Result is “typically one binary order of magnitude off.”
Of course! (Always returns a power of 2 divided by .77351.)
Quick experiment. 100,000 31-bit random values (20 trials)
Hypothesis. Expected value returned is N for random values from a large range.
23
Validation of probabilistic counting
Flajolet and Martin: Result is “typically one binary order of magnitude off.”
Of course! (Always returns a power of 2 divided by .77351.)
Quick experiment. 100,000 31-bit random values (20 trials)
16384/.77351 = 21181
32768/.77351 = 42362
65536/.77351 = 84725
…
Hypothesis. Expected value returned is N for random values from a large range.
23
Validation of probabilistic counting
Flajolet and Martin: Result is “typically one binary order of magnitude off.”
Of course! (Always returns a power of 2 divided by .77351.)
Quick experiment. 100,000 31-bit random values (20 trials)
16384/.77351 = 21181
32768/.77351 = 42362
65536/.77351 = 84725
…
Need to incorporate more experiments for more accuracy.
Hypothesis. Expected value returned is N for random values from a large range.
OF
Cardinality Estimation
•Rules of the game •Probabilistic counting •Stochastic averaging •Refinements •Final frontier
25
Stochastic averaging
Goal: Perform M independent PC experiments and average results.
25
Stochastic averaging
Goal: Perform M independent PC experiments and average results.
Alternative 1: M independent hash functions? No, too expensive.
25
Stochastic averaging
Goal: Perform M independent PC experiments and average results.
Alternative 1: M independent hash functions? No, too expensive.
Alternative 2: M-way alternation? No, bad results for certain inputs.
25
Stochastic averaging
Goal: Perform M independent PC experiments and average results.
Alternative 1: M independent hash functions? No, too expensive.
Alternative 2: M-way alternation? No, bad results for certain inputs.
01 02 03 04 01 02 03 04
01 01
02 02
03 03
04 04
01 02 03 04
01
02
03
04
25
Stochastic averaging
Goal: Perform M independent PC experiments and average results.
Alternative 3: Stochastic averaging
Alternative 1: M independent hash functions? No, too expensive.
Alternative 2: M-way alternation? No, bad results for certain inputs.
01 02 03 04 01 02 03 04
01 01
02 02
03 03
04 04
01 02 03 04
01
02
03
04
25
Stochastic averaging
Goal: Perform M independent PC experiments and average results.
Alternative 3: Stochastic averaging• Use second hash to divide stream into 2m independent streams
Alternative 1: M independent hash functions? No, too expensive.
Alternative 2: M-way alternation? No, bad results for certain inputs.
01 02 03 04 01 02 03 04
01 01
02 02
03 03
04 04
01 02 03 04
01
02
03
04
25
Stochastic averaging
Goal: Perform M independent PC experiments and average results.
Alternative 3: Stochastic averaging• Use second hash to divide stream into 2m independent streams
Alternative 1: M independent hash functions? No, too expensive.
Alternative 2: M-way alternation? No, bad results for certain inputs.
01 02 03 04 01 02 03 04
01 01
02 02
03 03
04 04
01 02 03 04
01
02
03
04
10 11 39 21
09 07 07
11
23 22 22
31
11 09 07 23 31 07 22 2221
39
10 11
25
Stochastic averaging
Goal: Perform M independent PC experiments and average results.
Alternative 3: Stochastic averaging• Use second hash to divide stream into 2m independent streams
key point: equal values all go to the same stream
Alternative 1: M independent hash functions? No, too expensive.
Alternative 2: M-way alternation? No, bad results for certain inputs.
01 02 03 04 01 02 03 04
01 01
02 02
03 03
04 04
01 02 03 04
01
02
03
04
10 11 39 21
09 07 07
11
23 22 22
31
11 09 07 23 31 07 22 2221
39
10 11
25
Stochastic averaging
Goal: Perform M independent PC experiments and average results.
Alternative 3: Stochastic averaging• Use second hash to divide stream into 2m independent streams• Use PC on each stream, yielding 2m sketches .
key point: equal values all go to the same stream
Alternative 1: M independent hash functions? No, too expensive.
Alternative 2: M-way alternation? No, bad results for certain inputs.
01 02 03 04 01 02 03 04
01 01
02 02
03 03
04 04
01 02 03 04
01
02
03
04
10 11 39 21
09 07 07
11
23 22 22
31
11 09 07 23 31 07 22 2221
39
10 11
25
Stochastic averaging
Goal: Perform M independent PC experiments and average results.
Alternative 3: Stochastic averaging• Use second hash to divide stream into 2m independent streams• Use PC on each stream, yielding 2m sketches .• Compute mean = average number of trailing bits in the sketches.
key point: equal values all go to the same stream
Alternative 1: M independent hash functions? No, too expensive.
Alternative 2: M-way alternation? No, bad results for certain inputs.
01 02 03 04 01 02 03 04
01 01
02 02
03 03
04 04
01 02 03 04
01
02
03
04
10 11 39 21
09 07 07
11
23 22 22
31
11 09 07 23 31 07 22 2221
39
10 11
25
Stochastic averaging
Goal: Perform M independent PC experiments and average results.
Alternative 3: Stochastic averaging• Use second hash to divide stream into 2m independent streams• Use PC on each stream, yielding 2m sketches .• Compute mean = average number of trailing bits in the sketches.• Return 2mean/.77531.
key point: equal values all go to the same stream
Alternative 1: M independent hash functions? No, too expensive.
Alternative 2: M-way alternation? No, bad results for certain inputs.
01 02 03 04 01 02 03 04
01 01
02 02
03 03
04 04
01 02 03 04
01
02
03
04
10 11 39 21
09 07 07
11
23 22 22
31
11 09 07 23 31 07 22 2221
39
10 11
26
PCSA trace
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000
0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000
0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000
0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000
0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000
0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000
0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000
0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000
0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000
0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000
1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000
0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000
0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000
0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000
1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000
0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000
0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000
0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000
0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000
1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000
0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000
1110010000111111 1000000 0000000000101010 0000000000000101 0000000000000101 0000000001000000
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000
0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000
0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000
0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000
1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000
0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000
1110010000111111 1000000 0000000000101010 0000000000000101 0000000000000101 0000000001000000
1010110011111101 10 0000000000101010 0000000000000101 0000000000000111 0000000001000000
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000
0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000
0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000
0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000
1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000
0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000
1110010000111111 1000000 0000000000101010 0000000000000101 0000000000000101 0000000001000000
1010110011111101 10 0000000000101010 0000000000000101 0000000000000111 0000000001000000
0001110100110100 1 0000000000101011 0000000000000101 0000000000000111
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000
0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000
0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000
0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000
1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000
0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000
1110010000111111 1000000 0000000000101010 0000000000000101 0000000000000101 0000000001000000
1010110011111101 10 0000000000101010 0000000000000101 0000000000000111 0000000001000000
0001110100110100 1 0000000000101011 0000000000000101 0000000000000111
0000000000101011 0000000000000101 0000000000000111 0000000001000000
M = 4use initial m bits for second hash
26
PCSA trace
x R(x) sketch[0] sketch[1] sketch[2] sketch[3]
1010011110111011 100 0000000000000000 0000000000000000 0000000000000100 0000000000000000
0001111100000101 10 0000000000000010 0000000000000000 0000000000000100 0000000000000000
0110110110110011 100 0000000000000010 0000000000000100 0000000000000100 0000000000000000
0000000111011111 100000 0000000000100010 0000000000000100 0000000000000100 0000000000000000
0101110001000100 1 0000000000100010 0000000000000101 0000000000000100 0000000000000000
0000101001010101 10 0000000000100010 0000000000000101 0000000000000100 0000000000000000
1010101111111100 1 0000000000100010 0000000000000101 0000000000000101 0000000000000000
0001011100110111 1000 0000000000101010 0000000000000101 0000000000000101 0000000000000000
1110010000111111 1000000 0000000000101010 0000000000000101 0000000000000101 0000000001000000
1010110011111101 10 0000000000101010 0000000000000101 0000000000000111 0000000001000000
0001110100110100 1 0000000000101011 0000000000000101 0000000000000111
0000000000101011 0000000000000101 0000000000000111 0000000001000000
r (sketch[ ] ) 2 1 3 0
M = 4use initial m bits for second hash
27
Probabilistic counting with stochastic averaging in Java
public static long estimate(Iterable<Long> stream, int M) { long[] sketch = new long[M]; for (long x : stream) { int k = hash2(x, M); sketch[k] = sketch[k] | R(x); } int sum = 0; for (int k = 0; k < M; k++) sum += r(sketch[k]); double mean = 1.0 * sum / M; return (int) (M * Math.pow(2, mean)/.77351); }
Flajolet-Martin 1983
27
Probabilistic counting with stochastic averaging in Java
public static long estimate(Iterable<Long> stream, int M) { long[] sketch = new long[M]; for (long x : stream) { int k = hash2(x, M); sketch[k] = sketch[k] | R(x); } int sum = 0; for (int k = 0; k < M; k++) sum += r(sketch[k]); double mean = 1.0 * sum / M; return (int) (M * Math.pow(2, mean)/.77351); }
Flajolet-Martin 1983
Idea. Stochastic averaging
27
Probabilistic counting with stochastic averaging in Java
public static long estimate(Iterable<Long> stream, int M) { long[] sketch = new long[M]; for (long x : stream) { int k = hash2(x, M); sketch[k] = sketch[k] | R(x); } int sum = 0; for (int k = 0; k < M; k++) sum += r(sketch[k]); double mean = 1.0 * sum / M; return (int) (M * Math.pow(2, mean)/.77351); }
Flajolet-Martin 1983
Idea. Stochastic averaging
• Use second hash to split into M = 2m independent streams
27
Probabilistic counting with stochastic averaging in Java
public static long estimate(Iterable<Long> stream, int M) { long[] sketch = new long[M]; for (long x : stream) { int k = hash2(x, M); sketch[k] = sketch[k] | R(x); } int sum = 0; for (int k = 0; k < M; k++) sum += r(sketch[k]); double mean = 1.0 * sum / M; return (int) (M * Math.pow(2, mean)/.77351); }
Flajolet-Martin 1983
Idea. Stochastic averaging
• Use second hash to split into M = 2m independent streams
• Use PC on each stream, yielding 2m sketches .
27
Probabilistic counting with stochastic averaging in Java
public static long estimate(Iterable<Long> stream, int M) { long[] sketch = new long[M]; for (long x : stream) { int k = hash2(x, M); sketch[k] = sketch[k] | R(x); } int sum = 0; for (int k = 0; k < M; k++) sum += r(sketch[k]); double mean = 1.0 * sum / M; return (int) (M * Math.pow(2, mean)/.77351); }
Flajolet-Martin 1983
Idea. Stochastic averaging
• Use second hash to split into M = 2m independent streams
• Use PC on each stream, yielding 2m sketches .
• Compute mean = average # trailing 1 bits in the sketches.
27
Probabilistic counting with stochastic averaging in Java
public static long estimate(Iterable<Long> stream, int M) { long[] sketch = new long[M]; for (long x : stream) { int k = hash2(x, M); sketch[k] = sketch[k] | R(x); } int sum = 0; for (int k = 0; k < M; k++) sum += r(sketch[k]); double mean = 1.0 * sum / M; return (int) (M * Math.pow(2, mean)/.77351); }
Flajolet-Martin 1983
Idea. Stochastic averaging
• Use second hash to split into M = 2m independent streams
• Use PC on each stream, yielding 2m sketches .
• Compute mean = average # trailing 1 bits in the sketches.
• Return 2mean/.77351.
27
Probabilistic counting with stochastic averaging in Java
public static long estimate(Iterable<Long> stream, int M) { long[] sketch = new long[M]; for (long x : stream) { int k = hash2(x, M); sketch[k] = sketch[k] | R(x); } int sum = 0; for (int k = 0; k < M; k++) sum += r(sketch[k]); double mean = 1.0 * sum / M; return (int) (M * Math.pow(2, mean)/.77351); }
Flajolet-Martin 1983
Idea. Stochastic averaging
• Use second hash to split into M = 2m independent streams
• Use PC on each stream, yielding 2m sketches .
• Compute mean = average # trailing 1 bits in the sketches.
• Return 2mean/.77351.
Q. Accuracy improves as M increases.
27
Probabilistic counting with stochastic averaging in Java
public static long estimate(Iterable<Long> stream, int M) { long[] sketch = new long[M]; for (long x : stream) { int k = hash2(x, M); sketch[k] = sketch[k] | R(x); } int sum = 0; for (int k = 0; k < M; k++) sum += r(sketch[k]); double mean = 1.0 * sum / M; return (int) (M * Math.pow(2, mean)/.77351); }
Flajolet-Martin 1983
Idea. Stochastic averaging
• Use second hash to split into M = 2m independent streams
• Use PC on each stream, yielding 2m sketches .
• Compute mean = average # trailing 1 bits in the sketches.
• Return 2mean/.77351.
Q. Accuracy improves as M increases.
Q. How much?
27
Probabilistic counting with stochastic averaging in Java
public static long estimate(Iterable<Long> stream, int M) { long[] sketch = new long[M]; for (long x : stream) { int k = hash2(x, M); sketch[k] = sketch[k] | R(x); } int sum = 0; for (int k = 0; k < M; k++) sum += r(sketch[k]); double mean = 1.0 * sum / M; return (int) (M * Math.pow(2, mean)/.77351); }
Flajolet-Martin 1983
Idea. Stochastic averaging
• Use second hash to split into M = 2m independent streams
• Use PC on each stream, yielding 2m sketches .
• Compute mean = average # trailing 1 bits in the sketches.
• Return 2mean/.77351.
Q. Accuracy improves as M increases.
Q. How much?
Theorem (paraphrased to fit context of this talk).
Under the uniform hashing assumption, PCSA
• Uses 64M bits.
• Produces estimate with a relative accuracy close to .0.78/
�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 10
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 10964416
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 10964416997616
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 10964416997616959857
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 109644169976169598571024303
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 109644169976169598571024303972940
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 109644169976169598571024303972940985534
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 109644169976169598571024303972940985534998291
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 109644169976169598571024303972940985534998291996266
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 109644169976169598571024303972940985534998291996266959208
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 1096441699761695985710243039729409855349982919962669592081015329
0.78/�M
28
Validation of PCSA analysis
Hypothesis. Value returned is accurate to for random values from a large range.
Experiment. 1,000,000 31-bit random values, M = 1024 (10 trials)
% java PCSA 1000000 31 1024 1096441699761695985710243039729409855349982919962669592081015329
0.78/�M
29
Space-accuracy tradeoff for probabilistic counting with stochastic averaging
10245122561286432
10%
5%
Relative accuracy:0.78�M
29
Space-accuracy tradeoff for probabilistic counting with stochastic averaging
10245122561286432
10%
5%
Relative accuracy:0.78�M
Bottom line.
29
Space-accuracy tradeoff for probabilistic counting with stochastic averaging
10245122561286432
10%
5%
Relative accuracy:0.78�M
Bottom line.• Attain 10% relative accuracy with a sketch consisting of 64 words.
M = 64
29
Space-accuracy tradeoff for probabilistic counting with stochastic averaging
10245122561286432
10%
5%
Relative accuracy:0.78�M
Bottom line.• Attain 10% relative accuracy with a sketch consisting of 64 words.• Attain 2.4% relative accuracy with a sketch consisting of 1024 words.
M = 64
M = 1024
30
Scientific validation of PCSA
Hypothesis. Accuracy is as specified for the hash functions we use and the data we have.
30
Scientific validation of PCSA
Hypothesis. Accuracy is as specified for the hash functions we use and the data we have.
Validation (Flajolet and Martin, 1985). Extensive reproducible scientific experiments (!)
30
Scientific validation of PCSA
Hypothesis. Accuracy is as specified for the hash functions we use and the data we have.
Validation (Flajolet and Martin, 1985). Extensive reproducible scientific experiments (!)
Validation (RS, this morning).
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
log.07.f3.txt
30
Scientific validation of PCSA
Hypothesis. Accuracy is as specified for the hash functions we use and the data we have.
Validation (Flajolet and Martin, 1985). Extensive reproducible scientific experiments (!)
Validation (RS, this morning).
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
log.07.f3.txt
30
Scientific validation of PCSA
Hypothesis. Accuracy is as specified for the hash functions we use and the data we have.
Validation (Flajolet and Martin, 1985). Extensive reproducible scientific experiments (!)
% java PCSA 6000000 1024 < log.07.f3.txt 1106474
Validation (RS, this morning).
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
log.07.f3.txt
30
Scientific validation of PCSA
Hypothesis. Accuracy is as specified for the hash functions we use and the data we have.
Validation (Flajolet and Martin, 1985). Extensive reproducible scientific experiments (!)
% java PCSA 6000000 1024 < log.07.f3.txt 1106474
Validation (RS, this morning).
<1% larger than actual value
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
log.07.f3.txt
30
Scientific validation of PCSA
Hypothesis. Accuracy is as specified for the hash functions we use and the data we have.
Q. Is PCSA effective?
Validation (Flajolet and Martin, 1985). Extensive reproducible scientific experiments (!)
% java PCSA 6000000 1024 < log.07.f3.txt 1106474
Validation (RS, this morning).
<1% larger than actual value
pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com CPE001cdfbc55ac-CM0011ae926e6c.cpe.net.cable.rogers.com 118-171-27-8.dynamic.hinet.net cpe-76-170-182-222.socal.res.rr.com 203-88-22-144.live.vodafone.in 117.211.88.36 cpe-76-170-182-222.socal.res.rr.com 117.211.88.36 220.248.45.69 HSI-KBW-078-042-025-130.hsi3.kabel-badenwuerttemberg.de 220.248.45.69 59.7.197.132 cpe-76-170-182-222.socal.res.rr.com 87-206-191-72.dynamic.chello.pl pool-71-104-94-246.lsanca.dsl-w.verizon.net adsl-76-212-198-113.dsl.sndg02.sbcglobal.net telemedia-ap-Static-113.5.175.122.airtelbroadband.in ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in cs27063206.pp.htv.fi pool-71-104-94-246.lsanca.dsl-w.verizon.net gsearch.CS.Princeton.EDU 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net cm56-159-216.liwest.at 109-162-6-189-ter.broadband.kyivstar.net 109-162-6-189-ter.broadband.kyivstar.net crawl-66-249-71-36.googlebot.com ABTS-KK-Dynamic-095.93.179.122.airtelbroadband.in telemedia-ap-Static-113.5.175.122.airtelbroadband.in c-71-231-40-193.hsd1.wa.comcast.net 29-72-93-178.pool.ukrtel.net 118-171-27-8.dynamic.hinet.net
log.07.f3.txt
30
Scientific validation of PCSA
Hypothesis. Accuracy is as specified for the hash functions we use and the data we have.
Q. Is PCSA effective?
A. ABSOLUTELY!
Validation (Flajolet and Martin, 1985). Extensive reproducible scientific experiments (!)
% java PCSA 6000000 1024 < log.07.f3.txt 1106474
Validation (RS, this morning).
<1% larger than actual value
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
PCSA
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
PCSA
• Makes one pass through the stream.
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
PCSA
• Makes one pass through the stream.• Uses a few machine instructions per value
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
PCSA
• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy 0.78/
�M
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
PCSA
• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy ✓0.78/
�M
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
PCSA
• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy ✓
Results validated through extensive experimentation.
0.78/�M
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
PCSA
• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy ✓
Results validated through extensive experimentation.
0.78/�M
A poster child for AofA/AC
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
PCSA
• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy
Open questions
✓Results validated through extensive experimentation.
0.78/�M
A poster child for AofA/AC
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
PCSA
• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy
Open questions• Better space-accuracy tradeoffs?
✓Results validated through extensive experimentation.
0.78/�M
A poster child for AofA/AC
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
PCSA
• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy
Open questions• Better space-accuracy tradeoffs?• Support other operations?
✓Results validated through extensive experimentation.
0.78/�M
A poster child for AofA/AC
31
is a demonstrably effective approach to cardinality estimation
Summary: PCSA (Flajolet-Martin, 1983)
Q. About how many different values are present in a given stream?
PCSA
• Makes one pass through the stream.• Uses a few machine instructions per value• Uses M words to achieve relative accuracy
Open questions• Better space-accuracy tradeoffs?• Support other operations?
✓Results validated through extensive experimentation.
0.78/�M
A poster child for AofA/AC
“ IT IS QUITE CLEAR that other observable regularities on hashed values of records could have been used…
− Flajolet and Martin
32
Small sample of work on related problems
32
Small sample of work on related problems
1970 Bloom set membership
32
Small sample of work on related problems
1970 Bloom set membership
1984 Wegman unbiased sampling estimate
32
Small sample of work on related problems
1970 Bloom set membership
1984 Wegman unbiased sampling estimate
1996– many authors refinements (stay tuned)
32
Small sample of work on related problems
1970 Bloom set membership
1984 Wegman unbiased sampling estimate
1996– many authors refinements (stay tuned)
2000 Indyk L1 norm
32
Small sample of work on related problems
1970 Bloom set membership
1984 Wegman unbiased sampling estimate
1996– many authors refinements (stay tuned)
2000 Indyk L1 norm
2004 Cormode–Muthukrishnan
frequency estimation deletion and other operations
32
Small sample of work on related problems
1970 Bloom set membership
1984 Wegman unbiased sampling estimate
1996– many authors refinements (stay tuned)
2000 Indyk L1 norm
2004 Cormode–Muthukrishnan
frequency estimation deletion and other operations
2005 Giroire fast stream processing
32
Small sample of work on related problems
1970 Bloom set membership
1984 Wegman unbiased sampling estimate
1996– many authors refinements (stay tuned)
2000 Indyk L1 norm
2004 Cormode–Muthukrishnan
frequency estimation deletion and other operations
2005 Giroire fast stream processing
2012 Lumbroso full range, asymptotically unbiased
32
Small sample of work on related problems
1970 Bloom set membership
1984 Wegman unbiased sampling estimate
1996– many authors refinements (stay tuned)
2000 Indyk L1 norm
2004 Cormode–Muthukrishnan
frequency estimation deletion and other operations
2005 Giroire fast stream processing
2012 Lumbroso full range, asymptotically unbiased
2014 Helmi–Lumbroso–Martinez–Viola uses neither sampling nor hashing
OF
Cardinality Estimation
•Rules of the game •Probabilistic counting •Stochastic averaging •Refinements •Final frontier
34
We can do better (in theory)
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
34
We can do better (in theory)
Contributions
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
• Won Gödel Prize in 2005 for “foundational contribution”
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
Theorem (paraphrased to fit context of this talk).
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
• Won Gödel Prize in 2005 for “foundational contribution”
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
• Won Gödel Prize in 2005 for “foundational contribution”
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,• Uses O(log N ) bits.
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
• Won Gödel Prize in 2005 for “foundational contribution”
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,• Uses O(log N ) bits.• Is accurate to a factor of c, with probability at least 2/c.
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
• Won Gödel Prize in 2005 for “foundational contribution”
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,• Uses O(log N ) bits.• Is accurate to a factor of c, with probability at least 2/c.
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
• Won Gödel Prize in 2005 for “foundational contribution”
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
Replaces “uniform hashing” assumption with “random bit existence” assumption
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,• Uses O(log N ) bits.• Is accurate to a factor of c, with probability at least 2/c.
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
• Won Gödel Prize in 2005 for “foundational contribution”
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
BUT, no impact on cardinality estimation in practice
Replaces “uniform hashing” assumption with “random bit existence” assumption
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,• Uses O(log N ) bits.• Is accurate to a factor of c, with probability at least 2/c.
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
• Won Gödel Prize in 2005 for “foundational contribution”
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
BUT, no impact on cardinality estimation in practice
• “Algorithm” just changes hash function for PC
Replaces “uniform hashing” assumption with “random bit existence” assumption
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,• Uses O(log N ) bits.• Is accurate to a factor of c, with probability at least 2/c.
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
• Won Gödel Prize in 2005 for “foundational contribution”
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
BUT, no impact on cardinality estimation in practice
• “Algorithm” just changes hash function for PC
• Accuracy estimate is too weak to be useful
Replaces “uniform hashing” assumption with “random bit existence” assumption
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,• Uses O(log N ) bits.• Is accurate to a factor of c, with probability at least 2/c.
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
• Won Gödel Prize in 2005 for “foundational contribution”
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
BUT, no impact on cardinality estimation in practice
• “Algorithm” just changes hash function for PC
• Accuracy estimate is too weak to be useful
• No validation
Replaces “uniform hashing” assumption with “random bit existence” assumption
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, PC, for any c >2,• Uses O(log N ) bits.• Is accurate to a factor of c, with probability at least 2/c.
34
We can do better (in theory)
Contributions
• Studied problem of estimating higher moments
• Formalized idea of randomized streaming algorithms
• Won Gödel Prize in 2005 for “foundational contribution”
Alon, Matias, and Szegedy The Space Complexity of Approximating the Frequency Moments STOC 1996; JCSS 1999.
BUT, no impact on cardinality estimation in practice
• “Algorithm” just changes hash function for PC
• Accuracy estimate is too weak to be useful
• No validation
Replaces “uniform hashing” assumption with “random bit existence” assumption
???!
35
Interesting quote
“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”
− Alon, Matias, and Szegedy
35
Interesting quote
No! They hypothesized that practical hash functions would be as effective as random ones.
“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”
− Alon, Matias, and Szegedy
35
Interesting quote
No! They hypothesized that practical hash functions would be as effective as random ones.They then validated that hypothesis by proving tight bounds that match experimental results.
“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”
− Alon, Matias, and Szegedy
35
Interesting quote
Points of view re hashing
No! They hypothesized that practical hash functions would be as effective as random ones.They then validated that hypothesis by proving tight bounds that match experimental results.
“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”
− Alon, Matias, and Szegedy
35
Interesting quote
Points of view re hashing
• Theoretical computer science. Uniform hashing assumption is not proved.
No! They hypothesized that practical hash functions would be as effective as random ones.They then validated that hypothesis by proving tight bounds that match experimental results.
“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”
− Alon, Matias, and Szegedy
35
Interesting quote
Points of view re hashing
• Theoretical computer science. Uniform hashing assumption is not proved.
• Practical computing. Hashing works for many common data types.
No! They hypothesized that practical hash functions would be as effective as random ones.They then validated that hypothesis by proving tight bounds that match experimental results.
“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”
− Alon, Matias, and Szegedy
35
Interesting quote
Points of view re hashing
• Theoretical computer science. Uniform hashing assumption is not proved.
• Practical computing. Hashing works for many common data types.
• AofA. Extensive experiments have validated precise analytic models.
No! They hypothesized that practical hash functions would be as effective as random ones.They then validated that hypothesis by proving tight bounds that match experimental results.
“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”
− Alon, Matias, and Szegedy
35
Interesting quote
Points of view re hashing
• Theoretical computer science. Uniform hashing assumption is not proved.
• Practical computing. Hashing works for many common data types.
• AofA. Extensive experiments have validated precise analytic models.
Points of view re random bits
No! They hypothesized that practical hash functions would be as effective as random ones.They then validated that hypothesis by proving tight bounds that match experimental results.
“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”
− Alon, Matias, and Szegedy
35
Interesting quote
Points of view re hashing
• Theoretical computer science. Uniform hashing assumption is not proved.
• Practical computing. Hashing works for many common data types.
• AofA. Extensive experiments have validated precise analytic models.
Points of view re random bits
• Theoretical computer science. Axiomatic that random bits exist.
No! They hypothesized that practical hash functions would be as effective as random ones.They then validated that hypothesis by proving tight bounds that match experimental results.
“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”
− Alon, Matias, and Szegedy
35
Interesting quote
Points of view re hashing
• Theoretical computer science. Uniform hashing assumption is not proved.
• Practical computing. Hashing works for many common data types.
• AofA. Extensive experiments have validated precise analytic models.
Points of view re random bits
• Theoretical computer science. Axiomatic that random bits exist.
• Practical computing. No, they don’t ! And randomized algorithms are inconvenient, btw.
No! They hypothesized that practical hash functions would be as effective as random ones.They then validated that hypothesis by proving tight bounds that match experimental results.
“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”
− Alon, Matias, and Szegedy
35
Interesting quote
Points of view re hashing
• Theoretical computer science. Uniform hashing assumption is not proved.
• Practical computing. Hashing works for many common data types.
• AofA. Extensive experiments have validated precise analytic models.
Points of view re random bits
• Theoretical computer science. Axiomatic that random bits exist.
• Practical computing. No, they don’t ! And randomized algorithms are inconvenient, btw.
• AofA. More effective path forward is to validate precise analysis even if stronger assumptions are needed.
No! They hypothesized that practical hash functions would be as effective as random ones.They then validated that hypothesis by proving tight bounds that match experimental results.
“ Flajolet and Martin [assume] that one may use in the algorithm an explicit family of hash functions which exhibits some ideal random properties. Since we are not aware of the existence of such a family of hash functions …”
− Alon, Matias, and Szegedy
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities• N is the number of items in the data stream.
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary.
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary. • lg lg N is the number of bits needed to represent numbers less than lg N in binary.
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary. • lg lg N is the number of bits needed to represent numbers less than lg N in binary.
For real-world applications
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary. • lg lg N is the number of bits needed to represent numbers less than lg N in binary.
For real-world applications• N is less than 264.
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary. • lg lg N is the number of bits needed to represent numbers less than lg N in binary.
For real-world applications• N is less than 264. • lg N is less than 64.
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary. • lg lg N is the number of bits needed to represent numbers less than lg N in binary.
For real-world applications• N is less than 264. • lg N is less than 64. • lg lg N is less than 8.
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary. • lg lg N is the number of bits needed to represent numbers less than lg N in binary.
For real-world applications• N is less than 264. • lg N is less than 64. • lg lg N is less than 8.
Typical PCSA implementations
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary. • lg lg N is the number of bits needed to represent numbers less than lg N in binary.
For real-world applications• N is less than 264. • lg N is less than 64. • lg lg N is less than 8.
Typical PCSA implementations• Could use M lg N bits, in theory.
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary. • lg lg N is the number of bits needed to represent numbers less than lg N in binary.
For real-world applications• N is less than 264. • lg N is less than 64. • lg lg N is less than 8.
Typical PCSA implementations• Could use M lg N bits, in theory.• Use 64-bit words to take advantage of machine-language efficiencies.
36
logs and loglogs
To improve space-time tradeoffs, we need to carefully count bits.
Relevant quantities• N is the number of items in the data stream. • lg N is the number of bits needed to represent numbers less than N in binary. • lg lg N is the number of bits needed to represent numbers less than lg N in binary.
For real-world applications• N is less than 264. • lg N is less than 64. • lg lg N is less than 8.
Typical PCSA implementations• Could use M lg N bits, in theory.• Use 64-bit words to take advantage of machine-language efficiencies. • Use (therefore) 64*64 = 4096 bits with M = 64 (for 10% accuracy with N < 264).
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
Contribution Improves space-accuracy tradeoff at extra stream-processing expense.
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
Contribution Improves space-accuracy tradeoff at extra stream-processing expense.
Theorem (paraphrased to fit context of this talk).
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
Contribution Improves space-accuracy tradeoff at extra stream-processing expense.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
Contribution Improves space-accuracy tradeoff at extra stream-processing expense.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that• Uses O(M log log N ) bits.
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
Contribution Improves space-accuracy tradeoff at extra stream-processing expense.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that• Uses O(M log log N ) bits. PCSA uses M lg N bits
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
Contribution Improves space-accuracy tradeoff at extra stream-processing expense.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that• Uses O(M log log N ) bits.• Achieves relative accuracy .O(1/
�M)
PCSA uses M lg N bits
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
STILL no impact on cardinality estimation in practice
Contribution Improves space-accuracy tradeoff at extra stream-processing expense.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that• Uses O(M log log N ) bits.• Achieves relative accuracy .O(1/
�M)
PCSA uses M lg N bits
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
STILL no impact on cardinality estimation in practice
• Infeasible because of high stream-processing expense.
Contribution Improves space-accuracy tradeoff at extra stream-processing expense.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that• Uses O(M log log N ) bits.• Achieves relative accuracy .O(1/
�M)
PCSA uses M lg N bits
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
STILL no impact on cardinality estimation in practice
• Infeasible because of high stream-processing expense.
• Big constants hidden in O-notation
Contribution Improves space-accuracy tradeoff at extra stream-processing expense.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that• Uses O(M log log N ) bits.• Achieves relative accuracy .O(1/
�M)
PCSA uses M lg N bits
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
STILL no impact on cardinality estimation in practice
• Infeasible because of high stream-processing expense.
• Big constants hidden in O-notation
• No validation
Contribution Improves space-accuracy tradeoff at extra stream-processing expense.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that• Uses O(M log log N ) bits.• Achieves relative accuracy .O(1/
�M)
PCSA uses M lg N bits
37
We can do better (in theory)
Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan Counting Distinct Elements in a Data Stream RANDOM 2002.
STILL no impact on cardinality estimation in practice
• Infeasible because of high stream-processing expense.
• Big constants hidden in O-notation
• No validation
???!
Contribution Improves space-accuracy tradeoff at extra stream-processing expense.
Theorem (paraphrased to fit context of this talk). With strongly universal hashing, there exists an algorithm that• Uses O(M log log N ) bits.• Achieves relative accuracy .O(1/
�M)
PCSA uses M lg N bits
38
We can do better (in theory and in practice)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
38
We can do better (in theory and in practice)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
38
We can do better (in theory and in practice)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
38
We can do better (in theory and in practice)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
PCSA saves sketches (lg N bits each)00000000000000000000000001101111
38
We can do better (in theory and in practice)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
PCSA saves sketches (lg N bits each)00000000000000000000000001101111
LogLog saves r() values (lglg N bits each)
00100 ( = 4)
38
We can do better (in theory and in practice)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
• Improves space-accuracy tradeoff without extra expense per value
PCSA saves sketches (lg N bits each)00000000000000000000000001101111
LogLog saves r() values (lglg N bits each)
00100 ( = 4)
38
We can do better (in theory and in practice)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
• Improves space-accuracy tradeoff without extra expense per value
• Full analysis, fully validated with experimentation
PCSA saves sketches (lg N bits each)00000000000000000000000001101111
LogLog saves r() values (lglg N bits each)
00100 ( = 4)
38
We can do better (in theory and in practice)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
• Improves space-accuracy tradeoff without extra expense per value
• Full analysis, fully validated with experimentation
Theorem (paraphrased to fit context of this talk). PCSA saves sketches (lg N bits each)00000000000000000000000001101111
LogLog saves r() values (lglg N bits each)
00100 ( = 4)
38
We can do better (in theory and in practice)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
• Improves space-accuracy tradeoff without extra expense per value
• Full analysis, fully validated with experimentation
Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, LogLog
PCSA saves sketches (lg N bits each)00000000000000000000000001101111
LogLog saves r() values (lglg N bits each)
00100 ( = 4)
38
We can do better (in theory and in practice)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
• Improves space-accuracy tradeoff without extra expense per value
• Full analysis, fully validated with experimentation
Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, LogLog
• Uses M lg lg N bits.
PCSA saves sketches (lg N bits each)00000000000000000000000001101111
LogLog saves r() values (lglg N bits each)
00100 ( = 4)
38
We can do better (in theory and in practice)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
• Improves space-accuracy tradeoff without extra expense per value
• Full analysis, fully validated with experimentation
Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, LogLog
• Uses M lg lg N bits.• Achieves relative accuracy close to .1.30/
�M
PCSA saves sketches (lg N bits each)00000000000000000000000001101111
LogLog saves r() values (lglg N bits each)
00100 ( = 4)
38
We can do better (in theory and in practice)
Not much impact on cardinality estimation in practice only because
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
• Improves space-accuracy tradeoff without extra expense per value
• Full analysis, fully validated with experimentation
Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, LogLog
• Uses M lg lg N bits.• Achieves relative accuracy close to .1.30/
�M
PCSA saves sketches (lg N bits each)00000000000000000000000001101111
LogLog saves r() values (lglg N bits each)
00100 ( = 4)
38
We can do better (in theory and in practice)
Not much impact on cardinality estimation in practice only because
• PCSA was effectively deployed in practical systems
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
• Improves space-accuracy tradeoff without extra expense per value
• Full analysis, fully validated with experimentation
Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, LogLog
• Uses M lg lg N bits.• Achieves relative accuracy close to .1.30/
�M
PCSA saves sketches (lg N bits each)00000000000000000000000001101111
LogLog saves r() values (lglg N bits each)
00100 ( = 4)
38
We can do better (in theory and in practice)
Not much impact on cardinality estimation in practice only because
• PCSA was effectively deployed in practical systems
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
• Improves space-accuracy tradeoff without extra expense per value
• Full analysis, fully validated with experimentation
I like PCSA
Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, LogLog
• Uses M lg lg N bits.• Achieves relative accuracy close to .1.30/
�M
PCSA saves sketches (lg N bits each)00000000000000000000000001101111
LogLog saves r() values (lglg N bits each)
00100 ( = 4)
38
We can do better (in theory and in practice)
Not much impact on cardinality estimation in practice only because
• PCSA was effectively deployed in practical systems
• Idea led to a better algorithm a few years later (stay tuned)
Durand and Flajolet LogLog Counting of Large Cardinalities ESA 2003; LNCS volume 2832.
Contributions (independent of BYJKST)
• Presents LogLog algorithm, an easy variant of PCSA
• Improves space-accuracy tradeoff without extra expense per value
• Full analysis, fully validated with experimentation
I like PCSA
Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, LogLog
• Uses M lg lg N bits.• Achieves relative accuracy close to .1.30/
�M
PCSA saves sketches (lg N bits each)00000000000000000000000001101111
LogLog saves r() values (lglg N bits each)
00100 ( = 4)
39
We can do better (in theory and in practice): HyperLogLog algorithm (2007)
public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }
Flajolet-Fusy-Gandouet-Meunier 2007
39
We can do better (in theory and in practice): HyperLogLog algorithm (2007)
public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }
Flajolet-Fusy-Gandouet-Meunier 2007
Idea. Harmonic mean of r() values
39
We can do better (in theory and in practice): HyperLogLog algorithm (2007)
public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }
Flajolet-Fusy-Gandouet-Meunier 2007
Idea. Harmonic mean of r() values
• Use stochastic splitting
8-bit bytes (code to pack into M loglogN bits omitted)
39
We can do better (in theory and in practice): HyperLogLog algorithm (2007)
public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }
Flajolet-Fusy-Gandouet-Meunier 2007
Idea. Harmonic mean of r() values
• Use stochastic splitting
• Keep track of min(r (x)) for each stream
8-bit bytes (code to pack into M loglogN bits omitted)
39
We can do better (in theory and in practice): HyperLogLog algorithm (2007)
public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }
Flajolet-Fusy-Gandouet-Meunier 2007
Idea. Harmonic mean of r() values
• Use stochastic splitting
• Keep track of min(r (x)) for each stream
• Return harmonic mean.
8-bit bytes (code to pack into M loglogN bits omitted)
39
We can do better (in theory and in practice): HyperLogLog algorithm (2007)
public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }
Flajolet-Fusy-Gandouet-Meunier 2007
about .709 for M = 64
Idea. Harmonic mean of r() values
• Use stochastic splitting
• Keep track of min(r (x)) for each stream
• Return harmonic mean.
8-bit bytes (code to pack into M loglogN bits omitted)
39
We can do better (in theory and in practice): HyperLogLog algorithm (2007)
public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }
Flajolet-Fusy-Gandouet-Meunier 2007
Flajolet, Fusy, Gandouet, and Meunier
HyperLogLog: the analysis of a near-
optimal cardinality estimation algorithm
AofA 2007; DMTCS 2007.about .709 for M = 64
Idea. Harmonic mean of r() values
• Use stochastic splitting
• Keep track of min(r (x)) for each stream
• Return harmonic mean.
8-bit bytes (code to pack into M loglogN bits omitted)
39
We can do better (in theory and in practice): HyperLogLog algorithm (2007)
public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }
Flajolet-Fusy-Gandouet-Meunier 2007
Flajolet, Fusy, Gandouet, and Meunier
HyperLogLog: the analysis of a near-
optimal cardinality estimation algorithm
AofA 2007; DMTCS 2007.
Theorem (paraphrased to fit context of this talk).
about .709 for M = 64
Idea. Harmonic mean of r() values
• Use stochastic splitting
• Keep track of min(r (x)) for each stream
• Return harmonic mean.
8-bit bytes (code to pack into M loglogN bits omitted)
39
We can do better (in theory and in practice): HyperLogLog algorithm (2007)
public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }
Flajolet-Fusy-Gandouet-Meunier 2007
Flajolet, Fusy, Gandouet, and Meunier
HyperLogLog: the analysis of a near-
optimal cardinality estimation algorithm
AofA 2007; DMTCS 2007.
Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, HyperLogLog
about .709 for M = 64
Idea. Harmonic mean of r() values
• Use stochastic splitting
• Keep track of min(r (x)) for each stream
• Return harmonic mean.
8-bit bytes (code to pack into M loglogN bits omitted)
39
We can do better (in theory and in practice): HyperLogLog algorithm (2007)
public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }
Flajolet-Fusy-Gandouet-Meunier 2007
Flajolet, Fusy, Gandouet, and Meunier
HyperLogLog: the analysis of a near-
optimal cardinality estimation algorithm
AofA 2007; DMTCS 2007.
Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, HyperLogLog
• Uses M log log N bits.
about .709 for M = 64
Idea. Harmonic mean of r() values
• Use stochastic splitting
• Keep track of min(r (x)) for each stream
• Return harmonic mean.
8-bit bytes (code to pack into M loglogN bits omitted)
39
We can do better (in theory and in practice): HyperLogLog algorithm (2007)
public static long estimate(Iterable<Long> stream, int M) { int[] bytes = new int[M]; for (long x : stream) { int k = hash2(x, M); if (bytes[k] < Bits.r(x)) bytes[k] = Bits.r(x); } double sum = 0.0; for (int k = 0; k < M; k++) sum += Math.pow(2, -1.0 - bytes[k]); return (int) (alpha * M * M / sum); }
Flajolet-Fusy-Gandouet-Meunier 2007
Flajolet, Fusy, Gandouet, and Meunier
HyperLogLog: the analysis of a near-
optimal cardinality estimation algorithm
AofA 2007; DMTCS 2007.
Theorem (paraphrased to fit context of this talk). Under the uniform hashing assumption, HyperLogLog
• Uses M log log N bits.• Achieves relative accuracy close to .1.02/
�M
about .709 for M = 64
Idea. Harmonic mean of r() values
• Use stochastic splitting
• Keep track of min(r (x)) for each stream
• Return harmonic mean.
8-bit bytes (code to pack into M loglogN bits omitted)
40
Space-accuracy tradeoff for HyperLogLog
102451225612864
10%
5%
Relative accuracy:1.02�M
40
Space-accuracy tradeoff for HyperLogLog
102451225612864
10%
5%
Relative accuracy:
Bottom line (for N < 264).
1.02�M
40
Space-accuracy tradeoff for HyperLogLog
102451225612864
10%
5%
Relative accuracy:
Bottom line (for N < 264).• Attain 10% relative accuracy with a sketch consisting of 108x6 = 648 bits.
1.02�M
M = 64
40
Space-accuracy tradeoff for HyperLogLog
102451225612864
10%
5%
Relative accuracy:
Bottom line (for N < 264).• Attain 10% relative accuracy with a sketch consisting of 108x6 = 648 bits.• Attain 3.1% relative accuracy with a sketch consisting of 1024x6 = 6144 bits.
1.02�M
M = 64
M = 1024
40
Space-accuracy tradeoff for HyperLogLog
102451225612864
10%
5%
Relative accuracy:
Bottom line (for N < 264).• Attain 10% relative accuracy with a sketch consisting of 108x6 = 648 bits.• Attain 3.1% relative accuracy with a sketch consisting of 1024x6 = 6144 bits.
1.02�M
Yay!
M = 64
M = 1024
41
PCSA vs Hyperloglog
41
PCSA vs Hyperloglog
Typical PCSA implementations
41
PCSA vs Hyperloglog
Typical PCSA implementations• Could use M lg N bits, in theory.
41
PCSA vs Hyperloglog
Typical PCSA implementations• Could use M lg N bits, in theory.• Use 64-bit words to take advantage of machine-language efficiencies.
41
PCSA vs Hyperloglog
Typical PCSA implementations• Could use M lg N bits, in theory.• Use 64-bit words to take advantage of machine-language efficiencies. • Use (therefore) 64*64 = 4096 bits with M = 64 (for 10% accuracy with N < 264).
41
PCSA vs Hyperloglog
Typical PCSA implementations• Could use M lg N bits, in theory.• Use 64-bit words to take advantage of machine-language efficiencies. • Use (therefore) 64*64 = 4096 bits with M = 64 (for 10% accuracy with N < 264).
Typical Hyperloglog implementations
41
PCSA vs Hyperloglog
Typical PCSA implementations• Could use M lg N bits, in theory.• Use 64-bit words to take advantage of machine-language efficiencies. • Use (therefore) 64*64 = 4096 bits with M = 64 (for 10% accuracy with N < 264).
Typical Hyperloglog implementations• Could use M lg lg N bits, in theory.
41
PCSA vs Hyperloglog
Typical PCSA implementations• Could use M lg N bits, in theory.• Use 64-bit words to take advantage of machine-language efficiencies. • Use (therefore) 64*64 = 4096 bits with M = 64 (for 10% accuracy with N < 264).
Typical Hyperloglog implementations• Could use M lg lg N bits, in theory.• Use 8-bit bytes to take advantage of machine-language efficiencies.
41
PCSA vs Hyperloglog
Typical PCSA implementations• Could use M lg N bits, in theory.• Use 64-bit words to take advantage of machine-language efficiencies. • Use (therefore) 64*64 = 4096 bits with M = 64 (for 10% accuracy with N < 264).
Typical Hyperloglog implementations• Could use M lg lg N bits, in theory.• Use 8-bit bytes to take advantage of machine-language efficiencies. • Use (therefore) 108*8 = 864 bits with M = 108 (for 10% accuracy with N < 264).
42
Validation of Hyperloglog
42
Validation of Hyperloglog
S. Heule, M. Nunkesser and A. HallHyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.Extending Database Technology/International Conference on Database Theory 2013.
42
Validation of Hyperloglog
S. Heule, M. Nunkesser and A. HallHyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.Extending Database Technology/International Conference on Database Theory 2013.
42
Validation of Hyperloglog
S. Heule, M. Nunkesser and A. HallHyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.Extending Database Technology/International Conference on Database Theory 2013.
42
Validation of Hyperloglog
S. Heule, M. Nunkesser and A. HallHyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.Extending Database Technology/International Conference on Database Theory 2013.
42
Validation of Hyperloglog
S. Heule, M. Nunkesser and A. HallHyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.Extending Database Technology/International Conference on Database Theory 2013.
42
Validation of Hyperloglog
S. Heule, M. Nunkesser and A. HallHyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.Extending Database Technology/International Conference on Database Theory 2013.
42
Validation of Hyperloglog
S. Heule, M. Nunkesser and A. HallHyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.Extending Database Technology/International Conference on Database Theory 2013.
42
Validation of Hyperloglog
S. Heule, M. Nunkesser and A. HallHyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.Extending Database Technology/International Conference on Database Theory 2013.
42
Validation of Hyperloglog
S. Heule, M. Nunkesser and A. HallHyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.Extending Database Technology/International Conference on Database Theory 2013.
Philippe Flajolet, mathematician, data scientist, and computer scientist extraordinaire
OF
Cardinality Estimation
•Rules of the game •Probabilistic counting •Stochastic averaging •Refinements •Final frontier
44
We can do a bit better (in theory) but not much better
Upper bound
Lower bound
44
We can do a bit better (in theory) but not much better
Upper bound
Lower bound
Indyk and Woodruff Tight Lower Bounds for the Distinct Elements Problem, FOCS 2003.
44
We can do a bit better (in theory) but not much better
Upper bound
Lower bound
Indyk and Woodruff Tight Lower Bounds for the Distinct Elements Problem, FOCS 2003.
Theorem (paraphrased to fit context of this talk). Any algorithm that achieves relative accuracy must use bitsO(1/
�M) Ω(M)
44
We can do a bit better (in theory) but not much better
Upper bound
Lower bound
Indyk and Woodruff Tight Lower Bounds for the Distinct Elements Problem, FOCS 2003.
Theorem (paraphrased to fit context of this talk). Any algorithm that achieves relative accuracy must use bitsO(1/
�M) Ω(M)
loglogN improvement possible
44
We can do a bit better (in theory) but not much better
Kane, Nelson, and Woodruff Optimal Algorithm for the Distinct Elements Problem, PODS 2010.
Upper bound
Lower bound
Indyk and Woodruff Tight Lower Bounds for the Distinct Elements Problem, FOCS 2003.
Theorem (paraphrased to fit context of this talk). Any algorithm that achieves relative accuracy must use bitsO(1/
�M) Ω(M)
loglogN improvement possible
44
We can do a bit better (in theory) but not much better
Kane, Nelson, and Woodruff Optimal Algorithm for the Distinct Elements Problem, PODS 2010.
Upper bound
Lower bound
Theorem (paraphrased to fit context of this talk). With strongly universal hashing there exists an algorithm that • Uses O(M ) bits. • Achieves relative accuracy .O(1/
�M)
Indyk and Woodruff Tight Lower Bounds for the Distinct Elements Problem, FOCS 2003.
Theorem (paraphrased to fit context of this talk). Any algorithm that achieves relative accuracy must use bitsO(1/
�M) Ω(M)
loglogN improvement possible
44
We can do a bit better (in theory) but not much better
Kane, Nelson, and Woodruff Optimal Algorithm for the Distinct Elements Problem, PODS 2010.
Upper bound
Lower bound
Theorem (paraphrased to fit context of this talk). With strongly universal hashing there exists an algorithm that • Uses O(M ) bits. • Achieves relative accuracy .O(1/
�M)
Indyk and Woodruff Tight Lower Bounds for the Distinct Elements Problem, FOCS 2003.
Theorem (paraphrased to fit context of this talk). Any algorithm that achieves relative accuracy must use bitsO(1/
�M) Ω(M)
loglogN improvement possible
optimal
44
We can do a bit better (in theory) but not much better
Kane, Nelson, and Woodruff Optimal Algorithm for the Distinct Elements Problem, PODS 2010.
Upper bound
Lower bound
Theorem (paraphrased to fit context of this talk). With strongly universal hashing there exists an algorithm that • Uses O(M ) bits. • Achieves relative accuracy .
Unlikely to have impact on cardinality estimation in practice
O(1/�M)
Indyk and Woodruff Tight Lower Bounds for the Distinct Elements Problem, FOCS 2003.
Theorem (paraphrased to fit context of this talk). Any algorithm that achieves relative accuracy must use bitsO(1/
�M) Ω(M)
loglogN improvement possible
optimal
44
We can do a bit better (in theory) but not much better
Kane, Nelson, and Woodruff Optimal Algorithm for the Distinct Elements Problem, PODS 2010.
Upper bound
Lower bound
Theorem (paraphrased to fit context of this talk). With strongly universal hashing there exists an algorithm that • Uses O(M ) bits. • Achieves relative accuracy .
Unlikely to have impact on cardinality estimation in practice
• Tough to beat HyperLogLog’s low stream-processing expense.
O(1/�M)
Indyk and Woodruff Tight Lower Bounds for the Distinct Elements Problem, FOCS 2003.
Theorem (paraphrased to fit context of this talk). Any algorithm that achieves relative accuracy must use bitsO(1/
�M) Ω(M)
loglogN improvement possible
optimal
44
We can do a bit better (in theory) but not much better
Kane, Nelson, and Woodruff Optimal Algorithm for the Distinct Elements Problem, PODS 2010.
Upper bound
Lower bound
Theorem (paraphrased to fit context of this talk). With strongly universal hashing there exists an algorithm that • Uses O(M ) bits. • Achieves relative accuracy .
Unlikely to have impact on cardinality estimation in practice
• Tough to beat HyperLogLog’s low stream-processing expense.
• Constants hidden in O-notation not likely to be < 6
O(1/�M)
Indyk and Woodruff Tight Lower Bounds for the Distinct Elements Problem, FOCS 2003.
Theorem (paraphrased to fit context of this talk). Any algorithm that achieves relative accuracy must use bitsO(1/
�M) Ω(M)
loglogN improvement possible
optimal
44
We can do a bit better (in theory) but not much better
Kane, Nelson, and Woodruff Optimal Algorithm for the Distinct Elements Problem, PODS 2010.
Upper bound
Lower bound
Theorem (paraphrased to fit context of this talk). With strongly universal hashing there exists an algorithm that • Uses O(M ) bits. • Achieves relative accuracy .
Unlikely to have impact on cardinality estimation in practice
• Tough to beat HyperLogLog’s low stream-processing expense.
• Constants hidden in O-notation not likely to be < 6
• No validation
O(1/�M)
Indyk and Woodruff Tight Lower Bounds for the Distinct Elements Problem, FOCS 2003.
Theorem (paraphrased to fit context of this talk). Any algorithm that achieves relative accuracy must use bitsO(1/
�M) Ω(M)
loglogN improvement possible
optimal
45
Can we beat HyperLogLog in practice?
45
Can we beat HyperLogLog in practice?
Necessary characteristics of a better algorithm
45
Can we beat HyperLogLog in practice?
Necessary characteristics of a better algorithm• Makes one pass through the stream.
45
Can we beat HyperLogLog in practice?
Necessary characteristics of a better algorithm• Makes one pass through the stream.• Uses a few dozen machine instructions per value
45
Can we beat HyperLogLog in practice?
Necessary characteristics of a better algorithm• Makes one pass through the stream.• Uses a few dozen machine instructions per value• Uses a few hundred bits
45
Can we beat HyperLogLog in practice?
Necessary characteristics of a better algorithm• Makes one pass through the stream.• Uses a few dozen machine instructions per value• Uses a few hundred bits• Achieves 10% relative accuracy or better
45
Can we beat HyperLogLog in practice?
Necessary characteristics of a better algorithm• Makes one pass through the stream.• Uses a few dozen machine instructions per value• Uses a few hundred bits• Achieves 10% relative accuracy or better
“ I’ve long thought that there should be a simple algorithm that uses a small constant times M bits…”
− Jérémie Lumbroso
45
Can we beat HyperLogLog in practice?
Necessary characteristics of a better algorithm• Makes one pass through the stream.• Uses a few dozen machine instructions per value• Uses a few hundred bits• Achieves 10% relative accuracy or better
“ I’ve long thought that there should be a simple algorithm that uses a small constant times M bits…”
− Jérémie Lumbroso
45
Can we beat HyperLogLog in practice?
Necessary characteristics of a better algorithm• Makes one pass through the stream.• Uses a few dozen machine instructions per value• Uses a few hundred bits• Achieves 10% relative accuracy or better
machine instructions per stream element
memory bound
memory bound when N < 264
# bits for 10% accuracy when N < 264
HyperLogLog 20–30 M loglog N 6M 648
“ I’ve long thought that there should be a simple algorithm that uses a small constant times M bits…”
− Jérémie Lumbroso
45
Can we beat HyperLogLog in practice?
Necessary characteristics of a better algorithm• Makes one pass through the stream.• Uses a few dozen machine instructions per value• Uses a few hundred bits• Achieves 10% relative accuracy or better
machine instructions per stream element
memory bound
memory bound when N < 264
# bits for 10% accuracy when N < 264
HyperLogLog 20–30 M loglog N 6M 648
BetterAlgorithm a few dozen a few hundred
“ I’ve long thought that there should be a simple algorithm that uses a small constant times M bits…”
− Jérémie Lumbroso
45
Can we beat HyperLogLog in practice?
Also, results need to be validated through extensive experimentation.
Necessary characteristics of a better algorithm• Makes one pass through the stream.• Uses a few dozen machine instructions per value• Uses a few hundred bits• Achieves 10% relative accuracy or better
machine instructions per stream element
memory bound
memory bound when N < 264
# bits for 10% accuracy when N < 264
HyperLogLog 20–30 M loglog N 6M 648
BetterAlgorithm a few dozen a few hundred
I love HyperLogLog
“ I’ve long thought that there should be a simple algorithm that uses a small constant times M bits…”
− Jérémie Lumbroso
46
A proposal: HyperBitBit (Sedgewick, 2016)
46
A proposal: HyperBitBit (Sedgewick, 2016)
public static long estimate(Iterable<String> stream, int M) { int lgN = 5; long sketch = 0; long sketch2 = 0; for (String x : stream) { long x = hash(s); int k = hash2(x, 64); if (r(x) > lgN) sketch = sketch | (1L << k); if (r(x) > lgN + 1) sketch2 = sketch2 | (1L << k); if (p(sketch) > 31) { sketch = sketch2; lgN++; sketch2 = 0; } } return (int) (Math.pow(2, lgN + 5.4 + p(sketch)/32.0)); }
46
A proposal: HyperBitBit (Sedgewick, 2016)
public static long estimate(Iterable<String> stream, int M) { int lgN = 5; long sketch = 0; long sketch2 = 0; for (String x : stream) { long x = hash(s); int k = hash2(x, 64); if (r(x) > lgN) sketch = sketch | (1L << k); if (r(x) > lgN + 1) sketch2 = sketch2 | (1L << k); if (p(sketch) > 31) { sketch = sketch2; lgN++; sketch2 = 0; } } return (int) (Math.pow(2, lgN + 5.4 + p(sketch)/32.0)); }
Idea.
46
A proposal: HyperBitBit (Sedgewick, 2016)
public static long estimate(Iterable<String> stream, int M) { int lgN = 5; long sketch = 0; long sketch2 = 0; for (String x : stream) { long x = hash(s); int k = hash2(x, 64); if (r(x) > lgN) sketch = sketch | (1L << k); if (r(x) > lgN + 1) sketch2 = sketch2 | (1L << k); if (p(sketch) > 31) { sketch = sketch2; lgN++; sketch2 = 0; } } return (int) (Math.pow(2, lgN + 5.4 + p(sketch)/32.0)); }
Idea.
• lgN is estimate of lgN
46
A proposal: HyperBitBit (Sedgewick, 2016)
public static long estimate(Iterable<String> stream, int M) { int lgN = 5; long sketch = 0; long sketch2 = 0; for (String x : stream) { long x = hash(s); int k = hash2(x, 64); if (r(x) > lgN) sketch = sketch | (1L << k); if (r(x) > lgN + 1) sketch2 = sketch2 | (1L << k); if (p(sketch) > 31) { sketch = sketch2; lgN++; sketch2 = 0; } } return (int) (Math.pow(2, lgN + 5.4 + p(sketch)/32.0)); }
Idea.
• lgN is estimate of
• sketch is 64 indicators whether to increment lgN
lgN
46
A proposal: HyperBitBit (Sedgewick, 2016)
public static long estimate(Iterable<String> stream, int M) { int lgN = 5; long sketch = 0; long sketch2 = 0; for (String x : stream) { long x = hash(s); int k = hash2(x, 64); if (r(x) > lgN) sketch = sketch | (1L << k); if (r(x) > lgN + 1) sketch2 = sketch2 | (1L << k); if (p(sketch) > 31) { sketch = sketch2; lgN++; sketch2 = 0; } } return (int) (Math.pow(2, lgN + 5.4 + p(sketch)/32.0)); }
Idea.
• lgN is estimate of
• sketch is 64 indicators whether to increment lgN
• sketch2 is is 64 indicators whether to increment lgN by 2
lgN
46
A proposal: HyperBitBit (Sedgewick, 2016)
public static long estimate(Iterable<String> stream, int M) { int lgN = 5; long sketch = 0; long sketch2 = 0; for (String x : stream) { long x = hash(s); int k = hash2(x, 64); if (r(x) > lgN) sketch = sketch | (1L << k); if (r(x) > lgN + 1) sketch2 = sketch2 | (1L << k); if (p(sketch) > 31) { sketch = sketch2; lgN++; sketch2 = 0; } } return (int) (Math.pow(2, lgN + 5.4 + p(sketch)/32.0)); }
Idea.
• lgN is estimate of
• sketch is 64 indicators whether to increment lgN
• sketch2 is is 64 indicators whether to increment lgN by 2
• Update when half the bits in sketch are 1
lgN
46
A proposal: HyperBitBit (Sedgewick, 2016)
public static long estimate(Iterable<String> stream, int M) { int lgN = 5; long sketch = 0; long sketch2 = 0; for (String x : stream) { long x = hash(s); int k = hash2(x, 64); if (r(x) > lgN) sketch = sketch | (1L << k); if (r(x) > lgN + 1) sketch2 = sketch2 | (1L << k); if (p(sketch) > 31) { sketch = sketch2; lgN++; sketch2 = 0; } } return (int) (Math.pow(2, lgN + 5.4 + p(sketch)/32.0)); }
Idea.
• lgN is estimate of
• sketch is 64 indicators whether to increment lgN
• sketch2 is is 64 indicators whether to increment lgN by 2
• Update when half the bits in sketch are 1
• correct with p(sketch)
lgN
bias factor (determined empirically)
and bias factor
46
A proposal: HyperBitBit (Sedgewick, 2016)
public static long estimate(Iterable<String> stream, int M) { int lgN = 5; long sketch = 0; long sketch2 = 0; for (String x : stream) { long x = hash(s); int k = hash2(x, 64); if (r(x) > lgN) sketch = sketch | (1L << k); if (r(x) > lgN + 1) sketch2 = sketch2 | (1L << k); if (p(sketch) > 31) { sketch = sketch2; lgN++; sketch2 = 0; } } return (int) (Math.pow(2, lgN + 5.4 + p(sketch)/32.0)); }
Idea.
• lgN is estimate of
• sketch is 64 indicators whether to increment lgN
• sketch2 is is 64 indicators whether to increment lgN by 2
• Update when half the bits in sketch are 1
• correct with p(sketch)
lgN
bias factor (determined empirically)
and bias factor
Q. Does this even work?
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example HyperBitBit estimates
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt
HyperBitBit estimates
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219
HyperBitBit estimates
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt
HyperBitBit estimates
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889
HyperBitBit estimates
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt
HyperBitBit estimates
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801
HyperBitBit estimates
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt
HyperBitBit estimates
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt1044043
HyperBitBit estimates
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt1044043
HyperBitBit estimates
1,000,000 2,000,000 4,000,000 6,000,000
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt1044043
HyperBitBit estimates
1,000,000 2,000,000 4,000,000 6,000,000
Exact 242,601 483,477 883,071 1,097,944
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt1044043
HyperBitBit estimates
1,000,000 2,000,000 4,000,000 6,000,000
Exact 242,601 483,477 883,071 1,097,944
HyperBitBit 234,219 499,889 916,801 1,044,043
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt1044043
HyperBitBit estimates
1,000,000 2,000,000 4,000,000 6,000,000
Exact 242,601 483,477 883,071 1,097,944
HyperBitBit 234,219 499,889 916,801 1,044,043
ratio 1.05 1.03 0.96 1.03
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt1044043
HyperBitBit estimates
Conjecture. On practical data, HyperBitBit, for N < 264,
1,000,000 2,000,000 4,000,000 6,000,000
Exact 242,601 483,477 883,071 1,097,944
HyperBitBit 234,219 499,889 916,801 1,044,043
ratio 1.05 1.03 0.96 1.03
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt1044043
HyperBitBit estimates
Conjecture. On practical data, HyperBitBit, for N < 264, • Uses 128 + 6 bits.
1,000,000 2,000,000 4,000,000 6,000,000
Exact 242,601 483,477 883,071 1,097,944
HyperBitBit 234,219 499,889 916,801 1,044,043
ratio 1.05 1.03 0.96 1.03
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt1044043
HyperBitBit estimates
Conjecture. On practical data, HyperBitBit, for N < 264, • Uses 128 + 6 bits.• Estimates cardinality within 10% of the actual.
1,000,000 2,000,000 4,000,000 6,000,000
Exact 242,601 483,477 883,071 1,097,944
HyperBitBit 234,219 499,889 916,801 1,044,043
ratio 1.05 1.03 0.96 1.03
47
Initial experiments
% java Hash 1000000 < log.07.f3.txt 242601 % java Hash 2000000 < log.07.f3.txt 483477 % java Hash 4000000 < log.07.f3.txt 883071 % java Hash 6000000 < log.07.f3.txt 1097944
Exact values for web log example
% java HyperBitBit 1000000 < log.07.f3.txt234219% java HyperBitBit 2000000 < log.07.f3.txt499889% java HyperBitBit 4000000 < log.07.f3.txt916801% java HyperBitBit 6000000 < log.07.f3.txt1044043
HyperBitBit estimates
Conjecture. On practical data, HyperBitBit, for N < 264, • Uses 128 + 6 bits.• Estimates cardinality within 10% of the actual.
1,000,000 2,000,000 4,000,000 6,000,000
Exact 242,601 483,477 883,071 1,097,944
HyperBitBit 234,219 499,889 916,801 1,044,043
ratio 1.05 1.03 0.96 1.03
Next steps. • Analyze. • Experiment. • Iterate
48
Summary/timeline for cardinality estimation
48
Summary/timeline for cardinality estimation
hashing assumption
feasible and
validated?memory
bound (bits)relative
accuracy constant
# bits for 10% accuracy when N < 264
48
Summary/timeline for cardinality estimation
hashing assumption
feasible and
validated?memory
bound (bits)relative
accuracy constant
# bits for 10% accuracy when N < 264
1970 Bloom Bloom filter uniform ✓ kN > 264
48
Summary/timeline for cardinality estimation
hashing assumption
feasible and
validated?memory
bound (bits)relative
accuracy constant
# bits for 10% accuracy when N < 264
1970 Bloom Bloom filter uniform ✓ kN > 264
1985 Flajolet-Martin PCSA uniform ✓ M log N 0.78 4096
48
Summary/timeline for cardinality estimation
hashing assumption
feasible and
validated?memory
bound (bits)relative
accuracy constant
# bits for 10% accuracy when N < 264
1970 Bloom Bloom filter uniform ✓ kN > 264
1985 Flajolet-Martin PCSA uniform ✓ M log N 0.78 4096
1996 Alon–Matias–Szegedy [theorem] strong universal ✗ O(M log N ) O(1) ?
48
Summary/timeline for cardinality estimation
hashing assumption
feasible and
validated?memory
bound (bits)relative
accuracy constant
# bits for 10% accuracy when N < 264
1970 Bloom Bloom filter uniform ✓ kN > 264
1985 Flajolet-Martin PCSA uniform ✓ M log N 0.78 4096
1996 Alon–Matias–Szegedy [theorem] strong universal ✗ O(M log N ) O(1) ?
2002Bar–Yossef–Jayram–Kumar–Sivakumar–
Trevisan[theorem] strong
universal ✗ O(M log log N ) O(1) ?
48
Summary/timeline for cardinality estimation
hashing assumption
feasible and
validated?memory
bound (bits)relative
accuracy constant
# bits for 10% accuracy when N < 264
1970 Bloom Bloom filter uniform ✓ kN > 264
1985 Flajolet-Martin PCSA uniform ✓ M log N 0.78 4096
1996 Alon–Matias–Szegedy [theorem] strong universal ✗ O(M log N ) O(1) ?
2002Bar–Yossef–Jayram–Kumar–Sivakumar–
Trevisan[theorem] strong
universal ✗ O(M log log N ) O(1) ?
2003 Durand–Flajolet LogLog uniform ✓ M lglg N 1.30 1536
48
Summary/timeline for cardinality estimation
hashing assumption
feasible and
validated?memory
bound (bits)relative
accuracy constant
# bits for 10% accuracy when N < 264
1970 Bloom Bloom filter uniform ✓ kN > 264
1985 Flajolet-Martin PCSA uniform ✓ M log N 0.78 4096
1996 Alon–Matias–Szegedy [theorem] strong universal ✗ O(M log N ) O(1) ?
2002Bar–Yossef–Jayram–Kumar–Sivakumar–
Trevisan[theorem] strong
universal ✗ O(M log log N ) O(1) ?
2003 Durand–Flajolet LogLog uniform ✓ M lglg N 1.30 1536
2007 Flajolet–Fusy–Gandouet–Meunier HyperLogLog uniform ✓ M lglg N 1.04 648
48
Summary/timeline for cardinality estimation
hashing assumption
feasible and
validated?memory
bound (bits)relative
accuracy constant
# bits for 10% accuracy when N < 264
1970 Bloom Bloom filter uniform ✓ kN > 264
1985 Flajolet-Martin PCSA uniform ✓ M log N 0.78 4096
1996 Alon–Matias–Szegedy [theorem] strong universal ✗ O(M log N ) O(1) ?
2002Bar–Yossef–Jayram–Kumar–Sivakumar–
Trevisan[theorem] strong
universal ✗ O(M log log N ) O(1) ?
2003 Durand–Flajolet LogLog uniform ✓ M lglg N 1.30 1536
2007 Flajolet–Fusy–Gandouet–Meunier HyperLogLog uniform ✓ M lglg N 1.04 648
2010 Kane–Nelson–Woodruff [theorem] strong
universal ✗ O(M ) + lglg N O(1) ?
48
Summary/timeline for cardinality estimation
hashing assumption
feasible and
validated?memory
bound (bits)relative
accuracy constant
# bits for 10% accuracy when N < 264
1970 Bloom Bloom filter uniform ✓ kN > 264
1985 Flajolet-Martin PCSA uniform ✓ M log N 0.78 4096
1996 Alon–Matias–Szegedy [theorem] strong universal ✗ O(M log N ) O(1) ?
2002Bar–Yossef–Jayram–Kumar–Sivakumar–
Trevisan[theorem] strong
universal ✗ O(M log log N ) O(1) ?
2003 Durand–Flajolet LogLog uniform ✓ M lglg N 1.30 1536
2007 Flajolet–Fusy–Gandouet–Meunier HyperLogLog uniform ✓ M lglg N 1.04 648
2010 Kane–Nelson–Woodruff [theorem] strong
universal ✗ O(M ) + lglg N O(1) ?
2018+ RS–? HyperBitBit uniform ✓(?) 2M + lglg N ? 134 (?)
49
Happy Birthday, Don!
Anders Björner Mireille Bousquet-Mélou
Erik DemainePersi Diaconis
Michael DrmotaRon Graham
Yannis Haralambous Susan HolmesSvante Janson
Dick KarpJohn Knuth
Svante LinussonLaci Lovász
Jan OverduinMike Paterson
Tim Roughgarden Martin RuckertBob Sedgewick
Jeffrey ShallitRichard Stanley
Wojtek Szpankowski Bob Tarjan
Greg TuckerAndrew Yao
Colloquium forDon Knuth’s
80th birthday
AlgorithmsCombinatoricsInformation
http://knuth80.elfbrink.se
Cardinality Estimation
with special thanks to Jérémie Lumbroso
Robert Sedgewick Princeton University