69
Sublinear Algorihms for Big Data Grigory Yaroslavtsev http://grigory.us Lectur e 1

Sublinear Algorihms for Big Data

  • Upload
    tale

  • View
    24

  • Download
    1

Embed Size (px)

DESCRIPTION

Sublinear Algorihms for Big Data. Lecture 1. Grigory Yaroslavtsev http://grigory.us. Part 0: Introduction. Disclaimers Logistics Materials …. Name. Correct: Grigory Gregory (easiest and highly recommended!) Also correct: Dr. Yaroslavtsev (I bet it’s difficult to pronounce) - PowerPoint PPT Presentation

Citation preview

Page 1: Sublinear Algorihms  for Big Data

Sublinear Algorihms for Big Data

Grigory Yaroslavtsevhttp://grigory.us

Lecture 1

Page 2: Sublinear Algorihms  for Big Data

Part 0: Introduction

• Disclaimers• Logistics• Materials• …

Page 3: Sublinear Algorihms  for Big Data

Name

Correct:• Grigory• Gregory (easiest and highly recommended!)Also correct: • Dr. Yaroslavtsev (I bet it’s difficult to pronounce)Wrong:• Prof. Yaroslavtsev (Not any easier)

Page 4: Sublinear Algorihms  for Big Data

Disclaimers

• A lot of Math!

Page 5: Sublinear Algorihms  for Big Data

Disclaimers

• No programming!

Page 6: Sublinear Algorihms  for Big Data

Disclaimers

• 10-15 times longer than “Fuerza Bruta”, soccer game, milonga…

Page 7: Sublinear Algorihms  for Big Data

Big Data

• Data• Programming and Systems• Algorithms• Probability and Statistics

Page 8: Sublinear Algorihms  for Big Data

Sublinear Algorithms

= size of the data, we want , not • Sublinear Time– Queries– Samples

• Sublinear Space– Data Streams– Sketching

• Distributed Algorithms– Local and distributed computations – MapReduce-style algorithms

Page 9: Sublinear Algorihms  for Big Data

Why is it useful?

• Algorithms for big data used by big companies (ultra-fast (randomized algorithms for approximate decision making)– Networking applications (counting and detecting

patterns in small space)– Distributed computations (small sketches to reduce

communication overheads)• Aggregate Knowledge: startup doing streaming

algorithms, acquired for $150M• Today: Applications to soccer

Page 10: Sublinear Algorihms  for Big Data

Course Materials

• Will be posted at the class homepage:http://grigory.us/big-data.html

• Related and further reading:– Sublinear Algorithms (MIT) by Indyk, Rubinfeld– Algorithms for Big Data (Harvard) by Nelson– Data Stream Algorithms (University of

Massachusetts) by McGregor– Sublinear Algorithms (Penn State) by

Raskhodnikova

Page 11: Sublinear Algorihms  for Big Data

Course Overview

• Lecture 1• Lecture 2• Lecture 3• Lecture 4• Lecture 5

3 hours = 3 x (45-50 min lecture + 10-15 min break).

Page 12: Sublinear Algorihms  for Big Data

Puzzles

You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”) – If all are different and have values between and ,

which value is missing? – You have space

• Example:– There are 11 soccer players with numbers 1, …, 11. – You see 10 of them one by one, which one is

missing? You can only remember a single number.

Page 13: Sublinear Algorihms  for Big Data

1

Page 14: Sublinear Algorihms  for Big Data

8

Page 15: Sublinear Algorihms  for Big Data

5

Page 16: Sublinear Algorihms  for Big Data

11

Page 17: Sublinear Algorihms  for Big Data

3

Page 18: Sublinear Algorihms  for Big Data

9

Page 19: Sublinear Algorihms  for Big Data

2

Page 20: Sublinear Algorihms  for Big Data

6

Page 21: Sublinear Algorihms  for Big Data

7

Page 22: Sublinear Algorihms  for Big Data

4

Page 23: Sublinear Algorihms  for Big Data

Which number was missing?

Page 24: Sublinear Algorihms  for Big Data

Puzzle #1

You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”) – If all are different and have values between and ,

which value is missing? – You have space

• Example:– There are 11 soccer players with numbers 1, …, 11. – You see 10 of them one by one, which one is

missing? You can only remember a single number.

Page 25: Sublinear Algorihms  for Big Data

Puzzle #2

You see a sequence of values , arriving one by one: • (Harder, “Keep a random team”) – How can you maintain a uniformly random sample

of values out of those you have seen so far? – You can store exactly items at any time

• Example:– You want to have a team of 11 players randomly

chosen from the set you have seen.– Players arrive one at a time and you have to decide

whether to keep them or not.

Page 26: Sublinear Algorihms  for Big Data

Puzzle #3

You see a sequence of values , arriving one by one: • (Very hard, “Count the number of players”)– What is the total number of values up to error – You have space and can be completely wrong

with some small probability

Page 27: Sublinear Algorihms  for Big Data

Puzzles

You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”)

– If all are different and have values between and , which value is missing?

– You have space• (Harder, “Keep a random team”)

– How can you maintain a uniformly random sample of values out of those you have seen so far?

– You can store exactly items at any time• (Very hard, “Count the number of players”)

– What is the total number of values up to error – You have space and can be completely wrong with some small

probability

Page 28: Sublinear Algorihms  for Big Data

Part 1: Probability 101

“The bigger the data the better you should know your Probability”• Basic Spanish: Hola, Gracias, Bueno, Por favor,

Bebida, Comida, Jamon, Queso, Gringo, Chica, Amigo, …

• Basic Probability:– Probability, events, random variables– Expectation, variance / standard deviation – Conditional probability, independence, pairwise

independence, mutual independence

Page 29: Sublinear Algorihms  for Big Data

Expectation

• = random variable with values • Expectation 𝔼• Properties (linearity):

• Useful fact: if all and integer then 𝔼

Page 30: Sublinear Algorihms  for Big Data

Expectation

• Example: dice has values with probability 1/6[Value]

Page 31: Sublinear Algorihms  for Big Data

Variance

• Variance

• is some fixed value (a constant)

• Corollary:

Page 32: Sublinear Algorihms  for Big Data

Variance• Example (Variance of a fair dice):

[=[ +

Page 33: Sublinear Algorihms  for Big Data

Independence

• Two random variables and are independent if and only if (iff) for every :

• Variables are mutually independent iff

• Variables are pairwise independent iff for all pairs i,j

Page 34: Sublinear Algorihms  for Big Data

Independence: Example

Independent or not?• Event Argentina wins the World Cup• Event Messi becomes the best striker

Independent or not?• Event Argentina wins against Netherlands in

the semifinals• Event Germany wins against Brazil in the

semifinals

Page 35: Sublinear Algorihms  for Big Data

Independence: Example

• Ratings of mortgage securities– AAA = 1% probability of default (over X years)– AA = 2% probability of default– A = 5% probability of default– B = 10% probability of default– C = 50% probability of default– D = 100% probability of default

• You are a portfolio holder with 1000 AAA securities? – Are they all independent? – Is the probability of default

Page 36: Sublinear Algorihms  for Big Data

Conditional Probabilities

• For two events and :

• If two random variables (r.vs) are independent

(by definition) (by independence)

Page 37: Sublinear Algorihms  for Big Data

Union Bound

For any events :

• Pro: Works even for dependent variables!• Con: Sometimes very loose, especially for

mutually independent events

Page 38: Sublinear Algorihms  for Big Data

Union Bound: Example

Events “Argentina wins the World Cup” and “Messi becomes the best striker” are not independent, but:

Pr[“Argentina wins the World Cup” or “Messi becomes the best striker”]

Pr[“Argentina wins the World Cup”] +Pr[“Messi becomes the best striker”]

Page 39: Sublinear Algorihms  for Big Data

Independence and Linearity of Expectation/Variance

• Linearity of expectation (even for dependent variables!):

• Linearity of variance (only for pairwise independent variables!)

Page 40: Sublinear Algorihms  for Big Data

Part 2: Inequalities

• Markov inequality• Chebyshev inequality• Chernoff bound

Page 41: Sublinear Algorihms  for Big Data

Markov’s Inequality• For every • Proof (by contradiction) (by definition)

(pick only some i’s) > (by assumption )

Page 42: Sublinear Algorihms  for Big Data

Markov’s Inequality

• For every • Corollary :For every • Pro: always works!• Cons: – Not very precise– Doesn’t work for the lower tail:

Page 43: Sublinear Algorihms  for Big Data

Markov Inequality: Example

Markov 1: For every

• Example:

Page 44: Sublinear Algorihms  for Big Data

Markov Inequality: Example

Markov 2: For every

• Example:

Page 45: Sublinear Algorihms  for Big Data

Markov Inequality: ExampleMarkov 2: For every

• :

Page 46: Sublinear Algorihms  for Big Data

Markov + Union Bound: Example

Markov 2: For every

• Example: 1.75

Page 47: Sublinear Algorihms  for Big Data

Chebyshev’s Inequality

• For every :

• Proof:

(by squaring)

(by Markov’s inequality)

Page 48: Sublinear Algorihms  for Big Data

Chebyshev’s Inequality

• For every :

• Corollary ():For every :

Page 49: Sublinear Algorihms  for Big Data

Chebyshev: Example

Page 50: Sublinear Algorihms  for Big Data

Chebyshev: Example

• Roll a dice 10 times: = Average value over 10 rolls

• = value of the i-th roll, • Variance ( by linearity for independent r.vs):

Page 51: Sublinear Algorihms  for Big Data

Chebyshev: Example

• Roll a dice 10 times: = Average value over 10 rolls

• (if n rolls then 2.91 / n)

Page 52: Sublinear Algorihms  for Big Data

Chernoff bound

• Let be independent and identically distributed r.vs with range [0,1] and expectation .

• Then if and

Page 53: Sublinear Algorihms  for Big Data

Chernoff bound (corollary)

• Let be independent and identically distributed r.vs with range [0, c] and expectation .

• Then if and

Page 54: Sublinear Algorihms  for Big Data

Chernoff: Example

• Roll a dice 10 times: = Average value over 10 rolls

Page 55: Sublinear Algorihms  for Big Data

Chernoff: Example

• Roll a dice 1000 times: = Average value over 1000 rolls

Page 56: Sublinear Algorihms  for Big Data

Chernoff v.s Chebyshev: Example

Let :• Chebyshev: • Chernoff: If is very big:• Values are all constants!– Chebyshev: – Chernoff:

Page 57: Sublinear Algorihms  for Big Data

Chernoff v.s Chebyshev: Example

Large values of t is exactly what we need!• Chebyshev: • Chernoff:

So is Chernoff always better for us?• Yes, if we have i.i.d. variables.• No, if we have dependent or only pairwise

independent random varaibles.• If the variables are not identical – Chernoff-type

bounds exist.

Page 58: Sublinear Algorihms  for Big Data

Answers to the puzzlesYou see a sequence of values , arriving one by one: • (Easy)

– If all are different and have values between and , which value is missing?

– You have space– Answer: missing value =

• (Harder) – How can you maintain a uniformly random sample of values

out of those you have seen so far? – You can store exactly values at any time– Answer: Store first . When you see for , with probability

replace random value from your storage with .

Page 59: Sublinear Algorihms  for Big Data

Part 3: Morris’s Algorithm

• (Very hard, “Count the number of players”)– What is the total number of values up to error – You have space and can be completely wrong

with some small probability

Page 60: Sublinear Algorihms  for Big Data

Morris’s Algorithm: Alpha-version

Maintains a counter using bits• Initialize to 0• When an item arrives, increase X by 1 with

probability • When the stream is over, output

Claim:

Page 61: Sublinear Algorihms  for Big Data

Morris’s Algorithm: Alpha-version

Maintains a counter using bits• Initialize to 0, when an item arrives, increase

X by 1 with probability Claim: • Let the value after seeing items be

= 1 +

Page 62: Sublinear Algorihms  for Big Data

Morris’s Algorithm: Alpha-version

Maintains a counter using bits• Initialize to 0, when an item arrives, increase

X by 1 with probability Claim:

Page 63: Sublinear Algorihms  for Big Data

Morris’s Algorithm: Alpha-version

Maintains a counter using bits• Initialize to 0, when an item arrives, increase

X by 1 with probability • = n + 1, • Is this good?

Page 64: Sublinear Algorihms  for Big Data

Morris’s Algorithm: Beta-version

Maintains counters using bits for each• Initialize to 0, when an item arrives, increase

each by 1 independently with probability • Output • = n + 1, • Claim: If then

Page 65: Sublinear Algorihms  for Big Data

Morris’s Algorithm: Beta-version

Maintains counters using bits for each• Output

• Claim: If then

– If we can make this at most

Page 66: Sublinear Algorihms  for Big Data

Morris’s Algorithm: Final

• What if I want the probability of error to be really small, i.e.

• Same Chebyshev-based analysis: • Do these steps times independently in

parallel and output the median answer.• Total space:

Page 67: Sublinear Algorihms  for Big Data

Morris’s Algorithm: Final

• Do these steps times independently in parallel and output the median answer .

Maintains counters using bits for each• Initialize to 0, when an item arrives, increase

each by 1 independently with probability • Output

Page 68: Sublinear Algorihms  for Big Data

Morris’s Algorithm: Final Analysis

Claim: • Let be an indicator r.v. for the event that ,

where is the i-th trial.• Let .

Page 69: Sublinear Algorihms  for Big Data

Thank you!

• Questions?• Next time: –More streaming algorithms– Testing distributions