CS 109Lecture 19
May 9th, 2016
0
5
10
15
20
25
30
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
105
110
115
120
Midterm Distribution
E[X] = 87√Var(X) = 15median = 90
0
5
10
15
20
25
30
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
105
110
115
120
A+
AA-B+
B
B-
Cs
Midterm Distribution
E[X] = 87√Var(X) = 15median = 90
0
5
10
15
20
25
30
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
105
110
115
120
Core Understanding
AdvancedUnderstanding
Midterm Distribution
E[X] = 87√Var(X) = 15median = 90
0
5
10
15
20
25
30
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
105
110
115
120
Midterm Beta
E[X] = 87√Var(X) = 15median = 90
ɑ = 7.1β = 2.7
0.0
0.2
0.4
0.6
0.8
1.0
0 20 40 60 80 100 120
0.000.020.040.060.080.100.120.14
0 20 40 60 80 100 120
You can interpret this as a percentile
function
Derivative of percentile per point
CS109 Midterm CDF
CS109 Midterm PDF
Midterm Beta
0
50
100
2 4 6 8 10 12 14 16 18 20 22 24
Problem 1
0
50
100
2 4 6 8 10 12 14 16 18 20
Problem 2
0
50
100
2 4 6 8 10 12 14 16 18 20
Problem 3
0
50
100
2 4 6 8 10 12 14 16 18 20
Problem 4
0
50
100
2 4 6 8 10 12
Problem 5
0
50
100
2 4 6 8 10 12 14 16 18 20 22 24
Problem 6
Midterm Question Correlations
1.a
1.b
1.c
1.d
2.a
2.b2.c
3.a
3.b
3.c
4.a4.b
4.c
5 6.a 6.b
6.c 0.600.400.22
Correlation
*Correlations bellow 0.22 are not shown **Almost all correlations were positive
Midterm Question Correlations
y = 0.46x + 2.6
-1
0
1
2
3
4
5
6
7
8
-1 0 1 2 3 4 5 6 7 8
Points on 6.a
Poin
ts o
n 6.
cJoint Probability Mass Function 6.a,6.cJoint PMF for Questions 6.a,6.c
y = 0.2x + 11
0
2
4
6
8
10
12
14
16
18
20
0 4 8 12 16 20 24
Joint Probability Mass Function 4, 6
Total Points on 6
Tota
l Poi
nts o
n 4
Joint PMF for Questions 4,6
Joint Probability Mass Function 4, 6Conditional Expectation• Let X be your score on problem 4. • Let Y be your score on problem 6.• E[X | Y]?
02468
101214161820
0 4 8 12 16 20 24
y
E[X|Y = y]
I use standard error because expectation is from a sample
Four Prototypical Trajectories
1. Inequalities
• In many cases, we don’t know the true form of a probability distribution§ E.g., Midterm scores§ But, we know the mean
§ May also have other measures/propertieso Variance
o Non-negativity
o Etc.
§ Inequalities and bounds still allow us to say something about the probability distribution in such caseso May be imprecise compared to knowing true distribution!
Inequality, Probability and Joviality
• Say X is a non-negative random variable
• Proof:§ I = 1 if X ≥ a, 0 otherwise
§
§ Taking expectations:
0 ,][)( >≤≥ aaXEaXP all for
,0 aXIX ≤≥Since
aXE
aXEaXPIE ][ )(][ =⎥⎦
⎤⎢⎣
⎡≤≥=
Markov’s Inequality
• Andrey Andreyevich Markov (1856-1922) was a Russian mathematician
§ Markov’s Inequality is named after him§ He also invented Markov Chains…
o …which are the basis for Google’s PageRank algorithm
Andrey Markov
• Statistics from a previous quarter’s CS109 midterm§ X = midterm score
§ Using sample mean X = 86.7 ≈ E[X]
§ What is P(X ≥ 100)?
§ Markov bound: ≤ 86.7% of class scored 100 or greater
§ In fact, 20.1% of class scored 100 or greatero Markov inequality can be a very loose bound
o But, it made no assumption at all about form of distribution!
867.01007.86
100][)100( ≈=≤≥XEXP
Markov and the Midterm
• X is a random variable with E[X] = µ, Var(X) = σ2
• Proof:§ Since (X – µ)2 is non-negative random variable, apply Markov’s Inequality with a = k2
§ Note that: (X – µ)2 ≥ k2 ⇔ |X – µ| ≥ k, yielding:
0 ,)( 2
2
>≤≥− kk
kXP all forσµ
2
2
2
222 ])[())((
kkXEkXP σµ
µ =−
≤≥−
2
2
)(k
kXP σµ ≤≥−
Chebyshev’s Inequality
• Pafnuty Lvovich Chebyshev (1821-1894) was also a Russian mathematician
§ Chebyshev’s Inequality is named after himo But actually formulated by his colleague Irénée-Jules Bienaymé
§ He was Markov’s doctoral advisoro And sometimes credited with first deriving Markov’s Inequality
§ There is a crater on the moon named in his honor
Pafnuty Chebyshev
Four Prototypical Trajectories
2. Law of Large Numbers
• Consider I.I.D. random variables X1, X2, ...§ Xi have distribution F with E[Xi] = µ and Var(Xi) = σ 2
§ Let
§ For any ε > 0:
• Proof:
§ By Chebyshev’s inequality:
0)( ⎯⎯→⎯≥− ∞→nXP εµ
[ ] µ== +++
nnXXXEXE ...21][ ( ) nn
nXXXX2...21Var)(Var σ== +++
0)( 2
2
⎯⎯ →⎯≤≥− ∞→n
nXP
εσ
εµ
∑=
=n
iiXn
X1
1
Weak Law of Large Numbers
2
2
)(k
kXP σµ ≤≥− Var(X)
k
• Consider I.I.D. random variables X1, X2, ...§ Xi have distribution F with E[Xi] = µ
§ Let
§ Strong Law ⇒Weak Law, but not vice versa
§ Strong Law implies that for any ε > 0, there are only a finite number of values of n such that condition of Weak Law: holds.
∑=
=n
iiXn
X1
1
1...lim 21 =⎟⎟⎠
⎞⎜⎜⎝
⎛=⎟
⎠
⎞⎜⎝
⎛ +++∞→
µn
XXXP nn
εµ ≥−X
Strong Law of Large Numbers
• Say we have repeated trials of an experiment§ Let event E = some outcome of experiment§ Let Xi = 1 if E occurs on trial i, 0 otherwise§ Strong Law of Large Numbers (Strong LLN) yields:
§ Recall first week of class:
§ Strong LLN justifies “frequency” notion of probability§ Misconception arising from LLN:
o Gambler’s fallacy: “I’m due for a win”o Consider being “due for a win” with repeated coin flips...
)(][...21 EPXEn
XXXi
n =→+++
nEnEP
n
)(lim)(∞→
=
Intuitions and Misconceptions of LLN
• History of the Law of Large Numbers§ 1713: Weak LLN described by Jacob Bernoulli
§ 1835: Poisson calls it “La Loi des Grands Nombres”o That would be “Law of Large Numbers” in French
§ 1909: Émile Borel develops Strong LLN for Bernoulli random variables
§ 1928: Andrei Nikolaevich Kolmogorov proves Strong LLN in general case
La Loi des Grands Nombres
And now a moment of silence...
...before we present...
...the greatest result of probability theory!
Silence!!
Four Prototypical Trajectories
3. Central Limit Theorem
• Consider I.I.D. random variables X1, X2, ...§ Xi have distribution F with E[Xi] = µ and Var(Xi) = σ 2
§ Let:
§ Central Limit Theorem:
∑=
=n
iiXn
X1
1
∞→n ),(~2
asn
NX σµ
The Central Limit Theorem
http://onlinestatbook.com/stat_sim/sampling_dist/
The Central Limit Theorem
• Consider I.I.D. random variables X1, X2, ...§ Xi have distribution F with E[Xi] = µ and Var(Xi) = σ 2
§ Let:
§ Recall where Z ~ N(0, 1):
∑=
=n
iiXn
X1
1∞→n ),(~
2
asn
NX σµ
n/2σ
µ−= XZ
( ) ( ) ( )nnX
nn
Xn
n
n
XnZ
nNX
n
i i
n
i in
i i
σ
µ
σ
µ
σ
µσµ
−=
⎥⎦
⎤⎢⎣
⎡ −=
−=⇔
∑∑∑=
== 12
1
2
1211
),(~
∞→→−+++ nN
nnXXX n )1 ,0(...21 as
σµ ~
Another form of the Central Limit Theorem
1733
Once Upon a Time…Abraham De Moivre
• History of the Central Limit Theorem§ 1733: CLT for X ~ Ber(1/2) postulated by Abraham de Moivre
§ 1823: Pierre-Simon Laplace extends de Moivre’s work to approximating Bin(n, p) with Normal
§ 1901: Aleksandr Lyapunov provides precise definition and rigorous proof of CLT
§ 2003: Charlie Sheen stars in television series “Two and a Half Men”o By end of the 7th (final) season, there were 161 episodes
o Mean quality of subsamples of episodes is Normally distributed (thanks to the Central Limit Theorem)
Once Upon a Time…
• CLT is why many things in “real world” appear Normally distributed§ Many quantities are sum of independent variables§ Exams scores
o Sum of individual problems on the SATo Why does the CLT not apply to our midterm?
§ Election pollingo Ask 100 people if they will vote for candidate X (p1 = # “yes”/100)o Repeat this process with different groups to get p1, ... , pno Will have a normal distribution over pio Can produce a “confidence interval”
• How likely is it that estimate for true p is correct
Central Limit Theorem in the Real World
• Consider I.I.D. Bernoulli variables X1, X2, ... With probability p§ Xi have distribution F with E[Xi] = µ and Var(Xi) = σ 2
§ Let: Let:∑=
=n
iiXn
X1
1
Binomial Approximation
Y = nX
Y ⇠ N(nµ, n2�2
n)
X ⇠ N(µ,�2) as n ! 1
Y ⇠ N(np, np(1� p))
• Start with 370 midterm scores: X1, X2, ..., X370§ E[Xi] = 87 and Var(Xi) = 225§ Created 50 samples (Yi ) of size n = 10
§ Prediction by CLT:
048
1216
-‐3 -‐2 -‐1 0 1 2 3
Your Midterm on the Central Limit Theorem
Zi =Yi � E[Xi]p
�2/n=
Yi � 87p22.5
Yi ⇠ N(87, 22.5)
Z =1
50
50X
i=1
Zi = 4⇥ 10�16
Var(Z) = 0.997
• Have new algorithm to test for running time§ Mean (clock) running time: µ = t sec.§ Variance of running time: σ2 = 4 sec2.§ Run algorithm repeatedly (I.I.D. trials), measure time
o How many trials so estimated time = t ± 0.5 with 95% certainty?o Xi = running time of i-th run (for 1 ≤ i ≤ n)
o By Central Limit Theorem, Z ~ N(0, 1), where:
Estimating Clock Running Time
Zn =(Pn
i=1 Xi)� nµ
�pn
=(Pn
i=1 Xi)� nt
2pn
0.95 = P (�0.5 Pn
i=1 Xi
n� t 0.5)
= P (�0.5
pn
2
Pni=1 Xi
n� t 0.5
pn
2)
= P (�0.5
pn
2
pn
2
Pni=1 Xi
n�
pn
2t 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi
2pn
�pnpn
pnt
2 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi � nt
2pn
0.5pn
2)
= P (�0.5
pn
2 Z 0.5
pn
2)
Zn =(Pn
i=1 Xi)� nt
2pn
0.95 = P (�0.5 Pn
i=1 Xi
n� t 0.5)
= P (�0.5
pn
2
Pni=1 Xi
n� t 0.5
pn
2)
= P (�0.5
pn
2
pn
2
Pni=1 Xi
n�
pn
2t 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi
2pn
�pnpn
pnt
2 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi � nt
2pn
0.5pn
2)
= P (�0.5
pn
2 Z 0.5
pn
2)
0.95 = P (�0.5 Pn
i=1 Xi
n� t 0.5)
= P (�0.5
pn
2
Pni=1 Xi
n� t 0.5
pn
2)
= P (�0.5
pn
2
pn
2
Pni=1 Xi
n�
pn
2t 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi
2pn
�pnpn
pnt
2 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi � nt
2pn
0.5pn
2)
= P (�0.5
pn
2 Z 0.5
pn
2)
0.95 = P (�0.5 Pn
i=1 Xi
n� t 0.5)
= P (�0.5
pn
2
Pni=1 Xi
n� t 0.5
pn
2)
= P (�0.5
pn
2
pn
2
Pni=1 Xi
n�
pn
2t 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi
2pn
�pnpn
pnt
2 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi � nt
2pn
0.5pn
2)
= P (�0.5
pn
2 Z 0.5
pn
2)
0.95 = P (�0.5 Pn
i=1 Xi
n� t 0.5)
= P (�0.5
pn
2
Pni=1 Xi
n� t 0.5
pn
2)
= P (�0.5
pn
2
pn
2
Pni=1 Xi
n�
pn
2t 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi
2pn
�pnpn
pnt
2 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi � nt
2pn
0.5pn
2)
= P (�0.5
pn
2 Z 0.5
pn
2)
0.95 = P (�0.5 Pn
i=1 Xi
n� t 0.5)
= P (�0.5
pn
2
Pni=1 Xi
n� t 0.5
pn
2)
= P (�0.5
pn
2
pn
2
Pni=1 Xi
n�
pn
2t 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi
2pn
�pnpn
pnt
2 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi � nt
2pn
0.5pn
2)
= P (�0.5
pn
2 Z 0.5
pn
2)
0.95 = P (�0.5 Pn
i=1 Xi
n� t 0.5)
= P (�0.5
pn
2
Pni=1 Xi
n� t 0.5
pn
2)
= P (�0.5
pn
2
pn
2
Pni=1 Xi
n�
pn
2t 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi
2pn
�pnpn
pnt
2 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi � nt
2pn
0.5pn
2)
= P (�0.5
pn
2 Z 0.5
pn
2)
0.95 = P (�0.5 Pn
i=1 Xi
n� t 0.5)
= P (�0.5
pn
2
Pni=1 Xi
n� t 0.5
pn
2)
= P (�0.5
pn
2
pn
2
Pni=1 Xi
n�
pn
2t 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi
2pn
�pnpn
pnt
2 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi � nt
2pn
0.5pn
2)
= P (�0.5
pn
2 Z 0.5
pn
2)
0.95 = �(
pn
4)� �(
pn
4)
= �(
pn
4)� (1� �(
pn
4))
= 2�(
pn
4)� 1
0.975 = �(
pn
4)
��1(0.975) =
pn
4
1.96 =
pn
4n = 61.4
0.95 = P (�0.5 Pn
i=1 Xi
n� t 0.5)
= P (�0.5
pn
2
Pni=1 Xi
n� t 0.5
pn
2)
= P (�0.5
pn
2
pn
2
Pni=1 Xi
n�
pn
2t 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi
2pn
�pnpn
pnt
2 0.5
pn
2)
= P (�0.5
pn
2
Pni=1 Xi � nt
2pn
0.5pn
2)
= P (�0.5
pn
2 Z 0.5
pn
2)0.95 = �(
pn
4)� �(
pn
4)
= �(
pn
4)� (1� �(
pn
4))
= 2�(
pn
4)� 1
0.95 = �(
pn
4)� �(
pn
4)
= �(
pn
4)� (1� �(
pn
4))
= 2�(
pn
4)� 1
0.975 = �(
pn
4)
��1(0.975) =
pn
4
1.96 =
pn
4n = 61.4
0.95 = �(
pn
4)� �(
pn
4)
= �(
pn
4)� (1� �(
pn
4))
= 2�(
pn
4)� 1
0.975 = �(
pn
4)
��1(0.975) =
pn
4
1.96 =
pn
4n = 61.4
0.95 = �(
pn
4)� �(
pn
4)
= �(
pn
4)� (1� �(
pn
4))
= 2�(
pn
4)� 1
0.975 = �(
pn
4)
��1(0.975) =
pn
4
1.96 =
pn
4n = 61.4
0.95 = �(
pn
4)� �(
pn
4)
= �(
pn
4)� (1� �(
pn
4))
= 2�(
pn
4)� 1
0.975 = �(
pn
4)
��1(0.975) =
pn
4
1.96 =
pn
4n = 61.4
0.95 = �(
pn
4)� �(
pn
4)
= �(
pn
4)� (1� �(
pn
4))
= 2�(
pn
4)� 1
0.975 = �(
pn
4)
��1(0.975) =
pn
4
1.96 =
pn
4n = 61.4
0.95 = �(
pn
4)� �(
pn
4)
= �(
pn
4)� (1� �(
pn
4))
= 2�(
pn
4)� 1
0.975 = �(
pn
4)
��1(0.975) =
pn
4
1.96 =
pn
4n = 61.4
William Sealy Gosset(aka Student)
• Have new algorithm to test for running time§ Mean (clock) running time: µ = t sec.§ Variance of running time: σ2 = 4 sec2.§ Run algorithm repeatedly (I.I.D. trials), measure time
o How many trials so estimated time = t ± 0.5 with 95% certainty?o Xi = running time of i-th run (for 1 ≤ i ≤ n), and
§ What would Chebyshev say? 2
2
)(k
kXP SSS
σµ ≤≥−
nnn
nX
nXt
nXE
n
i
in
i
iS
n
i
iS
4VarVar 2
2
11
2
1==⎟
⎠
⎞⎜⎝
⎛=⎟⎠
⎞⎜⎝
⎛==⎥
⎦
⎤⎢⎣
⎡= ∑∑∑
===
σσµ
320 05.016)5.0(
/4)5.0( 21
≥⇒==≤≥−∑=
nn
ntnXP
n
i
i
∑=
=n
i
iS n
XX1
Thanks for playing, Pafnuty!
Estimating Time With Chebyshev
• Number visitors to web site/minute: X ~ Poi(100) § Server crashes if ≥ 120 requests/minute§ What is P(crash in next minute)?
§ Exact solution:
§ Use CLT, where (all I.I.D)
o Note: Normal can be used to approximate Poisson
0282.0!)100()120(
120
100
≈=≥ ∑∞
=
−
i
i
ieXP
0256.0)95.1(1)1001005.119
100100()5.119()120( ≈Φ−=
−≥
−=≥=≥
YPYPXP
∑=
n
in
1
)/100(Poi~)100(Poi
Crashing Your Website
It’s play time!
• You will roll 10 6-sided dice (X1, X2, …, X10)§ X = total value of all 10 dice = X1 + X2 + … + X10§ Win if: X ≤ 25 or X ≥ 45§ Roll!
• And now the truth (according to the CLT)…
Sum of Dice
• You will roll 10 6-sided dice (X1, X2, …, X10)§ X = total value of all 10 dice = X1 + X2 + … + X10§ Win if: X ≤ 25 or X ≥ 45
• Recall CLT:
§ Determine P(X ≤ 25 or X ≥ 45) using CLT:
0784.0)9608.01(2)1)76.1(2(1 =−≈−Φ−≈
[ ]1235)(Var 5.3 2 ==== ii XXE σµ
)1012/35)5.3(105.44
1012/35)5.3(10
1012/35)5.3(105.25(1)5.445.25(1 −
≤−
≤−
−=≤≤−XPXP
∞→→−+++ nN
nnXXX n )1 ,0(...21 as
σµ
Sum of Dice
I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the ”[Central limit theorem]". The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.
-Sir Francis Galton
Wonderful Form of Cosmic Order
0
5
10
15
20
25
30
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
105
110
115
120
Beta
E[X] = 87√Var(X) = 15median = 90
ɑ = 7.1β = 2.7
Core Understanding
AdvancedUnderstanding
A+
AA-B+
B
B-
Cs