15
Estimating the Unseen: Sublinear Statistics Paul Valiant

Estimating the Unseen: Sublinear Statistics

  • Upload
    bryson

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Estimating the Unseen: Sublinear Statistics. Paul Valiant. Fisher’s Butterflies. Turing’s Enigma Codewords. How many new species if I observe for another period?. Probability mass of unseen codewords. F 1 -F 2 +F 3 -F 4 +F 5 - …. F 1 /(number of samples). +. -. +. -. +. -. -. -. - PowerPoint PPT Presentation

Citation preview

Page 1: Estimating the Unseen: Sublinear  Statistics

Estimating the Unseen:Sublinear Statistics

Paul Valiant

Page 2: Estimating the Unseen: Sublinear  Statistics

Fisher’s Butterflies

Turing’s Enigma Codewords

How many new species if I observe for another period?

Probability mass of unseen codewords

1 2 3 4 5 6 7 8 9 1011121314150

40

80

120

Corbet’s Butterfly Data

+

++ + + + + +

-- - - - - -

F1-F2+F3-F4+F5-… F1/(number of samples)

(“Fingerprint”)

Page 3: Estimating the Unseen: Sublinear  Statistics

Characteristic Functions

For element pi:

Pr[Not seen in first period, but seen in second period]

Pr[Not seen]*pi

𝑒−𝑘𝑝 𝑖∙𝑝𝑖

¿𝑝𝑜𝑖(𝑘𝑝𝑖 ,1)/𝑘¿𝑝𝑜𝑖 (𝑘𝑝𝑖 ,1 )−𝑝𝑜𝑖 (𝑘𝑝𝑖 ,2 )+𝑝𝑜𝑖 (𝑘𝑝𝑖 ,3 )−𝑝𝑜𝑖 (𝑘𝑝𝑖 ,4 )+…F1-F2+F3-F4+F5-… F1/(number of samples)

𝑝𝑜𝑖(𝑥 ,𝑖)≜𝑒−𝑥𝑥 𝑖

𝑖 !

Page 4: Estimating the Unseen: Sublinear  Statistics

Other Properties?

Entropy: pi log pi

Support size: step function

Approximate as

log pi

1/pi

0

Accurate to O(1) for x=Ω(1) linear samples

Exponentially hard to approximate below 1/k

Easier case? L2 norm pi2 𝑥2=∑

𝑗 ≥02 ! ∙( 𝑗2 )𝑝𝑜𝑖(𝑥 , 𝑗)

Page 5: Estimating the Unseen: Sublinear  Statistics

L2 approximation

∑𝑗2! ⋅( 𝑗2)𝐹 𝑗

Works very well if we have a bound on the j’s encountered

L2 distance related to L1:

Yields 1-sided testers for L1, also, L1-distance to uniform, also, L1-distance to arbitrary known distribution

[Batu, Fortnow, Rubinfeld, Smith, White, ‘00]

Page 6: Estimating the Unseen: Sublinear  Statistics

Are good testers computationally trivial?

Page 7: Estimating the Unseen: Sublinear  Statistics

Maximum Likelihood Distributions

[Orlitsky et al., Science, etc]

Page 8: Estimating the Unseen: Sublinear  Statistics

Relaxing the Problem

Given {Fj}, find a distribution p such that the expected fingerprint of k samples from p approximates Fj

By concentration bounds, the “right” distribution should also satisfy this, be in the feasible region of the linear program

Yields: n/log n-sample estimators for entropy, support size, L1 distance, anything similar

Does the extra computational power help??

Page 9: Estimating the Unseen: Sublinear  Statistics

Lower BoundsFind not-large {ci} that minimize

DUAL

Find distributions y+,y- that maximize while is small

“Find distributions with very different property values, but almost identical fingerprint expectations”

NEEDS: Theorem: close expected fingerprints indistinguishable[Raskhodnikova, Ron, Shpilka, Smith’07]

Page 10: Estimating the Unseen: Sublinear  Statistics

“Roos’s Theorem”Generalized Multinomial Distributions

Definition: a distribution expressible as where Zi { 0, (1,0,0,0,…), (0,1,0,0,…), (0,0,1,0,…), … }

Includes fingerprint distributionsAlso: binomial distributions, multinomial distributions, and any sums of such distributions.

“Generalized Multinomial Distributions” appear all over CS, and characterizing them is central to many papers (for example, Daskalakis and Papadimitriou, Discretized multinomial distributions and Nash equilibria in anonymous games, FOCS 2008.)

Comment:

Thm: If there are bounds , s.t. then is multivariate Poisson to within

Page 11: Estimating the Unseen: Sublinear  Statistics

Distributions of Rare Elements

Distribution of fingerprints – provided every element is rare, even in k samples

Yields best known lower-bounds for non-trivial 1-sided testing problems: Ω(n2/3) for L1 distance, Ω(n2/3m1/3) for “independence”

Note: impossible to confuse >log n with o(1). Can cut off above log n? Suggests these lower bounds are tight to within log n.

Can we do better?

Page 12: Estimating the Unseen: Sublinear  Statistics

A Better Central Limit Theorem (?)

Roos’s Theorem: Fingerprints are like Poissons

(provided…)

Poissons: 1-parameter family

Gaussians: 2-parameter family

New CLT: Fingerprints are like Gaussians

(provided variance is high enough in every direction)

How to ensure high variance? “Fatten” distributions by adding elements at many different probabilities. can’t use for 1-sided bounds

Page 13: Estimating the Unseen: Sublinear  Statistics

Results

Additive estimates of Entropy, Support Size, L1 distance :

2-approximation of L1 distance to Um:

All testers are linear expressions in the fingerprint

Page 14: Estimating the Unseen: Sublinear  Statistics

DualityFind not-large {ci} that minimize

DUAL

Find distributions y+,y- that maximize while is small

that minimize

≤𝑘−𝑑

Yields estimator when d<½

Yields lower bound when d>½

“When , optimum is log-convex”Theorem: For linear symmetric property π, and ε>0, c>½, if all p+,p- of support ≤n with are distinguishable w.p. >c via k samples, then there exists a linear estimator with error using (1+o(1))k samples, succeeding w.p. 1-o(1/poly(k))

Page 15: Estimating the Unseen: Sublinear  Statistics

Open ProblemsDependence on ε (resolved for entropy)

Beyond additive estimates – “case-by-case optimal”?

We suspect linear programming is better than linear estimators

Leveraging these results for non-symmetric properties

Monotonicity, with respect to different posets

Practical applications!