Estimating the Unseen: Sublinear Statistics

Estimating the Unseen:Sublinear Statistics

Paul Valiant

Fisher’s Butterflies

Turing’s Enigma Codewords

How many new species if I observe for another period?

Probability mass of unseen codewords

1 2 3 4 5 6 7 8 9 1011121314150

40

80

120

Corbet’s Butterfly Data

+

++ + + + + +

-- - - - - -

F1-F2+F3-F4+F5-… F1/(number of samples)

(“Fingerprint”)

Characteristic Functions

For element pi:

Pr[Not seen in first period, but seen in second period]

Pr[Not seen]*pi

𝑒−𝑘𝑝 𝑖∙𝑝𝑖

¿𝑝𝑜𝑖(𝑘𝑝𝑖 ,1)/𝑘¿𝑝𝑜𝑖 (𝑘𝑝𝑖 ,1 )−𝑝𝑜𝑖 (𝑘𝑝𝑖 ,2 )+𝑝𝑜𝑖 (𝑘𝑝𝑖 ,3 )−𝑝𝑜𝑖 (𝑘𝑝𝑖 ,4 )+…F1-F2+F3-F4+F5-… F1/(number of samples)

𝑝𝑜𝑖(𝑥 ,𝑖)≜𝑒−𝑥𝑥 𝑖

𝑖 !

Other Properties?

Entropy: pi log pi

Support size: step function

Approximate as

log pi

1/pi

0

Accurate to O(1) for x=Ω(1) linear samples

Exponentially hard to approximate below 1/k

Easier case? L2 norm pi2 𝑥2=∑

𝑗 ≥02 ! ∙( 𝑗2 )𝑝𝑜𝑖(𝑥 , 𝑗)

L2 approximation

∑𝑗2! ⋅( 𝑗2)𝐹 𝑗

Works very well if we have a bound on the j’s encountered

L2 distance related to L1:

Yields 1-sided testers for L1, also, L1-distance to uniform, also, L1-distance to arbitrary known distribution

[Batu, Fortnow, Rubinfeld, Smith, White, ‘00]

Are good testers computationally trivial?

Maximum Likelihood Distributions

[Orlitsky et al., Science, etc]

Relaxing the Problem

Given {Fj}, find a distribution p such that the expected fingerprint of k samples from p approximates Fj

By concentration bounds, the “right” distribution should also satisfy this, be in the feasible region of the linear program

Yields: n/log n-sample estimators for entropy, support size, L1 distance, anything similar

Does the extra computational power help??

Lower BoundsFind not-large {ci} that minimize

DUAL

Find distributions y+,y- that maximize while is small

“Find distributions with very different property values, but almost identical fingerprint expectations”

NEEDS: Theorem: close expected fingerprints indistinguishable[Raskhodnikova, Ron, Shpilka, Smith’07]

“Roos’s Theorem”Generalized Multinomial Distributions

Definition: a distribution expressible as where Zi { 0, (1,0,0,0,…), (0,1,0,0,…), (0,0,1,0,…), … }

Includes fingerprint distributionsAlso: binomial distributions, multinomial distributions, and any sums of such distributions.

“Generalized Multinomial Distributions” appear all over CS, and characterizing them is central to many papers (for example, Daskalakis and Papadimitriou, Discretized multinomial distributions and Nash equilibria in anonymous games, FOCS 2008.)

Comment:

Thm: If there are bounds , s.t. then is multivariate Poisson to within

Distributions of Rare Elements

Distribution of fingerprints – provided every element is rare, even in k samples

Yields best known lower-bounds for non-trivial 1-sided testing problems: Ω(n2/3) for L1 distance, Ω(n2/3m1/3) for “independence”

Note: impossible to confuse >log n with o(1). Can cut off above log n? Suggests these lower bounds are tight to within log n.

Can we do better?

A Better Central Limit Theorem (?)

Roos’s Theorem: Fingerprints are like Poissons

(provided…)

Poissons: 1-parameter family

Gaussians: 2-parameter family

New CLT: Fingerprints are like Gaussians

(provided variance is high enough in every direction)

How to ensure high variance? “Fatten” distributions by adding elements at many different probabilities. can’t use for 1-sided bounds

Results

Additive estimates of Entropy, Support Size, L1 distance :

2-approximation of L1 distance to Um:

All testers are linear expressions in the fingerprint

DualityFind not-large {ci} that minimize

DUAL

Find distributions y+,y- that maximize while is small

that minimize

≤𝑘−𝑑

Yields estimator when d<½

Yields lower bound when d>½

“When , optimum is log-convex”Theorem: For linear symmetric property π, and ε>0, c>½, if all p+,p- of support ≤n with are distinguishable w.p. >c via k samples, then there exists a linear estimator with error using (1+o(1))k samples, succeeding w.p. 1-o(1/poly(k))

Open ProblemsDependence on ε (resolved for entropy)

Beyond additive estimates – “case-by-case optimal”?

We suspect linear programming is better than linear estimators

Leveraging these results for non-symmetric properties

Monotonicity, with respect to different posets

Practical applications!

Documents

Estimating the Unseen: Sublinear Statistics