Estimating the Unseen: Sublinear Statistics

Estimating the Unseen:Sublinear Statistics

Paul Valiant

Fisher’s Butterflies

Turing’s Enigma Codewords

How many new species if I observe for another period?

Probability mass of unseen codewords

1 2 3 4 5 6 7 8 9 1011121314150

Corbet’s Butterfly Data

++ + + + + +

-- - - - - -

F1-F2+F3-F4+F5-… F1/(number of samples)

(“Fingerprint”)

Characteristic Functions

For element pi:

Pr[Not seen in first period, but seen in second period]

Pr[Not seen]*pi

𝑒−𝑘𝑝 𝑖∙𝑝𝑖

¿𝑝𝑜𝑖(𝑘𝑝𝑖 ,1)/𝑘¿𝑝𝑜𝑖 (𝑘𝑝𝑖 ,1 )−𝑝𝑜𝑖 (𝑘𝑝𝑖 ,2 )+𝑝𝑜𝑖 (𝑘𝑝𝑖 ,3 )−𝑝𝑜𝑖 (𝑘𝑝𝑖 ,4 )+…F1-F2+F3-F4+F5-… F1/(number of samples)

𝑝𝑜𝑖(𝑥 ,𝑖)≜𝑒−𝑥𝑥 𝑖

𝑖 !

Other Properties?

Entropy: pi log pi

Support size: step function

Approximate as

log pi

Accurate to O(1) for x=Ω(1) linear samples

Exponentially hard to approximate below 1/k

Easier case? L2 norm pi2 𝑥2=∑

𝑗 ≥02 ! ∙( 𝑗2 )𝑝𝑜𝑖(𝑥 , 𝑗)

L2 approximation

∑𝑗2! ⋅( 𝑗2)𝐹 𝑗

Works very well if we have a bound on the j’s encountered

L2 distance related to L1:

Yields 1-sided testers for L1, also, L1-distance to uniform, also, L1-distance to arbitrary known distribution

[Batu, Fortnow, Rubinfeld, Smith, White, ‘00]

Are good testers computationally trivial?

Maximum Likelihood Distributions

[Orlitsky et al., Science, etc]

Relaxing the Problem

Given {Fj}, find a distribution p such that the expected fingerprint of k samples from p approximates Fj

By concentration bounds, the “right” distribution should also satisfy this, be in the feasible region of the linear program

Yields: n/log n-sample estimators for entropy, support size, L1 distance, anything similar

Does the extra computational power help??

Lower BoundsFind not-large {ci} that minimize

Find distributions y+,y- that maximize while is small

“Find distributions with very different property values, but almost identical fingerprint expectations”

NEEDS: Theorem: close expected fingerprints indistinguishable[Raskhodnikova, Ron, Shpilka, Smith’07]

“Roos’s Theorem”Generalized Multinomial Distributions

Definition: a distribution expressible as where Zi { 0, (1,0,0,0,…), (0,1,0,0,…), (0,0,1,0,…), … }

Includes fingerprint distributionsAlso: binomial distributions, multinomial distributions, and any sums of such distributions.

“Generalized Multinomial Distributions” appear all over CS, and characterizing them is central to many papers (for example, Daskalakis and Papadimitriou, Discretized multinomial distributions and Nash equilibria in anonymous games, FOCS 2008.)

Comment:

Thm: If there are bounds , s.t. then is multivariate Poisson to within

Distributions of Rare Elements

Distribution of fingerprints – provided every element is rare, even in k samples

Yields best known lower-bounds for non-trivial 1-sided testing problems: Ω(n2/3) for L1 distance, Ω(n2/3m1/3) for “independence”

Note: impossible to confuse >log n with o(1). Can cut off above log n? Suggests these lower bounds are tight to within log n.

Can we do better?

A Better Central Limit Theorem (?)

Roos’s Theorem: Fingerprints are like Poissons

(provided…)

Poissons: 1-parameter family

Gaussians: 2-parameter family

New CLT: Fingerprints are like Gaussians

(provided variance is high enough in every direction)

How to ensure high variance? “Fatten” distributions by adding elements at many different probabilities. can’t use for 1-sided bounds

Results

Additive estimates of Entropy, Support Size, L1 distance :

2-approximation of L1 distance to Um:

All testers are linear expressions in the fingerprint

DualityFind not-large {ci} that minimize

Find distributions y+,y- that maximize while is small

that minimize

≤𝑘−𝑑

Yields estimator when d<½

Yields lower bound when d>½

“When , optimum is log-convex”Theorem: For linear symmetric property π, and ε>0, c>½, if all p+,p- of support ≤n with are distinguishable w.p. >c via k samples, then there exists a linear estimator with error using (1+o(1))k samples, succeeding w.p. 1-o(1/poly(k))

Open ProblemsDependence on ε (resolved for entropy)

Beyond additive estimates – “case-by-case optimal”?

We suspect linear programming is better than linear estimators

Leveraging these results for non-symmetric properties

Monotonicity, with respect to different posets

Practical applications!

Estimating the Unseen: Sublinear Statistics

Documents

SimpleFlow: A Non-iterative, Sublinear Optical Flow Algorithmkubitron/courses/cs252-S11/projects/reports/project1...SimpleFlow: A Non-iterative, Sublinear Optical Flow Algorithm Michael

Distributed Algorithms and Sublinear-Time Algorithms

Eccentricity Heuristics through Sublinear Analysis Lenses

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product …statistics.rutgers.edu/home/pingli/papers/Asymmetric... · 2016-01-10 · Asymmetric LSH (ALSH) for Sublinear Time

UNSEEN ANALYSIS. UNSEEN ANALYSIS: FOCUS CATEGORIES FILM LANGUAGE

Estimating the unseen: A sublinear-sample canonical ...cs.brown.edu/~pvaliant/unseenVV.pdfEstimating the unseen: A sublinear-sample canonical estimator of distributions Gregory Valiant

Sublinear Algorithms Course...Sublinear-time algorithms for • graphs • strings • geometric properties of images • basic properties of functions • algebraic properties and

The Unseen

Approximating the MST Weight in Sublinear Time

Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches

Learning Determinantal Point Processes in Sublinear Time

Sublinear FPTASs for Stochastic Optimization Problems

Sublinear Algorithms Lecture 23: April 20 10312013 …

Sublinear Time Algorithm

Implicit regularization in sublinear approximation algorithms

Limits of Practical Sublinear Secure Computation · Limits of Practical Sublinear Secure Computation Elette Boyle? Yuval Ishai?? Antigoni Polychroniadou? ? ? Abstract. Secure computations

Networks Cannot Compute Their Diameter in Sublinear Time SODA.pdfSilvio Frischknecht Roger Wattenhofer . Networks Cannot Compute Their Diameter in Sublinear Time ... Thanks for slide

An Adaptive Sublinear-Time Block Sparse Fourier Transform

Homomorphic signatures with sublinear public keys via ...oa.upm.es/49519/1/INVE_MEM_2017_269599.pdf · Homomorphic signatures with sublinear public keys via asymmetric programmable

Sublinear-time algorithms - University of Warwickczumaj/PUBLICATIONS/...Sublinear-time algorithms∗ Artur Czumaj† Christian Sohler‡ Abstract In this paper we survey recent advances