Download pdf - Unified Maximum Likelihood Estimation of Symmetric ...ccanonne/workshop-focs2017/files/...Unified Maximum Likelihood Estimation of Symmetric Distribution Properties JayadevAcharya

UnifiedMaximumLikelihoodEstimation

ofSymmetricDistributionProperties

Jayadev AcharyaHirakendu DasAlon Orlitsky AnandaSuresh

Cornell Yahoo UCSD Google

FrontiersinDistributionTestingWorkshopFOCS2017

SpecialThanksto…

Idon’tsmile Ionlysmile

Symmetricproperties

𝒫 ={(𝒑𝟏, …𝒑𝒌) }- collectionofdistributionsover{1,…,k}

Distributionproperty:𝒇:𝒫→ℝ

𝒇 issymmetricifunchangedunderinputpermutations

Entropy𝑯 𝒑 ≜ ∑𝒑𝒊𝒍𝒐𝒈𝟏𝒑𝒊

��

Supportsize𝑺 𝒑 ≜ ∑ 𝕀 𝒑𝒊5𝟎�𝒊

Rényi entropy,supportcoverage,distancetouniformity,…

Determinedbytheprobabilitymultiset{𝒑𝟏, 𝒑𝟐, … , 𝒑𝒌 }

Symmetricproperties

𝒑 = (𝒑𝟏, …𝒑𝒌), (𝒌 finiteorinfinite)- discretedistribution

𝒇 𝒑 , apropertyof𝒑

𝒇 𝒑 issymmetricifunchangedunderinputpermutations

Entropy 𝑯 𝒑 ≜ ∑𝒑𝒊𝒍𝒐𝒈𝟏𝒑𝒊

��

Supportsize𝑺 𝒑 ≜ ∑ 𝕀 𝒑𝒊5𝟎�𝒊

Rényi entropy,supportcoverage,distancetouniformity,…

Determinedbytheprobabilitymultiset{𝒑𝟏, 𝒑𝟐, … , 𝒑𝒌 }

Propertyestimation

𝒑 unknowndistributionin𝒫

Givenindependentsamples𝐗𝟏, 𝐗𝟐, … , 𝐗𝐧 ∼ 𝐩

Estimate𝒇 𝒑

Samplecomplexity 𝑺 𝒇, 𝒌, 𝜺, 𝜹

Minimum𝑛 necessaryto

Estimate𝒇 𝒑 ± 𝜺

Witherrorprobability< 𝜹

Plug-inestimation

Use𝑿𝟏, 𝑿𝟐, … , 𝑿𝒏 ∼ 𝒑tofindanestimate𝒑Fof𝒑

Estimate𝒇 𝒑 by𝒇 𝒑F

Howtoestimate𝒑?

SequenceMaximumLikelihood(SML)

𝒑𝐬𝐦𝐥 = 𝐚𝐫𝐠𝐦𝐚𝐱𝒑 𝒑(𝒙𝒏 )= 𝐚𝐫𝐠𝐦𝐚𝐱

𝒑(𝒙) 𝚷𝒊𝒑(𝒙𝒊 )

𝑥Q= ℎ, ℎ, 𝑡𝒑𝒉,𝒉,𝒕𝒔𝒎𝒍 = 𝐚𝐫𝐠𝐦𝐚𝐱𝒑𝟐 𝒉 · 𝒑(𝒕)

𝒑𝒉,𝒉,𝒕𝒔𝒎𝒍 (h)=2/3 𝒑𝒉,𝒉,𝒕𝒔𝒎𝒍 (t)=1/3

Sameasempirical-frequencydistribution

Multiplicity𝑵𝒙 - #times𝑥 appearsin𝒙𝒏

𝒑𝐬𝐦𝐥 𝒙 =𝑵𝒙𝒏

PriorWork

Differentestimatorforeachproperty

Usesophisticatedapproximationtheoryresults

A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions

Property Notation SML OptimalEntropy H(p) k

"

?Support size S(p)

k

k log 1

"

?Support coverage S

m

(p)

m

m ?Distance to u kp� uk

1

k

"

2 ?

Table 2. Estimation complexity for various properties, up to a constant factor. For all properties shown, PML achieves the best knownresults. Citations are for specialized techniques, PML results are shown in this paper. Support and support coverage results have beennormalized for consistency with existing literature.

Property Notation SML Optimal ReferencesEntropy H(p) k

"

k

log k

1

"

(Valiant & Valiant, 2011a; Wu &Yang, 2016; Jiao et al., 2015)

Support size S(p)

k

k log 1

"

k

log k

log

2

1

"

(Wu & Yang, 2015)Support coverage S

m

(p)

m

m m

logm

log

1

"

(Orlitsky et al., 2016)Distance to u kp� uk

1

k

"

2k

log k

1

"

2 (Valiant & Valiant, 2011b; Jiaoet al., 2016)

Table 3. Estimation complexity for various properties, up to a constant factor. For all properties shown, PML achieves the best knownresults. Citations are for specialized techniques, PML results are shown in this paper. Support and support coverage results have beennormalized for consistency with existing literature.

Forseveralimportantproperties

Empirical-frequencypluginrequiresΘ 𝒌 samples

Newcomplex(non-plugin)estimatorsneedΘ 𝑘log 𝑘

samples

Entropyestimation

SMLestimateofentropy= ∑ _`alog a

_`�b

Samplecomplexity:Θ 𝑘/𝜀Variouscorrectionsproposed:Miller-Maddow,Jackknifedestimator,Coverageadjusted,…Samplecomplexity:Ω(𝑘) foralltheaboveestimators

Entropyestimation[Paninski’03]:𝑜(𝑘) samplecomplexity(existential)

[ValiantValiant’11a]:ConstructiveLPbasedmethods:Θgh

ijk h[ValiantValiant11b,WuYang’14,HanJiaoVenkatWeissman’14]:

Simplifiedalgorithms,andgrowthrate:Θ hgijk h

New(asofAugust)Results

Unified,simple,sample-optimalapproachforallaboveproblems

Plug-inestimator,replacesequencemaximumlikelihood

withprofilemaximumlikelihood

Profiles𝒉, 𝒉, 𝒕 or 𝒉, 𝒕, 𝒉or𝒕, 𝒉, 𝒕➞ sameestimate

Oneelementappearedonce,onappearedtwice

Profile:Multi-setofmultiplicities:𝚽 𝑿𝟏𝒏 = {𝑵𝒙: 𝒙 ∈ 𝑿𝟏𝒏}

𝚽(𝒉, 𝒉, 𝒕) = 𝚽(𝒕, 𝒉, 𝒕) = {𝟏, 𝟐}

𝚽(𝜶, 𝜸, 𝜷, 𝜸) = {𝟏, 𝟏, 𝟐}

Sufficientstatisticforsymmetricproperties

Profilemaximumlikelihood[+SVZ’04]Profile probability

𝑝 Φ = u 𝑝(𝑥va)�

w bxy zw

Maximizetheprofileprobability

𝑝w{|} = argmax

{𝑝(Φ 𝑋va )

See“Onestimatingtheprobabilitymultiset”,Orlitsky,Santhanam,Viswanathan,Zhangforadetailedtreatment,andanargumentforcompetitivedistributionestimation.

Profilemaximumlikelihood(PML)[+SVZ’04]

Profile probability

𝑝 Φ = u 𝑝(𝑥a)�

by:w bxy zw

DistributionMaximizingtheprofileprobability

𝑝w{|} = argmax

{𝑝(Φ)

PMLcompetitivefordistributionestimation

PMLexample𝑋Q = h, h, t

𝒑𝒉,𝒉,𝒕𝒔𝒎𝒍 (h)=2/3𝒑𝒉,𝒉,𝒕𝒔𝒎𝒍 (t)=1/3

Φ h, h, t = 1,2𝑝 Φ = 1,2 = 𝑝 𝑠, 𝑠, 𝑑 + 𝑝 𝑠, 𝑑, 𝑠 + 𝑝 𝑑, 𝑠, 𝑠

=3p(s,s,d)

= Qv ∑ 𝑝� 𝑥 𝑝(𝑦)�

b��

𝑝�|} 1, 2 =31

23

� 13 +

13

� 23 =

1827 =

23

PMLof{1,2}

P({1,2})=p(s,s,d)+p(s,d,s)+p(d,s,s)=3p(s,s,d)

p s, s, d = Σ��𝑝� 𝑥 𝑝 𝑦= Σ�𝑝� 𝑥 1 − 𝑝 𝑥

≤¼Σ�𝑝 𝑥 = v�

𝑝{|}(s,s,d)=¼

(1/2,1/2)➞ p(s,s,d)=1/8+1/8=1/4

𝑝{|}({1,2})=¾ Recall:𝑝�|} 1, 2 =2/3

PML({1,1,2})

Φ(𝛼, 𝛾, 𝛽, 𝛾) = {1,1, 2}

𝑝{|} 1,1,2 = 𝑈[5]

PMLcanpredictexistenceofnewsymbols

ProfilemaximumlikelihoodPMLof{1,2}is{½,½}

𝑝{|} 1,2 =31

12

� 12 +

12

� 12 =

34 >

1827

Σ��𝑝� 𝑥 𝑝 𝑦 = Σ�𝑝� 𝑥 1 − 𝑝 𝑥 ≤¼Σ�𝑝 𝑥 = v�

𝑋a = 𝛼, 𝛾, 𝛽, 𝛾,Φ 𝑋a = {1,1, 2}

𝑝{|} 1,1,2 = 𝑈[5]

PMLcanpredictexistenceofnewsymbols

PMLPlug-inToestimateasymmetricproperty𝑓

Find𝑝{|} Φ(𝑋a)Output𝑓(𝑝{|})

Simple

UnifiedNotuningparameters

Someexperimentalresults(c.2009)

Uniform500symbols350samples

2x6,3x4,13x3,63x2,161x1242appeared, 258didnot

U[500],350x,12experiments

Uniform500symbols350samples

2x6,3x4,13x3,63x2,161x1248appeared,258didnot

700samples

U[500],700x,12experiments

Staircase

15Kelements,5steps,~3x30KsamplesObserve8,882elts6,118missing

Zipf

Underliesmanynaturalphenomenapi=C/i,i=100…15,00030,000samplesObserve9,047elts5,953missing

1990Census- LastnamesSMITH 1.006 1.006 1JOHNSON 0.810 1.816 2WILLIAMS 0.699 2.515 3JONES 0.621 3.136 4BROWN 0.621 3.757 5DAVIS 0.480 4.237 6MILLER 0.424 4.660 7WILSON 0.339 5.000 8MOORE 0.312 5.312 9TAYLOR 0.311 5.623 10

AMEND 0.001 77.478 18835ALPHIN 0.001 77.478 18836ALLBRIGHT 0.001 77.479 18837AIKIN 0.001 77.479 18838ACRES 0.001 77.480 18839ZUPAN 0.000 77.480 18840ZUCHOWSKI 0.000 77.481 18841ZEOLLA 0.000 77.481 18842

18,839 names77.48% population~230 million

1990Census- Lastnames18,839 lastnamesbasedon~230 million35,000 samples,observed9,813 names

Coverage(#newsymbols)Zipfdistributionover15Kelements

Sample30KtimesEstimate:#newsymbolsinsampleofsizeλ*30K

Good-Toulmin:

EstimatePML&predictExtendstoλ>1Appliestootherproperties

λ<1

λ>1

λ>1

FindingthePMLdistribution

EMalgorithm[+ Pan,Sajama,Santhanam,Viswanathan,Zhang’05- ’08]

ApproximatePMLviaBethePermanents[Vontobel]

ExtensionsofMarkovChains[Vatedka,Vontobel]

Noprovablealgorithmsknown

MotivatedValiant&Valiant

MaximumLikelihoodEstimationPlugin

Generalpropertyestimationtechnique

𝒫- collectionofdistributionsoverdomain𝒵

𝑓:𝒫 → ℝ anyproperty(sayentropy)

MLEestimator

Given𝑧 ∈ 𝒵

Determine𝑝ª«¬ ≜ argmax{∈𝒫

𝑝 𝑧

Output𝑓(𝑝ª«¬)

HowgoodisMLE?

CompetitivenessofMLEplugin

𝒫 - collectionofdistributionsoverdomain𝒵

𝑓®: 𝒵 → ℝ any estimaorsuchthat∀𝑝 ∈ 𝒫, 𝑍 ∼ 𝑝

Pr 𝑓 𝑝 − 𝑓® 𝑍 > 𝜀 < 𝛿

MLEpluginerrorboundedby

Pr 𝑓 𝑝 − 𝑓(𝑝ª«¬) > 2 ⋅ 𝜀 < 𝛿 ⋅ |𝒵|

Simple,universal,competitivewithany𝑓®

Quiz:Probabilityofunlikelyoutcomes6-sideddie,p=(𝑝v, 𝑝�, … , 𝑝µ)

𝑝¶ ≥ 0,andΣ𝑝¶ = 1,otherwisearbitrary

Z～p

𝑃º 𝑝ª ≤ 1/6

(1,0,…,0)➞ Pr(𝑝ª≤1/6)=0

(1/6,…,1/6)➞ Pr(𝑝ª ≤1/6)=1

𝑃¼ 𝑝ª ≤ 0.01

𝑃¼ 𝑝ª ≤ 0.01 = Σ¶:{¾¿À.Àv 𝑝¶ ≤ 6 ⋅ 0.01 = 0.06

Canbeanything

≤0.06

CompetitivenessofMLEplugin- proof𝑓®: 𝒵 → ℝ: ∀𝑝 ∈ 𝒫, 𝑍 ∼ 𝑝➞ Pr 𝑓 𝑝 − 𝑓® 𝑍 > 𝜀 < 𝛿

thenPr 𝑓 𝑝 − 𝑓(𝑝ª«¬) > 2 ⋅ 𝜀 < 𝛿 ⋅ |𝒵|

Forall𝑧suchthat𝑝 𝑧 ≥ 𝛿: 1) 𝑓 𝑝 − 𝑓® 𝑧 ≤ 𝜀

2)𝑝ª«¬ 𝑧 ≥ 𝑝 𝑧 > 𝛿,hence 𝑓(𝑝ª«¬) − 𝑓® 𝑧 ≤ 𝜀

Triangleinequality: 𝑓(𝑝ª«¬) − 𝑓 𝑝 ≤ 2𝜀

If 𝑓(𝑝ª«¬) − 𝑓 𝑝 > 2𝜀then𝑝 𝑧 < 𝛿,

Pr 𝑓 𝑝ª«¬ − 𝑓 𝑝 > 2𝜀 ≤ Pr 𝑝 𝑍 < 𝛿 ≤ u 𝑝(𝑧)�

{ ª ÆÇ

≤ 𝛿 ⋅ |𝒵|

PMLperformancebound

If𝑛 = 𝑆 𝑓, 𝑘, 𝜀, 𝛿 ,then𝑆{|} 𝑓, 𝑘, 2 ⋅ 𝜀, Φa ⋅ 𝛿 ≤ 𝑛

|Φa|:numberofprofilesoflength𝑛

Profileoflengthn:partitionofn

{3},{1,2},{1,1,1}➞ 3,2+1,1+1+1Φa = 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛#𝑜𝑓𝑛

Hardy-Ramanujan:|Φa| < 𝑒Q a�

Easy:𝑒ijk a⋅ a�

If 𝒏 = 𝑺 𝒇, 𝒌, 𝜺, 𝒆Ï𝟒 𝒏� ,then 𝑺𝒑𝒎𝒍 𝒇, 𝒌, 𝟐𝜺, 𝒆Ï 𝒏� ≤ 𝒏

SummarySymmetricpropertyestimationPMLplug-inapproach

SimpleUniversalSampleoptimalforknownsublinearproperties

FuturedirectionsProvablyefficientalgorithmsIndependentprooftechnique

ThankYou!

PMLforsymmetricfForany symmetricproperty𝑓,

if𝑛 = 𝑆 𝑓, 𝑘, 𝜀, 0.1 ,then𝑆{|} 𝑓, 𝑘, 2 ⋅ 𝜀, 0.1 = 𝑂(𝑛�).

Proof.Bymediantrick,𝑆 𝑓, 𝑘, 𝜀, 𝑒Ï| = 𝑂 𝑛 ⋅ 𝑚 .Therefore,

𝑆{|} 𝑓, 𝑘, 2 ⋅ 𝜀, 𝑒Q a⋅|� Ï| = 𝑂(𝑛 ⋅ 𝑚),Pluggingin𝑚 = 𝐶 ⋅ 𝑛,givesthedesiredresult.

Bettererrorprobabilities– warmup

Estimatingadistribution𝑝 over[𝑘] toL1distance𝜀 w.p.>0.9 requiresΘ(𝑘/𝜀�) samples.Proof:Exercise.Estimatingadistribution𝑝 over[𝑘] toL1distance𝜀 w.p.1 − 𝑒Ïh requiresΘ(𝑘/𝜀�) samples.Proof:• Empiricalestimator�̂�• |�̂� − 𝑝| hasboundeddifferenceconstant(b.d.c.)2/𝑛• ApplyMcDiarmid’s inequality

BettererrorprobabilitiesRecall

𝑆 𝐻, 𝜀, 𝑘, 2/3 = Θ𝑘

𝜀 ⋅ log 𝑘• Existingoptimalestimators:highb.d.c.• Modifythemtohavesmallb.d.c.,andstillbeoptimal• Inparticular,cangetb.d.c.=𝑛ÏÀ.Ö× (exponentcloseto1)

• Withtwicethesampleserrordropssuper-fast

𝑆 𝐻, 𝜀, 𝑘, 𝑒ÏaØ.Ù = Θ hg⋅ijk h

Similarresultsforotherproperties

EvenapproximatePMLworks!• Perhaps findingexactPMLishard.• EvenapproximatePMLworks.

Findadistribution𝑞 suchthat

𝑞 Φ 𝑋va ≥ 𝑒Ïa.Û ⋅ 𝑝{|} Φ(𝑋va)

Eventhisisoptimal(forlarge𝑘)

InFisher’swords…

OfcoursenobodyhasbeenabletoprovethatMLEisbestunderallcircumstances.MLEcomputedwithalltheinformationavailablemayturnouttobeinconsistent.Throwingawayasubstantialpartoftheinformationmayrenderthemconsistent.

R.A.Fisher

ProofofPMLperformanceIf𝒏 = 𝑺 𝒇, 𝒌, 𝜺, 𝜹 ,then𝑺𝒑𝒎𝒍 𝒇, 𝒌, 𝟐 ⋅ 𝜺, 𝜱𝒏 ⋅ 𝜹 ≤ 𝒏

𝑆 𝑓, 𝑘, 𝜀, 𝛿 ,achievedbyanestimator𝑓®(Φ(𝑋a))

• ProfilesΦ 𝑋a suchthat𝑝 Φ 𝑋a > 𝛿,

𝑝ÝÞß Φ ≥ 𝑝 Φ > 𝛿𝑓 𝑝wÝÞß − 𝑓 𝑝 ≤ 𝑓 𝑝wÝÞß − 𝑓® Φ + 𝑓® Φ − 𝑓 𝑝 < 2𝜀

• Profileswith𝑝 Φ 𝑋a < 𝛿,

𝑝(𝑝 Φ 𝑋a < 𝛿 < 𝛿 ⋅ |Φa|