UnifiedMaximumLikelihoodEstimation
ofSymmetricDistributionProperties
Jayadev AcharyaHirakendu DasAlon Orlitsky AnandaSuresh
Cornell Yahoo UCSD Google
FrontiersinDistributionTestingWorkshopFOCS2017
SpecialThanksto…
Idon’tsmile Ionlysmile
Symmetricproperties
𝒫 ={(𝒑𝟏, …𝒑𝒌) }- collectionofdistributionsover{1,…,k}
Distributionproperty:𝒇:𝒫→ℝ
𝒇 issymmetricifunchangedunderinputpermutations
Entropy𝑯 𝒑 ≜ ∑𝒑𝒊𝒍𝒐𝒈𝟏𝒑𝒊
��
Supportsize𝑺 𝒑 ≜ ∑ 𝕀 𝒑𝒊5𝟎�𝒊
Rényi entropy,supportcoverage,distancetouniformity,…
Determinedbytheprobabilitymultiset{𝒑𝟏, 𝒑𝟐, … , 𝒑𝒌 }
Symmetricproperties
𝒑 = (𝒑𝟏, …𝒑𝒌), (𝒌 finiteorinfinite)- discretedistribution
𝒇 𝒑 , apropertyof𝒑
𝒇 𝒑 issymmetricifunchangedunderinputpermutations
Entropy 𝑯 𝒑 ≜ ∑𝒑𝒊𝒍𝒐𝒈𝟏𝒑𝒊
��
Supportsize𝑺 𝒑 ≜ ∑ 𝕀 𝒑𝒊5𝟎�𝒊
Rényi entropy,supportcoverage,distancetouniformity,…
Determinedbytheprobabilitymultiset{𝒑𝟏, 𝒑𝟐, … , 𝒑𝒌 }
Propertyestimation
𝒑 unknowndistributionin𝒫
Givenindependentsamples𝐗𝟏, 𝐗𝟐, … , 𝐗𝐧 ∼ 𝐩
Estimate𝒇 𝒑
Samplecomplexity 𝑺 𝒇, 𝒌, 𝜺, 𝜹
Minimum𝑛 necessaryto
Estimate𝒇 𝒑 ± 𝜺
Witherrorprobability< 𝜹
Plug-inestimation
Use𝑿𝟏, 𝑿𝟐, … , 𝑿𝒏 ∼ 𝒑tofindanestimate𝒑Fof𝒑
Estimate𝒇 𝒑 by𝒇 𝒑F
Howtoestimate𝒑?
SequenceMaximumLikelihood(SML)
𝒑𝐬𝐦𝐥 = 𝐚𝐫𝐠𝐦𝐚𝐱𝒑 𝒑(𝒙𝒏 )= 𝐚𝐫𝐠𝐦𝐚𝐱
𝒑(𝒙) 𝚷𝒊𝒑(𝒙𝒊 )
𝑥Q= ℎ, ℎ, 𝑡𝒑𝒉,𝒉,𝒕𝒔𝒎𝒍 = 𝐚𝐫𝐠𝐦𝐚𝐱𝒑𝟐 𝒉 · 𝒑(𝒕)
𝒑𝒉,𝒉,𝒕𝒔𝒎𝒍 (h)=2/3 𝒑𝒉,𝒉,𝒕𝒔𝒎𝒍 (t)=1/3
Sameasempirical-frequencydistribution
Multiplicity𝑵𝒙 - #times𝑥 appearsin𝒙𝒏
𝒑𝐬𝐦𝐥 𝒙 =𝑵𝒙𝒏
PriorWork
Differentestimatorforeachproperty
Usesophisticatedapproximationtheoryresults
A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions
Property Notation SML OptimalEntropy H(p) k
"
?Support size S(p)
k
k log 1
"
?Support coverage S
m
(p)
m
m ?Distance to u kp� uk
1
k
"
2 ?
Table 2. Estimation complexity for various properties, up to a constant factor. For all properties shown, PML achieves the best knownresults. Citations are for specialized techniques, PML results are shown in this paper. Support and support coverage results have beennormalized for consistency with existing literature.
Property Notation SML Optimal ReferencesEntropy H(p) k
"
k
log k
1
"
(Valiant & Valiant, 2011a; Wu &Yang, 2016; Jiao et al., 2015)
Support size S(p)
k
k log 1
"
k
log k
log
2
1
"
(Wu & Yang, 2015)Support coverage S
m
(p)
m
m m
logm
log
1
"
(Orlitsky et al., 2016)Distance to u kp� uk
1
k
"
2k
log k
1
"
2 (Valiant & Valiant, 2011b; Jiaoet al., 2016)
Table 3. Estimation complexity for various properties, up to a constant factor. For all properties shown, PML achieves the best knownresults. Citations are for specialized techniques, PML results are shown in this paper. Support and support coverage results have beennormalized for consistency with existing literature.
Forseveralimportantproperties
Empirical-frequencypluginrequiresΘ 𝒌 samples
Newcomplex(non-plugin)estimatorsneedΘ 𝑘log 𝑘
samples
Entropyestimation
SMLestimateofentropy= ∑ _`alog a
_`�b
Samplecomplexity:Θ 𝑘/𝜀Variouscorrectionsproposed:Miller-Maddow,Jackknifedestimator,Coverageadjusted,…Samplecomplexity:Ω(𝑘) foralltheaboveestimators
Entropyestimation[Paninski’03]:𝑜(𝑘) samplecomplexity(existential)
[ValiantValiant’11a]:ConstructiveLPbasedmethods:Θgh
ijk h[ValiantValiant11b,WuYang’14,HanJiaoVenkatWeissman’14]:
Simplifiedalgorithms,andgrowthrate:Θ hgijk h
New(asofAugust)Results
Unified,simple,sample-optimalapproachforallaboveproblems
Plug-inestimator,replacesequencemaximumlikelihood
withprofilemaximumlikelihood
Profiles𝒉, 𝒉, 𝒕 or 𝒉, 𝒕, 𝒉or𝒕, 𝒉, 𝒕➞ sameestimate
Oneelementappearedonce,onappearedtwice
Profile:Multi-setofmultiplicities:𝚽 𝑿𝟏𝒏 = {𝑵𝒙: 𝒙 ∈ 𝑿𝟏𝒏}
𝚽(𝒉, 𝒉, 𝒕) = 𝚽(𝒕, 𝒉, 𝒕) = {𝟏, 𝟐}
𝚽(𝜶, 𝜸, 𝜷, 𝜸) = {𝟏, 𝟏, 𝟐}
Sufficientstatisticforsymmetricproperties
Profilemaximumlikelihood[+SVZ’04]Profile probability
𝑝 Φ = u 𝑝(𝑥va)�
w bxy zw
Maximizetheprofileprobability
𝑝w{|} = argmax
{𝑝(Φ 𝑋va )
See“Onestimatingtheprobabilitymultiset”,Orlitsky,Santhanam,Viswanathan,Zhangforadetailedtreatment,andanargumentforcompetitivedistributionestimation.
Profilemaximumlikelihood(PML)[+SVZ’04]
Profile probability
𝑝 Φ = u 𝑝(𝑥a)�
by:w bxy zw
DistributionMaximizingtheprofileprobability
𝑝w{|} = argmax
{𝑝(Φ)
PMLcompetitivefordistributionestimation
PMLexample𝑋Q = h, h, t
𝒑𝒉,𝒉,𝒕𝒔𝒎𝒍 (h)=2/3𝒑𝒉,𝒉,𝒕𝒔𝒎𝒍 (t)=1/3
Φ h, h, t = 1,2𝑝 Φ = 1,2 = 𝑝 𝑠, 𝑠, 𝑑 + 𝑝 𝑠, 𝑑, 𝑠 + 𝑝 𝑑, 𝑠, 𝑠
=3p(s,s,d)
= Qv ∑ 𝑝� 𝑥 𝑝(𝑦)�
b��
𝑝�|} 1, 2 =31
23
� 13 +
13
� 23 =
1827 =
23
PMLof{1,2}
P({1,2})=p(s,s,d)+p(s,d,s)+p(d,s,s)=3p(s,s,d)
p s, s, d = Σ���𝑝� 𝑥 𝑝 𝑦= Σ�𝑝� 𝑥 1 − 𝑝 𝑥
≤¼Σ�𝑝 𝑥 = v�
𝑝{|}(s,s,d)=¼
(1/2,1/2)➞ p(s,s,d)=1/8+1/8=1/4
𝑝{|}({1,2})=¾ Recall:𝑝�|} 1, 2 =2/3
PML({1,1,2})
Φ(𝛼, 𝛾, 𝛽, 𝛾) = {1,1, 2}
𝑝{|} 1,1,2 = 𝑈[5]
PMLcanpredictexistenceofnewsymbols
ProfilemaximumlikelihoodPMLof{1,2}is{½,½}
𝑝{|} 1,2 =31
12
� 12 +
12
� 12 =
34 >
1827
Σ���𝑝� 𝑥 𝑝 𝑦 = Σ�𝑝� 𝑥 1 − 𝑝 𝑥 ≤¼Σ�𝑝 𝑥 = v�
𝑋a = 𝛼, 𝛾, 𝛽, 𝛾,Φ 𝑋a = {1,1, 2}
𝑝{|} 1,1,2 = 𝑈[5]
PMLcanpredictexistenceofnewsymbols
PMLPlug-inToestimateasymmetricproperty𝑓
Find𝑝{|} Φ(𝑋a)Output𝑓(𝑝{|})
Simple
UnifiedNotuningparameters
Someexperimentalresults(c.2009)
Uniform500symbols350samples
2x6,3x4,13x3,63x2,161x1242appeared, 258didnot
U[500],350x,12experiments
Uniform500symbols350samples
2x6,3x4,13x3,63x2,161x1248appeared,258didnot
700samples
U[500],700x,12experiments
Staircase
15Kelements,5steps,~3x30KsamplesObserve8,882elts6,118missing
Zipf
Underliesmanynaturalphenomenapi=C/i,i=100…15,00030,000samplesObserve9,047elts5,953missing
1990Census- LastnamesSMITH 1.006 1.006 1JOHNSON 0.810 1.816 2WILLIAMS 0.699 2.515 3JONES 0.621 3.136 4BROWN 0.621 3.757 5DAVIS 0.480 4.237 6MILLER 0.424 4.660 7WILSON 0.339 5.000 8MOORE 0.312 5.312 9TAYLOR 0.311 5.623 10
AMEND 0.001 77.478 18835ALPHIN 0.001 77.478 18836ALLBRIGHT 0.001 77.479 18837AIKIN 0.001 77.479 18838ACRES 0.001 77.480 18839ZUPAN 0.000 77.480 18840ZUCHOWSKI 0.000 77.481 18841ZEOLLA 0.000 77.481 18842
18,839 names77.48% population~230 million
1990Census- Lastnames18,839 lastnamesbasedon~230 million35,000 samples,observed9,813 names
Coverage(#newsymbols)Zipfdistributionover15Kelements
Sample30KtimesEstimate:#newsymbolsinsampleofsizeλ*30K
Good-Toulmin:
EstimatePML&predictExtendstoλ>1Appliestootherproperties
λ<1
λ>1
λ>1
FindingthePMLdistribution
EMalgorithm[+ Pan,Sajama,Santhanam,Viswanathan,Zhang’05- ’08]
ApproximatePMLviaBethePermanents[Vontobel]
ExtensionsofMarkovChains[Vatedka,Vontobel]
Noprovablealgorithmsknown
MotivatedValiant&Valiant
MaximumLikelihoodEstimationPlugin
Generalpropertyestimationtechnique
𝒫- collectionofdistributionsoverdomain𝒵
𝑓:𝒫 → ℝ anyproperty(sayentropy)
MLEestimator
Given𝑧 ∈ 𝒵
Determine𝑝ª«¬ ≜ argmax{∈𝒫
𝑝 𝑧
Output𝑓(𝑝ª«¬)
HowgoodisMLE?
CompetitivenessofMLEplugin
𝒫 - collectionofdistributionsoverdomain𝒵
𝑓®: 𝒵 → ℝ any estimaorsuchthat∀𝑝 ∈ 𝒫, 𝑍 ∼ 𝑝
Pr 𝑓 𝑝 − 𝑓® 𝑍 > 𝜀 < 𝛿
MLEpluginerrorboundedby
Pr 𝑓 𝑝 − 𝑓(𝑝ª«¬) > 2 ⋅ 𝜀 < 𝛿 ⋅ |𝒵|
Simple,universal,competitivewithany𝑓®
Quiz:Probabilityofunlikelyoutcomes6-sideddie,p=(𝑝v, 𝑝�, … , 𝑝µ)
𝑝¶ ≥ 0,andΣ𝑝¶ = 1,otherwisearbitrary
Z~p
𝑃º 𝑝ª ≤ 1/6
(1,0,…,0)➞ Pr(𝑝ª≤1/6)=0
(1/6,…,1/6)➞ Pr(𝑝ª ≤1/6)=1
𝑃¼ 𝑝ª ≤ 0.01
𝑃¼ 𝑝ª ≤ 0.01 = Σ¶:{¾¿À.Àv 𝑝¶ ≤ 6 ⋅ 0.01 = 0.06
Canbeanything
≤0.06
CompetitivenessofMLEplugin- proof𝑓®: 𝒵 → ℝ: ∀𝑝 ∈ 𝒫, 𝑍 ∼ 𝑝➞ Pr 𝑓 𝑝 − 𝑓® 𝑍 > 𝜀 < 𝛿
thenPr 𝑓 𝑝 − 𝑓(𝑝ª«¬) > 2 ⋅ 𝜀 < 𝛿 ⋅ |𝒵|
Forall𝑧suchthat𝑝 𝑧 ≥ 𝛿: 1) 𝑓 𝑝 − 𝑓® 𝑧 ≤ 𝜀
2)𝑝ª«¬ 𝑧 ≥ 𝑝 𝑧 > 𝛿,hence 𝑓(𝑝ª«¬) − 𝑓® 𝑧 ≤ 𝜀
Triangleinequality: 𝑓(𝑝ª«¬) − 𝑓 𝑝 ≤ 2𝜀
If 𝑓(𝑝ª«¬) − 𝑓 𝑝 > 2𝜀then𝑝 𝑧 < 𝛿,
Pr 𝑓 𝑝ª«¬ − 𝑓 𝑝 > 2𝜀 ≤ Pr 𝑝 𝑍 < 𝛿 ≤ u 𝑝(𝑧)�
{ ª ÆÇ
≤ 𝛿 ⋅ |𝒵|
PMLperformancebound
If𝑛 = 𝑆 𝑓, 𝑘, 𝜀, 𝛿 ,then𝑆{|} 𝑓, 𝑘, 2 ⋅ 𝜀, Φa ⋅ 𝛿 ≤ 𝑛
|Φa|:numberofprofilesoflength𝑛
Profileoflengthn:partitionofn
{3},{1,2},{1,1,1}➞ 3,2+1,1+1+1Φa = 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛#𝑜𝑓𝑛
Hardy-Ramanujan:|Φa| < 𝑒Q a�
Easy:𝑒ijk a⋅ a�
If 𝒏 = 𝑺 𝒇, 𝒌, 𝜺, 𝒆Ï𝟒 𝒏� ,then 𝑺𝒑𝒎𝒍 𝒇, 𝒌, 𝟐𝜺, 𝒆Ï 𝒏� ≤ 𝒏
SummarySymmetricpropertyestimationPMLplug-inapproach
SimpleUniversalSampleoptimalforknownsublinearproperties
FuturedirectionsProvablyefficientalgorithmsIndependentprooftechnique
ThankYou!
PMLforsymmetricfForany symmetricproperty𝑓,
if𝑛 = 𝑆 𝑓, 𝑘, 𝜀, 0.1 ,then𝑆{|} 𝑓, 𝑘, 2 ⋅ 𝜀, 0.1 = 𝑂(𝑛�).
Proof.Bymediantrick,𝑆 𝑓, 𝑘, 𝜀, 𝑒Ï| = 𝑂 𝑛 ⋅ 𝑚 .Therefore,
𝑆{|} 𝑓, 𝑘, 2 ⋅ 𝜀, 𝑒Q a⋅|� Ï| = 𝑂(𝑛 ⋅ 𝑚),Pluggingin𝑚 = 𝐶 ⋅ 𝑛,givesthedesiredresult.
Bettererrorprobabilities– warmup
Estimatingadistribution𝑝 over[𝑘] toL1distance𝜀 w.p.>0.9 requiresΘ(𝑘/𝜀�) samples.Proof:Exercise.Estimatingadistribution𝑝 over[𝑘] toL1distance𝜀 w.p.1 − 𝑒Ïh requiresΘ(𝑘/𝜀�) samples.Proof:• Empiricalestimator�̂�• |�̂� − 𝑝| hasboundeddifferenceconstant(b.d.c.)2/𝑛• ApplyMcDiarmid’s inequality
BettererrorprobabilitiesRecall
𝑆 𝐻, 𝜀, 𝑘, 2/3 = Θ𝑘
𝜀 ⋅ log 𝑘• Existingoptimalestimators:highb.d.c.• Modifythemtohavesmallb.d.c.,andstillbeoptimal• Inparticular,cangetb.d.c.=𝑛ÏÀ.Ö× (exponentcloseto1)
• Withtwicethesampleserrordropssuper-fast
𝑆 𝐻, 𝜀, 𝑘, 𝑒ÏaØ.Ù = Θ hg⋅ijk h
Similarresultsforotherproperties
EvenapproximatePMLworks!• Perhaps findingexactPMLishard.• EvenapproximatePMLworks.
Findadistribution𝑞 suchthat
𝑞 Φ 𝑋va ≥ 𝑒Ïa.Û ⋅ 𝑝{|} Φ(𝑋va)
Eventhisisoptimal(forlarge𝑘)
InFisher’swords…
OfcoursenobodyhasbeenabletoprovethatMLEisbestunderallcircumstances.MLEcomputedwithalltheinformationavailablemayturnouttobeinconsistent.Throwingawayasubstantialpartoftheinformationmayrenderthemconsistent.
R.A.Fisher
ProofofPMLperformanceIf𝒏 = 𝑺 𝒇, 𝒌, 𝜺, 𝜹 ,then𝑺𝒑𝒎𝒍 𝒇, 𝒌, 𝟐 ⋅ 𝜺, 𝜱𝒏 ⋅ 𝜹 ≤ 𝒏
𝑆 𝑓, 𝑘, 𝜀, 𝛿 ,achievedbyanestimator𝑓®(Φ(𝑋a))
• ProfilesΦ 𝑋a suchthat𝑝 Φ 𝑋a > 𝛿,
𝑝ÝÞß Φ ≥ 𝑝 Φ > 𝛿𝑓 𝑝wÝÞß − 𝑓 𝑝 ≤ 𝑓 𝑝wÝÞß − 𝑓® Φ + 𝑓® Φ − 𝑓 𝑝 < 2𝜀
• Profileswith𝑝 Φ 𝑋a < 𝛿,
𝑝(𝑝 Φ 𝑋a < 𝛿 < 𝛿 ⋅ |Φa|