Learning Mixtures of Structured Distributions over Discrete Domains
Xiaorui SunColumbia University
Joint work with Siu-On Chan(UC Berkeley), Ilias Diakonikolas(U Edinburgh), Rocco Servedio(Columbia University)
Density Estimation• PAC-type learning model• Set of possible target distributions over • Learner – Know the set but does not know the target
distribution – Independently draws a few samples from – Outputs (succinct description of a)
distribution which is -close to • Total variation distance is standard measure in
statistics
Learn a structured distribution
• If = {all distributions over }, samples are required
• Much better sample complexities possible for structured distributions– Poisson binomial distributions [DDS12a]• samples
–Monotone/k-modal [Bir87, DDS12b]• samples/ samples
This work: Learn mixture of structured distributions
• Learn mixture of distributions?– A set of distributions over – Target distribution is a mixture of
distributions from– i.e. , such that
• Our result: learn mixtures for several structured distributions– Sample complexity close to optimal– Efficient running time
Our results: learning mixture of log-concave
• Log-concave distribution over [n]– – for
1 n
Our results: log-concave
• Algorithm to learn a mixture of log-concave distributions – Sample complexity: – Running time: bit operations
• Lower bound: samples
Our results: mixture of unimodal
• Unimodal distribution over [n]– s.t.
1 n
Our results: mixture of unimodal
• A mixture of 2 unimodal distributions may have modes
• Algorithm to learn a mixture of unimodal distributions– Sample complexity: samples– Running time: bit operations
• Lower bound: samples
Our results: mixture of MHR
• Monotone hazard rate distribution – Hazard rate of : – if –MHR distribution: is a non-decreasing
function over
1 n
Our results: mixture of MHR
• Algorithm to learn a mixture of MHR distributions – Sample complexity: – Running time: bit operations
• Lower Bound: samples
Compare with parameter estimation
• Parameter estimation [KMV10, MV 10] – Learn a mixture of Gaussians– Independently draw a few samples from – Estimate the parameters of each
Gaussian component accurately • Number of samples inherently
exponentially depends on , even for a mixture of 1-dimensional normal distributions [MV10]
Compare with parameter estimation
• Parameter estimation needs at least exp() samples to learn a mixture of binomial distributions– Similar to the lower bound in [MV 10]
• Density estimation allows to estimate non parametric distributions– E.g. log-concave, unimodal, MHR
• Density estimation for mixture of binomial distributions over using samples– Binomial distribution is log-concave
Outline
• Learning algorithm based on decomposition
• Structural results for log-concave, unimodal, MHR distributions
Flat decomposition
• Key definition: distribution is -flat if there exists a partition of into intervals such that – is an -flat decomposition for
• is obtained by "flattening" within each interval – for
Flat decomposition
1 n
Learn -flat distributions
• Main general Thm: Let = {all the -flat distributions}. There is an algorithm which draws samples from , and outputs a hypothesis such that .
• Linear running time with respect to the number of samples
Easier problem: known decomposition
• Given– Samples from an -flat distribution – -flat decomposition for
• Idea: estimate probability mass of every interval in
• samples are enough
Real problem: unknown decomposition
• Only given samples from a -flat distribution
• Exists some -flat decomposition for , but unknown
• A useful fact [DDS+ 13]: If is a -flat decomposition of , and is a “refinement” of , is a -flat decomposition of – If know a refinement of , it is good
Unknown flat decomposition (cont)
• Idea: partition [n] into intervals each with small probability mass,
– Achieve by sampling from
1 n
𝒦ℒ
Unknown flat decomposition (cont)
• Exist (unknown)– Refinement of both and– intervals
1 n
𝒦ℒ
Unknown flat decomposition (cont)
• Exist – Refinement of both and– intervals– -flat decomposition for
1 n
𝒥
Unknown flat decomposition (cont)
• Compare and
1 n
𝒥1 n
𝒥𝒦
Unknown flat decomposition (cont)
• If the total probability mass of every intervals of is at most , then
• Partition [n] into intervals each with probability mass at most – samples are enough
Learn -flat distributions
• Main general Thm: Let {all the -flat distributions}. There is an algorithm which draws samples from , and outputs a hypothesis such that
Learn mixture of distributions
• Lem:A mixture of -flat distributions has an -flat decomposition– Tight for interesting distribution classes
• Thm(Learn mixture): Let be a mixture of -flat distributions. There is an algorithm which draws samples, and outputs a hypothesis s.t.
First application: learning mixture of log-concave distributions
• Recall definition:– – for
• Lem: Every log-concave distribution is -flat
• Learn a mixture of log-concave distributions with samples
Second application: learning mixture of unimodal distribution
• Lem: Every unimodal distribution is -flat [Bir87, DDS+13]
• Learn a mixture of unimodal distribution with samples
Third application: learning mixture of MHR distribution
• Monotone hazard rate distribution– Hazard rate of : – if – is a non-decreasing function over
• Lem: Every MHR distribution is -flat• Learn a mixture of MHR distributions
with samples
Conclusion and further directions
• Flat decomposition is a useful way to study mixtures of structured distributions
• Extend to higher dimension?• Efficient algorithm with optimal
sample complexity
Distribution Sample complexity Lower boundLog-concaveUnimodalMHR
Thank you !