Hierarchical Mixture of Experts

Presented by Qi AnMachine learning reading groupDuke University07/15/2005

Outline Background Hierarchical tree structure Gating networks Expert networks E-M algorithm Experimental results Conclusions

Background The idea of mixture of experts

First presented by Jacobs and Hintons in 1988 Hierarchical mixture of experts

Proposed by Jordan and Jacobs in 1994 Difference from previous mixture mod

el Mixing weights depends on both the input and t

he output

Example (ME)

One-layer structure

Expert Network

Gating Network

μ1 μ2 μ3

g1 g2 g3

Ellipsoidal Gating function

Example (HME)

Hierarchical tree structure

Linear Gating function

Expert network

At the leaves of treesfor each expert:

Uxij linear predictor

)( ijij f output of the expert

link function

For example: logistic function for binary classification

Gating network

At the nonterminal of the treetop layer: other lay

er:xvTii

iig )exp(

xvTijij

ijijg )exp(

)exp(|

Output

At the non-leaves nodestop node: other

nodes:

ijiji μgμ |i

iiμgμ

Probability model For each expert, assume the true output y is

chosen from a distribution P with mean μij

Therefore, the total probability of generating y from x is given by

),( ijxyP

ijijijii xyPvxgvxgxyP ),(),(),(),( |

Posterior probabilities Since the gij and gi are computed based only

on the input x, we refer them as prior probabilities.

We can define the posterior probabilities with the knowledge of both the input x and the output y using Bayes’ rule

i jijiji

jijiji

i yPgg

ijijij yPg

E-M algorithm Introduce auxiliary variables zij which have

an interpretation as the labels that corresponds to the experts.

The probability model can be simplified with the knowledge of auxiliary variables

ttijyPggyPggxzyP

)()(),,( )()(|

)()()(|

)()()()(

E-M algorithm Complete-data likelihood:

The E-step

tijc yPggzyl )(lnlnln);( )()(

p yPgghylEQ )(lnlnln));((),( )()(|

)()()(

E-M algorithm The M-step

pij yPh

)(lnmaxarg )()(1

pi ghv

)()(1 lnmaxarg

pij ghhv

)(1 lnmaxarg

IRLS Iteratively reweighted least squares al

g. An iterative algorithm for computing the

maximum likelihood estimates of the parameters of a generalized linear model

A special case for Fisher scoring method

Algorithm

E-step

M-step

Online algorithm This algorithm can be used for online regres

For Expert network:

where Rij is the inverse covariance matrix for EN(i,j)

( 1) ( ) ( ) ( ) ( ) ( ) ( ) ( )| ( )t t t t t t t T t

ij ij i j i ij ijU U h h y x R

( 1) ( ) ( ) ( 1)( ) 1 ( 1) 1

( ) 1 ( ) ( 1) ( )[ ]

t t t T tij ijt t

ij ij t t T t tij ij

R x x RR R

h x R x

Online algorithm For Gating network:

where Si is the inverse covariance matrix

where Sij is the inverse covariance matrix

( 1) ( ) ( ) ( ) ( ) ( ) ( )|(ln )t t t t t t t

ij ij ij i j i ijv v S h h x

( 1) ( ) ( ) ( ) ( ) ( )(ln )t t t t t ti i i i iv v S h x

( 1) ( ) ( ) ( 1)( ) 1 ( 1) 1

( ) 1 ( ) ( 1) ( )[ ]

t t t T tij ijt t

ij ij t t T t ti ij

S x x SS S

h x S x

( 1) ( ) ( ) ( 1)( ) 1 ( 1) 1

( ) ( 1) ( )

t t t T tt t i ii i t T t t

S x x SS S

Results Simulated data of a four-joint robot arm

moving in three-dimensional space

Results

Conclusions Introduce a tree-structured

architecture for supervised learning

Much faster than traditional back-propagation algorithm

Can be used for on-line learning

Thank you

Questions?

Hierarchical Mixture of Experts

Documents

Bayesian Inference on Mixture-of-Experts for Estimation of …ghuerta/papers/avghpaper.pdf · Bayesian Inference on Mixture-of-Experts for Estimation of Stochastic Volatility Alejandro

A mixture of feature experts approach for protein …qyj/papers_sulp/bmc_mfe_onlinecomplete.pdfA mixture of feature experts approach for protein-protein interaction prediction Yanjun

MIXTURE OF EXPERTS LEARNING IN AUTOMATED THEOREM …gungort/theses/Mixture of Experts... · UZMANLARIN KARIS¸IMI YONTEM¨ ˙IN ˙IN UYGULANMASI Otomatik Teorem ˙Ispatlama sistemlerinde

HGMR: Hierarchical Gaussian Mixtures for Adaptive 3D ......HGMR: Hierarchical Gaussian Mixture Registration 3 Method Mult. Link Aniso-tropic Multi-Scale Data Trans. Assoc. Complex

O L NEURAL NETWORKS THE S -G MIXTURE OF-EXPERTS LAYER … · troduce a Sparsely-Gated Mixture-of-Experts layer (MoE), ... The Mixture-of-Experts ... (which causes the corresponding

MoE-SPNet: A Mixture-of-Experts Scene Parsing Network · MoE-SPNet: A Mixture-of-Experts Scene Parsing Network Huan Fua, Mingming Gongb, Chaohui Wangc, Dacheng Taoa aUBTECH Sydney

Statistical Mechanics of the Mixture of Experts

Occlusion-aware HandPose EstimationUsing Hierarchical Mixture … · 2018. 8. 28. · Occlusion-aware HandPose EstimationUsing Hierarchical Mixture Density Network Qi Ye, Tae-Kyun

An Alternative Inﬁnite Mixture Of Gaussian Process Expertsosindero/PUBLICATIONS/MeedsOsindero_dpme_nips.pdf · An Alternative Inﬁnite Mixture Of Gaussian Process Experts Edward

A Multilevel Mixture-of-Experts Framework for - Dariu Gavrila

OUTRAGEOUSLY L NEURAL NETWORKS THE S -G MIXTURE OF-EXPERTS …hinton/absps/Outrageously.pdf · adding experts sequentially (Aljundi et al., 2016). Garmash & Monz (2016) suggest an

Hidden Markov Decision Trees...Hidden Markov Decision Trees Figure 1: The hierarchical mixture of experts as a graphical model. The E step of the learning algorithm for HME's involves

580.691 Learning Theory Reza Shadmehr EM and expected complete log-likelihood Mixture of Experts

Statistical Mechanics of the Mixture of Experts...Statistical Mechanics of the Mixture of Experts Kukjin Kang and Jong-Hoon Oh Department of Physics Pohang University of Science and

Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Hierarchical Mixture Models in Neurological Transmission Analysismw/MWextrapubs/West1997b.pdf · 2011-11-18 · Hierarchical Mixture Models in Neurological Transmission Analysis Mike

Hierarchical Methods of Momentspapers.nips.cc/paper/6786-hierarchical-methods-of-moments.pdf · Hierarchical Methods of Moments ... models, and mixture models are routinely used in

A Comparison of Two MCMC Algorithms for Hierarchical Mixture

Stochastic variational hierarchical mixture of sparse Gaussian … · 2018. 9. 13. · Stochastic variational hierarchical mixture of sparse ... There has been much interest in sparse

Supervised learning: Mixture Of Experts (MOE) Network