Bayesian Dark Knowledge and Matrix Factorization

Bayesian Dark Knowledge and Matrix Factorization

Masatoshi UeharaMentor: Oono Kenta, Brian Vogel

October 27, 2016

Contents

1 Introduction

2 Bayesian Dark Knowledge with various SG-MCMC methods

3 Matrix Factorization

(JPN) Masatoshi October 27, 2016 2 / 18

Introduction

Introduction

SG-MCMC is a sampling algorithm towards large data.

We apply a variety of SG-MCMC methods to Bayesian DarkKnowledge.

We combine GANs with Bayesian Dark Knowledge.

We apply SG-MCMC and neural networks to matrix factorization.


Introduction

SGLD

SGLD

SGLD is a method combining with SGD and MLA(a samplingalgorithm)

θt+1 ← θt − εtD∇U(θt) + N(0, 2εD)

In the case of Bayesian Neural Network, the formula is as follows:

∆θt =εt2

(∇ log p(θt) +

N

n

∑∇ log p(yti |xti , θt)

)+ ηt , ηt ∼ N(0, εt).

Note that the noise term is removed in SGD.


Bayesian Dark Knowledge with various SG-MCMC methods

Bayesian Dark Knowledge Overview

Overview

Bayesian Dark knowledge is a method of combining SGLD with theconcept of distillation.

SGLD is a useful method for learning Bayeisian Deep Networks.

The problem is that SGLD needs to archive many copies ofparameters.

The motivation is replacing a set of neural networks with a singledeep network.

We can estimate the confidence rate even if data number is small.



Method

Teacher networks is denoted as p(y |x ,DN).Student network is denoted as S(y |x , ω).

In the distillation phase, the followingequation is minimized.

Distillation loss

L(ω) =

∫p(ω|x)p(x)

≈ 1

Θ

1

D ′

∑θ∈Θ

∑x ′∈D′

p(y |x ′, θ)[S(y |x ′, ω)]dx



Algorithm

Algorithm

Note that the student network is trained online. We do not have toarchive many copies of parameters.



How to improve?

We want to make a variety of teachers.

Use other SG-MCMC methods.

How to make unlabeled data set?

Use GANs.



SG-HMC and SG-NHT

SG-HMC

θt+1 ← θt + εM−1rt

rt+1 ← rt − εt∇U(θt)− εtCM−1rt + N(0, εt(2C − εtBt))

SG-NHT

θt+1 ← θt + εrt

rt+1 ← rt − εt∇U(θt)− εtζtrt + N(0, εt(2C − εtBt))

ζt+1 ← ζt + (1

drTt rt − 1)



Bayesian Dark Knowledge with GANs

GANs can mimic the empiricaldistribution.

In the distillation phase, we use GANsas a simulator.

How to remove poor images....



Anormaly detection by GANs

uLSIF

GAN



Result : MNIST

Setting: 800 labeled samples in MNIST, Epoch: 2000, Burn-inintervals:200, Thinning intervals:5.

Network 784-1200-1200-10, Activation: Relu

Result


Matrix Factorization


Rating matrix is given.

ui ....user feature, vj ...itemfeature , Rij ... rating matrix.

When learning, use SGD.

ui+1 ← ui −∇ui [(Ri ,j − uTi vj)2 + λu2

i ]

vj+1 ← vj −∇vj [(Ri ,j − uTi vj)2 + λv2

j ]



Matrix Factorization with SGLD

p(R|U,V , τ) =L∏

i=1

M∏j=1

[N(Rij |UTi Vj , τ

−1]Iij

p(U|λU) =L∏

i=1

N(Ui |0, λ−1U )

p(V |λV ) =M∏j=1

N(Vj |0, λ−1V )

λUd∼ Gamma(α0, β0)

λVd∼ Gamma(α0, β0)

Use Gibbs Sampling.

When updating u and v, SGLDis used.

λ is automatically tuned.



Neural Network Matrix Factorization

Estimate Xn,m by the equation:Xn,m = fθ(Un,Vm)

Cost function:∑(Xn,m − Xn,m)2 + λ[

∑‖Un‖2

2 +∑‖Vm‖2

2]

Update θ, un, vm at the same time

NNMF can reach state of the art accuracy....



Results

Use ML-100K, ML-1M data set.

Evaluate by root mean square method(RMSE).

Unfortunately, state of the art accuracy is not reproduced....



Discussion

Does data generated by GANs help classifiers?

What is a good method of combining Neural Networks with matrixfactorization?



References Papers

Large-Scale Distributed Bayesian Matrix Factorization usingStochastic Gradient MCMC

Neural Network Matrix Factorization

A Complete Recipe for Stochastic Gradient MCMC

Bayesian Dark Knowledge

Probabilistic Matrix Factorization


Technology

Bayesian Dark Knowledge and Matrix Factorization