Deep Style: Using Variational Auto-encoders for Image Generation

Deep Style

TJ Torres Data Scientist, Stitch Fix

PyData NYC 2015

Using Variational Auto-encoders for Image Generation

Data Labs

Data Labs

Data Labs

Data Labs

MOTIVATIONOur goal at Stitch Fix

Total Inventory

Recommendation Algo

Stylists

Filtered Items

1 2 3 4 5

Final Items Sent

COLD START PROBLEM

New Clients New Clothing


1. Get new clothing.

2. Get new clients.

3. ????????

4. PROFIT!!!

COLD START PROBLEM


1. Get new clothing.

2. Get new clients.

3. ????????

4. PROFIT!!!

Preemptive Modeling

COLD START PROBLEM

TURN TO IMAGES

• Style/fashion is primarily visual.

• We wish to use images for modeling purposes.

• Heuristics for how we process image data

unknown or quite complex.

• We don’t want to have to develop image

features.

• Turn to deep learning to learn the feature

extraction.

OUTLINE

1. Introduction to NNs 2. Unsupervised Deep Learning 3. Getting started with Chainer 4. Training a simple model.


OUTLINE


5. Open source package! 6. Conclusions/Future (current) Directions

OUTLINE

NEURAL NETWORKS

http://www.wired.com/2013/02/three-awesome-tools-scientists-may-use-to-map-your-brain-in-the-future/

http://www.wired.com/2013/02/three-awesome-tools-scientists-may-use-to-map-your-brain-in-the-future/

http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html


WhoaDude!



http://arxiv.org/pdf/1502.04623v2.pdf

http://arxiv.org/pdf/1502.04623v2.pdf

Begin with input:

INTRO TO NEURAL NETS1 2 3 4 5 6

Begin with input: 1 2 3 4 layer 1 (Input)

5 6

layer 2

f

(l)i (x) = tanh

0

@X

j

W

(l)ij x

(l�1)j + b

(l)

1

A

INTRO TO NEURAL NETS

Begin with input: 1 2 3 4 layer 1 (Input)

5 6

layer 2

f

(l)i (x) = tanh

0

@X

j

W

(l)ij x

(l�1)j + b

(l)

1

A

layer 3 (output)

Transform data repeatedly with non-linear function.

f

(1) � · · · � f (n)(x)


1 2 3 4 layer 1(Input)

5 6

layer 2

layer 3(output)

Calculate loss function and update weights

f

(1) � · · · � f (n)(x)

L(xout

, y) =

MSEz }| {1

m

mX

k=1

(xk � yk)2

Begin with input:

f

(l)i (x) = tanh

0

@X

j

W

(l)ij x

(l�1)j + b

(l)

1

A




5 6

layer 2

layer 3(output)

L(xout

, y) =

MSEz }| {1

m

mX

k=1

(xk � yk)2

W (l)⇤ij = W (l)

ij

✓1� ↵

@L@Wij

◆


f

(1) � · · · � f (n)(x)

Begin with input:

f

(l)i (x) = tanh

0

@X

j

W

(l)ij x

(l�1)j + b

(l)

1

A




5 6

layer 2

layer 3(output)

L(xout

, y) =

MSEz }| {1

m

mX

k=1

(xk � yk)2

W (l)⇤ij = W (l)

ij

✓1� ↵

@L@Wij

◆@L

@W

(l)ij

=

✓@L

@x

out

◆✓@x

out

@f

(n�1)

◆· · ·

@f

(l)

@W

(l)ij

!


f

(1) � · · · � f (n)(x)

Begin with input:

f

(l)i (x) = tanh

0

@X

j

W

(l)ij x

(l�1)j + b

(l)

1

A



WHY DEEP LEARNING?

1) With no hidden layers NN resemble just a linear transformation.

2) Shallow networks approximate PCA

3) Composing non-linear activation functions adds increasing nonlinearity.

f

(1) � · · · � f (n)(x)

4) Learn more complex/nonlinear models with deep architectures.

DL WITH SUPERVISIONMost deep learning methods rely on supervised training data.

MO:

Feature Extraction w/ Deep Learning

Final Classification Layer(s)

http://parse.ele.tue.nl/education/cluster2

http://parse.ele.tue.nl/education/cluster2

ISSUES FOR STYLE

PROBLEM No reliable system of style labels for image data.

Thankfully we can learn feature representations of unsupervised data.

The key is to compress the data with a nonlinear encoding process.

PROBLEM No reliable system of style labels for image data.

ISSUES FOR STYLE

UNSUPERVISED DEEP LEARNING

UNSUPERVISED DEEP LEARNING

AUTO-ENCODERSTwo different processes combined into one.

1) Encoding (inferential) 2) Decoding (generative)

Compressed Data

OriginalImage

ReconstructedImage

Encode Decode

Two different processes combined into one.

1) Encoding (inferential) 2) Decoding (generative)

AUTO-ENCODERS

Compressed Data

OriginalImage

ReconstructedImage

Encode Decode

AUTO-ENCODERSTraining:

1) Initialize to random weights in layers.

AUTO-ENCODERS

Compressed Data

OriginalImage

ReconstructedImage

Encode Decode


1) Initialize to random weights in layers. 2) Full forward pass of batch through encoding and then decoding

of encoded rep.

AUTO-ENCODERS

Compressed Data

OriginalImage

ReconstructedImage

Encode Decode



of encoded rep. 3) Construct loss via MSE of original data to reconstructed data.

AUTO-ENCODERS

Compressed Data

OriginalImage

ReconstructedImage

Encode Decode



of encoded rep. 3) Construct loss via MSE of original data to reconstructed data. 4) Calculate gradients and backprop through to train new weights.

AUTO-ENCODERS

Compressed Data

OriginalImage

ReconstructedImage

Encode Decode

Training:


of encoded rep. 3) Construct loss via MSE of original data to reconstructed data. 4) Calculate gradients and backprop through to train new weights. 5) Iterate.

AUTO-ENCODERS

AUTO-ENCODER ISSUES1) AEs will often overfit unless amount of training data is

large.

2) Gradients diminish quickly, thus weight corrections small “far away” from output.

SOLUTION

1) Use variational component to “regularize” training.

2) *Not Covered* Stack auto-encoders and train greedily (DBN)

1) AEs will often overfit unless amount of training data is large.

2) Gradients diminish quickly, thus weight corrections small “far away” from output.

AUTO-ENCODER ISSUES

PassageMANY deep learning frameworks!!!

Passage

Easy-to-use framework for training Neural Networks.

BASIC OBJECTS

Variables Functions

Wrapper on ndarrays. Operate on Variable objects

Operations of functions on variables memorized in sequence.

Back propagation done by simply automatic differentiation moving backwards through the sequence of operations.

INTRO TO CHAINER

x = np.ones(1)*5y = np.ones(1)*3x = chainer.Variable(x)y = chainer.Variable(y)z = x**2 + y**2 + 2*y

INTRO TO CHAINER


INTRO TO CHAINER


In [3]: z.dataOut[3]: array([ 40.])

INTRO TO CHAINER



INTRO TO CHAINER



#calculate gradientsz.backwards()

INTRO TO CHAINER

Steps to NN

1. Define a model using chainer.FunctionSet

1. Contains all parametric functions.

2. Simple way to wrap computational elements into one

object.

2. Design and code forward network pass.

3. Set optimizer: chainer.optimizers

4. Make a train script which iteratively passes batches forward

through the network and updates the weights:

optimizer.update()loss.backwards()

INTRO TO CHAINER

ADVANTAGES

1. Forward pass through networks are intuitive and easily

debugged.

2. Can use arbitrary control flow statements.

3. Backpropagation easily implemented through backwards

traversal of computational graph.

4. High level of readability.

INTRO TO CHAINER

BUILDING A SIMPLE AUTO-ENCODER

MODEL SETUP#layer setuplayers = {}

#encoding layerslayers[‘encode0’] = F.Linear(img_size, n0)layers[‘encode1’] = F.Linear(n0, 2*encoding_size)

#decoding layerslayers[‘decode0’] = F.Linear(encoding_size, n0)layers[‘decode1’] = F.Linear(n0, img_size)

#model setupmodel = chainer.FunctionSet(**layers)optimizer = optimizers.Adam()optimizer.setup(model)

ENCODING# Encoderinput = chainer.Variable(input)

input

# Encoderinput = chainer.Variable(input)

input = F.relu(model.encode0(input))

input

ENCODING

# Encoderinput = chainer.Variable(input)

input = F.relu(model.encode0(input))

latent = F.relu(model.encode1(input))

latent

ENCODING

VARIATIONAL STEP

sample from distribution

# Variational layermean, std = F.split_axis(latent, 2, 1)

noise = np.random.standard_normal(mean.data.shape)

}µ

}�

q�(z) = N (z;µ(i),�2(i)I)

VARIATIONAL STEP

sampled

# Variational layermean, std = F.split_axis(latent, 2, 1)

noise = np.random.standard_normal(mean.data.shape)

sampled = noise * F.exp(0.5 * std) + mean

DECODING# Decoderoutput = F.relu(model.decode0(sampled))

output

DECODING# Decoderoutput = F.relu(model.decode0(sampled))

reconstruction = F.sigmoid(model.decode1(output))

reconstruction

UPDATE# Loss is just RMSEloss = F.mean_squared_error(reconstruction, input)

# “Regularize” the latent vectorloss += F.gaussian_kl_divergence(mean, std)

L(x) = DKL(q�(z)||N (0, I)) +MSE(x,yout

)

UPDATE# Loss is just RMSEloss = F.mean_squared_error(reconstruction, input)

# “Regularize” the latent vectorloss += F.gaussian_kl_divergence(mean, std)

#backpropoptimizer.zero_grads()loss.backward()optimizer.update()

AFTER TRAINING

RESULTSStill testing the efficacy of modeling style with the encoded space.

Normally, the generative portion would be thrown out after training, but here we can use it to look at our style space.

TRY IT YOURSELF

https://github.com/stitchfix/fauxtograph

https://github.com/stitchfix/fauxtograph

COMMAND LINE TOOL

$ pip install fauxtograph

$ fauxtograph download images/

$ fauxtograph train images/ models/model_out

$ fauxtograph generate models/model_out generated_images/

source: @genekogan

http://www.apple.com

FUTURE DIRECTIONSIssues with scaling to high resolution.

For 100x200 RGB Image:100x200x3 = 60000 node input layer

60,000x(step down layer 4000) = 240M

240M x 32-bits = ~ 960 MB

FUTURE DIRECTIONSIssues with scaling to high resolution.

Add Convolution Layers:

1) Reduce # of parameters.

2) Add translation robustness.

3) Hierarchical feature structure.

FUTURE DIRECTIONS



240M x 32-bits = ~ 960 MB

Issues with scaling to high resolution.

Add Convolution Layers:

1) Reduce # of parameters.

2) Add translation robustness.

3) Hierarchical feature structure.

FUTURE DIRECTIONS



240M x 32-bits = ~ 960 MB

Issues with scaling to high resolution.

COMING SOON

CONCLUSIONS

1) Style feature space would help resolve cold-start problem for both clients and items.

2) Auto-encoders are useful for deducing feature space in an unsupervised way.

3) Turn to VAE for drag and drop way to prevent overfitting.

4) Convolution on it’s way.

You can check out the branch: convolutional-vae

QUESTIONS?Original VAE Paper: http://arxiv.org/abs/1312.6114

Blog Post: http://multithreaded.stitchfix.com/blog/2015/09/17/deep-style/

http://arxiv.org/abs/1312.6114

http://multithreaded.stitchfix.com/blog/2015/09/17/deep-style/

APPENDIX: VARIATIONAL INFERENCE

Want to solve for posterior: p✓(z|x) =p✓(x|z)p✓(z)

p✓(x)

But posterior can be intractable to calculate efficiently.

Approximate

p✓(z|x) ⇡ q�(z)

Minimize KL Divergence

DKL (q�(z)||p✓(z|x)) =Z

dz q�(z) ln

✓q�(z)

p✓(z|x)

◆

APPENDIX: VARIATIONAL AUTO-ENCODER

Auto-encoder learns/infers in the Bayesian sense too.

Learning encoding is equivalent to maximizing likelihood:

argmax

zp✓(x|z)

And generating decoding by maximizing posterior:

argmax

x

p✓(z|x)

Apply variational inference at the decoding step to calculate posterior.

Auto-encoder now models distributions for latent space.

If we guess a normal form for our “variational distribution” …


DKL (q�(z)||p✓(z|x)) = log

�2

�1+

��21 � �2

2

�+ (µ1 � µ2)

2

2�22





�2

�1+

��21 � �2

2

�+ (µ1 � µ2)

2

2�22

L2 Loss





�2

�1+

��21 � �2

2

�+ (µ1 � µ2)

2

2�22

L2 Loss

=

X

i

✓1

2

⇥�2i + µ2

i � 1

⇤� log �i

◆





�2

�1+

��21 � �2

2

�+ (µ1 � µ2)

2

2�22

L2 Loss

=

X

i

✓1

2

⇥�2i + µ2

i � 1

⇤� log �i

◆

Drop in loss term to regularize latent space!




Data & Analytics

Deep Style: Using Variational Auto-encoders for Image Generation