Upload
tj-torres
View
2.361
Download
2
Embed Size (px)
Citation preview
Deep Style
TJ Torres Data Scientist, Stitch Fix
PyData NYC 2015
Using Variational Auto-encoders for Image Generation
Data Labs
Data Labs
Data Labs
Data Labs
MOTIVATIONOur goal at Stitch Fix
Total Inventory
Recommendation Algo
Stylists
Filtered Items
1 2 3 4 5
Final Items Sent
COLD START PROBLEM
New Clients New Clothing
New Clients New Clothing
1. Get new clothing.
2. Get new clients.
3. ????????
4. PROFIT!!!
COLD START PROBLEM
New Clients New Clothing
1. Get new clothing.
2. Get new clients.
3. ????????
4. PROFIT!!!
Preemptive Modeling
COLD START PROBLEM
TURN TO IMAGES
• Style/fashion is primarily visual.
• We wish to use images for modeling purposes.
• Heuristics for how we process image data
unknown or quite complex.
• We don’t want to have to develop image
features.
• Turn to deep learning to learn the feature
extraction.
OUTLINE
1. Introduction to NNs 2. Unsupervised Deep Learning 3. Getting started with Chainer 4. Training a simple model.
1. Introduction to NNs 2. Unsupervised Deep Learning 3. Getting started with Chainer 4. Training a simple model.
OUTLINE
1. Introduction to NNs 2. Unsupervised Deep Learning 3. Getting started with Chainer 4. Training a simple model.
5. Open source package! 6. Conclusions/Future (current) Directions
OUTLINE
NEURAL NETWORKS
http://www.wired.com/2013/02/three-awesome-tools-scientists-may-use-to-map-your-brain-in-the-future/
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
WhoaDude!
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
http://arxiv.org/pdf/1502.04623v2.pdf
Begin with input:
INTRO TO NEURAL NETS1 2 3 4 5 6
Begin with input: 1 2 3 4 layer 1 (Input)
5 6
layer 2
f
(l)i (x) = tanh
0
@X
j
W
(l)ij x
(l�1)j + b
(l)
1
A
INTRO TO NEURAL NETS
Begin with input: 1 2 3 4 layer 1 (Input)
5 6
layer 2
f
(l)i (x) = tanh
0
@X
j
W
(l)ij x
(l�1)j + b
(l)
1
A
layer 3 (output)
Transform data repeatedly with non-linear function.
f
(1) � · · · � f (n)(x)
INTRO TO NEURAL NETS
1 2 3 4 layer 1(Input)
5 6
layer 2
layer 3(output)
Calculate loss function and update weights
f
(1) � · · · � f (n)(x)
L(xout
, y) =
MSEz }| {1
m
mX
k=1
(xk � yk)2
Begin with input:
f
(l)i (x) = tanh
0
@X
j
W
(l)ij x
(l�1)j + b
(l)
1
A
Transform data repeatedly with non-linear function.
INTRO TO NEURAL NETS
1 2 3 4 layer 1(Input)
5 6
layer 2
layer 3(output)
L(xout
, y) =
MSEz }| {1
m
mX
k=1
(xk � yk)2
W (l)⇤ij = W (l)
ij
✓1� ↵
@L@Wij
◆
Calculate loss function and update weights
f
(1) � · · · � f (n)(x)
Begin with input:
f
(l)i (x) = tanh
0
@X
j
W
(l)ij x
(l�1)j + b
(l)
1
A
Transform data repeatedly with non-linear function.
INTRO TO NEURAL NETS
1 2 3 4 layer 1(Input)
5 6
layer 2
layer 3(output)
L(xout
, y) =
MSEz }| {1
m
mX
k=1
(xk � yk)2
W (l)⇤ij = W (l)
ij
✓1� ↵
@L@Wij
◆@L
@W
(l)ij
=
✓@L
@x
out
◆✓@x
out
@f
(n�1)
◆· · ·
@f
(l)
@W
(l)ij
!
Calculate loss function and update weights
f
(1) � · · · � f (n)(x)
Begin with input:
f
(l)i (x) = tanh
0
@X
j
W
(l)ij x
(l�1)j + b
(l)
1
A
Transform data repeatedly with non-linear function.
INTRO TO NEURAL NETS
WHY DEEP LEARNING?
1) With no hidden layers NN resemble just a linear transformation.
2) Shallow networks approximate PCA
3) Composing non-linear activation functions adds increasing nonlinearity.
f
(1) � · · · � f (n)(x)
4) Learn more complex/nonlinear models with deep architectures.
DL WITH SUPERVISIONMost deep learning methods rely on supervised training data.
MO:
Feature Extraction w/ Deep Learning
Final Classification Layer(s)
http://parse.ele.tue.nl/education/cluster2
ISSUES FOR STYLE
PROBLEM No reliable system of style labels for image data.
Thankfully we can learn feature representations of unsupervised data.
The key is to compress the data with a nonlinear encoding process.
PROBLEM No reliable system of style labels for image data.
ISSUES FOR STYLE
UNSUPERVISED DEEP LEARNING
UNSUPERVISED DEEP LEARNING
AUTO-ENCODERSTwo different processes combined into one.
1) Encoding (inferential) 2) Decoding (generative)
Compressed Data
OriginalImage
ReconstructedImage
Encode Decode
Two different processes combined into one.
1) Encoding (inferential) 2) Decoding (generative)
AUTO-ENCODERS
Compressed Data
OriginalImage
ReconstructedImage
Encode Decode
AUTO-ENCODERSTraining:
1) Initialize to random weights in layers.
AUTO-ENCODERS
Compressed Data
OriginalImage
ReconstructedImage
Encode Decode
AUTO-ENCODERSTraining:
1) Initialize to random weights in layers. 2) Full forward pass of batch through encoding and then decoding
of encoded rep.
AUTO-ENCODERS
Compressed Data
OriginalImage
ReconstructedImage
Encode Decode
AUTO-ENCODERSTraining:
1) Initialize to random weights in layers. 2) Full forward pass of batch through encoding and then decoding
of encoded rep. 3) Construct loss via MSE of original data to reconstructed data.
AUTO-ENCODERS
Compressed Data
OriginalImage
ReconstructedImage
Encode Decode
AUTO-ENCODERSTraining:
1) Initialize to random weights in layers. 2) Full forward pass of batch through encoding and then decoding
of encoded rep. 3) Construct loss via MSE of original data to reconstructed data. 4) Calculate gradients and backprop through to train new weights.
AUTO-ENCODERS
Compressed Data
OriginalImage
ReconstructedImage
Encode Decode
Training:
1) Initialize to random weights in layers. 2) Full forward pass of batch through encoding and then decoding
of encoded rep. 3) Construct loss via MSE of original data to reconstructed data. 4) Calculate gradients and backprop through to train new weights. 5) Iterate.
AUTO-ENCODERS
AUTO-ENCODER ISSUES1) AEs will often overfit unless amount of training data is
large.
2) Gradients diminish quickly, thus weight corrections small “far away” from output.
SOLUTION
1) Use variational component to “regularize” training.
2) *Not Covered* Stack auto-encoders and train greedily (DBN)
1) AEs will often overfit unless amount of training data is large.
2) Gradients diminish quickly, thus weight corrections small “far away” from output.
AUTO-ENCODER ISSUES
PassageMANY deep learning frameworks!!!
Passage
Easy-to-use framework for training Neural Networks.
BASIC OBJECTS
Variables Functions
Wrapper on ndarrays. Operate on Variable objects
Operations of functions on variables memorized in sequence.
Back propagation done by simply automatic differentiation moving backwards through the sequence of operations.
INTRO TO CHAINER
x = np.ones(1)*5y = np.ones(1)*3x = chainer.Variable(x)y = chainer.Variable(y)z = x**2 + y**2 + 2*y
INTRO TO CHAINER
x = np.ones(1)*5y = np.ones(1)*3x = chainer.Variable(x)y = chainer.Variable(y)z = x**2 + y**2 + 2*y
INTRO TO CHAINER
x = np.ones(1)*5y = np.ones(1)*3x = chainer.Variable(x)y = chainer.Variable(y)z = x**2 + y**2 + 2*y
In [3]: z.dataOut[3]: array([ 40.])
INTRO TO CHAINER
x = np.ones(1)*5y = np.ones(1)*3x = chainer.Variable(x)y = chainer.Variable(y)z = x**2 + y**2 + 2*y
In [3]: z.dataOut[3]: array([ 40.])
INTRO TO CHAINER
x = np.ones(1)*5y = np.ones(1)*3x = chainer.Variable(x)y = chainer.Variable(y)z = x**2 + y**2 + 2*y
In [3]: z.dataOut[3]: array([ 40.])
#calculate gradientsz.backwards()
INTRO TO CHAINER
Steps to NN
1. Define a model using chainer.FunctionSet
1. Contains all parametric functions.
2. Simple way to wrap computational elements into one
object.
2. Design and code forward network pass.
3. Set optimizer: chainer.optimizers
4. Make a train script which iteratively passes batches forward
through the network and updates the weights:
optimizer.update()loss.backwards()
INTRO TO CHAINER
ADVANTAGES
1. Forward pass through networks are intuitive and easily
debugged.
2. Can use arbitrary control flow statements.
3. Backpropagation easily implemented through backwards
traversal of computational graph.
4. High level of readability.
INTRO TO CHAINER
BUILDING A SIMPLE AUTO-ENCODER
MODEL SETUP#layer setuplayers = {}
#encoding layerslayers[‘encode0’] = F.Linear(img_size, n0)layers[‘encode1’] = F.Linear(n0, 2*encoding_size)
#decoding layerslayers[‘decode0’] = F.Linear(encoding_size, n0)layers[‘decode1’] = F.Linear(n0, img_size)
#model setupmodel = chainer.FunctionSet(**layers)optimizer = optimizers.Adam()optimizer.setup(model)
ENCODING# Encoderinput = chainer.Variable(input)
input
# Encoderinput = chainer.Variable(input)
input = F.relu(model.encode0(input))
input
ENCODING
# Encoderinput = chainer.Variable(input)
input = F.relu(model.encode0(input))
latent = F.relu(model.encode1(input))
latent
ENCODING
VARIATIONAL STEP
sample from distribution
# Variational layermean, std = F.split_axis(latent, 2, 1)
noise = np.random.standard_normal(mean.data.shape)
}µ
}�
q�(z) = N (z;µ(i),�2(i)I)
VARIATIONAL STEP
sampled
# Variational layermean, std = F.split_axis(latent, 2, 1)
noise = np.random.standard_normal(mean.data.shape)
sampled = noise * F.exp(0.5 * std) + mean
DECODING# Decoderoutput = F.relu(model.decode0(sampled))
output
DECODING# Decoderoutput = F.relu(model.decode0(sampled))
reconstruction = F.sigmoid(model.decode1(output))
reconstruction
UPDATE# Loss is just RMSEloss = F.mean_squared_error(reconstruction, input)
# “Regularize” the latent vectorloss += F.gaussian_kl_divergence(mean, std)
L(x) = DKL(q�(z)||N (0, I)) +MSE(x,yout
)
UPDATE# Loss is just RMSEloss = F.mean_squared_error(reconstruction, input)
# “Regularize” the latent vectorloss += F.gaussian_kl_divergence(mean, std)
#backpropoptimizer.zero_grads()loss.backward()optimizer.update()
AFTER TRAINING
RESULTSStill testing the efficacy of modeling style with the encoded space.
Normally, the generative portion would be thrown out after training, but here we can use it to look at our style space.
COMMAND LINE TOOL
$ pip install fauxtograph
$ fauxtograph download images/
$ fauxtograph train images/ models/model_out
$ fauxtograph generate models/model_out generated_images/
source: @genekogan
FUTURE DIRECTIONSIssues with scaling to high resolution.
For 100x200 RGB Image:100x200x3 = 60000 node input layer
60,000x(step down layer 4000) = 240M
240M x 32-bits = ~ 960 MB
FUTURE DIRECTIONSIssues with scaling to high resolution.
Add Convolution Layers:
1) Reduce # of parameters.
2) Add translation robustness.
3) Hierarchical feature structure.
FUTURE DIRECTIONS
For 100x200 RGB Image:100x200x3 = 60000 node input layer
60,000x(step down layer 4000) = 240M
240M x 32-bits = ~ 960 MB
Issues with scaling to high resolution.
Add Convolution Layers:
1) Reduce # of parameters.
2) Add translation robustness.
3) Hierarchical feature structure.
FUTURE DIRECTIONS
For 100x200 RGB Image:100x200x3 = 60000 node input layer
60,000x(step down layer 4000) = 240M
240M x 32-bits = ~ 960 MB
Issues with scaling to high resolution.
COMING SOON
CONCLUSIONS
1) Style feature space would help resolve cold-start problem for both clients and items.
2) Auto-encoders are useful for deducing feature space in an unsupervised way.
3) Turn to VAE for drag and drop way to prevent overfitting.
4) Convolution on it’s way.
You can check out the branch: convolutional-vae
QUESTIONS?Original VAE Paper: http://arxiv.org/abs/1312.6114
Blog Post: http://multithreaded.stitchfix.com/blog/2015/09/17/deep-style/
APPENDIX: VARIATIONAL INFERENCE
Want to solve for posterior: p✓(z|x) =p✓(x|z)p✓(z)
p✓(x)
But posterior can be intractable to calculate efficiently.
Approximate
p✓(z|x) ⇡ q�(z)
Minimize KL Divergence
DKL (q�(z)||p✓(z|x)) =Z
dz q�(z) ln
✓q�(z)
p✓(z|x)
◆
APPENDIX: VARIATIONAL AUTO-ENCODER
Auto-encoder learns/infers in the Bayesian sense too.
Learning encoding is equivalent to maximizing likelihood:
argmax
zp✓(x|z)
And generating decoding by maximizing posterior:
argmax
x
p✓(z|x)
Apply variational inference at the decoding step to calculate posterior.
Auto-encoder now models distributions for latent space.
If we guess a normal form for our “variational distribution” …
APPENDIX: VARIATIONAL AUTO-ENCODER
DKL (q�(z)||p✓(z|x)) = log
�2
�1+
��21 � �2
2
�+ (µ1 � µ2)
2
2�22
Auto-encoder now models distributions for latent space.
If we guess a normal form for our “variational distribution” …
APPENDIX: VARIATIONAL AUTO-ENCODER
DKL (q�(z)||p✓(z|x)) = log
�2
�1+
��21 � �2
2
�+ (µ1 � µ2)
2
2�22
L2 Loss
Auto-encoder now models distributions for latent space.
If we guess a normal form for our “variational distribution” …
APPENDIX: VARIATIONAL AUTO-ENCODER
DKL (q�(z)||p✓(z|x)) = log
�2
�1+
��21 � �2
2
�+ (µ1 � µ2)
2
2�22
L2 Loss
=
X
i
✓1
2
⇥�2i + µ2
i � 1
⇤� log �i
◆
Auto-encoder now models distributions for latent space.
If we guess a normal form for our “variational distribution” …
APPENDIX: VARIATIONAL AUTO-ENCODER
DKL (q�(z)||p✓(z|x)) = log
�2
�1+
��21 � �2
2
�+ (µ1 � µ2)
2
2�22
L2 Loss
=
X
i
✓1
2
⇥�2i + µ2
i � 1
⇤� log �i
◆
Drop in loss term to regularize latent space!
Auto-encoder now models distributions for latent space.
If we guess a normal form for our “variational distribution” …
APPENDIX: VARIATIONAL AUTO-ENCODER