Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

Deep Learning:An Introduction from the NLP Perspective

Kevin Duh

NAIST

August 19, 2012

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Disclaimer

I am not (yet) an expert in Deep Learning. Let me know ifthese slides contain any mistakes.

The focus here is Natural Language Processing (NLP); I’mglossing over much active work in Vision & Speech.

Lots of good tutorial information online, some borrowed here:

[Bengio, 2009] Excellent short book summarizing the area[Socher et al., 2012a] Tutorial with videoStep-by-step code based on Theano python library:http://deeplearning.net/tutorial/

2/43

http://deeplearning.net/tutorial/



Outline

1 Introduction

2 Neural NetworksPreliminaries1-Layer & 2-Layer NetsNeural Language Models

3 Deep Learning Approach 1: Deep Belief NetsPreliminariesRestricted Boltzman MachinesDeep Belief Nets

4 Deep Learning Approach 2: Stacked Auto-EncodersAuto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

3/43



What is Deep Learning?

A model (e.g. neural network) with many layers, trained in alayer-wise way

An approach for unsupervised learning of featurerepresentations, at successively higher levels

These two definitions are very related, but correspond todifferent motivations.

4/43







4/43







4/43



Why explore Deep Learning?

1 It can model complex non-linear phenomenon

2 It learns a distributed feature representation

3 It learns a hierarchical feature representation

4 It can exploit unlabeled data

5/43



#1 Modeling complex non-linearities

Given same number of units (with non-linear activation), a deeperarchitecture is more expressive than a shallow one [Bishop, 1995]

6/43



#2 Distributed Feature Representations

One-hot representation is common in NLP:

”dog”=[1, 0, 0, . . . , 0], (vector dimension = vocabulary size)”cat”=[0, 1, 0, . . . , 0]”the”=[0, 0, 0, . . . , 1]”dog” and ”cat” share zero similarity, just like ”dog” and ”the”

Word clustering has proven effective in many tasks:

”dog”=[1, 0, 0, 0] (vector dim = number of clusters)”cat”=[1, 0, 0, 0] (”dog” and ”cat” were clustered together)”the”=[0, 1, 0, 0] 〈dog,cat〉 > 〈dog,the〉 = 0

Distributed represented (6= ”distributional representation”) isa multi-clustering, modeling factors like POS & semantics:

”dog”=[1, 0, 0.9, 0.0]”cat”=[1, 0, 0.5, 0.2]”the”=[0, 1, 0.0, 0.0]

7/43









”dog”=[1, 0, 0.9, 0.0]”cat”=[1, 0, 0.5, 0.2]”the”=[0, 1, 0.0, 0.0]

7/43









”dog”=[1, 0, 0.9, 0.0]”cat”=[1, 0, 0.5, 0.2]”the”=[0, 1, 0.0, 0.0]

7/43



#3 Hierarchical Feature Representations

Hierarchical features effective captures part-and-whole relationshipsand naturally addresses multi-task problems [Lee et al., 2009]

8/43



#4 Exploiting Unlabeled Data

Unsupervised & semi-supervised learning will be standard1:

Engineering question: Unlabeled data is more abundant thanlabeled data.Scientific question: Children learn language (syntax, meaning,etc.) mostly from raw unlabeled data

Layer-wise pre-training in Deep Learning: good model of inputP(X ) can help train P(Y |X )

”If you want to do computer vision, first learncomputer graphics.” – Geoff Hinton

1my prediction for 20209/43



#4 Exploiting Unlabeled Data

Unsupervised & semi-supervised learning will be standard1:

Engineering question: Unlabeled data is more abundant thanlabeled data.Scientific question: Children learn language (syntax, meaning,etc.) mostly from raw unlabeled data

Layer-wise pre-training in Deep Learning: good model of inputP(X ) can help train P(Y |X )

”If you want to do computer vision, first learncomputer graphics.” – Geoff Hinton

1my prediction for 20209/43



Some (personal) skepticism

1 There are other ways to learn distributed representations, e.g.

Topic model for documentsConcatenating multiple word clustering solutions (has anyonetried this?)Dictionary learning and sparse reconstruction methods

2 Is multiple-level of representations really necessary in NLP?

For Vision problems, there is clear analogy to the brain’sstructure, but for language?Maybe: compositionally and recursion in natural language.

3 Black magic required for effective training, e.g.hyper-parameter setting and large computer resources?

10/43









10/43









10/43









10/43



Research Opportunities in NLP

1 Improving on current state-of-the-art results on standard tasks

2 Encoding linguistic knowledge into the training process

Current methods are relatively generic, incorporates littledomain knowledge.

3 Integrating deep learning into current NLP pipelines

In particular: how to handle structured prediction problems ofsequences and trees

11/43




1 Improving on current state-of-the-art results on standard tasks2 Encoding linguistic knowledge into the training process




11/43




1 Improving on current state-of-the-art results on standard tasks2 Encoding linguistic knowledge into the training process




11/43



What we’ll cover here

1 Neural Language Models & Distributed Word Representations

Not sure if they’re ”deep”, but they’re relevant to what we’reinterested inBasic math here useful for later material

2 Restricted Boltzman Machines & Deep Belief Nets

Deep Learning Approach #1: the original generative model

3 Autoencoders, Denoising Autoencoders, and StackedDenoising Autoencoders

Deep Learning Approach #2: competitive with #1 andperhaps easier to train

12/43



Aside: A Brief History

Early days of AI. Invention of artificial neuron[McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958]

AI Winter. [Minsky and Papert, 1969] showed perceptron onlylearns linearly separable concepts

Revival in 1980s: Multi-layer Perceptrons (MLP) andBack-propagation [Rumelhart et al., 1986]

Other directions (1990s - present): SVMs, Bayesian Networks

Revival in 2006: Deep learning [Hinton et al., 2006]

Recent successes in applications: Speech at IBM/Toronto[Sainath et al., 2011], Microsoft [Dahl et al., 2012]. Vision atGoogle/Stanford [Le et al., 2012]

13/43










13/43










13/43










13/43










13/43










13/43



Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Outline

1 Introduction




14/43




Basic Setup of Machine Learning

Training Data: a set of (x (m), y (m))m={1,2,..M} pairs, where

input x (m) ∈ Rd and output y (m) = {0, 1}e.g. x=document, y=spam or not

Goal: Learn function f : x → y that predicts correctly on newinputs x .

Step 1: Choose a function model family:

e.g. f (x) = σ(wT · x). (logistic regression, aka. 1-layer net)e.g. f (x) =sign(wT · x). (perceptron)e.g. f (x) =sign

∑m wT

m · k(x , x (m)). (SVM)

Step 2: Optimize parameters w on the Training Data

e.g. minimize loss function minw

∑Mm=1(fw (x (m))− y (m))2

15/43










∑m wT

m · k(x , x (m)). (SVM)



∑Mm=1(fw (x (m))− y (m))2

15/43










∑m wT

m · k(x , x (m)). (SVM)



∑Mm=1(fw (x (m))− y (m))2

15/43




1-Layer Nets (logistic regression)

Function model: f (x) = σ(wT · x + b)

Parameters: vector w ∈ Rd , b is scalar bias termσ is a non-linearity: σ(z) = 1/(1 + exp(−z))For simplicity, sometimes write f (x) = σ(wT x) wherew = [w ; b] and x = [x ; 1]

Non-linearity will be important in expressiveness multi-layernets. Other non-linearities are also used, e.g. tanh

16/43




1-Layer Nets (logistic regression)

Function model: f (x) = σ(wT · x + b)

Parameters: vector w ∈ Rd , b is scalar bias termσ is a non-linearity: σ(z) = 1/(1 + exp(−z))For simplicity, sometimes write f (x) = σ(wT x) wherew = [w ; b] and x = [x ; 1]

Non-linearity will be important in expressiveness multi-layernets. Other non-linearities are also used, e.g. tanh

16/43




Training 1-Layer Nets

Easiest method: gradient descent

Let Loss(w) =∑

m(σ(wT x (m))− y (m))2

Gradient ∇wLoss =∑m 2(σ(wT x (m))− y (m))(σ(wT x (m))(1− σ(wT x (m)))x (m)

General form of gradient: Error ∗ σ′(in) ∗ x

Stochastic gradient descent algorithm:1 Initialize w2 for each sample (x (m), y (m)) in training set3 w ← w − γ(Error ∗ σ′(in) ∗ x (m))4 Repeat steps 2-3 until some condition satisfied

Some practical tricks for learning rate γ & stopping conditionfor quick training and good generalization

17/43






Let Loss(w) =∑

m(σ(wT x (m))− y (m))2





17/43






Let Loss(w) =∑

m(σ(wT x (m))− y (m))2





17/43






Let Loss(w) =∑

m(σ(wT x (m))− y (m))2





17/43




2-Layer Nets (MLP, Multi-layer Perceptron)

x1 x2 x3 x4

h1 h2 h3

y

xi

wij

hj

wj

w11 w12

w1 w2 w3

f (x) = σ(∑

j wj · hj) = σ(∑

j wj · σ(∑

i wijxi ))

18/43




Training 2-Layer Nets: Backpropagation

Recall the gradient for 1-Layer Nets consists of:

∂Loss/∂wj = Error ∗ σ′(in) ∗ xjWe just need to use Chain Rule to take derivatives over 2-layers

For the 2-Layer network (previous slide):

∂Loss/∂wj = [y − f (x)]f ′(x)hj

∂Loss/∂wij = [y − f (x)]f ′(x)wjσ′(∑

i wi jxi )xi

Note:1 First, run sample through network to get result f (x).2 Then, ”errors” are propagated back and weights fixed

according to their ”responsibility”3 Problem is not convex (may have several local optima)

19/43










i wi jxi )xi



19/43










i wi jxi )xi



19/43




Definition of ”Depth”

Depends on elementary computational elements:

weighted sum, product, single neuron, kernel, logic gate

1-Layer - linear classifier:

Logistic Regression, Maximum Entropy ClassifierPerceptron, Linear SVM

2-Layer - universal approximator:

Most MLPs (except some convolutional neural nets)SVMs with kernelsGaussian processesDecision trees

3-Layer or more - compact universal approximator:

Deep LearningBoosted decision trees, Random Forests

20/43




Neural Language Models [Bengio et al., 2003]

Motivation: Use Neural Nets to learn continuous distributedrepresentations of words.

Addresses curse of dimensionality arising from one-hotrepresentation of discrete variables.

Architecture (see pic on next slide):

C () are the learned word representations of dimension m.The history context x = [C (wt−1) C (wt−2) C (wt−3)] iscompressed to a h node hidden layer via tanh(Hx)Final output mapping with softmax gives probabilities p(wt |x).

21/43




Neural Language Models [Bengio et al., 2003]

22/43




Distributed Representations: many possibilities

1 Neural Networks & Neural Language Model:

Hidden layer serve as learned representationWe can view this as analogous learning a kernel.

2 Principle Component Analysis (PCA), Factor Analysis

Linear transform to decorrelated features: h = W T x + b

3 Sparse coding

h∗ = arg minh ||x −W · h||22 + λ||h||14 Also: manifold embeddings, ICA, and various unsupervised

methods.

23/43




Distributed Representations: many possibilities

1 Neural Networks & Neural Language Model:

Hidden layer serve as learned representationWe can view this as analogous learning a kernel.

2 Principle Component Analysis (PCA), Factor Analysis

Linear transform to decorrelated features: h = W T x + b

3 Sparse coding

h∗ = arg minh ||x −W · h||22 + λ||h||14 Also: manifold embeddings, ICA, and various unsupervised

methods.

23/43




Summary: things to remember about Neural Nets

1 Stacking layers of non-linearity (e.g. σ) is critical forexpressive power of neural nets

2 Hidden layers of neural nets can serve as distributedrepresentations

3 Backpropagation Training is just gradient descent, appliedwith Chain Rule.

4 Unfortunately, training beyond 2-layers is often difficult due tolocal optimum and vanishing gradients

24/43




Minimal Reading List for Neural Language Models

Original Neural LM paper: [Bengio et al., 2003]

Alternate training criteria & architecture:[Collobert et al., 2011]

Hierarchical distributed representations:[Mnih and Hinton, 2008]

Handling large data (code available also):[Mikolov et al., 2011, Schwenk et al., 2012]

Application in NLP: [Turian et al., 2010]

25/43



PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Outline

1 Introduction




26/43




Motivation

Goal: Discover useful latent features h from data x

One possibility: Directed Graphical Models:

Model p(x , h) = p(x |h)p(h), where p(x |h) is likelihood, p(h) ispriorDirected: we can think of h as a ”cause”. Given h = 1, what’sthe probability of x?

h x

27/43




Motivation

Goal: Discover useful latent features h from data x

One possibility: Directed Graphical Models:

Model p(x , h) = p(x |h)p(h), where p(x |h) is likelihood, p(h) ispriorDirected: we can think of h as a ”cause”. Given h = 1, what’sthe probability of x?

h x

27/43




Explaining away effect of directed graphical models

p(h1) and p(h2) are a priori independent, but dependent givenx : p(h1, h2|x) 6= p(h1|x) · p(h2|x)

Thus, posterior p(h|e), which is needed for features or deeplearning, is not easy to compute

Example:x = grass is wet;h1 = it rained last night; h2 = water sprinkler was on.

x

h1 h2

28/43




Undirected Graphical Models (aka MRF, Markov RandomFields)

MRF models p(x , h) = 1Zθ

∏i φi (x)

∏j ηj(h)

∏k νk(x , h) as

product of un-normalized potentials

θ are parameters, Zθ is (potentially expensive) normalizationClique potentials φi (x), ηj(h), νk(x , h) describe interactionsbetween inputs, hiddens, and input-hidden variables

Boltzman Machines define p(x , h) = 1Zθ

exp (−Eθ(x , h))

where x and h are binary variables, andEθ(x , h) = − 1

2xTUx − 12hTVh − xTWh − bT x − dTh

with θ = {U,V ,W , b, d} as parameters

Posterior p(h|x) of Boltzman Machines also intractable, e.g.p(hj |x) =

∑h1..∑

hj−1

∑hj+1

..p(h|x).

29/43






∏i φi (x)

∏j ηj(h)

∏k νk(x , h) as




exp (−Eθ(x , h))





∑h1..∑

hj−1

∑hj+1

..p(h|x).

29/43






∏i φi (x)

∏j ηj(h)

∏k νk(x , h) as




exp (−Eθ(x , h))





∑h1..∑

hj−1

∑hj+1

..p(h|x).

29/43




Restricted Boltzman Machine (RBM)

RBM: p(x , h) = 1Zθ

exp (−Eθ(x , h))

with only h-x interactions: Eθ(x , h) = −xTWh − bT x − dTh

x1 x2 x3

h1 h2 h3

Conditional distribution over hidden units factorizes:p(h|x) =

∏i p(hi |x)

p(hj = 1|x) = σ(∑

i wijxi + dj)Similarly: p(x |h) =

∏i p(xi |h);p(xi = 1|h) = σ(

∑j wijhj + bi )

Computing posteriors p(h|x) or features (E [p(h|x)) is easy.

Note partition function Zθ is still expensive, so approximationrequired during parameter learning

30/43






exp (−Eθ(x , h))


x1 x2 x3

h1 h2 h3


∏i p(hi |x)

p(hj = 1|x) = σ(∑


∏i p(xi |h);p(xi = 1|h) = σ(

∑j wijhj + bi )



30/43






exp (−Eθ(x , h))


x1 x2 x3

h1 h2 h3


∏i p(hi |x)

p(hj = 1|x) = σ(∑


∏i p(xi |h);p(xi = 1|h) = σ(

∑j wijhj + bi )



30/43




Training RBMs

Gradient of the Log-Likelihood: ∇w log Pw (x = x (m))

= ∇wij log∑h

Pw (x = x (m), h) (1)

= ∇wij log∑h

1

Zwexp (−Ew(x(m), h)) (2)

= −∇wij log Zw +∇wij log∑h

exp (−Ew(x(m), h)) (3)

=1

Zw

∑h,x

e(− Ew(x,h))∇wij Ew(x, h)− 1∑h e(− Ew(x(m),h))

∑h

e(− Ew(x(m),h))∇wij Ew(x(m), h)

=∑h,x

Pw (x , h)[∇wij Ew(x, h)]−∑h

Pw (x (m), h)[∇wij Ew(x(m), h)] (4)

= −Ep(x,h)[xi · hj ] + Ep(h|x=x (m))[x(m)i · hj ] (5)

31/43




Training RBMs with Contrastive Divergence

In the previous equation, first term is expensive(Ep(x ,h)[xi · hj ])

Gibbs Sampling (sample x then h iteratively) works butre-running for each gradient step is slow.

Contrastive Divergence is a faster but biased method thatinitializes with training data:

1 h ∼ P(h|x (m))2 x ∼ P(x |h); h ∼ P(h|x)3 wij ← wij + γ

∑batch(x

(m)i · hj − xi · hj)

32/43




Deep Belief Nets (DBN)

DBN stacks RBMs layer-by-layer to get deep architecture.

Layer-wise pre-training is critical:First, train RBM to learn 1st layer of features h from input x .Then, treat h as input and learn a 2nd layer of features.Each added layer improves the variational lower bound on thelog probability of training data.

Further fine-tuning can be obtained with the Wake-SleepAlgorithm

Do stochastic bottom-up pass (adjust weights to reconstructlayer below)Do a few iterations of Gibbs sampling at top-level RBMDo stochastic top-down pass (adjust weights to reconstructlayer above)

note: not to be confused with Dynamic Bayesian Nets orDeep Boltzman Machines

33/43




Deep Belief Nets (DBN)

DBN stacks RBMs layer-by-layer to get deep architecture.

Layer-wise pre-training is critical:First, train RBM to learn 1st layer of features h from input x .Then, treat h as input and learn a 2nd layer of features.Each added layer improves the variational lower bound on thelog probability of training data.

Further fine-tuning can be obtained with the Wake-SleepAlgorithm

Do stochastic bottom-up pass (adjust weights to reconstructlayer below)Do a few iterations of Gibbs sampling at top-level RBMDo stochastic top-down pass (adjust weights to reconstructlayer above)

note: not to be confused with Dynamic Bayesian Nets orDeep Boltzman Machines

33/43




Summary: things to remember about DBNs

1 Layer-wise pre-training is the innovation that enabled trainingdeep architectures.

2 Pre-training focuses on optimizing likelihood on the data, notthe target label. The philosophy is to first model p(x) in orderto do better p(y |x).

3 Why use an undirected graphical model like RBM? It’sbecause p(h|x) is computationally tractable (no ”explainingaway effect”), so that stacking them into DBNs is feasible.

4 Learning RBM still require approximates inference (e.g.contrastive divergence) since partition function is expensive.

34/43




Minimal Reading List for RBM/DBN

Original DBN paper [Hinton et al., 2006]

Why does unsupervised pre-training help deep learning?[Erhan et al., 2010]

Successful application in Collaborative Filtering[Salakhutdinov et al., 2007]

35/43



Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Outline

1 Introduction




36/43




Auto-Encoders

Auto-Encoders are a simpler non-probabilistic alternative toRBMs.

Define encoder and decoder and pass the data through it:

Encoder h = fθ(x), e.g. h = σ(Wx + b)Decoder x = gθ(h), e.g. x = σ(W ′h + d)W and W ′ need not be tied, but often are in practice.

Encourage θ to give small reconstruction errorl:

e.g. Loss =∑

m ||x (m) − gθ(fθ(x (m)))||2

Linear encoder/decoder with squared reconstruction errorlearns same subspace of PCA.

Sigmoid encoder/decoder gives same form p(h|x), p(x |h) asRBMs.

37/43




Auto-Encoders





e.g. Loss =∑

m ||x (m) − gθ(fθ(x (m)))||2



37/43




Auto-Encoders





e.g. Loss =∑

m ||x (m) − gθ(fθ(x (m)))||2



37/43




Architecture: Stacked Auto-Encoders

Auto-encoders can be stacked in the same way RBMs arestacked to give Deep Architectures

Hidden unit size:

Hidden layer should be lower dimensional or else Auto-encodermay just learn the identity mappingAlternatively, allow more hidden units but enforce sparsity.

38/43




Denoising Auto-Encoders

First, perturb the input data x to x using invariance fromdomain knowledge.

Reconstruct the original data

e.g. Loss =∑

m ||x (m) − gθ(fθ(x (m)))||2

[Vincent et al., 2010] explored Gaussian noise andsalt-and-pepper noise for Vision data. [Glorot et al., 2011]explored masking noise (random set to 0) for Text data.

39/43




Predictive Sparse Decomposition[Kavukcuoglu et al., 2008]

Objective (minimize with respect to h, W , θ):∑m λ||h(m)||1 + ||x (m) −Wh(m)||22 + ||h(m) − fθ(x (m))||22

First two terms similar to sparse coding. Third term learns afast encoder that approximates the sparse coder.

40/43




Summary: things to remember about StackedAutoencoders

1 Auto-encoders are computationally cheaper alternatives toRBMs. We stack them into deep architectures in the sameway we stack RBMs into DBNs.

2 Auto-encoders learn to ”compress” and ”re-construct” inputdata. Low reconstruction error corresponds to an encodingthat captures the main variations in data. Again, the focus ison modeling p(x) first.

3 Many variants of encoders are out there, and some provideeffective ways to incorporate expertise domain knowledge.

41/43




Minimal Reading List for Stacked Auto-Encoders

Original Stacked Auto-encoder paper [Bengio et al., 2006]

Comparison of optimization methods [Le et al., 2011]

Speeding up the reconstruction error computation for largeword vectors [Dauphin et al., 2011]

De-noising Auto-encoders [Vincent et al., 2010]

42/43




Selected Readings for NLPers

Deep Learning Applications in NLP:

Sentiment Analysis [Glorot et al., 2011]Parsing[Socher et al., 2011b, Collobert et al., 2011, Collobert, 2011]Paraphrase Detection [Socher et al., 2011a]Learning lexical semantics:[Huang et al., 2012, Socher et al., 2012b]

Applications in other fields, but worth reading:

Good reference that defines many terms popular in DeepLearning Vision papers [Jarrett et al., 2009]Deep learning of cats: entirely unsupervised learning high-levelfeatures on massive datasets [Le et al., 2012]

43/43




Bengio, Y. (2009).Learning Deep Architectures for AI, volume Foundations andTrends in Machine Learning.NOW Publishers.

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003).A neural probabilistic language models.JMLR.

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.(2006).Greedy layer-wise training of deep networks.In NIPS’06, pages 153–160.

Bishop, C. (1995).Neural Networks for Pattern Recognition.Oxford University Press.

Collobert, R. (2011).

43/43




Deep learning for efficient discriminative parsing.In AISTATS.

Collobert, R., Weston, J., Bottou, L., Karlen, M.,Kavukcuoglu, K., and Kuksa, P. (2011).Natural language processing (almost) from scratch.Journal of Machine Learning Research, 12:2493–2537.

Dahl, G., Yu, D., Deng, L., and Acero, A. (2012).Context-dependent pre-trained deep neural networks for largevocabulary speech recognition.IEEE Transactions on Audio, Speech, and LanguageProcessing, Special Issue on Deep Learning for Speech andLangauge Processing.

Dauphin, Y., Glorot, X., and Bengio, Y. (2011).Large-scale learning of embeddings with reconstructionsampling.In ICML’11, pages 945–952.

43/43




Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent,P., and Bengio, S. (2010).Why does unsupervised pre-training help deep learning?Journal of M, 11:625–660.

Glorot, X., Bordes, A., and Bengio, Y. (2011).Domain adaptation for large-scale sentiment classication: Adeep learning approach.In ICML.

Hinton, G., Osindero, S., and Teh, Y.-W. (2006).A fast learning algorithm for deep belief nets.Neural Computation, 18:1527–1554.

Huang, E., Socher, R., Manning, C., and Ng, A. (2012).Improving word representations via global context and multipleword prototypes.

43/43




In Proceedings of the 50th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), pages873–882, Jeju Island, Korea. Association for ComputationalLinguistics.

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y.(2009).What is the best multi-stage architecture for objectrecognition?In Computer Vision, 2009 IEEE 12th International Conferenceon.

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008).Fast inference in sparse coding algorithms with applications toobject recognition.Technical Report CBLL-TR-2008-12-01, Computational andBiological Learning Lab, Courant Institute, NYU.

43/43




Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., andNg, A. (2011).On optimization methods for deep learning.In Getoor, L. and Scheffer, T., editors, Proceedings of the 28thInternational Conference on Machine Learning (ICML-11),ICML ’11, pages 265–272, New York, NY, USA. ACM.

Le, Q. V., Ranzato, M., Monga, R., Devin, M., Chen, K.,Corrado, G. S., Dean, J., and Ng, A. Y. (2012).Building high-level features using large scale unsupervisedlearning.In ICML.

Lee, H., Grosse, R., Ranganath, R., and Ng, A. (2009).Convolutional deep belief networks for scalable unsupervisedlearning of hierarchical representations.In ICML.

43/43




McCulloch, W. S. and Pitts, W. H. (1943).A logical calculus of the ideas immanent in nervous activity.In Bulletin of Mathematical Biophysics, volume 5, pages115–137.

Mikolov, T., Deoras, A., Povey, D., Burget, L., and Cernocky,J. (2011).Strategies for training large scale neural network languagemodel.In ASRU.

Minsky, M. and Papert, S. (1969).Perceptrons: an introduction to computational geometry.MIT Press.

Mnih, A. and Hinton, G. (2008).A scalable hierarchical distributed language models.In Advances in Neural Information Processing Systems 21(NIPS 2008).

43/43




Rosenblatt, F. (1958).The perceptron: A probabilistic model for information storageand organization in the brain.Psychological Review, 65:386–408.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986).Learning representations by back-propagating errors.Nature, 323:533–536.

Sainath, T. N., Kingsbury, B., Ramabhadran, B., Fousek, P.,Novak, P., and Mohamed, A. (2011).Making deep belief networks effective for large vocabularycontinuous speech recognition.In ASRU.

Salakhutdinov, R., Mnih, A., and Hinton, G. (2007).Restricted boltzmann machines for collaborative filtering.

43/43




In Proceedings of the 24th international conference onMachine learning, ICML ’07, pages 791–798.

Schwenk, H., Rousseau, A., and Attik, M. (2012).Large, pruned or continuous space language models on a gpufor statistical machine translation.In Proceedings of the NAACL-HLT 2012 Workshop: Will WeEver Really Replace the N-gram Model? On the Future ofLanguage Modeling for HLT, pages 11–19, Montreal, Canada.Association for Computational Linguistics.

Socher, R., Bengio, Y., and Manning, C. (2012a).Deep learning for NLP (without the magic).ACL Tutorials http://www.socher.org/index.php/

DeepLearningTutorial/DeepLearningTutorial.

Socher, R., Huang, E. H., Pennin, J., Ng, A. Y., and Manning,C. D. (2011a).

43/43

http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial

http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial




Dynamic pooling and unfolding recursive autoencoders forparaphrase detection.In NIPS.

Socher, R., Huval, B., Manning, C. D., and Ng, A. Y. (2012b).

Semantic compositionality through recursive matrix-vectorspaces.In Proceedings of the 2012 Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning, pages 1201–1211, Jeju Island,Korea. Association for Computational Linguistics.

Socher, R., Lin, C., Ng, A. Y., and Manning, C. D. (2011b).Parsing natural scenes and natural language with recursiveneural networks.In ICML.

Turian, J., Ratinov, L.-A., and Bengio, Y. (2010).43/43




Word representations: A simple and general method forsemi-supervised learning.In Proceedings of the 48th Annual Meeting of the Associationfor Computational Linguistics, pages 384–394, Uppsala,Sweden. Association for Computational Linguistics.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., andManzagol, P.-A. (2010).Stacked denoising autoencoders: Learning usefulrepresentations in a deep network with a local denoisingcriterion.Journal of Machine Learning Research, 11:3371–3408.

43/43

Documents

Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning