85
Deep Learning: An Introduction from the NLP Perspective Kevin Duh NAIST August 19, 2012

Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

  • Upload
    others

  • View
    62

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

Deep Learning:An Introduction from the NLP Perspective

Kevin Duh

NAIST

August 19, 2012

Page 2: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Disclaimer

I am not (yet) an expert in Deep Learning. Let me know ifthese slides contain any mistakes.

The focus here is Natural Language Processing (NLP); I’mglossing over much active work in Vision & Speech.

Lots of good tutorial information online, some borrowed here:

[Bengio, 2009] Excellent short book summarizing the area[Socher et al., 2012a] Tutorial with videoStep-by-step code based on Theano python library:http://deeplearning.net/tutorial/

2/43

Page 3: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Outline

1 Introduction

2 Neural NetworksPreliminaries1-Layer & 2-Layer NetsNeural Language Models

3 Deep Learning Approach 1: Deep Belief NetsPreliminariesRestricted Boltzman MachinesDeep Belief Nets

4 Deep Learning Approach 2: Stacked Auto-EncodersAuto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

3/43

Page 4: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

What is Deep Learning?

A model (e.g. neural network) with many layers, trained in alayer-wise way

An approach for unsupervised learning of featurerepresentations, at successively higher levels

These two definitions are very related, but correspond todifferent motivations.

4/43

Page 5: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

What is Deep Learning?

A model (e.g. neural network) with many layers, trained in alayer-wise way

An approach for unsupervised learning of featurerepresentations, at successively higher levels

These two definitions are very related, but correspond todifferent motivations.

4/43

Page 6: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

What is Deep Learning?

A model (e.g. neural network) with many layers, trained in alayer-wise way

An approach for unsupervised learning of featurerepresentations, at successively higher levels

These two definitions are very related, but correspond todifferent motivations.

4/43

Page 7: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Why explore Deep Learning?

1 It can model complex non-linear phenomenon

2 It learns a distributed feature representation

3 It learns a hierarchical feature representation

4 It can exploit unlabeled data

5/43

Page 8: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

#1 Modeling complex non-linearities

Given same number of units (with non-linear activation), a deeperarchitecture is more expressive than a shallow one [Bishop, 1995]

6/43

Page 9: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

#2 Distributed Feature Representations

One-hot representation is common in NLP:

”dog”=[1, 0, 0, . . . , 0], (vector dimension = vocabulary size)”cat”=[0, 1, 0, . . . , 0]”the”=[0, 0, 0, . . . , 1]”dog” and ”cat” share zero similarity, just like ”dog” and ”the”

Word clustering has proven effective in many tasks:

”dog”=[1, 0, 0, 0] (vector dim = number of clusters)”cat”=[1, 0, 0, 0] (”dog” and ”cat” were clustered together)”the”=[0, 1, 0, 0] 〈dog,cat〉 > 〈dog,the〉 = 0

Distributed represented (6= ”distributional representation”) isa multi-clustering, modeling factors like POS & semantics:

”dog”=[1, 0, 0.9, 0.0]”cat”=[1, 0, 0.5, 0.2]”the”=[0, 1, 0.0, 0.0]

7/43

Page 10: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

#2 Distributed Feature Representations

One-hot representation is common in NLP:

”dog”=[1, 0, 0, . . . , 0], (vector dimension = vocabulary size)”cat”=[0, 1, 0, . . . , 0]”the”=[0, 0, 0, . . . , 1]”dog” and ”cat” share zero similarity, just like ”dog” and ”the”

Word clustering has proven effective in many tasks:

”dog”=[1, 0, 0, 0] (vector dim = number of clusters)”cat”=[1, 0, 0, 0] (”dog” and ”cat” were clustered together)”the”=[0, 1, 0, 0] 〈dog,cat〉 > 〈dog,the〉 = 0

Distributed represented (6= ”distributional representation”) isa multi-clustering, modeling factors like POS & semantics:

”dog”=[1, 0, 0.9, 0.0]”cat”=[1, 0, 0.5, 0.2]”the”=[0, 1, 0.0, 0.0]

7/43

Page 11: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

#2 Distributed Feature Representations

One-hot representation is common in NLP:

”dog”=[1, 0, 0, . . . , 0], (vector dimension = vocabulary size)”cat”=[0, 1, 0, . . . , 0]”the”=[0, 0, 0, . . . , 1]”dog” and ”cat” share zero similarity, just like ”dog” and ”the”

Word clustering has proven effective in many tasks:

”dog”=[1, 0, 0, 0] (vector dim = number of clusters)”cat”=[1, 0, 0, 0] (”dog” and ”cat” were clustered together)”the”=[0, 1, 0, 0] 〈dog,cat〉 > 〈dog,the〉 = 0

Distributed represented (6= ”distributional representation”) isa multi-clustering, modeling factors like POS & semantics:

”dog”=[1, 0, 0.9, 0.0]”cat”=[1, 0, 0.5, 0.2]”the”=[0, 1, 0.0, 0.0]

7/43

Page 12: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

#3 Hierarchical Feature Representations

Hierarchical features effective captures part-and-whole relationshipsand naturally addresses multi-task problems [Lee et al., 2009]

8/43

Page 13: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

#4 Exploiting Unlabeled Data

Unsupervised & semi-supervised learning will be standard1:

Engineering question: Unlabeled data is more abundant thanlabeled data.Scientific question: Children learn language (syntax, meaning,etc.) mostly from raw unlabeled data

Layer-wise pre-training in Deep Learning: good model of inputP(X ) can help train P(Y |X )

”If you want to do computer vision, first learncomputer graphics.” – Geoff Hinton

1my prediction for 20209/43

Page 14: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

#4 Exploiting Unlabeled Data

Unsupervised & semi-supervised learning will be standard1:

Engineering question: Unlabeled data is more abundant thanlabeled data.Scientific question: Children learn language (syntax, meaning,etc.) mostly from raw unlabeled data

Layer-wise pre-training in Deep Learning: good model of inputP(X ) can help train P(Y |X )

”If you want to do computer vision, first learncomputer graphics.” – Geoff Hinton

1my prediction for 20209/43

Page 15: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Some (personal) skepticism

1 There are other ways to learn distributed representations, e.g.

Topic model for documentsConcatenating multiple word clustering solutions (has anyonetried this?)Dictionary learning and sparse reconstruction methods

2 Is multiple-level of representations really necessary in NLP?

For Vision problems, there is clear analogy to the brain’sstructure, but for language?Maybe: compositionally and recursion in natural language.

3 Black magic required for effective training, e.g.hyper-parameter setting and large computer resources?

10/43

Page 16: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Some (personal) skepticism

1 There are other ways to learn distributed representations, e.g.

Topic model for documentsConcatenating multiple word clustering solutions (has anyonetried this?)Dictionary learning and sparse reconstruction methods

2 Is multiple-level of representations really necessary in NLP?

For Vision problems, there is clear analogy to the brain’sstructure, but for language?Maybe: compositionally and recursion in natural language.

3 Black magic required for effective training, e.g.hyper-parameter setting and large computer resources?

10/43

Page 17: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Some (personal) skepticism

1 There are other ways to learn distributed representations, e.g.

Topic model for documentsConcatenating multiple word clustering solutions (has anyonetried this?)Dictionary learning and sparse reconstruction methods

2 Is multiple-level of representations really necessary in NLP?

For Vision problems, there is clear analogy to the brain’sstructure, but for language?Maybe: compositionally and recursion in natural language.

3 Black magic required for effective training, e.g.hyper-parameter setting and large computer resources?

10/43

Page 18: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Some (personal) skepticism

1 There are other ways to learn distributed representations, e.g.

Topic model for documentsConcatenating multiple word clustering solutions (has anyonetried this?)Dictionary learning and sparse reconstruction methods

2 Is multiple-level of representations really necessary in NLP?

For Vision problems, there is clear analogy to the brain’sstructure, but for language?Maybe: compositionally and recursion in natural language.

3 Black magic required for effective training, e.g.hyper-parameter setting and large computer resources?

10/43

Page 19: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Research Opportunities in NLP

1 Improving on current state-of-the-art results on standard tasks

2 Encoding linguistic knowledge into the training process

Current methods are relatively generic, incorporates littledomain knowledge.

3 Integrating deep learning into current NLP pipelines

In particular: how to handle structured prediction problems ofsequences and trees

11/43

Page 20: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Research Opportunities in NLP

1 Improving on current state-of-the-art results on standard tasks2 Encoding linguistic knowledge into the training process

Current methods are relatively generic, incorporates littledomain knowledge.

3 Integrating deep learning into current NLP pipelines

In particular: how to handle structured prediction problems ofsequences and trees

11/43

Page 21: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Research Opportunities in NLP

1 Improving on current state-of-the-art results on standard tasks2 Encoding linguistic knowledge into the training process

Current methods are relatively generic, incorporates littledomain knowledge.

3 Integrating deep learning into current NLP pipelines

In particular: how to handle structured prediction problems ofsequences and trees

11/43

Page 22: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

What we’ll cover here

1 Neural Language Models & Distributed Word Representations

Not sure if they’re ”deep”, but they’re relevant to what we’reinterested inBasic math here useful for later material

2 Restricted Boltzman Machines & Deep Belief Nets

Deep Learning Approach #1: the original generative model

3 Autoencoders, Denoising Autoencoders, and StackedDenoising Autoencoders

Deep Learning Approach #2: competitive with #1 andperhaps easier to train

12/43

Page 23: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Aside: A Brief History

Early days of AI. Invention of artificial neuron[McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958]

AI Winter. [Minsky and Papert, 1969] showed perceptron onlylearns linearly separable concepts

Revival in 1980s: Multi-layer Perceptrons (MLP) andBack-propagation [Rumelhart et al., 1986]

Other directions (1990s - present): SVMs, Bayesian Networks

Revival in 2006: Deep learning [Hinton et al., 2006]

Recent successes in applications: Speech at IBM/Toronto[Sainath et al., 2011], Microsoft [Dahl et al., 2012]. Vision atGoogle/Stanford [Le et al., 2012]

13/43

Page 24: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Aside: A Brief History

Early days of AI. Invention of artificial neuron[McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958]

AI Winter. [Minsky and Papert, 1969] showed perceptron onlylearns linearly separable concepts

Revival in 1980s: Multi-layer Perceptrons (MLP) andBack-propagation [Rumelhart et al., 1986]

Other directions (1990s - present): SVMs, Bayesian Networks

Revival in 2006: Deep learning [Hinton et al., 2006]

Recent successes in applications: Speech at IBM/Toronto[Sainath et al., 2011], Microsoft [Dahl et al., 2012]. Vision atGoogle/Stanford [Le et al., 2012]

13/43

Page 25: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Aside: A Brief History

Early days of AI. Invention of artificial neuron[McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958]

AI Winter. [Minsky and Papert, 1969] showed perceptron onlylearns linearly separable concepts

Revival in 1980s: Multi-layer Perceptrons (MLP) andBack-propagation [Rumelhart et al., 1986]

Other directions (1990s - present): SVMs, Bayesian Networks

Revival in 2006: Deep learning [Hinton et al., 2006]

Recent successes in applications: Speech at IBM/Toronto[Sainath et al., 2011], Microsoft [Dahl et al., 2012]. Vision atGoogle/Stanford [Le et al., 2012]

13/43

Page 26: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Aside: A Brief History

Early days of AI. Invention of artificial neuron[McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958]

AI Winter. [Minsky and Papert, 1969] showed perceptron onlylearns linearly separable concepts

Revival in 1980s: Multi-layer Perceptrons (MLP) andBack-propagation [Rumelhart et al., 1986]

Other directions (1990s - present): SVMs, Bayesian Networks

Revival in 2006: Deep learning [Hinton et al., 2006]

Recent successes in applications: Speech at IBM/Toronto[Sainath et al., 2011], Microsoft [Dahl et al., 2012]. Vision atGoogle/Stanford [Le et al., 2012]

13/43

Page 27: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Aside: A Brief History

Early days of AI. Invention of artificial neuron[McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958]

AI Winter. [Minsky and Papert, 1969] showed perceptron onlylearns linearly separable concepts

Revival in 1980s: Multi-layer Perceptrons (MLP) andBack-propagation [Rumelhart et al., 1986]

Other directions (1990s - present): SVMs, Bayesian Networks

Revival in 2006: Deep learning [Hinton et al., 2006]

Recent successes in applications: Speech at IBM/Toronto[Sainath et al., 2011], Microsoft [Dahl et al., 2012]. Vision atGoogle/Stanford [Le et al., 2012]

13/43

Page 28: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Aside: A Brief History

Early days of AI. Invention of artificial neuron[McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958]

AI Winter. [Minsky and Papert, 1969] showed perceptron onlylearns linearly separable concepts

Revival in 1980s: Multi-layer Perceptrons (MLP) andBack-propagation [Rumelhart et al., 1986]

Other directions (1990s - present): SVMs, Bayesian Networks

Revival in 2006: Deep learning [Hinton et al., 2006]

Recent successes in applications: Speech at IBM/Toronto[Sainath et al., 2011], Microsoft [Dahl et al., 2012]. Vision atGoogle/Stanford [Le et al., 2012]

13/43

Page 29: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Outline

1 Introduction

2 Neural NetworksPreliminaries1-Layer & 2-Layer NetsNeural Language Models

3 Deep Learning Approach 1: Deep Belief NetsPreliminariesRestricted Boltzman MachinesDeep Belief Nets

4 Deep Learning Approach 2: Stacked Auto-EncodersAuto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

14/43

Page 30: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Basic Setup of Machine Learning

Training Data: a set of (x (m), y (m))m={1,2,..M} pairs, where

input x (m) ∈ Rd and output y (m) = {0, 1}e.g. x=document, y=spam or not

Goal: Learn function f : x → y that predicts correctly on newinputs x .

Step 1: Choose a function model family:

e.g. f (x) = σ(wT · x). (logistic regression, aka. 1-layer net)e.g. f (x) =sign(wT · x). (perceptron)e.g. f (x) =sign

∑m wT

m · k(x , x (m)). (SVM)

Step 2: Optimize parameters w on the Training Data

e.g. minimize loss function minw

∑Mm=1(fw (x (m))− y (m))2

15/43

Page 31: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Basic Setup of Machine Learning

Training Data: a set of (x (m), y (m))m={1,2,..M} pairs, where

input x (m) ∈ Rd and output y (m) = {0, 1}e.g. x=document, y=spam or not

Goal: Learn function f : x → y that predicts correctly on newinputs x .

Step 1: Choose a function model family:

e.g. f (x) = σ(wT · x). (logistic regression, aka. 1-layer net)e.g. f (x) =sign(wT · x). (perceptron)e.g. f (x) =sign

∑m wT

m · k(x , x (m)). (SVM)

Step 2: Optimize parameters w on the Training Data

e.g. minimize loss function minw

∑Mm=1(fw (x (m))− y (m))2

15/43

Page 32: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Basic Setup of Machine Learning

Training Data: a set of (x (m), y (m))m={1,2,..M} pairs, where

input x (m) ∈ Rd and output y (m) = {0, 1}e.g. x=document, y=spam or not

Goal: Learn function f : x → y that predicts correctly on newinputs x .

Step 1: Choose a function model family:

e.g. f (x) = σ(wT · x). (logistic regression, aka. 1-layer net)e.g. f (x) =sign(wT · x). (perceptron)e.g. f (x) =sign

∑m wT

m · k(x , x (m)). (SVM)

Step 2: Optimize parameters w on the Training Data

e.g. minimize loss function minw

∑Mm=1(fw (x (m))− y (m))2

15/43

Page 33: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

1-Layer Nets (logistic regression)

Function model: f (x) = σ(wT · x + b)

Parameters: vector w ∈ Rd , b is scalar bias termσ is a non-linearity: σ(z) = 1/(1 + exp(−z))For simplicity, sometimes write f (x) = σ(wT x) wherew = [w ; b] and x = [x ; 1]

Non-linearity will be important in expressiveness multi-layernets. Other non-linearities are also used, e.g. tanh

16/43

Page 34: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

1-Layer Nets (logistic regression)

Function model: f (x) = σ(wT · x + b)

Parameters: vector w ∈ Rd , b is scalar bias termσ is a non-linearity: σ(z) = 1/(1 + exp(−z))For simplicity, sometimes write f (x) = σ(wT x) wherew = [w ; b] and x = [x ; 1]

Non-linearity will be important in expressiveness multi-layernets. Other non-linearities are also used, e.g. tanh

16/43

Page 35: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Training 1-Layer Nets

Easiest method: gradient descent

Let Loss(w) =∑

m(σ(wT x (m))− y (m))2

Gradient ∇wLoss =∑m 2(σ(wT x (m))− y (m))(σ(wT x (m))(1− σ(wT x (m)))x (m)

General form of gradient: Error ∗ σ′(in) ∗ x

Stochastic gradient descent algorithm:1 Initialize w2 for each sample (x (m), y (m)) in training set3 w ← w − γ(Error ∗ σ′(in) ∗ x (m))4 Repeat steps 2-3 until some condition satisfied

Some practical tricks for learning rate γ & stopping conditionfor quick training and good generalization

17/43

Page 36: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Training 1-Layer Nets

Easiest method: gradient descent

Let Loss(w) =∑

m(σ(wT x (m))− y (m))2

Gradient ∇wLoss =∑m 2(σ(wT x (m))− y (m))(σ(wT x (m))(1− σ(wT x (m)))x (m)

General form of gradient: Error ∗ σ′(in) ∗ x

Stochastic gradient descent algorithm:1 Initialize w2 for each sample (x (m), y (m)) in training set3 w ← w − γ(Error ∗ σ′(in) ∗ x (m))4 Repeat steps 2-3 until some condition satisfied

Some practical tricks for learning rate γ & stopping conditionfor quick training and good generalization

17/43

Page 37: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Training 1-Layer Nets

Easiest method: gradient descent

Let Loss(w) =∑

m(σ(wT x (m))− y (m))2

Gradient ∇wLoss =∑m 2(σ(wT x (m))− y (m))(σ(wT x (m))(1− σ(wT x (m)))x (m)

General form of gradient: Error ∗ σ′(in) ∗ x

Stochastic gradient descent algorithm:1 Initialize w2 for each sample (x (m), y (m)) in training set3 w ← w − γ(Error ∗ σ′(in) ∗ x (m))4 Repeat steps 2-3 until some condition satisfied

Some practical tricks for learning rate γ & stopping conditionfor quick training and good generalization

17/43

Page 38: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Training 1-Layer Nets

Easiest method: gradient descent

Let Loss(w) =∑

m(σ(wT x (m))− y (m))2

Gradient ∇wLoss =∑m 2(σ(wT x (m))− y (m))(σ(wT x (m))(1− σ(wT x (m)))x (m)

General form of gradient: Error ∗ σ′(in) ∗ x

Stochastic gradient descent algorithm:1 Initialize w2 for each sample (x (m), y (m)) in training set3 w ← w − γ(Error ∗ σ′(in) ∗ x (m))4 Repeat steps 2-3 until some condition satisfied

Some practical tricks for learning rate γ & stopping conditionfor quick training and good generalization

17/43

Page 39: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

2-Layer Nets (MLP, Multi-layer Perceptron)

x1 x2 x3 x4

h1 h2 h3

y

xi

wij

hj

wj

w11 w12

w1 w2 w3

f (x) = σ(∑

j wj · hj) = σ(∑

j wj · σ(∑

i wijxi ))

18/43

Page 40: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Training 2-Layer Nets: Backpropagation

Recall the gradient for 1-Layer Nets consists of:

∂Loss/∂wj = Error ∗ σ′(in) ∗ xjWe just need to use Chain Rule to take derivatives over 2-layers

For the 2-Layer network (previous slide):

∂Loss/∂wj = [y − f (x)]f ′(x)hj

∂Loss/∂wij = [y − f (x)]f ′(x)wjσ′(∑

i wi jxi )xi

Note:1 First, run sample through network to get result f (x).2 Then, ”errors” are propagated back and weights fixed

according to their ”responsibility”3 Problem is not convex (may have several local optima)

19/43

Page 41: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Training 2-Layer Nets: Backpropagation

Recall the gradient for 1-Layer Nets consists of:

∂Loss/∂wj = Error ∗ σ′(in) ∗ xjWe just need to use Chain Rule to take derivatives over 2-layers

For the 2-Layer network (previous slide):

∂Loss/∂wj = [y − f (x)]f ′(x)hj

∂Loss/∂wij = [y − f (x)]f ′(x)wjσ′(∑

i wi jxi )xi

Note:1 First, run sample through network to get result f (x).2 Then, ”errors” are propagated back and weights fixed

according to their ”responsibility”3 Problem is not convex (may have several local optima)

19/43

Page 42: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Training 2-Layer Nets: Backpropagation

Recall the gradient for 1-Layer Nets consists of:

∂Loss/∂wj = Error ∗ σ′(in) ∗ xjWe just need to use Chain Rule to take derivatives over 2-layers

For the 2-Layer network (previous slide):

∂Loss/∂wj = [y − f (x)]f ′(x)hj

∂Loss/∂wij = [y − f (x)]f ′(x)wjσ′(∑

i wi jxi )xi

Note:1 First, run sample through network to get result f (x).2 Then, ”errors” are propagated back and weights fixed

according to their ”responsibility”3 Problem is not convex (may have several local optima)

19/43

Page 43: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Definition of ”Depth”

Depends on elementary computational elements:

weighted sum, product, single neuron, kernel, logic gate

1-Layer - linear classifier:

Logistic Regression, Maximum Entropy ClassifierPerceptron, Linear SVM

2-Layer - universal approximator:

Most MLPs (except some convolutional neural nets)SVMs with kernelsGaussian processesDecision trees

3-Layer or more - compact universal approximator:

Deep LearningBoosted decision trees, Random Forests

20/43

Page 44: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Neural Language Models [Bengio et al., 2003]

Motivation: Use Neural Nets to learn continuous distributedrepresentations of words.

Addresses curse of dimensionality arising from one-hotrepresentation of discrete variables.

Architecture (see pic on next slide):

C () are the learned word representations of dimension m.The history context x = [C (wt−1) C (wt−2) C (wt−3)] iscompressed to a h node hidden layer via tanh(Hx)Final output mapping with softmax gives probabilities p(wt |x).

21/43

Page 45: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Neural Language Models [Bengio et al., 2003]

22/43

Page 46: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Distributed Representations: many possibilities

1 Neural Networks & Neural Language Model:

Hidden layer serve as learned representationWe can view this as analogous learning a kernel.

2 Principle Component Analysis (PCA), Factor Analysis

Linear transform to decorrelated features: h = W T x + b

3 Sparse coding

h∗ = arg minh ||x −W · h||22 + λ||h||14 Also: manifold embeddings, ICA, and various unsupervised

methods.

23/43

Page 47: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Distributed Representations: many possibilities

1 Neural Networks & Neural Language Model:

Hidden layer serve as learned representationWe can view this as analogous learning a kernel.

2 Principle Component Analysis (PCA), Factor Analysis

Linear transform to decorrelated features: h = W T x + b

3 Sparse coding

h∗ = arg minh ||x −W · h||22 + λ||h||14 Also: manifold embeddings, ICA, and various unsupervised

methods.

23/43

Page 48: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Summary: things to remember about Neural Nets

1 Stacking layers of non-linearity (e.g. σ) is critical forexpressive power of neural nets

2 Hidden layers of neural nets can serve as distributedrepresentations

3 Backpropagation Training is just gradient descent, appliedwith Chain Rule.

4 Unfortunately, training beyond 2-layers is often difficult due tolocal optimum and vanishing gradients

24/43

Page 49: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Preliminaries1-Layer & 2-Layer NetsNeural Language Models

Minimal Reading List for Neural Language Models

Original Neural LM paper: [Bengio et al., 2003]

Alternate training criteria & architecture:[Collobert et al., 2011]

Hierarchical distributed representations:[Mnih and Hinton, 2008]

Handling large data (code available also):[Mikolov et al., 2011, Schwenk et al., 2012]

Application in NLP: [Turian et al., 2010]

25/43

Page 50: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Outline

1 Introduction

2 Neural NetworksPreliminaries1-Layer & 2-Layer NetsNeural Language Models

3 Deep Learning Approach 1: Deep Belief NetsPreliminariesRestricted Boltzman MachinesDeep Belief Nets

4 Deep Learning Approach 2: Stacked Auto-EncodersAuto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

26/43

Page 51: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Motivation

Goal: Discover useful latent features h from data x

One possibility: Directed Graphical Models:

Model p(x , h) = p(x |h)p(h), where p(x |h) is likelihood, p(h) ispriorDirected: we can think of h as a ”cause”. Given h = 1, what’sthe probability of x?

h x

27/43

Page 52: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Motivation

Goal: Discover useful latent features h from data x

One possibility: Directed Graphical Models:

Model p(x , h) = p(x |h)p(h), where p(x |h) is likelihood, p(h) ispriorDirected: we can think of h as a ”cause”. Given h = 1, what’sthe probability of x?

h x

27/43

Page 53: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Explaining away effect of directed graphical models

p(h1) and p(h2) are a priori independent, but dependent givenx : p(h1, h2|x) 6= p(h1|x) · p(h2|x)

Thus, posterior p(h|e), which is needed for features or deeplearning, is not easy to compute

Example:x = grass is wet;h1 = it rained last night; h2 = water sprinkler was on.

x

h1 h2

28/43

Page 54: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Undirected Graphical Models (aka MRF, Markov RandomFields)

MRF models p(x , h) = 1Zθ

∏i φi (x)

∏j ηj(h)

∏k νk(x , h) as

product of un-normalized potentials

θ are parameters, Zθ is (potentially expensive) normalizationClique potentials φi (x), ηj(h), νk(x , h) describe interactionsbetween inputs, hiddens, and input-hidden variables

Boltzman Machines define p(x , h) = 1Zθ

exp (−Eθ(x , h))

where x and h are binary variables, andEθ(x , h) = − 1

2xTUx − 12hTVh − xTWh − bT x − dTh

with θ = {U,V ,W , b, d} as parameters

Posterior p(h|x) of Boltzman Machines also intractable, e.g.p(hj |x) =

∑h1..∑

hj−1

∑hj+1

..p(h|x).

29/43

Page 55: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Undirected Graphical Models (aka MRF, Markov RandomFields)

MRF models p(x , h) = 1Zθ

∏i φi (x)

∏j ηj(h)

∏k νk(x , h) as

product of un-normalized potentials

θ are parameters, Zθ is (potentially expensive) normalizationClique potentials φi (x), ηj(h), νk(x , h) describe interactionsbetween inputs, hiddens, and input-hidden variables

Boltzman Machines define p(x , h) = 1Zθ

exp (−Eθ(x , h))

where x and h are binary variables, andEθ(x , h) = − 1

2xTUx − 12hTVh − xTWh − bT x − dTh

with θ = {U,V ,W , b, d} as parameters

Posterior p(h|x) of Boltzman Machines also intractable, e.g.p(hj |x) =

∑h1..∑

hj−1

∑hj+1

..p(h|x).

29/43

Page 56: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Undirected Graphical Models (aka MRF, Markov RandomFields)

MRF models p(x , h) = 1Zθ

∏i φi (x)

∏j ηj(h)

∏k νk(x , h) as

product of un-normalized potentials

θ are parameters, Zθ is (potentially expensive) normalizationClique potentials φi (x), ηj(h), νk(x , h) describe interactionsbetween inputs, hiddens, and input-hidden variables

Boltzman Machines define p(x , h) = 1Zθ

exp (−Eθ(x , h))

where x and h are binary variables, andEθ(x , h) = − 1

2xTUx − 12hTVh − xTWh − bT x − dTh

with θ = {U,V ,W , b, d} as parameters

Posterior p(h|x) of Boltzman Machines also intractable, e.g.p(hj |x) =

∑h1..∑

hj−1

∑hj+1

..p(h|x).

29/43

Page 57: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Restricted Boltzman Machine (RBM)

RBM: p(x , h) = 1Zθ

exp (−Eθ(x , h))

with only h-x interactions: Eθ(x , h) = −xTWh − bT x − dTh

x1 x2 x3

h1 h2 h3

Conditional distribution over hidden units factorizes:p(h|x) =

∏i p(hi |x)

p(hj = 1|x) = σ(∑

i wijxi + dj)Similarly: p(x |h) =

∏i p(xi |h);p(xi = 1|h) = σ(

∑j wijhj + bi )

Computing posteriors p(h|x) or features (E [p(h|x)) is easy.

Note partition function Zθ is still expensive, so approximationrequired during parameter learning

30/43

Page 58: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Restricted Boltzman Machine (RBM)

RBM: p(x , h) = 1Zθ

exp (−Eθ(x , h))

with only h-x interactions: Eθ(x , h) = −xTWh − bT x − dTh

x1 x2 x3

h1 h2 h3

Conditional distribution over hidden units factorizes:p(h|x) =

∏i p(hi |x)

p(hj = 1|x) = σ(∑

i wijxi + dj)Similarly: p(x |h) =

∏i p(xi |h);p(xi = 1|h) = σ(

∑j wijhj + bi )

Computing posteriors p(h|x) or features (E [p(h|x)) is easy.

Note partition function Zθ is still expensive, so approximationrequired during parameter learning

30/43

Page 59: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Restricted Boltzman Machine (RBM)

RBM: p(x , h) = 1Zθ

exp (−Eθ(x , h))

with only h-x interactions: Eθ(x , h) = −xTWh − bT x − dTh

x1 x2 x3

h1 h2 h3

Conditional distribution over hidden units factorizes:p(h|x) =

∏i p(hi |x)

p(hj = 1|x) = σ(∑

i wijxi + dj)Similarly: p(x |h) =

∏i p(xi |h);p(xi = 1|h) = σ(

∑j wijhj + bi )

Computing posteriors p(h|x) or features (E [p(h|x)) is easy.

Note partition function Zθ is still expensive, so approximationrequired during parameter learning

30/43

Page 60: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Training RBMs

Gradient of the Log-Likelihood: ∇w log Pw (x = x (m))

= ∇wij log∑h

Pw (x = x (m), h) (1)

= ∇wij log∑h

1

Zwexp (−Ew(x(m), h)) (2)

= −∇wij log Zw +∇wij log∑h

exp (−Ew(x(m), h)) (3)

=1

Zw

∑h,x

e(− Ew(x,h))∇wij Ew(x, h)− 1∑h e(− Ew(x(m),h))

∑h

e(− Ew(x(m),h))∇wij Ew(x(m), h)

=∑h,x

Pw (x , h)[∇wij Ew(x, h)]−∑h

Pw (x (m), h)[∇wij Ew(x(m), h)] (4)

= −Ep(x,h)[xi · hj ] + Ep(h|x=x (m))[x(m)i · hj ] (5)

31/43

Page 61: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Training RBMs with Contrastive Divergence

In the previous equation, first term is expensive(Ep(x ,h)[xi · hj ])

Gibbs Sampling (sample x then h iteratively) works butre-running for each gradient step is slow.

Contrastive Divergence is a faster but biased method thatinitializes with training data:

1 h ∼ P(h|x (m))2 x ∼ P(x |h); h ∼ P(h|x)3 wij ← wij + γ

∑batch(x

(m)i · hj − xi · hj)

32/43

Page 62: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Deep Belief Nets (DBN)

DBN stacks RBMs layer-by-layer to get deep architecture.

Layer-wise pre-training is critical:First, train RBM to learn 1st layer of features h from input x .Then, treat h as input and learn a 2nd layer of features.Each added layer improves the variational lower bound on thelog probability of training data.

Further fine-tuning can be obtained with the Wake-SleepAlgorithm

Do stochastic bottom-up pass (adjust weights to reconstructlayer below)Do a few iterations of Gibbs sampling at top-level RBMDo stochastic top-down pass (adjust weights to reconstructlayer above)

note: not to be confused with Dynamic Bayesian Nets orDeep Boltzman Machines

33/43

Page 63: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Deep Belief Nets (DBN)

DBN stacks RBMs layer-by-layer to get deep architecture.

Layer-wise pre-training is critical:First, train RBM to learn 1st layer of features h from input x .Then, treat h as input and learn a 2nd layer of features.Each added layer improves the variational lower bound on thelog probability of training data.

Further fine-tuning can be obtained with the Wake-SleepAlgorithm

Do stochastic bottom-up pass (adjust weights to reconstructlayer below)Do a few iterations of Gibbs sampling at top-level RBMDo stochastic top-down pass (adjust weights to reconstructlayer above)

note: not to be confused with Dynamic Bayesian Nets orDeep Boltzman Machines

33/43

Page 64: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Summary: things to remember about DBNs

1 Layer-wise pre-training is the innovation that enabled trainingdeep architectures.

2 Pre-training focuses on optimizing likelihood on the data, notthe target label. The philosophy is to first model p(x) in orderto do better p(y |x).

3 Why use an undirected graphical model like RBM? It’sbecause p(h|x) is computationally tractable (no ”explainingaway effect”), so that stacking them into DBNs is feasible.

4 Learning RBM still require approximates inference (e.g.contrastive divergence) since partition function is expensive.

34/43

Page 65: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

PreliminariesRestricted Boltzman MachinesDeep Belief Nets

Minimal Reading List for RBM/DBN

Original DBN paper [Hinton et al., 2006]

Why does unsupervised pre-training help deep learning?[Erhan et al., 2010]

Successful application in Collaborative Filtering[Salakhutdinov et al., 2007]

35/43

Page 66: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Outline

1 Introduction

2 Neural NetworksPreliminaries1-Layer & 2-Layer NetsNeural Language Models

3 Deep Learning Approach 1: Deep Belief NetsPreliminariesRestricted Boltzman MachinesDeep Belief Nets

4 Deep Learning Approach 2: Stacked Auto-EncodersAuto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

36/43

Page 67: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Auto-Encoders

Auto-Encoders are a simpler non-probabilistic alternative toRBMs.

Define encoder and decoder and pass the data through it:

Encoder h = fθ(x), e.g. h = σ(Wx + b)Decoder x = gθ(h), e.g. x = σ(W ′h + d)W and W ′ need not be tied, but often are in practice.

Encourage θ to give small reconstruction errorl:

e.g. Loss =∑

m ||x (m) − gθ(fθ(x (m)))||2

Linear encoder/decoder with squared reconstruction errorlearns same subspace of PCA.

Sigmoid encoder/decoder gives same form p(h|x), p(x |h) asRBMs.

37/43

Page 68: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Auto-Encoders

Auto-Encoders are a simpler non-probabilistic alternative toRBMs.

Define encoder and decoder and pass the data through it:

Encoder h = fθ(x), e.g. h = σ(Wx + b)Decoder x = gθ(h), e.g. x = σ(W ′h + d)W and W ′ need not be tied, but often are in practice.

Encourage θ to give small reconstruction errorl:

e.g. Loss =∑

m ||x (m) − gθ(fθ(x (m)))||2

Linear encoder/decoder with squared reconstruction errorlearns same subspace of PCA.

Sigmoid encoder/decoder gives same form p(h|x), p(x |h) asRBMs.

37/43

Page 69: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Auto-Encoders

Auto-Encoders are a simpler non-probabilistic alternative toRBMs.

Define encoder and decoder and pass the data through it:

Encoder h = fθ(x), e.g. h = σ(Wx + b)Decoder x = gθ(h), e.g. x = σ(W ′h + d)W and W ′ need not be tied, but often are in practice.

Encourage θ to give small reconstruction errorl:

e.g. Loss =∑

m ||x (m) − gθ(fθ(x (m)))||2

Linear encoder/decoder with squared reconstruction errorlearns same subspace of PCA.

Sigmoid encoder/decoder gives same form p(h|x), p(x |h) asRBMs.

37/43

Page 70: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Architecture: Stacked Auto-Encoders

Auto-encoders can be stacked in the same way RBMs arestacked to give Deep Architectures

Hidden unit size:

Hidden layer should be lower dimensional or else Auto-encodermay just learn the identity mappingAlternatively, allow more hidden units but enforce sparsity.

38/43

Page 71: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Denoising Auto-Encoders

First, perturb the input data x to x using invariance fromdomain knowledge.

Reconstruct the original data

e.g. Loss =∑

m ||x (m) − gθ(fθ(x (m)))||2

[Vincent et al., 2010] explored Gaussian noise andsalt-and-pepper noise for Vision data. [Glorot et al., 2011]explored masking noise (random set to 0) for Text data.

39/43

Page 72: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Predictive Sparse Decomposition[Kavukcuoglu et al., 2008]

Objective (minimize with respect to h, W , θ):∑m λ||h(m)||1 + ||x (m) −Wh(m)||22 + ||h(m) − fθ(x (m))||22

First two terms similar to sparse coding. Third term learns afast encoder that approximates the sparse coder.

40/43

Page 73: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Summary: things to remember about StackedAutoencoders

1 Auto-encoders are computationally cheaper alternatives toRBMs. We stack them into deep architectures in the sameway we stack RBMs into DBNs.

2 Auto-encoders learn to ”compress” and ”re-construct” inputdata. Low reconstruction error corresponds to an encodingthat captures the main variations in data. Again, the focus ison modeling p(x) first.

3 Many variants of encoders are out there, and some provideeffective ways to incorporate expertise domain knowledge.

41/43

Page 74: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Minimal Reading List for Stacked Auto-Encoders

Original Stacked Auto-encoder paper [Bengio et al., 2006]

Comparison of optimization methods [Le et al., 2011]

Speeding up the reconstruction error computation for largeword vectors [Dauphin et al., 2011]

De-noising Auto-encoders [Vincent et al., 2010]

42/43

Page 75: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Selected Readings for NLPers

Deep Learning Applications in NLP:

Sentiment Analysis [Glorot et al., 2011]Parsing[Socher et al., 2011b, Collobert et al., 2011, Collobert, 2011]Paraphrase Detection [Socher et al., 2011a]Learning lexical semantics:[Huang et al., 2012, Socher et al., 2012b]

Applications in other fields, but worth reading:

Good reference that defines many terms popular in DeepLearning Vision papers [Jarrett et al., 2009]Deep learning of cats: entirely unsupervised learning high-levelfeatures on massive datasets [Le et al., 2012]

43/43

Page 76: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Bengio, Y. (2009).Learning Deep Architectures for AI, volume Foundations andTrends in Machine Learning.NOW Publishers.

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003).A neural probabilistic language models.JMLR.

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.(2006).Greedy layer-wise training of deep networks.In NIPS’06, pages 153–160.

Bishop, C. (1995).Neural Networks for Pattern Recognition.Oxford University Press.

Collobert, R. (2011).

43/43

Page 77: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Deep learning for efficient discriminative parsing.In AISTATS.

Collobert, R., Weston, J., Bottou, L., Karlen, M.,Kavukcuoglu, K., and Kuksa, P. (2011).Natural language processing (almost) from scratch.Journal of Machine Learning Research, 12:2493–2537.

Dahl, G., Yu, D., Deng, L., and Acero, A. (2012).Context-dependent pre-trained deep neural networks for largevocabulary speech recognition.IEEE Transactions on Audio, Speech, and LanguageProcessing, Special Issue on Deep Learning for Speech andLangauge Processing.

Dauphin, Y., Glorot, X., and Bengio, Y. (2011).Large-scale learning of embeddings with reconstructionsampling.In ICML’11, pages 945–952.

43/43

Page 78: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent,P., and Bengio, S. (2010).Why does unsupervised pre-training help deep learning?Journal of M, 11:625–660.

Glorot, X., Bordes, A., and Bengio, Y. (2011).Domain adaptation for large-scale sentiment classication: Adeep learning approach.In ICML.

Hinton, G., Osindero, S., and Teh, Y.-W. (2006).A fast learning algorithm for deep belief nets.Neural Computation, 18:1527–1554.

Huang, E., Socher, R., Manning, C., and Ng, A. (2012).Improving word representations via global context and multipleword prototypes.

43/43

Page 79: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

In Proceedings of the 50th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), pages873–882, Jeju Island, Korea. Association for ComputationalLinguistics.

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y.(2009).What is the best multi-stage architecture for objectrecognition?In Computer Vision, 2009 IEEE 12th International Conferenceon.

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008).Fast inference in sparse coding algorithms with applications toobject recognition.Technical Report CBLL-TR-2008-12-01, Computational andBiological Learning Lab, Courant Institute, NYU.

43/43

Page 80: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., andNg, A. (2011).On optimization methods for deep learning.In Getoor, L. and Scheffer, T., editors, Proceedings of the 28thInternational Conference on Machine Learning (ICML-11),ICML ’11, pages 265–272, New York, NY, USA. ACM.

Le, Q. V., Ranzato, M., Monga, R., Devin, M., Chen, K.,Corrado, G. S., Dean, J., and Ng, A. Y. (2012).Building high-level features using large scale unsupervisedlearning.In ICML.

Lee, H., Grosse, R., Ranganath, R., and Ng, A. (2009).Convolutional deep belief networks for scalable unsupervisedlearning of hierarchical representations.In ICML.

43/43

Page 81: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

McCulloch, W. S. and Pitts, W. H. (1943).A logical calculus of the ideas immanent in nervous activity.In Bulletin of Mathematical Biophysics, volume 5, pages115–137.

Mikolov, T., Deoras, A., Povey, D., Burget, L., and Cernocky,J. (2011).Strategies for training large scale neural network languagemodel.In ASRU.

Minsky, M. and Papert, S. (1969).Perceptrons: an introduction to computational geometry.MIT Press.

Mnih, A. and Hinton, G. (2008).A scalable hierarchical distributed language models.In Advances in Neural Information Processing Systems 21(NIPS 2008).

43/43

Page 82: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Rosenblatt, F. (1958).The perceptron: A probabilistic model for information storageand organization in the brain.Psychological Review, 65:386–408.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986).Learning representations by back-propagating errors.Nature, 323:533–536.

Sainath, T. N., Kingsbury, B., Ramabhadran, B., Fousek, P.,Novak, P., and Mohamed, A. (2011).Making deep belief networks effective for large vocabularycontinuous speech recognition.In ASRU.

Salakhutdinov, R., Mnih, A., and Hinton, G. (2007).Restricted boltzmann machines for collaborative filtering.

43/43

Page 83: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

In Proceedings of the 24th international conference onMachine learning, ICML ’07, pages 791–798.

Schwenk, H., Rousseau, A., and Attik, M. (2012).Large, pruned or continuous space language models on a gpufor statistical machine translation.In Proceedings of the NAACL-HLT 2012 Workshop: Will WeEver Really Replace the N-gram Model? On the Future ofLanguage Modeling for HLT, pages 11–19, Montreal, Canada.Association for Computational Linguistics.

Socher, R., Bengio, Y., and Manning, C. (2012a).Deep learning for NLP (without the magic).ACL Tutorials http://www.socher.org/index.php/

DeepLearningTutorial/DeepLearningTutorial.

Socher, R., Huang, E. H., Pennin, J., Ng, A. Y., and Manning,C. D. (2011a).

43/43

Page 84: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Dynamic pooling and unfolding recursive autoencoders forparaphrase detection.In NIPS.

Socher, R., Huval, B., Manning, C. D., and Ng, A. Y. (2012b).

Semantic compositionality through recursive matrix-vectorspaces.In Proceedings of the 2012 Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning, pages 1201–1211, Jeju Island,Korea. Association for Computational Linguistics.

Socher, R., Lin, C., Ng, A. Y., and Manning, C. D. (2011b).Parsing natural scenes and natural language with recursiveneural networks.In ICML.

Turian, J., Ratinov, L.-A., and Bengio, Y. (2010).43/43

Page 85: Deep Learning: An Introduction from the NLP Perspective Deep Learning - An...Deep Learning Approach 1 Deep Learning Approach 2 Disclaimer I am not (yet) an expert in Deep Learning

IntroductionNeural Networks

Deep Learning Approach 1Deep Learning Approach 2

Auto-EncodersStacked Auto-EncodersDenoising Auto-Encoders and Variants

Word representations: A simple and general method forsemi-supervised learning.In Proceedings of the 48th Annual Meeting of the Associationfor Computational Linguistics, pages 384–394, Uppsala,Sweden. Association for Computational Linguistics.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., andManzagol, P.-A. (2010).Stacked denoising autoencoders: Learning usefulrepresentations in a deep network with a local denoisingcriterion.Journal of Machine Learning Research, 11:3371–3408.

43/43