63
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Alex Smola http://courses.d2l.ai/odsc2019/ Dive into Deep Learning in 1 Day 1 Basics · 2 Convnets · 3 Computation · 4 Sequences ODSC 2019

Dive into Deep Learning in 1 Day - c.d2l.ai

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dive into Deep Learning in 1 Day - c.d2l.ai

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Alex Smola http://courses.d2l.ai/odsc2019/

Dive into Deep Learning in 1 Day 1 Basics · 2 Convnets · 3 Computation · 4 Sequences

ODSC 2019

Page 2: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Outline

• Installation • Linear Algebra and Deep Numpy • Autograd • Softmax Classification • Optimization • Multilayer Perceptron

Page 3: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Installation

Page 4: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Tools

• Python • Jupyter • Reveal (for notebook slides)conda create --name mxnet conda activate mxnet conda install python==3.7 pip jupyter rise pip install mxnet-cu100 --pre pip install git+https://github.com/d2l-ai/d2l-en@numpy2 git clone https://github.com/d2l-ai/1day-notebooks.git

Page 5: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

mxnet.np

Page 6: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Deep

mxnet.np

Page 7: Dive into Deep Learning in 1 Day - c.d2l.ai

aws.amazon.com

Typical Deep Learning Workflow

• Get data and load it (e.g. in Pandas) • Preprocess it in NumPy • Convert to framework specific format

• Train model • Build model in specific format (TF, PyTorch, Gluon) • Optimize

• Convert predictions into NumPy • Plot and serve

Page 8: Dive into Deep Learning in 1 Day - c.d2l.ai

aws.amazon.com

Typical Deep Learning Workflow

• Get data and load it (e.g. in Pandas) • Preprocess it in NumPy • Convert to framework specific format

• Train model • Build model in specific format (TF, PyTorch, Gluon) • Optimize

• Convert predictions into NumPy • Plot and serve

Do you remember all the peculiarities?

Page 9: Dive into Deep Learning in 1 Day - c.d2l.ai

aws.amazon.com

Matrix Multiplication in NumPy

Random normal matrix

Matrix-matrix multiplication

Numbers from old 2015 MacBook Pro (runs on GPU, too)

Page 10: Dive into Deep Learning in 1 Day - c.d2l.ai

aws.amazon.com

Matrix Multiplication in PyTorch Random normal matrix

Matrix-matrix multiplication

Page 11: Dive into Deep Learning in 1 Day - c.d2l.ai

aws.amazon.com

Matrix Multiplication in TensorFlow Random normal matrix

Matrix-matrix multiplication

Page 12: Dive into Deep Learning in 1 Day - c.d2l.ai

aws.amazon.com

Matrix Multiplication in …

• MXNet Gluon • Caffe2 • CNTK • Chainer • Paddle • Keras

Page 13: Dive into Deep Learning in 1 Day - c.d2l.ai

aws.amazon.com

Matrix Multiplication in …

• MXNet Gluon • Caffe2 • CNTK • Chainer • Paddle • Keras

Confusing

Page 14: Dive into Deep Learning in 1 Day - c.d2l.ai

aws.amazon.com

Deep NumPy

Barrier

Page 15: Dive into Deep Learning in 1 Day - c.d2l.ai

aws.amazon.com

Frontend for Deep NumPy (on MXNet)

• np (numpy drop-in replacement)• 100% NumPy compatible* (arguments for device context) • Asynchronous evaluation

• autograd

• npx (numpy extensions)• Devices • Convolutions and other DL operators

• nn (neural networks)• Layers, optimizers, loss functions … (MXNet Gluon)

Page 16: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

NumPy notebook

Page 17: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Automatic Differentiation

Page 18: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Automatic Differentiation

Computing derivatives by hand is hard Computers can automate this Compute graph • Build explicitly

(TensorFlow, MXNet Symbol) • Build implicitly by tracing

(Chainer, PyTorch, DeepNumpy) Chain rule (evaluate e.g. via backprop)

z = (⟨x, w⟩ − y)2

1

2

3

a = ⟨x, w⟩

xw

y

b = a − y

4z = b2

∂y∂x

=∂y∂un

∂un

∂un−1…

∂u2

∂u1

∂u1

∂x

Page 19: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Automatic Differentiation

Computing derivatives by hand is hard Computers can automate this Compute graph • Build explicitly

(TensorFlow, MXNet Symbol) • Build implicitly by tracing

(Chainer, PyTorch, DeepNumpy) Chain rule (evaluate e.g. via backprop)

z = (⟨x, w⟩ − y)2

1

2

3

a = ⟨x, w⟩

xw

y

b = a − y

4z = b2

∂y∂x

=∂y∂un

∂un

∂un−1…

∂u2

∂u1

∂u1

∂x

Page 20: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Automatic Differentiation

Computing derivatives by hand is hard Computers can automate this Compute graph • Build explicitly

(TensorFlow, MXNet Symbol) • Build implicitly by tracing

(Chainer, PyTorch, DeepNumpy) Chain rule (evaluate e.g. via backprop)

z = (⟨x, w⟩ − y)2

1

2

3

a = ⟨x, w⟩

xw

y

b = a − y

4z = b2

∂y∂x

=∂y∂un

∂un

∂un−1…

∂u2

∂u1

∂u1

∂x

Page 21: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Automatic Differentiation

Computing derivatives by hand is hard Computers can automate this Compute graph • Build explicitly

(TensorFlow, MXNet Symbol) • Build implicitly by tracing

(Chainer, PyTorch, DeepNumpy) Chain rule (evaluate e.g. via backprop)

z = (⟨x, w⟩ − y)2

1

2

3

a = ⟨x, w⟩

xw

y

b = a − y

4z = b2

∂y∂x

=∂y∂un

∂un

∂un−1…

∂u2

∂u1

∂u1

∂x

Page 22: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Automatic Differentiation

Computing derivatives by hand is hard Computers can automate this Compute graph • Build explicitly

(TensorFlow, MXNet Symbol) • Build implicitly by tracing

(Chainer, PyTorch, DeepNumpy) Chain rule (evaluate e.g. via backprop)

z = (⟨x, w⟩ − y)2

1

2

3

a = ⟨x, w⟩

xw

y

b = a − y

4z = b2

∂y∂x

=∂y∂un

∂un

∂un−1…

∂u2

∂u1

∂u1

∂x

Page 23: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Automatic Differentiation

Computing derivatives by hand is hard Computers can automate this Compute graph • Build explicitly

(TensorFlow, MXNet Symbol) • Build implicitly by tracing

(Chainer, PyTorch, DeepNumpy) Chain rule (evaluate e.g. via backprop)

z = (⟨x, w⟩ − y)2

1

2

3

a = ⟨x, w⟩

xw

y

b = a − y

4z = b2

∂y∂x

=∂y∂un

∂un

∂un−1…

∂u2

∂u1

∂u1

∂x

Page 24: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Automatic Differentiation

Computing derivatives by hand is hard Computers can automate this Compute graph • Build explicitly

(TensorFlow, MXNet Symbol) • Build implicitly by tracing

(Chainer, PyTorch, DeepNumpy) Chain rule (evaluate e.g. via backprop)

z = (⟨x, w⟩ − y)2

1

2

3

a = ⟨x, w⟩

xw

y

b = a − y

4z = b2

∂y∂x

=∂y∂un

∂un

∂un−1…

∂u2

∂u1

∂u1

∂x

Page 25: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Automatic Differentiation

Computing derivatives by hand is hard Computers can automate this Compute graph • Build explicitly

(TensorFlow, MXNet Symbol) • Build implicitly by tracing

(Chainer, PyTorch, DeepNumpy) Chain rule (evaluate e.g. via backprop)

z = (⟨x, w⟩ − y)2

1

2

3

a = ⟨x, w⟩

xw

y

b = a − y

4z = b2

∂y∂x

=∂y∂un

∂un

∂un−1…

∂u2

∂u1

∂u1

∂x

Page 26: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

AutoGrad notebook

Page 27: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

$0.41 g3.4xlarge

$1.37 g3.16xlarge

$0.73 g3.8xlarge

Can we estimate prices (time, server, region)?

Page 28: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

$0.41 g3.4xlarge

$1.37 g3.16xlarge

$0.73 g3.8xlarge

Can we estimate prices (time, server, region)?

p = wtime ⋅ t + wserver ⋅ s + wregion[r]

Page 29: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Linear Model

• n-dimensional inputs • Linear model

• Weights and bias and • Vectorized version

• Loss function (to measure goodness of fit)

x = [x1, x2, …, xn]T

y = w1x1 + w2x2 + … + wnxn + bw = [w1, w2, …, wn]⊤ b

y = ⟨w, x⟩ + b

l(y, y) = (y − y)2

Page 30: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Linear Model as a Single-layer Neural Network

We can stack multiple layers to get deep neural networks

Page 31: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Training and Test Data

• Collect multiple data points to fit parameters(spot instance prices in the past 6 months)

• Training data - the more the better

• Test data - when we actually want to try it outWe do not observe labels at time of prediction(after all, that’s what we use ML for)

X = [x0, x1, …, xn]T y = [y0, y1, …, yn]T

X′� = [x′�0, x′ �1, …, x′�n′ �]T

Page 32: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Training

• Objective is to minimize (regularized) training loss

• Plugging in expansion yields objective

argminw

1n

n

∑i=1

l(yi, yi)

y = ⟨w, x⟩ + b

1n

n

∑i=1

(yi − ⟨w, xi⟩ − b)2 =1n

y − Xw − b2

Page 33: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Linear Regression notebook

Page 34: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Logistic Regression

Page 35: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Regression vs. Classification

Regression estimates a continuous value Classification predicts a discrete category

Page 36: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Regression vs. Classification

Regression estimates a continuous value Classification predicts a discrete category

MNIST: classify hand-written digits (10 classes)

Page 37: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Regression vs. Classification

Regression estimates a continuous value Classification predicts a discrete category

MNIST: classify hand-written digits (10 classes)

ImageNet: classify nature objects (1000 classes)

Page 38: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

From Regression to Multi-class Classification

Regression • Single continuous output • Natural scale in • Loss given e.g. in terms of

difference

Classification • Multiple classes, typically

multiple outputs • Score should reflect

confidence …y − f(x)

Page 39: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

From Regression to Multi-class Classification

Classification • Multiple classes, typically

multiple outputs • Score should reflect

confidence …

Square Loss • One hot encoding per class

• Train with squared loss • Largest output wins

y = argmaxi

oi

y = [y1, y2, …, yn]⊤

yi = {1 if i = y0 otherwise

Page 40: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

From Regression to Multi-class Classification

Classification • Multiple classes, typically

multiple outputs • Score should reflect

confidence …

Calibrated Scale • Output matches probabilities

(nonnegative, sums to 1)

• Negative log-likelihood

p(y |o) = softmax(o)

=exp(oy)

∑i exp(oi)

−log p(y |y) = log∑i

exp(oi) − oy

Page 41: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

∂ol(y, o) =exp(o)

∑i exp(oi)− y

Softmax and Cross-Entropy Loss

• Negative log-likelihood (for given label y)

• Cross-Entropy Loss (for probability distribution y)

• Gradient

−log p(y |o) = log∑i

exp(oi) − oy

l(y, o) = log∑i

exp(oi) − y⊤o

Difference between true and estimated probability

Page 42: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Logistic Regression Notebook

Page 43: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

Stochastic Gradient Descent

negative gradient

momentum

Page 44: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Gradient Descent

Choose a starting point Repeat to update weight

• Gradient direction indicates increase in value • Learning rate adjusts step length

w0

wt = wt−1 − η∂wl(wt−1)

w0

w1w2

Page 45: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Goldilocks Learning Rate

Too small Too big

Page 46: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Stochastic Gradient Descent (SGD)

• Computing the gradient over all data is too slow • Redundancy in data (e.g. many similar digits)

• Single observation is not efficient on GPU

• Sample examples to approximate loss/gradientb is the mini-batch size (chosen for GPU efficiency)

b i1, …ib1b ∑

i∈Ib

l(xi, yi, w) and 1b ∑

i∈Ib

∂wl(xi, yi, w)

Page 47: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

Multilayer Perceptron

Page 48: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

Learning XOR 1 2

43

Page 49: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

Learning XOR 1 2

43

1 2 3 4+ - + -

+ + - -product + - - +

Page 50: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

Learning XOR 1 2

43

1 2 3 4+ - + -

+ + - -product + - - +

w11x1 + w12x2 + b1 w21x1 + w22x2 + b2

Page 51: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

Learning XOR 1 2

43

1 2 3 4+ - + -

+ + - -product + - - +

w11x1 + w12x2 + b1 w21x1 + w22x2 + b2

Page 52: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

Single Hidden Layer

Page 53: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

Single Hidden Layer

Page 54: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

Single Hidden Layer

• Input • Hidden • Output

x ∈ ℝn

W1 ∈ ℝm×n, b1 ∈ ℝm

w2 ∈ ℝm, b2 ∈ ℝ

h = σ(W1x + b1)o = wT

2 h + b2

is an element-wise activation function

σ

Page 55: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

Single Hidden Layer

h = σ(W1x + b1)o = wT

2 h + b2

is an element-wise activation function

σ

Why do we need an a nonlinear activation?

Page 56: Dive into Deep Learning in 1 Day - c.d2l.ai

courses.d2l.ai/berkeley-stat-157

Single Hidden Layer

h = W1x + b1

o = wT2 h + b2

hence o = w⊤2 W1x + b′�

Linear …

Why do we need an a nonlinear activation?

Page 57: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

ReLU Activation

ReLU: rectified linear unit

ReLU(x) = max(x,0)

Page 58: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Sigmoid Activation

Map input into (0, 1), a soft version of

sigmoid(x) =1

1 + exp(−x)

σ(x) = {1 if x > 00 otherwise

Page 59: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Multiclass Classification

Page 60: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Multiclass Classification

y1, y2, …, yk = softmax(o1, o2, …, ok)

Page 61: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Multiple Hidden Layers

Hyper-parameters • # of hidden layers • Hidden size for each layer

h1 = σ(W1x + b1)h2 = σ(W2h1 + b2)h3 = σ(W3h2 + b3)o = W4h3 + b4

Page 62: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

MLP Notebook

Page 63: Dive into Deep Learning in 1 Day - c.d2l.ai

numpy.d2l.ai

Summary

• Installation • Linear Algebra and Deep Numpy • Autograd • Softmax Classification • Optimization • Multilayer Perceptron