43
Lecture 4: Backpropagation and Automatic Differentiation CSE599W: Spring 2018

Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Lecture 4: Backpropagation and Automatic Differentiation

CSE599W: Spring 2018

Page 2: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Announcement

• Assignment 1 is out today, due in 2 weeks (Apr 19th, 5pm)

Page 3: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Model Training Overviewlayer1extractor

layer2extractor predictor

Objective

Training

Page 4: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Symbolic Differentiation

• Input formulae is a symbolic expression tree (computation graph).• Implement differentiation rules, e.g., sum rule, product rule, chain rule

✘For complicated functions, the resultant expression can be exponentially large.✘Wasteful to keep around intermediate symbolic expressions if we only

need a numeric value of the gradient in the end✘Prone to error

!(# + %)!' = !#

!' +!%!'

!(#%)!' = !#

!' % + #!%!'

! ℎ '!' = !# % '

!' * !%(')'

Page 5: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Numerical Differentiation

• We can approximate the gradient using!" #!$%

≈ lim*→," # + ℎ/0 − "(#)

" 4, $ = 4 ⋅ $−0.8 0.3 < 0.5

−0.2

Page 6: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Numerical Differentiation

• We can approximate the gradient using!" #!$%

≈ lim*→," # + ℎ/0 − "(#)

" 4, $ = 4 ⋅ $−0.8 + ; 0.3 = 0.5

−0.2

" 4, $ = 4 ⋅ $−0.8 0.3 = 0.5

−0.2

Page 7: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Numerical Differentiation

• We can approximate the gradient using!" #!$%

≈ lim*→," # + ℎ/0 − "(#)

ℎ• Reduce the truncation error by using center difference

!" #!$%

≈ lim*→," # + ℎ/0 − "(# − ℎ/0)

2ℎ✘Bad: rounding error, and slow to computeüA powerful tool to check the correctness of implementation, usually

use h = 1e-6.

Page 8: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation

!

"

# = %(!, ")Operator %

Page 9: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation

Operator !

"

#

$ = !(", #))*)$

)*)" =

)*)$)$)"

)*)# =

)*)$)$)#

Compute gradient becomes local computation

Page 10: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation simple example

! = 11 + %&(()*(+,+*(-,-)

Page 11: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

Page 12: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

Page 13: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0

Page 14: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53

, % = 1/% à8984 = −1/%&

:;:% =

:;:,:,:% = −1/%&

Page 15: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53

, % = % + 1 à8984 = 1

Page 16: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53-0.20

, % = *4 à8984 = *4

:;:% =

:;:,:,:% =

:;:, ⋅ *

4

Page 17: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53-0.20

, %," = %" à9:94 = ", 9:

90 = %

0.200.20

0.20

0.200.20

0.600.20

Page 18: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation simple example

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

1.0

3.0

-2.0

2.02.0

3.0

-4.0 -1.01.0 -1.0 0.37 1.37 0.73

1.0-0.53-0.53-0.200.200.20

0.20

0.200.20

-0.400.40

0.600.20

Page 19: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Any problem?Can we do better?

Page 20: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Problems of backpropagation

• You always need to keep intermediate data in the memory during the forward pass in case it will be used in the backpropagation.

• Lack of flexibility, e.g., compute the gradient of gradient.

Page 21: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

Page 22: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1

, = 11 + *.(012034320545)

1/%

Page 23: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

- % = 1/% à8985 = −1/%&

Page 24: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1

- % = % + 1 à8985 = 1

Page 25: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1∗

- % = *5 à8985 = *5

Page 26: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1∗∗ −1∗89814

- %, " = %" à8;81 = %

Page 27: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Automatic Differentiation (autodiff)

• Create computation graph for gradient computation

∗"#

+%#

∗"&%&"'

+ ∗ −1 *%+ +1 1/%

− 1%&

- = 11 + */(123145431656)

∗ 1∗∗ −1∗89814

∗89816

Page 28: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithm

W

xmatmult softmax log

y_

mul meany cross_en

tropy

log-gradsoftmax-grad mul 1 / batch_sizematmult-transpose

W_grad

Page 29: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

Page 30: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&

Page 31: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&

!& !% !$ !"

Page 32: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&

!& !% !$ !"

!$" !%××

Page 33: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$"

!& !% !$ !"

!$" !%××

Page 34: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$"

!& !% !$ !"

!$" !%××

Page 35: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$"

!& !% !$ !"

!$" !%××

!$$id

Page 36: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$", !$$

!& !% !$ !"

!$" !%××

!$$id

Page 37: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$", !$$

!& !% !$ !"

!$" !%××

!$$id

!$+

Page 38: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&node_to_grad:!&: !&!%: !%!$: !$", !$$

!& !% !$ !"

!$" !%××

!$$id

!$+

!"×

Page 39: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&

!$" !%

!$

!$$

!"

××

+

×

id

node_to_grad:!&: !&!%: !%!$: !$", !$$!": !"

!& !% !$ !"

Page 40: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

AutoDiff Algorithmdef gradient(out):

node_to_grad[out] = 1

nodes = get_node_list(out)

for node in reverse_topo_order(nodes):

grad ß sum partial adjoints from output edges

input_grads ß node.op.gradient(input, grad) for input in node.inputs

add input_grads to node_to_grad

return node_to_grad

!"

1!$

!%

!&

exp

+

×

!&

!$" !%

!$

!$$

!"

××

+

×

id

node_to_grad:!&: !&!%: !%!$: !$", !$$!": !"

!& !% !$ !"

Page 41: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Backpropagation vs AutoDiff

!"

!#exp 1

!(

!)

+

×

!"

!#exp 1

!(

!)

+

×!)

!(×!#′×

!#′′id

!#+

!"×

Backpropagation AutoDiff

Page 42: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

Recap

• Numerical differentiation• Tool to check the correctness of implementation

• Backpropagation• Easy to understand and implement• Bad for memory use and schedule optimization

• Automatic differentiation• Generate gradient computation to entire computation graph• Better for system optimization

Page 43: Lecture 4: Backpropagation and AutomaticDifferentiation · Lecture 4: Backpropagation and AutomaticDifferentiation CSE599W: Spring 2018. Announcement •Assignment 1 is out today,

References

• Automatic differentiation in machine learning: a surveyhttps://arxiv.org/abs/1502.05767• CS231n backpropagation: http://cs231n.github.io/optimization-2/