21
CS 744: PYTORCH Shivaram Venkataraman Fall 2020 Hi !

CS 744: PYTORCH

  • Upload
    others

  • View
    24

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 744: PYTORCH

CS 744: PYTORCH

Shivaram VenkataramanFall 2020

Hi!

Page 2: CS 744: PYTORCH

ADMINISTRIVIA

Assignment 2 out!

Bid on topics, submit group (1 sentences) – Oct 5Project Proposal (2 pages) – Oct 16

IntroductionRelated WorkTimeline (with eval plan)

→ Due 10/5 ( Monday next week )

-28g y,

Oct 1

- Piazza

Page 3: CS 744: PYTORCH

Scalable Storage Systems

Datacenter Architecture

Resource Management

Computational Engines

Machine Learning SQL Streaming Graph

Applications

- Iem ←

→spark ,

MapReduce

ars→

pesos→ DRF

-

Page 4: CS 744: PYTORCH

EMPIRICAL RISK MINIMIZATION

Function

Data (Examples)

Model

Regularization

ised→ and labels

Shifrin, green

trainingdd"

f)Fit a model

--

Page 5: CS 744: PYTORCH

DEEP LEARNING

ResNet18

ConvolutionReLU

MaxPoolFully Connected

SoftMax

-ppFC. 84 = [ 84 dim ]

(gtfo" # eager; read argon man eager I

-

r

O'

rt.g.ie:"qq.im Him 'm?!";fion'

Page 6: CS 744: PYTORCH

STOCHASTIC GRADIENT DESCENT

Initialize wFor many iterations:

Loss = Forward passGradient = backwardUpdate model

End

Good fit for

sin"

raiser Tinhorn in→ ardent;

f -

- eat[ y

leathers

diindiarddef - (model)→ yfcw , input )- b Ha

'Fwy"

-(model) chain rule

-

↳ model is shared → how do we parallelize↳

every iteration depends on previous

Page 7: CS 744: PYTORCH

DATA PARALLEL MODEL TRAINING

+

Parallelizeoneiteration

→nextiteration

model -' reflate WH int,ft.

data points→ does CB . ) ←

25664 B , ← gradient (B. )

model wµ , forward pass

pandita 132→

lots!B" f (model , B)

model ! what ↳ flmodd ,Bi)

64 133 .

ffwodd.

,BD

model .

i

.

Wied-

64 Bu !dd up ) :

x÷iWy average lyFuntAdn . update step thatEli takes all grads into accent

Page 8: CS 744: PYTORCH

COLLECTIVE COMMUNICATIONBroadcast, Scatter Gather, Reduce

From https://mpitutorial.com/tutorials/

°→

MPIgo.im?iodno*EI:ag::g:[email protected]:ad.EE] send,

ties"-

(data ten, root )① detain D

→comate

-

-

--

vector¥"

Chief D- - e -

Es , 47,42

D- -

- 5+2+7+4

Page 9: CS 744: PYTORCH

ALL REDUCE

From https://mpitutorial.com/tutorials/

Ring All Reduce

-

Po →" 'EET⑧Ds-②

- - - - I 1¥ 's⑨④ c- Da!# 18

14Pe

B

ends

Page 10: CS 744: PYTORCH

DISTRIBUTED DATA PARALLEL API

→ only line of code change✓ local model

- Non - intrusive

-Hooks to do optimizationsin background

Page 11: CS 744: PYTORCH

GRADIENT BUCKETINGWhy do we need gradient bucketing?

60M parameter

↳ small tensor sires

lead to greater time for

Ad Reduce

Every All Redn? latency con-how wt

(fixed overhead) t handoff't

→ why not one big bucket

= wait for all gradiah-reatdy.be O g= Cannot overlap backward, Altadena

Page 12: CS 744: PYTORCH

GRADIENT BUCKETING + ALL REDUCEparameter=. layers

\ A. buckets become

② 0 ready ,

we start

All Reduceonthem

{⑧CTO wageIn background ,

£the gradient comp

9 continuesgriffe. Eredtf by sive =

.25 MB

-

Page 13: CS 744: PYTORCH

Gradient Accumulation

xtra parameterC 3-

e

✓dgidda.FI[ wm

no- sync AllReduceBet , Bu , Bi D " y Allrednce DCI 134 BED \Br

,Bs ,

B- D '" -④

00¥; !!!8§,B ,

'C D

Bg , Bb I'33 D -

Page 14: CS 744: PYTORCH

IMPLEMENTATION

Bucket_cap_mb

Parameter-to-bucket mapping

Round-robin ProcessGroups

Fazio Port ①→ ⑦↳ 1234

y I

viii.iii. y③← ②

= Parameter that is tunable-

small → overhead ~middle =

.

25 MB

large →no overlap -

↳ query baiatal""SMB

¥7 Lag::] - mate> um 's

flayer ]-

buckets → filled up↳ math function amp

GPUs = on abatch

cpu, ..

data / backward0 pass

Page 15: CS 744: PYTORCH

BREAKDOWN

Page 16: CS 744: PYTORCH

SUMMARY

Pytorch: Framework for deep learningDistributedDataParallel APIGradient bucketing, AllReduceOverlap computation and communication

Page 17: CS 744: PYTORCH

DISCUSSIONhttps://forms.gle/6xhVBNBhdzsJ6gBE6

Page 18: CS 744: PYTORCH

profanity na%ner÷ldf% , Andy.de;rwrk well terrene

Timefr329pI⑦ -

e fine for 16 am

-

o.O -0

④scales well !

- optimal bucket

sin depends on

→ 00000 a. orNew

-000

-Nccu is more

performart &

variancetown less

Page 19: CS 744: PYTORCH

This paperscales well ! ? fweak Seeling

strong scaling13=64

,incremeaprn.mn

i

B÷256

increasemm

GPUs

T - #

T - ¥,

I2

Page 20: CS 744: PYTORCH

What could be some challenges in implementing similar optimizations for AllReduce in Apache Spark?

spark :

"

larger workloads"

?

Each worker node on spark had dataset

↳ spark needs to shuffle operation14

Org ,

more expensivethan

Necc

- pig reduce0 Top Tree-

Org , - Reduce overlap compute / communicationOtsu - knees ask compete. → veggie fahimgdfngtietime

Page 21: CS 744: PYTORCH

NEXT STEPS

Next class: PipeDreamAssignment 2 is due soon!

Project ProposalGroups by Oct 52 pager by Oct 16

bucket flare:C. - Jbye!

copy user programAltoC .

. .

eI scatter h÷! an

Process GroupAPI

safer \-

<

aloo / EI. TITE )

Nccu -

#which

link

network monitoring Fes• '- ↳¥→ too:÷:*.

Eisman.mn?!;aYE!ir'?/FEiI.nm⇒ + We"" ¥ YE