Upload
others
View
24
Download
0
Embed Size (px)
Citation preview
CS 744: PYTORCH
Shivaram VenkataramanFall 2020
Hi!
ADMINISTRIVIA
Assignment 2 out!
Bid on topics, submit group (1 sentences) – Oct 5Project Proposal (2 pages) – Oct 16
IntroductionRelated WorkTimeline (with eval plan)
→ Due 10/5 ( Monday next week )
-28g y,
Oct 1
- Piazza
Scalable Storage Systems
Datacenter Architecture
Resource Management
Computational Engines
Machine Learning SQL Streaming Graph
Applications
- Iem ←
→spark ,
MapReduce
ars→
pesos→ DRF
-
EMPIRICAL RISK MINIMIZATION
Function
Data (Examples)
Model
Regularization
ised→ and labels
Shifrin, green
trainingdd"
f)Fit a model
--
DEEP LEARNING
ResNet18
ConvolutionReLU
MaxPoolFully Connected
SoftMax
-ppFC. 84 = [ 84 dim ]
(gtfo" # eager; read argon man eager I
-
r
O'
rt.g.ie:"qq.im Him 'm?!";fion'
STOCHASTIC GRADIENT DESCENT
Initialize wFor many iterations:
Loss = Forward passGradient = backwardUpdate model
End
Good fit for
sin"
raiser Tinhorn in→ ardent;
←
f -
- eat[ y
leathers
diindiarddef - (model)→ yfcw , input )- b Ha
'Fwy"
-(model) chain rule
-
↳ model is shared → how do we parallelize↳
every iteration depends on previous
DATA PARALLEL MODEL TRAINING
+
Parallelizeoneiteration
→nextiteration
model -' reflate WH int,ft.
data points→ does CB . ) ←
25664 B , ← gradient (B. )
model wµ , forward pass
pandita 132→
lots!B" f (model , B)
model ! what ↳ flmodd ,Bi)
64 133 .
ffwodd.
,BD
model .
i
.
Wied-
64 Bu !dd up ) :
x÷iWy average lyFuntAdn . update step thatEli takes all grads into accent
COLLECTIVE COMMUNICATIONBroadcast, Scatter Gather, Reduce
From https://mpitutorial.com/tutorials/
°→
MPIgo.im?iodno*EI:ag::g:[email protected]:ad.EE] send,
ties"-
(data ten, root )① detain D
→comate
-
-
--
vector¥"
Chief D- - e -
Es , 47,42
D- -
- 5+2+7+4
ALL REDUCE
From https://mpitutorial.com/tutorials/
Ring All Reduce
-
Po →" 'EET⑧Ds-②
- - - - I 1¥ 's⑨④ c- Da!# 18
14Pe
B
ends
DISTRIBUTED DATA PARALLEL API
→ only line of code change✓ local model
- Non - intrusive
-Hooks to do optimizationsin background
GRADIENT BUCKETINGWhy do we need gradient bucketing?
60M parameter
↳ small tensor sires
lead to greater time for
Ad Reduce
Every All Redn? latency con-how wt
(fixed overhead) t handoff't
→ why not one big bucket
= wait for all gradiah-reatdy.be O g= Cannot overlap backward, Altadena
GRADIENT BUCKETING + ALL REDUCEparameter=. layers
\ A. buckets become
② 0 ready ,
we start
All Reduceonthem
{⑧CTO wageIn background ,
£the gradient comp
9 continuesgriffe. Eredtf by sive =
.25 MB
-
Gradient Accumulation
xtra parameterC 3-
e
✓dgidda.FI[ wm
no- sync AllReduceBet , Bu , Bi D " y Allrednce DCI 134 BED \Br
,Bs ,
B- D '" -④
00¥; !!!8§,B ,
'C D
Bg , Bb I'33 D -
IMPLEMENTATION
Bucket_cap_mb
Parameter-to-bucket mapping
Round-robin ProcessGroups
Fazio Port ①→ ⑦↳ 1234
y I
viii.iii. y③← ②
= Parameter that is tunable-
small → overhead ~middle =
.
25 MB
large →no overlap -
↳ query baiatal""SMB
¥7 Lag::] - mate> um 's
flayer ]-
buckets → filled up↳ math function amp
GPUs = on abatch
cpu, ..
data / backward0 pass
BREAKDOWN
SUMMARY
Pytorch: Framework for deep learningDistributedDataParallel APIGradient bucketing, AllReduceOverlap computation and communication
DISCUSSIONhttps://forms.gle/6xhVBNBhdzsJ6gBE6
profanity na%ner÷ldf% , Andy.de;rwrk well terrene
Timefr329pI⑦ -
e fine for 16 am
-
o.O -0
④scales well !
- optimal bucket
sin depends on
→ 00000 a. orNew
-000
-Nccu is more
performart &
variancetown less
This paperscales well ! ? fweak Seeling
strong scaling13=64
,incremeaprn.mn
i
B÷256
increasemm
GPUs
T - #
T - ¥,
I2
What could be some challenges in implementing similar optimizations for AllReduce in Apache Spark?
spark :
"
larger workloads"
?
Each worker node on spark had dataset
↳ spark needs to shuffle operation14
Org ,
more expensivethan
Necc
- pig reduce0 Top Tree-
Org , - Reduce overlap compute / communicationOtsu - knees ask compete. → veggie fahimgdfngtietime
NEXT STEPS
Next class: PipeDreamAssignment 2 is due soon!
Project ProposalGroups by Oct 52 pager by Oct 16
bucket flare:C. - Jbye!
copy user programAltoC .
. .
eI scatter h÷! an
Process GroupAPI
safer \-
<
aloo / EI. TITE )
Nccu -
#which
link
network monitoring Fes• '- ↳¥→ too:÷:*.
Eisman.mn?!;aYE!ir'?/FEiI.nm⇒ + We"" ¥ YE