66
CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMD Bernhard Boser & Randy Katz http://inst.eecs.berkeley.edu/~cs61c

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

CS61C:GreatIdeasinComputerArchitecture

Lecture18:ParallelProcessing– SIMD

BernhardBoser&RandyKatz

http://inst.eecs.berkeley.edu/~cs61c

Page 2: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

61CSurvey

Itwouldbenicetohaveareviewlectureeveryonceinawhile,

actuallyshowingushowthingsfitinthebiggerpicture

CS61c Lecture18:ParallelProcessing- SIMD 2

Page 3: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Agenda

• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…

CS61c Lecture18:ParallelProcessing- SIMD 3

Page 4: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

61CTopicssofar…• Whatwelearned:

1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Datapath architecture6. Pipelining7. Caches8. Performanceevaluation9. Floatingpoint

• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!

CS61c Lecture18:ParallelProcessing- SIMD 4

Page 5: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

ReferenceProblem

•Matrixmultiplication−Basicoperationinmanyengineering,data,andimagingprocessingtasks

−Imagefiltering,noisereduction,…−Manycloselyrelatedoperations

§ E.g.stereovision(project4)

•dgemm−doubleprecisionfloatingpointmatrixmultiplication

CS61c Lecture18:ParallelProcessing- SIMD 5

Page 6: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

ApplicationExample:DeepLearning

• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying

CS61c Lecture18:ParallelProcessing- SIMD 6

Page 7: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Matrices

CS61c Lecture18:ParallelProcessing- SIMD 7

𝑐"#

• Square(orrectangular)NxNarrayofnumbers− DimensionN

𝐶 = 𝐴 ' 𝐵

𝑐"# = )𝑎"+𝑏+#

+

𝑖

𝑗N-1

N-1

00

Page 8: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

MatrixMultiplication

CS61c 8

𝑪 = 𝑨 ' 𝑩𝑐"# = )𝑎"+𝑏+#

+

𝑖

𝑗

𝑘

𝑘

Page 9: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Reference:Python• MatrixmultiplicationinPython

CS61c Lecture18:ParallelProcessing- SIMD 9

N Python[Mflops]32 5.4160 5.5480 5.4960 5.3

• 1Mflop =1Millionfloatingpointoperationspersecond(fadd,fmul)

• dgemm(N…)takes2*N3 flops

Page 10: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

C

• c=axb• a,b,careNxNmatrices

CS61c Lecture18:ParallelProcessing- SIMD 10

Page 11: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

TimingProgramExecution

CS61c Lecture18:ParallelProcessing- SIMD 11

Page 12: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

CversusPython

CS61c Lecture18:ParallelProcessing- SIMD 12

N C[Gflops] Python[Gflops]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053

Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!

240x!

Page 13: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Agenda

• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…

CS61c Lecture18:ParallelProcessing- SIMD 13

Page 14: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

WhyParallelProcessing?

• CPUClockRatesarenolongerincreasing−Technical&economicchallenges

§ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications

§ Energycostsareprohibitive

• Parallelprocessingisonlypathtohigherspeed−Compareairlines:

§ Maximumspeedlimitedbyspeedofsoundandeconomics§ Usemoreandlargerairplanestoincreasethroughput§ Andsmallerseats…

CS61c Lecture18:ParallelProcessing- SIMD 14

Page 15: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

UsingParallelismforPerformance

• Twobasicways:−Multiprogramming

§ runmultipleindependentprogramsinparallel§ “Easy”

−Parallelcomputing§ runoneprogramfaster§ “Hard”

•We’llfocusonparallelcomputinginthenextfewlectures

15CS61c Lecture18:ParallelProcessing- SIMD

Page 16: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssigned tocomputere.g.,Search“Katz”

• ParallelThreadsAssigned tocoree.g.,Lookup,Ads

• ParallelInstructions>[email protected].,5pipelined instructions

• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages 16

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

Core

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Today’sLecture

Page 17: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Single-Instruction/Single-DataStream(SISD)

• Sequentialcomputerthatexploitsnoparallelism ineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines

E.g.ourtrustedMIPS

17

ProcessingUnit

CS61c Lecture18:ParallelProcessing- SIMD

Thisiswhatwediduptonowin61C

Page 18: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)

18CS61c Lecture18:ParallelProcessing- SIMD

Today’stopic.

Page 19: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)

• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers

19

InstructionPool

PU

PU

PU

PU

DataPoo

l

CS61c Lecture18:ParallelProcessing- SIMD

TopicofLecture19andbeyond.

Page 20: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance

20CS61c Lecture18:ParallelProcessing- SIMD

Thishasfewapplications.Notcoveredin61C.

Page 21: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Flynn*Taxonomy,1966

• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures– usuallybothinsamesystem!• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives

21CS61c Lecture18:ParallelProcessing- SIMD

Page 22: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Agenda

• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…

CS61c Lecture18:ParallelProcessing- SIMD 22

Page 23: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

SIMD– “SingleInstructionMultipleData”

23CS61c Lecture18:ParallelProcessing- SIMD

Page 24: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

SIMDApplications&Implementations

• Applications− Scientificcomputing

§ Matlab,NumPy− Graphicsandvideoprocessing

§ Photoshop,…− BigData

§ Deeplearning− Gaming−…

• Implementations− x86− ARM−…

CS61c Lecture18:ParallelProcessing- SIMD 24

Page 25: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

25

FirstSIMDExtensions:MITLincolnLabsTX-2,1957

CS61c

Page 26: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

x86SIMDEvolution

CS61c Lecture18:ParallelProcessing- SIMD 26

http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf

• Newinstructions• New,wider,moreregisters• Moreparallelism

Page 27: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

CPUSpecs(Bernhard’sLaptop)$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4

machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz

machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS

CS61c Lecture18:ParallelProcessing- SIMD 27

Page 28: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

SIMDRegisters

CS61c Lecture18:ParallelProcessing- SIMD 28

Page 29: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

SIMDDataTypes

CS61c Lecture18:ParallelProcessing- SIMD 29

Page 30: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

SIMDVectorMode

CS61c Lecture18:ParallelProcessing- SIMD 30

Page 31: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Agenda

• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…

CS61c Lecture18:ParallelProcessing- SIMD 31

Page 32: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Problem

• Today’scompilers(largely)donotgenerateSIMDcode•Backtoassembly…• x86

−Over1000instructionstolearn…−GreenBook

•Canweusethecompilertogenerateallnon-SIMDinstructions?

CS61c Lecture18:ParallelProcessing- SIMD 32

Page 33: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

x86IntrinsicsAVXDataTypes

CS61c Lecture18:ParallelProcessing- SIMD 33

Intrinsics: Directaccesstoregisters&assemblyfromC

Register

Page 34: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

IntrinsicsAVXCodeNomenclature

CS61c Lecture18:ParallelProcessing- SIMD 34

Page 35: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

x86SIMD“Intrinsics”

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

CS61c Lecture18:ParallelProcessing- SIMD 35

4parallelmultiplies

2instructionsperclockcycle(CPI=0.5)

assemblyinstruction

Page 36: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

RawDoublePrecisionThroughput(Bernhard’sPowerbook Pro)

Characteristic Value

CPU i7-5557U

Clockrate(sustained) 3.1GHz

Instructions perclock(mul_pd) 2

Parallel multipliesperinstruction 4

Peakdoubleflops 24.8Gflops

CS61c Lecture18:ParallelProcessing- SIMD 36

Actualperformanceislowerbecauseofoverhead

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

Page 37: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

VectorizedMatrixMultiplication

CS61c 37

𝑖

𝑗

𝑘

𝑘

InnerLoop:

fori …;i+=4forj...

i+=4

Page 38: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

“Vectorized”dgemm

CS61c Lecture18:ParallelProcessing- SIMD 38

Page 39: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Performance

NGflops

scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64

CS61c Lecture18:ParallelProcessing- SIMD 39

• 4xfaster• Butstill<<theoretical25Gflops!

Page 40: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Weareflying…

• Survey:

• But…thereissomuchmaterialtocover!− Solution:targetedreading−Weeklyhomeworkwithintegratedreading&lecturereview

CS61c Lecture18:ParallelProcessing- SIMD 40

Page 41: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Agenda

• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…

CS61c Lecture18:ParallelProcessing- SIMD 41

Page 42: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

AtriptoLA

Get toSFO&check-in SFOà LAX Getto destination

3hours 1hour 3 hours

CS61c Lecture18:ParallelProcessing- SIMD 42

Commercialairline:

Supersonicaircraft:

Get toSFO&check-in SFOà LAX Getto destination

3hours 6min 3 hours

Totaltime:7hours

Totaltime:6.1hours

Speedup:

Flyingtime Sflight =60/6=10xTriptime Strip =7/6.1=1.15x

Page 43: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Amdahl’sLaw

• GetenhancementE foryournewPC− E.g.floatingpointrocketbooster

• E− Speedsupsometask(e.g.arithmetic)byfactorSE− F isfractionofprogramthatusesthis”task”

CS61c Lecture18:ParallelProcessing- SIMD 43

1-F F

1-F F/ SE

ExecutionTime:

Speedup:

T0 (noE)

TE (withE)

𝑆 =𝑇6𝑇7

=1

1 − 𝐹 + 𝐹𝑆7

nospeedup speedupsection

Page 44: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

BigIdea:Amdahl’sLaw

44

Partnotspedup Partspedup

Example:Theexecutiontimeofhalf ofaprogramcanbeacceleratedbyafactorof2.Whatistheprogramspeed-upoverall?

𝑆 =𝑇6𝑇7=

1

1 − 𝐹 + 𝐹𝑆7

𝑆 =𝑇6𝑇7=

1

1− 0.5 + 0.52= 1.33 ≪ 2

CS61c Lecture18:ParallelProcessing- SIMD

Page 45: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Maximum“Achievable”Speed-Up

45

Question: Whatisareasonable#ofparallelprocessorstospeedupanalgorithmwithF=95%?(i.e.19/20th canbespedup)

a)Maximumspeedup:

b)Reasonable“engineering”compromise:

𝑆BCD =1

1 − 𝐹 + 𝐹𝑆7E

FG⟹I

=1

1− 𝐹

𝐹 = 95% ⟹𝑆BCD = 20 but𝑆7 → ∞ !?

1 − 𝐹 =𝐹𝑆7

⟹ 𝑆7 =𝐹

1− 𝐹 =0.950.05 = 19

Then𝑆 = FOPQR = 10

Equaltime insequentialandparallelcode

CS61c

Page 46: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

46

Iftheportionoftheprogramthatcanbeparallelizedissmall,thenthespeedupislimited

Inthisregion,thesequentialportionlimitstheperformance

500processorsfor19x

20processorsfor10x

CS61c

Page 47: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

StrongandWeakScaling

• Togetgoodspeeduponaparallelprocessorwhilekeepingtheproblemsizefixedisharderthangettinggoodspeedupbyincreasingthesizeoftheproblem.− Strongscaling:whenspeedupcanbeachievedonaparallelprocessorwithoutincreasingthesizeoftheproblem

−Weakscaling:whenspeedupisachievedonaparallelprocessorbyincreasingthesizeoftheproblemproportionallytotheincreaseinthenumberofprocessors

• Loadbalancingisanotherimportantfactor:everyprocessordoingsameamountofwork− Justoneunitwithtwicetheloadofotherscutsspeedupalmostinhalf

47CS61c Lecture18:ParallelProcessing- SIMD

Page 48: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Clickers/PeerInstruction

48

Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?

𝑆 =𝑇6𝑇7=

1

1 − 𝐹 + 𝐹𝑆7

Answer SEA 5B 16C 20D 100E Noneoftheabove

CS61c Lecture18:ParallelProcessing- SIMD

Page 49: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Clickers/PeerInstruction

49

Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?

𝑆 =𝑇6𝑇7=

1

1 − 𝐹 + 𝐹𝑆7

Answer SEA 5B 16C 20D 100E Noneoftheabove

CS61c Lecture18:ParallelProcessing- SIMD

Page 50: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Administrivia• MT2is

− Tuesday,November1,− 3:30-5pm− seewebforroom assignments

• TAReviewSession:§ Sunday10/30,3:30– 5PMin10Evans§ SeePiazza

50CS61c Lecture19:ThreadLevalParallelProcessing

Page 51: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

MT2Topics• Coverslecturematerialupto10/20

− Caches− notfloatingpoint

• Combinatoriallogicincludingsynthesisandtruthtables• FSMs• Timingandtimingdiagrams• Pipelining• Datapath,hazards,stalls• Performance(e.g.CPI,instructionspersecond,latency)• Caches• AlltopicscoveredinMT1

− Focusisnewmaterial,butdonotbesurprisedbye.g.MIPSassembly

51CS61c Lecture19:ThreadLevalParallelProcessing

Page 52: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Agenda

• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…

CS61c Lecture18:ParallelProcessing- SIMD 52

Page 53: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Amdahl’sLawappliedtodgemm

• Measureddgemm performance− Peak 5.5Gflops− Largematrices 3.6Gflops− Processor 24.8Gflops

• Whyarewenotgetting(closeto)25Gflops?− Somethingelse(notfloatingpointALU)islimitingperformance!

− Butwhat?Possibleculprits:§ Cache§ Hazards§ Let’slookatboth!

CS61c Lecture18:ParallelProcessing- SIMD 53

Page 54: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

PipelineHazards– dgemm

CS61c Lecture18:ParallelProcessing- SIMD 54

Page 55: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

LoopUnrolling

CS61c Lecture18:ParallelProcessing- SIMD 55

Compilerdoestheunrolling

Howdoyouverifythatthegeneratedcodeisactuallyunrolled?

4registers

Page 56: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Performance

NGflops

scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91

CS61c Lecture18:ParallelProcessing- SIMD 56

Page 57: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Agenda

• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…

CS61c Lecture18:ParallelProcessing- SIMD 57

Page 58: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

FPUversusMemoryAccess

• Howmanyfloatingpointoperationsdoesmatrixmultiplytake?− F=2xN3 (N3 multiplies,N3 adds)

• Howmanymemoryload/stores?−M=3xN2 (forA,B,C)

• Manymorefloatingpointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…

CS61c Lecture18:ParallelProcessing- SIMD 58

Page 59: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Butmemoryisaccessedrepeatedly

• q=F/M=1!(2loadsand2floatingpointoperations)

CS61c Lecture18:ParallelProcessing- SIMD 59

Innerloop:

Page 60: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

CS61c Lecture18:ParallelProcessing- SIMD 60

Second-LevelCache(SRAM)

TypicalMemoryHierarchy

Control

Datapath

SecondaryMemory(Disk

OrFlash)

On-ChipComponents

RegFile

MainMemory(DRAM)Data

CacheInstrCache

Speed(cycles):½’s 1’s 10’s 100’s-10001,000,000’s

Size(bytes): 100’s 10K’s M’sG’sT’s

• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!

Cost/bit:highest lowest

Third-LevelCache(SRAM)

Page 61: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Sub-MatrixMultiplicationor:BeatingAmdahl’sLaw

CS61c 61

Page 62: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Blocking

• Idea:−Rearrangecodetousevaluesloadedincachemanytimes

−Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation

−à throughputlimitedbyFPhardwareandcache,notslowDRAM

−P&Hp.556

CS61c Lecture18:ParallelProcessing- SIMD 62

Page 63: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

MemoryAccessBlocking

CS61c Lecture18:ParallelProcessing- SIMD 63

Page 64: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Performance

NGflops

scalar avx unroll blocking32 1.30 4.56 12.95 13.80160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82

CS61c Lecture18:ParallelProcessing- SIMD 64

Page 65: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

Agenda

• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…

CS61c Lecture18:ParallelProcessing- SIMD 65

Page 66: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single

AndinConclusion,…

• ApproachestoParallelism− SISD,SIMD,MIMD(nextlecture)

• SIMD− Oneinstructionoperatesonmultipleoperandssimultaneously

• Example:matrixmultiplication− Floatingpointheavyà exploitMoore’slawtomakefast

• Amdahl’sLaw:− Serialsectionslimitspeedup− Cache

§ Blocking− Hazards

§ Loopunrolling

66CS61c Lecture18:ParallelProcessing- SIMD