Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CS61C:GreatIdeasinComputerArchitecture
Lecture18:ParallelProcessing– SIMD
BernhardBoser&RandyKatz
http://inst.eecs.berkeley.edu/~cs61c
61CSurvey
Itwouldbenicetohaveareviewlectureeveryonceinawhile,
actuallyshowingushowthingsfitinthebiggerpicture
CS61c Lecture18:ParallelProcessing- SIMD 2
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 3
61CTopicssofar…• Whatwelearned:
1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Datapath architecture6. Pipelining7. Caches8. Performanceevaluation9. Floatingpoint
• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!
CS61c Lecture18:ParallelProcessing- SIMD 4
ReferenceProblem
•Matrixmultiplication−Basicoperationinmanyengineering,data,andimagingprocessingtasks
−Imagefiltering,noisereduction,…−Manycloselyrelatedoperations
§ E.g.stereovision(project4)
•dgemm−doubleprecisionfloatingpointmatrixmultiplication
CS61c Lecture18:ParallelProcessing- SIMD 5
ApplicationExample:DeepLearning
• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying
CS61c Lecture18:ParallelProcessing- SIMD 6
Matrices
CS61c Lecture18:ParallelProcessing- SIMD 7
𝑐"#
• Square(orrectangular)NxNarrayofnumbers− DimensionN
𝐶 = 𝐴 ' 𝐵
𝑐"# = )𝑎"+𝑏+#
�
+
𝑖
𝑗N-1
N-1
00
MatrixMultiplication
CS61c 8
𝑪 = 𝑨 ' 𝑩𝑐"# = )𝑎"+𝑏+#
�
+
𝑖
𝑗
𝑘
𝑘
Reference:Python• MatrixmultiplicationinPython
CS61c Lecture18:ParallelProcessing- SIMD 9
N Python[Mflops]32 5.4160 5.5480 5.4960 5.3
• 1Mflop =1Millionfloatingpointoperationspersecond(fadd,fmul)
• dgemm(N…)takes2*N3 flops
C
• c=axb• a,b,careNxNmatrices
CS61c Lecture18:ParallelProcessing- SIMD 10
TimingProgramExecution
CS61c Lecture18:ParallelProcessing- SIMD 11
CversusPython
CS61c Lecture18:ParallelProcessing- SIMD 12
N C[Gflops] Python[Gflops]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053
Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!
240x!
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 13
WhyParallelProcessing?
• CPUClockRatesarenolongerincreasing−Technical&economicchallenges
§ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications
§ Energycostsareprohibitive
• Parallelprocessingisonlypathtohigherspeed−Compareairlines:
§ Maximumspeedlimitedbyspeedofsoundandeconomics§ Usemoreandlargerairplanestoincreasethroughput§ Andsmallerseats…
CS61c Lecture18:ParallelProcessing- SIMD 14
UsingParallelismforPerformance
• Twobasicways:−Multiprogramming
§ runmultipleindependentprogramsinparallel§ “Easy”
−Parallelcomputing§ runoneprogramfaster§ “Hard”
•We’llfocusonparallelcomputinginthenextfewlectures
15CS61c Lecture18:ParallelProcessing- SIMD
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssigned tocomputere.g.,Search“Katz”
• ParallelThreadsAssigned tocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelined instructions
• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages 16
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
Today’sLecture
Single-Instruction/Single-DataStream(SISD)
• Sequentialcomputerthatexploitsnoparallelism ineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines
E.g.ourtrustedMIPS
17
ProcessingUnit
CS61c Lecture18:ParallelProcessing- SIMD
Thisiswhatwediduptonowin61C
Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)
• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)
18CS61c Lecture18:ParallelProcessing- SIMD
Today’stopic.
Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)
• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers
19
InstructionPool
PU
PU
PU
PU
DataPoo
l
CS61c Lecture18:ParallelProcessing- SIMD
TopicofLecture19andbeyond.
Multiple-Instruction/Single-DataStream(MISD)
• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance
20CS61c Lecture18:ParallelProcessing- SIMD
Thishasfewapplications.Notcoveredin61C.
Flynn*Taxonomy,1966
• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures– usuallybothinsamesystem!• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives
21CS61c Lecture18:ParallelProcessing- SIMD
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 22
SIMD– “SingleInstructionMultipleData”
23CS61c Lecture18:ParallelProcessing- SIMD
SIMDApplications&Implementations
• Applications− Scientificcomputing
§ Matlab,NumPy− Graphicsandvideoprocessing
§ Photoshop,…− BigData
§ Deeplearning− Gaming−…
• Implementations− x86− ARM−…
CS61c Lecture18:ParallelProcessing- SIMD 24
25
FirstSIMDExtensions:MITLincolnLabsTX-2,1957
CS61c
x86SIMDEvolution
CS61c Lecture18:ParallelProcessing- SIMD 26
http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf
• Newinstructions• New,wider,moreregisters• Moreparallelism
CPUSpecs(Bernhard’sLaptop)$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4
machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS
CS61c Lecture18:ParallelProcessing- SIMD 27
SIMDRegisters
CS61c Lecture18:ParallelProcessing- SIMD 28
SIMDDataTypes
CS61c Lecture18:ParallelProcessing- SIMD 29
SIMDVectorMode
CS61c Lecture18:ParallelProcessing- SIMD 30
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 31
Problem
• Today’scompilers(largely)donotgenerateSIMDcode•Backtoassembly…• x86
−Over1000instructionstolearn…−GreenBook
•Canweusethecompilertogenerateallnon-SIMDinstructions?
CS61c Lecture18:ParallelProcessing- SIMD 32
x86IntrinsicsAVXDataTypes
CS61c Lecture18:ParallelProcessing- SIMD 33
Intrinsics: Directaccesstoregisters&assemblyfromC
Register
IntrinsicsAVXCodeNomenclature
CS61c Lecture18:ParallelProcessing- SIMD 34
x86SIMD“Intrinsics”
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
CS61c Lecture18:ParallelProcessing- SIMD 35
4parallelmultiplies
2instructionsperclockcycle(CPI=0.5)
assemblyinstruction
RawDoublePrecisionThroughput(Bernhard’sPowerbook Pro)
Characteristic Value
CPU i7-5557U
Clockrate(sustained) 3.1GHz
Instructions perclock(mul_pd) 2
Parallel multipliesperinstruction 4
Peakdoubleflops 24.8Gflops
CS61c Lecture18:ParallelProcessing- SIMD 36
Actualperformanceislowerbecauseofoverhead
https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
VectorizedMatrixMultiplication
CS61c 37
𝑖
𝑗
𝑘
𝑘
InnerLoop:
fori …;i+=4forj...
i+=4
“Vectorized”dgemm
CS61c Lecture18:ParallelProcessing- SIMD 38
Performance
NGflops
scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64
CS61c Lecture18:ParallelProcessing- SIMD 39
• 4xfaster• Butstill<<theoretical25Gflops!
Weareflying…
• Survey:
• But…thereissomuchmaterialtocover!− Solution:targetedreading−Weeklyhomeworkwithintegratedreading&lecturereview
CS61c Lecture18:ParallelProcessing- SIMD 40
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 41
AtriptoLA
Get toSFO&check-in SFOà LAX Getto destination
3hours 1hour 3 hours
CS61c Lecture18:ParallelProcessing- SIMD 42
Commercialairline:
Supersonicaircraft:
Get toSFO&check-in SFOà LAX Getto destination
3hours 6min 3 hours
Totaltime:7hours
Totaltime:6.1hours
Speedup:
Flyingtime Sflight =60/6=10xTriptime Strip =7/6.1=1.15x
Amdahl’sLaw
• GetenhancementE foryournewPC− E.g.floatingpointrocketbooster
• E− Speedsupsometask(e.g.arithmetic)byfactorSE− F isfractionofprogramthatusesthis”task”
CS61c Lecture18:ParallelProcessing- SIMD 43
1-F F
1-F F/ SE
ExecutionTime:
Speedup:
T0 (noE)
TE (withE)
𝑆 =𝑇6𝑇7
=1
1 − 𝐹 + 𝐹𝑆7
nospeedup speedupsection
BigIdea:Amdahl’sLaw
44
Partnotspedup Partspedup
Example:Theexecutiontimeofhalf ofaprogramcanbeacceleratedbyafactorof2.Whatistheprogramspeed-upoverall?
𝑆 =𝑇6𝑇7=
1
1 − 𝐹 + 𝐹𝑆7
𝑆 =𝑇6𝑇7=
1
1− 0.5 + 0.52= 1.33 ≪ 2
CS61c Lecture18:ParallelProcessing- SIMD
Maximum“Achievable”Speed-Up
45
Question: Whatisareasonable#ofparallelprocessorstospeedupanalgorithmwithF=95%?(i.e.19/20th canbespedup)
a)Maximumspeedup:
b)Reasonable“engineering”compromise:
𝑆BCD =1
1 − 𝐹 + 𝐹𝑆7E
FG⟹I
=1
1− 𝐹
𝐹 = 95% ⟹𝑆BCD = 20 but𝑆7 → ∞ !?
1 − 𝐹 =𝐹𝑆7
⟹ 𝑆7 =𝐹
1− 𝐹 =0.950.05 = 19
Then𝑆 = FOPQR = 10
Equaltime insequentialandparallelcode
CS61c
46
Iftheportionoftheprogramthatcanbeparallelizedissmall,thenthespeedupislimited
Inthisregion,thesequentialportionlimitstheperformance
500processorsfor19x
20processorsfor10x
CS61c
StrongandWeakScaling
• Togetgoodspeeduponaparallelprocessorwhilekeepingtheproblemsizefixedisharderthangettinggoodspeedupbyincreasingthesizeoftheproblem.− Strongscaling:whenspeedupcanbeachievedonaparallelprocessorwithoutincreasingthesizeoftheproblem
−Weakscaling:whenspeedupisachievedonaparallelprocessorbyincreasingthesizeoftheproblemproportionallytotheincreaseinthenumberofprocessors
• Loadbalancingisanotherimportantfactor:everyprocessordoingsameamountofwork− Justoneunitwithtwicetheloadofotherscutsspeedupalmostinhalf
47CS61c Lecture18:ParallelProcessing- SIMD
Clickers/PeerInstruction
48
Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?
𝑆 =𝑇6𝑇7=
1
1 − 𝐹 + 𝐹𝑆7
Answer SEA 5B 16C 20D 100E Noneoftheabove
CS61c Lecture18:ParallelProcessing- SIMD
Clickers/PeerInstruction
49
Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?
𝑆 =𝑇6𝑇7=
1
1 − 𝐹 + 𝐹𝑆7
Answer SEA 5B 16C 20D 100E Noneoftheabove
CS61c Lecture18:ParallelProcessing- SIMD
Administrivia• MT2is
− Tuesday,November1,− 3:30-5pm− seewebforroom assignments
• TAReviewSession:§ Sunday10/30,3:30– 5PMin10Evans§ SeePiazza
50CS61c Lecture19:ThreadLevalParallelProcessing
MT2Topics• Coverslecturematerialupto10/20
− Caches− notfloatingpoint
• Combinatoriallogicincludingsynthesisandtruthtables• FSMs• Timingandtimingdiagrams• Pipelining• Datapath,hazards,stalls• Performance(e.g.CPI,instructionspersecond,latency)• Caches• AlltopicscoveredinMT1
− Focusisnewmaterial,butdonotbesurprisedbye.g.MIPSassembly
51CS61c Lecture19:ThreadLevalParallelProcessing
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 52
Amdahl’sLawappliedtodgemm
• Measureddgemm performance− Peak 5.5Gflops− Largematrices 3.6Gflops− Processor 24.8Gflops
• Whyarewenotgetting(closeto)25Gflops?− Somethingelse(notfloatingpointALU)islimitingperformance!
− Butwhat?Possibleculprits:§ Cache§ Hazards§ Let’slookatboth!
CS61c Lecture18:ParallelProcessing- SIMD 53
PipelineHazards– dgemm
CS61c Lecture18:ParallelProcessing- SIMD 54
LoopUnrolling
CS61c Lecture18:ParallelProcessing- SIMD 55
Compilerdoestheunrolling
Howdoyouverifythatthegeneratedcodeisactuallyunrolled?
4registers
Performance
NGflops
scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91
CS61c Lecture18:ParallelProcessing- SIMD 56
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 57
FPUversusMemoryAccess
• Howmanyfloatingpointoperationsdoesmatrixmultiplytake?− F=2xN3 (N3 multiplies,N3 adds)
• Howmanymemoryload/stores?−M=3xN2 (forA,B,C)
• Manymorefloatingpointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…
CS61c Lecture18:ParallelProcessing- SIMD 58
Butmemoryisaccessedrepeatedly
• q=F/M=1!(2loadsand2floatingpointoperations)
CS61c Lecture18:ParallelProcessing- SIMD 59
Innerloop:
CS61c Lecture18:ParallelProcessing- SIMD 60
Second-LevelCache(SRAM)
TypicalMemoryHierarchy
Control
Datapath
SecondaryMemory(Disk
OrFlash)
On-ChipComponents
RegFile
MainMemory(DRAM)Data
CacheInstrCache
Speed(cycles):½’s 1’s 10’s 100’s-10001,000,000’s
Size(bytes): 100’s 10K’s M’sG’sT’s
• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!
Cost/bit:highest lowest
Third-LevelCache(SRAM)
Sub-MatrixMultiplicationor:BeatingAmdahl’sLaw
CS61c 61
Blocking
• Idea:−Rearrangecodetousevaluesloadedincachemanytimes
−Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation
−à throughputlimitedbyFPhardwareandcache,notslowDRAM
−P&Hp.556
CS61c Lecture18:ParallelProcessing- SIMD 62
MemoryAccessBlocking
CS61c Lecture18:ParallelProcessing- SIMD 63
Performance
NGflops
scalar avx unroll blocking32 1.30 4.56 12.95 13.80160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82
CS61c Lecture18:ParallelProcessing- SIMD 64
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 65
AndinConclusion,…
• ApproachestoParallelism− SISD,SIMD,MIMD(nextlecture)
• SIMD− Oneinstructionoperatesonmultipleoperandssimultaneously
• Example:matrixmultiplication− Floatingpointheavyà exploitMoore’slawtomakefast
• Amdahl’sLaw:− Serialsectionslimitspeedup− Cache
§ Blocking− Hazards
§ Loopunrolling
66CS61c Lecture18:ParallelProcessing- SIMD