Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
10/25/2016 CS152,Fall2016
CS152ComputerArchitectureandEngineering
Lecture14:Multithreading
JohnWawrzynekElectricalEngineeringandComputerSciences
UniversityofCalifornia,Berkeley
http://www.eecs.berkeley.edu/~johnwhttp://inst.cs.berkeley.edu/~cs152
10/25/2016 CS152,Fall2016
LastTimeLecture13:VLIW
§ InaclassicVLIW,compilerisresponsible foravoidingallhazards->simplehardware,complexcompiler.LaterVLIWs addedmoredynamichardwareinterlocks
§ Useloopunrollingandsoftwarepipeliningforloops,traceschedulingformoreirregularcode
§ Staticschedulingdifficultinpresenceofunpredictablebranchesandvariablelatencymemory
2
10/25/2016 CS152,Fall2016
CS152Administrivia
§ Quiz3results
§ PS4posted– dueNov3rd
§ Lab3dueFriday§ Noofficehourstoday§ FridaydiscussionmovedtoThursday
3
10/25/2016 CS152,Fall2016
Multithreading
§ Difficulttocontinuetoextractinstruction-levelparallelism(ILP)fromasinglesequential threadofcontrol
§ Manyworkloadscanmakeuseofthread-level parallelism(TLP)– TLPfrommultiprogramming(runindependentsequential jobs)
– TLPfrommultithreaded applications(runonejobfasterusingparallelthreads)
§ MultithreadingusesTLPtoimproveutilizationofasingleprocessor
4
10/25/2016 CS152,Fall2016
PipelineHazards
§ Eachinstructionmaydependonthenext
5
LW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5
F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8
F D X M WD D DF D X M WD D DF F F
F DD D DF F F
t9 t10 t11 t12 t13 t14
What is usually done to cope with this?– interlocks (slow)– or bypassing (needs hardware, doesn’t help all
hazards)
10/25/2016 CS152,Fall2016
Multithreading
Howcanweguaranteenodependencies betweeninstructionsinapipeline?
-- Onewayistointerleaveexecutionofinstructionsfromdifferentprogramthreadsonsamepipeline
6
F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8
T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)
t9
F D X M WF D X M W
F D X M WF D X M W
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
Prior instruction in a thread always completes write-back before next instruction in same thread reads register file
10/25/2016 CS152,Fall2016
CDC6600PeripheralProcessors(Cray,1964)
§ Firstmultithreadedhardware§ 10“virtual”I/Oprocessors§ Fixedinterleave onsimplepipeline§ Pipelinehas100nscycletime§ Eachvirtualprocessorexecutesoneinstructionevery1000ns§ Accumulator-based instructionsettoreduceprocessor state
7
10/25/2016 CS152,Fall2016
SimpleMultithreadedPipeline
§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/writtenateachpipestage
§ Appears tosoftware(includingOS)asmultiple,albeitslower,CPUs
8
+1
2 Thread select
PC1PC
1PC1PC
1I$ IR GPR1GPR1GPR1GPR1
X
Y
2
D$
10/25/2016 CS152,Fall2016
MultithreadingCosts
§ Eachthreadrequiresitsownuserstate– PC– GPRs
§ Also,needsitsownsystemstate– Virtual-memorypage-table-base register– Exception-handling registers
§Otheroverheads:– Additionalcache/TLBconflictsfromcompetingthreads– (oraddlargercache/TLBcapacity)– MoreOSoverhead toschedulemorethreads (wheredoallthesethreadscomefrom?)
9
10/25/2016 CS152,Fall2016
ThreadSchedulingPolicies
§ Fixedinterleave (CDC6600PPUs,1964)– EachofNthreadsexecutes oneinstructioneveryNcycles– Ifthreadnotreadytogoinitsslot,insertpipelinebubble
§ Software-controlled interleave (TIASCPPUs,1971)– OSallocatesSpipeline slotsamongstNthreads– Hardwareperformsfixedinterleave overSslots,executingwhicheverthreadisinthatslot
§ Hardware-controlledthreadscheduling(HEP,1982)– Hardwarekeepstrackofwhichthreadsarereadytogo– Picksnextthreadtoexecutebasedonhardwarepriorityscheme
10
10/25/2016 CS152,Fall2016
DenelcorHEP(BurtonSmith,1982)
FirstcommercialmachinetousehardwarethreadinginmainCPU– 120threadsperprocessor– 10MHzclockrate– Upto8processors– precursortoTeraMTA(Multithreaded Architecture)
11
10/25/2016 CS152,Fall2016
Tera MTA(1990-)
§ Upto256processors§ Upto128activethreadsperprocessor§ Processorsandmemorymodulespopulateasparse3Dtorusinterconnection fabric
§ Flat,sharedmainmemory– Nodatacache– Sustainsonemainmemoryaccesspercycleperprocessor
§ GaAs logicinprototype, 1KW/processor@260MHz– SecondversionCMOS,MTA-2,50W/processor– Newerversion,XMT,fitsintoAMDOpteronsocket,runsat500MHz
12
10/25/2016 CS152,Fall2016
MTAArchitecture
14
• Each Processor:• Every cycle, one VLIW instruction from one active thread is launched into pipeline• Instruction pipeline is 21 cycles long• Memory operations incur ~150 cycles of latency
Assuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz…
What is single-thread performance?
Effective single-thread issue rate is 260/21 = 12.4 MIPS
How many memory instructions in-flight for 256 processors?
256 mem-ops/cycle * 150 cycles/mem-op = 38K
10/25/2016 CS152,Fall2016
Coarse-GrainMultithreading
Tera MTAdesignedforsupercomputingapplicationswithlargedatasetsandlowlocality– Nodatacache– Manyparallelthreadsneededtohidelargememorylatency– Ultimatelynotverycommerciallysuccessful
Otherapplicationsaremorecachefriendly– Fewpipelinebubblesifcachemostlyhashits– Justaddafewthreadstohideoccasionalcachemisslatencies– Swapthreadsoncachemisses
15
10/25/2016 CS152,Fall2016
MITAlewife(1990)
§ModifiedSPARCchips– registerwindowsholddifferentthreadcontexts
§Uptofourthreadspernode§Threadswitchonlocalcachemiss
16
10/25/2016 CS152,Fall2016
IBMPowerPCRS64-IV(2000)
§ Commercialcoarse-grainmultithreadingCPU§ BasedonPowerPCwithquad-issuein-orderfive-stagepipeline
§ EachphysicalCPUsupportstwovirtualCPUs§ OnL2cachemiss,pipelineisflushedandexecutionswitchestosecondthread– shortpipelineminimizesflushpenalty(4cycles),smallcomparedtomemoryaccesslatency
– flushpipelinetosimplifyexceptionhandling
17
10/25/2016 CS152,Fall2016
Oracle/SunNiagaraprocessors
§ Targetisdatacenters runningwebserversanddatabases,withmanyconcurrentrequests
§ Providemultiplesimplecoreseachwithmultiplehardwarethreads,reducedenergy/operation thoughmuchlowersinglethreadperformance
§ Niagara-1[2004],8cores,4threads/core§ Niagara-2[2007],8cores,8threads/core§ Niagara-3[2009],16cores,8threads/core§ T4[2011],8cores,8threads/core§ T5[2012],16cores,8threads/core
18
10/25/2016 CS152,Fall2016
Oracle/SunNiagara-3,“Rainbow Falls”2009
19
10/25/2016 CS152,Fall2016
SimultaneousMultithreading(SMT)forOoOSuperscalars
§ Techniquespresented sofarhaveallbeen“vertical”multithreadingwhereeachpipelinestageworksononethreadatatime
§ SMTusesfine-graincontrolalreadypresent insideanOoOsuperscalartoallowinstructionsfrommultiplethreadstoenterexecutiononsameclockcycle.Givesbetterutilizationofmachineresources.
20
10/25/2016 CS152,Fall2016
Formostapps,mostexecutionunitslieidleinanOoOsuperscalar
21
From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism”, ISCA 1995.
For an 8-way superscalar.
10/25/2016 CS152,Fall2016
SuperscalarMachineEfficiency
22
Issue width
Time
Completely idle cycle (vertical waste)
Instruction issue
Partially filled cycle, i.e., IPC < 4(horizontal waste)
10/25/2016 CS152,Fall2016
VerticalMultithreading
§ Whatistheeffectofcycle-by-cycleinterleaving?– removesverticalwaste,butleavessomehorizontalwaste
23
Issue width
Time
Second thread interleaved cycle-by-cycle
Instruction issue
Partially filled cycle, i.e., IPC < 4(horizontal waste)
10/25/2016 CS152,Fall2016
ChipMultiprocessing(CMP)
§ Whatistheeffectofsplitting intomultiple processors?– reduceshorizontalwaste,– leavessomeverticalwaste,and– putsupperlimitonpeakthroughputofeachthread.
24
Issue width
Time
10/25/2016 CS152,Fall2016
IdealSuperscalarMultithreading[Tullsen,Eggers,Levy,UW,1995]
§ Interleavemultiplethreadstomultipleissueslotswithnorestrictions
25
Issue width
Time
10/25/2016 CS152,Fall2016
O-o-OSimultaneousMultithreading[Tullsen,Eggers,Emer,Levy,Stamm,Lo,DEC/UW,1996]
§ Addmultiplecontextsandfetchenginesandallowinstructionsfetchedfromdifferentthreadstoissuesimultaneously
§ Utilizewideout-of-ordersuperscalarprocessorissuequeuetofindinstructionstoissuefrommultiplethreads
§ OOOinstructionwindowalreadyhasmostofthecircuitryrequiredtoschedulefrommultiplethreads
§ Anysinglethreadcanutilizewholemachine
26
10/25/2016 CS152,Fall2016
IBMPower4
27
Single-threaded predecessor to Power 5 (2004). 8 execution units in out-of-order engine, each may issue an instruction each cycle.
10/25/2016 CS152,Fall2016 28
Power 4
Power 5
2 fetch (PC),2 initial decodes
2 commits (architected register sets)
10/25/2016 CS152,Fall2016
Power5dataflow...
29
Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck
10/25/2016 CS152,Fall2016
ChangesinPower5tosupportSMT§ Increasedassociativity ofL1instructioncacheandtheinstructionaddresstranslationbuffers
§ Addedper-thread loadandstorequeues§ IncreasedsizeoftheL2(1.92vs.1.44MB)andL3caches§ Addedseparate instructionprefetch andbufferingperthread§ Increasedthenumberofvirtualregistersfrom152to240§ Increasedthesizeofseveralissuequeues§ ThePower5coreisabout24%largerthanthePower4corebecauseoftheadditionofSMTsupport
30
10/25/2016 CS152,Fall2016
Pentium-4Hyperthreading(2002)
§ FirstcommercialSMTdesign(2-waySMT)– Hyperthreading ==SMT
§ Logicalprocessorssharenearlyallresourcesofthephysicalprocessor– Caches,execution units,branchpredictors
§ Dieareaoverhead ofhyperthreading ~5%§ Whenonelogicalprocessorisstalled,theothercanmakeprogress
§ Processorrunningonlyoneactivesoftwarethreadrunsatapproximatelysamespeedwithorwithouthyperthreading
§ Hyperthreading droppedonOoO P6basedfollow-onstoPentium-4(Pentium-M,CoreDuo,Core2Duo),untilrevivedwithNehalemgenerationmachinesin2008.
§ IntelAtom(in-orderx86core)hastwo-wayverticalmultithreading
31
10/25/2016 CS152,Fall2016
PerformanceofSMT(fromH&P)
§ RunningonPentium4SMTeachof26SPECbenchmarkspairedwitheveryother(262 runs)speed-upsfrom0.90to1.58;averagewas1.20
§ Power5,8-processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_rate
§ Power5running2copiesofeachappspeedupbetween0.89and1.41– Mostgainedsome– Fl.Pt.appshadmostcacheconflictsandleastgains
32
10/25/2016 CS152,Fall2016
Summary:MultithreadedCategories
35
Time
(pro
cess
or cy
cle) Superscalar Fine-Grained Coarse-Grained Multiprocessing
SimultaneousMultithreading
Thread 1Thread 2
Thread 3Thread 4
Thread 5Idle slot
10/25/2016 CS152,Fall2016
Acknowledgements
§ Theseslidescontainmaterialdeveloped andcopyrightby:– Arvind(MIT)– KrsteAsanovic(MIT/UCB)– JoelEmer(Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz(UCB)– DavidPatterson(UCB)
§ MITmaterialderivedfromcourse6.823§ UCBmaterialderivedfromcourseCS252
36