CS 152 Computer Architecture and Engineering Lecture 14: Multithreadingcs152/fa16/lectures/L14... · 2016. 10. 25. · Simultaneous Multithreading (SMT) for OoO Superscalars § Techniques

10/25/2016 CS152,Fall2016

CS152ComputerArchitectureandEngineering

Lecture14:Multithreading

JohnWawrzynekElectricalEngineeringandComputerSciences

UniversityofCalifornia,Berkeley

http://www.eecs.berkeley.edu/~johnwhttp://inst.cs.berkeley.edu/~cs152

10/25/2016 CS152,Fall2016

LastTimeLecture13:VLIW

§ InaclassicVLIW,compilerisresponsible foravoidingallhazards->simplehardware,complexcompiler.LaterVLIWs addedmoredynamichardwareinterlocks

§ Useloopunrollingandsoftwarepipeliningforloops,traceschedulingformoreirregularcode

§ Staticschedulingdifficultinpresenceofunpredictablebranchesandvariablelatencymemory

2

10/25/2016 CS152,Fall2016

CS152Administrivia

§ Quiz3results

§ PS4posted– dueNov3rd

§ Lab3dueFriday§ Noofficehourstoday§ FridaydiscussionmovedtoThursday

3

10/25/2016 CS152,Fall2016

Multithreading

§ Difficulttocontinuetoextractinstruction-levelparallelism(ILP)fromasinglesequential threadofcontrol

§ Manyworkloadscanmakeuseofthread-level parallelism(TLP)– TLPfrommultiprogramming(runindependentsequential jobs)

– TLPfrommultithreaded applications(runonejobfasterusingparallelthreads)

§ MultithreadingusesTLPtoimproveutilizationofasingleprocessor

4

10/25/2016 CS152,Fall2016

PipelineHazards

§ Eachinstructionmaydependonthenext

5

LW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5

F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8

F D X M WD D DF D X M WD D DF F F

F DD D DF F F

t9 t10 t11 t12 t13 t14

What is usually done to cope with this?– interlocks (slow)– or bypassing (needs hardware, doesn’t help all

hazards)

10/25/2016 CS152,Fall2016

Multithreading

Howcanweguaranteenodependencies betweeninstructionsinapipeline?

-- Onewayistointerleaveexecutionofinstructionsfromdifferentprogramthreadsonsamepipeline

6

F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8

T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)

t9

F D X M WF D X M W

F D X M WF D X M W

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

Prior instruction in a thread always completes write-back before next instruction in same thread reads register file

10/25/2016 CS152,Fall2016

CDC6600PeripheralProcessors(Cray,1964)

§ Firstmultithreadedhardware§ 10“virtual”I/Oprocessors§ Fixedinterleave onsimplepipeline§ Pipelinehas100nscycletime§ Eachvirtualprocessorexecutesoneinstructionevery1000ns§ Accumulator-based instructionsettoreduceprocessor state

7

10/25/2016 CS152,Fall2016

SimpleMultithreadedPipeline

§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/writtenateachpipestage

§ Appears tosoftware(includingOS)asmultiple,albeitslower,CPUs

8

+1

2 Thread select

PC1PC

1PC1PC

1I$ IR GPR1GPR1GPR1GPR1

X

Y

2

D$

10/25/2016 CS152,Fall2016

MultithreadingCosts

§ Eachthreadrequiresitsownuserstate– PC– GPRs

§ Also,needsitsownsystemstate– Virtual-memorypage-table-base register– Exception-handling registers

§Otheroverheads:– Additionalcache/TLBconflictsfromcompetingthreads– (oraddlargercache/TLBcapacity)– MoreOSoverhead toschedulemorethreads (wheredoallthesethreadscomefrom?)

9

10/25/2016 CS152,Fall2016

ThreadSchedulingPolicies

§ Fixedinterleave (CDC6600PPUs,1964)– EachofNthreadsexecutes oneinstructioneveryNcycles– Ifthreadnotreadytogoinitsslot,insertpipelinebubble

§ Software-controlled interleave (TIASCPPUs,1971)– OSallocatesSpipeline slotsamongstNthreads– Hardwareperformsfixedinterleave overSslots,executingwhicheverthreadisinthatslot

§ Hardware-controlledthreadscheduling(HEP,1982)– Hardwarekeepstrackofwhichthreadsarereadytogo– Picksnextthreadtoexecutebasedonhardwarepriorityscheme

10

10/25/2016 CS152,Fall2016

DenelcorHEP(BurtonSmith,1982)

FirstcommercialmachinetousehardwarethreadinginmainCPU– 120threadsperprocessor– 10MHzclockrate– Upto8processors– precursortoTeraMTA(Multithreaded Architecture)

11

10/25/2016 CS152,Fall2016

Tera MTA(1990-)

§ Upto256processors§ Upto128activethreadsperprocessor§ Processorsandmemorymodulespopulateasparse3Dtorusinterconnection fabric

§ Flat,sharedmainmemory– Nodatacache– Sustainsonemainmemoryaccesspercycleperprocessor

§ GaAs logicinprototype, 1KW/processor@260MHz– SecondversionCMOS,MTA-2,50W/processor– Newerversion,XMT,fitsintoAMDOpteronsocket,runsat500MHz

12

10/25/2016 CS152,Fall2016

MTAArchitecture

14

• Each Processor:• Every cycle, one VLIW instruction from one active thread is launched into pipeline• Instruction pipeline is 21 cycles long• Memory operations incur ~150 cycles of latency

Assuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz…

What is single-thread performance?

Effective single-thread issue rate is 260/21 = 12.4 MIPS

How many memory instructions in-flight for 256 processors?

256 mem-ops/cycle * 150 cycles/mem-op = 38K

10/25/2016 CS152,Fall2016

Coarse-GrainMultithreading

Tera MTAdesignedforsupercomputingapplicationswithlargedatasetsandlowlocality– Nodatacache– Manyparallelthreadsneededtohidelargememorylatency– Ultimatelynotverycommerciallysuccessful

Otherapplicationsaremorecachefriendly– Fewpipelinebubblesifcachemostlyhashits– Justaddafewthreadstohideoccasionalcachemisslatencies– Swapthreadsoncachemisses

15

10/25/2016 CS152,Fall2016

MITAlewife(1990)

§ModifiedSPARCchips– registerwindowsholddifferentthreadcontexts

§Uptofourthreadspernode§Threadswitchonlocalcachemiss

16

10/25/2016 CS152,Fall2016

IBMPowerPCRS64-IV(2000)

§ Commercialcoarse-grainmultithreadingCPU§ BasedonPowerPCwithquad-issuein-orderfive-stagepipeline

§ EachphysicalCPUsupportstwovirtualCPUs§ OnL2cachemiss,pipelineisflushedandexecutionswitchestosecondthread– shortpipelineminimizesflushpenalty(4cycles),smallcomparedtomemoryaccesslatency

– flushpipelinetosimplifyexceptionhandling

17

10/25/2016 CS152,Fall2016

Oracle/SunNiagaraprocessors

§ Targetisdatacenters runningwebserversanddatabases,withmanyconcurrentrequests

§ Providemultiplesimplecoreseachwithmultiplehardwarethreads,reducedenergy/operation thoughmuchlowersinglethreadperformance

§ Niagara-1[2004],8cores,4threads/core§ Niagara-2[2007],8cores,8threads/core§ Niagara-3[2009],16cores,8threads/core§ T4[2011],8cores,8threads/core§ T5[2012],16cores,8threads/core

18

10/25/2016 CS152,Fall2016

Oracle/SunNiagara-3,“Rainbow Falls”2009

19

10/25/2016 CS152,Fall2016

SimultaneousMultithreading(SMT)forOoOSuperscalars

§ Techniquespresented sofarhaveallbeen“vertical”multithreadingwhereeachpipelinestageworksononethreadatatime

§ SMTusesfine-graincontrolalreadypresent insideanOoOsuperscalartoallowinstructionsfrommultiplethreadstoenterexecutiononsameclockcycle.Givesbetterutilizationofmachineresources.

20

10/25/2016 CS152,Fall2016

Formostapps,mostexecutionunitslieidleinanOoOsuperscalar

21

From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism”, ISCA 1995.

For an 8-way superscalar.

10/25/2016 CS152,Fall2016

SuperscalarMachineEfficiency

22

Issue width

Time

Completely idle cycle (vertical waste)

Instruction issue

Partially filled cycle, i.e., IPC < 4(horizontal waste)

10/25/2016 CS152,Fall2016

VerticalMultithreading

§ Whatistheeffectofcycle-by-cycleinterleaving?– removesverticalwaste,butleavessomehorizontalwaste

23

Issue width

Time

Second thread interleaved cycle-by-cycle

Instruction issue

Partially filled cycle, i.e., IPC < 4(horizontal waste)

10/25/2016 CS152,Fall2016

ChipMultiprocessing(CMP)

§ Whatistheeffectofsplitting intomultiple processors?– reduceshorizontalwaste,– leavessomeverticalwaste,and– putsupperlimitonpeakthroughputofeachthread.

24

Issue width

Time

10/25/2016 CS152,Fall2016

IdealSuperscalarMultithreading[Tullsen,Eggers,Levy,UW,1995]

§ Interleavemultiplethreadstomultipleissueslotswithnorestrictions

25

Issue width

Time

10/25/2016 CS152,Fall2016

O-o-OSimultaneousMultithreading[Tullsen,Eggers,Emer,Levy,Stamm,Lo,DEC/UW,1996]

§ Addmultiplecontextsandfetchenginesandallowinstructionsfetchedfromdifferentthreadstoissuesimultaneously

§ Utilizewideout-of-ordersuperscalarprocessorissuequeuetofindinstructionstoissuefrommultiplethreads

§ OOOinstructionwindowalreadyhasmostofthecircuitryrequiredtoschedulefrommultiplethreads

§ Anysinglethreadcanutilizewholemachine

26

10/25/2016 CS152,Fall2016

IBMPower4

27

Single-threaded predecessor to Power 5 (2004). 8 execution units in out-of-order engine, each may issue an instruction each cycle.

10/25/2016 CS152,Fall2016 28

Power 4

Power 5

2 fetch (PC),2 initial decodes

2 commits (architected register sets)

10/25/2016 CS152,Fall2016

Power5dataflow...

29

Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck

10/25/2016 CS152,Fall2016

ChangesinPower5tosupportSMT§ Increasedassociativity ofL1instructioncacheandtheinstructionaddresstranslationbuffers

§ Addedper-thread loadandstorequeues§ IncreasedsizeoftheL2(1.92vs.1.44MB)andL3caches§ Addedseparate instructionprefetch andbufferingperthread§ Increasedthenumberofvirtualregistersfrom152to240§ Increasedthesizeofseveralissuequeues§ ThePower5coreisabout24%largerthanthePower4corebecauseoftheadditionofSMTsupport

30

10/25/2016 CS152,Fall2016

Pentium-4Hyperthreading(2002)

§ FirstcommercialSMTdesign(2-waySMT)– Hyperthreading ==SMT

§ Logicalprocessorssharenearlyallresourcesofthephysicalprocessor– Caches,execution units,branchpredictors

§ Dieareaoverhead ofhyperthreading ~5%§ Whenonelogicalprocessorisstalled,theothercanmakeprogress

§ Processorrunningonlyoneactivesoftwarethreadrunsatapproximatelysamespeedwithorwithouthyperthreading

§ Hyperthreading droppedonOoO P6basedfollow-onstoPentium-4(Pentium-M,CoreDuo,Core2Duo),untilrevivedwithNehalemgenerationmachinesin2008.

§ IntelAtom(in-orderx86core)hastwo-wayverticalmultithreading

31

10/25/2016 CS152,Fall2016

PerformanceofSMT(fromH&P)

§ RunningonPentium4SMTeachof26SPECbenchmarkspairedwitheveryother(262 runs)speed-upsfrom0.90to1.58;averagewas1.20

§ Power5,8-processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_rate

§ Power5running2copiesofeachappspeedupbetween0.89and1.41– Mostgainedsome– Fl.Pt.appshadmostcacheconflictsandleastgains

32

10/25/2016 CS152,Fall2016

Summary:MultithreadedCategories

35

Time

(pro

cess

or cy

cle) Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread 1Thread 2

Thread 3Thread 4

Thread 5Idle slot

10/25/2016 CS152,Fall2016

Acknowledgements

§ Theseslidescontainmaterialdeveloped andcopyrightby:– Arvind(MIT)– KrsteAsanovic(MIT/UCB)– JoelEmer(Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz(UCB)– DavidPatterson(UCB)

§ MITmaterialderivedfromcourse6.823§ UCBmaterialderivedfromcourseCS252

36

Documents

CS 152 Computer Architecture and Engineering Lecture 14: Multithreadingcs152/fa16/lectures/L14... · 2016. 10. 25. · Simultaneous Multithreading (SMT) for OoO Superscalars § Techniques