27
WU UCB CS252 SP17 CS252 Spring 2017 Graduate Computer Architecture Lecture 10: VLIW Architectures Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17

CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

WU UCB CS252 SP17

CS252 Spring 2017Graduate Computer Architecture

Lecture 10:VLIW Architectures

Lisa Wu, Krste Asanovichttp://inst.eecs.berkeley.edu/~cs252/sp17

Page 2: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

LastTimeinLecture9

Vectorsupercomputers§ Vectorregisterversusvectormemory§ Scalingperformancewithlanes§ Stripmining§ Chaining§ Masking§ Scatter/Gather

2

Page 3: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

SequentialISABottleneck

3

Checkinstructiondependencies

Superscalarprocessor

a = foo(b);for (i=0, i<

Sequentialsourcecode

Superscalarcompiler

Findindependentoperations

Scheduleoperations

Sequentialmachinecode

Scheduleexecution

Page 4: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

VLIW:VeryLongInstructionWord

§ Multipleoperationspackedintooneinstruction§ Eachoperationslotisforafixedfunction§ Constantoperationlatenciesarespecified§ Architecturerequiresguaranteeof:

- Parallelismwithinaninstruction=>nocross-operationRAWcheck- Nodatausebeforedataready=>nodatainterlocks

4

TwoIntegerUnits,SingleCycleLatency

TwoLoad/StoreUnits,ThreeCycleLatency TwoFloating-PointUnits,

FourCycleLatency

IntOp2 MemOp1 MemOp2 FPOp1 FPOp2Int Op1

Page 5: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

EarlyVLIWMachines

§ FPSAP120B(1976)- scientificattachedarrayprocessor- firstcommercialwideinstructionmachine- hand-codedvectormathlibrariesusingsoftwarepipeliningandloopunrolling

§ MultiflowTrace(1987)- commercializationofideasfromFisher’sYalegroupincluding“tracescheduling”

- availableinconfigurationswith7,14,or28operations/instruction

- 28operationspackedintoa1024-bitinstructionword§ CydromeCydra-5(1987)- 7operationsencodedin256-bitinstructionword- rotatingregisterfile

5

Page 6: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

VLIWCompilerResponsibilities

§ Scheduleoperationstomaximizeparallelexecution

§ Guaranteesintra-instructionparallelism

§ Scheduletoavoiddatahazards(nointerlocks)- TypicallyseparatesoperationswithexplicitNOPs

6

Page 7: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

LoopExecution

7

HowmanyFPops/cycle?

for (i=0; i<N; i++)

B[i] = A[i] + C;Int1 Int 2 M1 M2 FP+ FPx

loop: fldadd x1

fadd

fsdadd x2 bne

1 fadd / 8 cycles = 0.125

loop: fld f1, 0(x1)

add x1, 8

fadd f2, f0, f1

fsd f2, 0(x2)

add x2, 8

bne x1, x3, loop

Compile

Schedule

Page 8: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

LoopUnrolling

8

for (i=0; i<N; i++)

B[i] = A[i] + C;

for (i=0; i<N; i+=4)

{

B[i] = A[i] + C;

B[i+1] = A[i+1] + C;

B[i+2] = A[i+2] + C;

B[i+3] = A[i+3] + C;

}

Unroll inner loop to perform 4 iterations at once

Need to handle values of N that are not multiples of unrolling factor with final cleanup loop

Page 9: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

SchedulingLoopUnrolledCode

9

loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)fsd f8, 24(x2)add x2, 32bne x1, x3, loop

Schedule

Int1 Int 2 M1 M2 FP+ FPx

loop:

Unroll 4 ways

fld f1fld f2fld f3fld f4add x1 fadd f5

fadd f6fadd f7fadd f8

fsd f5fsd f6fsd f7fsd f8add x2 bne

How many FLOPS/cycle?4 fadds / 11 cycles = 0.36

Page 10: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

SoftwarePipelining

10

HowmanyFLOPS/cycle?

loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)add x2, 32fsd f8, -8(x2)bne x1, x3, loop

Int1 Int 2 M1 M2 FP+ FPxUnroll 4 ways firstfld f1fld f2fld f3fld f4

fadd f5fadd f6fadd f7fadd f8

fsd f5fsd f6fsd f7fsd f8

add x1

add x2bne

fld f1fld f2fld f3fld f4

fadd f5fadd f6fadd f7fadd f8

fsd f5fsd f6fsd f7fsd f8

add x1

add x2bne

fld f1fld f2fld f3fld f4

fadd f5fadd f6fadd f7fadd f8

fsd f5

add x1

loop:iterate

prolog

epilog

4 fadds / 4 cycles = 1

Page 11: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

SoftwarePipeliningvs.LoopUnrolling

11

time

performance

time

performance

LoopUnrolled

SoftwarePipelined

Startupoverhead

Wind-downoverhead

LoopIteration

LoopIterationSoftwarepipeliningpaysstartup/wind-downcostsonlyonceperloop,notonceperiteration

Page 12: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

Whatiftherearenoloops?

§ Brancheslimitbasicblocksizeincontrol-flowintensiveirregularcode

§ DifficulttofindILPinindividualbasicblocks

12

Basicblock

Page 13: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

TraceScheduling[Fisher,Ellis]

§ Pickstringofbasicblocks,atrace,thatrepresentsmostfrequentbranchpath

§ Useprofilingfeedbackorcompilerheuristicstofindcommonbranchpaths

§ Schedulewhole“trace”atonce§ Addfixup codetocopewithbranchesjumpingoutoftrace

13

Page 14: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

Problemswith“Classic”VLIW

§ Object-codecompatibility- havetorecompileallcodeforeverymachine,evenfortwomachinesinsamegeneration

§ Objectcodesize- instructionpaddingwastesinstructionmemory/cache- loopunrolling/softwarepipeliningreplicatescode

§ Schedulingvariablelatencymemoryoperations- cachesand/ormemorybankconflictsimposestaticallyunpredictablevariability

§ Knowingbranchprobabilities- Profilingrequiresansignificantextrastepinbuildprocess

§ Schedulingforstaticallyunpredictablebranches- optimalschedulevarieswithbranchpath

14

Page 15: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

VLIWInstructionEncoding

§ Schemestoreduceeffectofunusedfields- Compressedformatinmemory,expandonI-cacherefill- usedinMultiflow Trace- introducesinstructionaddressingchallenge

-Markparallelgroups- usedinTMS320C6xDSPs,IntelIA-64

- Provideasingle-opVLIWinstruction- Cydra-5UniOp instructions

15

Group 1 Group 2 Group 3

Page 16: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

IntelItanium,EPICIA-64

§ EPICisthestyleofarchitecture(cf.CISC,RISC)- ExplicitlyParallelInstructionComputing(reallyjustVLIW)

§ IA-64isIntel’schosenISA(cf.x86,MIPS)- IA-64=IntelArchitecture64-bit- Anobject-code-compatibleVLIW

§ MercedwasfirstItaniumimplementation(cf.8086)- Firstcustomershipmentexpected1997(actually2001)-McKinley,secondimplementationshippedin2002- Recentversion,Poulson,eightcores,32nm,announced2011

16

Page 17: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

EightCoreItanium“Poulson”[Intel2011]

§ 8cores§ 1-cycle16KBL1I&Dcaches§ 9-cycle512KBL2I-cache§ 8-cycle256KBL2D-cache§ 32MBsharedL3cache§ 544mm2in32nmCMOS§ Over3billiontransistors

§ Coresare2-waymultithreaded

§ 6instruction/cyclefetch- Two128-bitbundles

§ Upto12insts/cycleexecute

17

Page 18: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

IA-64InstructionFormat

§ Templatebitsdescribegroupingoftheseinstructionswithothersinadjacentbundles

§ Eachgroupcontainsinstructionsthatcanexecuteinparallel

18

Instruction 2 Instruction 1 Instruction 0 Template

128-bit instruction bundle

group i group i+1 group i+2group i-1

bundle j bundle j+1bundle j+2bundle j-1

Page 19: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

IA-64Registers

§ 128GeneralPurpose64-bitIntegerRegisters§ 128GeneralPurpose64/80-bitFloatingPointRegisters

§ 641-bitPredicateRegisters

§ GPRs“rotate”toreducecodesizeforsoftwarepipelinedloops- Rotationisasimpleformofregisterrenamingallowingoneinstructiontoaddressdifferentphysicalregistersoneachiteration

19

Page 20: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

RotatingRegisterFiles

20

Problems:Scheduledloopsrequirelotsofregisters,Lotsofduplicatedcodeinprolog,epilog

Solution:Allocatenewsetofregistersforeachloopiteration

20

Page 21: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

RotatingRegisterFile

21

P0P1P2P3P4P5P6P7

RRB=3

+R1

RotatingRegisterBase(RRB)registerpointstobaseofcurrentregisterset. Valueaddedontologicalregisterspecifier togivephysicalregisternumber.Usually,splitintorotatingandnon-rotatingregisters.

21

Page 22: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

RotatingRegisterFile(PreviousLoopExample)

22

bloopsd f9, ()fadd f5, f4, ...ld f1, ()

Three cycle load latency encoded as difference of 3

in register specifier number (f4 - f1 = 3)

Four cycle fadd latency encoded as difference of 4

in register specifier number (f9 – f5 = 4)

bloopsd P17, ()fadd P13, P12,ld P9, () RRB=8bloopsd P16, ()fadd P12, P11,ld P8, () RRB=7bloopsd P15, ()fadd P11, P10,ld P7, () RRB=6bloopsd P14, ()fadd P10, P9,ld P6, () RRB=5bloopsd P13, ()fadd P9, P8,ld P5, () RRB=4bloopsd P12, ()fadd P8, P7,ld P4, () RRB=3bloopsd P11, ()fadd P7, P6,ld P3, () RRB=2bloopsd P10, ()fadd P6, P5,ld P2, () RRB=1

22

Page 23: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

IA-64PredicatedExecution

23

Problem:Mispredicted brancheslimitILPSolution:Eliminatehardtopredictbrancheswithpredicatedexecution- AlmostallIA-64instructionscanbeexecutedconditionallyunderpredicate

- InstructionbecomesNOPifpredicateregisterfalse

Inst 1Inst 2br a==b, b2

Inst 3Inst 4br b3

Inst 5Inst 6

Inst 7Inst 8

b0:

b1:

b2:

b3:

if

else

then

Four basic blocks

Inst 1Inst 2p1,p2 <- cmp(a==b)(p1) Inst 3 || (p2) Inst 5(p1) Inst 4 || (p2) Inst 6Inst 7Inst 8

Predication

One basic block

Mahlke et al, ISCA95: On average >50% branches removed

Page 24: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

IA-64SpeculativeExecution

24

Problem:Branchesrestrictcompilercodemotion

Inst 1Inst 2br a==b, b2

Load r1Use r1Inst 3

Can’t move load above branch because might cause spurious

exception

Load.s r1Inst 1Inst 2br a==b, b2

Chk.s r1Use r1Inst 3

Speculative load never causes

exception, but sets “poison” bit on

destination register

Check for exception in original home block

jumps to fixup code if exception detected

Particularly useful for scheduling long latency loads early

Solution: Speculative operations that don’t cause exceptions

Page 25: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

IA-64DataSpeculation

25

Problem:Possiblememoryhazardslimitcodescheduling

Requires associative hardware in address check table

Inst 1Inst 2Store

Load r1Use r1Inst 3

Can’t move load above store because store might be to same

address

Load.a r1Inst 1Inst 2Store

Load.cUse r1Inst 3

Data speculative load adds address to

address check table

Store invalidates any matching loads in

address check table

Check if load invalid (or missing), jump to fixup

code if so

Solution: Hardware to check pointer hazards

Page 26: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

LimitsofStaticScheduling

§ Unpredictablebranches§ Variablememorylatency(unpredictablecachemisses)

§ Codesizeexplosion§ Compilercomplexity§ Despiteseveralattempts,VLIWhasfailedingeneral-purposecomputingarena(sofar).-MorecomplexVLIWarchitecturesclosetoin-ordersuperscalarincomplexity,norealadvantageonlargecomplexapps.

§ SuccessfulinembeddedDSPmarket- SimplerVLIWswithmoreconstrainedenvironment,friendliercode.

26

Page 27: CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

©KrsteAsanovic,2015CS252,Fall2015,Lecture10

Acknowledgements

§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

27