18
CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 3: CISC versus RISC Instruction Set Architecture (ISA) § The contract between software and hardware § Typically described by giving all the programmer-visible state (registers + memory) plus the semantics of the instructions that operate on that state § IBM 360 was first line of machines to separate ISA from implementation (aka. microarchitecture) § Many implementations possible for a given ISA - E.g., the Soviets build code-compatible clones of the IBM360, as did Amdahl after he left IBM. - E.g.2., today can buy AMD or Intel processors that run x86 ISA. - E.g.3: many cellphones use ARM ISA with implementations from many different companies including Apple, Qualcomm, Samsung, etc. § We use Berkeley RISC-V 2.0 as standard ISA in this course - www.riscv.org 2

CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L03-CISCvsRISC.pdf · CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 3:

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

CSC631:High-PerformanceComputerArchitecture

Spring2017Lecture3:CISCversusRISC

InstructionSetArchitecture(ISA)§ Thecontractbetweensoftwareandhardware§ Typicallydescribedbygivingalltheprogrammer-visiblestate(registers+memory)plusthesemanticsoftheinstructionsthatoperateonthatstate

§ IBM360wasfirstlineofmachinestoseparateISAfromimplementation(aka.microarchitecture)

§ ManyimplementationspossibleforagivenISA- E.g.,theSovietsbuildcode-compatibleclonesoftheIBM360,asdidAmdahlafterheleftIBM.

- E.g.2.,todaycanbuyAMDorIntelprocessorsthatrunx86ISA.- E.g.3:manycellphonesuseARMISAwithimplementationsfrommanydifferentcompaniesincludingApple,Qualcomm,Samsung,etc.

§ WeuseBerkeleyRISC-V2.0asstandardISAinthiscourse- www.riscv.org

2

ControlversusDatapath§ Processordesignscanbesplitbetweendatapath,wherenumbersarestoredandarithmeticoperationscomputed,andcontrol,whichsequencesoperationsondatapath

3

§ Biggestchallengeforearlycomputerdesignerswasgettingcontrolcircuitrycorrect

§ MauriceWilkesinventedtheideaofmicroprogrammingtodesignthecontrolunitofaprocessorforEDSAC-II,1958- ForeshadowedbyBabbage’s“Barrel”andmechanismsinearlierprogrammablecalculators

Condition?

Control

MainMemory

Address Data

ControlLines

Datapath

PC

Inst.Reg.

Registers

ALU

Instruction

Busy?

Microcoded CPU

4

Datapath

MainMemory(holdsuserprogram writteninmacroinstructions,e.g.,x86,RISC-V)

Address Data

Decoder

µPCMicrocodeROM(holdsfixedµcodeinstructions)

NextState

ControlLines

Opcod

e

Cond

ition

Busy?

TechnologyInfluence

§ Whenmicrocodeappearedin50s,differenttechnologiesfor:- Logic:VacuumTubes-MainMemory:Magneticcores- Read-OnlyMemory:Diodematrix,punchedmetalcards,…

§ LogicveryexpensivecomparedtoROMorRAM§ ROMcheaperthanRAM§ ROMmuchfasterthanRAM

5

Microcoded CPU

6

Datapath

MainMemory(holdsuserprogram writteninmacroinstructions,e.g.,x86,RISC-V)

Address Data

Decoder

µPCMicrocodeROM(holdsfixedµcodeinstructions)

NextState

ControlLines

Opcod

e

Cond

ition

Busy?

SingleBusDatapath forMicrocoded RISC-V

Microinstructionswrittenasregistertransfers:§ MA:=PCmeansRegSel=PC;RegW=0;RegEn=1;MALd=1§ B:=Reg[rs2]meansRegSel=rs2;RegW=0;RegEn=1;BLd=1§ Reg[rd]:=A+BmeansALUop=Add;ALUEn=1;RegSel=rd;RegW=1

7

Condition?

MainMemory

PC

Registers

ALU

32

(PC)

rdrs1

rs2

RegisterRAM

Address

InDataOutInstructionRe

g.

Mem

.Add

ressB

AImmed

iate

ImmEn RegEn ALUEn MemEn

ALUOp

Mem

W

ImmSel

RegW

BLdInstLd

MALd

ALd

RegSel

Busy?Opcode

RISC-VInstructionExecutionPhases

§ InstructionFetch§ InstructionDecode§ RegisterFetch§ ALUOperations§ OptionalMemoryOperations§ OptionalRegisterWriteback§ CalculateNextInstructionAddress

8

MicrocodeSketches(1)InstructionFetch: MA,A:=PC

PC:=A+4waitformemoryIR:=Memdispatchonopcode

ALU: A:=Reg[rs1]B:=Reg[rs2]Reg[rd]:=ALUOp(A,B)goto instructionfetch

ALUI: A:=Reg[rs1]B:=ImmI //Sign-extend12bimmediateReg[rd]:=ALUOp(A,B)goto instructionfetch

9

MicrocodeSketches(2)LW: A:=Reg[rs1]

B:=ImmI //Sign-extend12bimmediateMA:=A+BwaitformemoryReg[rd]:=Memgoto instructionfetch

JAL: Reg[rd]:=A//StorereturnaddressA:=A-4//RecoveroriginalPCB:=ImmJ //Jump-styleimmediatePC:=A+Bgoto instructionfetch

Branch: A:=Reg[rs1]B:=Reg[rs2]if(!ALUOp(A,B))goto instructionfetch//NottakenA:=PC//MicrocodefallthroughifbranchtakenA:=A-4B:=ImmB//Branch-styleimmediatePC:=A+Bgoto instructionfetch

10

PureROMImplementation

§ Howmanyaddressbits?|µaddress|=|µPC|+|opcode|+1+1

§ Howmanydatabits?|data|=|µPC|+|controlsignals|=|µPC|+18

§ TotalROMsize=2|µaddress|x|data|

11

µPC

ROMAddress

Data

Opcode Cond? Busy?

NextµPC ControlSignals

PureROMContentsAddress | Data

µPC Opcode Cond?Busy? |ControlLines NextµPCfetch0 X X X |MA,A:=PC fetch1fetch1 X X 1 | fetch1fetch1 X X 0 |IR:=Mem fetch2fetch2 ALU X X |PC:=A+4 ALU0fetch2 ALUI X X |PC:=A+4 ALUI0fetch2 LW X X |PC:=A+4 LW0….

ALU0 X X X |A:=Reg[rs1] ALU1ALU1 X X X |B:=Reg[rs2] ALU2ALU2 X X X |Reg[rd]:=ALUOp(A,B) fetch0

12

Single-BusMicrocodeRISC-VROMSize

§ Instructionfetchsequence3commonsteps§ ~12instructiongroups§ Eachgrouptakes~5steps(1fordispatch)§ Totalsteps3+12*5=63,needs6bitsforµPC

§ Opcode is5bits,~18controlsignals

§ Totalsize=2(6+5+2)x(6+18)=213x24=~25KB!

13

ReducingControlStoreSize

§ ReduceROMheight(#addressbits)-Useexternallogictocombineinputsignals- Reduce#statesbygroupingopcodes

§ ReduceROMwidth(#databits)- RestrictµPCencoding(next,dispatch,wait onmemory,…)- Encodecontrolsignals(verticalµcoding,nanocoding)

14

Single-BusRISC-VMicrocodeEngine

15

µPC

Decode

ROMAddress

Data

Opcode

Cond?Busy?

ControlSignals

+1

fetch0

µPCJumpLogic

µPCjump

µPCjump=next|spin|fetch|dispatch|ftrue |ffalse

µPCJumpTypes

§ next incrementsµPC§ spin waitsformemory§ fetch jumpstostartofinstructionfetch§ dispatch jumpstostartofdecodedopcode group§ fture/ffalse jumpstofetchifCond?true/false

16

EncodedROMContentsAddress | Data

µPC |ControlLines NextµPCfetch0 |MA,A:=PC nextfetch1 |IR:=Mem spinfetch2 |PC:=A+4 dispatch

ALU0 |A:=Reg[rs1] nextALU1 |B:=Reg[rs2] nextALU2 |Reg[rd]:=ALUOp(A,B) fetch

Branch0 |A:=Reg[rs1] nextBranch1 |B:=Reg[rs2] nextBranch2 |A:=PC ffalseBranch3 |A:=A-4 nextBranch4 |B:=ImmB nextBranch5 |PC:=A+B fetch

17

ImplementingComplexInstructions

Memory-memoryadd:M[rd]=M[rs1]+M[rs2]

Address | DataµPC |ControlLines NextµPCMMA0 |MA:=Reg[rs1] nextMMA1 |A:=Mem spinMMA2 |MA:=Reg[rs2] nextMMA3 |B:=Mem spinMMA4 |MA:=Reg[rd] nextMMA5 |Mem:=ALUOp(A,B) spinMMA6 | fetch

Complexinstructionsusuallydonotrequiredatapath modifications,onlyextraspaceforcontrolprogram

Verydifficulttoimplementtheseinstructionsusingahardwiredcontrollerwithoutsubstantialdatapath modifications

18

Horizontalvs VerticalµCode

§ Horizontalµcodehaswiderµinstructions-Multipleparalleloperationsperµinstruction- Fewermicrocodestepspermacroinstruction- SparserencodingÞmorebits

§ Verticalµcodehasnarrowerµinstructions- Typicallyasingledatapath operationperµinstruction

- separateµinstructionforbranches-Moremicrocodestepspermacroinstruction-MorecompactÞ lessbits

§ Nanocoding- Triestocombinebestofhorizontalandverticalµcode

19

#µInstructions

BitsperµInstruction

Nanocoding

20

§ Motorola68000had17-bitµcodecontainingeither10-bitµjumpor9-bitnanoinstruction pointer-Nanoinstructions were68bitswide,decodedtogive196controlsignals

µcodeROM

nanoaddress

µcodenext-state

µaddress

mPC(state)

nanoinstructionROMdata

Exploitsrecurringcontrolsignalpatternsinµcode,e.g.,

ALU0 A¬ Reg[rs1]...ALUI0 A¬ Reg[rs1]...

IBM360:InitialImplementations

21

Model30 ... Model70Storage 8K- 64KB 256K- 512KBDatapath 8-bit 64-bitCircuitDelay 30nsec/level 5nsec/levelLocalStore MainStore TransistorRegistersControlStore Readonly1msec Conventionalcircuits

IBM360instructionsetarchitecture(ISA)completelyhidtheunderlyingtechnologicaldifferencesbetweenvariousmodels.Milestone:ThefirsttrueISAdesignedasportablehardware-softwareinterface!

Withminormodificationsitstillsurvivestoday!

MicroprogramminginIBM360

§ Onlythefastestmodels(75and95)werehardwired

22

M30 M40 M50 M65Datapathwidth(bits) 8 16 32 64

µinst width(bits) 50 52 85 87

µcodesize(Kµinsts) 4 4 2.75 2.75

µstoretechnology CCROS TCROS BCROS BCROS

µstorecycle(ns) 750 625 500 200

memorycycle(ns) 1500 2500 2000 750

Rentalfee($K/month) 4 7 15 35

MicrocodeEmulation

§ IBMinitiallymiscalculatedtheimportanceofsoftwarecompatibilitywithearliermodelswhenintroducingthe360series

§ HoneywellstolesomeIBM1401customersbyofferingtranslationsoftware(“Liberator”)forHoneywellH200seriesmachine

§ IBMretaliatedwithoptionaladditionalmicrocodefor360seriesthatcouldemulateIBM1401ISA,laterextendedforIBM7000series- onepopularprogramon1401wasa650simulator,sosomecustomersranmany650programsonemulated1401s

- (650simulatedon1401emulatedon360)

23

Microprogrammingthrivedin‘60sand‘70s

§ SignificantlyfasterROMsthanDRAMswereavailable§ Forcomplexinstructionsets,datapath andcontrollerwerecheaperandsimpler

§ Newinstructions,e.g.,floatingpoint,couldbesupportedwithoutdatapath modifications

§ Fixingbugsinthecontrollerwaseasier§ ISAcompatibilityacrossvariousmodelscouldbeachievedeasilyandcheaply

24

Exceptforthecheapestandfastestmachines,allcomputersweremicroprogrammed

Microprogramming:earlyEighties§ Evolutionbredmorecomplexmicro-machines

- Complexinstructionsetsledtoneedforsubroutineandcallstacksinµcode

- Needforfixingbugsincontrolprogramswasinconflictwithread-onlynatureofµROM

- èWritableControlStore(WCS)(B1700,QMachine,Inteli432,…)§ WiththeadventofVLSItechnologyassumptionsaboutROM&RAMspeedbecameinvalidàmorecomplexity

§ Bettercompilersmadecomplexinstructionslessimportant.§ Useofnumerousmicro-architecturalinnovations,e.g.,pipelining,cachesandbuffers,mademultiple-cycleexecutionofreg-reg instructionsunattractive

25

WritableControlStore(WCS)§ ImplementcontrolstoreinRAMnotROM

- MOSSRAMmemoriesnowalmostasfastascontrolstore(corememories/DRAMswere2-10xslower)

- Bug-freemicroprograms difficulttowrite

§ User-WCSprovidedasoptiononseveralminicomputers- Alloweduserstochangemicrocodeforeachprocessor

§ User-WCSfailed- Littleornoprogrammingtoolssupport- Difficulttofitsoftwareintosmallspace- MicrocodecontroltailoredtooriginalISA,lessusefulforothers- LargeWCSpartofprocessorstate- expensivecontextswitches- Protectiondifficultifusercanchangemicrocode- Virtualmemoryrequiredrestartable microcode

26

AnalyzingMicrocodedMachines§ JohnCocke andgroupatIBM

- Workingonasimplepipelinedprocessor,801,andadvancedcompilersinsideIBM

- PortedexperimentalPL.8compilertoIBM370,andonlyusedsimpleregister-registerandload/storeinstructionssimilarto801

- Coderanfasterthanotherexistingcompilersthatusedall370instructions!(upto6MIPSwhereas2MIPSconsideredgoodbefore)

§ Emer,Clark,atDEC- MeasuredVAX-11/780usingexternalhardware- Founditwasactuallya0.5MIPSmachine,althoughusuallyassumedtobea1MIPSmachine

- Found20%ofVAXinstructionsresponsiblefor60%ofmicrocode,butonlyaccountfor0.2%ofexecutiontime!

§ VAX8800- ControlStore:16K*147bRAM,UnifiedCache:64K*8bRAM- 4.5xmoremicrostore RAMthancacheRAM!

27

§ Instructionsperprogramdependsonsourcecode,compilertechnology,andISA

§ Cyclesperinstructions(CPI)dependsonISAandµarchitecture

§ Timepercycledependsupontheµarchitectureandbasetechnology

28

Time =Instructions Cycles TimeProgramProgram*Instruction*Cycle

“IronLaw”ofProcessorPerformance

Inst3

CPIforMicrocodedMachine

29

7cycles

Inst1 Inst2

5cycles 10cycles

Totalclockcycles=7+5+10=22Totalinstructions=3CPI=22/3=7.33CPIisalwaysanaverageoveralargenumberofinstructions.

Time

ICTechnologyChangesTradeoffs

§ Logic,RAM,ROMallimplementedusingMOStransistors

§ SemiconductorRAM~samespeedasROM

30

Nanocoding

31

§ MC68000had17-bitµcodecontainingeither10-bitµjumpor9-bitnanoinstruction pointer-Nanoinstructions were68bitswide,decodedtogive196controlsignals

µcodeROM

nanoaddress

µcodenext-state

µaddress

mPC(state)

nanoinstructionROMdata

Exploitsrecurringcontrolsignalpatternsinµcode,e.g.,

ALU0 A← Reg[rs1]...ALUi0 A← Reg[rs1]...

FromCISCtoRISC

§ UsefastRAMtobuildfastinstructioncache ofuser-visibleinstructions,notfixedhardwaremicroroutines- Contentsoffastinstructionmemorychangetofitwhatapplicationneedsrightnow

§ UsesimpleISAtoenablehardwiredpipelinedimplementation-MostcompiledcodeonlyusedafewoftheavailableCISCinstructions

- Simplerencodingallowedpipelinedimplementations§ Furtherbenefitwithintegration- Inearly‘80s,couldfinallyfit32-bitdatapath +smallcachesonasinglechip

-Nochipcrossingsincommoncaseallowsfasteroperation

32

BerkeleyRISCChips

33

RISC-I(1982)Contains44,420transistors,fabbed in5µm NMOS,withadieareaof77mm2,ranat1MHz.ThischipisprobablythefirstVLSIRISC.

RISC-II(1983)contains40,760transistors,wasfabbed in3µmNMOS,ranat3MHz,andthesizeis60mm2.

Stanford built some too…

Microprogrammingisfarfromextinct

§ PlayedacrucialroleinmicrosoftheEighties- DECuVAX,Motorola68Kseries,Intel286/386

§ Playsanassistingroleinmostmodernmicros- e.g.,AMDBulldozer,IntelIvyBridge,IntelAtom,IBMPowerPC,…

- Mostinstructionsexecuteddirectly,i.e.,withhard-wiredcontrol

- Infrequently-usedand/orcomplicatedinstructionsinvokemicrocode

§ Patchablemicrocodecommonforpost-fabricationbugfixes,e.g.Intelprocessorsloadµcodepatchesatbootup

34

Acknowledgements

§ Thesecoursenotesweredevelopedby:- Krste Asanovic (UCB)- Arvind(MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

35