Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
CSC631:High-PerformanceComputerArchitecture
Spring2017Lecture3:CISCversusRISC
InstructionSetArchitecture(ISA)§ Thecontractbetweensoftwareandhardware§ Typicallydescribedbygivingalltheprogrammer-visiblestate(registers+memory)plusthesemanticsoftheinstructionsthatoperateonthatstate
§ IBM360wasfirstlineofmachinestoseparateISAfromimplementation(aka.microarchitecture)
§ ManyimplementationspossibleforagivenISA- E.g.,theSovietsbuildcode-compatibleclonesoftheIBM360,asdidAmdahlafterheleftIBM.
- E.g.2.,todaycanbuyAMDorIntelprocessorsthatrunx86ISA.- E.g.3:manycellphonesuseARMISAwithimplementationsfrommanydifferentcompaniesincludingApple,Qualcomm,Samsung,etc.
§ WeuseBerkeleyRISC-V2.0asstandardISAinthiscourse- www.riscv.org
2
ControlversusDatapath§ Processordesignscanbesplitbetweendatapath,wherenumbersarestoredandarithmeticoperationscomputed,andcontrol,whichsequencesoperationsondatapath
3
§ Biggestchallengeforearlycomputerdesignerswasgettingcontrolcircuitrycorrect
§ MauriceWilkesinventedtheideaofmicroprogrammingtodesignthecontrolunitofaprocessorforEDSAC-II,1958- ForeshadowedbyBabbage’s“Barrel”andmechanismsinearlierprogrammablecalculators
Condition?
Control
MainMemory
Address Data
ControlLines
Datapath
PC
Inst.Reg.
Registers
ALU
Instruction
Busy?
Microcoded CPU
4
Datapath
MainMemory(holdsuserprogram writteninmacroinstructions,e.g.,x86,RISC-V)
Address Data
Decoder
µPCMicrocodeROM(holdsfixedµcodeinstructions)
NextState
ControlLines
Opcod
e
Cond
ition
Busy?
TechnologyInfluence
§ Whenmicrocodeappearedin50s,differenttechnologiesfor:- Logic:VacuumTubes-MainMemory:Magneticcores- Read-OnlyMemory:Diodematrix,punchedmetalcards,…
§ LogicveryexpensivecomparedtoROMorRAM§ ROMcheaperthanRAM§ ROMmuchfasterthanRAM
5
Microcoded CPU
6
Datapath
MainMemory(holdsuserprogram writteninmacroinstructions,e.g.,x86,RISC-V)
Address Data
Decoder
µPCMicrocodeROM(holdsfixedµcodeinstructions)
NextState
ControlLines
Opcod
e
Cond
ition
Busy?
SingleBusDatapath forMicrocoded RISC-V
Microinstructionswrittenasregistertransfers:§ MA:=PCmeansRegSel=PC;RegW=0;RegEn=1;MALd=1§ B:=Reg[rs2]meansRegSel=rs2;RegW=0;RegEn=1;BLd=1§ Reg[rd]:=A+BmeansALUop=Add;ALUEn=1;RegSel=rd;RegW=1
7
Condition?
MainMemory
PC
Registers
ALU
32
(PC)
rdrs1
rs2
RegisterRAM
Address
InDataOutInstructionRe
g.
Mem
.Add
ressB
AImmed
iate
ImmEn RegEn ALUEn MemEn
ALUOp
Mem
W
ImmSel
RegW
BLdInstLd
MALd
ALd
RegSel
Busy?Opcode
RISC-VInstructionExecutionPhases
§ InstructionFetch§ InstructionDecode§ RegisterFetch§ ALUOperations§ OptionalMemoryOperations§ OptionalRegisterWriteback§ CalculateNextInstructionAddress
8
MicrocodeSketches(1)InstructionFetch: MA,A:=PC
PC:=A+4waitformemoryIR:=Memdispatchonopcode
ALU: A:=Reg[rs1]B:=Reg[rs2]Reg[rd]:=ALUOp(A,B)goto instructionfetch
ALUI: A:=Reg[rs1]B:=ImmI //Sign-extend12bimmediateReg[rd]:=ALUOp(A,B)goto instructionfetch
9
MicrocodeSketches(2)LW: A:=Reg[rs1]
B:=ImmI //Sign-extend12bimmediateMA:=A+BwaitformemoryReg[rd]:=Memgoto instructionfetch
JAL: Reg[rd]:=A//StorereturnaddressA:=A-4//RecoveroriginalPCB:=ImmJ //Jump-styleimmediatePC:=A+Bgoto instructionfetch
Branch: A:=Reg[rs1]B:=Reg[rs2]if(!ALUOp(A,B))goto instructionfetch//NottakenA:=PC//MicrocodefallthroughifbranchtakenA:=A-4B:=ImmB//Branch-styleimmediatePC:=A+Bgoto instructionfetch
10
PureROMImplementation
§ Howmanyaddressbits?|µaddress|=|µPC|+|opcode|+1+1
§ Howmanydatabits?|data|=|µPC|+|controlsignals|=|µPC|+18
§ TotalROMsize=2|µaddress|x|data|
11
µPC
ROMAddress
Data
Opcode Cond? Busy?
NextµPC ControlSignals
PureROMContentsAddress | Data
µPC Opcode Cond?Busy? |ControlLines NextµPCfetch0 X X X |MA,A:=PC fetch1fetch1 X X 1 | fetch1fetch1 X X 0 |IR:=Mem fetch2fetch2 ALU X X |PC:=A+4 ALU0fetch2 ALUI X X |PC:=A+4 ALUI0fetch2 LW X X |PC:=A+4 LW0….
ALU0 X X X |A:=Reg[rs1] ALU1ALU1 X X X |B:=Reg[rs2] ALU2ALU2 X X X |Reg[rd]:=ALUOp(A,B) fetch0
12
Single-BusMicrocodeRISC-VROMSize
§ Instructionfetchsequence3commonsteps§ ~12instructiongroups§ Eachgrouptakes~5steps(1fordispatch)§ Totalsteps3+12*5=63,needs6bitsforµPC
§ Opcode is5bits,~18controlsignals
§ Totalsize=2(6+5+2)x(6+18)=213x24=~25KB!
13
ReducingControlStoreSize
§ ReduceROMheight(#addressbits)-Useexternallogictocombineinputsignals- Reduce#statesbygroupingopcodes
§ ReduceROMwidth(#databits)- RestrictµPCencoding(next,dispatch,wait onmemory,…)- Encodecontrolsignals(verticalµcoding,nanocoding)
14
Single-BusRISC-VMicrocodeEngine
15
µPC
Decode
ROMAddress
Data
Opcode
Cond?Busy?
ControlSignals
+1
fetch0
µPCJumpLogic
µPCjump
µPCjump=next|spin|fetch|dispatch|ftrue |ffalse
µPCJumpTypes
§ next incrementsµPC§ spin waitsformemory§ fetch jumpstostartofinstructionfetch§ dispatch jumpstostartofdecodedopcode group§ fture/ffalse jumpstofetchifCond?true/false
16
EncodedROMContentsAddress | Data
µPC |ControlLines NextµPCfetch0 |MA,A:=PC nextfetch1 |IR:=Mem spinfetch2 |PC:=A+4 dispatch
ALU0 |A:=Reg[rs1] nextALU1 |B:=Reg[rs2] nextALU2 |Reg[rd]:=ALUOp(A,B) fetch
Branch0 |A:=Reg[rs1] nextBranch1 |B:=Reg[rs2] nextBranch2 |A:=PC ffalseBranch3 |A:=A-4 nextBranch4 |B:=ImmB nextBranch5 |PC:=A+B fetch
17
ImplementingComplexInstructions
Memory-memoryadd:M[rd]=M[rs1]+M[rs2]
Address | DataµPC |ControlLines NextµPCMMA0 |MA:=Reg[rs1] nextMMA1 |A:=Mem spinMMA2 |MA:=Reg[rs2] nextMMA3 |B:=Mem spinMMA4 |MA:=Reg[rd] nextMMA5 |Mem:=ALUOp(A,B) spinMMA6 | fetch
Complexinstructionsusuallydonotrequiredatapath modifications,onlyextraspaceforcontrolprogram
Verydifficulttoimplementtheseinstructionsusingahardwiredcontrollerwithoutsubstantialdatapath modifications
18
Horizontalvs VerticalµCode
§ Horizontalµcodehaswiderµinstructions-Multipleparalleloperationsperµinstruction- Fewermicrocodestepspermacroinstruction- SparserencodingÞmorebits
§ Verticalµcodehasnarrowerµinstructions- Typicallyasingledatapath operationperµinstruction
- separateµinstructionforbranches-Moremicrocodestepspermacroinstruction-MorecompactÞ lessbits
§ Nanocoding- Triestocombinebestofhorizontalandverticalµcode
19
#µInstructions
BitsperµInstruction
Nanocoding
20
§ Motorola68000had17-bitµcodecontainingeither10-bitµjumpor9-bitnanoinstruction pointer-Nanoinstructions were68bitswide,decodedtogive196controlsignals
µcodeROM
nanoaddress
µcodenext-state
µaddress
mPC(state)
nanoinstructionROMdata
Exploitsrecurringcontrolsignalpatternsinµcode,e.g.,
ALU0 A¬ Reg[rs1]...ALUI0 A¬ Reg[rs1]...
IBM360:InitialImplementations
21
Model30 ... Model70Storage 8K- 64KB 256K- 512KBDatapath 8-bit 64-bitCircuitDelay 30nsec/level 5nsec/levelLocalStore MainStore TransistorRegistersControlStore Readonly1msec Conventionalcircuits
IBM360instructionsetarchitecture(ISA)completelyhidtheunderlyingtechnologicaldifferencesbetweenvariousmodels.Milestone:ThefirsttrueISAdesignedasportablehardware-softwareinterface!
Withminormodificationsitstillsurvivestoday!
MicroprogramminginIBM360
§ Onlythefastestmodels(75and95)werehardwired
22
M30 M40 M50 M65Datapathwidth(bits) 8 16 32 64
µinst width(bits) 50 52 85 87
µcodesize(Kµinsts) 4 4 2.75 2.75
µstoretechnology CCROS TCROS BCROS BCROS
µstorecycle(ns) 750 625 500 200
memorycycle(ns) 1500 2500 2000 750
Rentalfee($K/month) 4 7 15 35
MicrocodeEmulation
§ IBMinitiallymiscalculatedtheimportanceofsoftwarecompatibilitywithearliermodelswhenintroducingthe360series
§ HoneywellstolesomeIBM1401customersbyofferingtranslationsoftware(“Liberator”)forHoneywellH200seriesmachine
§ IBMretaliatedwithoptionaladditionalmicrocodefor360seriesthatcouldemulateIBM1401ISA,laterextendedforIBM7000series- onepopularprogramon1401wasa650simulator,sosomecustomersranmany650programsonemulated1401s
- (650simulatedon1401emulatedon360)
23
Microprogrammingthrivedin‘60sand‘70s
§ SignificantlyfasterROMsthanDRAMswereavailable§ Forcomplexinstructionsets,datapath andcontrollerwerecheaperandsimpler
§ Newinstructions,e.g.,floatingpoint,couldbesupportedwithoutdatapath modifications
§ Fixingbugsinthecontrollerwaseasier§ ISAcompatibilityacrossvariousmodelscouldbeachievedeasilyandcheaply
24
Exceptforthecheapestandfastestmachines,allcomputersweremicroprogrammed
Microprogramming:earlyEighties§ Evolutionbredmorecomplexmicro-machines
- Complexinstructionsetsledtoneedforsubroutineandcallstacksinµcode
- Needforfixingbugsincontrolprogramswasinconflictwithread-onlynatureofµROM
- èWritableControlStore(WCS)(B1700,QMachine,Inteli432,…)§ WiththeadventofVLSItechnologyassumptionsaboutROM&RAMspeedbecameinvalidàmorecomplexity
§ Bettercompilersmadecomplexinstructionslessimportant.§ Useofnumerousmicro-architecturalinnovations,e.g.,pipelining,cachesandbuffers,mademultiple-cycleexecutionofreg-reg instructionsunattractive
25
WritableControlStore(WCS)§ ImplementcontrolstoreinRAMnotROM
- MOSSRAMmemoriesnowalmostasfastascontrolstore(corememories/DRAMswere2-10xslower)
- Bug-freemicroprograms difficulttowrite
§ User-WCSprovidedasoptiononseveralminicomputers- Alloweduserstochangemicrocodeforeachprocessor
§ User-WCSfailed- Littleornoprogrammingtoolssupport- Difficulttofitsoftwareintosmallspace- MicrocodecontroltailoredtooriginalISA,lessusefulforothers- LargeWCSpartofprocessorstate- expensivecontextswitches- Protectiondifficultifusercanchangemicrocode- Virtualmemoryrequiredrestartable microcode
26
AnalyzingMicrocodedMachines§ JohnCocke andgroupatIBM
- Workingonasimplepipelinedprocessor,801,andadvancedcompilersinsideIBM
- PortedexperimentalPL.8compilertoIBM370,andonlyusedsimpleregister-registerandload/storeinstructionssimilarto801
- Coderanfasterthanotherexistingcompilersthatusedall370instructions!(upto6MIPSwhereas2MIPSconsideredgoodbefore)
§ Emer,Clark,atDEC- MeasuredVAX-11/780usingexternalhardware- Founditwasactuallya0.5MIPSmachine,althoughusuallyassumedtobea1MIPSmachine
- Found20%ofVAXinstructionsresponsiblefor60%ofmicrocode,butonlyaccountfor0.2%ofexecutiontime!
§ VAX8800- ControlStore:16K*147bRAM,UnifiedCache:64K*8bRAM- 4.5xmoremicrostore RAMthancacheRAM!
27
§ Instructionsperprogramdependsonsourcecode,compilertechnology,andISA
§ Cyclesperinstructions(CPI)dependsonISAandµarchitecture
§ Timepercycledependsupontheµarchitectureandbasetechnology
28
Time =Instructions Cycles TimeProgramProgram*Instruction*Cycle
“IronLaw”ofProcessorPerformance
Inst3
CPIforMicrocodedMachine
29
7cycles
Inst1 Inst2
5cycles 10cycles
Totalclockcycles=7+5+10=22Totalinstructions=3CPI=22/3=7.33CPIisalwaysanaverageoveralargenumberofinstructions.
Time
ICTechnologyChangesTradeoffs
§ Logic,RAM,ROMallimplementedusingMOStransistors
§ SemiconductorRAM~samespeedasROM
30
Nanocoding
31
§ MC68000had17-bitµcodecontainingeither10-bitµjumpor9-bitnanoinstruction pointer-Nanoinstructions were68bitswide,decodedtogive196controlsignals
µcodeROM
nanoaddress
µcodenext-state
µaddress
mPC(state)
nanoinstructionROMdata
Exploitsrecurringcontrolsignalpatternsinµcode,e.g.,
ALU0 A← Reg[rs1]...ALUi0 A← Reg[rs1]...
FromCISCtoRISC
§ UsefastRAMtobuildfastinstructioncache ofuser-visibleinstructions,notfixedhardwaremicroroutines- Contentsoffastinstructionmemorychangetofitwhatapplicationneedsrightnow
§ UsesimpleISAtoenablehardwiredpipelinedimplementation-MostcompiledcodeonlyusedafewoftheavailableCISCinstructions
- Simplerencodingallowedpipelinedimplementations§ Furtherbenefitwithintegration- Inearly‘80s,couldfinallyfit32-bitdatapath +smallcachesonasinglechip
-Nochipcrossingsincommoncaseallowsfasteroperation
32
BerkeleyRISCChips
33
RISC-I(1982)Contains44,420transistors,fabbed in5µm NMOS,withadieareaof77mm2,ranat1MHz.ThischipisprobablythefirstVLSIRISC.
RISC-II(1983)contains40,760transistors,wasfabbed in3µmNMOS,ranat3MHz,andthesizeis60mm2.
Stanford built some too…
Microprogrammingisfarfromextinct
§ PlayedacrucialroleinmicrosoftheEighties- DECuVAX,Motorola68Kseries,Intel286/386
§ Playsanassistingroleinmostmodernmicros- e.g.,AMDBulldozer,IntelIvyBridge,IntelAtom,IBMPowerPC,…
- Mostinstructionsexecuteddirectly,i.e.,withhard-wiredcontrol
- Infrequently-usedand/orcomplicatedinstructionsinvokemicrocode
§ Patchablemicrocodecommonforpost-fabricationbugfixes,e.g.Intelprocessorsloadµcodepatchesatbootup
34