Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Very long instruction word architectures
Nicholas FitzRoy-Dale
For cs9244, S38n
Source: http://imagelab.ing.unimo.it/architettura/varie/periferiche/DS80C88.pdf
3-3
Functional Diagram
REGISTER FILE
EXECUTION UNIT
CONTROL AND TIMING
INSTRUCTIONQUEUE
4-BYTE
FLAGS
16-BIT ALU
BUS 8
4
QS0, QS1
S2, S1, S0
2
4
3
GNDVCC
CLK RESET READY
BUS INTERFACE UNIT
RELOCATIONREGISTER FILE
3
A19/S6. . . A16/S3
INTA, RD, WR
DT/R, DEN, ALE, IO/M
SSO/HIGH
2
SEGMENT REGISTERSAND
INSTRUCTION POINTER(5 WORDS)
DATA POINTERAND
INDEX REGS(8 WORDS)
TESTINTRNMI
HLDA
HOLD
RQ/GT0, 1
LOCK
MN/MX
3
ES
CS
SS
DS
IP
AH
BH
CH
DH
AL
BL
CL
DL
SP
BP
SI
DI
ARITHMETIC/LOGIC UNIT
B-BUS
C-BUS
EXECUTIONUNIT
INTERFACEUNIT
BUS
QUEUE
INSTRUCTIONSTREAM BYTE
EXECUTION UNITCONTROL SYSTEM
FLAGS
MEMORY INTERFACE
A-BUS
AD7-AD0
8 A8-A15
INTERFACEUNIT
80C88
Intel Technology Journal Q1, 2001
The Microarchitecture of the Pentium!!!! 4 Processor 4
11 22 33 44 55 66 77 88 99 1010FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd RdyRdy/Sch/Sch DispatchDispatch ExecExec
Basic Pentium III Processor Basic Pentium III Processor MispredictionMisprediction Pipeline Pipeline
Basic Pentium 4 Processor Basic Pentium 4 Processor MispredictionMisprediction Pipeline Pipeline
11 22 33 44 55 66 77 88 99 1010 1111 1212TC TC Nxt Nxt IPIP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414DispDisp DispDisp
1515 1616 1717 1818 1919 2020RFRF ExEx FlgsFlgs Br CkBr Ck Drive DriveRF RF
Figure 3: Misprediction Pipeline
Allocator / Register RenamerAllocator / Register Renamer
Memory Memory uopuop Queue Queue Integer/Floating Point Integer/Floating Point uop uop QueueQueue
FP Register / BypassFP Register / Bypass
FPFPMMXMMXSSESSESSE2SSE2
FPFPMoveMove
Simple FPSimple FP
L1 Data Cache (8Kbyte 4-way)L1 Data Cache (8Kbyte 4-way)
Memory SchedulerMemory Scheduler FastFast Slow/General FP SchedulerSlow/General FP Scheduler
Integer Register File / Bypass NetworkInteger Register File / Bypass Network
ComplexComplexInstr.Instr.
Slow ALUSlow ALU
SimpleSimpleInstr.Instr.
2x ALU2x ALU
SimpleSimpleInstr.Instr.
2x ALU2x ALU
LoadLoadAddressAddress
AGUAGU
StoreStoreAddressAddress
AGUAGU
256 bits256 bits
64-bits wide64-bits wide
QuadQuadPumpedPumped3.2 GB/s3.2 GB/s
BusBusInterfaceInterface
UnitUnit
SystemSystemBusBus
L2 CacheL2 Cache(256K Byte(256K Byte
8-way)8-way)
48GB/s48GB/s
InstructionInstructionTLB/TLB/PrefetcherPrefetcher
Front-End BTBFront-End BTB(4K Entries)(4K Entries)
Instruction DecoderInstruction Decoder
Trace CacheTrace Cache(12K (12K µµopsops))
Trace Cache BTBTrace Cache BTB(512 Entries)(512 Entries)
MicrocodeMicrocodeROMROM
µµopop Queue Queue
Figure 4: Pentium® 4 processor microarchitecture
NETBURST MICROARCHITECTURE Figure 4 shows a more detailed block diagram of the NetBurst microarchitecture of the Pentium 4 processor. The top-left portion of the diagram shows the front end of the machine. The middle of the diagram illustrates the out-of-order buffering logic, and the bottom of the diagram shows the integer and floating-point execution units and the L1 data cache. On the right of the diagram is the memory subsystem.
Front End The front end of the Pentium 4 processor consists of several units as shown in the upper part of Figure 4. It has the Instruction TLB (ITLB), the front-end branch predictor (labeled here Front-End BTB), the IA-32 Instruction Decoder, the Trace Cache, and the Microcode ROM.
Source: http://www.intel.com/technology/itj/q12001/pdf/art_2.pdf
Superscalar vs VLIWInstruction cache
Buffer, decoder,
dispatcher
Execution unit Execution unit Execution unitExecution unit
Register file
Data cache
Reorder buffer
Superscalar
Superscalar vs VLIWInstruction cache
Instruction
register
Execution unit Execution unit Execution unitExecution unit
Register file
Data cache
VLIW
What changes?➔ We lose:
➔ Dynamic scheduling
➔ Branch prediction?
➔ Register renaming?
➔ Interlocks?
➔ We gain:
➔ Explicit parallelism
➔ Predication?
Compiler techniques
➔ Loop parallelism
➔ Global code motion
➔ Predication
➔ Speculation
Loop parallelism
Unrolling Software pipelining
counter = 0loop arr2[counter] += arr1[counter] counter += 1until counter == 8
counter = 0loop arr2[counter] += arr1[counter] arr2[counter+1] += arr1[counter+1] arr2[counter+2] += arr1[counter+2] arr2[counter+3] += arr1[counter+3] counter += 4until counter == 8
counter = 0loop arr2[counter] += arr1[counter] arr3[counter] %= arr2[counter] counter += 1until counter == 8
STARTUPcounter = 1loop arr2[counter] += arr1[counter] arr3[counter-1] %= arr2[counter-1] counter += 1until counter == 8SHUTDOWN
Global code motion
➔ AKA “moving instructions across branches”
➔ Pure-software
➔ Trace scheduling
➔ Superblocks
➔ Hardware-assisted
➔ Predication
➔ Speculation
Trace scheduling
A
B
C D
E
A
C
E
B
FIXUP
D
Superblocks
➔ A simplified form of traces
➔ Traces: multiple entry, multiple exit
➔ Superblocks, single entry, multiple exit
Predication
bz R1, post add R2, R2, R3post: ...
add.pred R1, R2, R2, R3...
if(a == 0) { x(); y();} else { c(); d();}
a==0? x();a==0? y();a!=0? c();a!=0? d();
Speculation
ld R1, 0(R2) bz R1, else ld R3, 8(R2) j outelse: ld r3, 16(R2)out: ...
ld R1, 0(R2) ld.spec R14, 8(R2) bnz R1, out ld r14, 16(R2)out: ...
History
Joseph Fisher
➔ ELI
➔ Proposed by Joseph Fisher
➔ 512-bit words
➔ Trace scheduling
➔ 1979
MultiFlow Trace
➔ Successor to ELI
➔ 1984-1990
➔ 256/512-bit IW
➔ Compiler outlived hardware MultiFlow Trace
CydromeCydra 5
➔ 256-bit instruction word
➔ 6 operations / cycle
➔ or 1 operation / cycle
➔ All predicated
➔ “Directed dataflow” architecture
Cydra 5
Analog Devices SHARC
ADSP-2136x SHARC Processor Programming Reference 1-3
!"#$%&'(#)%"
• External port for interfacing to off-chip SDRAM (ADSP-21367/8/9 processors) and configuring a shared memory system with up to four other ADSP-21368 SHARC processors
• Parallel port for interfacing to off-chip memory and peripherals (ADSP-21362/3/4/5/6 processors)
Figure 1-1 also shows the three on-chip buses of the ADSP-2136x proces-sors: the PM bus, DM bus, and I/O bus. The PM bus provides access to either instructions or data. During a single cycle, these buses let the pro-cessor access two data operands from memory, access an instruction (from the cache), and perform a DMA transfer.
Figure 1-1. ADSP-21362/3/4/5/6 SHARC Processor Block Diagram
ADDR DATA
IOD
ADDR DATA
IOA
ADDR DATA
IOA
SRAM1 MBIT ROM
2 MBIT
SRAM0.5 MBIT
BLOCK 0 BLOCK 1 BLOCK 2 BLOCK 3
ADDR DATA
IOA
IOP REGISTERS(MEMORY MAPPED)
I/O PROCESSORAND PERIPHERALS
6JTAG TEST & EMULATION
32PM ADDRESS BUS
DM ADDRESS BUS 32
PM DATA BUS
DM DATA BUS
64
64
PX REGISTERPROCESSING
ELEMENT(PEY)
PROCESSINGELEMENT
(PEX)
TIMERINSTRUCTION
CACHE32 X 48-BIT
DAG18X4X32
DAG28X4X32
CORE PROCESSOR
PROGRAMSEQUENCER
SRAM1 MBIT ROM
2 MBIT
SIGNALROUTING
UNIT
SRAM0.5 MBIT
4 BLOCKS OF ON-CHIP MEMORY
IOD IOA IOD IOD
SPISPORTS
IDPPCG
TIMERSSRC
SPDIFDTCP
SHARC 21362
SHARC
➔ “Super Harvard” architecture
➔ DAG1, DAG2
➔ PEx, PEy
➔ 16 registers, all universal
SHARC
ADSP-2136x SHARC Processor Programming Reference 8-5
!"#$%&'$()"*+,$
(DM) and R4, S4 (PM) respectively. In the second instruction, values are simultaneously read from data memory to registers R8 and S8 and written to program memory from registers R0 and S0.
R0=DM(I1,M1);
When the processor is in broadcast mode (the BDCST1 bit is set in the MODE1 system register), the R0 (PEx) data register in this example is loaded with the value from data memory utilizing the I1 register from DAG1, and S0 (PEy) is loaded with the same value.
Type 1 Opcode
47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23
001DMD
DMI DMMPMD
DM DREG PMI PMM PM DREG
22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
COMPUTE
Bits Description
DMD, PMD Select the access types (read or write)
DM DREG,PM DREG
Specify register file location.
DMI, PMI Specify I registers for data and program memory
DMM, PMM Specify M registers used to update the I registers
COMPUTE Defines a compute operation to be performed in parallel with the data accesses; if omitted, this is a NOP
SHARC Group 1, Type 1 instruction
Matrix × matrix
mxnxnxp:bit set MODE1 PEYEN | CBUFEN; /* enable SIMD mode, circular buffers */
lcntr=R15, do matrix until lce; lcntr=r7, do column until lce; r5=i0; /*store i0 in r5*/ lcntr=R14, do row until lce; /*N/2 times*/ /*calc mat_A * mat_B,accumulate,read mat_A and mat_B */row: f12=f0*f4, f8=f8+f12, f0=dm(i0,m1), f4=pm(i8,m9); f12=f0*f4, f8=f8+f12; /*multiply, accumulate*/ f8=f8+f12; /*final accumulate*/ r4=r4 xor r4, r9<->s8; /*move Pey to Pex*/ f8=f8+f9, r0=r4; /*add values for result*/ r8=r8 xor r8, dm(i1,m0)=r8; /*clear r8, save values in mat_c */ column: r12=r12 xor r12, i0=r5; /*clear f12, restore i0 with r5*/ r5=r5+r6; /*accumulate r5 by N for next row of matrix B*/matrix: i0=r5; /* loads i0 with a value which points to the next row for matrix B*/ rts(db); bit clr mode1 PEYEN | CBUFEN; /*disable SIMD and circular buffers*/ /*the last value in the buffer is a dummy value and here it is cleared*/ dm(i1,m0)=0;
EPIC
Montecito
Itanium
➔ 128-bit-wide “bundles”
➔ 128 GP registers
➔ 128 FP registers
➔ 64 predicate registers
➔ 8 branch registers
➔ Comparatively short pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
TC IP TC Fetch Drv Alc Rename Q Sch Sch Sch Dsp Dsp RF RF Ex Flg BrCk Drv
Pentium 41 2 3 4 6
IPG ROT
5 7 8
EXPRENREGEXEDET WRB
Itanium 2
1 2 3 4 6
IFU IFU
5 7 8
IFU DEC DEC RAT ROB DIS
109 11
EX RET RET
P6 family
Pipelines and complexity
EPIC’s big ideas
➔ Bundles: avoid hardware dependence
➔ Speculation
➔ Predication
BundlingTmpl Instruction 1 Instruction 2 Instruction 3
5 4141 41
EPIC bundle
Template Slot 0 Slot 1 Slot 2
0 M I I
1 M I I
2 M I I
3 M I I
Four bundle templates
Other EPIC features
➔ Register “renaming”
➔ All instructions predicated
➔ Speculative loads
➔ For control and data
Convergence
➔ Backwards compatibility
➔ Universal predication
➔ Branch prediction hints and speculative loads