Upload
stella-carls
View
234
Download
1
Tags:
Embed Size (px)
Citation preview
VLIW-DLX Simulator
Milos Becvar and Stanislav KahanekFaculty of Electrical Engineering
Czech Technical University in Prague
Presentation Outline
• Undergraduate Comp. Arch. Course
• Experience with WinDLX
• VLIW-DLX Simulation Model
• VLIW-DLX Simulator Features
• Example of Program
• Planned use in comp. arch. Course
• Future work
X36APS Course ContentUndergraduate course intended for CS/CE students, follow-
up to digital design and basic computer organization course. 90 minutes lecture + 90 minutes lab/seminar per week. 200-300 students per semester.
• Introduction, computer performance – 1 lecture• ISA – 2 lectures• Pipelining of RISC – 2 lectures• Memory subsystem – 2 lectures• Intro. to ILP - Superscalar, VLIWs – 2 lectures• Data parallelism – vector computers – 1 lecture• Multiprocessors, coherency on SMP – 2 lectures
X36APS Seminars and Labs
• Goal is to complement lectures with additional experience with presented topics:
1. Using visualization simulators (WinDLX, HDLDLX, SMPCache)
2. Running benchmarks and evaluating various trade-offs (SPEC benchmarks, Dinero)
3. “Table and chalk” seminars about topics where simulators are not available (cache design, vector computers)
Visualization simulators prove to be the most efficient way for student interaction with the topics.
Good Experience with WinDLX
WinDLX in X36APS Course • Used to demonstrate correspondence
between C source code and assembly program in DLX ISA, importance of GCC optimization (1 week in class)
• Used to demonstrate loop unrolling to improve speed of execution on DLX (1 week in class)
• Matrix multiplication program (3 weeks homework)
Matrix Multiplication ProgramWrite a program for the DLX processor that will compute a
product of two square matrices of dimension N. Optimize this program for the given processor parameters so as to achieve as low execution time as possible.
Result Rating (for N=10)Clock Cycles Points
> 10 800 1
9800 ... 10800 2
8800... 9800 3
7800 ... 8800 4
6800 ... 7800 5
<6800 6
Competition to achieve best resultlimits cheating.For achieving full number of points,students have to employ unrolling of the rightloop and schedule instructions to eliminatestalls. Register constrains are necessary to prohibita brutal-force approach to a solution.(e.g. completely eliminating inner loop byunrolling it 10 times)
VLIW-DLX Goals• A tool similar to WinDLX to illustrate basics of VLIWs.• Show relationship between VLIWs and scalar pipelines• Show relationship between software elimination
of hazards by inserting NOPs into code and hardware solution by pipeline interlocks and stalls.
• Show that speedup achievable by extending pipeline width is limited and show sources of these limitations.
• Demonstrate software pipelining algorithm efficiency for VLIW and superscalar processors
Requirements for VLIW-DLX Simulator
• Similar ISA to DLX
• Similar GUI/features to WinDLX GUI
• Visualization of pipeline similar to WinDLX
• Must run in both Win/Linux environment(hence it is in Java)
VLIW-DLX Architecture
InstructionMemory
RegisterFile
64 x 32bGPRs
64 x 32bFPRs
EX
EX
EX(adder)
MEM(2R+1W)
EX1 EX2 EX3 EX4
EX1 EX2 EX3 EX4 EX5
Integer
Integer/Load
Load/Store
Floating Point
Floating Point+ Multiply
VLIW-DLX Features• Currently no forwarding, all data transfers through
a unified register file (Int. and FP registers).• RAW and WAW hazards possible. Multiple
write conflicts possible (later operation wins)• Single branch allowed per VLIW instruction in pipeline slot
1• VLIW instruction following branch in the delay slot is always
executed (branch is executed in the ID stage)• Number and type of pipeline slots can be
easily modified in simulator code. • Operations are all DLX instructions except double
precision FP instructions and division, new operations can be added easily.
VLIW-DLX Instructions bnez r3, loop lf f3,0(r2) sf -16(r2),f10 nop multf f2,f1,f2;; subi r3,r3,4 lf f5,4(r2) sf -12(r2),f11 nop multf f4,f1,f4;;
DLX Instruction = VLIW-DLX Operation
VLIW-DLX Instruction= Group of 5 DLX Instructions
VLIW-DLX Instruction delimiter
Pipeline 1 (Integer, Branch)
Pipeline 2 (Integer, Load)
Pipeline 3 (Load / Store)
Pipeline 4 (Floating Point)
Pipeline 5 (Multiplication)
VLIW-DLX Instructions• Simple HW oriented representation of VLIW instructions• Position in instruction corresponds to pipeline slot, operation type allowed
is checked by compiler• Explicit nops must be included in unused
instruction slots (bundle concept or instruction compression is not used)
• Exchange of values between two registers ispossible in a single VLIW instruction withoutintermediate storage register. r2 r1add r1,r2,r0 add r2,r1,r0 nop nop nop ;;
nop nop nop nop nop;; Full 5-slot NOP
VLIW-DLX Simulator Features
Source code editorRegister view
Memory view
Pipeline view
VLIW-DLX Simulator Features
Same colors as WinDLX Shows in which stage is a given operation
VLIW-DLX Simulator FeaturesHelp shows which operations are now allowed in a given pipeline slot
VLIW-DLX Simulator Features
Simulator shows register values read by a given operationin the ID stage. This helps to track RAW, WAW dependences.
VLIW-DLX Demonstration
Program XPK (X Plus K):
float x[100], k;
for (i=0; i<100; i++)
x[i]+=k;
XPK on Scalar DLX (main loop)
loop:
lf f1, 0(r1)
addf f1, f1, f0
addui r1,r1,4
subui r3,r3,1
sf -4(r1),f1
bnez r3, loop
XPK on WinDLX• Same latency as VLIW-DLX, no forwarding, 2 multicycle
FP adders used to simulate a single pipelined FP adder
Trivial2x
unrolled4x
unrolledSoftwarePipelined
Instruction Count 604 454 379 371
Cycles 1104 654 429 422
RAW stalls 400 150 0 0
Control stalls 99 49 24 22
Structural stalls 1 1 26 29
IPC 0,55 0,69 0,88 0,88
CPI 1,83 1,44 1,13 1,14
Trivial XPK on VLIW-DLX
Pipeline 1 Pipeline 2 Pipeline 3 Pipeline 4
#1 loop: nop lf f1, 0(r1) nop nop
#2 nop nop nop nop
#3 nop nop nop nop
#4 nop nop nop addf f1,f1,f0
#5 subui r3,r3,1 nop nop nop
#6 nop addui r1,r1,4 nop nop
#7 bnez r3, loop nop nop nop
#8 (Delay slot) nop nop sf 0(r1), f1 nop
2xUnrolled XPK on VLIW-DLX
Pipeline 1 Pipeline 2 Pipeline 3 Pipeline 4
#1 loop: nop lf f1, 0(r1) nop nop
#2 nop lf f2, 4(r1) nop nop
#3 nop nop nop nop
#4 nop nop nop addf f1,f1,f0
#5 nop nop nop addf f2,f2,f0
#6 subui r3,r3,2 nop nop nop
#7 nop addui r1,r1,8 nop nop
#8 bnez r3, loop nop sf 0(r1), f1 nop
#9 (Delay slot) nop nop sf 4(r1), f2
Soft. Pipelined XPK on VLIW-DLX
Pipeline 1 Pipeline 2 Pipeline 3 Pipeline 4
#1 loop: nop lf f1,28(r1) sf 0(r1),f1 addf f1,f1,f0
#2 addui r1,r1,16 lf f1, 32(r1) sf 4(r1),f1 addf f1,f1,f0
#3 bnez r3,loop lf f1,36(r1) sf 8(r1), f1 addf f1,f1,f0
#4 (Delay sl.) subui r3,r3, 4 lf f1,24(r1) sf -4(r1),f1 addf f1,f1,f0
+ Prolog and Epilog (not shown)
XPK Loop Performance
Triv ial 2x unrolled
4x unrolled
Sof t. Pipe-lined
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
XPK Loop
WinDLX
VLIW-DLXCycle
s
Pipeline Efficiency of VLIW-DLX
Triv ial 2x unrolled
4x unrolled
Sof t. Pipe-lined
0
100
200
300
400
500
600
700
800
900
IC / Cycles
Int nops
Int/Load nops
Load/Store nops
FP nops
Cycle
s
VLIW-DLX in X36APS Course
• Students will be introduced to VLIW-DLXwithin a single seminar
• They will try to implement a simple loop(similar to XPK) to learn how to use the tool and software pipelining
• They will be assigned a slightly more complex homework (SAXPY loop,matrix mult. kernel)
Summary and Future Work• VLIW-DLX is a simple tool for introduction
of VLIWs to undergraduate students
• It can be easily integrated into course based on DLX and also MIPS processors
• Similar tool is planned to replace aging WinDLX simulator. It will support also vectorinstructions.
• Our goal is to introduce all these conceptsto undergraduate students within a commonISA framework