Out-of-Order OpenRISC 2 semesters project

Out-of-Order OpenRISC2 semesters project

Semester B: OR1200 ISA Extension Final B Presentation

By: Vova Menis-Lurie Sonia GershkovichAdvisor: Mony Orbach

10.3.14

Spring 2013

Content:1. Project Overview

a. Background b. Goals

2. The System: OR12003. Project Flow

a. Simulation Environmentb. Out-of-Order Implementationc. Super Scalar implementationd. ISA Extension

4. Conclusions

Project Overview

Background• OpenRISC 1200 is an open source Verilog implementation of OR1000 ISA

• As a part A, we created basic working environment on XUPV5 board and SoC with OR1200 CPU

Project Overview

Project GoalInitial Goal:

Out-of-Order execution processor implementation based on OR1200 implementation

Changed goal:Super Scalar processor implementation based on OR1200 implementation

Final Goal

ISA Extension Implementation for OR1200

CPU

MMU

CPUQMEM

OR1200 top

IMMU

DMMU

32

32

Cache

ICache

DCache

3232

3232

StoreBuffer

WBI

Instruction

WBIU

DataWBIU

3232

32WB bus

WB bus

1. Cache initialization function in assembly to enable cache.

(WB Interface protocol require 3 cycles for each transaction – not effective for rtl analyze and implementation

improvements )

2. Simulation Environment Creation (Testbench)

3. Out-of-Order implementation – try

4. Super-Scalar implementation – try

5. ISA extension of current implementation

Project Flow

Environment features:• UART interface emulation• Waveform generation • One Makefile to:

• RTL Compilation• Testbench instantiation• C program compilation• Run simulation• Assembly code file creation

• XILINX ram initialization file

Simulation Environment

Environment features:• Advanced monitor:

• Monitoring all data and control transactions of SoC• Monitoring states and SPRS values• Creates log files with desired information:

• States of register file after each command

• Execution time analysis

Simulation Environment

Fundamental statements (based on Tomasulu algorithm):• Execution parallelism should be implemented !!• Non-arch shadow registers implementation.• In order commitment. (SW executes in order)

Out of Order implementation – try

ALU

OR1200 IF

GenPCOR1200

CTRL

Except

Freeze

MAC

LSU

FPU

SPRS CFGROR1200

RF

PCNext PC

Operand MUX

OR1200 top

OR1200 top

OR1200 top

WB MUX

CPU

• For LSU instruction parallelism–multiple ports memory and wider bus-multiple port Cache, QMEM and MMU

• Branch prediction is not necessary – delay slot at compiler level

• Multiple ALU – not effective solutionALU instructions executed in one cycle

Fundamental statements :.• Still in-order commitment. Multiple execution should not affect SW in-order

execution• Non-parallel Fetch and Decode to avoid instructions dependencies.

Super Scalar implementation – try

• Fetch and Decode units should be completely rewritten based on current implementation

• Exception engine should support 2 pipes – requires exception unit complete redesign

• Not all dependencies can be seen at fetch/decode stage LSU results may be required

• Multiple port SPRS should be implemented.

• Parallel LSU instruction execution in 2 pipes requires multiple port memories and wider bus

• gcc OR1000 compiler and assembler support empty slots for custom ISA extension

• 8 non-parameter commands:• l.cust1• l.cust2• l.cust3• l.cust4• l.cust6• l.cust7• l.cust8

• 1 highly parameterized command• l.cust5 Rd , Ra , Rb , L immediate[5:0] , K immediate [4:0]• Allows 2048 !! commands which operates on 3 registers.

• ISA extension will not be used by compiler to generate assembly code from given C code, but gcc allows assembly commands use aside C code.

ISA Extension – final goal

4 Non parameterized commands

• l.cust1• Set flag (unconditioned)

• l.cust2• Unset flag (unconditioned)

• l.cust3• Set carry (unconditioned)

• l.cust4• Unset carry (unconditioned)

l.cust Commands Implementation

l.cust5 parameterized command : K immediate defines command, L immediate defines options

• K=0x1 • Replaces A[L_byte] with B[0_byte] and put result in D

• K=0x2 • SET bit A[L] (Result in D)

• K=0x3 • UNSET bit A[L] (Result in D)



• K=0x4 • Slice A(MSB’s) and B(LSB’s) and put result in D >> D = {A[32-L:L] , B[L-1:0]}

• K=0x5 • Slice B(MSB’s) and A(LSB’s) and put result in D >> D = {B[32-L:L] , A[L-1:0]}

• K=0x6 • Rotate A >> D = A[0:31]



• K=0x7 • Rotate A by bit- Hword-wise >> D = {A[16:31] , A[0:15]}

• K=0x8 • Rotate A by bit- byte-wise >> D = {A[24:31] , A[16:23] , A[8:15] , A[0:7]}

• K=0xa • Check if A is even. If true D=1 and set flag else D=0

• K=0xb • Check if A is odd. If true D=1 and set flag else D=0



• K=0xe • L=2: Rotate A 2bytes MSB’s with 2bytes LSB’s >> D = {A[15:0] , A[31:16]}• L=4: Rotate A byte-wise >> D = {A[7:0] , A[15:8] , A[23:16] , A[31:24]}• L=8: Rotate A Hbyte-wise >> D = {A[3:0] , A[7:4] , A[11:8] , A[15:12] , A[19:16] , A[23:20] , A[27:24] ,A[31:28]};

• K=0xf

• L=0: Mirror LSB’s >> D = {A[0:15] , A[15:0]}• L=1: Mirror MSB’s >> D = {A[31:16] , A[16:31]}


ISA Extension – FPGA provenTest C program

ISA Extension – FPGA provenUART output

FPGA UtilizationOld RTL New RTL

~1% change

• Given implementation is not suitable for any significant u-Arch improvements

• Out-of-Order / Super-Scalar OR1200 implementations are possible but should

be done from scratch.

• Written in assembly software can be easily optimized for specific application

due to l.cust instructions (2048 instructions with 5 operands)

Conclusions

Thank you!

Documents

Out-of-Order OpenRISC 2 semesters project