10
Spring 2018, EC 513 Computer Architecture Adaptive and Secure Computing Systems (ASCS) Laboratory Department of Electrical and Computer Engineering, Boston University Prof. Michel A. Kinsy http://ascslab.org/courses/ec513/index.html Class Project: BRISC-V © Extensions Assigned March 28 th 2018 Milestones a. Project Proposal b. Mid-project (Phase I) Report c. Project Presentations and Final Project Report Due Dates a. April 5 th , 2018 b. April 19 th , 2018 c. May 1 st , 2018 I. Introduction RISC-V instruction set is recently proposed by a group of researchers at EECS Department of University of California, Berkeley. Main purposes of proposing RISC-V ISA is summarized below: To have a completely free-access ISA for both academic and industrial activities. To support different variations in processor design including 1) processor widths of 32, and 64 bits, 2) single, multi and many core designs, 3) FPGA and ASIC implementations. To have a core set of base integer instructions that can be extended with other categories of instructions, allowing architects to include only the needed features. To support both user and supervisor modes of working for the processor. To support variable-length instructions. To support custom instructions based on specific tasks the processor is intended to run in specific fields. The RISC-V instruction set has been used by researchers to test architectures relating to memory and cache sub-systems, power and performance improvement, among others. There are also several open-source CPU designs, including the 64-bit Berkeley Out of Order Machine (BOOM), 64-bit Rocket, five 32-bit Sodor CPUs from Berkeley, picorv32 by Clifford Wolf, scr1 from Syntacore, and our own open-source RISC-V processor implementation, called BRISC-V © . The parameterized BRISC-V © implementation, developed at the Adaptive and Secure Computing Systems laboratory (ASCS Lab) of Boston University, uses the RV32I version of the ISA. II. Project Overview In lecture, we have covered the key and time-tested concepts in computer architecture: pipelining, complex pipelining (Superscalar, Out-of-Order Execution, VLIW, Vector, Hardware Multi-threading, Branch Prediction, Speculative Execution, Caching, Memory Virtualization,

Spring 2018, EC 513 Computer Architecture Class Project ... · highlight the design choice in your report. Task 2: Multi-level direct-mapped cache design • You can implement two

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spring 2018, EC 513 Computer Architecture Class Project ... · highlight the design choice in your report. Task 2: Multi-level direct-mapped cache design • You can implement two

Spring2018,EC513ComputerArchitectureAdaptiveandSecureComputingSystems(ASCS)Laboratory

DepartmentofElectricalandComputerEngineering,BostonUniversityProf.MichelA.Kinsy

http://ascslab.org/courses/ec513/index.html

ClassProject:BRISC-V©ExtensionsAssignedMarch28th2018

Milestonesa. ProjectProposalb. Mid-project(PhaseI)

Reportc. ProjectPresentations

andFinalProjectReport

DueDatesa. April5th,2018b. April19th,2018

c. May1st,2018

I. Introduction

RISC-V instruction set is recently proposedby a groupof researchers at EECSDepartmentofUniversityofCalifornia,Berkeley.MainpurposesofproposingRISC-VISAissummarizedbelow:

• Tohaveacompletelyfree-accessISAforbothacademicandindustrialactivities.• Tosupportdifferentvariationsinprocessordesignincluding1)processorwidthsof32,

and64bits,2)single,multiandmanycoredesigns,3)FPGAandASICimplementations.• To have a core set of base integer instructions that can be extended with other

categoriesofinstructions,allowingarchitectstoincludeonlytheneededfeatures.• Tosupportbothuserandsupervisormodesofworkingfortheprocessor.• Tosupportvariable-lengthinstructions.• Tosupportcustominstructionsbasedonspecifictaskstheprocessorisintendedtorun

inspecificfields.The RISC-V instruction set has been used by researchers to test architectures relating tomemoryandcachesub-systems,powerandperformance improvement,amongothers.Therearealsoseveralopen-sourceCPUdesigns,includingthe64-bitBerkeleyOutofOrderMachine(BOOM),64-bitRocket, five32-bit SodorCPUs fromBerkeley,picorv32byCliffordWolf, scr1fromSyntacore,andourownopen-sourceRISC-Vprocessor implementation,calledBRISC-V©.The parameterized BRISC-V© implementation, developed at the Adaptive and SecureComputingSystems laboratory(ASCSLab)ofBostonUniversity,usestheRV32IversionoftheISA.

II. ProjectOverviewIn lecture, we have covered the key and time-tested concepts in computer architecture:pipelining, complex pipelining (Superscalar, Out-of-Order Execution, VLIW, Vector, HardwareMulti-threading, Branch Prediction, Speculative Execution, Caching, Memory Virtualization,

Page 2: Spring 2018, EC 513 Computer Architecture Class Project ... · highlight the design choice in your report. Task 2: Multi-level direct-mapped cache design • You can implement two

Multi-core, etc.). All these concepts exploit one or more of these parallelism modalities:InstructionLevelParallelism(ILP),DataLevelParallelism(DLP)andTaskLevelParallelism(TLP)to improve the cycle-per-instruction (CPI), a/the core processor performance measurementmetric.Inthisproject,youwillselectoneoftheseconcepts, implementit,andoptimizeitforhands-onarchitecturedesigntrade-offsexperience.

III. DesignBaseTostartoffyourdesign,weareprovidingyouwiththeBRISC-V©singlecycleprocessorbase.

Figure1:Canonicalviewofasingle-cycle,non-pipelineoftheBRISC-V©CPU.

BRISC-V©ispartoftheHeracles©1designplatform,anopen-source,functional,parameterized,synthesizableresearchandteachingtoolforarchitecturalexplorationandhardware-softwareco-design.TheprovidedBRISC-V©singlecycleprocessorbasecomprisesthesofthardware(HDL)modulesandtheirtestbeds,andapplicationcompiler.Itisdesignedwithahighdegreeofmodularitytosupportfastexplorationofdifferentarchitecturalfeaturesandmemorysystemorganizations.Itis a component-based framework with parameterized interfaces and strong emphasis onmodule reusability. The compiler toolchain is used to map C based applications onto theprocessor.

IV. ProjectOutlinePhase I: Pipelining the base version of the BRISC-V© processor and add amulti-level direct-mappedcachetothedesign.Task1:PipeliningthebaseversionoftheBRISC-V©processor

• Youcanselectthenumberofpipelinestages(ideallybetween2to10).Inyourreport,explainclearlywhyyouselectagivennumberofstages.

1http://ascslab.org/research/heracles/index.html

Page 3: Spring 2018, EC 513 Computer Architecture Class Project ... · highlight the design choice in your report. Task 2: Multi-level direct-mapped cache design • You can implement two

• Weobservedinlecturethatthemaindifficultyinpipeliningispipelinehazards,thatis,datadependenciesbetweeninstructionsinthepipeline.Wediscussedtwopossibilitiesfordealingwithpipelinehazards:stallinganddata forwarding.Stallingmeansthatwestop issuing instructions whenwe detect that the next instruction that wewill issuedepends on the result of in-flight instructions. An alternative to stalling is dataforwarding.Ratherthanwaitingfortheinstructiontocompleteandcommitdatatotheregisterfile,weforwardthedatafromanearlier instruction inthepipelinedirectlytolater instructions. You can elect to implement stall or bypass. Again, make sure tohighlightthedesignchoiceinyourreport.

Task2:Multi-leveldirect-mappedcachedesign

• Youcanimplementtwoorthreelevelsofcaches(L1,L2andMainMemoryorL1,L2,L3andMainMemory). YoushouldhavetwoL1caches(onefor instructionsandonefordata),butsingleinclusiveL2orL3.

• Describeyourcachingpolicies:Write-backorWrite-ThroughandWriteAllocateorNoWriteAllocate.

• Youdonothavetoimplementanycachecoherenceprotocol.• Forthisphaseoftheproject,youarenotrequiredtohaveitworkingwiththeprocessor.

Butyoushouldhaveatestbedimplementedtotestit.• You can also use the Heracles© design base for inspiration on how youmaywant to

implementyourcaches.PhaseII:ImplementonecomplexarchitecturefeaturetothedesigninPhaseI.Task 1: The memory system with the caches should be integrated and tested with theprocessor.Task2:Implementtheadditionalfeature.Task3:Evaluateyourprocessordesignperformance(e.g,CPI,cachemissrates).ProjectTimeline

1. Apr5:ProjectProposalsdue.2. Apr19:Mid-project(PhaseI)reportsdue3. May1:ProjectPresentationsandFinalprojectreportdue.

DeliverablesOnegroupmemberuploadsa.zipfiletitled[groupname]_project.zip(example:brisc_bros_project.zip) to the Blackboard assignment posting. For both the mid-project andfinal submissions,pergroup,youwill submit (1)Verilog sourcecode, (2)Verilog testbeds foreachnewormodifiedmodule,and(3)2-3pagereports.

V. BRISC-V©BaseThe BRISC-V© provided base is split into hardware and software directories. In the hardware folder there are two sub folders named testbeds and src. The testbed directory contains an individual testbed for each module of the BRISC-V©processor. The src sub-folder contains the Verilog files for each module of BRISC-V©.

Page 4: Spring 2018, EC 513 Computer Architecture Class Project ... · highlight the design choice in your report. Task 2: Multi-level direct-mapped cache design • You can implement two

In the software directory there are three folders called applications, compiler-scripts and riscv-compiler. There is also a file named compile.sh which will generate all the binary files in the binaries directory, from the C programs in applications/src/, using the riscv-compiler. Editing compile.sh can allow for not provided program binaries to be generated. In the applications folder there are two sub folders, binaries and src. In the src folder there are 12 sample C programs and in the binaries folder there are sample program’s .asm, .dump, .mem, .s, and .vmh versions. In compiler-scripts there are perl scripts used to arrange the BRISC-V©program kernel. The folder riscv-compiler contains two folders named bin and libexec. Libexec contains a number of sub-folders which are empty while bin contains the RISC-V gcc tools used for generating binary files to compile and generate binaries for the RISC-V rv32 ISA.

Figure2:Illustrativeschematicdescribingtheorganizationsofthehardwareandsoftwarefolders.

ProgramcompilationflowToassistindevelopingsoftwarefortheBRISC-V©processor, it is accompanied withaGCCRISC-Vcross-compiler. The figureabovedepicts the software flow for compilingaCprograminto the compatible BRISC-V© instruction code that can be executed on the processor. Thecompilationprocessconsistsofaseriesofsevensteps.

1. First, the user invokes riscv32-unknown-elf-gcc to translate the C code into assemblylanguage(e.g.,./riscv32-unknown-elf-gcc-Sfibonacci.c).

2. Instep2,theassemblycodeisthenrunthroughthelinkertosetupthestackpointerandreturnvalueregisters(e.g.,./link.plfibonacci.s).Itsoutputisa.asmfile.

3. Instep3,theusercompilestheassemblyfileintoanobjectfileusingthecross-compiler.This is accomplished by executing riscv32-unknown-elf-as on the .asm file (e.g.,./riscv32-unknown-elf-asfibonacci.asm–ofibonacci.o).

4. Inthisstep,allthejumpaddressesareproperlylinkedwith./riscv32-unknown-elf-ld-N-Ttext0x0004--unresolved-symbols=ignore-allfibonacci.o–ofibonacci.

Page 5: Spring 2018, EC 513 Computer Architecture Class Project ... · highlight the design choice in your report. Task 2: Multi-level direct-mapped cache design • You can implement two

5. In step 5, the object file is disassembled using the riscv32-unknown-elf-objdumpcommand(e.g.,./riscv32-unknown-elf-objdumpfibonacci.o).Itsoutputisa.dumpfile.

6. In step 6, the constructor script is called to transform the dump file into a Verilogmemory.vmhfileformat(e.g.,./riscv32-unknown-elf-objcopyfibonacci.dump).

7. Finally, a second constructor script is called to transform the dump file into anotherVerilogmemory.memfileformat(e.g.,./dump2vmhfibonacci.dump).DifferentVerilogsimulations or FPGA synthesis tools use different formats, i.e., .vmh or .mem. Theycontainthesamedata.Programs/Applicationsthathavesomeinitialvalues/datastoredin memory will also have a data file generated for them (e.g.,data_fibonacci.vmh/mem).

Figure3:Applicationcompilationtoolchain

For script-based compilation, if you run ./compile.sh, it will take a set of predefined Capplications/programsintheapplication/srcfolderandcompileallofthem.Ifyouwouldliketocompileyourownapplication(e.g.,albert_s_beautiful_code.c)withyourownstackpointersize(albert_s_stack,adecimalnumber),youcanexecute./compile.sh albert_s_beautiful_code.calbert_s_stack.(e.g.,./compile.shfoo.c128).ProjectIdeasThesearejustsuggestedprojectideas.Pleasefeelfreetobecreative.

1. BranchPredictor

Diassembler[…elf-objdump]

Compiler[…elf-gcc]Source code file

e.g., fibonacci.c

Assembly code file e.g., fibonacci.s

Linker Operation

[link.pl]

Program start & result outputassembly codee.g., fibonacci.asm

Linked object code file e.g., fibonacci

Jump Linking[…elf-ld]

Constructor[dump2vmh]

Dump file e.g., fibonacci.dump

Verilog hex memory file e.g., fibonacci.vmh

Constructor2[dump2mem]

Verilog hex memory file e.g., fibonacci.mem

Object code file e.g., fibonacci.o

Assembler[…elf-as]

ALUSrc

6

ALUresultZero

+Shift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Reg

iste

r File

[32-0]

[30, 14-12]

[11-7]

ImmGen

32 64

ID/EXEX/MEM MEM

/WB

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4+

IF/ID

PC

0

1

mux

0

1

mux

0mux

1

0

mux

Inst.Memory

Address

Instruction

BTB

1

2

pred targetpred dir

PC+4 (Not-taken target)taken target

3

MispredictDetection

Unit

Flush

predicted targetPC+4 (Not-taken target)

predicted direction

−4

address

targetdirection

alloc/updt

Page 6: Spring 2018, EC 513 Computer Architecture Class Project ... · highlight the design choice in your report. Task 2: Multi-level direct-mapped cache design • You can implement two

2. HardwareMulti-threading(Twohardwarethreads)

3. MemoryVirtualization(TLBImplementation)4. Out-of-OrderExecution5. DynamicRegisterRenamingUnit6. VectorExtension7. Superscalar(Addadditionalexecutionunitstotheprocessor,e.g.,amultiplicationunit

withthecorrectlogicmodification)

8. Verylonginstructionword(VLIW)–twoorthreeinstructionsbundles9. Special co-processor (e.g., a Neural Network Accelerator director connected to the

processorGeneralNotes:Althoughyoucandevelopprocessordesignonanymachine,wesuggestyouusethePHO307machinesthatweprovided,andstronglysuggestthatyoutestyourdesignonit.DonotusetheVM,justthenativecomputer.

+1

2

Thread select

PC1PC

1PC1PC

1

I$ IRGPR1GPR1GPR1GPR1

X

Y

2

D$

PCInst. Mem D Decode X1 X2

Data Mem W+GPRs

X2 WFadd X3

X3

FPRs X1

X2 Fmul X3

X2FDiv X3Unpipelined divider

Page 7: Spring 2018, EC 513 Computer Architecture Class Project ... · highlight the design choice in your report. Task 2: Multi-level direct-mapped cache design • You can implement two

Settingup: Toobtainthematerialsfortheproject,downloadthefileat: https://bit.ly/2GhJIww a. CompilationYoucanextracttheprojectbyrunning:

$ tar –xfz ec513-project.tar.gz NowyouhavetheprojectbasecontainingboththeBRISC-V©processorwritteninVerilog(locatedin ec513-project/hardware)andthecross-compiler(locatedin ec513-project/software)thattakesyourCcodeandprovidesyouwithrawmemoryfilestorunonthebareprocessor.Alongwiththecompiler,youwillfindsomesimpleapplications(locatedin ec513-project/software/applications),andassociatedscripts.Onceyouhaveextractedyourproject,youcompiletheapplicationsforBRISC-V©:

$ cd ec513-project/software $ ./compile.sh

Ifeverythingwentcorrectly,youshouldseeamessage:

COMPILATION SUCCESSFUL! Inthesoftware/applications/binaries/ directoryyoushouldbeabletofind*.vmhand*.memthatyouwilluseinyoursimulations.b. SimulationStep1: StartModelSim simulationenvironment: To runModelSim, type the following in theterminal:

$ /ad/eng/opt/mentor/modelsim/modeltech/bin/vsim Anewwindowshouldappear,anditshoulddisplayapopup.Youcansafelyclosethepopup.

Simulating Verilog with ModelSim

Step 1: Start ModelSim simulation environment. To run ModelSim, type the following in the terminal:

$ /ad/eng/opt/mentor/modelsim/modeltech/bin/vsim

A new window should appear, and it should display a popup. You can safely close the popup.

Step 2: Setting up the design library in ModelSim. In the main menu, click on the file menu item and on the change directory sub item. A file explorer willappear and you will want to navigate to your_project_location/ec513-project/simulation/.

Page 8: Spring 2018, EC 513 Computer Architecture Class Project ... · highlight the design choice in your report. Task 2: Multi-level direct-mapped cache design • You can implement two

Step2:SettingupthedesignlibraryinModelSim:In themainmenu, click on the filemenu item and on the change directory sub item. A fileexplorer will appear and you will want to navigate to your_project_location/ec513-project/simulation/.

Step3:CompilingVerilogsources: Inthemainmenu,clickoncompile,andcompile.Apopupshouldopen,andyoushouldnavigatetoyourVerilogsourcesinec513-project/hardware/src/.Selectalloftheverilogsources,andclickcompile.

Youmaythinkthatthecompilebuttondidnotwork,buta“createlibrary”popupshouldhaveappeared,possiblybehindthe“compilesourcefiles”popup.Whenpromptedonwhetheryouwanttocreatealibrary,clickyes.Clickcompile,andthendone.

Page 9: Spring 2018, EC 513 Computer Architecture Class Project ... · highlight the design choice in your report. Task 2: Multi-level direct-mapped cache design • You can implement two

Step4:Loadsimulation:Inthelibrarypane,youshouldseemultiplelibraries.Expandthefirstone – work, and you should see all your Verilog files. Right click on the testbenchtb_RISC_V_Core,andselectthesimulatemenuitem.

Afterclickingonsimulate,youshouldseethiswindow:

Step5: Load thebinaries.Beforewecan runourprocessor,weneed topopulate instructionmemorywithourbinaries,andforsomeapplications,thedatamemoryaswell.Assumingyouhaverunthecompilescript,intheec513-project/software/applications/binariesdirectoryyoushouldfindsome*.memfiles.Inthenewwindow,clickontheViewmenuitem,andselectthe“memorylist(w)”sub-menuitem.

� � � . /

Page 10: Spring 2018, EC 513 Computer Architecture Class Project ... · highlight the design choice in your report. Task 2: Multi-level direct-mapped cache design • You can implement two

Youwill see threedifferentmemories: the fetchstage instructionmemory, thedecodestagereigster file, and the data memory. We want to populate the first one,IF/i_mem_interface/RAM/sram.Ifyourightclickonthefirstitemandselect“viewcontents”,apanewillopenontheright-handside.Noticethatitiscurrentlyundefined–itisfilledwithx-es.Right click on the newpane, and select “ImportData Patterns”. A popupwill appear. In the“LoadType”segmentselect“Fileonly”,andinthe“Fileformat”segmentselect“Veriloghex”.Next, click on browse, and navigate to ec513-project/software/applications/binaries/, andfromthereselecttherightbinary,forexamplegcd.mem.ClickOK,andinthememorypaneyoushouldseethecontentsofyour.memfileintheinstructionmemory.Step 6: Run the simulation. In the Transcript pane, go ahead and type “run 10000”. Thiscommandwillrunthesimulationfor10000ns.

Weseethattheregister9containsthevalue16,which is thegreatestcommondivisorof64and48.EndNote:Makesuretomanageyourtimeproperlyanddistributetheworkloadfairly.

��������� ���� ����� ��