27
Princeton University and ETH Zürich http://openpiton.org http://pulp-platform.org OpenPiton with RISC-V Cores A Hands-On Tutorial with the Open Source Manycore Processor

OpenPiton with RISC-V Cores A Hands-On Tutorial with the

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

Princeton University and ETH Zürich

http://openpiton.orghttp://pulp-platform.org

OpenPiton with RISC-V CoresA Hands-On Tutorial with the

Open Source Manycore Processor

Page 2: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

Princeton Parallel Research Group• Computer Architecture after Moore’s Law

– @MICRO 2019: ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs (Monday)

• Redesigning the Data Center of the Future– @MICRO 2019: Architectural Implications of Function-as-a-

Service Computing (Wednesday)• Biodegradable Computing (Materials)

• 10 PhD Students• 1 Postdoc• 3 Undergraduates

2Grand Canyon Trip 2019

Page 3: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

3

This work was partially supported by the NSF under Grants No. CNS-1823222,CCF-1823032, CCF-1217553, CCF-1453112, and CCF-1438980, Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreements No. FA8650-18-2-7846, FA8650-18-2-7852, and FA8650-18-2-7862, AFOSR under Grant No. FA9550-14-1-0148, and DARPA under Grants No. N66001-14-1-4040 and HR0011-13-2-0005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA), the NSF, AFOSR, or the U.S. Government.

Support

Page 4: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

4

The world’s first open source, general purpose, multithreaded manycore processor

• Open source manycore• Written in Verilog RTL• Scales to ½ billion cores• Configurable core, uncore• Includes synthesis and back-end flow• Simulate in VCS, ModelSim, NCSim, Verilator, Icarus• ASIC & FPGA verified• ASIC power and energy fully characterized

[HPCA 2018]• Runs full stack multi-user Debian Linux• Used for Architecture, Programming Language,

Compilers, Operating Systems, Security, EDA research

Tile

Chip

chipset

Page 5: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

• Collaboration between Princeton University and PULP team from ETH Zürich

• Goal is to develop a permissively licensed, Linux capable many-core research platform based on RISC-V

• Ariane– RV64GC Core– Linux capable

•– Research manycore system– OpenSPARC T1 based– Coherent NoC, distributed cache

OpenPiton+Ariane

5

Page 6: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

• Project started in 2013 by Luca Benini• A collaboration between University of Bologna and ETH Zürich

– Large team. In total about 60 people, not all are working on PULP

• Key goal is

• We were able to start with a clean slate, no need to remain compatible to legacy systems.

Parallel Ultra Low Power (PULP)

How to get the most BANGfor the ENERGY consumed in a computing system

Page 7: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

ARIANE: Linux capable 64-bit core• Application class processor• Linux Capable

– Tightly integrated D$ and I$– M, S and U privilege modes– TLB, SV39– Hardware PTW

• Optimized for performance– Frequency: 1.5 GHz (22 FDX)– Area: ~ 175 kGE– Critical path: ~ 25 logic levels

• 6-stage pipeline– In-order issue– Out-of-order write-back– In-order commit

• Scoreboarding• Designed for extendibility• Branch-prediction

– Return Address Stack (RAS)– Branch Target Buffer (BTB)– Branch History Table (BHT)

7

7

Page 8: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

ARIANE: Linux capable 64-bit core

8

Page 9: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

OpenPiton System Overview

9

Tile

Page 10: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

OpenPiton System Overview

10

Page 11: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

OpenPiton System Overview

11

Chip

Page 12: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

OpenPiton System Overview

12

P-Mesh Off-Chip Routers (3)

Chip Bridge

P-Mesh Chipset Crossbars (3)

ChipsetChip

Page 13: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

OpenPiton System Overview

13

P-Mesh Off-Chip Routers (3)

Chip Bridge

P-Mesh Chipset Crossbars (3)

DRAM

Chip Chipset

Page 14: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

OpenPiton System Overview

14

P-Mesh Off-Chip Routers (3)

Chip Bridge

P-Mesh Chipset Crossbars (3)

DRAM WishboneSDHC

Chip Chipset

Page 15: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

OpenPiton System Overview

15

P-Mesh Off-Chip Routers (3)

Chip Bridge

P-Mesh Chipset Crossbars (3)

DRAM WishboneSDHC

AXII/O

Chip Chipset

Page 16: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

OpenPiton System Overview

16

P-Mesh Off-Chip Routers (3)

Chip Bridge

P-Mesh Chipset Crossbars (3)

DRAM WishboneSDHC

AXII/O

Chip Chipset

Page 17: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

Tile Overview

17

To Other Tiles

L2 Cache Slice+

Directory Cache

P-MeshRouters

(3)

L1.5 Cache

CCX Arbiter

FPU

Modified OpenSPARC T1

Core

MITTS(Traffic Shaper)

Page 18: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

Silicon Proven Designs: Ariane• Ariane has been taped-out

Globalfoundries 22nm FDXin 2017 and 2018

• The system features 16 kByte ofinstruction and 32 kByte of datacache.

• Poseidon:– Area: 0.23 mm2 – 175 kGE– 0.2 - 1.7 GHz (0.5 V – 1.15 V)

• Kosmodrom:– RV64GCXsmallFloat– Transprecision / Vector FPU– Ariane HP

• 8T library, 0.8V, 1.3 GHz• 55 mW @ 1 GHz

– Ariane LP• 7.5T ULP library, 0.5V, 250 MHz• 5 mW @ 200 MHz 18

Issue

QUENTIN KERBIN

HYPERDRIVE

Poseidon layoutAriane

Kosmodrom layout

Ariane LPAriane HP

L2

NTX

18

Page 19: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

Silicon Proven Designs: Piton Chip• 25-core

– 2 Threads per core– 64-bit Architecture– Modified OpenSPARC T1 Core

• 3 NoCs (P-Mesh)– 64-bit, 2D Mesh– Extend off-chip enabling multichip systems

• Directory-Based Cache System– 64KB L2 Cache per core (Shared)– 8KB L1.5 Data Cache– 8KB L1 Data Cache– 16KB L1 Instruction Cache

• IBM 32nm SOI Process– 6mm x 6mm– 460 Million Transistors

• Target: 1GHz Clock @ 900mV• 208 Pin CQFP Package

19

Tile 0 Tile 1 Tile 2 Tile 3 Tile 4

Tile 20

Tile 21

Tile 22

Tile 23

Tile 24

Tile 5 Tile 6 Tile 7 Tile 8 Tile 9

Tile 10

Tile 11

Tile 12

Tile 13

Tile 14

Tile 15

Tile 16

Tile 17

Tile 18

Tile 19

PLLCB Chip Bridge (CB)

Page 20: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

Piton Test Setup

20

DRAM + I/O

Chipset FPGAKintex 7

Bridge FPGASpartan 6

Piton + Heat Sink

Bulk Decoupling

Power Supply

Misc. Configuration

[McKeown et al, HotChips 2016] [McKeown et al, IEEE MICRO 2017] [McKeown et al, HPCA 2018]

Page 21: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

Putting it all together

21

To Other Tiles

L2 Cache Slice+

Directory Cache

P-MeshRouters

(3)

L1.5 Cache

CCX Arbiter

FPU

Modified OpenSPARC T1

Core

MITTS(Traffic Shaper)

§ Native L1.5 interface is the ideal point to attach a new core

§ Well defined interface similar to CCX from OpenSPARC

§ Write-through cache protocol

§ Coherency mechanism: only need to support invalidation messages

Page 22: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

Putting it all together

22

To Other Tiles

L2 Cache Slice+

Directory Cache

P-MeshRouters

(3)

L1.5 Cache

CCX Arbiter

FPU

Modified OpenSPARC T1

Core

MITTS(Traffic Shaper)

§ Native L1.5 interface is the ideal point to attach a new core

§ Well defined interface similar to CCX from OpenSPARC

§ Write-through cache protocol

§ Coherency mechanism: only need to support invalidation messages

Page 23: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

FPGA Prototyping Platforms

Available:• Digilent Genesys2– $999 ($600 academic)– 1-2 cores at 66MHz• Xilinx VC707– $3500– 1-4 cores at 60MHz• Digilent Nexys Video– $500 ($250 academic)– 1 core at 30MHz

• BittWare XUPP3R– $7000-8000– >100MHz (12 cores)• Amazon AWS F1– Rent by the hour– 12 cores

Page 24: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

OpenPiton Philosophy• Focus/Value is in the Uncore

– Not religious about ISA– Provide whole working system

• We are practical– Use Verilog (Ariane is SV)– Industry standard tools– Use the best tool for job (including commercial CAD tools)

• Primarily for research, but welcome industry also• Licensing

– All our code, Hypervisor, are BSD-like– Linux, T1 core (GPL or LGPL)– Ariane (Solderpad)

• Scalability (Million Core)

24

Page 25: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

OpenPiton Community

• Visit http://openpiton.org• [email protected]

25

• Building a community– Welcome community

contributions– Thousands of Downloads

• Google Group

Page 26: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

Doing Research with OpenPiton + Ariane

• Software– Install on Debian, test scalability

• Operating System– Recompile kernel, rebuild SW, run

• Hardware/Software Co-design– Add new instructions, change compiler/HV/OS/SW

• Architecture– Change parameters, rebuild HW, run

26

HW

ISA

HV/OS

Apps

Compiler/Runtime

Page 27: OpenPiton with RISC-V Cores A Hands-On Tutorial with the

Enabled Research

• Coherence Domain Restriction– Fu et al. MICRO 2015

• Execution Drafting– McKeown et al. MICRO 2014

• Memory Inter-arrival Time Traffic Shaper– Zhou et al. ISCA 2016

• Oblivious RAM– Fletcher et al. ASPLOS 2015

• DVFS modelling• Numerous outside papers• Numerous class research projects

27

Program A Instruction Program B Instruction

Fetch Stage Thread Select Stage

Decode Stage Execute Stage Memory Stage Writeback Stage

Successfully Drafted Instructions Lead Instructions

… … … … … … … ……

…App 1

App 3

App 2

Frequency

RequestInter-arrival time

2t

t 3t

Uniform Traffic

More Bursty Traffic

2tA Distribution of Traffic

Time