[IEEE 2009 IEEE International Conference on Microelectronic Systems Education (MSE) - San Francisco, CA, USA (2009.07.25-2009.07.27)] 2009 IEEE International Conference on Microelectronic

1This work was supported in part by the Cyprus Research Promotion Foundation.

FPGA-Based NoC-Driven Sequence of Lab Assignments for Manycore Systems

Christos Ttofis, Christos Kyrkou, Theocharis Theocharides and Maria K. Michael KIOS Research Center for Intelligent Systems and Networks

Department of Electrical and Computer Engineering, University of Cyprus {ee05ct1, ckyrko01, ttheocharides, mmichael}@ucy.ac.cy

Abstract1

Manycore systems are expected to emerge as the dominant trend in next generation computer systems. These parallel systems are expected to be interconnected via packet-based Networks-on-Chip (NoC). As such, they present several educational challenges, as their design is affected by a large number of NoC-based design-space parameters. NoC-based manycore systems demand modifications in the existing computer related teaching curricula. In this work we present a practical FPGA-based teaching framework to emulate NoC-based manycore systems that instructors can utilize to teach students the fundamental design concepts of NoC-based manycore systems and their integration through a series of VHDL laboratory assignments.

1. Introduction

Homogeneous and heterogeneous manycore architectures, interconnected via packet-based Networks-on-Chip (NoC), have emerged as the dominant architectural trend in System-on-Chip (SoC) design [1]. The increasing amount of on-chip cores gives rise to new challenges in next generation SoC design, evidenced by the emergence of NoCs. These micro-networks provide packet-based communication, and have been proposed to replace traditional buses as the on-chip communication infrastructure, offering reusability, scalability and predictability [2]. The era of NoC-based manycore systems presents significant challenges, and demands solid educational background in multiple areas; students need to understand all aspects of system design, ranging from the on-chip interconnection infrastructure, the interaction of the network with the cores, the core architecture and the overall system integration methodology. It is imperative therefore that the right educational tools addressed towards introductory and advanced computer architecture courses, FPGA design and embedded systems courses, feature design space exploration of the emerging design parameters involved in the development and sustainability of next-generation manycore systems. Currently, the main topics in SoC design and NoCs are taught in a theoretical way and students have difficulty understanding them. In particular, we believe that academia lacks significantly in educational tools for exploration of NoC communication

mechanisms. This paper introduces a practical teaching framework

consisting of a sequence of VHDL-based laboratory assignments aiming to help students learn, in practice, the principles of NoC-based manycore design. We utilize FPGA-based emulation as part of our sequence of lab assignments, particularly targeting design parameter exploration associated with the interconnection network supporting the manycore system. FPGAs are capable of speeding up the design cycle, and their low cost allows FPGA-based emulation over software-based simulation since software based simulation would require expensive PCs for it to be fast and effective. In developing this series of labs, we stress the need for teaching NoC design concepts and supplying students with a better understanding of the design parameters affecting the overall operation and performance of NoC-based manycore systems.

The presented framework enables the teaching of manycore architectures and SoC design using NoC interconnects. The sequence of lab assignments consists of a NoC IP development flow, a processor core integration stage and the system integration via the I/O ports and the targeted applications. The hardware modules derived from the lab assignments can easily and quickly be configured and modified in order to provide a variety of multicore implementations for use in other educational labs. Students have the opportunity to explore and evaluate many aspects of manycore architectures. This can be achieved by our evaluation and benchmarking platform which consists of few lines of C++ code for generating the data packets and a VGA controller used for visualizing the system status during execution. Furthermore, we use two popular academic FPGA boards, the XUP Virtex-II Pro and the XUP Virtex-5 LX110T, both part of Xilinx’s University Program (XUP). The Virtex-II Pro allows for up to a 6x6 NoC, whereas the Virtex-5 LX110T allows for up to an 11x11 NoC, allowing the students to be exposed to the programming and scalability challenges involved in the development of future manycore designs.

2. System Architecture

2.1 The interconnection network

The NoC architecture has been designed in a modular, plug and play manner that permits the emulation of a variety of NoC implementations and subsequently

5978-1-4244-4406-9/09/$25.00 ©2009 IEE

manycore architectures. Several goals have been kept in mind during the design process in order to develop a flexible NoC that can be easily modified and configured. Each of these goals is discussed below. • Topology exploration: NoC topologies impact the performance, routing algorithms and associated overheads in a very significant way, so we consider this one of the most important design parameters that students need to be able to explore. This is achieved by creating a set of router classes. Each router can have a different number of ports, depending on its position with regard to the topology of the network. Three different types exist; central, corner and semi-corner. Central routers have five ports; corner routers have three, while semi-corner routers have four ports. This allows networks of various dimensions to be designed. In our labs we focus on 2D-mesh topology; we can, however, easily change the topology through minor modifications. For example, if we change the routing tables per router and use only central routers, then we can develop a 2D-torus topology. • Configuration. This is done using VHDL packages and identifying the targeted configurations. Currently we support configuration in terms of data width, FIFO depth and packet length. For example, we might want to include a CRC code in every header/flit that travels in the network to allow the receiver core to check for data integrity; hence, a varied flit length is needed. A larger depth for FIFO can reduce the blocking activity and consequently the latency and a variable size of packet length can make the NoC flexible for different applications. • Modularity of router design: A router consists of a routing decision unit, one or more FIFO channels per port and arbitration logic for the FIFOs and the output ports. All components have well-defined interfaces and changes in the code of one component do not affect the operation of the remaining components. • Simple and common external interface: Each router into the NoC, depending of its type (central, corner or semi-corner), has a similar interface. This makes the integration of the NoC into manycore designs easy. • Virtual Channels (VCs). Our implementation illustrates to the students the concept of VCs and their associated area/energy overhead costs and performance benefits. Using VCs in the NoC can reduce blocking, increase the number of levels of Quality of Service (QoS) and reduce latency for high priority data. However, VCs consume large area overheads, constraining the NoC size when educational sized FPGAs are used because of the small number of slices that most of them provide. • Switching mechanisms: We route packets through the network using wormhole switching, which reduces the amount of storage buffers within the routers. This is particularly useful since we target educational FPGAs, as the area overhead must be as little as possible. However, we can easily use virtual cut-through switching by a

simple modification of the FIFO and a larger channel depth. The signal that indicates a full FIFO must be asserted when the FIFO gives a guarantee that a packet will be accepted completely. Also, the FIFO depth parameter must be larger than the packet length. • Packet Format: The format was chosen in order to facilitate scalability and flexibility. We defined destination and source numbers of ten bits, thus the system can scale up to 1024 cores. The number of flits per packets can vary between 0 and 7 flits. Also, the reserved field of the packet can be used for application specific requirements such as packet type, timestamp and so forth. Figure 1 shows the packet format.

Figure 1: The format of the header and data fields.

• Routing Algorithm: Our implementation features support for XY and YX (dimension-ordered) routing algorithms. We plan on expanding to adaptive algorithms in the future.

2.2 Processor Cores

For the purposes of this work, we use MIPS and Microblaze processors since the first is very popular in Computer Architecture courses and the second is easily available through Xilinx.

2.2.1 The MIPS32 core

MIPS32 is a 32-bit RISC processor core, widely used in teaching of computer architecture courses [3]. Most students are familiar with its architecture and operation, and will be able to understand it and make modifications where necessary. We feature a basic 5 stage pipelined, 32-bit load/store, MIPS processor, which detects and corrects hazards and reduces stall cycles by data forwarding and cache prefetching. The processor copes with branches by always assuming branches not taken. The processor utilizes a L1 direct-mapped instruction cache and a L1 direct-mapped data cache.

The main goal is not to design the MIPS32 processor but to identify how the overall processor operation and design is affected when the processor is connected to the NoC, and also to describe the design process for the network interface (NI), the component that interfaces the processor with the NoC. Thus, the processor supports only a set of the instructions used in the original MIPS ISA [3]; students can easily add/remove instructions. Supported instructions are listed in Table 1.

We use a handshake protocol based on ready-valid signals to interface the processors to the NI, and the NI to the NoC. The basic interconnection signals along with

6

their function are given in Table 2. The NoC is used to transfer instructions and data from the memory to the processor and data from the processor to the memory. Our NI assembles the packets to be sent through the NoC to the RAM and vice versa. We define three different types of packets. Packets of type I contain only addresses of instructions so they are created each time there is an instruction miss. The NI assembles a packet containing the miss instruction address and the next three consecutive instruction addresses. This aims to prefetch instructions close to the requested one, exploiting spatial locality. Packets of type II contain only addresses of data whereas packets of type III contain both instruction and data addresses. This situation occurs when there is an instruction miss at the Instruction Fetch stage and at the same cycle there is also a data miss at the Memory access stage.

Table 1: Supported ISA. Instruction Description

add Three reg. operands; addition add immediate Add reg. with constant subtract Three reg. operands; subtraction and Three reg. operands; bit-by-bit AND or Three reg. operands; bit-by-bit OR and immediate Bit-by-bit AND reg. with constant or immediate Bit-by-bit OR reg. with constant sll Shift left by constant srl Shift right by constant load word Word from memory to register store word Word from register to memory load upper immediate Loads constant in upper 16 bits beq Equal test and branch bne Not equal test and branch jump Jump to target address slt Set less than sltu Set less than unsigned slti Set less than immediate sltiu Set less than immediate unsigned JR Jump to target register JL Jump and link

2.2.2 The Microblaze Core

The MicroBlaze is another 32-bit RISC processor, used in FPGA designs targeting supported Xilinx Spartan® or Virtex® families of physical FPGA devices, licensed as part of the Xilinx EDK (Embedded Development Kit). The EDK tool provides a bulk of choices when configuring the MicroBlaze architecture. Internal processor memory, multiply/divide unit (MDU), on-chip debug system and mapping of the peripheral devices to memory addresses are some of the choices provided [4]. We implemented a wishbone-compatible NI, which targets potential Microblaze implementation.

2.3 The Monitoring/Debugging I/O System

Students can obtain a better understanding of the NoC operation by experimenting with the FPGA board and by monitoring the network traffic during execution. Students must generate 16 dummy processing elements (PEs) each with its own memory created with Xilinx’s Core Generator. These memories store randomly generated

packets created with C++ code. For simulation purposes, we set the data width of memories to 36 bits and the depth to 256 flits. We used a constant number of flits per packet (7 flits/packets), so each memory contains 64 packets. The experimental network is a 4x4 mesh.

Table 2: MIPS Interface Signals. Name Type Polarity/

Bus size Description

Reset I High External (system) reset Clock I High External (system) clock RAM_In I 128 Instructions received from

RAM RAM_Dat 128 Data received from RAM VALID_IN_I I High Indicates the start of a valid

instructions transfer cycle VALID_IN_D I High Indicates the start of a valid

data transfer cycle ADDR_I O 32 Address of the instruction

that caused the CPU to stall ADDR_D O 32 Address of the data that

caused the CPU to stall DOUT O 32 Data to be written to RAM WE O High Write enable signal. Indicates

a valid data transfer cycle to the RAM

RE_I O High Read enable for instruction RE_D O High Read enable for data

We implemented a FPGA-based VGA controller for displaying and monitoring the system. Students can use the switches on the FPGA board to select the PE that they want to monitor and according to their selection, information about the arriving traffic at the specific PE is printed to a VGA monitor attached to the FPGA board. The display shows the selected PE ID number, the arriving packets, the source PE and the number of packets received so far. Additionally, our I/O system can print to the screen the CPU registers.

We also developed a set of benchmark applications suitable for testing and evaluating the student implementations. The applications consist of four popular MIPS programs written in assembly. The four programs consist of a matrix multiplication algorithm, a producer-consumer problem, addition of operands fetched in sequential and random memory accesses, and summation of 100,000 numbers using parallel processing. The students are given the programs when they complete their manycore system to evaluate their implementation.

3. Laboratory Assignments

We created a series of five VHDL lab assignments targeted for a 14-week semester, each of which aims to teach the students the aforementioned concepts of NoC-based manycore system design. The assignments are addressed to senior or graduate-level students, with a comprehensive background in Digital Systems, Computer Organization and Design, Computer Architecture, Data Structures and Algorithms. The labs can be integrated as part of a new lab-based course or as part of senior undergraduate Computer Architecture or Embedded Systems courses.

7

The labs can use any of the two XUP boards mentioned previously, where during each lab, students are expected to download and evaluate their implementation on the FPGA board. The student’s design selections are also impacted by the targeted FPGA constraints. Next, we give a brief description for each of the five laboratory exercises. • Lab 1 (3 weeks): Design of the NoC router. In the first assignment, students are given the basic design principles of NoC routers, and are asked to design the three router types. They are also asked to do a cost-implementation analysis report which details the reasoning behind their selection of router architecture, FIFO depth, switching, routing algorithm and most importantly, VC cost analysis in terms of the area overhead vs. the performance. • Lab 2 (2 weeks): Design of a 4x4 NoC, using the developed routers. Students are asked to interconnect their routers into a 16-core NoC, where they are expected to address communication and synchronization signals between the routers, as well as the chosen topology. As mentioned earlier, our platform supports both mesh and torus topologies, so the students are expected to experiment with both and explain their approach in their lab report. The features experimented in the first two lab assignments are summarized in Table 3.

Table 3: NoC Parameter Exploration. Feature Description

Topology 2D – Mesh or 2D – Torus

Routing algorithm X-Y and YX or XY only (deterministic)

Switching Wormhole or Virtual cut-though

Flow control Implemented using valid – ready communication protocol and handshaking signals

Buffer allocation Ports and Virtual channels Allocation Independent allocators. Round- robin arbiters. Ordering Packets travel in sender order

• Lab 3 (2 weeks): Network Interface. Students are asked to design a NI which is expected to host a 32-bit RISC processor. Here, the signals and synchronization mechanisms for the communication protocols between the processor cores and the off-chip RAM are implemented. The NI acts as a bridge between the processor and the RAM and is responsible for synchronization and cache coherency protocols. • Lab 4 (3 weeks): Integrating MIPS32 cores to the network. The students are given a basic MIPS32 core as described in section 2.2.1, and are expected to integrate it with the NoC designed in Lab 3. Given the area constraints and their chosen network parameters, the students need to select the number of cores which fit on the FPGA and supplement the remaining available slots with L2 caches. In this way, students are able to see the tradeoffs involved in the NoC vs. the core area overheads and how the NoC design decisions impact the overall system.

• Lab 5 (4 weeks): Testing and Evaluation. Students are given the set of developed benchmarks and are being asked to convert the assembly programs into the MIPS32 machine language, in order to evaluate their platform. They are given also a VGA core which they are expected to integrate with their system, to debug and monitor the system’s operation through the procedure outlined in section 2.3. This lab provides an illustration to the students for the manycore capabilities and operation.

4. Lab Evaluation

The presented laboratory curriculum has been very recently developed, and has only been used for one semester in a course with a small number of students. Hence, no conclusive evaluation results are available at this point. However, early feedback received from students, indicates that they found the sequence challenging yet extremely helpful, as the hands-on approach helped them understand many issues surrounding NoC-based manycore systems. In particular, students mentioned that they can understand the programming challenges involved in utilizing efficiently such a large scale system.

5. Conclusion

We developed a sequence of lab assignments which students can use to experiment with the various features of a NoC-based manycore system. The Labs are structured in a way that facilitates learning of the fundamental concepts of the interconnection network, and how such networks can support and impact the performance of manycore architectures. The assignments include evaluation and monitoring of system operation for better understanding of NoC-based manycore architectures.

6. Acknowledgments

The authors would like to acknowledge Xilinx for their generous donation of the ISE Design Suite and EDK tools.

7. References

[1] S. Borkar, “Thousand Core Chips- A Technology Perspective”, in IEEE/ACM DAC, 2007, pp.746-749.

[2] L. Benini and G. De Micheli. “Networks on chips: a new SoC paradigm”, IEEE Computer, Volume 35, pp. 70--78, January 2002.

[3] David A. Patterson and John L. Hennessy, “Computer Organization and Design – The Hardware/Software Interface”, 3rd ed., Morgan Kaufman.

[4] MicroBlazeTM, http://www.xilinx.com, Feb. 2009.

8

Documents

[IEEE 2009 IEEE International Conference on Microelectronic Systems Education (MSE) - San Francisco, CA, USA (2009.07.25-2009.07.27)] 2009 IEEE International Conference on Microelectronic