43
Mr Fabio Maschi Electronic, Energy and Systems Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within Reconfigurable Processor Architectures June–September 2017 Academic Supervisor: Mr Cédric Koeniguer E-mail: [email protected] Host Organisation: INESC-ID Address: R. Alves Redol 9, 1000-029 Lisboa - Portugal Internship Supervisor: Mr Nuno Roma E-mail: [email protected] Université Paris-Saclay Maison de l’Ingénieur - bât 620 - Centre Scientifique d’Orsay - 91405 Orsay - France tel: +33 (0)1 69 33 86 00 - fax: +33 (0)1 69 41 99 58 - www.u-psud.fr

Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Mr Fabio Maschi

Electronic, Energy and Systems

Internship Report

4th year of Engineering School

Exploitation of Data-Flow Parallelism and Stream

Processing Paradigms within Reconfigurable

Processor Architectures

June–September 2017

Academic Supervisor: Mr Cédric Koeniguer

E-mail: [email protected]

Host Organisation: INESC-IDAddress: R. Alves Redol 9, 1000-029 Lisboa - PortugalInternship Supervisor: Mr Nuno Roma

E-mail: [email protected]

Université Paris-SaclayMaison de l’Ingénieur - bât 620 - Centre Scientifique d’Orsay - 91405 Orsay - France

tel: +33 (0)1 69 33 86 00 - fax: +33 (0)1 69 41 99 58 - www.u-psud.fr

Page 2: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Acknowledgements

First of all, I would like to address a grateful thank to Nuno Roma, who trusted and offered methis internship within his research group. The routine of working closely to such interesting

and smart people was the key factor to my immense satisfaction throughout the whole period.Nuno also revised this report, providing excellent corrections and suggestions.

I am very fortunate to have received mentoring from Nuno Neves. The several occasionshe generously took his time to explain me concepts and technologies had a huge influence inthe overall knowledge level of this report. Having the full big picture of the context which myinternship was placed on was not necessary, but he successfully accomplished the objective ofbeing a mentor, and not a merely working colleague.

Once again, Cédric Koeniguer figures on my acknowledgements list. He recommended me tothis internship position and, without him, I would not have be contacted by Nuno Roma.

Thank to the sisters Liliana and Elsa Sagres who ensured all the computational resources inorder to accomplish the assigned tasks.

I would like to take this opportunity to likewise thank the people who contribute with methroughout this year: Martin Rünz and Lourdes Agapito, from the Computer Science departmentat University College London. Martin gave me one of the best feedbacks I ever had, with scientificand constructive critics. The insight helped me to enhance my report level and self-critics interms of scientific approach. Lourdes Agapito, who gave me the taste for and a new point ofview of academic research in computing.

I hope to meet such great people next year.

“If I were stupid enough to think that the difficulty of understanding the author was hisproblem, I would be happy. And if I were a genius, I would produce as good as him. I am in the

middle, which I call a cruel intelligence degree: the one who has the angst of contemplatingbrillance, but who is not in the tranquility of the swamp of stupidity.”

—Leandro Karnal

Page 3: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Contents

Introduction 1

1 Presentation of INESC-ID 21.1 Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Embedded Electronic Systems action line . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 SiPS Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Working organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Design of a risc-v Interpretable Processor Based on mb-lite 52.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Microarchitecture and All That Jazz . . . . . . . . . . . . . . . . . . . . . 62.2.2 Introduction to the classic RISC pipeline . . . . . . . . . . . . . . . . . . . 72.2.3 mb-lite Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.4 risc-v Instruction Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1 Adaptation of the Instruction Fetch Stage . . . . . . . . . . . . . . . . . . 112.3.2 Modification of the Instruction Decode Stage . . . . . . . . . . . . . . . . 112.3.3 Adaptation of the Execute Stage . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.1 Critics to risc-v ISA encoding . . . . . . . . . . . . . . . . . . . . . . . . 132.4.2 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 The Big Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Stream-Based Loop Execution Acceleration 163.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.1 Count-Controlled Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.2 Architecture’s Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.1 Extension from 5-stage to 6-stage Pipeline . . . . . . . . . . . . . . . . . . 193.3.2 Instruction Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.3 Dedicated Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.4 Loop-Accelerator Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 The Big Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Page 4: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Contents

4 Data-Flow Engine for Automatic Loop Parallelisation 234.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.1 Pipeline Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.2 Instruction-to-Lane Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.3 Data-Forwarding Management . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 The Big Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Analysis and Comparisons 275.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Final considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Conclusion 29

Bibliography 31

A Binary Value Comparisons 32

B risc-v Implemented Instructions 34

C Detailed Microarchitecture Designs 36

D Processors Performance Summary 38

E Acronyms 39

Page 5: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Introduction

In the processor architecture domain, the efforts to increase the performance have been mostlyfocused on increasing the operating frequency of such devices. Meanwhile, due to physical

limitations, the most recent technological improvements have been diverted to new processingparadigms. On the other hand, the demand for computational resources has found on reconfigur-able architectures the opportunity of improving the performance of processors without penalisingtheir energy-efficiency [13].

Several research projects have been conducted at INESC-ID following the current trends inmorphable processing architectures for energy-efficient computing, which has led to a wide rangeof reconfigurable system implementations [12]. As a result, it has been recognised that despitetheir straightforward implementations, conventional processor architectures are not particularlywell suited to be deployed in morphable systems. This is mainly due to the fact that thereconfiguration of a given core (or a group of cores) in the system requires complex program anddata management and can incur in synchronisation issues.

As such, new and highly efficient processing architectures must be developed in order toovercome the above mentioned issues and comply with the current processing demands.

The herein described research project centred on the investigation of a new processing archi-tecture, to serve as base for a future implementation in a morphable accelerator, currently underinvestigation in the INESC-ID SiPS group. This architecture introduces two different paradigmsto a RISC1 single-core processor: streamed processing and data-flow parallelism.

The internship was divided into three main parts, each one resulting in a functional processorarchitecture (and formalised by a dedicated chapter in this report), denoted as atlas, andes

and alpes, respectively.The resulted architectures presented a real gain of performance in terms of execution time

and energy-consumption, indicating the promising perspectives of such application. As expected,the introduced paradigms increased the complexity of data-coherence management, signalisingsome possible improvements of the technology.

From the 12th June to the 1st September 2017, I integrated the INESC-ID research groupwithin the framework of my specialisation in Electronics, Energy and Systems. Following oneacademic year abroad at University College London, this internship concludes the fourth year ofmy engineering degree at Polytech Paris-Sud.

Note I took the liberty of writing a slightly longer report to explain some important concepts. Ifthe reader has a solid knowledge of electronics and computer architecture, he/she may skip someparts of background sections.

1Reduced Instruction Set Computer

1

Page 6: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 1

Presentation of INESC-ID

The Institute of Systems and Computer Engineering, Research and Development — INESC-ID — is located in the Portuguese capital, Lisbon. The institute, created in 2000, is a non-profit organisation, privately owned by Instituto Superior Técnico, from University of Lisbon,and INESC, an association dedicated to education, science incubation, research activity andtechnological consulting.

INESC-ID is a research institute focused on advanced research and development in the areas ofElectronics, Communications, Information Technologies, and Energy. It is officially recognisedof public interest and since 2004 has the status of Associated Laboratory, recognised by thePortuguese Ministry of Science.

The main objectives of INESC-ID are: to integrate competences from researchers in electricalengineering and computer science to advance the state-of-the-art in computers, telecommunica-tions, and information systems; to support the first stages of the value generation chain: basicresearch, applied research, and advanced education; in cooperation with other institutions, toperform technology transfer, to support the creation of technology based start-ups, and to providetechnical support [8].

1.1 Organisation

There are currently more than one hundred ten PhDs and two hundred post-graduate researchersdivided into nineteen groups, organised in five action lines:

• Computing System and Communication Networks

• Embedded Electronic Systems

• Information and Decision Support Systems

• Interactive Intelligent Systems

• Energy Systems

2

Page 7: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 1 - Presentation of INESC-ID

Since INESC-ID focuses its activity on the rapid growth areas of information technology,communications and electronics, an increase is to be expected in the number of researchers withhigher degrees within the next few years. Many researchers are carrying out their post-graduatework at INESC-ID.

INESC-ID has participated in more than fifty research projects funded by the EuropeanUnion and more than one hundred ninety funded by national entities. Until now, the researchershave published more than seven hundred papers in international journal papers, more than threethousand papers in international conferences, and have registered fifteen patents and/or brands.

1.2 Embedded Electronic Systems action line

During this internship, I was hosted in the Embedded Electronics Systems action line. This oneis divided into seven groups of researchers:

• Algorithms for Optimisation and Simulation

• Analogue and Mixed-Signal Circuits

• Electronic System Design and Automation

• Quality, Test and Co-design of Hardware and Software Systems

• Signal Processing Systems

• Software Algorithms and Tools for Constraint Solving

Embedded systems are crucial in the development of new devices and products, either foremergent applications, or for consumer electronics, IT, communications and media, energy, en-vironment, transports, biomedicine and life science. The main objective is to research newalgorithms, architectures, methodologies, tools and circuits for designing efficient embeddedelectronic systems. With the purpose of designing, verifying and testing embedded systems,the EES line performs research on electronic circuits design for high scale integration, algorithmsand tools for implementing programmable and reconfigurable embedded systems, and control andsignal processing algorithms suitable for implementations in real-time, energy-efficient embeddedsystems. Starting with basic research for enabling different and general applications, these re-searchers target advanced applied investigation and innovation using emergent technologies andapplications, in partnership with worldwide institutions (academia, R&D, and industry).

1.2.1 SiPS Group

The Signal Processing Systems (SiPS) group is devoted to research, development and educationon algorithms and architectures for signal processing systems. The research is oriented to severalapplication areas, including telecommunications, video and image processing systems.

The long term objectives are divided in two main areas. The image and video processing,where the main objective is to develop new architectures and parallel implementations of videoencoding algorithms. This research is complemented with the development of algorithm im-plementations in real time. The signal processing for communications, where it is intended toperform research work in digital synchronisers, spread spectrum systems and adaptive systems.

1.3 Working organisation

During my internship at INESC-ID, I integrated the SiPS group where my internship supervisor,Mr Nuno Roma, manages a team of PhD and MSc students and mentors their thesis. He isa Senior Researcher at INESC-ID and teaches Computer Architecture and High-PerformanceComputing at University of Lisbon.

3

Page 8: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 1 - Presentation of INESC-ID

The task which I was assigned to was one branch of a research mentored by Mr Roma. ThePhD student carrying out it was Mr Nuno Neves, who I closely collaborated with since the firstworking day. In fact, I was stablished in the same room of the PhD students from SiPS, whichcontributed to a very well communication between Mr Neves and me.

At the frequency of once per week, we three met up with Mr Pedro Tomás, the SiPS GroupManager, in order to state a follow-up of the project, presenting the results from the previousweek, the ongoing tasks and reanalysing future research fronts.

1.3.1 Hierarchy

As illustrated in Figure 1.1, each research group is managed by two entities: the Group Managerand the Group Scientific Coordinator. Both positions are yearly assigned to two of the SeniorResearches of the group. The first one is responsible for the budget of the research projects,while the second one is responsible for the strategic planning, the organisation of the researchunits, and evaluates the research projects, plans and reports. The Group Scientific Coordinatormanages Senior Researchers and each Senior Researcher manages Post. Doc, PhD and MScstudents.

Figure 1.1: The hierarchical organisation of SiPS group.

4

Page 9: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 2

Design of a risc-v Interpretable

Processor Based on mb-lite

The envisaged study for this internship requires the conception of a General-Purpose Processor(GPP) to be the departing point. Two technologies were combined to maximise the design

quality and performance. A robust processor architecture, herein denoted by atlas, able toperform forty-seven different instructions compatible with many existing applications is the resultof the first task reported in this chapter. The designed processor was then used as the basearchitecture for the subsequent tasks, described in Chapters 3 and 4.

2.1 Motivation

In order to study and perform the desired new features of a processor, it was first needed toconceive a General-Purpose Processor (GPP) which is performant, robust, portable and possiblyacceptable by the industry. To this extent, it was needed to define one hardware descriptionarchitecture and one Instruction Set Architecture (ISA)1.

With respect to the architecture, mb-lite appears bringing a robust and performant config-urable processor suitable for FPGA [10]. Moreover, its description in VHDL2 follow many goodpractices guidelines, which makes the implementation very easy to manipulate and carry outresearch. mb-lite is a re-implementation of the well-known commercial MicroBlaze ISA fromXilinks and, contrary to the proprietary ISA, the light-weight implementation is open source.However, using a mixed proprietary and open source solution might raise some inconveniences,so one open source ISA is required in order to dispose of a full open source solution.

There are many ISA available in the industry and many reasons to not create a new one: thenew ISA should present great advantages compared to the existing ones in order to be adoptedby the industry; such mission would be certainly timing and effort consuming and this is nota priority. Furthermore, picking up the right ISA may allow us to take advantage of the largeand widely supported software ecosystem it disposes, including development tools and portedapplications, compilers and simulation tools.

In this context, risc-v [15] is nowadays doing great within the industry. Moreover, theadoption of this modern ISA is increasing in both academic and commercial communities. Itssupport to the LLVM compiler allows an easy manipulation of the ISA for further research aimedin INESC-ID. Therefore, the first task of this project is to replace MicroBlaze proprietary ISAby risc-v open source within mb-lite.

1As explained with more details through Section 2.2, an ISA is basically the protocol of the instructions to beinterpreted and performed by a processor.

2Hardware description language.

5

Page 10: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 2 - Design of a risc-v Interpretable Processor Based on mb-lite

2.2 Background

Before starting this task, it is convenient to state some definitions and discuss about someparadigms in computer architecture. The following concepts involve the bases of any digitaldesign project and therefore were required to be mastered in order to achieve the goals of theinternship. If the reader of this document has a solid knowledge of electronics and digital systems,he/she may skip to Section 2.2.3.

2.2.1 Microarchitecture and All That Jazz

The goal of this section is not to lecture a 4th year module of Computer Engineering, but simplysummarise some key points which may be useful for a fully understanding of the report.

A processor is an electronic component responsible for executing simple but numerous in-structions per second. These instructions are defined by the software and are stored in thememory. For instance, one sequence of instructions could be the equivalent to Read the positionX of the memory, then Add 40 to this value and finally Store the result in the position Y of thememory. The processor hence fetches one instruction, performs it, save the result, fetches thenext instruction, performs it and so on. A series of simple instructions may perform complexresults; this is the key point to remember.

A computer provides many levels of memories, each one presenting different capacity andaccess time. From the processor’s point of view, it may communicate basically through threestorage facilities: i) the very limited but fast local registers, where temporary data may be storedbetween instructions; ii) the constant instruction channel, through which the processors sendsthe address of the next instruction and receives the desired command; and iii) the data memory,where all the dynamic data is stored. While in Harvard architecture instructions and dynamicmemories are independent, in Von Neumann architecture they are the same physical component,thus the distinction is made by internal regions within the memory.

Registers CPU

ISA

InstructionMemory

DataMemory

instr.

addr.addr.

data

Figure 2.1: Abstract organisation of a microprocessor. The ISA is not a physical component, butjust the protocol of communication. While Data Memory may be read and written (illustratedby the bidirectional data bus), Instruction Memory is only read by the processor.

All these circuits are driven by the clock, a squared periodic signal which synchronises andcommands the time response of the system. Every signal must settle to its final value beforethe raising edge of the clock signal. Flip-flops are components which store the value of a signalthrough a clock cycle. Its output during a clock cycle corresponds to its input during the formerclock cycle.

However, each electronic component has its own time delay, tdelay, to settle the output toits final value from the moment the input settles. Moreover, flip-flops have one clock-to-outputpropagation delay, tc2q, and one set-up time, tsetup. Figure 2.2 represents the timing diagram ofa combinational logic circuit (CL) placed between two flip-flops (F1, F2). The signal sent byF1 will traverse CL and be stored in F2, all commanded by the clock period Tc. Signal X is theoutput of F1 and signal Y is the result of CL, as well as the input of F2 [7].

6

Page 11: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 2 - Design of a risc-v Interpretable Processor Based on mb-lite

F1 CL F2X Y

CLK CLK

(a) Circuit illustration.

Clock

X

Y

Tc

tc2q tdelay tsetup

(b) Timing diagram.

Figure 2.2: Delay analysis for set-up time constraint. The gray area represents the transitorystage where the signal started to change, but has not yet settled to its final value.

To satisfy the set-up time of flip-flop F2, signal Y must settle no later than the set-up timebefore the next clock cycle. If the propagation delay tdelay is too great or the clock period is notlong enough, Y may not have settled to its final value by the time F2 saves it. In this scenario,F2 may sample an incorrect or invalid value.

The clock period must then be equal or greater to the time necessary for a signal go throughthe critical path — path whose delay is the biggest — between two flip-flops. Sequential logiccomponents — where one input depends on the output of other(s) component(s) — increase thecritical path and hence the minimum clock period. The more complex the combinational circuit,the larger the minimal time required to stabilise the final output. Hence, the clock period wouldneed to be longer and the clock frequency shorter — f = 1

T .

Performance Analysis A relevant metric to compare the performance of different architec-tures is how long the execution of a program takes. The execution time of a program is givenby:

Te = (#instructions)⇥✓clock cycles

instruction

◆⇥

✓ secondsclock cycle

◆(2.1)

The number of Cycles Per Instruction (CPI), or latency, is the number of clock cycles requiredto execute an average instruction. It is the reciprocal of the throughput (instruction per cycle,or IPC). The number of seconds per cycle is the clock period, Tc [7].

2.2.2 Introduction to the classic RISC pipeline

Pipelining is a technique to improve both the processor’s performance and its hardware utilisa-tion. It draws on two facts: i) reducing the complexity of the components between two registersreduces the critical path and hence permits a higher clock frequency (Time Constraint); and ii)the execution of an instruction may be divided in small tasks (Processing Stages).

Time Constraint This is the main limiting factor to the operating frequency of the clock.As discussed above, the bottleneck is determined by the minimum period which will allow thepropagation of any signal from one flip-flop to the next one, the called critical path.

One may tackle the critical path by redesigning the circuit to have a shorter propagationdelay and placing flip-flops between smaller blocks of combinational logic. The consequence ofthis, in a first moment, is that it will be required one additional clock cycle in order to settlethe signal at the output of the circuit. In a second moment, the critical path will be shorter,allowing a smaller clock period. The final consequence will be a circuit which performs moreclock cycles, but running at a higher frequency.

7

Page 12: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 2 - Design of a risc-v Interpretable Processor Based on mb-lite

CL1

CL2

CL3 CL4

tdelay = 2.4 ns

tdelay = 3.0 ns

tdelay = 2.0 ns tdelay = 4.0 ns

Tc = 9.5 ns

CLK CLK

Figure 2.3: Illustration of some combinational logic circuits (CLs) placed between two flip-flops.The clock period (Tc) must respect the minimal time necessary to signals traverse the entirecircuit. Let tc2q and tsetup equal to 0.25 ns each, Tc is equal to 0.25+max(2.4, 3.0)+ 2.0+ 4.0+0.25 = 9.5 ns.

CL1

CL2

CL3 CL4

tdelay = 2.4 ns

tdelay = 3.0 ns

tdelay = 2.0 ns tdelay = 4.0 ns

Stage1 = 5.5 ns Stage2 = 4.5 ns

CLK CLK CLK

Figure 2.4: From previous illustration, one adds a flip-flop between CL3 and CL4. The res-ulting minimal time necessary to a signal traverse the circuit has decreased to max([0.25 +max(2.4, 3.0) + 2.0 + 0.25], [0.25 + 4.0 + 0.25]) = 5.5 ns.

Figures 2.3 and 2.4 illustrate the effects of adding a flip-flop within a combinational logiccircuit. The critical path in this example was 9.5 ns, then it became the maximum betweenthe two paths, max(5.5, 4.5), which is 5.5 ns. The signal in Figure 2.3 takes one clock cycle,whose maximum frequency is 105 MHz, to reach the output. To perform the same functionin Figure 2.4, it takes two clock cycles, but the lower critical path permits a new maximumfrequency of 181 MHz.

Processing Stages The second fact pipelining draws on corresponds to dividing the executionof an instructions in small tasks. Each task corresponds to a pipeline stage and has a dedic-ated hardware. Stages communicate through registers. The standard RISC pipeline stages forcompletely execute one instruction are:

• Instruction Fetch (IF), responsible for handling the program counter (PC) — pointer tothe memory address of the instruction to be executed — and for fetching the instructionfrom the instruction memory;

• Instruction Decode (ID) stage, where all the control signals to the upcoming stages are setup based on the instruction inspection and decoding. If the execution depends on values

8

Page 13: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 2 - Design of a risc-v Interpretable Processor Based on mb-lite

stored in registers, they will be prepared to deliver the data to the next stage;

• Execute (EXE) stage, where the Arithmetic Logic Unit and/or branch and comparisonunits perform the operation based on the control signals;

• Memory Access (MEM) stage, responsible for reading and/or writing data from/into thedata memory;

• Write Back (WB) stage, where the result of the instruction execution is eventually writtenback to the bank of register.

Since one instruction will be processed at one stage at a time, the other stages are avail-able to execute other instructions. Hence, each stage may concomitantly execute different (butconsecutive) instructions compared to upcoming and past stages. For example, at a time t, aninstruction A might be at stage Execute while an instruction B is at stage Instruction Decodeand while an instruction C is at stage Instruction Fetch. At the time t+ Tc, all instructions willmove to their next stage.

This conception creates a parallelism within different pipeline stages. Up to five instructionsare being performed at the same time and, for each clock cycle, one instruction will be "delivered".The latency of each instruction is hence five cycles, and the throughput is equal to one instructionper cycle under perfect situations. Microprocessors perform millions of instructions per second,in a manner that throughput is more relevant than latency.

Data coherence The execution of consecutive instructions in parallel raises some issues relatedto data coherence. One instruction might depend on the result of the previous one. Within theprocessor pipeline, the result of an instruction is stored at the final stage (WB), hence it isrequired to conceive a data forwarding mechanism within pipeline stages, so the instruction atthe Execute stage is able to use the results of instructions at MEM and WB.

Another challenge for data coherence occurs when dealing with Jump instructions. Theseinstructions are used to change the address of the next instruction, which means that the nextinstruction might not be the consecutive one in the program memory, as the regular flow does.In many RISC architectures, the concerned address and the decision to "jump" or not is onlyknown when the jump instruction leaves the EXE stage. At this point, if the decision is totake the jump, the instructions at ID, IF and EXE stages must be flushed because they are notsupposed to be executed.

More details of how to tackle these issues are discussed later in this report.

2.2.3 mb-lite Architecture

The implementation of the mb-lite architecture made by Tamar Kranenburg uses the five stagespipeline principles described above. It employs a Harvard architecture, where data and instruc-tion memories are separated, reachable by two independent buses (channels). The dynamicmemory might not be synchronised with the processor; therefore, it may take more than onecycle to retrieve the data during a reading operation. The hardware description is very flexiblein terms of configuration, which permits an easy use of 32, 64 bits or any 2n architecture. In thescope of this internship, a 32-bit full architecture (data and address) was used. The number ofgeneral-purpose internal registers is also configurable, here thirty-two were used. Interruptionsare supported and trigger a programmable routine. The full architecture illustration may befound in Figure C.1 on page 36.

In this implementation, each pipeline stage corresponds to one VHDL entity. All the entit-ies are connected by the core, which is also responsible for the connection to external sources(memories, clock and reset signals). The description exploits the two-process methodology, wherecombinational and sequential logic are clearly defined and separated, in order to reproduce ahighly readable description.

9

Page 14: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 2 - Design of a risc-v Interpretable Processor Based on mb-lite

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0opcode Rd Ra Rb function

Type A

opcode Rd Ra ImmediateType B

Figure 2.5: The two types of instructions of mb-lite’s ISA. The numbers at the top indicatethe thirty-two bits of the encoded instruction. The immediate value is either sign extended orcombined with the 16-bit immediate from the previous instruction.

Since the first goal to achieve was to replace the original mb-lite ISA by risc-v ISA, themain particularity approached during this mission was the coding and decoding of instructions.mb-lite presents two types of instructions: Type A and Type B, both 32-bit length, comprisingtwo operands and one destination register — Ra, Rb and Rd — and eventually an immediatevalue [16] — see Figure 2.5.

2.2.4 risc-v Instruction Types

Differently from MicroBlaze, risc-v has six instruction types which are called R-, I-, S-, B-, U-and J-Type. They are all a fixed 32-bit string and may contain up to six out of seven differentfields. The opcode is a six-bit length field and the main piece of information responsible fordistinguishing the instructions, thus is the only fixed field for all instruction types. Then theregisters Ra (first operand), Rb (second operand) and Rd (destination) are coded on five bits— this ISA gives support to up thirty-two general-purpose internal registers — their appearancedepending on the instruction type. There are two supplemental fields to define the performedfunction, called funct3 and funct7, coded on three and seven bits respectively. The last one isthe immediate integer value which may vary on length, position and composition according tothe type of the instruction, as presented in Section 2.3.2.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0A B C D E F G opcodeRaw Instruction composition

funct7 Rb Ra funct3 Rd opcodeR-Type

A B C D Ra funct3 Rd opcodeI-Type

A B Rb Ra funct3 E G opcodeS and B-Type

A B C D E Rd opcodeU and J-Type

Figure 2.6: risc-v’s six instruction types. The coloured pieces compose the immediate value.

For example, the following instruction, an I-Type format — deduced by its opcode value(0010011) — is an ADDI — deduced by its funct3 value. The destination register (11001) isReg_25, the first operand register (01001) is Reg_9 and the immediate value is 112.

At the end of the execution of this instruction, the value stored in Reg_25 will be the sumof the constant 112 with the value initially stored in Reg_9.

10

Page 15: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 2 - Design of a risc-v Interpretable Processor Based on mb-lite

0 000011 1000 0 01001 000 11001 0010011

2.3 Design and Implementation

In order to port the mb-lite architecture to risc-v ISA, the control signals were carefully ex-amined and adapted to comply with the new behaviour imposed by the new instruction protocol.

Stressing the difference of supported features between both ISAs, a few control signals weresuppressed from the ID stage and a few new adder components were introduced to IF and EXEstages. On the other hand, the differences of instruction behaviours required more data beingtransmitted between stages.

2.3.1 Adaptation of the Instruction Fetch Stage

This stage received a small but significant adjustment, aiming to provide Jump and link instruc-tions with the value of the address of the subsequent instruction. They need this information inorder to save the concerned address into a general-purpose register. For instance, this mechanismpermits the program on execution to jump to a routine and, once the routine is over, go back tothe original execution flow from where it stopped.

Since the Instruction Fetch stage is responsible for computing the address of the subsequentinstruction, it was coherent to propagate this value as the instructions advances in the pipeline.To propagate a 32-bit value implies more hardware utilisation. Another solution would be toundertake again the computation of the address at the Execute stage, but this would increasethe critical path of the processor. The choice of propagating the value instead of computing itagain was based on taking advantage of parallelism within each component and stages, reducingthe internal dependencies to a minimum and thus not compromising the critical path.

2.3.2 Modification of the Instruction Decode Stage

Based on the specifications discussed above, the pipelining stage where most of the adjustmentshad to be done was the Instruction Decode, where control signals are set up and registers areeventually read.

Differently from mb-lite, risc-v does not give support to jump-delay instructions norimmediate-dependent instructions. Jump-delay is a feature to reduce losses when executingjumps by placing, after the jump, one instruction supposed to be executed regardless the jumpdecision, thus it is not necessary to flush it. Immediate-dependent instructions are triggeredby MicroBraze ISA’s IMM instruction. It preloads the upper half of the immediate value andcomplements it with the immediate of the subsequent instruction. The same result is achievedin risc-v using a register and the sequence i) Load Upper Immediate and ii) Bitwise-Logic OrImmediate. All signals responsible for controlling of jump-delays and immediate-dependencywere thus suppressed from the architecture.

Immediate Composition

The position and composition of immediate values within the instruction is different in bothISA. While in MicroBlaze, the immediate field exists only for Type B instructions, at the fifteenmost significant bits of the instruction; in risc-v, it exists for all but R-Type instructions, andits position within the instruction encoding is specific to the type.

In risc-v ISA, this composition may be cause of misunderstanding at a first glance, as J-Typeputs it in evidence (sequence AEDBC in Figure 2.7). The reason for having the immediate valuecut into small pieces instead of having it directly — for instance, as U-Type has: ABCDE —is explained by a more optimal use of hardware components. As illustrated in Figure 2.7, eachblock of the final immediate (from I to IX) may be assigned from up to four blocks of theinstruction (from A to G). For example, block VIII correspond to bits 1–4 of the immediate and

11

Page 16: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 2 - Design of a risc-v Interpretable Processor Based on mb-lite

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0A B C D E F G opcodeRaw Instruction composition

I II III IV V VI VII VIII IXImmediate blocks

– A – B C DI-Type

– A – B F GS-Type

– A – G B F 0B-Type

A B C D E – 0 –U-Type

– A – E D B C 0J-Type

Figure 2.7: According to the instruction type, immediate values are composed using pieces fromthe original raw instruction.

may be associated to either C or F blocks or a 4-bit zero value. The hardware necessary forthis association is made with multiplexers having up to four inputs and thus one or two controlsignals to select which input is piped to the output.

In the case the immediate value within J-Type format being straightly encoded — ABCDEinstead of AEDBC – in the raw instruction, the number of instruction blocks (from A to G)associated to the final immediate blocks (from I to IX) would be greater than four, increasingthe complexity of the multiplexers required to pipe the blocks.

Immediate for R-Type instructions

Although R-Type instructions have no immediate value, the available hardware was used to carryadditional information from the Instruction Decode stage to the Execute one for instructions suchas multiplication and logical/arithmetic shifts. These instructions require control signals to defineif the values are signed or not signed; if the result is the higher or lower part of the value and ifthe shift is logic or arithmetic. Instead of creating new signals which would increase the barrierand flip-flops; this (until then unemployed) immediate value was reused.

2.3.3 Adaptation of the Execute Stage

The differences between MicroBlaze and risc-v ISAs extend to some actions performed by someinstructions. Thus, the Execute stage had to be slightly modified. In the first architecture,branches3 are taken depending on the comparison between the value in one register and the zeroconstant, whereas in the second one the branch decision is relative to either two registers or oneregister and one immediate value.

3Instructions which may set a different address to the next instruction depending on the comparison of twovalues.

12

Page 17: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 2 - Design of a risc-v Interpretable Processor Based on mb-lite

In both ISAs, the address for the program counter (PC) to target the subsequent instructionin the case the branch is taken is the sum of the actual PC value with another operand — eitherthe immediate value or the value in one register. This means that there is always an addition tobe carried out when executing a branch instruction.

The comparison between one value and the zero constant is much simpler than two non-zerovalues. In the first case, any operand with any single bit equal to logic 1 will be greater thanzero if the value is unsigned; and an operand with the most significant bit equal to logic 1 will besmaller than zero if the value is signed, as illustrated in Table A.1. In the second case, the mostefficient solution is to perform the subtraction of the values and verify the sign of result, whichmeans that for risc-v a second addition module is required to perform the comparison — thefirst addition, as discussed in the last paragraph, is required for the target address. Appendix Aon page 32 presents a full description of the logic employed.

2.4 Analysis

2.4.1 Critics to risc-v ISA encoding

In MicroBlaze, because of the smaller number of instruction types, the decoding process is moresimple and straight to just a few rules. For instance, it is common to almost all instructions thatone opcode bit will determine the second operand as the immediate value or as the value in Rb.On the other hand, risc-v is less careful concerning this aspect.

Even though the 6th bit of opcode determines the second operand (just like MicroBlaze)within arithmetic instructions, it is necessary to create specific treatments for multiplicationinstructions. They could have been comprised in the same treatment as arithmetic instructionsif they all used funct3 field to determine the Arithmetic Logic Unit (ALU) operand. Distinc-tions between multiplications instructions (upper or lower saving, signed or unsigned) could bedetermined by funct7 field.

Another point of discussion is the fact that for the same opcode, risc-v ISA presents in-structions of different types, as Shift Logic Immediate - SLLI and Bitwise Logic Or Immediate -ORI instructions, R and I-Type respectively. These instructions do not share any common pointin terms of hardware execution nor meaning. Sharing the same opcode does not present anyadvantage, specially if the types in question — and hence the immediate value composition —are different.

A much simpler and straight decoding process would be if the following simple and generalrules were determined for all instructions:

• The six bits from opcode field determine the instruction type;

• Two of these six bits determine the first and second operand source;

• The three bits from funct3 field determine the operation to be performed by the ArithmeticLogic Unit;

• The seven bits from funct7 field distinguish particular treatment within the same instruc-tions — for instance, if the result of the multiplication will contain the upper or lowerthirty-two bits, like MUL and MULH.

• For instructions without funct3 nor funct7 fields, those distinctions should be determinedby the opcode, like LUI and AUIPC instructions do.

It is not the goal of this internship to develop a new ISA; however, it is very interesting froma research approach to explore these faults and its direct consequences at hardware level.

13

Page 18: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 2 - Design of a risc-v Interpretable Processor Based on mb-lite

2.4.2 Branch Prediction

Branches are massively used by programs in order to change the flow of execution. Indeed,the execution of a program through its instructions is not as linear as the disposition of theseinstructions in the memory, so branches appear as a good method to overcome the currentmemory paradigm.

However, within pipeline architectures, processors perform a few successive instructions atthe same time. If a branch decision is taken, the subsequent instruction to the branch shouldnot anymore be the subsequent one in memory, but the one located where the branch pointsto. This behaviour is only known after the decoding stage and the decision is known after theexecution stage, meaning that there might exist up to two instructions within the pipeline thatshould be flushed.

To overcome the limitations of this issue, a few solutions are presented: i) After each branch,put invariably two NOP4 instructions; ii) use branch delay method, where the compiler switchthe position of the branch with a former instruction, not requiring anymore the flush; iii) usebranch prediction to minimise the number of flushes.

A good branch prediction by hardware or software is critical to the exploitation of morethan modest amounts of instruction-level parallelism [14]. When implemented in hardware, thiscontrol unit tries to predict if the branch not yet decided will be taken or not. By using thepredicted decision, the IF stage may fetch the subsequent instruction while the decision is beingperformed. In the case the predicted decision is the correct one, there is no need of flushing anystage. Contrary, when the predicted decision is different to the one decided by the execution ofthe branch — e.g. the value in register Ra is bigger or equal to the one in register Rb for BGEinstruction — there is the necessity to flush the pipeline and fetch the correct one. Flushinga stage corresponds to replacing the current instruction by a NOP one, increasing the overallCycles per Instructions. Therefore, the best the system predicts the correct decision, the higherthe performance of the processor.

The trade-off to ensure an increase of performance when adding the last described BranchControl Unit will depend on how many branches there are in the code and how much of CPI itwill increase, and how much the clock frequency will be penalised by the more complex hardware.There is, notably, the increasing of hardware and thus surface usage, but here one is analysingonly the consequences with regard to the performance of the processor.

2.5 The Big Picture

In the whole, forty-seven instructions from risc-v ISA were ported to mb-lite architecture.They represent the full RV32I Base Integer Instruction Set (version 2.0) plus the multiplicationinstructions from RV32M Standard Extension (version 2.0), except for ECALL, EBREAK andFENCE instructions.The reader may find the list of them at the end of this report, Appendix Bon page 34.

Concerning the hardware differences, atlas architecture received 273 new bits and has beensupressed of 165 bits at the inter-stage registers of the pipeline, totalising a positive balance of108 bits. The illustration of added, preserved and removed signals and components is presentedin Appendix C, at the end of this report (see Figure C.2 on page 36).

2.6 Validation

To confirm the expected operation of the designed microarchitecture according to risc-v ISAspecifications, many benchmark programs were simulated through Xilinx ISE Simulation Tool.The goal of this set of programs was to ensure and validate the data coherence of each of the

4Instructions which do not perform any action on any register or memory.

14

Page 19: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 2 - Design of a risc-v Interpretable Processor Based on mb-lite

following items: i) data forwarding; ii) loadings; iii) stores; iv) arithmetics; v) branches; andvi) hazards. All data levels were inpected: control signals, flip-flops, data memory and general-purpose registers.

A spreadsheet was created to facilitate the translation from assembler language to binaryinstructions. This tool was first conceived to save time during validations in the beginning and,as the validation process got more complex, it appeared as a very useful and practical tool tocatalyse the validation.

A test program was performed to compare all the architectures concerned in this report. Thereader may find it in Chapter 5 on page 27.

2.7 Results

Once the hardware description is validated, the architecture may be synthesised and place&routedto a FPGA device, which defines the final critical path. The target device used for this place androuting process was the Xilinx Virtex-7 FPGA device. The summary with the main data fromthe synthesis of the concerned processors in this report is available in Appendix D on page 38.

Time Constraints As justified in Section 2.2.1, the critical path fixes the maximum clockworking frequency. By analysing the logs generated by the synthesising tool, it was verified thatthe critical path for the atlas microarchitecture is slightly longer due to the introduced compar-ison unit. In fact, while within MicroBlaze microarchitecture the result of the ALU goes directlyto the EXE stage output; within risc-v the ALU result is used to compute the comparisonsignals, which in turn are used to calculate the branch decision. This logic dependency is thecritical path of the atlas processor.

The original mb-lite microarchitecture implemented on the Xilinx Virtex-7 FPGA devicepresents a minimal clock period equals to 9.58 ns (104.35 MHz). On the same device, theconceived atlas processor presents a minimal clock period equals to 15.567 ns (64.23 MHz),resulting to a frequency 61.5% slower.

Power Consumption Another metric to compare the different architectures is their powerconsumption. It depends on many factors, such as quantity and characteristics of the componentsused for the implementation. Xilinx XPower Analyser software was used to calculate this metric.

The total power consumption on Virtex-7 FPGA device of the original mb-lite microarchi-tecture is 337.93 mW. On the other hand, despite the greater number of components used inthe implementation of the processor described through this chapter, there was an energy savingthanks not only to the reduction of the required DRAM but also to the reduction of workingfrequency. Therefore, on the same FGPA device, atlas consumes 8% less of energy comparedto the initial processor, presenting a total power consumption equals to 311.46 mW.

15

Page 20: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 3

Stream-Based Loop Execution

Acceleration

Within this chapter, the second task of the internship is presented. A new feature was in-troduced to the atlas microarchitecture to specifically perform simple loops, presenting

a smaller CPI by the introduction of stream concept. This development was the middle stepto achieve the final goal of the internship and consisted of designing a new path and executionprocedure for simple loops, a preliminary data-flow processing engine. The resulting microarchi-tecture described within this chapter is herein denoted by andes.

3.1 Motivation

In the domain of microprocessors, efforts to increase the clock frequency are stagnating. Instead,the astuteness to increase the processors performance is taking advantage of diverse forms ofparallelisms, as Oliver Pell and Oskar Mencer advocated in the article Surviving the End ofFrequency Scaling with Reconfigurable Dataflow Computing [13].

As presented earlier in this report, memory access is a recurrent issue in architectures re-garding the constraints and time-delay it imposes. Applications defined as memory-boundedhave a considerable level of memory access and hence their performance is highly limited. Morethan a simple cache prefetching mechanism, memory-bounded applications might require an ar-bitrarily memory access. Matrix operations, media processing and DNA computing are just afew examples of commonly used memory-bounded applications and whose complexity tends toincrease.

Stream processing is an emerging concept which enhances the data access performance byincluding dedicated and specialised components to extract a performant parallelism [12]. Whilethe computing core unit will be focused on the inherent computing instructions, the data man-agement unit will prefetch the data respecting the access pattern defined by the application,delivering it in the correct order. This new paradigm delegates address computing, index incre-mentation and load instructions to the data management unit, which works in parallel to thecomputing unit.

It is out of the scope of this report to enter in complex domains such as cache management,but it is appropriate to highlight the ability of such system to adapt the existent prefetchingmechanism to dynamic and non-linear access [12, 17]. In the industry, one may find someStream-based architectures, such as Borealis processor [1].

Similar issues are found in memory-bounded loop executions. Efforts to increase the per-formance of loops are largely found in the academy, for instance the loop unrolling techniquediscussed by Wall D.W. [14].

Repeated tasks such as fetching instructions from program memory within a loop executioncan be avoided by adopting an automatic feeder. Furthermore, there are many instructions

16

Page 21: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 3 - Stream-Based Loop Execution Acceleration

whose goal is only to manage the loop execution, as presented in Section 3.2.1.The motivation for the andes processor was to tackle both data-access and loop performance

issues by creating a new execution behaviour for simple loops. A stream-based processor wasthe inspiration for many choices on the design and architecture.

By adding new control components to avoid repeated loop control’s instructions, the core’spipeline may perform only the instructions of the loop body.

3.2 Background

A loop-dedicated section will state the paradigm of loops in computer science, referring to itsequivalence in assembler language. Once again, if the reader is familiar with this type of know-ledge or even do not need to go deep in this subject, he/she might skip to Section 3.2.2, whereone analyses the implications of a loop execution regarding the processor’s architecture.

3.2.1 Count-Controlled Loops

In computing, a loop is defined as a sequence of instructions which may be executed several times.It is generally composed of a body — the instructions — and a header — the loop controls. Thelatter one defines the loop type, conditions and index variables. Within the context of thisinternship, we were dealing in a first moment with count-controlled loops, which means loopsthat have a fixed and known number of iterations defined before the execution.

Algorithm 1: Example of a basic count-controlled loop in high-level language1 integer array A2 integer array B3 integer array Y4 for i 1 to N do5 var_a A[i] + B[i]6 var_b A[i] � B[i]7 Y[i] var_a ⇥ var_b

In Algorithm 1, line 4 corresponds to the header, indicating the variable i as the index andthe constant N as the number of iterations. The implicit increment of i is 1, i.e for each iteration,the variable i will be incremented by 1. Lines 5–7 correspond to the loop’s body, these are theinstructions which will be performed N times. The program might continue after the loop, so itmight exist instructions 8, 9 and so on.

The assembler code1 generated by the compilation of this piece of code is equivalent toAlgorithm 2. Loop controls [LC] correspond to lines 2–3 and 11–12; while body [B] instructionscorrespond to lines 4–10. Once again, it might exist instructions after the loop, corresponding toinstructions 13, 14 and so on. Note that due to the characteristics of RISC architectures, one biginstruction in high-level languages — Y [i] = (A[i]+B[i])⇥(A[i]�B[i]) — may be decomposed intosmall multiple others in a low-level language. Furthermore, in high-level programming languages,one may index the memory using the variable index (A[i]), but in low-level this relation may notexist, resulting in two registers: one for the memory address (reg_1) and another one for theloop indexation (reg_2).

3.2.2 Architecture’s Behaviour

A repetition in assembler is performed by a conditional jump instruction, also called branch.Branches, as jumps, are decided when they are carried out by the EXE pipeline stage, as presented

1A direct abstraction of the instructions in the format perceived by the processor.

17

Page 22: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 3 - Stream-Based Loop Execution Acceleration

Algorithm 2: Example of a basic count-controlled loop in assembler language1 reg_1 x003000 Base memory address

2 reg_2 1 index register [LC]

3 reg_3 500 iterations register [LC]

4 reg_4 A[reg_1] LW load instruction [B]

5 reg_5 B[reg_1] LW load instruction [B]

6 reg_6 reg_4 + reg_5 ADD instruction [B]

7 reg_7 reg_4 � reg_5 SUB instruction [B]

8 reg_8 reg_6 ⇥ reg_7 MUL instruction [B]

9 Y[reg_1] reg_8 SW store instruction [B]

10 reg_1 reg_1 + 4 Memory address incrementation [B]

11 reg_2 reg_2 + 1 Index incrementation [LC]

12 jump to #4 if (reg_2 reg_3) Conditional Jump [LC]

in Section 2.4.2. This means that while the branch instruction is at EXE stage, subsequentinstructions in memory are being carried out at IF and ID stages. If the branch is taken (whichhappens N � 1 times on a loop execution), the IF and ID stages will need to be flushed.

Hence, a loop execution following the standard RISC pipeline flow will introduce 2⇥ (N � 1)NOP instructions and will execute 2 + 2 ⇥ N loop-control instructions. As the number ofclock cycles increases, this impacts directly the processor’s performance and hence the program’sexecution time.

The following simulation illustrates a classic count-controlled loop and its instructions beingperformed by each pipeline stage. Highlighting ID stage for a better comprehension, one shouldnote that instructions are performed by all stages, advancing in the diagonal through time.

B ACDE

iteration 1iteration 2

IFID

EXEMEMWRB

time

Figure 3.1: Illustration of a classic pipeline executing a count-controlled loop. Element A repres-ents precedent instructions to the loop; B corresponds to initial control instructions( lines 2–3 inAlgorithm 2); C to loop body (lines 4–10 in Algorithm 2). D to loop control instructions (lines11–12 in Algorithm 2); E to instructions which will be flushed (replaced by a NOP) imposed bythe branch.

3.3 Design and Implementation

New dedicated hardware was required in order to remove the necessity of control-loop instruc-tions. This hardware should control the loop iterations and feed the pipeline with the loop bodyinstructions. From the software side, new instructions had to be created to activate the newfeature and command the new hardware. Preserving IF, EXE, MEM and WB pipeline stages;the unique modified stage was the ID one, which was split into two stages.

18

Page 23: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 3 - Stream-Based Loop Execution Acceleration

3.3.1 Extension from 5-stage to 6-stage Pipeline

The first task consisted of a simple splitting of the logic of the processor’s original ID stage toinclude a new pipeline stage, named Dispatcher/Operand Fetch (DOF). Accordingly, the originalinstruction decoding logic remained in the ID stage and the Register File (RF) access is nowperformed at the new DOF stage.

The following modifications to the processor’s structure were required to perform the aimedpipeline extension:

• Implementation of the RF access at DOF stage, including i) register access from decodedinstructions; and ii) WB connection;

• Back-propagation of flush and hazard signals and handling them accordingly in both IDand DOF stage.

The carried extension maintained the functionality of the original processor.

3.3.2 Instruction Buffer

A stream-processing-compatible architecture requires a component which manages the streameddata. Within the concerned processor here presented, this function will be carried out by aninstruction buffer, a circular queue on first-in-first-out basis: it works as a pipe where one putsinstructions — writing process — from one hole and gets instructions in the same order —reading process — from the other hole. The circular feature indicates that when the last addedinstruction is read, the pointer of the buffer goes back to the starting point and will deliver thefirst instruction again, as illustrated in Figure 3.2.

Inst1Inst2Inst3Inst4

RW

Figure 3.2: Illustration of a circular buffer. W and R are pointers to write and read positions,respectively. After each operation (write or read), the concerned pointer moves to the left.Pointer W starts at position zero and may move until the maximal capacity of the buffer.Pointer R starts at position zero and may move until the same position of pointer W. When thishappens, pointer R is reset to position zero.

Used in the context of a loop-dedicated instruction stream, the buffer stores the loop’s bodyinstructions and will iterate them N times. After the first iteration of fetching and decodinginstructions, stages IF and ID "freeze" during the loop execution. Indeed, in this new paradigm,loop instructions are loaded and decoded only once.

The circular buffer was implemented at DOF stage to store the already decoded instructions.It starts storing instructions after the reception of a loop activation instruction, detailed in thenext sections.

3.3.3 Dedicated Instructions

The envisaged loop-accelerator engine required the addition of a few configuration instructionsin order to give the loop control essential information, notably the number of body instructionsand the number of iterations to be performed. Moreover, it was added an instruction to indicatewhere the loop set starts.

19

Page 24: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 3 - Stream-Based Loop Execution Acceleration

In order to create these new instructions, one must respect the conventions of the ISA. Allthe knowledge discussed in Section 2.3.2 was very useful to master each available field and itscorresponding hardware influence. It was stipulated that only one opcode value should be used,in order to minimise the interference on the original ISA and hence maximise its compatibility.

Concerning the number of iterations, applications such as media processing might use quitea lot of iterations. Furthermore, one may imagine several applications where the number ofiterations is only known on runtime, hence the value must come from a register. Unlikely thenumber of iterations, the number of instructions in a loop is known in compile-time. After somepreliminary studies, it was defined that a 12-bit immediate value should be enough to hold thisinformation.

The following constraint list was stated with the help of my direct mentor, Mr Nuno Neves:

New instructions One or two to determine control information and one to indicates the be-ginning of the loop set;

opcode value Only one for all created instructions, hence an instruction type which holds atleast a funct3 field was required;

Number of loop iterations 0 to 232 = 4294 967 295;

Number of body instructions 1 to 212 = 4095.

It was decided to use the I-Type instruction format because of funct3 field, useful to distin-guish different loop control instructions for the same — and unique — opcode. Moreover, theI-Type format generates a 32-bit sign extended value from a 12-bit field.

An available opcode on risc-v full instruction set was chosen, "1010111".

LPCFG Ra, Imm Loop configuration instruction. The immediate value sets the number ofbody instructions; the value in Ra register sets the number of iterations.

LPST Imm Loop activation instruction. Immediate value defines the loop address offset in theLane-Mapping memory, approached in Chapter 4.

The Algorithm 2-equivalent assembler code compatible with the new feature is presented inAlgorithm 3.

Algorithm 3: Example of a basic count-controlled loop in assembler language usingautomatic execution feature.1 reg_1 x003000 Base memory address

2 reg_3 500 iterations register [LC]

3 LPCFG reg_3, 7 loop configuration set-up [LC]

4 LPST loop activation [LC]

5 reg_4 A[reg_1] LW load instruction [B]

6 reg_5 B[reg_1] LW load instruction [B]

7 reg_6 reg_4 + reg_5 ADD instruction [B]

8 reg_7 reg_4 � reg_5 SUB instruction [B]

9 reg_8 reg_6 ⇥ reg_7 MUL instruction [B]

10 Y[reg_1] reg_8 SW store instruction [B]

11 reg_1 reg_1 + 4 Memory address incrementation [B]

It is worth noting that index incrementation and conditional jump instructions are notrequired anymore. Their function are assured by the loop-accelerator engine. Hence, thenumber of executed instruction will be 3 + B ⇥ N instead of 2 + (B + 2) ⇥ N . It shouldalso be highlighted that by removing the branch instruction, there is no added NOP any-more, saving a cost of 2 ⇥ (N � 1) cycles. The total number of saved clock cycles is thus[2 + (B + 2)⇥N ]� [3 +B ⇥N ] + [2⇥ (N + 1)] = 1 + 4⇥N , as illustrated in Figure 3.3.

20

Page 25: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 3 - Stream-Based Loop Execution Acceleration

B AC

iteration 1iteration 2iteration 3iteration 4

IFID

DOFEXEMEMWRB

time

Figure 3.3: Illustration of a the conceived pipeline executing a count-controlled loop. Element Arepresents precedent instructions to the loop; B corresponds to initial control instructions (lines2–4 in Algorithm 3); C to loop body (lines 5–11 in Algorithm 3). There is no more flushedinstructions neither control instructions being iterated.

3.3.4 Loop-Accelerator Engine

With the instruction buffer and control instructions implemented, the subsequent task was toimplement the loop-accelerator engine, responsible for managing the loop execution. It is spreadover the ID and DOF stages. Accordingly, it aimed at providing the basic loop control andinstruction issue from the DOF stage, by considering:

• An instructions-to-load counter at ID stage, whose initial value is set up by LPCFG andis decremented for each instruction following LPST ;

• When the instructions-to-load counter reaches zero, IF and ID stages freeze until the endof the loop execution;

• After the loop activation instruction, the DOF stage starts reading the instruction buffer,instead of reading the instruction arriving from ID stage as under normal execution;

• An iterations-to-perform counter at DOF stage, whose initial value is set up by LPCFGand is decremented for each iteration. This decrementation signal is generated by theinstruction buffer at the end of a full cycle;

• When the iterations-to-perform counter reaches zero; IF, ID and DOF resume the originalfunctionality.

3.4 The Big Picture

The conceived andes architecture is completely compatible with the atlas one. Indeed, all theoriginal instruction set (risc-v) was respected.

The new hardware blocks that were designed and integrated in the former architecture cor-respond to: i) the new DOF pipeline stage; ii) the circular buffer for loop instructions; and iii)loop and iterations controllers. The illustration of added, preserved and removed signals andcomponents is presented in Appendix C, at the end of this report (see Figure C.3 on page 37).

3.5 Validation

By using the same data-coherence tests presented in the previous chapter, the operation of theatlas architecture was fully validated, by considering the hazard detection and resolution, the

21

Page 26: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 3 - Stream-Based Loop Execution Acceleration

control of the circular buffer for loop instructions and the loop control engine (instructions to beloaded and iterations to be accounted by the counters).

A test program was also performed to compare all the architectures concerned in this report.The reader may find it in Chapter 5, on page 27.

3.6 Results

Aiming a full-parallel loop execution, the intermediary solution described in this chapter presentsa simple and efficient improvement of the loop executions without penalising too much the clockperiod neither consuming more energy. The reader may find the full report and metrics inAppendix D on page 38.

By comparing the metrics obtained for atlas — the mb-lite architecture ported to risc-v

(from the previous chapter) — with those for andes architecture, it is observed that the latter onehas improved its maximum operating frequency. Furthermore, despite the considerable increaseof resource utilisation (6.28x more registers) by andes, the power consumption has increasedby less than 2%. Combined with the fact that the parallelism within the architecture is betterdistributed, the final critical path of andes is only 82% of the atlas’ critical path.

3.7 Limitations

It is worth mentioning that the conceived andes architecture also has a few limitations. Thescope of the count-controlled loop is just a small one within loop mechanisms. Furthermore,andes architecture gives support to simple and full-sequential count-controlled loops, whichmeans that it is not possible to have any kind of jump instructions inside the loop body whenusing the conceived automatic execution feature. This limitation in the loop body extends toother loops: two dimension loops are not supported. The solution for such cases is to keep thestandard execution flow for the outside loop and accelerate (by using the introduced feature) theinside one.

Moreover, in order to be able to use this feature, the program compiler2 must be compatiblewith the processor. This does not only mean compatible with the introduced instructions, butalso with the processor’s instruction buffer capacity.

2Computer software which translates a program from a high-level language to binary instructions.

22

Page 27: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 4

Data-Flow Engine for Automatic Loop

Parallelisation

The third and final task of this internship is presented in this chapter. A new parallelisa-tion direction was exploited in andes microarchitecture to take advantage of the data-flow

processing paradigm. The conceived architecture, herein denoted by alpes, parallelises thepipelining flow during the simple-loop executions, at the cost of an increase of the complexity ofdata-coherence processing but presenting a better performance for several computing-consumingapplications.

4.1 Motivation

Data-flow processing is typically performed by pre-analysing an application and extracting datadependencies between tasks/operations/instructions. Such dependencies are then used to createa so-called Data-Flow Graph (DAG) that describes the flow of data between tasks [11]. EachDAG node (i.e., task) and its corresponding data source(s) and destination(s) are then mappedto the available processing resources. This pre-coding of the paths for data movement betweenrunning tasks allows, in most cases, to explicitly transfer data between them, without requiringits intermediate storage in the main memory (e.g., in register files).

The key fact that motivates the exploitation of DAG in this research topic is that instruc-tions belonging to different branches of the DAG are independent, thus DAG branches can beparallelised.

A highly-efficient atmospheric equation solver [6], a framework for financial simulations [9]and two reconfigurable hardware architectures [5, 3] are some examples of recent applications ofData-flow within both the industry and academic community.

The proposed processing engine took on the data-flow paradigm and adapted it to the contextof common RISC pipelining architectures during simple-loop executions. While typical data-flow engines are based on very large grids of functional units with equally large communicationnetworks, the envisaged architecture relied on a far more contained and simple solution. In fact,it was based on the andes processor architecture resources and its replication.

To achieve the proposed goal, a data-flow encoding style was used to automatically parallelisethe computational loops of a given application.

4.2 Design and Implementation

The conception of alpes architecture was divided into two fronts: the implementation of thehardware engine and its controlling circuit; and the data management derived from the data-flowprocessing. In what concerns the new hardware structure, the alpes architecture is characterised

23

Page 28: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 4 - Data-Flow Engine for Automatic Loop Parallelisation

by the coexistence of several parallel pipelining lanes. On the side of data issues, the architectureprovides a vast data-forwarding management.

The scope of my collaboration was restricted to the hardware adaptations and did not involvethe software encoding-level — through compilers, for instance.

4.2.1 Pipeline Replication

This first task comprised the replication of the pipeline parts corresponding to the DOF, EXE,MEM and WB stages, creating multiple parallel pipeline lanes. Lane 0 preserved the originalfunctionality of the pipeline with full-ISA support. The additional lanes were removed of branch,jump and CSR access instructions.

4.2.2 Instruction-to-Lane Mapping

By providing multiple lanes, the conceived architecture allows several parallel execution flows.The instruction-routing decision is taken in compile-time, where data dependencies are processed,thus avoiding program inspection by the hardware in runtime.

A simple data-flow encoding of each loop-body instruction is used to explicitly configure thedata connections between instructions, as well as its destination lane. As such, it is defined aminimal 16-bit data-flow control word for each loop-body instruction, illustrated in Figure 4.1.

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0Lane id RegChain Rd Ra Rb

Figure 4.1: Lane mapping record. Lane ID field indicates the destination lane for the instruction.RegChain field indicates, in the case of a register-chain forwarding, the destination lane. Rd, Raand Rb fields indicate the source or destination flow of each register.

As it will be explained with more detail in the next section, the architecture provides fourtypes of register flows, in particular:

Local the corresponding operand preserves the behaviour as a standard pipeline: fetched fromor stored in the local register and forwarded from local pipeline stages;

Forwarding the corresponding operand is either obtained from the upper lane forwarding lines,or it will be stored in the lower lane as well;

Unicast the corresponding operand is either obtained from the register chain from a furtherlane, or it will be sent to a further lane through the register chain;

Multicast the corresponding operand is either obtained from the register chain from a furtherlane, or it will be sent to multiple further lanes through the register chain.

4.2.3 Data-Forwarding Management

In order to fully supply the Execution stage of each lane with the results at MEM or WB stagesof all the other lanes, it would be required an immense quantity of data and control signals. Theforwarding logic to choose the correct value among several paths would be too complex. Instead,two forwarding mechanisms were developed to fill this need. These two mechanisms are the keyfactor for the compiler to distribute the instructions into different lanes and orders.

Figure 4.2 illustrates all possible data-flows for a register within alpes architecture.

24

Page 29: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 4 - Data-Flow Engine for Automatic Loop Parallelisation

Figure 4.2: alpes’ data-forwarding flows. Orange wires represent the data-flow coming from theupper-adjacent lane. Blue wires represent the data-flow from the local lane. The purple wirerepresents the flow coming from the Register Chain.

Local-Proximity Forwarding

To provide the required data-flow between stages, the forwarding lines of each lane were extendedto the EXE stage of one adjacent lane. Specifically, the forwarding lines with origin in the lanei, namely EXi, MEMi, WBi are accessed by the EXi+1 stage.

In addition, lane i might write in the register file of lane i+1. Proper forwarding and controlsignals are routed.

Parallel DOF Issue

With the provided operand source/destination encoding, the DOF stage in each lane is able toautomatically perform the issue of its assigned instructions. To do so, the DOF stage shouldmonitor the forwarding lines of the upper lane and release the instruction to be executed as soonas the operands are available. In particular, the issue logic should assume that structural conflicts(i.e., register write conflicts and simultaneous memory accesses) are solved in compile-time bythe data-flow generation tool. This way, it is only required to resolve Read-After-Write hazards(between lanes) by inserting NOPs in its pipeline lane.

Register Chain

Non-adjacent lanes may exchange data through the register chain as well. The register chain iscomposed of a circular chain and a cache register per lane. Hence, each lane has a pair chainnode and cache register, as illustrated in Figure 4.3. The information moves in the chain and iswritten in the cache register of the destination lane. The chain is written when the instructionis at the WB stage of the sender lane. The routing within the chain is controlled by the numberof cycles necessary to the information arrive at the destination lane: it is equal to the numberof lanes between the sender and receiver ones minus one — i.e. from lane 1 to lane 7, it takes 4cycles.

4.3 Limitations

The same limitations of andes architecture described in the previous chapter also applies foralpes architecture. Furthermore, the lane-mapping engine and the forwarding-priority set in-

25

Page 30: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 4 - Data-Flow Engine for Automatic Loop Parallelisation

Figure 4.3: The Register Chain component.

crease the dependency on the compiler regarding the processor architecture — i.e. a programcompiled for a processor providind four lanes would be compatible (from an ISA point of view)with a processor providing only two (and vice-versa), but the data-coherence might not be en-sured.

4.4 The Big Picture

The conceived alpes architecture is completely compatible with the atlas one. Once again, allthe original instruction set (risc-v) was respected.

The new hardware blocks that were designed and integrated in the former architecture corres-pond to: i) the new parallel pipeline flow; ii) the register chain engine; iii) the several forwarding-control signals to the DOF stages; iv) the several data-forwarding signals to the EXE stages;v) the lane-mapping memory and vi) the upgraded loop-control engine. An illustration of theadded, preserved and removed signals and components is presented in Appendix C, at the endof this report (see Figure C.4 on page 37).

4.5 Results

In terms of the critical path (see Appendix D on page 38), the addition of parallel lanes presentsa slightly higher (and natural) value compared to the andes architecture. More investigationconcerning this topic will be carried out before the end of this internship.

Concerning the resources utilisation, there is a considerable step from andes to alpes withtwo lanes, due to the essential hardware necessary for the parallelisation (data-coherence man-agement and more complex loop-control engine). When comparing the different considered con-figurations of alpes (see Appendix D), the resource utilisation increases in the same proportionof the number of lanes.

In terms of power consumption, one may note a reduction of 10% from alpes with two lanesto andes, once again mainly caused by the reduction of used BRAM. Comparing the differentconfigurations of alpes, the power consumption increases slowly according to the number oflanes.

26

Page 31: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 5

Analysis and Comparisons

5.1 Benchmarks

Four benchmarks were simulated on the following architectures: mb-lite, atlas, andes andalpes. This last one under different configurations: two, four and eight lanes. Due to scheduleconstraints, this report was written two weeks before the end of the internship; thus, there aremore validations (with other different benchmark programs) planned to be carried out, but theresults were not ready on time to be published herein.

The four benchmarks herein reported were selected from the Livermore Loops suite [4] andare known to be a standard reference on loop execution evaluation, combining different type ofinstructions and using the maximal hardware resources of the processor architecture. BenchmarkA corresponds to Kernel 1 - Hydro fragment, Benchmark B to Kernel 3 - Inner Product, Bench-mark C to Kernel 5 - Tri-Diagonal Elimination, Below Diagonal and Benchmark D to Kernel 7- Equation of State Fragment.

Clock Cycles

In terms of clock cycles, both mb-lite and atlas present the same metric. The first one supportsbranch-delay, which reduces the quantity of NOPs after a branch to one. On the other hand,branches on atlas request one additional arithmetic instruction, because the comparison isbetween the zero constant. The result is a clock-cycle equivalence between the two architectures.

In the other two architectures, andes and alpes, where loop-execution improvements havebeen introduced, one may notice a real reduction in clock-cycles. Naturally, the variation isdependent on the number of parallel lanes (Figure 5.1).

(a) Number of clock cyles

(b) Clock-cycles speed-up

Figure 5.1: Benchmark results (for 10,000 iterations) in number of clock cycles. The blue seriescorrespond to Benchmark A, the red one to Benchmark B, the yellow one to Benchmark C andthe green one to Benchmark D. Speed-up metric assumes mb-lite results as the base value.

27

Page 32: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Chapter 5 - Analysis and Comparisons

Execution Time

As stated in equation (2.1), this second metric, depends not only on the efficiency of the instruc-tions (through the total number of instructions), but also on the physical performance of theprocessor (through the operating frequency). Hence, thanks to its higher operating frequency,mb-lite presents an advantage when compared to atlas, being 61% faster then the latter one.

Comparing the timing results of mb-lite with andes and alpes versions, one verifies areal reduction of the total execution time for architectures providing two or more lanes (alpes).In spite of the reduction on the operating frequency of those architectures, the reduction onclock-cycles drives the gain on the execution time.

In fact, there is an open opportunity to perform further optimisations on atlas architecturein order to increase its operating frequency. Such an increase would consequently increase theperformance of all derived architectures. One may also note, by comparing Figures 5.2b and5.1b, that the speed-up in terms of clock cycles is up to 5 for eight parallel lanes. If there is noor only a small increase on the operating frequency, the speed-up in terms of execution time maybe much more important.

(a) Execution time (b) Execution-time speed-up

Figure 5.2: Benchmark results (for 10,000 iterations) in terms of the execution time. The blueseries correspond to Benchmark A, the red one to Benchmark B, the yellow one to Benchmark Cand the green one to Benchmark D. Speed-up metric assumes mb-lite results as the base value.

5.2 Final considerations

The notary case of the mb-lite processor illustrates that simple systems may be overall moreperformant than complex systems which try to do so. The reason is that when adding morecomplexity to the system, the clock frequency may decrease or even the supplemental hardwareincrease costs and energy consumption. While the performance improvement may be restrictedto a few instructions, the clock frequency applies to all instructions. Hence, the new hardware,initially aimed to increase the performance, may instead decrease it; the gain of performancegenerated by the new hardware must thus overcome the penalisation to the clock frequency.

Notably, the risc-v ISA supports branch decisions based on the comparison between twodynamic values, which most certainly reduces the number of instructions of a program comparedto the limited zero-value comparison of mb-lite; but doing so, it increases the complexity of thesystem, possibly reducing the processor’s operating frequency.

The proof of concept investigated in this internship is extrinsic to the employed ISA (inthis case, the risc-v). Hence, the goal here was to proof that such data-flow parallelism,combined with stream processing, enhances the performance in terms of execution time andenergy-consumption.

28

Page 33: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Conclusion

The initial assignment for this internship concerned the investigation of four technologies —mb-lite processor architecture, risc-v ISA, stream processing and data-flow parallelism —

in order to produce a prototype of an architecture combining all of these four paradigms.This preliminary investigation on stream-processing and data-flow paradigms within a stand-

ard RISC pipeline processor architecture has raised some concerns and promising results.Results have proven that the performance of such architecture has increased compared to the

original simple one, as well as energy-consumption under certain conditions. Moreover, havingin mind that this internship has lasted only twelve weeks, the possibility of introducing severaloptimisations and the raised ideas of new features indicate that the concerned topic is indeed apromising topic and deserve further and deeper research.

The ongoing study, carried out by Nuno Neves, will use the conclusions of this internship tointegrate the conceived structures into a morphable and possibly multi-core architecture targetedto specific and high-computing-demand applications. The stream-processing paradigm will beexploited with more robustness, reducing some engine control complexity.

Recommendation and Further Research

The atlas architecture presented a somewhat lower operating frequency. Once again, due toschedule constraints, the optimisation of the architecture could not be performed before theconclusion of this report. A careful inspection of its critical path is planned to happen beforethe end of the internship.

Naturally, further optimisations must be implemented to broaden the compatibility of thedata-flow engine with application kernels. Hence, the following features could be considered:

• Register lock after write, to prevent data loss from consecutive writes. This optimisationallows to loosen the data-flow generation tool and permits easier communication betweennon-adjacent lanes;

• Support for nested loops. This feature requires each lane to maintain the state of the higherloop levels while lower levels are executing. It should be complemented with balancingbetween conventional and data-flow execution paradigms;

• Add support for conditional instructions, allowing conditional sections of loop code to beexecuted in the data-flow engine, without requiring branch instructions;

• Ensure the software compatibility within different alpes configurations. For instance, acode compiled for a four-lane architecture might not execute properly on a two or eveneight-lane architecture.

29

Page 34: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Bibliography

[1] Daniel J Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Cetintemel, Mitch Cherniack,Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, et al.The design of the borealis stream processing engine. In CIDR Conference, volume 5, pages277–289, 2005.

[2] ARM. ARM big.LITTLETM technology: The future of mobile. Technical report, ARMWhite Paper, 2013.

[3] D. Capalija and T. S. Abdelrahman. A high-performance overlay architecture for pipelinedexecution of data flow graphs. In 2013 23rd International Conference on Field programmableLogic and Applications, pages 1–8, Sept 2013.

[4] Jack Dongarra and Piotr Luszczek. Livermore Loops, pages 1041–1043. Springer US, Boston,MA, 2011.

[5] A. Fell, Z. E. Rákossy, and A. Chattopadhyay. Force-directed scheduling for data flow graphmapping on coarse-grained reconfigurable architectures. In 2014 International Conferenceon ReConFigurable Computing and FPGAs (ReConFig14), pages 1–8, Dec 2014.

[6] L. Gan, H. Fu, C. Yang, W. Luk, W. Xue, O. Mencer, X. Huang, and G. Yang. A highly-efficient and green data flow engine for solving euler atmospheric equations. In 2014 24thInternational Conference on Field Programmable Logic and Applications (FPL), pages 1–6,Sept 2014.

[7] David Harris and Sarah Harris. Digital Design and Computer Architecture. MorganKaufmann Publishers, 1 edition, 2007.

[8] INESC-ID. 2015 annual report. Consulted online the 7th August 2017 at http://www.inesc-id.pt/ficheiros/publicacoes/12408.pdf.

[9] Qiwei Jin, Diwei Dong, Anson Tse, Gary Chow, David Thomas, Wayne Luk, and StephenWeston. Multi-level customisation framework for curve based monte carlo financial simu-lations. Reconfigurable Computing: Architectures, Tools and Applications, pages 187–201,2012.

[10] Tamar Kranenburg. Design of a portable and customizable microprocessor for rapid systemprototyping. Master of science thesis, Delft University of Technology, 2009.

[11] E. A. Lee and D. G. Messerschmitt. Synchronous data flow. Proceedings of the IEEE,75(9):1235–1245, Sept 1987.

[12] Nuno Neves, Pedro Tomás, and Nuno Roma. Adaptive in-cache streaming for efficientdata management. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,25(7):2130–2143, July 2017.

[13] Oliver Pell and Oskar Mencer. Surviving the end of frequency scaling with reconfigurabledataflow computing. SIGARCH Comput. Archit. News, 39(4):60–65, December 2011.

30

Page 35: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Bibliography

[14] David W. Wall. Limits of instruction-level parallelism. SIGARCH Comput. Archit. News,19(2):176–188, April 1991.

[15] Andrew Waterman and Krste Asanovic. The RISC-V Instruction Set Manual. CS Depart-ment, University of California, Berkeley, 2.2 edition, 2017.

[16] I Xilinx. MicroBlaze processor reference guide, 2016.3 edition, 2016.

[17] Ying Xing, Stan Zdonik, and J-H Hwang. Dynamic load distribution in the borealis streamprocessor. In Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Confer-ence on, pages 791–802. IEEE, 2005.

31

Page 36: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Appendix A

Binary Value Comparisons

Binary representation

As illustrated on Table A.1, from a n-bit binary word, one may represent up to 2n unsignedvalues. For the signed 2-Complement method, the same n-bit binary word may represent from�2n�1 to 2n�1 � 1 signed values.

Binary comparison

In order to generate the comparison signals unsigned lesser than and signed lesser than of twobinary values, Table A.2 was conceived. The first two columns represent two random values inorder to build the Function column, i.e. the truth table. X values indicate an invalid input: apositive value minus a negative value gives always a positive result; a negative value minus apositive value will never give a positive value. By using the Karnaugh Map method to simplifythe equations, it gives:

A<uB = B.C +A.C

A<sB = C

Combined to a signal is_zero, which basically indicates if the subtraction result has all bitsequal to zero or not — indicating that both values are equal or not — one may deduce the fullset of comparisons — i.g. greater than, greater or equal than, lesser than.

Table A.2: Columns A, B and A � B present two random values and their subtraction. u[A]and u[B] present the unsigned decoding of A and B, while s[A] and s[B] present the signed one.Column A <u B indicates the value of the unsigned comparison of A and B; the reciprocal forA <s B. Columns function retrieves the most-significant (most-left) bit of A, B and A � B

columns, i.e. the input for the comparison functions.

A B A – B Function A <u B A <s B u[A] u[B] s[A] s[B]0000000010 0000000001 0000000001 000 false false 2 1 2 10000000001 0000000010 1111111111 001 true true 1 2 1 20000000111 1111111111 0000001000 010 true false 7 1023 7 -10000000001 1111111000 0000001001 011 X X1111111111 0000000111 1111111000 100 X X1111111111 0000000010 1111111101 101 false true 1023 2 -1 21111111111 1111111000 0000000111 110 true false 1023 1016 -1 -81111111000 1111111111 1111111001 111 false true 1016 1023 -8 -1

32

Page 37: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Appendix A - Binary Value Comparisons

Table A.1: A four-digit binary value may be interpreted as unsigned or signed according to thefollowing table. The signed version employed is ordinarily the 2-Complement, where a maximumnumber of different values can be represented.

Binary Unsigned Sign-Magnitude 1-Complement 2-Complement0000 +0 +0 +0 +00001 +1 +1 +1 +10010 +2 +2 +2 +20011 +3 +3 +3 +30100 +4 +4 +4 +40101 +5 +5 +5 +50110 +6 +6 +6 +60111 +7 +7 +7 +71000 +8 -0 -7 -81001 +9 -1 -6 -71010 +10 -2 -5 -61011 +11 -3 -4 -51100 +12 -4 -3 -41101 +13 -5 -2 -31110 +14 -6 -1 -21111 +15 -7 -0 -1

33

Page 38: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Appendix B

risc-v Implemented Instructions

Table B.1: List of risc-v instructions ported to mb-lite, regarding atlas processor fromChapter 2.

Type Mnemonic Assembler Meaning

I LB Rd, Imm(Ra) Rd_1B := s(M_1B[ Ra + s(Imm) ])I LH Rd, Imm(Ra) Rd_2B := s(M_2B[ Ra + s(Imm) ])I LW Rd, Imm(Ra) Rd := M[ Ra + s(Imm) ]I LBU Rd, Imm(Ra) Rd_1B := u(M_1B[ Ra + s(Imm) ])I LHU Rd, Imm(Ra) Rd_2B := u(M_2B[ Ra + s(Imm) ])I ADDI Rd, Ra, Imm Rd := Ra + s(Imm)I SLTI Rd, Ra, Imm Rd := ( Ra <s s(Imm) )I SLTIU Rd, Ra, Imm Rd := ( Ra <u s(Imm) )I XORI Rd, Ra, Imm Rd := Ra s(Imm)I ORI Rd, Ra, Imm Rd := Ra | s(Imm)I ANDI Rd, Ra, Imm Rd := Ra & s(Imm)R SRLI Rd, Ra, Imm Rd := Ra » ImmR SRAI Rd, Ra, Imm Rd := Ra »> ImmR SLLI Rd, Ra, Imm Rd := Ra « ImmU AUIPC Rd, Imm Rd := PC + ( Imm « 12 )S SB Rb, Imm(Ra) M_1B[ Ra + s(Imm) ] := Rb_1BS SH Rb, Imm(Ra) M_2B[ Ra + s(Imm) ] := Rb_2BS SW Rb, Imm(Ra) M[ Ra + s(Imm) ] := RbR ADD Rd, Ra, Rb Rd := Ra + RbR SUB Rd, Ra, Rb Rd := Ra + !Rb + 1R SLL Rd, Ra, Rb Rd := Ra « Rb[4:0]R SLT Rd, Ra, Rb Rd := ( Ra <s Rb )R SLTU Rd, Ra, Rb Rd := ( Ra <u Rb )R XOR Rd, Ra, Rb Rd := Ra RbR SRL Rd, Ra, Rb Rd := Ra » Rb[4:0]R SRA Rd, Ra, Rb Rd := Ra »> Rb[4:0]R OR Rd, Ra, Rb Rd := Ra | RbR AND Rd, Ra, Rb Rd := Ra & RbR MUL Rd, Ra, Rb Rd := Ra * RbR MULH Rd, Ra, Rb Rd := (Ra * Rb) » 32 (signed)R MULHSU Rd, Ra, Rb Rd := (Ra, signed * Rb, unsigned) » 32 (signed)R MULHU Rd, Ra, Rb Rd := (Ra * Rb) » 32 (unsigned)

34

Page 39: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Appendix B - risc-v Implemented Instructions

Table B.2: Continuation of Table B.1.

Type Mnemonic Assembler Meaning

U LUI Rd, Imm Rd := Imm « 12B BEQ Ra, Rb, Imm PC := ( Ra == Rb ) ? PC + s(Imm) : PC + 4B BNE Ra, Rb, Imm PC := ( Ra != Rb ) ? PC + s(Imm) : PC + 4B BLT Ra, Rb, Imm PC := ( Ra <s Rb ) ? PC + s(Imm) : PC + 4B BGE Ra, Rb, Imm PC := ( Ra >=s Rb ) ? PC + s(Imm) : PC + 4B BLTU Ra, Rb, Imm PC := ( Ra <u Rb ) ? PC + s(Imm) : PC + 4B BGEU Ra, Rb, Imm PC := ( Ra >=u Rb ) ? PC + s(Imm) : PC + 4I JALR Rd, Ra, Imm Rd := PC + 4; PC := (Ra + s(Imm)) & 0xfffffffeJ JAL Rd, Imm Rd := PC + 4; PC := PC + s(Imm)C CSRRW Rd, csr, Ra Rd := MCSR(csr); MCSR(csr) := RaC CSRRS Rd, csr, Ra Rd := MCSR(csr); MCSR(csr) := Ra or MCSR(csr)C CSRRC Rd, csr, Ra Rd := MCSR(csr); MCSR(csr) := !Ra and MCSR(csr)C CSRRWI Rd, csr, Imm Rd := MCSR(csr); MCSR(csr) := ImmC CSRRSI Rd, csr, Imm Rd := MCSR(csr); MCSR(csr) := Imm or MCSR(csr)C CSRRCI Rd, csr, Imm Rd := MCSR(csr); MCSR(csr) := !Imm and MCSR(csr)

Table B.3: Instructions created for the Automatic Loop Execution, concerning andes and alpes

processors from Chapters 3 and 4 respectively.

Type Mnemonic Assembler Meaning

I LPCFG Ra, Imm Loop configuration; size := Imm; iter := RaI LPST Imm Loop activation; MapOffset := Imm

35

Page 40: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Appendix C

Detailed Microarchitecture Designs

Figure C.1: mb-lite’s original microarchitecture.

Figure C.2: atlas’ final microarchitecture. Green signals and components correspond to addedhardware compared to the original mb-lite microarchitecture.

36

Page 41: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Appendix C - Detailed Microarchitecture Designs

Figure C.3: andes’ final microarchitecture. Green signals and components correspond to addedhardware compared to the atlas’ microarchitecture.

Figure C.4: alpes’ final microarchitecture. Green signals and components correspond to addedhardware compared to the andes’ microarchitecture.

37

Page 42: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Appendix D

Processors Performance Summary

Available mb-lite atlas andes alpes

One lane Two lanes Four lanes Eight lanes Sixteen lanes

Clock period (ns) - 9.583 15.567 12.774 14.911 16.883 17.386 19.011Frequency (MHz) - 104.35 64.23 78.28 67.06 59.23 57.51 52.60

Registers 607,200 343 (.1%) 362 (.1%) 2,273 (.4%) 8,709 (1.4%) 16,979 (2.8%) 33,822 (5.6%) 67,298 (11.1%)LUTs 303,600 1,165 (.4%) 1,169 (.4%) 1,886 (.6%) 9,580 (3.2%) 18,271 (6.0%) 36,343 (12.0%) 71,885 (23.7%)Slices 75,900 463 (.6%) 587 (.8%) 1,226 (1.6%) 4,796 (6.3%) 10,584 (13.9%) 20,221 (26.6%) 38,955 (51.3%)BRAM 2,060 3 (.2%) 2 (.1%) 2 (.1%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)

Static Power (mW) - 241.28 241.07 241.11 240.88 241.12 241.45 241.76Dynamic Power (mW) - 96.65 70.39 75.99 47.06 80.13 124.89 167.05Total Power (mW) - 337.93 311.46 317.10 287.94 321.25 366.34 408.81

Table D.1: Timing constraints, resource usage and power suppliance for the approached processors in this report implemented on Xilinx Virtex-7FPGA device.

38

Page 43: Exploitation of Data-Flow Parallelism and Stream ... · Internship Report 4th year of Engineering School Exploitation of Data-Flow Parallelism and Stream Processing Paradigms within

Appendix E

Acronyms

ALU Arithmetic Logic Unit

CPI Clock cycles per instruction

CPU Central Processing Unit

DAG Data-Flow Graph

DOF Dispatch/Operand Fetch stage

EXE Execute stage

FPGA Field-Programmable Gate Array

GPP General-Purpose Processor

ID Instruction Decode stage

IF Instruction Fetch stage

INESC-ID Institute of Systems and Computer Engineering, Research and Development

ISA Instruction Set Architecture

LLVM Low Level Virtual Machine

MEM Memory Access stage

PC Program Counter

RF Register File

RISC Reduced Instruction Set Computer

SiPS Signal Processing Systems group

VHDL VHSIC Hardware Description Language

VHSIC Very High Speed Integrated Circuit

WB Write Brack stage

39