30
Journal of Network and Computer Applications (1997) 20, 223–252 An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs Sumit Ghosh Department of Computer Science & Engineering, Arizona State University, Tempe, AZ 85287, USA Computers are broadly classified into two classes: general-purpose and special-purpose. General-purpose computers provide tolerable performance on a wide range of programs. In contrast, specialized computers, tailored to a narrow class of programs, usually yield significantly higher throughput. However, specialized computer architectures are limited in availability, inflexible, require special programming environments, and are, in general, expensive. Both classes are limited in that they utilize a ‘fixed’ hardware architecture, i.e. their designs, conceived at creation, are unchanged during their lifetime. PRISM-I introduced a new concept, wherein a custom architecture is dynamically created to execute a specific high-level program in a faster and more ecient manner. While one component of this architecture is a traditional general-purpose processor, the other is automatically synthesized from a collection of FPGAs by a configuration compiler. Speed-up is achieved by executing key sections of the high-level program on the synthesized hardware, while the remainder of the program executes on the core processor. While PRISM-I developed a proof-of-concept platform, it is significantly limited to simple programs. This paper introduces a significant conceptual advancement, PRISM- II, which synthesizes asynchronous, adaptive architectures for complex programs, including those that contain iterative ‘loop’ structures with dynamic loop counts. PRISM-II introduces a novel execution model, wherein an operator-network and controller are synthesized for key sections of a high-level program. The operator- network, custom-synthesized from FPGAs, executes the key sections in a data-flow manner, i.e. any instruction is executed as soon as its input operands are available. The controller controls the computations in the operator-network, accurately determines when the execution is complete by utilizing key principles developed in PRISM-II, and generates an end-of-computation signal to asynchronously inform the core-processor to fetch the results from the FPGA platform. While the realization of a general-purpose data-flow architecture has continued to be dicult in the architecture community, the PRISM-II approach promises asynchronous, data-flow execution of programs on custom synthesized FPGA hardware. 1997 Academic Press Limited 1. Introduction Since their advent, the performance of computers have increased exponentially, roughly an order of magnitude speed-up every 8–10 years [1]. The performance increase can be attributed to both technology-dependent and technology-independent improvements. While a few of the key technological advances include VLSI technology, semiconductor memories [2], and advances in silicon technology, important technology-independent advances include advanced compiler techniques, caching, pipelining, RISC, and vec- torization. Given that VLSI and semiconductor technologies are fast approaching Email: [email protected] 1084–8045/97/030223+30 $25.00/0 ma970052 1997 Academic Press Limited

An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Embed Size (px)

Citation preview

Page 1: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Journal of Network and Computer Applications (1997) 20, 223–252

An asynchronous approach to efficient execution ofprograms on adaptive architectures utilizing FPGAs

Sumit Ghosh

Department of Computer Science & Engineering, Arizona State University, Tempe,AZ 85287, USA

Computers are broadly classified into two classes: general-purpose and special-purpose.General-purpose computers provide tolerable performance on a wide range of programs.In contrast, specialized computers, tailored to a narrow class of programs, usually yieldsignificantly higher throughput. However, specialized computer architectures are limitedin availability, inflexible, require special programming environments, and are, in general,expensive. Both classes are limited in that they utilize a ‘fixed’ hardware architecture,i.e. their designs, conceived at creation, are unchanged during their lifetime. PRISM-Iintroduced a new concept, wherein a custom architecture is dynamically created toexecute a specific high-level program in a faster and more efficient manner. While onecomponent of this architecture is a traditional general-purpose processor, the other isautomatically synthesized from a collection of FPGAs by a configuration compiler.Speed-up is achieved by executing key sections of the high-level program on thesynthesized hardware, while the remainder of the program executes on the core processor.While PRISM-I developed a proof-of-concept platform, it is significantly limited tosimple programs. This paper introduces a significant conceptual advancement, PRISM-II, which synthesizes asynchronous, adaptive architectures for complex programs,including those that contain iterative ‘loop’ structures with dynamic loop counts.PRISM-II introduces a novel execution model, wherein an operator-network andcontroller are synthesized for key sections of a high-level program. The operator-network, custom-synthesized from FPGAs, executes the key sections in a data-flowmanner, i.e. any instruction is executed as soon as its input operands are available. Thecontroller controls the computations in the operator-network, accurately determineswhen the execution is complete by utilizing key principles developed in PRISM-II, andgenerates an end-of-computation signal to asynchronously inform the core-processorto fetch the results from the FPGA platform. While the realization of a general-purposedata-flow architecture has continued to be difficult in the architecture community, thePRISM-II approach promises asynchronous, data-flow execution of programs on customsynthesized FPGA hardware. 1997 Academic Press Limited

1. Introduction

Since their advent, the performance of computers have increased exponentially, roughlyan order of magnitude speed-up every 8–10 years [1]. The performance increase can beattributed to both technology-dependent and technology-independent improvements.While a few of the key technological advances include VLSI technology, semiconductormemories [2], and advances in silicon technology, important technology-independentadvances include advanced compiler techniques, caching, pipelining, RISC, and vec-torization. Given that VLSI and semiconductor technologies are fast approaching

Email: [email protected]

1084–8045/97/030223+30 $25.00/0 ma970052 1997 Academic Press Limited

Page 2: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

224 S. Ghosh

the limits of physics, technology-independent techniques are increasingly assuming aprominent role. This paper presents a novel technology-independent mechanism.

Along with the ever increasing demand for higher performance, the diversity ofprograms executed on computers has increased tremendously. While in the early daysof ENIAC [3] a computer was principally used for number-crunching applications likecomputing artillery firing tables, today computers are used for scientific number-crunching applications, mission-critical business applications, such as air-line reservationsystems, banking and medical applications, engineering and design applications, e.g.CAD/CAM, complex simulation and animation tasks, e.g. virtual reality, and ubiquitousapplications, such as word processing, spread-sheets and games. Each of these ap-plications requires unique organization of the computational, input–output, memoryand communication subcomponents. That is, every application requires an architecturethat has been tailored specifically to address the needs of the application and executeit most efficiently. However, such an architecture, tailored to a specific application, maynot yield either a modest or even acceptable performance for other applications.Economics and flexibility of architecture trade-offs have led to the design of general-purpose and special-purpose computers. While general-purpose computers cost relativelyless and provide tolerable performance on a wide mix of application programs, special-purpose machines are expensive, yield excellent performance on a specific applicationor a class of applications, but execute poorly on other applications.

The literature records several efforts to integrate elements of special-purpose ar-chitectures into a general-purpose framework. These efforts include the attachment ofspecial hardware substructures, proposed by Estrin [4], enhancing processor instructionsets with specialized complex instructions [5, 6], dynamic microprogramming [7–10],utilizing reconfigurable computing elements [11–14], and the use of co-processors [15].

Unfortunately, all of the above efforts suffer from one or both of the followingfundamental limitations:

• the integration effort is determined at design time, is permanent throughout itslife, and therefore incapable of addressing new application programs;

• the integration effort is neither automatic nor transparent to the programmer, andtherefore the programmer must possess knowledge of the processor architectureand hardware design.

PRISM-I [16] is perhaps the first attempt that effectively addresses the above limitationby demonstrating a proof-of-concept system. In it, the integration effort may be easilyadapted to a large class of application programs and it is transparent to the programmer.PRISM-I permits the realization of specialized architectures for maximum executionperformance of every individual application program. The underlying philosophy ofPRISM-I is to exploit the notion of ‘locality of reference’ [3], which states an empiricalfinding that most programs spend 90% of their execution time in 10% of the code [17].PRISM-I aims to expend effort and resources to increase performance of the smallerand frequently-executed sections of the program, termed ‘critical sections’, as opposedto the remainder of the less-frequently-executed sections.

In PRISM-I, shown in Fig. 1, reconfigurable hardware is used to augment thefunctionality of a general-purpose, core processor. A set of Field Programmable GateArrays (FPGAs) constitute the reconfigurable hardware which may be dynamically

Page 3: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 225

GeneralPurpose

CPU

ReconfigurablePlatform

Systembus

Figure 1. The PRISM-I approach.

configured to execute the critical sections of an application program quickly. The less-frequently-accessed sections are executed on the core processor. The overall impact ofPRISM-I is improved execution performance.

To achieve its goals, the synthesized hardware in PRISM must execute the criticalsections faster than that of a general-purpose processor. This requirement, in turn,translates into several low-level requirements for the synthesized hardware, namely, (i)simplicity, (ii) minimal communication overhead, and (iii) the exploitation of fine-grain,i.e. operator-level, parallelism. Requirements (i) and (ii) are particularly true for today’sFPGAs, with lower gate counts and slower speeds. The need to exploit fine-grainparallelism is important because of the frequently-encountered small-sized criticalsections.

A substantial amount of work in hardware synthesis has been reported in theliterature. This section reviews the research into hardware synthesis tools and data-flow computational models. Trickey [18] presents a ‘hardware’ compiler that translateshigh-level behavioral descriptions of digital systems, in Pascal, into hardware, subjectto user-specified cost function. Lanneer and colleagues [19] report the CATHEDRALhigh-level synthesis environment for automated synthesis of IC architectures for real-time, digital signal processing. IBM’s HIS system [20] translates a behavioral descriptionof a synchronous digital system specified in VHDL into a resulting design consistingof a finite state machine and a datapath, both described in the output language BDL/CS. The Cyber system [21] aims to compile software programs in C and BDL intoASIC chips, called ‘software chips’. Its first targets are synchronous control dominatedapplications and processors implemented in ASICs. In addition, Camposano [22] andWalker [23] report a survey of different high-level synthesis systems.

Most of the approaches reported in the literature differ with regard to the high-level hardware description language, optimization and transformation techniques, andscheduling and allocation algorithms. However, they share a common underlying modelof execution, namely, a synchronous digital machine that consists of a datapath and acontroller that is governed by a finite state machine. The execution of the machine isorganized through basic time units, termed control steps, that correspond to the statesof the finite state machine. The datapath consists of a set of functional units, e.g.arithmetic logic units, multiplexors, registers, storage units, and buses. The controlleris either microcoded or hardwired and executes the instructions sequentially. It alsocontrols the computation of data in the functional units and the transfer of data andresults to and from the storage units. Two key limitations of the synchronous approach,similar to those for Von Neumann’s ‘stored program control’, are:

• a centralized controller imposes strict sequential execution of all instructions. Asa result, a preceding instruction whose operands are not yet available, may prevent

Page 4: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

226 S. Ghosh

the execution of the subsequent instructions even when the operands of the latterare available [24]. This clearly results in the failure to exploit potential parallelismthat may be inherent in the program. An added problem is the reduced ability forthe processor to tolerate latency in fetching operands from the storage units [25].That is, the processor has to wait for each operand fetch to be completed beforeinitiating the computation. Techniques including the use of large register sets,caches, and pipelines aim to reduce the adverse impact of latency.

• intermediate results or data are passed between successive instructions throughthe use of registers and storage units. This indirect mode of transfer not only slowsdown the communication of data among instructions, but may also cause sideeffects. Because of the latter, for correctness, external synchronization must beimposed on operand fetches, which may impose additional overhead on the overallhardware execution.

In contrast, the data-flow computational model is based on two fundamental principles[26]:

• A.1: an operation can be executed as soon as all its required operands are available.• A.2: all operations are functions, i.e. there are no side effects arising from the

intermediate results and data being stored in registers and storage units.

The principle A.1 permits one to take advantage of the fine-grain parallelism, thatmay be inherent in a program, and enhances the processor’s ability to tolerate operandfetch latency. In theory, A.2 enables faster sharing of intermediate results or databetween successive instructions. Although the data-flow model apparently promiseshigher hardware execution performance, in reality there are several limitations.

First, general-purpose data-flow architectures, proposed in the literature [27, 28], arevery complex. They require complex mechanisms for labeling tokens, storing data, andcommunication between successive instructions. They impose high demand for silicon,because of which it is unrealistic to implement them on the current FPGAs that featuremodest gate counts. Second, general-purpose data-flow architectures involve significantoverheads in terms of execution time, implying a slow rate of providing operands tothe individual processing elements. Consequently, to outperform the conventionalprocessors, general-purpose data-flow architectures require parallelism of the order ofseveral hundred concurrently executable instructions in the programs [26].

The limitations of both hardware synthesis tools and general-purpose data-flowarchitectures make them poor candidates for flexible, high-performance FPGA-basedarchitectures, which demand (i) inexpensive hardware platforms, (ii) maximum ex-ploitation of parallelism inherent in critical sections, and (iii) minimal implementationoverheads. While PRISM-I [16] aims to address a few of these limitations, it suffersfrom the following critical weaknesses:

• the reconfigured hardware is constrained to evaluate functions within the elapsetime of a single bus-cycle of the core processor. As a result, critical sections whosecritical-path delays exceed the core processor bus cycle tick cannot be synthesizedon the reconfigurable platform.

• inefficient execution of critical sections that contain control constructs, e.g. ‘if-then-else’. In general, control constructs may imply multiple, possible-executionpaths that, in turn, require different execution times depending on the actual

Page 5: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 227

execution semantics and input data. In PRISM-I, the hardware design alwayschooses the longest of the execution times of the different possible paths which,while conservative, implies inefficiency.

• inability to synthesize loops with dynamic loop-counts, which eliminates a largeclass of programs requiring iterative computations.

This paper presents PRISM-II, a new execution model and a novel architecturethat addresses the key weaknesses of PRISM-I. The execution model facilitates theexploitation of maximal fine-grain parallelism in the critical sections without imposingrigid sequentiality. The architecture addresses critical sections that may require arbitraryexecution times, contain control constructs such as ‘if-then-else’ and ‘switch-case’, andloop constructs with static and dynamic loop counts. As with PRISM-I, PRISM-IIaccepts user programs written in C.

The remainder of the paper is organized as follows: section 2 presents an overviewof PRISM-II highlighting the configuration compiler and the hardware platform; section3 introduces a framework for the configuration compiler and presents a detaileddiscussion on the mechanism to translate a critical section into executable code; section4 presents details on the algorithm for the translator and illustrates with an example;section 5 presents the conclusions and suggestions for future work.

2. The PRISM-II approach: overview

The PRISM-II approach consists of two principal components—the hardware platformwhich ultimately executes the application program and the configuration compiler thataccepts an application program in C, translates it into executable code for the hardwareplatform.

2.1 Hardware platform

The hardware platform, shown in Fig. 2, consists of (i) a core processor, namely theAMD AM29050 RISC processor [29], and (ii) a reconfigurable platform that consistsof a set of FPGAs that are interconnected to the core processor through the latter’sco-processor interface.

The hardware platform design addresses two key limitations of PRISM-I. First,unlike PRISM-I, which requires between 45 and 75 clock cycles, 100 ns in length, toaccess a component of the synthesized hardware, PRISM-II requires only 30 ns. Inaddition, while the length of the data transfer in PRISM-I is 32 bits, that for PRISM-II is 64 bits for inputs and 32 bits for outputs. Second, unlike PRISM-I, where anoperation must fit in a single FPGA, PRISM-II permits an operation to utilize up tothree FPGAs through partitioning the data-flow graph. The fast AMD AM29050processor is selected to enable a reasonable balance between hardware and softwareperformance. The AM29050 processor, clocked at 33 MHz, can provide roughly 28MIPS performance. In addition, it has a built-in floating point unit, which is important,since the area expense necessary to synthesize it on the reconfigurable platform is high.Data transfer to and from the FPGAs is supported in the form of 64-, 32-, 16- and 8-bit quantities to facilitate hardware specifications in a high-level language. The Xilinx

Page 6: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

228 S. Ghosh

32

ReconfigurablePlatform

3

ReconfigurablePlatform

2

Burst-ModeMemoryController(V3)

Boot ROM

Timer, Comm,PIO

Am29050–33

(with MMU, FPU,timer and cache)

InterleavedDRAMBank A

InterleavedDRAMBank B

StatusDisplay

Bus Ex-

changer/latch

ReconfigurablePlatform

1

Ad

dre

ss B

us

Data Bus

Data Bus

Instruction Bus

32

32

32

32

32 32

B_ctrl

A_ctrl

addr

32

32

32

Figure 2. PRISM-II hardware platform.

4010 FPGA [30] provides 160 general purpose I/O pins which provides for several 8-bit buses. It is expected that the application programs, implemented on PRISM-II, willhave high data fan-in, i.e. a large number of inputs. The fan-in may be viewed as amanifestation of function calls that accept several arguments and return a single result.

2.2 The configuration compiler

Figure 3 shows an overview of the PRISM-II architecture. In it, the configurationcompiler accepts a user program in C as input, and generates hardware and softwareimages. The hardware image contains information that is necessary for synthesizinghardware corresponding to the critical sections, on the reconfigurable platform. Thesoftware image contains executable code that realizes the execution of the criticalsections on the reconfigurable platform and the remainder of the program on theAM29050. Both hardware and software images are generated automatically withoutintervention from the user. A current underlying assumption in PRISM-II is that thecritical section(s) of an application program is identified a priori, by the programmer.

The configuration compiler consists of two principal components—(i) C parser and

Page 7: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 229

ConfigurableCompiler

Standard CCompiler

HardwareImage

SoftwareImage

C Program

Figure 3. Overview of PRISM-II architecture.

optimizer, and (ii) a hardware synthesizer. The parser and optimizer constitute the frontend, while the hardware synthesizer forms the back end. To reduce development time,the parser and optimizer of the GNU C compiler (gcc) are utilized as the starting pointand are significantly modified to adapt to the needs of this research. The hardwaresynthesizer constitutes the core of the synthesis subtask, and is being designed anddeveloped in this investigation. Figure 4 presents the structure of the configurationcompiler including its components and the flow of information, and is described ingreater detail in the subsequent sections of the paper.

3. A framework for the configuration compiler

This section presents a framework for the PRISM-II configuration compiler. The criticalsection is first translated into an intermediate representation and then synthesized ontothe FPGA platform. A novel execution model is proposed for developing the architectureof the synthesized hardware. Among its key advantages over existing execution models,used in traditional high-level synthesis architectures and data-flow machines, theproposed execution model exploits fine-grain parallelism inherent in the critical section,requires minimum data and control communication overheads, and imposes lowimplementation cost. The designs of the intermediate representation and the executionmodel both utilize Control Flow (CFG) and Data Flow Graphs (DFG) of the criticalsection.

3.1 Control Flow Graphs

The translation begins with a block-level CFG of the critical section. A CFG is adirected graph in which the nodes represent ‘basic blocks’ [31] and the edges representthe flow of control. For example, an edge from node X to node Y indicates thatexecution of block X may be followed immediately by execution of block Y. A basicblock is a sequence of consecutive statements of the section in which the flow of controlenters at the beginning and leaves at the end without halting or branching, except atthe end. A total of five basic types of nodes are conceivable:

• Start: a unique node that has no incoming edges. It represents the start of thecomputation.

Page 8: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

230 S. Ghosh

ProgramSource(C)ProgramSource(C)ProgramSource(C)

GCC Front End:Parsing and StandardOptimizations

Flow GraphGeneration

MachineSynthesis

X–BLOX NetlistGeneration

Xilinx Tools(PPR etc.)

ProgramSource(C)ProgramSource(C)HardwareImages

GCC = GNU C CompilerRTL = Register Transfer LanguageXNF = Xilinx Netlist Format

HardwareSynthesizer

XNF

RTL

Figure 4. The structure of the configuration compiler.

• Stop: a unique node that has no outgoing edges. It represents the end of thecomputation.

• Sequential: it has only one outgoing edge.• Predicate: it has at least two outgoing edges.• Merge: it has at least two incoming edges.

A complex node is a combination of two or more basic node types. For instance, a‘start’ node may also be a ‘predicate’ node. A ‘stop’ node may also be a ‘merge’ node.Figures 5(a) through 5(c) presents an example function, its CFG, and its DFG. In Fig.5(b), the node labeled ‘predicate-block’ is both the start node and a predicate node.The nodes labeled ‘then-block’ and ‘else-block’ are both sequential nodes. The nodelabeled ‘join-block’ is both a merge node and the stop node. The number at the bottom

Page 9: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 231

if_example(int a, int b){ int c = 0;

if (a < 0) { c = a + b; c = c 4; } else { c = a* b; c = c & 4; }

return c;

}

(a) (b)

c = 0;if (a < 0)

25

return c;

0

c = a + b;c = 4;

75

c = a* b;c &= 4;

525

join–block

else–blockthen–block

predicate–block

550100

2525

(c)

mult (500) 40plus (50)4

and (25)or (25) ge (50)

a (0)b (0)

mux (25)

result

100 550

50 52575

0 0 00 50 500

00

Figure 5. (a) An example function; (b) the CFG; (c) the DFG.

of each node represents the time units necessary to execute the corresponding block,while that on an edge represents the cumulative execution time up to the point wherethe edge emanates.

The CFG of a function encapsulates the flow of control information in the function,and a particular instance of execution of a function is expressed through a path fromthe ‘start’ to the ‘stop’ node. A path in a CFG is an ordered set of statements executedin a sequence. A delay, associated with each path, represents the total time required toexecute all the statements in the path, sequentially. The exact path followed during an

Page 10: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

232 S. Ghosh

instance of execution is a function of the input data. Of the two paths in 5(b) betweenthe ‘start’ and ‘stop’ nodes, the path through the ‘then-block’ requires 100 time units,while that through the ‘else-block’ requires 550 time units. The value of the parameter‘a’ determines the actual path. In essence, the CFG provides information on the differentpossible execution paths of a function.

The CFG representation of a function is limited, in that its underlying executionmodel is the sequential, von-Neumann architecture. Consequently, it fails to exposeany inherent parallelism—coarse-grain, i.e. block-level, and fine-grain, i.e. operator orstatement-level. For example, in Fig. 5(b), one may not obviously conclude whetherthe first and second statements in the ‘then-block’ may execute concurrently.

3.2 Data Flow Graphs (DFG)

A block-level DFG of the critical section is utilized to address the limitations of theCFG. The DFG is a directed graph wherein the nodes and edges represent the operatorsand the flow of values among them. The DFG, utilized in this paper, differs from thetraditional DFG, in that it contains, as explained subsequently, multiplexor and latchoperators. A total of five different types of nodes are conceivable in a DFG:

• Constant: represented through a circle, as shown in Fig. 5(c).• Unary operator: it accepts only one input and generates a result, e.g.− and NOT.• Binary operator: it accepts two inputs and generates a result, e.g. +, −, AND,

and mult.• Ternary operator: it accepts three inputs to generate a result, e.g. multiplexor.• Latches: it accepts two inputs—a data value and clock, and generates the latched

value at its output.• Input/output registers: these respectively store the inputs and outputs of a critical

section and are represented through shaded rectangles as shown in Fig. 5(c).

It is noted that the DFG that is propagated to the ‘machine synthesis’ module fromthe ‘flow graph generation’ module in Fig. 4 is basic and does not contain multiplexorand latch operators. The ‘machine synthesis’ module adds these operators to derive a‘complete DFG’.

Figure 5(c) presents the DFG representation of the critical section in Fig. 5(a). Theunary and the binary operators represent the unary and binary arithmetic and/or logicoperations, respectively. Two types of the multiplexor operators are supported—‘merge-mux’ and ‘loop-mux’. A multiplexor selects one of two inputs, as dictated by a thirdselect input. The ‘merge-mux’ selects one of two definitions of a variable arriving at amerge-node in a CFG. The ‘loop-mux’, present at the top of a loop, selects either theinitial value of a variable or its value fed back value from a subsequent loop iteration.A latch stores a fed back value of a variable from a loop iteration, temporarily.

Given that the DFG representation of a critical section is based on the data-flowmodel of execution [27], i.e. there is a lack of the central locus of control, a node of aDFG may execute as soon as all its input operands are available. Upon execution, anode places the result on its outgoing edges which carry the value to other operators.Thus, an operator never stalls the execution of another operator unless its output servesas the latter’s input operand. Unlike the CFG, the DFG does expose the fine-grain

Page 11: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 233

parallelism inherent in the critical section, and eliminates the performance retardingside-effects [26] by converting all operations into functions.

Clearly, the DFG fails to capture any control information, inherent in the criticalsection in scenarios such as loops which involve iterative computation. In the executionof a loop, additional information is required to control the iterations and correctlyfeed back the values to subsequent iterations. A number of techniques such as taggedtokens and waiting–matching [27], that are proposed in the literature on data-flowarchitectures, are very complex, expensive in terms of hardware, and require substantialimplementation overheads. Such techniques are also inappropriate for a FPGA-basedarchitecture, where hardware is limited and high implementation overheads are un-acceptable.

3.3 Model of execution

PRISM-II’s objectives include: (i) the exploitation of fine-grain parallelism; (ii) providingprimitives for addressing iterative computations; (iii) efficient execution of controlconstructs; and (iv) implementation on FPGAs that is efficient yet inexpensive. Theseobjectives are encapsulated in the proposed execution model which provides theunderlying basis for hardware synthesis on the FPGA platform.

The execution model consists of two principal components—(i) ‘operator-network’,and (ii) ‘controller’. The operator-network is a specific organization of arithmetic andlogic units intended to perform the actual computation, and is an instantiation of theDFG on the FPGAs. The controller, a finite state machine, controls the computationin the operator-network, determines when the computation is complete, and generatesthe ‘end-of-computation’ signal at the conclusion of the computation. The ‘end-of-computation’ signal provides an asynchronous means of informing the core-processorto fetch the results from the FPGA platform. Figure 6 presents a pictorial view of theexecution model.

Prior to initiating execution, the input values are loaded into the input registers andthe controller is initialized. The operator-network is initiated as soon as inputs areavailable. That is, operators execute as soon as their input operands are available andintermediate results and data are communicated from one operator to the subsequentoperator directly through dedicated connections, thereby implying minimal com-munications overheads.

The controller plays the role of a timer that tracks the execution time of the operations.The controller stores information on the execution delays along all possible executionpaths in the operator-network, utilizes specific intermediate data values generated insidethe operator-network, and tracks the actual execution path taken by the current instanceof execution. The data values may include the predicates for the conditionals andpredicates for loops. The presence of control constructs including ‘if-then-else’ generatesthe possibility of multiple execution paths during the execution of a critical section.Where each of the possible execution paths may require a different execution time, itis important, for efficiency, that the controller tracks the actual path to precisely accountfor the time taken for an execution. At the end of the duration of the ‘tracked’ path,the controller latches the results in the output registers and generates the ‘end-of-computation’ signal.

The execution of a loop with dynamic loop count involves three phases—

Page 12: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

234 S. Ghosh

Operator–Network

+ –

63a

Controller(FSM)

S3

S2S1

Operator–Networkoutput

Operator–Networkinputs

Input Latch

Output Latch

Final Result

Main Processor

AMD 29050

Input Data FPGA Platform

End_of_ComputationControl Signal

Data Value

Figure 6. The PRISM-II execution model.

‘initialization’, ‘execution of the body’, and ‘feed-back’. In the ‘initialization’ phase,the controller signals all the ‘loop-muxes’, located at the top of the loop, to select theinitial values of the input variables to the loop. This phase is executed only once duringthe entire execution of the loop. In the ‘execution of the body’ phase, the operator-network executes the code segment in the loop-body in a data-flow manner. Thecontroller tracks the time required for this phase. In the subsequent ‘feed-back’ phase,the controller first examines the boolean value generated by the loop-predicate, i.e. theloop exit condition, to determine whether to iterate the loop further or exit. Where thecontroller decides to iterate the loop, it generates appropriate signals to latch the valuesof all intermediate state variables that are fed back to the loop. The controller also

Page 13: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 235

generates signals to the ‘loop-muxes’, at the top of the loop, to select the feed-backvalues of the intermediate variables. Following the ‘feed-back’ phase, the ‘execution ofthe body’ phase may be executed again and the cycle continues until termination. Toaddress the issue of nested-loops, an inner-loop is considered a part of the body of theouter-loop, and section 4 presents further details on the synthesis and use of thecorresponding operator-network and controller.

In essence, the controller permits the operator-network to execute, asynchronously,independently, and in a data-flow manner, while asserting minimal control over thelatter’s operations. Every operator in the operator-network executes as fast as possible,limited only by the inherent data dependencies. Given that the data communicationoverheads among the operators are minimal, the PRISM-II approach has the potentialto exploit maximal fine-grain parallelism inherent in the critical section, and result inhigh performance relative to sequential execution of the critical section. In addition, inthe proposed execution model, hardware is required to implement only the operator-network and controller. The execution model is novel and unique in many respects.First, unlike PRISM-I [16], which limits the execution time of a critical section to thetime period of the core-processor bus cycle, PRISM-II permits fast evaluations ofcritical sections requiring arbitrary execution times. Second, the PRISM-II executionmodel successfully addresses the issue of executing loops with dynamic loop counts.Third, the proposed execution model possesses several important advantages overexisting models proposed in traditional high-level synthesis architectures [18–20] and[21]. Traditional high-level synthesis architectures utilize centralized controllers that, inturn, impose strict sequential execution of instructions, resulting in the failure to exploitpotential fine-grain parallelism. In contrast, such restrictions are absent in PRISM-II.Fourth, unlike the slow mechanism of communicating intermediate results and databetween successive instructions or operators through registers and storage units, thenovel execution model permits direct communication of data among the operators.

Throughout the literature, data-flow architectures [27, 28] have always promised highperformance for general-purpose programs but have failed to meet the expectations.For general-purpose programs, they require complex mechanisms for token-labeling,storing, and communicating data which, in turn, pose high demand for silicon area andsignificant execution overheads. As a result, to compete effectively with a conventionaluniprocessor, a general-purpose data-flow architecture requires programs with inherentparallelism in the hundreds [26]. The proposed execution model fully exploits data-flowconcepts, namely A.1 and A.2 of section 1, and yet achieves high performance forcritical sections of limited sizes.

3.4 Intermediate representation

To facilitate its analysis and efficient execution, a critical section is expressed withinthe configuration compiler through an intermediate representation. The aim of theintermediate representation is a technology-independent expression of the criticalsection which is based on the execution model described above. Once the intermediaterepresentation is available, customized hardware can be synthesized from it by usingthe FPGA vendor tools.

The intermediate representation in PRISM-II is a ‘machine graph’ that consists of

Page 14: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

236 S. Ghosh

two principal components—a DFG and a Finite State Machine (FSM). The ‘operator-network’ and the ‘controller’ described in the previous section are instantiated on theFPGAs from the DFG and FSM, respectively. In the rest of the paper, operator-network and DFG, and controller and FSM may be used interchangeably, and themeaning would be clear from the context. The DFG and FSM represent the com-putational and the control aspects of the critical section respectively and maintain linksto each other. While the DFG purely represents the computational aspects of the criticalsection, the FSM, derived from the CFG, represents the control flow information. Inessence, the FSM behaves as a timer that provides timing signals corresponding toeach of the possible paths through the CFG. The FSM is represented through a directedgraph, wherein the nodes represent states and the arcs represent state transitions. Aduration is associated with every state, and a transition from a state occurs at the endof this duration. The next state is determined by selecting the appropriate arc emanatingfrom the state, based on the inputs. The input to a state is either a value obtained fromthe DFG or an internal signal generated at the end of the duration of the state.

4. The underlying algorithm of the configuration compiler

An algorithm is proposed for the synthesis of the operator-network and controllercorresponding to a critical section. The algorithm accepts as input the ‘simple DFG’and CFG of the critical section, and ultimately generates a ‘hardware netlist’ that, inturn, is utilized by the FPGA vendor tools to place and route and synthesize the targethardware. The algorithm is represented in Fig. 4 through the three modules—flowgraph generation, machine synthesis, and X-BLOX netlist generation.

In the flow graph generation module, first the CFG and DFG are constructed. The‘complete DFG’ is then constructed by inserting latches and muxes at appropriateplaces in the ‘simple DFG’. Then, the operator nodes in the DFG are ‘time-stamped’,i.e. each operator node is assigned the earliest possible time when the operator nodemay complete execution. In the CFG, utilizing the execution times of the operations,each basic block is assigned two time-stamps—‘starting time-stamp’ and ‘ending time-stamp’. The ‘starting time-stamp’ refers to the earliest time when all of the inputs to abasic block are available while the ‘ending time-stamp’ is the latest time when all ofthe outputs from a basic block have been generated. Then, the CFG is restructuredand optimized, utilizing the time-stamps of the basic blocks, to generate an FSM. Asindicated earlier, the FSM represents the flow of control in the critical section, servesas a controller for the operator-network, and indicates the end of the computation.The number of states in the FSM is optimized as explained later in this section. Theprocess of generating the FSM is detailed subsequently in this section.

The algorithm is presented, in pseudo-code, in Fig. 7 followed by a detailed discussionand an example.

(1) createOperatorNetwork: the operator-network is derived from the ‘simple DFG’passed on from the ‘Flow Graph Generation’ module in Fig. 4. Merge-muxes, introducedearlier in this paper, are inserted at appropriate locations in the ‘simple DFG’ to resolvemultiple definitions of a variable reaching an operator node. The select line for amultiplexor is the true/false output from the corresponding predicate operator. Latchesare added to loops to help retain the values of variables from a previous iteration.

Page 15: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 237

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

*/

/* time stamps, create

build_machine (DFG, CFG, CDG, DT, PDT){

createOperatorNetwork( ) ; /* creates the initial/* version of the/* operator network.

/* using the operator

/* time stamps for the/* basic blocks in CFG

timeStampOperators( ) ; /* time stamp operations/* in the operator/* network.

createController( ) ; /* create the initial/* states of controller/* from the CFG

optimizeController( ) ; /* optimize the number/* of states in the/* the controller.

writeNetlist( ) ; /* write out the/* hardware description/* of the machine graph. */

time StampBlocks( ) ;

determineDuration( ) ; /* determine the/* "duration" of each/* of the states of the/* controller.

}

Figure 7. The underlying algorithm of the configuration compiler, in pseudocode.

Loop-muxes are added at the beginning of loops to allow selecting either the initialvalue of a variable or its subsequent values. The select line of a loop-mux is derivedfrom the state of the controller.

(2) timeStampOperators: each operation in the DFG is assigned a time-stamp thatindicates the earliest time it may complete execution. Through this assignment, a

Page 16: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

238 S. Ghosh

schedule is automatically created that also reveals the operations that may execute inparallel. The schedule also permits a view into the timing of the execution of theoperator-network.

A time-stamp is determined based on the premise that an operation may not executeuntil all of its input operands are present. Thus, it is computed as the maximum of thetime-stamp values of all of the inputs plus the execution time of the operator node.The time-stamps of operations that have no inputs are set to zero, e.g. nodes forconstants. The time-stamps of operator nodes are computed utilizing a breadth-firstsearch algorithm. That is, first all operator nodes with zero incoming edges are time-stamped. These nodes constitute the first level of the DFG. Then, all operator nodesat the second level of the DFG are time-stamped. This level includes all operator nodesthat are either directly connected to the nodes of level 1 or bear a single incoming edge.Next, operator nodes at levels 2, 3, etc. are successively time-stamped. A time-stamp,assigned to an operation, does not necessarily reflect the actual completion time of acomputation. For instance, in the case of a loop, a time-stamp value assumes that theloop is executed only once. However, time-stamps serve as a useful mechanism to orderdata dependencies in a critical section.

(3) timeStampBlocks: time-stamps are computed for every basic block in a CFGutilizing the time-stamps of the individual operations in the DFG. For each basic block,‘starting time-stamp’ and ‘ending time-stamp’ values are computed, as indicated earlierin this paper. For efficiency, corresponding to every basic block, the ‘ending time-stamps’ of the direct predecessor basic blocks are examined. Where the ‘starting time-stamp’ of a basic block is less than the minimum of the ‘ending time-stamps’ of thepredecessor basic blocks, the ‘starting time-stamp’ of the basic block is modified to theminimum value. As a result, the basic block may be initiated for execution earlier,thereby achieving concurrency.

A predicate basic block determines which of the two succeeding basic blocks will beexecuted subsequently. The definitions of variables from the two succeeding basic blockswill be merged through a merge-mux using the predicate value as its select signal. Thepredicate value selects which of the two definitions will be propagated. It may beobserved that, for a merge-mux to execute, it is adequate if the select signal and theinput actually selected are available. It is not necessary for the other input to beavailable. Thus, for efficient execution of the operator-network, it is important toidentify the predicate basic blocks.

(4) createController: the initial version of the controller is simply the CFG, with thestates of controller corresponding directly to the basic blocks of the CFG. The finalversion of the controller results from efforts to restructure and optimize the number ofstates. A state transition is dictated by the predicate value of the corresponding basicblock. In a given state, all operations of the corresponding basic block must becompleted. Therefore, in the implementation, a counter is initially loaded with thecumulative time duration of all operations and its value is decremented as timeprogresses. The controller remains in the given state as long as the counter value isnon-zero and then permits a state transition.

Page 17: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 239

if (a > b) : BB0

c = a – b; : BB1

else

c = a * b; : BB2

d = c + 5; : BB3

Figure 8. Code fragment to illustrate optimization of FSM.

BB010

else–pathduration = 80

BB340

BB230

BB120

(a)

then–pathduration = 70

else–blockthen–block

if–block

join–block

state010

else–pathduration = 80

state340

state230

state120

(b)

then–pathduration = 70

predicate–state

join–state

(c)

else–paththen–path

state070

state110

Figure 9. (a) CFG for example code fragment in Figure 8; (b) corresponding unoptimized FSM; (c)optimized FSM.

(5) determineDuration: for every state of the FSM, durations are computed utilizingthe time-stamps of the corresponding blocks in the CFG. Duration of a block is thedifference between its ‘ending time-stamp’ and ‘starting time-stamp’.

(6) optimizeController: the initial FSM, derived directly from the CFG, may containmore than the minimal number of states. In this step, the initial FSM is optimized toyield the final FSM that, in turn, implies an efficient controller hardware for theoperator-network. In general, FSMs derived from CFGs corresponding to ‘if-then-else’constructs, lend themselves to optimization. For instance, the code fragment shown inFig. 8 consists of four basic-blocks, each with distinct duration, between the ‘if-block’and ‘join-block’. The corresponding CFG, shown in Fig. 9(a), contains four basic-blocks and Fig. 9(b) presents the unoptimized FSM with four states. The optimizedFSM requires only two states as shown in Fig. 9(c). The integers at the bottom of thebasic-blocks in Fig. 9(a) indicate the corresponding durations. It may be observed thatthe CFG contains two paths, namely ‘then-path’ of duration 70 and ‘else-path’ ofduration 80 and, as a result, the FSM needs only two states, ‘state0’ and ‘state1’. Theduration for ‘state0’ is obtained by adding the durations for the ‘predicate-block BB0’,‘join-block BB3’, and the minimum of the durations of the ‘then’ and ‘else’ blocks. Theduration for ‘state1’ is equal to the difference of the duration of the longer of the two

Page 18: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

240 S. Ghosh

if (a > b) { : BB0if (b > 10) : BB1

c = a + b; : BB2else

c = a * b; : BB3c++; : BB4

}else {

if (a == 0) { : BB5c = b/5; : BB6

elsec = ++b – 5; : BB7

c /= 2; : BB8}c = c * c; : BB9

Figure 10. Code fragment to illustrate optimization of FSM for nested ‘if-then-else’ constructs.

paths from ‘if-block’ and ‘join-block’ and that of ‘state0’. Thus, optimization leads toa FSM with two fewer states than the initial FSM.

The rationale underlying the optimization is as follows. In Fig. 9(a), once the threadof execution reaches block ‘BB0’, it is certain that ‘predicate-block BB0’ and ‘join-block BB3’ will be executed. Also, depending on the boolean value of the predicate,either ‘then-blockl’ or ‘else-block’ will be executed. Therefore, the total execution timefor the ‘if-then-else’ construct will at least equal the sum of the durations of ‘predicate-block’, ‘join-block’, and the minimum of the ‘then’ and ‘else’ blocks. This minimum,required execution time serves as the duration for ‘state0’. The second state, ‘state1’,accounts for the additional time that may be required when the execution is requiredto adopt the longer of the two alternate paths.

The basic idea is extensible to nested ‘if-then-else’ constructs. As a result, in general,for a critical section, the reduction in the number of states is a function of the numberof ‘if-then-else’ constructs that it contains. For instance, consider the code segmentshown in Fig. 10. The corresponding CFG is shown in Fig. 11(a) with 10 blocks andfour possible execution paths from basic-block ‘BB0’ to basic-block ‘BB9’. The initialunoptimized FSM, derived directly from the CFG, is shown in Fig. 11(b). The fullyoptimized FSM, shown in Fig. 12(a), contains only four states. A limitation, associatedwith the FSM, is expressed as follows. Complex logic is required for a state transitionfrom ‘state0’ to a subsequent state. Thus, a transition from ‘state0’ to ‘state1’ requirespredicate P1 to be TRUE and predicate P2 FALSE. This is reflected in Fig. 12(a)through the symbol ‘P1 AND P2’ on the arc from ‘state0’ to ‘state1’. In general, thedepth of nesting determines the complexity of the logic required to initiate the statetransitions. While the additional logic results from the effort to reduce the number ofstates in the FSM, the evaluation of the logic requires time and implies an increase inthe total execution time of the corresponding ‘if-then-else’ construct.

Page 19: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 241

join–block

predicate P1BB010

predicate P3BB520

BB750

BB670

BB890

–P1

–P3P3

P1

predicate P2BB110

BB340

BB230

BB450

–P2P2

join–block

BB9100

join–block

path2duration = 210

path3duration = 290

(a)

path1duration = 200

path4duration = 270

path1 = BB0 BB1 BB2 BB4 BB9path2 = BB0 BB1 BB3 BB4 BB9

path3 = BB0 BB5 BB6 BB8 BB9path4 = BB0 BB5 BB7 BB8 BB9

state010

state520

state750

state670

state890

–P1

–P3P3

P1

state110

state340

state230

state450

–P2P2

state9100

path2duration = 210

path3duration = 290

(b)

path1duration = 200

path4duration = 270

Figure 11. (a) CFG of code fragment in Figure 10 illustrating optimization of nested ‘if-then-else’ constructs;(b) corresponding unoptimized FSM.

Page 20: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

242 S. Ghosh

(a)

predicate–statestate0200

predicate–state(P3)

state270

P1 AND P2

state110

P1 AND –P2–P1

–P1 AND –P3

–P1 AND P3

state330

path3 (290)path4 (270)

path1 (200)path2 (210)

predicate–state(P1)

state0110

predicate–state(P3)

state2160

state190

P1 –P1

predicate–state(P2)

–P2

state210

state420

–P3P3

P2

(b)

path3 (290)

path4 (270)

path1 (200)

path2 (210)

Figure 12. (a) Optimized FSM for the CFG in Figure 10; (b) alternate optimization of CFG and generationof FSM without additional logic.

For a nested ‘if-then-else’ construct, the principal reason underlying the additionallogic is that the ‘inner predicate’ states such as the ‘state1’ and ‘state5’, as shown inFig. 11(b), are merged with the ‘outer predicate-state’ ‘state0’ in Fig. 11(b) to create asingle predicate state ‘state0’ in Fig. 12(a). As a result of these merging, it becomesnecessary to examine combinations of the predicate values, as opposed to a singlepredicate, to determine state transitions.

Alternate state optimization may be achieved without introducing additional logicas shown in Fig. 12(b). The predicate states ‘state1’ and ‘state5’ in Fig. 11(b) are notmerged with their parent predicate state ‘state0’. Instead, for the outer ‘if-then-else’construct, only the join-state ‘state9’ in Fig. 11(b) is merged with ‘state0’ in Fig.11(b) to create ‘state0’ in Fig. 12(b). Moreover, each of the innermost ‘if-then-else’constructs—{B1, B2, B3} and {B5, B6, B7}, are fully optimized to two states—{‘state1’,‘state2’} and {‘state3’, ‘state4’}, in Fig. 12(b). The resulting FSM consists of five statesand no additional logic in contrast to four states and additional logic in the FSM inFig. 12(a).

The algorithm for achieving optimization without introducing additional logic isdescribed as follows. First, every innermost ‘if-then-else’ construct is identified and thenoptimized to two states as explained earlier. Second, starting with every innermost ‘if-then-else’ construct, move outwards, optimizing the unoptimized outer ‘if-then-else’constructs until the outermost ‘if-then-else’ construct is reached. Finally, the ‘join-state’of the outermost ‘if-then-else’ construct is merged into its ‘predicate-state’ to form thenew ‘predicate-state’ with the duration equal to the sum of those of the two mergedstates.

Non-loop constructs in critical sections, such as ‘switch-case’, may be first reducedto ‘if-then-else’ form and then optimized. For loop-constructs with fixed or dynamicloop counts, including ‘for’, ‘while’, and ‘do-while’, the body of the loop is optimizedanalogous to the ‘if-then-else’ construct. The body of a loop is the set of basic-blocks,except the predicate block that tests the exit condition from the loop, that is executed

Page 21: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 243

iteratively. The loop-body may contain any mix of constructs and represented througha corresponding CFG.

Given a critical section containing a mix of loop, switch-case, if-then-else, etc.constructs, the final FSM is obtained as follows. First, all of the loops in the CFG andtheir constituent basic-blocks are uniquely identified. Then, every loop-body is optimizedseparately and, in the process, only states contained within a given loop-body arepermitted to merge. A state outside a loop-body is not permitted to merge with a statewithin the loop-body, since this will increase the duration for a single iteration of theloop thereby resulting in increased overall execution time. Code segments that contain‘if-then-else’ and ‘switch-case’ constructs but lack loops are optimized and the cor-responding FSM is generated.

(7) writeNetlist: the final FSM, representing the controller, and the operator-networkobtained in step 1 of the algorithm, are translated into the XBLOX Xilinx NetlistFormat (XNF) representation. XBLOX [32] is a library of components including adders,multiplexors, counters, etc. that have been previously optimized with respect to speedand area for the Xilinx FPGA. Following the translation, ‘xblox’ and ‘ppr’ tools [30]are used to place and route the FPGA, thereby yielding a custom hardware for fastand efficient execution of the critical section.

4.1 Limitations of PRISM-II

The limitations of PRISM-II, at the present time, include the following:

• The execution model does not address memory references, pointers, data structuresincluding arrays and structs, and global and static variables that require accessingmemory locations outside the FPGA. Global and static variables must be re-implemented in the form of explicit arguments of the critical section.

• Floating point types and operations are not supported because of their intensedemand for silicon.

• The number of input operands to the current FPGA platform is restricted to two32-bit numbers. The size of the return value is limited to 32 bits.

4.2 An example

To illustrate the principle of PRISM-II, consider a critical section, expressed in C, inFig. 13.

The critical section aims to perform integer division and computes the quotient ‘q’,given two integers ‘d’ and ‘e’. The purpose of this program is to illustrate the PRISM-II approach and the author does not necessarily recommend its usage for integerdivision. It is selected for the following reasons:

• it contains loops whose iteration counts are dynamic, i.e. unknown at compiletime; and

• it contains if-then-else constructs.

First, the code segment in Fig. 13 is analysed and the executable statements aregrouped into nine basic-blocks, BB0–BB3 and BB5–BB9. Second, the initial, traditional

Page 22: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

244 S. Ghosh

int div(int d, int e){

1. int q = 0; bb02. char samesign = 1; bb03. bb04. if (d < 0) { bb05. d = –d; bb16. samesign = –samesign; bb17. }8.9. if (e < 0) { bb210. e = –e; bb311. bb312. }13.14. loop: d = d – e; bb515. if (d >= 0) { bb516.17. q++; bb618. goto loop; bb619. }20.21. if (samesign) bb722. return q; bb923. else24. return –q; bb8}

samesign = –samesign;

Figure 13. An example function to illustrate the synthesis of operator-network and controller.

DFG and CFG are created for the code segment in Fig. 13 [31]. Then, multiplexorsand latches are added to the traditional DFG to synthesize the operator-network. Theoperator nodes corresponding to the basic-blocks of the code segment in Fig. 13 areshown enclosed in dashed rectangular boxes in the operator-network in Fig. 14. Thiscorrespondence will subsequently assist in correlating the functions of the operator-network and the FSM controller.

In Fig. 14, the circular nodes represent constant values that are utilized during thecomputation while the rectangular nodes represent unary or binary arithmetic or logicaloperators. For example, nodes labeled ‘minus’ and ‘plus’ in Fig. 14 correspond to ‘−’and ‘+’ arithmetic operators, respectively. The diamond nodes labeled ‘it’, ‘ne’, and‘ge’ correspond to the boolean operators ‘less than’, ‘not equal to’, and ‘greater than’,

Figure 14. (opposite) Operator-network for the C-function in Fig. 13.

Page 23: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 245

mux_2

d

neg not

mux_1

1It

00

500

00

bb0

50

0bb15050 15

a

It

00

50

0

5050

bb2

65

6565

not neg80 50

mux_4 mux_350

6595

bb380

65

bb4

95

80

mux_5latch_1 latch_2 mux_6

ge

00

65 65

minus plus

1

0

0

065

8080

180

130

180

130

130bb6

ne

0

0

bb5

145 145

bb7

95 180

neg230230bb8180

mux_7

245 245

bb10

230

returnLatch signal from FSM

Execution–delay of operators: (time units)neg, minus, plus = 50It, ne, ge = 50not, mux = 15latch, constants = 0

to State Machine

Loop Control Signalsfrom State Machine

Page 24: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

246 S. Ghosh

respectively. Multiplexors are expressed through rectangular-diamond nodes. Nodeslabeled ‘mux 1’ through ‘mux 5’ represent ‘merge-muxes’, that were introduced insection 3.2. For instance, the merge-mux, ‘mux 2’, selects one of two definitions—‘d’and ‘−d’, of the variable ‘d’ reaching line 7 in Fig. 13. Nodes labeled ‘mux 6’ and‘mux 7’ represent ‘loop-muxes’ which were also introduced in section 3.2. A ‘loop-mux’ is located at the top of a loop and it either selects the initial value of a variableor its value, fed back from a subsequent loop iteration. For instance, ‘mux 7’ selectseither the initial value of the variable ‘q’ which is ‘0’ or the values fed back fromsubsequent iterations. The node labeled ‘latch’ represents a latch that is utilized withina loop. It temporarily stores a value from an iteration and feeds it back to the subsequentiteration. The two initial input values to the operator-network are stored in registersrepresented through shaded rectangular nodes labeled ‘d’ and ‘e’ at the top of Fig. 14.The final result of the critical section is stored in the register labeled ‘return’, whichappears at the bottom of Fig. 14. The shaded operations, principally basic-blocks bb5and bb6, correspond to the loop.

The CFG for the critical section is presented in Fig. 15. Except for blocks bb4 andbb10, each of the blocks bb0–bb10 in Fig. 15 corresponds to the basic-blocks in Fig.13. The block bb4 in Fig. 15 contains merge-muxs ‘mux 3’ and ‘mux 4’ and does notcorrespond directly to any code segment in Fig. 13. Similarly, block bb10 in Fig. 15contains ‘mux 7’, and does not correspond directly to any code segment in Fig. 13.The shaded blocks, bb5 and bb6, correspond to the loop.

Next, the underlying algorithm of PRISM-II generates time-stamps for the operationsin the operator-network and corresponding time-stamps for every block of the CFG.The time-stamps are shown adjacent to every operation in Fig. 14. The starting andending time-stamps are also shown adjacent to every block in Fig. 15.

For the operator-network, the time-stamp values are generated as formerly explainedin step 2 of the algorithm in section 4 of this paper. A time-stamp for a node in Fig.14 is obtained from adding the time required to perform the operation to the maximumof the time-stamp values of the input operands. Thus, a ‘plus’ node with operator delay50 and input time-stamps 75 and 100 will have a time-stamp equal to maximum{75,100}+50=150. As also explained earlier in section 4, the nodes in Fig. 14 are selectedusing a breadth-first search algorithm and then their time-stamp values are determined.

For the CFG, the time-stamp values are generated as illustrated in step 3 of thealgorithm in section 4 of the paper. The starting time-stamp for a block in Fig. 15 isderived as the maximum of the time-stamps of all inputs to all operations in the block.The ending time-stamp for a block is computed as the maximum time-stamp of alloutputs of all operations in the block. The starting and ending time-stamps for theblocks are shown in Fig. 15.

In the subsequent step, an initial FSM controller is obtained as an identical copy ofthe CFG in Fig. 15, and is shown in Fig. 16(a). As per step 5 (‘DetermineDuration’)of the algorithm, the initial FSM is optimized by collapsing a few states onto otherstates and assigning them appropriate durations. In Fig. 16(a) states ‘ss0’ and ‘ss1’have the same set of time-stamps. This implies the lack of data dependencies between‘ss0’ and ‘ss1’, and hence they may be executed concurrently. Thus, ‘ss1’ is merged into‘ss0’, and the resulting FSM will show a direct link from the resulting state ‘ss0’ (time-stamps 0 and 50) into ‘ss2’ (time-stamps 50 and 65). Since resulting state ‘ss0’ and state‘ss2’ are now sequential, i.e. ‘ss0’ directly followed by ‘ss2’, they are merged into one

Page 25: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 247

bb00

50

bb250

65

bb4

bb565

180

bb795

145

bb90

0bb8

180

230

bb665

130

bb10230

245

bb10

50

bb365

80

80

95

Figure 15. CFG for the C-function in Fig. 13.

state, ‘s′0’, whose starting and ending time-stamps are 0 and 65, respectively, and isshown in Fig. 16(b). States ‘s′1’ and ‘s′2’ in Fig. 16(b) are derived directly from states‘ss3’ and ‘ss4’ of Fig. 16(a), respectively. It may be observed that the statementsconstituting the loop, i.e. ‘ss5’ in Fig. 16(a), is executed at least once. Therefore, theFSM controller must account for at least one time duration of ss5, i.e. from 65 to 180time units. In addition, the starting and ending time-stamps of state ‘ss7’ in Fig. 16(a)are completely overlapped by those of state ‘ss5’. Hence, state ‘ss7’ may be mergedwith state ‘ss5’ for efficiency and without any adverse effect. This resultant state isrepresented by ‘s′3’ in Fig. 16(b). State ‘ss6’ in Fig. 16(a) is re-labeled as ‘s′4’ in Fig.16(b), and it accounts for executions of the loop body in subsequent iterations. State‘ss9’ has the same starting and ending time-stamp values and is therefore deleted. States‘ss10’ and ‘ss8’ are merged into state ‘s′5’, as shown in Fig. 16(b).

The intermediate FSM obtained in Fig. 16(b) is further optimized, and the final FSM

Page 26: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

248 S. Ghosh

ss00

50

ss250

65

ss4

ss565

180

bb795

145

ss9180

180ss8

180

230

ss665

130

ss10230

245

ss10

50

ss365

80

80

95

(a)

(b)

s'00

65

s'0 = ss0 + ss1 + ss2

s' 280

95

s' 2 = ss4

s' 365

180

s' 5180

245

s' 3 = ss5 + ss7

s'465

130

s' 165

80

s' 1 = ss3

s'4 = ss6

s' 5 = ss8 + ss9 + ss10

s00

65s0 = s' 0

s165

180

s5180

245

s1 = s' 1 + s' 2 + s' 3 + s'4

s2 = s' 5

(65)

(115)

(65)

(c)

Figure 16. (a) The initial FSM for the code in Fig. 13; (b) intermediate FSM as a result of a few optimizations;(c) final FSM following further optimizations.

Page 27: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 249

mux_1

ne

0

not

mux_4

65

It

0

1

mux_2

negIt

0

mux_3

neg

ad

mux_5latch_1not

minus

0

mux_6latch_2

plus

1

0

s1 (115)

s0 (65)

s2 (65)

neg

mux_7

return

65

s0

0

180s1180

s2

245

End_Of_Computation Signal

Finite State Machine

ge

Figure 17. Machine graph for the division function in Fig. 13.

is shown in Fig. 16(c). The duration of state ‘s′3’ completely overlaps the starting andending time-stamps of states ‘s′1’ and ‘s′2’ and, therefore, the latter are merged intostate ‘s′3’. Furthermore, state ‘s′4’ is merged into state ‘s′3’. The resulting state isrepresented through state ‘s1’ in Fig. 16(c). Thus, the final FSM consists of only threestates—‘s0’, ‘s1’, and ‘s2’, and the duration of each state is determined simply bysubtracting the ‘starting’ time-stamp from the ‘ending’ time-stamp, as shown withinparentheses in Fig. 16(c). The FSM controller in Fig. 16(c) is in its simplest, irreducibleform, and therefore constitutes the final FSM.

Figure 17 presents the final ‘Machine-graph’, consisting of the operator-network andoptimized FSM. Sets of operations in the operator-network, that are executed withina particular state of the FSM, are shown enclosed by shaded rectangular blocks. Forinstance, all operations within the shaded block ‘s0’ are executed corresponding to thestate ‘s0’ of the FSM. The operations within ‘s0’ complete execution at 65 time units,

Page 28: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

250 S. Ghosh

upon which state ‘s1’ is initiated. State ‘s1’ corresponds to a loop-state, i.e. the operationsmay be iterated more than once. First, the FSM signals the multiplexors, ‘mux 5’ and‘mux 6’, in the operator-network to choose the initial values of variables ‘d’ and ‘e’,respectively. The first iteration of state ‘s1’ completes at 180 time units. Then, the FSMreceives from the operator-network the boolean value signal, generated from operator‘ge’, and determines whether to continue execution of the loop. The propagation ofthe signal is represented by a dashed-line from node ‘ge’ of the operator-network tostate ‘s1’ in the FSM. Where the loop continues to be executed, the FSM first instructsthe two ‘latch’ operators in the operator-network to latch the feedback values of thetwo variables ‘d’ and ‘q’. When it returns to the state ‘s1’, the FSM instructs themuxes—‘mux 5’ and ‘mux 6’, to choose the values from their corresponding latchesrather than utilize the initial values. Next, the operations within the loop are executed.Following execution of the loop, the FSM transfers control to ‘state2’ and then latchesthe final result from the output of ‘mux 7’ into the register labeled ‘return’. The FSMthen generates the ‘end-of-computation’ signal and propagates it to the core-processor,informing it of the availability of the final result.

The final machine is then instantiated with XBLOX library modules and a netlistfile is generated to program the reconfigurable FPGA testbed. Presently, the author isdeveloping an implementation of PRISM-II, i.e. integrating the design of the re-configurable FPGA testbed, with the configuration compiler augmented by the newexecution model. The results will be reported in a future publication.

5. Conclusions and future work

This paper has introduced a significant conceptual advancement, PRISM-II. PRISM-II is a new general-purpose adaptive computing system, and this paper has presentedits architecture and compiler. In this architecture, specialized hardware is synthesizedfor an FPGA-based reconfigurable platform that executes the critical section(s) of aprogram written in C. Speeding up the execution of the critical section(s) greatlyenhances the overall program execution performance. The reconfigurability of thehardware platform permits it to be reused for different applications, thereby maintaininga general architecture and yet specialized for each application. The synthesis processis automatic and transparent, thereby allowing the user to concentrate on the applicationand not the architecture. In the novel execution model, underlying the architecture, anoperator-network and controller are synthesized for a given high-level program. Theoperator-network, custom-synthesized from FPGAs, executes the high-level programin a data-flow manner, i.e. any instruction is executed as soon as its input operandsare available. The controller controls the computations in the operator-network,accurately determines when the execution is complete, by utilizing key principlesdeveloped in PRISM-II, and generates an end-of-computation signal to asynchronouslyinform the core-processor to fetch the results from the FPGA platform. While therealization of a general-purpose data-flow architecture has continued to be difficult,the PRISM-II approach promises asynchronous, data-flow execution of programs oncustom synthesized FPGA hardware. Presently, an implementation of PRISM-II isunder development.

Page 29: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

Programs on adaptive architectures utilizing FPGAs 251

Acknowledgments

The author gratefully acknowledges the support of the National Science Foundationthrough grant MIP-9021118.

References

1. J. Savage, S. Magidson and A. M. Stein 1986. The Mystical Machine: Issues and Ideas inComputing. Reading, Massachusetts: Addison-Wesley.

2. E. W. Pugh 1984. Memories That Shaped Industry: Decisions Leading to IBM System/360.Cambridge, Massachusetts: MIT Press.

3. J. Hennessy and D. Patterson 1990. Computer Architecture: A Quantitative Approach. SanMateo, California: Morgan Kaufmann.

4. G. Estrin 1960. Organization of computer systems: The fixed-plus variable structure computer.In Proceedings of the Western Joint Computer Conference, 33–40.

5. G. Radin 1983. The 801 minicomputer. IBM Journal of Research and Development, 27,237–246.

6. R. A. Brunner 1991. VAX Architecture Reference Manual. Second edition. Bedford,Massachusetts: Digital Press.

7. A. Tucker and M. Flynn 1970. Dynamic microprogramming: processor organization andprogramming. Communications of the ACM, 14, 240–250.

8. A. Abd-Alla and D. Karlgaard 1974. Heuristic synthesis of microprogrammed computerarchitecture. IEEE Transactions on Computers, C-23, 802–807.

9. T. G. Rauscher and A. K. Agrawala 1978. Dynamic problem-oriented redefinition ofcomputer architecture via microprogramming. IEEE Transactions on Computers, C-27,1006–1014.

10. C. Papachristou and V. Immaneni 1993. Vertical migration of software functions andalgorithms using enhanced microsequencing. IEEE Transactions on Computers, 42, 45–61.

11. Maya Gokhale et al. 1990. SPLASH: A reconfigurable linear logic array. In InternationalConference on Parallel Processing, I-526–I-532.

12. Jeffrey M. Arnold, Duncan A. Buell and Elaine G. Davis 1992. SPLASH 2. In ACMSymposium on Parallel Algorithms and Architectures, 316–322.

13. P. Bertin, D. Roncin and J. Vuillemin 1993. Programmable active memories: a performanceassessment. Research Report 24, Digital Paris Research Laboratory.

14. Herve Touati 1993. Perle1DC: A C++ library for the simulation and generation ofDECPeRLe-1 designs. Technical Note 4, Digital Paris Research Laboratory.

15. P. Callahan 1988. Dynamic instruction set coprocessors. MILCOM, 19.1.1–19.1.6.16. P. Athanas and H. Silverman 1993. Processor reconfiguration through instruction-set

metamorphosis: architecture and compiler. IEEE Computer, 26, 11–18.17. Donald E. Knuth 1971. An empirical study of FORTRAN programs. Software-Practice and

Experience, 1, 105–133.18. Howard Trickey 1987. Flamel: A high-level hardware compiler. IEEE Transactions on

Computer-Aided Design, CAD-6, 259–269.19. D. Lanneer, S. Note, F. Depuydt, M. Pauwels, F. Catthoor, G. Goossens and H. De Man

1991. Architectural synthesis for medium and high throughput signal processing with thenew CATHEDRAL environment. In High-Level VLSI Synthesis. Kluwer AcademicPublishers, 27–54.

20. R. Camposano, R. A. Bergamaschi, C. E. Haynes, M. Payer and S. M. Wu 1991. The IBMhigh-level synthesis system. In High-Level VLSI Synthesis. Kluwer Academic Publishers,79–104.

21. K. Wakabayashi 1991. Cyber: High level synthesis system from software into ASIC. In High-Level VLSI Synthesis. Kluwer Academic Publishers, 127–151.

22. R. Camposano and W. Wolf, editors 1991. High-Level VLSI Synthesis. Kluwer AcademicPublishers.

23. Robert A. Walker and Raul Camposano, editors 1991. A Survey of High-Level SynthesisSystems. Kluwer Academic Publishers.

Page 30: An asynchronous approach to efficient execution of programs on adaptive architectures utilizing FPGAs

252 S. Ghosh

24. G. Uvieghara et al. 1992. An experimental single-chip data flow CPU. IEEE Journal ofSolid-State Circuits, 27, 17–27.

25. Richard Buehrer and Kattamuri Ekanadham 1987. Incorporating data flow ideas into vonneumann processors for parallel execution. IEEE Transactions on Computers, C-36, 1515–1522.

26. D. D. Gajski, D. A. Padua, D. J. Kuck and R. H. Kuhn 1982. A second opinion on dataflow machines and languages. Computer, 58–69.

27. J. B. Dennis 1980. Data flow supercomputers. Computer, 13, 48–56.28. Arvind and R. Nikhil 1987. Executing a program on the MIT tagged-token dataflow

architecture. In Proceedings of the Conference on Parallel Architecture and Languages Europe,volume II of Lecture Notes in Computer Science. Eindhoven: Springer, 1–29.

29. Advanced Micro Devices, Inc. 1991. AM29050 Microprocessor User’s Manual.30. Xilinx Inc. San Jose, California 1992. XACT Reference Guide.31. A. V. Aho, R. Sethi and J. D. Ullman 1986. Compilers. Principles, Techniques, and Tools.

Reading, Massachusetts: Addison-Wesley.32. Xilinx Inc. San Jose, California 1992. X-BLOX Design Tool User Guide.

Sumit Ghosh is currently an associate professor and the associate chairfor research and graduate programs in the Computer Science andEngineering Department at Arizona State University. He had receivedhis BTech degree from IIT Kanpur, India, and his MS and PhD degreesfrom Stanford University, California. Prior to his current assignment,Sumit had been on the faculty at Brown University, Rhode Island, andbefore that he had worked at Bell Labs Research in Holmdel, NewJersey. Sumit’s research interests are in fundamental problems fromthe disciplines of asychronous distributed algorithms, modeling anddistributed simulation of complex system, networking, network security,computer-aided design of digital systems, continuity of care in medicine,and metrics to evaluate advanced graduate courses. Presently, he serveson the editorial board of the IEEE Press Book Series on microelectronicSystems Principles and Practice. Sumit is a US citizen.