Transcript

A PLACEMENT TOOL FOR A NOC-BASED DYNAMICALLY RECONFIGURABLE SYSTEM

Mario Raffo, Jonas Gomes Filho, Marius Strum, and Wang Jiang Chau

Microeletronics Laboratory,University of Sao Paulo

Av. Prof. Luciano Gualberto trav. 3 N158, Sao Paulo, Brazilemail: { mraffo, jonasgfilho, strum, jcwang }@lme.usp.br

ABSTRACT

In the last years, Field programmable gate-arrays (FPGAs)with partial reconfiguration capabilities have raised interestin the implementation of dynamically reconfigurable sys-tems. It has not become a mainstream activity though, dueto the lack of solid design methodologies and associatedtools. One of the approaches aimed to free the designerof lower level implementation details is to use structuredcommunication resources to provide the interaction betweenreconfigurable partitions (modules). The architecture of anetwork-on-chip (NoC) based dynamically reconfigurablesystem and a placement tool, which automatically places allof its modules, is presented. The tool takes the partitioneddesign information and the restrictions imposed by the de-vice family architecture into consideration. The basics ofthe placement algorithm and a study-case as an example arepresented.

1. INTRODUCTION

In recent years, SRAM-based Dynamic or Run-Time Recon-figuration (RTR) FPGAs (DRFPGAs) have been accepted asan important and potential alternative for lowering costs ofdigital circuits, especially regarding the flexibility in rapidlychanging the functions being performed and reducing areaconsumption [1]. However, they add new dimensions to theSoC design space, due to the different possibilities of phys-ical, temporal and functional partitions of the original ap-plication and due to the possibility to use different devicesavailable in the market. Although new methodologies andtools have been proposed in recent years [2], [3], [4], [5],[6], [7], [8], [9], [10] to deal with the increased design com-plexity of this class of circuits, solutions to the associatedproblems are still very ad-hoc.

One of the problems to be solved is the communica-tion between reconfigurable partitions or modules, which in-cludes data retention, bus-macros signals management and

This work was partially supported by the CNPq (National Council forScientific and Technological Development), Brazil

the characterization of latency effects due to the reconfigura-tion cycle. To deal with these problems, structured commu-nication resources have been proposed since they can freethe designer from the details of dynamic reconfigurable sys-tems (DRSs), allowing the designer to focus on wrappersand computing logic, reducing the design effort. Differentworks have been reported for both bus-based [2], [3] andnetwork-on-chip (NoC) based [4], [5], [6] implementationsfor DRS communication, having the latter ones better per-formance potential for systems with large number of com-ponents while supporting parallel communication, besidespresenting high scalability and modularity. Unfortunately,for both approaches, only architectural or specific imple-mentation aspects have been reported, with little indicationof CAD tools for their use.

Although assisting designers with some lower level de-sign tasks, the current commercial design tools do not offercomplete solutions for designs starting from register-transferor higher levels of abstraction, An example of an establishedsupport for designs with DRFPGAs is the Early Access Par-tial Reconfiguration methodology from Xilinx [11], whichsets a design flow from a partitioned RTL DRS to the bit-streams, but with lack of automation in several steps. Onetask left to be performed manually by the designer is theplacement of each synthesized module or bus-macro, withthe help of a graphical placement tool. This aid adds device-related information since, for specific families of DRFP-GAs there are particular restrictions on placement, as theones for bus-macros that must obey certain location pat-terns. In recent years, several researches published in thetechnical literature have also treated the problem of place-ment, delivering efficient algorithms for the optimization ofarea consumption in DRFPGAs. In those works, temporalplacement is considered (spatial placement can be seen as aparticular case) in compile-time [7] and online fashion [8];for the latter case, some proposed extensions are the inclu-sion of weighed cost of interconnections [9] and 2-D de-fragmentation as well [10]. The authors of such papers havefocused on improving the algorithmic efficiency in placinglogic to CLBs, in simplified DRFPGA models, without tak-

Fig. 1. Dynamic reconfigurable system. A PE could be a static or a dynamic module.

ing into account the real device restrictions.In this paper, a DRS model is presented based on the

NoC Hermes [12] and a rule-based placement tool, namedDynoPlace. The tool is responsible for the automatic distri-bution of the elements of the system throughout a Virtex-4family FPGA, taking into account all its architectural restric-tions for the partial reconfiguration. The placement algo-rithm deals with all parts of the DRS, as the processing el-ements and individual routers of the NoC, which are placedsequentially. They are ordered so that the modules withmore restrictions are placed first. Results of the applicationof such algorithm, such the final placed and routed circuit,are shown, demonstrating its feasibility.

The paper is organized as follows: the NOC-based sys-tem architecture is described in Section 2 and the designmethodology for DRS in Section 3. Section 4 presents allthe information related to the placement tool, while the re-sults are shown in Section 5. Finally, Section 6 presents theconclusions and suggestions for future work.

2. THE ARCHITECTURE UNDER NOC

Xilinxs DRFPGAs structures consist of two-dimensional ar-rays of CLBs, for which the smallest unit of reconfigurationis the frame. For the Virtex-4 DRFPGA family for example,frames are only a part of a column of 16 rows (in Virtex-II and II-Pro frames are the complete column). Therefore,they are known as two-dimensional model (2D). This allowsthe Virtex-4 to be compatible with the NoC structure formof the proposed DRS [11]. The placement tool presentedin this work targets a simplified DRS architecture based on

the Hermes NoC [12], as depicted in figure 1. The recon-figurable area has connections with the non reconfigurableone through the NoC and processing elements (PEs) can bedynamically traded by other ones at any operation time. Thegeneral controller and other modules used in the system op-eration are depicted in figure 1.

Hermes network is a communication infrastructure fordigital systems [12]. It is based on NoC and mesh topol-ogy, where component communication is based on packetstransmission. Handshake protocol is used to ensure a correctdata transmission. The communication mechanism uses thepacket switching method and the switching mode is worm-hole [13]. Each router has input buffers that work as circularFIFOs, to store incoming packets. In addition, each routercan be simultaneously requested to establish five connec-tions using a dynamic arbitration scheme (the priority of aport is a function of the last port with a granted routing re-quest). The routing strategy employed is the XY adaptivealgorithm [13].

The PEs used in the system may be memories, arithmeticoperations and logic operations modules. A PE includeswrappers for connecting to the router. The router main func-tion is to provide an interface to adapt data transmission andreception from/to PEs through the network. In the architec-ture shown in figure 1, PE 4 and 5 represent dynamic mod-ules, each one with two interchangeable logic partitions, andthe others ones are static modules. For each PE there is anI/O block interface between the module ports data and sys-tem borders.

The general controller unit controls which PEs are goingto communicate and verifies which PEs finish their work to

Fig. 2. Partial Reconfiguration Design Flow.

allow their reconfiguration (PE 4 and 5).The bus-macro allows the communication between static

and dynamic modules of the system. DRFPGA Virtex-4 andmore recent families have bus-macros for vertical commu-nication and bus-macros for horizontal communication (al-though only vertical ones are used in this work).

3. THE DESIGN FLOW

In order to accept the DRS methodology the designer weighsits benefits against the additional workload in the designphase, compared to the plain design without dynamic recon-figuration. Presently, the designers are not quite encouragedto implement DRSs due to the lack of tools and automation.The on-going research aims to contribute with providing anautomated environment in which the design flow will needlittle or no designer interference. In the current DRS designflow provided by the Early Access methodology, when start-ing with a synthesizable RTL code, the designer must followa series of steps as shown in figure 2. The main steps in acurrent design flow are:

Design partitioning: The system is manually dividedinto fixed modules, always active in the system, and in par-tially reconfigurable modules (PRMs). The last ones only

perform a task in a specific system operation time and it islocated into partial reconfigurable regions (PRRs). The sys-tem must incorporate the bus-macros into the top system fileto allow the communication with the PRMs.

Individual modules synthesis: It is the only automaticstep. First, the Top module is synthesized by the ISE toolwith the IOBUF parameter. Next, all the other modulesare synthesized without IOBUF parameter (by uncheckingthis option, drivers are not created for the I/Os). In the end,the designer will have ngc files from all the system moduleswith their area information (number of slices or clbs).

Modules placement and restrictions: In this step theXilinx graphical Floorplanner tool performs the placementtask: static, input pins, output pins and clock placement. Foreach PRRs only a PRM is allocated according to DRFPGAframe restrictions [11]. Finally the Floorplanner will makethe user constraint file (UCF) with inputs, outputs and clockpins and with the area allocation from each modules.

Static modules and PRMs routing: Place and routingfrom static part and from the PRMs are mapped on PRRs.Also the bus-macro placement form UCF is included.

Merge: In this step the static parts merge with eachPRM ones to make the bitstream for DRFPGA reconfigura-tion, as well the partial bitstream to reconfigure each PRR.

In summary, the Floorplanner tool is required, but theplacement is manually done by the user, as well as the UCFfile modifications. Only in next step could the partial recon-figuration design flow be automated by scripts from Xilinx[11].

The purpose of this tool is to automate the fifth task forthe partial reconfiguration design flow shown in figure 2.Device and user placement restrictions allows making theUCF file that includes bus-macros.

4. THE DYNOPLACE TOOL

The present version of the DynoPlace tool only performsthe placement of allocated modules to the DRFPGA. Theassumption is that two or more PRMs will be allocated atthe same PRR at a different execution time

As shown in figure 2, the tool takes information fromeach module, such as area (number of CLBs), system archi-tecture and DRFPGA restrictions, and system and modulesdescriptions. The tool is based on rules; in other words, forevery module, to be placed, the restrictions will be analysedto know the real position to placed it. The restrictions andthe tool algorithm characteristic will be explained next.

4.1. Restrictions

According to the Early Access Partial Reconfiguration [11],for different device families, a series of specific restrictionsmust be observed and, moreover, some specific data mustbe included in the UCF file in order to proceed with the DR

Fig. 3. Graph G(I,IO,O,V,E)

flow. The restrictions presented here refer to the Virtex-4family devices, although they can be extended to Virtex-5,only with other parameter numbers (e.g., the height of an in-dividual reconfigurable frame, in Virtex-5, is 20 CLB rows).

The first important restriction refers to the frame size.The height of an individual reconfigurable partition must beexactly a multiple of 16 CLB rows, while the width is any. Itmeans that reconfigurable partitions should have standard-ized heights and fixed start positions. As a consequence,other fixed or other dynamic partitions connected to the PRMmust also respect the same relative positioning due to theneed of matching the bus-macros.

Another interesting point refers to the bus-macros, thatare built on CLBs. Although each CLB contains 8 availablebit lines not all CLBs in a frame are useful in providing bus-macro connectivity. According to this limitation, the sep-aration between any CLB used for bus-macro connectionsis exactly 4 rows. It means that, in a reconfigurable frame,only 4 CLBs can be really used for bus-macros, while other12 must be used for regular logic. Analyzing the numberof port lines in a PRM allows us to calculate the number ofrequired CLBs for the connections and, therefore, the realtotal number of CLBs for the PRM implementation.

4.2. The System Model

The dynamic reconfigurable system can be represented byan M set of n modules mi, composed of sets S and D, whereS contains t statics modules and D contains f dynamic mod-ules, therefore n=t+f.

M = {m1, · · · ,mn} = {s1, · · · , st}U{d1, · · · , df} (1)

In order to structure the placement procedure, the mod-ules are discriminated by their relative position regardingsystem borders, besides their inter-dependence. The system

Fig. 4. Placement of modules

can be seen as a set of connected components as shown infigure 3. The modules vi ε V , are elements whose portsare only internal signals (are not system inputs or outputs),while modules wi ε W , are all the other modules from thesystem architecture, which implies modules with system in-puts ports, outputs ports and both. The directed edge in thegraph ei ε E represents the dependencies (connections) be-tween the modules.

The system architecture, shown in figure 1 could be rep-resented by the following sets: w ε(I u IO u O), v ε(PE u Ru OE). I is the set of modules that has only system inputs, Ois the set of modules that has only system outputs, IO is theset of modules that has both system inputs and outputs.

PE is the set of process elements from the network. Ris the set of routers from the network and OE is the set ofmodules that does not have input and outputs from the sys-tem and neither are processing elements nor routes.

4.3. DynoPlace Algorithm

The algorithm receives the number of CLBs for each systemmodule and their dependencies (connections) on other ones.The algorithm functions as a list scheduling algorithm andadopts as rule that the modules have the same height and theplacement will be sequential, from bottom to top and fromleft to right. The parts (a module or a set of related modulesthat would be placed as a single module) are always placedat the leftmost free position and the priority is from bottomto top, if more than one option is available. The sequenceis defined according to the type of modules with the prioritycriteria described further on in this section.

Initially, the placement begins from bottom to top withthe elements positioning at zero point (leftmost). Figure 4aillustrates this condition in which only one spot is left to thecompletion of the first column, while figure 4b shows thenext step, with its completion. Finally, figure 4c shows an-other step, in which a new element is placed at the row inthe leftmost free position. DynoPlace algorithm is dividedfor specific tasks, which are explained below.

Preprocessing:The dimensions (width and height) of all the modules

are defined in this step and to normalize them, all modules

will have the same height of one Virtex-4 frame (16 CLBrows). The width of each module is calculated by the totalnumber of CLBs divided by 16, according to that describedin section 4.1

The width of the modules, which are members of I, IOand O sets, will be calculated in the same manner, but in or-der to prevent possible problems with the bus-macros place-ment at routing step (practical requirement to avoid placeand route errors), it must be at least 6 CLBs wide if it is con-nected to some PRR.

Placement of PRRs and Adjacent Modules:This is the first step of placement, for modules with more

restrictions. In the proposed system only a few PEs (whichare PRMs) will be placed at PRRs. In this case, each elementformed by a PRR and its dependent modules must be placedin a row (while it fits), and the placement of the componentsalways follows a horizontal sequence: I/O/IO → PRR (forthe corresponding PRMs ) → R or I/O/IO → OE → PRR(for the corresponding PRMs )→ R

Placement for modules which are members of I, IO, Osets and do not have dependences with PRMs:

The next priority criterion points to modules system in-puts or outputs. In the sequence, all the other fixed PEs withtheir dependant modules must be placed in a row (while theyfit) and the placement of the components is always in hori-zontal sequence: I/IO/O→ static PEs→ R or I/IO/O→ OE→ static PEs→ R

Placement for modules composed of the remaining mem-bers for the V set.

In this step the internal modules are placed one by one.

The Bus-macros placement:The Early Access Methodology [11] presents the restric-

tions for bus-macros placement, which are included in ourtool. For simplicity, we do not use the bus-macros for verti-cal communications.

5. EXPERIMENTAL RESULTS

In order to test the algorithm, we have described in VHDLthe system architecture, presented in figure 1, along with thebus-macro modules and performed the synthesis step as in-dicated in Section 4.3. The system consisted of a series ofarithmetic and logic operations controlled by a FSM, hav-ing initial values in 8 ROMs and using the 3x3 NoC forcomunication. Basically, the system makes a sequence ofoperations at the processing nodes and through the routerssend the results to another PE for new computing. The RTLdescription was validated in a simulation based on dynamiccircuit switching [14] and having as a target the DRFPGA

Virtex-4LX25.After obtaining the modules size and aspect ratio, Dyno-

Place was used for placement. The resulting placement isshown in figure 5 and it can be observed that all the ele-ments of M set have a height of 16 CLB rows, and the widthdepends on the total numbers of CLBs of each. After thesynthesis, it is indicated that the total number of CLBs re-quired is 1238 and the two PRRs of the system are on thebottom side of DRFPGA. Since the DynoPlace tool algo-rithm requires the fixed height of 16 CLB rows , the mini-mum number of CLBs in each module column is 16, even ifthe synthesis had shown a smaller number. It means that theactual number of CLBs of the modules suffers an increase(e.g., even if a module needs only one CLB, the height re-striction increases it to 16). The number of CLBs occupiedby the system according to the algorithm is 1856 (49,92%more than previously mentioned) and the maximum opera-tion frequency is 96 MHz. DynoPlace Tool also generates areport the number of CLBs used on each module.

Although this first version of DynoPlace shows obviousinefficiencies as shown in previous paragraph, it presentsarea gains compared to a totally static design. The proposedsystem, without reconfiguration, requires 11 PEs and thatincrease the number of routers, so the system would need a4x4 NoC and the use of dynamic reconfiguration allows aCLBs reduction, nearly 41%.

To test others limitations of the DynoPlace tool, a largerSDR design was proposed. The original amount of CLBsafter the individual synthesis was 2011, but the tool algo-rithm has increased the CLBs numbers to 2688 (total capac-ity of Virtex-4LX25), e.g., the increase is from 33.67%. Theplacement obtained for this SDR is shown in figure 6. Theproject do not present any kind of difficulty to be completedusing the UCF files obtained by DynoPlace with the scriptsof the Early Access Methodology.

Fig. 5. The System architecture placement using DynoPlacealgorithm.

Fig. 6. Other system architecture placement using Dyno-Place algorithm with complete utilization of the device re-sources.

6. CONCLUSIONS AND FUTURE WORK

The DynoPlace tool was developed to generate the place-ment data from a partitioned DRS based on NoC extend-ing the automation in the Early Access Methodology [11].That enables designers to build dynamically reconfigurableapplications automatically without knowing the restrictionsimposed by the devices. The algorithm used by DynoPlaceTool increases the number of CLBs of the system (nearly50%), this is due to the fact that the algorithm sets the heightof all the modules to 16 CLB rows and several system mod-ules needed in fact only of 1 CLB area.Compared to a to-tally static system, the results produced by DynoPlace Showa area gain, in terms od number of CLBs, of nearly 41%.This is due to the fact that the system, without reconfigu-ration, requires 11 PEs, increasing the and that increase thenumber of routers and requiring a 4x4 NoC.

As future work, the DynoPlace tool may include the in-formation for all the Virtex4 family devices subtypes: LX,SX and FX, begin able to determine the device subtype, ac-cording to the system required area. The generated filesmust be used for Early Access Methodology and for the PlanAhead tool from Xilinx as well. An extra capability will bethe automation of simulation of DRS with isolation switchesand use of a PRM configuration controller library to allowthe validation of the DRS behavior before the implementa-tion.

7. ACKNOWLEDGMENTS

We would like to thank the members of the Hardware De-sign Support Group (GAPH) at the PUCRS (Brazil) for thepermission for using the Hermes NoC and for their support.

8. REFERENCES

[1] J. Eldredge and B. Hutchings, “Run time reconfiguration: Amethod for enhancing the functional density of sram basedfpgas,” The Journal of VLSI Signal Processing, vol. 12, no. 1,pp. 67–86, Jan. 1996.

[2] M. Huebner, T. Becker, and J. Becker, “Real time lut-based network topologies for dynamic and partial fpga self-reconfiguration,” 17th Symposium on Integrated Circuits andSystems Design.

[3] H. Walder and M. Platzner, “A runtime environment for re-configurable hardware operating systems,” 14th InternationalConference on Field Programmable Logic and Applications(FPL).

[4] C. Bobda, M. Majer, D. Koch, A. Ahmadinia, and J. Teich,“A dynamic noc approach for communication in reconfig-urable devices,” 14th International Conference on Field Pro-grammable Logic and Applications (FPL).

[5] T. Pionteck, R. Koch, and C. Albrecht, “Applying partialreconfiguration to networks on chips,” 16th InternationalConference on Field Programmable Logic and Applications(FPL).

[6] S. Jovanovic, C. Tanougast, C. Bobda, and S. Weber, “Cunoc:A dynamic scalable communication structure for dynami-cally reconfigurable fpgas,” Microprocessors and Microsys-tems, vol. 33, no. 1, pp. 24 – 36, Aug. 2009.

[7] J. Teich, S. Fekete, and J. Schepers, “Optimization of dy-namic hardware reconfigurations,” The Journal of Supercom-puting, vol. 19, no. 1, pp. 57 – 75, May 2001.

[8] A. Ahmadinia, C. Bobda, M. Bednara, and J. Teich, “Anew approach for on-line placement on reconfigurable de-vices,” Proceedings. 18th International Parallel and Dis-tributed Processing Symposium, 2004.

[9] A. Ahmadinia, C. Bobda, S. Fekete, J. Teich, andJ. van der Veen, “Optimal free-space management androuting-conscious dynamic placement for reconfigurable de-vices,” IEEE Transactions on Computers, vol. 56, no. 5, pp.673 – 680, May 2007.

[10] J. Tabero, J. Septien, H. Mecha, and D. Mozos, “A lowfragmentation heuristic for task placement in 2d rtr hwmanagement,” 14th International Conference on Field Pro-grammable Logic and Applications (FPL).

[11] Xilinx, “Early access partial reconfiguration user guide.ug208.”

[12] F. Moraes, N. Calazans, A. Mello, L. Mller, and L. Ost,“Hermes: an infrastructure for low area overhead packet-switching networks on chip,” Integration, the VLSI Journal,vol. 38, no. 1, pp. 69 – 93, 2004.

[13] C. Bobda, Introduction to Reconfigurable Computing: Archi-tectures, algorithms and applications, 1st ed. Dordrecht:Springer, 2007.

[14] M. Raffo, M. Strum, and J. C. Wang, “A simulation method-ology for a noc-based dynamically reconfigurable system,”Proceedings. 16th Workshop Iberchip, 2010.