A BEHAVIORAL SYNTHESIS APPROACH FOR DISTRIBUTED MEMORY … · 2010-11-02 · A BEHAVIORAL SYNTHESIS APPROACH FOR DISTRIBUTED MEMORY FPGA ARCHITECTURES Ashutosh Pal* *CoWare India

A BEHAVIORAL SYNTHESIS APPROACH FOR DISTRIBUTED MEMORY FPGAARCHITECTURES

Ashutosh Pal*

*CoWare India Pvt. Ltd.Tower-B, Logix Techno Park

Noida, [email protected]

M. Balakrishnan †

†Department of Computer Science andEngineering, Indian Institute of Technology

Hauz Khas, New Delhi, [email protected]

ABSTRACT

This paper presents an approach for efficiently mappingloops and array intensive applications onto FPGAarchitectures with distributed RAMs, multipliers and logic.We perform a data dependency based, two levelpartitioning of the application’s iteration space under targetFPGA architectural constraints, to achieve betterperformance. It is shown that, this approach can result in asuper-linear speedup; linear speedup due to concurrentcomputation on multiple compute elements and additionalspeedup due to improvement in the clock frequency (up to30%). The clock period reduction is made possible becausecomputation and accesses are now localized, i.e. thecompute elements interact only with memories which areclose by.

1 INTRODUCTION

One of the main aspects of behavioral synthesis is theefficient utilization of the target architecture information,while performing transformations. In this work, wepropose to utilize the physical closeness between theembedded RAMs and compute units present in the currentFPGA architectures like Xilinx Virtex II [7] and Altera’sStratix II [8] to create distributed data paths. This isintegrated with the apparent parallelism available in FPGAarchitectures due to the presence of replicated resources.We have adopted the loop partitioning techniques reportedin the domain of parallelizing compilers for multi-processors, to partition the application's iteration space.

There have been some efforts previously in thebehavioral synthesis domain targeting FPGA architectures.FPGA compilers like the SA-C compiler [2], mainlyperformed parallelizing code transformations to utilize thehuge arrays of CLBs. Baradaran et al. [3] presents atechnique to utilize the embedded RAMs as caches, toenable data-reuse. Ouaiss et al. [4] targets the hierarchicalmemory mapping problem on an RC board aiming tooptimize the performance but the analysis assumes a singleprocessing unit. Baradaran et al. [5] presents a customarray mapping approach for generalized configurable

architectures and reports improvement in clock cycles, butnot in the clock period. In this work, we target specializedFPGA architectures with distributed RAMs and associatedcompute units to achieve concurrency in computation andlocalization of memory accesses. We report improvementboth in clock period and clock cycles to achieve super-linear speed-up.

2 MOTIVATIONAL EXAMPLE

We have used Xilinx Virtex II [7] FPGA architecture forperforming our experiments. The architecture comprises ofembedded block RAMs (BRAMs) with multipliersadjacent to them as shown in fig. 2.1. For establishing themotivation, we have used Dot Product as the inputapplication, whose kernel is shown below:

Fig. 2.1(a) shows the synthesized design of dot productwhen taken through the normal synthesis flow. Each of thearrays gets mapped to the BRAMs with only a singlemultiplier unit being used for the computation. It is seenthat the memory access path to the multiplier is the criticalpath of the design. As the size of arrays increase, thecritical path delay increases, which in turn increases theclock period and the performance degrades.

Transformed dot product design is shown in fig. 2.1(b).It is obtained by partitioning the iteration space into fourpartitions (each mapped to a BRAM and a multiplier). Itcan be seen that the memory accesses are now localizedand thus the critical path delay (or the clock period) willnot increase with the increase in the array-sizes. However,

for (i = 0; i < N; i++)dprod = dprod + A[i] * B[i];

Fig. 2-1: Dot Product synthesized designs

* * **

BA A B

* * **

A A A A BBBB

(a) Normal Design utilizing single compute unit

(b) Transformed Design utilizing multiple compute units

BRAMS

Multiplier

1-4244-1060-6/07/$25.00 ©2007 IEEE. 517

as can be seen in fig. 2.2, there is an increase in the clockperiod, which is primarily due to the increase in the routingoverheads. It can be inferred from fig. 2.2 that up to 30%of clock period improvement is achieved.

Fig. 2-2: Clock period Variation for the two designs

3 PROBLEM DESCRIPTION

Consider an application with an N dimensional perfectlynested [6] iteration space with constant loop bounds anduniformly generated array access functions (as described inChen et al. [1]). Further, consider a target FPGAarchitecture with p identical embedded RAMs, passociated compute units and a Latency description Lat (i,j), which represents the latency of accessing an elementfrom RAM i to compute unit j.

The aim of this work is to map the input application’siteration and data space onto the resources of the targetarchitecture to achieve better performance.

The latency descriptions were obtained empirically byplacing a logical RAM onto the different BRAM positionson the device, while keeping the position of the multiplierfixed. The values ranged from 8.310 ns to 14.706 ns, forXC2V6000 Xilinx Virtex II [7] device (128 BRAMsorganized in 6 columns).

4 PARTITIONING MODEL

We propose a partitioning model according to which thegiven input application will be partitioned. The model ishierarchical in nature and comprises of three differentkinds of partitions. We use the notations, It(X) to representthe set of iterations mapped to partition X, Data(X) torepresent the set of data elements accessed by partition X,while Comm(X,Y) represents the set of data elementstransferred from partition X to partition Y and vice-versaduring computation.

4.1 Types of Partitions1. Logical level-1 (L1) Partitions }1,...,1,1{ 21 nLLL are at

the topmost level of the hierarchy. Formally we have,

,,1,,,)1,1(

;)1()1(;)1()1(

njijiLLComm

LDataLDataLItLIt

ji

jiji

≤≤∀=

=∩=∩

φφφ

These partitions are obtained by grouping the data-dependent iterations together in one partition and hencethere is no communication between them. We representthese partitions using the data-dependency based vectorspace formulation of an iteration partition from Chen et al.[1]. So we have, It (

iL1 ) =

uuidadadab ......

2211++++ ,

wherei

b is an initial point and21

, dd etc. are data-

dependence distance vectors [6]. We call21

,aa etc. as the

dependency control variables as they control the amount ofapplicability of the corresponding dependencies. Eachpartition has a unique initial point and by varying sa

i, all

the iterations in a partition can be obtained.

2. Logical level-2 (L2) Partitions, }2,...,2,2{ 21 kLLL are

at the next level of the hierarchy. Formally we have,

,,1,,,)2,2(

;)2()2(;)2()2(

kjijiLLComm

LDataLDataLItLIt

ji

jiji

≤≤∀=

≠∩=∩

φφφ

These are obtained by further partitioning the L1 partitionsby relaxing the false dependencies [6] and duplicating theshared data as described in section 5. Each of these k L2partitions will again comprise of t physical level partitions,described next.3. Physical level Partitions },...,,{ 21 tPPP are at the lowestlevel of the hierarchy. Each of these partitions comprise ofan Embedded RAM and its associated computation unit.Formally we have,

,,1,,,),(

;)()(;)()(

tjijiPPComm

PDataPDataPItPIt

ji

jiji

≤≤∀≠

≠∩=∩

φφφ

Fig. 4-1: Partitioning Model

4.2 Description of the ModelAs can be seen in fig. 4.1, the model comprises of threemain units: Init unit, Compute unit and Collate unit. Initunit initializes all the embedded RAMs with theappropriate data elements using the Address Map and the

518

Input Data, both of which reside in the external memory.Address Map contains the mapping of the array elementsto the locations in the physical partitions. The Computeunit encapsulates all the three levels of partitions discussedin the previous subsection. Also there is a hierarchicalFSM (LFSM, PFSM, and TFSM respectively at each of the3 levels) in place for providing control at each level.Finally, there is a Collate unit to collate all the results aftercomputation and transfer them to the Output memory.Please note that we make use of two clocks viz. compute-clock and init-clock. Init-clock is a slower clock whichtriggers the Init unit, Collate unit and the top level FSM(TFSM) as they are involved in the global communicationall over the chip. While, compute-clock is a faster clockwhich triggers the RAMs, compute units and local FSMs.

5 PARTITIONING APPROACH

The aim of our approach is to capture those design pointswhere there is maximum parallelism and minimum clockperiod, implying maximum performance. This is achievedby increasing the number of L2 partitions within one L1partition and minimizing the number of physical partitionswithin a L2 partition. Algorithms 1 and 2 describe thecomplete exploration process. We will go over the mainsteps of the algorithms using Matrix Multiplicationexample, kernel of which is shown below:

for(i = 0; i < N; i++)for(j = 0; j < N; j++)

for(k = 0; k < N; k++)R[i, j]=R[i, j]+P[i, k]*Q[k, j];

________________________________________________Algorithm 1: Level1_PartitioningInput: Iteration space I, Loop body B, Total BRAMs on thedevice nbrams, Number of BRAM ports m, Set of loop-carrieddependencies },...,,{

21 uddd=Δ

Output: No. of L1 partitions n, No. of L2 partitions k, No. ofBRAMs or physical partitions in a L2 partition t, DependencyBound Vector DBV, Dependency Relax Vector DRVBegin

1. Perform List Scheduling of B given m memory ports, 1multiplier to obtain a schedule length of l.

2. Group all the data-dependent iterations of I into n L1

partitions, where �=

+=u

jjjii dabLIt

1

)1(

3. For i = 1 to u doa. DBVi := bound(

ia )

4. Perform clustering to obtain similar size L1 partitions5. Calculate BRAMs to be allocated to each L1 partition:

|)1(||)*|/()1( ii LItInbramsLnbrams =6. For i = 1 to n do

a. <k, t, DRV>i :=Level2_Partitioning (L1i, DBVi, l, )1(

iLnbrams , Δ );

End

We obtain L1 partitions using the approach proposed byChen et al. [1] for multi-processor architectures, in whichall the data-dependent iterations are grouped together into aset of partitions (Step-2 of Algorithm 1). Using therepresentation of an iteration partition described insubsection 4.1, the L1 partition for matrix multiplicationobtained after step-2 of algorithm 1 is:

(0, 0, 0) + 1a . (0, 1, 0) + 2a . (1, 0, 0) + 3a . (0, 0, 1)

Here (0, 0, 0) is the initial point, while (0, 1, 0), (1, 0, 0), (0,0,1) are data-dependence distance vectors [6] correspondingto array accesses P[i, k], Q[k, j] and R[i, j] respectively.One can see that there will be only one L1 partition in thiscase, as by varying sa

ifrom 1 to N all iterations can be

obtained. If the loop-bound N is 32, then 0 <= 1a , 2a , 3a <

32, which is also the bound returned by the bound functionin the step-3 for each

ia . Hence after step-3, DBV gets the

value of (32, 32, 32).________________________________________________Algorithm 2: Level2_PartitioningInput: Input Partition L1_in, Dependency Bound Vector DBV,Schedule length l, BRAMs Allocated nbrams_L1, Set of loop-

carried dependencies },...,,{21 u

ddd=ΔOutput: No. of L2 partitions k, No. of BRAMs or physicalpartitions in a L2 partition t, Dependency Relax Vector DRVBegin1. Initialize a temporary L2 partition tempL2 with input L1

partition L1_in2. Initialize k and t:

k := 1; t := |Data(tempL2)| / BRAM_SIZE;3. Get an estimate of clock-period: clk_prd_est := Lat(1, t);4. cost := (|It (tempL2)| / t) * l * clk_prd_est;5. For each false dependency di ∈ Δ do

a. DRVi := 1;b. While (DRVi < DBVi) do

1. Relax di to obtain two L2 partitions L2’ and L2’’from tempL2 having equal number of iterations

2. Increase the data-size of L1_in by the amount ofthe duplicated data i.e. |Data(L2’) ∩ Data(L2’’)|;

3. if (size exceeds the capacity of nbrams_L1) thenPrint “Capacity Exceeded”; return <k, t, DRV>;

4. DRVi := DRVi * relax_factor(di);5. Update k and t:

k := k * 2; t := |Data(L2’)| / BRAM_SIZE;6. clk_prd_est := Lat(1, t);7. cost := (|It (L2’)| / t) * l * clk_prd_est;8. if (cost exceeds prev(cost)) then

Print “Minima”; return prev (<k, t ,DRV>);9. tempL2 := L2’;

End

Now in algorithm 2 (called for every L1 partition), wefurther partition the input L1 partition to obtain L2partitions by relaxing the false dependencies [6] as shownin the step-5. Dependency Relax vector contains theamounts by which the dependencies are relaxed. For matrixmultiplication, the dependency d1= (0,1,0) is relaxed in thefirst iteration of the for loop. In every successive while loop

519

iteration inside the for loop, d1 is relaxed by half to obtaintwo equally sized L2 partitions. Hence, relax_factor (d1) is2. So, we can see that, after 2 iterations of the while loop,DRV = (4, 0, 0). After relaxing a dependency, thedependent or the shared data is duplicated among the newlyformed partitions as shown in the step-2 of the while loop.It may be noted (from subsection 4.1) that communicationhappens only with-in a L2 partition; thus the access-pathbetween farthest BRAM and multiplier inside one L2partition forms the critical path. Thus the critical path delay(using the latency descriptions from section 3) becomesLat(1, t). We use this as a compute-clock period estimate toevaluate the cost of a design-point as shown in step-6 of thewhile loop. One can easily see that in every iteration of thewhile loop, number of L2 partitions i.e. k is increasing,while number of physical partitions or BRAMs within a L2partition i.e. t is decreasing. This is consistent with ourobjective stated previously in the beginning of this section.

6 EXPERIMENTAL RESULTS

In this section, we present the experimental resultsobtained after synthesis, for a set of designs for the Matrixmultiplication application with 32x32 matrices. The detailsare shown in table 1. We have used XC2V8000 Virtex II[7] device as the target FPGA, which has 168 BRAMsorganized in 6 columns. If we analyze the graph in figure6.1, we see that the compute cycles are reducing drasticallyas the number of L2 partitions are increasing. This isobvious due to the increasing parallelism. But the numberof init cycles is increasing, which is primarily due to theincrease in the amount of data duplication in the designs.Also, we see that the compute clock period decreases firstand then again increases (due to increasing routingoverheads). The init clock period keeps on rising, on otherhand, due to increase in the lengths of the initialization andcollation paths. The design which consumes the minimumoverall time (sum of initialization, computation andcollation time) is chosen, which is design 5, as can beinferred from table 1.

Fig. 6.1: Init and Compute cycles & period variation

7 CONCLUSION

In this paper we have presented a high level synthesisapproach for distributed memory FPGA architectures. Inthis approach, we have distributed the iteration and dataspace of a given loop-based input application to achievelocalization of memory accesses and concurrency incomputations. Using this approach, we have achievedimprovement in the clock frequency and thus super-linearspeedup in the overall performance.

Type Configuration Tot. time (us) speed up Slices1 un-distributed 2464.24 1.00 2782 k=1, t=4 672.23 3.66 9303 k=4, t=2 373.45 6.59 17264 k =8, t=1 368.32 6.69 17985 k =16, t=1 283.70 8.68 35066 k =32, t=1 297.05 8.29 69827 k =64, t=1 360.36 6.83 136358 k =128, t=1 551.49 4.46 27514

8 REFERENCES

[1] T.S. Chen et al.. Communication-Free Data AllocationTechniques for Parallelizing Compilers on Multicomputers.IEEE Transactions on Parallel and Distributed Systems,vol. 05, no. 9, pp. 924-938, Sept., 1994.

[2] Bruce A. Draper et al. Compiling SA-C Programs toFPGAs: Performance Results. In ICVS '01: Proceedings ofthe Second International Workshop on Computer VisionSystems, pages 220-235, London, UK, 2001. Springer-Verlag.

[3] Baradaran et al. Compiler reuse analysis for the mapping ofdata in FPGAs with RAM blocks. In FPT ’04: Proceedingsof the Conference on Field-Programmable Technology,pages 145-152, Dec., 2004

[4] I. Ouaiss and R. Vemuri. Hierarchical memory mappingduring synthesis in FPGA-based reconfigurable computers.In DATE '01: Proceedings of the conference on Design,automation and test in Europe, pages 650-657. Piscataway,NJ, USA, 2001. IEEE Press.

[5] Baradaran et al. Memory parallelism using custom arraymapping to heterogeneous storage structures. In FPL’06:Proceedings of the International Conference on FieldProgrammable Logic and Applications, pages 383,388.Madrid, Spain, Aug., 2006

[6] Randy Allen & Ken Kennedy. Optimizing Compilers forModern Architectures: A dependence based Approach,Morgan Kaufmann Publishers, first edition, 2001.

[7] Virtex II Platform FPGAs: Datasheet and User guide,http://www.xilinx.com/products/siliconsolutions/fpgas/virtex/virtexiiplatformfpgas

[8] http://www.altera.com/products/devices/stratix2/

Table 1: Comparison between the set of designs

520

Documents

A BEHAVIORAL SYNTHESIS APPROACH FOR DISTRIBUTED MEMORY … · 2010-11-02 · A BEHAVIORAL SYNTHESIS APPROACH FOR DISTRIBUTED MEMORY FPGA ARCHITECTURES Ashutosh Pal* *CoWare India