6
A Performance Oriented Systematic Flow Of Structural Transformations In Synchronous Elastic Control Networks Abstract—Because they are latency insensitive, synchronous elastic designs are robust to interconnect and computation latency variations. Generating a control network is an essential part of the conversion of any clocked design into a synchronous elastic. This paper reports on a systematic flow of structural transformations that allows an elastic control network to meet tight timing constraints. The flow has been verified and applied to the MiniMIPS processor and s298 ISCAS-89 benchmark. It removed a total negative slack of 1.3 ns from the former with area improvement of 6.2%. From the latter, it removed 5.3 ns with an area penalty of only 0.4%. Results are in comparison with contemporary work. I. I NTRODUCTION Being latency insensitive (LI), synchronous elastic designs are able to tolerate communication and computation latency variations without affecting the system functionality [1]. This allows for increased modularity and facilitates IP reuse. Synchronous elasticization is one approach (among others [2]) of transforming an ordinary clocked design into an LI [3], [4], [5]. A typical first step in synchronous elasticization is to replace each register in the original clocked design with an elastic buffer (EB). A block diagram of an EB is shown in Fig. 1 [6]. An EB is composed of two planes. The data plane is the same as of the ordinary clocked system, except for, usually, using two latches instead of an edge triggered flip-flop [7]. The control plane generates the latch enabling signals (i.e., Em and Es) based on the communications over the left and right control channels. Next step is to analyze the register-to- register communications in the original design. Between each two communicating registers there must be a corresponding elastic control channel to control the data flow of this communication. Synchronous Elastic Flow (SELF) [5] is a communication protocol over the elastic control channels. In SELF, each control channel is composed of a pair of signals: valid in the forward direction and stall in the backward direction. Valid denotes that data is available at the sending register. Stall is asserted by the receiving register if it can not accept data. For convenience, we will consider the direction of a channel to be the direction of the associated valid signal. When there are more than one transmitting register and one receiving register a network of control channels (i.e., a control network) is required. Control networks incorporate join and fork components to achieve the required communications between the registers. A join component joins two or more input channels into one channel. A fork component forks one input channel into two or more control channels. Fig. 2 shows sample implementations of n-input join and n-output eager fork [5], [6], [7]. Actual implementation depends on the available gates of the used library. In this paper we designate forks and joins with and , respectively. The control network can be constructed in many different ways. An algorithm has been reported that automatically generates a control network incorporating the minimum total number of 2-input joins and 2-output forks [8]. However, it does not guarantee providing the minimum possible critical path delay in the control network. This paper reports on a systematic flow of structural transforma- tions of the control network that reduces the network delay to permit meeting timing requirements. It targets paths that do not meet timing constraints (i.e., have negative slack) at the cost of possibly adding Fig. 1. An EB implementation. (a) An n-input join. (b) An n-output eager fork. Fig. 2. Sample join and fork implementations. some hardware. It essentially tries to eliminate the total negative slack of the control network at a specified clock period constraint. The flow is also verified to guarantee that the two versions of the control network (i.e., before and after the structural transformations) are functionally equivalent. II. PROPOSED STRUCTURAL TRANSFORMATIONS A path, pi , in the control network is defined the same way as in the data path. A path is a concatenation of signals. It starts at a Q-output of a synchronizing element (e.g., register or latch), and it ends at a D-input of a synchronizing element. A delay of a certain signal, si , in a certain path, pi , denoted as Dsp(si ,pi ), is the time difference between an intended change (i.e., not transient) in that signal value with respect to a change in this path start point. One signal usually belongs to different paths. We define the delay of a signal si , denoted as Ds(si ), to be the maximum delay of that signal in all the paths that it belongs to. A delay of path, pi , with an endpoint epi , denoted as Dp(pi ) is defined as Dsp(epi ,pi ). A path, pi , is called a violator, vi , if its delay violates one of the timing constraints. In this paper we are concerned with maximum delay constraints. We consider a path a violator if its delay exceeds some maximum delay constraints (usually a clock period with setup, propagation delays and time borrowing taken into account). The difference between a time constraint and the path delay is known as slack. If the slack is negative, the path is a violator. The total negative slack is defined to be the sum of the negative slacks in all the violators of the design (control network). It is usually represented with a positive number.

A Performance Oriented Systematic Flow Of Structural

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Performance Oriented Systematic Flow Of Structural

A Performance Oriented Systematic Flow Of StructuralTransformations In Synchronous Elastic Control Networks

Abstract—Because they are latency insensitive, synchronous elasticdesigns are robust to interconnect and computation latency variations.Generating a control network is an essential part of the conversion ofany clocked design into a synchronous elastic. This paper reports on asystematic flow of structural transformations that allows an elastic controlnetwork to meet tight timing constraints. The flow has been verified andapplied to the MiniMIPS processor and s298 ISCAS-89 benchmark. Itremoved a total negative slack of 1.3 ns from the former with areaimprovement of 6.2%. From the latter, it removed 5.3 ns with an areapenalty of only 0.4%. Results are in comparison with contemporary work.

I. INTRODUCTION

Being latency insensitive (LI), synchronous elastic designs are ableto tolerate communication and computation latency variations withoutaffecting the system functionality [1]. This allows for increasedmodularity and facilitates IP reuse.

Synchronous elasticization is one approach (among others [2]) oftransforming an ordinary clocked design into an LI [3], [4], [5].A typical first step in synchronous elasticization is to replace eachregister in the original clocked design with an elastic buffer (EB). Ablock diagram of an EB is shown in Fig. 1 [6]. An EB is composedof two planes. The data plane is the same as of the ordinary clockedsystem, except for, usually, using two latches instead of an edgetriggered flip-flop [7]. The control plane generates the latch enablingsignals (i.e., Em and Es) based on the communications over the leftand right control channels. Next step is to analyze the register-to-register communications in the original design. Between each twocommunicating registers there must be a corresponding elastic controlchannel to control the data flow of this communication. SynchronousElastic Flow (SELF) [5] is a communication protocol over the elasticcontrol channels. In SELF, each control channel is composed ofa pair of signals: valid in the forward direction and stall in thebackward direction. Valid denotes that data is available at the sendingregister. Stall is asserted by the receiving register if it can not acceptdata. For convenience, we will consider the direction of a channelto be the direction of the associated valid signal. When there aremore than one transmitting register and one receiving register anetwork of control channels (i.e., a control network) is required.Control networks incorporate join and fork components to achievethe required communications between the registers. A join componentjoins two or more input channels into one channel. A fork componentforks one input channel into two or more control channels. Fig. 2shows sample implementations of n-input join and n-output eagerfork [5], [6], [7]. Actual implementation depends on the availablegates of the used library. In this paper we designate forks and joinswith � and ⊗, respectively.

The control network can be constructed in many different ways.An algorithm has been reported that automatically generates a controlnetwork incorporating the minimum total number of 2-input joinsand 2-output forks [8]. However, it does not guarantee providing theminimum possible critical path delay in the control network.

This paper reports on a systematic flow of structural transforma-tions of the control network that reduces the network delay to permitmeeting timing requirements. It targets paths that do not meet timingconstraints (i.e., have negative slack) at the cost of possibly adding

Fig. 1. An EB implementation.

(a) An n-input join. (b) An n-output eager fork.

Fig. 2. Sample join and fork implementations.

some hardware. It essentially tries to eliminate the total negativeslack of the control network at a specified clock period constraint.The flow is also verified to guarantee that the two versions of thecontrol network (i.e., before and after the structural transformations)are functionally equivalent.

II. PROPOSED STRUCTURAL TRANSFORMATIONS

A path, pi, in the control network is defined the same way asin the data path. A path is a concatenation of signals. It starts at aQ-output of a synchronizing element (e.g., register or latch), and itends at a D-input of a synchronizing element. A delay of a certainsignal, si, in a certain path, pi, denoted as Dsp(si, pi), is the timedifference between an intended change (i.e., not transient) in thatsignal value with respect to a change in this path start point. Onesignal usually belongs to different paths. We define the delay of asignal si, denoted as Ds(si), to be the maximum delay of that signalin all the paths that it belongs to. A delay of path, pi, with an endpointepi, denoted as Dp(pi) is defined as Dsp(epi, pi). A path, pi, iscalled a violator, vi, if its delay violates one of the timing constraints.In this paper we are concerned with maximum delay constraints. Weconsider a path a violator if its delay exceeds some maximum delayconstraints (usually a clock period with setup, propagation delaysand time borrowing taken into account). The difference between atime constraint and the path delay is known as slack. If the slack isnegative, the path is a violator. The total negative slack is defined tobe the sum of the negative slacks in all the violators of the design(control network). It is usually represented with a positive number.

Page 2: A Performance Oriented Systematic Flow Of Structural

(a) Before (b) After

Fig. 3. Combining concatenated n-input and m-input joins.

The purpose of the presented flow is to reduce the total negativeslack to zero at a certain clock period constraint. Following aresome proposed structural transformations that help reducing violatorsdelays:

A. Combining Joins And Input Valids Reorder

A concatenated m-input-channel and n-input-channel joins can becombined into an (m+n-1)-input-channel join, as shown in Fig. 3.The combination preserves the control network functionality. It alsoreduces the delay of the valid output signal, Vr .

Combining reduces the amount of logic gates between the latestinput valid signal and the join valid output, Vr . It also allows foran optimization inside the combined join that takes into account therelative arrival times of the different input valid signals. In otherwords, it allows for moving critical signals closer to the output.Similarly, local optimization inside the combined (m+n-1)-input joincan reduce the delays of the stall output signals (i.e., Sl1,Sl2,...etc).

B. Combining Forks And Input Stalls Reorder

Similarly, a concatenated m-output-channel and n-output-channelforks can be combined into (m+n-1)-output-channel fork. The com-bination preserves the control network functionality. It also reducesthe delay of the stall output signal, Sl.

Reasons are the same as in Section II-A but with respect to thestall signals. Also, local optimization inside the combined (m+n-1)-output-channel fork can reduce the delays of the valid output signals(i.e., Vr1,Vr2,...etc).

C. Rolling Back A Fork

If concatenated joins and forks are, respectively, combined, thenany path would pass through a concatenation of interleaving multi-input (or output) joins (or forks).

Rolling back a fork moves a fork back in a path, such that it cancombine with forks preceding it in that path. Further, this allows thejoins before and after it to be combined together. Rolling back a forkpreserves the control network functionality (See the verification inSection V). It has the potential of cutting from the path delay becauseof the combining action that takes place in both joins and forks thatsurround this fork. However, in some cases the transformation canintroduce more violators. Quantifying the effect of rolling back a forkis deferred to Section III.

Example 1. Let A, B, C, D, X1, X2, X3, X4 be eight registers inthe original ordinary clocked design. The following registers passdata to X1 : A, B, C, and to X2 : A, B, D, and to X3 : A, and toX4 : B. A possible control network of the LI version of this designis shown in Fig. 4a.

Assume that the following path is a violator: (From A), V A, V A2,V AB, V AB1, V ABC (to X1). Where V x and Sx are the valid

(a)

(b)

(c)

Fig. 4. Steps of rolling back fork FAB.

(a) Before (b) After

Fig. 5. Rolling back an n-output fork through an m-input join

and stall signals of control channel x, respectively. This path passesthrough two 2-output forks and two 2-input joins. Rolling fork FABback to the inputs of join JAB is shown in Fig. 4b. This allows forcombining the preceding and following joins and forks as shown inFig. 4c. The path from A to X1 now incorporates only one 3-outputfork and one 3-input join. Hence, rolling back fork FAB reducesthe delay of that path.

In general, rolling back an n-output fork through an m-input joinis shown in Fig. 5, where Iij is the jth output of an n-output forkwhose input is Ii. The m n-output forks that produce Iij’s areomitted from Fig. 5b for simplicity. Ii’s and Xi’s in Fig. 5 couldbe any control channels (i.e., not necessarily directly connected tocontrollers). Rolling back some (not all) of the branches of an n-output fork through an m-input join also has delay reduction effectsfor some of the paths. However, in the context of this paper, whenwe roll back a fork, we roll all its branches.

Page 3: A Performance Oriented Systematic Flow Of Structural

III. GAIN FUNCTION

Rolling back a fork would usually decrease the delay of theassociated paths because of the combining action that takes placein the preceding and following joins and forks. However, in somecases, it may increase the negative slack of some violators. Toquantify these effects on a certain fork Fi, we define a heuristic gainfunction, Gain(Fi). Gain(Fi) evaluates to a number that should beproportional to the reduction in the total negative slack of the networkif fork Fi is rolled back.

To compute the Gain of a certain fork, Fi, we need, first, toexamine the different path types that can pass through this fork.Following is a list of six path types along with the rolling backeffect on each. The argument will make use of the network of Fig.4, where fork FAB is to be rolled back. We use the join and eagerfork implementations of Fig. 2.

A. Type I

A path of this type will have the fork Vl and any of the Vri aspart of it (i.e., it passes through the fork in the valid direction).

Let us consider a path of type I passing through fork FAB in Fig.4a. A path can not start neither end in a join, since a join does nothave any synchronizing elements. A path can only start or end eitherin an elastic controller or in a fork (since eager forks, which we use,incorporate registers). Hence, a type I path, that passes through forkFAB, will end either at the valid input of X1 controller (i.e., throughjoin JABC), or at the valid input of X2 controller (i.e., throughjoin JABD), or at the stall input of C controller (i.e., V AB1, thenthrough join JABC to SC), or at the stall input of D controller (i.e.,V AB2, then through join JABD to SD). In all these four cases,rolling back fork FAB will reduce the delay of the path end points,respectively. Delay reduction is due to the fork combination (FAwith FAB, and FB with FAB) and the join combination (JABwith JABC, and JAB with JABD), as shown in Fig. 4c.

B. Type II

A path of this type will have any of the fork Sri and Sl as partof it (i.e., it passes through the fork in the stall direction).

Let us consider a path of type II passing through fork FAB in Fig.4a. This path will end either at the stall input of A or B controllers,or at the D-input of any of the two registers R1 and R2 in forks FAor FB. In all these cases, the path delays are the same or less afterrolling back fork FAB.

Consider, as an example, the following path in Fig. 4a: (From X1),SABC, SAB1, SAB, SA2, SA,(to A). The path incorporates two2-output forks and two 2-input joins. After rolling back, in Fig. 4cthe path is reduced to only one 3-output fork and one 3-input join.

C. Type III

A path of this type will have the fork Vl and any of the Ri registerD-inputs as part of it (i.e., it is a path coming in the valid directionand ends inside the fork). Rolling back a fork is likely to decreasethe delay of this type of paths.

An example of this type in Fig. 4a is: (From A), V A, V A2, V AB,(FAB/R1/D). It can be easily shown that rolling back fork FABwill decrease the delay at that path endpoint.

D. Type IV

A path of this type will have any of the Ri register Q-outputs(inside the fork) and Sl as part of it (i.e., it starts inside the forkand propagates in the stall direction). Rolling back a fork is likely todecrease the delay of this type of paths.

An example of this type in Fig. 4a is: (From FAB/R1/Q), SAB,SA2, SA, (to A). It can be easily shown that rolling back fork FABwill decrease the delay at that path endpoint.

E. Type V

A path of this type will have any of the Ri register Q-outputs(inside the fork) and the corresponding Vri as part of it (i.e., it is apath starting inside the fork and propagating in the valid direction).Rolling back a fork is likely to increase the delay of this type ofpaths.

An example of this type in Fig. 4a is: (From FAB/R1/Q), V AB1,V ABC, (to X1). It can be easily shown that rolling back fork FABwill increase the delay at that path endpoint.

F. Type VI

A path of this type will have any of the fork Sri and any of theRi register D-inputs as part of it (i.e., it is a path coming in the stalldirection and ends inside the fork). Rolling back a fork is likely toincrease the delay of this type of paths.

An example of this type in Fig. 4a is: (From X1), SABC, SAB1,(to FAB/R1/D). It can be easily shown that rolling back fork FABwill increase the delay at that path endpoint.

We define the Gain function of a certain fork, Fi, as follows:

Gain(Fi) =

|V iolators|∑j=1

rj .wj (1)

where |V iolators| is the number of violators, vj . rj is a numberproportional to the delay reduction in violator, vj , caused by rollingback fork Fi. wj is the weight of violator vj .

One approach of choosing violator weights (i.e., wj), is to giveeach violator a weight based on its negative slack. This approachwill give priority to worst slack violators fixing. Another approachis to choose a value of 1 for all violators weights, giving all of themthe same priority. The results reported in this paper are based on thelatter approach.

The value of rj is technology and topology dependent. It alsodepends on the synthesis tool optimization algorithms. Accurateevaluation of these values are kept for future work. We choose avalue of 1 for each violator that is of type I, II, III or IV, and -1 foreach violator that is of type V or VI, and 0 otherwise (i.e., a violatordoes not include Fi at all).

IV. THE PROPOSED FLOW

A chart of the proposed flow is shown in Fig. 6. The flow starts byrunning the CNG tool [8], a tool that generates a control network forlatency insensitive designs, with minimum total number of 2-inputjoins and 2-output forks. The resultant network is synthesized andchecked against the timing constraints. If there is no violation, theflow exits successfully. If there are timing violations, the reportedviolators (by the synthesis tool) are analyzed. The Gain function iscomputed for all the forks in the design. The fork with the highestGain is chosen to be rolled back. The new network is now passedto the synthesis tool again. The loop continues until the networkmeets the timing constraint (i.e., success) or there are no more forksavailable to be rolled back (i.e., fail). We use implementations ofjoins, eager forks and controllers as in [5].

Page 4: A Performance Oriented Systematic Flow Of Structural

Fig. 6. The proposed flow.

A. Synthesis Considerations

Only the control network part of the design is synthesized. The datapath is abstracted out. We use the EB controller implementation of[5]. In the controllers, we set a value of zero for the output port delaysof the master and slave latch enables (i.e., Em and Es, respectively).This allows Em and Es to change as late as the clock positiveedge but not later. It also ensures maximum possible time borrowing(for Em) without touching the data path performance (i.e., no timeborrowing from the data path will take place). A more accurate valuefor Em,Es port delays should be the enable setup times, which islibrary dependent.

One of the strongest motivations behind latency insensitiveparadigm, is to tackle long wire delay problems [9], [10], [11].Besides, it facilitates communication between different IP cores ona chip. Hence, the logic in the LI control network is expected to behighly distributed, where wire delays are substantial contributors inthe violator slacks. We plan to include a metric for wire delays in theGain function proposed in Section III in future work. The wire delaymetric will be based on back-annotated place and route infromation.Hence, the choice of rolling back a fork will take into account theadded (or removed) wire delay expenses. For this same reason, wekeep the hierarchy during synthesis (i.e., the logical positions of joinsand forks are kept and only local optimizations inside the joins andforks are allowed). This way we can back annotate the wire delaysinto our calculations and into the synthesis tool. We use DC Ultrawith -timing script to ensure the highest effort in optimizing forperformance. Finally, to minimize the area, we set set max area tozero.

Example 2. Given the control network of Fig. 7 [8], find a function-ally equivalent network that can be clocked with 370 ps clock.

We synthesize the original control network of Fig. 7 [8] with DC,for clock period constraint of 370 ps. DC reports an area of 1304.4,23 violators, and a total negative slack of 1.4 ns. All reported violatorsare then analyzed and the gain function is calculated for all thenetwork forks.

Table I shows the analysis results. Since fork FABDE has

Fig. 7. Control network of Example 2.

TABLE IEXAMPLE 2 - ITERATION 1

FBCG FABE FABDEType I 0 21 21Type II 0 13 17Type III 0 0 0Type IV 0 0 0Type V 0 0 0Type VI 0 2 0

Gain 0 32 38

TABLE IIEXAMPLE 2 RESULTS

# Total Neg. Area Area Fork To ItsSlack (ns) @T=0.37 ns @T=400 ns Roll Back Gain

1 1.4 1304.4 852 FABDE 382 0.1 1174.2 859 FABE 163 0.0 1195.8 940.8

the highest gain of 38, it is chosen to be rolled back. FABDEis preferred over FABE, because 4 of the violators that passthrough both of them in the valid direction (i.e., type I), pass onlythrough FABDE in the stall direction. An example of such violatorsis: (Start from FA/R2/Q), V A2, V ABE, V ABE2, V ABDE,V ABDE2, (through join JABCDE), SABDE2, SABDE, (endat SD). Besides, 2 violators end at the internal registers of FABEcoming in the stall direction (i.e., Type VI).

Hence, FABDE is rolled back and the new control network issynthesized again with the same timing constraints (i.e., 370 ps clockperiod). DC reports an area of 1174.2, 9 violators and total negativeslack of only 0.1 ns. Violators are similarly analyzed. FABE isrolled back. Then, the network is synthesized. DC reports an area of1195.8 and no violations. Results are summarized in Table II.

V. VERIFICATION

We verify the correctness of the proposed structural transfor-mations of Section II using the symbolic model checker NuSMV[12]. We verify that the control networks before and after thetransformations are functionally equivalent. In other words, there isno sequence of inputs to the control network that produces differentoutputs in the two versions of the control network. In this Sectionwe verify the correctness of rolling back a fork (Section II-C).

Page 5: A Performance Oriented Systematic Flow Of Structural

Fig. 8. Verification setup for rolling back a fork.

Other transformations (i.e., of Section II-A and II-B) can be verifiedsimilarly.

Fig. 5 showed rolling back an n-output fork through an m-inputjoin. For brevity, we will verify the case of n=2 and m=2. Highervalues of n and m have been also verified. We use the setup of Fig.8. Elastic buffers controllers I1 and I2 are connected to controllersX1 and X2 through two versions of the control network. The one onthe top (designated ’Before’) is the control network before doing anytransformations. The one on the bottom (designated ’After’) is thecontrol network after rolling back fork FI1I2 through join JI1I2.Green lines represent the valid signals of the control channels. Redlines represent the stalls. Suffixes B and A are used to designatethe outputs of the control network before and after the transformation,respectively. The inputs coming from the controllers (i.e., V I1, V I2,SX1, SX2) are applied to the two networks simultaneously. Thecorresponding two network outputs (i.e., V X1, V X2, SI1, SI2)are ORed together, respectively, and then passed to the controllers.For example, V X1 B and V X1 A are ORed and passed to theinput valid pin of controller X1. We generated NuSMV code forthe joins, eager forks, logic gates, and the elastic controllers. Themodels were connected synchronously in NuSMV similar to [13].Synchronous connection in NuSMV guarantees that all componentsof the design advance synchronously. The delay of each componentis then encoded in individual counters in terms of the global timeunit used by NuSMV. Without loss of generality, we used unit delaymodels for all the flip-flops and logic functions except for the clockgenerator (described below). This minimizes the required state spacefor verification.

For example, to model a unit delay join, we use the followingNuSMV codeMODULE LJoin2_u(Vl1,Vl2,Sr)

VAR Sl1:boolean;

ASSIGN

init(Sl1):= 0;

next(Sl1):= Vl1 & !(!Sr & (Vl1 & Vl2)) ;

-- We similarly model Sl2 and Vr.

The EB model is the same as in [5] and its NuSMV code isomitted due to space limitation. A clock generator was also modeledwith a clock period long enough for the control network signals toreach the destination controllers. We used the following NuSMVcode to model a clock generator of 20 time unit period:

MODULE ClkGenerator

VAR clk:boolean; VAR wait:1..10;

DEFINE HalfPeriod :=10;

ASSIGN

init (wait) := HalfPeriod;

next (wait) := case

wait > 1: wait - 1;

wait = 1: HalfPeriod;

1: wait;

esac;

init(clk):= 0;

next(clk):= case

(clk = 0) & (wait=1): 1;

(clk = 1) & (wait=1): 0;

1:clk;

esac;

Finally, we used the following PSL properties to check thatthe two versions of the control network (i.e., before and after thetransformation) are functionally equivalent.DEFINE VX1_MISMATCH := Clk & (VX1_B xor VX1_A) ;

PSLSPEC never {[*]; VX1_MISMATCH[+]; !Clk[+]} ;

-- We similarly check VX2, SI1, SI2.

The properties can be simply read as follows: A mismatch inthe value of any control network output (at the end of the positivephase of the clock) must not happen. Note that, the elastic controllersproduce the valid and stall signals with the positive edge of the clock[5]. To take into account the propagation delay of these signals overthe control network, we do check the above properties at the end ofthe positive phase of the clock (i.e., just before the clock goes low).Without loss of generality, we assume the positive phase of the clockis long enough for the control signals to propagate.

All the properties were proven true by NuSMV which guaranteesfunctional equivalence between the two versions of the controlnetwork. It also proves the correctness of the transformation (rollingback a fork).

VI. CASE STUDIES AND RESULTS

This section presents two case studies: the MiniMIPS processorand the s298 ISCAS-89 benchmark. Results are synthesis numbers.We used Design Compiler Ultra technology and ARM 65 nm library.

A. MiniMIPS

MIPS (Microprocessor without Interlocked Pipeline Stages) is a32-bit architecture, first designed by Hennessy [14]. MiniMIPS isan 8-bit subset of MIPS. Register-to-register communications of theMiniMIPS were analyzed and CNG tool was used to generate thecontrol network of Fig. 9 [8]. We passed the MiniMIPS controlnetwork to our flow in order to meet a clock period constraint of370 ps. The results are shown in Table III. Our flow eliminated, inonly one iteration, the whole negative slack (1.3 ns), with an area gain(i.e., decrease) of 6.2%. Rolling back a fork usually involves addingredundant forks and joins to the design. However, this is compensated,in part, by join and fork combinations that take place. Besides, rollinga fork back makes it easier for DC to meet the timing constraints.This, in turn, seems to help DC optimizes the area more efficiently.Column 4 in Table III shows the area of the control network in thedifferent iterations when they are synthesized with 400 ns timingconstraint (i.e., virtually no constraints). In that case, rolling the forkback costs an area degradation (increase) of 6.5%.

Page 6: A Performance Oriented Systematic Flow Of Structural

Fig. 9. Control network of the synchronous elastic version of MiniMIPSprocessor.

TABLE IIIMINIMIPS RESULTS

# Total Neg. Area Area Fork To ItsSlack (ns) @T=0.37 ns @T=400 ns Roll Back Gain

1 1.3 1350 953.4 FABCI4P 352 0.0 1266 1015.2

Fig. 10. Control network of the synchronous elastic version of s298.

B. s298

s298 is an ISCAS-89 benchmark. It is a traffic light controller.s298 has a total of 23 synchronization points (14 registers + 3inputs + 6 outputs). We have analyzed all the register-to-registercommunications in the data path, and passed this information to theCNG tool. The resultant control network is shown in Fig. 10. Thes298 control network is passed to our flow in order to meet a clockperiod constraint of 500 ps. The results are shown in Table IV. Ourflow eliminated, in 3 iterations, the whole negative slack (5.3 ns),with an area degradation (i.e., increase) of only 0.4%

VII. CONCLUSION

We introduced a systematic flow of structural transformationsof the synchronous elastic control network. The flow reduces thenetwork delay to permit meeting timing requirements. It essentiallytries to eliminate the total negative slack of the control network at

TABLE IVS298 RESULTS

# Total Neg. Area Area Fork To ItsSlack (ns) @T=0.5 ns @T=400 ns Roll Back Gain

1 5.3 2657.4 1991.4 F5 702 2.2 2799 1977 F3 423 0.4 2392.8 1989.6 F4 364 0.0 2668.8 2374.2

a specified clock period constraint. We presented the main structuraltransformations and introduced a function named Gain that guidesthe flow. We also verified the correctness of such transformations.Finally, we applied the flow to two case studies, namely, MiniMIPSmicroprocessor and s298 ISCAS-89 benchmark. At a target period of370 ps, the flow eliminated a total negative slack of 1.3 ns from theMiniMIPS with area improvement (decrease) of 6.2%. At a targetperiod of 500 ps, it eliminated a total negative slack of 5.3 nsfrom s298, with area penalty (increase) of only 0.4%. Results arein comparison with the initial network generated by the CNG toolreported in [8].

REFERENCES

[1] L. Carloni, K. Mcmillan, and A. L. Sangiovanni-VincentelliR, “Theoryof latency insensitive design,” in IEEE Transactions on CAD of Inte-grated Circuits and Systems, vol. 20, no. 9, Sep 2001, pp. 1059–1076.

[2] N. Andrikos, L. Lavagno, D. Pandini, and C. Sotiriou, “A fully-automated desynchronization flow for synchronous circuits,” in DesignAutomation Conference, 2007. DAC ’07. 44th ACM/IEEE, June 2007,pp. 982–985.

[3] J. Carmona, J. Cortadella, M. Kishinevsky, and A. Taubin, “Elasticcircuits,” Computer-Aided Design of Integrated Circuits and Systems,IEEE Transactions on, vol. 28, no. 10, pp. 1437–1455, Oct. 2009.

[4] S. Krstic, J. Cortadella, M. Kishinevsky, and J. O’Leary, “Synchronouselastic networks,” in Formal Methods in Computer Aided Design, 2006.FMCAD ’06, Nov. 2006, pp. 19–30.

[5] J. Cortadella, M. Kishinevsky, and B. Grundmann, “Synthesis of syn-chronous elastic architectures,” in ACM/IEEE Design Automation Con-ference, July 2006, pp. 657–662.

[6] “Reference is omitted for blind review.”[7] H. M. Jacobson, P. N. Kudva, P. Bose, P. W. Cook, S. E. Schuster, E. G.

Mercer, and C. J. Myers, “Synchronous interlocked pipelines,” in 8thInternational Symposium on Asynchronous Circuits and Systems, Apr.2002, pp. 3–12.

[8] “Reference is omitted for blind review.”[9] M. Bohr, “Interconnect scaling-the real limiter to high performance ulsi,”

in Electron Devices Meeting, 1995., International, Dec 1995, pp. 241–244.

[10] R. Ho, K. Mai, H. Kapadia, and M. Horowitz, “Interconnect scalingimplications for cad,” in Computer-Aided Design, 1999. Digest ofTechnical Papers. 1999 IEEE/ACM International Conference on, 1999,pp. 425–429.

[11] L. Carloni and A. Sangiovanni-Vincentelli, “Coping with latency in socdesign,” Micro, IEEE, vol. 22, no. 5, pp. 24–35, Sep/Oct 2002.

[12] A. Cimatti, E. Clarke, E. Giunchiglia, F. Giunchiglia, M. Pistore,M. Roveri, R. Sebastiani, and A. Tacchella, “Nusmv 2: An opensourcetool for symbolic model checking.” in Proc. of 14th Conf. on ComputerAided Verification (CAV 2002), vol. 2404, July 2002.

[13] V. Vakilotojar and P. Beerel, “Rtl verification of timed asynchronousand heterogeneous systems using symbolic model checking,” in DesignAutomation Conference 1997. Proceedings of the ASP-DAC ’97. Asiaand South Pacific, 28-31 1997, pp. 181 –188.

[14] J. H. et al., “The MIPS Machine,” in COMPCON, 1982, pp. 2–7.