Theory of latency-insensitive design - Computer-Aided ...luca/research/lipTransactions.pdf · delay-insensitive circuits [19], [20]. A delay-insensitive circuit is designed to operate

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 9, SEPTEMBER 2001 1059

Theory of Latency-Insensitive DesignLuca P. Carloni, Student Member, IEEE, Kenneth L. McMillan, and Alberto L. Sangiovanni-Vincentelli, Fellow, IEEE

Abstract—The theory of latency-insensitive design is presentedas the foundation of a new correct-by-construction methodology todesign complex systems by assembling intellectual property com-ponents. Latency-insensitive designs are synchronous distributedsystems and are realized by composing functional modules that ex-change data on communication channels according to an appro-priate protocol. The protocol works on the assumption that themodules are stallable, a weak condition to ask them to obey. Thegoal of the protocol is to guarantee that latency-insensitive designscomposed of functionally correct modules behave correctly inde-pendently of the channel latencies. This allows us to increase therobustness of a design implementation because any delay varia-tions of a channel can be “recovered” by changing the channel la-tency while the overall system functionality remains unaffected. Asa consequence, an important application of the proposed theory isrepresented by the latency-insensitive methodology to design largedigital integrated circuits by using deep submicrometer technolo-gies.

Index Terms—Deep submicrometer design, formal methods, la-tency-insensitive protocols, system design.

I. INTRODUCTION

T HE THEORY of latency-insensitive design formallyseparates communication from computation by defining a

system as a collection of computational processes that exchangedata by means of communication channels. The communicationis governed by an abstract protocol, whose main character-istic is to be insensitive to the latencies of the channels. Thetheory may be applied as a rigorous basis to design complexdigital systems by simply composing predesigned and verifiedcomponents so that the composition satisfies, formally and “byconstruction,” the required properties of synchronization andcommunication. In particular, relevant applications may befound in the design of integrated circuits to be implementedwith the future generations of process technologies. Indeed, forthe so-called deep submicrometer (DSM) technologies (0.1mand below), where millions of gates will be customary, a designmethodology that guarantees by construction that certain keyproperties are satisfied appears as the only hope to achievecorrect designs in short time. Furthermore, the characteristics ofDSM designs will exacerbate thetiming-closure problemthatis already present with the current technologies: the designers

Manuscript received June 2, 2000; revised April 23, 2001. This work wassupported in part by the Semiconductor Research Corporation, the ConsiglioNazionale delle Ricerche, Italy, and Cadence Design Systems. This paper wasrecommended by Associate Editor M. Papaefthymiou.

L. P. Carloni and A. L. Sangiovanni-Vincentelli are with the Departmentof Electrical Engineering and Computer Science, University of California,Berkeley, CA 94720 USA (e-mail: [email protected]; [email protected]).

K. L. McMillan is with Cadence Berkeley Laboratories, Berkeley, CA 94704USA (e-mail: [email protected]).

Publisher Item Identifier S 0278-0070(01)06887-7.

of semicustom integrated circuits are forced to iterate manytimes between logic synthesis and layout generation becausethe two steps are performed independently and synthesis usesstatistical delay models, which badly estimate the postlayoutwire load capacitance [1], [2].

In fact, despite the increase in number of layers and in aspectratios, the resistance–capacitance delay of an average metalline with constant length is getting worse with each processgeneration [3], [4]. This effect, combined with the increases inoperating frequency, die size, and average interconnect length,is making interconnect delay becoming a larger fraction of theclock-cycle time [5]. Introducing copper metallizationand low-dielectric insulators helps reducing the interconnect delay, butthese one-time improvements will not suffice in the long run asfeature size continues to shrink [5], [6]. Furthermore, while thenumber of gates reachable in a cycle will not change significantlyand the on-chip bandwidth that wires provide will continue togrow, the percentage of the die reachable within one clock cyclewill decrease dramatically: we will soon reach a point wheremore gates can be fit on a chip than can communicate in one cycle[7]. In particular, it has been predicted that for DSM designs, asignal will need more than ten clock cycles to traverse the entirechip area [8]. Also, it has been estimated that only a fraction ofthe chip area between 0.4% and 1.4% will be reachable in oneclock cycle [9]. Hence, limiting the on-chip distance traveled bycritical signalswill be the key toguarantee the performanceof theoverall design. However, since precise data on wire lengths areavailable only late in the design process, several costly redesignsbecome necessary to change the placement or the speed of thechip modules while satisfying performance and functionalityconstraints [2].

The theory of latency-insensitive design provides the founda-tion for a new design methodology that maintains the inherentsimplicity of synchronous design and yet does not suffer of the“interconnect-delay problem.” According to our approach, thesystem can be thought as completely synchronous, i.e., just as acollection of modules that communicate by means of channelshaving “zero delay,” i.e., a delay negligible with respect to the pe-riod of the common clock signal (synchronous assumption). Werefer this clock as thevirtual clockand we call a system that isspecified starting from this assumptionstrict. Once the final im-plementation of the system is derived, its operation is controlledby areal clockthat has a precise frequency value. Unfortunately,due the above-mentioned DSM effects, some of the wires imple-menting these channels on the final layout may likely require adelay longer than one real clock cycle to transmit the appropriatesignals.However,theproposedtheoryguaranteesthatit isnotnec-essary to complete costly redesign iterations or to slow down thereal clock. The key idea is borrowed from pipelining [10], [11]:partition the long wires into segments whose lengths satisfy the

0278–0070/01$10.00 © 2001 IEEE

1060 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 9, SEPTEMBER 2001

timing requirements imposed by the real clock by inserting logicblocks calledrelay stations, which have a function similar to theone of latches on a pipelined data path. While the timing require-ments imposed by the real clock are now met by construction, thelatency of a channel connecting two modules may end up beingtwo or more clock cycles. Still, if the functionality of the designis based on the sequencing of the output signals and not on theirexact timing, then this modificationof the designdoesnot changeits functional correctness provided that all its components arepa-tient processes. Informally, a module is a patient process if itsbehavior does not depend on the latency of the communicationchannels because it is compliant with alatency-insensitive com-munication protocol. The key point of this approach is to relaxtimeconstraintsduringtheearlyphasesofthedesignwhencorrectmeasures of the delay paths among the modules are not yet avail-able. Instead, the specification of a complex system is stronglysimplified if performedunder thesynchronousassumption.Afterthe corresponding physical implementation is completed, if therearemismatchesbetweenthe timeconstraintsandthe interconnectdelays among the system modules, they can be easily fixed byinserting the right amount of relay stations. Since every moduleworks accordingly to the latency-insensitive protocol, no modifi-cation in the functionality or the layout of the individual modulesis necessary to reflect any necessary changes in wire latencies.The relevance of this approach for DSM design has been recentlyconfirmed by the first details that Intel has unveiled on the newhyperpipelined NETBURST microarchitecture of its Pentium 4processor. As reported in [12], NETBURST is the first pipelinecontaining instances of a stage dedicated exclusively to handlewire delays: in fact, a so-calleddrive stageis used only to movesignal across the chip without performing any computation and,therefore, can be seen as a physical implementation of a relay sta-tion.

In this paper, we introduce these concepts formally and provethe properties outlined above. Section II discusses backgroundand related work and, particularly, how the synchronous as-sumption distinguishes our design methodology from similarapproaches proposed in the asynchronous design communityduring the past three decades. In Section III, we give the foun-dation of latency-insensitive design by presenting the notion ofpatient processes. At the core of our theory, originally presentedin [13], lies the definition of latency equivalence. Two systemsare latency equivalent if on every channel, they present the samedata streams, i.e., the same ordered sequence of data, but, pos-sibly, with different timing. We show that for patient processes,the notion of latency equivalence is compositional by provingthe following theorems: 1) the intersection of two patient pro-cesses is a patient process; 2) given two pairs of latency equiv-alent patient processes, their pairwise intersections are also la-tency equivalent; and 3) for all pairs of strict processesand patient processes , if is latency equivalent toand is latency equivalent to , then their intersections arelatency equivalent. As a consequence, we can derive the majorresult of our theory: if all processes in a strict system are re-placed by corresponding patient processes, then the resultingsystem is patient and latency equivalent to the original one.

In Section IV, we define the notion of relay station: we illus-trate its main properties and we show how in a system of patient

processes, the communication channels can be segmented by in-troducing relay stations.

Section V discusses the assumption under which a genericstrict system can be transformed into a patient one, i.e., its com-ponents must bestallable. We also delineate how the presenttheory leads to the definition of the latency-insensitive method-ology for digital integrated circuit design: the methodology iscentered on a simple noniterative design flow that can leveragecommon layout and synthesis computer-aided design (CAD)tools as well as state-of-the art formal verification techniques[14]. Naturally, the effectiveness of the latency-insensitive de-sign methodology is strongly related to the ability of main-taining a sufficient communication throughput in the presenceof increased channel latencies. This problem, which is just oneinstance of a more general issue that has to be faced while real-izing integrated circuits with DSM process technologies, is dis-cussed in Section V-D together with some techniques that can beused to handle it. Finally, to experiment the proposed method-ology, we performed a latency-insensitive design of PDLX, anout-of-order microprocessor with speculative execution. Thisdesign experiment is discussed in Section VI.

II. BACKGROUND AND RELATED WORK

The theory of latency-insensitive design is clearly rem-iniscent of many ideas which have been proposed in theasynchronous design community during the past three decades[15], [16]. In particular, the idea of a design methodology whichis inherently modular is already present in the work by Clarkand Molnar [17], [18]. To separate the design of these modulesfrom the design of the system and make the entire processamenable to automation, the modules must be implemented asdelay-insensitive circuits [19], [20]. A delay-insensitive circuitis designed to operate correctly regardless of the delays on itsgates and wires (unbounded delay model) [21]. However, it hasbeen proven that almost no useful delay-insensitive circuits canbe built if one is restricted to a class of simple logic gates [22],[23]. To be able to build complex systems, one must use morecomplex components, which are “externally” delay insensitive,while “internally” are designed by carefully verifying theirtiming and avoiding or tolerating metastability [20], [24], [25].By slightly relaxing the unbounded delay model and allowing“isochronic forks,”1 practicalquasi-delay-insensitivecircuitscan be built using simple logic gates [26]. A further relaxationleads tospeed-independentcircuits, which operate correctlyregardless of gate delays, while wire delays are assumed to benegligible [27]–[29].

In 1985, Van de Sneupschet observed that the decreasing fea-ture size of very large scale integration devices would have leadto “a decrease of the propagation speed of electrical signals rel-ative to the switching speed,” and proposed the use of suitablecommunication protocols to obtain chip designs whose correctoperation is independent of the propagation speed [30]. Thiswork has more than one contact point with the present paper,but differs on the basic fact that leads to the choice of speed-in-dependent circuits over synchronous circuits. Dill has also pro-

1A bounded skew is allowed between the different branches of a net.

CARLONI et al.: THEORY OF LATENCY-INSENSITIVE DESIGN 1061

posed a trace theory for modeling and specifying speed-inde-pendent circuits that is the basis for a hierarchical verificationapproach [28]. Parallels with our paper can also be found insome of the ideas that have been proposed in the field of high-level synthesis to schedule the sequential execution of inter-acting processes under unbounded timing constraints [31], [32].

As we move toward the design of integrated circuits to berealized with DSM technologies, the delays of long intermodulewires are becoming dominant with respect to both the delays ofthe intramodule wires and those of the logic gates. More impor-tantly, intermodule delays are difficult to predict or to controlduring the different phases of the design of a chip, leading to anexacerbation of the timing closure problem. Delay-insensitiveapproaches as well as the latency-insensitive methodology allowthedesigner tospecifyand implement thesystemwhileassumingthat intermodule wire delays may vary arbitrarily. However,while a delay-insensitive system is based on the assumption thatthe delay between two subsequent events on a communicationchannel is completely arbitrary, in the case of a latency-insen-sitive system, this arbitrary delay is forced to be amultiple ofthe clock period. The key point is that this kind ofdiscretizationallows us to leverage well-accepted design methodologies forthe design and validation of synchronous systems. In fact, thebasic distinction between any of the previous asynchronousdesign methodologies and the latency-insensitive approach isessentially that a latency-insensitive system is specified as a syn-chronous system. Notice that we say “specified” because from animplementation viewpoint, a latency-insensitive communicationprotocol can also be realized usinghand-shaking signalingtechniques (such as request/acknowledge protocols), whichare typically asynchronous.2 However, from a specificationpoint of view, each module (as well as the overall system) isviewed as a synchronous system relying on the previously citedsynchronous assumption. Now, to specify a complex system as acollection of modules whose state is updated collectively in one“zero-time” step is naturally simpler than specifying the samesystem as the interaction of many components whose state isupdated following an intricate set of interdependency relations.Furthermore, the synchronous specification allows us to modifythe traditional semicustom design methodology slightly bysimply inserting a step to encapsulate each synchronous modulewithin a so-calledshell process. Finally, the impact is verydifferent also from a validation point of view because simulationis naturally a less complex task for a synchronous system thanan equivalent asynchronous one. In conclusion, the theory oflatency-insensitive design leads to a methodology that can beimplemented on top of a commonly adopted design flow, whileany asynchronous approach forces the designers to use newtools and, more importantly, to think of the digital system in acompletely different way.

III. L ATENCY INSENSITIVITY

To develop our theory formally, we adopt thetagged-signalmodel, which has been proposed recently by Lee and Sangio-

2Here, the communication bandwidth would be limited by the inverse of thelongest of the round trip times between pairs of communicating relay stations.

vanni-Vincentelli, to represent complex systems as collectionsof signals and processes [33].

A. Tagged-Signal Model

Given a set ofvalues and a set oftags , an eventis amember of . Two events aresynchronousif they have thesame tag. Asignal is a set of events. Two signals are syn-chronous if each event in one signal is synchronous with anevent in the other signal and vice versa. Synchronous signalsmust have the same set of tags.

The set of all -tuples of signals is denoted . A processis a subset of . A particular -tuple satisfies the

process if . A -tuple that satisfies a process is calleda behaviorof the process. Thus, a process is a set of possiblebehaviors.3 A composition of processes(also called asystem)

is a new process defined as the intersection oftheir behaviors . Since processes can be definedover different sets of signals, we need to extend the set of signalsover which each process is defined to contain all the signals ofall processes to form the composition. Note that the extensionchanges the behavior of the processes only formally.

Let be an ordered set of inte-gers in the range . The projection of a behavior

onto is .The projection of a process onto is

. A connection is aparticularly simple process, where two (or more) of the signalsin the -tuple are constrained to be identical. For instance,

,with .

An input to a process is an externally imposed con-straint such that is the total set of acceptablebehaviors. The set of all possible inputs is a furthercharacterization of a process. Given a process, if ,then the set of acceptable behaviors is and theprocess does not have input constraints (the process isclosed).Commonly, one considers processes having input signals andoutput signals. In this case, given process, the set of signalscan be partitioned into three disjoint subsets by partitioning theindex set as , where is the orderedset of indexes for the input signals of is the ordered set ofindexes for the output signals andis the ordered set of indexesfor the remaining signals (also called irrelevant signals with re-spect to ). A process isfunctionalwith respect to if forall behaviors and , implies

. Hence, given a function, we can completely characterize a functional processby

the tuple . A process is determinateif for any input, then either or . Otherwise, it is

nondeterminate.In a synchronous system, every signal in the system is

synchronous with every other signal. In atimed system, theset of tags, also calledtimestamps, is a totally ordered set.The ordering among the timestamps of a signalinduces anatural order on the set of events of. A functional process

3For N � 2, processes may also be viewed as a relation between theN

signals ins = (s ; . . . ; s ).


Fig. 1. Strict signals and a stalled signals .

is (strictly) causal if two outputs can only differ at times-tamps that (strictly) follow the timestamps when the inputsproducing these outputs show a difference. More formally,given a metric on the set of -tuples of signals,4

we have the following: a functional processis causal if ;a functional process is strictly causal if

.

B. Informative Events and Stalling Events

A latency-insensitive systemis a synchronous timed systemwhose set of values is equal to , where is the setof informative symbols, which are exchanged among modulesand is a special symbol, representing the absence ofan informative symbol. The absence of an informative symbolmay result from either lack of valid data to transmit or back-pressure, i.e., a request to delay a transmission. From now on,all signals are assumed to be synchronous. The set of timestampsis assumed to be in one-to-one correspondence with the setof natural numbers. An event is calledinformativeif it has aninformative symbol as value.5 An event whose value is asymbol is said to be astalling event(or event).6

Definition III.1: denotes the set of events of signal,while and are the set of informative events and theset of stalling events of, respectively. Theth event ofa signal is denoted . denotes the set of timestampsin signal , while is the set of timestamps correspondingto informative events.

Processes exchange “useful” data by sending and receivinginformative events. Ideally, only informative events should becommunicated among processes. However, in a latency-in-sensitive system, a process may not have data to output at agiven timestamp, thus requiring the output of a stalling event atthat timestamp. Alternatively, it may happen that a down-linkprocess that is not ready to receive new data requests the up-linkprocess to avoid sending them and, as a consequence, the latterreacts by emitting a stalling event (back-pressure).

Definition III.2: The set of all sequences of elements inis denoted by . The length of a sequenceis if it

is finite, otherwise it is infinity. The empty sequence is denotedas and, by definition, . The th term of a sequence isdenoted .

Definition III.3: Function takes asignal and an ordered pair of times-tamps and returns a sequence

4For instance, in [33], it is considered the Cantor metricd(s ; s ) =supf(1=2) j s (t) 6= s (t); t 2 T g.

5We use subscripts to distinguish among the different informative symbols of� : � ; � ; � ; . . ..

6The use of the� event is similar to the role played by theabsencesymbol?in the synchronous language SIGNAL [34].

s.t.7 . The sequence of values of asignal is denoted . The infinite subsequence of values cor-responding to an infinite sequence of events starting fromisdenoted .

For example, considering signal

we have8

and , respectively.To manipulate sequences of values, we define the following

filtering operators.Definition III.4: returns a sequence

s.t.

ifotherwise

Definition III.5: returns a sequences.t.

ifotherwise

For instance, if , thenand . Obviously,

. Latency-insensitive systems areassumed to have a finite horizon over which informative eventsappear, i.e., for each signal, there is a greatest timestamp

that corresponds to the “last” informative event.However, to build our theory, we need to extend the set of sig-nals of a latency-insensitive system over an infinite horizon byadding a set of timestamps such that all events with timestampgreater than have values.

Definition III.6: A signal is strict if and only if all infor-mative events precede all stalling events, i.e., iff there exists a

s.t. and . Asignal that is not strict is said to be delayed (or stalled).

Fig. 1 illustrates the sequences associated to two signals pre-senting ten informative events each: is a strict signal withgreatest timestamp equal to ten, whileis a stalled signal withgreatest timestamp equal to 21.

C. Latency Equivalence

Two signals are latency equivalent if they present the samesequence of informative events, i.e., they are identical exceptfor different delays between two successive informative events.

7Notice that� (s) denotes the value of the event att .8In this paper, we assume8t 2 T (s); 8t 2 T (s); t � t , i � j.


Fig. 2. Sequences of values of three latency equivalent signals.

Definition III.7: Two signals are latency equivalentiff .

Thereference signal of a class of latency equivalent sig-nals is a strict signal obtained by assigning the sequence of in-formative values that characterizes the equivalence class to thefirst timestamps.

For instance, Fig. 2 reports the sequences associated to threesignals that belong to the same latency equivalent class: signal

is also the reference signal of the class.Latency equivalent signals contain the same sequences of in-

formative values, but with different timestamps. Hence, it isuseful to identify their informative events with respect to thecommon reference signal: theordinal of an informative eventcoincides with its position in the reference signal.

Definition III.8: The ordinal of an informative eventis defined as .

Let and be two latency equivalent signals: two infor-mative events and are saidto be corresponding events iff .The slack between two corresponding events is defined as

. Hence, if is strict, the ordinalof an informative event coincides with its position on .Observe that if and are latency equivalent signals, thencorresponding informative events in and have the sameordinals (while they may have different timestamps).

We extend the notion of latency equivalence to behaviors ina componentwise manner.

Definition III.9: Two behaviors andare latency equivalent iff . A

behavior is strict iff every signal isstrict. Every class of latency equivalent behaviors contains onlyone strict behavior. This is called thereference behavior.

Definition III.10: Two processes and are latencyequivalent if every behavior of one is latencyequivalent to some behavior of the other. A processis strictiff every behavior is strict. Every class of latency equiv-alent processes contains only one strict process: thereferenceprocess.

Definition III.11: A signal is latency dominated by an-other signal iff and with

.Hence, referring to the example of Fig. 2, signalis domi-

nated by signal since while . Notice that areference signal is latency dominated by every signal belongingto its equivalence class. Latency dominance is extended to be-haviors and processes as in the case of latency equivalence.

D. Ordering the Set of Informative Events

A total order among events of a behavior is necessary to de-velop our theory. In particular, we introduce an ordering amongevents that is motivated by causality: events that have a smaller

ordinal are ordered before the ones with larger ordinals.9 In ad-dition, to avoid combinational cycles that may be created by pro-cessing events with the same ordinal, we rely on a well-foundedorder over the set of signals. This order in real-life designs cor-responds to input–output combinational dependencies as theycan be found, for instance, in the implementation of Mealy’s fi-nite-state machines. We cast this consideration in the most gen-eral form possible to extend maximally the applicability of ourmethod by giving the following definition.

Definition III.12: Given a behavior ,symbol denotes a well-founded order on its set of signals.The well-founded order induces alexicographic order overthe set of informative events of, s.t. for all pairs of events

with and

(1)

The following function returns the first informative event (insignal of behavior ) following an event with respectto the lexicographic order .

Definition III.13: Given a behavior andan informative event , the functionnextEventisdefined as

A stall movepostpones an informative event of a signal of agiven behavior by one timestamp. The stall move is used to ac-count for long delays along communication channels (i.e., wireson the chip) and to add delays where needed to guarantee func-tional correctness of the design.

Definition III.14: Given an informative eventin a behavior , a stall move

returns a behavior ,s.t. for all

A procrastination effectrepresents the “effect” of a stall moveon other signals of behaviorin correspondence

of events following in the lexicographic order. The pro-cesses will “respond” to the insertion of stalls in some of theirsignals by “delaying” other signals that are causally related tothe stalled signals. Given a behaviorfor each stall move onevents of , we have a corresponding set of behaviors (the pro-crastination effect set).

Definition III.15: A procrastination effect is a point-to-setmap that takes a behaviorresulting from the application of a stall move on eventof behavior and returns a set of behaviors

s.t. iff the fol-lowing three conditions hold:

1) ;

9Think of a strict process where the ordinal is related to the timestamp; theorder implies that past events do not depend on future events.


Fig. 3. Behaviorb = stall(e (s )) is obtained stalling the fifth event of signals of behaviorb.

2) and, where is the timestamp of event

;3) finite s.t.

.

Each behavior in is obtained from by possiblyinserting other stalling events in any signal of, but only at“later” timestamps, i.e., to postpone informative event thatfollows with respect to the lexicographic order .Observe that a procrastination effect returns a behavior thatlatency dominates the original behavior.

E. Compositionality of Patient Processes

We are now ready to define the notion of patient process,which lies at the core of the theory. A patient process can takestall moves on any signal of its behaviors by reacting with theappropriate procrastination effects. Patience is the key conditionfor the intellectual property (IP) blocks to be combinable ac-cording to our method. The following theorems guarantee thatthe notion of latency equivalence of processes for patient pro-cesses is compositional .

Definition III.16: A process is patient iff

Hence, the result of a stall move on one of the events of a patientprocess may not satisfy the process, but one of the behaviors ofthe procrastination effect corresponding to the stall move doessatisfy the process, i.e., if we stall a signal on an input to a func-tional block, the block will be forced to delay some of its outputsor if we request an output signal to be delayed then an appro-priate delay may be added to some of its incoming inputs.

Lemma III.1: Let and be two patient processes. Letbe two behaviors with the same lexicographic

order s.t. . Then, there exists a behavior.

Proof (Constructive):Let andbe the two behaviors with the same

lexicographic order. Since and are latency equivalent, eachevent in has a corresponding event in and vice versa (seeDefinition III.8). Let . Let

be the set of ordinals associated to pairs of corresponding

events of and whose timestamps differ. Define the distancebetween behaviors as

ifotherwise

This distance is reminiscent of the Cantor metric. Thus,and have distance equal to zero if all pairs of corre-

sponding events arealigned.10 In this case, and areidentical, i.e., they are the same behavior that belongs to

. Now, suppose that .In this case, is the smallest ordinal among those whichare associated to unaligned pairs of corresponding events.Without loss of generality, let be thepair of corresponding events whose ordinal is equal toand let . Apply a stall move to to obtain a newbehavior . Obviously,

. Notethat is not necessarily a behavior of . However, since

is patient, there existss.t. . Since, by defini-tion of procrastination effect, , then also

. Sincethe procrastination effect may postpone only events following

in the lexicographic order , then all the pairsof corresponding events of and with ordinal smallerthan are still aligned. Now, there are two possibilities: if

, then one more pair has beenaligned and ; otherwise, we can reduce byone this slack by repeating the same procedure starting frombehavior . In any case, after steps of the procedureoutlined above, we obtain a behavior that satisfies

and s.t. because one more pair ofcorresponding events has been aligned. We say that we haveperformed analignment step.

Now, if , then there are no more unaligned pairs,the two behaviors are identical, and the lemma is proven since

. Instead, if , then we con-sider the next unaligned pair of corresponding events andwe execute a second alignment step. Note that at theth step,while aligning pair with ordinal , we may increase theslack of some of the pairs following in the lexicographicorder, but we keep aligned all the pairs preceding. Duringthis sequence of alignment steps, we obtain two sequences ofbehaviors (one of behaviors in latency equivalent to andone of behaviors in latency equivalent to ) whose distanceis decreasing monotonically. Since bothand contain the

10A pair of corresponding events is saidalignedif the events are synchronousor, according to Definition III.8, if their slack is zero.


Fig. 4. Sketch for proof on compositionality of latency equivalence.

same finite number of informative events,11 the set of pairsof unaligned corresponding events is also finite. The slack ofeach of these pair is also a finite number. At theth step, wehave at most substeps to align , where is the startingslack for . In the worst case, each behaviorobtained duringthe substeps of the alignment step may have slacks of all theremaining unaligned pairs increased by at most(see Defi-nition III.15). Hence, at the end of theth step, has beendecreased by one, while all the slacks of its remaining elementshave been increased by at most , a finite number. Thus,for , the new slacks for the remaining unalignedpairs is . Globally, we perform in the worstcase alignment steps and for each of them we have a finitenumber of substeps. Hence, the two sequences of behaviors arealso finite and the last elements of these sequences do not haveunaligned pairs and, therefore, have distance equal to zero.

Theorem III.1: If and are patient processes, thenis a patient process.Proof: Let be a behavior in

. Consider behaviorsand , s.t. . For all

and for all , let . Since, then and . Let

. Similarly,and . Since is patient there existsa behavior s.t. and since ispatient, there exists a behavior s.t. .Notice that implies that . However, it isnot necessarily the case that . In fact, procrastinationeffects may have misaligned pairs of corresponding infor-mative signals that come after with respectto lexicographic order . Since share the samelexicographic order, by Lemma III.1, there exists a behavior

s.t. . The construction ofgiven in the proof of Lemma III.1 involves only unalignedpairs of corresponding events betweenand and all theseunaligned pairs correspond to informative events that comeafter with respect to lexicographic order . Further,since the number of informative events is finite, the numberof unaligned pairs is also finite. Hence, each signalofis obtained by inserting a finite number of stalling events notearlier than timestamp with .Therefore, by Definition III.15, . Sincewe have already seen that , then is apatient process.

Fig. 4 illustrates the above proof for the case when the twobehaviors are just 1-tuple signals.

11Recall that the number of informative events for every behavior consideredin latency-insensitive designs is finite.

Theorem III.2: For all patient processes , ifand then .

Proof: Let be a behavior in .Latency equivalence implies that there must be behaviors

and such that. Since and are latency equivalent and

and are patient, Lemma III.1 guarantees that there must be alatency equivalent behavior . The other directionof the proof is symmetric.

Therefore, we can replace any process in a system of pa-tient processes by a latency equivalent process, and the resultingsystem will be latency equivalent. A similar theorem holds forreplacing strict processes with patient processes.

Theorem III.3: For all strict processes and all patientprocesses , if and , then

.Proof: The argument that every behavior in has

an equivalent in is as in Theorem III.2. For the otherdirection, let be a behavior in . Latency equivalenceimplies that there must be behaviors and suchthat . Since and are strict, andare also strict. Being latency equivalent, they must, therefore,be equal. Thus, .

This means that we can replace all processes in a system ofstrict processes by corresponding patient processes and the re-sulting system will be latency equivalent to the original one.This is the core of our idea: take a design based on the assump-tion that computation in one functional module and communica-tion among modules “take no time” (synchronous hypothesis),12

i.e., the processes corresponding to the functional modules andtheir composition are strict and replace it with a design wherecommunication does take time (more than one clock cycle) and,as a result, signals are delayed, but without changing the se-quence of informative events observed at the system level, i.e.,with a set of patient processes.

IV. PATIENT PROCESSES ANDRELAY STATIONS

As explained in Section I, one of the goals of the latency-in-sensitive design methodology is to be able to “pipeline” a com-munication channel by inserting an arbitrary amount of memoryelements. In the framework of our theory, this operation corre-sponds to adding some particular processes, calledrelay sta-tions, to the given system. In this section, we give the formaldefinition of a relay station and we show how patient systems(i.e., systems of patient processes) are insensitive to the inser-tion of relay stations. In Section IV-A, we discuss under which

12In other words, communication and computation are completed in one clockcycle.


assumption a generic system can be transformed into a patientsystem.

A. Channels and Buffers

The tagged signal model provides the notion of channel toformalize the composition of processes [33]. Achannelis a con-nection13 constraining two signals to be identical.

Definition IV.1: A channel is aprocess s.t.

As the following lemma formally proves, a channel is not apatient process because it lacks the capacity of storing an eventand delaying its communication between two processes.

Lemma IV.1: A channel is not a patientprocess.

Proof: Let be a behavior of a channeland, without loss of generality, suppose that .

Consider a pair of corresponding informative events inand and . Since

, then and, therefore, .Moreover, implies that ,since . Withoutloss of generality, suppose that and arefollowed by stalling events, i.e., formally,

.Then, consider informative event . ByDefinition III.13, . Now, let

be the behavior obtained byapplying a stall move on . At timestamp , presentsa stalling event, while the event of corresponding tois , which occurs at timestamp .Then, consider any behavior . ByDefinition III.15, since ,then . In par-ticular, and, therefore,

, which, finally, implies that .Hence, and, by Definition III.16,

is not patient.Hence, to formally model communication delays as well as

pipeline stages, we introduce the notion of buffer: a buffer is aprocess relating two signals of a behavior and is definedby means of three parameters: capacity, minimum forward la-tency , and minimum backward latency. A buffer forces sig-nals to be latency equivalent and to satisfy the followingrelationships for all natural numbers.

1) The difference between the amount of information eventsseen at from timestamp zero to timestamp and theamount of informative events seen atfrom timestampzero to timestamp is greater or equal than zero.

2) The difference between the amount of information eventsseen at from timestamp zero to timestampand theamount of informative events seen atfrom timestampzero to timestamp is at most .

13See Section III-A for the definition of connection.

Definition IV.2: A buffer with capacity ,minimum forward latency , and minimum backward la-tency is a process s.t.

iff and

(2)

(3)

By definition, given a pair of indexes , for all, all buffers are latency equivalent. Ob-

serve also that buffer coincides with channeland, therefore, is not a patient process. In particular, we are in-terested in buffers having unitary latencies and we want to estab-lish under which conditions such buffers are patient processes.

Lemma IV.2: If are two signals s.t. and, then

1) s.t. , isthe corresponding event of in ;

2) s.t., where is the corresponding

event of in .

Proof: Let be the corresponding event of .By Definition III.8, . ByDefinition III.12, since , then .For all informative events, with isclearly . Instead, for all informative events

with , we have .However, is clearly the minimum ordinal informativeevent of that follows with respect to the lexi-cographic order and, therefore, by Definition III.13,

. The second relation canbe easily proven using the previous relation. We know that

, where and arecorresponding events. Let be .Then, necessarily, .Since , then, by DefinitionIII.12, . Furthermore, is also the min-imum ordinal event of , which comes after accordingto lexicographic order and, therefore, by Definition III.13,

.Fig. 5 illustrates the previous lemma.Theorem IV.1:Let . For all , is

patient iff .Proof: First, if , then inequalities (2) and (3)

become

(4)

(5)

Only If: By contradiction, we prove that if , thenis not a patient process. Suppose . For all

, let be a behavior of s.t.and . Let

with .


Fig. 5. Sketch to illustrate Lemma IV.2.

Clearly, because inequality (4) does not holdfor since . Further, for all

, we can prove that . In fact,since iff . However,consider that and, since ,then . Further, .Therefore, . Recallthat by Definition III.15 of procrastination effect,

, where is the timestampof event . Hence, in our case,

and . Since ,. This implies

that . Hence,and is not patient.

If : We prove that if , then is patient. Forall , let be a behavior of . Wemust analyze three distinct cases in which we, respectively, stallan informative signal of and with .

1) For all , such that , let. Since

iff inequality (4) does nothold for some . In fact, satisfies the othertwo conditions of Definition IV.2 because andto insert a stalling event on (while remains thesame) cannot induce a violation of inequality (5). Now,suppose first that satisfies also inequality (4) for all

: then, there exists at least a behavior that belongsto and this behavior isbecause . A moreinteresting case is when inequality (4) does not hold:in this case, . Then, consider a behavior

s.t. ,while is obtained from by inserting a stallingevent at timestamp , where is also the timestampof event . Clearly, thisconstruction guarantees that .It remains to be proven that . First, byconstruction, . Then, check whethersatisfy inequalities (4) and (5) for all . First,since and , by Lemma IV.2,

is the corresponding event ofin . Hence,

and, recalling DefinitionIII.8, . Since, byhypothesis, satisfy inequality (4) for all ,then . Compare and , respectively, with and

: has been derived by inserting a at , while

has been derived by inserting a at . Hence, we canderive the following four equations. Further, each term inthese equations can be bounded using the fact thatsatisfy inequalities (4) and (5) for all

(6)

(7)

(8)

(9)

Now, keeping in mind that , it is easy to prove that:

a) using (8) and (6), satisfy (4), ;b) using14 (8) and (7), satisfy (4),

;c) using (9) and (7), satisfy (4), ;d) using (6) and (8), satisfy (5), ;e) using (7) and (9), satisfy (5), ;f) using (7) and (9), satisfy (5), .

Therefore, .2) Consider now ,

where for all . Letand be the corresponding

event of in : then, since and ,by Lemma IV.2, .Now, construct in such a way that

, while is obtainedfrom by inserting a stalling event at timestamp,where . Hence,if , then , else .In both cases, this construction guarantees that

. It remains to be proventhat . First, by construction, .Then, check whether satisfy inequalities (4) and(5) for all . Compare and , respectively, with

and : has been derived by inserting a at

14Recall that� (s ) = � .


Fig. 6. Comparing two possible behaviors of finite buffersB (s ; s ) andB (s ; s ).

, while has been derived by inserting a at .Hence, previous relations (6)–(9) hold also in this case.Now, keeping in mind that here , it is easy to provethat:

a) using (8) and (6), satisfy (4), ;b) using (9) and (6), satisfy (4), ;c) using (9) and (7), satisfy (4), ;d) using (6) and (8), satisfy (5), ;e) using15 (6) and (8), satisfy (5),

;f) using (7) and (9), satisfy (5), .

Therefore, in this case, too.3) Finally, for all let

, where forall . Then, trivially,

.In conclusion, combining all three cases, we have

Hence, is patient.Consider a strict system with strict sig-

nals . As explained in Section III-A, processes canbe defined over different signal sets and to compose them wemay need to formally extend the set of signals of each processto contain all the signals of all processes. However, withoutloss of generality, consider the particular case of composing

processes that are already defined on the samesignals.Hence, any generic behavior of isalso a behavior of iff for all , process

contains a behavior s.t.. In fact, we may assume to derive system by

connecting the processes with channel processes, where and . Fur-

ther, we may also assume to “decompose” any channel processwith an arbitrary number of channel processes

by adding aux-iliary signals, each of them forced to be equal to . Thetheory developed in Section III guarantees that if we replaceeach process with a latency equivalent patientprocess and each channel with a patient buffer ,we obtain a system , which is patient and latency equiva-lent to . In fact,having a patient buffer in a patient systemis equivalent to having a channel in a strict system. Since “de-composing” a channel has no observable effect on astrict system, we are, therefore, free to add an arbitrary number

15Recall that� (s ) = � .

of patient buffers into the corresponding patient system to re-place this channel. Since we use patient buffers with unitary la-tencies, we can distribute them along that long wire on the chipwhich implements in such a way that the wire gets de-composed in segments whose physical lengths can be spannedin a single real clock cycle (as anticipated in Section I).

B. Relay Stations

Lemma IV.3 proves that no behaviors in may con-tain two informative events of , which are synchronous,i.e., there cannot be any timestamp for which bothandpresent an informative event. This implies that the maximumachievable throughput across such a buffer is 0.5, which maybe considered suboptimal. Instead, buffer is the min-imum capacity buffer that is able to “transfer” one informativeunit per timestamp, thus allowing, in the best case, to commu-nicate with maximum throughput equal to one. Fig. 6 comparestwo possible behaviors of these buffers.

Lemma IV.3: is the minimum capacity buffer withs.t. for all , closed intervals of

(10)

Proof: Relation (10) says that is the minimumcapacity buffer with containing a behavior ,where and present consecutive pairs of synchronousinformative events (i.e., the two informative events of each a pairhave the same timestamp) for all , closed intervals of .Notice that the only buffer with having capacity lessthan is . We first show that con-tains at least one behavior satisfying relation (10) and thenwe prove that the same is not true for any behavior of .It is easy to construct an example of such a behavior for any.For instance, consider a behavior s.t.

and thatClearly, and inequalities (4) and (5) (with ) aresatisfied for any . Hence, . Moreover, at alltimestamps , both and present an informa-tive event.

Now, consider . If , combining inequalities (4)and (5), we obtain that


Fig. 7. Example of a behavior of an equalizerE with I = f1; 2; 3g andO = f4; 5; 6g.

Hence, for all behaviors , sig-nals are not only latency equivalent, but also correlatedaccording to a very regular pattern (see Fig. 6), which can besummarized in two properties: 1) there are no two synchronousinformative events in and 2) for all timestamps, informa-tive events appear alternately onand on (possibly, at notconsecutive timestamps). Property 1 is a negation of relation(10).

Definition IV.3: The buffer is called arelay station.

V. LATENCY-INSENSITIVE DESIGN

In this section, we formally present the notion of latency-insensitive design as an application of the concepts previouslyintroduced. To do so, we assume that:

1) the predesigned functional modules are synchronousfunctional processes;

2) the processes are strictly causal;3) the processes belong to a particular class of processes

calledstallable, a weak condition to ask the processes toobey.

The basic ideas are as follows. Composing a set of predesignedsynchronous functional modules in the most efficient way isfairly straightforward if we assume that the synchronous hy-pothesis holds. This composition corresponds to a compositionof strict processes since there isa priori no need of insertingstalling events. However, as we have argued in the introduc-tion, it is very likely that the synchronous hypothesis will notbe valid for communication. If indeed the processes to be com-posed are patient, then adding an appropriate number of relaystations yields a process that is latency equivalent to the strictcomposition. Hence, if we use as the definition of correct be-havior the fact that the sequences of informative events do notchange, the addition of the relay stations solves the problem.However, requiring processes to be patient at the onset is def-initely too demanding from a practical point of view. Still, inpractice, a patient system can bederivedfrom a strict one asfollows: first, we take each strict process and we compose itwith a set of auxiliary processes to obtain an equivalent patientprocess . To be able to do so, all processes must satisfya simple condition (the processes must be stallable) that is for-mally specified in the next section. Then, we put together allpatient processes by connecting them with relay stations. Theset of auxiliary processes implements a “queuing mechanism”across the signal of in such a way that informative events arebuffered and reordered before being passed to: informativeevents having the same ordinal are passed tosynchronously.

In the sequel, we first introduce the formal definition of func-tional processes. Then, we present the simple notion of stallableprocesses and we prove that every stallable process can be en-capsulated into a wrapper process which acts as an interface to-ward a latency-insensitive protocol.

A. Stallable Processes

In the sequel, we consider only strictly causal processes andfor each of them we assume that the well-founded orderof Definition III.12 subsumes the causality relations among itssignals, i.e., formally .

Definition V.1: A process with andis stallable iff for all its behaviors

and for all

Hence, while a patient process tolerates arbitrary distributionsof stalling events among its signals (as long as causality is pre-served), a stallable process demands more regular patterns:symbols can only be inserted synchronously (i.e., with the sametimestamp) on all input signals and this insertion implies thesynchronous insertion of symbols on all output signals at thefollowing timestamp. To assume that a functional process is stal-lable is quite reasonable with respect to a practical implementa-tion. In fact, most hardware systems can be stalled: for instance,consider any sequential logic block that has agated clockora Moore finite state machine with an extra input, which, ifequal to , forces to stay in the current state and to emitatthe next cycle.

B. Encapsulation of Stallable Processes

Our goal is to define a group of functional processes thatcan be composed with a stallable processto derive a patientprocess which is latency equivalent to. We start consideringa process that aligns all the informative events across a set ofchannels.

Definition V.2: An equalizer is a process withand , such that

for all behaviors , wehave that and :

1) ;2) .

The first relation forces the output signals to have stalling eventsonly synchronously, while the second guarantees that at everytimestamp the number of informative events occurred at anyoutput is always equal to the least number of informative eventsseen by any input signal up to that timestamp. Fig. 7 illustratesa possible behavior of an equalizer. Notice how the the presenceof a stalling event at a certain input at a given timestamp does notnecessarily force the presence of a stalling event on all outputs atthe same timestamp. For instance, while the two stalling eventson and at timestamps do force stalling events on alloutput signals at , instead the stalling event present onat

does not result in any timestamp on the output signals: this isdue to the fact that at timestamp, all input signals have seen


Fig. 8. Encapsulation of a stallable processP into a wrapperW (P ).

at least one informative event while no output event have beenoccurred on the output signal up to.

Definition V.3: An extended relay station is a processwith and s.t. signalsare related by inequalities (2) and (3) of Definition IV.2 (with

and ) and

ifotherwise

Definition V.4: A stalling signal generator is aprocess with and s.t.

and

ifotherwise

As illustrated in Fig. 8, any stallable processcan be com-posed with an equalizer, a stalling signal generator, and someextended relay stations to derive a patient process that is latencyequivalent to .

Definition V.5: Let be a stallable process withand . A wrapper

process(or shell process) of is the process withand , which is

obtained by composing with the following processes:

1) an equalizer with and;

2) extended relay stations s.t.and , with

3) astalling signal generator withand .

Theorem V.1:Let be the wrapper process of Defini-tion V.5. Process is a patient processthat is latency equivalent to .

Proof: Throughout the proof, we follow the index notationof Definition V.5.

: Let be abehavior of and

one of . Letbe

the corresponding behavior of . Then, by Definition V.2

of equalizer , and, by definition of relaystation, . Therefore, .

W Patient: Recalling Definition III.16, we need to provethat

Consider first stalling any input signal of : forall and all , let

. Twocases may happen.

1) There exists a signal , s.t.. As a conse-

quence, by Definition V.2, no additional stalling eventsare added at the output of nor, ultimately, on the outputsignals of . Hence, while stall move doesnot affect any other signal, but , still and

(the stall move is “absorbed” bythe equalizer). Therefore, .

2) is a signal of s.t.. In this case, the insertion of a stalling

event on at implies that all the output signals ofequalizer have a stalling event at . Then, by analyzingthe interrelationships among the components of, it iseasy to verify that . In fact, all the output signalsof are forced to have a stalling event at and,similarly, all the output signals to have itat . Hence, must bealso stalled. Then, since move does notaffect any other signal, but . However, since

, the insertionof one stalling event on each of the wrapper outputs atis compatible with the definition of procrastination effectand, therefore, .

Next, consider stalling any output signal of : forall and all , let

. Bydefinition of stall move, and

. Hence, again, . Infact, the insertion of a stalling event on signal at has animpact on signal of , which isconstrainedto stall theinput event occurring timestamps later.16

As a consequence, all the outputs of the stallable processmusthave a stalling event at . While no other stalling events areforced on of at , all the remaining relay stations

must stall their .Hence, because does not affect anyother signal, but . However, , since

, where, then .

Hence, the insertion of one stalling event on each wrapperoutput at is compatiblewith the definition of procrastination effect. Therefore,

.

16Note that, by Definition IV.3,e (s ) must be stalled even though8k 2

[h + 1; h + l � 1]; (� (s ) = �)


C. Latency-Insensitive Design Methodology

By putting together the ideas discussed in the previous sec-tions, we can derive the following guidelines for the definitionof the new design methodology.

1) Begin with a system of stallable processes and channels.2) Encapsulate each stallable process to yield a corre-

sponding wrapper process that is patient and latencyequivalent to the original one.

3) By inserting the required amount of relay stations on eachchannel, the latency of the communication among anypair of processes now can be arbitrarily varied withoutaffecting the overall system functionality.

This approach clearly “orthogonalizes” computation andcommunication: in fact, we can build systems by assemblingfunctional cores (which can be arbitrarily complex as far as theysatisfy the stalling assumption) and wrappers (which interfacethem with the channels, by “speaking” the latency-insensitiveprotocol). While the specific functionality of the system isdistributed in the cores, the wrappers can be automatically gen-erated around them.17 Furthermore, the validation of the systemnow can be efficiently decomposed based on assume-guaranteereasoning and compositional model checking [35]–[38]: eachwrapper is verified assuming a given protocol and the protocolis verified separately.

With regard to the design of digital integrated circuits, thetheory of latency-insensitive design can be used as the formalbasis for defining thelatency-insensitive design methodologythat is centered around the following simple noniterative designflow.

1) The designers design and validate the chip as a collectionof synchronous modules that can be specified with theusual hardware-description languages (HDLs).

2) Each module is automatically encapsulated within a blockof control logic (the wrapper) to make it latency insensi-tive.

3) Traditional logic synthesis and place-and–route steps areapplied.

4) If the presence of unexpectedly large wire delays makesit necessary, the resulting layout is corrected by simplyinserting the right amount of relay stations to meet theclock cycle constraints everywhere.

Notice that the design, layout, and routing of individual moduleswould not need to be changed to reflect any necessary changesin wire latencies during the chip-level layout and wiring process.This may clearly represent a significant advantage for futuresystem-on-a-chip designs, where the designers completing thechip-level integration most likely will not work at the samecompany as the designers of the individual modules. Further-more, since it is based on the synchronous assumption, the ap-proach facilitates the adoption of state-of-the art formal verifi-cation techniques within a new design flow that, for the rest, canbe built using traditional and well-known layout and synthesisCAD tools.

17This is the reason why wrappers are also calledshells: they just “protect” theIP (the pearl) they contain from the “troubles” of the external communicationarchitecture.

D. Impact on System Performance

Naturally, the effectiveness of the latency-insensitive designmethodology is strongly related to the ability of maintaininga sufficient communication throughput in the presence of in-creased channel latencies. However, in the case of integrated cir-cuit design, this is just one instance of a more general problemthat has to be faced while using DSM technologies. In fact, sinceon-chip communication was not an issue with previous processtechnologies, the vast majority of chips that have been realizedover the past two decades are based on architectural models re-lying on low-latency communication to shared global resources.The advantages of such models is that they provide the mostuniform computational framework and the best utilization of thefunctional units. On the other hand, this focus on function ratherthan communication is now seen as the fundamental conceptualroadblock to be overcome in DSM design [7]. In this regard,the latency-insensitive design methodology represents an inter-esting approach due to the inherent separation of communica-tion and computation.

Inserting extra latency stages on a cyclic pipeline does notnecessary translate into a performance hit. For instance, the caseof the Alpha 21264 microprocessor, where the integer unit ispartitioned into two modules and the latency for communicatingbetween them takes an additional clock cycle [39], shows howit is possible to pipeline long wires, thereby increasing their la-tency, while still offering high computational bandwidth. Sim-ilarly, the presence of so-calleddrive stagesin the new hy-perpipelined NETBURST microarchitecture [12] suggests thateven for high-end designs, such as the Pentium 4 micropro-cessor, the insertion of extra stages dedicated exclusively tohandle wire delays may be the result of a precise engineeringchoice. From this point of view, the present paper representsa formal background for the definition of design methodolo-gies that allow an efficient analysis and exploitation of the la-tency/throughput tradeoffs at any level of the design flow.

Furthermore, the latency-insensitive approach can be ex-tended to incorporate other techniques aimed to have theperformance of a design less susceptible to large variations inchannel latency. For instance, a simple technique to make thedesign more robust on this respect consists in ensuring thatthe design specification contains some “slack” in the form ofunneeded pipeline delays. If there is sufficient “slack” latencyaround every cyclic path in the design, then, after the finallayout is derived, the latency-insensitive protocol allows us todistribute this slack in a completely transparent manner to coveran increase in wire latency without any changes to the designor the layout of the modules (a sort ofdynamic retiming).The overall design becomes more robust in the sense that itis less likely that some modules would have to be changed torecover performance lost due to unexpectedly large wire delays.Nonetheless, there will be cases where cyclic paths of lowlatency affecting the overall system throughput are inevitable.In such cases, techniques typically used for microprocessordesign such asspeculationandout-of-order executionmay beembodied within a latency-insensitive protocol. For instance,when a particular data item in not yet computed, while being onthe critical path, it is sometimes possible to “guess” the likely


value of this data and to allow the computation to proceed.Later, this value can be “retracted” if it proves to be incorrect.Such techniques may dramatically increase the overall systemperformance, but they are currently adopted only in high-endmicroprocessor design because they are error-prone and quitedifficult to implement. However, their demand is destined togrow, as wire latencies will keep increasing. On the other hand,these techniques may be rigorously built into a latency-in-sensitive protocol that would allow speculation to break thetight dependency cycles (provided the designer can identify anappropriate guessing strategy). Being part of the protocol, theywould be isolated from the functional specification and theywould enter the design picture in acorrect-by-constructionmanner that would not risk the introduction of a design error orcause previous simulation work to be invalidated.

VI. CASE STUDY—THE PDLX MICROPROCESSOR

To experiment the proposed methodology, we performed a“latency-insensitive design” of PDLX, an out-of-order micro-processor with speculative execution. In the present section, wefirst summarize the architectural specification of PDLX, and,then, we discuss the latency-insensitive design as well as theexperimental results.

A. PDLX Architecture and Instruction Flow

The instruction set of PDLX is the same as the one of the DLXmicroprocessor, described in [11]. The PDLX architecture isbased on an extended version of theTomasulo’s algorithm[40],which combines traditional dynamic scheduling with hardware-based speculative execution. As a consequence, the data pathof PDLX is similar to the one of some of the most advancedmicroprocessor available on the market today [39], [41].

The PDLX architecture is conceptually illustrated by theblock diagram of Fig. 9. At the center of the diagram lies aset of execution units (gray shaded) that operate in parallel.Schematically, the PDLX behavior can be summarized asfollows: the branch processing unitsends the next value ofthe program counter (PC)to the fetch unit, which fetches thecorresponding instruction from theinstruction cacheand passesit to the decode unit. Once instruction decoding is completed,the result arrives to thedispatch unit, which interacts with thereservation stations (RSs)of the different processing units aswell as with thecompletion unitand thesystem register unit.Instructions are fetched, decoded, and dispatched sequentiallyfollowing the order of the program that gets executed. Oncea new decoded instruction arrives, thedispatch unitstartsby assigning it to one of the functional categories (integerarithmetic/logic, floating-point, load, store, branch, ) andthen checks whether the following two conditions are verified.

1) There is one entry available in the reorder buffer withinthecompletion unit.

2) There is an available reservation station at the head of aprocessing unit matching the function category to whichthe instruction belongs.

if one of these conditions is not satisfied thedispatch unitstalls,otherwise instruction is dispatched: this means that the instruc-tion is labeled with atag identifying the reorder buffer entry

which has been assigned to it and, at the end of the execution,will contain the result. Then, operands and opcode of instruc-tion are loaded into the selected reservation station to start theexecution. In fact, the execution starts immediately only if thevalues of the operands which have been read from thesystemregister unitare “currently correct” in the sense that no otherinstruction (previously dispatched and still not completed) isdestined to change the value of one of these operands. If this isindeed the case, for each operand whose value is not yet avail-able thedispatch unitwrites on the corresponding entry in thereservation station the tag that identifies the reorder bufferentry previously assigned to instructionon which the operanddepends. The execution of instructionis procrastined until allcorrect values of its operands arrive at the reservation station.Different instructions may take a different number of clock cy-cles to execute not only due to this procrastination, but also be-cause their executions may present different latencies. When theexecution of an instruction ends, the corresponding processingunit broadcast the result to the reorder buffer entry and to anyreservation station that is awaiting it.

The completion unittracks each instruction from dispatchthrough execution and retires it by removing the result from thebottom entry of the reorder buffer and committing it to updatethe system registers. In-order completion guarantees that thesystem is in the correct state when it is necessary to recover froma mispredicted branch or any exception. In the presence of aconditional branch whose condition cannot be resolved immedi-ately, thebranch processing unitpredicts the “branch target ad-dress” and instruction fetching, dispatching and execution con-tinue from the predicted path. However, these instructions cannot commit and write back results into the system register untilthe prediction has been resolved, i.e., determined to be correct.Instead, when a prediction is determined to be incorrect, the in-struction from the wrong path are flushed from the datapath andthe execution resumes from the correct path.

B. Latency-Insensitive Design of PDLX

We performed a high-level cycle-accurate design of PDLXby using BONeS DESIGNER [42], a CAD tool that providesa powerful modeling and analysis environment for system de-sign. We first defined a synchronous specification of the PDLXand we designed each of of the PDLX modules illustrated inFig. 9, keeping in mind only the following informal rule to makethe process stallable:at each clock cycle, the execution processof a module can always be frozen without affecting its internalstate. Independently from the design of the PDLX modules, wespecified also a latency-insensitive protocol library, which con-tains parameterized components to guide the automatic genera-tion of different kinds of relay stations and wrappers. Finally,we encapsulated each module in a wrapper and we obtainedthe final system. Obviously, this decomposition of the hardwareimplementing the PDLX is not the only possible, let alone thebest, one. Still, while reasonably simple, it presents interestingchallenges to the realization of the proposed latency-insensitivecommunication architecture. In particular, modules such as thefetch unitand thedispatch unitmerge channels coming fromseparate sources, and which are likely to have different laten-cies. Further, each time the predicted value of a conditional


Fig. 9. PDLX microprocessor block diagram: conceptual view.

branch is verified a “feedback path” gets activated from thesystem register through thecompletion unitto thebranch pro-cessing unit.

We specified most PDLX modules based on the assumptionthat they communicate by means of point-to-point channels,whose latencies may be arbitrary. However, due to the particularstructure of a microprocessor such as PDLX and, in particular,to the parallel organization of its execution units, we decided toadopt a different type of communication structure to connect theseveral relay stations, the execution units and the reorder buffer:a pipelined ring. A ring, like a bus, inherently supports broad-cast-based communication: the sender place a packet on the ringwhile the other modules inspect (snoop) it as it goes by to see ifit is relevant to them [43]. In our case, the snooping mechanismis obviously based on identifying the tags associated to the en-tries of theReorder Buffer, as described in Section VI-A. In gen-eral, to keep sequential consistency on a ring is more complexthan on a bus since multiple packets may traverse it simultane-ously. However, the characteristics of our system help us in thisperspective because thecompletion unitguarantees in-order in-struction commitment together with the correct serial progressof the state of the system. The linear point-to-point nature ofa ring allows aggressive pipelining and potentially very highclock rates. A disadvantage is that the communication latencyis high, growing linearly with the number of nodes on the ring.Overall, based on its characteristics, a ring represents an inter-esting design solution in the framework of our design method-ology, where modules are by definition latency insensitive andpipelining is extensively applied.

C. Experimental Results

To test our design, we took some simple numericalpro-grams (permutations, binary search,) and we generated thecorresponding DLX assembler code by using DLXCC, a pub-

licly available DLX compiler [44]. Then, we loaded the assem-bler into the PDLXinstruction cacheand we executed it whilelogging every read/write access to thedata cache. Finally, wecompared the “log file” with the one obtained executing thesame assembler code on the DLX simulator DLXSIM to verifythat the functional behavior was indeed the same.

For each program execution, we computed the total numberof clock cycles necessary to complete the execution of the as-sembler code: this number is equal to , where is thenumber of instruction that have been issued,is the number ofcycles lost due to a stall within the execution unit, andis thenumber of cycles lost due to pipeline latency. Since the PDLXis a single-issue multiprocessor, the instruction throughput

is a quantity less than or equal to one. This quantity can bemultiplied by the system clock frequency to obtain theeffectiveinstruction throughput , which allows us tocompare the execution of the same assembler code on differentPDLX implementations running at different speeds. Fig. 10 il-lustrates the results obtained running three different assemblerprograms: on the axis, we have the instruction throughput onthe bottom chart and the effective instruction on the top chart.In both charts, each discrete point on theaxis corresponds toa different PDLX implementation with a different fixed amountof latency on some communication channels.

For this experiment, we focused on two specific channels onFig. 9: channel between theinstruction memory manage-ment unitand theinstruction cache (I-Cache)and channelbetween thedata memory management unitand thedata cache(D-Cache). We varied the latencies of the two channels as fol-lows: going from left to right on the axis, each of the 18data points represents an implementation case and is labeled as

, where and denote the amounts by which the latenciesof channels and have been increased. In particular, wevaried from zero to five and from zero to two. As expected,the bottom chart confirms that the more we increase the latencies


(a)

(b)

Fig. 10. PDLX performance. (a) Throughput. (b) Effective throughput.

between the two caches and the rest of the system, the higher isthe throughput degradation. It is also clear that for this PDLXimplementation, the impact of increasing the D-Cache latencyby one unit while leaving untouched the I-Cache latency (datapoint ) is more or less equivalent to increase the I-Cachelatency by four units while leaving untouched the D-Cache la-tency (data-point )

The data illustrated in the above chart of Fig. 10 have beenobtained based on the assumption that the wires grouped inchannels and represent the critical path of the overallPDLX design and that, after segmenting them (by inserting relaystations), we could afford to raise the clock frequency appro-priately. Specifically, for each implementation case, we set thesystem clock cycle as . One could arguethat the assumption is too coarse because, for instance, it is un-likely that all the other modules in the design are able to workcorrectly after doubling the clock. However, the main point thatwe want to stress here is that within the present methodology,one may perform an early exploration of the latency/throughputtradeoffs to guide architectural choices based on a rough es-timation of the channel latencies and then keep refining thesechoices at the different stages of the design flow to accommo-date various implementation constraints, while relying alwayson the property of the latency-insensitive communication pro-

tocol. In this regard, it is important to emphasize that all theabove implementations are functionally equivalent by construc-tion, being obtained simply by changing the number of relaystations on the channels and with no need of redesigning anyPDLX module. Furthermore, the insertion of relay stations canbe made at late stages in the design process, after detailed infor-mation can be extracted from the physical layout, to “fix” thosechannels whose latencies are longer than the desired clock cycle.While performing this operation, it is easy to keep an exact trackof the throughput variations.

VII. CONCLUSION

This paper presents the theory of latency-insensitive design.Latency-insensitive designs are synchronous distributed sys-tems composed by functional modules that exchange data oncommunication channels according to a latency-insensitiveprotocol. The protocol guarantees that latency-insensitivesystems, composed of functionally correct modules, behavecorrectly independently of the channel latencies. This allowsus to increase the robustness of a design implementationbecause any delay variations of a channel can be “recovered”by changing the channel latency while the overall systemfunctionality remains unaffected. The protocol works on the


assumption that the functional modules are stallable, a weakcondition to ask the processes to obey.

An important application of the proposed theory is repre-sented by the latency-insensitive methodology to design largedigital integrated circuits with DSM technologies. The method-ology is based on the assumption that the design is built by as-sembling blocks of IPs that have been previously designed andverified. Thanks to the compositionality of the notion of latencyequivalence, this methodology allows us to orthogonalize com-munication and computation, while the timing requirements im-posed by the clock are met by construction. Furthermore, sinceit is based on the synchronous assumption, the approach facili-tates the adoption of formal validation techniques within a newdesign flow that, for the rest, can be built using traditional CADtools.

ACKNOWLEDGMENT

The authors would like to thank L. Lavagno and A. Saldanhafor the discussions that led to the theory of latency-insensitivedesign. They would also like to thank P. Scaglia for his supportand continuous encouragement. Finally, they would also liketo thank the reviewers for their help in improving the overallquality of the paper.

REFERENCES

[1] J. Cong, “Challenges and opportunities for design innovations innanometer technologies,” SRC Design Sciences Concept Paper, Dec.1997.

[2] H. Kapadia and M. Horowitz, “Using partitioning to help convergence inthe standard-cell design automation method,” inProc. Design Automa-tion Conf., June 1999, pp. 592–597.

[3] M. T. Bohr, “Interconnect scaling—The real limiter to high performanceULSI,” in Proc. IEEE Int. Electron Devices Meeting, Dec. 1995, pp.241–244.

[4] R. Ho, K. Mai, H. Kapadia, and M. Horowitz, “Interconnect scaling im-plications for CAD,” inProc. Int. Conf. Computer-Aided Design, Nov.1999, pp. 425–429.

[5] M. T. Bohr, “Silicon trends and limits for advanced microprocessors,”Commun. ACM, vol. 41, no. 3, pp. 80–87, Mar. 1998.

[6] M. J. Flynn, P. Hung, and K. W. Rudd, “Deep-submicron microprocessordesign issues,”IEEE Micro, vol. 19, pp. 11–13, July 1999.

[7] R. Ho, K. Mai, and M. Horowitz, “The future of wires,”Proc. IEEE, vol.89, pp. 490–504, Apr. 2001.

[8] D. Matzke, “Will physical scalability sabotage performance gains?,”IEEE Comput., vol. 8, pp. 37–39, Sept. 1997.

[9] V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, “Clockrate versus ipc: The end of the road for conentional microarchitectures,”in Proc. 27th Annu. Int. Symp. Computer Architecture, June 2000, pp.248–250.

[10] P. M. Kogge,The Architecture of Pipelined Computers. New York:McGraw-Hill, 1981, Advanced Computer Science Series.

[11] J. L. Hennessy and D. A. Patterson,Computer Architecture: A Quanti-tative Approach. San Mateo, CA: Morgan Kaufmann, 1996.

[12] P. Glaskowski, “Pentium 4 (partially) previewed,”Microprocessor Rep.,vol. 14, no. 8, pp. 10–13, Aug. 2000.

[13] L. P. Carloni, K. L. McMillan, and A. L. Sangiovanni-Vincentelli,“Latency insensitive protocols,” inProceedings of the 11th Interna-tional Conference on Computer-Aided Verification, N. Halbwachsand D. Peled, Eds. New York: Springer-Verlag, 1999, vol. 1633, pp.123–133.

[14] L. P. Carloni, K. L. McMillan, A. Saldanha, and A. L. Sangiovanni-Vin-centelli, “A methodology for ‘correct-by-construction’ latency insensi-tive design,” inProc. IEEE Int. Conf. Computer-Aided Design, Nov.1999, pp. 309–315.

[15] A. Davis and S. M. Nowick, “Asynchronous circuit design: Motiva-tion, background, and methods,” inAsynchronous Digital Circuit De-sign, Workshops in Computing, G. Birtwistle and A. Davis, Eds. NewYork: Springer-Verlag, 1995, pp. 1–49.

[16] S. Hauck, “Asynchronous design methodologies: An overview,”Proc.IEEE, vol. 83, no. 1, pp. 69–93, Jan. 1995.

[17] W. A. Clark, “Macromodular computer systems,” inAFIPS ConferenceProceedings: 1967 Spring Joint Computer Conference. New York:Academic, 1967, vol. 30, pp. 335–336.

[18] W. A. Clark and C. E. Molnar, “The promise of macromodular systems,”in Proceedings of the 6th Annual IEEE Computer Society InternationalConference. Piscataway, NJ: IEEE Press, 1972, pp. 309–312.

[19] C. E. Molnar, T.-P. Fang, and F. U. Rosenberger, “Synthesis of delay-insensitive modules,” in1985 Chapel Hill Conference on Very LargeScale Integration, H. Fuchs, Ed. Rockville, MD: Computer Science,1985, pp. 67–86.

[20] F. U. Rosenberger, C. E. Molnar, T. J. Chaney, and T.-P. Fang,“Q-modules: Internally clocked delay-insensitive modules,”IEEETrans. Comput., vol. 37, pp. 1005–1018, Sept. 1988.

[21] J. T. Udding, “A formal model for defining and classifying delay-insen-sitive circuits,”Distrib. Comput., vol. 1, no. 4, pp. 197–204, 1986.

[22] J. A. Brzozowski and J. C. Ebergen, “On the delay-sensitivity of gatenetworks,”IEEE Trans. Comput., vol. 41, pp. 1349–1360, Nov. 1992.

[23] A. J. Martin, “The limitations to delay-insensitivity in asynchronouscircuits,” in Advanced Research in VLSI: Proceedings of the 6th MITConference, W. J. Dally, Ed. Cambridge, MA: MIT Press, 1990, pp.263–278.

[24] J. C. Ebergen, “A formal approach to designing delay-insensitive cir-cuits,” Distrib. Comput., vol. 5, no. 3, pp. 107–119, 1991.

[25] M. B. Josephs and J. T. Udding, “An overview of DI algebra,” inProc.Hawaii International Conference on System Sciences. Los Alamitos,CA: IEEE Comput. Soc. Press, 1993, vol. I.

[26] S. M. Burns, “Performance analysis and optimization of asynchronouscircuits,” Ph.D. dissertation, California Inst. Technol., Pasadena, CA,1991.

[27] P. Beerel and T. H.-Y. Meng, “Automatic gate-level synthesis of speed-independent circuits,” inProc. IEEE Int. Conf. Computer-Aided Design,Nov. 1992, pp. 581–587.

[28] D. L. Dill, “Trace Theory for Automatic Hierarchical Verificationof Speed-Independent Circuits,” inACM Distinguished Disserta-tions. Cambridge, MA: MIT Press, 1989.

[29] A. Kondratyev, M. Kishinevsky, B. Lin, P. Vanbekbergen, and A.Yakovlev, “Basic gate implementation of speed-independent circuits,”in Proc. Design Automation Conf., June 1994, pp. 56–62.

[30] J. L. A. van de Snepscheut,Trace Theory and VLSI Design. Berlin,Germany: Springer-Verlag, 1985, vol. 200, Lecture Notes in ComputerScience.

[31] D. C. Ku and G. De Micheli, “Relative scheduling under timing con-straints,” inProc. Design Automation Conf., June 1990, pp. 59–64.

[32] D. Filo, D. Ku, C. Coelho, and G. De Micheli, “Interface optimization forconcurrent systems under timing constraints,”IEEE Trans. Computer-Aided Design, vol. 13, pp. 268–281, Sept. 1993.

[33] E. A. Lee and A. Sangiovanni-Vincentelli, “A framework for comparingmodels of computation,”IEEE Trans. Computer-Aided Design, vol. 17,pp. 1217–1229, Dec. 1998.

[34] A. Benveniste and P. L Guernic, “Hybrid dynamical systems theory andthe signal language,”IEEE Trans. Automat. Contr., vol. 5, pp. 535–546,May 1990.

[35] E. M. Clarke, D. E. Long, and K. L. McMillan, “Compositional modelchecking,” in Proc. 4th Annu. Symp. Logic in Computer Science,Asilomar, CA, June 1989, pp. 464–475.

[36] K. L. McMillan, “A compositional rule for hardware design refinement,”in Proc. 9th Int. Conf. Computer-Aided Verification, Haifa, Israel, June1997, pp. 24–35.

[37] , “Verification of an implementation of Tomasulo’s algorithmby compositional model checking,” inProc. 10th Int. Conf. Com-puter-Aided Verification, Vancouver, BC, Canada, July 1998, pp.110–121.

[38] T. A. Henzinger, S. Qadeer, and R. K. Rajamani, “You assume, weguarantee: Methodology and case studies,” inProc. 10th Int. Conf.Computer-Aided Verification, Vancouver, BC, Canada, July 1998, pp.440–451.

[39] B. A. Giesekeet al., “A 600 MHz superscalar RISC microprocessor without-of-order execution,”Tech. Dig. IEEE Int. Solid-State Circuits Conf.,pp. 176–177, Feb. 1997.

[40] R. M. Tomasulo, “An efficient algorithm for exploiting multiple arith-metic units,”IBM J. Res. Devel., vol. 11, pp. 25–33, Jan. 1967.


[41] R. E. Kessler, “The alpha 21 264 microprocessor,”IEEE Micro., vol. 19,pp. 24–36, Mar. 1999.

[42] S. J. Schaffer and W. W. LaRue, “BONeS DESIGNER: A graphical en-vironment for discrete-event modeling and simulation,” inProceedingsof the 2nd International Workshop on Modeling, Analysis, and Simula-tion of Computer and Telecommunication Systems. Los Alamitos, CA:IEEE Comput. Soc. Press, 1994, pp. 371–374.

[43] D. E. Culler and J. P. Singh,Parallel Computer Architecture: A Hard-ware/Software Approach. San Mateo, CA: Morgan Kaufmann, 1999.

[44] The DLX Software (1994). [Online]. Available: ftp://max.stan-ford.edu/pub/hennessy-patterson.software

Luca P. Carloni (S’95) received the Laurea degree(summa cum laude) in electrical engineering fromthe University of Bologna, Bologna, Italy, in 1995,and the M.S. degree in electrical engineering andcomputer sciences from the University of California,Berkeley, in 1997. He is currently working towardsthe Ph.D. degree in electrical engineering andcomputer sciences at the same university.

His current research interests include embeddedsystems design, high-level synthesis, logic synthesis,and combinatorial optimization.

Kenneth L. McMillan received the B.S. degree inelectrical engineering from the University of Illinois,Urbana, in 1984, the M.S. degree in electrical engi-neering from Stanford University, Stanford, CA, in1986, and the Ph.D. degree in computer science fromCarnegie Mellon University, Pittsburgh, PA, in 1992.

He has been a Chip Designer, Biomedical Engi-neer, a Member of the Technical Staff at AT&T BellLaboratories, and is currently a Research Scientist atCadence Berkeley Laboratories, Berkeley, CA. Hiscurrent research interests include computer music,

formal verification, and design methodology.

Alberto L. Sangiovanni-Vincentelli (F’83) receivedthe Dott. Ing. degree (summa cum laude) in electricalengineering and computer science from the Politec-nico di Milano, Milan, Italy, in 1971.

He holds the Edgar L. and Harold H. Buttner Chairof Electrical Engineering and Computer Sciences atthe University of California, Berkeley, where he hasbeen on the Faculty since 1976. From 1980 to 1981,he spent a year as a Visiting Scientist with Mathe-matical Sciences Department of the IBM T.J. WatsonResearch Center, Yorktown Heights, NY. In 1987, he

was a Visiting Professor with the Massachusetts Institute of Technology, Cam-bridge. He cofounded Cadence Design Systems (where he is currently the ChiefTechnology Advisor and Member of the Board of Directors), Synopsys, Inc.(where he was Chair of the Technical Advisory Board), and Comsilica, a startupin the wireless communication area (where he is currently the Chairman of theBoard). He also founded the Cadence Berkeley Laboratories and the KawasakiBerkeley Concept Research Center, where he is Chairman of the Board. He wasa Director of ViewLogic and Pie Design Systems. He is currently a Member ofthe Board of Directors of Sonics Inc., Softface, and Accent. He has consultedfor a number of U.S. companies, including IBM, Intel, AT&T, GTE, GE, Harris,Nynex, Teknekron, DEC, HP, Japanese companies, including Kawasaki Steel,Fujitsu, Sony and Hitachi, and European companies, including SGS-ThomsonMicroelectronics, Alcatel, Daimler-Benz, Magneti-Marelli, BMW, Bull. He isthe Scientific Director of the Project on Advanced Research on Architecturesand Design of Electronic Systems, a European Group of Economic Interest. Heis on the Advisory Board of the Lester Center of the Haas School of Businessand of the Center for Western European Studies and a member of the BerkeleyRoundtable of the International Economy. He has authored or coauthored over530 papers and 14 books in the area of design methodologies, large-scale sys-tems, embedded controllers, hybrid systems and tools.

Dr. Sangiovanni-Vincentelli is a Member of the National Academy of En-gineering. He received the Distinguished Teaching Award of the University ofCalifornia in 1981, the Guillemin-Cauer Award in 1982, the Darlington Awardin 1987, and the 1995 Graduate Teaching Award of the IEEE. He was the Tech-nical Program Chairperson of the International Conference on Computer-AidedDesign and is currently General Chair and was also the Executive Vice-Presi-dent of the IEEE Circuits and Systems Society.

Documents

Theory of latency-insensitive design - Computer-Aided ...luca/research/lipTransactions.pdf · delay-insensitive circuits [19], [20]. A delay-insensitive circuit is designed to operate