Buffered Clock Tree Synthesis with Non-Zero Clock Skew Scheduling for Increased Tolerance to Process Parameter Variations

P1: KCU/PCY P2: VTL

Journal of VLSI Signal Processing KL448-04 May 27, 1997 8:54

Journal of VLSI Signal Processing 16, 149–161 (1996)c© 1996 Kluwer Academic Publishers. Manufactured in The Netherlands.

Buffered Clock Tree Synthesis with Non-Zero Clock Skew Schedulingfor Increased Tolerance to Process Parameter Variations∗

JOSE LUIS NEVES AND EBY G. FRIEDMANDepartment of Electrical Engineering University of Rochester

Rochester, NY 14618

Received August 15, 1996; Revised November 20, 1996

Abstract. An integrated top-down design system is presented in this paper for synthesizing clock distributionnetworks for application to synchronous digital systems. The timing behavior of a synchronous digital circuit isobtained from the register transfer level description of the circuit, and used to determine a non-zero clock skewschedule which reduces the clock period as compared to zero skew-based approaches. Concurrently, thepermissiblerangeof clock skew for each local data path is calculated to determine the maximum allowed variation of thescheduled clock skew such that no synchronization failures occur. The choice of clock skew values considersseveral design objectives, such as minimizing the effects of process parameter variations, imposing a zero clockskew constraint among the input and output registers, and constraining the permissible range of each local data pathto a minimum value.

The clock skew schedule and the worst case variation of the primary process parameters are used to determine thehierarchical topology of the clock distribution network, defining the number of levels and branches of the clock treeand the delay associated with each branch. The delay of each branch of the clock tree is physically implementedwith distributed buffers targeted in CMOS technology using a circuit model that integrates short-channel deviceswith the signal waveform shape and the characteristics of the clock tree interconnect. A bottom-up approach forcalculating the worst case variation of the clock skew due to process parameter variations is integrated with thetop-down synthesis system. Thus, the local clock skews and a clock distribution network are obtained which aremore tolerant to process parameter variations.

This methodology and related algorithms have been demonstrated on several MCNC/ISCAS-89 benchmarkcircuits. Increases in system-wide clock frequency of up to 43% as compared with zero clock skew implementationsare shown. Furthermore, examples of clock distribution networks that exploit intentional localized clock skew arepresented which are tolerant to process parameter variations with worst case clock skew variations of up to 30%.

1. Introduction

Most existing digital systems utilize fully synchronoustiming, requiring a reference signal to control thetemporal sequence of operations. Globally distributed

∗This research is based upon work supported by Grant 200484/89.3from CNPq (Conselho Nacional de Desenvolvimento Cient´ıfico eTecnologico—Brasil), the National Science Foundation under GrantNo. MIP-9208165 and Grant No. MIP-9423886, the Army ResearchOffice under Grant No. DAAH04-93–G-0323, and by a grant fromthe Xerox Corporation.

signals, such as clock signals, are used to provide thissynchronous time reference. These signals can domi-nate and limit the performance of VLSI-based digitalsystems. The importance of these global signals is,in part, due to the continuing reduction of feature sizeconcurrent with increasing chip dimensions. Thus in-terconnect delay has become increasingly significant,perhaps of greater importance than active device de-lay. The increased global interconnect delay also leadsto significant differences in clock signal propagationwithin the clock distribution network, calledclockskew, which occurs when the clock signals arrive at the

P1: KCU/PCY P2: VTL


150 Neves and Friedman

storage elements at different times. The clock skewcan be further increased by unintentional factors suchas process parameter variations which may limit themaximum frequency of operation, as well as create raceconditions independent of clock frequency, leading tocircuit failure. Therefore, the design of high perfor-mance, process tolerant clock distribution networks isa critical phase in the synthesis of synchronous VLSIdigital circuits. Furthermore, the design of the clockdistribution network, particularly in high speed appli-cations, requires significant amounts of time, inconsis-tent with the high turnaround in the design of the morecommon data flow elements of digital VLSI circuits.

Several techniques have been developed to improvethe performance and design efficiency of clock dis-tribution networks, such as placing distributed bufferswithin clock tree layouts [1] to control the propaga-tion delay and power consumption characteristics ofthe clock distribution networks, resizing clock netsfor speed optimization and clock path delay bal-ancing [2, 3], perform simultaneous buffer and in-terconnect sizing to optimize for speed and reducepower dissipation [4], using symmetric distribution

Figure 1. Block diagram of the clock tree design cycle integrated with standard IC design flow.

networks, such as H-tree structures [5], to minimizeclock skew, and applying zero-skew clock routing algo-rithms [e.g., 6, 7] to the automated layout of high speedclock distribution networks in cell-based circuits. Ef-fort has also been placed on reducing clock skew due toprocess variations [e.g., 8–10], and on designing clockdistribution networks so as to ensure minimal variationin clock skew [1, 7]. Alternative approaches have beendeveloped for using intentional non-zero clock skewto improve circuit performance and reliability by prop-erly choosing the local clock skews [10–12]. Targetingnon-zero local clock skew, a synthesis methodology hasbeen developed for designing clock distribution net-works capable of accurately producing specific clockpath delays [13, 14]. These clock distribution networksexploit intentional localized clock skew while takinginto account the effects of process parameter variationson the clock path delays.

A design environment is presented in this paperfor efficiently synthesizing distributed buffer, tree-structured clock distribution networks. This methodol-ogy is illustrated in terms of the IC design process cyclein Fig. 1. The IC design cycle typically begins with

38

P1: KCU/PCY P2: VTL


Buffered Clock Tree Synthesis 151

the System Specification phase. The Clock Tree De-sign Cycle utilizes timing information from the LogicDesign phase, such as the minimum and maximumdelay values of the logic blocks and the registers. Thetiming information is used to determine the maximumfrequency of operation of the circuit, the non-zero clockskew schedule, the permissible range of clock skew be-tween any pair of sequentially adjacent registers, andthe minimum clock path delay to each register. Thetopology of the clock tree is designed to enforce theclock skew schedule. The delay of each clock path is ac-curately implemented using repeaters targeting CMOStechnology. Finally, the clock tree is validated by en-suring that the worst case clock path delays caused byprocess parameter variations do not create clock skewvalues outside the allowed permissible range of eachpair of sequentially adjacent registers. Process param-eter information is extensively used in several stages ofthe design environment for ensuring the accuracy of theclock tree. The output of the Clock Tree Design Cycleis a detailed circuit description of the clock distributionnetwork, including the number and geometric size ofeach buffer stage within each branch of the clock tree.

This paper is organized as follows: in Section 2, alocalized clock skew schedule is derived from the ef-fective permissible range of the clock skew for eachlocal data path considering any global clock skew con-straints and process parameter variations. In Section 3,a topology of the clock distribution network is obtained,producing a clock tree with specific delay values as-signed to each branch. The design of circuit structuresfor implementing the individual branch delay values issummarized in Section 4. In Section 5, techniques forcompensating the scheduled local clock skew values toprocess-dependent clock path delay variations are pre-sented. In Section 6, these results are evaluated on a se-ries of circuits, thereby demonstrating performance im-provements and immunity to process parameter varia-tions. Finally, some conclusions are drawn in Section 7.

2. Optimal Clock Skew Scheduling

A synchronous digital circuitC can be modeled asa finite directed multi-graphG(V, E). Each vertexin the graph,v j ∈V , is associated with a register,circuit input, or circuit output. Each edge in thegraph, eij ∈ E, represents a physical connection be-tween verticesvi andv j , with an optional combina-tional logic path between the two vertices. An edge

is a bi-weighted connection representing the maxi-mum (minimum) propagation delayTPD max (TPD min)

between two sequentially adjacent storage elements.The propagation delayTPD includes the register, logic,and interconnect delays of a local data path [13], asdescribed in (1),

TPD = TC−Q + TLogic+ TInt + TSet-up, (1)

whereTC−Q is the time required for the data to leaveRi once it is triggered by a clock pulseCi , TLogic isthe propagation delay through the logic block betweenregistersRi andRj , TInt accounts for the interconnectdelay, andTSet-up is the time to successfully propagateto and latch the data withinRj [15].

A local data pathLij is a set of two vertices connectedby an edge,Lij ={vi , eij , v j } for any vi , v j ∈V . Aglobal data path,Pkl = vk

p→ vl , is a set of alternatingedges and vertices{vk, ek1, v1, e12, . . . ,en−1l , vl }, rep-resenting a physical connection between verticesvk andvl , respectively. A multi-input circuit can be modeledas a single input graph, where each input is connectedto vertexv0 by a zero- weighted edge.Pl(Lij ) is de-fined as the permissible range of a local data pathandPg(Pkl) is the permissible range of a global datapath.

2.1. Timing Constraints

The timing behavior of a circuitC can be describedin terms of two sets of timing constraints, local con-straints and global constraints. The local constraintsare designed to ensure the correct latching of data intothe registers of a local data path. In particular, (2) pre-vents latching the incorrect data signal intoRj by theclock pulse that latched the same data intoRi (prevent-ing double clocking [10, 11]),

TSkew(Lij ) ≥ THold j − TPD(min) + ζij , (2)

whereζij is a safety term to provide some margin in alocal data path against race conditions due to processparameter variations, and (3) guarantees that the datasignal latched inRi is latched intoRj by the followingclock pulse (preventing zero clocking [10, 11]),

TSkew(Lij ) ≤ TCP− TPD(max). (3)

Constraints (2) and (3) are similar to the synchronousconstraints introduced in [11, 12, 16, 17], where

39

P1: KCU/PCY P2: VTL



Figure 2. Permissible range of the clock skew of a local data path.

the clock skewTSkewij = TCDi − TCD j and whereTCDi (TCD j ) is the delay of theI th (Jth) clock path.

Assuming that the minimum and maximum delay ofeach combinational logic block and register are known,a region of valid clock skew is assigned to each localdata path, called the permissible rangePl(Lij ) [13, 18],as shown in Fig. 2. The bounds ofPl(Lij ) are deter-mined from the local constraints, (3) and (4), for agiven clock periodTCP. Also, thewidth of a permis-sible range is defined as the difference between themaximum(TSkewij(max)) and the minimum(TSkewij(min))

clock skew.Satisfying the clock skew constraints of each indi-

vidual local data path does not guarantee that the clockskew between two vertices of a global data pathPkl issatisfied, particularly when there are multiple paralleland feedback paths between the two vertices. Sinceany two registers connected by more than one globaldata path are each driven by a single clock path, theclock skew between these two registers is unique andthe permissible range of every path connecting the tworegisters must contain this clock skew value to ensurethat the circuit will operate correctly. As an example toillustrate that the clock skew between registers must becontained within the permissible range of each globaldata path connecting both registers, consider the cir-cuit illustrated in Fig. 3, where the numbers assignedto the edges are the maximum and minimum propa-gation delay of each local data pathLij , and the reg-ister set-up and hold times are assumed to be zero.Furthermore, the pair of clock skew values associatedwith a vertex are the minimum and maximum clockskew calculated with respect to the origin vertexv0

for a given clock period. The minimum bound of

Figure 3. Matching permissible clock skew ranges by adjusting theclock periodTCP.

Pl(Lij ) is given by (3) and isTSkewij(min) = −TPD minandthe maximum bound ofPl(Lij ) is given by (4) and isTSkewij(max) = TCP−TPD max. Observe that in Fig. 3(a), anon-empty permissible range for each individual localdata path can is obtained with a clock periodTCP = 6time units (tu). However, no clock skew value existsthat is common to the paths connecting verticesv1 andv3. The common value forTSkew13 is only obtainedwhen the clock period is increased to 8 tu.

40

P1: KCU/PCY P2: VTL



To guarantee that a clock skew value exists for anypair of registersvk, vl ∈ V within a global data path,a set of global timing constraints must be satisfied.Complete proofs of the following theorems are foundin [19]. The global timing constraints (4) and (5) areused to calculate the permissible range of any globaldata pathPkl ∈ V , and are based on the permissiblerange of the local data paths within the respective globaldata path. In particular, (4) determines the minimumand maximum clock skew of a global data path withrespect tovk, while (5) constrains the clock skew of twovertices connected by multiple forward and feedbackpaths. These two constraints can be formally stated as:

Theorem 1. For any global data path Pkl ∈ V,clock skew is conserved. Alternatively, the clock skewbetween any two storage elements, vk, vl ∈ V, isthe sum of the clock skews of each local data pathLk1, L12, . . . , Ln−1l , where Lk1, L12, . . . , Ln−1l arethe local data paths within Pkl ,

TSkew(Pkl) = TSkew(Lk1)+ TSkew(L12)+ · · ·+ TSkew(Ll−1l ). (4)

Theorem 2. For any global data path Pkl containingfeedback paths, the clock skew in a feedback path be-tween any two storage elements, sayvm andvn ∈ Pkl ,

is the negative of the clock skew betweenvm andvn inthe forward path,

TSkew(Pkl) = −TSkew(Plk). (5)

In the presence of multiple parallel and/or feedbackpaths connecting any two registersRk and Rl , a per-missible range only exists between these two registersif there is overlap among the permissible ranges of eachindividual parallel and feedback path connecting bothregisters. Furthermore, the upper and lower bounds ofsuch a permissible range are determined from the up-per and lower bounds of the permissible ranges of eachindividual parallel and feedback path. Formally, theconcept of permissible range overlap and the upper andlower bounds of the permissible range of a global datapathPkl can be stated as follows:

Theorem 3. Let Pkl ∈ V be a global data path withina circuit C with m parallel and n feedback paths.Let the two vertices, vk and vl ∈ Pkl , which are notnecessarily sequentially-adjacent, be the origin anddestination of the m parallel and n feedback paths, re-spectively. Also, let Pg(Pkl) be the permissible rangeof the global data path composed of verticesvk and

vl . Pg(Pkl) is a non-empty set of values iff the inter-section of the permissible ranges of each individualparallel and feedback path is a non-empty set, or

Pg(Pkl) =(

m⋂i=1

Pg(Pi

kl

)) ⋂ (n⋂

j=1

Pg(P j

lk

)) 6= ∅.(6)

Theorem 4. Let the two vertices, vk andvl ∈ Pkl , bethe origin and destination of a global data path withm forward and n feedback paths. If Pg(Pkl) 6= ∅, theupper bound of Pg(Pkl) is given by

TSkew(Pkl)max=MIN

{min

1≤i≤m

[TSkew

(Pi

kl

)max

],

min1≤ j≤n

[TSkew

(P j

lk

)min

]}, (7)

and the lower bound of Pg(Pkl) is given by

TSkew(Pkl)min=MAX

{max

1≤i≤m

[TSkew

(Pi

kl

)min

],

max1≤ j≤n

[TSkew

(P j

lk

)max

]}. (8)

Two global timing constraints impose zero clockskew among the I/O storage elements and limit thepermissible clock skew range that can be implementedby the fabrication technology. By constraining theclock skew among the off-chip registers to zero, raceconditions are eliminated among all integrated circuitscontrolled by the same clock source by avoiding thepropagation of a non-zero clock skew beyond the in-tegrated circuit. This condition is represented by thefollowing expression,

TSkew(Pkl)min=MAX

{max

1≤i≤m

[TSkew

(Pi

kl

)min

],

max1≤ j≤n

[TSkew

(P j

lk

)max

]}. (9)

An immediate consequence of (9) is that the clock pathdelay from the clock source to every input and outputregister is equal.

Although the permissible range of a local data pathis theoretically infinite, practical limitations place con-straints on the minimum clock path delays that canbe implemented with a given fabrication technology.These clock path delays determine the minimum clock

41

P1: KCU/PCY P2: VTL



skew that can be assigned to any two vertices in the cir-cuit. These fabrication dependent timing constraints are

|TSkew(Lij )max− TSkew(Lij )min| ≥ C1,

|TSkewij | ≥ C2,(10)

whereC1 andC2 are dependent on the fabrication tech-nology and are a measure of the statistical variation ofthe process parameters.

2.2. Optimal Clock Period

The problem of determining an optimal clock period fora synchronous circuit while exploiting non-zero clockskew has been previously studied [11, 12, 16, 17]. Inthese approaches, clock delays rather than clock skewsare calculated. Therefore, these clock delays cannotbe directly used for determining the permissible rangeof the local clock skews. Thus, there is no process fordetermining the position of the scheduled clock skewwithin the permissible range. A technique to performthis process is described in this paper to schedule theclock skew and to prevent synchronization failures dueto process parameter variations.

The determination of the minimum clock period us-ing permissible ranges is possible by recognizing thatthe width of the permissible range of a local data path isdependent on the clock period [from (3)]. The overlapof permissible ranges guarantees the synchronizationof the data flow between non-adjacent registers con-nected by multiple feedback and/or parallel paths. Thistechnique initially guarantees the existence of a permis-sible range for each local data path and terminates bysatisfying (6) for every data path in the circuit. The dif-ference between the propagation delays of a local datapath Lij defines the minimum clock period necessaryto safely latch data withinLij . The largest differenceamong all the local data paths of the circuit definesthe minimum clock period that can be used to safelylatch data into any local data path. However, as shownin the example depicted in Fig. 3, in the presence offeedback and/or parallel paths, local timing constraintsmay not be sufficient to determine the minimum clockperiod (since certain global timing constraints such as(6) must also be satisfied). Nevertheless, a clock pe-riod always exists that satisfies all the local and globaltiming constraints of a circuit. This clock period isbounded by two terms,TCP min and TCP max, as inde-pendently demonstrated by Deokar and Sapatnekar in[12]. The lower bound of the clock period,TCP min, isthe greatest difference in propagation delay of any local

data pathLij ∈ C,

TCP min= MAX

[minvij∈V

(TPD maxij − TPD minij ),

maxvi∈V

(TPD maxii )

], (11)

and the upper bound of the clock period,TCP max, isthe greatest propagation delay of any local data pathLij ∈ G,

TCP max= MAX[

maxvij∈V

(TPD maxij ),maxvi∈V

(TPD maxii )].

(12)

The second term in (11) and (12) accounts for theself-loop circuit when the output of a register is con-nected to its input through an optional logic block.Since the initial and final registers are the same, theclock skew in a self-loop is zero and the clock periodis determined by the maximum propagation delay ofthe path connecting the output of the register to its in-put. Observe that a clock period equal to the lowerbound exists for circuits without parallel and/or feed-back paths. Furthermore, a clock period equal to theupper bound always exists since the permissible rangeof any local data path in the circuit contains the zeroclock skew value. Although (12) satisfies any local andglobal timing constraints of circuitC, it is possible todetermine a lower clock period that satisfies (6).

Several algorithms for determining the optimal clockperiod while exploiting non-zero clock skew exist.Fishburn [11] introduced this approach with a lin-ear programming-based algorithm that minimizes theclock period while determining a set of clock path de-lays to drive the individual registers within the circuit.In [12], Deokar and Sapatnekar present a graph-basedapproach to achieve a similar goal, followed by an op-timization step to reduce the skew between registerswhile preserving the minimum clock period. Otherworks, such as Sakallah et al. [16] and Szymanski [17],also calculate the optimal clock period and clock pathdelay schedule using linear programming techniques.

A graph-based algorithm is implemented inC to de-termine the minimum clock period and a permissiblerange for each local data path while ensuring that allthe permissible ranges in the circuit satisfy (6) [18, 19].The initial clock period is given by (11) and, the localand global permissible ranges for each local data pathare calculated assuming this clock period. If at leastone data path does not satisfy (6), the clock period isincreased and the permissible ranges are re-calculated.This iterative process continues until (6) is satisfied for

42

P1: KCU/PCY P2: VTL



all global data paths. The primary distinction of thisalgorithm is that the permissible range of each localdata pathPl(Lij ) is determined rather than the individ-ual clock path delays to registersRi and Rj . Fromeach permissible range a clock skew value is chosen asexplained in Section 2.3. This information is crucialfor maximizing the performance of a synchronous cir-cuit while considering the effects of process parametervariations in the design of high speed clock distributionnetworks.

2.3. Selecting Clock Skew Values

Given any two verticesvk, vl ∈ V , the set of validclock skew values betweenvk andvl is given by (6)and bounded by (7) and (8), as described in Section 2.2.In the presence of feedback and/or parallel paths, theresulting permissible rangePg(Pkl) is a sub-set of thepermissible range of each independent global data pathbetweenvk andvl , as exemplified in Fig. 3. However,due to (4),Pg(Pkl) is the sum of the permissible range ofeach local data path for every global data pathPkl con-nectingvk andvl . Therefore, it is necessary to constrainthe permissible range of each local data path to a sub-setof values within its original permissible range. Alter-natively, if Pl(Lij ) is the permissible range of a localdata path within one of the global data paths connectingvk andvl , ρ(Lij ) is a sub-set of values withinPl(Lij )

such thatρ(Lij ) ⊆ Pl(Lij ). This new regionρ(Lij ) isdescribed as theeffective permissible rangeof a localdata path.

An example of an effective permissible range is theparallel path shown in Fig. 3(a). ForTCP = 8 tu,the permissible rangePg(P13) = [−2,−2]. SincePg(P13) = Pl(L12)+Pl(L23), the local data pathsL12

andL23 can only assume clock skew values for whichthe sum is within [−2,−2]. In this case, the permissi-ble range of each local data path is reduced to a singlevalue, orPl(L12) = [1, 1] andPl(L23) = [−3,−3],respectively.

Assume that the clock period of the circuit inFig. 3(a) is now increased from 8 tu to 9 tu. The newpermissible rangePg(P13) = [−2, 0] and the effectivepermissible range of each local data path isρ(L12) =[1, 2], ρ(L23) = [−3,−2], andρ(L13) = [−2, 0], re-spectively. Note that selecting a clock skew value out-side the effective permissible range of a local data pathmay lead to a race condition since (7) is violated. Also,there is no unique solution to the selection of an effec-tive permissible range unlessρ(Lij ) = Pl(Lij ). Forexample,Pl(L12) could be set to [0, 2] andPl(L23)

set to [−2,−2], giving the same permissible rangePg(P13) = [−2, 0]. Therefore, given any two verticesvk, vl ∈ V with feedback and/or parallel paths connect-ing vk andvl , the selection of a clock skew schedulerequires determining the effective permissible rangeρ(Lij ) for each local data path betweenvk andvl , andthe relative position ofρ(Lij ) within Pl(Lij ).

The effective permissible range of a local data pathρ(Lij )may not be unique, leading to multiple solutionsto the clock skew scheduling problem. It is, however,possible to obtain one solution that is most suitable forminimizing the clock period while reducing the possi-bility of race conditions due to the effects of processparameter variations. This solution forρ(Lij ) is derivedfrom the observation that the bounds of the permissi-ble range of any two verticesvk, vl ∈ V (with possiblefeedback and/or parallel paths connectingvk andvl ) aremaximum when determined by (7) and (8), and that thepermissiblePg(Pkl) bounded by (7) and (8) is unique.Therefore, the clock skew scheduling problem can bedivided into two phases. In the first phase, the permis-sible range of each global data path is derived from(6), with bounds given by (7) and (8). In the secondphase, the clock skew schedule is solved by the follow-ing process: 1) the permissible range of a global datapathPg(Pkl) is divided equally among each local datapath belonging to each global data path connecting theverticesvk andvl ; 2) within each global data path eacheffective permissible rangeρ(Lij ) is placed as close aspossible to the upper bound of the original permissiblerangePl(Lij ), thereby minimizing the likelihood of cre-ating any race conditions; and 3) the specific value ofthe clock skew is chosen in the middle of the effectivepermissible range, since no prior information describ-ing the variation of a particular clock skew value mayexist. An algorithm for selecting the clock skew of eachlocal data path was implemented as described in [18,19]. From this clock skew schedule the minimum clockpath delay to each register in the circuit is calculated[19].

Providing independent clock path delays for eachregister is impractical due to the large capacitive loadplaced on the clock source and the inefficient use ofdie area. A tree structured clock distribution networkis more appropriate, where the branching points are se-lected according to the delay of each clock path, therelative physical position of the clocked registers, andthe sensitivity of each local data path to delay varia-tions. Such an approach for determining the structuraltopology of a clock distribution network is describedin the following section.

43

P1: KCU/PCY P2: VTL



3. Clock Tree Topological Design

The topology of a clock tree derived from a clock skewschedule must ensure that the clock path delays are ac-curately implemented while considering the effects ofprocess parameter variations. A tree-structured topol-ogy can be based on the hierarchical description of thecircuit netlist, on implementing a balanced tree with afixed number of branching levels from the clock sourceto each register with a pre-defined number of branchingpoints per node (an example of this approach is a binarytree withn levels for 2n registers with two branchingpoints per node), on reducing the effects of process pa-rameter variations by driving common local data pathsby the same sub-tree, or by implementing each clockpath delay with pre-defined delay segments such thatthe layout area of the clock tree is reduced.

The topology of the clock distribution tree is built bydriving common local data paths by the same sub-treeand by assigning precise delay values to each branch ofthe clock tree such that the skew assignment is satisfied[20]. For this purpose, each clock path delay is parti-tioned into a series of branches, each branch emulatinga precise quantified delay value. Between any twosegments, there is a branching point to other registersor sub-trees of the clock tree, where several brancheswith pre-defined delays are cascaded to provide theappropriate delay between the clock source (or root)and each leaf node. The selection of the branch delayis dependent upon the minimum propagation delay thatcan be implemented for a particular fabrication processand the inverter transconductance (or gain). An exam-ple of the topology of a clock tree is shown in Fig. 4,where the numbers in brackets are the delays assigned

Figure 4. Topology of the clock distribution network.

to each branch and the numbers in parenthesis are theclock skew assignment.

4. Circuit Design of the Clock Tree

The circuit structures are designed to emulate the de-lay values associated with each branch of the clocktree. Special attention is placed on guaranteeing thatthe clock skewbetweenany two clock paths is satisfiedrather than satisfying each individual clock path delay.The successful design of each clock path is primar-ily dependent on two factors: 1) isolating each branchdelay using active elements, specifically CMOS invert-ers, and 2) using repeaters to integrate the inverter andinterconnect delay equations so as to more accuratelycalculate the delay of each clock path.

The interconnect lines are modeled as purely capac-itive lines by inserting inverting buffer repeaters intothe clock path such that the output impedance of eachinverter is significantly greater than the resistance ofthe driven interconnect line [21]. As a consequence,the slope of the input signal of a buffer connectedto a branching point is identical to the slope of theoutput signal of the buffer driving that same branchingpoint [22].

In the existing design methodology [14, 22], the de-lay of a branch is implemented with one or more CMOSinverters, as illustrated in Fig. 5. The delay equationsof each inverter are based on the MOSFETα-powerlaw short-channel I-V model developed by Sakurai andNewton [23].

Each inverter is assumed to be driven by a ramp sig-nal with symmetric rising and falling slopes, selected

Figure 5. Design of a branch delay element.

44

P1: KCU/PCY P2: VTL



such that during discharge (charge), the effects ofthe PMOS (NMOS) transistor can be neglected. Thecapacitive load of an inverter so as to satisfy a specificbranch delaytdi is

CLi = 2IDO

VDD

[tdi −

(1

2− 1− vτ

1+ α)

tTi−1

], (13)

where IDO is the drain current atVGS = VDS =VDD,VDO is the drain saturation voltage atVGS =VDD,Vth is the threshold voltage,α is the velocity sat-uration index,VDD is the power supply,tdi is the delayof an inverter defined at the 50%VDD point of the inputwaveform to the 50%VDD point of the output wave-form, vT = Vth/VDD, andtTi is the transition time ofthe input signal. Note thatCLi is composed of thecapacitance of the driven interconnect line and the to-tal gate capacitance of allbi+1 inverters. Sincetdi isknown, the only unknown in (13) is the transition timeof the input signaltTi (provided by [23]).tTi can be ap-proximated by a ramp shaped waveform, or by linearlyconnecting the points 0.1VDD and 0.9VDD of the outputwaveform. This assumption is accurate as long as theinterconnect resistance is negligible as compared withthe inverter output impedance.

tTi = t0.9− t0.80.8

= CLi VDD

IDO

(0.9

0.8+ VDO

0.8VDDln

10VDO

eVDD

). (14)

For each clock path within the clock tree, the proce-dure to design the CMOS inverters is as follows: 1) theload of the initial trunk of the clock tree is determinedfrom (13), assuming a step input clock signal; 2) theslope of the output signal is calculated from (14) andapplied in (13) to determine the capacitive load of thefollowing branch, permitting the slope of the outputsignal to be calculated; and 3) step 2 is repeated foreach subsequent branch of the clock path. Steps 1–3 areapplied to the remaining clock paths within the clocktree. Observe that if the transition time of the outputsignal of branchbi does not satisfy

tTi ≤ 1(12 − 1−vτ

1+α) (tdi+1− VDDCLi+1

2IDO

), (15)

(13) is no longer valid. The transition timetTi canbe reduced in order to satisfy (15) by increasing theoutput current drive of the inverter in branchbi . How-ever, increasingIDOi would increase the capacitive loadCLi in order to maintain the propagation delaytdi for

branchbi . Therefore, the transition time associatedwith branchbi must be maintained constant as long asthe propagation delaytdi of the branchbi remains thesame. Furthermore, the number of inverters requiredto implement the propagation delaytdi is chosen suchthat (15) is satisfied and the proper polarity of the clocksignal driving branchbi+1 is maintained.

5. Increasing Tolerance to ProcessParameter Variations

Every semiconductor fabrication process can be char-acterized by variations in process parameters. Theseprocess parameter variations along with environmentalvariations, such as temperature, supply voltage, and ra-diation, may compromise both the performance and thereliability of the clock distribution network. A bottom-up approach is presented in this section for verifying theselected clock skew values and correcting for any vari-ations of the clock skew due to process parameter vari-ations that violate the bounds of the permissible range.

5.1. Circuit Design Considerations

Each clock path delay can be modeled as beingcomposed of both a deterministic delay componentand a probabilistic delay component. While the de-terministic component can be characterized with welldeveloped delay models [e.g., 23], the probabilisticcomponent of the clock path delay is dependent uponvariations of the fabrication process and the environ-mental conditions. The variations of the fabricationprocess affect both the active device parameters (e.g.,IDO,Vth, µo) and the passive geometric parameters(e.g., the interconnect width and spacing).

The probabilistic delay component is determined foreach clock path by assuming that the cumulative effectsof the device parameter variations, such as thresholdvoltage and channel mobility, can be collected into asingle parameter characterizing the gain of the inverter,specifically the output current of a CMOS inverterIDO [23]. The minimum and maximum clock path de-lays are calculated considering the minimum and max-imum IDO of each inverter within a branch of the clockdistribution network. The worst case variation of theclock skews is determined from the minimum and max-imum clock path delays of each local data path. If atleast one worst case clock skew value is outside theeffective permissible range of the corresponding localdata path (i.e.,TSkewij 6⊂ ρ(Lij )), a timing constraint

45

P1: KCU/PCY P2: VTL



Figure 6. Example of upper and lower bound clock skew violations.

is violated and the circuit will not work properly, asillustrated in the example shown in Fig. 6.

This violation is passed to the top-down synthesissystem, indicating which bound of the effective per-missible range is violated. The clock skew of at leastone local data pathLij within the system may violatethe upper bound ofρ(Lij ), i.e., TSkewij > TSkewij(max).Observe that ifρ(Lij ) = Pl(Lij ), TSkewij does not sat-isfy (3), shown as regionC in Fig. 2, causing zeroclocking [11]. By increasing the clock periodTCP,the effective permissible clock skew range for eachlocal data path is also increased(TSkewij(max) is in-creased due to monotonicity), permitting those localdata paths previously in regionC to satisfy (3). Thenew clock skew value may also violate the lower boundof a local data path, i.e.,TSkewij < TSkewij(min), whereTSkewij(min) ⊂ ρ(Lij ). Observe that ifρ(Lij ) = Pl(Lij ),

TSkewij does not satisfy (2), shown as regionA in Fig. 2,causing double clocking [11]. This situation can be po-tentially dangerous since the lower bound ofPl(Lij ) isindependent of the clock frequency, causing the circuitto function improperly.

Two compensation techniques are used to preventlower bound violations, depending upon where the ef-fective permissible range of a local data pathρ(Lij )

is located within the absolute permissible range ofthe local data path,Pl(Lij ). If the worst case clockskew is in between the lower bounds ofρ(Lij ) andPl(Lij ),MIN[ Pl(Lij )] < TSkewij < MIN[ ρ(Lij )], theclock periodTCP is increased until the race conditionis eliminated, since the effective permissible range willincrease, due to monotonicity. If the worst case clockskew is less than the lower bound of the permissible

range of the local data path,TSkewij < MIN[ Pl(Lij )],any increase in the clock period will not eliminate thesynchronization failure since (2) is not dependent onthe clock period. To compensate for this violation asafety termζij > 0 is added to the local timing con-straint that defines the lower bound ofPl(Lij ) [see (2)].The clock period is increased and a new clock skewschedule is calculated for this value of the clock pe-riod. The increased clock period is required to obtain aset of effective permissible ranges with widths equal toor greater than the set of effective permissible rangesthat existed before the clock skew violation. Observethat by including the safety termζij , the lower boundof the clock skew of the local data path containing therace condition is shifted to the right (see Fig. 2), mov-ing the new clock skew schedule of the entire circuitaway from the bound violation and removing any raceconditions. This iterative process continues until theworst case variations of the selected clock skews nolonger violate the corresponding effective permissiblerange of each local data path.

6. Simulation Results

The simulation results presented in this section illus-trate the performance improvements obtained by ex-ploiting non-zero clock skew. In order to demonstratethese performance improvements, a set of ISCAS-89sequential circuits is chosen as benchmark circuits.The performance results are illustrated in Table 1.The number of registers and gates within the circuitincluding the I/O registers are shown in Column 2.The upper bound of the clock period assuming zeroclock skewTCP0 is shown in Column 3. The clockperiod obtained with intentional clock skewTCPi isshown in Column 4. The resulting performance gainis shown in Column 5. The clock period obtained withthe constraint of zero clock skew imposed among theI/O registers is shown in Column 6 while the perfor-mance gain with respect to zero I/O skew is shown inColumn 7.

The results shown in Table 1 clearly demonstratereductions of the minimum clock period when inten-tional clock skew is exploited. The amount of reductionis dependent on the characteristics of each circuit, par-ticularly the differences in propagation delay betweeneach local data path. Note also that by constraining theclock skew of the I/O registers to zero, circuit speed canbe improved, although less than if this I/O constraint isnot used.

46

P1: KCU/PCY P2: VTL



Table 1. Performance improvement with non-zero clock skew.

Size TCP0 TCPi TCP

Circuit # register/# gates TSkewij = 0 TSkewij 6= 0 Gain (%) TSkewI/O = 0 Gain (%)

ex1 20/- 11.0 6.3 43.0 7.2 35.0

s27 7/10 9.2 6.6 28.0 9.2 0.0

s298 23/119 16.2 11.6 28.0 11.6 28.0

s344 35/160 28.4 25.6 9.9 25.6 9.9

s386 20/159 19.8 19.8 0.0 19.8 0.0

s444 30/181 18.6 12.2 34.4 12.2 34.4

s510 32/211 19.8 17.3 13.0 17.3 13.0

s938 67/446 27.0 21.4 20.7 25.0 7.4

s1196 45/529 37.0 30.8 16.8 37.0 0.0

s1512 89/780 53.2 43.2 18.8 53.2 0.0

Table 2. Worst case variations in clock skew due to process parameter variations,IDO = 15%.

Simulated skew (ns) Error (%)

Circuit TCP0/TCPi Gain(%)Permissible

rangeSelected

clock skew Nom Worst case Nom Worst case

cdn 1 11/9 18.0 [−8,−2] −3.0 −3.0 −2.10 0.0 30.0

cdn 2 18/15 17.0 [−6.8,−1.4] −4.2 −4.1 −3.3 2.4 21.4

cdn 3 27/18 33.0 [−14, 2.3] 1.1 1.14 1.3 3.6 18.2

Clock distribution networks which exploit inten-tional clock skew and are less sensitive to the effects ofprocess parameter variations are depicted in Table 2.The ratio of the minimum clock period assuming zeroclock skewTCPo to the intentional clock skewTCPi andthe per cent improvement is shown in Columns 2 and 3,respectively. The permissible range most susceptibleto process parameter variations is illustrated in Col-umn 4. The selected clock skew is shown in Column 5.In Columns 6 and 7, respectively, the nominal andmaximum clock skew are depicted, assuming a±15%variation of the drain currentIDO of each inverter. Notethat both the nominal and the worst case value of theclock skew are within the permissible range. The percent variation of clock skew due to the effects of processparameter variations is shown in Columns 8 and 9. Thisresult confirms the claim stated previously that varia-tions in clock skew due to process parameter variationscan be both tolerated and compensated.

7. Conclusions

An integrated top-down, bottom-up approach is pre-sented for synthesizing clock distribution networks

tolerant to process parameter variations. In the top-down phase, the clock skew schedule and permissi-ble ranges of each local data path are calculated whileminimizing the clock period. The process of determin-ing the bounds of the permissible ranges and selectingthe clock skew value for each local data path so as tominimize the effects of process parameter variations isdescribed. Rather than placing limits or bounds on theclock skew variations, this approach guarantees thateach selected clock skew value is within the permis-sible range despite worst case variations of the clockskew. Techniques for designing the topology and theCMOS-based circuit structure of the clock trees are pre-sented. In the bottom-up phase, worst case variations ofclock skew due to process parameter variations are de-termined from the specific clock distribution network.Variations are compensated by the proper choice ofclock skew for each local data path. Results of optimiz-ing the clock skew schedule of several MCNC/ISCAS-89 benchmark circuits are presented. A schedule of theclock skews to make a clock distribution network lesssensitive to process parameter variations is presentedfor several example networks. An 18% improvementin clock frequency with up to a 30% variation in thenominal clock skew, and a 33% improvement in clock

47

P1: KCU/PCY P2: VTL



frequency with up to an 18% variation in the nomi-nal clock skew are demonstrated for several examplecircuits.

References

1. S. Pullela, N. Menezes, J. Omar, and L.T. Pillage, “Skew anddelay optimization for reliable buffered clock trees,”Proceed-ings of the IEEE International Conference on Computer-AidedDesign, pp. 556–562, Nov. 1993.

2. Q. Zhu, W.W.-M. Dai, and J.G. Xi, “Optimal sizing of high-speed clock networks based on distributed RC and lossytransmission line models,”Proceedings of the IEEE Interna-tional Conference on Computer-Aided Design, pp. 628–633,Nov. 1993.

3. J. Cong and K.-S. Leung, “Optimal wiresizing under the dis-tributed elmore delay model,”Proceedings of the IEEE Inter-national Conference on Computer-Aided Design, pp. 634–639,Nov. 1993.

4. J. Cong and C.-K. Koh, “Simultaneous driver and wire sizingfor performance and power optimization,”IEEE Transactionson VLSI Systems, Vol. VLSI-2, No. 4, pp. 408–425, Dec. 1994.

5. H.B. Bakoglu, J.T. Walker, and J.D. Meindl, “A symmetricclock-distribution tree and optimized high-speed interconnec-tions for reduced clock skew in ULSI and WSI circuits,”Pro-ceedings of the IEEE International Conference on ComputerDesign, pp. 118–122, Oct. 1986.

6. T.-H. Chao, Y.-C. Hsu, J.-M. Ho, K.D. Boese, and A.B. Kahng,“Zero skew clock routing with minimum wirelength,”IEEETransactions on Circuits and Systems-II: Analog and Digi-tal Signal Processing, Vol. CAS-39, No. 11, pp. 799–814,Nov. 1992.

7. R.-S. Tsay, “An exact zero-skew clock routing algorithm,”IEEETransactions on Computer-Aided Design of Integrated Circuitsand Systems, Vol. CAD-12, No. 2, pp. 242–249, Feb. 1993.

8. S. Lin and C.K. Wong, “Process-variation-tolerant clock skewminimization,” Proceedings of the IEEE International Confer-ence on Computer-Aided Design, pp. 284–288, Nov. 1994.

9. M. Shoji, “Elimination of process-dependent clock skew inCMOS VLSI,” IEEE Journal of Solid-State Circuits, Vol. SC-21,No. 5, pp. 875–880, Oct. 1986.

10. E.G. Friedman,Clock Distribution Networks in VLSI Circuitsand System, IEEE Press, 1995.

11. J.P. Fishburn, “Clock skew optimization,”IEEE Transactions onComputers, Vol. C-39, No. 7, pp. 945–951, July 1990.

12. R.B. Deokar and S. Sapatnekar, “A graph-theoretic approachto clock skew optimization,”Proceedings of the IEEE Interna-tional Symposium on Circuits and Systems, pp. 407–410, May1994.

13. J.L. Neves and E.G. Friedman, “Design methodology for synthe-sizing clock distribution networks exploiting non-zero localizedclock skew,”IEEE Transactions on VLSI Systems, Vol. VLSI-4,No. 2, pp. 286–291, June 1996.

14. J.L. Neves and E.G. Friedman, “Synthesizing distributed bufferclock trees for high performance ASICs,”Proceedings of theIEEE ASIC Conference, pp. 126–129, Sept. 1994.

15. E.G. Friedman, “Latching characteristics of a CMOS bistableregister,” IEEE Transactions on Circuits and Systems—I:

Fundamental Theory and Applications, Vol. CAS I-40, No. 12,pp. 902–908, Dec. 1993.

16. K.A. Sakallah, T.N. Mudge, and O.A. Olukotun, “CheckTcandminTc: Timing verification and optimal clocking of synchronousdigital circuits,”Proceedings of the IEEE/ACM Design Automa-tion Conference, pp. 111–117, June 1990.

17. T.G. Szymanski, “Computing optimal clock schedules,”Pro-ceedings of the IEEE/ACM Design Automation Conference,pp. 399–404, June 1992.

18. J.L. Neves and E.G. Friedman, “Optimal clock skew schedulingtolerant to process variations,”Proceedings of the ACM/IEEEDesign Automation Conference, pp. 623–628, June 1996.

19. J.L. Neves, “Synthesis of Clock Distribution Networks for HighPerformance VLSI/ULSI-Based Synchronous Digital Systems,”Ph.D. Dissertation, University of Rochester, Dec. 1995.

20. J.L. Neves and E.G. Friedman, “Topological design of clockdistribution networks based on non-zero clock skew specifica-tions,”Proceedings of the IEEE Midwest Symposium on Circuitsand Systems, pp. 461–471, Aug. 1993.

21. S.Dhar and M.A. Franklin, “Optimum buffer circuits for driv-ing long uniform lines,”IEEE Journal of Solid State Circuits,Vol. SC-26, No. 1, pp. 32–40, Jan. 1991.

22. J.L. Neves and E.G. Friedman, “Circuit synthesis of clock dis-tribution networks based on non-zero clock skew,”Proceedingsof the IEEE International Symposium on Circuits and Systems,pp. 4.175–4.178, May 1994.

23. T. Sakurai and A.R. Newton, “Alpha-power law MOSFET modeland its applications to CMOS inverter delay and other formulas,”IEEE Journal of Solid State Circuits, Vol. SC-25, No. 2, pp. 584–594, April 1990.

Jose Luis P.C. Nevesreceived the B.S. degree in Electrical Engi-neering in 1986, and the M.S. degree in Computer Science in 1989from the Federal University of Minas Gerais (UFMG), Brazil. He re-ceived the M.S. and Ph.D. degrees in electrical engineering from theUniversity of Rochester, New York, in 1991 and 1995, respectively.

He was with the Physics Department of the UFMG as an electricalengineer from 1986 to 1987, where he managed the automation ofseveral research laboratories, designing data acquisition equipmentand writing programs for data collect and analysis. He was a Teach-ing and Research Assistant at the University of Rochester from 1990to 1995. He was a computer systems administrator with the Labo-ratory of Respiratory Physiology in the Department of Anesthesiol-ogy, University of Rochester from 1992 to 1996, writing programsfor data collect and analysis, and designing the supporting electronicequipment. He has been with IBM Microelectronics since 1996 as

48

P1: KCU/PCY P2: VTL



an advisory engineer/scientist responsible for developing and imple-menting clock distribution design and synthesis tools. His researchinterests include high performance VLSI/IC design and analysis,timing issues in VLSI design, and CAD tool and methodology de-velopment with application to the design and synthesis of clock dis-tribution networks, low power circuits, and CMOS circuit designtechniques tolerant to process parameter variations.

Dr. Neves received a Doctoral Fellowship from the National Re-search Council (CNPq) Brazil from 1990 to 1994. He is a memberof the Technical Program Committee of ISCAS ’[email protected]

Eby G. Friedman was born in Jersey City, New Jersey in 1957. Hereceived the B.S. degree from Lafayette College, Easton, PA in 1979,and the M.S. and Ph.D. degrees from the University of California,Irvine, in 1981 and 1989, respectively, all in electrical engineering.

He was with Philips Gloeilampen Fabrieken, Eindhoven, TheNetherlands, in 1978 where he worked on the design of bipolardifferential amplifiers. From 1979 to 1991, he was with HughesAircraft Company, rising to the position of manager of the SignalProcessing Design and Test Department, responsible for the designand test of high performance digital and analog IC’s. He has beenwith the Department of Electrical Engineering at the University ofRochester, Rochester, NY, since 1991, where he is an Associate Pro-fessor and Director of the High Performance VLSI/IC Design andAnalysis Laboratory. His current research and teaching interests arein high performance microelectronic design and analysis with ap-plication to high speed portable processors and low power wirelesscommunications.

He has authored two book chapters and many papers in the fieldsof high speed and low power CMOS design techniques, pipeliningand retiming, and the theory and application of synchronous clockdistribution networks, and has edited one book,Clock DistributionNetworks in VLSI Circuits and Systems(IEEE Press, 1995). Dr.Friedman is a Senior Member of the IEEE, a Member of the editorialboard ofAnalog Integrated Circuits and Signal Processing, Chairof the VLSI Systems and Applications CAS Technical Committee,Chair of the VLSI track for ISCAS ’96 and ’97, and a Member of thetechnical program committee of a number of conferences. He was aMember of the editorial board of theIEEE Transactions on Circuitsand Systems II: Analog and Digital Signal Processing, Chair of theElectron Devices Chapter of the IEEE Rochester Section, and a re-cipient of the Howard Hughes Masters and Doctoral Fellowships, anNSF Research Initiation Award, an Outstanding IEEE Chapter Chair-man Award, and a University of Rochester College of EngineeringTeaching Excellence [email protected]

49

Documents

Buffered Clock Tree Synthesis with Non-Zero Clock Skew Scheduling for Increased Tolerance to Process Parameter Variations