21
VLSI DESIGN 2001, Vol. 12, No. 2, pp. 167-186 Reprints available directly from the publisher Photocopying permitted by license only (C) 2001 OPA (Overseas Publishers Association) N.V. Published by license under the Gordon and Breach Science Publishers imprint. Automatic FSM Synthesis for Low-power Mixed Synchronous/Asynchronous Implementation BENGT OELMANNa’*, KALLE TAMMEMEb, MARGUS KRUUS b and MATTIAS O’NILS aMid-Sweden University, Department of Information Technology, Sundsvall, Sweden," bTallinn Technical University, Department of Computer Engineering, Tallinn, Estonia (Received 20 June 2000," In final form 3 August 2000) Power consumption in a synchronous FSM (Finite-State Machine) can be reduced by partitioning it into a number of coupled sub-FSMs where only the part that is involved in a state transition is clocked. Automatic synthesis of a partitioned FSM includes a partitioning algorithm and sub-FSM synthesis to an implementation architecture. In this paper, we first introduce an implementation architecture for partitioned FSMs that uses gated-clock technique for disabling idle parts of the circuits and asynchronous controllers for communication between the sub-FSMs. We then describe a new trans- formation procedure for the sub-FSM. The FSM synthesis flow has been automated in a prototype tool that accepts an FSM specification. The tool generates synthesizable RT- level VHDL code with identical cycle-to-cycle input/output behavior in accordance with the specification. An average power reduction of 45% has been obtained for a set standard FSM benchmarks. Keywords: Low-power design; FSM decomposition; FSM partitioning; Asynchronous logic; Gated-clock techniques; RTL-synthesis 1. INTRODUCTION Optimization techniques for low average power consumption in synchronous digital CMOS cir- cuits often attempt to minimize the dynamic power consumption described as: p V2DD .f. Oi" C where O is the probability of a signal transition within a clock period at node i, Ci is the switched capacitance in node i, VZD is the power supply voltage and f is the clock frequency. Power optimization can be made on all abstraction levels, from IC technology to the system level. When optimizing on the gate level, or even higher abstraction levels, power optimization minimizes the product a. C, called effective capacitance. Here, 167

Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

VLSI DESIGN2001, Vol. 12, No. 2, pp. 167-186Reprints available directly from the publisherPhotocopying permitted by license only

(C) 2001 OPA (Overseas Publishers Association) N.V.Published by license under

the Gordon and Breach Science

Publishers imprint.

Automatic FSM Synthesis for Low-power MixedSynchronous/Asynchronous Implementation

BENGT OELMANNa’*, KALLE TAMMEMEb, MARGUS KRUUSb and MATTIAS O’NILS

aMid-Sweden University, Department of Information Technology, Sundsvall, Sweden,"bTallinn Technical University, Department of Computer Engineering, Tallinn, Estonia

(Received 20 June 2000," In finalform 3 August 2000)

Power consumption in a synchronous FSM (Finite-State Machine) can be reduced bypartitioning it into a number of coupled sub-FSMs where only the part that is involvedin a state transition is clocked. Automatic synthesis of a partitioned FSM includes apartitioning algorithm and sub-FSM synthesis to an implementation architecture. Inthis paper, we first introduce an implementation architecture for partitioned FSMs thatuses gated-clock technique for disabling idle parts of the circuits and asynchronouscontrollers for communication between the sub-FSMs. We then describe a new trans-formation procedure for the sub-FSM. The FSM synthesis flow has been automated in aprototype tool that accepts an FSM specification. The tool generates synthesizable RT-level VHDL code with identical cycle-to-cycle input/output behavior in accordancewith the specification. An average power reduction of 45% has been obtained for a setstandard FSM benchmarks.

Keywords: Low-power design; FSM decomposition; FSM partitioning; Asynchronous logic;Gated-clock techniques; RTL-synthesis

1. INTRODUCTION

Optimization techniques for low average powerconsumption in synchronous digital CMOS cir-cuits often attempt to minimize the dynamic powerconsumption described as:

p V2DD .f. Oi" C

where O is the probability of a signal transitionwithin a clock period at node i, Ci is the switchedcapacitance in node i, VZD is the power supplyvoltage and f is the clock frequency. Poweroptimization can be made on all abstraction levels,from IC technology to the system level. Whenoptimizing on the gate level, or even higherabstraction levels, power optimization minimizesthe product a. C, called effective capacitance. Here,

167

Page 2: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

168 B. OELMANN et al.

both power supply voltage and clock frequency areoften regarded as fixed in the system specificationand cannot be affected.For architectural design, it is possible to reduce

the effective capacitance by minimizing the com-munication over long wires that have high capaci-tance. Placing the required resources, such as

processing units and memories, locally within themodule reduces global communication [11]. It isalso possible to shut down parts of the design thatare idle, which makes the effective capacitanceequal to zero during that period. Data path units,such as multipliers and ALUs, which are purelycombinatorial logic, are shut down by disablingfurther changes of the values on the input signals.Here, additional logic is introduced to detect if theunit can be shut down or not. This technique hasbeen described by Alidina et al., in [1] and is calledthe input- disabling precomputational-based ap-proach. For sequential circuits, gated-clock tech-niques are used to disable the clock signal to theparts of the design that are idle. For large circuits,such as complex microprocessors, this technique isoften referred to as dynamic power management.Here, there are large functional units, such as cachememories and floating point units, with veryspecific tasks that are shut down when not used.This type ofcoarse-grained gated-clock technique ispossible to apply manually by the designer thanksto the small number of places where clock-gating isintroduced and to the fact that the different unitsare functionally well separated and therefore easyto identify. In order to use fine-grained clock-gating, a single functional unit is partitioned intoseveral sub-units where each of them are condi-tionally clocked by a gated clock signal. Anautomated procedure is needed for synthesizingthe original design to a gated-clock implementationoptimized for power. The number of places wherethe clock is gated increases and it becomes lessobvious as to how to partition the unit.For FSMs, the most common approach to low-

power design is to divide the FSM into two ormore sub-FSMs where only one of these is active ata time. Both the precomputation-based technique

and the clock-gating technique have beenused. Dasgupta et al. [8] use the precomputation-based technique for PLA (Programmable LogicArray) implementations of FSMs and reduce theeffective capacitance in the transition logic andoutput logic. Benini [5] et al., detect self-loops, i.e.,when the next state is equal to the current stateand the clock is gated under this condition. Thisapproach has been extended in [5, 14] where statesthat are strongly connected, i.e., that there is a

high probability of having state transitions amongthem, are placed in the same cluster or super-state,and the state transitions within the super-state canbe seen as a self-loop for that super-state. Theseapproaches result in partitioning based on thedescription given in a STG (State TransitionGraph). In the paper by Roy et al. [14], thepartitioning is based on state assignment. ForFSMs with few or no self-loops, e.g., counters, it ispossible to detect smaller FSMs that have self-loops. On higher levels of abstraction, the gated-clock approach has been applied for low-poweroptimization from high-level specifications ofhardware. In [7], the control flow is examinedand mutually exclusive sections of the computa-tion are detected and determine the partitioning ofthe FSM that controls the execution. In anotherapproach presented by Hwang et al. [9], the FSMis partitioned along with the data path, whichleads to an implementation where both the datapath and the controller are shut down when idle.The amount of power that is saved by partition-

ing the FSM is mainly determined by how goodthe partitioning algorithm can cluster stronglyconnected states together in sub-FSMs and by howlarge the cost is, in terms of power, to make a statetransition from one sub-FSM to another. In our

work, we have focused on minimizing the cost ofmaking state transitions from one sub-FSM toanother. This has led us to a new implementationarchitecture that is based on a gated-clocktechnique for shutting down idle sub-FSMs, andasynchronous communication between the sub-FSMs. The two main benefits of having asynchro-nous communication are as follows: First of all,

Page 3: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

FSM DECOMPOSITION 169

the power overhead introduced by the circuitshandling sub-FSM communication is up to fivetimes lower than for the corresponding synchro-nous solution [12]. Secondly, a more power-efficient protocol can be employed that on itsown, lowers the power consumption up to twotimes compared to existing ones [13].The outline of the rest of this paper is as follows:

The next chapter describes the principles behindgated-clock FSMs, points out the problems withfully synchronous implementation architecturesand motivates the asynchronous approach wepropose. Chapter 3 presents the decompositionmodel we use and Chapter 4 describes details onhow partitioned FSMs are implemented. Chapter5 presents experimental results from automaticallysynthesized FSM benchmark circuits.

State Transition Graph FSM

FIGURE Implementation architectures for: (a) asynchro-nous partitioned FSM, (b) synchronous partitioned FSM, and(c) monolithic FSM.

2. BACKGROUND

The goal of an implementation architecture forpartitioned FSMs is that it should provide animplementation with the same input/output behav-ior as the FSM specification describes. In our

case, the specification is given in STG form for amonolithic FSM with synchronous behavior. Theimplementation architecture we propose in thispaper is similar to the one that we have earlierpresented in [12]. It also has much in common withthe synchronous architecture that is used by Beniniet al. [5]. The proposed architecture is depicted inFigure a and consists of: (1) a number of sub-FSMs, (2) an equally large number of CCBs(Clock Control Blocks), and (3) AND gates forgating the clock signal. Alternative implementa-tion architecture is a fully synchronous partitionedFSM, shown in Figure b, and a monolithicimplementation, shown in Figure c. The impor-tant difference between the two architectures forpartitioned FSMs is that the communicationbetween the sub-FSMs that is handled by theCCBs is asynchronous instead of synchronous.The purpose of the sub-FSM interaction proto-

col is to control the activation and the deactivation

of the sub-FSMs. When a state transition to a statein another sub-FSM takes place, the active sub-FSM generates an event on a communicationcontrol signal called a go-signal. This event has thefollowing functional meaning: Activate sub-FSMthat contains the destination state of the transition

and deactivate the currently active sub-FSM. Ingeneral, a sub-FSM may submit many go-signals,one for each external state transition, and it can beactivated by one of many incoming state transi-tions. The CCB that is associated with a sub-FSMenables or disables the clock signal based on go-signals both from its own and other sub-FSMs. InFigure 2, a timing diagram is shown for state

clock

.o 0

go_l

co_.gE-q__l

ck_ ’s’SO:X Sl’ X’i:s2 X S3:}( ’SO X Sl X ,SO XFIGURE 2 Timing diagram for transitions from FSM0 toFSM1 and then back to FSM0.

Page 4: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

170 B. OELMANN et al.

transitions between states residing in the two sub-FSMs (FSMO and FSM1), see Figure 1.With asynchronous control for the CCB, we can

remove the need for a clock signal. For low-powerdesign, it is important to have as small effectivecapacitance as possible. The clock signal is thesignal with the highest switching activity (twice ashigh compared to any data signal) and thecapacitance added here will significantly contri-bute to increased power consumption.The additional control circuitry introduced by

the CCBs will naturally introduce additionalpower dissipation (power overhead). The numberof CCBs are equal to the number of sub-FSMs,but only one of them enables the clock at a time. ACCB has three operational modes; they are:

Hand-over When a transition from one sub-FSM to another takes place. In this mode, theasynchronous CCB is active and responds to thego-signal.Enable This is one of two passive modes. TheCCB is passive and enables the local clock signalto the sub-FSM. In this mode, the CCBdissipates no power except from the AND gateenabling the clock.Disable The CCB is passive and disables thelocal clock signal. The power consumptioncomes from switching the input of the ANDgate.

In Figure 3, the energy consumption for boththe synchronous and asynchronous CCB is given.

asynchronous CCB,!.20

[] synchronous CCB

iiiiiiiii!i unit is pJ

hand-over enable disable

FIGURE 3 Energy consumption in synchronous and asyn-chronous CCB in different modes.

From this figure we can observe two things. Thefirst is that energy consumption for the asynchro-nous CCB is lower in all modes. The second is thatthe difference in the energy dissipation of theasynchronous CCB in the different modes is largerthan the difference of the synchronous CCB. Thisproperty is typical for asynchronous circuits wherepower is dissipated only when it is active. In apartitioned FSM when there is no hand-over, onlyone CCB is in enable mode while the rest are indisable mode. In the clock cycle when a hand-overoccurs, one CCB is in hand-over mode. The totalpower consumption in the CCBs can be expressedas:

PCCB a. Phand-over + (1 a) Penable-+- (N 1) Pdisable

where ehand-over, eenable, and Pdisable are the powerconsumption for the CCBs in the different modes,a is the probability of a hand-over and N is thenumber of CCBs. With asynchronous CCBs, thepower consumption can be reduced by five timesfor CCBs in disable mode, which constitutes themajority of the CCBs. This will have a significantimpact on the total power overhead, especiallywhen the number of CCBs (N) is large.The second advantage of having asynchronous

control is that the sub-FSM communicationprotocol can be made more power efficient. Theexisting synchronous solutions, e.g. [5], requirethat two sub-FSMs are clocked simultaneously athand-over. The power consumption at hand-overwill be largest here because two sub-FSMs willbe active in this cycle. We have removed thisrequirement by implementing the CCB as an

asynchronous controller. A synchronous control-ler updates its states only at clock edges. Incontrast to this, an asynchronous controller canchange state as a response to an input change afteronly some combinatorial delay. We used thisproperty to design an asynchronous protocol thatdoes not require simultaneous clocking of twosub-FSMs at hand-over. The total power con-sumption for the sub-FSMs in the synchronous(Psub-fsm,synch) and the asynchronous (Psub-fsm,asynch)

Page 5: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

FSM DECOMPOSITION 171

case is expressed as:

Psub-fsm,synchN-1

Ti Psub-fsm,ii=0

N-1

nt- Z ai Psub-fsm,ii=0

N-1

Psub-fsm,asynch Ti Psub-fsm,ii=0

where Ti is the duty probability for the sub-FSM,Psub-fsm,i is the internal power consumption of theth sub-FSM, and ai is the probability of activationof the ith sub-FSM.The total power dissipation in a partitioned

FSM is the sum of Pcc and Psub-fsm. Using theproposed asynchronous approach reduces both ofthese components.

3. FSM DECOMPOSITION

In this chapter, the decomposition model we use ispresented first. We then make the necessarydefinitions that will be used for describing theFSM transformation procedures. Implementationof these procedures is discussed in Chapter 4.

In this chapter, we use abstract automata theoryas has been described by Baranov in [3]. There is,however, a small difference in terminology be-tween the work in [3] and other works we refer toin the area of implementation of decomposedFSMs, e.g. [5,9]. In [3], the initial FSM, whichcorresponds to a monolithic FSM implementation,is referred to as a source automaton and a sub-FSM is referred to as a component automaton. Inthis chapter, we use the same abstraction andterminology as in [3]. Elsewhere in this paper, theterminology in most of the referred papersconcerning implementation is used.

3.1. Decomposition Model

The source Mealy automaton is defined as a

sextuple"

A (S,X, Y, 5, A, s0)

where S is the set of states, X is the set of binaryinputs, Y is the set of binary outputs, 8 is thetransition function, A is the output function andSo is the initial state. The automaton can be re-

presented in the form of a transition table, whereevery row defines one transition from a sourcestate to a destination state along with a certainoutput term according to a certain input term.

Let there be a partition on the set S:

The automaton A can be decomposed into a set ofcomponent automata where every block siE 7r

defines a component automaton:

mAm (Sm, Xm, rrn, 6m, ,,m, So

We call states S internal states of componentautomaton. X is the set of input variables at alltransitions from the states in Sm, and ym is the setof output variables at all transitions from thestates in Sm. 5 and /m are transition and outputfunctions on the sets S and Xm. Such decom-position can be achieved by reordering the groupsof transition table rows having the same sourcestate, followed by segmentation according to 7r

blocks.

3.2. Definitions

In this section, we will define different sets fromthe component automaton point of view. Let usdefine V(sz:) to be the set of states from which thereare transitions to the state sk; sk is not included in

V(s). With Xh we denote the existence of inputvalid in the expressions where it is used.

V(Sk) {sjl6(sj,Xh Sk, j =/= k}

Similarly, we define a set of states, T(Sk), notincluded in S to which there are transitions fromthe states of S m.

T(Sk) {sjl (sk,Xh) sj, j =/= k}

Page 6: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

172 B. OELMANN et al.

We define two similar sets that are of more

importance for the whole component automata.

v(sm) sj m, sm}

T(Sm) (sjl5(sk,Xh) sj, sj Sm, sk ESm}

Here, V(Sm) is a set of states in S that havetransitions to states outside S where sic resides.

T(Sm) is a set of states not included in S m, towhich there are transitions from the states notincluded in Sm.

Let us define the set of states in S where thereare transitions from other component automataas:

Q(Sm) {sjl5(s,Xh sj, sj ESm, sk Sm}

The position of the sets defined above is depictedin Figure 4.The set T(Sm) originates from another subset of

Sm, which is denoted as a set w(sm).

w(sm) m, Csm)

The position of the sets defined above is depictedin Figure 5.We will use the shorter denotations Vm, W’,

Q m, and Tm, in the rest of this chapter.

FIGURE 4 The transition sets of a state (left) and componentautomata (right).

v(sm) T(S

FIGURE 5 Input and output subsets of Sm.

3.3. Transformation of the Network

In this section, we will present the transformationsequence that results in a modified networkdescription suitable for the implementation archi-tecture we are targeting. In the following presenta-tion, the definitions given in the previous sectionare used.The transformation is carried out by the

following steps:

1. Replace the transitions from the set W to Tm

with transitions from W to additional statesthat we call transition states, G i.e., 5(sj,Xh) Sk; Sj Wm, sic T with 5’(sj, Xh) sic;

sjG Wm, Sk G m.There is a one-to-one mapping between the

elements of T and G m.Let us denote the set of states replacing the

transitions originating from V with GThese transitions cause the activation of com-

ponent m.

2. Introduce new unconditional transitions fromstates in Gm to a single state dm.

6(gi, 1) din; gi Gm

3. Introduce new transitions originating from theadditional state dm. The new transitions arebased on all transitions from the set Q m.There is a many-to-one mapping between the

elements of G m- and Q m. We define additionalinputs, one for every state in Q m:

Em { ej[ U 6(si’Xh) SJ; sjGQm}Si E G

The new transition functions can now beevaluated: for every 6(sj, Xh)= &; sj Qm transi-tion functions t(dm, (ei, Xh)) Sic; sj Q m,6i Em are added.

4. Introduce additional output functions: forevery ,(si, Xh) sic; si Q output functions/t(dm, (ei, Zt[h))= Sk; eiE E are added. As thereare as many entering transitions as there are

exiting transitions in the network, we can say

Page 7: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

FSM DECOMPOSITION 173

that there is a one-to-one mapping between thesource transitions, additional transitions andoutput functions.

5. The first transformation step (replacement ofT with G m) may result in states in the set Q m,which do not have any incoming transitions.Such states are redundant and can be removed,except in the case where the state is an initialstate of the network. An example of sourcedecomposition and a transformed network isgiven in Figure 6.

6. The resulting network of the previous steps hasthe same behavior as the source automatonwhen the initial state is properly defined. Let Sobe the initial state of the source automaton. Theinitial state of the network has to be definedas’.

1. The component containing the state So isassigned So as its initial state.

2. Other components are assigned their initialstates to the corresponding d-state of thecomponent.

3.4. Functional Equivalence

The initial condition of the network, described inthe previous section, guarantees that only one e-signal is active at a time. The equivalence of thesource automaton and the transformed networkcan be proved. In the proof, the notation of

S S

"’i x ", a ,’" z ii’i)S

ees^a

"’,, ,(.,,,"; et

FIGURE 6 Example of source decomposition (top) and atransformed network (bottom).

transition tables is used. We will use a reorderedand segmented source transition table.

Proof1. Transitions Inside of the Component There are

equivalent rows in the source and transformedtransition tables.

2. Exiting Transitions of the Component Forevery exiting transition in the source table,there is a matching transition with the sameoutput in the transformed table but with uniquetarget states in the set G. Additionally, thereare unconditional transitions from the states inG to a single d-state with a unique output signalin E.

For every transition from the target state ofan exiting transition, there is an additionalmatching transition in the transformed networkfrom the d-state of the target component. Thismatching transition has the same target state,output vector and input term in conjunctionwith the appropriate signal in E.

According to the initial condition, only one com-ponent can be in the g-state at a time. Consequent-ly, there can only be one e-signal active at a time.This will uniquely define the transitions to betaken, see Figure 6.A condition that we call static G m-state occurs

when an automaton enters G in the cyclefollowed by the entrance to G m. This conditionrequires special considerations for the implemen-tation. It will be described later in Section 4.4.

3.5. Example

Let us decompose the microprogram automatonA, given in Figure 7, into a two-componentnetwork using state-partition. According to the ta-bles, there are three crossing transitions of thepartition 7r { {Sl, s2, s3}, {s4, ss} } which define thenumber of g-states. From the tables it can be seenthat there are as many e-signals in the network asthere are crossing transitions. It can also be seenthat inter-component communication is formed by

Page 8: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

174 B. OELMANN et al.

InitialState

$4

Automaton AFinal Input OutputState vectors vectors

s2 Xl Yls3 Xl Yl Y3

s2 x_2 Yls3 x_2 x3 Yl Y3$2 x2 x3 Y4

Sl x3 Y4 Y5x3 x4 Y64 Y2Y3

s5 x4 Y5s4 x4 Y6

S2 1_ Yl Y7X x yXl x2 Yl

{Sl, S2, S3},{S4,S5}

Corn,

Initial Final InputState State vectors

Sl S2 X1S3 X1

’S X_2S3 X3S2 X2 X3

S3 Sl X3

X3 X4

g41 d

S3S2S3S2d

ComInitial FinalState State

$5 X4$4 X4

x1 x2

gl

g2 d

d s5 x4 e4s4 x4 e4d2 e4

FIGURE 7 Decomposition example.

Outputvector

Yl Y3

Yl }@Y Y3Y4

Y4 Y5

Y2 Y3

e

Xl el Yl Y3x_2e2 Ylx_2 x3 e2 Yl Y3x_2 x_3 e2 Y4el e2

,nt 2

Input Outputvectors vectors

the signals {el, e2, e4} where the index is bound tothe target state in the source automaton.

4. IMPLEMENTATION

In this chapter, we will describe how decomposedFSMs are implemented. First we describe where inthe design flow the decomposition takes place.Next, an overall picture of the tool that auto-matically carries out the decomposition is de-scribed. After that, the implementation of theFSM transformation steps presented in Chapter 3are described. Hardware estimation is used to rank.the different partition candidates that are gener-ated by the tool. In Section 4.3, the estimationfunctions and their parameters are given. In orderto have a complete synthesis design flow, addi-tional cells must be added to a standard celllibrary. The implementation and the design detailsof the cells are given in Section 4.4.

4.1. Overview

4.1.1. Introduction

The position of the FSM power optimizationprocedures that have been implemented in theLIFS tool is depicted in Figure 8. FSM poweroptimization is one step of several synthesis stepsin FSM synthesis, which are a part of RTLsynthesis. Therefore, it is important that thecomputational complexity of the power optimiza-tion step is kept low in order to keep the total time

spent in synthesis low.

RTL VHDL ," Register synthesis State assignment

unit mappingLogic synthesis ,,

etc

FIGURE 8 The LIFS tool positioned in a synthesis-baseddigital design flow.

Page 9: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

FSM DECOMPOSITION 175

An overview of the information flow in LIFS isshown in Figure 9. The FSM description is givenas an STG. Currently, we use the KISS2 formatfrom Berkeley [15], but in principle, RT-levelVHDL or graphical input could be used in thatthey all contain the same information. Powerconsumption in digital CMOS is dominated by thedynamic power consumption, which is highlydata-dependent. In order to estimate power con-

sumption, it is necessary to have a set of input datathat is, under typical operating conditions, appliedto the inputs of the FSM. In LIFS, it is possible toeither give these input vectors in a testbench or,when no typical data is available, to specifyprobabilities for an input to be high (logic one)for each of the inputs. The power optimization ismade for a user-given area constraint. The areaconstraint is given as the maximum acceptableincrease in area relatively to a monolithic FSM.This will allow the designer to trade circuit area forreduced power consumption. The tool is designedto work in a standard-cell based design flow. Inorder to make early power and area estimates,data about power and area for three types of cellsare needed: a clocked storage element (D flip-flop),a CCB and a gate (2-input NAND gate). Theoutput of the tool is an RT-level VHDL

description of the partitioned FSM. This descrip-tion is normally passed on to a standard logicsynthesis tool that produces the gate netlist. Alongwith the VHDL code, design specific scripts forlogic synthesis are also generated.The tool is divided into two main parts. The first

part collects statistics about the FSM in order tofind the state transition probabilities. This partmay be omitted from the synthesis run if thetransition probabilities are already known fromthe environment of the FSM or from a previoussynthesis run where the STG with probabilities hasbeen generated. The second part is where theactual partitioning takes place.

4.1.2. Statistics Collection

The purpose of the statistics collection methods isto determine the probabilities of state transitionsin the STG. The statistics form the basis for thepartitioner when clustering states, i.e., group statesthat have the highest probability for their connec-tions. One of the two implemented methods forcollecting statistics is used by the tool.The first method we call profiling. Profiling uses

user-supplied input vectors for simulating theFSM. The tool first generates VHDL code that

Testbench State Transition Graph (STG)Inputprobabilities

Collectstatistics

Insert profiling information Random test..patternenerauon

Input vector qProfiling i’ ’,1 "STG simulator Random.----.----}1 walk

rti#Estimator

FIGURE 9 Overview of the LIFS tool.

Page 10: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

176 B. OELMANN et al.

corresponds to the initial STG specification. It alsoinserts profiling information during the generationof the VHDL code. The state transitions are tracedduring simulation and collected statistics arewritten back on file. The simulation is made in astandard commercial VHDL simulator.The second method is based on random walk.

Here, the state transition probabilities are basedon the input probability vectors given by the user.The user specifies, for each of the inputs, theprobability of the input being at high state.Random input vectors are generated from thisgiven input probability vector using a uniformdistribution function. The random input vectorsare then used to simulate the STG. The simulationis carried out in a STG simulator that is embeddedin LIFS. For the design examples that we presentlater in this paper, the length of the random walksimulation has been set to n3, where n is the totalnumber of arcs in the STG. The simulation timefor determining the transition probabilities easilybecomes very time-consuming for complex FSMs.It is desirable to run this part only once, even if theFSM must be re-synthesized. For that purpose, wehave extended the KISS2 format by adding thetransition probability for every transition in theFSM specification.

4.1.3. Partitioning

As previously mentioned, the power reductionstrategy is to partition the FSM into a number ofsub-FSMs. In a partitioned FSM, a state transi-tion can take place inside the sub-FSM or betweentwo different sub-FSMs, which we call a crossingtransition. In Chapter 2, we showed that transi-tions within a sub-FSM dissipate less power thanstate transitions from one sub-FSM to another. Itis also advantageous to have as few sub-FSMs aspossible active to reduce the effective capacitance.But at the same time, dividing the FSM intosmaller partitions tends to increase the probabilityof hand-overs occurring. The partitioning algo-rithm we use is divided into two phases. In the firstphase, a cluster representation of all states in theFSM is built. The states are clustered according to

a closeness measure that is based on the size of themutual state transition probabilities betweenstates. In the second phase, clusters from the firstphase are grouped and the FSMs are synthesized.Implementation costs, which are estimates ofpower and area, are the basis for selecting thefinal partitioned FSM.

Phase 1: Clustering The input to the clusteringalgorithm is an STG with arcs, representing statetransitions, labelled with state transition prob-abilities. We use a hierarchical clustering scheme tobuild a hierarchical system of clustering represen-tation of the states in the FSM. Hierarchical clus-tering is a general technique of clustering similarobjects together and it has found its applicationin many different fields [10]. The algorithm buildsa binary tree as illustrated in Figure 10.

Phase 2: Selection of best partition From thebinary tree built by clustering, it is possible togroup the clusters into a large number ofcombinations that are all candidates for a parti-tioned FSM. Cutting the cluster tree at a certainlevel generates the clusters. For example, inFigure 10 the cutting level can be 1, 2, or 3.Cutting at level one, for example, gives a largenumber of small clusters ({So}, {S1}, {$2}, {$3}, {$4})and cutting at level three gives two clusters({SO, Sl,S2,S3}, {s4}). For each cut-level it is alsopossible to perform concatenation of clusters andnew combinations of clusters can be generated.

P5

PP P4

P2

PO

(Po+P) > (P3+P4) > P2 > (Ps+P)

FIGURE 10 Example: hierarchical clustering.

Page 11: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

FSM DECOMPOSITION 177

partition select_partition(tree CT, real amax) {real Pmin ;partition BEST ;forh tollclusters C cutlevel(CT, h)forali Csort(C);

CT is cluster tree for the FSM.Pmin stores the minimum power.amax is a user constraint (max. area).BEST stores the best partition.TMP is the partition candidate.PFSM is the partitioned FSM.C is the number of clusters at a given cut-level

BEST TMP;

TMP <-- {cl},..,{cj}, {cj+I,...,CN};

partition TMP -- {Cl }, {2,...,CN} Cx Cforj 2 to N-1partitioned fsm PFSM -- synthesize(TMP);

N is the number of clusters in CH is the height of the cluster tree

ifPmin > power(PFSM) and area(PFSM) < amax then sort, sorts the clusters by activity, the clusterwith the highest internal activity will receivethe lowest index.power and area are the HW estimation func-tions.synthesize is the FSM synthesis function.cutlevel returns the clusters in the cluster treefor a given cut-level

}return BEST;

FIGURE 11 Procedure for selection of best partition.

Empirically, we found a procedure that for everycut-level generates a reduced number of clusters.The procedure, shown in Figure 11, takes a clustertree and returns the partitioned FSM with thelowest power consumption for a given areaconstraint. In order to estimate power and areafor the partitioned FSM, more details must beknown about the implementation. The partitionedFSM, i.e., all sub-FSMs and CCBs, are synthe-sized. After that, the estimation functions forcircuit area and power consumption are applied.The functions for FSM synthesis and hardwareestimation are described in more detailed below.

4.2. Sub-FSM Transformation

The FSM synthesis takes the clusters of states,given by the partitioning, and generates one sub-FSM for each of these clusters. The synthesis ismade according to the transformation stepspresented in Chapter 3. For the implementation,the transformation is divided into five steps.

4.2.1. Sub-FSM Communication

The crossing transitions in the STG (see Fig. 13b)are implemented by the CCBs, clock-gating, andstructural composition of the sub-FSMs and CCBs

in the partitioned FSM. Let us consider sub-FSMA and its associated CCB CCBm. The function ofthe CCB is to control the gated clock. The acti-vation of sub-FSM A is made by incoming cros-sing transitions. These transitions are detected bydecoding the incoming transition states, denotedG m-. Deactivation of A Occurs at the same timeas another sub-FSM is activated (only one sub-FSM is active at a time). The outgoing crossingtransitions from A are detected by decoding thetransition states G m. Signals decoded from G wecall g-signals and signals decoded from Gm we calld-signals, see Figure 12. The detection of an activa-ting crossing transition can only be made based onthe transitions of the g-signals. The CCB behaviorand implementation are shown in Figure 14.The CCB is an AFSM that holds the one-bit

state variable e, reflecting the state of one crossingtransition. The collection of all these state

Gm’,, -[ decode

Gm "J- decode

CCBm

e__.Em

FIGURE 12 Signal interface of the CCB.

Page 12: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

178 B. OELMANN et al.

X4/Z,

a. Partitioning of FSM A

go2,x7/z7_ ....-.

b. Transition state insertion

A2A

c. Local state transition insertion d. Transformed sub-FSMs A and A2

FIGURE 13 FSM transformation example.

variables gives a global state vector E. This statevector is one-hot encoded where only one bit is sethigh at a time. A high value indicates the lastactive crossing transition. E is decoded and used asinput signals to the sub-FSMs.

4.2.2. Transition-state Insertion

The implementation architecture we use with asy-nchronous CCBs, see Figure 1, requires hazard-free g-signals. The g-signals must therefore bedecoded from the state variable only. For example,the crossing transitions in Figure 13 are condi-tioned by the inputs of X. At these locations,where the crossing transition is conditioned by aninput signal, the Mealy state transition is

transformed to a Moore state transition. In theexample given in Figure 13a, the initial machineconsists of the set of states S {So, s1, s2, $3, $4} andthe partitioned machine is -= {S1, S2}, whereS1= {s0, sl} and $2= {$2,$3,$4}. For every crossingtransition we insert the transition statesG(S1) {g2, g3} and G(S2) {go}. At this stage, seeFigure 13b, the two sub-FSMs are still coupled bythe crossing transitions, indicated by the go-signalsin the STG. In the actual implementation, thesetransitions are handled by the hand-over mech-anism that involves the CCBs and clock-gating.A g-state is not added if it has no other outgoing

transition beside the crossing transition (uncondi-tional transition).

Page 13: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

FSM DECOMPOSITION 179

01 11 10

Ii--i0

C+ e+

FIGURE 14 1-bit CCB, transition map and gate-levelimplementation.

4.2.3. Local Transition Insertion

Here, the coupled STGs will be separated. Thepurpose of having a separate STG for each sub-FSM is that standard synthesis procedures can beused on the STG to get to gate- level implementa-tions. At the occurrence of a crossing transition,there is a state transition to a transition state in theactive machine. As a consequence, the sub-FSMcontaining the destination state is activated. Weknow that this machine is in one of its transitionstates. Therefore, all transition states in combina-tion with a global state E act as one of manypossible entry states. In the example in Figure 13,the global state vector is E--{E1, E2}, whereE1- {e0} and E2 {e2, e3}.

4.2.4. Removal of Unreachable States

From Figure 13c, it can be seen that some statesdo not have incoming transitions. These redun-dant states are R1- {So} and R2= {s3}, and their

function is now, after the two previous steps,located in the transition states.

4.2.5. Setting of Initial States

Each of the sub-FSM must have an initial state.The initial state, given by the specification of theoriginal machine, will be the reset-state of the sub-FSM in which it is located. For all the other sub-FSMs, an arbitrary transition state can be selectedas the initial state, see Figure 13d.

4.3. Hardware Estimation

The objective of hardware estimation is to enableranking of the different partition candidates sothat the best partition can be selected. The rankingis based on the implementation costs in terms ofpower consumption and circuit area. A smallnumber of estimation functions are used, seeTable I. The parameters in these functions arebased on the technology that is used, data fromstatistics collection, or they are empirically deter-mined. The parameters and their values are listedin Table II.The empirically determined parameters are

related to details that are not known on thecurrent level of abstraction. For example, the sizeof the output logic is not known before a gate-levelimplementation, and the probability for a transi-tion in the state-register is not known before stateassignment.

4.4. Library Elements

The goal has been to use a standard cell-baseddesign methodology. The output from the tool is astructural description of the partitioned FSMconsisting of sub-FSMs and CCBs. The sub-FSM description is an RT-level description thatcan be fed to any commercial RTL synthesis tool.The CCBs are asynchronous FSMs and standardtools do not in general support synthesis of thesecircuits. In our approach, we design a 1-bit CCB as

a library element on the gate level. We base this

Page 14: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

180 B. OELMANN et al.

TABLE Energy estimate functions

Function

Total sub-FSM energy

Total CCB energy

Output logic energy

Clock net energy

Partitioned FSMenergy

Total sub-FSM area

Total CCB area

Output logic area

Partitioned FSM area

EA, Etvv x (1 + k,,6) x T x [log 2]Sm" i1m=l

EccB a x EccB,hand-over -[- (1 a)x EcCB,enable X (n- 1) Edisable

E,x, (Psc Tm log Sm*m=l

+a+Zpx x IYI E= ki=1

Ecknet Eck,DFF kck T log :ISm I]m=l

EA* EA. + ECCB + Ea + Ecknet

AA, =ADFF (1 +ka,e) x [logzlSm*l]m=l

AccB ACCB,1 bit IGI

A,x,= ([log2lSm*l] + IGI + IX)rn=l

IYI AND2 ka

AA, AA. q- Accn.+ Ax

CCB on gates from the standard cell library, butfor improved performance the CCB can bedesigned on the transistor level and be includedas a cell in the cell library. Various multipleinput CCBs are built based on the 1-bit version.The 1-bit CCB is a controller with two bits in thestate variable and can be synthesized under the

fundamental mode assumption [16] and with singleinput change (SIC) assumption. The transitionmap and the gate-level solution are given inFigure 14.

In general, a sub-FSM can be activated by oneof many sub-FSMs and deactivated by one ofmany crossing transitions leaving the sub-FSM.For this, multiple-input CCBs are needed. Theseare generated from the 1-bit CCB. In this way, wecan avoid synthesis of complex asynchronouscontrollers. The extension to multiple input CCBsis shown in Figure 15. For the 1-bit CCB, the valueof e can be used directly for gating the clock. Forthe multiple-input CCB, an additional output e_ckmust be generated and used for gating the clock.

Page 15: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

FSM DECOMPOSITION 181

TABLE II Parameters in estimation functions. Units are pJ and gate equivalents for energy dissipation and area respectively

Technology Empirical Statistical

Value Comment Value Comment Value Comment

EDFF 5.20 energy in D flip- kp, 0.2 energy in transition functionflop cell in units of EDFF

END 0.96 energy in 2-input kA 0.5 degree of shared logic inNAND cell output logic

Eck,DFF 0.15 energy in clock Psc 0.5 probability for a statenet per DFF change in state register

ECCB, lbit 1.62/ CCB energy in kck 1.3 energy overhead in clock0.46/0.19 different modes buffers

ADFF 4.1 area of D flip- kA,6 0.2 area of transition functionflop cell in units of ADFF

ACCB,lbit 2.8 area of a 1-bitCCB

AND2 area of 2-inputNAND cell

T

Px

duty probabilityof sub-FSM m

probability for ahand-over

probability for aninput transition

b)

c)

g e

d e_ck

FIGURE 15 Multiple-input CCB, (a) structural composition,symbols for (b) 1-input CCB, and (c) m,n-input CCB.

As been described in Chapter 3, there are

situations where we may have static Gin-statesfrom a sub-FSM. This condition will prevent thetransition on the g-signal, which is needed fortriggering the CCB. In this work, we use atransistor-level solution for handling this situation.A special D flip-flop, called GDFF, has beendesigned and included in the cell library to be usedin the sub-FSM state register. The g-input of theCCB is positive edge-triggered and to avoid thesituation of static Gin-states, it must be guaranteedthat all g-signals return to zero before assertion.With the GDFF, we can guarantee that the g-signals, which are decoded from the state register

only, are zero during the first half of the clockperiod. The GDFF has an additional output (GQ)that is the state of the flip-flop gated with the clocksignal. The normal Q-output is used for thetransition function and the GQ-output is usedonly for decoding the g-signals. Due to uncertain-ties in loading conditions for the different netsbetween the cells after layout, a gate-level im-plementation of the function for GDFF may givehazardous results that cannot be accepted forsignals to an edge-triggered input (g-input of theCCB). In Figure 16 we propose a transistor levelsolution of the GDFF. In order to attain a

glitch-free output, it is important that the flip-flopstructure has a small clock to output delay.Suitable flip-flops are based on RAM cell struc-tures or CVSL style [17]. These flip-flops canbe implemented directly and no special delaymatching is required.

5. EXPERIMENTAL RESULTS

Our tool LIFS, which consists of a partitioningalgorithm and a set of transformation rules, hasbeen implemented as a software prototype toolin the Java language. LIFS, together with any

Page 16: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

182 B. OELMANN et al.

"T"

characteristic equationsQ+=DGQ+ D^CK

FIGURE 16 Transistor-level implementation of GDFF instatic CVSL style.

standard RTL synthesis tool, forms a completesynthesis path from STG description of an FSMto its gate-level implementation, where the imple-mentation architecture is our proposed mixedsynchronous/asynchronous architecture. In orderto demonstrate the effectiveness of the proposedarchitecture, eight of the MCNC standard bench-marks [18] were tested. The number of states in thebenchmarks range from 10 to 121 states. Whenestimating energy consumption, the input datapattern to the circuit that is to be characterized isimportant. For FSMs, the sequence of the inputvectors will determine the state transition prob-abilities and consequently determine how the FSMis partitioned. Unfortunately, typical input data isnot specified by the MCNC benchmarks, whichmakes it difficult to compare the results with otherreported works. In this work, we have set the inputprobability vector, used by the STG simulator inLIFS, to 0.5 for all inputs in all FSMs.

This chapter reports the experimental resultsfrom LIFS. First we illustrate the partitioningconsiderations by an example. We then describethe results of the structural decomposition. Afterthat, the energy consumption and circuit area arereported separately for the sub-FSMs, outputlogic, and CCBs. Also, the total energy and areafor the partitioned FSM are compared to itscorresponding monolithic FSM implementation.Finally, we compare the timing by looking at thecritical paths in the different implementations.

The energy figures were obtained from gate-levelpower estimations by Power Compiler (Synopsys)and the area estimates are based on the cell area.The timing information was obtained from statictiming analysis in Design Compiler (Synopsys).The target technology is a 0.5 gm CMOS standardcell technology. A wire-load model, supplied bythe silicon vendor for this specific library [2] hasbeen used.

Table III shows the main characteristics of thebenchmark FSMs. It describes from the top rowand down, the number of inputs, number ofoutputs and number of states.The first phase of the partitioning (clustering)

concentrates solely on state transition probabili-ties when grouping the strongly connected states.From the small FSM example given in Figure 17a,we can see that states sO and s have self-loopswith high probabilities and they are also, inrelation to other states, strongly connected. Theactual solution given by LIFS for this FSM was sOand s located in one sub-FSM and the rest of thestates located in a second sub-FSM. But onlylooking at state transition probabilities says verylittle about the implementation costs. The numberof g- and entry-states of a sub-FSM plays an

important role in implementation costs. In our

example, there are two entry-states, {sO, sl } andtwo g-states, {g2, g4} in A. Sub-FSM A2 also hastwo entry-states and two g-states. An increase inthe number of g-states, [G[, will increase the size ofthe transition function and may also require largerstate memory in the sub-FSM. For each entry-state, an internal enable signal, defined by E, isrequired. The number of enable signals, [E[, willinfluence the fan-in of the logic for both thetransition function and the output function, see

Figure 17b. In summary, a good partition has

TABLE III Characteristics of benchmark FSMs

bbara dk512 exl keyb styr donfile tma scf

Ixl 4 9 7 9 2 7 27

IYI 2 3 19 2 10 6 56

ISI 0 14 20 19 30 24 20 121

Page 17: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

FSM DECOMPOSITION 183

a)

.2utPut s ={So,G={g2,g4}El={eo,el}S2={82,$3,S4,$5,S6,$7,$8,89}G2={go,gl}E2={e2,e4}SI’={S1,G}$2"={$2,G2}

FIGURE 17 Example of partitioned FSM (bbara): (a) STG with transition probabilities, and (b) structure of the implementation.

sub-FSMs with high probabilities of state transi-tions within the sub-FSM and a small number ofentry- and g-states.

Table IV presents structural information thatinfluences the implementation costs of the parti-tioned FSMs. Here, T denotes the duty prob-ability of the sub-FSM and ISI denotes the numberof original states located in the sub-FSM.

In Table V, the energy consumption and thecircuit area for the power-optimized FSMs arepresented. The column labelled sub-FSM gives thesum of energy and area for all sub-FSMs. The

column labelled output gives the energy and areafor the output function, and the column labelledCCB gives the sum of energy and area for all CCBsin the partitioned FSM. The column labelled totalA* gives the sum of energy respectively area forthe three previously mentioned columns. The nextthree columns labelled FSM, output, and total A,contain energy and area for a monolithic imple-mentation. The last column labelled change showsthe decrease or increase in energy and area for thepartitioned FSM in comparison to the correspond-ing monolithic implementation.

Page 18: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

184 B. OELMANN et al.

TABLE IV Structural information from the decomposition

A A A A

4m IEll 1611 IS’l T IE IG=l IS E3I 1631 IS31 E’I IG Sal 7.4

bbara 2 2 2 0.87 2 2 8 0.13dk512 3 2 2 0.22 2 2 2 0.19exl 4 2 2 0.38 2 2 0.19keyb 3 0.72 3 18 0.27styr 3 0.64 3 29 0.36donfile 3 0.35 3 0.35tma 2 2 5 0.96 2 2 15 0.04sef 3 0.88 118 0.12

2 11 0.592 0.19 2 16 0.25

6 2 22 0.31

TABLE V Energy consumption [pJ] and circuit area [#gate eq] for partitioned FSMs and monolithic FSM

Sub-FSM output CCB total A* FSM output total A change

E E EA, AA. EccB ACCB EA, AA, Efsm Afsm EA AA EA AA E A

bbara 3.24 146 0.61 11 2.14 33 5.99 190 7.70 78 1.17 9 8.86 87 -32% + 118%dk512 3.98 155 1.27 22 1.31 50 6.57 227 13:1 79 1.29 11 14.4 90 -54% / 151%exl 4.42 277 15.8 398 1.94 68 22.2 734 16.5 205 12.6 152 29.1 358 -25% /107%keyb 7.33 360 11.2 149 0.72 32 19.2 541 18.5 220 10.6 96 29.1 316 -34% +71%styr 5.54 356 7.47 178 0.72 32 13.7 566 16.8 245 11.9 169 28.8 489 -52% / 16%donfile 7.00 318 0 0 2.00 66 9.00 384 16.0 148 0 0 16.0 148 -44% / 159%tma 4.86 241 3.42 89 0.86 33 9.14 363 9.74 166 6.82 78 16.6 244 -45% +49%scf 4.54 492 4.77 259 1.03 15 10.3 766 19.6 437 11.7 201 32.1 638 -68% +20%

minimumpartitionedflip-flops:

For all the benchmarks we have tested, sig-nificant power reductions have been obtained.However, there is a large difference in achievedimprovement for the different machines. Forexample, the bbara FSM seems to have goodpotential for large power reduction. Since it issmall, the power overhead in sub-FSM commu-nication becomes relatively large. For small FSMs,the area increases when partitioned. When using

length encoding of the states, theFSM will always require more

loglSI] < [1oglSml]m=l

In the case of exl, we have a large sub-FSM (A4)that is active most of the time. Further decom-position would have increased the area dramati-cally but with only small power savings as a result.Cases where large power reductions have beenobtained are for tma and scf. Here, small clusterswith high duty probabilities have been identified.

For tma, one sub-FSM with five states is active96% of the time while the other sub-FSMcontaining 15 states is only active 4% of the time.Both tma and scf are large enough so thatpartitioning them does not introduce excessivelylarge area overhead.

Finally, we present the timing results inTable VI. The column labelled sub-FSM givesthe critical path in the sub-FSM and the followingcolumn labelled output gives the critical path in theoutput function for the partitioned FSM. The nexttwo columns labelled FSM and output contain thecritical paths for the monolithic implementation.The last two columns labelled change shows thedecrease or increase in the critical path for thepartitioned FSM in comparison to its correspond-ing monolithic implementation.

In general, one could expect a slight increase ofthe delay in the output logic for the partitionedFSM. Here, the fan-in will increase with IEIsignals. In the case of exl, where we have foursub-FSMs and IEI 8, the increase of the com-

Page 19: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

FSM DECOMPOSITION 185

TABLE VI Timing, critical paths [ns]

change changesub-FSM output FSM output FSM output

bbaradk512exlkeybstyrdonfiletmascf

6.7 1.9 6.4 1.7 + 50/0 + 12/o6.8 2.6 7.3 1.7 -70/0 +53o/09.0 17.3 8.7 7.3 + 30/0 + 136%9.6 12.2 9.3 9.5 + 30/0 + 280/012.4 7.7 12.7 7.2 -2% +70/09.3 0 8.1 0 + 15/o 08.5 7.6 10.4 7.2 18% +6%

14.5 9.0 12.6 8.8 + 15/o + 20/0

plexity of the logic has significant influence on thedelay in the output logic. For most of thebenchmarks, we can observe only small changesin the delay of the critical path of the sub-FSMs.

Alternative ways of organizing the logic for theoutput function will be further investigated.

Acknowledgments

Financial support from the Royal Swedish Acad-emy of Sciences, the Estonian Science FoundationGrant "Multiparadigm System on Chip DesignEnvironment" and the Foundation for Knowledgeand Competence Development are gratefullyacknowledged.

References

6. CONCLUSIONS

Clock-gating is a common approach in reducingaverage power consumption in finite-statemachines. In this paper, we have presented anautomated synthesis flow for a new type of mixedsynchronous/asynchronous implementation archi-tecture for gated clock FSMs.The advantages of having asynchronous com-

munication between the sub-FSMs are:

Asynchronous controllers dissipate less powerthan synchronous controllers when idle.A more power efficient hand-over protocol forcommunication between the sub-FSMs can beemployed.

The effectiveness of the proposed implementa-tion architecture accompanied with the automatedsynthesis procedure, implemented in a softwaretool, has been demonstrated by the MCNC FSMbenchmarks. Power reductions of up to 68% havebeen achieved at the cost of an increase in area of20%. We have found these results encouragingand we also see that further improvements arepossible. More attention should be given toimproving the partitioning algorithm. For someof the benchmarks, we can observe large increasesin both area and power for the output logic.

[1] Alidina, M., Monteiro, J., Devadas, S., Ghosh, A. andPapaefthymiou, M., "Precomputation-based sequentiallogic optimization for low power", Proc. of the IEEE/ACM International Conf. on Computer-Aided Design,pp. 74-81, November, 1994.

[2] Alcatel Microelectronics, Technology and Design KitDocumentation C05M, 1998.

[3] Baranov, S. (1994). Logic Synthesisfor Control Automata,Kluwer Academic Publisher, ISBN 0-7923-9458-5.

[4] Benini, L., De Micheli, G. and Vermeulen, F., "Finite-state machine partitioning for low power", Proc. of theIEEE International Symposium on Circuits and Systems, II,5- 8, August, 1998.

[5] Benini, L., Siegel, P. and de Micheli, G. (1994). "Savingpower by synthesizing gated clocks for sequentialcircuits", IEEE Design and Test of Computers, 11, 32-41.

[6] Benini, L. and De Micheli, G. (1996). "Automaticsynthesis of low-power gated clock finite-state machines",IEEE Transactions on Computer-Aided Design of Inte-grated Circuits and Systems, 15(6), 630-643.

[7] Benini, L., Vuillod, P., De Micheli, G. and Coelho, C.,"Synthesis of low-power selectively-clocked systems fromhigh-level specification", Proc. of the International Sym-posium on System-level Synthesis, pp. 57-63, November,1996.

[8] Dasgupta, A. and Ganguly, S., "Divide and conquer: astrategy for synthesis of low power finite state machines",Proc. ofthe International Conf. on Computer-Aided Design,pp. 740-745, October, 1997.

[9] Hwang, E., Vahid, F. and Hsu, Y.-C., "FSMD functionalpartitioning for low power", Proc. of Design Automationand Test in Europe, pp. 22-28, March, 1999.

[10] Johnson, S. C. (1967). "Hierarchical clustering schemes",Psykometrika, No. 2, pp. 241-254.

[11] Mehra, R., Guerra, L. and Rabaey, J. (1996). "Low powerarchitectural synthesis and the impact of exploitinglocality", Journal of VLSI Signal Processing, 13(2,3),239-258.

[12] Oelmann, B. and O’Nils, M., "Asynchronous control oflow-power gated clock finite-state machines", Proc. of theIEEE International Conf on Electronics, Circuits andSystems, pp. 915-918, September, 1999.

Page 20: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

186 B. OELMANN et al.

[13] Oelmann, B. and O’Nils, M., "A low power hand-overmechanism for gated-clock FSMs", Proe. of the EuropeanConf. on Circuit Theory and Design, pp. 118-121, August,1999.

[14] Roy, S., Banerjee, P. and Sarrafzadeh, M., "Partitioningsequential circuits for low power", Proe. of the llthInternational Conf. on VLSI Design, pp. 212- 217,January, 1997.

[15] Sentovich, E. M., Singh, K. J., Lavagno, L., Moon, C.,Murgai, R., Saldanha, A., Savoj, H., Stephan, P. R.,Brayton, R. K. and Sangiovanni-Vincentelli, A. (1992).SIS: A Systemfor Sequential Circuit Synthesis, ElectronicsResearch Laboratory, Memorandum No. UCB/ERLM92/41, Department of Electrical Engineering andComputer Science, University of California, Berkeley.

[16] Unger, S. H. (1969). Asynchronous Sequential SwitchingCircuits, Wiley & Sons, Inc.

[17] Weste, N. and Eshraghian, K. (1992). Principles ofCMOSVLSI design, A Systems Perspective, 2nd edition,Addison-Wesley Publishing Company, ISBN:0-201-53376-6.

[18] Yang, S. (1991). Logic Synthesis and OptimizationBenchmarks User Guide, version 3.0, MCNC TechnicalReport.

Authors’ Biographies

Bengt Oelmann has been with Ericsson Telecom,Stockholm, Sweden, Mid Sweden University,Sundsvall, Sweden, and Nordic VLSI, Trondheim,Norway. Currently, he is an Associate Professor atMid Sweden University. His research interestsinclude asynchronous logic design, VLSIimplementation techniques, asynchronous logicdesign and its application to low-power design.Kalle Tammemiie has served the faculty of TallinnTechnical University (TTU), Estonia, since 1994.He received diploma in Computer Engineering in

1981 and Ph.D. in Engineering in 1997 from TTU,fulfilling part of the Ph.D. studies at RoyalInstitute of Technology, Sweden. Currently, he isa part time Associate Professor in Department ofComputer Engineering at TTU and full time rectorof Estonian Information Technology College. His

research interests include hardware/software co-

design, hardware description languages, high-levelsynthesis, prototyping and control intensive sys-tem design. He is a member of the IEEE ComputerSociety and ACM.Margus Kruus has served on the faculty ofInformation Processing of Tallinn Technical Uni-

versity since 1980. Currently, he is a AssociateProfessor and head of Department of ComputerEngineering, Tallinn Technical University, Esto-nia. His research interests include decompositionaldesign methods of digital systems, design-for-testability methods, computer arithmeticsalgorithms.Mattias O’Nils has been with Ericsson Telecom,Stockholm, Mid Sweden University, Sundsvall,and Royal Institute of Technology, Stockholm, allin Sweden. Currently, he is an Associate Professorin Electrical Engineering at Mid Sweden Univer-sity. His research interest include hardware/soft-ware codesign, interface synthesis, VLSI designmethods, video signal processing, and low-powerdesign. His research has resulted in three prototypeCAD tools and over 20 research papers. He is amember of the IEEE.

Page 21: Automatic FSM Synthesis Low-power Synchronous ...downloads.hindawi.com/archive/2001/027496.pdfPowerconsumptionin a synchronous FSM(Finite-State Machine) can be reduced by partitioningit

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2010

RoboticsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

Journal ofEngineeringVolume 2014

Submit your manuscripts athttp://www.hindawi.com

VLSI Design

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

DistributedSensor Networks

International Journal of