Lower Power Synthesis - Cho, Jun Dong ??? Sungkyunkwan …vada.skku.ac.kr/ClassInfo/lower-power-DS… · PPT file · Web view · 2002-04-04Clustering Example Two-cluster Partition

SungKyunKwan Univ.

1VADA Lab.

Clustering Example• Two-cluster Partition

• Three-cluster Partition

SungKyunKwan Univ.

2VADA Lab.

Complexity of Partitioning

In general, computing the optimal partitioning is an NP-complete problem, which means that the best known algorithms take time which is an exponential function of n=|N| and p, and it is widely believed that no algorithm whose running time is a polynomial function of n=|N| and p exists (see ``Computers and Intractability'', M. Garey and D. Johnson, W. H. Freeman, 1979, for details.) Therefore we need to use heuristics to get approximate solutions for problems where n is large. The picture below illustrates a larger graph partitioning problem; it was generated using the spectral partitioning algorithm as implemented in the graph partitioning software by Gilbert et al, described below. The partition is N = Nblue U Nbl

ack, with red edges connecting nodes in the two partitions.

SungKyunKwan Univ.

3VADA Lab.

Edge Separator and Vertex Separator

Bisecting a graph G=(N,E) can be done in twoways. In the last section, we discussed finding thesmallest subset Es of E such that removing Esfrom E divided G into two disconnected subgraphsG1 and G2, with nodes N1 and N2 respectively,where N1 U N2 = N and N1 and N2 are disjointand equally large. (If the number of nodes is odd,we obviously cannot make |N1|=|N2|. So we willcall Es an edge separator if |N1| and |N2| aresufficiently close; we will be more explicit abouthow different |N1| and |N2| can be only whennecessary.) The edges in Es connect nodes in N1to nodes in N2. Since removing Es disconnects G,Es is called an edge separator. The other way tobisect a graph is to find a vertex separator, asubset Ns of N, such that removing Ns and allincident edges from G also results in twodisconnected subgraphs G1 and G2 of G. In otherwords N = N1 U Ns U N2, where all three subsetsof N are disjoint, N1 and N2 are equally large, andno edges connect N1 and N2.

The following figure illustrates these ideas. Thegreen edges, Es1, form an edge separator, as wellas the blue edges Es2. The red nodes, Ns, are avertex separator, since removing them and theindicident edges (Es1, Es2, and the purple edges),leaves two disjoint subgraphs.

Theorem. (Tarjan, Lipton, "A separator theorem for planar graphs", SIAM J. Appl. Math., 36:177-189, April 1979). Let G=(N,E) be an planar graph. Then we can find a vertex separator Ns, so that N = N1 U Ns U N2 is a disjoint partition of N, |N1| <= (2/3)*|N|, |N2| <= (2/3)*|N|, and |Ns| <= sqrt(8*|N|).

SungKyunKwan Univ.

4VADA Lab.

Kernighan and Lin Algorithm• B. Kernighan and S. Lin ("An effective heuristic p

rocedure for partitioning graphs", The Bell System Technial Journal, pp. 291--308, Feb 1970), which takes O(|N|3) time per iteration. A more complicated and efficient implementation, which takes only O(|E|) time per iteration, was presented by C. Fiduccia and R. Mattheyses, "A linear-time heuristic for improving network partitions", Technical Report 82CRD130, General Electric Co., Corporate Research and Development Ceter, Schenectady, NY 1982.

• We start with an edge weighted graph G=(N,E,WE), and a partitioning G = A U B into equal parts: |A| = |B|. Let w(e) = w(i,j) be the weight of edge e=(i,j), where the weight is 0 if no edge e=(i,j) exists. The goal is to find equal-sized subsets X in A and Y in B, such that exchanging X and Y reduces the total cost of edges from A to B. More precisely, we let T = sum[ a in A and b in B ] w(a,b) = cost of edges from A to B and seek X and Y such that new_A = A - X U Y and new_B = B - Y U X has a lower cost new_T. To compute new_T efficiently, we introduce:

E(a) = external cost of a = sum[ b in B ] w(a,b)I(a) = internal cost of a = sum[ a' in A, a'!=a]w(a,a') D(a) = cost of a = E(a) - I(a) and analogously E(b) = external cost of b = sum[ a in A ] w(a,b)I(b) = internal cost of b = sum[ b' in B, b' !=b]w(b,b')D(b) = cost of b = E(b) - I(b)Then it is easy to show that swapping a in A and b inB changes T to new_T = T - ( D(a) + D(b) -2*w(a,b) ) = T - gain(a,b)In other words, gain(a,b) = D(a)+D(b)-2*w(a,b) measures the improvement in the partitioning by swapping a and b. D(a') and D(b') also change to new_D(a') = D(a') + 2*w(a',a) - 2*w(a',b) for all a' in A, a' !=a new_D(b') = D(b') + 2*w(b',b) - 2*w(b',a) for all b' in B, b' != b

SungKyunKwan Univ.

5VADA Lab.

Kernighan and Lin Algorithm

(0) Compute T = cost of partition N = A U B ... cost = O(|N|2) Repeat(1) Compute costs D(n) for all n in N ... cost = O(|N|2)(2) Unmark all nodes in G ... cost = O(|N|)(3) While there are unmarked nodes ... |N|/2 iterations(3.1) Find an unmarked pair (a,b) maximizing gai

n(a,b) ... cost = O(|N|2)(3.2) Mark a and b (but do not swap them) ... cost = O(1)(3.3) Update D(n) for all unmarked n, as though a and b had been swapped ... cost = O(|N|) End while

... At this point, we have computed a sequence of pairs ... (a1,b1), ... , (ak,bk) and ... gains gain(1), ..., gain(k) ... where k = |N|/2, ordered by the order in which ... we marked them(4) Pick j maximizing Gain = sumi=1...j gain(i) ... Gain is the reduction in cost from swapping ... (a1,b1),...,(aj,bj)(5) If Gain > 0 then(5.2) Update A = A - {a1,...,ak} U {b1,...,bk} ... cost = O(|N|)(5.2) Update B = B - {b1,...,bk} U {a1,...,ak} ... cost = O(|N|)(5.3) Update T = T - Gain ... cost = O(1) End if Until Gain <= 0

SungKyunKwan Univ.

6VADA Lab.

Spectral Partitioning• This is a powerful but expensive technique,

based on techniques introduced by Fiedler in the 1970s, but popularized in 1990 by A.

• Pothen, H. Simon, and K.-P. Liou, "Partitioning sparse matrices with eigenvectors of graphs", SIAM J. Matrix Anal. Appl., 11:430--452. We will first describe the algorithm, and then give three related justifications for its efficacy. Let G=(N,E) be an undirected, unweighted graph without self edges (i,i) or multiple edges from one node to another. We define two matrices related to this graph.

• Definition The incidence matrix In(G) of G is an |N|-by-|E| matrix, with one row for each node and one column for each edge.

• Suppose edge e=(i,j). Then column e of In(G) is zero except for the the i-th and j-th entries, which are +1 and -1, respectively.

Note that there is some ambiguity in this definition, since G is undirected; writing edge e=(i,j) instead of (j,i) is equivalent to multiplyingcolumn e of In(G) by -1. We will see that this ambiguity will not be important to us.

Definition The Laplacian matrix L(G) of G is an |N|-by-|N| symmetric matrix, with one row and column for each node. It is defined as follows. (L(G))(i,j) = degree of node i if i=j (number of incident edges) = -1 if i!=j and there is an edge (i,j)

SungKyunKwan Univ.

7VADA Lab.

Spatial Locality: Hardware Partitioning

• The interface logic should be properly partitioned for area and timing reasons. Minimization of global busses leads to lower bus capacitance, and thus lower interconnect power.

• Signal values within the clusters tend to be more highly correlated.• Data path should be partitioned into approximately equal size.• In the DSP area, data paths tens to occupy far more area than the control paths.• Wiring is still one of the domain area consumers• The method used to identify clusters is based on the eigenvalues and eigenvectors of the L

aplacian of the graph.• The eigen vector corresponding to the second smallest eigen value provides a 1-D placeme

nt of the nodes which minimizes the mean-squared connection length.

SungKyunKwan Univ.

8VADA Lab.

Spectral Partitioning in VLSI placement

SungKyunKwan Univ.

9VADA Lab.

Spectral Partitioning in VLSI placement• Setting the derivative of the Lagrangian, L, to zero gives:

• The solution to the above equation are those is the eigenvalue and x is the corresponding eigenvector.

• The smallest eigenvalue 0 gives a trivial solution with all nodes at the same point. The eigenvector corresponding to the second smallest eigenvalue minimizes the cost function while giving a non-trivial solution

0)( xIQ

SungKyunKwan Univ.

10VADA Lab.

Key Ideas in Spectral Partitioning

SungKyunKwan Univ.

11VADA Lab.

Spectral Partitioning

SungKyunKwan Univ.

12VADA Lab.

Spectral Partitioning norm(In(G)'*v)2 lambda = ------------------ norm(v)2 where norm(z)2 = sumi z(i)2

= sum{all edges e=(i,j)} (v(i)-v(j))2

---------------------------------- sumi v(i)2

5. The eigenvalues of L(G) are nonnegative:

0 <= lambda1 <= lambda2 <= ... <= lambdan

6.The number of of connected components of G is equal to the number of lambdai) equal to 0.

In particular, lambda2 != 0 if and only if G is connected.

The following theorem state some important facts about In(G) and L(G). It introduces us to the idea that the eigenvalues and eigen vectors of L(G) are related to the connectivity of G. Theorem 1. Given a graph G, its associated matrices In(G) and L(G) have the following properties.

1.L(G) is a symmetric matrix. This means the eigenvalues of L(G) are real, and its eigenvectors are real and orthogonal. 2.Let e=[1,...,1]', where ' means transpose, i.e. the column vector of all ones. Then L(G)*e = 0. 3.In(G)*(In(G))' = L(G). This is independent of the signs chosen in each column of In(G). 4.Suppose L(G)*v = lambda*v, where v is nonzero. Then

SungKyunKwan Univ.

13VADA Lab.

Spectral Partitioning Compute the eigenvector v2 corresponding to lambda2 of L(G) for each node n of G if v2(n) < 0 put node n in partition N- else put node n in partition N+ endif endforFirst we show that this partition is at least re

asonable, because it tends to give connected components N- and N+:

Theorem 2. (M. Fiedler, "A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory", Czech.Math. J. 25:619--637, 1975.) Let G be connected, and N- and N+ be defined by the above algorithm. Then N- is connected. If no v2(n) = 0, N+ is also connected.

There are a number of reasons lambda2 is called the algebraic connectivity. Here is another. Theorem 3. (Fiedler). Let G=(N,E) be a graph,and G1=(N,E1) a subgraph, i.e. with the samenodes and subset of the edges, so that G1 is "lessconnected" than G. Then lambda2(L(G1)) <=lambda2(L(G)), i.e. the algebraic connectivity ofG1 is also less than or equal to the algebraicconnectivity of G. Motivation for spectral bisection, by analogy with

a vibrating string

How does a taut string vibrate when it is plucked?From our background in either physics or music,we know that it has certain modes of vibration orharmonics. If we were to take snapshots of thesemodes, they would look like this:

SungKyunKwan Univ.

14VADA Lab.

Spectral Partitioning

SungKyunKwan Univ.

15VADA Lab.

Multilevel Kernighan-LinGc is computed in step (1) ofRecursive_partition as follows. We define amatching of a graph G=(N,E) as a subsetEm of the edges. E with the property that notwo edges in Em share an endpoint. Amaximal matching is one to which no moreedges can be added and remain a matching.We can compute a maximal matching by asimple random algorithm:

let Em be empty mark all nodes in N as unmatched for i = 1 to |N| ... visit the nodes in a random

order if node i has not been matched, choose an edge e=(i,j) where j is also un

matched, and add it to Em mark i and j as matched end if end for

Given a matching, Gc is computed as follows.We let there be a node r in Nc for each edge inEm. Then we construct Ec as follows:

for r = 1 to |Em| ... for each node in Nc let (i,j) be the edge in Em corresponding to no

de r for each other edge e=(i,k) in E incident on i let ek be the edge in Em incident on k, and let rk be the corresponding node in Nc add the edge (r,rk) to Ec end for for each other edge e=(j,k) in E incident on j let ek be the edge in Em incident on k, and let rk be the corresponding node in Nc add the edge (r,rk) to Ec end for end for if there are multiple edges between pairs of nodes of Nc, collapse them into single edges

SungKyunKwan Univ.

16VADA Lab.

Multilevel Kernighan-LinNote that we can take node weights intoaccount by letting the weight of a node (i,j)in Nc be the sum of the weights of thenodes I and j. We can similarly take edgeweights into account by letting the weightof an edge in Ec be the sum of the weightsof the edges "collapsed" into it. Furthermore, we can choose the edge (i,j)which matches j to i in the construction ofNc above to have the large weight of alledges incident on i; this will tend tominimize the weights of the cut edges. This is called heavy edge matching in METIS,and is illustrated on the right.

SungKyunKwan Univ.

17VADA Lab.

Multilevel Kernighan-LinGiven a partition (Nc+,Nc-) from step

(2) of Recursive_partition, it is easily expanded to a partition (N+,N-) in step (3) by associating

with each node in Nc+ or Nc- the nodes of N that comprise it. This is again shown below:

Finally, in step (4) of Recurive_partition, the approximate partition from step (3) is improved using a variation of Kernighan-Lin.

SungKyunKwan Univ.

18VADA Lab.

Multilevel Spectral PartitioningThere is a simple "greedy" algorithm forfinding an Nc: Nc = empty set for i = 1 to |N| if node i is not adjacent to any node alre

ady in Nc add i to Nc end if end forThis is shown below in the case where G issimply a chain of 9 nodes with nearestneighbor connections, in which case Ncconsists simply of every other node of N.

Now we turn to the divide-and-conqueralgorithm of Barnard and Simon, which isbased on spectral partitioning rather thanKernighan-Lin. The expensive part ofspectral bisection is finding the eigenvectorv2, which requires a possibly large numberof matrix-vector multiplications with theLaplacian matrix L(G) of the graph G. Thedivide-and-conquer approach ofRecursive_partition will dramaticallydecrease the cost. Barnard and Simonperform step (1) of Recursive_partition,computing Gc = (Nc,Ec) from G=(N,E),slightly differently than above: They find amaximal independent subset Nc of N. Thismeans that N contains Nc and E containsEc, no nodes in Nc are directly connectedby edges in E (independence), and Nc is aslarge as possible (maximality).

SungKyunKwan Univ.

19VADA Lab.

hMETIS• hMETIS is a set of programs for partitioning hypergraphs such as those corres

ponding to VLSI circuits. The algorithms implemented by hMETIS are based on the multilevel hypergraph partitioning scheme described in [KAKS97].

• hMETIS produces bisections that cut 10% to 300% fewer hyperedges than those cut by other popular algorithms such as PARABOLI, PROP, and CLIP-PROP, especially for circuits with over 100,000 cells, and circuits with non-unit cell areaIt is extremely fast!A single run of hMETIS is faster than a single run of simpler schemes such as FM, KL, or CLIP. Furthermore, because of its very good average cut characteristics, it produces high quality partitionings in significantly fewer runs. It can bisect circuits with over 100,000 vertices in a couple of minutes on Pentium-class workstations.

• The performance of hMETIS on the new ISPD98 benchmark suite can be found in the paper by Chuck Alpert.

http://www.users.cs.umn.edu/~karypis/metis/metis.html

SungKyunKwan Univ.

20VADA Lab.

How good is Recursive Bisection?• Horst D. Simon and Shang-Hua Teng , Report RNR-93-012, August 1993 • The most commonly used p-way partitioning method is recursive bisection. It

first "optimally" divides the graph (mesh) into two equal sized pieces and then recursively divides the two pieces.We show that,due to the greedy nature and the lack of global information,recursive bisection, in the worst case,may produce a partition that is very far from the optimal one. Our negative result is complemented by two positive ones.First, we show that for some important classes of graphs that occur in practical applications,such as well shaped finite element and finite difference meshes,recursive bisection is normally within a constant factor of the optimal one. Secondly,we show that if the balanced condition is relaxed so that each block in the partition is bounded by (1+e)n/p,then there exists a approximately balanced recursive partitioning scheme that finds a partition whose cost is within an 0(log p) factor of the cost of the optimal p-way partition.

SungKyunKwan Univ.

21VADA Lab.

Partitioning Algorithm with Multiple Constraints

1998. 5. 19조 준 동

SungKyunKwan Univ.

22VADA Lab.

스위칭에 의한 충전과 방전• 전체 전력소모의 최대 90% 까지 차지

PMOSpull-upnetwork

NMOSpull-upnetwork

V dd

short circuit + leakage

charge

discharge

C L

SungKyunKwan Univ.

23VADA Lab.

저전력을 위한 분할• 기존의 방법 : cut 을 지나가는 간선의 수• 저전력 : 간선의 스위칭 동작의 수

0.25

0.25

0.25

0.25

0.750.75

( a ) cut ÀÇ ¼ö·Î ÀÚ¸§ ( b ) ½ºÀ§Äª µ¿ÀÛÀÇ ¼ö·Î ÀÚ¸§

SungKyunKwan Univ.

24VADA Lab.

최소비용흐름 알고리즘• 주어진 양을 가장 적은 비용으로 원하는 목적지까지 보낼수 있는

방법– 각 통로는 용량과 비용을 가짐

• Max-flow min-cut : 간선의 수만 고려• Min-Cost flow : 간선마다 스위칭 동작의 가중치를 부여

– 비용 : 스위칭 동작 vs. 간선의 수 – 용량 : 간선에 흐를 수 있는 최대양

• 비용이 적을수록 선택되도록 큰 용량

W S Ci i i ( )1

SungKyunKwan Univ.

25VADA Lab.

Network and Mincost Flow

10 / 1001 / 5

20 / 10

10 / 35

15 / 30

10 / 35

10 / 100

15 / 30

45 / 55

23 / 11

100 / 10

30 / 24

1 / 10

3 / 56 / 100

100 / 10

100 / 10

45 / 55

23 / 11

7 / 80

SungKyunKwan Univ.

26VADA Lab.

그래프 변환 알고리즘• Min-Cost Flow 경로를 찾음• Cut 을 찾기 위해서 그래프의 변환이

필요• 레벨에 따른 topolo

gical 정렬Level 1

Level 5

Level 4

Level 3

Level 2

SungKyunKwan Univ.

27VADA Lab.

그래프 변환 알고리즘• 추가된 노드 및 간선

Level ( i )

Level ( i+1 )

»õ·Î »ý¼ºµÈ °£¼±

±âÁ¸ÀÇ °£¼±

»õ·Î »ý¼ºµÈ ³ëµå

±âÁ¸ÀÇ ³ëµå

Source Sink

SungKyunKwan Univ.

28VADA Lab.

그래프 변환

Level 1

Level 5

Level 4

Level 3

Level 2

sinkSource

S T

SungKyunKwan Univ.

29VADA Lab.

Partitioning with constraints

kiPPAAA

jiCW

upperiupperilower

k

i

k

jij

1

)(

,,

1 1

SungKyunKwan Univ.

30VADA Lab.

AlgorithmInput: Flow f, NetworkOutput: Partition the network into f subnetworks 단계 1: 그래프에 Flow 를 push 하여 최소비용흐름 알고리즘 수행 ; 만약 각각의 partition 에 대하여 A_upper 또는 P_upper 를 만족하면 마침 ; 그렇지않으면 f = f+1; 증가시키고 upper bound 를 만족할 때까지 단계 1 을 반복한다 .단계 2: 만약 A_lower 또는 P_lower 를 만족하지 않는두개의 partition p, q 가 있고

upperqplower

upperqplower

PPPP

AAAA

라면 p 와 q 는 merge 가 가능하고 모든 가능한 {p,q} set 에 대하여 최소비용매칭을 적용하여 분할된 partition 의 개수를 줄임 .

SungKyunKwan Univ.

31VADA Lab.

참고문헌[1] J.D.Cho and P.D.Franzon, "High-Performance Design Automation for Multi-Chip Modules and Packages", World

Scientific Pub. Co. 1996[2] H.J.M.Veendrick, "Short-Circuit Dessipation of Static CMOS Circuitry and its Impact on the Design of Buffer Cir

cuits" IEEE JSSCC, pp.468-473, August, 1984[3] H.B.Bakoglu, "Circuits, Interconnections and Packaging for VLSI", pp.81-112, Addison-Wesley Publishing Co.,

1990[4] K.M.hall. "An r-dimensional quadratic placement algorithm", Management Sci., vol.17, pp.219-229, Nov, 197

0[5] Cadence Design Systems. "A Vision for Multi-Chip Module design in the nineties", Tech. Rep. Cadence Design

Systems Inc., Santa Clara, CA, 1993[6] R.Raghavan, J.Cohoon, and S.Shani. "Single Bend Wiring", Journal of Algorithms, 7(2):232-257, June, 1986 [7] Kernighan, B.W. and S.lin. "An efficient heuristic procedure to partition graphs" Bell System Technical Journal,

492:291-307, Feb. 1970[8] Wei, Y.C. and C.K.Cheng "Ratio-Cut Partitioning for Hierachical Designs", IEEE Trans. on Computer-Aided Desi

gn. 40(7):911-921, 1991[9] S.W.Hadley, B.L.Mark, and A.Vanelli, "An Efficient Eigenvector Approach for Finding Netlist Partitions", IEEE Tr

ans. on Computer-Aided Design, vol. CAD-11, pp.85-892, July, 1992[10] L.R.Fold, Jr. and D.R.Fulkerson. "Flows in Networks", Princeton University Press, Princeton, NJ, 1962[11] Liu H. and D.F.Wong, "Network Flow Based Multi-Way Partitioning With Area and Pin Constraints", IEEE/ACM

Symposium on Physical Design, pp. 12-17, 1997[12] Kirkpatrick, S. Jr., C.Gelatt, and M.Vecchi. "Optimization by simulated annealing", Science, 220(4598):498-

516, May, 1983[13] Pedram, M. "Power Minimization in IC Design: Principles and Applications," ACM Trans. on Design Automatio

n of Electronics Systems, 1(1), Jan. pp. 3-56, 1996. [14] A.H.Farrahi and M.Sarrafzadeh. "FPGA Technology Mapping for Power Minimizatioin", In International Worksh

op on Field-Programmable Logic and Applications, pp66-77, Sep. 1994[15] M.A.Breur, "Min-Cut Placement", J.Design Automation and Fault-Tolerant Computing, pp.343-382, Oct. 197

7

SungKyunKwan Univ.

32VADA Lab.

[16] M.Hanan and M.J.Kutrzberg. A Review of the Placement and the Quadratic Assignment Problem, Apr. 1072.[17] N.R.Quinn, "The Placement Problem as Viewed from the Physics of Classical Mechanics", Proc. of the 12th Design Automation Conference, pp.173-178, 1975[18] C.Sehen, and A.Sangiovanni-Vincentelli, "The Timber Wolf placement and routing package", IEEE Journal of Solid-State Circuits, Sc-20, pp.501-522, 1985[19] K.Shahookar, and P.Mazumder, "A Genetic Approach to Standard Cell Placement", First European Design Automation Conference, Mar. 1990[20] J.D.Cho, S.Raje, M.Sarrafzadeh, M.Sriram, and S.M.Kang, "Crosstalk Minimum Layer Assignment", In Proc. IEEE Custom Integr. Circuits Conf., San Diego, CA, pp.29.7.1-29.7.4, 1993[21] J.M.Ho, M.Sarrafzadeh, G,Vijayan, and C.K.Wong. "Layer Assignment for Multi-Chip Modules", IEEE Trans. on Computer-Aided Design, CAD-9(12):1272-1277, Dec., 1991[22] G.Devaraj. "Distributed placement and crosstalk driven router for multichip modules", In MS Thesis, Univ. of Cincinnati, 1994[23] J.D.Cho. "Min-Cost Flow based Minimum-Cost Rectilinear Steiner Distance-Preserving Tree", International Symposium on Physical Desigh, pp-82-87, 1997[24] A.Vitttal and M.Marek-Sadowska. "Minimal Delay Interconnection Design using Alphabetic Trees", In Design Automation Conference, pp.392-396, 1994[25] M.C.Golumbic. "Algorithmic Graph Theory and Perfect Graph", pp.80-103, New York : Academic. 1980[26] R.Vemuri. "Genetic Algorithms for partitioning, placement, and layer assignment for multichip modules", Ph.D. Thesis, Univ. of Cincinnati, 1994[27] J.L.Kennington and R.V.Helgason, "Algorithms for Network Programmin", John Wiley, 1980[28] J.Y.Cho and J.D.Cho "Improving Performance and Routability Estimation in MCM Placement", In InterPack'97, Hawaii, June, 1997[29] J.Y.Cho and J.D.Cho "Partitioning for Low Power Using Min-Cost Flow Algorithm", submitted to 한국반도체학술대회 , Feb, 1998

SungKyunKwan Univ.

33VADA Lab.

6. Logic Level Design

SungKyunKwan Univ.

34VADA Lab.

Node Transition Activity

SungKyunKwan Univ.

35VADA Lab.

Low Activity XOR Function

SungKyunKwan Univ.

36VADA Lab.

GLITCH (Spurious transitions)• 15-20% of the total

power is due to glitching.

SungKyunKwan Univ.

37VADA Lab.

Glitches

SungKyunKwan Univ.

38VADA Lab.

Hazard Generation in Logic Circuits

•Static hazard: A transient pulse of width w (= the delay of the inverter).• Dynamic hazard: the transient consists of three edges, two rising and one falling with w of two units.• Each input can have several arriving paths.

SungKyunKwan Univ.

39VADA Lab.

High-Performance PowerDistribution

• (S: Switching probability; C: Capacitance)• Start with all logic at the lowest power level; then, successive

iterations of delay calculation, identifying the failing blocks, and powering

• up are done until either all of the nets pass their delay criteria or the

• maximum power level is reached.• Voltage drops in ground and supply wires use up a more

serious fraction of the total noise margin

SungKyunKwan Univ.

40VADA Lab.

Logic Transformation• Use a signal with low switching activity to reduce the activity on a highly active si

gnal.• Done by the addition of a redundant connection between the gate with low activi

ty (source gate) to the gate with a high switching activity (target gate).• Signals a, b, and g1 have very high switching activity and most of time its value i

s zero• Suppose c and g1 are selected as the source and target of a new connection ` 1

is undetectable, hence the function of the new circuit remains the same.• Signal c has a long run of zero, and zero is the controlling value of the and gate

g1 , most of the switching activities at the input of g1 will not be seen at the output, thus switching activity of the gate g1 is reduced.

• The redundant connection in a circuit may result in some irredundant connections becoming redundant.

• By adding ` 1 , the connections from c to g3 become redundant.

SungKyunKwan Univ.

41VADA Lab.

Logic Transformation

SungKyunKwan Univ.

42VADA Lab.

Logic Transformation

SungKyunKwan Univ.

43VADA Lab.

Frequency Reduction◈ Power saving

Reduces capacitance on the clock network Reduces internal power in the affected registers Reduces need for muxes(data recirculation)

◈ Opportunity Large opportunity for power reduction, dependent on;

Number of registers gated percentage of time clock is enabled

◈ Cost Testability Complicates clock tree synthesis Complicates clock skew balancing

SungKyunKwan Univ.

44VADA Lab.

GATED-CLOCK D-FLIP-FLOP• Flip- op present a large internal capacitance on the internal clock node.• If the DFF output does not switch, the DFF does not have to be clocked.

SungKyunKwan Univ.

45VADA Lab.

Frequency Reduction

FSM

data_ in

reset

c lkload_en

data_out

data_ reg

3232

D Q

B efore C loc k G ating

FSM

data_ in

reset

c lk

load_en

data_out

data_ reg

32D Q

After C loc k G ating

LATCHc lk

c lk_en

load- en_ latc hed

Clock Gating Example - When D is not equal to Q

SungKyunKwan Univ.

46VADA Lab.

◈ Clock Gating Example - Before CodeFrequency Reduction

library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;

entity nongate is port(clk,rst : in std_logic; data_in : in std_logic_vector(31 downto 0); data_out : out std_logic_vector(31 downto 0));end nongate;

architecture behave of nongate is signal load_en : std_logic; signal data_reg : std_logic_vector(31 downto 0); signal count : integer range 0 to 15;begin

FSM : process begin wait until clk'event and clk='1'; if rst='0' then count <= 0; elsif count=9 then count <= 0; else count <= count+1; end if; end process FSM;

enable_logic : process(count,load_en) begin if(count=9) then load_en <= '1'; else load_en <= '0'; end if; end process enable_logic;

datapath : process begin wait until clk'event and clk='1'; if load_en='1' then data_reg <= data_in; end if; end process datapath; data_out <= data_reg; end behave;

configuration cfg_nongate of nongate is for behave end for;end cfg_nongate;

SungKyunKwan Univ.

47VADA Lab.

◈ Clock Gating Example - After CodeFrequency Reduction

library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;

entity gate is port(clk,rst : in std_logic; data_in : in std_logic_vector(31 downto 0); data_out : out std_logic_vector(31 downto 0));end gate;

architecture behave of gate is signal load_en,load_en_latched,clk_en : std_logic; signal data_reg : std_logic_vector(31 downto 0); signal count : integer range 0 to 15;begin

SungKyunKwan Univ.

48VADA Lab.

Frequency Reduction FSM : process begin wait until clk'event and clk='1'; if rst='0' then count <= 0; elsif count=9 then count <= 0; else count <= count+1; end if; end process FSM;

enable_logic : process(count,load_en) begin if(count=9) then load_en <= '1'; else load_en <= '0'; end if; end process enable_logic;

deglitch : PROCESS(clk,load_en) begin

if(clk='0') then load_en_latched <= load_en; end if; end process deglitch; clk_en <= clk and load_en_latched; datapath : process begin wait until clk_en'event and clk_en='1'; data_reg <= data_in; end process datapath; data_out <= data_reg; end behave;

configuration cfg_gate of gate is for behave end for;end cfg_gate;

SungKyunKwan Univ.

49VADA Lab.

Frequency Reduction◈ Clock Gating Example - Report

SungKyunKwan Univ.

50VADA Lab.

Frequency Reduction◈ 4-bit Synchronous & Ripple counter - code

4-bit Synchronous Counter

Library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_arith.all;

entity BINARY is Port ( clk : In std_logic; reset : In std_logic; count : BUFFER UNSIGNED (3 downto 0));end BINARY;

architecture BEHAVIORAL of BINARY is begin process(reset,clk,count) begin

if (reset = '0') then count <= "0000” elsif (clk'event and clk = '1') then if (count = UNSIGNED'("1111")) then count <= "0000"; else count <=count+UNSIGNED'("1"); end if; end if; end process;end BEHAVIORAL;

configuration CFG_BINARY_BLOCK_BEHAVIORAL of BINARY is for BEHAVIORAL end for;end CFG_BINARY_BLOCK_BEHAVIORAL;

SungKyunKwan Univ.

51VADA Lab.

Frequency Reduction 4-bit Ripple Counter

Library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all;

entity RIPPLE is Port ( clk : In std_logic; reset : In std_logic; count : BUFFER UNSIGNED (3 downto 0)); end RIPPLE;

architecture BEHAVIORAL of RIPPLE is signal count0, count1, count2 : std_logic;begin process(count) begin count0 <= count(0); count1 <= count(1);

count2 <= count(2); end process;

process(reset,clk) begin if (reset = '0') then count(0) <= '0'; elsif (clk'event and clk = '1') then if (count(0) = '1') then count(0) <= '0'; else count(0) <= '1'; end if; end if; end process; process(reset,count0) begin if (reset = '0') then count(1) <= '0'; elsif (count0'event and count0 = '1') then

SungKyunKwan Univ.

52VADA Lab.

Frequency Reduction if (count(3) = '1') then count(3) <= '0'; else count(3) <= '1'; end if; end if; end process; end BEHAVIORAL;

configuration CFG_RIPPLE_BLOCK_BEHAVIORAL of RIPPLE is for BEHAVIORAL end for; end CFG_RIPPLE_BLOCK_BEHAVIORAL;

if (count(1) = '1') then count(1) <= '0'; else count(1) <= '1'; end if; end if; end process;

process(reset,count1) begin if (reset = '0') then count(2) <= '0'; elsif (count1'event and count1 = '1') then if (count(2) = '1') then count(2) <= '0'; else count(2) <= '1'; end if; end if; end process;

process(reset,count2) begin if (reset = '0') then count(3) <= '0'; elsif (count2'event and count2 = '1') then

SungKyunKwan Univ.

53VADA Lab.

Frequency Reduction◈ 4-bit Synchronous & Ripple counter - Report

SungKyunKwan Univ.

54VADA Lab.

Bus-Invert Coding for Low Power I/O

An eight-bit bus on which all eight lines toggle at the sametime and which has a high peak (worst-case) power dissipation.•There are 16 transitions over 16 clock cycles (average 1 transition per clock cycle).

SungKyunKwan Univ.

55VADA Lab.

Peak Power Dissipation

An eight-bit bus on which the eight lines toggle at differentmoments and which has a low peak power dissipation. There are the same 16 transitions over 16 clock cycles and thus the same average power dissipation

SungKyunKwan Univ.

56VADA Lab.

Bus-Invert - Coding for low power• The Bus-Invert method proposed here uses one extra control bit called i

nvert. By convention then invert = 0 the bus value will equal the data value. When invert = 1 the bus value will be the inverted data value. The peak power dissipation can then be decreased by half by coding the I/O as follow

• 1. Compute the Hamming distance (the number of bits in which they differ) between the present bus value (also counting the present invert line) and the next data value.

• 2. If the Hamming distance is larger than n=2, set invert = 1 (and thus make the next bus value equal to the inverted next data value).

• 3. Otherwise, let invert = 0 (and let the next bus value equal to the next data value).

• 4. At the receiver side the contents of the bus must be conditionally inverted according to the invert line, unless the data is not stored encoded as it is (e.g. in a RAM). In any case the value of invert must be transmitted over the bus (the method increases the number of bus lines from n to n + 1).

SungKyunKwan Univ.

57VADA Lab.

Example

A typical eight-bit synchronous data bus. The transitions between two consecutive time-slots are \clean". There are 64 transitions for a period of 16 time slots. This represents an average of 4 transitions per time slot, or 0.5 transitions per bus line per time

slot.

SungKyunKwan Univ.

58VADA Lab.

Bus encoding

The same sequence of data coded using the BusInvert method. There are now only 53 transitions over a period of 16 time slots. This represents an average of 3.3 transitions per time slot, or 0.41 transitions per bus line per time slot.The maximum number of transitions for any time slot is now 4.

SungKyunKwan Univ.

59VADA Lab.

Comparisons

Comparison of unencoded I/O and coded I/O with one or more invert lines. The comparison looks at the average and maximum number of transitions per time-slot, per bus-line per time-slot, and I/O power dissipation for different bus-widths.

SungKyunKwan Univ.

60VADA Lab.

Remarks• The increase in the delay of the data-path: By looking at the power-delay produc

t which removes the effect of frequency (delay) on power dissipation, a clear improvement is obtained in the form of an absolute lower number of transitions. It is also relatively easy to pipeline the bus activity. The extra pipeline stage and the extra latency must then be considered.

• The increased number of I/O pins. As was mentioned before ground-bounce is a big problem for simultaneous switching in high speed designs. That is why modern microprocessors use a large number of Vdd and GND pins. The Bus-Invert method has the side-effect of decreasing the maximum ground-bounce by approximately 50%. Thus circuits using the Bus Invert method can use a lower number of Vdd and GND pins and by using the method the total number of pins might even decrease.

• Bus-Invert method decreases the total power dissipation although both the total number of transitions increases (by counting the extra internal transitions) and the total capacitance increases (because of the extra circuitry). This is

• possible because the transitions get redistributed very nonuniformly, more on the low-capacitance side and less on the high-capacitance side.

SungKyunKwan Univ.

61VADA Lab.

References[1] H. B. Bakoglu, Circuits, Interconnections and Packaging forVLSI, Addison-Wesley, 1990.[2] T. K. Callaway, E. E. Swartzlander, \Estimating the Power Con-sumption of CMOS Adders", 11th Symp. on Comp. Arithmetic,pp. 210-216, Windsor, Ontario, 1993.[3] A. P. Chandrakasan, S. Sheng, R. W. Brodersen, \Low-PowerCMOS Digital Design", IEEE Journal of Solid-State Circuits,pp. 473-484, April 1992.[4] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, R. W. Brodersen,\HYPER-LP: A System for Power Minimization Using Archi-tectural Transformations", ICCAD-92, pp.300-303, Nov. 1992,Santa Clara, CA.[5] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, R. W. Brodersen,\An Approach to Power Minimization Using Transformations",IEEE VLSI for Signal Processing Workshop, pp. , 1992, CA.[6] S. Devadas, K. Keutzer, J. White, \Estimation of Power Dissi-pation in CMOS Combinational Circuits", IEEE Custom Inte-grated Circuits Conference, pp. 19.7.1-19.7.6, 1990.[7] D. Dobberpuhl et al. \A 200-MHz 64-bit Dual-Issue CMOS Mi-croprocessor", IEEE Journal of Solid-State Circuits, pp. 1555-1567, Nov. 1992.[8] R. J. Fletcher, \Integrated Circuit Having Outputs Conguredfor Reduced State Changes", U.S. Patent no. 4,667,337, May,1987.

[9] D. Gajski, N. Dutt, A. Wu, S. Lin, High-Level Synthesis, Introduction to Chip and System Design, Kluwer Academic Publishers, 1992.[10] J. S. Gardner, \Designing with the IDT SyncFIFO: the Architecture of the Future", 1992 Synchronous (Clocked) FIFO Design Guide, Integrated Device Technology AN-60, pp. 7-10, 1992,Santa Clara, CA.[11] A. Ghosh, S. Devadas, K. Keutzer, J. White, \Estimation of Average Switching Activity in Combinational and Sequential Circuits", Proceedings of the 29th DAC, pp. 253-259, June 1992, Anaheim, CA.[12] J. L. Hennessy, D. A. Patterson, Computer Architecture - AQuantitative Approach, Morgan Kaufmann Publishers, PaloAlto, CA, 1990.[13] S. Kodical, \Simultaneous Switching Noise", 1993 IDT High-Speed CMOS Logic Design Guide, Integrated Device Technology AN-47, pp. 41-47, 1993, Santa Clara, CA.[14] F. Najm, \Transition Density, A Stochastic Measure of Activity in Digital Circuits", Proceedings of the 28th DAC, pp. 644-649, June 1991, Anaheim, CA.

SungKyunKwan Univ.

62VADA Lab.

References[16] A. Park, R. Maeder, \Codes to Reduce Switching

Transients Across VLSI I/O Pins", Computer Architecture News, pp. 17-21, Sept. 1992.

[17] Rambus - Architectural Overview, Rambus Inc., Mountain View, CA, 1993. Contact [email protected].

[18] A. Shen, A. Ghosh, S. Devadas, K. Keutzer, \On Average Power Dissipation and Random Pattern Testability", ICCAD-92, pp. 402-407, Nov. 1992, Santa Clara, CA.

[19] M. R. Stan, \Shift register generators for circular FIFOs", Electronic Engineering, pp. 26-27, February 1991, Morgan Grampian House, London, England.

[20] M. R. Stan, W. P. Burleson, \Limited-weight codes for low power I/O", International Workshop on Low Power Design, April 1994,

Napa, CA.

[21] J. Tabor, Noise Reduction Using Low Weight and Constant Weight Coding Techniques, Master's Thesis, EECS Dept., MIT, May 1990.[22] W.-C. Tan, T. H.-Y. Meng, \Low-power polygon renderer for computer graphics", Int. Conf. on A.S.A.P., pp. 200-213, 1993.[23] N. Weste, K. Eshraghian, Principles of CMOS VLSI Design, A Systems Perspective, Addison-Wesley Publishing Company, 1988.[24] R. Wilson, \Low power and paradox", Electronic Engineering Times, pp. 38, November 1, 1993.[25] J. Ziv, A. Lempel, A universal Algorithm for Sequential Data Compression", IEEE Trans. on Inf. Theory, vol. IT-23, pp. 337-343, 1977.

SungKyunKwan Univ.

63VADA Lab.

DesignPower Gate Level Power Model

◈ Switching Power Power dissipated when a load capacitance(gate+wire) is charged o

r discharged at the driver’s output If the technology library contains the correct capacitance valu

e of the cell and if capacitive_load_unit attribute is specified then no additional information is needed for switching power modeling

Output pin capacitance need not be modeled if the switching power is incorporated into the internal power

][2

2

i

netsforall

isw TRCVP

SungKyunKwan Univ.

64VADA Lab.


◈ Internal Power power dissipated internal to a library cell Modeled using energy lookup table indexed by input

transition time and output load Library cells may contain one or more internal

energy lookup tables

]),(intint iitioninputtransoutputload TREPCellsforall

i

SungKyunKwan Univ.

65VADA Lab.


◈ Leakage Power Leakage power model supports a signal value for each library cell State dependent leakage power is not supported

Cellsforall

ileakleak PP

SungKyunKwan Univ.

66VADA Lab.

Operand Isolation

FS M

R egister

Bank

Significant Power Dissipation

EN

D Qn m

m

mData_out

FSM

R egiste r

Bank

EN

D Qnm

m

mData_out

LatchG

n

• Combinational logic dissipates significant power when output is unused

• Inputs to combination logic held stable when output is unused

SungKyunKwan Univ.

67VADA Lab.

Operation Isolation Example -Diagram

AD D

FS MLa tch

M U L

D ataReg

a

b

c

rst

c lk

DG

QLoad_En Load_En_Latched

C lk_En

Data_AddData_Mul

do

8

816

8

D

Q

AD D

FSMLatch

A DD

D ataR eg

a

b

c

rst

c lk

DG

QLoad_En Load_En_Latched

C lk_En

Data_Add Data_Mul

do

8

816

D

QLatchD QG

Iso_Data_Add

8

Before

Operand Isolation

After

Operand Isolation

SungKyunKwan Univ.

68VADA Lab.

Operand Isolation Example - Before Code

Library IEEE;Use IEEE.STD_LOGIC_1164.ALL;Use IEEE.STD_LOGIC_SIGNED.ALL;

Entity Logic isPort(

a, b, c : in std_logic_vector(7 downto 0);do : out std_logic_vector(15 downto 0);rst : in std_logic;clk : in std_logic

);End Logic;

Architecture Behave of Logic isSignal Count : integer;Signal Load_En : std_logic;

Signal Load_En_Latched : std_logic;Signal Clk_En : std_logic;

Signal Data_Add : std_logic_vector(7 downto 0);Signal Data_Mul : std_logic_vector(15 downto 0);Begin

Process(clk,rst) -- Counter Logic in FSMBegin

If(clk='1' and clk'event) thenIf(rst='0') then

Count <= 0;Elsif(Count=9) then

Count <= 0;Else Count <= Count + 1;End If;

End If;End Process;

SungKyunKwan Univ.

69VADA Lab.

Operand Isolation Example - Before Code

Process(Count) -- Enable Logic in FSMBegin

If(Count=9) thenLoad_En <= '1';

ElseLoad_EN <= '0';

End If;End Process;

Process(clk,Load_En) -- Latch(for Deglitch) Logic

BeginIf(clk='0') then

Load_En_Latched <= Load_En;End If;

End Process;

clk_En <= clk and Load_En_Latched;

Data_Add <= a + b;

Data_Mul <= Data_Add * c;

Process(Data_Mul,Clk_En) -- Data Reg LogicBegin

If(Clk_En='1' and Clk_En'event) thenDo <= Data_Mul;

End If;End Process;

End Behave;

Configuration CFG_Logic of Logic isfor BehaveEnd for;

End CFG_Logic;

SungKyunKwan Univ.

70VADA Lab.

Operand Isolation Example - After CodeLibrary IEEE;Use IEEE.STD_LOGIC_1164.ALL;Use IEEE.STD_LOGIC_SIGNED.ALL;

Entity Logic1 isPort(

a, b, c : in std_logic_vector(7 downto 0);do : out std_logic_vector(15 downto 0);rst : in std_logic;clk : in std_logic

);End Logic1;

Architecture Behave of Logic1 isSignal Count : integer;Signal Load_En : std_logic;Signal Load_En_Latched : std_logic;Signal Clk_En : std_logic;

Signal Data_Add : std_logic_vector(7 downto 0);Signal Data_Mul : std_logic_vector(15 downto 0);Signal Iso_Data_Add : std_logic_vector(7 downto 0);Begin

Process(clk,rst) -- Counter Logic in FSMBegin

If(clk='1' and clk'event) thenIf(rst='0') then

Count <= 0;Elsif(Count=9) then

Count <= 0;Else Count <= Count + 1;End If;

End If;End Process;

SungKyunKwan Univ.

71VADA Lab.

Operand Isolation Example - After Code

Process(Count) -- Enable Logic in FSMBegin

If(Count=9) thenLoad_En <= '1';ElseLoad_EN <= '0';End If;

End Process;

Process(clk,Load_En) -- Latch(for Deglitch) LogicBegin

If(clk='0') thenLoad_En_Latched <= Load_En;End If;

End Process;

clk_En <= clk and Load_En_Latched;

Data_Add <= a + b;

Process(Load_En_Latched,Data_Add) -- LatchBegin -- for Operand Isolation

If(Load_En_Latched='1' and Load_En_Latched'event) then

Iso_Data_Add <= Data_Add;End If;

End Process;

Data_Mul <= Iso_Data_Add * c;

Process(Data_Mul,Clk_En) -- Data Reg LogicBegin

If(Clk_En='1' and Clk_En'event) thenDo <= Data_Mul;

End If;End Process;

End Behave;

SungKyunKwan Univ.

72VADA Lab.

Operand Isolation Example - Report

Before Code After Code

SungKyunKwan Univ.

73VADA Lab.

Precomputation• Power saving

– Reduces power dissipation of combinational logic– Reduces internal power to precomputed registers

• Opportunity– Can be significant, dependent on;

• percentage of time latch precomputation is successful

• Cost– Increase area– Impact circuit timing– Increase design complexity

• number of bits to precompute– Testability

• may generate redundant logic

SungKyunKwan Univ.

74VADA Lab.

Precomputation

R egiste rBank/ /

Data_out

pn/R egiste rBank

n/ p

E NR egis ter

B ank

/Data_out

pn- m

/

m R egisterB ank

R egisterB ank

D Q

/

/

/

/

/

/

/

/

n - m

m

p p

1

p

Entire function is computed.

Smaller function is defined,

Enable is precomputed.

SungKyunKwan Univ.

75VADA Lab.

• Before Precomputation Diagram

Precomputation

a > b/

Data_out

C LK

a /

/8

b

8

/1 1

/

/

8

8

SungKyunKwan Univ.

76VADA Lab.

• After Precomputation Diagram

Precomputation

a(6:0)

a > b/

Data_out

Latch

C LK

/

a(6: 0)/

b(7)

a(7)

b(6:0)

/

/7

b(6: 0)

a(7)

b(7) /

1

1

/

77

7

/8

/8 /1 1

/1

/1

SungKyunKwan Univ.

77VADA Lab.

• Before Precomputation - Report

Precomputation

SungKyunKwan Univ.

78VADA Lab.

• After Precomputation - Report

Precomputation

SungKyunKwan Univ.

79VADA Lab.

Precomputation Example - Before Code

Library IEEE;Use IEEE.STD_LOGIC_1164.ALL;Entity before_precomputation isport ( a,b : in std_logic_vector(7 downto

0);CLK: in std_logic; D_out: out std_logic);

end before_precomputation;

Architecture Behav of before_precomputation is

signal a_in, b_in : std_logic_vector(7 downto 0);

signal comp : std_logic;

Beginprocess (a,b,CLK)

Beginif (CLK = '1' and CLK'even

t) then a_in <= a;

b_in<= b;end if;if (a_in > b_in) then

comp <= '1';else comp <= '0';end if;if (CLK'event and CLK='1')

then D_out <= comp;

end if;end process;end Behav;

SungKyunKwan Univ.

80VADA Lab.

Precomputation Example - After Code

Library IEEE;Use IEEE.STD_LOGIC_1164.ALL;

Entity after_precomputation isport (a, b : in std_logic_vector(7 downto 0);

CLK: in std_logic; D_out: out std_logic);end after_precomputation;

Architecture Behav of after_precomputation is

signal a_in, b_in : std_logic_vector(7 downto 0);

signal pcom, pcom_D : std_logic; signal CLK_en, comp : std_logic;

Beginprocess(a,b,CLK)Begin

if (CLK='1' and CLK'event) thena_in(7) <= a(7);b_in(7) <= b(7);end if;

pcom <= a xor b;

if (CLK='0') thenpcom_D <= pcom;end if;

CLK_en <= pcom_D and CLK;

SungKyunKwan Univ.

81VADA Lab.

Precomputation - Example After Code

if (CLK_en='1' and CLK_en'event) then

a_in(6 downto 0) <= a(6 downto 0);

b_in(6 downto 0) <= b(6 downto 0);end if;

if (a_in > b_in) thencomp <= '1';

else comp <= '0';

end if;

if (CLK='1' and CLK'event) thenD_out <= comp;

end if;end process;end Behav;

SungKyunKwan Univ.

82VADA Lab.

Peak Power Reduction• Peak Power has relation to EMI• Reducing concurrent switching makes

peak power reduction– Adjust delay within the speed of

system clock in Bus/Port driver– Consider the power consumption

of delay element– Maintaining total power

consumption, we improve EMI in peak power reduction

• Before Peak Power Reduction

• After Peak Power Reduction

n bits wide

Itotal

tE 1

n bits wide

Itotal

t

(n- 1)/

E 2t totoldd dtIVE

SungKyunKwan Univ.

83VADA Lab.

Factoring Example Function : f = ad + bc + cd The function f is not on the critical path. The signal a,b,c and d are all the same bit width. Signal b is a high activity net. The two factorings below are equivalent from both a timing and area criteria. Net Result : network toggling and power is reduced.

a

f

dc

b

c

f = b(a+ c ) + c d

b

ba

c

d

f = b(a+ c ) + c d

f

SungKyunKwan Univ.

84VADA Lab.

Block diagram of low-voltage, high-speed of LSI

• Power Management Processor controls the low-Vt circuit using the sleep signal.• Extend the sleep period as much as possible, because leakage power is reduced during this time

SungKyunKwan Univ.

85VADA Lab.

Operations of low-V t LSI

Request signal from an I/O device, output the results, waits for the next request signal. During the waitingperiod, the low-Vt circuit can sleep.

SungKyunKwan Univ.

86VADA Lab.

Waking/Sleeping operation

Waking operation Sleeping operation

SungKyunKwan Univ.

87VADA Lab.

Creating sleep period: Operation during calculation

•Heavy operations such as voice CODEC, and light operations such as datacollection can be distributed to both the low-Vt circuit and the PMP, and the low Vt circuit can sleep when the PMP is executing lightoperations.• reduce the power by 10%

SungKyunKwan Univ.

88VADA Lab.

Low Power Logic Gate Resynthesis on Mapped Circuit

김현상 조준동 전기전자컴퓨터공학부

성균관대학교

SungKyunKwan Univ.

89VADA Lab.

Transition Probability• Transition Probability: Prob. Of a transition at the output of a gate, given a cha

nge at the inputs• Use signal probabilities• Example: F = X’Y + XY’

– Signal Prob. Of F: Pf = Px(1-Py)+(1-Px)Py

– Transistion Prob. Of F = 2Pf(1-Pf)– Assumption of independence of inputs

• Use BDDs to compute these• References: Najm’91

SungKyunKwan Univ.

90VADA Lab.

Technology Mapping •Implementing a Boolean network in terms of gates from a given library•Popular technique: Tree-based mapping•Library gates and circuits decomposed into canonical patterns•Pattern matching and dynamic programming to find the best cover•NP-complete for general DAG circuits•Ref: Keutzer’87, Rudell’89•Idea: High transition probability points are hidden within gates

SungKyunKwan Univ.

91VADA Lab.

Low Power Cell Mapping• Example of High Switching

Activity Node• Internal Mapping in Complex

Gate

A

QDC

BY

A

Y

DC

B

SungKyunKwan Univ.

92VADA Lab.

Signal Probability vs. Power

0.5 1.00.0signal probability : p(x )

powe

r : P(

x) p

(x) (1

-p(x)

)

p(x) < 0.5 p(x) > 0.5

SungKyunKwan Univ.

93VADA Lab.

Spatial Correlation

P(x) = 0.25P(x) = 0.25

P(z) = 0.4375

a

b

c

P(b) = 0.5

P(c) = 0.5

P(d) = 0.5

P(x) = 0.25

P(y) = 0.25

x

y

zP(z) = 0.375

y

xz

SungKyunKwan Univ.

94VADA Lab.

Low Power Logic Synthesis

Technology IndependentOptimization

Technology Mapping

Resynthesis on MappedCircuit

Logic Equation

Connection of Gates

RTL Description

Gate Level Description

Logic Synthesis

Timing & PowerAnalysis Tools

SungKyunKwan Univ.

95VADA Lab.

Technology Mapping

(a)

l

l

(c)

h : high switching activity node

l : low switching activity node

h

h

l

l

(b)

h

h

l

l

SungKyunKwan Univ.

96VADA Lab.

Tree Decomposition

(a) (b)

Low Power

ff

gate(AND)primary input critical path

f output

SungKyunKwan Univ.

97VADA Lab.

Huffman Algorithm

x 1 x 2 x 3 x 4

y 1 y 2

x 5y 3

2 3 4 4

5 8

13 10

23

SungKyunKwan Univ.

98VADA Lab.

Depth-Constrained Decomposition• Algorithm• problem : minimize SUM from i=1 to m p_t (x_i ) • input : 입력 시그널 확률 (p1, p2,íñíñíñ, pn), 높이 (h), 말단 노드의 수 (n), 게이트당 fanin l

imit(k)• output : k-ary 트리 topology• Begin• sort (signal probability of p1, p2,íñíñíñ, pn);• while (n!=0) • if (h>logkn)• assign k nodes to level L(=h+1);• /* 레벨 L(=h+1) 에 노드 k 개만큼 할당 */ • h=h-1, n=n-(k-1); /*upward*/• else if (h<logkn)• assign k nodes to level L(=h+2); • /* 이전 레벨 L(=h+2) 에 노드 k 개만큼 할당 */• h=h, n=n-(k-1); /*downward*/• else (h=logkn)• assign the remaining nodes to level L(=h+1); • /*complete; 레벨 L(=h+1) 에 나머지 노드를 모두 할당하고 • complete k-ary 트리 구성 */

• for (bottom level L; L>1; L--) • min_edge_weight_matching (nodes in level L);• End

SungKyunKwan Univ.

99VADA Lab.

Exampleh = 1

a

x

b a

x

b c

y

d a

x

b c

y

d

e f

0.1 0.2 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4

0.5 0.6

h = 2

h = 3

level L =0

level L =1

level L =2

level L =3

a

x

b c

y

d

e f

0.1 0.2 0.3 0.4

0.5 0.6

a

x

d b

y

c

e f

0.1 0.4 0.2 0.3

0.5 0.6

before matching after matching

SungKyunKwan Univ.

100VADA Lab.

After Decomposition

0

2

4

6

8

10

12

14

16

Valu

e, R

atio

h=36

h=410

h=6 h=5 h=7 h=520

h=7 h=9

Fanin, Height

K 1=2

SIS

SIS+OURS

Improvement Ratio

SungKyunKwan Univ.

101VADA Lab.

After Tech. Mapping

0

10

20

30

40

50

60

70

80

Pow

er(m

W),

Rat

io

h=26

h=3 h=310

h=4 h=5 h=315

h=4 h=5 h=520

h=6 h=7 h=8

Fanin, Height

K 1=3, k 2=3

SIS+LEVEL MAPSIS+OURS+LEVEL MAPImprovement Ratio

SungKyunKwan Univ.

102VADA Lab.

7. Circuit Level Design

SungKyunKwan Univ.

103VADA Lab.

Buffer Chain• Delay analysis of buffer chain • Delay analysis considering parasitic c

apacitance,Cp

input

stage 1 stage 1stage (i- 1) stage i stage n

s ize 1 size s ize i- 2 size i- 1 s ize n- 1

C in C ini- 1C in iC in C in=nC in

)/ln()( , 72.2)(

0)()ln(

)/ln()ln(

)/ln(

)(

)/( )/(

0

1 100

1

inLoptimumoptimum

d

inLd

inL

inn

L

n

k

n

kdd

kk

CCne

T

CCtT

CCn

CC

tntktT

LWLW

) : (typical 10~21

11) (

) (

21

1

1

2

122

1

e

Eff

CCfVPP

CCfVfVCP

CCC

nn

n

nn

kpinddkT

pini

ddddkk

pk

ink

k

Ck,Pk: stage k buffer output 의 total capacitance, power

PT: buffer chain 의 power consumption

Pn: load capacitance CL 의 power consumption

Eff: power efficiency pn/pT

SungKyunKwan Univ.

104VADA Lab.

Slew Rate• Determining rise/fall time

P eriod Ttr tf

t1 t3t2

V in

Vdd+Vtp

Imax

Imean

V tn

Ishort

fr

tddddmeanSC

ttntpp

t

ttin

t

tshort

t

t

t

tshortshortmean

tt

fVVVIP

VVV

dtVVT

dttIT

dttIdttIT

I

where,

)2(2

, where,

)(2

4

)(4

)()(2

3

n

22

1

2

1

2

1

3

2

SungKyunKwan Univ.

105VADA Lab.

Slew Rate(Cont’d)• Power consumption of Short circuit current in Oscillation Circuit

Vdd

Vdd

Vo

V i

Vdd

Vdd

Vo

V i

VoV i

SungKyunKwan Univ.

106VADA Lab.

Pass Transistor Logic• Reducing Area/Power

– Macro cell(Large part in chip area) XOR/XNOR/MUX(Primitive) Pass Tr. Logic

– Not using charge/discharge scheme Appropriate in Low Power Logic

• Pass Tr logic Family– CPL (Complementary Pass Transisto

r Logic)– DPL (Dual Pass Transistor Logic)– SRPL (Swing Restored Pass Transist

or Logic)

• CPL– Basic Scheme

– Inverter Buffering

A

ABAB

B

ABB

B

A

ABAB

B

B

ABB

VddVdd

p- M OS Latch

SungKyunKwan Univ.

107VADA Lab.

Pass Transistor Logic(Cont’d)• DPL

– Pass Tr Network + Dual p-MOS– Enables rail-to-rail swing– Characteristics

• Increasing input capacitance(delay)

• Increasing driving ability for existing 2 ON-path

• equals CPL in input loading capacitance

• SRPL– Pass Tr network + Cross coupl

ed inverter– Restoring logic level– Inverter size must not be too bi

g

AB

B

A

B

AA B

A

B

AB

n-M O S C P Lnetw ork

SungKyunKwan Univ.

108VADA Lab.

Dynamic Logic• Using Precharge/Evaluation scheme• Family

– Domino logic– NORA(NO RAce) logic

• Characteristics– Decreasing input loading capacitanc

e– Power consumption in precharge clo

ck– Increasing useless switching in prech

arging period

• Basic architecture of Domino logic

A

B

clk

C in C L

A

P1

N1

NLogic Blockc lk

B

A

precharge evaluation

SungKyunKwan Univ.

109VADA Lab.

Input Pin Ordering• Reorder the equivalent inputs to a transi

stor based on critical path delays and power consumption

• N- input Primitive CMOS logic– symmetrical in function level– antisymmetrical in Tr level

• capacitance of output stage• body effect

• Scheme– The signal that has many transition

must be far from output– If it is hard to estimate switching fr

equency, we must determine pin ordering considering path and path delay balance from primary input to input of Tr.

• Example of N-input CMOS logic

A

D

C L

C

B

C 3

C 1

C 2

Experimentd with gate array of TIFor a 4-input NAND gate in TI’s BiCMOS gate array library (with a load of 13 inverters), the delay varies by 20% while power dissipation by 10% between a good and bad ordering

SungKyunKwan Univ.

110VADA Lab.

INPUT PIN Reordering

CL

A B C D

C

A

B

D

CB

CC

CD

VDD

MPA MPB MPC

MPD

MNA

MNB

MNC

MND

1 1

1 1

1 1

1 1

1

1

1

1

(a) (b) (c) (d)

Simulation result ( tcycle=50ns, tf/tr=1ns)

: A 가 critical input 인 경우 =38.4uW,

D 가 critical input 인 경우 =47.2uW

SungKyunKwan Univ.

111VADA Lab.

Sensitization• Example• Definition

– sensitization : input signal that forces output transition event

– sensitization vector : the other inputs if one signal is sensitized

X1

X3

X2

),,,1,,,( ),,,0,,,(

][ ][

11

11

10

nili

nili

XXi

XXXXfXXXXf

ffXY

ii

32332

101

][ ][ 11

XXXXX

ffXY

XX

321 )( XXXY

SungKyunKwan Univ.

112VADA Lab.

Sensitization(Cont’d)• Considering Sensitization in Combi

national logic:Remove unnecessary transitions in the C.L

• Considering Sensitization in Sequential logic: Also reduces the power consumption in the flip-flops.

C om binationa lLogicXn

E

QY

C om binationa lLog ic

X1

Xn

E

Q Y

C om binationa lLogic

X1

Xn

E

Q Y

Com binationalLogic

Q YD Q

D Q

c lk

X1

Xn

D Q

D Q

E

SungKyunKwan Univ.

113VADA Lab.

TTL-Compatible• TTL level signal CMOS

input• Characteristic Curve of CMOS

Inverter

Vdd= 3.3V

Vdd= 3.3V

Vo

V i

1.4V

V IL= 0.8V V IH= 2.0V Vdd= 3.3VV i

Vo Ileak= avg(Id1, Id2)

IDTTL1 IDTTL2

Vdd

V in

TTL INP U T

padinput compatible TTL ofnumber : e wher)( 21

TTL

DTTLDTTLddTTLTTL

NIIVNP

SungKyunKwan Univ.

114VADA Lab.

TTL Compatible(Cont’d)• CMOS output signal TTL input

– Because of sink current IOL,

CMOS gets a large amount of heat

– Increased chip operating temperature

– Power consumption of whole system

C hip B oundary C hip B oundary

Input Pad

O utput Pad

VOL

IO L

SungKyunKwan Univ.

115VADA Lab.

INPUT PIN Reordering◈ To reduce the power dissipation one should place the

input with low transition density near the ground end.

(a) If MNA turns off , only CL needs to be charged (b) If MND turns off , all CL, CB, CC and CD needs to be charged (c) If the critical input is rising and placed near output node, the initial charge of CB, CC and CD are zero and the delay time of CL

discharging is less than (d) (d) If the critical input is rising and placed near ground end, the charge of CB, CC and CD must dischagge before the charge of CL discharge to zero

SungKyunKwan Univ.

VADA Lab.

저전력 Booth Multiplier 설계성균관대학교

전기전자컴퓨터공학부김 진 혁 , 이 준 성 , 조 준 동

SungKyunKwan Univ.

VADA Lab.

Modified Booth 곱셈기

R ecoded D ig it O peration on X

0 : Add 0 to the partia l p roduct

+1 : Add X to the partia l p roduct

+2 : Sh ift le ft X one position and add it

to the partia l p roduct

-1 : Add two’s com plem ent o f X to the

partia l product

-2 : Take two’s com plem ent o f X and

sh ift left one position

Y 2i+1 Y 2i Y 2I-1 R ecoded O peration D igit on X

0 0 0 0 0X 0 0 1 +1 +1X 0 1 0 +1 +1X 0 1 1 +2 +2X 1 0 0 -2 -2X 1 0 1 -1 -1X 1 1 0 -1 -1X 1 1 1 0 0X

• Multibit Recoding 을 사용하여 부분합의 갯수를 1/2 로 줄여 고속의 곱셈을 가능하게 한다 . • 피승수 (multiplicand) : X , 승수 (multiplier) : Y

Recoded digit = Y2i-1 + Y2i -2Y2i+1 ( Y-1=0 )

< Generation and operation of recoded digit >

SungKyunKwan Univ.

118VADA Lab.

Modified Booth 곱셈기 - 예• Example

10010101 = X01101001 = Y

1111111110010101000000110101100000011010111100101010

1101010000011101 = P ( - 11235)

(- 107)(+105) Operation B its recoded

+ 1- 2- 1+ 2

010100101011sign

extension

SungKyunKwan Univ.

119VADA Lab.

Wallace Tree - 4:2 CompressorX 7Y 7

X 0Y 0

..............

.............. : Zero: B it jum ping leve l: partia l p roduct: b it generated by

com pressor

1st s tage

2nd stage

Tw o sum m ands tobe added

(a)

4*8 P artia l P roduct genera to rs

8 4-2 com pressors

4*8 P artia l P roduct genera to rs

8 4-2 com pressors

16-b it adder

11 4 -2 com pressors

1st s tage(b lock A )

1st s tage(b lock B )

2nd stage(b lock C )

X3 , X2 , X1 , X0

X7 , X6 , X5 , X4

8

4

4Y

P0P15

(b)

SungKyunKwan Univ.

120VADA Lab.

Multipliers - Area

• 16-bit Multiplier Area

Multipliertype

Area(mm2) Gate count

Array 4.2 2,378

Wallace 8.1 2,544

Modified booth 8.5 3,375

SungKyunKwan Univ.

121VADA Lab.

Multiplier - Delay

• Average Power Dissipation (16-bit)

Multipliertype

Power(mW) Logictransitions

Array 43.5 7,224

Wallace 32.0 3,793

Modified booth 41.3 3,993

SungKyunKwan Univ.

122VADA Lab.

Multiplier - Power

• Worst-Case Delay (16-bit)

Multipliertype

Delay(ns) Gatedelays

Array 92.6 50

Wallace 54.1 35

Modified booth 45.4 32

SungKyunKwan Univ.

VADA Lab.

Instruction Level Power Analysis

• Estimate power dissipation of instruction sequences and power dissipation of a program

• Eb : base cost of individual instructions

Es : circuit state change effects

• EM : the overall energy cost of a program

Bi : the base cost of type i instruction

Ni : the number of type i instruction

Oi,j : the cost occurred when a type i instruction is followed by

a type j instruction Ni,j : the number of occurrences when a type i instruction is

immediately followed by a type j instruction

E B Nb i i E O Ns i j i j , ,

E E EM b s

SungKyunKwan Univ.

VADA Lab.

Instruction ordering

• Develop a technique of operand swapping• Recoding weight : necessary operation cost of operands

• Wtotal : total recoding weight of input operand

Wi : weight of individual recoded digit i in Booth Multiplier

Wb : base weight of an instruction

Winter : inter-operation weight of instructions

• Therefore, if an operand has lower Wtotal , put it in the second

input(multiplier).

W Wtotal ii

W W Wi b er int

SungKyunKwan Univ.

VADA Lab.

RESULT

Circuit State Effects[pJ]when switchedInstruction

NameBasecost[pJ]

LOAD ADD 2’scomplement

SHIFT

LOAD 1.46 0.18 1.20 1.08 0.73

ADD 0.86 0.31 0.49 0.61

2’scomplement

0.77 0.27 0.34

SHIFT 0.29 0.15

< 4 by 4 multiplier >

Circuit State Effects[pJ]when switchedInstruction

NameBasecost[pJ]


SHIFT

LOAD 3.25 0.40 2.67 2.38 1.63

ADD 1.91 0.58 1.11 1.44

2’scomplement

1.72 0.55 0.78

SHIFT 0.65 0.38

< 8 by 8 multiplier >Circuit State Effects[pJ]

when switchedInstructionName

Basecost[pJ]


SHIFT

LOAD 4.81 0.59 3.96 3.57 2.40

ADD 2.83 1.02 1.63 2.12

2’scomplement

2.55 1.00 1.14

SHIFT 0.96 0.78

< 12 by 12 multiplier >

SungKyunKwan Univ.

VADA Lab.

Conclusion

02468

1012

4bit

8bit

12bi

tav

erag

e

0

5

10

15

20

25

30

35

4bit

8bit

12bi

t

circuitstateseffects notconsideredcircuitstateseffe c t sconsidered

Power[pJ]

bits bits

% of instances with circuit states effects

4.0% reduction

12.0% reduction

9.0% reduction

SungKyunKwan Univ.

127VADA Lab.

8. Layout Level Design

SungKyunKwan Univ.

128VADA Lab.

• Constant scaled wire increases coupling capacitance by S and wire resistance

by S• Supply Voltage by 1/S, Theshold Voltage by 1/S, Current Drive by 1/S• Gate Capaitance by 1/S, Gate Delay by 1/S• Global Interconnection Delay, RC load+para by S• Interconnect Delay: 50-70% of Clock Cycle• Area: 1/S2

• Power dissipation by 1/S - 1/S2

• ( P = nCVdd2f, where nC is the sum of capacitance times #transitions)

• SIA (Semiconductor Industry Association): On 2007, physical limitation: 0.1 m

20 billion transistors, 10 sqare centimeters, 12 or 16 inch wafer

Device Scaling of Factor of S

SungKyunKwan Univ.

129VADA Lab.

Delay Variations at Low-Voltage

• At high supply voltage, the delay increases with temperature (mobility is decreasing with temperature) while at very low supply voltages the delay decreases with temperature (VT is decreasing with temperature).

• At low supply voltages, the delay ratio between large and minimum transistor widths W increases in several factors.

• Delay balancing of clock trees based on wire snaking in order to avoid clock-skew. In this case, at low supply voltages, slightly VT variations can significantly modify the delay balancing.

SungKyunKwan Univ.

130VADA Lab.

Quarter Micron Challenge• Computers/peripherals (SOC): 1996 ($50 Billion) 1999 ($70 Billion)• Wiring dominates delay: wire R comparable to gate driver R; wire/wire coupling

C > C to ground• Push beyond 0.07 micron• Quest for area(past), speed-speed (now), power-power-power(future)• Accelerated increases of clock frequencies• Signal integrity-based tools• Design styles (chip + packages)• System-level design(system partitioning)• Synthesis with multiple constraints (power,area,timing)• Partitioning/MCM• Increasing speed limits complicate clock and power distribution• Design bounded by wires, vias, via resistance, coupling• Reverse scaling: adding area/spacing as needed: widening, thickening of wires,

metal shielding & noise avoidance - adding metal

SungKyunKwan Univ.

131VADA Lab.

CLOCK POWER CONSUMPTION

•Clock power consumption is as large as the logic power; Clock Signal carrying the heaviest load and switching at high frequency, clock distribution is a major source of power dissipation.• In a microprocessor, 18% of the total power is consumed by clocking• Clock distribution is designed as a hierarchical clock tree, according to the decomposition principle.

SungKyunKwan Univ.

132VADA Lab.

Power Consumption per block in typical microprocessor

SungKyunKwan Univ.

133VADA Lab.

Crosstalk

SungKyunKwan Univ.

134VADA Lab.

Solution for Clock Skew• Dynamic Effects on Skew

Capacitance Coupling• Supply Voltage Deviation (Clock

driver and receiver voltage difference)

• Capacitance deviation by circuit operation

• Global and local temperature• Layout Issues: clocks routed first• Must aware of all sources of delay• Increased spacing• Wider wires• Insert buffers• Specialized clock need net

matching• Two approaches: Single Driver, H-

tree driver

• Gated Clocks: The local clocks that are conditionally enabled so that the registers are only clocked during the write cycles. The clock is partitioned in different blocks and each block is clocked with its own clock.

• Gating the clocks to infrequently used blocks does not provide and acceptable level of power savings

• Divide the basic clock frequency to provide the lowest clock frequency needed to different parts of the circuit

• Clock Distribution: large clock buffer waste power. Use smaller clock buffers with a well-balanced clock tree.

SungKyunKwan Univ.

135VADA Lab.

PowerPC Clocking Scheme

SungKyunKwan Univ.

136VADA Lab.

CLOCK DRIVERS IN THE DEC ALPHA 21164

SungKyunKwan Univ.

137VADA Lab.

DRIVER for PADS or LARGE CAPACITANCES

Off-chip power (drivers and pads) are increasing and is very difficult to reduce such a power, as the pads or drivers sizes cannot be decreased with the new technologies.

SungKyunKwan Univ.

138VADA Lab.

Layout-Driven Resynthesis for Lower Power

SungKyunKwan Univ.

139VADA Lab.

Low Power Process• Dynamic Power Dissipation Vdd

V in Vo

C ovpC ovn

C djp

C djn

DrainW

D

C jbC jsw

)(2 ,

)(

)(

)(2

0

1

1

2

2

DWPDWA

PCACCWCC

CC

LWCC

VVI

fVCP

DD

DjswDjdj

GDov

m

jjgatein

n

ioxgate

tgsds

ddLd

SungKyunKwan Univ.

140VADA Lab.

Crosstalk• In deep-submicron layouts, some of the netlengths for connection between modules ca

n be so long that they have a resistance which is comparable to the resistance of the driver.

• Each net in the mixed analog/digital circuits is identified depending upon its crosstalk sensitivity– 1. Noisy = high impedance signal that can disturb other signals, e.g., clock signals.– 2. High-Sensitivity = high impedance analog nets; the most noise sensitive nets s

uch as the input nets to operational amplifiers.– 3. Mid-Sensitivity = low/medium impedance analog nets.– 4. Low-Sensitivity = digital nets that directly affect the analog part in some cells su

ch as control signals.– 5. Non-Sensitivity = The most noise insensitive nets such as pure digital nets,

• The crosstalk between two interconnection wires also depends on the frequencies (i.e., signal activities) of the signals traveling on the wires. Recently, deep-submicron designs require crosstalk-free channel routing.

SungKyunKwan Univ.

141VADA Lab.

Power Measure in Layout• The average dynamic power consumed by a CMOS gate is given below, where C_l

is the load capacity at the output of the node, V_dd is the supply voltage, T_cycle is the global clock period, N is the number of transitions of the gate output per clock cycle, C_g is the load capacity due to input capacitance of fanout gates, and C_w is the load capacity due to the interconnection tree formed between the driver and its fanout gates.

• Pav = (0.5 Vdd2) / (Tcycle Cl N) = (0.5 Vdd

2) / (Tcycle (Cg + Cw )N)

• Logic synthesis for low power attempts to minimize SUM i Cgi Ni

• Physical design for low power tries to minimize SUMi Cwi Ni

• . Here Cwi consists of Cxi + CsI, where Cxi is the capacitance of net i due to its crosstalk, and CsI is the substrate capacitance of net i. For low power layout applications, power dissipation due to crosstalk is minimized by ensuring that wires carrying high activity signals are placed sufficiently far from the other wires. Similarly, power dissipation due to substrate capacitance is proportional to the wirelength and its signal activity.

SungKyunKwan Univ.

VADA Lab.

이중 전압을 이용한 저전력 레이아웃 설계성균관대학교

전기전자컴퓨터공학부김 진 혁 , 이 준 성 , 조 준 동

SungKyunKwan Univ.

VADA Lab.

목 차• 연구목적• 연구배경• Clustered Voltage Scaling 구조• Row by Row Power Supply 구조• Mix-And-Match Power Supply 구조• Level Converter 구조• Mix-And-Match Power Supply 설계흐름• 실험결과• 결론

SungKyunKwan Univ.

144VADA Lab.

연 구 목 적 및 배경• 조합회로의 전력 소모량을 줄이는 이중 전압 레이아웃 기법 제안• 이중 전압 셀을 사용할 때 , 한 cell

row 에 같은 전압의 cell 이 배치되면서 증가하는 wiring 과 track 의 수를 줄임

• 최소 트랜지스터 개수를 사용하는 Level Converter 회로의 구현

• 디바이스의 성능을 유지하면서 이중 전압을 사용하는 Clustered Voltage Scaling [Usami, ’95] 을 적용

• 제안된 Mix-And-Match Power Supply 레이 아웃 구조는 기존의 Row by Row Power Supply [Usami, ’97] 레이 아웃 구조를 개선하여 전력과 면적을 줄임

SungKyunKwan Univ.

VADA Lab.

Clustered Voltage Scaling• 저전력 netlist 를 생성

F/F

F/F

F/F

LC 2

G 1

G 2G 3G 4

G 5

G 6

G 7G 8

LC 1

G 11 G 9G 10

S lack(S i) = R i - A i

S 1> 0S 3> 0S 4> 0

S 5>0

S 6>0

S 9>0S 7< 0

S 10< 0

S 11< 0

S 8< 0

: VDDL

: VDDH

: Level C onverter

S 2<0

SungKyunKwan Univ.

VADA Lab.

VDDHVDDL

VDDH

VDDHVDDL

standard cell

s tandard cell

module

VS SVDDH cell

VS S

VDDL

VDDL cell

standard cell

VDDH cell

VDDL cell

Row by Row Power Supply 구조

SungKyunKwan Univ.

VADA Lab.

Mix-And-Match Power Supply 구조VDDH

VDDLVDDHVDDL

standard cell

s tandard cell

module

standard cellVDDH

cellVDDL

cell

VDDH cell

VDDL cell

VDDL cellVDDH cellVS S

VDDLVDDH

VSS

VDDLVDDH

SungKyunKwan Univ.

148VADA Lab.

VDDH

module

VDDHVDDL

module

VDDH

VDDL

module

구 조 비 교Conventional RRPS MAMPS Circuit

SungKyunKwan Univ.

VADA Lab.

Level Converter 구조• Transistor 의 갯수 : 6 개 4 개 • 전력과 면적면에서 효과적

기 존 제 안

VS S / VDDL

Vth= 1.5V

VS S / VDDH

Vth= 2.0V

VDDHVDDH

INVDDL

VDDH

O U T

SungKyunKwan Univ.

VADA Lab.

Mix-And-Match Power Supply Design Flow

Physical placem ent

Assign supply voltage to each cell

Routing

Synthesis tim ing, power and area

Single voltage netlist

Netlist with m ultiple supply voltage

Multiple voltage scaling

(O P U S )

(Aquarius XO )

(P owerM ill)

SungKyunKwan Univ.

VADA Lab.

Area (% )

C onventionalc ircuit RR P S M AM P S

15% 10%

100

power (% )

C onventionalc ircuit RR P S M AM P S

47%

2%

100

실 험 결 과전체 Power

전체 Area

SungKyunKwan Univ.

VADA Lab.

결 론• 단일 전압 회로와 비교하여 49.4% 의 Power 감소를 얻은 반면 5.6% 의

Area overhead 가 발생• 기존의 RRPS 구조보다 10% 의 Area 감소와 2% 의 Power 감소• 제안된 Level Converter 는 기존의 Level Converter 보다 30% 의 Area 감소와 35% 의 Power 감소

SungKyunKwan Univ.

153VADA Lab.

9. CAD tools

SungKyunKwan Univ.

154VADA Lab.

Low Power Design Tools• Transistor Level Tools (5-10% of silicon)

– SPICE, PowerMill(Epic), ADM(Avanti/Anagram), Lsim Power Analyst(mentor)• Logic Level Tools (10-15%)

– Design Power and PowerGate (Synopsys), WattWatcher/Gate (Sente), PowerSim (System Sciences), POET (Viewlogic), and QuickPower (Mentor)

• Architectural (RTL) Level Tools (20-25%)– WattWatcher/Architect (Sente): 20-25% accuracy

• Behavioral (spreadsheet) Level Tools (50-100%)– Active area of academic research

SungKyunKwan Univ.

155VADA Lab.

Commercial synthesis systems

SungKyunKwan Univ.

156VADA Lab.

Research synthesis systems A - Architectural synthesis.

L - Logic synthesis.

SungKyunKwan Univ.

157VADA Lab.

Low-Power CAD sites

• Alternative System Concepts, Inc, : 7X power reduction throigh optimization, contact http://www.ee.princeton.edu and Jake Karrfalt at [email protected] or (603) 437-2234. Reduction of glitch and clock power; modeling and optimization of interconnect power; power optimization for data-dominated designs with limited control flow.

• Mentor Graphics QuickPower: Hierarchical of determining overall benet of exchanging the blocks for lower power. powering down or disabling blocks when not in use by gated-clock

• choose candidates for power-down Calculate the effect of the power-down logic http://www.mentorg.com

• Synopsys's Power Compiler http://www.synopsys.com/products/power/power_ds

• Sente's WattWatcher/Architect (first commerical tool operating at the architecture level(20-25 %accuracy). http://www.powereda.com

• Behavioral Tool: Hyper-LP (Optimization), Explore (Estimation) by J. Rabaey

SungKyunKwan Univ.

158VADA Lab.

Design Power(Synopsys)• DesignPower(TM) provides a single, integrated environment for power analysi

s in multiple phases of the design process: – Early, quick feedback at the HDL or gate level through probabilistic an

alysis. – Improved accuracy through simulation-based analysis for gate level an

d library exploration. • DesignPower estimates switching, internal cell and leakage power. It accepts u

ser-defined probabilities, simulation toggle data or a combination of both as input. DesignPower propagates switching information through sequential devices, including flip-flops and latches.

• It supports sequential, hierarchical, gated-clock, and multiple-clock designs. For simulation toggle data, it links directly to Verilog and VHDL simulators, including Synopsys' VSS.

SungKyunKwan Univ.

159VADA Lab.

10. References

SungKyunKwan Univ.

160VADA Lab.

References[1] Gary K. Yeap, "Practical Low Power Digital VLSI Design",

Kluwer Academic Publishers.[2] Jan M. Rabaey, Massoud Pedram, "Low Power Design Methodologies",

Kluwer Academic Publishers.[3] Abdellatif Bellaouar, Mohamed I. Elmasry, "Low-Power Digital VLSI Design

Circuits And Systems", Kluwer Academic Publishers.[4] Anantha P. Chandrakasan, Robert W. Brodersen, "Low Power Digital CMOS Design", Kluwer Academic Publishers.[5] Dr. Ralph Cavin, Dr. Wentai Liu, "1996 Emerging Technologies : Designing Low Power Digital Systems"[6] Muhammad S. Elrabaa, Issam S. Abu-Khater, Mohamed I. Elmasry, "Advanced Low-Power Digital Circuit Techniques", Kluwer Academic Publishers.

SungKyunKwan Univ.

161VADA Lab.

References• [BFKea94] R. Bechade, R. Flaker, B. Kaumann, and et. al. A 32b 66 mhz 1.8W

Microprocessor". In IEEE Int. Solid-State Circuit Conference, pages 208-209, 1994.

• [BM95] Bohr and T. Mark. Interconnect Scaling - The real limiter to high performance ULSI". In proceedings of 1995 IEEE international electron devices meeting, pages 241-242, 1995.

• [BSM94] L. Benini, P. Siegel, and G. De Micheli. Saving Power by Synthesizing Gated Clocks for Sequential Circuits". IEEE Design and Test of Computers, 11(4):32-41, 1994.

• [GH95] S. Ganguly and S. Hojat. Clock Distribution Design and Verification for PowerPC Microprocessor". In International Conference on Computer-Aided Design, page Issues in Clock Designs, 1995.

• [MGR96] R. Mehra, L. M. Guerra, and J. Rabaey. Low Power Architecture Synthesis and the Impact of Exploiting Locality". In Journal of VLSI Signal Processing,, 1996.

Documents

Lower Power Synthesis - Cho, Jun Dong ??? Sungkyunkwan …vada.skku.ac.kr/ClassInfo/lower-power-DS… · PPT file · Web view · 2002-04-04Clustering Example Two-cluster Partition