Upload
trinhkhanh
View
216
Download
2
Embed Size (px)
Citation preview
SungKyunKwan Univ.
1VADA Lab.
Clustering Example• Two-cluster Partition
• Three-cluster Partition
SungKyunKwan Univ.
2VADA Lab.
Complexity of Partitioning
In general, computing the optimal partitioning is an NP-complete problem, which means that the best known algorithms take time which is an exponential function of n=|N| and p, and it is widely believed that no algorithm whose running time is a polynomial function of n=|N| and p exists (see ``Computers and Intractability'', M. Garey and D. Johnson, W. H. Freeman, 1979, for details.) Therefore we need to use heuristics to get approximate solutions for problems where n is large. The picture below illustrates a larger graph partitioning problem; it was generated using the spectral partitioning algorithm as implemented in the graph partitioning software by Gilbert et al, described below. The partition is N = Nblue U Nbl
ack, with red edges connecting nodes in the two partitions.
SungKyunKwan Univ.
3VADA Lab.
Edge Separator and Vertex Separator
Bisecting a graph G=(N,E) can be done in twoways. In the last section, we discussed finding thesmallest subset Es of E such that removing Esfrom E divided G into two disconnected subgraphsG1 and G2, with nodes N1 and N2 respectively,where N1 U N2 = N and N1 and N2 are disjointand equally large. (If the number of nodes is odd,we obviously cannot make |N1|=|N2|. So we willcall Es an edge separator if |N1| and |N2| aresufficiently close; we will be more explicit abouthow different |N1| and |N2| can be only whennecessary.) The edges in Es connect nodes in N1to nodes in N2. Since removing Es disconnects G,Es is called an edge separator. The other way tobisect a graph is to find a vertex separator, asubset Ns of N, such that removing Ns and allincident edges from G also results in twodisconnected subgraphs G1 and G2 of G. In otherwords N = N1 U Ns U N2, where all three subsetsof N are disjoint, N1 and N2 are equally large, andno edges connect N1 and N2.
The following figure illustrates these ideas. Thegreen edges, Es1, form an edge separator, as wellas the blue edges Es2. The red nodes, Ns, are avertex separator, since removing them and theindicident edges (Es1, Es2, and the purple edges),leaves two disjoint subgraphs.
Theorem. (Tarjan, Lipton, "A separator theorem for planar graphs", SIAM J. Appl. Math., 36:177-189, April 1979). Let G=(N,E) be an planar graph. Then we can find a vertex separator Ns, so that N = N1 U Ns U N2 is a disjoint partition of N, |N1| <= (2/3)*|N|, |N2| <= (2/3)*|N|, and |Ns| <= sqrt(8*|N|).
SungKyunKwan Univ.
4VADA Lab.
Kernighan and Lin Algorithm• B. Kernighan and S. Lin ("An effective heuristic p
rocedure for partitioning graphs", The Bell System Technial Journal, pp. 291--308, Feb 1970), which takes O(|N|3) time per iteration. A more complicated and efficient implementation, which takes only O(|E|) time per iteration, was presented by C. Fiduccia and R. Mattheyses, "A linear-time heuristic for improving network partitions", Technical Report 82CRD130, General Electric Co., Corporate Research and Development Ceter, Schenectady, NY 1982.
• We start with an edge weighted graph G=(N,E,WE), and a partitioning G = A U B into equal parts: |A| = |B|. Let w(e) = w(i,j) be the weight of edge e=(i,j), where the weight is 0 if no edge e=(i,j) exists. The goal is to find equal-sized subsets X in A and Y in B, such that exchanging X and Y reduces the total cost of edges from A to B. More precisely, we let T = sum[ a in A and b in B ] w(a,b) = cost of edges from A to B and seek X and Y such that new_A = A - X U Y and new_B = B - Y U X has a lower cost new_T. To compute new_T efficiently, we introduce:
E(a) = external cost of a = sum[ b in B ] w(a,b)I(a) = internal cost of a = sum[ a' in A, a'!=a]w(a,a') D(a) = cost of a = E(a) - I(a) and analogously E(b) = external cost of b = sum[ a in A ] w(a,b)I(b) = internal cost of b = sum[ b' in B, b' !=b]w(b,b')D(b) = cost of b = E(b) - I(b)Then it is easy to show that swapping a in A and b inB changes T to new_T = T - ( D(a) + D(b) -2*w(a,b) ) = T - gain(a,b)In other words, gain(a,b) = D(a)+D(b)-2*w(a,b) measures the improvement in the partitioning by swapping a and b. D(a') and D(b') also change to new_D(a') = D(a') + 2*w(a',a) - 2*w(a',b) for all a' in A, a' !=a new_D(b') = D(b') + 2*w(b',b) - 2*w(b',a) for all b' in B, b' != b
SungKyunKwan Univ.
5VADA Lab.
Kernighan and Lin Algorithm
(0) Compute T = cost of partition N = A U B ... cost = O(|N|2) Repeat(1) Compute costs D(n) for all n in N ... cost = O(|N|2)(2) Unmark all nodes in G ... cost = O(|N|)(3) While there are unmarked nodes ... |N|/2 iterations(3.1) Find an unmarked pair (a,b) maximizing gai
n(a,b) ... cost = O(|N|2)(3.2) Mark a and b (but do not swap them) ... cost = O(1)(3.3) Update D(n) for all unmarked n, as though a and b had been swapped ... cost = O(|N|) End while
... At this point, we have computed a sequence of pairs ... (a1,b1), ... , (ak,bk) and ... gains gain(1), ..., gain(k) ... where k = |N|/2, ordered by the order in which ... we marked them(4) Pick j maximizing Gain = sumi=1...j gain(i) ... Gain is the reduction in cost from swapping ... (a1,b1),...,(aj,bj)(5) If Gain > 0 then(5.2) Update A = A - {a1,...,ak} U {b1,...,bk} ... cost = O(|N|)(5.2) Update B = B - {b1,...,bk} U {a1,...,ak} ... cost = O(|N|)(5.3) Update T = T - Gain ... cost = O(1) End if Until Gain <= 0
SungKyunKwan Univ.
6VADA Lab.
Spectral Partitioning• This is a powerful but expensive technique,
based on techniques introduced by Fiedler in the 1970s, but popularized in 1990 by A.
• Pothen, H. Simon, and K.-P. Liou, "Partitioning sparse matrices with eigenvectors of graphs", SIAM J. Matrix Anal. Appl., 11:430--452. We will first describe the algorithm, and then give three related justifications for its efficacy. Let G=(N,E) be an undirected, unweighted graph without self edges (i,i) or multiple edges from one node to another. We define two matrices related to this graph.
• Definition The incidence matrix In(G) of G is an |N|-by-|E| matrix, with one row for each node and one column for each edge.
• Suppose edge e=(i,j). Then column e of In(G) is zero except for the the i-th and j-th entries, which are +1 and -1, respectively.
Note that there is some ambiguity in this definition, since G is undirected; writing edge e=(i,j) instead of (j,i) is equivalent to multiplyingcolumn e of In(G) by -1. We will see that this ambiguity will not be important to us.
Definition The Laplacian matrix L(G) of G is an |N|-by-|N| symmetric matrix, with one row and column for each node. It is defined as follows. (L(G))(i,j) = degree of node i if i=j (number of incident edges) = -1 if i!=j and there is an edge (i,j)
SungKyunKwan Univ.
7VADA Lab.
Spatial Locality: Hardware Partitioning
• The interface logic should be properly partitioned for area and timing reasons. Minimization of global busses leads to lower bus capacitance, and thus lower interconnect power.
• Signal values within the clusters tend to be more highly correlated.• Data path should be partitioned into approximately equal size.• In the DSP area, data paths tens to occupy far more area than the control paths.• Wiring is still one of the domain area consumers• The method used to identify clusters is based on the eigenvalues and eigenvectors of the L
aplacian of the graph.• The eigen vector corresponding to the second smallest eigen value provides a 1-D placeme
nt of the nodes which minimizes the mean-squared connection length.
SungKyunKwan Univ.
8VADA Lab.
Spectral Partitioning in VLSI placement
SungKyunKwan Univ.
9VADA Lab.
Spectral Partitioning in VLSI placement• Setting the derivative of the Lagrangian, L, to zero gives:
• The solution to the above equation are those is the eigenvalue and x is the corresponding eigenvector.
• The smallest eigenvalue 0 gives a trivial solution with all nodes at the same point. The eigenvector corresponding to the second smallest eigenvalue minimizes the cost function while giving a non-trivial solution
0)( xIQ
SungKyunKwan Univ.
10VADA Lab.
Key Ideas in Spectral Partitioning
SungKyunKwan Univ.
11VADA Lab.
Spectral Partitioning
SungKyunKwan Univ.
12VADA Lab.
Spectral Partitioning norm(In(G)'*v)2 lambda = ------------------ norm(v)2 where norm(z)2 = sumi z(i)2
= sum{all edges e=(i,j)} (v(i)-v(j))2
---------------------------------- sumi v(i)2
5. The eigenvalues of L(G) are nonnegative:
0 <= lambda1 <= lambda2 <= ... <= lambdan
6.The number of of connected components of G is equal to the number of lambdai) equal to 0.
In particular, lambda2 != 0 if and only if G is connected.
The following theorem state some important facts about In(G) and L(G). It introduces us to the idea that the eigenvalues and eigen vectors of L(G) are related to the connectivity of G. Theorem 1. Given a graph G, its associated matrices In(G) and L(G) have the following properties.
1.L(G) is a symmetric matrix. This means the eigenvalues of L(G) are real, and its eigenvectors are real and orthogonal. 2.Let e=[1,...,1]', where ' means transpose, i.e. the column vector of all ones. Then L(G)*e = 0. 3.In(G)*(In(G))' = L(G). This is independent of the signs chosen in each column of In(G). 4.Suppose L(G)*v = lambda*v, where v is nonzero. Then
SungKyunKwan Univ.
13VADA Lab.
Spectral Partitioning Compute the eigenvector v2 corresponding to lambda2 of L(G) for each node n of G if v2(n) < 0 put node n in partition N- else put node n in partition N+ endif endforFirst we show that this partition is at least re
asonable, because it tends to give connected components N- and N+:
Theorem 2. (M. Fiedler, "A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory", Czech.Math. J. 25:619--637, 1975.) Let G be connected, and N- and N+ be defined by the above algorithm. Then N- is connected. If no v2(n) = 0, N+ is also connected.
There are a number of reasons lambda2 is called the algebraic connectivity. Here is another. Theorem 3. (Fiedler). Let G=(N,E) be a graph,and G1=(N,E1) a subgraph, i.e. with the samenodes and subset of the edges, so that G1 is "lessconnected" than G. Then lambda2(L(G1)) <=lambda2(L(G)), i.e. the algebraic connectivity ofG1 is also less than or equal to the algebraicconnectivity of G. Motivation for spectral bisection, by analogy with
a vibrating string
How does a taut string vibrate when it is plucked?From our background in either physics or music,we know that it has certain modes of vibration orharmonics. If we were to take snapshots of thesemodes, they would look like this:
SungKyunKwan Univ.
14VADA Lab.
Spectral Partitioning
SungKyunKwan Univ.
15VADA Lab.
Multilevel Kernighan-LinGc is computed in step (1) ofRecursive_partition as follows. We define amatching of a graph G=(N,E) as a subsetEm of the edges. E with the property that notwo edges in Em share an endpoint. Amaximal matching is one to which no moreedges can be added and remain a matching.We can compute a maximal matching by asimple random algorithm:
let Em be empty mark all nodes in N as unmatched for i = 1 to |N| ... visit the nodes in a random
order if node i has not been matched, choose an edge e=(i,j) where j is also un
matched, and add it to Em mark i and j as matched end if end for
Given a matching, Gc is computed as follows.We let there be a node r in Nc for each edge inEm. Then we construct Ec as follows:
for r = 1 to |Em| ... for each node in Nc let (i,j) be the edge in Em corresponding to no
de r for each other edge e=(i,k) in E incident on i let ek be the edge in Em incident on k, and let rk be the corresponding node in Nc add the edge (r,rk) to Ec end for for each other edge e=(j,k) in E incident on j let ek be the edge in Em incident on k, and let rk be the corresponding node in Nc add the edge (r,rk) to Ec end for end for if there are multiple edges between pairs of nodes of Nc, collapse them into single edges
SungKyunKwan Univ.
16VADA Lab.
Multilevel Kernighan-LinNote that we can take node weights intoaccount by letting the weight of a node (i,j)in Nc be the sum of the weights of thenodes I and j. We can similarly take edgeweights into account by letting the weightof an edge in Ec be the sum of the weightsof the edges "collapsed" into it. Furthermore, we can choose the edge (i,j)which matches j to i in the construction ofNc above to have the large weight of alledges incident on i; this will tend tominimize the weights of the cut edges. This is called heavy edge matching in METIS,and is illustrated on the right.
SungKyunKwan Univ.
17VADA Lab.
Multilevel Kernighan-LinGiven a partition (Nc+,Nc-) from step
(2) of Recursive_partition, it is easily expanded to a partition (N+,N-) in step (3) by associating
with each node in Nc+ or Nc- the nodes of N that comprise it. This is again shown below:
Finally, in step (4) of Recurive_partition, the approximate partition from step (3) is improved using a variation of Kernighan-Lin.
SungKyunKwan Univ.
18VADA Lab.
Multilevel Spectral PartitioningThere is a simple "greedy" algorithm forfinding an Nc: Nc = empty set for i = 1 to |N| if node i is not adjacent to any node alre
ady in Nc add i to Nc end if end forThis is shown below in the case where G issimply a chain of 9 nodes with nearestneighbor connections, in which case Ncconsists simply of every other node of N.
Now we turn to the divide-and-conqueralgorithm of Barnard and Simon, which isbased on spectral partitioning rather thanKernighan-Lin. The expensive part ofspectral bisection is finding the eigenvectorv2, which requires a possibly large numberof matrix-vector multiplications with theLaplacian matrix L(G) of the graph G. Thedivide-and-conquer approach ofRecursive_partition will dramaticallydecrease the cost. Barnard and Simonperform step (1) of Recursive_partition,computing Gc = (Nc,Ec) from G=(N,E),slightly differently than above: They find amaximal independent subset Nc of N. Thismeans that N contains Nc and E containsEc, no nodes in Nc are directly connectedby edges in E (independence), and Nc is aslarge as possible (maximality).
SungKyunKwan Univ.
19VADA Lab.
hMETIS• hMETIS is a set of programs for partitioning hypergraphs such as those corres
ponding to VLSI circuits. The algorithms implemented by hMETIS are based on the multilevel hypergraph partitioning scheme described in [KAKS97].
• hMETIS produces bisections that cut 10% to 300% fewer hyperedges than those cut by other popular algorithms such as PARABOLI, PROP, and CLIP-PROP, especially for circuits with over 100,000 cells, and circuits with non-unit cell areaIt is extremely fast!A single run of hMETIS is faster than a single run of simpler schemes such as FM, KL, or CLIP. Furthermore, because of its very good average cut characteristics, it produces high quality partitionings in significantly fewer runs. It can bisect circuits with over 100,000 vertices in a couple of minutes on Pentium-class workstations.
• The performance of hMETIS on the new ISPD98 benchmark suite can be found in the paper by Chuck Alpert.
http://www.users.cs.umn.edu/~karypis/metis/metis.html
SungKyunKwan Univ.
20VADA Lab.
How good is Recursive Bisection?• Horst D. Simon and Shang-Hua Teng , Report RNR-93-012, August 1993 • The most commonly used p-way partitioning method is recursive bisection. It
first "optimally" divides the graph (mesh) into two equal sized pieces and then recursively divides the two pieces.We show that,due to the greedy nature and the lack of global information,recursive bisection, in the worst case,may produce a partition that is very far from the optimal one. Our negative result is complemented by two positive ones.First, we show that for some important classes of graphs that occur in practical applications,such as well shaped finite element and finite difference meshes,recursive bisection is normally within a constant factor of the optimal one. Secondly,we show that if the balanced condition is relaxed so that each block in the partition is bounded by (1+e)n/p,then there exists a approximately balanced recursive partitioning scheme that finds a partition whose cost is within an 0(log p) factor of the cost of the optimal p-way partition.
SungKyunKwan Univ.
21VADA Lab.
Partitioning Algorithm with Multiple Constraints
1998. 5. 19조 준 동
SungKyunKwan Univ.
22VADA Lab.
스위칭에 의한 충전과 방전• 전체 전력소모의 최대 90% 까지 차지
PMOSpull-upnetwork
NMOSpull-upnetwork
V dd
short circuit + leakage
charge
discharge
C L
SungKyunKwan Univ.
23VADA Lab.
저전력을 위한 분할• 기존의 방법 : cut 을 지나가는 간선의 수• 저전력 : 간선의 스위칭 동작의 수
0.25
0.25
0.25
0.25
0.750.75
( a ) cut ÀÇ ¼ö·Î ÀÚ¸§ ( b ) ½ºÀ§Äª µ¿ÀÛÀÇ ¼ö·Î ÀÚ¸§
SungKyunKwan Univ.
24VADA Lab.
최소비용흐름 알고리즘• 주어진 양을 가장 적은 비용으로 원하는 목적지까지 보낼수 있는
방법– 각 통로는 용량과 비용을 가짐
• Max-flow min-cut : 간선의 수만 고려• Min-Cost flow : 간선마다 스위칭 동작의 가중치를 부여
– 비용 : 스위칭 동작 vs. 간선의 수 – 용량 : 간선에 흐를 수 있는 최대양
• 비용이 적을수록 선택되도록 큰 용량
W S Ci i i ( )1
SungKyunKwan Univ.
25VADA Lab.
Network and Mincost Flow
10 / 1001 / 5
20 / 10
10 / 35
15 / 30
10 / 35
10 / 100
15 / 30
45 / 55
23 / 11
100 / 10
30 / 24
1 / 10
3 / 56 / 100
100 / 10
100 / 10
45 / 55
23 / 11
7 / 80
SungKyunKwan Univ.
26VADA Lab.
그래프 변환 알고리즘• Min-Cost Flow 경로를 찾음• Cut 을 찾기 위해서 그래프의 변환이
필요• 레벨에 따른 topolo
gical 정렬Level 1
Level 5
Level 4
Level 3
Level 2
SungKyunKwan Univ.
27VADA Lab.
그래프 변환 알고리즘• 추가된 노드 및 간선
Level ( i )
Level ( i+1 )
»õ·Î »ý¼ºµÈ °£¼±
±âÁ¸ÀÇ °£¼±
»õ·Î »ý¼ºµÈ ³ëµå
±âÁ¸ÀÇ ³ëµå
Source Sink
SungKyunKwan Univ.
28VADA Lab.
그래프 변환
Level 1
Level 5
Level 4
Level 3
Level 2
sinkSource
S T
SungKyunKwan Univ.
29VADA Lab.
Partitioning with constraints
kiPPAAA
jiCW
upperiupperilower
k
i
k
jij
1
)(
,,
1 1
SungKyunKwan Univ.
30VADA Lab.
AlgorithmInput: Flow f, NetworkOutput: Partition the network into f subnetworks 단계 1: 그래프에 Flow 를 push 하여 최소비용흐름 알고리즘 수행 ; 만약 각각의 partition 에 대하여 A_upper 또는 P_upper 를 만족하면 마침 ; 그렇지않으면 f = f+1; 증가시키고 upper bound 를 만족할 때까지 단계 1 을 반복한다 .단계 2: 만약 A_lower 또는 P_lower 를 만족하지 않는두개의 partition p, q 가 있고
upperqplower
upperqplower
PPPP
AAAA
라면 p 와 q 는 merge 가 가능하고 모든 가능한 {p,q} set 에 대하여 최소비용매칭을 적용하여 분할된 partition 의 개수를 줄임 .
SungKyunKwan Univ.
31VADA Lab.
참고문헌[1] J.D.Cho and P.D.Franzon, "High-Performance Design Automation for Multi-Chip Modules and Packages", World
Scientific Pub. Co. 1996[2] H.J.M.Veendrick, "Short-Circuit Dessipation of Static CMOS Circuitry and its Impact on the Design of Buffer Cir
cuits" IEEE JSSCC, pp.468-473, August, 1984[3] H.B.Bakoglu, "Circuits, Interconnections and Packaging for VLSI", pp.81-112, Addison-Wesley Publishing Co.,
1990[4] K.M.hall. "An r-dimensional quadratic placement algorithm", Management Sci., vol.17, pp.219-229, Nov, 197
0[5] Cadence Design Systems. "A Vision for Multi-Chip Module design in the nineties", Tech. Rep. Cadence Design
Systems Inc., Santa Clara, CA, 1993[6] R.Raghavan, J.Cohoon, and S.Shani. "Single Bend Wiring", Journal of Algorithms, 7(2):232-257, June, 1986 [7] Kernighan, B.W. and S.lin. "An efficient heuristic procedure to partition graphs" Bell System Technical Journal,
492:291-307, Feb. 1970[8] Wei, Y.C. and C.K.Cheng "Ratio-Cut Partitioning for Hierachical Designs", IEEE Trans. on Computer-Aided Desi
gn. 40(7):911-921, 1991[9] S.W.Hadley, B.L.Mark, and A.Vanelli, "An Efficient Eigenvector Approach for Finding Netlist Partitions", IEEE Tr
ans. on Computer-Aided Design, vol. CAD-11, pp.85-892, July, 1992[10] L.R.Fold, Jr. and D.R.Fulkerson. "Flows in Networks", Princeton University Press, Princeton, NJ, 1962[11] Liu H. and D.F.Wong, "Network Flow Based Multi-Way Partitioning With Area and Pin Constraints", IEEE/ACM
Symposium on Physical Design, pp. 12-17, 1997[12] Kirkpatrick, S. Jr., C.Gelatt, and M.Vecchi. "Optimization by simulated annealing", Science, 220(4598):498-
516, May, 1983[13] Pedram, M. "Power Minimization in IC Design: Principles and Applications," ACM Trans. on Design Automatio
n of Electronics Systems, 1(1), Jan. pp. 3-56, 1996. [14] A.H.Farrahi and M.Sarrafzadeh. "FPGA Technology Mapping for Power Minimizatioin", In International Worksh
op on Field-Programmable Logic and Applications, pp66-77, Sep. 1994[15] M.A.Breur, "Min-Cut Placement", J.Design Automation and Fault-Tolerant Computing, pp.343-382, Oct. 197
7
SungKyunKwan Univ.
32VADA Lab.
[16] M.Hanan and M.J.Kutrzberg. A Review of the Placement and the Quadratic Assignment Problem, Apr. 1072.[17] N.R.Quinn, "The Placement Problem as Viewed from the Physics of Classical Mechanics", Proc. of the 12th Design Automation Conference, pp.173-178, 1975[18] C.Sehen, and A.Sangiovanni-Vincentelli, "The Timber Wolf placement and routing package", IEEE Journal of Solid-State Circuits, Sc-20, pp.501-522, 1985[19] K.Shahookar, and P.Mazumder, "A Genetic Approach to Standard Cell Placement", First European Design Automation Conference, Mar. 1990[20] J.D.Cho, S.Raje, M.Sarrafzadeh, M.Sriram, and S.M.Kang, "Crosstalk Minimum Layer Assignment", In Proc. IEEE Custom Integr. Circuits Conf., San Diego, CA, pp.29.7.1-29.7.4, 1993[21] J.M.Ho, M.Sarrafzadeh, G,Vijayan, and C.K.Wong. "Layer Assignment for Multi-Chip Modules", IEEE Trans. on Computer-Aided Design, CAD-9(12):1272-1277, Dec., 1991[22] G.Devaraj. "Distributed placement and crosstalk driven router for multichip modules", In MS Thesis, Univ. of Cincinnati, 1994[23] J.D.Cho. "Min-Cost Flow based Minimum-Cost Rectilinear Steiner Distance-Preserving Tree", International Symposium on Physical Desigh, pp-82-87, 1997[24] A.Vitttal and M.Marek-Sadowska. "Minimal Delay Interconnection Design using Alphabetic Trees", In Design Automation Conference, pp.392-396, 1994[25] M.C.Golumbic. "Algorithmic Graph Theory and Perfect Graph", pp.80-103, New York : Academic. 1980[26] R.Vemuri. "Genetic Algorithms for partitioning, placement, and layer assignment for multichip modules", Ph.D. Thesis, Univ. of Cincinnati, 1994[27] J.L.Kennington and R.V.Helgason, "Algorithms for Network Programmin", John Wiley, 1980[28] J.Y.Cho and J.D.Cho "Improving Performance and Routability Estimation in MCM Placement", In InterPack'97, Hawaii, June, 1997[29] J.Y.Cho and J.D.Cho "Partitioning for Low Power Using Min-Cost Flow Algorithm", submitted to 한국반도체학술대회 , Feb, 1998
SungKyunKwan Univ.
33VADA Lab.
6. Logic Level Design
SungKyunKwan Univ.
34VADA Lab.
Node Transition Activity
SungKyunKwan Univ.
35VADA Lab.
Low Activity XOR Function
SungKyunKwan Univ.
36VADA Lab.
GLITCH (Spurious transitions)• 15-20% of the total
power is due to glitching.
SungKyunKwan Univ.
37VADA Lab.
Glitches
SungKyunKwan Univ.
38VADA Lab.
Hazard Generation in Logic Circuits
•Static hazard: A transient pulse of width w (= the delay of the inverter).• Dynamic hazard: the transient consists of three edges, two rising and one falling with w of two units.• Each input can have several arriving paths.
SungKyunKwan Univ.
39VADA Lab.
High-Performance PowerDistribution
• (S: Switching probability; C: Capacitance)• Start with all logic at the lowest power level; then, successive
iterations of delay calculation, identifying the failing blocks, and powering
• up are done until either all of the nets pass their delay criteria or the
• maximum power level is reached.• Voltage drops in ground and supply wires use up a more
serious fraction of the total noise margin
SungKyunKwan Univ.
40VADA Lab.
Logic Transformation• Use a signal with low switching activity to reduce the activity on a highly active si
gnal.• Done by the addition of a redundant connection between the gate with low activi
ty (source gate) to the gate with a high switching activity (target gate).• Signals a, b, and g1 have very high switching activity and most of time its value i
s zero• Suppose c and g1 are selected as the source and target of a new connection ` 1
is undetectable, hence the function of the new circuit remains the same.• Signal c has a long run of zero, and zero is the controlling value of the and gate
g1 , most of the switching activities at the input of g1 will not be seen at the output, thus switching activity of the gate g1 is reduced.
• The redundant connection in a circuit may result in some irredundant connections becoming redundant.
• By adding ` 1 , the connections from c to g3 become redundant.
SungKyunKwan Univ.
41VADA Lab.
Logic Transformation
SungKyunKwan Univ.
42VADA Lab.
Logic Transformation
SungKyunKwan Univ.
43VADA Lab.
Frequency Reduction◈ Power saving
Reduces capacitance on the clock network Reduces internal power in the affected registers Reduces need for muxes(data recirculation)
◈ Opportunity Large opportunity for power reduction, dependent on;
Number of registers gated percentage of time clock is enabled
◈ Cost Testability Complicates clock tree synthesis Complicates clock skew balancing
SungKyunKwan Univ.
44VADA Lab.
GATED-CLOCK D-FLIP-FLOP• Flip- op present a large internal capacitance on the internal clock node.• If the DFF output does not switch, the DFF does not have to be clocked.
SungKyunKwan Univ.
45VADA Lab.
Frequency Reduction
FSM
data_ in
reset
c lkload_en
data_out
data_ reg
3232
D Q
B efore C loc k G ating
FSM
data_ in
reset
c lk
load_en
data_out
data_ reg
32D Q
After C loc k G ating
LATCHc lk
c lk_en
load- en_ latc hed
Clock Gating Example - When D is not equal to Q
SungKyunKwan Univ.
46VADA Lab.
◈ Clock Gating Example - Before CodeFrequency Reduction
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;
entity nongate is port(clk,rst : in std_logic; data_in : in std_logic_vector(31 downto 0); data_out : out std_logic_vector(31 downto 0));end nongate;
architecture behave of nongate is signal load_en : std_logic; signal data_reg : std_logic_vector(31 downto 0); signal count : integer range 0 to 15;begin
FSM : process begin wait until clk'event and clk='1'; if rst='0' then count <= 0; elsif count=9 then count <= 0; else count <= count+1; end if; end process FSM;
enable_logic : process(count,load_en) begin if(count=9) then load_en <= '1'; else load_en <= '0'; end if; end process enable_logic;
datapath : process begin wait until clk'event and clk='1'; if load_en='1' then data_reg <= data_in; end if; end process datapath; data_out <= data_reg; end behave;
configuration cfg_nongate of nongate is for behave end for;end cfg_nongate;
SungKyunKwan Univ.
47VADA Lab.
◈ Clock Gating Example - After CodeFrequency Reduction
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;
entity gate is port(clk,rst : in std_logic; data_in : in std_logic_vector(31 downto 0); data_out : out std_logic_vector(31 downto 0));end gate;
architecture behave of gate is signal load_en,load_en_latched,clk_en : std_logic; signal data_reg : std_logic_vector(31 downto 0); signal count : integer range 0 to 15;begin
SungKyunKwan Univ.
48VADA Lab.
Frequency Reduction FSM : process begin wait until clk'event and clk='1'; if rst='0' then count <= 0; elsif count=9 then count <= 0; else count <= count+1; end if; end process FSM;
enable_logic : process(count,load_en) begin if(count=9) then load_en <= '1'; else load_en <= '0'; end if; end process enable_logic;
deglitch : PROCESS(clk,load_en) begin
if(clk='0') then load_en_latched <= load_en; end if; end process deglitch; clk_en <= clk and load_en_latched; datapath : process begin wait until clk_en'event and clk_en='1'; data_reg <= data_in; end process datapath; data_out <= data_reg; end behave;
configuration cfg_gate of gate is for behave end for;end cfg_gate;
SungKyunKwan Univ.
49VADA Lab.
Frequency Reduction◈ Clock Gating Example - Report
SungKyunKwan Univ.
50VADA Lab.
Frequency Reduction◈ 4-bit Synchronous & Ripple counter - code
4-bit Synchronous Counter
Library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_arith.all;
entity BINARY is Port ( clk : In std_logic; reset : In std_logic; count : BUFFER UNSIGNED (3 downto 0));end BINARY;
architecture BEHAVIORAL of BINARY is begin process(reset,clk,count) begin
if (reset = '0') then count <= "0000” elsif (clk'event and clk = '1') then if (count = UNSIGNED'("1111")) then count <= "0000"; else count <=count+UNSIGNED'("1"); end if; end if; end process;end BEHAVIORAL;
configuration CFG_BINARY_BLOCK_BEHAVIORAL of BINARY is for BEHAVIORAL end for;end CFG_BINARY_BLOCK_BEHAVIORAL;
SungKyunKwan Univ.
51VADA Lab.
Frequency Reduction 4-bit Ripple Counter
Library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all;
entity RIPPLE is Port ( clk : In std_logic; reset : In std_logic; count : BUFFER UNSIGNED (3 downto 0)); end RIPPLE;
architecture BEHAVIORAL of RIPPLE is signal count0, count1, count2 : std_logic;begin process(count) begin count0 <= count(0); count1 <= count(1);
count2 <= count(2); end process;
process(reset,clk) begin if (reset = '0') then count(0) <= '0'; elsif (clk'event and clk = '1') then if (count(0) = '1') then count(0) <= '0'; else count(0) <= '1'; end if; end if; end process; process(reset,count0) begin if (reset = '0') then count(1) <= '0'; elsif (count0'event and count0 = '1') then
SungKyunKwan Univ.
52VADA Lab.
Frequency Reduction if (count(3) = '1') then count(3) <= '0'; else count(3) <= '1'; end if; end if; end process; end BEHAVIORAL;
configuration CFG_RIPPLE_BLOCK_BEHAVIORAL of RIPPLE is for BEHAVIORAL end for; end CFG_RIPPLE_BLOCK_BEHAVIORAL;
if (count(1) = '1') then count(1) <= '0'; else count(1) <= '1'; end if; end if; end process;
process(reset,count1) begin if (reset = '0') then count(2) <= '0'; elsif (count1'event and count1 = '1') then if (count(2) = '1') then count(2) <= '0'; else count(2) <= '1'; end if; end if; end process;
process(reset,count2) begin if (reset = '0') then count(3) <= '0'; elsif (count2'event and count2 = '1') then
SungKyunKwan Univ.
53VADA Lab.
Frequency Reduction◈ 4-bit Synchronous & Ripple counter - Report
SungKyunKwan Univ.
54VADA Lab.
Bus-Invert Coding for Low Power I/O
An eight-bit bus on which all eight lines toggle at the sametime and which has a high peak (worst-case) power dissipation.•There are 16 transitions over 16 clock cycles (average 1 transition per clock cycle).
SungKyunKwan Univ.
55VADA Lab.
Peak Power Dissipation
An eight-bit bus on which the eight lines toggle at differentmoments and which has a low peak power dissipation. There are the same 16 transitions over 16 clock cycles and thus the same average power dissipation
SungKyunKwan Univ.
56VADA Lab.
Bus-Invert - Coding for low power• The Bus-Invert method proposed here uses one extra control bit called i
nvert. By convention then invert = 0 the bus value will equal the data value. When invert = 1 the bus value will be the inverted data value. The peak power dissipation can then be decreased by half by coding the I/O as follow
• 1. Compute the Hamming distance (the number of bits in which they differ) between the present bus value (also counting the present invert line) and the next data value.
• 2. If the Hamming distance is larger than n=2, set invert = 1 (and thus make the next bus value equal to the inverted next data value).
• 3. Otherwise, let invert = 0 (and let the next bus value equal to the next data value).
• 4. At the receiver side the contents of the bus must be conditionally inverted according to the invert line, unless the data is not stored encoded as it is (e.g. in a RAM). In any case the value of invert must be transmitted over the bus (the method increases the number of bus lines from n to n + 1).
SungKyunKwan Univ.
57VADA Lab.
Example
A typical eight-bit synchronous data bus. The transitions between two consecutive time-slots are \clean". There are 64 transitions for a period of 16 time slots. This represents an average of 4 transitions per time slot, or 0.5 transitions per bus line per time
slot.
SungKyunKwan Univ.
58VADA Lab.
Bus encoding
The same sequence of data coded using the BusInvert method. There are now only 53 transitions over a period of 16 time slots. This represents an average of 3.3 transitions per time slot, or 0.41 transitions per bus line per time slot.The maximum number of transitions for any time slot is now 4.
SungKyunKwan Univ.
59VADA Lab.
Comparisons
Comparison of unencoded I/O and coded I/O with one or more invert lines. The comparison looks at the average and maximum number of transitions per time-slot, per bus-line per time-slot, and I/O power dissipation for different bus-widths.
SungKyunKwan Univ.
60VADA Lab.
Remarks• The increase in the delay of the data-path: By looking at the power-delay produc
t which removes the effect of frequency (delay) on power dissipation, a clear improvement is obtained in the form of an absolute lower number of transitions. It is also relatively easy to pipeline the bus activity. The extra pipeline stage and the extra latency must then be considered.
• The increased number of I/O pins. As was mentioned before ground-bounce is a big problem for simultaneous switching in high speed designs. That is why modern microprocessors use a large number of Vdd and GND pins. The Bus-Invert method has the side-effect of decreasing the maximum ground-bounce by approximately 50%. Thus circuits using the Bus Invert method can use a lower number of Vdd and GND pins and by using the method the total number of pins might even decrease.
• Bus-Invert method decreases the total power dissipation although both the total number of transitions increases (by counting the extra internal transitions) and the total capacitance increases (because of the extra circuitry). This is
• possible because the transitions get redistributed very nonuniformly, more on the low-capacitance side and less on the high-capacitance side.
SungKyunKwan Univ.
61VADA Lab.
References[1] H. B. Bakoglu, Circuits, Interconnections and Packaging forVLSI, Addison-Wesley, 1990.[2] T. K. Callaway, E. E. Swartzlander, \Estimating the Power Con-sumption of CMOS Adders", 11th Symp. on Comp. Arithmetic,pp. 210-216, Windsor, Ontario, 1993.[3] A. P. Chandrakasan, S. Sheng, R. W. Brodersen, \Low-PowerCMOS Digital Design", IEEE Journal of Solid-State Circuits,pp. 473-484, April 1992.[4] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, R. W. Brodersen,\HYPER-LP: A System for Power Minimization Using Archi-tectural Transformations", ICCAD-92, pp.300-303, Nov. 1992,Santa Clara, CA.[5] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, R. W. Brodersen,\An Approach to Power Minimization Using Transformations",IEEE VLSI for Signal Processing Workshop, pp. , 1992, CA.[6] S. Devadas, K. Keutzer, J. White, \Estimation of Power Dissi-pation in CMOS Combinational Circuits", IEEE Custom Inte-grated Circuits Conference, pp. 19.7.1-19.7.6, 1990.[7] D. Dobberpuhl et al. \A 200-MHz 64-bit Dual-Issue CMOS Mi-croprocessor", IEEE Journal of Solid-State Circuits, pp. 1555-1567, Nov. 1992.[8] R. J. Fletcher, \Integrated Circuit Having Outputs Conguredfor Reduced State Changes", U.S. Patent no. 4,667,337, May,1987.
[9] D. Gajski, N. Dutt, A. Wu, S. Lin, High-Level Synthesis, Introduction to Chip and System Design, Kluwer Academic Publishers, 1992.[10] J. S. Gardner, \Designing with the IDT SyncFIFO: the Architecture of the Future", 1992 Synchronous (Clocked) FIFO Design Guide, Integrated Device Technology AN-60, pp. 7-10, 1992,Santa Clara, CA.[11] A. Ghosh, S. Devadas, K. Keutzer, J. White, \Estimation of Average Switching Activity in Combinational and Sequential Circuits", Proceedings of the 29th DAC, pp. 253-259, June 1992, Anaheim, CA.[12] J. L. Hennessy, D. A. Patterson, Computer Architecture - AQuantitative Approach, Morgan Kaufmann Publishers, PaloAlto, CA, 1990.[13] S. Kodical, \Simultaneous Switching Noise", 1993 IDT High-Speed CMOS Logic Design Guide, Integrated Device Technology AN-47, pp. 41-47, 1993, Santa Clara, CA.[14] F. Najm, \Transition Density, A Stochastic Measure of Activity in Digital Circuits", Proceedings of the 28th DAC, pp. 644-649, June 1991, Anaheim, CA.
SungKyunKwan Univ.
62VADA Lab.
References[16] A. Park, R. Maeder, \Codes to Reduce Switching
Transients Across VLSI I/O Pins", Computer Architecture News, pp. 17-21, Sept. 1992.
[17] Rambus - Architectural Overview, Rambus Inc., Mountain View, CA, 1993. Contact [email protected].
[18] A. Shen, A. Ghosh, S. Devadas, K. Keutzer, \On Average Power Dissipation and Random Pattern Testability", ICCAD-92, pp. 402-407, Nov. 1992, Santa Clara, CA.
[19] M. R. Stan, \Shift register generators for circular FIFOs", Electronic Engineering, pp. 26-27, February 1991, Morgan Grampian House, London, England.
[20] M. R. Stan, W. P. Burleson, \Limited-weight codes for low power I/O", International Workshop on Low Power Design, April 1994,
Napa, CA.
[21] J. Tabor, Noise Reduction Using Low Weight and Constant Weight Coding Techniques, Master's Thesis, EECS Dept., MIT, May 1990.[22] W.-C. Tan, T. H.-Y. Meng, \Low-power polygon renderer for computer graphics", Int. Conf. on A.S.A.P., pp. 200-213, 1993.[23] N. Weste, K. Eshraghian, Principles of CMOS VLSI Design, A Systems Perspective, Addison-Wesley Publishing Company, 1988.[24] R. Wilson, \Low power and paradox", Electronic Engineering Times, pp. 38, November 1, 1993.[25] J. Ziv, A. Lempel, A universal Algorithm for Sequential Data Compression", IEEE Trans. on Inf. Theory, vol. IT-23, pp. 337-343, 1977.
SungKyunKwan Univ.
63VADA Lab.
DesignPower Gate Level Power Model
◈ Switching Power Power dissipated when a load capacitance(gate+wire) is charged o
r discharged at the driver’s output If the technology library contains the correct capacitance valu
e of the cell and if capacitive_load_unit attribute is specified then no additional information is needed for switching power modeling
Output pin capacitance need not be modeled if the switching power is incorporated into the internal power
][2
2
i
netsforall
isw TRCVP
SungKyunKwan Univ.
64VADA Lab.
DesignPower Gate Level Power Model
◈ Internal Power power dissipated internal to a library cell Modeled using energy lookup table indexed by input
transition time and output load Library cells may contain one or more internal
energy lookup tables
]),(intint iitioninputtransoutputload TREPCellsforall
i
SungKyunKwan Univ.
65VADA Lab.
DesignPower Gate Level Power Model
◈ Leakage Power Leakage power model supports a signal value for each library cell State dependent leakage power is not supported
Cellsforall
ileakleak PP
SungKyunKwan Univ.
66VADA Lab.
Operand Isolation
FS M
R egister
Bank
Significant Power Dissipation
EN
D Qn m
m
mData_out
FSM
R egiste r
Bank
EN
D Qnm
m
mData_out
LatchG
n
• Combinational logic dissipates significant power when output is unused
• Inputs to combination logic held stable when output is unused
SungKyunKwan Univ.
67VADA Lab.
Operation Isolation Example -Diagram
AD D
FS MLa tch
M U L
D ataReg
a
b
c
rst
c lk
DG
QLoad_En Load_En_Latched
C lk_En
Data_AddData_Mul
do
8
816
8
D
Q
AD D
FSMLatch
A DD
D ataR eg
a
b
c
rst
c lk
DG
QLoad_En Load_En_Latched
C lk_En
Data_Add Data_Mul
do
8
816
D
QLatchD QG
Iso_Data_Add
8
Before
Operand Isolation
After
Operand Isolation
SungKyunKwan Univ.
68VADA Lab.
Operand Isolation Example - Before Code
Library IEEE;Use IEEE.STD_LOGIC_1164.ALL;Use IEEE.STD_LOGIC_SIGNED.ALL;
Entity Logic isPort(
a, b, c : in std_logic_vector(7 downto 0);do : out std_logic_vector(15 downto 0);rst : in std_logic;clk : in std_logic
);End Logic;
Architecture Behave of Logic isSignal Count : integer;Signal Load_En : std_logic;
Signal Load_En_Latched : std_logic;Signal Clk_En : std_logic;
Signal Data_Add : std_logic_vector(7 downto 0);Signal Data_Mul : std_logic_vector(15 downto 0);Begin
Process(clk,rst) -- Counter Logic in FSMBegin
If(clk='1' and clk'event) thenIf(rst='0') then
Count <= 0;Elsif(Count=9) then
Count <= 0;Else Count <= Count + 1;End If;
End If;End Process;
SungKyunKwan Univ.
69VADA Lab.
Operand Isolation Example - Before Code
Process(Count) -- Enable Logic in FSMBegin
If(Count=9) thenLoad_En <= '1';
ElseLoad_EN <= '0';
End If;End Process;
Process(clk,Load_En) -- Latch(for Deglitch) Logic
BeginIf(clk='0') then
Load_En_Latched <= Load_En;End If;
End Process;
clk_En <= clk and Load_En_Latched;
Data_Add <= a + b;
Data_Mul <= Data_Add * c;
Process(Data_Mul,Clk_En) -- Data Reg LogicBegin
If(Clk_En='1' and Clk_En'event) thenDo <= Data_Mul;
End If;End Process;
End Behave;
Configuration CFG_Logic of Logic isfor BehaveEnd for;
End CFG_Logic;
SungKyunKwan Univ.
70VADA Lab.
Operand Isolation Example - After CodeLibrary IEEE;Use IEEE.STD_LOGIC_1164.ALL;Use IEEE.STD_LOGIC_SIGNED.ALL;
Entity Logic1 isPort(
a, b, c : in std_logic_vector(7 downto 0);do : out std_logic_vector(15 downto 0);rst : in std_logic;clk : in std_logic
);End Logic1;
Architecture Behave of Logic1 isSignal Count : integer;Signal Load_En : std_logic;Signal Load_En_Latched : std_logic;Signal Clk_En : std_logic;
Signal Data_Add : std_logic_vector(7 downto 0);Signal Data_Mul : std_logic_vector(15 downto 0);Signal Iso_Data_Add : std_logic_vector(7 downto 0);Begin
Process(clk,rst) -- Counter Logic in FSMBegin
If(clk='1' and clk'event) thenIf(rst='0') then
Count <= 0;Elsif(Count=9) then
Count <= 0;Else Count <= Count + 1;End If;
End If;End Process;
SungKyunKwan Univ.
71VADA Lab.
Operand Isolation Example - After Code
Process(Count) -- Enable Logic in FSMBegin
If(Count=9) thenLoad_En <= '1';ElseLoad_EN <= '0';End If;
End Process;
Process(clk,Load_En) -- Latch(for Deglitch) LogicBegin
If(clk='0') thenLoad_En_Latched <= Load_En;End If;
End Process;
clk_En <= clk and Load_En_Latched;
Data_Add <= a + b;
Process(Load_En_Latched,Data_Add) -- LatchBegin -- for Operand Isolation
If(Load_En_Latched='1' and Load_En_Latched'event) then
Iso_Data_Add <= Data_Add;End If;
End Process;
Data_Mul <= Iso_Data_Add * c;
Process(Data_Mul,Clk_En) -- Data Reg LogicBegin
If(Clk_En='1' and Clk_En'event) thenDo <= Data_Mul;
End If;End Process;
End Behave;
SungKyunKwan Univ.
72VADA Lab.
Operand Isolation Example - Report
Before Code After Code
SungKyunKwan Univ.
73VADA Lab.
Precomputation• Power saving
– Reduces power dissipation of combinational logic– Reduces internal power to precomputed registers
• Opportunity– Can be significant, dependent on;
• percentage of time latch precomputation is successful
• Cost– Increase area– Impact circuit timing– Increase design complexity
• number of bits to precompute– Testability
• may generate redundant logic
SungKyunKwan Univ.
74VADA Lab.
Precomputation
R egiste rBank/ /
Data_out
pn/R egiste rBank
n/ p
E NR egis ter
B ank
/Data_out
pn- m
/
m R egisterB ank
R egisterB ank
D Q
/
/
/
/
/
/
/
/
n - m
m
p p
1
p
Entire function is computed.
Smaller function is defined,
Enable is precomputed.
SungKyunKwan Univ.
75VADA Lab.
• Before Precomputation Diagram
Precomputation
a > b/
Data_out
C LK
a /
/8
b
8
/1 1
/
/
8
8
SungKyunKwan Univ.
76VADA Lab.
• After Precomputation Diagram
Precomputation
a(6:0)
a > b/
Data_out
Latch
C LK
/
a(6: 0)/
b(7)
a(7)
b(6:0)
/
/7
b(6: 0)
a(7)
b(7) /
1
1
/
77
7
/8
/8 /1 1
/1
/1
SungKyunKwan Univ.
77VADA Lab.
• Before Precomputation - Report
Precomputation
SungKyunKwan Univ.
78VADA Lab.
• After Precomputation - Report
Precomputation
SungKyunKwan Univ.
79VADA Lab.
Precomputation Example - Before Code
Library IEEE;Use IEEE.STD_LOGIC_1164.ALL;Entity before_precomputation isport ( a,b : in std_logic_vector(7 downto
0);CLK: in std_logic; D_out: out std_logic);
end before_precomputation;
Architecture Behav of before_precomputation is
signal a_in, b_in : std_logic_vector(7 downto 0);
signal comp : std_logic;
Beginprocess (a,b,CLK)
Beginif (CLK = '1' and CLK'even
t) then a_in <= a;
b_in<= b;end if;if (a_in > b_in) then
comp <= '1';else comp <= '0';end if;if (CLK'event and CLK='1')
then D_out <= comp;
end if;end process;end Behav;
SungKyunKwan Univ.
80VADA Lab.
Precomputation Example - After Code
Library IEEE;Use IEEE.STD_LOGIC_1164.ALL;
Entity after_precomputation isport (a, b : in std_logic_vector(7 downto 0);
CLK: in std_logic; D_out: out std_logic);end after_precomputation;
Architecture Behav of after_precomputation is
signal a_in, b_in : std_logic_vector(7 downto 0);
signal pcom, pcom_D : std_logic; signal CLK_en, comp : std_logic;
Beginprocess(a,b,CLK)Begin
if (CLK='1' and CLK'event) thena_in(7) <= a(7);b_in(7) <= b(7);end if;
pcom <= a xor b;
if (CLK='0') thenpcom_D <= pcom;end if;
CLK_en <= pcom_D and CLK;
SungKyunKwan Univ.
81VADA Lab.
Precomputation - Example After Code
if (CLK_en='1' and CLK_en'event) then
a_in(6 downto 0) <= a(6 downto 0);
b_in(6 downto 0) <= b(6 downto 0);end if;
if (a_in > b_in) thencomp <= '1';
else comp <= '0';
end if;
if (CLK='1' and CLK'event) thenD_out <= comp;
end if;end process;end Behav;
SungKyunKwan Univ.
82VADA Lab.
Peak Power Reduction• Peak Power has relation to EMI• Reducing concurrent switching makes
peak power reduction– Adjust delay within the speed of
system clock in Bus/Port driver– Consider the power consumption
of delay element– Maintaining total power
consumption, we improve EMI in peak power reduction
• Before Peak Power Reduction
• After Peak Power Reduction
n bits wide
Itotal
tE 1
n bits wide
Itotal
t
(n- 1)/
E 2t totoldd dtIVE
SungKyunKwan Univ.
83VADA Lab.
Factoring Example Function : f = ad + bc + cd The function f is not on the critical path. The signal a,b,c and d are all the same bit width. Signal b is a high activity net. The two factorings below are equivalent from both a timing and area criteria. Net Result : network toggling and power is reduced.
a
f
dc
b
c
f = b(a+ c ) + c d
b
ba
c
d
f = b(a+ c ) + c d
f
SungKyunKwan Univ.
84VADA Lab.
Block diagram of low-voltage, high-speed of LSI
• Power Management Processor controls the low-Vt circuit using the sleep signal.• Extend the sleep period as much as possible, because leakage power is reduced during this time
SungKyunKwan Univ.
85VADA Lab.
Operations of low-V t LSI
Request signal from an I/O device, output the results, waits for the next request signal. During the waitingperiod, the low-Vt circuit can sleep.
SungKyunKwan Univ.
86VADA Lab.
Waking/Sleeping operation
Waking operation Sleeping operation
SungKyunKwan Univ.
87VADA Lab.
Creating sleep period: Operation during calculation
•Heavy operations such as voice CODEC, and light operations such as datacollection can be distributed to both the low-Vt circuit and the PMP, and the low Vt circuit can sleep when the PMP is executing lightoperations.• reduce the power by 10%
SungKyunKwan Univ.
88VADA Lab.
Low Power Logic Gate Resynthesis on Mapped Circuit
김현상 조준동 전기전자컴퓨터공학부
성균관대학교
SungKyunKwan Univ.
89VADA Lab.
Transition Probability• Transition Probability: Prob. Of a transition at the output of a gate, given a cha
nge at the inputs• Use signal probabilities• Example: F = X’Y + XY’
– Signal Prob. Of F: Pf = Px(1-Py)+(1-Px)Py
– Transistion Prob. Of F = 2Pf(1-Pf)– Assumption of independence of inputs
• Use BDDs to compute these• References: Najm’91
SungKyunKwan Univ.
90VADA Lab.
Technology Mapping •Implementing a Boolean network in terms of gates from a given library•Popular technique: Tree-based mapping•Library gates and circuits decomposed into canonical patterns•Pattern matching and dynamic programming to find the best cover•NP-complete for general DAG circuits•Ref: Keutzer’87, Rudell’89•Idea: High transition probability points are hidden within gates
SungKyunKwan Univ.
91VADA Lab.
Low Power Cell Mapping• Example of High Switching
Activity Node• Internal Mapping in Complex
Gate
A
QDC
BY
A
Y
DC
B
SungKyunKwan Univ.
92VADA Lab.
Signal Probability vs. Power
0.5 1.00.0signal probability : p(x )
powe
r : P(
x) p
(x) (1
-p(x)
)
p(x) < 0.5 p(x) > 0.5
SungKyunKwan Univ.
93VADA Lab.
Spatial Correlation
P(x) = 0.25P(x) = 0.25
P(z) = 0.4375
a
b
c
P(b) = 0.5
P(c) = 0.5
P(d) = 0.5
P(x) = 0.25
P(y) = 0.25
x
y
zP(z) = 0.375
y
xz
SungKyunKwan Univ.
94VADA Lab.
Low Power Logic Synthesis
Technology IndependentOptimization
Technology Mapping
Resynthesis on MappedCircuit
Logic Equation
Connection of Gates
RTL Description
Gate Level Description
Logic Synthesis
Timing & PowerAnalysis Tools
SungKyunKwan Univ.
95VADA Lab.
Technology Mapping
(a)
l
l
(c)
h : high switching activity node
l : low switching activity node
h
h
l
l
(b)
h
h
l
l
SungKyunKwan Univ.
96VADA Lab.
Tree Decomposition
(a) (b)
Low Power
ff
gate(AND)primary input critical path
f output
SungKyunKwan Univ.
97VADA Lab.
Huffman Algorithm
x 1 x 2 x 3 x 4
y 1 y 2
x 5y 3
2 3 4 4
5 8
13 10
23
SungKyunKwan Univ.
98VADA Lab.
Depth-Constrained Decomposition• Algorithm• problem : minimize SUM from i=1 to m p_t (x_i ) • input : 입력 시그널 확률 (p1, p2,íñíñíñ, pn), 높이 (h), 말단 노드의 수 (n), 게이트당 fanin l
imit(k)• output : k-ary 트리 topology• Begin• sort (signal probability of p1, p2,íñíñíñ, pn);• while (n!=0) • if (h>logkn)• assign k nodes to level L(=h+1);• /* 레벨 L(=h+1) 에 노드 k 개만큼 할당 */ • h=h-1, n=n-(k-1); /*upward*/• else if (h<logkn)• assign k nodes to level L(=h+2); • /* 이전 레벨 L(=h+2) 에 노드 k 개만큼 할당 */• h=h, n=n-(k-1); /*downward*/• else (h=logkn)• assign the remaining nodes to level L(=h+1); • /*complete; 레벨 L(=h+1) 에 나머지 노드를 모두 할당하고 • complete k-ary 트리 구성 */
• for (bottom level L; L>1; L--) • min_edge_weight_matching (nodes in level L);• End
SungKyunKwan Univ.
99VADA Lab.
Exampleh = 1
a
x
b a
x
b c
y
d a
x
b c
y
d
e f
0.1 0.2 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4
0.5 0.6
h = 2
h = 3
level L =0
level L =1
level L =2
level L =3
a
x
b c
y
d
e f
0.1 0.2 0.3 0.4
0.5 0.6
a
x
d b
y
c
e f
0.1 0.4 0.2 0.3
0.5 0.6
before matching after matching
SungKyunKwan Univ.
100VADA Lab.
After Decomposition
0
2
4
6
8
10
12
14
16
Valu
e, R
atio
h=36
h=410
h=6 h=5 h=7 h=520
h=7 h=9
Fanin, Height
K 1=2
SIS
SIS+OURS
Improvement Ratio
SungKyunKwan Univ.
101VADA Lab.
After Tech. Mapping
0
10
20
30
40
50
60
70
80
Pow
er(m
W),
Rat
io
h=26
h=3 h=310
h=4 h=5 h=315
h=4 h=5 h=520
h=6 h=7 h=8
Fanin, Height
K 1=3, k 2=3
SIS+LEVEL MAPSIS+OURS+LEVEL MAPImprovement Ratio
SungKyunKwan Univ.
102VADA Lab.
7. Circuit Level Design
SungKyunKwan Univ.
103VADA Lab.
Buffer Chain• Delay analysis of buffer chain • Delay analysis considering parasitic c
apacitance,Cp
input
stage 1 stage 1stage (i- 1) stage i stage n
s ize 1 size s ize i- 2 size i- 1 s ize n- 1
C in C ini- 1C in iC in C in=nC in
)/ln()( , 72.2)(
0)()ln(
)/ln()ln(
)/ln(
)(
)/( )/(
0
1 100
1
inLoptimumoptimum
d
inLd
inL
inn
L
n
k
n
kdd
kk
CCne
T
CCtT
CCn
CC
tntktT
LWLW
) : (typical 10~21
11) (
) (
21
1
1
2
122
1
e
Eff
CCfVPP
CCfVfVCP
CCC
nn
n
nn
kpinddkT
pini
ddddkk
pk
ink
k
Ck,Pk: stage k buffer output 의 total capacitance, power
PT: buffer chain 의 power consumption
Pn: load capacitance CL 의 power consumption
Eff: power efficiency pn/pT
SungKyunKwan Univ.
104VADA Lab.
Slew Rate• Determining rise/fall time
P eriod Ttr tf
t1 t3t2
V in
Vdd+Vtp
Imax
Imean
V tn
Ishort
fr
tddddmeanSC
ttntpp
t
ttin
t
tshort
t
t
t
tshortshortmean
tt
fVVVIP
VVV
dtVVT
dttIT
dttIdttIT
I
where,
)2(2
, where,
)(2
4
)(4
)()(2
3
n
22
1
2
1
2
1
3
2
SungKyunKwan Univ.
105VADA Lab.
Slew Rate(Cont’d)• Power consumption of Short circuit current in Oscillation Circuit
Vdd
Vdd
Vo
V i
Vdd
Vdd
Vo
V i
VoV i
SungKyunKwan Univ.
106VADA Lab.
Pass Transistor Logic• Reducing Area/Power
– Macro cell(Large part in chip area) XOR/XNOR/MUX(Primitive) Pass Tr. Logic
– Not using charge/discharge scheme Appropriate in Low Power Logic
• Pass Tr logic Family– CPL (Complementary Pass Transisto
r Logic)– DPL (Dual Pass Transistor Logic)– SRPL (Swing Restored Pass Transist
or Logic)
• CPL– Basic Scheme
– Inverter Buffering
A
ABAB
B
ABB
B
A
ABAB
B
B
ABB
VddVdd
p- M OS Latch
SungKyunKwan Univ.
107VADA Lab.
Pass Transistor Logic(Cont’d)• DPL
– Pass Tr Network + Dual p-MOS– Enables rail-to-rail swing– Characteristics
• Increasing input capacitance(delay)
• Increasing driving ability for existing 2 ON-path
• equals CPL in input loading capacitance
• SRPL– Pass Tr network + Cross coupl
ed inverter– Restoring logic level– Inverter size must not be too bi
g
AB
B
A
B
AA B
A
B
AB
n-M O S C P Lnetw ork
SungKyunKwan Univ.
108VADA Lab.
Dynamic Logic• Using Precharge/Evaluation scheme• Family
– Domino logic– NORA(NO RAce) logic
• Characteristics– Decreasing input loading capacitanc
e– Power consumption in precharge clo
ck– Increasing useless switching in prech
arging period
• Basic architecture of Domino logic
A
B
clk
C in C L
A
P1
N1
NLogic Blockc lk
B
A
precharge evaluation
SungKyunKwan Univ.
109VADA Lab.
Input Pin Ordering• Reorder the equivalent inputs to a transi
stor based on critical path delays and power consumption
• N- input Primitive CMOS logic– symmetrical in function level– antisymmetrical in Tr level
• capacitance of output stage• body effect
• Scheme– The signal that has many transition
must be far from output– If it is hard to estimate switching fr
equency, we must determine pin ordering considering path and path delay balance from primary input to input of Tr.
• Example of N-input CMOS logic
A
D
C L
C
B
C 3
C 1
C 2
Experimentd with gate array of TIFor a 4-input NAND gate in TI’s BiCMOS gate array library (with a load of 13 inverters), the delay varies by 20% while power dissipation by 10% between a good and bad ordering
SungKyunKwan Univ.
110VADA Lab.
INPUT PIN Reordering
CL
A B C D
C
A
B
D
CB
CC
CD
VDD
MPA MPB MPC
MPD
MNA
MNB
MNC
MND
1 1
1 1
1 1
1 1
1
1
1
1
(a) (b) (c) (d)
Simulation result ( tcycle=50ns, tf/tr=1ns)
: A 가 critical input 인 경우 =38.4uW,
D 가 critical input 인 경우 =47.2uW
SungKyunKwan Univ.
111VADA Lab.
Sensitization• Example• Definition
– sensitization : input signal that forces output transition event
– sensitization vector : the other inputs if one signal is sensitized
X1
X3
X2
),,,1,,,( ),,,0,,,(
][ ][
11
11
10
nili
nili
XXi
XXXXfXXXXf
ffXY
ii
32332
101
][ ][ 11
XXXXX
ffXY
XX
321 )( XXXY
SungKyunKwan Univ.
112VADA Lab.
Sensitization(Cont’d)• Considering Sensitization in Combi
national logic:Remove unnecessary transitions in the C.L
• Considering Sensitization in Sequential logic: Also reduces the power consumption in the flip-flops.
C om binationa lLogicXn
E
QY
C om binationa lLog ic
X1
Xn
E
Q Y
C om binationa lLogic
X1
Xn
E
Q Y
Com binationalLogic
Q YD Q
D Q
c lk
X1
Xn
D Q
D Q
E
SungKyunKwan Univ.
113VADA Lab.
TTL-Compatible• TTL level signal CMOS
input• Characteristic Curve of CMOS
Inverter
Vdd= 3.3V
Vdd= 3.3V
Vo
V i
1.4V
V IL= 0.8V V IH= 2.0V Vdd= 3.3VV i
Vo Ileak= avg(Id1, Id2)
IDTTL1 IDTTL2
Vdd
V in
TTL INP U T
padinput compatible TTL ofnumber : e wher)( 21
TTL
DTTLDTTLddTTLTTL
NIIVNP
SungKyunKwan Univ.
114VADA Lab.
TTL Compatible(Cont’d)• CMOS output signal TTL input
– Because of sink current IOL,
CMOS gets a large amount of heat
– Increased chip operating temperature
– Power consumption of whole system
C hip B oundary C hip B oundary
Input Pad
O utput Pad
VOL
IO L
SungKyunKwan Univ.
115VADA Lab.
INPUT PIN Reordering◈ To reduce the power dissipation one should place the
input with low transition density near the ground end.
(a) If MNA turns off , only CL needs to be charged (b) If MND turns off , all CL, CB, CC and CD needs to be charged (c) If the critical input is rising and placed near output node, the initial charge of CB, CC and CD are zero and the delay time of CL
discharging is less than (d) (d) If the critical input is rising and placed near ground end, the charge of CB, CC and CD must dischagge before the charge of CL discharge to zero
SungKyunKwan Univ.
VADA Lab.
저전력 Booth Multiplier 설계성균관대학교
전기전자컴퓨터공학부김 진 혁 , 이 준 성 , 조 준 동
SungKyunKwan Univ.
VADA Lab.
Modified Booth 곱셈기
R ecoded D ig it O peration on X
0 : Add 0 to the partia l p roduct
+1 : Add X to the partia l p roduct
+2 : Sh ift le ft X one position and add it
to the partia l p roduct
-1 : Add two’s com plem ent o f X to the
partia l product
-2 : Take two’s com plem ent o f X and
sh ift left one position
Y 2i+1 Y 2i Y 2I-1 R ecoded O peration D igit on X
0 0 0 0 0X 0 0 1 +1 +1X 0 1 0 +1 +1X 0 1 1 +2 +2X 1 0 0 -2 -2X 1 0 1 -1 -1X 1 1 0 -1 -1X 1 1 1 0 0X
• Multibit Recoding 을 사용하여 부분합의 갯수를 1/2 로 줄여 고속의 곱셈을 가능하게 한다 . • 피승수 (multiplicand) : X , 승수 (multiplier) : Y
Recoded digit = Y2i-1 + Y2i -2Y2i+1 ( Y-1=0 )
< Generation and operation of recoded digit >
SungKyunKwan Univ.
118VADA Lab.
Modified Booth 곱셈기 - 예• Example
10010101 = X01101001 = Y
1111111110010101000000110101100000011010111100101010
1101010000011101 = P ( - 11235)
(- 107)(+105) Operation B its recoded
+ 1- 2- 1+ 2
010100101011sign
extension
SungKyunKwan Univ.
119VADA Lab.
Wallace Tree - 4:2 CompressorX 7Y 7
X 0Y 0
..............
.............. : Zero: B it jum ping leve l: partia l p roduct: b it generated by
com pressor
1st s tage
2nd stage
Tw o sum m ands tobe added
(a)
4*8 P artia l P roduct genera to rs
8 4-2 com pressors
4*8 P artia l P roduct genera to rs
8 4-2 com pressors
16-b it adder
11 4 -2 com pressors
1st s tage(b lock A )
1st s tage(b lock B )
2nd stage(b lock C )
X3 , X2 , X1 , X0
X7 , X6 , X5 , X4
8
4
4Y
P0P15
(b)
SungKyunKwan Univ.
120VADA Lab.
Multipliers - Area
• 16-bit Multiplier Area
Multipliertype
Area(mm2) Gate count
Array 4.2 2,378
Wallace 8.1 2,544
Modified booth 8.5 3,375
SungKyunKwan Univ.
121VADA Lab.
Multiplier - Delay
• Average Power Dissipation (16-bit)
Multipliertype
Power(mW) Logictransitions
Array 43.5 7,224
Wallace 32.0 3,793
Modified booth 41.3 3,993
SungKyunKwan Univ.
122VADA Lab.
Multiplier - Power
• Worst-Case Delay (16-bit)
Multipliertype
Delay(ns) Gatedelays
Array 92.6 50
Wallace 54.1 35
Modified booth 45.4 32
SungKyunKwan Univ.
VADA Lab.
Instruction Level Power Analysis
• Estimate power dissipation of instruction sequences and power dissipation of a program
• Eb : base cost of individual instructions
Es : circuit state change effects
• EM : the overall energy cost of a program
Bi : the base cost of type i instruction
Ni : the number of type i instruction
Oi,j : the cost occurred when a type i instruction is followed by
a type j instruction Ni,j : the number of occurrences when a type i instruction is
immediately followed by a type j instruction
E B Nb i i E O Ns i j i j , ,
E E EM b s
SungKyunKwan Univ.
VADA Lab.
Instruction ordering
• Develop a technique of operand swapping• Recoding weight : necessary operation cost of operands
• Wtotal : total recoding weight of input operand
Wi : weight of individual recoded digit i in Booth Multiplier
Wb : base weight of an instruction
Winter : inter-operation weight of instructions
• Therefore, if an operand has lower Wtotal , put it in the second
input(multiplier).
W Wtotal ii
W W Wi b er int
SungKyunKwan Univ.
VADA Lab.
RESULT
Circuit State Effects[pJ]when switchedInstruction
NameBasecost[pJ]
LOAD ADD 2’scomplement
SHIFT
LOAD 1.46 0.18 1.20 1.08 0.73
ADD 0.86 0.31 0.49 0.61
2’scomplement
0.77 0.27 0.34
SHIFT 0.29 0.15
< 4 by 4 multiplier >
Circuit State Effects[pJ]when switchedInstruction
NameBasecost[pJ]
LOAD ADD 2’scomplement
SHIFT
LOAD 3.25 0.40 2.67 2.38 1.63
ADD 1.91 0.58 1.11 1.44
2’scomplement
1.72 0.55 0.78
SHIFT 0.65 0.38
< 8 by 8 multiplier >Circuit State Effects[pJ]
when switchedInstructionName
Basecost[pJ]
LOAD ADD 2’scomplement
SHIFT
LOAD 4.81 0.59 3.96 3.57 2.40
ADD 2.83 1.02 1.63 2.12
2’scomplement
2.55 1.00 1.14
SHIFT 0.96 0.78
< 12 by 12 multiplier >
SungKyunKwan Univ.
VADA Lab.
Conclusion
02468
1012
4bit
8bit
12bi
tav
erag
e
0
5
10
15
20
25
30
35
4bit
8bit
12bi
t
circuitstateseffects notconsideredcircuitstateseffe c t sconsidered
Power[pJ]
bits bits
% of instances with circuit states effects
4.0% reduction
12.0% reduction
9.0% reduction
SungKyunKwan Univ.
127VADA Lab.
8. Layout Level Design
SungKyunKwan Univ.
128VADA Lab.
• Constant scaled wire increases coupling capacitance by S and wire resistance
by S• Supply Voltage by 1/S, Theshold Voltage by 1/S, Current Drive by 1/S• Gate Capaitance by 1/S, Gate Delay by 1/S• Global Interconnection Delay, RC load+para by S• Interconnect Delay: 50-70% of Clock Cycle• Area: 1/S2
• Power dissipation by 1/S - 1/S2
• ( P = nCVdd2f, where nC is the sum of capacitance times #transitions)
• SIA (Semiconductor Industry Association): On 2007, physical limitation: 0.1 m
20 billion transistors, 10 sqare centimeters, 12 or 16 inch wafer
Device Scaling of Factor of S
SungKyunKwan Univ.
129VADA Lab.
Delay Variations at Low-Voltage
• At high supply voltage, the delay increases with temperature (mobility is decreasing with temperature) while at very low supply voltages the delay decreases with temperature (VT is decreasing with temperature).
• At low supply voltages, the delay ratio between large and minimum transistor widths W increases in several factors.
• Delay balancing of clock trees based on wire snaking in order to avoid clock-skew. In this case, at low supply voltages, slightly VT variations can significantly modify the delay balancing.
SungKyunKwan Univ.
130VADA Lab.
Quarter Micron Challenge• Computers/peripherals (SOC): 1996 ($50 Billion) 1999 ($70 Billion)• Wiring dominates delay: wire R comparable to gate driver R; wire/wire coupling
C > C to ground• Push beyond 0.07 micron• Quest for area(past), speed-speed (now), power-power-power(future)• Accelerated increases of clock frequencies• Signal integrity-based tools• Design styles (chip + packages)• System-level design(system partitioning)• Synthesis with multiple constraints (power,area,timing)• Partitioning/MCM• Increasing speed limits complicate clock and power distribution• Design bounded by wires, vias, via resistance, coupling• Reverse scaling: adding area/spacing as needed: widening, thickening of wires,
metal shielding & noise avoidance - adding metal
SungKyunKwan Univ.
131VADA Lab.
CLOCK POWER CONSUMPTION
•Clock power consumption is as large as the logic power; Clock Signal carrying the heaviest load and switching at high frequency, clock distribution is a major source of power dissipation.• In a microprocessor, 18% of the total power is consumed by clocking• Clock distribution is designed as a hierarchical clock tree, according to the decomposition principle.
SungKyunKwan Univ.
132VADA Lab.
Power Consumption per block in typical microprocessor
SungKyunKwan Univ.
133VADA Lab.
Crosstalk
SungKyunKwan Univ.
134VADA Lab.
Solution for Clock Skew• Dynamic Effects on Skew
Capacitance Coupling• Supply Voltage Deviation (Clock
driver and receiver voltage difference)
• Capacitance deviation by circuit operation
• Global and local temperature• Layout Issues: clocks routed first• Must aware of all sources of delay• Increased spacing• Wider wires• Insert buffers• Specialized clock need net
matching• Two approaches: Single Driver, H-
tree driver
• Gated Clocks: The local clocks that are conditionally enabled so that the registers are only clocked during the write cycles. The clock is partitioned in different blocks and each block is clocked with its own clock.
• Gating the clocks to infrequently used blocks does not provide and acceptable level of power savings
• Divide the basic clock frequency to provide the lowest clock frequency needed to different parts of the circuit
• Clock Distribution: large clock buffer waste power. Use smaller clock buffers with a well-balanced clock tree.
SungKyunKwan Univ.
135VADA Lab.
PowerPC Clocking Scheme
SungKyunKwan Univ.
136VADA Lab.
CLOCK DRIVERS IN THE DEC ALPHA 21164
SungKyunKwan Univ.
137VADA Lab.
DRIVER for PADS or LARGE CAPACITANCES
Off-chip power (drivers and pads) are increasing and is very difficult to reduce such a power, as the pads or drivers sizes cannot be decreased with the new technologies.
SungKyunKwan Univ.
138VADA Lab.
Layout-Driven Resynthesis for Lower Power
SungKyunKwan Univ.
139VADA Lab.
Low Power Process• Dynamic Power Dissipation Vdd
V in Vo
C ovpC ovn
C djp
C djn
DrainW
D
C jbC jsw
)(2 ,
)(
)(
)(2
0
1
1
2
2
DWPDWA
PCACCWCC
CC
LWCC
VVI
fVCP
DD
DjswDjdj
GDov
m
jjgatein
n
ioxgate
tgsds
ddLd
SungKyunKwan Univ.
140VADA Lab.
Crosstalk• In deep-submicron layouts, some of the netlengths for connection between modules ca
n be so long that they have a resistance which is comparable to the resistance of the driver.
• Each net in the mixed analog/digital circuits is identified depending upon its crosstalk sensitivity– 1. Noisy = high impedance signal that can disturb other signals, e.g., clock signals.– 2. High-Sensitivity = high impedance analog nets; the most noise sensitive nets s
uch as the input nets to operational amplifiers.– 3. Mid-Sensitivity = low/medium impedance analog nets.– 4. Low-Sensitivity = digital nets that directly affect the analog part in some cells su
ch as control signals.– 5. Non-Sensitivity = The most noise insensitive nets such as pure digital nets,
• The crosstalk between two interconnection wires also depends on the frequencies (i.e., signal activities) of the signals traveling on the wires. Recently, deep-submicron designs require crosstalk-free channel routing.
SungKyunKwan Univ.
141VADA Lab.
Power Measure in Layout• The average dynamic power consumed by a CMOS gate is given below, where C_l
is the load capacity at the output of the node, V_dd is the supply voltage, T_cycle is the global clock period, N is the number of transitions of the gate output per clock cycle, C_g is the load capacity due to input capacitance of fanout gates, and C_w is the load capacity due to the interconnection tree formed between the driver and its fanout gates.
• Pav = (0.5 Vdd2) / (Tcycle Cl N) = (0.5 Vdd
2) / (Tcycle (Cg + Cw )N)
• Logic synthesis for low power attempts to minimize SUM i Cgi Ni
• Physical design for low power tries to minimize SUMi Cwi Ni
• . Here Cwi consists of Cxi + CsI, where Cxi is the capacitance of net i due to its crosstalk, and CsI is the substrate capacitance of net i. For low power layout applications, power dissipation due to crosstalk is minimized by ensuring that wires carrying high activity signals are placed sufficiently far from the other wires. Similarly, power dissipation due to substrate capacitance is proportional to the wirelength and its signal activity.
SungKyunKwan Univ.
VADA Lab.
이중 전압을 이용한 저전력 레이아웃 설계성균관대학교
전기전자컴퓨터공학부김 진 혁 , 이 준 성 , 조 준 동
SungKyunKwan Univ.
VADA Lab.
목 차• 연구목적• 연구배경• Clustered Voltage Scaling 구조• Row by Row Power Supply 구조• Mix-And-Match Power Supply 구조• Level Converter 구조• Mix-And-Match Power Supply 설계흐름• 실험결과• 결론
SungKyunKwan Univ.
144VADA Lab.
연 구 목 적 및 배경• 조합회로의 전력 소모량을 줄이는 이중 전압 레이아웃 기법 제안• 이중 전압 셀을 사용할 때 , 한 cell
row 에 같은 전압의 cell 이 배치되면서 증가하는 wiring 과 track 의 수를 줄임
• 최소 트랜지스터 개수를 사용하는 Level Converter 회로의 구현
• 디바이스의 성능을 유지하면서 이중 전압을 사용하는 Clustered Voltage Scaling [Usami, ’95] 을 적용
• 제안된 Mix-And-Match Power Supply 레이 아웃 구조는 기존의 Row by Row Power Supply [Usami, ’97] 레이 아웃 구조를 개선하여 전력과 면적을 줄임
SungKyunKwan Univ.
VADA Lab.
Clustered Voltage Scaling• 저전력 netlist 를 생성
F/F
F/F
F/F
LC 2
G 1
G 2G 3G 4
G 5
G 6
G 7G 8
LC 1
G 11 G 9G 10
S lack(S i) = R i - A i
S 1> 0S 3> 0S 4> 0
S 5>0
S 6>0
S 9>0S 7< 0
S 10< 0
S 11< 0
S 8< 0
: VDDL
: VDDH
: Level C onverter
S 2<0
SungKyunKwan Univ.
VADA Lab.
VDDHVDDL
VDDH
VDDHVDDL
standard cell
s tandard cell
module
VS SVDDH cell
VS S
VDDL
VDDL cell
standard cell
VDDH cell
VDDL cell
Row by Row Power Supply 구조
SungKyunKwan Univ.
VADA Lab.
Mix-And-Match Power Supply 구조VDDH
VDDLVDDHVDDL
standard cell
s tandard cell
module
standard cellVDDH
cellVDDL
cell
VDDH cell
VDDL cell
VDDL cellVDDH cellVS S
VDDLVDDH
VSS
VDDLVDDH
SungKyunKwan Univ.
148VADA Lab.
VDDH
module
VDDHVDDL
module
VDDH
VDDL
module
구 조 비 교Conventional RRPS MAMPS Circuit
SungKyunKwan Univ.
VADA Lab.
Level Converter 구조• Transistor 의 갯수 : 6 개 4 개 • 전력과 면적면에서 효과적
기 존 제 안
VS S / VDDL
Vth= 1.5V
VS S / VDDH
Vth= 2.0V
VDDHVDDH
INVDDL
VDDH
O U T
SungKyunKwan Univ.
VADA Lab.
Mix-And-Match Power Supply Design Flow
Physical placem ent
Assign supply voltage to each cell
Routing
Synthesis tim ing, power and area
Single voltage netlist
Netlist with m ultiple supply voltage
Multiple voltage scaling
(O P U S )
(Aquarius XO )
(P owerM ill)
SungKyunKwan Univ.
VADA Lab.
Area (% )
C onventionalc ircuit RR P S M AM P S
15% 10%
100
power (% )
C onventionalc ircuit RR P S M AM P S
47%
2%
100
실 험 결 과전체 Power
전체 Area
SungKyunKwan Univ.
VADA Lab.
결 론• 단일 전압 회로와 비교하여 49.4% 의 Power 감소를 얻은 반면 5.6% 의
Area overhead 가 발생• 기존의 RRPS 구조보다 10% 의 Area 감소와 2% 의 Power 감소• 제안된 Level Converter 는 기존의 Level Converter 보다 30% 의 Area 감소와 35% 의 Power 감소
SungKyunKwan Univ.
153VADA Lab.
9. CAD tools
SungKyunKwan Univ.
154VADA Lab.
Low Power Design Tools• Transistor Level Tools (5-10% of silicon)
– SPICE, PowerMill(Epic), ADM(Avanti/Anagram), Lsim Power Analyst(mentor)• Logic Level Tools (10-15%)
– Design Power and PowerGate (Synopsys), WattWatcher/Gate (Sente), PowerSim (System Sciences), POET (Viewlogic), and QuickPower (Mentor)
• Architectural (RTL) Level Tools (20-25%)– WattWatcher/Architect (Sente): 20-25% accuracy
• Behavioral (spreadsheet) Level Tools (50-100%)– Active area of academic research
SungKyunKwan Univ.
155VADA Lab.
Commercial synthesis systems
SungKyunKwan Univ.
156VADA Lab.
Research synthesis systems A - Architectural synthesis.
L - Logic synthesis.
SungKyunKwan Univ.
157VADA Lab.
Low-Power CAD sites
• Alternative System Concepts, Inc, : 7X power reduction throigh optimization, contact http://www.ee.princeton.edu and Jake Karrfalt at [email protected] or (603) 437-2234. Reduction of glitch and clock power; modeling and optimization of interconnect power; power optimization for data-dominated designs with limited control flow.
• Mentor Graphics QuickPower: Hierarchical of determining overall benet of exchanging the blocks for lower power. powering down or disabling blocks when not in use by gated-clock
• choose candidates for power-down Calculate the effect of the power-down logic http://www.mentorg.com
• Synopsys's Power Compiler http://www.synopsys.com/products/power/power_ds
• Sente's WattWatcher/Architect (first commerical tool operating at the architecture level(20-25 %accuracy). http://www.powereda.com
• Behavioral Tool: Hyper-LP (Optimization), Explore (Estimation) by J. Rabaey
SungKyunKwan Univ.
158VADA Lab.
Design Power(Synopsys)• DesignPower(TM) provides a single, integrated environment for power analysi
s in multiple phases of the design process: – Early, quick feedback at the HDL or gate level through probabilistic an
alysis. – Improved accuracy through simulation-based analysis for gate level an
d library exploration. • DesignPower estimates switching, internal cell and leakage power. It accepts u
ser-defined probabilities, simulation toggle data or a combination of both as input. DesignPower propagates switching information through sequential devices, including flip-flops and latches.
• It supports sequential, hierarchical, gated-clock, and multiple-clock designs. For simulation toggle data, it links directly to Verilog and VHDL simulators, including Synopsys' VSS.
SungKyunKwan Univ.
159VADA Lab.
10. References
SungKyunKwan Univ.
160VADA Lab.
References[1] Gary K. Yeap, "Practical Low Power Digital VLSI Design",
Kluwer Academic Publishers.[2] Jan M. Rabaey, Massoud Pedram, "Low Power Design Methodologies",
Kluwer Academic Publishers.[3] Abdellatif Bellaouar, Mohamed I. Elmasry, "Low-Power Digital VLSI Design
Circuits And Systems", Kluwer Academic Publishers.[4] Anantha P. Chandrakasan, Robert W. Brodersen, "Low Power Digital CMOS Design", Kluwer Academic Publishers.[5] Dr. Ralph Cavin, Dr. Wentai Liu, "1996 Emerging Technologies : Designing Low Power Digital Systems"[6] Muhammad S. Elrabaa, Issam S. Abu-Khater, Mohamed I. Elmasry, "Advanced Low-Power Digital Circuit Techniques", Kluwer Academic Publishers.
SungKyunKwan Univ.
161VADA Lab.
References• [BFKea94] R. Bechade, R. Flaker, B. Kaumann, and et. al. A 32b 66 mhz 1.8W
Microprocessor". In IEEE Int. Solid-State Circuit Conference, pages 208-209, 1994.
• [BM95] Bohr and T. Mark. Interconnect Scaling - The real limiter to high performance ULSI". In proceedings of 1995 IEEE international electron devices meeting, pages 241-242, 1995.
• [BSM94] L. Benini, P. Siegel, and G. De Micheli. Saving Power by Synthesizing Gated Clocks for Sequential Circuits". IEEE Design and Test of Computers, 11(4):32-41, 1994.
• [GH95] S. Ganguly and S. Hojat. Clock Distribution Design and Verification for PowerPC Microprocessor". In International Conference on Computer-Aided Design, page Issues in Clock Designs, 1995.
• [MGR96] R. Mehra, L. M. Guerra, and J. Rabaey. Low Power Architecture Synthesis and the Impact of Exploiting Locality". In Journal of VLSI Signal Processing,, 1996.