Partition-Driven Placement with Simultaneous Level Processing and Global Net Views

Partition-Driven Placement with Partition-Driven Placement with Simultaneous Level Processing and Simultaneous Level Processing and

Global Net ViewsGlobal Net Views

K. Zhong and S. DuttDepartment of Electrical Engineering and

Computer Science, University of Illinois at Chicago

Zhong & Dutt, UIC, Nov. 2000

Overview

• Problem

• Previous Work

• New Partition-Driven Placement Algorithm (SPADE)

• Experimental Evaluation

• Conclusions and Future Work


Problem

• Placement for Deep Sub-Micron (DSM)– Very large input size (up to tens of millions)

– More optimization objectives (area, delay, power)

– Various heterogeneous constraints (congestion, crosstalk, heat distribution, etc.)


• Three mainstream placement approaches

• Partition-Driven Placement (PDP) (e.g. [Breuer, DAC ‘77], [Huang et al, ISPD ‘97])

•Simulated Annealing (SA) (e.g. [Sun et al, TCAD ‘95])

• Mathematical programming (e.g. [Eisenmann et al, DAC ‘98])

• Global and detailed placement

• NRG [Wang et al, ICCAD ‘97], Snap-On [Yang et al, ISPD ‘00], etc.

Major Approaches to Placement


Advantages of PDP• Time-efficient

• divide-and-conquer approach

• Balanced decision with a global view• top-down placement flow

• Can tackle almost any objective function accurately (up to interconnect length model)

• delay, WL, power (in iterative improvement, update cost per move)

• Flexibility in tackling multiple constraints• iterative improvement---check per move


Previous PDP Work• Sequential level partitioning [Breuer, DAC ‘77]

– regions at the same level are cut sequentially

– may result in sub-optimal wire-length or cutsize

• Terminal propagation [Dunlop et al, TCAD ‘85]

– addresses external connections during partitioning

• Quadrisection [Suaris et al, TCAS ‘88; Huang et al, ISPD ‘97]

– 4-way partitioning better controls wire length in both directions, but run time goes up


New PDP Techniques--- Rectify Drawbacks of Prior PDP

• Placer SPADE (Simultaneous level PArtitioning with Distributed nEt views)

• Simultaneous Level Partitioning (SLP)---rectifies prior drawback of sequentially-ordered optimization

• Global net views---rectifies prior drawback of localized subcircuit views and cost + inaccuracy of Term. Prop.

• Wire-length based gain computation---rectifies prior drawback of mincut-based gain (not strictly WL)

• Modified CLIP-FM partitioner [Dutt et al, ICCAD ‘96]

• Maximum row length control

• Post-processing (cell swaps)Zhong & Dutt, UIC, Nov. 2000

Simultaneous Level Partitioning

• Simultaneous partitioning of all regions within the same level

• Cell moves are naturally interleaved across all regions based on gains (as shown in the figure)

• Achieves simultaneous optimization across multiple regions

1

2

1

2

3

4


SLP vs. Sequential Level Partitioning• Sequential level partitioning may not be able

to escape local optima

3

4

11

SLP: only the cell in lower region moved

(1)u

v

u

New Cost = 1

Sequential: sub-optimal move sequence, if upper

region processed first

11

3

4

u

v

(1)

(2)u

v

3

4

New Cost = 3

Initial partitioning: nets labeled with weights

11

cells

pads

3

Orig Cost=8

u

v

4

3

4


Global Net View vs. Terminal Propagation

• Terminal propagation may be inaccurate for wire length reduction

• With a global net view we can do better (e.g., moving left is better in the figure shown as it can shrink the BB, while the right move expands BB)

Dummy

Possible moves: dummy position does not help


De-coupled Regions: a Caveat• Suitable for row-based designs• Property: For a hor. cut, WL

change due to cell moves in regions in one side of the previous-level cutline does not affect WL of the subcircuits in regions on the other side

• Sequential partitioning of regions separated by previous-level horizontal cutlines justified

• Reduced run time at NO cost of wire length

Two segments can be shrunk separately; Regions spanning

cutline c is de-coupled from those spanning c’ by previous cutline d

c

c’

d


Wire-length Based Gain

• Pin coordinates (x or y) of each net along the direction orthogonal to current cutline are stored in a binary search tree

• SPADE-FM: A cell move can have non-zero gain only when it changes global bounding-boxes of connected nets


Illustration of Gain Computation

SPADE-FM: gain(u) = gain(w) = 0; since neither move can change bounding box by itself; only gain(v)=5L is positive and

all others have gain zero as “internal” nodes.

u

SPADE-PROP: gain(u) = (d'-d)•p(u)•p(w)/p(u) + (d'' - d')•p(x), where p(y) is the probability of y. The gain is of two parts: single-step

PROP gain of moving u and w, and multi-step gain for moving cells not on the boundary of BB (e.g., x) from same side as u.

v

w

3L

8L

g(v)=5L

xd

d

d'

d''

u


Global Gain Update

• Every move may entail out-of-region update of cell gains

• Total time taken for such update per pass is bounded by O(p*log(p)), where p is the pin number

cell move

Gain update needed

1 0 10


Maximum Row Length Control• A decisive factor in die-area utilization• Gradually increase row-balance deviations w/ partitioning

tree levels to max allowable– cannot use the prescribed max. row-length devn, as it can

freeze moves for future cuts (see figure below)

• Row devn assigned inversely proportional to logarithm of # of rows of target regions

Initial devn set as max allowed value

Max devn reached, further partitioning badly hampered

Devn

avail.


Local Region Balance Control• Relaxed local balance but strict row-balance control

• Local Deviation (from closest possible balance to 50-50) = Row Deviation overconstrains the problem

• Allow Local Deviation = (Row Deviation), > 1, but maintain overall row deviation


Circuit Partitioning Engine• CLIP-FM variation (SHRINK-FM) or SHRINK-PROP

algorithm at the core– shrinking initial gain helps cluster removal– iterative mode: shrink factor gradually enlarged to get

independent gains after most clusters are removed through earlier passes

• Two-level gain tree structure– local binary search tree for each region– top-gain cells of local trees sorted into global tree

• Efficient global cell selection strategy– row-balance violation: search opposite global tree– local violation: switch to opposite local tree– tie-breaking: following latest move


Post-processing• Intra-row horizontal neighbor swap• Intra-row clustering based on int/ext nets ratio• Inter-row vertical swap

– some cells have to be shifted due to cell overlap• Results in about 1-2% improvement

Horizontal neighbor swap Vertical cell swap


Experimental Evaluation

• MCNC standard cell benchmarks: up to 100k cells• Compared with prior methods

– TimberWolf 7.0 [Sun et al, TCAD ‘95]– FD-98 [Eisenmann et al, DAC ‘98]– QUAD [Huang et al, ISPD ‘97]– Snap-On [Yang et al, ISPD ‘00]

• Same number of rows as TimberWolf 7.0• Part of IBM-PLACE circuits also tested (ibm11 -

ibm15) and compared to iTools [internetCAD]• Experiments conducted on 550 MHz Pentium-III

Linux workstations


Comparison with Previous Methods

Circuit SPADE-FM TW 7.0 FD-98 QUAD Snap-On SPADE-PROPprimary1 0.74 0.83 0.87 0.9 0.95 0.74struct 0.291 0.338 0.378 0.285primary2 3.13 3.53 3.72 3.68 3.66 3.07biomed 1.43 1.61 1.78 1.84 1.38industry2 11.9 13.3 14.6 14.48 12.07industry3 35.37 41.53 45.1 44.7 35.09avqsmall 5.59 5.08 4.91 6.29 5.15 5.31avqlarge 6.16 5.65 5.38 6.59 5.21 5.61golem3 19.84 22.6 19.64Total (8/8 ckts) 84.16 / 64.61 94.13 / / 76.70 82.91/63.56Total (5/7 ckts) 15.94 / 64.32 17.84 / / 75.99 15.02/63.27SPADE-FM imprv. 10.60% 15.80% 10.70% 15.30%SPADE-PROP imprv. 11.92% 17.13% 15.81% 16.74%run time (8 ckts) 15001 7173 57920 18108run time (6 ckts) 14710 19034 18071scaled time ratio 1 0.69 0.26 1.16 1.21

SLP vs Seq. SPADE-FM Sequentail WL imprv.Total WL (6 ckts) 52.86 65.57 19.38%Total time (6 ckts) 7052 1719


• Results for IBM-PLACE Benchmarks

Circuit SPADE-FM SPADE-PROP iToolsibm11 37.27 36.28 39.76ibm12 66.52 64.92 69.56ibm13 42.94 42.4 49.11ibm14 121.38 121.17 118.8ibm15 134.68 130.45 130.6Total WL 402.79 395.22 407.83imprv. vs. itools 1.24% 3.10%

Other Experimental Results

Trade-off SPADE-FM/8 SPADE-FM/16 Best WL 16 vs 8Total WL 89.65 84.45 82.87 5.81%Total time 29117 37738 1.3 x

• Trade-off between run time and solution quality of SPADE-FM with 8 and 16 runs for the MCNC suite


Conclusions and Future Work• Introduced novel concepts of:

– SLP– global net view– bounding-box based gain computation

• PDP alone can be competitive (in fact better)– up to 15.8% better in aggregate result than s-of-art– among large circuits:

• best-known result for largest MCNC ckt - golem3• best-known results for ibm11-ibm13

• Run time reasonable, but can be reduced– early-stop per pass– multilevel clustering

• On-going work– timing-driven PDP– multi-constraint PDP (congestion, thermal distr, mult obj)


Documents

Partition-Driven Placement with Simultaneous Level Processing and Global Net Views