Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Parallelization by SimPLification:A Case Study in VLSI Placement

Myung-Chul Kim, Dong-Jin Leeand Igor L. MarkovDept. of EECS, University of Michigan

1


Complexities of Parallel Algorithms & SW

1.Objectives of parallelizationA. Improve completion time by using multiple cores in ||B. Improve throughput by using stream processing

(latency may increase and become less predictable)C. Improve power consumption (by decreasing clk rate)2.Not an objective (a pitfall)

− Come up with a slow algorithm that is easy to parallelize

■In this talk: how to accomplish 1.A without 2− Take a leading algorithm and speed up its bottlenecks− Design a new algorithm that is

(a) better, (b) easy to parallelize

2


CAD Algorithms

■Sequence of optimizations− Subject to Amdahl’s law− The more the stages, the harder to parallelize effectively■Additional complications

− Elaborate data structures may entail overheadfor parallel access

− When processing is light, memory bandwidthmay become a bottleneck (with 4+ threads)

■Recommendations− A simpler algorithm is often either to parallelize

(fewer stages, simpler data structures)− Using standard solvers, e.g., linear algebra

helps reuse previous work on parallelization

3


Global Placement: Motivation

■Interconnect lagging in performance while transistors continue scaling

− Circuit delay, power dissipation and areadominated by interconnect

− Routing quality highly controlled by placement

■Circuit size and complexity rapidly increasing− Scalable placement algorithm is critical− Simplicity, integration with other optimizations

4

Unloaded

Coupling

IR drop

RC delay


Goals in Placement

■Find good relative ordering of cells− Minimize wire length and congestion− Maximize timing slack■Find good spacing of cells

− Eliminate wiring congestion problems− Provide space for post placement stages

–clock trees–buffer insertion–timing correction

■Find good global position

5


A B C

Optimize Relative Order

6


A B C

To spread ...

7


A B C

.. or not to spread

8


A B C

Place to the left

9


A B C

… or to the right

10


A B C

Optimize Relative Order

Without whitespace,placement is dominated by ordering

11

Example of Global Placement (APlace 2.04 from UCSD)

Example of Global Placement (mFar from UCSB)


Placement Formulation

■Objective: Minimize estimated wirelength− Half-perimeter wirelength (HPWL)

− (max X – min X) + (max Y – min Y)

■Subject to constraints:− Legality: Row-based

placement with no overlaps− Routability: Limiting local

interconnect congestion forsuccessful routing

− Timing: Meeting performancetarget of a design

14

xy


Quadratic Placement

■Consider a graph first, not a hypergraph

■Minimize Σ(xi-xj)2+(yi-yj)2 (the sum is over eij)

− Seems unrelated to Σ |xi-xj|+|yi-yj| but can still be separated into x- and y-components

■Physical analogy: Hooke’s law− Consider an elastic spring, spread by x− Force F=-kx (k is the spring constant)− Energy E=kx2

− Our goal: minimize the energy of the system

A system of springs will only settle in a minimum

15


Iterative Optimization

16


Prior Work

■ Ideal Placer

− Low runtime without sacrificing solution quality

− Simplicity, integration with other optimizations

17

Sp

eed

Solution Quality

Non-convex optimization

mFAR, Kraftwerk2, FastPlace3

Ideal placer

mPL6, APlace2, NTUPlace3

Quadratic and force-directed


Key features of SimPL

■Flat quadratic placement■Primal dual optimization

− Closing the gap between upper and lower bounds

18

Final Solution

Lower-Bound Solutionby Linear System Solver

Wir

elen

gth

Iteration

Final Legal Solution

Upper-Bound Solution by Look-ahead Legalization

Initial WL Opt.


Common Analytical Placement Flow

19

Placement Instance

Converge

yes

no

GlobalPlacement

Initial WLOptimization

Legalizationand Detailed Placement

SimPL Flow

20

We delegate final legalization and detailed placement to FastPlace-DP [M. Pan, et al, “An Efficient and Effective Detailed Placement Algorithm”, ICCAD2005]

Placement Instance

Legalizationand Detailed Placement

B2B net model[P. Spindler, et al, “Kraftwerk2 - A Fast Force-Directed Quadratic Placement Approach Using an Accurate Net Model,” TCAD 2008]

yesno

Pseudonet Insertion

Look-aheadLegalization

(Upper-Bound)

B2B GraphBuilding

Linear System Solver (Lower-Bound)

ConvergeGlobal

Placement

B2B GraphBuilding

Linear System Solver

WLConverge

yes

noInitial WLOptimization


SimPL: Look-ahead Legalization

■Purpose: Produces almost-legal placement (Upper-Bound)

while preserving the relative cell ordering givenby linear system solver (Lower-Bound)

■Identify target region − Find overflow bin b− Create a minimal wide enough bin cluster B around b

■Perform geometric top-down partitioning − Find cell area median (Cc) and whitespace median (CB)

− Assign cells (Cc) to corresponding partitions (CB)

■Non-linear scaling− Form stripe regions− Move cells across stripe regions in-order based on whitespace

21


SimPL: Look-ahead Legalization (1)

Performing geometric top-down partitioning

Overfilled binCell-area median (Cc)

B0 B1

whitespacemedian (CB)

Bin cluster (B)

22



23

Cell-area median (Cc)

whitespacemedian (CB)

B0



CB

Obstacle

borders

Uniform cutlines

CellOrdering

Per-stripeLinear Scaling

26

4

37

58

1

CB

26

4

37

58

1

CB

24


■Example (adaptec1)

Look-ahead legalization stops when target regions become small enough


SimPL: Using legal locations as anchors

■Purpose: Gradually perturb the linear system to generate

lower-bound solutions with less overlap

■Anchors and Pseudonets− Look-ahead locations used

as fixed, zero-area anchors − Anchors and original cells

connected with 2-pin pseudonets− Pseudonet weights grow

linearly with iterations

26


Next illustration: Tug-of-war between low-wirelength and

legalized placements

27

SimPL Iterations on Adaptec1 (1)Iteration=0 (Init WL Opt.) Iteration=1 (Upper Bound)

Iteration=2 (Lower Bound) Iteration=3 (Upper Bound)

28

SimPL Iterations on Adaptec1 (2)Iteration=11 (Upper Bound)


Iteration=11 (Upper Bound)


Iteration=10 (Lower Bound)

29

SimPL Iterations on Adaptec1 (3)

30

Iteration=31 (Upper Bound)Iteration=30 (Lower Bound)



Convergence of SimPL

■ Legal solution is formed between two bounds

31


Empirical Results: ISPD05 Benchmarks

■Experimental setup− Single threaded runs on a 3.2GHz Intel core i7 Quad

CPU Q660 Linux workstation− HPWL is computed by GSRC Bookshelf Evaluator< 5000 lines of code in C++, including CG solver

for sparse linear systems (w Jacobi preconditioner)

32


Initial placement 8%

CG solver 31%

Sparse matrix and B2B net

modeling8%

Look-ahead legalization

14%Pseudo-net insertion 1%

Post Global Placement

38%

IO 0%

Speeding Up Placement Using Parallelism

■SimPL has very few components (5KLOC)■Each bottleneck is amenable to some form of ||-ism

− Thread-level − Instruction-level

34


Parallelism in Conjugate Gradient Solver

■Coarse-grain row partitioning− Implemented using OpenMP3.0 compiler intrinsic

■SSE2 (Streaming SIMD Extensions) instructions− Process 4 multiple data with a single instruction− Marginal runtime improvement in SpMxV

■Reducing memory bandwidth demand of SpMxV− CSR (Compressed Sparse Row) format

Y. Saad, “Iterative Methods for Sparse Linear Systems,” SIAM 2003

35


Parallelism in CG Solver - Example

36


Parallelism in B2B Mode Update

■B2B net model update– B2B model is separable– Can process the x and y cases in parallel

− Additionally, split the nets of the netlist into equal groups that can be processed by multiple threads.

37


SSE optimization affects Runtime Profile

38


CG solver 19%


modeling10%


18%

Pseudo-net insertion 1%


46%

IO 1%


CG solver 31%


modeling8%


14%Pseudo-net insertion 1%


38%

IO 0%


Parallelism in Look-ahead Legalization (1)

■Look-ahead legalization (LAL) started consuming a significant fraction of overall runtime

■Top-down geometric partitioning and non-linear scaling (T&N) are amenable to parallelization

− Top-down partitioning generates an increasing number of subtasks of similar sizes which can be solved in parallel

− After each level of T&N on bin cluster, eachthread generates two sub-clusters with similar numbers of cells

39


Parallelism in Look-ahead Legalization (2)

■LAL keeps the global queue of bin clusters Q■Static partitioning

− Assign initial bin clusters to available threads such that each thread has similar number of bin clusters to start

■Subtask updates

− Thread ti processes one of two sub-clusters (for the next level of T&N), the remainder is added to the global cluster queue Q

■Dynamic task scheduling

− When thread ti is idle, it dynamically retrieves clusters from the global cluster queue Q. The number of clusters to be retrieved N = max(Q.size()/N_threads, 1)

40


Empirical Results – Overall Speed-ups

■Experimental setup− Multithreaded runs on a 8-core AMD-based system

with four dual-core CPUs and 16GByte RAM− Each CPU was Opteron 880 processor running

at 2.4GHz with 1024KB cache

41

Empirical Results – Component Speed-ups

42PAPA2011, University of Michigan


Empirical Results – Component Speed-ups

43


Extending the Routability-driven Placement

■Ongoing work: simultaneous place-and-route

44


Simultaneous Place-and-Route

■After Look-Ahead Legalization (LAL) perform Look-Ahead Routing (LAR)

− Integrate an in-house router through clean API− Cell locations in, accurate congestion maps out− The placer accounts for congestion in addition to density

(slightly modified formulas, almost no extra work)■ISPD 2011 contest organized by IBM Research

− New, large benchmarks− Placements evaluated by a common global router

45


SimPL SimPLR

■Key metric is #overflows (OF)■Also shown – routed WL (RtWL)

46


Conclusions

■ New flat quadratic placement algorithm: SimPL− Novel primal-dual based approach − Amenable to integration with physical synthesis

■ Self-contained, compact implementation − Fastest among available academic placers − Highly competitive solution quality− Amenable to parallelism− Easy to extend to simultaneous place-and-route

47

Questions and Answers

Thank you!Time for Questions

48PAPA2011, University of Michigan

Documents

Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan