47
Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan 1 PAPA2011, University of Michigan

Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Parallelization by SimPLification:A Case Study in VLSI Placement

Myung-Chul Kim, Dong-Jin Leeand Igor L. MarkovDept. of EECS, University of Michigan

1

Page 2: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Complexities of Parallel Algorithms & SW

1.Objectives of parallelizationA. Improve completion time by using multiple cores in ||B. Improve throughput by using stream processing

(latency may increase and become less predictable)C. Improve power consumption (by decreasing clk rate)2.Not an objective (a pitfall)

− Come up with a slow algorithm that is easy to parallelize

■In this talk: how to accomplish 1.A without 2− Take a leading algorithm and speed up its bottlenecks− Design a new algorithm that is

(a) better, (b) easy to parallelize

2

Page 3: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

CAD Algorithms

■Sequence of optimizations− Subject to Amdahl’s law− The more the stages, the harder to parallelize effectively■Additional complications

− Elaborate data structures may entail overheadfor parallel access

− When processing is light, memory bandwidthmay become a bottleneck (with 4+ threads)

■Recommendations− A simpler algorithm is often either to parallelize

(fewer stages, simpler data structures)− Using standard solvers, e.g., linear algebra

helps reuse previous work on parallelization

3

Page 4: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Global Placement: Motivation

■Interconnect lagging in performance while transistors continue scaling

− Circuit delay, power dissipation and areadominated by interconnect

− Routing quality highly controlled by placement

■Circuit size and complexity rapidly increasing− Scalable placement algorithm is critical− Simplicity, integration with other optimizations

4

Unloaded

Coupling

IR drop

RC delay

Page 5: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Goals in Placement

■Find good relative ordering of cells− Minimize wire length and congestion− Maximize timing slack■Find good spacing of cells

− Eliminate wiring congestion problems− Provide space for post placement stages

–clock trees–buffer insertion–timing correction

■Find good global position

5

Page 6: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

A B C

Optimize Relative Order

6

Page 7: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

A B C

To spread ...

7

Page 8: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

A B C

.. or not to spread

8

Page 9: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

A B C

Place to the left

9

Page 10: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

A B C

… or to the right

10

Page 11: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

A B C

Optimize Relative Order

Without whitespace,placement is dominated by ordering

11

Page 12: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

Example of Global Placement (APlace 2.04 from UCSD)

Page 13: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

Example of Global Placement (mFar from UCSB)

Page 14: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Placement Formulation

■Objective: Minimize estimated wirelength− Half-perimeter wirelength (HPWL)

− (max X – min X) + (max Y – min Y)

■Subject to constraints:− Legality: Row-based

placement with no overlaps− Routability: Limiting local

interconnect congestion forsuccessful routing

− Timing: Meeting performancetarget of a design

14

xy

Page 15: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Quadratic Placement

■Consider a graph first, not a hypergraph

■Minimize Σ(xi-xj)2+(yi-yj)2 (the sum is over eij)

− Seems unrelated to Σ |xi-xj|+|yi-yj| but can still be separated into x- and y-components

■Physical analogy: Hooke’s law− Consider an elastic spring, spread by x− Force F=-kx (k is the spring constant)− Energy E=kx2

− Our goal: minimize the energy of the system

A system of springs will only settle in a minimum

15

Page 16: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Iterative Optimization

16

Page 17: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Prior Work

■ Ideal Placer

− Low runtime without sacrificing solution quality

− Simplicity, integration with other optimizations

17

Sp

eed

Solution Quality

Non-convex optimization

mFAR, Kraftwerk2, FastPlace3

Ideal placer

mPL6, APlace2, NTUPlace3

Quadratic and force-directed

Page 18: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Key features of SimPL

■Flat quadratic placement■Primal dual optimization

− Closing the gap between upper and lower bounds

18

Final Solution

Lower-Bound Solutionby Linear System Solver

Wir

elen

gth

Iteration

Final Legal Solution

Upper-Bound Solution by Look-ahead Legalization

Initial WL Opt.

Page 19: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Common Analytical Placement Flow

19

Placement Instance

Converge

yes

no

GlobalPlacement

Initial WLOptimization

Legalizationand Detailed Placement

Page 20: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

SimPL Flow

20

We delegate final legalization and detailed placement to FastPlace-DP [M. Pan, et al, “An Efficient and Effective Detailed Placement Algorithm”, ICCAD2005]

Placement Instance

Legalizationand Detailed Placement

B2B net model[P. Spindler, et al, “Kraftwerk2 - A Fast Force-Directed Quadratic Placement Approach Using an Accurate Net Model,” TCAD 2008]

yesno

Pseudonet Insertion

Look-aheadLegalization

(Upper-Bound)

B2B GraphBuilding

Linear System Solver (Lower-Bound)

ConvergeGlobal

Placement

B2B GraphBuilding

Linear System Solver

WLConverge

yes

noInitial WLOptimization

Page 21: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization

■Purpose: Produces almost-legal placement (Upper-Bound)

while preserving the relative cell ordering givenby linear system solver (Lower-Bound)

■Identify target region − Find overflow bin b− Create a minimal wide enough bin cluster B around b

■Perform geometric top-down partitioning − Find cell area median (Cc) and whitespace median (CB)

− Assign cells (Cc) to corresponding partitions (CB)

■Non-linear scaling− Form stripe regions− Move cells across stripe regions in-order based on whitespace

21

Page 22: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (1)

Performing geometric top-down partitioning

Overfilled binCell-area median (Cc)

B0 B1

whitespacemedian (CB)

Bin cluster (B)

22

Page 23: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (2)

23

Cell-area median (Cc)

whitespacemedian (CB)

B0

Page 24: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

SimPL: Look-ahead Legalization (2)

CB

Obstacle

borders

Uniform cutlines

CellOrdering

Per-stripeLinear Scaling

26

4

37

58

1

CB

26

4

37

58

1

CB

24

Page 25: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

SimPL: Look-ahead Legalization (3)

■Example (adaptec1)

Look-ahead legalization stops when target regions become small enough

Page 26: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

SimPL: Using legal locations as anchors

■Purpose: Gradually perturb the linear system to generate

lower-bound solutions with less overlap

■Anchors and Pseudonets− Look-ahead locations used

as fixed, zero-area anchors − Anchors and original cells

connected with 2-pin pseudonets− Pseudonet weights grow

linearly with iterations

26

Page 27: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Next illustration: Tug-of-war between low-wirelength and

legalized placements

27

Page 28: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

SimPL Iterations on Adaptec1 (1)Iteration=0 (Init WL Opt.) Iteration=1 (Upper Bound)

Iteration=2 (Lower Bound) Iteration=3 (Upper Bound)

28

Page 29: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

SimPL Iterations on Adaptec1 (2)Iteration=11 (Upper Bound)

Iteration=20 (Lower Bound) Iteration=21 (Upper Bound)

Iteration=11 (Upper Bound)

Iteration=20 (Lower Bound) Iteration=21 (Upper Bound)

Iteration=10 (Lower Bound)

29

Page 30: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

SimPL Iterations on Adaptec1 (3)

30

Iteration=31 (Upper Bound)Iteration=30 (Lower Bound)

Iteration=40 (Lower Bound) Iteration=41 (Upper Bound)

Page 31: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Convergence of SimPL

■ Legal solution is formed between two bounds

31

Page 32: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Empirical Results: ISPD05 Benchmarks

■Experimental setup− Single threaded runs on a 3.2GHz Intel core i7 Quad

CPU Q660 Linux workstation− HPWL is computed by GSRC Bookshelf Evaluator< 5000 lines of code in C++, including CG solver

for sparse linear systems (w Jacobi preconditioner)

32

Page 33: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Initial placement 8%

CG solver 31%

Sparse matrix and B2B net

modeling8%

Look-ahead legalization

14%Pseudo-net insertion 1%

Post Global Placement

38%

IO 0%

Speeding Up Placement Using Parallelism

■SimPL has very few components (5KLOC)■Each bottleneck is amenable to some form of ||-ism

− Thread-level − Instruction-level

34

Page 34: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Parallelism in Conjugate Gradient Solver

■Coarse-grain row partitioning− Implemented using OpenMP3.0 compiler intrinsic

■SSE2 (Streaming SIMD Extensions) instructions− Process 4 multiple data with a single instruction− Marginal runtime improvement in SpMxV

■Reducing memory bandwidth demand of SpMxV− CSR (Compressed Sparse Row) format

Y. Saad, “Iterative Methods for Sparse Linear Systems,” SIAM 2003

35

Page 35: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Parallelism in CG Solver - Example

36

Page 36: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Parallelism in B2B Mode Update

■B2B net model update– B2B model is separable– Can process the x and y cases in parallel

− Additionally, split the nets of the netlist into equal groups that can be processed by multiple threads.

37

Page 37: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

SSE optimization affects Runtime Profile

38

Initial placement 5%

CG solver 19%

Sparse matrix and B2B net

modeling10%

Look-ahead legalization

18%

Pseudo-net insertion 1%

Post Global Placement

46%

IO 1%

Initial placement 8%

CG solver 31%

Sparse matrix and B2B net

modeling8%

Look-ahead legalization

14%Pseudo-net insertion 1%

Post Global Placement

38%

IO 0%

Page 38: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Parallelism in Look-ahead Legalization (1)

■Look-ahead legalization (LAL) started consuming a significant fraction of overall runtime

■Top-down geometric partitioning and non-linear scaling (T&N) are amenable to parallelization

− Top-down partitioning generates an increasing number of subtasks of similar sizes which can be solved in parallel

− After each level of T&N on bin cluster, eachthread generates two sub-clusters with similar numbers of cells

39

Page 39: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Parallelism in Look-ahead Legalization (2)

■LAL keeps the global queue of bin clusters Q■Static partitioning

− Assign initial bin clusters to available threads such that each thread has similar number of bin clusters to start

■Subtask updates

− Thread ti processes one of two sub-clusters (for the next level of T&N), the remainder is added to the global cluster queue Q

■Dynamic task scheduling

− When thread ti is idle, it dynamically retrieves clusters from the global cluster queue Q. The number of clusters to be retrieved N = max(Q.size()/N_threads, 1)

40

Page 40: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Empirical Results – Overall Speed-ups

■Experimental setup− Multithreaded runs on a 8-core AMD-based system

with four dual-core CPUs and 16GByte RAM− Each CPU was Opteron 880 processor running

at 2.4GHz with 1024KB cache

41

Page 41: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

Empirical Results – Component Speed-ups

42PAPA2011, University of Michigan

Page 42: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Empirical Results – Component Speed-ups

43

Page 43: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Extending the Routability-driven Placement

■Ongoing work: simultaneous place-and-route

44

Page 44: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Simultaneous Place-and-Route

■After Look-Ahead Legalization (LAL) perform Look-Ahead Routing (LAR)

− Integrate an in-house router through clean API− Cell locations in, accurate congestion maps out− The placer accounts for congestion in addition to density

(slightly modified formulas, almost no extra work)■ISPD 2011 contest organized by IBM Research

− New, large benchmarks− Placements evaluated by a common global router

45

Page 45: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

SimPL SimPLR

■Key metric is #overflows (OF)■Also shown – routed WL (RtWL)

46

Page 46: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

PAPA2011, University of Michigan

Conclusions

■ New flat quadratic placement algorithm: SimPL− Novel primal-dual based approach − Amenable to integration with physical synthesis

■ Self-contained, compact implementation − Fastest among available academic placers − Highly competitive solution quality− Amenable to parallelism− Easy to extend to simultaneous place-and-route

47

Page 47: Parallelization by SimPLification: A Case Study in VLSI Placement Myung-Chul Kim, Dong-Jin Lee and Igor L. Markov Dept. of EECS, University of Michigan

Questions and Answers

Thank you!Time for Questions

48PAPA2011, University of Michigan