37
1 Alain Darte Compsys Project: Compilation and Embedded Systems CNRS, LIP, ENS-Lyon, France Lattice-Based Memory Allocation WOG’04, April. 25 th , 2004. Recent trends in Compiler Construction. Sven Verdoolaege’s PhD Defense. Joint work with Rob Schreiber (HP Labs) and Gilles Villard (CNRS, LIP). References: CASES’03, IEEE Transactions on Computers (to appear).

Alain Darte Compsys Project : Compilation and Embedded Systems CNRS, LIP, ENS-Lyon , France

  • Upload
    elgin

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Lattice-Based Memory Allocation. Alain Darte Compsys Project : Compilation and Embedded Systems CNRS, LIP, ENS-Lyon , France. Joint work with Rob Schreiber (HP Labs) and Gilles Villard (CNRS, LIP). References: CASES’03, IEEE Transactions on Computers (to appear). - PowerPoint PPT Presentation

Citation preview

Page 1: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

1

Alain Darte

Compsys Project: Compilation and Embedded Systems

CNRS, LIP, ENS-Lyon, France

Lattice-Based Memory Allocation

WOG’04, April. 25 th, 2004. Recent trends in Compiler Construction. Sven Verdoolaege’s PhD Defense.

Joint work with Rob Schreiber (HP Labs) and Gilles Villard (CNRS, LIP).

References: CASES’03, IEEE Transactions on Computers (to appear).

Page 2: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

2

Outline

• Introduction:

– The initial context: PICO, HP Labs software tool for compiling high-level programs (e.g., C code) into NPAs (Non Programmable Accelerators). How to store intermediate results?

– Mathematical tools for high-level program transformations.

– An example of communicating pipelined loops.

• Lattice-based memory allocation.

• Examples of previous work limitations.

• Main results and open questions.

Page 3: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

3

PiCo (Program In Chip Out)

PICO

Architecture Synthesis

rogram InP

Compiler

C O hip ode ut

Logic Synthesis, Physical Design

CAD Tools

VHDL for Processors

Output “code”:

• synthesizable VHDL

• netlists for FPGA

• VLIW code (interface)

Input code:

• C code

Similar tools: MMAlpha (Inria), Atomium (IMEC), Compaan (Leiden)

Other possible inputs: Recurrence equations, Matlab, Kahn processes

HP Labs automatic generation of non programmable accelerator (NPA)

Page 4: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

4

High-Level Program Optimizations

• Program analysis: dependence analysis, lifetime analysis, footprint analysis, array expansion, array renaming, etc.

• Code and loop transformations: tiling, scheduling, nested loop transformations, modulo scheduling, etc.

Well-established mathematical tools and theory: graph algorithms, polyhedral manipulations, Hermite/Smith forms, integer linear programming, Ehrhart polynomials, etc.

BUT

• Memory optimizations:

– optimization of local memory (intra-loop buffer);

– optimization of inter-loop buffers for communicating NPAs.

No suitable mathematical tools so far.

Page 5: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

5

Example: DCT-like code.

First NPA:

do br = 0, 63

do bc = 0, 63

do r = 0, 7

A(br, bc, r, …) = …

enddo

enddo

enddo

Second NPA:

do br = 0, 63

do bc = 0, 63

do c = 0, 7

… = A(br, bc, …, c)

enddo

enddo

enddo

Memory for A

pipelined with

A(br, bc, r, c) mapped to (r mod 4, 16(br+bc) + 2r +c mod 28)

Huge gap!

How to schedule the computations?

How to allocate elements of A in local memory so as to reduce its size?

a) Full array 256K elements. b) Optimized size = 112 elements (< 2 blocks).

Page 6: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

6

Outline

• Introduction.

• Lattice-based memory allocation:

– Definition of modular allocations.

– Conflicting indices and critical lattices.

• Examples of limitations of previous work.

• Main results and open questions.

Page 7: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

7

Memory Reduction Problem for Arrays

• Lifetime analysis:

– Schedule of computations Lifetime for each value (similar to dependence analysis, exact or over-approximated).

• Memory reuse:

– Values simultaneously live should not share the same location (constraints similar to register allocation).

• Restrict to “simple” addressing functions (for code generation):

– canonical linearization, linear mapping in multi-dimensional arrays

+ wrapping with modulo operations (reuse).

All are special cases of modular memory allocations.

Given a scheduled program (i.e., operations are not reordered), or several communicating programs, find the minimal memory size to store intermediate values and an adequate memory mapping.

Page 8: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

8

Modular Mappings

• Generalization of (rotating) registers in higher dimensions:Value indexed by i writes in multi-dimensional position Mi mod b, where b is a positive integral vector, and M an integral matrix.

Ex: i=(i1,i2) stored at @ (2i1+i2 mod 3, i1+i2 mod 6) b=(3,6), size = 18.

Given a schedule and a lifetime analysis, find a valid allocation (M,b) such that the product of the components of b (memory size) is minimized.

• Generalizes all previous approaches:

– De Greef, Catthoor, De Man (1996-1997): linearizations + 1 modulo

– Lefebvre, Feautrier (1996-1997): successive modulos.

– Wilde, Rajopadhye (1996), Quilleré, Rajopadhye (2000): projections.

– Strout, Carter, Ferrante, Simon (ASPLOS’98): only 1 modulo.

– Thies, Vivien, Sheldon, Amarasinghe (PLDI’01): same.

Page 9: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

9

Our Main Contributions

• We identify the fundamental object to work with:

– The set S of all differences of conflicting indices.

• We show the link with critical lattices:

– Finding the best allocation Mi mod b among ALL possible modular allocation amounts to find the critical integer lattice for the set S.

• We give guaranteed heuristics to approximate the optimal:

It explains previous work;

It gives new (and better) solutions;

It shows the link with theoretical work on successive minima, basis reduction, Minkowski’s theorems, etc.

[Thies et al., PLDI’01]: There is a need for a technique able “to consider more general storage mappings” and that “would allow variations in the number of array dimensions, while still capturing the directional and modular reuse of the occupancy vector”.

Page 10: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

10

Outline

• Introduction.

• Lattice-based memory allocation.

• Examples of previous work limitations:

– rely on particular linearizations,

– or may wrap along the wrong axis.

• Main results and open questions.

Page 11: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

11

De Greef, Catthoor, and De Man

• Were the first to identify the need for memory reduction techniques for embedded multimedia applications. Patent (1996) for intra- and inter-array memory reuse.

• Inter-array reuse:– Geometrical heuristics for packing different arrays in a given

memory buffer. will not be discussed here.

• Intra-array memory reuse:– Consider each original d-dimensional array and its 2dd!

canonical linearizations. (Example in 2D for an NxM array, look at 8 linearizations: Mi+j, Mi-j, -Mi+j, -Mi-j, i+Nj, i-Nj, -i+Nj, -i-Nj).

– Compute the maximal address difference D between two simultaneously live values.

– Select the linearization with smallest distance D and wrap the array modulo (D+1).

Page 12: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

12

De Greef, Catthoor, De Man: Example 1

do i = 1,N

do j = 1,N

a(i,j) = ...

b(i,j) = a(i-1,j)

enddo

enddo i

j

Column-major order (Fortran-like): i+Nj, maximal distance = N(N-1)+1

Row-major order (C-like): Ni+j, maximal distance = N

Best canonical linearization: Ni+j mod (N+1).

do i = 1,N

do j = 1,N

a(Ni+j mod (N+1)) = ...

b(i,j) = a(Ni+j+1 mod (N+1))

enddo

enddo

do i = 1,N

do j = 1,N

a(-i+j mod (N+1)) = ...

b(i,j) = a(-i+j+1 mod (N+1))

enddo

enddo

Page 13: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

13

do i = 1,N

do j = 1,N

a(i,j) = ...

b(i,j) = a(i-1,j)

enddo

enddo

do t = 2,2N /* t = i+j */

do j = max(1,t-N),min(N,t-1)

a(t-j,j) = ...

b(i,j) = a(t-j-1,j)

enddo

enddo

do t = 2,2N /* t = i+j */

do j = max(1,t-N),min(N,t-1)

a(t-j) = ...

b(i,j) = a(t-j-1)

enddo

enddo

De Greef, Catthoor, De Man: Example 2

Any canonical linearization leads to a distance

Θ(N2)!

But the allocation i mod N, or even i is just fine!

i

jHow could we have missed this?

Page 14: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

14

Lefebvre and Feautrier

• Developed in the context of parallelizing compilers:

– a) Eliminate spurious memory dependences thanks to single assignment form; b) Wrap memory back when possible.

• Inter-array reuse:

– Coloring heuristics on array names (as for register allocation).

• Intra-array memory reuse:

– Idea 1: forget about original arrays, focus on original loop indices.

– Idea 2: wrap successively in each dimension with modulos.

As a computational point of view, use classical techniques based on (rational) linear programming.

Page 15: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

15

Lefebvre, Feautrier: Example 1 revisited

do i = 1,N

do j = 1,N

a(i,j) = ...

b(i,j) = a(i-1,j)

enddo

enddo i

j

Along i, maximal distance = 1 i mod 2.

Along j (for a fixed i), maximal distance = N-1 j mod N, i.e., j.

Selected allocation (i mod 2, j), with a memory size 2N (note: N+1 in previous solution).

do i = 1,N

do j = 1,N

a(i mod 2, j) = ...

b(i,j) = a(i-1 mod 2, j)

enddo

enddo

Page 16: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

16

Lefebvre, Feautrier: Example 2 revisited

do i = 1,N

do j = 1,N

a(i,j) = ...

b(i,j) = a(i-1,j)

enddo

enddo

Along i, maximal distance = N-1 i mod N, i.e., i.

Along j (for a fixed i), maximal distance = 0 no extra dimension.

Selected allocation i mod N, i.e., i. (Note: order N2 in previous solution)

i

j

do t = 2,2N /* t = i+j */

do j = max(1,t-N),min(N,t-1)

a(t-j,j) = ...

b(i,j) = a(t-j-1,j)

enddo

enddo

do t = 2,2N /* t = i+j */

do j = max(1,t-N),min(N,t-1)

a(t-j) = ...

b(i,j) = a(t-j-1)

enddo

enddo

Page 17: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

17

Lefebvre, Feautrier: Example 3

do i = 1,N

do j = 1,N

a(i,j) = ...

enddo

enddo

Along i, maximal distance = 1 i mod 2

Along j (for a fixed i), maximal distance = 1 j mod 2.

Selected allocation (i mod 2, j mod 2) and size 4. OK.

i

j

do i = 1,N

do j = 1,N

b(i,j) = a(i,j)+...

enddo

enddo

pipelined 1 clock cycle later with

Page 18: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

18

Lefebvre, Feautrier: Example 3 (variant)

do t = 2,2N /* t = i+j */

do j = max(1,t-N),min(N,t-1)

a(t-j,j) = ...

enddo

enddo

Along i, maximal distance = N-1 i mod N

Along j (for fixed i), max. dist = 0 j mod 1.

Corresponding memory size N!

Same if starting with j. FAIL!

i

j

do t = 2,2N /* t = i+j */

do j = max(1,t-N),min(N,t-1)

b(t-,j) = a(t-j,j)+...

enddo

enddo

pipelined 1 clock cycle later with

Page 19: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

19

Outline

• Introduction.

• Lattice-based memory allocation.

• Examples of previous work limitations.

• Main results and open questions:No way to explain quickly all details, even to experts in lattice theory

and reduction theory...

See CASES’03 proceedings, research report (http://perso.ens-lyon.fr/alain.darte) or, IEEE TC journal version (to appear).

But I can try to:» Explain basic concepts of critical lattice and modular allocations.» Illustrate different mechanisms.» State results.

Page 20: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

20

There was a Need for a Framework for Memory Reduction Based on Modular Allocations• Lower bounds:

– Given a lifetime analysis, can we give a lower bound for the best achievable memory size? What is the best modular memory allocation?

• Upper bounds:– Can we find mechanisms leading to allocations whose

corresponding memory size is not arbitrarily bad compared to the lower bound (guaranteed heuristics)?

• Robustness:

– We need a framework that can possibly capture parameters, that does not depend on the basis in which the problem is described, etc. Geometrical model.

• Computability:– We need to make sure the mechanisms are constructive and

lead to heuristics (or algorithms) that can be implemented.

Page 21: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

21

• Index description:

– Choose an index description for values that are going to share a given array (the allocation will be linear with respect to these indices). Typically, loop indices, array indices, etc.

• Sef of conflicting index differences:

– Build the set CS of pairs of conflicting (i.e., simultaneously live) indices (i,j), and the set DS of differences (i-j).

We want: { (i,j) CS, i j } { Mi mod b Mj mod b}, or equivalently

{ d DS, d 0 } { Md mod b 0 }, or equivalently

Set of Conflicting Index Differences

Md mod b = 0, d DS d = 0

Page 22: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

22

• The kernel of (M,b):

– The set Λ = {i | Mi mod b = 0} is a full-dimensional lattice.

– (M,b) is valid iff Λ DS {0}, i.e., Λ is an admissible lattice for DS.

• Conversely:

– If A is a basis for Λ, admissible integral lattice for DS, compute the Smith form A = Q1 S Q2 with Q1 and Q2 unimodular, S = diag(b).

– The mapping (M,b) where M is the inverse of Q1 has the kernel Λ, thus is a valid allocation with memory size det(S) = det(Λ).

The modular allocation with smallest memory size corresponds to a critical integer lattice for DS, i.e., an admissible integer lattice for DS with smallest determinant.

Admissible and Critical Lattices

Page 23: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

23

Modular Mappings: Toy Example

Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Critical lattice: basis (4,3), (8,0)

Corresponding allocation (3i-4j mod 24).

Page 24: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

24

Modular Mappings: Toy Example

Corners: (-1,5), (1,-5), (8,1), (-8,-1)Bounding Box:

(i mod 9, j mod 6) Size = 54

Page 25: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

25

Successive modulos:

(i mod 9, j mod 5) Size = 45

Modular Mappings: Toy Example

Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Page 26: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

26

Skewed Bounding Box:

(i-j mod 8, j mod 6) Size = 48

Modular Mappings: Toy Example

Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Page 27: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

27

Skewed successive modulos:

(i-j mod 8, j mod 4) Size = 32

Modular Mappings: Toy Example

Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Page 28: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

28

Better allocation:

(i-j mod 7, j mod 4) Size = 28

Modular Mappings: Toy Example

Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Page 29: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

29

Critical lattice: basis (4,3), (8,0)

Best allocation (3i-4j mod 24).

Modular Mappings: Toy Example

Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Page 30: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

30

Results for 0-Symmetric Convex Bodies• We work with a 0-symmetric polytope K such that DS K. ⊆

(actually, we assume that the vector spaces generated by the points in K and the integer points in K are equal: K is full-dimensional)

• Lower bound in terms of volume: Vol(K)/2n

• Optimal solution found by optimized enumeration + ILP.

• Heuristics exist with memory size cn Vol(K) where cn depends on the dimension n only. guaranteed heuristics.

• One heuristic uses exactly Lefebvre-Feautrier mechanism but in a well-chosen basis. Always equivalent (i.e., with same memory size) to a particular linearization (= 1D mapping).

• Another heuristic (Rogers’ principle) works even for arbitrary sets, but equivalent linearization not clear.

• In practice: follow the schedule, when possible...

Reference: Gruber and Lekkerkerker, Geometry of Numbers.

Page 31: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

31

Remarks on critical lattices

• a) Hard to find the critical lattice, starting from 3D, even for simple bodies. b) critical integer lattice critical lattice for large bodies. Hard to find the optimal, heuristics needed.

• Lower bound in terms of volume: Δ(K) Vol(K)/2n

– If S-S K, then all elements in S are mapped to different locations: ⊆ Δ(K) Card(S).

– Minkowski’s first theorem: if Λ is a lattice and K is 0-symmetric with Vol(K) 2n det(Λ), then K contains a nonzero lattice point of Λ.

• Gauge function: F(x) = inf{λ>0 | x in λK} is a distance function.

• Successive minima: λi(K)= inf{λ 0 | dim(Vect(λK ⋂ Zn)) i}.

– Minkowski’s second theorem:

(2n/n!)det(Λ) λ1(K) … λn(K) 2ndet(Λ)

Page 32: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

32

Looking for the optimal solution

• Generate all possible lattices of a given determinant: Avoid duplicates: each lattice is uniquely determined by its Hermite

form (triangular matrix).

(Remark: not clear we could do the same for non-equivalent mappings without reasoning with the corresponding lattices.)

• Check that the lattice is admissible for K, either by ILP, or by

enumeration if integer points in K can be enumerated.

• For the DCT example:

– in 4D, optimal = 112, there are 86.416.644 lattices to check, it takes roughly 2 days!

– rewritten in 3D, optimal = 112, there are 941.901 lattices to check, it takes roughly 30 minutes.

Feasible only for small sets K and small dimensions.

Page 33: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

33

Rogers’ heuristic adapted

• Choose n positive integers ρi such that ρi is a multiple of ρ(i+1) and dim(Li) i-1 where Li = Vect(K/ρi ⋂ Zn).

• Choose a basis (a1, …, an) of Zn s. t. Li Vect(a1, … , ai-1).

• Define Λ the lattice generated by the vectors ρi ai.

det(Λ) n! Vol(K)

Page 34: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

34

Heuristic based on K (i.e., lattice)

• Choose n linearly independent integer vectors (a1, …, an)

• Compute Fi(ai) = inf{ F(y) | y in ai + Vect(a1, … , ai-1)}.

• Choose n integers ρi such that ρi Fi(ai) > 1.

• Define Λ the lattice generated by the vectors ρi ai.

det(Λ) (n!)2 Vol(K) if Fi(ai) 1 for all i

Page 35: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

35

Heuristic based on K* (i.e., mapping)• K* = dual (or polar reciprocal) of K = {y | y.x 1 for all x in K}

• K** = K, F* related to F, Vol(K) related to Vol(K*), successive minima related, etc.

• Choose n linearly independent integer vectors (c1, …, cn)

• Compute F*i(ci) = sup{ci.x | x in K, c1.x = … = ci-1.x = 0}

• Choose n integers ρi such that ρi >F*i(ci).

• Define the mapping (M,b) with the ci as rows of M and b=ρ.

det(Λ) (n!)2 Vol(K) if Fi(ci) 1 for all I

Dual of the previous heuristic. Exactly Lefebvre-Feautrier in a well-chosen basis.

Page 36: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

36

Important practical factorsThe set DS can be skewed for 3 reasons:

• Skewed iteration sub-domain with respect to full domain.

• Skewed schedule with respect to iteration domain.

• Skewed access function when reasoning with array indices.

In practice, “following” the schedule -- if it is expressed as a basis -- is not too bad.

But, ad-hoc counter-examples can be built. And schedule basis may be hidden in a “linearized” schedule.

Page 37: Alain Darte Compsys Project : Compilation and Embedded Systems CNRS,  LIP, ENS-Lyon , France

37

Open or On-Going Questions

• How much do we loose if we restrict to 1D mappings?

• How much do we loose, when restricting to modular mappings, compared to MAXLIVE?

• Mixing both Lefebvre-Feautrier (successive modulos) and Quilleré-Rajopadhye (choice of basis) is often ok (i.e., follow the schedule and wrap…). Can we quickly identify when?

• How costly and how good are the heuristics in practice?

• How to handle more general cases (union of polyhedra for conflicting differences, multiple arrays, etc.).

• Can this be used as a basis for solving the general problem (i.e., find the schedule with minimal memory requirements)?

• Fully implemented in Cl@K: parameters still in progress…