Alain Darte Compsys Project : Compilation and Embedded Systems CNRS, LIP, ENS-Lyon , France

1

Alain Darte

Compsys Project: Compilation and Embedded Systems

CNRS, LIP, ENS-Lyon, France

Lattice-Based Memory Allocation

WOG’04, April. 25 th, 2004. Recent trends in Compiler Construction. Sven Verdoolaege’s PhD Defense.

Joint work with Rob Schreiber (HP Labs) and Gilles Villard (CNRS, LIP).

References: CASES’03, IEEE Transactions on Computers (to appear).

2

Outline

• Introduction:

– The initial context: PICO, HP Labs software tool for compiling high-level programs (e.g., C code) into NPAs (Non Programmable Accelerators). How to store intermediate results?

– Mathematical tools for high-level program transformations.

– An example of communicating pipelined loops.

• Lattice-based memory allocation.

• Examples of previous work limitations.

• Main results and open questions.

3

PiCo (Program In Chip Out)

PICO

Architecture Synthesis

rogram InP

Compiler

C O hip ode ut

Logic Synthesis, Physical Design

CAD Tools

VHDL for Processors

Output “code”:

• synthesizable VHDL

• netlists for FPGA

• VLIW code (interface)

Input code:

• C code

Similar tools: MMAlpha (Inria), Atomium (IMEC), Compaan (Leiden)

Other possible inputs: Recurrence equations, Matlab, Kahn processes

HP Labs automatic generation of non programmable accelerator (NPA)

4

High-Level Program Optimizations

• Program analysis: dependence analysis, lifetime analysis, footprint analysis, array expansion, array renaming, etc.

• Code and loop transformations: tiling, scheduling, nested loop transformations, modulo scheduling, etc.

Well-established mathematical tools and theory: graph algorithms, polyhedral manipulations, Hermite/Smith forms, integer linear programming, Ehrhart polynomials, etc.

BUT

• Memory optimizations:

– optimization of local memory (intra-loop buffer);

– optimization of inter-loop buffers for communicating NPAs.

No suitable mathematical tools so far.

5

Example: DCT-like code.

First NPA:

do br = 0, 63

do bc = 0, 63

do r = 0, 7

A(br, bc, r, …) = …

enddo

enddo

enddo

Second NPA:

do br = 0, 63

do bc = 0, 63

do c = 0, 7

… = A(br, bc, …, c)

enddo

enddo

enddo

Memory for A

pipelined with

A(br, bc, r, c) mapped to (r mod 4, 16(br+bc) + 2r +c mod 28)

Huge gap!

How to schedule the computations?

How to allocate elements of A in local memory so as to reduce its size?

a) Full array 256K elements. b) Optimized size = 112 elements (< 2 blocks).

6

Outline

• Introduction.

• Lattice-based memory allocation:

– Definition of modular allocations.

– Conflicting indices and critical lattices.

• Examples of limitations of previous work.


7

Memory Reduction Problem for Arrays

• Lifetime analysis:

– Schedule of computations Lifetime for each value (similar to dependence analysis, exact or over-approximated).

• Memory reuse:

– Values simultaneously live should not share the same location (constraints similar to register allocation).

• Restrict to “simple” addressing functions (for code generation):

– canonical linearization, linear mapping in multi-dimensional arrays

+ wrapping with modulo operations (reuse).

All are special cases of modular memory allocations.

Given a scheduled program (i.e., operations are not reordered), or several communicating programs, find the minimal memory size to store intermediate values and an adequate memory mapping.

8

Modular Mappings

• Generalization of (rotating) registers in higher dimensions:Value indexed by i writes in multi-dimensional position Mi mod b, where b is a positive integral vector, and M an integral matrix.

Ex: i=(i1,i2) stored at @ (2i1+i2 mod 3, i1+i2 mod 6) b=(3,6), size = 18.

Given a schedule and a lifetime analysis, find a valid allocation (M,b) such that the product of the components of b (memory size) is minimized.

• Generalizes all previous approaches:

– De Greef, Catthoor, De Man (1996-1997): linearizations + 1 modulo

– Lefebvre, Feautrier (1996-1997): successive modulos.

– Wilde, Rajopadhye (1996), Quilleré, Rajopadhye (2000): projections.

– Strout, Carter, Ferrante, Simon (ASPLOS’98): only 1 modulo.

– Thies, Vivien, Sheldon, Amarasinghe (PLDI’01): same.

9

Our Main Contributions

• We identify the fundamental object to work with:

– The set S of all differences of conflicting indices.

• We show the link with critical lattices:

– Finding the best allocation Mi mod b among ALL possible modular allocation amounts to find the critical integer lattice for the set S.

• We give guaranteed heuristics to approximate the optimal:

It explains previous work;

It gives new (and better) solutions;

It shows the link with theoretical work on successive minima, basis reduction, Minkowski’s theorems, etc.

[Thies et al., PLDI’01]: There is a need for a technique able “to consider more general storage mappings” and that “would allow variations in the number of array dimensions, while still capturing the directional and modular reuse of the occupancy vector”.

10

Outline

• Introduction.


• Examples of previous work limitations:

– rely on particular linearizations,

– or may wrap along the wrong axis.


11

De Greef, Catthoor, and De Man

• Were the first to identify the need for memory reduction techniques for embedded multimedia applications. Patent (1996) for intra- and inter-array memory reuse.

• Inter-array reuse:– Geometrical heuristics for packing different arrays in a given

memory buffer. will not be discussed here.

• Intra-array memory reuse:– Consider each original d-dimensional array and its 2dd!

canonical linearizations. (Example in 2D for an NxM array, look at 8 linearizations: Mi+j, Mi-j, -Mi+j, -Mi-j, i+Nj, i-Nj, -i+Nj, -i-Nj).

– Compute the maximal address difference D between two simultaneously live values.

– Select the linearization with smallest distance D and wrap the array modulo (D+1).

12

De Greef, Catthoor, De Man: Example 1

do i = 1,N

do j = 1,N

a(i,j) = ...

b(i,j) = a(i-1,j)

enddo

enddo i

j

Column-major order (Fortran-like): i+Nj, maximal distance = N(N-1)+1

Row-major order (C-like): Ni+j, maximal distance = N

Best canonical linearization: Ni+j mod (N+1).

do i = 1,N

do j = 1,N

a(Ni+j mod (N+1)) = ...

b(i,j) = a(Ni+j+1 mod (N+1))

enddo

enddo

do i = 1,N

do j = 1,N

a(-i+j mod (N+1)) = ...

b(i,j) = a(-i+j+1 mod (N+1))

enddo

enddo

13

do i = 1,N

do j = 1,N

a(i,j) = ...

b(i,j) = a(i-1,j)

enddo

enddo

do t = 2,2N /* t = i+j */

do j = max(1,t-N),min(N,t-1)

a(t-j,j) = ...

b(i,j) = a(t-j-1,j)

enddo

enddo

do t = 2,2N /* t = i+j */


a(t-j) = ...

b(i,j) = a(t-j-1)

enddo

enddo

De Greef, Catthoor, De Man: Example 2

Any canonical linearization leads to a distance

Θ(N2)!

But the allocation i mod N, or even i is just fine!

i

jHow could we have missed this?

14

Lefebvre and Feautrier

• Developed in the context of parallelizing compilers:

– a) Eliminate spurious memory dependences thanks to single assignment form; b) Wrap memory back when possible.

• Inter-array reuse:

– Coloring heuristics on array names (as for register allocation).

• Intra-array memory reuse:

– Idea 1: forget about original arrays, focus on original loop indices.

– Idea 2: wrap successively in each dimension with modulos.

As a computational point of view, use classical techniques based on (rational) linear programming.

15

Lefebvre, Feautrier: Example 1 revisited

do i = 1,N

do j = 1,N

a(i,j) = ...

b(i,j) = a(i-1,j)

enddo

enddo i

j

Along i, maximal distance = 1 i mod 2.

Along j (for a fixed i), maximal distance = N-1 j mod N, i.e., j.

Selected allocation (i mod 2, j), with a memory size 2N (note: N+1 in previous solution).

do i = 1,N

do j = 1,N

a(i mod 2, j) = ...

b(i,j) = a(i-1 mod 2, j)

enddo

enddo

16

Lefebvre, Feautrier: Example 2 revisited

do i = 1,N

do j = 1,N

a(i,j) = ...

b(i,j) = a(i-1,j)

enddo

enddo

Along i, maximal distance = N-1 i mod N, i.e., i.

Along j (for a fixed i), maximal distance = 0 no extra dimension.

Selected allocation i mod N, i.e., i. (Note: order N2 in previous solution)

i

j

do t = 2,2N /* t = i+j */


a(t-j,j) = ...

b(i,j) = a(t-j-1,j)

enddo

enddo

do t = 2,2N /* t = i+j */


a(t-j) = ...

b(i,j) = a(t-j-1)

enddo

enddo

17

Lefebvre, Feautrier: Example 3

do i = 1,N

do j = 1,N

a(i,j) = ...

enddo

enddo

Along i, maximal distance = 1 i mod 2

Along j (for a fixed i), maximal distance = 1 j mod 2.

Selected allocation (i mod 2, j mod 2) and size 4. OK.

i

j

do i = 1,N

do j = 1,N

b(i,j) = a(i,j)+...

enddo

enddo

pipelined 1 clock cycle later with

18

Lefebvre, Feautrier: Example 3 (variant)

do t = 2,2N /* t = i+j */


a(t-j,j) = ...

enddo

enddo

Along i, maximal distance = N-1 i mod N

Along j (for fixed i), max. dist = 0 j mod 1.

Corresponding memory size N!

Same if starting with j. FAIL!

i

j

do t = 2,2N /* t = i+j */


b(t-,j) = a(t-j,j)+...

enddo

enddo

pipelined 1 clock cycle later with

19

Outline

• Introduction.


• Examples of previous work limitations.

• Main results and open questions:No way to explain quickly all details, even to experts in lattice theory

and reduction theory...

See CASES’03 proceedings, research report (http://perso.ens-lyon.fr/alain.darte) or, IEEE TC journal version (to appear).

But I can try to:» Explain basic concepts of critical lattice and modular allocations.» Illustrate different mechanisms.» State results.

20

There was a Need for a Framework for Memory Reduction Based on Modular Allocations• Lower bounds:

– Given a lifetime analysis, can we give a lower bound for the best achievable memory size? What is the best modular memory allocation?

• Upper bounds:– Can we find mechanisms leading to allocations whose

corresponding memory size is not arbitrarily bad compared to the lower bound (guaranteed heuristics)?

• Robustness:

– We need a framework that can possibly capture parameters, that does not depend on the basis in which the problem is described, etc. Geometrical model.

• Computability:– We need to make sure the mechanisms are constructive and

lead to heuristics (or algorithms) that can be implemented.

21

• Index description:

– Choose an index description for values that are going to share a given array (the allocation will be linear with respect to these indices). Typically, loop indices, array indices, etc.

• Sef of conflicting index differences:

– Build the set CS of pairs of conflicting (i.e., simultaneously live) indices (i,j), and the set DS of differences (i-j).

We want: { (i,j) CS, i j } { Mi mod b Mj mod b}, or equivalently

{ d DS, d 0 } { Md mod b 0 }, or equivalently

Set of Conflicting Index Differences

Md mod b = 0, d DS d = 0

22

• The kernel of (M,b):

– The set Λ = {i | Mi mod b = 0} is a full-dimensional lattice.

– (M,b) is valid iff Λ DS {0}, i.e., Λ is an admissible lattice for DS.

• Conversely:

– If A is a basis for Λ, admissible integral lattice for DS, compute the Smith form A = Q1 S Q2 with Q1 and Q2 unimodular, S = diag(b).

– The mapping (M,b) where M is the inverse of Q1 has the kernel Λ, thus is a valid allocation with memory size det(S) = det(Λ).

The modular allocation with smallest memory size corresponds to a critical integer lattice for DS, i.e., an admissible integer lattice for DS with smallest determinant.

Admissible and Critical Lattices

23

Modular Mappings: Toy Example

Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Critical lattice: basis (4,3), (8,0)

Corresponding allocation (3i-4j mod 24).

24


Corners: (-1,5), (1,-5), (8,1), (-8,-1)Bounding Box:

(i mod 9, j mod 6) Size = 54

25

Successive modulos:

(i mod 9, j mod 5) Size = 45


Corners: (-1,5), (1,-5), (8,1), (-8,-1)

26

Skewed Bounding Box:

(i-j mod 8, j mod 6) Size = 48


Corners: (-1,5), (1,-5), (8,1), (-8,-1)

27

Skewed successive modulos:



Corners: (-1,5), (1,-5), (8,1), (-8,-1)

28

Better allocation:



Corners: (-1,5), (1,-5), (8,1), (-8,-1)

29

Critical lattice: basis (4,3), (8,0)

Best allocation (3i-4j mod 24).


Corners: (-1,5), (1,-5), (8,1), (-8,-1)

30

Results for 0-Symmetric Convex Bodies• We work with a 0-symmetric polytope K such that DS K. ⊆

(actually, we assume that the vector spaces generated by the points in K and the integer points in K are equal: K is full-dimensional)

• Lower bound in terms of volume: Vol(K)/2n

• Optimal solution found by optimized enumeration + ILP.

• Heuristics exist with memory size cn Vol(K) where cn depends on the dimension n only. guaranteed heuristics.

• One heuristic uses exactly Lefebvre-Feautrier mechanism but in a well-chosen basis. Always equivalent (i.e., with same memory size) to a particular linearization (= 1D mapping).

• Another heuristic (Rogers’ principle) works even for arbitrary sets, but equivalent linearization not clear.

• In practice: follow the schedule, when possible...

Reference: Gruber and Lekkerkerker, Geometry of Numbers.

31

Remarks on critical lattices

• a) Hard to find the critical lattice, starting from 3D, even for simple bodies. b) critical integer lattice critical lattice for large bodies. Hard to find the optimal, heuristics needed.

• Lower bound in terms of volume: Δ(K) Vol(K)/2n

– If S-S K, then all elements in S are mapped to different locations: ⊆ Δ(K) Card(S).

– Minkowski’s first theorem: if Λ is a lattice and K is 0-symmetric with Vol(K) 2n det(Λ), then K contains a nonzero lattice point of Λ.

• Gauge function: F(x) = inf{λ>0 | x in λK} is a distance function.

• Successive minima: λi(K)= inf{λ 0 | dim(Vect(λK ⋂ Zn)) i}.

– Minkowski’s second theorem:

(2n/n!)det(Λ) λ1(K) … λn(K) 2ndet(Λ)

32

Looking for the optimal solution

• Generate all possible lattices of a given determinant: Avoid duplicates: each lattice is uniquely determined by its Hermite

form (triangular matrix).

(Remark: not clear we could do the same for non-equivalent mappings without reasoning with the corresponding lattices.)

• Check that the lattice is admissible for K, either by ILP, or by

enumeration if integer points in K can be enumerated.

• For the DCT example:

– in 4D, optimal = 112, there are 86.416.644 lattices to check, it takes roughly 2 days!

– rewritten in 3D, optimal = 112, there are 941.901 lattices to check, it takes roughly 30 minutes.

Feasible only for small sets K and small dimensions.

33

Rogers’ heuristic adapted

• Choose n positive integers ρi such that ρi is a multiple of ρ(i+1) and dim(Li) i-1 where Li = Vect(K/ρi ⋂ Zn).

• Choose a basis (a1, …, an) of Zn s. t. Li Vect(a1, … , ai-1).

• Define Λ the lattice generated by the vectors ρi ai.

det(Λ) n! Vol(K)

34

Heuristic based on K (i.e., lattice)

• Choose n linearly independent integer vectors (a1, …, an)

• Compute Fi(ai) = inf{ F(y) | y in ai + Vect(a1, … , ai-1)}.

• Choose n integers ρi such that ρi Fi(ai) > 1.

• Define Λ the lattice generated by the vectors ρi ai.

det(Λ) (n!)2 Vol(K) if Fi(ai) 1 for all i

35

Heuristic based on K* (i.e., mapping)• K* = dual (or polar reciprocal) of K = {y | y.x 1 for all x in K}

• K** = K, F* related to F, Vol(K) related to Vol(K*), successive minima related, etc.

• Choose n linearly independent integer vectors (c1, …, cn)

• Compute F*i(ci) = sup{ci.x | x in K, c1.x = … = ci-1.x = 0}

• Choose n integers ρi such that ρi >F*i(ci).

• Define the mapping (M,b) with the ci as rows of M and b=ρ.

det(Λ) (n!)2 Vol(K) if Fi(ci) 1 for all I

Dual of the previous heuristic. Exactly Lefebvre-Feautrier in a well-chosen basis.

36

Important practical factorsThe set DS can be skewed for 3 reasons:

• Skewed iteration sub-domain with respect to full domain.

• Skewed schedule with respect to iteration domain.

• Skewed access function when reasoning with array indices.

In practice, “following” the schedule -- if it is expressed as a basis -- is not too bad.

But, ad-hoc counter-examples can be built. And schedule basis may be hidden in a “linearized” schedule.

37

Open or On-Going Questions

• How much do we loose if we restrict to 1D mappings?

• How much do we loose, when restricting to modular mappings, compared to MAXLIVE?

• Mixing both Lefebvre-Feautrier (successive modulos) and Quilleré-Rajopadhye (choice of basis) is often ok (i.e., follow the schedule and wrap…). Can we quickly identify when?

• How costly and how good are the heuristics in practice?

• How to handle more general cases (union of polyhedra for conflicting differences, multiple arrays, etc.).

• Can this be used as a basis for solving the general problem (i.e., find the schedule with minimal memory requirements)?

• Fully implemented in Cl@K: parameters still in progress…

Documents

Alain Darte Compsys Project : Compilation and Embedded Systems CNRS, LIP, ENS-Lyon , France