Memory Reuse Analysis in the Polyhedral Model

Memory Reuse Analysis in the Polyhedral ModelD. Wilde([email protected])E. E. Dept, B. Y. U., Provo UT S. Rajopadhye([email protected])IRISA, Rennes, FranceAbstractIn the context of developing a compiler for a Alpha, a functional data-parallel language based on systems of a�ne recurrence equations (SAREs),we address the problem of transforming scheduled single-assignment code tomultiple assignment code. We show how the polyhedral model allows us tostatically compute the lifetimes of program variables, and thus enables us toderive necessary and su�cient conditions for reusing memory.1. IntroductionThe methodology of automatic systolic array synthesis from Systems of A�neRecurrence Equations (SAREs) has a close bearing on parallelizing compilers and one�cient implementation of functional languages. To study this relationship, we arecurrently developing a compiler for Alpha [9], a functional, data parallel languagebased on SAREs de�ned over polyhedral index domains. The language semanticsdirectly lead to sequential code based on demand driven evaluation. However,the resulting context switches can be avoided if the program is transformed into(sequential) imperative single assignment code (SAC) [13]. This is currently donesemi-automatically|the user chooses the transformation, and the system generatesthe �nal code automatically|in the framework of a transformation system currentlyunder development.In the imperative SAC, memory size is of the same order as the iteration space(for example, matrix multiplication will take �(N3) memory). This is clearly un-acceptable, and can be ameliorated by producing multiple assignment code (MAC).The idea is that memory is reused by allocating multiple index points to the samememory location. This necessitates a lifetime analysis, and in this paper we de-velop the constraints that the memory allocation must satisfy. They are based oninformation from the program and the schedule chosen.The paper is organized as follows. In Section 2 we review the Alpha language,transformation system and compiler, and point out its memory ine�ciency. Wethen present the memory reuse analysis (Section 3), describe a detailed example(Section 4), discuss related work (Section 5) and conclude (Section 6).2. The Alpha Language, System and CompilerAlpha is a functional data-parallel language based on SAREs. Variables arestatically declared over a polyhedral domain, and represent multidimensional arrayswhose shapes are polyhedra. For example, a matrix multiplication program inAlpha is almost identical (not shown due to space constraints) to the textbookformula Cij = PNk=1 AikBkj , except that the variable domains are declared, and

system matrix_mult :{N | 1<=N} -- N is a "size" papameter of the system(A : {i,j | 1<=i,j<=N} of integer; -- two input variables whose domainsB : {i,j | 1<=i,j<=N} of integer) -- are NxN squaresreturns (C : {i,j | 1<=i,j<=N} of integer);-- output variable(s)var -- local variablesS: {i,j,k | 1<=i,j<=N; 0<=k<=N} of integer; -- domain of S inferred automaticallylet -- a program consists of one equation for each output and local variableS[i,j,k] = -- equation for S (automatically inferred by serialization)case -- a case statement{| k=0} : 0[]; -- a case branch (called a restriction){| 1<=k} : S[i,j,k-1] + A[i,k] * B[k,j];esac;C[i,j] = S[i,j,N]; -- equation order is irrelevanttel; Figure 1: A serialized Alpha program for matrix multiplicationfor some minor syntactic sugar. The summation (written as a reduce in Alpha)allows a high level speci�cation, but is not practical in terms of implementation. Wetherefore apply a transformation to serialize it by specifying a temporary variable(say S) to accumulate the partial sums and a \direction" of accumulation (say, inthe increasing order of k). The resulting Alpha program is shown in Fig. 1, andillustrates the main syntactic constructs of the language.2.1. Change of BasisOne of the most important transformations in the Alpha system is the changeof basis (COB). It is similar to unimodular loop transformations and includes arrayalignment, data distribution, etc. in a uni�ed framework. The intuition behind itis as follows. Since an Alpha variable can be viewed as a multidimensional arrayde�ned over a polyhedral domain, we should be able to change the \shape" of itsdomain and construct an equivalent program. When a variable, V is transformed,the system must determine: (i) its new domain, (ii) the new case structure of itsequation, (iii) the new dependencies for the uses of all variables in the equationfor V, and (iv) the new dependencies for uses of V (in all other equations). All ofthis is done automatically using a polyhedral library [18], and relies on the factthat Alpha is founded on well de�ned closure properties of domains, dependenciesand transformations. The user speci�es the variable to be transformed, and ana�ne function, that admits an integral left inverse for all points in the domainof the transformed variable (the system reports an error if the function is not soinvertible, or else �nds it automatically). Note that any a�ne function respectingthis constraint can be a valid COB. In practice, and particularly in order to generateimperative code, the COB must respect some additional constraints.2.2. Compiling Alpha to Imperative CodeThe Alpha compiler is semi-automatic and transformational. Currently, theuser speci�es the COBs to be applied to each variable (the choice of COB is based ona number of analyses as discussed below), states which of the new indices representtime, and the system generates code as follows (see [13] for details). We assume thatthe COBs are chosen such that certain indices (in a speci�c order) denote time, and2

#include <math.h>double atof();int atoi();#define min(x,y) ((x)<(y)?(x):(y))#define max(x,y) ((x)>(y)?(x):(y))#define INFINITY 0x7fffffff/* parameters */#define N 10void matrix_mult (_A,_B,_C)int _A[100];#define A(i,j) _A[(i)+10*(j)-11]int _B[100];#define B(i,j) _B[(i)+10*(j)-11]int _C[100];#define C(i,j) _C[(i)+10*(j)-11](a) The preamble

{ int _S[1100];#define S(k,i,j) _S[(k)+11*(i)+110*(j)-121]int k, j, i ;/* k:=0 */for (i=1; i<=N; i++){ for (j=1; j<=N; j++){ S(0,i,j) = 0; }}for (k=1; k<=N; k++){ forall (i=1; i<=N; i++){ forall (j=1; j<=N; j++){ S(k,i,j) = S(k-1,i,j)+ A(i,k) * B(k,j); }}}for (i=1; i<=N; i++){ for (j=1; j<=N; j++){ C(i,j) = S(N,i,j); }}} (b) The main codeFigure 2: The C code generated for the matrix multiplication examplethe others (if any) are guaranteed to carry no dependencies. From the declarationof each variable, the compiler creates a special preamble that allocates memory forthe (bounding boxes of the) domains of all variables. Then, code is generated tovisit in a speci�ed order among indices (using a for loop for the indices speci�ed astemporal and forall loops for the others) each point of the union of the domains ofall the variables. Each visit consists of updating the corresponding memory locationusing the RHS of the equation. The for loops are produced by �rst \separating"the domains according to their time indices [7], and \sorting" the equations so thatwhenever an equation group textually precedes another, it is also temporally beforethe other. This transforms the code into a form called the imperative normalform, and the �nal code is generated by a special pretty printer (see Fig. 2 for thematrix multiplication example, for N = 10).As mentioned above, the choice of the COBs is currently the subject of intenseresearch. Though beyond the scope of the paper, we mention some important issues.Scheduling: There has been much research on the SARE scheduling problem[15,6], formulated as follows. For each variable V in a SARE, determine a functiontV (z) that gives us the time instant, represented by a k-dimensional time vector, atwhich V [z] can be computed. The function must respect causality: whenever V [z]depends (directly or indirectly) on W [z0], tW (z0) � tV (z). Typically, the class ofschedules one uses is variable dependent a�ne schedules of one or more dimensions1.This problem has now received a satisfactory solution, and tools exist for �ndingasymptotically optimal schedules. Each dimension of such a schedule is speci�edby a vector (plus a scalar) and provides one row of the COB transformation.Alignment: Among the variables A, B, C and S of Fig. 1, there are actuallynine independent indices|it was mere coincidence that we used only three names,i, j and k. In order to generate code however, we must place all variables in acommon index space, and this is speci�ed by means of alignment. This is similar1Similar to the a�ne-by-statement schedules, but with a subtle di�erence. Alpha is a functionallanguage, and there is no notion of statement; each variable occurs on the lhs of only one equation.3

to HPF's distribution directives, but in Alpha, the functions are a�ne.For parallel code, the variable domains are also mapped to processors, by alloca-tion functions. There is a close interplay between the schedule, the alignment andthe allocation functions, and the choice of one may a�ect the others. The scheduleachieves a partial alignment (the time indices are common to all variables). Thuseach of these three functions gives us partial information about the COB to applyto each variable. Even after these functions are chosen, there may be some degreeof freedom.2.3. Memory E�ciencyThe main drawback of the above approach is that the code allocates storage forthe bounding box of the declared domains of program variables, and this is memoryine�cient. There are two sources of ine�ciency: the bounding box is wasteful fornon rectangular domains. This loss is by a constant factor, at most (assuming thatthe number of dimensions is a constant), and moreover, heuristics to reduce thewastage are available [20]. The more serious problem is that because Alpha is afunctional language, the declared domains of the variables yield single-assignmentcode (for example, the matrix multiplication program will require O(N3) memory).3. Memory ReuseWe improve the memory e�ciency by generating multiple assignment code(MAC). For each (nV -dimensional) variable, V , with domain, DV , we de�ne amemory allocation function, MemV (z) which assigns a memory address to eachpoint z 2 DV � ZnV . We consider functions that are (multi) projections, i.e.,MemV (z) = �V z, where �V is a (nV � mV ) � nV matrix. The total memoryrequired will correspond to an (nV � mV )-dimensional array (a mV -dimensionalsubspace of DV is mapped to the same memory location). There are three reasonswhy we choose this class of functions:� They are surjective (so many points are allocated to the same address), andcorrespond to what imperative programmers often use in practice.� They allow us to formulate and resolve the memory reuse analysis compactly.� To generate MAC from the previous code, we systematically change every(read as well as write) access, say V [Mz +m], by V [�V (Mz +m)]. We alsochange the preamble to allocate memory only for the projections of domains.No other change is needed!Suppose that, for each variable, a k dimensional a�ne schedule is given. Hence,the \time" at which Y [z0] is computed is the k dimensional vector, �Y z0+�Y , where�Y is a k � nY matrix. Similarly, �X ; �X speci�es the k-dimensional schedule forX . Note that these are parallel schedules (an (nY � k)-dimensional subspace of Yand an (nX � k)-dimensional subspace of X are scheduled simultaneously).�Y is characterized by its null space: two points z01; z02 2 DY are mapped tothe same memory location i� (z01 � z02) 2 Null(�Y ). Let �1; : : : ; �mY , be a basis forNull(�Y ). Note that �i cannot be in the null space of �Y , otherwise two pointswill be written to the same memory location at the same time, leading to a write4

{k,i,j| 1<=k,i,j<=N}:S[k,i,j] -> S[k-1,i,j]{k,i,j| 1<=k,i,j<=N}: S[k,i,j] -> A[i,k]{k,i,j| 1<=k,i,j<=N}: S[k,i,j] -> B[k,j]{i,j| 1<=i,j<=N}: C[i,j] -> S[N,i,j](a) The Dependency Table{k,i,j| 0<=k<=N-1; 1<=i,j<=N}:S[k,i,j] => S[k+1,i,j]{i,j| 1<=i,j<=N}:A[i,j] => {k,i,j| 1<=k<=N}: S[j,i,k]{i,j| 1<=i,j<=N}:B[i,j] => {k,i,j| 1<=k<=N}: S[i,k,j]{k,i,j| k=N;1<=i,j<=N}: S[k,i,j] => C[i,j](b) The Usage TableFigure 3: Information used for static analysis of the matrix multiplication programcon ict. In other words, the matrix � �Y�Y � must be of full column rank, and hencek+(nY �mY ) � nY . Thus, mY � k, and no more than k dimensions of the domainof Y can be \removed" (there are examples where one cannot achieve even this).Without loss of generality, let us choose the signs of the �i's such that �Y �i islexicographically strictly positive �Y �i � 0. It follows that 8z0 2 DY , the writeinto MemY (z0) immediately following that of Y [z0], must be one of z0 + �i.Now, for our scheme to work, we must ensure that no value is overwritten beforeit has been used by all the computations that need it.De�nition 1 We say that a memory allocation function, MemY (z0) is valid i�,for any Y [z0] written into MemY (z0), the next write into the same memory locationoccurs after all uses of Y [z0] have been executed.This necessitates a lifetime analysis of the program which, if performed naively,will need enumeration of all the points in the domains of each variable. Our mainresult is the formulation of necessary and su�cient conditions for this, expressedconcisely using the polyhedral model.3.1. Usage TableTo perform the lifetime analysis, we need information regarding the use of ev-ery instance of every variable. Dependencies and usages are dual views of thesame information: a dependency is \consumer-centric" (it tells whose result a givenpoint needs), while usage is \producer-centric" (who needs a given point's results).Dependency information is present directly in the program, and usage has to bededuced.The dependency table produced by the Alpha system for the matrix multiplica-tion program is shown in Fig. 3.a. Each entry has the form D : X [z]! Y [Mz+m],and is read as, \for all z in D, each X at z depends on Y atMz+m". The domainD is a subset of the domain of declaration of X , and is deduced form the contextwhere the dependency occurs.Next, the system determines the usage table (see Fig. 3.b), which has one entryfor each dependency. Each entry has one of two forms: D0 : Y [z0]) X [M1z0+m0],read as, \for all z0 2 D0, Y [z0] is used for computing X at M1z0 +m0". The moregeneral form is D0 : Y [z0]) D00(z0) : X [M1z0 +M2r+m0], read as \for all z0 2 D0,Y [z0] is used for computing X at the set of points,M1z0+M2r+m0, where r belongsto the polyhedron, D00(z0)". It arises when the original dependency is surjective.In Fig. 3.b, the �rst and fourth entries are the simple cases and the other two arethe general ones. 5

For a dependency table entry, D : X [z]! Y [Mz], the corresponding usage tableentry is determined as follows (details may be found in [19]).� D0 � DY is the intersection of DY with the image, M(D) of D by the de-pendency M . D0 denotes the producers of the Y values that are used by thisdependency. Note that in general, M(D) is not convex, so the usage tablecan be determined automatically only when the dependencies preserve a denseimage (called e-unimodular dependencies).� M1, M2 and m0 are calculated by an extension of a standard generalizedmatrix inversion algorithm [19]. Similar methods are used for dependenceanalysis [21], communication generation for parallel compilers [17], etc.� Since the dependency may be surjective, each value may be used by multiplepoints in DX . Enumeration of these multiple points is achieved by introducingadditional indices (the index vector r), and a corresponding domain D00(z0).The number of such index vectors is equal to the dimensionality of the in-tersection of the null space2 of M , and the lineality space of D (the smallesta�ne space containing D). D00(z0) is a family of polyhedra, parameterized byz0, and is given as the set of points z0 + r that satisfy the constraints of D,where z =M1z0 +m0 is a particular solution.3.2. Constraints on Memory Allocation FunctionsWe now formulate the constraints that the memory allocation functions mustsatisfy. Consider a usage table entry, D0 : Y [z0] ) D00(z0) : X [M1z0 +M2r +m0].For any z0 2 D0, the time at which Y [z0] is computed is �Y z0 + �Y , and the timeat which this memory location is overwritten is �Y (z0 + �i) + �Y , for some �i. Wemust ensure that this does not occur before all the uses of Y [z0] have been executed.This is speci�ed formally as follows:Proposition 1 MemY (z0) = �Y z0 is a valid memory allocation function for Y i�for each usage table entry, D0 : Y [z0]) D00(z0) : X [M1z0+M2r+m0], the followingcondition holds for each �i8z0 2 D0; 8r 2 D00(z0) : [�X(M1z0 +M2r +m0) + �X � �Y (z0 + �i) + �Ywhich can be simpli�ed to (recall that �z = � z0r �, and �M = [M1 M2])8�z 2 D00 : (�X �M � �Y [Id 0])�z +�Xm0 + �X � �Y � �Y �i (1)Observe that we now have constraints that must hold for all points �z in D00.But since the size of D00 may be arbitrary, there may be an unbounded number ofconstraints to be satis�ed. However, the fact that the D00 is a polyhedron allows usto exploit the power of the polyhedral model:� First, note that Eqn. 1 holds at all �z 2 D00 i� it is satis�ed by the point(s) inD00that maximize, in the lexicographic order, the (multi) linear cost function,[�X �M � �Y [Id 0]]�z. Hence, we have a valid memory allocation function, i�for each �i the following holds (Lmax is the lexicographic maximum),Lmax�z2D00 �[�X �M � �Y [Id 0]]�z +�Xm0 + �X � �Y � � �Y �i (2)2The null space or kernel of a matrix, A, is the subspace, fz j Az = 0g.6

� Second, observe that for any �z 2 D00, (�X �M��Y [Id 0])�z+�Xm0+�X��Y isthe time between the computation of Y [z0] and X [ �M �z +m0]. Hence the LHSof Eqn. 2 is the maximum lifetime d of Y (with respect to the dependency,D : X [z] ! Y [Mz + m]). Indeed, as speci�ed by Eqn. 2, it can be solvedas a set of k integer linear programming problems whose solution can bedetermined statically. This leads to the following.Theorem 1 � is a valid memory projection vector for Y with respect to a depen-dency if and only if it satis�es a �nite number of linear constraints:d � �Y � (3)3.3. Memory reuse constraints independent of the usage tableAs mentioned above, the usage table can be determined only for a subset of allAlpha programs. Hence it is useful to formulate the constraints that the mem-ory allocation function must satisfy in an manner independent of the usage table.Consider a dependency D : X [z] ! Y [Mz + m]. Then, for any z 2 D, the timebetween the computation of X [z] and the computation of the value that it needs,Y [Mz+m], is simply �Xz+�X ��Y (Mz+m)��y . Hence we have the followingde�nition:d0 = Lmaxz2D (�Xz + �X � �Y (Mz +m)� �Y ) (4)Observe that d0 is an upper bound on the maximum lifetime of all z0 2 DY (withrespect to a dependency). This de�nition does not take into account the fact thatonly a subset of DY may be the producers of the Y values (as per this dependency).Hence we can only have su�cient conditions from this de�nition.Proposition 2 � is a valid memory projection vector for Y with respect to a de-pendency if it satis�es a �nite number of linear constraints:d0 � �Y � (5)This constraint may be used even if the dependency is not e-unimodular, andhence and allows us to perform reuse analysis for all Alpha programs. Finally, wenote that the formulation given in Theorem 1 allows us to develop piecewise linearmemory allocation functions. For instance, we could choose di�erent projectionsfor di�erent subdomains of Y .3.4. Extensions and VariationsWe have formulated the constraints (either necessary and su�cient for the casewhen the dependency is e-unimodular, or just su�cient otherwise) that memoryallocation functions must satisfy. We do not address the question of the choiceof the projections. Indeed, many di�erent cost functions come into play and thechoice is often not clear and involves many tradeo�s. One obvious cost function isto minimize the memory by (i) maximizing the number of linearly independent �i's(this gives us order of magnitude reduction), and (ii) choosing the �i's so that the\footprint" of the projected domain is minimized. This is similar to the allocationfunction problem in systolic synthesis, and we expect that the methods can beextended and adapted.If we are seeking a sequential implementation, then we know that all indices areultimately going to be interpreted as time. The schedule, � gives us constraintson only k of them. So we could seek to reduce memory by posing the following7

problem. What extension of � maximizes the number of linearly independent basesthat satisfy Eqn. 2?It may also be possible to seek the memory allocation, not necessarily to min-imize the memory, but to optimize other performance criteria such as cache lo-cality, etc. Variations when a parallel implementation is being considered involveadditional cost measures such as communication volume and overheads, and arediscussed below.3.5. Communication AnalysisWe have seen how the usage table drives the lifetime analysis. In the context ofparallelization, it is the cornerstone of communication optimization. Rajopadhyehas proposed a notation LACS [14] for specifying a wide range of communicationactivity within the framework of polyhedra and linear/a�ne index functions. Healso showed how a LACS speci�cation could be analyzed to �rst determine if it is wellformed (no write con icts, etc.) and then to infer many communication patterns(scatters/ gathers, broadcasts, reductions, scans, etc.) This enables optimizationsuch as message vectorization, broadcast elimination and latency hiding techniquessuch as \message prefetching", etc.If we determine the usage table after applying a space-time COB, so that someindices are interpreted as processors, and some as time, we have a \sender centric"view of the communication|a LACS speci�cation (actually, LACS allows reduc-tions too, so if we could obtain the \usage table" of an Alpha program beforeserialization, we could derive the complete LACS speci�cation). Hence, the com-munication analysis of LACS can be incorporated into the Alpha compiler.4. A Detailed ExampleIn the matrix multiplication example, a maximally parallel schedule is tA(i; j) =tB(i; j) = 0, tc(i; j) = N + 1, and tS(k; i; j) = k. Now, only S can be consideredfor memory reduction, since the others are inputs and outputs to the program. Tocompute the maximum lifetime of variable S under this schedule, and we consider�rst and fourth entries of Fig. 3.d1 = Lmax(tS(k + 1; i; j)� tS(k; i; j))= Lmax((k + 1)� k) = 1and d4 = Lmaxk=N (tC(i; j)� tS(k; i; j))= Lmaxk=N ((N + 1)� k) = 1and �nally, d = Lmax(d1; d2) = 1, thus the lifetime of S is 1.Next, we �nd a maximally large set (but recall that since the schedule is 1-dimensional, we cannot expect to use more than one of them) of linearly independentvectors � such thatd � �S�; i.e., [1] � � 1 0 0 � �Within this feasible space, we may choose, for example a vector that minimizes the\footprint" of the domain of S, and this yields [1; 0; 0]T , which means that the kdimension of S may be shared in memory. Finally, we determine a matrix � whosenull space is �, say � = � 0 1 00 0 1 �. Thus the memory allocation function is8

MemS(k; i; j) = (i; j) and the memory is reduced from O(N3) to O(N2). Usingthese functions, the code generator would produce parallel imperative code almostidentical to Fig. 2, except that all occurrences of S(a,b,c) would be replaced byS(b,c), and the declaration would be int _S[100].An alternative schedule: If we are willing to sacri�ce parallelism for a reductionin memory, we can exploit the additional degree of freedom that this gives us. Forexample, we will show that the following multidimensional schedule minimizes thesize of the S variable: tA(i; j) = tB(i; j) = [0; 0; 0]T , tC(i; j) = [i; j;N + 1]T , andtS(k; i; j) = [i; j; k]T . We recompute the maximum lifetime of S under this schedule.d1 = Lmax(tS(k + 1; i; j)� tS(k; i; j))= Lmax([i; j; k + 1]T � [i; j; k]T ) = [0; 0; 1]Tand d4 = Lmaxk=N (tC(i; j)� tS(k; i; j))= Lmaxk=N ([i; j;N + 1]T � [i; j; k]T ) = [0; 0; 1]Tand hence the lifetime of S is d = Lmax(d1; d2) = [0; 0; 1]T .Next, we �nd a maximally large set of orthogonal basis �-vectors such thatd � �S�; i.e., 24 001 35 � 24 0 1 00 0 11 0 0 35 �A suitable set of basis �-vectors is, n[1; 0; 0]T ; [0; 1; 0]T ; [0; 0; 1]To which meansthat all the three dimensions of variable S may be projected, and this is achievedby � = � 0 0 0 �. Thus the memory mapping function is MemS(k; i; j) = (0).By sacri�cing parallelism, and withthe proper schedule, the memory al-located to S is reduced to a singlescalar. Following the new schedule,and using the new memory mappingfunction MemS , the code generatorwould produces the totally sequen-tial code of Figure 4:for (i=1; i<=N; i++){ for (j=1; j<=N; j++){ S = 0;for (k=1; k<=N; k++){ S = S + A(i,k) * B(k,j);}C(i,j) = S; }}Figure 4: Memory Optimal CodeWe could also consider a tradeo� between these two extremes. For example, wecould derive a two dimensional schedule and a one dimensional memory allocation.5. Related WorkAlpha is a specialized functional language, closely related to Crystal [3]. Theanalysis methods presented here are complementary to the ones used for otherconventional functional languages such as SISAL and Haskell with I-structures [1].Parallelizing compliers face a dual problem to ours. In a loop program, only the ow dependencies are true dependencies, output and anti dependencies arise due tomemory reuse, and can be eliminated at the price of more memory. Since the fewerthe dependencies, the higher the parallelism in general, there has been considerablework on the problem of eliminating false dependencies by introducing temporaryvariables [11]. Usually this increases the memory space by a constant factor (thenumber of temporaries), but techniques such as array expansion [21] may causeorders of magnitude increases. In most parallelizing compilers, such expansion is9

often essential in order to obtain parallelism, and is considered worth the price.Nevertheless, it would be interesting to explore how the techniques presented herecould be used in such a context.The work closest to ours is that of Lefebvre and Feautrier [8] who also usethe polyhedral model in the PAF parallelizer. It is known [5] that a nested loopprogram where the loop bounds are a�ne functions of outer indices (and possiblyparameters), and all arrays are accessed through a�ne functions of the indices isequivalent to an Alpha program. The techniques and results are similar to ours,and were developed independently (our results were �rst reported in [19]). Thereare a few subtle distinctions, however. Scheduling an arbitrary Alpha program isknown to be undecidable, whereas, PAF is guaranteed that a schedule exists sincethe SARE was derived from a sequential program (the original sequential order).Second, they perform the analysis before a single assignment form is generated, andthey pose the question of whether an index that is introduced for array expansioncan be safely ignored (given a schedule). It is not clear whether the analysis canreduce the memory originally used by a program.For compiling SAREs, Mongenet [10] de�ned the utilization set, but did not givea method to obtain it. She states a constraint similar to Eqn. 1, but does not showhow this can be formulated as linear programming problem.Chamski [2] does lifetime analysis for an earlier version of Alpha. The modelwas somewhat more restrictive than ours. First, the schedule was assumed to befull dimensional (not just k-dimensional). Next, his de�nition of the lifetime of avariable (although he only considered self-dependencies, the approach can be easilyextended) was as follows.d = maxfz � z0 j z0 2 DY ; z 2M�1z0gThis requires the dependency to be bijective (or at least admit a left inverse). Allexamples treated by him had only uniform dependencies.Recently, De Greef and Catthoor have also addressed the memory reuse prob-lem (they also consider allocating di�erent variables to the same memory) in anextension of the polyhedral model [4]. They formulate the conditions on memoryreuse (all uses must be over before a value is overwritten) and even formulate anoptimization problem to minimize the memory, but they do not give any compactconditions on the space of possible solutions. Related problems are also studied inthe high-level synthesis community (see for example [12] and references therein).The usage table can also be a tool for static analysis of loop programs to deter-mine communication optimizations. Most current parallelizing compilers performsuch optimizations by index analysis, often with intelligent pattern matching, butdo not use the polyhedral model. For example, most of them will be able to deducethat, X[i,j] = X[i-1, j+2] is a translation, and generate communication code to\shift" X by [1,-2]. They are also able to deduce simple broadcasts, eg X[i,j] =X[i,i], since the j index is missing on the RHS3. Many similar ideas can be foundin the literature, but our usage table provides a concise representation.6. ConclusionWe have described how the polyhedral model allows us to statically compute3Note that this is not strictly true. If this statement was within a loop where i=j, then it isn'treally a broadcast, and our usage table would detect it.10

the usage table of an Alpha program. It provides a foundation for many kindsof compile time optimizations. We illustrated this by showing how the constraintsfor generating memory e�cient multiple assignment code can be determined ina constructive manner, and indicated that it can also be used in communicationoptimization.In the short term, perhaps the biggest bene�ciaries of our work will be func-tional languages, since the compiler can now tap into the optimizations that seemedvery far removed. It enables them to compete with imperative languages as theirparallelizing compilers, albeit for a fairly narrow, but important class of programs,while retaining declarative semantics.There are two main criticisms that one can make about the polyhedral model.The �rst is that one cannot (without contortions) express algorithms that havedynamic dependencies (iterative methods, pivoting algorithms, etc.) When paral-lelizing compilers make safe assumptions about the extent of dependencies, the sizeof the iteration space, etc., they impose very similar restrictions. However, theirrestrictions are on the analysis methods not on the programmer, who is shielded:the compiler may not be able to e�ectively analyze a program with pointers, butthis does not mean that the program won't run, just that the best performancewill not be attained. In Alpha, such a program can't even be written withoutcontortions. Nevertheless, we contend that this is a reasonable choice since we areexploring the limits of the analysis techniques enabled by the polyhedral model.Alpha is a research language and is not intended to replace existing languages.The second criticism is that these methods seem to be overkill. After all, mostparallelizing compilers do detect 90% of the common communication patterns, andcan generate e�cient code. Is the price of the sophisticated analysis worth the(seemingly minimal) returns? Our conviction is that it will be worthwhile in thelong run: even if \all" we have is a �rm theoretical foundation for the optimizationsthat work in 90% of the cases.Acknowledgments: We would like to thank Paul Feautrier for very valuablefeedback, and for suggesting that the maximum lifetime could be expressed indepen-dently of the usage table. We also thank Francky Catthoor for helpful discussionsand the anonymous referees for valuable feedback.7. References1. Arvind, R. S. Nikhil, and K. K. Pingali. I-structures: Data structures for paral-lel computing. ACM Transactions on Programming Languages and Systems,11(4):598{632, October 1989.2. Z. Chamski. Generating memory e�cient imperative data structures from systolicprograms. Technical Report PI-621, IRISA, Rennes, France, December 1991.3. Marina C. Chen. A parallel language and its compilation to multiprocessor machinesfor VLSI. In Principles of Programming Languages. ACM, 1986.4. E. De Greef, F. Catthoor, and H. De Man. Reducing storage size for static controlprograms mapped to parallel architectures. presented at Dagstuhl Seminar on LoopParallelization, April 1996.5. P. Feautrier. Data ow analysis of array and scalar references. International Jour-nal of Parallel Programming, 20(1):23{53, Feb 1991.11

6. Paul Feautrier. Some e�cient solutions to the a�ne scheduling problem, Part II,multidimensional time. Technical Report 78, Labaratoire MASI, Institut BlaisePascal, October 1992.7. H. Le Verge, V. Van Dongen, and D. Wilde. La synth�ese de nids de boucles avec labiblioth�eque poly�edrique. In RenPar`6, Lyon, France, Juin 1994. English version\Loop Nest Synthesis Using the Polyhedral Library"in IRISA TR 830, May 1994.8. V. Lefebvre and P. Feautrier. Storage management in parallel programs. In 5thEuromicro Workshop on Parallel and Distributed Processing, pages 181{188,London, January 1997. IEEE. French version presented at Renpar 8, May 96.9. Christophe Mauras. ALPHA: un langage �equationnel pour la conception et laprogrammation d'architectures parall�eles synchrones. PhD thesis, L'Universit�ede Rennes I, IRISA, Campus de Beaulieu, Rennes, France, December 1989.10. Catherine Mongenet. Data compiling for systems of uniform recurrence equations.Parallel Processing Letters, 4(3):245{257, 1994.11. D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers.Communications of the ACM, 29(12):1184{1201, December 1986.12. N. Passos, E. Sha, and L-F. Chao. Multi-dimensional interleaving for time-and-memory design optimization. In ICCD: International Conference on ComputerDesign, pages 440{445, Austin, Tx, October 1995. IEEE.13. P. Quinton, S. Rajopadhye, and D. Wilde. Deriving imperative code from functionalprograms. In 7th Conference on Functional Programming Languages and Com-puter Architecture, pages 36{44, La Jolla, CA, Jun 1995. ACM.14. S. V. Rajopadhye. LACS: A language for a�ne communication structures. TechnicalReport 712, IRISA, 35042, Rennes Cedex, April 1993.15. S. V. Rajopadhye and R. M. Fujimoto. Synthesizing systolic arrays from recurrenceequations. Parallel Computing, 14:163{189, June 1990. First presented as [16].16. S. V. Rajopadhye, S. Purushothaman, and R. M. Fujimoto. On synthesizing systolicarrays from recurrence equations with linear dependencies. In Proceedings, SixthConference on Foundations of Software Technology and Theoretical Com-puter Science, pages 488{503, New Delhi, India, December 1986. Springer Verlag,LNCS 241. Later appeared in Parallel Computing, June 1990.17. A. Rogers and K. Pingali. Compiling for distributed memory architectures. IEEETransactions on Parallel and Distributed Systems, 5(3):281{298, March 1994.18. D. Wilde. A library for doing polyhedral operations. Technical Report PI 785,IRISA, Rennes, France, Dec 1993. an extended version of the author's MS Thesis,Computer Science Dept, Oregon State University, Corvallis, OR. Dec 1993.19. D. Wilde and S. Rajopadhye. The power of polyhedra. Technical Report 95-80-8,Oregon State University, Computer Science Dept, Corvallis OR 97331, August 1995.20. D. K. Wilde and S. Rajopadhye. Allocating memory arrays for polyhedra. TechnicalReport RR-2059, INRIA, IRISA, Rennes, July 1993.21. M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.12

Documents

Memory Reuse Analysis in the Polyhedral Model