10
Backround data management in Subword Parallelizing Compilers: a test case C. Tenllado*, P. Op de Beeck†‡, M. Miranda, G. Deconinck, F. Catthoor†‡and M. Prieto* IMEC, Leuven, Belgium ESAT Lab. Katholieke Universiteit Leuven, Leuven, Belgium * ArTeCS. Universidad Complutense de Madrid, Madrid, Espa˜ na. Abstract Parallel processing techniques for compilers have been extensively researched in the past decades. More recent is the work done in the context of back-end compilers for Sub- Word Parallelism (SLP). This paper analyzes two state of the art compilers and exposes their main problems with respect to SLP code generation, for code dominated by background data. We suggest some ways to overcome these problems and their potential benefits are analyzed. As a fundamen- tal part of our proposed solution direction, we also describe the strong interaction between background and foreground data format organization in the context of energy and per- formance focusing on two cost factors inherent to SLP, re- ordering overhead due to subword communication on the one hand and data padding in the data memory on the other hand. To our knowledge, these problems that we identify have been ignored both in the academia and in the industrial compiler support. Nevertheless, they have a high impact on the main cost metrics of the design space, especially for data dominated embedded applications. Results show that our approach significantly improves the SLP code generation of both the Intel and Larsen com- pilers. 1 Introduction In the mid-nineties researchers from the processor architecture community proposed a relatively simple hardware technique that would allow a more ecient use of the already available resources [9]. They re- alized that in the majority of the applications the re- quired bit-widths do not match well with the proces- sor width. Dividing the full width of the data path into smaller chunks would enable a compiler to fill up the otherwise wasted bits with useful data and enable a kind of data parallelism called Subword Level data Par- allelism (SLP). Functional units enhanced with SLP properties re- quire the input data to be presented in a packed fash- ion. This means that each input register contains sev- eral logical data elements which are worked on in par- allel. In fact, the full width functional unit can be seen as a collection of parallel processing elements. This is illustrated in Figure 1 for a standard addition. Note that only an extra mux and demux is required to im- plement SLP in hardware. Full Adder Full Adder Full Adder Full Adder ... ... A31 B31 A16 B16 A15 B15 A0 B0 S 1 15 ... ... S 1 0 S 0 15 S 0 0 carry DualAdd Control ... ... Figure 1: A full-width addition (solid box) is sub- divided into two half-width additions (dotted boxes) when the DualAdd Control signal is activated and “cuts” the carry signal. The main advantage of SLP is its flexibility in sup- porting many dierent word lengths (for the sub- words) on the same datapath. Specially for DSP appli- cations this leads to important opportunities for power eciency. The main dierence between subword-level and traditional data-level parallelism is the fact that the ordering of the processing elements inside the SLP functional units is fixed. This imposes more restric- tions on the search space for parallelization compared to non-SLP architectures. Indeed, for non-SLP archi- tectures the optimal ordering of operations for par- allelization is only constrained by data and control flow dependencies, while for SLP architectures data- layout constraints have to be considered as well. If the 1

Backround data management in Subword Parallelizing Compilers: a test case

Embed Size (px)

Citation preview

Backround data management in Subword Parallelizing Compilers: a testcase

C. Tenllado†*, P. Op de Beeck†‡, M. Miranda†, G. Deconinck‡, F. Catthoor†‡and M. Prieto*†IMEC, Leuven, Belgium

‡ESAT Lab. Katholieke Universiteit Leuven, Leuven, Belgium* ArTeCS. Universidad Complutense de Madrid, Madrid, Espana.

Abstract

Parallel processing techniques for compilers have beenextensively researched in the past decades. More recent isthe work done in the context of back-end compilers for Sub-Word Parallelism (SLP). This paper analyzes two state of theart compilers and exposes their main problems with respectto SLP code generation, for code dominated by backgrounddata. We suggest some ways to overcome these problemsand their potential benefits are analyzed. As a fundamen-tal part of our proposed solution direction, we also describethe strong interaction between background and foregrounddata format organization in the context of energy and per-formance focusing on two cost factors inherent to SLP, re-ordering overhead due to subword communication on theone hand and data padding in the data memory on the otherhand. To our knowledge, these problems that we identifyhave been ignored both in the academia and in the industrialcompiler support. Nevertheless, they have a high impact onthe main cost metrics of the design space, especially for datadominated embedded applications.

Results show that our approach significantly improvesthe SLP code generation of both the Intel and Larsen com-pilers.

1 Introduction

In the mid-nineties researchers from the processorarchitecture community proposed a relatively simplehardware technique that would allow a more efficientuse of the already available resources [9]. They re-alized that in the majority of the applications the re-quired bit-widths do not match well with the proces-sor width. Dividing the full width of the data path intosmaller chunks would enable a compiler to fill up theotherwise wasted bits with useful data and enable akind of data parallelism called Subword Level data Par-

allelism (SLP).Functional units enhanced with SLP properties re-

quire the input data to be presented in a packed fash-ion. This means that each input register contains sev-eral logical data elements which are worked on in par-allel. In fact, the full width functional unit can be seenas a collection of parallel processing elements. This isillustrated in Figure 1 for a standard addition. Notethat only an extra mux and demux is required to im-plement SLP in hardware.

Ful

lA

dder

Ful

lA

dder

Ful

lA

dder

Ful

lA

dder

... ...

A31 B31 A16 B16 A15 B15 A0 B0

S1

15

......

S1

0S

0

15S

0

0

carry

DualAdd Control

... ...

Figure 1: A full-width addition (solid box) is sub-divided into two half-width additions (dotted boxes)when the DualAdd Control signal is activated and “cuts”the carry signal.

The main advantage of SLP is its flexibility in sup-porting many different word lengths (for the sub-words) on the same datapath. Specially for DSP appli-cations this leads to important opportunities for powerefficiency.

The main difference between subword-level andtraditional data-level parallelism is the fact that theordering of the processing elements inside the SLPfunctional units is fixed. This imposes more restric-tions on the search space for parallelization comparedto non-SLP architectures. Indeed, for non-SLP archi-tectures the optimal ordering of operations for par-allelization is only constrained by data and controlflow dependencies, while for SLP architectures data-layout constraints have to be considered as well. If the

1

data layout required by the SLP units is not matchingthe one present in the memory hierarchy, then extramemory accesses and reordering operations may berequired resulting in an energy and cycle overhead.

When this problem appears, the current compilersinhibit the SLP code generation even if there is inherentparallelism that could be efficiently exploited if bothalgorithm and data are treated together in a holisticway, given that these issues are tackled at compile-time. The main contribution of this paper is the clearidentification of the nature and cause of this problem.Moreover, a sequence of steps, based on the polyhedralrepresentation of loop nests with affine access func-tions, is presented as the first insight to overcome thisproblem. We do not yet describe the fully systematiccompiler techniques that will be required to solve therelated compiler support.

The rest of the paper is organized as follows. Somebasic definitions and the basic problems that affects theefficient SLP code generation are covered in Section 2while in Section 3 we introduce a very simple exam-ple that will help us to illustrate the weakness of theactual SLP compilers. In Section 4 we describe somerelated work and Section 5 we illustrate our approach,that solves the problem. The latter is evaluated in Sec-tion 6. Finally the main conclusions of this researchare presented in Section 7.

2 Common definitions and SLP issues

Necessary or otherwise reasonable conditions for astatement to become a candidate for SLP are:

• the operation inside the statement should be avail-able on a SLP functional unit,

• the statement should be inside a loop nest, and

• the statement should access arrays whose ele-ments are smaller than the full bit width.

We will give some definitions that apply for thesestatements.

Definition 2.1 (subword) Let D be an array accessed in astatement suited for SLP. The elements of D are also referredto as subwords.

Up to two subwords of 16 bit can be processed inparallel on a 32-bit architecture. Likewise, if the sub-words are 8-bit data elements, four of those can begrouped together.

Definition 2.2 (superword) The collection of subwordsthat are operated on in parallel and need to be merged to-gether is called a superword.

SWb

1SW

b

0SW

a

1SW

a

0

superword a superword b

required subwords

SW ShiftL

SWb

0SW

a

1

SWP unit

Figure 2: Suppose the subwords SWb0 and SWa

1 needto be assembled by a SubWord Shift Left operation inorder to feed the SLP unit properly. This is equivalentto communicating SWb

0 to subPE1 and SWa1 to subPE0 of

the SLP unit.

The subwords that form a superword are referredto as SW0, SW1, ... and their order in the superword isimportant for the SLP computations. If the size of a su-perword is smaller than the full bit width, the availablesubword level parallelism is not fully exploited.

Definition 2.3 (sws) The maximum number of subwordsthat fit into one superword is defined as the superword sizeor sws.

Although in practice sws is a power of 2 and a divi-sor of the full data-path width, there is no good reasonfor such a restriction in general. For instance, a super-word with sws = 3 could be divided in three subwordsof 12,12 and 8 bits respectively in an architecture witha 32 bit datapath.

Definition 2.4 (sub-processing element) For each SLPfunctional unit there exists a number of predetermined sub-divisions into sub-processing elements. At runtime the cho-sen subdivision is activated with a control signal. Eachsub-processing element is labeled subPE0, subPE1, . . . andworks on the subwords SW0, SW1, . . . respectively.

Several issues are crucial for the efficient SLP codegeneration in the embedded context.

Subword communication A common problem forany data-level parallelization scheme is communica-tion and synchronization. The communication prob-lem in the context of SLP occurs when the requiredsubwords by an SLP functional unit are located in:

• different superwords, or

• the same superword but have to be reordered

What happens is illustrated in Figure 2. Clearly ex-tra operations are needed to combine or reorder sub-words. These operations do not come for free and can

2

be avoided or else minimized by a careful mapping ofthe loop nest.

Superword layout in background memory Whathappens when some superwords contain less sub-words than others? Or in other words, what hap-pens in those cases where not all of the available sub-processing elements can do useful work?

Default layout of D (row-major)

D0 D1 D2 D3

D7D6D5D4

...

bp : Dj 7→ SWj

p(j)

SW1

0

SW3

1SW

4

0

SW0

1SW

2

0

SW7

1

SW5

1SW

6

0

SW0

1SW

2

0

SW5

1SW

6

0

SW6

0

SW3

1SW

4

0

SW7

1 ...

L

Data cache

Figure 3: Padding effect

To answer this question we focus on a techniquecalled padding in which we force each superword tohave equal size by padding the smaller sized super-words. The effect is visualized in Figure 3(b) and canbe summarized as an increase in data elements thatneed to be transferred from background to foregroundmemory.

3 Motivating example

To illustrate the efficiency problems of the actualSLP compilers, we use an example borrowed from [11].The pseudo-code is given in Listing 1.

Listing 1 An illustrative parallelizable loop nestfor i = 1 to N-1 do

for j = 1 to N-1 doA[i,j] = A[i,j] + B[i-1,j]; (statement 1)B[i,j] = A[i,j-1] + B[i,j]; (statement 2)

end forend for

Listing 2 Motivating example unrolled and reordered.for i = 1 to N-1 do

for j = 1 to N-1 by 2 doA[i,j] = A[i,j] + B[i-1,j]; (statement 1.a)A[i,j+1] = A[i,j+1] + B[i-1,j+1]; (statement 1.b)B[i,j] = A[i,j-1] + B[i,j]; (statement 2.a)B[i,j+1] = A[i,j] + B[i,j+1]; (statement 2.b)

end forend for

Let us analyze the inherent problems of this algo-rithm applying a simplified script of the SLP code gen-eration stage in the conventional compiling process.In order to find candidates for SLP computations thecompiler would unroll the inner loop in a number ofsteps equal to the sws (we will use sws = 2). In theunrolled loop body it tries to group the statements sothat the data accessed in all statements become con-tiguous. The result is shown in Listing 2. Finally thecompiler generates a code in which the statement.x.a ismapped to the SW0 and the statement.x.b to the SW1 ofthe respective superwords.

Unfortunately data dependencies may cause prob-lems: the element A[i, j] computed in SW0 of the super-word generated by statement 1, is needed in the SW1for the SLP version of statement 2. This implies com-municating SW0 of superword 1 to SW1 in superword2.

A further problem has to do with the data layout.Due to the memory architecture, an efficient load of asuperword from the background memory leads to thefollowing requirements:

• The subwords that constitude a superword has tobe stored contiguously in memory.

• Superwords have to be stored on aligned ad-dresses.

Some platforms, like Intel IA32, allow memory ac-cesses to fetch a non aligned superword, at the ex-pense of a decrease in performance and an increase inenergy consumption. If this kind of access is not avail-able, two normal accesses are needed to fetch a nonaligned superword, of course given that the subwordsare contiguous in memory, otherwise it can be evenworse. Obviously, it is not energy efficient to increasethe number of load/stores so the embedded orientedcompilers avoid the SLP generation in these cases.

Normally, the compilers align the first element ofeach array that is going to be vectorized. Even inthat case, depending on the dimensions of the array,the elements to be fetched in the inner loop could bemissaligned.

4 Related work

Several academic compilers with SLP extensions areavailable. One of them is the Lance compiler [10]which transforms the code selection problem into aninteger linear program formulation. Tackling SLP at thislevel leaves the traditional compiler steps like registerallocation and instruction scheduling intact.

3

Another well-known compiler effort came fromLarsen and Amarasinghe who have implemented anSLP module in SUIF [7]. They adopt a heuristic ap-proach which can produce good results in a very rea-sonable time.

Both approaches should be considered as a back-end step for SLP. None of them describe prior looptransformations to enhance the compilers ability to ex-ploit SLP. In this paper we show how a SLP back-end,and in particular the SLP compiler by Larsen, can pro-duce better code by performing, in a systematic way,loop transformations.

At the industrial side, the Intel compiler [2] can beconsidered as a major reference for SLP code genera-tion (in particular for IA32 architectures). It offers thepossibility to manually vectorize C code using intrin-sics, i.e. function calls that become a directly translatedinto native instructions. In the paper we will show thatthis compiler is also not able to manage the problemsobserved for the Larsen compiler.

It is interesting to analyze the way this compilershandle the problems described in Section 3. We ob-served that the Larsen backend inserts communica-tions and extra memory aligned accesses in order tofetch non aligned superwords. Moreover,we observedthat the code generated by Larsen is only vectorial ifN is a multiple of the superword size, so that each rowof the arrays is aligned in memory.

Furthermore, the vectorial code generated does notuse superwords of the maximum size due to the factthat the loop starts in one (the second element of eachrow which is not aligned). We also observed that theLarsen compiler decides not to vectorize the statement2 due to the necessary communication described inListing 2.

On the other hand, Intel platforms provide nonaligned memory accesses at the expense of an increasein the latency and the memory consumption. The onlydecision taken by the compiler, regarding to the back-ground memory, is to align the first element of thearray. Therefore, in a given iteration, it has to checkwhether the data to be processed are aligned or not,in order to see if the superwords can be efficientlyloaded. The Intel compiler splits the loop in somepieces to handle the case of aligned and missaligneddata. This could be a problem in embedded systems ifthe code size becomes larger than the loop buffer.

Summarizing, the analysis of SLP code generationmethodologies of the Intel and Larsen compilers forthe example in Listing 1 shows that:

1. None of the compilers is able to extract a commu-nication free SLP code. Moreover, Larsen avoidsthe vectorization of statement 2 in order to avoid

the communication problem.

2. Regarding to the memory layout, compilers alignthe first element of the arrays. This makes theresultant SLP codes inefficient in the sense that:

(a) Not the full superword width is used or evennon SLP code is obtained (Larsen Compiler).

(b) Code size can increase for some values ofthe problem size. In those cases checksfor aligned data are needed and some nonaligned accesses are used (in Intel Compiler).

The cost impact of both alternatives will be analyzedin Section 6, where the effect of applying our approachto this motivating example will be analyzed.

5 Polyhedral model for SLP

In this Section we will present our approach whichmakes use of the Polyhedral model. In this paper werestrict ourselves to loop nests with affine and man-ifest access functions. This is reasonable in a largeset of data-dominated applications and it is the sameassumption that is used for data parallel compilers.

In the complete compiling script some access order-ing constraints, decided during previous steps, mayhave to be taken into account. Typically, these con-straints are only affected by the outer loops and there-fore they are out of the scope of this paper.

The remaining k loop iterators, together with theirdomains, define a polyhedron I, which is a boundedregion in the iteration space I∞ = Zk. Each point in Irepresents an instance of the body of the loop nest. Thepolyhedron Iwhich is scanned by the code illustratedin Listing 1, is visualized in Figure 4.

. . .

i=0

1

j=0 1 n-1

n-1

dA

dB

. . .

. . .

. .

.

I= statement 2

= statement 1

Figure 4: The iteration space of the original example;vectors dA and dB represent the dependence distances.

5.1 Space-Time Transformation

During the Space-Time transformation we imposeno restrictions on the width of the SLP unit of the targetplatform. Later on we will see how to impose platformspecific constraints.

4

We formulate the problem of assigning each in-stance of a statements in Listing 1 to a processor in-side the SLP unit. This is dubbed as space mapping.Once this is done, several instances will have been as-signed to the same processor in the SLP unit and a timescheduling has to be decided so that the data depen-dencies are respected. The whole process is referredto as Space-Time (S-T) transformation in the literaturefor data parallel compilation [8].

In its most general form this is expressed as the di-rect sum I = S⊕P [5]. We interpretP as the processorspace and S governs how P is moved through I. Fora detailed study we refer to [5, 13].

Several existing techniques can be used to find andexplore expressions for S and P [8]. We propose toreuse the algorithm described in [11] that minimizesthe number of communication points in the data par-allel context. The result for the processor mapping forthe two statements in Listing 1 is:

φ1(i) = ( 1 −1 )i = i − jφ2(i) = ( 1 −1 )i + 1 = i − j + 1 (1)

We should remark that for the considered examplestill a lot of freedom in choosing the time schedulingexists. One possibility is to map the time variable to iwhich give us the S-T transformation (2) representedin Figure 5.

S1 = S2 ={

s | s = ( 11 )z, 1 ≤ z ≤ n − 1

}

,

P1 ={

p | p = ( 0−1 )y, −n+3 ≤ y ≤ n−1

}

, (2)

P2 ={

p | p = ( 0−1 )y + ( 0

1 ),−n+2 ≤ y ≤ n−2}

.

In Equation (2) z represents the time and y the pro-cessor. The ith iteration of a statement can now besplit into a processor and scheduling part. For the firststatement in Listing 1 this would give, using the samenotation as in [5]

i == ( 0−1 )y+ ( 1

1 )z, 1 ≤ z ≤ n− 1,−n+ 3 ≤ y ≤ n− 1 (3)

Note that this expression is identical to 1. Applyingthis transformation to the loop nest in the example wearrives to the algorithm version described in Listing 3.Current method produces only one best solution butwe would like to obtain all good solutions, even theone that could not be the best in this stage but theycould globally perform better (see 5.3).

I = S ⊕ P

P

0 n-1

n-1

S

. . .

. . .

. . .

n-2

-n+2

PE

PE

PE

. . .

. . .

Figure 5: A feasible subdivision of I for the originalexample.

Listing 3 ST applied on the motivating examplefor t = 1 to N-1 do

A[t,N-1] = A[t,N-1] + B[t-1,N-1];for p = t-(N-2) to p = t-1 do

A[t,t-p] = A[t,t-p] + B[t-1,t-p];B[t,t-p+1] = A[t,t-p] + B[t,t-p+1];

end forB[t,1] = A[t,0] + B[t,1];

end for

5.2 Partitioning

In the previous Section we made no restrictions onthe width of the SLP functional unit of the target plat-form. Now we have to consider the real case. For theexample we assume a bit width of 16 for both A and B,hence sws= 2 on a 32-bit architecture. To impose thisconstraints the processor space has to be further sub-divided or partitioned into a processor space Pπ anda schedule Sπ, much the same way as was done for I.All iterations i belonging to the coset (s + sπ) +Pπ areexecuted in parallel, i.e. all the iteration points in Pπscheduled out (s + sπ) are executed in parallel. Notethat the cardinality ofPπ should be equal to sws or twoin this case.

Again a lot of freedom exists. In a real case the ac-cess ordering constraints decided in previous steps ofthe compiler script may restrict the possibilities. Fur-thermore, as will be explained later, partitioning willstrongly influence the data layout. Thus, when severalloop nests are taken into account, where the same dataare being accessed, some constraints can be imposedto the partitioning stage to balance its impact on thecost function of each such loop nest.

We now give two useful definitions.

Definition 5.1 (access function) FDas is the affine accessfunction F used by the ath access to array D in statement s.

5

n-2

s

p

-n+2. . . . . .

. . .. . .

P

PE PEPE

. . .P

π

Figure 6: Partition of the original processor spaceP. Forsome some s + sπ values only one element is mappedto the Pπ space, it is a border effect.

Definition 5.2 (affiliated with) For each statement t wedefine the equivalence relation “affiliated with” as

t�: I → I : i

t� j⇒ i, j ∈ (st + sπt ) +Pπt .

This tells us that i and j will be executed in paral-lel. We can now make an important statement in thecontext of subword level parallelism.

Theorem 5.1 (superword) Data elements D(FDat(i)) andD(FDat(j)) belong to the same superword iff i

t� j.

For instance, one of the partitioning functions thatwe will consider in this paper corresponds to a tilingtransformation to the inner loop in Listing 3 with tilesize equal to the sws and zero offset. This situation issketched in Figure 6 and is given by

Pπ1 ={

pπ | pπ = ( 0−1 )κ, 0 ≤ κ < 2

}

,

Sπ1 ={

sπ | sπ = ( 0−1 )2κs, −n+2

2 ≤ κs ≤ n−22

}

.(4)

where κ is the position within the tile and κs theposition of the tile in the processor space. Here tileand superword can be equally used. The ith iterationcan now be written as

i = ( ij ) = ( 0

−1 )(κ + 2κs) + ( 11 )z, i ∈ I. (5)

All points generated by Equation (5), while keepingκs and z constant are affiliated (i.e. two points becauseκ can be 0 or 1).

5.2.1 Partition function

Equation 5 is an instance of the transformation T givenby

T = (P, L)(

ψ(κ, κs)z

)

, (6)

where ψ(κ, κs) : Z2 7→ Z is the partition function ofthe line Pπ and (P, L) is the transformation matrix. In

this paper we restrict the partition functions to thosewhich can be decomposed as

ψ(κ, κs) =sws−1∑

j=0

ψ j(κs)sws−1∏

k=0k, j

κ − kj − k

, (7)

on the lattice {0 . . . sws−1}×Z. For sws = 2 equation (7)simplifies to

ψ(κ, κs) = ψκ(κs), κ ∈ {0, 1}. (8)

The above restriction (7) still allows non-affine parti-tion functions, useful in for instance butterfly compu-tations. But at the same time it can be expressed as aPresburger Formula,

(κ = 0 ∧ ψ0(κs))∨ (κ = 1 ∧ ψ1(κs)). (9)

This property is necessary to enable analysis tech-niques that make use of, for instance, the Omega Li-brary [12] which transforms Presburger Formulas intopolytopes.

The partition function κ+ 2κs in equation (5) can bederived from equation (7) and (8) using

ψ0(κs) = 2κs and ψ1(κs) = 2κs + 1. (10)

and the corresponding transformation T is given by

T = ( 0 1−1 1 )

(

ψ(κ, κs)z

)

(11)

5.2.2 Subword communication

We are now able to restate the subword communicationproblem as discussed in Section 2. We have the follow-ing result.

Theorem 5.2 (non communication-free)Given a depen-dence relation R between statements a and b. The trans-formations Ta and Tb apply. R will result in subwordcommunication iff

∃(i, j) ∈ R : i = Ta|κ ∧ j = Tb|κ′ ∧ κ , κ′. (11)

Note that for SLP, these dependences also includefalse dependencies. This is one of the differences withtraditional data parallelism.

By now, we know that subword communicationtranslates to data reordering operations that will haveto be inserted in the code to swap subwords fromone “sub” location to another. Typically subwordsare gathered from different superwords and mergedtogether. This means that several superwords have to

6

be accessed, while not all of the subwords are goingto be used in the reordering result. Usually, however,these wasteful subwords are reordered later on them-selves. We have experienced large gains when localiz-ing all these reordering operations because it enablesthe script to bring these superwords to foregroundmemory (e.g. register file). We will call this step inter-superword optimization and it is related to layout issuesdescribed in the following section.

5.3 Data layout

Although data layout transformations are notstrictly required for SLP, it is sufficiently motivated inthe past that they enable the full potential of SLP [4].In our motivating example (Section 3) we have ana-lyzed some of the problems that might occur due tothe memory data layout. If data elements belongingto the same superword (defined in Theorem 5.1) arenot contiguously stored in memory then a lot of extradata restructuring and memory access operations areneeded. In the worst case sws accesses are required toconstruct one superword.

We define two types of data layout, inter and in-tra superword layout. Intra superword layout placesthe elements belonging to the same superword con-tiguously in background memory. Inter superwordlayout works on a coarser granularity and decides thememory placement of the superwords themselves. Weshould remark that still some degrees of freedom ex-ist, for both inter and intra superword layouts, that canbe left to other stages in the compiler script, e.g. theordering of the subwords inside the superword.

If Pπ is scanned in the most inner loop, which isenforced, then existing locality algorithms like the onedescribed in [6] can be used to perform a good intrasuperword layout.

5.3.1 Superword layout problem

Often it is not possible to keep every processing node inPπ busy all of the time, i.e. holes in some superwordswill exist. Usually these instances are border-effectsand they occur because I cannot always be perfectlycovered with Pπ scheduled according to S ⊕ Sπ. Fig-ure 6 illustrates the problem.

These border-effects will partly undo the positive ef-fects of the data layout transformations. This is due tothe strict order in which SLP requires its data. Border-effects would have to be identified in the code to allownon-SLP operations and to replace superword accessesby explicit subword fetches. But that problem is stilllargely ignored in the current literature. Furthermore,

since the sizes of subwords and superwords are inher-ently different this will introduce misalignment in thememory map.

cache line n+2

σ(s + sπ)

κ

PE PE

PAD

PAD

0 1

. . . . . .

. . .

PAD

PAD

cache line n

cache line n+1

(a) (b)

Data Cache

Figure 7: Final layout with padding (a) in transformediteration space; and (b) in the data cache.

In this paper we analyze one possible way to over-come this problem, by padding I such that all nodesinPπ appear to be busy all of the time. This results in abigger array, which will affect the data memory perfor-mance as is shown in Figure 7. Additionally, we willuse a different space-time and/or partitioning transfor-mation, from the set of good solutions coming out ofstep 5.1, but that may have different communicationrequirements.

6 Evaluation

The main goal of this Section is to show the im-portant improvements in performance and energy ob-tained by both Intel and Larsen compilers when ourapproach is fully applied. Nevertheless, also someexperiments are performed in order to show the po-tentiality of this approach when all the parameters,S-T transformation, Partitioning and Data Layout, areexplored to find the best solution.

For the Intel case, two versions of the compiler havebeen analyzed (7.1 and 8.1), using as experimentalplatform a PentiumM (1.5GHz, 400MHz FSB, 2MB L2cache, 32K data cache).

In the case of the Larsen Compiler, all the code ver-sions have been analyzed on a retargetable Very LongInstruction Word (VLIW) compiler/simulator calledConfigurable Reconfigurable Instruction Set Processor(CRISP) [3]. This environment is based on Trimaran [1]and incorporates several power models, includingCacti for caches [14]. SLP operations have been di-rectly inserted into the source code using intrinsicfunctions. These are function calls recognized by ourcompiler as native operations. In our experiments weuse an architecture with Data Path similar to that ofan TIC64 DSP, including the SLP instruction set. Thedata memory, however, has been adapted to reflect thetrend in power aware embedded systems to use smallL1 data caches. We have chosen a cache size of 1Kbyteand a cache line size of 8bytes [?].

7

6.1 Performance improvements on the Intel andLarsen compilers

Three different versions will be considered, the firstis outlined in Listing 1 and will be referred to as originalin the following. In version ST, described in Listing 3,only S-T transformation has been applied whereas inversion ST+P+L we also apply the simple mappingfunction (4) and a data layout that follows the con-straints given in Subsection 5.3.

Obviously all the steps of our approach have beenmanually performed and the resultant C codes are fur-ther processed using both the Intel and the Larsencompilers as back-ends. For all cases the datapathis 128 bits and the sws 4. We have used 32 bits integerdata although similar results can be obtained for float-ing point data. All our versions have been run withincreasing n. The problem size is approximately n2.

n original ST ST+P+L original/ST original/ST + P + L24 2575 2204 2323 1.168 1.10826 3732 3523 2448 1.059 1.52432 4970 4475 2948 1.110 1.68538 7427 7426 4601 1.000 1.61442 9240 9187 5364 1.005 1.72246 10918 10560 6214 1.033 1.75750 12538 13401 7077 0.935 1.77154 14879 14542 7877 1.023 1.88858 17176 16949 8761 1.013 1.96062 19785 20290 10062 0.975 1.966

Table 1: Intel Compiler version 7.1. Cycles.

n original ST ST+P+L original/ST original/ST + P + L24 2111 2099 2370 1.005717 0.89071726 2284 4157 2576 0.549434 0.88664532 3347 5218 3398 0.641433 0.98499138 6319 7804 4565 0.809712 1.38422742 7618 10390 5502 0.733205 1.38458746 9136 12720 6453 0.718238 1.41577550 10827 14132 7326 0.766133 1.47788654 12927 16225 8279 0.796733 1.56142058 15203 18554 9323 0.819392 1.63069862 17468 21827 9715 0.800293 1.798044

Table 2: Intel Compiler version 8.1. Cycles.

For the Intel compiler case, we have measuredthe total number of cycles involved in the process-ing of each of the three versions using the Time StampCounter. The results in Tables 1 and 2, clearly showthat

• High speedups are obtained when the full ap-proach is applied.

• A performance improvement cannot be achievedin all cases applying only Space-Time mapping.

• For small problem sizes, and using the version8.1 of the compiler, our approach (with a simplepartitioning function) is not performing as goodas the methodology applied by the compiler onthe original code.

As we could expect, the performance improvementsincrease with the problem size, achieving the speedupsvalues around 95% and 80% for the 7.1 and 8.1 versionsof the compiler respectively.

In most cases the speedup obtained by applyingonly the ST transformation is much lower than thepotential one, achieved when also the partitioning anddata layout steps are performed. This behavior is moredramatic for the version 8.1 of the compiler.

The experiments for small problem sizes, with theversion 8.1 of the Intel compiler, suggest that some op-timizations performed in the compiler script for smallloops interfere with some of steps in our approach.This effect should be studied in more detail.

As main conclusion we would like to remark thatthe most relevant improvements are achieved by thepartitioning and the improved data layout in back-ground memory, which are propagated to the cache

n original ST ST+P+L original/ST original/ST + P + L4 430 547 90 0.79 6.0824 15825 9736 2315 1.62 4.2064 114295 66736 12625 1.71 9.05128 476389 264432 45761 1.80 10.41

Table 3: Larsen Compiler. Cycles

For the Larsen compiler, the number of cycles tocompute the three kernels were measured using CRISP.Only problem sizes for which the Larsen compilercould generate SLP code from the original version wereconsidered.

Table 3 shows a more dramatic impact of our ap-proach on this compiler than for the Intel case. Forproblem sizes that fits into the cache, speedups around400% are obtained. These performance improvementsrise to 1000% when cache issues are involved.

One of the main reasons of this big impact is thatthe Larsen compiler only generates SLP code for onestatement in the original version. On the other hand,when the full approach is applied, the Larsen compilergenerates SLP code for the two statements.

Nevertheless, the main conclusions that we haveextracted from the Intel experiments are also valid forthe Larsen compiler. For instance, we can observe thatonly a small fraction of the performance improvementsis due to the Space-Time mapping. The most relevantimprovements are achieved by the partitioning andbackground memory issues.

6.2 Partitioning and padding effects

Three different partitioning functions have beentested in order to show the impact of this method-ological parameter in the performance. The exploredfunctions are described in Table 4. None of the com-pilers could be used as back-end as such. Instead,

8

code modifications were manually performed for allthe experiments.

name P ⊕ S Pπ1 ⊕ Sπ1

TP ( 0−1 )y + ( 1

1 )z ( 0−1 )(κ + 2κs), 0 ≤ κ < 2

TSP ( 01 )y + ( 1

0 )z ( 01 )(κ + 2κs), 0 ≤ κ < 2

PFP 12 ( 1−1 )y + 1

2 ( 11 )z 1

2 ( 1−1 )((1 − κ)(κs − 1) + κ(2 − κs)),

0 ≤ κ < 2Table 4: Different partitioning functions evaluated

Version TP (Tile Partitioning) is essentially the sim-ple partitioning that we have used as example in Sec-tion 5.2 and we have analyzed in the Section 6.1. It iseasily verified that it is communication-free.

Version TEP (Tessellation Partitioning) is whatcould be called the natural subdivision because it de-scribes the regular two-by-two tessellation of I, i.e. atiling applied to 3. For this version less padding isneeded but this it leads to some communications.

The final version IP (Interleaving Partitioning) issomewhat special in that the partition function is non-affine, but the restrictions described in Section 5.2.1obviously hold. The interesting feature of IP is that itmerges subwords from both A and B into one super-word, eliminating the padding. Without going intodetail, this is achieved by the fraction 1

2 (see Table 4),that allows us to interleave both arrays.

Note that we have been very careful to generatecomparable versions mainly by applying array privati-zation (i.e. moving array accesses to foreground mem-ory) after SLP source code generation. Neglecting thisdilutes the effect of padding, because extra data ac-cesses would take place which can be removed, whilethe effect of padding cannot. We have also comple-mented the SLP parallelizing technique with a local-ity improving inter-superword layout transformation.This type of layout transformation has been brieflydiscussed in Section 5.3.

All our versions have been run with increasing n.The problem size is approximately n2. We have chosena cache size of 1Kbyte and a cache line size of 8bytes.Deliberately n was kept small such that the total arraysize could fit into the first level data cache, again toeliminate unwanted effects.

n original TP TEP IP gain TEP→IP(%)

4 7.9 4.6 5.7 2.21 61.288 45.03 22.24 29.59 12.87 56.5212 111.21 53.12 70.77 32.35 54.2916 206.79 97.24 129.59 60.66 53.1920 331.78 154.59 206.05 97.79 52.5424 486.19 225.17 300.17 143.74 52.11

Table 5: Energy and energy gain in L1 data cache priorto inter-superword optimizations.

In Table 5 we present the energy dissipated in thedata cache after partitioning, but prior to any inter-superword locality optimizations. Clearly, TEP is the

worst (50%) and this is largely because the subwordreordering is suboptimally implemented. Recall thatinter-superword optimizations are required to bringtogether all the reordering operations which happento take place on a group of superwords. After thislocalization, these superwords can be moved to fore-ground memories. The effect of this optimization isgiven in Table 6.

n original TP TEP IP gain TP→IP gain TEP→IP(%) (%)

8 54.04 26.29 25.73 23.16 11.89 1012 133.45 61.58 60.66 56.61 8.06 6.6716 248.15 111.57 110.29 104.77 6.1 520 398.14 176.28 174.62 167.64 4.9 424 583.42 255.68 253.66 245.21 4.1 3.33

Table 6: Finall energy and energy gain in L1 data cache

The results in Table 6 further demonstrate a 1n rela-

tion for the relative gain (e.g. gainTP→IP) of the paddingeffect in the data cache. Intuitively this stems from thefact that padding is inherently a border effect. Forthese versions, the problem size is square and thus theborders are linear, hence the relative effect is 1

n .We also notice that the amount of padding is less in

TEP compared to TP.We also report the numbers for the original, non

SLP, code. The gains there are at least a factor of 2,demonstrating the potential of subword level parallelism.

n TP TEP gain TEP→TP(%)

8 116.03 127.46 8.9612 249.57 277.53 10.0716 432.37 484.64 10.7820 664.44 748.8 11.2724 945.78 1070 11.61

Table 7: Energy and energy gain in register file; increasedue to communication

n original TP TEP IP gain TP→IP gain TEP→IP(%) (%)

8 1525 1189 1133 952 19.93 15.9812 3184 2177 2215 1920 11.81 13.3216 5514 4543 4577 3272 27.98 28.5120 9794 6601 6743 6124 7.23 9.1824 14854 10311 10621 9792 5.03 7.81

Table 8: Total cycles and cycle gain

The effect of communication is most apparent inthe register file. This is because SLP communicationis achieved through the register file and this is evenstronger so, after the inter-superword optimizationmentioned earlier. In Table 7 we verify that TEP isworse over TP and the relative gain is rather constant.The reason is because communication happens in eachiteration.

Finally, the combined effect of padding and commu-nication is demonstrated in the total amount of cyclesgiven in Table 8. For n small, the effect due to paddingis most significant, hence TP is the worst. However,for increasing n the effects of communication become

9

more and more dominant and TEP turns out to be theworst case.

7 Conclusions

In this paper we show the problems of actual com-pilers to generate efficient SLP code. We show that po-tential possibilities to significantly improve this com-pilers exist by better exploiting the DL in backgorundmemory. In most cases, performance improvementscannot be obtained applying only ST mapping, andthe full sequence (ST, partitioning and data layout)has to be applied.

In this context, we explore two main effectswhich are intrinsic to SLP, namely data reorderingand padding. Communications arise when a sub-processing element share subwords. In SLP this meansthat data reordering operations need to be insertedin the code. We have given the criteria to have acommunication-free partitioning. Padding is used toeliminate border-effects which arise because not everyprocessing node can be kept busy all of the time.

Applying the complete approach with different par-titioning functions, worst and best SLP versions dif-fer as much as 50% in data cache energy dissipation.This difference can be made smaller by applying inter-superword optimizations. Communication in the con-text of SLP is achieved through the register file. Theextra energy dissipation associated is constant and lim-ited to a 11%. The gain in total cycles is as much as20% in the worst case. Finally, a cross-over point ispresent in the overall performance where the effects ofcommunication become dominant over padding.

References

[1] Trimaran compiler. Available athttp://www.trimaran.org.

[2] Intel Corp. Intel architecture optimiza-tion. reference manual. Available athttp://developer.intel.com.

[3] P. Op de Beeck, F. Barat, M. Jayapala, andR. Lauwereins. Crisp: a template for reconfig-urable instruction set processors. In Proc. 11th In-ternational Conference on Field-Programmable Logicand Applications, pages 296–305, Belfast, NorthIreland, August 2001.

[4] P. Op de Beeck, M. Miranda, F. Catthoor, andG. Deconinck. Background data organisation forthe low-power implementation in real-time of adigital audio broadcast receiver on a simd pro-cessor. In Proc. ACM/IEEE Conference on Design

Automation and Test in Europe, pages 1144–1145,Munich, Germany, March 2003.

[5] U. Eckhardt and R. Merker. Hierarchical algo-rithm partitioning at system level for an improvedutilization of memory structures. IEEE Transac-tions on Computer-Aided Design of Integrated Cir-cuits and Systems, 18(1):14–24, January 1999.

[6] M. Kandemir, A. Choudhary, N. Shenoy, P. Baner-jee, and J. Ramanujam. A linear algebra frame-work for automatic determination of optimal datalayouts. IEEE Transactions on Parallel and Dis-tributed Systems, 10(2):115–135, February 1999.

[7] Samuel Larsen and Saman Amarasinghe. Ex-ploiting superword level parallelism with mul-timedia instruction sets. ACM SIGPLAN Notices,35(5):145–156, 2000.

[8] D. Lavenier, P. Quinton, and S. Rajopadhye. Dig-ital Signal Processing for Multimedia Systems, chap-ter 23. Parhi and Nishitani Eds, March 1999.

[9] R. B. Lee. Subword parallelism with max-2. IEEEMicro, 16(4):51–59, August 1996.

[10] R. Leupers. Code generation for embedded pro-cessors. In Proc. ACM/IEEE Conference on De-sign Automation and Test in Europe, pages 173–179,2000.

[11] A. W. Lim and M. S. Lam. Maximizing par-allelism and minimizing synchronization withaffine transforms. In Proceedings of the Twenty-fourth Annual ACM Symposium on the Principlesof Programming Languages, pages 201–214, Paris,France, January 1997.

[12] William Pugh. Counting solutions to presburgerformulas: How and why. In SIGPLAN Conferenceon Programming Language Design and Implementa-tion, pages 121–134, Orlando, CA, USA, 1994.

[13] Rainer Schaffer, Renate Merker, and FranckyCatthoor. Combining background memory man-agement and regular array co-partitioning, illus-trated on a full motion estimation kernel. In Pro-ceedings of the 13th International Conference on VLSIDesign, page 104. IEEE Computer Society, 2000.

[14] S. Wilton and N. Jouppi. Cacti: An enhancedcache access and cycle time model. IEEE Journalof Solid-State Circuits, 31(5):677–688, May 1996.

10