15
The Trade-off between Implicit and Explicit Data Distribution in Shared-Memory Programming Paradigms Dimitrios S. Nikolopoulos Eduard Ayguad´ e Theodore S. Papatheodorou Constantine D. Polychronopoulos Jes´ us Labarta Coordinated Science Laboratory Department d’ Arquirectura Department of Computer University of Illinois de Computadors Engineering and Informatics at Urbana-Champaign Universitat Politecnica de Catalunya University of Patras 1308 West Main Street c/Jordi Girona 1–3 Rion, 26500 Urbana, IL, 61801, U.S.A. 08034, Barcelona, Spain Patras, Greece ABSTRACT Keywords 1. INTRODUCTION Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. ICS ’01 Sorrento, Italy © ACM 2001 1-58113-410-x/01/06…$5.00 23

The trade-off between implicit and explicit data distribution in shared-memory programming paradigms

  • Upload
    qub

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

The Trade-off between Implicit and Explicit DataDistribution in Shared-Memory Programming Paradigms

Dimitrios S. Nikolopoulos Eduard Ayguade Theodore S. PapatheodorouConstantine D. Polychronopoulos Jesus Labarta

Coordinated Science Laboratory Department d’ Arquirectura Department of ComputerUniversity of Illinois de Computadors Engineering and Informatics

at Urbana-Champaign Universitat Politecnica de Catalunya University of Patras1308 West Main Street c/Jordi Girona 1–3 Rion, 26500

Urbana, IL, 61801, U.S.A. 08034, Barcelona, Spain Patras, [email protected],[email protected] [email protected],[email protected] [email protected]

ABSTRACTThis paper explores previously established and novel meth-ods for scaling the performance of OpenMP on NUMA ar-chitectures. The spectrum of methods under investigationincludes OS-level automatic page placement algorithms, dy-namic page migration,and man ual data distribution. Thetrade-o� that these methods face lies betw een performanceand programming e�ort. Automatic page placement algo-rithms are transparent to the programmer, but may com-promise memory access locality. Dynamic page migrationis also transparent, but requires careful engineering of on-line algorithms to be e�ective. Manual data distributionon the other requires substantial programming e�ort andarchitecture-speci�c extensions to OpenMP, but may local-ize memory accesses in a nearly optimal manner.

The main contributions of the paper are: a classi�cation ofapplication characteristics, which identi�es clearly the con-ditions under which transparent methods are both capableand suÆcient for optimizing memory locality in an OpenMPprogram; and the use of two novel runtime techniques, run-time data distribution based on memory access traces andaÆnity scheduling with iteration schedule reuse, as com-petitiv e substitutes of manual data distribution in severalimportant classes of applications.

KeywordsData distribution, page migration, operating systems, run-time systems, performance evaluation, OpenMP.

1. INTRODUCTIONThe OpenMP Application Programming Interface (API) [20]is a portable model for programming any parallel architec-

ture that provides the abstraction of a shared address spaceto the programmer. The target architecture may be a small-scale desktop SMP, a ccNUMA supercomputer, a clusterrunning distributed shared-memory (DSM) middleware oreven a multithreaded processor. The purpose of OpenMP isto ease the development of portable parallel code, by en-abling incremental construction of parallel programs andhiding the details of the underlying hardware/softw are in-terface from the programmer.An OpenMP program can beobtained directly from its sequential coun terpart, by addingdirectiv es around parts of the code that can be parallelized.Architectural features such as the communication medium,or the mechanisms used to implement and orc hestrate par-allelism in softw are are hidden behind the OpenMP backendand do not place any burden on the programmer. The pop-ularit y of OpenMP has raised sharply during the last fewyears. OpenMP is now considered the de facto standard forprogramming shared-memory multiprocessors.

It is rather unfortunate that the ease of use of OpenMPcomes at the expense of limited scalability on architectureswith non-uniform memory access latency . Since OpenMPpro vides no means to control the placement of data in mem-ory, it is often the case that data is placed in a manner thatforces frequent accesses to remote memory modules throughthe interconnection network. Localizing memory accessesvia proper data distribution is a fundamental performanceoptimization for NUMA architectures. Soon after realizingthis problem, several researchers and vendors proposed andimplemented NUMA extensions of OpenMP in tw o forms:data distribution directives similar to those of data parallelprogramming languages like HPF; and loop iteration assign-ment directiv es that establish explicit aÆnity relationshipsbetw een threads and data [1, 2, 6, 11, 13, 16, 23].

Integrating data distribution and preserving simplicity andportabilit yare tw o con icting requirements for OpenMP.Data distribution for NUMA architectures is a platform-dependent optimization. It requires signi�cant program-ming e�ort, usually equivalen t to that of message-passing,while its e�ectiveness is implementation-dependent. On theother hand, injecting transparent data distribution capabili-ties in OpenMP without modifying the API is a challenging

Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. ICS ’01 Sorrento, Italy © ACM 2001 1-58113-410-x/01/06…$5.00

23

Some text in this electronic article is rendered in Type 3 or bitmapped fonts, and may display poorly on screen in Adobe Acrobat v. 4.0 and later. However, printouts of this file are unaffected by this problem. We recommend that you print the file for best legibility.

research problem.

Recently, work from the authors has shown that at least forthe popular class of iterative parallel programs with itera-tive reference patterns (that is, programs that repeat thesame parallel computation for a number of iterations), theOpenMP runtime environment can perform implicit datadistribution online, using memory reference tracing and anintelligent page migration algorithm [18]. This argumenthas been evaluated with a synthetic experiment, in whichOpenMP programs that were manually tuned with a pri-ori knowledge of the operating system's page placement al-gorithm, were deliberately executed with alternative pageplacement algorithms. The experiment has shown that nomatter what the initial distribution of data is, an intelli-gent page migration algorithm can accurately and timelyrelocate pages to match the performance of the best auto-matic page placement algorithm. It is therefore speculatedthat OpenMP and shared-memory programming paradigmsin general, can rely on implicit rather than explicit datadistribution to achieve memory access locality.

1.1 The problems addressed in this paperThere are several circumstances in which well-tuned manualdata distribution and binding of threads to memory mod-ules provide considerably better performance than any auto-matic page placement algorithm [2, 3]. Given the potentialof dynamic page migration as a substitute of data distribu-tion, a vital question that remains open is whether dynamicpage migration can provide the performance bene�ts of man-ual data distribution in such circumstances. A second openproblem is the vulnerability of dynamic page migration to�ne granularity and non-iterative or irregular reference pat-terns, which render online page migration algorithms inca-pable of localizing memory accesses [2]. Before reverting tomanual data distribution in these cases, it is important toinvestigate whether there exist other automated proceduresable to implement the same locality optimizations transpar-ently.

The idea pursued in this paper is to replace manual datadistribution and manual assignment of threads to memorymodules, with a combination of runtime memory manage-ment algorithms based on dynamic page migration and loopscheduling techniques that implicitly preserve thread tomemory aÆnity relationships. The ultimate objective is tolocalize memory accesses at least as well as manual distri-bution does, without modifying the OpenMP API.

1.2 Summary of contributionsWe conducted an extensive set of experiments with theOpenMP implementations of the NAS benchmarks and asimple LU decomposition code on the Origin2000 [10]. Weevaluated the native OpenMP implementations without anya priori knowledge of the memory management strategyof the underlying platform, thus relying solely on the pageplacement algorithm of the operating system for proper datadistribution; the same implementations modi�ed to use man-ual data distribution with HPF-like directives; and the sameimplementations linked with UPMlib [19], our runtime sys-tem which implements transparent data distribution usingmemory access tracing and dynamic page migration.

The NAS benchmarks are popular codes used by the paral-lel processing community to evaluate scalable architectures.LU is an interesting case where dynamic page migration mayactually hurt performance [2] and manual data distributioncombined with loop iteration assignment to memory mod-ules appear to be the only options for scaling. In this ex-periment, we tested an alternative implementation of LU, inwhich instead of manual data distribution, the inner paral-lel loop is transformed to exploit memory aÆnity by reusingan initially computed cyclic loop schedule. This transfor-mation ensures that in every execution of the inner parallelloop, each processor accesses the same data, thus preservinga strong thread-to-memory aÆnity relationship.

The �ndings of the paper are summarized as follows. Forstrictly iterative parallel codes and in situations in whichdata distribution appears to be necessary to improve scal-ability, a smart dynamic page migration engine matches orexceeds the performance of manual data distribution, un-der the assumption that the granularity and the memoryaccess pattern of the program are coarse enough to providethe page migration engine with a suÆciently long time framefor tuning data placement online. In this case, dynamic pagemigration has several inherent advantages, most notably theability to scan the entire memory address space instead ofonly distributed arrays and the ability to identify bound-ary cases of pages that may be misplaced with manual datadistribution.

In �ne-grain iterative codes, our page migration engine per-forms either in par or modestly worse than manual datadistribution. The same happens with codes that have onlya few iterations and some form of irregularity in the memoryaccess pattern. In the �rst case, activating page migration ornot is a matter of cost-bene�t analysis, using the granularityof the parallel computation as the driving factor for takinga decision online. In the second case, the competitive algo-rithm of the page migration engine may fail to detect earlythe irregularity and incur bouncing of pages, which in turncontributes unnecessary overhead. Handling irregular accesspatterns with an online page migration algorithm remainsan issue for further investigation.

In LU, although page migration is neither improving norworsening performance, our �ndings suggest that it is pos-sible to achieve nearly optimum memory access locality byinitially assigning the iterations of the innermost loop to pro-cessors in a cyclic manner and reusing this schedule in everyiteration of the outermost sequential loop. This transforma-tion works well with a �rst-touch page placement algorithmand does not require manual data distribution, thread place-ment, or any other platform-speci�c OpenMP extension.

Overall, the results make a strong case for sophisticatedruntime techniques that implement transparently the per-formance optimizations required to scale an OpenMP pro-gram on a NUMA multiprocessor. The paper contributesto a recently established belief that it is possible to use a at directive-based programming paradigm and yet be ableto obtain scalable performance, thus simplifying the processand reducing the cost of developing eÆcient parallel pro-grams. We consider our work as a step towards identifyinga programming model that yields the highest speedup with

24

the least programming e�ort.

1.3 The rest of this paperThe remainder of this paper is organized as follows. Sec-tion 2 gives a brief introduction to explicit and implicit datadistribution mechanisms for OpenMP and reviews relatedwork. Section 3 explains in detail our experimental setup.Our results are presented in Section 4. Section 5 summarizesour conclusions and gives some directions for future work.

2. IMPLICIT AND EXPLICIT DATADISTRIBUTION

The idea of introducing manual data distribution directivesin shared-memory programming paradigms originates in ear-lier work on data parallel programming languages [8, 7].Data-parallel languages use array distributions as the start-ing point for the development of parallel programs. Arraysare distributed along one or more dimensions, in order to letprocessors perform the bulk of the parallel computation onlocal data and communicate rarely in a loosely synchronousmanner. The communication is handled by the compiler,which has full knowledge of the data owned by each proces-sor.

In order to use a data parallel programming model on aNUMA shared-memory multiprocessor, data distribution di-rectives must be translated into distributions of the virtualmemory pages that contain the array elements assigned toeach processor. The communication is carried out directlythrough loads and stores in shared memory. From a per-formance perspective, the data parallel style of program-ming is attractive for NUMA architectures, because it es-tablishes explicitly strong aÆnity links between computa-tion and data.

Both researchers and vendors have implemented data par-allel extensions to either platform-speci�c shared memoryprogramming models, or OpenMP [2, 3, 13]. In general,these extensions serve two purposes. The �rst is the dis-tribution of data at page-level or element-level granularity.At page-level granularity, the unit of data distribution is avirtual memory page. At element-level granularity, the unitof data distribution is an individual array element. Since anelement-level array distribution may assign two or more ele-ments originally placed in the same page to di�erent memorymodules, the compiler needs to reorganize array subscriptsin order to implement the speci�ed mapping of elements.The second purpose of applying data parallel extensions toshared-memory programming models is to explicitly assignloop iterations to speci�c memory modules for exploiting lo-cality. The objective is to assign each iteration to the mem-ory module that contains all, or at least a good fraction ofthe data accessed by the loop body during that iteration.Data-parallel extensions of OpenMP have demonstrated apotential for signi�cant performance improvements in simplecodes like LU decomposition and SOR, albeit by modifyingthe native programming model and sacri�cing portability [2,3, 16].

As part of an e�ort to provide transparent, runtime opti-mizations for OpenMP, we have developed a runtime systemwhich acts as a local optimizer of memory accesses within

an OpenMP program [17, 19]. This runtime system, calledUPMlib, monitors the reference rates from each processorto each page in memory and applies competitive page mi-gration algorithms at speci�c execution points, at which thereference counters re ect accurately the exact memory ac-cess pattern of the program. In iterative parallel codes, thesepoints are the ends of the outer iterations of the sequentialloop that encloses the parallel computation. The techniqueis both accurate and e�ective in relocating early at run-time the pages that concentrate frequent remote accesses.In most cases, the runtime system stabilizes the placementof pages after the execution of the �rst iteration and thisplacement is optimal with respect to the reference pattern,in the sense that each page is placed on the node that min-imizes the latency of remote accesses [17].

Our dynamic page migration engine can potentially be usedin place of manual data distribution, because monitoringof reference counters at well-de�ned execution points canprovide the runtime system with a complete and accuratemap of thread-to-memory aÆnity relationships. This mapwould otherwise be established with data distribution di-rectives [18]. The runtime system may even identify situ-ations in which manual data distribution places pages in anon-optimal manner, e.g. due to inability to map the wholeaddress space of a program with distribution directives, orthe absence of appropriate distributions for certain accesspatterns.

The results obtained so far from using UPMlib as a trans-parent data distribution tool prove that intelligent dynamicpage migration algorithms can render OpenMP programsimmune to the page placement algorithm of the operatingsystem [18]. This means that programs perform always asgood as they do with the best-performing automatic pageplacement algorithm, no matter how their data is initiallyplaced in memory. Unfortunately, there are also cases inwhich manual data distribution outperforms the best auto-matic page placement algorithms used in contemporary op-erating systems by sizeable margins. It remains to be seen ifa dynamic page migration engine can provide the same ad-vantage over automatic page placement algorithms in suchcases. A second restriction of our framework is that it con-forms well only to strictly iterative codes. It is questionablewhether iterative codes that do not have an iterative accesspattern, or non-iterative codes can bene�t from dynamicpage migration. In these cases, if data distribution is not anoption for the sake of portability, it is more likely that care-ful scheduling of threads according to the location of datain memory is the most appropriate way to implement therequired locality optimizations transparently.

3. EXPERIMENTAL SETUPWe conducted two sets of experiments, one using the NASbenchmarks and one using a simple hand-crafted LU decom-position. The experiments were executed on a 64-processorSGI Origin2000, with MIPS R10000 processors running at250 MHz, 32 Kbytes of split L1 cache per processor, 4Mbytes of uni�ed L2 cache per processor and 12 Gbytesof DRAM memory. The experiments were conducted on anidle system, using the IRIX implementation of cpusets forruns with less than 64 processors.

25

3.1 NAS benchmarksIn the �rst set of experiments we executed �ve benchmarksfrom the NAS suite, implemented in OpenMP by researchersat NASA Ames [10]. We used the class A problem sizes,which �t the scale of the system on which we ran the ex-periments. The benchmarks are BT, SP, CG, FT and MG.BT and SP are complete CFD applications, while CG, MGand FT are small computational kernels. BT and SP solvethe Navier-Stokes equations in three dimensions, using dif-ferent factorization methods. CG approximates the small-est eigenvalue of a large sparse matrix using the conjugate-gradient method. FT computes the Fourier transform of a3-dimensional matrix. MG computes the solution of a 3-DPoisson equation, using a V-cycle multigrid method. Allcodes are iterative and repeat the same parallel computationfor a number of iterations that correspond to time steps.

The native OpenMP implementations of the benchmarksare tuned by their providers speci�cally for the Origin2000memory hierarchy [10]. The codes include a cold-start it-eration of the complete parallel computation to distributedata among processors on a �rst-touch basis1[14]. The cold-start iteration performs actually a BLOCK distribution ofthe dimension of the arrays accessed in the outermost levelof parallel loop nests. Its functionality is equivalent to thatof a DISTRIBUTE (*, : : : ,*,BLOCK) directive2.

Since data distribution is already hard-coded in the bench-marks, in order to evaluate the performance of the samebenchmarks without manual data distribution we exploretwo options. The �rst is to remove the cold-start iteration.Since IRIX's default page placement algorithm happens tobe �rst-touch, IRIX eventually implements the same dataplacement that a manual BLOCK distribution would other-wise perform statically at array declaration points. The onlysigni�cant di�erence, is that the benchmarks pay the addi-tional cost of several TLB faults, which occur on the criticalpath of the parallel computation. The second option, is touse the alternative automatic page placement algorithm ofIRIX, i.e. round-robin. In that case, we arti�cially devise ascenario in which the automatic page placement algorithmof the operating system is simply unable to perform properdata distribution. Round-robin is a realistic choice and isbeing used in many production settings (e.g. NCSA's large-scale Origin2000s) as the default data placement algorithm.

We executed four versions of each benchmark. In all ver-sions, the cold-start iteration of the original code was com-mented out. The versions are described in the following:

� OpenMP+rr: This is the native OpenMP code, exe-cuted with round-robin page placement. This versioncorresponds to a scenario in which the programmerhas absolutely no knowledge of the placement of datain memory and the OS happens to use a non-optimal

1First-touch is the default page placement algorithm ofIRIX, the Origin2000 operating system.2This was veri�ed experimentally, by obtaining the map-pings of pages to physical memory while executing twoversions of the benchmarks, one with the cold-start it-eration, and one with an explicit !$SGI DISTRIBUTE(*,*,*,BLOCK) directive at the declaration points of sharedarrays.

Table 1: Data distributions applied to the OpenMPimplementations of the NAS benchmarks.

Benchmark DistributionsBT u,rhs,forcing, BLOCK z-directionSP u,rhs,forcing, BLOCK z-directionCG z,q,r,x, one-dimensional BLOCKFT u0, u1, one-dimensional BLOCKMG u,v,r, one-dimensional BLOCK

automatic page placement algorithm. OpenMP+rrserves as a baseline for comparisons.

� OpenMP+ft: This is similar to OpenMP+rr, exceptthat �rst-touch page placement is used instead ofround-robin. In this case, the operating system hap-pens to use the right page placement algorithm. Anyadditional overhead compared to an implementationwith manual data distribution is attributed to TLBfault handling on the critical path of the parallel com-putation.

� OpenMP+DD: This is the native OpenMP code, ex-tended with !$SGI DISTRIBUTE directives to per-form manual data distribution at declaration points ofspeci�c arrays. The pages not included in distributedarrays are distributed by IRIX on a �rst-touch basis.

� OpenMP+UPMlib: This is the native OpenMP code,executed with round-robin page placement and linkedwith UPMlib to use our page migration engine.

In the OpenMP+DD versions, we used BLOCK distribu-tions for the array dimensions accessed by the outermostindex of parallelized loop nests. This implementation is co-herent with the HPF implementation of the same bench-marks, originally described in [5]. The distributions wereimplemented with the SGI compiler's DISTRIBUTE direc-tive [3], the semantics of which are similar to the correspond-ing HPF distribution statement. As mentioned before, thisis equivalent to using the cold-start iteration implementedin the original version of the benchmarks, combined with�rst-touch page placement. The only di�erence is that in thelatter case, data is distributed in the executable part, ratherthan in the declaration part. We actually tested both im-plementations and found no di�erence in data distributionbetween the directive-based implementation and the hard-coded implementation. We emphasize that in both imple-mentations, data distribution overhead is not on the criticalpath, therefore it is not re ected in the measurements.

Table 1 summarizes the data distributions applied to thebenchmarks. Two benchmarks, BT and SP, might bene�tfrom using data redistribution within iterations, in additionto data distribution. This happens because the memory ac-cess patterns of these benchmarks have phase changes. Inboth benchmarks, there is a phase change between the ex-ecution of the solver in the y-direction and the executionof the solver in the z-direction, due to the initial place-ment of data, which exploits spatial locality along the x-and y- directions. We implemented versions of BT and SPin which the arrays were BLOCK-redistributed along the y-dimension at the entry of z solve and BLOCK-redistributed

26

again along the z-direction at the exit of z solve. These re-distributions were implemented using the REDISTRIBUTEdirective. Although theoretically necessary in BT and SP,the redistributions had unacceptable overhead and poor scal-ability when applied in our implementations3. The resultsreported in the following are taken from experiments with-out runtime data redistribution.

In the OpenMP+UPMlib version, the benchmarks use theiterative page migration algorithm outlined in Section 2 anddescribed in more detail in [17, 18]. Page migration is ap-plied at the end of the iterations of the outer loop that en-closes the parallel computation, after collecting a map of thememory accesses of the program. This algorithm performsonline data distribution. Our runtime system includes alsoa record-replay algorithm [18], which emulates data redis-tribution at page-level granularity. Unfortunately, apply-ing record-replay in BT and SP, pretty much like manualdata redistribution, introduces overhead which outweighsthe bene�ts from reducing the rate of remote memory ac-cesses.

3.2 LUThe simple LU code shown in Figure 1 (left), is a repre-sentative example of a parallel code not easily amenableto dynamic page migration for online memory access local-ity optimizations. Although LU is iterative, the amount ofcomputation performed in each iteration is progressively re-duced, while the data touched by a processor may di�er fromiteration to iteration, depending on the algorithm that as-signs iterations to processors. The iterative page migrationalgorithm of UPMlib is not of much use in this code. Neitherblock or cyclic data distribution can improve memory accesslocality signi�cantly. A cyclic distribution along the sec-ond dimension of a is helpful though, because it distributesevenly the computation across memory modules and, im-plicitly, processors. In order to have both balanced loadand good memory access locality, the code should be mod-i�ed as shown in the right part of Figure 1. In additionto the cyclic distribution, iterations should be scheduled sothat each iteration of the j loop is executed on the nodewhere the j-th column of a is stored. On the Origin2000,this is accomplished with the AFFINITY directive of theSGI compiler.

We have implemented �ve versions of LU, using a 1400�1400matrix4. Three versions, OpenMP+ft, OpenMP+rr, andOpenMP+DD are produced in the same manner as thecorresponding implementations of the NAS benchmarks de-scribed in the Section 3.1. The OpenMP+DD version usescyclic distribution along the second dimension of a, primar-ily for load balancing. In the same version of LU, we at-tempted to add the AFFINITY clause shown in Figure 1, inorder to ensure that the j-th iteration of the innermost loopis always scheduled for execution on the node where the j-thcolumn of a resides. Unfortunately, the SGI implementationof the AFFINITY clause has high synchronization overheadand attens the speedup of LU. This phenomenon may be

3The overhead of data redistribution is most likely at-tributed to a non-scalable implementation on the Origin2000[12]. Our results agree with the results in [5].4We used a dense matrix merely to demonstrate the impactof data distribution.

attributed to an outmoded version of the SGI compiler weused.

The fourth version (OpenMP+UPMlib) uses our page mi-gration engine, but instead of using an iterative algorithm, ituses a periodic algorithm which is invoked upon expirationsof a timer [17]. The period of the algorithm is manually setto 50 ms, in order for the algorithm to have the opportu-nity to migrate pages within an iteration of the outermost kloop. Ideally, the mechanism would detect a change of thereference pattern at the beginning of each iteration of the kloop and migrate pages to realign data with the processorsthat work on the leftmost columns of the submatrix accessedduring that iteration. Realistically, since it is not feasible toscan and redistribute the entire array once every 50 ms, thealgorithm samples a few pages of the array whenever thetimer quantum is expired.

The �fth version (shown in Figure 2) follows an alternativepath for optimizing memory access locality. In this version,the inner j loop is transformed into a loop iterating from 0 tonprocs-1, where nprocs is the number of processors execut-ing in parallel. Each processor computes locally its own setof iterations to execute. Iterations are assigned to processorsin a cyclic manner and during the k-th iteration of the outerloop, each processor executes a subset of the iterations thatthe same processor executed during the (k-1)-th iteration ofthe outer loop. For example, assume that n=1024 and theprogram is executed with 4 processors. In the �rst iterationof the k loop, processor 0 executes iterations 2,6,10,14, : : : ,processor 1 executes iterations 3,7,11,15, : : : and so on. Inthe second iteration of the k loop , processor 0 executes it-erations 6,10,14 : : : , processor 1 executes iterations 7,11,15and so on. We call this version of LU OpenMP+reuse.

The purpose of the cyclic assignment of iterations is to haveeach processor reuse the data that it touches during the�rst iteration of the k loop. If the program is executedwith a �rst-touch page placement algorithm, such a trans-formation achieves good localization of memory accesses. Insome sense, the transformation resembles cache-aÆnity loopscheduling [15]. The di�erence is that the transformation isused to exploit cache reuse and at the same time, localizememory accesses.

We believe that such a transformation can be relativelyeasy to apply for a restructuring compiler, without requir-ing a new OpenMP directive. In the worst case, the trans-formation requires an extension to the SCHEDULE clauseof the PARALLEL DO directive, which dictates the com-piler to compute the initial iteration schedule (cyclic in thiscase) and reuse it in every subsequent invocation of theloop. One way to do this, is to assign names to sched-ules, e.g. using a SCHEDULE(name:policy,chunk) clause.In the case of LU, this clause would be written as SCHED-ULE(lu schedule:cyclic,1). If the compiler encounters theclause for the �rst time, it computes the schedule. Other-wise, in every occurrence of a SCHEDULE(lu reuse) clause,the compiler identi�es the precomputed schedule by its name(lu schedule) and reuses it. Note that the iteration schedulereuse can be exploited across di�erent instances of the sameloop or across di�erent loops.

27

program LUinteger nparameter (n=problem size)double precision a(n,n)do k=1,n

do m=k+1,na(m,k)=a(m,k)/a(k,k)

end do!$OMP PARALLEL DO PRIVATE(i,j)

do j=k+1, ndo i=k+1,n

a(i,j)=a(i,j)-a(i,k)*a(k,j)enddo

enddoenddo

program LUinteger nparameter (n=problem size)double precision a(n,n)!$SGI DISTRIBUTE a(*,CYCLIC)do k=1,n

do m=k+1,na(m,k)=a(m,k)/a(k,k)

end do!$SGI PARALLEL DO PRIVATE(i,j)!$SGI& AFFINITY(j)=DATA(a(i,j))

do j=k+1, ndo i=k+1,n

a(i,j)=a(i,j)-a(i,k)*a(k,j)enddo

enddoenddo

Figure 1: A simple LU code implemented with plain OpenMP (left) and with data distribution and aÆnityscheduling (right).

program LUinteger nparameter (n=problem size)double precision a(n,n)integer nprocsnprocs = OMP GET MAX THREADS()do k=1,n

do m=k+1,na(m,k)=a(m,k)/a(k,k)

enddo!$OMP PARALLEL DO PRIVATE(i,j,myp,jlow),!$OMP& SHARED(a,k)

do myp = 0, nprocs-1jlow = ((k / nprocs) * nprocs) + 1 + mypif (myp .lt. mod(k, nprocs)) jlow = jlow + nprocsdo j=jlow,n,nprocs

do i=k+1,na(i,j) = a(i,j) - a(i,k)*a(k,j)

enddoenddo

enddoenddo

Figure 2: LU with iteration schedule reuse for ex-pressing thread-to-memory aÆnity.

If the iteration schedule reuse transformation is beyond thecapacity of a compiler, it can still be applied using a semi-automatic procedure. The idea is to run one iteration of theprogram using the BLOCK distribution and another itera-tion using the CYCLIC distribution and iteration schedulereuse. The two iterations are used as inspectors to probethe performance of the two distributions, by measuring thenumber of remote memory accesses to a. The probing runwill clearly indicate the advantage of the CYCLIC distribu-tion, since the number of remote memory accesses will startdecreasing from the second iteration of the k loop.

4. RESULTSThe results from our experiments are summarized in Fig-ures 3, 4, 5 and 6. Figure 3 shows the execution times ofthe benchmarks on a varying number of processors. Notethat the y-axis in the charts is logarithmic and that theminimum and maximum values of the y-axis are adjustedaccording to the execution times of the benchmarks for the

sake of readability. Figure 4 shows the normalized speedupof the di�erent versions of each benchmark. Normalizationis performed against the execution time of the OpenMP+rrversion. On a given number of processors, the normalizedspeedups are obtained by dividing the execution time of theOpenMP+rr version with the execution time of the otherversions. Note that in two occasions, CG and LU, super-linear speedups occur between 4 and 16 processors due toworking set e�ects. A large fraction of the capacity and con- ict cache misses of the 1-processor execution is convertedinto hits due to additional cache space provided in the par-allel execution.

Figure 5 shows histograms of memory accesses per-node,divided into local and remote memory accesses5 for threeprograms, BT, MG and LU. The selected programs are theones that illustrate the most representative trends revealedfrom the experiments. The histograms are obtained from in-strumented executions of the programs on 32 processors (16nodes of the Origin2000). The instrumentation calculatesthe total number of local and remote memory accesses per-node, by reading the page reference counters in the entireaddress space of the programs.

Figure 6 shows the overhead of the page migration engine,normalized to the execution times of the benchmarks on64 processors. This overhead is computed with timing in-strumentation of UPMlib. It must be noted that UPMliboverlaps the execution of page migration with the executionof the program. The runtime system uses a thread which isinvoked periodically and runs the page migration algorithmin parallel with the program. Although the overhead of pagemigrations is masked, the thread used by UPMlib interfereswith one or more OpenMP threads at runtime, when theprograms use all 64 processors. We conservatively assumethat this interference causes a 50{50 processor sharing andwe estimate the net overhead of UPMlib as 50% of the totalCPU time consumed by the page migration thread.

We comment on the performance of the NAS benchmarksand LU in separate sections.

5Remote in the sense that the node is accessed by processorsresiding in other nodes.

28

20 40 60

processors

100

1000

exec

uti

on

tim

e

NAS BT

OpenMP+rrOpenMP+ftOpenMP+DDOpenMP+UPMlib

20 40 60

processors

100

1000

exec

uti

on

tim

e

NAS SP

OpenMP+rrOpenMP+ftOpenMP+DD OpenMP+UPMlib

20 40 60

processors

10

exec

utio

n tim

e

NAS CG

OpenMP+rrOpenMP+ftOpenMP+DD OpenMP+UPMlib

20 40 60

processors

10exec

utio

n tim

e

NAS FT

OpenMP+rrOpenMP+ftOpenMP+DD OpenMP+UPMlib

20 40 60

processors

10

exec

utio

n tim

e

NAS MG

OpenMP+rrOpenMP+ftOpenMP+DD OpenMP+UPMlib

20 40 60

processors

10

exec

utio

n tim

e

LU

OpenMP+ftOpenMP+rrOpenMP+DD OpenMP+UPMlib OpenMP+reuse

Figure 3: Execution times.

29

1 4 8 16 32 64

processors

0.0

0.5

1.0

1.5

2.0

norm

aliz

ed s

peed

up

NAS BT

OpenMP+rr

OpenMP+ft

OpenMP+DD

OpenMP+UPMlib

1 4 8 16 32 64

processors

0.0

0.5

1.0

1.5

2.0

norm

aliz

ed s

peed

up

NAS SP

OpenMP+rr

OpenMP+ft

OpenMP+DD

OpenMP+UPMlib

1 4 8 16 32 64

processors

0

1

2

norm

aliz

ed s

peed

up

NAS CG

OpenMP+rr

OpenMP+ft

OpenMP+DD

OpenMP+UPMlib

1 4 8 16 32 64

processors

0.0

0.5

1.0

1.5

2.0

norm

aliz

ed s

peed

up

NAS FT

OpenMP+rr

OpenMP+ft

OpenMP+DD

OpenMP+UPMlib

1 4 8 16 32 64

processors

0.0

0.5

1.0

1.5

2.0

norm

aliz

ed s

peed

up

NAS MG

OpenMP+rr

OpenMP+ft

OpenMP+DD

OpenMP+UPMlib

1 4 8 16 32 64

processors

0.0

0.5

1.0

1.5

2.0

norm

aliz

ed s

peed

up

LU

OpenMP+ft

OpenMP+rr

OpenMP+DD

OpenMP+UPMlib

OpenMP+reuse

Figure 4: Normalized speedups.

30

5 10 15

nodes

5M

10M

mem

ory

acce

sses

NAS BT, OpenMP+rr

5 10 15

nodes

5M

10M

mem

ory

acce

sses

NAS BT, OpenMP+DD

5 10 15

nodes

5M

10M

mem

ory

acce

sses

NAS BT, OpenMP+UPMlib

5 10 15

nodes

2M

4M

mem

ory

acce

sses

NAS MG, OpenMP+rr

5 10 15

nodes

2M

4M

mem

ory

acce

sses

NAS MG, OpenMP+DD

5 10 15

nodes

2M

4M

rem

ote

mem

ory

acce

sses

NAS MG, OpenMP+UPMlib

5 10 15

nodes

.2M

.4M

mem

ory

acc

esse

s

LU, OpenMP+ft

5 10 15

nodes

.2M

.4M

mem

ory

acc

esse

s

LU, OpenMP+UPMlib

5 10 15

nodes

.2M

.4M

mem

ory

acc

esse

s

LU, OpenMP+reuse

Figure 5: Per-node memory accesses of BT, MG and LU during their execution on 32 processors (16 nodesof the Origin2000). The memory accesses are divided into local (gray) and remote (black) memory accesses.

31

4.1 NAS benchmarksThe results from the experiments with the NAS benchmarksshow a consistent trend of improvement from using man-ual data distribution in the OpenMP implementation. Thetrend is observed clearly from the 16-processor scale andbeyond. At this scale, the ratio of remote-to-local mem-ory access latency approaches 2:1 on the Origin2000 anddata placement becomes critical. On 16 processors, datadistribution reduces execution time by 10%{20%. On 32 ormore processors, data distribution improves performance bya wider margin, which averages 34% and in one case (CG),is as much as 81%.

The di�erence between manual data distribution and auto-matic round-robin page placement is purely a matter of thememory access locality achieved with the two distributions.Figure 5 shows that in BT, round-robin placement of pagesforces 55% of memory accesses to be satis�ed remotely. TheBLOCK distribution reduces the fraction of remote memoryaccesses to 25%. Notice that this fraction is still signi�cantenough to justify further optimization. In BT for example,this fraction amounts to approximately 3.3 million remotememory accesses. Assuming that the base local memory ac-cess latency costs 300 ns (as stated in the system's technicalspeci�cations) and at the 32-processor scale the remote-to-local memory access ratio is 3.3:1 (accounting for the di�er-ence between reads and writes [4]), the savings in executiontime by converting all remote memory accesses to local areestimated to 2.64 seconds. The actual savings are expectedto be more signi�cant, because the localization of memoryaccesses reduces contention at the memory modules and thenetwork interfaces. Contention accounts for an additionaloverhead of approximately 50 ns per contending node onthe Origin2000 [9].

The di�erence between manual data distribution and �rst-touch placement (OpenMP+DD vs. OpenMP+ft) occursbecause the former distributes pages in a blocked fashionbefore the beginning of the computational loop, while thelatter performs the same distribution on the y, during theexecution of the �rst parallel iteration. The OpenMP+ftversion pays an additional cost for TLB fault handling onthe critical path of the parallel computation. OpenMP+ftand OpenMP+DD incur practically the same number of re-mote memory accesses (averaging 3.3 and 3.4 million pernode respectively).

The behavior of our dynamic page migration engine is ex-plained by classifying the benchmarks in two classes, coarse-grain benchmarks with regular memory access patterns and�ne-grain benchmarks with potentially irregular memory ac-cess patterns.

4.1.1 Coarse-grain benchmarks with regularmemory access patterns

The �rst class includes benchmarks with a large number ofiterations, relatively long execution times (in the order ofseveral hundreds of milliseconds per iteration) and a regu-lar memory access pattern. The term regular refers to thedistribution of memory accesses, i.e. the memory accesses,both local and remote, are uniformly distributed among thenodes. BT and SP belong to this class. Figure 5 illustrates

the regularity in the access pattern of BT. SP has a verysimilar memory access pattern.

In these benchmarks, our page migration engine not onlymatches the performance of manual data distribution, butalso outperforms it by a noticeable margin, no less than10% and in one case (BT on 64 processors), by as muchas 30%. This result implies that manual data distributiondoes not identify accurately the complete memory accesspattern of the program. It handles only pages that belongto distributed arrays. This is veri�ed by the chart in theupper right corner of Figure 5. Compared to OpenMP+DD,OpenMP+UPMlib reduces the number of remote memoryaccesses per node by 31% on average.

The page migration engine can track and optimize the ref-erence pattern of more pages than just those belonging todistributed arrays, including actively shared pages which areinitially placed in the wrong node by the operating systemand pages containing array elements or scalar data for whichno standard distribution is adequate to improve memory ac-cess locality. The capability of accurately relocating thesepages is inherent in the page migration engine, simply be-cause the runtime system uses online information obtainedfrom the actual execution, rather than the programmer'sunderstanding of the problem. This capability is useful alsowhenever manual data distribution misplaces pages with re-spect to the frequency of accesses. This can happen inboundary cases, e.g. if the array is not page-size alignedin memory and processors contend for pages at the blockboundaries, or when there are pages that contain data fromboth distributed and non-distributed arrays or scalars. Fi-nally, the benchmarks are coarse enough to outweigh thecost of page migration (see Figure 6) and execute enough it-erations to compensate for any undesirable page migratione�ect, e.g. ping-pong [17].

Note that the reduction of remote memory accesses alone,is not the only factor accounting for the performance im-provements on the 64-processor scale. The OpenMP+DDversion of BT for example is about 7 seconds slower than theOpenMP+UPMlib version. A rough estimation of the sav-ings from reducing the number of memory accesses yields animprovement of at most 2 seconds. We believe that furtherimprovements are enabled by the alleviation of contentionat memory modules and network links. Unfortunately, theOrigin2000 lacks the hardware to quantify the e�ect of con-tention. Indirect experiments for assessing this e�ect are asubject of investigation.

4.1.2 Fine-grain benchmarks withpotentially irregular memory access pattern

The second class includes �ne-grain benchmarks with a smallnumber of iterations and relatively short total executiontime (in the order of a few seconds). CG, MG and FTbelong to this class. Dynamic page migration performs inpar with manual data distribution only in CG. In FT andMG, manual data distribution performs clearly better. Tworeasons are likely to explain this trend. The �rst is the com-putational granularity. If the execution time per-iteration istoo short, the page migration engine might not have enoughtime to migrate poorly placed pages in a timely manner, ifat all. The second reason is the memory access pattern of

32

0.0

0.2

0.4

0.6

0.8

1.0

BT SP CG FT MG LU

Figure 6: Relative overhead of the page migrationengine (black part). The overhead is normalized tothe execution time of each benchmark.

the program itself. It might be the case that due to theaccess pattern, the competitive page migration criterion ofthe runtime system causes bouncing of pages between nodes,or is simply unable to migrate pages because the referencerates do not indicate potential receivers. A node may ref-erence a remote page more frequently than any other node,but migrating the page there may not justify the cost.

MG exposes both problems. As shown in Figure 5, MG hasan irregular access pattern, in the sense that memory ac-cesses are not distributed uniformly among the nodes. Threenodes (3, 13 and 14) appear to concentrate more memoryaccesses than the other nodes, while one (node 10) appearsto be accessed infrequently. Manual data distribution al-leviates this irregularity and results to a roughly balanceddistribution of both local and remote memory accesses. Dy-namic page migration on the other hand, reduces the num-ber of remote memory accesses by 36%, but does not dealwith the irregularity of the memory access pattern. Thepattern remains practically unaltered, albeit with fewer re-mote memory accesses. The remote memory accesses of theOpenMP+DD version are signi�cantly less than those ofthe OpenMP+UPMlib version. Figure 6 shows that on 64processors, the interference between the thread that runsUPMlib code and the OpenMP threads accounts for 39% ofthe total execution time of the benchmark. A closer lookat the page migration activity of MG revealed that UPM-lib executes about 1400 unnecessary page migrations. Thesepage migrations are for pages that simply ping-pong betweennodes. Instead of reducing the number of remote memoryaccesses, these pages contribute a noticeable overhead whichcan not be masked.

To investigate further why our page migration engine is in-ferior to manual data distribution in MG and FT, we con-ducted a synthetic experiment, in which we scaled bothbenchmarks by doubling the amount of computation therein.We did this modi�cation without changing the memory ac-cess pattern or the problem size of the benchmarks. We sim-ply doubled the number of iterations of the outer sequentialloop. The results from these experiments are reported in

Figure 7.

Figure 7 shows two diverging trends. In MG, UPMlib seemsunable to match the performance of manual data distribu-tion. In fact, the margin between the two appears to bewider in the scaled version of the benchmark. At a �rstglance, this result seems surprising. Timing instrumenta-tion of UPMlib indicates that in the scaled version of MG,the overhead of page migration accounts for 23% of the totalexecution time, which is a little more than half of the over-head observed in the non-scaled version, as expected. Unfor-tunately, as Figure 8 shows, the fraction of remote memoryaccesses of MG seems to increase with page migration.

FT presents a di�erent picture. Although there exists somebouncing of pages, increasing the number of iterations en-ables stability of page placement by the migration engine inthe long-term. This means that page ping-pong ends earlyenough to be almost harmless for performance. The relativeoverhead of page migration in the scaled version of FT doesnot exceed 7% of the total execution time.

4.2 LUWe turn our interest in LU, which is the most challengingbenchmark for our page migration engine. We remind thereader that a periodic timer-based algorithm is used in LU,instead of the iterative page migration algorithm used inthe NAS benchmarks and that the period is selected to letthe page migration engine perform at least a few migrationswithin a single iteration of the outer k loop. We also notethat �rst-touch is a poor choice for automatic data distribu-tion in LU and a cyclic distribution of pages must be usedinstead.

As Figures 3 and 4 show, our page migration engine doesprovide some measurable improvement over �rst-touch pageplacement. Figure 5 shows that in contrast to MG, UPM-lib is able to alleviate the irregularity of the memory ac-cess pattern, which is imposed by �rst-touch. The reduc-tion of remote memory accesses (more than 40%) is enoughto speedup the benchmark. Nevertheless, cyclic data dis-tribution (OpenMP+DD) outperforms our page migrationengine, indicating that load balancing is important and thatany localization of memory accesses by our page migrationengine is not suÆcient. Timeliness is the major problemof the page migration algorithm in LU. Although the pagemigration engine is able to detect the iterative shifts of thememory access pattern, it is unable to migrate pages aheadin time and reduce the impact of these shifts.

The OpenMP+reuse version outperforms by far the otherversions, yielding an improvement of more than 50% on 64processors. The result is encouraging, because the itera-tion schedule reuse transformation does not require man-ual data distribution. It relies on �rst-touch and a simplecyclic distribution of iterations to processors, which is ac-tually computed directly from the loop bounds and there-fore locally on each processor. Figure 5 suggests that theiteration schedule reuse transformation is very e�ective inreducing the number of remote memory accesses. This hap-pens because each processor computes repeatedly on datathat the processor touches �rst during the �rst iteration ofthe k loop. The transformation has also a positive e�ect on

33

20 40 60

processors

10

100

exec

utio

n tim

e

NAS MG

OpenMP+rrOpenMP+ftOpenMP+DD OpenMP+UPMlib

1 4 8 16 32 64

processors

0.0

0.5

1.0

1.5

2.0

norm

aliz

ed s

peed

up

NAS MG

OpenMP+rr

OpenMP+ft

OpenMP+DD

OpenMP+UPMlib

20 40 60

processors

10

100

exec

utio

n tim

e

NAS FT

OpenMP+rrOpenMP+ftOpenMP+DD OpenMP+UPMlib

1 4 8 16 32 64

processors

0.0

0.5

1.0

1.5

2.0

norm

aliz

ed s

peed

up

NAS FT

OpenMP+rr

OpenMP+ft

OpenMP+DD

OpenMP+UPMlib

Figure 7: Execution times and normalized speedup of the scaled MG and FT benchmarks.

cache block reuse. Figure 9 shows the number of secondarycache misses and external invalidations during the execu-tion of the various versions of LU on 64 processors. Thenumbers are aggregates and were collected using the IRIXlibperfex interface to the MIPS R10000 hardware counters.It is evident that iteration schedule reuse maximizes cacheblock reuse, an improvement which is not accomplished byany data distribution algorithm alone. The distinctive fea-ture of the transformation, is that it maintains an implicitbinding between threads and data, which is persistent acrossthe iterations of LU and enforces both cache and memoryaÆnity.

The iteration schedule reuse transformation seems to be sim-ple enough for a compiler, as soon as the compiler is ableto detect that the reference pattern of LU moves along thediagonal to rectangular submatrices of progressively lowersize. Since the argument of implementing this optimizationin the compiler may be weak, a more realistic implementa-tion would describe the transformation as a parameter to theSCHEDULE clause of the OpenMP PARALLEL DO con-struct, as outlined in Section 3. The interpretation of the

clause would be to construct a cyclic schedule for the loopand reuse it whenever the loop is executed.

We also investigated whether the combination of manualcyclic distribution and iteration schedule reuse (i.e. aOpenMP+DD+reuse version) improves further the scalabil-ity of LU. Such a trend was not observed in the experiments.The di�erence in performance between the OpenMP+reuseand the OpenMP+DD+reuse version of LU was within arange of �1.5%. This result is instrumental and suggeststhat iteration schedule reuse appears to be the only trulynecessary extension to OpenMP for expressing memory aÆn-ity relationships.

5. CONCLUSIONSWe conducted a detailed evaluation of alternative methodsfor improving the scalability of OpenMP on NUMA mul-tiprocessors, by localizing memory accesses. The simplestapproach is to use the existing automatic page placementalgorithms of contemporary operating systems, which in-clude some NUMA-friendly policies such as �rst-touch. Theexperiments have shown that this solution is inadequate. At

34

5 10 15

nodes

4M

8M

mem

ory

acce

sses

NAS MG (scaled), OpenMP+rr

5 10 15

nodes

4M

8M

mem

ory

acce

sses

NAS MG (scaled), OpenMP+DD

5 10 15

nodes

4M

8M

rem

ote

mem

ory

acce

sses

NAS MG, OpenMP+UPMlib

Figure 8: Per-node memory accesses of the scaled version of MG.

2M

4M

6M

LU

External invalidationsSecondary data cache misses

ft UPMlib DD reuse ft UPMlib DD reuse

Figure 9: Accumulated L2 cache misses and externalinvalidations in LU.

the moderate scale of 16 processors, automatic page place-ment algorithms underperform static data distribution by10%{20%, while at larger scales, their performance showsdiminishing returns. The second step to attack the prob-lem, i.e. introducing data distribution directives in OpenMP,reaches performance levels which verify the common beliefthat a data parallel programming style is likely to solve theproblem of memory access locality on NUMA architectures.Nevertheless, the experiments show that manual data dis-tribution is not a panacea. The same or even better perfor-mance can be obtained from a carefully engineered trans-parent data distribution engine, based on dynamic page mi-gration. This is feasible under the constraints of granularity,regularity and repeatability of the memory access pattern.Page migration takes the leading edge in certain cases, be-cause it relies on complete and accurate online informationfor all memory accesses, rather than on the programmer'sunderstanding of the memory access pattern.

The constraints of granularity, repeatability and irregularityare important enough to justify deeper investigation, sinceseveral parallel codes have some or all of these properties.Our experiments with LU, a program with non-repeatablebut fairly simple access pattern, gave us a hint that a bal-anced schedule with aÆnity links maintained between in-

stances of the same loop is able to localize memory accessesfairly well. This transformation requires a simple and mostimportantly, portable extension of the OpenMP API, ratherthan a full set of data distribution directives. We spec-ulate that several codes might bene�t from exible aÆn-ity scheduling strategies that implicitly move computationclose to data, rather than data close to computation. Itis a matter of further investigation and experimentation toverify this intuition. We are also working on automatingthis procedure, using an approach similar to the well-knowninspector/executor model [22]. The idea is to run a few iter-ations of the program in inspection mode, in order to probethe e�ectiveness of di�erent data placement algorithms. InLU for example, the probing phase would alternatively test�rst-touch and cyclic data distribution (the latter emulatedwith the cyclic assignment of loop iterations to processors)and use a metric (e.g. number of remote memory accesses)to decide between the two.

The problem of granularity is hard to overcome because theoverhead of page migration remains high6. It may worth thee�ort to design and implement faster hardware data copy-ing engines on future-generation NUMA systems. As analternative approach, we are investigating schemes for par-allelizing the page migration procedure. The idea is to haveeach processor poll the reference counters of locally mappedpages and forward pages that satisfy the migration crite-rion to the receiver. We expect that careful inlining of thepage migration algorithm in OpenMP threads will reducethe interference between the page migration engine and theprogram and hopefully provide more opportunities for pagemigration under tight time constraints.

The problem of dealing with irregularities in the memoryaccess pattern remains to some extent open, in the sensethat our page migration engine can at best freeze page mi-grations upon detecting irregularities, but is still unable tooptimize the access pattern in the presence of irregularities.Irregular codes constitute a challenging domain for futurework. Non-repeatable access patterns present similar prob-lems. Adaptive codes for example, have memory access pat-terns which are a-priori unknown and change at runtime in

6The cost of moving a page on the Origin2000 is no less than1 ms.

35

an unpredictable manner [24]. We are investigating solu-tions that combine our iterative page migration mechanismwith dynamic redistribution of loop iterations and iterationschedule reuse [21]. The idea is to identify at runtime boththe irregularity of the memory access pattern and potentialload imbalance, and try to handle the two problems simul-taneously. This can be done with a synergistic combinationof user-level page migration and loop redistribution.

AcknowledgmentsThis work was supported in part by NSF Grant No. EIA-9975019, the Greek Secretariat of Research and TechnologyGrant No. 99-566 and the Spanish Ministry of EducationGrant No. TIC98-511. The experiments were conductedwith resources provided by the European Center for Par-allelism of Barcelona (CEPBA).

6. REFERENCES[1] S. Benkner and T. Brandes. Exploiting Data Locality

on Scalable Shared Memory Machines with DataParallel Programs. In Proc. of the 6th InternationalEuroPar Conference (EuroPar'2000), pages 647{657,Munich, Germany, Aug. 2000.

[2] J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic,J. Harris, C. Nelson, and C. O�ner. ExtendingOpenMP for NUMA Machines. In Proc. of theIEEE/ACM Supercomputing'2000: High PerformanceNetworking and Computing Conference (SC'2000),Dallas, Texas, Nov. 2000.

[3] R. Chandra, D. Chen, R. Cox, D. Maydan,N. Nedelijkovic, and J. Anderson. Data DistributionSupport on Distributed Shared MemoryMultiprocessors. In Proc. of the 1997 ACMConference on Programming Languages Design andImplementation (PLDI'97), pages 334{345, Las Vegas,Nevada, June 1997.

[4] D. Culler, J. P. Singh, and A. Gupta. ParallelComputer Architecture: A Hardware/SoftwareApproach. Morgan Kaufman, 1998.

[5] M. Frumkin, H. Jin, and J. Yan. Implementation ofNAS Parallel Benchmarks in High PerformanceFORTRAN. Technical Report NAS-98-009, NASAAmes Research Center, Sept. 1998.

[6] W. Gropp. A User's View of OpenMP: The Good,The Bad and the Ugly. In Workshop on OpenMPApplications and Tools (WOMPAT'2000), San Diego,California, July 2000.

[7] High Performance FORTRAN Forum. HighPerformance FORTRAN Language Speci�cation,Version 2.0. Technical Report CRPCTR-92225, Centerfor Research on Parallel Computation, RiceUniversity, Jan. 1997.

[8] HPF+ Project Consortium. HPF+: Optimizing HPFfor Advanced Applications.http://www.par.univie.ac.at/project/hpf+, 1998.

[9] C. Hristea, D. Lenoski, and J. Keen. MeasuringMemory Hierarchy Performance on Cache-Coherent

Multiprocessors Using Microbenchmarks. In Proc. ofthe ACM/IEEE Supercomputing'97: HighPerformance Networking and Computing Conference(SC'97), San Jose, California, Nov. 1997.

[10] H. Jin, M. Frumkin, and J. Yan. The OpenMPImplementation of the NAS Parallel Benchmarks andits Performance. Technical Report NAS-99-011, NASAAmes Research Center, Oct. 1999.

[11] D. Kuck. OpenMP: Past and Future. In Proc. of theWorkshop on OpenMP Applications and Tools(WOMPAT'2000), San Diego, California, July 2000.

[12] J. Laudon and D. Lenoski. The SGI Origin: AccNUMA Highly Scalable Server. In Proceedings of the24th International Symposium on ComputerArchitecture (ISCA'97), pages 241{251, Denver,Colorado, June 1997.

[13] J. Levesque. The Future of OpenMP on IBM SMPSystems. In Proc. of the First European Workshop onOpenMP (EWOMP'99), pages 5{6, Lund, Sweden,Oct. 1999.

[14] M. Marchetti, L. Kontothanassis, R. Bianchini, andM. Scott. Using Simple Page Placement Schemes toReduce the Cost of Cache Fills in CoherentShared-Memory Systems. In Proc. of the 9th IEEEInternational Parallel Processing Symposium(IPPS'95), pages 380{385, Santa Barbara, California,Apr. 1995.

[15] E. Markatos and T. LeBlanc. Using Processor AÆnityin Loop Scheduling on Shared-MemoryMultiprocessors. IEEE Transactions on Parallel andDistributed Systems, 5(4):379{400, Apr. 1994.

[16] J. Merlin and V. Schuster. HPF-OpenMP for SMPClusters. In Proc. of the 4th Annual HPF User GroupMeeting (HPFUG'2000), Tokyo, Japan, Oct. 2000.

[17] D. Nikolopoulos, T. Papatheodorou,C. Polychronopoulos, J. Labarta, and E. Ayguad�e. ACase for User-Level Dynamic Page Migration. In Proc.of the 14th ACM International Conference onSupercomputing (ICS'2000), pages 119{130, Santa Fe,New Mexico, May 2000.

[18] D. Nikolopoulos, T. Papatheodorou,C. Polychronopoulos, J. Labarta, and E. Ayguad�e. IsData Distribution Necessary in OpenMP ? In Proc. ofthe IEEE/ACM Supercomputing'2000: HighPerformance Networking and Computing Conference(SC'2000), Dallas, Texas, Nov. 2000.

[19] D. Nikolopoulos, T. Papatheodorou,C. Polychronopoulos, J. Labarta, and E. Ayguad�e.UPMlib: A Runtime System for Tuning the MemoryPerformance of OpenMP Programs on ScalableShared-Memory Multiprocessors. In Proc. of the 5thACM Workshop on Languages, Compilers andRuntime Systems for Scalable Computers (LCR'2000),LNCS Vol. 1915, pages 85{99, Rochester, New York,May 2000.

36

[20] OpenMP Architecture Review Board. OpenMPFortran Application Programming Interface. Version1.2, http://www.openmp.org, Nov. 2000.

[21] R. Ponnusamy, J. Saltz, and A. Choudhary.Runtime-Compilation Techniques for DataPartitioning and Communication Schedule Reuse. InProc. of the ACM/IEEE Supercomputing '93: HighPerformance Networking and Computing Conference(SC'93), pages 361{370, Portland, Oregon, Nov. 1993.

[22] J. Saltz, R. Mirchandaney, and D. Baxter. RuntimeParallelization and Scheduling of Loops. In Proc. ofthe 1st ACM Symposium on Parallel Algorithms andArchitectures (SPAA'89), pages 303{312, Santa Fe,New Mexico, June 1989.

[23] V. Schuster and D. Miles. Distributed OpenMP,Extensions to OpenMP for SMP Clusters. In Proc. ofthe Workshop on OpenMP Applications and Tools(WOMPAT'2000), San Diego, California, July 2000.

[24] H. Shan, J. P. Singh, R. Biswas, and L. Oliker. AComparison of Three Programming Models forAdaptive Applications on the Origin2000. In Proc. ofthe IEEE/ACM Supercomputing'2000: HighPerformance Networking and Computing Conference(SC'2000), Dallas, Texas, Nov. 2000.

37