21
2 COMPILERS, TECHNIQUES, AND TOOLS FOR SUPPORTING PROGRAMMING HETEROGENEOUS MANY/MULTICORE SYSTEMS Pasquale Cantiello, Beniamino Di Martino, and Francesco Moscato CONTENTS 2.1 Introduction 2.2 Programming Models and Tools for Many/Multicore 2.2.1 OpenMP 2.2.2 Brook for GPUs 2.2.3 Sh 2.2.4 CUDA 2.2.4.1 Memory Management 2.2.4.2 Kernel Creation and Invocation 2.2.4.3 Synchronization 2.2.5 HMPP 2.2.6 OpenCL 2.2.7 OpenAcc 2.3 Compilers and Support Tools Large Scale Network-Centric Distributed Systems, First Edition. Edited by Hamid Sarbazi-Azad and Albert Y. Zomaya. © 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc. 31

Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

Embed Size (px)

Citation preview

Page 1: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

2COMPILERS, TECHNIQUES,

AND TOOLS FOR SUPPORTINGPROGRAMMING

HETEROGENEOUSMANY/MULTICORE SYSTEMS

Pasquale Cantiello, Beniamino Di Martino, andFrancesco Moscato

CONTENTS

2.1 Introduction

2.2 Programming Models and Tools for Many/Multicore

2.2.1 OpenMP

2.2.2 Brook for GPUs

2.2.3 Sh

2.2.4 CUDA

2.2.4.1 Memory Management

2.2.4.2 Kernel Creation and Invocation

2.2.4.3 Synchronization

2.2.5 HMPP

2.2.6 OpenCL

2.2.7 OpenAcc

2.3 Compilers and Support Tools

Large Scale Network-Centric Distributed Systems, First Edition. Edited by Hamid Sarbazi-Azad andAlbert Y. Zomaya.© 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc.

31

Page 2: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

32 COMPILERS , TECHNIQUES , AND TOOLS

2.3.1 RapidMind Multicore Development Platform

2.3.2 OpenMPC

2.3.3 Source-to-Source Transformers

2.3.3.1 CHiLL

2.3.3.2 Cetus

2.3.3.3 ROSE Compiler

2.3.3.4 LLVM

2.4 CALuMET: A Tool for Supporting Software Parallelization

2.4.1 Component-Based Source Code Analysis Architecture

2.4.2 Algorithmic Recognizer Add-on

2.4.3 Source Code Transformer for GPUs

2.5 Conclusion

References

2.1 INTRODUCTION

In the last few years, the continuous growth of processors’ clock speed has stopped andprocessor improvements follow a different path by multiplying the number of processingunits on a chip. Not only systems for scientific commodity applications but also personalcomputers include multiple multicore CPUs and GPUs. It is hard to write parallel code orto port existing sequential code to new architectures. It is a costly process requiring skilleddevelopers. This chapter presents, after a brief introduction of different programmingmodels for current many/multicore and GPUs systems, a review of the state of the artof compilers and support tools tailored to help programmers both in developing andporting code for many/multicore CPUs and GPUs. The emphasis will be on automaticparallelization compilers, techniques, and tools that do source-to-source transformationof code to convert sequential code into parallel. The chapter includes a presentation ofa tool developed by the authors, to do static analysis on source code, to represent itin a language-neutral way along with knowledge extracted from it. The tool integratesan algorithmic recognizer to find instances of known algorithms and a transformer toautomatically transform sequential code to a parallel version using libraries to be usedon GPUs.

2.2 PROGRAMMING MODELS AND TOOLS FOR MANY/MULTICORE

In recent years, multicore devices have quickly evolved in both architecture and corecount. This has motivated software developers to define programming models that areable to decouple code from hardware because new applications must automatically scaleas new architectures and processors are introduced. In addition, adequate programmingmodels can also enhance performance if proper optimization methodologies are enacted.

Page 3: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

PROGRAMMING MODELS AND TOOLS FOR MANY/MULT ICORE 33

All major CPU vendors now exploit explicit parallelism in their processors, both toimprove power efficiency and to increase performance, but only parallelized applicationsprovide real improvements.

With these new architectures, dealing with data parallelism is appealing becausemultiple cores can reduce latency when dealing with data accesses, or can execute thesame program on distributed data with a SPMD (Single Program, Multiple Data) model.

Unfortunately, multicore hardware is evolving faster than software technologiesand new software multicore standards are rising in order to cope with the complexity ofembedded multicore systems.

The single-thread computational approach is no longer useful for scaling perfor-mance on new architectures [14, 20]. The main reason is that recent multicore andmanycore systems do not use only the symmetric multiprocessing (SMP) model butare heterogeneous both in architecture and features. Programmers must acknowledgeheterogeneity of hardware and software in order to produce optimized applications.

The main problem in multicore is that, even with the opportunity to increase perfor-mance, software usually needs to explicitly exploit multicore features in order to fulfilthe potential. Traditional approaches like multithreading force programmers to defineproper thread management in order to design parallel algorithms.

The use of SPMD models in parallel programming has been proposed as a program-ming model of Graphics Processing Units (GPUs) for general-purpose processing. Itarises from the demonstration [29] that OpenGL [33] architecture can be abstracted as aSingle Instruction Multiple Data (SIMD) processor. Usually, SIMD processing involvessignificant memory use. In order to reduce bandwidth waste, graphics hardware nowexecutes small threads loading and storing data to local temporary registers and cachememories, trying to exploit computational and data locality through the use of streams(a collection of records requiring similar computation) and kernels (functions that canbe executed on local streams).

Identifying kernels to execute in parallel on local data is crucial to improving per-formance. In recent years, several programming models, techniques, and tools have beenproposed for general purpose programming of GPUs.

In the following, some models, languages, and tools for programming many/multicore are described. They usually face the problem of heterogeneity of hardwareresources by using a high-level language to describe computation in order to simplifythe development of parallel applications and to decouple programming models fromhardware. Proper middleware is used then to optimize, compile, and execute programson the best available target architecture.

2.2.1 OpenMP

A shared memory programming model is useful when developing software for embeddedmulticores following a SPMD model. Adequate optimization techniques can be used toenhance performance on different architectures. The OpenMP [6, 9] standard works wellin decoupling code from hardware. It is based on a set of directives libraries and callableruntime libraries, leaving the base language (C, C++, and FORTRAN) unspecified. Theexecution model of OpenMP is based on the fork/join model [9]. An OpenMP program

Page 4: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

34 COMPILERS , TECHNIQUES , AND TOOLS

begins execution as a single process (called a master thread) and then defines parallelregions (by means of a parallel directive) that are executed by multiple threads, as shownin the following.

Algorithm 2.1

Master Thread ...#pragma omp parallel{

func();}Master Thread continues ...

Threads are synchronized at the end of a region, where the master thread continuesexecution. This model allows for the execution of parallel regions following a pureSPMD model. func() is assigned for execution to each thread in the pool of OpenMPpool and executed once for each thread.

More complex options allow for dividing loop iterations among threads, for definingshared and local variables in parallel regions, and for defining reduction variables.

Parallel regions are optimized by OpenMP compilers in a transparent way, as inshared memory management.

2.2.2 Brook for GPUs

Brook for GPUs [4] is a framework for general purpose computation on GPUs thatexploits a streaming programming model. Brook manages memory via streams, and data-parallel operations on GPUs are specified as calls to kernels. Many-to-one reductionscan be implemented on stream elements.

Primitives of Brook programming language are not embedded in a general-purposeprogramming language (like C or Fortran). Brook uses its own language (similar to C)to write Brook programs, which are then translated in C and compiled with a nativeC compiler.

Main Brook primitives are used to manage streams and to define kernels. Streamsare declared with angle-bracket syntax, similar to array (i.e., double v<5,9> declaresa two-dimensional stream of doubles). Kernels are associated with special functions,specified by the kernel keyword.

Algorithm 2.2

kernel void sum(float4 a<>, float4 b<>, out float4 c<>) {c=a+b;

}void main() {

float vA[100], vB[100], vC[100];float a<100>, b<100>, c<100>;

Page 5: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

PROGRAMMING MODELS AND TOOLS FOR MANY/MULT ICORE 35

initialize vA and vBstreamRead(a,vA);streamRead(b,vB);sum(a,b,c);streamWrite(c,vC);

}reduce void sum2 (float v<>, reduce float red<>) {red +=v;}

The code above is an example of streams and kernel definitions. For kernels, inputand output streams must be declared explicitly. Brook forces programmers to distinguishbetween data streamed as input and other arrays because Brook divides kernel invoca-tion among available GPU processors, making different stream parts available to kernelinstances running on different processors. streamRead and streamWrite functions areused to copy data from memory to streams and vice versa. The kernel is executed onGPU processors simply by invoking the kernel function.

Kernels are used to apply a function to a set of data that is automatically managedby Brooks. In addition, this provides a data-parallel reduction method for evaluatinga single data from a set of records. Reductions are usually used in arithmetic sums ormatrices products. Reductions accept a single-input stream and produce a smaller outputstream. The function sum2 is defined as a reduce function, and it produces from thestream v the element red that contains the sum of all the elements in v.

Brook also provides a collection of stream operators that can be used to manage,manipulate, and organize streams (i.e., grouping elements into new streams, extractingsubstreams, etc.).

One of the strengths of Brooks is the way it manages kernels. As shown in the codeabove, kernels and streams are associated with processors on GPUs transparently. Usersdo not necessarily have to explicitly split data streams on processors in the code.

2.2.3 Sh

Sh is an open-source metaprogramming language for General Purpose Graphic Process-ing Units (GPGPUs). The Sh language is built on the top of C++, thus having a similarsyntax. Sh code is embedded inside C++, hence no extra compile tools are necessary.In order to generate executables, Sh uses a staged compiler. Parts of code are generatedwhen C++ code is compiled and the rest is compiled at runtime.

The following code shows an example of a Sh program:

Algorithm 2.3

ShPoint3f point1(0,0,0);ShMatrix4f Matr;ShProgram progr = SH−BEGIN−PROGRAM("gpu:stream") {

Page 6: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

36 COMPILERS , TECHNIQUES , AND TOOLS

ShInputAttrib1f a;ShInputPoint3f point2;ShOutputPoint3f pointOut = Matr | (point1+a*normalize(point2));

} SH−END;ShChannel <ShPoint3f> channel1;ShChannel <ShAttrib3f> channel2;ShStream datastream = channel1 & channel2;channel1 = progr « datastream; //executes prog.

Proper directives allow for variable definition. In Sh, all operations are strongly typechecked. Because Sh is a metaprogramming language, all legal C++ constructs are legal.In particular, stream operators are used to assign a stream to a programming functionand to run it. As in Brook, allocation and optimization of functions is managed by a Shframework in a transparent way. Streams are composed by channels that are concatenatedto build streams.

Sh was defined in order to be a shared language and it remains highly coupled withpixel shader functions.

2.2.4 CUDA

CUDA (Compute Unified Device Architecture) is a scalable programming model anda software environment for parallel computing for NVIDIA GPUs [28]. It exposes anSPMD programming model where the kernel is run by a large number of threads groupedinto blocks.

The specific GPU hardware is abstracted by CUDA PTX, a virtual machine forparallel thread execution. This provides a low-level interface separated from the targetdevice. All the threads within a block form a Cooperative Thread Array (CTA) in thePTX domain. The threads in a CTA run concurrently, communicating through sharedmemory, while multiple CTAs can only communicate through global memory. The PTXindexing schema assigns a position to each CTA within the grid and a position to eachthread within a CTA, as shown in Fig. 2.1. Each thread can thus determine what data towork on based on the block and thread ID.

The language supported by CUDA is an extension to C, with some features fromC++ as templates and static classes. A kernel is a function compiled into a PTX programthat can be executed by all threads in a CTA and that can access GPU memory andshared memory. A kernel is executed in a CUDA context, and one or more contexts canbe bound to a single GPU. While it is possible to assign multiple contexts among multipleGPUs, there are no spanning mechanisms or load-balancing techniques to distribute asingle context over multiple devices. This must be managed by the host program. Kernelexecutions and CPU-GPU memory transfers can run asynchronously.

2.2.4.1 Memory Management. Memory allocation on a GPU device can bedone by calling thecudaMalloc() function, while memory transfer from host to devicesis enacted by the cudaMemcpy() function.

Page 7: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

PROGRAMMING MODELS AND TOOLS FOR MANY/MULT ICORE 37

Figure 2.1 CUDA grid and block addressing (courtesy of nVIDIA).

Similarly, to free memory no longer needed, a call must be done to the cudaFree()function.

2.2.4.2 Kernel Creation and Invocation. In CUDA, a kernel function speci-fies the code to be executed by all threads by following the SPMD model.

A kernel function must be called with an execution configuration. The executioncontest of a kernel is given by a Grid of parallel threads, as shown in Fig. 2.1. Eachgrid contains a certain number of Blocks, addressed with a unique two-dimensionalcoordinate. In each block, a three-dimensional coordinate system addresses thethreads.

Just before invoking the kernel, the execution configuration parameters, in terms ofgrid and blocks, must be created. The special <<< and >>> symbol sequences are CUDAextensions to specify the execution configuration parameters.

The parameters to the kernel function are passed with the normal C syntax.Obviously, pointers passed as parameters must only point to device memory space.An example of kernel invocation follows.

Page 8: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

38 COMPILERS , TECHNIQUES , AND TOOLS

Algorithm 2.4

−−global−− void KernelFunc(...);dim3 DimGrid(100, 50); // 5000 thread blocksdim3 DimBlock(4, 8, 8); // 256 thread per blocksize−t SharedMemBytes = 64; // Shared memoryKernelFunc <<< DimGrid, DimBlock, SharedMemBytes >>>(...);

2.2.4.3 Synchronization. Threads in a block can coordinate their executionusing a barrier synchronization function _ _syncthreads(). When a thread calls thisfunction, it will be held until all the other threads in the block reach the same location.This can be done to ensure that all threads have completed a phase before beginning thenext one.

2.2.5 HMPP

HMPP [13] is a Heterogeneous Multicore Parallel Programming environment that wasdesigned to allow for the integration of hardware accelerators. HMPP aims at simplifyingthe use of accelerators while maintaining code portability. It is based on a set of compilerdirectives (like in OpenMP [9]), tools, and software runtimes that decouples applicationcode and hardware accelerator.

Basic HMPP directives are used to define functions, named codelets. Codelets arepure functions that are suitable for hardware acceleration. Codelets in HMPP are managedby a middleware that choses the best runtime implementation for hardware accelerators(if available). The HMPP runtime is not designed for a specific architecture and HMPPapplications can be compiled with off-the-shelf compilers. Dynamic linking mechanismsare employed to use new or improved codelets without having to compile the wholeapplication source.

To define codelets, HMPP directives address data exchange among host and devices.In addition, HMPP is able to handle different accelerator targets in the same applica-tion and to execute code running on CPUs and other devices simultaneously, if propercompiled codelets exist for target architectures.

HMPP provides a programming interface that is based on directives used to annotateoriginal code with instructions that will be used to produce proper compiled codelets ona hardware accelerator. Applications are first preprocessed and then linked to a HMPPruntime. Codelets for different devices are compiled with third-party tools. An exampleof codelet definition in HMPP follows.

Algorithm 2.5

#pragma hmpp func codelet, output=outvvoid func(int n, float *inv, unsigned int N1[1],float *outv, unsigned int N3[1]) {

int i;

Page 9: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

PROGRAMMING MODELS AND TOOLS FOR MANY/MULT ICORE 39

outfor (i = 0 ; i < n-1 ; i++) {

outv[i] = inv[i] + inv[i+1];}

}#pragma hmpp func callsite

func(n, inc, N1, outv, N1);

HMPP codelet is identified by a pragma directive that declares func as a codeletfunction. Notice that this function is realized with the pure function abstraction. Thestandard requires that, in codelet definitions, each array-based parameter is followedby a parameter (an array itself) containing the dimension of the previous one (e.g., thedeclaration of inv and outv are, respectively, followed by the declaration of N1 and N3).Several other directives can be used to exploit loop unroll, tiling, and other optimizations.

Codelets can be executed synchronously or asynchronously with the applicationrunning on the main CPU. HMPP runtime must be invoked for this purpose by definingproper execution directives (callsite). Hardware accelerators on which codelets must berun can be chosen explicitly or assigned by the HMPP runtime.

Different data-transfer and synchronization directives can be used in HMPP to im-plement synchronization patterns and to optimize data transfer from CPU host to device.OpenMP and MPI code are also supported and HMPP is compatible with several GPUdevices.

Recent analyses [15] prove that HMPP-based programs have good performance andspeed-up.

In summary, OpenHMPP is a high-level programming paradigm that allows fortransparent execution of functions (codelets) on several target architectures. Generationof executable code for codelets can be enacted by using third-party tools, compilers,and libraries. HMPP also supports shared memory systems, automatically exploitingsymmetric multi-processor systems (SMP), and it is also able to generate communicationthrough MPI interfaces if needed.

2.2.6 OpenCL

OpenCL [32] is an open industry standard that provides a common language, program-ming interfaces, and hardware abstraction for developing task-parallel and data-parallelapplications in heterogeneous environments. The environment consists of a host CPU andany attached OpenCl-compliant device. OpenCL offers classic compilation and linkingfeatures, and it also supports runtime compilation that allows the execution of acceleratedkernels on devices unavailable when applications were developed. Runtime compilationin OpenCL allows a developed application to be independent from instruction sets ofdevices. If device vendors change or upgrade instruction sets of their devices (e.g., whennew devices are marketed), old applications can be recompiled and optimized at runtime,exploiting new devices’ potentialities.

Page 10: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

40 COMPILERS , TECHNIQUES , AND TOOLS

The OpenCL programming model abstracts CPUs, GPUs, and other accelerators likeSIMD processing elements (PEs). As with CUDA, computation kernels are associatedwith PEs with a thread-safe semantics, allowing access to shared memory only for onethread at once. OpenCL also abstracts memory, defining four types of memory: global,constant, local, and private.

In order to compile, to allocate device memory, and to launch kernels, contexts mustbe created and associated with a device. Memory allocation is associated within a contextand not to devices. Resources (memory, number of PEs, etc.) can be reserved to contexts.OpenCL controls whether any device exists with resources required by contexts. Deviceswith inadequate resources will be excluded from context allocation.

An example of OpenCL code is shown below (the full code is available at [3]).

Algorithm 2.6

// create a compute context with GPU devicecontext = clCreateContextFromType(NULL, CL−DEVICE−TYPE−GPU,

NULL, NULL, NULL);// create a command queue

...queue = clCreateCommandQueue(context, device−id, 0, NULL);...// create the compute programprogram = clCreateProgramWithSource(context, 1,

&fft1D−1024−kernel−src, NULL, NULL);// create the compute kernel

kernel = clCreateKernel(program, "fft1D−1024", NULL);// set the args values

clSetKernelArg(kernel, 0, sizeof(cl−mem), (void *)&memobjs[0]);

...

−−kernel void fft1D−1024 (−−global float2 *in, −−global float2*out,

−−local float *sMemx, −−local float *sMemy) {...

}

In OpenCL, kernels and memory are associated with contexts that in turn are as-sociated with one or more devices. When creating contexts, OpenCL verifies whetheradequate devices exist to allocate resources associated with context. Once contexts areallocated, OpenCL programs are compiled at runtime and optimized for target devices.Operations within kernels are managed using command queues associated with targetdevices.

The clCreateContextFromType routine is used to create a new context (it is alsopossible to specify the type of the device to use). Kernels, buffers, and queues are thenassociated with the context.

Page 11: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

PROGRAMMING MODELS AND TOOLS FOR MANY/MULT ICORE 41

2.2.7 OpenAcc

The OpenACC [2] approach is represented by a set of compiler directives designed tospecify loops and regions of code in standard C, C++, and Fortran that can be offloadedfrom a host CPU to an attached accelerator.

OpenACC is a nonprofit corporation founded by four companies: CAPS Enterprise,CRAY Inc., the Portland Group Inc. (PGI), and NVIDIA. Their mission was to create across-platform API that would easily allow acceleration of applications on many-core andmulticore processors using directives. This allows portability across operating systems,host CPUs, and accelerators.

The directives and programming model allow programmers to create high-levelhost+accelerator programs without all the concerns about the initialization of the accel-erator, or data and program transfer between the host and the accelerator.

OpenACC API-enabled compilers and runtimes hide these concerns inside the pro-gramming model. This allows the programmer to provide additional information to thecompilers, including locality of data to an accelerator and mapping of loops onto anaccelerator.

The API is composed of a collection of compiler directives for C, C++, and Fortranlanguages. They apply to the immediately following structured block or loop, as a singlestatement or a compound statement for C and C++, or a single-entry/single-exit sequenceof statements for Fortran.

For C language, the standard form of a declaration is as follows:

Algorithm 2.7

#pragma acc directive-name [clause [[,] clause]...] new-line

directive-name is the name of the action that must be applied on the following structuredblock, and clause are optional parameters characterizing the action. Directives fall intothe main categories briefly described above.

• Accelerator Compute Constructs: These specify the start of parallel executions onthe acceleration device. As an example of the parallel construct, the compilercreates gangs of workers to execute the accelerator parallel region. One workerin each gang begins executing the code in the structured block of the construct.Optional clauses in this construct can control the number of gangs, or workers,the asynchrony of the execution, or the specification of copy-in and copy-out databetween host and accelerator.

• Data constructs: The data construct defines data regions (scalars, arrays, or sub-arrays) to be allocated in the device memory for the duration of the region and theeventual copy-in and copy-out between host and device memories.

• Loop constructs: The loop directive applies to the loop immediately followingit. It can describe what type of parallelism to use to execute the loop and declareloop-private data (variables and arrays) and reduction operations if required.

Page 12: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

42 COMPILERS , TECHNIQUES , AND TOOLS

• Cache directives: The cache directive may appear at the top of (inside of) a loopto specify elements of the array or subarrays to be fetched into the highest levelof the cache for the body of the loop.

• Declare directive: This can be used in the declaration section of a Fortran block,or following a variable declaration in C/C++ to specify that a variable or array isto be allocated in the device memory for the duration of the block of execution,and to specify whether the data values are to be transferred.

• Executable directives: The update directive is used to force the update of data inaccelerator memory to the values present in host memory or vice versa. The waitdirective causes the program to wait for completion of an asynchronous activity.

In addition to the directive, a runtime library with a set of functions is defined inOpenACC. They can be used, for example, to query the system at runtime to discoverthe number and typologies of device present, the available amount of memory, to controlthe synchronization of execution, or the allocate and free memory chunks.

An example of OpenACC code is reported where the operation Y = aX + Y isperformed.

Algorithm 2.8

void saxpy−parallel(int n, float a, float *x, float *restrict y){#pragma acc kernelsfor (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}

The code is a standard sequential loop, but the directive will cause the compiler togenerate the kernel allocation on processors.

2.3 COMPILERS AND SUPPORT TOOLS

2.3.1 RapidMind Multicore Development Platform

The RapidMind multicore development platform [26, 27] was born as a framework forexpressing data-parallel computations from within C++ code to be executed on a multi-core processor. The platform was bought by Intel Corporation and has been embeddedin its framework for parallel development. The platform is based on a programming APIand in middleware for optimization and analysis of code. RapidMind supports severalcompilation back-ends and optimization: it is possible to compile programs running onNVIDIA and ATI GPUs, Intel and AMD CPUs, Cell BE Blade, and Cell Accelerator.

When programs are compiled, RapidMind runtime chooses the best available hard-ware and optimizes the application using the proper back-end.

Page 13: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

COMPILERS AND SUPPORT TOOLS 43

RapidMind programs are based on three basic types: values, programs, and arrays.Values and Arrays are used to define variables and arrays, respectively. They are identifiedby proper data types in order to discriminate data that will be used in code parts to executein a parallel environment. Programs are used to identify codes to optimize for parallelexecution. A program is identified by proper macros, as shown in the following:

Algorithm 2.9

Program myprogram = RM−BEGIN {In<Value1f> v1;Out<Value1f> v2;Value1ui i;RM−FOR (i = 0, i < 5, i++) {

v2[i]=v1[i]+1;} RM−ENDFOR;

} RM−END;

Programs are contained between RM−BEGIN and RM−END keywords; v1 andv2 are, respectively, input and output array for the program; i is a variable used as index.The program produces as output v2 by executing a for statement that is optimized in theRapidMind Program.

2.3.2 OpenMPC

OpenMPC [22] is a framework that proposes a programming interface able to mergeOpenMP directives and API with the set of CUDA-related directives in order to extendOpenMP features for CUDA in a heterogeneous multicore/GPU environment.

OpenMPC provides programmers with abstraction of the (more complex) CUDAprogramming model. It consists of a toolchain where (1) OpenMPC code is parsed byCetus [23]; (2) OpenMP code is analyzed to identify possible CUDA kernels; (3) possiblekernels are annotated with CUDA directives; and (4) OpenMP kernels are translated intooptimized CUDA kernels following a code transformation approach.

To fulfil optimization, a Tuning Engine is used. It performs various compilation,execution, and measurements of produced code in order to collect information to enhanceperformances.

2.3.3 Source-to-Source Transformers

Under the heading of source-to-source (S2S) transformers fall tools able to converta software program written in a given source language to a new version in the samelanguage or in a different one, always in source code. Thus, their output is a code stillreadable and modifiable by a programmer and that must be compiled for the targetarchitecture.

Page 14: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

44 COMPILERS , TECHNIQUES , AND TOOLS

A type of S2S transformer was used in the past as the cross-compiler built to provide anew language on an architecture for which a base language compiler (typically assembly)is already present.

Today, they are conceived mainly with one or more of the following objectives:

• Transform a sequential version of code into a parallel version for a target archi-tecture.

• Transform an already parallel source code written with a particular paradigm (e.g.,OpenMP) to a different paradigm/language (e.g., OpenCL).

• Apply source code optimization on regions of code (e.g., loop-nests) in order totake advantage of hardware features or to improve data locality.

The process of transformation of code generally is not a priori fixed, but it can bedriven in several ways:

• By users who can annotate code regions they want to transform, with directives(pragmas) that instruct the transformer regarding what and how to modify thecode.

• Automatically by the code itself, for example, by analyzing dependences on datainside loop nests.

• By the size of the data involved, as in a distributed memory architecture in whichexecution and communication times must be taken into account when decidingwhat must be computed on remote nodes. There are systems that produce multipleversions of the translated code regions with software probes that can select amongthem only upon runtime, depending on the size of the data and the configurationof the architecture. These are known as auto-tuning systems.

More powerful S2S systems are programmable in a high-level language, allowingusers to develop custom analyzers and transforms. Some of them are mentioned above.

2.3.3.1 CHiLL. CHiLL [7] is a source-to-source compiler transformation andcode generation framework that can transform sequential loop nests. It provides a script-ing language to describe composable compiler transformation recipes [18] such as tiling,striping, and unrolling. CHiLL has served as a base for CUDA-CHiLL [31], whichextends the scripting language with commands to drive the data moving between com-pute devices or to extract execution kernels and produce a CUDA syntax. Still basedon CHiLL is the auto-tuner compiler described in [8], which combines performanceanalysis with transformations.

2.3.3.2 Cetus. Cetus [23] is compiler infrastructure for source-to-source trans-formation. It provides a front-end to parse programs written in C90 language (and C99with some limitations), model the parsed code in an object-oriented hierarchy represen-tation, and expose a set of Java APIs to access and manipulate it. The user can easilywrite analysis, transformation, and optimization passes, letting the framework do the

Page 15: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

CALUMET: A TOOL FOR SUPPORT ING SOFTWARE PARALLEL IZAT ION 45

parsing and unparsing operations. It does not provide prebuilt transformation routines.The original code structure and statements are preserved. On top of Cetus, tools suchas OpenMP-to-GPU [24] have been built to automatically translate standard OpenMPsource code into CUDA-based GPU applications.

2.3.3.3 ROSE Compiler. ROSE Compiler [30] is an open source infrastructurefor building source-to-source program transformation and analysis tools for C/C++ andFortran programs. It is based on the EDG front-end [16] to parse C and C++ and on OpenFortran Parser [17] for Fortran 90. By using these parsers, ROSE presents a commonobject-oriented, open source intermediate representation (IR) for the three supportedlanguages. This IR includes an abstract syntax tree, symbol tables, and control flowgraph. The class hierarchy of ROSE provides query, visit, and transformation functionsto be used by a program written by the user in C++. The IR preserves original sourcesyntax with comments and directives. Automatic parallelization using ROSE is describedin [25].

2.3.3.4 LLVM. LLVM (Low Level Virtual Machine) [21] is a compiler frame-work designed to support program analysis and transformation for arbitrary programsby providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs. By defining low-level code representationin static single assignment (SSA) form, it can provide a language-independent represen-tation. The main drawback that derives with this abstraction is the loss of the originalcontrol structures (e.g., for() loops are replaced by if() and gotos instructions) aftercode unparsing.

2.4 CALUMET: A TOOL FOR SUPPORTING SOFTWAREPARALLELIZATION

This section presents CALuMET (Code Analyser and knowLedge ModEller for Trans-formation), a prototype tool implementing a component-based architecture designed toassist the user in the process of software migration by showing extracted knowledgefrom code and by driving automatic transformation of parallelizable regions of code.It is designed to parse source code files in C/C++, Java, and Fortran languages and toperform several analyses on them using component-based approach. Graph structureswith dependence informations are also built and modeled within the representation.

Results are presented to the user within a GUI, but also emitted in a machine-readable format to enable interoperability. This can be used in a tool chain to do furtherinvestigation or to do code transformation driven by the extracted knowledge.

2.4.1 Component-Based Source Code Analysis Architecture

The architecture of the CALuMET Tool is shown in Fig. 2.2 as a block diagram.The Graphical User Interface (GUI) is the presentation module. It has been designed

to be user friendly to minimize learning time for the user and provide a clear view of the

Page 16: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

46 COMPILERS , TECHNIQUES , AND TOOLS

Graphical User Interface

Source Code Object Model

Analysis Engine

Parser adapterUtil

C/C++parser

Fortranparser

Javaparser

Knowledge Base

Figure 2.2 Tool architecture.

results. In our case, it has been designed as a multipanel view to permit at any momenta side-by-side comparison between the object of the view (graph, concepts, etc.) andthe source code it was built from. Point and click interface with highlighting of relatedobjects between views can be useful to the user.

The SCOM component implements the Language Neutral representation of code.The GUI interacts with this module to invoke specific analysis, to extract graphs, and topersist data in the Knowledge Base. Additional modules that can be added to the systemcan use this component to interface with the representation.

The Analysis Engine implements the core of the analysis functions and acts as abridge between the SCOM and the parsers. Requests for analysis from the GUI throughthe class methods of the SCOM are performed by this module.

In detail, the analysis engine, upon a request of the GUI to parse a file, invokes thecorresponding parser through the appropriate adapter, passing to it the environment andthe path of the file, and receives the results.

It has also the duty of building standard analysis graphs. It can serialize and deseri-alize the results of the analysis in a common format. Additional external analyzers canbe added as components interfaced with the SCOM.

Parser construction is different among target languages and cannot be done in thesame environment as the core components. They are built as external executable programsand installed as plug-in modules into the environment. A complete analysis for a givenlanguage can be done in several steps with a tool chain of different programs.

Each parser is invoked as an external process by passing it the pathname of thesource file to be analyzed, a set of options related to the analysis to perform, the formatof the output files, and their location. Thus, the adapter, along with the parser, act as aninterface between the analysis engine and the source code under analysis.

Page 17: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

CALUMET: A TOOL FOR SUPPORT ING SOFTWARE PARALLEL IZAT ION 47

The Util component contains a set of packages that provide common functionalitiesto the other components of the architecture. It contains, among the others, classes to do:

• Graph management: A package exposes base classes for Directed Graphs withmethods to do typical operations on nodes and edges (add(),remove(),find(),visit(), . . . ).

• Serialization and deserialization: All the functionalities to persist graphs to andfrom disks as XML files are present.

• Graph visualization: Another package exposes classes for graph and multigraphvisualization along with legends. These classes can be inherited and customizedfor the particular domain and their visualization can appear inside a panel view.

• Attribute extraction: Due to the design allowing expansions of the model, a sub-classed entity can add additional attributes that cannot be ignored by all the othercomponents. To do this, methods are provided to discover and extract them atruntime using reflection.

A parser is a module that directly processes the source code files and performs basicanalysis on it. Its output populates the Knowledge Base. For each file the parser buildsthe Abstract Syntax Tree as an intermediate representation of code in memory.

By traversing the so-built AST, it identifies the nodes that are relevant to the con-struction of the analysis graphs for each function, procedure, or method found in theAST. From each procedure, a multigraph is built with these nodes as starting points.Each node has a reference to the node of the AST from which it originates.

Parser construction requires different approaches for each language. At present,there are no Swiss knife tools that can be used for all languages, so different externalcomponents have been integrated in the prototype.

One of the objectives that led to the design of the architecture was to allow analysisresults to be exchanged not only among the core components but also to and fromother external components. As an example, a source-to-source transforming processorthat modifies sequential code, in order to execute it on a parallel machine, can have ininput the results of the analysis already done. Knowing that a certain portion of codeimplements a known algorithm and the given host architecture can help the transformerto pick a well-suited implementation of the algorithm for the target platform and totransform the code accordingly.

Analysis can be a time-consuming process, so it is wise to do it off-line and onlyonce for each file. Its results can be reused later whenever a new platform must beused or a new implementation of the algorithm is written. Therefore, one of the require-ments of the design of the architecture was to permit easy interchange of analysis dataand all of the results are produced in GXL format [19].

2.4.2 Algorithmic Recognizer Add-on

The modularity and the expandability of the architecture has been proven with the inte-gration of an add-on module to perform algorithmic recognition and enrich the knowl-

Page 18: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

48 COMPILERS , TECHNIQUES , AND TOOLS

edge base. This module, previously designed and developed by one of the authors [10],implements a technique for automated algorithmic concepts recognition in sourcecode [12].

Algorithmic concepts recognition is a program comprehension technique to rec-ognize in source code the instances of known algorithms. The recognition strategy isbased on a hierarchical parsing of algorithmic concepts. Starting from an intermediaterepresentation of code, Basic Concepts are recognized first. Subsequently, they becomecomponents of Structured Concepts in a hierarchical and/or recursive way. This abstrac-tion process can be modeled as a hierarchical parsing by using Concept RecognitionRules that act on a description of concept instances found in the code [11, 12].

2.4.3 Source Code Transformer for GPUs

A further extension to the architecture has been the integration of a source-to-sourcetransformer that, starting from the results of the algorithmic concept recognizer, does thetransformation of the code region implementing a known algorithm into a new versionthat takes gain of an accelerator device (a GPU in our case) [5].

The component interfaces not only with the GUI, to let the user express preferences,but mainly with the SCOM, from which it can derive recognized algorithm instances andthe references to the related code. The repository must contain, for each recognizablealgorithm and for each supported target architecture, one or more possible alternativeimplementations, stored in parametric format. The parameters should be mapped toinput and output data involved in the algorithm. The user can drive the selection ofthe code within the repository by setting preferences on alternative implementations.At present, rules have been implemented mapping basic linear algebra algorithms toCUBLAS calls [1].

The transformer directly manipulates the intermediate representation of the analyzedsource program. By using the references stored in SCOM entities, the abstract syntaxtree is modified with the following steps:

• The sub-tree corresponding to the code region is pruned from the AST and, ifdesired, a comment block with the original code is inserted.

• A new sub-tree is generated with transformed code. If needed (as in GPUs), italso contains memory allocation on device, memory transfer from CPU to device,library invocation, memory transfer from the device back to the CPU, and memorydeallocation.

• This tree is appended in the AST at the removal point, just after the commentblock.

After all the transformations done on the AST, an un-parsing operation permitsgeneration of the code ready to be compiled on the target platform.

A transformation of this code so it is legal can only be done after verifying depen-dence information on eventual extra statements not being part of the concepts that aremixed with the code.

Page 19: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

CONCLUS ION 49

By using in input the same source code of a sequential C implementation of a matrix-matrix multiplication seen before, in the following the produced source code with thecalls to CUBLAS library is shown, assuming the user has chosen that implementation.Commented code blocks have been omitted.

Algorithm 2.10

// .... omitted commented code ...// —> Added by Transformer - - -void* −dptr−x;void* −dptr−y;void* −dptr−z;// Memory allocationcudaMalloc ((void **)&−dptr−x, 10*10*sizeof(double));cudaMalloc ((void **)&−dptr−y, 10*10*sizeof(double));cudaMalloc ((void **)&−dptr−z, 10*10*sizeof(double));cublasCreate(&handle );// Data transfer CPU->GPUcublasSetMatrix (10, 10, sizeof(double), x, 10, −dptr−x, 10);cublasSetMatrix (10, 10, sizeof(double), y, 10, −dptr−y, 10);// Matrix x Matrix MultiplicationcublasDgemm (handle, CUBLAS−OP−N, CUBLAS−OP−N, 10, 10, 10,0.0, −dptr−x, 10, −dptr−y, 10, 0.0, −dptr−z, 10);// Data transfer GPU->CPUcublasGetMatrix (10, 10, sizeof(double), −dptr−z, 10, z, 10);// Memory deallocationcublasDestroy (handle);cublasFree (−dptr−x);cublasFree (−dptr−x);cublasFree (−dptr−x);

2.5 CONCLUSION

In this chapter a review has been presented of the main programming models, tools, andtechniques to develop and port code for many/multicore CPUs and GPUs. Compilers,techniques, and source-to-source transformer frameworks to program or to convertsequential code into parallel are also summarized.

The chapter also presents a tool developed by the authors to do static analysis onsource code, to represent it in a language neutral way. It integrates an algorithmic recog-nizer to find instances of known algorithms in the code, model the extracted knowledge,and drive a source-to-source transformer to convert sequential code to a parallel versionby using libraries for GPUs.

Page 20: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

50 COMPILERS , TECHNIQUES , AND TOOLS

REFERENCES

[1] CUDA CUBLAS library. http://developer.nvidia.com/cuBLAS (Aug. 2010).

[2] OpenACC corporation. http://www.openacc.org/ (Aug. 2012).

[3] Opencl fft example. http://developer.apple.com/library/mac/#samplecode/OpenCL FFT/Introduction/Intro.html (Aug. 2012).

[4] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. “Brookfor GPUs: Stream computing on graphics hardware.” In ACM Transactions on Graphics(TOG), Vol. 23, p. 777–786. ACM, 2004.

[5] P. Cantiello and B. Di Martino. “Automatic source code transformation for GPUsbased on program comprehension.” In Euro-Par 2011: Parallel Processing Workshops,Vol. 7156 of Lecture Notes in Computer Science, p. 188–197. Springer, Berlin/Heidelberg,2012.

[6] B. Chapman, G. Jost, and R. Van Der Pas. Using OpenMP: Portable Shared Memory ParallelProgramming, Vol. 10. The MIT Press, Cambridge, MA, 2007.

[7] C. Chen, J. Chame, and M. Hall. A Framework for Composing High-Level Loop Transfor-mations. Technical Report 08-897, University of Southern California, 2008.

[8] C. Chun, J. Chame, M. Hall, and J.K. Hollingsworth. “A scalable auto-tuning frameworkfor compiler optimization.” In IEEE International Symposium on Parallel & DistributedProcessing, IPDPS. IEEE, 2009.

[9] L. Dagum and R. Menon. “Openmp: An industry standard api for shared-memory program-ming.” Computational Science & Engineering, IEEE, 5(1):46–55, 1998.

[10] B. Di Martino. “ALCOR—An algorithmic concept recognition tool to support high levelparallel program development. In Applied Parallel Computing, Vol. 2367 of Lecture Notesin Computer Science, p. 755–755. Springer, Berlin/Heidelberg, 2002.

[11] B. Di Martino. “Algorithmic concept recognition to support high performance code reengi-neering.” Special Issue on Hardware/Software Support for High Performance Scientific andEngineering Computing of IEICE Transaction on Information and Systems, E87-D:1743–1750, Jul 2004.

[12] B. Di Martino and H.P. Zima. “Support of automatic parallelization with concept compre-hension.” Journal of Systems Architecture, 45(6-7):427–439, 1999.

[13] R. Dolbeau, S. Bihan, and F. Bodin. “Hmpp: A hybrid multi-core parallel programmingenvironment.” In Workshop on General Purpose Processing on Graphics Processing Units(GPGPU 2007), 2007.

[14] M. Domeika. “Software development for embedded multi-core systems: A practical guideusing embedded intel architecture.” In Newnes, 2008.

[15] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to GPU codes. Innovative Parallel Computing (InPar), 2012, IEEE,978-1-4673-2632-2.

[16] The Edison Design Group. The Edison Design Group C/C++ front-end. http://www.edg.com/(last accessed 17 Jun 2013).

[17] The Open Fortran Group. The Open Fortran Project. http://fortran-parser.sourceforge.net/(last accessed 17 Jun 2013).

[18] M. Hall, J. Chame, C. Chen, J. Shin, G. Rudy, and M. Khan. “Loop transformation recipesfor code generation and auto-tuning.” In Languages and Compilers for Parallel Computing,

Page 21: Large Scale Network-Centric Distributed Systems || Compilers, Techniques, and Tools for Supporting Programming Heterogeneous Many/Multicore Systems

REFERENCES 51

Vol. 5898 of Lecture Notes in Computer Science, p. 50–64. Springer, Berlin/Heidelberg,2010.

[19] R.C. Holt, A. Schurr, S. Elliott Sim, and A. Winter. “GXL: A graph-based standard exchangeformat for reengineering.” Science of Computer Programming, 60(2):149–170, 2006.

[20] W.-M. Hwu, K. Keutzer, and T.G. Mattson. “The concurrency challenge.” In IEEE Designand Test, Vol. 25, p. 312–320. IEEE, 2008.

[21] C. Lattner. “LLVM: a compilation framework for lifelong program analysis and transfor-mation.” In IEEE International Symposium on Code Generation and Optimization, CGO,p. 75–86. IEEE, 2004.

[22] S. Lee and R. Eigenmann. “Openmpc: Extended openmp programming and tuning for GPUs.”In Proceedings of the 2010 ACM/IEEE International Conference for High PerformanceComputing, Networking, Storage and Analysis, p. 1–11. IEEE Computer Society, 2010.

[23] S.-I. Lee, T. Johnson, and R. Eigenmann. “Cetus—An extensible compiler infrastructure forsource-to-source transformation.” In: L. Rauchwerger, ed., Languages and Compilers forParallel Computing, Vol. 2958 of Lecture Notes in Computer Science, p. 539–553. Springer,Berlin/Heidelberg, 2004.

[24] S. Lee, S.-J. Min, and R. Eigenmann. “OpenMP to GPGPU: A compiler framework forautomatic translation and optimization. SIGPLAN Not., 44:101–110, Feb. 2009.

[25] C. Liao, D. Quinlan, J. Willcock, and T. Panas. “Extending automatic parallelization to opti-mize high-level abstractions for multicore.” In: M. Muller, B. de Supinski, and B. Chapman,eds., Evolving OpenMP in an Age of Extreme Parallelism, Vol. 5568 of Lecture Notes inComputer Science, p. 28–41. Springer, Berlin/Heidelberg, 2009.

[26] M.D. McCool and B. D’Amora. “Programming using rapidmind on the Cell BE.” In Pro-ceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 222. ACM, 2006.

[27] M.D. McCool, K. Wadleigh, B. Henderson, and H.Y. Lin. “Performance evaluation of GPUsusing the rapidmind development platform.” In Proceedings of the 2006 ACM/IEEE Confer-ence on Supercomputing, p. 181. ACM, 2006.

[28] NVIDIA. CUDA: Compute Unified Device Architecture. http://www.nvidia.com/cuda/ (lastaccessed 17 Jun 2013).

[29] M.S. Peercy, M. Olano, J. Airey, and P.J. Ungar. “Interactive multi-pass programmable shad-ing.” In Proceedings of the 27th Annual Conference on Computer Graphics and InteractiveTechniques, p. 425–432. ACM Press/Addison-Wesley Publishing Co., 2000.

[30] D. Quinlan. ROSE Compiler project. http://www.rosecompiler.org/ (last accessed 17 Jun2013).

[31] G. Rudy, M. Khan, M. Hall, C. Chen, and J. Chame. “A programming language interfaceto describe transformations and code generation.” In: K. Cooper, J. Mellor-Crummey, andV. Sarkar, eds., Languages and Compilers for Parallel Computing, Vol. 6548 of Lecture Notesin Computer Science, p. 136–150. Springer, Berlin/Heidelberg, 2011.

[32] R. Tsuchiyama, T. Nakamura, T. Iizuka, A. Asahara, and S. Miki. The OpenCL ProgrammingBook. Fixstars Corporation, 2009.

[33] M. Woo, J. Neider, T. Davis, and D. Shreiner. OpenGL Programming Guide: The OfficialGuide to Learning OpenGL, Version 1.2. Addison-Wesley Longman Publishing Co., 1999.