R-Stream Parallelizing C Compiler Power User GuideFor Software Version 3.3.1 Preface Contacting Reservoir Labs For technical inquires: • Call 212.780.0527 • Send a fax to 212.780.0542

R-Stream Parallelizing C Compiler

Power User Guide

For Software Version 3.3.1

Preface

ContactingReservoir

Labs

For technical inquires:

• Call 212.780.0527

• Send a fax to 212.780.0542

• Send email to [email protected]

• Fill out a web-based inquiry form at:http://reservoir.com/contact_form.php

To report a bug, please open a bug report athttps://rstream.reservoir.com/bugzilla/.

Trademarks Intel and Xeon are registered trademarks of Intel Corporation in theUnited States and other countries.

Java is a trademark of Oracle, Inc.

NVIDIA is registered trademark of NVIDIA Corporation in theUnited Sates and other countries.

OpenMP is a trademark of the OpenMP Architecture Review Board.

Red Hat and Enterprise Linux are registered trademarks and Fedorais a trademark of Red Hat, Inc.

Tilera is a registered trademark and Tile64 is a trademark of TileraCorporation.

Ubuntu is a registered trademark of Canonical, Ltd.

All other trademarks belong to their respective owners.

Copyrights Copyright c© 2009-2013 Reservoir Labs, Inc. All rights reserved.

The content of this document is provided for informational use only,is subject to change without notice, and should not be construed as

http://reservoir.com/contact_form.php

https://rstream.reservoir.com/bugzilla/

a commitment by Reservoir Labs, Inc.

What’s thisBook Aboutand Who’s it

for?

This book targets users who want to get the best out of their High-Performance Technical Computing applications quickly, and thosewho are interested in experimenting interactively with the R-StreamCompiler’s optimization, parallelizing, and mapping capabilities tofine tune the performance of their programs.

See The R-Stream Parallelizing C Compiler Getting Started Guidefor instructions on downloading and installing the R-Stream Compiler.You can download The R-Stream Parallelizing C Compiler GettingStarted Guide from the Reservoir Labs Distribution web site at:https://reservoir.com/rstream_support

To report problems with or provide feedback on the user documen-tation, send email to [email protected]. We welcomeyour feedback on how to improve it.

Conventionsof Notation

This manual follows these conventions:

Fixed-spaced text

Denotes example code or characters you enter on the com-mand line.

Italicized textDenotes a term or cross-reference in general text.

In example code or command line entries, it denotes a variablefor which you must specify a value.

Bold textDenotes a selection to make.

4 Denotes a tip, hint, or reminder.

6 Denotes a caution or warning.

https://reservoir.com/rstream_support

Definition ofTerms

These terms are introduced here to aid in understanding the infor-mation presented in this manual:

Generalized Dependence GraphThe Polyhedral Mapper has a distinct Intermediate Represen-tation (IR) that is internal to it. This representation is calledthe Generalized Dependence Graph (GDG).

Synonymous terms are matrix representation, polyhedral rep-resentation, polyhedral form, and Mapper IR. This manualuses the abbreviated term GDG when discussing the Polyhe-dral Mapper’s Intermediate Representation in general, and theterm matrix representation when describing the mathematics.

LoweringTranslating the parallelized and mapped polyhedral represen-tation (GDG) of the C source code to its multi-assignmentform. Lowering occurs during the postmapping stage.

RaisingTranslating original C source code from its Static Single As-signment (SSA) form to its polyhedral representation (GDG).Raising occurs during the premapping stage.

ContentsPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Contacting Reservoir Labs . . . . . . . . . . . . . . ii

Trademarks . . . . . . . . . . . . . . . . . . . . . . ii

Copyrights . . . . . . . . . . . . . . . . . . . . . . . ii

What’s this Book About and Who’s it for? . . . . . . iii

Conventions of Notation . . . . . . . . . . . . . . . iii

Definition of Terms . . . . . . . . . . . . . . . . . . iv

1 Introducing R-Stream 1

R-Stream Architecture . . . . . . . . . . . . . . . . . . . 1

The Polyhedral Model . . . . . . . . . . . . . . . . . . . . 3

R-Stream Advantages and Trade-offs . . . . . . . . . . . . 7

2 Using the R-Stream Compiler 9

Running in Batch Mode . . . . . . . . . . . . . . . . . . . 9

Compiler Flags and Options . . . . . . . . . . . . . 10

Example Batch Mode Session . . . . . . . . . . . . 13

Controlling Compilation . . . . . . . . . . . . . . . 14

Fine Tuning Parallelization . . . . . . . . . . . . . . 18

Running in Interactive Mode . . . . . . . . . . . . . . . . 20

Interactive Mode Commands . . . . . . . . . . . . . 21

Interactive Mode Mapper Options . . . . . . . . . . 22

Example Interactive Session . . . . . . . . . . . . . 24

3 The Polyhedral Mapper 31

Understanding Mapper Basics . . . . . . . . . . . . . . . 31

Common Batch Mode Mapping Options . . . . . . . . . . 32

Passing Options to the Mapper . . . . . . . . . . . . . . . 33

Passing Arguments to Mapper Tactics . . . . . . . . . . . 36

4 Understanding Mapper Tactics 39

Combined multidimensional affine scheduling (as) . . . . 39

Command Synopsis and Arguments . . . . . . . . . 40

CUDA Geom (geom) . . . . . . . . . . . . . . . . . . . . 42


CUDA Placement (new_cudaplacement) . . . . . . . 43


Checks that barrier code are well-formed (checkBarriers) 43


Spatial layout optimization (cm) . . . . . . . . . . . . . . 44


Communication Optimization (CUDA-specific) (commopt) 44


Naive array expansion (ae) . . . . . . . . . . . . . . . . . 45


Promote all variables which are not live-in/live-out anddoes not involve communication to local arrays onthe PE (array) . . . . . . . . . . . . . . . . . . . . 46


Broadcast elimination (bcast) . . . . . . . . . . . . . . . 46


Communication generation (commgen) . . . . . . . . . . 46


Memory promotion tactic (mempromotion) . . . . . . . 48


Orthogonal tiling (tile) . . . . . . . . . . . . . . . . . . 49

Aliases . . . . . . . . . . . . . . . . . . . . . . . . . 49


Thread generation for CUDA and CSX (thread) . . . . . 51


Tile Scheduling (ts) . . . . . . . . . . . . . . . . . . . . 51


Pseudo Code Generation (c) . . . . . . . . . . . . . . . . 52

Aliases . . . . . . . . . . . . . . . . . . . . . . . . . 52


Communication Optimization (early_commopt) . . . . 52


Flexible thread generation (threadf) . . . . . . . . . . . 53


Late Data Relayout (latedatarelayout) . . . . . . . 54


Naive array contraction (lcontract) . . . . . . . . . . . 56


MaxFloat (maxfloat) . . . . . . . . . . . . . . . . . . . 57


Maxsink (maxsink) . . . . . . . . . . . . . . . . . . . . 57


Memory obfuscation using Ehrhart polynomials (mo) . . . 57


Multi-dimensional placement component (place) . . . . 58

Aliases . . . . . . . . . . . . . . . . . . . . . . . . . 58


Barrier Generation (sync) . . . . . . . . . . . . . . . . . 58

Aliases . . . . . . . . . . . . . . . . . . . . . . . . . 59


DMA generation (dma) . . . . . . . . . . . . . . . . . . . 59

Aliases . . . . . . . . . . . . . . . . . . . . . . . . . 59


Control Persistence (per) . . . . . . . . . . . . . . . . . 60


Combined placement, tile scheduling and tiling (placetile) 60

Aliases . . . . . . . . . . . . . . . . . . . . . . . . . 60


Polyhedral Simplification of Loops (CUDA only) (pol_simplify) 62


Polyhedral Unrolling (pol_unroll) . . . . . . . . . . . 63


Index-Set Unsplitting (fuseSplitted) . . . . . . . . . 63


Cosmetic domain tightening transformations (simp) . . . 66

Aliases . . . . . . . . . . . . . . . . . . . . . . . . . 66


Loop Simplification (simplifyLoop) . . . . . . . . . . 66


Unroll and jam (uj) . . . . . . . . . . . . . . . . . . . . . 66


Simple unrolling (unroll) . . . . . . . . . . . . . . . . 67


Virtual scratchpad mode (virtscratch) . . . . . . . . 68


5 R-Stream Target Architectures 69

SMP/OpenMP Architecture . . . . . . . . . . . . . . . . . 69

Base Mapping Strategy . . . . . . . . . . . . . . . . 70

Interactively Mapping to the Target Architecture . . . 70

Known Limitations . . . . . . . . . . . . . . . . . . 71

6 Machine Models 73

Understanding Machine Model File Structure . . . . . . . 73

Describing System Architecture . . . . . . . . . . . . . . 75

Processors . . . . . . . . . . . . . . . . . . . . . . . 75

Abstract Processors . . . . . . . . . . . . . . . . . . 77

Memory . . . . . . . . . . . . . . . . . . . . . . . . 77

Links . . . . . . . . . . . . . . . . . . . . . . . . . 79

Describing Execution Models . . . . . . . . . . . . . . . . 80

Morphs . . . . . . . . . . . . . . . . . . . . . . . . 80

Topology . . . . . . . . . . . . . . . . . . . . . . . 81

Deriving New Machine Models . . . . . . . . . . . . . . . 82

Modifying a Machine Model File . . . . . . . . . . . 82

Using the smp-config Utility . . . . . . . . . . . 84

7 Programming Guidelines 85

Affine Functions . . . . . . . . . . . . . . . . . . . . . . . 86

Loop Characteristics . . . . . . . . . . . . . . . . . . . . 86

Affine Parametric Loop Bounds . . . . . . . . . . . . . . 86

Affine Array Accesses . . . . . . . . . . . . . . . . . . . 88

Loop Nesting . . . . . . . . . . . . . . . . . . . . . . . . 90

Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . 90

Passing argument to mappable functions . . . . . . . . . . 92

Mappable vs. Unmappable Data-Dependent Structures . . 92

How to map loops with library calls . . . . . . . . . . . . 93

How to write an image function . . . . . . . . . . . 93

How to map unmappable code . . . . . . . . . . . . . . . 96

8 ARCC, the Auto-Tuner for RCC 101

Basic Operation . . . . . . . . . . . . . . . . . . . . . . . 101

Options . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Logging and Reporting . . . . . . . . . . . . . . . . . . . 104

Advanced Usage . . . . . . . . . . . . . . . . . . . . . . 105

Running in Production/Consumption Mode . . . . . 106

Meta-Data Syntax . . . . . . . . . . . . . . . . . . . 106

Selecting Auto-Tunable Tactic Options . . . . . . . . 107

Bibliography 111

Index 113

1 Introducing R-StreamThe R-Stream Compiler is a source-to-source compiler. It acts asa high-level compiler (HLC), accepting C programs with mappableregions (identified by the user) as input, and produces parallelizedand mapped C programs as output.

Executing the output code requires a Low-Level Compiler (LLC)(such as gcc, icc, or nvcc) to accept the mapped program and togenerate code for the individual accelerator engines or host proces-sors, using the widely and relatively generic compiler technologiesfor instruction selection, instruction scheduling, register allocation,and so on.

R-Stream Architecture

Figure 1.1 on page 2 shows the high-level structure of the R-StreamCompiler. The core of the compiler consists of the frontend; the In-termediate Representation (IR) and scalar optimizer; and the back-end, which regenerates the mapped code in a source-level format.The Polyhedral Mapper (hereafter referred to as the Mapper) andthe components for translating between the compiler’s IR form andthe Mapper’s IR form (GDG) —called raising and lowering—are anextension of the compiler.

The R-Stream Compiler uses the Edison Design Group (EDG) front-end to process user source files written in C. After this processing,user code is represented in the R-Stream Compiler’s IR form.

Unlike many existing source-to-source program transformation tools,which tend to retain the source syntax in their intermediate repre-sentation, the R-Stream Compiler uses a low-level IR to manipulateprograms. This IR is a graph-based representation of the programat the operator level, in static single-assignment (SSA) form. Thisform provides a strong framework for aggressive scalar compiler op-timizations. By ignoring the source syntax and concentrating on thesemantics of the input program, the R-Stream Compiler gains two

Chapter 1. Introducing R-Stream 1

R-Stream Architecture

Frontend

IRand optimizations

Target code generator

Raising

Polyhedral mapper

Lowering

��

��

55llllllllllllllll ��

��

iiRRRRRRRRRRRRRRR

Figure 1.1: The R-Stream infrastructure.

main advantages:

• Transformations and analyses are much easier to develop be-cause they are insensitive to syntactic restrictions or irregular-ities of the underlying language(s).

• Retargeting to another source or target language is easier be-cause the compiler is not tied to one specific source or targetlanguage.

Following translation to the R-Stream Compiler’s IR form, the pro-gram is processed by a Premapping pass before the Mapper is in-voked. This pass cleans up and simplifies the input code by employ-ing traditional scalar optimization techniques.

After the premapping optimization pass, the mappable portions ofthe program are identified and translated, or "Raised", into the in-termediate representation (GDG) used by the Mapper. The Mapperoperates on this IR, then translates, or "Lowers" the mapped portionsof the program back into the R-Stream Compiler’s IR. The mappedportions of the program are then subjected to another round of scalaroptimizations in the postmapping optimization pass.

Finally, the R-Stream Compiler writes out the mapped program ina manner consistent with the selected target. Depending on the tar-get, the output may be entirely in C source code, entirely in another

2 Chapter 1. Introducing R-Stream

The Polyhedral Model

target language (currently C-based OpenMP or pthreads), or parti-tioned such that part of the program is in C source code and part isin another target language.


The polyhedral model is a mathematical representation of programs.It was created in the late sixties to provide compilers with precisedata-dependence analysis and advanced code restructuring capabil-ity. While it is a mature model, its use in the production environmentbecame possible only recently owing to research results in code gen-eration.

The representation of programs in current production compilers (forexample, IBM XL, Intel ICC, or GCC) is an abstract syntax tree(AST). The AST representation is very close to the syntactic formof the program—the way a programmer writes it. Such a representa-tion is not designed to facilitate transformation of complex programsand is inadequate for this task. A striking example is the concept ofstatement; while written only once by the programmer and appear-ing only once in the AST representation, it can be executed multipletimes (for example, if it is enclosed inside a loop).

Consider the first statement S1 in the Ring-Roberts Edge DetectionFilter shown in Figure 1.2 on page 4. This statement produces a newpixel of the temporary image Ring from the pixels surrounding itin the original image Img. Hence, this statement is executed manytimes, but because a classical compiler considers it a single entity:

• Only basic transformation schemes are possible.

For example, loop fusion (which replaces multiple loops witha single one) relies on pattern matching. Fusing the two loopsin Figure 1.2 is highly desirable to benefit from reuse of thearray Ring. Nevertheless, no current production compiler isable to perform it because the loop bounds do not match.

• Data dependencies are highly over-estimated because thegranularity is at the level of the whole statement, rather than



the level of statement executions.

Even if we artificially made the loop bounds match in, currentproduction compilers would still be unable to fuse the loopsbecause of data dependencies, despite the fact that only a fewstatement executions would be blocking the transformation.

Because program restructuring is entirely about modifying the orderof the various executions of statements, a more precise representa-tion is needed.

/* Edge Detector for Noisy Images *//* Ring Blur Filter */for (i=1; i<length-1; i++)

for (j=1; j<width-1; j++)/* S1 */Ring[i][j] = (Img[i-1][j-1] + Img[i-1][j] + Img[i-1][j+1] +

Img[i][j-1] + Img[i][j+1] +Img[i+1][j-1] + Img[i+1][j] +Img[i+1][j+1]) / 8;

/* Roberts Edge Detection Filter */for (i=1; i<length-2; i++)

for (j=2; j<width-1; j++)/* S2 */Img[i][j] = abs(Ring[i][j] - Ring[i+1][j-1]) +

abs(Ring[i+1][j] - Ring[i][j-1]);

Image Before Filtering Image After Filtering

Figure 1.2: Ring-Roberts Edge Detection Filter

The polyhedral model is closer to the execution of the program be-cause it provides a way to represent the various executions, calledinstances, of each statement.

For example, again consider the first statement, S1, in the Ring-



Roberts Edge Detection Filter shown in Figure 1.2 on page 4. It isexecuted once for each pair of values of the surrounding loop coun-ters i and j. In the polyhedral model, a set of values for each outerloop counter of a statement is called an iteration vector. For ex-ample, the iteration vector for the statement S1, with the value 4for counter i and the value 7 for the counter j, would be

(47

), and it

would correspond to the execution of this statement instance:

Ring[4][7]=( Img[3][6]+Img[3][7]+Img[3][8]+S1

`47

´Img[4][6]+ Img[4][8]+Img[5][6]+Img[5][7]+Img[5][8])/8;

The set of all possible iteration vectors for a statement is called the it-eration domain of the statement. Iteration domains can be expressedin terms of the bounds of the loops that surround statements. For ex-ample, for S1 in Figure 1.2, the set of all iteration vectors is the set ofall vectors

(ij

)such that 1 ≤ i ≤ length−2 and 1 ≤ j ≤ width−2.

Using a mathematical notation, we define the iteration domain of S1in the following way:

DS1 =

“i

j

” ˛̨̨̨ “i

j

”∈ Z2, 1 ≤ i ≤ length− 2 ∧ 1 ≤ j ≤ width− 2

ff



To easily manipulate the domains, we use a matrix notation of thefollowing general form:

D = {~x | ~x ∈ Zn, A~x + ~a ≥ ~0}

Which, for DS1 , corresponds to:

DS1 =

8><>:„

ij

« ˛̨̨̨˛̨̨„ i

j

«∈ Z2,

264 1 0−1 0

0 10 −1

375 „ij

«+

0B@ −1lengh− 2

−1width− 2

1CA ≥ ~0

9>=>;

In the polyhedral model, we manipulate iteration domains such thatthey embed every execution of the statements, not just their textualinformation. This provides the compiler with the finest granularityof representation. We use similar matrix representations for array ac-cesses (called access functions) and for the ordering and placementof statement instances across processors (called space-time mappingfunctions).

The matrix representation enables manipulation of programs throughinteger linear programming tools. In this mathematical form, a re-structuring is simply a solution to a system of constraints. The exactdata-dependence information is encoded as the compulsory part ofthis system of constraints, so any restructuring is correct by con-struction. Other constraints are included to ensure that the restruc-turing has desirable properties (for example, parallelism granularity,data locality, SIMDization, and so on).

The fusion of the two main loops of the Ring-Roberts filter in Fig-ure 1.2 on page 4, while respecting the data dependencies—trivialprocessing for the R-Stream Compiler—resulted in a nearly 50% re-duction in cache misses for this code.


R-Stream Advantages and Trade-offs


As a parallelizing, mapping compiler, the R-Stream Compiler hassome distinct advantages. For example, the source-to-source ap-proach to compilation provides a high degree of portability. Tar-get architectures possessing a high-quality, optimizing C compilercan use it to immediately compile the code output by the R-StreamCompiler.

By relying on the polyhedral model in its Mapper, the R-StreamCompiler realizes some competitive benefits, enabling it to paral-lelize a greater number of codes:

• A precise treatment of data dependence allows freer expres-sion of parallelism.

• A rigorous mathematical framework results in mapping that iscorrect by construction

• Composability of transformations enables effective searchingof the mapping space by varying the order of phases.

However, using the Mapper imposes two limitations:

• Code to be mapped must be written in a somewhat restrictedform of C, to correspond to the model of the extended staticcontrol program.

• Some scalability issues remain, especially for deeply-nestedloops and those with especially complex dependencies.




2 Using the R-Stream CompilerThe R-Stream Compiler provides two modes of operation: batchmode and interactive mode.

You typically use batch mode to map all of your C source code andthen automatically compile the newly mapped source code using thelow-level compiler for your target system.

You generally use interactive mode when you want to see how theR-Stream Compiler processes your code during particular steps orphases, or to troubleshoot your code when results don’t match yourexpectations. When satisfied with results, you can either pass theresulting parallelized C source file to your low-level compiler, andcompile your entire application manually; or switch to batch mode,applying the same tactics and arguments as used in interactive mode,to map and compile your entire application automatically.

Running in Batch Mode

In Batch mode with mapping enabled, the R-Stream Compiler per-forms these tasks:

• Parses the C source code.

• Performs conventional scalar analysis and optimizations dur-ing the premapping stage.

– Conditional constant propagation

– Global value numbering and code motion

– Strength reduction

– Global reassociation

– Dead code elimination

– Function inlining

• Parallelizes mappable source code during the mapping stage.

Chapter 2. Using the R-Stream Compiler 9


• Performs conventional scalar analysis and optimizations onthe newly parallelized code during the postmapping stage.

• Outputs an optimized C source code file, <filename>.gen.c.

• Outputs one or more parallelized C source code files,<filename>.gen-<target>.c, targeted to run on the speci-fied target architecture.

• Passes the transformed C source code files to the target’s de-fault low-level compiler and linker.

CompilerFlags and

Options

To see a list of the available flags and options, enter rcc --helpon the command line.

The usage syntax for passing options to the R-Stream Compiler is:rcc -[-][option...][file...]

--ansi

Instructs the compiler to use the ANSI C standard.

-c, --compilation_only

Instructs the compiler to compile but not link.

--compiler=<optimization:options>

Passes the specified options to the specified scalar optimiza-tion.

To see the arguments available for any scalar optimization, runrcc --help with the name of the optimization as an argu-ment; for example, rcc --help GVNGCM.

-d [bitflags], --debug=[bitflags]

Enables internal compiler debugging. Valid values for the bitflags are:

1 Dump the IR in internal format after each optimization.

10 Chapter 2. Using the R-Stream Compiler


2 Dump the CFGs as dotty graphs after each optimization.

4 Dump the IR as a dotty graph after each optimization.

-D<macro>, --defines=<macro>

Adds the specified macro definitions.

--debugger

Enables Beanshell debugging, so that when it crashes, thecompiler enters the Beanshell.

--edg_options=<options>

Passes the specified options to the compiler frontend. Use thisflag only at the direction of Reservoir Labs.

-f<feature>, --features=<feature>

Specifies the optimization to run; for example, -fpic, -fPIC,-fno-built-in, -fmpa-only, -ffpga, -fopenmp, and so on.

Many of these optimizations can be disabled using the-fno-<feature> form of this flag.

-g, --produce_debugging_information

Instructs the compiler to generate debugging information inthe operating system’s native format.

-h, --help

Prints the list of the compiler’s options and flags to the stan-dard output.

-I<path>, --include_paths=<path>

Specifies the the path to search for user-supplied header files.

-L<path>, --lib_paths

Specifies the the path to search for user-supplied libraries.

-l<library>, --libraries=<library>

Instructs the compiler to search the specified library. Becausea library is searched when the compiler encounters its name,the order of -l operands on the command line is significant.

--log=<path>

Instructs the compiler to write diagnostics to the specified logfile.



--map:<options>

Passes the specified options to the Mapper. For details, seepage 33.

-march=<arch>

Specifies the target architecture for which to optimize the code.

Valid values are: x86_64-linux, tile64, tilepro,i686-linux, i686-cygwin, and ia64-linux.

--mm=<machine_model>

Specifies the machine model file to use. For details, see Ma-chine Models on page 73.

-O<level>, --optimization_level=<level>

Specifies the optimization level.

-o<path>, --output_file=<path>

Specifies the name of the output file. Defaults to a.out whenno file name is specified.

--polymap

Enables the Mapper.

-S Instructs the compiler to generate transformed source code, butnot to compile it using the low-level compiler.

--setmm=<key:value>

Overrides the value of the specified key in the machine modelfile with the new value. For details, see Modifying a MachineModel File on page 82.

--shell

Starts the rstream shell, causing the compiler to enter interac-tive mode.

--types=<types_file>

Specifies the machine types file to use. For the default instal-lation, these perl script (.pl) files are located in$RSTREAM_HOME/rstream/mm/.

In batch mode, use this switch if you do not supply the-march=<arch> batch mode switch.



In interactive mode, the counterpart to this switch,loadTypes(), is required whenever you cross-compile be-cause the interactive shell cannot access the perl script driver.

-U<macro>, --undefines<macro>

Cancels the previous macro definition.

4 For a list of Mapper options and switches and their descriptions,see The Polyhedral Mapper on page 31.

ExampleBatch Mode

Session

This example session demonstrates how to compile for parallel exe-cution.

4 Make sure the functions in your source code that you want theR-Stream Compiler to parallelize conform to static control pro-gram structure. For details, see Programming Guidelines onpage 85.

1. In your source code, specify which functions are valid regionsfor mapping by tagging each with a #pragma rstreammap directive.For example:

# pragma rstream mapdouble f(int n, int c){

int i, j;for (i=0; i<n; i++){

for (j=0; j<n; j++){C[i + c][i][j] = A[2*i][3*j + c] + B[j];

}}

}

In this example, n and c are loop-invariant values, which wecall parameters. These parameters must be formal parametersof the function you want the Mapper to map.

(For details on using the mapping pragmas and command-line switches, see Fine Tuning Parallelization on page 18).



2. Enable mapping. On the command line:

• Use the --polymap switch to activate the Mapper.

• Use the --mm=<machine_model> switch to specify tothe Mapper which machine model to use for the targetarchitecture.

• Use the -march=<arch> switch to specify to the R-Stream Compiler which ABI (Application Binary Inter-face) to use for the target architecture.

Valid -march values are x86_64-linux, tile64, tilepro,i686-linux, i686-cygwin, and ia64-linux.

3. Run the compiler this way:

$ rcc --polymap --mm=core2duo -march=x86_64-linux \-o myapp myapp.c

ControllingCompilation

This section explains how to:

• Handle header files

• Generate only parallelized C source files

• Specify a nondefault backend compiler

• Exclude the Mapper from compilation

• Control inlining within mappable regions

• Automate compilation using a software configuration andbuild utility

(For a list of all R-Stream Compiler options, run rcc --help.)



Handling Header Files

By default the R-Stream Compiler expands header (#include)files and prepends their content to corresponding generated outputfiles. This behavior can result in large and hard-to-read generatedoutput files.

You can disable this behavior using the -fcollapseinclude-filescommand-line switch. This switch directs the R-Stream Compiler totry to re-create the original #include directives in the generatedoutput.

This feature typically works well with standard system headers, butmay not work if a header file contains executable code, or whenprograms are conditionally compiled.

Generating Parallelized C Source Files Only

Using the -S command-line switch, you can direct the R-StreamCompiler to stop compilation after it generates the mapped files insource code (in C, CUDA, etc.). The default file name for amapped C file is <filename>.gen.<target>.c. This is useful if youwant to modify the parallelized C source code before you compile itfor the target machine.

Specifying a Different Backend Compiler

To use a low-level backend compiler other than the default, compileyour program as you normally would using the R-Stream Compiler,but include the -backend flag. Options can also be passed to thebackend compiler using the -backendoptions flag. For exam-ple:

$ rcc --polymap --mm=core2duo -march=x86_64-linux --backend=icc \--backendoptions="-O3 -g" foo.c -o foo



Excluding the Mapper from Compilation

If you run rccwithout invoking the Mapper, the R-Stream Compilerperforms only these steps:

• Parses the C source code.

• Performs scalar analysis and optimizations (for a list of theseoptimizations, see page 9).

• Outputs an optimized, sequential C source code file,<filename>.gen.c.

• Passes the transformed C source code file to the default low-level compiler and linker.

To compile your C source code for sequential execution, run the R-Stream Compiler without the --polymap, --mm=<machine_model>,and -march=<arch> command-line switches; for example:

$ rcc myapp.c

The R-Stream Compiler generates an optimized version of the orig-inal C program (<filename>.gen.c) and passes that file to the de-fault compiler, gcc. Depending on the content of the original C pro-gram, the resulting executable may or may not be better optimizedthan that generated by gcc using the default settings.

Controlling Inlining Within Mappable Regions

Using the pragmarstreaminline directive, you can tag smallroutines called from within mappable regions— those tagged witha pragmarstream map directive—that you want the Mapper toattempt to inline. After the Mapper identifies the mappable func-tions, it performs an inlining phase to inline any valid small routinestagged with #pragmarstreaminline, following these rules:



• Functions and function calls tagged with #pragma rstreaminline can be inlined.

• Recursive functions cannot be inlined.

(For details on using #pragmarstreammap directives, see FineTuning Parallelization on page 18.)

Automating Batch Mode Compilation

The R-Stream Compiler separates source code into sequential andparallel components, such that if a single C source file containsone or more mappable loops, it will output two new C source files.One output file, <filename>.gen.c, will contain the sequentialcode, and the other output file, <filename>.gen_<target>.c,will contain the mapped code, parallelized for the target architecture.

Following these file-naming guidelines, it is possible invoke theR-Stream Compiler using an automatic software configuration andbuild utility, such as make. To do so, in the makefile, add rules anddependencies for the parallelized C source output files, and invokethe R-Stream Compiler on the base C source input files.

For example, a simple makefile that builds a simple application, con-sisting of a main program and some compute kernel code, for theMapper component to parallelize for a 64-bit x86 processor runningLinux:

CC = gccCFLAGS = -O3RCC = rccRCCFLAGS = -S --polymap --mm=core2duo -march=x86_64-linux

SRCS = main.c kernel.c kernel.gen.cOBJS = main.o kernel.o kernel.gen.o

default: myprog

.c .o:$(CC) $(CFLAGS) -c $<

kernel.gen.c: kernel.c



$(RCC) $(RCCFLAGS) kernel.c

myprog: $(OBJS) $(SRCS)$(CC) -o $@ $(OBJS)

4 If your application has multiple compute kernels, each in a sep-arate file, that you need to fine tune differently, you can do so bywriting special rules for those files in the application’s makefile.

Fine TuningParalleliza-

tion

Using a combination of command-line switches in batch mode andpragmarstream directives in both batch and interactive modes,you can select with increasing granularity the regions of code youwant the Mapper to parallelize.

• You must always enable mapping using these command-lineswitches:--polymap

Activates the Mapper. If omitted, the Mapper ignores allcommand-line mapping switches (-fmap*) and #pragmadirectives embedded in the source code.


Specifies which machine model .xml file to use. Thisfile describes the target architecture’s system compo-nents and its system capabilities and execution modelsthat the Mapper can exploit.

-march=<arch>

Identifies the data types that the target architecture sup-ports.

• Using a combination of these command-line switches andpragma directives, you can fine tune which functions theMapper acts upon:

-fmap-function=<name>Identifies and requests the Mapper to map a valid func-tion, name.Overrides -fmapall and all mapping pragma direc-tives.



-fnomap-function=<name>Identifies and requests the Mapper to omit mappingfunction, name.Overrides -fmapall and all mapping pragma direc-tives.

-fmapall

Requests the Mapper to map all functions in the C sourcefile, asserting all are valid mappable regions.Does not override any other mapping command-lineswitch or any pragma directive.

• In both batch and interactive modes, the pragmarstream*directives provide control down to the loop level and enableinlining of eligible function calls:

# pragma rstream nomap

Inserted directly before a function description, requeststhe Mapper to omit mapping the following function.At the function level, overrides -fmapall and thepragma rstream map directive.Inserted directly before a for loop within a function, re-quests the Mapper to omit mapping the following loop.At the loop level, does not override any mapping command-line switch or any other mapping pragma directive.

# pragma rstream map

Inserted directly before a function description, requeststhe Mapper to map the following valid function.Inserted directly before a for loop within a function,identifies and requests the Mapper to map the followingvalid loop.Does not override any mapping command-line switch orany other mapping pragma directive at either the func-tion or loop level.

# pragma rstream inline

Inserted directly before a function description that iscalled from a mappable region, tagged with a pragmarstream map directive and located within the same


Running in Interactive Mode

file, identifies and requests the Mapper to inline thatfunction into the mappable region.Does not override any mapping command-line switch orany other -fnomap-function=<name>.

Precedence Rules and Guidelines

• When a -fmap-function=<name> switch and a-fnomap-function=<name> switch, both of which spec-ify the same function, appear on the same compiler commandline, the one located rightmost on the command line takesprecedence.

• To exclude particular functions when -fmapall is speci-fied, use -fnomap-function=<name> switches to spec-ify which functions to exclude.

• Use the -fnomap-function=<name> switch to overridea function definition tagged with a pragma rstream mapdirective.

• Include -fmap-function=<name> switches to explicitlyrequest the Mapper to map the specified functions.


Interactive mode is a powerful tool for exploring different ap-proaches to mapping code. It enables you to control the compilerin ways that are often unwieldy executed from the command line inbatch mode.

Interactive mode is implemented using the BeanShell interactiveframework. BeanShell is a small, Java-language source interpreterthat provides a scripting environment. In this environment, interac-tive mode provides a suite of commands, implemented as Java func-tions, for interacting with the compiler. From the interactive shell,you can issue these commands directly from the command line or inbatch, using a script file.



Compiling a program in interactive mode typically involves the fol-lowing nine steps (as demonstrated in Example Interactive Sessionon page 24):

1. Invoke the R-Stream Compiler in interactive mode.

2. Load the appropriate machine model.

3. Bring the code into the compiler.

4. Run premapping optimizations.

5. Pass the optimized code to the Mapper.

6. Run Mapper tactics.

7. Return mapped code to the compiler.

8. Run postmapping optimizations.

9. Emit mapped code.

This interactive process broadly follows the steps taken by the R-Stream Compiler running in batch mode, but it gives you con-trol over all of the optimizations and Mapper phases the R-StreamCompiler executes.

Several commands are provided as a convenience to automate com-mands that are commonly executed in sequence. Also, it’s not al-ways necessary to complete each step. For example, if you are ex-perimenting with different mappings, returned results can often beinterpreted immediately, making it unnecessary to execute the re-maining steps and produce a final version of the mapped program.

InteractiveMode

Commands

The interactive shell operates much as an operating system’s command-line shell does. You enter a command at the prompt, and the resultis returned to the shell window after the command ends.

To invoke interactive mode, use the command rcc --shell.



Using the help(); command at any time within the shell, you canview the list of all available commands.

A separate help command, mapperhelp(); provides detailed in-formation about the available Mapper commands. As with help();,you can issue mapperhelp(); at the prompt at any time. Thishelp includes a list of the available Mapper tactics, and the com-mand mapperhelp(“<tactic>”) provides help for the speci-fied tactic.

For example, to get help on the tiling tactic, issue the commandmapperhelp("tile");. This command outputs the specifiedhelp information to the shell window. Help information for eachMapper tactic typically includes a description of the tactic and thelist of options available to it.

It is to be noted that a Mapper tactic accepts zero or more argu-ments. When there is one or more arguments, the arguments areenclosed within quotes (“ ”) and separated by ‘comma’. For ex-ample, to call the affine scheduling tactic with no option, issuethe command as();. To call the affine scheduling tactic, as,with maximal_fission and contiguity options, issue the commandas("maximal_fission, contiguity");.

InteractiveMode

MapperOptions

Though Interactive mode provides the same Mapper options that areavailable in batch mode, you specify them using interactive modecommands, rather than command line switches.

Some of the interactive mode commands relevant to enabling andusing the Mapper are:

codegen()

This function generates pseudocode from the current GDG.

GDG()

This function returns the current GDG for processing (after theRaise phase is run). By default, this is the first GDG that isgenerated by the Raise phase.

During the Raise phase, the compiler generates a GDG



for each function in the source code that is tagged with a#pragmarstreammap directive. The general procedurefor mapping these tagged functions is to step through and mapeach GDG, using the setGDG() command to select a specificGDG and the map() command to map it.

For details on the setGDG() command, see page 24. Fordetails on the map() command, see page 23.

To see a complete list of the GDGs generated by the Raisephase, use the GDGs() command .

GDGs()

This function returns the list of generated GDGs (after theRaise phase is run).

loadMM(machine_model)

This command is equivalent to using the --mm=machine_model(see page 12) batch mode command-line switch.

Example: > loadMM("core2duo");

loadTypes(types_file)

This command is equivalent to using the --types=<types_file>(see page 12) batch mode switch. For example:

loadTypes("x86_64-gcc-linux");

map() or map(<options>)

This command runs the composite mapper tactic.

See Passing Options to the Mapper on page 33 for a list ofarguments that the composite tactic accepts.

parse(<filename>)

This command can be used to parse a file. This is the same asrun("PARSE:filename").

runall()

This command runs all the scalar optimizations necessary be-fore running Raise. It is equivalent to:

run("SSA");run("CCP");



run("GVNGCM");run("OSR");run("DCE");

run(phase)

This command runs an optimization phase. A colon separatesthe name of the phase from its arguments, and commas sepa-rate multiple options. For example:

run("phase:option1,option2,..., optionN")

setGDG(GDG)

This command selects the current or specified GDG for pro-cessing by the Mapper.

To select a specific GDG for processing, first run the GDGs()command to get a list of the generated GDGs, then issue thiscommand:

setGDG(GDGs().get(#));

where # is the index number of the GDG you want the Mapperto process.

For more details, see the description of the GDG() commandon page 22.

toGDG(path)

This command is equivalent to the sequence of commandsparse(path); runall(); run("Raise");

For example: toGDG("~/code.c");

ExampleInteractive

Session

This example session demonstrates the commands typically used ininteractive mode.

We start with a sequential implementation of the 3D heat equation,which essentially consists of multiple Jacobi stencil iterations. Thecode is located in:

rstream-x.x.x/rstream/examples/apps/heat_3D/heat_3D.c



In the original code, Ai and Bi are globally-defined arrays, andNSPTS, X, Y, and Z are global parameters in the context of the func-tion:

for (s=0; s<NSPTS; s++) {for (i=1; i<(X-1); i++) {

for (j=1; j<(Y-1); j++) {for (k=1; k<(Z-1); k++) {

Bi[i][j][k] = (1-6*R) * Ai[i][j][k] +R * (Ai[i-1][j][k] + Ai[i][j-1][k] +

Ai[i][j][k-1] + Ai[i][j][k+1] +Ai[i][j+1][k] + Ai[i+1][j][k]);

} } }for (i=1; i<(X-1); i++) {

for (j=1; j<(Y-1); j++) {for (k=1; k<(Z-1); k++) {

Ai[i][j][k] = (1-6*R) * Bi[i][j][k] +R * (Bi[i-1][j][k] + Bi[i][j-1][k] +

Bi[i][j][k-1] + Bi[i][j][k+1] +Bi[i][j+1][k] + Bi[i+1][j][k]);

} } } }

1. Invoke the R-Stream Compiler’s interactive shell:> rcc --shell

[rstreamc-sh ...]

___________ RStream shell ___________Type ’help();’ for a list of commands_____________________________________

BeanShell 2.0b4 - by Pat Niemeyer ([email protected])

2. Load an appropriate machine model file for the target archi-tecture. In this example, the machine model file that describesIntel’s Core2 Duo architecture:

> loadMM("core2duo");[ Loading [/home/username/work/rstream-3.1.4/rstream/mm/

core2duo-mapper.xml] ]

4 You can use the abbreviated name of the machine modelfile, in this case core2duo instead ofcore2duo-mapper.xml.



3. Bring the code into the IR of the R-Stream Compiler.> parse(RSTREAM_HOME+"/rstream/examples/apps/heat_3D/heat_3D.c");

heat_3D.c>

In this example, RSTREAM_HOME is a predefined path to/home/username/work/rstream-x.x.x.

4 The prompt changes, indicating the C code file is active.

4. Run a premapping optimization pass.The runall() command is typically used. It runs the samesequence of optimization passes that the compiler runs inbatch mode and uses the same default options.

heat_3D.c> runall();

5. Run the Raise phase, which translates the code into the in-termediate representation (GDG) used by the Mapper:

heat_3D.c> run("Raise");

This step produces a Generalized Dependence Graph (GDG),which graphically represents statements in the program asnodes and the dependencies between statements as edges. Youcan print it by issuing the print(GDG()); command.

You can also see what the Mapper is doing by issuing the com-mand codegen();, which outputs the current GDG in theMapper’s internal pseudocode. The Mapper’s internal pseu-docode has this structure:

// _stencil_sweepsmorph smpprocessor(s) [PC]index size 32 bits;(global) double Bi[*][251][251] input output livein liveout;(global) double Ai[*][251][251] input output livein liveout;for (i = 0; i <= 248; i++) {

for (j = 0; j <= 248; j++) {for (k = 0; k <= 248; k++) {

_stencil_sweeps_0(>Bi[1 + i,1 + j,1 + k],<Ai[1 + i,1 + j,1 + k],<Ai[i,1 + j,1 + k],



<Ai[1 + i,j,1 + k],<Ai[1 + i,1 + j,k],<Ai[1 + i,1 + j,2 + k],<Ai[1 + i,2 + j,1 + k],<Ai[2 + i,1 + j,1 + k])

} } }for (i = 0; i <= 248; i++) {

for (j = 0; j <= 248; j++) {for (k = 0; k <= 248; k++) {

_stencil_sweeps_1(>Ai[1 + i,1 + j,1 + k],<Bi[1 + i,1 + j,1 + k],<Bi[i,1 + j,1 + k],<Bi[1 + i,j,1 + k],<Bi[1 + i,1 + j,k],<Bi[1 + i,1 + j,2 + k],<Bi[1 + i,2 + j,1 + k],<Bi[2 + i,1 + j,1 + k])

} } }

4 Most mapping steps result from the application of a tac-tic. Taking a GDG as input, a tactic performs the speci-fied transformation on that code, then generates anotherGDG. Up to this point in the example session, no poly-hedral transformations have been applied to the programcode.

Hereafter for this example session, only the first access of eachstatement is represented in snippets of internal pseudocode.

4 Since the last three commands—parse(), runall(),and run(”raise”)—used in this example are fre-quently executed in succession, they’ve been bundled to-gether in the command toGDG(<filename>).

So, in this example, we could have used> toGDGs(RSTREAM_HOME+"/rstream/examples/apps/heat_3D/heat_3D.c");

instead of the three separate commands.

6. Call the affine scheduling tactic, as, to start the mapping, thencall the codegen(); command to output the internal pseu-docode from the transformed GDG to see the results of thetransformation.

(For details on the affine scheduling tactic, see Combined mul-tidimensional affine scheduling (as) on page 39.)

heat_3D.c> as();heat_3D.c> codegen();



// _stencil_sweeps// DummyCodeGenerator (110ms)// AffineSchedulingTactic (394ms)morph smpprocessor(s) [PC]index size 32 bits;(global) double Bi[*][251][251] input output livein liveout;(global) double Ai[*][251][251] input output livein liveout;doall (i = 0; i <= 248; i++) { //p

doall (j = 0; j <= 248; j++) { //pdoall (k = 0; k <= 248; k++) { //p

_stencil_sweeps_0(>Bi[1 + i,1 + j,1 + k], ...);_stencil_sweeps_1(>Ai[1 + i,1 + j,1 + k], ...);

} } }

4 Instances of //p at the ends of lines in the pseudocodeidentify permutable loops.

Note that each of the original for loops have now been con-verted into doall loops.

Unlike traditional AST-based compilers, no phase ordering isinvolved between fusion, skewing, and shifting. This is so be-cause the underlying polyhedral model considers only state-ments and then schedules their iterations according to the de-pendencies.

7. Call the tile tactic, tile, to start the tiling step, then outputthe internal pseudocode from the newly transformed GDG tosee the results of this transformation.

(For details on the tile tactic, see Orthogonal tiling (tile) onpage 49.)

prompt> tile();prompt> codegen();

(global) double Bi[*][251][251] input output livein liveout;(global) double Ai[*][251][251] input output livein liveout;doall (i = 0; i <= 7; i++) {

doall (j = 0; j <= 7; j++) {doall (k = 0; k <= 1; k++) {

doall (l = 32 * i; l <= min(32 * i + 31, 248); l++) {doall (m = 32 * j; m <= min(248, 32 * j + 31); m++) {



doall (n = 128 * k; n <= min(248, 128 * k + 127); n++) {_stencil_sweeps_0(>Bi[1 + l,1 + m,1 + n], ...);_stencil_sweeps_1(>Ai[1 + l,1 + m,1 + n], ...);

} } } } } }

Tiling is necessary to extract coarse-grained parallelism froma nest of permutable loops. In practice, intertile and intratiledimensions are created for each tiled loop in the loop nest.

Permutable intertile loops can be skewed further by usingthe tile scheduling (ts) tactic to expose coarse-grained par-allelism, which OpenMP can exploit.

The next three commands perform low-level transformationson the GDG to generate compilable code for the low-levelcompiler.

8. Run the phase Lower to translate the Mapper IR of the codeinto the R-Stream Compiler’s IR.

heat_3D.c> run("Lower");

9. Run the DeSSA optimization to remove the phi- and sigma-nodes and translate the R-Stream Compiler’s IR into multi-assignment form.

heat_3D.c> run("DeSSA");

10. Run the command EmitC to generate the transformed Csource code file.

heat_3D.c> run("EmitC:"+RSTREAM_HOME+"/rstream/examples/apps/heat_3D/heat_3D.gen.c");

Using your low-level compiler, compile all of theheat_3D<.*>.c files to generate an executable that willrun on the target machine.




3 The Polyhedral MapperThe Polyhedral Mapper, an internal component of the R-StreamCompiler, is a powerful loop analysis and transformation frameworkthat is based on the polyhedral model of static control programs. Itautomatically parallelizes and maps C source code to run on a spec-ified target architecture.

You can run the Mapper in batch mode or in interactive mode.

Understanding Mapper Basics

By default, the R-Stream Compiler performs scalar optimizations inthe following order:

SSA Computes the Static Single-Assignment [CFRZ91, CP95](SSA) representation from the non-SSA quads representation.

CCP Performs Conditional Constant Propagation. [WZ91]

GVNGCMPerforms Global Value Numbering and Global Code Motion.[Cli95]

OSR Performs Operator Strength Reduction. [CSV01]

DCEPerforms Dead Code Elimination.

DeSSATranslates the SSA form back into non-SSA form.

Supplying the --polymap switch in batch mode to the R-StreamCompiler invokes the Mapper, adding three phases between the DCEand DeSSA optimization passes:

Chapter 3. The Polyhedral Mapper 31

Common Batch Mode Mapping Options

Raise

Translates mappable functions from SSA form into the Map-per’s internal representation, called the Generalized Depen-dence Graph (GDG).

On functions marked for mapping, the R-Stream Compiler per-forms inlining and identifies which of the functions the Map-per can actually map.

For each mappable function, the R-Stream Compiler deter-mines its loop and induction structure, identifies all of its ar-rays and affine indexing expressions, and generates a GDG forit.

PolyhedralMapperNew

For each GDG generated in the raising phase, the Mapper per-forms the appropriate mapping operations. Since the mappingprocess can be complex, each operation is internally decom-posed into multiple Mapper optimizations called tactics.

Normally, the Mapper automatically determines which tacticsto run and the arguments to pass them. However, you can alterthis default behavior using the --map: switch (for details, seepage 33).

Lower

Translates each mapped GDG back into the target API’s SSAform. All mappable functions are replaced by their mappedversions.

4 For details on running the Mapper in interactive mode, seeRunning in Interactive Mode on page 20.

Common Batch Mode Mapping Options

The most common mapping options used on the rcc command lineare:

-fopenmp

Generates OpenMP code.

32 Chapter 3. The Polyhedral Mapper

Passing Options to the Mapper

This option is equivalent to using the options--map:no-placement, --map:no-synchronization,and --map:unroll-and-jam.

--map:

This option has two forms. One passes options to the Map-per driver, and the other passes options to a specified Mappertactic.--map:<option1,...,optionN>

Passes the specified options to the Mapper driver.For details, see Passing Options to the Mapper on page 33.

--map:<tactic>=<option1,...,optionN>

Passes the specified options to the specified Mapper tac-tic.For details, see Passing Arguments to Mapper Tactics onpage 36.


Specifies the machine model file that the Mapper will use. Thisoption is required when enabling the Mapper.

--polymap

Enables the Mapper.

--setmm=<key>:<value>

Overrides the previously specified machine model file with thebindings specified by key and value.

For details, see Machine Models on page 73.


Use the --map:<options> switch to enable and disable variousinternal options to the Mapper driver and to the default high-levelcomposite tactic.

4 While some options appear to be simple tactics, they arenot. Instead, they imply that a set of tactics—possibly target-specific—will be applied.



Some useful options are:

array-expansion

Performs naive array expansion to remove false dependenciesin the program.

Default is off.

corrective-array-expansion

Performs array expansion with correction. A more sophisti-cated algorithm, it performs array expansion only as necessary.

First, it performs scheduling while ignoring all false dependen-cies, then it checks whether these dependencies were violated.If so, it corrects them by performing array expansion. As a re-sult, array expansion is performed only when extra parallelismis obtainable.

Default is off.

dump-codegen

After the completion of each tactic, outputs the current GDGin the Mapper’s internal pseudocode.

dump-GDG

After the completion of each tactic, outputs the internal GDGrepresentation.

dma Performs DMA generation. Use this option only on architec-tures that provide DMA support.

By default, the Mapper automatically determines whetherDMA generation is necessary.

To force the Mapper to perform DMA generation, specify--map:dma. To prevent the Mapper from performing DMAgeneration, specify --map:no-dma.

logfile=<filename>Sets the output log file to filename. Normally, the mapperemits a large quantity of information to aid debugging. Send-ing this information to a log file avoids cluttering the normaldiagnostic output.



multi-buffering

Turns on multibuffering in communication generation.

This option is meaningful only on architectures that requireexplicit communication.

Default is off.

placetile

Forms tasks (tiling) and distributes them (processor place-ment) across the target processing elements.

By default, the Mapper automatically determines whetherplacement is necessary.

synchronization

Performs synchronization generation. For some architecturesand runtimes, this operation is unnecessary. 1

By default, the Mapper automatically determines whether syn-chronization generation is necessary.

To force the Mapper to perform synchronization, specify--map:synchronization. To prevent the Mapper fromperforming synchronization, specify--map:no-synchronization.

unroll-and-jam

Performs unroll-and-jam optimization.

Performing unroll-and-jam in the Mapper is not always prof-itable. Profitability depends on how well the low-level compilercan optimize output generated by the R-Stream Compiler.

For example, many compilers, such as icc, already do anexcellent job performing optimizations identical or similar tounroll-and-jam, and repeating this optimization in the Mappermay inhibit other low-level optimizations, resulting in slowercode.

Default is off.

1For OpenMP targets, synchronization is implicitly tied to OpenMP pragmas.


Passing Arguments to Mapper Tactics

cuda_geom

Activates CUDA Geom tactic that provides flexibility to userto set the cuda kernel execution configuration parameteres andnvcc compiler hints (such as minimum thread blocks per mul-tiprocessor).

Default is off.

max_sink

Activates max_sink tactic that expands the iteration spaces sothat all statements have the same number of dimensions after-wards.

Default is off.

privatize

Performs privatization in GPUs (promoting variables accessedwithin a thread to thread private memory i.e. registers).

By default, the Mapper automatically determines whether pri-vatization is feasible. However to prevent the Mapper fromperforming privatization, specify --map:no-privatize


Use the --map:<tactic>=<options> switch to pass argu-ments to a specified Mapper tactic.

4 For detailed descriptions of some of the most commonly usedMapper tactics and their arguments, see Understanding Map-per Tactics on page 39.

The Mapper contains an extensive suite of internal tactics that per-form various mapping optimizations. These optimizations and thetactics (in parenthesis) that execute them are:

Affine scheduling (as)

This tactic is used to extract parallelism.



Array expansion (ae, cae)

These tactics perform array expansion to remove false depen-dencies.

Array placement (array)

This tactic performs array placement by determining whichmemories to assign to which array. Use this tactic only whenthe target architecture has multiple distributed memories.

Communication generation (commgen)

This tactic emits communication code, according to the resultof memory promotion.

Composite tactic (map)

This top-level tactic is the default tactic. It runs when mappingis enabled and no tactic is specified on the command line. Itdetermines which other tactics to run, the order in which to runthem, and the options to pass each.

See Passing Options to the Mapper on page 33 for the argu-ments that the composite tactic (map()) accepts.

Dependence analysis (dep)

This tactic performs dependence analysis.

DMA generation (dma)

This tactic performs DMA generation by converting abstractcommunication primitives into DMA primitives.

Loop simplification (simplifyLoop)

This tactic performs equivalent scheduling to simplify loopnests. The resulting loop nests have the same order of execu-tion as the original, but often with much simpler loop boundsand indexing expressions.

Memory promotion (mempromotion)

This tactic attempts to promote variables from one memory toanother to take advantage of the target architecture’s memoryhierarchy. It also performs data layout optimizations.

Thread generation (threadf)

This tactic performs thread generation and thread partitioning.



Tile scheduling (ts)

Run this tactic after tiling. It reschedules the tiles to obtaincoarse-grained parallelism.

Tiling (tile)

This tactic performs tiling (also known as blocking).

Unroll-and-jam (uj)

This tactic performs the unroll-and-jam optimization to in-crease code reuse.


4 Understanding Mapper TacticsMapper Tactics implement Mapper optimizations.

For each Mapper tactic, this chapter describes in detail what the tac-tic does, command synopses for batch and interactive modes, anduser-relevant arguments. Where applicable, it also describes the tac-tic’s mathematical formulation, known limitations and how the tacticaffects performance and mapping, and references to relevant exter-nal publications and information.

4 Values for all boolean arguments are case sensitive, so the al-phabetic forms can be expressed only in lower case. True canbe expressed as true, t, yes, y, or 1. False can be expressedas false, f, no, n, or 0.

In this manual, true and false are used in descriptions andexamples.

4 Also, for those boolean arguments that default to false,you can specify --<arg> to set it to true. Conversely, forthose boolean arguments that default to true, you can specify--no-<arg> to set it to false.

Combined multidimensional affine scheduling (as)

This tactic performs the optimization of a selective tradeoff of multi-level parallelism, permutability and locality via affine loop transfor-mations and fusion/distribution. Its primary objective is to extractcoarse-grained parallelism that is also suitable for coarse-grainedask formation. The latter task formation is performed by tiling. Thistactic also contains experimental options for contiguity, simdiza-tion and spatial reuse of read-after-read dependences. These optionsshould be used carefully in this current release.

Chapter 4. Understanding Mapper Tactics 39


CommandSynopsis and

Arguments

In batch mode, you use command-line switches to pass argumentsto a Mapper tactic. In interactive mode, you use commands to do thesame.

• Batch mode Pass arguments to the tactic in one of three ways:

--map=as=<arg1,...,argN>--map:as=<arg1,...,argN>--map:as:<arg1,...,argN>

• Interactive mode

as();as(arg1,...,argN);

Valid arguments for as include:

parametric=<bool>

Search for schedules with a parametric shifting componentalso known as the Γ part of the schedule (default is false)

lindep=<bool>

Enforce invertibility of the α component of the schedule. Set-ting to false will result in potentially non-bijective schedulesfor experimental purposes (default is true)

orthant=<bool>

Enforce schedule linear independence with first-orthant con-straints. If activated, the output schedules will not contain anyloop reversal component (default is false)

coinor=<bool>

Use COIN-OR’s CBC as the solver (default is false)

best_solver=<bool>

Use the best available solver (default is true)

maximal_fission=<bool>

Maximal split the GDG into distinct SCCs during scheduling(default is false)

40 Chapter 4. Understanding Mapper Tactics


spatial=<bool>

Add spatial read-after-read reuse constraints (EXPERIMEN-TAL default is false)

simd=<bool>

Add simdization constraints (EXPERIMENTAL default isfalse)

contiguity=<bool>

Add constraints for contiguous memory access optimizations(EXPERIMENTAL default is false)

small_loops=<bool>

Add constraints to prevent considering small loops as validparallel and permutable dimensions. In some cases that con-tain small loops the parallelism is enhanced with this option(default is false)

maxsink_sequential=<bool>

Forces loops with trip-count of 1 to be sequential improvingthe quality of the loop parallelism.These unit trip-count loopdimensions are often created by the max_sink tactic (default isfalse) (default is false)

use_hidden_context=<bool>

Uses hidden_context information as if it were user annotations.This can improve array expansion of parametric code, betterscheduling, tile sizes etc (default is false) (default is false)

feautrier_only=<bool>

Use Feautrier’s algorithm only (default is false)

par=<int>

Set the maximum degrees of parallelism achieved beforeswitching to maximal fusion (default is 10)

max_phi_coefficient=<int>

Set the maximum absolute value of all the schedules coeffi-cients (default is 10)

scopes=<bool>

Preserve all variable scopes (default is false)


CUDA Geom (geom)

CUDA Geom (geom)

The CUDA geom tactic provides flexibility to user to set the cudakernel execution configuration parameteres (such as thread dimen-sions and sizes, and grid dimensions and sizes), and nvcc compilerhints (such as minimum thread blocks per multiprocessor that theuser wants for better resource sharing).

CommandSynopsis and

Arguments



--map=geom=<arg1,...,argN>--map:geom=<arg1,...,argN>--map:geom:<arg1,...,argN>


geom();geom(arg1,...,argN);

Valid arguments for geom include:

thread_config=<int,int,...>

sets the targeted cuda core geometry (i.e. sets user-defined“thread size and dimension” for cuda kernel launch). {0} in-dicates “use mapper defined sizes” (default is 0)

grid_config=<int,int,...>

sets the targeted cuda SM geometry (i.e. sets user-defined“grid size and dimension” for cuda kernel launch). {0} in-dicates “use mapper defined sizes” (default is 0)

blocks_per_sm=<int>

gets user hint – minimum thread blocks per SM – to pass it onto nvcc. 0 indicates “no hint” (default is 0)


CUDA Placement (new_cudaplacement)

mainly_cache=<int>

makes the configurable on-chip memory in GPU chips withcompute capability 2.x to be scratchpad biased or L1 cachebiased (it is not valid for GPU chips with compute capability1.x or lower). 0=default config, 1=mainly cache i.e. cachebiased, 2=mainly scratchpad i.e. scratchpad biased (default is0)

CUDA Placement (new_cudaplacement)

The CUDA Placement tactic performs block and thread placementfor the CUDA target. Suitable candidate loops are identified andmarked as placement loops whose iterations are distributed acrossCUDA threads and thread blocks.

CommandSynopsis and

Arguments

Interactive mode

new_cudaplacement();

The new_cudaplacement tactic has no command-line optionsor arguments.

Checks that barrier code are well-formed (checkBarriers)

CommandSynopsis and

Arguments

Interactive mode

checkBarriers();

The checkBarriers tactic has no command-line options or ar-guments.


Spatial layout optimization (cm)

Spatial layout optimization (cm)

Modifies the layout of arrays to optimize their spatial locality. Pro-duces copies of arrays when their layout can’t be transformed di-rectly.

CommandSynopsis and

Arguments



--map=cm=<arg1,...,argN>--map:cm=<arg1,...,argN>--map:cm:<arg1,...,argN>


cm();cm(arg1,...,argN);

Valid arguments for cm include:

unimod=<bool>

Only perform unimodular transformations

Communication Optimization (CUDA-specific) (commopt)

This tactic performs communication optimizations for the CUDAtarget.

CommandSynopsis and

Arguments



Naive array expansion (ae)


--map=commopt=<arg1,...,argN>--map:commopt=<arg1,...,argN>--map:commopt:<arg1,...,argN>


commopt();commopt(arg1,...,argN);

Valid arguments for commopt include:

hoist=<bool>

Enables hoisting of the first communications across the reusedimension (default is false)

early_opt=<bool>

Only performs early optimizations that do not depend on pri-vatization and placement application (i.e. hoisting of commu-nications and removal of redundant communications) (defaultis false)

interchange=<bool>

Enables interchanging communication loops on shared mem-ory (default is false)

within_inner=<bool>

Exploit reuse within the same processor (inner) or across pro-cessors (no-inner) (default is true)

Naive array expansion (ae)

Turns scalars and lower-dimension arrays into higher-dimension ar-rays for the sake of removeing storage-induced dependencies.


Promote all variables which are not live-in/live-out and does not involvecommunication to local arrays on the PE (array)

CommandSynopsis and

Arguments

Interactive mode

ae();

The ae tactic has no command-line options or arguments.

Promote all variables which are not live-in/live-out and doesnot involve communication to lo-cal arrays on the PE (array)

CommandSynopsis and

Arguments

Interactive mode

array();

The array tactic has no command-line options or arguments.

Broadcast elimination (bcast)

This tactic reduces the number of data broadcasts by using an equiv-alent schedule. Experimental.

CommandSynopsis and

Arguments

Interactive mode

bcast();

The bcast tactic has no command-line options or arguments.


This tactic inserts communication operations, i.e., memory-to-memorycopies, given the information provided by the memory promotiontactic.



CommandSynopsis and

Arguments



--map=commgen=<arg1,...,argN>--map:commgen=<arg1,...,argN>--map:commgen:<arg1,...,argN>


commgen();commgen(arg1,...,argN);

Valid arguments for commgen include:

stencils=<bool>

Detect stencil references to compute better shapes for datatransfers (default is true)

z_domain_projection=<bool>

Perform Z-domain computations when it is required that theresult is exact. If this is turned off, compilation is faster butthe result may be incorrect for some applications. (default istrue)

enlarge_reads=<bool>

Simplifies the read sets by enlarging them. When this opti-mization is turned on, this tries to make read sets with nicershapes by enlarging them slightly. This will have the effect oftransfering more elements than we need, but possibly requiringfewer transfer operations. (default is false)

generate_all_put_wait=<bool>

Generates a single wait per put memory location when true.Otherwise generates a single global wait for all the puts andreserves a tag for doing this. (default is false)

generate_waits=<bool>

Generates waits. (default is true)


Memory promotion tactic (mempromotion)

Memory promotion tactic (mempromotion)

This tactic tries to promote variables to memory located in thescratch pad memory to speed accesses.

CommandSynopsis and

Arguments



--map=mempromotion=<arg1,...,argN>--map:mempromotion=<arg1,...,argN>--map:mempromotion:<arg1,...,argN>


mempromotion();mempromotion(arg1,...,argN);

Valid arguments for mempromotion include:

splitting=<bool>

Split reference classes into smaller classes to further compactmemory (default is true)

stencils=<bool>

Detect stencil references to compute better shapes for datafootprints (default is true)

O <int>

,

optimization_level=<int>

Accuracy of dependence analysis (default is 1)

force_promotion=<bool>

Force the promotion of references even though the memoryusage exceeds the available size of the local memory. (defaultis false)


Orthogonal tiling (tile)

cross_mem_opt=<bool>

Detects some cross-memory optimization opportunities andpartitions references accordingly (CUDA-specific) (default istrue)

promote_host_references=<bool>

Promote all references, even those that have mapped onto thehost processor. (default is false)

override_memory_consumption_check=<bool>

Override memory consumption check (default is false)


This tactic computes tile sizes that comply with certain constraintsand are optimal w.r.t. certain criteria, which are parameters of thistactic. When tiling is applied several times in a mapping, options canbe passed to a specific instance by appending its instance number,e.g. "tile2=blabla"

Aliases OT, tile2, tile3, tile4, tile5, tile6, tile7, tile8

CommandSynopsis and

Arguments



--map=tile=<arg1,...,argN>--map:tile=<arg1,...,argN>--map:tile:<arg1,...,argN>


tile();tile(arg1,...,argN);



Valid arguments for tile include:

cstr_fpt=<bool>

Constrains the data footprint to lie in the targeted memory

cstr_gran=<bool>

Constrains tile sizes to comply to a placement plan, by tryingto make tile sizes a multiple of the according processor dimen-sion’ssizes

cstr_pow=<bool>

Tries to make tile sizes powers of 2

cstr_loadbal=<bool>

Prevents obvious load imbalance

cstr_assoc=<bool>

Reduces associativity-based cache capacity conflicts

lenient=<bool>

Makes grouping less restrictive about which groups are mappedto the PEs. (default is false)

small_loops=<bool>

Views small loop nests as a ’big statements’ that don’t need tobe tiled (default is false)

code_size=<string>

Amount of local memory to reserve to code (.text) (default is-1)

sizes=<int,int,...>

User-prescribed sizes

ortho=<bool>

Use a not-too-skewed tiling method (default is false)

scale=<bool>

Use a not-too-skewed tiling method which trades off divisionsfor strides. (default is false)


Thread generation for CUDA and CSX (thread)

Thread generation for CUDA and CSX (thread)

For a given hierarchical mapping level, this tacticpartitions the state-ments into two sides: host and processing elements. It also generatesthread_create() and thread_wait() calls on the host side, Accordingto the execution model defined in the machine model. Usable withCUDA and CSX.

CommandSynopsis and

Arguments

Interactive mode

thread();

The thread tactic has no command-line options or arguments.

Tile Scheduling (ts)

Convert permutable inter-tile dimensions into sequential and doalldimensions.

CommandSynopsis and

Arguments



--map=ts=<arg1,...,argN>--map:ts=<arg1,...,argN>--map:ts:<arg1,...,argN>


ts();ts(arg1,...,argN);

Valid arguments for ts include:


Pseudo Code Generation (c)

intertile=<bool>

Skew inter-tile dimensions (default is true)

intratile=<bool>

Skew intra-tile dimensions (default is false)

doalls=<int>

Number of exploitable doall dimensions (default is 100)

Pseudo Code Generation (c)

Generates a pseudo-code view of the current GDG.

Aliases codegen

CommandSynopsis and

Arguments

Interactive mode

c();

The c tactic has no command-line options or arguments.

Communication Optimization (early_commopt)

This tactic performs early communication optimizations.

CommandSynopsis and

Arguments

Interactive mode

early_commopt();

The early_commopt tactic has no command-line options or ar-guments.


Flexible thread generation (threadf)

Flexible thread generation (threadf)

Partitions the code into a hierarchy of master and slave codes andproduces operations for initializing code and propagating control be-tween master and slaves threads.

Options are usually not necessary for this tactic, which derives itsoperating mode from the machine model.

CommandSynopsis and

Arguments



--map=threadf=<arg1,...,argN>--map:threadf=<arg1,...,argN>--map:threadf:<arg1,...,argN>


threadf();threadf(arg1,...,argN);

Valid arguments for threadf include:

tightly_coupled=<bool>

Use tightly-coupled execution model for the current morph(default is false)

control_broadcast=<bool>

,

ctl_bcast=<bool>

Generate collective master-side control operations, rather thana loop over individual operations. This has to match the APIavailable with the target platform. (default is true)


Late Data Relayout (latedatarelayout)

code_broadcast=<bool>

,

code_bcast=<bool>

Generate collective master-side code operations, rather than aloop over individual operations Has to match the targeted API.(default is true)

time_sharing=<bool>

Indicates that more than one code can be started/loaded on agiven processing element concurrently. (default is true)

code_init_on_slave=<bool>

Indicates that the slaves should run an initialization functionright after they are loaded. (default is false)


Tactic used to perform data layout transformations on arrays pro-moted in a local scratchpad (physical or virtual) to enable vectoriza-tion. This is still experimental as the size of the buffer is likely toincrease in an uncontrolled fashion.

CommandSynopsis and

Arguments



--map=latedatarelayout=<arg1,...,argN>--map:latedatarelayout=<arg1,...,argN>--map:latedatarelayout:<arg1,...,argN>


latedatarelayout();latedatarelayout(arg1,...,argN);



Valid arguments for latedatarelayout include:

parametric=<bool>

Search for schedules with a parametric shifting componentalso known as the Γ part of the schedule (default is false)

lindep=<bool>

Enforce invertibility of the α component of the schedule. Set-ting to false will result in potentially non-bijective schedulesfor experimental purposes (default is true)

orthant=<bool>

Enforce schedule linear independence with first-orthant con-straints. If activated, the output schedules will not contain anyloop reversal component (default is false)

coinor=<bool>

Use COIN-OR’s CBC as the solver (default is false)

best_solver=<bool>

Use the best available solver (default is true)

maximal_fission=<bool>

Maximal split the GDG into distinct SCCs during scheduling(default is false)

spatial=<bool>

Add spatial read-after-read reuse constraints (EXPERIMEN-TAL default is false)

simd=<bool>

Add simdization constraints (EXPERIMENTAL default isfalse)

contiguity=<bool>

Add constraints for contiguous memory access optimizations(EXPERIMENTAL default is false)

small_loops=<bool>

Add constraints to prevent considering small loops as validparallel and permutable dimensions. In some cases that con-


Naive array contraction (lcontract)

tain small loops the parallelism is enhanced with this option(default is false)

maxsink_sequential=<bool>

Forces loops with trip-count of 1 to be sequential improvingthe quality of the loop parallelism.These unit trip-count loopdimensions are often created by the max_sink tactic (default isfalse) (default is false)

use_hidden_context=<bool>

Uses hidden_context information as if it were user annotations.This can improve array expansion of parametric code, betterscheduling, tile sizes etc (default is false) (default is false)

feautrier_only=<bool>

Use Feautrier’s algorithm only (default is false)

par=<int>

Set the maximum degrees of parallelism achieved beforeswitching to maximal fusion (default is 10)

max_phi_coefficient=<int>

Set the maximum absolute value of all the schedules coeffi-cients (default is 10)

scopes=<bool>

Preserve all variable scopes (default is false)

Naive array contraction (lcontract)

Lowers the dimensionality of arrays in order to reuse storage.

CommandSynopsis and

Arguments

Interactive mode

lcontract();

The lcontract tactic has no command-line options or arguments.


MaxFloat (maxfloat)

MaxFloat (maxfloat)

Maximally fissions the beta-strings when doing so does not alter therelative statement order. The goal is to avoid artificial loop peelingthat can be avoided during code generation. Typically useful aftermax_sink and scheduling.

CommandSynopsis and

Arguments

Interactive mode

maxfloat();

The maxfloat tactic has no command-line options or arguments.

Maxsink (maxsink)

Extends all iteration domains to the same number of dimensions bypadding the innermost dimensions with zeros (which is always le-gal). The effect is to give the scheduler more freedom by performingdeeper nestings of statements that would otherwise not be possible.

CommandSynopsis and

Arguments

Interactive mode

maxsink();

The maxsink tactic has no command-line options or arguments.

Memory obfuscation using Ehrhart polynomials (mo)

CommandSynopsis and

Arguments

Interactive mode

mo();

The mo tactic has no command-line options or arguments.


Multi-dimensional placement component (place)

Multi-dimensional placement component (place)

Aliases mplacement

CommandSynopsis and

Arguments



--map=place=<arg1,...,argN>--map:place=<arg1,...,argN>--map:place:<arg1,...,argN>


place();place(arg1,...,argN);

Valid arguments for place include:

has_host=<bool>

Forces placement to behave as if the target morph had a sepa-rate host processor (default is true)

small_loops=<bool>

Give a special treatment to loops that fit in the targeted memory(default is false)

Barrier Generation (sync)

Produces all-to-all slave barriers that ensure the parallel program’scorrectness. The multidimensional barrier tactic generates calls tooperations that implement a barrier across processing elements. Inmost cases, barrier operations are necessary for correctness of the



parallel program. However, using the t ndbarrier tactic is unneces-sary when other mechanisms, such as master/slave synchronizationor implicit synchronization (as in OpenMP), are used to generate therequired synchronizations.

Aliases ndbarrier

CommandSynopsis and

Arguments

Interactive mode

sync();

The sync tactic has no command-line options or arguments.


Transforms communications as generated by Communication Gen-eration into sets of strided DMA commands.

Aliases newdma

CommandSynopsis and

Arguments



--map=dma=<arg1,...,argN>--map:dma=<arg1,...,argN>--map:dma:<arg1,...,argN>


dma();dma(arg1,...,argN);


Control Persistence (per)

Valid arguments for dma include:

suppress_dma_opt=<bool>

Suppresses the aggregation of DMAs into higher dimensionalDMAs (default is false)

Control Persistence (per)

Enforces execution control persistence among pieces of a code unit

CommandSynopsis and

Arguments

Interactive mode

per();

The per tactic has no command-line options or arguments.

Combined placement, tile scheduling and tiling (placetile)

Aliases mplacetile

CommandSynopsis and

Arguments



--map=placetile=<arg1,...,argN>--map:placetile=<arg1,...,argN>--map:placetile:<arg1,...,argN>


placetile();placetile(arg1,...,argN);


Combined placement, tile scheduling and tiling (placetile)

Valid arguments for placetile include:

cstr_fpt=<bool>

Constrains the data footprint to lie in the targeted memory

cstr_gran=<bool>

Constrains tile sizes to comply to a placement plan, by tryingto make tile sizes a multiple of the according processor dimen-sion’ssizes

cstr_pow=<bool>

Tries to make tile sizes powers of 2

cstr_loadbal=<bool>

Prevents obvious load imbalance

cstr_assoc=<bool>

Reduces associativity-based cache capacity conflicts

lenient=<bool>

Makes grouping less restrictive about which groups are mappedto the PEs. (default is false)

small_loops=<bool>

Views small loop nests as a ’big statements’ that don’t need tobe tiled (default is false)

code_size=<string>

Amount of local memory to reserve to code (.text) (default is-1)

sizes=<int,int,...>

User-prescribed sizes

ortho=<bool>

Use a not-too-skewed tiling method (default is false)

scale=<bool>

Use a not-too-skewed tiling method which trades off divisionsfor strides. (default is false)


Polyhedral Simplification of Loops (CUDA only) (pol_simplify)

Polyhedral Simplification of Loops (CUDA only) (pol_simplify)

Performs simplification of loop nests based on processor parameters.Processor parameters are a special kind of global parameters whosevalues are all taken by some processor. SImplifications on thesevariables requires a specific tactic.

CommandSynopsis and

Arguments



--map=pol_simplify=<arg1,...,argN>--map:pol_simplify=<arg1,...,argN>--map:pol_simplify:<arg1,...,argN>


pol_simplify();pol_simplify(arg1,...,argN);

Valid arguments for pol_simplify include:

versioning=<bool>

Allow versioning along parameter values (default is false)

diverging=<bool>

Allow diverging threads to be created (versioning below halfwarp size) (default is false)

maxH=<int>

Maximal number of hyperplanes to unroll. The higher thisvalue the better the simplification but the longer the compiletime. (default is 512)


Polyhedral Unrolling (pol_unroll)

Polyhedral Unrolling (pol_unroll)

Non-syntactic unrolling of loops with small trip count.

CommandSynopsis and

Arguments



--map=pol_unroll=<arg1,...,argN>--map:pol_unroll=<arg1,...,argN>--map:pol_unroll:<arg1,...,argN>


pol_unroll();pol_unroll(arg1,...,argN);

Valid arguments for pol_unroll include:

width=<int>

Maximum loop width considered for unrolling (default is 4)

maxspec=<int>

Maximum specializations for any unrolling (default is 3)

remove_eqs=<bool>

Also remove equalities in iteration domains (default is true)

Index-Set Unsplitting (fuseSplitted)

This tactic serves the purpose of fusing back statements that havebeen splitted due to some information not available at the time ofsplitting. In particular, one use is to fuse back copy statements splitduring communication optimization for the CUDA mapper. An ex-ample is as follows. Original code before communication optimiza-tion:



doall (i=0; i<=15; i++) {doall (j=0; j<=255; j++) {doall (k=0; k<=23; k++) {copy0(>U2_l[k],<U2[4+j,16*i+k]);

}doall (k=0; k<=15; k++) {...copy9(pr_>U2_l_9,<U2[4+j,4+16*i+k]);...

}}}

CommunicationOptimization puts the data from U2_l_9 into U2_l[4+k].This is done before explicit placement is applied and the domain dif-ference of statement 9 minus statement 13 results in 4 splits (only 2are necessary):

doall (i=0; i<=15; i++) {doall (j=0; j<=255; j++) {doall (k=0; k<=15; k++) {...copy9(pr_>U2_l_9,<U2[4+j,4+16*i+k]);...

}doall (k=0; k<=15; k++) {copy0_13(>U2_l[4+k],<U2_l_9);

}doall (k=0; k<=-16*i+3; k++) {copy0_14(>U2_l[k],<U2[4+j,16*i+k]);

}doall (k=-16*i+260; k<=23; k++) {copy0_15(>U2_l[k],<U2[4+j,16*i+k]);

}doall (k=max(-16*i+4, 0); k<=3; k++) {copy0_16(>U2_l[k],<U2[4+j,16*i+k]);

}doall (k=20; k<=min(-16*i+259, 23); k++) {copy0_17(>U2_l[k],<U2[4+j,16*i+k]);

}}}



After placement and simplification is applied the code is the follow-ing:

doall (i=0; i<=255; i++) {...copy9(pr_>U2_l_9,<U2[4+i,4+16*bl.x+th.x]);...copy0_13(>U2_l[4+th.x],<U2_l_9);if (bl.x==0) {copy0_14(>U2_l[th.x],<U2[4+i,16*bl.x+th.x]);

}if (bl.x==15) {doall (j=(-th.x+35)/16; j<=(-th.x+23)/16; j++) {copy0_15(>U2_l[16*j+th.x],

<U2[4+i,16*j+16*bl.x+th.x]);}

}if (bl.x>=1) {copy0_16(>U2_l[th.x],<U2[4+i,16*bl.x+th.x]);

}if (bl.x<=14) {doall (j=(-th.x+35)/16; j<=(-th.x+23)/16; j++) {copy0_17(>U2_l[16*j+th.x],

<U2[4+i,16*j+16*bl.x+th.x]);}}}

(14, 16) and (15, 17) can each be collapsed into the same statement,the output code is much simpler and performs better:

doall (i=0; i<=255; i++) {...copy9(pr_>U2_l_9,<U2[4+i,4+16*bl.x+th.x]);...copy0_13(>U2_l[4+th.x],<U2_l_9);copy0_14(>U2_l[th.x],<U2[4+i,16*bl.x+th.x]);doall (j=(-th.x+35)/16; j<=(-th.x+23)/16; j++) {copy0_15(>U2_l[16*j+th.x],

<U2[4+i,16*j+16*bl.x+th.x]);}}


Cosmetic domain tightening transformations (simp)

CommandSynopsis and

Arguments

Interactive mode

fuseSplitted();

The fuseSplitted tactic has no command-line options or argu-ments.

Cosmetic domain tightening transformations (simp)

Aliases simplifyInt

CommandSynopsis and

Arguments

Interactive mode

simp();

The simp tactic has no command-line options or arguments.

Loop Simplification (simplifyLoop)

Simplifies loop bounds and access functions via equivalent sched-ules.

CommandSynopsis and

Arguments

Interactive mode

simplifyLoop();

The simplifyLoop tactic has no command-line options or argu-ments.

Unroll and jam (uj)

This tactic performs optimization following a default model ofunroll-and-jam in the poyhedral model.It sets integer elements in


Simple unrolling (unroll)

an unroll vector for each polyhedral statement.The actual unrollingis done syntactically after code generation and may be prohibited bythe subsequent syntactic cost model.

CommandSynopsis and

Arguments



--map=uj=<arg1,...,argN>--map:uj=<arg1,...,argN>--map:uj:<arg1,...,argN>


uj();uj(arg1,...,argN);

Valid arguments for uj include:

max_unroll=<int>

Maximum combined amount of unrolling. (default is 64)

max_unroll_per_dimension=<int>

Maximum amount of unrolling per dimension. (default is 64)

innermost=<bool>

Enable/disable unrolling on the innermost loop dimension.(default is true)

Simple unrolling (unroll)

Unrolls the innermost loops by a prescribed factor.


Virtual scratchpad mode (virtscratch)

CommandSynopsis and

Arguments



--map=unroll=<arg1,...,argN>--map:unroll=<arg1,...,argN>--map:unroll:<arg1,...,argN>


unroll();unroll(arg1,...,argN);

Valid arguments for unroll include:

factor=<int>

Unroll factor (default is 8)

deepest=<bool>

Only unroll the deepest loop(s) (default is false)

Virtual scratchpad mode (virtscratch)

Modifies the machine model to materialize a virtual scratchpad.

CommandSynopsis and

Arguments

Interactive mode

virtscratch();

The virtscratch tactic has no command-line options or argu-ments.


5 R-Stream Target ArchitecturesThe R-Stream Compiler supports these target architectures:

• x86 through OpenMP and OCR

• Tilera

• NVIDIA GPU through CUDA

4 Please refer to the release notes for a list of architectures sup-ported in the current release.

SMP/OpenMP Architecture

Machine model file: core2duo-mapper.xml

A Symmetric Multiprocessing (SMP) architecture is characterizedby multiple, identical processing elements (PEs), each connected toa single main memory. Common SMP architectures include bus-based systems of several PEs (for example, multiway x86-basedservers) as well as large-scale NUMA systems, such as SGI’s Altixarchitectures. More recently, the availability of commodity multi-core processors means that a single workstation may qualify as anSMP machine.

The R-Stream Compiler maps code to some 1 SMP architectures viathe widely available OpenMP programming environment. OpenMPcomprises a set of directives and library routines used to expressparallelism in C/C++ and Fortran code. The R-Stream Compileruses a subset of OpenMP as a target for mapped code.

1The R-Stream Compiler directly targets some SMP architectures, such asTILE64

Chapter 5. R-Stream Target Architectures 69


BaseMappingStrategy

Concentration on exposing outer loop parallelism is a commonguideline for producing performant OpenMP programs. Concep-tually, this translates to a set of outer loop iterations running on eachof the available PEs on the target SMP system.

InteractivelyMapping to

the TargetArchitecture

This section provides the basic steps for mapping a program to runon the SMP/OpenMP architecture.

In interactive mode:

1. If you are cross-compiling to the SMP/OpenMP architecture,you need to load the proper machine types file:

loadTypes("x86_64-gcc-linux");

(See the --types=<path> switch on page 12 for detailson machine types files.)

2. Call the affine scheduling (as) tactic:

as();

Affine scheduling results in loop nests that are fully per-mutable, and thus amenable to tiling.

(See Combined multidimensional affine scheduling (as) onpage 39 for details on this tactic.)

3. Call the tiling (tile) tactic.

tile();

In codes where outermost parallelism is not immediatelyavailable, tiling is commonly used to expose a coarser degreeof parallelism suitable for OpenMP.

(See Orthogonal tiling (tile) on page 49 for details on thistactic.)

4. Call the tile scheduling (ts) tactic, which can often exposemore coarse-grained parallelism.

ts();

(See Tile Scheduling (ts) on page 51 for details on this tactic.)

70 Chapter 5. R-Stream Target Architectures


5. Immediately after tile scheduling, call the loop simplification(simplifyLoop) tactic:

simplifyLoop();

then continue to the postmapping stage.

(See Loop Simplification (simplifyLoop) on page 66 fordetails on this tactic.)

KnownLimitations

The R-Stream Compiler uses only a subset of OpenMP2. Currently,the R-Stream Compiler does not generate annotations for someOpenMP features, such as nested parallelism and reduction an-notations.

Also, OpenMP programs that the R-Stream Compiler generates aregenerally not portable across multiple SMP platforms.

2The current version of OpenMP assumed by the R-Stream Compiler isOpenMP 2.5. The website is www.openmp.org

Chapter 5. R-Stream Target Architectures 71


72 Chapter 5. R-Stream Target Architectures

6 Machine ModelsThe R-Stream Compiler software includes machine model files fora wide range of target architectures. (For a list of supported targets,see R-Stream Target Architectures on page 69.) These supplied filesare installed in $RSTREAM_HOME/rstream/mm/.

The contents of a machine model file (<target>-mapper.xml)describe to the Mapper the main components of the correspondingtarget architecture. The Mapper produces a mapping of the appli-cable regions in the C source code according to the machine modeldescription.

A machine model file plays two major roles in mapping; it describes:

• The system architecture of the target machine in terms of therelevant parts of the system components.

• The execution models and system capabilities that the Mappercan exploit.

This point is particularly important because an architecturemay support multiple execution models, but the Mapper can-not derive knowledge of these models from the system de-scription alone. To guide the operation of the Mapper, all ex-ecution models must be stated explicitly.

The machine model file provides this information to the Map-per by creating an association between the physical elementsof the target system and possible mappings.

Understanding Machine Model File Structure

A machine model is a graph of entities (typically processors, em-mories, DMA links). The edges between entities represent relationssuch as “gets data from”, or “can start threads on”. The mapping isperformed hierarchically by considering views of the machine model(called “morphs”) that each present source memories in which data

Chapter 6. Machine Models 73

Understanding Machine Model File Structure

lies, and target processors that should run the mapped function usingthe data.

Machine model files are written in XML format (for details, visithttp://www.w3.org/standards/xml/). Machine model files consist ofkey:value pairs that describe the elements that make up each ofthe relevant system components and the available execution models.

Each key in a machine model file denotes a property of an entityin the target machine. Each key is written in path syntax, with eachpath element delimited by a period. Path elements follow a hierarchyin much the same way that elements in a directory path do.

The hierarchy of path elements is laid out this way:

entity.type.attribute

In this syntax, entity is either a physical system component or alogical system-organization element; for example:

proc.cpu.addressable_unit

morph.smp.PEs_can_synchronize

The contents of the machine model file for the Intel SMP/OpenMParchitecture looks like this (though this is an abbreviated listing):

<entry key="proc.cpu.geometry">[8]</entry><entry key="proc.cpu.int_registers">32</entry> ...<entry key="mem.global.size">[4G]</entry> <entry key="mem.global.cache_level">-1</entry>...<entry key="mem.L1.cache_level">1</entry><entry key="mem.L1.size">[16K]</entry>...<entry key="mem.L2.cache_level">2</entry><entry key="mem.L2.size">[6M]</entry> ...<entry key="mem.I.cache_level">1</entry><entry key="mem.I.cache_line_size">32</entry>

74 Chapter 6. Machine Models

Describing System Architecture

...<entry key="morph.smp.backend">OpenMP</entry><entry key="morph.smp.host">cpu</entry><entry key="morph.smp.topology">[

cpu->L1, cpu->I, L1-> {2-1} L2, I-> {2-1} L2, L2-> {many-1} global]</entry>

<entry key="morph.smp.options">[smp]</entry>


The entities described in the machine model file are:

ProcessorsDenoted by proc in the entity position of the path hierarchy.

Abstract processorsDenoted by aproc in the entity position of the path hierarchy.

MemoryDenoted by mem in the entity position of the path hierarchy.

LinksDenoted by link in the entity position of the path hierarchy.

Processors Processor entities (proc) represent computation engines, includingscalar processors, SIMD processors, and multiprocessors. The pro-cessor components described vary according to the type of machine.For example, a SMP system will typically have one processor type,cpu.

Processor attributes that you may need to modify include :

geometry

[int] Specifies the number of processors of the given type.Each element of the array specifies a number of processorsalong a dimension of connection.

This attribute drives a number of mapping decisions, such asplacement.



Valid values are integers greater than 0.

For example, an x86-based SMP system with two four-coreprocessors has a geometry of [8]. A 64-core processor with amesh topology has a geometry of [8,8].

int_registers

[int] Specifies the number of integer registers the processortype has.

fp_registers

[int] Specifies the number of floating-point registers the pro-cessor type has.

instruction_size

[int] Specifies the processor’s instruction size in bytes.

SIMD_width

[int] Specifies the width, in bits, of the SIMD unit for the pro-cessor type.

SIMD_alignment

[int] Specifies the required alignment, in bits, of the SIMD unitfor the processor type.

funit_types

[String,. . . , String] Specifies the functional unit types sup-ported by the processor type.

Values supported are memory (MEM), integer ALUs (INT),single-precision floating-point (FP4) and double-precision floating-point (FP8).

funit_issues_per_cycle

[int,. . . , int] Specifies the number of corresponding funit_typesinstructions the processor type can issue per cycle.

When you add or remove a functional unit by resettingfunit_types, remember to resetfunit_issues_per_cycle accordingly.



addressable_unit

[int] The size, in bits, of the addressable unit for the processortype.

For example, a processor that is byte-addressable has a valueof 8.

AbstractProcessors

Abstract processor entities (proc) represent computation entitieswhich are composed of other computation entities and, usually,memories. Abstract processors are the way hierarchical parallelmachines are defined. During the mapping process, they are ab-stracted as a processor (which does not have some specific processorattributes such as the number of FP units, SIMD width etc.)

Abstract processor attributes include: :

included_procs

the set of processors (or abstract processors) directly includedby this abstract processor.

proper_mems

the set of memory directly accessible to the abstarct proces-sors, that can be considered its “own” memories.

parameter_passing

the ways parameters get passed to a code running on this ab-stract processor. Can be

explicit (there is a separate parameter-passing method)or

implicit (parameters are passed as arguments to the func-tion that starts the thread on the abstract processor).

Memory Memory entities (mem) represent data caches, instruction caches,unified data and instruction caches, and main and scratch pad mem-ories.

Common memory types include:



• global

• local

• L1

• L2

• I

Memory attributes include:

size

[int] Specifies the size, in bytes, of the memory type.

bank_names

[string] Specifies the names of the memory banks for the mem-ory type.

cache_line_size

[int] Specifies the size, in bytes, of the cache line for the mem-ory type.

tlb_miss_cost

[int] Specifies the cost of TLB misses, in cycles, for the mem-ory type.

cache_level

[int] Specifies the level of the cache for the memory type.

Valid values for caches are 1 or greater, with 1 signifying thehighest level of cache. Non-cache memory (scratchpad) has avalue of -1.

speed

[int] Specifies the speed, in cycles, of the memory type.

data

[true|false] Specifies whether the memory type handles data.

instructions

[true|false] Specifies whether the memory type handles in-structions.



options

A set of options for memory with special needs:

• host_allocated indicates that data in this memorymust be allocated by the parent thread.

• cuda_constant identifies a NVIDIA device con-stant memory.

Links Link entities (link) represent connections between processing ele-ments and memories and require the use of an explicit communica-tion protocol to operate. These protocols include low-level commu-nications, such as DMA; higher-level abstractions, such as MPI; andanything in between.

4 Currently, DMA is the only supported link entity.

The capabilities of link entities are specified by their attributes.

6 Some attributes internal to Reservoir (for example, tag_type,tag_memory, options) are included in the supplied ma-chine model files, but are not described in this manual. In gen-eral, you should avoid modifying the value of these attributes.

Link attributes that you may need to modify include:

strided_overhead

[int] Specifies the overhead in cycles per message.

strided_bandwidth

[int] Specifies the bandwidth in bytes per cycle.

indexed_overhead

[int] Specifies the overhead in cycles per message.

indexed_bandwidth

[int] Specifies the bandwidth in bytes per cycle.


Describing Execution Models

preferred_size_multiple

[int] Specifies, in bytes, the preferred transfer size multiple.

Set this attribute to 1 if there is no preferred size.

preferred_alignment

[int] Specifies the preferred alignment, in bits, of data to trans-fer.

Set this attribute to 1 if there is no preferred alignment.

has_local_strides

[true|false] Specifies whether the link can scatter locally.

asynchronous

[true|false] Specifies whether the link is asynchronous.


The execution models and system capabilities that the Mapper canexploit are described in the Mapping Strategy section (see page 74)of the machine model file.

The Mapping Strategy section consists of morph entities that de-scribe the logical properties of machine organization. The morphkeys provide the information the Mapper needs to associate a targetsystem’s physical components with mapping solutions.

6 Do not modify any of the morph keys in the Mapping Strat-egy section of the machine model file. Morphs are complexenough to make modification by users inappropriate in mostcases. Because the R-Stream Compiler makes implicit map-ping decisions based on information encoded by morphs, westrongly advise you to contact Reservoir Labs if you think amorph is insufficient.

Morphs Morphs represent mapping foci or views for the mapper. The map-ping problem for a target architecture is divided into a set of morphs,which enables the effective decomposition of mapping problems for



systems with hierarchical organization and multiple levels of paral-lelism (for example, a cluster with multicore nodes).

Each morph in a machine model is defined by a root processor, fromwhich the mapping starts, and a set of targeted processing elements(PEs). A host processor is also defined, which executes code thatisn’t parallelized to the PEs (basically the master thread). In somecases, this role is played by one of the PEs. The way to specify thisin the machine model is to set the host processor to the same valueas the PEs. Another case is when there is no host processor per seand no PE can run a separate master thread. The way to specify thissituation in the machine model is to set the host processor as beingthe root processor.

Sychronization capabilities between the host and the PEs are alsospecified in the morph.

PEs_can_synchronize indicates whether the PEs can syn-crhonize with each other, and

host_can_syncrhonize indicates whether the host can syn-crhonize the PEs.

A topology is associated with the host processor, the PEs, and a listof submorphs.

Topology A machine topology is an attribute of the morph entity. The topol-ogy of a machine is expressed as a directed graph in terms of nodesand edges. Two types of edges are supported:

• Data edges represent possible data connections in the systemarchitecture. The source of a data edge can be viewed as theclient of the destination fo the data edge, in the sense that ittakes its data from it (a L1 cache takes its data from the L2,etc.). Data edges are unlabeled.

• Command edges represent commands that can be sent fromone type of entity to another. Each command edge is labeledwith the type of command that it supports.


Deriving New Machine Models


The R-Stream Compiler is supplied with a set of machine modelfiles for a variety of target architectures. However, you may want totry different machine model files to experiment with a new target orto modify a program mapping.

R-Stream comes with a tool to gather the main architectural param-eters of your SMP (x86 or PowerPC) machine.

• run the $RSTREAM_HOME/rstream/bin/smp-config tool onthe SMP machine to extract this information.

• For any other supported target architecture, you can usuallyfind this information in the vendor’s datasheet.

6 Avoid modifying any of the morph keys in the Mapping Strat-egy section of the machine model file. Doing so can cause prob-lems, either in the Mapper or in the mapped code.

Modifying aMachine

Model File

You can quickly override an existing machine model file usingthe --setmm=[key:value] command-line switch. This switchoverrides the value of the associated key in the specified machinemodel file.

The value component of the key:value pair can be of type:

• byte

• short, int, long

• float, double

• A string

• Arrays of any of the above, delimited by [ and ], with theentries separated by a comma.



4 The format required for values of these data types is describedin the XML standard.

The machine model file format supports some abbreviations com-monly used for specifying characteristics of computer systems. Forexample, for keys that require a numeric value, you can use thesesuffixes when specifying values:

K Applies multiplier of 210 to the associated prefix value.

M Applies multiplier of 220 to the associated prefix value.

G Applies multiplier of 230 to the associated prefix value.

For example, if you want to specify a total cache size of 16 kilo-bytes, you can write 16K instead of 16384. These abbreviationsare provided as a convenience for entering values commonly used incomputer systems.

Let’s assume you want to recompile your application for a newerversion of the Intel Core2Duo architecture, which has a larger L1cache. Assuming that you’d been using the core2duo machinemodel, and that the new L1 cache size is 32 KBytes, you’d add the--setmm= switch to the rcc command line like this:

rcc --polymap -march=x86_64-linux --mm=core2duo \--setmm=mem.L1.size:[32K] -o myapp myapp.c

The command-line method of modifying machine model files is aquick and convenient way to experiment with minor architecturalchanges. But you may find it necessary to create a permanent ma-chine model file, perhaps after completing a number of experimentsusing the command-line switch. In this case, you need to derive yourmodel from one of the existing machine model files. To do so:

1. Select an appropriate base machine model.

2. Copy the original machine model file to a new, unique<target>-mapper.xml file.



3. Edit the new file, changing entries as needed.

4. Use the new machine model file via the --mm command-lineswitch in batch mode, or via the shell command loadMM ininteractive mode.

Using thesmp-config

Utility

The smp-config utility provides a quick and easy way for you toextract the information you need to update the core2duo-mapper.xmlmachine model file.

To run the smp-config utility:

• Locally on the target machine, issue this command:

$ $RSTREAM_HOME/rstream/bin/smp-config

• Remotely on the target machine, copy the utility to the targetmachine, then issue this command

$ ssh <user@target> $RSTREAM_HOME/rstream/bin/smp-config

The smp-config utility produces output that looks like this:

‘-setmm=proc.cpu.geometry:[2]‘‘-setmm=global.size:[3949406k]‘‘-setmm=mem.L2.size:[3M]‘‘-setmm=mem.L2.cache.line.size:64‘

You can then either update the corresponding keys in thecore2duo-mapper.xml file, or supply these lines of output asarguments to the compiler at compile time to dynamically updateany out-of-date entries in the machine model description.

4 Before you modify the core2duo-mapper.xml file, werecommend that you save a copy of the original file.

To dynamically change the machine model description automati-cally at compile time with output generated by smp-config, runthe R-Stream Compiler this way:

$ rcc --polymap -march=x86_64-linux --mm=core2duo \‘smp-config‘ -o myapp myapp.c


7 Programming GuidelinesThis chapter provides general guidelines for coding programs thatcan take advantage of the Mapper component of the R-StreamCompiler. The Mapper can parallelize mappable regions of code thatconform to extended static control program structure. In a static con-trol program, data-dependent control paths are prohibited. In an ex-tended static control program, some data-dependent behavior is al-lowed (for details, see Mappable vs. Unmappable Data-DependentStructures on page 92).

This pseudo C syntax summarizes the extended static control pro-gram structure:

parameters ::= M, N, . . .arrays ::= A, B, C, . . .

loop indices ::= i, j, k, . . .e ::= affine expression of indices and parametersS ::= A[e] = f(A1[e1], . . . , An[en])

| for (i = e; i < e; i+ = N) S| if (e) S else S| S; S; . . .

4 In this structure, parameters may be constants or otherwisefixed values; for example, parameters to a function.

Though currently the Mapper parallelizes only loops, it parallelizesthe entire loop, allowing parameterization of some values and par-allelization of imperfectly nested loops. This behavior enables theMapper to parallelize complete linear algebra kernels and even fusedor nested kernels.

When programming for parallelization, keep your code very simple.Write a compute kernel that conforms to the extended static controlprogram structure and that looks like a textbook example of a ker-nel. Do not optimize your code by hand, unrolling or tiling loops, or

Chapter 7. Programming Guidelines 85

Affine Functions

reuse variables for other purposes. Doing so interferes with paral-lelization. Leave the details of implementation and performance tothe R-Stream Compiler.

Following these basic rules will align your programming methodol-ogy more closely with the algorithmic and mathematical conceptionof your application.

Affine Functions

A mappable region of code contains one or more affine functions.Affine functions are composed of parameters or outer loop indexvariables, combined through addition or multiplication with somescalar values, taking the general form: (p1, . . . , pn) = X1 ∗ pn +Xn ∗p+C where each p is a parameter or outer loop index variable,and C and each Xi is a constant.

Loop Characteristics

The Mapper operates on loops with certain characteristics. It viewsa loop nest as a region of code. Informally, a mappable region isa function with parameters, where the parameters are variables thatremain unchanged within the region; for example:

for (i = 0; i < n; i++) {A[i] = B[k] * A[i];

}

which shows a mappable region with two parameters: n and k.

Affine Parametric Loop Bounds

The Mapper can analyze and transform loops with known boundsthat are affine functions of parameters and outer loop index vari-ables, and that also have a constant step. However, loop boundsdon’t need to be compile-time constants. The Mapper can deal withcode of the form:

86 Chapter 7. Programming Guidelines

Affine Parametric Loop Bounds

Double g(int m) {int n = getLimit(m);return f(n);

}double f(int n) {

int i, j;for (i = 0; i < n; i++) {

for (j = 0; j < i; j++) {C[i][j] = A[i][j] * B[j];

}} }

This example loop nest describes a triangular iteration space. Theupper bound of the outer loop is not a compile-time constant, butis determined by an input parameter to the function when the func-tion is called. This means that the Mapper can still work with loopbounds that are known only at runtime (for example, read from aconfiguration file).

Furthermore, in the inner loop, the loop bound changes for eachiteration of the outer loop. However, because the bound is an affinefunction of the outer loop index variable, the region is still mappable.

For the Mapper to handle a loop nest, the loop counters must be in-cremented by a constant. If we modify the code in this loop nestexample by introducing a call to a function skip(), which ran-domly modifies the value of loop counter j, such that j is no longerincremented by a constant:

double f(int n) {int i, j;for (i = 0; i < n; i++) {

for (j = 0; j < i; j++) {C[i][j] = A[i][j] * B[j]

skip(&j);

the Mapper cannot map it. In High Performance Technical Comput-ing (HPTC) codes, it’s rare to find a loop that modifies its bound orindex variable within its body. Parallelizing and vectorizing compil-ers typically consider such loops non-transformable.


Affine Array Accesses


Similar to loop bounds, array index expressions must be affine func-tions of parameters and loop index variables of the enclosing loops.For example, the code shown in the following example accesses anarray C that is an affine function of the parameter c and the loopindices i and j. Similarly, the array access to A is an affine functionof i, j and c, and the access to B is an affine function of j.

double f(int n, int c) {int i, j;for (i = 0; i < n; i++) {

for (j = 0; j < i; j++) {C[i+c][i][j] = A[2*i][3*j + c] + B[j]

}}

}

Our previous examples have shown code that uses the C syntax forarray access. However, unlike most other compilers, the R-StreamCompiler does not rely on syntactic form to determine whether it canmap array accesses. Instead, it can operate on an abstract model ofarrays, meaning that if it can determine a pointer access to memoryis made by a parametric affine access function, it can map it.

Consider the matrix multiplication code in the following example,where two-dimensional matrices are accessed using pointer arith-metic and dereferencing:

#define N 1024void matmult(float *C, float *A, float *B){

int i, j, k;float *p;for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {

*(C + i*N + j) = 0.0;p = (A + i*N);for (k = 0; k < N; k++) {

*(C + i*N + j) += *p++ * *(B + k*N + j);}

}}

}



The R-Stream Compiler can detect that the pointer accesses via A, Band C are of an affine parametric form. That is, they are all relatedto the loop counters or to parameters through an affine relation.

When working with this pointer-based form, the dimensionality ofabstract arrays is defined by the types of their function parameters.If you want the R-Stream Compiler to view an input/output array asmultidimensional, the array’s dimensions must be specified explic-itly. (Only the leftmost dimension don’t need to be specified.)

In the matrix multiplication code example, the Mapper will treatC, A and B as one-dimensional arrays. Their size along their left-most and only dimension don’t need to be specified. The code inthe following multidimensional array example represents the samefunction, except that the Mapper considers arrays C, A and B as two-dimensional, with a rightmost size of 1024.

void matmul(float[1024] *C, float[1024] *A, float[1024][1024] B){

int i, j, k;float *p;for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {C[i][j] = 0.0;p = &(A[i][j]);for (k = 0; k < N; k++) {

*(C[i]+j) += *p++ * *(B[k] + j);}

}}

}

Declaration of arguments A and B are equivalent, although declara-tion of argument B provides more information about the size of Bto the R-Stream Compiler. Any locally declared array must have astatic size.

4 Currently, only constant-sized arrays are supported, so you can-not allocate variable-sized arrays with malloc() or malloc()-like calls from within a mapped function. This limitation willbe addressed in future versions of the R Stream Compiler.


Loop Nesting

Loop Nesting

The Mapper supports loop nests of arbitrary depth and breadth. Forexample, the Cholesky code in the following imperfectly nested loopexample has a loop nest with a depth of three, and for each outer loopthere is a statement between it and the inner loop.

void cholesky(float *A[1024], int n){

int i, j, k;

for (k = 0; k < n; k++) {A[k][k] = sqrt(A[k][k]);for (i = k+1; i < n; i++) {

A[i][k] = A[i][k] / A[k][k];for (j = k+1; j <= i; j++) {

A[i][j] = A[i][j] - A[i][k] * A[j][k];}

}}

}

6 The depth of nesting is constrained by practicality since deepernesting creates denser constraint systems and more complexmapping problems for the Mapper. We recommend limitingloop nests to a depth of five to avoid stressing the capabilitiesof the Mapper. You can try nests of greater depths, but doingso may result in very long compilation times.

Conditionals

The Mapper can also handle loops that contain conditional state-ments in the body. Consider the simplified QR decomposition code:

void qr(int n) {int i, j, k;double nrm, s;for (k = 0; k < n; k++) {/*Compute 2-norm of k-th column w/out under/overflow. */

nrm = 0.0;for (i = k; i < n; i++) {


Conditionals

nrm= sqrt(nrm*nrm + QR[i][k]*QR[i][k]);}if (nrm != 0.0) {

/* Form k-th Householder vector. */if (QR[k][k] < 0) {

nrm = -nrm;}for (i = k; i < n; i++) {

QR[i][k] = QR[i][k] / nrm;}QR[k][k] = QR[k][k] + 1;/* Apply transformation to remaining columns. */for (j = k+1; j < n; j++) {

s = 0.0;for (i = k; i < n; i++) {

s = s + QR[i][k]*QR[i][j];}s = -s / QR[k][k];for (i = k; i < n; i++) {

QR[i][j] = QR[i][j] + s * QR[i][k];}

}}Rdiag[k] = -nrm;

}}

This conditional statements example shows more of the Mapper’scapabilities:

• The outer loop has two conditionals that it can analyze.

• All of the inner loops have bounds that depend on the outerloop index variable.

• Loop nesting is somewhat complicated by multiple innerloops at the same nesting level.

• Execution of some of the loops depends on a conditional inthe outer loop.


Passing argument to mappable functions

The Mapper supports these types of conditionals within loop bodies:

• Affine functions of the surrounding loop indices

• Fixed values

Passing argument to mappable functions

The current version of R-Stream temporarily requires the user todistinguish data from indices in the arguments passed to a mappedfunction (and a blackbox function, defined in the next section). Anindex is a read-only, integer variable whose value is used either aspart of array indices or as part of loop bounds. A data is everythingelse.

Data must be passed by pointer, indices by value. This new andtemporary requirement simplifies R-Stream’s analysis of the inputprogram.

Finally, all global variables must be passed as arguments to the map-pable functions. R-Stream will detect failures to do so and informyou.

Mappable vs. Unmappable Data-Dependent Structures

In accordance with the extended static control program structure, theMapper supports some data-dependent structures:

Supported Structures Unsupported StructuresIf indices are properly affine, References to struct or union membersconditionals of this form:if (A[i] < B[i]) Code containing indirect accesses to arrays:

or if (A[B[i]])

if (A[i] < N)

(All comparator operators are supported.) Pointer references in general

6 Function calls can be problematic unless they are pure func-


How to map loops with library calls

tions. The R-Stream Compiler assumes that all functions haveno side effects (pure). If the source code violates this assump-tion, the Mapper may generate incorrect code.


The R-Stream Compiler supports the mapping of codes that containlibrary calls, but requires some help from the user. Several condi-tions have to be met for this to work.

• The function itself must be able to be executed on the targetedarchitecture.

• The user has to let the R-Stream Compiler know how data isaccessed during the library function call. More specifically,the user must declare which parts of which arrays are read andwritten. The user does this by declaring a mappable function,called “image function”, which summarizes the array readsand writes that take place in the library function call. Theuser then uses directives to indicate which library function thefunction is an image of. The general idea is that the R-StreamCompiler will pretend, for mapping purposes, that it is map-ping the image function, but will however produce an outputprogram that does the proper calls to the library function.

Guidelines for writing an image function are given in the next sec-tion.

How to writean image

function

The image function of function f is a function which abides by theprogramming style required by the Mapper (affine loop bounds andindexing expressions), and whose (read and write) data accesses sub-sume those of f. A function is declared as being the image functionof function f by writing a special pragma directive right before thedefinition of the image function:

#pragma rstream map image_of:f



Here is an illustration with a call to a 1-D FFT.

FFT example

Let us consider a fictitious program that calls 1-D row FFTs on a2-D array.

/* computes the FFT of row Aand puts it in array FFT_A */

extern void fft_row(double* A, double* FFT_A);

void my_function(double (*A)[N]) {int i;double FFT_A[N][N];for (i=0; i<N; i++) {

fft_row(A[i], FFT_A[i]);for (j=0; j< N; j++) {

FFT_A[i][j] = 2* FFT_A[i][j];}

}}

In order to map function my_function, the user has to let themapper know what data accesses happen within a call to fft_row.However complicated they may be, what matters to the Mapper is tohave a conservative description of what may be written and read. Inthe case of the row_fft example, one full row of FFT_A may bewritten and one row of A may be read.

The user must describe this by declaring an image function for func-tion fft_row, as follows:

/* computes the FFT of row Aand puts it in array FFT_A */

extern void fft_row(double* A, double* FFT_A);



#pragma rstream map image_of:fft_rowvoid image_fft(double* A, double* FFT_A) {

int k;for (k=0; k<N; k++) {

FFT_A[k] = A[k];}

}

#pragma rstream mapvoid my_function(double (*A)[N]) {

int i;double FFT_A[N][N];for (i=0; i<N; i++) {

fft_row(A[i], FFT_A[i]);for (j=0; j< N; j++) {

FFT_A[i][j] = 2* FFT_A[i][j];}

}}

The R-Stream Compiler will understand that it has to process callsto fft_row as a black box and that the mapper should pretend thatthe array accesses that happen during a call to fft_row are thoseof image_fft. “Processing calls as a blackbox” means that themapper will not try to inline the calls, or to modify anything withinthe function call, even if the code is available to it.

While this mechanism is relatively easy to use, here are a few caveatsand precautions you should be aware of.

4 The arguments of the library function and the image functionmust match.

4 The Mapper partitions computations so that data locality isachieved. For this to work, data accesses performed by thefunction should be described in the image function. This is ac-tually not required for correctness if the mapper doesn’t need togenerate any form of copies or communications. But not doing


How to map unmappable code

this will impact performance in any case, and may introducecorrectness issues otherwise.

4 Keep in mind that the image function goes through a compiler,and that the R-Stream Compiler (as a compiler) can simplifyyour code before it goes to the Mapper. In particular, donot write dead code (code that can easily be removed withoutchanging the semantics of the image function).

4 The mapper will produce a correct mapping of a blackboxedprogram if the image function defines a conservative approxi-mation of the array accesses, i.e., if all possible reads and writes(more is allowed) are read and written by the image function.

Note that the image function never gets executed. It is only a con-venient way for the user to describe a function in the style that isaccepted by the Mapper.


The loop code you are writing may contain snippets of code thatcannot comply with the programming style required by the Mapper,because it is too irregular. You can still give your loop code to themapper by turning the snippet into a function call (this is called “out-lining” the code snippet) and treat the function call as in the previoussection.

A typical example is that of an indirection. Consider the followingprogram, in which the indirect access A[B[i]] is too irregular to bemapped as such.

void my_function(float (*A)[N], float* B, int n) {for (i=0; i< n; i+) {

for (j=0; j<n; j+) {A[B[i]][j] = i+j;

}}

}



The user must outline the indirect access into a function:

void ind(float (*A)[N], float* B, int i, int j) {A[B[i]][j] = i+j;

}

void my_function(float (*A)[N], float* B, int n) {for (i=0; i< n; i+) {

for (j=0; j<n; j+) {ind(A, B, i, j);

}}

}

The outlined function can then be treated exactly as a library func-tion (see section 2) using blackboxing. In our case, the reads andwrites can be safely approximated as:

• element B[i] is read

• any element, from the minimum value of the elements of B totheir maximum value, can be written in the j column. If theuser doesn’t know anything about the values of B, a usuallysafe assumption is that no array overflow (or underflow) isproduced. If A is of size N, we would assume that the valuesof B[i] are between 0 and N-1. Hence any element A[x][j]with x ∈ [0, N − 1] can be written to.

The user has to write a loop that writes to all the elements that maybe written, as illustrated hereunder.

void ind(float (*A)[N], float* B, int i, int j) {A[B[i]][j] = i+j;

}



#pragma rstream map image_of:indvoid ind_image(float (*A)[N], float* B,

int i, int j) {int k;for (k=0; k< N; k++) {

A[k][j] = B[i] + 3;}

}

#pragma rstream mapvoid my_function(float (*A)[N], float* B, int n) {

for (i=0; i< n; i+) {for (j=0; j<n; j+) {

ind(A, B, i, j);}

}}

In practice, this method for “mapping unmappable code” is usefulwhen small snippets of code would prevent large loops from beingotherwise mapped. If you have to hide all the loops into a functioncall, then it is not worth doing it, since the Mapper won’t modifyanything that is inside the function definition.

A set of options, appended to the pragma, inform the mapper aboutadditional properties of array accesses in a blackbox function:

strong_write:"X,Y"informs the mapper that all the elements of arrays

X and

Y that are declared as written in the image function are defi-nitely written/defined by the function, as opposed to just beingpossibly written.

small_fp:”X,Y”informs the mapper that the number of elements of



X and

Y accessed at each call to the blackboxed function is small.This is particularly useful when the number of possibly ac-cessed elements (as defined by the image function) is large.




8 ARCC, the Auto-Tuner for RCCOne of the problems of a high-level compiler is that it relies on abackend compiler, and that the backend compiler’s behavior maystrongly depend on some relatively unpredictable factors. In addi-tion, analytical performance models are good but do not generallymodel the entire problem. Hence, a tool for searching in some spaceof mapping solutions is desirable. However, searching the entirespace of mappings is intractable and unnecessary if you have anycharacterization of good versus bad mappings.

The approach we are choosing with ARCC is to tune advancedcompiler options. Components of the R-Stream Compiler (RCC)mapper are named "tactics". Each option of a tactic defines the spacein which it should be searched. This is achieved by defining a spe-cific syntax that the tactics consume and that the external auto-tuner,ARCC, produces.

Basic Operation

ARCC does not typically call RCC directly but communicates withRCC’s tactics directly. ARCC is called by the following command:

$ arcc --build=<cmd> --run=<cmd> --clean=<cmd> [options]

To use ARCC, users simply need to know:

• how to build the target program (e.g., "make foo")

• how to run the target program (e.g., "./foo input.dat")

• how to clean the generated files (e.g., "make clean")

and give those information to ARCC as command-line arguments.

Autotuning with ARCC is done in two steps. In the first step (calledproduction mode), meta-data is produced by RCC’s tactics during

Chapter 8. ARCC, the Auto-Tuner for RCC 101

Options

the first build. The produced meta-data defines the syntax of thetactic’s options and the space of values that ARCC can instantiatethe options to.

In the second step (called consumption mode), ARCC iteratively in-stantiates solutions in the search space defined by the meta-data andproduces option instances that will be used by RCC. In each itera-tion, ARCC

• sets up options for the various RCC tactic options,

• (if needed) cleans previously generated files,

• builds the target program,

• runs it, and

• measures its performance.

Each instance is recorded along with the subsequent execution time.After a number of iterations, the program instance that minimizesexecution time is built and ARCC returns.

Note that a clean option is necessary when the build system will notrebuild the program if it is seen up-to-date, such as make. Oth-erwise, if the build system just overwrites the existing program,the clean option is not needed (i.e., assign its value with an emptystring), as exemplified by the following command:

$ arcc --build="rcc --polymap --mm=core2duo --march=x86_64-linux--backend=\"icc\" --backendoptions=\"-O3 -openmp\" -o foo foo.c"--run="./foo input.dat" --clean=""

Options

ARCC options which can be obtained by arcc -help commandare listed below.

Notes:

102 Chapter 8. ARCC, the Auto-Tuner for RCC

Options

1. The -help option asks ARCC to print a usage message onthe screen.

2. To display what is happening during ARCC’s auto-tuning pro-cess, use the -verbose option.

3. User can run the steps of producing the meta-data and con-suming it separately. To run the former only, use -produce.User can customize the auto-tuning search space by modifyingthe produced meta-data file. To run the latter, use -consume.

4. For debugging purposes, the -keep option can be used tosave all files generated or modified after each build in a subdi-rectory arcc-codes.

5. Advanced user can selectively decide a set of RCC’s tactic op-tions used in the auto-tuning. To view a list of all RCC’s auto-tunable tactic options, use -list option. To simply tune allRCC’s auto-tunable tactic options, use -tune-all option.See Section 8.

6. There is a default meta-data file name, which you can option-ally override with the -meta-data option.

7. Similarly for log files. See Section 8.

8. Because the size of the search space can be huge, a brute forcesearch space exploration technique (-exhaustive option)is not always feasible. Hence, heuristic algorithms (-randomand -simplex options) are provided for reducing the searchspace size and thus the auto-tuning time. The search algo-rithm is set by default to the Nelder Mead Simplex method(the -simplex option), which is a popular non-derivativedirect search method for optimization. The -random-seedoption can be used to control the randomness of the results ofthe heuristic algorithms.

9. The -max-trials option can be used to limit the maximumnumber of instances in the search space empirically evaluatedby ARCC.


Logging and Reporting

10. The -precise-perf option is specific to RCC. It asksRCC to generate more exact timings information at the levelof a mapped function, whereas -rough-perf just measurestotal execution time.

Logging and Reporting

ARCC includes logging and reporting mechanisms useful for gath-ering diagnostic information to track down the tuning results andfailures. Logging is made in three different files:

• arcc-main.log: the main log file that records all steps ofthe auto-tuning process

• arcc-result.log: the log file that all best results foundat each search-space exploration step

• arcc-error.log: the log file that records all encounteredfailures

Note that the contents of arcc-result.log and arcc-error.logare also recorded in the main log file (arcc-main.log). The pur-pose of creating additional log files for reporting results and errorsis to localize log messages, so that the user need not go through thelong content of the main log file. These log file names can be pre-fixed by the user using the -log option.

Moreover, ARCC also offers -keep to store all files that are gener-ated or modified after each build performed during the consumptionmode. This feature is useful for keeping track of RCC-specific gen-erated files such as the mapped codes, the generated executables, themapper log files, etc. All of the stored files are saved by default ina subdirectory arcc-codes. An example of how the subdirectorystructure of arcc-codes looks like after tuning a matrix multipli-cation code can be observed in the following.

% tree arcc-codes


Advanced Usage

arcc-codes‘-- arcc-run-2011-3-30-16h-34m-47s

|-- coord-1-2-0| |-- matmult| |-- matmult.gen.c| ‘-- README|-- coord-0-2-0| |-- matmult| |-- matmult.gen.c| ‘-- README|-- coord-0-3-0| |-- matmult| |-- matmult.gen.c| ‘-- README|-- coord-1-3-0| |-- matmult| |-- matmult.gen.c| ‘-- README‘-- coord-2-0-1

|-- matmult|-- matmult.gen.c‘-- README

6 directories, 15 files

In the above example, the auto-tuning built and empirically evalu-ated five different program instances. Each code instance is repre-sented as a Cartesion coordinate (i.e., coord-X-Y-Z) in the tun-ing search space. Multiple ARCC runs are recorded separately insubdirectories arcc-codes/arcc-run-<timestamp>. Thefiles matmult and matmult.gen.c are the executable and themapped code generated by RCC, respectively. Each of the gener-ated README file consists of supplementary information such as theused build/clean/run commands, the program version (representedas a Cartesian coordinate in the search space), the used tactic op-tions, etc.

Advanced Usage

This section is intended for an advanced user who wants to havemore precise control over ARCC’s auto-tuning configurations andsearch space.


Advanced Usage

Running inProduction/-

ConsumptionMode

Advanced user may run the production and consumption steps sep-arately and modify the produced meta-data, which is concentratedin a single file (called arcc-meta-data.md by default). Forexample, the following command can be used to run ARCC in aproduction mode:

$ arcc --build="make foo" --clean="make clean" --produce

Note that a run option is not needed in the produce mode.

After that, user can optionally alter the auto-tuning search space bymodifying the meta-data file. Details on the syntax of the meta-datafile are given in the Section 8. With the search space defined inthe obtained meta-data file, user can start the search for the best-performing program instance by running ARCC in a consumptionmode, using the command below:

$ arcc --build="make foo" --run="./foo input.dat"--clean="make clean" --consume

Meta-DataSyntax

The meta-data file consists of a sequence of tactic-specific meta-data. Each tactic provides:

• A syntactic template for the options it will accept to be auto-tuned for. The template contains variables, enclosed by $signs, which will be instantiated by ARCC’s search.

• The set of values allowed for each variable in the search is alsoprovided.

• Finally, constraints on the variable’s values can be formulated.

The grammar of the syntax used for declaring a tactic’s meta-data isas follows:

<tactic-ID> "{""option" "=" <template-string> ";"( "var" <var-name> "=" "[" <var-val> ( "," <var-val> )* "]" ";" )*[ "constraint" "=" <constraint-exp> ";" ]

"}"


Advanced Usage

where <template-string> is a string containing to-be-tunedvariables enclosed by $ signs, and <constraint-exp> is a set ofconstraint expressions as defined in the Python scripting language.

For instance, the meta-data for RCC’s unroll-and-jam tactic optionsthat works on function "matmult" when targeting an SMP target ar-chitecture will look like:

UJ_SMP__MATMULT {option = "max_unroll=$A$,max_unroll_per_dimension=$B$,innermost=$C$";var A = ["1","8","64","256"];var B = ["1","4","8","16"];var C = ["true","false"];constraint = (A>=B);

}

User can also specify a global constraint across tactics’ meta-data.The variables used in a global constraint must be mangled with thetactic ID name. A use example of a global constraint is given below:

ID_1 {option = "opt1=$A$,opt2=$B$";var A = ["1","2","3","4","5"];var B = ["2","4"];

}ID_2 {

option = "opt1=$A$;var A = ["1","2","3","4"];

}constraint = (ID_1_A==ID_2_A) and (ID_1_B>=ID_2_A);

The goal of enforcing constraints on variables’ values is to enablefurther pruning of the search space. As a concrete example, the num-ber of unrolling factors must satisfy the machine’s register capacityto avoid register spills.

SelectingAuto-Tunable

TacticOptions

RCC has its own default set of auto-tunable tactic options that pro-duce the meta-data and define the space of all possible program in-stances explored during ARCC’s auto-tuning process. Both the di-mension and the size of the search space depend on the total numberand the value ranges of the tactic option parameters, respectively.So, it would be beneficial for the user to explicitly select a subset


Advanced Usage

of RCC’s auto-tunable tactic options, to limit the search to a man-ageable size and therefore to reduce the empirical tuning time. Notethat there is always a trade-off between tuning speed and tuned pro-gram performance. User should be aware of possible performancechanges when selecting auto-tunable options of RCC’s tactics.

To view the list of all available auto-tunable tactic options in RCC,user can use the -list option. The following is the output formatof RCC’s auto-tunable tactic options list.

% arcc --list[arcc] List of all autotunable tactic options:-------------------------------------------------------------------------| No. | Tactic | Option | Used by | Description| | | | default |-------------------------------------------------------------------------| 1. | tactic1 | option1 | y | Description on tactic1-option1| 2. | tactic1 | option2 | n | Description on tactic1-option2| 3. | tactic1 | option3 | y | Description on tactic1-option3| 4. | tactic2 | option4 | y | Description on tactic2-option4| 5. | tactic2 | option5 | n | Description on tactic2-option5| 6. | tactic3 | option6 | y | Description on tactic3-option6-------------------------------------------------------------------------

User selection of RCC’s auto-tunable tactic options is done at RCClevel, to maintain a clear separation of knowledge and responsibilitybetween ARCC and RCC. A list of used tactic options can be spec-ified by user in the makefile and is given to RCC as command linearguments, as exemplified in the following.

% rcc --map:tactic1=autotune={option1,option2},tactic2=autotune=all input.c

The above example indicates that the options auto-tuned by ARCCare the first two options of tactic1 and all options of tactic2.No options of tactic3 will be autotuned.

For ease of use, user is not required to specify any auto-tunable tac-tic options. In this case, ARCC will detect that no meta-data file isgenerated after the first build during the production mode, and thenARCC will rebuild and decide to use RCC’s default list of auto-tunable tactic options to generate the meta-data, which later will beconsumed by ARCC during the consumption mode. Optionally, user


Advanced Usage

can also use RCC option -map:<tactic>=autotune=defaultfor explicitly selecting the default auto-tunable option list of a partic-ular tactic. As shown previously in Section 8, the -tune-all op-tion of ARCC is essentially equivalent to applying -map:<tactic>=autotune=allto all RCC tactics.


Advanced Usage


BIBLIOGRAPHY

Bibliography[CFRZ91] R. Cytron, J. Ferrante, B. K. Rosen, and F. K. Zadeck.

Efficiently computing static single assignment form andthe control dependence graph. ACM Transactions onProgramming Languages and Systems, 13(4):451–490,October 1991.

[Cli95] C. Click. Global code motion/Global value numbering.In Proceedings of the ACM SIGPLAN 1995 Conferenceon Programming Language Design and Implementation,pages 246–257, La Jolla, California, June 1995.

[CP95] C. Click and M. Paleczny. A simple graph-based inter-mediate representation. In ACM SIGPLAN Notices, SanFrancisco, CA, 1995.

[CSV01] K. Cooper, T. Simpson, and C. A. Vick. Operatorstrength reduction. ACM Transactions on ProgrammingLanguages and Systems, 23(5), September 2001.

[WZ91] M. Wegman and K. Zadeck. Constant propagation withconditional branches. ACM Transactions on Program-ming Languages and Systems, 13(2):181–210, April1991.

111

BIBLIOGRAPHY

112

INDEX

Index

#include files, handling, 15#pragmarstreaminline directive,16#pragmarstreammap directive, 13#pragmarstreamnomap directive,19

Affine array accessespointer accesses, 89

Affine parametric loop boundscode example, unmappable loopnest, 87

Affine array accessesmultidimensional, code example,89

Affine array accessescode example, 88described, 88matrix multiplication, monodimen-sional code example, 88

variable-sized arrays, 89Affine functions, described, 86Affine parametric loop bounds

described, 86Affine scheduling tactic (as)

usage example, 27--ansi batch mode switch, 10Array accesses, 6as() interactive mode command,27

Automating batch mode compila-tion, 17

Barrier Generation tactic (sync)

command arguments, 59Base mapping strategy

SMP/OpenMP architectures, 70Batch mode#pragmarstreaminline direc-tive, 16, 19#pragmarstreammap directive,13, 19#pragmarstreamnomap direc-tive, 19

mapping, enabling, 18automating compilation, 17command line syntax, 10compiler flags and options, 10example session, 13-fcollapseinclude-files switch,15

fine tuning parallelization, 18-fmap-function=<name> switch,18, 20-fmapall> switch, 19-fnomap-function=<name> switch,19, 20-g switch, 11header files, handling, 15--help switch, 11-I<path> switch, 11inlining function calls, 16, 19invoking the compiler, 14-L<path> switch, 11-l<library> switch, 11--log=<path> switch, 11--map:<options> switch, 12,

113

INDEX

33mapping switch precedence guide-lines, 20

mapping, enabling, 14-march=<arch> switch, 12, 14,18--mm=<machine_model> switch,12, 14, 18, 33

non default backend compiler, spec-ifying, 15-O<level> switch, 12-o<path> switch, 12parallelized C source output files,generating, 10, 15--polymap switch, 12, 14, 18,33-S switch, 12, 15sequential compilation, specify-ing, 16--setmm=[key:value] switch,12, 82, 84--shell switch, 12tasks performed with mapping en-abled, 9--types=<types_file> switch,12

usage, typical, 9Broadcast elimination tactic (bcast)

command arguments, 46

-c batch mode switch, 10Checks that barrier code are well-formed tactic (checkBarriers)command arguments, 43

Coarse-grained parallelismextracting, 29

codegen() interactive mode com-mand, 22, 26–28

Combined multidimensional affinescheduling (as)command arguments, 40

Combined multidimensional affinescheduling(as)command synopsis, 40

Combined placement, tile schedul-ing and tiling (placetile)command arguments, 61

Combined placement, tile schedul-ing and tiling(placetile)command synopsis, 60

Command edges, 81Communication generation (comm-gen)command arguments, 47

Communication generation(commgen)command synopsis, 47

Communication Optimization (CUDA-specific) (commopt)command arguments, 45

Communication Optimization (CUDA-specific)(commopt)command synopsis, 44

Communication Optimization tactic(early_commopt)command arguments, 52

Compilation, controlling in batch mode,14--compiler=<optimization:options>

batch mode switch, 10Compiler modes

batch, 9interactive, 20

Compiling mapped C code with alow-level compiler, 29

Conditionalscode example, 90described, 90Mapper capabilities, 91supported, 92

Control Persistence tactic (per)command arguments, 60

Cosmetic domain tightening trans-

114

INDEX

formations tactic (simp)command arguments, 66

CUDA Geom (geom)command arguments, 42

CUDA Geom(geom)command synopsis, 42

CUDA Placement tactic (new_cudaplacement)command arguments, 43

-d batch mode switch, 10Data edges, 81Data-dependent structures

function calls, limitation on, 93Data-dependent structures

mappable vs unmappable, 92DeSSA postmapping phase, running,29

DMA generation (dma)command arguments, 60

DMA generation(dma)command synopsis, 59

--edg_options=<options> batchmode switch, 11

edgesmachine topology, and, 81statements and, 26

EmitC: command, 29Execution instances, 4Execution models

described, 73descriptions, 80

Extended static control program struc-turedata-dependent structures, supported,92

Extended static control program struc-turedescribed, 85

-f<feature> batch mode switch,11

-fcollapseinclude-files batchmode switch, 15

Flexible thread generation (threadf)command arguments, 53

Flexible thread generation(threadf)command synopsis, 53

-fmap-function=<name> batch modeswitch, 18, 20-fmapall batch mode switch, 19-fnomap-function=<name> batchmode switch, 19, 20

GDG() interactive mode command,22GDGs() interactive mode command,23

Generalized Dependence Graph (GDG)defined, ivdescribed, 26functions, mapping, 23, 24printing, 26raising to, 26viewing current, 26

Generating mapped C code for low-level compiler, 29

Header files, handling, 15help(); interactive mode command,22

Index-Set Unsplitting tactic (fuseS-plitted)command arguments, 66

Inlining function calls, controlling,16, 19

Interactive mappingSMP/OpenMP architectures, 70

Interactive mode#pragmarstreaminline direc-tive, 19#pragmarstreammap directive,19

115

INDEX

#pragmarstreamnomap direc-tive, 19

affine scheduling tactic (as), 27beanshell environment, and, 20bundled commands, 27codegen() command, 22, 26commands, 21compilation steps, typical, 21compiling mapped C code, 29described, 20DeSSA phase, running, 29example session, 24fine tuning parallelization, 18GDG() command, 22GDG, printing, 26GDG, viewing current, 26GDGs() command, 23generating mapped C code, 29help(); command, 22inlining function calls, 16, 19invoking the shell, 21, 25loading a machine model file, 25loadMM() command, 23, 25loadTypes() command, 23lowering code, 29map() command, 23mapper options, 22mapperhelp(); command, 22parse() command, 23, 26premapping optimization pass, 26raising code to the Mapper’s IR,26rcc --shell interactive modecommand, 21

removing phi- and sigma-nodesfrom lowered code, 29run(phase) command, 24runall() command, 23, 26setGDG(GDG) command, 24tile scheduling tactic (ts), 29tile tactic, invoking, 28

toGDG() command, 24, 27usage, typical, 9

Iteration domain, 5Iteration vector, 5

Late Data Relayout (latedatarelay-out)command arguments, 55

Late Data Relayout(latedatarelayout)command synopsis, 54

loadMM() interactive mode com-mand, 23, 25loadTypes() interactive mode com-mand, 23

Loop nestingcode example, 90depth, optimal, 90described, 90

Loop Simplification tactic (simpli-fyLoop)command arguments, 66

Low-level compilerdescribed, 1supported, 1

Loweringdefined, ivdescribed, 1interactive mode command, 29

Machine model filesmorph component, 80

Machine model filesabbreviated names, using, 25abstract processor attributes, 77abstract processor entity, 77core2duo-mapper.xml, 69deriving new, 83described, 73execution models, 73, 80key values, supported data types,82

116

INDEX

keys, 74link attributes, 79link entity, 79loading, 25memory attributes, 78memory entity, 77modifying an existing file, 82morph component, 81morph.topology, described, 81path element syntax, 74processor attributes, 75processor entity, 75--setmm=[key:value] switch,82, 84

SMP example, 74structure of, 73system architecture, describing, 75system components, described, 74value suffixes, 83

Machine modelsderiving new, 82described, 73

Machine topologycommand edges, 81data edges, 81morph.topology, and, 81

Makefileautomating batch mode compila-tion, 17

example of, 17Manual terminology, iv--map:<options> batch mode switch,12, 33map() interactive mode command,23

Mapper IRdescribed, ivlowering from, 29

Mapper optionsinteractive mode, 22batch mode, 33

mapperhelp(); interactive modecommand, 22

Mapping, enabling, 14-march=<arch> batch mode switch,12, 14, 18

Matrix representation, iv, 6MaxFloat tactic (maxfloat)

command arguments, 57Maxsink tactic (maxsink)

command arguments, 57Memory obfuscation using Ehrhartpolynomials tactic (mo)command arguments, 57

Memory promotion tactic (mempro-motion)command arguments, 48

Memory promotion tactic(mempromotion)command synopsis, 48

--mm=<machine_model> batch modeswitch, 12, 14, 18, 33

Multi-dimensional placement com-ponent (place)command arguments, 58

Multi-dimensional placement com-ponent(place)command synopsis, 58

Naive array contraction tactic (lcon-tract)command arguments, 56

Naive array expansion tactic (ae)command arguments, 46

nodesmachine topology, and, 81statements, and, 26

Non default backend compiler, spec-ifying, 15

Ordering and placing statement in-stances across processors, 6


117

INDEX

command arguments, 50Orthogonal tiling(tile)

command synopsis, 49Output files

parallelized C source, 15sequential C source, 16

Output files, names of, 17

Parallelization, fine tuning, 18Parallelized C source files, generat-ing, 15parse() interactive mode command,26parse() interactive mode command,23

Permutable intertile loops, skewing,29--polymap batch mode switch, 12,14, 18, 33

Polyhedral form, ivPolyhedral mapper

array expansion, 34, 37benefits of, 7blocking, 38coarse-grained parallelism, extract-ing, 29

dependency scheduling, 28described, 31example interactive session, 24loop simplification and equivalentscheduling, 37

pseudocode, structure of, 26tactics, 22, 27tile scheduling, 29tiling, 29, 38

Polyhedral model, 3access functions, 6described, 3execution instances, 4history of, 3iteration domains, 6

iteration vector, 5matrix representation, 6program restructuring, 4space-time mapping, 6statements, treatment of, 3

Polyhedral representation, ivPolyhedral Simplification of Loops(CUDA only) (pol_simplify)command arguments, 62

Polyhedral Simplification of Loops(CUDA only)(pol_simplify)command synopsis, 62

Polyhedral Unrolling (pol_unroll)command arguments, 63

Polyhedral Unrolling(pol_unroll)command synopsis, 63

Postmapping optimization pass, 2,10

Postmapping options, DeSSA, 29Pragma directives, user-identified map-pable code, 13

Premapping optimization pass, 2, 9print(GDG()) interactive mode com-mand, 26

Program restructuring, 6Programming guidelines

affine array accesses, 88affine functions, 86affine parametric loop bounds, 86coding rules, basic, 86coding tips, 85conditionals, using, 90data-dependent structures, map-pable vs unmappable , 92

extended static control programstructure, 85

loop characteristics, mappable, 86loop nesting, 90static control program structure,85

Promote all variables which are not

118

INDEX

live-in/live-out and does not involvecommunication to local arrays onthe PE tactic (array)command arguments, 46

Pseudo Code Generation tactic (c)command arguments, 52

R-Stream compilerflags and options, batch mode, 10rcc --help command, 10advantages of, 2, 7architecture, 1batch mode, 9command line syntax, 10described, 1frontend, described, 1infrastructure, diagram of, 2interactive mode, 20intermediate representation, 1limitations of, 7mapping pragmas, 19modes, 9output files, 10, 17scalar optimizations, 9

Raise phase, running, 26Raising

defined, ivdescribed, 1

rcc --help command, 10rcc --shell interactive mode com-mand, 21, 25

Reservoir Labscontacting, iireporting bugs, iitechnical support, ii

run(phase) interactive mode com-mand, 24run(“Raise”) interactive mode com-mand, 26runall() interactive mode com-mand, 23, 26

run(“DeSSA”) command, 29run(“Lower”) command, 29

-S batch mode switch, 15-S batch mode switch, 12Scalar optimizations, 9Sequential execution, compiling for,16setGDG(GDG) interactive mode com-mand, 24--setmm=<key:value> batch modeswitch, 12, 82, 84--shell batch mode switch, 12Simple unrolling (unroll)

command arguments, 68Simple unrolling(unroll)

command synopsis, 68smp-config utility, 82, 84SMP/OpenMP target architecture

base mapping strategy, 70described, 69interactive mapping, 70limitations of, 71machine model file, 69

Source code, parsing, 26Spatial layout optimization (cm)

command arguments, 44Spatial layout optimization(cm)

command synopsis, 44SSA form, iv, 1statements, graphic representation of,26

Static control program structuredescribed, 85

Static control program structure, 13extended format, 85

Tacticsaffine scheduling tactic (as), 27described, 27mapperhelp(); command, 22

119

INDEX

tile tactic (tile), 28Target architectures

SMP/OpenMP, 71supported, 69

Thread generation for CUDA andCSX tactic (thread)command arguments, 51

Tile Scheduling (ts)command arguments, 51

Tile Scheduling(ts)command synopsis, 51

tile() interactive mode command,28

Tilingdescribed, 29invoking interactively, 28

toGDG() interactive mode command,24, 27--types=<types_file> batch modeswitch, 12

Unroll and jam (uj)command arguments, 67

Unroll and jam(uj)command synopsis, 67

Virtual scratchpad mode tactic (virtscratch)command arguments, 68

120

Documents

R-Stream Parallelizing C Compiler Power User GuideFor Software Version 3.3.1 Preface Contacting Reservoir Labs For technical inquires: • Call 212.780.0527 • Send a fax to 212.780.0542