210
A PARALLEL COMPILER FOR SequenceL by PER ANDERSEN, B.E., M.S. A DISSERTATION IN COMPUTER SCIENCE Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for die Degree of DOCTOR OF PHILOSOPHY Approved Chairperson of the Committee Accepted Dean of the Graduate School August, 2002

A PARALLEL COMPILER FOR SequenceL A DISSERTATION IN

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

A PARALLEL COMPILER FOR SequenceL

by

PER ANDERSEN, B.E., M.S.

A DISSERTATION

IN

COMPUTER SCIENCE

Submitted to the Graduate Faculty of Texas Tech University in

Partial Fulfillment of the Requirements for

die Degree of

DOCTOR OF PHILOSOPHY

Approved

Chairperson of the Committee

Accepted

Dean of the Graduate School

August, 2002

Copyright 2002, Per Andersen

ACKNOWUEDGEMENTS

I could not have completed this dissertation without the support and encouragement

of a number of people who I wish to acknowledge here.

Foremost, in this group of people is Dr. Daniel Cooke. Dr. Cooke has been a

dedicated advisor, a judicious mentor, and a good friend. Dr. Cooke introduced me to

compiler theory and SequenceL and during a three-year process he provided constant

guidance to my academic work and research projects. He stood by me at times of

difficulty, encouraging me and showing his confidence in me. This dissertation work

would not be in the current form without his insightful input and constmctive criticism.

I am very grateful to Dr. Noe Lopez-Benitez, Dr. Milton Smith, and Dr. Richard

Watson, who served on my dissertation committee. I appreciate their intellectual

perspectives and general encouragement over the process of this dissertation work.

I would like to thank my family for putting up with the time I have spent away

from them during this process. At times I may have been near them physically but

mentally I was a thousand miles away. Even so my wife Sue has always given me her full

support and encouragement. Without this support this dissertation would not have taken

place.

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ii

ABSTRACT vii

UST OF TABLES viii

USTOFHGURES ix

CHAPTER

L INTRODUCTION 1

1.1 An Introduction to SequenceL 4

1.1.1 Regular Constmct 7

1.1.2 Irregular Constmct 7

1.1.3 Generative Constmct 8

1.1.4 Consume-Simplify-Produce 8

1.2 Imphed Parallelisms in SequenceL 12

1.2.1 Singleton Computations 13

1.2.2 Parallelisms Involving Indexing 13

1.2.3 Control Flow Parallelisms 15

1.3 Document Overview 16

n. CURRENT STATE OF PARALLEL PROGRAMMING 17

2.1 Message Passing Interface 18

2.1.1 MPI Programming Example 20

2.2 0penMP 25

111

2.2.1 OpenMP Programming Example 26

2.3 POSIX Threads 28

2.4 Parallel Programming Languages 29

2.4.1 NESL 31

2.4.1.1 VCODE 32

2.5 Automated Parallel Language Tools 37

2.5.1 PARAMAT 38

2.5.2 PAP 40

2.6 Parallel Architectures 42

m. METHODOLOGY 44

3.1 Lexical Analysis and Syntax Analysis 45

3.1.1 Eliminating Left Recursion 47

3.1.2 Eliminating Common Prefixes 48

3.1.3 Selection Set Generation 49

3.2 Semantic Analysis 55

3.3 Intermediate Code 59

3.3.1 Quadruples and Triples 59

3.4 Code Generation 60

3.4.1 Parallel C Code 62

3.5 Scheduling 65

rv. RESULTS 67

4.1 Proof of Concept 68

IV

4.1.1 General Approach to Mapping SequenceL Constmcts 70

4.1.1.1 Regular Constmcts 70

4.1.1.2 Irregular Constmcts 75

4.1.1.3 Generative Constmcts 78

4.1.2 Proof of Concept Through Testing 78

4.2 Intermediate Language 83

4.2.1 Initial Intermediate Language 84

4.2.2 SequenceL Intermediate Language 89

4.2.2.1 SequenceL IC Operations 91

4.2.2.2 Multi-Operand Operations 93

4.2.2.3 Conditional Operation 98

4.2.2.4 Generative Operation 99

4.3 SequenceL Thread Model 100

4.3.1 Dynamic Thread Function Creation 104

4.3.2 Dynamic Thread Functions for Conditional Expression.. .111

4.4 Optimization and Scheduling Issues 116

4.4.1 Granularity 116

4.4.2 Code Restmcturing 121

4.4.3 Data Distribution 122

4.4.4 IC Collect Operation 123

4.5 Data Representation 124

4.5.1 Circular Linked List Sequence Representation 124

4.5.2 Sequence Stmcture Representation 126

V. CONCLUSIONS AND FUTURE RESEARCH 129

5.1 Conclusions 129

5.2 Future Research 132

5.2.1 Preprocessor 132

5.2.2 Optimization 133

5.2.3 Parallel Models 133

5.2.4 Granularity 134

REFERENCES 136

APPENDIX

A. SEQUENCEL GRAMMAR 140

B. AN IMPLMENTATION GUIDE TO A SEQUENCEL COMPILER 142

VI

ABSTRACT

Procedural languages like C and FORTRAN have historically been the languages

of choice for implementing programs for high performance parallel computers. This

dissertation is an investigation of a high-level nested programming language, SequenceL,

and whether a SequenceL compiler that compiles to parallel code can be developed for a

parallel system. This dissertation has achieved the following results.

• Established a proof of concept that there exists a SequenceL compiler that can

create executable programs that embody the inherent parallelisms and other

implied controls stmctures in SequenceL,

• Developed a new intermediate language capable of representing the meaning of a

SequenceL source program,

• Developed the techniques for spawning threads to dynamically create parallelisms

using a threaded approach, and discovered that the SequenceL language implies a

parallel execution model,

• Identified a number of optimization and performance enhancement opportunities,

• Identified a new SequenceL language requirement for defining nesting and

cardinality typing information for SequenceL data stmctures.

Vll

LIST OF TABLES

3.1 Selection Sets 51

3.2 Triples 60

3.3 Quadmples 60

4.1 Thread Execution Times 108

B. 1 SequenceL Selection Sets 150

Vlll

LIST OF HGURES

2.1 Parallel Multiplications by Associative Rule 36

2.2 Syntax Tree 39

2.3 Derivation Tree for PAP 41

3.1 Syntax Checking 54

4.1 Mapping SequenceL Constmcts 70

4.2 Mapping Regular Constmcts 74

4.3 Interpreter Identified Parallelisms 79

4.4 Matrix Multiply Execution Trace 80

4.5 Gaussian Parallehsms 81

4.6 Quicksort Execution Trace 82

4.7 Object Code Flow Chart 106

4.8 Object Code Flow Chart with Cache Locality 112

4.9 Tree Diagram of a Sequence 116

4.10 Granularity Study 118

4.11 Repeatability of Execution Times 120

4.12 Linked List Sequence 125

B.l Matrix Multiple 180

IX

CHAPTER I

INTRODUCTION

This report presents the research in developing a compiler for SequenceL, a

nested high-level language that exploits the full extent of parallelisms inherent in a

problem solution. The specific results of the research are as follows.

• Established a proof of concept that there exists a SequenceL compiler that can

create executable programs that embody the inherent parallelisms and other

implied controls stmctures in SequenceL,

• Developed a new intermediate language capable of representing the meaning of a

SequenceL source program,

• Developed the techniques for spawning threads to dynamically create parallelisms

using a threaded approach, and discovered that the SequenceL language implies a

parallel execution model,

• Identified a number of optimization and performance enhancement opportunities,

• Identified a new SequenceL language requirement for defining nesting and

cardinality typing information for SequenceL data stmctures.

Why should developers consider this high level parallel language when so many

other high level parallel languages have not been particularly successful? The problem

with high level parallel programming languages is that they ultimately force the

developer into coding the data decomposition [CooOO]. This is also the nature of

procedural languages and the motivation for seeking improvements in programming from

high level parallel languages. The hard part of programming is making the implied data

product appear in the programmer's mind when reading or writing the explicit control

stmctures that produce or process the data product. Something as simple as multiplying

two matrices together can quickly get lost in the coding that is required in a procedural

language like C.

for (i=0;i<=m.rows;i++) for (j=0;j<=m.columns;j++)

{s = 0; for (k=0;k<=m.length-l;k++)

{s+=mU][k]*m[k][i];} mr[i]|j] = s;}

High-level functional languages like NESL [Ble96] and Sisal [Feo] attempt to

redefine the way data stmctures are realized. The following is an example of matrix

multiplication in NESL.

function matrix_multiply(A,B) = {{sum({x*y: x inrowA; yincolumnB}) : columnB in transpose(B)}

: rowA in A}

Note the nesting of the data stmctures, nesting levels are delineated by curly

brackets. A programmer decides where to locate the curly brackets based on the

parallelisms they wish to exploit. Even for so simple a numerical operation as matrix

multiply the distribution of the data stmcture across the control stmcture is explicitly

defined by the programmer.

Sisal requires programmers to explicitly express loops. The following matrix

multiplication example is written in Sisal, note the explicit use of the "for" looping

constmcts.

function dot_product( x, y : array [ real ] returns real) for a in x dot b in y returns value of sum a * b end for

end function type One_Dim_R = array [ real ]; type Two_Dim_R = array [ One_Dim_R J; function matrix_mult( x, y_transposed : Two_Dim_R

returns Two_Dim_R) for x_row in x % for all rows of x cross y_col in y_transposed % & all columns of y (rows of

y_transposed) returns array of dot_product(x_row, y_col) end for

end function % matrix mult

Like NESL, Sisal puts the responsibility for data decomposition onto the

developer tiirough the need to explicitiy express the computational loops upon which

concurrent or parallel opportunities are exploited.

For the same numerical method in SequenceL the programmer declares the

matrices and defines the calculation as a row multiplied by a column and the control

stmcture and parallelisms are implied.

{{matmul(s_l(n,*),s_2(*,m)),where next = {([+([*(s_l(i,*),s_2(*,j))])])} taking [i,j] from cartesian_product([gen([l,...,n]),gen([l,...,m])])} matmul, s_l}

Given the level of complexity in as simple a numerical method as matrix

multiplication, the development of a procedural language algorithm for parallel execution

is just that much harder. (A description of SequenceL matrix multiply is presented in

Appendix B.) It has been estimated that the development of parallel code costs on

average $800 per line of code [Pan]. Even the migration of existing serial code to

parallel execution, a problem of critical interest in many enterprises, may cost anywhere

from $50 to $500 per line of code.

The goal of the SequenceL language [Coo96] is to provide an environment where

the problem solution can be stated at a high level of abstraction - where one can describe

the data product explicitly and have the iterative and parallel program stmctures that

produce and process the data product generated automatically. In other words, the desire

is for a language that is based upon high-level constmcts that permits one to declare data

products directly. In such an environment, the abstraction should be easier for the

problem solver - the problem solver no longer has the difficult task of envisioning the

elusive and implied data product. Rather than having to write the explicit algorithm that

implies the data product, the problem solver explicitly declares the data product.

1.1 An Introduction to SequenceL

The definition of the SequenceL language began in 1991. In 1995 the proof of

Turing-completeness was published [Fri]. The current version of the language was

completed in 1998. Parallelisms implied by the language statements were discovered in

early 1999. Papers introducing the language include [Coo96, Coo98, CooOO]. In this

section these papers are summarized in order to provide an overview of the language

constmcts.

SequenceL is a high-level nested language whose fundamental data type is the

sequence. Sequences are collections of integers, reals, strings, identifiers, functions and

computations delineated by square brackets. Reals and integers can be mixed, but string

sequences can only appear in sequences with strings.

[1,2,3]

Sequence

[3]

Singleton

A sequence of one element is called a singleton. The simple sequence listed above

contains three singletons. Sequences can contain sequences. These types of sequences are

called multi-dimensional or nested sequences.

[[1,2,3],[4,5,6],[7,8,9]]

Sequences can contain references to identifiers and functions. The following

sequence contains references to the functions eigen, matmul and max and the input

variables s_l and s_2.

[eigen,matmul,s_l ,max,s_2]

Although not necessary, a language convention adopted for this dissertation is to

begin input variables with s_. Input variables are always constant sequences. Sequences

can also be unbalanced. The following is an example of an unbalanced sequence.

[[1,2,3],[4,5],6]

Unbalanced sequences are normalized before operations are preformed on them.

Therefore, normalization by default is performed for every sequence operation when

necessary. For example the following expression does not undergo normalization before

addition.

+([[1,2,3],[4,5,6]])

The result produced by this expression is;

[5,7,8]

This next expression does undergo normalization before the addition.

+([[1,2,3],[4,5],6])

Normalization before the addition results in;

+([[I,2,3],[4,5,4],[6,6,6])

The result produced by this expression is;

[11,13,13]

Computations can also appear in sequences. In the language definition the

functions and operators operate only on sequences and produce as results only sequences.

This expression contains a computation and a constant sequence.

[+([s_2]),[l,2]]

The expression is evaluated in place resulting. If s_2 has the following value;

s_2=[[l,2,3],[4,5],6]

Then the result produced by the expression is;

[[11,13,13],[1,2]]

SequenceL provides the following operators +,-,*,/ in support of addition, subtraction,

multiplication and division.

The fact that function references and computations can appear in a sequence is a

powerful feature of SequenceL and supports the consume-simplify-produce philosophy

of the language. Details on this philosophy will be presented later in this section.

SequenceL has three basic constmcts for processing sequences: regular, irregular

and generative constmcts [Coo96].

1.1.1 Regular Constmct

A regular constmct applies an operation in a uniform manner to a normalized

non-scalar operand or sequence. For example given the sequence.

s_l=[l,2,3,4,5]

The following SequenceL addition expression:

+([s_l])

Is applied to all the elements of the sequence s_l in a uniform manner. The result of this

expression is a summation of all five singletons.

[15]

1.1.2 Irregular Constmct

An irregular constmct applies an operation selectively to a non-scalar operand

based on a conditional expression. Conditional expressions are expressed with the

"when" clause which has a relational operation component and a tme and false

expression component. For example the following SequenceL expression uses the not

equal relational operator <>.

/([s_l(i),x(i)]) when <>([x(i),[0]) else [ ]

The tme expression is;

/([s_l(i),x(i)])

The false expression is the null or empty sequence.

[ ]

In this example only elements in x that are not equal to zero are divided into the

corresponding sub-sequence of s_l which are selected by index i. Additional relational

operators include <, >, =, <=, >= which provide for less, greater, equal less than or equal,

and greater than or equal.

1.1.3 Generative Constmct

While the regular and irregular reduce their inputs in terms of size and dimension,

tiie generative operation is an expansion operation. The following expression is a

generative expression.

gen([[l],...,[5]])

This generative expression results in the sequence.

[[1],[2],[3],[4],[5]]

1 • 1.4 Consume-Simplify-Produce

The SequenceL tableau can best be described as a shared memory area that

contains a complete problem solution in SequenceL. This includes all functions,

operations and sequences. Within the tableau SequenceL expressions are processed using

a consume-simplify-produce strategy. When SequenceL expressions or functions are

referenced they proceed to consume input arguments and undergo a simplification

process before they are evaluated. When no more simplifications can take place the

simplified expressions are evaluated and a result is produced. The result produced will

normally be a sequence. Some functions can produce as a result, another function.

SequenceL expressions and function references are also placeholders in the tableau.

Therefore the result produced will replace the evaluated SequenceL expression or

function in the tableau. The examples in this paper use small data sets, memory is the

only limitation on the maximum size of an input data set.

A simple example of this consume-simplify-produce philosophy is illustrated in

the following example. Assume the following sequence appears in a function.

[eigen,matmul,max,s_2,s_l]

Assume that eigen and max are SequenceL functions and each accepts one input

argument. Matmul is also a SequenceL function, but it accepts two input arguments. In

the evaluation of this sequence, s_2 would be consumed by max. Max would then

simplify its expressions working towards producing a result that would replace max in

the sequence. The function matmul would then consume the result that max produced and

the sequence s_l. Matmul would then simplify its terms producing a result. The function

eigen would finally be referenced and it consumes the sequence produced by matmul,

eigen then simplifies its terms and produces a result. Note that this consume-simplify-

produce process continues until no more functions or operators appear in the tableau. In

this example all function references and input variables have been consumed, the only

expression left is the sequence produced by eigen. This consume-simplify-produce

philosophy is also the case for sequences containing nested computations. For example

the multiply/add operation in matrix multiply is carried out by.

[+([*(s_I(i,*),s_2(*,j))])] taking [i,j] from [[I,1],[I,2],[2,1],[2,2]]

In this example the multiply operation consumes the sequences s_l and s_2 and

simplifies the nested expression, which becomes,

[ [ +([*([s_l(l,*),s_2(*,l)])])

+([*([s_I(l,*),s_2(*,2)])]) ] [ +([*([s_l(2,*),s_2(M)])])

+([*([s_l(2,*),s_2(*,2)])]) ] ]

If we set, s_l and s_2: s_l = [1,2] s_2=[[5,6],[7,8]]

The above set of terms are simplified further to:

[ [ +([*([[1],[5,7]])])

+([*([[1],[6,8]])]) ] [ +([*([[2],[5,7]])])

+([*([[2],[6,8]])]) ] ]

Because the two sequences that are to be multiplied and added together are not the same

dimension, normalization will take place.

[ [ +([*([[1,1],[5,7]])])

+([*([[1,1],[6,8]])]) ] [ +([*([[2,2],[5,7]])])

+([*([[2,2],[6,8]])]) ] ]

The terms are now ready for evaluation, and a result is produced.

In the above example, the identifiers / and j are indexes used by the index operation

to select sub-sequences of s_l and s_2. The index constmct s_l(i,*) also contains the

wildcard operator * which says select all. The wildcard is useful in selecting larger parts

of a nested or multi-dimensional sequence. For example given.

10

s_l = [[l,2],[3,4],[5,6]]

The expression s_l(l,*) means "select all elements of sub-sequence 1". Which results in:

[1,2]

The expression s_l(*,l) means "select only element 1 of all sub-sequences." Which

results in:

[1,3,5]

In the nested computation example / and 7 are assigned values by the taking

expression:

taking [i,j] from [[1,1],[1,2],[2,1],[2,2]]

The taking expression is the only assignment statement in SequenceL. It processes the

sequence specified after "from," in the above expression that sequence is;

[[1,1],[1,2],[2,1],[2,2]]

We will call this sequence the "taking sequence." Taking will assign parts of the taking

sequence to identifiers specified after "taking." In this example there are two identifiers

after "taking." The determination of how to assign parts of the taking sequence to the

identifiers is dependent on the number of identifiers specified. For the above expression

since there are two identifiers two assignments need to be made for each sub-sequence

taken from the taking sequence. The first sub-sequence processed will be [1,1], therefore

identifiers i and; are set to 1 and 1 respectively. If there had been only one identifier,

such as:

taking [i] from [[1,1],[1,2],[2,1],[2,2]]

11

Then taking would assign / [ 1,1]. The next sub-sequence from the taking sequence is

[1,2]. This process of taking sub-sequences from the taking sequence and assigning

values to identifiers continues until the taking sequence has no more sub-sequences left to

process. The process of assigning values to taking identifiers and then using those

identifiers as indexes in an index operation is at the heart of one of the implied

parallelisms that is inherent to SequenceL.

SequenceL is a Turing-complete language [Fri], a more complete description of the

language can be found in [Coo96, Coo98, CooOO]. A listing of the language grammar can

be found in Appendix A.

1.2 ImpUed Parallelisms in SequenceL

It is the mixing of SequenceL's regular, irregular and generative constmcts that the

programmer is provided with the tools needed to specify problem solutions. In

SequenceL these problem solutions are specified by their data stmctures, in a procedural

language the same problem solution is specified by an interaction between data and

control stmctures. It is the combination of these constmcts as well as the execution

strategy of SequenceL that supports the languages ability to imply parallelisms. Implied

parallelisms involve one of three SequenceL operations,

a. Computations on singletons,

b. Computations involving indexing, and

c. Control Flow Parallelisms.

12

1.2.1 Singleton Computations

The implied parallelisms associated with singleton computations come from the

fact that a constant sequence, consists of singletons. For example the following sequence

contains six singletons.

[[1,2,3],[4,5,6]]

When this type of sequence is referenced in an operation such as addition.

+([[1,2,3],[4,5,6]])

Simplification would produce the following singleton additions:

[l]+[4] [2]+[5] [3]+[6]

These three additions are data independent and therefore there is no reason why these

three computations cannot take place in parallel. Any SequenceL operation on a set of

sequences can take advantage of parallelism involving singletons. This includes

operations like arithmetic operations and relational operations.

1.2.2 Parallelisms Involving Indexing

The next type of parallelism is similar to the previous example of parallelism,

except in this case we move up to a higher level of sequence computation. The taking

clause is the mechanism through which parallel operations on sequences are implied. By

distributing indexes to the index operation computations that selected sequences can

imply parallelisms. The following is an example of a taking expression.

taking [i] from gen([[l],...,[5]])

13

In this example the gen expression is evaluated first and results in the sequence.

[[1],[2],[3],[4],[5]]

This sequence becomes the source of values for i, which is assigned by the taking

expression. Note, this is the closest SequenceL comes to an actual assignment statement.

Typically, the different i values are passed by the taking expression to an index operation

which provides sequences for a computation. Consider the following example

+([s_l(i)]) taking i from gen([[l],...n])

For this example s_I is defined as.

s_l=[[[I],[2]],[[3],[4]],[[5],[6]]]

The identifier n contains the length of s_l, which equals 3. SequenceL evaluates the

example as follows.

+([s_l{i)]) taking i from [[1],[2],[3]]

This simplifies to:

This simplifies to:

+([s_l(l)]) +([s_l(2)]) +([s_l(3)])

+([[1],[2]]) +([[3],[4]]) +([[5],[6]])

This is evaluated to produce the result:

[3,7,11]

14

In this example there is no reason why the simplification stages cannot occur in parallel

and that is exactly what happens in the SequenceL execution strategy for implied

parallelisms.

1.2.3 Control Flow Parallelisms

The final source of parallel execution is control flow parallelisms. An earlier

consume-simplify-produce example was given.

[eigen,matmul,s_l ,max,s_2]

In this example each function is dependent on a result from the function to its right.

Matmul is dependent on the result produced by max and eigen is dependent on the result

produced by matmul. Therefore, matmul cannot be evaluated until max has been

evaluated and eigen cannot be evaluated until matmul has been evaluated. Given a

situation where there are no dependencies between SequenceL expressions or functions,

parallel execution becomes possible. For example consider the following sequence.

[quick,less,s_l[l],s_l,s_l[l],quick,great,s_l[l],s_l]

Quick is a SequenceL function that accepts one input argument, less and greater are

SequenceL functions that accept two input arguments. SequenceL evaluates sequences

from right to left. A control flow analysis of this sequence reveals that the result of the

quick in the middle of the sequence is not needed by any of the functions to its left.

quick,great,s_l[l],s_l

Therefore the above expression can execute at the same time expression shown below.

quick,less,s_l[l],s_l

15

The term control flow parallelism is used here to describe this parallel evaluation process.

The actual parallelisms associated with this type of expression are defined by the

language semantics.

1.3 Document Overview

The research on SequenceL from 1991 to 1998 has focused on the language

definition. Then around 1999 it was discovered that problem solutions implemented in

SequenceL exhibited implied parallelism, this led to the development of a SequenceL

interpreter capable of identifying implied parallelisms. The next phase of this research is

presented in tiiis dissertation. This phase of the SequenceL research is to do a proof of

concept for a SequenceL compiler. This compiler will be capable of producing

executable parallel code from SequenceL source code.

The rest of the dissertation is presented as follows. Chapter n will explore briefly

aspects of the current parallel programming domain. Chapter IH lays out the

methodology behind the development of the compiler. Chapter IV will report on the

results of the compiler implementation and Chapter V will provide conclusions.

Appendix B is an implementation guide that lays out the mechanics behind the

development of the compiler. It also provides some background information on some

issues, such as cache locality, that needed to be addressed during compiler development.

16

CHAPTER n

CURRENT STATE OF PARALLEL PROGRAMMING

A number of different areas of parallel programming have been reviewed in order

to establish some sense of the current state of Parallel Programming in the High

Performance Computing field. There are three areas or approaches to parallel

programming that will be reviewed, explicit parallel languages, high-level languages that

are compiled into a parallel form and the automatic transformation of serial code into

parallel code [DiM96b]. Examples from each of these areas will be presented as

background material. Hardware, although not a primary focus of this research is of some

importance, therefore a brief review of the current state of parallel hardware systems will

also be included. The specific class of parallel machine the SequenceL compiler is

designed for is a shared memory multi-processor that supports POSIX threads.

There seem to be as many different strategies for achieving effective parallel

programming, as there are parallel computers. These programming strategies typically

center on the procedural language paradigm since procedural languages like C and

FORTRAN are the most widely used for scientific programming and are available for

most systems [All]. It can be speculated that the resultant large investment in the existing

pool of procedural code, and programmers is also a motivation for focusing on procedural

languages. Examples of this approach is the Message Passing Interface (MPI) and

OpenMP, both which attempt to equip programmers with the tools they need to develop

parallel applications using procedural languages like C and FORTRAN.

17

2.1 Message Passing Interface

MPI is a Message Passing Interface standardized by a consortium of vendors,

implementers, and users [MPI]. Typically, MPI consists of a parallel library and some

server code. The server code is designed to manage multiple cooperating processes or

tasks executing on a distributed memory system. Since MPI developers are responsible

for coding all parallel operations, MPI can be described as an explicit approach to parallel

programming. MPI codes can mn on either shared memory or distributed memory

parallel architecture, making it very popular with experienced parallel programmers. It

scales very well on most architectures [Leu], provided the level of inter-process

communication is reasonable. Like any parallel application, if the communication level

donunates the overall efficiency of the system becomes poor [Kum].

Efficiency, E is defined by the following formula.

£ _ Execution time using one processor

Total Parallel Execution time * number of processors

Total Parallel Execution time = time to compute + time to communicate.

We can see from the above equation that as communication time increases the

total parallel execution time for the multiprocessor increases and therefore the efficiency

drops. Ideally we would like to have the parallel execution time, on the multiprocessor, to

be as close as possible to the sequential time divided by the number of processors. The

expectation is that if one processor executes a program in x seconds then a multiprocessor

with y processors would execute in x/y seconds. For example if a program executes on a

18

single processor system in 20 seconds, then ideally a parallel version would execute on a

4-processor system in 5 seconds.

The efficiency for this ideal condition would be:

Efficiency = 20/(5*4) = 1.

As communication overhead increases on the multiprocessor system the parallel

execution time increases. If for example the parallel execution time for our 4-processor

system increases to 10 seconds, then efficiency drops to .5. It is easy to see that efficiency

is one way of measuring how much time a parallel system is spending on computations.

The MPI model is primitive from a programmer's viewpoint. There is no implicit

data sharing, MPI provides a set of communication library routines that allows a

programmer to setup a process to send a message from one task to another. The message

typically contains some data that the two parallel tasks are sharing. Therefore, the

developer must explicitly program every single point in a parallel application in which

data sharing occurs. The tasks themselves are UNIX processes that can be generated from

the same binary image or from different binary images. Typically, the Single Program

Multiple Data or SPMD model is used to program MPI based applications. This means

that MPI uses one binary image to spawn a UNIX process for each processor in a

multiprocessor system. When there is a single task for each processor, the parallel model

is called a coarse-grain parallel execution model [Nar]. When a SPMD parallel

programming model is used every processor executes the same binary image, what

differentiates each program is a task identification number assigned by MPI. SPMD

19

programs should be designed to use this identification number in conditional expressions

so that each task processes a different part of a data stmcture.

The strength of MPI is also its weakness. MPI, through the use of low level

communication calls, gives the developer complete control over the parallel environment

this puts all the onus on the developer to create an efficient and reliable parallel

application. It has been documented that MPI program development is both costly and

time consuming [Cha]. The scheduling of tasks or threads of computation within a

parallel application can be a complex task but is not a technical problem with MPI since

MPI gives the developer complete control over all of a task's parallelisms through MPI's

low level message passing calls.

Some manufacturers of shared memory computers, such as Silicon Graphics Inc.

have extended the use of MPI to shared memory systems by bundling MPI on their

shared memory computers.

2.1.1 MPI Programming Example

The following simple but meaningless code will be used to demonstrate the

process behind the development of parallel code using MPI.

#include <stdio.h>

#define SIZE 10 /* must be even number *l

int mainO {

intinputs[SIZE],i,j; intresult[SIZE/2];

20

}

int total=0;

for(i=0; i < SIZE; i-i-+){ inputs[i]=i;

}

j=0; for(i=0; i < SIZE; i=i+2){

result[j]=inputs[i] * inputs[i+l]; total= total + resultlj];

} printfC'total = %d\n",total);

This code take the values 0,1,...8,9 and multiples each pair and then sums the

products. Therefore the following computation takes place.

0*1+2*3+4*5+6*7+8*9=140

Before an MPI programmer attempts to code this simple algorithm a number of

decisions need to be made. First, how many tasks should be executing in parallel? Data

decomposition is always the most difficult step, decomposition has a major impact on

efficiency, and for more complex data stmctures finding an optimal data decomposition

can be a hard problem. Ignoring granularity issues the data decomposition for this

example is simple enough. Since the problem is uniform in nature, 5 processes could be

created with each process getting two array variables to multiply. The next decision the

MPI developer needs to make is how should the 5 tasks be initialized? Should one

process initialize an array and then distribute it to the other four tasks or should each task

do its own initialization? In this case the static nature of the initial data makes the second

choice a good one. Since communication costs are lowered and the four other processes

21

would probably be idle during initialization anyway. In some cases a broadcast of

initialization data might be a better choice. For this example broadcasting will be

demonstrated. The final step, before displaying the result, is communicating all the

products back to one process. We will assume that summation is a serial operation. One

process is selected to sum the results and display the result. The MPI code is as follows.

#include <stdio.h> #include <mpi.h>

#define SIZE 10

int main(int argc, char **argv) {

int myrank, nsize, x, inputs[SIZE], result, total;

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nsize); MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

if(nsize != SIZE/2)retum(l);

MPI_Bcast(inputs, SIZE, MPI_INT, 0, MPI_COMM_WORLD);

x=myrank*2; result=inputs[x] * inputs[x+l];

MPI_Reduce(&result, &total, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

if(!myrank){ printfC'total = %d\n",total);

}

MPI_Finalize();

return (0);

22

This MPI program is fairly painless since it makes use of the MPI broadcast and

reduce library calls. The MPI broadcast library call, MPI_Bcast, is used to send a

message from one MPI task to a collection of MPI tasks. Since all the MPI tasks are

executing the same binary image, they all execute the MPI_Bcast function call. One MPI

task is selected in the MPI_Bcast argument list to broadcast, by default the rest of the

tasks listen. In this example task 0 is selected to broadcast. Even though each task needs

only part of the array, task 0 broadcasts the entire array to every task. Each task uses its

task identification number (myrank) to determine which part of the array it's responsible

for processing. Upon completing the computation all the tasks call MPI_Reduce. Again

one task is selected through the argument list to listen and sum the data, by default all the

other tasks send their results. Task 0 is selected to reduce all the products to one

summation value. Task 0 then prints the result.

For more complex data sharing or to optimize data sharing, explicit sends and

receives between tasks can be coded. For example the above program can be redone

using the MPI send and receive library calls.

#include <stdio.h> #include <mpi.h>

#define SIZE 10

int main(int argc, char **argv)

int myrank, nsize, x, inputs[SIZE], result, total; int myresults, myinputs[2];

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, /* total */

&nsize); /* cluster size */

23

MPI_Comm_rank(MPI_COMM_WORLD, /* group of everybody */ &myrank); /* 0 thm N-1 */

if(nsize != SIZE/2)retum(l);

i=0; if(!myrank){

for(I=l;I<nsize;I++) MPI_Send(&inputs[myrank*i], 2, MPI_INT i

MPI_COMM_WORLD); } else

MPI_Recv(myinput[0], 2, MPI_INT, 0, MPI_COMM_WORLD);

myresult=myinputs[0] * myinputs[l];

if(!myrank){

for(I=l; I < nsize; I++) MPI_Recv(&results[i], 1, MPI_INT, i, MPI_COMM_WORLD);

else MPI_Send(&myresult, 1, MPI_INT, 0, MPI_COMM_WORLD);

if( .'myrank) { total=0; for(i=0; i < nsize; i++)

total= total + results [i]; printfC'total = %d\n",total);

}

MPI_Finalize(); retum(O);

}

In this code, task 0 uses the MPI_Send library routine to explicitly send each task

the part of the array its been assigned to do computations on. The rest of the tasks use

MPI_Recv to receive the array data in a message sent from task 0. The task id (myrank)

is again used, this time in a conditional statement to control the send/receive operations

between task 0 and the rest of the tasks. When results are available task 0 uses

24

MPLRecev to get the results and the rest of the tasks use MPI_Send to send their results.

Since the reduce call was not used task 0 must explicitly sum the results. For even such a

simple algorithm it is evident, from the example that the level of complexity can increase

rapidly with MPI. Given the difficulties programmers face when using MPI, OpenMP has

become an attractive altemative.

2.2 OpenMP

OpenMP has been described as a shared memory or distributed shared memory

parallel programming tool [Cha]. Its implementation is at a higher level of abstraction

than MPI although it would be wrong to describe it as a high level language. Silicon

Graphics Inc. pioneered the development of OpenMP in collaboration with other parallel

computer vendors. The OpenMP specification can be found at www.openmp.org, until

recentiy this specification makes up the bulk of the information available on OpenMP.

OpenMP can best be described as a thread model for shared memory processor systems

(SMP) implemented through the use of compiler directives. Therefore, OpenMP is not

really a language. On the Silicon Graphic Origin2000 multiprocessor system parallel

program developers can utilize OpenMP in one of two ways. Sequential programs can be

submitted to the OpenMP compiler, which can be requested to automatically add

OpenMP compiler directives to the code or a developer can manually place the directives

in the code. A typical approach is to have the compiler add the parallel directives. The

developer can then go back through the resultant source code and remove the directives

associated with code sections that would not be efficient to execute in parallel [SGI]. This

25

methodology forces the difficult task of explicitly specifying parallelisms back on to the

developer. OpenMP directives are designed to take advantage of fine-grained

parallelisms in loops. Coarse-grained parallel programming is also available through the

use of parallel regions and work sharing constmcts [Cha]. The advantage of OpenMP is

that the parallelisms are left up to a compiler; if a compiler doesn't support OpenMP the

directives are ignored. The disadvantage is that the compiler directives isolate the

developer from the actual thread implementation, the lack of thread tuning at a low level

means tiie developer cannot easily explicitiy specify how shared data will be distributed

and accessed. Being unable to explicitly specify data sharing has led to scaling problems

with some algorithms implemented using OpenMP [Leu].

2.2.1 OpenMP Programming Example

OpenMP compiler directives are typically placed around loops that operate on

large data sets or arrays. For example the following code illustrates the OpenMP directive

to parallelize a "for" loop.

#include <stdio.h>

#define SIZE 10 /* must be even number */

int main() {

int inputs [SIZE], i,j; intresult[SIZE/2]; int total=0;

for(i=0; i < SIZE; i++){ inputs[i]=i;

}

26

J=0;

#pragma omp parallel for shared(inputs,result) private(i,j) reduction(+:total) for(i=0; i < SIZE; i=i+2){

result[j]=inputs[i] * inputs[i+I]; total= total+ result[j]; j++;

} printfC'total = %d\n",total);

}

In this example the arrays "inputs" and "results" are shared amongst all the

threads created by the OpenMP directive. The variable / and; are private or unique to

each thread created by the OpenMP directive. The reduction directive instmcts the

compiler to setup a mechanism in the code to gather all the individual product results

from each thread and sum them together. The sum is stored in "total". Private data is

kept in each processors local memory; any shared data located on a processor's local

memory can be accessed remotely over the interconnection network by all other

processors. It is this remote memory access to shared data that leads to bottlenecks and

scaling problems on shared memory systems.

This example illustrates both the power and the weaknesses of OpenMP. The

advantage is obvious; any programmer of average ability can quickly identify loops

within their code that are candidates for parallelization. The developer can then place the

directives in the appropriate location so as to direct the compiler to parallelize the loop.

The disadvantage is the fact that the developer must still work hard at specifying data

sharing at a low level in order to avoid the scaling problems encountered with the

automatic placement of the directives, by the OpenMP compiler [Leu].

27

2.3 POSIX Threads

Threads such as POSIX threads are another tool available to developers for

parallelizing codes for SMP systems. POSIX threads or Pthreads are available for all

mainstream UNIX platforms as well as for Microsoft NT. The advantages of Pthreads

are; they are light-weight since they carry only part of a task stmcture, the UNIX process

that generated the threads manages the rest of the stmcture. Pthreads provide the

developer with low-level control over execution; they are universally accepted and

implemented in a standard way and therefore portable; they execute on single processor

systems as well as SMP systems without change. OpenMP will do the same except

parallel programs must be recompiled to eliminate the effect of the thread directives. MPI

requires the installation of the underiying MPI server environment before MPI programs

can mn on a single processor system.

On Solaris and IRIX systems Pthreads execute on light-weight processes (LWP).

A LWP can be described as a virtual processor upon which code can execute [Lew].

LWPs are capable of executing from one to many threads per LWP. The developer can

control the number of LWPs created per process and the number of threads executing on

each LWP but the operating system kernel is responsible for scheduling LWPs on

processors.

A classic approach to writing Pthread parallel programs has been to use a coarse

grain thread model where one thread is created for each processor [Nar]. A fine-grained

thread model has been found to have a number of advantages over coarse-grained

models. A fine-grained model has many threads per processor. This makes the fine-

28

grained thread-programming model more adaptable to changes in the number of available

processor. It can handle irregular parallelisms more effectively and is more effective at

load balancing a system [Nar]. The reason the fine-grained model is better at handling

irregular parallelisms and load balancing can be explained with the following example.

Assume that 10 fine-grained threads accomplish the computations done by a single

coarse-grained thread. When the coarse grained thread is executing on a processor at

some point in time it will be forced to yield the processor. While the coarse grained

thread waits to get the processor back it is in a suspended state. The same is tme for fine­

grained threads except when one fine-grained thread is suspended the other nine threads

could still be executing on other processors that potentially could have been idle

otherwise.

A disadvantage of the fine-grained model is the increased number of threads

created and the associated thread creation overhead and subsequent load on operating

system kernel resources.

2.4 Parallel Prograrmning Languages

Parallel language development has also been a factor in parallel programming,

although parallel languages have not achieved the level of interest and use that MPI and

OpenMP have achieved they still have their advocates. The problem parallel language

developers face in gaining the acceptance of parallel programmers is the pressure placed

on developers to stay with FORTRAN or C owing to the huge amount of code already

written in C and FORTRAN [Dow]. Unless the language is easy to use, reduces time

29

spent programming and executes efficientiy, parallel program developers will tend to

ignore it. Therefore the most significant language work done has been in extending

FORTRAN and C [Dow]. Examples of this type of work are FORTRAN 90 and High

Performance FORTRAN (HPF). These two languages are not high-level parallel

programming languages and FORTRAN 90 is not a Parallel language but is the core of

HPF.

FORTRAN 90 brings new extensions to FORTRAN 77, such as.

Array constmcts

Dynamic memory allocation and automatic variables

Pointers

New data types, stmctures

New intrinsic functions, including many that operate on vectors and matrices

New control stmctures, such as a WHERE statement which uses a logical

expression to control array assignments without indexing.

• Enhanced procedure interfaces.

After reviewing these features it becomes apparent that FORTRAN 90 is an

attempt to bring FORTRAN 77 up to date with some of the features found in C. The

problem facing FORTRAN 90 is that in an attempt to become more like C it begins to

encounter the same optimization difficulties encounter in C code [Dow]. The use of

pointers and dynamic data stmctures can effect optimization since they introduce data

stmctures that are not clearly identified until mntime. One of FORTRAN'S strengths over

languages like C is that it is easier to optimize at compile time [Dow]. The development

30

of FORTRAN 90 led to the development of HPF. HPF like OpenMP uses directives to

guide the compiler in parallelizing the code. Therefore a FORTRAN 90 program mn

through an HPF compiler will produce the same results as mnning the same program

through a FORTRAN 90 compiler. It is the addition of the HPF directives to the

FORTRAN 90 code, which identify parallel opportunities in the code, that differentiate

FORTRAN 90 from HPF code. What makes HPF different then from OpenMP? The

design of HPF is based on a message-passing model, while OpenMP is based on a shared

memory model. Therefore HPF can take advantage of distributed memory systems such

as the IBM SP. HPF compilers are designed to optimize parallel execution by minimizing

communication between processors, developers must be aware of this and decompose

and align the data stmctures with this mind.

2.4.1 NESL

NESL is a high-level parallel programming language that comes closest to

SequenceL in terms of the features and objectives for the language. Therefore, it will be

examined in a little more detail. NESL is based on constmcts, which manipulate

sequences or ordered sets. NESL is described by its authors; as "a strongly typed,

applicative, data-parallel language" [Ble96]. NESL is a language that uses nested

constmcts to achieve parallel execution. In order to encourage programming of parallel

applications NESL does not provide for any looping. Although it is pointed out that

looping can be implemented through recursion.

31

The goals set out for the NESL language by the developers is four fold;

1. Support parallelism by means of data-parallel constmcts, which operate on

sequences or vectors of numbers.

2. Support nested parallelism, any user defined function can be applied over

a nested sequence in parallel.

3. Support a variety of different parallel architectures.

4. To easily implement parallelisms through the simple use of its constmcts.

Each of these goals have been met in some fashion by the language developers

except for, perhaps 3. NESL is actually an interface to another language called VCODE

[Ble96]. VCODE is a portable intermediate language that has been implemented on Cray

Y-MP, IBM SP, Intel Paragon and Connection CM-2 machines, or any serial machine

with a C-compiler. VCODE seems to have restricted NESL to a small set of parallel

computers. A prototype compiler was developed which converted NESL to VCODE and

then the resultant VCODE to Java [Har]. This effort was an attempt to investigate the

possibility of making NESL more portable by using Java since Java is a more widely

used language for parallel computing systems than NESL. The speed of the resultant

NESIWCODE/Java code using a JDK interpreter was as much as 10 times slower than

native VCODE [Har].

2.4.1.1 VCODE

Since NESL is dependent on VCODE, what is VCODE? The language developers

describe VCODE as a data-parallel intermediate language [Ble94]. VCODE is based on a

32

stack heap model. It has a small set of about 50 instmctions, which are divided into two

categories of instmctions, vector instmctions and memory/control instmctions. There are

four vector types in VCODE, Boolean vectors, integer vectors, floating-point vectors and

segment descriptor vectors. The segment descriptor vector is used to partition one or

more data vectors into segments. The vector instmctions pull vectors off the top of a

stack and return results to the top of the stack. Here is a partial list of VCODE

instmctions [Ble94].

Memory Control Instmctions Memory Instmctions copy, pop, load, store, const

Control Instmctions if-then-else, call, ret

Vector Instmctions Elementwise

negate, +, *, =, >, and, not, select Permute Instmctions

permute, spermute, bpermute, dist Scan Instmctions

+-scan, max-scan, or-scan Segment Descriptor Manipulation Instmctions

length, segdes

VCODE was never meant to be a development language; its goal was to provide

compiler designers with a portable intermediate language for the development of high-

level parallel programming languages, like NESL. For example VCODE understands and

supports nested-parallelisms, which is the concept that makes NESL a high-level parallel

programming language. VCODE implements nested parallelisms indirectly through the

use of the segment descriptor vector and segment instmctions. A technique called

flattened nested parallelism is used by VCODE to support nested parallelisms for high-

33

level languages like NESL [Ble94]. Flattened nested parallelism can be implemented in

VCODE because segments are designed to operate independently, and in parallel. By

placing each nested parallel call in it's own segment, parallel execution of the nested calls

can take place.

NESL implements its parallelisms by performing operations on sequences, an

example of a sequence would be

[1,-3,2,4]

All elements in a sequence must be of the same type in NESL. Parallelism are

implemented in one of two ways on sequences, either by applying a function or operator

across a sequence in parallel or though the use of built-in parallel functions which apply

parallelisms across a sequence.

The NESL examples shown are taken directly from the NESL interpreter. A

simple NESL example that implements parallelisms;

{negate(a): a in [3, -4, -9, 5]};

=^ [-3,4, 9, -5] : [int]

The above statement is read as " for each a in the sequence [3, -4, -9, 5], negate,

in parallel, each a". The result is returned after the symbol ^ . The [int] is the type of tiie

result: a sequence of integers. The parentheses { } delineate the operation that is to be

implemented in parallel. Multiple sequences can also be handled;

{a + b : a in [3, -4, -9, 5]; b in [1, 2, 3, 4]};

^ [4, -2, -6, 9] : [int]

34

For the above example to work the two sequences must be of the same length.

Normalization in SequenceL allows for sequences of different length to be handled. The

above constmct is referred to as an apply-to-each constmct in the NESL documentation.

This capability is also available to NESL functions; the following factorial example is

given in the NESL documentation.

function factorial(i) = i f ( i== l ) then l else i*factorial(i-l);

The function constmct is defined as;

function name(arguments) = body;

If not declared, int is implied by the compiler as the type to be retumed by a function. To

use the factorial function in an apply-to-each constmct;

{factorial(x): x in [3, 1, 7]};

=> [6, 1, 5040] : [int]

In this example the factorial function is applied to 3,1 and 7, in parallel. What is

surprising about this example is the use of serial recursion in the function, which defeats

the parallelism that is pursued in the language. What would make more sense is if NESL

expanded 3, 1 and 7 and then did the multiplication. For example;

{prod(x): x in [1,2,3,4,5,6,7,8]};

^ [40320] : int

This would then produce the factorial of 8 in parallel. This is accomplished through the

associative mle for multiplication which is known to NESL [Ble96]. Using the

35

associative mle for multiplication parallel execution of multiplication can be

implemented on a parallel machine in logarithmic time using a tree.

40320 t

24*1680 / \

2*12 30*56

^ \ / \

1*2 3*4 5*6 7*8

Figure 2.1 Parallel Multiplications by Associative Rule

Time to do this computation is t*log2 8 = 3t where t is the time for one computation. To

do this kind of parallel computation associative mles for all operators must be known to

NESL. NESL also supports multiple levels of parallelism, for example;

{sum(v):v in [[2,1], [7,3,0], [1,2,3]]};

=> [19] : [int]

In this example NESL performs each sub-sequence addition first in parallel and then

sums the three results in parallel. Again the associative mle, this time for summation, is

implemented on a parallel machine in logarithmic time using a tree.

The last constmct to be examined is the nested constmct; this is the heart of what

makes NESL tick as a parallel programming language. A simple example of a nested

constmct might be

{{x*y:xin A }:yinB};

Note the nesting of the constmcts through the use of the curiy brackets {{ } }; it is this

stmcture that defines a NESL nested constmct. For the above nested constmct, assume

36

sequence A consists of [1,2] and sequence B consists of [3,4] the analysis of this nested

constmct is as follows;

1) read sequence [3,4] in parallel

2) read sequence [ 1,2] in parallel and apply to 3 and 4 in parallel.

The result of nesting tiie constmcts is the parallel multiplication of all the elements in

both sequences; a two-processor system would be required for this concurrent

multiplication. This pattem of nesting NESL constmcts is followed in the development of

the NESL programs. Understanding how to layout the data stmctures, along the lines of

the computation, into a series of NESL nested constmcts is how parallehsms are achieved

in NESL.

2.5 Automated Parallel Language Tools

The third approach to parallel programming is the use of automatic transformation

tools. One example is Algorithm Recognition. The work of KePler and DiMartino is

presented as examples of this type approach to parallel programming. One of the goals of

this approach is to translate, in an effective way, serial codes into parallel codes. With the

billions of lines of procedural code already in use automatic conversion tools have the

potential to save millions of dollars and man-hours, while at the same time preserving the

investment in the original codes. So why has not this approach gained more wide spread

use and acceptance? Keller points out that for some serial algorithms determining how to

distribute the serial data stmctures in parallel for a parallel algorithm is an NP-complete

problem. In addition there are problems dealing with, syntactic variation, algorithmic

37

variation, delocalization (pattem is spread throughout the code) and overiapping

implementations. Also sorting out mntime issues like caching, network contention etc...

propagated by the original serial algorithm is very difficult for an automatic parallelizing

compiler.

2.5.1 PARAMAT

Kepler's PARAMAT tool works by processing a syntax tree of the original source

code. Before generating the syntax tree the program is preprocessed in order to make it as

explicit as possible by; in lining all functions, which means replacing all function calls

witii the associated function code; forward propagation of constant expressions, which

means removing all constant assignment statements, for example given c=3+pi, if pi is a

constant then c is also a constant; recognition and replacement of induction variables i.e.

integer variables indexing arrays that are not a loop variable of a surrounding loop [KeP];

and eliminating dead code, which is code that has become isolated and is never executed.

[DiM]. The idea with PARAMAT is to annotate as many nodes as possible in an abstract

syntax tree with a pattem instance. The abstract syntax tree is traversed from left to right

in post-order. An example of a syntax tree for matrix multiply is as follows.

for(i=l;i<=n;i++) { for(j=l;j<=m;j++)

SI: c[i][j]=0.0; for(j=l;j<=m;j++)

for(k=l;k<=r;k++) S2: c[i][j]=c[i]|j]+a[i][k]*b[k]|j];

Leaf nodes are trivial to pattem match since they are typically variables or

;onstants. The inner nodes are tested by a pattem matching algorithm that accounts for

38

the effect of child nodes on a given parent node whose pattem is currently being tested.

The testing will result in the assignment of a pattem instance to the node. For example

the statement at SI is recognized as a scalar assignment statement and assigned the

pattem instance SINIT. The recognizer then moves up the tree to the "forj" loop around

SI, this combination is recognized as an array assignment pattem and the SESflT and the

"forj" loop statement is replaced with VINIT, the vector assignment pattem instance.

The syntax tree is;

fori

c[i]U]

t assign

^ \ ] 0.0

forj

t fork

t assign

/ \

c[i]U] +

!

/ \ a[i][k] b[k][j]

The code now looks like;

Figure 2.2 Syntax Tree

for(i=l;i<=n;i++) VINrr(c[i][l:m],0.0)

for(j=l;j<=m;j++) for(k=l;k<=r;k++)

39

S2: c[i][j] = c[i]Ij]+a[i][k]*b[k][i];

The process then moves over to the multiply statement and recognizes it as such, it then

moves up to the addition and recognized that the child pattem for multiply can be rolled

into the add as an add/multiply pattem instance. The next pattem is dot product and so on

up the syntax tree. This process continues until finally the matrix multiply pattem has

been recognized. The matrix multiply pattem is then replaced with a template of matrix

multiply (function call), which is designed to execute in parallel.

PARAMAT can recognize 91 nontrivial patterns and has 150 nontrivial templates.

PARAMAT does a reasonable job of recognizing pattems but is limited to certain types

of problem domains involving numerical methods. As new domains are explored new

pattems can be added.

2.5.2 PAP

DiMartino's PAP tool is more general in its algorithm recognition and for that

reason is more general purpose in terms of handling a problem domain. It therefore does

not have the performance of PARAMAT in recognizing code. PAP uses the idea of

concept recognition rales that recognize concept instances. For example a concept might

be a do or an assign statement at the lowest level, a scalar product at some intermediate

level and matrix multiplication at the highest level. PAP uses the Vienna Fortran

Compilation System (VFCS) as a front-end. PAP gets from the VFCS a data dependency

graph, a control flow graph, a syntax tree and a symbol table. With this information PAP

uses a Prolog based inference engine and database to build the Base Internal

40

Representation of the program and it stores this in a Concept Instance Database. This

representation is die Program Dependency Graph (PDG) where nodes are statements and

edges are control and data dependencies. While the PDG is being build the HPDG is

being created from concepts that reflex what was recognized in the PDG. Figure 2.3 is

the derivation tree for matrix multiply and directly corresponds to a HPDG for matrix

multiply.

Matrix-matrix-multiply

Matrix-vector-multiply

Scalar-product Count-loop

Scalar-shift-incr-product Count-loop do

Scalar-shift

assign

Count-loop

do

do

Figure 2.3 Derivation Tree for PAP

The PAP recognizer uses a concept database and backtracking methods to

generate the HPDG. Unlike PARAMAT, which tries to develop one concept instance,

like matrix multiply, PAP maintains a pattem that contains all the sub-concept such as the

add/multiple concept [DiM]. Implementation of PAP has been limited to parallel systems

utilizing the Vienna Fortran Compilation System. PAP itself is written in Prolog and

takes advantage of Prolog's inference engine in carrying out its concept recognition. The

41

final output of PAP is the HPDG, the next phase of the PAP research is to automatically

generate parallel code from the concept tree.

2.6 Parallel Architectures

From a hardware standpoint, many of the more recent successful parallel

architectures have been either shared memory systems (SMS) like that found in the

Silicon Graphics Inc's Origin2000 which is a NUMA cache coherent system [Lau], or

distributed memory systems (DMS) such as IBM's SP and Beowulf clusters. A quick

review of the TOP 500 supercomputer list reveals that 396 are either, IBM SPs (DMS),

SGI Origins (SMS), Sun HPCs (DMS), Compaq Alphas (DMS) or self made clusters

(DMS). Ideally any new parallel languages and compilers would be developed with one

of these types of system in mind. Typically, distributed memory architectures are more

difficult to program since they are message-passing architectures. Also they have fewer

sophisticated development tools such as good parallel debuggers. Shared memory

architecture, like the Origin2000, are shipped with vendor supplied integrated

development tools and compilers that can take advantage of all processors using standard

program development techniques. In addition systems like the Origin2000 can also be

programmed using message passing if necessary.

A third architecture that is not a parallel architecture but is of importance to the

success of any new language and its acceptance is the desktop workstation. A language

that can take advantage of parallel supercomputers and single processor desktop systems

42

without requiring the rewriting of the code has a better chance for acceptance than a

language that does not meet this requirement.

Most of the parallel programming tools reviewed in this chapter require the

developer to explicitly code or specify the parallelisms. The automated tools like

PARAMAT convert serial code to parallel, and therefore do little in reducing the

programming effort associated with coding new applications in parallel since a serial

program is required first. Therefore, tools like PARAMAT target legacy code. High-level

programming languages like SequenceL target new application development.

43

CHAPTER m

METHODOLOGY

This chapter describes the methodologies that guided the development and

implementation of the SequenceL compiler. Traditional compiler design involves the

following steps [Aho],

• Lexical Analysis

• Syntax Analysis

• Intermediate Code Generation (Semantic Analysis)

• Optimization

• Code Generation

Each steps has its corollary for this compiler implementation as well as some additional

steps.

• Lexical Analysis

• Syntax Analysis

• Intermediate Code Generation

• Scheduling Analysis (Optimization)

• Conversion to C (Code generation)

• Compile to Parallel C

• Runtime Environment

44

3.1 Lexical Analysis and Syntax Analysis

Rather than using a compiler-compiler to generate the lexical and syntax analyzer,

the conciseness of the SequenceL grammar and experience with LL(1) compilation

motivated the choice to write an LL(1) parser from scratch. An LL(1) parser is a grammar

driven approach that processes a source program as a string from left to right (that's the

first L) using leftmost derivations (the second L) and has a one token look ahead buffer.

The lexical analyzer performs a character by character scan of the input to identify the

language vocabulary, analysis involves identifying tokens and determining if the tokens

are legal members of the language. A token is a sequence of characters having a

collective meaning [Aho].

As tokens are read they are placed in a symbol table, each token has at least two

attributes, type and a pointer value into the symbol table. Syntax analysis checks the

token pattems in the source code to insure that the code meets the language specification

as set down by the language grammar. Before proceeding with the design of the syntax

analyzer the language grammar must be made compliant with the requirements of an

LL(1) parser.

Definition: "A context-free grammar is a formal system that describes a language by

specifying how any legal text can be derived from a distinguished symbol called the

axiom, or sentence symbol. It consists of a set of productions, each of which states that a

given symbol can be replaced by a given sequence of symbols" [Onl].

Definition: A selection set is a set of tokens that define what are the first legal tokens for

a given production. If a production has an epsilon option or empty option the selection set

45

for the epsilon option will be the first token of the productions that can legally follow the

production that has the epsilon option.

Definition: A context free grammar is an LL(1) grammar if and only if any two

productions with the same left-hand sides have different selection sets. This means that

the grammar must be first context free and that all productions must be uniquely

identifiable.

Aho and Ullman in "Principles of Compiler Design," also affectionately known as

the "Dragon Book," present techniques for placing a grammar in a form that is suitable

for LL(1) parsing, without losing the meaning of the original grammar. These steps are:

1. Elimination of Left Recursion; and

2. Elimination of common prefixes; and

3. Selection set determination.

A simple grammar will be introduced in order to provide examples of the methods

used to develop the compiler.

E=>E + E | E * E | ( E ) | i d

This grammar is ambiguous, what this means is that more than one parse tree can be

generated from the same expression [Aho]. For example the expression id + id * id

generates two different parse trees. This is because the grammar gives no indication of

operator precedence, as a result a derivation of id + id * id can start with either the + or *

resulting in the two different parse trees. This problem can be eliminating by associating

precedence with the operators. Aho and Ullman provide a methodology for eliminating

46

ambiguity. The first step is to set up a production for the elements. An element is a single

identifier or a parenthesized expression.

F ^ (E) I id

The next step is to give multiplication a higher precedence than addition. Therefore a

term for the multiplication operator must be set up.

T = > T * F | F

The final step is to set up the production for the addition, which links terms through the +

operator. E=>E + T | T

The final unambiguous grammar is:

E=>E + T | T T ^ T * F | F F =^ (E) I id

3.1.1 Eliminate Left Recursion

Both direct and indirect left recursion must be eliminated. The definitions of direct

and indirect left recursion are taken from [Coo02].

Direct left recursion exists if and only if:

Ap :=> Ap'

Indirectiy left recursive exists if and only if it is not directly left recursive but can satisfy

the derivation:

AP i^ AP'

The following algorithm is used to eliminate left recursion [Coo02].

47

Input: Left Recursive Syntax Rule in a form wherein left recursive options precede non-

left recursive options:

B ::= Bai | Ba2 |... | Bom | a^+i ] am+2 I- I "m+n ;

Output: Two rales, which are not directly left recursive but are equivalent to the input

rale.

Procedure: Replace B with two productions:

B ::= ttm+iBl | 0^+261 |... | Om+nBl

B l : : = a i B l | a 2 B l | . . . | a m B l | 6

The symbol e signifies the epsilon option or empty production.

Returning to the simple grammar the elimination of left recursion produces

E =>TE' E' ^ +TE I 8 T ^ F T ' T' => *FT' I e F ^ (E) I id

3.1.2 Eliminate Common Prefixes

Common prefix elimination or left-factoring, as Aho and Ullman describes it, is

necessary since LL(1) parsers have a one token look ahead buffer. This means that the

parser makes decisions on where it is to go next in the parse tree based on looking ahead

only one token. If a production has two sentences with the same first token the parser is

unable to reliably choose the next production to syntax check. The result is the potential

false reporting of a syntax error. For example given;

A => aPi I aP2

48

If the parser has a in its look ahead buffer and chooses option api as the current

production but encounters a P2 after consuming the a, a syntax error occurs. The

following algorithm eliminates the common prefix problem [Coo02].

Input: Common prefix mle:

B ::= aPi | a|32 |... | a P ^ I Xm+1 I Xm+2 I- I Xm+n ;

Output: Two rales which have no common prefixes.

Procedure:

Replace B with two productions:

B::= cxBl|Xm+llXm+2l-IXm+n

Bl : :=P l l P 2 I - I Pm

Eliminating common prefixes from the example yields:

A=>aA' A ' ^ P, I P2

In effect the A' production introduces a branch in the A production. Now after accepting

the a token the parser moves to the A' production and checks for either a Pi or a P2

returning a trae when it encounters a Pi or P2.

For the example grammar there are no common prefixes.

3.1.3 Selection Set Generation

Once left recursion and common prefixes are eliminated the resultant grammar

must be inspected for e options. If an s options appears in the grammar selection sets

49

must be generated. The following steps must be carried out in order to establish the

selection sets [Coo02].

1. Determine the FIRST set. This is a set that contains all the initial grammar

symbols in a production.

2. Determine the FOLLOW set. There are three steps in generating the FOLLOW

sets.

A. Set the FOLLOW set of the start symbol to {$}.

B. For each RHS aAp, add non-e elements of FIRST(P) to FOLLOW(A).

C. repeat for each rale B::= aAp, where P ,add FOLLOW(B) to

FOLLOW (A) until no more changes occur.

3. Determine the SELECTION sets. The SELECTION set of a non-epsilon RHS

options is the option's FIRST set. The SELECTION set of an epsilon option is

the rale's FOLLOW set:

The following is the SELECTION set table for the simple grammar introduced earlier in

the chapter. Because this grammar has epsilon options, SELECTION sets must be

generated so that the syntax analyzer can determine what tokens legally follow a

production with an epsilon option. $ is used to indicate end of sentence. Once the

SELECTION sets are generated the full specification for the LL(1) parser exists. The

source code for the syntax analyzer can be taken directly from the SELECTION set table.

50

Table 3.1 Selection Sets

Production

E=i>TE'

E' => +TE'

=>E

T=>FT'

T' ^ *FT'

=^s

F=>(E)

=>id

First

(id

+

(id

*

(

id

Follow

$ )

$ )

+ $)

+ $ )

* + $ )

Selection

(id

+

$ )

(id

*

+ $)

(

id

int e() { ift(){

ife'O retum trae;

else retum false

} else retum false

inte'OI if(token == '+')

token=get_token() ift(){ ife'O retum trae;

else retum false;

51

}

intt(){

} else

retum false; } else{

if(token == $ || token == ')' ) retum trae;

else retum false;

}

iff(){ ift'O retum trae

else retum false

} else

false }

intt'Oi

}

intfO

if(token == '*') token=get_token()

iff(){ ift'O retum trae;

else retum false;

} else

retum false; } else{

if(token == '+' || token == $ || token == ')' ) retum trae;

else retum false;

1

52

if( token =='('){ token = get_token() ifeO{

if(token == ')') retum trae

else retum false

} else

retum false; } else{

if(token == id) retum trae

else retum false;

} }

This pseudo-code is generated directly from the productions, and their

SELECTION sets. Note that tokens are only consumed when a token is matched.

The following is the call sequence for the syntax analysis of the expression

A * B .

Using the syntax analyzer for the example grammar.

1. e() is called.

2. eO calls to

3. to calls f(),

4. f() determines the token is an id, gets next token and retums tme

5. to calls t'().

6. t'O determines that token is a *, gets next token and calls f()

7. f() determines that token is is, gets next token retums tme

8. t'O calls t'O

53

9. t'O determines epsilon retums trae to t'()

10. t'O retums trae to t()

11.1() retums trae to e()

12. eO calls e'O

13. e'O determines epsilon retums trae to e()

14. e() retums trae to indicating expression is syntactically correct.

token contains A

Function e'() epsilon option

Function f() f() determines token is id Gets next token

Function t'() t'O determines token is * gets next token or epsilon

Figure 3.1 Syntax Checking

54

3.2 Semantic Analysis

Semantic analysis is typically integrated into the syntax analyzer, to create what is

known as a one-pass compiler. This means that the source code file is parsed and

analyzed for lexical errors, syntax errors, and semantic errors on one pass through the

source code. Additionally, the semantic analyzer may generate intermediate code via

what are know as semantic actions. The triggers for initiating a semantic action must be

placed at appropriate points in the productions.

The semantics of a program are typically captured in a machine-independent

code. This code is machine independent because there is no management of memory or

register usage defined. One example of a machine independent intermediate code is

quadruples [Aho]. Quadraples consist of four fields an operation, two operands, and a

result field. During intermediate code generation the fields are assigned addresses, which

are locations in the symbol table. Not every quadraple generated has all its fields'

assigned addresses. Unary operators like x = -y use only one of the operand fields. Jumps

put the target in the result field [Aho].

For the following expression:

A + B

Semantic analysis will generate the following quadraple or quad for this expression.

OP ArgI Arg2 Result

+ A B Ti

In order to accomplish the task of generating a quad for an expression the

semantic analyzer must be able to recall information about the expression when the

55

semantic action is invoked. A Semantic Analysis Stack (SAS) is utilized for this purpose.

In the syntax analyzer a "push token instraction" is placed whenever a token needs to be

placed on the SAS. For example operands are typically pushed onto the SAS, therefore a

"push token instraction" is placed in the syntax analyzer whenever an operand is detected

by syntax analysis and needs to be pushed onto the SAS. As semantic actions are

triggered in the granunar, the semantic action will pop tokens off the SAS and use'd them

to generate the appropriate quads. For the above expression the SAS will contain A and B

when the semantic action is triggered for the addition expression. The semantic action

routine will pop the two operands off of the SAS and generate the quadraple or quad for

addition at that point. Semantic action generates the temporary Ti which is used to store

the result of the addition expression. Ti is pushed back onto the SAS making it available

for a semantic action rale associated with whatever expression might use the result of A +

B as an operand. Using the example grammar, the semantic action for the addition

expression will be.

B :=pop(SAS) A := pop(SAS) Ti := generate temporary genquad(+, B, A, TO push(Ti,SAS)

The following is the syntax code with the semantic actions placed in the locations

that allow for the generation of the appropriate quads for multiplication and addition.

int e() { iftOi

ife'O retum trae;

else

56

retum false } else retum false

}

inte'Of if(token == '+')

token=get_token() iftO{

op2=pop(SAS) opl=pop(SAS) result=genquad("+",op 1 ,op2) push(S AS .result)

ife'0{ retum trae;

} else retum false;

} else

retum false; } else{

if(token == $ || token == ') ' ) retum trae;

else retum false;

1 }

inttOi iffOi ift'O retum trae

else retum false

} else

false 1

intt'Of if (token =='* ')

57

}

intfO

token=get_token() iffO{

op2=pop(SAS) opl=pop(SAS) result=genquad("*",opl,op2) push(S AS,result) ift'Oi retum trae;

} else

retum false; } else

retum false; } else{

if(token == '+' || token == $ || token == ') ' ) retum trae;

else retum false;

}

if( token == '('){ token = get_token() ifeOI if(token == ')') retum trae

else retum false

} else

retum false; } else{ if(token == id)( push(token, SAS); retum trae

else retum false;

}

58

Note how each time an identifier is correctly identified in f() it is pushed onto the SAS.

When the second identifier has been correctly identified either e'O's semantic action will

generate the quads for addition or t'()'s semantic action will generate the quads for

multiplication. Therefore, the semantic action occurs at the following locations in the

grammar.

E =>TE' E' => +T(semantic action)E' | e T ^ F T ' T' => *F(semantic action)T' | 8 F => (E) I id

3.3 Intermediate Code

In many compilers source code is translated into an intermediate representation

(IR) or intermediate code (IC) before compiling to the final object code. In the

NESL/Java implementation the intermediate code is VCODE. One of four kinds of IR

codes is often used in compilers [Aho], postfix notation, syntax trees, quadraples and

triples. The advantage of going to an IC as opposed to going directiy to machine code is

that optimization on the IC is generally easier to perform than on machine code.

3.3.1 Quadraples and Triples

Triples or three address codes are made up of expressions involving an operator

and two operands. Quadraples are similar to triples except they have an additional

address specified for the result. The address tables for A = X + Y * Z for triples and

quadraples are as follows.

59

Address

(0)

(1)

(2)

Table 3.2 Triples OP

*

+

:=

ARGI

Y

X

A

ARG2

Z

(0)

(1)

Quadraples are generated with temporary variables, information on these temporary

variables are stored in the symbol-table during IC generation, see Table 3.3. Whenever

information on a temporary is needed during machine code generation it can be easily

accessed from the symbol-table. Triples do not store information on temporaries in the

symbol-table, therefore in order to determine if a temporary is active the IC must be

scanned, see Table 3.2. Note the use of address pointer in the triple table, this makes it

more difficult to implement control flow analysis.

Table 3.3 Quadraples Address

(0)

(1)

OP

*

+

Argl

Y

X

Arg2

Z

TI

Result

TI

A

3.4 Code Generation

An early decision was made to implement the compiler with C as the final object

code for the compiler. This approach has proven to be successful in the past. One

example is the GNU Prolog compiler, [Dia] which is as efficient as Quintus Prolog 2.5

[Cod] and only 30% slower than Sicstus Prolog. The advantage of compiling to C is that

60

it reduces the complexity of the compiler; optimization will not be implemented in this

compiler. The SequenceL compiler will rely on the C compiler's optimization capabilities

to handle optimization. Another advantage of using C is the availability of monitoring

and debugging tools for C programs, there are no tools designed for monitoring and

debugging SequenceL code. Finally, since many parallel programmers code in C they

may find it more acceptable to work with a language that generates parallel C code.

During the proposal phase the question was asked why not use FORTRAN as the

object language? Parallel programmers have often chosen FORT AN over C because

automatic parallelization of C has always been difficult due to pointer arithmetic,

irregular control flow and complicated data aggregation [Dow, Ken]. Availability of

FORTRAN compilers is the primary reason not to select it. Typically, if there is only one

compiler on a computer system it is probably a C compiler. It is also easier to implement

threaded programs in C rather than in FORTRAN [Cha].

Normally code generation involves the process of taking optimized intermediate

code and converting it to assembly or machine language code [Pys]. Some of the

difficulties associated with code generation to assembly or machine code include;

deciding what machine language code to generate, deciding on the order of the

computations and deciding which registers to use [Aho]. When code generating to C code

a new set of issues must be dealt with; such as what C code should be generated,

generating the correct control flow, use of global variables, variable and function naming

conventions, memory allocation, dynamic creation of functions and data stmctures.

61

3.4.1 Parallel C Code

A pilot research effort was carried out in order to investigate and compare a series

of algorithms implemented in SequenceL versus a parallel implementation in a

procedural language [CooOO]. Matrix Multiplication, Gaussian Elimination and Quicksort

were chosen as the algorithms to be implemented. Each of these algorithms presents a

different challenge. Matrix Multiply and the Gaussian Elimination are examples of

problems for which static a-priori schedules can be generated. The difference between

the Matrix Multiply and Gaussian Elimination is the need for Gaussian Elimination to

provide intermediate data to the various parallel paths setting up a communication

requirement. The paths of execution can be determined based upon the dimensions of the

matrix, in the Matrix Multiply, and the number of equations, in Gaussian Elimination.

Quicksort, requires dynamic scheduling and provided insights into SequenceL's ability to

imply parallelisms under dynamic scheduling conditions. Although the pilot research

smdy focused on SequenceL it also provided insight into the implementation of the

algorithms using procedural languages while reflecting on the SequenceL

implementations. C/MPI and Java threaded codes were experimented with. Java was

never considered for code generation, but experiments with Java's thread model did help

in making a decision on how to implement the parallelisms in SequenceL using C.

Given the decision to use C and a shared memory architecture the choices for the

underlying C based parallel development tool where narrowed down to three choices.

62

1. Multi-threading (Pthreads)

2. Message Passing Interface (MPI)

3. OpenMP

MPI is normally thought of as a distributed memory programming model but it

has also been implemented on many distributed shared memory systems such as the SGI

Origin2000. MPI uses a UNIX process model for its parallel tasks. This means a UNIX

process is spawned for every task that executes in parallel. Data sharing with MPI is

complex and requires message passing. The problems with portability and the difficult

programming model associated with MPI coupled with the lack of adequate monitoring

and debugging tools eliminated MPI from consideration for this first SequenceL

compiler.

OpenMP is a very attractive choice for the underlying parallel environment.

OpenMP has a standard interface and therefore should port well between OpenMP

systems. OpenMP is based on a thread model and is capable of data sharing between

threads executing on different processors. OpenMP can support fine-grained parallelisms

through loop parallelisms and coarse-grain parallelisms using parallel regions and work

sharing constmcts. Finally, OpenMP has good tools for monitoring and debugging tools.

The basic problem with OpenMP is that it requires an OpenMP compiler and a system

that supports OpenMP. This makes it impossible to port OpenMP programs to a shared

memory system that does not support OpenMP. OpenMP codes can be recompiled for

single processor systems, but in reality the OpenMP directives are just ignored and the

resultant program is a serial program.

63

The final decision was to select Pthreads. Pthreads have the following advantages.

Pthreads are standardized and widely implemented. Pthreads use a lightweight process

model. Pthreads are lightweight because they do not require a complete task stmcture.

Instead Pthreads inherit the task stmcture from their parent UNIX process. Second

Pthreads can take advantage of shared memory accesses, different threads can access the

same memory locations. Pthread programs have the advantage of mnning on either a

single processor system or a multi-processor system as a multi-threaded program. Some

disadvantages of Pthreads are, Pthreads require more of a programming effort from a

developer. This should not be a problem since the SequenceL compiler will be generating

the thread code. Pthreads functions restrict thread functions to just one user defined

argument and no user definable retum values. A pointer to a C stmcture containing the

input arguments and retum variable is the recommended method for getting around this

problem [Lew]. Pthread functions must also be synchronized with the calling routine so

that results from the threaded function are not used before the threaded function

completes. Pthreads provide two methods for synchronizing the data returning from a

thread function, the semaphore and the thread join. The semaphore works by having the

thread function set a semaphore just before it exits. The code that needs the thread result

will wait for the semaphore to be set at a location just before the results produced by the

thread are needed. The thread join works in a sinular fashion, except in this case the

thread must completely exit before a result can be used.

64

3.5 Scheduling

Task scheduling for parallel systems must be addressed and considered in the

design an implementation of the SequenceL compiler. Factors such as task granularity,

task allocation and task synchronization are scheduling issues that must be considered.

Granularity is defined as the ratio between computation and communication [Kum] or

how much computation should or will take place before there is communication between

tasks. A fine-grained parallelism has very few computational instmctions between

communication cycles while a coarse grained parallelism has many computational

instmctions between communication cycles. For example a coarse grained parallelism

involving 1000 calculations might involve only 10 threads with 100 computations on

each while a fine-grained parallelism might involve 1000 threads of execution for the

1000 calculations. The amount of communication overhead must also be considered. For

example if each thread, whether it is coarse-grained or fine-grained, requires one

communication cycle, the coarse grain example will have 10 communication cycles and

the fine grain model 1000 communication cycles.

The difficult with granularity is that, communication overhead differs between

parallel systems. Therefore it can be very difficult to establish a grain size for complex

computations from one parallel system to another.

The scheduling methodology for the SequenceL compiler will include the

development of simple grain size controls.

The operating system scheduler will handle task or thread allocation. Numerical

methods typically have minimal thread scheduling issues, since there are usually not a lot

65

of independent threads executing in a numerical program. A debugger is an example of a

program that does execute independent threads, threads to ran and monitor a program,

threads to keep the GUI alive, performance monitor threads, etc. This type of program

would require a time-slicing thread scheduler in order to give all the threads adequate

CPU cycles. The typical approach for scheduling threads associated with numerical

methods is to schedule them without any difference in priority as they are generated

[Lew].

Task or thread synchronization is a key element in using threads. Semaphores are

the preferred method for synchronizing threads but they suffer from race conditions in

recursive stmctures. Therefore the thread join will be used for thread synchronization.

The methodologies discussed in this chapter are used to build the SequenceL compiler.

The mechanics of implementing these methods are documented in Appendix B.

66

CHAPTER TV

RESULTS

There are number of differences between designing a SequenceL compiler and

designing a compiler for a procedural language. Typically, the lack of static stmctures

has made high-level functional languages more likely to be interpreted. Procedural

languages have more static properties and therefore are more likely to have compilers

[Lou]. Therefore it is not hard to believe that many of the tools and instmments used to

build compilers have been developed for procedural languages. The primary difference

between implementing the SequenceL compiler and a procedural language compiler is;

the source code is in SequenceL, a high-level nested language; the development of a new

intermediate language, an enhanced symbol-table, C as the target object code and the

compiler is built to generate parallel code.

This dissertation has achieved the following results!

• Established a proof of concept that there exists a SequenceL compiler that can

create executable programs that embody the inherent parallelisms and other

implied controls stmctures in SequenceL,

• Developed a new intermediate language capable of representing the meaning of a

SequenceL source program,

• Developed the techniques for spawning threads to dynamically create parallelisms

using a threaded approach, and discovered that the SequenceL language implies a

parallel execution model.

67

• Identified a number of optimization and performance enhancement opportunities,

• Identified a new SequenceL language requirement for defining nesting and

cardinality typing information for SequenceL data stmctures.

This chapter outiines how these five results were achieved. The most basic question this

research needed to answer is can a compiler be built that exploited the inherent

parallelisms in the SequenceL language? The answer to this question is yes. What follows

in the next few sections are the results of this implementation and a report on the issues

that the SequenceL compiler had to address and overcome before it could translate

SequenceL source code to parallel procedural code.

4.1 Proof of Concept

To date the language has been implemented as an interpreter in Prolog [CooOO].

Although the SequenceL interpreter can identify implied parallelisms it cannot actually

execute in parallel. The next phase of the SequenceL language development was to build

a proof of concept compiler. The goal of the SequenceL compiler is to take SequenceL

source code and generate parallel executable code. Chapter HI describes the formal

methods behind the development of the compiler. Chapter IV will describe the results of

that development process. Specific details on the mechanics of building the compiler are

presented in the Appendix B. Appendix B describes the translation of the grammar to an

1.1.(1) grammar, the implementation of the syntax analyzer, symbol table stmcture,

semantic analyzer, intermediate language generation, code generation and the mntime

libraries.

68

The SequenceL compiler implementation dealt with a number of research issues

related to the translation of high-level nested SequenceL code to procedural code. Some

of the basic constmcts of a procedural language include:

a. type statements,

b. assignment statements,

c. iterative constmcts,

d. jump statements, and

e. computational statements.

The procedural programming model forces developers to focus more on how to get things

done as opposed to what needs to get done [Mac]. SequenceL has no type statement,

jump statement, and iterative constmcts. SequenceL has one kind of assignment

statement, the taking expression. Unlike procedural languages SequenceL is designed so

that developers can focus on what needs to be done rather than on how to do it.

The only data stmcture in SequenceL is the sequence. As described in chapter I a

sequence can contain constant sequences, operations, and functions, sequences can also

be nested. The SequenceL compiler has been designed to deal with sequence stmctures

and will generate object code that faithfully preserves the meaning of the original source

code.

The SequenceL compiler uses the SequenceL language semantics and seeks out

opportunities for parallel execution and exploits them. It does so by examining the data

dependencies between intermediate code (IC) statements in the IC table and by

monitoring attribute information stored in the symbol tables. Together the symbol table

69

and the IC are key elements in the compiler's ability to generate the C object code that

includes the implied parallelisms. The IC statements and the attribute information in the

symbol table are taken directly from the SequenceL language constmcts during semantic

analysis, and it is therefore the SequenceL language constmcts that provide the necessary

information about the location of the implied parallelisms in a SequenceL program.

4.1.1 General Approach to Mapping SequenceL Constmcts

The proof of concept for the SequenceL compiler ultimately led to the definition

of a formal method to map SequenceL constmcts to Multi-threaded constracts. This

mapping takes place through the intermediate code and symbol table. There are three

SequenceL constracts; regular, irregular and generative which need to be defined in terms

of this mapping. The mapping can be expressed as follows.

SequenceL-v r Symbol Table >. r Multithreaded

r • i ^ r—• i Constmct J L Intermediate Code J L code

Figure 4.1 Mapping SequenceL Constracts

4.1.1.1 Regular Constract

The SequenceL regular constract was described in chapter I as a constmct that

applies an operation in a uniform manner to a set of singletons.

(t)(S)

Where ([) is a SequenceL built-in operator (+,-,*,...).

S is a sequence.

The operator (|) can be applied to linear sequences and nested sequences.

70

A linear sequence is a collection of singletons.

S={Si,S2,S3,...,Sn}

A nested sequence is a collection of sequences.

Given this definition of a sequence the regular constract for SequenceL is defined as.

Multithreaded code implements (<t),S) in a way that takes advantage of array stmctures it

uses to represent sequences. Therefore the multithreaded code uses the following

approach for implementing a regular constract.

The first level of nesting is defined as follows;

r if S={si,S2,S3,...,Sn} and V Si 6 S

Si(l)S2(t),...,Sn(|) ((t),S) \ else

V t 6 Si X S2 x,„Sn : fork((t),t)

• The singletons [a] and [b] are first level sequences of sequence [a,b]

• [[a,b],[c,d]] and [[e,fl,[g,h]] are first level sequences of [[[a,b],[c,d]], [[e,f],[g,h]]]

Object code implements (l)[[[a,b],[c,d]], [[e,f],[g,h]]] as an operation between arrays of

first level singletons.

For [[[a,b],[c,d]], [[e,f],[g,h]]] the singleton arrays are ai=a,b,c,d and a2=e,f,g,h.

Where afSfa.2 = a<{)e,b(t)f,c(t)g,d(t)h

(t)[[[a,b],[c,d]], [[e,f],[g,h]]] = gather(ai(t)a2)

Where "gather" maps the resultant array to the cardinality and nesting of a first level

sequence. Therefore the regular constract in object code is defined as;

gather(Ai 0 A2 (t) A3,..., An )

71

where each array Ai contains a set of singletons {ai,a2,...am}

Ai is a first level sequence array.

rj = { aji(t)aji,...,aji} where j = {l,...,m} andi={I,...,n}

rj is the result of each set of singleton operations.

For every singleton in the first array, Ai, a thread is forked. For every thread forked there

is a result rj. This result TJ in each thread is updated by (]) for a given singleton in each

subsequent array, .i.e. TJ 0 aj. This continues until all first level sequence arrays have been

processed.

((t),S)

V t 6 ai X a2 x,„ain : fork((t),t,aj) V Ai e {Ai X A2X„An} 3 {ai, a2,„an,} : rj< rj (]) aj gather(ri,r2„,rm)

The following applies to both definitions of (^,S).

Define;

S'= { Si, S2, S3,..., Sn }

Where S' is a set of sequences that contains Si, S2, S3,.. .,Sn.

The taking expression generates a set of index values that is used to index a sequence;

taking i from S

S(i)={Si,S2,S3,...,Sn}

Therefore

S(i)=S'

72

A regular constract with index variables is defined as;

[(1)(S')]

or

[(t)(S,),(l)(S2),(l)(S3),...,(t)(Sn)]

The implied parallelism for a regular constract involving indexed sequences can be

defined semantically as the Cartesian product of an operation 0 and a set of sequences.

During intermediate code generation a symbol for S is placed in the symbol-table.

V S 6 ST where ST=symbol table

The following is the IC for the SequenceL expressions listed after the semicolon.

from S i ; taking i from S _M S i to ;S(i) (t) to t l ;(J)(S(i))

S is defined as a sequence, intermediate code has no knowledge of what S might in terms

of dimension. S could be a singleton, sequence or nested sequence. This information is

not available until mntime. tO and tl are temporary sequences that are generated by

intermediate code generation to store the results. tO is defined in the symbol table as a

being produced by an index operation. Therefore tl will also be defined in the symbol

table as being produced indirectly by an index operation.

Object code generation will generate the multithreaded constmcts when it detects

the index attribute in the symbol-table. When the IC code is read during object code

generation the index attribute triggers code generation to build an iterative thread

constract to process the set.

73

[<t)(S(i))] symbol-table set attribute • - • iterative thread constract

thread function (|)'

S'

Figure 4.2 Mapping Regular Constract

The multithreaded constract is defined as;

do(x=sr,...,Sn') (t)'(x)

Where S'={si',.. .,Sn'}. The iterative constract generates a thread for each element in the

set S'. This iterative constract can be viewed as generating the Cartesian product of the

thread function (j)' and the set S'. The thread function ((>' carries out the (j) operation as a

thread.

CARTESIAN(((t)'), (S'))

Therefore multithreaded execution is defined as;

V t e (t)'xS' fork (t)

The final mapping for SequenceL to IC to Multithreaded Code for a regular constract is

defined as follows;

0([S(i)]) CARTESL\N(((1)'), (S'))

74

4.1.1.2 Irregular Constract

The irregular constract mapping follows the same pattem as the regular with some

added complexity. The irregular constract can be defines as follows;

[(t)i(Si)] when P(S2) else [(1)3(S3)]

Vpe {>,<,>=,<=,<>}.

P tests the sequence Si. If S2 is not an indexed sequence then;

S2={S}

Where s is a single sequence.

Either the trae or false expressions will execute depending on the outcome of (t)(s). The

trae and false expressions are regular constracts. The mapping of this conditional

expression to multithreaded code is;

[(t)i(Si)] when P(s) else [([) (S3)]

I P s then 01 Si else (t)2 S2

i ifP(s)

CARTESL\N(((t)r),(S,')) else

CARTESL\N(((t)3'), (S3'))

75

What of the case where S2 is an indexed sequence? When this occurs an implied

parallelism is associated with the relational operation. The following statement is tme of

all SequenceL conditional expressions.

Given :

|S2'| = m : m > I

then|Sr,...,S3'| = mor I

If S2> 1 the relational expression is placed in its own thread function P'.

The input variables to P' is the set of index variables:

Si'=(sli,...,slm); S2'=(s2i,...,s2n,);.... S3'=(sni,...,snni);

The thread function P' contains the conditional expression.

ifP(X2)

then

else

The multithreaded call to P' would be via an iterative loop:

do(Xi=Sli,...,Slm; X2=s2i,...,s2m;... Xn=s3i,...,s3ni)

P ' (Xi,X2,...,Xn)

Therefore a conditional operation with implied parallelisms can be defined as m sets of

function P' and n sequences.

S"=(p', sli, s2i,...,sni),( P', SI2, s22,...,sn2),...,( P', sl^, s2m,...,snm)

or

V t e S" forkt

76

The number of sets goes from 1 to n since n relational, trae and false expressions can be

linked together.

[01 (S,)] when p (S2) else [US3)] ...else [(t)„(S„)]

i P S2 then 01 Si else 03 S3

else 0n Sn

i V t e S"

forkt

There are two assumptions made about this mapping. The first assumption is that

the sequences in the set of sequences Si,S2,...,Sn are the index variables required by the

thread function P'. In the multithreaded object code, the input arguments to the thread

function P' are the variables required to resolve these index variables. Therefore, when

index variable s_l(l) is required by P', s_l and 1 are passed to the thread function and

the actual index variable is resolved in the thread function.

77

The second assumption has to do with the use of a non-indexed variable in a

conditional expression. For example, for the following expression;

s_l(i) when >([s_I(i),[I]]) else [ ]

If / has values from 1 to n then this simplifies to;

s_l(l) when >([s_I(I),[l]]) else [ ] s_I(2) when >([s_I(2),[I]]) else [ ] s_l(3) when >([s_l(3),[l]]) else [ ]

s_l(n) when >([s_l(n),[l]]) else [ ]

The assumption is that there are n [ ], one for each conditional expression.

4.1.1.3 Generative Constract

The mapping of the generative constract from SequenceL to IC to multithreaded

code is as follows;

gen(a,...b) ^ gen a b tO ^ gen(S)

S is a set containing three elements, a, b and tO.

These three formal methods were developed as a result of the insights gained

during intermediate code development and reinforced by results from object code

generation and testing of the thread model.

4.1.2 Proof of Concept through Testing

A number of SequenceL problem solutions have been compiled and executed to

verify the compiler has met its objectives. Of specific interest are the algorithms that

78

were used in [CooOO], Matrix Multiply, Gaussian Elimination and Quicksort. These three

algorithms were identified in [CooOO] as having certain parallel properties that would

present a challenge to a parallel programming language. Both Matrix Multiply and

Gaussian Elimination provide a compiler with apriori knowledge of the parallel execution

opportunities, while Quicksort has dynamic properties associated with parallel execution

that are known only during mntime. The difference between the Matrix Multiply and

Gaussian Elimination is the need of the latter to communicate intermediate values during

execution. The SequenceL interpreter identified the following implied parallelisms for

Matrix Multiply, shown in Figure 4.3. The first version of the SequenceL compiler

generated the same trace as the interpreter. The current version of the SequenceL

compiler includes some scheduling capabilities. The SequenceL execution trace in Figure

4.4 reflects this.

mm

H e * * * * * * * * * * * * * * * * * * * * * * * * * *

Number of processors 1

27

Figure 4.3 Interpreter Identified Parallelisms [CooOO]

79

mm Number of processors 1

Figure 4.4 Matrix Multiply Execution Trace

Initially, the Gaussian Elimination program failed on a logic error, when

executed. This was caused by a semantic inconsistence between the interpreter and the

compiler. The following generative constmct generates a descending set of sequences.

gen([5,...,l])

For this expression using the compiler semantics gen produces;

[5,4,3,2,1]

The interpreter produces;

[1]

After making an adjustment to the compiler semantics the execution trace in

Figure 4.5, this matches the interpreter's trace. The Quicksort algorithm provides the most

interesting test for the SequenceL compiler. Quicksort involves parallel execution and

recursion mixed together. The SequenceL compiler generated program duplicated the

findings of the SequenceL interpreter.

80

FM

AM

* * * * * * * *

FM

^M

# of processors

I

2

1

8

8

1

1

1

1

t" 'K "F 5t*

[[[1], [0.5], [0.33333], [1]],

[[0],[0.08333],[0.083335],[-0.5]] ° \ ^ ^

[[0],[0],[0.00555111],[0.1667]]

Figure 4.5 Gaussian Parallelisms [CooOO]

81

# of processors 1

2

2

4

4

4

4

Figure 4.6 Quicksort Execution Trace

In the execution trace Q is the quick function, L is the less function and G is the

great function. For this example the quick function has 10 recursive calls, and the less

and great have 4 each.

These results have led to the conclusion that the SequenceL compiler achieved its

goal of generating executable parallel code that exploits the inherent implied parallelisms

found in SequenceL. (Future research will include extending the set of algorithms that

have been parallelized by the SequenceL compiler.) The following sections describe how

this proof of concept result was achieved as well as new insights and research issues

identified for the SequenceL language.

82

4.2 Intermediate Language

The development of the SequenceL intermediate language was a major step in the

compiler's development. While the concept used is not new the implementation is. What

makes the implementation different is that a typical intermediate language is designed to

preserve the semantics of a language with some target machine code in mind. SequenceL

expressions and functions can be evaluates without side effects because they have no

dependent control stmctures. The SequenceL intermediate language is designed to

preserve the advantages of this independent function evaluation model that SequenceL

employs, while at the same time providing a bridge to the C object code. The independent

function evaluation model means that SequenceL functions and expressions can be

evaluated with no side effects associated with implied control stmctures. An example of a

side effect in a procedural language is a global variable assignment. In C if multiple

functions update the same variable, with no synchronization mechanisms, data could be

lost causing unpredictable results.

A significant difference in the SequenceL IC is that it has no conditional jump

statements or assignment statements, two features fundamental to implementing iterative

stmctures. For example, the following code fragment is a "for" loop from the C

programming language.

for(i=0; i < 10; i++) a[i]=0;

This simple statement consists of three assignment statements and a conditional jump.

The intermediate language representation might be something like the following.

83

loopl: := 1

< i jfalse t5 := a[i] + i jump

10

1

0 t5 exit 0 i loopl

4.2.1 Initial Intermediate Language

This section describes the initial intermediate language design, which was later

abandoned. The description is given here because its development led to some key

insights about the SequenceL language constracts. All of the examples of IC given in

this section are based on this initial intermediate language. The following section will

introduce and describe the current SequenceL intermediate language.

Initially, the intermediate language for the SequenceL compiler was specified

with support for conditional jumps and assignment statements. To generate object code

that contains iterative stmctures without the benefit of conditional jump statements and

assignment statements in the intermediate language was initially perceived as an

unreasonable condition to place on object code generation. This turned out to be untrae.

This initial design led to certain insights about the SequenceL language that will be

explained in this section, as well as the development of the current intermediate language.

The reason iterative stmctures were placed in the initial intermediate language

specification is that they are a very important language feature for procedural languages

and particularly important for parallel procedural languages. Parallel numerical methods

are of particular interest to many parallel program developers, and many numerical

methods employ iterative stmctures. Matrix multiply is one example.

84

for (i=0;i<=m.rows;i++) for (j=0;j<=m.columns;j++)

{s = 0; for (k=0;k<=m.length-l;k++)

{s+=m|j][k]*m[k][i];} mr[i][j] = s;}

Many parallel languages such as OpenMP, identify iterative constracts in

procedural code and mark them for parallelization. Parallelizing iterative stmctures that

have no dependencies is what languages like OpenMP exploits when parallelizing

procedural code. An iterative stmcture with no dependencies means that one iteration

does not depend on the result of any other iteration. If these conditions exist, then the

iteration can be partitioned and distributed to multiple tasks and executed in parallel. The

SequenceL compiler implementation must do the same, but in this case it must translate

the SequenceL code, which has no iterative constmcts to a parallel procedural

programming model based on threads that does have iterative constmcts.

If there are no iterative constmcts in SequenceL, how are iterative stmctures

generated in this initial intermediate language?

The taking expression in SequenceL is the source of some of the iterative

constracts in the object code. The "taking" expression is SequenceL's only assigninent

statement. It assigns values to identifiers. Therefore in the following expression i is

assigned the value [1].

taking i from [1]

85

The taking expression has capabilities beyond assigning one value to /, it can assign

many values to i. For example, the following expression assigns the values [1] through

[5] to /.

taking i from [1,2,3,4,5]

This is not unlike assigning / values in a C expression using a "for" statement.

for(i=l; i <=5; i++)

There are a number of semantic differences between the two statements. The first

difference is that the "taking" expression does not assign integer values to /, it assigns

sequences to i. Therefore the i identifier in the taking expression is actually assigned the

singleton values:

[1],[2],[3],[4],[5]

This difference is significant, for example the following taking expression assigns i the

values [1,2], [3,4] and [5,6].

taking i from [[1,2],[3,4],[5,6]]

The second difference is that SequenceL treats the assignment of values to i as a

distribution of i values among copies of the function the taking expression modifies. The

C expression assigns i a value after each iteration of the "for" loop. Therefore, the taking

expression is not just a method of implementing an iterative constract in SequenceL, it is

also a distribution point. The copies of the functions that reference i are a result of this

distribution and they execute concurrently. This makes the taking expression a source of

implied parallelisms. With these facts in mind it seems obvious to generate the iterative

86

constracts in the intermediate language using the taking expression as a starting point.

Therefore the following SequenceL code:

*([s_l(i),s_2(i)]) taking i from [1,2,3,4,5]

Translates to the following initial intermediate code (IC)

:= i 1 loopl: <= i 5 t5

jfalse exit * s_I[i] s_2[i] tO[i] + i 1 i jump loopl

This results in the following C object code.

for(i=l;i<=5;i++) _tO(i)=mult(s_I(i), s_2(i));

The function mult handles sequence multiplications. Parallelization of this "for" loop

would involve partitioning the iterations into threads of execution. Although this

approach appeared promising, closer analysis revealed problems.

SequenceL expressions are typically nested expressions; a one-pass compiler

parsing expressions from left to right evaluates these expressions and generates

intermediate code in that order. For example, the following SequenceL expression has

two computational expressions, highlighted by bold.

[+([s_l(i),s_2(i)]), +([s_3(i),s_4(i)])] taking i from [1,2,3,4,5]

The compiler will generate the initial intermediate code for the first expression followed

by the second.

loopl: := i <= i jfalse

1 5 t5

exit

87

+ s_l[i] s_2[i] tO[i] + s_3[i] s_4[i] tl[i] + i 1 i jump t5 loopl

The C object code might be something like the following:

for(i=l; i<= 5; i++){ _t0(i)=add(s_l(i),s_2(i)); _tl(i)=add(s_3(i), s_4(i));

}

This constract can be parallelized by partitioning the loop. The next example

extends the nesting a little further and places a function reference and its input argument

between the two expressions.

[+([s_l(i),s_2(i)]),function2,s_l, +([s_3(i),s_4(i)])] taking i from [1,2,3,4,5]

The IC code for this expression would be:

:= i 1 loopl: <= i 5 t5

jfalse exit + s_l[i] s_2[i] tO[i] call function2 s_l tl + s_3[i] s_4[i] t2[i] + i 1 i jump loopl

The object C code would be:

for(i=l;i<=5;i++){ _tO(i)=add(s_l(i), s_2(i)); _tl=function2(s_l); _t2(i)=add(s_3(i), s_4(i));

}

88

This is an inefficient situation since function2 is called multiple times when only one call

is required. If code relocation were an option then a more acceptable intermediate code

might be as follows:

:= i 1 call function2 s_I tl

loopl: <= i 5 t5 jfalse exit + s_l[i] s_2[i] tO[i] + s_3[i] s_4[i] t2[i] + i 1 i jump loopl

The resultant object C code would be:

_tl=function2(s_l); for(i=l; i<=5;i++){

_tO(i)=add(s_l(i), s_2(i)); _t2(i)=add(s_3(i), s_4(i));

}

An early decision was made to not develop code relocation capabilities for this version of

the compiler. Code relocation involves complex analysis of intermediate code [Aho] and

was deemed beyond the scope of this research. Therefore a better intermediate language

design was needed.

4.2.2 SequenceL Intermediate Language

Certain key insights resulted from the first version of a SequenceL intermediate

language representation. On closer examination of the intermediate code it became

evident that the individual expressions were independent of each other until a result was

needed. This characteristic of SequenceL is also typical of functional programming

languages. It has long been known that pure functional programming languages can

89

execute in parallel [Ham]. This is because there are no implicit control dependencies

between expressions in a functional language. Functional programming is a style of

programming that emphasizes the evaluation of functional expressions, rather than

execution of commands. Expressions are formed in functional languages by using

functions to combine basic values. A functional programming language only has

identifiers bound to values, no variables, no assignment statements and no iterative

constracts [Fin]. This description applies to SequenceL. The semantics of functional

languages is called reduction semantics. This simply means that expressions in functional

languages are created from functions that are evaluated or reduced in place with out side

effects. This is similar to the consume-simplify-produce execution strategy SequenceL

employs, the difference being the SequenceL simplification process typically expands an

expression as it simplifies it until the expression can be evaluated. When a SequenceL

expression is being evaluated, there are no implied control dependencies that affect other

expressions. This revelation led to a re-examination of the possibilities for an

intermediate language design that did not include conditional jumps and assignment

statements. If SequenceL expressions could be treated in an independent manner, then the

parallelisms associated with these expressions could also be treated in an independent

manner. The intermediate language design grew from this premise.

Given this premise, what are the design requirements for an intermediate language

that must bridge a gap between SequenceL and C object code? The typical approach for

developing an intermediate language is to specify it with a target machine language in

mind [Aho]. The approach taken for the SequenceL intermediate code was to develop it

90

from the SequenceL perspective and move towards the object code. First and foremost,

the SequenceL expressions and their meanings needed to be preserved in some form.

Second, there must be a mechanism to identify IC statements that potentially can be

parallelized. In addition, SequenceL has a conditional statement that needs to be

considered.

s_l when >([s_l,[l]) else [ ]

Does this force a conditional jump operation back into the SequenceL intermediate

language? SequenceL also imposes certain conditions on the intermediate language with

respect to the way SequenceL passes resultant sequences between expressions. The

following sections present classes of intermediate language operations. These

classifications are based on an aspect of the operations behavior. For example operations

that accept multiple operands are called multi-operand intermediate operations. For the

general approach to mapping these classes of IC operation to SequenceL constmcts see

section 4.1.1.

4.2.2.1 SequenceL IC Operations

The first requirement, preserving the meaning of the SequenceL expressions,

resulted in the creation of the first class of intermediate language operators. We can call

these operators the SequenceL IC operators. When we examine the earlier examples for

an intermediate language, we can see that the multiplication expression can be executed

independent of any other expressions, other than the need for the index and input

variables.

91

:= i 1 loopl: <= i 5 t5

jfalse exit * s_l[i] s_2[i] tO[i] + i 1 i jump loopl

This code corresponds to a SequenceL regular constmct see section 4.1.1.1. We can

define a simple class of IC operators based on the SequenceL operations that exhibit this

behavior. These operations include the arithmetic operators +,*,-,/ the relational operators

<, >, <=, >=, <>, the logic operators, "and", "or" and "not" and the SequenceL functions

abs, cos, sin etc.... These operators accept a single sequence operand and produce a

single result.

Languages like C allow multiple operands and produce a single result, for

example the following expression is a legal C statement.

t=a+b+c+d+e;

In SequenceL the expression to sum sequences a,b,c,d, and e would be:

+([a,b,c,d,e])

SequenceL expresses the sequences a,b,c,d and e within a sequence before addition is

applied. This means, from the prospective of the intermediate language, that there is an

additional operation that takes place with this type of SequenceL expression, which we

shall call the intermediate language's collective sequence operation. The SequenceL

intermediate language has been given an operator called the "collect sequences" operator.

Therefore, the intermediate code for this SequenceL expression would be:

_seq a b c d tO + to tl

92

Note that the collect sequences statement in the first line has more than one

operand. Because machine code is implemented with only two operands, intermediate

languages typically have only two operands [Aho]. Since the compiler was not compiling

to machine code it was decided that more than two operands would be allowed. An

operand stack was chosen as the method for handling more than two operands. The

operand stack was implemented with no fixed upper limit on the number of operands it

could handle. This IC language design implies that the C object code must also be

designed to deal with any number of operands. There are significant implications

associated with the collect sequences operation with respect to the C object code's

intemal data representation and performance. These issues will be addressed in section

4.4.

4.2.2.2 Multi-Operand Operations

Five operations fall into this class of IC operations, which are called "multi-

operand" operations. The collective sequence operation, the taking operation, the index

operation or index variable operation, the function call operation and the function begin

operation. The collective sequence operator has already been described. The function call

operation and the function begin operations are related. The function call operation

references a function and passes input arguments. While the function begin operation

specifies a function and its formal parameter list. Functions can only accept sequences

and retum one sequence. A function can have any number of arguments, which are

treated as operands by IC.

93

The SequenceL taking expression can also have more than two operands. In the

examples so far only one identifier has been used in a taking expression, more can be

specified. For example the following taking expression has three identifiers i,j and k.

taking [i,j,k] from s_l

The SequenceL IC for this expression would be:

from s_l i j k

The following is an index operation. This statement makes use of the taking expression

identifiers as index values.

s_l(i,j,k)

The SequenceL IC for this expression would be:

_M s_l [i j k] to

The variable index operation has a result field, which it uses to indicate where the result

of the index variable will be stored. Square brackets are shown only to improve

readability.

It is the combination of the taking expression and variable index expression that

makes the taking expression a distribution point for parallelisms. By specifying a variable

index operation as an operand in a computational expression, the index will produce all

the indexed variables concurrentiy for the simplified expression. For example;

+([s_l(i)]) taking i from [1,2,3]

Simplifies to; +([s_l(l)]) +([s_l(2)])

94

+([s_I(3)])

Ifs_l = [[l,2],[3,4],[5,6]]

The simplification process continues and generates;

+([1,2]) +([3,4]) +([5,6])

These three computations can be evaluated in parallel producing the result.

[3,7,11]

The following SequenceL expression will be translated to IC statements.

*([s_l(i),s_2(i)]) taking i from [1,2,3,4,5]

This code fragment results in the following IC.

from _M _M _seq *

[1,2,3,4,5] i s_l i s_2 i tl t2 t3

tl t2 t3 t4

Notice that there is no indication of possible iterative operations, and no indication of

parallelisms. At first glance it appears to be a set of operations on scalars. To identify the

implied parallelisms the semantic analyzer will mark the identifier i as being an index

variable, by setting an attribute for / in the symbol table. Any operation that has an index

variable, as an operand will produce a result that also has an attribute set in the symbol

table identifying it as the result of an operation, which uses an index variable. Therefore

in this example t l , t2, t3 and t4 have this attribute set. The result is the C object code

95

generator treats the multiply operation as multiple operations and will therefore generate

an iterative constract for this SequenceL expression. The trigger for iterative constracts in

the object code in the first intermediate language design was the taking operation in this

current version it is the taking identifier. Another way of describing this is that

parallelism has shifted from being driven by control flow to data driven. The IC code

from above translates to the following object code. TO is a queue of sequences containing

the index variables associated with s_l(i) and s_2(i). This queue is processed from its

beginning or head until the tail is detected.

11 =select_sequences(s_l ,i) t2=select_sequences(s_2,i) t3=collect_sequence(t 1 ,t2); t3->element=t3->head; while(t3->element !=NULL){

mult(t3 ->element]) tO->element=tO->element->next;

}

In the above object code the multiplication function mult is not execution as a thread, to

execute the mult function as a thread requires it to be invoked by pthread_create.

11=select_sequences(s_l ,i) t2=select_sequences(s_2,i) t3=collect_sequence(s_l(i),s_2(i)); t3 ->element=t3 ->head; while(t3->element !=NULL){

pthread_create(&thr_id[j], NULL,(void *) mult, (void *)t3->element)); t3->element=t3->element->next;

}

The following expression:

[+([s_l(i),s_2(i)]),function2,s_l, +([s_3(i),s_4(i)])] taking i from [1,2,3,4,5]

96

Now results in the following IC code.

from _M _M _seq + _call _M _M _seq + _seq

1,2,3,4,5 i s_l i s_2 i t l t 2 t3 function2 s_l s_l i s_2 i t6 t7 t8 t4 t5 t9

tl t2 t3 t4 t5 t6 t7 t8 t9 tlO

All the operands are in stacks. The stack elements are shown here in order to give the IC

statements a little more meaning for the reader. The C object code for this set of IC

statements would be.

tl=select_sequences(s_l ,i) t2=select_sequences(s_2,i) t3=collect_sequence(tl ,t2); t3->element=t3->head; while(t3->element != NULL){

pthread_create(&thr_id[j], NULL,(void *) add, (void *)t3->element)); t3->element=t3->element->next;

} t4=t3; t5=function2(s_l); t6=select_sequences(s_l ,i) t7=select_sequences(s_2,i) t8=collect_sequence(t6,t7); t8->element=t 1 ->head; while(t8->element != NULL){

pthread_create(&thr_id[j], NULL,(void *) add, (void *)t8->element)); t8->element=t8->element->next;

} t9=t8;

110=collect_queues("qsq",t4,t5 ,t9);

There are now two loops one for the multiply and one for the addition with the function

call in between. This might not appear to be a big improvement and certainly combining

97

the two loops would be better, but it does reduce the number of calls to function2. Simple

code movement where the first and second loops are combined would reduce the number

of loop iteration by half

t5=function2(s_l); 11=select_sequences(s_ 1 ,i) t2=select_sequences(s_2,i) t3=collect_sequence(tl ,t2); t3->element=t3->head; t6=select_sequences(s_l ,i) t7=select_sequences(s_2,i) t8=collect_sequence(t6,t7); t8->element=t 1 ->head;

while(t3->element != NULL){ pthread_create(&thr_id[j], NULL,(void *) add, (void *)t3->element)); t3 ->element=t3 ->element->next; pthread_create(&thr_id[j], NULL,(void *) add, (void *)t8->element)); t8->element=t8->element->next;

} t4=t3; t9=t8; tl0=collect_queues("qsq",t4,t5,t9);

Notice the use of the queue length t3 to also used to control the thread generation for t8.

Queues that are generated from the same taking expression identifier, / in this case, will

be the same length. The object code for the taking expression is not shown in this

example, an example of it cab be found in section 4.3.

4.2.2.3 Conditional Operation

The next class of SequenceL IC operations is a class of operations that require a

temporary to be used twice as a result. The conditional operation is this type of operation.

98

One of the features of the SequenceL conditional is that it cannot be nested. The

following is an example of a SequenceL conditional expression.

+([s_l,s_2]) when >([s_l,s_2) else -([s_l,s_2]) when =([s_l,s_2]) else [ ] It reads as follows,

if s_l > s_2 execute s_l+s_2 else

if s_l == s_2 execute s_l-s_2

else

[]

SequenceL allows SequenceL expression to produce one result for each

expression and conditional expressions are no exception. Once a condition is trae, or the

last else expression is reached a single result is produced. For example, in the above

SequenceL conditional expression, if s_l > s_2 is tme the result of the conditional

expression is produced by s_l+s_2 but if s_l > s_2 is false and s_l==s_2 is tme, the

result of the conditional expression would be produced by s_l-s_2. The result for either

of these expressions are assigned to the same temporary. The conditional operation is the

only statement where a temporary is used multiply times in an assignment statement in

the object code.

4.2.2.4 Generative Operation

The final IC operation is the generative operation. The simple generative

operation requires two operands and generates one result. The following SequenceL

expression:

gen([n,...,m])

99

Results in the following generative IC statements.

gen n m tO

The generative statement is a very good example of an expression that can be evaluated

without producing side effects. It uses two bound identifiers, n and m, and produces a

single result.

The SequenceL research leading up to this research had clearly shown that

SequenceL has inherent implied parallelisms [CooOO]. In this research, the development

of the SequenceL intermediate language supports the contention, that SequenceL

expressions inherentiy support their own evaluation in parallel. Even though SequenceL

is not purely functional, it does exhibit certain characteristics of a functional

programming language, the most important being that SequenceL expressions can be

evaluated independently. The thread model covered in the next section will reinforce this

result.

4.3 SequenceL Thread Model

The SequenceL compiler's parallel execution model is based on a thread model,

specifically Pthreads. The reasons for choosing a thread model for the SequenceL

compiler were given in Chapter m. In addition to these reasons, the intermediate

language research has revealed that the SequenceL language is a natural fit for the thread

programming model. This is because of the independence of a SequenceL statement.

There are no implied control stmctures defined in the SequenceL language that effect

100

functions or expressions being evaluated. Therefore, for the following SequenceL

expression;

( ([S])

0 is a SequenceL operation or function that has no implied control stmctures that effect

its evaluation, S is a sequence input argument. Under these conditions 0 can be

implemented as a function 0' in object code with S as an input argument. If 0 ([S])

exhibit implied parallelisms then the object code thread function 0' can be created.

Thread function can be invoked by the POSIX thread's pthread_create library routine,

which creates threads as a result. This process of generating object code thread functions

from SequenceL expressions is the SequenceL thread model.

For example the following SequenceL code has two multiply operations or

expressions that can be evaluated independently of each other. Only when the results are

needed by the addition operation is there any relationship. Once the results are available

to the addition operation it too can be evaluated in place without consideration of any

other SequenceL expressions.

+([*([s_l,s_2]),*([s_3,s_4])])

Since each of the multiply operations are independent and can be evaluated

without effecting any other code, they can be treated as independent functions. This point

becomes more significant when implied parallelisms are involved. For example, if the

multiply operations involves the taking and index operation, multiple multiply and

addition operations will occur. This can be illustrated using the following example.

+([*([s_l(i),s_2(i)]),*([s_3(i),s_4(i)])]) taking i from [1,2,3,4,5]

101

from _M _M _seq *

_M _M _seq *

_seq +

to s_l s_2 tl t2 t3 s_l s_2 t6 t8 t4 tlO

i i i

i i t7

t9

tl t2 t3 t4 t6 t7 t8 t9 tlO t i l

The SequenceL compiler's thread model will set up the two multiply functions so that

they are invoked for each sequence produced by the index operation. In this example

there would be 10 multiply operations, 5 for each, this would be followed by 5 additions.

Each set of arithmetic operations can be executed in C object code using an iterative

constract, changing values of / would control iterations. For each loop iteration a thread is

generated. The compiler will place blocks for the multiply threads just before the iterative

addition where the results from the multiply operations are needed.

t0=sequence_atos("[l,2,3,4,5]");

/* taking expression */ i=(taking_data*)malloc(sizeof(taking_data)); i->from=tO; i->var=l; i->num_var=l; take_thri=(pthread_t*)malloc(sizeof(pthread_t)); pthread_create(take_thri, NULL, (void *)taking, (void*)i); pthreadJoin(*take_thri, NULL);

/* first multiply operation */ tl=select_sequences(s_l,i->result); t2=select_sequences(s_2, i->result); t3=collect_sequence(tl ,t2); t3 ->element=t3 ->head; thread=0;

102

while(t3->element != NULL){ pthread_create(&thr_id_t4[thread++], NULL,(void *) mult, (void *)t3-

>element)); t3->element=t3->element->next;

} t4=t3;

/* second nested multiply operation */ t6=select_sequences(s_ 1, i ->result); t7=select_sequences(s_2, i->result); t8=collect_sequence(t6,t7); t8->element=t8->head; thread=0; while(t8->element != NULL){

pthread_create(&thr_id_t8[thread++], NULL,(void *) mult, (void *)t8->element));

t8->element=t8->element->next; } t9=t8;

/* result t4 is needed for addition, therefore a join is required */ thread=0; t4->element=t4->head; while(_t4->element) {

pthreadJoin(thr_id_t4[thread++],NULL); t4->element=t4->element->next;

}

/* result t9 is needed for addition, therefore a join is required */ thread=0; t9->element=t9->head; while(_t9->element) {

pthreadJoin(thr_id_t9[thread++],NULL); t9->element=t9->element->next;

}

/* addition operation */ tl0=collect_queues("qq",t4,t9); 110->element=t 10->head; thread=0; while(tlO->element != NULL){

pthread_create(&thr_id_tll[thread++], NULL,(void *) add, (void *)tlO->element));

110->element=t 10->element->next;

103

}

The implementation of the SequenceL IC operators as thread functions was fairly

straight forward, since they required only one input argument. The pthread_create library

routine then uses these thread functions to create the threads. Thread functions that are

setup for pthread_create can have only one input argument and no retum value. At first

this might seem to be a problem, but recall that SequenceL uses a consume-simplify-

produce execution strategy, this means that a SequenceL expression consumes its input

arguments and produces a result in its place. Therefore, functions that are to be invoked

by pthread_create can also retum their result in the input argument. With this design

approach the SequenceL IC operations were implemented in C object code as mntime

library functions. The thread joins (pthreadjoin) are placed in the object code using an

as-needed strategy. Therefore a join will not be set up until a result associated with a

thread is needed.

4.3.1 Dynamic Thread Function Creation

In the first version of the SequenceL compiler every expression that was

associated with an implied parallelism was executed in object code as a thread. This

model works well for an ideal parallel system where memory access is uniform; however

it does not work as well in the real worid. Not every function in SequenceL should have

its own thread function. For example given the following expression;

+([*([s_l(i),s_2(i)])])

The first compiler generated the following C object code fragment.

104

1. tO=sequence_atosC'[ 1,2,3,4,5]"); 2. /* taking expression */ 3. i=(taking_data*)malloc(sizeof(taking_data)); 4. i->from=tO; 5. i->yar=l; 6. i->num_var=l; 7. take_thri=(pthread_t*)malloc(sizeof(pthread_t)); 8. pthread_create(take_thri, NULL, (void *)taking, (void*)i); 9. pthreadJoin(*take_thri, NULL); 10. /* first multiply operation */ 11. tl=select_sequences(s_l,i->result); 12. t2=select_sequences(s_2, i->result); 13. t3=collect_sequence(tl,t2); 14. t3->element=t3->head; 15. thread=0; 16. while(t3->element != NULL){ 17. pthread_create(&thr_id_t4[thread++], NULL,(yoid *) mult, (void

*)t3->element)); 18. t3->element=t3->element->next; 19.} 20. t4=t3; 21. /* result t4 is needed for addition, therefore a join is required */ 22. thread=0; 23. t4->element=t4->head; 24. while(_t4->element){ 25. pthreadJoin(thr_id_t4[thread++],NULL); 26. t4->element=t4->element->next; 27. } 28. /* addition operation */ 29. t5->element=t4->head; 30. thread=0; 31. while(t5->element != NULL){ 32. pthread_create(&thr_id_t5[thread++], NUUL,(void *) add, (void

*)t5->element)); 33. t5->element=t5->element->next; }

105

Generate values for i using a thread (lines 2-9)

Setup input arguments for thread function (lines 11-14)

false While (!last queue element) (Line 16)

I trae Generate a thread using mult Line 17

Get next queue element (line 18)

false

While (!last queue element) (Line 24)

^ tme

Pthreadjoin (line 25)

Get next queue element (line 26)

^ While (!last queue element) (Line 31)

4r trae

Generate a thread using add Line 32

continue Get next queue element (line 33)

Figure 4.7 Object Code Flow Chart

Element is a pointer into a queue containing sequences. The while loop processes each

106

queue element and passes a sequence to pthread_create, which passes the sequence to the

function mult, which executes as a thread. If the / values are [1], [2], [3], [4] and [5] the

queue t3 contains the following sequences.

[s_l(l),s_2(l)] [s_l(2),s_2(2)] [s_l(3),s_2(3)] [s_l(4),s_2(4)] [s_l(5),s_2(5)]

When the thread model was chosen for this compiler these were the type of

parallelisms that seemed to fit so well with the model. The arithmetic functions mult and

add are mntime library functions that accept one input argument and retum a result in

that argument.

Research into cache locality and its effects on parallel system performance led to

a realization that this simple thread model was not going to be adequate. Philbin et al in

their research discovered that thread scheduling associated with cache locality effects

program performance by as much as 50% [Phi]. A study to confirm this was done on the

Origin2000. Two series of tests were mn. Both tests executed the following nested

computation.

+([*([s_l(i),s_2(i)])])

The first test used the same C object code loop stmcture as the multiply/add example just

presented in this section. The code generates all of the multiply threads and then waits for

the threads to complete before it generates the addition threads for the addition operation.

This execution strategy makes no effort to place threads that need results from other

threads on the same processor. Therefore a thread associated with an addition operation,

107

that needs a result from one of the threads that generates a multiply result, must access

that result from main memory. This thread relationship is called Parent-Parent [Phi]. The

second test generates the threads for the addition operation first. Within each addition

thread a thread for the multiply operation is generated. This creates a Parent-Child

relationship between threads where the multiply thread is a child of the addition thread.

Upon creating the child thread, the parent blocks until the child completes its multiply

operation. The parent then unblocks and uses the result from the child to complete its

addition operation. Philbin et al. showed that this parent-child execution strategy, coupled

with a FIFO based thread scheduler in the operating system, greatly increases the chances

that the result from the multiply will be in cache memory where the parent thread can

access it.

Table 4.1 Thread Parent Parent (sees)

33.36 34.12 33.7

33.76 33.32 33.85 33.52 32.41 33.82 32.17

Average 33.403

Execution Times Parent-Child (sees)

24.08 23.6 24.22 24.05 23.7

23.87 23.51 23.71 23.23 21.84

Average 23.581

Table 4.1 lists the results form the Origin2000 tests. These results clearly show the

performance improvements Philbin et al. reported. The table lists execution times for the

108

same amount of work for parent-parent and parent-child thread execution. These results

led to a change in the design of the thread model. To setup a parent-child relationship

between nested operations would require code restmcturing, this was beyond the scope of

this research. Therefore, it was decided that the nested expressions would be placed in the

same thread function. This change required the compiler to create a thread function for

any nested expressions found in a SequenceL function. This process is called dynamic

thread function creation. The dynamic thread function creation process for SequenceL

code was simplified by the knowledge gained from the IC development, which indicate

that SequenceL expressions are inherently independent. To implement the dynamic

thread creation a dependency analysis between results and operands is needed to

determine whether a set of computations in the IC table are nested and can be placed in

the same thread function. Examining temporaries in the IC table to see if a given

statement generates an operand for the next statement is how the dependency analysis

works. For example:

+(*(s_l))

Generates the IC statements;

* s_l to + to tl

These two IC statements have a dependency, the temporary tO. The ability to dynamically

generate thread functions for combinations of SequenceL expressions is a new result for

the thread model.The following code illustrates the code generated using dynamic thread

function creation:

109

+([*([s_l(i),s_2(i)])]) taking i from [1,2,3,4,5]

1. tO=sequence_atosC'[l,2,3,4,5]");

2. /* taking expression */ 3 4 5 6 7 8

i=(taking_data*)malloc(sizeof(taking_data)); i->from=tO; i->var=l; i->num_var=l; take_thri=(pthread_t*)malloc(sizeof(pthread_t)); pthread_create(take_thri, NULL, (void *)taking, (void*)i); pthreadJoin(*take_thri, NULL);

10. /* multiply/add operation */ 11. tl=select_sequences(s_l,i->result); 12. t2=select_sequences(s_2, i->result); 13. t3=collect_sequence(tl,t2); 14. t3->element=t3->head; 15. thread=0; 16. while(t3->element != NULL){ 17. pthread_create(&thr_id_t5[thread++], NULL,(yoid *) _t5agg,

(void *)t3->element)); 18. t3->element=t3->element->next; 19.} 20. t5=t3;

21. /* join required before result of multi/add can be used */ 22. thread=0; 23. t5->element=t5->head; 24. while(_t5->element){ 25. pthread J oin(thr_id_t5 [thread++] ,NULL); 26. t5->element=t5->element->next; 27. }

28. /* the following is the thread function _t5agg */ 29. void * _t5agg(sequence *input){ 30. mult(input); 31. add(input); 32. }

110

The compiler dynamically creates _t5agg so that the multiply/add operations can take

place in the same thread. The thread function accepts one sequence as an input argument

and retums a result in the input argument. The data flow is illustrated in Figure 4.8. The

first while loop generates all the threads for _t5agg and the second loop waits uses a join

to detect when all the threads have exited. Once all the threads have joined the results are

available. The difference between Figure 4.8 and Figure 4.7 is that Figure 4.7 has two

thread loops one for multiply and one for add. Figure 4.8 has one thread loop for the

thread function _t5agg that contains the multiply/add. The code continues with whatever

expression (not shown) uses the result of the multiply/add operation.

4.3.2 Dynamic Thread Functions for Conditional Expressions

Parallelizing conditional expressions also involves dynamic thread function

creation but with some added complexities. The following production is the Sequence

conditional production.

B => T^ I T* when R else B

R is the relational expression. T" is one or more terms. The following expression is an

example of a conditional expression.

[s_2] when >([s_l,[l]]) else []

In the expression when s_l is greater than [1] the relational is trae and s_2 is produced,

when the relational is false the null or empty sequence is produced. This particular

conditional example has no indexed variables. Therefore it has no implied parallelisms.

HI

false

false

continue

Generate values for I using a thread (lines 2-9)

Setup input arguments for thread function (lines 11-14)

While (!last queue element) (Line 16)

T trae Generate a thread using _t5agg Line 17

Get next queue element (line 18)

While (!last queue element) (Line 24)

^ trae

Pthreadjoin (line 25)

Get next queue element (line 26)

Figure 4.8 Object Code Flow Chart with Cache Locality

The next example illustrates implied parallelisms.

*([s_l(i),[1.03]]) when =([s_2(i),[3]]) else *([s_l(i),[1.01]])

In this example when s_2(i) equals [3] s_l(i)*1.03 is produced otherwise s_l(i)*1.01 is

produced [Coo98]. Assuming i has values from [1] to [n] then simplification produces:

*([s_l(l),[1.03]]) when =([s_2(l),[3]]) else *([s_l(l),[1.01]]) *([s_l(2),[1.03]]) when =([s_2(2),[3]]) else *([s_l(2),[1.01]])

112

*([s_l(3),[1.03]]) when =([s_2(3),[3]]) else *([s_l(3),[1.01]])

*([s_l(n),[1.03]]) when =([s_2(n),[3]]) else *([s_l(n),[1.01]])

In this example, some value s_l, is increased by 3% if some associate corresponding

value s_2, equals 3, otherwise the initial s_I value is increased by 1%. This conditional

expression has indexed variables in the trae false and relational parts of the conditional

expression therefore it exhibits implied parallelisms. The difficulty the thread model has

to deal with is that the dynamically created thread function must contain the code to do a

complete conditional statement such as;

*([s_l(l),[1.03]]) when =([s_2(l),[3]]) else *([s_l(l),[1.01]])

This statement contains the relational operation as well as the trae and false expressions.

The code to implement this expression must be placed in a thread function. Each time the

index variables are to be tested, by the conditional expression, these index variables are

passed to the thread function. For this example the conditional thread function would

contain the pseudo object code:

if(condition(EQUAL, [s_2,3]))

result=mult([s_l,[1.03]]); else

result=mult([s_l,[1.01]]);

The variables s_l and s_2 are sequences passed to the thread function by the calling

routine that invokes the thread. Each time the thread function is invoked by the calling

routing a thread is created. For each thread created the calling routine passes one index

variable for s_l and one index variable for s_2. Therefore the first pair of index variables

113

passed to the first thread function would be s_l(I),s_2(l), the next pair would be

s_l(2),s_2(2) and so on, until all n index variable pairs have been tested in the

conditional thread function.

Conditional thread functions are not required for every conditional expression that

exhibits implied parallelisms. For example, here is the same expression without the index

variable in the relational expression.

*([s_l(i),[1.03]]) when =([s_2,[3]]) else *([s_l(i),1.01]]

In this example no conditional thread function is created. If the expression is trae then the

following trae expression is executed as a parallel expression.

*([s_l(i),[1.03]])

If the expression is false then the following false expression is executed in parallel.

*([s_l(i),[1.01]]).

The thread model must also handle conditional expressions that are linked

together through the false expressions. The examples shown involve only expressions of

the following stmcture.

Tl when Rl else T2

The following stracture is also valid:

Tl when Rl else T2 when R2 else T3

There is no limit on the number of conditional expressions that can be linked together.

The thread model uses the following rale to initiate dynamic thread creation for

conditionals, the first relational expression encountered that uses indexed variable, forces

the rest of the expression into a conditional thread function. Therefore given

114

Tl when Rl else T2 when R2 else T3

If Tl uses an indexed variable and Rl does not then something like the following

stracture will appear in object code.

if(Rl) while

OPl

Where OPl is a thread function that contains Tl. Each time a thread is created using OPl

an indexed variable value is passed to OPl for the Tl term. If T2, R2 and T3 use indexed

variables then R2 is the first relational expression that uses an index variable and

therefore it triggers the creation of a thread function. This thread function would contain

a stracture something like:

if(R2) T2

else T3

If CONDI is the name given to this thread function, then in the object code it is placed in

a thread creation loop.

if(Rl) while

OPl else

while CONDI

Each time a thread is created using CONDI the index variables that R2, T2 and T3 need

are passed to the thread.

115

Even when mixing expressions in a conditional expression, where some

expressions use index variables (implied parallelisms) and other expression do not (non-

parallel) the thread function creation process allows for the parallelisms to be exploited in

a straightforward manner. A complete example of conditional operations involving

parallelisms can be found in the Quicksort listings at the end of Appendix B.

4.4 Optimization and Scheduling Issues

This section discusses a number of scheduling and optimization issues that were

identified during compiler development.

4.4.1 Granularity

The primary design objective of the SequenceL compiler was to exploit all

SequenceL implied parallelisms. This included exploiting computational parallelisms at

the singleton level. If a sequence is described using a tree stmcture, singletons are the leaf

nodes, see Figure 4.9. The following sequence has eight singletons with four singletons in

two sub-sequences.

[[1,2,3,4],[5,6,7,8]]

[]

[] ^^^-^W^^-^

[1] [2] [3] [4] [5] [6] [7] [8]

Figure 4.9 Tree Diagram of a Sequence

116

The following expression is the addition expression for this sequence.

+([[I,2,3,4],[5,6,7,8]])

+(I,5),+(2,6),+(3,7),+(4,8)

[6,8,10,12]

Each singleton addition, is independent of the other additions, therefore, the four

additions can be done in parallel. Four threads of execution can be created with each

thread executing one addition. The problem with executing one singleton computation in

parallel is tiiat multi-processor systems have overhead issues that need to be addressed.

Thread creation time is one such overhead that must be accounted for. The advantage of

doing the computations in parallel is speedup, ideally if the time to execute a single

addition takes x seconds then doing n additions on an n processor system will still take

only X seconds. This is an idealized view of a multi-processor system. Parallel threads of

execution typically have overhead associated with them. For example, if thread creation

time is Ax seconds then the total time for the n additions in parallel would be Anx+ x

seconds assuming one addition per thread. Two additions would take 9x seconds, eight

additions would take 33x seconds. The problem with this example is that due to the

thread creation times, the parallel execution times for two and eight additions are greater

than the serial execution times. This is not unusual. Tests were conducted on the Texas

Tech University Origin2000 and indicated that as many as 50,000 multiplications can be

executed in the time it takes to create a thread. The graph in Figure 4.10 shows the results

of this test.

117

Granularity Experiment

0.06

-•—no thread

- •—1 thread

2 thread

-i^—4 thread

HK—8 thread

100000 200000 300000 400000 500000 600000 700000 800000 900000 1E+06 1E+06 1E+06

Number of Computations

Figure 4.10 Granularity Study

At around 150,000 computations generating threads to share the computations

begins to improve execution times over serial execution times. Retuming to the example,

if eight additions are divided between two threads the total execution time would be

8;c+4x, or 12x seconds. 12x is better than the 33x seconds for eight threads but still worse

than the Sx seconds it takes to do the eight computations in series. If the number of

additions are increased to thirty-two, then execution time for thirty-two threads with one

addition per thread would be 129x seconds; for two threads it would be 24x seconds and

for serial execution 32x seconds. There are now enough additions to justify two threads

of computation. Before creating a single thread a decision needs to be made by the

compiler about how many computations per thread are required to offset the cost of

thread creation overhead. This is an optimization issue that requires future research.

118

Granularity Experiment

0.06

0.05

8 0.04

0)

E c g

0.03

0.02

0.01

- •— no thread

-•— 1 thread

2 thread

-^<—4 thread

Hte- 8 thread

100000 200000 300000 400000 500000 600000 700000 800000 900000 1E+06 1E+06 1E+06

Number of Computations

Figure 4.10 Granularity Study

At around 150,000 computations generating threads to share the computations

begins to improve execution times over serial execution times. Retuming to the example,

if eight additions are divided between two threads the total execution time would be

8JC+4JC, or I2.v seconds. 12A is better than the 33JC seconds for eight threads but still worse

than the 8JC seconds it takes to do the eight computations in series. If the number of

additions are increased to thirty-two, then execution time for thirty-two threads with one

addition per thread would be 129JC seconds; for two threads it would be 24x seconds and

for serial execution 32.v seconds. There are now enough additions to justify two threads

of computation. Before creating a single thread a decision needs to be made by the

compiler about how many computations per thread are required to offset the cost of

thread creation overhead. This is an optimization issue that requires future research.

118

This problem also extends upward from singletons to sequences. In an earlier

example the following parallel computation was presented.

+(s_l(l),s_2(l)) +(s_l(2),s_2(2)) +(s_l(3),s_2(3)) +(s_l(4),s_2(4)) +(s_l(5),s_2(5)) +(s_l(6),s_2(6))

It is trae in any case that these can be done in parallel - the question is "Is it worthwhile

to do them or should they be done?" It is possible that executing these sequence additions

in parallel is inefficient. Like the singleton example just shown there may not be enough

work in these additions to justify thread creation. It could be that a number of sequence

additions need to be group together before creating threads for parallel execution is

justified.

For this SequenceL compiler, sequence operations are always executed in parallel.

At the singleton level, singletons are executed in series until a certain threshold level is

reached. The compiler user can set this level. When the threshold is reached the singleton

operations are divided up into groups for parallel threads of execution.

Creating a thread for every sequence operation makes the parallel code generated

by this compiler fine-grained. A fine-grained parallel model is defined as having many

threads per processor. A coarse-grained model typically has one thread of execution per

processor. The advantages of a fine-grained model are that irregular parallelisms and load

balancing issues are easier to deal with [Nar]. The following graph was generated from

data collected from the Origin2000.

119

Execution Time Comparison

60 00

•8 threads

16 threads

32 threads

64 threads

•256 threads

•512 threads

Process sample

Figure 4.11 Repeatability of Execution Times

The Texas Tech University Origin2000 has 56 processors. The problem size for

this study was fixed at 2.048xl0'^ multiplications. This work was divided between 8

threads and run 10 times followed by 16 threads up to 1024 threads. From the graph in

Figure 4.11 it is evident that as the number of threads generated approaches the number

of processors, executions times being to become unpredictable. For example the 64

thread tests have wildly varying execution times.

Granularity is an optimization issue that needs further development in future

research. Future versions of the SequenceL compiler will have to do some level of

computational analysis based on a given target parallel system and its associated

overhead costs before determining the level of granularity that is appropriate for the

given target system.

120

4.4.2 Code Restmcturing

Another area identified for performance enhancements is the issue of code

restmcturing. The following expression from section 4.3,

[+([s_l(i),s_2(i)]),function2,s_l, +([s_3(i),s_4(i)])] taking i from [1,2,3,4,5]

Resulted in the following C object code.

for(i=l;i<=5;i++){ _tO(i)=add(s_l(i), s_2(i));

}

_tl=function2(s_l);

for(i=l;i<=5;i++){ _t2(i)=add(s_3(i), s_4(i));

}

There are two iterative thread creation loops associated with this code, one for each for

loop. It is obvious that the two for loops can be combined into one, but this would require

code movement. Any future optimization component of the compiler will have to be

designed to deal with code restmcturing. In this example ideally the following code

would be generated as a result of code movement.

_tl=function2(s_l); for(i=l;i<=5;i++){

_tO(i)=add(s_l(i), s_2(i)); _t2(i)=add(s_3(i), s_4(i));

}

121

4.4.3 Data Distribution

Data distribution is another area for optimization. From section 1.1 the following

matrix multiple example was presented.

([+([*(s_l(i,*),s_2(*,j))])]) taking [i,j] from [[1,1],[1,2],[2,1],[2,2]]

In this example the multiply operation consumes the sequences s_l and s_2 and

simplifies to;

[ [ +([''([s_l(l,*),s_2(*,l)])])

+([*([s_l(l,*),s_2(*,2)])]) ] [ +([*([s_l(2,*),s_2(*,l)])])

+([*([s_l(2,*),s_2(*,2)])]) ] ]

If

s_l = [[l,2],[3,4]]s_2 = [[5,6],[7,8]]

Then the next simplification produces;

[ [ +([*([[1,2],[5,7]])])

+([*([[1,2],[6,8]])]) ] [ +([*([[3,4],[5,7]])])

+([*([[3,4],[6,8]])]) ] ]

Note the repetition of rows and columns in the data stracture. Problem domains with

large data sets, using this kind of data distribution approach can quickly consume all of

memory. In this example twice as much memory is consumed than is necessary. The

question is should the repetition be allowed to occur? The basic problem with conserving

space is that it works against speed. Time and Space trade-offs are an important area of

research on parallel systems [Ble]. For this matrix multiply algorithm, if space is

conserved by having only one copy of each row and column in memory, then all

122

computations that use a given row or columns will have to access the same memory

location. As the problem size increases memory locations can become a point of

contention. This problem is described as a memory bottleneck problem in the literature

on studies of scaling problems on shared memory multiprocessor systems [Leu]. Any

future SequenceL compilers will have to address this issue.

4.4.4 IC Collect Operation.

The collect sequence operation was described in section 4.2 as an object code

requirement. The collect sequence operation creates a single sequence before any

operation can take place on that sequence. Since thread functions can only accept one

sequence, these two conditions work together in making the thread model easier to

implement for the SequenceL compiler. For example the following expression produces a

single sequence in object code:

*([s_l])

The next expression also produces a single sequence in object code.

*([s_2])

When these two expressions are nested in another expressions such as:

+([*([s_l]),*([s_2])])

Each result produced in object code is collected into a single sequence, which is then

passed to the addition operation. The problem with this "object code collect sequences

operation" is that it creates overhead. It takes time to collect the sequences together. The

problem with not collecting sequences together is that all of the thread functions that

123

carry out computations on sequences will have to be designed to handle an unknown

number of arguments, which may be unknown until mntime. For the current compiler,

the collective operation is in use. For future compilers it should be reconsidered.

4.5 Data Representation

The strength of SequenceL is the use of sequences in conjunction with the

language constmcts to specify an algorithm. This is at the heart of the programming

philosophy of describing "what" to do as opposed to "how" to do it. Therefore the design

of this SequenceL compiler preserves the sequence data stracture in the object code.

4.5.1 Circular Linked List Sequence Representation

The first attempt at defining a C object code representation of a sequence was a

circular linked list approach. The circular linked list approach was very attractive since it

was relatively easy to normalize sequences without requiring additional memory

allocations. For example the following sequence is not normalized.

[[1,2,3],[4,5]]

Normalized this sequence becomes

[[1,2,3],[4,5,4]]

Note that the normalized sequence is larger than the non-normalized sequence. After

normalization a sequence will always be increased in size. The only time a sequence will

not increase in size due to normalization is if it is already normalized.

124

cardinality

1 cardinality

I cardinality

cardinality

Level 1

Level 2

Level 3

1 singleton W singleton ^ singleton

4 singleton singleton

Figure 4.12 Linked List Sequence

Figure 4.12 shows the circular linked list stracture for this sequence example.

Normalizing the linked list is simply a matter of making sure each cardinality value on a

level is equal to the largest value on that level. Referencing a sequence becomes a matter

of traversing the linked lists reading each list according to the cardinality values.

Therefore, since 3 is the largest cardinality in level 3, then each list at that level requires

that 3 singletons be read. Therefore 1,2 and 3 are read followed by 4,5 and 4 again. The

problem with linked list is that every time a sequence is referenced a linked list traversal

is required. This adds significant overhead to the generated programs. Additionally a new

linked list stracture has to be created for every result produced.

125

4.5.2 Sequence Stracture Representation

The second approach was driven by the objective of reducing the overhead

associated with the linked list approach and making computations on sequences

equivalent to processing arrays. Arrays and array processing is a key element in many

numerical and non-numerical application [Kum]. The matrix multiply example in chapter

I is typical of the type of array processing that takes place in numerical methods.

The current mntime sequence data stracture is as follows.

typedef union { int *i; double *f; char **s;

}numerics;

typedef stract{ numerics data; char *string; int *card; int *nest; int *start; int *end; int *empty; int length; int dimlength; char type; /* i for int, r for real and s for string */

} sequence;

The data or singleton information is stored in an array the "numerics data" points to. This

array is allocated as a contiguous memory array. This array can be of type string, real or

integer. The char *string points to a string representation of the sequence, which is used

by normalization. Int card and nest are used to determine if a sequence needs to be

126

normalized. The int start and end arrays track sub-sequence positions within the string

representation, this is used for extracting sub-sequences of a sequence. Int empty is an

array that indicates whether there is a null sequence present and its location in the

sequence. Int dimlength stores the size of the nest, card, start, end and empty arrays. Int

length stores the size of the data array. Char type indicates whether the sequence

singletons are real, integer or string in type. The advantage of this design is that

normalization can be quickly checked. Therefore, if a sequence is normalized then a

computation becomes a simple contiguous memory traversal of the data array.

The problem with this representation is that it is space inefficient. Too much

information is being carried in order to fully describe the sequence. Another problem

with this representation is that normalization involves the dynamic allocation of

additional memory during rantime. Additionally, the lifetime of the normalized stracture

lasts only the time it takes to execute the computation that triggered normalization. It is

possible that an input argument to a function could be referenced by numerous

SequenceL statements within a function resulting in repeated normalizations for the input

argument within the function. Therefore, this finding is only an improvement over the

linked list approach if sequences do not require normalization.

Both the current sequence representation and the circular linked list representation

share a common problem; sequences are allocated memory at mntime. This is due to the

fact that typing information associated with cardinality and nesting is not available for

input arguments until rantime. This causes an unacceptable rantime overhead. What the

compiler needs to improve this situation is dimension and typing information for input

127

arguments at compile time. The compiler could then do a data flow analysis to determine

the dimension and typing of all subsequent sequences. Additionally, given knowledge of

the sequences at compile time, the data representation for a sequence could be simplified.

Conditional expressions will pose a problem for this data flow analysis approach

since the control flow path is non-deterministic. Functions that have conditional

expressions can produce a result from either the trae or false expressions within the

conditional. The dimension and typing of any sequence produced by a conditional's trae

expression may be different than the dimension and typing of the false expression. The

result is that functions with conditional expressions can produce sequences of different

dimension and type depending on the outcome of a conditional expression at rantime.

Therefore, under these conditions it is difficult for data flow analysis to figure out the

dimension and type of the placeholder for a function. Data flow analysis has been left as

a future research issue.

There are no specific recommendations on the definition of the sequence

representation for future SequenceL compilers. There is a requirement that nesting and

cardinality information be provided at compile time for all input arguments.

128

CHAPTER V

CONCLUSIONS AND FUTURE RESEARCH

This chapter presents conclusions on the results of this research. Future research

opportunities are presented as well. The initial goal of this research was to develop the

first SequenceL compiler that exploits the inherent parallelisms in SequenceL. The result

of this development process has gone beyond that goal with the development of a general

approach to mapping SequenceL constracts to multi-threaded code. This result is

important since it provides an approach to implementing the inherent parallelisms found

in SequenceL, for all future SequenceL compilers.

5.1 Conclusions

The result of this research is the first SequenceL compiler. This SequenceL

compiler can create executable programs that embody the inherent parallelisms and other

implied controls stmctures in SequenceL. A number of algorithms have been tested using

the compiler in order to exercise the full range of SequenceL constracts and implied

parallelisms. Three types of parallelisms are detected and exploited by the SequenceL

compiler: (a) Parallelisms involving singleton operations, (b) Parallelisms involving

indexed sequences, and (c) Control Flow Parallelisms.

The compiler development process has identified a number of insights into the

SequenceL language. A key finding is that SequenceL expressions inherentiy support

their own evaluation in parallel. A formal definition of implied parallelisms for regular.

129

irregular and generative constracts has been developed. The key to uncovering this

insight was the process of developing the SequenceL intermediate language. The

intermediate language along with the symbol table definition provides a complete

representation of the SequenceL constracts and parallelisms. Although parallel object

code was generated based on a thread model, the intermediate code has no specific

support for the thread model. Therefore, future research projects should consider the

intermediate code as transferable to message-passing parallel programming models as

well as other shared memory parallel programming models.

The thread model was found to be flexible at meeting all the requirements for

implementing the parallel constracts necessary for executing the SequenceL parallelisms.

The function based model that Pthreads uses for implementing the threads allowed for the

development of a dynamic thread function capability. This capability allowed the

compiler to take advantage of the inherent independency found in SequenceL expressions

and functions. This result also supported the conclusion that SequenceL expressions

inherently support their own evaluation in parallel.

During development, the compiler was moved back and forth between two

different single processor systems and a multiprocessor system. The single processor

systems include a Linux Intel system and a Solaris Sparc system. Both the GNU C

compiler and the IRIX C compiler where used to compile the compiler components and

the SequenceL compiler generated object code. The only issue encountered was the IRIX

compiler's strict adherence to the POSIX thread interface standard. After this was

130

addressed, no further changes were required in the compiler code due to the underlying

system and its compiler.

A number of optimization and performance enhancements have been identified.

Some such as cache locality and granularity have been partially addressed. Others, such

as code restmcturing and optimizing the collect sequence operation, have not. Before any

of these issues are pursued in future research the question of the SequenceL data

representation in object code needs to be addressed.

The way in which SequenceL data stmctures are represented in object code

should be redefined. Neither of the two approaches developed was satisfactory. Any data

representation developed will have to compete with procedural codes, which typically use

arrays. Although no specific data stracture is being recommended, any data stmcture that

is developed will need to meet the following requirement. The new SequenceL data

representation must, preserve the nesting capability of sequence stmctures, minimize

overhead, allow for normalization, and allow for compile time memory allocation. By

providing dimension and type information on input variables it might possible to do a

data flow analysis on the SequenceL code to determine the memory allocation

requirement at compiles time. One impediment to this methodology has already been

documented in this report. Conditional statements make it difficult to do data flow

analysis since they can retum sequence from either a trae expression or a false

expression. These sequences may not be of the same type or dimension. This problem

creates a fork and two paths different paths for data flow analysis. This problem will have

to be solved in any future SequenceL research.

131

5.2 Future Research

A number of possible future research opportunities were raised in this document.

These will be summarized in this section.

5.2.1 Preprocessor

A preprocessor was never developed for this compiler. The SequenceL source

code that was used to test the compiler was specified as if a preprocessor generated it.

The preprocessor has two roles the first role is to restracture some of the SequenceL

expressions before they go through semantic analysis. The function clause is a good

example. The grammar production that defines the function clause is:

F =^ V(V*) where next = {B} C

The C and {B} will be reversed by the preprocessor, that way semantic analysis will have

information in the symbol table on the identifiers defined in the C part of the function

clause. The C grammar production is:

C => [ ] I taking [V^] from T

It is the identifiers defined by V" that are required by semantic analysis before {B} is

processed.

The second role of the preprocessor is to allow programmers to use a little more

readable SequenceL programming style. For example the following sequence,

[1,2,3,4,5]

is coded this way by the programmer. The preprocessor will read this and convert the

sequence to;

132

[[1],[2],[3],[4],[5]].

This first representation defines a singleton in a sequence of singletons without square

brackets this potentially creates a new data type for SequenceL called a singleton. The

second representation defines a singleton as a sequence of one element.

5.2.2 Optimization

An optimization capability has always been deemed beyond the scope of this

research. It was anticipated that this research would uncover a number of issues that

would have to be addressed before the development of an optimization component could

be considered. For example, the problem with the object code's sequence representation

needs to be resolved before optimization can be considered. The current representation

causes a number of problems for the compiler, the most significant being mntime

overhead. A representation that provides compile time information on dimension and

type will allow for example, the process of optimization of memory allocation to take

place.

5.2.3 Parallel Models

The thread model was picked for this first compiler because it had the advantages

listed in Chapter HI. These include light-weight tasks, portability, a standard interface,

low-level control over parallel execution and data sharing. Ultimately, the compiler

should be developed using the MPI model. The advantages of the MPI model is that it

can execute parallel codes on distributed memory systems, as well as shared memory

133

systems. OpenMP/MLP parallel programming model is of particular interest [Taf]. The

advantage of pursuing the OpenMP/MLP parallel programming model is that it would

exposure SequenceL to many research issues being pursued by NASA. This would

provide SequenceL with "real" hard problems that would test SequenceL's capabilities

and limits.

5.2.4 Granularity

The SequenceL compiler manages granularity using threshold levels set in the

rantime library thread functions. These thread functions include the arithmetic and

relational functions. Dealing with granularity can be a difficult problem. Both compile

time and rantime analysis for granularity has been researched. Runtime analysis can

create an overhead problem in itself. This leads to a tradeoff between doing a good job of

estimating granularity and the length of time it takes to do the estimate [Ham94].

Compile time analysis suffers from a lack of knowledge about rantime issues such as

contention for resources. One compile time approach is to set up certain strategies to

manage granularity, such as clustering. Clustering is the arranging of parallel tasks to

operate on related sub-collections of data [Loi]. Scheduling associated with granularity

will be an ongoing research effort for every SequenceL compiler developed.

A number of algorithms have been tested on the compiler. This includes the three

algorithms described in this document. Matrix Multiply, Gaussian Elimination and

Quicksort. More algorithms involving parallelisms need to be tested, these include

134

algorithms such as sparse matrix multiplication and search algorithms such as branch and

bound [Kum].

The success of this compiler has added to the development of the SequenceL

language. The continued success of the language will be built on this work. It is hoped

that in the near future that the compiler can be released to the parallel programming

community. This release would provide the community with a new and effective

language for solving many difficult parallel programming problems.

135

REFERENCES

[Aho] Aho, A. V. and Ullman, J. D. Principles of Compiler Design, Reading, MA: Addison-Wesley Publishing Co., 1979.

[All] Allan, R.J., Heggarty, J., Goodman, M. and Ward, R.R., Parallel Application Software on High Performance Computers. Survey of Parallel Perfonruxnce Tools and Debuggers, Darebury Warrington, England: Computational Science and Engineering Department CLRC Daresbury Laboratory, June 16 1999, Retrieved from http://www.cse.clrc.ac.uk/Activitv/HPCI.

[Ban] Banks, J., Carson, J. S. n and Ngo Sy, J. Getting Started with GPSS/H, Annandale VA: Wolverrine Software Corp., 1989.

[Ble94] Blelloch, G.E, Chatteijee, S., Sipelstein, J. andZagha, M, VCODE Reference Manual 2.0, Retrieved from http://www-.cs.cmu.edu/afs/cs.cmu.edu/proiect/scandal/public/papers/vcode-ref.html 1994.

[Ble95] Blelloch, G. E. NESL: A Nested Data-Parallel Language, Version 3.1, CMU-CS-95-170, Pittsburgh PA: School of Computer Science, Camegie Mellon University, Sept. 19, 1995.

[Ble96] Blelloch, G. E. and Greiner, J. A Provable Time and Space Efficient Implementation of NESL. Proceedings of International Conference on Functional Programming, Philadelphia, Pennsylvania, May 1996, pp. 213-225.

[Cha] Chandra, R. Dagum, L., Kohr, D., Maydan, D., McDonald, J., and Menon, R. Parallel Programming in OpenMP, San Diego CA: Morgan Kaufmann Publishers, 2001.

[Cod] Codognet P and Diaz D. wamcc: Compiling Prolog to C, I2th International Conference on Logic Programming, Tokyo, Japan: The MIT Press, 1995.

[Coo96] Cooke, D.E. An Introduction to SEQUENCEL: A Language to Experiment with Nonscalar Constracts, Software Practice and Experience, Vol. 26, No. 11, November 1996, pp. 1205-1246.

[Coo98] Cooke, D.E. "SequenceL Provides a Different Way to View Programming," Computer Languages, 1998, pp. 1-32.

[CooOO] Cooke, D.E. and Andersen, P. Automatic Parallel Control Stmctures in SequenceL, Software Practice and Experience, Volume 30, Issue 14, November 2000, pp.1541-1570.

136

[CooOl] Cooke, D. E., and Andersen P. Specification of a Parallelizing SequenceL Compiler, Proceeding of the Monterey Formal Methods Workshop, Monterey, CA. June 19, 2001, pp. 37-48.

[Coo02] Cooke, D.E., A Concise Introduction to Computer Language: Design, Experimentation and Paradigms, Pacific Grove, CA: Brooks/Cole Publishers, 2002.

[Dia] Diaz, D. and Codognet, P., GNU Prolog: Beyond compiling Prolog to C, Practical Aspects of Declarative Languages, Boston: 2000.

[DiM96] Di Martino, B. and KePler C. W., Program Comprehension Engines for Automatic Parallelization: A Comparative Study, Proc. of 1st Int. Workshop on Software Engineering for Parallel and Distributed Systems, Berlin, Germany: Chapman & Hall, March 25-26, 1996.

[DiM96b] Di Martino, B. and lannello, G. PAP: Recoginzer: A Tool for Automatic Recognition of Parallelizable Pattems, 4''' Workshop on program Comprehem Technische Universitat Berlin, Berlin, Germany March 30-31 1996.

[Dow] Dowd, K. and Severance, C. High Performance Computing, Sebastopol, CA: O'Reilly & Associates Inc., 1998.

[Feo] John T. Feo, David C. Cann, and Rodney R. Oldehoeft. A Report on the Sisal Language Project. Journal of Parallel and Distributed Computing, 10(4), pp. 349-366, December 1990.

[Fin] Finkel, R., Advanced Programming Language Design, Reading MA: Addison-Wesley Pub Co., December 1995.

[Fri] Friesen, B., The Universality ofBagL, Master's Thesis, University of Texas at El Paso, May, 1995.

[Ham] Hammond, K and Michaelson, G, Research Directions in Parallel Functional Programming, London: Springer-Verlag, 1999.

[Ham94] Hammond, K., Mattson Jr., J.S. and Peyton Jones, S.L. Conference on Algorithms and Hardware for Parallel Processing, Linz, Austria: Springer Veriag, September 6-8 1994.

[IBM] IBM White Paper: Power2 Floating-Point Unit: Architecture and Implementation, retrieved from http://www-l.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/fpu.html, 2002

137

[Jon] Jones Telecommunications, Software: History and Development, retrieved from http://www.digitalcenturv.com/encyclo/update/software.html June 2002.

[KeP] Kepier, C. W. Pattern-driven Automatic Parallelization. Scientific Programming 5, pp 251-274,1996.

[Ken] Kennell, R.L. and Eigenmann, R. Automatic Parallelization of C by Means of Language Transcription, Proceedings of the II''' International Workshop on Languages and Compilers for Parallel Computing (LCPC-98), Chapel Hill, NC, August 1998.

[Kum] Kumar, V., Grams, A., Gupta, A., Karypis, G. Introduction to Parallel Computing, Redwood City, CA: Benjamin/Cummings PubUshing Co., 1994

[Lam] Lam/MPI Parallel Computing, retrieved from <http://www.mpi.nd.edu/lam/> 2000.

[Lau] Laudon, J. and Lenoski, D., "The SGI Origin: A ccNUMA Highly Scalable Server," Silicon Graphics, Inc. Mountain View, CA., retrieved from <http://www.sgi.com/>, 1999.

[Lew] Lewis, B. and Berg, D. J., Multithreaded Programming With Pthreads, Upper Saddle River, NJ: Prentice Hall PTR/Sun Microsystems Press, 1997.

[Loi] Loidl, H., Trinder, P.W., and Butz . Tuning Task Granularity and Data Locality of Data Parallel GpH Programs, HLPP'OI International Workshop on High-level Parallel Programming and Applications, Orieans, France, 26-27 March, 2001.

[Lue] Luecke, G. R. and Lin W. Scalability and Performance of OpenMP and MPI on a 128-Processor SGI Origin 2000, Iowa State University, August 16 2000.

[Mac] Maclennan, B. Principles of Programming Languages, (third edition). New York : Oxford University Press, 1999.

[MPI] The Message Passing Interface (MPI) Standard, retrieved from http://www-unix.mcs.anl.gov/mpi/ May 2002.

[Muc] Muchnick, S.S. Advanced Compiler Design & Implementation, San Francisco, CA: Morgran Kaufmann, 1997.

[Nar] Narlikar, G. J. and Blelloch, G. E. Pthreads for Dynamic and Irregular Parallelism, Proceedings ofSC98: High Performance Networking and Computing, Nov 1998.

138

[Onl] Online Documents, retrieved from http://www.cs.colorado.edU/~eliuser/elionline4.3/syntax 1 .html. June 2002.

[Pag] Pagh, R. and Pagter, J. Optimal Time-Space Trade-Offs for Non-Comparison-Based Sorting, Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2002), New York: ACM Press, 2002, pp. 9-18.

[Pan] Pancake C. M., " Those who live by the flop may die by the flop," Keynote Address, 41" International Cray User Group Conf, Minneapolis, MN, 24-28 May 1999.

[Phi] Philbin, J., Edler, J., Anshus, O.J., Douglas, C.C. and Li, K. Thread Scheduling for Cache Locality, Proceedings of the Seventh ACM Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA: ACM, 1996, pp. 60 - 73

[Piz] Pizzi J, A SequenceL Compiler, Master's Thesis, Texas Tech University, 2001.

[SGI] Discussion with Instractor at SGI Global Education, Developer Training for Origin2000, April 20 2000.

[Taf] Taft, J. MLP Parallelisms and current results on NASA's 512 CPU Origin System, Japan HPC Forum Presentations, Sept 20-27 2000, retrieve June 13 2000 from www.cc.uec.ac.jp/SGI3000.doc/docs/developer_news/hpc_forum/pdf/jim_taft.pdf

[Tor] Torrellas, J., Tucker, A., and Gupta, A., Evaluating the Performance of Cache-Affinity Scheduling in Shared-Memory Multiprocessors, Journal of Parallel and Distributed Computing, vol. 24, no. 2, Feb. 1995, pp. 139—151.

[Tri] Trinder, P.W., Hammond, K., Mattson, Jr. J.S., Partridge, A.S. and Peyton Jones, S.L. GUM: a portable implementation of Haskell, Proceedings of Programming Language Design and Implementation, Philadelphia, PA, May 1996.

[Whi] White, S. A Brief History of Computing, retrieved June 10 2002 from http://www.ox.compsoc.net/~swhite/historv/timeline-LANG.html.

139

APPENDIX A

SEQUENCEL GRAMMAR

140

SequenceL Grammar

A => integer | real | string L ^ A,L I E,[L] I A E ^ [ ] I [L] I s(integer)

V =? id O => + I -1 / I * I abs I sqrt | cos | sin | tan | log | mod | reverse | transpose |

rotateright | rotateleft | cartesianproduct M => *,M I T,M I * I T T => E I [T*] I V I 0(T) I gen([T,...,T]) | T(M) | T | V(map(M)) | $ RO => < I > I = I >= I <= I <> I integer | real | var | operator R => RO(T) I and(R^) | or(R*) | not(R*) B => T^ IT* when R else B C ^ [] I taking [V*] from T F ^ V(V^) where next = {B} C U ^ V ,U IE, U I V IE p ^ { { r } u }

P is a program U is initial tableau F* in P is a hst of callable functions One to many C ) and zero to many ( ) has implied delimiters when necessary, usually commas

141

APPENDIX B

AN IMPLEMENTATION GUIDE TO A SEQUENCEL COMPILER

142

This document is a guide for anyone who wishes to implement a compiler for

SequenceL. The methodologies used to build this compiler are detailed in Chapter m of

the dissertation report and will not be repeated here. A brief overview of some of the

components will be detailed in this document. Issues associated with the generation of

parallel object code will also be detailed. Examples of the intermediate code and C object

code will be presented. The source code files referenced in this document are available on

the CDROM.

In addition to the goal of converting SequenceL source code to parallel C object

code, the compiler had the following objectives;

Extensibility - a framework methodology has been implemented for achieving

extensibility. A modular approach with clear interface definitions is the framework

methodology followed. This approach makes it relatively easy to add or replace modules.

Portability - The implementation of all the SequenceL compiler components in C

provides the compiler with the advantage of having one of the most widely used

programming languages in the world as the only requirement for installing and using the

compiler. To ran SequenceL executables POSIX threads must also be available.

Efficiency - The SequenceL scheduling mechanism and the C compiler's optimization

will provide the SequenceL compiler with its optimization phase.

The components of the SequenceL compiler provide for the usual compilation

functions: lexical analysis, syntax analysis, intermediate language generation, and code

generation. Also provided is a rantime library. The interfaces between compiler

143

components are generally implemented as function calls. The one exception is the

standard interface between the semantic analysis and code generation, in this instance the

interface is the intermediate language and symbol table.

An early decision was made to keep the compiler dependencies on multiple

languages and tools to a minimum. Experience with other compilers and their associated

installation difficulties was the motivating factor in restricting the compiler

implementation to one language. To force users to install multiple support tools and

packages before the compiler can be installed would probably discourage a number of

potential compiler users. Therefore, this compiler has been completely implemented with

C.

The following terms used in this document are defined here. Function files are

files that contain C source code derived from a SequenceL function and generated by the

SequenceL compiler. Include files are C source code files containing declarations for the

function file and are also generated by the compiler. A SequenceL file contains

SequenceL source code. An input file contains a text string of a SequenceL sequence.

The compiler is designed to create function and include files from a SequenceL

file. The function and include files can then be compiled and linked by a C compiler. The

SequenceL compiler is designed to compile each SequenceL function, appearing in a

SequenceL file, to its own function file. Developers will have the opportunity to analyze

and modify these function files before actually compiling the files to executable code. If

any modifications are made to the function files additional C libraries can be linked in if

144

required. It is possible that SequenceL developers might want to add additional code for

monitoring purposes in order to obtain intermediate results for a program.

The implementation details described in this paper will follow the design

chronology of the compiler development.

B.l Lexical and Syntax Analyzer Implementation

B.1.1 Lexical Analyzer

The Lexical analyzer reads text strings from a SequenceL source code file. As the

Lexical analyzer reads these strings it identifies tokens, in the strings, by matching

characters using an if-then-else stracture that tests for valid tokens. The following code is

a fragment taken from of the lexical analyzer code for this SequenceL compiler.

token_ptr = token; /* remove white spaces */

if(whitespace(**ptr)){ while(whitespace(**ptr))

(*ptr) ++; retum WH;

}

/* check for comment line */ c_ptr = *ptr; c_ptr++; if(**ptr == V && *c_ptr == '*')

retum CM;

if(delimit(**ptr)) { /* nested if-then-else stracture starts here */

/* check for two character relational operator */

if(**ptr == '<' && (*c_ptr == '=' II *c_ptr == '>')) { *token_ptr = **ptr; (*ptr) ++; token_ptr++;

145

*token_ptr= **ptr; *token_ptr = ^0'; retum DL;

}

The lexical analyzer is designed to eliminate all white spaces and comment lines. Once

the lexical analyzer recognizes a valid token it enters it into the symbol table. The rest of

the lexical analyzer code can be found in seq.c on the CDROM.

B.1.2 Syntax Analyzer

The initial SequenceL grammar can be found in Appendix A of the dissertation

report. This grammar completely describes SequenceL. Having chosen to implement

syntax analysis using an LL(1) parser the grammar in Appendix A needs to be put in

LL(1) format. For example there are common prefixes in the following productions, L, E

M, B, U and indirectly in T that must be eliminated. There is also left recursion in the T

production that must also be eliminated.

Additionally, there are implied commas in some productions these must be made

explicit. For example the P production can include one or many F productions.

P ^ {{F^}U}

If more than one function appears in a P expression then comma separators are

required between the functions. For example if Fl and F2 are SequenceL functions then

the expression based on the P production is as follows.

P ^ {{F1,F2}U}

146

In addition to the issues listed above the initial grammar did not allow for the

nesting of constants with expressions and functions. This was a mistake and was

discovered and corrected. There were some additional changes made to the grammar to

assist the semantic analyzer, which the preprocessor will manage. The preprocessor was

not built for this SequenceL compiler. The syntax analyzer component of the

preprocessor has been built and tested, but the rest of the preprocessor will be left for

future development. The following changes were made;

B => TO^ ITO^ when R else B

Is changed to

B => TO^ I R then TO* else B

This creates a conditional expression corresponding to a procedural approach. The

original SequenceL production is read as "execute TO"*" when R is trae else execute B",

the new production is read as "when R is trae execute TO" else execute B".

F => V(V*) where next = {B} C

Is changed to

F => V(S*) where next(V*) = C {B}

The key change in the above production is reversing C and {B}. This was done because

the identifier information specified in C is used in B and therefore needs to be in the

symbol table before the semantic analysis of B. Also (V*) was added after "where next".

(V*) provides output formatting information for results produced by SequenceL

functions. The complete grammar with the above changes is listed below.

147

A => integer | real | string L => A,L I TO,L I A I TO E => [ ] I [L] I s(integer) V ^ id S => V|V(M) O => + I -1 /1 * I abs I sqrt j cos | sin | tan j log j mod j reverse (transpose |

rotateright j rotateleft | cartesianproduct M => all,M I TO,M j all | TO T => E I V I 0(TO) I gen([TO,...,TO]) | $ TO => T I T(M) I T(map(M)) RO => < I > I = I >= I <= I <> I integer | real j var | operator R => RO(TO) I and(R*) j or(R*) | not(R*) j trae j false B ^ TO* I R then TO* else B C => [ ] I from TO taking [V*] F ^ V(S*) where next(V*) = C {B} U => V ,U IE, U I V IE P=>{{F*}U}

P is a program, U is initial tableau One to many (*) and zero to many ( ) has implied delimiters when necessary, usually commas.

With these grammar changes in place the process of placing the grammar into

LL(1) format can begin. Left recursion was already eliminated with the creation of TO,

but there are still a number of common prefixes that need to be eliminated. The grammar

after eliminating the common prefixes is listed below. Note, the wildcard operator has

been changed from * to "all" this change eliminates the need to do overload processing

on the multiplication operator.

p Fl F2 F U Ul V

=>

=>

=>

^>

=>

=>

=>

{{FF1}U} F2F118 ,F V(SS2) where next(V3) = C{B} VUl 1 EUl , U | s id

148

VI => V2V1|8 V2 => ,V V3 =^ VVl I e s ^ VSl 51 => (M)\e 52 => S3S2 53 => ,S B => TOTl IR then TOTl else B Tl => T2T1 18 T2 => ,T0 C => [ ] I from TO taking [VVl] E ^ [ El I s(integer) El ^ L] I ] L => AAl I TOAl A => integer | real | string Al => ,L 18 M => allMl I TOMl Ml => ,M 18 T ^ E|V|0(TO)|gen([TO,...,TO])|$ TO => TT3 T3 => cr418 T4 => map(M)) | M) O => + I - I / I * I abs I sqrt j cos j sin | tan | log j mod | reverse j transpose j rotateright | rotateleft | cartesianproduct R => RO(TO) I and(RRl) j or(RRl) | not(RRl) Rl => RR1|8 RO => < I > I = I >= I <= I o I integer | real | var | operator

The above grammar is not an LL(1) grammar yet. The epsilon options appearing

in the grammar must be dealt with. This means selection sets must be generated for the

grammar. The selection set methodology described in chapter DI was used to generate the

following table of selection sets for SequenceL.

149

Table B.l SequenceL Selection Sets

p Fl

F2 :

:=

:= :=

:=

F ::= U ::=

::= Ul ::=

::= U2 ::=

::= V ::= VI

:

V2 : V3

:= :z=

:=

••= S ::= SI ::=

::= S2 ::=

S3 ::=

B ::=

Tl ::= ::=

T2 ::=

{{FF1}U} F2FI

,F V(SS2) where next(V3) = C{B} VUl EU2 ,U

,U

Id V2V1

,v VVl

VSl (M)

S3S2

,S

TOTl

R then TOTl else B T2T1

,T0

First Set { J

f

id id [s f

t

id )

?

id

id

(

1

[ s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen$ < > = > = < = < > integer real var operator and or not )

J

Follow Set Start

}

,}

,} }

}

}

( , ] } ) ])

, ] ) )

J

)

, )

}

else }

, else }

Selection Set { •>

} »

id id [s )

} 1

} id

]) ) id \ id ( y

1

) ?

[ s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen$ < > = >=<=<> integer real var operator and or not ) else } )

150

Table B.l continued

C ::=

• •"

E ::=

El ::=

L ::=

A ::=

::= Al ::=

::= M ::=

Ml ::=

[] from TO taking [VVl]

[E l s(integer)

L] ] AAl

TOAl integer real string ,L

allMl

T0M2 M

First Set [

from

[ s integer real string [ s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen$

] integer real string [ s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen$ integer real string ?

all [ s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen$ •>

Follow Set {

, ( else taking) ] }

, (else taking) ] }

]

,]

]

)

)

Selection Set [

from

[ S integer real string [ s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen$

] integer real string [ s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen$ integer real string )

] all ; s id + - / * abs sqrt cos sin tan og mod reverse transpose roatetright rotateleft cartesianproduct gen$

151

Table B.l continued

::=

M2 ::= ::=

T ::=

::=

TO ::= T3 ::=

::= T4 ::=

0 ::= : :—

::=

,M

E V

0(TO) gen([TO,...,TO]) $

TT3 (T4

mapCM))

M) +

/ *

First Set

[s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen $ [ s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen$

(

map

[ s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen $ all +

/ *

Follow Set

)

(, else taking) ] }

, else taking ) ] } , else taking ) ] }

, else taking ) ] }

(

Selection Set )

9

)

[s

+ - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen $ [ s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen$

( , else taking ) ] }

[ s id + - / * abs sqrt cos sin tan log mod reverse transpose roatetright rotateleft cartesianproduct gen $all + -

/ *

152

Table B.l continued

= —

=

=

=

= =

R ::=

Rl

RO ::=

;;=

;;=

::= ::—

;;=

::=

::=

abs sqrt cos sin tan log mod reverse transpose rotateright rotateleft cartesianproduct

RO(TO) and(RRl) or(RRl) not(RRl)

RRl

<

>

= >=

<=

<>

integer real var operator

First Set abs sqrt cos sin tan log mod reverse transpose rotateright rotateleft cartesianproduct

< > = > = < = < > integer real var operator and or not < > = > = < = < > integer real var operator and or not

<

>

>=

<=

<>

integer real var operator

Follow Set

then <> = >=<= o integer real var operator and or not)

)

(

Selection Set abs sqrt cos sin tan log mod reverse transpose rotateright rotateleft cartesianproduct

< > = >=<=<> integer real var operator and or not <> = >=<= o integer real var operator and or not ) <

>

= >=

<=

o integer real var operator

The syntax analyzer is written directly from the selection set table using the

methods described in Chapter m. The following code is the syntax code for the L

production.

153

L=> A,L I TO,L I A I TO

int 10 {

if(a()){ if(al()){

retum trae; } else

retum false; } else

if(tO()){ if(al()){ retum trae;

} retum false;

} else

retum false;

The complete syntax analyzer code is in syntax.c on the CDROM.

B. 1.2.1 Syntax Error Checking

Error reporting by the syntax analyzer occurs whenever a syntax error is detected.

A classic approach with some compilers is to report the syntax error and do some sort of

error recovery, this allows the compiler to continue processing the source code file [Aho].

A number of strategies are available for error recovery, the simplest method being panic

mode. This method will discard symbols until a designated synchronizing token has been

found, in C this might be the semicolon used to terminate statements. A SequenceL

program is an expression of a complete problem solution using nested constmcts. This

makes the panic mode method not as effective as it has been found to be in a procedural

154

language like C. It is possible to introduce terminating symbols into SequenceL but this

idea was rejected. Instead the compiler is designed to stop immediately and generate an

error message when it detects a syntax error. It will then dump to the display system the

SequenceL file up to the point at which the error was detected. This implementation has

worked reasonable well.

B.2 Symbol Table

All symbols including, identifiers, reserve words and temporaries have an entry in

a symbol table. Through a symbol's type information meaning is given to the symbol.

This compiler creates multiple symbol tables. There is a symbol table for each SequenceL

function encountered in a SequenceL file as well as a global symbol table. The global

symbol table contains reserve words, SequenceL function symbols and any SequenceL

program input variables. Every symbol table entry, except reserve words and SequenceL

function names, are local in scope. By creating a symbol table for each SequenceL

function, local scope of all function identifiers is automatically provided. Function

symbols can appear in both the function symbol tables and the global symbol table. When

a function g is reference in a function/the function g's symbol is placed in function/s

symbol table. A symbol table update program rans before object code generation (OCG)

to make sure the attributes of the symbol for function g, in function/s symbol table,

matches g's symbol attributes in the global symbol table.

Symbol tables for this compiler are implemented using hash tables with a

maximum table size of 499; each hashed entry in the symbol table has a linked list

155

capability. If a symbol hashes to an entry already in use the new entry will be linked to

the existing symbol table entry. If a symbol table location is not in use a new entry is

created for that location. This design makes the symbol table size dynamic. Memory is

allocated for symbol table entries only as symbols are added.

Initially the symbol attributes consisted of only a name and type but has since

grown to include additional attribute information. The current symbol attributes are listed

in the following C stracture taken from the compiler source code.

stract entry { char* name; /* symbol name */ int type; /* symbol type information */ int constype; /* symbol is a constant */ int gen; /* identifier is gen result */ int taking; /* symbol is a taking identifier */ int queue; /* symbol is a queue */ int numarg; /* number of function arguments */ int formal; /* symbol is a formal parameter */ int if_result; /* symbol is retum value for if-then-else constract */ stract entry next; /* pointer to next symbol in chain */

};

The name attribute contains a pointer to a character string, which contains the

symbol's name. For example the name for reserve word "taking" is "taking". Type is a

numerical value arbitrarily assigned to each symbol. The "taking" symbol has been

assigned the type value 15. Constants will have their constype attribute set by semantic

analysis, this gives the compiler an opportunity to do some compile time memory

allocation for constants during OCG. Taking expression identifiers are associated with

one type of implied parallelisms. Therefore, the symbols for these identifiers will have

their taking attribute set by semantic analysis. For example in the following expression

the taking identifier is /.

156

taking i from gen([l,...,4])

The queue attribute is set when an identifier has been identified by semantic

analysis as a queue. Any expression that uses a taking identifier in an operation will have

the result it produces defined as a queue. Any expression that has a queue as one of its

operands will have the result it produces defined as a queue. The numarg stores

information on the number of input arguments a SequenceL function can accept. The

formal attribute is set for input arguments to a SequenceL function. In the generated C

code the conditional expressions can retum a result from either its trae or false

expressions, the temporary that retums a result from a conditional expression has its if-

result attribute set. The symbol table is critical to code generation. Without the attribute

information it would be very difficult to identify implied parallelisms.

B.3 Semantic Analyzer

Semantic analysis (SA) converts the SequenceL code to SequenceL intermediate

code (IC) statements. Semantic analysis generates the SequenceL IC statements through a

set of semantic action rales. The decision on where to place the semantic action rales

depends on when a given production is recognized by syntax analysis. The S production

is used here as an example of the semantic analysis code.

S=^ VSl

The following code is the S production code from syntax.c

int s() {

stract entry *p; if(v()){

157

}

p=lookup(last,prevtoken); create_arg(p); if(sl()){

retum trae; } else

retum false; } else

retum false;

The create_arg(p) is the semantic action for function input arguments. The semantic

function create_arg is listed below.

void create_arg(stmct entry *p) {

stract entry *pl; symtbl_stack_TOP *operand_top;

operand_top=(symtbl_stack_TOP*)malloc(sizeof(symtbl_stack_ELEMENT)); *operand_top=NULL; push_sym(p,operand_top); p->formal=l; p 1 =pop_sym(&sas_top); gen_quad(lookup(last,"_arg"),operand_top,NULL,0); argcount++;

}

The code first sets up the operand stack using the pointer operand_top. The input

argument is on the SAS when this function is called. Therefore the input argument is

popped off the SAS and placed in the operand stack. Next the input argument has its

formal attribute set. The function gen_quad is passed the operator "_arg", the operand

stack and a NULL for the result. The variable argcount is used to update the symbol

attribute numargs for the function whose input arguments are being processed. PI is

158

getting a marker off the SAS that was used at one time but is not used now. The rest of

the semantic analyzer implementation follows this pattem. The semantic action functions

can be found in quads.c on the CDROM.

B.4 Intermediate Code Representation

The IC statements are a key interface point in the compiler between

lexical/syntax/semantic analysis and object code generation. Complete details on the

development of the SequenceL intermediate language can be found in Chapter FV.

The following sections will describe the intermediate code by example. As each

example is reviewed a description of each step will be provided.

B.4.1 Evaluation of IC for Matrix Multiply

The following is a listing of the IC statements generated by the compiler for the

SequenceL Matrix Multiple;

/* matrix multiply */ {{ matmul(s_l(n,all),s_2(all,m)) where next(n,m) = from cartesianproduct([gen([[l],...,n]),gen([[l],...,m])]) taking [i,j] {+([*([s_l(i,all),s_2(all,j)])])}} matmul, s_l, s_2 }

1. _beginfunc:: matmul s_l s_2 2. 3. 4. 5. 6. 7. 8. 9.

_arg:: _arg:: next:: _seq :: gen:: _seq :: gen:: _seq ::

s_l s_2 n 1 ::

_to 1 ::

_t2 _tl

n all m

_to n ::

_t2 m :: _t3 ::

all m

_tl

_t3 _t4

159

10. cartesianproduct: 11. from :: 12. _M:: 13. _M:: 14. _seq :: 15.*: : 16. + ::

_t5 s_l s_2 _t6

_t8 :: _t9 ::

17. _endfunc :: ::

:

i i

all _t7

_t9 _tio _tio

_t4 :: j J

all : j :

_t5

: _t6 _t7

_t8

18. _call:: matmul s_l s_2

The lines are numbered for reference purposes. Two colons are used as delimiters

between operator, operands and results and are added by the compiler's IC dump to

display feature. To make the IC listing more readable the operand field includes all the

operands in the operand stack. Therefore on line 15 the operator is * the operand is _t8

and the result is stored in _t9. Note the use of temporaries. All temporaries generated by

semantic analysis begin with _t followed by a numerical value.

Before an IC statement can be generated by the semantic analyzer three things are

required, a pointer to the symbol table entry for the operator, a pointer to the operand

stack and a pointer to the symbol table entry for the result. The elements in the operand

stack are pointers to the symbol table entries for the operands. An IC statement that has

no operands or result will have the operand stack address or result address set to NULL in

the IC table. For example _endfunc in line 17 has no operand stack.

The semantic actions coupled with the SAS effectively pull apart all the

SequenceL nested operations and lay them out in serial order as IC statements. An

examination of the IC statements for data dependencies is the only way to tell that the IC

statements were generated from a nested stmcture. For example the nested computation

in matrix multiply is;

160

+([*([s_I(i,*),s_2(*,j)])])

Semantic analysis generates the following IC statements for the multiply and add parts of

this expression.

* _t8 _t9 + _t9 _tl0

_t9 is the result produced by multiplication, _t9 is also an operand for the addition. This

data dependency means the IC statements were generated for a nested SequenceL

expression.

A complete description of the matrix multiply IC listing is presented below along

with a description of each IC statement and what part of SequenceLMatrix Multiply

program the IC maps back to.

1) _beginfunc:: matmul s_l s_2 Indicates the beginning of a function called matmul which accepts two input arguments, s_l and s_2. Generated by SA from matmul(s_l(n,*),s_2(*,m))

2) _arg:: s_l n all Set n to the number of columns in s_l, generated by SA from s_l(n,*)

3) _arg:: s_2 all m Set m to the number of rows in s_2, generated by SA from s_2(*,m)

4) next:: n m Set the output format based on n and m, generated by SA from next(n,m)

5) _seq:: 1 :: _tO Setup a constant sequence, generated by SA from [1]

6) gen:: _tO n :: _tl Generate a sequence ranging from _tO to n and store the resultant sequence in _tl, generated by SA from gen([[l],...,n])

7) _seq:: 1 :: _t2 Setup a constant sequence, generated by SA from [1]

161

8) gen:: _t2 m :: _t3 Generate a sequence ranging from _t2 to m and store the resultant sequence in _t3, generated by SA from gen([l,...,m])

9) _seq:: _tl _t3 :: _t4 Collect the two generated sequences together, generated by SA from [_tl, _t3 ] _tl and _t3 are the results associated with the two gen IC statements in lines 6 and 8.

10) cartesianproduct:: _t4 :: _t5 Generate the Cartesian product of _t4 and store in _t5, generated by SA from cartesian_product([_t4 ]) _t4 is the result associated with the collect sequences IC in line 9.

11) from:: _t5 i j Generate / and; values from _t5, generated by SA from taking [i,j] from _t5, _t5 is the result associated with the Cartesian product IC in line 10.

12) _M:: s_l i all :: _t6 Select column sequences from s_l based on / and store resultant sequence in _t6, generated by SA from s_l(i,*)

13) _M:: s_2 all j :: _t7 Select row sequences from s_2 based on j and store resultant sequence in _t7, generated by SA from s_2(*,j)

14)_seq:: _t6 _t7 :: _t8 Collect the column and rows together in _t8, generated by SA from [_t6, _t7] _t6 and _t7 are the results associated with the index IC statements in lines 12 and 13.

15) * :: _t8 :: _t9 Multiply the columns and rows and store the resultant sequence in _t9, generated by SA from *([_t8])

16) + :: _t9 :: _tlO Add the sequence in _t9 and store the resultant sequence in _tlO, generated by SA from +([_t9])

17)_endfunc:: :: _tlO retum _tlO to the referencing function, generated by SA from the } at the end of a function.

18)_call:: matmul s_l s_2

162

This IC specifies the initial call to matmul, the result of matmul is retumed here. Generated by SA from matmul, s_l, s_2}.

This example will be continued in the section on object code generation.

B.5 Object Code Generation

Every SequenceL program passed to the SequenceL compiler results in the

compiler creating a number of function files and their associated include files. For

example given the following SequenceL code fragment, and assuming that function2

contains an implied parallelism that retums a result in the temporary identifier _t30;

{{

functionl(s_l(n)) where next(n)... },

function2(s_l,s_2(n)) where next(n) ... },

function3(s_l,s_2(n)) where next(n)... }} function1,s_l }

The following file open and close cycles occur during OCG.

163

Open main.c main.h and initialize

Close main.c and main.h and open functionl.c functionl.h

Close function l.c functionl.h open main.c and main.h

Close main.c and main.h and open function2.c function2.h

Open _t30agg.c _t30agg.h

Close _t30agg.c _t30agg.h

Close function2.c function2.h open main.c and main.h

Close main.c and main.h and open function3.c function3.h

Close function3.c function3.h open main.c and main.h

Complete update of main.c and main.h and close

Figure 4

Between each file's open and close statement the function file is being updated with C

code by OCG. Note that function2.c and function2.h remains open when _t30agg.c and

_t30agg.h are open, this is because _t30agg.c and _t30agg.h are being created

dynamically as a result of control flow analysis detecting the implied parallelism. The C

functions created in _t30agg.c will be a thread function that will be invoked by the

pthread_create library call from within function2.c. When main.c and main.h are initially

opened the first part of main.c is written to main.c. If the file open and write fails the

compiler reports an error and exits. The file open and close trace is the result of using

164

only one file pointer for function files and one file pointer for include files. The current

setting of the two file pointers is controlled by OCG's current location in the IC table.

Only when an implied parallelism has been detected are more than two files open at the

same time.

The initial code written to main.c is as follows.

#include <stdio.h> #include <pthread.h> #include <math.h> #include <semaphore.h> #include <rantimel.h>

int main(){ #include "main.h"

The very first IC read from the IC table by OCG will always be a _beginfunc.

This IC causes the OCG to close main.c and main.h and open a new C function file and

include file. As OCG reads IC statements it generates C code and writes it to the function

file, C code declarations for identifiers are written to the function's include file. When

OCG reads the _endfunc IC this IC causes OCG to generate the function's retum

statement and close the function file and it's include file, reopening main.c and main.h.

The very last IC in the IC table will always be derived from the SequenceL initial

tableau expression that SequenceL uses to begin the consume-simplify-produce execution

process. In the matrix multiply IC table listing this IC is;

_call:: matmul s_l s_2

This IC causes OCG to write the following C code for matrix multiply to main.c.

s_l=get_input("s_l"); s_2=get_input("s_2");

165

result=matmul(s_l,s_2); printf("result = %s\n",result->string); }

The declarations for s_l, s_2, result and the function prototype for matmul are written to

the file main.h

sequence *s_l; sequence *s_2; sequence* result; sequence * matmul (sequence *, sequence *);

Since the queue flags are not set for s_l, s_2 and result they are sequences.

Additionally since the s_l and s_2 are not function names and do not appear within a

SequenceL function they are input arguments for the SequenceL program. All inputs to a

SequenceL program are read from input files that must have the same name as the input

arguments. The stracture of an input file is a string sequence. Therefore if a sequence

representing a 3x3 matrix is to be loaded into the input variable s_l, the following string

would be placed in a text file called s_l.

[[[1],[2],[3]],[[4],[5],[6]],[[1],[1],[1]]]

Code generation by example is presented in section B.5.2. The compiler code for

code generation is in codgen.c on the CDROM.

B.5.1 Parallel Code Generation

The queue is the fundamental data stracture for supporting parallel execution in

the generated C object code. The inspiration for developing the queue for the object code

166

was tiie SequenceL execution strategy. For example in matrix multiply the following

stracture is created in the tableau for a 3x3 matrix multiplication [CooOO].

T = [

]

[ 2:+([*([[2,4,6],[2,3,l]])]) 2+([*([[2,4,6],[4,5,l]])]) 2:+([*([[2,4,6],[6,7,l]])]) 2:+([*([[3,5,7],[2,3,l]])]) 2:+([*([[3,5,7],[4,5,l]])]) 2+([*([[3,5,7],[6,7,l]])]) 2:+([*([[l,l,l],[2,3,l]])]) 2:+([*([[l,l,l],[4,5,l]])]) 2:+([*([[l,l,l],[6,7,l]])])

It is not difficult to see that this stracture can be envisioned as a queue, or at least the data

can be easily processed into a queue.

[[2,4,6],[2,3,1] [[2,4,6],[4,5,1] [[2,4,6],[6,7,1] [[3,5,7],[2,3,1] [[3,5,7],[4,5,1] [[3,5,7],[6,7,1] [[1,1,1],[2,3,1] [[1,1,1],[4,5,1] [[1,1,1],[6,7,1]

The queue length can be anywhere from zero to whatever system memory will

allow.

The above queue listing reveals a problem that can occur with some parallel

execution strategies. Note that there are only three columns and three rows in a 3x3

matrix multiply yet the queue has nine columns and nine rows stored in it. Each column

and row is repeated three times. It's easy enough to setup a queue using pointers.

167

Therefore instead of storing three copies of each column and row in a queue the queue

has three pointers to each column and row. That way only one copy of a column or row is

actually stored in memory. This implementation has its problems as well. It leads to

memory bottlenecks on shared memory system. For example if a 1000 threads needed

access to that one column a potential memory bottleneck is created. Further more on a

distributed shared memory system such as the SGI Origin2000 chances are that the

matrix will be stored in contiguous memory. Therefore all the columns and rows will be

together in the same area of memory, this makes the bottleneck condition even worse.

Optimization for space versus time is beyond the scope of this research. For now time

gets priority over space for this compiler.

The queues created in the C generated code begin with the taking identifiers. The

taking expression is a known point of distribution for one class of implied parallelisms.

For example the following taking expression,

taking i from gen([l,...5])

Results in a queue in object code. The queue / will contain the elements;

[1] [2] [3] [4] [5]

When an expression is evaluated that references i the expression produces a result that is

also a queue. For example given;

s_l = [[[1],[2]],[[3],[4]],[[5],[6]],[[7],[8]],[[9],[10]]]

The expression s_l(i) produces the following result in a queue.

168

[[1],[2]] [[3],[4]] [[5],[6]] [[7],[8]] [[9],[10]]

If the expression s_l(i) appears in a computation such as +([s_l(i)]) then the result

produced is also a queue.

[3] [7] [11] [15]

[19]

A the end of a function the output formatting information associated with the

"next(V*) expression will determine the format the queue must be placed in before a

sequence is retumed. For this queue example assume the result is linear.

[[3],[7],[11],[15],19]]

Any operation that utilizes a queue is a potential opportunity for parallelism. The

key to processing the queues in parallel is to setup a loop to pass the queue elements one

at a time to a thread function. Thread functions are invoked by the POSIX thread library

routine pthread_create. Thread functions that are invoked by pthread_create are non-

blocking. Therefore before the result of a thread function can be used a block must be

setup that unblocks when the thread function exits. Pthreadjoin is the POSIX thread

library routine that provides the block for thread functions. Pthread joins are placed in

object using an as-needed strategy. Therefore, a pthreadjoin is not placed in object code

until a result is needed from a thread function. This strategy leads to another type of

parallelism.

169

In a nested SequenceL stracture, if there is more than one computation at the

same level of nesting these computations can execute in parallel. For example the

following is a SequenceL example of two multiply expressions at the same level of

nesting in the nested expression.

+([*([s_l(i),s_2(i)]),*([s_3(i),s_4(i)])])

The results from the two multiply operations is required by the addition operation.

The following pseudo code for this expression will be utilized to explain how the object

code handles nested stmctures executing in parallel.

for(i=0; i < size; i++) /* loop 1 */ _t0(i)=*(s_l(i),s_2(i))/* threaded execution */

for(i=0; i < size; i++) /*loop 2 */ _tl(i)=*(s_3(i),s_4(i))/* threaded execution */

block here for loop 1 block here for loop 2

for(i=0; i < size; i++) /* loop 3 */

_t2(i)=+(_t0(i),_tl(i));/* threaded execution */

Since the results from loop 1 and 2 are needed in loop 3 a block is placed just

before loop 3. This block will unblock when all the threads generated in loop 1 and 2

have exited. This works very well for balanced nested expressions but mns into problems

with unbalanced nested expressions. The following expression is unbalanced, the first

addition is nested one level below the second addition.

*([*([+([s_l(i),s_2(i)])]),+([s_3(i),s_4(i)])])

The pseudo procedural code from the above expression is;

for(i=0; i < SIZE; i++) /* loop 1 */ _tl(i)=+(s_l(i),s_2(i))/* first addition */

170

block here for loop 1

for(i=0; i<SIZE; i++) /* loop 2 */ -t3(i)=*(_tl(i)) /* inner or second multiply */

for(i=0; i < SIZE; i++) /* loop 3 */ _t2(i)=+(s_3(i),s_4(i))/* second addition */

block here for loop 2 and 3

for(i=0; i<SIZE; i++) /* loop 4 */

_t4(i)=*(_t3(i),_t2(i)) /* outer or first multiply */

Even tiiough loopl and loop3 can be executed in parallel the block on loop 1 just before

loop 2 keeps loop 3 from executing until all of loop I's threads have exited. The block is

there because loop 2 uses the results from loop 1. The result is a reduction in parallel

execution. A side effect of optimizing the compiler for cache locality has been to reduce

the effects of this problem for simple unbalanced nested expressions. Cache locality is

detailed in a B.6.2.1. Cache locality is improved by combining data dependent

computations into the same thread function. Therefore the above code would be

restractured so that the first addition and inner multiply share the same thread function.

The procedural code then becomes.

for(i=0; i < SIZE; i++){ /* loop 1 */ _tl(i)=+(s_l(i),s_2(i))/* first addition */ _t3(i)=*(_tl(i)) /* inner or second multiply */

}

for(i=0; i < SIZE; i++) /* loop 2 */ _t2(i)=+(s_3(i),s_4(i))/* second addition */

block here for loop 1 block here for loop 2

for(i=0; i<SIZE; i++) /* loop 3 */

171

_t4(i)=*(_t3(i),_t2(i)) /* outer or first multiply */

This implementation strategy works well for both balanced and simple unbalanced nested

computations. More complex unbalanced nested computations could require code

restmcturing.

The final type of parallelism that code generation must deal with is control flow

parallelisms associated with function calls, or references. For example, given two

functions, function 1 and function2 were each receives one argument and each retums one

argument and given the following SequenceL expression that references function 1 and

function2.

[function2,s_l,function 1 ,s_l ]

The IC statements generated from this SequenceL expression will be;

_call function 1 s_l _tO

_call function2 s_l _tl

In this example there are no dependencies between the functions. When control flow

analysis detects this kind of IC arrangement it creates two thread functions. OCG then

places the calls to function 1 in one thread function and the call to function2 in the other

thread function. The result is the two thread functions execute in parallel.

B.5.2 Code Generation by Example

Two examples of code generation will be presented in this section. Together the

two examples demonstrate the code generated for all the SequenceL grammar

productions except "and", "or" and "not. The two examples demonstrate all the classes of

172

implied parallelisms. The first example will be matrix multiple and the second will be

quicksort. In each example an IC statement is presented first followed by the OCG

generated object code. The appropriate variable and function declarations are placed in

the function include files as OCG generates the object code.

B.5.2.1 Matrix Multiply

The first IC statement encountered will be the _beginfunc statement. This

statement causes OCG to set up the function prototype and formal arguments for the

SequenceL function matmul.

_beginfunc:: matmul s_l s_2

OCG generates the following code in function file matmul.c

#include <stdio.h> #include <pthread.h> #include <math.h> #include <semaphore.h> #include <rantimel.h>

sequence *matmul(sequence *s_l,sequence *s_2) { #include "matmul.h"

This next IC instmcts OCG to setup object code that assigns a value to n.

_arg:: s_l n all

OCG generates;

n=Msizeof(s_l,l,2);

173

Msizeof is a rantime library routine. It accepts sequence and integer information that

helps it to retum a size value associated with the sequence it is passed. The two integer

values indicate how many indexes are involved in the size specification and which index

is of interest. Msizeof will assign to n size information associated with s_l. The 1 in the

argument list indicate that n, is the first of two identifiers in the original SequenceL

expression, therefore column information is required. The 2 in the argument list indicate

that there are two identifiers in the original SequenceL index expression.

This next IC instmcts OCG to setup object code that assigns a value to m.

_arg:: s_2 all m

OCG generates;

m=Msizeof(s_2,2,2);

In this case Msizeof is informed that m is the second identifier of two. Therefore row information is being requested.

next:: n m

Code generation does no generate anything for the next IC. It will be scanned for when

the _endfunc IC is encountered.

This IC statement sets a temporary to a constant.

_seq:: 1 :: _tO

OCG generates; _tO=sequence_atos(" [ 1 ]");

Sequence_atos is a string-to-sequence conversion routine. Constants like "[1]" can be

allocated memory at compile time or at mntime. Here it is allocated at mntime.

174

This next IC statement is a generative statement.

gen:: _tO n :: _tl

OCG generates the following statements;

_tl gen. start=_tO; _tlgen.end=n; gen_thr_t 1=(pthread_t*)malloc(sizeof (pthread_t));

pthread_create(gen_thr_tl, NULL, (void *)gen, (void*)&_tlgen);

All generative constracts are implemented as threads. The argument stracture for gen is

predefined. The stracture includes a start sequence, an end sequence and a sequence to

store the result of the gen. The result name in the IC is used as part of the name of all

variables associated with the gen. This includes the name of the stmcture (_tlgen) as well

as the thread id (gen_thr_tl). This strategy of using a result name for variables guarantees

that the names are unique since a result name can be used as a result only once in an IC

table. The rantime library routine gen is setup as a thread function.

A constant;

_seq:: 1 :: _t2

OCG sets up a constant;

_t2=sequence_atos(" [ 1 ]");

Another generative constract.

gen:: _t2 m :: _t3 generates _t3gen.start=_t2;

_t3gen.end=m; gen_thr_t3=(pthread_t*)malloc(sizeof(pthread_t)); pthread_create(gen_thr_t3, NULL, (void *)gen, (yoid*)&_t3gen);

This IC collects _tl and _t3 together in a new sequence _t4.

175

_seq:: _tl _t3 :: _t4

OCG generates these statements;

pthreadJoin(*gen_thr_tl,NULL); _tl=_tlgen.seq; pthreadJoin(*gen_thr_t3,NULL); _t3=_t3gen.seq; _t4=collect_sequences("ss",_tl,_t3);

_tl has it's gen flag set therefore a gen operation was involved in creating the contents of

_tl. Therefore a pthreadjoin is required for _tl. This also holds for _t3. _tl and _t3 are

collected together by the collect_sequences mntime library call.

This IC is a Cartesian product statement;

cartesianproduct:: _t4 :: _t5

OCG generates;

_t5=cartesian(_t4);

Except for the multiply, add, subtract and divide functions all of the SequenceL functions

such as cartesian, abs, cos, sin etc... are implemented as non-threaded functions in the

rantime library. They accept one sequence and retum one sequence.

This IC is a taking statement;

from:: _t5 i j

OCB generates the following statementsl

i=(taking_data*)malloc(sizeof(taking_data)); i->from=_t5; i->yar=l; i->num_var=2; take_thri=(pthread_t*)malloc(sizeof(pthread_t)); pthread_create(take_thri, NULL, (void *)taking, (yoid*)i);

176

j=(taking_data*)malloc(sizeof(taking_data)); j->from=_t5; j->yar=2; j->num_var=2; take_thrj=(pthread_t*)malloc(sizeof(pthread_t)); pthread_create(take_thrj, NULL, (void *)taking, (void*)j); pthreadJoin(*take_thri,NULL); pthreadJoin(*take_thrj,NULL);

A thread is generated for each identifier in the taking clause. In this example / and; are

the taking identifiers. A stracture has been pre-defined for the input argument for the

taking rantime library function. The taking stracture requires the following information,

the sequence the taking is to be done from; how many identifiers there are; and which

identifier the call is being made for. The joins are placed after the last taking identifier's

thread is created.

This IC is an index statement

_M:: s_l i all :: _t6

OCG generates; _t6=selection_queue("sq*",s_l,i->result,all);

The rantime library function selection_queue can take any number of arguments. In this

example sq* defines the number of arguments and their types. The s in sq* means the

first argument, s_l is a sequence, the q in sq* means the second argument, i->result is a

queue and the * in sq* means the last argument is a wildcard. The call retums a queue in

_t6, in this matrix multiply example all of the columns in s_l are stored in _t6.

An IC index statement;

_M:: s_2 all j :: _t7

177

OCG generates;

_t7=selection_queue("s*q",s_2,all,j->result);

Again selection_queue is called except this time the second argument is the wildcard, the

result of this call is that all of the rows in s_2 are stored in _t7.

The collect queue IC statement;

_seq:: _t6 _t7 :: _t8

OCG generates; _t8=collect_queues("qq",_t6,_t7);

The rantime library function collect_queues places all the column and row sequences in

the queues _t6 and _t7 into a new queue, _t8.

The two arithmetic IC statements will be reviewed together.

*:: _t8 :: _t9 + :: _t9 :: _tlO

OCG generates; _tlO=_t8;

_tlO_element= _tlO->head; size=queue_size(_t 10);

thr_tlO=(pthread_t*)malloc(sizeof(pthread_t)*size); while(_tlO_element !=NULL){

pthread_create(&thr_tlO[_tlO_thread], NULL, (void *)_tlOagg, (void*)_tlO_element->p);

_t 10_element=_t 10_element->next; _tlO_thread++;

}

Control flow analysis seeks out opportunities to improve cache locality. Tests

were carried out on the Origin2000 that revealed that cache locality can reduce execution

times by as much as 50%, this figure is supported by the literature [Phi]. Control flow

178

analysis detects the data dependency between these two IC statements and dynamically

creates a thread function to place both in. Additionally OCG must dynamically generate a

stracture for all the input arguments this new thread function will need. Since _tlO will

eventually contain the results of this multiply/add function, _tlO is setup as the control

for tiie loop. _tlO initially points at the input queue _t8. Size is assigned the number of

threads tiiat will be created. The variable _tlO_element is setup to point at the first

column/row sequence in the input queue. Each time the while loop cycles the pointer

_tlO_element is moved so that it points at the next column/row sequence. The

column/row sequence is passed to the thread function, which retums a result in this same

sequence.

The dynamically created thread function _tlOagg is listed below.

void* _tlOagg(void *arg) { #include "_tlOagg.h"

_t8=(sequence*)arg; multiply(_t8); _t9=_t8; add(_t9); _t8=_t9;

}

The end function IC statement;

endfunc :: :: _tlO

OCG generates; do{

_tlO_thread-; pthread J oin(thr_t 10 [_t 10_thread] ,NULL);

}while(_tlO_thread > 0); retum queue_reduc("qii",_tlO,n,m);

179

}

The queue flag for _tlO has been set therefore a join is needed before the results stored in

_tlO can be used. When all of the threads have joined the results will be stored in the

queue _tlO. Once all of the results are available they must be placed back in a sequence

format. The IC statement specifies the format;

next:: n m

The queue_reduc rantime library function, is passed this format information and

restractures the queue into a sequence.

Finally the initial tableau IC is processed and main.c and main.h are updated with the call

to matmul.

The execution trace for the C generated parallel code is shown below.

mm Number of processors

1

Figure B.l Matrix Multiply

B.5.2.2 Quicksort

The SequenceL quicksort code is as follows.

{{ quick(s_l(n)) where next(n) = [ ] { =([[ ],s_l]) then [ ] else >([n,[l]]) then [$,quick,less,s_l([l]),s_l,s_l([l]),$,quick,great,s_l([l]),s_l] else [s_l] },

less(s_l,s_2(n)) where next(n) = from gen([[l],...,n]) taking [i]

180

{ <([s_2(i),s_l]) then [s_2(i)] else [ ] },

great(s_I,s_2(n)) where next(n) = from gen([[l],...,n]) taking [i] { >([s_2(i),s_l]) then [s_2(i)] else [ ] }} quick, s_l }

Quicksort in SequenceL clearly illustrates the high-level nature of the SequenceL

language, not including includes and the contents of the include files, 184 lines of C

code were required to execute quicksort in C code. The Quicksort IC statements and

generated code are listed at the end of this document.

Of interest in quicksort is the code generation associated with the relational

operations. In the quick function there are two relational expressions, in the less function

and the great function there is one in each. There is also a sequence containing multiple

function references, which includes two recursive calls to quick.

The quick function IC statements for the relational operations are on lines 7 and 14

in the IC table listing.

= :: _tl :: _t2 >:: _t5 :: _t6

An examination of the operands _tl and _t5 reveals that these two identifiers do not have

their symbol queue attribute set therefore they are sequences. This means OCG will

generate non-parallel code for these two relational expressions. For the first IC the code

generated is,

if(convert_logic(condition(EQUAL,_tl)))

Which is in line 13 in the quick.c file listing. The rantime library routine condition

accepts a flag indicating what relational operation is being invoked, equality in this case.

181

and the sequence upon which the operation is taking place on, _tl. The condition routine

retums a trath sequence of trae or false singletons. Convertjogic is a rantime library

routine that translates the trath sequence, retumed by condition, into something that a C if

statement can understand. The second IC causes the generation of

if(conyert_logic(condition(GREATER,_t5)))

Note that the result of the relational operations are stored in _t2 and _t6, these two

identifiers are tracked by OCG so that the result of either a tme or false will be retumed

in them. Since _t2 was the result of the first relational and the second relational only

executes if the first is false then ultimately the result of _t6 is passed to _t2. The IC at line

32 performs this function.

_seq:: _t6 :: _t2

Note the tracking of _t2 and _t6 by the conditional operators at lines, 8, 11, 15, 28,

31 and 33 in the IC table.

Quick also provides an opportunity to demonstrate the compiler's ability to

generate parallel code for implied control flow parallelism. The following SequenceL

expression contains function references that generate the code for this type of parallelism.

[$,quick,less,s_l([l]),s_l,s_l([l]),$,quick,great,s_l([l]),s_l]

The ICs generated from this expression are listed below.

_seq:: 1 :: _t7 _M:: s_l _t7 :: _t8 _seq:: 1 :: _t9 _M:: s_l _t9 :: _tlO _seq:: 1 :: - tH _M:: s_l _tl l :: _tl2 _call:: great _tl2 s_l :: _t29 _call:: quick _t29 :: _t30

182

_call:: less _t8 s_l :: _t31 _call:: quick _t31 :: _t32 _seq:: $ _t32 _tlO $ _t30 :: _tl3

Of particular interest are the _call IC statements. When control flow analysis

encounters the _call operator it checks the next IC for a data dependency. Upon

discovering a data dependency control flow will place both of these function calls into

one thread function, it then continues and discovers the same opportunity with the next

two _call IC statements and repeats the process. Finally when control flow reaches the

_seq IC it recognizes that the results of the calls are needed here. Therefore pthreadjoins

are generated for the two thread functions. The code generated for the _call IC staements

is as follows.

_t30args=(_t30call_stract*)malloc(sizeof(_t30call_stract)); _t30args->_tl2=_tl2; _t30args->s_l=s_l; pthread_create(&thr_t30, NULL, (void *)_t30agg,(void*)_t30args); _t32args=(_t32call_stract*)malloc(sizeof(_t32call_stract)); _t32args->_t8=_t8; _t32args->s_l=s_l; pthread_create(&thr_t32, NULL, (void *)_t32agg,(void*)_t32args);

The contents of the two thread functions will be the code that would normally be

generated for a _call IC, plus the code associated with the passing of arguments to a

thread function.

_t30args=(_t30call_stract*)arg; _tl2=_t30args->_tl2; s_l=_t30args->s_l; _t29=great(_tl2,s_l); _t30=quick(_t29); _t30args->result=_t30;

183

The complete listing for _t30agg and _t32agg is in the object code listing.

Unlike the relational operations in quick the relational operations in less and great

do present an opportunity for parallel execution since the queue attribute is set for _tl8

and _t25. A thread function needs to be created that will do the comparisons on each of

the elements in the queues. The IC statements for the relational operations in less and

great are listed below.

<:: _tl8 :: _tl9 and

>:: _t25 :: _t26

These two IC statements along with the fact that the queue attribute is set for the

operands cause OCG to setup two loop stmctures, one in less.c and one in great.c. These

loops will generate the threads and pass each thread a sequence to compare. The code

generated for the first IC statement is as follows.

while(_tl8_element !=NULL){ _tl9args=(cond_arguments*)malloc(sizeof(cond_arguments)); _tl9args->s_2=s_2; _tl9args->n=n; _tl9args->s_l=s_l; _tl9args->result=(sequence*)malloc(sizeof(sequence)); push_queue(_t 19args->result,_t 19queue); _t 19args->parameter=_t 18_element->p; _t 19args->i=i_element->p; i_element=i_element->next; pthread_create(&thr_id[_tl9_thread], NULL, (void *)_tl9cond, (yoid*)_tl9args); _t 18_element=_t 18_element->next; _tl9_thread++; }

The actual relational expressions that execute as a result of either a tme or false

are in the new thread functions.

184

_t 19args=(cond_arguments*)arg; i=_tl9args->i; s_2=_tl9args->s_2; n=_tl9args->n; s_l=_tl9args->s_l;

if(convert_logic(condition(LESS,_tl9args->parameter)))

_t20=selection_seq("ss",s_2,i); copy_seq(_t 19args->result,_t20);

} else{

t21=sequence_atos("[ ]"); copy_seq(_t 19args->result,_t21);

} }

All input variables used by these dynamically created thread functions are passed

in a dynamically created input variable stracture. The copy_seq is a way of updating a

previously allocated sequence with another sequence.

B.6 Runtime Setup

B.6.1 Sequence Representation

This topic is covered in Chapter IV.

B.6.2 Scheduling

Scheduling issues have been addressed with respect to control flow parallelisms

in the OCG section. Cache locality issues will be reviewed in the next section.

185

B.6.2.1 Cache Locality

Cache locality is a problem that occurs on multi-processor systems, which thread

scheduling can address [Phi]. A thread that executes on a processor will store its data in

the processor's cache memory. Any subsequent thread requiring access to the data should

be scheduled to execute on the same processor so it can take advantage of the data

already in cache. It is a complex task to explicitly schedule threads on processors through

an analysis of data dependency and control flow [Phi].

An operating system's thread scheduler typically uses a FIFO protocol to

schedule threads[ Lew]. The first thread created executes first, followed by the second

and so on. When a thread yields a processor the next thread on the FIFO queue will be

loaded onto the vacated processor for execution [Nar]. It is possible to setup a thread

creation strategy so that threads can share data with each other using the same cache

memory. When an initial thread is created, if that thread immediately creates a child

thread that child thread will be placed next to the parent in the FIFO queue. The result is

that when the parent yields a processor the child thread will get the vacated processor and

when the child yields the processor back, the parent thread will be retumed the same

processor. Given a multiply/add senario, where one thread does the multiply operation

and a second does the addition. Then the following strategy would result in data sharing

via the same cache. The first thread created will do the addition operation, but before it

does so it creates the multiply thread and then waits for the multiply to complete before

the addition takes place. The result is a data sharing through the same cache. For this

compiler, instead of creating two threads the multiply/add are placed in the same thread

186

and execute serial there. This saves overhead in thread creation and has no ill effects on

parallelisms since the two operations are serial anyway.

B.6.3 Runtime Library

The rantime library thread functions were initially designed to use a semaphore to

synchronize threads. Some developers recommend semaphores over thread joins [Lew]

since the blocked routine does not have to wait for the thread function to exit before it

can use a result. The problem with semaphores is that they are subject to race. For

example, assume that the function function 1 created a thread from the thread function

threadl and that function2 also created a thread from thread function threadl. After

creating the threads function 1 and function2 block on a semaphore waiting for threadl to

set the semaphore. If the thread created by function 1 completes first function2 may detect

the semaphore before function 1 and assume that it was its thread that completed first.

This condition occurs in the quicksort program. One solution to this problem is to have

every function dynamically create a unique semaphore and pass it to the thread function.

Another solution is to create threads with a unique thread id and use the pthreadjoin to

wait on the thread. This second method was implemented since it removed all traces of

the thread model from the rantime library making the ran time library a little more

general purpose. In its final configuration, granularity capabilities were added to a

number of the rantime library routines, this change reintroduced the thread model to the

rantime library.

187

The following is a list of all the mntime library routines. Some of the routines are

for intemal library use others appear in the generated C code. These functions are

important to the development of the compiler in that they provide much of the

functionality expressed by SequenceL constmcts, such as the taking expression. The code

for these routines is in rantimel.c on the CDROM.

These are the sequence normalization functions.

char* norm(sequence*, int, int*); /* retum a normalized character string */ int do_normalize(sequence*); /* check for normalization requirement */ void normalize(sequence*); /* normalize a sequence */

int* cardinality(sequence*); /* retum an array containing a sequence's cardinality */

These are the arithmetic, relational and logic mntime routines.

void* multipIy(void*); /* threadable multiply function */ void* add(void*); /* threadable add function */ void* subtract(void*); /* threadable subtract function */ void* divide(void*); /* threadable divide function */ sequence* compute_seq(sequence*,int); /* arithmetic function */ sequence* condition(int, sequence*); /* relational function */ int convert_logic(sequence*); /* convert relational result to something C understands */ sequence* logic(int, sequence*); /* logical operation function */

These are the sequence rantime routines. These routines typically generate sequences,

measure sequences or retum parts of a sequence.

sequence* sequence_atos(char*); /* convert a string to a sequence */ sequence* collect_sequences(char*,...); /* collect sequences */ sequence* get_sequence(sequence*, sequence*, int); /* get a sequence */ sequence* selection_seq(char*,...); /* sequence selection function */ sequence* get_row(sequence*, int); /* get a sequence row */ sequence* get_col(sequence*, int); /* get a sequence column */ sequence* Msizeof(sequence*,int,int); /* get a sequence size */ sequence* remove_null(sequence*); /* remove empty sequences fr-om sequence / sequence* sequence_ntos(int,int*,double*,sequence*); /* convert data to sequence /

These are the queue rantime routines.

188

sequence* queue_reduc(char*,...); /* reduce queue to a sequence */ sequence* pop_queue(stract queue*); /* pop a sequence from queue */ void push_queue(sequence*,stract queue*); /* push sequence onto queue */ int isempty_queue(stract queue*); /* is queue empty */ stract queue* selection_queue(char*,...); /* queue selection function */ stract queue* collect_queues(char*,...); /* collect queues */ stract queue* copy_queue(stract queue*); /* copy queue to queue */ int queue_size(stract queue*); /* retum queue size */

The generative and taking routines.

void* taking(taking_data*); /* taking function */ void* gen(gen_args*); /* generative function */

Functions such as abs, sqrt, cos, sin, tan, log, mod are the computational functions that

apply some kind of mathematical operation on a sequence and retum a result. Functions

that manipulate or modify sequences or generate new sequences are reverse, transpose,

rotateright, rotateleft, cartesianproduct. Only cartesianproduct and transpose have been

implemented, the rest will be added as needed.

sequence* cartesian(sequence*); /* Cartesian product function */ sequence* transpose(sequence*); /* transpose function */

This routine reads in sequences from input files.

sequence* getjnput(char*); /* get sequence from file */

These are miscellaneous rantime routines that support the routines listed above.

int** create_2d(int, int); /* create a two dimensional integer array */ void stringcpy(char**, char*); /* copy to a string, expanding original string size */ void stringncpy(char**, char*, int); /* copy n char to a string expanding original */ void stringcat(char**, char*); /* cat a string to a string expanding original */ void stringncat(char**, char*, int); /* cat n char to a string expanding original */

These routines support stack manipulation, which is used by routines that use stacks.

int popJnt(TOP_INT*); /* pop integer off stack */ void pushJnt(int,TOP_INT*); /* push an integer onto stack */ int pull_int(TOP_INT*); /* pop integer from bottom of stack */

189

int isempty_int(TOP_INT); /* is stack empty */

B.6.4 Linking a SequenceL Program

Once a sequenceL program has been compiled it can be linked, assuming the

rantime library file is in the parent directory of the current directory the command to link

the function files is the following command.

cc -I., -o main *.c ../rantimel.c -Ipthread

The result is the SequenceL executable file main.

190

Quicksort IC Statements

1. _beginfunc :: 2. _arg:: 3. next:: 4. null:: 5. _seq :: 6. _seq :: 7. =:: 8. then :: :: 9. _seq :: 10. _seq :: 11. else:: :: 12. _seq :: 13. _seq :: 14. >:: 15. then :: :: 16. _seq :: 17. _M :: 18. _seq :: 19. _M:: 20. _seq :: 21. _M:: 22. _call:: 23. _call:: 24._call:: 25. _caU :: 26. _seq :: 27. _seq :: 28. else :: :: 29. _seq :: 30. _seq :: 31._endif:: 32. _seq :: 33._endif:: 34. _endfunc

s_l n

null null : _to

J l :: _t2

null : _t3 :

_t2 1 :: n

_t5 :: _t6

1 :: s_l

1 :: s_I

I :: s_l great quick less

quick $

_tl3 : _t6

s_l : _tl4 :

_t6 ~. ;;

35. _beginfunc :: 36. _arg:: 37. _arg :: 38. next:: 39. _seq :: 40. gen :: 41. from :: 42. M::

s_l s_2 n 1 ::

_tl5 _tl6

s_2

quick n

: _tO s_l ::

_t2

_t3 _t2

_t4 _t4 :: _t6

_t7 _t7 ::

_t9 _t9 ::

_t l l _t l l ::

_tl2 _t29 :: _t8 _t31 ::

_t32 : _t6

tl4 : _t6 t6

_t2 t2

_t2 less s

n

_tl5 n ::

i i ::

s_l

_tl

_t5

_t8

_tio

_tl2 s_l ::

_t30 s_l ::

_t32 _tio

_1 s_

_tl6

_tl7

_t29

.t31

t30 :: tl3

191

43. _seq :: _tl7 44. <:: _tl8 :: 45. then :: :: _tl9 46. _M:: s_2 47. _seq :: _t20 :: 48. else:: :: _tl9 49. _seq :: null :: 50. _seq :: _t21 ::

s_I :: _tl9

i :: _tl9

_t21 _tl9

51._endif:: :: _tl9 52. _endfunc :: :: _tl9 53. _beginfunc:: great s 54. _arg :: s_l 55._arg:: s_2 56. next:: n 57. _seq:: 1 :: 58. gen :: _t22 59. from :: _t23 60. _M:: s_2 61._seq:: _t24 62. >:: _t25 :: 63. then:: :: _t26 64. _M:: s_2 65. _seq:: _t27 :: 66. else :: :: _t26 67. _seq :: nufl :: 68. _seq :: _t28 ::

n

_t22 n ::

i i " s_l ::

_t26

i " _t26

_t28 _t26

69._endif:: :: _t26 70. _endfunc :: :: 71. _call :: quick

_t26 s_l

_tl8

_t20

_1

_t23

_t24 _t25

_t27

s 2

The code generated for quicksort consists of 8 functions and are listed below.

/* main.c */ #include <stdio.h> #include <pthread.h> #include <math.h> #include <semaphore.h> #include <rantimel.h>

int main(){

192

#include "main.h'

s_l=getjnput("s_l"); result=quick(s_l); result=remove_nulls(result); printf("result = %s\n",result->string);

1. /* quick.c*/ 2. #include <stdio.h> 3. #include <pthread.h> 4. #include <math.h> 5. #include <semaphore.h> 6. #include <rantimel.h>

7. sequence *quick(sequence *s_l) 8. { 9. #include "quick.h" 10. n=Msizeof(s_l,l,l); 11. _tO=sequence_atos("[ ]"); 12. _tl=collect_sequences("ss",_tO,s_l); 13.if(convert_logic(condition(EQUAL,_tl))) 14. { 15. _t3=sequence_atos("[ ]"); 16. _t2=_t3; 17.} 18. else{ 19. _t4=sequence_atos("[l]"); 20. _t5=collect_sequences("ss",n,_t4); 21. if (con vertJogic(condition(GREATER,_t5))) 22. { 23. _t7=sequence_atos("[l]"); 24. _t8=selection_seq("ss",s_l,_t7); 25. _t9=sequence_atos("[l]"); 26. _tlO=selection_seq("ss",s_l,_t9); 27. _tll=sequence_atos("[l]"); 28. _tl2=selection_seq("ss",s_l,_tll); 29.

_t30args=(_t30call_stract*)malloc(sizeof(_t30call_stmct)); 30. _t30args->_tl2=_tl2; 31. _t30args->s_l=s_l; 32. pthread_create(thr_t30, NULL, (void *)_t30agg,

(void*)_t30args);

193

P,- -t32args=(_t32call_stract*)malloc(sizeof(_t32call_stract)); 34. _t32args->_t8=_t8; 35. _t32args->s_l=s_l; 3^- pthread_create(thr_t32, NULL, (void *)_t32agg

(void*)_t32args); 37. pthreadJoin(*thr_t30,NULL); 38. pthreadJoin(*thr_t32,NULL); 39. _t32=_t32args->result; 40. _t30=_t30args->result; " 1 • -tl3=collect_sequences("$ss$s",NULL,_t32,_tlO,NULL,_t30); 42. _t6=_tl3; 43. } 44. else{ 45. _t 14=sequence_atos(s_ 1 ->string); 46. _t6=_tl4; 47. } 48. _t2=_t6; 49. } 50. retum _t2; 51.}

1. /* less.c */ 2. #include <stdio.h> 3. #include <pthread.h> 4. #include <math.h> 5. #include <semaphore.h> 6. #include <rantimel.h>

7. sequence *less(sequence *s_l,sequence *s_2) 8. { 9. #include "less.h" 10. n=Msizeof(s_2,l,l); 11. _tl5=sequence_atos("[l]"); 12. _tl6gen.start=_tl5; 13. _tl6gen.end=n; 14. gen_thr_t 16=(pthread_t*)malloc(sizeof(pthread_t)); 15. pthread_create(gen_thr_tl6, NULL, (void *)gen, (void*)&_tl6gen); 16. pthreadJoin(*gen_thr_tl6,NULL); 17. _tl6=_tl6gen.seq; 18. i=(taking_data*)malloc(sizeof(taking_data)); 19. i->from=_tl6; 20. i->vai^l; 21. i->num_yar=l; 22. take_thri=(pthread_t*)malIoc(sizeof(pthread_t));

194

23. pthread_create(take_thri, NULL, (void *)taking, (void*)i); 24. pthreadJoin(*take_thri,NUUL); 25._tl7=selection_queue("sq",s_2,i->result); 26. _tl8=collect_queues("qs",_tl7,s_l); 27. i_element=i->result->head; 28. _tl9queue=(stract queue*)malloc(sizeof(stmct queue)); 29. _tl9queue->head=NULL; 30. _tl9queue->tail=NULL; 31. _tl8_element=_tl8->head; 32. size=queue_size(_tl8); 33. thr_id=(pthread_t*)malloc(sizeof(pthread_t)*size); 34. while(_tl8_element != NULL){ 35. _tl9args=(cond_arguments*)malloc(sizeof(cond_arguments)); 36. _tl9args->s_2=s_2; 37. _tl9args->n=n; 38. _tl9args->s_l=s_l; 39. _tl9args->result=(sequence*)malloc(sizeof(sequence)); 40. push_queue(_t 19args->result,_t 19queue); 41. _tl9args->parameter=_tl8_element->p; 42. _tl9args->i=i_element->p; 43. i_element=i_element->next; 44. pthread_create(&thrjd[_tl9_thread], NULL, (void *)_tl9cond, (void*)_tl9args); 45. _t 18_element=_t 18_element->next; 46. _tI9_thread++; 47.} 48. while(_tl9_thread-) 49. pthreadJoin(thr_id[_tl9_thread],NULL); 50. _t 19=queue_reduc("q",_t 19queue); 51. retum _tl9; 52.}

1. /*great.c*/ 2. #include <stdio.h> 3. #include <pthread.h> 4. #include <math.h> 5. #include <semaphore.h> 6. #include <rantimel.h>

7. sequence *great(sequence *s_l,sequence *s_2)

8. { 9. #include "great.h" 10. n=Msizeof(s_2,l,l); 11. _t22=sequence_atos("[l]"); 12. _t23gen.start=_t22;

195

13. _t23gen.end=n;

14.gen_thr_t23=(pthread_t*)malloc(sizeof(pthread_t)); 15. pthread_create(gen_thr_t23, NULL, (void *)gen, (yoid*)&_t23gen); 16. pthreadJoin(*gen_thr_t23,NULL); 17. _t23=_t23gen.seq; 18. i=(taking_data*)malloc(sizeof(taking_data)); 19. i->from=_t23; 20. i->va:^l; 21. i->num_var=l; 22. take_thri=(pthread_t*)malloc(sizeof(pthread_t)); 23. pthread_create(take_thri, NULL, (void *)taking, (void*)i); 24. pthreadJoin(*take_thri,NULL); 25._t24=selection_queue("sq",s_2,i->result); 26. _t25=collect_queues("qs",_t24,s_l); 27. i_element=i->result->head; 28. _t26queue=(stmct queue*)malloc(sizeof(stract queue)); 29. _t26queue->head=NULL; 30. _t26queue->tail=NULL; 31. _t25_element=_t25->head; 32. size=queue_size(_t25); 33. thrJd=(pthread_t*)malloc(sizeof(pthread_t)*size); 34. while(_t25_element !=NULL){ 35._t26args=(cond_arguments*)malloc(sizeof(cond_arguments)); 36. _t26args->s_2=s_2; 37. _t26args->n=n; 38. _t26args->s_I=s_l; 39. _t26args->result=(sequence*)malloc(sizeof(sequence)); 40. push_queue(_t26args->result,_t26queue); 41. _t26args->parameter=_t25_element->p; 42. _t26args->i=i_element->p; 43. i_element=i_element->next; 44. pthread_create(&thr_id[_t26_thread], NULL, (void *)_t26cond, (void*)_t26args); 45. _t25_element=_t25_element->next; 46. _t26_thread++; 47.} 48. while(_t26_thread-) 49. pthreadJoin(thrJd[_t26_thread],NULL); 50. _t26=queue_reduc("q",_t26queue); 51. retum _t26; 52.}

1. /* _tl9cond.c */ 2. #include <stdio.h> 3. #include <pthread.h>

196

4. #include <math.h> 5. #include <semaphore.h> 6. #include<rantimel.h>

7. void* _tI9cond(void *arg) 8. { 9. #include "_tl9cond.h"

10. _tl9args=(cond_arguments*)arg; 11. i=_tl9args->i; 12. s_2=_tl9args->s_2; 13. n=_tl9args->n; 14. s_l=_tl9args->s_l; 15. if(convert_logic(condition(LESS,_tl9args->parameter))) 16. { 17. _t20=selection_seq("ss",s_2,i); 18. copy_seq(_tl9args->result,_t20); 19.} 20. else{ 21. _ t21=sequence_atos("[ ]"); 22. copy_seq(_tl9args->result,_t21); 23. } 24.}

1. /* _t26cond.c */ 2. #include <stdio.h> 3. #include <pthread.h> 4. #include <math.h> 5. #include <semaphore.h> 6. #include <rantime 1 .h>

7. void* _t26cond(yoid *arg) 8. { 9. #include "_t26cond.h"

10. _t26args=(cond_arguments*)arg; ll.i=_t26args->i; 12. s_2=_t26args->s_2; 13. n=_t26args->n; 14. s_l=_t26args->s_l; 15.if(convertJogic(condition(GREATER,_t26args->parameter)))

16. { 17. _t27=selection_seq("ss",s_2,i); 18. copy_seq(_t26args->result,_t27);

197

19.} 20. else{

21. _t28=sequence_atos("[ ]"); 22. copy_seq(_t26args->result, t28); 23. } 24.}

1. /* _t30agg.c */ 2. #include <stdio.h> 3. #include <pthread.h> 4. #include <math.h> 5. #include <semaphore.h> 6. #include<rantimel.h>

7. void* _t30agg(void *arg) 8. { 9. #include "_t30agg.h"

10. _t30args=(_t30call_stract*)arg; Il._tl2=_t30args->_tl2; 12. s_l=_t30args->s_l; 13. _t29=great(_tl2,s_l); 14. _t30=quick(_t29); 15. _t30args->result=_t30; 16.}

1. /* _t32agg.c */ 2. #include <stdio.h> 3. #include <pthread.h> 4. #include <math.h> 5. #include <semaphore.h> 6. #include <rantimel.h>

7. void* _t32agg(void *arg) 8. { 9. #include "_t32agg.h"

10. _t32args=(_t32call_stract*)arg; Il._t8=_t32args->_t8; 12. s_l=_t32args->s_l; 13._t31=less(_t8,s_l); 14._t32=quick(_t31); 15. _t32args->result=_t32;

198

16.}

199