Kit 2.1 Accelerated Library Framework Programmer’s Guide API …iml.ece.mcgill.ca/~nazuel/ALFProgrammersGuideAndAPIRef_v... · 2007. 3. 27. · Cell Broadband Engine Software Development

Cell Broadband Engine

Software Development Kit 2.1

Accelerated Library Framework

Programmer’s Guide and API Reference

Version 1.1

SC33-8333-01

��

Cell Broadband Engine

Software Development Kit 2.1

Accelerated Library Framework

Programmer’s Guide and API Reference

Version 1.1

SC33-8333-01

��

Note

Before using this information and the product it supports, read the information in “Notices” on page 101.

Second Edition (March 2007)

This edition applies to the version 2.1 of the Cell Broadband Edition Software Development Kit and to all

subsequent releases and modifications until otherwise indicated in new editions.

© Copyright International Business Machines Corporation 2006, 2007. All rights reserved.

US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract

with IBM Corp.

Contents

About this publication . . . . . . . . v

How to send your comments . . . . . . . . . v

Part 1. ALF components . . . . . . . 1

Chapter 1. Overview of ALF external

components . . . . . . . . . . . . . 3

Chapter 2. Compute task . . . . . . . 5

Chapter 3. Data transfer list . . . . . . 7

Chapter 4. Work blocks . . . . . . . . 9

Chapter 5. Buffers . . . . . . . . . . 11

Buffer types . . . . . . . . . . . . . . 11

Local memory allocation and address calculations 14

Memory constraints . . . . . . . . . . . 16

Chapter 6. Accelerator data partitioning 17

Chapter 7. Synchronization points . . . 19

Chapter 8. Error handling . . . . . . . 21

Part 2. ALF API reference . . . . . 23

Chapter 9. Host API . . . . . . . . . 25

Basic framework API . . . . . . . . . . . 25

alf_handle_t . . . . . . . . . . . . . 25

ALF_ERR_POLICY_T . . . . . . . . . . 25

alf_configure . . . . . . . . . . . . . 26

alf_query_system_info . . . . . . . . . . 26

alf_init . . . . . . . . . . . . . . . 28

alf_exit . . . . . . . . . . . . . . . 29

alf_register_error_handler . . . . . . . . . 30

Compute task API . . . . . . . . . . . . 31

alf_task_handle_t . . . . . . . . . . . 31

alf_task_context_handle_t . . . . . . . . . 31

alf_task_info_t . . . . . . . . . . . . 31

alf_task_create . . . . . . . . . . . . 32

alf_task_context_create . . . . . . . . . 33

alf_task_context_add_entry . . . . . . . . 35

alf_task_context_register . . . . . . . . . 36

alf_task_query . . . . . . . . . . . . 36

alf_task_wait . . . . . . . . . . . . . 37

alf_task_destroy . . . . . . . . . . . . 38

Work block API . . . . . . . . . . . . . 39

Data structures . . . . . . . . . . . . 39

alf_wb_create . . . . . . . . . . . . . 39

alf_wb_enqueue . . . . . . . . . . . . 40

alf_wb_add_parm . . . . . . . . . . . 41

alf_wb_add_io_buffer . . . . . . . . . . 41

alf_wb_sync . . . . . . . . . . . . . 42

sync_callback_func . . . . . . . . . . . 44

alf_wb_sync_wait . . . . . . . . . . . 44

Chapter 10. Accelerator API . . . . . . 47

alf_comp_kernel . . . . . . . . . . . . . 47

alf_prepare_input_list . . . . . . . . . . . 47

alf_prepare_output_list . . . . . . . . . . 48

ALF_DT_LIST_CREATE . . . . . . . . . . 49

ALF_DT_LIST_ADD_ENTRY . . . . . . . . 49

Chapter 11. Cell/B.E. architecture

platform-dependent API . . . . . . . 51

alf_task_info_t_CBEA . . . . . . . . . . . 51

Part 3. Programming with ALF . . . 53

Chapter 12. Understand the problem 55

Chapter 13. Data layout and partition

design for the ALF implementation on

Cell/B.E. . . . . . . . . . . . . . . 57

Chapter 14. Double buffering on ALF 59

Chapter 15. ALF host application and

data transfer lists . . . . . . . . . . 61

Chapter 16. Debugging and tuning . . . 63

Chapter 17. Matrix addition example . . 65

Partition scheme . . . . . . . . . . . . . 66

Example compute kernel . . . . . . . . . . 68

The main thread and data transfer lists . . . . . 68

Chapter 18. Matrix transpose example 71


Example compute kernel . . . . . . . . . . 73

The main thread and data transfer lists . . . . . 73

Debugging and tuning . . . . . . . . . . . 76

Chapter 19. Vector min-max example 79


Task context buffer . . . . . . . . . . . . 80

Overlapped I/O buffer . . . . . . . . . . 81

Barrier . . . . . . . . . . . . . . . . 81

The code list . . . . . . . . . . . . . . 81

© Copyright IBM Corp. 2006, 2007 iii

Part 4. Platform specific constraints

for the ALF implementation on

Cell/B.E. architecture . . . . . . . 87

Chapter 20. SPU resource reserved and

used . . . . . . . . . . . . . . . . 89

Chapter 21. Memory constraints . . . . 91

Chapter 22. Data transfer list

limitations . . . . . . . . . . . . . 93

Part 5. Compile time options . . . . 95

Part 6. Appendixes . . . . . . . . . 97

Appendix. Accessibility features . . . . 99

Notices . . . . . . . . . . . . . . 101

Trademarks . . . . . . . . . . . . . . 103

Terms and conditions . . . . . . . . . . . 104

Related documentation . . . . . . . 105

Glossary . . . . . . . . . . . . . 107

Index . . . . . . . . . . . . . . . 109

iv ALF Programmer’s Guide and API Reference

About this publication

This book provides detailed information regarding the use of the Accelerated

Library Framework APIs. It contains an overview of the Accelerated Library

Framework, detailed reference information about the APIs, and usage information

for programming with the APIs.

For information about the accessibility features of this product, see “Accessibility

features,” on page 99.

Who should use this book

This book is intended for use by accelerated library developers and compute

kernel developers.

Related information

See “Related documentation” on page 105.

How to send your comments

Your feedback is important in helping to provide the most accurate and highest

quality information. If you have any comments about this publication, send your

comments using Resource Link™ at http://www.ibm.com/servers/resourcelink.

Click Feedback on the navigation pane. Be sure to include the name of the book,

the form number of the book, and the specific location of the text you are

commenting on (for example, a page number or table number).

© Copyright IBM Corp. 2006, 2007 v

http://www.ibm.com/servers/resourcelink

vi ALF Programmer’s Guide and API Reference

Part 1. ALF components

The Accelerated Library Framework (ALF) application programming interface

(API) provides a set of functions to solve parallel problems on multi-core memory

hierarchy systems. This programmer’s guide addresses the ALF implementation on

the Cell Broadband Engine™ (Cell/B.E.™) architecture.

Overview of ALF

ALF supports the single-program-multiple-data (SPMD) programming style with a

single program running on all allocated accelerator elements at one time. ALF

provides an interface to write data parallel applications without requiring

architecturally dependent code. The ALF APIs are designed to be platform

independent, and currently only the Cell/B.E. implementation is supported.

Features of ALF include data transfer management, parallel task management,

double buffering, and data partitioning.

ALF considers a natural division of labor between the two types of processing

elements in a hybrid system: the host element and the accelerator element. Also,

two different types of tasks are defined in a typical parallel program: the control

task and the compute task. The control task resides on the host element, while the

compute task resides on the accelerator element. The PowerPC® Processing

Element (PPE) is considered the host, and the Synergistic Processor Elements (SPE)

is considered the accelerator. This division of labor enables programmers to

specialize in different parts of a given parallel workload.

ALF defines three different types of work that can be assigned to the following

types of programmers:

Application developer

At the highest level, the application developer programs only at the host

level. Application programmers can use the provided accelerated libraries

without direct knowledge of the inner workings of the hybrid system.

Accelerated library developer

Using the provided ALF APIs, the accelerated library developers provide

the library interfaces to invoke the compute kernels on the accelerators.

Accelerated library developers are responsible for breaking the problem

into the control process running on the host and the compute kernel

running on the accelerators. Accelerated library developers then partition

the input and output into work blocks that ALF can schedule to run on

different accelerators.

Compute kernel developer

At the accelerator level, the compute kernel developer writes optimized

accelerator code. The ALF API provides a common interface for the

compute task to be invoked automatically by the framework.

The ALF APIs were inspired by the observation that many applications targeted for

Cell/B.E. or multi-core computing follow the general usage pattern of breaking up

a set of data into a set of independent tasks, creating a list of data to be computed

by code on the SPE, and then managing the distribution of that data to the various

SPE processes. This type of control process/compute process usage scenario, along

with the corresponding work queue definition, are the fundamental abstractions in

© Copyright IBM Corp. 2006, 2007 1

ALF. The framework design also enables a separation of work. Compute kernel

developers focus on the compute process, while the accelerated library developers

focus on the data partitioning strategy. Because the runtime framework handles the

underlying task management, data movement, and error handling, the focus is on

the kernel and the data partitioning, not the direct memory access (DMA) list

creation or the lock management on the work queue.

2 ALF Programmer’s Guide and API Reference

Chapter 1. Overview of ALF external components

With the provided ALF API, you can create work blocks, put them on a queue, and

the ALF runtime on the host can assign the work blocks to the accelerators.

Figure 1 provides an overview of the different external components in the ALF. The

main programming construct is a compute task that is run in parallel on the

accelerators. Different input data is entered into this compute task, and the

accelerators run the task and return the output data based on the given input. To

run the compute tasks in parallel, the input data and the corresponding output

data are divided into separate portions, called work blocks. The accelerators

process the assigned work blocks and send the output, per work block, back to the

host element.

Input Data Partition

Output Data Partition

Output Data

Input Data

ComputeTask

WorkBlock

WorkQueue

Accelerator

Main Application

AccelerationLibrary

ALF Runtime (Host)

ALF Runtime(Accelerator)

ComputeKernel

HostAPI

AcceleratorAPI

Host

Figure 1. Overview of ALF



Chapter 2. Compute task

A compute task is constructed by linking the compute kernel code with the ALF

accelerator runtime code. The ALF accelerator runtime code provides the main

entry point and calls the compute kernel code when input data is ready. The

runtime assumes a single default API to the compute kernel as alf_comp_kernel.

When data partition descriptions need to be generated on the accelerators, the

runtime supports APIs that enable you to generate your own data transfer

descriptions on accelerators.



Chapter 3. Data transfer list

A data transfer list contains entries that consist of the data size and a pointer to the

host memory location of the data.

For many applications, the input data for a single compute kernel cannot be stored

contiguously in the host memory. For example, in the case of a multi-dimensional

matrix, the matrix is usually partitioned into smaller submatrixes for the

accelerators to process. For certain data partitioning schemes, the data of a

submatrix is scattered to different memory locations in the data space of the large

matrix. Accelerator memory is usually limited, and the most efficient way to store

the submatrix is contiguously. Data for each row or column of the submatrix is put

together in a contiguous buffer. For input data, they are gathered to the local

memory of the accelerator from scattered host memory locations. With output data,

the above situation is reversed, and the data in the local memory of the accelerator

is scattered to different locations in host memory.

The complexity of data movement patterns can vary. For example, in the case of a

two-dimensional matrix, the movement pattern can be described by a base pointer,

a column width, a row count, and a stride length; while for certain fast Fourier

transform (FFT) kernels, the host addresses of data are derived from very complex

butterfly exchanging paths. To address these complexities, these operations are

represented as a data transfer list. The data in the local memory of the accelerator

is always packed and is organized in the order of the entries in the list. For input

data, the data transfer list describes a data gathering operation. For output data,

the data transfer list describes a scattering operation. See Figure 2 for a diagram of

a data transfer list.

A

B

C

D

E

F

G

A

H

B C

D

E

F

G

H

D

C

A

G

F

B

E

H

Accelerator memory

Host memory

Data transfer list

Figure 2. Data transfer list



Chapter 4. Work blocks

A work block represents related input data, output data, and parameters. The

input and output data are described by corresponding data transfer lists. The

parameters are provided through ALF APIs. Depending on the application, the

data transfer list can either be generated on the host (host data partition) or by the

accelerators (accelerator data partition).

Before calling the compute kernel, the ALF accelerator runtime retrieves the

parameters and the input data based on the input data transfer list from the input

buffer in host memory. After calling the compute kernel, the ALF accelerator

runtime puts the output result back into the host memory. The ALF accelerator

runtime manages the memory of the accelerator to accommodate the input and

output data. The ALF accelerator runtime also supports overlapping data transfers

and computations transparently through double buffering techniques if there is

enough free memory.

Single-use work block

A single-use work block is processed only once. A single-use work block enables

you to generate input and output data transfer lists on either the host or the

accelerator.

Multi-use work block

A multi-use work block is repeatedly processed to the specified iterations. Unlike

the single-use work block, the multi-use work block does not enable you to

generate input and output data transfer lists from the host process. For multi-use

work blocks, all input and output data transfer lists must be generated on the

accelerators each time a work block is processed by the ALF runtime. The ALF

runtime passes the parameters, total number of iterations, and current iteration

count to the accelerator data partition subroutines. See Chapter 6, “Accelerator data

partitioning,” on page 17 for more information about single-use work blocks and

multi-use work blocks.



Chapter 5. Buffers

On the accelerator, the ALF accelerator runtime manages the data of the work

blocks for the compute kernel. The compute kernel developers only need to focus

on the organization of data and the actual compute code. Buffer management and

data movement are handled by the ALF accelerator runtime. However, it is still

important that the programmers have a good understanding of the usage of each

buffer and their relationship with the compute task.

Buffer types

The ALF accelerator runtime code provides handles to the following five different

buffers for each instance of a compute task:

Task context buffer

A task context buffer is used by applications that require common persistent data

buffers that can be referenced by all work blocks. It is also useful for merge or

all-reduce operations. It is a shared buffer that is allocated when an instance of the

compute task is started on the accelerator. The task context consists of two optional

sections that are concatenated into a contiguous buffer. One section is for read-only

access, the other section is writable and can be modified by the compute task while

processing the work blocks. If there is a read-only section, it will be placed before

the writable section. See Figure 3 on page 12. The writable task context is returned

to host memory after the ALF runtime code finishes processing a compute task. If

other processes update this buffer in host memory when the task context buffer is

returned to host memory, ALF does not ensure data consistency. To avoid data

coherency problems, create a unique writable context buffer in host memory for

each task instance.


Work block parameter and context buffer

The work block parameter and context buffer serves two purposes:

v It passes work block specific constants or reference-by-value parameters.

v It reserves storage space for the compute task to save the specific context data of

the work block.

This buffer can be used by the alf_comp_kernel accelerator routine, the

alf_prepare_input_list accelerator routine, or the alf_prepare_output_list

accelerator routine. The parameters are copied to an internal buffer associated with

the work block data structure in host memory when the alf_wb_add_parm

accelerator routine is invoked.

Figure 4 gives an illustration of the buffer layout when the task only has dedicated

input and output buffers. The buffers in this case are not guaranteed to be adjacent

in memory.

WB

WB

WB

WB WB

WB

WB

WB

WB

WB

WB

WB

Writeablecontextbuffer





Host/task main thread

Accelerator/task instance



Read-only context(shared)



Figure 3. Task context buffer

Input data Output dataParameter

data

Task contextRead only

Input databuffer pointer

Output databuffer pointer

Parameterbuffer pointer

Task contextbuffer pointer

Writeable

Contiguouslocal memory Contiguous local memory

Contiguouslocal memory

Contiguouslocal memory

Figure 4. Work block with only input and output buffers


Work block input data buffer

The work block input data buffer contains the input data for each work block (or

each iteration of a multi-use work block) for the compute kernel. For each instance

of the ALF compute kernel, there is a single contiguous input data buffer.

However, the input buffer can consist of a collection of data from distinct memory

segments set in host memory. These data buffers are gathered into the input data

buffer on the accelerators. The ALF runtime code minimizes performance overhead

by not duplicating input data unnecessarily. When the contents of the work block

is constructed by the alf_wb_add_io_buffer routine, only the pointers to the input

data are saved to the internal data structure of the work block. This data is

transferred to the memory of the accelerator when the work block is processed. A

pointer to the contiguous input buffer in the memory of the accelerator is passed

to the compute kernel. For more information about data scattering and gathering,

see Chapter 3, “Data transfer list,” on page 7.

Work block output data buffer

This buffer is used to save the output of the compute kernel. It is a single

contiguous buffer in the memory of the accelerator. Output data can be transferred

to distinct memory segments in host memory. After the compute kernel returns

from processing one work block, the data in this buffer is moved to the host

memory locations specified by the alf_wb_add_io_buffer routine when the work

block is constructed.

Work block overlapped input and output data buffer

This buffer contains both input and output data. It is dynamically allocated for

each work block. However, when this buffer is declared, the buffer organization on

the accelerator is different. The input, overlapped, and output buffer are

concatenated as one contiguous buffer where the dedicated input buffer is the first

section, the overlapped buffer follows, and the dedicated output buffer is the third

section. Only two pointers are passed to the compute kernel: the input data pointer

and the output data pointer. The input data pointer points to the beginning of the

contiguous buffer. The output data pointer points to the beginning of the

overlapped data buffer.

There are now two contiguous buffers:

v The dedicated input buffer plus the overlapped buffer for input data

v The overlapped buffer plus the dedicated output buffer for output data

Remember that the two buffers are overlapped, similar to the implementation of

the memmove C runtime function.

Figure 5 on page 14 shows the buffer layout when the task has all three types of

data buffers. The input, overlapped, and output buffers are concatenated to a

single contiguous buffer. If you need a pointer to the dedicated output buffer, you

can calculate it based on the output data pointer and the known size of the

overlapped data buffer.

Chapter 5. Buffers 13

There can be another special case with the overlapped buffers when there is no

dedicated input buffer. In this special case, the input and output data pointers all

point to the beginning of the overlapped buffer. Figure 6 shows the buffer layout

when there is no dedicated input buffer.

For accelerator data partition, all input and output data transfer lists for the

overlapped buffer can be defined through the alf_prepare_input_list function

and the alf_prepare_output_list function. This provides you with the necessary

flexibility to organize data in situations where accelerator memory is limited.

Overlapped buffers allow you to maximize the use of local memory in computing

task scenarios where the temporarily copied input data can be overwritten or the

output buffer is also the input to the computation. For example, to compute C = A

+ B when you know that the input data B can be overwritten by the result C

during the computations, define the data buffers of B and C as one overlapped

buffer. This eliminates the requirement of a dedicated output buffer for C in the

local memory and you could save a large percent of memory. You can now

increase the size of the work blocks to support double buffering if the local

memory is too limited to do that when C is in a dedicated buffer.

Local memory allocation and address calculations

Buffers are allocated according to the size given by the alf_task_info_t data

structure. When the corresponding data transfer lists do not cover the whole

buffer, there could be unused memory regions at the end of the corresponding

section of the buffer.

Task context buffer

All of the data added by the alf_task_context_add_entry

(ALF_TASK_CONTEXT_READ) function is written to local memory starting at the

alf_comp_kernel (p_task_context) address. The size of the data will not exceed

alf_task_info_t.task_context_buffer_read_only_size.

Contiguous local memoryContiguous local memory

Input data Output dataParameter

data






Input andoutput data Writeable

Contiguous localmemory

Figure 5. Work block with input data buffer, overlapped input and output data buffer, and

output data buffer

Output dataParameter

data






Input andoutput data Writeable

Contiguous local memory Contiguous local memoryContiguous local memory

Figure 6. Work block with overlapped I/O buffer and no dedicated input buffer


All of the data added by the alf_task_context_add_entry

(ALF_TASK_CONTEXT_WRITABLE) function is written to or retrieved from local

memory starting at the alf_comp_kernel (p_task_context) +

alf_task_info_t.task_context_buffer_read_only_size address. The size of the

data will not exceed alf_task_info_t.task_context_buffer_writable_size.

Work block parameter and context buffer

All of the data added by the alf_wb_add_parm function is written to local memory

starting at the alf_comp_kernel (p_parm_ctx_buffer) address. The size of the data

will not exceed alf_task_info_t.parm_ctx_buffer_size.

Work block input data buffer

All of the data added by the alf_wb_add_io_buffer (ALF_BUFFER_INPUT) function is

written to local memory starting at the alf_comp_kernel (p_input_buffer)

address. The size of the data will not exceed alf_task_info_t.input_buffer_size.

When an accelerator data partition is used, the ALF_DT_LIST_CREATE (io_buffer

offset) address offset value is based on alf_comp_kernel (p_input_buffer). The

size of the data will not exceed alf_task_info_t.input_buffer_size +

alf_task_info_t.overlapped_buffer_size. There can be multiple data transfer lists

to retrieve data from different offsets of the input buffer. These data lists might

target the same location on the host memory. However, if you write to the same

host memory location with different data transfer lists, ALF does not guarantee

data consistency.

Work block output data buffer

All of the data added by the alf_wb_add_io_buffer (ALF_BUFFER_OUTPUT) function

is written to local memory starting at the alf_comp_kernel (p_output_buffer) +

alf_task_info_t.overlapped_buffer_size address. The size of the data will not

exceed alf_task_info_t.output_buffer_size.

When an accelerator data partition is used, the ALF_ DT_LIST_CREATE (io_buffer

offset) address offset value is based on alf_comp_kernel (p_output_buffer). The

size of the data will not exceed

alf_task_info_t.overlapped_buffer_size+alf_task_info_t.output_buffer_size.

There can be multiple data transfer lists to write data to host memory. Each of

these data transfer lists can start from a different offset of the output buffer. These

data lists might overlap each other.

Work block overlapped input and output data buffer

All of the data added by the alf_wb_add_io_buffer (ALF_BUFFER_INOUT) function is

written to local memory starting at the alf_comp_kernel (p_output_buffer)

address. The size of the data will not exceed

alf_task_info_t.overlapped_buffer_size.

When an accelerator data partition is used, there are no dedicated APIs for the

overlapped buffer. For the input part, it is combined with the input data buffer

API. For the output part, it is combined with the output data API.

Chapter 5. Buffers 15

Memory constraints

To make the most efficient use of accelerator memory, the ALF runtime needs to

know the memory usage requirements of the compute task. The ALF runtime

requires that you specify the memory resources each compute task uses. The

runtime can then allocate the requested memory for the compute task.


Chapter 6. Accelerator data partitioning

When the data partition schemes are complex and require a lot of computing

resources, it can be more efficient to generate the data transfer lists on the

accelerators. This is especially useful if the host computing resources can be used

for other work or if the host does not have enough computing resources to

compute data transfer lists for all of its work blocks.

Data partition subroutines

Accelerated library developers must provide the alf_prepare_input_list

subroutine and the alf_prepare_output_list subroutine to do the data partition

for input and output and generate the corresponding data transfer list. The

alf_prepare_input_list is the input data partitioning subroutine and the

alf_prepare_output_list is the output data subroutine.

Number of data transfer list entries

Because dynamic memory allocation can be inefficient on the accelerator, it is

important to explicitly specify the maximum number of entries that the data

transfer list on the accelerator occupies. Then the ALF runtime can help allocate

the buffer to save the data transfer lists properly before calling the data partition

subroutines. Input and output data transfer lists are generated and used at

different times, so by specifying the size of the larger list, the buffers can be reused

between input and output data transfer lists.

Host memory addresses

The host does not generate the data transfer lists when using accelerator data

partitioning, so the addresses of input and output data buffers must be explicitly

passed to the accelerator through the parameter and context buffer.

Single-use and multi-use work blocks

Based on the characteristics of an application, you can use single-use work blocks

or multi-use work blocks to efficiently implement data partitioning on the

accelerators. For a given task that can be partitioned into N work blocks, the

following illustrates how the two different types of work blocks can be used:

v Single-use work block: N work blocks with only the parameter and context

buffer are created on the host. The input parameters must contain the necessary

information to generate the corresponding input and output data transfer lists

for each small block of data. The ALF runtime calls the alf_prepare_input_list

function, the alf_comp_kernel function, and the alf_prepare_output_list

function sequentially for each work block.

v Multi-use work block: One multi-use work block that is processed N times is

created on the host. The ALF runtime then calls the alf_prepare_input_list

alf_prepare_output_listalf_comp_kernelalf_prepare_input_list

Figure 7. Single-use work block


function, the alf_comp_kernel function, and the alf_prepare_output_list

function N times for each multi-use work block.

Modification of the parameter and context buffer during

multi-use work blocks

The parameter and context buffer of a multi-use work block is shared by multiple

invocations of the alf_prepare_input_list accelerator function and the

alf_prepare_output_list accelerator function. Use care when changing the

contents of this buffer. Because the ALF runtime does double buffering

transparently, it is possible that the current_count arguments for succeeding calls

to the alf_prepare_input_list function, the alf_comp_kernel function, and the

alf_prepare_output_list function are not strictly incremented when a multi-use

work block is processed. Because of this, modifying the parameter and context

buffer according to the current_count in one of the subroutines might cause

unexpected effects to other subroutines when they are called with different

current_count values at a later time.

alf_prepare_output_listalf_comp_kernelalf_prepare_input_list

N

Figure 8. Multi-use work block


Chapter 7. Synchronization points

Synchronization points introduce ordering into the processing flow of the work

blocks. Two types of synchronization points are supported: barrier and notify. Both

support synchronous event notification by callback and asynchronous event

notification by query.

Barrier

A barrier ensures that all work blocks enqueued before this point are finished

before new works block added after the barrier can be processed on any of the

accelerators. If a callback function is registered to this synchronization point, the

work queue processing will not continue until the callback function returns.

Notify

A notify synchronization point provides a mechanism to query and run a specific

piece of code when this point is reached. The ALF runtime will generate a

notification message that can be queried on the host and it will also invoke the

registered callback function when this synchronization point is reached. However,

it does not ensure the order of work block completion.

Callback and query

Callback and query features are listed below:

v You can do work block-related memory region modification in barrier callbacks

because no work block is processed during that time. For example, the callback

function can write to the input or output data area that might be referred by

later work blocks.

v Query will only return the current results when the corresponding callback is

returned. This is true for both notify and barrier.

v Only the alf_task_query API is supported in the callback function. Calls to

other APIs might result in errors or undetermined behaviors.

v All callbacks, including notify callbacks, are serialized. This means a new

callback is not allowed when an existing callback has not returned, and the

callbacks will be invoked in the order that the corresponding synchronization

points are added into the work queue.



Chapter 8. Error handling

ALF supports limited capability to handle runtime errors. Upon encountering an

error, the ALF runtime tries to free up resources, then exits by default. To allow the

accelerated library developers to handle errors in a more graceful manner, you can

register a callback error handler function to the ALF runtime. Depending on the

type of error, the error handler function can direct the ALF runtime to retry the

current operation, stop the current operation, or shut down. These are controlled

by the return values of the callback error handler function.

When several errors happen in a short time or at the same time, the ALF runtime

attempts to invoke the error handler in sequential order.

Possible runtime errors include the following:

v Compute task runtime errors such as bus error, memory allocation issues, dead

locks, and others

v Detectable internal data structure corruption errors, which might be caused by

improper data transfers or access boundary issues

v Application detectable/catchable errors

Standard error codes on supported platforms are used for return values when an

error occurs. For this implementation, the standard C/C++ header file, ″errno.h″, is

used. The API definitions in Part 2, “ALF API reference,” on page 23 list the

possible error codes.



Part 2. ALF API reference

Conventions

ALF and alf are the prefixes for the namespace for ALF. For normal function

prototypes and data structure declarations, use all lowercase characters with

underscores (_) separating the words. For macro definitions, use all uppercase

characters with underscores separating the words.

Data type assumptions

int This data type is assumed to be signed by default on both the host

and accelerator. The size of this data type is defined by the

Application Binary Interface (ABI) of the architecture. However, the

minimum size of this data type is 32 bits. Note that the actual size of

this data type might differ between the host and the accelerator

architectures.

unsigned int This data type is assumed to be the same size as that of int.

char This data type is not assumed to be signed or unsigned. The size of

this data structure, however, must be 8 bits.

long This data type is not used in the API definitions because it might not

be uniformly defined across platforms.

void * The size of this data type is defined by the ABI of the corresponding

architecture and compiler implementation. Note that the actual size of

this data type might differ between the host and accelerator

architectures.

Platform-dependent auxiliary APIs or data structures

The basic APIs and data structures of ALF are designed with cross-platform

portability in mind. Platform-dependent implementation details are not exposed in

the core APIs.

Common data structures

The enumeration type ALF_DATA_TYPE_T defines the data types for data movement

operations between the hosts and the accelerators. The ALF runtime does byte

swapping automatically if the endianness of the host and the accelerators are

different.

ALF_DATA_BYTE For data types that are independent of byte orders

ALF_DATA_INT16 For two bytes signed / unsigned integer types

ALF_DATA_INT32 For four bytes signed / unsigned integer types

ALF_DATA_INT64 For eight bytes signed / unsigned integer types

ALF_DATA_FLOAT For four bytes float point types

ALF_DATA_DOUBLE For eight bytes float point types

The constant ALF_NULL_HANDLE is used to indicate a non-initialized handle in the

ALF runtime environment. All handles should be initialized to this value to avoid

ambiguity in code semantics.


ALF runtime APIs that create handles always return results through pointers to

handles. After the API call is successful, the original content of the handle is

overwritten. Otherwise, the content is kept unchanged. ALF runtime APIs that

destroy handles modify the contents of handle pointers and initialize the contents

to ALF_NULL_HANDLE.


Chapter 9. Host API

The host API includes the basic framework API, the compute task API, and the

work block API.

Basic framework API

The following API definitions are the basic framework APIs.

alf_handle_t

This data structure is used as a reference to one instance of the ALF runtime. The

data structure is initialized by calling the alf_init API call and is destroyed by

alf_exit.

Example

{

alf_handle_t half = ALF_NULL_HANDLE; // initialize

// do something here

if(ALF_NULL_HANDLE == half) // and check

{

fprintf(stderr, "The ALF handle is not initialized!\n");

}

}

ALF_ERR_POLICY_T

This is a callback function prototype that can be registered to the ALF runtime for

customized error handling.

Synopsis

ALF_ERR_POLICY_T(*alf_error_handler_t)(void *p_context_data, int

error_type, int error_code, char *error_string)

Parameters

p_context_data [IN] A pointer given to the ALF runtime when the error handler is

registered. The ALF runtime passes it to the error handler

when the error handler is invoked. The error handler can use

this pointer to keep its private data.

error_type [IN] A system-wide definition of error type codes, including the

following:

v ALF_ERR_FATAL: Cannot continue, the framework must shut

down.

v ALF_ERR_EXCEPTION: You can choose to retry or skip the

current operation.

v ALF_ERR_WARNING: You can choose to continue by ignoring

the error.

error_code [IN] A type-specific error code.

error_string [IN] A C string that holds a printable text string that provides

information about the error.


Return values

ALF_ERR_POLICY_RETRY Indicates that the ALF runtime should retry the operation that

caused the error. If a severe error occurs and the ALF runtime

cannot retry this operation, it will report an error and shut

down.

ALF_ERR_POLICY_SKIP Indicates that the ALF runtime should stop the operation that

caused the error and continue processing. If the error is severe

and the ALF runtime cannot continue, it will report an error

and shut down.

ALF_ERR_POLICY_ABORT Indicates that the ALF runtime must stop the operations and

shut down.

ALF_ERR_POLICY_IGNORE Indicates that the ALF runtime will ignore the error and

continue. If the error is severe and the ALF runtime cannot

continue, it will report an error and shut down.

Example

See “alf_register_error_handler” on page 30 for an example of this function.

alf_configure

This function configures the ALF runtime code according to the system

configuration provided. This API must be the first one called in any instance of

applications using the ALF runtime code. Depending on the platforms, ALF might

automatically detect the system configuration or might require the application

developer to provide some strings or a pointer to configuration files where the

ALF runtime code can get information about the system.

Synopsis

int alf_configure(alf_sys_config_t *p_configuration)

Parameters

p_configuration

[IN]

A platform-dependent configuration information place holder that ALF

uses to get the necessary system configuration data. In the current

Cell/B.E. architecture implementation, this argument is not used. A

NULL pointer should be given. On other platforms, specific definitions of

the data structure must be documented accordingly and the caller is

responsible to fill the data structure.

Return values

>= 0 Successful.

< 0 Errors occurred:

v -EINVAL: Invalid input parameter

v -ENODATA: Some system configuration data is not available

v -EBADR: Generic internal errors

alf_query_system_info

This function queries basic configuration information for the specific system on

which ALF is running.


Synopsis

int alf_query_system_info(int query_id, unsigned int * result)

Parameters

query_id [IN] A query identification that indicates the item to be queried:

v ALF_INFO_NUM_ACCL_NODES: Returns the number of accelerators in the

system.

v ALF_INFO_CTRL_NODE_MEM_SIZE: Returns the memory size of the hosts up

to 4 TB, in KB. When the size of memory is more than 4 TB, the total

reported memory size is (ALF_INFO_CTRL_NODE_MEM_SIZE_EXT*4 TB +

ALF_INFO_CTRL_NODE_MEM_SIZE*1KB) bytes. In a system where virtual

memory is supported, this should be the maximum size of one

contiguous memory block that a single user space application can

allocate.

v ALF_INFO_CTRL_NODE_MEM_SIZE_EXT: Returns the memory size of the

host, in units of 4 TB.

v ALF_INFO_ACCL_NODE_MEM_SIZE: Returns the memory size of the

accelerators up to 4 TB, in KB. When the size of memory is more than

4 TB, the total reported memory size is

(ALF_INFO_ACCL_NODE_MEM_SIZE_EXT*4 TB +

ALF_INFO_ACCL_NODE_MEM_SIZE*1KB) bytes. In a system where virtual

memory is supported, this should be the maximum size of one

contiguous memory block that a single user space application can

allocate.

v ALF_INFO_ACCL_NODE_MEM_SIZE_EXT: Returns the memory size of the

accelerators, in units of 4 TB.

v ALF_INFO_CTRL_NODE_ADDR_ALIGN: Returns the basic requirement of

memory address alignment on the host in an exponential of 2. A zero

indicates a byte-aligned address. A 4 is to align on 16-byte boundaries.

v ALF_INFO_ACCL_NODE_ADDR_ALIGN: Returns the basic requirement of

memory address alignment on the accelerator in an exponential of 2. A

zero indicates a byte-aligned address. An 8 is to align on 256-byte

boundaries.

v ALF_INFO_DT_LIST_ADDR_ALIGN: Returns the address alignment of the

data transfer list entries in units of bytes.

p_result [OUT] Pointer to a buffer where the return value of the query is saved. If the

query fails, the result is undefined. If a NULL pointer is provided, the

query value is not returned, but the call returns zero.

Return values

0 Successful, the result of query is returned by p_result if that pointer is

not NULL


v -EINVAL: Unsupported query

v -EPERM: The ALF runtime is not properly configured


Example

{

unsigned long long memsize;

unsigned int memsize_low, memsize_ext;

alf_configure(NULL);

Chapter 9. Host API 27

nodes=alf_query_system_info(ALF_INFO_NUM_ACCL_NODES);

memsize_low = alf_query_system_info(ALF_INFO_ACCL_NODE_MEM_SIZE);

memsize_ext = alf_query_system_info(ALF_INFO_ACCL_NODE_MEM_SIZE_EXT);

memsize = (unsigned long long) memsize_low +

((unsigned long long) memsize_ext << 32);

printf("We have %ull KB memory on 1 of %d accelerator nodes\n"

}

alf_init

This function initializes the ALF runtime. It allocates accelerator resources and sets

up global data for ALF.

Synopsis

int alf_init(alf_handle_t *p_alf_handle, unsigned int

number_of_accelerators, ALF_STARTUP_POLICY_T policy)

Parameters

p_alf_handle [OUT] A pointer to a buffer that receives the contents of the data

structure that represents an instance of the ALF runtime.

This buffer is initialized with proper data if the call is

successful. Otherwise, the content is not modified.

number_of_accelerators [IN] Specifies the number of accelerators to allocate. When this

parameter is zero, the runtime tries to allocate all available

accelerators. If there are not enough accelerator resources,

the behavior is defined according to the policy parameter.

policy [IN] Defines the behavior of the function if there are not enough

accelerator resources as specified by the

number_of_accelerators parameter. Possible options are:

v ALF_INIT_PERSIST: Waits until the requested accelerators

are available.

v ALF_INIT_COMPROMISE: Obtains all available accelerators

and continues, even if the number of accelerators is less

than requested. Even if more accelerators are available

after this function returns, ALF will not dynamically add

more accelerators to satisfy the initial request.

v ALF_INIT_TRY: Stops, with an error code, if the number of

accelerators can not be satisfied.

Return values

> 0 The actual number of accelerators allocated for the runtime.

0 Not defined.

< 0 Error occurred:

v -EINVAL: Invalid input argument.

v -EPERM: The process does not have sufficient privileges to

fulfill the requirements.

v -ENOMEM: Out of memory or system resource.

v -ENOSYS: The required policy is not supported.

v -EBADR: Generic internal errors.


Example

{

alf_handle_t half = ALF_NULL_HANDLE; // initialize



rtn = alf_init(&half, nodes, ALF_INIT_PERSIST);

if(rtn < 0) // and check

{

fprintf(stderr, "alf_init failed with code %d !\n", rtn);

}

}

alf_exit

This function shuts down the ALF runtime. It frees allocated accelerator resources

and stops all running or pending work queues and tasks, depending on the policy

parameter.

Synopsis

int alf_exit(alf_handle_t *p_alf_handle, ALF_SHUTDOWN_POLICY_T policy)

Parameters

p_alf_handle

[IN/OUT]

A pointer to a buffer that holds the contents of the data structure that

represents an instance of the ALF runtime. On exit, this buffer is set to

ALF_NULL_HANDLE to avoid future misuse if the call is successful.

policy [IN] Defines the shut down behavior:

v ALF_SHUTDOWN_FORCE: Shuts down immediately and stops all unfinished

tasks.

v ALF_SHUTDOWN_WAIT: Waits for all tasks to be processed and then shuts

down.

v ALF_SHUTDOWN_TRY: Returns with a failure if there are unfinished tasks.

Return values

>= 0 The shut down succeeded. The number of unfinished work blocks is

returned.

< 0 The shut down failed:

v -EINVAL: Invalid input argument

v -EBADF: Invalid ALF handle

v -EPERM: Process does not have sufficient privileges to fulfill the

requirements

v -ENOSYS: The required policy is not supported

v -EBUSY: There are still running tasks


Example

{


nodes = alf_query_system_info(ALF_INFO_NUM_ACCL_NODES);


// do something here


// and finally

rtn = alf_exit(&half, ALF_SHUTDOWN_WAIT);

// check return now

}

alf_register_error_handler

This function registers a global error handler function to the ALF runtime code. If

an error handler has already been registered, the new one replaces it.

Synopsis

int alf_register_error_handler(alf_handle_t alf_handle, alf_error_handler_t

error_handler_function, void *p_context)

Parameters

alf_handle [IN] A handle to the ALF runtime code.

error_handler_function

[IN]

A pointer to the user-defined error handler function. A NULL

value resets the error handler to the ALF default handler.

p_context [IN] A pointer to the user-defined context data for the error handler

function. This pointer is passed to the user-defined error

handler function when it is invoked.

Return values

0 Successful.


v -EINVAL: Invalid input argument

v -EBADF: Invalid ALF handle


Example

static char my_context[256];

ALF_ERR_POLICY_T my_alf_error_handler(void *p_context_data,

int error_type, int error_code,

char *error_string)

{

if(ALF_ERR_FATAL == error_type)

{

fprintf(stderr, "Fatal error %d : ’%s’\n", error_code,

error_string);

return ALF_ERR_POLICY_ABORT;

}

return ALF_ERR_POLICY_SKIP;

}

int main(void)

{




rtn = alf_register_error_handler(half, my_alf_error_handler,


my_context);

// check return now

}

Compute task API

The following API definitions are the compute task APIs.

alf_task_handle_t

This data structure is a handle to a specific compute task running on the

accelerators. It is created by calling the alf_task_create function and destroyed by

either calling the alf_task_destroy function or when the alf_exit function is

called. Call the alf_task_wait function to wait for the task to finish processing all

queued work blocks. The alf_task_wait API is also an indication to the ALF

runtime that no new work blocks will be added to the work queue of the

corresponding task in the future.

alf_task_context_handle_t

This data structure is a handle to access the task context of one task instance.

Context buffers are only available when the task is created with a nonzero

task_context_buffer_size. The handle is returned by alf_task_context_create.

Context buffer entries can then be added by calling alf_task_context_add_entry.

Then the context buffer is registered to the task instance by calling

alf_task_context_register. This handle will be destroyed by the runtime

automatically when the task has completed or has been destroyed explicitly.

alf_task_info_t

This data structure is used to hold the task creation information for the

alf_task_create function.

For more information about memory usage, see Chapter 5, “Buffers,” on page 11.

typedef struct

{

void * p_task_info;

/* This is a pointer to information regarding compute tasks that are

critical for starting the task on the accelerators. */

unsigned int task_context_buffer_read_only_size;

/* The size of the task context buffer section that should only be

referenced by all work blocks of a specific task on a specific

accelerator. This parameter can be zero if the task does not need

a task context buffer. For each instance of a compute task on an

accelerator, a context buffer will be created. */

unsigned int task_context_buffer_writable_size;

/* The size of the task context buffer that can be written to by all

work blocks of a specific task on a specific accelerator. This

parameter can be zero if the task does not need a task context

buffer. For each instance of a compute task on an accelerator,

a context buffer will be created. */

unsigned int parm_ctx_buffer_size;

/* The size of the parameter and context buffer that the work block needs,

specified in bytes. This parameter can be zero if the work blocks do

not need this buffer. */

unsigned int input_buffer_size;

/* The maximum size of the input data buffer of the work blocks, specified

in bytes. This parameter can be zero if the work blocks do not have

input data or do not need dedicated input buffers. */

unsigned int overlapped_buffer_size;

/* The maximum size allowed for overlapped input and output data buffers


for work blocks, specified in number of bytes. This parameter can be

zero when the overlapped buffer feature is not used. */

unsigned int output_buffer_size;

/* The maximum size of the output data buffer of the work blocks, specified

in bytes. This parameter can be zero if the work blocks do not contain

output data or do not need dedicated output buffers. */

unsigned int dt_list_entries;

/* The maximum number of entries in input and output data transfer lists

when the accelerator data partition is used. This value can be

zero when the host data partition is used. For the ALF runtime to

manage memory resources on the accelerators more efficiently, an

approximate value for this number must be given to the runtime even

when host data partitioning is used. */

unsigned int task_attr;

/* A logical OR of the following values: ALF_TASK_ATTR_PARTITION_ON_ACCEL:

the accelerator functions that are provided by programmers are invoked

to generate data transfer list for input and output data. By default,

the host APIs are used to do data partitioning. */

} alf_task_info_t;

Cell/B.E. architecture specific implementation details

The p_task_info pointer points to an initialized alf_task_info_t_CBEA data

structure.

alf_task_create

This function spawns or schedules the compute task described by the

p_computing_task_info function on all accelerators and enables the addition of

work blocks to the work queue of the task.

ALF uses an SPMD model, so it is possible to create multiple tasks in a batch.

However, the tasks are run sequentially in the order they are created. The

corresponding alf_task_wait should be called to ensure the completion of the

specified task and then alf_task_destroy is called to free the task handle resource.

The runtime automatically spawns new tasks when the currently running task

completes.

For tasks with a task_context_buffer_size larger than zero, the task will not

begin to process work blocks until the task context buffer is assigned to each

instance of the task on an accelerator by invoking alf_task_context_assign.

Synopsis

int alf_task_create(alf_task_handle_t *p_task_handle, alf_handle_t

alf_handle, alf_task_info_t *p_computing_task_info)

Parameters

p_task_handle [OUT] Returns a handle to the created task. The content of the pointer

is not modified if the call fails.

alf_handle [IN] The handle to the ALF runtime.

p_computing_task_info

[IN]

A pointer to a data structure that contains critical information

that the ALF runtime uses to spawn a compute task on the

accelerators. Contents of this data structure are not referenced

by the ALF runtime after the function returns.


Return values

> 0 The number of task instances that have been or will be

spawned.

0 Not defined.



v -EBADF: Invalid ALF handle.


v -EPERM: The process does not have sufficient privileges to

fulfill the requirements.

v -ENOEXEC: Invalid task image format or description

information.

v -E2BIG: Memory requirement cannot be satisfied.

v -ENOSYS: The required task attribute is not supported.


Example

{

alf_task_info_t tinfo;

alf_task_info_t_CBEA spe_tsk;

// init ALF ...

memset(&tinfo, 0, sizeof(tinfo));

spe_tsk.spe_task_image = my_spe_task_image;

spe_tsk.max_stack_size = 4096;

tinfo.p_task_info = &spe_tsk;

tinfo.parm_ctx_buffer_size = sizeof(my_parm_data_structure);

tinfo.input_buffer_size = 128*1024;

tinfo.output_buffer_size = 64*1024;

tinfo.dt_list_entries = 128;

tinfo.task_attr = ALF_TASK_ATTR_PARTITION_ON_ACCEL;

rtn = alf_task_create(&htask, half, &tinfo);

if(rtn < 0)

fprintf(stderr, "Failed to create task\n");

else

printf("A total of %d instances are / will be created.\n",

rtn);

}

alf_task_context_create

This function creates a context buffer handle to enable accesses to the context

buffer of the specified task instance.

A task can only have context buffers when it is started with

task_context_buffer_size set to nonzero. The context buffer is on the accelerator

that runs the task instance. For each instance of the task, only one context buffer

handle is created. The programmer calls alf_task_context_add_entry to add

references to host memory locations to the context buffer. After the entries are

added, alf_task_context_register registers the context buffer handle to the ALF

runtime. These host memory references are copied to the context buffer before the

task instance begins to process work blocks. After the task completes, the contents

of the context buffer are written back to the original locations in the host memory.


Synopsis

int alf_task_context_create(alf_task_context_handle_t *p_tc_handle,

alf_task_handle_t task_handle, unsigned int accelerator_index)

Parameters

p_tc_handle [OUT] The pointer to a buffer where the created handle is returned.

The contents are not modified if this call returns an error.

task_handle [IN] The handle to the compute task.

accelerator_index [IN] The index of the accelerator. This value ranges from zero to the

number of allowed instances of the task as returned by

alf_task_create. If the value is zero, one of the instances

without a context buffer allocated will be selected by the ALF

runtime. Otherwise, the context buffer will be allocated for the

specific instance that is defined by the internal order of ALF

when the task is created.

Return values

0 Success.



v -EPERM: Operation not allowed (one context buffer is already

allocated for this instance or the task is not declared as

supporting context buffers).

v -EBADF: Invalid task handle.



Example

{



my_task_context_t *pctx;

// init ALF ...


spe_tsk.spe_task_image = my_spe_task_image;



tinfo.task_context_buffer_read_only_size = sizeof(read_only_data);

tinfo.task_context_buffer_writable_size = sizeof(my_task_context_t);

tinfo.parm_ctx_buffer_size = sizeof(my_parm_data_structure);

tinfo.input_buffer_size = 128*1024;

tinfo.output_buffer_size = 64*1024;

tinfo.dt_list_entries = 128;



if(rtn < 0)

fprintf(stderr, "Failed to create task\n");

else

printf("A total of %d instances are / will be created.\n", rtn);

pctx = malloc_align(sizeof(my_task_context_t)*rtn, 128);

memset(pctx, 0, sizeof(my_task_context_t)*rtn);

for(i=0; i<rtn; i++)


{

alf_task_context_handle_t hctx;

hctx = alf_task_context_create(&hctx, htask, 0 /* i+1 is ok too */);

alf_task_context_add_entry(hctx, read_only_data, sizeof(read_only_data),

ALF_DATA_BYTE, ALF_TASK_CONTEXT_READ);

alf_task_context_add_entry(hctx, pctx+i, sizeof(my_task_context_t),

ALF_DATA_BYTE, ALF_TASK_CONTEXT_WRITABLE);

alf_task_context_register(hctx);

}

// ...

}

alf_task_context_add_entry

This function adds an entry to the context buffer of the corresponding task

instance. The entry describes a single piece of data that is transferred in from the

host memory before the task instance starts to process work blocks. This data is

transferred to the original location when the task instance finishes normally. For a

specific context buffer, further calls to this API will return an error after the context

buffer is registered by calling the alf_task_context_register function.

Synopsis

int alf_task_context_add_entry(alf_task_context_handle_t tc_handle, void

*p_address, unsigned int size_of_data, ALF_DATA_TYPE_T data_type,

ALF_TASK_CONTEXT_T entry_type)

Parameters

tc_handle [IN] The task context buffer handle.

p_address [IN] The pointer to the data in remote memory.

size_of_data

[IN]

The size of data in units of the data type.

data_type [IN] The type of data. This value is required if data endianness conversion is

necessary when moving the data.

entry_type [IN] Type of entry:

v ALF_TASK_CONTEXT_READ: Add this entry to the read-only section of the

context buffer. The content of this entry will not be written back to the

original location when the task is finished.

v ALF_TASK_CONTEXT_WRITABLE: Add this entry to the writable section of

the context buffer. The content of this entry can be modified by the

accelerator, so it will be written back to the original location when the

task is finished to ensure consistency.

Return values

0 Success.

< 0s Errors occurred:


v -EBADF: Invalid task context buffer handle.

v -ENOBUFS: The size or offset of the entry is outside of the allowed range.


Example

See an example of this function in “alf_task_context_create” on page 33.


alf_task_context_register

This function registers the given context buffer handle to the corresponding task

instance. It should only be called when all related calls to the

alf_task_context_add_entry function have returned. The task instance will not

begin to process work blocks before the context buffer handle is registered.

Synopsis

int alf_task_context_register(alf_task_context_handle_t tc_handle)

Parameters

tc_handle [IN] The task context buffer handle.

Return values

0 Success.


v -EBADF: Invalid task context buffer handle.


Example

See an example of this function in “alf_task_context_create” on page 33.

alf_task_query

This function queries the current status of a task.

Synopsis

int alf_task_query(alf_task_handle_t task_handle, unsigned int

*p_unfinished_wbs, unsigned int *p_total_wbs)

Parameters

task_handle [IN] The task to be queried.

p_unfinished_wbs [OUT] A pointer to an integer buffer where the number of unfinished

work blocks of this task is returned. When a NULL pointer is

given, the return value is ignored.

p_total_wbs [OUT] A pointer to an integer buffer where the total number of

submitted work blocks of this task is returned. When a NULL

pointer is given, the return value is ignored.

Return values

> 1 The task is the (N-1)th pending task in the task queue.

1 The task is currently running.

0 The task is finished and can be safely destroyed.






Example

{

unsigned int unfinished;


// do things here

rtn = alf_task_query(htask, &unfinished, NULL);

if(rtn > 1)

printf("we are waiting in the queue !\n");

else if(rtn == 1)

printf("we are running and %d work blocks pending !\n", unfinished);

else if(rtn == 0)

printf("done !\n");

else

printf("Why ? \n");

}

alf_task_wait

This function declares that no work blocks will be added to the specified task and

waits for the completion of the spawned task instances on all accelerators.

When this function is called, new work blocks cannot be added to the work queue

of the specified task. Thus, further calls to alf_wb_enqueue will return an error.

This function provides a timeout mechanism that you can use to implement

synchronous or asynchronous work.

Note: The task handle and all of its related resources continue to be valid after this

function finishes. You must call alf_task_destroy to release the resources

associated with the task.

Synopsis

int alf_task_wait(alf_task_handle_t task_handle, int time_out)

Parameters

task_handle

[IN/OUT]

A task handle that is returned by the alf_create_task API.

time_out [IN] A timeout input with the following options for values:

v > 0: Waits for up to the number of milliseconds specified before a

timeout error occurs.

v 0: Checks the status of the accelerator and returns immediately.

v < 0: Waits until all of the accelerators finish processing.

Return values

> 0 The accelerators are still running and the number of unfinished work

blocks is returned. This value is only possible when the time_out

argument is zero.

0 All of the accelerators finished the job.




v -ESRCH: Already closed task handle.

v -ETIME: Time out.



1 There is a special case when all work blocks are finished, but the runtime

is still cleaning up the task environment (for example, when writing back

the context). The return value will be 1 at this time. This is an indication

that the task should not be destroyed at this time.

Example

{


// do things here

rtn = alf_task_wait(htask, 10*1000);

if(rtn > 0)

printf("Still running and %d work blocks pending !\n", rtn);

else if(rtn == 0)

printf("done !\n");

else if(-ETIME == rtn)

printf("Timeout\n");

else

printf("Something bad %d\n", rtn);

}

alf_task_destroy

This function destroys the specified task and releases the resources used by the

task. If there are work blocks that are still being processed, this routine forcibly

stops the processing of work blocks. Pending tasks are also destroyed. To release

the task resources without losing the computing results, ensure that calls to the

alf_task_wait function return zero.

Synopsis

int alf_task_destroy(alf_task_handle_t* p_task_handle)

Parameters

p_task_handle [IN/OUT] The pointer to a task handle that is returned by the

alf_create_task API. When there is a successful return, the

pointed content is set to ALF_NULL_HANDLE.

Return values

>= 0 Success and the number of unfinished work blocks is returned.




v -EBUSY: Resource busy.

v -ENOSYS: Feature not implemented.


Example

{


// do things here

rtn = alf_task_wait(htask, 10*1000);

if(rtn == 0)

{


printf("done !\n");

alf_task_destroy(&htask);

}

}

Work block API

The following API definitions are the work block APIs.

Data structures

alf_wb_handle_t

This data structure is a handle to a work block.

alf_wb_sync_handle_t

This data structure refers to the synchronization point.

alf_wb_create

This function creates a new work block for the specified compute task. The work

block is added to the work queue of the task. The caller can only update the

contents of a work block before it is added to the work queue. After the work

block is added to the work queue, the life span of the data structure is determined

by the ALF runtime.

The ALF runtime is responsible for releasing any resources allocated for the work

block. The ALF runtime releases the allocated resources for the work block after

the runtime finishes processing it. This function can only be called before the

alf_task_wait function is invoked. After the alf_task_wait function is called,

additional calls to this function will return an error.

Synopsis

int alf_wb_create(alf_wb_handle_t *p_wb_handle, alf_task_handle_t

task_handle, ALF_WORK_BLOCK_TYPE_T work_block_type, unsigned int

repeat_count)

Parameters

p_wb_handle

[OUT]

The pointer to a buffer where the created handle is returned. The contents

are not modified if this call fails.

task_handle

[IN]

The handle to the compute task.

work_block_type

[IN]

The type of work block to be created. Choose from the following types:

v ALF_WB_SINGLE: Creates a single-use work block

v ALF_WB_MULTI: Creates a multi-use work block. This work block type is

only supported when the task is created with the

ALF_TASK_ATTR_PARTITION_ON_ACCEL attribute.

repeat_count

[IN]

Specifies the number of iterations for a multi-use work block. This

parameter is ignored when a single-use work block is created.

Return values

>= 0 Success.




v -EPERM: Operation not allowed.


v -ENOMEM: Out of memory.


Example

See “alf_wb_enqueue” for an example of this function.

alf_wb_enqueue

This function adds the work block to the work queue of the specified task handle.

The caller can only update the contents of a work block before it is added to the

work queue. After it is added to the work queue, you cannot access the wb_handle.

Synopsis

int alf_wb_enqueue(alf_wb_handle_t wb_handle)

Parameters

wb_handle [IN] The handle of the work block to be put into the work queue.

Return values

0 Success.




v -EBUSY: An internal resource is occupied.


Example

{

alf_task_create(&htask, half, &tinfo);

for(X=0; X<1024; X+=M)

{

parm.x=X;

parm.y=0;

alf_wb_create (&hwb, htask, ALF_WB_SINGLE, 1);

alf_wb_add_parm (hwb, &parm;, sizeof(parm),

ALF_DATA_BYTE, 0);

alf_wb_add_io_buffer (hwb,

data_a[X], M*N*sizeof(float),

ALF_DATA_FLOAT, ALF_BUFFER_INPUT);


&mat_b[X][0], M*N*sizeof(float),



&mat_c[X][0], M*N*sizeof(float),

ALF_DATA_FLOAT, ALF_BUFFER_OUTPUT);


alf_wb_enqueue(hwb);

}

alf_task_wait(&htask, -1);

}

alf_wb_add_parm

This function adds the given parameter to the parameter and context buffer of the

work block in the order that this function is called. The starting address is from

offset zero. The added data is copied to the internal parameter and context buffer

immediately. The relative address of the data can be aligned as specified. For a

specific work block, additional calls to this API will return an error after the work

block is put into the work queue by calling the alf_wb_enqueue function.

Synopsis

int alf_wb_add_parm(alf_wb_handle_t wb_handle, void *pdata, unsigned int

size_of_data, ALF_DATA_TYPE_T data_type, unsigned int address_alignment)

Parameters

wb_handle [IN] The work block handle.

pdata [IN] A pointer to the data to be copied.

size_of_data [IN] The size of the data in units of the data type.

data_type [IN] The type of data. This value is required if data endianness

conversion is necessary when moving the data.

address_alignment [IN] Address alignment requirement in an exponential of 2. The

valid range is from 0 to 8. A zero indicates a byte-aligned

address. An 8 indicates alignment on 256 byte boundaries.

Return values

0 Success.




v -EBADF: Invalid work block handle.

v -ENOBUFS: The size of the data to be added is too large.


Example

See “alf_wb_enqueue” on page 40 for an example of this function.

alf_wb_add_io_buffer

This function adds an entry to the input or output data transfer lists of a single-use

work block. The entry describes a single piece of data transferred from or to

remote memory.

For a specific work block, additional calls to this API return an error after the work

block is put into the work queue by calling the alf_wb_enqueue function. This

function can only be called if the compute task is not created with the

ALF_TASK_ATTR_PARTITION_ON_ACCEL attribute.


Synopsis

int alf_wb_add_io_buffer(alf_wb_handle_t wb_handle, void *p_address,

unsigned int size_of_data, ALF_DATA_TYPE_T data_type, ALF_BUFFER_TYPE_T

io_type)

Parameters

wb_handle [IN] The work block handle.

p_remote_address [IN] A pointer to the data in remote memory.

size_of_data [IN] The size of the data in units of the data type.

data_type [IN] The type of data. This value is required if data endianness

conversion is necessary when doing the data movement.

buffer type[IN] v ALF_BUFFER_INPUT: Input buffer

v ALF_BUFFER_OUTPUT: Output buffer

v ALF_BUFFER_INOUT: Buffer used for both input and output

Return values

0 Success.




v -EBADF: Invalid work block handle.

v -E2BIG: The ALF runtime cannot accommodate the number

of io_buffer requested.

v -ENOBUFS: The ALF runtime cannot accommodate the amount

of data requested.


Cell/B.E. architecture implementation details

For this macro, the ALF runtime handles the 16 KB DMA limitation transparently.

You must ensure the data is aligned properly because the ALF runtime will not do

data padding and data duplication to satisfy the address and data size alignment

requirements of the memory flow controller (MFC). An -EINVAL error will be

returned when the input data does not meet the alignment requirements.

Example

See “alf_wb_enqueue” on page 40 for an example of this function.

alf_wb_sync

This function adds a synchronization point to the current work queue for the

specified task. You can register a callback function for notification of this

synchronization point. The ALF runtime will invoke the callback function when the

synchronization condition is met. This API can only be called before alf_task_wait

is invoked. After alf_task_wait is called, further calls to the function will return

an error.


Synopsis

int alf_wb_sync(alf_wb_sync_handle_t *p_sync_handle, alf_task_handle_t

task_handle, ALF_SYNC_TYPE_T sync_type, int

(*sync_callback_func)(alf_wb_sync_handle_t sync_handle, void* p_context),

void *p_context, unsigned int context_size)

Parameters

p_sync_handle [IN/OUT] Pointer to buffer where the handle to the created

synchronization point is returned.

task_handle [IN] Task handle.

sync_type [IN] This can be set to one of the following values:

v ALF_SYNC_BARRIER: When the ALF runtime reaches this

synchronization point, all work blocks enqueued before this

point must be finished before any new work blocks added

after the synchronization point can be processed on any of

the accelerators. If a callback function is registered to this

synchronization point, the work queue will continue running

only when the callback function returns.

v ALF_SYNC_NOTIFY: The ALF accelerator runtime will send a

notification to the ALF host runtime and invoke the

registered callback function when this synchronization point

is reached. However, it does not ensure the order of work

block completion.

sync_callback_func [IN] The pointer to the call back function that will be registered for

this synchronization point. This parameter can be NULL if you

do not want a call back function.

p_context [IN] A pointer to a context buffer. The pointer to the internal buffer

will be passed to the callback function if there is a callback

function registered. The content of the context buffer is copied

by value only. A NULL value indicates no context buffer.

context_size [IN] The size of the context buffer in bytes. Zero indicates no

context buffer.

Return values

0 Success.





v -ENOMEM: Out of memory.


Note: For a synchronization point without an associated callback function, its

behavior is always non-blocked. In this case, use alf_wb_sync_wait to check the

status of the synchronization point. If a callback function is associated with the

synchronization point, its behavior is always blocked, and the ALF runtime will

not assign new work blocks to the accelerators until the callback function has

returned. In either case, alf_wb_sync_wait is always supported.


sync_callback_func

This is the prototype of the call back function for the synchronization point. The

callback function might be invoked in a different thread context than the main

application.

Synopsis

int (*sync_callback_func)(alf_wb_sync_handle_t sync_handle, void

*p_context)

Parameters

sync_handle [IN] The handle to the synchronization point.

p_context [IN] A pointer to the buffer where the programmer supplied

context values are duplicated. The contents of this buffer are

not kept after the callback function is returned.

Return values

0 No errors.

< 0 Errors occurred during the callback. An internal error with

type ALF_ERR_EXCEPTION will be raised.

alf_wb_sync_wait

This function waits for the arrival of a synchronization point. Timeout is given in

milliseconds.

Synopsis

int alf_wb_sync_wait (alf_wb_sync_handle_t sync_handle, int time_out)

Parameters

sync_handle [IN] Task handle. This is the value returned from alf_wb_sync.

time_out Timeout value in milliseconds.

v > 0: The function will wait for up to time_out milliseconds

before a time out error occurs.

v 0: The function will check the status of the synchronization

point and return immediately.

v < 0: The function will wait until the synchronization point is

reached.

Return values

> 0 The synchronization point has not been reached. This value is

only available when the time_out value is zero.

0 The synchronization operation completed successfully.




v -ETIME: Timed out.



Note: The status of a synchronization point can be queried multiple times. When a

synchronization point has been reached, future calls to this API will always return

success until the corresponding task has completed.



Chapter 10. Accelerator API

The following API definitions are the accelerator APIs.

alf_comp_kernel

This is the entry point to the compute kernel. The ALF runtime moves in the user

data and input data before invoking this call.

Synopsis

int alf_comp_kernel(void* p_task_context, void *p_parm_ctx_buffer, void

*p_input_buffer, void *p_output_buffer, unsigned int current_count,

unsigned int total_count)

Parameters

p_task_context [IN] A pointer to the local memory block where the task context

buffer is kept.

p_parm_ctx_data [IN] A pointer to the local memory block where the parameter and

context data are kept.

p_input_data [IN} A pointer to the local memory block where the input data is

loaded.

p_output_buffer [IN] A pointer to the local memory block where the output data is

written.

current_count [IN] The current iteration count of multi-use work blocks. This

value starts at 0. For single-use work blocks, this value is

always 0.

total_count [IN] The total number of iterations of multi-use work blocks. For

single-use work blocks, this value is always 1.

Return values

0 The computation finished correctly.

< 0 An error occurred during the computation. The error code is

passed back to the library developer to be handled.

For overlapped I/O buffers, when this API is called, the p_input_data will refer

the memory region that includes the dedicated input buffer and the overlapped

buffer, and the p_output_buffer will refer the memory region that includes the

overlapped buffer and the dedicated output buffer. See Figure 5 on page 14 and

Figure 6 on page 14.

alf_prepare_input_list

The ALF runtime calls this function in order to define the input data partition on

the accelerator.

Because ALF might be doing double buffering, the function should only refer to

the context and memory buffers provided by the p_parm_ctx_buffer. This function

is only called if the compute task is created with the alf_task_info_t.task_attr

parameter set to ALF_TASK_ATTR_PARTITION_ON_ACCEL.


Synopsis

int alf_prepare_input_list(void *p_task_context, void *p_parm_ctx_buffer,

void *p_dt_list_buffer, unsigned int current_count, unsigned int

total_count)

Parameters


buffer is kept.

p_parm_ctx_buffer [IN] A pointer to the local memory block where the parameter and

context of the work block are kept. The data partition is only

based on these contents.

p_dt_list_buffer [IN} A pointer to the buffer where the generated data transfer list is

saved.

current_count[IN] The current iteration count of multi-use work blocks. This


always 0.

total_count [IN] The total number of iterations of multi-use work blocks. For


Return values



passed to the library developer to be handled.

Note: This API does not need to be implemented when data partitioning is

performed by the host and the compiler and linker support weak symbols.

alf_prepare_output_list

The ALF runtime calls this function in order to define the output data partition on

the accelerator.

Because ALF might be doing double buffering, the function should only refer to

the context and memory buffers provided by p_parm_ctx_buffer. Invoke this API

only when the compute task is spawned with the alf_task_info_t.task_attr

parameter set to ALF_TASK_ATTR_PARTITION_ON_ACCEL.

Synopsis

int alf_prepare_output_list(void *p_task_context, void *p_parm_ctx_buffer,

void *p_dt_list_buffer, unsigned int current_count, unsigned int

total_count)

Parameters


buffer is kept.

p_parm_ctx_buffer [IN] A pointer to the local memory block where the parameter and

context of the work block are kept. The data partition is based

on these contents.

p_dt_list_buffer [IN} A pointer to the buffer where the generated data transfer list is

saved.


current_count[IN] The current iteration count for multi-use work blocks. This


always 0.

total_count [IN] The total number of iterations for multi-use work blocks. For


Return values



passed to the calling program to be handled.

Note: This API does not need to be implemented when data partitioning is

performed by the host and the compiler and linker support weak symbols.

ALF_DT_LIST_CREATE

This macro creates the data transfer list data structure for input or output data

transfers.

Synopsis

ALF_DT_LIST_CREATE (void *p_dt_list_buffer, unsigned int io_buffer_offset)

Parameters

p_dt_list_buffer [IN} A pointer to the buffer for the data transfer list data structure.

io_buffer_offset [IN] The offset to the input or output buffer pointer in accelerator

memory where the data transfer list will refer.

Return values

Not specified.

This API generates the data transfer list entries. It might be implemented as macros

on some platforms. For overlapped I/O buffers, when this API is called in

alf_prepare_input_list, the io_buffer_offset will refer the memory region that

includes the dedicated input buffer and the overlapped buffer. When it is called in

alf_prepare_output_list, the io_buffer_offset will refer the memory region that

includes the overlapped buffer and the dedicated output buffer. See Figure 5 on

page 14 and Figure 6 on page 14.

ALF_DT_LIST_ADD_ENTRY

This macro fills the data transfer list entry.

Synopsis

ALF_DT_LIST_ADD_ENTRY(void *p_dt_list_buffer, unsigned int data_size,

ALF_DATA_TYPE_T data_type, void *p_remote_address)

Chapter 10. Accelerator API 49

Parameters

p_dt_list_buffer [IN} A pointer to the buffer for the data transfer list data structure.

data_size [IN] The offset to the input or output buffer pointer in accelerator

memory where the data transfer list will refer.

data_type [IN] The type of data. This parameter is required if data endianness

conversion is necessary when doing the data movement.

p_remote_address [IN] The address of the remote memory.

Note: The Cell/B.E. implementation uses the parameter addr64 remote_address in

place of p_remote_address.

Return values

Not specified.

This API generates the data transfer list entries.

Cell/B.E. architecture implementation details

For this macro, the ALF runtime does not manage the 16 KB limitations

transparently. Cell/B.E. architecture programmers must overcome that limitation

by dividing the entry into multiple entries not larger than 16 KB. Cell/B.E.

architecture also has a data transfer list size limitation of 2048 entries. The ALF

runtime will not do a strict check on these constraints by default due to

performance requirements. However, some compile time options can be enabled to

support strict checks to these constraints. The Cell/B.E. implementation uses the

addr64 remote_address parameter in place of the void *p_remote_address

parameter. For more information about compile time options, see Part 5, “Compile

time options,” on page 95.


Chapter 11. Cell/B.E. architecture platform-dependent API

This API is platform-dependent.

alf_task_info_t_CBEA

This data structure holds the task creation information for alf_task_create on

Cell/B.E. architecture.

typedef struct

{

spe_program_handle_t *spe_task_image;

/* libspe SPE image handle */

unsigned int max_stack_size;

/* The maximum stack size the image requests when it is run.*/

}



Part 3. Programming with ALF

There are several things to consider when programming with ALF.

Basic structure of an ALF application

The basic structure of an ALF application is shown in Figure 9. On the host, you

initialize the ALF runtime and then create a compute task. After the task is created,

you can begin to add work blocks to the work queue of the task. Then, you can

wait for the task to complete and shut down the ALF runtime to release the

allocated resources. On the accelerator, after an instance of the task is spawned, it

waits for pending work blocks to be added to the work queue. Then the

alf_comp_kernel function is called for each work block. If the partition location

attribute of a task is ALF_TASK_ATTR_PARTITION_ON_ACCEL, then the

alf_prepare_input_list function is called before the invocation of the compute

kernel and the alf_prepare_output_list function is called after the compute

kernel exits.

HOST

Initialization

Create task

Creatework block

Wait task

Termination and exit

ALF

Runtime

Accelerator

Prepareinput DTL

Computekernel

Prepareoutput DTL

Figure 9. ALF application structure and process flow



Chapter 12. Understand the problem

The primary class of problems that ALF is well suited to solve is data parallel

problems. To decide if a problem is suitable for ALF, answer the following three

questions. If the answers are all YES, you can use ALF to solve the problem.

1. Can the problem be parallelized? Certain problems might seem to be

inherently serial at first. However, try to find alternative approaches to divide

the problem into sub-problems. One or all of the sub-problems can often be

parallelized.

2. Is the parallel problem SPMD-capable? ALF supports the SPMD

parallel-programming style, where one program runs on all accelerators with a

different set of data for each accelerator. The ALF runtime does not guarantee

on which accelerator a specific work block is processed, nor does it keep the

order of completion based on the queuing order. Each work block is a stateless

computation procedure; no context is preserved across succeeding work blocks.

3. Can the SPMD-parallel problem be supported on the current architecture?

Check that the problem is suitable for the specific architectures that ALF

supports. For example, the Cell/B.E. processor has the local memory size of 256

KB. If the data set of the problem cannot be divided into work blocks that fit

into local storage, you should not use ALF for that problem.



Chapter 13. Data layout and partition design for the ALF

implementation on Cell/B.E.

Data partitioning is crucial to the ALF programming model. Improper data

partitioning and data layout design either prevents ALF from being applicable or

results in degraded performance. Data partition and layout is closely coupled with

the design and implementation of the compute kernel, and they should be

considered simultaneously.

Use the following considerations for data layout and partition design:

v Use the proper size for data partitions.

Often, the local memory or data cache of the accelerator is limited. Performance

can degrade if the partitioned data cannot fit into the available memory. For

example, on Cell/B.E. architecture, if a single block of partitioned data is larger

than 128 KB, it might not be possible to support double buffering on the SPE.

This might result in up to 50% performance loss.

v Minimize the amount of data movement.

A large amount of data movement can cause performance loss in applications.

Improve performance by avoiding unnecessary data movements.

v Simplify data movement patterns.

Although the data transfer list feature of ALF enables flexible data gathering and

scattering patterns, it is better to keep the data movement patterns as simple as

possible. Some good examples are sequential access and using contiguous

movements instead of small discrete movements.

v Avoid data reorganization.

Data reorganization requires extra work. It is better to organize data in a way

that suits the usage pattern of the algorithm than to write extra code to

reorganize the data when it is used.

v Know address alignment limitations on Cell/B.E.



Chapter 14. Double buffering on ALF

When transferring data can be done in parallel with the computation, double

buffering can reduce the time lost to data transfer by overlapping it with the

computation time. The ALF runtime implementation on Cell/B.E. architecture

supports three different kinds of double buffering schemes.

See Figure 10 for an illustration of how double buffering works inside ALF. The

ALF runtime evaluates each work block and decides which buffering scheme is

most efficient. At each decision point, if the conditions are met, that buffering

scheme is used. The ALF runtime first checks if the work block uses the

overlapped I/O buffer. If the overlapped I/O buffer is not used, the ALF runtime

next checks the conditions for the four-buffer scheme, then the conditions of the

three-buffer scheme. If the conditions for neither scheme are met, the ALF runtime

does not use double buffering. If the work block uses the overlapped I/O buffer,

the ALF runtime first checks the conditions for the overlapped I/O buffer scheme,

and if those conditions are not met, double buffering is not used.

These examples use the following assumptions:

1. All SPUs have 256 KB of local memory.

2. 16 KB of memory is used for code and runtime data including stack, the task

context buffer, and the data transfer list. This leaves 240 KB of local storage for

the work block buffers.

3. Transferring data in or out of accelerator memory takes one unit of time and

each computation takes two units of time.

4. The input buffer size of the work block is represented as in_size, the output

buffer size as out_size, and the overlapped I/O buffer size as overlap_size.

5. There are three computations to be done on three inputs, which produces three

outputs.

Buffer schemes

The conditions and decision tree are further explained in the examples below.

v Four-buffer scheme: In the four-buffer scheme, two buffers are dedicated for

input data and two buffers are dedicated for output data. This buffer use is

shown in the Four-buffer scheme section of Figure 10.

0 1 2 3 4 5 6 7 8 9

DMA In

ComputeKernel Input

ComputeKernel Output

DMA Out

ComputeKernel In/Out

C3 O3

I0 C0

C0 O0

I2 C2

C2 O2

I1 C1

C1 O1

I3 C3

I0 C0

I0 C0

C1 O1 C3

O3

I3

C0 O0 I2 C2

I2 C2

C3 O3

C3

I1 C1 C2 O2

I1 C1 O1 I3

O2

Buf0

Buf1

Buf2

Buf3

Buf0

Buf1

Buf2

Buf0

Buf1

Timeline

Four-bufferscheme

Three-bufferscheme

OverlappedI/O bufferscheme

Buffer Usage Types

Figure 10. ALF double buffering


– Conditions satisfied: The ALF runtime chooses the four-buffer scheme if the

work block does not use the overlapped I/O buffer and the buffer sizes

satisfy the following condition: 2*(in_size + out_size) <= 240 KB.

– Conditions not satisfied: If the buffer sizes do not satisfy the four-buffer

scheme condition, the ALF runtime will check if the buffer sizes satisfy the

conditions of the three-buffer scheme.v Three-buffer scheme: In the three-buffer scheme, the buffer is divided into three

equally sized buffers of the size max(in_size, out_size). The buffers in this

scheme are used for both input and output as shown in the Three-buffer scheme

section of Figure 10 on page 59. This scheme requires the output data movement

of the previous result to be finished before the input data movement of the next

work block starts, so the DMA operations must be done in order. The advantage

of this approach is that for a specific work block, if the input and output buffer

are almost the same size, the total effective buffer size can be 2*240/3 = 160 KB.

– Conditions satisfied: The ALF runtime chooses the three-buffer scheme if the

work block does not use the overlapped I/O buffer and the buffer sizes

satisfy the following condition: 3*max(in_size, out_size) <= 240 KB.

– Conditions not satisfied: If the conditions are not satisfied, the single-buffer

scheme is used.v Overlapped I/O buffer scheme: In the overlapped I/O buffer scheme, two

contiguous buffers are allocated as shown in the Overlapped I/O buffer scheme

section of Figure 10 on page 59. The overlapped I/O buffer scheme requires the

output data movement of the previous result to be finished before the input data

movement of the next work block starts.

– Conditions satisfied: The ALF runtime chooses the overlapped I/O buffer

scheme if the work block uses the overlapped I/O buffer and the buffer sizes

satisfy the following condition: 2*(in_size + overlap_size + out_size) <= 240

KB.

– Conditions not satisfied: If the conditions are not satisfied, the single-buffer

scheme is used.v Single-buffer scheme: If none of the cases outlined above can be satisfied,

double buffering is not used, but performance might not be optimal.

When creating buffers and data partitions, remember the conditions of these

buffering schemes. If your buffer sizes can meet the conditions required for double

buffering, it can result in a performance gain, but double buffering does not double

the performances in all cases. When the time periods required by data movements

and computation are significantly different, the problem becomes either I/O-bound

or computing-bound. In this case, enlarging the buffers to allow more data for a

single computation might improve the performance even with a single buffer.


Chapter 15. ALF host application and data transfer lists

One important decision to make is whether to use accelerator data transfer list

generation.

See Figure 9 on page 53 for the flow diagram of ALF applications on the host.

Because there might be a large number of accelerators used in one compute task if

the data transfer list is complex, the host might not be able to generate work

blocks as fast as the accelerators can process them. In that case, you can supply the

data needed for data transfer list generation in the parameters of the work block

and use the accelerators to generate the data transfer lists based on these

parameters. You can also start a new task to be queued while another task is

running. You can then prepare the work blocks for the new task before the task

runs. However, the ALF programming model is SPMD, so the newly created task

starts when the current task is finished.



Chapter 16. Debugging and tuning

For easier debugging, first create a simple compute kernel that gives the correct

results, then focus on optimizing the algorithm to get better performance.

Debugging and tuning on the host are more simple than on the accelerator. An

advantage to programming on ALF is scalability. For most applications, you can

debug them by using a single accelerator and then verifying that everything runs

properly before moving to multiple accelerators. This approach can also help you

recreate bug scenarios where the issue happens only with a specific work block

sequence, because the whole sequence might not be preserved when multiple

accelerators are used.

To improve performance, try any of the following:

v Variants of data partition schemes.

The size of the data partition can have a significant impact on performance. On

memory-constrained architectures like Cell/B.E., this is especially important.

Data partitions that are large compared to the available accelerator memory will

prevent you from using double buffering. Data partitions that are very small in

relation to the available accelerator memory will increase the overhead of

processing work blocks and degrade performance. Try a partition scheme that

requires less than 50% of the total memory size minus the code foot print.

Remember to add the runtime stack size and any other overheads, such as

global data, when sizing the code foot print.

v Accelerator data partitioning.

When there is a large number of accelerators and the work blocks are small or

data layout is complex, the host might become a bottleneck when generating the

data movement descriptions for work blocks. Consider offloading the generation

of data movement descriptions to the accelerators to improve performance.

v Multi-use work blocks.

When there is a large number of small work blocks, the scheduler running on

the host might be heavily loaded and the performance might degrade. In this

case, use multi-use work blocks to reduce the load on the scheduler and

improve the overall performance. Multi-use work blocks actually group several

small work blocks together and allow them to run sequentially on one

accelerator. You might also need to adjust the iteration counts to get the best

performance.

v New data structures.

Data structures and layout schemes can significantly affect the complexity of

data movement and data access speed on the host and accelerator. Typical

considerations include array of structure (AOS) versus structure of array (SOA)

conversion, and row and column transpose. This, however, is closely related to

the algorithm implemented by the compute kernel.

v New algorithms.

The performance of algorithms can differ from architecture to architecture.

Algorithms also affect the way data is organized and moved around. The data

movement can become a performance bottleneck. Using a less advanced

algorithm that requires more simple data movement might improve the overall

performance.



Chapter 17. Matrix addition example

This is a simple application that does addition of two two-dimensional matrixes

and stores the result to a third matrix. For two-dimensional matrix addition, the

mathematical definition is as follows:

C = A + B, where ci,j

= ai,j

+ bi,j

as shown in Figure 11.

Simple solution

In the following analysis, assume the data to be a 1024x512 single-precision

floating point matrix. The following is a piece of plain C code that solves the

problem:

float mat_a[1024][512];

float mat_b[1024][512];

float mat_c[1024][512];

int main(void)

{

int i,j;

for (i=0; i<1024; i++)

for (j=0; j<512; j++)

mat_c[i][j] = mat_a[i][j] + mat_b[i][j];

return 0;

}

The limitation with this simple approach is that it cannot be made to run faster on

a system with many accelerators that can process the ci,j

= ai,j

+ bi,j

in parallel. There

are parallel programming languages and models that can speed up the program.

Potential solution for parallel speed increase

In general, most matrix math operations can be decomposed into similar

operations on many submatrixes. The operations on these submatrixes can be done

in parallel if there are no dependencies between them. For example, take a

1024x512 matrix and divide the matrix into 128 submatrixes, each of which has

64x64 elements. Then the operation can be done on each 64x64 submatrix in

parallel. In theory, the computation of the 1024x512 matrix addition can be

completed in 1/128th of the time of the simple serialized code.

3,32,31,3

3,22,21,2

3,12,11,1

3,32,31,3

3,22,21,2

3,12,11,1

3,32,31,3

3,22,21,2

3,12,11,1

bbb

bbb

bbb

aaa

aaa

aaa

ccc

ccc

ccc

��

Figure 11. Two-dimensional matrix addition


Partition scheme

Two-dimensional matrixes are usually represented in two-dimensional arrays in C

code.

The actual memory layout of a two-dimensional array in C code is in

one-dimensional arrays concatenated by the second (or column) index, as shown in

Figure 12 and Figure 13.

In the matrix addition example in Chapter 17, “Matrix addition example,” on page

65, the submatrixes were the basic unit of data. In this C matrix data structure, a

submatrix is part of the whole array as shown in Figure 12 and Figure 13. This

provides the following partition schemes to choose from:

v Partition Scheme A: With this partition scheme, the submatrixes are a part of the

whole column or row of the matrix. One of the submatrixes of ″a[m][n]″ is

defined as ″sa[h][v]″ where the h < m and v <= n.

…

…

a(1,1)

a(m,1)

a(1,n)

a(m,n)

… … …

Figure 12. Memory organization of a two-dimensional array ″a[m][n]″, part A

a(2,1)

a(1,2)

a(2,n)

…

….......

a(1,1)

a(m,1)

a(1,n)

a(m,n)

….......

Figure 13. Memory organization of a two-dimensional array ″a[m][n]″, part B


v Partition Scheme B: With this partition scheme, the submatrixes are defined as a

set of adjacent full-length rows of the matrix. One of the submatrixes of

″a[m][n]″ is defined as ″sa[h][v]″ where h < m and v == n.

The data of the submatrix in partition scheme A is collected from disjointed

segments in the data buffer of the matrix. For partition scheme B, the submatrix is

from one contiguous segment of the matrix. Mathematically, this makes no

significant difference, but the data movement in our matrix addition example is

significantly more complex in partition scheme A than in partition scheme B, as

can be seen in the following example code.

float a[m][n], b[m][n], c[m][n];

{

int i,j,k;

float sa[h][v], sb[h][v], sc[h][v];

// Partition Scheme A

for (i=0; i<m; i+=h)

for (j=0; j<n; j+=V)

{

for(k=0; k<h; k++)

{

data_move(&sa[k][0], &a[i+k][j], v*sizeof(float));

data_move(&sb[k][0], &b[i+k][j], v*sizeof(float));

}

call_mat_add_kernel(sa, sb, sc, h,v);

for(k=0; k<h; K++)

data_move(&c[i+k][j], &sc[k][0], v*sizeof(float));

}

// Partition Scheme B


{

data_move(&sa[0][0], &a[i][0], v*h*sizeof(float));

…

….......

a(1,1)

a(m,1)

a(1,n)

a(m,n)

…....... a(2,n)

a(1,2)

a(2,1)

Figure 14. Partition scheme A: Data partition of a two-dimensional submatrix

a(2,1)

a(1,2)

a(2,n)

…

….......

a(1,1)

a(m,1)

a(1,n)

a(m,n)

….......

Figure 15. Partition scheme B: Data partition of a two-dimensional submatrix

Chapter 17. Matrix addition example 67

data_move(&sb[0][0], &b[i][0], v*h*sizeof(float));

call_mat_add_kernel(sa, sb, sc, h,v);

data_move(&sb[0][0], &b[i][0], v*h*sizeof(float));

}

}

Based on the above analysis, partition scheme B is preferred in this matrix addition

example. Remember that this situation might change in some real world scenarios

where large contiguous data movement might not be supported.

Example compute kernel

After the data partition scheme has been defined, implement the compute kernel.

The following is a simple example of a compute kernel:

FILE: my_header.h

struct _add_parms_t

{

unsigned int h;

unsigned int v;

} add_parms_t;

FILE: my_kernel.c

#include <alf_accel.h>

#include "my_header.h"

int alf_comp_kernel(void *p_parm_ctx_buffer,

void *p_input_buffer, void *p_output_buffer,

unsigned int current_count, unsigned int total_count)

{

unsigned int i;

float *sa, *sb, *sc;

add_parms_t *p_parm = (add_parms_t *) p_parm_ctx_buffer;

cnt = p_parm->h * p_parm->v;

sa = (float *) p_input_buffer;

sb = sa + cnt;

sc = (float *) p_output_buffer;

for(i=0; i<cnt; i++)

sc[i] = sa[i] + sb[i];

return 0;

}

The main thread and data transfer lists

This example shows the implementation of the main thread using the ALF

implementation on Cell/B.E. architecture. To prevent errors caused by address

alignment in the example code of the simple compute kernel in the previous

section, all of the data is aligned when it is defined.

FILE: my_main.c

#include <alf.h>

#include <string.h>


#define H 16

#define V 512

#define MY_ALIGN(_my_var_def_, _my_al_) _my_var_def_ \

__attribute__((__aligned__(_my_al_)))

MY_ALIGN(float mat_a[1024][512], 128);

MY_ALIGN(float mat_b[1024][512], 128);


MY_ALIGN(float mat_c[1024][512], 128);

spe_program_handle_t spe_matrix_add;

int main(void)

{

alf_handle_t half;

alf_task_handle_t htask;

alf_wb_handle_t hwb;



add_parms_t parm;

int i, nodes;


alf_query_system_info(ALF_INFO_NUM_ACCL_NODES, &nodes);

alf_init(&half, nodes, ALF_INIT_PERSIST);

spe_tsk.spe_task_image = spe_matrix_add;




tinfo.parm_ctx_buffer_size = sizeof(add_parms_t);

tinfo.input_buffer_size = H*V*2*sizeof(float); //64k

tinfo.output_buffer_size = H*V*sizeof(float); // 32k

tinfo.dt_list_entries = 0; // let the runtime decide this

tinfo.task_attr = 0; // do partition on the host


parm.h=H;

parm.v=V;

for(i=0; i<1024; i+=H)

{


alf_wb_add_parm (hwb, &parm, sizeof(parm),

ALF_DATA_BYTE, 0);

alf_wb_add_io_buffer (hwb,&mat_a[i][0], H*V*sizeof(float),


alf_wb_add_io_buffer (hwb,&mat_b[i][0], H*V*sizeof(float),


alf_wb_add_io_buffer (hwb,&mat_c[i][0], H*V*sizeof(float),



}

alf_task_wait(&htask, -1);


alf_exit(&half, ALF_SHUTDOWN_WAIT);

return 0;

}

Chapter 17. Matrix addition example 69


Chapter 18. Matrix transpose example

A two-dimensional matrix transpose is a common operation in matrix

computations and is defined as flipping the columns and rows by swapping the

indexes of matrix elements.

Figure 16 shows the data movement patterns of a two-dimensional matrix

transpose.

Simple solution

This implementation assumes that the data represents a 1024x512 single-precision

floating point matrix. This piece of plain C code performs the two-dimensional

matrix transpose.

float mat_a[1024][512];

float mat_c[512][1024];

int main(void)

{

int i,j;

for (i=0; i<512; i++)

for (j=0; j<1024; j++)

mat_c[i][j] = mat_a[j][i];

return 0;

}


Similar to the matrix addition example, matrix transpose can also be decomposed

into transposes on many submatrixes. The operations on these submatrixes can be

done in parallel to increase the speed of the process. The submatrix approach is

used throughout this chapter, Chapter 18, “Matrix transpose example.”

Partition scheme

As with the matrix addition example, the data partition parameters (h and v) must

be determined, but there is one difference between a matrix transpose problem and

matrix addition: as the submatrix is transposed from the source to the destination,

contiguous data movement from the source results in very fragmented data

movements. To address this, select a compromise between the size of one

contiguous data segment and the number of data transfers.

…

…

…

…

a(1,1)

a(m,1)

a(1,n)

a(m,n)

c(i,j) = a(j,i) c(1,1)

c(n,1)

c(1,m)

c(n,m)

… … … …… …

Figure 16. Matrix transpose in detail


In Figure 17 and Figure 18, the matrixes ″a[m][n]″ and ″c[n][m]″ are partitioned

into submatrixes, and ″Axy[h][v]″ is transposed to ″Cyx[v][h]″.

The following example program describes the data movement patterns of the

problem. The goal is to maximize the memory usage of the accelerator so that the

size of the submatrixes is constant. The total number of data movements is the

sum of the input and output data movements. In mathematical language, the

problem can be expressed as: Since h * v ≈ Constant, what h and v result in the

minimum h + v? When you look for integer solutions, the optimum result is found

when h and v are equal, or as close as possible.

float a[m][n], c[n][m];

{

int i,j,k;

float sa[h][v], sc[v][h];


for (j=0; j<n; j+=v)

{

for(k=0; k<h; k++)

data_move(&sa[k][0], &a[i+k][j], v*sizeof(float));

….......

….......….......

….......

….......…...

…...

.

.

.

.

AX1(1,v)AX1(1,1)

AX1(h,v)

AXY(1,v)

AXY(h,v)AX1(h,1)

A1Y(1,1)A11(1,1)

A1Y(h,1)

A1Y(1,v)

A1Y(h,v)A11(h,1)

Figure 17. Data partition of matrix transpose, part A

….......

….......….......

….......

….......…...

…...

.

.

.

.

CY1(1,h)CY1(1,1)

CY1(v,h)

CYX(1,h)

CYX(v;h)CY1(v,1)

C1X(1,1)C11(1,1)

C1X(v,1)

C1X(1,h)

C1X(v,h)C11(v,1)

Figure 18. Data partition of matrix transpose, part B


call_mat_transpose_kernel(sa, sc, h,v);

for(k=0; k<v; k++)

data_move(&c[j+k][i], &sc[k][0], h*sizeof(float));

}

}

Example compute kernel

The compute kernel is implemented in C code without SPE Single Instruction

Multiple Data (SIMD) intrinsics for instructional purposes. You can optimize this

code using SIMD and loop unrolling for better performance.

FILE: my_header.h

struct _trans_parms_t

{

unsigned int h;

unsigned int v;

} trans_parms_t;

FILE: my_kernel.c






{

unsigned int i, j;

float *sa, *sc;

trans_parms_t *p_parm = (trans_parms_t *)p_parm_ctx_buffer;



for(i=0; i< p_parms->h; i++)

for(j=0; j< p_parms->v; j++)

*(sc+j*p_parms->h + i) /* sc[j][i] */

= *(sa+i*p_parms->v +j); /* sa[i][j] */

return 0;

}

The main thread and data transfer lists

There are two approaches to the implementation of the main thread by different

data transfer list generation policies. In the first approach, data transfer list

generation is accomplished on the host. In the second approach, the data transfer

list generation is done on the accelerator with the help of the input parameters.

The following sections compare these two approaches:

v Data transfer lists generated in the host: The following example shows the

implementation of the main thread with the data list generated on the host

using the ALF implementation on Cell/B.E. architecture. The data must be

aligned properly when it is defined.

– FILE: my_main.c

#include <string.h>

#include <alf.h>


#define H 128

#define V 128






int main(void)

{

// ... same as before

tinfo.parm_ctx_buffer_size = sizeof(trans_parms_t);

tinfo.input_buffer_size = H*V*sizeof(float); //64k


tinfo.dt_list_entries = max(H, V);

tinfo.task_attr = 0;


parm.h=H;

parm.v=V;

for(X=0; X<1024; X+=H)

for(Y=0; Y<512; Y+=V)

{



ALF_DATA_BYTE, 0);

for(i=0; i<H; i++)

alf_wb_add_io_buffer (hwb, &mat_a[X+i][Y], V*sizeof(float),


for(j=0; j<V; j++)

alf_wb_add_io_buffer (hwb, &mat_c[Y+j][X], H*sizeof(float),



}


return 0;

}

– In this solution, all data transfer list generation logic is run on the host. This

is reasonable when there are few work blocks. Otherwise, this can create a

performance bottleneck because the accelerators might be idle while waiting

for the data transfer list generation to finish on the host. Better performance

can be expected from the second solution in this situation, where the

capability of each accelerator is fully used.v Data transfer lists generated in the accelerator: The following example shows

the implementation of the main thread with the data list generated on the

accelerator using the ALF implementation on Cell/B.E. architecture. This

solution does not call alf_wb_add_io_buffer. To prepare the data transfer list on

the accelerator, the task_attr field in the task information structure must be set

to ALF_TASK_ATTR_PARTITION_ON_ACCEL. Also, more information needs to be

passed to the host in the parameters, so the trans_parms_t data structure is also

expanded to include more information.

– FILE: my_header.h

struct _trans_parms_t

{

unsigned int h;

unsigned int v;

unsigned int DIMX, DIMY;

unsigned int X, Y;

float *p_mat_a, *p_mat_c;

} trans_parms_t;

FILE: my_main.c

#include <alf.h>

#include <string.h>



#define H 128

#define V 128





int main(void)

{


tinfo.parm_ctx_buffer_size = sizeof(trans_parms_t);

tinfo.input_buffer_size = H*V*sizeof(float); //64k


tinfo.dt_list_entries = max(H, V);



parm.h=H;

parm.v=V;

parm.DIMX = 1024;

parm.DIMY = 512;

parm.p_mat_a = &mat_a[0][0];

parm.p_mat_c = &mat_c[0][0];

for(X=0; X<1024; X+=H)

for(Y=0; Y<512; Y+=V)

{

parm.X = X;

parm.Y = Y;



ALF_DATA_BYTE, 0);


}


return 0;

}

Data transfer lists

Below are two accelerator APIs used to generate the data input or output transfer

lists respectively. See the API definition descriptions for more information on the

parameters.

v alf_prepare_input_list

v alf_prepare_output_list

These two macros are used to create the data transfer list structure and create the

data transfer list entries:

v ALF_DT_LIST_CREATE

v ALF_DT_LIST_ADD_ENTRY

Note that the data transfer list must be initialized with the first macro before

appending an element to it. The following is the implementation for this example:

FILE: my_kernel.c

// omitted the previous kernel section here

int alf_prepare_input_list(void *p_task_context,


void *p_parm_ctx_buffer,

void *p_dt_list_buffer,

unsigned int current_count,


{


float *pA;

unsigned int i;

addr64 ea;

pA = p_parm->p_mat_a +

p_parm->DIMY*p_parm->X + p_parm->Y; // mat_a[X][Y]

ALF_DT_LIST_CREATE(p_dt_list_buffer,0);

ea.ui[0] = 0;

for(i=0; i<p_parm->h; i++)

{

ea.ui[1] = (unsigned int)(pA + p_parm->DIMY*i); // mat_a[X+i][Y]

ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer,

p_parm->v,

ALF_DATA_FLOAT,

ea);

}

return 0;

}

int alf_prepare_output_list(void *p_task_context,





{


float *pC;

unsigned int j;

addr64 ea;

pC = p_parm->p_mat_c +

p_parm->DIMX*p_parm->Y + p_parm->X; // mat_c[Y][X]


ea.ui[0] = 0;

for(j=0; j<p_parm->v; j++)

{

ea.ui[1] = (unsigned int)(pC + p_parm->DIMX*j); // mat_c[Y+j][X]

ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer,

p_parm->h,

ALF_DATA_FLOAT,

ea);

}

return 0;

}

Debugging and tuning

In the matrix transpose example, tuning the main thread will yield little

improvement, so the focus is on optimizing the compute kernel. An appropriate

strategy is to use SIMD and loop unrolling.

For a more thorough guide on how to optimize SPE code, see the SDK 2.1

Programmer’s Guide, which is listed in “Related documentation” on page 105. The


following example shows the compute kernel from the matrix transpose example

optimized with SPE intrinsics and some loop unrolling:

FILE: my_kernel.c






{

unsigned int i, j;

vector float *sa, *sc;


const vector unsigned char step1_pattern1 =

{0, 1, 2, 3, 16, 17, 18, 19, 4, 5, 6, 7, 20, 21, 22, 23};


{8, 9, 10, 11, 24, 25, 26, 27, 12, 13, 14, 15, 28, 29, 30, 31};


{0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23};


{8, 9, 10, 11, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31};

vector float f1, f2, f3, f4;

vector float tmp1, tmp2, tmp3, tmp4;



for(i=0; i< p_parms->h; i+=4)

for(j=0; j< p_parms->v; j+=4)

{

f1 = *(sa+(i )*p_parms->v/4 + j/4);

f2 = *(sa+(i+1)*p_parms->v/4 + j/4);

f3 = *(sa+(i+2)*p_parms->v/4 + j/4);

f4 = *(sa+(i+3)*p_parms->v/4 + j/4);

tmp1 = spu_shuffle(f1, f2, step1_pattern1);




f1 = spu_shuffle(tmp1, tmp3, step2_pattern_1);




*(sc+(j )*p_parms->h/4 + i/4) = f1;

*(sc+(j+1)*p_parms->h/4 + i/4) = f2;

*(sc+(j+2)*p_parms->h/4 + i/4) = f3;

*(sc+(j+3)*p_parms->h/4 + i/4) = f4;

}

return 0;

}



Chapter 19. Vector min-max example

The current version of ALF has several new features such as the task context

buffer, overlapped I/O buffers, and synchronization points. The example in this

section shows some of these new features.

Understand the problem

The problem is defined as finding the maximum and minimum element value in a

vector sequence. Solve amin

= min(ai,j) and amax

= max(ai,j)where ai,j

are elements of

the vector sequence At

= {a0,t

, a1,t

,..., an-1,t}, t ≤ T as defined:

v A0

= {a0,0

, a1,0

,..., an-1,0} is the given initial value;

v At

= F(At-1) where aj,t

= c0

· aj,t-1

+ c1

· aj+1, t-1

+ c2

· aj+2,t-1

+ c3

· aj+3,t-1

and references

to aj,t-1

are defined as zero for all j > n-1.

Simple solution

The sequential code that solves this problem is shown in the example below. From

the definition of F(A), the calculation of a[i] does not rely on previously calculated

values of a[i]. You can overwrite the old value of each a[i] with its new value

during the computation, so the same buffer could be used for all of the

calculations. Also, there could be a boundary condition check when the index of A

approaches ″n-3″, then the index of elements in the calculation will be out of the

scope of the array. Because boundary condition checks during the calculation will

degrade the performance, it is better to avoid them. To solve this problem, append

the array that stores the vector with three extra elements that are initialized to

zero.

#include <time.h>

#include <stdlib.h>

#define N (1024*1024)

#define T (4096)

#define C0 ((float)0.30)




// define the array to have more elements to remove the need to

// perform boundary condition checks when the calculation approaches a[N-3]

float a[N+3];

float amin, amax;

int main(void)

{

unsigned int i,j;

// init the array for initial value

srand(time(NULL));

for(i=0; i<N; i++)

a[i] = (float)(rand()%16384 - 8192);

for(; i<N+3; i++)

a[i] = (float)0.0;

// init the amin, amax

amin = amax = a[0];


// for the initial A0

for(i=0; i<N; i++)

{

if(amin > a[i]) amin = a[i];

else if(amax < a[i]) amax = a[i];

}

for(j=1; j<=T; j++)

{

// calculate A[t]

for(i=0; i<N; i++)

{

a[i] = C0*a[i] + C1*a[i+1] + C2*a[i+2] + C3*a[i+3];

if(amin > a[i]) amin = a[i];

else if(amax < a[i]) amax = a[i];

}

}

// here is the result

}


The calculation of new element values of vector A is based only on old values, so

in theory, the values of these elements can be calculated at the same time for each

new At. In reality, the parallelism of a system is limited, and it is reasonable to

divide the vector into several blocks and compute the blocks in parallel. For the

vector min-max problem:

1. Divide work into multiple blocks

2. Calculate the result for each work block

3. Merge these results when all the calculations are complete

Partition scheme

The vector min-max problem can be solved by dividing the data buffer into work

blocks. The input data is a one-dimensional array, so the input array is divided

into smaller portions for the work blocks. In a parallel programming environment

like ALF, there is not strict control over the order of work block processing, so the

same buffer cannot be reused to store input and output.

In the vector min-max example, two buffers were used to hold the step t = 0,2,4,...

and t=1,3,5,... data respectively. When calculating step t, the reference data comes

from the buffer module(t-1, 2) and the result is written to the module(t, 2) buffer.

Task context buffer

In the vector min-max example, the task context buffer provides an opportunity to

accelerate the min-max process in a parallel model. Each instance of the task

running on the accelerators will have its own context, so it can save the most

recent min-max values found in the work blocks it processes. Then, after the task is

finished, the host program can do a trivial reduction of the local results to derive

the global results. Another usage of the task context is to communicate the global

parameters that apply to each work block. In this example, the parameters of the

function F() and the partition size can be passed to the accelerator by using the

task context buffer.


Overlapped I/O buffer

For a single work block, the input data can be overwritten during the computation

of the sequential code. The vector min-max example uses the overlapped I/O

buffer to reduce the memory requirement of the work block to make it possible to

enlarge the data partition to about half of the available memory.

Barrier

The ALF programming model provides a barrier synchronization point to ensure

that tasks are completed in the correct order.

In the figure below, you can see that one task depends on data from another task.

The computations of step t require the computations of step t-1 to be finished to

make sure the reference data is updated to the most recent values.

The code list

The following is a complete list of the code and not a step-by-step approach like

the previous examples. Note that the code listed here is not fully optimized so that

it is easier to read. The bold code lines show key parts of the code.

/***************************************************************

* my_header.h

* shared definitions header

***************************************************************/

typedef struct _my_parm_t

{

float * addr_a; //input/output data address

float * addr_b; //output/input data address

int size; //problem size

unsigned char dummy[16 - 12]; //dummy for alignment

} my_parm_t;

// writeable section of task context

typedef struct _my_task_context_w

{

.....

.....

.....

blk[0] blk[1]

blk[0]

blk[0] blk[1]

blk[1]

blk[n-2]

blk[n-2]

blk[n-2]

blk[n]

blk[n]

blk[n]

F(A) F(A) F(A)

F(A)

F(A)

F(A)F(A)F(A)

Buf[(t-1)%2]

Buf[(t+1)%2]

Step t-1

Barrier

Buf[t%2]

Step t

Padding

Padding

Padding

Task DataDependency

Figure 19. Data partition and task dependency of the vector computation example


// writable section

float max; //max float

float min; //min float

unsigned char dummy[16 - 8]; //dummy for alignment

} my_task_context_w;

// read-only section of task context

typedef struct _my_task_context_r

{

// read-only section

float c0; //C0

float c1; //C1

float c2; //C2

float c3; //C3

} my_task_context_r;

/***************************************************************

* spu.c

* Accelerator code

***************************************************************/



int alf_comp_kernel(volatile void *p_task_context,

volatile void *p_parm_ctx_buffer,

volatile void *p_input_buffer,

volatile void *p_output_buffer,



{

my_task_context_r *p_ctx_r = (my_task_context_r *) p_task_context;

my_task_context_w *p_ctx_w =

(my_task_context_w *)((char *)p_task_context+sizeof(my_task_context_r));

my_parm_t *p_parm = (my_parm_t *) p_parm_ctx_buffer;

float *a = (float *)p_input_buffer;

float *b = (float *)p_output_buffer;

float c0 = p_ctx_r->c0;




int size = p_parm->size;

int i;

for(i=0;i<size;i++)

{

b[i]=c0*a[i]+c1*a[i+1]+c2*a[i+2]+c3*a[i+3];

if(b[i]>p_ctx_w->max)

p_ctx_w->max = b[i];

if(b[i]<p_ctx_w->min)

p_ctx_w->min = b[i];

}

return 0;

}

int alf_prepare_input_list(void *p_task_context,






{


float *addr = p_parm->addr_a;

int size;

int small_size;

int i,left;

addr64 ea;

size = p_parm->size + 4; // do not forget the boundary ones

// now decide the per DT list entry size

// this is because of the 16KB per DMA entry limitation of MFC

small_size = (size)/(16*1024/sizeof(float));

left = size%(16*1024/sizeof(float));


for(i=0;i<small_size;i++)

{

ea.ui[0] = 0;

ea.ui[1] = (unsigned int)addr+i*16*1024+current_count*size*sizeof(float);

ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer, 16*1024, ALF_DATA_BYTE, ea);

}

ea.ui[1] = (unsigned int)addr+small_size*16*1024

+current_count*size*sizeof(float);

ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer, left, ALF_DATA_FLOAT, ea);

return 0;

}

int alf_prepare_output_list(void* p_task_context,





{


float *addr=p_parm->addr_b;

int size=p_parm->size;

int small_size;

int i,left;

addr64 ea;

// this is because of the 16KB per DMA entry limitation of MFC

small_size=size/(16*1024/sizeof(float));

left=size%(16*1024/sizeof(float));


for(i=0;i<small_size;i++)

{

ea.ui[0] = 0;

ea.ui[1] = (unsigned int)addr+i*16*1024+current_count*size*sizeof(float);

ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer, 16*1024, ALF_DATA_BYTE, ea);

}

ea.ui[1] = (unsigned int)addr+small_size*16*1024

+current_count*size*sizeof(float);

ALF_DT_LIST_ADD_ENTRY(p_dt_list_buffer, left, ALF_DATA_FLOAT, ea);

return 0;

}

/***************************************************************

* main.c

* Host code

***************************************************************/

#include <stdio.h>


#include <libmisc.h>

#include <float.h>

#include <stdlib.h>

#include <time.h>

#include <string.h>

#include <alf.h>


#define N (1024*1024)

#define T (128)

#define BLOCK_SIZE (16*1024)





// define the array to have more elements for simplification

// of boundary condition check when the calculation comes to a[N-3]

float a[2][N+4] __attribute__ ((aligned (128)));

float amin, amax;

extern spe_program_handle_t spe_ops;

void prepare_data()

{

int i;

for(i=0; i<N; i++)

{

a[0][i] = (float)(rand()%8192 - 4096);

}

// init the padding

for(i=N; i<N+4; i++)

{

a[1][i] = a[0][i] = (float)0.0f;

}

// init the amin, amax

amin = amax = a[0][0];

// for the initial A0

for(i=0; i<N; i++)

{

if(amin > a[0][i]) amin = a[0][i];

else if(amax < a[0][i]) amax = a[0][i];

}

}

int main(void)

{

alf_handle_t half;

alf_task_handle_t htask;

alf_wb_handle_t hwb;



my_parm_t parm __attribute__ ((aligned (128)));

my_task_context_w *pctx_w;

my_task_context_r *pctx_r;

alf_wb_sync_handle_t hsync;

int i,j,instances;

unsigned int nodes;

prepare_data();



alf_query_system_info(ALF_INFO_NUM_ACCL_NODES, &nodes);

alf_init(&half, nodes, ALF_INIT_PERSIST);

spe_tsk.spe_task_image = &spe_ops;

spe_tsk. max_stack_size = 4096; // make your best guess :-)


tinfo. p_task_info = &spe_tsk;

tinfo.task_context_buffer_read_only_size = sizeof(my_task_context_r);

tinfo.task_context_buffer_writable_size = sizeof(my_task_context_w);

tinfo.parm_ctx_buffer_size = sizeof(my_parm_t);

tinfo.input_buffer_size = 0;

tinfo.overlapped_buffer_size = (BLOCK_SIZE+4)*sizeof(float);

//+4 for adjacent reference values

tinfo.output_buffer_size = 0;

tinfo. dt_list_entries = (BLOCK_SIZE+4)/(16*1024/sizeof(float)) + 1;


instances = alf_task_create(&htask, half, &tinfo);

// prepare the task contexts

pctx_w = malloc_align(sizeof(my_task_context_w)*instances, 4);

pctx_r = malloc_align(sizeof(my_task_context_r), 4);

// init and assign the context buffers

pctx_r->c0=C0;

pctx_r->c1=C1;

pctx_r->c2=C2;

pctx_r->c3=C3;

for(i=0; i<instances; i++)

{

alf_task_context_handle_t hctx;

alf_task_context_create(&hctx, htask, 0 /* i+1 is ok too */);

alf_task_context_add_entry(hctx, pctx_r,sizeof(my_task_context_r),

ALF_DATA_BYTE, ALF_TASK_CONTEXT_READ);

pctx_w[i].min =amin;

pctx_w[i].max =amax;

alf_task_context_add_entry(hctx, &pctx_w[i], sizeof(my_task_context_w),

ALF_DATA_BYTE, ALF_TASK_CONTEXT_WRITABLE);

alf_task_context_register(hctx);

}

// now comes the real computation

for(j=0;j<T;j++)

{

for(i=0; i<N/BLOCKSIZE; i++)

{

parm.addr_a = &a[j%2][i*BLOCK_SIZE];

parm.addr_b = &a[(j+1)%2][i*BLOCK_SIZE];

parm.size = BLOCK_SIZE;


alf_wb_add_parm (hwb, &parm, sizeof(my_parm_t), ALF_DATA_BYTE, 0);


}

// add a barrier to make sure the different steps do not overlap

alf_wb_sync(&hsync, htask, ALF_SYNC_BARRIER, NULL, NULL, 0);

}

alf_task_wait(htask, -1);


// all done, reduce the max value to get the final result

amax = pctx_w[0].max;


amin = pctx_w[0].min;

for(i=1; i<instances; i++)

{

if(pctx_w[i].max > amax)

amax = pctx_w[i].max;

if(pctx_w[i].min < amin)

amin = pctx_w[i].min;

}

free_align(pctx_w);

free_align(pctx_r);

printf("The maximum element value is %f minimum element value is %f\n",

amax,amin);

alf_exit(&half, ALF_SHUTDOWN_WAIT);

return 0;

}


Part 4. Platform specific constraints for the ALF

implementation on Cell/B.E. architecture



Chapter 20. SPU resource reserved and used

Tags

ALF reserves the MFC DMA tags from 15 to 23 for internal use. SPU applications

should avoid using these reserved tags.

Cache line reservation

ALF uses the cache line reservation feature. For more information on cache line

reservations, refer to the SDK 2.1 Programmer’s Guide, which is listed in “Related

documentation” on page 105. Because only one reservation is allowed at a time in

the Atomic Unit of MFC, a cache line reservation might not be guaranteed when

the execution context returns to ALF. For example, all cache line reservations made

within alf_prepare_input_list, alf_prepare_output_list, and alf_comp_kernel

might not be preserved when these functions return control to the ALF runtime.



Chapter 21. Memory constraints

Local memory

The size of local memory on the accelerator is 256 KB and is shared by code and

data. Memory is not virtualized and is not protected. See Figure 20 for a typical

memory map of an SPU program. There is a runtime stack above the global data

memory section. The stack grows from the higher address to the lower address

until it reaches the global data section. Due to the limitation of programming

languages and compiler/linker tools, you cannot predict the maximum stack usage

when developing the application and when the application is loaded. If the stack

requires more memory than was allocated, there will be a stack overflow

exception. When there is a stack overflow, the SPU application is shut down and a

message is sent to the PowerPC Processing Element (PPE).

ALF allocates the work block buffers directly from the memory region above the

runtime stack, as shown in Figure 21 on page 92. This is implemented by moving

the stack pointer (or equivalently by pushing a large amount of data into the

stack). To ALF, the larger the buffer is, the better it can optimize the performance

of a task by using techniques like double buffering. It is better to let ALF allocate

as much memory as possible from the runtime stack. Estimate the size of the stack

and use that value in the alf_task_info_t_CBEA data structure when the task is

created. If the stack size is too small at runtime, a stack overflow occurs and it

causes unexpected exceptions such as incorrect results or a bus error.

0x3FFFF

0x00000

Reserved

Runtime Stack

Data

Text

SPU ABI Reserved Usage

Global Data

Code

(a) Common Cell/B.E. Application

Figure 20. SPU local memory map of a common Cell/B.E. application


0x3FFFF

0x00000

SPU ABI Reserved Usage

ALF’s Dynamic ManagedBuffer for Work Blocks

User Code + ALF Runtime

ALF Global Data

User Code + ALF Runtime

(b) ALF Application

Reserved

Runtime Stack

Data

Text

Work Block DataBuffer

Figure 21. SPU local memory map of an ALF application


Chapter 22. Data transfer list limitations

Data transfer information is used to describe the input, output, and input or

output data movement. The ALF implementation on Cell/B.E. has the following

constraints.

v Data transfer information for a single work block can consist of up to 8 data

transfer lists for each direction of transfer (accelerator memory to host memory

and the reverse).

v Each data transfer list consists of up to 2048 data transfer entries.

v Each entry can describe up to 16 KB of data transfer between the contiguous

area in host memory and accelerator memory.

v All of the entries within the same data transfer list share the same high 32-bits

effective address.

v The local store area described by each entry within the same data transfer list

must be contiguous.

v Transfer size and the low 32 bits of the effective address for each data transfer

entry must be 16 bytes aligned.



Part 5. Compile time options

Several compile time options are available when you build the ALF runtime

library. By changing or enabling these options, internal debug features can be

enabled. These are helpful when debugging the host or compute kernel code.

Enabling these extra debug features can significantly slow down your application

due to the amount of information dumped and the extra error checks, so they

should be disabled when debugging is complete.

alf_config.h

Compiler time options are located in the ALF global configuration header file. It is

stored in the same location as the other external header files, for example, in

alf\include\alf_config.h.

_ALF_DEBUG_LEVEL_

#define _ALF_DEBUG_LEVEL_ x // where x is a number between 0 to 9

0 Default value of this macro. Means no debug information is dumped

1 Enables dumping of textual error information in the host runtime, in addition to

returning the standard error codes

2 Enables dumping of textual error information in both the host and accelerator

runtimes, in addition to returning the standard error codes

>=3 Dumps internal debug trace information

_ALF_CFG_CHECK_DEBUG_

The _ALF_CFG_CHECK_DEBUG_ macro enables strict argument checking. It is

disabled by default. When this option is enabled on Cell/B.E. architecture, all of

the DMA source or destination addresses and transfer size combinations will be

checked for address alignment and size before a DMA request is issued to help

identify bugs related to improper address or size alignment problems. However,

the resulting code size on the SPU will be increased and performance will be

degraded.

#define _ALF_CFG_CHECK_DEBUG_ // disabled by default

}



Part 6. Appendixes



Appendix. Accessibility features

Accessibility features help users who have a physical disability, such as restricted

mobility or limited vision, to use information technology products successfully.

The following list includes the major accessibility features:

v Keyboard-only operation

v Interfaces that are commonly used by screen readers

v Keys that are tactilely discernible and do not activate just by touching them

v Industry-standard devices for ports and connectors

v The attachment of alternative input and output devices

IBM® and accessibility

See the IBM Accessibility Center at http://www.ibm.com/able/ for more

information about the commitment that IBM has to accessibility.


http://www.ibm.com/able/


Notices

This information was developed for products and services offered in the U.S.A.

The manufacturer may not offer the products, services, or features discussed in this

document in other countries. Consult the manufacturer’s representative for

information on the products and services currently available in your area. Any

reference to the manufacturer’s product, program, or service is not intended to

state or imply that only that product, program, or service may be used. Any

functionally equivalent product, program, or service that does not infringe any

intellectual property right of the manufacturer may be used instead. However, it is

the user’s responsibility to evaluate and verify the operation of any product,

program, or service.

The manufacturer may have patents or pending patent applications covering

subject matter described in this document. The furnishing of this document does

not give you any license to these patents. You can send license inquiries, in

writing, to the manufacturer.

For license inquiries regarding double-byte (DBCS) information, contact the

Intellectual Property Department in your country or send inquiries, in writing, to

the manufacturer.

The following paragraph does not apply to the United Kingdom or any other

country where such provisions are inconsistent with local law: THIS

INFORMATION IS PROVIDED “AS IS ” WITHOUT WARRANTY OF ANY KIND,

EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE

IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR

FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of

express or implied warranties in certain transactions, therefore, this statement may

not apply to you.

This information could include technical inaccuracies or typographical errors.

Changes are periodically made to the information herein; these changes will be

incorporated in new editions of the publication. The manufacturer may make

improvements and/or changes in the product(s) and/or the program(s) described

in this publication at any time without notice.

Any references in this information to Web sites not owned by the manufacturer are

provided for convenience only and do not in any manner serve as an endorsement

of those Web sites. The materials at those Web sites are not part of the materials for

this product and use of those Web sites is at your own risk.

The manufacturer may use or distribute any of the information you supply in any

way it believes appropriate without incurring any obligation to you.

Licensees of this program who wish to have information about it for the purpose

of enabling: (i) the exchange of information between independently created

programs and other programs (including this one) and (ii) the mutual use of the

information which has been exchanged, should contact the manufacturer.

Such information may be available, subject to appropriate terms and conditions,

including in some cases, payment of a fee.


The licensed program described in this information and all licensed material

available for it are provided by IBM under terms of the IBM Customer Agreement,

IBM International Program License Agreement, IBM License Agreement for

Machine Code, or any equivalent agreement between us.

Any performance data contained herein was determined in a controlled

environment. Therefore, the results obtained in other operating environments may

vary significantly. Some measurements may have been made on development-level

systems and there is no guarantee that these measurements will be the same on

generally available systems. Furthermore, some measurements may have been

estimated through extrapolation. Actual results may vary. Users of this document

should verify the applicable data for their specific environment.

Information concerning products not produced by this manufacturer was obtained

from the suppliers of those products, their published announcements or other

publicly available sources. This manufacturer has not tested those products and

cannot confirm the accuracy of performance, compatibility or any other claims

related to products not produced by this manufacturer. Questions on the

capabilities of products not produced by this manufacturer should be addressed to

the suppliers of those products.

All statements regarding the manufacturer’s future direction or intent are subject to

change or withdrawal without notice, and represent goals and objectives only.

The manufacturer’s prices shown are the manufacturer’s suggested retail prices, are

current and are subject to change without notice. Dealer prices may vary.

This information is for planning purposes only. The information herein is subject to

change before the products described become available.

This information contains examples of data and reports used in daily business

operations. To illustrate them as completely as possible, the examples include the

names of individuals, companies, brands, and products. All of these names are

fictitious and any similarity to the names and addresses used by an actual business

enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which

illustrate programming techniques on various operating platforms. You may copy,

modify, and distribute these sample programs in any form without payment to the

manufacturer, for the purposes of developing, using, marketing or distributing

application programs conforming to the application programming interface for the

operating platform for which the sample programs are written. These examples

have not been thoroughly tested under all conditions. The manufacturer, therefore,

cannot guarantee or imply reliability, serviceability, or function of these programs.

CODE LICENSE AND DISCLAIMER INFORMATION:

The manufacturer grants you a nonexclusive copyright license to use all

programming code examples from which you can generate similar function

tailored to your own specific needs.

SUBJECT TO ANY STATUTORY WARRANTIES WHICH CANNOT BE

EXCLUDED, THE MANUFACTURER, ITS PROGRAM DEVELOPERS AND

SUPPLIERS, MAKE NO WARRANTIES OR CONDITIONS EITHER EXPRESS OR


IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES

OR CONDITIONS OF MERCHANTABILITY, FITNESS FOR A PARTICULAR

PURPOSE, AND NON-INFRINGEMENT, REGARDING THE PROGRAM OR

TECHNICAL SUPPORT, IF ANY.

UNDER NO CIRCUMSTANCES IS THE MANUFACTURER, ITS PROGRAM

DEVELOPERS OR SUPPLIERS LIABLE FOR ANY OF THE FOLLOWING, EVEN

IF INFORMED OF THEIR POSSIBILITY:

1. LOSS OF, OR DAMAGE TO, DATA;

2. SPECIAL, INCIDENTAL, OR INDIRECT DAMAGES, OR FOR ANY

ECONOMIC CONSEQUENTIAL DAMAGES; OR

3. LOST PROFITS, BUSINESS, REVENUE, GOODWILL, OR ANTICIPATED

SAVINGS.

SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OR LIMITATION OF

DIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, SO SOME OR ALL

OF THE ABOVE LIMITATIONS OR EXCLUSIONS MAY NOT APPLY TO YOU.

Each copy or any portion of these sample programs or any derivative work, must

include a copyright notice as follows:

© (your company name) (year). Portions of this code are derived from IBM Corp.

Sample Programs. © Copyright IBM Corp. _enter the year or years_. All rights

reserved.

If you are viewing this information in softcopy, the photographs and color

illustrations may not appear.

Trademarks

The following terms are trademarks of International Business Machines

Corporation in the United States, other countries, or both:

IBM

developerWorks

PowerPC

PowerPC Architecture

Resource Link

Adobe, Acrobat, Portable Document Format (PDF), and PostScript are either

registered trademarks or trademarks of Adobe Systems Incorporated in the United

States, other countries, or both.

Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer

Entertainment, Inc., in the United States, other countries, or both and is used under

license therefrom.

Linux® is a trademark of Linus Torvalds in the United States, other countries, or

both.

Other company, product or service names may be trademarks or service marks of

others.

Notices 103

Terms and conditions

Permissions for the use of these publications is granted subject to the following

terms and conditions.

Personal Use: You may reproduce these publications for your personal,

noncommercial use provided that all proprietary notices are preserved. You may

not distribute, display or make derivative works of these publications, or any

portion thereof, without the express consent of the manufacturer.

Commercial Use: You may reproduce, distribute and display these publications

solely within your enterprise provided that all proprietary notices are preserved.

You may not make derivative works of these publications, or reproduce, distribute

or display these publications or any portion thereof outside your enterprise,

without the express consent of the manufacturer.

Except as expressly granted in this permission, no other permissions, licenses or

rights are granted, either express or implied, to the publications or any data,

software or other intellectual property contained therein.

The manufacturer reserves the right to withdraw the permissions granted herein

whenever, in its discretion, the use of the publications is detrimental to its interest

or, as determined by the manufacturer, the above instructions are not being

properly followed.

You may not download, export or re-export this information except in full

compliance with all applicable laws and regulations, including all United States

export laws and regulations.

THE MANUFACTURER MAKES NO GUARANTEE ABOUT THE CONTENT OF

THESE PUBLICATIONS. THESE PUBLICATIONS ARE PROVIDED ″AS-IS″ AND

WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED,

INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF

MERCHANTABILITY, NON-INFRINGEMENT, AND FITNESS FOR A

PARTICULAR PURPOSE.


Related documentation

All of the documentation listed in this section is available on the ISO image. The

latest versions of some documents may be available from the referenced web pages

or on your system after installing components of the SDK.

Cell/B.E. processor

There is a set of tutorial and reference documentation for the Cell/B.E. stored in

the IBM online technical library at:

http://www.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine

v Cell Broadband Engine Architecture

v Cell Broadband Engine Programming Handbook

v Cell Broadband Engine Registers

v C/C++ Language Extensions for Cell Broadband Engine Architecture

v Synergistic Processor Unit (SPU) Instruction Set Architecture

v SPU Application Binary Interface Specification

v Assembly Language Specification

v Cell Broadband Engine Linux Reference Implementation Application Binary Interface

Specification

Cell/B.E. programming using the SDK

v SDK 2.1 Installation Guide

v SDK 2.1 Programmer’s Guide

v Cell Broadband Engine Programming Tutorial

v SIMD Math Library

v Accelerated Library Framework Programmer’s Guide and API Reference

After you have installed the SDK, you can also find the following PDFs in the

/opt/ibm/cell-sdk/prototype/docs directory:

v SDK Sample Library documentation

v IDL compiler documentation

The following documents are available as downloads from:


v Cell Broadband Engine Programming Tutorial documentation

v SPE Runtime Management library documentation Version 1.2

v SPE Runtime Management library documentation Version 2.1 (beta)

v SPE Runtime Management library Version 1.2 to Version 2.0 Migration Guide

IBM XL C/C++ Compiler

After you have installed the SDK, you can find the following PDFs in the

/opt/ibmcmp/xlc/8.2/doc directory.

v Getting Started with IBM XL C/C++ Compiler

v IBM XL C/C++ Compiler Language Reference




v IBM XL C/C++ Compiler Programming Guide

v IBM XL C/C++ Compiler Reference

v IBM XL C/C++ Compiler Installation Guide

IBM Full-System Simulator

After you have installed the SDK, you can also find the following PDFs in the

/opt/ibm/systemsim-cell/doc directory.

v IBM Full-System Simulator Users Guide

v IBM Full-System Simulator Command Reference

v Performance Analysis with the IBM Full-System Simulator

v IBM Full-System Simulator BogusNet HowTo

PowerPC Base

The following documents can be found on the developerWorks® Web site at:

http://www.ibm.com/developerworks/eserver/library

v PowerPC Architecture™ Book, Version 2.02

– Book I: PowerPC User Instruction Set Architecture

– Book II: PowerPC Virtual Environment Architecture

– Book III: PowerPC Operating Environment Architecture

v PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology

Programming Environments Manual Version 2.07c


http://www.ibm.com/developerworks/eserver/library

Glossary

Accelerator

General or special purpose processing element in

a hybrid system. An accelerator might have a

multi-level architecture with both host elements

and accelerator elements. An accelerator, as

defined here, is a hierarchy with potentially

multiple layers of hosts and accelerators. An

accelerator element is always associated with one

host. Aside from its direct host, an accelerator

cannot communicate with other processing

elements in the system. The memory subsystem

of the accelerator can be viewed as distinct and

independent from a host. This is referred to as the

subordinate in a cluster collective.

All-reduce operation

Output from multiple accelerators is reduced and

combined into one output.

Compute kernel

Part of the accelerator code that does stateless

computation task on one piece of input data and

generates corresponding output results.

Compute task

An accelerator execution image that consists of a

compute kernel linked with the accelerated

library framework accelerator runtime library.

Host

A general purpose processing element in a hybrid

system. A host can have multiple accelerators

attached to it. This is often referred to as the

master node in a cluster collective.

Main thread

The main thread of the application. In many

cases, Cell/B.E. architecture programs are

multi-threaded using multiple SPEs running

concurrently. A typical scenario is that the

application consists of a main thread that creates

as many SPE threads as needed and the

application organizes them.

PPE

PowerPC Processor Element. The general-purpose

processor in the Cell/B.E. processor.

SIMD

Single Instruction Multiple Data. Processing in

which a single instruction operates on multiple

data elements that make up a vector data-type.

Also known as vector processing. This style of

programming implements data-level parallelism.

SPE

Synergistic Processor Element. Extends the

PowerPC 64 architecture by acting as cooperative

offload processors (synergistic processors), with

the direct memory access (DMA) and

synchronization mechanisms to communicate

with them (memory flow control), and with

enhancements for real-time management. There

are 8 SPEs on each Cell/B.E. processor.

SPMD

Single Program Multiple Data. A common style of

parallel computing. All processes use the same

program, but each has its own data.

SPU

Synergistic Processor Unit. The part of an SPE

that executes instructions from its local store (LS).

Work block

A basic unit of data to be managed by the

framework. It consists of one piece of the

partitioned data, the corresponding output buffer,

and related parameters. A work block is

associated with a task. A task can have as many

work blocks as necessary.

Work queue

An internal data structure of the accelerated

library framework that holds the lists of work

blocks to be processed by the active instances of

the compute task.



Index

AAccelerator API 47

address calculations 14

ALF API reference 23

alf_comp_kernel 47

alf_configure 26

ALF_DT_LIST_ADD_ENTRY 49

ALF_DT_LIST_CREATE 49

ALF_ERR_POLICY_T 25

alf_exit 29

alf_handle_t 25

alf_init 28

alf_prepare_input_list 47

alf_prepare_output_list 48

alf_query_system_info 27

alf_register_error_handler 30

alf_task_context_add_entry 35

alf_task_context_create 33

alf_task_context_handle_t 31

alf_task_context_register 36

alf_task_create 32

alf_task_destroy 38

alf_task_handle_t 31

alf_task_info_t 31

alf_task_info_t_CBEA 51

alf_task_query 36

alf_task_wait 37

alf_wb_add_io_buffer 41

alf_wb_add_parm 41

alf_wb_create 39

alf_wb_enqueue 40

alf_wb_sync 43

alf_wb_sync_wait 44

application structure 53

Bbasic framework API 25

buffer layouts 11

buffers 11

task context buffer 11, 81

work block input data buffer 11

work block output data buffer 11

work block overlapped input and

output data buffer 11

work block parameter and context

buffer 11

CCell/B.E. architecture platform-dependent

API 51

compute kernel 68, 73

compute task 1, 3, 7

compute task API 31

conventions 23

Ddata partitioning 17

data structures 39

data transfer list 7, 68, 73, 93

debugging 63, 76

documentation 105

double buffering 59

Eerror handling 21

Hhost API 25

host memory 17

Llocal memory allocation 14

Mmain thread 68

matrix addition 65

matrix transpose 71, 72

memoryhost memory 91

local memory 91

memory constraints 17, 91

Ooverlapped I/O buffer 81

overview of ALF 1

Ppartition scheme 80

performance tuning 63

SSDK documentation 105

SPE 76

sync_callback_func 44

synchronization points 19

barrier 19

notify 19

Ttwo-dimensional array 66

Vvector min-max 79

Wwork block API 39

work blocks 1, 3

multi-use 9, 17

single-use 9, 17

work queue 53



��

Printed in USA

SC33-8333-01

Documents

Kit 2.1 Accelerated Library Framework Programmer’s Guide API …iml.ece.mcgill.ca/~nazuel/ALFProgrammersGuideAndAPIRef_v... · 2007. 3. 27. · Cell Broadband Engine Software Development