Elastic Computing

1

ELASTIC COMPUTINGA Framework for EffectiveMulti-core Heterogeneous Computing

2

Introduction Clear trend towards multi-core heterogeneous systems Problem: increased application-design complexity

Different resources require different algorithms to execute efficiently Compiler research attempts to compile code for different resources

Fundamentally limited as compilers can’t infer one algorithm from another

Elastic Computing: optimization framework with knowledge base of implementations for different elastic functions Designers call functions that automatically optimize for any system i.e., designers specify “what” without specifying “how”

SystemResources

Perfo

rman

ceOptimal

AlgorithmSingle

Algorithm

Sorting Implementations

Compiler

quick_sort(...) {...

}

quick_sort(...)

bitonic_sort(...) {...

}

bitonic_sort(...)

quick_sort(...)quick_sort(...)

bitonic_sort(...)bitonic_sort(...)

uP

FPGA

3

Elastic Function Library

Overview

Implementations for Sorting Elastic

Function

insertion_sort(...) {...

}


}


}

System Resources

Elastic ComputingFramework

Applicationint main(...) {

...sort(A, 100);...

} Perfo

rman

ce

Quick

Sort

Bitonic

Sort

Insert

ion

Sort

…

quick_sort(...)sort(A, 100);

Instead of specifying a specific implementation, applications use Elastic Functions Elastic Functions contain a knowledge-base of implementation and parallelization options

At run-time, Elastic Computing Framework determines the best execution decisions Decision based on available system resources as well as function parameters

4


Overview


Function


}


}


}

System Resources



...sort(A, 100);...

} Perfo

rman

ce

Quick

Sort

Bitonic

Sort

Insert

ion

Sort

…

Instead of specifying a specific implementation, applications use Elastic Functions Elastic Functions contain a knowledge-base of implementation and parallelization options

At run-time, Elastic Computing Framework determines the best execution decisions Decision based on available system resources as well as function parameters

bitonic_sort(...)

5


Overview


Function


}


}


}

System Resources



...sort(A, 100);...

}

…

If multiple resources are available, the Elastic Computing Framework will dynamically parallelize work across different resources Automatically determines efficient partitioning of work to resources Also, determines most efficient implementation for each resource individually

bitonic_sort(...)

partition

quick sort

bitonic sort

Logical Executionquick_sort(...)

6


Overview


Function


}


}


}

System Resources



...sort(A, 100);...

}

…

If multiple resources are available, the Elastic Computing Framework will dynamically parallelize work across different resources Automatically determines efficient partitioning of work to resources Also, determines most efficient implementation for each resource individually

partition

quick sort

bitonic sort

Logical Execution

quick sort

partition partition

…

7


Overview


Function


}


}


}

System ResourcesElastic

ComputingFramework


...sort(A, 100);...

}

…

Elastic Computing is transparent

Applications treat Elastic Computing as a high-performance auto-tuning

library of functions

Elastic Computing determines how to efficiently execute the Elastic Functions

on behalf of the application

8


Overview


Function


}


}


}

System ResourcesElastic

ComputingFramework


...sort(A, 100);...

}

…

Elastic Computing is transparent, portable

System Resources

System Resources

…

Elastic Computing automatically optimizes the Elastic Function execution to the available system resources, even if the application is moved to a different system

9


Overview


Function


}


}


}



...sort(A, 100);...

}

…

Elastic Computing is transparent, portable, and adaptive

System Resources

Perfo

rman

ce

Quick

Sort

Bitonic

Sort

Insert

ion

Sort

quick_sort(...)

insertion_sort(...)

Perfo

rman

ce

Quick

Sort

Bitonic

Sort

Insert

ion

Sort

int main(...) {...sort(A, 5);...

}

Elastic Computing also automatically adapts the Elastic Function execution to the application’s input parameters (e.g., sorting 5 elements as opposed to 100)

10

Related Work Parallel cross-compiling programming languages:

Examples: CUDA, OpenCL, DirectX, ImpulseC Allows a single code file to describe parallel computation that can compile to numerous devices

Single-domain adaptable software libraries: Examples: FFTW (for FFT) [Frigo 98], ATLAS (for linear algebra) [Whaley 98]

Measures performances of execution alternatives and determines the best way to execute the function for the specific function call and system

General-purpose adaptable software libraries: Examples: PetaBricks [Ansel 09], SPIRAL [Püschel 05]

Uses custom languages to expose algorithmic/implementation choices to the compiler, and relies on measured performance and learning techniques to determine the best

Examples: Qilin [Luk 09] Uses dynamic compilation to determine a data graph, and relies on measured performance to

determine an efficient partitioning of work across heterogeneous resources

Differentiating features of Elastic Computing: Allows specification of multiple algorithms for different devices Automatically determines efficient partitionings of work between heterogeneous devices Supports both multi-core and heterogeneous devices and are not specific to any domain Does not require custom programming languages or non-standard compilation In most cases, previous work can be used in conjunction with Elastic Computing

11

Optimization Steps Elastic Computing Framework performs two optimization steps to

determine how to execute an Elastic Function efficiently Implementation Assessment collects performance information about different

implementation options for an Elastic Function Optimization Planning then analyzes the predicted performance to determine

efficient execution decisions

To reduce run-time overhead, both optimization steps execute at installation-time and save their results to a file May require several minutes to an hour to complete Only needs to occur once per Elastic Function per system

At run-time, the Elastic Function Execution step looks-up the optimization decisions to execute the Elastic Function on behalf of an application

Elastic Function

OptimizationDecisions

Implementation Assessment

Optimization Planning

Elastic Function Execution

Inst

alla

tion-

time

Run

-tim

e

Application

12

Optimization Steps Elastic Functions inform the Elastic Computing Framework of how to

execute and optimize a function May be created for nearly any function (e.g., sort, FFT, matrix multiply)

Elastic Functions contain numerous alternate implementations for executing the function Implementations may be single-core, multi-core, and/or heterogeneous All implementations adhere to the same input/output parameters making

them interchangeable Elastic Functions also contain:

Dependent Implementations that specify how to parallelize the function Adapter to abstract function-specific details from the analyses steps Details discussed later!

SortElastic Function

Quick SortC code

Implementations:

Bitonic SortVHDL code

Merge SortCUDA code

Elastic Function





Inst

alla

tion-

time

Run

-tim

e

Application

13

Optimization Steps Implementation Assessment creates performance predictors for the

implementations of the Elastic Function Performance predictors are called Implementation Performance Graphs

(IPGs), which are: Created for each implementation individually Returns the estimated execution time of the implementation when given the

implementation’s invocation parameters Example: a quick sort implementation

Quick SortC code

1.3 sec

10,000Input Parameters

Exe

cutio

n Ti

me

IPG for Quick Sort

void main() { // Other code... int array[10000]; QuickSort(array); // Other code...}

Sample Invocation

execution time = 1.3 sec

Elastic Function





Inst

alla

tion-

time

Run

-tim

e

Application

14

Optimization Steps Optimization Planning then analyzes the IPGs to predetermine efficient

Elastic Function execution decisions Goal is to make decisions that minimize the estimated execution time

Answers two main execution questions: Which implementation is the most efficient for an invocation? How to efficiently partition computation across multiple resources? Details discussed later!

Elastic Function





Inst

alla

tion-

time

Run

-tim

e

Application

1.3 sec


Exe

cutio

n Ti

me

IPG for Quick Sortvoid main() { // Other code... int array[10000]; Sort(array); // Other code...}

Sample Invocation

1.1 sec


Exe

cutio

n Ti

me

IPG for Bitonic Sort

Bitonic Sort estimated to be most efficient at 1.1 sec!

15

Optimization Steps Output of Implementation Assessment and Optimization Planning is

saved to a file for lookup at run-time

Applications execute normally until they invoke an Elastic Function When an Elastic Function is invoked, the Elastic Function Execution step

starts which then: Looks-up predetermined execution decisions based on the invocation

parameters and availability of system resources Executes the Elastic Function using the predetermined decisions Returns control to the application once the Elastic Function completes

Elastic Function





Inst

alla

tion-

time

Run

-tim

e

Application

16

Design Flow

ApplicationCodeElastic Function

ApplicationExecutable

Hardware Vendors

Library Designers

Open-source Efforts

Application Developer

InstalledElastic Functions

Elastic FunctionInstallation

ApplicationInstallation



Compilation Compilation

Elastic Function Interface Specification

Elastic FunctionInvocation

Elastic Function Design Application Design

System Run-time

ApplicationLaunched


ApplicationExecution

17

Design Flow

ApplicationCode

ApplicationExecutable

Application Developer

InstalledElastic Functions

ApplicationInstallation



Compilation

Elastic Function Interface Specification

Elastic FunctionInvocation

Elastic Function Design Application Design

System Run-time

ApplicationLaunched


ApplicationExecution

Elastic Function

Hardware Vendors

Library Designers

Open-source Efforts

Elastic FunctionInstallation

Compilation



How does it work?Implementation Assessment and Optimization Planning

are the main research challenges and the focus of

on-going research

Time for details!

18

Adapter Implementation Assessment creates Implementation Performance Graphs (IPGs) for each

implementation to predict the execution time from the input parameters IPG is a piece-wise linear graph mapping the input parameters to estimated execution time Question: how do we map input parameters to the x-axis for every Elastic Function? Answer: the adapter

Quick SortC code

1.3 sec


Exe

cutio

n Ti

me

IPG for Quick Sortvoid main() { // Other code... int array[10000]; QuickSort(array); // Other code...}

Sample Invocation


Input Parameters

Exe

cutio

n Ti

me

IPG for Convolution

ConvolutionC code

void main() { // Other code... float a[100]; float b[10000]; Convolve(a, b); // Other code...}

Sample Invocation

?

19

Adapter Adapter maps the input/output parameters to a numeric value, called the work metric

Essentially provides an abstraction layer to allow Elastic Computing to analyze and, thereby, optimize any type of Elastic Function

Developer creates the adapter as part of the Elastic Function Rules for the Adapter’s Mapping:

1. Parameters that map to the same Work Metric value should require equal execution times 2. As the Work Metric value increases, execution time should also increase

Example: sorting Elastic Function Adapter: set work metric equal to number of elements to sort Adheres to Rule 1: sorting the same number of elements generally takes the same time Adheres to Rule 2: sorting more elements generally takes longer

Quick SortC code

1.3 sec

10,000Work Metric

Input Parameters

Exe

cutio

n Ti

me

IPG for Quick Sort

void main() { // Other code... int array[10000]; QuickSort(array); // Other code...}

Sample Invocation

work metric = 10,000


20

Adapter Any work metric mapping that (mostly) adheres to Rules 1 & 2 is a valid adapter One technique is to set the mapping equal to the result of an asymptotic analysis on the

performance of a function Asymptotic analysis creates an equation that is approximately proportional to execution time Use that equation as the work metric mapping

Example: convolution Elastic Function Time-domain convolution has asymptotic performance equal to Θ(|a|*|b|) Therefore, set work metric equal to product of the lengths of the two input vectors

void main() { // Other code... float a[100]; float b[10000]; Convolve(a, b); // Other code...}

Sample Invocation

ConvolutionC code

1.7 sec

1,000,000Work Metric

Input Parameters

Exe

cutio

n Ti

me

IPG for Convolutionwork metric = 100 * 10,000

= 1,000,000


21

Heuristic collected fewer samples in linear regions

Implementation Assessment Implementation Assessment relies on a heuristic to create IPGs, which:

Samples execution time of the implementation at several work metrics to determine performance

Performs statistical analyses on sets of samples to find work metric intervals with linear trends

Adapts the sampling process to collect fewer samples in regions of linear trends

Work Metric

Exe

cutio

n Ti

me

Work Metric

Exe

cutio

n Ti

me

Collected Samples Resulting IPG

Implementation

22

Optimization Planning Optimization Planning analyzes the IPGs to predetermine efficient execution

decisions, and performs two main optimizations: Fastest Implementation Planning predetermines the most efficient implementation for

different invocation situations Work Parallelization Planning predetermines how to efficiently parallelize computation

Fastest Implementation Planning (FIP) creates Function Performance Graphs (FPGs) that allow a single lookup to return the best implementation for an invocation FIP creates an FPG by overlaying IPGs corresponding to the possible implementation

alternatives and saving only the lowest-envelope

Candidate Implementations

Quick SortC code

Bitonic SortVHDL code

Corresponding Candidate IPGs Overlay of IPGs

Work Metric

Exe

cutio

n Ti

me

Work Metric

Exe

cutio

n Ti

me

Resulting FPG

23

Optimization Planning Work Parallelization Planning (WPP) analyzes FPGs to determine partitionings of computation

that minimize estimated execution time Dependent implementations are a type of implementation that uses WPP results to determine

how to efficiently parallelize computation Developers create dependent implementations based on divide-and-conquer algorithms Divide-and-conquer algorithms divide a big-instance of a problem into multiple smaller instances, and

are common for many types of functions Example: merge sort algorithm (divide-and-conquer algorithm that performs sort) Question: How to parallelize computation and resources to maximize performance?

Answer: Determine partitionings that minimize the estimated execution time!

Merge Sort Algorithm

Initial Call: Sort( [ 3, 5, 7, 1, 2, 8, 5, 2 ] )

Partition:

Nested Calls: Sort( [ 3, 5, 7, 1, 2 ] ) Sort( [ 8, 5, 2 ] )

Merge:

Nested Output:

Output:

return [ 1, 2, 3, 5, 7 ] return [ 2, 5, 8 ]

return [ 1, 2, 2, 3, 5, 5, 7, 8 ]

Merge Sort Dependent Implementationvoid MergeSortDepImp(input) { // Partition input [A_in, B_in] = Partition(input); // Perform recursive sorts In Parallel { A_out = sort(A_in); B_out = sort(B_in); } // Merge recursive outputs output = Merge(A_out, B_out); // Return output return output;}

24

Optimization Planning WPP uses a sweep-line algorithm to analyze pairs of FPGs and determine

efficient partitioning of computation between them Example: partitioning sort between two resources Algorithm analyzes all pairs of FPGs to consider all possible resource partitionings Result of algorithm is optimal, assuming estimated FPG performance is accurate

Implementation Assessment and Optimization Planning iterate to consider repeated nesting of dependent implementations Repeated nesting of dependent implementations allow for arbitrarily many partitions

Proposed improvements to WPP consider more parallelization options to allow more efficient parallelization decisions

Work Metric

Exe

cutio

n Ti

me

FPG for Sort using a CPU

Work Metric

Exe

cutio

n Ti

me

FPG for Sort using a FPGA

1.2 sec

1,000

1.2 sec

5,000

sweep line

when sorting 6,000 elements partition 1,000 to CPU and 5,000 to FPGA

25

Status of Elastic Computing Elastic Computing Framework is working!

Consists of over 200 files and 25k lines of code 13 Elastic Functions (and 35 implementations) created:

Convolution: Circular Convolution, Convolution, 2D Convolution Linear Algebra: Inner Product, Matrix Multiply Image Processing: Mean Filter, Optical Flow, Prewitt Filter, Sum-of-Absolute-Differences Others: Floyd-Warshall, Lattice-Boltzmann, Longest Common Subsequence, and Sort Easy to add new Elastic Functions and Implementations

5 processing resources supported: Multi-threaded implementations support MPI communication/synchronization features GPU support: any CUDA-supported GPUs FPGA support: H101PCIXM, PROCeIII, and PROCStarIII Adding support for new resources requires creating a wrapper for the driver’s interface

Elastic Computing Framework installed on: Alpha, Delta, Elastic, Marvel, Novo-G, and Warp Easy to add new platforms

26

Experimental Results

Overlap-Add Partitioning

FFT-based Convolution

Overlap-Add Partitioning

Time-domain Convolution

Time-domain Convolution

1,024,000,000

553,512,960 470,487,040

235,233,280 235,233,280

CPUs = 3FPGAs = 1

GPUs = 2( )CPUs = 2

FPGAs = 0GPUs = 2( )CPUs = 1

FPGAs = 0GPUs = 1( )

CPUs = 1FPGAs = 1

GPUs = 0( )CPUs = 1

FPGAs = 0GPUs = 1( )

0x 10x 20x 30x 40x 50x 60x 70x 80x 90x

1 FPGA & 3 GPUs1 FPGA & 2 GPUs1 FPGA & 1 GPU

3 GPUs1 FPGA2 GPUs1 GPU

Only CPUs

Speedup

Speedup of Convolution Elastic Function(as more resources are made available)

Parallelization Decisions(for a invocation with work metric = 1,024,000)

Results collected on Elastic system Convolution Elastic Function contains 5 implementations:

Single-threaded CPU implementation using time-domain algorithm Multi-threaded CPU implementation using time-domain algorithm GPU implementation using time-domain algorithm FPGA implementation using frequency-domain algorithm Dependent implementation using overlap-add partitioning

27

Experimental Results

0x 5x 10x 15x 20x 25x

AVG

Sort

SAD

Prewitt

Optical

MM

Mean

Inner

FW

Conv

CConv

2DConv

Speedup

DeltaElasticMarvelNovo-G

80x

49x

117x

Average

Results collected on Delta, Elastic, Marvel, and Novo-G for 11 Elastic Functions: 2DConv = 2D convolution Cconv = circular

convolution Conv = 1D convolution FW = Floyd-Warshall Inner = inner-product Mean = mean image filter MM = matrix multiply Optical = optical flow Prewitt = Prewitt edge

detection SAD = sum of absolute

differences Sort = sort

28

Publication List Elastic Computing Publications:

J. Wernsing and G. Stitt, “Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing,” in LCTES’10: Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems, pp. 115–124, 2010.

J. Wernsing and G. Stitt, “A scalable performance prediction heuristic for implementation planning on heterogeneous systems,” in ESTIMedia’10: 8th IEEE Workshop on Embedded Systems for Real-Time Multimedia, pp. 71 –80, 2010.

J. Wernsing and G. Stitt, "RACECAR: A Heuristic for Automatic Function Specialization on Multi-core Heterogeneous Systems," under review in PPoPP'12: 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012.

J. Wernsing and G. Stitt, Elastic Computing: A Portable Optimization Framework for Hybrid Computers, under review in Parallel Computing Journal (ParCo) Special Issue on Application Accelerators in HPC.

Other Publications: J. Wernsing, J. Ling, G. Cieslewski, and A. George, "Lightweight Reliable Communications Library for High-

Performance Embedded Space Applications," in DSN'07: 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Edinburgh, UK, June 25-28, 2007 (student forum).

J. Coole, J. Wernsing, and G. Stitt, "A Traversal Cache Framework for FPGA Acceleration of Pointer Data Structures: A Case Study on Barnes-Hut N-body Simulation," in ReConFig'09: International Conference on Reconfigurable Computing and FPGAs, pp. 143-148, 2009.

J. Fowers, G. Brown, J. Wernsing, and G. Stitt, A Performance and Energy Comparison of Convolution on GPUs, FPGAs, and Multicore Processors, under review in ACM Transactions on Architecture and Code Optimization (TACO) Special Issue on High-Performance and Embedded Architectures and Compilers.

29

Conclusions Elastic Computing enables effective multi-core heterogeneous computing by:

Providing a framework for designing, reusing, and automatically optimizing computation on multi-core heterogeneous systems

Adapting execution decisions to execute efficiently based on the invocation’s input parameters and the availability of system resources

Abstracting application developers from computation and optimization details Enabling applications to be portable yet efficient across different systems

Main research challenges: Implementation Planning ,which creates performance predictors for implementations Optimization Planning, which predetermines efficient execution decisions by analyzing the

performance predictors

Proposed improvements: Improve Implementation Planning to more intelligently sample an implementation when

creating an IPG, resulting in a reduced installation-time overhead without reducing accuracy Improve Optimization Planning to consider more partitioning options, resulting in improved

efficiency when parallelizing computation

Questions?

Documents

Elastic Computing