Parallel Matlab Based on Telescoping …ken/Presentations/TeleParallelMatlabSC...Parallel Matlab...

Ken KennedyRice University

Parallel Matlab Based on Telescoping Languages

and Data Parallel Compilation

Collaborators

Rob FowlerJohn GarvinGuohua Jin

Chuck KoelbelCheryl McCosh

John Mellor-CrummeyLinda Torczon

http://www.cs.rice.edu/~ken/Presentations/TeleParallelMatlabSC05.pdf

Raj BandypadhyayZoran BudimlicArun Chauhan

Daniel Chavarria-MirandaKeith Cooper

Jason EckhardtMary Fletcher

Telescoping Languages

The Programming Problem:A Strategy

• Make it possible for end users to become application developers:

• users integrate software components using scripting languages (e.g., Matlab, Python, Visual Basic, S+)

• professional programmers develop software components

The Programming Problem:An Obstacle

• Achieving High Performance:

• translate scripts and components to common intermediate language

• optimize the resulting program using whole-program compilation

Whole-Program Compilation

Problem: long compilation times, even for short scripts!

Problem: expert knowledge on specialization lost

Script

Component Library

Code GeneratorTranslator

GlobalOptimizingCompiler

CompilerGenerator

L1Compiler

Translator

Code Generator

Telescoping Languages

understandslibrary callsas primitives

Script

L1Component

Library

OptimizedApplication

Could runfor many

• Compile times can be reasonable

• High-level optimizations can be included

• User retains substantive control over language performance

• Generate a new language with user-produced libraries

• Reliability can be improved

Telescoping Languages: Advantages

Applications

• Matlab SP (Signal Processing)• Optimizations applied to functions

• Statistical Calculations in Science and Medicine in R and S-PLUS• Hundred-fold performance improvements

• Library Generation and Maintenance

• Component Integration Systems

• Parallelism in Matlab

Applications

• Prof Dan Sorensen (Rice CAAM) maintains ARPACK, a large-scale eigenvalue solver

• He prototypes the algorithms in Matlab, then generates 8 variants in Fortran by hand:

• ({Real, Complex} x {Single, Double} x {Symmetric, Nonsymmetric}

• Special handling for {Dense, Sparse}

• Could this hand generation step be avoided?

Library Generator (LibGen)

LibGen Performance

1,000,000x1,000,000

MATLAB 6.1 MATLAB 6.5 ARPACK LibGen

LibGen Scaling

362x362 3200x3200 10000x10000

MATLAB 6.1 MATLAB 6.5 ARPACK LibGen

Key Technology: Type Inference

• Translation of Matlab to C or Fortran requires type inference• At each point in a function, translator must

know types as functions of input parameter types (”type jump functions”)

• Constraint-based type inference algorithm• Polynomial time• Can be used to produce different

specialized variants based on input types

Value of Specialization

sparse-symmetric-real dense-symmetric-realdense-symmetric-complex dense-nonsymmetric-complex

0.072 sec

• Add distributions to Matlab arrays

• Distributions can cross multiple processors

• Use distributions to determine parallelism

• Hide parallelism in library component operations

• Specialize for specific distributions

Parallelism in Matlab

A(1:100) A(101:200) A(201:300) A(301:400)

Parallel Matlab

D = distributed(n,m,block,block);

A = B + C .* D

A = D(1:n-2,1:m) + D(3:n,1:m)

Where is each operation performed?

Some data must be communicated!

Parallel Matlab Implementation

A(1:100) A(101:200)

A(201:300) A(301:400)

A(2:399) = (A(1:398) + A(3:400))/2

A(100)A(101)

A(300)A(301)

A(201)A(200)

Parallel Matlab Interpreter

A(1:100) A(101:200)

A(201:300) A(301:400)

MatlabInterpreter

Invokes Computationand Data MovementRoutines on Cluster

Implementation

Base Matlab Implementation

Code Generator

Library+

Annotations

ObjectCode

TypeInference

Script TypeInference

TypeJump

Functions

CodeGenerator

Database

Specialized Variants

Database

Return Type Jump Functions

JIT Implementation

Matlab Interpreter

Library+

Annotations

TypeInference

Script

TypeJump

Functions

CodeGenerator

Database

Specialized Variants

Database

Return Type Jump Functions

Matlab D Goals

• Build on existing Matlab compilation technology

• Target scalable distributed memory systems with high efficiency

• Support rich collection of data distributions

• Lesson of HPF

• Compile user scripts quickly

• Or support interpreter/JIT

dHPF PerformanceEfficiency SP class 'C'

1 4 9 16 25 36 49 64 81 100

# procs.

llel Eff

MPIdHPF

Requires “multipartitioning”distribution

IMPACT-3D• HPF application: Simulate 3D Rayleigh-Taylor instabilities in

plasma using TVD (1334 lines, 45 HPF directives)

• Problem size: 1024 x 1024 x 2048

• Compiled with HPF/ES compiler

• 7.3 TFLOPS on 2048 ES processors ~ 45% peak

• Compiled with dHPF on PSC’s Lemieux (Alpha+Quadrics)

# procs relative speedup

GFLOPS % peak

128 1.0 46.4 18.1

256 1.94 89.9 17.6

512 3.78 175.5 17.4

1024 7.58 352.0 17.2

• Produce parallel versions of functional components

• for each supported distribution

• Produce data movement components

• for each distribution pair

Parallel Matlab Implementation

This sounds like a lot of work!

Specialization: HPF Legacy

• Library written with template distribution

• Useful distributions specified by library designer (standard + user-defined)

• Examples: multipartitioning, generalized block, adaptive mesh

• Library specialized using HPF compilation technology to exploit parallelism

• Performed at language generation time

Parallel Matlab

No expensiveHPF compile

• Simple stencil calculation

• Insert explicit move operations

• Perform type and distribution analysis

Example

B = distributed1(n,block,0)C = distributed1(n,block,0)...A(2:n-1) = B(1:n-2) + C(3:n)

T1 = get(B,1,n-2)T2 = get(C,3,n-2)A(2:n-1) = add(T1,T2,1,n-2)

Example II• Convert to Fortran CALL MPI_INIT(ierr) t = gen_template(block, n) A = distributed1(n, t, 0) ! creates a descriptor ... B = distributed1(n, t, 0) C = distributed1(n, t, 0) .... T1 = distributed1(n-2, t, 1) T2 = distributed1(n-2, t, 1) call moveB1(B, 1, n-2, T1) ! B, slb, sub, T1 call moveB1(C, 3, n, T2) call addB1(A, 2, n-1, 1, T1, T2) ! No communication ... CALL MPI_FINALIZE(ierr) Goal: No HPF

compilation step

Distributions• Structure

• Block, cyclic, block-block, generalized block, multipartitioning, etc.

• Parameters

• Template size and offset

Offset = 1

Template

Generating Move• Generate specialized HPF input version• Compile to Fortran + MPI subroutine moveBlock1(src, slb, sub, so, tar, to, tN) integer slb, sub, so, to, tN, i double precision src(slb:sub), tar(sub-slb+1)

CHPF$ processors p(number_of_processors())CHPF$ template temp(tN)CHPF$ align tar(i) with temp(i+to)CHPF$ align src(i) with temp(i+so)CHPF$ distribute temp(block) onto p do i = 1, sub-slb+1 tar(i) = src(i+slb-1) enddo end

Wrapper Routines

• Convert from descriptors to arrays

call moveBLK1(B, 3, n, T1)

makes a direct call to moveBlock1:

call moveBlock1(B%array_ptr, 3, n, B%tmpl_offset, T1%array_ptr, T1%tmpl_offset, B%tmpl_size)

Performance: Stencils

1 2 4 8

1D Stencil Matlab 1D

4 Arrays1,048,576(BLOCK)

Performance: Stencils

1 2 4 8

2D Stencil Matlab 2D6 Arrays

512 x 512(BLOCK,BLOCK)

Copy cost

Replace move with shift?

Eliminating Copies• Slice Descriptors (with overlap areas) CALL MPI_INIT(ierr) t = gen_template(block, n+1) A = distributed1(n, t, 0) ! creates a descriptor ... B = distributed1(n, t, 0) C = distributed1(n, t, 0) .... T1 = slicealias(n-2, t, 1) T2 = slicealias(n-2, t, 1) call shiftB1(B, 1, n-2, T1) ! T1 points to storage in B call shiftB1(C, 3, n, T2) ! T2 points to storage in C call addB1(A, 2, n-1, 1, T1, T2) ... CALL MPI_FINALIZE(ierr) No change to AddB1

Summary

• Goal: add generalized distribution-based parallelism to Matlab

• Embed parallelism in library

• Challenge: adding new distributions

• Strategy: Use HPF-style compiler technology on library

http://www.cs.rice.edu/~ken/Presentations/TeleParallelMatlabSC05.pdf

The End

Parallel Matlab Based on Telescoping …ken/Presentations/TeleParallelMatlabSC...Parallel Matlab...

Documents

Matlab Parallel

Introduction to MATLAB parallel computing toolbox

Parallel Computing in Matlab

MATLAB Parallel Computing · MATLAB Parallel Computing John Burkardt (ARC/ICAM) & Gene Cli (AOE/ICAM) ... (although some MATLAB linear algebra routines already use multithreading

Parallel Matlab programming using Distributed Arrays

Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

Parallel Matlab

MATLAB Parallel Computing - Peoplejburkardt/presentations/matlab_parallel... · (accounts given out through online application.). 3/143 . MATLAB Parallel Computing: Some Announcements

Parallel MATLAB at FSU: Task Computing

MATLAB Based Optimization Techniques and Parallel Computing · MATLAB Based Optimization Techniques and Parallel Computing Bratislava June 4, ... Optimization with MATLAB ... In this

FDTD Acceleration using MATLAB Parallel Computing Toolbox

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX · INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX ... • Programs may include both parallel and ... (b - a)/numlabs; % length

parallel Computing With Matlab And Simulink · Parallel Computing with MATLAB and Simulink ... Accelerating MATLAB and Simulink Applications ... Discrete-Event Model of Fleet Performance

Parallel MATLAB: The Parallel Computing Toolbox, MDCS ...Parallel MATLAB: The Parallel Computing Toolbox, MDCS, and Red Cloud Steve Lantz Senior Research Associate Cornell Center for

Parallel Computing with MATLAB - Massachusetts …web.mit.edu/8.13/matlab/MatlabTraining_IAP_2012/Parallel_Computing/... · Parallel Computing with MATLAB Jiro Doke, ... Parallel

Telescoping MATLAB for DSP Applications

Parallel Processing Tool-box - HPC Home Pagehpc.kfupm.edu.sa/Documentation/MATLAB PARALLEL.pdf · Parallel Processing Tool-box Start up MATLAB in the regular way. This copy of MATLAB

Parallel MATLAB: The Parallel Computing Toolbox, · PDF fileParallel MATLAB: The Parallel Computing Toolbox, MDCS ... •Select parallel resources by using a configuration/profile,

Parallel Programming in MATLAB on BioHPC

MATLAB Parallel Computingjburkardt/presentations/matlab_parallel_2010_vt.pdfMATLAB Parallel Computing John Burkardt (ARC/ICAM) & Gene Cli (AOE/ICAM) ... (although some MATLAB linear