Domain-Specific Programming Systems418/lectures/22_dsl.pdf · 2020. 3. 30. · Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2020 Lecture 22: Domain-Specific

Parallel Computer Architecture and Programming

CMU 15-418/15-618, Spring 2020

Lecture 22:

Domain-Specific Programming Systems

Slide acknowledgments:

Pat Hanrahan, Zach Devito (Stanford University)

Jonathan Ragan-Kelley (MIT, Berkeley)

CMU 15-418/618,

Spring 2020

Course themes:

Designing computer systems that scale(running faster given more resources)

Designing computer systems that are efficient(running faster under constraints on resources)

Techniques discussed:

Exploiting parallelism in applications

Exploiting locality in applications

Leveraging hardware specialization (earlier lecture)

CMU 15-418/618,

Spring 2020

Claim: most software uses modern hardware resources inefficiently

▪ Consider a piece of sequential C code

- Call the performance of this code our “baseline

performance”

▪ Well-written sequential C code: ~ 5-10x faster

▪ Assembly language program: maybe another small

constant factor faster

▪ Java, Python, PHP, etc. ??

Credit: Pat Hanrahan

CMU 15-418/618,

Spring 2020

0

5

10

15

20

25

30

35

40

Slo

wd

ow

n (

Co

mp

are

d t

o C

++

)Code performance: relative to C (single core)

Data from: The Computer Language Benchmarks Game:

http://shootout.alioth.debian.org

Java Scala C#

(Mono)

Haskell Go Javascript(V8)

Lua PHP Python 3 Ruby

(JRuby)

= NBody

= Mandlebrot

= Tree Alloc/Delloc

= Power method (compute eigenvalue)

no

da

ta

no

da

ta

GCC -O3 (no manual vector optimizations)

40/57/53 47 44/114x51

http://shootout.alioth.debian.org

CMU 15-418/618,

Spring 2020

Even good C code is inefficient

Recall Assignment 1’s Mandelbrot program

Consider execution on a high-end laptop: quad-core,

Intel Core i7, AVX instructions...

Single core, with AVX vector instructions: 5.8x speedup

over C implementation

Multi-core + hyper-threading + AVX instructions: 21.7x

speedup

Conclusion: basic C implementation compiled with -O3

leaves a lot of performance on the table

CMU 15-418/618,

Spring 2020

Making efficient use of modern

machines is challenging

(proof by assignments 2, 3, and 4)

In our assignments, you only programmed

homogeneous parallel computers.

…And parallelism even in that context was not easy.

Assignment 2: GPU cores only

Assignments 3 & 4: shared memory / message passing

CMU 15-418/618,

Spring 2020

Recall: need for efficiency leading to heterogeneous parallel platforms

Integrated

CPU + GPU

GPU:throughput cores + fixed-

function

CPU+data-parallel accelerator

Qualcomm Snapdragon SoC

Mobile system-on-a-chip:

CPU+GPU+media processing

CMU 15-418/618,

Spring 2020

Hardware trend: specialization of execution

▪ Multiple forms of parallelism

- SIMD/vector processing

- Multi-threading

- Multi-core

- Multiple node

- Multiple server

▪ Heterogeneous execution capability

- Programmable, latency-centric (e.g., “CPU-like” cores)

- Programmable, throughput-optimized (e.g., “GPU-like” cores)

- Fixed-function, application-specific (e.g., image/video/audio

processing)

Fine-granularity parallelism:

perform same logic on

different data

Varying scales of coarse-

granularity parallelism

Motivation for parallelism and specialization: maximize compute

capability given constraints on chip area, chip energy consumption.

Result: amazingly high compute capability in a wide range of devices!

Mitigate inefficiencies (stalls)

caused by unpredictable data

access

CMU 15-418/618,

Spring 2020

Hardware diversity is a huge challenge

▪ Machines with very different performance characteristics

▪ Even worse: different technologies and performance

characteristics within the same machine at different scales

- Within a core: SIMD, multi-threading: fine-

granularity sync and communication

- Across cores: coherent shared memory via fast on-

chip network

- Hybrid CPU+GPU multi-core: incoherent

(potentially) shared memory

- Across racks: distributed memory, multi-stage

network

CMU 15-418/618,

Spring 2020

Variety of programming models to abstract HW

▪ Machines with very different performance

characteristics

▪ Worse: different technologies and performance

characteristics within the same machine at different

scales- Within a core: SIMD, multi-threading: fine grained sync and comm

- Abstractions: SPMD programming (ISPC, Cuda, OpenCL, Metal,

Renderscript)

- Across cores: coherent shared memory via fast on-chip network

- Abstractions: OpenMP pragma, Cilk, TBB

- Hybrid CPU+GPU multi-core: incoherent (potentially) shared memory

- Abstractions: OpenCL

- Across racks: distributed memory, multi-stage network

- Abstractions: message passing (MPI, Go, Spark, Legion, Charm++)


CMU 15-418/618,

Spring 2020

This is a huge challenge

▪ Machines with very different performance

characteristics

▪ Worse: different performance characteristics

within the same machine at different scales

▪ To be efficient, software must be optimized for

HW characteristics

- Difficult even in the case of one level of one

machine

- Combinatorial complexity of optimizations

when considering a complex machine, or

different machines

- Loss of software portability


CMU 15-418/618,

Spring 2020

Open computer science question:

How do we enable programmers to

productively write software that

efficiently uses current and future

heterogeneous, parallel machines?

CMU 15-418/618,

Spring 2020

The [magical] ideal parallel programming

language

Completeness

(applicable to most problems we

want to write a program for)

Productivity

(ease of development)

??

High Performance

(software is scalable and efficient)


CMU 15-418/618,

Spring 2020

Successful programming languages

Completeness



Productivity


High Performance



(Success = widely used)

??

CMU 15-418/618,

Spring 2020

Growing interest in domain-specific programming systemsTo realize high performance and productivity: willing to sacrifice completeness

High Performance


Completeness



Productivity


Domain-specific

languages (DSL) and

programming

frameworks


??

CMU 15-418/618,

Spring 2020

Domain-specific programming systems

▪ Main idea: raise level of abstraction for expressing

programs

▪ Introduce high-level programming primitives specific

to an application domain

- Productive: intuitive to use, portable across machines,

primitives correspond to behaviors frequently used to solve

problems in targeted domain

- Performant: system uses domain knowledge to provide

efficient, optimized implementation(s)

- Given a machine: system knows what algorithms to use,

parallelization strategies to employ for this domain

- Optimization goes beyond efficient mapping of software

to hardware! The hardware platform itself can be

optimized to the abstractions as well

▪ Cost: loss of generality/completeness

CMU 15-418/618,

Spring 2020

Two domain-specific programming examples

1. Liszt: for scientific computing on meshes

2. Halide: for image processing

What are other domain specific languages?

(SQL is another good example)

CMU 15-418/618,

Spring 2020

Example 1:

Lizst: a language for solving PDE’s on meshes

http://liszt.stanford.edu/

[DeVito et al. Supercomputing 11,

SciDac ’11]

Slide credit for this section of lecture:

Pat Hanrahan and Zach Devito (Stanford)

http://liszt.stanford.edu

What a Liszt program does

val Position = FieldWithConst[Vertex,Float3](0.f, 0.f, 0.f)val Temperature = FieldWithConst[Vertex,Float](0.f)val Flux = FieldWithConst[Vertex,Float](0.f)val JacobiStep = FieldWithConst[Vertex,Float](0.f)

Color key:FieldsMesh entity

A Liszt program is run on a mesh

A Liszt program defines, and compute the value of, fields defined on

the mesh

Notes:

Fields are a higher-kinded type

(special function that maps a type to a new type)

Position is a field defined at each mesh vertex.

The field’s value is represented by a 3-vector.

Liszt program: heat conduction on mesh

var i = 0;while ( i < 1000 ) {Flux(vertices(mesh)) = 0.f;JacobiStep(vertices(mesh)) = 0.f;for (e <- edges(mesh)) {val v1 = head(e)val v2 = tail(e)val dP = Position(v1) - Position(v2)val dT = Temperature(v1) - Temperature(v2)val step = 1.0f/(length(dP))Flux(v1) += dT*stepFlux(v2) -= dT*stepJacobiStep(v1) += stepJacobiStep(v2) += step

} i += 1

}

Program computes the value of fields defined on meshes

Color key:FieldsMeshTopology functionsIteration over set

Set flux for all vertices to 0.f;

Independently, for each

edge in the mesh

Access value of field

at mesh vertex v2

Given edge, loop body

accesses/modifies field values at

adjacent mesh vertices

Liszt’s topological operatorsUsed to access mesh elements relative to some input vertex, edge, face, etc.

Topological operators are the only way to access mesh data in a Liszt

program

Notice how many operators return sets (e.g., “all edges of this face”)

CMU 15-418/618,

Spring 2020

Liszt programming

▪ A Liszt program describes operations on fields of

an abstract mesh representation

▪ Application specifies type of mesh (regular,

irregular) and its topology

▪ Mesh representation is chosen by Liszt (not by the

programmer)

- Based on mesh type, program behavior, and

target machine

Well, that’s interesting. I write a program, and the compiler

decides what data structure it should use based on what

operations my code performs.

CMU 15-418/618,

Spring 2020

Compiling to parallel computers

Recall challenges you have faced in your assignments

1. Identify parallelism

2. Identify data locality

3. Reason about required synchronization

Now consider how to automate this process in

the Liszt compiler.

CMU 15-418/618,

Spring 2020

Key: determining program dependencies

1. Identify parallelism:

Absence of dependencies implies code can be executed in

parallel

2. Identify data locality:

Partition data based on dependencies (localize dependent

computations for faster synchronization)

3. Reason about required synchronization:

Synchronization is needed to respect dependencies (must

wait until the values a computation depends on are

known)

In general programs, compilers are unable to infer

dependencies at global scale: a[f(i)] += b[i] (must

execute f(i) to know if dependency exists across loop

iterations i)

Statically analyze code to find stencil of each top-level for loop

- Extract nested mesh element reads

- Extract field operations

for (e <- edges(mesh)) {val v1 = head(e)val v2 = tail(e)val dP = Position(v1) - Position(v2)val dT = Temperature(v1) - Temperature(v2)val step = 1.0f/(length(dP))Flux(v1) += dT*stepFlux(v2) -= dT*stepJacobiStep(v1) += stepJacobiStep(v2) += step

}…

Liszt is constrained to allow dependency analysis

Lizst infers “stencils”: “stencil” = mesh elements accessed in an iteration of loop

= dependencies for the iteration

Edge 6’s read stencil is D and F

Restrict language for dependency analysisLanguage restrictions:

- Mesh elements are only accessed through built-in topological functions:

cells(mesh), …

- Single static assignment:

val v1 = head(e)

- Data in fields can only be accessed using mesh elements:

Pressure(v)

- No recursive functions

Restrictions allow compiler to automatically infer stencil for a loop iteration.

Portable parallelism: use dependencies to implement different parallel execution strategiesI’ll discuss two strategies…

Strategy 1: mesh partitioning

Strategy 2: mesh coloring

Owned Cell

Ghost Cell

CMU 15-418/618,

Spring 2020

Imagine compiling a Lizst program to the

(entire) Latedays cluster

(multiple nodes, distributed address

space)

How might Liszt distribute a graph across

these nodes?

Distributed memory implementation of LisztMesh + Stencil → Graph → Partition

for(f <- faces(mesh)) {rhoOutside(f) =

calc_flux(f, rho(outside(f))) +calc_flux(f, rho(inside(f)))

}

Initial Partition (by ParMETIS)

Consider distributed memory implementation

Store region of mesh on each node in a cluster

(Note: ParMETIS is a tool for partitioning

meshes)

Ghost Cells

Each processor also needs data for neighboring

cells to perform computation (“ghost cells”)

Listz allocates ghost region storage and emits

required communication to implement

topological operators.

CMU 15-418/618,

Spring 2020

Imagine compiling a Lizst program to a GPU

(single address space, many tiny threads)

GPU implementation: parallel reductions

for (e <- edges(mesh)) {…Flux(v1) += dT*stepFlux(v2) -= dT*step…

}

Different edges share a vertex:

requires atomic update of per-

vertex field data

In previous example, one region of mesh assigned per processor (or node in

MPI cluster)

On GPU, natural parallelization is one edge per CUDA thread

Threads (each edge assigned to 1 CUDA thread)

Flux field values (per vertex)

GPU implementation: conflict graph

Identify mesh edges with colliding writes

(lines in graph indicate presence of

collision)

Can simply run program once to get this

information.

(results valid for subsequent executions

provided mesh does not change)



GPU implementation: conflict graph

“Color” nodes in graph such that

no connected nodes have the

same color

Can execute on GPU in parallel,

without atomic operations, by

running all nodes with the same

color in a single CUDA launch.



Cluster performance of Lizst program256 nodes, 8 cores per node (message-passing implemented using MPI)

Important: performance portability!Same Liszt program also runs with high efficiency on GPU (results not

shown here).But uses a different algorithm when compiled to GPU! (graph coloring)

CMU 15-418/618,

Spring 2020

Liszt summary

▪ Productivity:- Abstract representation of mesh: vertices, edges, faces, fields (concepts

that a scientist thinks about already!)

- Intuitive topological operators

▪ Portability- Same code runs on large cluster of CPUs (MPI) and GPUs (and

combinations thereof!)

▪ High-performance- Language is constrained to allow compiler to track dependencies

- Used for locality-aware partitioning in distributed memory

implementation

- Used for graph coloring in GPU implementation

- Compiler knows how to chooses different parallelization strategies for

different platforms

- Underlying mesh representation can be customized by system based on

usage and platform (e.g, don’t store edge pointers if code doesn’t need

it, choose struct of arrays vs. array of structs for per-vertex fields)

CMU 15-418/618,

Spring 2020

Example 2:Halide: a domain-specific language for

image processing

Jonathan Ragan-Kelley, Andrew

Adams et al.

[SIGGRAPH 2012, PLDI 13]

CMU 15-418/618,

Spring 2020

Halide used in practice

▪ Halide used to implement Android HDR+ app

▪ Halide code used to process all images uploaded to Google Photos

CMU 15-418/618,

Spring 2020

A quick tutorial on high-performance

image processing

CMU 15-418/618,

Spring 2020

What does this C code do?

int WIDTH = 1024;

int HEIGHT = 1024;

float input[(WIDTH+2) * (HEIGHT+2)];

float output[WIDTH * HEIGHT];

float weights[] = {1.0/9, 1.0/9, 1.0/9,

1.0/9, 1.0/9, 1.0/9,

1.0/9, 1.0/9, 1.0/9};

for (int j=0; j<HEIGHT; j++) {

for (int i=0; i<WIDTH; i++) {

float tmp = 0.f;

for (int jj=0; jj<3; jj++)

for (int ii=0; ii<3; ii++)

tmp += input[(j+jj)*(WIDTH+2) + (i+ii)] * weights[jj*3 + ii];

output[j*WIDTH + i] = tmp;

}

}

CMU 15-418/618,

Spring 2020

3x3 box blur

(Zoom view)

CMU 15-418/618,

Spring 2020

3x3 image blur

int WIDTH = 1024;

int HEIGHT = 1024;



float weights[] = {1.0/9, 1.0/9, 1.0/9,

1.0/9, 1.0/9, 1.0/9,

1.0/9, 1.0/9, 1.0/9};



float tmp = 0.f;



tmp += input[(j+jj)*(WIDTH+2) + (i+ii)] * weights[jj*3 + ii];


}

}

Total work per image = 9 x WIDTH x HEIGHT

For NxN filter: N2 x WIDTH x HEIGHT

CMU 15-418/618,

Spring 2020

Two-pass 3x3 blur

int WIDTH = 1024;

int HEIGHT = 1024;


float tmp_buf[WIDTH * (HEIGHT+2)];


float weights[] = {1.0/3, 1.0/3, 1.0/3};

for (int j=0; j<(HEIGHT+2); j++)


float tmp = 0.f;


tmp += input[j*(WIDTH+2) + i+ii] * weights[ii];

tmp_buf[j*WIDTH + i] = tmp;

}



float tmp = 0.f;


tmp += tmp_buf[(j+jj)*WIDTH + i] * weights[jj];


}

}

Total work per image = 6 x WIDTH x HEIGHT

For NxN filter: 2N x WIDTH x HEIGHT

1D horizontal

blur

1D vertical

blur

WIDTH x HEIGHT extra storage

3X lower arithmetic intensity than 3D blur

input(W+2)x(H+2)

tmp_bufW x (H+2)

outputW x H

CMU 15-418/618,

Spring 2020

Two-pass image blur: locality

int WIDTH = 1024;

int HEIGHT = 1024;


float tmp_buf[WIDTH * (HEIGHT+2)];


float weights[] = {1.0/3, 1.0/3, 1.0/3};

for (int j=0; j<(HEIGHT+2); j++)


float tmp = 0.f;


tmp += input[j*(WIDTH+2) + i+ii] * weights[ii];

tmp_buf[j*WIDTH + i] = tmp;

}



float tmp = 0.f;


tmp += tmp_buf[(j+jj)*WIDTH + i] * weights[jj];


}

}

Data from input reused three times.

(immediately reused in next two i-loop iterations

after first load, never loaded again.)

- Perfect cache behavior: never load required data

more than once

- Perfect use of cache lines (don’t load

unnecessary data into cache)

Data from tmp_buf reused three times

(but three rows of image data are

accessed in between)

- Never load required data more than

once… if cache has capacity for three

rows of image

- Perfect use of cache lines (don’t load

unnecessary data into cache)

Two pass: loads/stores to tmp_buf are overhead

(this memory traffic is an artifact of the two-pass

implementation: it is not intrinsic to computation

being performed)

Intrinsic bandwidth requirements of

algorithm:

Application must read each element of

input image and must write each element

of output image.

CMU 15-418/618,

Spring 2020

Two-pass image blur, “chunked” (version 1)

int WIDTH = 1024;

int HEIGHT = 1024;


float tmp_buf[WIDTH * 3];


float weights[] = {1.0/3, 1.0/3, 1.0/3};


for (int j2=0; j2<3; j2++)


float tmp = 0.f;


tmp += input[(j+j2)*(WIDTH+2) + i+ii] * weights[ii];

tmp_buf[j2*WIDTH + i] = tmp;


float tmp = 0.f;


tmp += tmp_buf[jj*WIDTH + i] * weights[jj];


}

}

input(W+2)x(H+2)

tmp_buf

outputW x H

(Wx3)

Produce 3 rows of

tmp_buf

(only what’s needed for

one row of output)

Total work per row of output:

- step 1: 3 x 3 x WIDTH work

- step 2: 3 x WIDTH work

Total work per image = 12 x WIDTH x

HEIGHT ????

Loads from tmp_buffer are cached

(assuming tmp_buffer fits in cache)

Combine them together to get one row of

output

Only 3 rows of

intermediate buffer

need to be allocated

CMU 15-418/618,

Spring 2020

Two-pass image blur, “chunked” (version 2)

int WIDTH = 1024;

int HEIGHT = 1024;


float tmp_buf[WIDTH * (CHUNK_SIZE+2)];


float weights[] = {1.0/3, 1.0/3, 1.0/3};

for (int j=0; j<HEIGHT; j+CHUNK_SIZE) {

for (int j2=0; j2<CHUNK_SIZE+2; j2++)


float tmp = 0.f;


tmp += input[(j+j2)*(WIDTH+2) + i+ii] * weights[ii];

tmp_buf[j2*WIDTH + i] = tmp;

for (int j2=0; j2<CHUNK_SIZE; j2++)


float tmp = 0.f;


tmp += tmp_buf[(j2+jj)*WIDTH + i] * weights[jj];

output[(j+j2)*WIDTH + i] = tmp;

}

}

input(W+2)x(H+2)

tmp_buf

outputW x H

W x (CHUNK_SIZE+2)

Produce enough rows of

tmp_buf to produce a

CHUNK_SIZE number of

rows of output

Total work per chunck of output:

(assume CHUNK_SIZE = 16)

- Step 1: 18 x 3 x WIDTH work

- Step 2: 16 x 3 x WIDTH work

Total work per image: (34/16) x 3 x WIDTH x HEIGHT

= 6.4 x WIDTH x HEIGHT

Produce CHUNK_SIZE rows of output

Sized to fit in cache

(capture all

producer-consumer

locality)

Trends to ideal 6 x WIDTH x HEIGHT as CHUNK_SIZE is increased!

CMU 15-418/618,

Spring 2020

Conflicting goals (once again...)

▪ Want to be work efficient (perform fewer operations)

▪ Want to take advantage of locality when present

- Otherwise work-efficient code will be bandwidth bound

- Ideally: bandwidth cost of implementation is very close to

intrinsic cost of algorithm: data is loaded from memory

once and reused as much as needed prior to being

discarded from processor’s cache

▪ Want to execute in parallel (multi-core, SIMD within core)

CMU 15-418/618,

Spring 2020

Optimized C++ code: 3x3 image blur

Good: 10x faster: on a quad-core CPU than original two-pass code

Bad: specific to SSE (not AVX2), CPU-code only, hard to tell what is going on at all!

use of SIMD vector

intrinsics

Modified iteration

order: 256x32 block-

major iteration (to

maximize cache hit

rate)

Multi-core execution

(partition image

vertically)

two passes fused into one: tmp data read

from cache

CMU 15-418/618,

Spring 2020

// Halide 3x3 blur program definition

Func halide_blur(Func in) {

Func blurx, out;

Var x, y;

blurx(x,y) = (in(x-1, y) + in(x,y) + in(x+1,y)) / 3.0f;

out(x,y) = (blurx(x,y-1) + blurx(x,y) + blurx(x,y+1)) / 3.0f;

return out;

}

Halide blur (algorithm description)

NOTE: execution order and storage are unspecified by the

abstraction. The implementation can evaluate, reevaluate, cache

individual points as desired!

Images are pure functions

Functions map integer coordinates (in up

to a 4D domain) to values (e.g., colors of

corresponding pixels)(in, blurx and out are functions)

Algorithms are a series of functions (think:

pipeline stages)

Value of blurx at

coordinate (x,y) is given

by expression accessing three values of in// top-level calling code

Image<uint8_t> input = load_image(“myimage.png”); // define input image

Func my_program = halide_blur(input); // define pipeline

Image<uint8_t> output = my_program.realize(input.width(), input.height(),

input.channels()); // execute pipeline

output.save(“myblurredimage.png”);

CMU 15-418/618,

Spring 2020

Think of a Halide program as a pipeline

in

blurx

out

// Halide 3x3 blur program definition


Func blurx, out;

Var x, y;



return out;

}

CMU 15-418/618,

Spring 2020

Halide schedule describes how to execute a pipeline// Halide program definition


Func blurx, out;

Var x, y, xi, yi

// the “algorithm description” (what to do)



// “the schedule” (how to do it)

out.tile(x, y, xi, yi, 256, 32).vectorize(xi,8).parallel(y);

blurx.chunk(x).vectorize(x, 8);

return out;

}

When evaluating out, use 2D tiling

order (loops named by x, y, xi, yi).

Use tile size 256 x 32.

Vectorize the xi loop (8-wide)

Use threads to parallelize the y loop

Produce only chunks of blurx at a

time. Vectorize the x (innermost)

loop

CMU 15-418/618,

Spring 2020

// Halide program definition


Func blurx, out;

Var x, y, xi, yi





out.tile(x, y, xi, yi, 256, 32).vectorize(xi,8).parallel(y);

blurx.chunk(x).vectorize(x, 8);

return out;

}

void halide_blur(uint8_t* in, uint8_t* out) {

#pragma omp parallel for

for (int y=0; y<HEIGHT; y+=32) { // tile loop

for (int x=0; y<WIDTH; x+=256) { // tile loop

// buffer

uint8_t* blurx[34 * 256];

// produce intermediate buffer

for (int yi=0; yi<34; yi++) {

// SIMD vectorize this loop (not shown)

for (int xi=0; xi<256; xi++) {

blurx[yi*256+xi] =

(in[(y+yi-1)*WIDTH+x+xi-1] +

in[(y+yi-1)*WIDTH+x+xi] +

in[(y+yi-1)*WIDTH+x+xi+1]) / 3.0;

}

}

// consume intermediate buffer

for (int yi=0; yi<32; yi++) {

// SIMD vectorize this loop (not shown)

for (int xi=0; xi<256; xi++) {

out[(y+yi)*256+(x+xi)] =

(blurx[yi*256+xi] +

blurx[(yi+1)*256+xi] +

blurx[(yi+2)*256+xi]) / 3.0;

}

}

} // loop over tiles

} // loop over tiles

}

Halide schedule describes how to execute a pipeline

Given a schedule, Halide carries out

mechanical process of

implementing the specified

schedule

CMU 15-418/618,

Spring 2020

Halide: two domain-specific co-languages

▪ Functional language for describing image processing

operations

▪ Domain-specific language for describing schedules

▪ Design principle: separate “algorithm specification”

from its schedule- Programmer’s responsibility: provide a high-performance schedule

- Compiler’s responsibility: carry out mechanical process of generating threads,

SIMD instructions, managing buffers, etc.

- Result: enable programmer to rapidly explore space of schedules

- (e.g., “tile these loops”, “vectorize this loop”, “parallelize this loop across

cores”)

▪ Domain scope:- All computation on regular N-D coordinate spaces

- Only feed-forward pipelines (includes special support for reductions and fixed

recursion depth)

- All dependencies inferable by compiler

CMU 15-418/618,

Spring 2020

Producer/consumer scheduling primitives

Four basic scheduling primitives shown below

in tmp blurred in tmp blurred

“Root” “Inline”

in tmp blurred

“Sliding Window” “Chunked”

in tmp blurred

CMU 15-418/618,

Spring 2020

Producer/consumer scheduling primitives



Func blurx, out;

Var x, y, xi, yi





blurx.compute_at(ROOT);

return out;

}


uint8_t blurx[WIDTH * HEIGHT];

for (int y=0; y<HEIGHT; y++) {

for (int x=0; y<WIDTH; x++) {

blurx[] = ...



out[] = ...

}



Func blurx, out;

Var x, y, xi, yi





blurx.inline();

return out;

}




out[] = (((in[(y-1)*WIDTH+x-1] +

in[(y-1)*WIDTH+x] +

in[(y-1)*WIDTH+x+1]) / 3) +

((in[y*WIDTH+x-1] +

in[y*WIDTH+x] +

in[y*WIDTH+x+1]) / 3) +

((in[(y+1)*WIDTH+x-1] +

in[(y+1)*WIDTH+x] +

in[(y+1)*WIDTH+x+1]) / 3));

}

“Root”:

compute all points of the

producer, then run consumer

(minimal locality)

“Inline”:

revaluate producer at every

use site in consumer

(maximal locality)

CMU 15-418/618,

Spring 2020

Domain iteration primitives

Specify both order and how

to parallelize

(multi-thread, SIMD vector)

2D blocked

iteration order

CMU 15-418/618,

Spring 2020

Example Halide results

▪ Camera RAW processing pipeline(Convert RAW sensor data to RGB image)

- Original: 463 lines of hand-

tuned ARM NEON assembly

- Halide: 2.75x less code, 5%

faster

▪ Bilateral filter(Common image filtering operation used in many applications)

- Original 122 lines of C++

- Halide: 34 lines algorithm + 6 lines schedule

- CPU implementation: 5.9x faster

- GPU implementation: 2x faster than hand-written CUDA

CMU 15-418/618,

Spring 2020

Stepping back: what is Halide?

▪ Halide is a DSL for helping good developers optimize

image processing code more rapidly

- Halide doesn’t decide how to optimize a program for a

novice programmer

- Halide provides primitives for a programmer (that has

strong knowledge of code optimization, such as a 418

student) to rapidly express what optimizations the system

should apply

- Halide carries out the nitty-gritty of mapping that strategy

to a machine

CMU 15-418/618,

Spring 2020

Automatically generating Halide schedules

Extend Halide compiler to automatically generate schedule for programmer

- Compiler input: Halide program + size of expected input/output images

[Mullapudi, CMU 2016]

= Naive schedule

= Expert manual schedule

(best human-created schedule)

= Automatically generated schedule (no autotuning, ~ seconds)

= Automatically generated, with auto-tuning (~ 10 minutes)

= Automatically generated, auto-tuning over 3 days

CMU 15-418/618,

Spring 2020

“Racing” top Halide programmers

Halide auto-scheduler produced

schedules that were better than

those of expert Google Halide

programmers in two of three

cases (it got beat in one!)

CMU 15-418/618,

Spring 2020

Darkroom/Rigel

▪ Directly synthesize FGPA implementation of

image processing pipeline from a high-level

description (a constrained “Halide-like” language)

[Hegarty 2014, Hegarty 2016]

▪ Goal: ultra high efficiency image processing

CMU 15-418/618,

Spring 2020

Many other recent domain-specific

programming systems

DSL for graph-based machine learning

computations

Less domain specific than examples

given today, but still designed

specifically for:

data-parallel computations on big

data for distributed systems (“Map-

Reduce”)

Model-view-controller

paradigm for web-

applications

Also see Green-Marl, Ligra

(DSLs for describing operations on

graphs)

Simit: a language for physical simulation [MIT]

Ongoing efforts in many domains...

CMU 15-418/618,

Spring 2020

Domain-specific programming system

development

▪ Can develop DSL as a stand-alone language

- Graphics shading languages

- MATLAB, SQL

▪ “Embed” DSL in an existing generic language

- e.g., C++ library (GraphLab, OpenGL host-side API, Map-

Reduce)

- Lizst syntax above was all valid Scala code

▪ Active research idea:

- Design generic languages that have facilities that assist

rapid embedding of new domain-specific languages

- “What is a good language for rapidly making new DSLs?”

CMU 15-418/618,

Spring 2020

Summary

▪ Modern machines: parallel and heterogeneous

- Only way to increase compute capability in energy-

constrained world

▪ Most software uses small fraction of peak capability of

machine

- Very challenging to tune programs to these machines

- Tuning efforts are not portable across machines

▪ Domain-specific programming environments trade-off

generality to achieve productivity, performance, and

portability

- Case studies today: Liszt, Halide

- Common trait: languages provide abstractions that make

dependencies known

- Understanding dependencies is necessary but not sufficient:

need domain restrictions and domain knowledge for system

to synthesize efficient implementations

Documents

Domain-Specific Programming Systems418/lectures/22_dsl.pdf · 2020. 3. 30. · Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2020 Lecture 22: Domain-Specific