63
Dynamic Compilation for Massively Parallel Processors Gregory Diamos PhD candidate Georgia Institute of Technology and NVIDIA Research April 14, 2011 Gregory Diamos CS264 - Dynamic Compilation 1/62

[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

  • Upload
    npinto

  • View
    775

  • Download
    0

Embed Size (px)

DESCRIPTION

http://cs264.org http://j.mp/h2zN72

Citation preview

Page 1: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Dynamic Compilation for Massively Parallel

Processors

Gregory Diamos

PhD candidate

Georgia Institute of Technology and NVIDIA Research

April 14, 2011

Gregory Diamos CS264 - Dynamic Compilation 1/62

Page 2: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

What is an execution model?

Gregory Diamos CS264 - Dynamic Compilation 2/62

Page 3: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Goals of programming languages

Programming languages are designed for productivity.

Efficiency is measured in terms of:1 cost - hardware investment, power consumption, area requirement2 complexity - application development effort3 speed - amount of work performed per unit time

Gregory Diamos CS264 - Dynamic Compilation 3/62

Page 4: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Goals of processor architecture

Hardware is designed for speed and efficiency.

Gregory Diamos CS264 - Dynamic Compilation 4/62

Page 5: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Goals of processor architecture - 2

[1] - M. Koyanagi, T. Fukushima, and T. Tanaka. "High-Density Through Silicon Vias for 3-D LSIs" [2] - Novoselov et al. "Electric Field Effect in Atomically Thin Carbon Films." [3] - Intel Corp. 22nm test chip.

It is constrained by the limitations of physical devices.

Gregory Diamos CS264 - Dynamic Compilation 5/62

Page 6: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Execution models bridge the gap

Gregory Diamos CS264 - Dynamic Compilation 6/62

Page 7: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Goals of execution models

Execution models provide impedance matching between applications andhardware.

Goals:

leverage common optimizations across multiple applications.

limit the impact of hardware changes on software.

ISAs have traditionally been effective execution models.

Gregory Diamos CS264 - Dynamic Compilation 7/62

Page 8: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Programming challenges of heterogeneity

The introduction of heterogeneous and multi-core processors changes thehardware/software interface:

Intel Nehalem IBM PowerEN AMD Fusion NVIDIA Fermi

1 multi-core creates multiple interfaces.2 heterogeneity creates different interfaces.3 these increase software complexity.

Gregory Diamos CS264 - Dynamic Compilation 8/62

Page 9: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Program the entire processor, not individual cores.(new execution model abstractions are needed)

Gregory Diamos CS264 - Dynamic Compilation 9/62

Page 10: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Emerging execution models

Gregory Diamos CS264 - Dynamic Compilation 10/62

Page 11: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Bulk-synchronous parallel (BSP)

[1] - Leslie Valiant. A bridging model for parallel computing.

Gregory Diamos CS264 - Dynamic Compilation 11/62

Page 12: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

The Parallel Thread eXecution (PTX) Model

PTX defines a kernel as a 2-level grid of bulk-synchronous tasks.

Gregory Diamos CS264 - Dynamic Compilation 12/62

Page 13: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Dynamically translating PTX

Dynamic compilers can transform this parallelism to fit the hardware.

Gregory Diamos CS264 - Dynamic Compilation 13/62

Page 14: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Beyond PTX - Data distributions

Gregory Diamos CS264 - Dynamic Compilation 14/62

Page 15: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Beyond PTX - Memory hierarchies

[1] - Leslie Valiant. A bridging model for multi-core.[2] Fatahalian et al. Sequoia: Programming the memory hierarchy.

Gregory Diamos CS264 - Dynamic Compilation 15/62

Page 16: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Dynamic compilation/binarytranslation

Gregory Diamos CS264 - Dynamic Compilation 16/62

Page 17: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Binary translation

Gregory Diamos CS264 - Dynamic Compilation 17/62

Page 18: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Binary translators are everywhere

If you are running a browser, you are using dynamic compilation.

Gregory Diamos CS264 - Dynamic Compilation 18/62

Page 19: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

x86 binary translation

Gregory Diamos CS264 - Dynamic Compilation 19/62

Page 20: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Low Level Virtual Machines

Compile all programs to a common virtual machine representation (LLVMIR), keep this around.

Perform common optimizations on this IR.

Target various machines by lowering it to an ISA.

Statically or via JIT compilation.

Gregory Diamos CS264 - Dynamic Compilation 20/62

Page 21: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Execution model translation

Gregory Diamos CS264 - Dynamic Compilation 21/62

Page 22: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Execution model translation

Extend binary translation to execution model translation.

Dynamic compilers can map threads/tasks to the HW.

Gregory Diamos CS264 - Dynamic Compilation 22/62

Page 23: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Different core architectures

Can we target these from the same execution model.

What about efficiency?

Gregory Diamos CS264 - Dynamic Compilation 23/62

Page 24: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Ocelot

Enables thread-aware compiler transformations.

Gregory Diamos CS264 - Dynamic Compilation 24/62

Page 25: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Mapping CTAs to cores - thread fusion

Scheduler Block

Restore Registers

Spill Registers

Barrier

Original PTX Code Transformed PTX Code

Transform threads into loops over the program.

Distribute loops to handle barriers.

Gregory Diamos CS264 - Dynamic Compilation 25/62

Page 26: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Mapping CTAs to cores - vectorization

Pack adjacent threads into vector instructions.

Speculate that divergence never occurs, check in case it does.

Gregory Diamos CS264 - Dynamic Compilation 26/62

Page 27: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Mapping CTAs to cores - multiple instruction streams

T0 T1 T2 T3

Instructions from different threads are independent.

merge instruction streams and statically schedule on functional units.

Gregory Diamos CS264 - Dynamic Compilation 27/62

Page 28: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

PTX analysis

Gregory Diamos CS264 - Dynamic Compilation 28/62

Page 29: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Divergence analysis

Gregory Diamos CS264 - Dynamic Compilation 29/62

Page 30: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Subkernels

subkernel

Gregory Diamos CS264 - Dynamic Compilation 30/62

Page 31: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Thread frontier analysis

Supporting control flow on SIMD processors requires finding divergentbranches and potential re-converge points.

if((cond1() || cond2()) && (cond3() || cond4())){ ...}

bra cond1() bra cond3()

bra cond2() bra cond4()....

entry

exit

compound conditionals

short circuit control flow

bra cond1()

bra cond3()

bra cond2()

bra cond4()

....

entry

exit

B1

B2

B3

B4

B5

Thread FrontiersBlock Id

{}

{B2 - B3}

{B3 - Exit}

{B4 - Exit}

{B5 - Exit}

T0 T1 T2 T3 T0 T1 T2 T3

thread-frontier reconvergence

of T0

thread-frontier reconvergence

of T2

post dominatorreconvergenceof T1 and T3

Push B3 on T0

Push Exit on T1

Push B5 on T2

re-convergence at thread frontiers

immediate post-dominator re-convergence

Push Exit on T4

Pop stack Exit on T4

Pop stack switch to B5 on T2

Pop stack switch to B3 on T0

post dominatorreconvergence

of T1, T2, and T3

Pop stack Exit on T1

Compiler analysis can identify immediate post donimators orthread-frontiers as re-convergence points.

Gregory Diamos CS264 - Dynamic Compilation 31/62

Page 32: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Consequences of architecturedifferences

Gregory Diamos CS264 - Dynamic Compilation 32/62

Page 33: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Degraded performance portability

0 1000 2000 3000 4000 5000 6000

0100

200

300

400

500

600

GF

LO

PS

N

Fermi SGEMM

AMD SGEMM

0 1000 2000 3000 4000 5000 6000

0200

400

600

800

1000

1200

1400

1600

GF

LO

PS

N

Fermi SGEMM

AMD SGEMM

Performance of two OpenCL applications, one tuned for AMD, the otherfor NVIDIA.

Gregory Diamos CS264 - Dynamic Compilation 33/62

Page 34: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Memory traversal patterns

Warp(4) Cycle 1 Warp(4) Cycle 2

Warp(1) Cycle 1 Warp(1) Cycle 2

Thread loops change row major memory accesses to column majoraccesses.

Gregory Diamos CS264 - Dynamic Compilation 34/62

Page 35: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Reduced memory bandwidth on CPUs

Optimized for SIMD (GPU)

Optimized for single-threaded CPU

This reduces memory bandwidth by 10x for a memory microbenchmarkrunning on a 4-core CPU.

Gregory Diamos CS264 - Dynamic Compilation 35/62

Page 36: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

The good news

Gregory Diamos CS264 - Dynamic Compilation 36/62

Page 37: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Scaling across three decades of processors

Many existing applications still scale.

12x

480x

280GTX has 40x more peak flops than a Phenom, 480x more than anAtom.

Gregory Diamos CS264 - Dynamic Compilation 37/62

Page 38: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Questions?

Gregory Diamos CS264 - Dynamic Compilation 38/62

Page 39: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Databases on GPUs

Gregory Diamos CS264 - Dynamic Compilation 39/62

Page 40: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Who cares about databases?

Gregory Diamos CS264 - Dynamic Compilation 40/62

Page 41: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

What do applications look like?

What do applications look like?

Gregory Diamos CS264 - Dynamic Compilation 41/62

Page 42: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Gobs of data

Gregory Diamos CS264 - Dynamic Compilation 42/62

Page 43: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Distributed systems

Gregory Diamos CS264 - Dynamic Compilation 43/62

Page 44: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Lots of parallelism

Gregory Diamos CS264 - Dynamic Compilation 44/62

Page 45: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

What do CPU algorithms look like?

What do cpu algorithms looklike?

Gregory Diamos CS264 - Dynamic Compilation 45/62

Page 46: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Btrees

Gregory Diamos CS264 - Dynamic Compilation 46/62

Page 47: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Sequential algorithms

=

<

>

relation 1

relation 2

result

Gregory Diamos CS264 - Dynamic Compilation 47/62

Page 48: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

It doesn’t look good

Outlook not so good...

Gregory Diamos CS264 - Dynamic Compilation 48/62

Page 49: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Or does it?

Where is the parallelism?

Gregory Diamos CS264 - Dynamic Compilation 49/62

Page 50: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Flattened trees

Gregory Diamos CS264 - Dynamic Compilation 50/62

Page 51: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Relational algebra

Gregory Diamos CS264 - Dynamic Compilation 51/62

Page 52: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Inner Join

A Case Study: Inner Join

Gregory Diamos CS264 - Dynamic Compilation 52/62

Page 53: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

1. Recursive partitioning

Gregory Diamos CS264 - Dynamic Compilation 53/62

Page 54: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

2. Block streaming

Blocking into pages, shared memory buffers, and transaction sizedchunks makes memory accesses efficient.

Gregory Diamos CS264 - Dynamic Compilation 54/62

Page 55: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

3. Shared memory merging network

A network for join can be constructed, similar to a sorting network.

Gregory Diamos CS264 - Dynamic Compilation 55/62

Page 56: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

4. Data chunking

Stream compaction packs result data into chunks that can be streamedout of shared memory efficiently.

Gregory Diamos CS264 - Dynamic Compilation 56/62

Page 57: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Operator fusion

Gregory Diamos CS264 - Dynamic Compilation 57/62

Page 58: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Will it blend?

Gregory Diamos CS264 - Dynamic Compilation 58/62

Page 59: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Yes it blends.

Operator NVIDIA C2050 Phenom 9570

inner-join 26.4-32.3 GB/s 0.11-0.63 GB/sselect 104.2 GB/s 2.55 GB/sset operators 45.8 GB/s 0.72 GB/sprojection 54.3 GB/s 2.34 GB/scross product 98.8 GB/s 2.67 GB/s

Gregory Diamos CS264 - Dynamic Compilation 59/62

Page 60: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Questions?

Gregory Diamos CS264 - Dynamic Compilation 60/62

Page 61: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Conclusions

Emerging heterogeneous architectures need matching execution modelabstractions.

dynamic compilation can enable portability.

When writing massively parallel codes, consider:

data structures and algorithms.

mapping onto the execution model.

transformations in the compiler/runtime.

processor micro-architecture.

Gregory Diamos CS264 - Dynamic Compilation 61/62

Page 62: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Thoughts on open source software

Gregory Diamos CS264 - Dynamic Compilation 62/62

Page 63: [Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

Questions?

Questions?

Contact Me:

[email protected]

Contribute to Harmony, Ocelot, and Vanaheimr:

http://code.google.com/p/harmonyruntime/

http://code.google.com/p/gpuocelot/

http://code.google.com/p/vanaheimr/

Gregory Diamos CS264 - Dynamic Compilation 63/62