1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to

1

7 Questions for Parallelism Applications:1. What are the apps?2. What are kernels of apps? Hardware:3. What are the HW building

blocks?4. How to connect them? Programming Model &

Systems Software:5. How to describe apps and

kernels?6. How to program the HW? Evaluation: 7. How to measure success?

(Inspired by a view of the Golden Gate Bridge from Berkeley)

2

How do we describe apps and kernels? Observation 1: use Dwarfs. Dwarfs are of 2 types

Libraries Dense matrices Sparse matrices Spectral Combinational Finite state machines

Patterns/Frameworks MapReduce Graph traversal, graphical models Dynamic programming Backtracking/B&B N-Body (Un) Structured Grid

Algorithms in the dwarfs can either be implemented as:• Compact parallel computations within a traditional library• Compute/communicate pattern implemented as framework

• Computations may be viewed a multiple levels: e.g., an FFT library may be built by instantiating a Map-Reduce framework, mapping 1D FFTs and then transposing (generalize reduce)

3

Composing dwarfs to build apps Any parallel application of arbitrary complexity

may be built by composing parallel and serial components

Parallel patterns with serial plug-inse.g., MapReduce Serial code invoking parallel

libraries, e.g., FFT, matrix ops.,…

Composition is hierarchical

4

Programming the HW 2 types of programmers/2 layers

“The right tool for the right time” Productivity Layer (90% of programmers)

Domain experts / Naïve programmers/productively build parallel apps using frameworks & libraries

Frameworks & libraries composed using C&C Language to provide app frameworks

Efficiency Layer (10% of programmers) Expert programmers build:

Frameworks: software that supports general structural patterns of computation and communication: e. g. MapReduce

Libraries: software that supports compact computational expressions: e.g. Sketch for Combinational or Grid computation

“Bare metal” efficiency possible at Efficiency Layer Effective composition techniques allows the efficiency

programmers to be highly leveraged.

5

Coordination & Composition in CBIR Application

Parallelism in CBIR is hierarchical Mostly independent tasks/data with combining

DCT extractor

Face Recog

?

DWT

?

…

stream parallel over images

task parallel over extraction algorithms

data parallelmap DCT over tiles

combineconcatenate

feature vectors

output stream of feature vectors

DCT

output stream of images

feature extraction

combinereduction on histograms from each tileoutput one

histogram (feature vector)

6

Coordination & Composition Language

Coordination & Composition language for productivity

2 key challenges1. Correctness: ensuring independence using decomposition

operators, copying and requirements specifications on frameworks2. Efficiency: resource management during composition; domain-

specific OS/runtime support Language control features hide core resources, e.g.,

Map DCT over tiles in language becomes set of DCTs/tiles per core Hierarchical parallelism managed using OS mechanisms

Data structure hide memory structures Partitioners on arrays, graphs, trees produce independent data Framework interfaces give independence requirements: e.g., map-

reduce function must be independent, either by copying or application to partitioned data object (set of tiles from partitioner)

7

For parallelism to succeed, must provide productivity, efficiency, and correctness simultaneously Can’t make SW productivity even worse! Why do in parallel if efficiency doesn’t matter? Correctness usually considered orthogonal problem Productivity slows if code incorrect or inefficient Correctness and efficiency slow if programming unproductive

Most programmers not ready for parallel programming IBM SP customer escalations: concurrency bugs worst,

can take months to fix How make ≈90% today’s programmers productive on parallel

computers? How make code written by ≈90% of programmers efficient?

How do we program the HW?What are the problems?

8

Ensuring Correctness• Productivity Layer:

• Enforce independence of tasks using decomposition (partitioning) and copying operators

• Goal: Remove concurrency errors (nondeterminism from execution order, not just low level data races)• E.g., the race-free program “atomic delete” + “atomic

insert” does not compose to an “atomic replace”; need higher level properties, rather than just locks or transactions

• Efficiency Layer: Check for subtle concurrency bugs (races, deadlocks, etc.)• Mixture of verification and automated directed

testing • Error detection on framework and libraries; some

techniques applicable to third-party software

9

Compilers and Operating Systems are large, complex, resistant to innovation

Takes a decade for compiler innovations to show up in production compilers?

Time for idea in SOSP to appear in production OS?

Traditional OSes brittle, insecure, memory hogs Traditional monolithic OS image uses lots of

precious memory * 100s - 1000s times (e.g., AIX uses GBs of DRAM / CPU)

Support Software: What are the problems?

10

21st Century Code Generation

Search space for matmul block sizes:• Axes are block dim• Temp is speed

Problem: generating optimal code is like searching for a needle in a haystack

New approach: “Auto-tuners” 1st run variations of program on computer to heuristically search for best combinations of optimizations (blocking, padding, …) and data structures, then produce C code to be compiled for that computer E.g., PHiPAC (BLAS), Atlas (BLAS), Spiral (DSP), FFT-W Can achieve 10X over conventional compiler

Example: Sparse Matrix (SPMv) for 3 multicores Fastest SPMv: 2X OSKI/PETSc Clovertown, 4X Opteron Optimization space: register blocking, cache blocking, TLB

blocking, prefetching/DMA options, NUMA, BCOO v. BCSR data structures, 16b v. 32b indices, …

11

Example: Sparse Matrix * VectorName Clovertow

nOpteron Cell

Chips*Cores 2*4 = 8 2*2 = 4 1*8 = 8

Architecture 4-/3-issue, 2-/1-SSE3, OOO, caches,

prefetch

2-VLIW, SIMD, local store,

DMA

Clock Rate 2.3 GHz 2.2 GHz 3.2 GHz

Peak MemBW

21.3 GB/s 21.3 25.6 GB/s

Peak GFLOPS

74.6 GF 17.6 GF 14.6 (DP Fl. Pt.)

Naïve SPMv (median of many matrices)

1.0 GF 0.6 GF --

Efficiency % 1% 3% --

Autotune SPMv

1.5 GF 1.9 GF 3.4 GF

Auto Speedup

1.5X 3.2X --

12


nOpteron Cell

Chips*Cores 2*4 = 8 2*2 = 4 1*8 = 8


prefetch


DMA


Peak MemBW

21.3 GB/s 21.3 25.6 GB/s

Peak GFLOPS

74.6 GF 17.6 GF 14.6 (DP Fl. Pt.)


1.0 GF 0.6 GF --


Autotune SPMv

1.5 GF 1.9 GF 3.4 GF

Auto Speedup

1.5X 3.2X --

13


nOpteron Cell

Chips*Cores 2*4 = 8 2*2 = 4 1*8 = 8


prefetch


DMA


Peak MemBW

21.3 GB/s 21.3 25.6 GB/s

Peak GFLOPS

74.6 GF 17.6 GF 14.6 (DP Fl. Pt.)


1.0 GF 0.6 GF --


Autotune SPMv

1.5 GF 1.9 GF 3.4 GF

Auto Speedup

1.5X 3.2X ∞

14

Greater productivity and efficiency for SPMv? Parallelizing compiler +

multicore + caches + prefetching

Autotuner + multicore + local store + DMA

• Originally, caches to improve programmer productivity • Not always the case for manycore+autotuner• Easier to autotune single local store + DMA than multilevel caches + HW and SW prefetching

15

Deconstructing Operating Systems Resurgence of interest in virtual

machinesVM monitor thin SW layer btw guest OS

and HW Future OS: libraries where only functions

needed are linked into app, on top of thin hypervisor providing protection and sharing of resources

Partitioning support for very thin hypervisors, and to allow software full access to hardware within partition

Documents

1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to