View
217
Download
1
Tags:
Embed Size (px)
Citation preview
1
7 Questions for Parallelism Applications:1. What are the apps?2. What are kernels of apps? Hardware:3. What are the HW building
blocks?4. How to connect them? Programming Model &
Systems Software:5. How to describe apps and
kernels?6. How to program the HW? Evaluation: 7. How to measure success?
(Inspired by a view of the Golden Gate Bridge from Berkeley)
2
How do we describe apps and kernels? Observation 1: use Dwarfs. Dwarfs are of 2 types
Libraries Dense matrices Sparse matrices Spectral Combinational Finite state machines
Patterns/Frameworks MapReduce Graph traversal, graphical models Dynamic programming Backtracking/B&B N-Body (Un) Structured Grid
Algorithms in the dwarfs can either be implemented as:• Compact parallel computations within a traditional library• Compute/communicate pattern implemented as framework
• Computations may be viewed a multiple levels: e.g., an FFT library may be built by instantiating a Map-Reduce framework, mapping 1D FFTs and then transposing (generalize reduce)
3
Composing dwarfs to build apps Any parallel application of arbitrary complexity
may be built by composing parallel and serial components
Parallel patterns with serial plug-inse.g., MapReduce Serial code invoking parallel
libraries, e.g., FFT, matrix ops.,…
Composition is hierarchical
4
Programming the HW 2 types of programmers/2 layers
“The right tool for the right time” Productivity Layer (90% of programmers)
Domain experts / Naïve programmers/productively build parallel apps using frameworks & libraries
Frameworks & libraries composed using C&C Language to provide app frameworks
Efficiency Layer (10% of programmers) Expert programmers build:
Frameworks: software that supports general structural patterns of computation and communication: e. g. MapReduce
Libraries: software that supports compact computational expressions: e.g. Sketch for Combinational or Grid computation
“Bare metal” efficiency possible at Efficiency Layer Effective composition techniques allows the efficiency
programmers to be highly leveraged.
5
Coordination & Composition in CBIR Application
Parallelism in CBIR is hierarchical Mostly independent tasks/data with combining
DCT extractor
Face Recog
?
DWT
?
…
stream parallel over images
task parallel over extraction algorithms
data parallelmap DCT over tiles
combineconcatenate
feature vectors
output stream of feature vectors
DCT
output stream of images
feature extraction
combinereduction on histograms from each tileoutput one
histogram (feature vector)
6
Coordination & Composition Language
Coordination & Composition language for productivity
2 key challenges1. Correctness: ensuring independence using decomposition
operators, copying and requirements specifications on frameworks2. Efficiency: resource management during composition; domain-
specific OS/runtime support Language control features hide core resources, e.g.,
Map DCT over tiles in language becomes set of DCTs/tiles per core Hierarchical parallelism managed using OS mechanisms
Data structure hide memory structures Partitioners on arrays, graphs, trees produce independent data Framework interfaces give independence requirements: e.g., map-
reduce function must be independent, either by copying or application to partitioned data object (set of tiles from partitioner)
7
For parallelism to succeed, must provide productivity, efficiency, and correctness simultaneously Can’t make SW productivity even worse! Why do in parallel if efficiency doesn’t matter? Correctness usually considered orthogonal problem Productivity slows if code incorrect or inefficient Correctness and efficiency slow if programming unproductive
Most programmers not ready for parallel programming IBM SP customer escalations: concurrency bugs worst,
can take months to fix How make ≈90% today’s programmers productive on parallel
computers? How make code written by ≈90% of programmers efficient?
How do we program the HW?What are the problems?
8
Ensuring Correctness• Productivity Layer:
• Enforce independence of tasks using decomposition (partitioning) and copying operators
• Goal: Remove concurrency errors (nondeterminism from execution order, not just low level data races)• E.g., the race-free program “atomic delete” + “atomic
insert” does not compose to an “atomic replace”; need higher level properties, rather than just locks or transactions
• Efficiency Layer: Check for subtle concurrency bugs (races, deadlocks, etc.)• Mixture of verification and automated directed
testing • Error detection on framework and libraries; some
techniques applicable to third-party software
9
Compilers and Operating Systems are large, complex, resistant to innovation
Takes a decade for compiler innovations to show up in production compilers?
Time for idea in SOSP to appear in production OS?
Traditional OSes brittle, insecure, memory hogs Traditional monolithic OS image uses lots of
precious memory * 100s - 1000s times (e.g., AIX uses GBs of DRAM / CPU)
Support Software: What are the problems?
10
21st Century Code Generation
Search space for matmul block sizes:• Axes are block dim• Temp is speed
Problem: generating optimal code is like searching for a needle in a haystack
New approach: “Auto-tuners” 1st run variations of program on computer to heuristically search for best combinations of optimizations (blocking, padding, …) and data structures, then produce C code to be compiled for that computer E.g., PHiPAC (BLAS), Atlas (BLAS), Spiral (DSP), FFT-W Can achieve 10X over conventional compiler
Example: Sparse Matrix (SPMv) for 3 multicores Fastest SPMv: 2X OSKI/PETSc Clovertown, 4X Opteron Optimization space: register blocking, cache blocking, TLB
blocking, prefetching/DMA options, NUMA, BCOO v. BCSR data structures, 16b v. 32b indices, …
11
Example: Sparse Matrix * VectorName Clovertow
nOpteron Cell
Chips*Cores 2*4 = 8 2*2 = 4 1*8 = 8
Architecture 4-/3-issue, 2-/1-SSE3, OOO, caches,
prefetch
2-VLIW, SIMD, local store,
DMA
Clock Rate 2.3 GHz 2.2 GHz 3.2 GHz
Peak MemBW
21.3 GB/s 21.3 25.6 GB/s
Peak GFLOPS
74.6 GF 17.6 GF 14.6 (DP Fl. Pt.)
Naïve SPMv (median of many matrices)
1.0 GF 0.6 GF --
Efficiency % 1% 3% --
Autotune SPMv
1.5 GF 1.9 GF 3.4 GF
Auto Speedup
1.5X 3.2X --
12
Example: Sparse Matrix * VectorName Clovertow
nOpteron Cell
Chips*Cores 2*4 = 8 2*2 = 4 1*8 = 8
Architecture 4-/3-issue, 2-/1-SSE3, OOO, caches,
prefetch
2-VLIW, SIMD, local store,
DMA
Clock Rate 2.3 GHz 2.2 GHz 3.2 GHz
Peak MemBW
21.3 GB/s 21.3 25.6 GB/s
Peak GFLOPS
74.6 GF 17.6 GF 14.6 (DP Fl. Pt.)
Naïve SPMv (median of many matrices)
1.0 GF 0.6 GF --
Efficiency % 1% 3% --
Autotune SPMv
1.5 GF 1.9 GF 3.4 GF
Auto Speedup
1.5X 3.2X --
13
Example: Sparse Matrix * VectorName Clovertow
nOpteron Cell
Chips*Cores 2*4 = 8 2*2 = 4 1*8 = 8
Architecture 4-/3-issue, 2-/1-SSE3, OOO, caches,
prefetch
2-VLIW, SIMD, local store,
DMA
Clock Rate 2.3 GHz 2.2 GHz 3.2 GHz
Peak MemBW
21.3 GB/s 21.3 25.6 GB/s
Peak GFLOPS
74.6 GF 17.6 GF 14.6 (DP Fl. Pt.)
Naïve SPMv (median of many matrices)
1.0 GF 0.6 GF --
Efficiency % 1% 3% --
Autotune SPMv
1.5 GF 1.9 GF 3.4 GF
Auto Speedup
1.5X 3.2X ∞
14
Greater productivity and efficiency for SPMv? Parallelizing compiler +
multicore + caches + prefetching
Autotuner + multicore + local store + DMA
• Originally, caches to improve programmer productivity • Not always the case for manycore+autotuner• Easier to autotune single local store + DMA than multilevel caches + HW and SW prefetching
15
Deconstructing Operating Systems Resurgence of interest in virtual
machinesVM monitor thin SW layer btw guest OS
and HW Future OS: libraries where only functions
needed are linked into app, on top of thin hypervisor providing protection and sharing of resources
Partitioning support for very thin hypervisors, and to allow software full access to hardware within partition