61
EE CS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EE CS Electrical Engineering and Computer Sciences BERKELEY PAR LAB The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Krste Asanovic, Ras Bodik, Eric Brewer, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, Dave Patterson, Koushik Sen, David Wessel, and Kathy Yelick UC Berkeley Par Lab June 23, 2010

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem?

  • Upload
    molly

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem?. Krste Asanovic, Ras Bodik, Eric Brewer, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, Dave Patterson , Koushik Sen, David Wessel, and Kathy Yelick UC Berkeley Par Lab - PowerPoint PPT Presentation

Citation preview

Page 1: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

The Parallel Revolution Has Started: Are You Part of the Solution

or Part of the Problem?Krste Asanovic, Ras Bodik, Eric Brewer, Jim Demmel,

Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, Dave Patterson, Koushik Sen,

David Wessel, and Kathy YelickUC Berkeley Par Lab

June 23, 2010

Page 2: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

The Transition to Multicore

2

Sequential App Performance

Page 3: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

3

050

100150200250300

1985 1995 2005 2015

Millions of PCs / year

P.S. Multicore Revolution Could Fail

John Hennessy, President, Stanford University:“…when we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.” “A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 1/07.

100% failure rate of Parallel Computer Companies Ardent, Convex, Encore, Inmos (Transputer), MasPar, nCUBE,

Kendall Square Research, Sequent, Tandem, Thinking Machines What if IT goes from a growth

industry to a replacement industry? If SW can’t effectively use 32, 64, ...

cores per chip => SW no faster on new computer => Only buy if computer wears out

or some people buy cheaper PC (netbook)

Page 4: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

4

Need a Fresh Approach to Parallelism

Berkeley researchers from many backgrounds meeting since Feb. 2005 to discuss parallelism Krste Asanovic, Eric Brewer, Ras Bodik, Jim Demmel, Kurt

Keutzer,John Kubiatowicz, Dave Patterson, Koushik Sen, Kathy Yelick, …

Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis

Tried to learn from successes in high-performance computing (LBNL) and parallel embedded (BWRC)

Led to “Berkeley View” Tech. Report 12/2006 and new Parallel Computing Laboratory (“Par Lab”)

2008: after open competition, Intel/MS award $10M

Goal: Productive, Efficient, Correct, Portable SW for 100+ cores & scale as core increase every 2 years (!)

Page 5: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Past parallel projects often dominated by hardware/architecture This is the one true way to build computers:

software must adapt to this breakthrough ILLIAC IV, Thinking Machines CM-2, Transputer,

Kendall Square KSR-1, Silicon Graphics Origin 2000 … Or sometimes by programming language

This is the one true way to write programs: hardware must adapt to this breakthrough

Id, Backus Functional Language FP, Occam, Linda, High Performance Fortran, Chapel, X10, Fortress …

Apps usually an afterthought

5

Need a Fresh Approach to Parallelism

Page 6: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Par Lab’s original “bets”Let compelling applications drive research

agendaSoftware platform: data center + mobile clientIdentify common programming patternsProductivity versus efficiency programmersAutotuning and software synthesisBuild-in correctness + power/performance

diagnosticsOS/Architecture support applications, provide

primitives not pre-packaged solutionsFPGA simulation of new parallel architectures:

RAMPCo-located integrated collaborative centerAbove all, no preconceived big idea - see what

works driven by application needs.6

Page 7: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

7

Personal Health

Image Retrieval

Hearing, Music

Speech

Parallel BrowserDesign Patterns/Motifs

Legacy Code

Schedulers

Communication & Synch. PrimitivesEfficiency Language Compilers

Easy to write portable code that runs efficiently on manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

ParLab Manycore/RAMPHypervisor

OSArch.

Productivi

ty Layer

Efficienc

y Layer Corre

ctne

ss

Applicatio

ns

Selective Embedded JIT Specialization

Parallel Libraries

Parallel Frameworks

Dynamic Checking

Debuggingwith Replay

Directed Testing

Autotuners

Efficiency Languages

Diag

nosin

g Po

wer/P

erfo

rman

ce

Par Lab Research Overview

Productivity Languages

Page 8: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Par Lab, ~2 years in

How are our bets working out?What big new ideas are emerging?Where are we going in next 3 years?

8

Page 9: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

9

Dominant Application Platforms

Data Center or Cloud (“Server”) Laptop/Handheld (“Mobile Client”) Both together (“Server+Client”)

ParLab-RADLab/AMPLab collaborations Par Lab focuses on mobile clients

But many technologies apply to data center

Shift in Key Platforms:‘90s: Desktop/workstation‘00s: Laptops‘10s: Smartphone/Tablet/TV + Cloud

Implications? Smaller clients, fewer cores/client? More need for intra-task parallelism in data centers?

9

Page 10: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

10

Music and Audio Applications(David Wessel)

Musicians have an insatiable appetite for computation + real-time demands

More channels, instruments, more processing, more interaction!

Jitter free latency must be low (<5 ms) Must be reliable (No clicks!)

Novel Instruments & User Interfaces

New composition and performance systems beyond keyboards

Input devices for Laptop/Handheld Enhanced sound delivery systems using

large microphone and speaker arrays Music Information Retrieval (MIR)

systems for client & cloud

Page 11: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Health Application: Stroke Treatment

(Tony Keaveny)

Stroke treatment time-critical, need supercomputer performance in hospital

Image scan, blood flow analysis, then with stroke, then simulate blood thinner

200,000 cases per year in US20% not treated since too long

afterPotentially, 30% of those benefit

Recent progress: Build simplified 1.5D “Strokeflow” model as test case for ParLab SEJITS approach.

11

Page 12: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Health Application: Stroke Treatment

(Tony Keaveny)

Stroke treatment time-critical, need supercomputer performance in hospital

Image scan, blood flow analysis, then with stroke, then simulate blood thinner

200,000 cases per year in US20% not treated since too long

afterPotentially, 30% of those benefit

Recent progress: Build simplified 1.5D “Strokeflow” model as test case for ParLab SEJITS approach.

12

Page 13: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

13

Content-Based Image Retrieval

(Kurt Keutzer)

Relevance Feedback

ImageDatabase

Query by example

SimilarityMetric

CandidateResults Final Result

Built around Key Characteristics of personal databases Very large number of pictures (>5K) Non-labeled images Many pictures of few people Complex pictures including people, events,

places, and objects

1000’s of images

SVM&Damascene code ~10-100x speedups over existing implementations, 100s downloads. Project expanded to greater set of machine vision applications.

Page 14: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

14

Robust Speech Recognition(Nelson Morgan)

Meeting Diarist Laptops/ Handhelds at

meeting coordinate to create speaker identified, partially transcribed text diary of meeting

Use cortically-inspired manystream spatio-temporal features to tolerate noise

Progress: Parallelized all main ASR componentsNext: Parallelization of diarization code, SEJITS

Page 15: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

15

Parallel Browser (Ras Bodik)

Original goal: Desktop-quality browsing on handhelds (Enabled by 4G networks, better output devices)

Now: Better development environment for new browser-based applications Language for layout and behavior Specifying CSS layout semantics Layout engines via synthesis and autotuning Efficient parallel parsing components

Highlights:Identified 7 major browser-based app classes, and their needs. Developing very high-level (constraints-based) layout programming model to help productivity.

Page 16: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

More Applications:“Beating down our

doors!”New external application collaborators:Computer Vision (Jitendra Malik, Thomas Brox)Computational Finance (Matthew Dixon @UCD)Speech (Dorothea Kolossa @TU Berlin)Natural Language Translation (Dan Klein)Programming multitouch interfaces (Maneesh Agrawala)

Protein Docking (Henry Gabb, Intel)Pediatric MRI (Michael Lustig, Shreyas Vassanwala @Stanford)

16

Page 17: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Fast Pediatric MRI

17

Pediatric MRI is difficult Children cannot keep still or hold breath Must put children under anesthesia for long

exams: risky & costly Need techniques to accelerate MRI

acquisition (sample & multiple sensors) Reconstruction must also be fast, or time

saved in acquisition is lost in compute Current reconstruction time: 2 hours

Non-starter for clinical use Mark Murphy (Par Lab) starts on reconstruction 9/09: Pick SW

Patterns, Good SW Architecture, Good Algorithms, Autotuning Now 1 minute (100X): Fast enough for radiologist to decide Starting 3/10: in use for clinical study by Dr. Shreyas

Vasanawala at Lucille Packard Children's Hospital at Stanford

Page 18: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Types of Programming(or “types of programmer”)

Hardware/OS

Efficiency-Level(MS in CS) C/C++/FORTRAN

assembler

Java/C# Uses hardware/OS primitives, builds programming frameworks (or apps)

Productivity-Level(Some CS courses)

Python/Ruby/Lua

Scala

Uses programming frameworks, writes application frameworks (or apps)

Haskell/OCamL/F#

Domain-Level(No CS courses)

Max/MSP, SQL,CSS/Flash/Silverlight,Matlab, Excel, Rails

Builds app with DSL and/or by customizing app framework

Where & how is parallelism made

visible?

Provides hardware primitives and OS services

Example Languages Example Activities

18

Page 19: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Where to make parallelism visible?

Not in a Domain-Specific Language Should focus on making domain experts

productive Too many domains, new domains, multiple

domains/app Not in a new general-purpose parallel

language An oxymoron? Won’t get adopted. Most big applications written in >1 language.

Par Lab: Pattern-based software components Components use any language or parallel prog.

model Pattern-specific compilation to attain efficiency Flexible, efficient composition of separately

developed software components

19

Page 20: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LABMotifs common across applications

App 1 App 2 App 3

Dense Sparse

Graph Trav.

Berkeley View Motifs (“Dwarfs”)

20

Page 21: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

21

How do compelling apps relate to 12 motifs?

Motif (nee “Dwarf”) Popularity (Red Hot Blue Cool)

Page 22: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

22

Graph-Algorithms

Dynamic-Programming

Dense-Linear-Algebra

Sparse-Linear-Algebra

Unstructured-Grids

Structured-Grids

Model-View-Controller

Iterative-Refinement

Map-Reduce

Layered-Systems

Arbitrary-Static-Task-Graph

Pipe-and-Filter

Agent-and-Repository

Process-Control

Event-Based/Implicit-Invocation

Puppeteer

Graphical-Models

Finite-State-Machines

Backtrack-Branch-and-Bound

N-Body-Methods

Circuits

Spectral-Methods

Monte-Carlo

Applications

Structural Patterns Computational Patterns

Task-ParallelismDivide and Conquer

Data-ParallelismPipeline

Discrete-Event Geometric-DecompositionSpeculation

SPMDData-Par/index-space

Fork/JoinActors

Distributed-ArrayShared-Data

Shared-QueueShared-mapPartitioned Graph

MIMDSIMD

Parallel Execution Patterns

Concurrent Algorithm Strategy Patterns

Implementation Strategy Patterns

Message-PassingCollective-Comm.Transactional memory

Thread-PoolTask-Graph

Data structureProgram structure

Point-To-Point-Sync. (mutual exclusion)collective sync. (barrier)Memory sync/fence

Loop-Par.Task-Queue

Transactions

Thread creation/destructionProcess creation/destruction

Concurrency Foundation constructs (not expressed as patterns)

“Our” Pattern Language (OPL-2010)

(Kurt Keutzer, Tim Mattson)

Structural Patterns

A = M x VComputational Patterns

Refine Towards Implementation

Highlight: ~20 apps architected with patterns, patterns-to-frameworks talk later today.Wide uptake, 2nd ParaPLOP Workshop

Page 23: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

23

Structural Patterns describe Component Composition

Structural patterns describe how components are assembled, e.g. Map-Reduce, Agent-and-Repository, Static-task-graph

Our belief: Any large software application must have a comprehensible software architecture describable as a hierarchy of patterns => hierarchy of components

Page 24: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Mapping Patterns to Hardware

App 1 App 2 App 3

Dense Sparse

Graph Trav.

Multicore GPU “Cloud”Only a few types of hardware platform

24

Page 25: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

High-level pattern constrains space of reasonable low-level

mappings

(Insert latest OPL chart showing path)

25

Page 26: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

“Stovepipes”: pattern-specific and platform-specific

compilers

Multicore GPU “Cloud”

App 1 App 2 App 3

Dense Sparse

Graph Trav.

Allow maximum efficiency and expressibility in stovepipes by avoiding mandatory intermediary layers

26

Page 27: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

27

Autotuning for Code Generation

Search space for block sizes (dense matrix):• Axes are block

dimensions

• Temperature is speed

Problem: generating optimized code is like searching for needle in haystack; use computers rather than humans

Auto-tuning

Auto-parallelization

serialreference

OpenMPComparison

Auto-NUMA

Auto-tuners approach: program generates optimized code and data structures for a “motif” (~kernel)

ParLab autotuners for stencils (e.g., images), sparse matrices, particle/mesh, collectives (e.g., “reduce”)

Page 28: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SEJITS: “Selective, Embedded, Just-In Time Specialization”

Use modern high-level scripting language (Python, Ruby) for productivity programming

Use language facilities to embed “specializers” that map high-level pattern to efficient low-level code at (develop/install/run)-time

Specializers can incorporate autotuners to generate tuned efficiency-level code for given platform

Efficiency programmers incrementally add specializers for (Pattern-Target) pairs. Fall back to scripting runtime if no specializer exists.

28

Page 29: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

.py

OS/HW

f() @h()

Specializer

.c

PLL

Inte

rp

@g()

SEJITS

Productivity app

.so

cc/ld

$

SEJITS makes tuning decisions per-function (not per-app)

Page 30: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

.py

OS/HW

f() @h()

Specializer

.c

PLL

Inte

rp

@g()

SEJITS

Productivity app

.so

cc/ld

$

SEJITS makes tuning decisions per-function (not per-app)

Selective

Embedded

JIT

Specialization

Page 31: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SEJITS Main Ideas

1. Specializer == pattern-specific compiler exploit pattern-specific strategies that may

not generalize target specific hardware per pattern

2. Can happen at runtime 3. Productivity Level Language (PLL) program

always valid even without SEJITS support vs. incompatibly extended syntax; inspired

by Dom. Spec. Embed. Lang. vs. DSL argument

4. Specializers can be written in PLL31

Page 32: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Producing Software vs. Producing An Answer

SEJITS delivers adaptive parallel software HW variation: # cores, caches, DRAM, etc. runtime variation: resource availability

(OS+HW) SEJITS is a highly productive way to produce

exactly the code variants you need SEJITS makes research code productive

Exploit full libraries, tools, etc. of PLL Performance competitive with ELL code

=> run non-toy experiments Develop specializer to target new HW feature

=> Test designs with real apps 32

Page 33: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Correctness, Testing, and Debugging

Productivity Layer no low level data races, but non-determinism could still be present specify and verify semantic determinism and atomicity, i.e. there is no

bug due to parallelism• separates parallel correctness from functional correctness

Efficiency Layer data races, deadlocks, memory model related bugs, atomicity

violations, livelocks Active testing: combines static and dynamic analyses so rare

concurrency bugs discovered quickly and precisely Debugging

Record and replay Simplify a buggy concurrent trace

• reduce number of context switches• reduce length of the trace

Concurrent Breakpoints

33

Highlight:• ACM SIGSOFT Distinguished Paper Awards at ICSE 09 and FSE 09

• IFIP TC2 Manfred Paul Best Paper Award at ICSE 10

Page 34: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Key Culprit: Nondeterminism

Determinism key to parallel correctness Same input ==> semantically same output Parallelism is wrong if some schedules give a

correct answer while others don’t Lightweight spec of parallel correctness

Independent of functional specification Can effectively test deterministic specs

New Approach: Automatically infer deterministic specifications by observing sample program runs Result: Recovered previous manual

specifications for most benchmarks 34

Page 35: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

35

Tessellation OS: Space-Time Partitioning + 2-Level Scheduling

1st level: OS determines coarse-grain allocation of resources to jobs over space and time

2nd level: Application schedules component tasks onto available “harts” (hardware thread contexts) using Lithe

TimeSpace

Spac

e2nd-level

Scheduling

Address Space A

Address Space B Tas

k

Tessellation Kernel(Partition Support)

CPUL1

L2Bank

DRAM

DRAM & I/O Interconnect

L1 Interconnect

CPUL1

L2Bank

DRAM

CPUL1

L2Bank

DRAM

CPUL1

L2Bank

DRAM

CPUL1

L2Bank

DRAM

CPUL1

L2Bank

DRAM

Prototype ROS running on RAMP Gold and Nehalem-x86

Page 36: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Software must be adaptive

Environment-adaptive Number of cores, speed of cores, soft errors,

multiprogramming, battery life SPMD/Static load-balancing/static mapping, things of

the past Input-adaptive

Degree of parallelism depends on input parameters Adaptation managed by:

OS resource manager (“how many resources best for this app?”)

Real-time app (“how many resources to meet deadline?”)

Quality-adjustable app (“what’s possible with these resources?”)

Need hardware measurement facilities36

Future: Need to work on

efficient adaptively parallel algorithms.

Page 37: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Par Lab “Multi-Paradigm” Architecture

Single “Fat” ILP-focused Tile Control Processor

Multiple “Thin” Lane Control Processors embedded in vector-thread lane

Core

Tile

Tile-Private L2U$

Fat Tile Control

Processor(ILP)

L1D$

L1I$

Shareable L3$/LL$

Vector-

ThreadLane

Thin Scalar ControlProc

.Vector-

ThreadLane

Thin Scalar ControlProc

.Vector-

ThreadLane

Thin Scalar ControlProc

.

Tile Control Processor, Lane Control Processor, and Vector-Thread microthreads all run the same ISA, but microarchs optimized for different forms of parallelism

Par Lab Architecture research using a “superset” architecture that can model various forms of scalar + vector machine, and various forms of memory hierarchy and interconnect.

37

Page 38: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Hardware Measurement: This we believe

Parallel HW/SW must support performance-portable parallel software

If you expect programmers to continue “Moore’s Law” by doubling amount of portable parallelism in programs every 2 years*, need hardware measurement for them to see how well doing During development inside an IDE During runtime so that app, resource

scheduler, and OS can see and adapt

38

*Shekhar Borkar, Intel, CNET News, 2009

Page 39: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

RAMP GoldRapid accurate simulation of manycore architectural ideas using FPGAs

Initial version models 64 cores of SPARC v8 with shared memory system on $750 board

Hardware FPU, MMU, boots OS.

Cost Performance(MIPS) Simulations per day

SoftwareSimulator $2,000 0.1 - 1 1

RAMP Gold $2,000 + $750 50 - 100 100

Highlight: RAMP Gold in production use (ISCA/HotPar/DAC papers based on RAMP Gold results).

39

Page 40: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Co-located Collaborative Center Approach

Continual collaboration sometimes hard work, but very beneficial. Many ideas emerge from, or are honed by, cross-subproject discussions. E.g. Eigensolvers for Damascene application E.g. Audio framework built on Lithe E.g. Sparse-matrix language for library development

Integration with other components/layers forces real implementations not just paper prototypes Whole stack demo on RAMP Gold

Can’t see how to have made this much progress otherwise

40

Page 41: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Par Lab Stats 16 faculty, 62 PhD students, 4 postdocs 125+ papers

Par Lab cover story, CACM, Oct. ‘09 “The Trouble with Multicore,”

IEEE Spectrum, July ‘10 Workshops / Summer Courses @ UCB

UCB “Bootcamp”: 196 attend 2008, 397 in 2009• Next is August 16-18, 2010; see Par Lab Web Site

2 Founding Companies: Intel and Microsoft 6 Affiliates: National Instruments, NEC, Nokia,

NVIDIA, Oracle, and Samsung

41

Page 42: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Par Lab Summary Multicore challenge hardest for CS in 50 years:

Performance “Moore’s Law” up to programmers! Unveil 2X parallelism/program every 2 years

Software platform: mobile client + cloud Identify common programming patterns to

reveal parallelism and use good software architecture

Target productivity versus efficiency programmers

Autotuning & Specialized Embedded JIT Specialization

Active Testing OS/Architecture support isolation,

measurement FPGA simulation of new parallel architectures:

RAMP No preconceived silver bullet; let’s apps decide

Already many apps with significant speedup

42

Page 43: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Backup Slides & References

Armbrust, M., A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia, “A View of Cloud Computing,” Communications of the ACM, 53:3, March 2010.

Asanović, K., R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, K. Yelick, "A View of the Parallel Computing Landscape,” Communications of the ACM, 52:10, October 2009.

J. Burnim and K. Sen. Asserting and checking determinism for multithreaded programs. Communications of the ACM, 53(6), June 2010.

Catanzaro, B., A. Fox, K. Keutzer, D. Patterson, B-Y. Su, M.Snir, K. Olukotun, P. Hanrahan, and H. Chafi, “Ubiquitous Parallel Computing from Berkeley, Illinois and Stanford,” IEEE Micro, March/April 2010.

Catanzaro, B., S. Kamil, Y. Lee, K. Asanović, J. Demmel, K. Keutzer, J. Shalf, K. Yelick, and A. Fox, "SEJITS: Getting Productivity and Performance with Selective Embedded JIT Specialization,” 1st Workshop on Programmable Models for Emerging Architecture at the 18th Int’l Conf. on Parallel Architectures and Compilation Techniques, Raleigh, North Carolina, November 2009.

J. A. Colmenares, S. Bird, H. Cook, P. Pearce, D. Zhu, J. Shalf, S. Hofmeyr, K. Asanovic, and J. Kubiatowicz. Resource management in the Tessellation manycore OS. HotPar’10, Berkeley, CA, USA, June 2010.

Tan, Z., A. Waterman, S. Bird, H. Cook, K. Asanović, and D. Patterson, “A Case for FAME: FPGA Architecture Model Execution,” Proc. ISCA, June 2010.

43

Page 44: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Par Lab Apps What are the compelling future workloads?

oNeed apps of future vs. legacy to drive agendao Improve research even if not the real killer apps

Computer Vision: Segment-Based Object Recognition, Poselet-Based Human Detection

Health: MRI Reconstruction, Stroke Simulation Music: 3D Enhancer, Hearing Aid, Novel UI Speech: Automatic Meeting Diary Video Games: Analysis of Smoke 2.0 Demo Computational Finance: Value-at-Risk

Estimation, Crank-Nicolson Option Pricing Parallel Browser: Layout, Scripting Language

44

Page 45: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Developing Parallel Software

Conventional: Measure and recode slow pieces with more threads

Par Lab: Find right SW architecture that reveals parallelism

45

Application Specification

SW Arch. toIdentify Parallelism

Performanceprofile

Not fast enough

Fast enough

Ship it

Thought Experiment:Map SW Arch. to HW Arch.

Write / Debug Code

Page 46: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Correctness, Testing, and Debugging

Active testing: combines static and dynamic analyses so that rare concurrency bugs are discovered quickly and precisely Burnim, Sen “Asserting and Checking Determinism for

Multithreaded Programs”• ACM SIGSOFT Distinguished Paper Award

+ CACM Research Highlight Invitation Naik, Park, Sen, Gay “Effective Static Deadlock

Detection” • ACM SIGSOFT Distinguished Paper Award

Actively control scheduler to force potentially buggy schedules: Data races, Atomicity Violations, Deadlocks

Found parallel bugs in real production OSS code: Apache Commons Collections, Java Collections Framework, Java Swing GUI framework, and Java Database Connectivity (JDBC)

Page 47: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SHOT Functional Requirements

Standardized Hardware Operation Tracker: SHOT Low latency reads so deployed in production code Can be read by OS and by user apps To be used by virtual machines, must be able to

save and restore as part of context switch Since some counters are per core, SW must read

all counters as if on same clock edge Don’t need to be perfect counts, just consistent:

accuracy ± 1% OK

47

Page 48: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Minimum SHOT Architecture

1. Global real time clock (vs. count clock cycles) Since clock rate varies due to Dynamic Voltage

and Frequency Scaling (DVFS) ~ 100 MHz (fast enough for apps)

2. Count Number instructions retired per core Measure computation throughput

3. Count off-chip memory traffic (incl. prefetching) Key to performance and energy

4. Standard so apps and OS can rely on them Standardized Hardware Measurement as

important as IEEE Floating Point Standard?48

Page 49: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Make productivity programmers efficient, and efficiency programmers productive?

Autotuning problem: The search space is large—taking a lot of cycles to explore and a long time

Search Full Parameter Space More than 180 Days

Using machine learning + few performance counters to democratize autotuning 12 minutes to find solution

As good or even beat the expert designed autotuner! -1% and 16% for a 7-pt Stencil -2% and 15% for a 27-pt Stencil 18% and 50% for dense matrix

Enables even greater range of optimizations than we imagined

Page 50: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Why might we succeed this time?

No Killer Microprocessor to Save Programmers No one is building a faster serial microprocessor For programs to go faster, SW must use parallel HW

New Metrics for Success vs. Linear Speedup Real Time Latency/Responsiveness and/or MIPS/Joule Just need some new killer parallel apps

vs. all legacy SW must achieve linear speedup Necessity: All the Wood Behind One Arrow

Whole industry committed, so more working on it If future growth of IT depends on faster processing at

same price (vs. lowering costs like NetBook)

50

Page 51: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Why might we succeed this time?

Multicore Synergy with Cloud Computing Cloud Computing apps parallel even if client not parallel Manycore is cost-reduction, not radical SW disruption

Vitality of Open Source Software OSS community more quickly embraces advances?

Single-Chip Multiprocessors Enable Innovation Enables inventions that were impractical or

uneconomical when multiprocessors were 100s chips FPGA prototypes shorten HW/SW cycle

Fast enough to run whole SW stack, can change every day vs. every 4 to 5 years when do chips

51

Page 52: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Gesture-Enhanced User Interface

• Using human gestures, motion, body pose to control a video game interface• Natal-like interface using built-in

laptop webcam, not add-on stereo-infrared

• We are collaborating with Jitendra Malik, world-leader in Computer Vision on "Poselets"-based human detection and pose estimation• Algorithmic improvements, GPU implementation: 30x speedup• Detection running near real-time using a $5 webcam

• Enrich the Windows user experience• Using tracking body pose, allow

user to interact with a 3D world: richer than touch-screen provides

Page 53: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Recent Results: Vision Acceleration

Bryan Catanzaro: Parallelizing Computer Vision (image segmentation)

Problem: Malik’s highest quality algorithm was 5.5 minutes / image on new PC

Good SW architecture + talk within Par Lab on to use new algorithms, data structures Bor-Yiing Su, Yunsup Lee, Narayanan Sundaram,

Mark Murphy, Kurt Keutzer, Jim Demmel, Sam Williams Current result: 1.8 seconds / image on manycore ~ 150X speedup

Factor of 10 quantitative change is a qualitative change Malik: “This will revolutionize computer vision.”

53

Page 54: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

RAMP Blue, July 20071008 modified MicroBlaze cores90MHzRTL directly mapped to FPGARuns UPC version of NAS parallel

benchmarks.Message-passing cluster

No MMURequires lots of hardware

21 BEE2 boards / 84 FPGAsDifficult to modifyHigh clock rate but low IPC,

particularly if want to model timing

Page 55: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Alexander’s Pattern Language

Christopher Alexander’s approach to (civil) architecture: A pattern is a generalizable solution to

a recurring problem. "Each pattern describes a problem

which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice.“ Page x, A Pattern Language, Christopher Alexander

Alexander’s 253 (civil) architectural patterns range from the creation of cities (2. distribution of towns) to particular building problems (232. roof cap)

A pattern language is an organized way of tackling an architectural problem using patterns

Main limitation: It’s about civil not software

architecture!!!

Page 56: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SEJITS Prototypes

Copperhead (Bryan Catanzaro, Michael Garland) Pattern/target: data parallel programming/GPU PSC finds opportunities to unroll loops, fuse

operations, avoid data structure conversions, .... Asp* + PySKI (Shoaib Kamil, Erin Carson et al.)

Patterns: captured by OO classes & higher-order methods; structured grid, lambda application (map), sparse matrix, others to follow

Target: x86 multicore with OpenMP, or similar LL (Gilad Arnold, Ras Bodík et al.)

Domain-specific language + compiler Patterns: sparse matrix linear algebra

56* “Asp is SEJITS for Python”

Page 57: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

One Decade of SAME(Software Architecture Model Execution)

Median Instructions Simulated/ Benchmark

Median #Cores

Median Instructions Simulated/ Core

ISCA 1998

267M 1 267M

ISCA 2008

825M 16 100M

[ “A Case for FAME”, ISCA 2010]

Page 58: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

58

Dimensions in FAME(FPGA Architecture Model Execution)

Direct: One target cycle executed in one FPGA host cycleDecoupled: One target cycle takes one or more FPGA cycles

Full RTL: Complete RTL of target machine modeledAbstract RTL: Partial/simplified RTL, split functional/timing

Host single-threaded: One target model per host pipelineHost multi-threaded: Multiple target models per host pipeline

Page 59: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Host Multithreading

CPU1

CPU2

CPU3

CPU4Target Model

Multithreading emulation engine reduces FPGA resource use and improves emulator throughput

Hides emulation latencies (e.g., communicating across FPGAs)

Multithreaded Emulation Engine (on FPGA)

+1

2

PC1PC

1PC1PC

1I$ IR GPR1GPR1GPR1GPR1

X

Y

2

D$Single hardware pipeline with multiple copies of CPU state

Page 60: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

60

Composable Resource Management (Lithe + OS)

Tessellation OSLithe User-Level Scheduling

ABIHardware Cores

App 1Module

2Module

3

Module 1

App 2

Page 61: The Parallel Revolution Has Started:  Are You Part of the Solution  or Part of the Problem?

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

61

Resource Management for Real-Time

Tessellation OSLithe User-Level Scheduling

ABIHardware Cores

App 1Module

2Module

1

Real-Time Module 3