Upload
molly
View
40
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem?. Krste Asanovic, Ras Bodik, Eric Brewer, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, Dave Patterson , Koushik Sen, David Wessel, and Kathy Yelick UC Berkeley Par Lab - PowerPoint PPT Presentation
Citation preview
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
The Parallel Revolution Has Started: Are You Part of the Solution
or Part of the Problem?Krste Asanovic, Ras Bodik, Eric Brewer, Jim Demmel,
Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, Dave Patterson, Koushik Sen,
David Wessel, and Kathy YelickUC Berkeley Par Lab
June 23, 2010
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
The Transition to Multicore
2
Sequential App Performance
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
3
050
100150200250300
1985 1995 2005 2015
Millions of PCs / year
P.S. Multicore Revolution Could Fail
John Hennessy, President, Stanford University:“…when we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.” “A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 1/07.
100% failure rate of Parallel Computer Companies Ardent, Convex, Encore, Inmos (Transputer), MasPar, nCUBE,
Kendall Square Research, Sequent, Tandem, Thinking Machines What if IT goes from a growth
industry to a replacement industry? If SW can’t effectively use 32, 64, ...
cores per chip => SW no faster on new computer => Only buy if computer wears out
or some people buy cheaper PC (netbook)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
4
Need a Fresh Approach to Parallelism
Berkeley researchers from many backgrounds meeting since Feb. 2005 to discuss parallelism Krste Asanovic, Eric Brewer, Ras Bodik, Jim Demmel, Kurt
Keutzer,John Kubiatowicz, Dave Patterson, Koushik Sen, Kathy Yelick, …
Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis
Tried to learn from successes in high-performance computing (LBNL) and parallel embedded (BWRC)
Led to “Berkeley View” Tech. Report 12/2006 and new Parallel Computing Laboratory (“Par Lab”)
2008: after open competition, Intel/MS award $10M
Goal: Productive, Efficient, Correct, Portable SW for 100+ cores & scale as core increase every 2 years (!)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Past parallel projects often dominated by hardware/architecture This is the one true way to build computers:
software must adapt to this breakthrough ILLIAC IV, Thinking Machines CM-2, Transputer,
Kendall Square KSR-1, Silicon Graphics Origin 2000 … Or sometimes by programming language
This is the one true way to write programs: hardware must adapt to this breakthrough
Id, Backus Functional Language FP, Occam, Linda, High Performance Fortran, Chapel, X10, Fortress …
Apps usually an afterthought
5
Need a Fresh Approach to Parallelism
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Par Lab’s original “bets”Let compelling applications drive research
agendaSoftware platform: data center + mobile clientIdentify common programming patternsProductivity versus efficiency programmersAutotuning and software synthesisBuild-in correctness + power/performance
diagnosticsOS/Architecture support applications, provide
primitives not pre-packaged solutionsFPGA simulation of new parallel architectures:
RAMPCo-located integrated collaborative centerAbove all, no preconceived big idea - see what
works driven by application needs.6
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
7
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel BrowserDesign Patterns/Motifs
Legacy Code
Schedulers
Communication & Synch. PrimitivesEfficiency Language Compilers
Easy to write portable code that runs efficiently on manycore
Legacy OS
Multicore/GPGPU
OS Libraries & Services
ParLab Manycore/RAMPHypervisor
OSArch.
Productivi
ty Layer
Efficienc
y Layer Corre
ctne
ss
Applicatio
ns
Selective Embedded JIT Specialization
Parallel Libraries
Parallel Frameworks
Dynamic Checking
Debuggingwith Replay
Directed Testing
Autotuners
Efficiency Languages
Diag
nosin
g Po
wer/P
erfo
rman
ce
Par Lab Research Overview
Productivity Languages
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Par Lab, ~2 years in
How are our bets working out?What big new ideas are emerging?Where are we going in next 3 years?
8
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
9
Dominant Application Platforms
Data Center or Cloud (“Server”) Laptop/Handheld (“Mobile Client”) Both together (“Server+Client”)
ParLab-RADLab/AMPLab collaborations Par Lab focuses on mobile clients
But many technologies apply to data center
Shift in Key Platforms:‘90s: Desktop/workstation‘00s: Laptops‘10s: Smartphone/Tablet/TV + Cloud
Implications? Smaller clients, fewer cores/client? More need for intra-task parallelism in data centers?
9
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
10
Music and Audio Applications(David Wessel)
Musicians have an insatiable appetite for computation + real-time demands
More channels, instruments, more processing, more interaction!
Jitter free latency must be low (<5 ms) Must be reliable (No clicks!)
Novel Instruments & User Interfaces
New composition and performance systems beyond keyboards
Input devices for Laptop/Handheld Enhanced sound delivery systems using
large microphone and speaker arrays Music Information Retrieval (MIR)
systems for client & cloud
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Health Application: Stroke Treatment
(Tony Keaveny)
Stroke treatment time-critical, need supercomputer performance in hospital
Image scan, blood flow analysis, then with stroke, then simulate blood thinner
200,000 cases per year in US20% not treated since too long
afterPotentially, 30% of those benefit
Recent progress: Build simplified 1.5D “Strokeflow” model as test case for ParLab SEJITS approach.
11
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Health Application: Stroke Treatment
(Tony Keaveny)
Stroke treatment time-critical, need supercomputer performance in hospital
Image scan, blood flow analysis, then with stroke, then simulate blood thinner
200,000 cases per year in US20% not treated since too long
afterPotentially, 30% of those benefit
Recent progress: Build simplified 1.5D “Strokeflow” model as test case for ParLab SEJITS approach.
12
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
13
Content-Based Image Retrieval
(Kurt Keutzer)
Relevance Feedback
ImageDatabase
Query by example
SimilarityMetric
CandidateResults Final Result
Built around Key Characteristics of personal databases Very large number of pictures (>5K) Non-labeled images Many pictures of few people Complex pictures including people, events,
places, and objects
1000’s of images
SVM&Damascene code ~10-100x speedups over existing implementations, 100s downloads. Project expanded to greater set of machine vision applications.
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
14
Robust Speech Recognition(Nelson Morgan)
Meeting Diarist Laptops/ Handhelds at
meeting coordinate to create speaker identified, partially transcribed text diary of meeting
Use cortically-inspired manystream spatio-temporal features to tolerate noise
Progress: Parallelized all main ASR componentsNext: Parallelization of diarization code, SEJITS
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
15
Parallel Browser (Ras Bodik)
Original goal: Desktop-quality browsing on handhelds (Enabled by 4G networks, better output devices)
Now: Better development environment for new browser-based applications Language for layout and behavior Specifying CSS layout semantics Layout engines via synthesis and autotuning Efficient parallel parsing components
Highlights:Identified 7 major browser-based app classes, and their needs. Developing very high-level (constraints-based) layout programming model to help productivity.
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
More Applications:“Beating down our
doors!”New external application collaborators:Computer Vision (Jitendra Malik, Thomas Brox)Computational Finance (Matthew Dixon @UCD)Speech (Dorothea Kolossa @TU Berlin)Natural Language Translation (Dan Klein)Programming multitouch interfaces (Maneesh Agrawala)
Protein Docking (Henry Gabb, Intel)Pediatric MRI (Michael Lustig, Shreyas Vassanwala @Stanford)
16
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Fast Pediatric MRI
17
Pediatric MRI is difficult Children cannot keep still or hold breath Must put children under anesthesia for long
exams: risky & costly Need techniques to accelerate MRI
acquisition (sample & multiple sensors) Reconstruction must also be fast, or time
saved in acquisition is lost in compute Current reconstruction time: 2 hours
Non-starter for clinical use Mark Murphy (Par Lab) starts on reconstruction 9/09: Pick SW
Patterns, Good SW Architecture, Good Algorithms, Autotuning Now 1 minute (100X): Fast enough for radiologist to decide Starting 3/10: in use for clinical study by Dr. Shreyas
Vasanawala at Lucille Packard Children's Hospital at Stanford
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Types of Programming(or “types of programmer”)
Hardware/OS
Efficiency-Level(MS in CS) C/C++/FORTRAN
assembler
Java/C# Uses hardware/OS primitives, builds programming frameworks (or apps)
Productivity-Level(Some CS courses)
Python/Ruby/Lua
Scala
Uses programming frameworks, writes application frameworks (or apps)
Haskell/OCamL/F#
Domain-Level(No CS courses)
Max/MSP, SQL,CSS/Flash/Silverlight,Matlab, Excel, Rails
Builds app with DSL and/or by customizing app framework
Where & how is parallelism made
visible?
Provides hardware primitives and OS services
Example Languages Example Activities
18
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Where to make parallelism visible?
Not in a Domain-Specific Language Should focus on making domain experts
productive Too many domains, new domains, multiple
domains/app Not in a new general-purpose parallel
language An oxymoron? Won’t get adopted. Most big applications written in >1 language.
Par Lab: Pattern-based software components Components use any language or parallel prog.
model Pattern-specific compilation to attain efficiency Flexible, efficient composition of separately
developed software components
19
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LABMotifs common across applications
App 1 App 2 App 3
Dense Sparse
Graph Trav.
Berkeley View Motifs (“Dwarfs”)
20
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
21
How do compelling apps relate to 12 motifs?
Motif (nee “Dwarf”) Popularity (Red Hot Blue Cool)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
22
Graph-Algorithms
Dynamic-Programming
Dense-Linear-Algebra
Sparse-Linear-Algebra
Unstructured-Grids
Structured-Grids
Model-View-Controller
Iterative-Refinement
Map-Reduce
Layered-Systems
Arbitrary-Static-Task-Graph
Pipe-and-Filter
Agent-and-Repository
Process-Control
Event-Based/Implicit-Invocation
Puppeteer
Graphical-Models
Finite-State-Machines
Backtrack-Branch-and-Bound
N-Body-Methods
Circuits
Spectral-Methods
Monte-Carlo
Applications
Structural Patterns Computational Patterns
Task-ParallelismDivide and Conquer
Data-ParallelismPipeline
Discrete-Event Geometric-DecompositionSpeculation
SPMDData-Par/index-space
Fork/JoinActors
Distributed-ArrayShared-Data
Shared-QueueShared-mapPartitioned Graph
MIMDSIMD
Parallel Execution Patterns
Concurrent Algorithm Strategy Patterns
Implementation Strategy Patterns
Message-PassingCollective-Comm.Transactional memory
Thread-PoolTask-Graph
Data structureProgram structure
Point-To-Point-Sync. (mutual exclusion)collective sync. (barrier)Memory sync/fence
Loop-Par.Task-Queue
Transactions
Thread creation/destructionProcess creation/destruction
Concurrency Foundation constructs (not expressed as patterns)
“Our” Pattern Language (OPL-2010)
(Kurt Keutzer, Tim Mattson)
Structural Patterns
A = M x VComputational Patterns
Refine Towards Implementation
Highlight: ~20 apps architected with patterns, patterns-to-frameworks talk later today.Wide uptake, 2nd ParaPLOP Workshop
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
23
Structural Patterns describe Component Composition
Structural patterns describe how components are assembled, e.g. Map-Reduce, Agent-and-Repository, Static-task-graph
Our belief: Any large software application must have a comprehensible software architecture describable as a hierarchy of patterns => hierarchy of components
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Mapping Patterns to Hardware
App 1 App 2 App 3
Dense Sparse
Graph Trav.
Multicore GPU “Cloud”Only a few types of hardware platform
24
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
High-level pattern constrains space of reasonable low-level
mappings
(Insert latest OPL chart showing path)
25
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
“Stovepipes”: pattern-specific and platform-specific
compilers
Multicore GPU “Cloud”
App 1 App 2 App 3
Dense Sparse
Graph Trav.
Allow maximum efficiency and expressibility in stovepipes by avoiding mandatory intermediary layers
26
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
27
Autotuning for Code Generation
Search space for block sizes (dense matrix):• Axes are block
dimensions
• Temperature is speed
Problem: generating optimized code is like searching for needle in haystack; use computers rather than humans
Auto-tuning
Auto-parallelization
serialreference
OpenMPComparison
Auto-NUMA
Auto-tuners approach: program generates optimized code and data structures for a “motif” (~kernel)
ParLab autotuners for stencils (e.g., images), sparse matrices, particle/mesh, collectives (e.g., “reduce”)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SEJITS: “Selective, Embedded, Just-In Time Specialization”
Use modern high-level scripting language (Python, Ruby) for productivity programming
Use language facilities to embed “specializers” that map high-level pattern to efficient low-level code at (develop/install/run)-time
Specializers can incorporate autotuners to generate tuned efficiency-level code for given platform
Efficiency programmers incrementally add specializers for (Pattern-Target) pairs. Fall back to scripting runtime if no specializer exists.
28
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
.py
OS/HW
f() @h()
Specializer
.c
PLL
Inte
rp
@g()
SEJITS
Productivity app
.so
cc/ld
$
SEJITS makes tuning decisions per-function (not per-app)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
.py
OS/HW
f() @h()
Specializer
.c
PLL
Inte
rp
@g()
SEJITS
Productivity app
.so
cc/ld
$
SEJITS makes tuning decisions per-function (not per-app)
Selective
Embedded
JIT
Specialization
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SEJITS Main Ideas
1. Specializer == pattern-specific compiler exploit pattern-specific strategies that may
not generalize target specific hardware per pattern
2. Can happen at runtime 3. Productivity Level Language (PLL) program
always valid even without SEJITS support vs. incompatibly extended syntax; inspired
by Dom. Spec. Embed. Lang. vs. DSL argument
4. Specializers can be written in PLL31
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Producing Software vs. Producing An Answer
SEJITS delivers adaptive parallel software HW variation: # cores, caches, DRAM, etc. runtime variation: resource availability
(OS+HW) SEJITS is a highly productive way to produce
exactly the code variants you need SEJITS makes research code productive
Exploit full libraries, tools, etc. of PLL Performance competitive with ELL code
=> run non-toy experiments Develop specializer to target new HW feature
=> Test designs with real apps 32
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Correctness, Testing, and Debugging
Productivity Layer no low level data races, but non-determinism could still be present specify and verify semantic determinism and atomicity, i.e. there is no
bug due to parallelism• separates parallel correctness from functional correctness
Efficiency Layer data races, deadlocks, memory model related bugs, atomicity
violations, livelocks Active testing: combines static and dynamic analyses so rare
concurrency bugs discovered quickly and precisely Debugging
Record and replay Simplify a buggy concurrent trace
• reduce number of context switches• reduce length of the trace
Concurrent Breakpoints
33
Highlight:• ACM SIGSOFT Distinguished Paper Awards at ICSE 09 and FSE 09
• IFIP TC2 Manfred Paul Best Paper Award at ICSE 10
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Key Culprit: Nondeterminism
Determinism key to parallel correctness Same input ==> semantically same output Parallelism is wrong if some schedules give a
correct answer while others don’t Lightweight spec of parallel correctness
Independent of functional specification Can effectively test deterministic specs
New Approach: Automatically infer deterministic specifications by observing sample program runs Result: Recovered previous manual
specifications for most benchmarks 34
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
35
Tessellation OS: Space-Time Partitioning + 2-Level Scheduling
1st level: OS determines coarse-grain allocation of resources to jobs over space and time
2nd level: Application schedules component tasks onto available “harts” (hardware thread contexts) using Lithe
TimeSpace
Spac
e2nd-level
Scheduling
Address Space A
Address Space B Tas
k
Tessellation Kernel(Partition Support)
CPUL1
L2Bank
DRAM
DRAM & I/O Interconnect
L1 Interconnect
CPUL1
L2Bank
DRAM
CPUL1
L2Bank
DRAM
CPUL1
L2Bank
DRAM
CPUL1
L2Bank
DRAM
CPUL1
L2Bank
DRAM
Prototype ROS running on RAMP Gold and Nehalem-x86
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Software must be adaptive
Environment-adaptive Number of cores, speed of cores, soft errors,
multiprogramming, battery life SPMD/Static load-balancing/static mapping, things of
the past Input-adaptive
Degree of parallelism depends on input parameters Adaptation managed by:
OS resource manager (“how many resources best for this app?”)
Real-time app (“how many resources to meet deadline?”)
Quality-adjustable app (“what’s possible with these resources?”)
Need hardware measurement facilities36
Future: Need to work on
efficient adaptively parallel algorithms.
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Par Lab “Multi-Paradigm” Architecture
Single “Fat” ILP-focused Tile Control Processor
Multiple “Thin” Lane Control Processors embedded in vector-thread lane
Core
Tile
Tile-Private L2U$
Fat Tile Control
Processor(ILP)
L1D$
L1I$
Shareable L3$/LL$
Vector-
ThreadLane
Thin Scalar ControlProc
.Vector-
ThreadLane
Thin Scalar ControlProc
.Vector-
ThreadLane
Thin Scalar ControlProc
.
Tile Control Processor, Lane Control Processor, and Vector-Thread microthreads all run the same ISA, but microarchs optimized for different forms of parallelism
Par Lab Architecture research using a “superset” architecture that can model various forms of scalar + vector machine, and various forms of memory hierarchy and interconnect.
37
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Hardware Measurement: This we believe
Parallel HW/SW must support performance-portable parallel software
If you expect programmers to continue “Moore’s Law” by doubling amount of portable parallelism in programs every 2 years*, need hardware measurement for them to see how well doing During development inside an IDE During runtime so that app, resource
scheduler, and OS can see and adapt
38
*Shekhar Borkar, Intel, CNET News, 2009
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
RAMP GoldRapid accurate simulation of manycore architectural ideas using FPGAs
Initial version models 64 cores of SPARC v8 with shared memory system on $750 board
Hardware FPU, MMU, boots OS.
Cost Performance(MIPS) Simulations per day
SoftwareSimulator $2,000 0.1 - 1 1
RAMP Gold $2,000 + $750 50 - 100 100
Highlight: RAMP Gold in production use (ISCA/HotPar/DAC papers based on RAMP Gold results).
39
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Co-located Collaborative Center Approach
Continual collaboration sometimes hard work, but very beneficial. Many ideas emerge from, or are honed by, cross-subproject discussions. E.g. Eigensolvers for Damascene application E.g. Audio framework built on Lithe E.g. Sparse-matrix language for library development
Integration with other components/layers forces real implementations not just paper prototypes Whole stack demo on RAMP Gold
Can’t see how to have made this much progress otherwise
40
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Par Lab Stats 16 faculty, 62 PhD students, 4 postdocs 125+ papers
Par Lab cover story, CACM, Oct. ‘09 “The Trouble with Multicore,”
IEEE Spectrum, July ‘10 Workshops / Summer Courses @ UCB
UCB “Bootcamp”: 196 attend 2008, 397 in 2009• Next is August 16-18, 2010; see Par Lab Web Site
2 Founding Companies: Intel and Microsoft 6 Affiliates: National Instruments, NEC, Nokia,
NVIDIA, Oracle, and Samsung
41
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Par Lab Summary Multicore challenge hardest for CS in 50 years:
Performance “Moore’s Law” up to programmers! Unveil 2X parallelism/program every 2 years
Software platform: mobile client + cloud Identify common programming patterns to
reveal parallelism and use good software architecture
Target productivity versus efficiency programmers
Autotuning & Specialized Embedded JIT Specialization
Active Testing OS/Architecture support isolation,
measurement FPGA simulation of new parallel architectures:
RAMP No preconceived silver bullet; let’s apps decide
Already many apps with significant speedup
42
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Backup Slides & References
Armbrust, M., A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia, “A View of Cloud Computing,” Communications of the ACM, 53:3, March 2010.
Asanović, K., R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, K. Yelick, "A View of the Parallel Computing Landscape,” Communications of the ACM, 52:10, October 2009.
J. Burnim and K. Sen. Asserting and checking determinism for multithreaded programs. Communications of the ACM, 53(6), June 2010.
Catanzaro, B., A. Fox, K. Keutzer, D. Patterson, B-Y. Su, M.Snir, K. Olukotun, P. Hanrahan, and H. Chafi, “Ubiquitous Parallel Computing from Berkeley, Illinois and Stanford,” IEEE Micro, March/April 2010.
Catanzaro, B., S. Kamil, Y. Lee, K. Asanović, J. Demmel, K. Keutzer, J. Shalf, K. Yelick, and A. Fox, "SEJITS: Getting Productivity and Performance with Selective Embedded JIT Specialization,” 1st Workshop on Programmable Models for Emerging Architecture at the 18th Int’l Conf. on Parallel Architectures and Compilation Techniques, Raleigh, North Carolina, November 2009.
J. A. Colmenares, S. Bird, H. Cook, P. Pearce, D. Zhu, J. Shalf, S. Hofmeyr, K. Asanovic, and J. Kubiatowicz. Resource management in the Tessellation manycore OS. HotPar’10, Berkeley, CA, USA, June 2010.
Tan, Z., A. Waterman, S. Bird, H. Cook, K. Asanović, and D. Patterson, “A Case for FAME: FPGA Architecture Model Execution,” Proc. ISCA, June 2010.
43
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Par Lab Apps What are the compelling future workloads?
oNeed apps of future vs. legacy to drive agendao Improve research even if not the real killer apps
Computer Vision: Segment-Based Object Recognition, Poselet-Based Human Detection
Health: MRI Reconstruction, Stroke Simulation Music: 3D Enhancer, Hearing Aid, Novel UI Speech: Automatic Meeting Diary Video Games: Analysis of Smoke 2.0 Demo Computational Finance: Value-at-Risk
Estimation, Crank-Nicolson Option Pricing Parallel Browser: Layout, Scripting Language
44
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Developing Parallel Software
Conventional: Measure and recode slow pieces with more threads
Par Lab: Find right SW architecture that reveals parallelism
45
Application Specification
SW Arch. toIdentify Parallelism
Performanceprofile
Not fast enough
Fast enough
Ship it
Thought Experiment:Map SW Arch. to HW Arch.
Write / Debug Code
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Correctness, Testing, and Debugging
Active testing: combines static and dynamic analyses so that rare concurrency bugs are discovered quickly and precisely Burnim, Sen “Asserting and Checking Determinism for
Multithreaded Programs”• ACM SIGSOFT Distinguished Paper Award
+ CACM Research Highlight Invitation Naik, Park, Sen, Gay “Effective Static Deadlock
Detection” • ACM SIGSOFT Distinguished Paper Award
Actively control scheduler to force potentially buggy schedules: Data races, Atomicity Violations, Deadlocks
Found parallel bugs in real production OSS code: Apache Commons Collections, Java Collections Framework, Java Swing GUI framework, and Java Database Connectivity (JDBC)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SHOT Functional Requirements
Standardized Hardware Operation Tracker: SHOT Low latency reads so deployed in production code Can be read by OS and by user apps To be used by virtual machines, must be able to
save and restore as part of context switch Since some counters are per core, SW must read
all counters as if on same clock edge Don’t need to be perfect counts, just consistent:
accuracy ± 1% OK
47
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Minimum SHOT Architecture
1. Global real time clock (vs. count clock cycles) Since clock rate varies due to Dynamic Voltage
and Frequency Scaling (DVFS) ~ 100 MHz (fast enough for apps)
2. Count Number instructions retired per core Measure computation throughput
3. Count off-chip memory traffic (incl. prefetching) Key to performance and energy
4. Standard so apps and OS can rely on them Standardized Hardware Measurement as
important as IEEE Floating Point Standard?48
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Make productivity programmers efficient, and efficiency programmers productive?
Autotuning problem: The search space is large—taking a lot of cycles to explore and a long time
Search Full Parameter Space More than 180 Days
Using machine learning + few performance counters to democratize autotuning 12 minutes to find solution
As good or even beat the expert designed autotuner! -1% and 16% for a 7-pt Stencil -2% and 15% for a 27-pt Stencil 18% and 50% for dense matrix
Enables even greater range of optimizations than we imagined
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Why might we succeed this time?
No Killer Microprocessor to Save Programmers No one is building a faster serial microprocessor For programs to go faster, SW must use parallel HW
New Metrics for Success vs. Linear Speedup Real Time Latency/Responsiveness and/or MIPS/Joule Just need some new killer parallel apps
vs. all legacy SW must achieve linear speedup Necessity: All the Wood Behind One Arrow
Whole industry committed, so more working on it If future growth of IT depends on faster processing at
same price (vs. lowering costs like NetBook)
50
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Why might we succeed this time?
Multicore Synergy with Cloud Computing Cloud Computing apps parallel even if client not parallel Manycore is cost-reduction, not radical SW disruption
Vitality of Open Source Software OSS community more quickly embraces advances?
Single-Chip Multiprocessors Enable Innovation Enables inventions that were impractical or
uneconomical when multiprocessors were 100s chips FPGA prototypes shorten HW/SW cycle
Fast enough to run whole SW stack, can change every day vs. every 4 to 5 years when do chips
51
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Gesture-Enhanced User Interface
• Using human gestures, motion, body pose to control a video game interface• Natal-like interface using built-in
laptop webcam, not add-on stereo-infrared
• We are collaborating with Jitendra Malik, world-leader in Computer Vision on "Poselets"-based human detection and pose estimation• Algorithmic improvements, GPU implementation: 30x speedup• Detection running near real-time using a $5 webcam
• Enrich the Windows user experience• Using tracking body pose, allow
user to interact with a 3D world: richer than touch-screen provides
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Recent Results: Vision Acceleration
Bryan Catanzaro: Parallelizing Computer Vision (image segmentation)
Problem: Malik’s highest quality algorithm was 5.5 minutes / image on new PC
Good SW architecture + talk within Par Lab on to use new algorithms, data structures Bor-Yiing Su, Yunsup Lee, Narayanan Sundaram,
Mark Murphy, Kurt Keutzer, Jim Demmel, Sam Williams Current result: 1.8 seconds / image on manycore ~ 150X speedup
Factor of 10 quantitative change is a qualitative change Malik: “This will revolutionize computer vision.”
53
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
RAMP Blue, July 20071008 modified MicroBlaze cores90MHzRTL directly mapped to FPGARuns UPC version of NAS parallel
benchmarks.Message-passing cluster
No MMURequires lots of hardware
21 BEE2 boards / 84 FPGAsDifficult to modifyHigh clock rate but low IPC,
particularly if want to model timing
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Alexander’s Pattern Language
Christopher Alexander’s approach to (civil) architecture: A pattern is a generalizable solution to
a recurring problem. "Each pattern describes a problem
which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice.“ Page x, A Pattern Language, Christopher Alexander
Alexander’s 253 (civil) architectural patterns range from the creation of cities (2. distribution of towns) to particular building problems (232. roof cap)
A pattern language is an organized way of tackling an architectural problem using patterns
Main limitation: It’s about civil not software
architecture!!!
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SEJITS Prototypes
Copperhead (Bryan Catanzaro, Michael Garland) Pattern/target: data parallel programming/GPU PSC finds opportunities to unroll loops, fuse
operations, avoid data structure conversions, .... Asp* + PySKI (Shoaib Kamil, Erin Carson et al.)
Patterns: captured by OO classes & higher-order methods; structured grid, lambda application (map), sparse matrix, others to follow
Target: x86 multicore with OpenMP, or similar LL (Gilad Arnold, Ras Bodík et al.)
Domain-specific language + compiler Patterns: sparse matrix linear algebra
56* “Asp is SEJITS for Python”
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
One Decade of SAME(Software Architecture Model Execution)
Median Instructions Simulated/ Benchmark
Median #Cores
Median Instructions Simulated/ Core
ISCA 1998
267M 1 267M
ISCA 2008
825M 16 100M
[ “A Case for FAME”, ISCA 2010]
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
58
Dimensions in FAME(FPGA Architecture Model Execution)
Direct: One target cycle executed in one FPGA host cycleDecoupled: One target cycle takes one or more FPGA cycles
Full RTL: Complete RTL of target machine modeledAbstract RTL: Partial/simplified RTL, split functional/timing
Host single-threaded: One target model per host pipelineHost multi-threaded: Multiple target models per host pipeline
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Host Multithreading
CPU1
CPU2
CPU3
CPU4Target Model
Multithreading emulation engine reduces FPGA resource use and improves emulator throughput
Hides emulation latencies (e.g., communicating across FPGAs)
Multithreaded Emulation Engine (on FPGA)
+1
2
PC1PC
1PC1PC
1I$ IR GPR1GPR1GPR1GPR1
X
Y
2
D$Single hardware pipeline with multiple copies of CPU state
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
60
Composable Resource Management (Lithe + OS)
Tessellation OSLithe User-Level Scheduling
ABIHardware Cores
App 1Module
2Module
3
Module 1
App 2
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
61
Resource Management for Real-Time
Tessellation OSLithe User-Level Scheduling
ABIHardware Cores
App 1Module
2Module
1
Real-Time Module 3