Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Xipeng Shen
Tackling Memory and Concurrency Barriers for Modern Parallel Computing
Computer Science, NCSU
Unprecedented Scale
!2sources: SciDAC, IBM
Y201120 petaflops
Y20201000 petaflops
Exascale Roadmap
Key Factors
!3sources: SciDAC, IBM
• Concurrency
• Memory Performance
Memory Gets More Complex
!5
Tesla K20 (NVIDIA)More than 8 types of memory
Non-volatile memory
3D stacked memory
!6
Hardware Trends
• Rapidly growing node concurrency
• Increasingly complex memory
systems
Pivotal Role of Program Optimizers
• Expose & exploit parallelism• Capitalize on memory systems
potential
Implications to COSMIC
Xipeng Shen [email protected] 7
Two Recent Advancements
• 1. Data Placement Optimizations (MICRO’14)
- Memory Performance of GPU
• 2. Principled Speculative Parallelization (ASPLOS’14, ASPLOS’15)
- Execution Concurrency of FSM
Xipeng Shen [email protected] 8
a SIMD group(warp)
Graphic Processing Unit (GPU)
• Massive parallelism• Favorable
• computing power• cost effectiveness• energy efficiency
Xipeng Shen [email protected]
Complex Memory
!9
Global memory
Texture memory
Shared memory
Constant memory
L1/L2 cache
Read-only cache
Texture cache
Xipeng Shen [email protected]
Complex Memory
!10
Global memory
Texture memory
Shared memory
Constant memory
L1/L2 cache
Read-only cache
Texture cache
coalescing; cache hierarchy
2D/3D locality; texture cache; read-only (except surface write)
on-chip; bank conflicts
broadcasting; cached; read-only
private/shared
read-only data
2D/3D locality; read-only
Xipeng Shen [email protected]
Data Placement Problem
!11
Global memory
Texture memory
Shared memory
Constant memory
(L1/L2 cache)
(Read-only cache)
(Texture cache)
A
B
C
D
…
Data in a program
?????
3X performance difference
Xipeng Shen [email protected]
Data Placement Problem
!12
Properties:Machine dependent
Changes across models/generationsInput dependent
Changes across runs
Options:Manual efforts by programmers? Offline autotuning?
Xipeng Shen [email protected]
PORPLE: Portable Data Placement Engine
!13
PLACER(placing engine)
MSL(mem. spec. lang.)
PORPLE-C(compiler)
architect
org. program
desired placement
efficient execution
input
PORPLE
microkernels
Xipeng Shen [email protected] !14
MSL(mem. spec. lang.)architect/user
mem spec
microkernels
MSL(Memory Specification Language)
OFFLINE
Expose memory properties formally for automatic analysis.
Separate hw. complexities.
Xipeng Shen [email protected] !15
Keywords: address1, address2, index1, index 2, banks, blockSize, warp, block, grid, sm, core, tpc, die, clk, ns, ms, sec, na, om, ?; // na: not applicable; om: omitted; ?: unknown; // om and ? can be used in all fields
Operators: C-like arithmetic and relational operators, and a scope operator {};
Syntax: • specList ::= processorSpec memSpec* • processorSpec ::= die=Integer tpc; tpc=Integer sm; sm=Integer core; end-of-line • memSpec ::= name id swmng rw dim size blockSize banks latency upperLevels
lowerLevels shareScope concurrencyFactor serialCondition ; end-of-line • name ::= String • id ::= Integer • swmng ::= Y | N // software manageable or not • rw ::= R|W|RW // allow read or write accesses • dim ::= na | Integer // special for arrays of a particular dimensionality • sz ::= Integer[K|M|G|T|ε][E|B] // E for data elements • size ::= sz | <sz sz> | <sz sz sz> • blockSize ::= sz | <sz sz> | <sz sz sz> • lat ::= Integer[clk|ns|ms|sec] // clk for clocks • latency ::= lat | <lat lat> • upperLevels ::= <[id | name]*> • lowerLevels ::= <id*> • shareScope ::= core | sm | tpc | die • concurrencyFactor ::= < Number Number> • serialCondition ::= scope{RelationalExpr} • scope ::= warp | block | grid
latency,hierarchy,size,sharing,constraints,serializationcondition,……
MSL(Memory Specification Language)
Xipeng Shen [email protected] !16
Serialization Condition• Key to handle peculiar properties of some memory• Insight: they are all about when accesses get serialized.
Examples of serialization conditions: constant mem: warp{address1 != address2} shared mem: block{word1!=word2 && word1%banks == word2%banks} global mem: warp{⌊address1/blockSize⌋ != ⌊address2/blockSize⌋}
broadcasting
bankconflict
coalescing
Xipeng Shen [email protected]
Example of Mem Spec (Tesla M2075)
!17
One-time effort by experts.Encapsulate hw complexity (extensibility; portability).
Mem spec of Tesla M2075:
die =1 tpc; tpc = 16 sm; sm = 32 core; globalMem 8 Y rw na 5375M 128B ? 600clk <L2 L1> <> die <0.1 0.5> warp{⌊address1/blockSize⌋ != ⌊address2/blockSize⌋}; L1 9 N rw na 16K 128B ? 80clk <> <L2 globalMem> sm ? warp{⌊address1/blockSize⌋ != ⌊address2/blockSize⌋}; L2 7 N rw na 768K 32B ? 390clk om om die ? warp{⌊address1/blockSize⌋ != ⌊address2/blockSize⌋};
constantMem 1 Y r na 64K ? ? 360clk <cL2 cL1> <> die ? warp{address1 != address2}; cL1 3 N r na 4K 64B ? 48clk <> <cL2 constantMem> sm ? warp{⌊address1/blockSize⌋ != ⌊address2/blockSize⌋}; cL2 2 N r na 32K 256B ? 140clk <cL1> <cL2 constantMem> die ? warp{⌊address1/blockSize⌋ != ⌊address2/blockSize⌋};
sharedMem 4 Y rw na 48K ? 32 48clk <> <> sm ? block{word1!=word2 && word1%banks ==word2%banks};
… …
sharedMem 4 Y rw na 48K ? 32 48clk <> <> sm ? block{word1!=word2 && word1%banks ==word2%banks};
Xipeng Shen [email protected]
PORPLE: Portable Data Placement Engine
!18
PLACER(placing engine)
MSL(mem. spec. lang.)
PORPLE-C(compiler)
architect
org. program
desired placement
efficient execution
input
PORPLE
microkernels
Xipeng Shen [email protected] !19
PORPLE-C(compiler)org. program
access patterns
staged program
PORPLE-C Compiler(source-to-source compiler)
Finding out data access patternsComplemented by online profiler
Making code placement-agnosticCode can work regardless of data placement
OFFLINE
Xipeng Shen [email protected]
Enable Placement-Agnostic
!20
// host code float * A; cudaMalloc(A,...); cudaMemcopy(A, hA, ...);
// device code x = A[tid];
// host code float * A; cudaMalloc(A,...); cudaMemcopy(A, hA, ...); texture <float, ...> Atex; cudaBindTexture(null, Atex, A);
// device code x = tex1Dfetch(Atex, tid);
// host code float * A; cudaMalloc(A,...); cudaMemcopy(A, hA, ...); // device code __shared__ float s[sz]; s[localTid] = A[tid]; __synchthreads(); x = s[localTid];
// host code float * A; cudaMalloc(A,...); cudaMemcopy(A, hA, ...); // device code x = __ldg(&A[tid]);
(a) from global mem.
(d) from texture mem. (e) from shared mem.
(b) through read-only cache
// global declaration __constant__ float * A[sz];
// host code cudaMemcpyToSymbol (A, hA, ...); // device code x = A[tid];
(c) from constant mem.
Enable Placement-Agnostic
Coarse-grained & fine-grained versioning:code size V.S. time overhead
if (memSpace[A0_id]==TXR && memSpace[A1_id]==SHR)kernel_1(…); // version 1
elsekernel_others (…); // version 2
if (memSpace[A1_id]==SHR)sA1[…] = _tempA0;
elseA1[j] = _tempA0;
// host code
// code in kernel_others
Coarse
Fine
Xipeng Shen [email protected]
PORPLE: Portable Data Placement Engine
!22
PLACER(placing engine)
MSL(mem. spec. lang.)
PORPLE-C(compiler)
architect
org. program
desired placement
efficient execution
input
PORPLE
microkernels
Xipeng Shen [email protected] !23
PLACER(placing engine)
mem spec
access patterns
staged program
online profile
desired placement
efficient execution
online
input
Placer
Xipeng Shen [email protected]
Lightweight On-Line Profiling
!24
• Array Sizes• Lightweight profiling on CPU for irreg. programs
• Work of the first thread block• At most 10 iterations for a loop
Xipeng Shen [email protected] !25
• Search for the best placement• A placement: how all data are placed on men
• Key component: Performance model M
Mem perf. = M (a placement, access patterns, mem spec)
Placer
Xipeng Shen [email protected] !26
Lightweight Performance Model
• Reuse Distance for cache miss rate estimation• Consideration of interferences on shared cache
• Maximal transfer time among data paths
numberofaccesseslatencyofmemoryj
concurrencyfactor
arraysthroughpathpalldatapaths
Xipeng Shen [email protected] !27
PLACER(placing engine)
MSL(mem. spec. lang.)
PORPLE-C(compiler)
architect/usermem spec
org. program
access patterns
staged program
online profile
desired placement
efficient execution
offline online
microkernels
input
PORPLE in a Whole
MoredetailsinMicro’2014.
Xipeng Shen [email protected] !28
Properties of PORPLE• Easy expansion to new memory
• New MSL spec• Good portability to new memory
• Program runs with new suitable placement automatically• Adaptivity to new program inputs
• On-the-fly placement with placement-agnostic code.• Generality to regular & irregular programs
• Static analysis + lightweight online profiling
Xipeng Shen [email protected] !29
Evaluation• 3 generations of NVIDIA GPU: K20c, M2075, C1060
• Different• mem size, cache, latency,
etc.• 14 benchmarks
• from SDK, SHOC, Rodinia• 9 regular, 5 irregular
RegularBenchmarksonK20c
Rule-basedmethodfromJang+:TPDS.
IrregularBenchmarksonK20c
• K20c
• M2075
• C1060
• Crossarchitectures
• Crossinputs
spmv particlefilter
Portability
RuntimeOverheadBreakdown
Xipeng Shen [email protected] 35
• 1. Data Placement Optimizations (MICRO’14)
- Memory Performance of GPU
• 2. Principled Speculative Parallelization (ASPLOS’14, ASPLOS’15)
- Execution Concurrency of FSM
Two Recent Advancements
�36
FiniteStateMachine(FSM)
Example:Checkwhetherastringcontains“aabaaabb”.
WebBrowsers(HTMLlexing/parsing)Decompression(Huffmandecoding)PatternMatching(grep)Security(intrusiondetection)SoftwareEngineer(modelchecking)Hardwaredesign(VHDLstatemachine)
Widelyused:
�37
“EmbarrassinglySequential”
input string: a a b a b a b b a a a b a b a a b b a b a a a
thread1 thread2startingfromwhichstate?
—The Landscape of Parallel Computing Research: A View from Berkeley
�38
SpeculativeParallelization
input string: a a b a b a b b a a a b a b a a b b a b a a a
thread1 thread2startingfromwhichstate?
• Basicidea• Pickaguesstostartprocessing• Reprocessinguponawrongguess
�39
KeyOpenQuestion
input string: a a b a b a b b a a a b a b a a b b a b a a a
thread1 thread2startingfromwhichstate?
• Previousapproaches• Exhaustive(e.g.,NFA)—cost&scalability• Selective(e.g.,Jones+’09,Prabhu+’10)• Tricks:lookback;partialcommit
Howtomaketheguesstomaximizeperformance?
0"
2"
4"
6"
8"
huff" lexing" str1" str2" pval" xval" div"
Speedu
p"(8"cores)"
the$state$of$the$art$(Prabhu+: PLDI’10)
�40
KeyOpenQuestion
Howtomaketheguesstomaximizeperformance?
• How far should lookback go?
• Which state should be used to start lookback ?
• Would starting from multiple states help ? If so, which state should be chosen after lookback ?
• How to combine lookback and partial commit effectively ?
• How to characterize FSMs and their inputs so that the speculation can be FSM and input-aware ?
Benefits & Best Design�
FSM Properties�
Spec. Para. Scheme�
PrincipledSpeculation—Arigorouswaytodesignspeculativeparallelization
Obtained through offline profiling�41
Performance
InputSensitivity
EnablingOnlineDeploymentASPLOS’2015
Overhead&Performance
Summary
Systematic�
Ad Hoc�
Rigor�
Design of Speculative Parallelization of FSM
Concurrency: Principled Speculation
Memory:Portable Data Placement Opt.
Tobeexplored…Concurrency: Principled Speculation
Memory:Portable Data Placement Opt.
• Interplay w/ data layout
• Beyond GPU
• Emerging memory technology
• 3D stacked mem
• non-volatile mem
• …
• Energy efficiency
• Beyond FSM
• Dyn. programming
• Other sequential prog.
• Unified framework
• …
FinalTakeaways
• Twokeyfactors:Memory&Concurrency
• Softwaresolutioniscritical&promising
• PORPLE:portability,extensibility,inputadaptivity
• Principledspeculation:rigorispowerful