Upload
dane
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine. Leonid Oliker Future Technologies Group Computational Research Division LBNL www.nersc.gov/~oliker Sourav Chatterji , Jason Duell, Manikandan Narayanan. Motivation. - PowerPoint PPT Presentation
Citation preview
Performance Evaluation of Two Emerging Media Processors:
VIRAM and Imagine
Leonid OlikerFuture Technologies Group
Computational Research Division
LBNL
www.nersc.gov/~oliker
Sourav Chatterji, Jason Duell, Manikandan Narayanan
Motivation
Commodity cache-based SMP clusters perform at small % of peak for memory intensive problems (esp irregular prob)
But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)
Power and packaging are becoming significant bottlenecks
Better software is improving some problems: ATLAS, FFTW, Sparsity, PHiPAC
Alternative arch allow tighter integration of proc & memoryCan we build HPC systems w/ high-end media proc tech?
VIRAM: PIM technology combines embedded DRAM with vector coprocessor to exploit large bandwidth potential
IMAGINE: Stream-aware memory supports large processing potential of SIMD controlled VLIW clusters
Motivation
General purpose procs badly suited for data intensive ops
Large caches not useful Low memory bandwidth Superscalar methods of increasing ILP inefficient Power consumption
Application-specific ASICs Good, but expensive/slow to design.
Solution: general purpose “memory aware” processors
Large number of ALUs: to exploit data-parallelism Huge memory bandwidth: to keep ALUs busy Concurrency: overlap memory w/ computation
VIRAM Overview MIPS core (200 MHz) Main memory system
8 banks w/13 MB of on-chip DRAM Large 6.4 GBytes/s on-chip peak bandwidth
Cach-less Vector unit
Energy efficient way to express fine-grained parallelism and exploit bandwidth
Single issue, in order Low power consumption: 2.0 W Peak vector performance
1.6/3.2/6.4 Gops 1.6 Gflops (single-precision)
Fabricated by IBM: Taped-out 02/2003 To hide DRAM access load/store,
arithmetic instructions deeply pipelined (15 stages)
We use simulator with Cray’s vcc compiler
VIRAM Vector Lanes
Parallel lane design has adv in performance, design complex, scalability
Each lanes has 2 ALUs ( 1 for FP) and receives identical control signal Vector instr specify 64 way-parallelism, hardware exec 8-way 8 KB vector register file partitioned into 32 vector registers Variable data widths: 4 lanes 64-bit, 8 lanes for 32 bit, 16 for 8 bit
Data width cut in half, # of elems per register (and peak) doubles Limitations: no 64-bit FP & compiler doesn’t generate fused MADD
VIRAM Power Efficiency
Comparable performance with lower clock rate Large power/performance advantage for VIRAM from
PIM technology, data parallel execution model
0.1
1
10
100
1000
Transitive GUPS SPMV Hist Mesh
MO
PS
/Wa
tt
VIRAM
R10K
P-III
P4
Sparc
EV6
Stream Processing
Stream: ordered set of records (homogenous, arbitrary data type)
Stream programming: data is streams, compu is kernel Kernel loop through all stream elements (sequential order) Perform compound (multiword) operation on each stream elem Vectors perform single arith op on each vector elem (then store
in reg)
Example: stereodepth extraction
Data and Functional Parallelism
High Comp rate Little Data Reuse Producer-Consumer
and Spatial locality Ex: Multimedia, sign
proc, graphics
Imagine Overview “Vector VLIW” processor Coprocessor to off-chip
host processor 8 arithmet clusters
control in SIMD w/ VLIW instr
Central 128KB Stream Register File @ 32GB/s
SRF can overlap comp with mem (double buff)
SRF cab reuse intermed results (prod-cons local)
Stream-aware mem sys with 2.7 GB/s off-chip
544 GB/s interclustr comm
Host sends inst to stream controller, SC issues commands to on-chip modules
Imagine Arithmetic Clusters
400 MHz clock, 8 clusters w/ 6 FU each (48 FU total) Reads/writes streams to SRF Each cluster 3 ADD, 2 MULT, 1 DIV/SQRT, 1 scratch, & 1 comm unit 32 bit arch: subword operations support 16 and 8 bit data (no 64 bit
support) Local registers on functional units hold 16 words each (total 1.5 KB) Clusters receive VLIW-style instructions broadcast from microcontroller.
VIRAM and Imagine
Imagine order of magnitude higher performance
VIRAM twice mem bandwidth, less power consumption
Notice peak Flop/Word ratios
VIRAM IMAGINEMemory
IMAGINE SRF
Bwdth GB/s 6.4 2.7 32Peak Fl 32bit 1.6 GF/s 20 GF/s 20
Peak Fl/Wd 1 30 2.5
Speed MHz 200 400
Chip Area 15x18mm 12x12mm
Data widths 64/32/16 32/16/8
Transistors 130 x 106 21 x 106
Pwr Consmp 2 Watts 10 Watts
SQMAT Architectural Probe
Sqmat: scalable synthetic probe, control comput intensity, vector len Imagine stream model req large # of ops per word to amortize mem ref
Poor use of SRF, no producer-consumer locality Long stream helps hide mem latency but only 7% of algorithmic peak VIRAM: performs well for low op/word (40% when L=256) Vector pipeline overlap comp/mem, on-chip DRAM (hi bdwth, low laten)
0%10%20%30%40%50%
8 16 32 64 128 256 512 1024
Vector/Stream Length (L)
% o
f Pea
k
VIRAM
IMAGINE
3x3 Matrix Multiply
SQMAT: Performance Crossover
0
10000
20000
30000
40000
50000
60000
70000
80000
8 16 32 64 128 256 512 1024
Vector/Stream Length(L)
CY
CL
ES
0
500
1000
1500
2000
2500
3000
3500
4000
4500
MF
LO
PS
CYCLES VIRAM
CYCLES IMAGINE
MFLOPS VIRAM
MFLOPS IMAGINE
Large number of ops/word N10 where N=3x3 Crossover point L=64 (cycles) , L = 256 (MFlop) Imagine power becomes apparent almost 4x VIRAM at
L=1024Codes at this end of spectrum greatly benefit from Imagine arch
VIRAM/Imagine Optimization
Example optimization RGB→YIQ conversion from EEMBC
Input format: R1G1B1R2G2R2R3G3B3…
Required format: R1R2R3… G1G2G3… B1B2B3….
Optimization strat: speed up slower of comp or mem
Restructure computation for better kernel perform
Mem is waiting for ALUS Add more computation for
better memory perform ALU memory starved
Subtle overlap effects:vect chaining, stream doub buff
VIRAM RGB→YIQ Optimization
VIRAM: poor memory performance
• Strided accesses (~1/2 performance)
- RGBRGBRGB… -- strided loads → RRR…GGG…BBB…
- Only 4 address generators for 8 addresses (sufficient for 64 bit)
• Word operations on byte data (1/4th performance)
Optimization: replace strided w/ unit access, using in-register shuffle• Increased computational overhead (packing and unpacking)
VIRAM RGB→YIQResults
Used functional units instead of memory to extract components, increasing the computational
overhead
VIRAM RGB->YIQ
1,900.00
2,000.00
2,100.00
2,200.00
2,300.00
2,400.00
2,500.00
small medium large
Inte
ger
ops
(M/s
ec)
Original optimized
VIRAMKernel
(cycles)
Memory(cycles)
Unoptimized 114 95
Optimized 108 17Chunk Size 64
Imagine RGB→YIQ Optimization
Imagine bottleneck is comp due poor ALU schedule (left)
Unoptimized 15 cycles per pixel Software pipelining makes VLIW schedule denser
(right) Optimized 8 cycles per pixel
Imagine RGB→YIQResults
Imagine RGB->YIQ
0.00
1,000.00
2,000.00
3,000.00
4,000.00
5,000.00
6,000.00
small medium large
Inte
ger
op
s (M
/sec
)
Original software pipelined
ImagineKernel
(cycles)
Memory(cycles)
Unoptimized 2153 1167
Optimized 1147 1165Chunk Size 1024
Optimized kernel takes only ½ the cycles per element
Memory is now the new bottleneck
EEMBC Benchmark
0.00
1.00
2.00
3.00
4.00
5.00
6.00
64K
Vec
tor
RG
B-
>Y
IQ:
RG
B-
>C
MY
K:
Au
toco
rr:
pu
lse
Au
toco
rr:
spee
ch
Ban
dw
idth
(G
B/s
ec)
0.00
1.00
2.00
3.00
4.00
5.00
6.00
Inte
ger
op
s (G
/sec
)
VIRAM GOPS Imagine GOPS
VIRAM GB/sec Imagine GB/sec
Vec-add: one add/elem, perf limited by memory system
RGB →(YIQ,CMYK): VIRAM limited by processing (cannot use avail bdwidth)
Grayfiler: Difficult to efficiently impl on Imagine (sliding 3x3 window)
Autocorr: Uses short streams, Imagine host latency is high
Benchmark Width VIR/IMA Application Area Remarks
Vec addition 32/32 bits Microbenchmark c[i]=a[i]+b[i]RGB →YIQ 32/32 bits EEMBC Consumer Color-converRGB →CMYK 16/8 bits EEMBC Consumer Color-converGray Filter 16/32 bits EEMBC Consumer 3x3 convoluAutocorrelation 16/32 bits EEMBC Telecom Dot product
Scientific KernelsSPMV Performance
Algorithmic peak: VIRAM 8 ops/cycle, Imag 32 ops/cycle LSHAPE: finite element matrix, LARGEDIS pseudo-random nnz Imagine lacks irreg access, reorder matrix before kernelC VIRAM better suited for this class of apps (low comp/mem)
Matrix Rows/NNZ
PerformMetric
VIRAM Imagine
CRS SegSum
Ellpck CRS Stream
sEllpck
LSHAPE
1008 6958
% Peak 2.8% 7.4% 31% 1.1% 0.8% 1.2%Cycles 67K 24K 5.6K 40K 48K 38KMFlop/s 44 118 496 136 114 149
LARGEDIS10000117820
% Peak 3.2% 8.4% 32% 1.5% 0.6% 6.3%Cycles 802K 567K 641K 742K 1840K 754KMFlop/s 91 135 511 192 77 870
Scientific KernelsComplex QR
Decomposition
A=QR Q orthrog & A upper triag,
Blocked Househoulder variant – rich in level 3 BLAS ops
Complex elems increases ops/word & locality (1 MUL = 6 ops)
VIRAM uses CLAPACK port (insertion of vector directives)
Imagine: complex indexing of matrix stream (each iter smaller matrix)
Imagine over 10GFlops (19x VIRAM) – well suited for this archLow VIRAM perf due strided access and compiler limitations
Complex QR Decomposition
VIRAM ImagineMatrix Performanc
e
MITRERT_STRAP192x96complex
% of Peak 34.1% 65.5%Total Cycles 5189K 712K
MFlops/s 546 10480
Overview Significantly different balance of memory organization
Relative performance depends on computational intensity
Programming complexity is high for both approaches, although VIRAM is based on established vector technology
For well-suited applications IMAGINE processor can sustain over 10GFlop/s (simulated results)
Large # homogeneous computation required to sufficiently saturate IMAGINE while VIRAM can operate on small vector sizes
IMAGINE can take advantage of producer-consumer locality
Both present significant reduction in power and space
May be used as coprocessors in future generation architectures
Next Generation •CODE: next generation of VIRAM
–More functional units/ faster clock speed
–Local registers per unit instead of single register file.
–Looking more like Imagine…
•Multi VIRAM architecture – network interface issues?
•Brook: new language for Imagine
–Eliminate exposure of hardware details (# of clusters)
• Streaming Supercomputer – multi Imagine configuration
– Streams can be used for functional/data parallelism
•Currently evaluating DIVA architecture