Upload
sabidur-rahman
View
18
Download
1
Embed Size (px)
Citation preview
Approximation techniques used for general purpose algorithms, data
parallel applications and solid-state memories
1
Presented by: K M Sabidur Rahman Date: Apr 28, 2014
Outline
Approximate Computing
Neural Acceleration for General-Purpose Approximate Programs
Approximate Storage in Solid-State Memories
Paraprox: Pattern-Based Approximation for Data Parallel Applications
2
Approximate Computing
Applicable where some degree of variation or error is acceptable
Example: Video processing
Loss of accuracy is permissible
Better performance given less work
Low power consumption
3
Domains
Multimedia processing
Machine learning
Gaming
Data mining/analysis
Financial modeling
Statistics
4
Approximate Computing
Companies dealing with huge data are interested for more efficient data processing even with some loss of accuracy
5
Categorization of approximation
Programmer-based: the programmer writes different approximate versions of a program and a runtime system decides which version to run.
Hardware-based: hardware modifications such as imprecise arithmetic units, register files, or accelerators. Cannot be readily utilized without manufacturing new hardware.
Software-based: Approximation is done on the software level. Each of these solutions works only for a small set of applications.
6
Neural Acceleration for General-Purpose Approximate Programs
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze and Doug Burger
7
Basic concept
A learning-based approach
Select and train a neural network to mimic a region of code
After the learning phase, the compiler replaces the original code by aproximable code
NPU: low power accelerator tightly coupled to the processor pipeline to accelerate small code regions.
8
Challenges for effective trainable accelerators A learning algorithm: to accurately and efficiently mimic
imperative code.
A language and compilation framework: to transform regions of imperative code to neural network evaluations.
An architectural interface: to call a neural processing unit (NPU) in place of the original code regions
9
Neural Acceleration
Annotate an approximate program component
Compile the program
Train a neural network
Execute on a fast Neural Processing Unit (NPU)
10
From annotated code to accelerated execution on an NPU-augmented core
11
Programming
The programmer explicitly annotates functions
This is a common practice in literature
12
Code Observation
Compiler observes the behavior of the candidate code region by logging its inputs and outputs
The logged inputoutput pairs constitute the training and validation data for the next step
Compiler uses the collected inputoutput data to configure and train a neural network that mimics the candidate region
13
Execution
The transformed program begins execution on the main core and configures the NPU.
NPU is invoked to perform a neural network evaluation with of executing the original code region.
Invoking the NPU is faster and more energy-efficient than executing the original code region.
14
Code Region Criteria
Hot code
Approximability
Well-defined inputs and outputs
15
Original sobel code
16
Parrot transformed code
17
Architecture Design for NPU Acceleration
18
Architecture Design for NPU Acceleration
The CPUNPU interface consists of three queues:
sending and retrieving the configuration
sending the inputs and
retrieving the neural networks outputs.
19
Architecture Design for NPU Acceleration
The ISA is ex-tended with four instructions to access the queues:
enq.c %r: enqueues the value of the register r into the config FIFO.
deq.c %r: dequeues a configuration value from the config FIFO to the register r.
enq.d %r: enqueues the value of the register r into the input FIFO.
deq.d %r: dequeues the head of the output FIFO to the register r.
20
Reconfigurable 8-PE NPU
21
A Single processing engine
22
Benchmarks and Experimental Setup
Benchmarks: FFT, inverse kinematics, triangle intersection, JPEG, K-means, Sobel (annotated one hot function each)
Experimental Setup: MARSSx86
Energy model: McPAT and CACTI
23
Results: 2.3x Speedup
24
Results: 3.0x Energy reduction
25
Limitations
Applicability
Programmer effort and
Quality and error control
26
Approximate Storage in Solid-State Memories
Adrian Sampson, Jacob Nelson, Karin Strauss and Luis Ceze
27
Basic concept
Mechanisms to enable applications to store data approximately
Improved performance, lifetime, or density of solid-state memories
28
Two techniques
Reduced-precision writes in multi-level phase-change memory cells
Use of blocks with failed bits to store approximate data
Reduced-precision writes in multi-level phase-change memory cells can be 1.7x faster on average
Failed blocks can improve array lifetime by 23% on average with quality loss under 10%
29
INTERFACES FOR APPROXIMATE STORAGE
Approximate storage augments memory modules with software-visible precision modes.
When an application needs strict data fidelity, it uses traditional precise storage; the memory then guarantees a low error rate when recovering the data.
When the application can tolerate occasional errors in some data, it uses the memorys approximate mode, in which data recovery errors may occur with non-negligible probability
30
Phase change memory (PCM)
Merits: Non-volatile, almost as fast as DRAM, More scalable, Faster than flash
Limitations: Need more time and energy to protect against errors. Cells wear out over time and can no longer be used for precise data storage.
31
Approximate storage in PCM
PCM work by storing an analog valueresistance and quantizing it to expose digital storage.
A larger number of levels per cell requires more time and energy to access.
Approximation improves performance and efficiency
32
Multi-Level Cell Model
33
Multi-Level Cell Model
The shaded areas are the target regions for writes to each level
Unshaded areas are guard bands.
The curves show the probability of reading a given analog value after writing one of the levels.
Approximate MLCs decrease guard bands so the probability distributions overlap.
Goal is to increase density or performance at the cost of occasional digital-domain storage errors.
34
Memory Interface
MLC blocks can be made precise or approximate by adjusting the target threshold of write operations.
The memory array must know which threshold value to use for each write operation.
Memory interface extended to include precision flags
Read operations are identical for approximate and precise memory
37
USING FAILED MEMORY CELLS
Use blocks with exhausted error-correction resources to store approximate data
Value stored in a particular failed block will consistently exhibit bit errors in the same positions
38
Prioritized Bit Correction
Example of mantissa in floating point number.
Correct the bits that appear in high-order positions within words and leave the lowest-order failed bits uncorrected.
39
Memory Interface
Unlike with the approximate MLC technique, software has no control over blocks precision state.
To permit safe allocation of approximate and precise data, the memory must inform software of the locations of approximate (i.e., failed) blocks.
As a block fails the OS adds the block to a pool of approximate blocks.
Memory allocators consult this set of approximate blocks when laying out data in the memory.
While approximate data can be stored in any block, precise data must be allocated in memory without failures.
40
Benchmarks
The main-memory applications: Java programs annotated using the EnerJ , approximation-aware type system, which marks some data as approximate and leaves other data precise.
The persistent-storage benchmarks are static data sets that can be stored 100% approximately
Applications: fft, jmeint, lu, mc, raytr. , smm, sor, zxing
41
Results
42
Results
43
Paraprox: Pattern-Based Approximation for Data Parallel
Applications
Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee and Scott Mahlke
44
Paraprox
Pattern-specific approximation methods
Identify different patterns commonly found in data parallel workloads
Use specialized approximation optimization for each pattern
Write software once and use it on a variety of processors
Provide knobs to control the output quality
45
Paraprox framework
46
Paraprox framework
Paraprox detects the patterns
Generates approximate kernels with different tuning parameters
The runtime profiles the kernels and tunes the parameters for the best performance.
If the user-defined target output quality (TOQ) is violated, the runtime system will adjust by
retuning the parameters and/or
selecting a less aggressive approximate kernel for the next execution.
47
Pattern detection
Map
Scatter/Gather
Reduction
Scan
Stencil and
Partition.
48
Patterns
49
Approximation Optimizations
Map and scatter/gather patterns: approximate memoization
Replaces a function call with a query into a lookup table which returns a pre-computed result
Pre-compute the output of the map or scatter/gather function for a number of representative input sets offline.
During runtime, the launched kernels threads use this lookup table to find the output for all input values.
50
Approximate Memoization
51
Approximate Memoization
Identify candidate functions
Find the table size
Determine qi for each input
Check for quality; if not satisfied, go back to step 2.
Fill the Table
Execution
52
Stencil and Partition
70% of the each images pixels have less than 10% difference from their neighbors.
Paraprox assumes that adjacent elements in the input array are similar in value.
Rather than access all neighbors within a tile, Paraprox accesses only a subset of them and assumes the rest of the neighbors have the same value
53
54
55
Approximation of tile
Center based approach
Row based approximation schemes
Row based approximation schemes
56
Reduction
Paraprox aims to predict the final result by computing the reduction of a subset of the input data
The data is assumed to be distributed uniformly, so a subset of the data can provide a good representation of the entire array
May need adjustment
57
58
For example, instead of finding the minimum of the original array, Paraprox finds the minimum within one half of the array and returns it as the approximate result.
If the data in both subarrays have similar distributions, the minimum of these subarrays will be close to each other and approximation error will be negligible.
59
Scan
Paraprox assumes that differences between elements in the input array are similar to those in other partitions of the same input array.
Parallel implementations of scan patterns break the input array into sub-arrays and computes the scan result for each of them.
60
Scan
61
Scan : Implementation
A data parallel implementation of the scan pattern has
three phases:
Phase I scans each subarray.
Phase II scans the sum of all subarrays.
Phase III then adds the result of Phase II to each corresponding subarray in the partial scan to generate the final result.
62
Scan Approximation
63
Experimental Setup
Clang 3.3
GPU - NVIDIA GTX 560
CPU- Intel Core I7
Benchmarks - NVIDIA SDK, Rodinia
64
Results: Speedup
65
Results: Performance comparison
68
Q&A
? 69
70