Approximation techniques used for general purpose algorithms

Approximation techniques used for general purpose algorithms, data

parallel applications and solid-state memories

1

Presented by: K M Sabidur Rahman Date: Apr 28, 2014

Outline

Approximate Computing

Neural Acceleration for General-Purpose Approximate Programs

Approximate Storage in Solid-State Memories

Paraprox: Pattern-Based Approximation for Data Parallel Applications

2


Applicable where some degree of variation or error is acceptable

Example: Video processing

Loss of accuracy is permissible

Better performance given less work

Low power consumption

3

Domains

Multimedia processing

Machine learning

Gaming

Data mining/analysis

Financial modeling

Statistics

4


Companies dealing with huge data are interested for more efficient data processing even with some loss of accuracy

5

Categorization of approximation

Programmer-based: the programmer writes different approximate versions of a program and a runtime system decides which version to run.

Hardware-based: hardware modifications such as imprecise arithmetic units, register files, or accelerators. Cannot be readily utilized without manufacturing new hardware.

Software-based: Approximation is done on the software level. Each of these solutions works only for a small set of applications.

6

Neural Acceleration for General-Purpose Approximate Programs

Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze and Doug Burger

7

Basic concept

A learning-based approach

Select and train a neural network to mimic a region of code

After the learning phase, the compiler replaces the original code by aproximable code

NPU: low power accelerator tightly coupled to the processor pipeline to accelerate small code regions.

8

Challenges for effective trainable accelerators A learning algorithm: to accurately and efficiently mimic

imperative code.

A language and compilation framework: to transform regions of imperative code to neural network evaluations.

An architectural interface: to call a neural processing unit (NPU) in place of the original code regions

9

Neural Acceleration

Annotate an approximate program component

Compile the program

Train a neural network

Execute on a fast Neural Processing Unit (NPU)

10

From annotated code to accelerated execution on an NPU-augmented core

11

Programming

The programmer explicitly annotates functions

This is a common practice in literature

12

Code Observation

Compiler observes the behavior of the candidate code region by logging its inputs and outputs

The logged inputoutput pairs constitute the training and validation data for the next step

Compiler uses the collected inputoutput data to configure and train a neural network that mimics the candidate region

13

Execution

The transformed program begins execution on the main core and configures the NPU.

NPU is invoked to perform a neural network evaluation with of executing the original code region.

Invoking the NPU is faster and more energy-efficient than executing the original code region.

14

Code Region Criteria

Hot code

Approximability

Well-defined inputs and outputs

15

Original sobel code

16

Parrot transformed code

17

Architecture Design for NPU Acceleration

18


The CPUNPU interface consists of three queues:

sending and retrieving the configuration

sending the inputs and

retrieving the neural networks outputs.

19


The ISA is ex-tended with four instructions to access the queues:

enq.c %r: enqueues the value of the register r into the config FIFO.

deq.c %r: dequeues a configuration value from the config FIFO to the register r.

enq.d %r: enqueues the value of the register r into the input FIFO.

deq.d %r: dequeues the head of the output FIFO to the register r.

20

Reconfigurable 8-PE NPU

21

A Single processing engine

22

Benchmarks and Experimental Setup

Benchmarks: FFT, inverse kinematics, triangle intersection, JPEG, K-means, Sobel (annotated one hot function each)

Experimental Setup: MARSSx86

Energy model: McPAT and CACTI

23

Results: 2.3x Speedup

24

Results: 3.0x Energy reduction

25

Limitations

Applicability

Programmer effort and

Quality and error control

26

Approximate Storage in Solid-State Memories

Adrian Sampson, Jacob Nelson, Karin Strauss and Luis Ceze

27

Basic concept

Mechanisms to enable applications to store data approximately

Improved performance, lifetime, or density of solid-state memories

28

Two techniques

Reduced-precision writes in multi-level phase-change memory cells

Use of blocks with failed bits to store approximate data

Reduced-precision writes in multi-level phase-change memory cells can be 1.7x faster on average

Failed blocks can improve array lifetime by 23% on average with quality loss under 10%

29

INTERFACES FOR APPROXIMATE STORAGE

Approximate storage augments memory modules with software-visible precision modes.

When an application needs strict data fidelity, it uses traditional precise storage; the memory then guarantees a low error rate when recovering the data.

When the application can tolerate occasional errors in some data, it uses the memorys approximate mode, in which data recovery errors may occur with non-negligible probability

30

Phase change memory (PCM)

Merits: Non-volatile, almost as fast as DRAM, More scalable, Faster than flash

Limitations: Need more time and energy to protect against errors. Cells wear out over time and can no longer be used for precise data storage.

31

Approximate storage in PCM

PCM work by storing an analog valueresistance and quantizing it to expose digital storage.

A larger number of levels per cell requires more time and energy to access.

Approximation improves performance and efficiency

32

Multi-Level Cell Model

33

Multi-Level Cell Model

The shaded areas are the target regions for writes to each level

Unshaded areas are guard bands.

The curves show the probability of reading a given analog value after writing one of the levels.

Approximate MLCs decrease guard bands so the probability distributions overlap.

Goal is to increase density or performance at the cost of occasional digital-domain storage errors.

34

Memory Interface

MLC blocks can be made precise or approximate by adjusting the target threshold of write operations.

The memory array must know which threshold value to use for each write operation.

Memory interface extended to include precision flags

Read operations are identical for approximate and precise memory

37

USING FAILED MEMORY CELLS

Use blocks with exhausted error-correction resources to store approximate data

Value stored in a particular failed block will consistently exhibit bit errors in the same positions

38

Prioritized Bit Correction

Example of mantissa in floating point number.

Correct the bits that appear in high-order positions within words and leave the lowest-order failed bits uncorrected.

39

Memory Interface

Unlike with the approximate MLC technique, software has no control over blocks precision state.

To permit safe allocation of approximate and precise data, the memory must inform software of the locations of approximate (i.e., failed) blocks.

As a block fails the OS adds the block to a pool of approximate blocks.

Memory allocators consult this set of approximate blocks when laying out data in the memory.

While approximate data can be stored in any block, precise data must be allocated in memory without failures.

40

Benchmarks

The main-memory applications: Java programs annotated using the EnerJ , approximation-aware type system, which marks some data as approximate and leaves other data precise.

The persistent-storage benchmarks are static data sets that can be stored 100% approximately

Applications: fft, jmeint, lu, mc, raytr. , smm, sor, zxing

41

Results

42

Results

43

Paraprox: Pattern-Based Approximation for Data Parallel

Applications

Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee and Scott Mahlke

44

Paraprox

Pattern-specific approximation methods

Identify different patterns commonly found in data parallel workloads

Use specialized approximation optimization for each pattern

Write software once and use it on a variety of processors

Provide knobs to control the output quality

45

Paraprox framework

46

Paraprox framework

Paraprox detects the patterns

Generates approximate kernels with different tuning parameters

The runtime profiles the kernels and tunes the parameters for the best performance.

If the user-defined target output quality (TOQ) is violated, the runtime system will adjust by

retuning the parameters and/or

selecting a less aggressive approximate kernel for the next execution.

47

Pattern detection

Map

Scatter/Gather

Reduction

Scan

Stencil and

Partition.

48

Patterns

49

Approximation Optimizations

Map and scatter/gather patterns: approximate memoization

Replaces a function call with a query into a lookup table which returns a pre-computed result

Pre-compute the output of the map or scatter/gather function for a number of representative input sets offline.

During runtime, the launched kernels threads use this lookup table to find the output for all input values.

50

Approximate Memoization

51

Approximate Memoization

Identify candidate functions

Find the table size

Determine qi for each input

Check for quality; if not satisfied, go back to step 2.

Fill the Table

Execution

52

Stencil and Partition

70% of the each images pixels have less than 10% difference from their neighbors.

Paraprox assumes that adjacent elements in the input array are similar in value.

Rather than access all neighbors within a tile, Paraprox accesses only a subset of them and assumes the rest of the neighbors have the same value

53

Approximation of tile

Center based approach

Row based approximation schemes

Row based approximation schemes

56

Reduction

Paraprox aims to predict the final result by computing the reduction of a subset of the input data

The data is assumed to be distributed uniformly, so a subset of the data can provide a good representation of the entire array

May need adjustment

57

For example, instead of finding the minimum of the original array, Paraprox finds the minimum within one half of the array and returns it as the approximate result.

If the data in both subarrays have similar distributions, the minimum of these subarrays will be close to each other and approximation error will be negligible.

59

Scan

Paraprox assumes that differences between elements in the input array are similar to those in other partitions of the same input array.

Parallel implementations of scan patterns break the input array into sub-arrays and computes the scan result for each of them.

60

Scan

61

Scan : Implementation

A data parallel implementation of the scan pattern has

three phases:

Phase I scans each subarray.

Phase II scans the sum of all subarrays.

Phase III then adds the result of Phase II to each corresponding subarray in the partial scan to generate the final result.

62

Scan Approximation

63

Experimental Setup

Clang 3.3

GPU - NVIDIA GTX 560

CPU- Intel Core I7

Benchmarks - NVIDIA SDK, Rodinia

64

Results: Speedup

65

Results: Performance comparison

68

Q&A

? 69

Documents

Approximation techniques used for general purpose algorithms