66
Approximation techniques used for general purpose algorithms, data parallel applications and solid-state memories 1 Presented by: K M Sabidur Rahman Date: Apr 28, 2014

Approximation techniques used for general purpose algorithms

Embed Size (px)

Citation preview

  • Approximation techniques used for general purpose algorithms, data

    parallel applications and solid-state memories

    1

    Presented by: K M Sabidur Rahman Date: Apr 28, 2014

  • Outline

    Approximate Computing

    Neural Acceleration for General-Purpose Approximate Programs

    Approximate Storage in Solid-State Memories

    Paraprox: Pattern-Based Approximation for Data Parallel Applications

    2

  • Approximate Computing

    Applicable where some degree of variation or error is acceptable

    Example: Video processing

    Loss of accuracy is permissible

    Better performance given less work

    Low power consumption

    3

  • Domains

    Multimedia processing

    Machine learning

    Gaming

    Data mining/analysis

    Financial modeling

    Statistics

    4

  • Approximate Computing

    Companies dealing with huge data are interested for more efficient data processing even with some loss of accuracy

    5

  • Categorization of approximation

    Programmer-based: the programmer writes different approximate versions of a program and a runtime system decides which version to run.

    Hardware-based: hardware modifications such as imprecise arithmetic units, register files, or accelerators. Cannot be readily utilized without manufacturing new hardware.

    Software-based: Approximation is done on the software level. Each of these solutions works only for a small set of applications.

    6

  • Neural Acceleration for General-Purpose Approximate Programs

    Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze and Doug Burger

    7

  • Basic concept

    A learning-based approach

    Select and train a neural network to mimic a region of code

    After the learning phase, the compiler replaces the original code by aproximable code

    NPU: low power accelerator tightly coupled to the processor pipeline to accelerate small code regions.

    8

  • Challenges for effective trainable accelerators A learning algorithm: to accurately and efficiently mimic

    imperative code.

    A language and compilation framework: to transform regions of imperative code to neural network evaluations.

    An architectural interface: to call a neural processing unit (NPU) in place of the original code regions

    9

  • Neural Acceleration

    Annotate an approximate program component

    Compile the program

    Train a neural network

    Execute on a fast Neural Processing Unit (NPU)

    10

  • From annotated code to accelerated execution on an NPU-augmented core

    11

  • Programming

    The programmer explicitly annotates functions

    This is a common practice in literature

    12

  • Code Observation

    Compiler observes the behavior of the candidate code region by logging its inputs and outputs

    The logged inputoutput pairs constitute the training and validation data for the next step

    Compiler uses the collected inputoutput data to configure and train a neural network that mimics the candidate region

    13

  • Execution

    The transformed program begins execution on the main core and configures the NPU.

    NPU is invoked to perform a neural network evaluation with of executing the original code region.

    Invoking the NPU is faster and more energy-efficient than executing the original code region.

    14

  • Code Region Criteria

    Hot code

    Approximability

    Well-defined inputs and outputs

    15

  • Original sobel code

    16

  • Parrot transformed code

    17

  • Architecture Design for NPU Acceleration

    18

  • Architecture Design for NPU Acceleration

    The CPUNPU interface consists of three queues:

    sending and retrieving the configuration

    sending the inputs and

    retrieving the neural networks outputs.

    19

  • Architecture Design for NPU Acceleration

    The ISA is ex-tended with four instructions to access the queues:

    enq.c %r: enqueues the value of the register r into the config FIFO.

    deq.c %r: dequeues a configuration value from the config FIFO to the register r.

    enq.d %r: enqueues the value of the register r into the input FIFO.

    deq.d %r: dequeues the head of the output FIFO to the register r.

    20

  • Reconfigurable 8-PE NPU

    21

  • A Single processing engine

    22

  • Benchmarks and Experimental Setup

    Benchmarks: FFT, inverse kinematics, triangle intersection, JPEG, K-means, Sobel (annotated one hot function each)

    Experimental Setup: MARSSx86

    Energy model: McPAT and CACTI

    23

  • Results: 2.3x Speedup

    24

  • Results: 3.0x Energy reduction

    25

  • Limitations

    Applicability

    Programmer effort and

    Quality and error control

    26

  • Approximate Storage in Solid-State Memories

    Adrian Sampson, Jacob Nelson, Karin Strauss and Luis Ceze

    27

  • Basic concept

    Mechanisms to enable applications to store data approximately

    Improved performance, lifetime, or density of solid-state memories

    28

  • Two techniques

    Reduced-precision writes in multi-level phase-change memory cells

    Use of blocks with failed bits to store approximate data

    Reduced-precision writes in multi-level phase-change memory cells can be 1.7x faster on average

    Failed blocks can improve array lifetime by 23% on average with quality loss under 10%

    29

  • INTERFACES FOR APPROXIMATE STORAGE

    Approximate storage augments memory modules with software-visible precision modes.

    When an application needs strict data fidelity, it uses traditional precise storage; the memory then guarantees a low error rate when recovering the data.

    When the application can tolerate occasional errors in some data, it uses the memorys approximate mode, in which data recovery errors may occur with non-negligible probability

    30

  • Phase change memory (PCM)

    Merits: Non-volatile, almost as fast as DRAM, More scalable, Faster than flash

    Limitations: Need more time and energy to protect against errors. Cells wear out over time and can no longer be used for precise data storage.

    31

  • Approximate storage in PCM

    PCM work by storing an analog valueresistance and quantizing it to expose digital storage.

    A larger number of levels per cell requires more time and energy to access.

    Approximation improves performance and efficiency

    32

  • Multi-Level Cell Model

    33

  • Multi-Level Cell Model

    The shaded areas are the target regions for writes to each level

    Unshaded areas are guard bands.

    The curves show the probability of reading a given analog value after writing one of the levels.

    Approximate MLCs decrease guard bands so the probability distributions overlap.

    Goal is to increase density or performance at the cost of occasional digital-domain storage errors.

    34

  • Memory Interface

    MLC blocks can be made precise or approximate by adjusting the target threshold of write operations.

    The memory array must know which threshold value to use for each write operation.

    Memory interface extended to include precision flags

    Read operations are identical for approximate and precise memory

    37

  • USING FAILED MEMORY CELLS

    Use blocks with exhausted error-correction resources to store approximate data

    Value stored in a particular failed block will consistently exhibit bit errors in the same positions

    38

  • Prioritized Bit Correction

    Example of mantissa in floating point number.

    Correct the bits that appear in high-order positions within words and leave the lowest-order failed bits uncorrected.

    39

  • Memory Interface

    Unlike with the approximate MLC technique, software has no control over blocks precision state.

    To permit safe allocation of approximate and precise data, the memory must inform software of the locations of approximate (i.e., failed) blocks.

    As a block fails the OS adds the block to a pool of approximate blocks.

    Memory allocators consult this set of approximate blocks when laying out data in the memory.

    While approximate data can be stored in any block, precise data must be allocated in memory without failures.

    40

  • Benchmarks

    The main-memory applications: Java programs annotated using the EnerJ , approximation-aware type system, which marks some data as approximate and leaves other data precise.

    The persistent-storage benchmarks are static data sets that can be stored 100% approximately

    Applications: fft, jmeint, lu, mc, raytr. , smm, sor, zxing

    41

  • Results

    42

  • Results

    43

  • Paraprox: Pattern-Based Approximation for Data Parallel

    Applications

    Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee and Scott Mahlke

    44

  • Paraprox

    Pattern-specific approximation methods

    Identify different patterns commonly found in data parallel workloads

    Use specialized approximation optimization for each pattern

    Write software once and use it on a variety of processors

    Provide knobs to control the output quality

    45

  • Paraprox framework

    46

  • Paraprox framework

    Paraprox detects the patterns

    Generates approximate kernels with different tuning parameters

    The runtime profiles the kernels and tunes the parameters for the best performance.

    If the user-defined target output quality (TOQ) is violated, the runtime system will adjust by

    retuning the parameters and/or

    selecting a less aggressive approximate kernel for the next execution.

    47

  • Pattern detection

    Map

    Scatter/Gather

    Reduction

    Scan

    Stencil and

    Partition.

    48

  • Patterns

    49

  • Approximation Optimizations

    Map and scatter/gather patterns: approximate memoization

    Replaces a function call with a query into a lookup table which returns a pre-computed result

    Pre-compute the output of the map or scatter/gather function for a number of representative input sets offline.

    During runtime, the launched kernels threads use this lookup table to find the output for all input values.

    50

  • Approximate Memoization

    51

  • Approximate Memoization

    Identify candidate functions

    Find the table size

    Determine qi for each input

    Check for quality; if not satisfied, go back to step 2.

    Fill the Table

    Execution

    52

  • Stencil and Partition

    70% of the each images pixels have less than 10% difference from their neighbors.

    Paraprox assumes that adjacent elements in the input array are similar in value.

    Rather than access all neighbors within a tile, Paraprox accesses only a subset of them and assumes the rest of the neighbors have the same value

    53

  • 54

  • 55

  • Approximation of tile

    Center based approach

    Row based approximation schemes

    Row based approximation schemes

    56

  • Reduction

    Paraprox aims to predict the final result by computing the reduction of a subset of the input data

    The data is assumed to be distributed uniformly, so a subset of the data can provide a good representation of the entire array

    May need adjustment

    57

  • 58

  • For example, instead of finding the minimum of the original array, Paraprox finds the minimum within one half of the array and returns it as the approximate result.

    If the data in both subarrays have similar distributions, the minimum of these subarrays will be close to each other and approximation error will be negligible.

    59

  • Scan

    Paraprox assumes that differences between elements in the input array are similar to those in other partitions of the same input array.

    Parallel implementations of scan patterns break the input array into sub-arrays and computes the scan result for each of them.

    60

  • Scan

    61

  • Scan : Implementation

    A data parallel implementation of the scan pattern has

    three phases:

    Phase I scans each subarray.

    Phase II scans the sum of all subarrays.

    Phase III then adds the result of Phase II to each corresponding subarray in the partial scan to generate the final result.

    62

  • Scan Approximation

    63

  • Experimental Setup

    Clang 3.3

    GPU - NVIDIA GTX 560

    CPU- Intel Core I7

    Benchmarks - NVIDIA SDK, Rodinia

    64

  • Results: Speedup

    65

  • Results: Performance comparison

    68

  • Q&A

    ? 69

  • 70