Approximate On-Chip Communication2

Preview:

Citation preview

Approximate On-Chip Communication

Davide Patti, Ph.D. davide.patti@dieei.unict.it University of Catania, Italy

…in the Previous Episodes1. The goal of computing was to be the fastest

2. The challenge to maximize MHz hit the ‘power wall in the mid-2000s

3. Initial solution: “ok, no problem, let’s optimise for speed and power…”

4. …but, eventually, the dramatically increasing workloads ruined the party…

!3

Why?! Ever-increasing amount of information

! Industry reports – 2010 – 2020 amount of information will expand by 50x – ...number of servers will only grow by a factor of 10!

Emerging RMS Applications

Error-Resilience Property Forgiving workloads: multimedia, recognition, search, can tolerate not perfect computing, examples: • Inexact inputs, derived from noisy and redundant

sources (e.g. sensors) • human consumer of results may not discern small

variations • data/algortihms including statistical/probabilistic

computations • computations which may be refined with multiple

iterations

!6

Approximate Computing: A Third Dimension for Optimization

“Error” or “Feature” ?• Approximation not as a “problem” to deal with, not as a

“limitation”, but part of the game

• A neuron spikes when a combination of all the excitation and inhibition it receives makes it reach threshold (around -50mV )

Approximating at Multiple Levels of the Stack

Hardware level

• Less accurate yet energy-efficient circuits (e.g., simplified adder)

• Tuning the supply voltage

Software level

• Ignore some computations (skip loop iterations, relaxing control dependences)

• Data structures, e.g., reducing vector sizes

• Ignore certain memory accesses replacing them by estimated values

Current Applications• Database Querying/Visualization:

• BlinkDB, Facebook’s Presto, M4 from SAP 

2B points (70 mins) vs 1M points (3 mins)

Current Applications• Neural Networks

• Using NN to replace some expensive computation or algorithm

• Approximate NN implementations for inference (e.g., less bits to represent weights)

• SqueezeNet, Google’s Neural Machine Translation 

Approximate Communication: the NOC Case Study

■ Shared bus➔Low area ➔Poor scalability ➔High energy consumption

■Network-on-Chip➔Mesh of Routers (in red) ➔Each Processing Element

connected to a Router ➔Scalability and modularity ➔Low energy consumption ➔ Increase of design complexity

Shared bus

Communication Overhead• Interconnection networks consume 10% to 20% of the power in

current HPC systems

• Majority due to network's links NoC based design

• More than one-third of the chip's power consumption

!14

Example

for (i=0; i<n; i++) v[i] = f(w[i]);

MemoryMI

CPU

!15

Example – Load w[i]

for (i=0; i<n; i++) v[i] = f(w[i]);

MemoryMI

CPU

Address Data

!16

Example – Store v[i]

for (i=0; i<n; i++) v[i] = f(w[i]);

MemoryMI

CPU

Data

!17

Approximate Communication! Send(data, destination) ! Send(data, destination, reliability_level)

Reliability Level

Communication Energy

Communication System “aware” of error-resilience Acting on two Knobs:

Voltage Swing (wired) Transmission Power (wireless)

!18

Tuning the Link Voltage Swing! Reliability vs. Energy (1mm bit-line):

! Nominal voltage swing → low BER, high energy ! Low voltage swing → high BER, low energy

ReconfigurableLink

coreNI

coreNI

coreNI

coreNI

R R RR

coreNI

coreNI

coreNI

coreNI

R R RR

coreNI

coreNI

coreNI

coreNI

R R RR

R R RR

coreNI

coreNI

coreNI

coreNI

core IPCore

NI NetworkInterface

R Router

PhysicalLink

TilecoreNI

R

ReconfigurableLink

coreNI

coreNI

coreNI

coreNI

R R RR

coreNI

coreNI

coreNI

coreNI

RR

coreNI

coreNI

coreNI

coreNI

R R RR

R R RR

coreNI

coreNI

coreNI

coreNI

R R

ReconfigurableLink

coreNI

coreNI

coreNI

coreNI

R R RR

coreNI

coreNI

coreNI

coreNI

RR

coreNI

coreNI

coreNI

coreNI

R R RR

R R RR

coreNI

coreNI

coreNI

coreNI

R R

HSPICELinkSimulation• 45nmCMOStechnology(NanGate'sOpenCellLibrary):• 10metallayers• 3mmlinklineusingtheseventhmetallayer• 2GHztargetfrequency

Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 2018

HSPICELinkSimulation

70%saving3%overhead

Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 9

HSPICELinkSimulation

Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 9

!25

ImplementationHeader Data Data Data Tail

Reliability LevelDestination Other

control info

!26

Annotation Example

! Data coming from/delivered to w[i] travel with a reliability level rl

#pragma resilient(w, rl) for (i=0; i<n; i++) v[i] = f(w[i]);

!28

Application Characterization

! How the imprecision on inputs and internal data reflects on the outputs ?

! Classify data structures according to their impact on the outputs – Exploitation

! Storing less sensitive data on energy efficient memories (low voltage, low refresh rate, ...)

! Optimizing communication of less sensitive data (unreliable communications, lossy compression, ...)

!29

Experiments

! Two voltage swing levels – Nominal 1.1 V → BER: 10-17, Ebit: 512 fJ – Low 0.6 V → BER: 10-6, Ebit: 152 fJ

!30

Experiments! JPEG encoding pipeline (AXBench)

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

UINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { levelShift(Y1); dct(Y1); quantization(Y1, ILqt); outputBuffer = huffman(1, outputBuffer); return outputBuffer; }

!31

ExperimentsUINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { #pragma resilient_load(Y1, rl_load) levelShift(Y1); ... }

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Memory

rl_load

!32

ExperimentsUINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { #pragma resilient_store(Y1, rl_store) levelShift(Y1); ... }

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Memory

rl_store

!33

ExperimentsUINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { #pragma resilient(Y1, rl) levelShift(Y1); ... }

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Memory

rlrl

Approximation Profiles

!35

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

!36

Experiments

Level Shifter

R

DCT

R

MC

R

Quantizer

R

Entropy Encoder

R

MC

R

Mem 1

Mem 2

!37

Evaluation FlowApplication Resilient data

selection

Annotated application

Resilience level selection

Full Simulation (MIT Graphite)

Memory Reference

trace

NoC architecture

Energy estimation (Noxim)

Error injection

Perturbated Application

Execution

Communication energy

Execution

Imprecise results

Exactresults

Comparison Quality metric

!38

Experiments

!39

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 0 (gold)

!40

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 1

!41

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 2

!43

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 3

!44

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 4

!45

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 5

!47

Experiments

!48

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 6

!50

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 7

!52

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 8

!54

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 9

!56

Experiments

!57

Experiments

0 1 2 3 4 5 6 7 8 90.0000

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Image diff Normalized energy

Configuration

Imag

e di

ff (R

MS

E)

Nor

mal

ized

ene

rgy

!58

Sensitivity Analysis

in Y

1/le

velS

hift

out Y

1/le

velS

hift

in Y

1/dc

t

out Y

1/dc

t

in Y

1/qu

antiz

atio

n

in Il

qt/q

uant

izat

ion

out T

emp/

quan

tizat

ion

in T

emp/

huffm

an

out o

utpu

tBuf

fer/

huffm

an

0.00000

0.00005

0.00010

0.00015

0.00020

0.00025

0.00030

Sensitivity

!59

Experiments

Level Shift DCT Quantize Entropy

Encode

Quantizer Table

Huffman Table

Mem 1

Mem 2

Nominal (high energy, high reliability)

Approx (low energy, low reliability)

Conf 9

!60

Experiments

0.0E+0 1.0E-4 2.0E-4 3.0E-4 4.0E-4 5.0E-4 6.0E-40.00

0.20

0.40

0.60

0.80

1.00

1.20

Image diff (RSME)

Nor

mal

ized

ene

rgy

Next Step: On-Chip Wireless Communications

V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti, “Improving energy efficiency in wireless network-on-chip architectures,” ACM Journal on Emerging Technologies in Computing Systems, vol. 14, no. 1, 2017.

!62

Tuning Transmitting Power

! High BER as compared to wired NoC – 10-9 vs. 10-14

! General approach – Increasing the transmitting power for compensating

the attenuation introduced by the wireless medium ! Proposed approach – Tuning the transmitting power based on the reliability

level of the current transmitted data

Tunable Transmitting PowerZigzag antenna modeled with Ansoft HFSS to compute attenuation (16Gbps)

Variable Power Amplifier

• S. Kaushik, M. Agrawal, H. K. Mondal, S. H. Gade, and S. Deb, “Path loss-aware adaptive transmission power control scheme for energy- efficient wireless noc,” in International Midwest Symposium on Circuits and Systems (MWSCAS), Aug. 2017, pp. 132–135.

• A. Mineo, M. Palesi, G. Ascia, and V. Catania, “Exploiting antenna directivity in wireless noc architectures,” Microprocessors and Microsys- tems, vol. 43, pp. 59–66, 2016.

Simulation Setup• Two transmission profiles:

• normal) BER 10e-12 —> 1.47 pJ/bit

• (approximate) BER 10e-6 —> 1pJ/bit

• Wireless Interfaces placement same as Memory Controllers (mesh corners)

• 8 × 8 mesh-based NoC architecture simulated by using the Graphite Multicore Simulator with the following parameters:

RepresentativeApplicationsApplication Description Approximated Regions

streamcluster:aRMSkerneldevelopedbyPrincetonUniversitythatsolvestheonlineclusteringproblem

Regions of 256 bytes required for storing the 64 dimensions of each point encoded as a floating point value of 4 bytes, for a total of 8192 regions.

canneal: developedbyPrincetonUniversity,itusescache-awaresimulatedannealing(SA)tominimizetherouXngcostofachipdesign

The annotation has been performed on the netlist element, for a total of 160,000 instances of 64 bytes netlist elements.

blackscholes:anIntelRMSbenchmarkthatcalculatespricesforaporYolioofEuropeanopXonsanalyXcallywiththeBlack-ScholesparXaldifferenXalequaXon

Two data structures have been annotated: optiondata a 36 bytes floating point structure, and prices (4 bytes floating point), for a total of 147,456 bytes and a 16,384 bytes, respectively.

radiosity: computestheequilibriumdistribuXonoflightinasceneusingthehierarchicaldiffuseradiositymethod.

elemvertex buf.col, a data structure encoding the three RGB components as 4 bytes floating point values, and elemvertex buf.vertex, a data structure encoding the 3-dimensional coordinates of each vertex of the polygons describing the 3D model of the scene. Each of these two structure occupies 12 bytes, for a total of 65,535 regions and 786,420 annotated bytes size each.

EvaluationFlowFourscenarios:

3. Approx.NoC4. Approx.WiNoC

1. NoC2. WiNoC

Results

∗AllenergyvaluesarenormalizedwithrespecttothewiredNoCenergyconsumption.

Results–PerformanceMetrics

Conclusions• ApproximatecommunicationtechniqueforimprovingtheenergyefficiencyofWiNoCarchitectures.• Dynamiclinkvoltageswing(NoClinks)• Dynamictransmittingpowermodulation(wirelesscommunications)

• Pragmabasedannotationoftheapplicationcode• loadandstoreinducedcommunicationsrelatedtoerrortolerantdata

• Assessmentonasetofrepresentativebenchmarks• Energysavingversusapplicationaccuracytrade-off.• Upto30%oftotalcommunicationenergysavinghasbeenobservedwithoutanyappreciableimpactontheaccuracymetrics

Future Developments• Generalize & Automate in order to reduce the

required knowledge about the Application

• A methodology to identify approximable communication flows

• Automated choice of the most efficient approximation technique (reduced bits representation, reduced iterations, etc..)

• Automatic exploration loop

Bibliography

• Vincenzo Catania, Andrea Mineo, Salvatore Monteleone, Maurizio Palesi, and Davide Patti. 2016. Cycle-Accurate Network on Chip Simulation with Noxim. ACM Trans. Model. Comput. Simul. 27, 1, Article 4 (August 2016), 25 pages. DOI: https://doi.org/10.1145/2953878

• Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 9

• . Kaushik, M. Agrawal, H. K. Mondal, S. H. Gade, and S. Deb, “Path loss-aware adaptive transmission power control scheme for energy- efficient wireless noc,” in International Midwest Symposium on Circuits and Systems (MWSCAS), Aug. 2017, pp. 132–135.

• C. Roth, H. Bucher, S. Reder, F. Buciuman, O. Sander, and J. Becker. 2013. A SystemC modeling and simulation methodology for fast and accurate parallel MPSoC simulation. In Integrated Circuits and Systems Design (SBCCI), 2013 26th Symposium on. 1–6. DOI:http://dx.doi.org/10.1109/SBCCI.2013.6644853

• S. Deb, K. Chang, M. Cosic, A. Ganguly, P. P. Pande, D. Heo, and B. Belzer, “Enhancing performance of network-on-chip architectures with millimeter-wave wireless interconnects,” in IEEE International Conference on Application-specific Systems Architectures and Processors, 2010, pp. 73–80.

• E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, “Graphite: A distributed parallel simulator for multicores,” in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010, pp. 1–12.

Recommended