Design Methodology to Explore Hybrid Approximate Adders

JOURNAL OF RESOURCE MANAGEMENT

AND TECHNOLOGY ISSN NO: 0745-6999

Vol 11, Issue4/ 2020

Page No:267

Design Methodology to Explore Hybrid Approximate Adders for

Energy-Efficient Image and Video Processing Accelerators 1V.Dushyanth Kumar Reddy, 2K.Naga Sankar Reddy

1M.Tech Student, 2Assistant Professor

Department of ECE

Global college Of Engineering and Technology, Kadapa

ABSTRACT

This paper proposes a new design

methodology to explore the state-of-the-art

approximate adders for accelerator architectures

conceived in the realm of multiplier-less multiple

constant multiplication optimization problem. The

proposed methodology is composed of: 1) a search

heuristic to seek faster and feasible approximate

configurations for the architectures under evaluation;

2) low-power techniques regarding hybrid

approximate adders design for accelerators based on

trees of shift-and-add operations; 3) high-

performance evaluation by exploring parallel prefix

adders and low power analysis through the use of the

adder optimized by a commercial synthesis tool in

the precise part of the approximate adders; and 4)

energy efficiency analysis by considering both the

approximate techniques and voltage over scaling

estimation. Furthermore, improvements are proposed

for the state-of-the-art approximate adders under

evaluation in this paper. Two case studies are

considered to assess the proposed methodology: 1)

Gaussian image filter and 2) Sobel operator. The

precise and approximate image filters were described

in very high-speed integrated circuits hardware

description language regarding the proposed

methodology. Results are shown after synthesis to a

45-nm standard cell-based technology, where energy

reductions ranging from 7.7% up to 73.2% were

experienced for multiple levels of quality considering

the applications under analysis.

1. INTRODUCTION

1.1 MOTIVATION

As the scale of integration keeps growing,

more and more sophisticated signal processing

systems are being implemented on a VLSI chip.

These signal processing applications not only

demand great computation capacity but also consume

considerable amount of energy. While performance

and Area remain to be the two major design tolls,

power consumption has become a critical concern in

today’s VLSI system design[]. The need for low-

power VLSI system arises from two main forces.

First, with the steady growth of operating frequency

and processing capacity per chip, large currents have

to be delivered and the heat due to large power

consumption must be removed by proper cooling

techniques. Second, battery life in portable electronic

devices is limited. Low power design directly leads to

prolonged operation time in these portable devices.

Addition usually impacts widely the overall

performance of digital systems and a crucial

arithmetic function. In electronic applications adders

are most widely used. Applications where these are

used are multipliers, DSP to execute various

algorithms like FFT, FIR and IIR. Wherever concept

of multiplication comes adders come in to the picture.

As we know millions of instructions per second are

performed in microprocessors. So, speed of operation

is the most important constraint to be considered

while designing multipliers. Due to device portability

miniaturization of device should be high and power

consumption should be low. Devices like Mobile,

Laptops etc. require more battery backup.

So, a VLSI designer has to optimize these

three parameters in a design. These constraints are

very difficult to achieve so depending on demand or

application some compromise between constraints

has to be made. Ripple carry adders exhibits the most

compact design but the slowest in speed. Whereas

carry look ahead is the fastest one but consumes more

area. Carry select adders act as a compromise

between the two adders. In 2002, a new concept of

hybrid adders is presented to speed up addition

process by Wang et al. that gives hybrid carry look-

ahead/carry select adders design. In 2008, low power

multipliers based on new hybrid full adders is

presented.

1.2 NEED FOR LOW POWER DESIGN

The design of portable devices requires

consideration for peak power consumption to ensure

reliability and proper operation. However, the time

averaged power is often more critical as it is linearly

related to the battery life. There are four sources of

power dissipation in digital CMOS circuits:

switching power, short-circuit power, leakage power

and static power. The following equation describes

these four components of power:




Page No:268

Pswitching is the switching power. For a

properly designed CMOS circuit, this power

component usually dominates, and may account for

more than 90% of the total power. denotes the

transition activity factor, which is defined as the

average number of power consuming transitions that

is made at a node in one clock period. Vs is the

voltage swing, where in most cases it is the same as

the supply voltage, Vdd. CL is the node capacitance.

It can be broken into three components, the gate

capacitance, the diffusion capacitance, and the

interconnect capacitance. The interconnect

capacitance is in general a function of the placement

and routing. fck is the frequency of clock. The

switching power for static CMOS is derived as

follows.

During the low to high output transition, the

path from Vdd to the output node is con-ducting to

charge CL. Hence, the energy provided by the supply

source is

where is the current drawn from the supply. Here, R is the resistance of the path between

the Vdd and the output node. Therefore, the energy can be rewritten as

During the high to low transition, no energy is supplied by the source. Hence, the average power consumed during

one clock cycle is

Eq. (2.4) and Eq. (2.5) estimate the energy and the power of a single gate only. From a system point of view, is

used to account for the actual number of gates switching at a point in time.

1.2.1 LOW VOLTAGE

Power consumption is linearly proportional to

voltage swing (Vs) and supply voltage (Vdd) as

indicated in Eq. (2.5). For most CMOS logic

families, the swing is typically rail-to-rail. Hence,

power consumption is also said to be proportional to

the square of the supply voltage, Vdd. Therefore,

lowering the Vdd is an efficient approach to reduce

both energy and power, presuming that the signal

voltage swing can be freely chosen. This is, however,

at the expense of the delay of circuits. The delay, td,

can be shown to be proportional to

.The exponent is

between 1 and 2. It tends to be closer to 1 for MOS

transistors that are in deep sub-micrometer region,

where carrier velocity saturation may occur.

increases toward 2 for longer channel transistors.

2. LITERATURE SURVEY

Addition is the most common and often used

arithmetic operation on microprocessor, digital signal

processor, especially digital computers. Also, it

serves as a building block for synthesis all other

arithmetic operations. Therefore, regarding the

efficient implementation of an arithmetic unit, the

binary adder structures become a very critical

hardware unit. In any book on computer arithmetic,

someone looks that there exists a large number of

different circuit architectures with different

performance characteristics and widely used in the

practice. Although many researches dealing with the

binary adder structures have been done, the studies

based on their comparative performance analysis are

only a few.

This is about a digital circuit. For an

electronic circuit that handles analog signals

see Electronic mixer.In electronics,

an adder or summer is a digital circuit that

http://en.wikipedia.org/wiki/Electronic_mixer

http://en.wikipedia.org/wiki/Digital_circuit




Page No:269

performs addition of numbers. In

many computers and other kinds of processors,

adders are used not only in the arithmetic logic

unit(s), but also in other parts of the processor, where

they are used to calculate addresses, table indices,

and similar.

Although adders can be constructed for

many numerical representations, such as binary-

coded decimal or excess-3, the most common adders

operate on binary numbers. In cases where two's

complement or ones' complement is being used to

represent negative numbers, it is trivial to modify an

adder into an adder–subtractor. Other signed number

representations require a more complex adder.

Different type of adders as follows

3.EXISTING METHOD

3.1 CARRY SELECT ADDER

DESIGN of area- and power-efficient high-

speed data path logic systems are one of the most

substantial areas of research in VLSI system design.

In digital adders, the speed of addition is limited by

the time required to propagate a carry through the

adder. The sum for each bit position in an elementary

adder is generated sequentially only after the

previous bit position has been summed and a carry

propagated in to the next position .The CSLA is used

in many computational systems to alleviate the

problem of carry propagation delay by independently

generating multiple carries and then select a carry to

generate the sum. However ,the CSLA is not area

efficient because it uses multiple pairs of Ripple

Carry Adders (RCA) to generate partial sum and

carry by considering carry input cin=0 and cin=1,

then the final sum and carry are selected by the

multiplexers (mux).The basic idea of this work is to

use Binary to Excess-1 Converter(BEC) instead of

RCA with cin=1 in the regular CSLA to achieve

lower area and power consumption The main

advantage of this BEC logic comes from the lesser

number of logic gates than the n-bitFull Adder (FA)

structure.

The carry select adder comes in the category

of conditional sum adder. Conditional sum adder

works on some condition. Sum and carry are

calculated by assuming input carry as 1 and 0 prior

the input carry comes. When actual carry input

arrives, the actual calculated values of sum and carry

are selected using a multiplexer. The conventional

carry select adder consists of k/2 bit adder for the

lower half of the bits i.e. least significant bits and for

the upper half i.e. most significant bits (MSB’s) two

k/ bit adders. In MSB adders one adder assumes carry

input as one for performing addition and another

assumes carry input as zero. The carry out calculated

from the last stage i.e. least significant bit stage is

used to select the actual calculated values of output

carry and sum. The selection is done by using a

multiplexer. This technique of dividing adder in to

stages increases the area utilization but addition

operation fastens.

Fig. Regular 16-b SQRT CSLA.

http://en.wikipedia.org/wiki/Addition

http://en.wikipedia.org/wiki/Computer

http://en.wikipedia.org/wiki/Arithmetic_logic_unit

http://en.wikipedia.org/wiki/Arithmetic_logic_unit

http://en.wikipedia.org/wiki/Binary-coded_decimal

http://en.wikipedia.org/wiki/Binary-coded_decimal

http://en.wikipedia.org/wiki/Excess-3

http://en.wikipedia.org/wiki/Binary_numeral_system

http://en.wikipedia.org/wiki/Two%27s_complement

http://en.wikipedia.org/wiki/Two%27s_complement

http://en.wikipedia.org/wiki/Ones%27_complement

http://en.wikipedia.org/wiki/Adder%E2%80%93subtractor

http://en.wikipedia.org/wiki/Signed_number_representations

http://en.wikipedia.org/wiki/Signed_number_representations




Page No:270

3.1.1 MULTIPLEXER

In electronics, a multiplexer (or MUX) is a

device that selects one of several analog or digital

input signals and forwards the selected input into a

single line.[1] A multiplexer of 2n inputs has n select

lines, which are used to select which input line to

send to the output.[2] Multiplexers are mainly used to

increase the amount of data that can be sent over the

network within a certain amount of time and

bandwidth.[1] A multiplexer is also called a data

selector. They are used in CCTV, and almost every

business that has CCTV fitted, will own one of these.

An electronic multiplexer makes it possible for

several signals to share one device or resource, for

example one A/D converter or one communication

line, instead of having one device per input signal.

On the other hand, a demultiplexer (or

demux) is a device taking a single input signal and

selecting one of many data-output-lines, which is

connected to the single input. A multiplexer is often

used with a complementary demultiplexer on the

receiving end.[1]

An electronic multiplexer can be considered

as a multiple-input, single-output switch, and a

demultiplexer as a single-input, multiple-output

switch.[3] The schematic symbol for a multiplexer is

an isosceles trapezoid with the longer parallel side

containing the input pins and the short parallel side

containing the output pin.[4] The schematic on the

right shows a 2-to-1 multiplexer on the left and an

equivalent switch on the right. The wire connects the

desired input to the output.

In digital circuit design, the selector wires

are of digital value. In the case of a 2-to-1

multiplexer, a logic value of 0 would connect to

the output while a logic value of 1 would

connect to the output. In larger multiplexers, the

number of selector pins is equal

to where is the number of inputs.

For example, 9 to 16 inputs would require

no fewer than 4 selector pins and 17 to 32 inputs

would require no fewer than 5 selector pins. The

binary value expressed on these selector pins

determines the selected input pin.

A 2-to-1 multiplexer has a boolean

equation where and are the two inputs, is the

selector input, and is the output:

This truth table shows that

when then but

when then . A straightforward

realization of this 2-to-1 multiplexer would need

2 AND gates, an OR gate, and a NOT gate.

Larger multiplexers are also common and,

as stated above, require selector pins

for inputs. Other common sizes are 4-to-1, 8-

to-1, and 16-to-1. Since digital logic uses binary

values, powers of 2 are used (4, 8, 16) to

maximally control a number of inputs for the

given number of selector inputs.

4-to-1 mux

http://en.wikipedia.org/wiki/Digital_circuit

http://en.wikipedia.org/wiki/Boolean_equation

http://en.wikipedia.org/wiki/Boolean_equation

http://en.wikipedia.org/wiki/File:Multiplexer_2-to-1.svg

http://en.wikipedia.org/wiki/File:Multiplexer_4-to-1.svg




Page No:271

The boolean equation for a 4-to-1 multiplexer is:

MODIFIED REGULAR CSLA USING BEC

The main idea of this work is to use BEC instead of

the RCA with Cin=1in order to reduce the area and

power consumption of the regular CSLA. To replace

the n bit RCA, an n+1 bit BEC is required. A

structure and the function table of a 4 bit BEC are

show in Fig and Table II respectively

.

Figure: 4-bit BEC

Fig. above illustrates how the basic function of the

CSLA is obtained by using 4-bit BEC together with

the mux. One input of the 8:4 mux gets as it input(

(B3, B2, B1, and B0) and another input of the mux is

the BEC output. This produces the two possible

partial results in parallel and the mux is used to select

either the BEC output or the direct inputs according

to the control signal Cin. The importance of the BEC

logic stems from the large silicon area reduction

when the CSLA with large number of bits are

designed. The Boolean expressions of the 4-bit BEC

is listed as (note the functional symbols NOT, &

XOR)

DELAY AND AREA EVALUATION OF

MODIFIED CSLA

The structure of the proposed 16-b SQRT

CSLA using BEC for RCA with Cin=1 to optimize

the area and power is shown in above fig. We again

split the structure into five groups. The delay and

area estimation of each group are shown in Fig




Page No:272

below. The steps leading to the evaluation are given here.

Figure: Delay and area evaluation of modified SQRT CSLA: (a) group2, (b) group3, (c) group4, and (d) group5. H

is a Half Adder.

4. PROPOSED METHOD

THE semiconductor industry faces

challenges at each new Complementary Metal Oxide

Semiconductor (CMOS) technology node. Power

density increase has been experienced due to the

observation that Dennard scaling is not feasible

anymore [1]. Furthermore, power and thermal walls

bring much more effort to designers, so that digital

CMOS design is facing the so-called “Dark Silicon

Era” even considering recent Fin Field-Effect

Transistor (FinFET) technologies [2]. Therefore, the

current and future computing scenario is

characterized by the demand for numerous and

ubiquitous compute-intensive applications in

constrained power budget digital devices. Based on

that, energy-efficient techniques (i.e. maximize the

number of arithmetic operations per energy unit) are

paramount to cope with the previously observed

challenges. According to [3], two trending energy-

efficient techniques are listed as follows: (i)

accelerator-rich architectures based on Application

Specific Integrated Circuits (ASICs) and (ii)

Approximate Computing (AC). Architectural

heterogeneity and the use of ASIC accelerators are

energy-efficient techniques to execute the most

compute-intensive kernels of an application [3]. On

the other hand, the remaining tasks which demand

less energy consumption can be scheduled for

general-purpose processors. As a result, general-

purpose processors’ workload is alleviated due to the

use of energy-efficient specific processing cores.




Page No:273

Power-management schemes can be implemented to

power off accelerators or general-purpose cores when

not in use, thus respecting the power and thermal

constraints. The works in [4] and [5] show that

despite the challenges of architectural integration

introduced by this new design paradigm, these

accelerator-rich architectures play an essential role in

energy efficiency for recent applications. For

example, Hameed et al. [6] show that an ASIC

solution is 500× more energy-efficient than a four

core general-purpose processor when considered

H.264 video coding application. One essential

approach to conceive ASIC implementation of digital

filters and transforms is to implement the

multiplications by constants in the light of MMCM

problem formulated in [7] and [8]. In other words,

these architectures are designed by adopting the use

of additions, subtractions, and shift in an optimized

configuration.

The approximate computing paradigm emerged to

increase performance and to reduce power dissipation

[9]. The critical approach in approximate hardware is

to reduce the computation accuracy in favor of

energy-efficiency. In circuit level design, this is

performed by designing simpler circuits to speed up

the critical path timing and/or to consume less power.

Approximate computing techniques take advantage

of approximation-tolerant applications which do not

need high accuracy all the time but only “good

enough” or “sufficiently good” results for output

perceptual quality. In [10] is stated the following

properties to define an approximation-resilient

application: (i) there is not a golden or accurate

result, but a range of acceptable ones and (ii)

robustness to input noisy data. For example,

multimedia applications (e.g., video coding, audio

filtering, image processing, and so on), highly

demanded by current portable devices, are

intrinsically related to human senses. The multimedia

signals are, in fact, approximation-tolerant

applications, since in [11] is stated that human senses

process analog information and have difficulty to

realize the negative impact of digital approximations.

It means that it is possible to adopt approximate

computing techniques to improve energy efficiency

in multimedia applications by adequately exploring

the user experience at different profiles of quality.

The excellent point of approximate computing is that

this paradigm can be adopted at any abstraction level

from transistor-level up to software application [12].

Furthermore, approximate computing can be an

additive design component for accelerator-rich

architectures. One can consider that the use of

approximate hardware accelerators brings further

energy efficiency improvements [12]. In the

arithmetic layer of abstraction, works in [11] and

[13]–[20] have proposed approximate adders. Adders

are basic building blocks for several compute-

intensive multimedia applications. Therefore,

approximate adders could drive energy efficiency for

recent digital compute-intensive and approximation-

tolerant applications. Based on that, this work

proposes a design methodology to explore state-of-

the-art approximate adders for ASIC implementation

of add-and-shift accelerators for image and video

processing. Previous works in [22] and [23]

examined the use of the state-of-the-art approximate

adders for image filters. To explore approximation

for the architectures, they adopted simulation-based

methodologies in which search heuristics are

implemented to seek for energy-efficient approximate

configurations. The approximate adders taken by the

previously mentioned related works are the

Approximate Mirror Adder (AMA) [13] and the

Error-Tolerant Adder I (ETAI) [11]. They are divided

into precise and approximate parts. In both the works,

only the Ripple Carry Adder (RCA) topology is

explored in the precise block of those approximate

adders. The same observation is valid for the case

study explored in [11]: the precise block of ETA-I is

only implemented with RCA topology. This work

presents four novel contributions in the scope of

approximate computing: 1) A faster search heuristic

and simulation-based methodology to configure

feasible configurations with evaluation of multiple

levels of quality; 2) A high-performance exploration

of approximate hardware accelerators through the use

of PPAs and low power evaluation through the

optimized adder from the commercial synthesis tool

in the precise part of the approximate adders; 3)

Combination of different approximate adders to

compose hybrid adders for the add-and-shift

architectures; Energy-efficiency 4) analysis based on

the approximate configurations and VOS estimation

due to the insertion of PPAs. Results show that our

approach substantially reduces energy consumption

ranging from 7.7% up to 73.2% for different levels of

quality.

BACKGROUND ON APPROXIMATE

ADDERS, PARALLEL PREFIX ADDERS, AND

IMAGE FILTERS

A. Approximate Adders

The approximate adders can be classified as

computational performance- and power-oriented

designs. The former is related to adders divided into

m independent blocks or sub-adders to speed up the

critical path timing. The claim is that, for random and

uniformly distributed pairs of operands, more

extended carry propagation rarely occurs. Based on




Page No:274

that, additional logic is necessary to speculate carry-

in for each sub-adder, since this class of approximate

adder breaks the carry propagation in many parts.

Examples of adders which improve computational

performance are the Error-Tolerant Adder II [15],

Error Tolerant Adder IV [14], and the Almost Correct

Adder (ACA) [16]. This class of approximate adders

is also characterized by the presence of infrequent

and high magnitude sum errors. Therefore, the works

in [16]–[19] proposed accuracy configurable adders

to cope with this error characteristics. On the other

hand, more logic is added to detect and correct the

sum errors.

A different philosophy is to propose power-

oriented adders which generally are divided into two

parts: (i) the least significant approximate part and

(ii) the most significant accurate part. Examples of

power-oriented approximate adders can be observed

in [11], [13], and [20]. The principal idea in the

approximate part is to replace the full adder cells by

simpler adder circuits. Therefore, power reduction is

the main objective of this class of adders. Besides,

these adders also tend to reduce critical path timing,

because in the approximate part there is not carry

propagation scheme. One can observe that the

classical truncation is a type of power-oriented adder

which truncates least significant full adder cells. This

class of approximate adders is also characterized by

the presence of frequent and low magnitude sum

errors. Such errors are of low magnitude because the

bit-width of the approximate part can be controlled

through an approximate parameter k. In this work,

the proposed approach is to explore the power-

oriented a design, this is performed by designing

simpler circuits to speed up the critical path timing

and/or to consume less power. Approximate

computing techniques take advantage of

approximation-tolerant applications which do not

need high accuracy all the time but only “good

enough” or “sufficiently good” results for output

perceptual quality. In [10] is stated the following

properties to define an approximation-resilient

application: (i) there is not a golden or accurate

result, but a range of acceptable ones and (ii)

robustness to input noisy data. For example,

multimedia applications (e.g., video coding, audio

filtering, image processing, and so on), highly

demanded by current portable devices, are

intrinsically related to human senses.

The multimedia signals are, in fact,

approximation-tolerant applications, since in [11] is

stated that human senses process analog information

and have difficulty to realize the negative impact of

digital approximations. It means that it is possible to

adopt approximate computing techniques to improve

energy efficiency in multimedia applications by

adequately exploring the user experience at different

profiles of quality. The excellent point of

approximate computing is that this paradigm can be

adopted at any abstraction level from transistor-level

up to software application [12]. Furthermore,

approximate computing can be an additive design

component for accelerator-rich architectures. One can

consider that the use of approximate hardware

accelerators brings further energy efficiency

improvements [12]. In the arithmetic layer of

abstraction, works in [11] and [13]–[20] have

proposed approximate adders. Adders are basic

building blocks for several compute-intensive

multimedia applications.

Therefore, approximate adders could drive

energy efficiency for recent digital compute-intensive

and approximation-tolerant applications. Based on

that, this work proposes a design methodology to

explore state-of-the-art approximate adders for ASIC

implementation of add-and-shift accelerators for

image and video processing. Previous works in [22]

and [23] examined the use of the state-of-the-art

approximate adders for image filters. To explore

approximation for the architectures, they adopted

simulation-based methodologies in which search

heuristics are implemented to seek for energy-

efficient approximate configurations. The

approximate adders taken by the previously

mentioned related works are the Approximate Mirror

Adder (AMA) [13] and the Error-Tolerant Adder I

(ETAI) [11]. They are divided into precise and

approximate parts. In both the works, only the Ripple

Carry Adder (RCA) topology is explored in the

precise block of those approximate adders. The same

observation is valid for the case study explored in

[11]: the precise block of ETA-I is only implemented

with RCA topology. This work presents four novel

contributions in the scope of approximate computing:

1) A faster search heuristic and simulation-based

methodology to configure feasible configurations

with evaluation of multiple levels of quality;

2) A high-performance exploration of approximate

hardware accelerators through the use of PPAs and

low power evaluation through the optimized adder

from the commercial synthesis tool in the precise part

of the approximate adders;

3) Combination of different approximate adders to

compose hybrid adders for the add-and-shift

architectures; Energy-efficiency

4) analysis based on the approximate configurations

and VOS estimation due to the insertion of PPAs.




Page No:275

BACKGROUND ON APPROXIMATE ADDERS,

PARALLEL PREFIX ADDERS, AND IMAGE

FILTERS

A. Approximate Adders

The approximate adders can be classified as

computational performance- and power-oriented

designs. The former is related to adders divided into

m independent blocks or sub-adders to speed up the

critical path timing. The claim is that, for random and

uniformly distributed pairs of operands, more

extended carry propagation rarely occurs. Based on

that, additional logic is necessary to speculate carry-

in for each sub-adder, since this class of approximate

adder breaks the carry propagation in many parts.

Examples of adders which improve computational

performance are the Error-Tolerant Adder II [15],

Error Tolerant Adder IV [14], and the Almost Correct

Adder (ACA) [16]. This class of approximate adders

is also characterized by the presence of infrequent

and high magnitude sum errors. Therefore, the works

in [16]–[19] proposed accuracy configurable adders

to cope with this error characteristics. On the other

hand, more logic is added to detect and correct the

sum errors. A different philosophy is to propose

power-oriented adders which generally are divided

into two parts: (i) the least significant approximate

part and (ii) the most significant accurate part.

Examples of power-oriented approximate adders can

be observed in [11], [13], and [20]. The principal idea

in the approximate part is to replace the full adder

cells by simpler adder circuits. Therefore, power

reduction is the main objective of this class of adders.

Besides, these adders also tend to reduce critical path

timing, because in the approximate part there is not

carry propagation scheme. One can observe that the

classical truncation is a type of power-oriented adder

which truncates least significant full adder cells. This

class of approximate adders is also characterized by

the presence of frequent and low magnitude sum

errors. Such errors are of low magnitude because the

bit-width of the approximate part can be controlled

through an approximate parameter k. In this work,

the proposed approach is to explore the power-

oriented adders to give priority to power-efficiency.

It is also ratified in [21] which states that the adders

focused on delay reduction cannot be used to explore

power-efficiency in structures which demand the

massive use of additions like the multipliers. Also,

the power-oriented approximate adders enable the

exploration of multiple conventional adder topologies

in the precise part, which is not true for the

approximate adders divided into many blocks.

Therefore, we consider in this study the

exploration of the fifth approximate version of AMA

[13] and the Error-Tolerant Adder I [11], because

they are also explored by related works [22], [23].

The former approximate adder is renamed to “Copy

adder” due to its copy function implemented by the

buffers. The approximate adders which are explored

in this study can be observed in Figure 1. The “Copy

adder” in Figure 1(a) has its k bits long approximate

part implemented by buffers to copy the operand a to

the sum. This procedure has 50% probability of

getting the correct sum for each bit position.

Furthermore, the carry-in estimation for the precise

part is implemented by the more straightforward

assignment of the input operand bit bk−1. This

procedure has 75% probability of getting a correct

carry-in estimation to the precise block. The use of

half adders implements the approximate part of the

ETAI in Figure 1(b). The sum is performed in the

non-conventional direction, (i.e. from the most

significant bit k−1 to the least significant bit position

0). The control logic block is conceived as follows: if

the first carry-generate c is equal to “1,” then all the

remaining least significant sum bits are set to “1.”

Otherwise, the sum result is the one computed by the

propagate signal. The carry-in to the precise part in

ETAI is statically set to “0.” This procedure has 50%

probability of getting the correct carry-in to the

precise part. One can observe that for both the

approximate adders, any conventional adder topology

can be implemented in the precise part. Related

works in [22] and [23] explore the use of the RCA. In

this work, we explore the RCA, the high performance

PPAs, and the low power adder fully optimized by

the commercial tool used in this study. That is why in

the next subsection a brief PPA overview is

developed.




Page No:276

Fig. 1. Approximate adders: (a) Copy adder; (b) ETAI.

Fig. 2. Parallel prefix adder structure.

B. Precise Adders

As previously mentioned, adders are fundamental

building blocks in a great variety of computational

applications. Based on that, many adder topologies

have been proposed to deal with the tradeoff between

power and computational performance. The RCA

topology is characterized to present low values of




Page No:277

power consumption, area, and computational

performance. Depending on the high-performance

application requirements and given that

computational complexity is increasing in nowadays

tasks, the critical approach is to accelerate the adder’s

critical path delay (i.e. carry propagation) at the

expense of higher area and power dissipation. Based

on that, the Parallel Prefix Adders were proposed to

deal with high-performance demands [24]. The carry

propagation structure in the PPA adders is

implemented by simple logic cells which tend to keep

a regular connection. Based on that, the sum

computation can be divided into pre-processing,

prefix computation and post-processing parts, as can

be seen in Figure 2. In the pre-processing part, the

propagate pi (i.e. ai ⊕bi) and generate gi (i.e ai∧bi)

signals are computed based on the input operands ai

and bi . In the prefix computation stage, the carry

computation is accelerated by the parallel

composition of the black cells which implement the

group propagate Pi: j and generate Gi: j signals as

well as the gray cells which compute the carry ci .

Finally, the post-processing step is given by the sum

si = pi ⊕ ci−1. 1) Use of PPA Adders on the Precise

Part of the Approximate Adders: Different

configurations between the black and gray cells can

be obtained. According to the PPAs taxonomy

presented in [25], the different prefix cells

configurations allow tradeoffs among (i) the number

of logic levels, (ii) the maximum fanout, and (iii) the

maximum number of horizontal wire tracks (i.e wire

density). All of these aspects affect the adder delay.

Based on that, PPA topologies proposed in [26]–[30]

are considered in this study. Their main

characteristics concerning logic levels, maximum

fanout, and the maximum number of wiring tracks

are presented in Table I [25]. As can be seen in Table

I, the Brent-Kung adder has the highest number of

levels, while presents low values for both fanout and

wire density. The Sklansky and Kogge-Stone have

the lowest number of logic levels, but the former

gives the highest fanout, and the latter has the worst

wire density due to the highest number of prefix

cells. The Han-Carlson is the hybrid solution between

the Brent-Kung and Kogge-Stone. Therefore, this

adder balance the tradeoff between the number of

logic levels and wiring tracks. The Ladner-Fischer

adder is the hybrid approach between Sklansky and

Brent-Kung so that the tradeoff is balanced between

the number of logic levels and the fanout.

PROPOSED DESIGN METHODOLOGY

This section presents the proposed design

methodology to explore our hybrid approximate

adders. The basic design flow is shown in Figure 3.

The first step is to simulate the application under

evaluation by adopting MatLab or C models

considering both the architectures and approximate

adders. The hybrid approximate adder models are

implemented by an overloaded function in MatLab

and C language. Therefore, add-and-shift filter

structures are developed in simulation scripts. The

function is called to perform the necessary additions,

with the appropriate parameters to select between the

type of hybrid approximate adder, the number of bits

of the approximate part, and so forth. The quality

metrics are generated by considering the use of real

test-cases. The next stage is related to the application

quality evaluation for all the exercised approximate

configurations. After that, different levels of quality

are selected to generate the Register Transfer Level

(RTL) designs automatically. The logic synthesis is

performed for each RTL file, by using the standard

cell technology library files (i.e.,.lib,.lef, cap tables,

and so forth). Then, the mapped gate-level netlist is

created to enable the next step of post-synthesis

simulation. During the simulation, standard cell

technology library files (i.e., the Verilog file with the

behavioral model of the standard cells) and real test

cases are used to capture switching activity which is

saved in Value Change Dump (VCD) or Toggle

Count Format (TCF) files. The Verilog gate-level

netlist, standard cell technology library files, and

VCD or TCF files are then used to estimate power. In

the next subsections the proposed hybrid approximate

adders, the accelerator architectures under analysis,

and the proposed heuristic adopted to perform

simulation are shown.




Page No:278

Fig. 3. Proposed design methodology.

A. Proposed Hybrid Approximate Adders for Shift-and-Add Architectures

The 5 × 5 Gaussian and the 3 × 3 Sobel image filter architectures under evaluation are similar to the ones adopted

Fig. 4. Gaussian image filter architecture.




Page No:279

Fig. 5. Gradient image filter architecture.

by Oliveira et al. [22]. The difference is observed in

the Gaussian architecture, where the partial terms of

the adder tree were reorganized to enable left shift

overlapping regions in the operands. It is performed

to leverage the power efficiency provided by the

proposed hybrid adders and to improve the proposed

search heuristic which will be presented in the next

subsection. The Gaussian and Gradient architectures

can be observed in the Figure 4 and Figure 5,

respectively [22]. One can observe that these

architectures are implemented by the shift operations,

adders, and subtractors. There are two observable

configurations in which the proposed hybrid

approximate adders can be adopted: (i) both the

operands present overlapped number of left shift

operations, (ii) one operand present excessive number

of left shift operations than the other operand. These

aspects can be observed in Figure 6. As can be seen

in Figure 6, if there is overlapping between the

number of left shifts in both the operands, then the

proposed approach considers the use of truncation

adder in the overlapped and least significant region.

The excessive amount of left shift operations in one

of the operands, when compared to the other can be

leveraged with the use of the Copy adder. One can

notice in the examples in Figure 6(a) and (b) that

these procedures do not produce sum errors. The

remaining most significant region in the adder can

further be explored with the state-of-the-art Copy

adder and ETAI (i.e., precise plus approximate

adder). The example in Figure 6(a) shows the

proposed hybrid scheme of copy-copy-truncation.

The operands are left shifted two and four times,

respectively. One can see that there are three

approximation parameters: (i) k1 = 2 which controls

the truncation in the overlapped shift region, (ii) k2 =

2 which controls the copy adder in the excessive shift

region, and (iii) k3 = 4 which controls the

approximate part of the copy adder in the non-shifted

region. Almost the same observations can be made in

Figure 6(b) for the configuration ETAI-

copytruncation. The only difference is that in the

non-shifted region the ETAI is adopted instead of the

Copy adder. For the ETAI, we adopted the

modification proposed by Kang et al. [23]: the use of

OR logic gates instead of XOR. It occurs because

there is not difference regarding produced sum result,

while the former gate has less area than the latter.

Besides, this work also considers the use of carry-in

estimation performed in Copy adder for the ETAI. As

previously mentioned, this procedure has more

probability of getting correct estimation than

statically set the carry-in to “0.” If a given adder of

the architecture has not left shifted operands, then the

approximation is not hybrid (i.e., k1 = 0 and k2 = 0).

On the other hand, the copy adder or ETAI can be

explored in this adder (i.e., k3 ≥ 0).

B. Proposed Search Heuristic

The exhaustive search in simulation-based

methodologies tends to be time consuming or

prohibitive to find the most energy-efficient

configuration. Therefore, the use of search heuristics

is essential in this scenario. As previously shown,

related works in [22] and [23] proposed search

heuristics to seek for energy-efficient accelerators. In

this work, the proposed approach is to first establish

the k1 and k2 parameters




Page No:280

Fig. 6. Proposed hybrid approximate adders. (a) copy-copy-truncation adder. (b) ETAI-copy-truncation adder.

for the overlapped and excessive shift regions. After

that, the next step is to seek different k3 parameters

for all the adders in the tree through an iterative

process. The algorithm for the heuristic is shown in

Algorithm 1. One can observe that the k1 and k2

parameters are fixed, while the k3 is iterated to seek

for different approximate configurations. During the

initialization step from lines 1 to 9, the k1 and k2

parameters are determined solely by the overlapped

regions previously shown in Figure 6. The iteration in

lines 10 to 12 is used to create and initialize the data

structure which stores the k3 values for each adder in

the filter architectures. The search for k3 parameters

is performed in lines 14 to 17. During the search, the

quality metric is stored after running the application

for each k3 parameter value. The application quality

is evaluated, and objective metric is calculated by

considering the use of real test cases.

EXTENSION METHOD

SIMPLE ACCURACY-RECONFIGURABLE ADDER

A. Preliminaries

Fig. 3. (a) Conventional full adder. (b) Carry-out selectable full adder. (c) Carry-in configurable full adder.




Page No:281

An N-bit adder operates on two addends A = (aN ,

aN−1,..., ai,..., a1) and B = (bN , bN−1,..., bi,..., b1).

For bit i, its carry-in is ci−1 and its carry-out is ci .

Defining the carry generate bit gi = ai · bi, propagate

bit pi = ai ⊕ bi , and kill bit ki = ¯ai · b¯i, the

conventional full adder computes the sum si and

carry ci according to

si = pi ⊕ ci−1 (1)

ci = gi + pi · ci−1. (2)

A gate level schematic of a conventional full adder is

provided in Fig. 3(a). A CRA is used to chain N bits

of conventional full adders together. By applying (2)

recursively, one can get

ci = gi + pi gi−1 +···+ g1 i k=2 pk + c0 i k=1 pk. (3)

This equation implies that ci can be computed

directly from g and p of all bits, without waiting for

the c of its lower bits to be computed. This

observation is the basis for CLA adder.

c prdt i = gi. (4)

For the LSB of the higher bit subadder, which is bit i

+ 1, its carry-out ci+1 can be computed using one of

two options: either by the conventional ci+1 = gi+1 +

pi+1 · ci , or by using the carry prediction as

ci+1 = gi+1 + pi+1 · c prdt i = gi+1 + pi+1 · gi. (5)

Fig. 4. Design of SARA

The selection between the two options is realized

using MUXes as in Fig. 4 and the multiplexer (MUX)

selection result is denoted as cî . Comparing (5) with

(3), we can see that the carry prediction is a

truncation-based approximation to carry

computation.1 Therefore, cî can be configured to

either accurate mode or approximation mode, that is

cî ← c prdt i , if approximation mode ci, if accurate

mode. (6)

It should be noted that the carry prediction c prdt i

reuses gi in an existing full adder instead of

introducing an additional dedicated circuit as in [18]

or Fig. 2. This prediction scheme makes a very

simple modification to the conventional full adder, as

shown in Fig. 3(b). One can connect cî to its higher

bit i + 1 to compute both carry ci+1 and sum si+1, as

in GDA [18] and RAP-CLA [19]. We suggest an

improvement over this approach by another simple

change as in Fig. 3(c), where si+1 is based on ci

instead of cî . Such approach can help reduce the

error rate in outputs when an incorrect carry is

propagated. Because the sum keeps accurate and the

carry will not be propagated when addends are

exactly the same. Moreover, out of all four

configurations of sum/carry calculation by

approximate/accurate carry-in, the most meaningful

way is to have sum bit calculated by accurate carry

and make carry bit configurable.

Therefore, sum si+1 is calculated directly by accurate

carry ci without the option of c prdt i . Applying this

in SARA as in Fig. 4, in the approximation mode,

computing sj+1 from c j can still limit the critical

path to be between c prdt i−1 and sj+1, but has higher

accuracy than computing sj+1 from cˆj . Compared to

sum computation in GDA and RAP-CLA, this

technique improves accuracy with almost no

additional overhead. Compared to CRA, the overhead

of SARA is merely the MUXes, which is almost the

minimum possible for configurable adders. Although

sj+1 is calculated by accurate carry c j , its delay can

still be reduced by approximate carry in lower

subadder. In a multibit adder, the delay of sum bit

depends on the carry chain propagated from its lower

bits. In our SARA structure, even when accurate

carry c j is propagated at bit j, the carry chain might

be truncated by approximate carry in other lower bits.




Page No:282

Fig. 5. Implementation of 12-bit adder in (a) CRA and (b) SARA

In Fig. 4, when c prdt i−1 is propagated, the delay of

sj+1 is reduced as its path is shorten to be between bit

i − 1 and j + 1. We can take the 12-bit adder in Fig. 5

as an example. For 12-bit SARA working in

approximate mode, the sum s9 uses the accurate carry

c8 from a lower subadder (bits 5 to 8). But c8 is

propagated from approximate carry c prdt 4 of

another subadder (bits 1 to 4). As shown in the

figure, the delay of s9 in SARA is about six stages.

Compared with the same bit in CRA, the delay of

sum bit s9 in SARA is reduced by three stages.

Similar delay reduction can be observed in other sum

bits (bits 6 to 12). For sums at bits 1 to 5, their delay

is the same as CRA, because they are using an

accurate carry c0 from LSB. As a result, the

maximum delay in 12-bit SARA is reduced, since for

a multibit adder its maximum delay depends on the

longest critical path.

5. SIMULATION RESULT

Simulation result

Rill schematic

POWER REPORT




Page No:283

DELAY REPORT

DESIGN SUMMARY

6. CONCLUSION

This work proposed a novel design flow

methodology to cope with energy efficiency in

CMOS technology. The proposed solution is focused

on exploring approximation in add-and-shift

architectures to reduce power consumption and

increase computational performance. The proposed

search heuristic and hybrid approximate adders

presented substantial energy reductions of up to

73.2% when compared to the precise and baseline

architectures. PPA topologies were explored, to

rescue performance, in addition to RCA-based

approach, where VOS estimation shows additional

dynamic power reduction of up to 19.3%. Area

reduction up to 73.8% is also observed in this study.

The proposed design methodology also enabled a

more comprehensive observation of application

quality by considering average results and variability.

Comparison with state-of-the-art related work is

provided showing the contributions of this work for

low power digital CMOS design and approximate

computing scope. Future work and effort are focused

on giving configurable capabilities to the filter

images under analysis, thus enabling the exploration

of different power-performance profiles during the

execution time.

FUTURE SCOPE

In this paper, we propose an SARA design. It has

significantly lower power/EDP than the latest

previous work when comparing at the same accuracy

level. In addition, SARA has considerable lower area

overhead than almost all the previous works. The

accuracy-power-delay efficiency is further improved

by a DAR technique. We demonstrate the efficiency

of our adder in the applications of multiplication

circuits and DCT computing circuits for image

processing.

REFERENCES

[1] Kuldeep Rawat, Tarek Darwish and Magdy

Bayoumi, “A low power and reduced area Carry

Select Adder”, 45th Midwest Symposium on

Circuits and Systems, vol.1, pp. 467-470, March

2002.

[2] Y. Kim and L.-S. Kim, "64-bit carry-select adder

with reduced area, " Electron. Lett. vol. 37, no. 10,

pp. 614- 615, May 2001.

[3] J. M. Rabaey, Digtal Integrated Circuits-A

Design Perspective.Upper Saddle River, NJ:

Prentice-Hall,2001.

[4] Cadence, "Encounter user guide, " Version 6.2.4,

March 2008.

[5] R. Priya and J. Senthil Kumar, “ Enhanced area

efficient architecture for 128 bit Modified CSLA”,

International Conference on Circuits, Power and

Computing Technologies,2013.

[6] Shivani Parmar and Kirat pal Singh,”Design of

high speed hybrid carry select adder”,IEEE ,2012.

[7] I-Chyn Wey, Cheng-Chen Ho, Yi-Sheng Lin, and

Chien-Chang Peng,” An Area-Efficient Carry

Select Adder Design by Sharing the Common

Boolean Logic Term”, Proceedings of the

International MultiConference of Engineers and

Computer Scientist 2012 Vol II,IMCES

2012,HongKong,March 14-16 2012.

[8] B. Ramkumar and Harish M Kittur,” Low-

Power and Area-Efficient Carry Select Adder”,

IEEE Transactions on Very Large Scale

Integration (VLSI) Systems, VOL. 20, NO. 2,

February 2012.

[8] Ms. S.Manjui, Mr. V. Sornagopae,” An Efficient

SQRT Architecture of Carry Select Adder Design by

Common Boolean Logic”,IEEE, 2013.

[9] Youngjoon Kim and Lee-Sup Kim, “64-bit

carry-select adder with reduced area”, Electronics

Letters, vol.37, issue 10, pp.614-615, May 2001.

[10] Yajuan He, Chip-Hong Chang and Jiangmin Gu,

“An area efficient 64-bit square root Carry-Select

Adder for low power applications”, IEEE

International Symposium on Circuits and

Systems,vol.4, pp.4082-4085,

May 2005.

Documents

Design Methodology to Explore Hybrid Approximate Adders