Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:267
Design Methodology to Explore Hybrid Approximate Adders for
Energy-Efficient Image and Video Processing Accelerators 1V.Dushyanth Kumar Reddy, 2K.Naga Sankar Reddy
1M.Tech Student, 2Assistant Professor
Department of ECE
Global college Of Engineering and Technology, Kadapa
ABSTRACT
This paper proposes a new design
methodology to explore the state-of-the-art
approximate adders for accelerator architectures
conceived in the realm of multiplier-less multiple
constant multiplication optimization problem. The
proposed methodology is composed of: 1) a search
heuristic to seek faster and feasible approximate
configurations for the architectures under evaluation;
2) low-power techniques regarding hybrid
approximate adders design for accelerators based on
trees of shift-and-add operations; 3) high-
performance evaluation by exploring parallel prefix
adders and low power analysis through the use of the
adder optimized by a commercial synthesis tool in
the precise part of the approximate adders; and 4)
energy efficiency analysis by considering both the
approximate techniques and voltage over scaling
estimation. Furthermore, improvements are proposed
for the state-of-the-art approximate adders under
evaluation in this paper. Two case studies are
considered to assess the proposed methodology: 1)
Gaussian image filter and 2) Sobel operator. The
precise and approximate image filters were described
in very high-speed integrated circuits hardware
description language regarding the proposed
methodology. Results are shown after synthesis to a
45-nm standard cell-based technology, where energy
reductions ranging from 7.7% up to 73.2% were
experienced for multiple levels of quality considering
the applications under analysis.
1. INTRODUCTION
1.1 MOTIVATION
As the scale of integration keeps growing,
more and more sophisticated signal processing
systems are being implemented on a VLSI chip.
These signal processing applications not only
demand great computation capacity but also consume
considerable amount of energy. While performance
and Area remain to be the two major design tolls,
power consumption has become a critical concern in
today’s VLSI system design[]. The need for low-
power VLSI system arises from two main forces.
First, with the steady growth of operating frequency
and processing capacity per chip, large currents have
to be delivered and the heat due to large power
consumption must be removed by proper cooling
techniques. Second, battery life in portable electronic
devices is limited. Low power design directly leads to
prolonged operation time in these portable devices.
Addition usually impacts widely the overall
performance of digital systems and a crucial
arithmetic function. In electronic applications adders
are most widely used. Applications where these are
used are multipliers, DSP to execute various
algorithms like FFT, FIR and IIR. Wherever concept
of multiplication comes adders come in to the picture.
As we know millions of instructions per second are
performed in microprocessors. So, speed of operation
is the most important constraint to be considered
while designing multipliers. Due to device portability
miniaturization of device should be high and power
consumption should be low. Devices like Mobile,
Laptops etc. require more battery backup.
So, a VLSI designer has to optimize these
three parameters in a design. These constraints are
very difficult to achieve so depending on demand or
application some compromise between constraints
has to be made. Ripple carry adders exhibits the most
compact design but the slowest in speed. Whereas
carry look ahead is the fastest one but consumes more
area. Carry select adders act as a compromise
between the two adders. In 2002, a new concept of
hybrid adders is presented to speed up addition
process by Wang et al. that gives hybrid carry look-
ahead/carry select adders design. In 2008, low power
multipliers based on new hybrid full adders is
presented.
1.2 NEED FOR LOW POWER DESIGN
The design of portable devices requires
consideration for peak power consumption to ensure
reliability and proper operation. However, the time
averaged power is often more critical as it is linearly
related to the battery life. There are four sources of
power dissipation in digital CMOS circuits:
switching power, short-circuit power, leakage power
and static power. The following equation describes
these four components of power:
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:268
Pswitching is the switching power. For a
properly designed CMOS circuit, this power
component usually dominates, and may account for
more than 90% of the total power. denotes the
transition activity factor, which is defined as the
average number of power consuming transitions that
is made at a node in one clock period. Vs is the
voltage swing, where in most cases it is the same as
the supply voltage, Vdd. CL is the node capacitance.
It can be broken into three components, the gate
capacitance, the diffusion capacitance, and the
interconnect capacitance. The interconnect
capacitance is in general a function of the placement
and routing. fck is the frequency of clock. The
switching power for static CMOS is derived as
follows.
During the low to high output transition, the
path from Vdd to the output node is con-ducting to
charge CL. Hence, the energy provided by the supply
source is
where is the current drawn from the supply. Here, R is the resistance of the path between
the Vdd and the output node. Therefore, the energy can be rewritten as
During the high to low transition, no energy is supplied by the source. Hence, the average power consumed during
one clock cycle is
Eq. (2.4) and Eq. (2.5) estimate the energy and the power of a single gate only. From a system point of view, is
used to account for the actual number of gates switching at a point in time.
1.2.1 LOW VOLTAGE
Power consumption is linearly proportional to
voltage swing (Vs) and supply voltage (Vdd) as
indicated in Eq. (2.5). For most CMOS logic
families, the swing is typically rail-to-rail. Hence,
power consumption is also said to be proportional to
the square of the supply voltage, Vdd. Therefore,
lowering the Vdd is an efficient approach to reduce
both energy and power, presuming that the signal
voltage swing can be freely chosen. This is, however,
at the expense of the delay of circuits. The delay, td,
can be shown to be proportional to
.The exponent is
between 1 and 2. It tends to be closer to 1 for MOS
transistors that are in deep sub-micrometer region,
where carrier velocity saturation may occur.
increases toward 2 for longer channel transistors.
2. LITERATURE SURVEY
Addition is the most common and often used
arithmetic operation on microprocessor, digital signal
processor, especially digital computers. Also, it
serves as a building block for synthesis all other
arithmetic operations. Therefore, regarding the
efficient implementation of an arithmetic unit, the
binary adder structures become a very critical
hardware unit. In any book on computer arithmetic,
someone looks that there exists a large number of
different circuit architectures with different
performance characteristics and widely used in the
practice. Although many researches dealing with the
binary adder structures have been done, the studies
based on their comparative performance analysis are
only a few.
This is about a digital circuit. For an
electronic circuit that handles analog signals
see Electronic mixer.In electronics,
an adder or summer is a digital circuit that
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:269
performs addition of numbers. In
many computers and other kinds of processors,
adders are used not only in the arithmetic logic
unit(s), but also in other parts of the processor, where
they are used to calculate addresses, table indices,
and similar.
Although adders can be constructed for
many numerical representations, such as binary-
coded decimal or excess-3, the most common adders
operate on binary numbers. In cases where two's
complement or ones' complement is being used to
represent negative numbers, it is trivial to modify an
adder into an adder–subtractor. Other signed number
representations require a more complex adder.
Different type of adders as follows
3.EXISTING METHOD
3.1 CARRY SELECT ADDER
DESIGN of area- and power-efficient high-
speed data path logic systems are one of the most
substantial areas of research in VLSI system design.
In digital adders, the speed of addition is limited by
the time required to propagate a carry through the
adder. The sum for each bit position in an elementary
adder is generated sequentially only after the
previous bit position has been summed and a carry
propagated in to the next position .The CSLA is used
in many computational systems to alleviate the
problem of carry propagation delay by independently
generating multiple carries and then select a carry to
generate the sum. However ,the CSLA is not area
efficient because it uses multiple pairs of Ripple
Carry Adders (RCA) to generate partial sum and
carry by considering carry input cin=0 and cin=1,
then the final sum and carry are selected by the
multiplexers (mux).The basic idea of this work is to
use Binary to Excess-1 Converter(BEC) instead of
RCA with cin=1 in the regular CSLA to achieve
lower area and power consumption The main
advantage of this BEC logic comes from the lesser
number of logic gates than the n-bitFull Adder (FA)
structure.
The carry select adder comes in the category
of conditional sum adder. Conditional sum adder
works on some condition. Sum and carry are
calculated by assuming input carry as 1 and 0 prior
the input carry comes. When actual carry input
arrives, the actual calculated values of sum and carry
are selected using a multiplexer. The conventional
carry select adder consists of k/2 bit adder for the
lower half of the bits i.e. least significant bits and for
the upper half i.e. most significant bits (MSB’s) two
k/ bit adders. In MSB adders one adder assumes carry
input as one for performing addition and another
assumes carry input as zero. The carry out calculated
from the last stage i.e. least significant bit stage is
used to select the actual calculated values of output
carry and sum. The selection is done by using a
multiplexer. This technique of dividing adder in to
stages increases the area utilization but addition
operation fastens.
Fig. Regular 16-b SQRT CSLA.
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:270
3.1.1 MULTIPLEXER
In electronics, a multiplexer (or MUX) is a
device that selects one of several analog or digital
input signals and forwards the selected input into a
single line.[1] A multiplexer of 2n inputs has n select
lines, which are used to select which input line to
send to the output.[2] Multiplexers are mainly used to
increase the amount of data that can be sent over the
network within a certain amount of time and
bandwidth.[1] A multiplexer is also called a data
selector. They are used in CCTV, and almost every
business that has CCTV fitted, will own one of these.
An electronic multiplexer makes it possible for
several signals to share one device or resource, for
example one A/D converter or one communication
line, instead of having one device per input signal.
On the other hand, a demultiplexer (or
demux) is a device taking a single input signal and
selecting one of many data-output-lines, which is
connected to the single input. A multiplexer is often
used with a complementary demultiplexer on the
receiving end.[1]
An electronic multiplexer can be considered
as a multiple-input, single-output switch, and a
demultiplexer as a single-input, multiple-output
switch.[3] The schematic symbol for a multiplexer is
an isosceles trapezoid with the longer parallel side
containing the input pins and the short parallel side
containing the output pin.[4] The schematic on the
right shows a 2-to-1 multiplexer on the left and an
equivalent switch on the right. The wire connects the
desired input to the output.
In digital circuit design, the selector wires
are of digital value. In the case of a 2-to-1
multiplexer, a logic value of 0 would connect to
the output while a logic value of 1 would
connect to the output. In larger multiplexers, the
number of selector pins is equal
to where is the number of inputs.
For example, 9 to 16 inputs would require
no fewer than 4 selector pins and 17 to 32 inputs
would require no fewer than 5 selector pins. The
binary value expressed on these selector pins
determines the selected input pin.
A 2-to-1 multiplexer has a boolean
equation where and are the two inputs, is the
selector input, and is the output:
This truth table shows that
when then but
when then . A straightforward
realization of this 2-to-1 multiplexer would need
2 AND gates, an OR gate, and a NOT gate.
Larger multiplexers are also common and,
as stated above, require selector pins
for inputs. Other common sizes are 4-to-1, 8-
to-1, and 16-to-1. Since digital logic uses binary
values, powers of 2 are used (4, 8, 16) to
maximally control a number of inputs for the
given number of selector inputs.
4-to-1 mux
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:271
The boolean equation for a 4-to-1 multiplexer is:
MODIFIED REGULAR CSLA USING BEC
The main idea of this work is to use BEC instead of
the RCA with Cin=1in order to reduce the area and
power consumption of the regular CSLA. To replace
the n bit RCA, an n+1 bit BEC is required. A
structure and the function table of a 4 bit BEC are
show in Fig and Table II respectively
.
Figure: 4-bit BEC
Fig. above illustrates how the basic function of the
CSLA is obtained by using 4-bit BEC together with
the mux. One input of the 8:4 mux gets as it input(
(B3, B2, B1, and B0) and another input of the mux is
the BEC output. This produces the two possible
partial results in parallel and the mux is used to select
either the BEC output or the direct inputs according
to the control signal Cin. The importance of the BEC
logic stems from the large silicon area reduction
when the CSLA with large number of bits are
designed. The Boolean expressions of the 4-bit BEC
is listed as (note the functional symbols NOT, &
XOR)
DELAY AND AREA EVALUATION OF
MODIFIED CSLA
The structure of the proposed 16-b SQRT
CSLA using BEC for RCA with Cin=1 to optimize
the area and power is shown in above fig. We again
split the structure into five groups. The delay and
area estimation of each group are shown in Fig
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:272
below. The steps leading to the evaluation are given here.
Figure: Delay and area evaluation of modified SQRT CSLA: (a) group2, (b) group3, (c) group4, and (d) group5. H
is a Half Adder.
4. PROPOSED METHOD
THE semiconductor industry faces
challenges at each new Complementary Metal Oxide
Semiconductor (CMOS) technology node. Power
density increase has been experienced due to the
observation that Dennard scaling is not feasible
anymore [1]. Furthermore, power and thermal walls
bring much more effort to designers, so that digital
CMOS design is facing the so-called “Dark Silicon
Era” even considering recent Fin Field-Effect
Transistor (FinFET) technologies [2]. Therefore, the
current and future computing scenario is
characterized by the demand for numerous and
ubiquitous compute-intensive applications in
constrained power budget digital devices. Based on
that, energy-efficient techniques (i.e. maximize the
number of arithmetic operations per energy unit) are
paramount to cope with the previously observed
challenges. According to [3], two trending energy-
efficient techniques are listed as follows: (i)
accelerator-rich architectures based on Application
Specific Integrated Circuits (ASICs) and (ii)
Approximate Computing (AC). Architectural
heterogeneity and the use of ASIC accelerators are
energy-efficient techniques to execute the most
compute-intensive kernels of an application [3]. On
the other hand, the remaining tasks which demand
less energy consumption can be scheduled for
general-purpose processors. As a result, general-
purpose processors’ workload is alleviated due to the
use of energy-efficient specific processing cores.
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:273
Power-management schemes can be implemented to
power off accelerators or general-purpose cores when
not in use, thus respecting the power and thermal
constraints. The works in [4] and [5] show that
despite the challenges of architectural integration
introduced by this new design paradigm, these
accelerator-rich architectures play an essential role in
energy efficiency for recent applications. For
example, Hameed et al. [6] show that an ASIC
solution is 500× more energy-efficient than a four
core general-purpose processor when considered
H.264 video coding application. One essential
approach to conceive ASIC implementation of digital
filters and transforms is to implement the
multiplications by constants in the light of MMCM
problem formulated in [7] and [8]. In other words,
these architectures are designed by adopting the use
of additions, subtractions, and shift in an optimized
configuration.
The approximate computing paradigm emerged to
increase performance and to reduce power dissipation
[9]. The critical approach in approximate hardware is
to reduce the computation accuracy in favor of
energy-efficiency. In circuit level design, this is
performed by designing simpler circuits to speed up
the critical path timing and/or to consume less power.
Approximate computing techniques take advantage
of approximation-tolerant applications which do not
need high accuracy all the time but only “good
enough” or “sufficiently good” results for output
perceptual quality. In [10] is stated the following
properties to define an approximation-resilient
application: (i) there is not a golden or accurate
result, but a range of acceptable ones and (ii)
robustness to input noisy data. For example,
multimedia applications (e.g., video coding, audio
filtering, image processing, and so on), highly
demanded by current portable devices, are
intrinsically related to human senses. The multimedia
signals are, in fact, approximation-tolerant
applications, since in [11] is stated that human senses
process analog information and have difficulty to
realize the negative impact of digital approximations.
It means that it is possible to adopt approximate
computing techniques to improve energy efficiency
in multimedia applications by adequately exploring
the user experience at different profiles of quality.
The excellent point of approximate computing is that
this paradigm can be adopted at any abstraction level
from transistor-level up to software application [12].
Furthermore, approximate computing can be an
additive design component for accelerator-rich
architectures. One can consider that the use of
approximate hardware accelerators brings further
energy efficiency improvements [12]. In the
arithmetic layer of abstraction, works in [11] and
[13]–[20] have proposed approximate adders. Adders
are basic building blocks for several compute-
intensive multimedia applications. Therefore,
approximate adders could drive energy efficiency for
recent digital compute-intensive and approximation-
tolerant applications. Based on that, this work
proposes a design methodology to explore state-of-
the-art approximate adders for ASIC implementation
of add-and-shift accelerators for image and video
processing. Previous works in [22] and [23]
examined the use of the state-of-the-art approximate
adders for image filters. To explore approximation
for the architectures, they adopted simulation-based
methodologies in which search heuristics are
implemented to seek for energy-efficient approximate
configurations. The approximate adders taken by the
previously mentioned related works are the
Approximate Mirror Adder (AMA) [13] and the
Error-Tolerant Adder I (ETAI) [11]. They are divided
into precise and approximate parts. In both the works,
only the Ripple Carry Adder (RCA) topology is
explored in the precise block of those approximate
adders. The same observation is valid for the case
study explored in [11]: the precise block of ETA-I is
only implemented with RCA topology. This work
presents four novel contributions in the scope of
approximate computing: 1) A faster search heuristic
and simulation-based methodology to configure
feasible configurations with evaluation of multiple
levels of quality; 2) A high-performance exploration
of approximate hardware accelerators through the use
of PPAs and low power evaluation through the
optimized adder from the commercial synthesis tool
in the precise part of the approximate adders; 3)
Combination of different approximate adders to
compose hybrid adders for the add-and-shift
architectures; Energy-efficiency 4) analysis based on
the approximate configurations and VOS estimation
due to the insertion of PPAs. Results show that our
approach substantially reduces energy consumption
ranging from 7.7% up to 73.2% for different levels of
quality.
BACKGROUND ON APPROXIMATE
ADDERS, PARALLEL PREFIX ADDERS, AND
IMAGE FILTERS
A. Approximate Adders
The approximate adders can be classified as
computational performance- and power-oriented
designs. The former is related to adders divided into
m independent blocks or sub-adders to speed up the
critical path timing. The claim is that, for random and
uniformly distributed pairs of operands, more
extended carry propagation rarely occurs. Based on
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:274
that, additional logic is necessary to speculate carry-
in for each sub-adder, since this class of approximate
adder breaks the carry propagation in many parts.
Examples of adders which improve computational
performance are the Error-Tolerant Adder II [15],
Error Tolerant Adder IV [14], and the Almost Correct
Adder (ACA) [16]. This class of approximate adders
is also characterized by the presence of infrequent
and high magnitude sum errors. Therefore, the works
in [16]–[19] proposed accuracy configurable adders
to cope with this error characteristics. On the other
hand, more logic is added to detect and correct the
sum errors.
A different philosophy is to propose power-
oriented adders which generally are divided into two
parts: (i) the least significant approximate part and
(ii) the most significant accurate part. Examples of
power-oriented approximate adders can be observed
in [11], [13], and [20]. The principal idea in the
approximate part is to replace the full adder cells by
simpler adder circuits. Therefore, power reduction is
the main objective of this class of adders. Besides,
these adders also tend to reduce critical path timing,
because in the approximate part there is not carry
propagation scheme. One can observe that the
classical truncation is a type of power-oriented adder
which truncates least significant full adder cells. This
class of approximate adders is also characterized by
the presence of frequent and low magnitude sum
errors. Such errors are of low magnitude because the
bit-width of the approximate part can be controlled
through an approximate parameter k. In this work,
the proposed approach is to explore the power-
oriented a design, this is performed by designing
simpler circuits to speed up the critical path timing
and/or to consume less power. Approximate
computing techniques take advantage of
approximation-tolerant applications which do not
need high accuracy all the time but only “good
enough” or “sufficiently good” results for output
perceptual quality. In [10] is stated the following
properties to define an approximation-resilient
application: (i) there is not a golden or accurate
result, but a range of acceptable ones and (ii)
robustness to input noisy data. For example,
multimedia applications (e.g., video coding, audio
filtering, image processing, and so on), highly
demanded by current portable devices, are
intrinsically related to human senses.
The multimedia signals are, in fact,
approximation-tolerant applications, since in [11] is
stated that human senses process analog information
and have difficulty to realize the negative impact of
digital approximations. It means that it is possible to
adopt approximate computing techniques to improve
energy efficiency in multimedia applications by
adequately exploring the user experience at different
profiles of quality. The excellent point of
approximate computing is that this paradigm can be
adopted at any abstraction level from transistor-level
up to software application [12]. Furthermore,
approximate computing can be an additive design
component for accelerator-rich architectures. One can
consider that the use of approximate hardware
accelerators brings further energy efficiency
improvements [12]. In the arithmetic layer of
abstraction, works in [11] and [13]–[20] have
proposed approximate adders. Adders are basic
building blocks for several compute-intensive
multimedia applications.
Therefore, approximate adders could drive
energy efficiency for recent digital compute-intensive
and approximation-tolerant applications. Based on
that, this work proposes a design methodology to
explore state-of-the-art approximate adders for ASIC
implementation of add-and-shift accelerators for
image and video processing. Previous works in [22]
and [23] examined the use of the state-of-the-art
approximate adders for image filters. To explore
approximation for the architectures, they adopted
simulation-based methodologies in which search
heuristics are implemented to seek for energy-
efficient approximate configurations. The
approximate adders taken by the previously
mentioned related works are the Approximate Mirror
Adder (AMA) [13] and the Error-Tolerant Adder I
(ETAI) [11]. They are divided into precise and
approximate parts. In both the works, only the Ripple
Carry Adder (RCA) topology is explored in the
precise block of those approximate adders. The same
observation is valid for the case study explored in
[11]: the precise block of ETA-I is only implemented
with RCA topology. This work presents four novel
contributions in the scope of approximate computing:
1) A faster search heuristic and simulation-based
methodology to configure feasible configurations
with evaluation of multiple levels of quality;
2) A high-performance exploration of approximate
hardware accelerators through the use of PPAs and
low power evaluation through the optimized adder
from the commercial synthesis tool in the precise part
of the approximate adders;
3) Combination of different approximate adders to
compose hybrid adders for the add-and-shift
architectures; Energy-efficiency
4) analysis based on the approximate configurations
and VOS estimation due to the insertion of PPAs.
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:275
BACKGROUND ON APPROXIMATE ADDERS,
PARALLEL PREFIX ADDERS, AND IMAGE
FILTERS
A. Approximate Adders
The approximate adders can be classified as
computational performance- and power-oriented
designs. The former is related to adders divided into
m independent blocks or sub-adders to speed up the
critical path timing. The claim is that, for random and
uniformly distributed pairs of operands, more
extended carry propagation rarely occurs. Based on
that, additional logic is necessary to speculate carry-
in for each sub-adder, since this class of approximate
adder breaks the carry propagation in many parts.
Examples of adders which improve computational
performance are the Error-Tolerant Adder II [15],
Error Tolerant Adder IV [14], and the Almost Correct
Adder (ACA) [16]. This class of approximate adders
is also characterized by the presence of infrequent
and high magnitude sum errors. Therefore, the works
in [16]–[19] proposed accuracy configurable adders
to cope with this error characteristics. On the other
hand, more logic is added to detect and correct the
sum errors. A different philosophy is to propose
power-oriented adders which generally are divided
into two parts: (i) the least significant approximate
part and (ii) the most significant accurate part.
Examples of power-oriented approximate adders can
be observed in [11], [13], and [20]. The principal idea
in the approximate part is to replace the full adder
cells by simpler adder circuits. Therefore, power
reduction is the main objective of this class of adders.
Besides, these adders also tend to reduce critical path
timing, because in the approximate part there is not
carry propagation scheme. One can observe that the
classical truncation is a type of power-oriented adder
which truncates least significant full adder cells. This
class of approximate adders is also characterized by
the presence of frequent and low magnitude sum
errors. Such errors are of low magnitude because the
bit-width of the approximate part can be controlled
through an approximate parameter k. In this work,
the proposed approach is to explore the power-
oriented adders to give priority to power-efficiency.
It is also ratified in [21] which states that the adders
focused on delay reduction cannot be used to explore
power-efficiency in structures which demand the
massive use of additions like the multipliers. Also,
the power-oriented approximate adders enable the
exploration of multiple conventional adder topologies
in the precise part, which is not true for the
approximate adders divided into many blocks.
Therefore, we consider in this study the
exploration of the fifth approximate version of AMA
[13] and the Error-Tolerant Adder I [11], because
they are also explored by related works [22], [23].
The former approximate adder is renamed to “Copy
adder” due to its copy function implemented by the
buffers. The approximate adders which are explored
in this study can be observed in Figure 1. The “Copy
adder” in Figure 1(a) has its k bits long approximate
part implemented by buffers to copy the operand a to
the sum. This procedure has 50% probability of
getting the correct sum for each bit position.
Furthermore, the carry-in estimation for the precise
part is implemented by the more straightforward
assignment of the input operand bit bk−1. This
procedure has 75% probability of getting a correct
carry-in estimation to the precise block. The use of
half adders implements the approximate part of the
ETAI in Figure 1(b). The sum is performed in the
non-conventional direction, (i.e. from the most
significant bit k−1 to the least significant bit position
0). The control logic block is conceived as follows: if
the first carry-generate c is equal to “1,” then all the
remaining least significant sum bits are set to “1.”
Otherwise, the sum result is the one computed by the
propagate signal. The carry-in to the precise part in
ETAI is statically set to “0.” This procedure has 50%
probability of getting the correct carry-in to the
precise part. One can observe that for both the
approximate adders, any conventional adder topology
can be implemented in the precise part. Related
works in [22] and [23] explore the use of the RCA. In
this work, we explore the RCA, the high performance
PPAs, and the low power adder fully optimized by
the commercial tool used in this study. That is why in
the next subsection a brief PPA overview is
developed.
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:276
Fig. 1. Approximate adders: (a) Copy adder; (b) ETAI.
Fig. 2. Parallel prefix adder structure.
B. Precise Adders
As previously mentioned, adders are fundamental
building blocks in a great variety of computational
applications. Based on that, many adder topologies
have been proposed to deal with the tradeoff between
power and computational performance. The RCA
topology is characterized to present low values of
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:277
power consumption, area, and computational
performance. Depending on the high-performance
application requirements and given that
computational complexity is increasing in nowadays
tasks, the critical approach is to accelerate the adder’s
critical path delay (i.e. carry propagation) at the
expense of higher area and power dissipation. Based
on that, the Parallel Prefix Adders were proposed to
deal with high-performance demands [24]. The carry
propagation structure in the PPA adders is
implemented by simple logic cells which tend to keep
a regular connection. Based on that, the sum
computation can be divided into pre-processing,
prefix computation and post-processing parts, as can
be seen in Figure 2. In the pre-processing part, the
propagate pi (i.e. ai ⊕bi) and generate gi (i.e ai∧bi)
signals are computed based on the input operands ai
and bi . In the prefix computation stage, the carry
computation is accelerated by the parallel
composition of the black cells which implement the
group propagate Pi: j and generate Gi: j signals as
well as the gray cells which compute the carry ci .
Finally, the post-processing step is given by the sum
si = pi ⊕ ci−1. 1) Use of PPA Adders on the Precise
Part of the Approximate Adders: Different
configurations between the black and gray cells can
be obtained. According to the PPAs taxonomy
presented in [25], the different prefix cells
configurations allow tradeoffs among (i) the number
of logic levels, (ii) the maximum fanout, and (iii) the
maximum number of horizontal wire tracks (i.e wire
density). All of these aspects affect the adder delay.
Based on that, PPA topologies proposed in [26]–[30]
are considered in this study. Their main
characteristics concerning logic levels, maximum
fanout, and the maximum number of wiring tracks
are presented in Table I [25]. As can be seen in Table
I, the Brent-Kung adder has the highest number of
levels, while presents low values for both fanout and
wire density. The Sklansky and Kogge-Stone have
the lowest number of logic levels, but the former
gives the highest fanout, and the latter has the worst
wire density due to the highest number of prefix
cells. The Han-Carlson is the hybrid solution between
the Brent-Kung and Kogge-Stone. Therefore, this
adder balance the tradeoff between the number of
logic levels and wiring tracks. The Ladner-Fischer
adder is the hybrid approach between Sklansky and
Brent-Kung so that the tradeoff is balanced between
the number of logic levels and the fanout.
PROPOSED DESIGN METHODOLOGY
This section presents the proposed design
methodology to explore our hybrid approximate
adders. The basic design flow is shown in Figure 3.
The first step is to simulate the application under
evaluation by adopting MatLab or C models
considering both the architectures and approximate
adders. The hybrid approximate adder models are
implemented by an overloaded function in MatLab
and C language. Therefore, add-and-shift filter
structures are developed in simulation scripts. The
function is called to perform the necessary additions,
with the appropriate parameters to select between the
type of hybrid approximate adder, the number of bits
of the approximate part, and so forth. The quality
metrics are generated by considering the use of real
test-cases. The next stage is related to the application
quality evaluation for all the exercised approximate
configurations. After that, different levels of quality
are selected to generate the Register Transfer Level
(RTL) designs automatically. The logic synthesis is
performed for each RTL file, by using the standard
cell technology library files (i.e.,.lib,.lef, cap tables,
and so forth). Then, the mapped gate-level netlist is
created to enable the next step of post-synthesis
simulation. During the simulation, standard cell
technology library files (i.e., the Verilog file with the
behavioral model of the standard cells) and real test
cases are used to capture switching activity which is
saved in Value Change Dump (VCD) or Toggle
Count Format (TCF) files. The Verilog gate-level
netlist, standard cell technology library files, and
VCD or TCF files are then used to estimate power. In
the next subsections the proposed hybrid approximate
adders, the accelerator architectures under analysis,
and the proposed heuristic adopted to perform
simulation are shown.
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:278
Fig. 3. Proposed design methodology.
A. Proposed Hybrid Approximate Adders for Shift-and-Add Architectures
The 5 × 5 Gaussian and the 3 × 3 Sobel image filter architectures under evaluation are similar to the ones adopted
Fig. 4. Gaussian image filter architecture.
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:279
Fig. 5. Gradient image filter architecture.
by Oliveira et al. [22]. The difference is observed in
the Gaussian architecture, where the partial terms of
the adder tree were reorganized to enable left shift
overlapping regions in the operands. It is performed
to leverage the power efficiency provided by the
proposed hybrid adders and to improve the proposed
search heuristic which will be presented in the next
subsection. The Gaussian and Gradient architectures
can be observed in the Figure 4 and Figure 5,
respectively [22]. One can observe that these
architectures are implemented by the shift operations,
adders, and subtractors. There are two observable
configurations in which the proposed hybrid
approximate adders can be adopted: (i) both the
operands present overlapped number of left shift
operations, (ii) one operand present excessive number
of left shift operations than the other operand. These
aspects can be observed in Figure 6. As can be seen
in Figure 6, if there is overlapping between the
number of left shifts in both the operands, then the
proposed approach considers the use of truncation
adder in the overlapped and least significant region.
The excessive amount of left shift operations in one
of the operands, when compared to the other can be
leveraged with the use of the Copy adder. One can
notice in the examples in Figure 6(a) and (b) that
these procedures do not produce sum errors. The
remaining most significant region in the adder can
further be explored with the state-of-the-art Copy
adder and ETAI (i.e., precise plus approximate
adder). The example in Figure 6(a) shows the
proposed hybrid scheme of copy-copy-truncation.
The operands are left shifted two and four times,
respectively. One can see that there are three
approximation parameters: (i) k1 = 2 which controls
the truncation in the overlapped shift region, (ii) k2 =
2 which controls the copy adder in the excessive shift
region, and (iii) k3 = 4 which controls the
approximate part of the copy adder in the non-shifted
region. Almost the same observations can be made in
Figure 6(b) for the configuration ETAI-
copytruncation. The only difference is that in the
non-shifted region the ETAI is adopted instead of the
Copy adder. For the ETAI, we adopted the
modification proposed by Kang et al. [23]: the use of
OR logic gates instead of XOR. It occurs because
there is not difference regarding produced sum result,
while the former gate has less area than the latter.
Besides, this work also considers the use of carry-in
estimation performed in Copy adder for the ETAI. As
previously mentioned, this procedure has more
probability of getting correct estimation than
statically set the carry-in to “0.” If a given adder of
the architecture has not left shifted operands, then the
approximation is not hybrid (i.e., k1 = 0 and k2 = 0).
On the other hand, the copy adder or ETAI can be
explored in this adder (i.e., k3 ≥ 0).
B. Proposed Search Heuristic
The exhaustive search in simulation-based
methodologies tends to be time consuming or
prohibitive to find the most energy-efficient
configuration. Therefore, the use of search heuristics
is essential in this scenario. As previously shown,
related works in [22] and [23] proposed search
heuristics to seek for energy-efficient accelerators. In
this work, the proposed approach is to first establish
the k1 and k2 parameters
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:280
Fig. 6. Proposed hybrid approximate adders. (a) copy-copy-truncation adder. (b) ETAI-copy-truncation adder.
for the overlapped and excessive shift regions. After
that, the next step is to seek different k3 parameters
for all the adders in the tree through an iterative
process. The algorithm for the heuristic is shown in
Algorithm 1. One can observe that the k1 and k2
parameters are fixed, while the k3 is iterated to seek
for different approximate configurations. During the
initialization step from lines 1 to 9, the k1 and k2
parameters are determined solely by the overlapped
regions previously shown in Figure 6. The iteration in
lines 10 to 12 is used to create and initialize the data
structure which stores the k3 values for each adder in
the filter architectures. The search for k3 parameters
is performed in lines 14 to 17. During the search, the
quality metric is stored after running the application
for each k3 parameter value. The application quality
is evaluated, and objective metric is calculated by
considering the use of real test cases.
EXTENSION METHOD
SIMPLE ACCURACY-RECONFIGURABLE ADDER
A. Preliminaries
Fig. 3. (a) Conventional full adder. (b) Carry-out selectable full adder. (c) Carry-in configurable full adder.
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:281
An N-bit adder operates on two addends A = (aN ,
aN−1,..., ai,..., a1) and B = (bN , bN−1,..., bi,..., b1).
For bit i, its carry-in is ci−1 and its carry-out is ci .
Defining the carry generate bit gi = ai · bi, propagate
bit pi = ai ⊕ bi , and kill bit ki = ¯ai · b¯i, the
conventional full adder computes the sum si and
carry ci according to
si = pi ⊕ ci−1 (1)
ci = gi + pi · ci−1. (2)
A gate level schematic of a conventional full adder is
provided in Fig. 3(a). A CRA is used to chain N bits
of conventional full adders together. By applying (2)
recursively, one can get
ci = gi + pi gi−1 +···+ g1 i k=2 pk + c0 i k=1 pk. (3)
This equation implies that ci can be computed
directly from g and p of all bits, without waiting for
the c of its lower bits to be computed. This
observation is the basis for CLA adder.
c prdt i = gi. (4)
For the LSB of the higher bit subadder, which is bit i
+ 1, its carry-out ci+1 can be computed using one of
two options: either by the conventional ci+1 = gi+1 +
pi+1 · ci , or by using the carry prediction as
ci+1 = gi+1 + pi+1 · c prdt i = gi+1 + pi+1 · gi. (5)
Fig. 4. Design of SARA
The selection between the two options is realized
using MUXes as in Fig. 4 and the multiplexer (MUX)
selection result is denoted as cˆi . Comparing (5) with
(3), we can see that the carry prediction is a
truncation-based approximation to carry
computation.1 Therefore, cˆi can be configured to
either accurate mode or approximation mode, that is
cˆi ← c prdt i , if approximation mode ci, if accurate
mode. (6)
It should be noted that the carry prediction c prdt i
reuses gi in an existing full adder instead of
introducing an additional dedicated circuit as in [18]
or Fig. 2. This prediction scheme makes a very
simple modification to the conventional full adder, as
shown in Fig. 3(b). One can connect cˆi to its higher
bit i + 1 to compute both carry ci+1 and sum si+1, as
in GDA [18] and RAP-CLA [19]. We suggest an
improvement over this approach by another simple
change as in Fig. 3(c), where si+1 is based on ci
instead of cˆi . Such approach can help reduce the
error rate in outputs when an incorrect carry is
propagated. Because the sum keeps accurate and the
carry will not be propagated when addends are
exactly the same. Moreover, out of all four
configurations of sum/carry calculation by
approximate/accurate carry-in, the most meaningful
way is to have sum bit calculated by accurate carry
and make carry bit configurable.
Therefore, sum si+1 is calculated directly by accurate
carry ci without the option of c prdt i . Applying this
in SARA as in Fig. 4, in the approximation mode,
computing sj+1 from c j can still limit the critical
path to be between c prdt i−1 and sj+1, but has higher
accuracy than computing sj+1 from cˆj . Compared to
sum computation in GDA and RAP-CLA, this
technique improves accuracy with almost no
additional overhead. Compared to CRA, the overhead
of SARA is merely the MUXes, which is almost the
minimum possible for configurable adders. Although
sj+1 is calculated by accurate carry c j , its delay can
still be reduced by approximate carry in lower
subadder. In a multibit adder, the delay of sum bit
depends on the carry chain propagated from its lower
bits. In our SARA structure, even when accurate
carry c j is propagated at bit j, the carry chain might
be truncated by approximate carry in other lower bits.
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:282
Fig. 5. Implementation of 12-bit adder in (a) CRA and (b) SARA
In Fig. 4, when c prdt i−1 is propagated, the delay of
sj+1 is reduced as its path is shorten to be between bit
i − 1 and j + 1. We can take the 12-bit adder in Fig. 5
as an example. For 12-bit SARA working in
approximate mode, the sum s9 uses the accurate carry
c8 from a lower subadder (bits 5 to 8). But c8 is
propagated from approximate carry c prdt 4 of
another subadder (bits 1 to 4). As shown in the
figure, the delay of s9 in SARA is about six stages.
Compared with the same bit in CRA, the delay of
sum bit s9 in SARA is reduced by three stages.
Similar delay reduction can be observed in other sum
bits (bits 6 to 12). For sums at bits 1 to 5, their delay
is the same as CRA, because they are using an
accurate carry c0 from LSB. As a result, the
maximum delay in 12-bit SARA is reduced, since for
a multibit adder its maximum delay depends on the
longest critical path.
5. SIMULATION RESULT
Simulation result
Rill schematic
POWER REPORT
JOURNAL OF RESOURCE MANAGEMENT
AND TECHNOLOGY ISSN NO: 0745-6999
Vol 11, Issue4/ 2020
Page No:283
DELAY REPORT
DESIGN SUMMARY
6. CONCLUSION
This work proposed a novel design flow
methodology to cope with energy efficiency in
CMOS technology. The proposed solution is focused
on exploring approximation in add-and-shift
architectures to reduce power consumption and
increase computational performance. The proposed
search heuristic and hybrid approximate adders
presented substantial energy reductions of up to
73.2% when compared to the precise and baseline
architectures. PPA topologies were explored, to
rescue performance, in addition to RCA-based
approach, where VOS estimation shows additional
dynamic power reduction of up to 19.3%. Area
reduction up to 73.8% is also observed in this study.
The proposed design methodology also enabled a
more comprehensive observation of application
quality by considering average results and variability.
Comparison with state-of-the-art related work is
provided showing the contributions of this work for
low power digital CMOS design and approximate
computing scope. Future work and effort are focused
on giving configurable capabilities to the filter
images under analysis, thus enabling the exploration
of different power-performance profiles during the
execution time.
FUTURE SCOPE
In this paper, we propose an SARA design. It has
significantly lower power/EDP than the latest
previous work when comparing at the same accuracy
level. In addition, SARA has considerable lower area
overhead than almost all the previous works. The
accuracy-power-delay efficiency is further improved
by a DAR technique. We demonstrate the efficiency
of our adder in the applications of multiplication
circuits and DCT computing circuits for image
processing.
REFERENCES
[1] Kuldeep Rawat, Tarek Darwish and Magdy
Bayoumi, “A low power and reduced area Carry
Select Adder”, 45th Midwest Symposium on
Circuits and Systems, vol.1, pp. 467-470, March
2002.
[2] Y. Kim and L.-S. Kim, "64-bit carry-select adder
with reduced area, " Electron. Lett. vol. 37, no. 10,
pp. 614- 615, May 2001.
[3] J. M. Rabaey, Digtal Integrated Circuits-A
Design Perspective.Upper Saddle River, NJ:
Prentice-Hall,2001.
[4] Cadence, "Encounter user guide, " Version 6.2.4,
March 2008.
[5] R. Priya and J. Senthil Kumar, “ Enhanced area
efficient architecture for 128 bit Modified CSLA”,
International Conference on Circuits, Power and
Computing Technologies,2013.
[6] Shivani Parmar and Kirat pal Singh,”Design of
high speed hybrid carry select adder”,IEEE ,2012.
[7] I-Chyn Wey, Cheng-Chen Ho, Yi-Sheng Lin, and
Chien-Chang Peng,” An Area-Efficient Carry
Select Adder Design by Sharing the Common
Boolean Logic Term”, Proceedings of the
International MultiConference of Engineers and
Computer Scientist 2012 Vol II,IMCES
2012,HongKong,March 14-16 2012.
[8] B. Ramkumar and Harish M Kittur,” Low-
Power and Area-Efficient Carry Select Adder”,
IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, VOL. 20, NO. 2,
February 2012.
[8] Ms. S.Manjui, Mr. V. Sornagopae,” An Efficient
SQRT Architecture of Carry Select Adder Design by
Common Boolean Logic”,IEEE, 2013.
[9] Youngjoon Kim and Lee-Sup Kim, “64-bit
carry-select adder with reduced area”, Electronics
Letters, vol.37, issue 10, pp.614-615, May 2001.
[10] Yajuan He, Chip-Hong Chang and Jiangmin Gu,
“An area efficient 64-bit square root Carry-Select
Adder for low power applications”, IEEE
International Symposium on Circuits and
Systems,vol.4, pp.4082-4085,
May 2005.