Data-Dependent Low-Power 8x8 DCT/IDCT · 2005-02-09 · Design and Evaluation of a Data-Dependent Low-Power 8x8 DCT/IDCT Cheng-Yu ai' Traditional fast Discrete Cosine Transforrn @CT)/hverse

Design and Evaluation of a

Data-Dependent Low-Power 8x8 DCT/IDCT

Cheng-Yu Pai

A Thesis

in

The Department

of

Eiectrical and Computer Engineering

Presented in Partial Fulfillrnents of the Requirement

for the Degree of Master of Applied Science (Electrical) at

Concordia University

Montreal, Quebec, Canada

December 2000

O Cheng-Yu Pai, 2000

National Library I*l ,,,a Bibliothèque nationale du Canada

Acquisitions and Acquisitions et Bibliographie Services services bibliographiques 395 Wellington Street 395. rue Wellington Ottawa ON KlA ON4 Ottawa ON K I A O N 4 Canada Canada

The author has granted a non- exclusive licence allowing the National Librq of Canada to reproduce, loaq distribute or sell copies of this thesis in microfom, paper or electronic formats.

The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fÎom it may be printed or otherwise reproduced without the author's pemiission.

Yowfile Votre réfd~(yso~

Our fi& Notre dtdr~nte

L'auteur a accordé une licence non exclusive permettant à la Bibliotheque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/fïJm, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thése. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être h p h e s ou autrement reproduits sans son autorisation.

Design and Evaluation of a

Data-Dependent Low-Power 8 x 8 DCT/IDCT

Cheng-Yu ai'

Traditional fast Discrete Cosine Transforrn @CT)/hverse DCT (DCT)

algorithms have focused on reducing arithmetic complexity and have fixed m - t h e

cornplexities regardless of the input. Recently, data-dependent signal processing has been

applied to the DCT/IDCT. These algorithms have variable nui - the complexities.

A new two-dimensional 8x8 low-power DCTIIDCT design is implemented using

VHDL by applying the data-dependent signal-processing concept ont0 the traditional

fixed-complexity fast DCTADCT algorithm. To reduce power, the design is based on

Loeffler's fast a lgori th, which uses a low number of multipIications. On top of bat,

zero bypassing, data segmentation, input tnuication, and hardwired canonical sign-digit

(CSD) multipliers are used to reduce the run-time computation, hence reduce the

switching activities and ths power.

When synthesized using Canadian MicroeIectronic Corporation 3-V 0.35 pn

CMOSP technology, this FDCTlIDCT design consumes 122.7i124.9 mW with dock

fiequency of 40MHz and processing rate of 32OM sample/sec. With technology scaling

to 0.35 pm technology, the proposed design features lobver switching capacitance per

' This work is supported by National Sciences and Engineering Research Council of Canada (È4iSERC) post-graduate

scholarship. and NSERC rescarch grants

sample. i.e. more power-e fficient, than other previously reported hi&-performance

FDCTDCT designs.

Keywords: Data-dependent computation, discrete cosine transfom (DCT), inverse

discrete cosine transfom (IDCT)? low power, canonical sign-digit multiplier.

Acknowledgements

I would l k e to express rny deepest and most sincere gratitude toward my

supenisors - Dr. Asim 5. Al-Khalili and Dr. William E. Lynch. They have given me

clear and helpful guidelines throughout my years as a master student. Above dl, I wish to

rhank them for the geat amount of time devoted to me and my work.

1 wish to th& die scholarship offered by the National Sciences and EngÏneering

Research Council of Canada (NSERC) Post-Graduate Scholarship (PGS-A), and NSERC

research gants. Their financial support allows me concentrating my tirne and effort on

my research.

1 would also like to thank my fellow fnends Wassim Tout, Wei Wang, and VLSI

lab specialist Ted Obuchowicz for helping me throughout the technical problems with the

simulation environnlenrs. and givïng me their valuable opinions about the cornparison

strate,^.

Finally, 1 would like to dedicate this work to my family for their love and support.

I rhank you all for your patience and your sacrifices. This work is as much yows as it is

mine.

Table of Contents

List of Figures ................ ................. ................................................... ix List of TabIes ................................................................................................. x

List of Acronyrns .......................................................................................... xi

1 . Introduction ............................................. .... 1 ........................................................................................... 1.1. Research Motivation 1

1.2. Contribution of this Thesis ................................................................................. 3

1.3. Power Measurement Criteria .............................................................................. 4

............................................................................................ 1 .1 . Thesis Organization 6

2 . Background of FDCTIIDCT ................................................................... 7

2.1. Definition of DCT aiid its Inverse ........................... .... ....................................... 7

2 2 Choices of Algorithms ..................................................................................... 9

......................................................................... 2.2.1. Chen's Algorithm Farnily 9

........................................................... 2.2.2 . Loeffler's FDCTADCT Algorithm 11

......................................................................... 2.2.3. Jeong's FDCT Algorithm 13

........................... 2.2.4. Surnmaq- and Cornparison of Algorithm Cornplexities 14

...................................................................... 2.3 . Precision Requirements of IDCT 15

........................................................................................... 4 . Chapter Sumrnary 16

3 . Design C hoices for the FDCTmICT .................................................... 17 .................................................. 3 Data-Dependent Loeffler's FDCT Algorithm 17

............................................................ 3.1.1. Data-Dependent B ypassing Logic 17

.................................... 3 1 Truncate Some Least-Significant Bits fiom Input 20

................................................... 3.2. Data-Dependent Loeffler's IDCT Algorithm 24 3 ') ...................................................................... J . Transpose Memory Architecture 25

3.4. Chapter Summary .......-........... .. ...................................................................... 28

4 . Multiplier Architectures ......................88..8......8..................................... 29

4.1. S w e y of Constant Multipiication Schemes ..................................................... 29

........................................................................ 4.1.1. Modified Booth Multiplier 30

..................................................................... 4.1.2. Distributed Arithmetic (DA) 30

........... 4.1 .3 . Hardwired Canonicd-S ign-Digit (CSD) Wallace-Tree Multiplier 31

4.1.4. Pattern-Based CSD Multiplier .................................................................. 34

...................................................... 4.2. CSD blultipIier Implernenration Procedure 35

............................................................................... 4.3. Multiplier Synthesis Result 40

4.4. Chapter S u m a r y ............................................................................................. 42

5 . Implementation ...................................................................................... 43

.................. ........................................ 5.1. Hardwired CSD Multiplier Generator .. 43

.................................................... 9.2. IEEE Standard L 180-1990 IDCT Cornpliant 45

5 . 3 Pipelining Desion .............................................................................................. 46 s

............................................................................................. 5.4. Chapter Summary 47

6 . Synthesis Results .......... .. ......................................................................... 49

........................................................ 6.1. Synthesis Results of the Proposed Design 49

.......................... 6.2. Cornparison with past FDCTKDCT VLSI implementations .. 50

6 ChapterSummary ............................................................................................ . 53

7 . Conclusion .............................................................................................. 54

....................................................................................... 7.1 . S ummary of Research 54

7.2. Conclusion ........................................................................................................ 55

.................................................... 7.3. Possible Improvements for Future Research 56

Bibliography ................... .... .......................8..........8...............8....8.88.......8...... 59

vii

Appendi~ A Trunca tion Test Result ........................................................ 65

................... AppendLx B Sample Output of CSD Multiplier Generator 69

................ Appendix C Source Code of Constant Multiplier Generator 74

....... Appendix D IEEE Standard 1180-1990 Cornpliant Test Program 95

viii

List of Fi. aures

....................................... Figure 1 : Generd block diagram of video compression encoder 2

............................... Figure 2: 2-D FDCTADCT using row-column (separable) method .... 9

Figure 3: Loeffler's FDCT algorithm ............................................................................... 11

Fi-me 4: Loeffler's IDCT algorithm ................................................................................ 12

.......................................................... ................ Figure 5: Jeong's fast FDCT algorithm ,. 13

......................... Fi-ure 6: Setup for measuring the accuracy of a proposed 8x8 IDCT .... 15

. . ................................................................................. Fi-me 7: Zero Bypassing Multrpl~er 18

............................................................................. Figure 8: Multiplication Se-gmentation 19

............................................................ Fig~ire 9: 2-D row-col~unn FDCT wivirh tnincation 21

............................................... Figure 10: Test mode1 to measure the effect of truncation 22

.......................................................................... Fi-gre I I : Ping-pong transpose memory 25

................................................................. Figure 12: On-the-fly 8 x 8 Transpose Memory 26

................................. Figure 13: States of the transpose matrix for different dock cycles 27

.................. Figure 14: Converting binary number 0 1 100 10 1 1 1 into CSD representation 33

Figue 15: Hardwired CSD multiplier for multiplying cos(3xA6) with 8-bit unsigned

integer ............................................................................................................. 38

Figure 16: Hardwired CSD multiplier for muItiplying cos(37dI6) with 8-bit signed

integer ............................................................................................................. 40

......................................................................................... Figure 17: Pipelined kcn bIock 46

List of Tables

Table I : Tramfer function of Loeffler's FDCT building blocks ................................ ,.., 11

Table 2: Transfer function of Loeffler's lDCT building blocks ....................................... 12

Table 3: Complexities of different FDCT algorithms ........................... ... ......................... 14

Table 4: lEEE Standard 1 180-1 900 IDCT Precision Requirement .................................. 16

......... Table 5: Truncation errors against the number of truncated bits .. ........................... 23

Table 6: Cornparison of general-purpose multiplication against ROM based

..................................................................................................... multiplication 31

C I ? Table 7: Canonical si@-digit representation of cos(nrJ16) .. ........................................... 33

Table S: Trutb-table of b-1 ............................................................................................... 36

Table 9: Truth table to simpliS sign-extension ................................................................ 39

Table 10: Cornparison of 32-bit CSD Wallace-tree multiplier with 4 different general-

........................ purpose multipliers using Xilinx 4052XL-1 FPGA technology 41

Table 1 1 : IEEE Standard 1 180-1 990 Cornpliance for Proposed DCT ............................ 46

..................................................... Table 12: Latencies for 1 -D FDCT and 1 -D IDCT 4 7

Table 13: Latencies for 2-D FDCT and 3-D IDCT .......................................................... 47

................... Table 14: Process and Specifications of the proposed FDCT/IDCT designs 50

Table 15: Summary of specifications of several FDCTlIDCT chips ................................ 51

Table 16: Energy Efficiency (Switchinp Capacitances/Sample in O . 3 5 p technology) .. 53

Table 17: Truncation errors of test sequences: coke, salesman, and tennis ...................... 68

List of Acronyms

CCITT

CMC

CLMG

CP A

CSA

CSD

DA

dB

DCT

DFT

FDCT

FPGA

HDW

DCT

E E E

JPEG

MC

ME

MHz

bros

International Telegraph and Telephone Consultative Cornmittee

Canadian Microelectronic Corporation

Constant Ivlultiplier Generator

Carry Propagate Adder

Carry Save Adder

Canonical Sip-Digit

Distributed Arithrnetic

Decibel

Discrete Cosine Transform

Discrete Fourier Transfomi

Forward Discrete Cosine Transforrn

Field-Programmable Gate-Array

Hi&-definition TV

Inverse Discrete Cosine Transform

Institute of Electncal and Electronic Engineers

Joint P hotographic Experts Group

Motion Compensation

Motion Estimation

Mega-Hertz

Metal-Oxide Semiconductor

bPEG

MCrX

NMOS

PMOS

PSNR

ROM

SD

SFG

VLC

VLSI

Moving Picture Experts Group

Multiplexer

N-type LMOS

P-type MOS

Peak Signal-to-Noise Ratio

Read-Oniy Memory

S @-Digit

S ignal F!o w Grap h

Variable-Length Coding

Very Large-Scale Integration

xii

Chapter 1

Introduction

1.1. Research Motivation

Waveform compression has k e n an important research topic, and it has wide

industry applications. The term waveform is a generic rem that c m be applied to speech

signal. still image, or video signals. Generally speaking, these wavefoms require large

storage in physicd devices, and require large communication bandwidth to transmit. For

example, one-hou colored 704x450 fiame-size video requires 704x480 (bytedframe)

x 1.5 (for color fiames) x 30 (fiame/sec) x 60 (sec./min.) x 60 (midhour) 54.7 GB to

store/transmit. That is an enorrnous amount of data. Due to the nature of these signais,

redundancies c m be removed by means of waveform compression. In practice, for the

vidso signals, one can achieve from 40:l (for hi& quality) up to 80:1 (for low quality)

compression ratio. In other words, one-hour of digital video requires only about 1.37 GB

to store or transmit.

The discrete cosine transform (DCT) has been widely used in waveform

compression because it features good energy compaction and low computational

cornplexity. It has become an integral part of many waveform compression standards,

such as JPEG, MPEG-2, MPEG-4, CCITiT Recomrnendation H. 361 and H. 263, and

HDTV. [36]

The DCT, like the Discrete Fourier Transforrn (DFT), is used to transfonn the

signai to the fiequency domain. UnIike DFT that uses complex exponentials as basis

functions, DCT uses cosines (real nurnbers) as ba is functions- Since the human audio-

visual system is less sensitive to hi& frequency harmonies, waveform compression

standards use DCT to uansform signal to fiequency domain and perforxn compression on

the DCT coefficients.

As an euample, for video compression, both temporal and spatial redundancies

are eliminated as sl-iown in Fi,owe 1- The motion estirnatiodrnotion compensation

('VIE/MC) block is used to reduce temporal redundancies due to high correlation among

adjacent frmes. The forward DCT (FDCT) together with the guantizer is used to reduce

spatial redundancies. Finally, the variable-length coder (VLC) is used to reduce coding

redundancies.

Uncornpressed Cornpressed Sequence Sequence

Fi,gre 1 : General block d i a m of video compression encoder

'CVith the advances in communication and VLSI technologies, it is expected that

video telephony/conferencing on mobile devices will be more and more cornmon in the

future. Because mobile devices operate with battery power, in order to increase the

batte. life and recharging time, mobile devices always have seingent power

specifications. Also, to Save valuable communication bandwidth, video compression is

always performed on these applications. As a result, the DCT chip is an integral part of

video communication mobile devices, and the design of a low-power DCT chip is an

important problem. In this thesis, a low-power data-dependent DCT/IDCT design is

presented to meet this need-

1.2. Contribution of this Thesis

Many earlier fast DCT algorithms are aimed at reducing the number of

multiplications because general-purpose multipIiers are assumed to be the basic hardware

elements for computing the DCT. Later on, other design techniques, such as digital

filtering and distributed arithmetic (DA), are also used to compute DCT [9]. In more

recent works. data-dependent DCT algorithms ' have been introduced in [19]-[2 11 [ B I .

unIike traditional algorithms, which have fixed-cornputation complexity, data-dependent

algorithms have variable run-time complexities that depend on the statistical properties of

the input data. They may yield fewer or more computations in the nin-time than the fixed

complexity dgorithms.

To reduce the power consurnption, optimizations are performed at both the

a l g o r i h i c level and the architectural level. The lotv-complexity Loeffler's [IO] fast

FDCTADCT algorithm is chosen to reduce the hardware requirement, which in turn

reduces power.

The concept of data-dependent signal processing has also been appiied to the

fixed-complesity Loeffler's algorithm to reduce the switching activities. For both the

FDCT and DCT, zero-bypassing logic is inserted into the circuit to bypass redundant

computations. The zero-bypassing logic takes advantages of high conelation among input

data for the FDCT. and high proportion of zero inputs for the IDCT. Furthemore, the

FDCT design also tnrncates bits fkom its input to reduce the amount of data to be

3

processed, consequently reducing power consurnption. The error introduced by the

truncation is also analyzed in the thesis.

Further architectural optimization is performed on multipliers. Since

multiplication is a high cornplexïty operation compared to addition, the FDCT/IDCT

designs use hard-wired canonical sign-digit (CSD) Wdlace-tree multipliers since it

utilizes minimum arnount of power over the multipliers surveyed.

To sumrnarize, the main contributions made in this thesis are listed as following:

Introduce new data-dependent FDCT/IDCT algorithm by merD$ng the data-

dependent processing concept with fast FDCT/IDCT algorithm.

Empiricdly study the eEect of truncating some least significant bits of the FDCT

input to Save computation.

Derive detailed design procedure for implernenting low-power constant-

coefficient mrd tip liers.

Devdop a code generator written in Ci-+ that generate VHDL code of constant

multipliers for different specifications.

1.3. Power Measurement Criteria

In VLSl design. it is always difficult to compare one design with another due to

different process techology (feature size), supply voltages, operating fiequency,

implementation approach (hll-eustom? semi-custom, etc.), optirnization parameters, and

design algorithm/architectures. Depending on the design goal, several cornparison

rnethods have been suggested and used, such as A, P, T, PT, AT, AP, etc., where A stands

for area, T stands for t h e (delay), and P stands for power. UnfortunateIy, these

measurement criteria give rou@ measures, which do not take al1 process technology into

account.

In this thesis, the proposed design is compared wiîh other reported designs by

comparing the swicching capncitavrce per snmple, which has been used in [19-2 13 [28]. In

VLSl design. power can be estimated Gom the well-known formula:

where P is the pow-er, pi is the switching probability, CL is the Ioad capacitance of the

DCT/TDCT in this case, fcrk i s the clock frequency, and V'D is the suppIy voltage. From

1 equation (1). the switching capacitance is defined as --pl CL , and the switching

capacitance per sample can be obtained by dividing the switching capacitance by the

num ber of inpiit/output sarnples per clock cycle. S ince switching capacitance is directly

proportionai to power. this measwement method leads to comparing relative energy

effrciency rather than absolute values such as in ,4P, PT, etc. It indicates how much

power (switching capacitance) is required to obtain one output.

The main advantage o f this method is that it takes out the effect of different

process technology by performing rechnology scaling. Thus to compare one design of

one technolog uiith amther design of different technology, tecbaology scdùig is first

perforrned on the measured power, then the effects of dock frequency and voltage supply

are factored out to obtain the switching capacitance per sample.

1.4. Thesis Organization

The organization of this thesis is as follows: in Chapter 2, the defuiition of

discrete cosine tsansform and its inverse and the algorithm used in the proposed design

are described. Chapter 3 describes the data-dependent signal processing concept and how

it is incorporated into the design. Chapter 4 summarizes the pros and cons of several

multiplier architectures. and provides a detailed design procedure for the selected

multiplier - hardwired canonical sis-digit (CSD) Wallace-tree multiplier. Chapter 5

describes the design automation effort made to facilitate the implementation of hardwired

multiptiers. The IDCT accuracy test result and pipelining design are dso described. In

Chapter 6, synthesis results of the new FDCT/IDCT designs are reported and compared

against previously reported implementations.

Cnapter 2

Background of FDCT/IDCT

Since there exist many DCT definitions [38], the forward DCT (FDCT) and its

inverse (IDCT) are defined in Section 2.1 for clarification.

Numerous fast algorithms for both FDCT and IDCT have been reparted in the

literature. Most of them atternpted to minimize the number of additions and

multipiications ([l], [8]-[lj], [17-181, [19], etc.). These algonthms usually take

advanrage of the symmetry in the cosine bais functions, and the computation complexity

is fised for al1 input data (data independent algorithm). Since multiplication requires

more hardware and computation t ime than adders, fewer multiplications imply low

power.

In Section 2.2, several existing fast FDCTKDCT algorithms are studied and

compared. The Loeffler's [IO] alsondm is chosen to be the fundamental FDCTlIDCT

algorithm of the proposed design.

Since the FDCT is dways foIlowed by a quantizer, its precision requirement is

not high. On the contrary, the IDCT is used to perform inverse transformation at both the

encoder and the decoder, which requires high precision. It needs to conform to IEEE

Standard 1 180-1990, which is described in 3.3.

2.1. Definition of DCT and its Inverse

The :V-point 1-D fonvard DCT (FDCT) is defined in equation 2:

X ( n ) = l$&n)&k) cos (2k i- 1)nz k=o 2N

The N-point 1-D inverse DCT (IDCT) is define in equarion 3:

where C(n) = n =l,3. ..., N-1

Similady' the NxiV 2-D FDCT is defined as follows: [4]

and the NxrV 2-D IDCT is defined as:

Notice that 2-D Nx:V FDCT/LDCT is a separable transformation, which means

that it can be obtained by first perfonning 1-D N-point DCTlIDCT on the rows, then

performing 1-D IV-point DCTADCT on the columns, or the other way around. This

method of computing 2-D DCTlIDCT is generally referred to as row-column method or

indirect method. The general block diagram of this method is shown in Figure 2.

- - --

Figure 2: 2-D FDCT/ID CT using rotv-colurnn (separable) method

The row-column method is the most popular method in VLSI implementations

([2]-[7], [93, [14]-[16], etc.). Also, since the 8x8 block size is used by MPEG and other

standards, in this thesis. the FDCTKDCT design presented uses 8x8 block size.

2.2. Choices of Algorithrns

Many fast DCTADCT algonthms have been reported in the literature. In this

section. severd fixed-compIexity aIgorithms are reviewed and compared based on their

arithrnetic complexities. The cornparison suggests that Loeffler's FDCTLDCT algorithm

is the most efficient and is used as the basis of the proposed design.

2.2.1. Chen's Algorithm Family

Chen's fast al_oorithm [ I l reported in 1977 is by far the most widely used

DCTADCT algorithm. it has been used in [2]- [7] and many other papers. It is a fixed-

complexity algorithm. The idea of Chen's algorithm is to exploit the symmetry in the

DCTADCT transformation rriatrirr. The 8x8 DCT c m be written in rnatrix form:

where

7r 3z 57r 37r COS- COS- cos- COS -

16 16 8 16 8

Since the even rows of the transformation rnatnx are even symmeb5c and odd

rows are odd-symrneûic, by exploiting the symmetry and separating even and odd rows,

equation (6) can be rewritten as folIows:

Sirnilarly, the 1-D IDCT c m be rewitten as folIows:

b d e g

d -g -6 - e

e - b g d g - e d -b

2.2.2. Loeffler's FDCT/IDCT Algorithm

Loeffler's 1-D %point FDCT dgonthm uses 11 multiplications and 29 additions

only. The signal flow graph (SFG) of an 8-point 1-D DCT is shown in Figure 3, and the

transfer functions of the building blocks are given in Table 1.

Stage 1 Stage 2 Stage 3 Stage 4

Fi-me 3: Loeffler's FDCT algorithm [10]

Symbol 1 Equation 1 Effort

2 add

Table I : Transfer firnction of LoefYler's FDCT building blocks [l O]

I U O

Notice that the second building bIock (km) requires o d y 3 muItip1ications and 3

additions instead of 4 multiplications and 2 additions when equation 9 is used.

0, = 1, ( k cos- n ~ ) + I l ( k s i n ~ ) ( 2N

O=&I

0, = alo + bl , = (b - a ) l , + a ( l o + 1,) nrr nz ,wherea = kcos-,b =sin--

O , = 4 1 , t a l , = -(a + b)I , + a(1, + 1,) 21V 2N (9)

3 mult. + 3 add

1 mult.

By reversing the transfer function of each building block shown in Table 1, and

reversing the signai-flow direction, it is easy to show that the IDCT has SFG shown in

Fi,oure 4 tvith building block transfer function shown in Table 2. Notice that the Loeffler

LDCT algorithm has the same arithmetic complexîty as in the FDCT case ( I l

multiplications and 29 additions). Notice also that division by 2 is considered using no

operation since it c m be realized by i-monng the least-significant bit of the vdue to be

divided.

Stage 1 Stage 2 Stage 3 Stage 4

Figure 4: Loeffler's IDCT algoritlm

~ a u i t i o n t ~ f f o r t

2 add

3 add O, =i,('sin~)+I,('cos~) 1 0=1/42 1 1 mult.

Table 2: Transfer function of Loef2ler's IDCT building blocks

2.2.3. Jeong's FDCT Algo rithm

Jeong's El31 8-point FDCT algorithm reported in 1998 uses 28 additions, 12

multiplications. This algorithm is special because it performs rnost multiplications at the

final stage and requires fewer multiplication stages than other aigorithms, so propagation

errors occurring in the fixed-point computation cm be reduced.

By separating even and odd points in the DCT, this algorithm uses trïgonomemc

identities to reduce the number of multiplication needed to calculate DCT.

* Even points:

N i ? - l

~ ( 2 1 ) = [ - ~ ( k ) +- X ( N - 1 -k)]cos (2k + 1)2zx, where I E [0,3] N t =O ZN

Odd points:

X(21 t 1) = 2 cos + 1 ) ~ ) - ' ru(fl) - [ 2, i'i S '4-1 (21 + 1)Srnx (21 + 1)(2rn + I ) z

i [y(2m) t y(2m + l)] cos rn=O N N

where y(k) = ~ ( k ) - x ( N - 1 - k) and y(-1) = O

The signal flow g a p h is shown in Figure 5-

4-Point DCT I

Figure 5: Jeong's fast FDCT algorithm El31

2.2.4. Summary and Corn parison of Alpo rithm Complexities

Since in VLSI implementation, each computation, Le. addition and multiplication,

requires hardware and consumes power, algonthms with fewer additiodmultiplication

lead to lower power. Aiso, since multiplication requires more power than addition, one

algorithm is better than anoùier if it requires fewer multiplications (for integer

operations).

Table 3 summarizes the complexity of several fixed-compIexity FDCT

algorithms. In [34], Duhamel demonstrated that the theoretical lower bound of an %point

DCT is 1 1 multiplications. Since the number of multiplication in Loeffler's [IO]

algorithm reaches the theoretical Iower bound and the number of addition is not worse

than other algorithms (except Jeong's), the Loeffler's algorithm is chosen.

in [l O])

Lee [ I l ]

Wang 1311

Algorithm Chen

[ l ]

Vetterli [32]

Multiplication Add

Table 3: Complesities of different FDCT a1gonth.m~ (adapted column 2-7 fiom Table 1

12 29

16 1 13 1 12 26 1 29 1 29

Suehiro [33] 12 29

Jeong [13]

Hu '--Loenler

12 28

[12] 12 29

-[IO] .-.

11 : - -29 --

2.3. Precision Requirements of IDCT

In video compression. the precision requirement of FDCT is not high because it is

always followed by heaw quantization. On the contrary, since the IDCT is used for

sequence reconstruction, it is important for IDCT to be computed with hi& precision.

The IEEE Standard 1180-1990 [27] defmes the specification for the

implernentations of IDCT. The step for measuring the accuracy of an 8x8 DCT block is

shown in Figure 6.

Reference i Refernece 6x8 lOCT ' i IDCT output

1 Refernece 8x8 FDCT ; 9 ! 1

Sepefabie. Oriagonal. i- Multiply with at least 64-

, bit floating-point acairaCy :

- -

q '"ws~" ""D" i ? i ROUM [ J' a Ctip

L 1 %ml

Figure 6: Setup for measuring the accuracy of a proposed 8x8 IDCT (figure 2 in 1271)

The standard defines a random nurnber generator that c m generate numbers

within lower and upper bounds (-L and H) inclusive. Based on these random numbers,

10000 8x8 blocks for (L=256, H=255), (L=H=5) and (L=H=300) are used as input for

reference FDCT (see Figure 6): and passed through the diagrarn shown in Figure 6. The

error. ek(i,j): is defined to be the difference between the '?esty' IDCT output and the

"reference" D C T output, Le.:

ek (i, j ) = ..tk (i, j) - xk (i, j )

The standard defmes the following terms to measure the error (see Table 4).

2.4. Chapter Surnmary

In ttiis chapter, the FDCT and IDCT are defined. Several fast fixed-complexity

FDCTKDCT algorithms are reviewed and their computational complexities are

surnrnarized in Table 3 . Since low arithmetic complexity usually implies low power, the

Loeffler's algorithm is used as the basis of the proposed design.

The E E E 1180-1990 standard is also described in this chapter. The standard

defines the precision requirements of DCT, which the new IDCT design will conform to.

In the next chapter, detailed discussion/description is presented to show how the

data-dependent concept is integrated into Loeffler's FDCT/IDCT dgonùun to make it a

data-dependent algorithm.

Maximum Magnitude

I

0.06

0.03

0.0 15

0.00 15

Tenn

For ail-zero input, the proposed IDCT shall generate all-zero output.

Table 4: IEEE Standard 1 1 80- 1 900 IDCT Precision Requirement

De finition

Peak error @pe) 1 Max( kk(i.]>l )

Mean square error for any pixel @mse)

Overall mean square error (omse)

Mean error for any pixel @me)

Overall mean error (orne)

prnse(i, j) = ~ ~ = ~ " ~ i , j ) r 0000 10000 2

omse(i, j) = C:=o C:=o Ci=i ek ( i y i) 64x 10000

prne(i, j ) = e, ( i? j )

10000

ome(i, j ) = c:=, c:, ek ('9 j)

64 x 10000

Chapter 3

Design Choices for the FDCTIIDCT

In this chapter, the data-dependent processing concept is applied to Loeffler's

FDCTADCT algorithm. In Section 3.1 and 3.2, data-dependent bypassing logic is

inserted into Loeffler's FDCTADCT aigorithms to achieve more power reduction. To

M e r reduce the computation complexity, the least significant bits of the FDCT inputs

are tmncated. The effect of mmcation is studied in detail.

Since the row-column method is used to compute the 2-D FDCT/IDCT by using

tcvo 1-0 FDCT/IDCT with a transpose memory in between (see Figure 2), Section 3.3

studies two transpose memory architectures. The on-the-fly transpose memory

architecture is used in this work.

3.1. Data-Dependent Loeffler's FDCT Algorithm

To have a power-efficient design, data-dependent algorithm and truncation

techniques are adopted into Loeffler's FDCT algorithm.

3.1.1. Data-Dependent Bypassing Logic

Loeffler' s FDCT algorithm performs several butterfly operations on the inputs

(set: Figure 3). In general, the inputs are well correlated for the FDCT. Thus, the

subtractions used in the butterfly are very Iikely to produce zeros or small numbers. Since

most multiplications are performed in the kcn blocks, and the inputs of the kcn blocks are

the results of subtractions, adding zero bypassing logic in fiont of each multiplication in

the kcn blocks will reduce the number of multiplications. As shown in Fi=gure 7, the zero

bypassing logic only adds the non-zero-detection logic ((AND gate), a register, and a

multiplexer (MUX) to the circuit. The overhead, both the area and speed, introduced is

small comparing to the miiltiplier itself

i a, Non-Zero Register 1 1 (Load *en Non-Zero)

Figure 7: Zero Bypassing Multiplier

By segmenting the inputs of rnultipliers into several smaller chunks (data

~e~grnentation), further computational reduction can be achieved by taking advantage of

the fact that the inputs of the ken block are vely likely to be small numbers because the

inputs are obtained from butterflying highly correlated data. Thus, instead of rnultiplying

x by c directly, the multiplication is done by breaking x into rn segments, performing

multiplication on each segment, and then adding the products together with proper offset

if necessary (see Figure 8). The sum of the products is still xxc. By inserting bypassing

logic in fiont of each smaller multiplier, part of the small number inputs can be bypassed,

consequently reducing the switching activities and the power. For example, if

x=OO0001 Z l b (7& with two segments, .WC is performed as (0000xc)<<4 + 01 f Oxc. ~ i t h

zero-bypassing logic inserted, OOOOxc is bqpassed and uses no operation.

Product Producc

Figure 8: Multiplication Segmentation

The choice of segment size affects the probability of zero bypassing. One extreme

is that there is onIy one segment, which is direct rnultiplication of xxc. The other extreme

is that each segment is one bit ody, which is essentially perfonning shift-and-add

operation. Theoretically, if we use segment of one bit only, one cm achieve highest

bypassing probability and uses lowest amount of multiplication. However, it requires the

lasest number of addition to add partial products to produce the final product- For n-

segment, one would require to add n partial products together. Having more segments

irnplies more complicated conrrol logic and delay to produce the final result. Thus,

having the trade-off between the probability of bypassing and the segmentation overhead

in mind, cve decide to use IWO segments for FDCT multipkations. It alIows bypassing of

small numbers while keeping the segmentation overhead small since there are only two

partial prodticts to be added.

3.1.2. Truncate Some Least-Significant Bits frorn Input

Since the IEEE standard [27] defines only the precision requirements for the

IDCT, and since the FDCT is usually followed by quantization, in this thesis, some least-

significant bits (LSBs) of the FDCT are truncated. Truncating input bits results in less

compirtation, consequently, reduces power consurnption and increases the speed. On the

other hand, truncation introduces error at the output. Although some error introduced by

the truncation mil1 be compensated by the heaw quantization that follows the FDCT

module. the error still exist. Thus? tnincation allows trade-off between power and error.

The goal is to find the best straregy to truncate input bits so that the error is in acceptable

range dependhg on the application.

In 2-D 8x8 FDCT, there are eight 8-point 1-D FDCT in the first dimension

(rows)? and eight 8-point 1-D FDCT Ui the second dimension (columns). Let Trunc(d,n)

denote the number of bits to be tnincated fiom the n-th 1-D FDCT of dimension cf, where

d = 1 (row), 2 (colurnn) and n = 0...7. The tnincation for al1 eight inputs of any 1-D 8-

point FDCT is the sarne. Figure 9 illustrates the detailed view of 2-D row-column FDCT

with truncation.

DCT Dimension t (d=O) - A Dimension 2 (d=2) ? -

610ck 1-D FDCT Rowç 1-0 FDCT on ~olumns-

i & Tnitx(2.5) g F b ' ! 1 1

!

1 R m 7 ' FOCT '3

I Iii Figure 9: 2-D row-column FDCT with truncation

If we allow truncating at most nz bits fiom each 1 -D FDCT, since there are 16 1-D

FDCT blocks, there are a total of possible combinations (including no muication

for rn=O). Even when rn is small. Say m=l, there are still 65536 possibilities to be

esamined. Fortunately, not al1 combinations are valid fiom the distortion point of view.

In practice- since human eyes/ears are less sensitive to high frequency signal

components higher frequency FDCT

than the low-er frequency coeffîcients.

higher fiequency FDCT coefficients i

coefficients (larger n) are quantized more heavily

This fact suggests that the effect of truncation in

s less than the lower fiequency coefficients. This

ar-ment leads to the following equation.

Trunc(d, n, ) 5 Trunc(d ,n2) if n, c n, (1 0)

Further test cases reduction can be achieved due to the fact that the transpose

matrix distributes al1 coefficients cornputed in each of the first-dimension FDCT modules

to al1 second-dimension FDCT modules. Thus? al1 first-dimension (d=l) FDCT modules

are equally important, Le.:

t ln : Tmnc(1,n) = k, where k is a constant (1 1)

Since the truncation error introduced in the first stage affects entire second stage,

to have a more accurate result, k O (no truncation at the frrst dimension FDCT blocks) is

used in the design of FDCT.

Figure 10: Test model to measure the effect of truncation

To have a quantitative measure of the tnincation effect, standard MPEG-2

encoder is rnodified as the test model (see Figure 10). By changing the Trunc(c?,n),

different PSNR values are measured. The P S h R values are then compared against the

reference: PSNR of no truncation (Trunc(.?,n)=O for al1 n). Smaller PSNR dif5erence

indicates smaller distortion introduced due to tmcation. The tnincation error is defined

as:

Tuca t ion Error = Average PSNR(reference) - Average PSNR(truncation) (13)

Since the goal is to Save power, one combination is better than another if it

tnincates more bits, but has higher PSNR (smaller truncation error), i-e.

7 7 T ~ ~ ~ c ~ ~ ~ ~ (2, I I ) > xnd TmncCst (Zn), and P S N . a e , > PS1v&se2 (7)

Three test video sequences (coke, saiesman, and tennis) we used to measure the

tnincation errors. Each sequence has 180 kames and is encoded using pure 1 - h e s at 8

kIb/s. The FDCT is computed with fixed-point calcuiation with 11-bit precision afier

binary points.

To show the effect of nuncation. al1 165 possible combinations are using m=3

(tnincate at most 3 bits) and Tmnc(l,n)=O (no tnuication for first-dimension FDCT). The

testing results ( tucat ion errors) are s h o w in Appendk A.

Table 5 illustrates the best truncation patterns and its average truncation error

compared to a11 other truncation patterns wïth the same total truncated bit. In this thesis,

truncation pattern Tmnc(1 ,n)=O and Trunc(2,n)=( 1 , 1 ,l , 1 , 1,1,1,1) Ls used in the

implementation of the FDCT because its truncation error is moderate (around 0.5 dB).

Total 'runcated Bit'

O 1 2 3 4 5 6 7 8 9

I O 11 12

Truncation Error rrunc(2,nll ( d ~ )

Total Trunc(2,nlTr~n cation Erro rruncated Bits (dB)

Table 5: Truncation errors against the nurnber of tnincated bits

3.2. Data-Dependent Loeffler's IDCT Algorithm

Like the FDCT, row-colurnn rnethod is used to compute the 2-D TDCT. Due to the

heavy quantization of the encoder (for high compression), a high proportion of the

coefficients are expected to be zero at the input of the first-dimension IDCT.

One problem with the Xanthopoulos's data-dependent DCT designs in [19]-[2 11

is that they may result in more computation than the fi'ced-complexity fast algorithms. In

the worst case, such as the input does not salis@ the assumed statistical propem, the

data-dependent desi= in [19]-[21] may yield as hi& as 1024 multiplications for 3D

IDCT, i.e. degenerates to its base algorithm (direct D C T computation).

In this work: like the FDCT, zero-bypassing logics are inserted into the IDCT

circuit to reduce the nurnber of cornputation. Since zero-bypassing logic does not

increase the number of computation, even at the worst situation, the data-dependent

design yields the sarne complexity as the fundamental Loeffler's algorithm. In other

words. in the worst scenario (none of the bypassins logic active), data-dependent

Loef2fler.s 2D IDCT algorithm uses 176 multiplications (2 dimensions x 8 rows

(colurnns)/dimension s 1 1 multiplication/row (column)).

In real life, some zero-bypassing logics will be active, and the number of

multiplications starts to depend on the distribution of input data. For instance, if there is

one non-zero coefficients in the input of the 1-D IDCT, data-dependent Loeffler's IDCT

algorithm requires 0, 2, 5 or 6 multiplications depending on the position of non-zero

input. If the probability of the non-zero input position is the sarne for all 8 inputs, the

algorithm requires only 3.25 multiplications in average. Thus, by applying zero-

bypassing logic ont0 Loeffler's IDCT algori th, the fiued-compiexity algorithm is

transformed into a dara-dependent aigorithm. The new 2-D IDCT multiplication lower

bound is the same as Xanthopoulos' (O), while the upper bound is significantly reduced

fkom 1024 down to 176.

3.3. Transpose Mernory Architecture

There are various ways to transpose 8x 8 matris in hardware. The trivial way is to

have two matrices (as shown in Figure 1 1). Th ey are used for read and write alternatively

(ping-pong bufTtering). Two matrices are required since the data arrives row-by-row.

Figure 1 1: Ping-pong transpose memory

Another way to transpose a matrix is reported in [28]. As shown in Figure 12,

only one matrix is required. Data is transposed on the fly by changing the shifüng

direction (top-to-bottom or left-to-nght).

Figure 12: On-the-fly 8 x 8 Transpose Memory [28]

The state of the transposition rnatrix for clock cycles is illustrated in Figure 13. To

fil1 up the matrix, eorn clock cycle 1 to 8, shifting direction is top-to-bottom. From dock

cycle 9 to 16, the shifting direction is left-to-right. From dock cycle 17 to 24, the sming

direction is top-to-bottom. Clock cycle 25 is identicai to clock cycle 9, and so on.

1 9-th cycle

b

After 8 cycles \

1 O-th cycle

I I i

16-th cycle 17-th cycle 15-th cycle

t

18-th cycle 23-th cycle - - . - - - - ---

24-th cycle

Figure 13: States of the transpose matrix for different clock cycles

Since n-bit element 8x8 matrix is built with 64n flip-flops, if n is large, the area

consumption will also be large. In the proposed FDCTDDCT design, the on-the-fly

transposition architecture is used since it requires only 64n flip-flops instead of 178n flip-

a case. flops in the ping-pon,

3.4. Chapter Summary

In this chapter. data-dependent Loeffler FDCT/IDCT algorithms are described.

The zero-bypassing logic i s inserted into fixed-complexity Loeffler's algorithm to

conven it into a data-dependent algorithm. rvhich the new design is based on. For FDCT,

input truncation technique \vas also analyzed and applied to fiirther reduce the amount of

data to be pmcessed, hence reduce the power consumption. Based on the simulation

result. we decided to truncate one bit from the input of üle second dimension FDCT.

The transpose memory architecture has also been studied. The on-the-fly

transpose memory reported in [28] is chosen because it requires only half the amount of

area cornparin; to the ping-pong architecture.

Since multiplier is the fundamental building block of FDCTADCT, in the next

chapter, different multiplier architectures are analyzed based on Iow-power criteria.

Chapter 4

Multiplier Architectures

Ln VLSI irnplernenration, floating-point multipIiers are rnuch larger, slocver, and

consume more power zhan fi'ted-point rnultipliers due to normalization of mantissa. For

this reason, al1 FDCT/IDCT designs reviewed in this thesis used fixed-point

multiplication instead of floating-point multiplication.

Since fixed-point or integer multipliers are larger, slowery and consume more

power than adders, ttie choice of multiplier greatly affects the overall FDCTlIDCT

performance and power consumption.

One special note about the multiplications performed in FDCTIIDCT is that they

are al1 constant muItiplications, Le. one of the multiplicand is a constant. In Section 4.1,

several constant multiplication schemes are studied, and the hardwired CSD multiplier is

chosen for low-power design. Section 4.2 describes the design procedure of the

hardmired CSD muItipliers. In Section 3.3, synthesis is performed, and the result

indicates that the CSD multipliers indeed consume less power than general-purpose

rnultipliers.

3.1. Survey of Constant Multiplication Schemes

Following is a bnef description of the characteristics of different constant

multipliers. More detailed description can be found in the references.

4.1.1. Modified Booth Multiplier

Modified Booth multiplier [35] is a popular general-purpose multiplier. Both of

its multiplicands are variables that can be changed at run-time. However, in DCT/IDCT

multiplications, only one of the multiplicand is variable, the other one is a constant

(cos(ndl6)). Having both operands of multiplier variable implies more hardware,

consequently more potver. Thus, general-purpose modified Booth multiplier is not a good

choice for low-power DCTADCT design.

4.1.2. Distributed Arithmetic @A)

Distributed arithrnetic (DA) is a bit-serial operation that performs shift-and-add

operation to multiply nvo nurnbers (one of which is a constant). It replaces the

multiplication tvith additions and a look up ROM table [14]. The input is used as index in

the ROM, and the ROiM contains the partial product of multiplying the address with the

constant rnultiplicand. and the partial products are then added by using shifi-and-add

operations.

The main disadvantage of DA is that it is slow due to its bit-serial nature and

parallel-sendsenal-paraIIel conversion. This implies that it needs higher intemal dock

fiequency than parallel processing to do the same work. Moreover, shifiing consumes

much power because of the high switching activities. In [14] and [lj], the authors

evaluated the trade-off betcveen the performance and the power for three multiplication

schemes: general-purpose multiplier. pure ROM based, and mixed ROM based (DA).

r

Voltage Multiplier Pure ROM Mixed ROM Delay 1 Power Delay 1 Power Delay 1 Power

Table 6: Cornparison of general-purpose multiplication against ROM based

multiplication [ 141

As s h o w in Table 6, the multiplier-based irnplementation is slower than the DA-

based implementations. However, the power is about 30-50% less than the DA-

implementations because about 85% of the entire DA chip runs at higher fiequency due

to its bit-serial nature. As the result. DA is not a good choice for low-power design.

4.1.3. Hardwired Canonical-Sign-Digit (CSD) Wallace-Tree Multiplier

Hardwired multipliers hard code the constant multiplicand by using o d y shift-

and-add operarions. Unlike DA. which performs shifi-and-add operation at run-time,

these shifis are hard-wired at design time and consume no power. In other words,

hardwired multipliers are simply Wallace-tree cany-save adders. This results in a smaller

and more power-efficient multiplier than general-purpose multiplier.

Further power reduction can be achieved on the fixed multiplicand by not using

2's complement representation, but using radix-2 canonical sis-digit (CSD)

representation. By definition. the cunonicnl sign-digit representation is a redundant

number system that represents number with no adjacent non-zero digits. Every nurnber

11s a unique CSD representation [30]. It represents numbers with fewer or equal non-zero

digits [II as the algebraic surn/subtraction of several power-of-two, i.e. :

c = Cs, 2-', where .Y, E {- 1,0,1)

A procedure to transfomi a conventional binary number to CSD representation is

described in [30]. We have also derived a more intuitive transformation algorith:

Given a (nt1)-digit b i n q number B = BnB ,.I...BfBo with B,=O and Bie{O,l J for

ie [ O p il. The following procedure converts B into the (n+l)-digit radix-2

canonical SD vector D = D,D n-,...DrDa with D,G (0,l) and Di€ (0,1,-1) for

is [O,n- 1 ] such that both vector D and B represent the same value:

1. If there are consecutive 1's in B, continue to step S. OtheMrise, the resulting

niunber B is in CSD representation (D). End the process.

2. Replace the nghtmost (starting fiom the lowesr order 2' end) occurrence of bit

pattern O 1 ... 1 1 with 1 - 1. This replacement is possible because +-,- (m-1) l 5 (nt-L) 0'5

3. Go back to step 1.

Figure 14 shows a step-by-step example that converts a binary number

O1 lOOlOll lb (107,-~ in decimal) into CSD representation. The consecutive 1's to be

replaced are shaded in the figure. The resulting CSD representation of 407d is

9 7 10 1 0 1 O T00 T, where T denotes -1. As expected, 407 = 2 -2 +2'-z3-20. In this example,

the CSD representation reduces the number of 1 's fiom 6 down to 5.

-

CSD Representation

Figure 14: Converting binary number 0 1 100 10 1 1 1 into CSD representation

As another esample, Table 7 shows the CSD representations of the constant operands

(cos(nrdl6)) used in FDCT/IDCT with 15-bit precision d e r binary point (total of 16

bits).

cos(ndl6) Traditional Binary Represcntation 1 Canonical Sign-digit Representation

'* i Bit Pattern 1 $Non- 1 Bit Pattern 1 #Non- 1 % Bit

Table 7: Canonical sign-digit representation of cos(ndl6)

1 2

As shown in Table 7, the CSD representation can reduce the nurnber of non-zero

bits up to 50% over traditional representation. In hardwired-multiplier. each non-zero

digit (except the first 3 non-zero digits) in the constant multiplicand requires one extra

caq-save adder stage.

Because canonical means no adjacent non-zero digits, any n-bit number c m be

represented with at most hl21 nurnber of non-zero digits, which in turn reduces at least

half of the cary-save adder stages comparing to generai purpose array multiplier. It can

also be shown that CSD generates an average of nl3 additions [40]. Since fewer non-zero

33

70 3-1 7-Z ,-15) ( - , - , - , . . . , -

O l I l l I o l l o o O l o l o OIlIOIlOOIOOOOlO

3 ~ 0 l I O I O 1 0 0 1 1 0 I 110

Saving 44% 29%

Zero Bits / ( 2 O , ~ ' . 2 - ~ , . . .,P) 1 Zero Bits

4 5

21% 0% 25%

O I O I IOlOlOOOOOlO 0100011100011101

. 50% - ., 3 8%

5 5 7 6 6

9 / 1 0 0 0 0 0 - 1 0 - 1 0 0 0 1 0 1 0 ,

9 6 S

4 5

7 10-101010100-100-LO IO-10-10 I010000010 0100100-100100-IO1

LOOO-IO-1OOiOOOO 10

O 10-1000 100000-100 O0 10-100 10000-1001

16 OOllOOOOlll1llOO 1 8 ~ 7 ~ 0 0 0 1 1 0 0 0 ~ 1 1 1 1 0 0 1 ~ 8

bits imply less computation, less switching activity, and less potver consumption, the

hardwired CSD multiplier is a good choice for low-power design.

4.1.4. Pattern-Based CSD Multiplier

The CSD representation uses minimum sliift-and-add (S&A) operations when

multiplying constant k with variable x directly. However, direct multiplication of x x k

does not necessarily use minimum S&A operations to perform x x k. In some situations,

it is possible to find patterns inside the CSD representation, which can be reused to avoid

repeated cornputarion. Thus, instead multiplying x with k directly, x is mdtiplied with

sub-expressions of k, then partial products are used to construct the final product. As an

example, let k = i I i 00 1 i f = i 00 TO I 001 (23 1 d). Using CSD representation without pattern

searching, 23 lx requires 4 additions. However, ~ 5 t h pattern-based algorithm, 23 lx can be

represented by (7~«5)+7?c, which requires 3 additions only. The Bernstein's algorithm

[4 11, Lefevre's algorithms [3 9-40], and Potkonjack algorithm [42] are pattern-based

algorithrns.

The pattern-based algorithms are very useful for multiplication with very large

constants where the patterns c m be reused fiequently. For example, in

encryptionidecryption, the constant rnay have several hundreds or thousands of bits. In

such situation. pattern-based algorithm can reduces the computation significantly.

However? for the purpose of FDCTmCT and most DSP applications, the constants word

len=ds are usually small, and patterns (if any) are reused less fiequently.

For pattern reuse, one rnust obtain the entire partial product, which requires using

cany-save-adder (CSA) followed by cany-propagate adder (CPA). In general, in VLSI

implementation, CPA is slower, and consumes more power than CSA due to carry

propagation. The slower pattern-based algorithm speed can be compensated by adding

pipeline registers after each CSA used for partial product (pattern) computation. The

edxtra power consumption due to the cany propagation in CPA can be reduced by using

other types of adders such as carry-by-pass adders or carry-select adders. However, given

the patterns are not reused frequently, the overall power consurnption of pattern-based

multiplier is still larger than the one cvithout using pattern- Since the design criterion of

this thesis is power, oniy the CSD multiplication without using pattern is considered, and

ail multipliers used in FDCADCT are hardwired CSD multipliers.

Notice that the application of hardwired CSD Wallace-tree multiplier is not

restricted to FDCT/IDCT only. It can be used in many other digital signal processing

(DSP) applications, such as digital filters, tvhere fixed-coeffrcient multiplication is

required.

4.2. CSD Multiplier Implementation Procedure

To design a hardwired CSD multiplier for multiplying unsigned variable integer

operand (v) with a constant operand, we derived the following steps:

2 . Obtain the CSD representation of the constant operand by using the aigorithm

described in Section 4.1.3.

2. For each non-zero bit position p in constant operand:

8 For each 1 in the constant operand, place the unsicgned variable operand, i.e.

performing v x 2" .

For each -1 in the constant operand, negate the unsigned variable operand

with a 1 placed at the least-significant bit (2's complement), and extend 1's to

the lefi of the most-sipificant bit (sign extension) of the variable operand, Le.

performing (- v) x 2'

3. Simplify the diagram by adding the constant 1's together to avoid redundant

computation at run time. By studying the truth-table of addition, we found that

fùrther optimization can be achieved by using identity 1.1 :

Identity 1.1: Variable bit b plus constant 1 results in sum -b and carry b, where -6

denotes NO T operation

Sum=b, Cary==-b 1 Table 8: Truth-table of b+l

b

O 1

This identity allows reduction of one operand to be added for pos iition p by

increasing the number of operands to be added for position p+l by 1. Intelligent

use of this identity can reduce the nurnber of cary-save adder (CSA) stages

(critical path delay) witiiout introduchg any extra hardware.

4. Combine the operands placed in step 2 and the sirnplified constant 1's (in step 3)

with carry-save adders in Wallace-tree form. The result of the carry-propagate

adder is the result of multiplying variable input operand with the constant

operand.

bil

To illustrate the above algorithm, Figure 15 shows the procedure of constnrcting a

CSD hardtvired-multiplier of constant cos(3d 1 6) multiplying with an 8-bit unsigned

Sum 1 O

Carry O L

integer. Constant cos(3d16)

Table 7. As shown in Fi-aure

of CSA tree fiom 7 down to

is chosen because it contains the most non-zero bits in the

1 5, in step 3, the application of identity 1.1 reduces the depth

4. As the result, the multiplication of cos(3dl6) with an 8-

bit unsigned number has critical path of only 2 full-adder stages with a 19-bit CPA adder.

Notice that despite the fact that the multiplier uses CSD representation for the constant

operand, both the variable operand and the product are still in 2's complement

representation.

Step 1:

Step 2: 7 6 5 4 3 2 1 0 - 1 - 2 3 4 5 6

0 Y

i Notation: I I , I I :

- - Simplify the 1's and Rearranging op%%VYS

1 L . - - Apply Identity: b+l => Sum -b. Carry b

on bit position -34. 4. and 6

1 - - Canstruct Multiplier with Wallace-tm C a n y S a v e Adders

step 4: -- - - - -- -- -----.----A

7 -6 8 - 7 6 7 6 7 6 5 -7 - 6 - 5 - 7 : - 6 ; - 5 ' '- -3 1 -1 O

--- - > . . . . T V T V T T . & T ; T . t . . t . . .

Figure 15: Harduired CSD multiplier for multiplying cos(3dl6) with &bit unsigned

integer

Similarly-, to multiply a signed 2's complernent vkab l e operand (v) having a

sis-bi t (s) with a constant operand, the following procedure is derived:

1. Obtain the CSD representation of the constant operand.

2. For each non-zero bit positionp in constant operand:

a For each 1 in the constant operand, place the signed variable operand, Le.

performing v x zP . Sign-extend towards left.

* For each -1 in the constant operand, negate the s i p e d variable operand v with

a 1 placed at the Ieast-significant bit (2's complement), and extend -s

(negated sign-bit s) to the left of the most-significant bit (sign extension) of

the variable operand, Le. performing (- V) x 2'

3. SirnpliS the s i s extension bits and constant 1's in the diagram:

* Let s O , replace al1 s with 0, and -s with 1, add al1 constant 1's together, and

obtain a constant value SEO.

0 Le s=l, replace dl s with 1, and -s with 0, add al1 constant 1's together, and

obtain a constant value SEI.

For each bit at positionp, merge and SEr together to obtain another value

SE using the following truth table:

Table 9: Truth table to simpli@ si@-extension

Remove dl sign extension bit (s or -s), insert SE into

Like the unsigned case, apply identity 1.1 where suitable.

4. Combine the operands placed in step 2 and the simpiified sip-extension bits and

constant 1's (in step 3) with carry-save adders in Wallace tree form. The result of

the carry-propagate adder is the result of multiplyinp variable signed 2's

complement input operand with the constant operand.

Like the unsigned case, step-by-step illustration of construction of a CSD

hardwired-multiplier of constant operand COS(;^ 1 6) rnultiplying with an 8-bit signed 2's

compIement inreger is shown in Figure 16.

Step 1: 1 - 2 - 2 + 2-& + z-& + 2- . - 2-l1 - 2-1.

Step 2: 7 6 5 4 3 2 1 0 -1 -2 -3 4 -5 5 -7 -8 -9 -10 -11 -12 -11 -14

-7- *-? ?'a; ?-.-- ,------,- -sT' ----.--.-'-- -- . . 1-

! -S . s i s S i S I ! s i 6 5 : 4 3 2 o f + ; - --- -- -.- - -- ..-------

1

; Natation- - 7

- Sirnplify the Sign-Extension bits and COnSlant 1's

'if i'rl SE,, @=O)

a SE, (SI)

P

- !- Remove si-s in step 2. and Insert SE

- ! ' Apply ldentity: b+l = Sum -b. Carry b . . -

on bit position -14. 5. 6

Fi-me 1 6: Hardwired CSD multiplier for multiplying cos(Wl6) with 8-bit signed

integer

4.3. Multiplier Synthesis Result

To demonstrate that the hardwired CSD Wallace-tree constant multiplier

consumes less power and area while offenng comparable speed performance, its delay,

area and power consumption figures are compared wirh the with other 32-bit popular

general-purpose multipliers.

Since the hardwired CSD multiplier has one operand constant, severai CSD

multipliers are implemented wïth different constant operand used in FDCT/IDCT

(cos(nrr/l6), and 2'" ). Al1 constants have l -bit integer part and 3 1-bit fkaction part to

forrn a 32-bit fixed-point number. The constants are then rnultiplied with a 32-bit signed

inteser (variable operand). Al1 multipliers are synthesized using Xilinx 4052XL-1 FPGA

technology.

Modified \Vallace Modifiec; 32-bit Array Booth- Booth Tree Multiplier Multiplier Multiplier Wallace TI

Multi~fie

Proposed ree Scheme

Table 10: Cornparison of 32-bit CSD Wallace-tree multiplier with 4 different general-

purpose multipliers using Xilinx 4052XL-1 FPGA t e c h n o l o ~ (Columns 1-5 adopted

Tom Table 1 in [36])

As shown in Table 10, the CSD multiplier uses least arnount of area and power

(less than half of the power than the array multiplier) while offering comparable speed

performance with the other multipliers (around 100 ns). This result agrees with the

analysis - hardwired CSD is more power efficient then other general-purpose multipliers.

Therefore, hardwired CSD Wallace-tree multipliers are used in the FDCTllDCT designs

presented in this thesis.


In this chapter, by analyzing diîferent constant multiplication schemes, a new

constant-coefficient multiplier design is presented. The multiplier is based on canonical

sign-digit representation with Wallace-tree formation. As shown in the analysis and

simulation. the CSD multiplier is both more power md area efficient than general-

purpose multiplier while offering similar speed performance. Consequently, it is used in

the FDCTRDCT design presented in this work. Detailed design procedures for both

unsigned and signed integer are dso described.

In the next chapter, more implementation details, such as design automation and

pipeline design, are presented.

Chapter 5

Implementation

Since the main efforts are concentrated on the arithrnetic level (data-dependent

algorithrn) and implementation level (hardwired CSD multipliers), we decide to use

VHDL to implement the FDCTLDCT designs. No optirnization on the circuit level or

technology level is made.

To ensure error-fice coding, some design automation effort is made. In Section

5.1. a C+ prograrn that generates VHDL code of hardwired CSD Wallace-tree mulùplier

is developed. Similarly, to make the LDCT design cornpliant to IEEE Std. 1180-1990, in

Section 5.2. a Java prograrn is developed that calculates the error figures defmed in IEEE

standard [27] for different intemal bandwidths. The pipeline designs for both the FDCT

and D C T are also described in this Chapter (Section 5.3).

9.1. Hardwired CSD Multiplier Generator

Since the FDCTADCT design uses hardwired CSD multiplier, for e x h constant

operand and bandwidth of variable operand. different multipliers are required. To Save

the design tirne and avoid bugs in the coding, it is ideal to generate constant multipliers

through a code generator.

S everal constant multipliers generators [40] [43 -441 have been reported in the

literahre. Al1 of them are optimized for Xilinx FPGA 4000 and Virtex technologies. To

a

a

a

a

code.

VHDL entity name.

Integer value of the constant operand: For real number constant operand, use the

integer value of the corresponding fixed-point representation. For Intel ~entium@

processors running Microsoft ~ i n d o w s @ 32, the limitation of the constant

operand is fiom O to 2 147483647.

Variable operand: Nurnber of bit of the signed/unsigned variable opermd.

Product Least-Significant-Bit Truncation: This feature is useful for real number

(fixed-point) multiplications. In many situations, not al1 bits in the real part are

required. Truncating some least-si,&ficant bits frorn the product results in a

smailer, faster. and more potver-efficient rnriltiplier. The truncation error has been

analyzed in [ i 5 ] .

The generator uses the algorithm described in Section 4.2 to generate VHDL

At the end of the code generation, it also reports critical statistical information:

have a technoIogy-independent constant multiplier generator, a C++ program that

generates \FIDL code for hardwired CSD multiplier is developed. The program is called

constant multiplier generator (CMG). The C++ source code of the generator is listed in

Appendix C and in attached CD.

The CMG is capable of generating VHDL code that multiplies signedfunsigned

variable opermd with any positive integer constant multiplicand. The constant operand

can have the size of Lon9 type in Ci-+ langua~e- The CMG takes the following

information fiom the user:

number of cany-save adder stages, number of inverters, half adders, and full adders. This

information is useful for power, area, speed, and pipelining analysis.

44

As an example, for constant operand cos(3d16) with 15-bit precision multipiied

with 12-bit variable operand and no truncation, the CMG generates the VHDL code

shotvn in Appendix B.

5.2. IEEE Standard 1180-1990 IDCT Cornpliant

To ensure the proposed IDCT chip conforms to IEEE Standard 1180-1990, a Java

progam is deveIoped. The program reads in data path bandwidths, multiplier precisions,

and tmcation patterns used in Loeffler's IDCT in each pipeline stage from a file, and

calculates the error figures @pe. pmse, ornse, pme and orne) defined in the standard (see

Section 2.3. ). Again, the source code is listed in Appendix D and in the attached CD.

Notice that Java is chosen as the programming language because the long t g e in Java is

a 64-bit integer, which is more suitable for simulating fixed-point arithmetic. In Ci+, the

size of data type is machine dependent; while in Java, the size of data type is machine

independent and fixed.

After testing different combinations of interna1 bandwidths, the first dimension

IDCT produces l ;-bit integedj-bit precision fixed-point output. The second dimension

IDCT produces 14-bit signed integer output (afier roundinp of 10-bit precision result).

The 2 -0 IDCT presented in this thesis conforms to the IEEE 1180-1990 Standard. The

cornpliance test results are shown in Table 11.

1 Random Data 1 Pme 1 pmse 1 orne 1 omse 1

5.3. Pipelining Design

Range [-300,300] [-256,3551

[-5,5] -[-300,300] -[-256,3551

-[-5,q

Since the hardwired CSD mdtiplier is essentially a carry-save adder, and the

speed of the carry-save adder is mostly limited to the carry-propagate adder, the speed of

Zero-in zero-out -- test passed. IppelSl Table 1 1 : EEE Standard 1 1 30- 1 990 Cornpliance for Proposed IDCT

FDCT/IDCT is directly related to the carry-propagate adders. Thus, it is logical to insert

<O.OIS 1 c0.06

pipelining registrrs afier each adder (including the adders in the multipliers). As shown in

<O.OO 15 0.00082 0.00073

O 0.0008 1 O -0008 5

O

0.0 121 0.0 129

Figrire 17, for kcn blocks in IDCT, there are 3 pipeline stages (add inputs, multiply, and

<0.02 0.0108 0.0103

O 0.0109 0.0 1 04

O

0.0 16 1 0.0 144

add product). For kcn blocks in FDCT? there are 4 stages. The extra stage is required to

add the partiai products of the segmented multiplications. Therefore, there are 10 pipeline

delays (latency) for 1-D FDCT? and 8 pipeline delays for 1-D IDCT (see Table 12).

-Stage 1 -Stage 2 ; = Stage 3 - 1 I

t I

4, c --(a+b)+ $ &

V) - - V1 .-

, +O/ m m 2 a , B 1 1

P) c a C 1

.- A .- - a> CZ al a

4 S E . b - é i ~ a + 0"

j 1

I

Figure 17: Pipelined ken block

O . 0.0 12 1

0.0 143

O 0.0 155 0.0 163

O I O

For the transpose memory. the on-the-fly transpose memory architecture is used.

From Figure 12. ir is clear t h ~ t the latency is 8 clock cycles because the transposed output

can be obtained starting from the 9" clock cycle.

To summarize. the proposed 2-D FDCT has latency of 28 dock cycles, and the 2-

D IDCT has latency of 24 clock cycles (see Table 13).

Total IO 8

Table 13: Latencies for 2-D FDCT and 2-D IDCT

Table 12: Latencies for 1-D FDCT and 1 -D IDCT

Stage 4 1 1


Latency FDCT DCT

Total 28 24

In this chapter, a new constant CSD multiplier generator is introduced. Written in

Ctç, the program generates VHDL code that multiplies constant integer operand wiîh

signed/unsigncd variable operand. Truncation can also be made on the product to reduce

hardware, power, and delay.

A Java program is developed to select the intemal bandwidth such that the 2-D

8x8 IDCT conforms to the IEEE Standard 1 180-1 990.

Both FDCT and IDCT designs have also been pipelined to achieve throughput of

1 output/clock cycle. The latency is 28 clock cycles for FDCT, and 24 clock cycles for

Stage 1 1 1

Stage 2 1 Stage 3 4 i 4

Second Dimension 10 8

Latency FDCT DCT

IDCT.

47

3 3

First Dimension 1 Transpose Memory 10 8

8 8

Ln the next chapter, the W L code of the proposed FDCTmCT chip is

synthesized using Synopsis with Canadian Microelecnonic Corporation (CMC) 3-volt

0.35-,yn technology. Synthesis results (powedareddelay) are compared with previous

works.

Chapter 6

Synthesis Results

In this chapter, synthesis results of proposed FDCTRDCT are presented in Section

6.1. The proposed design is compared with previous reported designs in Section 6.2.

using the stvitching-capacitance per sample criteria described in 1-3.

6.1. Synthesis Results of the Proposed Design

The VHDL code of the proposed FDCTRDCT core is synthesized using Synopsis

with Canadian Microelectronic Corporation (CMC) 3-volt 0.35-pn technology. Since the

design goal is low power, the compiler constraint is set to minimize the dynamic power

consumption (ideally zero). The synthesis result indicates that the proposed FDCT core

consumes 222.7mW at 40MHz' and IDCT core consumes 124.9mW at 40MHz. The

detailed specifications of the new FDCT/IDCT design are s h o w in Table 14.

Only the dynamic power reported by the Synopsis is compared with other designs

in the next section. Ln real life, there may be other power consumptions, such as leakage

power and short-circuit power. Since the leakage power is related to the fabrication,

which is not the concern of this paper, it is ignored in the cornparison. As for the short-

circuit power, it is assurned to be small and negligibIe, which is usually the case in

practice. Its effect c m be minimized with proper timing design.

The power measurements are performed under the worst-case condition where the

assumed statistical properties do not hold, Le. under white noise input. In this situation,

most of the bypassing Iogics are not active, and the power consumption is higher. This

simulation condition is chosen because in real life, for MPEG-2 video compression, the

assumed statistical properties apply only for 1-Çarnes, but less so for B-fkames and P-

fiames. For those fkames, the redundancies at the input are already been reduced, and the

input behaves like white noise.

FDCT -

Process Technolog- / CLIC 0.35pm CMOSP technology Supply Voltage (V) 3 Volts

1 hout Bandwidth (for eacli input) 9-bit I 12-bit I

D C T

Operating Frequency (MHz) Processing Rate (samples/sec) Dpamic Power (mW) Leakaoe Power (nW) u

Table 14: Process and Specifications of the proposed FDCT/IDCT designs

40 M 320 M

3.2969125 24.43

24

Area (reported by S-popsis) Maximum Pipeline Stage Delay (ns)

Throughput Output Bandwidth (for each output)

6.2. Comparison with past FDCTADCT VLSI irnplernentations

1226666 :- ,, 16.8610

3.2548425 24.5 1

8 output/clock cycle 17-bit 1 14-bit

Many FDCTiIDCT VLSI imptementahons have been reported in the literature.

The specifications of several recent hi&-performance FDCT/IDCT chips are summarized

in Table 15. Due to different process technologies (supply voltage, operating fiequency,

etc.), implementation approach (Ml-custom, semi-custom, etc.), optirnization parameters

. : - - .124.858.7---:~<.-; 18.6860

Input/Output Numeric System hriut S~ecification

2's comptement signed integer 8 in~ut/ciock cycle

Latency (clock cycles) 28

(RTL, transistor level, Iayout level, etc.), and design algonthdarchitectures, cornparhg

different implantations is alrvays a tough job in VLSI design. N s o , in some situations,

not ail measurement figures are reported. As the result, it is very difficult to compare one

design with another accurately.

--

I Area (mm') l Supply VoItage (V) Power CIock Implementa tion Process Transistors (mW) Ra te

Toshiba 1994 FDCTIIDCT 123 ](M 1

Toshiba 1996 FDCTfIDCT j221

ATScT lDCT 1371

Xanthopoulos's CDCT [251

Xanthopoulos's FDCT 1281

Table 15: Summary of specificaûons of several FDCT/IDCT chips

Sarmiento's FDCT 161

In order to compare the proposed design with other works fairly, like [19]-[21],

the switching capacitance per sample (hence power per sample) is calculated and

compared. It c m be used as an indication of e n e r g efficiency since it is directly

proportional to power consumption required to process each input sample.

As described in Section 1.3, the switching capacitance o f each design is obtained

by dividing the power with the fiequency and squared voltage. Notice that the switch

capacitance per sarnple is obtained by dividing the switching capacitance by the number

of sarnples per clock cycle.

0.6prn12ML

0.3 prn 2ML, Triple well

0.5 prn

3ML

0.6 Pm, 3ML

0.6 Fm, E/D-MESFET

GaAs

13.33mm7/ 120K

$mm2/ 120K

7rnm2/ 69K

32.2mm2l 5 l K

0.3 5 W at 3.3V, 20OMHz 0.15W ar 2V, lOOMHz

3V

150 MHz

58 MHz

5-43 MHz

2-43 MHz

0-9Vl VT=0,15 10.1V

3V

: -4i6-2 '- at-l'.32V;::

I~MHZ .'

20,7mm2/ l6OK

7W

250mW

1.1-1.9V VTNNTP=0.66/-0.92V

TOX=9.6nm

600 MHz

20.7mm2/ l6OK

1.1-3V VMNTP=0.75/-0.82 V

TOX=14.8nm

.$.38myi -333@?g: - . .14-& .+'

Technology scaling is also perforrned for all designs to norrnabe all designs to

0.35pm technology. The scalins factor fkom 0.35pm (CMC 0.35pm CMOSP) technology

îo 0 . 5 ~ ~ (0.6pm drawn) (CMC CMOSISS) technology is obtained by performing

HSPICE simulations on two inverters, one as the load of another. For both technologies,

the power supply is 3 volts with 40-MHz 3-volt square pulse input. The PMOSs have size

L= W,,,,, with W=4Wm,,, and the NMOSS have size L= W,ni, wîth W=2 Wmi,, where W,,,, is

the minimum feature size of the correspondin,o technology. The simulation result

indicates that the power consumption is 0.634 mW for 0.35pm technology, and 1.19 mW

for 0.5pm technology. Since both circuits are operating on the same voltage and

Bequency, the ratio between the powers is the ratio between the switching capacitances.

For simplicity, 0 . 5 ~ ~ m and 0.6pm technologies are treated equally, similarly for 0 . 3 p

and 0.35pm technologies. Thus, the switching capacitance in 0.5pm (and 0 . 6 ~ )

technology will be multiplied with 0.532 to scale to 0.3 5pm technology. The effect of

circuit level optimization, such as variable threshold voltage used in [22], is ignored since

it cannot be quantified correctly.

The switching capacitance pet sample is shown in Table 16 &er technology

norrnalization. As an example, the switching capacitance per sample of the proposed

112.6666.1 O-' FDCT design is calculated as = 42.6 pF . For the Xanthopoulos's FDCT,

320- 106 -3'

which is a O.5um design, technology scding is performed, and the switching capacitance

4.65- IO-^ -0.532 per sample is calculated as = 101.6 pF, where 0.532 is the technology

1 4 . 1 0 ~ -1.32'

scaling factor to scale the power of a O . 5 p technology down to 0.35pm technology.

As shown in Table 16, the proposed data-dependent FDCT/IDCT designs have

the least swirching capacitance per sample. Le. consume least amount of power to process

each input data sample. Thus, the proposed FDCTlDCT design is the most power

effkient one among the designs reviewed in this thesis.

Switching Capacitance / 1 Implernentation

Toshiba 1994 FDCTLDCT 85.6 (3.3V design) [23] [24] / 199.8(ZVdesign) Toshiba 1996 FDCT/IDCT f331 1 82.3

AT&T IDCT [37] ! 478.9 Xanthopoulos's FDCT [28] 1 68.5

thop op ou los's IDCT [28] 1 101.6 1

Table 16: Energy Efficiency (Switching Capacitance/Sarnple in O.35pm technology)

Sarmiento's FDCT [6] Proposed FDCT Design.


1553 -9 _ - - 42.6. : -:- :

In this chapter, the proposed FDCTADCT design is synthesized using Synopsis

with CMC 3-volt 0.35-ym technology. To compare the proposed design with previous

works, the switching capacitance per sample is used. This cornparison method pennits

technolog-independent cornparison of different DCTLDCT architectures. From Table

16. it has been show that the new FDCTmCT designs have the smallest switching

capacitance per sarnple, and are the most power-eEcient designs.

Propo sed IDCT Design . -

43 -4-

Chapter 7

Conclusion

7.1. Summary of Research

In thiç w-ork, a data-dependent low-power FDCTmCT design is presented. Low

power is achieved by performing opùmizations on both algorithm and architectural

levels.

Both the FDCT and IDCT designs are buik based on low-complexity Loeffler's

fast algorithm cornbined with data-dependent zero-bypassing logic. In FDCT, to have

high zero-bypassing probability, sepented multiplication is used. Also, to reduce the

interna1 bandwidth, hence the arnount of data to be processed, least-significant-bits

mincation technique has also been employed. The error introduced by tuca t ion is

empincally snidied.

The multiplier architecture is optimized by developing low-power CSD

multipliers. To reduce the possibility of bugs in coding, a C* program that generates the

technology-independent VHDL code for the multiplier is developed. This generator can

be used in many other DSP applications where constant multiplication is required.

The FDCTlIDCT designs are coded using VHDL, and synthesized using Synopsis

1998 with CMC O.35pm CMOSP technology. No transistor-level circuit optimization is

done. Operating at 3V and 40MHz, the FDCT design consumes 122.7mW, while the

IDCT design consumes 124.9mW. By comparing with other recent works, the proposed

FDCTRDCT designs are the most power-efficient ones since they have the least

switching capacitance per sample. Low-power operation is achieved through the selection

of low-complexity Loeffler's algorirhm~ data-dependent zero-bypassing logics, and least-

si-gificant-bits truncation.

7.2. Conclusion

Frorn the analysis and simulation results, the foilowing conclusion can be made

about this thesis:

r Data-dependent algorithm c m reduce the number of operation w-hen bypassing

logics are properly inserted. Improper use of the data-dependent algorithm may

lead to increasing the computation rather than decreasing the computation.

* Hardwired CSD Wallace-tree multiplier is a good choice for low-power design

where constant multiplication is required. Its application is not only limited to

DCTRDCT. In many non-adaptive signal processingKilter applications, constant

multiplications are required. The use of hardwired CSD multiplier c m lead to a

more power-efficient design.

9 Low-power design can be achieved by having optimization at both design tirne

and run time. The design tirne optimization is done by carefully choosing a good

algorithm that reduces the number of operations. The m-time opthkat ion is

achieved by using data-dependent bypassing logics to reduce the switching

activity, which is directly proportional to the power consumption.

The data-dependent low-power design approach is not only limited to DCTLDCT.

It can be used in other applications as weU where the statistical property of the

input is wel! undentood.

7.3. Possible Improvements for Future Research

Follow-ing are some recommendations and possible irnprovements for future

research endeavors.

Stzidy the effect of Nztegrating data-dependenr algorithm wirh other fast

dgorirhms: This thesis is based on Loeffler's fast algorithm. It is chosen because

it has the least amount of multiplication over the surveyed papers. It is interesting

to know the effect of applying bypassing logic onto other fast algorithms to

determine the potential of data-dependent algorithm.

Sni& the effect of segmented rnziltip[ication: As discussed in 3.1 -1, smaller

segmentation size leads to higher bypassing probability with the expense of more

complicated control iogic and more delay. In this work, the multiplications in

FDCT are spilt into nvo segments. This choice may not be optimum. Having

different segmentation straregy may lead to a more power-efficient design.

Srtdy the trwncation effect for P-frnmes and B--urnes: The truncation simulation

is performed for 1-fiarns only. It is a good idea to measure the truncation effect on

P- and B-fiames as weil.

Explore the possibilify of wing truncation as a mean of quantization: Tnincation

behaves like quantization since both operations reduce numerical precision. Thus,

instead of havulg 2-D FDCT and quantizer as two separate blocks, it could be

possibIe to rnerge them together. In such a situation, a sophisticated control

algorithm is necessary for adapting the FDCT for different quantization levels (Q-

factors).

More power simzrlntions trnder dzxerent conditions: The po wer rneasurements

presented in this work are performed under the worst-case condition where the

assumed statisticd properties do not hoid, Le. under white noise input. In order to

=et more accurate power estimation, it is recommended to pass many different

real sequences (with 1-, B-, and P-fiames) as the input of the system, and measure

the power consumptions.

Improve rhe Conrimi bfiil~iplier Genernror (CMG): Several possible

irnprovements can be made on the CMG:

1. Negcttive conston& support Currently, the C M G supports only multiplication

Miith non-negative integers. In this work, the negative constant coefficients of

DCT/IDCT are taken care by using subtractions instead of additions when the

products are used. However, for other applications, if negative constant

multiplication is required, the CMG can easily be modified to support

multiplying negative integer constants.

2. Cnrry-sme-adder optimization: In some situations, there are comrnon

operands to be added in the cany-save adder array for different bit positions.

It is possible to share the partial sum of the full/half adders. Unlike the

pattern-based algorithm that requires full summation, sharing cany-save-adder

reduces the hardware and power without increasing the delay. The only

drawback of doing so is that the overall design becomes highly imegular due

to cornplex routing cause by sharing wires.

3. berter fi&-addition szipport: Currently, at the end of the CSA, C P A is used. It

is possible to reduce the power consumption even m e r by rasing cany-

bypass adder or carry-select adder.

4. Strpporr for pattern-based CSD algorithms: As rnentioned before, &e CMG is

designed for DSP applications where the constants are assumed izo be srnall.

However, if the constants are large, pattern-based algorithms should reduce

the computation si~glificantly, thus reducing the power.

Bibliograp hy

W. H. Chen, C. H. Smith, and S. C. Fralick, "A Fast Computational Algorithm for

the Discrete Cosine Transform", IEEE Tram. on Communications, vol. Com-25,

no. 9, pp. 1003-1 009, Septernber 1977

S. 1. C'rarnotoT Y. houe. A. Takabatake, J. Takeda, Y. Yamasiiita, H. Terme, and

M. Y oshimoto. "A 1 00-MHz 2-D discrete cosine transform core processor", IEEE

J. of solid-state circuits, vol. 27, no. 4, pp. 492-499, April 1992.

Y. F. Jang, J. N. Kao, J. S. Yang, and P. C. Huang, "A 0 . 8 ~ 100-MHz 2-D DCT

core processor", IEEE tram on consumer electronics, vol. 40, no. 3, pp. 703-709,

A u p s t 1994.

A. Madisetti and A. hT. Willson, "A 100 MHz 3-D 8x8 DCTiIDCT Processor for

HDTV Applications", IEEEE. Tran. on Circuits and Sysrems for Video Tech., vol. 5,

NO. 2. pp. 158-1 61, April 1995.

T. Masaki- Y. Morimoto, T. Onoye, and 1. Shirakawa, "VLSI Implementation of

Inverse Discrete Cosine Transform and Motion Compensator for MPEG2 HDTV

Video Decoding". lEEE Tran. on Circuits and Systems for Video Tech., vol. 5, No.

5, pp. 387-395, October 1995.

R. Sarmiento, C. Pulido, F. Tobajas, V. Armas, R. E. Chain, J. Lapez, J. M. Nelson,

and A. Niifiez. "A 600 MHz 2-D DCT processor for MPEG application",

Conference Record of the 31'' Asilornar Conference on Signals, Systems &

Computers 1997, vol. 2: pp. 1527 -1 53 1, 1998

[7] M. T. Sun, T. C. Chen, and A. M. Gottlieb, T L S I Implementation of a 16x16

discrete cosine transform7?, IEEE transaction on circuits and systems, vol. 36, no. 4,

pp. 610-617, April 1989

[8] W. Li. "A new algorithm to compte the DCT and its inverse", IEEE trans. On

signal processing, vol. 39, no. 6. pp. 1305-13 13, June 1991

[9] D. Slawecki and W. Lee, "DCTADCT Processor Design for High Data Rate Image

Coding", IEEE Tmn. on Circziits and S ~ e r n s for Video Tech., vol. 2, No. 2, pp.

135-146. Jme 1993.

[IO] C. Loeffler, A. Lightenberg, and G. S. Moschytz, "Practical fast 1-D DCT

algorithms with 1 1-multiplications", ICASSP-89, vol. 2, pp. 988 -99 1, 1989

[ I l ] B. G. Lee, "A new algorithm to cornpute the discrete cosine transfonn", IEEE

trans. on acoustics, speech, and signal processing, vol. ASSP-32, no. 6, pp. 1243-

1345' December 1954

, - [ E l H. S. Hou, --A fast recursive algorithm for computing the discrete cosine

tra,nsform", IEEE trans. on acoustics, speech. and signal processing, vol. ASSP-35,

no. 10. pp. l65-146l , October 1957

[13] Y. Jeon%, 1. Lee, H. S. Kim, and K. T. Park? "Fast DCT algonthm uith fewer

multiplication stages", EZectronic Letters, vol. 34, No. 8, pp. 723-724, April 1998.

[l4] E. N. Farag- and ha. 1. Elmasry, "Low-power Mplementation of discrete cosine

transform", Sixth Great Lakes Symposium on Proceedings VLSI, pp. 174 -177,

1996

[lj] M. Kuhimann and K. Parhi, "Power cornparison of flow-graph and distributed

arithmetic based DCT architectures", Conference Record of the 3znd Asilomar

Conference on Sipals, Systems & Computers, 1998, vo1.2. pp. 1214 -1219, 1998

[16] C. V. Schimpfle, P. Reider. and J. A. Nossek: "A power efficient implementation of

the discrete cosine transform", Conference Record of the 3 1'' PLsilomar Conference

on Signals. Systems & Computers, 1997, vol. 1, pp. 729 -733, 1998

[17] S. ~Masupe and T. Arslan, "Low power DCT implernentation approach for VtSI

DSP processors", ISCAS '99, vol. 1, pp. 149 -152, 1999

[18] S. Masupr and T. Arsian, "Low power DCT implementation approach for CMOS-

based DSP processors", Electronics Letters, vol. 34 25, pp. 2392 -2394, Dec. 1998

1191 T. Xanthopoulos, and A. Chandrakasan, "A low-power DCT core usiog adaptive

bittvidth and arithrnetic activity exploiting signal correlations and quantkation",

Digest of Technical Papers. 1999 Symposium on VLSI Circuits, pp. 1 1 -1 2, 1999

[20] T. Xanthopoulos. and A. Cliandrakasan. "A low-power IDCT macrocell for

MPEGZ bIP@ML exploiting data distribution properties for minimal activity",

Digest ofTechnical Papers. 1998 Symposium on VLSI Circuits, pp. 38 -39, 1998

[21] T. Xanthopoulos. and A. Chandrakasan, "A low-power iDCT macrocell for

MPEG2 MP@ML exploiting data distribution properties for minimal activity",

IEEE J. of solid-state circuits, vol. 34, no. 5, pp. 693-703, May 1999

[22] T. Kuroda, T. Fujita, S. Mita, T. Nagarnatsu. S. Yoshioka, K. Suzuki, F. Sano, M.

Norishima, M. Murota, M. Kako, M. Kinugawa, M. Kakumu, and T. Sakurai, "A

0.9V 1 jOlMHz, 1 OmW 4 m 2 , 2-D discrete cosine transform core processor with

variable threshold-voltage (VT) scherne", IEEE J. of solid-state circuits, vol. 3 1,

no. I l , pp. 1770-1779, November 1996

[23] M. Matsui, H. Hara, Y. U e t a ~ , L. S. Kirn, T. Nagamatsu, Y. Watanabe, A. Chiba,

K. Matsuda, and T. Sakurai, "A 200 MHz 13 mm2 2-D DCT macrocell using sense-

ampliSing pipeline flip-flop scheme", IEEE J. of solid-state circuits, vol. 29, no.

12, pp. 1452-1490, Decernber 1994

[24] M. blatsui, H. Hara. K. Seta. Y. Uetani, L. S. Kirn, T- Nagamatsu, T. Shimazawa,

S. Mita. G. Otomo, T. Oto, Y. Watanabe, F. Sano, A. Chiba, K. Matsuda, T.

Sakurai, "200MHz video compression macrocells using low-swing differential

logic", ISSC'94, pp. 76-77, 1993

[25] M. Hamada, T. Terazawa, T. Higashi. S. Kitabayashi, S. Mita, Y. Watanabe, M.

Ashino? H. Hara, and T. Kuroda, "Flip-flop selection technique for power-delay

trade-off7, ISSC799, pp. 270-271, 1999

[26] T. H. Chen, "A cost-effective 8x8 2-D IDCT core processor with folded

architecture". ïEEE trans. on consumer dectronics, vol. 45, no. 2, pp.333-339, May

1999

[27] YEEE Standard Specifications for the Implementation of 8x8 Inverse Discrete

Cosine Transform", IEEE Std. 1180-1 990, March, 199 1.

[28] Xanthopoulos, "Low pou7er data-dependent transform video and still image

coding", Ph. D. Thesis, M. 1. T., February 1999.

[29] E. Feing and S. Winograd? "Fast algorithms for the discrete cosine transform7',

IEEE trans. on signal processing, 40(9), pp. 2 174-2 193, September 1992.

K. Hwang? Cornpziter Arirhmetic - Principles. Architecfzrre, and Design, John

Wiley Br Songs, 1979, pp. 149-151.

Z. Wang, "Fast Algorithrns for Discrete W-Transfomi and for the Discrete Fourier

Transform", E E E trans. on acoustics, speech and signal processing, vol. ASSP-32,

no. 4, pp. 803-8 16, Aupst 1984.

M. Vetterli, W. Nussbaumer, "Simple FFT and DCT Algorithms with Reduced

Number of Operations", Sipal Processing (North Holland), vol. 6. no. 4, pp. 264-

275. August 1954

N. Suehiro, M. Hatori. "Fast algorithms for the DFT and other Sinusoida1

Transforms", IEEE Trans. on acoustics, speech, and signal processing, vol. ASSP-

34. no. 3, pp. 642-664, June 1986

P. Duhamel and H. H'Mida, "New 2" DCT algorithms suitable for VLSI

implemcntation", Proceedings IEEE international conference on acoustics, speech

and sienai C processing, ICASSP-85, Dallas, pp. 1805-1 808, April 1987

K. Swang, pp. 152-1 55

S. Shah. A. J. Al-Khalili, and D. Al-Khalili, Tomparison of 32-bit multipliers of

various performance rncasures", Proceedings of the 12" international Conference

on Microelectronics, ICb1'2000, pp. 75-80, October 3 1- November 2,2000

A. Bhattacharya and S. Haider, "A VLSI implementation of the inverse cosine

transforrn, International J. of Pattern R e c ~ ~ p i t i o n and AI, 9(2), pp. 303-3 14, 1995

K. R. Rao and P. Yip, Discrete Cosine Tt-ansform - Algorithrns, Advantages,

Applications, Academic Press, 1990, pp. 10- 1 5

[39] V. Lefèvre? "Multiplication by an integer constanty7, LIP research report RR1999-

06, Laboratoire d'Informatique du Parallélisme, Lyon, France, 1999

[LCO] F. de Dinechine and V. Lefevre, "Constant MultipKers for FPGAs", LIP research

report W 0 0 0 - 1 8, Laboratoire d'Informatique du Parallélisme, Lyon, France, 2000

[4 11 R. Bernstein, Multiplication by integer constants, Software - Practice and

Expenence, 16(7), Juiy 1956, pp. 641-652

[42] M. Potkonjak, M- Snvastava, and A. Chandrakasan, "Multiple Constant

Multiplications: Efficient and Versatile Frameworks for Exploring Common

Subexpression Elimination": IEEE Trans. on CAD of IC and Systems, vol. 15, no.

2, pp. 151-165, February 1996

[433 Xilinx Cooperation, "Constant (k) Coefficient Multiplier Generator for Virtex",

Application Note, Version 1.1, Mach 12, 1999

1441 Xilim Cooperation, "Constant Coefficient Multipliers for XC3000E", Application

Note XAPP 054, Version 1-1, December 1 1, 1996

[4>3 R Hartley. "Optimization of Canonical Sign Digit Multipliers for Filter Design",

IEEE International Sympoisum on Circuits and Systems, 1991, vol. 4, 1992-1995,

1991

Appendix A

Truncation Test Result

Table 17 shows the truncation erïor of 3 test video sequences: coke, salesman,

and tennis. The truncation error is defined as:

Tuca t ion Error = Average PSNR(reference) - Average PSNR(tnuication)

Each sequence is encoded with pure 1-hunes, 8 Mb/s and 180 frames. The FDCT

is computed with fixed-point calculation with 1 1-bit precision after binary points.

Truncation Error = Average PSNR(reference1- Average PSNR(truncation1

Average of 3 Sequences (dB;

-- -

TNnc(2tn) Numker of

Truncated Bit Tennis Coke Saiesmar

(dB) (dB) (dB)

Table 17: Truncation errors of test sequences: coke, salesman, and tennis

a r ch i t e c tu r e S t ruczu rz l o f COS-3-16 Fs component HalfAdder

port (A, 9: i2 Std-Loqic; S m , Cout: out Std-Logic) ; end componenz ;

cornuone3r FullAdcer O , , c i : i 3cd-Logit; Sum, Cocr: ou t Szd-Logic);

ena component;

s i g n o l S û r CO, SI, C i , s 2 , C 2 , 5 3 , C 1 : Srci-Logic-Vector (25 downto O ) ;

s igna l n-m : Std - LoqFc-Vector (25 downto O ) ; s i g n a l ZEXO: Std-Logic; -- cons tan^ s i g n a l ' O ' s i gnc i ONE : Stc-Logic; -- Cozstan t s i g n a l '1'

ZE8O <= ' 0 ' ; ONE <= '1';

-- I n ~ e r t e à i n p u t siqnals: N <= nor P;

-- a i r O Srage 9: -- B i t O Stage 1:

-- B i t O Stage 2: -- B i = O Çcoge 3:

-- B i t 1 Stage O:

ïiA - 0-1: XalfAdder 3 0 x rnac (N ( I l ,N ( -- B i t L Scaqe 1: -- B i t I Sïaçe 2: -- Biz I Srage 3 :

-- Sic 2 Stzge O: -- E i t 2 Stage 1:

HA-1-2: HalfAdder porc map (N ( 2 , CO -- Bi= 2 Scage 2: -- a i t 2 Stage 3 :

-- E i t 3 Stage O: 34-0-3: FuilAdder p o r t map(N( 3),N(

-- B i t 3 Stage 1: -- a i z 3 Stage 2:

Ka--2-3: Holr'P-dder porz rnaplSO( 3 ) , C I ( 23,S2( 3),C2( 3 ) ) ; -- B i t 3 Stage 3:

-- B i r 4 Stzge O : -- Bit 4 Stage 1:

FP--I-?: FullAdder p o r t map(N( 4),A;I( l),CO( 3),Si( 4),CI( 4 ) ) ; -- B i t 4 S t a g e 2 : -- B i t 4 stage 3:

m-3-4: H a l f A d d e r porc ~ p ( S l ( 4) ,C2( 3),S3( 4),C3( 4));

-- Sic 6 S c a q e O : E3--0-6: FullS.dder port rnap(bJ( 61, NI 3 ) , 1 ( C),SO( CO( 6));

-- B i c 6 Stage 1: -- Biz 6 S t a g e 2 : -- B i t 6 S t e g e 3:

-- Sic 7 Sz+çe O : FA-C-7: Fü1IAdcier p o r t - p ( N ( 7 ) , Y f 4)r?1 I),SO( ?),CO( 7 ) ) ;

-- S i r 7 S c a g e 1: -- Bit 7 S t a g e 2: -- Bit 7 S c a g e 3:

-- B i t I I S t a q e O : Fr? - 0-11: F t r l l A d c i e r p o r c .nap(N(LI),N( a ) , P ( 5),S0(11),C0(11)1;

-- E i c II Scage 1: FA - 1-11: F u l l A d d e r porz mâp(P( ~ ) , S O ( I ~ ) , C O ( I O ) , S I ( ~ ~ ) , C ~ ( ~ ~ ) ) ;

-- Bit 11 Sracre 2:

-- Bic 1 3 Stage O: FA O 13: FullAdder p 0 r . L r n a p ( N ( l O ) , P ( 7 ) , P ( 5),~0(13),~0(13));

-- Bit 20 Stage 1:

FA-1-20: FullAdder porc nzp( ONE ,SO(20) ,CO(19) ,S1{20) ,CI (20) ) ; -- B i t 20 Stâge 2:

-- 9it 21 Stage O: fK-0-21: FullP-dder porc nzp(P(1I) ,N( 9 ) , P ( 7) ,S0 (211 ,CO i21) ;

-- Bit 21 Stage 1: FA 1 21: FxlLidder sort mzp( ONE ,S0(21) ,CO(20) ,S1(21) ,C1(21) 1 ;

-- ~it-21 Stage 2: KA-2-21 : 5clfAdder port map CS1 (21) ,CL (2C)) , S2 ( 2 1 ) ) ;

-- BFt 21 Stage 1:

-- 3ic 22 Scage O: FA-0-22: FullAdder p o r t map(N(IO),P( 'ô), ONE ,S0(22),C0(22));

-- B i r 22 Stôoe 1: -- Bit 22 S t q e 2:

-F-L - 2-22: F~llOdder POEL map(~0(22i,~0(21),~1(21),~2(22),~2(22)~; -- i3ic 22 Scage 3:

-- Bit 23 Scage O : FA - 0-23: FullAaaer porc nap(N(ll), P( 91 , ONE ,SO (23) ,CO (23) ) ;

-- Bir 23 Szage 1: m-1-23: E~lZAdder ?art map(SO(23),CO(22},S1(23),C1(23));

-- B i t 23 Stage 2: -- Six 23 Sxage 3:

-- B i = 21 Scaqe O: -- 3ic 24 S t q e 1:

Fi-1-24: EalfAdcier port nap[P(IO) ,C0(23; ,SI ( 2 4 ; , CIi241 ! ; -- 3it 24 Szage 2: -- a i t 2 4 Stage 3 :

end;

-- Statistical Info-macion: A d Stage : 4 -- K Inverter : 12 -- # E a l f adaerr 13 -- t Fu11 adcer: 4 8

Appendix C

Source Code of Constant Multiplier Generator

The following is the C++ source code listing for constant multiplier generator.

The codes are listed in arphabetic order of the source file name. The header file (.h) is

aiways in fiont of the implementation file (.cpp). The main program is located inside n l e

IntMrkcpp. Notice tbat al1 codes are also included in the attached CD.

using naneSpace s t d ;

ucs igned nXezcyAïStzge(SignalVector& imï, int curSïage);

void gecAdderOperana (znsigced n O p , msigned r n a x C o n s t I n p u t , S i g n a l V e c t o r & i m t ,

void createE? {vec=cr<Sig?.aItJector> & k t , unsigned xnsignea m a x C o n s t O p , oscreemb 0 ) ; voia c r e ~ t e ? ~ (vêctcr<SiqnaItiector> S i m r , umigned xnsianed nzxCanscC~, oscrsam& G ) ;

curBiz , unsigned curstaqe,

c u r B i t , u n s i g n e d c u r s t a g e ,

voici genprate-VHDI, - CSA - Eody(vec~or<ÇignalVec~sr> &imt, unsignea CSA-Scage, u r r s i g c e c & n H â l f A d d e r , unsigned a c F u l l A c d e r , ostrem& csa) ;

using nomespace std; I sca t i c S i g n a l

SIGNAL_SU?.I (VARIABLE,"S",SUM ,false,NONE,-1,-11, S I G N ~ C P ~ Y (V.UI.ABLE, "Cm , CPRRY, fafse, NONE, -1, -1 1 ,

SIGNAL-SIGX (SIGX, "Sign", O, faIse,NONE, -l) - ;

//-------------------------------'------------------------------------------- void gerACderOpera?c(unsigned =Op, unsignea maxConsrInput,

Sig~alVector& SV, SignalVeccor& opToAda)

L=@ ; wnile :nOpO & & i<sv. size ( . - Ir (SV[%] .ID==CF-=Y\ 1 s v [ F ] . I!l==SUM)

I 20p--; cpToAad. push-k~cck (SV [Fi) ; sv.erase(çv-begin0 ii) ;

I eise i++;

if (nOp==O j return;

//------------------------------'--------------------------------------------

void creazsSJ. (v2c~or<SlgnaIVec=or> &SV, unsigned curBit, unsignec curstage, unsigned maxConstOp, osrrecrn& O )

I SignalVeczor opToAdd; gecAdderOoerand ( 2 , maxConstOp, SV [curB ic j , ocToAàd) ;

SIGNAL-SUM.bitPos=SIGNALLCARRY.bitPos=curBit; SIGNAL-ÇUF. stage =S IGNPL-CFRRY . stage =curStage;

void create?A(veccor<SFgnalVector> &SV,

unsigned curai t , uns igned curçtage, unçlgned rnüxCorrst~p, cstreun& O)

SigcalVecrcr opToAdd; get-WcerOperand ( 3 , maxCons~Op, SV [curBit 1 , opToAdd) ;

SIGNAL SUM.SitPos=SIGNAL-CF-SRY .bItPos=curEii t ; S I G N A L ~ S U P I . ~ ~ ~ ~ ~ =SIGNAL-CPmY.scage =curStage;

SV [curBizI .push-bsck (SIGNAL-SüMl ; if (curBi~==sv.slte ( j -Il return; sv[cur3Ft+t] .push-bcck(SIGNPL-C32.RY) ; return;

i

/ / Generating carry-save adder VHDL code nEclE9.dder=nFullAdder=O ; / / Complexity Stat

boof -HA - for-2op = c rue ; 5001 i sF i r s tAdder ; Fnt ~Xeady; for (i=O; i<sv.size ( ) ; i + - 1 I csa << "\n"; FsFirstAdder = Erne;

f o r ( j = O ; j<CÇk-Sccge; j +-)

I cça << "-- Bit "<<i<<" Stage "<<j<<" : \n"; if (,U,-for_20pI { switch (nSeadyAtStage (SV [ il , j 1 1 I czse 0: nreak; case 1: break; case 2: (

if (sv[i],sizeO==2)

creace-FIA (SV, i, j, (isFirstAeder?2: 1) , csa) ; if (j==CSA-Stage-l) KA-for-2op=false; isFirstAdder=false; nHalfAdder++;

1 break;

1 default : ! creaceFL(sv, i, j , f isFirstAddez?3 : 1) , csz) ; if ( j ==CS-S tage-1) F-a- - ffor_2op=false; isFirstAdaer=false; nFuL1Accierit;

i t

I else / / L-A-for-20p = false I nReaay = nRe~cy~tStaçe (svii], j 1 ; . - r r (sv[il -size O ==3) i L

If (nReâdy==2 1 1 riReady==3 1 I createw-(SV, i, j , (isPirsSidder?2: I), csa) ; isFirstAdder = false; nXalZAader+i;

I i else if (nRêady>=3 1 I creaïeFA (SV, F , j , (isFirsiAader?3 : l) , csa) ; isFFrstAdder = folse; nFullAddec+t;

i 1

1 i

1

void s i ~ ~ l i f ~ ~ o n s ~ ~ n r s ( v e c t o r ~ S i g ~ a f V e c ~ o ~ ~ &c) I SipaiVecrcr: : i ï e rzcor result; int ?One, carry=O; fo r (unrigzed 1=0; i<c-sire ( 1 ; i++l { nOne=O ; //counc ( c [ i ] .beginO ,c[il .end0 ,SfGXIL1,ONEInOne) ; nOne = count (c[i] .begFn0,c[i~.en~O,S1GNPJ;~0NE);

/ / Removing constant zeros: No operation result = rernoveic[ij .begin{},c[i] .er?d(),SIGNPJ,-ZERCI; c[i] .erase(result,cCil .end0 1 ;

/ / S i m p l i f y constant o n e s : adding rhem together result = reF.ove(c[ii .beginO , c [ i I .ena(),SIGNNiILLONFl; c i F 1 .ercse(resulc,c[i] . e d O ) ;

nO?e-=carry; carry=nOr?e/2; nOne%=2; if (nOne!=O; c [ i l .push_back (SIGNAL-ONE) ;

1 1

void create CSA Vector(

77

vec=or<SignalVector>& op, vector<SFqrrziVector>& c sa, bool issigned) I int i, j; unsigned max9it=O; for (i=O; i<cp.sizeO ; i++l

if fop[ii .size ( 1 >max9it) maxBit=oc. size ( 1 ;

for (i=op.çize ( 1 >>Ir j=O; i!=O; i > > = l , j -t) ; / / Get tne MSB posizion of i: Log2 (op. s i z e ( 1 l

naxBiz+= j ; / / n m-Dit operand will have outpur of n+m bic

csâ. reçize ( x ï a x E i ~ ) ;

Signal signal (VARIABLE, l I O ~ l r r O, crueI NONE, -1, -1) ; for (signol. ID=O; signal. ID<op. size ( 1 ; signal. ID*+)

O << " "c<signalc<" : in Std Logic Vector ("<< (op [signal. ID] . size O -1) <<"

d o w n t o O 1 ; \n" ;

s i g n a l . n a m e = "Sum"; f o r (sicnal.ID=l; sFgnal. ID<=2; sional. ID++)

O << Ii " < < s i g n z l < C " : out S t d - L o g i c - V e c t o r ( "cc (nBitOut-1) cc" d o w n t o O) ;\nW;

o c < " ) ;\a" c< "end; \n\c" ;

/ / Generâcf VEDL Architecture H e a d e r o c < "zrcnitecture Structural of " << en~ityNzme <C" Fs\nW

<< " componenz EalEAdderb" << " po r t (A, E: i n S t c i - L o g i c ; S m , C o c t : out S t d - L o g i c ) ; \n" c< " enci cc-onent; \n\nn cc " conFccent FullAcder\n" <c II pc rz (A, B, CFrr: in S r d - L o g i c ; Sun,, tout: o u t Std-Loqic) ; \n" << " end c c m p c n e , r ; t ; \ c \ n W ;

C S . - S t a c e = ( o p . s i z e O > = 3 ? o p . s i z o 0 - 2 : 1) ; SIGNAL-SUT. showïD=SIGNALLC1.RRY. s h o w I D = ~ r u e ; O << " slqn&lw << encil; f o r (F=O- , i < = C S A - S t a g e ; i+-1 {

SIGNAL-SVM.ID = SIGNAL_CF,9RY.ID = i; O << " w<cSFGNPL-SU-IC<", " < < S I G N P L - C m Y ; Lf ( i !=CSI;=Stage)

O << ",\nW; else O << ": Std-Logic - Vec~or("c<(nSirOut-L)<<" domto O1 ; \n\nl ' ;

, r

i E ( FsSigned)

O << " signal "; f o r (SIGLJAL-SIC-N. I D = G ; S I G N U - S I G N . I D < o p . s i z e O ; S I G N A L - S I G N . I D t i ) c << SIGXPL - SIGX C c (SIGNPL-SIGN.LD<opPsizei}-1 ? ", " : " " 1 ;

o << ": Scd-Lcgic-Vector ( " < < ( o p . s i t e ( 1 -1) <<" downto O ! ; \ r iv ;

1

i f ( i s s i g n e d ) i

s i g n a i - n a m e = "Op"; f o r ( i = O ; i<op. s i z e O ; i++) C

SIGNAI-STGN-ID = s i g n a l . 1 D = i;

void g e n e r a t e _ V H D 4 C S - A - - T a F I (

v e c t o r < S i a n à l V e c ~ o r > &imt,

SLgnalVeccor &occl, SiçnaLVeccor Gout2, ostream& O)

{ //--------------------------------------------------------------------------

/ / Map iaternal signal~ to output ostrstrezm numl, ZIW~;

Signal signal (VG.X2BLCr "Sum", O, true,NONE, - I r -1) ;

cnor -sr = n-xml. s i r 0 ; si [nilml.Fcount ( ) !='\O' ; char ' r2 = nu-.n2. str ( 1 ; s 2 [num2.pcount ( ) ] = ' \ O 1 ; O << "\fiv << si << s2 << "\nW; O << "end;\~\n"; o.flush() ;

//------------------------------------------------------------------------ void generate-VBDL-CSA (

char- en t i cyNme, vector<SignalVector>& op, 5001 issigned, SignalVeccor &oucl, SignalVector &outSr oscr2arn& 01

ist CSA-Stage; unsigced nEalfAader, riFullAdder; vector<SignolVecccr> csa;

creaïe-CSA-Vector (op , csar lsSigneci} ; gzcerate-VECL-CSA-Sezaer (ent-i~:~Nms, op, issigned, csa. s i z e (1, C S C S t a g e , O ) ; generzte-VEDL-CSA-3ody (csa, CSA-Stage, r-AalEAddez, nFullAdder, O ) ; generate-VEDL-CSD,TaFI (csa, oucl , out2, O) ;

HWMult. h - - 1 . . - . - .. . -

Sifncef -HWL'ULT-H #def ine - H'rNULT-H

voia HWMult ( unsignec nVarBit, bool sFgneaVar, unsigned lcng constop, vector<SiqnalVeccor> &out, ostreEm O, ostreamç ccmpone?t, charf entityNme-0, unsigned cruncLSE=O, no01 goceraïeProduct=~rue, bcol byPass=fa l se ) ;

- - --

kincluae "HWMult.h" #include "CSFi.h" %include "NurrberSyscern. h" Binclude "Nonzero . k"

using nowspace std;

scotic Signal SIGNAL-SIGN-P (SIGN, ' lSignlf , O, falser POSITIVE, -1 -1) , S I G N A L - SIGN-N (SIGN, "Slçn", O, faIse,NEGATIVE, -1, -1) ;

uirsigned nOutEit1, unsigned nOutaFt2, unsigneci out3it2of fset, w-signed CSA-Scage, unslgced nCSlsBFt, bool F~vertedInpuc, ostream& O, ostream& c, bool generatêJroduct, bool bypass)

t o << " 3 e s u l t : ouc Scd-LcgFc-Veczor ("<< (nOuc9itl-1) <<" downto O) \n"; c << " R e s u L c : oct Std-Logic~Vec-,or("<<(nOut3Lt1-1)~~~f downto O)\n";

1 else f O << " Xesrrltl: oilc Std-Loqic-Vector ("<< (nOutSit1-It cc'' downto O) ;\n"

<< " Result2: cur S t c - Loçic~Veccor("<<(nOutSF~2-l)<<" downto O)\nn; c << " R e s ~ i t l : e u t Szci-Logic-Vector ( "<< (nOut8iil-1) <Cs' downto O) ; \nu

<< " Result2: out Scd-LogFcJector ("<< (nOtitBit2-1) <Cg' downto O) \n"; 1

O << " ) ;\n" << "ena;\n\nW;

c << ) ; \n" << " end componenz;\n\n";

/ / Generace VHDL Archiïecrure Header O CC "architecture Structural of " << entityName CC" i s \ n m

<< " port (A, B: in StkLogic; Sum, COUC: out Scd-Logic) ;\nV cc ecd component;\n\nW

<< II component FulXader\n" cc Il porr iA, 9, Cin: in StdLogic; S m , Cour: out Std-Logic) ;\n" < enc cornpone~t; \n\n" ;

L 0 << " Sm << i <<", C" << i; iE (:!=CS-A-Scaqe-l)

O cc ",\nu; else O cc ": Std-Logic-Vecror ("Cc (nCSFBit-l) <<" downto O) ; \n\nn;

1

if ( invertedIn-uz} I O c< " signal N : Std - Logic-Vector ("cc (riVar9i1~-1) cc'* downto O) ; \n\nn; if (signeci'Jzr)

o << " siqnal "<<CIGM-AL-SIGN-3<<", "<<SIGN3L-SIGN-N<<": Std-Logic;\nn; 1

O c< " sicnzl P : S t d - Logic-Vector !"<< (nVarBFt -1) <<" downto O) ; \nW << " signal numl : Std-Logic-Vector ("<< (;?OuiSitI-1) <<Ir domto O) ;\nu << '* sicnal num2 : Scd-Logic-Vector("<c(nOucBit2-l)<<" aownto O);\n\nW;

o << " signal ZERO: Std-Logic; -- Constant signal 'O'\nt' <c " signal ONE : Std-Loçic; -- Zonscanc signal 'll\n";

if (bypzss) O << " signal NonZeraIn: Std-LoqFc;\nW

cc " s i g n a l ZERO-Out : S c d - Logic-Vector ("<< (nOutBit1-1) <cl1 downto O} ;\n\nW;

if (byPass) (

O << " SP: MZ"cCnVzrSFt<<" port ma? (VarI2, NonZeroIn) ; \n" <c " P <= VarIn when (NonZeroIn=' 1' ) else P; \n\nM;

e L s e O € c " P <= Var1n;\n\nu;

if (inverreciInpur) f 0 c< II-- Inverced input siqnals:\nW; //for :i=O; i<nVarBit; itt)

if (signedvar) / / Signeci variable oceranc! O << " " << SImTkL-sIG??-P << " <= P("<<(nVarBir-1) <<") ; \nvf

CC " " << SIGNAL-SIGN-Pl << " <= N ("<< (nVarBit-1) <<") ; \n\nn'; k

void generace-VSCL-Hm-TâiI ( vector<SignzlVector> &imt, vector<SiqnclVectcr> &cct, oscrêarri& O, Uool qenerzreproduct, b o o l byFass)

i ,./-------------------------------------------------------------------------- / / Mep inccxnzl signal~ to ouepur 3 s trçirfzq nu?nl, nm2 ; numl << " n u m l <= " ; nu12 << " m . <= ";

j +t ; if ( (jâ8)==O) ( numl c< "\ri "; nwn2 cc "\n

1 1 nrrrnl<<"; \n"; nu1;i2<<"; ?nW; numl . flush { 1 ; n-~m2. f l u s h ( 1 ;

cher -SI = rruml.szr(); s1[nunl.pcount()]='\O1; char * s 2 = num2.str O ; s2[num2.pcoun~(} ]='\O1; O << "\n" c< sl << ç2 <c "\fi";

O << " ZERO-Ou= <= \"" ; for (i=O; i<ou=.size ( ) ; i--) o<<"O";

if (generatePrcduct i O cc " nun <= Unsigsec(ncml) + Unsigned(nun2) ;\II"; if (byPass)

O c< " R e s c l t <= num whec (NonZeroLn='lr ) else ZSBO0ut;\n\n1'; eLse o c c " R e s u l t <= \n\n" ;

k else I

iE (5yPass) O << " R~sulclC= n u l when (NonZeroIn=' l ' 1 eise Z E R O O u t ; \nt'

<< " Result2C= n m 2 when (NonZerofn=' l' 1 else ZEE?OOuc; \n\nW; else o << " ?.esulzl<= narnl;\nw

<< " ?.esült2<= n 1 a 2 ; \n\n8';

vec~or<SiçnolV2ctor> signZero (sign) , signOne (sign) ; SIgnalVeccor: : i t e r a r o r resulr;

/ / For Sign=l => 3ernove al1 "Sign-N (=O)" & Replace Sign-P with "ln resclt = remove :signOne fi] . bêg in ( ) , signone [il -end ( 1 SIGNPLJ;SIGNNN1 ; siqnOne[i] .erase (result, signone [il .ericO ) ;

reclace (s içnOne fi 1 . b q i n ( ) sLgn0ne [il .*-ONE) ;

1

unsigneà n O u t S i ~ = nVarsic + const3it.sizeO; / / N u m b e r of output bit imt . r l s i z e (nCu=Bir) ; / / Inte-mediate signals sign. res i ze (nOut9it) ; / / Sign and constanc 1's

/ / I n s e r t a l1 Lncermediate signols S i q n z l siçnal;

invercedIzpuc = false; fo r ( I = G ; icconsrBit .size ( 1 ; ii+) t

if (canszBit[ij==l) t

fcz [ j=O; j< (signecVar?~Vor3it-I:nVarBit) ; j++) i signal .bitPos=j ; çiqnal,inv~rted=2OSITIVS; kr[i+j j .push-back(signa1) ;

1

s i g n [il , p u s h b a c k (SIGNPL-ONE) ;

/ / Kerge s i g n / c o n s ~ a n t s t o g e t e r and perform opcimization for constant 1's unsignec mxEepth=O ; f o r ( i = G ; i < i m t . s i z e O ; i++)

if (siçnii] .sFze() !=O) t

if ( s i g r i [ i ] [O]==SIGMAL-CNS &t i C i m c . s i z e 0 - l & & imï[i].sizeO==l & &

irr , i[i-l] . s F z e ( ) c=2)

/ / bit T I => sm-=(nec o i t ) , c a r q r = S i t iinï [ i + l ] .push-bock (imt [ i l [O] ) ; imt[ij [ 0 ] . i n v e r t e d = i m t [ i ] [O] .lnverted==POSITIVE ? NEGATIVE: POSITIVE; i nve r t ed rnpuc = t r u e ;

1 e l s e i n t [ i ] .push-bock(sigr:[FI [ W ;

i

. - I r ( F r n t [ i f .size ( ) > m a x C e g r n f inaxDepth=imt [il .size ( 1 ;

!

CSA - Stage=(rnâxDepcB>3) ? ~axDepth -2 : (maxDepzh>O ? 1 : 0) ; 1

voit D i i u l t (

unsigned nVzrBit , ho01 signedTJar, unsigneci long constop, vector<SignzlVector> & o u t , astream& O , ostream& component, charr entityName, ansigcec cruncLSB, b o o l genera teProduc t , boo l SyPassl

( unsigzeb L , j;

/ / Cons t ruc t M u c i p l i c a t i c n v e c t o r to be usea i n CSA creace-EiW-Vector ( ~ V a r B i ' c , s igneavar , consrOp, i m t , i n v e r t e d I n p u t ,

CS&--Stage) ;

i m t . erase (imc. begin ( ) , imt . begin ( 1 +rruncLSB 1 ; COU^ << " \ n A f t e r t r u n c a c i n g "<<truncLSa<<" b i t s : \n";

1

/ / -------------------------------------------------------------------- / / Generat inç VHDL Code generate-VEDL-EFR44i'!~aaez (?,Vzr9Ft sicneciVar, canstCp, entityName,

i m t . size ( ) , Lxtt-size ( ) , O , CSA-Stage, ict .size ( 1 , i n v e r t e d I n p u t , O, componenr, generacê?roducc, byPass 1 ;

O . m ~ ~ n ( ; ;

//--------------------------------------------------------------------------

/ / Generaring ca r ry - save aaaer VH3L code u s i g n e d nEalfAdcer, nFulLAdder; generate-JH3L-CSA-3cdy ( i m t , CSA-Stage, nffaifAdder, n F ~ l f A d d e r , 01 ; a.flusn() ;

/ / Generezing VEDL tail ( ~ n c a r c h i t e c t u r e & statistical i n f o r n a c i o n generace - VHDL-I_Tail ( i m t , ou t , O, gene taceProauc t , byPass) ;

l u s i n ç na-tespace scd;

c o n s t double p i = 3.1415526535897932384o;

i n t nain i i nc zrgc, char ' arcv [ 1 )

unsiqnea long va l ; ccu t << "Conscanc Operand c i n >> val;

inc nVarBit, trcncLSB, bypsss; inc s ignedvâr , generzceProduct ; couz << "f b i t of V a r i a b l e Operând . 11 .

r I

c i n >> nVar3Fc; coüt << "Çigred v a r i a b l e operond ( O / L ) : "; c i n >> s ignedvar ; cour << " # bic crurrcoted at LSB . 11 .

. ?

c i n >> EruncLSS; couc << "Generate p r o d u c i (O/l) : '*; c i n >> genera teProàuc t ; cout << "Bypass Zero ( O / L ) : Il;

c i n >> bypass;

tout CC "Entity ( f i l e ) nane cin >> en t i t yNme; if (entFtÿNarne [ O ] = = 1 \ O 1 1

c o u r <c ll\n\n"; cour << "Cons tan t Value : "C<valCC"'~n\n";

f i n c l u d e <fstre&.ru

using namespace std;

voici Nonzero (char- z n t i t y N m î , unsigned n B i t , ostrea-m6, cl f char f iloxaxte :2561 ; s p r i n t f (EileNcrne, "ES .vhdl', encityNzne1 ; L- ~ a t r e a m f (f i l e N m ~ e , F o s : : ourj ;

<< " porc\nl' << (\n" << Ir D : i n Std-Logic-Vector ("<< (nBit-1) <<" downto O) ; \n" << " NZ: ouz S t < L o g i c \ ~ " << " ) ;\n" << "end; \n\n" ;

/ / C o n v e r t unsiqnea l cng tc a sequezce of bic. / / Trie MC6 of t k e r e t u r n i z g b l t is a l w a y s O

void ulonqToBFc(unsFgned long 1, vector<char>& b i c ) ;

vo i d 3izoryToSFgnDigFE (veceor<char>& b i t ) ; void op r in ï zeSD (vec tor<char> &bic) ; / / Seduce -1's void uLongToSignDigit (unsigceà long I, vector<char>& sd) ;

vold uiongToBoo~8 (unsigned long 1, vector<cnar>& booCh1; vo i d E o o c k T o S i c p D i q i ~ (v~c to r<cha r>&boo th , vecror<char>& sd);

1 osc r smh p r i n t a i c ( c s c r e m i O , veccor<chsr>& b i t ) ; vo i d s k o w B i t ( v e c t o r ~ c h ~ r > t bir;;

#endif

/ / Convert wsiqned lonç CO a sequence o f S i c . / / T h e ES3 of the returning bit is always O void u l o n g T o S F r (unsiqxed l o n g 1, veccor<thar>& Sic) t

b i t . c l ea r ( ; for (; l ! = 0 ; 1>>=I)

bii.push-'Dack(l&l ? 1: O}; 1

void show Bi^ (veczor<char>& b i t ) t

p r i ? c 5 i c ( c o u t , nit ; 1

f o r ( i n t F = D ~ E . size ( ) -l; i>=O; i--1 f o << s e t w ( 3 ) <c ( i n t ) b l r [ l l ; if ( b i ~ [ i l ! = O ) weight+t;

1 O << " Weighc=" cc welght; retur? O ;

1

int stczc, enc; / / Sïârt azd end p o s i c i o n of consecotive ones

i stârt=i; f o r (end=i+l; end<rBit; end++) if ( ~ i t [ena] ==O) breok;

i f (ena-s târt>l) / / More cheni one l y s i

b i ~ [starcl=-1; b i t [end] =l; for ( s t a r t t t ; startcend; start-f 1 bit [scart]=O;

! F = end-1;

if (bit [bit-size ( ) -l]==O} b i t - c r a s e (8ir.end:; -1) ;

1

void ulongToSignDigit (üns igned l o n q 1, vec ro r<char>& b i t ) i

ulongToBit (1, b i r ) ; b ina ryToSignù ig i r (bit 1 ;

/ / cp tL .?zeSD ( b i t ) ; i

L

s t o c i c ccnçc c h a r ~ o B o o c h [ ] = ( O , 1, 1, 2 , -2, -1, -1, O );

v o i d BoothT~SignDFgiC ( ~ e c t o r < c h a r > & C o o t h , vec to r<cner>& sd) f

f o r (int i=O; i < b o o c n . s F z e ( ) ; i++) s w i ~ c h ('coccir [ i 1 1 I

case -2: sd-push-back( 0); s d - p u s h b a c k ( - 1 ) ; Dreak; case -1: sd, push-back (-1) ; sd.-ush-back ( 0) ; break ; case 0: sd . pusn-bock ( 0) ; sa. pnsn-back ( 0 ) ; break ; c a s e I: sd.pnsh-bccic( 1) ; çd. p u s h b a c k ( 0) ; break ; c a s e 2: sa. p u s n j a c k ( O } ; sd. pnsn-back ( 1) ; break ;

I

. - . _ . . . .. VHDL Signa1.h - . . - - . . - , -

. . ) . . - - . - . .;, :z D i f nde f -VHDL-SIGN-AL-B + d e f i n e VZDL-SIGNAL-ii

ginciucie < v e c t o r > Winciude < ç ~ r i n g > SincLuae < ios r re s ,w Oinclude <iomanip> k i n c l u d e <s t r s t r ea rn>

u s i n g nanespace std;

cypeaef enum { CONSTANT, VXQIPBLE, SIGN 1 s F q n a l T p e ; c ~ e d e f enum ( NONE, POSITI'VE, NEGATIVE 1 inverrrype; typede f enum { ZERO , ONE, OPEN 1 consrIDType; cypedef enum ( INPUT, SUM, CARRY 1 varIDType;

Signa l ( 1 : tlQe (V.=IP3LE) , name l " " ) , I D ( O ) , i nver ted(~ONE) , bFtPos (-1) , s t a g e (-1) , show13 ( f â l s e ! 1 ;

i q n a l ( s i g n a l T l n e t , cha r "2, ur-çignêd id, bool showid, inverCType inv, int pos, i n t s t g )

signalType type; char* ncn~e ; unsigneci I D ; / / ID f o r che s i ç n a l Do01 showID; inverzType Invexed; i n t --- b i t - 3 0 ~ ; / / Bit p o s i t i o n (Non-negative i n t e g e r . -1 w i l l not show

che b i t position) irrc stage ; / / stage where t h i s s i g n a l is gensra ted (-1: input or

corstent s igna l & w i l l not show the s t a g e )

s t a c i c Fnt cre~ceNewSigria l0 { ur,sigzed save=idCount++; return s c a c i c int idCounr;

I ;

oscreznh 03erator << ( o s t r e ~ = & s t r e m , const Signal& s i g n a l ) ; bool operator== !cocst SLgnal&sl , cons t Signal &s2 ) ; bool osera tor ! = (consr S i g n z l & s l , consc Signal &s2) ; bool operâcor < (consc S i g n a l & s l , c m s r Signal &s2) ;

consr Signal S LGN-%-ZERO (CONST,n-NT, " ZERO " , ZERO, Eâlçe, NONE, -1 , - 1 1 , SIGXALONE (CONSTANT, " GNE " ,ONE , fâLse, NONE, -1, -1 1 , ÇIG>IAL OFEN (CGNSTGXT," \'X\' " , O P ~ N , ~ ~ ~ S ~ , N O N ~ , - 1 , - I ) ; -

uçing nomespace scd;

i n t SicnaL::idCcunc=O;

//------------------------------------------------------------------------ o s t r e m & operator CC ( ~ ~ ~ r e m & st rem. , const Signal& s i g n a l ) f

scream << signal.norne; i f (signal. cype==CONST-XYT ) re turr - scream;

if (signal.showID) strearn<<sLgnal. I D ;

Ff (signal.inverred!=NONEI stream << ( s i g n a l . i n v e r r e a = = P O S I T I F Y ? " P" : "N" ) ;

/ / Variable signal if (signal.stage>=O:

stream C < signal. sraqe;

//------------------------------------------------------------------------ bool operatcr==(consC Signal&sl, const Signal & s 2 ) ( . - rr (s i . t - p e ! = s L . type) retrrrn false;

i f (si. ttype==CONST-W! recurn (sl . ID==s2. ID) ; rêcnzn (SI. ID==s2. ID & & sl. inverted=s2 . inverteci & & sI.SitP~s==s2 .bitPos

s l . s t a g e = = s 2 . s r a g e ) ; 1

//------------------------------------------------------------------------ Dco: operâccr < ( c o n s t Signal &sL, const Sigcal &s2) i

if (sl. s ï age<s2 . scoge) return true; i f (SI. ~~Q~==CONSTPNT) rerurn true; i f (sl. stage==s2,stage)

- - - ,- ,, :SI. ID==SUM & o s2 - ID==C..V.?.Y) r e t c r n zrne; roccrn f alse;

t

//------------------------------------------------------------------------ void printVSV (veczorCSignaIVector> &SV, osrreâm& O)

Appendix D

IEEE Standard 1180-1990 Cornpliant Test

Program

The following is the Java source code listing for IEEE Standard 1 180- 1990

compliance test program for IDCT. It is used to determine the interna1 bandwidth of the

IDCT for both the first dimension IDCT and second dimension IDCT.

The codes are listed in alphabetic order based on the souice file name. The main

program is located inside file IEEE-118O-l99Ojava. Notice that al1 codes are also

included in the attached CD.

To esecute the program, use the following command: jm IEEE - 1180 - 1990. The

program reads the intemal banduidth confipuration from file Setzptxt, and perform test

to check if the bandwidth yields IEEE 1 180-1 990 compliance.

. - - - - CSDij ava . - . . . I V . - - -A,

/+ Convert conven t iona l b i n a r y nunber t o c a n o n i c a l sign-digit represen ta t ion Algorir-hm: H. Zwarig, Conputer Aritiimetic, Wiiey, 1979 , pp. 150 Coding : Pai, Cheng-Yu Nore

To conpi le , execute " j avac SignDiqit . java" To rur. , e x e c u i e "java SignDigit xxxx",

where xxxx is the number wish t o c o n v e r t . * /

p u b l i c class CSD {

p u b l i c s c a t i c byte [ 1 coCSD ( long 1) I

/ / System.out.princln("Tnteger value = "+L); / / Systen.out.prinrsln("In~eger bits = "tLong.toBinaryString(1));

/ / Construct bit a r r a y represencacion of t h e input byte [ ] b = ( "O"+Long. toBinaryStr ing (1) ) . ge tBytes ( 1 ; f o r (inc i=0, j=b. length-1; i<=j ; i++, j - - 1

byce [ l d = new b y t e [b. l eng th] ; byte ci=O, ci-1; f o r (Fric :=O; i <b , lengtn; i + t , ci=ci-l) f

i f (i==b . lenqcn-1 j

ci-l = ( b y t e ) f (bEll+ci>l)?l:O); e l s e ci-l = ( b y t e ) ((Dlil+b[i+l]tci>l)?L:O); d[d. lenoch-i-l] = (byte) (b [il +ci -2fc i - 1) ;

t

. .. .. . -. FDCEjava - _ -

._. . . _ _ - - _ . - - -- 1 .-.. -, . . . . - ,

public c l a s s FDCT {

s c a t i c double s [ l [ = new double i51 [8 1 ; s r z c i c double tmp[] [! = new couble[8] [ a ] ; s t a t i c final ict m a p [ ] = { 0 1 4 , 2 1 6 , 7 , 3 , 5 1 1 ~ ;

s r a t i c void f

s [ s cage t l s [scage+1

f

static ~ i o i d xO, inc ? c l )

L o e f f l e r (acuble AI double BmixusA, docble AplusE, int ç ïsge , i n t

s t ac ic vo id Lrl (inï srage, i n t xO, i n t xl) I ?.

r r n a l Fnc n=l; f i n a l double k=Mach . sqrt (2 1 ; f i n a l double a=k*Math.cos(n*Math.P1/16),

b=K*Kath.sin (niMoth. 01 /16) , BminusA=b-a, Aplusa=a+S;

Lce f f l e r (a, BminusA, ApLusB, s tage, xO, xl) ; 1

stztiî vcid Lclfint sc2qer inc xO, i n t :cl) t

f i n a l i n t n=L; f i n o l double k=l; f i n a l double a=keMaxh. cos (n*Math. PI/l6) ,

8=ktP4ath. sin (nt3Iath. -6) , aminusA=b-s, AplusB=a+b;

L o e f f l e r (a, BminusA, AplcsB, stzge, xO, xl) ; t

stacic void Lc3 tint stzga, F n t xO, int XII I

f i z d i n t n=3; final double k = l ; f i n a l aomle a=kfk!ath. cos (n*Xath- I?I/I6l ,

b=k*Math. sin(n'L4ath. PI/I6), BninusF-=bal p.~lusY=a-n ;

Loef f l e z (a, Brn inusA, A p l u s B , stage, xO, xl) ; I

f o r ( i = O ; i<9; i ~ i ) I

/ / Input m a ~ p i n g f o r ( j = O ; j<9; j++) s [O] [j]=blockCil [ j l ;

/ / Stage 1: B u t t e r f l y for ( j = O ; j<l; j++l

B t i r t e r f l y (a, j,7-j) ;

/ / Scage 2 for ( j = O ; j<2; j t - 1

Eucterfly(l,j,3-j); Lc3 ( L , 4 , 7 ) ; L c L (1,5,6) ;

/ / Stage 3 But=erfly(2,0,1) ; Lrl (2,2,3) ; S x c t e r f l y (2,4,61 ; auczerfly(2,7,51 ;

/ / Stage 4 for ( j = O ; j<4; j + + ) s[41 [jj=s[3] [ j ] ; 3u~tsrfly (3,7,4) ; s[4! [5] = root2 ' s [ 3 ] [ S I ; s [ C l [ O ] = roo t2 * s [ 3 ] [61 ;

/ / Ourput mapping f o r ( j = O ; j<3; jt+)

unp[map[jI j [il = s[41 [ j l ; / *

Systern. out .prinïln i"1D S: " 1 ; f o r ( j = O ; j<5; j++) I

f o r (k=O; k<8; kc+) Syst~m.ouz.print(s[j] [klt", "1 ;

l System.our .pxintln ( ;

I - /

1 ~ / -

Sys=em.ouz .prlntln ( "FDCT ID: " 1 ; for (:=O; i<8; it+) 1

fer ( j = O ; j <8 ; j++: Sysrem.out.princ(tmptil [ j I + " , "1 ;

Syste,.n.out - p r i n t l n ( 1 ; 1 Systern-out -println f) ;

+/ f c r (i=O; i < B ; i++l f

/ / Izpuc rnapping for (j=O; f c 9 ; j + t l s[0: [ j ] = t x p [ i ] [ j ] ;

/ / Stage 4 for ( j = O ; j < l ; j++, s [41 [ j ] = s [ 3 ] [ j ] ; Bucterfly!3,7,41 ; sC41 [51 = r oo t2 * s r31 [51; s [ 4 ] [ 6 ] = roo t2 " s [ 3 ] i61;

/ / O ~ r p u t rnapping for (j=G; j<8 ; Siil ~lockli] [mapl j ] 1 = (shorr) Math. rounci (s 1 4 1 [ j ] ) ;

/ = Syste_m.oi?c . p r i n t I n ("23 S: " 1 ; f o r (j=O; j<5; j-i) (

f o r (k=.;li; k<3; k t t j System.out.prir,rs(s[j! [k]+", " ) ;

Syscem.out . pzintln! ; 1

= /

I s t a t i c double s [ I [ ] = n e w doul ; le[5][a] ; s t c t i c couble unp[] C I = new doub l e [8 ] [8 ] ; s c a t i c f i n a l Fnt mop[I=f0,4,2,6,7,3,S,11;

scatic void Ektterfly ( i n t stage, i n c xO, inc xl)

1 s [sczge-il [xOj = !s istaçei [xO 1 + s [stage] [xl] /2; s [scâgeil] [XI] = ( s [stoçel [xO 1 - s [stage] [xll ) /2;

I

s z o c i c voici Iloeffler(ao&le C, double DminusC, d o b l e DpLusC, int stage, i n t xO, inc xlj

( doubie Cmp = C* ( s [staqel [xO] +s [stage] ExII 1 ; s[scaqe+l][:&] = DplusC - ~[stagel [xOI - tmp; s isisce+l] [xli = 9rninusC ' s istago] [XI] +- tmp;

I

srzïic void I c l (int stage, int xO, int xl) ( - . r~ncl in= n = L;

final double k = 1; final dounle c = Math.sln(n'Math.P1/16)/kI

d = P!ach.cos (n'Math. 21/1o)/k, DainusC = d-c, DplusC = d+c;

ILoeffler ;c, DrninusC, i@IxsC , stage, x0, XI) ;

1

s rac l c -raid Ic3 ( i ~ t stage, FZC %O, int xl) I firial h t ri = 3; finzl aouble k = i; fF2z. l CouDle c = Math.sin(nTMach.PI/16)/k,

d = Hath.c~s(ntMâth.PI/14)/k, D m i n u s C = a-c, DplusC = d t c ;

~~ûefflêr (c, DndnusC, EpIt?sC, stage, x0, XI) ; 1

scûcic voici Irl(inr stage, int xO, Fnc xl) ( - - rmal int n = 1; final double k = Math. sqrt ( 2 ; final double c = Xach.sin(n*Ma~h.PI/l6)/k,

d = Mâth.cos(n*Math.3I/l61/k8 miriusC = d-c, DplusC = a+c;

ILoeffLer (c, DmincsC, DplcsC, stage, xo, xl) ;

pub l i c s ~ a t i c voia idci(sncrt block[j ( 1 ) { int i, j , k; - - r ~ n a l double invRoot2 = l.O/Math. sqrt (2) ;

/ - Syscem. ou~. p r i ~ ï l n ( 1 ; Syszem.out .princln(llDime-?~L~n O: " ) ;

* / for ( i = O ; i < 8 ; i++) f

/ / Input mappinq for (j=O; j<8 ; jt+)

s [O] [j] = blcck[i] [rnap[j J I ;

-+ S.

+.- .O,- .

v 7 . c T t I l V ) . u u ..p.-.- -3' .O0 v O -- .n-V) ~1

4 h 4 I l Il

I) O 'CJ 3 Il LI -. - u ~ n a l m u , J - U -u Q C , + -

Lc Y 1-1 r l ~ 0 F O U U \ IU 1-1 0 V)

$1 O v, 'Cl u

/ / Stage 3 f o r (j=O; jC2; ji-)

IEuccerfly (2 , j , 3- j i ; Ic3 (2,4,71; Icl(2,5,6) ;

/ / Stage 4 fer ( j = O ; j < 4 ; j--)

IBu1zterfly(3,~,7-j) ;

public clâss IDCT-Trznc {

s t a t i c long s [ j [I = new 1o~qi51 [ 8 ] ; sratic l o n q t?~ [ 1 [ ] = n e w long [ a j [ 8 1 ; scatic f i z a l int mep[]={O, 4,2,6,7,3,5,1);

p rec i {

Frit i; Long factor = ( (long} 1) <Cprec; Lonq CL, subL, surrL; CL = (long1 Math. r o u ~ d ( c " f a c t o r ) ;

~aCtOr) ; s u j L = ( long) Mach. round ( s-;bw =- s-mL= (Lcr-.g) Mazh. ronnd ( smtfzctor) ;

coplidx] [O] = CSD.toCSD!cL) ; COD [ idx] [l] = CSD . coCSO ( s u b L ) ;

p u b l i c staric void iniï-IDCT-Trunc ( i n r p r e c [ 1 1 t

f i n a i d o b l e k [ ] = { L , 1,Math.çqr t (21 1 ; final int 1 ~ [ ] = { 1 , 3 r l l ; double c, d, sub, sun;

I Syscern, o u t , p r i n t l n ("LnFtialize IDCT c o e f f i c i e n t s : " 1 ;

f i n a l cou8 ie F r = 1/Math.sqr= ( 2 ) ; long factor=! (long) 1) <cprec [ 3 I ; long r2L = (long) Nach. round ( i r C f a c t o r 1 ; Fnü-Zoor2 = CSD. toCSD (r25) ;

Sysren. o u t . p r i n t l n ("l/Sqrt (2) =11+r2L1 ; 1

/ * scatic long mult ( n y ~ e sdf 1, long v a l , int t r u n c l f

l o n g resclr=O;

s t a ~ i c l o n g nult (by te sa [ ] , long val, int trünc) {

long result=O; long pp; i f (vel==O) rerurn 0; n.bIul++;

f o r ( i n ï i = O ; i<sd . l eng th ; iit) I

i f ( s a [ i ] = = O ) c o n t i n u e ; if (sQiij==lj t if (i<trunc) pp=val>> (trunc-F) ; eiss pp=val<< (i-crunc) ;

1 else / / s a [ i j = = - 1

return result;

I l 1 stacFc n i d IButterfljr(Fnt stage, int xC, ict xl) 1

( s[stage+ll [x0] = is Cstagei Lx01 - s[stage] [xl] /2; s[stzge+i] [xll = (~[sragel ixO1 - s[stage] [xl]) /2; n4dd+=2 ;

1 scatic void Xutterfly2(irii szaqe, inc xO, int xl)

srcric vcid I L o e f E i e r (byte C [ 1, byre DminusC[ j , byte DplusC Il, int stage, int 1 x g i Lnc x l , in^ ï r u n c ;

long m p = mult ( C , s [s~agej lx0 1 +s [scaçel [XII , trunc) ; s [stage+l] [xO j = r n u l ~ (3plusC , s [sragel Lx01 , trunc) - t m ~ ; s [stogeii] [xl] = mult (Dr r i nüsC , s [stage] [xl] , trunc) + tmp;

static void Icl(inc srâge, int xO, i n t XI, int trunc) f ILoeffler(cûp[O] [O] ,cOp[OI [Il ,cOp[O] [23 ,stage,xO,xl,trunc) ;

k

static voici Ic3 iint stage, int x O , int xl, i n t trunc) i ILoeff' -,~er(cO?[il [Gl ,c3pCII [Ii,cOpiL] 121 ,stàqe,xO,xI,trunc);

1

scar ic void Irl (int scage, int :<O, i n t xl, inr trunc) ( ILaeffler(cûp[2! [ Q I , cOpE21 [ I l ,cUp[2] [ 2 ] ,s~~ge,xO,xl, t r unc ) ;

1

szacic voia adj u s ~ G f f s e c ( l ong stage i l , inc offset [l ) 1 for (inr i=O; i<8; F-+) ( if (offset [il CO) stage[i] >>=(-offset f i l ) ;

else if (offsec [il >O1 stâçeri] <<=offsec ii! ;

1 1

static long cclcR (int r) ( if (r<2) r e t u r n O; / / Do nothing int i. i;

lorig offsec; for (i-l, of fset=l; i<r- l ; i++) offset=(offset<<l) i 1;

/ / o f f s e r = ((lcnq]l)<<(r-2); r e t u r n offse~;

j

public static void F d c t T r u n c (short block[] [ J , int crunc [ l [ i , F n t - - orrsez[J [ I [ j ,inr r o u d t ?

f I n t i, j , k; Lcng rO, rl;

r O = calcR (round [O 1 ) ; r l = calcR (round [l! ) ;

/ * Systen.out .priztln ( 1 ; Sysrs-rn-out . p r i n c h tl'Diner?sicn O: " 1 ;

* / for (i=Q; i < a ; ii-) f

/ / I n p u t m.pging for [ j = O ; j<8; j ~ - )

s [O] [ 5 ] = ~lock[i] [map[j]];

/ / Stage I LSu tce rE l y2 (û, 0,l) ; I rl ( 0 , 2 , 3 , crunc[U] [21) ; i B u c c e r f l y 2 ( 0 , 7 , 4 1 ; s [Il E S ] = rnult (invRoot2, s [O] [ 5 j , crux 101 [3 ! ) ; s[l] [SI = mclt(invRoot2,s [O] (61, trunc[Ol [ 3 l ;

/ / Stage 3 for ( j = O ; j < 4 ; ïi+i / / Rounding

s [ 3 1 [ j l = s12J Ljl i (r0 << (-offset[O] [ 4 ] [jl-round[O]) ) ; .*ddi=4 ;

Ic3 ( 2 , 4 , 7 , c r u n c [ O j [tl) ; T c 1 (2,5,5, ~runc[O] [On ;

a d j u s t O f f s e t ( s [ 3 ] , o f f s e t [O] [ 3 ] ) ;

/ / S t a g e 4 f o r ( j = O ; j c 4 ; j++)

IButterfly2{3, j,7-j};

/ / Stage I ISc tcer f ly2 (O, O, lj ; Tr: - - - (0,2,3, trunc111 12i ; ISuizerfly2(0,7,4) ; s [ l ; [SI = mult(invRocC2,s [ O ] [ S I , tzunc[L! [311; s i l j [ 6 j = mult(inv2cot2,s [O] [ o ' ] , trunc[l] 131) ;

a c j u s t O f f s e t ( s [LI, of f sez [ I l [II 1 ; I

/ / Stage 3 for ( j = O ; j < 4 ; j - i )

s131 [ j ] = sC21 [j] i (rl <C t-offsec[ll t 4 l [jl-round[l]) ) ; -cd+=4 ;

Ic3 (2,4,7,~runc[l] [Il ) ; Ic1(2,5,6, trunc[I] [O] ) ;

/ / Scage 4 f o r ( j = O ; j < 4 ; j t c )

I B u t t o r f ly2 (3, j ,7- j ) ;

srcric Lcnq e [ ] [ 1 = n e w Lang [8 [E ] pmsêi] [ j = r.ew I_onç[8I [B], pme [ I [ I = new Long[Bj [a! , orne, 0.9s e ;

static b o o l e a n checkError (short %Cal [ l [ l short xRef [ I [ ] ) ( int i, j, e r r , e 2 ; long e P h s r pmeAbs, orneas;

i e[i] [ j ] = e r r = xCzl[i] [ j l - x X e f [il t j l ; e 2 = e r r * err; p ç e [ L ] [j] += e2; pme [il [ j l += err; CF-e -= err; omse += e 2 ;

SURS += err; S y s t e r n - o n t -print (erri" " 1 ;

eAbs = (err<O ? -err : err); prneAbs = ( p m e [ i l Tjl<O ? -prne[il [ j ] : p m e [ i ] [ j ] ) ;

scacic void cransforrnBlock ( s h o r z bl [ J [ ] , shori b2 [ ] [ ] , o f f s e t [ ] [ ] [ ] , i nc round[] )

( FDCT. facc (bl) ; clFpFDCT ! bl) ;

IDCT . i d c ~ (SI 1 ; clipIDCT (bl) ;

pme ="+prne[il [j J

orne ="+orne

omse="iomse

re turn false; 1

trunc [ 1 1, int

IDCT-Trunc. idcLTrunc (b2# t runc , o f f se t , round) ; clFpIDCT (b2) ;

szatic noolean checlcli.: (lonq L, long H, bcoleari negatePixe1, i n t trunc C l f 1, Fnt cffse t [ ] E l [ J , int romid[])

i short 5 11 [ 1 = new s n o r t [8] [8 1,

bi[] [] = n e w snort[8] 181 , b2:j [ 1 = z e w shor~~[8] 181 ;

inrs Fr j, k;

/ / TnFïizlize s z o t variables for (i=O; i<8; i++) for ij=O; jC8; j++)

{ e [ i : [jI=pmse[FI [j:=pme[il [jI=O; 1 ome=o=e=O ;

IDCT-Trunc. idctTrunc ('01, tz~n-c, offset, round) ; if ( ! cHeckZero (bl) ) r e t s r n false;

IEEZ - 4andom. init (L, H) ;

fo r ( j = O ; jc8; j-+) / / Generacs random pixel cata fcr (k=O; kC8; kt+)

if ( i chêckError (Dl, b2) ) i

Çyçc-rn. out-println ("B~cck="tit", L="iLtW, x="iH+",

~açate="-negatePixeL) ; r e t u r E f alse;

L

1 lonc; max; i n r percect; System. out .pr inc ("PASSZD: pme (max) =") ; for (F=O,max=O; i<8; ii+) for ( j = O ; jc8; jt-1 if (prne[i][j]>.max) max=pmelil[jl;

percent = (Fnt) ( (max'100.0) /pmeMPX) ; Systern.ou=.print ( (max/lOOOO .O) t" ("+percent+"%), psme (max) =") ;

for (i=Or~x=O; Fc8; i++) for ( j = O ; j c8 ; j++) . I r (prnse[i] [j!>max) max=pmse[il [jl;

percent = ( i n t ) ( (mox*100 .O) /pmseMFX) ; Systern-out . p r i n t ( (ni.â.u/10000. O) +" ("+?ercent+"%), " 1 ; perîeEt = (inrs) ( (cme-IOO. O) /omeMAX) ; syçtem. ouc.prht (orne/ ( 6 4 - IOOOO. O) ) + p e r c e n t + % , " 1 ;

f u r (in: token=s .nextToken ( ) ; token!=s . T T N b . E R ; token=s . nexzToken ( ) ) ; r e r u r n (Fr î t ) S. nvol;

1

static void FnirSerup ( i n t trunc [ ] Clr int offset [ l C I [ I ir?t round[] ) throws Lxcept ion

t i n t d l L, j;

IDCT-Trunc. init - ICCT-Trunc (prec) ;

t for (i=O; i<5; F-i)

for ( j = O ; 5 4 ; j++) offse~:d] [i: [j ] = qetInr ( s e tup ) ;

round[Q] = qeeInt (setup) ;

i

public s r a t i c void main (String orgs [ 1 ) throws Excepzion

static i n t qetïnt (Streâ-~ToKenizer s) t n r ~ w s Exception

initçetup (trunc, o f f s e t r zound) ;

for (i=C; Fc2; i++, negote=!negare) f o r ( j = O ; j<3; j++)

if (!checkLE(L[j] ,H[j 1 , negate,tzunc, offset, round) 1 re turr?;

Systom.out . p r i n c l n ("A11 t e s t passed! " ; 1

p u b l i c scaï ic void Fnitclong 1, long h)

randx = 1; z = Docbie. long~itsToDouble (Ox7fff f f ff) ; ; s=n;

?ublic sca t i c long rand( 1 !

Icrig i, j; CouSLe x;

pirSlic s t a t i c void m i n (Srrizg orgs : ! ) i

l o n g 1, ?, n; n = Long .parseLong (args l0l) ; I = Long. parseLong (args l1l: ; h = Long. pa r seLo~g (args :2j) ;

Documents

Data-Dependent Low-Power 8x8 DCT/IDCT · 2005-02-09 · Design and Evaluation of a Data-Dependent Low-Power 8x8 DCT/IDCT Cheng-Yu ai' Traditional fast Discrete Cosine Transforrn @CT)/hverse