A New Number Representation for Hardware Implementation of ...cas.ee.ic.ac.uk/people/cte00/Thesis_Corrected.pdf · Majority of DSP applications do not need the full dy- namic range

A New Number Representation forHardware Implementation of

DSP Algorithms

Chun Te Ewe

A thesis submitted for the degree of

Doctor of Philosophy of the University of London

and for the Diploma of Membership of the

Imperial College

Department of Electrical and Electronic Engineering

Imperial College of Science, Technology and Medicine

University of London

October 2008

Abstract

Dual FiXed-point (DFX) is a new number representation for digital hardware.

By providing a single exponent bit to select between two fixed-point scalings,

DFX strikes a compromise between conventional fixed-point and floating-point

representations. It has the implementation complexity similar to that of a

fixed-point system together with the improved dynamic range offered by a

floating-point system. Majority of DSP applications do not need the full dy-

namic range provided by a floating-point system but the cost of fixed-point

increases greatly when a wide wordlengths are needed.

This thesis presents the definition of DFX as the new number representa-

tion and its characteristics are compared with other common number repre-

sentations. Multiple wordlength and scaling modular designs for its arithmetic

operations are presented which are made to work alongside fixed-point. Us-

ing any finite precision number representation would cause quantisation errors

to be introduced. Hence a mixed simulation with static analysis technique is

presented to analyse the errors when DFX modules are used.

Utilising a readily available high-level synthesis tool to optimise for area,

DFX has been shown to be suitable when a design can tolerate a large amount

of noise and requires a wide dynamic range or representable numbers. The

tool uses the modules created and the error analysis in a two-phase simulated

annealing optimisation algorithm. DFX has been shown to be more suitable

when a design can tolerate a large amount of noise and requires a wide dynamic

range or representable numbers.

1

Acknowledgements

I would like to express a great amount of gratitude to my supervisor, Prof.

Peter Y.K. Cheung and also my co-supervisor, Dr. George A. Constantinides,

for their infinite encouragement, support and understanding throughout my

research.

Altera Corporation and Overseas Research Students Award Scheme, UK,

have provided financial support for this research.

Thanks go to the all my friends and colleagues in the Electrical and Elec-

tronic Engineering at Imperial College London for the wonderful atmosphere

and all their kind support. Doing lab and computing demonstrations gave

some variety to my life at the department.

Special mention to the last crew to man the original Linstead Hall for

there was never a dull moment and paved the way for me to meet my special

someone.

That someone is my girlfriend Audrey Ng. She has been with me through

the thick and thin while tirelessly motivating me throughout the crucial years.

Having her around reminds me that there is more to life than study. I owe a

great depth of gratitude to her.

Finally, I would like to say a big thanks to both my parents, Ong Kim Kee

and Ewe Poh Teong. They have given their unconditional love and support,

knowing that doing so contributed greatly to my absence throughout the time

of my studies, during which we could have been geographically closer. This

thesis is dedicated to them.

2

Contents

Contents 3

List of Figures 8

List of Tables 12

1 Introduction 15

1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Background 20

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Question of Software vs Hardware . . . . . . . . . . . . . . . . . 20

2.3 Finite Wordlength Effects . . . . . . . . . . . . . . . . . . . . . 22

2.4 Number Representations . . . . . . . . . . . . . . . . . . . . . . 24

2.4.1 Fixed-point . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4.2 Floating-point . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.3 Logarithmic Number System . . . . . . . . . . . . . . . . 32

2.4.4 Block Floating-Point . . . . . . . . . . . . . . . . . . . . 35

2.4.5 Residue Number System . . . . . . . . . . . . . . . . . . 36

2.4.6 Signed-Digit Number System . . . . . . . . . . . . . . . 38

2.4.7 Rational Arithmetic . . . . . . . . . . . . . . . . . . . . 39

3

2.4.8 Level-Index . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.4.9 Comparison and Summary . . . . . . . . . . . . . . . . . 41

2.5 Wordlength and Scaling Optimisation . . . . . . . . . . . . . . . 45

2.5.1 SKK Methodology . . . . . . . . . . . . . . . . . . . . . 46

2.5.2 Interval and Affine analysis . . . . . . . . . . . . . . . . 48

2.5.3 BitSize Tool . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.5.4 Synoptix and RightSize tool . . . . . . . . . . . . . . . . 50

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3 Exponent Recoding & Dual FiXed-point 55

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2 Exponent Recoding . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2.1 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3 Dual FiXed-point . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.1 Defining the Format . . . . . . . . . . . . . . . . . . . . 61

3.3.2 Characteristics of DFX Number System . . . . . . . . . 65

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4 DFX Modules and Arithmetic Circuits 71

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Algorithm Representation in RightSize . . . . . . . . . . . . . . 73

4.3 Module Design Forethought and Criteria . . . . . . . . . . . . . 75

4.4 Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.4.1 Range-Detector . . . . . . . . . . . . . . . . . . . . . . . 77

4.4.2 DFX Encoder and Decoder . . . . . . . . . . . . . . . . 78

4.4.3 DFX Recoder . . . . . . . . . . . . . . . . . . . . . . . . 80

4.4.4 Area and Critical Path Delay Tables . . . . . . . . . . . 84

4.5 DFX Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.5.1 DFX Adder Version I (V1) . . . . . . . . . . . . . . . . . 86

4

4.5.2 DFX Adder Version II (V2) . . . . . . . . . . . . . . . . 87


4.6 DFX Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.6.1 DFX Gain Multiplier . . . . . . . . . . . . . . . . . . . . 93

4.6.2 DFX Full Multiplier . . . . . . . . . . . . . . . . . . . . 95


4.7 Discussion and Further Comparisons . . . . . . . . . . . . . . . 99

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5 Modelling Noise at the Outputs of a DFX System 106

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.3 DFX Modules Noise Models . . . . . . . . . . . . . . . . . . . . 112

5.3.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.3.2 Recoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.3.3 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.3.4 Gain Multiplier . . . . . . . . . . . . . . . . . . . . . . . 116

5.3.5 Full Multiplier . . . . . . . . . . . . . . . . . . . . . . . . 119

5.3.6 Error Model Evaluation and Discussion . . . . . . . . . . 120

5.4 Correlated Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.4.1 Estimating the Correlation for the Errors Sources . . . . 125

5.4.2 Rounding Benefits . . . . . . . . . . . . . . . . . . . . . 128

5.5 Profiling: Simulation and Tables . . . . . . . . . . . . . . . . . . 131

5.5.1 Profiling Simulation . . . . . . . . . . . . . . . . . . . . . 131

5.5.2 1-D Profile Table . . . . . . . . . . . . . . . . . . . . . . 132

5.5.3 2-D Profile Table . . . . . . . . . . . . . . . . . . . . . . 135

5.6 Output Noise Estimation . . . . . . . . . . . . . . . . . . . . . . 137

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5

6 Approach to DFX Parameter Optimisation 140

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.2 RightSize Prerequisites . . . . . . . . . . . . . . . . . . . . . . . 142

6.2.1 User Specified Design Constraints . . . . . . . . . . . . . 142

6.2.2 Representative Floating-point Input . . . . . . . . . . . . 143

6.3 DFX Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.3.1 Applying Synthesis Restrictions on p0 Parameters . . . . 144

6.3.2 Propagating and Conditioning p1 and n Parameters . . . 146

6.3.3 Iterative Algorithm . . . . . . . . . . . . . . . . . . . . . 148

6.4 Area Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.4.1 Evaluation of Area Models . . . . . . . . . . . . . . . . . 153

6.5 Determining the Upper Scaling p1 Parameter . . . . . . . . . . . 153

6.5.1 Simulated Peak Values . . . . . . . . . . . . . . . . . . . 154

6.5.2 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . 155

6.6 The Optimisation Problem, Formulated . . . . . . . . . . . . . . 156

6.7 Exploring the Feasibility of DFX . . . . . . . . . . . . . . . . . 157

6.8 Meta-Heuristic Approach to Optimisation Using Simulated An-

nealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6.8.1 Background to simulated annealing . . . . . . . . . . . . 163

6.8.2 Optimisation Algorithm . . . . . . . . . . . . . . . . . . 166

6.9 Case Study and Discussion . . . . . . . . . . . . . . . . . . . . . 170

6.9.1 IIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 171

6.9.2 Adaptive LMS Filter . . . . . . . . . . . . . . . . . . . . 174

6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7 Conclusions & Future Work 181

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

Glossary 186

6

Bibliography 191

7

List of Figures

2.1 Fixed-point number format 〈n, p〉. . . . . . . . . . . . . . . . . . 26

2.2 Range and precision of two’s complement fixed-point number. . 28

2.3 IEEE 754 single precision floating-point number format. . . . . . 29

2.4 Floating-point adder/subtractor. Figure taken from [Kor02]. . . 31

2.5 IEEE 754 single precision floating-point number format. . . . . . 33

2.6 Adder/Subtractor for logarithm numbers. A fixed-point (FX)

adder is used to perform addition and the ROM contains the

look-up table for either Ψ+ or Ψ− of Equation (2.8). . . . . . . . 35

2.7 Example of Rational Representation format. . . . . . . . . . . . 39

2.8 Level Index/Symmetric Level Index format. . . . . . . . . . . . 40

2.9 An area vs Critical path delay graph for Table 2.3. . . . . . . . 43

2.10 Design flow for RightSize tool (shaded portions are vendor spe-

cific). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1 Example of number system with Exponent Recoding. . . . . . . 56

3.2 Example of quad fixed-point. . . . . . . . . . . . . . . . . . . . . 60

3.3 Dual FiXed-point (DFX) number: (a) Number format, (b) De-

tailed structure of DFX number format. The symbols • and ◦represents the binary position for a Num0 and Num1 numbers

respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.4 Fully and non-fully represented DFX number. . . . . . . . . . . 64

3.5 Num0 and Num1 range in a DFX Number. . . . . . . . . . . . 66

8

3.6 Precision of number representations in significant bits as a func-

tion of absolute number value (in dB). The number representa-

tions shown are fixed-point 〈15, 0〉, floating-point E4:M11, and

DFX 〈14,−5, 0〉. . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.7 An area vs critical path delay graph for Table 3.3. . . . . . . . . 70

4.1 The graphical representation of a data-flow graph. . . . . . . . . 75

4.2 DFX Range-Detector Module. . . . . . . . . . . . . . . . . . . . 77

4.3 Input bits the Range-Detector is interested in. . . . . . . . . . . 78

4.4 DFX Encoder block to convert from fixed-point to DFX. Input

is a fixed-point with wordlength nin and binary point pin and

output is a DFX 〈n, p0, p1〉. . . . . . . . . . . . . . . . . . . . . 79

4.5 DFX Decoder block to convert from DFX to fixed-point. Input

is a DFX 〈n, p0, p1〉 and output is a fixed-point with wordlength

(n + (p1 − p0)) and binary point p1. . . . . . . . . . . . . . . . . 80

4.6 DFX Recoder module with the flow of data through the module. 81

4.7 DFX Recoder blocks to convert between two different properly

scaled DFX numbers. The flow of data through the recoder is

shown beneath each block. . . . . . . . . . . . . . . . . . . . . . 82

4.8 DFX Recoder used with fork and delay. . . . . . . . . . . . . 84

4.9 DFX Adder Version I (v1). . . . . . . . . . . . . . . . . . . . . . 86

4.10 DFX Adder Version II (v2). . . . . . . . . . . . . . . . . . . . . 88

4.11 DFX Adder (v2) (a)Pre-Adder and (b)Post-Adder diagram. . . 89

4.12 DFX Gain Multiplier. . . . . . . . . . . . . . . . . . . . . . . . . 93

4.13 DFX Full Multiplier. . . . . . . . . . . . . . . . . . . . . . . . . 95

4.14 Module comparisons with similar dynamic range implemented

in ASIC. Parameters used for each number representations is

shown in Table 4.7. . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.15 Area of DFX modules implemented in ASIC with p1 = 8. . . . . 102

9

4.16 Sizes of DFX arithmetic modules relative to their fixed-point

equivalent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.1 Noise error of each module modelled as an error injection at the

output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.2 LSB-side scaling definition. . . . . . . . . . . . . . . . . . . . . . 110

5.3 Probability density function(PDF) of DFX Encoder Input signal.113

5.4 PDF of the DFX Recoder input and the probabilities of trun-

cation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.5 DFX Adder joint probability distribution table. . . . . . . . . . 117

5.6 PDF of the DFX Gain Multiplier input and the probabilities of

truncation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.7 DFX Full Multiplier joint probability distribution table. (a)

shows the input ranges (Input A:Input B) and (b)-(d) shows

the probability of truncations for different input and output

boundary cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.8 Transpose FIR Direct Form type I filter implemented with DFX

modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.9 The PDF of DFX signal errors. . . . . . . . . . . . . . . . . . . 125

5.10 Joint probability distribution of the errors. (a)-(d) shows the

breakdown for each error combination “x :y” and (e) shows the

complete joint distribution diagram. . . . . . . . . . . . . . . . . 127

5.11 The PDF of DFX signal errors when rounding is used. . . . . . 129

5.12 Joint probability distribution of the errors when rounding is

used. (a)-(d) shows the breakdown for each error combination

“x : y” and (e) shows the complete joint distribution diagram. . 129

5.13 Boundary bins. . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.14 An example profile table for a DFX Recoder. . . . . . . . . . . . 133

5.15 The estimated vs simulated SNR for 500 filters. . . . . . . . . . 139

10

6.1 Linked unit delays. . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.2 Examples of DFX Gain Multiplier output formatting with the

binary points aligned. The shaded bits can be omitted without

introducing errors. . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.3 Estimating the area for a DFX Decoder. . . . . . . . . . . . . . 150

6.4 Estimating the area for a fixed-point Adder. . . . . . . . . . . . 151

6.5 Boundary of wordlength-multiplier using Equation (6.21) with

varying p1 scaling and signal variance. . . . . . . . . . . . . . . 161

6.6 Boundary of wordlength-multiplier with varying maximum prob-

ability of signal overflow, λ, and SNR). . . . . . . . . . . . . . . 162

6.7 Flow of the DFX optimisation with Simulated Annealing. The

detailed flow diagrams of the Algorithms 6.5 and 6.6 are not

shown and is denoted by broken arrow lines. Refer to the re-

spective Algorithms in page 168. . . . . . . . . . . . . . . . . . . 169

6.8 The optimisation times and the area with respect of the level of

optimisation for a 4th order IIR filter on ASIC. . . . . . . . . . 170

6.9 Case study 4th order IIR filter. . . . . . . . . . . . . . . . . . . 172

6.10 Area of ASIC 4th order IIR filter with lower variance input. . . 173

6.11 Area ratio of ASIC 4th order IIR filter optimised with the pro-

posed two-phase ASA optimisation (ASA2) and fixed-point only

optimisation (FX). . . . . . . . . . . . . . . . . . . . . . . . . . 175

6.12 Case study 1st order LMS filter. . . . . . . . . . . . . . . . . . . 176

6.13 Area of ASIC 2-tap adaptive LMS FIR filter. . . . . . . . . . . . 177

6.14 Area Ratio of ASIC 2-tap adaptive LMS FIR filter optimised

with the proposed two-phase ASA optimisation (ASA2) and

fixed-point only optimisation (FX). . . . . . . . . . . . . . . . . 178

11

List of Tables

2.1 IEEE 754 floating-point special values. . . . . . . . . . . . . . . 30

2.2 Area and critical path delay for 16-bit arithmetic units taken

from Tables 4.4 to 4.6. These arithmetic units where imple-

mented on an ASIC platform. . . . . . . . . . . . . . . . . . . . 42

2.3 Area and critical path delay (CPD) results for 4-tap FIR filter

with increasing wordlength implemented using UMC 0.13um

High Density Standard Cell Library. . . . . . . . . . . . . . . . 43

2.4 Dynamic range comparisons of 32-bit numbers representations. . 44

3.1 Dynamic range comparison between DFX, fixed-point (FX),

floating-point (FP) and logarithmic number system (LNS) for

32-bit and 16-bit format. . . . . . . . . . . . . . . . . . . . . . . 65

3.2 Comparing the precision errors between DFX and fixed-point.

The DFX parameters are chosen to match a fixed-point 〈15, 0〉dynamic range of ≈ 90dB and upper limit of 20. . . . . . . . . . 68

3.3 Area and critical path delay result for 4 tap FIR filter with

increasing wordlength implemented using UMC 0.13um High

Density Standard Cell Library. This is an extension of results

in Table 2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.1 In and out degrees of nodes used in computational graph G(V, S). 74

4.2 Building block areas and critical path delays (CPD) table. . . . 84

12

4.3 Scaling of the inputs before the fixed-point adder. . . . . . . . . 90

4.4 Area and critical path delay tables for 16-bit adder comparisons. 91

4.5 Area and critical path delay (CPD) tables for gain multiplier

comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.6 Area and critical path delay (CPD) tables for full multiplier

comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.7 The parameters used to generate results for Figure 4.14. The

dynamic range (DR) is represented in dB. . . . . . . . . . . . . 101

4.8 Comparing fixed-point and DFX arithmetic module implemen-

tations on ASIC(# of cells) with dynamic range fixed at ∼90dB. Fixed-point the first result line where p1 = p0 = 0. . . . . 103

5.1 DFX Encoder truncations where the input is a fixed-point 〈nin, pin〉and output DFX 〈n, p0, p1〉 (Refer to Fig. 4.4 for block diagram).112

5.2 DFX Recoder where the input is DFX 〈nin, pin0, pin1〉 and out-

put DFX 〈nout, pout0, pout1〉. . . . . . . . . . . . . . . . . . . . . . 114

5.3 DFX Adder inputs (A and B) and output (S) combinations

with their respective output truncations. . . . . . . . . . . . . . 116

5.4 DFX Gain Multiplier input(A) and output(Q) combinations

and their respective output truncations. . . . . . . . . . . . . . . 118

5.5 DFX Full Multiplier inputs (A and B) and output (Q) combi-

nations and their respective output truncations. . . . . . . . . . 119

5.6 Evaluation of error models for truncation scheme with DFX

format 〈14,−5, 2〉 and 〈14,−3, 5〉. . . . . . . . . . . . . . . . . . 121

5.7 Evaluation of error models for rounding scheme with DFX for-

mat 〈14,−5, 2〉 and 〈14,−3, 5〉. . . . . . . . . . . . . . . . . . . 121

5.8 The correlation coefficients of the error sources for the FIR filter

in Figure 5.8 when truncation is used. . . . . . . . . . . . . . . . 124

13

5.9 The correlation coefficients of the error sources for the FIR filter

in Figure 5.8 when rounding is used. . . . . . . . . . . . . . . . 124

5.10 Combinations of signals X and Y ranges and their error prob-

abilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.11 Estimate of the correlation coefficients of the error sources for

the FIR filter in Figure 5.8 when truncation is used. . . . . . . . 128

5.12 Comparison between actual and estimated SNR for 159-tap FIR

filter with DFX parameter of increasing wordlength. . . . . . . . 138

6.1 Propagation rules for DFX conditioning. . . . . . . . . . . . . . 146

6.2 Comparison between actual area of DFX modules with the es-

timation by the area models on a Virtex 4 FPGA. . . . . . . . . 154

14

Chapter 1

Introduction

1.1 Objectives

The pace of integrating applications onto a chip is driven by the continual de-

mand to achieve lower cost while reducing system size and power consumption.

The availability of high density integrated circuits now enables the design and

implementation of sophisticated arithmetic processors employing algorithms

that were considered prohibitively complex in the past.

Most digital signal processing (DSP) application algorithms are usually

implemented with IEEE 754 floating-point number representation for its wide

dynamic range of representable numbers. Furthermore, a direct floating-point

implementation onto hardware offers the advantage of consistency between the

software and hardware implementations without introducing extra rounding or

truncation errors.

On the other hand, digital very-large-scale integration (VLSI) implemen-

tations of these applications rely on fixed-point approximations with finite

precision which have the advantage of using reduced hardware cost and power

consumption while increasing throughput [IO96]. This is because, generally

15

speaking, fixed-point DSP devices are far less complex, having fewer gates and

transistors, than an equivalent floating-point system. Although Intel led the

formulation of the IEEE standard for floating-point, Intel too recognises the

benefits of fixed-point for 3D graphic applications commercial embedded de-

vices [Kol04]. Implementations of arithmetic circuits in hardware are often

based on mapping the desired function to an Application Specific Integrated

Circuit (ASIC) or to a Field-Programmable Gate Array (FPGA).

Currently, when a fixed-point DSP implementation meets the needs of an

application, it is probably the better option than floating-point. However

when the application needs a large dynamic range, an implementation using

only fixed-point may suffer due to the wide wordlengths needed and the other

familiar option available is floating-point. Fixed-point and floating-point are

both popular but distinct number representations with their own strengths

and weaknesses.

The objectives for this research are to:

1. Define a new number representation that bridges between fixed-point

and floating-point number representations. The new number represen-

tation should have the dynamic range capability and the hardware im-

plementation complexity between fixed-point and floating-point number

representations.

2. Design the hardware arithmetic modules needed to realise the new num-

ber representation. These modules should have fully parameterisable

wordlengths and flexible scalings to allow exploration of efficient design

implementations.

3. Understand the noise introduced by the arithmetic circuits designed for

the new number representation. As with every number representation in

16

the digital domain, the finite error effects of the new number represen-

tation need to be understood. This shall enable us to use it to its full

potential.

4. Discover the conditions and design requirements that will benefit from

using the new number representation.

5. Find the parameters of the arithmetic modules to obtain area optimised

designs which meets a desired user constraint. This process is preferably

automated.

1.2 Overview

The thesis starts in the Chapter 2 with a review of the number representations

available for a DSP hardware designer. Also reviewed are the previous and

recent work in the field of wordlength and scaling optimisation.

Chapter 3 then introduces the concept of Exponent Recoding and Dual

FiXed-point. Exponent Recoding is a concept that generalises the common

number representations. From it, the new number representation called Dual

FiXed-point (DFX) is defined and its characteristics discussed and compared

with the common number represetations.

Hardware implementations of the arithmetic modules to use DFX in VHDL

are described in Chapter 4. These modules are capable of multiple wordlength

and scaling which works alongside fixed-point. All the modules are synthe-

sized onto FPGAs and ASIC to compare with implementations using common

number representations.

Error analysis of the DFX arithmetic modules is discussed in Chapter 5. A

hybrid (simulation mixed with static) analytical technique was developed to

17

estimate both the errors introduced. Because the DFX quantisation depends

on the signal’s magnitude, any correlation in the signal will be reflected on the

errors introduced through quantisation and this is dealt with in this chapter.

Chapter 6 shows DFX can be incorporated into a high-level synthesis tool

called RightSize. It optimises DSP designs to meet user specified constraints

of output SNR and signal overflow probability to give area optimised designs.

The two-phase simulated annealing algorithm optimises a design to have DFX

and fixed-point working along side each other. Comparison is done with fully

fixed-point designs on FPGA and ASIC platforms.

Chapter 7 concludes the thesis and suggests some future work.

1.3 Contributions

The original contributions in this thesis are:

1. Exponent Recoding as a concept that applies a mapping function to

the exponent field of a number. Conventional number representations

are shown to be special cases of an exponent recoding concept. A new

number representation, Dual FiXed-point (DFX), was defined as a special

case of Exponent Recoding by using only a single bit for the exponent.

The characteristics of DFX is compared with other conventional number

representations.

2. The design of hardware modules to perform multiple wordlength and

scaling DFX arithmetic operations. Various steps were taken to minimise

the loss of precision. As these modules are described in a synthesizable

hardware description language (in this case, VHDL), they are readily

synthesizable on to any platform and their input and output ports are

18

completely parameterisable. Comparisons are made with equivalent im-

plementations of other common number representations on FPGA and

ASIC platforms.

3. Analysis of the errors introduced by using DFX. Due to the existence

of correlation between errors, a novel profiling simulation is employed to

extract the probability distribution function of the signals in a design.

This simulation mixed with static analytical approach quickly estimates

the errors after a single pass simulation.

4. The error analysis takes into account any correlation amongst errors in-

troduced when using DFX. It is also shown that the correlation amongst

the errors only exists when signal truncation is used but there is no

correlation when rounding is used.

5. The inclusion of DFX into RightSize high-level synthesis tool to generate

area optimised designs under user specified constraints. The two phase

simulated annealing algorithm optimises a design to have DFX and fixed-

point along side each other.

19

Chapter 2

Background

2.1 Introduction

To provide an overview and background for this thesis, the various design

choices open to a DSP designer are reviewed in this chapter. The main focus

here is two fold. Firstly, the various possible number representations and

their suitability for computation in the context of real-time DSP algorithm

implementation will be examined and compared. Then the recent research

in the area of wordlength and scaling optimisation for these various number

representations are appraised.

2.2 Question of Software vs Hardware

The main driver in determining the platform for a DSP algorithm is in terms of

unit cost, time-to-market, or both. For projects that are time critical, design-

ers may choose specialised DSP microprocessors for their easy programmability

and any bug-fixes or upgrades can easily be supported. Due to rapid technol-

ogy advancement, using processors as a platform for DSP makes good business

20

sense for small scale productions. However, the inherently serial nature of these

processors means that they are inefficient at processing algorithms which has

large degrees of parallelism resulting in slow execution speed and increased

power consumption. Even if the improvements of DSP microprocessors con-

tinue to follow Moore’s Law so that their density doubles every 18 months,

they may still be unable to keep up with the requirements of some of the more

aggressive DSP algorithms.

Customised circuitry for applications had always outperformed general

CPUs as resources can be allocated to meet the needs of a specific application.

There are two options that are explored here, Application-Specific Integrated

Circuits (ASIC) and the Field Programmable Gate Arrays (FPGA) processors.

Traditionally, designers would choose to develop on an Application-Specific

Integrated Circuits (ASIC) platform when high-performance is required to take

advantage of the large amount of parallelism found in many DSP applications.

When designed well, an ASIC can contain just the right mix of functional units

for a particular application. In addition, ASICs do not suffer from the serial

(and often slow and power-hungry) instruction fetch, decode and execute cycle

that is at the heart of any microprocessors. Although the development time

for ASICs is significantly longer, the cheaper productions cost may offset any

non-recurring engineering (NRE) cost incurred with high-volume of sales.

A middle ground between microprocessors and ASIC is reconfigurable com-

puting systems such as Field Programmable Gate Arrays (FPGA). Most mod-

ern reconfigurable computing systems typically contain one or more processors

and a reconfigurable fabric. The processors would execute sequential and non-

critical code, while the reconfigurable fabric would ’execute’ code that has ef-

ficiently mapped to hardware. Like ASICs, reconfigurable computing systems

take advantage of the parallelism achievable in a hardware implementation.

With the improvements in process technology, recent throughput of FPGAs

21

have even surpassed CPUs [Und04] and its ease of development and short

time-to-market places it between that of an ASIC and processor-based devel-

opment. Another area of research that has had a surge of activity due to the

improvements in process technology is around coarse-grain reconfigurable pro-

cessors such as Crisp [BJA+03], MorphoSys [LSL+00] and RICA [KNM+08].

In essence, these architectures combine a standard processor with an array of

reconfigurable hardware. The reconfigurable hardware would be tailored for

a specific task and once completed, will be reconfigured for another task by

the processor. This results in a hybrid architecture aimed at combining the

flexibility of software with the speed of hardware.

One of the benefits from using ASIC or reconfigurable platforms is that

designers have the freedom to customise an algorithm to meet the desired de-

sign goals or constraints. Designer has the choice in deciding the way data is

manipulated and the number system used to represent data. The ability to

finely adjust the data-paths has shown to often lead to more efficient designs

when compared to processor based implementations. Constantinides and Gaf-

far [Con01, GML+02] have presented automated means to finely adjust designs

by altering the level of precision of internal data-paths to minimize hardware

cost. Also, in recent years, tools such as Altera’s DSP Builder and Synplicity’s

Synplify DSP [Altb, Synb] have been introduced to make direct porting from

a high level description in Simulink [Matb] to Register Transfer Level (RTL)

description for direct synthesis on to ASIC or FPGA platforms.

2.3 Finite Wordlength Effects

Any practical DSP implementation on hardware will be implemented using

finite wordlength numbers and arithmetic such as the ones to be discussed in

22

Sections 2.4. As a result, every signal node or stored value may suffer from

truncation/rounding noise and possibility of overflow.

Due to finite precision of the signals in a design, results of calculations

may need to be further quantised by either truncation or rounding the lower

bits to fit the result on to a signal [OW72]. Truncation is easy to perform as

the unwanted bits are simply dropped. Rounding, however, requires an adder

inserted on the data-path. Noise is introduced whenever truncation/rounding

is performed and they appear as low-level noise at the design outputs. Provided

we can tolerate a certain level of noise, we can exploit this to optimise the

hardware area cost. Section 2.5 later describes some of these optimisation

techniques.

The addition of signal truncation and rounding noise renders a design non-

linear. This nonlinearity is negligible when large signals are involved and

quantisation noise becomes the major concern. However for recursive filters,

this nonlinearity can cause limit cycles [Mit98]. This is generally not a problem

for infinite precision number representations [Liu71], but a properly chosen fil-

ter structure and coefficients can free the filter from the effects of limit-cycles

[Bom94, XB97]. Alternatively, one can determine a bound on the maximum

limit-cycle amplitude [GT88] and choose the level of quantisation that makes

the limit-cycle amplitude acceptably low.

While the truncation/rounding noise is the result of losing the lower bits

during calculations, an overflow happens when the magnitude of the num-

ber is more than the upper limit of the number representation used. In the

case of fixed-point two’s complement representation, an overflow would result

in a catastrophic increase in noise caused by the number wrapping-around.

Furthermore, in recursive filters high-level oscillations can exist in an other-

wise stable filter due to the gross nonlinearity associated with the overflow of

internal filter calculations [CMP76, EMT69].

23

There are several ways to prevent overflow. One method is to force signals

to saturate to either the largest positive number or largest negative number of

the representation used. By carefully selecting appropriate signals to perform

saturation arithmetic, the noise injected can be significantly reduced [CCL03].

However, additional hardware is required to implement saturation and slows

the design down. Moreover, saturation arithmetic are also prone to instability,

especially in the condition of zero-input [SP87]. Another more obvious method

would be to scale the signals so as to render overflow impossible. Current

optimisation methods derive the peak values for each signal through analytical

or simulation methods which are then used to scale the signals appropriately.

They however do not offer any form of guarantee that no overflow will occur.

Overflows may still occur in designs with feedback loops or under extreme

input conditions.

All the effects mentioned depend on the number format used and the pa-

rameters that define them. For all cases, providing extra bits in the data-paths

will reduce each finite wordlength effect. Then again, increasing the wordlength

would mean additional hardware resources will be needed. A designer will have

to balance this trade-off to meet his design objectives and constraints.

2.4 Number Representations

There are many ways in representing data digitally, all of them with their own

advantages and disadvantages. In this section, the basic conventional num-

ber representations namely fixed-point, floating-point, and logarithmic num-

ber system will be addressed. It will also explore other alternative number

representations.

Where possible, each number representation’s basic arithmetic operation,

precision and dynamic range will be described. The dynamic range of a number

24

representation is the range of possible values that it can handle. To quantify

the dynamic range, we define it as the ratio of the largest representable mag-

nitude over the smallest and generally is expressed in dBs, i.e.:

Dynamic range = 20 log10

(largest representable magnitude

smallest representable magnitude

)dB (2.1)

As DSP applications are adder and multiplier rich, the two main arithmetic

operations (addition and multiplication) will be the main focus in this thesis.

To take advantage of the parallelism in custom hardware, all the operations

described are based on bit-parallel arithmetic where all the bits of a signal are

processed together.

2.4.1 Fixed-point

Fixed-point number representation is a weighted positional number representa-

tion commonly employed in custom implementation of DSP algorithms [Par00]

where every operand is modelled with a fixed-length integer and fraction part.

A fixed-point number can be treated as integers and transforming from one

fixed-point format to another is done simply by performing bit-shifting, sign-

extension and/or zero-padding. As a result of this ease of format translation,

it is possible to assign each operand in a complex DSP task with a unique

and minimal wordlength to minimize overall area, power, and/or delay. There

is also an abundance of hardware library support for fixed-point arithmetic

[Alt98, Xilb, Syna].

Fixed-point is essentially a number system whose radix-point is fixed at a

predetermined position. A binary fixed-point number representation is rep-

resented either by integers, fractions or a combination of both. This is done

by partitioning a n-bit binary word into two sets: p bits for the integral part

and (n − p) bits for the fractional part. An additional sign-bit ‘S ’ is implied

25

when referring to the wordlength for fixed-point. As the name states, the bi-

nary point (or radix point) is fixed at design time. For the remainder of this

thesis, a binary fixed-point number denoted by 〈n, p〉 will have n-bits for its

wordlength and the scaling of the binary point will be p-bits to the right of the

sign-bit. Figure 2.1 illustrates a 〈n, p〉 fixed-point number. Also the symbol

‘•’ will be use to represent the position of the binary point throughout this

thesis.

n bits

S

p bits (n-p) bits

x0xn-1 ... ...

Figure 2.1: Fixed-point number format 〈n, p〉.

Representing sign numbers in binary fixed-point is normally done using

two’s complement notation for easy addition and subtraction. The value, X,

of a fixed-point number 〈n, p〉 is given by (2.2). We can see that the maximum

representable value would depend on the position of the binary point or in

other words, the scaling p. For the rest of this thesis, any mention of fixed-

point would refer to binary fixed-point two’s complement representation unless

stated otherwise.

X = −S · 2p +n−1∑i=0

xi2i−(n−p) (2.2)

Arithmetic

Since the conventional fixed-point is a binary number system, all arithmetic

operations can be done with straight forward binary operations. For an addi-

tion of an n-bit number, a serial configuration of n full-adders linked together

forms the basic ripple-carry adder. This simple configuration may use the least

hardware resources, but other adder configurations may speed up addition at

the expense of resources such as carry-look-ahead, carry-skip and carry-save

26

[Kor02]. FPGAs designs tend to have carry-look-ahead chain adders to quickly

perform addition [Xilc].

Multiplication of fixed-point numbers is more labourious as it involves a

series of additions. In bit-parallel processing, there are two forms of multipliers:

a general full-multiplier and a constant coefficient gain-multiplier. An n-bit

full-multiplier takes two bit-parallel inputs to form the product, usually by

generating n partial products which will then be summed together. Gain-

multipliers have a single bit-parallel input and they scale that input by a fixed

constant. Various techniques are employed to reduce the number of partial

products and/or to accelerate addition of partial products. Booth recoding

[Boo51], for example, reduces the number of partial products needed for gain-

multipliers and the Wallace Tree addition [Zim99] quickly sums the partial

product.

Precision and Dynamic Range

For a 〈n, p〉 fixed-point number, X, the range of representable numbers lies

in the range of −2p ≤ X < 2p, as seen in Figure 2.2. The dynamic range of

an n + 1-bit fixed-point number would have a dynamic range of 20 log10(2n)

dBs, which is solely dependent on the wordlength of fixed-point. For example,

the absolute value of a 〈31, 0〉 fixed-point number has a range between 1 and

4.7× 10−10, or in other words, a dynamic range of ≈ 187db.

Another property of fixed-point number is its uniform precision throughout

the whole representable range, which in the case of the fixed-point number

X above is 2−(n−p). Therefore, in order to utilise the range and precision

effectively, a signal should be properly scaled to use as many available bits to

the signal.

27

0

-2p

+2p

2-(n-p)

Overflow Overflow

Figure 2.2: Range and precision of two’s complement fixed-point number.

2.4.2 Floating-point

In recent years, the use of floating-point arithmetic has increased dramatically

in digital signal processing due to the rapid development of hardware technol-

ogy. The main advantage of floating-point is its wide dynamic range which

reduces the risk of overflow and improved signal-to-noise ratios of low-level

signals. Also, DSP algorithms are normally designed for use in a floating-

point environment. For example, Matlab [Mata], a popular algorithm explo-

ration and simulation tool, uses double-precision floating-point as its default

datatype. Unfortunately, the added complexity in performing arithmetic oper-

ations makes floating-point hardware expensive and slower than its fixed-point

counter-part. Therefore hardware designers often resort to porting their algo-

rithms onto fixed-point hardware.

In a floating-point number system, the number X is generally presented as

X = sgn(X)×M × 2E (2.3)

where sgn(X) returns the sign of the number, M is the mantissa (also some-

times known as the fraction or significand) and E is the exponent. Typically,

the mantissa is normalised to be within an interval M ∈ [ 1β, 1). There are many

variants of floating-point which are normally not directly compatible with one

another. The most popular is the standard defined in the IEEE Standard

754-1985 [IEE85].

28

IEEE Standard 754-1985

IEEE Standard 754-1985 [IEE85] defines four formats of floating-point num-

bers. The first two are the basic 32-bit single and 64-bit double precision

format. The other two are extended formats used for intermediate calculation

results. Figure 2.3 shows the layout of the 32-bit single precision format where

e = 8bits and m = 23bits.

exponent, E mantissa, M

m bitse bits

S

Figure 2.3: IEEE 754 single precision floating-point number format.

An IEEE 754 floating-point number, X, is given by

X = (−1)S × 1.M × 2E−bias (2.4)

The mantissa has a hidden bit ‘1’ implied at the MSB because the mantissa

is normalised to be within the interval M ∈ [1, 2). To represent numbers less

than one, the exponent is biased by bias = (2(e−1) − 1). The IEEE standard

reserves some special values which are summarised in Table 2.1. NaN, short for

Not-A-Number, is when the floating-point operation received invalid inputs,

such as when finding the square root of a negative number. When E = 0, the

mantissa is denormalised and the floating-point number would have a value of

X = (−1)S × 0.M × 2E−biasd (2.5)

where biasd = (2(e−1)− 2). The denormalised numbers capability is seldom in-

cluded in the design of arithmetics units that follow the IEEE standard [Kor02].

This is mainly due to the high cost associated with its implementation. Due

to the popularity of the IEEE 754 standard, there are many hardware libraries

29

that support it and therefore the remainder of this thesis will refer to it when

mentioning floating-point.

Table 2.1: IEEE 754 floating-point special values.

M = 0 M 6= 0

E = 0 0 Denormalised

E = 2e − 1 ±∞ NaN

Arithmetic

When compared to fixed-point, implementing floating-point arithmetic opera-

tions is more complicated because of an extra exponent field and normalised

mantissa. When performing addition, the significands of both operands have to

be pre-aligned. After addition, the output’s mantissa has to be re-normalised

and its exponent updated to reflect the new value. All the pre- and post-

alignment are done using priority encoders and barrel shifters [Kor02] which

are expensive in terms of hardware area and power consumption, and they

tend to have large combinational delays. A simplified block diagram of a

floating-point addition/substraction is depicted in Figure 2.4. The multiplica-

tion operation is slightly easier as no input pre-alignment is needed although

the product of the multiplication will need to be re-normalised and its exponent

re-calculated.

All these extra pre- and post-alignment operations add a significant amount

of overhead circuitry which translates to increased in hardware area, latency

and power consumption when compared to fixed-point. Minus the hardware

used for pre- and post-alignment, the core arithmetic implementation is the

same as fixed-point.

30

Right Shifter

Leading 0s

Detector

Exp #1 Exp #2 Significand #1 Significand #2

Exp Adder #1

Exp Adder #2

Mux #1 Mux #2

Significand Adder

Mux #3

R/L Shifter

Incrementer

Output SignificandOutput Exponent

Exp Diff

Exponent

comparison

and

significand

alignment

Post-

normalisation

and rounding

Significand

addition/

substraction

Figure 2.4: Floating-point adder/subtractor. Figure taken from [Kor02].


To demonstrate the precision and dynamic range of a floating-point number,

an IEEE single precision (32-bit) floating-point number X is used. Without

considering denormalisation, the floating-point number can take real numbers

in the range between 2−127 and 2128 (≈ 1535dB). Adjusting the width of the

exponent field affects the dynamic range of the number.

Unlike fixed-point, the precision of floating-point number varies depending

on the exponent of the number. When compared to a properly scaled fixed-

point number with equal wordlength, floating-point will always be less precise

than fixed-point due to the inclusion of the exponent field.

Format variants

There have been many variants proposed by various researchers for floating-

point, each to meet the demands of their own application. Munafo [Mun96]

31

gives a detailed summary of many different floating-point variants but only a

few notable ones will be described here.

In the bid to extend the precision of floating-point numbers, Dekkar [Dek71]

and Kahan [Kah65] pioneered the approach which were essentially doubling the

number of bits used. Priest went further with his work on arbitrary precision

floating-point numbers and showed that the computational cost to guarantee

accuracy is fairly reasonable [Pri91]. To extend the range, Yokoo introduced an

overflow and underflow free representation [Yok92] by not bounding the width

of the exponent field. This is made possible by using a prefix-free encoding

scheme by Hamada [Ham83] to encode the exponent with a self delimiter,

hence allowing the exponent size to grow as necessary while sacrificing the size

of the mantissa. Both the precision and range extension methods described

exist as software methods.

Representation of decimal numbers has always been problematic in binary

number systems. A dense packing demical encoding scheme to encode 3 deci-

mal numbers (1,000 combinations) into 10 binary digits (1,024 combinations)

has been proposed by [Cow02] and is being worked into the draft revision of

the IEEE 754 floating-point standard [IEE07]. This will be highly useful in

the financial field where numbers like 0.1 can be represented accurately and

that numbers are normally separated by commas in groups of 3 decimal digits.

2.4.3 Logarithmic Number System

In a logarithmic number system (LNS), operations such as multiplication, divi-

sion, powers and roots become easy as they are reduced to performing addition,

subtraction, multiplication and division operations respectively. However, ad-

dition and subtraction operation is more complicated as alignment of radix has

to be performed. Despite this, the LNS has generated considerable amount of

32

interest ever since its introduction, especially for designs with a high multiplier

to adder ratio [MTS95, FML06, CDdD06]. As Swartzlander et. al. pointed

out, LNS is intended to enhance the implementations of specialised applica-

tions and not meant to replace fixed-point or floating-point number systems

[SA75].

n bits

Fraction, f bitsInteger, i bits

S Logarithm, E

Figure 2.5: IEEE 754 single precision floating-point number format.

A signed LNS number is represented by a signed bit S together with a log-

arithm E. The logarithm is encoded with an integer part, i, and a fractional

part, f , as seen in Figure 2.5. The value of X is therefore defined in Equa-

tion (2.6) below. In order to represent numbers smaller than one, negative

logarithms are needed. Therefore, the logarithm field may be two’s comple-

mented or be biased in the form ”E − bias”, where the bias is predetermined

by the designer at design time. In essence, a LNS number is a floating-point

number whose mantissa is always equal 1.0.

X = (−1)S · 2E if E is two’s complement, or

X = (−1)S · 2E−bias if E is biased.

(2.6)

Arithmetic

Since all values in LNS are logarithms, operations such as multiplications and

divisions are simplified to a mere addition and subtraction as seen in (2.7) for

two inputs A and B. The hardware required for multiplication in logarithms

is therefore the same as fixed-point addition.

33

logβ(A×B) = EA + EB

logβ(A÷B) = EA − EB

(2.7)

In contrast, LNS additions and subtractions are more complicated and

often their results suffer from lack of accuracy. A brute force solution for this

operation is to use a complete look-up table. However, the size of such a table

is prohibitively large (22n×n) for an adder with input and output wordlengths

[Kor02]. A more common approach would be to approximate the result. It

can be shown that addition and subtraction of LNS numbers determined using

the following equations:

log2(A + B) = EA + Ψ+(EB − EA)

log2(A−B) = EA + Ψ−(EB − EA)(2.8)

where Ψ+(z) = log2(1 + 2z) and Ψ−(z) = log2(1− 2z) with the condition that

z = |EB − EA| > 0. Both Ψ+(z) and Ψ−(z) can be pre-calculated and stored

into look-up tables, e.g. ROM. Figure 2.6 shows a typical LNS adder or sub-

tractor data flow taken from [TGJR88], where the addition of the logarithms

are performed by an ordinary fixed-point adder. Each ROM table would be

not larger than 2n×n but when n ≥ 20, the size of the ROM gets prohibitively

large and therefore several approaches have been suggested and implemented

to reduce the size of look-up tables. The approach by [TGJR88] is to partition

the 2n × n table into several smaller tables.


As an example to compare with the IEEE floating-point standard, we take a 32-

bit wordlength LNS number with 8-bit integer part, 23-bit fractional part and

34

max(EA,EB)

EA – EB

EB – EA

ROM

FX

Adder

EA

EB

ES

Figure 2.6: Adder/Subtractor for logarithm numbers. A fixed-point (FX)adder is used to perform addition and the ROM contains the look-up table foreither Ψ+ or Ψ− of Equation (2.8).

exponent bias of 27. The range for the LNS number is thus between ±2−128 to

±2128 (≈ 1541dB) which is in the same range as a 32-bit floating-point number.

The LNS’s dynamic range is dependent on the size of the logarithm’s integer

part. Similar to floating-point, the precision of LNS number constantly varies

between values and the precision gets coarser with higher logarithm values.

2.4.4 Block Floating-Point

Block floating-point (BFP) is an attempt to strike a compromise between fixed-

point and floating-point. It utilises the benefits of dynamic scaling in floating-

point while taking advantage of fixed-point arithmetic operation’s simplicity.

First introduced by Oppenheim, BFP arithmetic has been used in the reali-

sation of digital filters [Opp70, SW86]. BFP has been used in several digital

audio data transmission standards. These audio standards include NICAM

(stereophonic sound system for PAL TV standard) [Bow87] and the audio

part of MUSE (Japanese HDTV standard) [Nim87].

BFP can be considered as a special case of floating-point representation

[KA96], where a block of N numbers has a joint scaling factor corresponding

35

to the largest magnitude of the numbers in the block, i.e.

[x1 · · ·xN ] = [x1 · · · xN ] · 2γ

xi = 2−γxi

(2.9)

The block-exponent γ is defined by

γ = blog2Mc+ 1 + k (2.10)

where M = max(|x1|, · · · , |x1|), and bxc is a floor function which returns the

largest integer less than or equal to x. The magnitudes of the block mantissas

xi are in the interval |xi| ∈ [0, 2−k]. The constant scaling term k may be

required by some applications to ensure no overflow of internal signals and the

output.

Despite its name, block floating-point is not truly a number representation.

It is a method to extend the dynamic range of a fixed-point algorithm. BFP’s

strength is that the block exponent, γ, need only be represented once for a

whole block of numbers while the main operations can be done normally in

fixed-point. Because BFP format works on a block of data in one go, a block of

memory is needed to store this input and/or results and the amount of memory

increasing with the number of pipelining stages. Also the lower bits of smaller

signals get quantised away. [KF00] introduces a variant called hierarchical

BFP which preserves the lower bits.

2.4.5 Residue Number System

With the residue number system (RNS) [ST67], numbers are represented by

their residues with respect to a set of relatively prime moduli. Residue number

systems are inherently integer systems due to the definition of the residue.

36

A RNS is defined by a set of N integer constants M1,M2, ..., MN referred to

as the moduli. The residue of an integer number X is given by (Xk, Xk−1, ..., X0)

where Xi is the residue from X modulo Mi. Therefore each residue Xi is defined

as the smallest positive integer remainder of X/Mi:

Xi = X mod(Mi)

= X −MibX/Mic(2.11)

where bxc returns the largest integer less than or equal to x. There is no

ambiguity for this system when representing the integers within the range R

given by

0 ≤ R ≤k∏

i=0

Mi − 1. (2.12)

As an example, let the set of moduli be 3, 4, 5. Therefore the number 34 can

be represented by its residue (1, 2, 4).

In conventional computer arithmetic, the addition of two numbers would

require the carries to be propagated from one adder end to the other. Arith-

metic operations in RNS are performed on the corresponding residues without

carries or other interaction between the residues:

(A±B) = ( (Ak ±BK)mod(Mk), . . . , (A0 ±B0)mod(M0) )

(A×B) = ( (Ak ×BK)mod(Mk), . . . , (A0 ×B0)mod(M0) )(2.13)

Any carry bit will terminate at the boundaries between each residue meaning

RNS has the advantage of adding large numbers quickly.

However the uniqueness of RNS posses various problems. Representing

fractions and division in RNS is not trivial as residues are inherently integer

[And96, LC92]. As RNS is non-positional number system, determining the

sign, magnitude comparison and overflow detection is also non-trivial [HP94].

Despite this, the benefits of RNS has been shown to offer area cost savings,

37

high-speed and low power dissipation in DSP applications [FP97, Sto05]. The

basic number systems, fixed-point, floating-point and LNS can be represented

in a residue form. A residue version of fixed-point has been explained here.

The residue versions of floating-point and LNS are mentioned in by [KL97] and

[Arn05] respectively. Because there is no carry digit to propagate, the residue

version of LNS is the quickest means to perform multiplication and division

operation.

2.4.6 Signed-Digit Number System

A signed-digit (SD) number system has redundancy in its representation,

meaning that values are not uniquely represented. Take for example a binary

signed-digit (BSD) number X denoted by a radix-2 representation (xn, xn−1,

..., x0)BSD. Each digit xi has a symmetric digit set xi ∈ {−1, 0, 1}. The value

of the number can be found in a similar fashion to fixed-point:

X =n∑

i=0

xi2i (2.14)

For example, A = 910 can be written as (01001)BSD = (01011)BSD = (11001)BSD

where 1 = −1.

Apart from the need to convert to and from the integer number, the main

disadvantage of a signed-digit number systems is the extra bits required to

represent it, hence it is rarely used as conventional number representation.

Typically, BSD is used as internal representations of a arithmetic logic unit

or in specially designed circuits. Parhami [Par88] demonstrated that provided

a BSD number has no repeated strings of 1 or 1 for both operands, i.e. ai ×aj−1 6= 1, the addition will not need any carry propagation. This is particularly

useful to speed-up multiplication, division and square-root operation. Booth

[Boo51] uses the BSD number system to recode the multiplier for high-speed

38

multiplication. In doing so, he reduced the number of partial products needed

for the multiplication operation.

2.4.7 Rational Arithmetic

The number systems described before this are unable to represent real num-

bers exactly. Take for example the real number number 13

would have to be

truncated to fit into a fixed-point number, introducing errors. Another way

to represent real numbers would be to use fractions. There is no difficulty in

representing rational numbers in hardware. It suffice to have a pair of integers,

one for the numerator and one for the denominator, as shown in Figure 2.7.

A real number, X, of a rational arithmetic number (RA) is given by

X = (−1)S × N

D(2.15)

Numerator, N Denominator, D

(n – i) bitsi bits

S

n bits

Figure 2.7: Example of Rational Representation format.

The maximum value representable by RA is dependent on the width of the

numerator field, i, and the precision is dependent on the width of the denom-

inator field, (n − i). Matula and Kornerup proposed several variants to the

format described above, among them is a floating-slash number system [MK85]

where the representation includes an additional field to denote the width of

the numerator. This allows for adjustments to be made to accommodate a

number depending on its magnitude, akin to floating-point. However since the

wordlength n is fixed, the precision of the number reduces as its magnitude in-

crease. The authors have also demonstrated several hardware arithmetic units

for RA [KM83, KM88] but there is little uptake among hardware designers and

39

researches. Rational arithmetic has been found in software implementations

such as LEDA [MN99], a collection of C++ libraries for efficient data types

and algorithms.

2.4.8 Level-Index

In an effort to represent real numbers without overflow/underflow, Clenshaw

and Olver [CO84] proposed representing numbers based on repeated exponen-

tiation. A non-negative real number, X, may be written in the form

X = ee..eI

︸︷︷︸L

where 0 ≤ I < 1 and the process of exponentiation is repeated L times. The

values of L and I are respectively known as the level and index of the Level-

Index (LI) number representation. The structure of a LI number is shown in

Figure 2.8 and a real number XLI is mapped in the following way:

XLI = (−1)S × φ(L, I) (2.16)

where the ‘general exponential function’ φ is defined as

φ(L, I) =

eφ(L− 1, I) if L > 0

I if L = 0

(2.17)

Level, L Index, I

i bitsl bits

S

Figure 2.8: Level Index/Symmetric Level Index format.

The authors improved on LI to represent numbers in the range [0, 1) with

symmetric level-index (SLI) [CT88]. SLI represents numbers in the range [0, 1)

40

with the reciprocal of the general exponential function.

XSLI = −1S × (φ(L))r(X) (2.18)

where

r(X) =

+1 if |X| ≥ 1

−1 if |X| < 1

(2.19)

The repeated exponentiation in LI/SLI number system allows for near lim-

itless range of representable numbers. In particular, setting l = 3 bits wide for

storage of the level (plus one more bit for the sign bit, r(X), for the case of SLI)

problems of overflow and underflow will be “non-existant” [LO90]. In contrast,

in the conventional floating-point system, overflow and underflow can always

occur, regardless of the width of the exponent field. On top of that, computed

numbers in SLI maintain considerable precision compared to floating-point for

large numbers. Take for example a number whose magnitude is 101,000,000.

Clearly it is beyond the limit of the IEEE double precision floating-point. A

64-bit SLI number with L = 3 retain a relative precision of about 1.6× 10−10

[LO90].

The arithmetic operations for LI/SLI are considerably more complicated

which is a limiting its adoption. In prior art, there is no known hardware

implementation of LI apart from some software implementations in Turbo

Pascal [Tur89].

2.4.9 Comparison and Summary

In this section, comparisons are made among the conventional number repre-

sentations (fixed-point (FX), floating-point(FP) and LNS) before summarising.

From the discussions above, the various differences between the number repre-

sentations do not shed enough light to quantify the differences in terms of chip

41

Table 2.2: Area and critical path delay for 16-bit arithmetic units taken fromTables 4.4 to 4.6. These arithmetic units where implemented on an ASICplatform.

FormatArea (cells)

Adder Gain Multiplier Full Multiplier

FX 449 2551 8077

FP 5049 2627 5312

LNS 23490 292 553

FormatCritical path delay (ns)

Adder Gain Multiplier Full Multiplier

FX 2.390 8.740 9.325

FP 13.607 10.881 12.146

LNS 18.842 2.483 5.616

area cost. Results from Tables 4.4 to 4.6 in Chapter 4 (summarised in Table

2.2 for convenience) are quoted in this section for comparison. These tables

show the synthesized chip area cost of the three basic arithmetic operations:

addition, constant gain multiplier and general multiplier.

The floating-point and LNS adders implemented on ASIC1 are 11 and 52

times larger than fixed-point adder respectively. Also, the speed of fixed-point

adder is 5.7 and 7.9 times faster than the equivalent floating-point and LNS

adders. LNS may be bad for addition, but its multipliers are the cheapest and

quickest. When compared to fixed-point, the LNS multipliers are on average

11.7 times smaller and 2.5 times faster in ASIC. However, multipliers for fixed-

point and floating-point are similar in terms of area and speed. The adders and

multipliers were synthesized with parameters to match each others dynamic

range.

As there is no comparative study between number representations on a

system level, a simple 4-tap FIR filter for each number representation with

increasing wordlength is made to compare the number representations. A

diagram of the filter can be found in Figure 4.1(b) and the coefficients used

1UMC 0.13um High Density Standard Cell Library

42

Table 2.3: Area and critical path delay (CPD) results for 4-tap FIR filter withincreasing wordlength implemented using UMC 0.13um High Density StandardCell Library.

Design

Fixed-point Floating-point LNS

Area CPD Area CPD Area CPD

(cells) (ns) (cells) (ns) (cells) (ns)

1 9829 9.9 16898 51.4 16188 39.1

2 11042 10.3 19710 54.3 23295 43.3

3 13067 11.4 21992 58.5 31562 46.9

4 14845 12.0 24854 61.7 42918 49.6

5 17261 13.0 26461 62.9 57261 54.0

6 18198 13.4 29685 63.4 77916 58.5

7 20005 14.4 32481 70.2 107450 59.5

5000

10000

15000

20000

25000

30000

35000

40000

45000

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0

Critical path delay (ns)

Are

a (

cells

)

Fixed-point

Floating-point

Logarithmic number system

Figure 2.9: An area vs Critical path delay graph for Table 2.3.

43

Table 2.4: Dynamic range comparisons of 32-bit numbers representations.

Number representation Fixed-point Floating-point LNS

Dynamic range 187dB 1535dB 1541dB

were randomly chosen. Each signal in the filter is of equal wordlength. For

each incremental design, the width of the signals in the filter are increased one

bit at a time and the parameters for each number representation are selected

to have similar dynamic range for each design (Refer to Table 4.7, designs 1

to 7). Table 2.3 tabulates the results for the filters and Figure 2.9 shows the

results in a graph. From Figure 2.9, we can see clear separations between the

types of number representations. Fixed-point results are small and fast while

floating-point is large and slow. LNS seemed to be in between the two in the

beginning, but the last 3 results are out of the range of the graph.

Comparing the dynamic ranges for the three major number representations

in Table 2.4, we can see that for designs that need large dynamic range the

obvious choice is to use floating-point or LNS. Collange suggested to use LNS

over floating-point when the number of multipliers is greater than adders but

never stated a value [CDdD06], while Coleman suggested a ratio of at least 3:2

before seeing any speed improvements [CC99]. However, most DSP applica-

tions do not need such wide dynamic range. Fang et. al showed that a H.263

video decoder preserved good perceptual quality while using a ‘lightweight’

14-bit floating-point (1 sign-bit, 5 exponent and 8 mantissa) without the de-

nomalisation [FCR02]. The video decoder with lightweight floating-point was

about 3 times smaller than the full IEEE implementation though it is 2 times

larger than a fixed-point version. These results are pushing towards a trend

for IEEE floating-point like formats with reduced wordlength. In [LEK05],

a 16-bit floating-point instructions for embedded processors is demonstrated,

and NVIDIA recently introduced a 16-bit half precision floating-point to their

Cg language [NVI05].

44

In the field for FPGAs, implementation of floating-point has been partic-

ularly expensive and researchers had been using parameterised floating-point

modules for DSP operators [JL01], [DGL+02]. Techniques to convert from a

description of a DSP algorithm to a parameterised floating-point implementa-

tion have been presented [GML+02, LGML05]. However a properly optimised

fixed-point design would usually result in smaller area and higher performance

when compared other number representation.

The less conventional number representations surveyed do not provide any

form of compromise between the dynamic range and hardware implementa-

tion complexity. Block floating-point (BFP) is a method to extend the range

of a fixed-point algorithm and is not a number representation. An application

with BFP needs to register a block of its inputs and outputs which means

that the outputs will have latency delays. Residue and signed-digit number

systems are normally used for speeding up internal calculations such as multi-

plications. Their unique nature brings many disadvantages that hinder its use

as a general number representation. Rational arithmetic and level-index are

great for special purpose computing in the software realm.

Among the conventional number representations, fixed-point is currently

the best performer in terms of hardware implementation cost and floating-

point is used when large dynamic range is required. The two formats differ

significantly in terms of hardware implementation cost and at present, there

is no number representation that gives any compromise between the dynamic

range and hardware implementation complexity.

2.5 Wordlength and Scaling Optimisation

One of the main objectives of designers is to find optimal designs to meet the

requirements of an application. Optimality of designs could refer to the area,

45

critical path delay, throughput and/or power consumption. The wordlength

and scaling parameters of signals can be tweaked by designers to improve these

metrics.

Signal wordlength optimisation has enjoyed considerable attention in the

research community. An optimisation procedure can be typically separated

into two parts: range analysis and precision analysis. Range analysis de-

termines the dynamic range required by the signals in the system, whereas

precision analysis refines the wordlengths of signals needed to meet the perfor-

mance requirement of a design. Also, the methods available can be classified

into two types, one being fully simulation based and the other being fully an-

alytical based. Both have their own advantages and disadvantages. There is

however a growing number of mixed simulation and analytical methods, or

hybrids methods being proposed. All the methods reviewed allow the user

to specify a trade-off between numerical accuracy and efficiency in chip area,

speed performance and/or power consumption in the implementation.

2.5.1 SKK Methodology

Sung, Kim and Kum have developed a method to determine the optimum

wordlengths for DSP algorithms based solely around simulation. Their method

measures the performance of a fixed-point algorithm using simulation results

and iteratively modifies the wordlengths to find an optimum set to reduce

minimize their objective function.

For the range analysis, the statistic of the signal such as mean (µx) and

standard deviation (σx) of each signal x are collected via a single pass simula-

tion [KS94]. Using these information, the authors proposed a statistical range

for signal x determined by R(x) = |µx|+ k σx where k is a user specified inte-

ger. For a symmetric uni-modal distributed signal, the scaling of the signals

46

can therefore made to accommodate this range. The authors extended this

framework to classify signals using its skewness and kurtosis characteristics

[KKS98].

In [SK94, SK95], the precision analysis starts out with all signals having

a large wordlength (64-bits). Each signal’s wordlength is reduced individually

until they reach their ‘minimum wordlength’ of which the design function to the

a user-specified error specification. The set of minimum wordlengths together

with the uniform wordlength that satisfies the error specification forms the

bound for the minimum hardware cost optimisation phase. This phase may

done either through an exhaustive search, or by using a heuristic that favours

reducing the wordlength of signals that has the greatest impact on minimising

hardware cost. The authors went on to improve their framework in [KS01] in

a bid to reduce the number of simulations. Signals around the adders after

multiplications are grouped together and optimised as a “signal wordlength

group” and their error effects are analysed using standard quantisation noise

models for linear, time-invariant systems [OS99].

There are various other similar works to the SKK methodology. Roy et. al.

proposed a MATLAB to fixed-point FPGA implementation [RB04]. The main

difference is their precision analysis minimises the wordlength of all units in-

stead of hardware cost with the assumption that lower wordlength equals lower

hardware cost. This is done to reduce the optimisation algorithm complexity

to produce quick results. A simulation-only based optimisation can be a slow

process. The optimisation procedure spends most of its time waiting for the

error metric feedback. Also, the resultant design is not guaranteed to function

to expectation when a different set of input stimuli used [CRS+99].

47

2.5.2 Interval and Affine analysis

As opposed to the simulation-only based optimisation, techniques derived from

interval arithmetic (IA) is a form of static analysis based optimisation where

the bounds of a signal range and quantisation error are determined analyti-

cally. In interval arithmetic, each variable has an interval x = [x, x] where

the underscore and overscore are the lower and upper bounds respectively.

The result of an arithmetic operation on an interval would result in another

interval. Taking addition for example, the interval of the output is given as

x + y =[x + y, x + y

]. Similar formulas can be derived for other arithmetic

operations [Moo66].

Benedetti and Perona’s approach to optimisation first uses IA to extract

the range of each signal in the design [BP00]. They introduced a multi-interval

representation to monitor the wordlength growth of each signal which gives a

bound for the range of operating values for each signal. The authors noted

that IA may result in designs that are overly pessimistic, especially when there

is correlation between signals.

Affine arithmetic (AA) [SdF97] has been developed to alleviate some of the

problems with IA by preserving the correlation among intervals. For a number

x = [x, x], its affine form would be x = x0 + x1ε1 where x0 = (x + x)/2,

x1 = (x − x)/2 and ε1 ∈ [−1, 1]. The x1ε1 term models the uncertainties or

variable range of x. The affine forms of basic operations are given in [SdF97].

As an example, the addition in AA may be expressed in the form

x + y = (x0 + y0) +n∑

i=1

(xi + yi)εi

.

Fang et. al. [FRPC03] first demonstrated the use of AA to perform error

analysis on fixed-point and floating-point DSP designs. As the result, an au-

tomated optimisation tool, MiniBit, was developed by Lee et. al. [LGML05]

48

using AA and simulated annealing for fixed-point designs. The standard affine

form shown above is used for the range analysis. For the precision analysis, a

quantised version of signal x is given by x = x+uε. In the case of rounding to

the nearest quantisation of a fixed-point number 〈n, p〉, u = 2n−p−1 (0.5 unit

in last position, ulp) and ε ∈ [−1, 1]. MiniBit then uses a simulated annealing

meta-heuristic approach which uses the fully analytical precision analysis as

feedback to minimise hardware cost.

2.5.3 BitSize Tool

The BitSize tool by Gaffar et. al. [GML+02, GML04] is an example of a

hybrid approach to optimisation. With minimal simulation effort, the tool

obtains information needed to analytically minimise the area cost. It uses

interval arithmetic methods discussed previously to determine the range of

design signals and automatic differentiation for the precision analysis.

Automatic differentiation, developed by the applied mathematics commu-

nity [Gri00], is able to obtain the differentials of each variable in an algorithm

as a by-product of a simulation run. BitSize uses these differentials as sensi-

tivity measures for errors induced on a signal due to quantisation. Take for

example a function y = f(x). A small change ∆x in input x would cause a

change ∆y in output y: ∆y ≈ f ′(x)∆x where f ′(x) is the first derivative or

the sensitivity of the changes to the output with respect to the input. This

approximation holds provided that ∆x ¿ x.

For a differentiable function with n inputs, y = F (x1, x2, . . . , xn), the Taylor

series approximate the change ∆y shown in Equation (2.20) below, ignoring

high order terms.

∆y ≥ ∆x1dF

dx1

+ ∆x2dF

dx2

+ . . . + ∆xndF

dxn

(2.20)

49

In BitSize, each input term represents a signal in the design. A forward pass

analyses the differentials for each signal and by specifying a maximum tolerable

error (∆y) the errors of each signal will be bounded by (2.20). The authors in

[GML+02] suggest two ways of partitioning the output error bound between

the signal errors: (1) uniformly ∆xi = ∆y/n, or (2) weighted ∆xi = ∆y×Wi.

The weights could be chosen to reflect the relative cost for the operations at

each signal. Hence a backward pass calculates the wordlengths of signals using

the sensitivities measured and their respective error bounds.

The methodology described offers a uniform treatment for both fixed-point

and floating-point [GML04]. It does not however allow for a mix usage of the

two number formats.

2.5.4 Synoptix and RightSize tool

Like the interval analysis methods discussed earlier, the Synoptix tool by Con-

stantinides et. al. [CCL01] is a fully static analysis based optimisation utility

for fixed-point designs. As with all fully static analysis methods, Synoptix

was limited to linear system. RightSize improves on Synoptix to optimise

non-linear systems through a hybrid approach [Con03].

The design flow of both Synoptix and RightSize tools shown in Figure 2.10

takes a DSP algorithm made from Mathworks’s Simulink [Matb] as the in-

put design. Simulink is a graphical programming environment to visualize a

DSP algorithm using synchronous data-flow graphs (DFG) [LM87, RH91], a

commonly used means to represent an algorithm. The tools are entirely ar-

chitecture independent where the vendor-specific portions of the design flow

are shaded. Written in C++, they utilise classes to great effect and extending

them to incorporate new number representations fairly straight-forward. They

both produce a synthesizable structural description ready for vendor tools, and

50

Simulink

design

User

constraints

Representative

floating-point inputs

(RightSize only)

Synoptix/

RightSize

Area

models

Verification

outputsTestbench

Structural

VHDL

VHDL

synthesis

VHDL

simulator

makefile

Vendor

synthesis

Verification

outputs

Completed

design

Compare

Figure 2.10: Design flow for RightSize tool (shaded portions are vendor spe-cific).

a bit-true behavioural VHDL testbench together with a set of expected out-

puts for design verification. Also generated is a ‘makefile’ to automate the

post-synthesis and simulation processes.

Synoptix

The range analysis portion in Synoptix uses the l1-norm scaling [HJ86] on

the transfer function impulse response of the inputs to each signal. Using the

product of the l1-scaling and the input peak values supplied by the user gives

the maximum range of a signal.

In the precision analysis, every quantisation is modelled by an error injec-

tion using quantisation noise models described in [CCL99]. Output errors are

estimated by the weighted-sum of the injected errors. The weights are deter-

mined using the so called L2-scaling [HJ86] on the transfer function impulse

response of each noise injection input to the outputs. The estimated error is

51

compared with the user supplied output SNR constraint during the wordlength

optimisation heuristic described below.

1. Find uniform wordlength for design to meet SNR constraint and scale

all wordlengths up by a factor 2.

2. Start iterations by reducing each signal’s wordlength individually till the

point before violating SNR constraint and estimate the individual impact

on the area. Provided there is wordlength reduction, the signal that gave

the least area cost is nominated to have its wordlength reduced.

3. The nominated signal’s wordlength is then reduced by one bit and step 2

is repeated. If no signal was nominated, the iteration terminates giving

the optimised design.

This greedy optimisation heuristic has been shown to give results to within

0.7% deviation of the optimum area for a given user constraint [CCL02]. The

main disadvantage of this tool is the limitation on LTI designs and this is

addressed by the author in his RightSize tool.

RightSize and perturbation analysis

In RightSize [Con03], Constantinides used a perturbation analysis technique

to produce linearised small signal equivalent model [SS97] of a non-linear sys-

tem in order to apply the analytical techniques used to estimate noise in LTI

systems.

Using the notations from [Con03], consider a n-input differentiable func-

tion Y [t] = f(X1[t], . . . , Xn[t]) where t is the time index. Taking the first-order

Taylor approximation, a small perturbation xi[t] of variable Xi[t] would cause

a perturbation on Y [t] such that y[t] ≈ x1[t]∂f

∂X1+ . . . + xn[t] ∂f

∂Xn. This ap-

proximation is linear for each xi, although the partial derivative terms vary

52

with time and are a function of X1, . . . , Xn. Assuming the quantisation er-

rors are sufficiently small, the approximation function is used to make a linear

small signal equivalent of a design. In the RightSize tool, derivative monitors

are used to collect the partial derivatives during a simulation run with user

supplied inputs.

Since the model is linear, if injecting an error variance, σ2, into a signal gives

an output of variance V , then scaling the error by ε (i.e. εσ2) will see the output

variance also scaled by ε (i.e. εV ). Therefore we can analytically determine

the output response through scaling the response obtained through a one-time

simulation with a noise of known variance. RightSize analyses the output

sensitivity to each signal by injecting a known random unit variance noise

to each signal and observing the output variance for an unscaled sensitivity

measure ε. These sensitivities are then used as weights in the weighted-sum of

errors as in the Synoptix tool. The rest of the optimisation is done using the

same greedy optimisation heuristic in Synoptix.

2.6 Summary

This chapter introduces some of the common and not so common number

representations used in digital hardware. Since the choice of data represen-

tation is a vital aspect of any designer’s decision, it has to be treated with

care. Data representation can dramatically determine the overall chip area

and speed performance.

Fixed-point arithmetic is normally better suited for DSP applications than

floating-point arithmetic since good DSP algorithms require high accuracy

(long mantissa), but not the large dynamic signal range provided by floating-

point arithmetic [Wan99]. Floating-point arithmetic provides a large dynamic

53

range which is usually not required, and the cost in terms of power consump-

tion, execution time, and chip area is much larger than that for fixed-point

arithmetic. Hence, floating-point arithmetic is useful in general-purpose signal

processors, but it is not efficient for application-specific implementations.

Some wordlength and scaling optimsation techniques were also reviewed.

Simulation-only techniques have the disadvantage of heavy dependence on in-

put data stimuli and slow run times. Purely analytical static optimisation

techniques tend to suffer from pessimistic results and not adequate for non-

linear systems. RightSize tool provides a good basis for wordlength optimisa-

tion and form the framework upon which the work described in Chapter 6 of

this thesis is based.

54

Chapter 3

Exponent Recoding & Dual

FiXed-point

3.1 Introduction

Having looked at the various number representation systems for digital hard-

ware in Chapter 2, this chapter details the concept of exponent recoding and a

special case derived from it, a new number representation called Dual FiXed-

point (DFX).

The idea of exponent recoding essentially takes conventional floating-point

representation and apply a recoding function to its exponent field. This func-

tion is chosen arbitrarily by the designer at design time which gives the flexibil-

ity to trade between hardware implementation complexity for dynamic range

of signals. Exponent recoding also serves as a generalising concept where some

of the number systems mentioned in Chapter 2 are special cases.

Dual FiXed-point, also a special case of exponent recoding, improves on

the dynamic range of signals in a digital hardware without significantly in-

creasing circuit size. It combines the simplicity of the ordinary fixed-point

55

data representation, and the superior dynamic range of the floating-point data

representation without the inherent hardware complexity. Section 3.3 firstly

defines DFX and then introduces its characteristics and properties.

The original contributions of this chapter are:

• the concept of Exponent Recoding and how it relates to existing number

systems as special cases,

• the definition of Dual FiXed-point as a new number system [ECC04],

and

• the characteristics of Dual FiXed-point.

3.2 Exponent Recoding

Coded

Exponent, EcSignificand, M

c bits m bits

n bits

Figure 3.1: Example of number system with Exponent Recoding.

Definition 3.1. The representation of a real number, X, in the form given by

(3.1) will be referred to as a number with exponent recoding (ER).

X = M · βΦ(Ec) (3.1)

where base β ∈ R, M is the significand, Ec is the coded exponent and Φ is

the function that recodes the exponent. The base and recoding function are

attributes that are predetermined by the user. ¤

56

A number with exponent recoding may be represented in a similar manner

as conventional floating-point. Referring to Figure 3.1, the numerical data is

represented by a n-bit number which contains two separate fields: the coded

exponent field Ec (with c bits) and the significand field M (m bits). The

exponent recoding concept introduces a recoding function, Φ, to the coded

exponential field Ec before being applied as an exponent to the base.

The recoding function Φ can be arbitrary chosen by the designer to meet

his/her needs. When the coded exponent is c-bit wide, there can be up to 2c

different mapping results for the base exponent. The recoding function adds

an extra level of indirection and greater flexibility to the signal precision and

range. With a carefully chosen recoding function, the number system can have

a very large dynamic range of representable values. Generally, as the number of

possible exponent values a number system increases, the hardware complexity

increases as well.

As a note, the term “exponent recoding” has been used in the context of

public key crytography [SMM05]. This is not to be confused with the exponent

recoding concept introduced here which applies a mapping function on the

exponent field.

3.2.1 Special Cases

Floating-point

Floating-point is a straight forward example of exponent recoding. Taking for

example the case of IEEE 754 floating-point standard, the base is β = 2 and

a recoding function ΦIEEE754 is applied to the exponent field as follows:

ΦIEEE754(Ec) =

Ec − (2c−1 − 1) if 0 < Ec < 2c − 1

−(2c−1 − 2) if Ec = 0

+∞ if Ec = 2c − 1

57

where c is either 8 or 11 depending on whether it is single or double precision

format.

To make the most out of floating-point, the IEEE 754 standard normalises it

significand. Apart from the basic binary operation routines as in fixed-point,

floating-point uses barrel shifters to do the normalisation routines and they

are expensive in hardware and this affects the addition/subtraction operations

the most. The wider the exponent field, c, the more hardware is required to

perform normalisation.

Fixed-point

For the case of fixed-point, the number of exponent bits, c = 0, (i.e. there

is no coded exponent bit). Therefore the recoding function would take any

arbitrary value such that ΦFX ∈ Z with the base β = 2. Taking for example

a fixed-point number with the format 〈n, p〉 (Refer to Figure 2.1 for diagram),

the recoding function is given by:

ΦFX = −(n− p)

As there is no exponent field, there is no need to do any kind of normalisation

or denormalisation as in floating-point and this contributes to its hardware

simplicity.

Logarithmic Number System

Like conventional floating-point, logarithmic number system (LNS) treats the

exponent field in a similar way. The main difference is its significand field has

58

zero bit width (i.e. n = 0) with the whole number used for the exponent field.

Here the recoding function is applied to the whole number in the manner,

ΦLNS(Ec) = Ec if Ec is two’s complement

ΦLNS(Ec) = Ec − bias if Ec is biased.

Similar to floating-point, addition and subtractions operations are expensive

in hardware. It also suffers from precision lost depending on the algorithm

used.

Level-Index

The repeated exponentiation number system, Level-index(LI) initially pro-

posed by [CO84] may also be generalised with the exponent recoding concept.

The generalised exponential function in Section 2.4.8 is a recoding function

with the with the base β = e. The recoding function ΦLI may reinterpreted as

follows:

ΦLI(Ec) =

Ec if 0 ≤ Ec < 1

eΦLI(Ec − 1) if Ec ≥ 1

where Ec = L + I × 2−i (Refer to Figure 2.8 for notations). The repeated

exponentiation definition for the recoding function gives a LI number plenty

of dynamic range but with the extra complexity.

3.2.2 Discussion

As seen from the special cases, the recoding function can take any arbitrary

form. Taking for example that when c = 2, we could have each coded exponent

value mapped to an arbitrary number as in Example 3.1

59

Example 3.1. In this example, we present a simple quad fixed-point number.

Letting c = 2 gives us four possible mappings for the coded exponent, Ec, such

that

Φ(Ec) =

−(n− p0) if Ec = 0

−(n− p1) if Ec = 1

−(n− p2) if Ec = 2

−(n− p3) if Ec = 3

There are 4 different scalings of an ordinary fixed-point number which is

depicted in Figure 3.2 with their scalings aligned. Although it has not been

made and tested, it is expected that the complexity to perform arithmetic with

this number representation be somewhere between fixed-point and floating-point.

¤

n

p0

p1

p2

p3

S

S

S

S

Figure 3.2: Example of quad fixed-point.

From an arithmetic hardware complexity point of view, the more choices

the exponent value can take, the more complex the hardware implementation

becomes due to the need to align radix-points. However, by adding a single

exponent bit to an ordinary fixed-point representation, we are able get an

improved dynamic range without drastically increasing hardware cost. This is

the basis for a new number representation, Dual FiXed-point.

60

3.3 Dual FiXed-point

3.3.1 Defining the Format

First proposed in [ECC04], the n + 2-bit Dual FiXed-point number represen-

tation consists of a single exponent-bit, E, and n + 1 bits for the significand

together with the sign-bit, M . The significand is formatted in the same man-

ner as ordinary two’s complement fixed-point with n being the wordlength

excluding the sign-bit, and the binary point is measured as the displacement

from the right of the sign-bit towards the least significant bit (LSB). Here, the

recoding function ΦDFX maps a single exponent bit to two different exponent

value. This gives the significand two possible scalings and ranges to represent

a number. Definition 3.2 formally defines the DFX number representation.

Definition 3.2. The representation of a real number, D, into the form given

by (3.2) will be referred to as a Dual FiXed-point (DFX) number.

D = X · 2ΦDFX(E)

ΦDFX(E) =

n− p0 if E = 0

n− p1 if E = 1

(3.2)

where X is the signal significand, E is the exponent such that E ∈ {0, 1}, p0

and p1 are binary points; and p0 ≤ p1. The structure of DFX is illustrated in

Figure 3.3(a). ¤

In order to achieve two different scalings, we define two binary points p0

and p1 with the condition that p0 ≤ p1 at all times. p0 and p1 represent the

displacement of the binary-point from the right side of the sign-bit towards the

LSB. We define p0 as the lower scaling parameter and it is used to scale the

Num0 number (Definition 3.3). Similarly, we define p1 as the upper scaling

61

Exponent E Signed Significand X

1 bit n + 1 bits

(a)

S0

1

n

p0p1

Sign-bit

Exponent

bit, ESigned Significand, X

(b)

Figure 3.3: Dual FiXed-point (DFX) number: (a) Number format, (b) Detailedstructure of DFX number format. The symbols • and ◦ represents the binaryposition for a Num0 and Num1 numbers respectively.

parameter for the Num1 number. Take note that p0 and p1 are allowed to

lie outside the number representation, i.e. p0, p1 < 0 or p0, p1 > n. A more

detailed DFX format structure is shown in Figure 3.3(b).

Definition 3.3. We define two types of numbers, Num0 and Num1. Num0

numbers lie in the range [−B,B) where B is a boundary (Definition 3.4) and

they are scaled by the lower scaling parameter p0. Num1 numbers lie in the

range (−∞,−B) or [B,∞) and they are scaled by the upper scaling parameter

p1. Hence Num0 is the lower range number and Num1 is the upper range

number.

Definition 3.4. With the view of simplifying the arithmetic units, the bound-

ary is defined to be the next incremental value after the maximum positive

number of Num0, i.e.

Boundary, B = 2p0 (3.3)

¤

Having two different scalings require a means to decide the best scaling for

62

a number. Knowing which range a number is in determines the exponent the

number takes.

Definition 3.5. The choice of the exponent is determined by the value of the

real number, D, such that

E =

0 if D ∈ Num0, i.e. −B ≤ D < B

1 if D ∈ Num1, i.e. D < −B or D ≥ B(3.4)

¤

There are some arithmetic computations, e.g. multiplication and addition,

that could give intermediary results that does not conform to Definition 3.5, i.e.

the value of the exponent bit does not match with the value of the significand

and is improperly scaled. For example take a DFX number Z whose format is

¡10,2,6¿ and its boundary B = 22 = 4. Initially, let Z = 6 which means that

Z is an upper range number, Num1, with the exponent bit, E = 1. Then

take Z and multiply it with 0.5. This gives the result Z = 3 which should be

a lower range number, Num0, where E = 0. However, immediately after the

multiplication, the exponent field has not been updated to reflect the changes

and therefore at that point in time the number is improperly scaled. For all

intents and purposes, a DFX number is expected to be properly scaled as some

of the algorithms used for the arithmetic operations rely on this.

Definition 3.6. A DFX number is said to be properly scaled if the number

strictly follows Definition 3.5. Therefore a properly scaled DFX number with

exponent bit E = 1 will not be in the Num0 number range and a DFX number

with E = 0 will not be in the Num1 number range. ¤

63

p0

S0

1

S

S

p1

Align binary-

points

(a) Full represented DFX number

p0

p1

S0

1

S

S

Non-represented

values

Align binary-

points

(b) Non-fully represented DFX number

Figure 3.4: Fully and non-fully represented DFX number.

Under most cases, all real values between the representable range will be

discretised to either Num0 or Num1 range as shown in Figure 3.4(a). However

consider the case shown in Figure 3.4(b), the range of values between 2p0 and

2p1−p0 will not have any representation. If a value lies within that range, the

discretised DFX number would result in a value of zero and there would be

a noticeable discontinuity between the values in Num0 and Num1. For this

reason, the following chapters on arithmetic modules and error analysis require

that the DFX number is fully represented.

Definition 3.7. A DFX number is said to be fully represented if and only if

p1 − p0 ≤ n. ¤

A complete definition of a DFX number system requires three parameters;

n for the wordlength of the significand and the two binary points p0 and p1.

The notation used from here onwards will be in the form of DFX 〈n, p0, p1〉. In

actual fact, a DFX signal is actually (n+2) bits wide, one for the sign-bit and

another for the exponent-bit. This is not to be confused with the size of the

significand without the sign-bit, n, when referring to the wordlength. Do note

that for the case when p0 = p1, the DFX representation will revert to ordinary

64

Table 3.1: Dynamic range comparison between DFX, fixed-point (FX),floating-point (FP) and logarithmic number system (LNS) for 32-bit and 16-bitformat.

Format Param Wordlength Dynamic Range

DFX 〈30, 0, 30〉 32-bit 260 ≈ 361dB

DFX 〈30, 14, 26〉 32-bit 246 ≈ 276dB

FX 〈31, 0〉 32-bit 231 ≈ 187dB

FP E:8 M:23 32-bit (IEEE) 2254 ≈ 1529dB

LNS I:8 F:23 32-bit 2256 ≈ 1541dB

DFX 〈14,−5, 0〉 16-bit 219 ≈ 114dB

FX 〈15, 0〉 16-bit 215 ≈ 90dB

FP E:4 M:11 16-bit 214 ≈ 84dB

LNS I:4 F:11 16-bit 216 ≈ 96dB

fixed-point representation and therefore the exponent-bit will be of no use and

can be discarded. Also do note that the determination of the optimum choice

for n, p0 and p1 parameters for a DSP design will be explored Chapter 6.

3.3.2 Characteristics of DFX Number System

Dynamic Range

The smallest non-zero absolute value of a DFX number is 2−(n−p0) while the

largest absolute value is 2p1 , therefore the dynamic range of a DFX number is

given as

DFX dynamic range = 20 log10(2n+p1−p0) dB (3.5)

With two possible scalings for a number means that DFX has better range

capability than ordinary fixed-point. This is clearly shown in Table 3.1 for

both the 32-bit and 16-bit formats. The DFX format 〈30, 0, 30〉 is an example

format that shows the maximum dynamic range available to a fully-represented

32-bit DFX number. As expected, the fixed-point format has the smallest

dynamic range while floating-point and logarithmic number system has the

65

largest. Both 32-bit DFX formats have a dynamic range between fixed-point

and floating-point (and logarithmic number system). Having 8 bits for the

exponent and integer part of 32-bit floating-point and 32-bit LNS respectively

gives them significantly more dynamic range than both fixed-point and DFX.

However, their 16-bit counterparts with half the number of exponent or integer

part (4 bits) is not enough to rival the dynamic range of DFX 〈14,−5, 0〉.

Precision and Finite Errors

0 B = 2p0

2p1

Num0 Range

Num1 Range

-B = -2p0

-2p1

Figure 3.5: Num0 and Num1 range in a DFX Number.

The range and precision of Num0 and Num1 numbers are illustrated in

Figure 3.5. From the DFX definition, the range around zero is the Num0

number range and the outer range is Num1. Although Num1 numbers have

larger magnitude, Num0 numbers have finer precision.

A comparison between precisions of 16-bit fixed-point, floating-point and

DFX number representations is given in Figure 3.6. The precision of represen-

tation in significant bits is plotted as a function of absolute number value. For

floating-point, only the normalised number representation is shown, hence the

number of significant bits is constant over the dynamic range, 11 + 1 bits man-

tissa with implicit bit. The exponent in floating-point determines the dynamic

range. For fixed-point, the number of significant bits drops by one bit with a

decrease of approximately every 6dBs of the absolute number value. We also

observe the same rate of decrease for DFX, but with the exception that there

is a change when absolute value transitions between DFX Num1 and DFX

66

-120 -100 -80 -60 -40 -20 0

0

2

4

6

8

10

12

14

16

Floating-point

Fixed-point

DFX Num0

DFX Num1

Absolute number

Significant bits

10-110-210-310-410-510-6 0

Figure 3.6: Precision of number representations in significant bits as a functionof absolute number value (in dB). The number representations shown are fixed-point 〈15, 0〉, floating-point E4:M11, and DFX 〈14,−5, 0〉.

Num0. Suddenly, the DFX number is able to provide better precision when

compared to ordinary fixed-point with the same wordlength at low absolute

values.

With two different scalings for DFX, the number represented by DFX will

have two different precisions depending on its value. This means that any error

statistics must depend on the proportion of time the number spends in either

Num0 or Num1 ranges. Error, or quantisation error, is simply the difference

between a quantised number (e.g. fixed-point or DFX) with its equivalent

infinite precision real number. Let the probabilities of the number spending

time in either Num0 and Num1 be represented by K and (1−K) respectively.

For DSP applications, the error statistics that interest us are the error

mean and error variances. Suffice to say for now that these two error statistics

are necessary to determine the output signal-to-noise ratio (SNR) of a system

which will be detailed in Chapter 5. For a number with DFX 〈n, p0, p1〉, using

Equation (5.3) we get the quantisation error statistics as shown in (3.6).

67

Error mean µdfx = Kµ0 + (1−K)µ1

Error variance σ2dfx = K(σ2

0 + µ20) + (1−K)(σ2

1 + µ21)− µ2

dfx

(3.6)

where µ0 = −122(p0−n), µ1 = −1

22(p1−n), σ2

0 = 112

22(p0−n) and σ21 = 1

1222(p1−n).

Table 3.2: Comparing the precision errors between DFX and fixed-point. TheDFX parameters are chosen to match a fixed-point 〈15, 0〉 dynamic range of≈ 90dB and upper limit of 20.

DFX Truncation Mean Truncation Var

n p0 p1 K = 90% K = 99% K = 90% K = 99%

15 0 0 -1.53E-05 -1.53E-05 7.76E-11 7.76E-11

14 -1 0 -1.68E-05 -1.54E-05 1.22E-10 8.22E-11

13 -2 0 -1.98E-05 -1.57E-05 3.83E-10 1.10E-10

12 -3 0 -2.59E-05 -1.63E-05 1.59E-09 2.39E-10

11 -4 0 -3.81E-05 -1.75E-05 6.77E-09 7.94E-10

10 -5 0 -6.26E-05 -2.00E-05 2.82E-08 3.09E-09

Using (3.6), Table 3.2 shows the error mean and variance for various DFX

formats with the probability K = 90% and K = 99%. The DFX formats were

chosen to have a dynamic range of approximately 90dBs, matching that of a

fixed-point 〈15, 0〉. It shows that the error statistics relies heavily on the time

the number spends on either of the ranges and the introduction of the higher

range number increases the error mean and variance significantly. Further

widening of the two scalings will aggravate the errors observed. Fortunately

the extra dynamic range gained from widening the two scalings may result in

shorter wordlengths which translates to reduced area consumption. Therefore

the p0 parameter gives an extra way to tradeoff between area consumption and

error.

Hardware Simplicity

The added dynamic range over fixed-point is not without hardware cost penalty.

Consider for example the case of ordinary floating-point. The normalising or

68

denormalising routines of its significand requires expensive and slow barrel

shifters. Fortunately for DFX there are only two possible scalings for its num-

ber and the shifting required for these DFX numbers are known a priori which

can be done merely by using multiplexors (MUX). MUXs do add extra chip

area when compared to ordinary fixed-point, but they are considerably cheaper

and quicker than barrel shifters of floating-point.

To demonstrate this, we extended the 4-tap FIR filter results in Chapter 2

to include DFX. Table 3.3 tabulates these new results and Figure 3.7 illustrate

it in a graph. As before, the wordlength and scaling parameters of DFX are

chosen have the dynamic ranges match as closely as possible to the other

number representation designs (Refer to Table 4.7, designs 1 to 7). As we

can see, the area and speed of DFX filter implementation lie in between the

fixed-point and floating-point implementations which is expected.

3.4 Summary

This chapter demonstrated the concept of Exponent Recoding (ER) as a gen-

eralising concept where common number representations are special cases of

ER. An arbitrary mapping function chosen at design time is applied to the

exponent field which serves to provide extra level of indirection allowing for

flexibility to manipulate the signal precision and range. From ER, a special

case and new number representation was derived.

Dual FiXed-point (DFX) was defined as with only a single exponent bit,

hence having only two exponent mapping values. It incorporates the simplicity

of ordinary fixed-point with extra dynamic range. The two level precision of

DFX and the error statistics were briefly explored. Area and latency results

of FIR filters synthesized onto ASIC show that DFX designs in between fixed-

point and floating-point designs.

69

Table 3.3: Area and critical path delay result for 4 tap FIR filter with increas-ing wordlength implemented using UMC 0.13um High Density Standard CellLibrary. This is an extension of results in Table 2.3.

Design

Fixed-point Floating-point LNS DFX

Area CPD Area CPD Area CPD Area CPD

(cells) (ns) (cells) (ns) (cells) (ns) (cells) (ns)

1 9829 9.9 16898 51.4 16188 39.1 8756 23.6

2 11042 10.3 19710 54.3 23295 43.3 10472 26.5

3 13067 11.4 21992 58.5 31562 46.9 11623 27.9

4 14845 12.0 24854 61.7 42918 49.6 12366 29.1

5 17261 13.0 26461 62.9 57261 54.0 14811 31.4

6 18198 13.4 29685 63.4 77916 58.5 16312 33.3

7 20005 14.4 32481 70.2 107450 59.5 19342 35.7

5000

10000

15000

20000

25000

30000

35000

40000

45000

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0

Critical path delay (ns)

Are

a (

cells

)

Fixed-point

Floating-point

Logarithmic number system

Dual FiXed-point (This Thesis)

Figure 3.7: An area vs critical path delay graph for Table 3.3.

70

Chapter 4

DFX Modules and Arithmetic

Circuits

4.1 Introduction

In order to implement a system with DFX number system, component libraries

must be created to support DFX arithmetic. Apart from the arithmetic mod-

ules, other supporting modules libraries to handle signal quantisation, encoding

and decoding will also be needed. These libraries can then be instantiated by

a synthesis system to create synthesizable hardware description language, and

have their area consumption modelled to assist the wordlength and scaling

optimisation process (Chapter 6).

The arithmetic modules described here are targeted for the RightSize high-

level synthesis tool by Constantinides [Con03]. RightSize is able to manipu-

late basic arithmetic modules which are common to many DSP applications;

adder, constant coefficient gain multiplier and full general multiplier. Before

proceeding further, the algorithm representation used in RightSize is detail in

Section 4.2.

71

All the modules are designed in VHDL [IEE04] which were made with a

the design objectives and goals mentioned in Section 4.3. Sections 4.4 to 4.6

will show working designs that incorporates all the design objectives stated

earlier. Each section will have the area and critical path delay comparisons

with other equivalent arithmetic modules made with other conventional num-

ber representations, fixed-point, floating-point and LNS. Lesser area translates

to cheaper hardware and low critical path delay means faster operation time.

The fixed-point arithmetic units are readily available within the synthesis tools

used while the The floating-point and LNS module library used here for com-

parison is the readily available FPLibrary from [DdD].

Results shown are the place-and-routed results for both FPGAs and ASIC

implementations. The FPGAs used are Xilinx Virtex4 (XC4VLX15) and Al-

tera Stratix2 (EP2S15). The Virtex4 series has different versions with different

features and the XC4VLX15 model is optimised for logic with 12, 288 4-input

LUTs. Apart from LUTs, the XC4VLX15 also has XtremeDSP slices, each

consisting of a 18 x 18 multiplier, an adder, and an accumulator, and block

RAMs for storage. The Stratix2 has similar features with the Virtex4, with the

EP2S15 containing 12, 480 Adaptive LUTs (ALUTs). It also has DSP blocks

(similar to the XtremeDSP blocks above), and TriMatrix memory blocks (sim-

ilar to the block RAMs above). For the XtremeDSP and DSP blocks are highly

suited for multiply-accumulate functions which are heavily used in DSP algo-

rithms.

Unfortunately, for the purpose of this thesis only the LUTs and ALUTs

of these devices are used in the synthesis of the arithmetic modules. This is

to enable a fair comparison between the different hardware implementations

without the assistance of dedicated hardware such as build-in multiplier blocks.

The ASIC designs are implemented with the UMC 0.13um High Density

Standard Cell Library where each cell consumes a square micron of chip area.

72

Synplify Pro synthesis flow is used for the FPGAs before using their own

respective place-and-route tools to obtain the area and critical path delay

results. For the ASIC, Synplify Asic is used to get the area consumption and

critical path delay estimates.

The original contributions for this chapter are:

• design of multiple wordlength and scaling DFX arithmetic modules,

• simple interfacing with ordinary fixed-point,

• comparison made between DFX arithmetic modules and other popular

arithmetic modules on FPGAs and ASIC hardware platforms.

4.2 Algorithm Representation in RightSize

A formal definition for the data-flow graph (DFG) representation used in this

thesis is referred to as a computational graph (Definition 4.1).

Definition 4.1. A computation graph G(V, S) is the formal representation of

an design/algorithm. V is a set of graph nodes, each representing an atomic

computation or input/output port, and S ⊂ V × V is a set of directed edges

representing the data flow. An element of S is referred to as a signal. The

type of an atomic computation v ⊂ V is given by type(v) (4.1). The set S

must satisfy the constraints on in-degree and out-degree given in Table 4.1. For

each signal j ∈ {(vj1, vj2) ∈ S}, vj1 is the driver node of signal j and vj2 is

the driven node.

type(v) : V → {primary in, primary out, adder,

full mult, gain mult, delay, fork}(4.1)

¤

73

Table 4.1: In and out degrees of nodes used in computational graph G(V, S).

type(v) in-degree(v) out-degree(v)

primary in 0 1

primary out 1 0

adder 2 1

full mult 2 1

gain mult 1 1

delay 1 1

fork 1 ≥ 2

The arithmetic operations currently implemented in RightSize and dis-

cussed in this thesis are: adder, general multiplication (full mult) and con-

stant coefficient multiplication (gain mult). The nodes are visualised in a

graphical representation form as shown in Figure 4.1(a). Signals (or directed

edges) are implicitly represented by arrows showing the direction of data-flow.

An example DFG is shown in Figure 4.1(b).

For the purpose of describing a computation graph with multiple wordlength

and scalings with DFX, we define a DFX annotated computation graph G(V, S, ADFX)

(Definition 4.2).

Definition 4.2. A DFX annotated computation graph G(V, S,ADFX) is a

formal representation of the DFX implementation of a computation graph

G(V, S). ADFX is a tuple (n,p0,p1) of vectors n ∈ N|S|, p0 ∈ Z|S|, and

p1 ∈ Z|S|, with each elements in a one-to-one correspondence with the elements

of S. Thus for each signal j ∈ S, it is possible to refer to the corresponding

nj, pj0 and pj1 and the condition of pj0 < pj1 applies. ¤

74

F

z-1

ADDER FULL_MULT GAIN_MULT DELAY

PRIMARY_IN FORKPRIMARY_OUT

(a) Nodes used in the data-flow graph.

In

Out

z-1

z-1

z-1

z-1

F

z-1

(b) Example data-flow graph of a 4-tap direct form FIR filter.

Figure 4.1: The graphical representation of a data-flow graph.

4.3 Module Design Forethought and Criteria

Quantisation only at the end

First and foremost, the amount of error introduced should be as minimal as

possible while staying true to the DFX format specification. If there is any

quantisation, regardless of the operation and irrespective of the algorithm used,

error should be applied only at the end of the module. This means that there

is no loss in precision anywhere until the very last moment. This not only

improves the output precision error, but also provides a predictable output

error which simplifies error analysis.

Modules designed produce truncation errors as rounding is not performed.

As truncation is the least area-expensive method of quantisation [Fio98], round-

ing has not been implemented for these modules. However, if rounding is re-

quired, an extra addition block will need to be added before the final output.

75

The decision to round will depend on the various ‘guard-bits’ from the inputs

and/or result [ECC05]. Depending the faithfulness of rounding performed, the

number of guard-bits needed will vary.

Multiple wordlength and scaling

The RightSize tool original function was to synthesize a fixed-point design

with multiple wordlengths for its signals. Therefore in keeping with the spirit

of the tool’s original intent, the DFX arithmetic modules are expected to

deal with not only multiple wordlengths, but also multiple scalings as well.

The arithmetic modules must be fully parameterisable to allow an optimising

routine to explore the whole design space without any hindrance.

Modularity

A modular design gives a designer the means to build-up any system by con-

necting the modules together (e.g. a data-flow graph). Also, the modular

building-blocks give the flexibility to extend DFX further with other arith-

metic operations to complement the ones discussed in this chapter. Each new

module type would need a VHDL description, area consumption and error

models to be incorporated into RightSize.

Integration with fixed-point

As DFX can be readily interchange between fixed-point (p0 = p1), the module

libraries must be flexible to readily integrate and work along side ordinary

fixed-point. The inputs and outputs should readily convert between DFX and

ordinary fixed-point without needing additional hardware in between. At the

same time, modules that encode and decode between DFX and ordinary fixed-

point (environment interface) should be made as simple as possible to facilitate

the transition.

76

4.4 Building Blocks

The following modules are the building blocks used in most of the DFX mod-

ules. They are the Range-Detector, Encoder, Decoder and Recoder.

4.4.1 Range-Detector

E =

0 if -B Input < B

1 if Input < -B or Input B

Fixed-point input,

AE

in inn ,p

Figure 4.2: DFX Range-Detector Module.

Function of the DFX Range-Detector block, shown in Figure 4.2, is to

determine which of the two ranges the tested signal lies in and also generate

the exponent bit, E. Input to this module is a fixed-point number with format

〈nin, pin〉 (nin being the input wordlength and pin being the position of its

binary point). This module is not a node TYPE in RightSize as it is an internal

module within other DFX modules.

The Range-Detector enforces the equations for DFX defined in (3.3) and

(3.4). The choice of range and boundary allows the operation to be simplified

down to the logic operation given by (4.2) which is a simple sum-of-product

expression. If the tested signal input belongs to the Num0 range, all the bits

above the boundary will be 0’s (when it is a positive input) or 1’s (when it

is a negative input) since the input is a two’s complement number. The bits

of interest for detection are shown in Figure 4.3. This operation can be easily

performed by lookup-tables in FPGAs, or with AND and OR gates in ASIC.

E = anin· anin−1 · . . . · anin−(pin−p0) + anin

· anin−1 · . . . · anin−(pin−p0) (4.2)

77

n-1

nin

p0

pin

Input A

Num0

S

Input bits of interestfor detection

a0a1

Figure 4.3: Input bits the Range-Detector is interested in.

4.4.2 DFX Encoder and Decoder

In order to utilize this number system, a method is needed to convert a number

from a known type to DFX. Currently modules exist to encode and decode

to and from two’s complement ordinary fixed-point. Typically the DFX En-

coder and Decoder will be used in conjunction with the primary in and pri-

mary out nodes respectively when required for the transitions between the

fixed-point environment to DFX. Apart from that, these modules are also used

for the DFX Adder v1 for the pre/post adder re-scalings in Section 4.5.1 and

the Encoder is also used in the DFX Recoder. The VHDL interface entities

for these modules are given below.

ENTITY dfx_Encoder IS

GENERIC( In_n, In_p : INTEGER;

Out_n, Out_p0, Out_p1 : INTEGER );

PORT( dIn : IN std_logic_vector(In_n downto 0);

dOut : OUT std_logic_vector(Out_n+1 downto 0) );

END ENTITY dfx_Encoder;

ENTITY dfx_Decoder IS

GENERIC( In_n, In_p0, In_p1 : INTEGER;

Out_n, Out_p : INTEGER );

PORT( dIn : IN std_logic_vector(In_n+1 downto 0);

dOut : OUT std_logic_vector(Out_n downto 0) );

END ENTITY dfx_Decoder;

78

0

1

Range-detector

DFX

Output

Fixed-point

Input

>> pin-p0

>> pin-p1 mod 2n+1

in inn ,p

0n,p

1n,p0 1n,p ,p

mod 2n+1

Eout

(a) Top-level diagram.

S S0

1

FX In DFX Out

(b) Flow of data through the Encoder.

Figure 4.4: DFX Encoder block to convert from fixed-point to DFX. Input isa fixed-point with wordlength nin and binary point pin and output is a DFX〈n, p0, p1〉.

The DFX Encoder (as shown in Figure 4.4) takes in a fixed-point input with

wordlength nin and binary point pin and to be converted to a DFX 〈n, p0, p1〉output. It first feeds the input into a Range-Detector that would determine

the appropriate output range and the output exponent bit. The multiplexor

chooses the proper scaled signal for the output. Note that “>>” and “<<”

are the binary left and right shift operators respectively which do not require

any hardware cost as they are a matter of wiring. “mod 2n+1” simply extracts

out the least significant (n + 1) bits of the signal.

Figure 4.5 is the DFX Decoder block. It takes a DFX 〈n, p0, p1〉 input and

outputs a fixed-point number with wordlength (n+(p1−p0)) and binary point

p0. The input is shifted and scaled accordingly to the exponent bit. Again, the

shifting does not incur any hardware cost and the main cost for the decoder is

the multiplexor.

79

DFX

Input

Fixed-point

Output<< p0-p1

mod 2(n+(p1-p0)+1)

0

1

0 1n,p ,p

( )1 0 1n+ p -p ,p

mod 2(n+(p1-p0)+1)

Ein


SS

0

1

FX In DFX Out

(b) Flow of data through the Decoder.

Figure 4.5: DFX Decoder block to convert from DFX to fixed-point. Input isa DFX 〈n, p0, p1〉 and output is a fixed-point with wordlength (n + (p1 − p0))and binary point p1.

4.4.3 DFX Recoder

Converting a DFX number from one format to another is done by the DFX

Recoder. In a typical fixed-point case, no additional hardware is required

when a signal is changed from one fixed-point format to another (assuming

truncation) as any shifting is done as a result of wiring. In DFX, when the

input and output DFX boundary is different, a little care is required when

performing conversion. Figure 4.6 shows the whole DFX Recoder which consist

of two DFX Encoders and a MUX before the output. The input’s Num0 and

Num1 ranges can be treated separately as individual fixed-point numbers.

Therefore the two Encoders are aligned to the input’s Num0 and Num1 range

with their DFX output as 〈nout, pout0, pout1〉. The MUX at the output then

chooses the correct output depending on the input’s exponent bit.

80

The VHDL interface entity for the DFX Recoder is given below:

ENTITY dfx_Recoder IS

GENERIC( In_n, In_p0, In_p1 : INTEGER;

Out_n, Out_p0, Out_p1 : INTEGER );

PORT( dIn : IN std_logic_vector(In_n+1 downto 0);

dOut : OUT std_logic_vector(Out_n+1 downto 0) );

END ENTITY dfx_Recoder;

SS0

1

0

1

DFX In DFX Out

DFX Input

nin,pin0,pin1

0

1

Ein

DFX Encoder

for nin,pin0

DFX Encoder

for nin,pin1

DFX Output

nout,pout0,pout1

Figure 4.6: DFX Recoder module with the flow of data through the module.

The Recoder shown in Figure 4.6 is the complete version of the Recoder and

it works for both properly and improperly scaled DFX numbers and the result-

ing DFX signal will be properly scaled. Under normal situations where DFX

numbers are always properly scaled, the Recoder can be simplified. Referring

to Figure 4.7, the simplified implementation depends on the DFX boundaries

of input and output signals. If the input and output boundaries are the same,

i.e. Bin = Bout, no hardware is needed apart from wiring (Figure 4.7(a)).

If the output’s boundary is less than the input’s boundary, i.e. Bout < Bin,

all the input’s Num1 will become an output Num1 provided the input is

properly scaled. Therefore we can simply feed the input to the output MUX

when the input is a Num1 after performing any necessary shifting. This leaves

the case when the input is a Num0 which can become either an output Num0

or Num1 and therefore handled by a DFX Encoder. Referring to Figure 4.7(b),

81

S0

1

DFX In DFX Out

S0

1

nin bits nout bits

(nin-nout) bits

DFX Input

nin,pin0,pin1

DFX Output

nout,pout0,pout1

(a) When DFX Output Boundary = DFX Input Boundary.

S0

1

DFX In DFX Out

S0

1

DFX Input

nin,pin0,pin1

0

1

Ein

DFX Encoder

for nin,pin0

DFX Output

nout,pout0,pout1>> pin1 -pout1 1mod 2 outn +

out out1n ,p

(b) When DFX Output Boundary < DFX Input Boundary.

S0

1

DFX In DFX Out

S0

1

DFX Input

nin,pin0,pin1

0

1

Ein

DFX Encoder

for nin,pin1

DFX Output

nout,pout0,pout1

>> pin0 -pout0out out0n ,p

1mod 2 outn +

(c) When DFX Output Boundary > DFX Input Boundary.

Figure 4.7: DFX Recoder blocks to convert between two different properlyscaled DFX numbers. The flow of data through the recoder is shown beneatheach block.

82

we can see that this simplification results in one less DFX Encoder when

compared to the complete Recoder.

Similarly when the output’s boundary is more than the input’s boundary,

i.e. Bout > Bin, all the input’s Num0 will definitely become an output Num0

provided the input is properly scaled. This time when the input is a Num0,

it is fed directly to the output MUX after performing the necessary shifting.

The DFX Encoder used for this case is recoding is tuned to the input’s Num1.

Figure 4.7(c) shows its implementation.

As RightSize uses multiple DFX wordlength and scalings for its signals,

there is a needed to recode any format changes between signals and after arith-

metic operations. The DFX Recoder is used after the node types gain mult,

full mult, fork and delay. Use of the Recoder for the gain mult and

full mult will be explained in their respective sections.

Normally in fixed-point systems, the fork and delay modules act as ‘pass-

throughs’ and any shifting of the signals at the output is dealt with careful

wiring, which is not the case for DFX. Figure 4.8(a) show an example scenario

with different input and output DFX formats for fork and delay, and Fig-

ure 4.8(b) shows their actual implementation with DFX Recoders explicitly

shown. The fork example has three outputs X, Y and Z where Y and Z share

the same boundary (i.e.BY = BZ). Since Y and Z share the same boundary,

only one Recoder is needed for the two outputs. It should be noted that the

range of the input and output of a fork should be the same, hence in this

example pX1 = pY 1 = pZ1 = pA1. The case for the delay node should be self

explanatory.

83

z-1

F

X X0 X1X , ,n p p

Y Y0 Y1Y , ,n p p

Z Z0 Z1Z , ,n p p

X X0 X1X , ,n p pA A0 A1A , ,n p p

A A0 A1A , ,n p p

(a) fork and delay example.

z-1

R

FA A0 A1A , ,n p p

R

R F

X X0 X1X , ,n p p

X X0 X1X , ,n p p

A A0 A1A , ,n p p

Y Y0 Y1Y , ,n p p

Z Z0 Z1Z , ,n p p

(b) fork and delay example with DFX Recoder “R” explicitly shown,where BY = BZ .

Figure 4.8: DFX Recoder used with fork and delay.

4.4.4 Area and Critical Path Delay Tables

Table 4.2 shows the area and critical path delay of the Encoder, Decoder

and Recoder modules. It can be seen that they do not take much resource to

perform the conversions to and from fixed-point (∼0.16% of FPGA chip area).

The most expensive is the Recoder because of the Range-Detector and output

MUX (∼ 0.20% of FPGA chip area). The table does not include the Range-

Detector area and critical path delay because it is never used by itself and is

always integrated into another module for example the Encode and Recoder.

Table 4.2: Building block areas and critical path delays (CPD) table.

DFXParameters

Area and ( CPD )

Virtex4 Stratix2 ASIC

Modules LUTs (ns) ALUTs (ns) Cells (ns)

Encoder 〈19, 5〉→〈14, 0, 5〉 20 (2.535) 15 (2.000 ¦) 228 (1.233)

Decoder 〈14, 0, 5〉→〈19, 5〉 19 (1.827) 19 (2.000 ¦) 213 (0.974)

Recoder 〈14, 0, 5〉→〈14, 1, 5〉 28 (2.731) 16 (2.013) 304 (2.042)

¦ - Speed limited by place and route tool

84

4.5 DFX Adders

The DFX Adder module performs addition between two different DFX num-

bers and this module implements the adder node in RightSize. For the case

of DFX inputs with different scalings means that the inputs will need to be

aligned to a common scaling before addition can be performed. This is done

in the pre-adder stage. Once inputs are aligned, the addition is performed by

an ordinary fixed-point adder. If the adder’s output is in DFX, the summation

output has to be aligned by the post-adder stage. The pre- and post-alignment

stages may be akin to floating-point adder [Kor02]. But unlike floating-point,

the number of bits to shift is known a priori and only multiplexors are used to

perform the necessary shifting instead of expensive barrel shifters. As a result,

the DFX Adder is both smaller and faster than an equivalent floating-point

adder.

There are two different versions of the adder; DFX Adder Version I (v1)

and DFX Adder Version II (v2). The difference between the two is in the

way they perform pre-addition and post-addition alignment which may result

in different area consumptions. Which adder gives a lower area consumption

depends on its input and output wordlength and scaling parameters. The

adder that consumes the least area is chosen by RightSize for the synthesized

output. The VHDL interface entity for Version I is shown below; Version II

has an identical interface.

ENTITY dfx_Adder_v1 IS

GENERIC ( a_n, a_p0, a_p1, b_n, b_p0, b_p1 : INTEGER;

s_n, s_p0, s_p1: INTEGER );

PORT( A : IN std_logic_vector( a_n+1 downto 0);

B : IN std_logic_vector( b_n+1 downto 0);

S : OUT std_logic_vector( s_n+1 downto 0) );

END ENTITY dfx_Adder_v1;

85

4.5.1 DFX Adder Version I (V1)

Σ DFX

Encoder

DFX

Decoder

fx fxn p,

+ +fx fxn p1, 1

B B Bn p p0 1B , ,

A A An p p0 1A , ,

S S Sn p p0 1S , ,

DFX

Decoder

Afx

Bfx Sfx

Pre-Adder

Post-Adder


SAA

B SB

0

1

0

1

SS0

1S

+

=

SS

Pre-Adder

Decoding

Post-Adder

Encoding

DFX Inputs

Summation result from

fixed-point adderDFX output

SA

SB

+

Decoded inputs to

fixed-point adder

nfx+1

SA

SB

pfx+1

Fixed-point

Addition

No shifting,

Shifting required

(b) Example dataflow diagram.

Figure 4.9: DFX Adder Version I (v1).

Looking at DFX Adder Version I design and an example dataflow diagram

in Figure 4.9, the Pre-Adder consists of a DFX Decoder for each input to an

ordinary fixed-point format 〈nfx, pfx〉. The fixed-point wordlength, nfx, and

the scaling, pfx, is given by

nfx = max(nA + (pA1 − pA0), nB + (pB1 − pB0))

pfx = max(pA1, pB1)

where the max() is a function that returns the larger of the two operands.

With the inputs aligned, the summation is done by an ordinary fixed-point

86

adder. To ensure no overflow of results, the output of the fixed-point adder is

usually one bit wider than its inputs, therefore the fixed-point adder can be at

most (nfx + 2) bits wide and its inputs may need to be sign-extended. Sign-

extension does not require any additional hardware apart from wire routing.

After summation, the output is fed into a DFX Encoder to produce a DFX

output as necessary.

DFX Adder V1 is the simplest of the two DFX Adders, design-wise. It has

better area consumption than Version II when the cost of using multiplexors is

far greater than the cost of the fixed-point adder. It suffers when the difference

between the p1 and p0 gets large because the ordinary fixed-point adder has

to increase in width. This not only increases the area consumption but also

increases critical path delay of the device. For the size and delay comparison

of this adder block and others, please refer to the end of this section.

4.5.2 DFX Adder Version II (V2)

Figure 4.10 shows the top level and an example dataflow of the DFX Adder

Version II, and Figure 4.11 shows its Pre and Post-Adders. At first glance,

this version seems to be more complicated. However, for certain input and

output parameters, Version II has an area consumption and speed advantage

when compared to Version I. In order to achieve this, we have to make some

design choices and limitations to this adder.

Firstly, the adder’s inputs, Afx and Bfx, are scaled to either of the two DFX

ranges of output S, i.e. SNum0 (〈nS, pS0〉) or SNum1 (〈nS, pS1〉). Whenever both

input exponents are zero, the inputs are aligned to SNum0. For all other input

exponent combinations, the inputs are aligned SNum1. Table 4.3 summarises

the input alignment combinations. These alignment rules mean that each input

will need only be aligned in three different ways using multiplexors (as seen

87

Σ Post-adder

block

Pre-adder

Block

B B Bn p p

0 1B , ,

A A An p p0 1

A , ,

S S Sn p p

0 1S , ,

+ + + +0 1

1, 1 or 1, 1S S S Sn p n p

gBits

Adderexp

Afx

Bfx Sfx

(pS1-pS0) bits

0 1, or ,S S S Sn p n p


ps0

SAA

B SB

SA

0

1

0

1

0 0

A BExponent Bit

0 1

1 0

1 1

nfx+1

SS0

1S

+

=

gBitsA

gBitsB

SA

SA SASA

SA

SB

SB

SB SBSB

SB

0

1

SS

SS gBits

ps1-ps0

Adderexp

Pre-Adder

Alignment

Post-Adder

Recoding

DFX Inputs

Scaled inputs to

fixed-point adder

Summation result from

fixed-point adder

DFX output

SA

SA

SA

SA

SB

SB

SB

SB

ps1 Fixed-point

Addition

No shifting,

Shifting required

(b) Example dataflow diagram.

Figure 4.10: DFX Adder Version II (v2).

88

(a) DFX Adder-v2 Pre-adder Block

(b) DFX Adder-v2 Post-adder Block

Control Block

mod 2n-1

Exponent bit

no_shift

shift_l

Sfx

mod 2n-1<< pS1-pS0

gBits

e

d

e

d

AdderexpRange-detector

for nfx,pS1

S S0 S1n ,p ,pS

0

1

0

1

e

d

e

d

0

1

0

1

→A A fx Sn p n p0 0, ,

→A A fx S

n p n p1 1

, ,

→B B fx S

n p n p0 0

, ,

→B B fx S

n p n p1 1

, ,

Retain truncated

bits for A

Retain truncated

bits for B

MUXgBits

A A An p p0 1

A , ,

B B Bn p p

0 1B , ,

Bexp

Aexp

(nS+1) bits

(nS+1) bits

(pS1-pS0) bits

Afx

Bfx

gBits

Adderexp

Con t r o l B l o c k

Adderexp= Aexp+BexpAsel= !Aexp·(Aexp+Bexp)

Bsel= !Bexp·(Aexp+Bexp)

AselBsel

→A A fx S

n p n p0 1

, ,

→B B fx Sn p n p0 1, ,

Figure 4.11: DFX Adder (v2) (a)Pre-Adder and (b)Post-Adder diagram.

89

in Figure 4.11(a)). The 〈nx, px〉 → 〈ny, py〉 operation involves shifting and/or

truncating the signal from 〈nx, px〉 to 〈ny, py〉. Whenever the input exponents

are different, gBits retains the input bits that are shifted out to ensure no lost

of precision if the addition results in SNum0.

Table 4.3: Scaling of the inputs before the fixed-point adder.

Aexp BexpScaling

Afx Bfx

0 0 SNum0 SNum0

0 1 SNum1 SNum1

1 0 SNum1 SNum1

1 1 SNum1 SNum1

Secondly, the output pS0 is chosen so that when both inputs are in their

Num0 ranges, the output will always be a SNum0. Refer to Figure 4.10(b) for

dataflow diagram for clarification. Therefore when both inputs are in their

Num0 ranges, the summation result can be passed straight to the output and

post-adder alignment is only needed when the inputs have different scaling.

By letting pS0 = max(pA0, pB0) + 1, the sum is ensured to be a SNum0 when

both inputs are in their Num0 range, but experimental results have shown

that it can be a very conservative solution. Provided that we have knowledge

of the adder’s inputs, we can obtain a more suitable pS0. Section 6.3 details

this method after obtaining the the joint probability distribution statistics of

the inputs.

Without the conditions made to this adder, the post-adder block would

have to handle 4 different scalings (2 different scaling for each input) and up to

8 different output rescalings ( output has 2 different scaling). The restrictions

to this version of the adder simplifies the Post-Adder block and it looks fairly

similar to the DFX Encoder.

90

Table 4.4: Area and critical path delay tables for 16-bit adder comparisons.

Format Params

Adder Area


LUTs (Slices) ALUTs (ALMs) Cells

DFX v1 〈14, 0, 5〉 56 (44) 59 (34) 1223

DFX v2 〈14, 0, 5〉 55 (44) 54 (35) 1089

FX 〈15, 0〉 16 (8) 16 (8) 449

FP E:4 M:11 271 (144) 269 (143) 5049

LNS I:4 F:11 2127 (1134) 998 23490

Format ParamsAdder critical path delay (ns)


DFX v1 〈14, 0, 5〉 6.524 5.200 8.721

DFX v2 〈14, 0, 5〉 6.588 4.476 7.718

FX 〈15, 0〉 3.026 ¦ 2.000 2.390

FP E:4 M:11 20.733 13.923 13.607

LNS I:4 F:11 19.277 5.711 18.842¦ - Speed limited by place and route tool


Table 4.4 shows the comparison between the two versions of DFX Adders

and other popular number formats all designed to be 16-bit wide and with

similar dynamic range. As expected, the DFX Adder’s hardware simplicity

over floating-point means that its area consumption and device critical path

delay lies between fixed-point and floating-point. DFX Adders for FPGAs

(Virtex4 and Stratix2) are about 3.5 times larger and 2.0 times slower than

equivalent fixed-point. In contrast, it is nearly 5.0 times smaller and 3.0 times

faster than an equivalent floating-point. For ASIC, DFX adders are about 2.5

times larger and 3.5 times slower than fixed-point while being 4.5 times larger

and 2.0 times faster than floating-point. The LNS addition was performed

using approximations from look-up-tables or ROM and hence more hardware

area is dedicated for additional memory. Because of this, the LNS adders

consumes the largest area overall.

In this example implementation, the DFX Adder Version 2 has a slight ad-

91

vantage both in terms of area and critical path delay when compared to Version

1. This is mainly due to the large difference between p1 and p0 and increasing

the length of the ordinary fixed-point adder in Version 1 more expensive than

providing more multiplexors in Version 2.

It can be seen that the latencies of the FPGA designs are shorter than

those of the ASIC. This is mostly due to the fact that both FPGA families are

made on 90nm process, while the ASIC are synthesized using a 130nm process

library.

4.6 DFX Multipliers

Unlike the DFX Adder, DFX Multipliers do not need any pre-aligning stage for

their inputs. The inputs can be multiplied together by a fixed-point multiplier

regardless of their scalings and the product from the fixed-point multiplier will

be fed through a DFX Recoder before the sent to the output. As with the DFX

Adder, the multiplier can take in any wordlength and precision parameters

even if they are fixed-point.

The implementations of FPGA multipliers are done with their logic-blocks

and not with any embedded multiplier/DSP blocks to provide a fair comparison

along side ASIC implementations. These results are a valid representation

as designers may also have limited resources on chip or may be working on

devices without embedded multipliers such as in low-cost FPGAs [Xila], [Alta]

or CPLDs.

Two types of multipliers are made: a gain multiplier and a full multiplier.

92

DFX

Recoderm mm ,n p

A A0 A1A , ,n p pQ Q0 Q1Q , ,n p pX

Prod

Aexp

0 1

A m

0 A0 m 1 A1 m

*, *, *where *

* *

n p pn n np p p p p p

= += + = +

(a) Top Level diagram.

p1'

p0'

p1'

SAA0

1

Sm

Sm

SA

0

1A

m

m

SQ

0

1SQ

0

1

SQ

0

1SQ

0

1

Multiplier, m < 1.0

Multiplier, m > 1.0

Recoding

Recoding

Product from

fixed-point multiplier DFX output

Fixed-point

multiplication

No shifting,

Shifting required

Q

Q

p0'

(b) Example dataflow.

Figure 4.12: DFX Gain Multiplier.

4.6.1 DFX Gain Multiplier

DFX Gain Multiplier (Figure 4.12) takes a DFX input and multiply with a

constant fixed-point coefficient. This is particularly useful in applications such

as filtering where one of the operands is a constant. This module is used with

the gain mult in RightSize and its VHDL interface entity is given below.

The wordlength and scaling of the multiplier coefficient should be optimised

as to not have any unnecessary leading sign bits or trailing zeros.

93

ENTITY dfx_Mult_Gain IS

GENERIC ( a_n, a_p0, a_p1, q_n, q_p0, q_p1 : INTEGER;

mult : REAL; mult_n, mult_p : INTEGER );

PORT ( A : in STD_LOGIC_VECTOR( a_n+1 downto 0);

Q : out STD_LOGIC_VECTOR( q_n+1 downto 0) );

END ENTITY dfx_Mult_Gain;

Unlike the DFX Adder, the inputs to the multiplier do not need aligning

and can be multiplied together with an ordinary fixed-point multiplier. Since

the input is a DFX number, the multiplier product Prod will also be a DFX

number. Therefore Prod has to be recoded to the desired output DFX format.

Consider the multiplication of a DFX 〈nA, pA0, pA1〉 number with a fixed-point

〈nm, pm〉 (nm is the word-length of the constant multiplier m and pm is its

fractional length). The constant multiplier should be optimally scaled, (i.e.

pm = blog2(m)c+1). After multiplication, the product would be in an interme-

diate format 〈n∗, p∗0, p∗1〉, where n∗ = nA+nm, p∗0 = pA0+pm and p∗1 = pA1+pm.

This DFX signal may be improperly scaled and a full DFX Recoder will be

needed to convert the output to a properly scaled DFX number. However by

placing some restrictions to the design, the full DFX Recoder is not required.

Consider when m < 1.0 and if we can ensure that multiplied result of ANum0

with the coefficient will always result in an output QNum0, the Recoder needed

is the one shown in Figure 4.7(c) which requires less hardware area. This can

be achieved by making sure that the output’s pQ0 ≤ pA0+pm and provided that

the multiplier coefficient is properly optimised. Similarly, the same reduction

in area when m > 1.0 by limiting the output’s pQ0 ≥ pA0 + pm, which will

ensure the multiplied result of ANum1 with the coefficient, will always result in

an output QNum1. The Recoder needed then is the one shown in Figure 4.7(b).

These limitations would not compromise the operation of the circuit and the

RightSize tool will impose these limitations to reduce the area consumption

(Section 6.3).

94

4.6.2 DFX Full Multiplier

I II III IV

I III

II IV

A B

A0 B0 A1 B0

A0 B1 A1 B1

*, , , ,

where *

n p p p p

n n n

p p p p p p

p p p p p p

= += + = += + = +

A A0 A1A , ,n p p

XProd

Bexp

B B0 B1B , ,n p p

DFX Recoder 1

for i/p n*,pI,pII

DFX Recoder 2

for i/p n*,pIII,pIV

0

1

Aexp

Q Q0 Q1Q , ,n p p


pIV

pIII

pII

SQ

SAA0

1

SBB0

1

SQ Q0

1

0 0

A B

Exponent Bit

0 1

1 0

1 1 SQ

SQ

SQ

Product from

fixed-point multiplier

DFX output

Recoding

Fixed-point

multiplication

No shifting,

Shifting required

pI

(b) Example dataflow.

Figure 4.13: DFX Full Multiplier.

A DFX Full Multiplier (shown in Figure 4.13(a)) is fairly similar to the

gain multiplier. It performs a multiplication with two DFX numbers. The

full mult node of the RightSize tool uses this module and its VHDL interface

entity is given below.

ENTITY dfx_Mult_full IS

GENERIC ( a_n, a_p0, a_p1, b_n, b_p0, b_p1 : INTEGER;

q_n, q_p0, q_p1: INTEGER );

PORT( A : IN std_logic_vector( a_n+1 downto 0);

B : IN std_logic_vector( b_n+1 downto 0);

Q : OUT std_logic_vector( q_n+1 downto 0) );

END ENTITY dfx_Mult_full;

95

Because both inputs to this multiplier can be DFX, the intermediate prod-

uct result can have up to four different scalings, pI, pII, pIII and pIV with

wordlength n∗ (i.e. a quad fixed-point) where If we were to keep the num-

Recoder 1

Recoder 2

ber in this form, further computations down the line will suffer from great

complexity. That discussion is beyond the scope of this thesis and has not

been explored. We already know that there are 4 different scalings for the

intermediate result, and if each needs to be recoded to either of the 2 different

output scalings, this would mean that the post-multiplier block may have to

handle up to a maximum of 8 different output shifts to recode the output to

a properly scaled DFX signal. This operation will need two Recoders and a

multiplexor.

Referring to Figure 4.13(a), Recoder 1 assumes that its input (the inter-

mediate result Prod) is a 〈n∗, pI, pII〉 DFX number while Recoder 2 assumes a

〈n∗, pIII, pIV〉 DFX number. For Recoder 1, the difference between pI and pII

is the addition of either pB0 or pB1, and this is the same for Recoder 2’s pIII

and pIV. Hence the exponent bit for input B is used for the intermediate re-

sult Prod. The outputs of both Recoder is a 〈nQ, nQ0, nQ1〉 number, but their

input’s differ by either pA0 or pA1. Therefore the select signal for the output

MUX is the input A’s exponent bit.

The scenario described above is the extreme case when the post-multiplier

block has to cope with 8 different output shifts. As we shall see in Example 4.1,

96

the output of the multiplier may be simplified to only a single Range-Detector

and a MUX.

Example 4.1. Consider a full multiplier with the same input and output for-

mat 〈n, 0, k〉 where k ∈ Z+. With the p0 = 0, the maximum value for the

Num0 range is ≤ 1 and the Num1 range would be > 1. Referring to the ex-

ample data-flow in Figure 4.13(b), when both inputs are Num1, the output will

never become a Num0 and would always only shift right by k bits to a Num1.

Similarly, when both inputs are Num0, the output always be a Num0 but no

shift is required. When the inputs differ in their scalings, the output will either

be a Num0 (shift right by k bits) or Num1 (no shift required) depending on the

value of the product, Prod. Therefore, when the input exponents are different,

only a k-bit right shift is needed and a single MUX is capable of doing this

together with the help of a Range-Detector to determine the correct range. ¤


As seen in Table 4.5, the gain multipliers for DFX, fixed-point and floating-

point are roughly the same size on all hardware platforms when synthesized.

Fixed-point is clearly the quickest of the three with floating-point being the

slowest. The full multiplier area (Table 4.6) on the other hand shows that

fixed-point is largest followed by DFX. This is because the mantissa of floating-

point is only 11 bits wide when compared to the 16 bits of fixed-point. However

the post multiplication normalisation of floating-point and recoding of DFX

means that the fixed-point full multiplier is still the quickest of the three. LNS

multiplication is a fairly simple process as discussed in the background, hence

the small area requirement for LNS gain and full multipliers.

Generally, the DFX multipliers will have an area consumption and crit-

ical path delay between fixed-point and floating-point [ECC04]. It should

97

Table 4.5: Area and critical path delay (CPD) tables for gain multiplier com-parisons.

Format Params

Gain Multiplier Area



DFX 〈14, 0, 5〉 113 (61) 127 (67) 2561

FX 〈15, 0〉 98 (58) 113 (57) 2551

FP E:4 M:11 105 (63) 97 (49) 2627

LNS I:4 F:11 18 (9) 18 (9) 292

Format ParamsGain Multiplier CPD (ns)


DFX 〈14, 0, 5〉 10.049 7.010 10.075

FX 〈15, 0〉 7.527 2.000 ¦ 8.740

FP E:4 M:11 11.362 9.370 10.881


Table 4.6: Area and critical path delay (CPD) tables for full multiplier com-parisons.

Format Params

Full Multiplier Area



DFX 〈14, 0, 5〉 256 (140) 265 (139) 7500

FX 〈15, 0〉 257 (114) 279 (140) 8077

FP E:4 M:11 187 (115) 195 (104) 5312

LNS I:4 F:11 22 (12) 22 (12) 553

Format ParamsFull Multiplier CPD (ns)


DFX 〈14, 0, 5〉 7.453 8.333 11.861

FX 〈15, 0〉 6.785 2.000 ¦ 9.325

FP E:4 M:11 15.343 11.287 12.146


98

noted that the multipliers synthesized here on FPGAs are limited to using

look-up-tables (LUTs) although most modern FPGAs have dedicated built-in

multiplier blocks.

4.7 Discussion and Further Comparisons

The area and critical path delay tables shown in the individual module sections

earlier were all 16-bit implementations. They provide a glimpse of the differ-

ences between the different number representations. Shown in Figure 4.14

is the place-and-routed resource usage against varying wordlengths for the

arithmetic modules described for the ASIC implementation. This plot is rep-

resentative in terms of the general shape of the plots obtained for all other

platforms. Parameters used for the fixed-point, floating-point and logarithmic

number system were chosen to match the dynamic range as closely as possible

to DFX (Table 4.7). In addition, the wordlengths of the floating-point and

LNS are matched with their DFX equivalent. The LNS formats tested is lim-

ited by the LNS library used [CDdD06] but the limited data points is sufficient

to show its general trend.

In the case of adders, the area of DFX Adders lies between fixed-point

and floating-point implementations. We can see a large difference between the

floating-point and fixed-point adders. DFX on the other hand mirrors fixed-

point with just a slight added overhead. LNS adders on the other hand is way

off the chart and reaches to about 50k cells with a wordlength of 18-bits, which

is not surprising. Addition in fixed-point is definitely the cheapest in terms of

hardware area.

For the case of multipliers, the trends are reversed. In the view to maintain

the same level of dynamic range as DFX, fixed-point multipliers is now the most

expensive to implement. DFX maintains its position between fixed-point and

99

0

2000

4000

6000

8000

10000

12000

14000

1 3 5 7 9 11 Design

cells

DFX v1

DFX v2

FX

FP

LNS

(a) Adders

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1 3 5 7 9 11 Design

cells

DFX

FX

FP

LNS

(b) Gain multipliers

0

5000

10000

15000

20000

25000

1 3 5 7 9 11 Design

cells

DFX

FX

FP

LNS

(c) Full multipliers

Figure 4.14: Module comparisons with similar dynamic range implemented inASIC. Parameters used for each number representations is shown in Table 4.7.

100

Table 4.7: The parameters used to generate results for Figure 4.14. Thedynamic range (DR) is represented in dB.

DesignDFX FX FP LNS

n p0 p1 DR n p DR E M DR I F DR

1 9 0 5 84 14 0 84 4 6 90 4 6 96

2 10 0 5 90 15 0 90 4 7 90 4 7 96

3 11 0 5 96 16 0 96 4 8 90 4 8 96

4 12 0 5 102 17 0 102 4 9 90 4 9 96

5 13 0 5 108 18 0 108 4 10 90 4 10 96

6 14 0 5 114 19 0 114 4 11 90 4 11 96

7 15 0 5 120 20 0 120 4 12 90 4 12 96

8 16 0 5 126 21 0 126 4 13 90 4 13 96

9 17 0 5 132 22 0 132 4 14 90 - - -

10 18 0 5 138 23 0 138 5 14 187 - - -

11 19 0 5 144 24 0 144 5 15 187 - - -

12 20 0 5 151 25 0 151 5 16 187 - - -

floating-point. As expected from LNS, its multipliers are cheapest in terms of

hardware area.

Figure 4.15 shows the area of DFX modules synthesized onto ASIC. Their

wordlengths, n and the lower scaling parameter p0 are varied while keeping

upper scaling constant at p1 = 8. We can see that if the wordlength is kept

constant, changing from fixed-point to DFX will always incur added hardware

cost. Results show similar trends when implemented on FPGAs.

101

02

46

88

10

12

14

16200

400

600

800

1000

1200

1400

p0n

Are

a (c

ells

)

(a) DFX Adder V1

02

46

88

10

12

14

16200

400

600

800

1000

1200

p0n

Are

a (c

ells

)

(b) DFX Adder V2

02

46

88

10

12

14

16500

1000

1500

2000

2500

3000

p0n

Are

a (c

ells

)

(c) DFX Gain-Multiplier

02

46

88

10

12

14

162000

3000

4000

5000

6000

7000

8000

9000

p0n

Are

a (c

ells

)

(d) DFX Full-Multiplier

Figure 4.15: Area of DFX modules implemented in ASIC with p1 = 8.

Looking the results from a different point of view, we shall compare DFX

arithmetic modules with fixed-point arithmetic modules while keeping the dy-

namic range of ∼ 90dB consistent throughout. Fixed-point is chosen for com-

parison since the DSP community tends to prefer building a DSP solution

around fixed-point number representation [Smi97] for its superior area and

speed performance. The first line of Table 4.8 shows the fixed-point 〈15, 8〉implementation in ASIC of each of the modules and the lines below it are

DFX with the same dynamic range. The area result from the table are plotted

relative to fixed-point in Figure 4.16. Once again, results show similar trends

when implemented on FPGAs.

Given the small number of DFX formats examined, the DFX Adders are

102

Table 4.8: Comparing fixed-point and DFX arithmetic module implementa-tions on ASIC(# of cells) with dynamic range fixed at ∼ 90dB. Fixed-pointthe first result line where p1 = p0 = 0.

DFX Format Adder v1 Adder v2 Gain Mult Full Mult

〈15, 0, 0〉 453 454 2497 8082

〈14,−1, 0〉 1011 988 2514 7574

〈13,−2, 0〉 997 949 2037 6630

〈12,−3, 0〉 985 897 1865 5721

〈11,−4, 0〉 975 847 1692 4845

〈10,−5, 0〉 957 805 1298 4088

〈9,−6, 0〉 945 759 1199 3396

〈8,−7, 0〉 926 705 855 2737

always more expensive than equivalent fixed-point adders. The area reduction

from reducing the wordlength is not sufficient to cover for the added overhead

of the DFX pre and post alignment. However this is the opposite for the case

of multipliers. The area reduction of the fixed-point multiplier is more than

sufficient to cover for the added area of the post multiplier blocks needed by

DFX.

We can further deduce from Figure 4.16 that optimising a design with DFX

Adders to reduce area may lead to the search space being non-convex (having

more than one minimum point). While keeping the dynamic range constant

and increasing the separation between p0 and p1, we can see the Adder’s size

increase before gradually decreasing.

103

0.0

0.5

1.0

1.5

2.0

2.5

<15,0,0> <14,-1,0> <13,-2,0> <12,-3,0> <11,-4,0> <10,-5,0> <9,-6,0> <8,-7,0>

DFX Format

Rel

ativ

e si

ze (D

FX

Are

Are

a)

DFX Adder v1

DFX Adder v2

DFX Gain Multiplier

DFX Full Multiplier

Relative Size ( DFX Area/ FX Area )

DFX Format

Figure 4.16: Sizes of DFX arithmetic modules relative to their fixed-pointequivalent.

4.8 Summary

The chapter introduces the basic arithmetic modules built for DFX on VHDL.

At present, these fully parameterisable modules are incorporated into the

RightSize tool which can be incorporated into any hardware design flow that

accepts VHDL library files. Apart from being fully parameterisable, attention

has been given to ensure that any truncation errors that occur will only happen

at the end of a module to reduce overall output error.

Being fully parameterisable also meant that incorporating DFX with or-

dinary fixed-point is done by merely equating the p0 and p1 parameters of

the input and outputs. The interfacing modules (encoder and decoder) be-

tween DFX and ordinary fixed-point are achieved by the basic building blocks

mentioned.

Although all the designs shown all perform truncation as a means of quan-

tising the output, rounding can easily be incorporated for additional hardware

[ECC05]. Similarly, DFX modules are not limited to the ones shown and other

arithmetic operations can be added without difficulty. Typically the arithmetic

104

operation will be done with an ordinary fixed-point equivalent operator with

the necessary input and output scalings performed as necessary.

The arithmetic modules were compared with arithmetic modules of other

conventional number representations. DFX arithmetic modules were found

to have the area consumption and speed between fixed-point and floating-

point arithmetic modules. Hence, as will see in Chapter 6, provided that the

parameters have been chosen well, the area consumption of DFX can be less

than fixed-point area consumption.

As majority of application specific DSP solutions rely on fixed-point, the

rest of the work will be solely concentrated on being better than fixed-point

implementations. From the comparison made between fixed-point and DFX,

it can be seen that DFX has an advantage when a certain dynamic range is

required where multiplication is concerned.

105

Chapter 5

Modelling Noise at the Outputs

of a DFX System

5.1 Introduction

Since our emphasis for this work is for DSP solutions, we are interested in the

signal-to-noise ratio (SNR), sometimes known as the signal-to-quantisation-

noise ratio (SQNR). It is a well accepted general metric in the DSP commu-

nity for measuring the quality of a finite precision algorithm implementation

[Mit98]. Conceptually the output sequence at each system output resulting

from a finite precision implementation can be subtracted from the equivalent

sequence resulting from an infinite precision implementation. The difference

obtained is known as the finite precision error. Therefore the signal-to-noise

ratio is the ratio of the output power resulting from a signal with infinite

precision over the power of the finite precision error.

The original RightSize synthesis tool takes the SNR as a user constraint to

guide the fixed-point wordlength optimisation procedure [Con03]. This feature

has been extended in this thesis to incorporate DFX. For the purpose of this

106

thesis, the signal power at each output is fixed since it is determined by a

combination of the input signal statistics and the computation graph G(V, S).

The optimisation procedure discussed in Chapter 6 will require accurate error

models as a prerequisite to explore the different implementations of a DFX

annotated computational graph, G(V, S, ADFX). It is therefore the purpose of

this chapter to concentrate on noise modelling.

Oppenheim and Weinstein [OW72] and Liu [Liu71] had previously laid

down models for quantisation errors together with their propagation through

linear time-invariant systems (LTI) uniform. In their models, error signals are

added to each signal that is truncated or rounded and they are assumed to

be uniformly distributed and uncorrelated with one another. These models

worked well for uniform wordlength designs as quantisation error power typi-

cally degrades by approximately 6 dB for each bit reduction and the prediction

of errors did not require highly accurate models.

As hardware designers aggressively push their designs further to reduce

the cost of their designs, they naturally turn to a multiple wordlength design

paradigm [CCL01]. For a design with multiple wordlengths, the designers have

greater flexibility to finely adjust the implementation error power and the tra-

ditional error models are not sufficiently accurate. Constantinides introduced

an error model to address this issue using a discretised probability distribution

function for the LSB bits that are truncated [CCL00]. The added scaling pa-

rameter in DFX together with the boundary conditions makes DFX a scaled

number representation and the error models from [CCL00] can not be applied

directly.

Furthermore, errors introduced by DFX are correlated with each other.

Any error signal statistic for DFX depends on the signal magnitude, which in

turn depends on the temporal and spatial correlation of the system’s inputs

and design respectively. However, there is no error correlation between the

107

errors when rounding quantisation is performed and this will be addressed in

Section 5.4.2.

To assist the RightSize tool in modelling the errors injected, a single-pass

profiling simulation is done to populate the probability distribution tables used

in estimating the truncation probabilities. With the noise injection models

and error correlation accounted for, the output error estimation is provided by

using the existing perturbation analysis technique where each noise injected is

weighted with a sensitivity measure before summing them together [Con03].

Therefore the main original contributions in this chapter are:

• the error models for each DFX Module with flexible inputs and output

parameters (i.e. multiple wordlength and scaling),

• the correlation between the errors for DFX truncation and rounding

schemes,

• profiling simulation to obtain the probability distribution tables to model

the errors and estimate the correlation coefficient, and

• estimating the output error of an DFX annotated computational graph

G(V, S, ADFX) taking into account the correlation of errors when trun-

cation scheme is used.

5.2 Preliminaries

The error models made in this section are meant to model the errors intro-

duced to a design/system which uses the DFX arithmetic modules discussed

in Chapter 4. The main criteria for these error models are that they have to

be as flexible as the fully parameterisable modules. This includes being able

108

to cope with the multiple scaling and wordlength design paradigm without

compromising accuracy.

As mentioned in the background (Section 2.5.4), the noise analysis of a

fixed-point system used in RightSize is a weighted-sum of injected quantisa-

tion errors [KS98, Con01]. Our proposed error model follows the same error

injection approach where the injection of errors are made after a DFX module,

as illustrated in Figure 5.1(a) (The error injections are shown in broken arrow

lines). Take note that these additional inputs and adders are not nodes in the

computation graph but are useful conceptual device for modelling the errors

associated when truncation occurs. Therefore for every signal j ∈ S in an

annotated graph G(V, S, ADFX), an error source ej is added by the signal’s

driver node. In an ideal situation, all error sources would be zero and the

system would behave as if the error sources were not present. An example of

a computation graph with error injections is shown in Figure 5.1(b).

OutDFX ModuleIn +

e

(a) Error injection after a DFX module

Y+

z-1

X F+

e1

+

e2

+

e6

+ e5

+

e4

+

e2

(b) Example computation graph with error injection

Figure 5.1: Noise error of each module modelled as an error injection at theoutput.

Given a computation graph G(V, S), let Vo ⊂ V be the set of nodes of type

109

primary out and let k ∈ V0. Using the perturbation analysis in RightSize

(Section 2.5.4), we can obtain the unscaled sensitivities measure, εjk, of error

injected (ej) into every signal j ∈ S to output k. Assuming the noise power

of injected errors, σ2j = var(ej), are known and the injected errors are not

correlated with one another, the output noise power can be determined by

the weighted-sum equation (5.1) below. This equation is used by the origi-

nal RightSize fixed-point wordlength optimisation algorithm to estimate the

output noise power.

σ2k =

∑j∈S

εjkσ2j . (5.1)

To ease the explanation of the rest of this section, an alternative means

of representing scaling called the least significant bit (LSB)-side scaling, p′, is

defined (Definition 5.1). Figure 5.2 shows this alternative scaling as the number

of bits from the right of the binary point to the LSB for both fixed-point and

DFX cases.

p'

n

p

S

(a) Fixed-point

p0

p0'

p1'

S0

1

n

p1

E

(b) Dual FiXed-point

Figure 5.2: LSB-side scaling definition.

Definition 5.1. The least significant bit LSB-side scaling vector pair p′0 ∈ Z|S|

and p′1 ∈ Z|S| for graph G(V, S,ADFX) consist of elements with a one-to-one

correspondence with the elements of S. They represent a LSB-side scaling

representation of the scaling vectors p0 and p1 respectively. For each signal

j ∈ S, the elements p′0 = nj − pj0 and p′1 = nj − pj1. Since p1 > p0, therefore

p′0 > p′1. ¤

110

Errors are introduced into a system whenever quantisation occurs. In the

case of DFX modules, quantisation occurs whenever there is a right shift in

the data path. A two’s complement fixed-point signal with LSB-side scaling

p′a truncated to p′b will introduce an error with the mean and variance given by

(5.2) which uses a discrete distribution for the errors[CCL99]. The equations

are derived from the assumption that each of the combinations of the low-end

truncated bits are equally likely [Liu71, Tsa74] and this is true in practice if

the signals have sufficient dynamic range over that bit-width. If rounding is

performed instead of truncation, (5.2) still applies but the error mean becomes

zero while the variance remains the same.

As DFX has two levels of precision, more than one truncation/rounding

error may occur within each DFX module. Let T be the set of all these

possible truncation/rounding that may occur and let T ∈ T. A truncation

T of “p′a ½ p′b” will yield an error mean and error variance given by (5.2) if

and only if p′a ≥ p′b. When p′a < p′b, left shift is performed which involves zero

padding the least-significant-bits (LSB) and hence the error mean and error

variance will equal zero as there is no precision loss.

mean = µ = −1

22−n (2pb − 2pa) = −1

2

(2−p′b − 2−p′a

)

variance = σ2 =1

122−2n

(22pb − 22pa

)=

1

12

(2−2p′b − 2−2p′a

) (5.2)

For every truncation/rounding T, there will also be a corresponding prob-

ability of truncation occurring, PT. From the profiling simulation explained

later in Section 5.5.1, we can extract the probability of all the sources of trun-

cations within each module. Therefore we can then determine the overall error

mean and error variance using an equation in the form given by (5.3). Again,

111

if rounding is performed, the error mean will be zero and the variance is cal-

culated with the zero error means. If there is only one truncation, i.e. |T|=1,

(5.3) reverts back to its ordinary fixed-point truncation.

µerror =∑

T∈TPTµt

σ2error =

∑

T∈TPT

(σ2

T + µ2T

)− µ2error

(5.3)

The error models will be using the probability and joint probability dis-

tribution functions of signals to estimate the injected errors. Therefore, the

error models rely on the DFX signals to be properly scaled for correct error

estimation.

5.3 DFX Modules Noise Models

5.3.1 Encoder

The DFX Encoder module performs two forms of quantisation on its ordinary

fixed-point input depending on the input’s magnitude. If the output is in the

Num0 range, then the output will be truncated by TE0 and if the output is in

Num1, it will be truncated with TE1. A list of truncations for the Encoder is

shown in Table 5.1. Just to reiterate, B is the boundary of the DFX number

(Definition 3.4).

Table 5.1: DFX Encoder truncations where the input is a fixed-point 〈nin, pin〉and output DFX 〈n, p0, p1〉 (Refer to Fig. 4.4 for block diagram).

TE Truncation Condition

TE0 p′in ½p′0 −B ≤ Input < B

TE1 p′in ½p′1 Input < −B or Input ≥ B

112

PDF

Input ABoundary-Boundary

Num0Num1 Num1

0

0ETP1ETP1ETP

Figure 5.3: Probability density function(PDF) of DFX Encoder Input signal.

A probability distribution function (PDF) of the input A is shown in Fig-

ure 5.3. From the PDF, the probability PTE0is the integral of the PDF curve

whereby the input A is in Num0. Likewise, the probability PTE1is the integral

of the PDF curve whereby the input A is in Num1. Therefore using (5.3), the

modelled error mean and the error variance is given by (5.4).

µEnc = µTE0PTE0

+ µTE1PTE1

σ2Enc =

((σ2

TE0+ µ2

TE0)PTE0

+ (σ2TE1

+ µ2TE1

)PTE1

)− µ2Enc

(5.4)

5.3.2 Recoder

The error analysis discussed in this section covers the use of the DFX Recoder

when used for the fork and delay nodes. As mentioned in Section 4.4.3, there

are 3 different implementations of the Recoder depending on the input and

output boundaries. The possible truncations for the Recoder is summarised in

Table 5.2. The truncation case TR01 will happen only when the input boundary

is greater than the output boundary (Bin > Bout). Similarly the truncation

case TR10 will happen only when Bin < Bout. When Bin = Bout, there is

no change in the lower scaling, hence truncations TR01 and TR10 will never

happen.

Obtaining the probabilities of truncation involves integrating sections of

the input’s PDF. The input PDF for the Recoder is essentially the input

113

Table 5.2: DFX Recoder where the input is DFX 〈nin, pin0, pin1〉 and outputDFX 〈nout, pout0, pout1〉.

TR Input Output Truncation Condition

TR00 Num0 Num0 p′in0 ½p′out0 -

TR01 Num0 Num1 p′in0 ½p′out1 Bin > Bout

TR10 Num1 Num0 p′in1 ½p′out0 Bin < Bout

TR11 Num1 Num1 p′in1 ½p′out1 -

PDF

InputInput

Num0

Input

Num1

Input

Num1

Output

Num1

Output

Num1Output Num0

Num0 Num1

Num1 Num1

Input Output

00RPT11R

PT 11RPT

- out out

- in in

(a) For the case when Bin = Bout.

PDF

InputInput

Num0

Input

Num1

Input

Num1

Output Num1Output Num1 Output Num0

Num0 Num1

Num0 Num1

Num1 Num1

Input Output

11RPT 11R

PT01RPT 01R

PT00RPT

- out out

- in in

(b) For the case when Bin > Bout.

PDF

InputInput

Num0

Input

Num1

Input

Num1

Output Num1 Output Num1

Output Num0

Num0 Num1

Num1 Num0

Num1 Num1

Input Output

11RPT 11R

PT10RPT 10R

PT00RPT

- out out

- in in

(c) For the case when Bin < Bout.

Figure 5.4: PDF of the DFX Recoder input and the probabilities of truncation.

114

PDF into the fork or the delay nodes since these nodes basically passes

the inputs to their outputs. Figure 5.4 shows the PDF partitioning for the

three different Recoder implementations. By integrating the area under the

graph, we can get the probabilities of truncation for each implementation.

If a particular truncation does not happen, therefore the probability of that

truncation happening is zero. For example the case when Bin = Bout, the

probabilities PTR01= PTR10

= 0.

With the probabilities of truncation known, we can obtain the error mean

and error variance for the Recoder using Equation (5.3).

5.3.3 Adders

Analysis of the error model for the Adder module begins with analysing all

possible input and output combinations. For both versions of DFX Adders,

the errors are introduced only at the end of the module and there are no loss of

precision elsewhere. This means that both versions of the DFX Adder can be

modelled in the same way. Table 5.3 shows all the 8 different possible input and

output combinations of the adder together with their respective truncations.

Ideally, if the inputs have no temporal correlation and are independent of

each other, obtaining the probability distribution of the inputs independently

would be sufficient to determine the probabilities of truncation. In practice,

however, input signals have some degree of correlation. Therefore, a joint

probability distribution function table of the Adder’s inputs is needed. Such

distributive function can be viewed as a graph, with the x-axis for the input A

and y-axis for the input B (see Figure 5.5(a)). Different patterns on the graph

denote the different input combinations. Input boundaries are shown in red

and blue lines and the gold line represents the locus of |A| + |B| = BS. The

portion in between the output boundary lines is the portion when the output

is a Num0 number. The output is a Num1 number for the outer portions.

115

Table 5.3: DFX Adder inputs (A and B) and output (S) combinations withtheir respective output truncations.

TA A B S Truncation Condition

TA000 Num0 Num0 Num0 max(p′A0, p′B0)½p′S0 -

TA010 Num0 Num1 Num0 max(p′A0, p′B1)½p′S0 (BB−BA) < BS

TA100 Num1 Num0 Num0 max(p′A1, p′B0)½p′S0 (BA−BB) < BS


TA001 Num0 Num0 Num1 max(p′A0, p′B0)½p′S1 (BA+BB) > BS




The probability for each of the individual truncations are found by inte-

grating the individual areas shown in Figure 5.5(b). Under certain conditions

such as when (BB−BA) < BS, an addition of a ANum0 with BNum1 will never

result in a QNum0 and for this case, PTA010will always be zero. Similarly, when

(BA−BB) > BS an addition of a ANum1 with BNum0 will never result in a

QNum0 and PTA100= 0. Lastly, the addition of a ANum0 with BNum0 will never

result in a QNum1 if (BA+BB) < BS and therefore PTA001= 0.

With the probabilities of truncation known, the error mean and error vari-

ance for the Adder may be found using Equation (5.3).

5.3.4 Gain Multiplier

As mentioned earlier in Section 4.6.1, the DFX Gain Multiplier multiplies a

DFX number, X, with a fixed-point constant multiplier m〈nm, pm〉. Table 5.4

shows all its possible truncations. Due to the design restrictions imposed

(Section 4.6.1), the case TG01 will only happen if BA > (BQ/|m|) and similarly,

case TG10 will only happen if BA < (BQ/|m|).

As this module is a single input module, the error statistics will be depend

solely on the input’s probability distribution. When the coefficient m is mul-

tiplied with a DFX input, the input boundary is shifted and so does the range

116

Input A

Input BBA-BA

BB

-BB

0

|A| + |B| =

Num0 Num0

Input A Input B

Num0 Num1

Num1 Num0

Num1 Num1

S

(a) Joint distribution table showing the input and output ranges. The goldenline is the output boundary.

- A A

- B

B

PTA000

PTA001

Input A Num0 : Input B Num0- A A

Input A Num1 : Input B Num0

PTA100

PTA101

- B

B

- A A

- B

B

PTA110

PTA111

PTA111

Input A Num1 : Input B Num1- A A

- B

B

PTA010 PTA011

Input A Num0 : Input B Num1

(b) The probability of truncations for the DFX Adder

Figure 5.5: DFX Adder joint probability distribution table.

117

Table 5.4: DFX Gain Multiplier input(A) and output(Q) combinations andtheir respective output truncations.

TG A Q Truncation Condition

TG00 Num0 Num0 (p′A0+p′m)½p′Q0 -

TG01 Num0 Num1 (p′A0+p′m)½p′Q1 BA > (BQ/|m|)TG10 Num1 Num0 (p′A1+p′m)½p′Q0 BA < (BQ/|m|)TG11 Num1 Num1 (p′A1+p′m)½p′Q1 -

QNum1QNum1

PDF

Input AA- A

Num0 Num1

Num0 Num1

Num1 Num1

i/p A o/p Q

11GPT 11G

PT01GPT 01G

PT

ANum1ANum1 ANum0

QNum0

00GPT

Q

|m|-( ) Q

|m| ( )

A > ( Q ⁄ |m| )

(a) For the case BA > (BQ/|m|)

PDF

QNum1 QNum0

Num0 Num1

Num1 Num0

Num1 Num1

i/p A o/p Q

11GPT 10G

PT 10GPT

QNum1

11GPT00G

PT

ANum0ANum1 ANum1 Input A

A- A

Q

|m|-( ) Q

|m| ( )A < ( Q ⁄ |m| )

(b) For the case BA < (BQ/|m|)

Figure 5.6: PDF of the DFX Gain Multiplier input and the probabilities oftruncation.

of input Num0 and Num1. If m > 1, the output boundary will be shifted

lower in terms of its magnitude and vice-versa. This leads to the partitioning

of the input PDFs as shown in Figure 5.6 which is similar to the partitioning

of the Recoder’s input PDF. As before, the probabilities of each truncation

can be obtained by integrating the appropriate areas under the graph. For

the case BA > (BQ/|m|), the probability PTG10will always be zero and for the

case BA < (BQ/|m|), PTG01will always be zero. Therefore the gain multiplier

118

error mean and error variance is given by Equation (5.3).

5.3.5 Full Multiplier

Table 5.5 tabulates all the 8 possible truncations of the full multiplier. The

product of a ANum0 and BNum0 will never be become a QNum1 if (BA×BB) ≤BS. If this was the case, the probability PTM001

will be zero. Similarly, PTM110

will be zero if (BA×BB) ≥ BS because the product of a ANum1 and BNum1

will never become a QNum0.

Table 5.5: DFX Full Multiplier inputs (A and B) and output (Q) combinationsand their respective output truncations.

TM A B Q Truncation Condition

TM000 Num0 Num0 Num0 (p′A0 + p′B0)½p′S0 -



TM110 Num1 Num1 Num0 (p′A1 + p′B1)½p′S0 (BA×BB) < BS

TM001 Num0 Num0 Num1 (p′A0 + p′B0)½p′S1 (BA×BB) > BS




The joint probability distributions for the full multiplier inputs are needed

to estimate the errors, since the full multiplier is a dual input module. Unlike

the Adder modules, the distribution of errors does not depend on the sign

of its inputs and outputs. Taking the log2() of the magnitude of the inputs,

the joint PDF can be simplified as shown in Figure 5.7(a). There are four

quadrants for the four different combinations of input ranges (Input A:Input

B). Figure 5.7(b-d) shows the distribution of errors for the different cases of

input and output boundaries. By integrating the specific areas on the joint

PDF for the probabilities of truncations, the Full Multiplier error mean and

error variance can be found using Equation (5.3).

119

(a) Input Ranges

log2( |A| )

log2( |B| )

Num0:Num0 Num1:Num0

Num1:Num1Num0:Num1

log2(BA)

log2(B

B)

(b) When (BA BB) < BQ

log2(BA)

log2(B

B)

P TM000

AB=

Q

P TM100

P TM101

P TM011

P TM111

P TM010

P TM110

(c) When (BA BB) > BQ

log2(BA)

log2(B

B)

P TM000

P TM100

P TM101

P TM011 P TM111

P TM010

P TM001AB=

Q

(d) When (BA BB) = BQ

log2(BA)

log2(B

B)

P TM000P TM100

P TM101

P TM011 P TM111

P TM010

AB=BQ

Figure 5.7: DFX Full Multiplier joint probability distribution table. (a) showsthe input ranges (Input A:Input B) and (b)-(d) shows the probability of trun-cations for different input and output boundary cases.

5.3.6 Error Model Evaluation and Discussion

For verification of error models, speech audio samples were used as input sam-

ples. The error is the difference between the estimated output and actual

(double precision) output. Tables 5.6 and 5.7 show that with two different

DFX formats chosen arbitrarily, the truncation and rounding error models are

capable of providing error estimates that are within ±3% of the actual error.

This experiment is repeated with 100 different DFX formats and the results

show that the estimates were within ±5%.

It can be seen from the models earlier that in determining the error statis-

tics of the injected errors, the knowledge of each probability of truncation is

120

Table 5.6: Evaluation of error models for truncation scheme with DFX format〈14,−5, 2〉 and 〈14,−3, 5〉.

〈14,−5, 2〉 Estimated Error Actual Error Difference

Mean Var. Mean Var. Mean Var.

Encoder -5.40e-05 5.81e-09 -5.39e-05 5.78e-09 0.20% 0.50%

Recoder -3.10e-05 4.07e-09 -3.17e-05 4.08e-09 0.61% -0.32%

Adder -2.71e-04 8.09e-08 -2.77e-04 8.23e-08 -2.30% -1.74%

Gain Mult. -1.57e-04 3.40e-08 -1.55e-04 3.38e-08 1.35% 0.67%

Full Mult. -3.67e-04 1.40e-07 -3.74e-04 1.36e-07 -1.91% 2.85%

〈14,−3, 5〉 Estimated Error Actual Error Difference

Mean Var. Mean Var. Mean Var.

Encoder -1.68e-04 1.85e-07 -1.68e-04 1.86e-07 0.17% -0.68%

Recoder -1.46e-04 1.55e-07 -1.47e-04 1.56e-07 -0.60% 0.54%

Adder -1.06e-03 3.16e-06 -1.07e-03 3.18e-06 -0.97% -0.48%

Gain Mult. -5.56e-04 1.20e-06 -5.58e-04 1.21e-06 -0.40% -1.20%

Full Mult. -2.17e-03 4.95e-06 -2.20e-03 4.89e-06 -1.38% 1.21%

Table 5.7: Evaluation of error models for rounding scheme with DFX format〈14,−5, 2〉 and 〈14,−3, 5〉.

〈14,−5, 2〉 Model Var. Actual Var. Difference

Encoder 2.18e-09 2.17e-09 0.52%

Recoder 1.92e-09 1.92e-09 -0.32%

Adder 1.36e-08 1.38e-08 -1.75%

Gain Mult. 8.85e-09 8.81e-09 0.50%

Full Mult. 1.83e-08 1.81e-08 1.09%

〈14,−3, 5〉 Model Var. Actual Var. Difference

Encoder 5.32e-08 5.33e-08 -0.27%

Recoder 5.12e-08 5.11e-08 0.19%

Adder 4.04e-07 4.08e-07 -0.86%

Gain Mult. 2.40e-07 2.42e-07 -0.78%

Full Mult. 6.12e-07 6.20e-07 -1.29%

vital. These probabilities are dependent on the input and output signal pa-

rameters together with their respective probability distributions. Also, since

the error statistics of truncation T of a module depends on the output scaling,

it should be noted that the output DFX parameters of each module play a

significant part in the error injection after each module.

121

5.4 Correlated Errors

As discussed in Section 5.2, the noise power at the primary outputs are calcu-

lated as the weighted sum of the individual errors injected by signal quantisa-

tions given by Equation (5.1). Summing the noise power, i.e. error variances,

of individual errors is a straight forward affair provided that the errors are not

correlated with one another. Essentially, the variance of the sum of n multi-

ple variables is the sum of their covariances, cov(x, y), given by (5.5) below.

Hence, Equation (5.1) is rewritten to include the correlation between error

sources.

var(n∑

j=1

Xi) =n∑

i=1

n∑j=1

cov(Xi, Xj)

=n∑

i=1

var(Xi) + 2n∑

i=1

n∑j=i+1

cov(Xi, Xj)

(5.5)

σ2k =

∑j∈S

εjkσ2j +

∑i,j∈Sj 6=i

(εikεjk)12 cov(Xi, Xj) (5.6)

In the case when error sources are not correlated with one another, the

covariances between different error sources would all be zero and the result

would be simply the sum of the error source’s variances. However, as demon-

strated by the error models in the previous section, the variance of the error

introduced are highly dependent on the probability distribution of the signals.

Correlation between the signals exists in practical DSP implementations due

to the temporal correlation of its inputs and the spatial correlation which is

intrinsic to the design of the computational graph.

122

The correlation between error sources can be quantified by the Pearson

correlation coefficient [SM94], rx,y (5.7). When rx,y = 0, the errors x and y are

not correlated with each other. When rx,y = 1 or = −1, the errors are fully

correlated or inversely correlated with each other respectively. Example 5.1

demonstrates the severity of error correlation when calculating the noise power

of the primary output error variance with DFX signals.

rX,Y = r(X, Y ) =cov(X, Y )

σXσY

(5.7)

Input

Output+ + +z-1

z-1

z-1

a2 a1 a0g3 g2 g1 g0

F

b0b1b2b3

Figure 5.8: Transpose FIR Direct Form type I filter implemented with DFXmodules.

Example 5.1. Figure 5.8 shows a transposed FIR Direct Form type I filter

with the error sources shown. In this example, the encoder, branching fork

and delay nodes do not exert any errors onto the final output and thereby

the omission of their respective error injections. The signals in this filter are

truncated to a uniform wordlength and scaling throughout (DFX 〈14,−3, 2〉)and the filter inputs used is a speech sample to simulate a temporally correlated

input. Taking a straight forward addition of the error variances give a noise

power of 0.96×10−8 but actual simulation shows that the noise power is actually

1.67 × 10−8. This -43% difference is because of the correlation between the

error sources. Table 5.8 tabulates the correlation coefficient between the error

sources.

Interestingly, when rounding is used, the error sources does not exhibit

any correlation between them as shown in Table 5.9. Without correlation,

123

Table 5.8: The correlation coefficients of the error sources for the FIR filter inFigure 5.8 when truncation is used.

Errors g3 g2 g1 g0 a2 a1 a0

g3 1.0 0.482 0.363 0.448 0.406 0.167 0.442

g2 0.482 1 0.241 0.599 0.394 0.169 0.534

g1 0.363 0.241 1 0.221 0.246 0.150 0.213

g0 0.448 0.599 0.221 1 0.373 0.177 0.511

a2 0.406 0.394 0.246 0.373 1 0.254 0.378

a1 0.167 0.169 0.150 0.177 0.254 1 0.168

a0 0.442 0.534 0.213 0.511 0.378 0.168 1

the covariances (5.7) between the errors is zero and therefore the noise power

is just the sum of the error variances. The simulated output noise power is

found to be about 1.07×10−8 and the straight forward addition estimation gives

1.11× 10−8 ( i.e. 3.6% error).

Table 5.9: The correlation coefficients of the error sources for the FIR filter inFigure 5.8 when rounding is used.


g3 1 -0.001 -0.007 -0.002 0.000 0.005 0.002

g2 -0.001 1 -0.011 -0.012 -0.001 0.002 -0.005

g1 -0.007 -0.011 1 0.009 0.001 0.000 0.000

g0 -0.002 -0.012 0.009 1 0.001 -0.005 0.004

a2 0.000 -0.001 0.001 0.001 1 -0.006 -0.006

a1 0.005 0.002 0.000 -0.005 -0.006 1 -0.006

a0 0.002 -0.005 0.000 0.004 -0.006 -0.006 1

¤

Since truncation is generally the least area expensive method of quantisa-

tion, it is essential that the correlations between the truncation error sources

are ascertained. This section introduces a way to approximate the correlation

coefficient between the error sources and explore the case of when DFX signals

are rounded.

124

5.4.1 Estimating the Correlation for the Errors Sources

Taking two DFX signals X and Y , we denote the errors injected into these

signals as x and y respectively. In a truncation scheme, the error x would be

in the range (−δx0, 0] where δx0 = 2p′X0 when X is in Num0, or it will be in

the range (−δx1, 0] where δx1 = 2p′X1 when X is in Num1. Figure 5.9 shows

the distribution of the x and y errors. The discrete error distribution [CCL99]

used for the error models is not used here to simplify calculations and the

uniformly distributed error model by [Liu71] is sufficient.

pdf

x-δx0-δx1

pdf

y-δy0-δy1

Num0

Num1

Figure 5.9: The PDF of DFX signal errors.

Let the probabilities for the errors for combination of ranges of signals X

and Y be given in Table 5.10. These probabilities can be determined from the

joint PDF from the profiling simulation later in Section 5.5.3.

Table 5.10: Combinations of signals X and Y ranges and their error probabil-ities.

X Y Error probability

Num0 Num0 ω00

Num0 Num1 ω01

Num1 Num0 ω10

Num1 Num1 ω11

Estimation of the correlation coefficient in Equation (5.7) begins by obtain-

ing the covariance between the errors (x and y) and their standard deviations.

The covariance is found using Equation (5.8). Mean and standard deviation

of the errors can be found using the techniques described in Section 5.3. On

the other hand, calculating the expectation of the product of two errors needs

a joint PDF f(x, y) as seen in Equation (5.9)

125

cov(x, y) = E(xy)− µxµy (5.8)

E(xy) =

∫ ∞

−∞

∫ ∞

−∞xy f(x, y) dxdy (5.9)

Similar to the joint PDF used for DFX Adder, Figure 5.10 shows the joint

PDF for the errors x and y. Figures 5.10(a)-(d) shows the breakdown of

the joint PDF for each combination of errors and Figure 5.10(e) shows the

complete joint distribution diagram with the four individual regions marked

out. With the assumption that the errors are uniformly spread-out for each

error combination, the joint PDF f(x, y) is given by (5.10).

f(x, y) =

f00 if x ∈ (−δx0, 0] and y ∈ (−δy0, 0]

f01 if x ∈ (−δx0, 0] and y ∈ (−δy1,−δy0]

f10 if x ∈ (−δx1,−δx0] and y ∈ (−δy0, 0]

f11 if x ∈ (−δx1,−δx0] and y ∈ (−δy1,−δy0]

(5.10)

where

f00 = ω00(δx0δy0)−1 + f01 + f10 − f11

f01 = ω01(δx0δy1)−1 + f11

f10 = ω10(δx1δy0)−1 + f11

f11 = ω11(δx1δy1)−1

Knowing the joint PDF of the errors, the expectation E(xy) can be shown

to be (5.11). Also, using (5.3), the error means and variances are given by

(5.12) and (5.13).

E(xy) =

∫ −δy0

−δy1

(∫ −δx0

−δx1

xy f11 dx +

∫ 0

−δx0

xy f01 dx

)dy

+

∫ 0

−δy0

(∫ −δx0

−δx1

xy f10 dx +

∫ 0

−δx0

xy f00 dx

)dy

=1

4[ω00 (δx0δy0) + ω01 (δx0δy1) + ω10 (δx1δy0) + ω11 (δx1δy1)]

(5.11)

126

x

y-δx0-δx1

-δy0

-δy1

x

y-δx0-δx1

-δy0

-δy1

x

y-δx0-δx1

-δy0

-δy1

x

y-δx0-δx1

-δy0

-δy1

x

y-δx0-δx1

-δy0

-δy1

f11 f01

f10 f00

(a) Num0 : Num0 (b) Num0 : Num1

(c) Num1 : Num0 (d) Num1 : Num1

(e) Complete Joint

Distribution diagram

Figure 5.10: Joint probability distribution of the errors. (a)-(d) shows thebreakdown for each error combination “x :y” and (e) shows the complete jointdistribution diagram.

µx = (ω00 + ω01)

(−δx0

2

)+ (ω10 + ω11)

(−δx1

2

)

µy = (ω00 + ω10)

(−δy0

2

)+ (ω01 + ω11)

(−δy1

2

) (5.12)

σ2x =

1

3

((ω00 + ω01)δ

2x0 + (ω10 + ω11)δ

2x1

)− µ2x

σ2y =

1

3

((ω00 + ω10)δ

2y0 + (ω01 + ω11)δ

2y1

)− µ2y

(5.13)

With the error means and expectation known, the error covariance can be

shown to be (5.14). Therefore together with the error mean and variances, the

correlation coefficient can be found using (5.7).

127

cov(x, y) =1

4(w00w11 − w01w10)((δx0 − δx1)(δy0 − δy1)) (5.14)

In order for the errors to be uncorrelated, the covariance between them

should be zero, i.e. cov(x, y) = 0. Examining the error covariance equation, we

can gather that the condition for errors to be not correlated is when w00w11 =

w01w10, δx0 = δx1 or δy0 = δy1. Through experiments, it is found that the case

w00w11 = w01w10 will happen when signals X and Y are not correlated with

each other. The other two conditions confirms that when fixed-point is used

for either of the signals, the error would have no correlation (i.e. in fixed-point,

δ0 = δ1).

Using the methods in this section, the correlation coefficient estimates for

error sources in Example 5.1 were found and shown in Table 5.11. They are

all within ±4% of the actual correlation coefficients.

Table 5.11: Estimate of the correlation coefficients of the error sources for theFIR filter in Figure 5.8 when truncation is used.


g3 1 0.48698 0.36295 0.42468 0.43505 0.18584 0.45016

g2 0.48698 1 0.25186 0.58503 0.41588 0.19758 0.54779

g1 0.36295 0.25186 1 0.21963 0.26785 0.16327 0.24033

g0 0.42468 0.58503 0.21963 1 0.38986 0.19983 0.51668

a2 0.43505 0.41588 0.26785 0.38986 1 0.26921 0.39682

a1 0.18584 0.19758 0.16327 0.19983 0.26921 1 0.19163

a0 0.45016 0.54779 0.24033 0.51668 0.39682 0.19163 1

5.4.2 Rounding Benefits

The errors introduced by rounding DFX signals were shown not to correlated

with each other (Example 5.1). Using the same notations as in the previous

section (Section 5.4.1), Figures 5.11 and 5.12 show the PDFs of the errors x

and y, and their joint PDF diagram.

128

pdf

x-δx12

-δx02

δx02

δx12

pdf

y-δy12

-δy02

δy02

δy12

Num0

Num1

Figure 5.11: The PDF of DFX signal errors when rounding is used.

(e) Complete Joint

Distribution diagram

x

y-δx02

-δx12

δx12

δx02

-δy02

-δy12

δy12

δy02

f '00f '10 f '10

f '01

f '01

f '11 f '11

f '11 f '11

(d) Num1 : Num1

x

y

(b) Num0 : Num1

x

y(a) Num0 : Num0

x

y-δx02

-δx12

δx12

δx02

-δy02

-δy12

δy12

δy02

(c) Num1 : Num0

x

y

‘

Figure 5.12: Joint probability distribution of the errors when rounding is used.(a)-(d) shows the breakdown for each error combination “x : y” and (e) showsthe complete joint distribution diagram.

The joint PDF f ′(x, y) for the case of rounding is redefined as (5.15).

f ′(x, y) =

f ′00 if |x| ∈ (− δx0

2, 0] and |y| ∈ (− δy0

2, 0],

f ′01 if |x| ∈ (− δx0

2, 0] and |y| ∈ (− δy1

2,− δy0

2],

f ′10 if |x| ∈ (− δx1

2,− δx0

2] and |y| ∈ (− δy0

2, 0],

f ′11 if |x| ∈ (− δx1

2,− δx0

2] and |y| ∈ (− δy1

2,− δy0

2].

(5.15)

129

where

f ′00 = ω00(δx0δy0)−1 + f ′01 + f ′10 − f ′11

f ′01 = ω01(δx0δy1)−1 + f ′11

f ′10 = ω10(δx1δy0)−1 + f ′11

f ′11 = ω11(δx1δy1)−1

When rounding is used, the means of the errors is zero. Hence the error

covariance would depend only on the expectation E ′(xy). As shown below in

(5.16), the expectation is zero when rounding is used. The limits of integration

used cancels out each other resulting with zero expectation.

E ′(xy) =

∫ − δy02

− δy12

(∫ − δx02

− δx12

f ′11xy dx +

∫ δx02

− δx02

f ′01xy dx +

∫ δx12

δx02

f ′11xy dx

)dy

+

∫ δy02

− δy02

(∫ − δx02

− δx12

f ′10xy dx +

∫ δx02

− δx02

f ′00xy dx +

∫ δx12

δx02

f ′10xy dx

)dy

+

∫ δy12

δy02

(∫ − δx02

− δx12

f ′11xy dx +

∫ δx02

− δx02

f ′01xy dx +

∫ δx12

δx02

f ′11xy dx

)dy

E ′(xy) = 0

(5.16)

This means that the covariance between the errors would always be zero

and the errors will never be correlated if rounding is used and Example 5.1

showed an example of this case. However, the added circuit complexity when

rounding is performed makes it undesirable and rounding scheme for DFX is

not implemented in the RightSize tool. The system output error estimation

will have to deal with the correlations between the errors since truncation is

used.

130

5.5 Profiling: Simulation and Tables

5.5.1 Profiling Simulation

The error models introduced in Section 5.3 rely on the probability distribu-

tions, or profiles of the signals within a system. These profile tables must be

flexible to cater for any possible input and output boundaries of a module. In

practice the signals within a DSP design would typically be correlated with

one-another especially because of their highly correlated inputs, for example

uncompress sound for audio applications [Mit98] and RGB channels for video

[Pra01]. We therefore perform a profiling simulation to obtain the distributions

of signals.

A straight forward approach is to place a histogram counter at every mod-

ule inputs to record the distribution of inputs while performing a single-pass

simulation with a representative input for the entire system. To extract accu-

rate probabilities from these tables, the histogram would have to be very fine

while able to accommodate a signal’s large dynamic range. For every node

v ∈ V : type(v) 6= primary out in a computational graph G(V, S), an in-

dividual profile is needed. Also, the estimation of error correlation coefficient

discussed in Section 5.4 requires a profile table for each 2-combination of signal

s ∈ S. Therefore a great number of information will need to be gathered and

will rapidly become a major computational hurdle as the design/algorithm

increases in complexity. Fortunately there are ways to simplify tables while

retaining its accuracy. There are two types of profile tables: A 1-D Profile

Table is sufficient for all models requiring an ordinary PDF and a 2-D Profile

Table for those needing a joint PDF.

131

5.5.2 1-D Profile Table

Modules that uses a 1-D Profile Table are single input modules: DFX Encoder,

DFX Recoder and DFX Gain Multiplier. The 1-D Profile table will collect data

on Input A and also Output Q for the case of the Gain Multiplier.

The decision for which DFX range a number takes ultimately depends on

its magnitude. Hence by losing the information on the sign, the profile table

is shrunk by half (for 1-D tables) or by three quarters (for 2-D tables). Apart

from that, the boundary of a DFX number is always to the powers of 2 and

all the truncation probabilities PT depends on the boundary value. We shall

exploit this fact and group the histograms’ values based on a logarithmic class

division. By using a logarithms of base 2, each division is a boundary bin,

H(α). For α = αmin, . . . , αmax, we denote H(α) = [2α−1, 2α) with exception,

H(αmin) = [0, 2αmin).

Limits αmin and αmax depends on the signal’s peak value. For a signal with

a simulation obtained peak value, P , the upper boundary bin limit is given

by αmax = blog2(P )c + 1. The peak value P is determined during the upper

scaling p1 parameter determination step in Section 6.5. As for αmin, it is chosen

so that the signal remains fully represented (Definition 3.7) up to a maximum

signal wordlength, nmax defined by the user.

αmin = αmax − nmax (5.17)

0 · · · 2�-1 2

�2�+1� �

+1

2�MIN�MIN

2�MAX-1 2

�MAX�

MAX

|A|

Boundary bins, �A

· · ·

Figure 5.13: Boundary bins.

132

To simplify the notations later, we denote αA as the logarithm of the bound-

ary for Input A, i.e. αA = log2(BA). Figure 5.13 illustrates the range of a

boundary bins for Input A. Each boundary bin H(α) has is a counter, h(α),

counting the number of times the input falls within the range of the boundary

bin. For a simulation size of N samples, the counter h(α) is given by (5.18).

h(α) =N∑

i=1

1{Ai∈H(α)} (5.18)

Therefore, the probability of the input being within a range [2αX , 2αY )

is found by (5.19). This equation is sufficient to gather the probabilities of

truncations for the DFX Encoder and DFX Recoder modules. An example for

the DFX Recoder is shown in Example 5.2.

PH =

αY∑α=αX

h(α)

N(5.19)

Example 5.2. For an example, take a DFX Recoder with input A〈nA, pA0, pA1〉and output X〈nX , pX0, pX1〉 where pA0 < pX0. Since BA < BX , its input PDF

and probabilities of truncations would match Figure 5.4(b). The profile table

can be visualised below in Figure 5.14. The probabilities of each truncation

using (5.19) is given below. ��MIN

�� MAX

PTR00 PTR01 PTR11

Figure 5.14: An example profile table for a DFX Recoder.

133

PTR00=

αout∑α=αmin

h(α)

NPTR01

=

αin∑α=αout+1

h(α)

N

PTR10= 0 PTR11

=

αmax∑α=αin+1

h(α)

N

¤

For the DFX Gain Multiplier, an extended boundary bin is required. In

addition to the boundary bin counter, each H(α) will collect an extra his-

togram of the output, Q. Using the same logarithmic base 2 class division and

range, we denote the output boundary bins as Gα(γ) for each input bound-

ary bin H(α). The range of Gα(γ) = [2γ−1, 2γ) where γ = γmin, . . . , γmax and

Gα(γmin) = [0, 2γmin). Each output boundary bin also has a counter gα(γ)

(5.20). As before with the output boundary bins limits are dependent on the

output signal and is treated the same way as the limits of the input boundary

bins. Also, to simplify the notations we denote γQ to the logarithm of the

boundary for output Q, i.e. γQ = log2(BQ).

gα(γ) =N∑

i=1

1{Qi∈Gα(γ)} (5.20)

Therefore the probability of an input boundary bin H(α) having an output

QNum0 range is given by the function PGα(γQ) (5.21).

PGα(γQ) =

γQ∑γ=γmin

gα(γ)

N(5.21)

134

Example 5.3. For an example, take a DFX Gain-Multiplier with input A〈nA,

pA0, pA1〉 and output Q〈nQ, pQ0, pQ1〉 where BA > (BQ/|m|). The probabilities

of truncation of the ones in Figure 5.6(a) can be obtain by using equations

(5.19) and (5.21) as shown below.

PTG00=

αA∑α=αmin

PGα(γQ)

NPTG01

=

αA∑α=αmin

h(α)

N− PTG00

PTG10= 0 PTG11

=

αmax∑α=αA+1

h(α)

N

¤

5.5.3 2-D Profile Table

The error analysis for the DFX modules Adder and Full Multiplier require the

joint PDF of their inputs. Also the error correlation coefficient requires the

joint probability of every 2-combination of signals which can only be deter-

mined using their joint PDFs. As the name states, the 2-D Profile Table is a

2-dimensional version of the 1-D Profile Table to collect the joint probability

distribution of two inputs signals A and B and output Q.

As before, the logarithmic base 2 class division is used for the histogram

counter and the boundary bins are now H(α, β). For α = αmin, . . . , αmax, β =

βmin, . . . , βmax, each boundary bin H(α, β) has the range([2α−1, 2α), [2β−1, 2β)

)

with exception that H(αmin, βmin) =([0, 2αmin), [0, 2βmin)

). Each boundary bin

also has a counter h(α, β) and for inputs A and B with N samples, counter

is given by (5.22). This basic boundary bin is sufficient for the correlation

coefficient estimation.

135

h(α, β) =N∑

i=1

1{(Ai,Bi)∈H(α,β)} (5.22)

For the DFX Adder and Full Multiplier, the 2-D Profile Table with ex-

tended boundary bins has an extra array of output boundary bins to collect

the histogram data of the output. An output boundary bin Gα,β(γ) for H(α, β)

has the range [2γ−1, 2γ) where γ = γmin, . . . , γmax and Gα,β(γmin) = [0, 2γmin).

The counter for each output boundary bin gα,β(γ) is given by (5.23). As in the

previous section, the limits of the input and output boundary bins depends on

the signals that they analyse and are specified in the same way. To simplify

notations, the logarithms of the input and output boundaries are denote as

αA = log2(BA), βB = log2(BB) and γQ = log2(BQ)

gα,β(γ) =N∑

i=1

1{Qi∈Gα,β(γ)} (5.23)

Probability of inputs A and B being in the range([2αX , 2αY ), [2βX , 2βY )

)

is given by (5.24). For the extended version of the boundary bin H(α, β), the

probability of the output being a SNum0 is given by function Pα,β(γQ) (5.25).

N is the total number of samples. An example for the truncations probabilities

for a DFX Adder is given shown in Example 5.4

PH =

αY∑α=αX

βY∑

β=βX

h(α, β)

N(5.24)

Pα,β(γQ) =

γS∑γ=γmin

gα,β(γ)

N(5.25)

136

Example 5.4. Take for example a DFX Adder with inputs A〈nA, pA0, pA1〉,B〈nB, pB0, pB1〉 and output S〈nS, pS0, pS1〉. The probabilities for truncation as

tabulated in Table 5.3 are given below.

PTA000=

αA∑α=αmin

βB∑

β=βmin

Pα,β(γS)

NPTA001

=

αA∑α=αmin

βB∑

β=βmin

h(α, β)

N− PTA000

PTA010=

αA∑α=αmin

βmax∑

β=βB+1

Pα,β(γS)

NPTA011

=

αA∑α=αmin

βmax∑

β=βB+1

h(α, β)

N− PTA010

PTA100=

αmax∑α=αA+1

βB∑

β=βmin

Pα,β(γS)

NPTA101

=

αmax∑α=αA+1

βB∑

β=βmin

h(α, β)

N− PTA100

PTA110=

αmax∑α=αA+1

βmax∑

β=βB+1

Pα,β(γS)

NPTA111

=

αmax∑α=αA+1

βmax∑

β=βB+1

h(α, β)

N− PTA110

¤

5.6 Output Noise Estimation

This section brings together the topics discussed earlier in this section to deter-

mine the noise at the primary outputs of an annotated graph G(V, S, ADFX).

The profiling simulation (Section 5.5.1) gathers information for the probabil-

ity distribution of the signals within the system. For every signal i, j ∈ S,

we can determine mean and variance of the individual error sources, and their

correlation coefficient, ri,j, using the probability distributions.

Using the output response sensitivity measure, εjk, from the perturbation

analysis (Section 2.5.4), the error variances as observed at the primary outputs

137

k for an error source ej is given by εjkσ2ej

. Therefore, using (5.6) and (5.7), the

covariance observed at the output would be ri,j

(εikεjkσ

2i σ

2j

) 12 . Hence the final

equation for the noise power at primary output k is given by

σ2k =

∑j∈S

aiε2jk +

∑i,j∈Sj 6=i

ri,j

(εikεjkσ

2i σ

2j

) 12 (5.26)

To test the feasibility of the error estimation, the SNR of 500 filters were

simulated and compared with their estimated SNR. The set of filters are made

out of 200 FIR 159-tap filters, 200 IIR 12th order filters and 100 LMS 4th order

filters (Refer to 6.9 for a description of the IIR and LMS filters) where the co-

efficients of the FIR and IIR filters are randomly selected. Similarly, the signal

wordlengths and their lower scalings p0 are chosen at random and only their

upper scalings p1 were selected to ensure no overflows. Figure 5.15 displays the

estimated SNR against the simulated SNR and Table 5.12 tabulates some of

the individual designs with their results. We can see that the estimated SNR

closely matches the simulated results (0.0628 significance level).

Table 5.12: Comparison between actual and estimated SNR for 159-tap FIRfilter with DFX parameter of increasing wordlength.

DesignSNR (dB)

Actual Estimate Diff.

FIR 1 41.0 40.2 -2.1%

FIR 2 53.0 52.2 -1.5%

FIR 3 64.9 64.2 -1.0%

5.7 Summary

As part of the progressive steps taken to automate the determination of DFX

parameters in Chapter 6, reliable error models are required to quickly explore

the design space without having to repetitively run simulations. This section

138

0 20 40 60 80 100 120 1400

20

40

60

80

100

120

140

Simulated SNR (dB)

Estimated SNR (dB)

Figure 5.15: The estimated vs simulated SNR for 500 filters.

has introduced the error models for DFX modules that inflict quantisation

errors on a system. The nature of DFX meant that the models rely on the

probability distributions of its inputs. A single-pass profiling simulation would

not only determine the probability distributions of the signals for single input

modules, but also the joint probability distributions for two-input DFX mod-

ules.

Also with the knowledge of the joint PDFs of every 2-combination of sig-

nals, the correlation coefficient of their error sources can be estimated. This

is necessary as the individual error sources in a practical design will be cor-

related with one another and we need to do a weighted sum of the variances

and covariances of the error sources. The weights are taken from the output

response sensitivities which is determined by the perturbation analysis of the

RightSize tool.

139

Chapter 6

Approach to DFX Parameter

Optimisation

6.1 Introduction

This chapter’s main focus is in describing an approach to optimising a design

with the new DFX number representation. The high-level synthesis tool Right-

Size, which originally optimises a design using fixed-point number representa-

tion, has been modified to incorporate DFX into its optimisation procedure.

A feature of the original RightSize tool is that users decide the SNR con-

straint on the primary outputs of a design, which is then used to guide the

optimisation routine. This feature is kept and an additional optional design

constraint on the maximum probability of signal overflow is provided to the

users. This constraint provides an upper boundary for probability of overflow

for every signal in a design which is useful when dealing with designs with

feedback loops.

By providing the lower scaling p0 as in DFX adds an extra degree of freedom

to the optimisation problem for a multiple wordlength and scaling of a design,

140

thus making the problem more difficult when compared to using fixed-point

alone. From the definition of DFX, the lower scaling p0 parameter is allowed

to equal to p1 and when this happens, the signal is effectively a fixed-point

signal. The proposed optimisation procedure successfully obtains the right

mix of DFX and fixed-point signal representations for a design.

The optimisation procedure is a hybrid approach, incorporating results

from simulation and analytical estimating the output errors and the area con-

sumption. Optimising the parameters for a multiple wordlength design has

been shown to be NP-hard [CW02] and finding an optimal solution will be

computationally intensive. Therefore a meta-heuristic optimisation is used for

the optimisation which will efficiently search the configuration space with a

two phase simulated annealing algorithm to decide both the parameters for

the optimum p0 and n parameters.

Conditioning is performed on the DFX parameters for every signal in order

to concentrate the search and eliminate futile/infeasible trials configurations.

Also a simple area consumption model is used to provide quick but reliable

cost metric feedback to the optimisation routines. Finally the case studies

results are presented to demonstrate the optimisation procedure.

Therefore the original contributions for this chapter are:

• provide a maximum overflow probability for a design,

• the design options where DFX could perform better than fixed-point,

• a meta-heuristic optimisation using a two phase simulated annealing al-

gorithm,

• case study of an IIR and LMS filter to demonstrate the use of DFX in a

system context and the optimisation heuristic in action

141

6.2 RightSize Prerequisites

Apart from the design description, the design flow of RightSize requires user

specified design constraints together with a set of representative inputs (Refer

to Section 2.5.4 for the description of the RightSize design flow).

6.2.1 User Specified Design Constraints

There are two design constraints that can be specified by the user. First, the

user will have to provide the acceptable signal to noise ratio (SNR) for each

of the system primary outputs. As mentioned earlier, signal-to-noise ratio

(SNR) is a well accepted general metric in the DSP community for measuring

the quality for a finite precision algorithm implementation [Mit98]. Therefore

RightSize will ensure that all the outputs of the synthesized design will meet

this SNR bound.

The wide dynamic range available for DFX means that our system is capa-

ble of preventing overflows without having to use saturation. Overflows must

be prevented at all cost in digital filters as severe distortions may occur in the

filter output [Jac70, CMP76]. Saturation can be performed whenever overflow

may arise but unless designs meet some criteria [BL91, Liu98], the filter may

never recover from the saturation nonlinearity. Therefore the second design

constraint is the optional maximum probability of overflow. When specified,

the scaling of the signals are individually determined using the signal’s stan-

dard deviation and the Chevbyshev’s inequality [GS01] to have a maximum

probability of signal overflow. If it is not specified, a scaling analysis will be

performed to provide sufficient signal scaling which depends on the maximum

simulated peak value. This will be explained in further detail in Section 6.5.

142

6.2.2 Representative Floating-point Input

Since the RightSize tool takes a hybrid approach to optimisation, statistical

information of each datapath is gathered via simulation. The optimisation

procedure uses this statistical information (such as the probability distribution

for the error analysis described in Chapter 5) and naturally the final system

will be sensitive to the choice of representative input.

Generally, the input signals should be sufficient to exercise the full dynamic

range required of the datapath, otherwise unwanted overflow errors could occur

while using other data sets in practice. Also, the quantization error produced

by the resulting system could violate the user-specified constraints mentioned

above when driven by inputs with different statistical properties, although the

constraints are guaranteed for the specific set of data provided.

6.3 DFX Conditioning

While running the optimisation, the search for the optimum design parame-

ters may bring about parameters that do not meet the design requirements

of the DFX modules described in Chapter 4. Also, some of the parameters

may be known to be overly pessimistic and could be ruled out from the op-

timisation. Hence, a fully conditioned DFX annotated computational graph,

G(V, S, ADFX), has parameters that are not overly pessimistic while meeting

the design requirements of all DFX modules.

In this section, it is assumed that both scalings and wordlength parameters

for each signal have been pre-specified. An iterative algorithm is proposed at

the end of this section to condition any ‘ill-conditioned’ designs.

143

6.3.1 Applying Synthesis Restrictions on p0 Parameters

By applying some restrictions to p0 parameters of some signals, there is some

area cost savings to be gained. These signals are the outputs of the DFX Gain

multiplier, DFX Adder V2 arithmetic modules and unit delay nodes.

DFX Adder V2

Referring to Section 4.5.2, the DFX Adder V2 is designed with the requirement

that an input ANum0 and BNum0 can only result in a Num0 output, SNum0, in

order to reduce the post-adder hardware cost. By using the joint probability

distribution on an adder’s inputs found through the profiling simulation (Sec-

tion 5.5.3 and referring to Example 5.4), we can determine the pS0 that meets

this requirement for a given pA0 and pB0. Also we want the smallest value

possible for pS0 to utilise the full potential of DFX. While performing DFX

conditioning, each adder node in G(V, S,ADFX) will be subjected to Algorithm

6.1.

Algorithm 6.1 : Determine the pS0 required for an adder.

Require: 1) The extended 2-D profile table of an adder with inputs

A〈nA, pA0, pA1〉 and B〈nB, pB0, pB1〉.2) pS1 determined earlier.

Ensure: PTA001= 0

pS0 ⇐ pS1

repeatDetermine PTA001

. // For an example look at Example 5.4, pg. 137.pS0 ⇐ pS0 − 1

until PTA0016= 0

pS0 ⇐ pS0 + 1

DFX Gain Multiplier

By restricting the DFX Gain Multiplier output’s p0 parameter, we can ex-

pect hardware cost savings due to the the reduced complexity of the output

144

recoder. Therefore, the RightSize tool will enforce the restrictions mentioned

in Section 4.6.1 which is repeated here for convenience. For gain multiplier

with input A〈nA, pA0, pA1〉 and multiplier m〈nm, pm〉, the output pQ0 is lim-

ited based on the value of m. Algorithm 6.2 will be performed on every gain

multiplier node when doing a DFX conditioning.

Algorithm 6.2 : Determine the pS0 required for a gain multiplier.

Require: The input A〈nA, pA0, pA1〉 and multiplier m〈nm, pm〉 DFX format

if (|m| < 1.0) and (pQ0 < pA0 + pm) thenpQ0 ⇐ pA0 + pm

else if (|m| > 1.0) and (pQ0 > pA0 + pm) thenpQ0 ⇐ pA0 + pm

else // m = 1.0 or m = −1.0pQ0 ⇐ pA0

end if

Unit Delay

When unit delay are linked one after another, there is absolutely no reason

for there to be a change in the p0 parameter from the input to the output.

Consider Figure 6.1, when pB0 6= pA0, an extra DFX module to recode the

output from the first unit delay will be inserted in between the two unit delays.

Therefore any change to the DFX signal format apart from straight wordlength

reduction will require additional hardware which is unnecessary for a chain of

unit delays. If there is any need to change the format, it should be done before

or after the chain of unit delays and Algorithm 6.3 is performed during the

DFX conditioning stage to aid the synthesis tool by removing the unnecessary

changes to p0 parameters.

z-1

z-1

A A0 A1

A

, ,n p p C C0 C1

C

, ,n p pB B0 B1

B

, ,n p p

Figure 6.1: Linked unit delays.

145

Algorithm 6.3 : Determine the output p0 required for an unit delay.

Require: Input A〈nA, pA0, pA1〉 DFX format

if output signal Q drives another unit delay thenpQ0 ⇐ pA0

end if

6.3.2 Propagating and Conditioning p1 and n Parame-

ters

Table 6.1: Propagation rules for DFX conditioning.

type(v) Propagation rules for j ∈ outedge(v)

adder For inputs A〈nA, pA0, pA1〉 and B〈nB, pB0, pB1〉pj1 = max(pA1, pB1) + 1

nj = pj1 + max(p′A0, p′B0)

full mult For inputs A〈nA, pA0, pA1〉 and B〈nB, pB0, pB1〉pj1 = pA1 + pB1

nj = pj1 + (p′A0 + p′B0)

gain mult For input A〈nA, pA0, pA1〉 and multiplier m〈nm, pm〉pj1 = pA1 + pm

nj = pj1 + (p′A0 + p′m)

unit delay For input A〈nA, pA0, pA1〉or fork pj1 = pA1

nj = pj1 + p′A0

After the restrictions on the p0 parameters have been applied, the remaining

p1 and n parameters are propagated from the inputs of each atomic operation

through to their outputs as shown in Table 6.1. pj1 and nj represents the

upper limit for the p1 scaling and wordlength parameter respectively of the

output after an operation. They are determined using the the parameters of

the operation’s input signal parameters. These limits prevent the optimisation

routine from considering parameter values that will not yield any benefit.

For the upper scaling parameters p1, this DFX conditioning phase compen-

sates for any overly pessimistic values found during the optimisation phases

146

S

S

S

S

0

1

0

1

7,2,3

8,1,3

(a)

Intermediate

(c)

pA1+pm pA0'+pm’

S

S

0

1S

A A0 A1, , 4,1,2n p p = m m, 3,1n p =

Input, Amultiplier, m

0 0 0

0

S

S

0

1

(d)4,0,2

S

S

0

1

(b)

S6,1,4

Figure 6.2: Examples of DFX Gain Multiplier output formatting with thebinary points aligned. The shaded bits can be omitted without introducingerrors.

in Section 6.5. If pj1 found is greater than the propagated pj1, then pj1 is

set to the propagated pj1. Similarly, if the wordlength nj is greater than the

propagated nj, then the nj = nj. The propagated wordlength are all found to

be the sum of p1 together with the maximum LSB-side scaling to ensure the

output’s Num1 number is able to extend its dynamic range fully.

Consider the example of a DFX Gain Multiplier, Figure 6.2(a) shows the

intermediate result before the output is recoded to a desired output format

and Figure 6.2(b)-(d) shows its possible output formats. Figure 6.2(b) is an

example where the upper scaling factor is pessimistic (pQ1 > pA1 +pm) and the

extra shaded bit is merely a sign-extension which can be omitted. Reducing

the upper scaling in this manner is guaranteed not to cause any overflow and

would also reduce the signal’s wordlength. The propagated wordlength for

output in Figure 6.2(c) is nQ = pQ1 + (p′A0 + p′m) and since nQ is greater,

there is a pair of extra shaded zero padding bits that can be omitted without

introducing additional errors. Figure 6.2(d) shows an output example that is

within the propagated parameters.

147

6.3.3 Iterative Algorithm

Algorithm 6.4 below is an iterative algorithm to obtain a fully conditioned

DFX annotated computational graph G(V, S, ADFX).

Algorithm 6.4 : DFX Conditioning

Input: A DFX annotated computation graph G(V, S, ADFX)

Output: Properly conditioned DFX annotated computation graph with

identical behaviour with the input system

//First perform conditioning to for p0 parameters...repeat

for all v ∈ V doPerform Algorithm 6.1 on v if type(v)=adderPerform Algorithm 6.2 on v if type(v)=gain multPerform Algorithm 6.3 on v if type(v)=unit delay

end foruntil No more changes made

//...then followed by the p1 and n parametersCalculate the pj1 and nj for all signals j ∈ S according to Table 6.1repeat

for all j ∈ S doSet pj1 ⇐ pj1 if pj1 > pj1

Set nj ⇐ nj if nj > nj

end forRecalculate nj for all affected signals according to Table 6.1

until No more changes made

6.4 Area Models

The operation of the RightSize tool will require area metric feedback through-

out the course of its optimisation process and it would be very computation-

ally intensive to perform logic synthesis to extract the area metric for each

instance. Therefore it is worthwhile to model the area consumption for each

module types supported by RightSize at a high level of abstraction. These

148

simple cost models may be evaluated many times throughout the optimisation

process with little computational effort.

It is assumed, when constructing the cost model, that dedicated resource

binding is used and that designs are resource dominated such that area cost

of wiring is negligible [Mic94]. With dedicated resource binding, each compu-

tational node maps to a physically distinct module element. The construction

of the area cost model is greatly simplified with these assumptions and it is

possible to estimate each computational node separately before summing the

resulting estimates. It should be noted that in reality, logic synthesis performed

after the RightSize optimisation is likely to result in some logic optimisation

between the boundaries of two connected nodes to give a lower area. However

experience has shown that these deviations from the area model are small, and

tend to cancel each other out in large systems, resulting in simply a propor-

tionally slightly smaller area than predicted [Con01].

This section will not go into explaining the details of estimating the area

for each individual module but the following explanation should be sufficient

for anyone to recreate the estimation procedure used. Each module can be

broken down into a few main parts. For example a DFX Gain Multiplier

(Section 4.6.1) would consist of a fixed-point gain multiplier and output recoder

block. The recoder block is then made up of a Range-Detector and MUXs for

the shifting operations. Therefore by modelling the basic entities, we can model

all the modules that are used in RightSize. The models used in this thesis have

been tuned for Virtex4 and Stratix2 FPGAs, and for UMC .13 micron high

density standard cell ASIC library. The estimation models required the DFX

annotated computational graph G(V, S,ADFX) to be conditioned in advanced.

Range-Detector (RD)

As mentioned in Section 4.4.1, the Range-Detector is a fairly simple block

whose operation depends on the number of bits used for detection. A lookup

149

table, crd[i], contains the size Range-Detector where i is the number of bits used

for detection. The lookup table is customised for each hardware platform. The

estimated area for the Range-Detector is shown in the example below.

Example 6.1. For the Range-Detector in a DFX Encoder with input A〈nA, pA〉and output X〈nX , pX0, pX1〉, the estimated area, Crd, is given by (6.1).

Crd = crd[pX1 − pX0] where pX1 = pA (6.1)

¤

Multiplexor (MUX)

Multiplexors (MUXs) are used extensively for aligning and shifting signals in

DFX as shown throughout Chapter 4. A lookup table, cmux[i], is used to

estimate the area of a MUX where i is the number of inputs multiplexed by

the MUX. This table is customised for each hardware platform. As a note,

when i = 1, the input is ‘multiplexed’ it with a constant ground or in other

words, the MUX is just an AND gate. The example below shows the size of a

DFX Decoder which is made entirely from MUXs.

nX

nA

S

S

0

1

S

A A0 A1

A, ,n p p

X X

X,n p

Input

Output

i = 2 i = 1

Figure 6.3: Estimating the area for a DFX Decoder.

Example 6.2. Figure 6.3 shows a DFX Decoder with input A〈nA, pA0, pA1〉and a fixed-point output X〈nX , pX〉. For nA MSB bits at the output, 2:1 MUXs

are used ( i.e. i = 2) but for the lower (nX − nA) bits, only AND gates are

needed. In this example, when the input is a Num1 the output is the same as

150

the input. But when the input is a Num0, the output is a zero (grounded).

Therefore, the estimated area for a DFX Decoder, Cdec is given below in (6.2).

Cdec = cmux[2] ·min(nA, nX) + cmux[1] ·max(0, nX − nA) (6.2)

¤

Adder

S

S

S

B BB ,n p

S SS ,n p

Input

Input

Output

A AA ,n p

Sum and Carry

full-adder

Carry only

full-adder

nS+1

Figure 6.4: Estimating the area for a fixed-point Adder.

The area estimation for the fixed-point adder is reasonably straight forward.

Generally, the design of FPGAs provide good support for fast ripple-carry

[Kor02] adder architectures. For example, synthesis tools such as Synplify

ASIC from Synplicity used after RightSize uses a ripple-carry architecture.

The synthesis tools would normally generate a sum-and-carry full-adder for

the full width of the adder. However on some occasions the carry-only full-

adder will be sufficient as the sum is not needed by the output as shown in

Figure 6.4. The cost model therefore consists of two coefficients, ca1 and ca2 for

the sum-and-carry and carry-only full-adders respectively. These coefficients

are customised for the three hardware platforms. The estimated area of the

fixed-point adder, CfxAdd, is therefore expressed by (6.3). This model is a

151

modified version of model proposed in [Con01].

CfxAdd =ca1 ·min (nS + 1, pS + 1 + min(p′A, p′B))

+ ca2 ·max (0, min(p′A, p′B)− p′S)(6.3)

Gain multiplier

Estimating the area for constant coefficient fixed-point multipliers is signifi-

cantly more complicated. Typically, a constant gain multiplier is implemented

in a series of additions for the partial products generated through recoding

schemes such as the classic Booth technique [Boo51]. This makes the area

consumption highly dependent on the coefficient value and to in addition to

that, the exact recoding scheme used by the synthesis vendor is known only to

the vendor. Ideally, the area estimation would account for any recoding-based

implementation but this this is not realised.

A simple area model for the area is used instead based on the model pro-

posed in [Con01]. Equation (6.4) estimates the area, (6.4), for a gain multiplier

with input A〈nA, pA〉, coefficient multiplier m〈nm, pm〉 and output Q〈nQ, pQ〉.Through synthesis of several hundred multipliers with varying input and out-

put wordlengths, and varying coefficient values, the coefficient values cg1 and

cg2 are determined through the use of the least-squares approach for each hard-

ware platform.

CfxGain = cg1nAnm + cg2(nA + nm − nQ) (6.4)

Full multiplier

The area estimation for a fixed-point full multiplier is more predictable than

the previous gain multiplier. To perform multiplication in parallel, array mul-

tipliers [Kor02] are generally used. However, as with the gain multiplier, the

152

exact method used by the individual synthesis vendor is not known and the

logic synthesis would remove any unnecessary logic to unconnected outputs.

Therefore, as with the gain multiplier, a simple area model is used which

is based on the work done in [Con03]. Equation (6.5) estimates the area,

CfxMult, for a full multiplier using the coefficients cm1 and cm2 for a fixed-point

multiplier with inputs A〈nA, pA〉, B〈nB, pB〉 and output Q〈nQ, pQ〉. These co-

efficients are also found through the least-squares approach to several hundred

synthesized multipliers with varying input and output wordlengths. As before,

these coefficients are customised for the three hardware platforms.

CfxMult =cm1 ·min(nA, nB) · (max(nA, nB) + 1)

+ cm2(nA + nB − nQ)(6.5)

6.4.1 Evaluation of Area Models

Table 6.2 tabulates a comparison between the actual area of DFX module on

a Virtex4 FPGA with the estimation using the area models in this section.

Each module’s input and output signal parameters are randomly chosen. We

can see that the estimated area are within ±15% of the actual area and the

results for the other platforms are similar to this.

6.5 Determining the Upper Scaling p1 Param-

eter

As Num1 is the upper number range of DFX, the upper scaling parameter

p1 for each signal must be sufficient to prevent overflow from occurring and

can be determined independently from the other two DFX parameters. When

the optional maximum probability of overflow is specified by the user, the p1

153

Table 6.2: Comparison between actual area of DFX modules with the estima-tion by the area models on a Virtex 4 FPGA.

DFX Adder V1 (LUTs) Adder V2 (LUTs)

Module Act. Est. Err Act. Est. Error

1 39 34 -12.8% 30 29 -3.3%

2 39 36 -7.7% 31 29 -6.5%

3 56 52 -7.1% 39 38 -2.6%

4 38 35 -7.9% 39 34 -12.8%

5 19 20 5.3% 19 21 10.5%

DFX Gain Multiplier (LUTs) Full Multiplier (LUTs)

Module Act. Est. Err Act. Est. Error

1 57 55 -3.9% 109 102 -6.4%

2 76 83 9.7% 170 151 -11.2%

3 82 71 -13.6% 173 181 4.7%

4 50 55 10.0% 185 180 -2.7%

5 54 48 -11.1% 215 227 5.4%

parameters can be determined using the Chebyshev’s inequality, as will be

discussed in Section 6.5.2. Otherwise, the simulated peak values is used to

find the p1 parameters. After this step, the p1 parameters found may be too

pessimistic but this is compensated when the DFX parameters are propagated

and conditioned, as explained in Section 6.3.

6.5.1 Simulated Peak Values

In order to determine the maximum peak values, Ps, of each signal, a simula-

tion run is first performed using the set of user-provided inputs. The standard

deviation of the signal σs which is used in the next section is recorded during

the same simulation run. Ps is then scaled up by a user-defined ‘safety factor’

k to provide some guard bits against overflow. Typically k = 4 is used to give

2 overflow guard bits, [Con01]. Hence, for signal s, we can derive the upper

scaling parameter, ps1, as

ps1 = psims = blog2 kPsc+ 1 (6.6)

154

where b.c returns a less or equivalent integer value.

Using the simulated peak values is fine in most cases provided there is no

feedback in the design. Designs with feedback, e.g. IIR filters, may suffer from

limit cycles and irrecoverable oscillations when overflow occurs as mentioned

in Section 2.3.

6.5.2 Chebyshev’s Inequality

Using the Chebyshev’s inequality, the maximum probability of a zero mean

DFX signal overflowing can be determined by knowing the signal’s variance

and its maximum representable magnitude.

The single tailed Chebyshev’s inequality [GS01] states that

Pr [|X − µX | ≥ A] ≤ σ2X

A2(6.7)

where µ and σ2 is the mean and variance of a variable X.

Letting A be the magnitude of the largest DFX representable number, i.e.

A = 2pchevs the probability of overflow for a zero mean signal s ∈ S is bounded

by

Pr [overflow] = Pr[|s| ≥ 2pchev

s

]

≤ σ2s

22pchevs

(6.8)

Using (6.8), we define the maximum probability of overflow, λ, as follows

λ =σ2

s

22pchevs

(6.9)

Hence by rearranging (6.9), we can determine the scaling parameter pchevs

for signal s ∈ S with the standard deviations, σs, collected from the simulation

155

run in the previous section and the user-specified maximum probability of

overflow, λ as shown below.

pchevs = log2 σs − 1

2log2 λ (6.10)

On the occasion when the λ is set too high, overflow is bound to occur. To

prevent this, an extra condition is set for the upper scaling parameter according

to (6.11). If the maximum probability of overflow constraint is set by the user,

the upper scaling ps1 would normally take the value of pchevs . However if the

scaling parameter found by through simulation, psims (6.6), is higher than pchev

s ,

then ps1 will take the value from the simulation method and a note of this will

be logged for the user to action on.

ps1 = max(psims , pchev

s ) (6.11)

6.6 The Optimisation Problem, Formulated

Given a computation graph G(V, S), Section 6.5 earlier described how the

upper scaling vector p1 can be found by either using the results from simulation

or with the Chebyshev inequality and a user specified maximum probability of

overflow λ. Combining the area models presented earlier in Section 6.4 into a

single area measure on G gives a cost metric CG(n,p0,p1). Together with the

error variance model Ek (5.26), combined into vectors EG(n,p0,p1) with one

element per output, allows for formulation of a DFX parameter optimisation

problem stated in Problem 6.1. Here E denotes the user specified bound for

the output error given by the user’s output SNR constraint.

156

Problem 6.1. Given a computation graph G(V, S) the DFX parameter opti-

misation problem may be defined as to select n and p0 such that CG(n,p0,p1)

is minimized subject to (6.12).

n ∈ NS

p0 ≤ p1

EG(n,p0,p1) ≤ E

(6.12)

¤

The condition p0 ≤ p1 does not restrict all signals to be purely DFX. Any

optimisation procedure should strike a balance between DFX and fixed-point

formats for their signals to minimize the area cost.

6.7 Exploring the Feasibility of DFX

Take a DFX signal X with format 〈n, p0, p1〉 and let K be the probability

of signal X being in the Num0 range. Using the Equation (5.3), the error

variance of signal X is given by Equation (6.13) below, assuming truncating

from an infinite precision to simplify the equation. We can see that as K

decreases, the error variance increases.

E{e2

DFX

}=

1

12· 2−2n

(K22p0 + (1−K)22p1

)(6.13)

Using the Chebyshev’s inequality (6.7), the maximum probability of a zero

mean DFX signal being in the Num1, (K−1), can be determined by knowing

the binary point p0 and the signal’s variance, σ2. This time around, by letting

a = 2p0 , which is the boundary between the Num0 and Num1 ranges, we

157

can determine maximum probability of a signal being in the Num1 range and

rearranging it to get the lower limit for K (6.14).

(1−K) = Pr [X ≥ 2p0 ]

≤ σ2

22p0

K ≥ 1− σ2

22p0

(6.14)

Assuming K is at its lowest limit (i.e. K = 1 − σ2

22p0) and substituting it

into (6.13), we get the error variance below.

E{e2

DFX

}=

1

12· 2−2n

((1− σ2

22p0

)22p0 +

( σ2

22p0

)22p1

)

=1

12· 2−2n

(22p0 − σ2

(1− 22p12−2p0

)) (6.15)

Taking the 1st and 2nd derivative of Equation (6.15), we get

dE

dp0

=1

6· 2−2nln 2

(22p0 − σ222p12−2p0

)

d2E

d(p0)2 =

1

3· 2−2n(ln 2)2

(22p0 + σ222p12−2p0

) (6.16)

To obtain the lowest error variance, we examine the stationary point when

dEdp0

= 0. Ignoring all negative and imaginary roots gives, the best value of p0

that minimises the error variance with the lowest value of K is

p0 =1

2(p1 + log2 σ) . (6.17)

The equation for p0 parameter above does not guarantee an area optimised

implementation as it does not take into account any area metric. However, by

placing a relationship between fixed-point and DFX implementation cost, we

can make a meaningful DFX/fixed-point comparison and develop an under-

standing to the types of design and conditions that DFX is best suited.

158

From Chapter 4, we know that a DFX implementation is always larger than

a fixed-point implementation of the same wordlength due to added overhead

cost such that a DFX implementation cost is θ times larger than the equiva-

lent wordlength fixed-point implementation [ECC04]. Therefore a wordlength-

multiplier, θ, is introduced such that

m = θn (6.18)

to equate area cost of a fixed-point implementation with wordlength m and a

DFX implementation with wordlength n. A smaller value of θ is desirable for

DFX as it implies that the DFX implementation additional overhead cost (on

top of an equivalent fixed-point implementation) is relatively cheap.

Using (6.18), the error variance of a fixed-point signal with wordlength m

and scaling p is given by (6.19). For the fixed-point scaling p, it makes sense

to set it to p = p1 in order to match with the highest representable value of

DFX.

E{e2

FX

}=

1

122−2m+2p

=1

122−2θn+2p1

(6.19)

For DFX to be the better number representation than fixed-point, we want

the error variance of DFX to be less than the error variance of fixed-point. In

other words, we want the inequality (6.20) to be satisfied.

E{e2

DFX

} ≤ E{e2

FX

}(6.20)

Using the equations (6.15), (6.17) and (6.19) into the inequality (6.20), we

have determined the upper limit, θ, for the wordlength-multiplier θ shown be-

low in (6.21). Equation (6.21) implies that when θ is below the limit, a DFX

159

implementation would have lower error variance than a fixed-point implemen-

tation with equivalent chip area. In other words, the DFX implementation

might have the flexibility to sacrifice error variance to optimise chip area. A

high value for θ gives more flexibility for a DFX implementation to absorb the

extra DFX hardware overhead over fixed-point.

θ < θ = 1 +1

2n

[2p1 − log2

(σ2p1+1 − σ2

)](6.21)

Examining the inequality above further, we can deduce that the limit, θ,

increases when a signal has a small variance but needs a large p1 (to prevent

overflow). Signals like this have high kurtosis [SM94], i.e. the signal values

are distributed about the mean (assumed zero here) and has a fat tail of the

distribution. In order to represent these signals, a number with wide dynamic

range is needed. Also the inverse relationship of the wordlength n with θ means

DFX forms a better representation when a design is able to tolerate a large

error, i.e. low n since shorter wordlengths lead to larger errors. Figure 6.5

illustrate the the limit θ given by Equation (6.21) for the case when n = 10

and n = 30.

To view the limit, θ, from a different perspective, we shall use the equation

for p1 found with the Chebyshev inequality (Section 6.5.2). Using p1 from

(6.10) and substituting it into (6.21), we get

θ < θ = 1 +1

2nlog2

(2λ

12 − λ

)−1

(6.22)

Taking p1 from (6.10), p0 from (6.17) and substituting them into the error

variance (6.15), we get the error variance for DFX in terms of the maximum

overflow probability, λ, and wordlength, n.

E(e2

DFX

)=

σ2

12

(2λ−

12 − 1

)· 2−2n (6.23)

160

10-30

10-20

10-10

100

-10

-5

0

5

10

1.0

1.5

2.0

2.5

3.0

3.5

4.0

p1

Signal va

riance, �x2

Wordlength-multiplier,

�Increasing

n

n = 10

n = 30

Figure 6.5: Boundary of wordlength-multiplier using Equation (6.21) withvarying p1 scaling and signal variance.

Hence, with the SNR defined as the ratio of signal power over noise power (i.e.

SNR = σ2/E), rearranging the equation above gives us

n =1

2log2

(SNR

12

(2λ−

12 − 1

))(6.24)

Finally, substituting (6.24) into (6.22) to have (6.25) where the limit θ is

expressed as a function of maximum overflow probability λ and SNR. Figure 6.6

illustrates equation (6.25). From the graph, we can see that the θ limit is higher

when SNR required and maximum overflow probability is small.

θ < 1 +log2

(2λ

12 − λ

)−1

log2

(SNR12

(2λ−12 − 1)

) (6.25)

In practice, attempting to ascertain the wordlength multiplier, θ, is not

a trivial task as it depends on a variety of factors such as the number of

the different atomic operations in a design and their input and output signal

161

050

100150

200250

10-20

10-10

100

1

1.2

1.4

1.6

1.8

2

Overflow probability, � SNR

(dB)

Wordlength-multiplier,

�Figure 6.6: Boundary of wordlength-multiplier with varying maximum proba-bility of signal overflow, λ, and SNR).

parameters. Taking note that the analysis in this section is based on a single

signal, the inequalities would also require that all signals in the design have the

same variance and/or with the same p1 parameter. The wordlength multiplier

inequalities discussed in this section serves to provide a glimpse of the possible

scenarios when DFX is better then fixed-point.

In general, DFX is better suited for a design when the output SNR con-

straint is low and when the internal signals need a wide dynamic range. Also

when the constraint for maximum overflow probability λ is provided, DFX is

the better choice for low λ probabilities. These properties are demonstrated

by the results of the cases studies in Section 6.9.

162

6.8 Meta-Heuristic Approach to Optimisation

Using Simulated Annealing

To determine the best wordlength and scaling parameters for the signals, this

section describes an optimisation approach using simulated annealing meta-

heuristic. As mentioned in this chapter’s introduction, the signals in a design

may be either DFX or fixed-point and looking back at Figure 4.16, a mix-

ture of DFX and fixed-point signals would lead to a non-convex search space.

Simulated annealing is an optimisation algorithm for problems that are NP-

hard [CW02] and have a non-convex search space. After a brief background

into simulated annealing, the two-phase optimisation routine for RightSize is

introduced.

6.8.1 Background to simulated annealing

Simulated annealing (SA) is a flexible optimization method that is suited for

large-scale combinatorial optimization problems. Kirkpatrick et al. [KGV83]

was the first to use a generalised form of the Metropolis Monte Carlo scheme

[MRRT53] for optimisation. At about the same time, Cerny [Cer85] and Pin-

cus [Pin70] independently developed the algorithm further to what is now

known as simulated annealing. SA has been successfully applied to classical

combinatorial optimization problems, such as the travelling salesman problem,

and problems concerning the design and layout of very-large-scale-integration

(VLSI) designs. SA differs from standard iterative optimisation methods by

allowing ‘uphill’ moves – moves that spoil, rather than improve, the temporary

solution [DH96]. Refer to [Haj85, JAMS89, JAMS91] for surveys of the SA

algorithm and its use.

The problems that SA optimisation is useful are characterized by a very

large discrete search space whose objective cost function we want to minimize

163

(or maximize) is non-convex over the search space. When the search space is

non-convex, there is a tendency for some optimisation algorithms (e.g. greedy

algorithm) to get trapped in a local minima. Simulated annealing prevents this

by using rules that are derived from an analogy to the process in which liquids

are cooled to a crystalline form, a process called annealing. Each iterative

step of the SA algorithm replaces the current state by a random state within

its neighbourhood with a probability that is a function of the state cost and

a temperature parameter, T , which gradually ‘cools’. When the temperature

is ‘hot’, a greater number of large random changes to the current state is

allowed and prevents the search from becoming stuck in a local minima. As

the temperature cools, only incremental better changes are allowed to the

current state which will narrow down towards the global minima.

A readily available Adaptive Simulated Annealing (ASA) code by Ingber

[Ing93] is adapted for RightSize optimisation procedure to optimise the DFX

parameters. The fundamental workings of the ASA algorithm is fairly simple.

Firstly, a feasible starting point is chosen to initialise the current state x, the

associated cost, C, is found with a cost function and the initial temperature,

T = T0, to start the annealing process which is usually a large value. State

x is a vector for the list of parameters we want to optimise. The exponential

temperature ‘cooling’ schedule of ASA rapidly reduces the temperature down

to narrow the search, regardless of the initial value of T0 [Ing93]. The annealing

then consists of the cyclic repetition of the following operations/steps until a

termination criterion is satisfied. When ASA ends, the final optimised state

x and its cost C will be made available.

Step 1. Generation and evaluation: A random trial state x′ is generated by

a uniform distribution generator and its associated cost C ′ is evaluated by the

cost function. If the trial state x′ fails any user derived constraint, this step

is repeated until a valid x′. Doing so does not violate the prove that a global

minima statistically can be obtained with ASA [Ing96].

164

Step 2. Examination: If (C ′ < C) the trial state is accepted (i.e. x and

C are set equal to x′ and C ′, respectively). If (C ′ ≥ C) the trial point is

accepted with probability defined by the Boltzman probability distribution,

exp(−(C′−C)T

) [Ing96]. When a trial state is accepted, an acceptance counter

Acc is incremented.

Step 3. Temperature cooling: An iteration counter k is incremented. An

exponential cooling schedule (6.26) is applied to T . Here, the parameters ψ,

Temperature_Ratio_Scale, and ϕ, Temperature_Anneal_Scale, are tuning

parameters supplied by ASA. They are kept at the default values of ψ =

− ln(1.5e− 5) and ϕ = ln(100.0) [Ing96].

T = T0 exp(−ψ (k eϕ)1/|x|) (6.26)

The termination criterion is set up so that 1) the number of accepted

states, Acc is limited to a maximum parameter, Accmax, and 2)the algorithm

is stopped when there is no improvement to the cost is recorded after 500

iterations. When ASA ends, the final optimised state x and its cost C will be

made available.

From the overview mentioned above, we can see that the speed of the

SA algorithm depends on three major factors, the maximum accepted states,

Accmax, the speed of evaluating the cost function and the ease generating valid

states for x′ that meets the user constraints in Step 1. These factors can be

managed to trade-off between the speed of optimisation and the quality of its

result. Typically, there is a greater possibility of securing better results if more

time is spent on the optimisation.

Elaborating on the ease of generating valid states for x′, the random state

selection in Step 1 gives rise to the possibility of generating states that would

violate the user’s constraint. Therefore most of the time would be spent looping

165

about the same step. This would be the case if the SA algorithm tried to

optimise both the parameter vectors n and p0 together to meet the constraints

set up in Problem 6.1. A simple 4th order IIR filter optimisation (as in the

later case study, Section 6.9.1) could take 74 hours to complete on a PC (Intel

Pentium IV 2.4GHz). Hence the proposed two phase optimisation separately

opimises the lower scaling p0 vector followed by the wordlength n.

6.8.2 Optimisation Algorithm

A heuristic approach has been developed to find feasible lower scaling, p0,

and wordlength, n, vectors for a design having small, though not necessary

optimal, area consumption. Figure 6.7 shows the flow of the optimisation al-

gorithm. Before performing the heuristic optimisation some pre-optimisation

needs to be done. Starting with the computational graph G(V, S), the per-

turbation analysis (Section 2.5.4) and profiling simulation (Section 5.5.1) are

first performed to aid error analysis to be performed. Then the upper scaling

vector p1 is determined in Section 6.5 to meet the user’s requirement. The

optimisation heuristic is divided into two phases; Phase 1 attempts to find the

optimum set for p0 while Phase 2 finds the optimum set for n. Phase 1 uses

Both phases uses Algorithm 6.5 and Algorithm 6.6 while Phase 2 uses only

Algorithm 6.5.

Algorithm 6.5(OptNOnly) optimises the wordlength parameters, n, to min-

imise the cost design cost for a DFX annotated computational graph G(V, S, ADFX)

where both p0 and p1 parameters are known. The minimum uniform wordlength

for all the signals which satisfies the user constraints are used for the initial

state. SA’s iterative loops are then performed until the number of accepted

states Acc = Accmax or when the cost keeps repeating. The cost objective

function in this algorithm is determined by using the area models presented

166

in Section 6.4. At the end, this algorithm returns the optimised vector of n

parameters and also the optimised cost of G(V, S, ADFX).

On the other hand, Algorithm 6.6(OptP0Only) optimises the lower scaling

parameters, p0, for a DFX annotated computational graph G(V, S, ADFX)

where the objective is also to minimise the cost of G(V, S,ADFX). This time,

only the p1 parameters are known and kept constant. To generate the initial

state p0, Equation (6.17) is applied to lower scaling parameter of every signal

s ∈ S such that

ps0 =1

2(ps1 + log2 σs) . (6.27)

This algorithm then performs iterative SA loops until Acc = 5000 or when

the cost keeps repeating. Instead of using the area models to determine the

cost, Algorithm 6.5(OptNOnly) is called for each iterative of trial state p0′ with

Accmax = k log2(|S|) where |S| is the number of signals. The parameter k is

a constant that controls the amount of time spent looking for an optimum

solution and also determines the level of optimisation. OptNOnly’s Accmax

is set comparatively smaller than the typical value of 5000 to quickly gauge

the cost of the trial state. Experiments have shown that setting k = 10 is

typically sufficient to gauge the cost of the trial state p′0 without being too

computationally intensive. At the end, this algorithm returns the optimised

vector of p0 parameters.

Therefore, Phase 1 consist of Algorithm 6.6(OptP0Only) which uses Algo-

rithm 6.5(OptNOnly) within its iterative loop to give optimised lower scaling

vector p0. Once Phase 1 one is complete, Phase 2 optimises the wordlength

vector n with Algorithm 6.5 (OptNOnly) again but this time it is allowed to

terminate regularly with Accmax = 5000. Splitting the optimisation in two

phases dramatically reduces the optimisation time. The same 4th order IIR

filter optimisation mentioned at the end of the previous section took just 45

167

minutes to complete with the same test machine. Figure 6.8 show the optimi-

sation times and result area with respect to the level of optimisation k for a

4th IIR order filter.

Algorithm 6.5 (OptNOnly): Optimise wordlength vector n with simulatedannealing.

Input: G(V, S) with scaling vectors p0 and p1 and Accmax

Output: Wordlength vector n together with cost CG(n,p0,p1)

1: Find u, the minimum uniform wordlength satisfying (6.12) with n = u · 12: Set n ⇐ ku · 13: Initialise temperature T = T0

4: Set cost C ⇐ CG(n,p0,p1)5: repeat6: Generate new random vector n′ in the neighborhood of n7: DFX condition graph G(V, S, Adfx) where Adfx = (n′,p0,p1)8: Go back to line 6: if (6.12) is not satisfied9: Set cost C ′ ⇐ CG(n′,p0,p1)

10: if (C ′ < C) or (random < exp(−(C′−C)T

)) then // Accept new state11: Set n ⇐ n′, C ⇐ C ′ and increment Acc12: end if13: Decrease T according to cooling schedule (6.26)14: until (Acc > Accmax) or (Cost keeps repeating)15: return vector n and C

Algorithm 6.6 (OptP0Only): Optimise lower scaling vector p0 with simu-lated annealing.

Input: G(V, S) with upper scaling vector p1 and Accmax

Output: lower scaling vector p0

1: Initialise p0 with Eqn. (6.27) ∀s ∈ S2: Initialise temperature T = T0

3: Set cost C ⇐ ( Algorithm 6.5 with p0, and Accmax = k log2(|S|) )4: repeat5: Generate new random vector p′0 in the neighborhood of p0

6: Set cost C ′ ⇐ ( Algorithm 6.5 with p′0 and Accmax = k log2(|S|) )

7: if (C ′ < C) or (random < exp(−(C′−C)T

)) then // Accept new state8: Set p0 ⇐ p′0, C ⇐ C ′ and increment Acc9: end if

10: Decrease T according to cooling schedule (6.26)11: until (Acc > Accmax) or (Cost keeps repeating)12: return lower scaling vector p0

168

Perturbation analysis,

Section 2.5.4

Computation graph,

G(V,S)

Profiling simulation,

Section 5.5.1

Determine p1,Section 6.5

Obtained p0

Optimised DFX annotated

computation graph, G(V,S,Adfx)

Obtained n

Start

End

Phase 1

Phase 2

Obtained p1

No

Algorithm 6.5:

Opt–nOnly

Yes

Accmax=5000Meet termination

criteria?

Accmax=5000Meet termination

criteria?

Yes

No

Algorithm 6.6:

Opt–p0Only

NoAccmax=klog2(|S|)Meet termination

criteria?

Algorithm 6.5:

Opt–nOnly

Yes

Pre-optimisation

Figure 6.7: Flow of the DFX optimisation with Simulated Annealing. Thedetailed flow diagrams of the Algorithms 6.5 and 6.6 are not shown and isdenoted by broken arrow lines. Refer to the respective Algorithms in page168.

169

25000

27000

29000

31000

33000

35000

37000

39000

41000

43000

2 4 6 8 10 12 14Level of optimisation, k

Are

a (c

ells

)

0

10

20

30

40

50

60

Opt

imis

atio

ni ti

me

(min

)

AreaOptimisation time

Figure 6.8: The optimisation times and the area with respect of the level ofoptimisation for a 4th order IIR filter on ASIC.

6.9 Case Study and Discussion

For the case study, an infinite impulse response (IIR) filter and a least-mean-

square (LMS) adaptive filter were implemented. IIR filter is frequently used

in audio processing or general filtering of signal frequencies, whereas the LMS

adaptive filters are used in communications to compensate for multi-path dis-

tortions or feedback control applications. Both filters are part of the BDTI

DSP benchmark [Ber00] suite to measure the performance of DSP processors.

Both designs are optimised with the two-phase ASA optimisation (ASA2)

in Section 6.8.2 with multiple wordlength and multiple scaling which may re-

sult in designs with a mix of DFX with fixed-point signals. Both designs are

benchmarked against designs optimised with a completely fixed-point (FX)

only optimisation with the multiple wordlength optimisation heuristic by Con-

stantinides [CCL01] explained in Section 2.5.4.

Each design has been implemented on the same three platforms as in Chap-

ter 4. Just to reiterate the platforms used, the FPGAs are Xilinx Virtex4

(XC4VLX15) and Altera Stratix2 (EP2S15) and the ASIC designs with UMC

170

0.13um High Density Standard Cell Library. All designs are synthesized using

Synplicity’s Synplify Pro (for FPGAs) or Synplify ASIC. The FPGAs are then

placed and routed using their respective platform vendor software.

The first test is with varying output SNR without any constraint on max-

imum overflow probability and this is to demonstrate inequality (6.21) and

show that DFX may be more area cost effective when signals have small vari-

ance and high noise is tolerable, i.e. low SNR. For this test, two different sets

of input signals with different variance were used.

The second test is with two different fixed output SNR while varying the

maximum overflow probability λ. This test is to demonstrate the inequality

(6.25) and show that when the maximum overflow probability, lamdba is pro-

vided, DFX may be more area cost effective when λ and SNR are small. For

all the cases, the cost function is minimised and two different tests performed

on each design.

6.9.1 IIR Filter

The IIR filter used in this case study is a 4th order high-pass filter made from

two cascaded direct-form-II 2nd order filters [OS99]. Figure 6.9 shows its data-

flow graph and its frequency response graph. This high-pass filter has been

generated using Filter Design & Analysis tool in Matlab allowing the signals

with normalised frequencies 0.7π and above to pass through.

For the first test, two different 100,000 zero-mean audio input sampled at

44.1kHz were used. Both inputs have different sample variances. The variances

of the input samples are (i)1.6×10−5 (lower variance) and (ii)1.2×10−2 (higher

variance). Both input samples have also been normalised so that they have the

same maximum magnitude and therefore the upper scaling parameters used

171

z-1

F

F

F

z-1

z-1

F

F

F

z-1

z-1

z-1Input Output

-a1(1)

-a2(1)

b1(1) -a1(2)

-a2(2)

b1(2)

g(1)

(a) Cascaded structure with two second order filters

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

-150

-100

-50

0

Normalized Frequency (×π rad/sample)

Mag

nitu

de (

dB)

(b) Frequency response

Figure 6.9: Case study 4th order IIR filter.

for both inputs are the same. The optimisation was done with varying output

SNR but without giving a constraint on maximum probability of overflow.

The cost of the IIR filter optimised with the lower variance input onto ASIC

is illustrated on a graph in Figure 6.10(a) with varying output SNR. Results

on the FPGA platforms follow the same trend and hence not shown. As a

summary, the percentage ratio of the area cost after an ASA2 optimisation

over the area cost after a FX optimisation are plotted in Figures 6.11(a)&(b)

for input with variance (i)1.6×10−5 and (ii)1.2×10−2 respectively. All three

hardware platforms that were implemented are shown on scatter-plots together

with their respective trend lines obtained through linear regression.

Referring to the discussion in Section 6.7, the DFX number representation

can provide better cost over fixed-point when the output SNR required is low

and when there are wide dynamic range signals. The ASA2 optimisation pro-

duces synthesized designs which are about 3% better than a design from a

172

0

10000

20000

30000

40000

50000

60000

70000

80000

0 20 40 60 80 100

FX ASA2

SNR (dB)

Area (cells)

(a) Varying output SNR

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0.01 0.0001 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16

FX ASA2

10-2 10

-4 10

-6 10

-8 10

-10 10

-12 10

-14 10

-16

Overflow probability, �A

rea (cells)

(b) Varying overflow probability λ

Figure 6.10: Area of ASIC 4th order IIR filter with lower variance input.

FX optimisation when the output SNR is at 15dB. The percentage improve-

ment reduces as the output SNR required is increased. When the larger input

variance input is used, there is hardly any improvement as a fully fixed-point

design is the most efficient to operate the filter.

Analysing each signal individually for the case of the lower variance input,

both ASA2 and FX optimised most signals to fixed-point apart from some

signals near the input. ASA2 optimised these signals to a DFX format and

accounts for any area cost improvement over the FX optimisation. This is be-

cause the input with lower variance has a wider dynamic range when compared

to the input with higher variance (both sets of input have been normalised),

assuming that both inputs’ maximum absolute value is the same. The ASA2

optimisation took advantage of this and used DFX on the signals near the in-

put since DFX has a wider dynamic range than fixed-point. The other signals

do not vary a lot in terms of magnitude and therefore fixed-point is the most

efficient way to represent them. When the higher variance input was used,

there is no area cost savings from using DFX at the inputs and the whole

design was optimised using fixed-point.

For the second test, the larger variance input was used. This time the

173

maximum overflow probability λ is varied with the output SNR constraint

fixed to 15dB and 60dB. Figure 6.10(b) shows the synthesized cost on the

ASIC platform when the output SNR constraint is 15dB. The results on FPGA

platforms again show a similar trend and not shown. Figures 6.11(c)&(d) show

the percentage ratio of the ASA2 over FX area cost implementation for all three

hardware platforms. They are similar scatter plot diagrams to the previous

test with a linear regression trend lines for each hardware platform.

From Figure 6.11(c), when the output SNR is 15dB, the extra cost savings

of ASA2 over FX improves as we reduce the maximum overflow probability.

When λ = 10−4, the ASA2 optimisation gave only 2% improvement over FX

optimisation on the Virtex4 FPGA but this increased to 11% when λ = 10−8.

Unlike the first test where only the signals around the input would be optimised

to a DFX format, the signals around the gain multiplier would be optimised

to DFX in order to provide the wide dynamic range needed for the maximum

overflow probability and precision required. When the output SNR is 60dB,

Figure 6.11(d) shows a similar trend but with lesser gradient. These results

conform to the expectation outlined in Section 6.7 where DFX can provide

better area cost efficiency over fixed-point when output SNR is low and/or

when low maximum overflow probability is needed.

6.9.2 Adaptive LMS Filter

An adaptive filter is an example of a non-linear application with feedback.

Incremental updates are made to the coefficients used to in the filter by the

feedback of accumulated correction terms. The output of the filter is compared

with a desired filter response to produce the feedback correction terms. This

desired filter response may be known before-hand (e.g. the training sequence

used in GSM mobile telephony). Figure 6.12 shows the structure of the 1st

order LMS filter used in this case study.

174

Varying output SNR

50

60

70

80

90

100

110

0 20 40 60 80 100

ASIC Stratix2 Virtex4

SNR (dB)

ASA2 / FX area ratio (%)

(a) Input variance = 1.6×10−5

50

60

70

80

90

100

110

0 20 40 60 80 100


SNR (dB)


(b) Input variance = 1.2×10−2

Varying overflow probability, λ

50

60

70

80

90

100

110

0.01 0.0001 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16



10-2 10

-4 10

-6 10

-8 10

-10 10

-12 10

-14 10

-16

Overflow probability, �

(c) Output SNR = 15 dB

50

60

70

80

90

100

110

0.01 0.0001 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16



10-2 10

-4 10

-6 10

-8 10

-10 10

-12 10

-14 10

-16


(d) Output SNR = 60 dB

Figure 6.11: Area ratio of ASIC 4th order IIR filter optimised with the pro-posed two-phase ASA optimisation (ASA2) and fixed-point only optimisation(FX).

175

z-1

F

F

z-1

z-1

F

F

z-1

F

F

Input

Output

Input_ref

Figure 6.12: Case study 1st order LMS filter.

The filter has two inputs, one is the normal Input signal and the other is

the desired filter response Input_ref signal. For the desired input signal, we

used the same audio sample input for the IIR filter case study in the previous

section. For the normal input, 4 equally size segments from the desired input

was passed through 4 different 2nd order autoregressive filters to distort the

signal. The constant coefficients of each distortion filter are randomly chosen

to have complex conjugate filter poles pairs in the magnitude range of (0, 1)

and in phase range(0, π

2

).

As with the IIR case study, the first test on the LMS filters is to vary

the output SNR constrain using two sets of inputs with different variance.

Using the lower variance input (6.5× 10−5), the area cost from optimising

the LMS filter onto ASIC with ASA2 and FX optimisations are illustrated in

Figure 6.13(a). Yet again, the FPGA designs show a similar trend and are not

shown. Figures 6.14(a)&(b) show the scatter plots for the percentage ratio

of synthesized design area cost using ASA2 over FX optimisation for all three

hardware platforms. They also show the respective linear regressive trend lines

for each scatter plot.

When the lower variance input was used, Figure 6.14(a) shows a greater

percentage improvement when compared to the earlier IIR case study. At

15dB, the LMS filter optimised with ASA2 is 30% better than FX on the

176

0

10000

20000

30000

40000

50000

60000

70000

80000

0 20 40 60 80 100

FX ASA2

SNR (dB)

Area (cells)

(a) Varying output SNR

0

5000

10000

15000

20000

25000

30000

35000

0.01 1E-04 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16

FX ASA2

10-2 10

-4 10

-6 10

-8 10

-10 10

-12 10

-14 10

-16

Overflow probability, �A

rea (cells)

(b) Varying overflow probability λ

Figure 6.13: Area of ASIC 2-tap adaptive LMS FIR filter.

Virtex4 platform, compared to only 3% for the IIR filter. As expected, the

percentage improvement reduces as the output SNR requirement increases and

when the higher variance input is used ( Figure 6.14(b) ) the area improvement

of ASA2 over FX optimisation is markedly reduced. At 3dB, the ASIC and

Virtex4 platforms had only 4% improvement with ASA2 while the Stratix2

platform had none.

To account for the major improvements with the lower variance input, we

looked at the individual signals in the LMS filter. The outputs of the full-

multipliers and signals near the input are the ones optimised to a DFX format

by the ASA2 optimisation. This is because the wide dynamic range of the

outputs of the non-linear full-multipliers and the inputs which are most suited

for DFX. Full multipliers make up a significant amount of chip area for a

LMS filter (about 60% on the ASIC platform) and since a DFX full multiplier

is cheaper than a fixed-point full multiplier with equivalent dynamic range

(Chapter 4), it is not surprising to see the major area improvement of ASA2

over FX optimisation.

The results of the second test with varying maximum overflow probability

are shown in Figure 6.13(b) for the implementation on the ASIC platform and

177

Varying output SNR

40

50

60

70

80

90

100

110

0 20 40 60 80 100


SNR (dB)


(a) Input variance = 6.5×10−5

40

50

60

70

80

90

100

110

0 20 40 60 80 100


SNR (dB)


(b) Input variance = 1.7×10−3

Varying overflow probability, λ

40

50

60

70

80

90

100

110

120

0.01 0.0001 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16



10-2 10

-4 10

-6 10

-8 10

-10 10

-12 10

-14 10

-16


(c) Output SNR = 15 dB

40

50

60

70

80

90

100

110

120

0.01 0.0001 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16



10-2 10

-4 10

-6 10

-8 10

-10 10

-12 10

-14 10

-16


(d) Output SNR = 60 dB

Figure 6.14: Area Ratio of ASIC 2-tap adaptive LMS FIR filter optimisedwith the proposed two-phase ASA optimisation (ASA2) and fixed-point onlyoptimisation (FX).

178

Figures 6.14(c)&(d) for the synthesized ASA2 over FX percentage ratio. The

trends observed are similar to the IIR case study. As with the IIR case study,

the results for both test show that using DFX for signals that need a wide

dynamic range may reduce the area cost of a design. Also, when the output

SNR and/or maximum overflow probability are low, mixing the use of DFX

and fixed-point formats is shown to give better area cost over a fully fixed-

point only optimisation. The ASA2 optimisation is able to optimise a design

to use both DFX and fixed-point efficiently.

An area that is lacking in both this case study and the previous is a com-

parison with floating-point filter implementation. It was a conscious choice

to omit the comparison with floating-point from the start of this work since

the initial work on arithmetic modules showed floating-point implementations

were relatively larger than fixed-point and DFX. Furthermore, majority of

DSP algorithms are made with fixed-point. However, for completeness, the

case studies should include floating-point designs and this is part of the future

work plans.

6.10 Summary

This section has demonstrated an optimisation method to incorporate both

DFX and fixed-point number representation signals in a design using a two-

phase ASA algorithm (ASA2). The algorithm uses the error analysis detailed

in Chapter 5 and also a simplified area cost model to quickly analyse each de-

sign iteration generated by the simulated annealing algorithm. After determin-

ing upper scaling parameters from the user desired signal overflow constraint,

the optimisation stage is made out of two phases: first phase optimises the

lower scaling parameters and the second stage obtains optimised wordlengths.

179

The speed of the optimisation traded for quality of the result by adjusting the

amount of time spent looking for the optimum point.

Section 6.7 explored the feasibility of using DFX in a design. Albeit the

analysis was performed on a single signal, we understand that DFX can provide

better area cost savings than fixed-point provided that a wide dynamic range

signal can tolerate a large error. Therefore designs that require low output

SNR and/or with low signal overflow probability constraints can benefit from

using DFX. This result has been demonstrated by the case studies on the IIR

and LMS filters.

180

Chapter 7

Conclusions & Future Work

7.1 Summary

Continual integration of applications onto a chip is fuelled by the need to

drive down overall cost while achieving smaller and more efficient designs.

Apart from using smaller chip manufacturing process, a hardware designer

could explore other areas to improve the design efficiency such as the number

representation used in the design. In digital signal processing (DSP) designs,

the two main number representations used for the signal data-paths are fixed-

point and floating-point. As with all number representations, both come with

their own set of advantages and disadvantages. The aim of this thesis is to

introduce an alternative number representation that compromises between the

two number representations for DSP applications. Next was to understand

under which the conditions when this number representation is best used and

a technique to discover the optimum parameters for this number representation

in the context of system design.

The new number representation, Dual FiXed-point (DFX), achieves better

dynamic range when compared to ordinary fixed-point. By using only a single

181

bit for the exponent field, DFX is able to bridge the dynamic range and hard-

ware implementation complexity gap between fixed-point and floating-point.

DFX’s extra dynamic range is possible due to the exponent recoding concept

presented in Chapter 3. Exponent recoding introduces a mapping function

onto the exponent field which is defined at design time. The extra level of in-

direction makes it possible to trade a number representation’s dynamic range

for its hardware implementation complexity. With the exponent recoding con-

cept, common number representations were generalised as special cases.

Having defined the new DFX format, performing arithmetic using the new

number representation would not be possible as there was no hardware to sup-

port it. For this thesis, designs were created for the basic arithmetic operations:

addition and multiplications (gain and full). Coded in VHDL, the arithmetic

modules are easily synthesized onto any hardware platform and integrated into

any design flow. In general, the dual precision nature of DFX forces the need

to align the signals either before and/or after an arithmetic operation. This is

similar to floating-point arithmetic operations but the alignment is performed

using multiplexers. Therefore DFX hardware will always have extra hardware

overhead over an equivalent fixed-point arithmetic module.

In line with the research objectives, the effect of precision errors introduced

when using DFX was investigated. Unlike fully fixed-point systems, estimat-

ing the signal-to-noise ratio (SNR) at the outputs of a system with DFX is

a non-trivial task due to DFX’s dual precision. The errors introduced in the

system from truncation after arithmetic operations are also correlated with

one another. A quick way is needed to estimate the output errors for feedback

to guide RightSize’s optimisation routine and a computationally intensive ex-

haustive simulation approach would not be practical. Analytical error models

for DFX arithmetic modules and a single-run profiling simulation was devel-

oped for incorporation into the RightSize synthesis tool. Interestingly, the

rounding errors introduced were shown not to correlate with one another.

182

When the effects of the errors introduced were understood, the viability of

using DFX was explored and an automated means to determine the optimum

set of signal parameters was the subject of Chapter 6. An analysis showed

that the usefulness of DFX is fairly limited to designs that require low output

SNR and/or designs with low signal overflow probability constraints. In other

words, DFX can provide better area cost savings compared to fixed-point on

condition that a signal with wide dynamic range can put up with a large

error. Therefore, DFX is meant to enhance the implementations of specialised

applications and not meant to replace fixed-point or floating-point number

systems.

Optimisation was done using a two phase simulated annealing methodology.

The optimisation problem is known to be NP-hard and the non-convex design

search space made it necessary to implement this meta-heuristic approach.

Since the robust DFX arithmetic modules can also input and out in fixed-

point, the optimised design can mixed fixed-point and DFX signals. The

modified RightSize allows designers to constraint their design in two ways,

a minimum SNR for every output and the maximum probability of overflow

for every signal in a design.

7.2 Future Work

The work in this thesis could be continued and extended in a variety of direc-

tions.

First is the case studies in Section 6.9. As mentioned before, compari-

son should be made between the ASA2 optimised filters with DFX and opti-

mised filters with floating-point. At present, due to initial assumptions that

floating-point implementations are always larger than DFX implementations,

the comparisons with floating-point implementations were not planned for this

183

thesis. However, there is that lingering doubt whether that initial assumption

is always true. Hence for completeness, the floating-point filters should be

implemented and compared with the ASA2 optimised filters.

Next is in the number representation. In this thesis, the definition of DFX

and the design of arithmetic modules are made under assumption that there

will be no overflow. In some cases, this may lead to overly pessimistic scalings

and reducing the wordlength from the MSB side could cause a signal overflow

but also lead to reduce area consumption. Hence, saturation arithmetic could

be applied on selected signals. The obvious place to implement it is to saturate

overflows for the DFX Num1 range. In addition, if the DFX boundary is

modified above the current definition (i.e. B > 2p0), overflows values in the

Num0 range can also be saturated.

The error estimation presented in this thesis relies on the running a profiling

simulation using a set of representative input data. Although the profile tables

described in Chapter 5 is sufficient to profile every signal in a design, the

computational intensity and memory requirement increases with the design

size. A fully analytical means to estimate the profile table would reduce the

computational memory requirement and making the analysis more scalable.

Potential candidates that we could adapted are the polynomial chaos expansion

[WZN04] and the Edgeworth expansion techniques [Hal97].

This thesis’s optimisation routine chooses the optimum set of signal wordlength

and scaling parameters in a design to minimise the chip area. There are other

optimisation objectives such as speed and power consumption that can be ex-

plored. Some precursor work such as improving the existing arithmetic mod-

ules would be needed. As we have seen from Chapter 4, the DFX arithmetic

modules are all slower than fixed-point without pipelining. It would be inter-

esting to include pipelining for the designs to improve the critical path delay.

184

Apart from that, the estimation of the delay and power consumption of each

arithmetic module has to be investigated.

Another area for future work is in the arithmetic modules. Designs of the

arithmetic modules could be improved to allow for greater resource sharing.

For example of a chain of two adders, the post-adder block from the first adder

could be merged with a pre-adder block of the second adder. Furthermore, we

can make a full custom design of the arithmetic modules which could lead

to quicker and smaller designs. Some initial work has been done by Wang

[Wan04] with promising results. On a separate note, other operations such

as division could be added to the list of DFX arithmetic components. This

would open up avenues to test and implement a wider range of applications

with RightSize.

Following on from that, the final area for future work is to explore other ap-

plications where DFX would be an improvement over fixed-point and floating-

point (and possibly LNS) designs. At present, DFX has been shown to be

better than fixed-point IIR and LMS filters under extreme circumstances. The

search of other applications should not be constraint to the realm of DSP only

as DFX might have a use in other areas. To facilitate the search, DFX may be

incorporated into the design flow of system-level description languages such as

SystemC [CEV07] as this would cut down the development time.

185

Glossary

bxc The floor function which returns the largest integer less than or

equal to x.

Φ() Recoding function for Exponent Recoding.

n Wordlength of a number system, not including the sign-bit.

p The number of bits from the right of the sign-bit to the binary-point

of a fixed-point number system.

p0, p1 The number of bits from the right of the sign-bit to the binary-

point of lower and upper number ranges in a DFX number system.

See Definition 3.2, pg. 61.

p′, p′0, p′1 An alternative representation for p, p0 and p1 known as LSB-side

scaling. See Section 5.2, pg. 108.

B Boundary between Num0 and Num1 of a DFX number. See Defi-

nition 3.4, pg. 62.

T, T Truncation; T is a set of Truncations. See Section 5.2, pg. 108.

p′a ½p′b Truncation of a number with p′a scaling to p′b scaling. See Sec-

tion 5.2, pg. 108.

µX, σ2X The error mean and error variance of signal X.

186

PT The probability of truncation T.

cov(x, y) Covariance between variables x and y. See Equation (5.8), pg. 126.

rx,y Correlation coefficient which measures the degree of correlation. It

has a range between [0,1].

R The set of all reals.

N The set of all positive integers.

ALUT Adaptive Look-Up Table. The principle building block for the re-

configurable frabric of Altera Stratix 2 FPGAs.

ASA2 The 2-phased adaptive simulated annealing optimisation algorithm

to produce an DFX annotated computation graph of system opti-

mised for minimum area based on a set of user constraints. See

Section 6.8.2, pg. 166.

ASIC Application Specific Integrated Circuit.

block floating-point (BFP) An algorithm for processing numbers in blocks

of fixed-point numbers. See Section 2.4.4, pg. 35.

computational graph A formal representation of an algorithm. See Defi-

nition 4.1, pg. 73.

correlation The strength and direction of a linear relationship between two

random variables.

critical path delay (CPD) The maximum time needed after the a change

in the inputs of a system to be reflected in its outputs.

DFX Dual Fixed-Point. See Section 3.3, pg. 61.

187

DFX annotated computational graph A formal representation of the DFX

implementation of a computational graph. See Definition 4.2, pg.

74.

DSP Digital Signal Processing.

dynamic range The range of a number representation. It is quantified as

the ratio of the largest representable magnitude over the smallest

and generally is expressed in dBs.

ER Exponent Recoding. See Section 3.2, pg. 56.

error See finite precision error.

finite precision error The result of taking the output sequence from a fi-

nite precision implementation subtracted from the equivalent se-

quence resulting from an infinite precision implementation. This is

also known as quantisation error.

FIR Finite Impulse Response.

fixed-point (FX) A binary number representation. See Section 2.4.1, pg.

25.

floating-point (FP) A binary number representation. See Section 2.4.2,

pg. 28.

FPGA Field Programmable Gate Array.

IEEE754 Standard for Binary Floating-Point Arithmetic.

IIR Infinite Impulse Response.

level-index (LI) A binary number representation. See Section 2.4.8, pg.

40.

188

LMS Least Mean Square.

logarithmic number system (LNS) A binary number representation. See

Section 2.4.3, pg. 32.

LSB Least significant bit.

LUT Look-Up Table. The principle building block for the reconfigurable

frabric of Xilinx Virtex 4 FPGAs.

MSB Most significant bit.

MUX Multiplexer.

Num0, Num1 The lower and upper number range of DFX numbers. See

Definition 3.3, pg. 62.

PDF Probability density function.

pertubation analysis The process of linearising a non-linear system in or-

der to apply analytical techniques to estimate noise in LTI systems

[Con03].

profiling simulation A step taken during the ASA2 opimisation in order

to determine the probability distributions of the signals within a

system. See Section 5.5.1, pg. 131.

rational arithmetic (RA) A binary number representation. See Section 2.4.7,

pg. 39.

residue number system (RNS) A binary number representation. See Sec-

tion 2.4.5, pg. 36.

RightSize The high-level synthesis tool originally made by Constantinides

[Con03].

189

rounding The process of reducing the number of bits used to represent a

number by ignoring one or more least significant bits by rounding

to the nearest even.

scaling The scaling of a signal is determined by the position of the bi-

nary point in a signal representation. The terms ’binary point’ and

’scaling’ are used interchangeably.

simulated annealing (SA) A heuristic search technique based on the an-

nealing in metallurgy [Ing93].

SNR Signal to Noise Ratio.

truncation The process of reducing the number of bits used to represent a

number by ignoring one or more least significant bits.

VHDL Very high speed integrated circuit Hardware Description Language.

190

Bibliography

[Alta] Altera. Cyclone Series.

http://www.altera.com/products/devices/cyclone/cyc-index.jsp.

[Altb] Altera. DSP Builder.

http://www.altera.com/products/software/products/dsp/dsp-

builder.html.

[Alt98] Altera Corporation, San Jose. Altera Databook, 1998.

[And96] S. Andraos. Fixed point unsigned fractional representation in

residue number system. Circuits and Systems, 1996., IEEE 39th

Midwest symposium on, 1:555–558 vol.1, 18-21 Aug 1996.

[Arn05] M.G. Arnold. The residue logarithmic number system: theory and

implementation. Computer Arithmetic, 2005. ARITH-17 2005.

17th IEEE Symposium on, pages 196–205, 27-29 June 2005.

[Ber00] Berkeley Design Technology, Inc. BDTI DSP kernel benchmarks.

http://www.bdti.com/bdtimark/BDTImark2000.htm, 2000.

[BJA+03] Francisco Barat, Murali Jayapala, Tom Vander Aa, Rudy Lauw-

ereins, Geert Deconinck, and Henk Corporaal. Low power

coarse-grained reconfigurable instruction set processor. In Field-

Programmable Logic and Applications, volume Volume 2778/2003

191

of Lecture Notes in Computer Science, pages 230–239. Field-

Programmable Logic and Applications, September 2003.

[BL91] P. H. Bauer and L. -J. Leclerc. Asymptotic stability of digital fil-

ters with combinations of overflow and quantization nonlinearities.

Circuits and Systems, 1991., IEEE International Sympoisum on,

pages 380–383 vol.1, 11-14 Jun 1991.

[Bom94] B. W. Bomar. Low-roundoff-noise limit-cycle-free implementation

of recursive transfer functions on a fixed-point digital signal pro-

cessor. Industrial Electronics, IEEE Transactions on, 41(1):70–78,

Feb 1994.

[Boo51] Andrew D. Booth. A signed binary multiplication technique.

The Quarterly Journal of Mechanics and Applied Mathematics,

4(2):236–240, 1951.

[Bow87] A. J. Bower. Digital two-channel sound for terrestrial television.

IEEE Trans. on Consumer Electronics, 4CE–33(3):286–296, Au-

gust 1987.

[BP00] A. Benedetti and P. Perona. Bit-width optimization for config-

urable DSP’s by multi-interval analysis. In Signals, Systems and

Computers, 2000. Conference Record of the Thirty-Fourth Asilo-

mar Conference on, volume 1, pages 355–359 vol.1, 2000.

[CC99] J. N. Coleman and E. I. Chester. A 32-bit logarithmic arithmetic

unit and its performance compared to floating-point. In Proceed-

ings of the 14th IEEE Symposium on Computer Arithmetic, pages

142–152, Adelaide, Australia, April 1999.

192

[CCL99] George A. Constantinides, Peter Y. K. Cheung, and Wayne

Luk. Truncation noise in fixed-point SFGs. Electronic Letters,

35(23):2012–2014, 1999.

[CCL00] George A. Constantinides, Peter Y. K. Cheung, and Wayne Luk.

Roundoff-noise shaping in filter design. In Proc. IEEE Interna-

tional Symposium on Circuits and Systems, May–June 2000.


The multiple wordlength paradigm. In Field-Programmable Cus-

tom Computing Machines, 2001. FCCM ’01. The 9th Annual IEEE

Symposium on, pages 51–60, 2001.


Optimum wordlength allocation. In Field-Programmable Custom

Computing Machines, 2002. Proceedings. 10th Annual IEEE Sym-

posium on, pages 219–228, 2002.


Synthesis of saturation arithmetic architectures. ACM Trans. Des.

Autom. Electron. Syst., 8(3):334–354, 2003.

[CDdD06] S. Collange, J. Detrey, and F. de Dinechin. Floating point or

LNS: Choosing the right arithmetic on an application basis. In

Digital System Design: Architectures, Methods and Tools, 2006.

DSD 2006. 9th EUROMICRO Conference on, pages 197–203, 2006.

[Cer85] V. Cerny. Thermodynamical approach to the traveling salesman

problem: An efficient simulation algorithm. Journal of Optimiza-

tion Theory and Applications, 45(1):41–51, Jan 1985.

[CEV07] B.A. Correa, J.F. Eusse, and J.F. Velez. High level system-on-chip

design using uml and systemc. Electronics, Robotics and Automo-

193

tive Mechanics Conference, 2007. CERMA 2007, pages 740–745,

Sept. 2007.

[CMP76] T. Claasen, W. Mecklenbrauker, and J. Peek. Effects of quanti-

zation and overflow in recursive digital filters. Acoustics, Speech,

and Signal Processing, IEEE Transactions on, 24(6):517–529, Dec

1976.

[CO84] C. W. Clenshaw and F. W. J. Olver. Beyond floating-point. J.

Assoc. Comput. Mach., 31:319–328, March 1984.

[Con01] George A. Constantinides. High Level Synthesis and Word Length

Optimization of Digital Signal Processing Systems. PhD thesis,

Imperial College of Science, Technology and Medicine, University

of London, London, U.K., September 2001.

[Con03] George A. Constantinides. Perturbation analysis for word-length

optimization. In Field-Programmable Custom Computing Ma-

chines, 2003. FCCM 2003. 11th Annual IEEE Symposium on,

pages 81–90, 9-11 April 2003.

[Cow02] M. Cowlishaw. Densely packed decimal encoding. Computers and

Digital Techniques, IEE Proceedings -, 149(3):102–104, May 2002.

[CRS+99] R. Cmar, L. Rijnders, P. Schaumont, S. Vernalde, and I. Bolsens.

A methodology and design environment for DSP ASIC fixed point

refinement. In Design, Automation and Test in Europe Conference

and Exhibition 1999. Proceedings, pages 271–276, 1999.

[CT88] C. W. Clenshaw and P. R. Turner. The symmetric level-index

system. IMA J. Numerical Analysis, 8:517–526, 1988.

194

[CW02] George A. Constantinides and G.J. Woeginger. The complexity

of multiple wordlength assignment. Applied Mathematics Letters,

15:137–140(4), February 2002.

[DdD] Jeremie Detrey and Florent de Dinechin. A VHDL library

of parametrisable floating-point and LNS operators for FPGA.

http://www.ens-lyon.fr/LIP/Arenaire/Ware/FPLibrary/.

[Dek71] T. J. Dekker. A floating-point technique for extending the available

precision. Numerische Mathematik, 18(3):224–242, June 1971.

[DGL+02] J. Dido, N. Geraudie, L. Loiseau, O. Payeur, Y. Savaria, and

D. Poirier. A flexible floating-point format for optimizing data-

paths and operators in FPGA based DSPs. pages 50–55, 2002.

[DH96] Ron Davidson and David Harel. Drawing graphs nicely using sim-

ulated annealing. ACM Trans. Graph., 15(4):301–331, 1996.

[ECC04] Chun Te Ewe, Peter Y.K. Cheung, and George A. Constantinides.

Dual FiXed-point: An efficient alternative to floating-point compu-

tation. In Field Programmable Logic and Application: 14th Inter-

national Conference, FPL 2004, pages 200–208, Leuven, Belgium,

August 2004. Springer-Verlag Heidelberg.

[ECC05] C.T. Ewe, P.Y.K. Cheung, and G.A. Constantinides. Error mod-

elling of Dual FiXed-point arithmetic and its application in field

programmable logic. In Field Programmable Logic and Applica-

tions, 2005. International Conference on, pages 124–129, 24-26

Aug. 2005.

[EMT69] P. M. Ebert, J. E. Mazo, and M. G. Taylor. Overflow oscillations in

digital filters. Bell System Technical Journal, 48:2999–3020, Nov.

1969.

195

[FCR02] Fang Fang, Tsuhan Chen, and Rob A. Rutenbar. Lightweight

floating-point arithmetic: case study of inverse discrete cosine

transform. EURASIP J. Appl. Signal Process., 2002(1):879–892,

2002.

[Fio98] P.D. Fiore. Lazy rounding. In Signal Processing Systems, 1998.

SIPS 98. 1998 IEEE Workshop on, pages 449–458, 8-10 Oct. 1998.

[FML06] Haohuan Fu, Oskar Mencer, and Wayne Luk. Comparing floating-

point and logarithmic number representations for reconfigurable

acceleration. Field Programmable Technology, 2006. FPT 2006.

IEEE International Conference on, pages 337–340, Dec. 2006.

[FP97] W.L. Freking and K.K. Parhi. Low-power FIR digital filters using

residue arithmetic. Signals, Systems & Computers, 1997. Confer-

ence Record of the Thirty-First Asilomar Conference on, 1:739–743

vol.1, 2-5 Nov 1997.

[FRPC03] Clair Fang Fang, R. A. Rutenbar, M. Puschel, and Tsuhan Chen.

Toward efficient static analysis of finite-precision effects in DSP

applications via affine arithmetic modeling. In Design Automation

Conference 2003 (DAC’03) Proceedings, pages 496–501, Leuven,

Belgium, June 2003. Springer–Verlag Heidelberg.

[GML+02] Altaf A. Gaffar, Oskar Mencer, Wayne Luk, Peter Y.K. Cheung,

and Nabeel Shirazi. Floating point bitwidth analysis via automatic

differentiation. In IEEE Conference on Field Programmable Tech-

nology (FPT’02), pages 158–165, Hong Kong, December 2002.

[GML04] Altaf A. Gaffar, Oskar Mencer, and Wayne Luk. Unifying bit-width

optimisation for fixed-point and floating-point designs. Field-

196

Programmable Custom Computing Machines, 2004. FCCM 2004.

12th Annual IEEE Symposium on, pages 79–88, 20-23 April 2004.

[Gri00] Andreas Griewank. Evaluating derivatives: principles and tech-

niques of algorithmic differentiation. Society for Industrial and

Applied Mathematics, Philadelphia, PA, USA, 2000.

[GS01] Geoffrey R. Grimmett and David R. Stirzaker. Probability and

Random Processes. Oxford University Press, August 2001.

[GT88] B.D.O. Green and L.E. Turner. New limit cycle bounds for digital

filters. Circuits and Systems, IEEE Transactions on, 35(4):365–

374, Apr 1988.

[Haj85] Bruce Hajek. A tutorial survey of theory and applications of simu-

lated annealing. Decision and Control, 1985 24th IEEE Conference

on, 24:755–760, Dec 1985.

[Hal97] Peter G. Hall. The Bootstrap and Edgeworth Expansion. Springer,

1997.

[Ham83] Hozumi Hamada. URR: Universal Representation of Real numbers.

New Generation Computing, 1(2):205–209, 1983.

[HJ86] Roger A. Horn and Charles R. Johnson. Matrix analysis. Cam-

bridge University Press, New York, NY, USA, 1986.

[HP94] C.Y. Hung and B. Parhami. An approximate sign detection method

for residue numbers and its application to rns division. Computers

and Mathematics with Applications, 27:23–35, Feb 1994.

[IEE85] IEEE. IEEE Standard for Binary Floating-Point Arithmetic (IEEE

754). IEEE, 1985.

197

[IEE04] IEEE. IEEE Standard for VHDL Register Transfer Level (RTL)

Synthesis, 2004.

[IEE07] IEEE. DRAFT Standard for Floating-Point Arithmetic P754, Oct.

2007.

[Ing93] Lester Ingber. Adaptive simulated annealing (ASA).

[ftp.ingber.com: ASA.tar.gz ASA.zip], 1993. McLean, VA,

Lester Ingber Research.

[Ing96] Lester Ingber. Adaptive simulated annealing (ASA): Lessons

learned. Control and Cybernetics, 25:33, 1996.

[IO96] Christopher Inacio and Denise Ombres. The DSP decision: fixed

point or floating? IEEE Spectrum, 33(9):72–74, 1996.

[Jac70] Leland B. Jackson. On the interaction of roundoff noise and dy-

namic range in digital filters. The Bell System Technical Journal,

49(2):159–184, February 1970.

[JAMS89] David S. Johnson, Cecilia R. Aragon, Lyle A. McGeoch, and

Catherine Schevon. Optimization by simulated annealing: An

experimental evaluation. Part i, graph partitioning. Oper. Res.,

37(6):865–892, 1989.

[JAMS91] David S. Johnson, Cecilia R. Aragon, Lyle A. McGeoch, and

Catherine Schevon. Optimization by simulated annealing: An ex-

perimental evaluation. Part ii, graph coloring and number parti-

tioning. Oper. Res., 39(3):378–406, 1991.

[JL01] Allan Jaenicke and W. Luk. Parameterised floating-point arith-

metic on fpgas. Acoustics, Speech, and Signal Processing, 2001.

Proceedings. (ICASSP ’01). 2001 IEEE International Conference

on, 2:897–900 vol.2, 2001.

198

[KA96] Kari Kalliojrvi and Jaakko Astola. Roundoff errors in block-

floating-point systems. IEEE Trans. on Signal Processing,

44(4):783–790, April 1996.

[Kah65] W. Kahan. Pracniques: Further remarks on reducing truncation

errors. Commun. ACM, 8(1):40, 1965.

[KF00] Shiro Kobayashi and Gerhard P. Fettweis. A hierarchical block-

floating-point arithmetic. J. VLSI Signal Process. Syst., 24(1):19–

30, 2000.

[KGV83] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by

simulated annealing. Science, Number 4598, 13 May 1983, 220,

4598:671–680, 1983.

[KKS98] Seehyun Kim, Ki-Il Kum, and Wonyong Sung. Fixed-point op-

timization utility for C and C++ based digital signal processing

programs. Circuits and Systems II: Analog and Digital Signal Pro-

cessing, IEEE Transactions on [see also Circuits and Systems II:

Express Briefs, IEEE Transactions on], 45(11):1455–1464, 1998.

[KL97] E. Kinoshita and Ki-Ja Lee. A residue arithmetic extension

for reliable scientific computation. Transactions on Computers,

46(2):129–138, Feb 1997.

[KM83] P. Kornerup and D.W. Matula. Finite precision rational arith-

metic: An arithmetic unit. Transactions on Computers, C-

32(4):378–388, 1983.

[KM88] Peter Kornerup and David W. Matula. An on-line arithmetic unit

for bit-pipelined rational arithmetic. J. Parallel Distrib. Comput.,

5(3):310–330, 1988.

199

[KNM+08] Sami Khawam, Ioannis Nousias, Mark Milward, Ying Yi, Mark

Muir, and Tughrul Arslan. The reconfigurable instruction cell ar-

ray. IEEE Trans. Very Large Scale Integr. Syst., 16(1):75–85, 2008.

[Kol04] Gopi Kolli. Using fixed-point instead of floating-point for

better 3D performance. Intel Optimizing Center, 2004.

http://devx.com/Intel/Article/16478.

[Kor02] Israel Koren. Computer Arithmetic Algorithms. A K Peters, Nat-

ick, Massachusetts, 2nd edition, 2002.

[KS94] Seehyun Kim and Wonyong Sung. A floating-point to fixed-point

assembly program translator for the TMS 320c25. Circuits and

Systems II: Analog and Digital Signal Processing, IEEE Transac-

tions on [see also Circuits and Systems II: Express Briefs, IEEE

Transactions on], 41(11):730–739, Nov 1994.

[KS98] K.-I. Kum and W. Sung. Word-length optimization for high-level

synthesis of digital signal processing systems. In Signal Processing

Systems, 1998. SIPS 98. 1998 IEEE Workshop on, pages 569–578,

8-10 Oct. 1998.

[KS01] Ki-Il Kum and Wonyong Sung. Combined word-length optimiza-

tion and high-level synthesis of digital signal processing systems.

Computer-Aided Design of Integrated Circuits and Systems, IEEE

Transactions on, 20(8):921–930, Aug 2001.

[LC92] M. Lu and J.-S. Chiang. A novel division algorithm for the residue

number system. Transactions on Computers, 41(8):1026–1032,

Aug 1992.

[LEK05] L. Lacassagne, D. Etiemble, and S.A.O. Kablia. 16-bit floating

point instructions for embedded multimedia applications. Com-

200

puter Architecture for Machine Perception, 2005. CAMP 2005.

Proceedings. Seventh International Workshop on, pages 198–203,

4-6 July 2005.

[LGML05] Dong-U Lee, Altaf Abdul Gaffar, Oskar Mencer, and Wayne Luk.

MiniBit: bit-width optimization via affine arithmetic. In DAC ’05:

Proceedings of the 42nd annual conference on Design automation,

pages 837–840, New York, NY, USA, 2005. ACM Press.

[Liu71] Bede Liu. Effects of finite word length on the accuracy of digital

filters - A review. IEEE Transactions, CT–18(6):670–677, 1971.

[Liu98] Derong Liu. Lyapunov stability of two-dimensional digital filters

with overflow nonlinearities. Circuits and Systems I: Fundamental

Theory and Applications, IEEE Transactions on [see also Circuits

and Systems I: Regular Papers, IEEE Transactions on], 45(5):574–

577, May 1998.

[LM87] Edward A. Lee and David G. Messerschmitt. Static scheduling of

synchronous data flow programs for digital signal processing. In

IEEE Transactions on Computers, volume C-36, 1987.

[LO90] D. W. Lozier and F. W. J. Olver. Closure and precision in level-

index arithmetic. SIAM Journal on Numerical Analysis, 27:1295–

1304, 1990.

[LSL+00] Ming-Hau Lee, Hartej Singh, Guangming Lu, Nader Bagherzadeh,

Fadi J. Kurdahi, Eliseu M.C. Filho, and Vladimir Castro Alves.

Design and implementation of the morphosys reconfigurable com-

puting processor. The Journal of VLSI Signal Processing,

24(2):147–164, March 2000.

[Mata] Mathworks. Matlab. http://www.mathworks.com.

201

[Matb] Mathworks. Simulink. http://www.mathworks.com.

[Mic94] Giovanni De Micheli. Synthesis and Optimization of Digital Cir-

cuits. McGraw-Hill Higher Education, 1994.

[Mit98] S. K. Mitra. Digital Signal Procssing. McGraw-Hill, New York,

1998.

[MK85] D.W. Matula and P. Kornerup. Finite precision rational arith-

metic: Slash number systems. IEEE Transactions on Computers,

34(1):3–18, 1985.

[MN99] Kurt Mehlhorn and Stefan Nher. LEDA: A Platform for Combi-

natorial and Geometric Computing. Cambridge University Press,

1999.

[Moo66] R. E. Moore. Interval Analysis. Prentice-Hall, 1966.

[MRRT53] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosen-

bluth, and Augusta H. Teller. Equation of state calculations

by fast computing machines. The Journal of Chemical Physics,

21(6):1087–1092, June 1953.

[MTS95] Jean-Muchel Muller, Arnaud Tisserand, and Alexandre Scherbyna.

Semi-logarithmic number system. In Proceedings of the 12th IEEE

Symposium on Computer Arithmetic, pages 201 – 207, Bath, Eng-

land, July 1995.

[Mun96] Robert Munafo. Survey of floating-point formats.

http://home.earthlink.net/ mrob/pub/math/floatformats.html,

1996.

202

[Nim87] Y. Nimomiya et. al. A HDTV broadcasting system utilizing a

bandwidth compression technique MUSE. IEEE Trans on Broad-

casting, BC–33(4):130–160, December 1987.

[NVI05] NVIDIA. NVIDIA GPU programming guide.

http://developer.nvidia.com/object/gpu programming guide.html,

2005.

[Opp70] Alan V. Oppenheim. Realization of digital filters using block-

floating-point arithmetic. IEEE Trans. on Audio Electroaccoustics,

AE–18(2):130–136, June 1970.

[OS99] Alan V. Oppenheim and R. W. Schafer. Discrete-Time Signal

Processing. Prentice-Hall, Englewood Cliffs, NJ, USA, 2nd edition,

1999.

[OW72] Alan V. Oppenheim and C. J. Weinstein. Effects of finite register

length in digital filtering and the fast fourier transform. Proceedings

of the IEEE, 60(8):957–976, 1972.

[Par88] B. Parhami. Carry-free addition of recoded binary signed-digit

numbers. IEEE Trans. Comput., 37(11):1470–1476, 1988.

[Par00] B. Parhami. Computer Arithmetic: Algorithms and Hardware De-

signs. Oxford University Press, U.K., 2000.

[Pin70] Martin Pincus. A Monte Carlo method for the approximate solu-

tion of certain types of constrained optimization problems. Oper-

ations Research, 18(6):1225–1228, Nov 1970.

[Pra01] William K. Pratt. Digital Image Processing. Wiley, New York,

NY, 3rd edition, 2001.

203

[Pri91] D.M. Priest. Algorithms for arbitrary precision floating point arith-

metic. Computer Arithmetic, 1991. Proceedings., 10th IEEE Sym-

posium on, pages 132–143, 26-28 Jun 1991.

[RB04] Sanghamitra Roy and Prith Banerjee. An algorithm for

converting floating-point computations to fixed-point in

MATLAB based FPGA design. In Proceedings of the 41st

annual conference on Design automation, pages 484–487. ACM

Press, 2004.

[RH91] I.P. Radivojevic and J. Herath. Executing DSP applications in

a fine-grained dataflow environment. Transactions on Software

Engineering, 17(10):1028–1041, 1991.

[SA75] E.E. Swartzlander and A.G. Alexopoulos. The sign/logarithm

number system. Transactions on Computers, C-24(12):1238–1242,

Dec. 1975.

[Sd88] S. Sridharan and G. dickman. Block floating point implementa-

tion of digital filters using the DSP56000. Micropress. Mirosyst.,

12(6):299–308, July–Aug. 1988.

[SdF97] J. Stolfi and L. de Figueiredo. Self-Validated Numerical Methods

and Applications. Institute for Pure and Applied Mathematics

(IMPA), Rio de Janeiro, 1997. Monograph for the 21st Brazilian

Mathematics Colloquium (CBM’97), IMPA.

[SK94] Wonyong Sung and Ki-Il Kum. Word-length determination and

scaling software for a signal flow block diagram. Acoustics, Speech,

and Signal Processing, 1994. ICASSP-94., 1994 IEEE Interna-

tional Conference on, ii:II/457–II/460 vol.2, 19-22 Apr 1994.

204

[SK95] Wonyong Sung and Ki-Il Kum. Simulation-based word-length op-

timization method for fixed-point digital signal processing sys-

tems. Signal Processing, IEEE Transactions on [see also Acous-

tics, Speech, and Signal Processing, IEEE Transactions on],

43(12):3087–3090, Dec 1995.

[SM94] Richard L. Scheaffer and James T. McClave. Probability and Statis-

tics for Engineers. Duxbury Press, 4th edition, April 4 1994.

[Smi97] Steven W. Smith. The Scientist and Engineer’s Guide to Digital

Signal Processing. California Technical Pub, 1997.

[SMM05] N. Symth, M. McLoone, and J.V. McCanny. Reconfigurable pro-

cessor for public-key cryptography. Signal Processing Systems De-

sign and Implementation, 2005. IEEE Workshop on, pages 110–

115, Nov. 2005.

[SP87] Puay Kia Sim and K. K. Pang. Decoupling of the overflow and

quantization phenomena in orthogonal biquad recursive digital fil-

ters. Circuits, Systems, and Signal Processing, 6(4):457–470, De-

cember 1987.

[Sri87] S. Sridharan. Implementation of state-space digital filters using

block-floating-point arithmetic. In 1987 IEEE International Con-

ference on Acoustics, Speech and Signal Processing, pages 912–915,

Dallas, TX, USA, April 1987.

[SS97] Adel S. Sedra and Kenneth C. Smith. Microelectronic Circuits.

Oxford University Press Inc, USA, 4rev ed edition edition, July

1997.

205

[ST67] Nicholas S. Szabo and Richard I. Tanaka. Residue Arithmetic and

Its Application to Computer Technology. McGraw-Hill, New York,

1967.

[Sto05] Thanos Stouraitis. The Electrical Engineering Handbook, chapter

1, Logarithmic and Residue Number Systems for VLSI Arithmetic,

pages 179–190. Academic Press, 2005.

[SW86] S. Sridharan and D. Williamson. Implementation of high-order

direct-form digital filter structures. IEEE Trans. on Circuit and

Systems, CAS–33(8):818–822, August 1986.

[Syna] Synopsis. Design compiler.

http://www.synopsus.com/products/logic/logic.html.

[Synb] Synplicity. Synplify DSP.

http://www.synplicity.com/products/dsp solutions.html.

[TGJR88] F. J. Taylor, R. Gill, J. Joseph, and J. Radke. A 20 bit logarithmic

number system processor. IEEE Trans. Comput., 37(2):190–200,

1988.

[Tsa74] N. Tsao. On the distribution of significant digits and roundoff

errors. Commun. Ass. Comput. Mach., 17:267–271, May 1974.

[Tur89] P.R. Turner. A software implementation of SLI arithmetic. Com-

puter Arithmetic, 1989., Proceedings of 9th Symposium on, pages

18–24, Sep 1989.

[Und04] Keith Underwood. FPGAs vs. CPUs: trends in peak floating-point

performance. pages 171–180, 2004.

[Wan99] Lars. Wanhammar. DSP Integrated Circuits. Elsevier, 1999.

206

[Wan04] Guixin Wang. ASIC design of dual fixed-point arithmetic. Master’s

thesis, Imperial College London, Dept of Electrical and Electronic

Engineering, September 2004.

[WZN04] Bin Wu, Jianwen Zhu, and F.N. Najm. Dynamic range estimation

for nonlinear systems. In Computer Aided Design, 2004. ICCAD-

2004. IEEE/ACM International Conference on, pages 660–667,

2004.

[XB97] Guo Fang Xu and T. Bose. Elimination of limit cycles due to

two’s complement quantization in normal form digital filters. Sig-

nal Processing, IEEE Transactions on [see also Acoustics, Speech,

and Signal Processing, IEEE Transactions on], 45(12):2891–2895,

Dec 1997.

[Xila] Xilinx. Spartan series.

http://www.xilinx.com/products/silicon solutions/fpgas/spartan.

[Xilb] Xilinx. System generator.

http://www.xilinx.com/ise/optional prod/system generator.htm.

[Xilc] Xilinx. Virtex series.

http://www.xilinx.com/products/silicon solutions/fpgas/virtex/.

[Yok92] H. Yokoo. Overflow/underflow-free floating-point number rep-

resentations with self-delimiting variable-length exponent field.

Transactions on Computers, 41(8):1033–1039, Aug 1992.

[Zim99] Reto Zimmermann. Lecture notes on Computer Arithmetic: Prin-

ciples, Architectures, and VLSI Design. Integrated Systems Labo-

ratory, ETH, Zrich, Switzerland, March 1999.

207

Documents

A New Number Representation for Hardware Implementation of ...cas.ee.ic.ac.uk/people/cte00/Thesis_Corrected.pdf · Majority of DSP applications do not need the full dy- namic range