Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
A New Number Representation forHardware Implementation of
DSP Algorithms
Chun Te Ewe
A thesis submitted for the degree of
Doctor of Philosophy of the University of London
and for the Diploma of Membership of the
Imperial College
Department of Electrical and Electronic Engineering
Imperial College of Science, Technology and Medicine
University of London
October 2008
Abstract
Dual FiXed-point (DFX) is a new number representation for digital hardware.
By providing a single exponent bit to select between two fixed-point scalings,
DFX strikes a compromise between conventional fixed-point and floating-point
representations. It has the implementation complexity similar to that of a
fixed-point system together with the improved dynamic range offered by a
floating-point system. Majority of DSP applications do not need the full dy-
namic range provided by a floating-point system but the cost of fixed-point
increases greatly when a wide wordlengths are needed.
This thesis presents the definition of DFX as the new number representa-
tion and its characteristics are compared with other common number repre-
sentations. Multiple wordlength and scaling modular designs for its arithmetic
operations are presented which are made to work alongside fixed-point. Us-
ing any finite precision number representation would cause quantisation errors
to be introduced. Hence a mixed simulation with static analysis technique is
presented to analyse the errors when DFX modules are used.
Utilising a readily available high-level synthesis tool to optimise for area,
DFX has been shown to be suitable when a design can tolerate a large amount
of noise and requires a wide dynamic range or representable numbers. The
tool uses the modules created and the error analysis in a two-phase simulated
annealing optimisation algorithm. DFX has been shown to be more suitable
when a design can tolerate a large amount of noise and requires a wide dynamic
range or representable numbers.
1
Acknowledgements
I would like to express a great amount of gratitude to my supervisor, Prof.
Peter Y.K. Cheung and also my co-supervisor, Dr. George A. Constantinides,
for their infinite encouragement, support and understanding throughout my
research.
Altera Corporation and Overseas Research Students Award Scheme, UK,
have provided financial support for this research.
Thanks go to the all my friends and colleagues in the Electrical and Elec-
tronic Engineering at Imperial College London for the wonderful atmosphere
and all their kind support. Doing lab and computing demonstrations gave
some variety to my life at the department.
Special mention to the last crew to man the original Linstead Hall for
there was never a dull moment and paved the way for me to meet my special
someone.
That someone is my girlfriend Audrey Ng. She has been with me through
the thick and thin while tirelessly motivating me throughout the crucial years.
Having her around reminds me that there is more to life than study. I owe a
great depth of gratitude to her.
Finally, I would like to say a big thanks to both my parents, Ong Kim Kee
and Ewe Poh Teong. They have given their unconditional love and support,
knowing that doing so contributed greatly to my absence throughout the time
of my studies, during which we could have been geographically closer. This
thesis is dedicated to them.
2
Contents
Contents 3
List of Figures 8
List of Tables 12
1 Introduction 15
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Background 20
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Question of Software vs Hardware . . . . . . . . . . . . . . . . . 20
2.3 Finite Wordlength Effects . . . . . . . . . . . . . . . . . . . . . 22
2.4 Number Representations . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Fixed-point . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Floating-point . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3 Logarithmic Number System . . . . . . . . . . . . . . . . 32
2.4.4 Block Floating-Point . . . . . . . . . . . . . . . . . . . . 35
2.4.5 Residue Number System . . . . . . . . . . . . . . . . . . 36
2.4.6 Signed-Digit Number System . . . . . . . . . . . . . . . 38
2.4.7 Rational Arithmetic . . . . . . . . . . . . . . . . . . . . 39
3
2.4.8 Level-Index . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.9 Comparison and Summary . . . . . . . . . . . . . . . . . 41
2.5 Wordlength and Scaling Optimisation . . . . . . . . . . . . . . . 45
2.5.1 SKK Methodology . . . . . . . . . . . . . . . . . . . . . 46
2.5.2 Interval and Affine analysis . . . . . . . . . . . . . . . . 48
2.5.3 BitSize Tool . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.4 Synoptix and RightSize tool . . . . . . . . . . . . . . . . 50
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3 Exponent Recoding & Dual FiXed-point 55
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Exponent Recoding . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Dual FiXed-point . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 Defining the Format . . . . . . . . . . . . . . . . . . . . 61
3.3.2 Characteristics of DFX Number System . . . . . . . . . 65
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4 DFX Modules and Arithmetic Circuits 71
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Algorithm Representation in RightSize . . . . . . . . . . . . . . 73
4.3 Module Design Forethought and Criteria . . . . . . . . . . . . . 75
4.4 Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4.1 Range-Detector . . . . . . . . . . . . . . . . . . . . . . . 77
4.4.2 DFX Encoder and Decoder . . . . . . . . . . . . . . . . 78
4.4.3 DFX Recoder . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.4 Area and Critical Path Delay Tables . . . . . . . . . . . 84
4.5 DFX Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5.1 DFX Adder Version I (V1) . . . . . . . . . . . . . . . . . 86
4
4.5.2 DFX Adder Version II (V2) . . . . . . . . . . . . . . . . 87
4.5.3 Area and Critical Path Delay Tables . . . . . . . . . . . 91
4.6 DFX Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.6.1 DFX Gain Multiplier . . . . . . . . . . . . . . . . . . . . 93
4.6.2 DFX Full Multiplier . . . . . . . . . . . . . . . . . . . . 95
4.6.3 Area and Critical Path Delay Tables . . . . . . . . . . . 97
4.7 Discussion and Further Comparisons . . . . . . . . . . . . . . . 99
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5 Modelling Noise at the Outputs of a DFX System 106
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 DFX Modules Noise Models . . . . . . . . . . . . . . . . . . . . 112
5.3.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3.2 Recoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.3 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.4 Gain Multiplier . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.5 Full Multiplier . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.6 Error Model Evaluation and Discussion . . . . . . . . . . 120
5.4 Correlated Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.1 Estimating the Correlation for the Errors Sources . . . . 125
5.4.2 Rounding Benefits . . . . . . . . . . . . . . . . . . . . . 128
5.5 Profiling: Simulation and Tables . . . . . . . . . . . . . . . . . . 131
5.5.1 Profiling Simulation . . . . . . . . . . . . . . . . . . . . . 131
5.5.2 1-D Profile Table . . . . . . . . . . . . . . . . . . . . . . 132
5.5.3 2-D Profile Table . . . . . . . . . . . . . . . . . . . . . . 135
5.6 Output Noise Estimation . . . . . . . . . . . . . . . . . . . . . . 137
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5
6 Approach to DFX Parameter Optimisation 140
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2 RightSize Prerequisites . . . . . . . . . . . . . . . . . . . . . . . 142
6.2.1 User Specified Design Constraints . . . . . . . . . . . . . 142
6.2.2 Representative Floating-point Input . . . . . . . . . . . . 143
6.3 DFX Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.3.1 Applying Synthesis Restrictions on p0 Parameters . . . . 144
6.3.2 Propagating and Conditioning p1 and n Parameters . . . 146
6.3.3 Iterative Algorithm . . . . . . . . . . . . . . . . . . . . . 148
6.4 Area Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.1 Evaluation of Area Models . . . . . . . . . . . . . . . . . 153
6.5 Determining the Upper Scaling p1 Parameter . . . . . . . . . . . 153
6.5.1 Simulated Peak Values . . . . . . . . . . . . . . . . . . . 154
6.5.2 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . 155
6.6 The Optimisation Problem, Formulated . . . . . . . . . . . . . . 156
6.7 Exploring the Feasibility of DFX . . . . . . . . . . . . . . . . . 157
6.8 Meta-Heuristic Approach to Optimisation Using Simulated An-
nealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.8.1 Background to simulated annealing . . . . . . . . . . . . 163
6.8.2 Optimisation Algorithm . . . . . . . . . . . . . . . . . . 166
6.9 Case Study and Discussion . . . . . . . . . . . . . . . . . . . . . 170
6.9.1 IIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.9.2 Adaptive LMS Filter . . . . . . . . . . . . . . . . . . . . 174
6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7 Conclusions & Future Work 181
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Glossary 186
6
Bibliography 191
7
List of Figures
2.1 Fixed-point number format 〈n, p〉. . . . . . . . . . . . . . . . . . 26
2.2 Range and precision of two’s complement fixed-point number. . 28
2.3 IEEE 754 single precision floating-point number format. . . . . . 29
2.4 Floating-point adder/subtractor. Figure taken from [Kor02]. . . 31
2.5 IEEE 754 single precision floating-point number format. . . . . . 33
2.6 Adder/Subtractor for logarithm numbers. A fixed-point (FX)
adder is used to perform addition and the ROM contains the
look-up table for either Ψ+ or Ψ− of Equation (2.8). . . . . . . . 35
2.7 Example of Rational Representation format. . . . . . . . . . . . 39
2.8 Level Index/Symmetric Level Index format. . . . . . . . . . . . 40
2.9 An area vs Critical path delay graph for Table 2.3. . . . . . . . 43
2.10 Design flow for RightSize tool (shaded portions are vendor spe-
cific). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1 Example of number system with Exponent Recoding. . . . . . . 56
3.2 Example of quad fixed-point. . . . . . . . . . . . . . . . . . . . . 60
3.3 Dual FiXed-point (DFX) number: (a) Number format, (b) De-
tailed structure of DFX number format. The symbols • and ◦represents the binary position for a Num0 and Num1 numbers
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Fully and non-fully represented DFX number. . . . . . . . . . . 64
3.5 Num0 and Num1 range in a DFX Number. . . . . . . . . . . . 66
8
3.6 Precision of number representations in significant bits as a func-
tion of absolute number value (in dB). The number representa-
tions shown are fixed-point 〈15, 0〉, floating-point E4:M11, and
DFX 〈14,−5, 0〉. . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7 An area vs critical path delay graph for Table 3.3. . . . . . . . . 70
4.1 The graphical representation of a data-flow graph. . . . . . . . . 75
4.2 DFX Range-Detector Module. . . . . . . . . . . . . . . . . . . . 77
4.3 Input bits the Range-Detector is interested in. . . . . . . . . . . 78
4.4 DFX Encoder block to convert from fixed-point to DFX. Input
is a fixed-point with wordlength nin and binary point pin and
output is a DFX 〈n, p0, p1〉. . . . . . . . . . . . . . . . . . . . . 79
4.5 DFX Decoder block to convert from DFX to fixed-point. Input
is a DFX 〈n, p0, p1〉 and output is a fixed-point with wordlength
(n + (p1 − p0)) and binary point p1. . . . . . . . . . . . . . . . . 80
4.6 DFX Recoder module with the flow of data through the module. 81
4.7 DFX Recoder blocks to convert between two different properly
scaled DFX numbers. The flow of data through the recoder is
shown beneath each block. . . . . . . . . . . . . . . . . . . . . . 82
4.8 DFX Recoder used with fork and delay. . . . . . . . . . . . . 84
4.9 DFX Adder Version I (v1). . . . . . . . . . . . . . . . . . . . . . 86
4.10 DFX Adder Version II (v2). . . . . . . . . . . . . . . . . . . . . 88
4.11 DFX Adder (v2) (a)Pre-Adder and (b)Post-Adder diagram. . . 89
4.12 DFX Gain Multiplier. . . . . . . . . . . . . . . . . . . . . . . . . 93
4.13 DFX Full Multiplier. . . . . . . . . . . . . . . . . . . . . . . . . 95
4.14 Module comparisons with similar dynamic range implemented
in ASIC. Parameters used for each number representations is
shown in Table 4.7. . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.15 Area of DFX modules implemented in ASIC with p1 = 8. . . . . 102
9
4.16 Sizes of DFX arithmetic modules relative to their fixed-point
equivalent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1 Noise error of each module modelled as an error injection at the
output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 LSB-side scaling definition. . . . . . . . . . . . . . . . . . . . . . 110
5.3 Probability density function(PDF) of DFX Encoder Input signal.113
5.4 PDF of the DFX Recoder input and the probabilities of trun-
cation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5 DFX Adder joint probability distribution table. . . . . . . . . . 117
5.6 PDF of the DFX Gain Multiplier input and the probabilities of
truncation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.7 DFX Full Multiplier joint probability distribution table. (a)
shows the input ranges (Input A:Input B) and (b)-(d) shows
the probability of truncations for different input and output
boundary cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.8 Transpose FIR Direct Form type I filter implemented with DFX
modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.9 The PDF of DFX signal errors. . . . . . . . . . . . . . . . . . . 125
5.10 Joint probability distribution of the errors. (a)-(d) shows the
breakdown for each error combination “x :y” and (e) shows the
complete joint distribution diagram. . . . . . . . . . . . . . . . . 127
5.11 The PDF of DFX signal errors when rounding is used. . . . . . 129
5.12 Joint probability distribution of the errors when rounding is
used. (a)-(d) shows the breakdown for each error combination
“x : y” and (e) shows the complete joint distribution diagram. . 129
5.13 Boundary bins. . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.14 An example profile table for a DFX Recoder. . . . . . . . . . . . 133
5.15 The estimated vs simulated SNR for 500 filters. . . . . . . . . . 139
10
6.1 Linked unit delays. . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2 Examples of DFX Gain Multiplier output formatting with the
binary points aligned. The shaded bits can be omitted without
introducing errors. . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.3 Estimating the area for a DFX Decoder. . . . . . . . . . . . . . 150
6.4 Estimating the area for a fixed-point Adder. . . . . . . . . . . . 151
6.5 Boundary of wordlength-multiplier using Equation (6.21) with
varying p1 scaling and signal variance. . . . . . . . . . . . . . . 161
6.6 Boundary of wordlength-multiplier with varying maximum prob-
ability of signal overflow, λ, and SNR). . . . . . . . . . . . . . . 162
6.7 Flow of the DFX optimisation with Simulated Annealing. The
detailed flow diagrams of the Algorithms 6.5 and 6.6 are not
shown and is denoted by broken arrow lines. Refer to the re-
spective Algorithms in page 168. . . . . . . . . . . . . . . . . . . 169
6.8 The optimisation times and the area with respect of the level of
optimisation for a 4th order IIR filter on ASIC. . . . . . . . . . 170
6.9 Case study 4th order IIR filter. . . . . . . . . . . . . . . . . . . 172
6.10 Area of ASIC 4th order IIR filter with lower variance input. . . 173
6.11 Area ratio of ASIC 4th order IIR filter optimised with the pro-
posed two-phase ASA optimisation (ASA2) and fixed-point only
optimisation (FX). . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.12 Case study 1st order LMS filter. . . . . . . . . . . . . . . . . . . 176
6.13 Area of ASIC 2-tap adaptive LMS FIR filter. . . . . . . . . . . . 177
6.14 Area Ratio of ASIC 2-tap adaptive LMS FIR filter optimised
with the proposed two-phase ASA optimisation (ASA2) and
fixed-point only optimisation (FX). . . . . . . . . . . . . . . . . 178
11
List of Tables
2.1 IEEE 754 floating-point special values. . . . . . . . . . . . . . . 30
2.2 Area and critical path delay for 16-bit arithmetic units taken
from Tables 4.4 to 4.6. These arithmetic units where imple-
mented on an ASIC platform. . . . . . . . . . . . . . . . . . . . 42
2.3 Area and critical path delay (CPD) results for 4-tap FIR filter
with increasing wordlength implemented using UMC 0.13um
High Density Standard Cell Library. . . . . . . . . . . . . . . . 43
2.4 Dynamic range comparisons of 32-bit numbers representations. . 44
3.1 Dynamic range comparison between DFX, fixed-point (FX),
floating-point (FP) and logarithmic number system (LNS) for
32-bit and 16-bit format. . . . . . . . . . . . . . . . . . . . . . . 65
3.2 Comparing the precision errors between DFX and fixed-point.
The DFX parameters are chosen to match a fixed-point 〈15, 0〉dynamic range of ≈ 90dB and upper limit of 20. . . . . . . . . . 68
3.3 Area and critical path delay result for 4 tap FIR filter with
increasing wordlength implemented using UMC 0.13um High
Density Standard Cell Library. This is an extension of results
in Table 2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1 In and out degrees of nodes used in computational graph G(V, S). 74
4.2 Building block areas and critical path delays (CPD) table. . . . 84
12
4.3 Scaling of the inputs before the fixed-point adder. . . . . . . . . 90
4.4 Area and critical path delay tables for 16-bit adder comparisons. 91
4.5 Area and critical path delay (CPD) tables for gain multiplier
comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.6 Area and critical path delay (CPD) tables for full multiplier
comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7 The parameters used to generate results for Figure 4.14. The
dynamic range (DR) is represented in dB. . . . . . . . . . . . . 101
4.8 Comparing fixed-point and DFX arithmetic module implemen-
tations on ASIC(# of cells) with dynamic range fixed at ∼90dB. Fixed-point the first result line where p1 = p0 = 0. . . . . 103
5.1 DFX Encoder truncations where the input is a fixed-point 〈nin, pin〉and output DFX 〈n, p0, p1〉 (Refer to Fig. 4.4 for block diagram).112
5.2 DFX Recoder where the input is DFX 〈nin, pin0, pin1〉 and out-
put DFX 〈nout, pout0, pout1〉. . . . . . . . . . . . . . . . . . . . . . 114
5.3 DFX Adder inputs (A and B) and output (S) combinations
with their respective output truncations. . . . . . . . . . . . . . 116
5.4 DFX Gain Multiplier input(A) and output(Q) combinations
and their respective output truncations. . . . . . . . . . . . . . . 118
5.5 DFX Full Multiplier inputs (A and B) and output (Q) combi-
nations and their respective output truncations. . . . . . . . . . 119
5.6 Evaluation of error models for truncation scheme with DFX
format 〈14,−5, 2〉 and 〈14,−3, 5〉. . . . . . . . . . . . . . . . . . 121
5.7 Evaluation of error models for rounding scheme with DFX for-
mat 〈14,−5, 2〉 and 〈14,−3, 5〉. . . . . . . . . . . . . . . . . . . 121
5.8 The correlation coefficients of the error sources for the FIR filter
in Figure 5.8 when truncation is used. . . . . . . . . . . . . . . . 124
13
5.9 The correlation coefficients of the error sources for the FIR filter
in Figure 5.8 when rounding is used. . . . . . . . . . . . . . . . 124
5.10 Combinations of signals X and Y ranges and their error prob-
abilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.11 Estimate of the correlation coefficients of the error sources for
the FIR filter in Figure 5.8 when truncation is used. . . . . . . . 128
5.12 Comparison between actual and estimated SNR for 159-tap FIR
filter with DFX parameter of increasing wordlength. . . . . . . . 138
6.1 Propagation rules for DFX conditioning. . . . . . . . . . . . . . 146
6.2 Comparison between actual area of DFX modules with the es-
timation by the area models on a Virtex 4 FPGA. . . . . . . . . 154
14
Chapter 1
Introduction
1.1 Objectives
The pace of integrating applications onto a chip is driven by the continual de-
mand to achieve lower cost while reducing system size and power consumption.
The availability of high density integrated circuits now enables the design and
implementation of sophisticated arithmetic processors employing algorithms
that were considered prohibitively complex in the past.
Most digital signal processing (DSP) application algorithms are usually
implemented with IEEE 754 floating-point number representation for its wide
dynamic range of representable numbers. Furthermore, a direct floating-point
implementation onto hardware offers the advantage of consistency between the
software and hardware implementations without introducing extra rounding or
truncation errors.
On the other hand, digital very-large-scale integration (VLSI) implemen-
tations of these applications rely on fixed-point approximations with finite
precision which have the advantage of using reduced hardware cost and power
consumption while increasing throughput [IO96]. This is because, generally
15
speaking, fixed-point DSP devices are far less complex, having fewer gates and
transistors, than an equivalent floating-point system. Although Intel led the
formulation of the IEEE standard for floating-point, Intel too recognises the
benefits of fixed-point for 3D graphic applications commercial embedded de-
vices [Kol04]. Implementations of arithmetic circuits in hardware are often
based on mapping the desired function to an Application Specific Integrated
Circuit (ASIC) or to a Field-Programmable Gate Array (FPGA).
Currently, when a fixed-point DSP implementation meets the needs of an
application, it is probably the better option than floating-point. However
when the application needs a large dynamic range, an implementation using
only fixed-point may suffer due to the wide wordlengths needed and the other
familiar option available is floating-point. Fixed-point and floating-point are
both popular but distinct number representations with their own strengths
and weaknesses.
The objectives for this research are to:
1. Define a new number representation that bridges between fixed-point
and floating-point number representations. The new number represen-
tation should have the dynamic range capability and the hardware im-
plementation complexity between fixed-point and floating-point number
representations.
2. Design the hardware arithmetic modules needed to realise the new num-
ber representation. These modules should have fully parameterisable
wordlengths and flexible scalings to allow exploration of efficient design
implementations.
3. Understand the noise introduced by the arithmetic circuits designed for
the new number representation. As with every number representation in
16
the digital domain, the finite error effects of the new number represen-
tation need to be understood. This shall enable us to use it to its full
potential.
4. Discover the conditions and design requirements that will benefit from
using the new number representation.
5. Find the parameters of the arithmetic modules to obtain area optimised
designs which meets a desired user constraint. This process is preferably
automated.
1.2 Overview
The thesis starts in the Chapter 2 with a review of the number representations
available for a DSP hardware designer. Also reviewed are the previous and
recent work in the field of wordlength and scaling optimisation.
Chapter 3 then introduces the concept of Exponent Recoding and Dual
FiXed-point. Exponent Recoding is a concept that generalises the common
number representations. From it, the new number representation called Dual
FiXed-point (DFX) is defined and its characteristics discussed and compared
with the common number represetations.
Hardware implementations of the arithmetic modules to use DFX in VHDL
are described in Chapter 4. These modules are capable of multiple wordlength
and scaling which works alongside fixed-point. All the modules are synthe-
sized onto FPGAs and ASIC to compare with implementations using common
number representations.
Error analysis of the DFX arithmetic modules is discussed in Chapter 5. A
hybrid (simulation mixed with static) analytical technique was developed to
17
estimate both the errors introduced. Because the DFX quantisation depends
on the signal’s magnitude, any correlation in the signal will be reflected on the
errors introduced through quantisation and this is dealt with in this chapter.
Chapter 6 shows DFX can be incorporated into a high-level synthesis tool
called RightSize. It optimises DSP designs to meet user specified constraints
of output SNR and signal overflow probability to give area optimised designs.
The two-phase simulated annealing algorithm optimises a design to have DFX
and fixed-point working along side each other. Comparison is done with fully
fixed-point designs on FPGA and ASIC platforms.
Chapter 7 concludes the thesis and suggests some future work.
1.3 Contributions
The original contributions in this thesis are:
1. Exponent Recoding as a concept that applies a mapping function to
the exponent field of a number. Conventional number representations
are shown to be special cases of an exponent recoding concept. A new
number representation, Dual FiXed-point (DFX), was defined as a special
case of Exponent Recoding by using only a single bit for the exponent.
The characteristics of DFX is compared with other conventional number
representations.
2. The design of hardware modules to perform multiple wordlength and
scaling DFX arithmetic operations. Various steps were taken to minimise
the loss of precision. As these modules are described in a synthesizable
hardware description language (in this case, VHDL), they are readily
synthesizable on to any platform and their input and output ports are
18
completely parameterisable. Comparisons are made with equivalent im-
plementations of other common number representations on FPGA and
ASIC platforms.
3. Analysis of the errors introduced by using DFX. Due to the existence
of correlation between errors, a novel profiling simulation is employed to
extract the probability distribution function of the signals in a design.
This simulation mixed with static analytical approach quickly estimates
the errors after a single pass simulation.
4. The error analysis takes into account any correlation amongst errors in-
troduced when using DFX. It is also shown that the correlation amongst
the errors only exists when signal truncation is used but there is no
correlation when rounding is used.
5. The inclusion of DFX into RightSize high-level synthesis tool to generate
area optimised designs under user specified constraints. The two phase
simulated annealing algorithm optimises a design to have DFX and fixed-
point along side each other.
19
Chapter 2
Background
2.1 Introduction
To provide an overview and background for this thesis, the various design
choices open to a DSP designer are reviewed in this chapter. The main focus
here is two fold. Firstly, the various possible number representations and
their suitability for computation in the context of real-time DSP algorithm
implementation will be examined and compared. Then the recent research
in the area of wordlength and scaling optimisation for these various number
representations are appraised.
2.2 Question of Software vs Hardware
The main driver in determining the platform for a DSP algorithm is in terms of
unit cost, time-to-market, or both. For projects that are time critical, design-
ers may choose specialised DSP microprocessors for their easy programmability
and any bug-fixes or upgrades can easily be supported. Due to rapid technol-
ogy advancement, using processors as a platform for DSP makes good business
20
sense for small scale productions. However, the inherently serial nature of these
processors means that they are inefficient at processing algorithms which has
large degrees of parallelism resulting in slow execution speed and increased
power consumption. Even if the improvements of DSP microprocessors con-
tinue to follow Moore’s Law so that their density doubles every 18 months,
they may still be unable to keep up with the requirements of some of the more
aggressive DSP algorithms.
Customised circuitry for applications had always outperformed general
CPUs as resources can be allocated to meet the needs of a specific application.
There are two options that are explored here, Application-Specific Integrated
Circuits (ASIC) and the Field Programmable Gate Arrays (FPGA) processors.
Traditionally, designers would choose to develop on an Application-Specific
Integrated Circuits (ASIC) platform when high-performance is required to take
advantage of the large amount of parallelism found in many DSP applications.
When designed well, an ASIC can contain just the right mix of functional units
for a particular application. In addition, ASICs do not suffer from the serial
(and often slow and power-hungry) instruction fetch, decode and execute cycle
that is at the heart of any microprocessors. Although the development time
for ASICs is significantly longer, the cheaper productions cost may offset any
non-recurring engineering (NRE) cost incurred with high-volume of sales.
A middle ground between microprocessors and ASIC is reconfigurable com-
puting systems such as Field Programmable Gate Arrays (FPGA). Most mod-
ern reconfigurable computing systems typically contain one or more processors
and a reconfigurable fabric. The processors would execute sequential and non-
critical code, while the reconfigurable fabric would ’execute’ code that has ef-
ficiently mapped to hardware. Like ASICs, reconfigurable computing systems
take advantage of the parallelism achievable in a hardware implementation.
With the improvements in process technology, recent throughput of FPGAs
21
have even surpassed CPUs [Und04] and its ease of development and short
time-to-market places it between that of an ASIC and processor-based devel-
opment. Another area of research that has had a surge of activity due to the
improvements in process technology is around coarse-grain reconfigurable pro-
cessors such as Crisp [BJA+03], MorphoSys [LSL+00] and RICA [KNM+08].
In essence, these architectures combine a standard processor with an array of
reconfigurable hardware. The reconfigurable hardware would be tailored for
a specific task and once completed, will be reconfigured for another task by
the processor. This results in a hybrid architecture aimed at combining the
flexibility of software with the speed of hardware.
One of the benefits from using ASIC or reconfigurable platforms is that
designers have the freedom to customise an algorithm to meet the desired de-
sign goals or constraints. Designer has the choice in deciding the way data is
manipulated and the number system used to represent data. The ability to
finely adjust the data-paths has shown to often lead to more efficient designs
when compared to processor based implementations. Constantinides and Gaf-
far [Con01, GML+02] have presented automated means to finely adjust designs
by altering the level of precision of internal data-paths to minimize hardware
cost. Also, in recent years, tools such as Altera’s DSP Builder and Synplicity’s
Synplify DSP [Altb, Synb] have been introduced to make direct porting from
a high level description in Simulink [Matb] to Register Transfer Level (RTL)
description for direct synthesis on to ASIC or FPGA platforms.
2.3 Finite Wordlength Effects
Any practical DSP implementation on hardware will be implemented using
finite wordlength numbers and arithmetic such as the ones to be discussed in
22
Sections 2.4. As a result, every signal node or stored value may suffer from
truncation/rounding noise and possibility of overflow.
Due to finite precision of the signals in a design, results of calculations
may need to be further quantised by either truncation or rounding the lower
bits to fit the result on to a signal [OW72]. Truncation is easy to perform as
the unwanted bits are simply dropped. Rounding, however, requires an adder
inserted on the data-path. Noise is introduced whenever truncation/rounding
is performed and they appear as low-level noise at the design outputs. Provided
we can tolerate a certain level of noise, we can exploit this to optimise the
hardware area cost. Section 2.5 later describes some of these optimisation
techniques.
The addition of signal truncation and rounding noise renders a design non-
linear. This nonlinearity is negligible when large signals are involved and
quantisation noise becomes the major concern. However for recursive filters,
this nonlinearity can cause limit cycles [Mit98]. This is generally not a problem
for infinite precision number representations [Liu71], but a properly chosen fil-
ter structure and coefficients can free the filter from the effects of limit-cycles
[Bom94, XB97]. Alternatively, one can determine a bound on the maximum
limit-cycle amplitude [GT88] and choose the level of quantisation that makes
the limit-cycle amplitude acceptably low.
While the truncation/rounding noise is the result of losing the lower bits
during calculations, an overflow happens when the magnitude of the num-
ber is more than the upper limit of the number representation used. In the
case of fixed-point two’s complement representation, an overflow would result
in a catastrophic increase in noise caused by the number wrapping-around.
Furthermore, in recursive filters high-level oscillations can exist in an other-
wise stable filter due to the gross nonlinearity associated with the overflow of
internal filter calculations [CMP76, EMT69].
23
There are several ways to prevent overflow. One method is to force signals
to saturate to either the largest positive number or largest negative number of
the representation used. By carefully selecting appropriate signals to perform
saturation arithmetic, the noise injected can be significantly reduced [CCL03].
However, additional hardware is required to implement saturation and slows
the design down. Moreover, saturation arithmetic are also prone to instability,
especially in the condition of zero-input [SP87]. Another more obvious method
would be to scale the signals so as to render overflow impossible. Current
optimisation methods derive the peak values for each signal through analytical
or simulation methods which are then used to scale the signals appropriately.
They however do not offer any form of guarantee that no overflow will occur.
Overflows may still occur in designs with feedback loops or under extreme
input conditions.
All the effects mentioned depend on the number format used and the pa-
rameters that define them. For all cases, providing extra bits in the data-paths
will reduce each finite wordlength effect. Then again, increasing the wordlength
would mean additional hardware resources will be needed. A designer will have
to balance this trade-off to meet his design objectives and constraints.
2.4 Number Representations
There are many ways in representing data digitally, all of them with their own
advantages and disadvantages. In this section, the basic conventional num-
ber representations namely fixed-point, floating-point, and logarithmic num-
ber system will be addressed. It will also explore other alternative number
representations.
Where possible, each number representation’s basic arithmetic operation,
precision and dynamic range will be described. The dynamic range of a number
24
representation is the range of possible values that it can handle. To quantify
the dynamic range, we define it as the ratio of the largest representable mag-
nitude over the smallest and generally is expressed in dBs, i.e.:
Dynamic range = 20 log10
(largest representable magnitude
smallest representable magnitude
)dB (2.1)
As DSP applications are adder and multiplier rich, the two main arithmetic
operations (addition and multiplication) will be the main focus in this thesis.
To take advantage of the parallelism in custom hardware, all the operations
described are based on bit-parallel arithmetic where all the bits of a signal are
processed together.
2.4.1 Fixed-point
Fixed-point number representation is a weighted positional number representa-
tion commonly employed in custom implementation of DSP algorithms [Par00]
where every operand is modelled with a fixed-length integer and fraction part.
A fixed-point number can be treated as integers and transforming from one
fixed-point format to another is done simply by performing bit-shifting, sign-
extension and/or zero-padding. As a result of this ease of format translation,
it is possible to assign each operand in a complex DSP task with a unique
and minimal wordlength to minimize overall area, power, and/or delay. There
is also an abundance of hardware library support for fixed-point arithmetic
[Alt98, Xilb, Syna].
Fixed-point is essentially a number system whose radix-point is fixed at a
predetermined position. A binary fixed-point number representation is rep-
resented either by integers, fractions or a combination of both. This is done
by partitioning a n-bit binary word into two sets: p bits for the integral part
and (n − p) bits for the fractional part. An additional sign-bit ‘S ’ is implied
25
when referring to the wordlength for fixed-point. As the name states, the bi-
nary point (or radix point) is fixed at design time. For the remainder of this
thesis, a binary fixed-point number denoted by 〈n, p〉 will have n-bits for its
wordlength and the scaling of the binary point will be p-bits to the right of the
sign-bit. Figure 2.1 illustrates a 〈n, p〉 fixed-point number. Also the symbol
‘•’ will be use to represent the position of the binary point throughout this
thesis.
n bits
S
p bits (n-p) bits
x0xn-1 ... ...
Figure 2.1: Fixed-point number format 〈n, p〉.
Representing sign numbers in binary fixed-point is normally done using
two’s complement notation for easy addition and subtraction. The value, X,
of a fixed-point number 〈n, p〉 is given by (2.2). We can see that the maximum
representable value would depend on the position of the binary point or in
other words, the scaling p. For the rest of this thesis, any mention of fixed-
point would refer to binary fixed-point two’s complement representation unless
stated otherwise.
X = −S · 2p +n−1∑i=0
xi2i−(n−p) (2.2)
Arithmetic
Since the conventional fixed-point is a binary number system, all arithmetic
operations can be done with straight forward binary operations. For an addi-
tion of an n-bit number, a serial configuration of n full-adders linked together
forms the basic ripple-carry adder. This simple configuration may use the least
hardware resources, but other adder configurations may speed up addition at
the expense of resources such as carry-look-ahead, carry-skip and carry-save
26
[Kor02]. FPGAs designs tend to have carry-look-ahead chain adders to quickly
perform addition [Xilc].
Multiplication of fixed-point numbers is more labourious as it involves a
series of additions. In bit-parallel processing, there are two forms of multipliers:
a general full-multiplier and a constant coefficient gain-multiplier. An n-bit
full-multiplier takes two bit-parallel inputs to form the product, usually by
generating n partial products which will then be summed together. Gain-
multipliers have a single bit-parallel input and they scale that input by a fixed
constant. Various techniques are employed to reduce the number of partial
products and/or to accelerate addition of partial products. Booth recoding
[Boo51], for example, reduces the number of partial products needed for gain-
multipliers and the Wallace Tree addition [Zim99] quickly sums the partial
product.
Precision and Dynamic Range
For a 〈n, p〉 fixed-point number, X, the range of representable numbers lies
in the range of −2p ≤ X < 2p, as seen in Figure 2.2. The dynamic range of
an n + 1-bit fixed-point number would have a dynamic range of 20 log10(2n)
dBs, which is solely dependent on the wordlength of fixed-point. For example,
the absolute value of a 〈31, 0〉 fixed-point number has a range between 1 and
4.7× 10−10, or in other words, a dynamic range of ≈ 187db.
Another property of fixed-point number is its uniform precision throughout
the whole representable range, which in the case of the fixed-point number
X above is 2−(n−p). Therefore, in order to utilise the range and precision
effectively, a signal should be properly scaled to use as many available bits to
the signal.
27
0
-2p
+2p
2-(n-p)
Overflow Overflow
Figure 2.2: Range and precision of two’s complement fixed-point number.
2.4.2 Floating-point
In recent years, the use of floating-point arithmetic has increased dramatically
in digital signal processing due to the rapid development of hardware technol-
ogy. The main advantage of floating-point is its wide dynamic range which
reduces the risk of overflow and improved signal-to-noise ratios of low-level
signals. Also, DSP algorithms are normally designed for use in a floating-
point environment. For example, Matlab [Mata], a popular algorithm explo-
ration and simulation tool, uses double-precision floating-point as its default
datatype. Unfortunately, the added complexity in performing arithmetic oper-
ations makes floating-point hardware expensive and slower than its fixed-point
counter-part. Therefore hardware designers often resort to porting their algo-
rithms onto fixed-point hardware.
In a floating-point number system, the number X is generally presented as
X = sgn(X)×M × 2E (2.3)
where sgn(X) returns the sign of the number, M is the mantissa (also some-
times known as the fraction or significand) and E is the exponent. Typically,
the mantissa is normalised to be within an interval M ∈ [ 1β, 1). There are many
variants of floating-point which are normally not directly compatible with one
another. The most popular is the standard defined in the IEEE Standard
754-1985 [IEE85].
28
IEEE Standard 754-1985
IEEE Standard 754-1985 [IEE85] defines four formats of floating-point num-
bers. The first two are the basic 32-bit single and 64-bit double precision
format. The other two are extended formats used for intermediate calculation
results. Figure 2.3 shows the layout of the 32-bit single precision format where
e = 8bits and m = 23bits.
exponent, E mantissa, M
m bitse bits
S
Figure 2.3: IEEE 754 single precision floating-point number format.
An IEEE 754 floating-point number, X, is given by
X = (−1)S × 1.M × 2E−bias (2.4)
The mantissa has a hidden bit ‘1’ implied at the MSB because the mantissa
is normalised to be within the interval M ∈ [1, 2). To represent numbers less
than one, the exponent is biased by bias = (2(e−1) − 1). The IEEE standard
reserves some special values which are summarised in Table 2.1. NaN, short for
Not-A-Number, is when the floating-point operation received invalid inputs,
such as when finding the square root of a negative number. When E = 0, the
mantissa is denormalised and the floating-point number would have a value of
X = (−1)S × 0.M × 2E−biasd (2.5)
where biasd = (2(e−1)− 2). The denormalised numbers capability is seldom in-
cluded in the design of arithmetics units that follow the IEEE standard [Kor02].
This is mainly due to the high cost associated with its implementation. Due
to the popularity of the IEEE 754 standard, there are many hardware libraries
29
that support it and therefore the remainder of this thesis will refer to it when
mentioning floating-point.
Table 2.1: IEEE 754 floating-point special values.
M = 0 M 6= 0
E = 0 0 Denormalised
E = 2e − 1 ±∞ NaN
Arithmetic
When compared to fixed-point, implementing floating-point arithmetic opera-
tions is more complicated because of an extra exponent field and normalised
mantissa. When performing addition, the significands of both operands have to
be pre-aligned. After addition, the output’s mantissa has to be re-normalised
and its exponent updated to reflect the new value. All the pre- and post-
alignment are done using priority encoders and barrel shifters [Kor02] which
are expensive in terms of hardware area and power consumption, and they
tend to have large combinational delays. A simplified block diagram of a
floating-point addition/substraction is depicted in Figure 2.4. The multiplica-
tion operation is slightly easier as no input pre-alignment is needed although
the product of the multiplication will need to be re-normalised and its exponent
re-calculated.
All these extra pre- and post-alignment operations add a significant amount
of overhead circuitry which translates to increased in hardware area, latency
and power consumption when compared to fixed-point. Minus the hardware
used for pre- and post-alignment, the core arithmetic implementation is the
same as fixed-point.
30
Right Shifter
Leading 0s
Detector
Exp #1 Exp #2 Significand #1 Significand #2
Exp Adder #1
Exp Adder #2
Mux #1 Mux #2
Significand Adder
Mux #3
R/L Shifter
Incrementer
Output SignificandOutput Exponent
Exp Diff
Exponent
comparison
and
significand
alignment
Post-
normalisation
and rounding
Significand
addition/
substraction
Figure 2.4: Floating-point adder/subtractor. Figure taken from [Kor02].
Precision and Dynamic Range
To demonstrate the precision and dynamic range of a floating-point number,
an IEEE single precision (32-bit) floating-point number X is used. Without
considering denormalisation, the floating-point number can take real numbers
in the range between 2−127 and 2128 (≈ 1535dB). Adjusting the width of the
exponent field affects the dynamic range of the number.
Unlike fixed-point, the precision of floating-point number varies depending
on the exponent of the number. When compared to a properly scaled fixed-
point number with equal wordlength, floating-point will always be less precise
than fixed-point due to the inclusion of the exponent field.
Format variants
There have been many variants proposed by various researchers for floating-
point, each to meet the demands of their own application. Munafo [Mun96]
31
gives a detailed summary of many different floating-point variants but only a
few notable ones will be described here.
In the bid to extend the precision of floating-point numbers, Dekkar [Dek71]
and Kahan [Kah65] pioneered the approach which were essentially doubling the
number of bits used. Priest went further with his work on arbitrary precision
floating-point numbers and showed that the computational cost to guarantee
accuracy is fairly reasonable [Pri91]. To extend the range, Yokoo introduced an
overflow and underflow free representation [Yok92] by not bounding the width
of the exponent field. This is made possible by using a prefix-free encoding
scheme by Hamada [Ham83] to encode the exponent with a self delimiter,
hence allowing the exponent size to grow as necessary while sacrificing the size
of the mantissa. Both the precision and range extension methods described
exist as software methods.
Representation of decimal numbers has always been problematic in binary
number systems. A dense packing demical encoding scheme to encode 3 deci-
mal numbers (1,000 combinations) into 10 binary digits (1,024 combinations)
has been proposed by [Cow02] and is being worked into the draft revision of
the IEEE 754 floating-point standard [IEE07]. This will be highly useful in
the financial field where numbers like 0.1 can be represented accurately and
that numbers are normally separated by commas in groups of 3 decimal digits.
2.4.3 Logarithmic Number System
In a logarithmic number system (LNS), operations such as multiplication, divi-
sion, powers and roots become easy as they are reduced to performing addition,
subtraction, multiplication and division operations respectively. However, ad-
dition and subtraction operation is more complicated as alignment of radix has
to be performed. Despite this, the LNS has generated considerable amount of
32
interest ever since its introduction, especially for designs with a high multiplier
to adder ratio [MTS95, FML06, CDdD06]. As Swartzlander et. al. pointed
out, LNS is intended to enhance the implementations of specialised applica-
tions and not meant to replace fixed-point or floating-point number systems
[SA75].
n bits
Fraction, f bitsInteger, i bits
S Logarithm, E
Figure 2.5: IEEE 754 single precision floating-point number format.
A signed LNS number is represented by a signed bit S together with a log-
arithm E. The logarithm is encoded with an integer part, i, and a fractional
part, f , as seen in Figure 2.5. The value of X is therefore defined in Equa-
tion (2.6) below. In order to represent numbers smaller than one, negative
logarithms are needed. Therefore, the logarithm field may be two’s comple-
mented or be biased in the form ”E − bias”, where the bias is predetermined
by the designer at design time. In essence, a LNS number is a floating-point
number whose mantissa is always equal 1.0.
X = (−1)S · 2E if E is two’s complement, or
X = (−1)S · 2E−bias if E is biased.
(2.6)
Arithmetic
Since all values in LNS are logarithms, operations such as multiplications and
divisions are simplified to a mere addition and subtraction as seen in (2.7) for
two inputs A and B. The hardware required for multiplication in logarithms
is therefore the same as fixed-point addition.
33
logβ(A×B) = EA + EB
logβ(A÷B) = EA − EB
(2.7)
In contrast, LNS additions and subtractions are more complicated and
often their results suffer from lack of accuracy. A brute force solution for this
operation is to use a complete look-up table. However, the size of such a table
is prohibitively large (22n×n) for an adder with input and output wordlengths
[Kor02]. A more common approach would be to approximate the result. It
can be shown that addition and subtraction of LNS numbers determined using
the following equations:
log2(A + B) = EA + Ψ+(EB − EA)
log2(A−B) = EA + Ψ−(EB − EA)(2.8)
where Ψ+(z) = log2(1 + 2z) and Ψ−(z) = log2(1− 2z) with the condition that
z = |EB − EA| > 0. Both Ψ+(z) and Ψ−(z) can be pre-calculated and stored
into look-up tables, e.g. ROM. Figure 2.6 shows a typical LNS adder or sub-
tractor data flow taken from [TGJR88], where the addition of the logarithms
are performed by an ordinary fixed-point adder. Each ROM table would be
not larger than 2n×n but when n ≥ 20, the size of the ROM gets prohibitively
large and therefore several approaches have been suggested and implemented
to reduce the size of look-up tables. The approach by [TGJR88] is to partition
the 2n × n table into several smaller tables.
Precision and Dynamic Range
As an example to compare with the IEEE floating-point standard, we take a 32-
bit wordlength LNS number with 8-bit integer part, 23-bit fractional part and
34
max(EA,EB)
EA – EB
EB – EA
ROM
FX
Adder
EA
EB
ES
Figure 2.6: Adder/Subtractor for logarithm numbers. A fixed-point (FX)adder is used to perform addition and the ROM contains the look-up table foreither Ψ+ or Ψ− of Equation (2.8).
exponent bias of 27. The range for the LNS number is thus between ±2−128 to
±2128 (≈ 1541dB) which is in the same range as a 32-bit floating-point number.
The LNS’s dynamic range is dependent on the size of the logarithm’s integer
part. Similar to floating-point, the precision of LNS number constantly varies
between values and the precision gets coarser with higher logarithm values.
2.4.4 Block Floating-Point
Block floating-point (BFP) is an attempt to strike a compromise between fixed-
point and floating-point. It utilises the benefits of dynamic scaling in floating-
point while taking advantage of fixed-point arithmetic operation’s simplicity.
First introduced by Oppenheim, BFP arithmetic has been used in the reali-
sation of digital filters [Opp70, SW86]. BFP has been used in several digital
audio data transmission standards. These audio standards include NICAM
(stereophonic sound system for PAL TV standard) [Bow87] and the audio
part of MUSE (Japanese HDTV standard) [Nim87].
BFP can be considered as a special case of floating-point representation
[KA96], where a block of N numbers has a joint scaling factor corresponding
35
to the largest magnitude of the numbers in the block, i.e.
[x1 · · ·xN ] = [x1 · · · xN ] · 2γ
xi = 2−γxi
(2.9)
The block-exponent γ is defined by
γ = blog2Mc+ 1 + k (2.10)
where M = max(|x1|, · · · , |x1|), and bxc is a floor function which returns the
largest integer less than or equal to x. The magnitudes of the block mantissas
xi are in the interval |xi| ∈ [0, 2−k]. The constant scaling term k may be
required by some applications to ensure no overflow of internal signals and the
output.
Despite its name, block floating-point is not truly a number representation.
It is a method to extend the dynamic range of a fixed-point algorithm. BFP’s
strength is that the block exponent, γ, need only be represented once for a
whole block of numbers while the main operations can be done normally in
fixed-point. Because BFP format works on a block of data in one go, a block of
memory is needed to store this input and/or results and the amount of memory
increasing with the number of pipelining stages. Also the lower bits of smaller
signals get quantised away. [KF00] introduces a variant called hierarchical
BFP which preserves the lower bits.
2.4.5 Residue Number System
With the residue number system (RNS) [ST67], numbers are represented by
their residues with respect to a set of relatively prime moduli. Residue number
systems are inherently integer systems due to the definition of the residue.
36
A RNS is defined by a set of N integer constants M1,M2, ..., MN referred to
as the moduli. The residue of an integer number X is given by (Xk, Xk−1, ..., X0)
where Xi is the residue from X modulo Mi. Therefore each residue Xi is defined
as the smallest positive integer remainder of X/Mi:
Xi = X mod(Mi)
= X −MibX/Mic(2.11)
where bxc returns the largest integer less than or equal to x. There is no
ambiguity for this system when representing the integers within the range R
given by
0 ≤ R ≤k∏
i=0
Mi − 1. (2.12)
As an example, let the set of moduli be 3, 4, 5. Therefore the number 34 can
be represented by its residue (1, 2, 4).
In conventional computer arithmetic, the addition of two numbers would
require the carries to be propagated from one adder end to the other. Arith-
metic operations in RNS are performed on the corresponding residues without
carries or other interaction between the residues:
(A±B) = ( (Ak ±BK)mod(Mk), . . . , (A0 ±B0)mod(M0) )
(A×B) = ( (Ak ×BK)mod(Mk), . . . , (A0 ×B0)mod(M0) )(2.13)
Any carry bit will terminate at the boundaries between each residue meaning
RNS has the advantage of adding large numbers quickly.
However the uniqueness of RNS posses various problems. Representing
fractions and division in RNS is not trivial as residues are inherently integer
[And96, LC92]. As RNS is non-positional number system, determining the
sign, magnitude comparison and overflow detection is also non-trivial [HP94].
Despite this, the benefits of RNS has been shown to offer area cost savings,
37
high-speed and low power dissipation in DSP applications [FP97, Sto05]. The
basic number systems, fixed-point, floating-point and LNS can be represented
in a residue form. A residue version of fixed-point has been explained here.
The residue versions of floating-point and LNS are mentioned in by [KL97] and
[Arn05] respectively. Because there is no carry digit to propagate, the residue
version of LNS is the quickest means to perform multiplication and division
operation.
2.4.6 Signed-Digit Number System
A signed-digit (SD) number system has redundancy in its representation,
meaning that values are not uniquely represented. Take for example a binary
signed-digit (BSD) number X denoted by a radix-2 representation (xn, xn−1,
..., x0)BSD. Each digit xi has a symmetric digit set xi ∈ {−1, 0, 1}. The value
of the number can be found in a similar fashion to fixed-point:
X =n∑
i=0
xi2i (2.14)
For example, A = 910 can be written as (01001)BSD = (01011)BSD = (11001)BSD
where 1 = −1.
Apart from the need to convert to and from the integer number, the main
disadvantage of a signed-digit number systems is the extra bits required to
represent it, hence it is rarely used as conventional number representation.
Typically, BSD is used as internal representations of a arithmetic logic unit
or in specially designed circuits. Parhami [Par88] demonstrated that provided
a BSD number has no repeated strings of 1 or 1 for both operands, i.e. ai ×aj−1 6= 1, the addition will not need any carry propagation. This is particularly
useful to speed-up multiplication, division and square-root operation. Booth
[Boo51] uses the BSD number system to recode the multiplier for high-speed
38
multiplication. In doing so, he reduced the number of partial products needed
for the multiplication operation.
2.4.7 Rational Arithmetic
The number systems described before this are unable to represent real num-
bers exactly. Take for example the real number number 13
would have to be
truncated to fit into a fixed-point number, introducing errors. Another way
to represent real numbers would be to use fractions. There is no difficulty in
representing rational numbers in hardware. It suffice to have a pair of integers,
one for the numerator and one for the denominator, as shown in Figure 2.7.
A real number, X, of a rational arithmetic number (RA) is given by
X = (−1)S × N
D(2.15)
Numerator, N Denominator, D
(n – i) bitsi bits
S
n bits
Figure 2.7: Example of Rational Representation format.
The maximum value representable by RA is dependent on the width of the
numerator field, i, and the precision is dependent on the width of the denom-
inator field, (n − i). Matula and Kornerup proposed several variants to the
format described above, among them is a floating-slash number system [MK85]
where the representation includes an additional field to denote the width of
the numerator. This allows for adjustments to be made to accommodate a
number depending on its magnitude, akin to floating-point. However since the
wordlength n is fixed, the precision of the number reduces as its magnitude in-
crease. The authors have also demonstrated several hardware arithmetic units
for RA [KM83, KM88] but there is little uptake among hardware designers and
39
researches. Rational arithmetic has been found in software implementations
such as LEDA [MN99], a collection of C++ libraries for efficient data types
and algorithms.
2.4.8 Level-Index
In an effort to represent real numbers without overflow/underflow, Clenshaw
and Olver [CO84] proposed representing numbers based on repeated exponen-
tiation. A non-negative real number, X, may be written in the form
X = ee..eI
︸︷︷︸L
where 0 ≤ I < 1 and the process of exponentiation is repeated L times. The
values of L and I are respectively known as the level and index of the Level-
Index (LI) number representation. The structure of a LI number is shown in
Figure 2.8 and a real number XLI is mapped in the following way:
XLI = (−1)S × φ(L, I) (2.16)
where the ‘general exponential function’ φ is defined as
φ(L, I) =
eφ(L− 1, I) if L > 0
I if L = 0
(2.17)
Level, L Index, I
i bitsl bits
S
Figure 2.8: Level Index/Symmetric Level Index format.
The authors improved on LI to represent numbers in the range [0, 1) with
symmetric level-index (SLI) [CT88]. SLI represents numbers in the range [0, 1)
40
with the reciprocal of the general exponential function.
XSLI = −1S × (φ(L))r(X) (2.18)
where
r(X) =
+1 if |X| ≥ 1
−1 if |X| < 1
(2.19)
The repeated exponentiation in LI/SLI number system allows for near lim-
itless range of representable numbers. In particular, setting l = 3 bits wide for
storage of the level (plus one more bit for the sign bit, r(X), for the case of SLI)
problems of overflow and underflow will be “non-existant” [LO90]. In contrast,
in the conventional floating-point system, overflow and underflow can always
occur, regardless of the width of the exponent field. On top of that, computed
numbers in SLI maintain considerable precision compared to floating-point for
large numbers. Take for example a number whose magnitude is 101,000,000.
Clearly it is beyond the limit of the IEEE double precision floating-point. A
64-bit SLI number with L = 3 retain a relative precision of about 1.6× 10−10
[LO90].
The arithmetic operations for LI/SLI are considerably more complicated
which is a limiting its adoption. In prior art, there is no known hardware
implementation of LI apart from some software implementations in Turbo
Pascal [Tur89].
2.4.9 Comparison and Summary
In this section, comparisons are made among the conventional number repre-
sentations (fixed-point (FX), floating-point(FP) and LNS) before summarising.
From the discussions above, the various differences between the number repre-
sentations do not shed enough light to quantify the differences in terms of chip
41
Table 2.2: Area and critical path delay for 16-bit arithmetic units taken fromTables 4.4 to 4.6. These arithmetic units where implemented on an ASICplatform.
FormatArea (cells)
Adder Gain Multiplier Full Multiplier
FX 449 2551 8077
FP 5049 2627 5312
LNS 23490 292 553
FormatCritical path delay (ns)
Adder Gain Multiplier Full Multiplier
FX 2.390 8.740 9.325
FP 13.607 10.881 12.146
LNS 18.842 2.483 5.616
area cost. Results from Tables 4.4 to 4.6 in Chapter 4 (summarised in Table
2.2 for convenience) are quoted in this section for comparison. These tables
show the synthesized chip area cost of the three basic arithmetic operations:
addition, constant gain multiplier and general multiplier.
The floating-point and LNS adders implemented on ASIC1 are 11 and 52
times larger than fixed-point adder respectively. Also, the speed of fixed-point
adder is 5.7 and 7.9 times faster than the equivalent floating-point and LNS
adders. LNS may be bad for addition, but its multipliers are the cheapest and
quickest. When compared to fixed-point, the LNS multipliers are on average
11.7 times smaller and 2.5 times faster in ASIC. However, multipliers for fixed-
point and floating-point are similar in terms of area and speed. The adders and
multipliers were synthesized with parameters to match each others dynamic
range.
As there is no comparative study between number representations on a
system level, a simple 4-tap FIR filter for each number representation with
increasing wordlength is made to compare the number representations. A
diagram of the filter can be found in Figure 4.1(b) and the coefficients used
1UMC 0.13um High Density Standard Cell Library
42
Table 2.3: Area and critical path delay (CPD) results for 4-tap FIR filter withincreasing wordlength implemented using UMC 0.13um High Density StandardCell Library.
Design
Fixed-point Floating-point LNS
Area CPD Area CPD Area CPD
(cells) (ns) (cells) (ns) (cells) (ns)
1 9829 9.9 16898 51.4 16188 39.1
2 11042 10.3 19710 54.3 23295 43.3
3 13067 11.4 21992 58.5 31562 46.9
4 14845 12.0 24854 61.7 42918 49.6
5 17261 13.0 26461 62.9 57261 54.0
6 18198 13.4 29685 63.4 77916 58.5
7 20005 14.4 32481 70.2 107450 59.5
5000
10000
15000
20000
25000
30000
35000
40000
45000
0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0
Critical path delay (ns)
Are
a (
cells
)
Fixed-point
Floating-point
Logarithmic number system
Figure 2.9: An area vs Critical path delay graph for Table 2.3.
43
Table 2.4: Dynamic range comparisons of 32-bit numbers representations.
Number representation Fixed-point Floating-point LNS
Dynamic range 187dB 1535dB 1541dB
were randomly chosen. Each signal in the filter is of equal wordlength. For
each incremental design, the width of the signals in the filter are increased one
bit at a time and the parameters for each number representation are selected
to have similar dynamic range for each design (Refer to Table 4.7, designs 1
to 7). Table 2.3 tabulates the results for the filters and Figure 2.9 shows the
results in a graph. From Figure 2.9, we can see clear separations between the
types of number representations. Fixed-point results are small and fast while
floating-point is large and slow. LNS seemed to be in between the two in the
beginning, but the last 3 results are out of the range of the graph.
Comparing the dynamic ranges for the three major number representations
in Table 2.4, we can see that for designs that need large dynamic range the
obvious choice is to use floating-point or LNS. Collange suggested to use LNS
over floating-point when the number of multipliers is greater than adders but
never stated a value [CDdD06], while Coleman suggested a ratio of at least 3:2
before seeing any speed improvements [CC99]. However, most DSP applica-
tions do not need such wide dynamic range. Fang et. al showed that a H.263
video decoder preserved good perceptual quality while using a ‘lightweight’
14-bit floating-point (1 sign-bit, 5 exponent and 8 mantissa) without the de-
nomalisation [FCR02]. The video decoder with lightweight floating-point was
about 3 times smaller than the full IEEE implementation though it is 2 times
larger than a fixed-point version. These results are pushing towards a trend
for IEEE floating-point like formats with reduced wordlength. In [LEK05],
a 16-bit floating-point instructions for embedded processors is demonstrated,
and NVIDIA recently introduced a 16-bit half precision floating-point to their
Cg language [NVI05].
44
In the field for FPGAs, implementation of floating-point has been partic-
ularly expensive and researchers had been using parameterised floating-point
modules for DSP operators [JL01], [DGL+02]. Techniques to convert from a
description of a DSP algorithm to a parameterised floating-point implementa-
tion have been presented [GML+02, LGML05]. However a properly optimised
fixed-point design would usually result in smaller area and higher performance
when compared other number representation.
The less conventional number representations surveyed do not provide any
form of compromise between the dynamic range and hardware implementa-
tion complexity. Block floating-point (BFP) is a method to extend the range
of a fixed-point algorithm and is not a number representation. An application
with BFP needs to register a block of its inputs and outputs which means
that the outputs will have latency delays. Residue and signed-digit number
systems are normally used for speeding up internal calculations such as multi-
plications. Their unique nature brings many disadvantages that hinder its use
as a general number representation. Rational arithmetic and level-index are
great for special purpose computing in the software realm.
Among the conventional number representations, fixed-point is currently
the best performer in terms of hardware implementation cost and floating-
point is used when large dynamic range is required. The two formats differ
significantly in terms of hardware implementation cost and at present, there
is no number representation that gives any compromise between the dynamic
range and hardware implementation complexity.
2.5 Wordlength and Scaling Optimisation
One of the main objectives of designers is to find optimal designs to meet the
requirements of an application. Optimality of designs could refer to the area,
45
critical path delay, throughput and/or power consumption. The wordlength
and scaling parameters of signals can be tweaked by designers to improve these
metrics.
Signal wordlength optimisation has enjoyed considerable attention in the
research community. An optimisation procedure can be typically separated
into two parts: range analysis and precision analysis. Range analysis de-
termines the dynamic range required by the signals in the system, whereas
precision analysis refines the wordlengths of signals needed to meet the perfor-
mance requirement of a design. Also, the methods available can be classified
into two types, one being fully simulation based and the other being fully an-
alytical based. Both have their own advantages and disadvantages. There is
however a growing number of mixed simulation and analytical methods, or
hybrids methods being proposed. All the methods reviewed allow the user
to specify a trade-off between numerical accuracy and efficiency in chip area,
speed performance and/or power consumption in the implementation.
2.5.1 SKK Methodology
Sung, Kim and Kum have developed a method to determine the optimum
wordlengths for DSP algorithms based solely around simulation. Their method
measures the performance of a fixed-point algorithm using simulation results
and iteratively modifies the wordlengths to find an optimum set to reduce
minimize their objective function.
For the range analysis, the statistic of the signal such as mean (µx) and
standard deviation (σx) of each signal x are collected via a single pass simula-
tion [KS94]. Using these information, the authors proposed a statistical range
for signal x determined by R(x) = |µx|+ k σx where k is a user specified inte-
ger. For a symmetric uni-modal distributed signal, the scaling of the signals
46
can therefore made to accommodate this range. The authors extended this
framework to classify signals using its skewness and kurtosis characteristics
[KKS98].
In [SK94, SK95], the precision analysis starts out with all signals having
a large wordlength (64-bits). Each signal’s wordlength is reduced individually
until they reach their ‘minimum wordlength’ of which the design function to the
a user-specified error specification. The set of minimum wordlengths together
with the uniform wordlength that satisfies the error specification forms the
bound for the minimum hardware cost optimisation phase. This phase may
done either through an exhaustive search, or by using a heuristic that favours
reducing the wordlength of signals that has the greatest impact on minimising
hardware cost. The authors went on to improve their framework in [KS01] in
a bid to reduce the number of simulations. Signals around the adders after
multiplications are grouped together and optimised as a “signal wordlength
group” and their error effects are analysed using standard quantisation noise
models for linear, time-invariant systems [OS99].
There are various other similar works to the SKK methodology. Roy et. al.
proposed a MATLAB to fixed-point FPGA implementation [RB04]. The main
difference is their precision analysis minimises the wordlength of all units in-
stead of hardware cost with the assumption that lower wordlength equals lower
hardware cost. This is done to reduce the optimisation algorithm complexity
to produce quick results. A simulation-only based optimisation can be a slow
process. The optimisation procedure spends most of its time waiting for the
error metric feedback. Also, the resultant design is not guaranteed to function
to expectation when a different set of input stimuli used [CRS+99].
47
2.5.2 Interval and Affine analysis
As opposed to the simulation-only based optimisation, techniques derived from
interval arithmetic (IA) is a form of static analysis based optimisation where
the bounds of a signal range and quantisation error are determined analyti-
cally. In interval arithmetic, each variable has an interval x = [x, x] where
the underscore and overscore are the lower and upper bounds respectively.
The result of an arithmetic operation on an interval would result in another
interval. Taking addition for example, the interval of the output is given as
x + y =[x + y, x + y
]. Similar formulas can be derived for other arithmetic
operations [Moo66].
Benedetti and Perona’s approach to optimisation first uses IA to extract
the range of each signal in the design [BP00]. They introduced a multi-interval
representation to monitor the wordlength growth of each signal which gives a
bound for the range of operating values for each signal. The authors noted
that IA may result in designs that are overly pessimistic, especially when there
is correlation between signals.
Affine arithmetic (AA) [SdF97] has been developed to alleviate some of the
problems with IA by preserving the correlation among intervals. For a number
x = [x, x], its affine form would be x = x0 + x1ε1 where x0 = (x + x)/2,
x1 = (x − x)/2 and ε1 ∈ [−1, 1]. The x1ε1 term models the uncertainties or
variable range of x. The affine forms of basic operations are given in [SdF97].
As an example, the addition in AA may be expressed in the form
x + y = (x0 + y0) +n∑
i=1
(xi + yi)εi
.
Fang et. al. [FRPC03] first demonstrated the use of AA to perform error
analysis on fixed-point and floating-point DSP designs. As the result, an au-
tomated optimisation tool, MiniBit, was developed by Lee et. al. [LGML05]
48
using AA and simulated annealing for fixed-point designs. The standard affine
form shown above is used for the range analysis. For the precision analysis, a
quantised version of signal x is given by x = x+uε. In the case of rounding to
the nearest quantisation of a fixed-point number 〈n, p〉, u = 2n−p−1 (0.5 unit
in last position, ulp) and ε ∈ [−1, 1]. MiniBit then uses a simulated annealing
meta-heuristic approach which uses the fully analytical precision analysis as
feedback to minimise hardware cost.
2.5.3 BitSize Tool
The BitSize tool by Gaffar et. al. [GML+02, GML04] is an example of a
hybrid approach to optimisation. With minimal simulation effort, the tool
obtains information needed to analytically minimise the area cost. It uses
interval arithmetic methods discussed previously to determine the range of
design signals and automatic differentiation for the precision analysis.
Automatic differentiation, developed by the applied mathematics commu-
nity [Gri00], is able to obtain the differentials of each variable in an algorithm
as a by-product of a simulation run. BitSize uses these differentials as sensi-
tivity measures for errors induced on a signal due to quantisation. Take for
example a function y = f(x). A small change ∆x in input x would cause a
change ∆y in output y: ∆y ≈ f ′(x)∆x where f ′(x) is the first derivative or
the sensitivity of the changes to the output with respect to the input. This
approximation holds provided that ∆x ¿ x.
For a differentiable function with n inputs, y = F (x1, x2, . . . , xn), the Taylor
series approximate the change ∆y shown in Equation (2.20) below, ignoring
high order terms.
∆y ≥ ∆x1dF
dx1
+ ∆x2dF
dx2
+ . . . + ∆xndF
dxn
(2.20)
49
In BitSize, each input term represents a signal in the design. A forward pass
analyses the differentials for each signal and by specifying a maximum tolerable
error (∆y) the errors of each signal will be bounded by (2.20). The authors in
[GML+02] suggest two ways of partitioning the output error bound between
the signal errors: (1) uniformly ∆xi = ∆y/n, or (2) weighted ∆xi = ∆y×Wi.
The weights could be chosen to reflect the relative cost for the operations at
each signal. Hence a backward pass calculates the wordlengths of signals using
the sensitivities measured and their respective error bounds.
The methodology described offers a uniform treatment for both fixed-point
and floating-point [GML04]. It does not however allow for a mix usage of the
two number formats.
2.5.4 Synoptix and RightSize tool
Like the interval analysis methods discussed earlier, the Synoptix tool by Con-
stantinides et. al. [CCL01] is a fully static analysis based optimisation utility
for fixed-point designs. As with all fully static analysis methods, Synoptix
was limited to linear system. RightSize improves on Synoptix to optimise
non-linear systems through a hybrid approach [Con03].
The design flow of both Synoptix and RightSize tools shown in Figure 2.10
takes a DSP algorithm made from Mathworks’s Simulink [Matb] as the in-
put design. Simulink is a graphical programming environment to visualize a
DSP algorithm using synchronous data-flow graphs (DFG) [LM87, RH91], a
commonly used means to represent an algorithm. The tools are entirely ar-
chitecture independent where the vendor-specific portions of the design flow
are shaded. Written in C++, they utilise classes to great effect and extending
them to incorporate new number representations fairly straight-forward. They
both produce a synthesizable structural description ready for vendor tools, and
50
Simulink
design
User
constraints
Representative
floating-point inputs
(RightSize only)
Synoptix/
RightSize
Area
models
Verification
outputsTestbench
Structural
VHDL
VHDL
synthesis
VHDL
simulator
makefile
Vendor
synthesis
Verification
outputs
Completed
design
Compare
Figure 2.10: Design flow for RightSize tool (shaded portions are vendor spe-cific).
a bit-true behavioural VHDL testbench together with a set of expected out-
puts for design verification. Also generated is a ‘makefile’ to automate the
post-synthesis and simulation processes.
Synoptix
The range analysis portion in Synoptix uses the l1-norm scaling [HJ86] on
the transfer function impulse response of the inputs to each signal. Using the
product of the l1-scaling and the input peak values supplied by the user gives
the maximum range of a signal.
In the precision analysis, every quantisation is modelled by an error injec-
tion using quantisation noise models described in [CCL99]. Output errors are
estimated by the weighted-sum of the injected errors. The weights are deter-
mined using the so called L2-scaling [HJ86] on the transfer function impulse
response of each noise injection input to the outputs. The estimated error is
51
compared with the user supplied output SNR constraint during the wordlength
optimisation heuristic described below.
1. Find uniform wordlength for design to meet SNR constraint and scale
all wordlengths up by a factor 2.
2. Start iterations by reducing each signal’s wordlength individually till the
point before violating SNR constraint and estimate the individual impact
on the area. Provided there is wordlength reduction, the signal that gave
the least area cost is nominated to have its wordlength reduced.
3. The nominated signal’s wordlength is then reduced by one bit and step 2
is repeated. If no signal was nominated, the iteration terminates giving
the optimised design.
This greedy optimisation heuristic has been shown to give results to within
0.7% deviation of the optimum area for a given user constraint [CCL02]. The
main disadvantage of this tool is the limitation on LTI designs and this is
addressed by the author in his RightSize tool.
RightSize and perturbation analysis
In RightSize [Con03], Constantinides used a perturbation analysis technique
to produce linearised small signal equivalent model [SS97] of a non-linear sys-
tem in order to apply the analytical techniques used to estimate noise in LTI
systems.
Using the notations from [Con03], consider a n-input differentiable func-
tion Y [t] = f(X1[t], . . . , Xn[t]) where t is the time index. Taking the first-order
Taylor approximation, a small perturbation xi[t] of variable Xi[t] would cause
a perturbation on Y [t] such that y[t] ≈ x1[t]∂f
∂X1+ . . . + xn[t] ∂f
∂Xn. This ap-
proximation is linear for each xi, although the partial derivative terms vary
52
with time and are a function of X1, . . . , Xn. Assuming the quantisation er-
rors are sufficiently small, the approximation function is used to make a linear
small signal equivalent of a design. In the RightSize tool, derivative monitors
are used to collect the partial derivatives during a simulation run with user
supplied inputs.
Since the model is linear, if injecting an error variance, σ2, into a signal gives
an output of variance V , then scaling the error by ε (i.e. εσ2) will see the output
variance also scaled by ε (i.e. εV ). Therefore we can analytically determine
the output response through scaling the response obtained through a one-time
simulation with a noise of known variance. RightSize analyses the output
sensitivity to each signal by injecting a known random unit variance noise
to each signal and observing the output variance for an unscaled sensitivity
measure ε. These sensitivities are then used as weights in the weighted-sum of
errors as in the Synoptix tool. The rest of the optimisation is done using the
same greedy optimisation heuristic in Synoptix.
2.6 Summary
This chapter introduces some of the common and not so common number
representations used in digital hardware. Since the choice of data represen-
tation is a vital aspect of any designer’s decision, it has to be treated with
care. Data representation can dramatically determine the overall chip area
and speed performance.
Fixed-point arithmetic is normally better suited for DSP applications than
floating-point arithmetic since good DSP algorithms require high accuracy
(long mantissa), but not the large dynamic signal range provided by floating-
point arithmetic [Wan99]. Floating-point arithmetic provides a large dynamic
53
range which is usually not required, and the cost in terms of power consump-
tion, execution time, and chip area is much larger than that for fixed-point
arithmetic. Hence, floating-point arithmetic is useful in general-purpose signal
processors, but it is not efficient for application-specific implementations.
Some wordlength and scaling optimsation techniques were also reviewed.
Simulation-only techniques have the disadvantage of heavy dependence on in-
put data stimuli and slow run times. Purely analytical static optimisation
techniques tend to suffer from pessimistic results and not adequate for non-
linear systems. RightSize tool provides a good basis for wordlength optimisa-
tion and form the framework upon which the work described in Chapter 6 of
this thesis is based.
54
Chapter 3
Exponent Recoding & Dual
FiXed-point
3.1 Introduction
Having looked at the various number representation systems for digital hard-
ware in Chapter 2, this chapter details the concept of exponent recoding and a
special case derived from it, a new number representation called Dual FiXed-
point (DFX).
The idea of exponent recoding essentially takes conventional floating-point
representation and apply a recoding function to its exponent field. This func-
tion is chosen arbitrarily by the designer at design time which gives the flexibil-
ity to trade between hardware implementation complexity for dynamic range
of signals. Exponent recoding also serves as a generalising concept where some
of the number systems mentioned in Chapter 2 are special cases.
Dual FiXed-point, also a special case of exponent recoding, improves on
the dynamic range of signals in a digital hardware without significantly in-
creasing circuit size. It combines the simplicity of the ordinary fixed-point
55
data representation, and the superior dynamic range of the floating-point data
representation without the inherent hardware complexity. Section 3.3 firstly
defines DFX and then introduces its characteristics and properties.
The original contributions of this chapter are:
• the concept of Exponent Recoding and how it relates to existing number
systems as special cases,
• the definition of Dual FiXed-point as a new number system [ECC04],
and
• the characteristics of Dual FiXed-point.
3.2 Exponent Recoding
Coded
Exponent, EcSignificand, M
c bits m bits
n bits
Figure 3.1: Example of number system with Exponent Recoding.
Definition 3.1. The representation of a real number, X, in the form given by
(3.1) will be referred to as a number with exponent recoding (ER).
X = M · βΦ(Ec) (3.1)
where base β ∈ R, M is the significand, Ec is the coded exponent and Φ is
the function that recodes the exponent. The base and recoding function are
attributes that are predetermined by the user. ¤
56
A number with exponent recoding may be represented in a similar manner
as conventional floating-point. Referring to Figure 3.1, the numerical data is
represented by a n-bit number which contains two separate fields: the coded
exponent field Ec (with c bits) and the significand field M (m bits). The
exponent recoding concept introduces a recoding function, Φ, to the coded
exponential field Ec before being applied as an exponent to the base.
The recoding function Φ can be arbitrary chosen by the designer to meet
his/her needs. When the coded exponent is c-bit wide, there can be up to 2c
different mapping results for the base exponent. The recoding function adds
an extra level of indirection and greater flexibility to the signal precision and
range. With a carefully chosen recoding function, the number system can have
a very large dynamic range of representable values. Generally, as the number of
possible exponent values a number system increases, the hardware complexity
increases as well.
As a note, the term “exponent recoding” has been used in the context of
public key crytography [SMM05]. This is not to be confused with the exponent
recoding concept introduced here which applies a mapping function on the
exponent field.
3.2.1 Special Cases
Floating-point
Floating-point is a straight forward example of exponent recoding. Taking for
example the case of IEEE 754 floating-point standard, the base is β = 2 and
a recoding function ΦIEEE754 is applied to the exponent field as follows:
ΦIEEE754(Ec) =
Ec − (2c−1 − 1) if 0 < Ec < 2c − 1
−(2c−1 − 2) if Ec = 0
+∞ if Ec = 2c − 1
57
where c is either 8 or 11 depending on whether it is single or double precision
format.
To make the most out of floating-point, the IEEE 754 standard normalises it
significand. Apart from the basic binary operation routines as in fixed-point,
floating-point uses barrel shifters to do the normalisation routines and they
are expensive in hardware and this affects the addition/subtraction operations
the most. The wider the exponent field, c, the more hardware is required to
perform normalisation.
Fixed-point
For the case of fixed-point, the number of exponent bits, c = 0, (i.e. there
is no coded exponent bit). Therefore the recoding function would take any
arbitrary value such that ΦFX ∈ Z with the base β = 2. Taking for example
a fixed-point number with the format 〈n, p〉 (Refer to Figure 2.1 for diagram),
the recoding function is given by:
ΦFX = −(n− p)
As there is no exponent field, there is no need to do any kind of normalisation
or denormalisation as in floating-point and this contributes to its hardware
simplicity.
Logarithmic Number System
Like conventional floating-point, logarithmic number system (LNS) treats the
exponent field in a similar way. The main difference is its significand field has
58
zero bit width (i.e. n = 0) with the whole number used for the exponent field.
Here the recoding function is applied to the whole number in the manner,
ΦLNS(Ec) = Ec if Ec is two’s complement
ΦLNS(Ec) = Ec − bias if Ec is biased.
Similar to floating-point, addition and subtractions operations are expensive
in hardware. It also suffers from precision lost depending on the algorithm
used.
Level-Index
The repeated exponentiation number system, Level-index(LI) initially pro-
posed by [CO84] may also be generalised with the exponent recoding concept.
The generalised exponential function in Section 2.4.8 is a recoding function
with the with the base β = e. The recoding function ΦLI may reinterpreted as
follows:
ΦLI(Ec) =
Ec if 0 ≤ Ec < 1
eΦLI(Ec − 1) if Ec ≥ 1
where Ec = L + I × 2−i (Refer to Figure 2.8 for notations). The repeated
exponentiation definition for the recoding function gives a LI number plenty
of dynamic range but with the extra complexity.
3.2.2 Discussion
As seen from the special cases, the recoding function can take any arbitrary
form. Taking for example that when c = 2, we could have each coded exponent
value mapped to an arbitrary number as in Example 3.1
59
Example 3.1. In this example, we present a simple quad fixed-point number.
Letting c = 2 gives us four possible mappings for the coded exponent, Ec, such
that
Φ(Ec) =
−(n− p0) if Ec = 0
−(n− p1) if Ec = 1
−(n− p2) if Ec = 2
−(n− p3) if Ec = 3
There are 4 different scalings of an ordinary fixed-point number which is
depicted in Figure 3.2 with their scalings aligned. Although it has not been
made and tested, it is expected that the complexity to perform arithmetic with
this number representation be somewhere between fixed-point and floating-point.
¤
n
p0
p1
p2
p3
S
S
S
S
Figure 3.2: Example of quad fixed-point.
From an arithmetic hardware complexity point of view, the more choices
the exponent value can take, the more complex the hardware implementation
becomes due to the need to align radix-points. However, by adding a single
exponent bit to an ordinary fixed-point representation, we are able get an
improved dynamic range without drastically increasing hardware cost. This is
the basis for a new number representation, Dual FiXed-point.
60
3.3 Dual FiXed-point
3.3.1 Defining the Format
First proposed in [ECC04], the n + 2-bit Dual FiXed-point number represen-
tation consists of a single exponent-bit, E, and n + 1 bits for the significand
together with the sign-bit, M . The significand is formatted in the same man-
ner as ordinary two’s complement fixed-point with n being the wordlength
excluding the sign-bit, and the binary point is measured as the displacement
from the right of the sign-bit towards the least significant bit (LSB). Here, the
recoding function ΦDFX maps a single exponent bit to two different exponent
value. This gives the significand two possible scalings and ranges to represent
a number. Definition 3.2 formally defines the DFX number representation.
Definition 3.2. The representation of a real number, D, into the form given
by (3.2) will be referred to as a Dual FiXed-point (DFX) number.
D = X · 2ΦDFX(E)
ΦDFX(E) =
n− p0 if E = 0
n− p1 if E = 1
(3.2)
where X is the signal significand, E is the exponent such that E ∈ {0, 1}, p0
and p1 are binary points; and p0 ≤ p1. The structure of DFX is illustrated in
Figure 3.3(a). ¤
In order to achieve two different scalings, we define two binary points p0
and p1 with the condition that p0 ≤ p1 at all times. p0 and p1 represent the
displacement of the binary-point from the right side of the sign-bit towards the
LSB. We define p0 as the lower scaling parameter and it is used to scale the
Num0 number (Definition 3.3). Similarly, we define p1 as the upper scaling
61
Exponent E Signed Significand X
1 bit n + 1 bits
(a)
S0
1
n
p0p1
Sign-bit
Exponent
bit, ESigned Significand, X
(b)
Figure 3.3: Dual FiXed-point (DFX) number: (a) Number format, (b) Detailedstructure of DFX number format. The symbols • and ◦ represents the binaryposition for a Num0 and Num1 numbers respectively.
parameter for the Num1 number. Take note that p0 and p1 are allowed to
lie outside the number representation, i.e. p0, p1 < 0 or p0, p1 > n. A more
detailed DFX format structure is shown in Figure 3.3(b).
Definition 3.3. We define two types of numbers, Num0 and Num1. Num0
numbers lie in the range [−B,B) where B is a boundary (Definition 3.4) and
they are scaled by the lower scaling parameter p0. Num1 numbers lie in the
range (−∞,−B) or [B,∞) and they are scaled by the upper scaling parameter
p1. Hence Num0 is the lower range number and Num1 is the upper range
number.
Definition 3.4. With the view of simplifying the arithmetic units, the bound-
ary is defined to be the next incremental value after the maximum positive
number of Num0, i.e.
Boundary, B = 2p0 (3.3)
¤
Having two different scalings require a means to decide the best scaling for
62
a number. Knowing which range a number is in determines the exponent the
number takes.
Definition 3.5. The choice of the exponent is determined by the value of the
real number, D, such that
E =
0 if D ∈ Num0, i.e. −B ≤ D < B
1 if D ∈ Num1, i.e. D < −B or D ≥ B(3.4)
¤
There are some arithmetic computations, e.g. multiplication and addition,
that could give intermediary results that does not conform to Definition 3.5, i.e.
the value of the exponent bit does not match with the value of the significand
and is improperly scaled. For example take a DFX number Z whose format is
¡10,2,6¿ and its boundary B = 22 = 4. Initially, let Z = 6 which means that
Z is an upper range number, Num1, with the exponent bit, E = 1. Then
take Z and multiply it with 0.5. This gives the result Z = 3 which should be
a lower range number, Num0, where E = 0. However, immediately after the
multiplication, the exponent field has not been updated to reflect the changes
and therefore at that point in time the number is improperly scaled. For all
intents and purposes, a DFX number is expected to be properly scaled as some
of the algorithms used for the arithmetic operations rely on this.
Definition 3.6. A DFX number is said to be properly scaled if the number
strictly follows Definition 3.5. Therefore a properly scaled DFX number with
exponent bit E = 1 will not be in the Num0 number range and a DFX number
with E = 0 will not be in the Num1 number range. ¤
63
p0
S0
1
S
S
p1
Align binary-
points
(a) Full represented DFX number
p0
p1
S0
1
S
S
Non-represented
values
Align binary-
points
(b) Non-fully represented DFX number
Figure 3.4: Fully and non-fully represented DFX number.
Under most cases, all real values between the representable range will be
discretised to either Num0 or Num1 range as shown in Figure 3.4(a). However
consider the case shown in Figure 3.4(b), the range of values between 2p0 and
2p1−p0 will not have any representation. If a value lies within that range, the
discretised DFX number would result in a value of zero and there would be
a noticeable discontinuity between the values in Num0 and Num1. For this
reason, the following chapters on arithmetic modules and error analysis require
that the DFX number is fully represented.
Definition 3.7. A DFX number is said to be fully represented if and only if
p1 − p0 ≤ n. ¤
A complete definition of a DFX number system requires three parameters;
n for the wordlength of the significand and the two binary points p0 and p1.
The notation used from here onwards will be in the form of DFX 〈n, p0, p1〉. In
actual fact, a DFX signal is actually (n+2) bits wide, one for the sign-bit and
another for the exponent-bit. This is not to be confused with the size of the
significand without the sign-bit, n, when referring to the wordlength. Do note
that for the case when p0 = p1, the DFX representation will revert to ordinary
64
Table 3.1: Dynamic range comparison between DFX, fixed-point (FX),floating-point (FP) and logarithmic number system (LNS) for 32-bit and 16-bitformat.
Format Param Wordlength Dynamic Range
DFX 〈30, 0, 30〉 32-bit 260 ≈ 361dB
DFX 〈30, 14, 26〉 32-bit 246 ≈ 276dB
FX 〈31, 0〉 32-bit 231 ≈ 187dB
FP E:8 M:23 32-bit (IEEE) 2254 ≈ 1529dB
LNS I:8 F:23 32-bit 2256 ≈ 1541dB
DFX 〈14,−5, 0〉 16-bit 219 ≈ 114dB
FX 〈15, 0〉 16-bit 215 ≈ 90dB
FP E:4 M:11 16-bit 214 ≈ 84dB
LNS I:4 F:11 16-bit 216 ≈ 96dB
fixed-point representation and therefore the exponent-bit will be of no use and
can be discarded. Also do note that the determination of the optimum choice
for n, p0 and p1 parameters for a DSP design will be explored Chapter 6.
3.3.2 Characteristics of DFX Number System
Dynamic Range
The smallest non-zero absolute value of a DFX number is 2−(n−p0) while the
largest absolute value is 2p1 , therefore the dynamic range of a DFX number is
given as
DFX dynamic range = 20 log10(2n+p1−p0) dB (3.5)
With two possible scalings for a number means that DFX has better range
capability than ordinary fixed-point. This is clearly shown in Table 3.1 for
both the 32-bit and 16-bit formats. The DFX format 〈30, 0, 30〉 is an example
format that shows the maximum dynamic range available to a fully-represented
32-bit DFX number. As expected, the fixed-point format has the smallest
dynamic range while floating-point and logarithmic number system has the
65
largest. Both 32-bit DFX formats have a dynamic range between fixed-point
and floating-point (and logarithmic number system). Having 8 bits for the
exponent and integer part of 32-bit floating-point and 32-bit LNS respectively
gives them significantly more dynamic range than both fixed-point and DFX.
However, their 16-bit counterparts with half the number of exponent or integer
part (4 bits) is not enough to rival the dynamic range of DFX 〈14,−5, 0〉.
Precision and Finite Errors
0 B = 2p0
2p1
Num0 Range
Num1 Range
-B = -2p0
-2p1
Figure 3.5: Num0 and Num1 range in a DFX Number.
The range and precision of Num0 and Num1 numbers are illustrated in
Figure 3.5. From the DFX definition, the range around zero is the Num0
number range and the outer range is Num1. Although Num1 numbers have
larger magnitude, Num0 numbers have finer precision.
A comparison between precisions of 16-bit fixed-point, floating-point and
DFX number representations is given in Figure 3.6. The precision of represen-
tation in significant bits is plotted as a function of absolute number value. For
floating-point, only the normalised number representation is shown, hence the
number of significant bits is constant over the dynamic range, 11 + 1 bits man-
tissa with implicit bit. The exponent in floating-point determines the dynamic
range. For fixed-point, the number of significant bits drops by one bit with a
decrease of approximately every 6dBs of the absolute number value. We also
observe the same rate of decrease for DFX, but with the exception that there
is a change when absolute value transitions between DFX Num1 and DFX
66
-120 -100 -80 -60 -40 -20 0
0
2
4
6
8
10
12
14
16
Floating-point
Fixed-point
DFX Num0
DFX Num1
Absolute number
Significant bits
10-110-210-310-410-510-6 0
Figure 3.6: Precision of number representations in significant bits as a functionof absolute number value (in dB). The number representations shown are fixed-point 〈15, 0〉, floating-point E4:M11, and DFX 〈14,−5, 0〉.
Num0. Suddenly, the DFX number is able to provide better precision when
compared to ordinary fixed-point with the same wordlength at low absolute
values.
With two different scalings for DFX, the number represented by DFX will
have two different precisions depending on its value. This means that any error
statistics must depend on the proportion of time the number spends in either
Num0 or Num1 ranges. Error, or quantisation error, is simply the difference
between a quantised number (e.g. fixed-point or DFX) with its equivalent
infinite precision real number. Let the probabilities of the number spending
time in either Num0 and Num1 be represented by K and (1−K) respectively.
For DSP applications, the error statistics that interest us are the error
mean and error variances. Suffice to say for now that these two error statistics
are necessary to determine the output signal-to-noise ratio (SNR) of a system
which will be detailed in Chapter 5. For a number with DFX 〈n, p0, p1〉, using
Equation (5.3) we get the quantisation error statistics as shown in (3.6).
67
Error mean µdfx = Kµ0 + (1−K)µ1
Error variance σ2dfx = K(σ2
0 + µ20) + (1−K)(σ2
1 + µ21)− µ2
dfx
(3.6)
where µ0 = −122(p0−n), µ1 = −1
22(p1−n), σ2
0 = 112
22(p0−n) and σ21 = 1
1222(p1−n).
Table 3.2: Comparing the precision errors between DFX and fixed-point. TheDFX parameters are chosen to match a fixed-point 〈15, 0〉 dynamic range of≈ 90dB and upper limit of 20.
DFX Truncation Mean Truncation Var
n p0 p1 K = 90% K = 99% K = 90% K = 99%
15 0 0 -1.53E-05 -1.53E-05 7.76E-11 7.76E-11
14 -1 0 -1.68E-05 -1.54E-05 1.22E-10 8.22E-11
13 -2 0 -1.98E-05 -1.57E-05 3.83E-10 1.10E-10
12 -3 0 -2.59E-05 -1.63E-05 1.59E-09 2.39E-10
11 -4 0 -3.81E-05 -1.75E-05 6.77E-09 7.94E-10
10 -5 0 -6.26E-05 -2.00E-05 2.82E-08 3.09E-09
Using (3.6), Table 3.2 shows the error mean and variance for various DFX
formats with the probability K = 90% and K = 99%. The DFX formats were
chosen to have a dynamic range of approximately 90dBs, matching that of a
fixed-point 〈15, 0〉. It shows that the error statistics relies heavily on the time
the number spends on either of the ranges and the introduction of the higher
range number increases the error mean and variance significantly. Further
widening of the two scalings will aggravate the errors observed. Fortunately
the extra dynamic range gained from widening the two scalings may result in
shorter wordlengths which translates to reduced area consumption. Therefore
the p0 parameter gives an extra way to tradeoff between area consumption and
error.
Hardware Simplicity
The added dynamic range over fixed-point is not without hardware cost penalty.
Consider for example the case of ordinary floating-point. The normalising or
68
denormalising routines of its significand requires expensive and slow barrel
shifters. Fortunately for DFX there are only two possible scalings for its num-
ber and the shifting required for these DFX numbers are known a priori which
can be done merely by using multiplexors (MUX). MUXs do add extra chip
area when compared to ordinary fixed-point, but they are considerably cheaper
and quicker than barrel shifters of floating-point.
To demonstrate this, we extended the 4-tap FIR filter results in Chapter 2
to include DFX. Table 3.3 tabulates these new results and Figure 3.7 illustrate
it in a graph. As before, the wordlength and scaling parameters of DFX are
chosen have the dynamic ranges match as closely as possible to the other
number representation designs (Refer to Table 4.7, designs 1 to 7). As we
can see, the area and speed of DFX filter implementation lie in between the
fixed-point and floating-point implementations which is expected.
3.4 Summary
This chapter demonstrated the concept of Exponent Recoding (ER) as a gen-
eralising concept where common number representations are special cases of
ER. An arbitrary mapping function chosen at design time is applied to the
exponent field which serves to provide extra level of indirection allowing for
flexibility to manipulate the signal precision and range. From ER, a special
case and new number representation was derived.
Dual FiXed-point (DFX) was defined as with only a single exponent bit,
hence having only two exponent mapping values. It incorporates the simplicity
of ordinary fixed-point with extra dynamic range. The two level precision of
DFX and the error statistics were briefly explored. Area and latency results
of FIR filters synthesized onto ASIC show that DFX designs in between fixed-
point and floating-point designs.
69
Table 3.3: Area and critical path delay result for 4 tap FIR filter with increas-ing wordlength implemented using UMC 0.13um High Density Standard CellLibrary. This is an extension of results in Table 2.3.
Design
Fixed-point Floating-point LNS DFX
Area CPD Area CPD Area CPD Area CPD
(cells) (ns) (cells) (ns) (cells) (ns) (cells) (ns)
1 9829 9.9 16898 51.4 16188 39.1 8756 23.6
2 11042 10.3 19710 54.3 23295 43.3 10472 26.5
3 13067 11.4 21992 58.5 31562 46.9 11623 27.9
4 14845 12.0 24854 61.7 42918 49.6 12366 29.1
5 17261 13.0 26461 62.9 57261 54.0 14811 31.4
6 18198 13.4 29685 63.4 77916 58.5 16312 33.3
7 20005 14.4 32481 70.2 107450 59.5 19342 35.7
5000
10000
15000
20000
25000
30000
35000
40000
45000
0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0
Critical path delay (ns)
Are
a (
cells
)
Fixed-point
Floating-point
Logarithmic number system
Dual FiXed-point (This Thesis)
Figure 3.7: An area vs critical path delay graph for Table 3.3.
70
Chapter 4
DFX Modules and Arithmetic
Circuits
4.1 Introduction
In order to implement a system with DFX number system, component libraries
must be created to support DFX arithmetic. Apart from the arithmetic mod-
ules, other supporting modules libraries to handle signal quantisation, encoding
and decoding will also be needed. These libraries can then be instantiated by
a synthesis system to create synthesizable hardware description language, and
have their area consumption modelled to assist the wordlength and scaling
optimisation process (Chapter 6).
The arithmetic modules described here are targeted for the RightSize high-
level synthesis tool by Constantinides [Con03]. RightSize is able to manipu-
late basic arithmetic modules which are common to many DSP applications;
adder, constant coefficient gain multiplier and full general multiplier. Before
proceeding further, the algorithm representation used in RightSize is detail in
Section 4.2.
71
All the modules are designed in VHDL [IEE04] which were made with a
the design objectives and goals mentioned in Section 4.3. Sections 4.4 to 4.6
will show working designs that incorporates all the design objectives stated
earlier. Each section will have the area and critical path delay comparisons
with other equivalent arithmetic modules made with other conventional num-
ber representations, fixed-point, floating-point and LNS. Lesser area translates
to cheaper hardware and low critical path delay means faster operation time.
The fixed-point arithmetic units are readily available within the synthesis tools
used while the The floating-point and LNS module library used here for com-
parison is the readily available FPLibrary from [DdD].
Results shown are the place-and-routed results for both FPGAs and ASIC
implementations. The FPGAs used are Xilinx Virtex4 (XC4VLX15) and Al-
tera Stratix2 (EP2S15). The Virtex4 series has different versions with different
features and the XC4VLX15 model is optimised for logic with 12, 288 4-input
LUTs. Apart from LUTs, the XC4VLX15 also has XtremeDSP slices, each
consisting of a 18 x 18 multiplier, an adder, and an accumulator, and block
RAMs for storage. The Stratix2 has similar features with the Virtex4, with the
EP2S15 containing 12, 480 Adaptive LUTs (ALUTs). It also has DSP blocks
(similar to the XtremeDSP blocks above), and TriMatrix memory blocks (sim-
ilar to the block RAMs above). For the XtremeDSP and DSP blocks are highly
suited for multiply-accumulate functions which are heavily used in DSP algo-
rithms.
Unfortunately, for the purpose of this thesis only the LUTs and ALUTs
of these devices are used in the synthesis of the arithmetic modules. This is
to enable a fair comparison between the different hardware implementations
without the assistance of dedicated hardware such as build-in multiplier blocks.
The ASIC designs are implemented with the UMC 0.13um High Density
Standard Cell Library where each cell consumes a square micron of chip area.
72
Synplify Pro synthesis flow is used for the FPGAs before using their own
respective place-and-route tools to obtain the area and critical path delay
results. For the ASIC, Synplify Asic is used to get the area consumption and
critical path delay estimates.
The original contributions for this chapter are:
• design of multiple wordlength and scaling DFX arithmetic modules,
• simple interfacing with ordinary fixed-point,
• comparison made between DFX arithmetic modules and other popular
arithmetic modules on FPGAs and ASIC hardware platforms.
4.2 Algorithm Representation in RightSize
A formal definition for the data-flow graph (DFG) representation used in this
thesis is referred to as a computational graph (Definition 4.1).
Definition 4.1. A computation graph G(V, S) is the formal representation of
an design/algorithm. V is a set of graph nodes, each representing an atomic
computation or input/output port, and S ⊂ V × V is a set of directed edges
representing the data flow. An element of S is referred to as a signal. The
type of an atomic computation v ⊂ V is given by type(v) (4.1). The set S
must satisfy the constraints on in-degree and out-degree given in Table 4.1. For
each signal j ∈ {(vj1, vj2) ∈ S}, vj1 is the driver node of signal j and vj2 is
the driven node.
type(v) : V → {primary in, primary out, adder,
full mult, gain mult, delay, fork}(4.1)
¤
73
Table 4.1: In and out degrees of nodes used in computational graph G(V, S).
type(v) in-degree(v) out-degree(v)
primary in 0 1
primary out 1 0
adder 2 1
full mult 2 1
gain mult 1 1
delay 1 1
fork 1 ≥ 2
The arithmetic operations currently implemented in RightSize and dis-
cussed in this thesis are: adder, general multiplication (full mult) and con-
stant coefficient multiplication (gain mult). The nodes are visualised in a
graphical representation form as shown in Figure 4.1(a). Signals (or directed
edges) are implicitly represented by arrows showing the direction of data-flow.
An example DFG is shown in Figure 4.1(b).
For the purpose of describing a computation graph with multiple wordlength
and scalings with DFX, we define a DFX annotated computation graph G(V, S, ADFX)
(Definition 4.2).
Definition 4.2. A DFX annotated computation graph G(V, S,ADFX) is a
formal representation of the DFX implementation of a computation graph
G(V, S). ADFX is a tuple (n,p0,p1) of vectors n ∈ N|S|, p0 ∈ Z|S|, and
p1 ∈ Z|S|, with each elements in a one-to-one correspondence with the elements
of S. Thus for each signal j ∈ S, it is possible to refer to the corresponding
nj, pj0 and pj1 and the condition of pj0 < pj1 applies. ¤
74
F
z-1
ADDER FULL_MULT GAIN_MULT DELAY
PRIMARY_IN FORKPRIMARY_OUT
(a) Nodes used in the data-flow graph.
In
Out
z-1
z-1
z-1
z-1
F
z-1
(b) Example data-flow graph of a 4-tap direct form FIR filter.
Figure 4.1: The graphical representation of a data-flow graph.
4.3 Module Design Forethought and Criteria
Quantisation only at the end
First and foremost, the amount of error introduced should be as minimal as
possible while staying true to the DFX format specification. If there is any
quantisation, regardless of the operation and irrespective of the algorithm used,
error should be applied only at the end of the module. This means that there
is no loss in precision anywhere until the very last moment. This not only
improves the output precision error, but also provides a predictable output
error which simplifies error analysis.
Modules designed produce truncation errors as rounding is not performed.
As truncation is the least area-expensive method of quantisation [Fio98], round-
ing has not been implemented for these modules. However, if rounding is re-
quired, an extra addition block will need to be added before the final output.
75
The decision to round will depend on the various ‘guard-bits’ from the inputs
and/or result [ECC05]. Depending the faithfulness of rounding performed, the
number of guard-bits needed will vary.
Multiple wordlength and scaling
The RightSize tool original function was to synthesize a fixed-point design
with multiple wordlengths for its signals. Therefore in keeping with the spirit
of the tool’s original intent, the DFX arithmetic modules are expected to
deal with not only multiple wordlengths, but also multiple scalings as well.
The arithmetic modules must be fully parameterisable to allow an optimising
routine to explore the whole design space without any hindrance.
Modularity
A modular design gives a designer the means to build-up any system by con-
necting the modules together (e.g. a data-flow graph). Also, the modular
building-blocks give the flexibility to extend DFX further with other arith-
metic operations to complement the ones discussed in this chapter. Each new
module type would need a VHDL description, area consumption and error
models to be incorporated into RightSize.
Integration with fixed-point
As DFX can be readily interchange between fixed-point (p0 = p1), the module
libraries must be flexible to readily integrate and work along side ordinary
fixed-point. The inputs and outputs should readily convert between DFX and
ordinary fixed-point without needing additional hardware in between. At the
same time, modules that encode and decode between DFX and ordinary fixed-
point (environment interface) should be made as simple as possible to facilitate
the transition.
76
4.4 Building Blocks
The following modules are the building blocks used in most of the DFX mod-
ules. They are the Range-Detector, Encoder, Decoder and Recoder.
4.4.1 Range-Detector
E =
0 if -B Input < B
1 if Input < -B or Input B
Fixed-point input,
AE
in inn ,p
Figure 4.2: DFX Range-Detector Module.
Function of the DFX Range-Detector block, shown in Figure 4.2, is to
determine which of the two ranges the tested signal lies in and also generate
the exponent bit, E. Input to this module is a fixed-point number with format
〈nin, pin〉 (nin being the input wordlength and pin being the position of its
binary point). This module is not a node TYPE in RightSize as it is an internal
module within other DFX modules.
The Range-Detector enforces the equations for DFX defined in (3.3) and
(3.4). The choice of range and boundary allows the operation to be simplified
down to the logic operation given by (4.2) which is a simple sum-of-product
expression. If the tested signal input belongs to the Num0 range, all the bits
above the boundary will be 0’s (when it is a positive input) or 1’s (when it
is a negative input) since the input is a two’s complement number. The bits
of interest for detection are shown in Figure 4.3. This operation can be easily
performed by lookup-tables in FPGAs, or with AND and OR gates in ASIC.
E = anin· anin−1 · . . . · anin−(pin−p0) + anin
· anin−1 · . . . · anin−(pin−p0) (4.2)
77
n-1
nin
p0
pin
Input A
Num0
S
Input bits of interestfor detection
a0a1
Figure 4.3: Input bits the Range-Detector is interested in.
4.4.2 DFX Encoder and Decoder
In order to utilize this number system, a method is needed to convert a number
from a known type to DFX. Currently modules exist to encode and decode
to and from two’s complement ordinary fixed-point. Typically the DFX En-
coder and Decoder will be used in conjunction with the primary in and pri-
mary out nodes respectively when required for the transitions between the
fixed-point environment to DFX. Apart from that, these modules are also used
for the DFX Adder v1 for the pre/post adder re-scalings in Section 4.5.1 and
the Encoder is also used in the DFX Recoder. The VHDL interface entities
for these modules are given below.
ENTITY dfx_Encoder IS
GENERIC( In_n, In_p : INTEGER;
Out_n, Out_p0, Out_p1 : INTEGER );
PORT( dIn : IN std_logic_vector(In_n downto 0);
dOut : OUT std_logic_vector(Out_n+1 downto 0) );
END ENTITY dfx_Encoder;
ENTITY dfx_Decoder IS
GENERIC( In_n, In_p0, In_p1 : INTEGER;
Out_n, Out_p : INTEGER );
PORT( dIn : IN std_logic_vector(In_n+1 downto 0);
dOut : OUT std_logic_vector(Out_n downto 0) );
END ENTITY dfx_Decoder;
78
0
1
Range-detector
DFX
Output
Fixed-point
Input
>> pin-p0
>> pin-p1 mod 2n+1
in inn ,p
0n,p
1n,p0 1n,p ,p
mod 2n+1
Eout
(a) Top-level diagram.
S S0
1
FX In DFX Out
(b) Flow of data through the Encoder.
Figure 4.4: DFX Encoder block to convert from fixed-point to DFX. Input isa fixed-point with wordlength nin and binary point pin and output is a DFX〈n, p0, p1〉.
The DFX Encoder (as shown in Figure 4.4) takes in a fixed-point input with
wordlength nin and binary point pin and to be converted to a DFX 〈n, p0, p1〉output. It first feeds the input into a Range-Detector that would determine
the appropriate output range and the output exponent bit. The multiplexor
chooses the proper scaled signal for the output. Note that “>>” and “<<”
are the binary left and right shift operators respectively which do not require
any hardware cost as they are a matter of wiring. “mod 2n+1” simply extracts
out the least significant (n + 1) bits of the signal.
Figure 4.5 is the DFX Decoder block. It takes a DFX 〈n, p0, p1〉 input and
outputs a fixed-point number with wordlength (n+(p1−p0)) and binary point
p0. The input is shifted and scaled accordingly to the exponent bit. Again, the
shifting does not incur any hardware cost and the main cost for the decoder is
the multiplexor.
79
DFX
Input
Fixed-point
Output<< p0-p1
mod 2(n+(p1-p0)+1)
0
1
0 1n,p ,p
( )1 0 1n+ p -p ,p
mod 2(n+(p1-p0)+1)
Ein
(a) Top-level diagram.
SS
0
1
FX In DFX Out
(b) Flow of data through the Decoder.
Figure 4.5: DFX Decoder block to convert from DFX to fixed-point. Input isa DFX 〈n, p0, p1〉 and output is a fixed-point with wordlength (n + (p1 − p0))and binary point p1.
4.4.3 DFX Recoder
Converting a DFX number from one format to another is done by the DFX
Recoder. In a typical fixed-point case, no additional hardware is required
when a signal is changed from one fixed-point format to another (assuming
truncation) as any shifting is done as a result of wiring. In DFX, when the
input and output DFX boundary is different, a little care is required when
performing conversion. Figure 4.6 shows the whole DFX Recoder which consist
of two DFX Encoders and a MUX before the output. The input’s Num0 and
Num1 ranges can be treated separately as individual fixed-point numbers.
Therefore the two Encoders are aligned to the input’s Num0 and Num1 range
with their DFX output as 〈nout, pout0, pout1〉. The MUX at the output then
chooses the correct output depending on the input’s exponent bit.
80
The VHDL interface entity for the DFX Recoder is given below:
ENTITY dfx_Recoder IS
GENERIC( In_n, In_p0, In_p1 : INTEGER;
Out_n, Out_p0, Out_p1 : INTEGER );
PORT( dIn : IN std_logic_vector(In_n+1 downto 0);
dOut : OUT std_logic_vector(Out_n+1 downto 0) );
END ENTITY dfx_Recoder;
SS0
1
0
1
DFX In DFX Out
DFX Input
nin,pin0,pin1
0
1
Ein
DFX Encoder
for nin,pin0
DFX Encoder
for nin,pin1
DFX Output
nout,pout0,pout1
Figure 4.6: DFX Recoder module with the flow of data through the module.
The Recoder shown in Figure 4.6 is the complete version of the Recoder and
it works for both properly and improperly scaled DFX numbers and the result-
ing DFX signal will be properly scaled. Under normal situations where DFX
numbers are always properly scaled, the Recoder can be simplified. Referring
to Figure 4.7, the simplified implementation depends on the DFX boundaries
of input and output signals. If the input and output boundaries are the same,
i.e. Bin = Bout, no hardware is needed apart from wiring (Figure 4.7(a)).
If the output’s boundary is less than the input’s boundary, i.e. Bout < Bin,
all the input’s Num1 will become an output Num1 provided the input is
properly scaled. Therefore we can simply feed the input to the output MUX
when the input is a Num1 after performing any necessary shifting. This leaves
the case when the input is a Num0 which can become either an output Num0
or Num1 and therefore handled by a DFX Encoder. Referring to Figure 4.7(b),
81
S0
1
DFX In DFX Out
S0
1
nin bits nout bits
(nin-nout) bits
DFX Input
nin,pin0,pin1
DFX Output
nout,pout0,pout1
(a) When DFX Output Boundary = DFX Input Boundary.
S0
1
DFX In DFX Out
S0
1
DFX Input
nin,pin0,pin1
0
1
Ein
DFX Encoder
for nin,pin0
DFX Output
nout,pout0,pout1>> pin1 -pout1 1mod 2 outn +
out out1n ,p
(b) When DFX Output Boundary < DFX Input Boundary.
S0
1
DFX In DFX Out
S0
1
DFX Input
nin,pin0,pin1
0
1
Ein
DFX Encoder
for nin,pin1
DFX Output
nout,pout0,pout1
>> pin0 -pout0out out0n ,p
1mod 2 outn +
(c) When DFX Output Boundary > DFX Input Boundary.
Figure 4.7: DFX Recoder blocks to convert between two different properlyscaled DFX numbers. The flow of data through the recoder is shown beneatheach block.
82
we can see that this simplification results in one less DFX Encoder when
compared to the complete Recoder.
Similarly when the output’s boundary is more than the input’s boundary,
i.e. Bout > Bin, all the input’s Num0 will definitely become an output Num0
provided the input is properly scaled. This time when the input is a Num0,
it is fed directly to the output MUX after performing the necessary shifting.
The DFX Encoder used for this case is recoding is tuned to the input’s Num1.
Figure 4.7(c) shows its implementation.
As RightSize uses multiple DFX wordlength and scalings for its signals,
there is a needed to recode any format changes between signals and after arith-
metic operations. The DFX Recoder is used after the node types gain mult,
full mult, fork and delay. Use of the Recoder for the gain mult and
full mult will be explained in their respective sections.
Normally in fixed-point systems, the fork and delay modules act as ‘pass-
throughs’ and any shifting of the signals at the output is dealt with careful
wiring, which is not the case for DFX. Figure 4.8(a) show an example scenario
with different input and output DFX formats for fork and delay, and Fig-
ure 4.8(b) shows their actual implementation with DFX Recoders explicitly
shown. The fork example has three outputs X, Y and Z where Y and Z share
the same boundary (i.e.BY = BZ). Since Y and Z share the same boundary,
only one Recoder is needed for the two outputs. It should be noted that the
range of the input and output of a fork should be the same, hence in this
example pX1 = pY 1 = pZ1 = pA1. The case for the delay node should be self
explanatory.
83
z-1
F
X X0 X1X , ,n p p
Y Y0 Y1Y , ,n p p
Z Z0 Z1Z , ,n p p
X X0 X1X , ,n p pA A0 A1A , ,n p p
A A0 A1A , ,n p p
(a) fork and delay example.
z-1
R
FA A0 A1A , ,n p p
R
R F
X X0 X1X , ,n p p
X X0 X1X , ,n p p
A A0 A1A , ,n p p
Y Y0 Y1Y , ,n p p
Z Z0 Z1Z , ,n p p
(b) fork and delay example with DFX Recoder “R” explicitly shown,where BY = BZ .
Figure 4.8: DFX Recoder used with fork and delay.
4.4.4 Area and Critical Path Delay Tables
Table 4.2 shows the area and critical path delay of the Encoder, Decoder
and Recoder modules. It can be seen that they do not take much resource to
perform the conversions to and from fixed-point (∼0.16% of FPGA chip area).
The most expensive is the Recoder because of the Range-Detector and output
MUX (∼ 0.20% of FPGA chip area). The table does not include the Range-
Detector area and critical path delay because it is never used by itself and is
always integrated into another module for example the Encode and Recoder.
Table 4.2: Building block areas and critical path delays (CPD) table.
DFXParameters
Area and ( CPD )
Virtex4 Stratix2 ASIC
Modules LUTs (ns) ALUTs (ns) Cells (ns)
Encoder 〈19, 5〉→〈14, 0, 5〉 20 (2.535) 15 (2.000 ¦) 228 (1.233)
Decoder 〈14, 0, 5〉→〈19, 5〉 19 (1.827) 19 (2.000 ¦) 213 (0.974)
Recoder 〈14, 0, 5〉→〈14, 1, 5〉 28 (2.731) 16 (2.013) 304 (2.042)
¦ - Speed limited by place and route tool
84
4.5 DFX Adders
The DFX Adder module performs addition between two different DFX num-
bers and this module implements the adder node in RightSize. For the case
of DFX inputs with different scalings means that the inputs will need to be
aligned to a common scaling before addition can be performed. This is done
in the pre-adder stage. Once inputs are aligned, the addition is performed by
an ordinary fixed-point adder. If the adder’s output is in DFX, the summation
output has to be aligned by the post-adder stage. The pre- and post-alignment
stages may be akin to floating-point adder [Kor02]. But unlike floating-point,
the number of bits to shift is known a priori and only multiplexors are used to
perform the necessary shifting instead of expensive barrel shifters. As a result,
the DFX Adder is both smaller and faster than an equivalent floating-point
adder.
There are two different versions of the adder; DFX Adder Version I (v1)
and DFX Adder Version II (v2). The difference between the two is in the
way they perform pre-addition and post-addition alignment which may result
in different area consumptions. Which adder gives a lower area consumption
depends on its input and output wordlength and scaling parameters. The
adder that consumes the least area is chosen by RightSize for the synthesized
output. The VHDL interface entity for Version I is shown below; Version II
has an identical interface.
ENTITY dfx_Adder_v1 IS
GENERIC ( a_n, a_p0, a_p1, b_n, b_p0, b_p1 : INTEGER;
s_n, s_p0, s_p1: INTEGER );
PORT( A : IN std_logic_vector( a_n+1 downto 0);
B : IN std_logic_vector( b_n+1 downto 0);
S : OUT std_logic_vector( s_n+1 downto 0) );
END ENTITY dfx_Adder_v1;
85
4.5.1 DFX Adder Version I (V1)
Σ DFX
Encoder
DFX
Decoder
fx fxn p,
+ +fx fxn p1, 1
B B Bn p p0 1B , ,
A A An p p0 1A , ,
S S Sn p p0 1S , ,
DFX
Decoder
Afx
Bfx Sfx
Pre-Adder
Post-Adder
(a) Top-level diagram.
SAA
B SB
0
1
0
1
SS0
1S
+
=
SS
Pre-Adder
Decoding
Post-Adder
Encoding
DFX Inputs
Summation result from
fixed-point adderDFX output
SA
SB
+
Decoded inputs to
fixed-point adder
nfx+1
SA
SB
pfx+1
Fixed-point
Addition
No shifting,
Shifting required
(b) Example dataflow diagram.
Figure 4.9: DFX Adder Version I (v1).
Looking at DFX Adder Version I design and an example dataflow diagram
in Figure 4.9, the Pre-Adder consists of a DFX Decoder for each input to an
ordinary fixed-point format 〈nfx, pfx〉. The fixed-point wordlength, nfx, and
the scaling, pfx, is given by
nfx = max(nA + (pA1 − pA0), nB + (pB1 − pB0))
pfx = max(pA1, pB1)
where the max() is a function that returns the larger of the two operands.
With the inputs aligned, the summation is done by an ordinary fixed-point
86
adder. To ensure no overflow of results, the output of the fixed-point adder is
usually one bit wider than its inputs, therefore the fixed-point adder can be at
most (nfx + 2) bits wide and its inputs may need to be sign-extended. Sign-
extension does not require any additional hardware apart from wire routing.
After summation, the output is fed into a DFX Encoder to produce a DFX
output as necessary.
DFX Adder V1 is the simplest of the two DFX Adders, design-wise. It has
better area consumption than Version II when the cost of using multiplexors is
far greater than the cost of the fixed-point adder. It suffers when the difference
between the p1 and p0 gets large because the ordinary fixed-point adder has
to increase in width. This not only increases the area consumption but also
increases critical path delay of the device. For the size and delay comparison
of this adder block and others, please refer to the end of this section.
4.5.2 DFX Adder Version II (V2)
Figure 4.10 shows the top level and an example dataflow of the DFX Adder
Version II, and Figure 4.11 shows its Pre and Post-Adders. At first glance,
this version seems to be more complicated. However, for certain input and
output parameters, Version II has an area consumption and speed advantage
when compared to Version I. In order to achieve this, we have to make some
design choices and limitations to this adder.
Firstly, the adder’s inputs, Afx and Bfx, are scaled to either of the two DFX
ranges of output S, i.e. SNum0 (〈nS, pS0〉) or SNum1 (〈nS, pS1〉). Whenever both
input exponents are zero, the inputs are aligned to SNum0. For all other input
exponent combinations, the inputs are aligned SNum1. Table 4.3 summarises
the input alignment combinations. These alignment rules mean that each input
will need only be aligned in three different ways using multiplexors (as seen
87
Σ Post-adder
block
Pre-adder
Block
B B Bn p p
0 1B , ,
A A An p p0 1
A , ,
S S Sn p p
0 1S , ,
+ + + +0 1
1, 1 or 1, 1S S S Sn p n p
gBits
Adderexp
Afx
Bfx Sfx
(pS1-pS0) bits
0 1, or ,S S S Sn p n p
(a) Top-level diagram.
ps0
SAA
B SB
SA
0
1
0
1
0 0
A BExponent Bit
0 1
1 0
1 1
nfx+1
SS0
1S
+
=
gBitsA
gBitsB
SA
SA SASA
SA
SB
SB
SB SBSB
SB
0
1
SS
SS gBits
ps1-ps0
Adderexp
Pre-Adder
Alignment
Post-Adder
Recoding
DFX Inputs
Scaled inputs to
fixed-point adder
Summation result from
fixed-point adder
DFX output
SA
SA
SA
SA
SB
SB
SB
SB
ps1 Fixed-point
Addition
No shifting,
Shifting required
(b) Example dataflow diagram.
Figure 4.10: DFX Adder Version II (v2).
88
(a) DFX Adder-v2 Pre-adder Block
(b) DFX Adder-v2 Post-adder Block
Control Block
mod 2n-1
Exponent bit
no_shift
shift_l
Sfx
mod 2n-1<< pS1-pS0
gBits
e
d
e
d
AdderexpRange-detector
for nfx,pS1
S S0 S1n ,p ,pS
0
1
0
1
e
d
e
d
0
1
0
1
→A A fx Sn p n p0 0, ,
→A A fx S
n p n p1 1
, ,
→B B fx S
n p n p0 0
, ,
→B B fx S
n p n p1 1
, ,
Retain truncated
bits for A
Retain truncated
bits for B
MUXgBits
A A An p p0 1
A , ,
B B Bn p p
0 1B , ,
Bexp
Aexp
(nS+1) bits
(nS+1) bits
(pS1-pS0) bits
Afx
Bfx
gBits
Adderexp
Con t r o l B l o c k
Adderexp= Aexp+BexpAsel= !Aexp·(Aexp+Bexp)
Bsel= !Bexp·(Aexp+Bexp)
AselBsel
→A A fx S
n p n p0 1
, ,
→B B fx Sn p n p0 1, ,
Figure 4.11: DFX Adder (v2) (a)Pre-Adder and (b)Post-Adder diagram.
89
in Figure 4.11(a)). The 〈nx, px〉 → 〈ny, py〉 operation involves shifting and/or
truncating the signal from 〈nx, px〉 to 〈ny, py〉. Whenever the input exponents
are different, gBits retains the input bits that are shifted out to ensure no lost
of precision if the addition results in SNum0.
Table 4.3: Scaling of the inputs before the fixed-point adder.
Aexp BexpScaling
Afx Bfx
0 0 SNum0 SNum0
0 1 SNum1 SNum1
1 0 SNum1 SNum1
1 1 SNum1 SNum1
Secondly, the output pS0 is chosen so that when both inputs are in their
Num0 ranges, the output will always be a SNum0. Refer to Figure 4.10(b) for
dataflow diagram for clarification. Therefore when both inputs are in their
Num0 ranges, the summation result can be passed straight to the output and
post-adder alignment is only needed when the inputs have different scaling.
By letting pS0 = max(pA0, pB0) + 1, the sum is ensured to be a SNum0 when
both inputs are in their Num0 range, but experimental results have shown
that it can be a very conservative solution. Provided that we have knowledge
of the adder’s inputs, we can obtain a more suitable pS0. Section 6.3 details
this method after obtaining the the joint probability distribution statistics of
the inputs.
Without the conditions made to this adder, the post-adder block would
have to handle 4 different scalings (2 different scaling for each input) and up to
8 different output rescalings ( output has 2 different scaling). The restrictions
to this version of the adder simplifies the Post-Adder block and it looks fairly
similar to the DFX Encoder.
90
Table 4.4: Area and critical path delay tables for 16-bit adder comparisons.
Format Params
Adder Area
Virtex4 Stratix2 ASIC
LUTs (Slices) ALUTs (ALMs) Cells
DFX v1 〈14, 0, 5〉 56 (44) 59 (34) 1223
DFX v2 〈14, 0, 5〉 55 (44) 54 (35) 1089
FX 〈15, 0〉 16 (8) 16 (8) 449
FP E:4 M:11 271 (144) 269 (143) 5049
LNS I:4 F:11 2127 (1134) 998 23490
Format ParamsAdder critical path delay (ns)
Virtex4 Stratix2 ASIC
DFX v1 〈14, 0, 5〉 6.524 5.200 8.721
DFX v2 〈14, 0, 5〉 6.588 4.476 7.718
FX 〈15, 0〉 3.026 ¦ 2.000 2.390
FP E:4 M:11 20.733 13.923 13.607
LNS I:4 F:11 19.277 5.711 18.842¦ - Speed limited by place and route tool
4.5.3 Area and Critical Path Delay Tables
Table 4.4 shows the comparison between the two versions of DFX Adders
and other popular number formats all designed to be 16-bit wide and with
similar dynamic range. As expected, the DFX Adder’s hardware simplicity
over floating-point means that its area consumption and device critical path
delay lies between fixed-point and floating-point. DFX Adders for FPGAs
(Virtex4 and Stratix2) are about 3.5 times larger and 2.0 times slower than
equivalent fixed-point. In contrast, it is nearly 5.0 times smaller and 3.0 times
faster than an equivalent floating-point. For ASIC, DFX adders are about 2.5
times larger and 3.5 times slower than fixed-point while being 4.5 times larger
and 2.0 times faster than floating-point. The LNS addition was performed
using approximations from look-up-tables or ROM and hence more hardware
area is dedicated for additional memory. Because of this, the LNS adders
consumes the largest area overall.
In this example implementation, the DFX Adder Version 2 has a slight ad-
91
vantage both in terms of area and critical path delay when compared to Version
1. This is mainly due to the large difference between p1 and p0 and increasing
the length of the ordinary fixed-point adder in Version 1 more expensive than
providing more multiplexors in Version 2.
It can be seen that the latencies of the FPGA designs are shorter than
those of the ASIC. This is mostly due to the fact that both FPGA families are
made on 90nm process, while the ASIC are synthesized using a 130nm process
library.
4.6 DFX Multipliers
Unlike the DFX Adder, DFX Multipliers do not need any pre-aligning stage for
their inputs. The inputs can be multiplied together by a fixed-point multiplier
regardless of their scalings and the product from the fixed-point multiplier will
be fed through a DFX Recoder before the sent to the output. As with the DFX
Adder, the multiplier can take in any wordlength and precision parameters
even if they are fixed-point.
The implementations of FPGA multipliers are done with their logic-blocks
and not with any embedded multiplier/DSP blocks to provide a fair comparison
along side ASIC implementations. These results are a valid representation
as designers may also have limited resources on chip or may be working on
devices without embedded multipliers such as in low-cost FPGAs [Xila], [Alta]
or CPLDs.
Two types of multipliers are made: a gain multiplier and a full multiplier.
92
DFX
Recoderm mm ,n p
A A0 A1A , ,n p pQ Q0 Q1Q , ,n p pX
Prod
Aexp
0 1
A m
0 A0 m 1 A1 m
*, *, *where *
* *
n p pn n np p p p p p
= += + = +
(a) Top Level diagram.
p1'
p0'
p1'
SAA0
1
Sm
Sm
SA
0
1A
m
m
SQ
0
1SQ
0
1
SQ
0
1SQ
0
1
Multiplier, m < 1.0
Multiplier, m > 1.0
Recoding
Recoding
Product from
fixed-point multiplier DFX output
Fixed-point
multiplication
No shifting,
Shifting required
Q
Q
p0'
(b) Example dataflow.
Figure 4.12: DFX Gain Multiplier.
4.6.1 DFX Gain Multiplier
DFX Gain Multiplier (Figure 4.12) takes a DFX input and multiply with a
constant fixed-point coefficient. This is particularly useful in applications such
as filtering where one of the operands is a constant. This module is used with
the gain mult in RightSize and its VHDL interface entity is given below.
The wordlength and scaling of the multiplier coefficient should be optimised
as to not have any unnecessary leading sign bits or trailing zeros.
93
ENTITY dfx_Mult_Gain IS
GENERIC ( a_n, a_p0, a_p1, q_n, q_p0, q_p1 : INTEGER;
mult : REAL; mult_n, mult_p : INTEGER );
PORT ( A : in STD_LOGIC_VECTOR( a_n+1 downto 0);
Q : out STD_LOGIC_VECTOR( q_n+1 downto 0) );
END ENTITY dfx_Mult_Gain;
Unlike the DFX Adder, the inputs to the multiplier do not need aligning
and can be multiplied together with an ordinary fixed-point multiplier. Since
the input is a DFX number, the multiplier product Prod will also be a DFX
number. Therefore Prod has to be recoded to the desired output DFX format.
Consider the multiplication of a DFX 〈nA, pA0, pA1〉 number with a fixed-point
〈nm, pm〉 (nm is the word-length of the constant multiplier m and pm is its
fractional length). The constant multiplier should be optimally scaled, (i.e.
pm = blog2(m)c+1). After multiplication, the product would be in an interme-
diate format 〈n∗, p∗0, p∗1〉, where n∗ = nA+nm, p∗0 = pA0+pm and p∗1 = pA1+pm.
This DFX signal may be improperly scaled and a full DFX Recoder will be
needed to convert the output to a properly scaled DFX number. However by
placing some restrictions to the design, the full DFX Recoder is not required.
Consider when m < 1.0 and if we can ensure that multiplied result of ANum0
with the coefficient will always result in an output QNum0, the Recoder needed
is the one shown in Figure 4.7(c) which requires less hardware area. This can
be achieved by making sure that the output’s pQ0 ≤ pA0+pm and provided that
the multiplier coefficient is properly optimised. Similarly, the same reduction
in area when m > 1.0 by limiting the output’s pQ0 ≥ pA0 + pm, which will
ensure the multiplied result of ANum1 with the coefficient, will always result in
an output QNum1. The Recoder needed then is the one shown in Figure 4.7(b).
These limitations would not compromise the operation of the circuit and the
RightSize tool will impose these limitations to reduce the area consumption
(Section 6.3).
94
4.6.2 DFX Full Multiplier
I II III IV
I III
II IV
A B
A0 B0 A1 B0
A0 B1 A1 B1
*, , , ,
where *
n p p p p
n n n
p p p p p p
p p p p p p
= += + = += + = +
A A0 A1A , ,n p p
XProd
Bexp
B B0 B1B , ,n p p
DFX Recoder 1
for i/p n*,pI,pII
DFX Recoder 2
for i/p n*,pIII,pIV
0
1
Aexp
Q Q0 Q1Q , ,n p p
(a) Top-level diagram.
pIV
pIII
pII
SQ
SAA0
1
SBB0
1
SQ Q0
1
0 0
A B
Exponent Bit
0 1
1 0
1 1 SQ
SQ
SQ
Product from
fixed-point multiplier
DFX output
Recoding
Fixed-point
multiplication
No shifting,
Shifting required
pI
(b) Example dataflow.
Figure 4.13: DFX Full Multiplier.
A DFX Full Multiplier (shown in Figure 4.13(a)) is fairly similar to the
gain multiplier. It performs a multiplication with two DFX numbers. The
full mult node of the RightSize tool uses this module and its VHDL interface
entity is given below.
ENTITY dfx_Mult_full IS
GENERIC ( a_n, a_p0, a_p1, b_n, b_p0, b_p1 : INTEGER;
q_n, q_p0, q_p1: INTEGER );
PORT( A : IN std_logic_vector( a_n+1 downto 0);
B : IN std_logic_vector( b_n+1 downto 0);
Q : OUT std_logic_vector( q_n+1 downto 0) );
END ENTITY dfx_Mult_full;
95
Because both inputs to this multiplier can be DFX, the intermediate prod-
uct result can have up to four different scalings, pI, pII, pIII and pIV with
wordlength n∗ (i.e. a quad fixed-point) where If we were to keep the num-
Recoder 1
Recoder 2
ber in this form, further computations down the line will suffer from great
complexity. That discussion is beyond the scope of this thesis and has not
been explored. We already know that there are 4 different scalings for the
intermediate result, and if each needs to be recoded to either of the 2 different
output scalings, this would mean that the post-multiplier block may have to
handle up to a maximum of 8 different output shifts to recode the output to
a properly scaled DFX signal. This operation will need two Recoders and a
multiplexor.
Referring to Figure 4.13(a), Recoder 1 assumes that its input (the inter-
mediate result Prod) is a 〈n∗, pI, pII〉 DFX number while Recoder 2 assumes a
〈n∗, pIII, pIV〉 DFX number. For Recoder 1, the difference between pI and pII
is the addition of either pB0 or pB1, and this is the same for Recoder 2’s pIII
and pIV. Hence the exponent bit for input B is used for the intermediate re-
sult Prod. The outputs of both Recoder is a 〈nQ, nQ0, nQ1〉 number, but their
input’s differ by either pA0 or pA1. Therefore the select signal for the output
MUX is the input A’s exponent bit.
The scenario described above is the extreme case when the post-multiplier
block has to cope with 8 different output shifts. As we shall see in Example 4.1,
96
the output of the multiplier may be simplified to only a single Range-Detector
and a MUX.
Example 4.1. Consider a full multiplier with the same input and output for-
mat 〈n, 0, k〉 where k ∈ Z+. With the p0 = 0, the maximum value for the
Num0 range is ≤ 1 and the Num1 range would be > 1. Referring to the ex-
ample data-flow in Figure 4.13(b), when both inputs are Num1, the output will
never become a Num0 and would always only shift right by k bits to a Num1.
Similarly, when both inputs are Num0, the output always be a Num0 but no
shift is required. When the inputs differ in their scalings, the output will either
be a Num0 (shift right by k bits) or Num1 (no shift required) depending on the
value of the product, Prod. Therefore, when the input exponents are different,
only a k-bit right shift is needed and a single MUX is capable of doing this
together with the help of a Range-Detector to determine the correct range. ¤
4.6.3 Area and Critical Path Delay Tables
As seen in Table 4.5, the gain multipliers for DFX, fixed-point and floating-
point are roughly the same size on all hardware platforms when synthesized.
Fixed-point is clearly the quickest of the three with floating-point being the
slowest. The full multiplier area (Table 4.6) on the other hand shows that
fixed-point is largest followed by DFX. This is because the mantissa of floating-
point is only 11 bits wide when compared to the 16 bits of fixed-point. However
the post multiplication normalisation of floating-point and recoding of DFX
means that the fixed-point full multiplier is still the quickest of the three. LNS
multiplication is a fairly simple process as discussed in the background, hence
the small area requirement for LNS gain and full multipliers.
Generally, the DFX multipliers will have an area consumption and crit-
ical path delay between fixed-point and floating-point [ECC04]. It should
97
Table 4.5: Area and critical path delay (CPD) tables for gain multiplier com-parisons.
Format Params
Gain Multiplier Area
Virtex4 Stratix2 ASIC
LUTs (Slices) ALUTs (ALMs) Cells
DFX 〈14, 0, 5〉 113 (61) 127 (67) 2561
FX 〈15, 0〉 98 (58) 113 (57) 2551
FP E:4 M:11 105 (63) 97 (49) 2627
LNS I:4 F:11 18 (9) 18 (9) 292
Format ParamsGain Multiplier CPD (ns)
Virtex4 Stratix2 ASIC
DFX 〈14, 0, 5〉 10.049 7.010 10.075
FX 〈15, 0〉 7.527 2.000 ¦ 8.740
FP E:4 M:11 11.362 9.370 10.881
LNS I:4 F:11 3.718 3.201 2.483¦ - Speed limited by place and route tool
Table 4.6: Area and critical path delay (CPD) tables for full multiplier com-parisons.
Format Params
Full Multiplier Area
Virtex4 Stratix2 ASIC
LUTs (Slices) ALUTs (ALMs) Cells
DFX 〈14, 0, 5〉 256 (140) 265 (139) 7500
FX 〈15, 0〉 257 (114) 279 (140) 8077
FP E:4 M:11 187 (115) 195 (104) 5312
LNS I:4 F:11 22 (12) 22 (12) 553
Format ParamsFull Multiplier CPD (ns)
Virtex4 Stratix2 ASIC
DFX 〈14, 0, 5〉 7.453 8.333 11.861
FX 〈15, 0〉 6.785 2.000 ¦ 9.325
FP E:4 M:11 15.343 11.287 12.146
LNS I:4 F:11 3.954 3.510 5.616¦ - Speed limited by place and route tool
98
noted that the multipliers synthesized here on FPGAs are limited to using
look-up-tables (LUTs) although most modern FPGAs have dedicated built-in
multiplier blocks.
4.7 Discussion and Further Comparisons
The area and critical path delay tables shown in the individual module sections
earlier were all 16-bit implementations. They provide a glimpse of the differ-
ences between the different number representations. Shown in Figure 4.14
is the place-and-routed resource usage against varying wordlengths for the
arithmetic modules described for the ASIC implementation. This plot is rep-
resentative in terms of the general shape of the plots obtained for all other
platforms. Parameters used for the fixed-point, floating-point and logarithmic
number system were chosen to match the dynamic range as closely as possible
to DFX (Table 4.7). In addition, the wordlengths of the floating-point and
LNS are matched with their DFX equivalent. The LNS formats tested is lim-
ited by the LNS library used [CDdD06] but the limited data points is sufficient
to show its general trend.
In the case of adders, the area of DFX Adders lies between fixed-point
and floating-point implementations. We can see a large difference between the
floating-point and fixed-point adders. DFX on the other hand mirrors fixed-
point with just a slight added overhead. LNS adders on the other hand is way
off the chart and reaches to about 50k cells with a wordlength of 18-bits, which
is not surprising. Addition in fixed-point is definitely the cheapest in terms of
hardware area.
For the case of multipliers, the trends are reversed. In the view to maintain
the same level of dynamic range as DFX, fixed-point multipliers is now the most
expensive to implement. DFX maintains its position between fixed-point and
99
0
2000
4000
6000
8000
10000
12000
14000
1 3 5 7 9 11 Design
cells
DFX v1
DFX v2
FX
FP
LNS
(a) Adders
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1 3 5 7 9 11 Design
cells
DFX
FX
FP
LNS
(b) Gain multipliers
0
5000
10000
15000
20000
25000
1 3 5 7 9 11 Design
cells
DFX
FX
FP
LNS
(c) Full multipliers
Figure 4.14: Module comparisons with similar dynamic range implemented inASIC. Parameters used for each number representations is shown in Table 4.7.
100
Table 4.7: The parameters used to generate results for Figure 4.14. Thedynamic range (DR) is represented in dB.
DesignDFX FX FP LNS
n p0 p1 DR n p DR E M DR I F DR
1 9 0 5 84 14 0 84 4 6 90 4 6 96
2 10 0 5 90 15 0 90 4 7 90 4 7 96
3 11 0 5 96 16 0 96 4 8 90 4 8 96
4 12 0 5 102 17 0 102 4 9 90 4 9 96
5 13 0 5 108 18 0 108 4 10 90 4 10 96
6 14 0 5 114 19 0 114 4 11 90 4 11 96
7 15 0 5 120 20 0 120 4 12 90 4 12 96
8 16 0 5 126 21 0 126 4 13 90 4 13 96
9 17 0 5 132 22 0 132 4 14 90 - - -
10 18 0 5 138 23 0 138 5 14 187 - - -
11 19 0 5 144 24 0 144 5 15 187 - - -
12 20 0 5 151 25 0 151 5 16 187 - - -
floating-point. As expected from LNS, its multipliers are cheapest in terms of
hardware area.
Figure 4.15 shows the area of DFX modules synthesized onto ASIC. Their
wordlengths, n and the lower scaling parameter p0 are varied while keeping
upper scaling constant at p1 = 8. We can see that if the wordlength is kept
constant, changing from fixed-point to DFX will always incur added hardware
cost. Results show similar trends when implemented on FPGAs.
101
02
46
88
10
12
14
16200
400
600
800
1000
1200
1400
p0n
Are
a (c
ells
)
(a) DFX Adder V1
02
46
88
10
12
14
16200
400
600
800
1000
1200
p0n
Are
a (c
ells
)
(b) DFX Adder V2
02
46
88
10
12
14
16500
1000
1500
2000
2500
3000
p0n
Are
a (c
ells
)
(c) DFX Gain-Multiplier
02
46
88
10
12
14
162000
3000
4000
5000
6000
7000
8000
9000
p0n
Are
a (c
ells
)
(d) DFX Full-Multiplier
Figure 4.15: Area of DFX modules implemented in ASIC with p1 = 8.
Looking the results from a different point of view, we shall compare DFX
arithmetic modules with fixed-point arithmetic modules while keeping the dy-
namic range of ∼ 90dB consistent throughout. Fixed-point is chosen for com-
parison since the DSP community tends to prefer building a DSP solution
around fixed-point number representation [Smi97] for its superior area and
speed performance. The first line of Table 4.8 shows the fixed-point 〈15, 8〉implementation in ASIC of each of the modules and the lines below it are
DFX with the same dynamic range. The area result from the table are plotted
relative to fixed-point in Figure 4.16. Once again, results show similar trends
when implemented on FPGAs.
Given the small number of DFX formats examined, the DFX Adders are
102
Table 4.8: Comparing fixed-point and DFX arithmetic module implementa-tions on ASIC(# of cells) with dynamic range fixed at ∼ 90dB. Fixed-pointthe first result line where p1 = p0 = 0.
DFX Format Adder v1 Adder v2 Gain Mult Full Mult
〈15, 0, 0〉 453 454 2497 8082
〈14,−1, 0〉 1011 988 2514 7574
〈13,−2, 0〉 997 949 2037 6630
〈12,−3, 0〉 985 897 1865 5721
〈11,−4, 0〉 975 847 1692 4845
〈10,−5, 0〉 957 805 1298 4088
〈9,−6, 0〉 945 759 1199 3396
〈8,−7, 0〉 926 705 855 2737
always more expensive than equivalent fixed-point adders. The area reduction
from reducing the wordlength is not sufficient to cover for the added overhead
of the DFX pre and post alignment. However this is the opposite for the case
of multipliers. The area reduction of the fixed-point multiplier is more than
sufficient to cover for the added area of the post multiplier blocks needed by
DFX.
We can further deduce from Figure 4.16 that optimising a design with DFX
Adders to reduce area may lead to the search space being non-convex (having
more than one minimum point). While keeping the dynamic range constant
and increasing the separation between p0 and p1, we can see the Adder’s size
increase before gradually decreasing.
103
0.0
0.5
1.0
1.5
2.0
2.5
<15,0,0> <14,-1,0> <13,-2,0> <12,-3,0> <11,-4,0> <10,-5,0> <9,-6,0> <8,-7,0>
DFX Format
Rel
ativ
e si
ze (D
FX
Are
Are
a)
DFX Adder v1
DFX Adder v2
DFX Gain Multiplier
DFX Full Multiplier
Relative Size ( DFX Area/ FX Area )
DFX Format
Figure 4.16: Sizes of DFX arithmetic modules relative to their fixed-pointequivalent.
4.8 Summary
The chapter introduces the basic arithmetic modules built for DFX on VHDL.
At present, these fully parameterisable modules are incorporated into the
RightSize tool which can be incorporated into any hardware design flow that
accepts VHDL library files. Apart from being fully parameterisable, attention
has been given to ensure that any truncation errors that occur will only happen
at the end of a module to reduce overall output error.
Being fully parameterisable also meant that incorporating DFX with or-
dinary fixed-point is done by merely equating the p0 and p1 parameters of
the input and outputs. The interfacing modules (encoder and decoder) be-
tween DFX and ordinary fixed-point are achieved by the basic building blocks
mentioned.
Although all the designs shown all perform truncation as a means of quan-
tising the output, rounding can easily be incorporated for additional hardware
[ECC05]. Similarly, DFX modules are not limited to the ones shown and other
arithmetic operations can be added without difficulty. Typically the arithmetic
104
operation will be done with an ordinary fixed-point equivalent operator with
the necessary input and output scalings performed as necessary.
The arithmetic modules were compared with arithmetic modules of other
conventional number representations. DFX arithmetic modules were found
to have the area consumption and speed between fixed-point and floating-
point arithmetic modules. Hence, as will see in Chapter 6, provided that the
parameters have been chosen well, the area consumption of DFX can be less
than fixed-point area consumption.
As majority of application specific DSP solutions rely on fixed-point, the
rest of the work will be solely concentrated on being better than fixed-point
implementations. From the comparison made between fixed-point and DFX,
it can be seen that DFX has an advantage when a certain dynamic range is
required where multiplication is concerned.
105
Chapter 5
Modelling Noise at the Outputs
of a DFX System
5.1 Introduction
Since our emphasis for this work is for DSP solutions, we are interested in the
signal-to-noise ratio (SNR), sometimes known as the signal-to-quantisation-
noise ratio (SQNR). It is a well accepted general metric in the DSP commu-
nity for measuring the quality of a finite precision algorithm implementation
[Mit98]. Conceptually the output sequence at each system output resulting
from a finite precision implementation can be subtracted from the equivalent
sequence resulting from an infinite precision implementation. The difference
obtained is known as the finite precision error. Therefore the signal-to-noise
ratio is the ratio of the output power resulting from a signal with infinite
precision over the power of the finite precision error.
The original RightSize synthesis tool takes the SNR as a user constraint to
guide the fixed-point wordlength optimisation procedure [Con03]. This feature
has been extended in this thesis to incorporate DFX. For the purpose of this
106
thesis, the signal power at each output is fixed since it is determined by a
combination of the input signal statistics and the computation graph G(V, S).
The optimisation procedure discussed in Chapter 6 will require accurate error
models as a prerequisite to explore the different implementations of a DFX
annotated computational graph, G(V, S, ADFX). It is therefore the purpose of
this chapter to concentrate on noise modelling.
Oppenheim and Weinstein [OW72] and Liu [Liu71] had previously laid
down models for quantisation errors together with their propagation through
linear time-invariant systems (LTI) uniform. In their models, error signals are
added to each signal that is truncated or rounded and they are assumed to
be uniformly distributed and uncorrelated with one another. These models
worked well for uniform wordlength designs as quantisation error power typi-
cally degrades by approximately 6 dB for each bit reduction and the prediction
of errors did not require highly accurate models.
As hardware designers aggressively push their designs further to reduce
the cost of their designs, they naturally turn to a multiple wordlength design
paradigm [CCL01]. For a design with multiple wordlengths, the designers have
greater flexibility to finely adjust the implementation error power and the tra-
ditional error models are not sufficiently accurate. Constantinides introduced
an error model to address this issue using a discretised probability distribution
function for the LSB bits that are truncated [CCL00]. The added scaling pa-
rameter in DFX together with the boundary conditions makes DFX a scaled
number representation and the error models from [CCL00] can not be applied
directly.
Furthermore, errors introduced by DFX are correlated with each other.
Any error signal statistic for DFX depends on the signal magnitude, which in
turn depends on the temporal and spatial correlation of the system’s inputs
and design respectively. However, there is no error correlation between the
107
errors when rounding quantisation is performed and this will be addressed in
Section 5.4.2.
To assist the RightSize tool in modelling the errors injected, a single-pass
profiling simulation is done to populate the probability distribution tables used
in estimating the truncation probabilities. With the noise injection models
and error correlation accounted for, the output error estimation is provided by
using the existing perturbation analysis technique where each noise injected is
weighted with a sensitivity measure before summing them together [Con03].
Therefore the main original contributions in this chapter are:
• the error models for each DFX Module with flexible inputs and output
parameters (i.e. multiple wordlength and scaling),
• the correlation between the errors for DFX truncation and rounding
schemes,
• profiling simulation to obtain the probability distribution tables to model
the errors and estimate the correlation coefficient, and
• estimating the output error of an DFX annotated computational graph
G(V, S, ADFX) taking into account the correlation of errors when trun-
cation scheme is used.
5.2 Preliminaries
The error models made in this section are meant to model the errors intro-
duced to a design/system which uses the DFX arithmetic modules discussed
in Chapter 4. The main criteria for these error models are that they have to
be as flexible as the fully parameterisable modules. This includes being able
108
to cope with the multiple scaling and wordlength design paradigm without
compromising accuracy.
As mentioned in the background (Section 2.5.4), the noise analysis of a
fixed-point system used in RightSize is a weighted-sum of injected quantisa-
tion errors [KS98, Con01]. Our proposed error model follows the same error
injection approach where the injection of errors are made after a DFX module,
as illustrated in Figure 5.1(a) (The error injections are shown in broken arrow
lines). Take note that these additional inputs and adders are not nodes in the
computation graph but are useful conceptual device for modelling the errors
associated when truncation occurs. Therefore for every signal j ∈ S in an
annotated graph G(V, S, ADFX), an error source ej is added by the signal’s
driver node. In an ideal situation, all error sources would be zero and the
system would behave as if the error sources were not present. An example of
a computation graph with error injections is shown in Figure 5.1(b).
OutDFX ModuleIn +
e
(a) Error injection after a DFX module
Y+
z-1
X F+
e1
+
e2
+
e6
+ e5
+
e4
+
e2
(b) Example computation graph with error injection
Figure 5.1: Noise error of each module modelled as an error injection at theoutput.
Given a computation graph G(V, S), let Vo ⊂ V be the set of nodes of type
109
primary out and let k ∈ V0. Using the perturbation analysis in RightSize
(Section 2.5.4), we can obtain the unscaled sensitivities measure, εjk, of error
injected (ej) into every signal j ∈ S to output k. Assuming the noise power
of injected errors, σ2j = var(ej), are known and the injected errors are not
correlated with one another, the output noise power can be determined by
the weighted-sum equation (5.1) below. This equation is used by the origi-
nal RightSize fixed-point wordlength optimisation algorithm to estimate the
output noise power.
σ2k =
∑j∈S
εjkσ2j . (5.1)
To ease the explanation of the rest of this section, an alternative means
of representing scaling called the least significant bit (LSB)-side scaling, p′, is
defined (Definition 5.1). Figure 5.2 shows this alternative scaling as the number
of bits from the right of the binary point to the LSB for both fixed-point and
DFX cases.
p'
n
p
S
(a) Fixed-point
p0
p0'
p1'
S0
1
n
p1
E
(b) Dual FiXed-point
Figure 5.2: LSB-side scaling definition.
Definition 5.1. The least significant bit LSB-side scaling vector pair p′0 ∈ Z|S|
and p′1 ∈ Z|S| for graph G(V, S,ADFX) consist of elements with a one-to-one
correspondence with the elements of S. They represent a LSB-side scaling
representation of the scaling vectors p0 and p1 respectively. For each signal
j ∈ S, the elements p′0 = nj − pj0 and p′1 = nj − pj1. Since p1 > p0, therefore
p′0 > p′1. ¤
110
Errors are introduced into a system whenever quantisation occurs. In the
case of DFX modules, quantisation occurs whenever there is a right shift in
the data path. A two’s complement fixed-point signal with LSB-side scaling
p′a truncated to p′b will introduce an error with the mean and variance given by
(5.2) which uses a discrete distribution for the errors[CCL99]. The equations
are derived from the assumption that each of the combinations of the low-end
truncated bits are equally likely [Liu71, Tsa74] and this is true in practice if
the signals have sufficient dynamic range over that bit-width. If rounding is
performed instead of truncation, (5.2) still applies but the error mean becomes
zero while the variance remains the same.
As DFX has two levels of precision, more than one truncation/rounding
error may occur within each DFX module. Let T be the set of all these
possible truncation/rounding that may occur and let T ∈ T. A truncation
T of “p′a ½ p′b” will yield an error mean and error variance given by (5.2) if
and only if p′a ≥ p′b. When p′a < p′b, left shift is performed which involves zero
padding the least-significant-bits (LSB) and hence the error mean and error
variance will equal zero as there is no precision loss.
mean = µ = −1
22−n (2pb − 2pa) = −1
2
(2−p′b − 2−p′a
)
variance = σ2 =1
122−2n
(22pb − 22pa
)=
1
12
(2−2p′b − 2−2p′a
) (5.2)
For every truncation/rounding T, there will also be a corresponding prob-
ability of truncation occurring, PT. From the profiling simulation explained
later in Section 5.5.1, we can extract the probability of all the sources of trun-
cations within each module. Therefore we can then determine the overall error
mean and error variance using an equation in the form given by (5.3). Again,
111
if rounding is performed, the error mean will be zero and the variance is cal-
culated with the zero error means. If there is only one truncation, i.e. |T|=1,
(5.3) reverts back to its ordinary fixed-point truncation.
µerror =∑
T∈TPTµt
σ2error =
∑
T∈TPT
(σ2
T + µ2T
)− µ2error
(5.3)
The error models will be using the probability and joint probability dis-
tribution functions of signals to estimate the injected errors. Therefore, the
error models rely on the DFX signals to be properly scaled for correct error
estimation.
5.3 DFX Modules Noise Models
5.3.1 Encoder
The DFX Encoder module performs two forms of quantisation on its ordinary
fixed-point input depending on the input’s magnitude. If the output is in the
Num0 range, then the output will be truncated by TE0 and if the output is in
Num1, it will be truncated with TE1. A list of truncations for the Encoder is
shown in Table 5.1. Just to reiterate, B is the boundary of the DFX number
(Definition 3.4).
Table 5.1: DFX Encoder truncations where the input is a fixed-point 〈nin, pin〉and output DFX 〈n, p0, p1〉 (Refer to Fig. 4.4 for block diagram).
TE Truncation Condition
TE0 p′in ½p′0 −B ≤ Input < B
TE1 p′in ½p′1 Input < −B or Input ≥ B
112
Input ABoundary-Boundary
Num0Num1 Num1
0
0ETP1ETP1ETP
Figure 5.3: Probability density function(PDF) of DFX Encoder Input signal.
A probability distribution function (PDF) of the input A is shown in Fig-
ure 5.3. From the PDF, the probability PTE0is the integral of the PDF curve
whereby the input A is in Num0. Likewise, the probability PTE1is the integral
of the PDF curve whereby the input A is in Num1. Therefore using (5.3), the
modelled error mean and the error variance is given by (5.4).
µEnc = µTE0PTE0
+ µTE1PTE1
σ2Enc =
((σ2
TE0+ µ2
TE0)PTE0
+ (σ2TE1
+ µ2TE1
)PTE1
)− µ2Enc
(5.4)
5.3.2 Recoder
The error analysis discussed in this section covers the use of the DFX Recoder
when used for the fork and delay nodes. As mentioned in Section 4.4.3, there
are 3 different implementations of the Recoder depending on the input and
output boundaries. The possible truncations for the Recoder is summarised in
Table 5.2. The truncation case TR01 will happen only when the input boundary
is greater than the output boundary (Bin > Bout). Similarly the truncation
case TR10 will happen only when Bin < Bout. When Bin = Bout, there is
no change in the lower scaling, hence truncations TR01 and TR10 will never
happen.
Obtaining the probabilities of truncation involves integrating sections of
the input’s PDF. The input PDF for the Recoder is essentially the input
113
Table 5.2: DFX Recoder where the input is DFX 〈nin, pin0, pin1〉 and outputDFX 〈nout, pout0, pout1〉.
TR Input Output Truncation Condition
TR00 Num0 Num0 p′in0 ½p′out0 -
TR01 Num0 Num1 p′in0 ½p′out1 Bin > Bout
TR10 Num1 Num0 p′in1 ½p′out0 Bin < Bout
TR11 Num1 Num1 p′in1 ½p′out1 -
InputInput
Num0
Input
Num1
Input
Num1
Output
Num1
Output
Num1Output Num0
Num0 Num1
Num1 Num1
Input Output
00RPT11R
PT 11RPT
- out out
- in in
(a) For the case when Bin = Bout.
InputInput
Num0
Input
Num1
Input
Num1
Output Num1Output Num1 Output Num0
Num0 Num1
Num0 Num1
Num1 Num1
Input Output
11RPT 11R
PT01RPT 01R
PT00RPT
- out out
- in in
(b) For the case when Bin > Bout.
InputInput
Num0
Input
Num1
Input
Num1
Output Num1 Output Num1
Output Num0
Num0 Num1
Num1 Num0
Num1 Num1
Input Output
11RPT 11R
PT10RPT 10R
PT00RPT
- out out
- in in
(c) For the case when Bin < Bout.
Figure 5.4: PDF of the DFX Recoder input and the probabilities of truncation.
114
PDF into the fork or the delay nodes since these nodes basically passes
the inputs to their outputs. Figure 5.4 shows the PDF partitioning for the
three different Recoder implementations. By integrating the area under the
graph, we can get the probabilities of truncation for each implementation.
If a particular truncation does not happen, therefore the probability of that
truncation happening is zero. For example the case when Bin = Bout, the
probabilities PTR01= PTR10
= 0.
With the probabilities of truncation known, we can obtain the error mean
and error variance for the Recoder using Equation (5.3).
5.3.3 Adders
Analysis of the error model for the Adder module begins with analysing all
possible input and output combinations. For both versions of DFX Adders,
the errors are introduced only at the end of the module and there are no loss of
precision elsewhere. This means that both versions of the DFX Adder can be
modelled in the same way. Table 5.3 shows all the 8 different possible input and
output combinations of the adder together with their respective truncations.
Ideally, if the inputs have no temporal correlation and are independent of
each other, obtaining the probability distribution of the inputs independently
would be sufficient to determine the probabilities of truncation. In practice,
however, input signals have some degree of correlation. Therefore, a joint
probability distribution function table of the Adder’s inputs is needed. Such
distributive function can be viewed as a graph, with the x-axis for the input A
and y-axis for the input B (see Figure 5.5(a)). Different patterns on the graph
denote the different input combinations. Input boundaries are shown in red
and blue lines and the gold line represents the locus of |A| + |B| = BS. The
portion in between the output boundary lines is the portion when the output
is a Num0 number. The output is a Num1 number for the outer portions.
115
Table 5.3: DFX Adder inputs (A and B) and output (S) combinations withtheir respective output truncations.
TA A B S Truncation Condition
TA000 Num0 Num0 Num0 max(p′A0, p′B0)½p′S0 -
TA010 Num0 Num1 Num0 max(p′A0, p′B1)½p′S0 (BB−BA) < BS
TA100 Num1 Num0 Num0 max(p′A1, p′B0)½p′S0 (BA−BB) < BS
TA110 Num1 Num1 Num0 max(p′A1, p′B1)½p′S0 -
TA001 Num0 Num0 Num1 max(p′A0, p′B0)½p′S1 (BA+BB) > BS
TA011 Num0 Num1 Num1 max(p′A0, p′B1)½p′S1 -
TA101 Num1 Num0 Num1 max(p′A1, p′B0)½p′S1 -
TA111 Num1 Num1 Num1 max(p′A1, p′B1)½p′S1 -
The probability for each of the individual truncations are found by inte-
grating the individual areas shown in Figure 5.5(b). Under certain conditions
such as when (BB−BA) < BS, an addition of a ANum0 with BNum1 will never
result in a QNum0 and for this case, PTA010will always be zero. Similarly, when
(BA−BB) > BS an addition of a ANum1 with BNum0 will never result in a
QNum0 and PTA100= 0. Lastly, the addition of a ANum0 with BNum0 will never
result in a QNum1 if (BA+BB) < BS and therefore PTA001= 0.
With the probabilities of truncation known, the error mean and error vari-
ance for the Adder may be found using Equation (5.3).
5.3.4 Gain Multiplier
As mentioned earlier in Section 4.6.1, the DFX Gain Multiplier multiplies a
DFX number, X, with a fixed-point constant multiplier m〈nm, pm〉. Table 5.4
shows all its possible truncations. Due to the design restrictions imposed
(Section 4.6.1), the case TG01 will only happen if BA > (BQ/|m|) and similarly,
case TG10 will only happen if BA < (BQ/|m|).
As this module is a single input module, the error statistics will be depend
solely on the input’s probability distribution. When the coefficient m is mul-
tiplied with a DFX input, the input boundary is shifted and so does the range
116
Input A
Input BBA-BA
BB
-BB
0
|A| + |B| =
Num0 Num0
Input A Input B
Num0 Num1
Num1 Num0
Num1 Num1
S
(a) Joint distribution table showing the input and output ranges. The goldenline is the output boundary.
- A A
- B
B
PTA000
PTA001
Input A Num0 : Input B Num0- A A
Input A Num1 : Input B Num0
PTA100
PTA101
- B
B
- A A
- B
B
PTA110
PTA111
PTA111
Input A Num1 : Input B Num1- A A
- B
B
PTA010 PTA011
Input A Num0 : Input B Num1
(b) The probability of truncations for the DFX Adder
Figure 5.5: DFX Adder joint probability distribution table.
117
Table 5.4: DFX Gain Multiplier input(A) and output(Q) combinations andtheir respective output truncations.
TG A Q Truncation Condition
TG00 Num0 Num0 (p′A0+p′m)½p′Q0 -
TG01 Num0 Num1 (p′A0+p′m)½p′Q1 BA > (BQ/|m|)TG10 Num1 Num0 (p′A1+p′m)½p′Q0 BA < (BQ/|m|)TG11 Num1 Num1 (p′A1+p′m)½p′Q1 -
QNum1QNum1
Input AA- A
Num0 Num1
Num0 Num1
Num1 Num1
i/p A o/p Q
11GPT 11G
PT01GPT 01G
PT
ANum1ANum1 ANum0
QNum0
00GPT
Q
|m|-( ) Q
|m| ( )
A > ( Q ⁄ |m| )
(a) For the case BA > (BQ/|m|)
QNum1 QNum0
Num0 Num1
Num1 Num0
Num1 Num1
i/p A o/p Q
11GPT 10G
PT 10GPT
QNum1
11GPT00G
PT
ANum0ANum1 ANum1 Input A
A- A
Q
|m|-( ) Q
|m| ( )A < ( Q ⁄ |m| )
(b) For the case BA < (BQ/|m|)
Figure 5.6: PDF of the DFX Gain Multiplier input and the probabilities oftruncation.
of input Num0 and Num1. If m > 1, the output boundary will be shifted
lower in terms of its magnitude and vice-versa. This leads to the partitioning
of the input PDFs as shown in Figure 5.6 which is similar to the partitioning
of the Recoder’s input PDF. As before, the probabilities of each truncation
can be obtained by integrating the appropriate areas under the graph. For
the case BA > (BQ/|m|), the probability PTG10will always be zero and for the
case BA < (BQ/|m|), PTG01will always be zero. Therefore the gain multiplier
118
error mean and error variance is given by Equation (5.3).
5.3.5 Full Multiplier
Table 5.5 tabulates all the 8 possible truncations of the full multiplier. The
product of a ANum0 and BNum0 will never be become a QNum1 if (BA×BB) ≤BS. If this was the case, the probability PTM001
will be zero. Similarly, PTM110
will be zero if (BA×BB) ≥ BS because the product of a ANum1 and BNum1
will never become a QNum0.
Table 5.5: DFX Full Multiplier inputs (A and B) and output (Q) combinationsand their respective output truncations.
TM A B Q Truncation Condition
TM000 Num0 Num0 Num0 (p′A0 + p′B0)½p′S0 -
TM010 Num0 Num1 Num0 (p′A0 + p′B1)½p′S0 -
TM100 Num1 Num0 Num0 (p′A1 + p′B0)½p′S0 -
TM110 Num1 Num1 Num0 (p′A1 + p′B1)½p′S0 (BA×BB) < BS
TM001 Num0 Num0 Num1 (p′A0 + p′B0)½p′S1 (BA×BB) > BS
TM011 Num0 Num1 Num1 (p′A0 + p′B1)½p′S1 -
TM101 Num1 Num0 Num1 (p′A1 + p′B0)½p′S1 -
TM111 Num1 Num1 Num1 (p′A1 + p′B1)½p′S1 -
The joint probability distributions for the full multiplier inputs are needed
to estimate the errors, since the full multiplier is a dual input module. Unlike
the Adder modules, the distribution of errors does not depend on the sign
of its inputs and outputs. Taking the log2() of the magnitude of the inputs,
the joint PDF can be simplified as shown in Figure 5.7(a). There are four
quadrants for the four different combinations of input ranges (Input A:Input
B). Figure 5.7(b-d) shows the distribution of errors for the different cases of
input and output boundaries. By integrating the specific areas on the joint
PDF for the probabilities of truncations, the Full Multiplier error mean and
error variance can be found using Equation (5.3).
119
(a) Input Ranges
log2( |A| )
log2( |B| )
Num0:Num0 Num1:Num0
Num1:Num1Num0:Num1
log2(BA)
log2(B
B)
(b) When (BA BB) < BQ
log2(BA)
log2(B
B)
P TM000
AB=
Q
P TM100
P TM101
P TM011
P TM111
P TM010
P TM110
(c) When (BA BB) > BQ
log2(BA)
log2(B
B)
P TM000
P TM100
P TM101
P TM011 P TM111
P TM010
P TM001AB=
Q
(d) When (BA BB) = BQ
log2(BA)
log2(B
B)
P TM000P TM100
P TM101
P TM011 P TM111
P TM010
AB=BQ
Figure 5.7: DFX Full Multiplier joint probability distribution table. (a) showsthe input ranges (Input A:Input B) and (b)-(d) shows the probability of trun-cations for different input and output boundary cases.
5.3.6 Error Model Evaluation and Discussion
For verification of error models, speech audio samples were used as input sam-
ples. The error is the difference between the estimated output and actual
(double precision) output. Tables 5.6 and 5.7 show that with two different
DFX formats chosen arbitrarily, the truncation and rounding error models are
capable of providing error estimates that are within ±3% of the actual error.
This experiment is repeated with 100 different DFX formats and the results
show that the estimates were within ±5%.
It can be seen from the models earlier that in determining the error statis-
tics of the injected errors, the knowledge of each probability of truncation is
120
Table 5.6: Evaluation of error models for truncation scheme with DFX format〈14,−5, 2〉 and 〈14,−3, 5〉.
〈14,−5, 2〉 Estimated Error Actual Error Difference
Mean Var. Mean Var. Mean Var.
Encoder -5.40e-05 5.81e-09 -5.39e-05 5.78e-09 0.20% 0.50%
Recoder -3.10e-05 4.07e-09 -3.17e-05 4.08e-09 0.61% -0.32%
Adder -2.71e-04 8.09e-08 -2.77e-04 8.23e-08 -2.30% -1.74%
Gain Mult. -1.57e-04 3.40e-08 -1.55e-04 3.38e-08 1.35% 0.67%
Full Mult. -3.67e-04 1.40e-07 -3.74e-04 1.36e-07 -1.91% 2.85%
〈14,−3, 5〉 Estimated Error Actual Error Difference
Mean Var. Mean Var. Mean Var.
Encoder -1.68e-04 1.85e-07 -1.68e-04 1.86e-07 0.17% -0.68%
Recoder -1.46e-04 1.55e-07 -1.47e-04 1.56e-07 -0.60% 0.54%
Adder -1.06e-03 3.16e-06 -1.07e-03 3.18e-06 -0.97% -0.48%
Gain Mult. -5.56e-04 1.20e-06 -5.58e-04 1.21e-06 -0.40% -1.20%
Full Mult. -2.17e-03 4.95e-06 -2.20e-03 4.89e-06 -1.38% 1.21%
Table 5.7: Evaluation of error models for rounding scheme with DFX format〈14,−5, 2〉 and 〈14,−3, 5〉.
〈14,−5, 2〉 Model Var. Actual Var. Difference
Encoder 2.18e-09 2.17e-09 0.52%
Recoder 1.92e-09 1.92e-09 -0.32%
Adder 1.36e-08 1.38e-08 -1.75%
Gain Mult. 8.85e-09 8.81e-09 0.50%
Full Mult. 1.83e-08 1.81e-08 1.09%
〈14,−3, 5〉 Model Var. Actual Var. Difference
Encoder 5.32e-08 5.33e-08 -0.27%
Recoder 5.12e-08 5.11e-08 0.19%
Adder 4.04e-07 4.08e-07 -0.86%
Gain Mult. 2.40e-07 2.42e-07 -0.78%
Full Mult. 6.12e-07 6.20e-07 -1.29%
vital. These probabilities are dependent on the input and output signal pa-
rameters together with their respective probability distributions. Also, since
the error statistics of truncation T of a module depends on the output scaling,
it should be noted that the output DFX parameters of each module play a
significant part in the error injection after each module.
121
5.4 Correlated Errors
As discussed in Section 5.2, the noise power at the primary outputs are calcu-
lated as the weighted sum of the individual errors injected by signal quantisa-
tions given by Equation (5.1). Summing the noise power, i.e. error variances,
of individual errors is a straight forward affair provided that the errors are not
correlated with one another. Essentially, the variance of the sum of n multi-
ple variables is the sum of their covariances, cov(x, y), given by (5.5) below.
Hence, Equation (5.1) is rewritten to include the correlation between error
sources.
var(n∑
j=1
Xi) =n∑
i=1
n∑j=1
cov(Xi, Xj)
=n∑
i=1
var(Xi) + 2n∑
i=1
n∑j=i+1
cov(Xi, Xj)
(5.5)
σ2k =
∑j∈S
εjkσ2j +
∑i,j∈Sj 6=i
(εikεjk)12 cov(Xi, Xj) (5.6)
In the case when error sources are not correlated with one another, the
covariances between different error sources would all be zero and the result
would be simply the sum of the error source’s variances. However, as demon-
strated by the error models in the previous section, the variance of the error
introduced are highly dependent on the probability distribution of the signals.
Correlation between the signals exists in practical DSP implementations due
to the temporal correlation of its inputs and the spatial correlation which is
intrinsic to the design of the computational graph.
122
The correlation between error sources can be quantified by the Pearson
correlation coefficient [SM94], rx,y (5.7). When rx,y = 0, the errors x and y are
not correlated with each other. When rx,y = 1 or = −1, the errors are fully
correlated or inversely correlated with each other respectively. Example 5.1
demonstrates the severity of error correlation when calculating the noise power
of the primary output error variance with DFX signals.
rX,Y = r(X, Y ) =cov(X, Y )
σXσY
(5.7)
Input
Output+ + +z-1
z-1
z-1
a2 a1 a0g3 g2 g1 g0
F
b0b1b2b3
Figure 5.8: Transpose FIR Direct Form type I filter implemented with DFXmodules.
Example 5.1. Figure 5.8 shows a transposed FIR Direct Form type I filter
with the error sources shown. In this example, the encoder, branching fork
and delay nodes do not exert any errors onto the final output and thereby
the omission of their respective error injections. The signals in this filter are
truncated to a uniform wordlength and scaling throughout (DFX 〈14,−3, 2〉)and the filter inputs used is a speech sample to simulate a temporally correlated
input. Taking a straight forward addition of the error variances give a noise
power of 0.96×10−8 but actual simulation shows that the noise power is actually
1.67 × 10−8. This -43% difference is because of the correlation between the
error sources. Table 5.8 tabulates the correlation coefficient between the error
sources.
Interestingly, when rounding is used, the error sources does not exhibit
any correlation between them as shown in Table 5.9. Without correlation,
123
Table 5.8: The correlation coefficients of the error sources for the FIR filter inFigure 5.8 when truncation is used.
Errors g3 g2 g1 g0 a2 a1 a0
g3 1.0 0.482 0.363 0.448 0.406 0.167 0.442
g2 0.482 1 0.241 0.599 0.394 0.169 0.534
g1 0.363 0.241 1 0.221 0.246 0.150 0.213
g0 0.448 0.599 0.221 1 0.373 0.177 0.511
a2 0.406 0.394 0.246 0.373 1 0.254 0.378
a1 0.167 0.169 0.150 0.177 0.254 1 0.168
a0 0.442 0.534 0.213 0.511 0.378 0.168 1
the covariances (5.7) between the errors is zero and therefore the noise power
is just the sum of the error variances. The simulated output noise power is
found to be about 1.07×10−8 and the straight forward addition estimation gives
1.11× 10−8 ( i.e. 3.6% error).
Table 5.9: The correlation coefficients of the error sources for the FIR filter inFigure 5.8 when rounding is used.
Errors g3 g2 g1 g0 a2 a1 a0
g3 1 -0.001 -0.007 -0.002 0.000 0.005 0.002
g2 -0.001 1 -0.011 -0.012 -0.001 0.002 -0.005
g1 -0.007 -0.011 1 0.009 0.001 0.000 0.000
g0 -0.002 -0.012 0.009 1 0.001 -0.005 0.004
a2 0.000 -0.001 0.001 0.001 1 -0.006 -0.006
a1 0.005 0.002 0.000 -0.005 -0.006 1 -0.006
a0 0.002 -0.005 0.000 0.004 -0.006 -0.006 1
¤
Since truncation is generally the least area expensive method of quantisa-
tion, it is essential that the correlations between the truncation error sources
are ascertained. This section introduces a way to approximate the correlation
coefficient between the error sources and explore the case of when DFX signals
are rounded.
124
5.4.1 Estimating the Correlation for the Errors Sources
Taking two DFX signals X and Y , we denote the errors injected into these
signals as x and y respectively. In a truncation scheme, the error x would be
in the range (−δx0, 0] where δx0 = 2p′X0 when X is in Num0, or it will be in
the range (−δx1, 0] where δx1 = 2p′X1 when X is in Num1. Figure 5.9 shows
the distribution of the x and y errors. The discrete error distribution [CCL99]
used for the error models is not used here to simplify calculations and the
uniformly distributed error model by [Liu71] is sufficient.
x-δx0-δx1
y-δy0-δy1
Num0
Num1
Figure 5.9: The PDF of DFX signal errors.
Let the probabilities for the errors for combination of ranges of signals X
and Y be given in Table 5.10. These probabilities can be determined from the
joint PDF from the profiling simulation later in Section 5.5.3.
Table 5.10: Combinations of signals X and Y ranges and their error probabil-ities.
X Y Error probability
Num0 Num0 ω00
Num0 Num1 ω01
Num1 Num0 ω10
Num1 Num1 ω11
Estimation of the correlation coefficient in Equation (5.7) begins by obtain-
ing the covariance between the errors (x and y) and their standard deviations.
The covariance is found using Equation (5.8). Mean and standard deviation
of the errors can be found using the techniques described in Section 5.3. On
the other hand, calculating the expectation of the product of two errors needs
a joint PDF f(x, y) as seen in Equation (5.9)
125
cov(x, y) = E(xy)− µxµy (5.8)
E(xy) =
∫ ∞
−∞
∫ ∞
−∞xy f(x, y) dxdy (5.9)
Similar to the joint PDF used for DFX Adder, Figure 5.10 shows the joint
PDF for the errors x and y. Figures 5.10(a)-(d) shows the breakdown of
the joint PDF for each combination of errors and Figure 5.10(e) shows the
complete joint distribution diagram with the four individual regions marked
out. With the assumption that the errors are uniformly spread-out for each
error combination, the joint PDF f(x, y) is given by (5.10).
f(x, y) =
f00 if x ∈ (−δx0, 0] and y ∈ (−δy0, 0]
f01 if x ∈ (−δx0, 0] and y ∈ (−δy1,−δy0]
f10 if x ∈ (−δx1,−δx0] and y ∈ (−δy0, 0]
f11 if x ∈ (−δx1,−δx0] and y ∈ (−δy1,−δy0]
(5.10)
where
f00 = ω00(δx0δy0)−1 + f01 + f10 − f11
f01 = ω01(δx0δy1)−1 + f11
f10 = ω10(δx1δy0)−1 + f11
f11 = ω11(δx1δy1)−1
Knowing the joint PDF of the errors, the expectation E(xy) can be shown
to be (5.11). Also, using (5.3), the error means and variances are given by
(5.12) and (5.13).
E(xy) =
∫ −δy0
−δy1
(∫ −δx0
−δx1
xy f11 dx +
∫ 0
−δx0
xy f01 dx
)dy
+
∫ 0
−δy0
(∫ −δx0
−δx1
xy f10 dx +
∫ 0
−δx0
xy f00 dx
)dy
=1
4[ω00 (δx0δy0) + ω01 (δx0δy1) + ω10 (δx1δy0) + ω11 (δx1δy1)]
(5.11)
126
x
y-δx0-δx1
-δy0
-δy1
x
y-δx0-δx1
-δy0
-δy1
x
y-δx0-δx1
-δy0
-δy1
x
y-δx0-δx1
-δy0
-δy1
x
y-δx0-δx1
-δy0
-δy1
f11 f01
f10 f00
(a) Num0 : Num0 (b) Num0 : Num1
(c) Num1 : Num0 (d) Num1 : Num1
(e) Complete Joint
Distribution diagram
Figure 5.10: Joint probability distribution of the errors. (a)-(d) shows thebreakdown for each error combination “x :y” and (e) shows the complete jointdistribution diagram.
µx = (ω00 + ω01)
(−δx0
2
)+ (ω10 + ω11)
(−δx1
2
)
µy = (ω00 + ω10)
(−δy0
2
)+ (ω01 + ω11)
(−δy1
2
) (5.12)
σ2x =
1
3
((ω00 + ω01)δ
2x0 + (ω10 + ω11)δ
2x1
)− µ2x
σ2y =
1
3
((ω00 + ω10)δ
2y0 + (ω01 + ω11)δ
2y1
)− µ2y
(5.13)
With the error means and expectation known, the error covariance can be
shown to be (5.14). Therefore together with the error mean and variances, the
correlation coefficient can be found using (5.7).
127
cov(x, y) =1
4(w00w11 − w01w10)((δx0 − δx1)(δy0 − δy1)) (5.14)
In order for the errors to be uncorrelated, the covariance between them
should be zero, i.e. cov(x, y) = 0. Examining the error covariance equation, we
can gather that the condition for errors to be not correlated is when w00w11 =
w01w10, δx0 = δx1 or δy0 = δy1. Through experiments, it is found that the case
w00w11 = w01w10 will happen when signals X and Y are not correlated with
each other. The other two conditions confirms that when fixed-point is used
for either of the signals, the error would have no correlation (i.e. in fixed-point,
δ0 = δ1).
Using the methods in this section, the correlation coefficient estimates for
error sources in Example 5.1 were found and shown in Table 5.11. They are
all within ±4% of the actual correlation coefficients.
Table 5.11: Estimate of the correlation coefficients of the error sources for theFIR filter in Figure 5.8 when truncation is used.
Errors g3 g2 g1 g0 a2 a1 a0
g3 1 0.48698 0.36295 0.42468 0.43505 0.18584 0.45016
g2 0.48698 1 0.25186 0.58503 0.41588 0.19758 0.54779
g1 0.36295 0.25186 1 0.21963 0.26785 0.16327 0.24033
g0 0.42468 0.58503 0.21963 1 0.38986 0.19983 0.51668
a2 0.43505 0.41588 0.26785 0.38986 1 0.26921 0.39682
a1 0.18584 0.19758 0.16327 0.19983 0.26921 1 0.19163
a0 0.45016 0.54779 0.24033 0.51668 0.39682 0.19163 1
5.4.2 Rounding Benefits
The errors introduced by rounding DFX signals were shown not to correlated
with each other (Example 5.1). Using the same notations as in the previous
section (Section 5.4.1), Figures 5.11 and 5.12 show the PDFs of the errors x
and y, and their joint PDF diagram.
128
x-δx12
-δx02
δx02
δx12
y-δy12
-δy02
δy02
δy12
Num0
Num1
Figure 5.11: The PDF of DFX signal errors when rounding is used.
(e) Complete Joint
Distribution diagram
x
y-δx02
-δx12
δx12
δx02
-δy02
-δy12
δy12
δy02
f '00f '10 f '10
f '01
f '01
f '11 f '11
f '11 f '11
(d) Num1 : Num1
x
y
(b) Num0 : Num1
x
y(a) Num0 : Num0
x
y-δx02
-δx12
δx12
δx02
-δy02
-δy12
δy12
δy02
(c) Num1 : Num0
x
y
‘
Figure 5.12: Joint probability distribution of the errors when rounding is used.(a)-(d) shows the breakdown for each error combination “x : y” and (e) showsthe complete joint distribution diagram.
The joint PDF f ′(x, y) for the case of rounding is redefined as (5.15).
f ′(x, y) =
f ′00 if |x| ∈ (− δx0
2, 0] and |y| ∈ (− δy0
2, 0],
f ′01 if |x| ∈ (− δx0
2, 0] and |y| ∈ (− δy1
2,− δy0
2],
f ′10 if |x| ∈ (− δx1
2,− δx0
2] and |y| ∈ (− δy0
2, 0],
f ′11 if |x| ∈ (− δx1
2,− δx0
2] and |y| ∈ (− δy1
2,− δy0
2].
(5.15)
129
where
f ′00 = ω00(δx0δy0)−1 + f ′01 + f ′10 − f ′11
f ′01 = ω01(δx0δy1)−1 + f ′11
f ′10 = ω10(δx1δy0)−1 + f ′11
f ′11 = ω11(δx1δy1)−1
When rounding is used, the means of the errors is zero. Hence the error
covariance would depend only on the expectation E ′(xy). As shown below in
(5.16), the expectation is zero when rounding is used. The limits of integration
used cancels out each other resulting with zero expectation.
E ′(xy) =
∫ − δy02
− δy12
(∫ − δx02
− δx12
f ′11xy dx +
∫ δx02
− δx02
f ′01xy dx +
∫ δx12
δx02
f ′11xy dx
)dy
+
∫ δy02
− δy02
(∫ − δx02
− δx12
f ′10xy dx +
∫ δx02
− δx02
f ′00xy dx +
∫ δx12
δx02
f ′10xy dx
)dy
+
∫ δy12
δy02
(∫ − δx02
− δx12
f ′11xy dx +
∫ δx02
− δx02
f ′01xy dx +
∫ δx12
δx02
f ′11xy dx
)dy
E ′(xy) = 0
(5.16)
This means that the covariance between the errors would always be zero
and the errors will never be correlated if rounding is used and Example 5.1
showed an example of this case. However, the added circuit complexity when
rounding is performed makes it undesirable and rounding scheme for DFX is
not implemented in the RightSize tool. The system output error estimation
will have to deal with the correlations between the errors since truncation is
used.
130
5.5 Profiling: Simulation and Tables
5.5.1 Profiling Simulation
The error models introduced in Section 5.3 rely on the probability distribu-
tions, or profiles of the signals within a system. These profile tables must be
flexible to cater for any possible input and output boundaries of a module. In
practice the signals within a DSP design would typically be correlated with
one-another especially because of their highly correlated inputs, for example
uncompress sound for audio applications [Mit98] and RGB channels for video
[Pra01]. We therefore perform a profiling simulation to obtain the distributions
of signals.
A straight forward approach is to place a histogram counter at every mod-
ule inputs to record the distribution of inputs while performing a single-pass
simulation with a representative input for the entire system. To extract accu-
rate probabilities from these tables, the histogram would have to be very fine
while able to accommodate a signal’s large dynamic range. For every node
v ∈ V : type(v) 6= primary out in a computational graph G(V, S), an in-
dividual profile is needed. Also, the estimation of error correlation coefficient
discussed in Section 5.4 requires a profile table for each 2-combination of signal
s ∈ S. Therefore a great number of information will need to be gathered and
will rapidly become a major computational hurdle as the design/algorithm
increases in complexity. Fortunately there are ways to simplify tables while
retaining its accuracy. There are two types of profile tables: A 1-D Profile
Table is sufficient for all models requiring an ordinary PDF and a 2-D Profile
Table for those needing a joint PDF.
131
5.5.2 1-D Profile Table
Modules that uses a 1-D Profile Table are single input modules: DFX Encoder,
DFX Recoder and DFX Gain Multiplier. The 1-D Profile table will collect data
on Input A and also Output Q for the case of the Gain Multiplier.
The decision for which DFX range a number takes ultimately depends on
its magnitude. Hence by losing the information on the sign, the profile table
is shrunk by half (for 1-D tables) or by three quarters (for 2-D tables). Apart
from that, the boundary of a DFX number is always to the powers of 2 and
all the truncation probabilities PT depends on the boundary value. We shall
exploit this fact and group the histograms’ values based on a logarithmic class
division. By using a logarithms of base 2, each division is a boundary bin,
H(α). For α = αmin, . . . , αmax, we denote H(α) = [2α−1, 2α) with exception,
H(αmin) = [0, 2αmin).
Limits αmin and αmax depends on the signal’s peak value. For a signal with
a simulation obtained peak value, P , the upper boundary bin limit is given
by αmax = blog2(P )c + 1. The peak value P is determined during the upper
scaling p1 parameter determination step in Section 6.5. As for αmin, it is chosen
so that the signal remains fully represented (Definition 3.7) up to a maximum
signal wordlength, nmax defined by the user.
αmin = αmax − nmax (5.17)
0 · · · 2�-1 2
�2�+1� �
+1
2�MIN�MIN
2�MAX-1 2
�MAX�
MAX
|A|
Boundary bins, �A
· · ·
Figure 5.13: Boundary bins.
132
To simplify the notations later, we denote αA as the logarithm of the bound-
ary for Input A, i.e. αA = log2(BA). Figure 5.13 illustrates the range of a
boundary bins for Input A. Each boundary bin H(α) has is a counter, h(α),
counting the number of times the input falls within the range of the boundary
bin. For a simulation size of N samples, the counter h(α) is given by (5.18).
h(α) =N∑
i=1
1{Ai∈H(α)} (5.18)
Therefore, the probability of the input being within a range [2αX , 2αY )
is found by (5.19). This equation is sufficient to gather the probabilities of
truncations for the DFX Encoder and DFX Recoder modules. An example for
the DFX Recoder is shown in Example 5.2.
PH =
αY∑α=αX
h(α)
N(5.19)
Example 5.2. For an example, take a DFX Recoder with input A〈nA, pA0, pA1〉and output X〈nX , pX0, pX1〉 where pA0 < pX0. Since BA < BX , its input PDF
and probabilities of truncations would match Figure 5.4(b). The profile table
can be visualised below in Figure 5.14. The probabilities of each truncation
using (5.19) is given below. �����MIN
��� �MAX
PTR00 PTR01 PTR11
Figure 5.14: An example profile table for a DFX Recoder.
133
PTR00=
αout∑α=αmin
h(α)
NPTR01
=
αin∑α=αout+1
h(α)
N
PTR10= 0 PTR11
=
αmax∑α=αin+1
h(α)
N
¤
For the DFX Gain Multiplier, an extended boundary bin is required. In
addition to the boundary bin counter, each H(α) will collect an extra his-
togram of the output, Q. Using the same logarithmic base 2 class division and
range, we denote the output boundary bins as Gα(γ) for each input bound-
ary bin H(α). The range of Gα(γ) = [2γ−1, 2γ) where γ = γmin, . . . , γmax and
Gα(γmin) = [0, 2γmin). Each output boundary bin also has a counter gα(γ)
(5.20). As before with the output boundary bins limits are dependent on the
output signal and is treated the same way as the limits of the input boundary
bins. Also, to simplify the notations we denote γQ to the logarithm of the
boundary for output Q, i.e. γQ = log2(BQ).
gα(γ) =N∑
i=1
1{Qi∈Gα(γ)} (5.20)
Therefore the probability of an input boundary bin H(α) having an output
QNum0 range is given by the function PGα(γQ) (5.21).
PGα(γQ) =
γQ∑γ=γmin
gα(γ)
N(5.21)
134
Example 5.3. For an example, take a DFX Gain-Multiplier with input A〈nA,
pA0, pA1〉 and output Q〈nQ, pQ0, pQ1〉 where BA > (BQ/|m|). The probabilities
of truncation of the ones in Figure 5.6(a) can be obtain by using equations
(5.19) and (5.21) as shown below.
PTG00=
αA∑α=αmin
PGα(γQ)
NPTG01
=
αA∑α=αmin
h(α)
N− PTG00
PTG10= 0 PTG11
=
αmax∑α=αA+1
h(α)
N
¤
5.5.3 2-D Profile Table
The error analysis for the DFX modules Adder and Full Multiplier require the
joint PDF of their inputs. Also the error correlation coefficient requires the
joint probability of every 2-combination of signals which can only be deter-
mined using their joint PDFs. As the name states, the 2-D Profile Table is a
2-dimensional version of the 1-D Profile Table to collect the joint probability
distribution of two inputs signals A and B and output Q.
As before, the logarithmic base 2 class division is used for the histogram
counter and the boundary bins are now H(α, β). For α = αmin, . . . , αmax, β =
βmin, . . . , βmax, each boundary bin H(α, β) has the range([2α−1, 2α), [2β−1, 2β)
)
with exception that H(αmin, βmin) =([0, 2αmin), [0, 2βmin)
). Each boundary bin
also has a counter h(α, β) and for inputs A and B with N samples, counter
is given by (5.22). This basic boundary bin is sufficient for the correlation
coefficient estimation.
135
h(α, β) =N∑
i=1
1{(Ai,Bi)∈H(α,β)} (5.22)
For the DFX Adder and Full Multiplier, the 2-D Profile Table with ex-
tended boundary bins has an extra array of output boundary bins to collect
the histogram data of the output. An output boundary bin Gα,β(γ) for H(α, β)
has the range [2γ−1, 2γ) where γ = γmin, . . . , γmax and Gα,β(γmin) = [0, 2γmin).
The counter for each output boundary bin gα,β(γ) is given by (5.23). As in the
previous section, the limits of the input and output boundary bins depends on
the signals that they analyse and are specified in the same way. To simplify
notations, the logarithms of the input and output boundaries are denote as
αA = log2(BA), βB = log2(BB) and γQ = log2(BQ)
gα,β(γ) =N∑
i=1
1{Qi∈Gα,β(γ)} (5.23)
Probability of inputs A and B being in the range([2αX , 2αY ), [2βX , 2βY )
)
is given by (5.24). For the extended version of the boundary bin H(α, β), the
probability of the output being a SNum0 is given by function Pα,β(γQ) (5.25).
N is the total number of samples. An example for the truncations probabilities
for a DFX Adder is given shown in Example 5.4
PH =
αY∑α=αX
βY∑
β=βX
h(α, β)
N(5.24)
Pα,β(γQ) =
γS∑γ=γmin
gα,β(γ)
N(5.25)
136
Example 5.4. Take for example a DFX Adder with inputs A〈nA, pA0, pA1〉,B〈nB, pB0, pB1〉 and output S〈nS, pS0, pS1〉. The probabilities for truncation as
tabulated in Table 5.3 are given below.
PTA000=
αA∑α=αmin
βB∑
β=βmin
Pα,β(γS)
NPTA001
=
αA∑α=αmin
βB∑
β=βmin
h(α, β)
N− PTA000
PTA010=
αA∑α=αmin
βmax∑
β=βB+1
Pα,β(γS)
NPTA011
=
αA∑α=αmin
βmax∑
β=βB+1
h(α, β)
N− PTA010
PTA100=
αmax∑α=αA+1
βB∑
β=βmin
Pα,β(γS)
NPTA101
=
αmax∑α=αA+1
βB∑
β=βmin
h(α, β)
N− PTA100
PTA110=
αmax∑α=αA+1
βmax∑
β=βB+1
Pα,β(γS)
NPTA111
=
αmax∑α=αA+1
βmax∑
β=βB+1
h(α, β)
N− PTA110
¤
5.6 Output Noise Estimation
This section brings together the topics discussed earlier in this section to deter-
mine the noise at the primary outputs of an annotated graph G(V, S, ADFX).
The profiling simulation (Section 5.5.1) gathers information for the probabil-
ity distribution of the signals within the system. For every signal i, j ∈ S,
we can determine mean and variance of the individual error sources, and their
correlation coefficient, ri,j, using the probability distributions.
Using the output response sensitivity measure, εjk, from the perturbation
analysis (Section 2.5.4), the error variances as observed at the primary outputs
137
k for an error source ej is given by εjkσ2ej
. Therefore, using (5.6) and (5.7), the
covariance observed at the output would be ri,j
(εikεjkσ
2i σ
2j
) 12 . Hence the final
equation for the noise power at primary output k is given by
σ2k =
∑j∈S
aiε2jk +
∑i,j∈Sj 6=i
ri,j
(εikεjkσ
2i σ
2j
) 12 (5.26)
To test the feasibility of the error estimation, the SNR of 500 filters were
simulated and compared with their estimated SNR. The set of filters are made
out of 200 FIR 159-tap filters, 200 IIR 12th order filters and 100 LMS 4th order
filters (Refer to 6.9 for a description of the IIR and LMS filters) where the co-
efficients of the FIR and IIR filters are randomly selected. Similarly, the signal
wordlengths and their lower scalings p0 are chosen at random and only their
upper scalings p1 were selected to ensure no overflows. Figure 5.15 displays the
estimated SNR against the simulated SNR and Table 5.12 tabulates some of
the individual designs with their results. We can see that the estimated SNR
closely matches the simulated results (0.0628 significance level).
Table 5.12: Comparison between actual and estimated SNR for 159-tap FIRfilter with DFX parameter of increasing wordlength.
DesignSNR (dB)
Actual Estimate Diff.
FIR 1 41.0 40.2 -2.1%
FIR 2 53.0 52.2 -1.5%
FIR 3 64.9 64.2 -1.0%
5.7 Summary
As part of the progressive steps taken to automate the determination of DFX
parameters in Chapter 6, reliable error models are required to quickly explore
the design space without having to repetitively run simulations. This section
138
0 20 40 60 80 100 120 1400
20
40
60
80
100
120
140
Simulated SNR (dB)
Estimated SNR (dB)
Figure 5.15: The estimated vs simulated SNR for 500 filters.
has introduced the error models for DFX modules that inflict quantisation
errors on a system. The nature of DFX meant that the models rely on the
probability distributions of its inputs. A single-pass profiling simulation would
not only determine the probability distributions of the signals for single input
modules, but also the joint probability distributions for two-input DFX mod-
ules.
Also with the knowledge of the joint PDFs of every 2-combination of sig-
nals, the correlation coefficient of their error sources can be estimated. This
is necessary as the individual error sources in a practical design will be cor-
related with one another and we need to do a weighted sum of the variances
and covariances of the error sources. The weights are taken from the output
response sensitivities which is determined by the perturbation analysis of the
RightSize tool.
139
Chapter 6
Approach to DFX Parameter
Optimisation
6.1 Introduction
This chapter’s main focus is in describing an approach to optimising a design
with the new DFX number representation. The high-level synthesis tool Right-
Size, which originally optimises a design using fixed-point number representa-
tion, has been modified to incorporate DFX into its optimisation procedure.
A feature of the original RightSize tool is that users decide the SNR con-
straint on the primary outputs of a design, which is then used to guide the
optimisation routine. This feature is kept and an additional optional design
constraint on the maximum probability of signal overflow is provided to the
users. This constraint provides an upper boundary for probability of overflow
for every signal in a design which is useful when dealing with designs with
feedback loops.
By providing the lower scaling p0 as in DFX adds an extra degree of freedom
to the optimisation problem for a multiple wordlength and scaling of a design,
140
thus making the problem more difficult when compared to using fixed-point
alone. From the definition of DFX, the lower scaling p0 parameter is allowed
to equal to p1 and when this happens, the signal is effectively a fixed-point
signal. The proposed optimisation procedure successfully obtains the right
mix of DFX and fixed-point signal representations for a design.
The optimisation procedure is a hybrid approach, incorporating results
from simulation and analytical estimating the output errors and the area con-
sumption. Optimising the parameters for a multiple wordlength design has
been shown to be NP-hard [CW02] and finding an optimal solution will be
computationally intensive. Therefore a meta-heuristic optimisation is used for
the optimisation which will efficiently search the configuration space with a
two phase simulated annealing algorithm to decide both the parameters for
the optimum p0 and n parameters.
Conditioning is performed on the DFX parameters for every signal in order
to concentrate the search and eliminate futile/infeasible trials configurations.
Also a simple area consumption model is used to provide quick but reliable
cost metric feedback to the optimisation routines. Finally the case studies
results are presented to demonstrate the optimisation procedure.
Therefore the original contributions for this chapter are:
• provide a maximum overflow probability for a design,
• the design options where DFX could perform better than fixed-point,
• a meta-heuristic optimisation using a two phase simulated annealing al-
gorithm,
• case study of an IIR and LMS filter to demonstrate the use of DFX in a
system context and the optimisation heuristic in action
141
6.2 RightSize Prerequisites
Apart from the design description, the design flow of RightSize requires user
specified design constraints together with a set of representative inputs (Refer
to Section 2.5.4 for the description of the RightSize design flow).
6.2.1 User Specified Design Constraints
There are two design constraints that can be specified by the user. First, the
user will have to provide the acceptable signal to noise ratio (SNR) for each
of the system primary outputs. As mentioned earlier, signal-to-noise ratio
(SNR) is a well accepted general metric in the DSP community for measuring
the quality for a finite precision algorithm implementation [Mit98]. Therefore
RightSize will ensure that all the outputs of the synthesized design will meet
this SNR bound.
The wide dynamic range available for DFX means that our system is capa-
ble of preventing overflows without having to use saturation. Overflows must
be prevented at all cost in digital filters as severe distortions may occur in the
filter output [Jac70, CMP76]. Saturation can be performed whenever overflow
may arise but unless designs meet some criteria [BL91, Liu98], the filter may
never recover from the saturation nonlinearity. Therefore the second design
constraint is the optional maximum probability of overflow. When specified,
the scaling of the signals are individually determined using the signal’s stan-
dard deviation and the Chevbyshev’s inequality [GS01] to have a maximum
probability of signal overflow. If it is not specified, a scaling analysis will be
performed to provide sufficient signal scaling which depends on the maximum
simulated peak value. This will be explained in further detail in Section 6.5.
142
6.2.2 Representative Floating-point Input
Since the RightSize tool takes a hybrid approach to optimisation, statistical
information of each datapath is gathered via simulation. The optimisation
procedure uses this statistical information (such as the probability distribution
for the error analysis described in Chapter 5) and naturally the final system
will be sensitive to the choice of representative input.
Generally, the input signals should be sufficient to exercise the full dynamic
range required of the datapath, otherwise unwanted overflow errors could occur
while using other data sets in practice. Also, the quantization error produced
by the resulting system could violate the user-specified constraints mentioned
above when driven by inputs with different statistical properties, although the
constraints are guaranteed for the specific set of data provided.
6.3 DFX Conditioning
While running the optimisation, the search for the optimum design parame-
ters may bring about parameters that do not meet the design requirements
of the DFX modules described in Chapter 4. Also, some of the parameters
may be known to be overly pessimistic and could be ruled out from the op-
timisation. Hence, a fully conditioned DFX annotated computational graph,
G(V, S, ADFX), has parameters that are not overly pessimistic while meeting
the design requirements of all DFX modules.
In this section, it is assumed that both scalings and wordlength parameters
for each signal have been pre-specified. An iterative algorithm is proposed at
the end of this section to condition any ‘ill-conditioned’ designs.
143
6.3.1 Applying Synthesis Restrictions on p0 Parameters
By applying some restrictions to p0 parameters of some signals, there is some
area cost savings to be gained. These signals are the outputs of the DFX Gain
multiplier, DFX Adder V2 arithmetic modules and unit delay nodes.
DFX Adder V2
Referring to Section 4.5.2, the DFX Adder V2 is designed with the requirement
that an input ANum0 and BNum0 can only result in a Num0 output, SNum0, in
order to reduce the post-adder hardware cost. By using the joint probability
distribution on an adder’s inputs found through the profiling simulation (Sec-
tion 5.5.3 and referring to Example 5.4), we can determine the pS0 that meets
this requirement for a given pA0 and pB0. Also we want the smallest value
possible for pS0 to utilise the full potential of DFX. While performing DFX
conditioning, each adder node in G(V, S,ADFX) will be subjected to Algorithm
6.1.
Algorithm 6.1 : Determine the pS0 required for an adder.
Require: 1) The extended 2-D profile table of an adder with inputs
A〈nA, pA0, pA1〉 and B〈nB, pB0, pB1〉.2) pS1 determined earlier.
Ensure: PTA001= 0
pS0 ⇐ pS1
repeatDetermine PTA001
. // For an example look at Example 5.4, pg. 137.pS0 ⇐ pS0 − 1
until PTA0016= 0
pS0 ⇐ pS0 + 1
DFX Gain Multiplier
By restricting the DFX Gain Multiplier output’s p0 parameter, we can ex-
pect hardware cost savings due to the the reduced complexity of the output
144
recoder. Therefore, the RightSize tool will enforce the restrictions mentioned
in Section 4.6.1 which is repeated here for convenience. For gain multiplier
with input A〈nA, pA0, pA1〉 and multiplier m〈nm, pm〉, the output pQ0 is lim-
ited based on the value of m. Algorithm 6.2 will be performed on every gain
multiplier node when doing a DFX conditioning.
Algorithm 6.2 : Determine the pS0 required for a gain multiplier.
Require: The input A〈nA, pA0, pA1〉 and multiplier m〈nm, pm〉 DFX format
if (|m| < 1.0) and (pQ0 < pA0 + pm) thenpQ0 ⇐ pA0 + pm
else if (|m| > 1.0) and (pQ0 > pA0 + pm) thenpQ0 ⇐ pA0 + pm
else // m = 1.0 or m = −1.0pQ0 ⇐ pA0
end if
Unit Delay
When unit delay are linked one after another, there is absolutely no reason
for there to be a change in the p0 parameter from the input to the output.
Consider Figure 6.1, when pB0 6= pA0, an extra DFX module to recode the
output from the first unit delay will be inserted in between the two unit delays.
Therefore any change to the DFX signal format apart from straight wordlength
reduction will require additional hardware which is unnecessary for a chain of
unit delays. If there is any need to change the format, it should be done before
or after the chain of unit delays and Algorithm 6.3 is performed during the
DFX conditioning stage to aid the synthesis tool by removing the unnecessary
changes to p0 parameters.
z-1
z-1
A A0 A1
A
, ,n p p C C0 C1
C
, ,n p pB B0 B1
B
, ,n p p
Figure 6.1: Linked unit delays.
145
Algorithm 6.3 : Determine the output p0 required for an unit delay.
Require: Input A〈nA, pA0, pA1〉 DFX format
if output signal Q drives another unit delay thenpQ0 ⇐ pA0
end if
6.3.2 Propagating and Conditioning p1 and n Parame-
ters
Table 6.1: Propagation rules for DFX conditioning.
type(v) Propagation rules for j ∈ outedge(v)
adder For inputs A〈nA, pA0, pA1〉 and B〈nB, pB0, pB1〉pj1 = max(pA1, pB1) + 1
nj = pj1 + max(p′A0, p′B0)
full mult For inputs A〈nA, pA0, pA1〉 and B〈nB, pB0, pB1〉pj1 = pA1 + pB1
nj = pj1 + (p′A0 + p′B0)
gain mult For input A〈nA, pA0, pA1〉 and multiplier m〈nm, pm〉pj1 = pA1 + pm
nj = pj1 + (p′A0 + p′m)
unit delay For input A〈nA, pA0, pA1〉or fork pj1 = pA1
nj = pj1 + p′A0
After the restrictions on the p0 parameters have been applied, the remaining
p1 and n parameters are propagated from the inputs of each atomic operation
through to their outputs as shown in Table 6.1. pj1 and nj represents the
upper limit for the p1 scaling and wordlength parameter respectively of the
output after an operation. They are determined using the the parameters of
the operation’s input signal parameters. These limits prevent the optimisation
routine from considering parameter values that will not yield any benefit.
For the upper scaling parameters p1, this DFX conditioning phase compen-
sates for any overly pessimistic values found during the optimisation phases
146
S
S
S
S
0
1
0
1
7,2,3
8,1,3
(a)
Intermediate
(c)
pA1+pm pA0'+pm’
S
S
0
1S
A A0 A1, , 4,1,2n p p = m m, 3,1n p =
Input, Amultiplier, m
0 0 0
0
S
S
0
1
(d)4,0,2
S
S
0
1
(b)
S6,1,4
Figure 6.2: Examples of DFX Gain Multiplier output formatting with thebinary points aligned. The shaded bits can be omitted without introducingerrors.
in Section 6.5. If pj1 found is greater than the propagated pj1, then pj1 is
set to the propagated pj1. Similarly, if the wordlength nj is greater than the
propagated nj, then the nj = nj. The propagated wordlength are all found to
be the sum of p1 together with the maximum LSB-side scaling to ensure the
output’s Num1 number is able to extend its dynamic range fully.
Consider the example of a DFX Gain Multiplier, Figure 6.2(a) shows the
intermediate result before the output is recoded to a desired output format
and Figure 6.2(b)-(d) shows its possible output formats. Figure 6.2(b) is an
example where the upper scaling factor is pessimistic (pQ1 > pA1 +pm) and the
extra shaded bit is merely a sign-extension which can be omitted. Reducing
the upper scaling in this manner is guaranteed not to cause any overflow and
would also reduce the signal’s wordlength. The propagated wordlength for
output in Figure 6.2(c) is nQ = pQ1 + (p′A0 + p′m) and since nQ is greater,
there is a pair of extra shaded zero padding bits that can be omitted without
introducing additional errors. Figure 6.2(d) shows an output example that is
within the propagated parameters.
147
6.3.3 Iterative Algorithm
Algorithm 6.4 below is an iterative algorithm to obtain a fully conditioned
DFX annotated computational graph G(V, S, ADFX).
Algorithm 6.4 : DFX Conditioning
Input: A DFX annotated computation graph G(V, S, ADFX)
Output: Properly conditioned DFX annotated computation graph with
identical behaviour with the input system
//First perform conditioning to for p0 parameters...repeat
for all v ∈ V doPerform Algorithm 6.1 on v if type(v)=adderPerform Algorithm 6.2 on v if type(v)=gain multPerform Algorithm 6.3 on v if type(v)=unit delay
end foruntil No more changes made
//...then followed by the p1 and n parametersCalculate the pj1 and nj for all signals j ∈ S according to Table 6.1repeat
for all j ∈ S doSet pj1 ⇐ pj1 if pj1 > pj1
Set nj ⇐ nj if nj > nj
end forRecalculate nj for all affected signals according to Table 6.1
until No more changes made
6.4 Area Models
The operation of the RightSize tool will require area metric feedback through-
out the course of its optimisation process and it would be very computation-
ally intensive to perform logic synthesis to extract the area metric for each
instance. Therefore it is worthwhile to model the area consumption for each
module types supported by RightSize at a high level of abstraction. These
148
simple cost models may be evaluated many times throughout the optimisation
process with little computational effort.
It is assumed, when constructing the cost model, that dedicated resource
binding is used and that designs are resource dominated such that area cost
of wiring is negligible [Mic94]. With dedicated resource binding, each compu-
tational node maps to a physically distinct module element. The construction
of the area cost model is greatly simplified with these assumptions and it is
possible to estimate each computational node separately before summing the
resulting estimates. It should be noted that in reality, logic synthesis performed
after the RightSize optimisation is likely to result in some logic optimisation
between the boundaries of two connected nodes to give a lower area. However
experience has shown that these deviations from the area model are small, and
tend to cancel each other out in large systems, resulting in simply a propor-
tionally slightly smaller area than predicted [Con01].
This section will not go into explaining the details of estimating the area
for each individual module but the following explanation should be sufficient
for anyone to recreate the estimation procedure used. Each module can be
broken down into a few main parts. For example a DFX Gain Multiplier
(Section 4.6.1) would consist of a fixed-point gain multiplier and output recoder
block. The recoder block is then made up of a Range-Detector and MUXs for
the shifting operations. Therefore by modelling the basic entities, we can model
all the modules that are used in RightSize. The models used in this thesis have
been tuned for Virtex4 and Stratix2 FPGAs, and for UMC .13 micron high
density standard cell ASIC library. The estimation models required the DFX
annotated computational graph G(V, S,ADFX) to be conditioned in advanced.
Range-Detector (RD)
As mentioned in Section 4.4.1, the Range-Detector is a fairly simple block
whose operation depends on the number of bits used for detection. A lookup
149
table, crd[i], contains the size Range-Detector where i is the number of bits used
for detection. The lookup table is customised for each hardware platform. The
estimated area for the Range-Detector is shown in the example below.
Example 6.1. For the Range-Detector in a DFX Encoder with input A〈nA, pA〉and output X〈nX , pX0, pX1〉, the estimated area, Crd, is given by (6.1).
Crd = crd[pX1 − pX0] where pX1 = pA (6.1)
¤
Multiplexor (MUX)
Multiplexors (MUXs) are used extensively for aligning and shifting signals in
DFX as shown throughout Chapter 4. A lookup table, cmux[i], is used to
estimate the area of a MUX where i is the number of inputs multiplexed by
the MUX. This table is customised for each hardware platform. As a note,
when i = 1, the input is ‘multiplexed’ it with a constant ground or in other
words, the MUX is just an AND gate. The example below shows the size of a
DFX Decoder which is made entirely from MUXs.
nX
nA
S
S
0
1
S
A A0 A1
A, ,n p p
X X
X,n p
Input
Output
i = 2 i = 1
Figure 6.3: Estimating the area for a DFX Decoder.
Example 6.2. Figure 6.3 shows a DFX Decoder with input A〈nA, pA0, pA1〉and a fixed-point output X〈nX , pX〉. For nA MSB bits at the output, 2:1 MUXs
are used ( i.e. i = 2) but for the lower (nX − nA) bits, only AND gates are
needed. In this example, when the input is a Num1 the output is the same as
150
the input. But when the input is a Num0, the output is a zero (grounded).
Therefore, the estimated area for a DFX Decoder, Cdec is given below in (6.2).
Cdec = cmux[2] ·min(nA, nX) + cmux[1] ·max(0, nX − nA) (6.2)
¤
Adder
S
S
S
B BB ,n p
S SS ,n p
Input
Input
Output
A AA ,n p
Sum and Carry
full-adder
Carry only
full-adder
nS+1
Figure 6.4: Estimating the area for a fixed-point Adder.
The area estimation for the fixed-point adder is reasonably straight forward.
Generally, the design of FPGAs provide good support for fast ripple-carry
[Kor02] adder architectures. For example, synthesis tools such as Synplify
ASIC from Synplicity used after RightSize uses a ripple-carry architecture.
The synthesis tools would normally generate a sum-and-carry full-adder for
the full width of the adder. However on some occasions the carry-only full-
adder will be sufficient as the sum is not needed by the output as shown in
Figure 6.4. The cost model therefore consists of two coefficients, ca1 and ca2 for
the sum-and-carry and carry-only full-adders respectively. These coefficients
are customised for the three hardware platforms. The estimated area of the
fixed-point adder, CfxAdd, is therefore expressed by (6.3). This model is a
151
modified version of model proposed in [Con01].
CfxAdd =ca1 ·min (nS + 1, pS + 1 + min(p′A, p′B))
+ ca2 ·max (0, min(p′A, p′B)− p′S)(6.3)
Gain multiplier
Estimating the area for constant coefficient fixed-point multipliers is signifi-
cantly more complicated. Typically, a constant gain multiplier is implemented
in a series of additions for the partial products generated through recoding
schemes such as the classic Booth technique [Boo51]. This makes the area
consumption highly dependent on the coefficient value and to in addition to
that, the exact recoding scheme used by the synthesis vendor is known only to
the vendor. Ideally, the area estimation would account for any recoding-based
implementation but this this is not realised.
A simple area model for the area is used instead based on the model pro-
posed in [Con01]. Equation (6.4) estimates the area, (6.4), for a gain multiplier
with input A〈nA, pA〉, coefficient multiplier m〈nm, pm〉 and output Q〈nQ, pQ〉.Through synthesis of several hundred multipliers with varying input and out-
put wordlengths, and varying coefficient values, the coefficient values cg1 and
cg2 are determined through the use of the least-squares approach for each hard-
ware platform.
CfxGain = cg1nAnm + cg2(nA + nm − nQ) (6.4)
Full multiplier
The area estimation for a fixed-point full multiplier is more predictable than
the previous gain multiplier. To perform multiplication in parallel, array mul-
tipliers [Kor02] are generally used. However, as with the gain multiplier, the
152
exact method used by the individual synthesis vendor is not known and the
logic synthesis would remove any unnecessary logic to unconnected outputs.
Therefore, as with the gain multiplier, a simple area model is used which
is based on the work done in [Con03]. Equation (6.5) estimates the area,
CfxMult, for a full multiplier using the coefficients cm1 and cm2 for a fixed-point
multiplier with inputs A〈nA, pA〉, B〈nB, pB〉 and output Q〈nQ, pQ〉. These co-
efficients are also found through the least-squares approach to several hundred
synthesized multipliers with varying input and output wordlengths. As before,
these coefficients are customised for the three hardware platforms.
CfxMult =cm1 ·min(nA, nB) · (max(nA, nB) + 1)
+ cm2(nA + nB − nQ)(6.5)
6.4.1 Evaluation of Area Models
Table 6.2 tabulates a comparison between the actual area of DFX module on
a Virtex4 FPGA with the estimation using the area models in this section.
Each module’s input and output signal parameters are randomly chosen. We
can see that the estimated area are within ±15% of the actual area and the
results for the other platforms are similar to this.
6.5 Determining the Upper Scaling p1 Param-
eter
As Num1 is the upper number range of DFX, the upper scaling parameter
p1 for each signal must be sufficient to prevent overflow from occurring and
can be determined independently from the other two DFX parameters. When
the optional maximum probability of overflow is specified by the user, the p1
153
Table 6.2: Comparison between actual area of DFX modules with the estima-tion by the area models on a Virtex 4 FPGA.
DFX Adder V1 (LUTs) Adder V2 (LUTs)
Module Act. Est. Err Act. Est. Error
1 39 34 -12.8% 30 29 -3.3%
2 39 36 -7.7% 31 29 -6.5%
3 56 52 -7.1% 39 38 -2.6%
4 38 35 -7.9% 39 34 -12.8%
5 19 20 5.3% 19 21 10.5%
DFX Gain Multiplier (LUTs) Full Multiplier (LUTs)
Module Act. Est. Err Act. Est. Error
1 57 55 -3.9% 109 102 -6.4%
2 76 83 9.7% 170 151 -11.2%
3 82 71 -13.6% 173 181 4.7%
4 50 55 10.0% 185 180 -2.7%
5 54 48 -11.1% 215 227 5.4%
parameters can be determined using the Chebyshev’s inequality, as will be
discussed in Section 6.5.2. Otherwise, the simulated peak values is used to
find the p1 parameters. After this step, the p1 parameters found may be too
pessimistic but this is compensated when the DFX parameters are propagated
and conditioned, as explained in Section 6.3.
6.5.1 Simulated Peak Values
In order to determine the maximum peak values, Ps, of each signal, a simula-
tion run is first performed using the set of user-provided inputs. The standard
deviation of the signal σs which is used in the next section is recorded during
the same simulation run. Ps is then scaled up by a user-defined ‘safety factor’
k to provide some guard bits against overflow. Typically k = 4 is used to give
2 overflow guard bits, [Con01]. Hence, for signal s, we can derive the upper
scaling parameter, ps1, as
ps1 = psims = blog2 kPsc+ 1 (6.6)
154
where b.c returns a less or equivalent integer value.
Using the simulated peak values is fine in most cases provided there is no
feedback in the design. Designs with feedback, e.g. IIR filters, may suffer from
limit cycles and irrecoverable oscillations when overflow occurs as mentioned
in Section 2.3.
6.5.2 Chebyshev’s Inequality
Using the Chebyshev’s inequality, the maximum probability of a zero mean
DFX signal overflowing can be determined by knowing the signal’s variance
and its maximum representable magnitude.
The single tailed Chebyshev’s inequality [GS01] states that
Pr [|X − µX | ≥ A] ≤ σ2X
A2(6.7)
where µ and σ2 is the mean and variance of a variable X.
Letting A be the magnitude of the largest DFX representable number, i.e.
A = 2pchevs the probability of overflow for a zero mean signal s ∈ S is bounded
by
Pr [overflow] = Pr[|s| ≥ 2pchev
s
]
≤ σ2s
22pchevs
(6.8)
Using (6.8), we define the maximum probability of overflow, λ, as follows
λ =σ2
s
22pchevs
(6.9)
Hence by rearranging (6.9), we can determine the scaling parameter pchevs
for signal s ∈ S with the standard deviations, σs, collected from the simulation
155
run in the previous section and the user-specified maximum probability of
overflow, λ as shown below.
pchevs = log2 σs − 1
2log2 λ (6.10)
On the occasion when the λ is set too high, overflow is bound to occur. To
prevent this, an extra condition is set for the upper scaling parameter according
to (6.11). If the maximum probability of overflow constraint is set by the user,
the upper scaling ps1 would normally take the value of pchevs . However if the
scaling parameter found by through simulation, psims (6.6), is higher than pchev
s ,
then ps1 will take the value from the simulation method and a note of this will
be logged for the user to action on.
ps1 = max(psims , pchev
s ) (6.11)
6.6 The Optimisation Problem, Formulated
Given a computation graph G(V, S), Section 6.5 earlier described how the
upper scaling vector p1 can be found by either using the results from simulation
or with the Chebyshev inequality and a user specified maximum probability of
overflow λ. Combining the area models presented earlier in Section 6.4 into a
single area measure on G gives a cost metric CG(n,p0,p1). Together with the
error variance model Ek (5.26), combined into vectors EG(n,p0,p1) with one
element per output, allows for formulation of a DFX parameter optimisation
problem stated in Problem 6.1. Here E denotes the user specified bound for
the output error given by the user’s output SNR constraint.
156
Problem 6.1. Given a computation graph G(V, S) the DFX parameter opti-
misation problem may be defined as to select n and p0 such that CG(n,p0,p1)
is minimized subject to (6.12).
n ∈ NS
p0 ≤ p1
EG(n,p0,p1) ≤ E
(6.12)
¤
The condition p0 ≤ p1 does not restrict all signals to be purely DFX. Any
optimisation procedure should strike a balance between DFX and fixed-point
formats for their signals to minimize the area cost.
6.7 Exploring the Feasibility of DFX
Take a DFX signal X with format 〈n, p0, p1〉 and let K be the probability
of signal X being in the Num0 range. Using the Equation (5.3), the error
variance of signal X is given by Equation (6.13) below, assuming truncating
from an infinite precision to simplify the equation. We can see that as K
decreases, the error variance increases.
E{e2
DFX
}=
1
12· 2−2n
(K22p0 + (1−K)22p1
)(6.13)
Using the Chebyshev’s inequality (6.7), the maximum probability of a zero
mean DFX signal being in the Num1, (K−1), can be determined by knowing
the binary point p0 and the signal’s variance, σ2. This time around, by letting
a = 2p0 , which is the boundary between the Num0 and Num1 ranges, we
157
can determine maximum probability of a signal being in the Num1 range and
rearranging it to get the lower limit for K (6.14).
(1−K) = Pr [X ≥ 2p0 ]
≤ σ2
22p0
K ≥ 1− σ2
22p0
(6.14)
Assuming K is at its lowest limit (i.e. K = 1 − σ2
22p0) and substituting it
into (6.13), we get the error variance below.
E{e2
DFX
}=
1
12· 2−2n
((1− σ2
22p0
)22p0 +
( σ2
22p0
)22p1
)
=1
12· 2−2n
(22p0 − σ2
(1− 22p12−2p0
)) (6.15)
Taking the 1st and 2nd derivative of Equation (6.15), we get
dE
dp0
=1
6· 2−2nln 2
(22p0 − σ222p12−2p0
)
d2E
d(p0)2 =
1
3· 2−2n(ln 2)2
(22p0 + σ222p12−2p0
) (6.16)
To obtain the lowest error variance, we examine the stationary point when
dEdp0
= 0. Ignoring all negative and imaginary roots gives, the best value of p0
that minimises the error variance with the lowest value of K is
p0 =1
2(p1 + log2 σ) . (6.17)
The equation for p0 parameter above does not guarantee an area optimised
implementation as it does not take into account any area metric. However, by
placing a relationship between fixed-point and DFX implementation cost, we
can make a meaningful DFX/fixed-point comparison and develop an under-
standing to the types of design and conditions that DFX is best suited.
158
From Chapter 4, we know that a DFX implementation is always larger than
a fixed-point implementation of the same wordlength due to added overhead
cost such that a DFX implementation cost is θ times larger than the equiva-
lent wordlength fixed-point implementation [ECC04]. Therefore a wordlength-
multiplier, θ, is introduced such that
m = θn (6.18)
to equate area cost of a fixed-point implementation with wordlength m and a
DFX implementation with wordlength n. A smaller value of θ is desirable for
DFX as it implies that the DFX implementation additional overhead cost (on
top of an equivalent fixed-point implementation) is relatively cheap.
Using (6.18), the error variance of a fixed-point signal with wordlength m
and scaling p is given by (6.19). For the fixed-point scaling p, it makes sense
to set it to p = p1 in order to match with the highest representable value of
DFX.
E{e2
FX
}=
1
122−2m+2p
=1
122−2θn+2p1
(6.19)
For DFX to be the better number representation than fixed-point, we want
the error variance of DFX to be less than the error variance of fixed-point. In
other words, we want the inequality (6.20) to be satisfied.
E{e2
DFX
} ≤ E{e2
FX
}(6.20)
Using the equations (6.15), (6.17) and (6.19) into the inequality (6.20), we
have determined the upper limit, θ, for the wordlength-multiplier θ shown be-
low in (6.21). Equation (6.21) implies that when θ is below the limit, a DFX
159
implementation would have lower error variance than a fixed-point implemen-
tation with equivalent chip area. In other words, the DFX implementation
might have the flexibility to sacrifice error variance to optimise chip area. A
high value for θ gives more flexibility for a DFX implementation to absorb the
extra DFX hardware overhead over fixed-point.
θ < θ = 1 +1
2n
[2p1 − log2
(σ2p1+1 − σ2
)](6.21)
Examining the inequality above further, we can deduce that the limit, θ,
increases when a signal has a small variance but needs a large p1 (to prevent
overflow). Signals like this have high kurtosis [SM94], i.e. the signal values
are distributed about the mean (assumed zero here) and has a fat tail of the
distribution. In order to represent these signals, a number with wide dynamic
range is needed. Also the inverse relationship of the wordlength n with θ means
DFX forms a better representation when a design is able to tolerate a large
error, i.e. low n since shorter wordlengths lead to larger errors. Figure 6.5
illustrate the the limit θ given by Equation (6.21) for the case when n = 10
and n = 30.
To view the limit, θ, from a different perspective, we shall use the equation
for p1 found with the Chebyshev inequality (Section 6.5.2). Using p1 from
(6.10) and substituting it into (6.21), we get
θ < θ = 1 +1
2nlog2
(2λ
12 − λ
)−1
(6.22)
Taking p1 from (6.10), p0 from (6.17) and substituting them into the error
variance (6.15), we get the error variance for DFX in terms of the maximum
overflow probability, λ, and wordlength, n.
E(e2
DFX
)=
σ2
12
(2λ−
12 − 1
)· 2−2n (6.23)
160
10-30
10-20
10-10
100
-10
-5
0
5
10
1.0
1.5
2.0
2.5
3.0
3.5
4.0
p1
Signal va
riance, �x2
Wordlength-multiplier,
�Increasing
n
n = 10
n = 30
Figure 6.5: Boundary of wordlength-multiplier using Equation (6.21) withvarying p1 scaling and signal variance.
Hence, with the SNR defined as the ratio of signal power over noise power (i.e.
SNR = σ2/E), rearranging the equation above gives us
n =1
2log2
(SNR
12
(2λ−
12 − 1
))(6.24)
Finally, substituting (6.24) into (6.22) to have (6.25) where the limit θ is
expressed as a function of maximum overflow probability λ and SNR. Figure 6.6
illustrates equation (6.25). From the graph, we can see that the θ limit is higher
when SNR required and maximum overflow probability is small.
θ < 1 +log2
(2λ
12 − λ
)−1
log2
(SNR12
(2λ−12 − 1)
) (6.25)
In practice, attempting to ascertain the wordlength multiplier, θ, is not
a trivial task as it depends on a variety of factors such as the number of
the different atomic operations in a design and their input and output signal
161
050
100150
200250
10-20
10-10
100
1
1.2
1.4
1.6
1.8
2
Overflow probability, � SNR
(dB)
Wordlength-multiplier,
�Figure 6.6: Boundary of wordlength-multiplier with varying maximum proba-bility of signal overflow, λ, and SNR).
parameters. Taking note that the analysis in this section is based on a single
signal, the inequalities would also require that all signals in the design have the
same variance and/or with the same p1 parameter. The wordlength multiplier
inequalities discussed in this section serves to provide a glimpse of the possible
scenarios when DFX is better then fixed-point.
In general, DFX is better suited for a design when the output SNR con-
straint is low and when the internal signals need a wide dynamic range. Also
when the constraint for maximum overflow probability λ is provided, DFX is
the better choice for low λ probabilities. These properties are demonstrated
by the results of the cases studies in Section 6.9.
162
6.8 Meta-Heuristic Approach to Optimisation
Using Simulated Annealing
To determine the best wordlength and scaling parameters for the signals, this
section describes an optimisation approach using simulated annealing meta-
heuristic. As mentioned in this chapter’s introduction, the signals in a design
may be either DFX or fixed-point and looking back at Figure 4.16, a mix-
ture of DFX and fixed-point signals would lead to a non-convex search space.
Simulated annealing is an optimisation algorithm for problems that are NP-
hard [CW02] and have a non-convex search space. After a brief background
into simulated annealing, the two-phase optimisation routine for RightSize is
introduced.
6.8.1 Background to simulated annealing
Simulated annealing (SA) is a flexible optimization method that is suited for
large-scale combinatorial optimization problems. Kirkpatrick et al. [KGV83]
was the first to use a generalised form of the Metropolis Monte Carlo scheme
[MRRT53] for optimisation. At about the same time, Cerny [Cer85] and Pin-
cus [Pin70] independently developed the algorithm further to what is now
known as simulated annealing. SA has been successfully applied to classical
combinatorial optimization problems, such as the travelling salesman problem,
and problems concerning the design and layout of very-large-scale-integration
(VLSI) designs. SA differs from standard iterative optimisation methods by
allowing ‘uphill’ moves – moves that spoil, rather than improve, the temporary
solution [DH96]. Refer to [Haj85, JAMS89, JAMS91] for surveys of the SA
algorithm and its use.
The problems that SA optimisation is useful are characterized by a very
large discrete search space whose objective cost function we want to minimize
163
(or maximize) is non-convex over the search space. When the search space is
non-convex, there is a tendency for some optimisation algorithms (e.g. greedy
algorithm) to get trapped in a local minima. Simulated annealing prevents this
by using rules that are derived from an analogy to the process in which liquids
are cooled to a crystalline form, a process called annealing. Each iterative
step of the SA algorithm replaces the current state by a random state within
its neighbourhood with a probability that is a function of the state cost and
a temperature parameter, T , which gradually ‘cools’. When the temperature
is ‘hot’, a greater number of large random changes to the current state is
allowed and prevents the search from becoming stuck in a local minima. As
the temperature cools, only incremental better changes are allowed to the
current state which will narrow down towards the global minima.
A readily available Adaptive Simulated Annealing (ASA) code by Ingber
[Ing93] is adapted for RightSize optimisation procedure to optimise the DFX
parameters. The fundamental workings of the ASA algorithm is fairly simple.
Firstly, a feasible starting point is chosen to initialise the current state x, the
associated cost, C, is found with a cost function and the initial temperature,
T = T0, to start the annealing process which is usually a large value. State
x is a vector for the list of parameters we want to optimise. The exponential
temperature ‘cooling’ schedule of ASA rapidly reduces the temperature down
to narrow the search, regardless of the initial value of T0 [Ing93]. The annealing
then consists of the cyclic repetition of the following operations/steps until a
termination criterion is satisfied. When ASA ends, the final optimised state
x and its cost C will be made available.
Step 1. Generation and evaluation: A random trial state x′ is generated by
a uniform distribution generator and its associated cost C ′ is evaluated by the
cost function. If the trial state x′ fails any user derived constraint, this step
is repeated until a valid x′. Doing so does not violate the prove that a global
minima statistically can be obtained with ASA [Ing96].
164
Step 2. Examination: If (C ′ < C) the trial state is accepted (i.e. x and
C are set equal to x′ and C ′, respectively). If (C ′ ≥ C) the trial point is
accepted with probability defined by the Boltzman probability distribution,
exp(−(C′−C)T
) [Ing96]. When a trial state is accepted, an acceptance counter
Acc is incremented.
Step 3. Temperature cooling: An iteration counter k is incremented. An
exponential cooling schedule (6.26) is applied to T . Here, the parameters ψ,
Temperature_Ratio_Scale, and ϕ, Temperature_Anneal_Scale, are tuning
parameters supplied by ASA. They are kept at the default values of ψ =
− ln(1.5e− 5) and ϕ = ln(100.0) [Ing96].
T = T0 exp(−ψ (k eϕ)1/|x|) (6.26)
The termination criterion is set up so that 1) the number of accepted
states, Acc is limited to a maximum parameter, Accmax, and 2)the algorithm
is stopped when there is no improvement to the cost is recorded after 500
iterations. When ASA ends, the final optimised state x and its cost C will be
made available.
From the overview mentioned above, we can see that the speed of the
SA algorithm depends on three major factors, the maximum accepted states,
Accmax, the speed of evaluating the cost function and the ease generating valid
states for x′ that meets the user constraints in Step 1. These factors can be
managed to trade-off between the speed of optimisation and the quality of its
result. Typically, there is a greater possibility of securing better results if more
time is spent on the optimisation.
Elaborating on the ease of generating valid states for x′, the random state
selection in Step 1 gives rise to the possibility of generating states that would
violate the user’s constraint. Therefore most of the time would be spent looping
165
about the same step. This would be the case if the SA algorithm tried to
optimise both the parameter vectors n and p0 together to meet the constraints
set up in Problem 6.1. A simple 4th order IIR filter optimisation (as in the
later case study, Section 6.9.1) could take 74 hours to complete on a PC (Intel
Pentium IV 2.4GHz). Hence the proposed two phase optimisation separately
opimises the lower scaling p0 vector followed by the wordlength n.
6.8.2 Optimisation Algorithm
A heuristic approach has been developed to find feasible lower scaling, p0,
and wordlength, n, vectors for a design having small, though not necessary
optimal, area consumption. Figure 6.7 shows the flow of the optimisation al-
gorithm. Before performing the heuristic optimisation some pre-optimisation
needs to be done. Starting with the computational graph G(V, S), the per-
turbation analysis (Section 2.5.4) and profiling simulation (Section 5.5.1) are
first performed to aid error analysis to be performed. Then the upper scaling
vector p1 is determined in Section 6.5 to meet the user’s requirement. The
optimisation heuristic is divided into two phases; Phase 1 attempts to find the
optimum set for p0 while Phase 2 finds the optimum set for n. Phase 1 uses
Both phases uses Algorithm 6.5 and Algorithm 6.6 while Phase 2 uses only
Algorithm 6.5.
Algorithm 6.5(OptNOnly) optimises the wordlength parameters, n, to min-
imise the cost design cost for a DFX annotated computational graph G(V, S, ADFX)
where both p0 and p1 parameters are known. The minimum uniform wordlength
for all the signals which satisfies the user constraints are used for the initial
state. SA’s iterative loops are then performed until the number of accepted
states Acc = Accmax or when the cost keeps repeating. The cost objective
function in this algorithm is determined by using the area models presented
166
in Section 6.4. At the end, this algorithm returns the optimised vector of n
parameters and also the optimised cost of G(V, S, ADFX).
On the other hand, Algorithm 6.6(OptP0Only) optimises the lower scaling
parameters, p0, for a DFX annotated computational graph G(V, S, ADFX)
where the objective is also to minimise the cost of G(V, S,ADFX). This time,
only the p1 parameters are known and kept constant. To generate the initial
state p0, Equation (6.17) is applied to lower scaling parameter of every signal
s ∈ S such that
ps0 =1
2(ps1 + log2 σs) . (6.27)
This algorithm then performs iterative SA loops until Acc = 5000 or when
the cost keeps repeating. Instead of using the area models to determine the
cost, Algorithm 6.5(OptNOnly) is called for each iterative of trial state p0′ with
Accmax = k log2(|S|) where |S| is the number of signals. The parameter k is
a constant that controls the amount of time spent looking for an optimum
solution and also determines the level of optimisation. OptNOnly’s Accmax
is set comparatively smaller than the typical value of 5000 to quickly gauge
the cost of the trial state. Experiments have shown that setting k = 10 is
typically sufficient to gauge the cost of the trial state p′0 without being too
computationally intensive. At the end, this algorithm returns the optimised
vector of p0 parameters.
Therefore, Phase 1 consist of Algorithm 6.6(OptP0Only) which uses Algo-
rithm 6.5(OptNOnly) within its iterative loop to give optimised lower scaling
vector p0. Once Phase 1 one is complete, Phase 2 optimises the wordlength
vector n with Algorithm 6.5 (OptNOnly) again but this time it is allowed to
terminate regularly with Accmax = 5000. Splitting the optimisation in two
phases dramatically reduces the optimisation time. The same 4th order IIR
filter optimisation mentioned at the end of the previous section took just 45
167
minutes to complete with the same test machine. Figure 6.8 show the optimi-
sation times and result area with respect to the level of optimisation k for a
4th IIR order filter.
Algorithm 6.5 (OptNOnly): Optimise wordlength vector n with simulatedannealing.
Input: G(V, S) with scaling vectors p0 and p1 and Accmax
Output: Wordlength vector n together with cost CG(n,p0,p1)
1: Find u, the minimum uniform wordlength satisfying (6.12) with n = u · 12: Set n ⇐ ku · 13: Initialise temperature T = T0
4: Set cost C ⇐ CG(n,p0,p1)5: repeat6: Generate new random vector n′ in the neighborhood of n7: DFX condition graph G(V, S, Adfx) where Adfx = (n′,p0,p1)8: Go back to line 6: if (6.12) is not satisfied9: Set cost C ′ ⇐ CG(n′,p0,p1)
10: if (C ′ < C) or (random < exp(−(C′−C)T
)) then // Accept new state11: Set n ⇐ n′, C ⇐ C ′ and increment Acc12: end if13: Decrease T according to cooling schedule (6.26)14: until (Acc > Accmax) or (Cost keeps repeating)15: return vector n and C
Algorithm 6.6 (OptP0Only): Optimise lower scaling vector p0 with simu-lated annealing.
Input: G(V, S) with upper scaling vector p1 and Accmax
Output: lower scaling vector p0
1: Initialise p0 with Eqn. (6.27) ∀s ∈ S2: Initialise temperature T = T0
3: Set cost C ⇐ ( Algorithm 6.5 with p0, and Accmax = k log2(|S|) )4: repeat5: Generate new random vector p′0 in the neighborhood of p0
6: Set cost C ′ ⇐ ( Algorithm 6.5 with p′0 and Accmax = k log2(|S|) )
7: if (C ′ < C) or (random < exp(−(C′−C)T
)) then // Accept new state8: Set p0 ⇐ p′0, C ⇐ C ′ and increment Acc9: end if
10: Decrease T according to cooling schedule (6.26)11: until (Acc > Accmax) or (Cost keeps repeating)12: return lower scaling vector p0
168
Perturbation analysis,
Section 2.5.4
Computation graph,
G(V,S)
Profiling simulation,
Section 5.5.1
Determine p1,Section 6.5
Obtained p0
Optimised DFX annotated
computation graph, G(V,S,Adfx)
Obtained n
Start
End
Phase 1
Phase 2
Obtained p1
No
Algorithm 6.5:
Opt–nOnly
Yes
Accmax=5000Meet termination
criteria?
Accmax=5000Meet termination
criteria?
Yes
No
Algorithm 6.6:
Opt–p0Only
NoAccmax=klog2(|S|)Meet termination
criteria?
Algorithm 6.5:
Opt–nOnly
Yes
Pre-optimisation
Figure 6.7: Flow of the DFX optimisation with Simulated Annealing. Thedetailed flow diagrams of the Algorithms 6.5 and 6.6 are not shown and isdenoted by broken arrow lines. Refer to the respective Algorithms in page168.
169
25000
27000
29000
31000
33000
35000
37000
39000
41000
43000
2 4 6 8 10 12 14Level of optimisation, k
Are
a (c
ells
)
0
10
20
30
40
50
60
Opt
imis
atio
ni ti
me
(min
)
AreaOptimisation time
Figure 6.8: The optimisation times and the area with respect of the level ofoptimisation for a 4th order IIR filter on ASIC.
6.9 Case Study and Discussion
For the case study, an infinite impulse response (IIR) filter and a least-mean-
square (LMS) adaptive filter were implemented. IIR filter is frequently used
in audio processing or general filtering of signal frequencies, whereas the LMS
adaptive filters are used in communications to compensate for multi-path dis-
tortions or feedback control applications. Both filters are part of the BDTI
DSP benchmark [Ber00] suite to measure the performance of DSP processors.
Both designs are optimised with the two-phase ASA optimisation (ASA2)
in Section 6.8.2 with multiple wordlength and multiple scaling which may re-
sult in designs with a mix of DFX with fixed-point signals. Both designs are
benchmarked against designs optimised with a completely fixed-point (FX)
only optimisation with the multiple wordlength optimisation heuristic by Con-
stantinides [CCL01] explained in Section 2.5.4.
Each design has been implemented on the same three platforms as in Chap-
ter 4. Just to reiterate the platforms used, the FPGAs are Xilinx Virtex4
(XC4VLX15) and Altera Stratix2 (EP2S15) and the ASIC designs with UMC
170
0.13um High Density Standard Cell Library. All designs are synthesized using
Synplicity’s Synplify Pro (for FPGAs) or Synplify ASIC. The FPGAs are then
placed and routed using their respective platform vendor software.
The first test is with varying output SNR without any constraint on max-
imum overflow probability and this is to demonstrate inequality (6.21) and
show that DFX may be more area cost effective when signals have small vari-
ance and high noise is tolerable, i.e. low SNR. For this test, two different sets
of input signals with different variance were used.
The second test is with two different fixed output SNR while varying the
maximum overflow probability λ. This test is to demonstrate the inequality
(6.25) and show that when the maximum overflow probability, lamdba is pro-
vided, DFX may be more area cost effective when λ and SNR are small. For
all the cases, the cost function is minimised and two different tests performed
on each design.
6.9.1 IIR Filter
The IIR filter used in this case study is a 4th order high-pass filter made from
two cascaded direct-form-II 2nd order filters [OS99]. Figure 6.9 shows its data-
flow graph and its frequency response graph. This high-pass filter has been
generated using Filter Design & Analysis tool in Matlab allowing the signals
with normalised frequencies 0.7π and above to pass through.
For the first test, two different 100,000 zero-mean audio input sampled at
44.1kHz were used. Both inputs have different sample variances. The variances
of the input samples are (i)1.6×10−5 (lower variance) and (ii)1.2×10−2 (higher
variance). Both input samples have also been normalised so that they have the
same maximum magnitude and therefore the upper scaling parameters used
171
z-1
F
F
F
z-1
z-1
F
F
F
z-1
z-1
z-1Input Output
-a1(1)
-a2(1)
b1(1) -a1(2)
-a2(2)
b1(2)
g(1)
(a) Cascaded structure with two second order filters
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
-150
-100
-50
0
Normalized Frequency (×π rad/sample)
Mag
nitu
de (
dB)
(b) Frequency response
Figure 6.9: Case study 4th order IIR filter.
for both inputs are the same. The optimisation was done with varying output
SNR but without giving a constraint on maximum probability of overflow.
The cost of the IIR filter optimised with the lower variance input onto ASIC
is illustrated on a graph in Figure 6.10(a) with varying output SNR. Results
on the FPGA platforms follow the same trend and hence not shown. As a
summary, the percentage ratio of the area cost after an ASA2 optimisation
over the area cost after a FX optimisation are plotted in Figures 6.11(a)&(b)
for input with variance (i)1.6×10−5 and (ii)1.2×10−2 respectively. All three
hardware platforms that were implemented are shown on scatter-plots together
with their respective trend lines obtained through linear regression.
Referring to the discussion in Section 6.7, the DFX number representation
can provide better cost over fixed-point when the output SNR required is low
and when there are wide dynamic range signals. The ASA2 optimisation pro-
duces synthesized designs which are about 3% better than a design from a
172
0
10000
20000
30000
40000
50000
60000
70000
80000
0 20 40 60 80 100
FX ASA2
SNR (dB)
Area (cells)
(a) Varying output SNR
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0.01 0.0001 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16
FX ASA2
10-2 10
-4 10
-6 10
-8 10
-10 10
-12 10
-14 10
-16
Overflow probability, �A
rea (cells)
(b) Varying overflow probability λ
Figure 6.10: Area of ASIC 4th order IIR filter with lower variance input.
FX optimisation when the output SNR is at 15dB. The percentage improve-
ment reduces as the output SNR required is increased. When the larger input
variance input is used, there is hardly any improvement as a fully fixed-point
design is the most efficient to operate the filter.
Analysing each signal individually for the case of the lower variance input,
both ASA2 and FX optimised most signals to fixed-point apart from some
signals near the input. ASA2 optimised these signals to a DFX format and
accounts for any area cost improvement over the FX optimisation. This is be-
cause the input with lower variance has a wider dynamic range when compared
to the input with higher variance (both sets of input have been normalised),
assuming that both inputs’ maximum absolute value is the same. The ASA2
optimisation took advantage of this and used DFX on the signals near the in-
put since DFX has a wider dynamic range than fixed-point. The other signals
do not vary a lot in terms of magnitude and therefore fixed-point is the most
efficient way to represent them. When the higher variance input was used,
there is no area cost savings from using DFX at the inputs and the whole
design was optimised using fixed-point.
For the second test, the larger variance input was used. This time the
173
maximum overflow probability λ is varied with the output SNR constraint
fixed to 15dB and 60dB. Figure 6.10(b) shows the synthesized cost on the
ASIC platform when the output SNR constraint is 15dB. The results on FPGA
platforms again show a similar trend and not shown. Figures 6.11(c)&(d) show
the percentage ratio of the ASA2 over FX area cost implementation for all three
hardware platforms. They are similar scatter plot diagrams to the previous
test with a linear regression trend lines for each hardware platform.
From Figure 6.11(c), when the output SNR is 15dB, the extra cost savings
of ASA2 over FX improves as we reduce the maximum overflow probability.
When λ = 10−4, the ASA2 optimisation gave only 2% improvement over FX
optimisation on the Virtex4 FPGA but this increased to 11% when λ = 10−8.
Unlike the first test where only the signals around the input would be optimised
to a DFX format, the signals around the gain multiplier would be optimised
to DFX in order to provide the wide dynamic range needed for the maximum
overflow probability and precision required. When the output SNR is 60dB,
Figure 6.11(d) shows a similar trend but with lesser gradient. These results
conform to the expectation outlined in Section 6.7 where DFX can provide
better area cost efficiency over fixed-point when output SNR is low and/or
when low maximum overflow probability is needed.
6.9.2 Adaptive LMS Filter
An adaptive filter is an example of a non-linear application with feedback.
Incremental updates are made to the coefficients used to in the filter by the
feedback of accumulated correction terms. The output of the filter is compared
with a desired filter response to produce the feedback correction terms. This
desired filter response may be known before-hand (e.g. the training sequence
used in GSM mobile telephony). Figure 6.12 shows the structure of the 1st
order LMS filter used in this case study.
174
Varying output SNR
50
60
70
80
90
100
110
0 20 40 60 80 100
ASIC Stratix2 Virtex4
SNR (dB)
ASA2 / FX area ratio (%)
(a) Input variance = 1.6×10−5
50
60
70
80
90
100
110
0 20 40 60 80 100
ASIC Stratix2 Virtex4
SNR (dB)
ASA2 / FX area ratio (%)
(b) Input variance = 1.2×10−2
Varying overflow probability, λ
50
60
70
80
90
100
110
0.01 0.0001 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16
ASIC Stratix2 Virtex4
ASA2 / FX area ratio (%)
10-2 10
-4 10
-6 10
-8 10
-10 10
-12 10
-14 10
-16
Overflow probability, �
(c) Output SNR = 15 dB
50
60
70
80
90
100
110
0.01 0.0001 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16
ASIC Stratix2 Virtex4
ASA2 / FX area ratio (%)
10-2 10
-4 10
-6 10
-8 10
-10 10
-12 10
-14 10
-16
Overflow probability, �
(d) Output SNR = 60 dB
Figure 6.11: Area ratio of ASIC 4th order IIR filter optimised with the pro-posed two-phase ASA optimisation (ASA2) and fixed-point only optimisation(FX).
175
z-1
F
F
z-1
z-1
F
F
z-1
F
F
Input
Output
Input_ref
Figure 6.12: Case study 1st order LMS filter.
The filter has two inputs, one is the normal Input signal and the other is
the desired filter response Input_ref signal. For the desired input signal, we
used the same audio sample input for the IIR filter case study in the previous
section. For the normal input, 4 equally size segments from the desired input
was passed through 4 different 2nd order autoregressive filters to distort the
signal. The constant coefficients of each distortion filter are randomly chosen
to have complex conjugate filter poles pairs in the magnitude range of (0, 1)
and in phase range(0, π
2
).
As with the IIR case study, the first test on the LMS filters is to vary
the output SNR constrain using two sets of inputs with different variance.
Using the lower variance input (6.5× 10−5), the area cost from optimising
the LMS filter onto ASIC with ASA2 and FX optimisations are illustrated in
Figure 6.13(a). Yet again, the FPGA designs show a similar trend and are not
shown. Figures 6.14(a)&(b) show the scatter plots for the percentage ratio
of synthesized design area cost using ASA2 over FX optimisation for all three
hardware platforms. They also show the respective linear regressive trend lines
for each scatter plot.
When the lower variance input was used, Figure 6.14(a) shows a greater
percentage improvement when compared to the earlier IIR case study. At
15dB, the LMS filter optimised with ASA2 is 30% better than FX on the
176
0
10000
20000
30000
40000
50000
60000
70000
80000
0 20 40 60 80 100
FX ASA2
SNR (dB)
Area (cells)
(a) Varying output SNR
0
5000
10000
15000
20000
25000
30000
35000
0.01 1E-04 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16
FX ASA2
10-2 10
-4 10
-6 10
-8 10
-10 10
-12 10
-14 10
-16
Overflow probability, �A
rea (cells)
(b) Varying overflow probability λ
Figure 6.13: Area of ASIC 2-tap adaptive LMS FIR filter.
Virtex4 platform, compared to only 3% for the IIR filter. As expected, the
percentage improvement reduces as the output SNR requirement increases and
when the higher variance input is used ( Figure 6.14(b) ) the area improvement
of ASA2 over FX optimisation is markedly reduced. At 3dB, the ASIC and
Virtex4 platforms had only 4% improvement with ASA2 while the Stratix2
platform had none.
To account for the major improvements with the lower variance input, we
looked at the individual signals in the LMS filter. The outputs of the full-
multipliers and signals near the input are the ones optimised to a DFX format
by the ASA2 optimisation. This is because the wide dynamic range of the
outputs of the non-linear full-multipliers and the inputs which are most suited
for DFX. Full multipliers make up a significant amount of chip area for a
LMS filter (about 60% on the ASIC platform) and since a DFX full multiplier
is cheaper than a fixed-point full multiplier with equivalent dynamic range
(Chapter 4), it is not surprising to see the major area improvement of ASA2
over FX optimisation.
The results of the second test with varying maximum overflow probability
are shown in Figure 6.13(b) for the implementation on the ASIC platform and
177
Varying output SNR
40
50
60
70
80
90
100
110
0 20 40 60 80 100
ASIC Stratix2 Virtex4
SNR (dB)
ASA2 / FX area ratio (%)
(a) Input variance = 6.5×10−5
40
50
60
70
80
90
100
110
0 20 40 60 80 100
ASIC Stratix2 Virtex4
SNR (dB)
ASA2 / FX area ratio (%)
(b) Input variance = 1.7×10−3
Varying overflow probability, λ
40
50
60
70
80
90
100
110
120
0.01 0.0001 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16
ASIC Stratix2 Virtex4
ASA2 / FX area ratio (%)
10-2 10
-4 10
-6 10
-8 10
-10 10
-12 10
-14 10
-16
Overflow probability, �
(c) Output SNR = 15 dB
40
50
60
70
80
90
100
110
120
0.01 0.0001 1E-06 1E-08 1E-10 1E-12 1E-14 1E-16
ASIC Stratix2 Virtex4
ASA2 / FX area ratio (%)
10-2 10
-4 10
-6 10
-8 10
-10 10
-12 10
-14 10
-16
Overflow probability, �
(d) Output SNR = 60 dB
Figure 6.14: Area Ratio of ASIC 2-tap adaptive LMS FIR filter optimisedwith the proposed two-phase ASA optimisation (ASA2) and fixed-point onlyoptimisation (FX).
178
Figures 6.14(c)&(d) for the synthesized ASA2 over FX percentage ratio. The
trends observed are similar to the IIR case study. As with the IIR case study,
the results for both test show that using DFX for signals that need a wide
dynamic range may reduce the area cost of a design. Also, when the output
SNR and/or maximum overflow probability are low, mixing the use of DFX
and fixed-point formats is shown to give better area cost over a fully fixed-
point only optimisation. The ASA2 optimisation is able to optimise a design
to use both DFX and fixed-point efficiently.
An area that is lacking in both this case study and the previous is a com-
parison with floating-point filter implementation. It was a conscious choice
to omit the comparison with floating-point from the start of this work since
the initial work on arithmetic modules showed floating-point implementations
were relatively larger than fixed-point and DFX. Furthermore, majority of
DSP algorithms are made with fixed-point. However, for completeness, the
case studies should include floating-point designs and this is part of the future
work plans.
6.10 Summary
This section has demonstrated an optimisation method to incorporate both
DFX and fixed-point number representation signals in a design using a two-
phase ASA algorithm (ASA2). The algorithm uses the error analysis detailed
in Chapter 5 and also a simplified area cost model to quickly analyse each de-
sign iteration generated by the simulated annealing algorithm. After determin-
ing upper scaling parameters from the user desired signal overflow constraint,
the optimisation stage is made out of two phases: first phase optimises the
lower scaling parameters and the second stage obtains optimised wordlengths.
179
The speed of the optimisation traded for quality of the result by adjusting the
amount of time spent looking for the optimum point.
Section 6.7 explored the feasibility of using DFX in a design. Albeit the
analysis was performed on a single signal, we understand that DFX can provide
better area cost savings than fixed-point provided that a wide dynamic range
signal can tolerate a large error. Therefore designs that require low output
SNR and/or with low signal overflow probability constraints can benefit from
using DFX. This result has been demonstrated by the case studies on the IIR
and LMS filters.
180
Chapter 7
Conclusions & Future Work
7.1 Summary
Continual integration of applications onto a chip is fuelled by the need to
drive down overall cost while achieving smaller and more efficient designs.
Apart from using smaller chip manufacturing process, a hardware designer
could explore other areas to improve the design efficiency such as the number
representation used in the design. In digital signal processing (DSP) designs,
the two main number representations used for the signal data-paths are fixed-
point and floating-point. As with all number representations, both come with
their own set of advantages and disadvantages. The aim of this thesis is to
introduce an alternative number representation that compromises between the
two number representations for DSP applications. Next was to understand
under which the conditions when this number representation is best used and
a technique to discover the optimum parameters for this number representation
in the context of system design.
The new number representation, Dual FiXed-point (DFX), achieves better
dynamic range when compared to ordinary fixed-point. By using only a single
181
bit for the exponent field, DFX is able to bridge the dynamic range and hard-
ware implementation complexity gap between fixed-point and floating-point.
DFX’s extra dynamic range is possible due to the exponent recoding concept
presented in Chapter 3. Exponent recoding introduces a mapping function
onto the exponent field which is defined at design time. The extra level of in-
direction makes it possible to trade a number representation’s dynamic range
for its hardware implementation complexity. With the exponent recoding con-
cept, common number representations were generalised as special cases.
Having defined the new DFX format, performing arithmetic using the new
number representation would not be possible as there was no hardware to sup-
port it. For this thesis, designs were created for the basic arithmetic operations:
addition and multiplications (gain and full). Coded in VHDL, the arithmetic
modules are easily synthesized onto any hardware platform and integrated into
any design flow. In general, the dual precision nature of DFX forces the need
to align the signals either before and/or after an arithmetic operation. This is
similar to floating-point arithmetic operations but the alignment is performed
using multiplexers. Therefore DFX hardware will always have extra hardware
overhead over an equivalent fixed-point arithmetic module.
In line with the research objectives, the effect of precision errors introduced
when using DFX was investigated. Unlike fully fixed-point systems, estimat-
ing the signal-to-noise ratio (SNR) at the outputs of a system with DFX is
a non-trivial task due to DFX’s dual precision. The errors introduced in the
system from truncation after arithmetic operations are also correlated with
one another. A quick way is needed to estimate the output errors for feedback
to guide RightSize’s optimisation routine and a computationally intensive ex-
haustive simulation approach would not be practical. Analytical error models
for DFX arithmetic modules and a single-run profiling simulation was devel-
oped for incorporation into the RightSize synthesis tool. Interestingly, the
rounding errors introduced were shown not to correlate with one another.
182
When the effects of the errors introduced were understood, the viability of
using DFX was explored and an automated means to determine the optimum
set of signal parameters was the subject of Chapter 6. An analysis showed
that the usefulness of DFX is fairly limited to designs that require low output
SNR and/or designs with low signal overflow probability constraints. In other
words, DFX can provide better area cost savings compared to fixed-point on
condition that a signal with wide dynamic range can put up with a large
error. Therefore, DFX is meant to enhance the implementations of specialised
applications and not meant to replace fixed-point or floating-point number
systems.
Optimisation was done using a two phase simulated annealing methodology.
The optimisation problem is known to be NP-hard and the non-convex design
search space made it necessary to implement this meta-heuristic approach.
Since the robust DFX arithmetic modules can also input and out in fixed-
point, the optimised design can mixed fixed-point and DFX signals. The
modified RightSize allows designers to constraint their design in two ways,
a minimum SNR for every output and the maximum probability of overflow
for every signal in a design.
7.2 Future Work
The work in this thesis could be continued and extended in a variety of direc-
tions.
First is the case studies in Section 6.9. As mentioned before, compari-
son should be made between the ASA2 optimised filters with DFX and opti-
mised filters with floating-point. At present, due to initial assumptions that
floating-point implementations are always larger than DFX implementations,
the comparisons with floating-point implementations were not planned for this
183
thesis. However, there is that lingering doubt whether that initial assumption
is always true. Hence for completeness, the floating-point filters should be
implemented and compared with the ASA2 optimised filters.
Next is in the number representation. In this thesis, the definition of DFX
and the design of arithmetic modules are made under assumption that there
will be no overflow. In some cases, this may lead to overly pessimistic scalings
and reducing the wordlength from the MSB side could cause a signal overflow
but also lead to reduce area consumption. Hence, saturation arithmetic could
be applied on selected signals. The obvious place to implement it is to saturate
overflows for the DFX Num1 range. In addition, if the DFX boundary is
modified above the current definition (i.e. B > 2p0), overflows values in the
Num0 range can also be saturated.
The error estimation presented in this thesis relies on the running a profiling
simulation using a set of representative input data. Although the profile tables
described in Chapter 5 is sufficient to profile every signal in a design, the
computational intensity and memory requirement increases with the design
size. A fully analytical means to estimate the profile table would reduce the
computational memory requirement and making the analysis more scalable.
Potential candidates that we could adapted are the polynomial chaos expansion
[WZN04] and the Edgeworth expansion techniques [Hal97].
This thesis’s optimisation routine chooses the optimum set of signal wordlength
and scaling parameters in a design to minimise the chip area. There are other
optimisation objectives such as speed and power consumption that can be ex-
plored. Some precursor work such as improving the existing arithmetic mod-
ules would be needed. As we have seen from Chapter 4, the DFX arithmetic
modules are all slower than fixed-point without pipelining. It would be inter-
esting to include pipelining for the designs to improve the critical path delay.
184
Apart from that, the estimation of the delay and power consumption of each
arithmetic module has to be investigated.
Another area for future work is in the arithmetic modules. Designs of the
arithmetic modules could be improved to allow for greater resource sharing.
For example of a chain of two adders, the post-adder block from the first adder
could be merged with a pre-adder block of the second adder. Furthermore, we
can make a full custom design of the arithmetic modules which could lead
to quicker and smaller designs. Some initial work has been done by Wang
[Wan04] with promising results. On a separate note, other operations such
as division could be added to the list of DFX arithmetic components. This
would open up avenues to test and implement a wider range of applications
with RightSize.
Following on from that, the final area for future work is to explore other ap-
plications where DFX would be an improvement over fixed-point and floating-
point (and possibly LNS) designs. At present, DFX has been shown to be
better than fixed-point IIR and LMS filters under extreme circumstances. The
search of other applications should not be constraint to the realm of DSP only
as DFX might have a use in other areas. To facilitate the search, DFX may be
incorporated into the design flow of system-level description languages such as
SystemC [CEV07] as this would cut down the development time.
185
Glossary
bxc The floor function which returns the largest integer less than or
equal to x.
Φ() Recoding function for Exponent Recoding.
n Wordlength of a number system, not including the sign-bit.
p The number of bits from the right of the sign-bit to the binary-point
of a fixed-point number system.
p0, p1 The number of bits from the right of the sign-bit to the binary-
point of lower and upper number ranges in a DFX number system.
See Definition 3.2, pg. 61.
p′, p′0, p′1 An alternative representation for p, p0 and p1 known as LSB-side
scaling. See Section 5.2, pg. 108.
B Boundary between Num0 and Num1 of a DFX number. See Defi-
nition 3.4, pg. 62.
T, T Truncation; T is a set of Truncations. See Section 5.2, pg. 108.
p′a ½p′b Truncation of a number with p′a scaling to p′b scaling. See Sec-
tion 5.2, pg. 108.
µX, σ2X The error mean and error variance of signal X.
186
PT The probability of truncation T.
cov(x, y) Covariance between variables x and y. See Equation (5.8), pg. 126.
rx,y Correlation coefficient which measures the degree of correlation. It
has a range between [0,1].
R The set of all reals.
N The set of all positive integers.
ALUT Adaptive Look-Up Table. The principle building block for the re-
configurable frabric of Altera Stratix 2 FPGAs.
ASA2 The 2-phased adaptive simulated annealing optimisation algorithm
to produce an DFX annotated computation graph of system opti-
mised for minimum area based on a set of user constraints. See
Section 6.8.2, pg. 166.
ASIC Application Specific Integrated Circuit.
block floating-point (BFP) An algorithm for processing numbers in blocks
of fixed-point numbers. See Section 2.4.4, pg. 35.
computational graph A formal representation of an algorithm. See Defi-
nition 4.1, pg. 73.
correlation The strength and direction of a linear relationship between two
random variables.
critical path delay (CPD) The maximum time needed after the a change
in the inputs of a system to be reflected in its outputs.
DFX Dual Fixed-Point. See Section 3.3, pg. 61.
187
DFX annotated computational graph A formal representation of the DFX
implementation of a computational graph. See Definition 4.2, pg.
74.
DSP Digital Signal Processing.
dynamic range The range of a number representation. It is quantified as
the ratio of the largest representable magnitude over the smallest
and generally is expressed in dBs.
ER Exponent Recoding. See Section 3.2, pg. 56.
error See finite precision error.
finite precision error The result of taking the output sequence from a fi-
nite precision implementation subtracted from the equivalent se-
quence resulting from an infinite precision implementation. This is
also known as quantisation error.
FIR Finite Impulse Response.
fixed-point (FX) A binary number representation. See Section 2.4.1, pg.
25.
floating-point (FP) A binary number representation. See Section 2.4.2,
pg. 28.
FPGA Field Programmable Gate Array.
IEEE754 Standard for Binary Floating-Point Arithmetic.
IIR Infinite Impulse Response.
level-index (LI) A binary number representation. See Section 2.4.8, pg.
40.
188
LMS Least Mean Square.
logarithmic number system (LNS) A binary number representation. See
Section 2.4.3, pg. 32.
LSB Least significant bit.
LUT Look-Up Table. The principle building block for the reconfigurable
frabric of Xilinx Virtex 4 FPGAs.
MSB Most significant bit.
MUX Multiplexer.
Num0, Num1 The lower and upper number range of DFX numbers. See
Definition 3.3, pg. 62.
PDF Probability density function.
pertubation analysis The process of linearising a non-linear system in or-
der to apply analytical techniques to estimate noise in LTI systems
[Con03].
profiling simulation A step taken during the ASA2 opimisation in order
to determine the probability distributions of the signals within a
system. See Section 5.5.1, pg. 131.
rational arithmetic (RA) A binary number representation. See Section 2.4.7,
pg. 39.
residue number system (RNS) A binary number representation. See Sec-
tion 2.4.5, pg. 36.
RightSize The high-level synthesis tool originally made by Constantinides
[Con03].
189
rounding The process of reducing the number of bits used to represent a
number by ignoring one or more least significant bits by rounding
to the nearest even.
scaling The scaling of a signal is determined by the position of the bi-
nary point in a signal representation. The terms ’binary point’ and
’scaling’ are used interchangeably.
simulated annealing (SA) A heuristic search technique based on the an-
nealing in metallurgy [Ing93].
SNR Signal to Noise Ratio.
truncation The process of reducing the number of bits used to represent a
number by ignoring one or more least significant bits.
VHDL Very high speed integrated circuit Hardware Description Language.
190
Bibliography
[Alta] Altera. Cyclone Series.
http://www.altera.com/products/devices/cyclone/cyc-index.jsp.
[Altb] Altera. DSP Builder.
http://www.altera.com/products/software/products/dsp/dsp-
builder.html.
[Alt98] Altera Corporation, San Jose. Altera Databook, 1998.
[And96] S. Andraos. Fixed point unsigned fractional representation in
residue number system. Circuits and Systems, 1996., IEEE 39th
Midwest symposium on, 1:555–558 vol.1, 18-21 Aug 1996.
[Arn05] M.G. Arnold. The residue logarithmic number system: theory and
implementation. Computer Arithmetic, 2005. ARITH-17 2005.
17th IEEE Symposium on, pages 196–205, 27-29 June 2005.
[Ber00] Berkeley Design Technology, Inc. BDTI DSP kernel benchmarks.
http://www.bdti.com/bdtimark/BDTImark2000.htm, 2000.
[BJA+03] Francisco Barat, Murali Jayapala, Tom Vander Aa, Rudy Lauw-
ereins, Geert Deconinck, and Henk Corporaal. Low power
coarse-grained reconfigurable instruction set processor. In Field-
Programmable Logic and Applications, volume Volume 2778/2003
191
of Lecture Notes in Computer Science, pages 230–239. Field-
Programmable Logic and Applications, September 2003.
[BL91] P. H. Bauer and L. -J. Leclerc. Asymptotic stability of digital fil-
ters with combinations of overflow and quantization nonlinearities.
Circuits and Systems, 1991., IEEE International Sympoisum on,
pages 380–383 vol.1, 11-14 Jun 1991.
[Bom94] B. W. Bomar. Low-roundoff-noise limit-cycle-free implementation
of recursive transfer functions on a fixed-point digital signal pro-
cessor. Industrial Electronics, IEEE Transactions on, 41(1):70–78,
Feb 1994.
[Boo51] Andrew D. Booth. A signed binary multiplication technique.
The Quarterly Journal of Mechanics and Applied Mathematics,
4(2):236–240, 1951.
[Bow87] A. J. Bower. Digital two-channel sound for terrestrial television.
IEEE Trans. on Consumer Electronics, 4CE–33(3):286–296, Au-
gust 1987.
[BP00] A. Benedetti and P. Perona. Bit-width optimization for config-
urable DSP’s by multi-interval analysis. In Signals, Systems and
Computers, 2000. Conference Record of the Thirty-Fourth Asilo-
mar Conference on, volume 1, pages 355–359 vol.1, 2000.
[CC99] J. N. Coleman and E. I. Chester. A 32-bit logarithmic arithmetic
unit and its performance compared to floating-point. In Proceed-
ings of the 14th IEEE Symposium on Computer Arithmetic, pages
142–152, Adelaide, Australia, April 1999.
192
[CCL99] George A. Constantinides, Peter Y. K. Cheung, and Wayne
Luk. Truncation noise in fixed-point SFGs. Electronic Letters,
35(23):2012–2014, 1999.
[CCL00] George A. Constantinides, Peter Y. K. Cheung, and Wayne Luk.
Roundoff-noise shaping in filter design. In Proc. IEEE Interna-
tional Symposium on Circuits and Systems, May–June 2000.
[CCL01] George A. Constantinides, Peter Y. K. Cheung, and Wayne Luk.
The multiple wordlength paradigm. In Field-Programmable Cus-
tom Computing Machines, 2001. FCCM ’01. The 9th Annual IEEE
Symposium on, pages 51–60, 2001.
[CCL02] George A. Constantinides, Peter Y. K. Cheung, and Wayne Luk.
Optimum wordlength allocation. In Field-Programmable Custom
Computing Machines, 2002. Proceedings. 10th Annual IEEE Sym-
posium on, pages 219–228, 2002.
[CCL03] George A. Constantinides, Peter Y. K. Cheung, and Wayne Luk.
Synthesis of saturation arithmetic architectures. ACM Trans. Des.
Autom. Electron. Syst., 8(3):334–354, 2003.
[CDdD06] S. Collange, J. Detrey, and F. de Dinechin. Floating point or
LNS: Choosing the right arithmetic on an application basis. In
Digital System Design: Architectures, Methods and Tools, 2006.
DSD 2006. 9th EUROMICRO Conference on, pages 197–203, 2006.
[Cer85] V. Cerny. Thermodynamical approach to the traveling salesman
problem: An efficient simulation algorithm. Journal of Optimiza-
tion Theory and Applications, 45(1):41–51, Jan 1985.
[CEV07] B.A. Correa, J.F. Eusse, and J.F. Velez. High level system-on-chip
design using uml and systemc. Electronics, Robotics and Automo-
193
tive Mechanics Conference, 2007. CERMA 2007, pages 740–745,
Sept. 2007.
[CMP76] T. Claasen, W. Mecklenbrauker, and J. Peek. Effects of quanti-
zation and overflow in recursive digital filters. Acoustics, Speech,
and Signal Processing, IEEE Transactions on, 24(6):517–529, Dec
1976.
[CO84] C. W. Clenshaw and F. W. J. Olver. Beyond floating-point. J.
Assoc. Comput. Mach., 31:319–328, March 1984.
[Con01] George A. Constantinides. High Level Synthesis and Word Length
Optimization of Digital Signal Processing Systems. PhD thesis,
Imperial College of Science, Technology and Medicine, University
of London, London, U.K., September 2001.
[Con03] George A. Constantinides. Perturbation analysis for word-length
optimization. In Field-Programmable Custom Computing Ma-
chines, 2003. FCCM 2003. 11th Annual IEEE Symposium on,
pages 81–90, 9-11 April 2003.
[Cow02] M. Cowlishaw. Densely packed decimal encoding. Computers and
Digital Techniques, IEE Proceedings -, 149(3):102–104, May 2002.
[CRS+99] R. Cmar, L. Rijnders, P. Schaumont, S. Vernalde, and I. Bolsens.
A methodology and design environment for DSP ASIC fixed point
refinement. In Design, Automation and Test in Europe Conference
and Exhibition 1999. Proceedings, pages 271–276, 1999.
[CT88] C. W. Clenshaw and P. R. Turner. The symmetric level-index
system. IMA J. Numerical Analysis, 8:517–526, 1988.
194
[CW02] George A. Constantinides and G.J. Woeginger. The complexity
of multiple wordlength assignment. Applied Mathematics Letters,
15:137–140(4), February 2002.
[DdD] Jeremie Detrey and Florent de Dinechin. A VHDL library
of parametrisable floating-point and LNS operators for FPGA.
http://www.ens-lyon.fr/LIP/Arenaire/Ware/FPLibrary/.
[Dek71] T. J. Dekker. A floating-point technique for extending the available
precision. Numerische Mathematik, 18(3):224–242, June 1971.
[DGL+02] J. Dido, N. Geraudie, L. Loiseau, O. Payeur, Y. Savaria, and
D. Poirier. A flexible floating-point format for optimizing data-
paths and operators in FPGA based DSPs. pages 50–55, 2002.
[DH96] Ron Davidson and David Harel. Drawing graphs nicely using sim-
ulated annealing. ACM Trans. Graph., 15(4):301–331, 1996.
[ECC04] Chun Te Ewe, Peter Y.K. Cheung, and George A. Constantinides.
Dual FiXed-point: An efficient alternative to floating-point compu-
tation. In Field Programmable Logic and Application: 14th Inter-
national Conference, FPL 2004, pages 200–208, Leuven, Belgium,
August 2004. Springer-Verlag Heidelberg.
[ECC05] C.T. Ewe, P.Y.K. Cheung, and G.A. Constantinides. Error mod-
elling of Dual FiXed-point arithmetic and its application in field
programmable logic. In Field Programmable Logic and Applica-
tions, 2005. International Conference on, pages 124–129, 24-26
Aug. 2005.
[EMT69] P. M. Ebert, J. E. Mazo, and M. G. Taylor. Overflow oscillations in
digital filters. Bell System Technical Journal, 48:2999–3020, Nov.
1969.
195
[FCR02] Fang Fang, Tsuhan Chen, and Rob A. Rutenbar. Lightweight
floating-point arithmetic: case study of inverse discrete cosine
transform. EURASIP J. Appl. Signal Process., 2002(1):879–892,
2002.
[Fio98] P.D. Fiore. Lazy rounding. In Signal Processing Systems, 1998.
SIPS 98. 1998 IEEE Workshop on, pages 449–458, 8-10 Oct. 1998.
[FML06] Haohuan Fu, Oskar Mencer, and Wayne Luk. Comparing floating-
point and logarithmic number representations for reconfigurable
acceleration. Field Programmable Technology, 2006. FPT 2006.
IEEE International Conference on, pages 337–340, Dec. 2006.
[FP97] W.L. Freking and K.K. Parhi. Low-power FIR digital filters using
residue arithmetic. Signals, Systems & Computers, 1997. Confer-
ence Record of the Thirty-First Asilomar Conference on, 1:739–743
vol.1, 2-5 Nov 1997.
[FRPC03] Clair Fang Fang, R. A. Rutenbar, M. Puschel, and Tsuhan Chen.
Toward efficient static analysis of finite-precision effects in DSP
applications via affine arithmetic modeling. In Design Automation
Conference 2003 (DAC’03) Proceedings, pages 496–501, Leuven,
Belgium, June 2003. Springer–Verlag Heidelberg.
[GML+02] Altaf A. Gaffar, Oskar Mencer, Wayne Luk, Peter Y.K. Cheung,
and Nabeel Shirazi. Floating point bitwidth analysis via automatic
differentiation. In IEEE Conference on Field Programmable Tech-
nology (FPT’02), pages 158–165, Hong Kong, December 2002.
[GML04] Altaf A. Gaffar, Oskar Mencer, and Wayne Luk. Unifying bit-width
optimisation for fixed-point and floating-point designs. Field-
196
Programmable Custom Computing Machines, 2004. FCCM 2004.
12th Annual IEEE Symposium on, pages 79–88, 20-23 April 2004.
[Gri00] Andreas Griewank. Evaluating derivatives: principles and tech-
niques of algorithmic differentiation. Society for Industrial and
Applied Mathematics, Philadelphia, PA, USA, 2000.
[GS01] Geoffrey R. Grimmett and David R. Stirzaker. Probability and
Random Processes. Oxford University Press, August 2001.
[GT88] B.D.O. Green and L.E. Turner. New limit cycle bounds for digital
filters. Circuits and Systems, IEEE Transactions on, 35(4):365–
374, Apr 1988.
[Haj85] Bruce Hajek. A tutorial survey of theory and applications of simu-
lated annealing. Decision and Control, 1985 24th IEEE Conference
on, 24:755–760, Dec 1985.
[Hal97] Peter G. Hall. The Bootstrap and Edgeworth Expansion. Springer,
1997.
[Ham83] Hozumi Hamada. URR: Universal Representation of Real numbers.
New Generation Computing, 1(2):205–209, 1983.
[HJ86] Roger A. Horn and Charles R. Johnson. Matrix analysis. Cam-
bridge University Press, New York, NY, USA, 1986.
[HP94] C.Y. Hung and B. Parhami. An approximate sign detection method
for residue numbers and its application to rns division. Computers
and Mathematics with Applications, 27:23–35, Feb 1994.
[IEE85] IEEE. IEEE Standard for Binary Floating-Point Arithmetic (IEEE
754). IEEE, 1985.
197
[IEE04] IEEE. IEEE Standard for VHDL Register Transfer Level (RTL)
Synthesis, 2004.
[IEE07] IEEE. DRAFT Standard for Floating-Point Arithmetic P754, Oct.
2007.
[Ing93] Lester Ingber. Adaptive simulated annealing (ASA).
[ftp.ingber.com: ASA.tar.gz ASA.zip], 1993. McLean, VA,
Lester Ingber Research.
[Ing96] Lester Ingber. Adaptive simulated annealing (ASA): Lessons
learned. Control and Cybernetics, 25:33, 1996.
[IO96] Christopher Inacio and Denise Ombres. The DSP decision: fixed
point or floating? IEEE Spectrum, 33(9):72–74, 1996.
[Jac70] Leland B. Jackson. On the interaction of roundoff noise and dy-
namic range in digital filters. The Bell System Technical Journal,
49(2):159–184, February 1970.
[JAMS89] David S. Johnson, Cecilia R. Aragon, Lyle A. McGeoch, and
Catherine Schevon. Optimization by simulated annealing: An
experimental evaluation. Part i, graph partitioning. Oper. Res.,
37(6):865–892, 1989.
[JAMS91] David S. Johnson, Cecilia R. Aragon, Lyle A. McGeoch, and
Catherine Schevon. Optimization by simulated annealing: An ex-
perimental evaluation. Part ii, graph coloring and number parti-
tioning. Oper. Res., 39(3):378–406, 1991.
[JL01] Allan Jaenicke and W. Luk. Parameterised floating-point arith-
metic on fpgas. Acoustics, Speech, and Signal Processing, 2001.
Proceedings. (ICASSP ’01). 2001 IEEE International Conference
on, 2:897–900 vol.2, 2001.
198
[KA96] Kari Kalliojrvi and Jaakko Astola. Roundoff errors in block-
floating-point systems. IEEE Trans. on Signal Processing,
44(4):783–790, April 1996.
[Kah65] W. Kahan. Pracniques: Further remarks on reducing truncation
errors. Commun. ACM, 8(1):40, 1965.
[KF00] Shiro Kobayashi and Gerhard P. Fettweis. A hierarchical block-
floating-point arithmetic. J. VLSI Signal Process. Syst., 24(1):19–
30, 2000.
[KGV83] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by
simulated annealing. Science, Number 4598, 13 May 1983, 220,
4598:671–680, 1983.
[KKS98] Seehyun Kim, Ki-Il Kum, and Wonyong Sung. Fixed-point op-
timization utility for C and C++ based digital signal processing
programs. Circuits and Systems II: Analog and Digital Signal Pro-
cessing, IEEE Transactions on [see also Circuits and Systems II:
Express Briefs, IEEE Transactions on], 45(11):1455–1464, 1998.
[KL97] E. Kinoshita and Ki-Ja Lee. A residue arithmetic extension
for reliable scientific computation. Transactions on Computers,
46(2):129–138, Feb 1997.
[KM83] P. Kornerup and D.W. Matula. Finite precision rational arith-
metic: An arithmetic unit. Transactions on Computers, C-
32(4):378–388, 1983.
[KM88] Peter Kornerup and David W. Matula. An on-line arithmetic unit
for bit-pipelined rational arithmetic. J. Parallel Distrib. Comput.,
5(3):310–330, 1988.
199
[KNM+08] Sami Khawam, Ioannis Nousias, Mark Milward, Ying Yi, Mark
Muir, and Tughrul Arslan. The reconfigurable instruction cell ar-
ray. IEEE Trans. Very Large Scale Integr. Syst., 16(1):75–85, 2008.
[Kol04] Gopi Kolli. Using fixed-point instead of floating-point for
better 3D performance. Intel Optimizing Center, 2004.
http://devx.com/Intel/Article/16478.
[Kor02] Israel Koren. Computer Arithmetic Algorithms. A K Peters, Nat-
ick, Massachusetts, 2nd edition, 2002.
[KS94] Seehyun Kim and Wonyong Sung. A floating-point to fixed-point
assembly program translator for the TMS 320c25. Circuits and
Systems II: Analog and Digital Signal Processing, IEEE Transac-
tions on [see also Circuits and Systems II: Express Briefs, IEEE
Transactions on], 41(11):730–739, Nov 1994.
[KS98] K.-I. Kum and W. Sung. Word-length optimization for high-level
synthesis of digital signal processing systems. In Signal Processing
Systems, 1998. SIPS 98. 1998 IEEE Workshop on, pages 569–578,
8-10 Oct. 1998.
[KS01] Ki-Il Kum and Wonyong Sung. Combined word-length optimiza-
tion and high-level synthesis of digital signal processing systems.
Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on, 20(8):921–930, Aug 2001.
[LC92] M. Lu and J.-S. Chiang. A novel division algorithm for the residue
number system. Transactions on Computers, 41(8):1026–1032,
Aug 1992.
[LEK05] L. Lacassagne, D. Etiemble, and S.A.O. Kablia. 16-bit floating
point instructions for embedded multimedia applications. Com-
200
puter Architecture for Machine Perception, 2005. CAMP 2005.
Proceedings. Seventh International Workshop on, pages 198–203,
4-6 July 2005.
[LGML05] Dong-U Lee, Altaf Abdul Gaffar, Oskar Mencer, and Wayne Luk.
MiniBit: bit-width optimization via affine arithmetic. In DAC ’05:
Proceedings of the 42nd annual conference on Design automation,
pages 837–840, New York, NY, USA, 2005. ACM Press.
[Liu71] Bede Liu. Effects of finite word length on the accuracy of digital
filters - A review. IEEE Transactions, CT–18(6):670–677, 1971.
[Liu98] Derong Liu. Lyapunov stability of two-dimensional digital filters
with overflow nonlinearities. Circuits and Systems I: Fundamental
Theory and Applications, IEEE Transactions on [see also Circuits
and Systems I: Regular Papers, IEEE Transactions on], 45(5):574–
577, May 1998.
[LM87] Edward A. Lee and David G. Messerschmitt. Static scheduling of
synchronous data flow programs for digital signal processing. In
IEEE Transactions on Computers, volume C-36, 1987.
[LO90] D. W. Lozier and F. W. J. Olver. Closure and precision in level-
index arithmetic. SIAM Journal on Numerical Analysis, 27:1295–
1304, 1990.
[LSL+00] Ming-Hau Lee, Hartej Singh, Guangming Lu, Nader Bagherzadeh,
Fadi J. Kurdahi, Eliseu M.C. Filho, and Vladimir Castro Alves.
Design and implementation of the morphosys reconfigurable com-
puting processor. The Journal of VLSI Signal Processing,
24(2):147–164, March 2000.
[Mata] Mathworks. Matlab. http://www.mathworks.com.
201
[Matb] Mathworks. Simulink. http://www.mathworks.com.
[Mic94] Giovanni De Micheli. Synthesis and Optimization of Digital Cir-
cuits. McGraw-Hill Higher Education, 1994.
[Mit98] S. K. Mitra. Digital Signal Procssing. McGraw-Hill, New York,
1998.
[MK85] D.W. Matula and P. Kornerup. Finite precision rational arith-
metic: Slash number systems. IEEE Transactions on Computers,
34(1):3–18, 1985.
[MN99] Kurt Mehlhorn and Stefan Nher. LEDA: A Platform for Combi-
natorial and Geometric Computing. Cambridge University Press,
1999.
[Moo66] R. E. Moore. Interval Analysis. Prentice-Hall, 1966.
[MRRT53] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosen-
bluth, and Augusta H. Teller. Equation of state calculations
by fast computing machines. The Journal of Chemical Physics,
21(6):1087–1092, June 1953.
[MTS95] Jean-Muchel Muller, Arnaud Tisserand, and Alexandre Scherbyna.
Semi-logarithmic number system. In Proceedings of the 12th IEEE
Symposium on Computer Arithmetic, pages 201 – 207, Bath, Eng-
land, July 1995.
[Mun96] Robert Munafo. Survey of floating-point formats.
http://home.earthlink.net/ mrob/pub/math/floatformats.html,
1996.
202
[Nim87] Y. Nimomiya et. al. A HDTV broadcasting system utilizing a
bandwidth compression technique MUSE. IEEE Trans on Broad-
casting, BC–33(4):130–160, December 1987.
[NVI05] NVIDIA. NVIDIA GPU programming guide.
http://developer.nvidia.com/object/gpu programming guide.html,
2005.
[Opp70] Alan V. Oppenheim. Realization of digital filters using block-
floating-point arithmetic. IEEE Trans. on Audio Electroaccoustics,
AE–18(2):130–136, June 1970.
[OS99] Alan V. Oppenheim and R. W. Schafer. Discrete-Time Signal
Processing. Prentice-Hall, Englewood Cliffs, NJ, USA, 2nd edition,
1999.
[OW72] Alan V. Oppenheim and C. J. Weinstein. Effects of finite register
length in digital filtering and the fast fourier transform. Proceedings
of the IEEE, 60(8):957–976, 1972.
[Par88] B. Parhami. Carry-free addition of recoded binary signed-digit
numbers. IEEE Trans. Comput., 37(11):1470–1476, 1988.
[Par00] B. Parhami. Computer Arithmetic: Algorithms and Hardware De-
signs. Oxford University Press, U.K., 2000.
[Pin70] Martin Pincus. A Monte Carlo method for the approximate solu-
tion of certain types of constrained optimization problems. Oper-
ations Research, 18(6):1225–1228, Nov 1970.
[Pra01] William K. Pratt. Digital Image Processing. Wiley, New York,
NY, 3rd edition, 2001.
203
[Pri91] D.M. Priest. Algorithms for arbitrary precision floating point arith-
metic. Computer Arithmetic, 1991. Proceedings., 10th IEEE Sym-
posium on, pages 132–143, 26-28 Jun 1991.
[RB04] Sanghamitra Roy and Prith Banerjee. An algorithm for
converting floating-point computations to fixed-point in
MATLAB based FPGA design. In Proceedings of the 41st
annual conference on Design automation, pages 484–487. ACM
Press, 2004.
[RH91] I.P. Radivojevic and J. Herath. Executing DSP applications in
a fine-grained dataflow environment. Transactions on Software
Engineering, 17(10):1028–1041, 1991.
[SA75] E.E. Swartzlander and A.G. Alexopoulos. The sign/logarithm
number system. Transactions on Computers, C-24(12):1238–1242,
Dec. 1975.
[Sd88] S. Sridharan and G. dickman. Block floating point implementa-
tion of digital filters using the DSP56000. Micropress. Mirosyst.,
12(6):299–308, July–Aug. 1988.
[SdF97] J. Stolfi and L. de Figueiredo. Self-Validated Numerical Methods
and Applications. Institute for Pure and Applied Mathematics
(IMPA), Rio de Janeiro, 1997. Monograph for the 21st Brazilian
Mathematics Colloquium (CBM’97), IMPA.
[SK94] Wonyong Sung and Ki-Il Kum. Word-length determination and
scaling software for a signal flow block diagram. Acoustics, Speech,
and Signal Processing, 1994. ICASSP-94., 1994 IEEE Interna-
tional Conference on, ii:II/457–II/460 vol.2, 19-22 Apr 1994.
204
[SK95] Wonyong Sung and Ki-Il Kum. Simulation-based word-length op-
timization method for fixed-point digital signal processing sys-
tems. Signal Processing, IEEE Transactions on [see also Acous-
tics, Speech, and Signal Processing, IEEE Transactions on],
43(12):3087–3090, Dec 1995.
[SM94] Richard L. Scheaffer and James T. McClave. Probability and Statis-
tics for Engineers. Duxbury Press, 4th edition, April 4 1994.
[Smi97] Steven W. Smith. The Scientist and Engineer’s Guide to Digital
Signal Processing. California Technical Pub, 1997.
[SMM05] N. Symth, M. McLoone, and J.V. McCanny. Reconfigurable pro-
cessor for public-key cryptography. Signal Processing Systems De-
sign and Implementation, 2005. IEEE Workshop on, pages 110–
115, Nov. 2005.
[SP87] Puay Kia Sim and K. K. Pang. Decoupling of the overflow and
quantization phenomena in orthogonal biquad recursive digital fil-
ters. Circuits, Systems, and Signal Processing, 6(4):457–470, De-
cember 1987.
[Sri87] S. Sridharan. Implementation of state-space digital filters using
block-floating-point arithmetic. In 1987 IEEE International Con-
ference on Acoustics, Speech and Signal Processing, pages 912–915,
Dallas, TX, USA, April 1987.
[SS97] Adel S. Sedra and Kenneth C. Smith. Microelectronic Circuits.
Oxford University Press Inc, USA, 4rev ed edition edition, July
1997.
205
[ST67] Nicholas S. Szabo and Richard I. Tanaka. Residue Arithmetic and
Its Application to Computer Technology. McGraw-Hill, New York,
1967.
[Sto05] Thanos Stouraitis. The Electrical Engineering Handbook, chapter
1, Logarithmic and Residue Number Systems for VLSI Arithmetic,
pages 179–190. Academic Press, 2005.
[SW86] S. Sridharan and D. Williamson. Implementation of high-order
direct-form digital filter structures. IEEE Trans. on Circuit and
Systems, CAS–33(8):818–822, August 1986.
[Syna] Synopsis. Design compiler.
http://www.synopsus.com/products/logic/logic.html.
[Synb] Synplicity. Synplify DSP.
http://www.synplicity.com/products/dsp solutions.html.
[TGJR88] F. J. Taylor, R. Gill, J. Joseph, and J. Radke. A 20 bit logarithmic
number system processor. IEEE Trans. Comput., 37(2):190–200,
1988.
[Tsa74] N. Tsao. On the distribution of significant digits and roundoff
errors. Commun. Ass. Comput. Mach., 17:267–271, May 1974.
[Tur89] P.R. Turner. A software implementation of SLI arithmetic. Com-
puter Arithmetic, 1989., Proceedings of 9th Symposium on, pages
18–24, Sep 1989.
[Und04] Keith Underwood. FPGAs vs. CPUs: trends in peak floating-point
performance. pages 171–180, 2004.
[Wan99] Lars. Wanhammar. DSP Integrated Circuits. Elsevier, 1999.
206
[Wan04] Guixin Wang. ASIC design of dual fixed-point arithmetic. Master’s
thesis, Imperial College London, Dept of Electrical and Electronic
Engineering, September 2004.
[WZN04] Bin Wu, Jianwen Zhu, and F.N. Najm. Dynamic range estimation
for nonlinear systems. In Computer Aided Design, 2004. ICCAD-
2004. IEEE/ACM International Conference on, pages 660–667,
2004.
[XB97] Guo Fang Xu and T. Bose. Elimination of limit cycles due to
two’s complement quantization in normal form digital filters. Sig-
nal Processing, IEEE Transactions on [see also Acoustics, Speech,
and Signal Processing, IEEE Transactions on], 45(12):2891–2895,
Dec 1997.
[Xila] Xilinx. Spartan series.
http://www.xilinx.com/products/silicon solutions/fpgas/spartan.
[Xilb] Xilinx. System generator.
http://www.xilinx.com/ise/optional prod/system generator.htm.
[Xilc] Xilinx. Virtex series.
http://www.xilinx.com/products/silicon solutions/fpgas/virtex/.
[Yok92] H. Yokoo. Overflow/underflow-free floating-point number rep-
resentations with self-delimiting variable-length exponent field.
Transactions on Computers, 41(8):1033–1039, Aug 1992.
[Zim99] Reto Zimmermann. Lecture notes on Computer Arithmetic: Prin-
ciples, Architectures, and VLSI Design. Integrated Systems Labo-
ratory, ETH, Zrich, Switzerland, March 1999.
207