This work was supported by France Telecom CNET, Issy les Moulineaux France, and was perforrmed while the first author was atINT Evry France.
EFFICIENT IMPLEMENTATION METHODOLOGY OF FAST FIR
FILTERING ALGORITHMS ON DSP
Anissa Zergaïnoh Pierre Duhamel Jean Pierre VidalInstitut Galilée, ENST/SIG, INT/SIM,Avenue Jean Baptiste Clément, 46 rue Barrault, 9, rue Charles Fourier,93430 Villetaneuse France 75013 Paris France 91011 Evry, [email protected] [email protected] [email protected]
ABSTRACT
A class of Finite Impulse Response (FIR) filtering algorithms based either on short Fast Fourier Transforms (FFT)
or on short length FIR filtering algorithms was recently proposed. Besides the significant reduction of the arithmetic
complexity, these algorithms present some characteristics which make them useful in many applications, namely a small
delay processing (independent on the FIR filter length) as well as a multiply-add based computational structure. These
algorithms are presented in a unified framework, thus allowing an easy combination of any of them.
However, a remaining difficulty concerns the implementation of the fast algorithms on Digital Signal Processors (DSP),
given the DSP finite resources (number of pointers, registers and memory), while keeping as much as possible the
improvement brought by the reduction of the arithmetic complexity. This paper provides an efficient implementation
methodology, by organizing the algorithm in such a way that the memory data access is optimized on a DSP. As a result,
our implementation requires a constant number of pointers whatever the algorithm combination. This knowledge is used
in a DSP code generator which is able to select the appropriate algorithm meeting the application constraints, as well as
to generate automatically an optimized assembly code, using macro-instructions available in a DSP-dependent library.
An improvement of more than 50% in terms of throughput (number of machine cycles per point) compared to the
implementation of the direct convolution is generally achieved.
2
1. INTRODUCTION
Finite Impulse Response (FIR) filters have found wide applications in various fields where large FIR filters (up to
4000 taps) are required. When the application requires real time processing, a Digital Signal Processor (DSP)
implementation is not always feasible, due to the heavy computational cost. This is why fast algorithms can be used
again, despite their computational structure which is not suitable to a DSP implementation. Considerable efforts have
been spent to reduce the computational requirements of FIR filtering [1], [2], [3], [4]. The most famous algorithm is
based on the Fast Fourier Transform (FFT) algorithm [5], [6], [7], [8], [9]. In the earlier versions, the processing block
size N is more larger than the FIR filter length L . Since, in a block processing method, the Input/Output delay is on the
order of the block size, this is often a severe constraint in real time processing applications. This is outlined on Figure 1,
evaluating the processing delay, assuming that Te
is the sample rate of Input\Output data. In a fully optimized system,
the device is always working, with auxiliary memories buffering the incoming and out going data: A first buffer
performs the acquisition of some input data block Ai
during a period equal to NTe
while previous data block Ai-1
is
processed. A second buffer restores the computation result of the input data block Ai-2
.
Hence, the processing delay is given by the time spent for the acquisition of an input data block, plus the required for
performing the computation of the same input data block. Thus, it is evaluated to 2NTe.
Time
Steps
Processings
Restitutions
Acquisitions A i
A i
A i
A i -1
A i -1
A i -1
A i + 1
A i + 1
A i + 1
A i - 2
A i - 2A i - 3
A i + 3A i + 2
A i + 2
2 NTe
NTe
Processing delay
Te : Sample rate N : Block processing size Ak : consecutive input data blockN
Fig. 1 Real-time block processing method
Hence, an excessive delay is certainly the main problem with such algorithms. Several others filtering algorithms were
proposed in [12] and [13]. They efficiently reduce the arithmetic complexity, but do not take into account the relation
between the filtering operations, arithmetic requirements and the signal processor architecture. Thus the estimated
complexity reduction, as classically measured in terms of number of operations, may be misleading because it does not
accurately represent the capability of the processor to efficiently implement the algorithm. The key of the solution to this
3
problem is to involve hardware and algorithm considerations in order to achieve an efficient implementation. These
considerations motivated the proposal [14], [15], [16], [17], [18], [19] of a new class of fast FIR filtering algorithms.
These algorithms make an efficient use of the Multiplier-Accumulator hardware found in most DSP's [17], by partially
keeping the original inner product formulation, while reducing the arithmetic complexity, even for a small block
processing. Both characteristics have the potential for resulting in efficient DSP implementations, compatible with
application oriented constraints. The main building blocks are short length fast FIR modules in which all multiplications
are replaced by decimated sub-filters. These algorithms break the computation of a length L FIR filter, into that of
several sub-filters of length L < L , in such a way that the arithmetic complexity decreases [18].
The paper is organized as follows: Section 2 analyses the interaction between filtering algorithms and DSP
architectures, allowing to define the adapted tools for an efficient implementation. Section 3 explains briefly the method
allowing the construction of the basic fast FIR filtering algorithms, based on the Chinese Remainder Theorem [12], [15],
[18]. The main filtering algorithms constructed for real and complex interpolation points, leading to both classes of
algorithms are presented: those based on short length FIR algorithms and those based on short FFT’s. Section 4 shows
that both classes can be mixed [24] building composite filtering algorithm. An evaluation of the arithmetic complexity is
given. Links between architecture and algorithms considerations provided. Section 5 proposes a unified methodology for
an optimal memory data organization based on an efficient address generation [20], [21], [23], [24]. This is
implemented in section 6 as a Code Generator [21], [23], [24]. Finally, this Code Generator is evaluated through
implementation performances of the resulting filtering algorithms on the ADSP-2100.
2. INTERACTION BETWEEN DIGITAL SIGNAL PROCESSING AND CLASSICAL
FIR FILTERING ALGORITHM
An important task consists in analyzing the interaction between the operation available in a programmable DSP
and the corresponding operations (arithmetic, transfers, data management) required by the filtering application. In fact,
the DSP architecture used for implementing the filtering algorithms should have a significant impact on the choice of the
algorithm if one wants really fast execution time. Both hardware and software architecture of the DSP are the key to
achieving an efficient implementation. Consider briefly the most important hardware structures which influence the
choice of the actual algorithm.
Actual programmable DSP’s are based on pipeline operation. The pipeline operation provides multiple stages through
which the data progress at the basic clock cycle rate of the DSP hardware. In the pipeline operation, the instruction
fetch, decode, execution and storage operations are independent, which allows overall instruction executions to overlap.
4
An example of a five stage pipeline operation is given in Figure 2. So at each successive stage in the pipeline an
operation (transfer, multiply, add, shift) is performed. This configuration allows to take advantage of the inherent
decomposition of filtering algorithm into multiple serial operations. A well known Multiplier-Accumulator structure,
described in Figure 3, is very efficient in computing an inner product. A multiply-accumulate instructions fully utilize
the computational bandwidth of the Multiplier-Accumulator, allowing both operands to be processed simultaneously (in
a pipelined manner). The data for these operations can be transferred to the Multiplier-Accumulator each cycle via the
program and data buses for accessing respectively to filter coefficients and input data. This facilitates single cycle
Multiplication-Accumulation (MAC) when used with repeat instructions. Then, the arithmetic complexity will be
evaluated in terms of number of MAC corresponding to the number of machine cycles (forgetting the initialization of the
pipeline). Thus, a straightforward implementation of filtering equation (1) requires approximately L MAC per output
sample.
Linked Operation
Instruction 1 Fetch DecodeOperandreading
Execution Storage
Instruction 2 Fetch DecodeOperandreading
Execution Storage
Instruction 3 1st Cycle Fetch DecodeOperandreading
Execution Storage
Instruction 4 2 nd Cycle Fetch DecodeOperandreading
Execution
Instruction 5 3 rd Cycle Fetch DecodeOperandreading
4 th Cycle ......... .........
Instruction i 5 th Cycle
Fig. 2 Five stages pipeline operation
Most programmable DSP's also offer a modulo addressing mode. This addressing mode can be used in conjunction with
careful buffer sizing to minimize the cost associated to memory access (read/write). Thus allowing significant reduction
of overheads. It is important to note that the actual implementation of circular buffers, allow the indexes to wrap around
to the other end. Thus, a pointer register determining some physical address, is assigned as a pointer to data which are
stored in a circular buffer defined in a preliminary set up. The classical method used to implement a classical
convolution algorithm consists in defining two circulars buffers modulo the FIR filter length and two pointers registers
attributed respectively to each buffer. Thus permitting efficient access to data (coefficients and samples) without
resorting to data shifts between two computations: only the pointer is incremented, and the new data stored at the
appropriate place. This is a very efficient update of the sample delay line by a pointer register. Unfortunately, the
number of pointer register available on most DSP is limited (typical values are 8 pointers, e.g. for the ADSP-2100 of
5
Analog Devices). Consequently, one should avoid distributing the arithmetic computations on too many different delay
lines. This would result in an excess of computation time, since several transfer operations are unavoidable for
compensating for the lack of pointers.
MACMAC
yn
Accumulator
Add\Substract
Multiplier
DataAddress
Generator
CoefficientsAddress
Generator
Datamemory xi
Coefficientsmemory hi
Fig. 3 General Multiplier-Accumulator structure
3. FAST FIR FILTERING ALGORITHMS USING SMALL BLOCK PROCESSING
The output of an FIR filter equation is given by the convolution equation (see eq. (1)) of the infinite input data
sequence { xi} and the finite fixed filter coefficients sequence {h
i} of the FIR filter of length L :
y x hn n i i
i=0
L-1
n 0 1, ,..., (1)
A straightforward implementation on DSP of equation (1) requires L MAC per output sample corresponding to a large
amount of computation for large filters. Equation (1) can be also written in z domain (see eq. (2)) as a polynomial
product of H z( ) by X z( ), the z transforms of {hi} and { x
i}. H z( ) is a finite degree polynomial (degree L 1) while
X z( ) and Y z( ) have infinite degree. Therefore, a fast implementation of equation (3) requires some segmentation of the
infinite degree polynomials into blocks. This results classically in "overlap-add" or "overlap-save" methods.
Y z X z H z( ) ( ) ( ) (2)
y y z h h z h z x x zL
L
0 1
1
0 1
1
1
1
0 1
1
... ( ... )( ...)
( )(3)
A general procedure for obtaining the classical block formulation as well as the more recent ones is decimation: by
decimating by a factor N each term of equation (3), the filtering equation is turned into a product of two finite degree
polynomials (degree N 1) (see eq. (4)), the coefficients of which are themselves polynomials of infinite degree ({ Xi},
{Yi}) or finite degree ({H
i}).
Y Y z Y z H H z H z X X z X z0 1
1
0 1
1
0 1
1
... ( ... )( ... )
N 1
(N 1)
N 1
(N 1)
N 1
(N 1)(4)
6
with
H H z h zi i
N
kN i
-kN
k 0
(L/N)-1
= ( ) i N 0 1 1, ,..., - (5)
X X z x zi i
N
kN ik 0
-kN= ( )
i N 0 1 1, ,..., - (6)
Y Y z y zi i
N
kN ik 0
-kN= ( )
i N 0 1 1, ,..., - (7)
{Hi} being constant polynomials, each individual product H X
j iinvolved in a straightforward computation of equation
(4) corresponds to some filtering of an infinite sequence by a length L N filter. Thus, equation (4) amounts to the
computation of N2
filters of length L N , that is N L N LN2( ) MAC for computing N input data altogether, or L
MAC per output point. The arithmetic complexity has not been modified. The reduction of the arithmetic complexity
appears when the length- N polynomial product appearing in (eq. (4)) is computed by some fast algorithm. Suppose the
polynomial product of length N is computed in multiplications, the arithmetic complexity is L N , which is smaller
than NL if < N2. The Chinese Remainder Theorem (CRT) is the basis of these fast algorithms (see [12]). Since the
formulation allows the use of small N , small block processing will be feasible [17]. It turns out that the application of
the CRT permits to obtain a lower bound on the number of "multiplications" of the type H Xj i
necessary for computing
the finite degree polynomial product (eq. (4)). This minimum is 2 1N multiplications. Hence, if this minimum were
practically attainable, the number of MAC per output sample would be considerably reduced to ( )2 12
N L N MAC
per output. This is explained with more details in reference [18]. The application of the CRT structures the algorithm
into three main parts [18], [19], in which various choices of the parameters N (the decimation factor), {ai} (the set of
interpolation points) provide different algorithms. The minimum number of multiplications can be practically attained
for a small block size N , whereas for larger N the optimal algorithm involves too many additions/subtractions and is
too sensitive to numerical errors (quantization noise) to be of practical interest. Whether the interpolation points {ai}
are chosen to be real-valued or complex-valued determines two main classes of algorithms, described at the following
section.
These FIR filtering algorithms, process altogether a small block size of N consecutive filter outputs, in order to take
into account the redundancy found in these N successive computations. This method permits consequently the reduction
of arithmetic complexity while maintaining partially (on smaller filters) the classical FIR filtering structure. Furthermore
these algorithms are characterized by a relatively small processing delay fixed by the decimation factor N and does not
dependent on the filter length. The different steps to construct a fast FIR filtering algorithm are summarized in Fig. 4:
First an evaluation (pre processing part), then the short-length filters (filtering part), and finally the reconstruction (post
processing part).
7
Pre processing Post processing
0
1
N-1
RF(2N-2)
RF(0)
z -N
+
2N-2
N
N-1N-2 +
0
X(z)Y(z)
z -N
RF(N-1)
Filtering2N-1 real sub-filters (RF)
Fig. 4 General structure of Fast FIR filtering algorithm
3.1. BASIC SHORT FILTERING ALGORITHMS F N ,N( )i i
The basic short fast FIR filtering algorithms are denoted F N N( , )i j
, where Ni
denotes the decimation factor on the
input and output signal, while Nj
denotes the decimation factor on the filter. Only the simplest case N N Ni j is
considered here. The number of sub-filters required by the actual basic fast algorithm will be denoted by i . In some
cases, this number will be equal to the lower bound (i.e. 2 1N ) while being sometimes larger.
Two basics examples are proposed for different decimation factor values Ni 2 and N
i 3. The simplest short-length
have been constructed by considering the simplest real-valued interpolation points, i.e. {ai}={ 1 0 1, , , }. Note here that
ai is an abuse of notation denoting the evaluation of the term of highest degree of the polynomial (see[18]).
The reconstruction formula of the full basic FIR filtering algorithm F( , )2 2 using {a a a0 1 2
0 1 , , } as
interpolation points are easily obtained (see [18]), and given below:
Y X H X H z0 0 0 1 1
2
(8)
Y X X H H X H X H1 0 1 0 1 0 0 1 1 ( )( ) (9)
The overall organization (see Fig. 5) shows that the short length FIR filtering algorithm F( , )2 2 results in a first pre-
processing block, followed by 3 (i.e i 3) sub-filters ( H
0,H
1, H H
0 1 ) of length L 2 , all in parallel, and finally a
post processing block.
+
Pre processing
+
-
-
Post processingSub-filters length
z-2
z-1
z-1+
2
2
2
2
+
H z0
2( )
H z1
2( )
H z H z0
2
1
2( ) ( )
X z0
2( )
X z1
2( )
Y z0 ( )
Y z1( )
Y z( )X z( )
L 2
Fig. 5 Structure of fast FIR filtering algorithm F( , )2 2 {a a a0 1 2
0 1 , , }
8
Now consider a decimation factor Ni 3. An optimal basic filtering algorithm F( , )3 3 requires five interpolation points
{ai}={0 1 1, , , ,? }. Here, a problem appears because only four points are really simple in the real domain. An
additional value such as 2 introduces further additions and involves computations with a large dynamic range:
multiplications by 8 would result in a much larger roundoff noise, even when implemented by shifts. Hence, instead of
adding a fifth interpolation point (see [1]), it is wiser to apply recursively the previous F( , )2 2 algorithm (see eq. (8),
(9)):
Y Y z Y z X X z X z H H z H z0 1
1
2
2
0 1
1
2
2
0 1
1
2
2
( )( ) (10)
Y z X Xz H Hz( ) ( )( )
0
1
0
1(11)
The final filtering equations are given by:
Y X H X X H H X H X H0 0 0 1 2 1 2 1 1 2 2
3 ( )( ) ) (12)
Y X X H H X H X H X H z1 0 1 0 1 1 1 0 0 2 2
3
( )( ) (13)
Y X X X H H H X X H H X H X X H H2 0 1 2 0 1 2 0 1 0 1 1 1 1 2 1 2
2 ( )( ) ( )( ) ( )( ) (14)
The initial FIR filter length has been broken into 6 (i 6 ) sub-filters { H
0, H
1, H
2, H H
0 1 , H H
1 2+ ,
H H H0 1 2 + } in parallel (see Fig. 6) whose lengths L / 3 are smaller than the initial one L . Compared to the optimal
algorithm, this is 6 sub-filters rather than 5. This is the price to be paid for obtaining a more precise algorithm, and
keeping the overhead (number of additions) small. Even if non-optimum, the algorithm is efficient, since the L MAC
required by the direct convolution are reduced to 2 3L / MAC per output point. This may be significant for medium or
large length filters. Just like the F( , )2 2 algorithm, the F( , )3 3 algorithm performs block processing: it computes
altogether a set of three consecutive outputs, thus taking advantage of the redundancy between successive computations.
The processing delay (I/O) is small (6 samples) and is independent of the FIR filter length.
Table 1 gives an evaluation of the arithmetic complexity of different basic F( , )2 2 and F( , )3 3 algorithms build for
various simple interpolation points. Interpolation points 1, -1 generate either an addition or a subtraction. The last two
F( , )2 2 algorithms require one addition more, giving a total of 3 3 2 2 1 ( )( )L additions per output sample compared
to the first and second one. On some recent DSP's, this additional operation has no consequence, because they are able
to compute efficiently an addition/subtraction (cost only one cycle machine) a ±b at the same time allowed by the
pipeline technique. The number of multiplications, not shown, is equal to the number of MAC in Table 1. The fifth
column provides theoretical savings (25%, 34%) both in terms of multiplications and MAC operations, compared to the
direct convolution. Since this improvement is given as a percentage of the initial computational load, and since this is
obtained at the cost of a fixed (and small) of additions, these savings are asymptotic in terms of DSP cycles when the
9
length of the filter increases, but is very quickly reached. As a conclusion, the last column gives the processing delay
which is proportional to the decimation factor.
Pre processing Post processing
+
+
-
-
+
3
3
3
3
3
+
+
+
+ +
+
-
- +
-
-
3 +
H z0
3( )
H z H z0
3
1
3( ) ( )
H z1
3( )
H z H z1
3
2
3( ) ( )
H z2
3( )
H z H z H z0
3
1
3
2
3( ) ( ) ( )
X z0
3( )
X z1
3( )
X z2
3( )
Y z0 ( )
Y z1( )
Y z2 ( )
Y z( )
z3
z1
z3
z1
z1
z1
Sub-filters length
X z( )
L 3
Fig. 6 Structure of fast FIR filtering algorithm F( , )3 3
Basic filtering algorithms
F N N( , )i i
# Sub-filters i
# Add\Subs(per point)
# MAC(per point)
AsymptoticSavings(MAC)
Processingdelay
F( , )2 2 {a a a0 1 2
0 1 , , } 3 2 3 2 2 1 ( )( )L 3 4L 25% 4Te
F( , )2 2 {a a a0 1 2
0 1 , , } 3 2 3 2 2 1 ( )( )L 3 4L 25% 4Te
F( , )2 2 {a a a0 1 2
1 1 , , } 3 3 3 2 2 1 ( )( )L 3 4L 25% 4Te
F( , )2 2 {a a a0 1 2
1 0 1 , , } 3 3 3 2 2 1 ( )( )L 3 4L 25% 4Te
F( , )3 3 {F( , )2 2 with 0,1,} 6 10 3 2 3 1 ( )L 2 3L / 34% 6Te
Table 1 Basic short FIR filtering algorithms characteristics
3.2. BASIC FILTERING ALGORITHMS USING SHORT FFT’s FF N,N( )
Interpolation points can also be chosen in the complex domain. A particular set of interpolation points, taken on
the unit circle {a W j i K K NN
i
iwith
2 12 2 1exp( / ) } results in an FFT-based fast FIR filtering algorithm (see
[18], 19]) denoted by FF N N( , ). The first part requires the computation of a Discrete Fourier Transform (DFT)
performed on 2 1N real points. The second part consists in 2 1N FIR complex filters in parallel. The third part
computes an Inverse DFT (IDFT). The authors of [16], have demonstrated that an additional interpolation point is
possible, in order to transform the DFT and IDFT length to 2 N . In fact it is simpler to compute a DFT the length of
which is a power of 2. For a fast computation, the DFT's are computed through Fast Fourier Transform (FFT) algorithms
(see Fig. 7).
10
2N
ID F T
p oin ts
D F T
2N
poin ts
ig n oredN v alu es
x (n)
x (n -N +1 )
z x (n -N + 1 )-N
z x (n )-N
y (n-N + 1 )
y (n)
2 N
C F (2 N -1 )
P re p ro cessin g P os t p ro cessin g
F ilterin g2 N C o m p lex su b -F ilters (C F )
C F (0 )
Fig. 7 Structure of fast FIR filtering algorithm using short FFT's
The real valued DFT computations
Note that, the input sample being real-valued, then the DFT output has hermitian symmetry. Hence, only half the
outputs need to be computed. Although some algorithms for computing Fast DFT, such as the split radix, or fast Hartley
transform algorithms are available for real-valued data, it is often more efficient to use more classical computations
based on complex-valued FFT's of shorter length, which have been optimized by the DSP manufacturer. The classical
technique we used for an efficient implementation on any recent DSP is now described.
The 2 N real input sequence { x n( )} is divided into even and odd indexed sub-sequences noted respectively {h n( )} and
{g n( )}. A complex sequence {y n( )} is built from these two real valued sequences {h n( )} and {g n( )}. The final
expression of the { x n( )} DFT is given by equation (16). The others values are obtained by exploiting the symmetry and
the redundancy of a real points DFT (see eq. (17)).
X k x n w h n w w g n wN
k( ) ( ) ( ) ( ) 2N
k
n=0
2N-1
N
nk
n=0
N-1
N
nk
n=0
N-1
2(15)
X k Y k Y N k jw Y N k Y k( ) ( ( ) ( )) / ( ( ) ( )) /* *
2 22N
k(16)
X k X N k( ) ( )*
2 k N N N 1 2 2 2, ,..., (17)
Therefore, the computation of a length 2 N real valued DFT is optimized by computing a single length N complex DFT,
with an additional post-processing which is easily and efficiently implemented (see Fig. 8) since it makes use of the FFT
kernel.
A careful analysis of the previous equations shows that some improvements may be obtained, if we compute at the same
time X k( ) and X N k*( ) , because they use the same operands (see [23]). Consequently, some redundant arithmetic
computations are eliminated, as well as a number of data transfers (read\write) in memory which contribute in the
execution time of the algorithm. Some evaluations are provided in Table 2, where the second column evaluates the cost
of the computation of { X k( ) k N 0 1,..., }, and the third one deals with the computation of { X k( ) and X N k*( ) ,
k N 0 2 1,...,( / ) } altogether.
11
Post-processing X k( ) X k( ) and X N k*( ) altogether
Arithmetic operations (Add/Subt) 8N 5N
Transfers operations 6N 3N
Total for N complex points 14 N 8N
Table 2
Moreover small improvements are possible by computing separately the points X( )0 , X N( ) , and X N( / )2 which are
real-valued. The optimized operations for an efficient implementation of the pre-processing required by the computation
of a length 2 N real valued DFT are illustrated in Fig. 8.
1/2
R( )1
R N k*( - )
Real R( ( ))0
Imag R( ( ))0
X al XlN lN
( )= ( ( ))0 0Re
X N al X NlN lN( )= ( ( ))Re
+
+
x1/2
-
XlN
( )k+
+X N k
lN
* ( - )
-j ( )lNx n2 1
+
+
-R k( )
R N( -1)
R( )0
Wj nk
NN
k
2
2
2- = (
-)exp
FFT N complex point
x nlN
( )2
x nlN
( )1
x n NlN
( ) 1
z x nN
lN( )
z x nN lN
( )1
z x n NN lN
( )1
x nlN
( ) Real X I X( ( )) + mag( ( ))lN lNk k
Imag( ( ))lNX k
Real( ( ))(lNX k
+
k=0,...,N
k=0,...,N
k=0,...,N
Input data
k = 1, ... ,N
2
k = 1,. .. ,N
21
Fig. 8 Optimal organization of pre-processing module
The complex filtering computation
Since the DFT computation provides complex outputs, the sub-filter fed by the output of each frequency bin are
also complex. Due to the hermitian symmetry, only N 1 such sub-filters of length L N/ have to be computed. A
straightforward implementation of these N 1 complex sub-filters is equivalent to 4 1( )N real valued sub-filters of
same length. Improvements brought in [19], allows to reduce 4 1( )N to 3 1( )N real valued sub-filters, since one
complex sub-filter is replaced by 3 real valued sub-filters (see Fig. 9). More over 2 particular complex sub-filters are
real valued (see [23]). So the decomposition given in Fig. 9 is useless for these 2 sub-filters. The unnecessary
computation of 4 real sub-filters are deduced to 3 1( )N real sub-filters. The improvements brought in [19], allows to
reduce the complexity to that of 3 1N (i.e 3 1 4( )N ) real FIR sub-filters (see [23]) in parallel (see Fig. 9).
Table 3 evaluates the complexity of the full algorithms: the second and third columns represent respectively the number
of sub-filters generated in parallel, and the asymptotic savings in percentage compared to the classical convolution.
+
Xi(z)
Xr(z)
X(z)
Yi(z)
Yr(z)
Y(z)Hr(z)
Hr(z) + Hi(z) +-
-Hr(z) - Hi(z) +
12
Fig. 9 Structure of FIR complex filter
Filtering algorithmsFF N N( , )
# real valued sub-filters3 1N -
Savings(MAC)
Processing delay2NT
e
FF( , )4 4 11 31% 8Te
FF( ,8)8 23 64% 16Te
FF( , )16 16 47 82% 32Te
Table 3
Imag ( ( ))lNX 0
Real X I X( ( 0 ) ) + mag ( ( 0 ) )lN lN
Imag ( ( ) )lNX N
Real ( ( ))0H 0
Real ( ( ) )lNX 0
Real Imag( ( ) ) - ( ( ) )H H0 0
0 0
Real X I X( ( ) ) + mag ( ( ) )lN lNN N
Real ( ( ) )lNX N
Real Imag( ( ) ) - ( ( ))L / N 1 L / N 1
H H
0 0
+ +
Real Imag( ( ) ) - ( ( ) )H N H N0 0 Real Imag( ( ) ) - ( ( ) )
L / N 1 L / N 1H N H N
+ +
Real Imag( ( ) ) + ( ( ) )H H0 0
0 0 Real Imag( ( ) ) + ( ( ))L / N 1 L / N 1
H H
0 0
+ +
Real Imag( ( ) ) + ( ( ))H N H N0 0 Real Imag( ( )) + ( ( ) )
L / N 1 L / N 1H N H N
+ +
+ +
Real ( ( ) )H N0 Real ( ( ) )
L / N 1H N
z N
+ +
Real ( ( ) )L / N -1H 0
z Nz N
z N
z Nz N
z Nz N
z Nz N
z N z N
mr( )0
mr N( )
mi( )0
mi N( )
mri( )0
mri N( )
Fig. 10 FIR filtering module
A dual procedure is used for the computation of the 2 N points complex Inverse FFT with hermitian symmetric input.
This amounts to the calculation of a single length N complex FFT by the procedure below:
x n X k w( ) ( )
2N
nk
k 0
2N 1
X k N X N k N X N k( ) ( ) ( )* *
2
x n X k X N k w( ) ( ( ) ( ))*
2
N
nk
k 0
N 1
x n X k X N k w w( ) ( ( ) ( ))*
2 1
N
nk
2N
k
k 0
N 1
x n( )2 , x n( )2 1 are real valued then:
x n jx n X k X N k jw X k X N k wn
( ) ( ) ( ( ) ( ) ( ( ) ( ))* *
2 2 1
N
k
k 0
N 1
N
k(18)
V k X k X N k jw X k X N k( ) ( ( ) ( ) ( ( ) ( ))* *
2N
k(19)
Equation (18) shows that the computation of a complex IFFT with hermitian symmetry is done by one complex IFFT of
length twice smaller, plus some pre processing as illustrated in Fig. 11. This figure depicting the dual operations of Fig.
8, both flow graphs are transposed of each other.
13
Wj nk
NN
k
2
2
2- = ( )exp
S k( )
XV k*( - )N-
V k( )
y( )n
y( )n N 1+ +
+S k*( - )N
V N( -1)
V( )0
N Complex points Inverse FFT
Real( ( ))S k+
k=0,...,N
k=0,...,N
+
Imag( ( ))S k+
mrk( )
mik( )
mrik( )
Nvaluesignored
k = 0,...,N
2
k = 0,...,N
21
Fig. 11 Organization of post-processing module
The overall algorithm
This algorithm FF N N( , ) operates on a small block composed of N consecutive samples requiring the calculation
of FFT's of length 2 N (see Fig. 8). Whatever the filter length L , it is possible to maintain a low processing delay, given
by 2 N . Obviously, long FFT's will result in more efficient algorithms (as far as the number of operations is concerned),
but also result in a longer I/O delay. Note also that when the FFT length increases, multiplications appear in the pre and
post processing modules.
4. A "LEGO" VIEW OF FAST FIR FILTERING ALGORITHMS
All these basics filtering algorithms coming from the same general procedure are hence compatible: whenever
some algorithm produces sub-filters, then another basic fast algorithm can be applied to the sub-filters as well, and this
recursively. This improves the efficiency of the fast algorithm, at the cost of an increased delay.
4.1. GENERAL STRUCTURE OF COMPOSITE FIR FILTERING ALGORITHMS
Basic filtering algorithms have the same computational structure so it is easily seen that these algorithms can be
recursively nested: any basic fast FIR algorithm can be applied to the so-called sub-filters as well leading to the
composite fast FIR filtering algorithm.
Assume that L , the FIR filter length, is composite: L N N N L1 2
...n r
. The general composite FIR filtering algorithm,
denoted by F N N F N N( , ),..., ( , )1 1 n n
, is built from the on basic algorithms seen in the previous section. The filtering
algorithm noted F N N( , )i i
may be chosen either as F N N( , )i i
or as FF N N( , )i i
. The procedure consists to apply the
first basic algorithm F N N( , )1 1
. This algorithm decimates by a factor equal to N1
the initial filter into 1
sub-filters of
length L N1. The second algorithm F N N( , )
2 2operates on each of the
1sub-filters, generating
1 2sub-filters in
parallel, whose length is equal to L N N1 2
. The process is iterated until the last basic algorithm F N N( , )n n
, giving
therefore ii=1
n
sub-filters in parallel whose length is equal to L N Lii=1
n
r .
14
Y n
X n
h 0 0 + h 0 1
-
-
+
h 2 0 + h 2 1
-h 2 0
h 1 0 + h 1 1
-
-4z-
-
-
-
h 2 1
h 1 1
h 1 0
h 0 1
h 0 0
-4z
-4z
-2z
+
+
++
+
+
+
S u b -filte rs len g th L /4
+
+
+
+
P o s t p ro c e ss in gP re p ro c ess in g
Fig. 12 Structure of composite fast FIR filtering algorithm F F( , ), ( , )2 2 2 2
The general composite fast FIR filtering scheme has still a structure similar to that of the basic filtering algorithm: it is
made of three parts. The first one groups the pre-processing part of all F N N( , )i i
that are successively applied. The
second part consists in ii=1
n
real-valued FIR sub-filters in parallel. The last one is composed of the individual post-
processing stages of F N N( , )i i
. The pre and post-processing parts generate and recombine the corresponding data.
Figure 12 presents an algorithm obtained by nesting two F( , )2 2 algorithms, thus resulting in 3 3 9 decimated FIR
sub-filters.
4.2. ARITHMETIC COMPLEXITY
The use of several decomposition decreases considerably the arithmetic complexity (see [5]) which is on the order
of ( ) / ( )L Nii 1
n
ii 1
n
2
MAC per output sample. This number does not include the pre and post processing
operations, which can be quite complicated when FFT's are involved. Note that this complexity remains low when FFT's
of short length are used (which is always the case in practical implementations on DSP's if one has delay constraint).
These computations will however be included in our evaluations in terms of DSP cycles later on. In some cases, the
number of MAC can be effectively halved without increasing the number of additions. Table 4 provides an evaluation of
the various possibilities obtained by combining the basic algorithms explained above. The first column describes the
algorithm, giving the order in which the decompositions are applied. The second column represents the number of FIR
sub-filters that are run in parallel. The third column determines for each algorithm the asymptotic savings in terms of
MAC per output computed to the direct convolution that requires L MAC per output point. The last column evaluates
the processing delay depending on the sample rate Te. Note that the asymptotic performances are not influenced by the
order of the decomposition. However, this order influences the overhead, i.e. the rate at which the asymptotic gain is
reached. This will be clearer later on.
15
Composite filtering algorithms
F N N F N N( , ),..., ( , )1 1 n n
# Sub-filters
ii 1
n
Asymptotic savings
12
( ) / ( )
ii 1
n
ii 1
n
N
Processing delay
2( )N Tii 1
n
eDirect-Convolution 1 0% Te
F( , )2 2 3 25% 4Te
F( ,3)3 6 34% 6Te
F F( , ), ( , )2 2 2 2 9 44% 8Te
F F( , ), ( , )2 2 3 3 18 50% 12Te
F F( , ), ( , )3 3 2 2 18 50% 12Te
F F( , ), ( , )3 3 3 3 36 56% 18Te
( , ), ( , ), ( , )F F F2 2 2 2 2 2 27 58% 16Te
( , ), ( , ), ( , )F F F2 2 2 2 3 3 54 63% 24Te
F F F( , ), ( , ), ( , )2 2 3 3 2 2 54 63% 24Te
F F F( , ), ( , ), ( , )3 3 2 2 2 2 54 63% 24Te
F F F( , ), ( , ), ( , )3 3 3 3 2 2 108 67% 36Te
F F F( , ), ( , ), ( , )2 2 3 3 3 3 108 67% 36Te
F F F( , ), ( , ), ( , )3 3 2 2 3 3 108 67% 36Te
( , ), ( , ), ( , )F F F3 3 3 3 3 3 216 71% 54Te
F F F F( , ), ( , ), ( , ), ( , )2 2 2 2 2 2 2 2 81 69% 32Te
FF F( , ), ( , )4 4 2 2 33 48% 16Te
FF F F( , ), ( , ), ( , )4 4 2 2 2 2 99 61% 32Te
FF F F F( , ), ( , ), ( , ), ( , )4 4 2 2 2 2 2 2 297 70% 64Te
FF F F F F( , ), ( , ), ( , ), ( , ), ( , )4 4 2 2 2 2 2 2 2 2 891 78% 128Te
FF F(8,8), ( , )2 2 69 73% 32Te
FF F F(8,8), ( , ), ( , )2 2 2 2 207 79% 64Te
FF F F(8,8), ( , ), ( , )2 2 2 2 621 85% 128Te
FF F F F F(8,8), ( , ), ( , ), ( , ), ( , )2 2 2 2 2 2 2 2 1861 88% 256Te
FF F(16,16), ( , )2 2 141 86% 64Te
FF F F(16,16), ( , ), ( , )2 2 2 2 423 89% 128 Te
FF F F F(16,16), ( , ), ( , ), ( , )2 2 2 2 2 2 1269 92% 256Te
Table 4
4.3. PROBLEMS BETWEEN DSP ARCHITECTURE AND FILTERING ALGORITHMS
However, at first glance, if we want to use such a strategy in our fast algorithms, the number of pointer registers
seems to be proportional to the number of sub-filters. For example, the composite FIR filtering algorithm
FF N N F N N F N N( , ), ( , ),..., ( , )1 1 2 2 n n
would require 2 3 11 2
( )N
ii
n
pointer registers for its DSP implementation.
Table 5 evaluates, for some filtering algorithms, the number of pointer register required. It is easily seen that actual
DSP’s do not satisfy these requirements. This problem was already pointed out in previous papers (see [7]), in which the
algorithm was restricted to a single decomposition due to these constraints. A particular attention was devoted to this
difficulty in [9], [10], and a solution is given below.
16
Filtering algorithms
F N N F N N( , ),..., ( , )1 1 n n
# Sub-filters
ii 1
n
# Pointers registers
2 ii 1
n
F( , )2 2 3 6F( , )3 3 6 12
F F( , ), ( , )2 2 2 2 9 18F F( , ), ( , )2 2 3 3 18 36F F( , ), ( , )3 3 3 3 36 72
( , ), ( , ), ( , )F F F2 2 2 2 2 2 27 54( , ), ( , ), ( , )F F F2 2 2 2 3 3 54 108( , ), ( , ), ( , )F F F3 3 3 3 2 2 108 216
FF( , )4 4 11 22FF( , )8 8 33 66
FF F( , ), ( , )4 4 2 2 33 66FF F F( , ), ( , ), ( , )4 4 2 2 2 2 99 198
FF F F F( , ), ( , ), ( , ), ( , )4 4 2 2 2 2 2 2 297 594FF F( 6,16), ( , )1 2 2 141 282
Table 5
5. UNIFIED METHODOLOGY FOR AN EFFICIENT REAL-TIME
IMPLEMENTATION ON DSP
In previous paragraphs, we have seen that, whatever the filtering algorithm F N N F N N( , ),..., ( , )1 1 n n
(short length,
composite length, based on short FFT), it is made of 3 main parts. The proposed general methodology makes use of this
property and of the mathematical structure of nested algorithms. In order to keep as much as possible the improvement
brought by the reduction of the arithmetic complexity of this filtering algorithms, without exceeding the resources
available on DSP's, we suggest an efficient data memory management and organization. The proposed technique is
based on a precise address generation allowing the use of only two modulo addressing modes for the global algorithm.
The basic idea proposed to overcome the heavy computational resources, consists in defining one delay line for ordering
data and one array for storing sub-filters coefficients. So the number of pointer registers required for computing all sub-
filters outputs is simplified to only two. This is rather easily obtained because, for a given algorithm combination, all
quantities involved are evaluated by simple formula depending on { Ni}, {
i}, and { L
i}. Once this is done, an efficient
implementation is obtained by storing the elements to be used in a filtering equation at regularly spaced location in a
delay line. As a first example, let us evaluate the total sizes of the data buffer and coefficients array. The new delay line
regroups all ii=1
n
data lines whose length is equal to sub-filter length Lr. Therefore the data buffer size or coefficients
array size is equal to the product of the number sub-filters in parallel by the sub-filter length Lr
(see eq. (20)).
Size Size L L Nbuffer array ii=1
n
r ii 1
n
ii 1
n
( )( / ) ( ) (20)
17
Note that although FF N N( , )i i
requires the computation of 3 1Ni inner products (i.e. 3 1 4( )N
i ), filtering module
(see Fig. 10) is structured as for computing 3 1( )Ni sub-filters. Indeed, this structuration allows to maintain a
processing regularity for the proposed implementation. So, in this case i
is equal to 3 1( )Ni .
5.1. EFFICIENT IMPLEMENTATION OF PRE PROCESSING MODULE
The filtering algorithm efficiency depends essentially on the arithmetic complexity and the data management of the
pre processing module. So far, the basic filtering algorithms F N N( , )i i
presented in previous section have been
evaluated in terms of the arithmetic complexity. The study presented in this section is devoted to the data memory
organization. From basic filtering algorithms pre processing parts, the method allows to build the composite filtering
algorithm pre processing module, without requiring the filtering equation formula.
The data line must store the most L Nii 1
n
ii 1
n
/ recent combined samples. Thus, an actualization operation is
necessary. Since each ii=1
n
data lines requires the update of one sample, the L Nii 1
n
ii 1
n
/ data line requires
then ii=1
n
combined samples to be actualized. The sub-filter coefficients are fixed. So, the coefficients array does not
necessitate an updating operation. Their memory storage is identical with data storage described at the following.
The original method proposed for interleaved samples actualization is based on a progressive filling technique, used in
proportion as a basic filtering algorithm F N N( , )i i
is applied. The pre-processing module combines (arithmetic
operation) and organizes data which are found in equally spaced memory locations according to the basic algorithm
employed. The physical address is determined by the relationship between the decimation factor Ni
and the number of
sub-filters in parallel.
The initialization step consists in distributing Nii=1
n
input data at particular positions in the data delay line, of a total
length equal to ii=1
n
. Then, the algorithm F N N( , )i i
processes Njj=i+1
n
times a number Ni
of sequences, each one
of size jj=1
i-1
. The memory is organized according to these quantities, in a block manner. Each block is indexed by
three quantities: M l k( , ) , where M is the size of the block ( M jj=1
i-1
1 i-1... ), l is the number of sequence
(l N 1,...,i) and k is a recursion index (k N 1,...,
jj=i+1
n
). This is illustrated on Fig. 13.
18
1( , )1 1
N1
1( , )N2 11( , )2 1
1 2( , )1 1 1 2( , )N3 1
1 2 3( , )1 1
N1N1 N1 N1
1( , )1 3N 1( , )2
3N 1( , )N N2 3
1( , )1 2
N1
1( , )N2 2
N1
1( , )2 2
N1N1 N1
N1
1( , )1
3N
1( , )2
3N
1( , )N N
2 3
1 2 3( , )2 1
1 2( , )1 2 1 2( , )N3 2
ii 1
n
L Nii 1
n
ii 1
n
/
: Free memory space filled by the application of
: Free memory space filled by the application of
: Free memory space filled by the application of
F N N*( , )1 1
F N N( , )2 2
F N N( , )3 3
Data delay line length
update
Ni : Decimation factor
N N N N 1 2 n... : Input data block processing
: Sub-filters generated byi F N N( , )i i
Step F N NWrite 1 1( ( , ))
Step F N NadRe ( ( , ))1 1
Fig. 13 Optimal memory data organization
The trick which is used here is that one knows in advance from the basic algorithm combination how many sub-filters
will be used. It is thus possible to store at contiguous locations all data corresponding to the same delay in all sub-filters.
This "slice" of delay line is thus efficiently processed by a cyclic pointer, since the increment to fetch the data for a
specific filter is the same one for all of them. The increments to fetch data or to store data will be given bellow.
After the initialization phase, the first basic filtering algorithm F N N( , )1 1
is applied on shared out input data delay line.
The algorithm combines (arithmetic operations) each N1
input sequences according to its pre processing part. Thus, for
each N1
input sequences, the algorithm generates 1
data (see Fig. 13). These results are stored in the delay line at
appropriate and equally spaced places illustrated by ‘0’ in Fig. 13. The access to particular data delay line is given by
the reading step (see eq. (21)). The storage is also determined by a writing step (see eq.(22)).
The next basic filtering algorithm F N N( , )2 2
processes Nii=3
n
sequences composed of N2
consecutive length blocks
1( , )l k , with for a given k {k N 1,...,
ii=3
n
}, l N 12
,..., . The algorithm generates several 1 2
( , )l k blocks, for all
k and l in the given limits: k N 1,...,ii=4
n
, l N 13
,..., .
Based on the same principle, the following algorithm treats the scheduling data provided by the preceding one and stores
the results in memory, equally spaced in the delay line and so on until the last basic algorithm F N N( , )n n
.
Assume that a basic algorithm F N N( , )i i
is applied. Figure 14 shows the increments required to locate the data delay
line pointer register at the correct memory address for accessing or writing data.
19
1 11... ( , )i 1 2 1... ( , )i 1 1... ( , )i i+1N 1 1 2... ( , )i 1 2 2... ( , )i 1 2... ( , )i i+1N 1 1... ( , )i iN 1 2... ( , )i iN 1... ( , )i i+1 iN N
1 1 2... ( , )i-1 1
2 2... ( , )i-1
1 2... ( , )i-1 iN
Step F N NRead i i
1 ( ( , )) Step F N NWrite i i( ( , ))
Step F N NRead
2
i i( ( , )) Step F N NRead
3
i i( ( , ))
Step F N NRead
4
i i( ( , ))
Step F N NRead
5
i i( ( , ))
Data delay line
Pointer register = Initial physical address + Step modulo delay line length
Pointer register
Fig. 14 Pointer register increments
The increments pointer register (see Fig. 14) are define bellow:
Step F N N NWrite
( ( , ))i i i ij 1
i 1
ij 1
i 1
(21)
Step F N N Step F N Nad adRe Re
( ( , )) ( ( , ))1 2
1i i i i
jj 1
i 1
jj 1
i 1
if i 1
if i 1
(22)
Step F N NN N
NadRe
( ( , ))3
2
3
i i
j i j ij 1
i 1
j 1
i 1
j 2 2j 1
if
1 if i 1,n 3
(23)
Step F N NN N
N N NadRe
( ( , ))(( ) ) (
(( ) ) (
41 2
1 3i i
jj i
n
k j ij 1
i 1
k 1
j 1
jj i
n
k i j ij 1
i 1
k 1
j 1
) +1 if
) +1 if
(24)
The following step is used for positioning the pointer register, in order to apply the next algorithm F N N( , )i+1 i+1
Step F N NN N N
N N NadRe
( ( , ))(( ) ) (
(( ) ) (
51 2
1 3i i
jj i
n
k i j ij 1
i 1
k 1
j 1
jj i
n
k i j j ij 1
i 1
j 1
i 1
k 1
j 1
) +1 if
) + +1 if
(25)
Filters coefficients memory organization is based on the same data delay line method described above. But the method is
applied at one times since the coefficients filters are fixed. Thus, the memory organization is done off-line.
5.2. EFFICIENT IMPLEMENTATION OF FILTERING MODULE
The filtering module made from FIR sub-filters is efficiently implemented on DSP's since the data to be combined
in each inner product computation (MAC) are available at equally spaced positions in memory (see Fig. 15).
20
StepWrite
Space memoryreserved for
post processingmodule
Co
mb
ined
Sa
mp
les
Com
bin
edC
oeff
icie
nts
Pa
rtia
lsco
nv
olu
tio
ns
X
Lr i
i 1
n
( )
Lr i
i 1
n
( )
i
i 1
n
Step = i
i 1
n
++
X
X
Data delay linelength
Coefficients sub-filtersarray length
Pdata
Pcoefficients
Pconvolutions
: Data Pointer Register initial position
: Coefficients Filters Pointer Register initial position
: Partial Convolutions Pointer Register initial position
Pdata
Pcoefficients
Pconvolutions
Fig. 15 Partial convolutions module
The data and coefficients reading are available by means of two pointer registers. Each one is attributed to the data delay
line and coefficients array. The pointer registers increment value corresponds to the spaced positions. It is given by the
step depending on the number of sub-filters in parallel (see eq. (26)).
Step Nconv sub filters ii 1
n
(26)
The various partial convolutions, evaluated to ii 1
n
, are stored in a buffer. A third pointer register is attributed to the
partial convolutions buffer in order to generate and recombine the corresponding variables by the post-processing
module (see Fig. 15).
5.3. EFFICIENT IMPLEMENTATION OF POST PROCESSING MODULE
This module performs a task which is the converse of the pre-processing module (see Fig. 16). The pre processing
part processes data block of size N , providing ii 1
n
values. After filtering, the corresponding ii 1
n
outputs must
be combined together according to the post processing part of basic algorithm in order to provide the corresponding N ,
values.
ii 1
n
N N
ii 1
n Pre processing module
Post processing moduleInput \Outputdata block
Partial convolutions
21
Fig. 16 Relation between pre processing and post processing parts
The proposed method is based on a filling technique as the pre processing module. A reserved space memory in partial
convolutions buffer will be completed in proportion as a basic filtering is applied.
The procedure consists in first applying the post processing part of the last basic algorithm F N N( , )n n
to the partial
convolutions outputs obtained previously. Due to our structuration, the partial convolutions to be combined are available
at equally memory positions defined by a step, named Step F N NRead
1
n n( ( , )) , which is equal to the number of sub-filters in
parallel (see Fig. 17).
The second stage consists in applying the next basic algorithm F N N( , )n 1 n 1
on the result given by the first stage. The
data positions to be processed, according to the F N N( , )n 1 n 1
post processing, are provided by a step which is easily
computed as a function of the number of sub-filters (see Fig. 17). A similar technique is applied up to the last F N N( , )1 1
algorithm, generating Nii 1
n
filtered outputs.
The steps values, defined bellow for any basic algorithm F N N( , )i i
, allow to increment the pointer register attributed
previously to partial convolutions buffer. The incrementation operation does not require additional machine cycles (see
Fig. 17).
Step F N NRead i i
jk 1
i 1
if i 1
if i 1
1
1( ( , ))
(27)
The writing step, named Step F N NWrite i i
( ( , )), allows to store the result of the combination at a precise physical address.
Step F N NWrite i i k=1
i-1
kif i 1( ( , )) (28)
Step F N N N NRead i i k=1
i-1
k k=i
n
k k=1
i-1
k k=i+1
n
k k=1
i
k) )
21( ( , )) ( )( ( )( with N
k=i
n
k= 1 if i = n
The reading step, given bellow, permits to position the pointer register in order to apply the following basic algorithm
F N N( , )i-1 i-1
.
Step F N N NRead i i k=i
n
k k=1
i-1
k)
31( ( , )) ( )( (29)
22
:
: Free memory space
: Memory space completed according to basic algorithm
Step F N NRead n n
1 ( ( , ))
1 1 1... ( , )n 1 1 1... ( , )n 1 n N 1... n
Nn n 1 1... N N Nn n-1 i i 1... ... 1
F N N( , )n n F N N( , )i i
N N Nn n-1 i+1 i... ... 1
F N N( , )i+1 i+1
1 1 1... ( , )i 1 1 1... ( , )i 1 i N 1 1... ( , )i i+1N 1 1 1... ( , )i
1 1 1... ( , )n 1 1 1... ( , )n 1 n N 1... n
Nn n 1 1...
F N N( , )n n
1... n
Step F N NRead n n
1 ( ( , )) Step F N NWrite n n( ( , ))
Step F N NRead i i
1 ( ( , )) Step F N NWrite i i( ( , ))
Step F N NRead i i
2 ( ( , ))
Step F N NRead i i
3 ( ( , ))
ApplicationApplicationApplication
Partial convolutions buffer
: Partial convolutions Pointer register Initial position
Pconvolutions
Pconvolutions
Fig. 17 Post processing module
5.4. HARDWARE RESOURCES EVALUATION
A very useful characteristic of our F N N F N N( , ),..., ( , )1 1 n n
composite filtering algorithm structuration is that the
total number of pointers required is fixed, whatever the number of basic short algorithms which are nested. The method
consists in regrouping all data of the same kind (combined samples, combined coefficients, partial convolutions). Each
kind is stored in a single buffer. The previous section concerning the optimized memory data organization shows that:
For the PRE PROCESSING MODULE one pointer register, named Pdata
, has been defined for data delay line
management. This evaluation has not included the unavoidable pointers requires for the FFT computation, if the first
F N N( , )i i
algorithm is substituted by FF N N( , )i i
.
For the FILTERING MODULE two pointer registers have been defined. The pointer Pcoefficients
has been attributed to
filters coefficients array in order to execute required product convolutions. The other one, named Pconvolutions
, is attributed
to partial convolutions buffer for storing inner products results.
For the POST PROCESSING MODULE uses the previous pointer register Pconvolutions
defined in the filtering module. It
allows the efficient partial convolutions management for recombination operations.
23
Input samples coming are stored in a sized delay line. The input data management (push and pop) is solved by a pointer
register named Pacquisition
. As the input data, the output samples are directed by a pointer register named Prestitution . Hence
the general structuration of F N N F N N( , ),..., ( , )1 1 n n
composite algorithm requires 5 ( Pdata
, Pcoefficients
, Pconvolutions ,
Pacquisition
, Prestitution
) fixed pointer registers.
6. OPTIMIZED CODE GENERATOR SYSTEM
Although conceptually simple, the general memory data organization is quite intricate. Initializations of the
pointers, their increments, depend by simple formula on the length of the short algorithms, and on the number of
"multiplies" their require. Rather than implementing these formula in an ad hoc manner for each algorithm, we decided
to build a Code Generator System (CGS) which manages efficiently the address generation. This CGS is written as a
high level language program, and automatically generates an assembly file, after selecting the best combination of small
length building blocks in term of computing cycles given some constraints on real time application (maximum I/O delay)
or memory occupation (see Fig. 16).
These small building blocks are macro-instructions, and are the only parts which depend on the DSP. The file generated
is written in an optimized assembly code and is directly assembled by the chosen DSP. These macro are the pre-
processing part, post-processing part of each short fast FIR algorithm, the convolution (for the sub-filters), plus FFT’s.
Examples are given bellow:
Macro_Partial_Convolutions (%0, %1, %2, %3, %4, %5, %6){
%0 sub_filter_length,
%1 data_pointer_register,
%2 coefficients_sub_filters_pointer_register,
%3 partial_convolutions_pointer_register,
%4 coefficients _Stepconv
%5 data _Stepconv
,
%6 Stepwrite
_partial_convolutions,
Parametrized instructions in DSP assembly code
}
Macro_Pre_Processing_ F N N( , )i i
(%0, %1, %2, %3, %4, %5, %6, %7)
{%0 data_pointer_register,
%1 data_Step F N NRead
1
i i( ( , )),
%2 data_Step F N NRead
2
i i( ( , )),
%3 data_Step F N NRead
3
i i( ( , )),
%4 data_Step F N NRead
4
i i( ( , )),
24
%5 data_Step F N NRead
5
i i( ( , )),
%6 data_Step F N NWrite i i
( ( , )),
%7 input_pointer_register.
Parametrized instructions in DSP assembly code
}
SOURCE FILE
SELECTION OF THE BEST COMBINATIONS
FIR filter length, Processing delay,
Developped in a "C" language
LIBRARY
Optimised
assembly code
MACRO_INSTRUCTIONS
DSP chosen
Application constraints
SELECTION OF DECIMATION FACTEUR " "
Adapted to the constraints of application
Execution time, Available memory space
Ni
F N N F N N( , ),..., ( , )1 1 n n
Adapted to the constraints of application
Assembly code
for DSP
FIR Sub-filterscoefficients file
EFFICIENT IMPLEMENTATION
DSP COMPILERDSP COMPILER
"C " COMPILER"C " COMPILER
F N N F N N( , ),..., ( , )1 1 n n
Twiddle factors rotation filesof DFT and IDFT pre\post
processing
Twiddle factors rotationfiles of DFT and IDFT
F N N F N N( , ),..., ( , )1 1 n n
FF N N( , )i i
Fig. 18 Code Generator System
They are called by a C language program, which computes the required parameters, named %0, ...., (increments in the
delay line, pointers) and evaluates the corresponding number of cycles. Various algorithms are checked in a
combinatorial manner, and the generator provides the one which is the most efficient, given the application constraints.
7. IMPLEMENTATION RESULTS
In order to validate the CGS and to evaluate the performance of FIR filtering algorithms, a set of optimized macro-
instructions has been written for a DSP of Analog Devices the "ADSP-2100". The experimental result are provided in
Table 7.
25
Table 7 evaluates the number of machine cycles versus the filter length for various implementations of filtering
algorithms on the "ADSP-2100", as provided by the CGS. Graphical representation, supplied in Fig. 19a, Fig. 19b and
Fig. 19c, compares execution time of different filtering algorithms. It can be seen that the improvement is larger with
long FIR filters and depends on the block size . For an FIR filter longer than 350 coefficients, the overheads (transfers
data, arithmetic operations,...) become negligible and therefore have no noticeable influence on the throughput
compared to the execution time of classical convolution algorithm. Note that the combination of the filtering algorithm
based on short FFT's and short length filtering algorithm, yields to improved execution timings. This result is partially
due to the (deliberately chosen) limitation on the FFT size, due to limitations on the number of pointers. Otherwise,
longer FFT's would result in more efficient algorithms.
8. CONCLUSION
This paper has described an efficient methodology for implementing the recently proposed fast FIR filtering
algorithms on DSP's. This methodology is based on an homogeneous presentation of the whole class of fast FIR
algorithms (based on a nesting technique of sub-filters), showing altogether the common features of their structure and
their interplay. This set of algorithms is quite general, since it includes the ones based on short length filters or based on
short FFT’s. The classical algorithms using long FFT's also belong to this class, but have not been considered for
implementation, since they often do not meet practical constraints such as I/O delay. As a result, we could propose a
memory data organization which is very economical in terms of DSP resources. The efficient address generation of data
in memory is managed by the code generator system which automatically generates assembly code. Figure of merits
have been given, showing that the computing time for a 1000 taps FIR filter can be as low as 1/4 that of a classical
computation, with a block size of only 32, hence an I/O delay of 64 samples.
Although described specifically for DSP's, this methodology should also be very efficient for a VLSI
implementation, since this algorithm organization has very small hardware requirements and regular structure. The main
characteristics of the algorithms, such as the conservation of multiply-accumulate structures and the many sub-filters
running in parallel at a lower rate than the initial one should also be very helpful for obtaining precise tradeoffs between
hardware complexity and I/O throughput.
26
FIR filtering algorithms
F N N F N N( , ),..., ( , )1 1 n n
Asymptotic savings
(MAC)
# Machine cycles
(per point)
direct -convolution- 0% L + 9F( , )2 2 25% 15 5 3 4. + /LF( , )3 3 34% 22 2 3+ /L
F F( ,2), ( ,2)2 2 44% 28 25 9 16. + /LF F( , ), ( , )2 2 3 3 50% 38 5 2. + /LF F( , ), ( , )3 3 2 2 50% 39 2+ /LF F( , ), ( , )3 3 3 3 56% 53 11 4 9. + /L
F F F( ,2), ( ,2), ( ,2)2 2 2 58% 47 75 27 64. + /LF F F( , ), ( , ), ( , )2 2 2 2 3 3 63% 63 16 3 8. + /LF F F( , ), ( , ), ( , )2 2 3 3 2 2 63% 64 16 3 8. + /L
F F( , ),2 ( , )3 3 2 2 63% 65 3 8+ /L2 ( , ), ( , )F F3 3 2 2 67% 87 27 3. + /LF F( , ),2 ( , )2 2 3 3 67% 85 3 3. + /L
F F F( , ), ( , ), ( , )3 3 2 2 3 3 67% 80 38 3. + /LF F F( , ), ( , ), ( , )3 3 3 3 3 3 71% 115 8 27+ /L
F F F F( ,2), ( ,2), ( ,2), ( ,2)2 2 2 2 69% 76 37 81 256. + /LFF( , )4 4 31% 94 0 687+ . LFF( , )8 8 64% 162 25 0 36. + . L
FF( , )16 16 82% 164 87 0 183. + . LFF( , )32 32 91% 171 93 0 092. . L
FF F( , ), ( , )4 4 2 2 48% 110.25+0.51LFF F F( , ), ( , ), ( , )4 4 2 2 2 2 61% 138.75+0.386L
FF F F F( , ), ( , ), ( , ), ( , )4 4 2 2 2 2 2 2 70% 188.31+0.290LFF F F F F( , ), ( , ), ( , ), ( , ), ( , )4 4 2 2 2 2 2 2 2 2 78% 264.09+0217L
FF F( ,8), ( , )8 2 2 73% 167,625+0.269LFF F F( ,8), ( , ), ( , )8 2 2 2 2 79% 189.37+0.302L
FF F F F( ,8), ( , ), ( , ), ( , )8 2 2 2 2 2 2 85% 237.53+0.151LFF F F F F( ,8), ( , ), ( , ), ( , ), ( , )8 2 2 2 2 2 2 2 2 88% 310.48+0.113L
FF F( 6,16), ( , )1 2 2 86% 170.125+0.137LFF F F( 6,16), ( , ), ( , )1 2 2 2 2 89% 195+0.103L
FF F F F( 6,16), ( , ), ( , ), ( , )1 2 2 2 2 2 2 92% 239.82+0.077LFF F F F F( 6,16), ( , ), ( , ), ( , ), ( , )1 2 2 2 2 2 2 2 2 94% 331.39+0.058L
Table 7
27
100
200
300
400
500
600
700
800
900
1000
200 300 400 500 600 700 800 900 1000
direct-convo
FF(4,4)
FF(4,4),F(2,2)
FF(4,4),2F(2,2)
FF(4,4),3F(2,2)
FF(4,4),4F(2,2)
100
200
300
400
500
600
700
800
900
1000
200 300 400 500 600 700 800 900 1000
FF(8,8)
FF(8,8),F(2,2)
FF(8,8),2F(2,2)
FF(8,8),3F(2,2)
FF(8,8),4F(2,2)
100
200
300
400
500
600
700
800
900
1000
200 300 400 500 600 700 800 900 1000
FF(16,16)
FF(16,16),F(2,2)
FF(16,16),2F(2,2)
FF(16,16),3F(2,2)
FF(16,16),4F(2,2)
100
200
300
400
500
600
700
800
900
1000
200 300 400 500 600 700 800 900 1000
F(2,2)2 F(2,2)3F(2,2)4F(2,2)
direct-convo
direct-convo direct-convo
Fig. 19a Actual machine cycles per point versus the FIR filter length
30
40
50
60
70
80
90
100
110
120
30 40 50 60 70 80
direct-convoF(2,2), F(3,3), F(2,2)
F(3,3),2 F(2,2)2 F(3,3), F(2,2)
F(2,2), 2 F(3,3)
30
40
50
60
70
80
90
30 40 50 60 70 80
direct-convoF(2,2)
F(3,3)
2 F(2,2)
F(2,2),F(3,3)
30
40
50
60
70
80
90
30 40 50 60 70 80
direct-convoF(2,2)
F(3,3)
2 F(2,2)
F(2,2),F(3,3)
80
100
120
140
160
180
200
220
240
260
280
300
320
100 200 300
direct-convoF(2,2)
F(3,3)
2 F(2,2)
F(2,2),F(3,3)
Fig. 19b Actual machine cycles per point versus the FIR filter length
28
100
200
300
400
500
600
700
300 400 500 600 700
direct-convo
F(2,2)
F(3,3)
2 F(2,2)
F(2,2),F(3,3)
100
200
300
400
500
600
700
300 400 500 600 700
direct-convo
F(3,3),F(2,2),F(3,3)
3F(3,3)
4F(2,2)
80
100
120
140
160
180
200
220
240
260
280
300
320
100 200 300
direct-convo
3 F(3,3)
4 F(2,2)
80
100
120
140
160
180
200
220
240
260
280
300
320
100 200 300
direct-convo
2 F(3,3)
3 F(2,2)
2 F(2,2),F(3,3)
F(3,3), F(2,2), F(3,3)
Fig. 19c Actual machine cycles per point versus the FIR filter length
REFERENCES
[1] Ramesh C.Agarwal, Charles S.Burrus, "Fast One-Dimensional Digital Convolution by Multidimentional
Techniques", IEEE Trans. on Acoust. Speech and Signal Processing, Vol 22, No. 1,pp. 1-10, February 1974.
[2] R.C.Agarwal, C.S.Burrus, "Number theoric transforms to implement fast digital convolution", IEEE Trans.
Acoust. Speech, Signal processing, Vol. 63, No. 4, April 1975.
[3] R.C.Agarwal, J.W.Cooley, "New algorithms for digital convolution", IEEE Trans. on Acoust. Speech and Signal
processing, Vol 25, No. 5, October 1977, pp. 392-410.
[4] C.S.Burrus, T.W.Parks, "DFT/FFT and convolution Algorithm", Wiley, New York, 1985.
[5] J.W.Cooley, J.W.Tukey, "An algorithm for the calculation of complex Fourier series", Math. of. Comput., Vol.
19, pp. 297-301, April 1965.
[6] H.Nussbaumer, "Nouveaux algorithmes de transformée de Fourier rapide", Traitement du signal, l'onde
électrique, Vol 59, No. 6-7, 1979 (in french).
[7] P.Duhamel, M.Vetterli, "Cyclic convolution of real sequences: Hartley versus Fourier and new schemes", IEEE
proc. ICASSP Tokyo, Japan, Apr. 8-11, 1986, pp. 6.5.1-6.5.4.
29
[8] P.Duhamel, M.Vetterli, "Improved Fourier and Hartley Transform Algorithms: Application to cyclic convolution
of real data", IEEE Trans. Acous. Speech, Signal processing, Vol. 35, No. 6, June 1987, pp. 818-824.
[9] P.Duhamel, M.Vetterli, "Fast Fourier Transforms: A tutorial Review and a state of the art", IEEE Trans. Acous.
Speech, Signal processing Vol. 19, 1990, pp.259-299.
[10] M.Vetterli, "Running FIR and IIR Filtering using multirate filter bank", IEEE Trans. on ASSP, Vol 36, No. 5,
May 1988, pp.730-738.
[11] D.M.W.Evans, "An improved Digit Reversal Permutation Algorithm for Fast Fourier and Hartley Transforms",
IEEE Trans. Acoustics Speech and Signal processing, Vol ASSP-35, No. 8, IEEE CS press, Los Almitos, Calif.,
August 1987, pp. 1120-1125.
[12] R.E.Blahut, "Fast Algorithms for Signal processing", Addison-Wesley, Reading, MA, 1985.
[13] Winograd, "Arithmetic complexity of computation", CBMS-NSF Regional Conf. Series in Applied Mathematics,
SIAM publications, No. 33, 1980.
[14] H.K.Kwan, M.T.Tsim, "High speed 1-D FIR digital filtering architectures using polynomial convolution" proc
ICASSP Dallas 87, USA pp.1863-1866.
[15] P.Duhamel, Z.J.Mou, J.Benesty, "Une représentation unifiée du filtrage rapide fournissant tous les intermédiaires
entre traitements temporels et fréquentiels", douzième colloque Gretsi-Juan-les-pins France, Juin 1989, pp. 37-40
(in french).
[16] Z.J.Mou, P.Duhamel, "A unified approach to the Fast FIR Filtering algorithms", IEEE proc. ICASSP,
1988,.pp.1914-1917.
[17] Z.J.Mou, P.Duhamel, "Fast FIR Filtering: Algorithms and Implementations", Signal Processing, December 1987,
pp. 377-384.
[18] Z.J.Mou, P.Duhamel, "Short length FIR filters and their use in fast non recursive filtering", IEEE Trans. ASSP,
1989.
[19] Z.J.Mou, P.Duhamel, J.Benesty, "Fast complex FIR filtering algorithms with applications to real FIR and
complex LMS filters, proc. Eusipco 1990, pp. 549-552.
[20] R.Meyer, R.Reng and Schwarz, "Convolution algorithms on DSP processors", IEEE proc. ICASSP 1991, pp.
2193-2196.
[21] A.Zergaïnoh, P.Duhamel, J.P.Vidal, "Efficient implementation of composite length fast FIR filtering on the
"ADSP-2100", IEEE proc. ICASSP Adelaïde 94, Australia, pp. 461-464.
30
[22] A.Zergaïnoh, P.Duhamel, J.P.Vidal, "Implantation efficace d'algorithmes de filtrage rapide RIF sur ADSP-2100",
Conférence Adéquation Algorithmes Architectures", Greco TDSI, AFCET, SEE, CNET, Grenoble Janvier 94,
pp. 85-92.
[23] A.Zergaïnoh, P.Duhamel, J.P.Vidal, "DSP implementaion of fast FIR filtering algorithm using short FFT's",
IEEE proc ISCAS Seattle 1995, pp. 219-222.
[24] A.Zergaïnoh, P.Duhamel, "Implementation and performance of fast FIR filtering algorithms on DSP", IEEE
VLSI Signal Processing VIII, October 16-18, 1995.
[25] P.Duhamel, "A split radix fast Fourier Transform", Ann. Télécomm, Vol. 40, No. 9-10, pp. 418-494, Septembre-
Octobre 1985.
[26] Henrik V.Sorensen, Douglas L.Jones, Michael T.Heideman, C.Sidney Burrus, "Real Valued Fast Fourier
Transform Algorithms", IEEE Trans. on Acoust. Speech and Signal processing, Vol 35, No. 6, pp. 849-863, June
1987.
[27] H.Nussbaumer, "Fast Fourier Transform and convolution algorithms", Spring-Verlag, 2nd edition 1982.
[28] Analog Devices, "Digital Signal processing Applications Using the ADSP-2100 Familly", Prentice-Hall,
Englewood Cliffs, NJ 07632.
[29] ADSP-21020 User's Manual, 1991 Analog Devices Incorporated.