EFFICIENT IMPLEMENTATION METHODOLOGY OF FAST FIR …mokraoui/FIR-REV.pdf · the DSP architecture used for implementing the filtering algorithms should have a significant impact on

This work was supported by France Telecom CNET, Issy les Moulineaux France, and was perforrmed while the first author was atINT Evry France.

EFFICIENT IMPLEMENTATION METHODOLOGY OF FAST FIR

FILTERING ALGORITHMS ON DSP

Anissa Zergaïnoh Pierre Duhamel Jean Pierre VidalInstitut Galilée, ENST/SIG, INT/SIM,Avenue Jean Baptiste Clément, 46 rue Barrault, 9, rue Charles Fourier,93430 Villetaneuse France 75013 Paris France 91011 Evry, [email protected] [email protected] [email protected]

ABSTRACT

A class of Finite Impulse Response (FIR) filtering algorithms based either on short Fast Fourier Transforms (FFT)

or on short length FIR filtering algorithms was recently proposed. Besides the significant reduction of the arithmetic

complexity, these algorithms present some characteristics which make them useful in many applications, namely a small

delay processing (independent on the FIR filter length) as well as a multiply-add based computational structure. These

algorithms are presented in a unified framework, thus allowing an easy combination of any of them.

However, a remaining difficulty concerns the implementation of the fast algorithms on Digital Signal Processors (DSP),

given the DSP finite resources (number of pointers, registers and memory), while keeping as much as possible the

improvement brought by the reduction of the arithmetic complexity. This paper provides an efficient implementation

methodology, by organizing the algorithm in such a way that the memory data access is optimized on a DSP. As a result,

our implementation requires a constant number of pointers whatever the algorithm combination. This knowledge is used

in a DSP code generator which is able to select the appropriate algorithm meeting the application constraints, as well as

to generate automatically an optimized assembly code, using macro-instructions available in a DSP-dependent library.

An improvement of more than 50% in terms of throughput (number of machine cycles per point) compared to the

implementation of the direct convolution is generally achieved.

2

1. INTRODUCTION

Finite Impulse Response (FIR) filters have found wide applications in various fields where large FIR filters (up to

4000 taps) are required. When the application requires real time processing, a Digital Signal Processor (DSP)

implementation is not always feasible, due to the heavy computational cost. This is why fast algorithms can be used

again, despite their computational structure which is not suitable to a DSP implementation. Considerable efforts have

been spent to reduce the computational requirements of FIR filtering [1], [2], [3], [4]. The most famous algorithm is

based on the Fast Fourier Transform (FFT) algorithm [5], [6], [7], [8], [9]. In the earlier versions, the processing block

size N is more larger than the FIR filter length L . Since, in a block processing method, the Input/Output delay is on the

order of the block size, this is often a severe constraint in real time processing applications. This is outlined on Figure 1,

evaluating the processing delay, assuming that Te

is the sample rate of Input\Output data. In a fully optimized system,

the device is always working, with auxiliary memories buffering the incoming and out going data: A first buffer

performs the acquisition of some input data block Ai

during a period equal to NTe

while previous data block Ai-1

is

processed. A second buffer restores the computation result of the input data block Ai-2

.

Hence, the processing delay is given by the time spent for the acquisition of an input data block, plus the required for

performing the computation of the same input data block. Thus, it is evaluated to 2NTe.

Time

Steps

Processings

Restitutions

Acquisitions A i

A i

A i

A i -1

A i -1

A i -1

A i + 1

A i + 1

A i + 1

A i - 2

A i - 2A i - 3

A i + 3A i + 2

A i + 2

2 NTe

NTe

Processing delay

Te : Sample rate N : Block processing size Ak : consecutive input data blockN

Fig. 1 Real-time block processing method

Hence, an excessive delay is certainly the main problem with such algorithms. Several others filtering algorithms were

proposed in [12] and [13]. They efficiently reduce the arithmetic complexity, but do not take into account the relation

between the filtering operations, arithmetic requirements and the signal processor architecture. Thus the estimated

complexity reduction, as classically measured in terms of number of operations, may be misleading because it does not

accurately represent the capability of the processor to efficiently implement the algorithm. The key of the solution to this

3

problem is to involve hardware and algorithm considerations in order to achieve an efficient implementation. These

considerations motivated the proposal [14], [15], [16], [17], [18], [19] of a new class of fast FIR filtering algorithms.

These algorithms make an efficient use of the Multiplier-Accumulator hardware found in most DSP's [17], by partially

keeping the original inner product formulation, while reducing the arithmetic complexity, even for a small block

processing. Both characteristics have the potential for resulting in efficient DSP implementations, compatible with

application oriented constraints. The main building blocks are short length fast FIR modules in which all multiplications

are replaced by decimated sub-filters. These algorithms break the computation of a length L FIR filter, into that of

several sub-filters of length L < L , in such a way that the arithmetic complexity decreases [18].

The paper is organized as follows: Section 2 analyses the interaction between filtering algorithms and DSP

architectures, allowing to define the adapted tools for an efficient implementation. Section 3 explains briefly the method

allowing the construction of the basic fast FIR filtering algorithms, based on the Chinese Remainder Theorem [12], [15],

[18]. The main filtering algorithms constructed for real and complex interpolation points, leading to both classes of

algorithms are presented: those based on short length FIR algorithms and those based on short FFT’s. Section 4 shows

that both classes can be mixed [24] building composite filtering algorithm. An evaluation of the arithmetic complexity is

given. Links between architecture and algorithms considerations provided. Section 5 proposes a unified methodology for

an optimal memory data organization based on an efficient address generation [20], [21], [23], [24]. This is

implemented in section 6 as a Code Generator [21], [23], [24]. Finally, this Code Generator is evaluated through

implementation performances of the resulting filtering algorithms on the ADSP-2100.

2. INTERACTION BETWEEN DIGITAL SIGNAL PROCESSING AND CLASSICAL

FIR FILTERING ALGORITHM

An important task consists in analyzing the interaction between the operation available in a programmable DSP

and the corresponding operations (arithmetic, transfers, data management) required by the filtering application. In fact,

the DSP architecture used for implementing the filtering algorithms should have a significant impact on the choice of the

algorithm if one wants really fast execution time. Both hardware and software architecture of the DSP are the key to

achieving an efficient implementation. Consider briefly the most important hardware structures which influence the

choice of the actual algorithm.

Actual programmable DSP’s are based on pipeline operation. The pipeline operation provides multiple stages through

which the data progress at the basic clock cycle rate of the DSP hardware. In the pipeline operation, the instruction

fetch, decode, execution and storage operations are independent, which allows overall instruction executions to overlap.

4

An example of a five stage pipeline operation is given in Figure 2. So at each successive stage in the pipeline an

operation (transfer, multiply, add, shift) is performed. This configuration allows to take advantage of the inherent

decomposition of filtering algorithm into multiple serial operations. A well known Multiplier-Accumulator structure,

described in Figure 3, is very efficient in computing an inner product. A multiply-accumulate instructions fully utilize

the computational bandwidth of the Multiplier-Accumulator, allowing both operands to be processed simultaneously (in

a pipelined manner). The data for these operations can be transferred to the Multiplier-Accumulator each cycle via the

program and data buses for accessing respectively to filter coefficients and input data. This facilitates single cycle

Multiplication-Accumulation (MAC) when used with repeat instructions. Then, the arithmetic complexity will be

evaluated in terms of number of MAC corresponding to the number of machine cycles (forgetting the initialization of the

pipeline). Thus, a straightforward implementation of filtering equation (1) requires approximately L MAC per output

sample.

Linked Operation

Instruction 1 Fetch DecodeOperandreading

Execution Storage

Instruction 2 Fetch DecodeOperandreading

Execution Storage

Instruction 3 1st Cycle Fetch DecodeOperandreading

Execution Storage

Instruction 4 2 nd Cycle Fetch DecodeOperandreading

Execution

Instruction 5 3 rd Cycle Fetch DecodeOperandreading

4 th Cycle ......... .........

Instruction i 5 th Cycle

Fig. 2 Five stages pipeline operation

Most programmable DSP's also offer a modulo addressing mode. This addressing mode can be used in conjunction with

careful buffer sizing to minimize the cost associated to memory access (read/write). Thus allowing significant reduction

of overheads. It is important to note that the actual implementation of circular buffers, allow the indexes to wrap around

to the other end. Thus, a pointer register determining some physical address, is assigned as a pointer to data which are

stored in a circular buffer defined in a preliminary set up. The classical method used to implement a classical

convolution algorithm consists in defining two circulars buffers modulo the FIR filter length and two pointers registers

attributed respectively to each buffer. Thus permitting efficient access to data (coefficients and samples) without

resorting to data shifts between two computations: only the pointer is incremented, and the new data stored at the

appropriate place. This is a very efficient update of the sample delay line by a pointer register. Unfortunately, the

number of pointer register available on most DSP is limited (typical values are 8 pointers, e.g. for the ADSP-2100 of

5

Analog Devices). Consequently, one should avoid distributing the arithmetic computations on too many different delay

lines. This would result in an excess of computation time, since several transfer operations are unavoidable for

compensating for the lack of pointers.

MACMAC

yn

Accumulator

Add\Substract

Multiplier

DataAddress

Generator

CoefficientsAddress

Generator

Datamemory xi

Coefficientsmemory hi

Fig. 3 General Multiplier-Accumulator structure

3. FAST FIR FILTERING ALGORITHMS USING SMALL BLOCK PROCESSING

The output of an FIR filter equation is given by the convolution equation (see eq. (1)) of the infinite input data

sequence { xi} and the finite fixed filter coefficients sequence {h

i} of the FIR filter of length L :

y x hn n i i

i=0

L-1

n 0 1, ,..., (1)

A straightforward implementation on DSP of equation (1) requires L MAC per output sample corresponding to a large

amount of computation for large filters. Equation (1) can be also written in z domain (see eq. (2)) as a polynomial

product of H z( ) by X z( ), the z transforms of {hi} and { x

i}. H z( ) is a finite degree polynomial (degree L 1) while

X z( ) and Y z( ) have infinite degree. Therefore, a fast implementation of equation (3) requires some segmentation of the

infinite degree polynomials into blocks. This results classically in "overlap-add" or "overlap-save" methods.

Y z X z H z( ) ( ) ( ) (2)

y y z h h z h z x x zL

L

0 1

1

0 1

1

1

1

0 1

1

... ( ... )( ...)

( )(3)

A general procedure for obtaining the classical block formulation as well as the more recent ones is decimation: by

decimating by a factor N each term of equation (3), the filtering equation is turned into a product of two finite degree

polynomials (degree N 1) (see eq. (4)), the coefficients of which are themselves polynomials of infinite degree ({ Xi},

{Yi}) or finite degree ({H

i}).

Y Y z Y z H H z H z X X z X z0 1

1

0 1

1

0 1

1

... ( ... )( ... )

N 1

(N 1)

N 1

(N 1)

N 1

(N 1)(4)

6

with

H H z h zi i

N

kN i

-kN

k 0

(L/N)-1

= ( ) i N 0 1 1, ,..., - (5)

X X z x zi i

N

kN ik 0

-kN= ( )

i N 0 1 1, ,..., - (6)

Y Y z y zi i

N

kN ik 0

-kN= ( )

i N 0 1 1, ,..., - (7)

{Hi} being constant polynomials, each individual product H X

j iinvolved in a straightforward computation of equation

(4) corresponds to some filtering of an infinite sequence by a length L N filter. Thus, equation (4) amounts to the

computation of N2

filters of length L N , that is N L N LN2( ) MAC for computing N input data altogether, or L

MAC per output point. The arithmetic complexity has not been modified. The reduction of the arithmetic complexity

appears when the length- N polynomial product appearing in (eq. (4)) is computed by some fast algorithm. Suppose the

polynomial product of length N is computed in multiplications, the arithmetic complexity is L N , which is smaller

than NL if < N2. The Chinese Remainder Theorem (CRT) is the basis of these fast algorithms (see [12]). Since the

formulation allows the use of small N , small block processing will be feasible [17]. It turns out that the application of

the CRT permits to obtain a lower bound on the number of "multiplications" of the type H Xj i

necessary for computing

the finite degree polynomial product (eq. (4)). This minimum is 2 1N multiplications. Hence, if this minimum were

practically attainable, the number of MAC per output sample would be considerably reduced to ( )2 12

N L N MAC

per output. This is explained with more details in reference [18]. The application of the CRT structures the algorithm

into three main parts [18], [19], in which various choices of the parameters N (the decimation factor), {ai} (the set of

interpolation points) provide different algorithms. The minimum number of multiplications can be practically attained

for a small block size N , whereas for larger N the optimal algorithm involves too many additions/subtractions and is

too sensitive to numerical errors (quantization noise) to be of practical interest. Whether the interpolation points {ai}

are chosen to be real-valued or complex-valued determines two main classes of algorithms, described at the following

section.

These FIR filtering algorithms, process altogether a small block size of N consecutive filter outputs, in order to take

into account the redundancy found in these N successive computations. This method permits consequently the reduction

of arithmetic complexity while maintaining partially (on smaller filters) the classical FIR filtering structure. Furthermore

these algorithms are characterized by a relatively small processing delay fixed by the decimation factor N and does not

dependent on the filter length. The different steps to construct a fast FIR filtering algorithm are summarized in Fig. 4:

First an evaluation (pre processing part), then the short-length filters (filtering part), and finally the reconstruction (post

processing part).

7

Pre processing Post processing

0

1

N-1

RF(2N-2)

RF(0)

z -N

+

2N-2

N

N-1N-2 +

0

X(z)Y(z)

z -N

RF(N-1)

Filtering2N-1 real sub-filters (RF)

Fig. 4 General structure of Fast FIR filtering algorithm

3.1. BASIC SHORT FILTERING ALGORITHMS F N ,N( )i i

The basic short fast FIR filtering algorithms are denoted F N N( , )i j

, where Ni

denotes the decimation factor on the

input and output signal, while Nj

denotes the decimation factor on the filter. Only the simplest case N N Ni j is

considered here. The number of sub-filters required by the actual basic fast algorithm will be denoted by i . In some

cases, this number will be equal to the lower bound (i.e. 2 1N ) while being sometimes larger.

Two basics examples are proposed for different decimation factor values Ni 2 and N

i 3. The simplest short-length

have been constructed by considering the simplest real-valued interpolation points, i.e. {ai}={ 1 0 1, , , }. Note here that

ai is an abuse of notation denoting the evaluation of the term of highest degree of the polynomial (see[18]).

The reconstruction formula of the full basic FIR filtering algorithm F( , )2 2 using {a a a0 1 2

0 1 , , } as

interpolation points are easily obtained (see [18]), and given below:

Y X H X H z0 0 0 1 1

2

(8)

Y X X H H X H X H1 0 1 0 1 0 0 1 1 ( )( ) (9)

The overall organization (see Fig. 5) shows that the short length FIR filtering algorithm F( , )2 2 results in a first pre-

processing block, followed by 3 (i.e i 3) sub-filters ( H

0,H

1, H H

0 1 ) of length L 2 , all in parallel, and finally a

post processing block.

+

Pre processing

+

-

-

Post processingSub-filters length

z-2

z-1

z-1+

2

2

2

2

+

H z0

2( )

H z1

2( )

H z H z0

2

1

2( ) ( )

X z0

2( )

X z1

2( )

Y z0 ( )

Y z1( )

Y z( )X z( )

L 2

Fig. 5 Structure of fast FIR filtering algorithm F( , )2 2 {a a a0 1 2

0 1 , , }

8

Now consider a decimation factor Ni 3. An optimal basic filtering algorithm F( , )3 3 requires five interpolation points

{ai}={0 1 1, , , ,? }. Here, a problem appears because only four points are really simple in the real domain. An

additional value such as 2 introduces further additions and involves computations with a large dynamic range:

multiplications by 8 would result in a much larger roundoff noise, even when implemented by shifts. Hence, instead of

adding a fifth interpolation point (see [1]), it is wiser to apply recursively the previous F( , )2 2 algorithm (see eq. (8),

(9)):

Y Y z Y z X X z X z H H z H z0 1

1

2

2

0 1

1

2

2

0 1

1

2

2

( )( ) (10)

Y z X Xz H Hz( ) ( )( )

0

1

0

1(11)

The final filtering equations are given by:

Y X H X X H H X H X H0 0 0 1 2 1 2 1 1 2 2

3 ( )( ) ) (12)

Y X X H H X H X H X H z1 0 1 0 1 1 1 0 0 2 2

3

( )( ) (13)

Y X X X H H H X X H H X H X X H H2 0 1 2 0 1 2 0 1 0 1 1 1 1 2 1 2

2 ( )( ) ( )( ) ( )( ) (14)

The initial FIR filter length has been broken into 6 (i 6 ) sub-filters { H

0, H

1, H

2, H H

0 1 , H H

1 2+ ,

H H H0 1 2 + } in parallel (see Fig. 6) whose lengths L / 3 are smaller than the initial one L . Compared to the optimal

algorithm, this is 6 sub-filters rather than 5. This is the price to be paid for obtaining a more precise algorithm, and

keeping the overhead (number of additions) small. Even if non-optimum, the algorithm is efficient, since the L MAC

required by the direct convolution are reduced to 2 3L / MAC per output point. This may be significant for medium or

large length filters. Just like the F( , )2 2 algorithm, the F( , )3 3 algorithm performs block processing: it computes

altogether a set of three consecutive outputs, thus taking advantage of the redundancy between successive computations.

The processing delay (I/O) is small (6 samples) and is independent of the FIR filter length.

Table 1 gives an evaluation of the arithmetic complexity of different basic F( , )2 2 and F( , )3 3 algorithms build for

various simple interpolation points. Interpolation points 1, -1 generate either an addition or a subtraction. The last two

F( , )2 2 algorithms require one addition more, giving a total of 3 3 2 2 1 ( )( )L additions per output sample compared

to the first and second one. On some recent DSP's, this additional operation has no consequence, because they are able

to compute efficiently an addition/subtraction (cost only one cycle machine) a ±b at the same time allowed by the

pipeline technique. The number of multiplications, not shown, is equal to the number of MAC in Table 1. The fifth

column provides theoretical savings (25%, 34%) both in terms of multiplications and MAC operations, compared to the

direct convolution. Since this improvement is given as a percentage of the initial computational load, and since this is

obtained at the cost of a fixed (and small) of additions, these savings are asymptotic in terms of DSP cycles when the

9

length of the filter increases, but is very quickly reached. As a conclusion, the last column gives the processing delay

which is proportional to the decimation factor.

Pre processing Post processing

+

+

-

-

+

3

3

3

3

3

+

+

+

+ +

+

-

- +

-

-

3 +

H z0

3( )

H z H z0

3

1

3( ) ( )

H z1

3( )

H z H z1

3

2

3( ) ( )

H z2

3( )

H z H z H z0

3

1

3

2

3( ) ( ) ( )

X z0

3( )

X z1

3( )

X z2

3( )

Y z0 ( )

Y z1( )

Y z2 ( )

Y z( )

z3

z1

z3

z1

z1

z1

Sub-filters length

X z( )

L 3

Fig. 6 Structure of fast FIR filtering algorithm F( , )3 3

Basic filtering algorithms

F N N( , )i i

# Sub-filters i

# Add\Subs(per point)

# MAC(per point)

AsymptoticSavings(MAC)

Processingdelay

F( , )2 2 {a a a0 1 2

0 1 , , } 3 2 3 2 2 1 ( )( )L 3 4L 25% 4Te

F( , )2 2 {a a a0 1 2

0 1 , , } 3 2 3 2 2 1 ( )( )L 3 4L 25% 4Te

F( , )2 2 {a a a0 1 2

1 1 , , } 3 3 3 2 2 1 ( )( )L 3 4L 25% 4Te

F( , )2 2 {a a a0 1 2

1 0 1 , , } 3 3 3 2 2 1 ( )( )L 3 4L 25% 4Te

F( , )3 3 {F( , )2 2 with 0,1,} 6 10 3 2 3 1 ( )L 2 3L / 34% 6Te

Table 1 Basic short FIR filtering algorithms characteristics

3.2. BASIC FILTERING ALGORITHMS USING SHORT FFT’s FF N,N( )

Interpolation points can also be chosen in the complex domain. A particular set of interpolation points, taken on

the unit circle {a W j i K K NN

i

iwith

2 12 2 1exp( / ) } results in an FFT-based fast FIR filtering algorithm (see

[18], 19]) denoted by FF N N( , ). The first part requires the computation of a Discrete Fourier Transform (DFT)

performed on 2 1N real points. The second part consists in 2 1N FIR complex filters in parallel. The third part

computes an Inverse DFT (IDFT). The authors of [16], have demonstrated that an additional interpolation point is

possible, in order to transform the DFT and IDFT length to 2 N . In fact it is simpler to compute a DFT the length of

which is a power of 2. For a fast computation, the DFT's are computed through Fast Fourier Transform (FFT) algorithms

(see Fig. 7).

10

2N

ID F T

p oin ts

D F T

2N

poin ts

ig n oredN v alu es

x (n)

x (n -N +1 )

z x (n -N + 1 )-N

z x (n )-N

y (n-N + 1 )

y (n)

2 N

C F (2 N -1 )

P re p ro cessin g P os t p ro cessin g

F ilterin g2 N C o m p lex su b -F ilters (C F )

C F (0 )

Fig. 7 Structure of fast FIR filtering algorithm using short FFT's

The real valued DFT computations

Note that, the input sample being real-valued, then the DFT output has hermitian symmetry. Hence, only half the

outputs need to be computed. Although some algorithms for computing Fast DFT, such as the split radix, or fast Hartley

transform algorithms are available for real-valued data, it is often more efficient to use more classical computations

based on complex-valued FFT's of shorter length, which have been optimized by the DSP manufacturer. The classical

technique we used for an efficient implementation on any recent DSP is now described.

The 2 N real input sequence { x n( )} is divided into even and odd indexed sub-sequences noted respectively {h n( )} and

{g n( )}. A complex sequence {y n( )} is built from these two real valued sequences {h n( )} and {g n( )}. The final

expression of the { x n( )} DFT is given by equation (16). The others values are obtained by exploiting the symmetry and

the redundancy of a real points DFT (see eq. (17)).

X k x n w h n w w g n wN

k( ) ( ) ( ) ( ) 2N

k

n=0

2N-1

N

nk

n=0

N-1

N

nk

n=0

N-1

2(15)

X k Y k Y N k jw Y N k Y k( ) ( ( ) ( )) / ( ( ) ( )) /* *

2 22N

k(16)

X k X N k( ) ( )*

2 k N N N 1 2 2 2, ,..., (17)

Therefore, the computation of a length 2 N real valued DFT is optimized by computing a single length N complex DFT,

with an additional post-processing which is easily and efficiently implemented (see Fig. 8) since it makes use of the FFT

kernel.

A careful analysis of the previous equations shows that some improvements may be obtained, if we compute at the same

time X k( ) and X N k*( ) , because they use the same operands (see [23]). Consequently, some redundant arithmetic

computations are eliminated, as well as a number of data transfers (read\write) in memory which contribute in the

execution time of the algorithm. Some evaluations are provided in Table 2, where the second column evaluates the cost

of the computation of { X k( ) k N 0 1,..., }, and the third one deals with the computation of { X k( ) and X N k*( ) ,

k N 0 2 1,...,( / ) } altogether.

11

Post-processing X k( ) X k( ) and X N k*( ) altogether

Arithmetic operations (Add/Subt) 8N 5N

Transfers operations 6N 3N

Total for N complex points 14 N 8N

Table 2

Moreover small improvements are possible by computing separately the points X( )0 , X N( ) , and X N( / )2 which are

real-valued. The optimized operations for an efficient implementation of the pre-processing required by the computation

of a length 2 N real valued DFT are illustrated in Fig. 8.

1/2

R( )1

R N k*( - )

Real R( ( ))0

Imag R( ( ))0

X al XlN lN

( )= ( ( ))0 0Re

X N al X NlN lN( )= ( ( ))Re

+

+

x1/2

-

XlN

( )k+

+X N k

lN

* ( - )

-j ( )lNx n2 1

+

+

-R k( )

R N( -1)

R( )0

Wj nk

NN

k

2

2

2- = (

-)exp

FFT N complex point

x nlN

( )2

x nlN

( )1

x n NlN

( ) 1

z x nN

lN( )

z x nN lN

( )1

z x n NN lN

( )1

x nlN

( ) Real X I X( ( )) + mag( ( ))lN lNk k

Imag( ( ))lNX k

Real( ( ))(lNX k

+

k=0,...,N

k=0,...,N

k=0,...,N

Input data

k = 1, ... ,N

2

k = 1,. .. ,N

21

Fig. 8 Optimal organization of pre-processing module

The complex filtering computation

Since the DFT computation provides complex outputs, the sub-filter fed by the output of each frequency bin are

also complex. Due to the hermitian symmetry, only N 1 such sub-filters of length L N/ have to be computed. A

straightforward implementation of these N 1 complex sub-filters is equivalent to 4 1( )N real valued sub-filters of

same length. Improvements brought in [19], allows to reduce 4 1( )N to 3 1( )N real valued sub-filters, since one

complex sub-filter is replaced by 3 real valued sub-filters (see Fig. 9). More over 2 particular complex sub-filters are

real valued (see [23]). So the decomposition given in Fig. 9 is useless for these 2 sub-filters. The unnecessary

computation of 4 real sub-filters are deduced to 3 1( )N real sub-filters. The improvements brought in [19], allows to

reduce the complexity to that of 3 1N (i.e 3 1 4( )N ) real FIR sub-filters (see [23]) in parallel (see Fig. 9).

Table 3 evaluates the complexity of the full algorithms: the second and third columns represent respectively the number

of sub-filters generated in parallel, and the asymptotic savings in percentage compared to the classical convolution.

+

Xi(z)

Xr(z)

X(z)

Yi(z)

Yr(z)

Y(z)Hr(z)

Hr(z) + Hi(z) +-

-Hr(z) - Hi(z) +

12

Fig. 9 Structure of FIR complex filter

Filtering algorithmsFF N N( , )

# real valued sub-filters3 1N -

Savings(MAC)

Processing delay2NT

e

FF( , )4 4 11 31% 8Te

FF( ,8)8 23 64% 16Te

FF( , )16 16 47 82% 32Te

Table 3

Imag ( ( ))lNX 0

Real X I X( ( 0 ) ) + mag ( ( 0 ) )lN lN

Imag ( ( ) )lNX N

Real ( ( ))0H 0

Real ( ( ) )lNX 0

Real Imag( ( ) ) - ( ( ) )H H0 0

0 0

Real X I X( ( ) ) + mag ( ( ) )lN lNN N

Real ( ( ) )lNX N

Real Imag( ( ) ) - ( ( ))L / N 1 L / N 1

H H

0 0

+ +

Real Imag( ( ) ) - ( ( ) )H N H N0 0 Real Imag( ( ) ) - ( ( ) )

L / N 1 L / N 1H N H N

+ +

Real Imag( ( ) ) + ( ( ) )H H0 0

0 0 Real Imag( ( ) ) + ( ( ))L / N 1 L / N 1

H H

0 0

+ +

Real Imag( ( ) ) + ( ( ))H N H N0 0 Real Imag( ( )) + ( ( ) )

L / N 1 L / N 1H N H N

+ +

+ +

Real ( ( ) )H N0 Real ( ( ) )

L / N 1H N

z N

+ +

Real ( ( ) )L / N -1H 0

z Nz N

z N

z Nz N

z Nz N

z Nz N

z N z N

mr( )0

mr N( )

mi( )0

mi N( )

mri( )0

mri N( )

Fig. 10 FIR filtering module

A dual procedure is used for the computation of the 2 N points complex Inverse FFT with hermitian symmetric input.

This amounts to the calculation of a single length N complex FFT by the procedure below:

x n X k w( ) ( )

2N

nk

k 0

2N 1

X k N X N k N X N k( ) ( ) ( )* *

2

x n X k X N k w( ) ( ( ) ( ))*

2

N

nk

k 0

N 1

x n X k X N k w w( ) ( ( ) ( ))*

2 1

N

nk

2N

k

k 0

N 1

x n( )2 , x n( )2 1 are real valued then:

x n jx n X k X N k jw X k X N k wn

( ) ( ) ( ( ) ( ) ( ( ) ( ))* *

2 2 1

N

k

k 0

N 1

N

k(18)

V k X k X N k jw X k X N k( ) ( ( ) ( ) ( ( ) ( ))* *

2N

k(19)

Equation (18) shows that the computation of a complex IFFT with hermitian symmetry is done by one complex IFFT of

length twice smaller, plus some pre processing as illustrated in Fig. 11. This figure depicting the dual operations of Fig.

8, both flow graphs are transposed of each other.

13

Wj nk

NN

k

2

2

2- = ( )exp

S k( )

XV k*( - )N-

V k( )

y( )n

y( )n N 1+ +

+S k*( - )N

V N( -1)

V( )0

N Complex points Inverse FFT

Real( ( ))S k+

k=0,...,N

k=0,...,N

+

Imag( ( ))S k+

mrk( )

mik( )

mrik( )

Nvaluesignored

k = 0,...,N

2

k = 0,...,N

21

Fig. 11 Organization of post-processing module

The overall algorithm

This algorithm FF N N( , ) operates on a small block composed of N consecutive samples requiring the calculation

of FFT's of length 2 N (see Fig. 8). Whatever the filter length L , it is possible to maintain a low processing delay, given

by 2 N . Obviously, long FFT's will result in more efficient algorithms (as far as the number of operations is concerned),

but also result in a longer I/O delay. Note also that when the FFT length increases, multiplications appear in the pre and

post processing modules.

4. A "LEGO" VIEW OF FAST FIR FILTERING ALGORITHMS

All these basics filtering algorithms coming from the same general procedure are hence compatible: whenever

some algorithm produces sub-filters, then another basic fast algorithm can be applied to the sub-filters as well, and this

recursively. This improves the efficiency of the fast algorithm, at the cost of an increased delay.

4.1. GENERAL STRUCTURE OF COMPOSITE FIR FILTERING ALGORITHMS

Basic filtering algorithms have the same computational structure so it is easily seen that these algorithms can be

recursively nested: any basic fast FIR algorithm can be applied to the so-called sub-filters as well leading to the

composite fast FIR filtering algorithm.

Assume that L , the FIR filter length, is composite: L N N N L1 2

...n r

. The general composite FIR filtering algorithm,

denoted by F N N F N N( , ),..., ( , )1 1 n n

, is built from the on basic algorithms seen in the previous section. The filtering

algorithm noted F N N( , )i i

may be chosen either as F N N( , )i i

or as FF N N( , )i i

. The procedure consists to apply the

first basic algorithm F N N( , )1 1

. This algorithm decimates by a factor equal to N1

the initial filter into 1

sub-filters of

length L N1. The second algorithm F N N( , )

2 2operates on each of the

1sub-filters, generating

1 2sub-filters in

parallel, whose length is equal to L N N1 2

. The process is iterated until the last basic algorithm F N N( , )n n

, giving

therefore ii=1

n

sub-filters in parallel whose length is equal to L N Lii=1

n

r .

14

Y n

X n

h 0 0 + h 0 1

-

-

+

h 2 0 + h 2 1

-h 2 0

h 1 0 + h 1 1

-

-4z-

-

-

-

h 2 1

h 1 1

h 1 0

h 0 1

h 0 0

-4z

-4z

-2z

+

+

++

+

+

+

S u b -filte rs len g th L /4

+

+

+

+

P o s t p ro c e ss in gP re p ro c ess in g

Fig. 12 Structure of composite fast FIR filtering algorithm F F( , ), ( , )2 2 2 2

The general composite fast FIR filtering scheme has still a structure similar to that of the basic filtering algorithm: it is

made of three parts. The first one groups the pre-processing part of all F N N( , )i i

that are successively applied. The

second part consists in ii=1

n

real-valued FIR sub-filters in parallel. The last one is composed of the individual post-

processing stages of F N N( , )i i

. The pre and post-processing parts generate and recombine the corresponding data.

Figure 12 presents an algorithm obtained by nesting two F( , )2 2 algorithms, thus resulting in 3 3 9 decimated FIR

sub-filters.

4.2. ARITHMETIC COMPLEXITY

The use of several decomposition decreases considerably the arithmetic complexity (see [5]) which is on the order

of ( ) / ( )L Nii 1

n

ii 1

n

2

MAC per output sample. This number does not include the pre and post processing

operations, which can be quite complicated when FFT's are involved. Note that this complexity remains low when FFT's

of short length are used (which is always the case in practical implementations on DSP's if one has delay constraint).

These computations will however be included in our evaluations in terms of DSP cycles later on. In some cases, the

number of MAC can be effectively halved without increasing the number of additions. Table 4 provides an evaluation of

the various possibilities obtained by combining the basic algorithms explained above. The first column describes the

algorithm, giving the order in which the decompositions are applied. The second column represents the number of FIR

sub-filters that are run in parallel. The third column determines for each algorithm the asymptotic savings in terms of

MAC per output computed to the direct convolution that requires L MAC per output point. The last column evaluates

the processing delay depending on the sample rate Te. Note that the asymptotic performances are not influenced by the

order of the decomposition. However, this order influences the overhead, i.e. the rate at which the asymptotic gain is

reached. This will be clearer later on.

15

Composite filtering algorithms

F N N F N N( , ),..., ( , )1 1 n n

# Sub-filters

ii 1

n

Asymptotic savings

12

( ) / ( )

ii 1

n

ii 1

n

N

Processing delay

2( )N Tii 1

n

eDirect-Convolution 1 0% Te

F( , )2 2 3 25% 4Te

F( ,3)3 6 34% 6Te

F F( , ), ( , )2 2 2 2 9 44% 8Te

F F( , ), ( , )2 2 3 3 18 50% 12Te

F F( , ), ( , )3 3 2 2 18 50% 12Te

F F( , ), ( , )3 3 3 3 36 56% 18Te

( , ), ( , ), ( , )F F F2 2 2 2 2 2 27 58% 16Te

( , ), ( , ), ( , )F F F2 2 2 2 3 3 54 63% 24Te

F F F( , ), ( , ), ( , )2 2 3 3 2 2 54 63% 24Te

F F F( , ), ( , ), ( , )3 3 2 2 2 2 54 63% 24Te

F F F( , ), ( , ), ( , )3 3 3 3 2 2 108 67% 36Te

F F F( , ), ( , ), ( , )2 2 3 3 3 3 108 67% 36Te

F F F( , ), ( , ), ( , )3 3 2 2 3 3 108 67% 36Te

( , ), ( , ), ( , )F F F3 3 3 3 3 3 216 71% 54Te

F F F F( , ), ( , ), ( , ), ( , )2 2 2 2 2 2 2 2 81 69% 32Te

FF F( , ), ( , )4 4 2 2 33 48% 16Te

FF F F( , ), ( , ), ( , )4 4 2 2 2 2 99 61% 32Te

FF F F F( , ), ( , ), ( , ), ( , )4 4 2 2 2 2 2 2 297 70% 64Te

FF F F F F( , ), ( , ), ( , ), ( , ), ( , )4 4 2 2 2 2 2 2 2 2 891 78% 128Te

FF F(8,8), ( , )2 2 69 73% 32Te

FF F F(8,8), ( , ), ( , )2 2 2 2 207 79% 64Te

FF F F(8,8), ( , ), ( , )2 2 2 2 621 85% 128Te

FF F F F F(8,8), ( , ), ( , ), ( , ), ( , )2 2 2 2 2 2 2 2 1861 88% 256Te

FF F(16,16), ( , )2 2 141 86% 64Te

FF F F(16,16), ( , ), ( , )2 2 2 2 423 89% 128 Te

FF F F F(16,16), ( , ), ( , ), ( , )2 2 2 2 2 2 1269 92% 256Te

Table 4

4.3. PROBLEMS BETWEEN DSP ARCHITECTURE AND FILTERING ALGORITHMS

However, at first glance, if we want to use such a strategy in our fast algorithms, the number of pointer registers

seems to be proportional to the number of sub-filters. For example, the composite FIR filtering algorithm

FF N N F N N F N N( , ), ( , ),..., ( , )1 1 2 2 n n

would require 2 3 11 2

( )N

ii

n

pointer registers for its DSP implementation.

Table 5 evaluates, for some filtering algorithms, the number of pointer register required. It is easily seen that actual

DSP’s do not satisfy these requirements. This problem was already pointed out in previous papers (see [7]), in which the

algorithm was restricted to a single decomposition due to these constraints. A particular attention was devoted to this

difficulty in [9], [10], and a solution is given below.

16

Filtering algorithms

F N N F N N( , ),..., ( , )1 1 n n

# Sub-filters

ii 1

n

# Pointers registers

2 ii 1

n

F( , )2 2 3 6F( , )3 3 6 12

F F( , ), ( , )2 2 2 2 9 18F F( , ), ( , )2 2 3 3 18 36F F( , ), ( , )3 3 3 3 36 72

( , ), ( , ), ( , )F F F2 2 2 2 2 2 27 54( , ), ( , ), ( , )F F F2 2 2 2 3 3 54 108( , ), ( , ), ( , )F F F3 3 3 3 2 2 108 216

FF( , )4 4 11 22FF( , )8 8 33 66

FF F( , ), ( , )4 4 2 2 33 66FF F F( , ), ( , ), ( , )4 4 2 2 2 2 99 198

FF F F F( , ), ( , ), ( , ), ( , )4 4 2 2 2 2 2 2 297 594FF F( 6,16), ( , )1 2 2 141 282

Table 5

5. UNIFIED METHODOLOGY FOR AN EFFICIENT REAL-TIME

IMPLEMENTATION ON DSP

In previous paragraphs, we have seen that, whatever the filtering algorithm F N N F N N( , ),..., ( , )1 1 n n

(short length,

composite length, based on short FFT), it is made of 3 main parts. The proposed general methodology makes use of this

property and of the mathematical structure of nested algorithms. In order to keep as much as possible the improvement

brought by the reduction of the arithmetic complexity of this filtering algorithms, without exceeding the resources

available on DSP's, we suggest an efficient data memory management and organization. The proposed technique is

based on a precise address generation allowing the use of only two modulo addressing modes for the global algorithm.

The basic idea proposed to overcome the heavy computational resources, consists in defining one delay line for ordering

data and one array for storing sub-filters coefficients. So the number of pointer registers required for computing all sub-

filters outputs is simplified to only two. This is rather easily obtained because, for a given algorithm combination, all

quantities involved are evaluated by simple formula depending on { Ni}, {

i}, and { L

i}. Once this is done, an efficient

implementation is obtained by storing the elements to be used in a filtering equation at regularly spaced location in a

delay line. As a first example, let us evaluate the total sizes of the data buffer and coefficients array. The new delay line

regroups all ii=1

n

data lines whose length is equal to sub-filter length Lr. Therefore the data buffer size or coefficients

array size is equal to the product of the number sub-filters in parallel by the sub-filter length Lr

(see eq. (20)).

Size Size L L Nbuffer array ii=1

n

r ii 1

n

ii 1

n

( )( / ) ( ) (20)

17

Note that although FF N N( , )i i

requires the computation of 3 1Ni inner products (i.e. 3 1 4( )N

i ), filtering module

(see Fig. 10) is structured as for computing 3 1( )Ni sub-filters. Indeed, this structuration allows to maintain a

processing regularity for the proposed implementation. So, in this case i

is equal to 3 1( )Ni .

5.1. EFFICIENT IMPLEMENTATION OF PRE PROCESSING MODULE

The filtering algorithm efficiency depends essentially on the arithmetic complexity and the data management of the

pre processing module. So far, the basic filtering algorithms F N N( , )i i

presented in previous section have been

evaluated in terms of the arithmetic complexity. The study presented in this section is devoted to the data memory

organization. From basic filtering algorithms pre processing parts, the method allows to build the composite filtering

algorithm pre processing module, without requiring the filtering equation formula.

The data line must store the most L Nii 1

n

ii 1

n

/ recent combined samples. Thus, an actualization operation is

necessary. Since each ii=1

n

data lines requires the update of one sample, the L Nii 1

n

ii 1

n

/ data line requires

then ii=1

n

combined samples to be actualized. The sub-filter coefficients are fixed. So, the coefficients array does not

necessitate an updating operation. Their memory storage is identical with data storage described at the following.

The original method proposed for interleaved samples actualization is based on a progressive filling technique, used in

proportion as a basic filtering algorithm F N N( , )i i

is applied. The pre-processing module combines (arithmetic

operation) and organizes data which are found in equally spaced memory locations according to the basic algorithm

employed. The physical address is determined by the relationship between the decimation factor Ni

and the number of

sub-filters in parallel.

The initialization step consists in distributing Nii=1

n

input data at particular positions in the data delay line, of a total

length equal to ii=1

n

. Then, the algorithm F N N( , )i i

processes Njj=i+1

n

times a number Ni

of sequences, each one

of size jj=1

i-1

. The memory is organized according to these quantities, in a block manner. Each block is indexed by

three quantities: M l k( , ) , where M is the size of the block ( M jj=1

i-1

1 i-1... ), l is the number of sequence

(l N 1,...,i) and k is a recursion index (k N 1,...,

jj=i+1

n

). This is illustrated on Fig. 13.

18

1( , )1 1

N1

1( , )N2 11( , )2 1

1 2( , )1 1 1 2( , )N3 1

1 2 3( , )1 1

N1N1 N1 N1

1( , )1 3N 1( , )2

3N 1( , )N N2 3

1( , )1 2

N1

1( , )N2 2

N1

1( , )2 2

N1N1 N1

N1

1( , )1

3N

1( , )2

3N

1( , )N N

2 3

1 2 3( , )2 1

1 2( , )1 2 1 2( , )N3 2

ii 1

n

L Nii 1

n

ii 1

n

/

: Free memory space filled by the application of



F N N*( , )1 1

F N N( , )2 2

F N N( , )3 3

Data delay line length

update

Ni : Decimation factor

N N N N 1 2 n... : Input data block processing

: Sub-filters generated byi F N N( , )i i

Step F N NWrite 1 1( ( , ))

Step F N NadRe ( ( , ))1 1

Fig. 13 Optimal memory data organization

The trick which is used here is that one knows in advance from the basic algorithm combination how many sub-filters

will be used. It is thus possible to store at contiguous locations all data corresponding to the same delay in all sub-filters.

This "slice" of delay line is thus efficiently processed by a cyclic pointer, since the increment to fetch the data for a

specific filter is the same one for all of them. The increments to fetch data or to store data will be given bellow.

After the initialization phase, the first basic filtering algorithm F N N( , )1 1

is applied on shared out input data delay line.

The algorithm combines (arithmetic operations) each N1

input sequences according to its pre processing part. Thus, for

each N1

input sequences, the algorithm generates 1

data (see Fig. 13). These results are stored in the delay line at

appropriate and equally spaced places illustrated by ‘0’ in Fig. 13. The access to particular data delay line is given by

the reading step (see eq. (21)). The storage is also determined by a writing step (see eq.(22)).

The next basic filtering algorithm F N N( , )2 2

processes Nii=3

n

sequences composed of N2

consecutive length blocks

1( , )l k , with for a given k {k N 1,...,

ii=3

n

}, l N 12

,..., . The algorithm generates several 1 2

( , )l k blocks, for all

k and l in the given limits: k N 1,...,ii=4

n

, l N 13

,..., .

Based on the same principle, the following algorithm treats the scheduling data provided by the preceding one and stores

the results in memory, equally spaced in the delay line and so on until the last basic algorithm F N N( , )n n

.

Assume that a basic algorithm F N N( , )i i

is applied. Figure 14 shows the increments required to locate the data delay

line pointer register at the correct memory address for accessing or writing data.

19

1 11... ( , )i 1 2 1... ( , )i 1 1... ( , )i i+1N 1 1 2... ( , )i 1 2 2... ( , )i 1 2... ( , )i i+1N 1 1... ( , )i iN 1 2... ( , )i iN 1... ( , )i i+1 iN N

1 1 2... ( , )i-1 1

2 2... ( , )i-1

1 2... ( , )i-1 iN

Step F N NRead i i

1 ( ( , )) Step F N NWrite i i( ( , ))

Step F N NRead

2

i i( ( , )) Step F N NRead

3

i i( ( , ))

Step F N NRead

4

i i( ( , ))

Step F N NRead

5

i i( ( , ))

Data delay line

Pointer register = Initial physical address + Step modulo delay line length

Pointer register

Fig. 14 Pointer register increments

The increments pointer register (see Fig. 14) are define bellow:

Step F N N NWrite

( ( , ))i i i ij 1

i 1

ij 1

i 1

(21)

Step F N N Step F N Nad adRe Re

( ( , )) ( ( , ))1 2

1i i i i

jj 1

i 1

jj 1

i 1

if i 1

if i 1

(22)

Step F N NN N

NadRe

( ( , ))3

2

3

i i

j i j ij 1

i 1

j 1

i 1

j 2 2j 1

if

1 if i 1,n 3

(23)

Step F N NN N

N N NadRe

( ( , ))(( ) ) (

(( ) ) (

41 2

1 3i i

jj i

n

k j ij 1

i 1

k 1

j 1

jj i

n

k i j ij 1

i 1

k 1

j 1

) +1 if

) +1 if

(24)

The following step is used for positioning the pointer register, in order to apply the next algorithm F N N( , )i+1 i+1

Step F N NN N N

N N NadRe

( ( , ))(( ) ) (

(( ) ) (

51 2

1 3i i

jj i

n

k i j ij 1

i 1

k 1

j 1

jj i

n

k i j j ij 1

i 1

j 1

i 1

k 1

j 1

) +1 if

) + +1 if

(25)

Filters coefficients memory organization is based on the same data delay line method described above. But the method is

applied at one times since the coefficients filters are fixed. Thus, the memory organization is done off-line.

5.2. EFFICIENT IMPLEMENTATION OF FILTERING MODULE

The filtering module made from FIR sub-filters is efficiently implemented on DSP's since the data to be combined

in each inner product computation (MAC) are available at equally spaced positions in memory (see Fig. 15).

20

StepWrite

Space memoryreserved for

post processingmodule

Co

mb

ined

Sa

mp

les

Com

bin

edC

oeff

icie

nts

Pa

rtia

lsco

nv

olu

tio

ns

X

Lr i

i 1

n

( )

Lr i

i 1

n

( )

i

i 1

n

Step = i

i 1

n

++

X

X

Data delay linelength

Coefficients sub-filtersarray length

Pdata

Pcoefficients

Pconvolutions

: Data Pointer Register initial position

: Coefficients Filters Pointer Register initial position

: Partial Convolutions Pointer Register initial position

Pdata

Pcoefficients

Pconvolutions

Fig. 15 Partial convolutions module

The data and coefficients reading are available by means of two pointer registers. Each one is attributed to the data delay

line and coefficients array. The pointer registers increment value corresponds to the spaced positions. It is given by the

step depending on the number of sub-filters in parallel (see eq. (26)).

Step Nconv sub filters ii 1

n

(26)

The various partial convolutions, evaluated to ii 1

n

, are stored in a buffer. A third pointer register is attributed to the

partial convolutions buffer in order to generate and recombine the corresponding variables by the post-processing

module (see Fig. 15).

5.3. EFFICIENT IMPLEMENTATION OF POST PROCESSING MODULE

This module performs a task which is the converse of the pre-processing module (see Fig. 16). The pre processing

part processes data block of size N , providing ii 1

n

values. After filtering, the corresponding ii 1

n

outputs must

be combined together according to the post processing part of basic algorithm in order to provide the corresponding N ,

values.

ii 1

n

N N

ii 1

n Pre processing module

Post processing moduleInput \Outputdata block

Partial convolutions

21

Fig. 16 Relation between pre processing and post processing parts

The proposed method is based on a filling technique as the pre processing module. A reserved space memory in partial

convolutions buffer will be completed in proportion as a basic filtering is applied.

The procedure consists in first applying the post processing part of the last basic algorithm F N N( , )n n

to the partial

convolutions outputs obtained previously. Due to our structuration, the partial convolutions to be combined are available

at equally memory positions defined by a step, named Step F N NRead

1

n n( ( , )) , which is equal to the number of sub-filters in

parallel (see Fig. 17).

The second stage consists in applying the next basic algorithm F N N( , )n 1 n 1

on the result given by the first stage. The

data positions to be processed, according to the F N N( , )n 1 n 1

post processing, are provided by a step which is easily

computed as a function of the number of sub-filters (see Fig. 17). A similar technique is applied up to the last F N N( , )1 1

algorithm, generating Nii 1

n

filtered outputs.

The steps values, defined bellow for any basic algorithm F N N( , )i i

, allow to increment the pointer register attributed

previously to partial convolutions buffer. The incrementation operation does not require additional machine cycles (see

Fig. 17).

Step F N NRead i i

jk 1

i 1

if i 1

if i 1

1

1( ( , ))

(27)

The writing step, named Step F N NWrite i i

( ( , )), allows to store the result of the combination at a precise physical address.

Step F N NWrite i i k=1

i-1

kif i 1( ( , )) (28)

Step F N N N NRead i i k=1

i-1

k k=i

n

k k=1

i-1

k k=i+1

n

k k=1

i

k) )

21( ( , )) ( )( ( )( with N

k=i

n

k= 1 if i = n

The reading step, given bellow, permits to position the pointer register in order to apply the following basic algorithm

F N N( , )i-1 i-1

.

Step F N N NRead i i k=i

n

k k=1

i-1

k)

31( ( , )) ( )( (29)

22

:

: Free memory space

: Memory space completed according to basic algorithm

Step F N NRead n n

1 ( ( , ))

1 1 1... ( , )n 1 1 1... ( , )n 1 n N 1... n

Nn n 1 1... N N Nn n-1 i i 1... ... 1

F N N( , )n n F N N( , )i i

N N Nn n-1 i+1 i... ... 1

F N N( , )i+1 i+1

1 1 1... ( , )i 1 1 1... ( , )i 1 i N 1 1... ( , )i i+1N 1 1 1... ( , )i

1 1 1... ( , )n 1 1 1... ( , )n 1 n N 1... n

Nn n 1 1...

F N N( , )n n

1... n

Step F N NRead n n

1 ( ( , )) Step F N NWrite n n( ( , ))

Step F N NRead i i

1 ( ( , )) Step F N NWrite i i( ( , ))

Step F N NRead i i

2 ( ( , ))

Step F N NRead i i

3 ( ( , ))

ApplicationApplicationApplication

Partial convolutions buffer

: Partial convolutions Pointer register Initial position

Pconvolutions

Pconvolutions

Fig. 17 Post processing module

5.4. HARDWARE RESOURCES EVALUATION

A very useful characteristic of our F N N F N N( , ),..., ( , )1 1 n n

composite filtering algorithm structuration is that the

total number of pointers required is fixed, whatever the number of basic short algorithms which are nested. The method

consists in regrouping all data of the same kind (combined samples, combined coefficients, partial convolutions). Each

kind is stored in a single buffer. The previous section concerning the optimized memory data organization shows that:

For the PRE PROCESSING MODULE one pointer register, named Pdata

, has been defined for data delay line

management. This evaluation has not included the unavoidable pointers requires for the FFT computation, if the first

F N N( , )i i

algorithm is substituted by FF N N( , )i i

.

For the FILTERING MODULE two pointer registers have been defined. The pointer Pcoefficients

has been attributed to

filters coefficients array in order to execute required product convolutions. The other one, named Pconvolutions

, is attributed

to partial convolutions buffer for storing inner products results.

For the POST PROCESSING MODULE uses the previous pointer register Pconvolutions

defined in the filtering module. It

allows the efficient partial convolutions management for recombination operations.

23

Input samples coming are stored in a sized delay line. The input data management (push and pop) is solved by a pointer

register named Pacquisition

. As the input data, the output samples are directed by a pointer register named Prestitution . Hence

the general structuration of F N N F N N( , ),..., ( , )1 1 n n

composite algorithm requires 5 ( Pdata

, Pcoefficients

, Pconvolutions ,

Pacquisition

, Prestitution

) fixed pointer registers.

6. OPTIMIZED CODE GENERATOR SYSTEM

Although conceptually simple, the general memory data organization is quite intricate. Initializations of the

pointers, their increments, depend by simple formula on the length of the short algorithms, and on the number of

"multiplies" their require. Rather than implementing these formula in an ad hoc manner for each algorithm, we decided

to build a Code Generator System (CGS) which manages efficiently the address generation. This CGS is written as a

high level language program, and automatically generates an assembly file, after selecting the best combination of small

length building blocks in term of computing cycles given some constraints on real time application (maximum I/O delay)

or memory occupation (see Fig. 16).

These small building blocks are macro-instructions, and are the only parts which depend on the DSP. The file generated

is written in an optimized assembly code and is directly assembled by the chosen DSP. These macro are the pre-

processing part, post-processing part of each short fast FIR algorithm, the convolution (for the sub-filters), plus FFT’s.

Examples are given bellow:

Macro_Partial_Convolutions (%0, %1, %2, %3, %4, %5, %6){

%0 sub_filter_length,

%1 data_pointer_register,

%2 coefficients_sub_filters_pointer_register,

%3 partial_convolutions_pointer_register,

%4 coefficients _Stepconv

%5 data _Stepconv

,

%6 Stepwrite

_partial_convolutions,

Parametrized instructions in DSP assembly code

}

Macro_Pre_Processing_ F N N( , )i i

(%0, %1, %2, %3, %4, %5, %6, %7)

{%0 data_pointer_register,

%1 data_Step F N NRead

1

i i( ( , )),


2

i i( ( , )),


3

i i( ( , )),


4

i i( ( , )),

24


5

i i( ( , )),

%6 data_Step F N NWrite i i

( ( , )),

%7 input_pointer_register.

Parametrized instructions in DSP assembly code

}

SOURCE FILE

SELECTION OF THE BEST COMBINATIONS

FIR filter length, Processing delay,

Developped in a "C" language

LIBRARY

Optimised

assembly code

MACRO_INSTRUCTIONS

DSP chosen

Application constraints

SELECTION OF DECIMATION FACTEUR " "

Adapted to the constraints of application

Execution time, Available memory space

Ni

F N N F N N( , ),..., ( , )1 1 n n

Adapted to the constraints of application

Assembly code

for DSP

FIR Sub-filterscoefficients file

EFFICIENT IMPLEMENTATION

DSP COMPILERDSP COMPILER

"C " COMPILER"C " COMPILER

F N N F N N( , ),..., ( , )1 1 n n

Twiddle factors rotation filesof DFT and IDFT pre\post

processing

Twiddle factors rotationfiles of DFT and IDFT

F N N F N N( , ),..., ( , )1 1 n n

FF N N( , )i i

Fig. 18 Code Generator System

They are called by a C language program, which computes the required parameters, named %0, ...., (increments in the

delay line, pointers) and evaluates the corresponding number of cycles. Various algorithms are checked in a

combinatorial manner, and the generator provides the one which is the most efficient, given the application constraints.

7. IMPLEMENTATION RESULTS

In order to validate the CGS and to evaluate the performance of FIR filtering algorithms, a set of optimized macro-

instructions has been written for a DSP of Analog Devices the "ADSP-2100". The experimental result are provided in

Table 7.

25

Table 7 evaluates the number of machine cycles versus the filter length for various implementations of filtering

algorithms on the "ADSP-2100", as provided by the CGS. Graphical representation, supplied in Fig. 19a, Fig. 19b and

Fig. 19c, compares execution time of different filtering algorithms. It can be seen that the improvement is larger with

long FIR filters and depends on the block size . For an FIR filter longer than 350 coefficients, the overheads (transfers

data, arithmetic operations,...) become negligible and therefore have no noticeable influence on the throughput

compared to the execution time of classical convolution algorithm. Note that the combination of the filtering algorithm

based on short FFT's and short length filtering algorithm, yields to improved execution timings. This result is partially

due to the (deliberately chosen) limitation on the FFT size, due to limitations on the number of pointers. Otherwise,

longer FFT's would result in more efficient algorithms.

8. CONCLUSION

This paper has described an efficient methodology for implementing the recently proposed fast FIR filtering

algorithms on DSP's. This methodology is based on an homogeneous presentation of the whole class of fast FIR

algorithms (based on a nesting technique of sub-filters), showing altogether the common features of their structure and

their interplay. This set of algorithms is quite general, since it includes the ones based on short length filters or based on

short FFT’s. The classical algorithms using long FFT's also belong to this class, but have not been considered for

implementation, since they often do not meet practical constraints such as I/O delay. As a result, we could propose a

memory data organization which is very economical in terms of DSP resources. The efficient address generation of data

in memory is managed by the code generator system which automatically generates assembly code. Figure of merits

have been given, showing that the computing time for a 1000 taps FIR filter can be as low as 1/4 that of a classical

computation, with a block size of only 32, hence an I/O delay of 64 samples.

Although described specifically for DSP's, this methodology should also be very efficient for a VLSI

implementation, since this algorithm organization has very small hardware requirements and regular structure. The main

characteristics of the algorithms, such as the conservation of multiply-accumulate structures and the many sub-filters

running in parallel at a lower rate than the initial one should also be very helpful for obtaining precise tradeoffs between

hardware complexity and I/O throughput.

26

FIR filtering algorithms

F N N F N N( , ),..., ( , )1 1 n n

Asymptotic savings

(MAC)

# Machine cycles

(per point)

direct -convolution- 0% L + 9F( , )2 2 25% 15 5 3 4. + /LF( , )3 3 34% 22 2 3+ /L

F F( ,2), ( ,2)2 2 44% 28 25 9 16. + /LF F( , ), ( , )2 2 3 3 50% 38 5 2. + /LF F( , ), ( , )3 3 2 2 50% 39 2+ /LF F( , ), ( , )3 3 3 3 56% 53 11 4 9. + /L

F F F( ,2), ( ,2), ( ,2)2 2 2 58% 47 75 27 64. + /LF F F( , ), ( , ), ( , )2 2 2 2 3 3 63% 63 16 3 8. + /LF F F( , ), ( , ), ( , )2 2 3 3 2 2 63% 64 16 3 8. + /L

F F( , ),2 ( , )3 3 2 2 63% 65 3 8+ /L2 ( , ), ( , )F F3 3 2 2 67% 87 27 3. + /LF F( , ),2 ( , )2 2 3 3 67% 85 3 3. + /L

F F F( , ), ( , ), ( , )3 3 2 2 3 3 67% 80 38 3. + /LF F F( , ), ( , ), ( , )3 3 3 3 3 3 71% 115 8 27+ /L

F F F F( ,2), ( ,2), ( ,2), ( ,2)2 2 2 2 69% 76 37 81 256. + /LFF( , )4 4 31% 94 0 687+ . LFF( , )8 8 64% 162 25 0 36. + . L

FF( , )16 16 82% 164 87 0 183. + . LFF( , )32 32 91% 171 93 0 092. . L

FF F( , ), ( , )4 4 2 2 48% 110.25+0.51LFF F F( , ), ( , ), ( , )4 4 2 2 2 2 61% 138.75+0.386L

FF F F F( , ), ( , ), ( , ), ( , )4 4 2 2 2 2 2 2 70% 188.31+0.290LFF F F F F( , ), ( , ), ( , ), ( , ), ( , )4 4 2 2 2 2 2 2 2 2 78% 264.09+0217L

FF F( ,8), ( , )8 2 2 73% 167,625+0.269LFF F F( ,8), ( , ), ( , )8 2 2 2 2 79% 189.37+0.302L

FF F F F( ,8), ( , ), ( , ), ( , )8 2 2 2 2 2 2 85% 237.53+0.151LFF F F F F( ,8), ( , ), ( , ), ( , ), ( , )8 2 2 2 2 2 2 2 2 88% 310.48+0.113L

FF F( 6,16), ( , )1 2 2 86% 170.125+0.137LFF F F( 6,16), ( , ), ( , )1 2 2 2 2 89% 195+0.103L

FF F F F( 6,16), ( , ), ( , ), ( , )1 2 2 2 2 2 2 92% 239.82+0.077LFF F F F F( 6,16), ( , ), ( , ), ( , ), ( , )1 2 2 2 2 2 2 2 2 94% 331.39+0.058L

Table 7

27

100

200

300

400

500

600

700

800

900

1000

200 300 400 500 600 700 800 900 1000

direct-convo

FF(4,4)

FF(4,4),F(2,2)

FF(4,4),2F(2,2)

FF(4,4),3F(2,2)

FF(4,4),4F(2,2)

100

200

300

400

500

600

700

800

900

1000

200 300 400 500 600 700 800 900 1000

FF(8,8)

FF(8,8),F(2,2)

FF(8,8),2F(2,2)

FF(8,8),3F(2,2)

FF(8,8),4F(2,2)

100

200

300

400

500

600

700

800

900

1000

200 300 400 500 600 700 800 900 1000

FF(16,16)

FF(16,16),F(2,2)

FF(16,16),2F(2,2)

FF(16,16),3F(2,2)

FF(16,16),4F(2,2)

100

200

300

400

500

600

700

800

900

1000

200 300 400 500 600 700 800 900 1000

F(2,2)2 F(2,2)3F(2,2)4F(2,2)

direct-convo

direct-convo direct-convo

Fig. 19a Actual machine cycles per point versus the FIR filter length

30

40

50

60

70

80

90

100

110

120

30 40 50 60 70 80

direct-convoF(2,2), F(3,3), F(2,2)

F(3,3),2 F(2,2)2 F(3,3), F(2,2)

F(2,2), 2 F(3,3)

30

40

50

60

70

80

90

30 40 50 60 70 80

direct-convoF(2,2)

F(3,3)

2 F(2,2)

F(2,2),F(3,3)

30

40

50

60

70

80

90

30 40 50 60 70 80

direct-convoF(2,2)

F(3,3)

2 F(2,2)

F(2,2),F(3,3)

80

100

120

140

160

180

200

220

240

260

280

300

320

100 200 300

direct-convoF(2,2)

F(3,3)

2 F(2,2)

F(2,2),F(3,3)

Fig. 19b Actual machine cycles per point versus the FIR filter length

28

100

200

300

400

500

600

700

300 400 500 600 700

direct-convo

F(2,2)

F(3,3)

2 F(2,2)

F(2,2),F(3,3)

100

200

300

400

500

600

700

300 400 500 600 700

direct-convo

F(3,3),F(2,2),F(3,3)

3F(3,3)

4F(2,2)

80

100

120

140

160

180

200

220

240

260

280

300

320

100 200 300

direct-convo

3 F(3,3)

4 F(2,2)

80

100

120

140

160

180

200

220

240

260

280

300

320

100 200 300

direct-convo

2 F(3,3)

3 F(2,2)

2 F(2,2),F(3,3)

F(3,3), F(2,2), F(3,3)

Fig. 19c Actual machine cycles per point versus the FIR filter length

REFERENCES

[1] Ramesh C.Agarwal, Charles S.Burrus, "Fast One-Dimensional Digital Convolution by Multidimentional

Techniques", IEEE Trans. on Acoust. Speech and Signal Processing, Vol 22, No. 1,pp. 1-10, February 1974.

[2] R.C.Agarwal, C.S.Burrus, "Number theoric transforms to implement fast digital convolution", IEEE Trans.

Acoust. Speech, Signal processing, Vol. 63, No. 4, April 1975.

[3] R.C.Agarwal, J.W.Cooley, "New algorithms for digital convolution", IEEE Trans. on Acoust. Speech and Signal

processing, Vol 25, No. 5, October 1977, pp. 392-410.

[4] C.S.Burrus, T.W.Parks, "DFT/FFT and convolution Algorithm", Wiley, New York, 1985.

[5] J.W.Cooley, J.W.Tukey, "An algorithm for the calculation of complex Fourier series", Math. of. Comput., Vol.

19, pp. 297-301, April 1965.

[6] H.Nussbaumer, "Nouveaux algorithmes de transformée de Fourier rapide", Traitement du signal, l'onde

électrique, Vol 59, No. 6-7, 1979 (in french).

[7] P.Duhamel, M.Vetterli, "Cyclic convolution of real sequences: Hartley versus Fourier and new schemes", IEEE

proc. ICASSP Tokyo, Japan, Apr. 8-11, 1986, pp. 6.5.1-6.5.4.

29

[8] P.Duhamel, M.Vetterli, "Improved Fourier and Hartley Transform Algorithms: Application to cyclic convolution

of real data", IEEE Trans. Acous. Speech, Signal processing, Vol. 35, No. 6, June 1987, pp. 818-824.

[9] P.Duhamel, M.Vetterli, "Fast Fourier Transforms: A tutorial Review and a state of the art", IEEE Trans. Acous.

Speech, Signal processing Vol. 19, 1990, pp.259-299.

[10] M.Vetterli, "Running FIR and IIR Filtering using multirate filter bank", IEEE Trans. on ASSP, Vol 36, No. 5,

May 1988, pp.730-738.

[11] D.M.W.Evans, "An improved Digit Reversal Permutation Algorithm for Fast Fourier and Hartley Transforms",

IEEE Trans. Acoustics Speech and Signal processing, Vol ASSP-35, No. 8, IEEE CS press, Los Almitos, Calif.,

August 1987, pp. 1120-1125.

[12] R.E.Blahut, "Fast Algorithms for Signal processing", Addison-Wesley, Reading, MA, 1985.

[13] Winograd, "Arithmetic complexity of computation", CBMS-NSF Regional Conf. Series in Applied Mathematics,

SIAM publications, No. 33, 1980.

[14] H.K.Kwan, M.T.Tsim, "High speed 1-D FIR digital filtering architectures using polynomial convolution" proc

ICASSP Dallas 87, USA pp.1863-1866.

[15] P.Duhamel, Z.J.Mou, J.Benesty, "Une représentation unifiée du filtrage rapide fournissant tous les intermédiaires

entre traitements temporels et fréquentiels", douzième colloque Gretsi-Juan-les-pins France, Juin 1989, pp. 37-40

(in french).

[16] Z.J.Mou, P.Duhamel, "A unified approach to the Fast FIR Filtering algorithms", IEEE proc. ICASSP,

1988,.pp.1914-1917.

[17] Z.J.Mou, P.Duhamel, "Fast FIR Filtering: Algorithms and Implementations", Signal Processing, December 1987,

pp. 377-384.

[18] Z.J.Mou, P.Duhamel, "Short length FIR filters and their use in fast non recursive filtering", IEEE Trans. ASSP,

1989.

[19] Z.J.Mou, P.Duhamel, J.Benesty, "Fast complex FIR filtering algorithms with applications to real FIR and

complex LMS filters, proc. Eusipco 1990, pp. 549-552.

[20] R.Meyer, R.Reng and Schwarz, "Convolution algorithms on DSP processors", IEEE proc. ICASSP 1991, pp.

2193-2196.

[21] A.Zergaïnoh, P.Duhamel, J.P.Vidal, "Efficient implementation of composite length fast FIR filtering on the

"ADSP-2100", IEEE proc. ICASSP Adelaïde 94, Australia, pp. 461-464.

30

[22] A.Zergaïnoh, P.Duhamel, J.P.Vidal, "Implantation efficace d'algorithmes de filtrage rapide RIF sur ADSP-2100",

Conférence Adéquation Algorithmes Architectures", Greco TDSI, AFCET, SEE, CNET, Grenoble Janvier 94,

pp. 85-92.

[23] A.Zergaïnoh, P.Duhamel, J.P.Vidal, "DSP implementaion of fast FIR filtering algorithm using short FFT's",

IEEE proc ISCAS Seattle 1995, pp. 219-222.

[24] A.Zergaïnoh, P.Duhamel, "Implementation and performance of fast FIR filtering algorithms on DSP", IEEE

VLSI Signal Processing VIII, October 16-18, 1995.

[25] P.Duhamel, "A split radix fast Fourier Transform", Ann. Télécomm, Vol. 40, No. 9-10, pp. 418-494, Septembre-

Octobre 1985.

[26] Henrik V.Sorensen, Douglas L.Jones, Michael T.Heideman, C.Sidney Burrus, "Real Valued Fast Fourier

Transform Algorithms", IEEE Trans. on Acoust. Speech and Signal processing, Vol 35, No. 6, pp. 849-863, June

1987.

[27] H.Nussbaumer, "Fast Fourier Transform and convolution algorithms", Spring-Verlag, 2nd edition 1982.

[28] Analog Devices, "Digital Signal processing Applications Using the ADSP-2100 Familly", Prentice-Hall,

Englewood Cliffs, NJ 07632.

[29] ADSP-21020 User's Manual, 1991 Analog Devices Incorporated.

Documents

EFFICIENT IMPLEMENTATION METHODOLOGY OF FAST FIR …mokraoui/FIR-REV.pdf · the DSP architecture used for implementing the filtering algorithms should have a significant impact on