Uwe meyer baese presentation report

FPGA SEMINAR REPORT UNIT-4

2

CONTENTS

S NO TITLE PG NO

1 BINARY ADDER 3

2 BINARY MULTIPLIER 8

3 BINARY DIVIDER 13

4 FIR FILTERS 18

5 IIR FILTERS 27

6 DECIMATION 34

7 INTERPOLATION 39

8 MULTISTAGE DECIMATION 44

9 POLYPHASE DECIMATION 51

10 FILTER BANKS 56

11 DIT-FFT ALGORITHM 65

12 DIF-FFT ALGORITHM 74

13 ERROR CONTROL CODING 81

14 CRYPTOGRAPHIC ALGORITHM 89

15 LMS ALGORITHM 94

16 DIGITAL UP CONVERTER 100

17 DIGITAL DOWN CONVERTER 105

3

BINARY ADDERS

Addition is the most commonly performed arithmetic operation in Digital systems. An adder

is a combinational circuit which combines two arithmetic operands using addition rules. An

adder is a basic building block in any DSP system. An adder can perform subtraction using

2s complemented subtrahend. The following are the various types of adders:

Half Adders

Full Adders

Binary (Multi Bit) Adders

o Ripple Adders

o Carry Look Ahead Adders

o Pipeline Adders

o Modulo Adders

A basic binary N-bit adder / subtractor consist of N full-adders (FA). A full-adder

implements the following Boolean equations:

The Sum is defined by:

sk = xk XOR yk XOR ck = xk yk ck

The carry (out) bit is computed with:

ck+1 = (xk AND yk) OR (xk AND ck) OR (yk AND ck) = (xk yk) + (xk ck) + (yk ck)

PIPELINED ADDERS:

Pipelining is extensively used in DSP solutions due to the intrinsic dataflow regularity of

DSP algorithms. Programmable digital signal processor MACs typically carry at least four

pipelined stages. The processor:

1) Decodes the command 2) Loads the operands in registers 3) Performs multiplication and stores the product, and 4) Accumulates the products, all concurrently.

The pipelining principle can be applied to FPGA designs as well, at little or no additional

cost since each logic element contains a flip-flop, which is otherwise unused, to save routing

resources. With pipelining it is possible to break an arithmetic operation into small primitive

operations, save the carry and the intermediate values in registers, and continue the

calculation in the next clock cycle. Such adders are sometimes called carry save adders

(CSAs) in the literature.

4

The block diagram of pipeline adder is shown in Figure 1.

Figure 1: Block Schematic of Pipeline adder

VHDL CODE FOR PIPELINE ADDER:

LIBRARY ieee;

USE ieee.std_logic_1164.ALL;

USE ieee.std_logic_arith.ALL;

USE ieee.std_logic_unsigned.ALL;

ENTITY pipeline_add IS

GENERIC (WIDTH : INTEGER := 15;

WIDTH1 : INTEGER := 7; WIDTH2 : INTEGER := 8);

PORT (x,y : IN STD_LOGIC_VECTOR(WIDTH-1 DOWNTO 0);

sum : OUT STD_LOGIC_VECTOR(WIDTH-1 DOWNTO 0);

LSBs_Carry : OUT STD_LOGIC;

clk : IN STD_LOGIC);

END pipeline_add;

ARCHITECTURE struct OF pipeline_add IS

SIGNAL l1, l2, s1 : STD_LOGIC_VECTOR(WIDTH1-1 DOWNTO 0);

SIGNAL r1 : STD_LOGIC_VECTOR(WIDTH1 DOWNTO 0);

SIGNAL l3, l4, r2, s2 : STD_LOGIC_VECTOR(WIDTH2-1 DOWNTO 0);

BEGIN

PROCESS

BEGIN

5

WAIT UNTIL clk = '1';

l1

6

MODULO ADDERS

Modulo adders are the most important building blocks in RNS-DSP designs. They are used

for both additions and, via index arithmetic, for multiplications.

The block diagram of modulo adder is shown in Figure 2.

Figure 2: Block Schematic of Modulo adder

VERILOG CODE FOR MODULO-256 ADDER:

module mod_add (input [7:0]x, input [7:0]y, output [8:0]Sum);

parameter m=256;

wire [8:0] x1,x2;

wire c;

assign x1[8:0]= (x[7:0]+y[7:0]);

assign x2[8:0]= (x1[7:0]-m);

or(c,x1[8],x2[8]);

assign x2 = (x1>255) ? x1[8:0]-m : x1[8:0];

assign Sum = (c==1'b0) ? x1[8:0] : x2[8:0];

endmodule

7

MODULO-256 ADDER SIMULATION RESULTS:

SUMMARY OF BINARY ADDERS:

Ripple Carry Adders: Are two bit at a time adder, the longest delay comes from the ripple of the carry through all stages. Carry-skip, carry look-ahead, conditional sum, or carry-select

adders techniques are employed to reduce the delay.

Adders implemented using modern FPGAs / CPLDs possess very fast ripple carry logic a magnitude faster than the delay through a regular logic.

In Pipeline adders, how many pieces of pipeline addition depends on number of logic elements

and FFs in each LAB of FPGA/CPLD

For example in Alteras Cyclone II devices a reasonable choice will be 2 pipeline additions with maxi mum block size of 15 using an LAB with 16LEs and 16 FFs for one pipeline element. The

feasible breakup is shown below:

With one additional pipeline stage we can build adders up to a length 15 + 16 = 31.

With two pipeline stages we can build adders with up to 15+15+16 = 46-bit length

With three pipeline stages we can build adders with up to 15+15+15+16 = 61-bit length.

Though the number of flip-flops in one LAB is 16 and we need an extra flip-flop for the carry-

out. Only the blocks with the MSBs can be 16 bits wide.

8

BINARY MULTIPLIER

Since we always multiply by either 0 or 1, the partial products are always either 0000 or

the multiplicand (1101 in this example).

There are four partial products which are added to form the result.

We can add them in pairs, using three adders.

Even though the product has up to 8 bits, we can use 4-bit adders if we stagger them

leftwards, like the partial products themselves.

If the multiplicand is of k bit and multiplier is of j bit then

o k*j no. of & gates are require.

o (j-1) no. of k bit adder are require.

Ex. To multiply 1101 by 111 we require

o 4*3=12 & gates

o (3-1) no. of 4 bit adder.

A 2*2 BINARY MULTIPLIER

The AND gates produce the partial products.

For a 2-bit by 2-bit multiplier, we can just use two half adders to sum the partial products. In general, though, well need full adders.

Here C3-C0 are the product, not carries!

9

A 4*4 MULTIPLIER CIRCUIT

10

Here multiplier and multiplicand both are of 4 bits.

So to implement multiplier we needed 4*4=16 & gates and 3 no. of 4 bit adders.

First adder of each 4bit adder can be a HalfAdder because cin is always zero for those

adder.

VERILOG CODE

module HA(sout,cout,a,b);

outputsout,cout;

inputa,b;

assignsout=a^b;

assigncout=(a&b);

endmodule

module FA(sout,cout,a,b,cin);

11

outputsout,cout;

inputa,b,cin;

assignsout=(a^b^cin);

assigncout=((a&b)|(a&cin)|(b&cin));

endmodule

module multiply4bits(product,a,b);

output [7:0]product;

input [3:0]a;

input [3:0]b;

assign product[0]=(a[0]&b[0]);

wire x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17;

HA HA1(product[1],x1,(a[0]&b[1]),(a[1]&b[0]));

FA FA1(x2,x3,a[1]&a[1],(a[2]&b[0]),x1);

FA FA2(x4,x5,(a[2]&b[1]),(a[3]&b[0]),x3);

HA HA2(x6,x7,(a[3]&b[1]),x5);

HA HA3(product[2],x8,x2,(a[0]&b[2]));

FA FA5(x9,x10,x4,(a[1]&b[2]),x8);

FA FA4(x11,x12,x6,(a[2]&b[2]),x10);

FA FA3(x13,x14,x7,(a[3]&b[2]),x12);

HA HA4(product[3],x15,x9,(a[0]&b[3]));

FA FA8(product[4],x16,x11,(a[1]&b[3]),x15);

FA FA7(product[5],x17,x13,(a[2]&b[3]),x16);

FA FA6(product[6],product[7],x14,(a[3]&b[3]),x17);endmodule

12

SIMULATION RESULT

13

DIVIDERS

Of all four basic arithmetic operations division is the most complex. Consequently, it is

the most time-consuming operation and also the operation with the largest number of different

algorithms to be implemented. For a given dividend (or numerator) N and divisor (or

denominator) D the division produces (unlike the other basic arithmetic operations) two results:

the quotient Q and the remainder R, i.e.,

N/D = Q and R with |R| < D.

However, we may think of division as the inverse process of multiplication, as

demonstrated through the following equation,

N = D Q + R,

It differs from multiplication in many aspects. Most importantly, in multiplication all

partial products can be produced parallel, while in division each quotient bit is determined in a

sequential trail-and-error procedure.

For eg:

234/50 Q=5 and R=-16 and Q=4 and R=34. But we prefer O=4 and R=34.

Hence Q

14

RESTORING DIVISION:

We align first the denominator and load the numerator in the remainder register. We

then subtract the aligned denominator from the remainder and store the result in the remainder

register. If the new remainder is positive we set the quotients LSB to 1, otherwise the

quotients LSB is set to zero and we need to restore the previous remainder value by adding the

denominator. Finally, we have to realign the quotient and denominator for the next step. The

recalculation of the previous remainder is why we call such an algorithm restoring division.

The main disadvantage of the restoring division is that we need two steps to determine one

quotient bit. We can combine the two steps using a nonperforming divider algorithm, i.e., each

time the denominator is larger than the remainder, we do not perform the subtraction. The

number of steps is reduced by a factor of 2

NON PERFORMING NON RESTORING:

The idea behind the nonrestoring division is that if we have computed in the restoring

division a negative remainder, i.e., rk+1 = rkdk, then in the next step we will restore rk by

adding dk and then perform a subtraction of the next aligned denominator dk+1 = dk/2. So,

instead of adding dk followed by subtracting dk/2, we can just skip the restoring step and

proceed with adding dk/2, when the remainder has (temporarily) a negative value. As a result,

we have now quotient bits that can be positive or negative, i.e., qk = 1, but not zero.We can

change this signed-digit representation later to a twos complement representation. In

15

conclusion, every time the remainder after the iteration is positive we store a 1 and subtract the

aligned denominator, while for negative remainder, we store a 1 = 1 in the quotient register

and add the aligned denominator.

Both quotient and remainder are now in the twos complement representation and

have a valid result. If we wish to constrain our results in a way that both have the same sign, we

need to correct the negative remainder, i.e., for r < 0 we correct this via

r := r + D and q := q 1.

Such a nonrestoring divider will now run faster than the nonperforming divider, with

about the same Registered Performance as the restoring divider.

FAST DIVIDER DESIGN:

The first fast divider algorithm we wish to discuss is the division through

multiplication with the reciprocal of the denominator D. The reciprocal can, for instance, be

computed via a look-up table for small bit width. The general technique for constructing

iterative algorithms, however, makes use of the Newton method for finding a zero.

ARRAY DIVIDER:

Obviously, as with multipliers, all division algorithms can be implemented in a

sequential, FSM-like, way or in the array form. If the array form and pipelining is desired, a

good option will then be to use the lpm_divide block, which implements an array divider with

the option of pipelining, for a detailed description of the lpm_divide block.

CODE:

module divya2(q,out,a,b);

input [7:0]a;//dividend

input [3:0]b;//divisor

output [3:0]out;//reminder

output [4:0]q;//quotient

wire [3:0]r1,r2,r3,r4;

stage s1(q[4],r1[3:0],{1'b1},a[7:4],b[3:0]);

stage s2(q[3],r2[3:0],q[4],{r1[2:0],a[3]},b[3:0]);

stage s3(q[2],r3[3:0],q[3],{r2[2:0],a[2]},b[3:0]);

16

stage s4(q[1],r4[3:0],q[2],{r3[2:0],a[1]},b[3:0]);

stage s5(q[0],out[3:0],q[1],{r4[2:0],a[0]},b[3:0]);

endmodule

module stage(q,out,t,a,b); // submodule

input [3:0]a;

input [3:0]b;

input t;

output [3:0]out;

output q;

wire [3:0]c;

cas ca1(out[0],c[0],t,b[0],a[0],t);

cas ca2(out[1],c[1],t,b[1],a[1],c[0]);

cas ca3(out[2],c[2],t,b[2],a[2],c[1]);

cas ca4(out[3],c[3],t,b[3],a[3],c[2]);

not n1(q,out[3]);

endmodule

module cas(out,cout,t,divisor,rin,cin);

input t,divisor,rin,cin;

output cout,out;

wire x;

xor x1(x,t,divisor);

fadd f1(out,cout,x,rin,cin);

endmodule

module fadd(s,cout,a,b,c); //full adder submodule

input a,b,c;

17

output s,cout;

wire w1,w2,w3;

and a1(w1,a,b);

and a2(w2,b,c);

and a3(w3,c,a);

xor x1(s,a,b,c);

or o1(cout,w1,w2,w3);

endmodule

OUTPUT:

18

FIR FILTERS

An FIR with constant coefficients is an LTI digital filter. The output of an FIR of order or

length L, to an input time-series x[n], is given by a finite version of the convolution sum,

namely:

where f[0] 0 through f[L-1] 0 are the filters L coefficients. They also correspond to the

filters impulse response.

For LTI systems it is sometimes more convenient to express this in the z-domain with

where F(z) is the FIRs transfer function defined in the z-domain by

The Lth

order LTI FIR filter is graphically interpreted in Fig. 1.1. It can be seen to consist of a

collection of a tapped delay line, adders, and multipliers. One of the operands presented to

each multiplier is an FIR coefficient, often referred to as a tap weight for obvious reasons.

Fig 1: Direct Form FIR filter

19

The roots of polynomial F(z) in define the zeros of the filter. The presence of only zeros is the

reason that FIRs are sometimes called all zero filters.

FIR FILTER WITH TRANSPOSED STRUCTURE

A variation of the direct FIR model is called the transposed FIR filter. It can be constructed

from the FIR filter in Fig. 1 by:

Exchanging the input and output

Inverting the direction of signal flow

Substituting an adder by a fork, and vice versa

A transposed FIR filter is shown in Fig. 2 and is, in general, the preferred implementation of an

FIR filter. The benefit of this filter is that we do not need an extra shift register for x[n],and

there is no need for an extra pipeline stage for the adder (tree) of the products to achieve high

throughput.

Fig 2: Filter with Transposed Structure

SYMMETRY IN FIR FILTERS

The center of an FIRs impulse response is an important point of symmetry. It is sometimes

convenient to define this point as the 0th sample instant. Such filter descriptions area-

causal(centered notation). For an odd-length FIR, the a-causal filter model is given by:

20

The FIRs frequency response can be computed by evaluating the filters transfer function about

the periphery of the unity circle, by setting z=ejT

. It then follows that:

We then denote with |F()| the filters magnitude frequency response and () denotes the

phase response, and satisfies:

Digital filters are more often characterized by phase and magnitude than by the z-domain

transfer function or the complex frequency transform.

Table 1: Four possible linear-phase FIR filters

LINEAR-PHASE FIR FILTERS

Maintaining phase integrity across a range of frequencies is a desired system attribute in many

applications such as communications and image processing. As a result, designing filters that

establish linear-phase versus frequency is often mandatory. The standard measure of the phase

linearity of a system is the group delay defined by:

21

A perfectly linear-phase filter has a group delay that is constant over a range of frequencies. It

can be shown that linear-phase is achieved if the filter is symmetric or antisymmetric. A

constant group delay can only be achieved if the frequency response F() is a purely real or

imaginary function. This implies that the filters impulse response possesses even or odd

symmetry. That is:

An odd-order even-symmetry FIR filter would, for example, have a frequency response given

by:

which is seen to be a purely real function of frequency. Table 1 summarizes the four possible

choices of symmetry, antisymmetry, even order and odd order. In addition, Table 1 graphically

displays an example of each class of linear-phase FIR.

Fig. 3: Linear-phase filter with reduced number of multipliers

The symmetry properties intrinsic to a linear-phase FIR can also be used to reduce the necessary

number of multipliers L, as shown in Fig. 1. Consider the linear-phase FIR shown in Fig. 3

(even symmetry assumed), which fully exploits coefficient symmetry. Observe that the

22

symmetric architecture has a multiplier budget per filter cycle exactly half of that found in the

direct architecture shown in Fig. 1 (L versus L/2) while the number of adders remains constant

at L1.

DESIGNING FIR FILTERS

There are two methods for FIR Filter design:

Direct Window Design Method

Equiripple Design Method

DIRECT WINDOW DESIGN METHOD

The discrete Fourier transform (DFT) establishes a direct connection between the frequency and

time domains. Since the frequency domain is the domain of filter definition, the DFT can be

used to calculate a set of FIR filter coefficients that produce a filter that approximates the

frequency response of the target filter. A filter designed in this manner is called a direct FIR

filter. A direct FIR filter is defined by:

Consider a length-16 direct FIR filter design with a rectangular window, shown in Fig. 4a, with

the passband ripple shown in Fig. 4b. Note that the filter provides a reasonable approximation to

the ideal lowpass filter with the greatest mismatch occurring at the edges of the transition band.

The observed ringing is due to the Gibbs phenomenon, which relates to the inability of a

finite Fourier spectrum to reproduce sharp edges. The Gibbs ringing is implicit in the direct

inverse DFT method and can be expected to be about7% over a wide range of filter orders. To

illustrate this, consider the example filter with length 128, shown in Fig. 4c, with the passband

ripple shown in Fig. 3.6d. Although the filter length is essentially increased (from 16 to 128) the

ringing at the edge still has about the same quantity. The effects of ringing can only be

suppressed with the use of a data window that tapers smoothly to zero on both sides. Data

windows overlay the FIRs impulse response, resulting in a smoother magnitude frequency

response with an attendant widening of the transition band. If, for instance, a Kaiser window is

applied to the FIR, the Gibbs ringing can be reduced.

23

Fig. 4: Gibbs phenomenon.(a)Impulse response of FIR lowpass with L=16. (b) Passband of

transfer function L=16. (c)Impulse response of FIR lowpass with L= 128. (d) Passband of

transfer function L= 128.

The most common windows, denoted w[n], are:

24

EQUIRIPPLE DESIGN METHOD

A typical filter specification not only includes the specification of passband p and stopband s

frequencies and ideal gains, but also the allowed deviation (or ripple) from the desired transfer

function. The transition band is most often assumed to be arbitrary in terms of ripples. A special

class of FIR filter that is particularly effective in meeting such specifications is called the

equiripple FIR. An equiripple design protocol minimizes the maximal deviations (ripple error)

from the ideal transfer function. The equiripple algorithm applies to a number of FIR design

instances. The most popular are:

Lowpass filter design

Hilbert filter, i.e., a unit magnitude filter that produces a 90 phase shift for all

frequencies in the passband

Differentiator filter that has a linear increasing frequency magnitude proportional to

The equiripple or minimum-maximum algorithm is normally implemented using the Parks

McClellan iterative method. The ParksMcClellan method is used to produce a equiripple or

minimax data fit in the frequency domain.

The length of the polynomial, and therefore the filter, can be estimated for a lowpass with

where p is the passband and s the stopband ripple.

CONSTANT COEFFICIENT FIR DESIGN

The method used for implementing FIR filters in FPGAs is the Constant Coefficient FIR Design

method. The different ways to implement this method are:

Direct design

Transposed form design

Design using Distributed Arithmetic (DA) architecture

25

DIRECT FIR DESIGN

The direct FIR filter shown in Fig. 1 can be implemented in VHDL using (sequential)

PROCESS statements or by component instantiations of the adders and multipliers. A

PROCESS design provides more freedom to the synthesizer, while component instantiation

gives full control to the designer. To illustrate this, a length-4 FIR will be presented as a

PROCESS design. Although a length-4 FIR is far too short for most practical applications, it is

easily extended to higher orders and has the advantage of a short compiling time. The linear-

phase (therefore symmetric) FIRs impulse response is assumed to be given by

FOUR-TAP DIRECT FIR FILTER VHDL CODE:

PACKAGE eight_bit_int IS -- User-defined types

SUBTYPE BYTE IS INTEGER RANGE -128 TO 127;

TYPE ARRAY_BYTE IS ARRAY (0 TO 3) OF BYTE;

END eight_bit_int;

LIBRARY work;

USE work.eight_bit_int.ALL;

LIBRARY ieee;



ENTITY fir_srg IS ------> Interface

PORT (clk : IN STD_LOGIC;

x : IN BYTE;

y : OUT BYTE);

END fir_srg;

ARCHITECTURE flex OF fir_srg IS

SIGNAL tap : ARRAY_BYTE := (0,0,0,0);

-- Tapped delay line of bytes

BEGIN

p1: PROCESS ------> Behavioral style

BEGIN


-- Compute output y with the filter coefficients weight.

-- The coefficients are [-1 3.75 3.75 -1]

27

IIR FILTERS

A nonrecursive filter incorporates, as the name implies, no feedback. The impulse response of

such a filter is finite, i.e., it is an FIR filter. A recursive filter, on the other hand has feedback,

and is expected, in general, to have an infinite impulse response, i.e., to be an IIR filter. Figure

4.4a shows filters with separate recursive and nonrecursive parts. A canonical filter is produced

if these recursive and nonrecursive parts are merged together, as shown in Fig. 4.4b. The

transfer function of the filter from Fig. 4.4 can be written as:

The difference equation for such a system yields:

Comparing this with the difference equation for the FIR filter,we find that the difference

equation for recursive systems depends not only on the L previous values of the input sequence

x[n], but also on the L 1 previous values of y[n].

If we compute poles and zeros of F(z), we see that the nonrecursive part, i.e., the numerator of

F(z), produces the zeros p0l, while the denominator of F(z) produces the poles pl

28

FAST IIR FILTER

FIR filter Registered Performance was improved using pipelining. In the case of FIR filters,

pipelining can be achieved at essentially no cost. Pipelining IIR filters, however, is more

sophisticated and is certainly not free. Simply introducing pipeline registers for all adders will,

especially in the feedback path, very likely change the pole locations and therefore the transfer

function of the IIR filter.

The methods that improve IIR filter throughput are:

Look-ahead interleaving in the time domain

Parallel processing

These methods are based on filter architecture or signal flow techniques. These techniques will

be demonstrated with examples. To simplify the VHDL representation of each case, only a first-

order IIR filter will be considered, but the same ideas can be applied to higher-order IIR filters.

TIME-DOMAIN INTERLEAVING

Consider the differential equation of a first-order IIR system, namely

.

The output of the first-order system, namely y[n + 1], can be computed using a look-ahead

methodology by substituting y[n+1] into the differential equation for y[n + 2]. That is

The equivalent system is shown in Fig. 4.14.This concept can be generalized by applying the

look-ahead transform for (S 1) steps, resulting in:

29

It can be seen that the term () defines an FIR filter having coefficients {b, ab, a2b, . . . ,

aS1b}, that can be pipelined using the pipelining techniques.

The recursive part of (4.12) can now also be implemented with an S-stage pipelined multiplier

for the coefficient.

The VHDL code shown below, implements the IIR filter in look-ahead form.

PACKAGE n_bit_int IS -- User-defined type

SUBTYPE BITS15 IS INTEGER RANGE -2**14 TO 2**14-1;

END n_bit_int;

LIBRARY work;

USE work.n_bit_int.ALL;

LIBRARY ieee;



ENTITY iir_pipe IS

PORT ( x_in : IN BITS15; -- Input

y_out : OUT BITS15; -- Result

clk : IN STD_LOGIC);

END iir_pipe;

ARCHITECTURE fpga OF iir_pipe IS

SIGNAL x, x3, sx, y, y9 : BITS15 := 0;

BEGIN

PROCESS -- Use FFs for input, output and pipeline stages

BEGIN

WAIT UNTIL clk = 1;

x

30

PARALLEL PROCESSING

In a parallel-processing filter implementation [100], P parallel IIR paths are formed, each

running at a 1/P input sampling rate. They are combined at the output using a multiplexer, as

shown in Fig. 4.18. Because a multiplexer, in general, will be faster than a multiplier and/or

adder, the parallel approach will be faster. Furthermore, each path P has a factor of P more time

to compute its assigned output.

To illustrate, consider again a first-order system and P = 2. The lookahead

scheme, as in (4.11)

is now split into even n = 2k and odd n = 2k1 output sequences, obtaining

where n, k Z. The two equations are the basis for the following parallel IIR filter FPGA implementation.

31

VHDL CODE:

PACKAGE n_bit_int IS -- User-defined type


END n_bit_int;

LIBRARY work;

USE work.n_bit_int.ALL;

LIBRARY ieee;



ENTITY iir_par IS ------> Interface

PORT ( clk, reset : IN STD_LOGIC;

x_in : IN BITS15;

x_e, x_o, y_e, y_o : OUT BITS15;

clk2 : OUT STD_LOGIC;

y_out : OUT BITS15);

END iir_par;

ARCHITECTURE fpga OF iir_par IS

TYPE STATE_TYPE IS (even, odd);

SIGNAL state : STATE_TYPE;

SIGNAL x_even, xd_even : BITS15 := 0;

SIGNAL x_odd, xd_odd, x_wait : BITS15 := 0;

SIGNAL y_even, y_odd, y_wait, y : BITS15 := 0;

SIGNAL sum_x_even, sum_x_odd : BITS15 := 0;

SIGNAL clk_div2 : STD_LOGIC;

BEGIN

Multiplex: PROCESS (reset, clk) --> Split x into even and

BEGIN -- odd samples; recombine y at clk rate

IF reset = 1 THEN -- asynchronous reset

state

32

WHEN even =>

x_even

33

The design is realized with two PROCESS statements. In the first, PROCESS Multiplex, x is

split into even and odd indexed parts, and the output y is recombined at the clk rate. In addition,

the first PROCESS statement generates

the second clock, running at clk/2. The second block implements the filters arithmetic

according to (4.22). The design uses 268 LEs, no embedded multiplier, and has a 168.12MHz

Registered Performance.

34

DECIMATION

INTRODUCTION

A frequent task in digital signal processing is to adjust the sampling rate according to the signal

of interest. Systems with different sampling rates are referred to as multirate systems. Two

typical examples in multirate DSP systems are decimation and interpolation . Multirate systems

are sometimes used for sampling-rate conversion, which involves both decimation and

interpolation.

DECIMATION

Decimation can be regarded as the discrete-time counterpart of sampling. Whereas in sampling

we start with a continuous-time signal x(t) and convert it into a sequence of samples x[n], in

decimation we start with a discrete-time signal x[n] and convert it into another discrete-time

signal y[n], which consists of sub-samples of x[n]. Thus, the formal definition of M-fold

decimation, or down-sampling, is defined by equation below

.In decimation, the sampling rate is reduced from Fs to Fs/M by discarding M 1 samples for

every M samples in the original sequence. A narrow filter followed by a down sampler is

usually referred to as a decimator

Fig 1: Block diagram notation of decimation, by a factor of M.

The block diagram notation of the decimation process is depicted in Figure.

35

An anti-aliasing digital filter precedes the down-sampler to prevent aliasing from

occurring, due to the lower sampling rate. In Figure 2 below, it illustrates the concept of

3-fold decimation i.e. M = 3. Here, the samples of x[n] corresponding to n = , -2, 1,

4, and n = , -1, 2, 5, are lost in the decimation process.

In general, the samples of x[n] corresponding to n kM, where k is an integer, are

discarded in M-fold decimation. In Figure 2 it shows samples of the decimated signal

y[n] spaced three times wider than the samples of x[n].

In real time, the decimated signal appears at a slower rate than that of the original signal

by a factor of M.

If the sampling frequency of x[n] is Fs, then that of y[n] is Fs/M.

Fig 2: Decimation of a discrete-time signal by a factor of 3

ANTI ALIASING FILTER

We can reduce the sampling rate up to the limit called the Nyquist rate, which says that the

sampling rate must be higher than the bandwidth of the signal, in order to avoid aliasing.

Aliasing is demonstrated in Fig. 3. For a low pass signal. Aliasing is irreparable, and should be

avoided at all cost. For a bandpass signal, the frequency band of interest must fall within an

integer band. If fs is the sampling rate, and R is the desired downsampling factor, then the band

36

of interest must fall between. If it does not, there may be aliasing due to copies from the

negative frequency bands, although the sampling rate may still be higher than the Nyquist rate,

Fig 3: Unaliased and aliased decimation cases.

Fig 4: Decimation of signal x[n] X().

DOWN SAMPLER

Down sampling is the process of reducing the sampling rate of a signal. The down sampling

factor is usually an integer or a rational fraction greater than one. The sampling rate can be

reduced up to the limit called the Nyquist rate .An down-sampler with a down-sampling

factor M, where M is a positive integer, develops an output sequence y[n] with a sampling rate

that is (1/M)-th of that of the input sequence x[n]

37

VHDL CODE

entity decimator_1 is

port( inseq: in std_logic_vector( 7 downto 0);--input sequence

clk: in std_logic;

reset:in std_logic;

dec_op: out std_logic_vector( 7 downto 0));-- decimated output sequence

end decimator_1;

architecture Behavioral of decimator_1 is

begin

process(clk,inseq)

variable count: integer ;

begin

if reset='1' then

count:=2;--count initiated,counts the clock pulses

end if;

if clk='1' and clk'event then

if (count mod 2 = 0) thenif count is multiple of 2, then input is passed to

output

dec_op

38

OUTPUT

Fig: test bench waveform

Fig: simulated output

Input sequence = {8hff, 8hfe, 8hfd, 8hfc, 8hfb, 8hfa}

Output sequence= {8hff, 8hfd, 8hfb }

39

INTERPOLATION

A frequent task in digital signal processing is to adjust the sampling rate

according to the signal of interest. Systems with different sampling rates are

referred to as multiratesystems.

After A/D conversion, the signal of interest can be found in a small

frequencyband (typically, lowpass or bandpass), then it is reasonable to filter

with a lowpass or bandpass filter and to reduce the sampling rate. A narrow filter

followed by a downsampler is usually referred to as a decimator .Increasing the

sampling rate can be useful, in the D/A conversion process, for example.

Typically, D/A converters use a sample-and-hold of first-order at the output,

which produces a step-like output function. This can be compensated

for with an analog 1/sinc(x) compensation filter, but most often a digital solution

is more efficient.

We can use, in the digital domain, an expanderand an additional filter to get

the desired frequency band. The introduced zeros produce an extra copy of the

baseband spectrum that must first be removed before the signal can be processed

with the D/A converter.

For the interpolator, the Noble relation is defined as

F(z) ( R) = ( R) F(zR),

i.e., in an interpolation putting the filter before the expander results in an

R-times shorter filter.

40

INTERPOLATION

A process by which the output sampling rate of a signal is increased is known

as interpolation.

Consists of an up-sampler and an anti-imaging filter.

The up-sampling operation is just simply inserting (N-1) zeroes between every

two input samples.

x(n) v(m) y(m)

The up-sampling produces the intermediate signal v(m) from the input signal x(n).

The output signal y(m) is obtained by convolving the intermediate signal with the

impulse response h(n).

k)

The up-sampled signal can also be denoted as

y(m) = x(n/L)

The spectral properties of up-sampling is simple in the z-transform domain.

So up-sampling is simply a contraction of the frequency axis by a factor of N.

L

H(Z)

41

The original spectrum X(ej

) over [-,].

The original spectrum X(ej

) over [-5,5]

The upsampled spectrum

Interpolation example. For R = 3 ,x[n] X()is shown below.

42

Interpolation in time domain,

The up-sampling factor used is 4 , so three zeroes are inserted between two

input samples.So ,up-sampling is expansion in time domain.

CODE

entity interpolator is

port(a:inbit_vector(1 to 16);

b:out bit_vector(1 to 32));

43

end interpolator;

architectureBehavioral of interpolator is

begin

process(a)

begin

b

44

MULTISTAGE DECIMATOR

The single stage of decimator is repeatedly performed to get our required output of the

multistage decimator(ie., upto Pth stage).

Block Diagram of Multistage Decimator

If the decimation rate R is large it can be shown that a multistage design

can be realized with less effort than a single-stage converter. In particular, S stages, each having

a decimation capability of Rk, are designed to have an overall down sampling rate

ofR=R1R2RS. Unfortunately, pass band imperfections, such as ripple deviation, accumulate

from stage to stage. As a result, a pass band deviation target of p must normally be tightened on the order of p=p/S to meet overall system specifications. This is obviously a worst-case assumption, in which all short filters have the maximum ripple at the same frequencies, which

is, in general, too pessimistic. It is often more reasonable to try an initial value near the given

pass band specification p, and then selectively reduce it if necessary.

MULTISTAGE DECIMATOR DESIGN USING GOODMANCAREY HALF-BAND FILTERS:

Goodman and Carey [80] proposed to develop multistage systems based on

the use of CIC and half-band filters. A half-band filter has a pass band and stop band located at

s =p=/2, or midway in the baseband. A half-band filter can therefore be used to change the sampling rate by a factor of two. If the half-band filter has point symmetry relative to =/2, then all even coefficients (except the center tap) become zero.

45

CIC FILTER:

CIC (cascaded integrator comb) filter is an optimized class of finite impulse response (FIR) filter combined with an interpolator or decimator.

It consists of one or more integrator and comb filter pairs.

For decimating CIC, the input signal is fed through one or more cascaded integrators, then a down sampler which is followed by one or more comb sections.

For an input impulse response, the single stage CIC filter produces the step response

output and also the same logic is used for the multistage decimator.

VHDL PROGRAM:

entity vrb is

Port ( clk : in STD_LOGIC;

x_in : in STD_LOGIC_VECTOR (7 downto 0);

y_out : out STD_LOGIC_VECTOR (8 downto 0));

end vrb;

architecture Behavioral of vrb is

46

TYPE STATE_TYPE is (hold,sample);

SIGNAL state :STATE_TYPE;

SIGNAL count:integer RANGE 0 to 64;

SIGNAL clk2: STD_LOGIC;

SIGNAL x : STD_LOGIC_VECTOR( 7 DOWNTO 0);

SIGNAL sxtx: STD_LOGIC_VECTOR( 25 DOWNTO 0);

SIGNAL i0 :word26;

SIGNAL i1 :word21;

SIGNAL i2 :word16;

SIGNAL i2d1,i2d2,i2d3,i2d4,c1,c0: word14;

SIGNAL c1d1,c1d2,c1d3,c1d4,c2: word13;

SIGNAL c2d1,c2d2,c2d3,c2d4,c3: word12;

begin

FSM:PROCESS

BEGIN

WAIT UNTIL clk='0';

CASE state is

WHEN hold =>

IF count

47

END CASE;

END PROCESS FSM;

Sxt: PROCESS(x)

BEGIN

sxtx(7 DOWNTO 0)

48

clk2

49

ANOTHER PROGRAM:

entity newvrb is

port (a : in STD_LOGIC_vector(1 to 32);

bintr : out STD_LOGIC_vector(1 to 16);

cintr : out STD_LOGIC_vector(1 to 8);

d : out STD_LOGIC_vector(1 to 4));

end newvrb;

architecture Behavioral of newvrb is

signal b : STD_LOGIC_vector(1 to 16);

signal c : STD_LOGIC_vector(1 to 8);

begin

process (a,b,c)

begin

for I in 1 to 16 loop

b(I)

50

end loop;

end process;

end Behavioral;

APPLICATIONS:

During A/D conversion: Oversampling to alleviate the stringent requirements on the Analog anti-alising filter.

During D/A conversion: Filter to remove spectrum images.

Fractional sampling rate conversion.

51

POLYPHASE DECOMPOSITION

Polyphase decomposition is very useful when implementing decimation or interpolation

in IIR or FIR filter and filter banks. To illustrate this, consider the polyphase decomposition of

an FIR decimation filter. If we add downsampling by a factor of R to the FIR filter structure

shown in Figure1, we find that we only need to compute the outputs y[n] at time instances

(1)

Figure1: Direct form of FIR.

It follows that we do not need to compute all sums-of-product f [k] x[n k] of the convolution. For instance, x[0] only needs to be multiplied by

f [0], f [R], f [2R] , . . . . (2)

Besides x[0], these coefficients only need to be multiplied by

x [R], x [2R] , . . . . (3)

It is therefore reasonable to split the input signal first into R separate sequences

according to

and also to split the filter f [n] into R sequences

52

Figure 2 shows a decimator filter implemented using polyphase decomposition. Such a

decimator can run R times faster than the usual FIR filter followed by a downsampler. The

filters fr [n] are called polyphase filters, because they all have the same magnitude transfer

function, but they are separated by a sample delay, which introduces a phase offset. A final

example illustrates the polyphase decomposition.

EXAMPLE 5.1: POLYPHASE DECIMATOR FILTER

Consider a Daubechies length-4 filter with G(z) and R = 2.

Quantizing the filter to 8 bits of precision results in the following model:

and it follows that

Figure2: Polyphase realization of decimation filter.

53

The following VHDL code3 shows the polyphase implementation for DB4.

55

Figure 3 : Output for the given code

56

FILTER BANKS

A digital filter bank is a collection of filters having a common input or output. One

common application of the analysis filter bank is spectrum analysis, i.e., to split the input signal

into R different so-called subband signals. The combination of several signals into a common

output signal is called a synthesis filter bank. The analysis filter may be nonoverlapping,

slightly overlapping, or substantially overlapping. Another important characteristic that

distinguishes different classes of filter banks is the bandwidth and spacing of the center

frequencies of the filters. A popular example of a non-uniform filter bank is the octave-spaced

or wavelet filter bank.

.

UNIFORM DFT FILTER BANK:

In uniform filter banks, all filters have the same bandwidth and sampling rates. In a

maximal decimating, or critically sampled filter bank, the decimation or R is equal to the

number of bands K. If the rth band filter is computed from the modulation of a single

prototype filter h[n], according to

(1)

then it is a uniform DFT filter bank.

FIG.1 R channel filter bank, with a small amount of overlapping

An efficient implementation of the R channel filter bank can be generated if we use

polyphase decomposition of the filter and the input signal x[n]. Because each of these

bandpass filters is critically sampled, we use a decomposition with R polyphase signals

according to

57

(2)

(3)

If we now substitute (2) into (1), we find that all bandpass filters share the same polyphase

filter hk[n], while the twiddle factors for each filter are different. It is now obvious that this

twiddle multiplication for corresponds to the rth DFT component, with an input vector

of The computation for the whole analysis band can be reduced to

filtering with R polyphase filters, followed by a DFT (or FFT) of these R filtered components.

This is obviously much more efficient than direct computation. The polyphase filter bank for

the uniform DFT synthesis bank can be developed as an inverse operation to the analysis bank.

Perfect reconstruction occurs if the convolution of the included polyphase filter gives a unit

sample function, i.e.,

(4)

TWO CHANNEL FILTER BANKS:

The input x[n] is split by using lowpass G(z) and highpass H(z) analysis filters.

The resulting signal x[n] is reconstructed using lowpass and highpass synthesis

filters.

Between the analysis and synthesis sections are decimation and interpolation by 2 units.

The construction rule is given by H(z) = G(z) which defines the filters to be mirrored

pairs. This is a quadrature mirror filter (QMF) bank, because the two filters have mirror

symmetry to /2.

A perfectly reconstructed signal has the same shape as the original, up to a phase (time)

shift.

58

Fig.2 Two-channel filter bank

If the signal is applied to the two-channel filter bank, the

lowpass path XG(z) and highpass path XH(z) become

(5)

(6)

After multiplication by the synthesis filter and summation

of the results, we get as

(7)

The factor of X(z) shows the aliasing component, while the term at X(z) shows the

amplitude distortion.

PERFECT RECONSTRUCTION:

A perfect reconstruction for a two-channel filter bank is achieved if

1) , i.e., the reconstruction is free of aliasing.

59

2) i.e., the amplitude distortion has amplitude

one.

A two-channel filter bank is aliasing-free if

IMPLEMENTING TWO-CHANNEL FILTER BANKS:

POLYPHASE TWO-CHANNEL FILTER BANKS:

In the general case, with two filters G(z) and H(z), we can realize each filter as a

polyphase filter as shown below

Fig.3 Polyphase implementation of the two-channel filter bank

(8)

This does not reduce the hardware effort (2L multipliers and 2(L1) adders are still used), but

the design can be run with twice the usual sampling frequency, 2fs.These four polyphase filters

have only half the length of the original filters.

LIFTING:

Another general approach to constructing fast and efficient two channel filter banks is the

lifting scheme introduced recently by Swelden and Herley and Vetterli. The basic idea is the use

60

of cross-terms (called lifting and dual-lifting), as in a lattice filter, to construct a longer filter

from a short filter, while preserving the perfect reconstruction conditions.

Any (bi)orthogonal wavelet filter bank can be converted into a sequence of lifting and

dual-lifting steps. The number of multipliers and adders required then depends on the number of

lifting steps (more steps gives less complexity) and can reach up to 50% compared with the

direct polyphase implementation.

QMF IMPLEMENTATION:

For QMF,we know

H(z) = G(z) (9)

But this implies that the polyphase filters are the same (except the sign),i.e.,

G0(z) = H0(z), G1(z) = H1(z) . (10)

Instead of the four filters, for QMF we only need two filters and an additional Butterfly. This

saves about 50%. For the QMF filter we need L real adders, L real multipliers and the filter can

run with twice the usual input-sampling rate.

ORTHOGONAL FILTER BANKS:

If highpass and lowpass polynomials are mirror versions of each other,then it is

orthogonal filter banks. An orthogonal filter pair obeys the conjugate mirror filter (CQF)

condition, defined by

(11)

If we use the transposed FIR filter, we need only half the number of multipliers. The

disadvantage is that we can not benefit from polyphase decomposition to double the speed.

FIG.4. Lattice realization for the orthogonal two-channel filter bank

61

VHDL CODE:

PACKAGE n_bits_int IS -- User-defined types

SUBTYPE BITS8 IS INTEGER RANGE -128 TO 127;



TYPE ARRAY_BITS17_4 IS ARRAY (0 TO 3) OF BITS17;

END n_bits_int;

LIBRARY work;

USE work.n_bits_int.ALL;

LIBRARY ieee;



USE ieee.std_logic_unsigned.ALL;

ENTITY db4latti IS ------> Interface

PORT (clk, reset : IN std_logic;

clk2 : OUT std_logic;

x_in : IN BITS8;

x_e, x_o : OUT BITS17;

g, h : OUT BITS9);

END db4latti;

ARCHITECTURE fpga OF db4latti IS

TYPE STATE_TYPE IS (even, odd);

SIGNAL state : STATE_TYPE;

SIGNAL sx_up, sx_low, x_wait : BITS17 := 0;

SIGNAL clk_div2 : std_logic;

SIGNAL sxa0_up, sxa0_low : BITS17 := 0;

SIGNAL up0, up1, low0, low1 : BITS17 := 0;

BEGIN

Multiplex: PROCESS (reset, clk) ----> Split into even and

BEGIN -- odd samples at clk rate

IF reset = '1' THEN -- Asynchronous reset

62

state

-- Multiply with 256*s=124

sx_up

63

up1

64

Computational complexity is reduced.

QMF based subband coders provide more natural sounding,pitch prediction and wider

bandwidth than earlier subband coders.

APPLICATIONS:

Accurate channel selection in wireless communications.

Faster convergence and lower complexity in adaptive equalization.

Flexible compression of speech and music

Lower latency and better frequency compensation in hearing aids

More efficient short-time spectral analysis and synthesis

Multi-resolution image compression and wavelet transformations

Reliable automatic speech recognition.

65

DIT-FFT ALGORITHM

A Fast Fourier Transform(FFT) is an efficient algorithm for calculating the discrete Fourier

transform of a set of data. A DFT basically decomposes a set of data in time domain into

different frequency components. DFT is defined by the following equation:

A FFT algorithm uses some interesting properties of the above formula to simply the

calculations.

COOLEY-TUKEY ALGORITHM

The CooleyTukey algorithm, named after J.W. Cooley and John Tukey, is the most common

fast Fourier transform (FFT) algorithm. It re-expresses the discrete Fourier transform (DFT) of

an arbitrary composite size N = N1N2 in terms of smaller DFTs of sizes N1 and N2, recursively,

in order to reduce the computation time to O(N log N) for highly-composite N (smooth

numbers).

Basically, the computational problem for the DFT is to compute the sequence {X(k)}

of N complex-valued numbers given another sequence of data {x(n)} of length N, according to

the formula

In general, the data sequence x(n) is also assumed to be complex valued. Similarly, The IDFT

becomes

Since DFT and IDFT involve basically the same type of computations, our discussion of

efficient computational algorithms for the DFT applies as well to the efficient computation of

the IDFT.

We observe that for each value of k, direct computation of X(k) involves N complex

multiplications (4N real multiplications) and N-1 complex additions (4N-2 real additions).

66

Consequently, to compute all N values of the DFT requires N 2 complex multiplications and N

2-

N complex additions.

Direct computation of the DFT is basically inefficient primarily because it does not exploit the

symmetry and periodicity properties of the phase factor WN. In particular, these two properties

are :

The computationally efficient algorithms described in this sectio, known collectively as fast

Fourier transform (FFT) algorithms, exploit these two basic properties of the phase factor.

RADIX-2 FFT ALGORITHM

Let us consider the computation of the N = 2v point DFT by the divide-and conquer approach.

We split the N-point data sequence into two N/2-point data sequences f1(n) and f2(n),

corresponding to the even-numbered and odd-numbered samples of x(n), respectively, that is,

Thus f1(n) and f2(n) are obtained by decimating x(n) by a factor of 2, and hence the resulting

FFT algorithm is called a decimation-in-time algorithm.

Now the N-point DFT can be expressed in terms of the DFT's of the decimated sequences as

follows:

But WN2 = WN/2. With this substitution, the equation can be expressed as

67

where F1(k) and F2(k) are the N/2-point DFTs of the sequences f1(m) and f2(m), respectively.

Since F1(k) and F2(k) are periodic, with period N/2, we have F1(k+N/2) = F1(k) and F2(k+N/2)

= F2(k). In addition, the factor WNk+N/2

= -WNk. Hence the equation may be expressed as

We observe that the direct computation of F1(k) requires (N/2)2 complex multiplications. The

same applies to the computation of F2(k). Furthermore, there are N/2 additional complex

multiplications required to computeWNkF2(k). Hence the computation of X(k) requires

2(N/2)2 + N/2 = N

2/2 + N/2 complex multiplications. This first step results in a reduction of the

number of multiplications from N 2

to N 2/2 + N/2, which is about a factor of 2 for N large.

By computing N/4-point DFTs, we would obtain the N/2-point DFTs F1(k) and F2(k) from the

relations

The decimation of the data sequence can be repeated again and again until the resulting

sequences are reduced to one-point sequences. For N = 2v, this decimation can be performed v =

log2N times. Thus the total number of complex multiplications is reduced to (N/2)log2N. The

number of complex additions is Nlog2N.

The following figure depicts the computation of N = 8 point DFT. We observe that the

computation is performed in tree stages, beginning with the computations of four two-point

DFTs, then two four-point DFTs, and finally, one eight-point DFT. The combination for the

smaller DFTs to form the larger DFT is illustrated in following figure for N = 8.

68

Figure1 Three stages in the computation of an N = 8-point DFT.

Figure 2 Eight-point decimation-in-time FFT algorithm.

69

Figure 3 Basic butterfly computation in the decimation-in-time FFT algorithm.

An important observation is concerned with the order of the input data sequence after it is

decimated (v-1) times. For example, if we consider the case where N = 8, we know that the first

decimation yeilds the sequencex(0), x(2), x(4), x(6), x(1), x(3), x(5), x(7), and the second

decimation results in the sequence x(0), x(4), x(2), x(6), x(1), x(5), x(3), x(7). This shuffling of

the input data sequence has a well-defined order as can be ascertained from observing the

following figure, which illustrates the decimation of the eight-point sequence.

INPUT DATA

INDEX

INDEX BITS REVERSED BITS OUTPUT DATA

INDEX

0 000 000 0

4 100 001 1

2 010 010 2

6 110 011 3

1 001 100 4

5 101 101 5

3 011 110 6

7 111 111 7

Figure TC.3.5 Shuffling of the data and bit reversal.

ADVANTAGES:

To reduce the computational complexity of the DFT algorithm.

Energy compaction.

Delay can be reduced.

70

CODE:

module ditfft(clk,sel,yr,yi);

input clk;

input [2:0]sel;

output reg [7:0]yr,yi ;

wire [7:0]y0r,y1r,y2r,y3r,y4r,y5r,y6r,y7r,y0i,y1i,y2i,y3i,y4i,y5i,y6i,y7i;

wire [7:0]x20r,x20i,x21r,x21i,x22r,x22i,x23r,x23i,x24r,x24i,x25r,x25i,x26r,x26i,x27r,x27i;


wire [7:0]x0,x1,x2,x3,x4,x5,x6,x7;

assign x0=8'b10;

assign x1=8'b10;

assign x2=8'b10;

assign x3=8'b10;

assign x4=8'b1;

assign x5=8'b1;

assign x6=8'b1;

assign x7=8'b1;

//stage1

bfly1 s11(x0,x4,x10r,x10i,x11r,x11i);




//stage2

bfly1 s21(x10r,x12r,x20r,x20i,x22r,x22i);

71

bfly2 s22(x11r,x11i,x13r,x13i,x21r,x21i,x23r,x23i);

bfly1 s23(x14r,x16r,x24r,x24i,x26r,x26i);

bfly2 s24(x15r,x15i,x17r,x17i,x25r,x25i,x27r,x27i);

//stage3

bfly1 s31(x20r,x24r,y0r,y0i,y4r,y4i);

bfly3 s32(x21r,x21i,x25r,x25i,y1r,y1i,y5r,y5i);



module bfly1(x,y,x0r,x0i,x1r,x1i);

input [7:0]x,y;

output[7:0]x1r,x1i,x0r,x0i;

assign x0r=x+y;

assign x0i=8'd0;

assign x1r=x-y;

assign x1i=8'd0;

endmodule

module bfly2(xr,xi,yr,yi,x0r,x0i,x1r,x1i);

input [7:0]xr,xi,yr,yi;

output [7:0]x0r,x0i,x1r,x1i;

wire [7:0]q1;

assign x0r=xr+yi;

assign x0i=xi-yr;

assign x1r=xr-yi;

assign x1i=xi+yr;

Endmodule

72




parameter sht=8'd1000;

wire [7:0]p1,p2;

assign p1=(707*yr)>>sht;

assign p2=(707*yi)>>sht;

assign x0r=xr+p1+p2;

assign x0i=xi-p1+p2;

assign x1r=xr-p1-p2;

assign x1i=xi-p2+p1;

endmodule




parameter sht=8'd1000;

wire [7:0]p1,p2;

assign p1=(707*yr)>>sht;

assign p2=(707*yi)>>sht;

assign x0r=xr-p1+p2;

assign x0i=xi-p1-p2;

assign x1r=xr+p1-p2;

assign x1i=xi+p2+p1;

endmodule

73

always@(posedge clk)

case(sel)

0:begin yr=y0r; yi=y0i; end








endcase

endmodule

OUTPUT:

Input : 2,1,2,1,2,1,2,1.

74

DIF-FFT

FOURIER TRANSFORM

A fourier transform is an useful analytical tool that is important for many fields of

application in the digital signal processing.

In describing the properties of the fourier transform and inverse fourier transform, it is

quite convenient to use the concept of time and frequency.

In image processing applications it plays a critical role.

FAST FOURIER TRANSFORM

Fast fourier transform proposed by Cooley and Tukey in 1965.

The fast fourier transform is a highly efficient procedure for computing the DFT of a finite series and requires less number of computations than that of direct evaluation of

DFT.

The FFT is based on decomposition and breaking the transform into smaller transforms and combining them to get the total transform.

DISCRETE FOURIER TRANSFORM

The DFT pair was given as

Baseline for computational complexity:

Each DFT coefficient requires - N complex multiplications,N-1 complex additions.

All N DFT coefficients require - N2 complex multiplications, N(N-1) complex additions.

1

0

/2][N

n

knNjenxkX

1

0

/21][N

k

knNjekXN

nx

75

SYMMETRY AND PERIODICITY PROPERTY

Symmetry

Periodicity

FFT algorithm provides speed increase factors, when compared with direct computation of the DFT, of approximately 64 and 205 for 256 point and 1024

point transforms respectively.

The number of multiplications and additions required to compute N-point DFT using radix-2 FFT are Nlog2N and N/2 log2N respectively.

EXAMPLE: The number of complex multiplications required using direct computation is

N2=64

2 =4096

The number of complex multiplications required using FFT is

N/2log2 N=64/2log2 64=192

Speed improvement factor =4096/192= 21.33.

NUMBER OF COMPLEX MULTIPLICATIONS REQUIRED IN DIF- FFT

ALGORITHM

No. of points

in a sequence

x(n), N

Complex

multiplications

in direct

computation of

DFT

=NN =A

Complex

multiplications

in FFT

algorithms

N/2 log2 N = B

Speed

improvement

Factor -A/B

4 16 4 4

8 64 12 5

16 256 32 8

)()(

)()(

)(

kNn

N

nNk

N

kn

N

nNk

N

Nnk

N

kn

N

kn

N

kn

N

WWW

WWW

WW

76

FFT ALGORITHMS

There are basically two types of FFT algorithms.

They are:

Decimation in Time

Decimation in frequency

DECIMATION-IN-FREQUENCY

It is a popular form of FFT algorithm.

In this the output sequence x(k) is divided into smaller and smaller subsequences, that is why the name decimation in frequency.

Initially the input sequence x(n) is divided into two sequences x1(n) and x2(n) consisting of the first n/2 samples of x(n) and the last n/2 samples of x(n) respectively.

RADIX-2 DIF- FFT ALGORITHM To divide N-point sequence x(n) into two N/2-point sequence

The former N/2-point

The latter N/2-point

Decimation-in-frequency algorithm of length-8 for radix-2.

77

THE COMPARISON OF DIT AND DIF The order of samples

DIT-FFT: the input is bit- reversed order and the output is natural order.

DIF-FFT: the input is natural order and the output is bit- reversed order.

The butterfly computation

DIT-FFT: multiplication is done before additions.

DIF-FFT: multiplication is done after addition.

Both DIT-FFT and DIF-FFT have the identical computation complexity. i.e. for, there

are total L stages and each has N/2 butterfly computation. Each butterfly computation has 1

multiplication and 2 additions.

A DIT-FFT flow graph can be transposed to a DIF-FFT flow graph and vice versa.

RADIX-2 DIF- FFT ALGORITHM

VERILOG CODE

module diff(clk,sel,yr,yi);

input clk;

input [2:0]sel;

output reg [7:0]yr,yi;

wire [7:0]y0r,y1r,y2r,y3r,y4r,y5r,y6r,y7r,y0i,y1i,y2i,y3i,y4i,y5i,y6i,y7i;




parameter w0r=8'b1;

parameter w0i=8'b0;

parameter w1r=8'b10110101;

parameter w1i=8'b01001011;

parameter w2r=8'b0;


parameter w3r=8'b01001011;


78

assign x0r=8'b11111111;

assign x0i=8'b00000000;

assign x1r=8'b11011100;

assign x1i=8'b00010101;

assign x2r=8'b11001101;

assign x2i=8'b00000000;

assign x3r=8'b11011100;

assign x3i=8'b00010101;

assign x4r=8'b10101011;

assign x4i=8'b00000000;

assign x5r=8'b00000110;

assign x5i=8'b11101011;

assign x6r=8'b11001101;

assign x6i=8'b00000000;

assign x7r=8'b00000110;

assign x7i=8'b11101011;

//stage1

bfly1 s11(x0r,x0i,x4r,x4i,w0r,w0i,x10r,x10i,x14r,x14i);




//stage2





//stage3

bfly1 s31(x20r,x20i,x24r,x24i,w0r,w0i,y0r,y0i,y1r,y1i);




always@(posedge clk)

case(sel)









endcase

endmodule

module bfly1(xr,xi,yr,yi,wr,wi,x0r,x0i,x1r,x1i);// sub module

79

input [7:0]xr,xi,yr,yi,wr,wi;

output[7:0]x1r,x1i,x0r,x0i;

assign x0r=xr+yr;

assign x0i=yi*wi;

assign x1r=xr+(~(yr*wr)+1);

assign x1i=~(yr*wi)+1;

endmodule

module bfly2(xr,xi,yr,yi,wr,wi,x0r,x0i,x1r,x1i); // sub module



wire [7:0]q1;

assign q1=yr*(wi+1);

assign x0r=xr;

assign x0i=q1;

assign x1r=xr;

assign x1i=~q1+1;

endmodule




wire [15:0]p1,p2,p3,p4;

wire [7:0]win,yrn,yin;

wire [8:0]ywr,ywi;

parameter sht=8'b1000;

assign yrn=~yr+1;

assign yin=yi;

assign win=~wi+1;

assign p1=(yrn*wr)>>sht;

assign p2=(yin*win)>>sht;

assign p3=(yrn*win)>>sht;

assign p4=(yin*wr)>>sht;

assign ywr=(~p1+1)+p2;

assign ywi=p3+p4;

assign x0r=xr+ywr;

assign x0i=xi+ywi;

assign x1r=xr+(~ywr+1);

assign x1i=xi+(~ywi+1);

endmodule



80


wire [15:0]p1,p2;

wire [7:0]win,yrn,yin;

wire [8:0]ywr,ywi;

parameter sht=8'b1000;

assign yrn=~yr+1;

assign yin=~yi+1;

assign win=~wi+1;

assign p1=(yrn*win)>>sht;

assign p2=(yin*win)>>sht;

assign ywr=(~p1+1)+p2;

assign ywi=p1+p2;

assign x0r=xr+ywr;

assign x0i=xi+ywi;

assign x1r=xr+(~ywr+1);

assign x1i=xi+(~ywi+1);

endmodule

OUTPUT:

81

ERROR CONTROL CODES

ERROR DETECTION AND CORRECTION CODES:

When a message is transmitted, it has the potential to get scrambled by noise. This is certainly true

of voice messages, and is also true of the digital messages that are sent to and from computers. Now

even sound and video are being transmitted in this manner. A digital message is a sequence of 0s and

1s which encodes a given message. More data will be added to a given binary message that will help to

detect if an error has been made in the transmission of the message; adding such data is called an error-

detecting code.

More data may also be added to the original message so that errors made in transmission may be

detected, and also to figure out what the original message was from the possibly corrupt message that

was received. This type of code is an error-correcting code. Error detection is the ability to detect errors.

Error correction has an additional feature that enables identification and correction of the errors. Error

detection always precedes error correction. Both can be achieved by having extra or redundant or check

bits in addition to data deduce that there is an error. Original Data is encoded with the redundant bit(s).

New data formed is known as code word.Coding is the process of adding redundancy for error detection

or correction. It is of two types:

Block codes

Divides the data to be sent into a set of blocks

Extra information attached to each block

Memoryless

Convolutional codes

Treats data as a series of bits, and computes a code over a continuous series

The code computed for a set of bits depends on the current and previous input.

HAMMING CODES:

Hamming Codes are used in detecting and correcting a code. An error-correcting code is an

algorithm for expressing a sequence of numbers such that any errors which are introduced can be

detected and corrected (within certain limitations) based on the remaining numbers. Errors can happen in

a variety of ways. Bits can be added, deleted, or flipped. Errors can happen in fixed or variable codes.

82

Error-correcting codes are used in CD players, high speed modems, and cellular phones. Error detection

is much simpler than error correction. For example, one or more "check" digits are commonly embedded

in credit card numbers in order to detect mistakes.Hamming code adopt parity concept, but have more

than one parity bit. Example of a block code is the (7,4) Hamming code. This is an error detecting and

error-correcting binary code, which transmits N=7 bits for every K=4 source bits.

GENERAL ALGORITHM:

The following general algorithm generates a single-error correcting (SEC) code for any

number of bits.

1. Number the bits starting from 1: bit 1, 2, 3, 4, 5, etc.

2. Write the bit numbers in binary: 1, 10, 11, 100, 101, etc.

3. All bit positions that are powers of two (have only one 1 bit in the binary form of their position)

are parity bits: 1, 2, 4, 8, etc. (1, 10, 100, 1000)

4. All other bit positions, with two or more 1 bits in the binary form of their position, are data bits.

5. Each data bit is included in a unique set of 2 or more parity bits, as determined by the binary

form of its bit position.

ENCODING A (7,4) CODE:

CONSTRUCTION OF G AND H:

The matrix is called a (Canonical) generator matrix of a linear (n,k) code and

is called a parity check matrix.This is the construction of G and H in standard (or

systematic) form. Regardless of form, G and H for linear block codes must satisfy , an all-

zeros matrix. Since [7,4,3]=[n,k,d]=[2m 1, 2m1-m, m]. The parity check matrix H of a Hamming code

is constructed by listing all columns of length m that are pair-wise independent.The parity check matrix

H of a Hamming code is constructed by listing all columns of length m that are pair-wise

independent.Thus H is a matrix whose left side is all of the nonzero n-tuples where order of the n-tuples

in the columns of matrix does not matter. The right hand side is just the (n-k)-identity matrix.So G can

be obtained from H by taking the transpose of the left hand side of H with the identity k- identity matrix

on the left hand side of G.

83

The code generator matrixG and parity check matrix H are:

and

From the above matrix we have 2k=2

4=16 codewords. The codewords of this binary code can be

obtained from . With with exist in ( A field with two elements

namely 0 and 1).

Thus the codewords are all the 4-tuples (k-tuples).Therefore,(1,0,1,1) gets encoded as (1,0,1,1,0,1,0).

HAMMING DISTANCE:

Hamming Distance = No of bit positions in which 2 code words differ

E.g. 10001001 and 10110001 have distance of 3. If distance is d, then d-single bit errors are required to

convert any one valid code into another. Implying that this error would not be detected. In general, to

detect k-single bit error, minimum hamming distance D(min)=k+1.Hence we need code words that have

D(min) = 2 + 1 = 3 to detect 2-bit errors.

DECODING OF HAMMING CODE:

The decoding task can be re-expressed as syndrome decoding.

s = H x r

where s - syndrome.

H - parity checks matrix.

r - is the received vector.

The following two possibilities are:

If the syndrome is zero, that is, all three parity checks agree with the corresponding received bits,

then the received vector is a codeword, and the most probable decoding is given by reading out its first

four bits. Then u is supposed to be the same than u. One can give the situation in that the errors are not

84

detectable. This happens when the error vector is identical to a non null word code. In this case r it is the

sum of two words code and therefore the syndrome is similar to zero. These errors are non-detectable

errors. As there are 2k-1 non-null words code, there are 2k-1 non-detectable errors.

If the syndrome is non-zero, then we are certain that the noise sequence for the present block was

non-zero(we have noise in our transmissions). Since the received vector v is given by v = Gt*u+n and

HGT=0 the syndrome decoding problem is then to find the most probable noise vector n satisfying the

equation Hn = e. Once the error vector is found, the original source sequence is identified.

VHDL CODE:

HAMMING ENCODER

entity hamenc is

Port ( datain : in STD_LOGIC_VECTOR (3 downto 0);

p :inout STD_LOGIC_VECTOR(2 downto 0);

hamout : out STD_LOGIC_VECTOR (6 downto 0));

end hamenc;

architecture Behavioral of hamenc is

begin --check bits

--generate check bits

p(0)

85

hamout(3) rxdrxd rxd rxd rxd rxd rxd rxd

86

IF syndrome = "000" THEN

Data out(3 down to 0)

87

SIMULATION RESULTS

ENCODER OUT

DECODER OUTPUT:

88

INFERENCE:

Hamming code is code word of n bits with m data bits and r parity (or check bits)

i.e. n = m + r

It can detect D(min) 1 errors and can correct errors. Hence to correct k errors, need D(min)

=2k+1.It need a least a distance of 3 to correct a single bit error.

89

CRYPTOGRAPHIC ALGORITHMS FOR FPGAS

Many communication systems use data-stream ciphers to protect relevant Information.

The key sequence K is more or less a pseudorandom sequence (known to the sender and the receiver), and with the modulo 2 property of the XOR function, the plaintext P can be

reconstructed at the receiver side, because

P K K = P 0 = P.

We shall discuss to encryption algorithms namely:

Linear Feedback Shift Registers (LFSR) algorithm

Data Encryption Standard (DES)

Neither algorithm requires large tables and both are suitable for an FPGA implementation.

LINEAR FEEDBACK SHIFT REGISTERS ALGORITHM

LFSRs with maximal sequence length are a good approach for an ideal security key,

because they have good statistical properties. In other words, it is difficult to analyze the

sequence in a cryptographic attack, an analysis called cryptoanalysis. Because bitwise designs

are possible with FPGAs, such LFSRs are more efficiently realized with FPGAs than PDSPs.

Two possible realizations of a LFSR of length 8 are shown in Fig. 1.1.

Fig 1.1. Possible realizations of LFSRs. (a) Fibonacci configuration. (b) Galois configuration.

For the XOR LFSR there is always the possibility of the all-zero word, which should

never be reached. If the cycle starts with any nonzero word, the cycle length is always 2l 1.

Sometimes, if the FPGA wakes up with an all-zero state, it is more convenient to use a

mirrored or inverted LFSR circuit. If the all-zero word is a valid pattern and produces exactly the inverse sequence, it is necessary to substitute the XOR with a not XOR or XNOR gate. Such LFSRs can easily be designed using a PROCESS statement in VHDL, as the following

example shows.

90

The following VHDL code implements a LFSR of length 6.

LIBRARY ieee;



ENTITY lfsr IS ------> Interface

PORT ( clk : IN STD_LOGIC;

y : OUT STD_LOGIC_VECTOR(6 DOWNTO 1));

END lfsr;

ARCHITECTURE fpga OF lfsr IS

SIGNAL ff : STD_LOGIC_VECTOR(6 DOWNTO 1)

:= (OTHERS => 0); BEGIN

PROCESS -- Implement length 6 LFSR with xnor

BEGIN

WAIT UNTIL clk = 1; ff(1)

91

Note that a complete cycle of an LFSR sequence fulfills the three criteria for optimal

length 2l 1 pseudorandom sequences.

1) The number of 1s and 0s in a cycle differs by no more than one.

2) Runs of length k (e.g., 111 sequence, 000 sequence) have a total

fractional part of all runs of 1/2k.

3) The autocorrelation function C() is constant for [1, n 1].

DES BASED ALGORITHM:

The data encryption standard (DES), outlined in Fig. 1.3, is typically used in a block

cipher. By selecting the output feedback mode (OFB) it is also possible to use the modified DES in a data-stream cipher.

Fig 1.3. State machine for a block encryption system (DES)

92

PRINCIPLE:

The DES comprises a finite state machine translating plaintext blocks into ciphertext

blocks. First the block to be substituted is loaded into the state register (32 bits). Next it is

expanded (to 48 bits), combined with the key (also 48 bits) and substituted in eight 64 bit-width S-boxes. Finally, permutations of single bits are performed. This cycle may be (if desired,

with a changing key) applied several times.

In the DES, the key is usually shifted one or two bits so that after 16 rounds the key is

back in the original position. Because the DES can therefore be seen as an iterative application

of the Feistel cipher (shown in Fig. 1.4), the S-boxes must not be invertible. To simplify an

FPGA realization some modifications are useful, such as a reduction of the length of the state

register to 25 bits. No expansion is used. Use the final permutations as listed in Table 1.1.

Because most FPGAs only have four to five input look-up tables (LUTs), S-boxes with five

inputs have been designed.

Fig 1.4. Principle of the Feistel Network

A reasonable test for S-boxes is the dependency matrix. This matrix shows, for every

input/output combination, the probability that an output bit changes if an input bit is changed.

With the avalanche effect the ideal probability is 1/2.

Table 1.1 Table for Permutation

93

Since there are 25 = 32 possible input vectors for each S-box, the ideal value is 16. A

random generator was used to generate the S-boxes. The reason that some values differ much

from the ideal 16 may lie in the desired inversion.

Though Des was considered most secure till the late 90s, it was successfully cracked in 1997 and thereafter. So more complex encryption algorithms like AES, Triple DES, etc., are

used practically nowadays.

94

FPGA DESIGN OF LMS ALGORITHM

The WidrowHoff least mean square (LMS) adaptive algorithm is a practical method for

finding a close approximation to in real time.

It is a very simple algorithm and it does not require explicit measurement of the

correlation functions, nor does it involve matrix inversion.

The LMS algorithm is an implementation of the method of the steepest descent.

According to this method, the next filter coefficient vector f[n + 1] is equal to the

present filter coefficient vector f[n] plus a change proportional to the negative gradient.

where the parameter is the learning factor or step size that controls stability and the

rate of convergence of the algorithm. During each iteration the true gradient is

represented by [n].

The LMS algorithm estimates an instantaneous gradient in a crude but efficient manner

by assuming that the gradient of J = is an estimate of the gradient of the mean-

square error E{ }. The relationship between the true gradient [n] and the

estimated gradients [n] is given by the following expression:

Therefore the coefficient update equation becomes,

95

FIG: LMS configuration

The LMS algorithm makes use of gradients of mean-square error functions, it does not require squaring, averaging, or differentiation. The algorithm is simple and generally

easy to implement.

The LMS algorithm is convergent in the mean square if and only if the step-size

parameter satisfies,

where max is the largest eigenvalue of the correlation matrix of the input data.

To ensure the fast convergence the closely bounded step size value is

where L is the filter order & is the autocorrelation function of input.

For the higher order filter the upper bound can be relaxed by factor of 3

Normalized LMS

The LMS algorithm discussed so far uses a constant step size proportional to the

stability bound

. Obviously this requires knowledge of the signal

statistic, i.e., rxx[0], and this statistic must not change over time.

It is, however, possible that this statistic changes over time, and we wish to adjust accordingly i.e time varying step size parameter().

The normalized is given by

96

If we are concerned that the denominator can temporary become very small and too large, we may add a small constant to [n]x[n], which yields

Therefore the coefficient update equation for NLMS is,

where is the norm of the input.

PIPELINED LMS FILTER:

This method is used to increase the throughput of the Adaptive system.

The optimal number of pipeline stages can be computed is as follows. For the (b b) multiplier a total of log2(b) stages are needed, for the adder tree an additional log2(L) pipeline stages would be sufficient and one additional stage for the computation of the

error. The coefficient update multiplication requires an additional log2(b) pipeline

stages.

The total number of pipeline stages for a maximum throughput are,

where we have assumed that is a power-of-two constant and the scaling with can be done without the need of additional pipeline stages. If, however, the normalized LMS is

used, then will no longer be a constant and depending on the bit width of additional pipeline stages will be required.

BLOCK TRANSFORMATION USING FFTs:

LMS algorithms that solve the filter coefficient adjustment in a transform domain have been

proposed for two reasons,

The goal of the fast convolution techniques is to lower the computational effort, by using block update and transforming the convolution to compute the adaptive filter output and

the filter coefficient adjustment in the transform domain with the help of a fast cyclic

convolution algorithm.

The second method that uses transform domain techniques has the main goal to improve the adaptation rate of the LMS algorithm, because it is possible to find transforms that

allow a decoupling of the modes of the adaptive filter. The coefficient update equation is,

The step size can be reduced to

97

for a block update of B steps each.

Choice of Block size: B=L: optimal choice from the viewpoint of computational complexity. BL: redundant operations in the adaptation, not optimal.

VHDL CODE:

LIBRARY lpm;

USE lpm.lpm_components.ALL;

LIBRARY ieee;



USE ieee.std_logic_signed.ALL;

ENTITY fir_lms IS

GENERIC (W1 : INTEGER := 8;

W2 : INTEGER := 16;

L : INTEGER := 2 );

PORT ( clk : IN STD_LOGIC;

x_in : IN STD_LOGIC_VECTOR(W1-1 DOWNTO 0);

d_in : IN STD_LOGIC_VECTOR(W1-1 DOWNTO 0);

e_out, y_out : OUT STD_LOGIC_VECTOR(W2-1 DOWNTO 0);

f0_out, f1_out : OUT STD_LOGIC_VECTOR(W1-1 DOWNTO 0));

END fir_lms;

ARCHITECTURE flex OF fir_lms IS

SUBTYPE N1BIT IS STD_LOGIC_VECTOR(W1-1 DOWNTO 0);

SUBTYPE N2BIT IS STD_LOGIC_VECTOR(W2-1 DOWNTO 0);

TYPE ARRAY_N1BIT IS ARRAY (0 TO L-1) OF N1BIT;

TYPE ARRAY_N2BIT IS ARRAY (0 TO L-1) OF N2BIT;

SIGNAL d : N1BIT;

SIGNAL emu : N1BIT;

SIGNAL y, sxty : N2BIT;

SIGNAL e, sxtd : N2BIT;

SIGNAL x, f : ARRAY_N1BIT;

SIGNAL p, xemu : ARRAY_N2BIT;

BEGIN

dsxt: PROCESS (d)

BEGIN

sxtd(7 DOWNTO 0)

98

BEGIN


d W2)

PORT MAP ( dataa => x(I), datab => f(I),

result => p(I));

END GENERATE;

y

99

OUTPUT:

APPLICATIONS:

Interference cancellation Prediction Inverse Modelling

100

DIGITAL UP CONVERTER

An ideal Software Defined Radio base station would perform all signal processing tasks

in the digital domain. However, current-generation wideband data converters cannot support the

processing bandwidth and dynamic r

Documents

Uwe meyer baese presentation report