Upload
venkatsat
View
30
Download
2
Embed Size (px)
DESCRIPTION
Presentation report of uwe meyer baese digital signal processing with fpga.
Citation preview
FPGA SEMINAR REPORT UNIT-4
2
CONTENTS
S NO TITLE PG NO
1 BINARY ADDER 3
2 BINARY MULTIPLIER 8
3 BINARY DIVIDER 13
4 FIR FILTERS 18
5 IIR FILTERS 27
6 DECIMATION 34
7 INTERPOLATION 39
8 MULTISTAGE DECIMATION 44
9 POLYPHASE DECIMATION 51
10 FILTER BANKS 56
11 DIT-FFT ALGORITHM 65
12 DIF-FFT ALGORITHM 74
13 ERROR CONTROL CODING 81
14 CRYPTOGRAPHIC ALGORITHM 89
15 LMS ALGORITHM 94
16 DIGITAL UP CONVERTER 100
17 DIGITAL DOWN CONVERTER 105
3
BINARY ADDERS
Addition is the most commonly performed arithmetic operation in Digital systems. An adder
is a combinational circuit which combines two arithmetic operands using addition rules. An
adder is a basic building block in any DSP system. An adder can perform subtraction using
2s complemented subtrahend. The following are the various types of adders:
Half Adders
Full Adders
Binary (Multi Bit) Adders
o Ripple Adders
o Carry Look Ahead Adders
o Pipeline Adders
o Modulo Adders
A basic binary N-bit adder / subtractor consist of N full-adders (FA). A full-adder
implements the following Boolean equations:
The Sum is defined by:
sk = xk XOR yk XOR ck = xk yk ck
The carry (out) bit is computed with:
ck+1 = (xk AND yk) OR (xk AND ck) OR (yk AND ck) = (xk yk) + (xk ck) + (yk ck)
PIPELINED ADDERS:
Pipelining is extensively used in DSP solutions due to the intrinsic dataflow regularity of
DSP algorithms. Programmable digital signal processor MACs typically carry at least four
pipelined stages. The processor:
1) Decodes the command 2) Loads the operands in registers 3) Performs multiplication and stores the product, and 4) Accumulates the products, all concurrently.
The pipelining principle can be applied to FPGA designs as well, at little or no additional
cost since each logic element contains a flip-flop, which is otherwise unused, to save routing
resources. With pipelining it is possible to break an arithmetic operation into small primitive
operations, save the carry and the intermediate values in registers, and continue the
calculation in the next clock cycle. Such adders are sometimes called carry save adders
(CSAs) in the literature.
4
The block diagram of pipeline adder is shown in Figure 1.
Figure 1: Block Schematic of Pipeline adder
VHDL CODE FOR PIPELINE ADDER:
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_arith.ALL;
USE ieee.std_logic_unsigned.ALL;
ENTITY pipeline_add IS
GENERIC (WIDTH : INTEGER := 15;
WIDTH1 : INTEGER := 7; WIDTH2 : INTEGER := 8);
PORT (x,y : IN STD_LOGIC_VECTOR(WIDTH-1 DOWNTO 0);
sum : OUT STD_LOGIC_VECTOR(WIDTH-1 DOWNTO 0);
LSBs_Carry : OUT STD_LOGIC;
clk : IN STD_LOGIC);
END pipeline_add;
ARCHITECTURE struct OF pipeline_add IS
SIGNAL l1, l2, s1 : STD_LOGIC_VECTOR(WIDTH1-1 DOWNTO 0);
SIGNAL r1 : STD_LOGIC_VECTOR(WIDTH1 DOWNTO 0);
SIGNAL l3, l4, r2, s2 : STD_LOGIC_VECTOR(WIDTH2-1 DOWNTO 0);
BEGIN
PROCESS
BEGIN
5
WAIT UNTIL clk = '1';
l1
6
MODULO ADDERS
Modulo adders are the most important building blocks in RNS-DSP designs. They are used
for both additions and, via index arithmetic, for multiplications.
The block diagram of modulo adder is shown in Figure 2.
Figure 2: Block Schematic of Modulo adder
VERILOG CODE FOR MODULO-256 ADDER:
module mod_add (input [7:0]x, input [7:0]y, output [8:0]Sum);
parameter m=256;
wire [8:0] x1,x2;
wire c;
assign x1[8:0]= (x[7:0]+y[7:0]);
assign x2[8:0]= (x1[7:0]-m);
or(c,x1[8],x2[8]);
assign x2 = (x1>255) ? x1[8:0]-m : x1[8:0];
assign Sum = (c==1'b0) ? x1[8:0] : x2[8:0];
endmodule
7
MODULO-256 ADDER SIMULATION RESULTS:
SUMMARY OF BINARY ADDERS:
Ripple Carry Adders: Are two bit at a time adder, the longest delay comes from the ripple of the carry through all stages. Carry-skip, carry look-ahead, conditional sum, or carry-select
adders techniques are employed to reduce the delay.
Adders implemented using modern FPGAs / CPLDs possess very fast ripple carry logic a magnitude faster than the delay through a regular logic.
In Pipeline adders, how many pieces of pipeline addition depends on number of logic elements
and FFs in each LAB of FPGA/CPLD
For example in Alteras Cyclone II devices a reasonable choice will be 2 pipeline additions with maxi mum block size of 15 using an LAB with 16LEs and 16 FFs for one pipeline element. The
feasible breakup is shown below:
With one additional pipeline stage we can build adders up to a length 15 + 16 = 31.
With two pipeline stages we can build adders with up to 15+15+16 = 46-bit length
With three pipeline stages we can build adders with up to 15+15+15+16 = 61-bit length.
Though the number of flip-flops in one LAB is 16 and we need an extra flip-flop for the carry-
out. Only the blocks with the MSBs can be 16 bits wide.
8
BINARY MULTIPLIER
Since we always multiply by either 0 or 1, the partial products are always either 0000 or
the multiplicand (1101 in this example).
There are four partial products which are added to form the result.
We can add them in pairs, using three adders.
Even though the product has up to 8 bits, we can use 4-bit adders if we stagger them
leftwards, like the partial products themselves.
If the multiplicand is of k bit and multiplier is of j bit then
o k*j no. of & gates are require.
o (j-1) no. of k bit adder are require.
Ex. To multiply 1101 by 111 we require
o 4*3=12 & gates
o (3-1) no. of 4 bit adder.
A 2*2 BINARY MULTIPLIER
The AND gates produce the partial products.
For a 2-bit by 2-bit multiplier, we can just use two half adders to sum the partial products. In general, though, well need full adders.
Here C3-C0 are the product, not carries!
9
A 4*4 MULTIPLIER CIRCUIT
10
Here multiplier and multiplicand both are of 4 bits.
So to implement multiplier we needed 4*4=16 & gates and 3 no. of 4 bit adders.
First adder of each 4bit adder can be a HalfAdder because cin is always zero for those
adder.
VERILOG CODE
module HA(sout,cout,a,b);
outputsout,cout;
inputa,b;
assignsout=a^b;
assigncout=(a&b);
endmodule
module FA(sout,cout,a,b,cin);
11
outputsout,cout;
inputa,b,cin;
assignsout=(a^b^cin);
assigncout=((a&b)|(a&cin)|(b&cin));
endmodule
module multiply4bits(product,a,b);
output [7:0]product;
input [3:0]a;
input [3:0]b;
assign product[0]=(a[0]&b[0]);
wire x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17;
HA HA1(product[1],x1,(a[0]&b[1]),(a[1]&b[0]));
FA FA1(x2,x3,a[1]&a[1],(a[2]&b[0]),x1);
FA FA2(x4,x5,(a[2]&b[1]),(a[3]&b[0]),x3);
HA HA2(x6,x7,(a[3]&b[1]),x5);
HA HA3(product[2],x8,x2,(a[0]&b[2]));
FA FA5(x9,x10,x4,(a[1]&b[2]),x8);
FA FA4(x11,x12,x6,(a[2]&b[2]),x10);
FA FA3(x13,x14,x7,(a[3]&b[2]),x12);
HA HA4(product[3],x15,x9,(a[0]&b[3]));
FA FA8(product[4],x16,x11,(a[1]&b[3]),x15);
FA FA7(product[5],x17,x13,(a[2]&b[3]),x16);
FA FA6(product[6],product[7],x14,(a[3]&b[3]),x17);endmodule
12
SIMULATION RESULT
13
DIVIDERS
Of all four basic arithmetic operations division is the most complex. Consequently, it is
the most time-consuming operation and also the operation with the largest number of different
algorithms to be implemented. For a given dividend (or numerator) N and divisor (or
denominator) D the division produces (unlike the other basic arithmetic operations) two results:
the quotient Q and the remainder R, i.e.,
N/D = Q and R with |R| < D.
However, we may think of division as the inverse process of multiplication, as
demonstrated through the following equation,
N = D Q + R,
It differs from multiplication in many aspects. Most importantly, in multiplication all
partial products can be produced parallel, while in division each quotient bit is determined in a
sequential trail-and-error procedure.
For eg:
234/50 Q=5 and R=-16 and Q=4 and R=34. But we prefer O=4 and R=34.
Hence Q
14
RESTORING DIVISION:
We align first the denominator and load the numerator in the remainder register. We
then subtract the aligned denominator from the remainder and store the result in the remainder
register. If the new remainder is positive we set the quotients LSB to 1, otherwise the
quotients LSB is set to zero and we need to restore the previous remainder value by adding the
denominator. Finally, we have to realign the quotient and denominator for the next step. The
recalculation of the previous remainder is why we call such an algorithm restoring division.
The main disadvantage of the restoring division is that we need two steps to determine one
quotient bit. We can combine the two steps using a nonperforming divider algorithm, i.e., each
time the denominator is larger than the remainder, we do not perform the subtraction. The
number of steps is reduced by a factor of 2
NON PERFORMING NON RESTORING:
The idea behind the nonrestoring division is that if we have computed in the restoring
division a negative remainder, i.e., rk+1 = rkdk, then in the next step we will restore rk by
adding dk and then perform a subtraction of the next aligned denominator dk+1 = dk/2. So,
instead of adding dk followed by subtracting dk/2, we can just skip the restoring step and
proceed with adding dk/2, when the remainder has (temporarily) a negative value. As a result,
we have now quotient bits that can be positive or negative, i.e., qk = 1, but not zero.We can
change this signed-digit representation later to a twos complement representation. In
15
conclusion, every time the remainder after the iteration is positive we store a 1 and subtract the
aligned denominator, while for negative remainder, we store a 1 = 1 in the quotient register
and add the aligned denominator.
Both quotient and remainder are now in the twos complement representation and
have a valid result. If we wish to constrain our results in a way that both have the same sign, we
need to correct the negative remainder, i.e., for r < 0 we correct this via
r := r + D and q := q 1.
Such a nonrestoring divider will now run faster than the nonperforming divider, with
about the same Registered Performance as the restoring divider.
FAST DIVIDER DESIGN:
The first fast divider algorithm we wish to discuss is the division through
multiplication with the reciprocal of the denominator D. The reciprocal can, for instance, be
computed via a look-up table for small bit width. The general technique for constructing
iterative algorithms, however, makes use of the Newton method for finding a zero.
ARRAY DIVIDER:
Obviously, as with multipliers, all division algorithms can be implemented in a
sequential, FSM-like, way or in the array form. If the array form and pipelining is desired, a
good option will then be to use the lpm_divide block, which implements an array divider with
the option of pipelining, for a detailed description of the lpm_divide block.
CODE:
module divya2(q,out,a,b);
input [7:0]a;//dividend
input [3:0]b;//divisor
output [3:0]out;//reminder
output [4:0]q;//quotient
wire [3:0]r1,r2,r3,r4;
stage s1(q[4],r1[3:0],{1'b1},a[7:4],b[3:0]);
stage s2(q[3],r2[3:0],q[4],{r1[2:0],a[3]},b[3:0]);
stage s3(q[2],r3[3:0],q[3],{r2[2:0],a[2]},b[3:0]);
16
stage s4(q[1],r4[3:0],q[2],{r3[2:0],a[1]},b[3:0]);
stage s5(q[0],out[3:0],q[1],{r4[2:0],a[0]},b[3:0]);
endmodule
module stage(q,out,t,a,b); // submodule
input [3:0]a;
input [3:0]b;
input t;
output [3:0]out;
output q;
wire [3:0]c;
cas ca1(out[0],c[0],t,b[0],a[0],t);
cas ca2(out[1],c[1],t,b[1],a[1],c[0]);
cas ca3(out[2],c[2],t,b[2],a[2],c[1]);
cas ca4(out[3],c[3],t,b[3],a[3],c[2]);
not n1(q,out[3]);
endmodule
module cas(out,cout,t,divisor,rin,cin);
input t,divisor,rin,cin;
output cout,out;
wire x;
xor x1(x,t,divisor);
fadd f1(out,cout,x,rin,cin);
endmodule
module fadd(s,cout,a,b,c); //full adder submodule
input a,b,c;
17
output s,cout;
wire w1,w2,w3;
and a1(w1,a,b);
and a2(w2,b,c);
and a3(w3,c,a);
xor x1(s,a,b,c);
or o1(cout,w1,w2,w3);
endmodule
OUTPUT:
18
FIR FILTERS
An FIR with constant coefficients is an LTI digital filter. The output of an FIR of order or
length L, to an input time-series x[n], is given by a finite version of the convolution sum,
namely:
where f[0] 0 through f[L-1] 0 are the filters L coefficients. They also correspond to the
filters impulse response.
For LTI systems it is sometimes more convenient to express this in the z-domain with
where F(z) is the FIRs transfer function defined in the z-domain by
The Lth
order LTI FIR filter is graphically interpreted in Fig. 1.1. It can be seen to consist of a
collection of a tapped delay line, adders, and multipliers. One of the operands presented to
each multiplier is an FIR coefficient, often referred to as a tap weight for obvious reasons.
Fig 1: Direct Form FIR filter
19
The roots of polynomial F(z) in define the zeros of the filter. The presence of only zeros is the
reason that FIRs are sometimes called all zero filters.
FIR FILTER WITH TRANSPOSED STRUCTURE
A variation of the direct FIR model is called the transposed FIR filter. It can be constructed
from the FIR filter in Fig. 1 by:
Exchanging the input and output
Inverting the direction of signal flow
Substituting an adder by a fork, and vice versa
A transposed FIR filter is shown in Fig. 2 and is, in general, the preferred implementation of an
FIR filter. The benefit of this filter is that we do not need an extra shift register for x[n],and
there is no need for an extra pipeline stage for the adder (tree) of the products to achieve high
throughput.
Fig 2: Filter with Transposed Structure
SYMMETRY IN FIR FILTERS
The center of an FIRs impulse response is an important point of symmetry. It is sometimes
convenient to define this point as the 0th sample instant. Such filter descriptions area-
causal(centered notation). For an odd-length FIR, the a-causal filter model is given by:
20
The FIRs frequency response can be computed by evaluating the filters transfer function about
the periphery of the unity circle, by setting z=ejT
. It then follows that:
We then denote with |F()| the filters magnitude frequency response and () denotes the
phase response, and satisfies:
Digital filters are more often characterized by phase and magnitude than by the z-domain
transfer function or the complex frequency transform.
Table 1: Four possible linear-phase FIR filters
LINEAR-PHASE FIR FILTERS
Maintaining phase integrity across a range of frequencies is a desired system attribute in many
applications such as communications and image processing. As a result, designing filters that
establish linear-phase versus frequency is often mandatory. The standard measure of the phase
linearity of a system is the group delay defined by:
21
A perfectly linear-phase filter has a group delay that is constant over a range of frequencies. It
can be shown that linear-phase is achieved if the filter is symmetric or antisymmetric. A
constant group delay can only be achieved if the frequency response F() is a purely real or
imaginary function. This implies that the filters impulse response possesses even or odd
symmetry. That is:
An odd-order even-symmetry FIR filter would, for example, have a frequency response given
by:
which is seen to be a purely real function of frequency. Table 1 summarizes the four possible
choices of symmetry, antisymmetry, even order and odd order. In addition, Table 1 graphically
displays an example of each class of linear-phase FIR.
Fig. 3: Linear-phase filter with reduced number of multipliers
The symmetry properties intrinsic to a linear-phase FIR can also be used to reduce the necessary
number of multipliers L, as shown in Fig. 1. Consider the linear-phase FIR shown in Fig. 3
(even symmetry assumed), which fully exploits coefficient symmetry. Observe that the
22
symmetric architecture has a multiplier budget per filter cycle exactly half of that found in the
direct architecture shown in Fig. 1 (L versus L/2) while the number of adders remains constant
at L1.
DESIGNING FIR FILTERS
There are two methods for FIR Filter design:
Direct Window Design Method
Equiripple Design Method
DIRECT WINDOW DESIGN METHOD
The discrete Fourier transform (DFT) establishes a direct connection between the frequency and
time domains. Since the frequency domain is the domain of filter definition, the DFT can be
used to calculate a set of FIR filter coefficients that produce a filter that approximates the
frequency response of the target filter. A filter designed in this manner is called a direct FIR
filter. A direct FIR filter is defined by:
Consider a length-16 direct FIR filter design with a rectangular window, shown in Fig. 4a, with
the passband ripple shown in Fig. 4b. Note that the filter provides a reasonable approximation to
the ideal lowpass filter with the greatest mismatch occurring at the edges of the transition band.
The observed ringing is due to the Gibbs phenomenon, which relates to the inability of a
finite Fourier spectrum to reproduce sharp edges. The Gibbs ringing is implicit in the direct
inverse DFT method and can be expected to be about7% over a wide range of filter orders. To
illustrate this, consider the example filter with length 128, shown in Fig. 4c, with the passband
ripple shown in Fig. 3.6d. Although the filter length is essentially increased (from 16 to 128) the
ringing at the edge still has about the same quantity. The effects of ringing can only be
suppressed with the use of a data window that tapers smoothly to zero on both sides. Data
windows overlay the FIRs impulse response, resulting in a smoother magnitude frequency
response with an attendant widening of the transition band. If, for instance, a Kaiser window is
applied to the FIR, the Gibbs ringing can be reduced.
23
Fig. 4: Gibbs phenomenon.(a)Impulse response of FIR lowpass with L=16. (b) Passband of
transfer function L=16. (c)Impulse response of FIR lowpass with L= 128. (d) Passband of
transfer function L= 128.
The most common windows, denoted w[n], are:
24
EQUIRIPPLE DESIGN METHOD
A typical filter specification not only includes the specification of passband p and stopband s
frequencies and ideal gains, but also the allowed deviation (or ripple) from the desired transfer
function. The transition band is most often assumed to be arbitrary in terms of ripples. A special
class of FIR filter that is particularly effective in meeting such specifications is called the
equiripple FIR. An equiripple design protocol minimizes the maximal deviations (ripple error)
from the ideal transfer function. The equiripple algorithm applies to a number of FIR design
instances. The most popular are:
Lowpass filter design
Hilbert filter, i.e., a unit magnitude filter that produces a 90 phase shift for all
frequencies in the passband
Differentiator filter that has a linear increasing frequency magnitude proportional to
The equiripple or minimum-maximum algorithm is normally implemented using the Parks
McClellan iterative method. The ParksMcClellan method is used to produce a equiripple or
minimax data fit in the frequency domain.
The length of the polynomial, and therefore the filter, can be estimated for a lowpass with
where p is the passband and s the stopband ripple.
CONSTANT COEFFICIENT FIR DESIGN
The method used for implementing FIR filters in FPGAs is the Constant Coefficient FIR Design
method. The different ways to implement this method are:
Direct design
Transposed form design
Design using Distributed Arithmetic (DA) architecture
25
DIRECT FIR DESIGN
The direct FIR filter shown in Fig. 1 can be implemented in VHDL using (sequential)
PROCESS statements or by component instantiations of the adders and multipliers. A
PROCESS design provides more freedom to the synthesizer, while component instantiation
gives full control to the designer. To illustrate this, a length-4 FIR will be presented as a
PROCESS design. Although a length-4 FIR is far too short for most practical applications, it is
easily extended to higher orders and has the advantage of a short compiling time. The linear-
phase (therefore symmetric) FIRs impulse response is assumed to be given by
FOUR-TAP DIRECT FIR FILTER VHDL CODE:
PACKAGE eight_bit_int IS -- User-defined types
SUBTYPE BYTE IS INTEGER RANGE -128 TO 127;
TYPE ARRAY_BYTE IS ARRAY (0 TO 3) OF BYTE;
END eight_bit_int;
LIBRARY work;
USE work.eight_bit_int.ALL;
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_arith.ALL;
ENTITY fir_srg IS ------> Interface
PORT (clk : IN STD_LOGIC;
x : IN BYTE;
y : OUT BYTE);
END fir_srg;
ARCHITECTURE flex OF fir_srg IS
SIGNAL tap : ARRAY_BYTE := (0,0,0,0);
-- Tapped delay line of bytes
BEGIN
p1: PROCESS ------> Behavioral style
BEGIN
WAIT UNTIL clk = '1';
-- Compute output y with the filter coefficients weight.
-- The coefficients are [-1 3.75 3.75 -1]
26
y
27
IIR FILTERS
A nonrecursive filter incorporates, as the name implies, no feedback. The impulse response of
such a filter is finite, i.e., it is an FIR filter. A recursive filter, on the other hand has feedback,
and is expected, in general, to have an infinite impulse response, i.e., to be an IIR filter. Figure
4.4a shows filters with separate recursive and nonrecursive parts. A canonical filter is produced
if these recursive and nonrecursive parts are merged together, as shown in Fig. 4.4b. The
transfer function of the filter from Fig. 4.4 can be written as:
The difference equation for such a system yields:
Comparing this with the difference equation for the FIR filter,we find that the difference
equation for recursive systems depends not only on the L previous values of the input sequence
x[n], but also on the L 1 previous values of y[n].
If we compute poles and zeros of F(z), we see that the nonrecursive part, i.e., the numerator of
F(z), produces the zeros p0l, while the denominator of F(z) produces the poles pl
28
FAST IIR FILTER
FIR filter Registered Performance was improved using pipelining. In the case of FIR filters,
pipelining can be achieved at essentially no cost. Pipelining IIR filters, however, is more
sophisticated and is certainly not free. Simply introducing pipeline registers for all adders will,
especially in the feedback path, very likely change the pole locations and therefore the transfer
function of the IIR filter.
The methods that improve IIR filter throughput are:
Look-ahead interleaving in the time domain
Parallel processing
These methods are based on filter architecture or signal flow techniques. These techniques will
be demonstrated with examples. To simplify the VHDL representation of each case, only a first-
order IIR filter will be considered, but the same ideas can be applied to higher-order IIR filters.
TIME-DOMAIN INTERLEAVING
Consider the differential equation of a first-order IIR system, namely
.
The output of the first-order system, namely y[n + 1], can be computed using a look-ahead
methodology by substituting y[n+1] into the differential equation for y[n + 2]. That is
The equivalent system is shown in Fig. 4.14.This concept can be generalized by applying the
look-ahead transform for (S 1) steps, resulting in:
29
It can be seen that the term () defines an FIR filter having coefficients {b, ab, a2b, . . . ,
aS1b}, that can be pipelined using the pipelining techniques.
The recursive part of (4.12) can now also be implemented with an S-stage pipelined multiplier
for the coefficient.
The VHDL code shown below, implements the IIR filter in look-ahead form.
PACKAGE n_bit_int IS -- User-defined type
SUBTYPE BITS15 IS INTEGER RANGE -2**14 TO 2**14-1;
END n_bit_int;
LIBRARY work;
USE work.n_bit_int.ALL;
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_arith.ALL;
ENTITY iir_pipe IS
PORT ( x_in : IN BITS15; -- Input
y_out : OUT BITS15; -- Result
clk : IN STD_LOGIC);
END iir_pipe;
ARCHITECTURE fpga OF iir_pipe IS
SIGNAL x, x3, sx, y, y9 : BITS15 := 0;
BEGIN
PROCESS -- Use FFs for input, output and pipeline stages
BEGIN
WAIT UNTIL clk = 1;
x
30
PARALLEL PROCESSING
In a parallel-processing filter implementation [100], P parallel IIR paths are formed, each
running at a 1/P input sampling rate. They are combined at the output using a multiplexer, as
shown in Fig. 4.18. Because a multiplexer, in general, will be faster than a multiplier and/or
adder, the parallel approach will be faster. Furthermore, each path P has a factor of P more time
to compute its assigned output.
To illustrate, consider again a first-order system and P = 2. The lookahead
scheme, as in (4.11)
is now split into even n = 2k and odd n = 2k1 output sequences, obtaining
where n, k Z. The two equations are the basis for the following parallel IIR filter FPGA implementation.
31
VHDL CODE:
PACKAGE n_bit_int IS -- User-defined type
SUBTYPE BITS15 IS INTEGER RANGE -2**14 TO 2**14-1;
END n_bit_int;
LIBRARY work;
USE work.n_bit_int.ALL;
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_arith.ALL;
ENTITY iir_par IS ------> Interface
PORT ( clk, reset : IN STD_LOGIC;
x_in : IN BITS15;
x_e, x_o, y_e, y_o : OUT BITS15;
clk2 : OUT STD_LOGIC;
y_out : OUT BITS15);
END iir_par;
ARCHITECTURE fpga OF iir_par IS
TYPE STATE_TYPE IS (even, odd);
SIGNAL state : STATE_TYPE;
SIGNAL x_even, xd_even : BITS15 := 0;
SIGNAL x_odd, xd_odd, x_wait : BITS15 := 0;
SIGNAL y_even, y_odd, y_wait, y : BITS15 := 0;
SIGNAL sum_x_even, sum_x_odd : BITS15 := 0;
SIGNAL clk_div2 : STD_LOGIC;
BEGIN
Multiplex: PROCESS (reset, clk) --> Split x into even and
BEGIN -- odd samples; recombine y at clk rate
IF reset = 1 THEN -- asynchronous reset
state
32
WHEN even =>
x_even
33
The design is realized with two PROCESS statements. In the first, PROCESS Multiplex, x is
split into even and odd indexed parts, and the output y is recombined at the clk rate. In addition,
the first PROCESS statement generates
the second clock, running at clk/2. The second block implements the filters arithmetic
according to (4.22). The design uses 268 LEs, no embedded multiplier, and has a 168.12MHz
Registered Performance.
34
DECIMATION
INTRODUCTION
A frequent task in digital signal processing is to adjust the sampling rate according to the signal
of interest. Systems with different sampling rates are referred to as multirate systems. Two
typical examples in multirate DSP systems are decimation and interpolation . Multirate systems
are sometimes used for sampling-rate conversion, which involves both decimation and
interpolation.
DECIMATION
Decimation can be regarded as the discrete-time counterpart of sampling. Whereas in sampling
we start with a continuous-time signal x(t) and convert it into a sequence of samples x[n], in
decimation we start with a discrete-time signal x[n] and convert it into another discrete-time
signal y[n], which consists of sub-samples of x[n]. Thus, the formal definition of M-fold
decimation, or down-sampling, is defined by equation below
.In decimation, the sampling rate is reduced from Fs to Fs/M by discarding M 1 samples for
every M samples in the original sequence. A narrow filter followed by a down sampler is
usually referred to as a decimator
Fig 1: Block diagram notation of decimation, by a factor of M.
The block diagram notation of the decimation process is depicted in Figure.
35
An anti-aliasing digital filter precedes the down-sampler to prevent aliasing from
occurring, due to the lower sampling rate. In Figure 2 below, it illustrates the concept of
3-fold decimation i.e. M = 3. Here, the samples of x[n] corresponding to n = , -2, 1,
4, and n = , -1, 2, 5, are lost in the decimation process.
In general, the samples of x[n] corresponding to n kM, where k is an integer, are
discarded in M-fold decimation. In Figure 2 it shows samples of the decimated signal
y[n] spaced three times wider than the samples of x[n].
In real time, the decimated signal appears at a slower rate than that of the original signal
by a factor of M.
If the sampling frequency of x[n] is Fs, then that of y[n] is Fs/M.
Fig 2: Decimation of a discrete-time signal by a factor of 3
ANTI ALIASING FILTER
We can reduce the sampling rate up to the limit called the Nyquist rate, which says that the
sampling rate must be higher than the bandwidth of the signal, in order to avoid aliasing.
Aliasing is demonstrated in Fig. 3. For a low pass signal. Aliasing is irreparable, and should be
avoided at all cost. For a bandpass signal, the frequency band of interest must fall within an
integer band. If fs is the sampling rate, and R is the desired downsampling factor, then the band
36
of interest must fall between. If it does not, there may be aliasing due to copies from the
negative frequency bands, although the sampling rate may still be higher than the Nyquist rate,
Fig 3: Unaliased and aliased decimation cases.
Fig 4: Decimation of signal x[n] X().
DOWN SAMPLER
Down sampling is the process of reducing the sampling rate of a signal. The down sampling
factor is usually an integer or a rational fraction greater than one. The sampling rate can be
reduced up to the limit called the Nyquist rate .An down-sampler with a down-sampling
factor M, where M is a positive integer, develops an output sequence y[n] with a sampling rate
that is (1/M)-th of that of the input sequence x[n]
37
VHDL CODE
entity decimator_1 is
port( inseq: in std_logic_vector( 7 downto 0);--input sequence
clk: in std_logic;
reset:in std_logic;
dec_op: out std_logic_vector( 7 downto 0));-- decimated output sequence
end decimator_1;
architecture Behavioral of decimator_1 is
begin
process(clk,inseq)
variable count: integer ;
begin
if reset='1' then
count:=2;--count initiated,counts the clock pulses
end if;
if clk='1' and clk'event then
if (count mod 2 = 0) thenif count is multiple of 2, then input is passed to
output
dec_op
38
OUTPUT
Fig: test bench waveform
Fig: simulated output
Input sequence = {8hff, 8hfe, 8hfd, 8hfc, 8hfb, 8hfa}
Output sequence= {8hff, 8hfd, 8hfb }
39
INTERPOLATION
A frequent task in digital signal processing is to adjust the sampling rate
according to the signal of interest. Systems with different sampling rates are
referred to as multiratesystems.
After A/D conversion, the signal of interest can be found in a small
frequencyband (typically, lowpass or bandpass), then it is reasonable to filter
with a lowpass or bandpass filter and to reduce the sampling rate. A narrow filter
followed by a downsampler is usually referred to as a decimator .Increasing the
sampling rate can be useful, in the D/A conversion process, for example.
Typically, D/A converters use a sample-and-hold of first-order at the output,
which produces a step-like output function. This can be compensated
for with an analog 1/sinc(x) compensation filter, but most often a digital solution
is more efficient.
We can use, in the digital domain, an expanderand an additional filter to get
the desired frequency band. The introduced zeros produce an extra copy of the
baseband spectrum that must first be removed before the signal can be processed
with the D/A converter.
For the interpolator, the Noble relation is defined as
F(z) ( R) = ( R) F(zR),
i.e., in an interpolation putting the filter before the expander results in an
R-times shorter filter.
40
INTERPOLATION
A process by which the output sampling rate of a signal is increased is known
as interpolation.
Consists of an up-sampler and an anti-imaging filter.
The up-sampling operation is just simply inserting (N-1) zeroes between every
two input samples.
x(n) v(m) y(m)
The up-sampling produces the intermediate signal v(m) from the input signal x(n).
The output signal y(m) is obtained by convolving the intermediate signal with the
impulse response h(n).
k)
The up-sampled signal can also be denoted as
y(m) = x(n/L)
The spectral properties of up-sampling is simple in the z-transform domain.
So up-sampling is simply a contraction of the frequency axis by a factor of N.
L
H(Z)
41
The original spectrum X(ej
) over [-,].
The original spectrum X(ej
) over [-5,5]
The upsampled spectrum
Interpolation example. For R = 3 ,x[n] X()is shown below.
42
Interpolation in time domain,
The up-sampling factor used is 4 , so three zeroes are inserted between two
input samples.So ,up-sampling is expansion in time domain.
CODE
entity interpolator is
port(a:inbit_vector(1 to 16);
b:out bit_vector(1 to 32));
43
end interpolator;
architectureBehavioral of interpolator is
begin
process(a)
begin
b
44
MULTISTAGE DECIMATOR
The single stage of decimator is repeatedly performed to get our required output of the
multistage decimator(ie., upto Pth stage).
Block Diagram of Multistage Decimator
If the decimation rate R is large it can be shown that a multistage design
can be realized with less effort than a single-stage converter. In particular, S stages, each having
a decimation capability of Rk, are designed to have an overall down sampling rate
ofR=R1R2RS. Unfortunately, pass band imperfections, such as ripple deviation, accumulate
from stage to stage. As a result, a pass band deviation target of p must normally be tightened on the order of p=p/S to meet overall system specifications. This is obviously a worst-case assumption, in which all short filters have the maximum ripple at the same frequencies, which
is, in general, too pessimistic. It is often more reasonable to try an initial value near the given
pass band specification p, and then selectively reduce it if necessary.
MULTISTAGE DECIMATOR DESIGN USING GOODMANCAREY HALF-BAND FILTERS:
Goodman and Carey [80] proposed to develop multistage systems based on
the use of CIC and half-band filters. A half-band filter has a pass band and stop band located at
s =p=/2, or midway in the baseband. A half-band filter can therefore be used to change the sampling rate by a factor of two. If the half-band filter has point symmetry relative to =/2, then all even coefficients (except the center tap) become zero.
45
CIC FILTER:
CIC (cascaded integrator comb) filter is an optimized class of finite impulse response (FIR) filter combined with an interpolator or decimator.
It consists of one or more integrator and comb filter pairs.
For decimating CIC, the input signal is fed through one or more cascaded integrators, then a down sampler which is followed by one or more comb sections.
For an input impulse response, the single stage CIC filter produces the step response
output and also the same logic is used for the multistage decimator.
VHDL PROGRAM:
entity vrb is
Port ( clk : in STD_LOGIC;
x_in : in STD_LOGIC_VECTOR (7 downto 0);
y_out : out STD_LOGIC_VECTOR (8 downto 0));
end vrb;
architecture Behavioral of vrb is
46
TYPE STATE_TYPE is (hold,sample);
SIGNAL state :STATE_TYPE;
SIGNAL count:integer RANGE 0 to 64;
SIGNAL clk2: STD_LOGIC;
SIGNAL x : STD_LOGIC_VECTOR( 7 DOWNTO 0);
SIGNAL sxtx: STD_LOGIC_VECTOR( 25 DOWNTO 0);
SIGNAL i0 :word26;
SIGNAL i1 :word21;
SIGNAL i2 :word16;
SIGNAL i2d1,i2d2,i2d3,i2d4,c1,c0: word14;
SIGNAL c1d1,c1d2,c1d3,c1d4,c2: word13;
SIGNAL c2d1,c2d2,c2d3,c2d4,c3: word12;
begin
FSM:PROCESS
BEGIN
WAIT UNTIL clk='0';
CASE state is
WHEN hold =>
IF count
47
END CASE;
END PROCESS FSM;
Sxt: PROCESS(x)
BEGIN
sxtx(7 DOWNTO 0)
48
clk2
49
ANOTHER PROGRAM:
entity newvrb is
port (a : in STD_LOGIC_vector(1 to 32);
bintr : out STD_LOGIC_vector(1 to 16);
cintr : out STD_LOGIC_vector(1 to 8);
d : out STD_LOGIC_vector(1 to 4));
end newvrb;
architecture Behavioral of newvrb is
signal b : STD_LOGIC_vector(1 to 16);
signal c : STD_LOGIC_vector(1 to 8);
begin
process (a,b,c)
begin
for I in 1 to 16 loop
b(I)
50
end loop;
end process;
end Behavioral;
APPLICATIONS:
During A/D conversion: Oversampling to alleviate the stringent requirements on the Analog anti-alising filter.
During D/A conversion: Filter to remove spectrum images.
Fractional sampling rate conversion.
51
POLYPHASE DECOMPOSITION
Polyphase decomposition is very useful when implementing decimation or interpolation
in IIR or FIR filter and filter banks. To illustrate this, consider the polyphase decomposition of
an FIR decimation filter. If we add downsampling by a factor of R to the FIR filter structure
shown in Figure1, we find that we only need to compute the outputs y[n] at time instances
(1)
Figure1: Direct form of FIR.
It follows that we do not need to compute all sums-of-product f [k] x[n k] of the convolution. For instance, x[0] only needs to be multiplied by
f [0], f [R], f [2R] , . . . . (2)
Besides x[0], these coefficients only need to be multiplied by
x [R], x [2R] , . . . . (3)
It is therefore reasonable to split the input signal first into R separate sequences
according to
and also to split the filter f [n] into R sequences
52
Figure 2 shows a decimator filter implemented using polyphase decomposition. Such a
decimator can run R times faster than the usual FIR filter followed by a downsampler. The
filters fr [n] are called polyphase filters, because they all have the same magnitude transfer
function, but they are separated by a sample delay, which introduces a phase offset. A final
example illustrates the polyphase decomposition.
EXAMPLE 5.1: POLYPHASE DECIMATOR FILTER
Consider a Daubechies length-4 filter with G(z) and R = 2.
Quantizing the filter to 8 bits of precision results in the following model:
and it follows that
Figure2: Polyphase realization of decimation filter.
53
The following VHDL code3 shows the polyphase implementation for DB4.
54
55
Figure 3 : Output for the given code
56
FILTER BANKS
A digital filter bank is a collection of filters having a common input or output. One
common application of the analysis filter bank is spectrum analysis, i.e., to split the input signal
into R different so-called subband signals. The combination of several signals into a common
output signal is called a synthesis filter bank. The analysis filter may be nonoverlapping,
slightly overlapping, or substantially overlapping. Another important characteristic that
distinguishes different classes of filter banks is the bandwidth and spacing of the center
frequencies of the filters. A popular example of a non-uniform filter bank is the octave-spaced
or wavelet filter bank.
.
UNIFORM DFT FILTER BANK:
In uniform filter banks, all filters have the same bandwidth and sampling rates. In a
maximal decimating, or critically sampled filter bank, the decimation or R is equal to the
number of bands K. If the rth band filter is computed from the modulation of a single
prototype filter h[n], according to
(1)
then it is a uniform DFT filter bank.
FIG.1 R channel filter bank, with a small amount of overlapping
An efficient implementation of the R channel filter bank can be generated if we use
polyphase decomposition of the filter and the input signal x[n]. Because each of these
bandpass filters is critically sampled, we use a decomposition with R polyphase signals
according to
57
(2)
(3)
If we now substitute (2) into (1), we find that all bandpass filters share the same polyphase
filter hk[n], while the twiddle factors for each filter are different. It is now obvious that this
twiddle multiplication for corresponds to the rth DFT component, with an input vector
of The computation for the whole analysis band can be reduced to
filtering with R polyphase filters, followed by a DFT (or FFT) of these R filtered components.
This is obviously much more efficient than direct computation. The polyphase filter bank for
the uniform DFT synthesis bank can be developed as an inverse operation to the analysis bank.
Perfect reconstruction occurs if the convolution of the included polyphase filter gives a unit
sample function, i.e.,
(4)
TWO CHANNEL FILTER BANKS:
The input x[n] is split by using lowpass G(z) and highpass H(z) analysis filters.
The resulting signal x[n] is reconstructed using lowpass and highpass synthesis
filters.
Between the analysis and synthesis sections are decimation and interpolation by 2 units.
The construction rule is given by H(z) = G(z) which defines the filters to be mirrored
pairs. This is a quadrature mirror filter (QMF) bank, because the two filters have mirror
symmetry to /2.
A perfectly reconstructed signal has the same shape as the original, up to a phase (time)
shift.
58
Fig.2 Two-channel filter bank
If the signal is applied to the two-channel filter bank, the
lowpass path XG(z) and highpass path XH(z) become
(5)
(6)
After multiplication by the synthesis filter and summation
of the results, we get as
(7)
The factor of X(z) shows the aliasing component, while the term at X(z) shows the
amplitude distortion.
PERFECT RECONSTRUCTION:
A perfect reconstruction for a two-channel filter bank is achieved if
1) , i.e., the reconstruction is free of aliasing.
59
2) i.e., the amplitude distortion has amplitude
one.
A two-channel filter bank is aliasing-free if
IMPLEMENTING TWO-CHANNEL FILTER BANKS:
POLYPHASE TWO-CHANNEL FILTER BANKS:
In the general case, with two filters G(z) and H(z), we can realize each filter as a
polyphase filter as shown below
Fig.3 Polyphase implementation of the two-channel filter bank
(8)
This does not reduce the hardware effort (2L multipliers and 2(L1) adders are still used), but
the design can be run with twice the usual sampling frequency, 2fs.These four polyphase filters
have only half the length of the original filters.
LIFTING:
Another general approach to constructing fast and efficient two channel filter banks is the
lifting scheme introduced recently by Swelden and Herley and Vetterli. The basic idea is the use
60
of cross-terms (called lifting and dual-lifting), as in a lattice filter, to construct a longer filter
from a short filter, while preserving the perfect reconstruction conditions.
Any (bi)orthogonal wavelet filter bank can be converted into a sequence of lifting and
dual-lifting steps. The number of multipliers and adders required then depends on the number of
lifting steps (more steps gives less complexity) and can reach up to 50% compared with the
direct polyphase implementation.
QMF IMPLEMENTATION:
For QMF,we know
H(z) = G(z) (9)
But this implies that the polyphase filters are the same (except the sign),i.e.,
G0(z) = H0(z), G1(z) = H1(z) . (10)
Instead of the four filters, for QMF we only need two filters and an additional Butterfly. This
saves about 50%. For the QMF filter we need L real adders, L real multipliers and the filter can
run with twice the usual input-sampling rate.
ORTHOGONAL FILTER BANKS:
If highpass and lowpass polynomials are mirror versions of each other,then it is
orthogonal filter banks. An orthogonal filter pair obeys the conjugate mirror filter (CQF)
condition, defined by
(11)
If we use the transposed FIR filter, we need only half the number of multipliers. The
disadvantage is that we can not benefit from polyphase decomposition to double the speed.
FIG.4. Lattice realization for the orthogonal two-channel filter bank
61
VHDL CODE:
PACKAGE n_bits_int IS -- User-defined types
SUBTYPE BITS8 IS INTEGER RANGE -128 TO 127;
SUBTYPE BITS9 IS INTEGER RANGE -2**8 TO 2**8-1;
SUBTYPE BITS17 IS INTEGER RANGE -2**16 TO 2**16-1;
TYPE ARRAY_BITS17_4 IS ARRAY (0 TO 3) OF BITS17;
END n_bits_int;
LIBRARY work;
USE work.n_bits_int.ALL;
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_arith.ALL;
USE ieee.std_logic_unsigned.ALL;
ENTITY db4latti IS ------> Interface
PORT (clk, reset : IN std_logic;
clk2 : OUT std_logic;
x_in : IN BITS8;
x_e, x_o : OUT BITS17;
g, h : OUT BITS9);
END db4latti;
ARCHITECTURE fpga OF db4latti IS
TYPE STATE_TYPE IS (even, odd);
SIGNAL state : STATE_TYPE;
SIGNAL sx_up, sx_low, x_wait : BITS17 := 0;
SIGNAL clk_div2 : std_logic;
SIGNAL sxa0_up, sxa0_low : BITS17 := 0;
SIGNAL up0, up1, low0, low1 : BITS17 := 0;
BEGIN
Multiplex: PROCESS (reset, clk) ----> Split into even and
BEGIN -- odd samples at clk rate
IF reset = '1' THEN -- Asynchronous reset
62
state
-- Multiply with 256*s=124
sx_up
63
up1
64
Computational complexity is reduced.
QMF based subband coders provide more natural sounding,pitch prediction and wider
bandwidth than earlier subband coders.
APPLICATIONS:
Accurate channel selection in wireless communications.
Faster convergence and lower complexity in adaptive equalization.
Flexible compression of speech and music
Lower latency and better frequency compensation in hearing aids
More efficient short-time spectral analysis and synthesis
Multi-resolution image compression and wavelet transformations
Reliable automatic speech recognition.
65
DIT-FFT ALGORITHM
A Fast Fourier Transform(FFT) is an efficient algorithm for calculating the discrete Fourier
transform of a set of data. A DFT basically decomposes a set of data in time domain into
different frequency components. DFT is defined by the following equation:
A FFT algorithm uses some interesting properties of the above formula to simply the
calculations.
COOLEY-TUKEY ALGORITHM
The CooleyTukey algorithm, named after J.W. Cooley and John Tukey, is the most common
fast Fourier transform (FFT) algorithm. It re-expresses the discrete Fourier transform (DFT) of
an arbitrary composite size N = N1N2 in terms of smaller DFTs of sizes N1 and N2, recursively,
in order to reduce the computation time to O(N log N) for highly-composite N (smooth
numbers).
Basically, the computational problem for the DFT is to compute the sequence {X(k)}
of N complex-valued numbers given another sequence of data {x(n)} of length N, according to
the formula
In general, the data sequence x(n) is also assumed to be complex valued. Similarly, The IDFT
becomes
Since DFT and IDFT involve basically the same type of computations, our discussion of
efficient computational algorithms for the DFT applies as well to the efficient computation of
the IDFT.
We observe that for each value of k, direct computation of X(k) involves N complex
multiplications (4N real multiplications) and N-1 complex additions (4N-2 real additions).
66
Consequently, to compute all N values of the DFT requires N 2 complex multiplications and N
2-
N complex additions.
Direct computation of the DFT is basically inefficient primarily because it does not exploit the
symmetry and periodicity properties of the phase factor WN. In particular, these two properties
are :
The computationally efficient algorithms described in this sectio, known collectively as fast
Fourier transform (FFT) algorithms, exploit these two basic properties of the phase factor.
RADIX-2 FFT ALGORITHM
Let us consider the computation of the N = 2v point DFT by the divide-and conquer approach.
We split the N-point data sequence into two N/2-point data sequences f1(n) and f2(n),
corresponding to the even-numbered and odd-numbered samples of x(n), respectively, that is,
Thus f1(n) and f2(n) are obtained by decimating x(n) by a factor of 2, and hence the resulting
FFT algorithm is called a decimation-in-time algorithm.
Now the N-point DFT can be expressed in terms of the DFT's of the decimated sequences as
follows:
But WN2 = WN/2. With this substitution, the equation can be expressed as
67
where F1(k) and F2(k) are the N/2-point DFTs of the sequences f1(m) and f2(m), respectively.
Since F1(k) and F2(k) are periodic, with period N/2, we have F1(k+N/2) = F1(k) and F2(k+N/2)
= F2(k). In addition, the factor WNk+N/2
= -WNk. Hence the equation may be expressed as
We observe that the direct computation of F1(k) requires (N/2)2 complex multiplications. The
same applies to the computation of F2(k). Furthermore, there are N/2 additional complex
multiplications required to computeWNkF2(k). Hence the computation of X(k) requires
2(N/2)2 + N/2 = N
2/2 + N/2 complex multiplications. This first step results in a reduction of the
number of multiplications from N 2
to N 2/2 + N/2, which is about a factor of 2 for N large.
By computing N/4-point DFTs, we would obtain the N/2-point DFTs F1(k) and F2(k) from the
relations
The decimation of the data sequence can be repeated again and again until the resulting
sequences are reduced to one-point sequences. For N = 2v, this decimation can be performed v =
log2N times. Thus the total number of complex multiplications is reduced to (N/2)log2N. The
number of complex additions is Nlog2N.
The following figure depicts the computation of N = 8 point DFT. We observe that the
computation is performed in tree stages, beginning with the computations of four two-point
DFTs, then two four-point DFTs, and finally, one eight-point DFT. The combination for the
smaller DFTs to form the larger DFT is illustrated in following figure for N = 8.
68
Figure1 Three stages in the computation of an N = 8-point DFT.
Figure 2 Eight-point decimation-in-time FFT algorithm.
69
Figure 3 Basic butterfly computation in the decimation-in-time FFT algorithm.
An important observation is concerned with the order of the input data sequence after it is
decimated (v-1) times. For example, if we consider the case where N = 8, we know that the first
decimation yeilds the sequencex(0), x(2), x(4), x(6), x(1), x(3), x(5), x(7), and the second
decimation results in the sequence x(0), x(4), x(2), x(6), x(1), x(5), x(3), x(7). This shuffling of
the input data sequence has a well-defined order as can be ascertained from observing the
following figure, which illustrates the decimation of the eight-point sequence.
INPUT DATA
INDEX
INDEX BITS REVERSED BITS OUTPUT DATA
INDEX
0 000 000 0
4 100 001 1
2 010 010 2
6 110 011 3
1 001 100 4
5 101 101 5
3 011 110 6
7 111 111 7
Figure TC.3.5 Shuffling of the data and bit reversal.
ADVANTAGES:
To reduce the computational complexity of the DFT algorithm.
Energy compaction.
Delay can be reduced.
70
CODE:
module ditfft(clk,sel,yr,yi);
input clk;
input [2:0]sel;
output reg [7:0]yr,yi ;
wire [7:0]y0r,y1r,y2r,y3r,y4r,y5r,y6r,y7r,y0i,y1i,y2i,y3i,y4i,y5i,y6i,y7i;
wire [7:0]x20r,x20i,x21r,x21i,x22r,x22i,x23r,x23i,x24r,x24i,x25r,x25i,x26r,x26i,x27r,x27i;
wire [7:0]x10r,x10i,x11r,x11i,x12r,x12i,x13r,x13i,x14r,x14i,x15r,x15i,x16r,x16i,x17r,x17i;
wire [7:0]x0,x1,x2,x3,x4,x5,x6,x7;
assign x0=8'b10;
assign x1=8'b10;
assign x2=8'b10;
assign x3=8'b10;
assign x4=8'b1;
assign x5=8'b1;
assign x6=8'b1;
assign x7=8'b1;
//stage1
bfly1 s11(x0,x4,x10r,x10i,x11r,x11i);
bfly1 s12(x2,x6,x12r,x12i,x13r,x13i);
bfly1 s13(x1,x5,x14r,x14i,x15r,x15i);
bfly1 s14(x3,x7,x16r,x16i,x17r,x17i);
//stage2
bfly1 s21(x10r,x12r,x20r,x20i,x22r,x22i);
71
bfly2 s22(x11r,x11i,x13r,x13i,x21r,x21i,x23r,x23i);
bfly1 s23(x14r,x16r,x24r,x24i,x26r,x26i);
bfly2 s24(x15r,x15i,x17r,x17i,x25r,x25i,x27r,x27i);
//stage3
bfly1 s31(x20r,x24r,y0r,y0i,y4r,y4i);
bfly3 s32(x21r,x21i,x25r,x25i,y1r,y1i,y5r,y5i);
bfly2 s33(x22r,x22i,x26r,x26i,y2r,y2i,y6r,y6i);
bfly4 s34(x23r,x23i,x27r,x27i,y3r,y3i,y7r,y7i);
module bfly1(x,y,x0r,x0i,x1r,x1i);
input [7:0]x,y;
output[7:0]x1r,x1i,x0r,x0i;
assign x0r=x+y;
assign x0i=8'd0;
assign x1r=x-y;
assign x1i=8'd0;
endmodule
module bfly2(xr,xi,yr,yi,x0r,x0i,x1r,x1i);
input [7:0]xr,xi,yr,yi;
output [7:0]x0r,x0i,x1r,x1i;
wire [7:0]q1;
assign x0r=xr+yi;
assign x0i=xi-yr;
assign x1r=xr-yi;
assign x1i=xi+yr;
Endmodule
72
module bfly3(xr,xi,yr,yi,x0r,x0i,x1r,x1i);
input [7:0]xr,xi,yr,yi;
output [7:0]x0r,x0i,x1r,x1i;
parameter sht=8'd1000;
wire [7:0]p1,p2;
assign p1=(707*yr)>>sht;
assign p2=(707*yi)>>sht;
assign x0r=xr+p1+p2;
assign x0i=xi-p1+p2;
assign x1r=xr-p1-p2;
assign x1i=xi-p2+p1;
endmodule
module bfly4(xr,xi,yr,yi,x0r,x0i,x1r,x1i);
input [7:0]xr,xi,yr,yi;
output [7:0]x0r,x0i,x1r,x1i;
parameter sht=8'd1000;
wire [7:0]p1,p2;
assign p1=(707*yr)>>sht;
assign p2=(707*yi)>>sht;
assign x0r=xr-p1+p2;
assign x0i=xi-p1-p2;
assign x1r=xr+p1-p2;
assign x1i=xi+p2+p1;
endmodule
73
always@(posedge clk)
case(sel)
0:begin yr=y0r; yi=y0i; end
1:begin yr=y1r; yi=y1i; end
2:begin yr=y2r; yi=y2i; end
3:begin yr=y3r; yi=y3i; end
4:begin yr=y4r; yi=y4i; end
5:begin yr=y5r; yi=y5i; end
6:begin yr=y6r; yi=y6i; end
7:begin yr=y7r; yi=y7i; end
endcase
endmodule
OUTPUT:
Input : 2,1,2,1,2,1,2,1.
74
DIF-FFT
FOURIER TRANSFORM
A fourier transform is an useful analytical tool that is important for many fields of
application in the digital signal processing.
In describing the properties of the fourier transform and inverse fourier transform, it is
quite convenient to use the concept of time and frequency.
In image processing applications it plays a critical role.
FAST FOURIER TRANSFORM
Fast fourier transform proposed by Cooley and Tukey in 1965.
The fast fourier transform is a highly efficient procedure for computing the DFT of a finite series and requires less number of computations than that of direct evaluation of
DFT.
The FFT is based on decomposition and breaking the transform into smaller transforms and combining them to get the total transform.
DISCRETE FOURIER TRANSFORM
The DFT pair was given as
Baseline for computational complexity:
Each DFT coefficient requires - N complex multiplications,N-1 complex additions.
All N DFT coefficients require - N2 complex multiplications, N(N-1) complex additions.
1
0
/2][N
n
knNjenxkX
1
0
/21][N
k
knNjekXN
nx
75
SYMMETRY AND PERIODICITY PROPERTY
Symmetry
Periodicity
FFT algorithm provides speed increase factors, when compared with direct computation of the DFT, of approximately 64 and 205 for 256 point and 1024
point transforms respectively.
The number of multiplications and additions required to compute N-point DFT using radix-2 FFT are Nlog2N and N/2 log2N respectively.
EXAMPLE: The number of complex multiplications required using direct computation is
N2=64
2 =4096
The number of complex multiplications required using FFT is
N/2log2 N=64/2log2 64=192
Speed improvement factor =4096/192= 21.33.
NUMBER OF COMPLEX MULTIPLICATIONS REQUIRED IN DIF- FFT
ALGORITHM
No. of points
in a sequence
x(n), N
Complex
multiplications
in direct
computation of
DFT
=NN =A
Complex
multiplications
in FFT
algorithms
N/2 log2 N = B
Speed
improvement
Factor -A/B
4 16 4 4
8 64 12 5
16 256 32 8
)()(
)()(
)(
kNn
N
nNk
N
kn
N
nNk
N
Nnk
N
kn
N
kn
N
kn
N
WWW
WWW
WW
76
FFT ALGORITHMS
There are basically two types of FFT algorithms.
They are:
Decimation in Time
Decimation in frequency
DECIMATION-IN-FREQUENCY
It is a popular form of FFT algorithm.
In this the output sequence x(k) is divided into smaller and smaller subsequences, that is why the name decimation in frequency.
Initially the input sequence x(n) is divided into two sequences x1(n) and x2(n) consisting of the first n/2 samples of x(n) and the last n/2 samples of x(n) respectively.
RADIX-2 DIF- FFT ALGORITHM To divide N-point sequence x(n) into two N/2-point sequence
The former N/2-point
The latter N/2-point
Decimation-in-frequency algorithm of length-8 for radix-2.
77
THE COMPARISON OF DIT AND DIF The order of samples
DIT-FFT: the input is bit- reversed order and the output is natural order.
DIF-FFT: the input is natural order and the output is bit- reversed order.
The butterfly computation
DIT-FFT: multiplication is done before additions.
DIF-FFT: multiplication is done after addition.
Both DIT-FFT and DIF-FFT have the identical computation complexity. i.e. for, there
are total L stages and each has N/2 butterfly computation. Each butterfly computation has 1
multiplication and 2 additions.
A DIT-FFT flow graph can be transposed to a DIF-FFT flow graph and vice versa.
RADIX-2 DIF- FFT ALGORITHM
VERILOG CODE
module diff(clk,sel,yr,yi);
input clk;
input [2:0]sel;
output reg [7:0]yr,yi;
wire [7:0]y0r,y1r,y2r,y3r,y4r,y5r,y6r,y7r,y0i,y1i,y2i,y3i,y4i,y5i,y6i,y7i;
wire [7:0]x20r,x20i,x21r,x21i,x22r,x22i,x23r,x23i,x24r,x24i,x25r,x25i,x26r,x26i,x27r,x27i;
wire [7:0]x10r,x10i,x11r,x11i,x12r,x12i,x13r,x13i,x14r,x14i,x15r,x15i,x16r,x16i,x17r,x17i;
wire [7:0]x0r,x0i,x1r,x1i,x2r,x2i,x3r,x3i,x4r,x4i,x5r,x5i,x6r,x6i,x7r,x7i;
parameter w0r=8'b1;
parameter w0i=8'b0;
parameter w1r=8'b10110101;
parameter w1i=8'b01001011;
parameter w2r=8'b0;
parameter w2i=8'b11111111;
parameter w3r=8'b01001011;
parameter w3i=8'b01001011;
78
assign x0r=8'b11111111;
assign x0i=8'b00000000;
assign x1r=8'b11011100;
assign x1i=8'b00010101;
assign x2r=8'b11001101;
assign x2i=8'b00000000;
assign x3r=8'b11011100;
assign x3i=8'b00010101;
assign x4r=8'b10101011;
assign x4i=8'b00000000;
assign x5r=8'b00000110;
assign x5i=8'b11101011;
assign x6r=8'b11001101;
assign x6i=8'b00000000;
assign x7r=8'b00000110;
assign x7i=8'b11101011;
//stage1
bfly1 s11(x0r,x0i,x4r,x4i,w0r,w0i,x10r,x10i,x14r,x14i);
bfly3 s12(x1r,x1i,x5r,x5i,w1r,w1i,x11r,x11i,x15r,x15i);
bfly2 s13(x2r,x2i,x6r,x6i,w2r,w2i,x12r,x12i,x16r,x16i);
bfly4 s14(x3r,x3i,x7r,x7i,w3r,w3i,x13r,x13i,x17r,x17i);
//stage2
bfly1 s21(x10r,x10i,x12r,x12i,w0r,w0i,x20r,x20i,x22r,x22i);
bfly2 s22(x11r,x11i,x13r,x13i,w2r,w2i,x21r,x21i,x23r,x23i);
bfly1 s23(x14r,x14i,x16r,x16i,w0r,w0i,x24r,x24i,x26r,x26i);
bfly2 s24(x15r,x15i,x17r,x17i,w2r,w2i,x25r,x25i,x27r,x27i);
//stage3
bfly1 s31(x20r,x20i,x24r,x24i,w0r,w0i,y0r,y0i,y1r,y1i);
bfly1 s32(x22r,x22i,x26r,x26i,w0r,w0i,y2r,y2i,y3r,y3i);
bfly1 s33(x21r,x21i,x25r,x25i,w0r,w0i,y4r,y4i,y5r,y5i);
bfly1 s34(x23r,x23i,x27r,x27i,w0r,w0i,y6r,y6i,y7r,y7i);
always@(posedge clk)
case(sel)
0:begin yr=y0r; yi=y0i; end
1:begin yr=y1r; yi=y1i; end
2:begin yr=y2r; yi=y2i; end
3:begin yr=y3r; yi=y3i; end
4:begin yr=y4r; yi=y4i; end
5:begin yr=y5r; yi=y5i; end
6:begin yr=y6r; yi=y6i; end
7:begin yr=y7r; yi=y7i; end
endcase
endmodule
module bfly1(xr,xi,yr,yi,wr,wi,x0r,x0i,x1r,x1i);// sub module
79
input [7:0]xr,xi,yr,yi,wr,wi;
output[7:0]x1r,x1i,x0r,x0i;
assign x0r=xr+yr;
assign x0i=yi*wi;
assign x1r=xr+(~(yr*wr)+1);
assign x1i=~(yr*wi)+1;
endmodule
module bfly2(xr,xi,yr,yi,wr,wi,x0r,x0i,x1r,x1i); // sub module
input [7:0]xr,xi,yr,yi,wr,wi;
output [7:0]x0r,x0i,x1r,x1i;
wire [7:0]q1;
assign q1=yr*(wi+1);
assign x0r=xr;
assign x0i=q1;
assign x1r=xr;
assign x1i=~q1+1;
endmodule
module bfly3(xr,xi,yr,yi,wr,wi,x0r,x0i,x1r,x1i); // sub module
input [7:0]xr,xi,yr,yi,wr,wi;
output [7:0]x0r,x0i,x1r,x1i;
wire [15:0]p1,p2,p3,p4;
wire [7:0]win,yrn,yin;
wire [8:0]ywr,ywi;
parameter sht=8'b1000;
assign yrn=~yr+1;
assign yin=yi;
assign win=~wi+1;
assign p1=(yrn*wr)>>sht;
assign p2=(yin*win)>>sht;
assign p3=(yrn*win)>>sht;
assign p4=(yin*wr)>>sht;
assign ywr=(~p1+1)+p2;
assign ywi=p3+p4;
assign x0r=xr+ywr;
assign x0i=xi+ywi;
assign x1r=xr+(~ywr+1);
assign x1i=xi+(~ywi+1);
endmodule
module bfly4(xr,xi,yr,yi,wr,wi,x0r,x0i,x1r,x1i); // sub module
input [7:0]xr,xi,yr,yi,wr,wi;
80
output [7:0]x0r,x0i,x1r,x1i;
wire [15:0]p1,p2;
wire [7:0]win,yrn,yin;
wire [8:0]ywr,ywi;
parameter sht=8'b1000;
assign yrn=~yr+1;
assign yin=~yi+1;
assign win=~wi+1;
assign p1=(yrn*win)>>sht;
assign p2=(yin*win)>>sht;
assign ywr=(~p1+1)+p2;
assign ywi=p1+p2;
assign x0r=xr+ywr;
assign x0i=xi+ywi;
assign x1r=xr+(~ywr+1);
assign x1i=xi+(~ywi+1);
endmodule
OUTPUT:
81
ERROR CONTROL CODES
ERROR DETECTION AND CORRECTION CODES:
When a message is transmitted, it has the potential to get scrambled by noise. This is certainly true
of voice messages, and is also true of the digital messages that are sent to and from computers. Now
even sound and video are being transmitted in this manner. A digital message is a sequence of 0s and
1s which encodes a given message. More data will be added to a given binary message that will help to
detect if an error has been made in the transmission of the message; adding such data is called an error-
detecting code.
More data may also be added to the original message so that errors made in transmission may be
detected, and also to figure out what the original message was from the possibly corrupt message that
was received. This type of code is an error-correcting code. Error detection is the ability to detect errors.
Error correction has an additional feature that enables identification and correction of the errors. Error
detection always precedes error correction. Both can be achieved by having extra or redundant or check
bits in addition to data deduce that there is an error. Original Data is encoded with the redundant bit(s).
New data formed is known as code word.Coding is the process of adding redundancy for error detection
or correction. It is of two types:
Block codes
Divides the data to be sent into a set of blocks
Extra information attached to each block
Memoryless
Convolutional codes
Treats data as a series of bits, and computes a code over a continuous series
The code computed for a set of bits depends on the current and previous input.
HAMMING CODES:
Hamming Codes are used in detecting and correcting a code. An error-correcting code is an
algorithm for expressing a sequence of numbers such that any errors which are introduced can be
detected and corrected (within certain limitations) based on the remaining numbers. Errors can happen in
a variety of ways. Bits can be added, deleted, or flipped. Errors can happen in fixed or variable codes.
82
Error-correcting codes are used in CD players, high speed modems, and cellular phones. Error detection
is much simpler than error correction. For example, one or more "check" digits are commonly embedded
in credit card numbers in order to detect mistakes.Hamming code adopt parity concept, but have more
than one parity bit. Example of a block code is the (7,4) Hamming code. This is an error detecting and
error-correcting binary code, which transmits N=7 bits for every K=4 source bits.
GENERAL ALGORITHM:
The following general algorithm generates a single-error correcting (SEC) code for any
number of bits.
1. Number the bits starting from 1: bit 1, 2, 3, 4, 5, etc.
2. Write the bit numbers in binary: 1, 10, 11, 100, 101, etc.
3. All bit positions that are powers of two (have only one 1 bit in the binary form of their position)
are parity bits: 1, 2, 4, 8, etc. (1, 10, 100, 1000)
4. All other bit positions, with two or more 1 bits in the binary form of their position, are data bits.
5. Each data bit is included in a unique set of 2 or more parity bits, as determined by the binary
form of its bit position.
ENCODING A (7,4) CODE:
CONSTRUCTION OF G AND H:
The matrix is called a (Canonical) generator matrix of a linear (n,k) code and
is called a parity check matrix.This is the construction of G and H in standard (or
systematic) form. Regardless of form, G and H for linear block codes must satisfy , an all-
zeros matrix. Since [7,4,3]=[n,k,d]=[2m 1, 2m1-m, m]. The parity check matrix H of a Hamming code
is constructed by listing all columns of length m that are pair-wise independent.The parity check matrix
H of a Hamming code is constructed by listing all columns of length m that are pair-wise
independent.Thus H is a matrix whose left side is all of the nonzero n-tuples where order of the n-tuples
in the columns of matrix does not matter. The right hand side is just the (n-k)-identity matrix.So G can
be obtained from H by taking the transpose of the left hand side of H with the identity k- identity matrix
on the left hand side of G.
83
The code generator matrixG and parity check matrix H are:
and
From the above matrix we have 2k=2
4=16 codewords. The codewords of this binary code can be
obtained from . With with exist in ( A field with two elements
namely 0 and 1).
Thus the codewords are all the 4-tuples (k-tuples).Therefore,(1,0,1,1) gets encoded as (1,0,1,1,0,1,0).
HAMMING DISTANCE:
Hamming Distance = No of bit positions in which 2 code words differ
E.g. 10001001 and 10110001 have distance of 3. If distance is d, then d-single bit errors are required to
convert any one valid code into another. Implying that this error would not be detected. In general, to
detect k-single bit error, minimum hamming distance D(min)=k+1.Hence we need code words that have
D(min) = 2 + 1 = 3 to detect 2-bit errors.
DECODING OF HAMMING CODE:
The decoding task can be re-expressed as syndrome decoding.
s = H x r
where s - syndrome.
H - parity checks matrix.
r - is the received vector.
The following two possibilities are:
If the syndrome is zero, that is, all three parity checks agree with the corresponding received bits,
then the received vector is a codeword, and the most probable decoding is given by reading out its first
four bits. Then u is supposed to be the same than u. One can give the situation in that the errors are not
84
detectable. This happens when the error vector is identical to a non null word code. In this case r it is the
sum of two words code and therefore the syndrome is similar to zero. These errors are non-detectable
errors. As there are 2k-1 non-null words code, there are 2k-1 non-detectable errors.
If the syndrome is non-zero, then we are certain that the noise sequence for the present block was
non-zero(we have noise in our transmissions). Since the received vector v is given by v = Gt*u+n and
HGT=0 the syndrome decoding problem is then to find the most probable noise vector n satisfying the
equation Hn = e. Once the error vector is found, the original source sequence is identified.
VHDL CODE:
HAMMING ENCODER
entity hamenc is
Port ( datain : in STD_LOGIC_VECTOR (3 downto 0);
p :inout STD_LOGIC_VECTOR(2 downto 0);
hamout : out STD_LOGIC_VECTOR (6 downto 0));
end hamenc;
architecture Behavioral of hamenc is
begin --check bits
--generate check bits
p(0)
85
hamout(3) rxdrxd rxd rxd rxd rxd rxd rxd
86
IF syndrome = "000" THEN
Data out(3 down to 0)
87
SIMULATION RESULTS
ENCODER OUT
DECODER OUTPUT:
88
INFERENCE:
Hamming code is code word of n bits with m data bits and r parity (or check bits)
i.e. n = m + r
It can detect D(min) 1 errors and can correct errors. Hence to correct k errors, need D(min)
=2k+1.It need a least a distance of 3 to correct a single bit error.
89
CRYPTOGRAPHIC ALGORITHMS FOR FPGAS
Many communication systems use data-stream ciphers to protect relevant Information.
The key sequence K is more or less a pseudorandom sequence (known to the sender and the receiver), and with the modulo 2 property of the XOR function, the plaintext P can be
reconstructed at the receiver side, because
P K K = P 0 = P.
We shall discuss to encryption algorithms namely:
Linear Feedback Shift Registers (LFSR) algorithm
Data Encryption Standard (DES)
Neither algorithm requires large tables and both are suitable for an FPGA implementation.
LINEAR FEEDBACK SHIFT REGISTERS ALGORITHM
LFSRs with maximal sequence length are a good approach for an ideal security key,
because they have good statistical properties. In other words, it is difficult to analyze the
sequence in a cryptographic attack, an analysis called cryptoanalysis. Because bitwise designs
are possible with FPGAs, such LFSRs are more efficiently realized with FPGAs than PDSPs.
Two possible realizations of a LFSR of length 8 are shown in Fig. 1.1.
Fig 1.1. Possible realizations of LFSRs. (a) Fibonacci configuration. (b) Galois configuration.
For the XOR LFSR there is always the possibility of the all-zero word, which should
never be reached. If the cycle starts with any nonzero word, the cycle length is always 2l 1.
Sometimes, if the FPGA wakes up with an all-zero state, it is more convenient to use a
mirrored or inverted LFSR circuit. If the all-zero word is a valid pattern and produces exactly the inverse sequence, it is necessary to substitute the XOR with a not XOR or XNOR gate. Such LFSRs can easily be designed using a PROCESS statement in VHDL, as the following
example shows.
90
The following VHDL code implements a LFSR of length 6.
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_arith.ALL;
ENTITY lfsr IS ------> Interface
PORT ( clk : IN STD_LOGIC;
y : OUT STD_LOGIC_VECTOR(6 DOWNTO 1));
END lfsr;
ARCHITECTURE fpga OF lfsr IS
SIGNAL ff : STD_LOGIC_VECTOR(6 DOWNTO 1)
:= (OTHERS => 0); BEGIN
PROCESS -- Implement length 6 LFSR with xnor
BEGIN
WAIT UNTIL clk = 1; ff(1)
91
Note that a complete cycle of an LFSR sequence fulfills the three criteria for optimal
length 2l 1 pseudorandom sequences.
1) The number of 1s and 0s in a cycle differs by no more than one.
2) Runs of length k (e.g., 111 sequence, 000 sequence) have a total
fractional part of all runs of 1/2k.
3) The autocorrelation function C() is constant for [1, n 1].
DES BASED ALGORITHM:
The data encryption standard (DES), outlined in Fig. 1.3, is typically used in a block
cipher. By selecting the output feedback mode (OFB) it is also possible to use the modified DES in a data-stream cipher.
Fig 1.3. State machine for a block encryption system (DES)
92
PRINCIPLE:
The DES comprises a finite state machine translating plaintext blocks into ciphertext
blocks. First the block to be substituted is loaded into the state register (32 bits). Next it is
expanded (to 48 bits), combined with the key (also 48 bits) and substituted in eight 64 bit-width S-boxes. Finally, permutations of single bits are performed. This cycle may be (if desired,
with a changing key) applied several times.
In the DES, the key is usually shifted one or two bits so that after 16 rounds the key is
back in the original position. Because the DES can therefore be seen as an iterative application
of the Feistel cipher (shown in Fig. 1.4), the S-boxes must not be invertible. To simplify an
FPGA realization some modifications are useful, such as a reduction of the length of the state
register to 25 bits. No expansion is used. Use the final permutations as listed in Table 1.1.
Because most FPGAs only have four to five input look-up tables (LUTs), S-boxes with five
inputs have been designed.
Fig 1.4. Principle of the Feistel Network
A reasonable test for S-boxes is the dependency matrix. This matrix shows, for every
input/output combination, the probability that an output bit changes if an input bit is changed.
With the avalanche effect the ideal probability is 1/2.
Table 1.1 Table for Permutation
93
Since there are 25 = 32 possible input vectors for each S-box, the ideal value is 16. A
random generator was used to generate the S-boxes. The reason that some values differ much
from the ideal 16 may lie in the desired inversion.
Though Des was considered most secure till the late 90s, it was successfully cracked in 1997 and thereafter. So more complex encryption algorithms like AES, Triple DES, etc., are
used practically nowadays.
94
FPGA DESIGN OF LMS ALGORITHM
The WidrowHoff least mean square (LMS) adaptive algorithm is a practical method for
finding a close approximation to in real time.
It is a very simple algorithm and it does not require explicit measurement of the
correlation functions, nor does it involve matrix inversion.
The LMS algorithm is an implementation of the method of the steepest descent.
According to this method, the next filter coefficient vector f[n + 1] is equal to the
present filter coefficient vector f[n] plus a change proportional to the negative gradient.
where the parameter is the learning factor or step size that controls stability and the
rate of convergence of the algorithm. During each iteration the true gradient is
represented by [n].
The LMS algorithm estimates an instantaneous gradient in a crude but efficient manner
by assuming that the gradient of J = is an estimate of the gradient of the mean-
square error E{ }. The relationship between the true gradient [n] and the
estimated gradients [n] is given by the following expression:
Therefore the coefficient update equation becomes,
95
FIG: LMS configuration
The LMS algorithm makes use of gradients of mean-square error functions, it does not require squaring, averaging, or differentiation. The algorithm is simple and generally
easy to implement.
The LMS algorithm is convergent in the mean square if and only if the step-size
parameter satisfies,
where max is the largest eigenvalue of the correlation matrix of the input data.
To ensure the fast convergence the closely bounded step size value is
where L is the filter order & is the autocorrelation function of input.
For the higher order filter the upper bound can be relaxed by factor of 3
Normalized LMS
The LMS algorithm discussed so far uses a constant step size proportional to the
stability bound
. Obviously this requires knowledge of the signal
statistic, i.e., rxx[0], and this statistic must not change over time.
It is, however, possible that this statistic changes over time, and we wish to adjust accordingly i.e time varying step size parameter().
The normalized is given by
96
If we are concerned that the denominator can temporary become very small and too large, we may add a small constant to [n]x[n], which yields
Therefore the coefficient update equation for NLMS is,
where is the norm of the input.
PIPELINED LMS FILTER:
This method is used to increase the throughput of the Adaptive system.
The optimal number of pipeline stages can be computed is as follows. For the (b b) multiplier a total of log2(b) stages are needed, for the adder tree an additional log2(L) pipeline stages would be sufficient and one additional stage for the computation of the
error. The coefficient update multiplication requires an additional log2(b) pipeline
stages.
The total number of pipeline stages for a maximum throughput are,
where we have assumed that is a power-of-two constant and the scaling with can be done without the need of additional pipeline stages. If, however, the normalized LMS is
used, then will no longer be a constant and depending on the bit width of additional pipeline stages will be required.
BLOCK TRANSFORMATION USING FFTs:
LMS algorithms that solve the filter coefficient adjustment in a transform domain have been
proposed for two reasons,
The goal of the fast convolution techniques is to lower the computational effort, by using block update and transforming the convolution to compute the adaptive filter output and
the filter coefficient adjustment in the transform domain with the help of a fast cyclic
convolution algorithm.
The second method that uses transform domain techniques has the main goal to improve the adaptation rate of the LMS algorithm, because it is possible to find transforms that
allow a decoupling of the modes of the adaptive filter. The coefficient update equation is,
The step size can be reduced to
97
for a block update of B steps each.
Choice of Block size: B=L: optimal choice from the viewpoint of computational complexity. BL: redundant operations in the adaptation, not optimal.
VHDL CODE:
LIBRARY lpm;
USE lpm.lpm_components.ALL;
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_arith.ALL;
USE ieee.std_logic_signed.ALL;
ENTITY fir_lms IS
GENERIC (W1 : INTEGER := 8;
W2 : INTEGER := 16;
L : INTEGER := 2 );
PORT ( clk : IN STD_LOGIC;
x_in : IN STD_LOGIC_VECTOR(W1-1 DOWNTO 0);
d_in : IN STD_LOGIC_VECTOR(W1-1 DOWNTO 0);
e_out, y_out : OUT STD_LOGIC_VECTOR(W2-1 DOWNTO 0);
f0_out, f1_out : OUT STD_LOGIC_VECTOR(W1-1 DOWNTO 0));
END fir_lms;
ARCHITECTURE flex OF fir_lms IS
SUBTYPE N1BIT IS STD_LOGIC_VECTOR(W1-1 DOWNTO 0);
SUBTYPE N2BIT IS STD_LOGIC_VECTOR(W2-1 DOWNTO 0);
TYPE ARRAY_N1BIT IS ARRAY (0 TO L-1) OF N1BIT;
TYPE ARRAY_N2BIT IS ARRAY (0 TO L-1) OF N2BIT;
SIGNAL d : N1BIT;
SIGNAL emu : N1BIT;
SIGNAL y, sxty : N2BIT;
SIGNAL e, sxtd : N2BIT;
SIGNAL x, f : ARRAY_N1BIT;
SIGNAL p, xemu : ARRAY_N2BIT;
BEGIN
dsxt: PROCESS (d)
BEGIN
sxtd(7 DOWNTO 0)
98
BEGIN
WAIT UNTIL clk = '1';
d W2)
PORT MAP ( dataa => x(I), datab => f(I),
result => p(I));
END GENERATE;
y
99
OUTPUT:
APPLICATIONS:
Interference cancellation Prediction Inverse Modelling
100
DIGITAL UP CONVERTER
An ideal Software Defined Radio base station would perform all signal processing tasks
in the digital domain. However, current-generation wideband data converters cannot support the
processing bandwidth and dynamic r