ELEC692 VLSI Signal Processing Architecture
Lecture 7VLSI Architecture for Block
Matching Algorithm for Video compression
* Part of the notes is taken from the course notes of Prof. Bing Zeng’s ELEC 533
Reference
• P. Pirsch, N. Demassieux, W. Gehrke, “VLSI architecture for Video compression – A survey”, in ther IEEE Proceedings, Vol. 83, No. 2, pp. 220-246,Feb 1995
• T. Komarek, P. Pirsch, “Array Architecture for Block Matching Algorithm”, in IEEE Transactions of Circuit and Systems, vol. 36, No. 10, pp. 1301-1310, Oct. 1989
Interframe Coding/Motion Estimation of Video Sequence
Interframe Transform/Predictive Coding
Interframe Transform/Predictive Coding
• Prediction is based on a previously processed frame
• Prediction is accomplished by motion estimation (ME)
• Motion estimation is done in spatial domain• 2-D DCT has to be inside the coding loop and a
2-D IDCT is needed to convert the frequency domain information back to spatial domain
Motion Compensated Prediction
Block Matching Method
Search window
Block matching Criterion
• Mean Square Error (MSE)
N
i
N
jtt jixjix
NMSE
1 1
212
)),(),((1
),(
• Mean Absolute Difference (MAD)
N
i
N
jtt jixjix
NMAD
1 112
|),(),(|1
),(
Important factors for BM Motion Estimation
• Block size – 8X8, 16X16, variable• Size of searching window
– Depend on frame differences, speed of moving objects, resolution, etc
• Matching criterion– Accuracy vs complexity, use of truncated pixels
• Search strategy– Full search, hierarchical search, subsampling of
motion field
• Hardware consideration
Real time processing for BMA
• Let Block size = 16*16, window size = 32*32, assuming CIF frame at 30f/s, we need
sec/879sec
30396289256 Mopsframe
frame
blocks
block
search
search
ops
For CCIR 601 or HDTV, it will require several or tens of GOPS/sec. So Full search has to be implemented in dedicated hardware.
Exhaustive Search Block Matching• Block size of N X N of the current image (reference
block, denote by X)• Matched with all the block located within a search
window (candidate blocks, denote by Y).• Maximum displacement – w• Computing the mean absolute difference (MAD)
between the blocks• Matching distance D is given by
min
1
0
1
0
),(),(),( Dn
mvnjmiyjixnmD
N
i
N
j
V is the motion vector
No. of candidate block to be considered: (2w+1)2
Algorithm to find the motion vectorDmin = MAXVALUEVmin = (0,0)For m=-w to +w
for n = -w to +wD(m,n) = 0for i=1 to N
for j = 1 to ND(m,n) = D(m,n)+|x(I,j)-y(i+m,j+n)|
endforendforif D(m,n) < Dmin then
Dmin = D(m,n)Vmin = (m,n)
endifendfor
endfor
Dependency graph
Calculating MAD
Calculating si(m.n) and s(m,n) Calculate Dmin and v
Dependency graph • The BM algorithm can be described by several
different dependency graph• Example 1
AD = absolute difference and addition
M = minimum value computation
Dependency graph
• Example 2
Data input• Line scan and block scan• Line scan
– TV lines run through as a whole, from the upper to the lower side of the frame
• Block scan– Quadratic blocks of n X n pixels are run through in a block-
line manner– Well suited if the data are supplied by a memory with block
scan output– Pixels within a block are traversed column by column– E.g. (3X3)-pixel block
)3,3()2,3()1,3(
)3,2()2,2()1,2(
)3,1()2,1()1,1(
xxx
xxx
xxx Data are read in the order x(1,1), x(2,1) x(3,1), x(1,2), x(2, 2) x(3,2),x(1,3), x(2,3) x(3,3),
Mapping BMA onto Systolic Arrays
• Decompose the algorithm into its basic operations and convert it into a form where each result is assigned to a unique variable
• Formulate it as an n-dimension dependence graph (DG) of computation nodes and data dependence arcs.
• One straight forward mapping is implementing a PE designated to each node of the DG and a communication link to each edge of the DG.
• More efficient design with a higher processor utilization if each PE executes the operations of multiple computation nodes
• Need time schedule and assignment of multiple nodes to a single PE by projection. PE need to be programmable to some extent.
Mapping BMA onto Systolic Arrays
• The BMA is defined over a 4-dimensional index space (i,j,m,n)
• The BMA can be decomposed into two parts which are defined over two-dimesional index spaces.– 1st one spawn by the index I,j, finding the sum of D(m,n)
– 2nd one defined over m and n, the minium search and the selectin of displacement vector
N
ii
N
ji nmDnmDnjmiyjixnmD
11
),(),(),(),(),(
minmin |),()},(min{ DnmvnmDD n
Transform it into a 2D -array
• D(m,n) mapped into a 2D array of PE
• V(X,Y) is mapped into time
Realistic implementation of 2-D array• Reduction of the cycle time
– Pipelining of the computation of D(m,n).• I/O management
– Each of the AD-PE receives a new value of y(m+i,n+j) at each clock cycle.• Transmitting the N2 value from an external memory is not feasible. WE
can take the advantage of that these values belong to the search window.
• A portion of the search window of size N.(2w+N) is stored in the circuit in a 2D bank of shift registers, able to shift in, up, down, and right direction.
• Each AD-PE has one of these registers and can at each cycle obtain the value of y(m+i,n+j) that it needs
• To update this register bank, a new column of 2w+N piexls of the serach area is serially entered in the circuit and is inserted in the back of regigters.
• To load in a new reference with a low I/O overhead, a double buffering of x(I,j) is required, with the pixels x’(I,j) of a new reference block serially loaded during the computation of the current reference block.
implementation of the 2-D array
2-D array
• Alternate projection of the DG onto the I and j –plane provides the architecture AB2
• Current frame data x(i,j) remains fixed in the PE’s AD that they have to be loaded into the array before. Time required= (2w+1)*(2w+1)
Mapping to a 1-D array
• More efficient design with a higher processor utilization if each PE executes the operations of multiple computation nodes
• Mapped to a 1D array of PE, which is able to compute in parallel the partial distortion along one row.
• Compute D(m,n) in N cycles
1-D array
• Project the DG along the i-axis onto a one-dimensional signal flow graph.
• Called AB1 array, it has the size of a block
Consecutive computation of all (2w+1)2 candidate blocks per displacement vector may provide N*(2p+1)2 time instances
Another way of mapping-search area based
• The dependency graph for computing v(X,Y) is mapped into a 2D array of (2w+1)2 PE while the dependency graph for computing D(m,n) is mapped into time
• Each PE working in parallel keeps track of a particular distortion computation and sequentially explore the reference block.
• At each cycle, one PE receives a different vlaue of y(m+I,n+j) and all the PE receive the value of one pixel of the reference block which is broadcasted to the array.
• After N2 cycle, each of the (2w+1)2 PE holds one value of D(m,n) corresponding to a particular displacement (m,n)
• To find the minimum distortion value, find the minimum of a column by downshifting the D(m,n) in the PEs and find the final minimum value by left-shifting the result D(m,n) in the M-PE.
2-D search area based architecture
Part of the search area of size w.(2w+N) is needed to be stored in order to reduce I/O.
1-D search area based architecture• An array of (2w+1) processing elements executes in N2
cycles the computation of the distortion D(m,n) corresponding to one line (resp. column) of possible motion vectors.
• This process is repeated sequentially 2w+1 times for computing all the distortion.
Another architecture • Require only a sequential data input.• Dummy data denotes by dots are inserted into the
stream of reference data to guarantee a regular data flow without any data permutation within the array
Time required = (2w+1)*(2w+1)*N