Download pdf - Efficient Variable Size Template Matching Using Fast Normalized Cross Correlation on Multicore Processors

“Efficient Variable Size Template Matching

Using Fast Normalized Cross Correlation on

Multicore Processors”

Durgaprasad Gangodkar, Sachin Gupta, Gurbinder Gill,

Padam Kumar, Ankush Mittal

Department of Electronics and Computer Engineering

INDIAN INSTITUTE OF TECHNOLOGY

Roorkee

India 1

Contents

1. Introduction

2. NVIDIA’s Compute Unified Device Architecture

3. Normalized and Fast Normalized Cross Correlation

4. Parallel Implementation of Fast Normalized Cross

Correlation

5. Experimental Details and Performance Evaluation

6. Conclusion

2

1. Introduction

Template Matching has its applications in image and signal

processing like image registration, object detection, pattern

matching etc. Given a source image and a template, the

matching algorithm finds the location of template within the

image in terms of specific measures.

• Full search (FS) or exhaustive search algorithms consider

every pixel in the block to find out the best match --

computationally very expensive.

• Though there are different measures proposed. An empirical

study found NCC provides the best performance in all image

categories in the presence of various image distortions [9].

NCC is also more robust against image variations such as

illumination changes then widely used SAD and MAD .

3

• However NCC is computationally very expensive

than SAD or MAD, which is a significant drawback in

its real-time application.

• In this paper we propose the parallel

implementation of template matching using Full

Search using NCC as a measure using the concept of

pre-computed sum-tables [10][11] referred to as

FNCC for high resolution images on NVIDIA’s

Graphics Processing Units (GP-GPU’s)

4

2. NVIDIA’s Compute Unified Device Architecture • GP-GPUs have emerged as front runners for low-cost

high-performance computing (HPC) machines

• GTX280 can provide theoretical peak performance of

around 933 GFLOPs (single precision) and 78 GFLOPs

(double precision).

• A kernel executes a scalar sequential program on a set of

parallel threads. The programmer organizes these threads

into a grid of thread blocks.

Challenges:

• Higher global memory latency

• Higher CPU – Device data transfer latency

• Limited availability of registers

• Limited high-speed shared memory

• Thread synchronization and dynamic kernel configuration 5

1. Novel strategy for parallel calculation of sum-tables using

prefix-sum algorithm that optimally uses high-speed shared

memory of GPU.

2. Adaptation of the kernel configuration to variable sized

templates and efficient use of shared memories offered by

CUDA

3. Exploitation of the asynchronous nature of kernel calls to

optimally distribute computation between host and device.

4. Data parallelism in the algorithms by dividing

computationally intensive tasks for parallel and scalable

execution on the multiple cores.

6

Main contributions of this paper:

• NCC has been commonly used as a metric to evaluate the

similarity (or dissimilarity) measure between two

compared images[8][9].

• Template of size �� × ��is matched with an image of

size �� ×��.

• The position (��, �)of the template � in image � is

determined by calculating the NCC value at every step.

• The basic equation for NCC is as given in (1)

∑ ∑

∑

−−−−

−−−−=

yx yxvu

yx vu

vutvyuxtfyxf

tvyuxtfyxf

,

2

,

2,

, ,

,)),(()),((

)),()(),((γ

7

3. Normalized and Fast Normalized Cross Correlation

(1)

• Direct computation of (1) involves the order of

�� × ��(�� − �� )(��− ��) calculations.

• For example, to match a small 16×16 pixel template

with a 250×250 pixel image would require a total of

more than “14 million calculations”

8

∑∑−+

=

−+

=

=11

, ),(1 yx

Nv

vy

Nu

uxyxvu yxf

NNf (2)

• Calculation of the denominator of equation using the

concept of sum-tables[10][11].

• �(�, ��2��, � are sum tables over image

function and image energy respectively.

• The sum-tables of image function and image energy

are computed recursively as given below:

(1)

(2)

(3)

(4) 9

Fast Normalized Cross Correlation (FNCC)

4. Parallel Implementation of Template

Matching

• Though FNCC reduces computational time for low

resolution images, incurs substantial time for high

resolution images.

• We adopt two stage approach for template matching

– In the first stage we parallelize the computation of the

sum-tables

– In the second stage we parallelize the computation of

normalized cross correlation by utilizing the sum-tables

as a look up.

10

Computation of Sum-Tables

• The sum tables are calculated by taking the cumulative sum

over the image points.

• We make use of parallel prefix-sum algorithm as shown in

figure

The figure illustrates the working of prefix sum algorithm,

where n/2 threads can work in parallel to calculate prefix sum

in O(logn) time complexity 11

12

• Sum-tables for template on the host CPU, while GPU is busy

calculating the sum-tables for the source image exploiting

asynchronous nature of kernel calls. This eliminates idling of

host CPU when device is busy

• One row to a thread block.

• Task of each thread grouped in a block configuration

dynamically decided by template size.

• Every thread caches data in shared memory for template

image of variable resolution.

• Parallel prefix-sum � transpose � Parallel prefix-sum �

transpose � sum-table

• Use of device pointers in total of four kernels to avoid data

transfer latencies.

• For a template of size �� × �� pixels we divide the source

image into search window of 2�� × 2�� pixels.

• The correlation value is calculated utilizing the sum-tables

as lookup by moving the template over the referenced

search window pixel by pixel, covering the entire search

window.

13

Template matching using FNCC

• Highest Correlation indicates best match

• The task of computing correlation for each search window

is assigned to a single thread.

• The target image is dynamically divided into search

windows according to the x and y dimensions of the

variable sized template such that we get the maximum

number of threads per block.

• Every thread block dynamically caches data such that

constraint of shared memory (16 KB per block ) is never

violated. 14

5. Experimental Details and Performance

Evaluation

• Execution time and speedup of proposed parallel

implementation FCC algorithm evaluated on benchmark

dataset .

• Sequential code implemented on Intel Xeon 3.2 GHz

processor with 1 GB of DRAM and 32 bit Windows XP OS.

• Parallel code was implemented on NVIDIA GTX 280 having

1 GB of DDR3 onboard Intel Xeon 3.2 GHz processor with 1

GB of DRAM and 32 bit Windows XP OS.

15

• For frame size of 2048x1080 and template size 16x16 we could

achieve the considerable reduction in execution time from 4.116 sec

to 239 ms yielding a speedup of around 17x.

16

Image Size in

pixels Template

Size in pixels

CUDA Sequential

Time in sec. Speedup Thread Blocks

Threads Per Block

Execution Time in

sec. 512x512 32x32 5x8 3x2 0.517 1.372 2.7 24x32 8x5 2x5 0.260 1.097 4.3 24x16 5x6 6x4 0.047 0.543 11.6 16x16 5x6 7x6 0.033 0.406 12.3 1024x1024 32x32 9x16 3x2 1.311 6.170 4.8 24x32 16x9 2x5 0.639 4.773 7.5 24x16 10x11 6x4 0.179 2.518 14.1 16x16 10x11 7x6 0.121 1.893 15.6 2048x1080 32x32 10x32 3x2 2.848 13.474 4.8 24x32 17x17 2x5 1.261 10.344 8.3 24x16 11x22 6x4 0.391 5.551 14.3 16x16 10x22 7x6 0.239 4.116 17.3

17

• As the resolution of the image increases the speed-up

obtained also increases hence opening up the scope for

handling high resolution digital images.

6. Conclusion

• Every thread has been assigned an independent task of

computing the correlation for template which eliminates

inter-thread communication, inter-thread dependencies and

synchronization.

• Dynamic arrangement of threads into blocks and grids has

been done depending on the size of the template.

• We have also devised efficient strategy to make use of the

faster shared memory to overcome memory access latency.

• Thread configuration is scalable to match low resolution or

high resolution images and varying size template.

• Our future work involves exploring division of larger

templates into smaller sub-templates further exploit the

computational power of multicore processors 18

1. Ryan, T. W.: The Prediction of Cross-Correlation Accuracy in Digital Stereo-Pair Images. PhD thesis,

University of Arizona (1981)

2. Burt, P. J., Yen, C., Xu, X.: Local Correlation Measures for Motion Analysis: A Comparative Study. In:

IEEE Conf. Pattern Recognition and Image Processing, pp. 269-274. IEEE Press, Las Vegas (1982).

3. Essannouni, L., Ibn-Elhaj, E., Aboutajdine, D.: Fast Cross-Spectral Image Registration Using New

Robust Correlation. In: Journal of Real-Time Image Processing, vol. 1, no. 2, pp. 123-12. Springer

(2006)

4. Minoru, M., Kunio, K.: Fast Template Matching Based on Normalized Cross Correlation Using

Adaptive Block Partitioning and Initial Threshold Estimation. In: IEEE International Symposium on

Multimedia, pp. 196 – 203. IEEE Press, Taichung, Taiwan (2010)

5. Luo, J., Konofagou, E. E.: A Fast Normalized Cross-Correlation Calculation Method for Motion

Estimation. In: IEEE Trans. on Ultrasonics, Ferroelectrics and Frequency Control, vol. 57, no. 6, pp.

1347 – 1357. (2010)

6. Zhu, S., Ma, K. K.: A New Diamond Search Algorithm for Fast Block Matching Motion Estimation. In:

IEEE Trans. Image Processing, vol. 9, no. 2, pp. 287–290. (2000)

7. Tham, J. Y., Ranganath, S., Ranganath, M., Kassim, A. A.: A Novel Unrestricted Center-Biased

Diamond Search Algorithm for Block Motion Estimation. In: IEEE Trans. Circuits Syst. Video

Technol., vol. 8, no. 4, pp. 369–377. (1998)

8. Zhu, C., Lin, X., Chau, L.: Hexagon-Based Search Pattern for Fast Block Motion Estimation. In: IEEE

Trans. Circuits Syst. Video Technol., vol. 12, no. 5, pp. 349-355. (2002)

9. Lewis, J. P.: Fast Template Matching. In: Vision Interface 95, Canadian Image Processing and Pattern

Recognition Society, pp. 120–123. Quebec City, Canada (1995) 19

References

10. Briechl K., Hanebeck, U. D.: Template Matching Using Fast Normalized Cross Correlation. In: SPIE,

vol. 4387, no. 95. AeroSense Symposium, Orlando, Florida (2001)

11. NVIDIA CUDA Programming Guide, Version 2.2, pp. 10, 27-35, 75-97. (2009)

12. Hii, A. J. H., Hann, C. E., Chase, J. G., Van Houten, E. E. W.: Fast Normalized Cross Correlation for

Motion Tracking Using Basis Functions. In: Journal of Computer Methods and Programs in

Biomedicine, vol. 82, no. 2, pp. 144–156. Elsevier (2006)

20

Thank You

21