Fast Implementation of Scale Invariant Feature Transform ... · 718 Meng Lu: Fast Implementation of Scale Invariant Feature Transform Based on CUDA parallel computation and memory

Appl. Math. Inf. Sci. 7, No. 2L, 717-722 (2013) 717

Applied Mathematics & Information SciencesAn International Journal

http://dx.doi.org/10.12785/amis/072L49

Fast Implementation of Scale Invariant FeatureTransform Based on CUDAMeng Lu1

1College of Information Science and Engineering, Northeastern University, China

Received: 16 Oct. 2012, Revised: 10 Jan. 2013, Accepted: 13 Jan. 2013Published online: 1 Jun. 2013

Abstract: Scale-invariant feature transform (SIFT) was an algorithm in computer vision to detect and describe local features in images.Due to its excellent performance, SIFT was widely used in many applications, but the implementation of SIFT was complicated andtime-consuming. To solve this problem, this paper presented a novel acceleration algorithm for SIFT implementation based on ComputeUnified Device Architecture (CUDA). In the algorithm, all the steps of SIFT were specifically distributed and implemented by CPUor GPU, accroding to the step’s characteristics or demandings, to make full use of computational resources. Experiments showed thatcompared with the traditional implementation of SIFT, this paper’s acceleration algorithm can greatly increase computation speed andsave implementation time. Furthermore, the acceleration ratio had linear relation with the number of SIFT keypoints.

Keywords: CUDA acceleration, Scale-Invariant feature transform, Image feature, Feature descriptor

1 Introduction

Scale-Invariant Feature Transform (SIFT) was publishedby Lowe in 1999 [1], and improved in 2004 [2], whichwas used to detect and describe local image features.SIFT is an excellent feature descriptor, because it isinvariant to uniform scaling, orientation, and partiallyinvariant to affine distortion and illumination changes.SIFT’s applications include object recognition, roboticmapping and navigation, image stitching, 3D modeling,gesture recognition, video tracking, individualidentification of wildlife and match moving.

Compute Unified Device Architecture (CUDA) is aparallel computing architecture developed by Nvidia forgraphics processing. CUDA is the computing engine inNvidia graphics processing units (GPU) that has ten timeshigher computation performance than CPU and isaccessible to software developers through variants ofindustry standard programming languages. Due toCUDA, acceleration of large-scale computation becamefeasible. [3,5,9]. CUDA acceleration was widely used inimage processing, Zhao and colleagues proposed acorresponding relation between the offset and the currentviewpoint by establishing mathematics function of thedetail image pixel coordinates and the height values [10].

Wu and colleagues proposed a GPU-aided parallelinterpolation algorithm in order to scale video imagereal-timely [11]. CUDA acceleration was also widelyused in biology sciences, Kim and colleagues proposed amethod for computing the SES (solvent excluded surface)of a protein molecule in interactive-time based on GPUacceleration [12].

Implementation of SIFT is complicated and timeconsuming, which can’t meet the real-time require ofsome applications, such as medical clinical application.Tosolve this problem, many scholars made attempts toaccelerate SIFT implementation. Sinha and colleaguespresented an acceleration algorithm of SIFT extractionbased on OpenGL and Cg program language using Nvidiagraphical card. However, they didn’t complete allcomputation steps on GPU, which caused huge datatransfer between GPU and CPU and cost much time [6].Zhang and colleagues accelerated SIFT on intel 8 coreCPU, and presented some optimization techniques toimprove the implementation’s performance on multi-coresystem [7], but this method can’t widely spread, becauseit was too expensive for many applications.

This paper presented an acceleration algorithm ofSIFT extraction based on CUDA, which made fulladvantage of GPU’s abilities on float point computation,

∗ Corresponding author e-mail: [email protected]⃝ 2013 NSP

Natural Sciences Publishing Cor.

718 Meng Lu: Fast Implementation of Scale Invariant Feature Transform Based on CUDA

parallel computation and memory management tooptimize computational resources management and datatransferring.

2 SIFT

2.1 Scale-space

Scale-space is a formal theory for handling imagestructures at different scales from physical and biologicalvision, by representing an image as a one-parameterfamily of smoothed images. The main type of Scale-spaceis the linear (Gaussian) Scale-space, which can be definedby formula(1):

L(x,y,σ) = G(x,y,σ)∗ I(x,y) (1)

where I(x,y) represented one image, * representedconvolution, and G(x,y,σ) represented Gaussian filterfunction.

G(x,y,σ) =1

2πσ 2 e−(x2+y2)

2σ2 (2)

where (x,y) represented image coordinates, σ representedscale level. So, Difference of Gaussian (DoG) Scale-spacecan be defined as:

D(x,y,σ) = (G(x,y,kσ)−G(x,y,σ)))∗ I(x,y) (3)= L(x,y,kσ)−L(x,y,σ)

2.2 Local Keypoint Detection

Once DoG images have been obtained, keypoints areidentified as local minima/maxima of the DoG imagesacross scales. This is done by comparing each pixel in theDoG images to its eight neighbors at the same scale andnine corresponding neighboring pixels in each of theneighboring scales. If the pixel value is the maximum orminimum among all compared pixels, it is selected as acandidate keypoint.

2.3 Keypoint Localization

Scale-space extreme detection produces too manykeypoint candidates, some of which are unstable. Thenext step in the algorithm is to perform a detailed fit to thenearby data for accurate location, scale, and ratio ofprincipal curvatures. This information allows points to berejected that have low contrast (and are therefore sensitiveto noise) or are poorly localized along an edge.

The interpolation of keypoints is done using thequadratic Taylor expansion of the DoG scale-spacefunction.

D(X) = D+∂DT

∂XX +

12

XT ∂ 2D∂X2 X (4)

Then, the location of the extreme X̂ , is determined bytaking the derivative of this function with respect to X andsetting it to zero.

X̂ =−∂ 2D−1

∂X2∂D∂X

(5)

D(X̂) = D+12

∂DT

∂XX̂ (6)

The DoG function will have strong responses alongedges, even if the candidate keypoint is not robust tosmall amounts of noise. Therefore, in order to increasestability, Hessian matrix is used to eliminate the keypointsthat have poorly determined locations but have high edgeresponses.

H =

(Dxx DxyDxy Dyy

)Tr(H) = Dxx +Dyy = α +β (7)

Det(H) = DxxDyy − (Dxy)2 = αβ (8)

where α represented bigger eigenvalue, β representedsmaller eigenvalue. Supposed that α = rβ , we can getformula (9).

R =Tr(H)2

Det(H)=

(α +β )2

αβ=

(rβ +β )2

rβ 2 =(r+1)2

r(9)

It follows that, for some threshold eigenvalue ratio rth, if Rfor a candidate keypoint is larger than (rth + 1)2/rth, thatkeypoint is poorly localized and hence rejected.

2.4 Orientation Assignment

Each keypoint is assigned one or more orientations basedon local image gradient directions. This is the key step inachieving invariance to rotation as the keypoint descriptorcan be represented relative to this orientation and thereforeachieve invariance to image rotation.

m(x,y) = sqrt(L(x+1,y)−L(x−1,y))2 + (10)

(L(x,y+1)−L(x,y−1))2

θ(x,y) = tan−1((L(x,y+1)−L(x,y−1))/L(x+1,y)−L(x−1,y)))

where m(x,y) represented the gradient magnitude, θ(x,y)represented the orientation.

2.5 Keypoint Descriptor

First a set of orientation histograms are created on 4× 4pixel neighborhoods with 8 bins each. These histogramsare computed from magnitude and orientation values ofsamples in a 16×16 region around the keypoint such thateach histogram contains samples from a 4× 4 sub-region

c⃝ 2013 NSPNatural Sciences Publishing Cor.

Appl. Math. Inf. Sci. 7, No. 2L, 717-722 (2013) / www.naturalspublishing.com/Journals.asp 719

Fig. 1: Flow chart of CUDA-based acceleration of SIFT featureextraction algorithm

of the original neighborhood region. The magnitudes arefurther weighted by a Gaussian function with equal to onehalf the width of the descriptor window. The descriptorthen becomes a vector of all the values of these histograms.Since there are 4×4 = 16 histograms each with 8 bins, thevector has 128 elements. This vector is then normalized tounit length in order to enhance invariance to affine changesin illumination.

3 CUDA-Based Acceleration of SIFTComputation

In this section, the whole details about SIFT accelerationmethod based on CUDA was presented. We made fulladvantage of GPU’s abilities of float point computation,parallel computation, memory management, and givereasonable role to CPU, GPU in the SIFT computationalprocedure. Flow char of the SIFT acceleration algorithmbased on CUDA was shown in Figure 1.

3.1 DoG Scale-space Construction

Firstly, image data was transferred from host memory todevice memory and bound to texture memory. Thereasons that we chose texture memory were: (1) Texturememory can access texture cache which was optimizedfor 2D array data in GPU, and have high performanceunder the circumstances of random access. (2) Texturememory can do bilinear interpolation in hardware to makeacceleration of image processing. (3) While Gaussianfilter was applied to 2D image array data, it’s necessary to

make judgment on array bounds. And CUDA program’snot good at conditional statements. Therefore, texturememory was chosen, because it can efficiently andautomatically solve this problem. Gaussian filter’sparameters were computed in host based on formula (1),and results were transferred from host to device constantmemory. Each image was processed in one grid. Eachgrid was divided into 16×16 blocks. Each block coveredone part of the image, and was divided into 256 threads.Each thread covered one pixel and its 8 neighborhoods.Intermediate results of computation were saved in sharedmemory, and final result (Gaussian pyramid) was saved inglobal memory. Once Gaussian pyramid was obtained,DoG scale-space can be constructed based on formula (3).During DoG construction, the results were only correlatedwith adjacent layers in the same Gaussian pyramid group,and had no influences on the obtained Gaussian pyramid.Therefore, Gaussian pyramid can be divided into differentblocks according to different groups and layers. And ineach block, DoG scale-space was computed by formula(3).

3.2 Local Keypoints Detection

Suppose that Gaussian pyramid had O groups, each grouphad S layers. DoG had O groups, each group had S − 1layers. Correspoindingly, we can get O groups localkeypoints, and each group had S − 3 layers. Therefore,kernel functions can be circularly called O times to detectlocal keypoints, and each circulation was used to processimage data in DoG for each scale. In the procedure oflocal extreme points detection, 26 neighborhood of eachcandidate pixel should be considered, which may greatlyincrease the amount of calculation. In the present paper,we firstly did extreme points detection in 8 neighborhood,experiments showed that[8,9,10] 95% candidate pixelscan be eliminated. And then, we did extreme pointsdetection in 26 neighborhood for the kept pixels. To alarge extent, this method can improve calculationefficiency and save time. After local keypoints werepreliminarily obtained, the result was transferred to hostfor further selection which has been discussed in section2.3. Then the result of further selection was transferredback to device, and saved in global memory.

3.3 Keypoint orientation assignment

Each keypoint was processed in one block to calculategradient orientation and magnitude by formula (10). Eachblock was divided into 16× 16 threads, and each threadprocessed one pixel (shown in Figure 2).

3.4 Keypoint Descriptor

The present paper formulated the keypoint descriptor as128 dimensional vector. It’s similar to keypoints



Fig. 2: Gradient orientation and magnitude of keypoints

orientation calculation that each vector was computed inone block, which was divided into 16 × 16 threads. Allthese 256 threads were divided into 4×4 sub-regions, andeach sub-region was arranged into a histogram with eightbins based on the result of gradient orientation. Eachthread calculated one pixel’s degree of contribution to thehistogram’s bins.

4 Experiments and Results

Workstation’s hardware and software used in theexperiments was shown in table 1.

In the experiment, this paper’s acceleration methodbased on CUDA was compared with standard SIFT[1,2].We focused on the computational efficiency andcomputing time of two SIFT implementations. Thestandard SIFT was implemented in CPU. Test imageswere selected from BSDS500 database of UC BerkeleyComputer Vision Group. To verify the correlationbetween the number of SIFT keypoints and computationalefficiency, the comparisons were made under differentimage resolutions, which were 300 × 200,640 ×480,800× 640,1200× 800,1600× 1200. The parametersof SIFT were set as: (1) the number of group wasdetermined by image resolution; (2) 3 layer of eachgroup; (3) threshold of DoG was 0.04; (4) threshold ofprinciple curvature was 10; (5) sub-region of keypointdescriptor was 4×4.

Fig. 3: Result of SIFT keypoints under the 300 × 200 imageresolution, the number of detected keypoints by the presentalgorithm is 125

SIFT keypoints calculation results of different imageresolutions were shown in Figure 3-Figure 7. The numberof detected keypoints increased according to the imageresolutions.

The comparison of computational efficiency betweenthe present algorithm and the standard SIFT was shownin table 2. The numbers of detected keypoints of presentalgorithm and standard SIFT were almost the same underevery image resolution, which meant that the present SIFTaccelerated method could detect accurate keypoints. Fromtable 2, we can conclude that higher the image resolutionis, more SIFT keypoints can be obtained, and higher thepresent algorithm’s speed-up ratio is.

The correlation between speed-up ration and thenumber of SIFT keypoints was shown in Fig. 8. Theacceleration ratio approximatively had linear relation withthe number of SIFT feature points.

Table 1: Computer’s hardware and software used in theexperiments

Items DetailsCPU Intel Xeon

X3440 2.53G2Memory DDR3

1333MHZ8G

GPU Nvidia Quadro600

Number of multi-processor

2

Number of core in eachprocessor

48

Video memory 1G DDR5Thread maximum ineach block

1024

Data transferring speedbetween host anddevice

3500M/s

Data transferring speedfrom device to device

7800M/s

Operation system Win7Professional


Appl. Math. Inf. Sci. 7, No. 2L, 717-722 (2013) / www.naturalspublishing.com/Journals.asp 721

Table 2: Comparison between the present algorithm and thestandard SIFT under different image resolutions

Items ImageA

ImageB

ImageC

ImageD

ImageE

Imageresolution

300 ×200

640 ×480

800 ×640

1200×800

1600×1200

Thenumber ofkeypointsby thestandardSIFT

127 1746 3186 5761 8672

Time usedby thestandardSIFT (ms)

365 1609 2815 5260 9053

Thenumber ofkeypointsby thepresentalgorithm

125 1781 3175 5736 8677

Time usedby thepresentalgorithm(ms)

135 311 361 426 463

Speed-upratio

2.70 5.16 7.78 12.36 19.54




Fig. 7: Result of SIFT keypoints under the 1600× 1200 imageresolution, the number of detected keypoints by the presentalgorithm is 8677

Fig. 8: The correlation between speed-up ration and the numberof the SIFT keypoints of the present algorithm, the accelerationratio approximatively had linear relation with the number of SIFTfeature points

5 Conclusion

This paper presented a CUDA-based accelerationalgorithm of SIFT feature extraction. The presentalgorithm can greatly enhance SIFT computational speed,and the speed-up ratio linearly increased with the increaseof the number of SIFT keypoints.

Acknowledgement

This research was supported by National Natural ScienceFoundation of China (61101057).The authors are grateful to the anonymous referee for acareful checking of the details and for helpful commentsthat improved this paper.



References

[1] Lowe D. object recognition from local scale-invariantfeatures[J], International Conference on ComputerVision(ICCV), 1999:1150-1157.

[2] Lowe D. Distictive Image Features From Scale-InvariantKeypoints[J], International Journal of Computer Vision,2004,60(2):91-110.

[3] Nian H. GPGPU and Image Matching Parallel AlgorithmBased on SIFT[D], XiDian University, 2010.

[4] Wang R, Liagn H, Cai Xuan-ping. Study of SIFT FeatureExtraction Algorithm Based on GPU[J], Modern ElectronicsTechnique, 2010,15:41-43.

[5] Zuo HR, Zhang QH, Xu Y. Fast Sobel Edge DetectionAlgorithm Based on GPU[J], Opto-Electronic Engineering,2009,36(1):41-43.

[6] Sinha S, Frahm J M, Pollefeys M. FeatureTracking and Matching in Video UsingProgrammable Graphics Hardware[J], Available from:http://www.springerlink.com/content/5rv615p24360/.

[7] Zhang Qi,Chen Y, Zhang YM. SIFT Implementation andOptimization for Multi-core Systems. IEEE InternationalSymposium on Parallel and Distributed Processing, 2008. p.1-8.

[8] Tian W, Xu F, Wang HY, Zhou B. Fast Scale InvariantFeature Transform Algorithm Based on CUDA[J], ComputerEngineering, 2010,36(8):219-221.

[9] Gan XB, Wang ZY, Shen L, Zhu Q. Data layoutpruning on GPU, Applied Mathematics InformationSciences,2011,5(2):129-138.

[10] Zhao CJ, Wu HR, Gao RH. Realistic and detail renderingof village virtual scene based on pixel offset, AppliedMathematics Information Sciences, 2012:6(3):769-775.

[11] Wu TH, Bai BG, Wang P. Parallel catmull-rom splineinterpolation algorithm for image zooming based on CUDA,Applied Mathematics Information Sciences, 2013,7(2):533-537.

[12] Kim B, Kim KJ, Seong JK. GPU accelerated molecularsurface computing, Applied Mathematics InformationSciences, 2012,6(1):185-194.

Meng Lu receivedthe MS degree in ComputerScience from NortheasternUniversity in 2007, andreceived the PhD degreein Computer Science fromthe Northeastern Universityin 2010. He is currentlya lecturer in NortheasternUniversity. His research

interests are in the areas of computer vision, medicalimage processing, and 3D visualization. He has publishedresearch articles in reputed international journals ofmathematical and engineering sciences.


Documents

Fast Implementation of Scale Invariant Feature Transform ... · 718 Meng Lu: Fast Implementation of Scale Invariant Feature Transform Based on CUDA parallel computation and memory