Panoramic Video using Scale-Invariant Feature …visionlab.hanyang.ac.kr/wordpress/wp-content/uploads/...Fig. 1. Flowchart of scale-invariant feature transform. features from the keypoint

Contributed Paper Manuscript received December 22, 2009 Current version published 06 29 2010;Electronic version published 07 06 2010. 0098 3063/10/$20.00 © 2010 IEEE

Panoramic Video using Scale-Invariant Feature Transform with Embedded Color-Invariant Values

Oh-Seol Kwon and Yeong-Ho Ha, Senior Member, IEEE

Abstract — The resolutions offered by today’s multimedia vary significantly owing to the development of video technology. For example, there is a huge gap between the resolution of cellular phones as small input devices and beam projectors as large output devices. Thus, panoramic video technology is one method that can convert a small resolution into a large resolution to lend realism and wide vision to a scene. Yet, transforming the resolution of an image requires feature or object matching based on extracting important information from the image, where the scale-invariant feature transform (SIFT) is one of the most robust and widely used methods. However, identifying corresponding points becomes difficult in the case of changing illumination or two surfaces with a similar intensity, as SIFT extracts features using only gray information. Therefore, this paper proposes a method of image stitching based on color-invariant features for automated panoramic videos. Color-invariant features can discount the illumination, highlights, and shadows in a scene, as they include the property of the surface reflectance independent of illumination changes. The effectiveness and accuracy of the feature matching with the proposed algorithm are verified using objects and illuminations in a booth, followed by panoramic videos.1

Index Terms — Panoramic Video, SIFT, Image Mosaicking.

I. INTRODUCTIONImage stitching has recently been attracting interest as an

effective way of increasing the restricted field of view of a camera by combining a set of separate images into a single seamless image. This technique has already been widely applied to such areas as video compression, video indexing, image alignment, and panoramic videos [1], [2], where panoramic technology in particular the has been applied to lens falloff [3], exposure mismatches [4], and vignettes [5], [6]. As such, this paper focuses on how to derive clues from an image to implement panoramic technology.

Currently, there are two basic ways to determine corresponding points or regions in overlapped images; feature-point matching and block matching. In general, block-matching algorithms correlate regular-sized blocks generated in sequential images, and typical methods include NCC

Oh-Seol Kwon was with the School of Electrical Engineering and Computer Science, Kyungpook National University, Taegu, Korea. He is now with the School of Arts and Science in New York University, New York, NY 10003 USA (e-mail: [email protected]).

Yeong-Ho Ha is with the School of Electrical Engineering and Computer Science, Kyungpook National University, Taegu, Korea (e-mail: [email protected]).

(Normalized Cross-Correlation) [7] and phase correlation using an FFT (Fast Fourier Transform) [8], where the former calculates the intensity similarity between blocks in corresponding images, while the latter considers the cross power spectrum using the FFT properties between images. Meanwhile, feature-point matching algorithms generally use local descriptors representing neighborhoods of feature points extracted from each image. This type of method based on feature points has recently attracted more attention due to its simple geometric transformation based on camera motion and includes such methods as SIFT [9], the Harris Corner Detector [10], SUSAN Corner Detector [11], and Morphological Corner Detector [12]. In particular, the SIFT algorithm produces an outstanding performance for feature-point matching that is invariant to image scaling, rotation, and changes in the illumination luminance, opening a variety of applications [13]. However, since most feature-point matching algorithms are based on gray values, there are some limitations when finding the corresponding points. For example, in the case of two local descriptors with different colors yet a similar gray level, the two feature points are regarded as corresponding points when the color information is converted to gray values, resulting in errors with feature detection and image mosaicking. Accordingly, this paper proposes a method of image mosaicking using color-invariant features, where color-invariant values are determined based on the surface reflectance property of an object. Local descriptors are then determined after detecting features based on the color-invariant values. As a result, images and panoramic videos can be stitched using a modified version of the SIFT algorithm with embedded color-invariant values instead of gray levels. In experiments, the validity of the color-invariant values is demonstrated and applied to a panoramic video from a rotating camera.

II. SCALE-INVARIANT FEATURE TRANSFORM

The SIFT algorithm is widely used due to various advantages, including its robustness to rotation, scaling, and changes in luminance. This algorithm consists of the follows four steps: scale-space extrema detection, keypoint localization, orientation assignment, and a keypoint descriptor. In the first step, images are reproduced with different scales and are defined as the octave. A difference of Gaussian (DoG) image with different sigma values is then calculated for each octave, and keypoint candidates selected as the local minimum or maximum using a 33 mask for the adjacent DoG images. In the second step, two methods are used to extract more stable

792 IEEE Transactions on Consumer Electronics, Vol. 56, No. 2, May 2010

Fig. 1. Flowchart of scale-invariant feature transform.

features from the keypoint candidates, where the first sets a critical coefficient for smooth regions in the DoG images, while the other uses a Hessian matrix for edge regions. After localizing the keypoints, one or more orientations are assigned to each key point location based on the local image gradient directions. In the third step, the orientation is quantized using 36 bins of ten degrees in a 1616 sample array window. In the last step, a keypoint descriptor is computed based on eight directions aligned in a 44 grid. As a result, the descriptor includes a 128-element feature vector for each keypoint. A flowchart of the SIFT algorithm is given in Fig. 1. In addition, to reduce the effects of changes in the illumination intensity, the feature vector is modified using unit length normalization.

III. GENERATION OF COLOR-INVARIANT FEATURE

Most conventional image matching methods based on features use a gray channel instead of color information to reduce the complexity. However, the loss of color information can result in certain features not being detected, despite the detection of color edge regions. There are two main cases of detection failure as a result of color loss. The first involves boundary regions with a different hue yet similar gray level. Fore example, in an case of an original image consisting of two regions, a magenta background and cyan circle, as shown in Fig. 2, these two regions have a different hue, yet similar gray level value, meaning that no feature will be detected when using methods based on gray levels. The second case is a change in the chromaticity of the illumination in a scene, as shown in Figs. 3(a) and 3(b). Here, the difference in the contrast of the two images can be confirmed by scaling up seven times, as shown in Figs. 3(c) and 3(d). As a result, more features are detected in Fig. 3(c) than in Fig. 3(d), as the white words become reddish with the chromaticity of illumination A, thereby reducing the contrast between the words and the background.

(a) (b) Fig. 2. Test images with similar intensity yet different hue: (a) original image and (b) gray image for SIFT.

(a) (b)

(c) (d)

Fig. 3. Local feature under different illuminations: (a) image under D65, (b) image under illumination A, (c) 7 scaled-up version of (a), and (d) 7 scaled-up version of (b).

In general, images from input devices are affected by the ambient illumination, and changes in the illumination mean a reduction in the matching ratio. Thus, if the property of the surface reflectance of an object could be obtained regardless of the illumination, this would solve the abovementioned problems related to color edge detection and contrast reduction. Therefore, this paper proposes the use of color-invariant features and a method of a panoramic video method based on a modified SIFT with embedded color. Virtual camera responses are first calculated to create color-invariant values, where the virtual responses consist of the non-overlapping sensitivity and illuminations from a Planckian locus. Digital values acquired from a real camera are then converted into virtual responses, and color-invariant values calculated by rotating the virtual responses using a constant angle. Finally, image mosaicking and a panoramic video can be created as the color-invariant values are embedded within the SIFT algorithm.

A. Method for Creating Color-Invariant Values In general, a camera response can be represented as

follows:

w ixy

i BGRidSRE ),,(,)()()( (1)

where i is the camera response, )(E is the illumination,

)(R is the surface reflectance, and )(S is the camera sensitivity. Plus, xy coordinates are assigned to each point in the array. Here, Eq. (1) can be revised if it is assumed that the camera sensitivities are delta functions for any k wavelength.

w kkkxy

k REdRE )()()()()( . (2)

O.-S. Kwon and Y.-H. Ha: Panoramic Video using Scale-Invariant Feature Transform with Embedded Color-Invariant Values 793

The color response of an illuminant at a Planckian locus for any pixel is as follows:

kk

kk RT

cc 251 exp (3)

where1

251 1exp),(

TccTE . (4)

Here, 2161 1074183.3 Wmc and mKc 2

2 104388.1 .The surface and illuminant components can be separated by taking the natural logarithm using Eq. (3) [14].

kkkk T

cRc 251lnln . (5)

The ratio for each channel is calculated when the color responses for each RGB channel are determined using the same process as follows:

)(1lnlnln

)(1lnlnln

GBGBGB

GRGRGR

EET

RRGB

EET

RRGR

(6)

where )](ln[ 51 kkk RcR and kk cE /2 . Two

significant clues are identified using a simple transform from Eq. (6).

)(lnln GBGB

GRGR

GB

GR RREEEERR

GB

EEEE

GR . (7)

This means that a change in the illuminant has a linear property in relation to the log-chromaticity coordinates, as kRis independent of the illuminant color. Meanwhile, a change in the surface reflectance also has a linear property in relation to the log-chromaticity for the following reason:

)(lnln GBGB

GRGR

GB

GR EERRRREET

GB

RRRR

GR . (8)

However, since it is impossible to implement these sensitivities using a delta function with a real camera, this paper proposes a trade-off based on non-overlapping virtual sensitivities with a Gaussian distribution.

),,(2

)(exp21)( 2

2

BGRiSi

i

ii

(9)

where is the peak wavelength and is the standard deviation of the distribution. How a real camera response is transformed into a virtual response will be explained later.

(a)

(b) Fig. 4. Test samples and illuminants: (a) Macbeth ColorChecker and (b) five illuminants.

A Macbeth ColorChecker is used to acquire the color-invariant values in the experiment, as shown in Fig. 4(a). Five illuminants are also calculated from the Planckian locus in Fig. 4(b). The color responses are represented as log-chromaticity coordinates, as shown in Fig. (5). Each mark (+) is a single sample from the Macbeth ColorChecker under an arbitrary illuminant, making a total number of 120 marks (24 samples and 5 illuminants). Here, some patterns were found to be repeated in the distribution of the responses. In particular, the responses of a single sample changed linearly with a change of illuminant. This is the same as the result of Eq. (7). That is, the virtual response of an arbitrary surface reflectance acted linearly with a change of illuminant under the current assumptions.

Fig. 5. Distribution of responses as log-chromaticity coordinates.


A rotation operator is then applied to calculate a color-invariant value from Fig. 5. First, by subtracting the mean color response, the line is moved so that it passes through the origin;

mn

iimn 1

1 (10)

iM (11)

where i is the vector component and is the mean of the vectors. This creates a nm2 matrix M, where n is the number of representative surfaces and m is the number of representative illuminants. The principal orientation is then determined by applying a singular value decomposition to matrix M, while the color-invariant value is determined by adding the mean color response after rotating for the selected orientation, as shown in Fig. 6.

Fig. 6. Color-invariant values for Macbeth ColorChecker under five illuminants.

B. Transform of Real Responses into Virtual Responses Since an image captured by a DSC (Digital Still Camera) is

not a raw image, it needs to be transformed from real to raw values. Here, a linear regression model was used for the calibration [15]. The calibrated responses are transformed into virtual responses, as shown in Fig. 7. First, the proposed virtual response is calculated using the surface reflectance of a Gretag ColorChecker and two assumptions; the non-overlapping sensitivity with a delta function and the illuminants at a Planckian locus.

0.5 0.1 0.70.3 0.2 0.4

….

0.2 0.4 0.80.1 0.5 0.6

….

Fig. 7. Flowchart for generating virtual response from real image.

The digital RGB values for each patch captured by a DSC are then averaged to reduce the camera noise and create a uniform illumination. Finally, polynomial regression with least square fitting is applied to find the relationship between the captured RGB digital values and the simulated virtual responses as follows [16]:

aVP T (12)

where

nn

nn

nn

n

n

n

GBRBRBRBBGBG

BBGGRR

V

21

21

2211

21

21

1

1

1

11

'''

'1

'1

'1

19,19,19,

0,0,0,

'''

'''

nnn

BGR

BGR

BGR

BGRP

aaa

aaaa

Here, n is the number in the color chart. The coefficients for the linear equation are obtained based on a pseudo-inverse transformation as follows:

VPVVa T 1)( (13)

As a result, the derived coefficients enable a real image captured by a camera to be converted into a virtual image. Finally, the color-invariant values are normalized for the minimum and maximum values of 1485 Munsell patches as input values for the SIFT algorithm.

C. Image Mosaicking for Panoramic Video Frame selection from sequential video frames is the first

step for creating a panoramic video. At this point, two images should have an overlapping region, which is identified using phase correlation that indicates the overlapping rate of the two images based on an inverse Fourier transform after calculating the cross power spectrum. The SIFT algorithm with embedded color-invariant values uses the RANSAC (Random Sample Consensus) algorithm [17] to match the descriptor in the two overlapped images.

(a) (b) Fig. 8. Results of feature matching: (a) SIFT algorithm and (b) proposed method.


(a)

(b)

(c)

(d)

(e)

(f)Fig. 9. Results of feature matching with real images: (a) scale transformation, (b) rotation transformation, (c)-(d) images including various objects, and (e)-(f) images acquired under different illuminations.

This algorithm also increases the accuracy of the matching by eliminating the outlier corresponding points. The perspective transform [18] for the image matching is then derived from the selected corresponding points as follows:

111''

76

543

210

yx

hhhhhhhh

yx (14)

IV. EXPERIMENTAL RESULTS

The proposed algorithm was tested for its feature detection and matching ratio. A variety of images from a regular digital camera, mobile phone camera, and video camera were used in the experiments. The images were acquired under identical and different illuminants to validate the color-invariant values. While the conventional SIFT method was unable to detect any feature points in Fig. 2, the proposed algorithm detected

twelve features around the color edge region in Fig 8. The tested images used for feature matching are shown in Fig. 8, where scaling and rotation were tested in Figs. 9(a) and 9(b), feature matching of multiple objects under the same illuminant tested in Figs. 9(c) and 9(d), and feature matching under different illuminants tested in Figs. 9(e) and 9(f). The pairs on the left are the results with the SIFT algorithm, while the pairs on the right are the results with the proposed algorithm. D65 and A illumination of the booth were used to obtain images under different illuminations. The matching ratio was calculated for each image as follows:

100(%)p

pR T

AM (15)

wherepA is the number of accurate pairs and

pT is the total number of pairs in the test image. Fig. 10 shows the matching ratio results for the six images in Fig. 9.

0

10

20

30

40

50

60

70

80

90

100

Test images

Mat

chin

g Ra

tio

SIFT algorithmProposed method

Fig. 10. Comparison of matching ratio for test images.

(a)

(b)Fig. 11. Image mosaicking with sudden illumination changes: (a) four test images and (b) resulting image.


The results for the proposed algorithm were all superior to those for the conventional SIFT algorithm. In particular, the proposed algorithm maintained a better performance as regards the matching ratio in spite of a change of illuminant. As a change of illuminant is possible when selecting frames for a panoramic video, the proposed algorithm was tested in the extreme case of a rapid change of illumination in frames taken in the booth. This test used four images taken under two different illuminants, as shown in Fig. 11(a), where the color temperature of the two illuminants was D65 and CWF,

respectively. As a result, the proposed algorithm worked well for image mosaicking, as shown in Fig. 11(b), and was also effectively applied to image stitching in the case of rotated and sifted images, as shown in Fig. 12. Finally, a panoramic frame with a wide-view was generated using only panned images of a real suburban scene to give the video frame a fixed-resolution, as shown in Fig. 13. Even though Fig. 13 included a change of illuminant, the resulting panoramic image was very smooth and clear.

(a)

(b) Fig. 12. Image stitching for wide-view: (a) input images with resolution of 7681024 and (b) resulting image.

(a)

(b)Fig. 13. Panoramic frame of suburban scene: (a) input images with resolution of 235352 and (b) panoramic frame with resolution of 2141097 .


V. CONCLUSIONS

This paper proposed a method for image mosaicking and panoramic videos using the SIFT algorithm with embedded color-invariant values. As conventional methods are based on gray levels, this makes feature-point extraction difficult in color edge regions. However, the proposed method facilitates feature-point extraction from a color edge region using color-invariant values instead of gray values. To create the color-invariant values, a virtual response is used with certain constraints. A polynomial regression is then used to transform the real camera response into a virtual response. Finally, the proposed algorithm embeds the color-invariant values in the SIFT algorithm. Experiments confirm that the proposed algorithm increased the matching ratio for images with rapid illuminant changes and generated panoramic frames with a natural scene.

REFERENCES

[1] H. Shum and R. Szeliski, “Construction of panoramic mosaics with global and local alignment,” International Journal of Computer Vision,vol. 36, no. 2, pp. 101-130, Feb. 2000.

[2] A. Acha, Y. Pritch, D. Lischinski, and S. Pleleg, “Dynamosaics: Video mosaics with non-chronological time,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Sand Diego, CA, USA. vol. 1, pp. 58-65, Jun. 2005.

[3] J. Park, S. Byun, and B. Lee, “Lens distortion correction using ideal image coordinates,” IEEE Transactions on Consumer Electronics, vol. 55, no. 3, pp. 987-991, Aug. 2009.

[4] M. Uyttendaele, A. Eden, and R. Szeliski, “Eliminating ghosting and exposure artifacts in image mosaics,” Proc. Computer Society Conf. CVPR, vol. 2, pp. 509-516, Dec. 2001.

[5] W. Yu, “Practical anti-vignetting methods for digital cameras,” IEEETransactions on Consumer Electronics, vol. 50, no. 4, pp. 675-983, Nov. 2004.

[6] S. Kim and M. Pollefeys, “Robust radiometric calibration and vignetting correction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 4, pp. 562-576, Apr. 2008.

[7] J. Hsieh, “Fast stitching algorithm for moving object detection and mosaic construction,” Image and Vision Computing, vol. 22, no. 4, pp. 291-306, Apr. 2004.

[8] B. Reddy and B. Chatterji, “An FFT-based technique for translation, rotation, and scale-invariant image registration,” IEEE Transactions on Image Processing, vol. 5, no. 8, pp. 1266-1271, Aug. 1996.

[9] D. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, Feb. 2004.

[10] C. Harris and M. Stephens, “A combined corner and edge detector,” Proceedings of The Fourth Alvey Vision Conference, Manchester, UK, pp. 147-152, Aug. 1988.

[11] S. Smith and J. Brady, “SUSAN-a new approach to low level image processing,” International Journal of Computer Vision, vol. 23, no. 1, pp. 45-78, May, 1997.

[12] R. Lagniere, “Morphological corner detection,” International Conference of Computer Vision, Bombay, India, pp. 280-285, Jan. 1998.

[13] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Transactions, Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1615-1630, Oct. 2005.

[14] G. Finlayson and S. Hordley “Color constancy at a pixel,” Journal of Optical Society of America A, vol. 18, no. 2, pp. 253-264, Feb. 2001.

[15] H. Lee, Introduction to color imaging science, Cambridge University Press: Cambridge, UK.2005.

[16] G. Hong, M. Luo, and P. Ronnier, “A study of digital camera colorimetric characterization based on polynomial modeling,” Color Res. Appl. vol. 26, no. 1, pp. 76–84, Jan. 2001.

[17] M. Fischler and R. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381-395, Jun. 1981.

[18] R. Hartley and A. Zisserman, Multiple view geometry in computer vision, Cambridge University Press: Cambridge, UK. 2000.

BIOGRAPHIESOh-Seol Kwon received his B.S. and M.S. degrees in Electrical Engineering & Computer Science from Kyungpook National University, Korea in 2002 and 2004, respectively and Ph. D. degree in Electronics from the same university in 2008. He has worked as a postdoctoral research fellow at New York University since 2008. His research interests are color signal

processing, imaging systems, computer vision, and human visual system.

Yeong-Ho Ha received the B.S. and M.S. degrees in Electronic Engineering from Kyungpook National University, Taegu, Korea, in 1976 and 1978, respectively, and Ph. D. degree in Electrical and Computer Engineering from the University of Texas at Austin, Texas, 1985. In March 1986, he joined the Department of Electronics Engineering of Kyungpook National

University, as an assistant professor, and is currently a professor. He is a senior member of IEEE, a member of Pattern Recognition Society, and a fellow of Society of IS&T. His main research interests are in color image processing, computer vision, and digital signal and image processing.


Documents

Panoramic Video using Scale-Invariant Feature …visionlab.hanyang.ac.kr/wordpress/wp-content/uploads/...Fig. 1. Flowchart of scale-invariant feature transform. features from the keypoint