54
Katedra aplikovanej informatiky Fakulta Matematiky, Fyziky a Informatiky Univerzita KomenskØho, Bratislava Martin Buj NÆk Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Katedra aplikovanej informatikyFakulta Matematiky, Fyziky a Informatiky

Univerzita Komenského, Bratislava

Martin BujµNák

Reconstructing 3D mesh from videosequence

Diplomová práca

BRATISLAVA 2005

Page 2: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Comenius University, Bratislava, Slovakia

Faculty of Mathematics, Physics and Informatics

Department of Applied Informatics

Martin BujµNák:

Reconstructing 3D mesh from videosequence

(Master thesis)

Advisor: Bratislava,

RNDr. Martin SamuelµCík April 2005

Page 3: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

i

I honestly declare that I have written the submitted master thesis

by myself and I have used only the literature mentioned in the

bibliography.

Bratislava, 29th April 2005 . . . . . . . . . . . . . . . . . . . . . .

Martin Bujµnák

Page 4: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

ii

I would like to thank to my master thesis advisor Martin

Samuelµcík for his valuable advices, remarks and suggestions.

I would also like to thank to my family for support and for their

patience during my work.

Page 5: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Abstrakt

Moja práca sa venuje problematike vytvárania 3D scény zo súboru nekalib-

rovaných obrázkov, alebo videa. Prezentovaný algoritmus spracováva vstup

spôsobom on-line a je zaloµzený na trasovaní význaµcných bodov v súbore vstup-

ných obrazov, h,ladaní geometrického vz ,tahu medzi pármi obrazov a vytvorení

projektívnej rekon�trukcie nasnímanej scény. Pomocou lineárnej metódy al-

goritmus nájde vnútorné parametre kamery a transformuje scénu do metrick-

ého priestoru. V práci,dalej prezentujem nový algoritmus na nájdenie hustej

reprezentácie scény, jej triangulácii a popisujem experiment, kde pomocou geo-

metrie dvoch poh,ladov h

,ladám radiálnu chybu optiky kamery.

Algoritmus predpokladá, µze kamera nevytvára skosené obrazy, stred projekcie

leµzí v strede obrazu a pomer vý�ky a �írky bodu obrazu je rovný 1. Ohnisková

vzdialenos ,t kamery sa môµze poµcas snímania meni ,t.

Výstup splµna predpoklady vstupnej kamery a rekon�truovaná scéna nepod-

lieha projektívnej deformácii.

K ,lúµcové slová: Vizuálne modelovanie, �truktúra a pohyb kamery, Hustá re-

kon�trukcia scény, Samokalibrácia, Radiálna chyba �o�ovky

iii

Page 6: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Abstract

This thesis aims to create complete 3D reconstruction of real scene from un-

calibrated video sequence. My work deals with image features correspondence

problem reduced to feature tracking throughout image sequence, camera track-

ing with retrieving cameras positions and camera calibration, and �nally dense

scene reconstruction represented in 3D mesh.

Even input consists of un-calibrated images, algorithm assumes that images

were taken by camera with these restrictions to intrinsic parameters: zero-skew,

principal point is at image center and aspect ratio of 1. Camera focal length can

vary across the sequence. Images must be processed in the order of how they

were captured. Motion between two consequent frames is assumed to be small.

Main contribution of this work is in simple feature detector and tracker, novel

fast on-line structure from motion algorithm based on two-view geometry, dense

reconstruction based on new stereo algorithm and 3D mesh extraction. In this

work I also describe linear method for calibrating cameras only from input image

(self-calibration). Experimental method for lens radial distortion detection based

on two-view geometry is also presented here.

Keywords: Structure-from-motion, Uncalibrated video, Self calibration, Fea-

ture tracking, Dense reconstruction, Radial lens distortion

iv

Page 7: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Table of Contents

Abstrakt iii

Abstract iv

Table of Contents v

1 Introduction 11.1 Motivation and goal of this work . . . . . . . . . . . . . . . . . . 11.2 Outline of the document . . . . . . . . . . . . . . . . . . . . . . . 2

2 Feature tracking 32.1 Inroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Harris based feature tracker . . . . . . . . . . . . . . . . . . . . . 32.3 Removing outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Finding more features . . . . . . . . . . . . . . . . . . . . . . . . 52.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5.1 Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5.2 Guided matching . . . . . . . . . . . . . . . . . . . . . . . 7

3 Structure and Motion 103.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Camera pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.1 Quasi-calibrated camera pair . . . . . . . . . . . . . . . . . 123.2.2 Sparse scene from camera pair . . . . . . . . . . . . . . . . 123.2.3 Small motion �precision issues . . . . . . . . . . . . . . . 13

3.3 Updating the structure and motion . . . . . . . . . . . . . . . . . 133.3.1 Merging camera pairs . . . . . . . . . . . . . . . . . . . . . 14

3.4 Self-calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

v

Page 8: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

CONTENTS vi

4 Dense reconstruction 204.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Volumetric methods . . . . . . . . . . . . . . . . . . . . . 204.2.2 Stereo methods . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Novel algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3.2 Recti�cation . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3.3 Initial disparity map . . . . . . . . . . . . . . . . . . . . . 264.3.4 Re�ning disparity map using dynamic programming . . . . 274.3.5 Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . 284.3.6 Multi-view linking . . . . . . . . . . . . . . . . . . . . . . 30

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Experiments 375.1 Radial lens distortion . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Conclusion and future work 39

A Resources 40

Page 9: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

List of Figures

2.1 Self correlatin. left �self correlating feature, left bottom �cor-

relation to two self correlating features, right �good feature to

track, right bottom �neighbours correlations. . . . . . . . . . . . 4

2.2 Principal components marked by red. The only good feature is

marked by green colour. . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Feature matches perfectly to more features in second image. . . . 5

2.4 Point corresponding to the point in right image must lay on (or

due to noise lie near) the line in the left image. . . . . . . . . . . . 6

2.5 Pair P1 1-1 will be added if correlation exceeds threshold. Pairs

1-3, 2-1 2-3 will not be tested because lengths of line segments

are too di¤erent. If 2-2(If not missing) is used, 3-1 would not be

tested due to ordering criterion. . . . . . . . . . . . . . . . . . . . 6

2.6 Novel detector �traceable features are marked white. Red fea-

tures are self correlating. . . . . . . . . . . . . . . . . . . . . . . . 7

2.7 KLT good features � traceable features marked white. Bigger

motion results in invalid matching on the roofs of houses. . . . . . 8

2.8 Guided matching. Matching is processed on two corresponding

epipolar lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.9 More features found using guided matching. Small lines starting

in feature points represents feature motion to previous frame. . . . 9

3.1 Filled areas denote place where the 3D point can be put. Left im-

age - perpendicular rays lead to biggest precision (smaller region),

right image - region grows with smaller distance of two cameras. . 12

3.2 Merging new pair and existing space using common feature points

and corresponding 3D points. . . . . . . . . . . . . . . . . . . . . 14

vii

Page 10: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

LIST OF FIGURES viii

3.3 Structure and motion precision progress. Blue - reconstruction

from 5, green - from 8 and red from 10 cameras. . . . . . . . . . . 18

3.4 Structure and motion of the real scene. Video was captured in

resolution 640x480. . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 Sparse reconstruction of the scene. Resolution of input video was

320x240. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Searching for matching pairs using path-search problem. Cut of

the scene on the left. Occlusion diagram on the right. . . . . . . . 22

4.2 Stereo ambiguities. It is not possible to detect true matching from

these two view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Human cannot match pixels in background. Background must be

removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Area visible from both cameras. Black lines de�ne left camera

visibility and red lines right camera visibility. Minimal area is

selected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.5 Recti�cation process. Width of recti�ed image is di¤erence of min

X radius from max X radius. Epipolar line is rotated so that it

travels at maxima 1 pixel on outer circumference (with max X

radius). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.6 Row of recti�ed image. Each row correspond to some epipoler line

and its intersection with the image. . . . . . . . . . . . . . . . . . 27

4.7 Re�ned disparity map. Triangle verticles correspond to points for

which we have its corresponding 3D vertex - from scene structure. 29

4.8 Triangulation of small disparity discontinuities. . . . . . . . . . . 30

4.9 Triangulation mergin. Input reconstructions in a) and b), result

in c). d) shows triangle orientation, all in one image . . . . . . . . 31

4.10 Recti�cation example. Original image pair (top) and recti�ed

image pair (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.11 Dense reconsturction of the nature. Two merged camera pairs

were used to obtain disparity map and 8 other cameras were used

for photoconsistence checks. . . . . . . . . . . . . . . . . . . . . . 34

4.12 Dense reconsturction of human face. 2 cameras to obtain disparity

map and 7 cameras for photoconsistence checks were used. Scene

is rendered as point cloud, each point is rendered as 2x2pixels

colored splat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Page 11: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

LIST OF FIGURES ix

4.13 Reconstruction process. Frame from input sequence (top-left),

feature tracing (top-right), structure and motion with reconstruc-

ted object (bottom-right), �nal 3D mesh (bottom-left). . . . . . . 35

4.14 Dense reconstruction of scene from �gure 3.2.2. Scene is recon-

structed from two views. Video resolution was 320x240. . . . . . . 35

4.15 Scene reconstructed using 9 cameras. . . . . . . . . . . . . . . . . 36

4.16 Novel view of the object with may homogenous regions. Scene is

reconstructed from 2 cameras. . . . . . . . . . . . . . . . . . . . 36

5.1 Radial distortion test. Left - original, right - undistorted. Cost

function graph for each feature point is below. . . . . . . . . . . . 38

Page 12: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Chapter 1

Introduction

1.1 Motivation and goal of this work

In many nowadays applications such architectural visualization, cultural herit-

age, medicine, movie and computer games industry it is required to acquire high

detail photo-realistic 3D representation of some real objects. There exist several

methods how to create virtual copy of existing real object �from modeling by

artists to laser scanning.

In this work I take a closer look to the one of the most accessible and cheapest

way of 3Dmodel reconstruction �using video sequences. With such methods user

can freely move camera around an object or scene and record video. From this

video we are able to reconstruct motion of the camera and reconstruct textured

3D scene. Neither camera position, nor camera setting has to be known a priori.

My approach tracks point features across video sequence. From tracked fea-

tures algorithm creates two-view geometry using robust algorithm. Then mul-

tiple view structure and motion is created. Every change of structure must kept

sparse scene consistent. It means that all re-projected 3D points must be lying

on their corresponding 2D features (practically due to discrete space and noise

we want to achieve minimal quadratic error �further in text �error aspects�). If

mean error exceeds threshold then non-linear minimization is performed. Self-

calibration is used to restrict space ambiguity from projective to metric. Final

3D reconstruction is processed in two steps. The �rst step is to create hypo-

thesis about scene - using stereo algorithm and the second step is merging with

photo-consistency checks.

1

Page 13: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

1. Introduction 2

Note that if we could e¤ectively �nd global minima of following expression

Pmi=0

Pnj=0 d(mij; PiXj)

2;

where mij is known 2D re-projection of unknown 3D point Xj by unknown

camera Pi and d(x; y) is distance of two 2D points, then all e¤orts behind recon-

struction would be futile. Therefore all the e¤ort is being put into �nding initial

condition for numerical minimization methods minimizing this formula.

1.2 Outline of the document

Algorithms are described in three sections - feature tracking, structure and mo-

tion with self calibration and dense scene reconstruction. Each section contains

overview, comparison with existing methods and concludes with results of my

novel approaches.

In my work I target to feature tracking with own modi�cation of guided

matching, experimental radial lens distortion removing, sequential two-view to

multiple-view merging, and dense reconstruction. Other intermediate processes

are described with smaller detail with references to complete description so that

reader can get complete look to the problem. The thesis is concluded with

Conclusion with future work.

Page 14: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Chapter 2

Feature tracking

2.1 Inroduction

Feature point can be de�ned as point that can be di¤erentiated from its neigh-

boring points. Feature matching plays key role for most of the photo/video

based modeling tools even it is ill conditioned from its basics. Let consider two

images of building with many equal looking windows. Human �nds correspond-

ing windows be counting number of windows from some edge or in some similar

way. Selection of strategy depends on scene context. This is complicated to im-

plement. We transform this problem in computer vision to simpler problem of

feature tracking with adding assumption on maximal motion between two input

frames.

There already exist robust commercial feature tracking packages like de-

scribed in [5]. For our work we have selected free KTL feature tracking toolkit

[6] and developed new feature tracker similar to KTL.

2.2 Harris based feature tracker

My feature tracker uses Harris point feature detector [7]. Features points are de-

tected on two neighboring images from sequence. Similarly as in KTL algorithm

selects features that are good to track. As small motion between neighboring

images is assumed, algorithm removes all features that can be miss-exchanged

with its neighboring feature points in the same image �further in the text this

is referred to as self-correlation (see �gure 2.1, left).

3

Page 15: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

2. Feature tracking 4

Figure 2.1: Self correlatin. left �self correlating feature, left bottom �correlationto two self correlating features, right �good feature to track, right bottom �neighbours correlations.

Figure 2.2: Principal components marked by red. The only good feature ismarked by green colour.

Each feature point is extended by its orientation - de�ned by two principal

axes of covariance matrix formed from small region surrounding feature point

(see �gures 2.2, 2.1, top-left). Features with one principal component much

bigger than second one are removed. This is typical for edges.

Features left in two images are than matched using zero-mean normalized

cross-correlation (ZNCC). I modi�ed ZNCC to take in account feature orienta-

tion. This is done by changing coordinate system to polar coordinate system.

Due to noise in image and discrete sampling, features orientation can be slightly

changed when we rotate the image. To handle this we rotate feature in correla-

tion process.

Page 16: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

2. Feature tracking 5

Figure 2.3: Feature matches perfectly to more features in second image.

2.3 Removing outliers

From both KTL and my feature matching algorithms, we get well correlating

feature pairs. In real images we noticed that these pairs sometimes do not

point to the same object in the scene (see 2.3). Such feature pairs (further in

the text outliers, similarly good correspondences are called inliers) have to be

removed as in next stages they may cause various errors. In my approach I use

RANSAC paradigm [8] to �nd two-view geometry - epipolar geometry described

by fundamental matrix. All inliers satisfy epipolar constrain

mFm�= 0; (2.1)

where m and m�are two corresponding feature points, and F is fundamental

matrix.

Using this constraint algorithm eliminates almost all outliners, but unfor-

tunately some outlier will persist. It is because bad matches can also satisfy

epipolar constrain. This occurs when second feature point lies on epipolar line

of the �rst point (see �gure 2.4). Such bad matches can be removed using an-

other view. In my algorithm I assume that some outlier can persist and thus

other algorithms must be able to deal with outliers and �ltrate them.

2.4 Finding more features

After two view geometry is obtained guided searching can be perform. Funda-

mental matrix restricts searching region for each point in �rst image to line in

second image (see �gure 2.4).

Page 17: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

2. Feature tracking 6

Figure 2.4: Point corresponding to the point in right image must lay on (or dueto noise lie near) the line in the left image.

P1

P2

2 - missing

1

2

1

3

P3

Left image

Right image3

Figure 2.5: Pair P1 1-1 will be added if correlation exceeds threshold. Pairs 1-3,2-1 2-3 will not be tested because lengths of line segments are too di¤erent. If2-2(If not missing) is used, 3-1 would not be tested due to ordering criterion.

In my approach I �nd epipolar lines for each feature. Guided matching is

performed on these lines with taking ordering constraint into account. Two

features are correlated only if length of epipolar line segment from previous

feature point is similar to length on corresponding epipolar line (see �gure 2.5).

From all matches algorithm calculates new fundamental matrix using normal-

ized 8 point algorithm [9]. Normalized 8 point algorithm uses linear least-square

error methods to �nd new fundamental matrix. This method does not perfectly

distribute error as pointed in [9]. Therefore 8point algorithm is used as an initial

estimation for nonlinear numerical minimization of

cost(F ) =P

i d(mi;mit)2 + d(mi�;mit�)

2;

where mitFmit� = 0, mi and mi� are i-th corresponding feature pair and

d(x; y)is distance of two 2D points.

Cost function is minimized using Levenberg-Marquard algorithm [11].

Page 18: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

2. Feature tracking 7

Figure 2.6: Novel detector �traceable features are marked white. Red featuresare self correlating.

2.5 Results

2.5.1 Tracker

My feature detector does not di¤er from KLT detector a lot. Comparing with

KLT I introduced feature orientation, perform feature correlation in color space

and have other criterion for selecting features that are good to track. Taking

orientation into account caused that this feature tracker does not lose features

when camera is rotated. For small camera rotation even KLT will work �ne.

Because feature correlation is performed on colored images my algorithm is able

to track more features if these can be di¤erentiated by color.

My criterion for selecting traceable features keep main role in scenes like

in Figure 2.6 (compare to Figure 2.7). Here KLT selects features that are not

suitable for tracking and will cause many bad matches. For small motion both

KTL and my feature trackers give the same results in comparable time.

2.5.2 Guided matching

Guided matching algorithm �nds matches that satisfy epipolar constraint (equa-

tion 2.1). If two matching pairs lay down on their epipolar lines, than epipolar

constraint is satis�ed even if these features are outliers. My algorithm performs

guided matching similar to the way how dense stereo algorithms do. Di¤erence

Page 19: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

2. Feature tracking 8

Figure 2.7: KLT good features �traceable features marked white. Bigger motionresults in invalid matching on the roofs of houses.

comparing to stereo algorithms is that algorithm runs on smaller amount of

features instead of all image pixels.

Imposing ordering constraint and length criterion dramatically eliminate num-

ber of outliers. Computational complexity stays in worst case O(nm) - for n

features in �rst and m features in second image. Algorithm works in expected

linear time due to length criterion constraint. New criterions reduced number of

expensive cross correlation tests so that amount of cross correlation tests is in

worst case mn, but also linear in expected time. See �gures 2.8 and 2.9.

Page 20: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

2. Feature tracking 9

Figure 2.8: Guided matching. Matching is processed on two corresponding epi-polar lines.

Figure 2.9: More features found using guided matching. Small lines starting infeature points represents feature motion to previous frame.

Page 21: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Chapter 3

Structure and Motion

3.1 Introduction

3.1.1 Previous work

Structure and motion is de�ned as reconstruction of camera motion - positions

of cameras - and sparse reconstruction of structure of the scene. Note that it is

desireable to �nd this only from knowledge of image features correspondences.

In the past years many approaches retrieving structure and motion from

image sequence has been proposed. The most similar - in the way of how input

and output are de�ned - is approach presented by Marc Pollefeys [1]. Pollefeys

searches for two good initial frames. From initial camera pair the structure of the

scene is reconstructed. Structure and motion is then completed from knowledge

of 2D-3D correspondence. Quality of reconstruction depends on selection of

initial frames. Even through, this method is considered today to be most robust.

The main di¤erence against Pole¤eys method is that I process input on-line and

create structure even if it is not accurate. New frame is classi�ed if it can improve

quality of the reconstruction. If some part of the scene could be improved then

the structure and motion would be updated. In contrast with Pollefeys method

my method does not estimate position of the new camera not only from 2D-3D

correspondence, but also from two-view geometry. Sequential approach has been

also proposed by [16], but unlike this algorithm, my approach does not require

quasi-Euclidian initialization.

Di¤erent approach has been proposed by Kanade et. al [2]. Kanade uses

perspective factorization method. This method requires that every feature is

10

Page 22: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

3. Structure and Motion 11

known in all views. From such features the measurement matrix is build. This

matrix is then factorized into P and X (see equation 3.1).0BBBBBBBBBBB@

�11

0B@x11y111

1CA ::: �1n

0B@x1ny1n1

1CA::: :::

�m1

0B@xm1ym1

1

1CA ::: �mn

0B@xmnymn

1

1CA

1CCCCCCCCCCCA= PX (3.1)

The main problem is to �nd � values. Complete description of method and

algorithm can by found in [3].

There exist many other methods where more assumptions or space markers

are required. Many of them are described in [4].

3.1.2 Overview

From previous stages we have pairs of matching features and also two-view geo-

metry of the last and the new frame from the image sequence. Note that frames

are added sequentially. From relation between the views and feature correspond-

ence we want to create structure of the scene and motion of the camera.

Images are processed as they came and re�ne existing scene if information

from new frame leads to more precise scene. This information is obtained from

two view projective reconstruction (described in section 3.2). I calculate weight

for each 3D point - telling how big is the region where 3D point can by put while

reprojection error stays within thresholds (see �gure 3.1). All measurements are

carried out in image space so algorithm works in projective space. Structure

and motion is built sequentially by merging camera pair with previous structure

using common features. There must be enough �at least 4 �common feature

points. Introducing image based measurements allows us to measure amount of

motion parallax between image pairs. During merging step I take this motion

to assign weight to common feature points.

Page 23: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

3. Structure and Motion 12

Figure 3.1: Filled areas denote place where the 3D point can be put. Left image -perpendicular rays lead to biggest precision (smaller region), right image - regiongrows with smaller distance of two cameras.

3.2 Camera pair

3.2.1 Quasi-calibrated camera pair

Projective camera pair is created using epipolar geometry known from previous

step. Canonical pair is de�ned as

P1 =hI3x3 j 03

i;

P2 =h[e12]x F12 + e12a

T j oe12i;

(3.2)

where F12 is fundamental matrix from image 1 to image 2, e12 is epipole. Note

that o and a are free parameters and changing them will let epipolar geometry of

camera pair unchanged [4]. o determines the global scale of reconstruction and

a position of reference plane. Thus o can be simply set to one. My algorithm

�nds a so that camera P2 will hold calibration condition - zero skew, principal

point at image centre and varying focal length. Note that at least 3 cameras are

needed to perform full calibration with input assumptions. Therefore further in

the text I will refer to this camera pair as quasi-calibrated.

3.2.2 Sparse scene from camera pair

Having projective matrices allows calculating 3D position for each feature pair.

Usually, this is done using triangulation. Due to noise and discrete image space,

sight lines may not intersect perfectly. Features pair position in 3D point can

be calculated so that distance between reprojected 3D point and matching 2D

point is minimal:

d(m;P1M)2 + d(m�; P2M)

2;

Page 24: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

3. Structure and Motion 13

where m, m�are corresponding 2D feature points, both corresponding to M ,

and P1, P2 are camera matrices pair and d(x; y) is distance of two 2D points.

Many methods how to obtain optimal 3D position are proposed in [4]. In my

work I �nd 3D position M by minimizing following formula:

cost(M) =P

i

��!Aim;M :Summembers are distances of unknown pointM and sight lines (Ai is camera

centre, m point in 3D on projection plane). Such point can be computed using

least squares method. If reprojected 3D point is to far from any of its 2D pair,

the feature is considered as outlier.

3.2.3 Small motion �precision issues

Sometimes it is not possible to calculate projection matrices. This occurs when

no-motion was made, or virtual parallax occurred. In that case the epipolar

geometry (fundamental matrix) is poorly estimated. To avoid this algorithm

performs 2D measurements and skip frame if median length of motion vectors

is smaller than threshold. Virtual parallax caused by pure rotation around axis

passing through focal point combined with pure zooming can be detected by

thresholding fundamental matrix eigen values.

Even if fundamental matrix is well de�ned, discrete space and noise can cause

too big freedom for placing 3D point (see

Figure 3.1). The error can be enormous when camera motion is small. For

that case I use image based measure (weight) for each 3D feature saying, how

precise is the estimation of 3D point (volume of intersection of sight lines).

Similarly, if median weight is smaller than threshold, then the frame is skipped.

Note that photo-consistent 3D point lays in intersection of all sight lines from

all cameras from which the 3D point is visible.

Also note that skipped frames are hold in memory for feature tracking pur-

poses.

3.3 Updating the structure and motion

In this section I describe how camera pairs are merged together with existing

reconstruction. My algorithm merges new camera pair with existing structure

Page 25: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

3. Structure and Motion 14

Figure 3.2: Merging new pair and existing space using common feature pointsand corresponding 3D points.

and motion using their common camera. Merging is performed so that the best

from old and the new scene is used. The algorithm also calculates re-projection

error of 3D space for each camera. If mean error exceeds given threshold value,

than we perform nonlinear minimization - bundle adjustment [12]. This problem

can be solved e¤ectively by taking sparsity of the problem into account [13].

3.3.1 Merging camera pairs

For merging process we have new camera pair P1; P2 - in canonical form, and

existing reconstruction. Let P be last camera in existing motion structure. P

corresponds to P1 camera in new pair. Merging pairs means to transform both

P1 and P2, so that P1 will be equal to P . After that P2 will not be correctly

placed as P2 can di¤er in position of reference plane and also in the scale factor

(see Figure 3.2).

Page 26: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

3. Structure and Motion 15

In ideal conditions we can express homography transformation that will �x

P2 camera from known 3D space and common correspondence as

Y = HX;

X = H�1Y;

where X and Y are corresponding 3D points in new and existing structure.

From four 3D points we are able to calculate H or H�1. Practically due

to error aspects we need robust approach as not all 3D points are suitable for

calculating such homography. In ours approach we select bundle of those points

that have biggest weight (see section 3.2.3). Features are divided into two groups

by weights - new space is better and current space is better. Points in one group

are used to calculate H and the second to calculate H�1. Homographies are

calculated using RANSAC paradigm [8]. All measurements are carried in 2D.

After transforming P2 with H we can merge 3D structure of new pair into

existing structure. Already known 3D points are merged with new points and

are recalculated as described in section 3.2.2.

To minimize accumulated error mean 3D to 2D re-projection error is calcu-

lated. If this error exceeds threshold value algorithm performs bundle adjust-

ment.

3.4 Self-calibration

Until now we did not care about intrinsic camera parameters. Reconstructed

space and camera poses are locked by photo consistency constraints (reprojection

error is small) but such reconstruction is not unique. Now we detect camera

intrinsic parameters using only images - this is called self-calibration. Many

techniques how to perform self-calibration of cameras are described in [4].

Let X by any 3D point of reconstruction, P any camera andm corresponding

2D feature in camera P . For any homography H4x4 we get :

m = PX;

m = PHH�1X:(3.3)

It means that we can transform both cameras and 3D points so that repro-

jection error will stay unchanged. Without loss of generality we can consider

Page 27: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

3. Structure and Motion 16

that H does not shear, rotate, translate and scale. These components are inter-

esting only if we want to align reconstruction to some existing space. Now, the

only component that we will care about is projective part of the homography

- the only part that can transform plane at in�nity. Such homography can be

described as follows:

H =

0BBBB@k�1 0 0 0

0 k�1 0 0

0 0 k�1 0

ak�1 bk�1 ck�1 k

1CCCCA ; (3.4)

where a; b; c; k are unknowns.

Camera matrix can be factorized into upper triangular 3x3 calibration matrix

(3.5), 3x3 rotation matrix and 3x1 translation matrix.

K =

0B@ax s x0

0 ay y0

0 0 1

1CA ; (3.5)

where ax; ay - focal length, ax : ay aspect ration, s �skew, [x0, y0] is principal

point.

My algorithm �nd homography H so that all cameras transformed with H

will have calibration matrices as assumed �skew equal zero, principal point at

image centre, aspect ration equal 1 and varying focal length. Key of �nding such

homography lays in projection of absolute conic. Absolute conic can be repres-

ented using dual absolute quadric [14]. One of the most important properties

of absolute quadrics is that they are invariant to similarity transformations.

Another property leads to direct key to how to �nd H: The projection of dual

absolute quadric is directly related to the intrinsic camera parameters [1] :

KKT = PP T ; (3.6)

where P is 3x4 projection matrix.

Since the images are independent to the projective basis of the reconstruction,

equation (3.6) is always valid and constraints on the intrinsic can be translated

to constraints of absolute quadric [1]. With our assumptions K can found from

equation (3.6) by solving linear system with one cubic constraint [15].

Page 28: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

3. Structure and Motion 17

Using H I transform whole structure and cameras as shown in equation (3.3).

This will cause that re-projection error will stay unchanged and cameras hold

�real�conditions.

3.5 Results

RANSAC paradigm in both two-view and merging to n-view processes makes

algorithm robust for presence of outliers. Tests on synthetically generated data

showed that algorithm can also deal with 30% amount of outlier (for 100 feature

correspondence). Radial lens distortion in�uence 3D scene but 3D to 2D re-

projection error stays under 1pixel. Adding noise with gaussian distribution to

the images will cause problems for linear algorithms. Having more cameras will

cause that 3D points are estimated from more 2D correspondences and thus noise

is slightly suppressed so that structure does not change dramatically. Because

camera projection matrix is calculated only from using fundamental matrix,

numerical minimization on fundamental matrix is in this case essential.

For noise with radius under 1pixel both motion of camera and scene structure

did not change dramatically - mean residual error (further in the text error)

measured in pixel space stays under 0.5 pixels for 100 corresponding points and

3 cameras. 4 cameras dropped error under 0.3pixels. Adding more cameras did

not changed error so rapidly. Noise with radius 2-3pixels caused that camera

motion was poorly estimated and reprojected structure comparing to ground

true was under 2pixels measured in image space. Ten cameras were able to drop

error under 1.2pixel. Change in structure and motion is visualized in �gure 3.3.

Numerical minimization - both on structure and motion - found new motion

and structure with error under 0.8pixel. For noise with radius 3pixels and above,

cameras were estimated poorly even after numerical minimization. For such case

it would be better to calculate projection matrix from 3, 4 or more view image

constraints.

Tests on real data give good results even for small resolution camera. Figure

3.5 shows sparse reconstruction of scene captured by digital camera in resolution

320x240. Another example is in �gure 3.4.

Although quasi-calibration of pairs is not required, my experiences show that

it helped in merging processes. Merging quasi-calibrated pairs will cause that

sparse 3D space is not so distorted by perspective. Such 3D points are near true

Page 29: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

3. Structure and Motion 18

Figure 3.3: Structure and motion precision progress. Blue - reconstruction from5, green - from 8 and red from 10 cameras.

position and space is more uniformly distributed. We explain better results by

better uniformity of distribution of the space.

Page 30: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

3. Structure and Motion 19

Figure 3.4: Structure and motion of the real scene. Video was captured inresolution 640x480.

Figure 3.5: Sparse reconstruction of the scene. Resolution of input video was320x240.

Page 31: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Chapter 4

Dense reconstruction

4.1 Introduction

Until now we have worked with sparse data. Sparse data in combination with

extrinsic and intrinsic camera parameters (from previous process) can be used

to align virtual world with real world. Then we can simply render virtual scene

from known camera position and merge it with the original image. Sparsity

of the reconstruction and presence of wrong 3D points are not a problem for

applications like mentioned above. Reconstructed 3D points are used only for

aligning worlds and it can be done by user. It there were occluders in the front

of virtual objects then the real and rendered images would have to be merged

by the user because depths for pixels in real image are not known.

For visualization or cultural heritance purposes is sparse reconstruction in-

su¢ cient. Dense reconstruction would be prefered.

4.2 Overview

In recent years many algorithms for dense reconstructions have been proposed

[17]. In this section, I describe approaches that are relevant to my approach.

4.2.1 Volumetric methods

Volumetric methods assume that bounded area in which the objects of interest

lie is known. Then 3D space is �lled with 3D points/voxels such that projection

of these points/voxels on cameras has the same color - resp. must be photo

20

Page 32: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 21

consistent [19]. In practical, it is not as simple as visibility and many other

aspects of the 3D points have to be known. There are many voxel based method.

More detailed overview of many volumetric methods can be found in [29].

In my approach I use some ideas of these two voxel based algorithms, voxel

coloring [25] and space carving [19].Voxel coloring searches for 3D points that are

photo consistent to all cameras that can see these points. Unlike voxel coloring,

space carving starts with some initial reconstruction of the scene - for example

cube - and carve photo inconsistent points. For recall - bounds of the interest

area has to be known.

Mathematical background and theory can by found in [19]. Note that both

voxel based methods need to know visibility or occlusion of voxels to return good

results. [26] and [19] deals how to traverse space to get good reconstruction

for any camera motion. Quality and also numerical complexity depends on

resolution of voxel space.

4.2.2 Stereo methods

Unlike volumetric methods, stereo methods use image space to generate 3D space

- using per-pixel correspondences. Stereo algorithms use epipolar constraint (2.1)

to restrict the correspondence search to 1D. In case of calibrated stereo rig, two

cameras lies in con�guration where second camera is purely translated against

the �rst one. Note that in such con�guration the epipoles lie at in�nite and

thus the epipolar lines are parallel. Let consider case where the position of the

second camera is obtained by pure translation by X axis of image space of the

�rst camera. In such con�guration the matching process is performed in rows.

This leads to more advantages like: (1) two matching pairs can be expressed

as signed distances between them (called disparity) and thus save memory and

(2) we access neighboring blocks of memory and thus reduce cache misses which

leads to better performance. In general if we know fundamental matrix we can

unwrap space so that epipoles will lie at in�nite too. This process is called

recti�cation. For details see section 4.3.2.

Goal is to �nd which subset of points on one epipolar line that match to

subset on corresponding epipolar line in other image - in recti�ed image we work

with recti�ed image scan lines. Finding such subsets can be transformed to

search path problem and solved using dynamic programming (see �gure 4.1).

Dynamical programming allows to incorporate other constraints like preserving

Page 33: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 22

Figure 4.1: Searching for matching pairs using path-search problem. Cut of thescene on the left. Occlusion diagram on the right.

order of neighboring pixel, bidirectional uniques of match and occlusion detection

[23], [24]. Although many other algorithms area available, I selected dynamic

programming approach because it provides good trade-o¤ between quality and

speed. Note that matching each rows/line can be treated independently and

thus parallel.

Having more images o¤er more constraints. [27] recti�ed all images against

the �rst view and instead of direct �nding of matches it searches for correct depth.

This is good idea since disparity changes from view to view but depth is used

as common search index. For wide baseline cameras [28] presents probabilistic

method with good mathematical background and results. Another algorithms

like [21] or in [1] merge disparity maps from more camera pairs to obtain very

dense depth map.

More detailed description of stereo algorithm and my implementation follows

in next sections.

4.3 Novel algorithm

One of the biggest disadvantages of the volumetric method are that scene bound-

ing area and physical properties of material has to be known, it is hard to handle

occlusions and result is dependant on noise. Also already known scene structure

is not used and it is hard to introduce epipolar constraint. On the other hand, the

Page 34: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 23

Figure 4.2: Stereo ambiguities. It is not possible to detect true matching fromthese two view.

most stereo algorithms were built to extract disparity map without any know-

ledge of the scene structure and camera motion. For these only fundamental

matrix has to be known.

4.3.1 Design

It is not possible to uniquely reconstruct 3D scene from two views (see �gure

4.2). If there were homogenous regions with singe color then we would not know

which two pixels match in the real world. There are many photo consistent

solutions and even human brain is not able to �nd correct one (see �gure 4.3).

Because of that all that we can do is to use heuristic to get "most real" looking

scene. Remember that we can use more than two views and thus we have more

information. In this work the scene hypothesis from two views is built and then

merged with global scene. Hypotheses are either broken or con�rmed. If there

is some uncertainty in two view reconstruction then all pixels which cannot be

reconstructed uniquely are removed. Since this is not easy to detect we need to

remove all invalid (photo inconsistent) 3D points in merging process too.

There are many algorithms trying to reconstruct scene from only two views.

They can achieve better results for two views, but with cost of the speed. This

partial solution is �ltrated and merged with global space which is also re�ned to

achieve better precision.

The requrements to algorithm were :

Page 35: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 24

Figure 4.3: Human cannot match pixels in background. Background must beremoved.

� To use already known scene structure - as these 3D points are photo con-sistent, they had already passed many tests and thus surface pass near to

them.

� To use camera motion - as we can extract two-view geometry from it and

also use it for photo consistence checks.

� To have possibility to make algorithm distributed on more CPUs/GPUs

to achieve better response time - the aim is real-time.

� To allow to de�ne scene constraints by user - like "this is a plane" or "thisis an empty area" and so on. Such approach can by with real-time or near

real-time response very powerful and precise photo modeling tool.

In this work I use dynamic programming approach since using it I can exploit

all constraints and all these requirements are met.

4.3.2 Recti�cation

Recti�cation process wrap image pairs so that epipolar lines are coinciding with

the image scan lines. We can achieve this by �nding transformation that will

move epipoles to in�nity. Wrapping need to be done so that all pixels in the image

will persist and size of recti�ed image is minimal. Finding such transformation

seems to be complicated. If we realize that after transforming epipoles to in�nity,

epipolar lines will become parallel. If we get two corresponding epipolar lines,

we can transfer image pixel on epipoles to rows. By doing this for every pairs of

Page 36: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 25

Figure 4.4: Area visible from both cameras. Black lines de�ne left camera vis-ibility and red lines right camera visibility. Minimal area is selected.

epipolar lines we get image that satisfy recti�cation condition.This method can

also handle cases when epipoles lie in image.

In my work I extracted bilinear �ltered image segment under each epipolar

lines and transfer them to output recti�ed image. My algorithm is similar to

that used in [1]. The di¤erence is in the way how epipolar lines are selected.

Simple description follows:

� Enumeration of resolution of resulting recti�ed images. Image height cor-responds to count of epipolar lines that we use. First, visible area in both

images is found (see �gure 4.4). In my implementation I rotate �rst epi-

polar line around its epipole by angle which is calculated so that no pixel

compression will occur - see �gure 4.5. The angle is calculated in both im-

ages and the minimal one is used. Image width corresponds to di¤erence of

outer (denoted max X) and inner (denoted min X) radius of circles passing

through the most distant and the most near image pixel - see �gure 4.5.

Note that for camera pairs where epipole lies in the image it su¢ ce to

select any epipolar line and rotate it 180�.

� Rows of recti�ed image are built from intersections between epipolar lines

and images. Intersection of epipolar line and image can be empty, 1 pixel

or line segment.

1. Empty intersection - this case cannot occur since we process only

visible part of image.

2. 1 pixel - bilinear �ltered pixel is read from original image and stored

to row.

Page 37: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 26

Figure 4.5: Recti�cation process. Width of recti�ed image is di¤erence of min Xradius from max X radius. Epipolar line is rotated so that it travels at maxima1 pixel on outer circumference (with max X radius).

� Line segment - line rasterization algorithm is used to traverse pixels in

input image. Each traversed pixel is read and placed into row of recti�ed

image (see �gure 4.6). Note that to increase quality of recti�ed image we

perform bilinear �ltering when reading pixels.

Note that lines are resterized in direction from position of epipole to outer

radius. For con�guration with epipole in image we use oriented fundamental

matrix. This can by calculated from any fundamental matrix by correcting

orientation. We can do it using one known corresponding feature pair as follows:

let l = Fm and l0 = m0F , if sign(l �m0) di¤ers from sign(l0 �m) then sign of Fmust be changes.

Back transformation from recti�ed image to normal image can be expressed as:

y - angle to �rst visible epipolar line and x- distance from epipole decreased by

inner radius.

4.3.3 Initial disparity map

Let consider two recti�ed images. We know that for any selected point we can

�nd (if it is not occluded) its corresponding point on the same scan line in the

second image. Speed of algorithm depends on search region. In my thesis I use

already known structure and feature correspondence to �nd it.

After two input images from two di¤erent views are recti�ed we �nd all

common feature points and their matching. Note that we need to transfer feature

points to recti�ed space. Then Delaunay triangulation is used in the �rst recti�ed

Page 38: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 27

Figure 4.6: Row of recti�ed image. Each row correspond to some epipoler lineand its intersection with the image.

image to �nd triangulation on these points (see �gure 4.7). Triangulation is

transferred to second image by changing vertices with corresponding vertices.

Since feature matching is known algorithm can know disparities for all these

vertices. Note that disparity stored in vertices is known with sub pixel precision.

In next step algorithm rasterize triangles to matrix of �oat values, such that

each point inside triangle A;B;C with disparities D(A); D(B); D(C) will have

disparity uD(A)+vD(B)+wD(C), where u+v+w = 1 and u; v; w are baricentric

coordinates of this point.

In preprocessing algorithm marks triangles which correlates with correspond-

ing triangle in second view and also algorithm marks triangles where ratio of

triangle area is to far from 1.

If we have constrains from user such "this is plane" then we rasterize this

plane to disparity matrix.

4.3.4 Re�ning disparity map using dynamic programming

Disparity re�nement process uses two recti�ed images (left and right), initial dis-

parity map and known structure and motion. Since we work in recti�ed space we

can treat each scanline alone. For each pixel in left scan line algorithm searches

for corresponding pixel in right scan line. From initial disparity map we have

estimation of position of the corresponding pixel. Because initial disparity map

can di¤er from true disparity, algorithm searches in neighborhood of estimated

Page 39: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 28

position.

Algorithms can be described in few steps :

� Detect homogenous area and reduce noise - using mean �lter - in both scanlines.

� Correlate each pixel from left scan line with all pixels with prede�ned

radius around estimated position in the right scan line. Bigger number

means better match. Correlation under prede�ned threshold is clamped to

zero.

� Build grid graph with size image_width � search_range.

� Weight each edge in the graph as follows:

�Vertical and horizontal edges have weights equal to zero.

�Edge between i; j and i+1; j+1 node have weight equal to correlationbetween i-th and j-th pixel of left and right scan line.

� Increase weights of those diagonal edges, where i-th and j-th pixel liesin triangles marked as correlating and j is chose of initial disparity.

�Reconstruct 3D position from hypothetical match and test photo con-sistence. Decrease weight if it is not. In this step constraints like "this

is empty area" can be tested.

�Zero weights of those diagonal edges, where i-th and j-th pixel lie intriangles where ratio of areas of triangles is not near 1.

� Find the most expensive path using dynamic programing - seach pathproblem algorithm.

Example of re�ned disparity map is shown in �gure 4.7

4.3.5 Triangulation

Next step is dense point cloud triangulation. This is done by interconnecting

neighboring pixels in image space and then 2D positions are swapped with 3D

coordinates. We need to handle two types of discontinuities - (1) occlusion and

(2) small discontinuities caused by noise and camera discrete sampling. 2nd

Page 40: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 29

Figure 4.7: Re�ned disparity map. Triangle verticles correspond to points forwhich we have its corresponding 3D vertex - from scene structure.

Page 41: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 30

Figure 4.8: Triangulation of small disparity discontinuities.

type is present as small / single point discontinuities with known neighborhood.

These can be removed by interpolating neighbors values - we do not need to do

it because triangulation will do it. Triangulation of (2) is shown in �gure 4.8.

Occlusions / big discontinuities are not triangulated and this space is let

unde�ned with hope that it can be completed from another views. We also

remove triangles if triangle normal is almost perpendicular to the camera view

direction - for example 85�.

Reconstruction noise can be suppressed by smoothing disparity map. Smooth-

ing should be processed with respect to edges.

4.3.6 Multi-view linking

From two views reconstruction we get dense space with its triangulation. Areas

that are occluded to both cameras are not reconstructed.

Linking process consists of:

� Re�nement of common space.

� Merging with new space.

� Update of triangulation.

Common space is identi�ed using matching transitivity of new and old space.

All common points are recalculated like in section 3.2.2. Detection of outliers

and inconsistence is similar to method described in [1].

New space is merged with respect to triangulation as follows:

� If two non-parallel triangles have intersection then these triangles are sub-divided in this intersection (point or edge). New triangles are segmented

by triangles normal orientation. All occluded triangles are removed. Idea

of this algorithm is similar to packet wrapping algorithm (illustration in

�gure 4.9).

Page 42: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 31

Figure 4.9: Triangulation mergin. Input reconstructions in a) and b), result inc). d) shows triangle orientation, all in one image .

� If one triangle oclude another triangle and these triangles don�t have in-tersection then :

�These two triangles de�ne the same part of object and are shifted dueto noise. This is detected by thresholding triangles distances in 3D.

Space is re�ned by caclulating weight for each vertex of triangle saying

how precisely is this triangle calculated. Vertices are then recalculated

similary as in section 3.2.2.

�Occluded area is photo inconsistent.

�Occluder is photo inconsistent.

�Both triangles are photo consistent and outside testing threshold -triangles are marked and stay unchanged.

� If two triangles lie one-on the second then algorithm removes that onewhich is completely covered by that second one.

There are many other sub cases but most of them will never occur since initial

Page 43: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 32

Figure 4.10: Recti�cation example. Original image pair (top) and recti�ed imagepair (bottom).

triangulation is �ltrated from those triangles which are invalid. In linking process

algorithm removes only those triangles which are occluded or photo inconsistent.

4.4 Results

My recti�cation algorithm unwrap image so that no pixel compression occurred.

We can save image space if we removed unused space (see �gure 4.6) but then we

would break vertical image continuity - vertical lines would not be continuous.

Example of recti�ed image is in �gure 4.10 .

Computation of initial disparity is not time expensive process and using it

we reduced search region which caused speed-up of algorithm since numerical

complexity of dynamical programming approach isO(width�2�searchradius): Inour observations 4% of image resolution was su¢ cient - for image with resolution

640x480 it is around 20pixel radius.

Segmentation of homogenous regions removes noise and thus homogenous

regions stay continuous since discontinuities concentrates in occluded areas. In

the case where triangles in both views correlated well, algorithm supports nodes

Page 44: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 33

of the grid which were estimated by initial process. It means that planar solution

is supported. On the other hand if ratio of triangles area in both views has

changed a lot, then space behind these triangles would di¤er a lot - comparing

to planar solution. Testing photo consistence leads to performance drop. On the

other hand this removes photo inconsistent paths.

Algorithm with 10 cameras for photo consistence check and 2 selected cam-

eras for space reconstruction from input with resolution 640x480 was able to

calculate �oat precision disparity map in 3 seconds on AMD AthlonXP 2800+

CPU/512MB. All stages such that recti�cation, initial disparity guess, disparity

re�nement, dense map extraction were present. See examples section 4.4.1.

Triangulation that is build from 2D space �lls gaps but the reconstruction is

a¤ected by noise. Smoothing disparity maps in combination with median �lter

will suppress noise but also deform edges. It would be better to use some point

cloud approximation method.

4.4.1 Examples

Each of these examples was captured by standard Digital hand-help camera in

resolution 640x480. Reconstruction time was around 2.5 seconds for each pair

of cameras. Merging of two reconstructions was done under one second.

Page 45: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 34

Figure 4.11: Dense reconsturction of the nature. Two merged camera pairs wereused to obtain disparity map and 8 other cameras were used for photoconsistencechecks.

Figure 4.12: Dense reconsturction of human face. 2 cameras to obtain disparitymap and 7 cameras for photoconsistence checks were used. Scene is rendered aspoint cloud, each point is rendered as 2x2pixels colored splat.

Page 46: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 35

Figure 4.13: Reconstruction process. Frame from input sequence (top-left),feature tracing (top-right), structure and motion with reconstructed object(bottom-right), �nal 3D mesh (bottom-left).

Figure 4.14: Dense reconstruction of scene from �gure 3.2.2. Scene is reconstruc-ted from two views. Video resolution was 320x240.

Page 47: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

4. Dense reconstruction 36

Figure 4.15: Scene reconstructed using 9 cameras.

Figure 4.16: Novel view of the object with may homogenous regions. Scene isreconstructed from 2 cameras.

Page 48: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Chapter 5

Experiments

5.1 Radial lens distortion

Optical distortion of camera lens can move 2D points from original position far

away �even more than 10 pixels. In this experiment I take into account only

radial lens distortion that can be approximated as

� = (1 + �(x2 + y2));

x1 = �x;

y1 = �y;

where � is unknown distortion factor.

My algorithm is similar to [10]. In [10] authors modi�ed distortion equation

and this allowed them to modify linear algorithm for calculating fundamental

matrix to calculate distortion too. New equation for F matrix returns more

roots (around 10) and all must be tested. It was also noted that change of radial

distortion equation creates many local minima around global minima.

Our radial lens distortion removal algorithm searches for � directly by min-

imizing Pi d(mi:Fmi�)

2 + d(F Tmi:mi�)2;

where mi, mi� are corresponding 2D feature points, F is fundamental matrix

and d(x; y) is distance of line and 2D point. For each � we un-distort features

position and �nd new fundamental matrix. Features are scaled to �t window. �

is found using simulated annealing [18].

37

Page 49: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

5. Experiments 38

Figure 5.1: Radial distortion test. Left - original, right - undistorted. Costfunction graph for each feature point is below.

Algorithm assumes that � is equal for two neighboring cameras. � for i-th

camera is approximated by averaging obtained from both pairs (i � 1th andi+1th cameras). Experiments showed that for small motion this algorithm does

not �nd good � and similarly when radial is higher - j�j > 0:2 .

5.2 Results

Radial distortion algorithm was tested on grid patterns and real images. I did

not aim to �nd perfect calibration, because global nonlinear minimization - such

bundle adjustment - can correct radial distortion too. Numerical complexity of

the method depends on number of tested feature pairs. . It is because the fun-

damental matrix is recalculated each iteration. Our experience show that linear

estimation of fundamental matrix su¢ ces. For 200 feature points we estimate

radial distortion under 1 second on 2.8GHz AMD CPU powered machine.

For grid pattern with resolution 512x512 the error against ground truth was

6 pixels at image corners before and fewer than 1.64 pixels (measured on feature

points) after correction. See Figure 5.1.

Page 50: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Chapter 6

Conclusion and future work

This thesis presented a sequential approach for creating calibrated motion and

structure from un-calibrated video sequence with dense 3D reconstruction of

the space. Sequential processing allows us to process input video directly from

camera stream. Biggest advantage of processing from stream is that we can skip

process of storing to disk and video compression which leads to better quality

(due to uncompressed transfer).

Because there is always a noise in the images it is not good to calculate

camera position from only two views. Therefore in future it would be better to

improve camera projection matrix calculation, using more images, maybe using

factorization approach. My experiences with real camera also showed that if

principal point is not in image centre, than scene stay skewed even after self-

calibration. For cheap hand held cameras it is unexpected to have principal

point at image center. Allowing principal point to be constant (non zero) or to

be varying leads to non-linear self-calibration algorithm [4].

Since there are always many ambiguities in dense reconstruction process I

recommend using algorithm where user can control algorithm pipeline. This

is possible and e¤ective when algorithm responds to user interaction promptly.

Slowest part in my reconstruction pipeline is algorithm for dense reconstruction.

Because it works in recti�ed space, more scan lines can be processed parallel and

thus it is possible to drop down response time.

More work need to be done also in process of triangulation. It would be

better to extract dense point cloud and generate surface using point cloud ap-

proximation algorithms.

39

Page 51: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Appendix A

Resources

Results from my work with some additional materials are included on enclosed

compact disc.

You can �nd there:

� The text of this master thesis in pdf format.

� The paper submited to the CESCG 2005 in Budmerice.

� The paper submited to the �VK 2005, FMFI UK, Bratislava.

� Images with results from various parts of algorithm.

� Video from reconstructed scenes.

� Test sets - generated, captured, standard video and images.

� Papers relevant to my thesis - the most of referenced in bibliography.

� Source code of some parts.

Online resources

More recent versions, updates and results can be found on homepage of Data-

Expert, s.r.o. http://www.dataexpert.sk.

40

Page 52: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

Bibliography

[1] M. Pollefeys - L. Van Gool - M. Vergauwen - F. Verbiest - K. Cornelis - J.Tops - R. Koch. Visual modeling with a hand-held camera, InternationalJournal of Computer Vision 59(3), 207-232, 2004.

[2] M. Han - T. Kanade. Creating 3D Models with Uncalibrated Cameras, pro-ceeding of IEEE Computer Society Workshop on the Application of Com-puter Vision (WACV2000), December 2000.

[3] P. Sturm - B. Triggs. A Factorization Based Algorithm for Multi-ImageProjective Structure and Motion, 4th European Conference on ComputerVision, Cambridge, England, April 1996, pp. 709-720 .

[4] R. Hartley - A. Zisserman. Multiple View Geometry In Computer Vision.Second Edition. Cambridge University press, UK. March 2004.

[5] A. W. Fitzgibbon �A. Zisserman. Automatic Camera Tracking. RoboticsResearch Group. Department of Engineering Science. University of Oxford,UK.

[6] S. Birch�leld. KLT: An Implementation of the Kanade-Lucas-Tomasi Fea-ture Tracker. Stanford University, http://vision.stanford.edu/~birch

[7] C. Harris - M. Stephens. A combined corner and edge detector, Fourth AlveyVision Conference pp.147-151, 1988.

[8] M. Fischler �R. Bolles. Random Sample Consensus: A Paradigm for ModelFitting. Communications of the ACM, 24 (6), 381-395. 1981.

[9] R. Hartley, In defense of the eight-point algorithm. IEEE Trans. On PatternAnalysis and Machine Intelligence, 19(6):580-593, June 1997.

[10] A. W. Fitzgibbon. Simultaneous linear estimation of multiple view geometryand lens distortion. Department of Engineering Science. University of Ox-ford, UK.

[11] W. Press �S. Teukolsky �W. Vetterling. Numerical recipes in C : the artof scienti�c computing, Cambridge university press, 1992

41

Page 53: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

BIBLIOGRAPHY 42

[12] B. Triggs �P. McLauchlan �R. Hartley �A. Fitzgibbon. Bundle Adjust-ment - Modern Synthesis, Vision Algorithms: Theory and Practice, SpringerVerlag, 298-375, 2000.

[13] M. I. A. Lourakis - A. A. Argyros. The Design and Implementation of a Gen-eric Sparse Bundle Adjustment Software Package Based on the Levenberg-Marquardt Algorithm, Institute of Computer Science �FORTH, Heraklion,Crete, Greece, August 2004.

[14] B. Triggs. The absolute Quadric, Proc. 1997 Conference on Computer Vis-ion and Pattern Recognition, IEEE Computer Society press, pp. 609-617,1997.

[15] M. Pollefeys �R. Koch �L. V. Gool. Self-Calibration and Metric Recon-struction in spite of Varying and Unknown Intrinsic Camera Parameters,International Journal of Computer Vision, KLuwer Academic Publishers,Boston, 1998

[16] P. Breadsley �A. Zisserman �D. Murray. Sequential Updating of Project-ive and A¢ ne Structure from Motion. International Journal of ComputerVision (23), No. 3, Jun-Jul 1997, pp. 235-259.

[17] M. Pollefeys. 3D Photography, comp290-89 Fall, University of North Caro-lina, 2004.

[18] V. Kvasniµcka �J. Pospíchal �P.Tiµno. Evoluµcné algoritmy. Slovak technicaluniversity, Bratislava, 2000

[19] K. N. Kutulakos - S. M. Seitz, A Theory of Shape by Space Carving, Proc.Seventh Int�l Conf. Computer Vision, vol. 1, pp. 307-314, 1999.

[20] V. Kolmogorov - R. Zabih. Multi-Camera Scene Reconstruction via GraphCuts, Proc. Seventh European Conf. Computer Vision, 2002.

[21] G. Zeng - S. Paris - L. Quan - M. Lhuillier. Surface Reconstruction byPropagating 3D Stereo Data in Multiple 2D Images, Dep. of Computer Sci-ence, HKUST, Clear Water Bay, Kowloon, Hong Kong

[22] M. Lhuillier - L. Quan. A Quasi-Dense Approach to Surface Reconstructionfrom Uncalibrated Images, IEEE Trans. On Pattern Analysis and MachineIntelligence, VOL. 27, NO. 3, MARCH 2005

[23] I. Cox - S. Hingorani - S. Rao. A Maximum Likelihood Stereo Algorithm,Computer Vision and Image Understanding, Vol. 63, No. 3., 1996

[24] S. Birch�eld - C. Tomasi, Depth Discontinuities by Pixel-to-Pixel Stereo,International Journal of Computer Vision, 35(3): 269-293, December 1999

Page 54: Reconstructing 3D mesh from video sequencecmp.felk.cvut.cz/~bujnam1/projects/diplomovka/thesis.pdf · Reconstructing 3D mesh from video sequence DiplomovÆ prÆca BRATISLAVA 2005

BIBLIOGRAPHY 43

[25] S. M. Seitz - C. R. Dyer. Photorealistic Scene Reconstruction by VoxelColoring. International Journal of Computer Vision, 35(2), 151-173. 1999

[26] Culbertson - W. B., T. Malzbender - G. Slabaugh: 1999, �Generalized VoxelColoring. In: Workshop on Vision Algorithms: Theory and Practice. Corfu,Greece.

[27] M. Okutomi - T. Kanade. A multiple baseline stereo. IEEE Trans. On Pat-tern Analysis and Machine Intelligence, 15, 1993.

[28] C. Strecha, R. Fransens, L. Van Gool - Wide-baseline Stereo from MultipleViews : a Probabilistic Account, ESAT-PSI, University of Leuven, Belgium

[29] C. R. Dyer. Volumetric scene reconstruction from multiple views, FIU01,2001, 469-489