16
International Journal of Computer Vision 30(1), 27–42 (1998) c 1998 Kluwer Academic Publishers. Manufactured in The Netherlands. The Tractability of Segmentation and Scene Analysis MARTIN C. COOPER IRIT, University of Toulouse III, 31062 Toulouse, France Received March 4, 1997; Revised April 7, 1998; Accepted April 23, 1998 Abstract. One of the fundamental problems in computer vision is the segmentation of an image into semantically meaningful regions, based only on image characteristics. A single segmentation can be determined using a linear number of evaluations of a uniformity predicate. However, minimising the number of regions is shown to be an NP-complete problem. We also show that the variational approach to segmentation, based on minimising a criterion combining the overall variance of regions and the number of regions, also gives rise to an NP-complete problem. When a library of object models is available, segmenting the image becomes a problem of scene analysis. A sufficient condition for the reconstruction of a 3D scene from a 2D image to be solvable in polynomial time is that the scene contains no cycles of mutually occluding objects and that no range information can be deduced from the image. It is known that relaxing the no cycles condition renders the problem NP-complete. We show that relaxing the no range information condition also produces an NP-complete problem. Keywords: segmentation, scene analysis, computational complexity, NP-completeness 1. Tractability of Vision Problems Segmentation is perhaps the most basic operation in low-level vision. Important applications are encoun- tered in, among other areas, the analysis of satellite imagery, medical imagery, geology and metallurgy (Borisenko et al., 1987; Fu and Mui, 1981; Haralick and Shapiro, 1985; Pal and Pal, 1993). When a library of object models is available, the segmentation problem falls in the domain of scene analysis. Different formal specifications of the segmentation problem exist. When choosing among them it is criti- cal to know which problems are tractable and which are intractable. Apparently small changes in the problem specification can change a problem solvable in polyno- mial time into an NP-complete problem. The way in which a problem is expressed is often as important as the algorithm which is used to solve it. Establishing the borderline between those segmenation problems which are solvable in polynomial time and those which are intractable is clearly of great practical importance to computer vision engineers. Before embarking on an analysis of the tractability of different versions of the segmentation problem, it is interesting to try to identify what makes certain vision problems easy and others intractable. Certain problems are easy because they can be solved by application of local operations only. The detec- tion of local features, such as edge segments, cor- ners, discontinuities of curvature and zeros of curvature clearly fall into this category. In fact, this class of easy problems can be enlarged to include all problems of unconstrained detection. The identification and locali- sation of sets of local features which could correspond to partially visible rigid objects, occluded by unknown objects of arbitrary shape, is another example of un- constrained detection. This problem, although far from trivial, is solvable in polynomial time. If, for example, three local features in the image are sufficient to de- termine the three-dimensional position and orientation of an object, then an exhaustive search over all triples of pairs of matching local features in the image and the object model will solve it. The hypothesise and test approach to the recognition of partially occluded

The Tractability of Segmentation and Scene Analysis

Embed Size (px)

Citation preview

Page 1: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

International Journal of Computer Vision 30(1), 27–42 (1998)c© 1998 Kluwer Academic Publishers. Manufactured in The Netherlands.

The Tractability of Segmentation and Scene Analysis

MARTIN C. COOPERIRIT, University of Toulouse III, 31062 Toulouse, France

Received March 4, 1997; Revised April 7, 1998; Accepted April 23, 1998

Abstract. One of the fundamental problems in computer vision is the segmentation of an image into semanticallymeaningful regions, based only on image characteristics. A single segmentation can be determined using a linearnumber of evaluations of a uniformity predicate. However, minimising the number of regions is shown to be anNP-complete problem. We also show that the variational approach to segmentation, based on minimising a criterioncombining the overall variance of regions and the number of regions, also gives rise to an NP-complete problem.

When a library of object models is available, segmenting the image becomes a problem of scene analysis. Asufficient condition for the reconstruction of a 3D scene from a 2D image to be solvable in polynomial time is thatthe scene contains no cycles of mutually occluding objects and that no range information can be deduced from theimage. It is known that relaxing the no cycles condition renders the problem NP-complete. We show that relaxingthe no range information condition also produces an NP-complete problem.

Keywords: segmentation, scene analysis, computational complexity, NP-completeness

1. Tractability of Vision Problems

Segmentation is perhaps the most basic operation inlow-level vision. Important applications are encoun-tered in, among other areas, the analysis of satelliteimagery, medical imagery, geology and metallurgy(Borisenko et al., 1987; Fu and Mui, 1981; Haralickand Shapiro, 1985; Pal and Pal, 1993). When a libraryof object models is available, the segmentation problemfalls in the domain of scene analysis.

Different formal specifications of the segmentationproblem exist. When choosing among them it is criti-cal to know which problems are tractable and which areintractable. Apparently small changes in the problemspecification can change a problem solvable in polyno-mial time into an NP-complete problem. The way inwhich a problem is expressed is often as important asthe algorithm which is used to solve it. Establishing theborderline between those segmenation problems whichare solvable in polynomial time and those which areintractable is clearly of great practical importance tocomputer vision engineers.

Before embarking on an analysis of the tractabilityof different versions of the segmentation problem, it isinteresting to try to identify what makes certain visionproblems easy and others intractable.

Certain problems are easy because they can be solvedby application of local operations only. The detec-tion of local features, such as edge segments, cor-ners, discontinuities of curvature and zeros of curvatureclearly fall into this category. In fact, this class of easyproblems can be enlarged to include all problems ofunconstrained detection. The identification and locali-sation of sets of local features which could correspondto partially visible rigid objects, occluded by unknownobjects of arbitrary shape, is another example of un-constrained detection. This problem, although far fromtrivial, is solvable in polynomial time. If, for example,three local features in the image are sufficient to de-termine the three-dimensional position and orientationof an object, then an exhaustive search over all triplesof pairs of matching local features in the image andthe object model will solve it. The hypothesise andtest approach to the recognition of partially occluded

Page 2: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

28 Cooper

objects is based on the search for matching sets of localfeatures (Ayache and Faugeras, 1986). An orthogonalapproach to the same problem is an exhaustive searchin transformation space (Cass, 1992).

To resume, unconstrained detection problems aretractable because the search space is not combinatorial.

What makes the interpretation of multi-object im-ages intractable in the most general case is the presenceof constraints between the interpretations of differentparts of the image, meaning that the interpretation mustbe treated as a whole and cannot be decomposed into in-dependent non-combinatorial subproblems. The mostbasic physical constraint that two solid objects cannotoccupy the same physical space, and hence do not in-tersect, is sufficient to make the interpretation of multi-object images NP-complete (see Section 7).

The line drawing labelling problem is a classic ex-ample of a combinatorial problem in vision, since thelabel for one line may influence the possible labels forall other lines in the drawing. Huffman (1971) andClowes [1971] gave constraints which must be satis-fied by a semantic labelling of the lines in a draw-ing of polyhedra with trihedral vertices. Determiningwhether a drawing has a legal global labelling accord-ing to the Huffman-Clowes scheme has been shown tobe NP-complete (Kirousis and Papadimitriou, 1988).Some drawings which can be labelled according to thisscheme are not physically realisable (Sugihara, 1986),but the realisability problem has also been shown to beNP-complete (Kirousis and Papadimitriou, 1988).

We call the problem of reconstructing a three-dimensional scene from a two-dimensional image, sub-ject to constraints concerning the physical realisabilityof the scene,constrained reconstruction.

If, in general, constrained reconstruction can giverise to intractable problems, this is not always the case.More restrictive (Dendris et al., 1994; Kirousis, 1990;Kirousis and Papadimitriou, 1988) or less restrictive(Cooper, 1997) versions of the Huffman-Clowes la-belling scheme exist which have 0/1/all constraints,a class of tractable constraints (Cooper et al., 1994;Kirousis, 1993). A constraint satisfaction problem with0/1/all constraints can be viewed as a generalisation ofthe well known tractable problem 2SAT to multi-valuedlogics, and is solvable in polynomial time. Knowl-edge of vanishing points in an image is also suffi-cient to convert the Huffman-Clowes labelling prob-lem into a polynomially solvable problem (Parodi andTorre, 1994). This is again achieved by a reductionto 2SAT.

The interpretation of multi-object images also hasan important tractable version. A single interpretationof an image composed of known flat objects occludingeach other can be found in low-degree polynomial time,provided that the objects do not form cycles in depth.Furthermore, a data structure combining all valid inter-pretations (of which there may be a exponential num-ber) can also be determined in low-degree polynomialtime (Cooper, 1992).

A simplifying operation which can often convert anintractable problem into an easy problem is dimen-sionality reduction. A problem which is intractable ona two-dimensional image may become tractable on aone-dimensional image (a string). On the other hand,applying an optimisation criterion may render an oth-erwise tractable problem intractable. For example, inthis paper we show that finding a single valid segmen-tation of a two-dimensional image becomes intractablewhen a global optimisation criterion is applied, and thatthe segmentation problem becomes easy again when anoptimisation criterion is applied to a one-dimensionalimage.

In conclusion, whereas the unconstrained detec-tion of features and sets of features is tractable, thereconstruction of a three-dimensional scene from atwo-dimensional image subject to constraints on thephysical realisability of the scene is, in general, an NP-hard problem. However, interesting tractable versionsof the same problem often exist in which constraintshave been either tightened or slackened. An optimi-sation criterion is an artificial global constraint whichcan render a tractable problem NP-hard.

2. Definitions

In the first part of this paper we study the problemof segmenting an image independently of any seman-tic information. The optimal segmentation found willalmost always be an oversegmentation, in the sensethat regions correspond to parts of objects rather thanwhole objects, due to the presence of surface-normaldiscontinuities, shadows, specular reflections or sur-face markings. On the other hand, constrast failure bet-ween two objects whose projections are adjacent in theimage leads to the inevitable merging of two unrelatedregions.

Thresholding methods (Lim and Lee, 1990) areclearly efficient but inadequate in many applicationssince they ignore the relative spatial disposition of

Page 3: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

The Tractability of Segmentation and Scene Analysis 29

pixels. Edge-based methods ignore the values of pix-els which are not close to sharp discontinuities in theimage. Both of these approaches are based on in-herently local operations. A more global approach isto attempt to fit a piecewise smooth function to theimage while simultaneously minimising the numberof regions (Beaulieu and Goldberg, 1989) or the to-tal length of region boundaries (Mumford and Shah,1989; Nitzberg et al., 1991). We begin with a discreteversion of the segmentation problem based on a uni-formity predicate, before showing that the complexityresults obtained can be generalised to the variationalformulation of the optimal segmentation problem.

The following definition of a valid segmentation,based on a uniformity predicate for regions, is adaptedfrom Pavlidas (1977).

Let X be a grid, on which is defined an imagewith pixel valuesI (p) for all points p ∈ X. A one-dimensional grid is represented by

X = {i : 1≤ i ≤ N}and a two-dimensional square grid by

X = {(i, j ) : 1≤ i ≤ N and 1≤ j ≤ N}

Definition 2.1. A uniformity predicate Uassigns atruth valueU (Y) to each non-empty subsetY of thegrid X as a function only of the values ofI (p) forp ∈ Y, and satisfies the subset property:

(Z ⊆ Y ∧ Z 6= ∅ ∧U (Y))⇒ U (Z)

Furthermore,U (Y)= true for all Y ⊆ X such that|Y| =1.

Definition 2.2. A segmentationof X according to theuniformity predicateU is a partition ofX into regionsX1, . . . , Xr such that

(i) eachXi is connected(ii) U (Xi ) = true for i = 1, . . . , r(iii) U (Xi ∪ X j ) = false for all pairs of adjacent re-

gionsXi , X j .

Note that, by the definition of a partition,X1 ∪· · · ∪ Xr = X and Xi ∩ X j = ∅ for all i 6=j . Different definitions of a valid segmentation re-sult from the different possible definitions of con-nected, such as 4-connected or 8-connected in twodimensional images. A regionXi is connected if be-tween each pair of pixelsp,q ∈ Xi , there is a path

p = p1, p2, . . . , pt =q of pixels such that, for eachi ∈ {1, . . . , t − 1}, pi and pi+1 are adjacent. In thedefinition of 4-connectedness, pixel (i, j ) is only adja-cent to pixels (i −1, j ), (i +1, j ), (i, j −1), (i, j +1),whereas in the definition of 8-connectedness, (i, j ) isalso adjacent to pixels (i − 1, j − 1), (i − 1, j + 1),(i + 1, j − 1), (i + 1, j + 1). In this paper we assumethat connected means 4-connected. 8-connectednessprovides unsatisfactory results when trying to min-imise the number of regions in a segmentation. Forexample, a chessboard is composed of only two 8-connected regions, the black squares and the whitesquares.

An image may have an exponential number of validsegmentations. It is therefore natural to try to discrim-inate between them by means of some optimisationcriterion.

Definition 2.3. A minimal segmentationof an imageis such that the number of regions is minimal over allvalid segmentations.

This criterion is derived from the assumption thatsimpler interpretations are more likely. Condition (iii)in Definition 2.2 only ensures that a segmentation isa partition of the image into coherent regions whichlocally minimises the number of regions, in that notwo adjacent regions can be merged to form a coherentregion.

To illustrate the utility of the global minimisationcriterion, consider the image shown in Fig. 1(a) in theform of a matrix of grey levels. Assuming thatU is theuniformity predicate

U (X) = true iff for all pointsp,q ∈ X

|I (p)− I (q)| ≤1 (1)

Figure 1(b) shows a minimal segmentation. It is clearthat this segmentation is preferable to the valid seg-mentation in Fig. 1(c) which is non-optimal.

Figure 1. (a) An image, (b) a minimal segmentation, and (c), (d)two non-minimal segmentations.

Page 4: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

30 Cooper

Figure 2. (a) A one-dimensional image, (b) a segmentation of thisimage, (c) the greedy segmentation, and (d) comparison of the greedysegmentation and a minimal segmentation.

Even in one dimension, an image can have an expo-nential number of valid segmentations, as the followingexample shows. LetU be the same simple uniformitypredicate defined in (1). In a valid segmentation of theimage in Fig. 2(a), sequences of 0’s and sequences of2’s must be separated by a boundary, but the boundariescan lie either to the left or to the right of the 1. For ex-ample, one possible segmentation is given in Fig. 2(b),but there are clearly an exponential number of validsegmentations.

It should be noted that the number of minimal seg-mentations may also be exponential. This is the casein the one-dimensional example of Fig. 2, in which allvalid segmentations (according to Definition 2.2) areminimal. It should also be noted that there is no guar-antee that the correct segmentation will be minimal.For example, Fig. 1(d) shows a non-minimal segmen-tation of the image in Fig. 1(a). This segmentation isjust as plausible as the minimal segmentation shown inFig. 1(b).

In the presence of Gaussian noise, a simple thresh-old on the difference between pixel values of a region,as embodied by the uniformity predicate given by (1),is too naive as a means of distinguishing between im-age regions that do or do not have an acceptable planarapproximation. Furthermore, it is clearly desirable tobe able to distinguish between acceptable approxima-tions by optimising a functional taking into account thecloseness of the approximation. Modelling the image I

as a piecewise constant function with added Gaussiannoise, the maximum likelihood criterion is to minimisethe total variance

Var=r∑

i=1

∑p∈Xi

(I (p)− µi )2

whereX1, . . . , Xr is a partition of the image into con-nected regions and whereµi is the mean value of pixelsin regionXi . To simultaneously minimise the numberof regionsr , we can minimise Var+ λr for some con-stantλ. This leads to the following definition.

Definition 2.4. An optimal segmentationof an im-age is a partition of the image into connected regionsX1, . . . , Xr which minimises Var+ λr .

Note that an optimal segmentation does not neces-sarily satisfy Definition 2.2 of a segmentation for anygiven uniformity predicateU . It is clear that Defini-tion 2.4 can be generalised by approximating the imageby any piecewise smooth functions, such as polyno-mials or splines, instead of constant functions. Thesimplicity criterion can be generalised to include notonly the numberr of regions, but also, for example, thesmoothness and the length of region boundaries.

3. Polyomial Segmentation Algorithms

For one-dimensional images, a simple greedy algo-rithm is sufficient to find a minimal segmentation inpolynomial time. Given an imageI (1), . . . , I (N) wefirst find the largest value ofi ≤ N such that

U ({1, 2, . . . , i }) = true

Then, if i < N, we apply recursively the same al-gorithm to the remaining imageI (i + 1), . . . , I (N).For example, the result of this algorithm on the one-dimensional image of Fig. 2(a) is given in Fig. 2(c).

Let NU be the number of evaluations of the uni-formity predicateU by this greedy algorithm.NU isclearly a linear function ofN, the length of the image.

Let SG represent the segmentation found by thisgreedy algorithm. We will show thatSG is minimal.Suppose, on the contrary, thatSM is a minimal segmen-tation composed of fewer regions thanSG. Supposethat the regions inSG end at pixelsi1, . . . , i r and theregions inSM end at pixelsj1, . . . , js, wheres < r .SinceSM is composed of fewer regions thanSG, it must

Page 5: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

The Tractability of Segmentation and Scene Analysis 31

contain a boundaryjk which is further to the right thanthe corresponding boundaryi k. Let k be the first indexfor which jk > i k. Then the regionRG ending ati k inSG must be a proper subset of the regionRM endingat jk in SM . This is illustrated in Fig. 2(d). Further-more, RG ∪ {i k + 1} ⊆ RM . By the subset property,U (RG ∪ {i k + 1})=U (RM)= true, which contradictsthe choice ofi k by the greedy algorithm.

An obvious question is whether an optimal segmen-tation of a one-dimensional image, according to Defini-tion 2.4, can also be found in polynomial time. In fact, itis possible to find an optimal segmentation in quadratictime, using a dynamic programming algorithm.

To see this, for eachi ∈ [0, N], let SOPT(i ) repre-sent the optimal segmentation of the truncated imageI (1), . . . , I (i ) and let Score(i ) be its score given byVar+ λr . If the last region inSOPT(i ) is [k+1, i ], then

Score(i ) = Score(k)+ Var({I (k+ 1), . . . , I (i )})+ λ

where

Var({I (k+ 1), . . . , I (i )}) =i∑

t=k+1

(I (t)− µ)2

=i∑

t=k+1

I (t)2− (i − k)µ2

with µ = 1i−k

∑it=k+1 I (t).

Thus,SOPT(i ) is given by

SOPT(i ) = SOPT(k) ∪ {[k+ 1, i ]}

wherek ∈ [0, i − 1] is such that

Score(k)+ Var({I (k+ 1), . . . , I (i )})+ λ

is minimal. We can now use a standard dynamic pro-gramming algorithm to determineSOPT(N) in O(N2)

time andO(N) space.It is clear that the one-dimensional case is much sim-

pler than the two-dimensional problem. A single, notnecessarily minimal or optimal, segmentation in twodimensions can be found in polynomial time, in factusing only a linear number of evaluations of the uni-formity predicate, by the following merging algorithm.

We maintain an index REG(p) = region containingthe pixel p.

SEG BY MERGING:

{First segment the image into individual pixels}for all pixels p do REG(p) := {p};for all pairs of neighbouring pixelsp,q do

if REG(p) and REG(q) are not the same regionthen if REG(p)∪REG(q) satisfiesU

then merge REG(p) and REG(q)

The order in which the pairs of pixels (p,q) are exa-mined is unspecified. Different orders may producedifferent segmentations.

SEG BY MERGING grows but never shrinks re-gions. If at some point REG(p)∪ REG(q) does notsatisfy U , then we know, by the subset property,that R∪ S cannot satisfyU , for any pair of regionsR⊇REG(p), S⊇REG(q). This shows that, for ev-ery pair of pixelsp,q, there is no need to test twicewhether REG(p)∪ REG(q) satisfiesU , since anygrowing of REG(p) or REG(q) cannot change the valueof U (REG(p)∪REG(q)) from false to true.

SEG BY MERGING clearly makesO(n) evalua-tions of the uniformity predicateU , wheren is thenumber of pixels in the image. Merging two regionsrequires updating REG(r ) for all pixels r in one ofthe two regions. Despite this, the total number of up-dates is bounded above byn log2 n provided that wealways update the pixels in the smaller of the two re-gions. Indeed, for each pixelr , the index REG(r ) willbe updated a maximum of log2 n times: each timeREG(r ) is updated, the size of the region REG(r ) atleast doubles, and the maximum size of REG(r ) isbounded above byn, the number of pixels in the image.The time complexity of SEGBY MERGING is thusO(nc+ n logn) wherec is the cost of testing whetherREG(p)∪REG(q) satisfiesU .

Split-and-merge algorithms, which start with an ini-tial segmentation into regions of sizek pixels, arenot only more robust to noise but have also beenfound in practice to be more efficient by a constantfactor (Pavlidas, 1977). Our intention in presentingSEG BY MERGING is simply to show formally thatan efficient algorithm to find a single segmentation doesexist.

4. Minimal Segmentation in Two Dimensions

In this section we show that finding a valid 2D seg-mentation with the minimum number of regions isNP-complete. We will then use the same construction

Page 6: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

32 Cooper

to prove the NP-completeness of finding an optimalsegmentation (according to Definition 2.4) of a two-dimensional image. Since complexity theory (Gareyand Johnson, 1979; Papadimitriou, 1994) only ap-plies to decision problems, we express minimal two-dimensional segmentation as a decision problem.

MIN2DSEG: Given anN× N imageI , a uniformitypredicateU and a constantk, does there exist a validsegmentationSof I using this predicateU such thatShas at mostk regions?

We will show the NP-completeness of MIN2DSEGin the case that the uniformity predicate is given by (1),above. Other uniformity predicates may give rise toproblems which are polynomially solvable.

There is a polynomial equivalence between thedecision problem MIN2DSEG and the problem ofdetermining the minimal number of regions in allvalid segmentations ofI , since the latter problem isequivalent to solving MIN2DSEG for each ofk =1, 2, . . . , N2.

It is clear that MIN2DSEG is in NP, since we canverify in polynomial time that a given segmentation isvalid and consists of at mostk regions. To prove thatMIN2DSEG is NP-complete we give a polynomial re-duction from PLANAR 3SAT to MIN2DSEG. SincePLANAR 3SAT is NP-complete (Lichtenstein, 1982),any polynomial-time algorithm for MIN2DSEG wouldthus provide polynomial-time algorithms for all prob-lems in NP.

An instance of 3SAT is a set of Boolean variables{vi }and a set of disjunctions of 3 literals{Dj }. A solution isan assignment of truth values to the variables{vi } suchthat all the disjunctions are simultaneously satisfied.An instance of PLANAR 3SAT is an instance of 3SATthat can be represented by a planar graphG. There is avertex inG for each variablevi and for each disjunctionDj ; there is an edge linkingvi andDj if and only if vi

or¬vi is one of the three literals inDj .

Theorem 4.1. MIN2DSEG is NP-complete.

Proof: The theorem follows from the following poly-nomial reduction of PLANAR 3SAT to MIN2DSEG.

Given an arbitrary instance of PLANAR 3SAT, wewill construct an imageI and a valuek such thatI has avalid segmentation consisting of at mostk regions if andonly if this instance of PLANAR 3SAT is satisfiable.

The imageI is built from three types of construc-tions: one for each variablevi , one for each disjunctionDj and one to negate a variablevi to produce¬vi .

Figure 3. (a) A variable construction and its two possible mini-mal segmentations, (b) the branching out of a variable constructiontowards several other constructions.

We choose the uniformity predicate to be

U (X) = true iff for all pointsp,q ∈ X

|I (p)− I (q)| ≤ 1

In other words, a region is uniform if its pixel valueslie within 1 of each other.

The construction representing the variablevi isshown on the left of Fig. 3(a). The shaded area in all fig-ures represents a background area with a constant pixelvalue of 4. Such a value guarantees that the backgroundis segmented independently of the foreground, and canthus be ignored, since the foreground only contains pix-els of value 0, 1 or 2. The construction on the left ofFig. 3(a) has only two minimal segmentations (shownon the right of Fig. 3(a)) which we associate, as indi-cated, with the two possibilitiesvi = true andvi = false.The segmentation corresponding tovi = true retains thevalue 2 whose influence is propagated to all disjunc-tionsDj containingvi . Figure 3(b) shows how a singleconstruction of the type shown in Fig. 3(a) can fork outtowards many disjunctions.

On the left of Fig. 4(a) is the construction to negatea variable. The input region at the left is connectedto the construction for variablev. The critical pixelsare those of value 1 which have neighbouring pixels ofvalue 0 and 2; they can be in the same region as eitherthe 0’s or the 2’s. There are only two minimal segmen-tations and these are shown on the right of Fig. 4(a).If v= false, then the output region at the right of theconstruction contains a 2, which by convention corre-sponds to the value true. Ifv= true then the outputregion contains a 0 corresponding to the value false.

Page 7: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

The Tractability of Segmentation and Scene Analysis 33

Figure 4. (a) A negation construction and its two possible minimalsegmentations, and (b) an example of a non-minimal segmentation.

Forcing the input and output regions to both containthe value 0, for example, produces a non-minimal seg-mentation as illustrated in Fig. 4(b).

The construction for the disjunctionv1 ∨ v2 ∨ v3 isshown in Fig. 5. (In fact, eight copies are shown.) Theconstruction is connected at the top to the constructions(of the type shown in Fig. 3(a)) for the three variablesv1, v2, v3 as indicated. The construction contains twocopies of the negation construction of Fig. 4(a). Theshaded area has a pixel value of 4 and the white “paths”have a pixel value of 1, except for a single 2, as shown.There are seven minimal segmentations correspondingto the seven combinations of values ofv1, v2, v3 whichsatisfy v1∨ v2∨ v3. All eight cases are shown, cor-responding to the eight possible assignments of truthvalues tov1, v2, v3. If v1 = v2 = v3 = false, then thebest we can do is the segmentation shown at the bottomright of Fig. 5 which has one more region than the otherseven minimal segmentations shown in Fig. 5.

We now have all the necessary constructions toreduce a graphG corresponding to an instance of

Figure 5. The disjunction construction forv1 ∨ v2 ∨ v3 and exam-ples of minimal segmentations for each of the eight combinations ofvalues forv1, v2, v3.

Page 8: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

34 Cooper

Figure 6. (a) Two negation constructions separate the variable cons-truction forvi and the disjunction construction forD j containingvi ,and (b) a single negation construction separates the variable construc-tion for vi and the disjunction construction forD j containing¬vi .

PLANAR 3SAT into an image to be segmented. Eachvertex of G corresponding to a variablevi is trans-formed into a variable construction, as illustrated inFig. 3(a). Each vertex corresponding to a disjunctionDj is transformed into a disjunction construction, asillustrated in Fig. 5. For each negated variable¬vi inDj , we add a negation construction, as illustrated inFig. 4(a), to the input region corresponding to¬vi in thedisjunction construction forDj . For each non-negatedvariablevi in Dj , we add two negation constructions,to guarantee independence between different disjunc-tion constructions. This double negation does not alterthe logical value ofvi . This is illustrated in Fig. 6.

Let b denote the number of background regionsin the image,n the number of Boolean variablesvi ,m the number of disjunctionsDj and p the num-ber of negation constructions (not counting thosewithin disjunction constructions). Define the constantkas

k = b+ 2n+ 4m+ 3p

Each variable construction adds at least two regionsto the total number of regions. Each disjunction con-struction adds at least four regions to the total number ofregions. Each negation construction adds at least threeregions to the total number of regions. The constructedimage has a valid segmentation ofk regions only if eachvariable construction has a minimal segmentation, asillustrated in Fig. 3(a), each negation construction hasa minimal segmentation, as illustrated in Fig. 4(a), andeach disjunction construction has a minimal segmen-tation, as illustrated in Fig. 5.

The assignment of truth values corresponding tosuch a segmentation satisfies each of the disjunctionsDj and is hence a solution to the instance of PLANAR3SAT. Conversely, from a solution for the instance ofPLANAR 3SAT, we can construct a segmentation ofthe image consisting of exactlyk regions.

Since this reduction from PLANAR 3SAT toMIN2DSEG is clearly polynomial, we can deduce thatMIN2DSEG is NP-complete. 2

The above proof usesd = 5 brightness levels. Itis an open problem whether MIN2DSEG remains NP-complete ford= 3 or 4. MIN2DSEG is trivially solv-able in polynomial time ford = 2, since there is onlyone non-trivial segmentation. The above proof makesuse of a uniformity predicate which specifies that notwo pixel values differ by more than a given constantvalue. Simpler uniformity predicates, such as thresh-olding at fixed levels, uniquely determine a segmenta-tion, and hence the minimal segmentation can again befound in polynomial time (Pavlidas, 1977).

In MIN2DSEG the fitting of a piecewise planar func-tion to the image is achieved using a uniformity pred-icate and the only optimisation criterion is the min-imisation of the number of regions. In the variationalapproach, embodied in OPT2DSEG below, functionfitting and the minimisation of the number of regionsare carried out simultaneously by the optimisation of asingle combined functional.

OPT2DSEG: Given anN×N imageI and constantsλ, c, does there exist a partition of the image into con-nected regionsX1, . . . , Xr such that Var+ λr ≤ c?

Theorem 4.2. OPT2DSEG is NP-complete.

Proof: We can use the same reduction fromPLANAR 3SAT as in the proof of Theorem 4.1, ex-cept that, because of the new optimisation criterionVar+ λr , the constructions need to be slightly modi-fied to avoid spurious alternative segmentations. It isnecessary to make the constructions vary as a functionof λ, otherwise for very small values ofλ the optimalsegmentation would be an oversegmentation into manyregions of zero variance and for very large values ofλ

the optimal segmentation would be an undersegmenta-tion comprising a single region of high variance.

The negation construction of Fig. 4(a) is stretchedin a horizontal direction so that the single-pixel areasof value 0 or 2 now containdλ/4e pixels. The con-structions in Figs. 3(a) and 5 are similarly modified so

Page 9: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

The Tractability of Segmentation and Scene Analysis 35

that the single-pixel areas of value 0 or 2 now contain5dλ/4e + 10 pixels. Double negation constructions,as shown on the left-hand side of Fig. 6, are added inorder to limit the size of regions of pixel value 1 to amaximum of 19 pixels.

With these modifications, for the class of imagesconstructed in the proof of Theorem 4.1, the optimalsegmentation is identical to the minimal segmentationfor all values ofλ ≥ 3. 2

5. Discussion of the Tractability of Segmentation

To put into perspective the importance of Theorems 4.1and 4.2, we should remember that NP-completenessresults only prove that no algorithm exists whichhas polynomial worst case time complexity (assum-ing P6=NP). They tell us nothing about the existenceof either heuristicswith worst case polynomial timecomplexity or algorithms withaverage casepolyno-mial time complexity. (See Tsotsos (1990) and Kube(1991) for a more detailed discussion of the relevanceof applying computational complexity results to visualproblems.)

We should also ask ourselves whether the segmenta-tion problems studied in this paper are the most appro-priate abstractions of the diverse image interpretationproblems encountered in practice. Two fundamental as-sumptions about images and the scenes they representare hidden in the definitions of minimal and optimalsegmentations:

1. semantic entities (such as objects) tend to have auniform aspect, implying that regions should satisfythe uniformity predicate or can be approximated bya constant function.

2. an interpretation involving fewer semantic entities,and hence fewer regions, is a priori more likely.

Segmentation is an exclusively region-based ap-proach, a philosophy that we have tried to preservein this paper. However, it is also possible to incor-porate in the optimisation function other criteria con-cerning, for example, the boundaries between regions.These could be based on fundamental assumptions,such as the assumption that object boundaries are morelikely to be smooth and to represent an important dif-ference between pixel values on either side of theboundary.

It should be noted that the condition that regionsbe connected is only one possible criterion for region

coherence. In scene analysis a partially occluded objectis often visible in two or more unconnected areas of theimage. It is possible to imagine alternative criteria forregion coherence, based on, for example, the boundaryshape of the union of possibly unconnected areas.

It is, of course, very doubtful whether generalisingthe problem of optimal segmentation in any of theseways would render it solvable in polynomial time. Analternative optimisation criterion is to minimise the to-tal length of region boundaries in the segmentation.However, Eades and Rappaport (1993) have provedthe NP-hardness of the related problem of finding aminimal separating polygon for two given finite sets ofpoints in the plane. A separating polygon which sep-aratesS from T is a simple polygonal circuitP suchthat every point ofS lies inside or on the boundary ofP while every point ofT lies outsideP. A minimalseparating polygon is such that the sum of the lengthsof the edges ofP is minimal.

An obvious direction for future research is whetherrelaxing the definition of an optimal segmentation willproduce a tractable problem, while keeping the essen-tial characteristics of the problem. For example, asa compromise between minimising a global functionsuch as the number of regions (an NP-complete prob-lem) and having no optimisation function (a problemsolvable in polynomial time), is to apply a greedy algo-rithm which finds the most likely region, according toa local criterion such as maximising its size, subtractsout of the image the pixels which belong to this region,and recursively calls the same algorithm on the remain-ing image. The largest possible uniform region can bedetermined in polynomial time for simple uniformitypredicates of the form

U (X) = true iff for all pointsp,q ∈ X

|I (p)− I (q)| ≤ k

by thresholding and merging operations.

6. Model-Based Segmentation

The segmentation of an image using a library of objectmodels is often known as scene analysis. In fact, theproblem is not only to segment the image but also torecognise and locate objects in a 3D scene from theinformation contained in its projection into a 2D im-age. The loss of information in passing from the 3Dscene to the 2D image is compensated by knowledgeof some or all of the objects which may occur in thescene.

Page 10: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

36 Cooper

Compared with the recognition of isolated objects,the interpretation of multi-object images presents twoextra difficulties:

1. greater ambiguity in the recognition of individualobjects due to partial occlusion by nearer objects orcontrast failure with objects projecting into adjacentregions.

2. possible combinatorial ambiguity due to the pres-ence of many objects, none of which can be recog-nised unambiguously.

Most workers have only concerned themselves withthe first and most immediate of these two problems.It is known that the recognition of a partially visi-ble object independently of the rest of the scene isessentially a polynomial-time problem (Cass, 1992).Examples of polynomial-time techniques that havebeen successfully employed are hypothesise and test(Ayache and Faugeras, 1986), hashing (Lamdan et al.,1988), clustering in transformation space (Cass, 1992)and multi-resolution (Neveu et al., 1986). These meth-ods assume that the partially visible object is occludedby unknown objects. If all the occluding objects can bereliably identified and their projections subtracted outof the image, then this can only increase the reliabil-ity of the recognition of the partially occluded object(Cooper, 1988; Ullmann, 1992). Unfortunately, whenthere is ambiguity, both in the correct segmentation ofthe image and in the semantic labelling of regions, thisleads to a combinatorial problem which in the mostgeneral case is intractable, as we will demonstrate inthe following section.

In order to analyse the complexity of scene analysis,we need a formal definition of a valid interpretation,which we call a reconstruction. Many constraints canbe applied in scene analysis based on knowledge ofphysical laws such as the law of gravity or on high-level knowledge of the possible positions of differentobjects (such as blackboards may be attached to wallsbut never to ceilings). Most collections of objects in 3Dspace are highly unlikely or physically impossible. Werestrict ourselves to applying the most basic physicalconstraint, that two solid objects cannot occupy thesame space.

Definition 6.1. Given a setM of 3D object modelsand a setT of 3D transformations, aphysically possiblesceneis a setS of transformed objectst (m) (m ∈ M

andt ∈ T), such that for allt1(m1), t2(m2) ∈ S,

t1(m1) ∩ t2(m2) = ∅

Definition 6.2. Given an imageI , a projection trans-formationP, a setM of 3D object models and a setTof 3D transformations, areconstructionof the imageIis a setR⊆ M × T such that

S= {t (m) : (m, t) ∈ R}is a physically possible scene which projects intoI ,i.e.,

P(S) = I

and the projection of each element ofS is visible in atleast one pixel of the imageI .

We can now state the decision problem correspond-ing to the reconstruction of a 3D scene from a 2D image.

RECONSTRUCTION: Given an imageI , a projec-tion transformationP, a setM of 3D object modelsand a setT of 3D transformations, does there exist areconstruction ofI ?

It is known that RECONSTRUCTION is solvablein polynomial time when scenes cannot contain cy-cles of mutually occluding objects and when the onlydepth cue in the image is occlusion. There are no depthcues, for example, if the image was produced by anorthographic projection or if the scene is a pile of flatobjects. In fact, if a single reconstruction can be foundin polynomial time, then a compact description ofallreconstructions can also be determined in polynomialtime (Cooper, 1992).

A sufficient condition for RECONSTRUCTION tobe polynomially solvable is that the set of possibleimages can be generated by first calculating the setP(T(M)) of projections of transformed objects andthen displaying elements ofP(T(M)) one behind theother in the image array. An image generated by thismechanism can be interpreted without backtrackingby finding the foremost objectP(t (m)), subtractingP(t (m)) out of the image and then applying the samealgorithm recursively to the remaining image (Cooper,1988, 1992). Unfortunately, there are two reasons whythis generation mechanism may be inaccurate:

(i) certain physically possible images cannot be gen-erated, since cycles of mutually occluding objects

Page 11: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

The Tractability of Segmentation and Scene Analysis 37

cannot be generated by displaying one object be-hind another;

(ii) physically impossible images may be generatedsince the non-intersection of transformed objectsis not tested.

It is already known that RECONSTRUCTION is anNP-complete problem when the set of possible scenesis enlarged to include cycles of mutually occluding ob-jects. In fact, this problem even remains NP-completewhen the correct segmentation of the image is known(Cooper, 1992). In the following section, we completethe analysis of this problem by showing that, even whenobjects cannot form cycles, RECONSTRUCTION isNP-complete when the image contains depth cues suchas perspective, shadows or range values. Intractabilityis due to the non-intersection constraint on transformedobjects. It is curious to note that it is the introductionof depth information and an extra constraint to reducethe number of possible interpretations that renders theproblem intractable. We will show that if depth infor-mation can be deduced from images, then an algorithmto solve RECONSTRUCTION could also be used tosolve jigsaw puzzles. We first need to show that theproblem of solving jigsaw puzzles is NP-complete.

7. The Jigsaw Problem

A special case of the reconstruction of a scene oc-curs when we know that there is no occlusion. In otherwords, the objects fit together to cover the whole scenewithout overlapping, in the same way that the piecesof a jigsaw fit together to form a picture. This problemis of more practical importance in one dimension thanin two, in applications such as speech or text analy-sis. Fortunately, in one dimension the problem can besolved in polynomial time using a dynamic program-ming algorithm.

Definition 7.1. A jigsaw is an imageI , a set of ob-jects M , a set of transformationsT and a projectionoperationP.

Note that we assume that the image to be recon-structed is given. A solution to a jigsaw is a reconstruc-tion, according to Definition 6.2, in which the projec-tions of no two transformed objects overlap.

Definition 7.2. A solution to a jigsaw(I ,M, T, P) isa setSof transformed objectst (m) ∈ T(M) such that

1. for all t1(m1), t2(m2) ∈ S

P(t1(m1)) ∩ P(t1(m1)) = ∅

2. P(S) = I .

In a children’s jigsaw puzzle all pieces must be usedexactly once. Even without this extra constraint, theproblem of reconstructing a jigsaw is NP-complete intwo dimensions.

Definition 7.3. A minimal solution to a jigsaw(I ,M, T, P) is a solution of minimal cardinality.

We now state two decision problems concerning jig-saws.

JIGSAW: Does the jigsaw (I ,M, T, P) have a solu-tion?

MINIMAL JIGSAW: Does the jigsaw (I ,M, T, P)have a solution of cardinality less than or equal tok?

We will show that JIGSAW is NP-complete in twodimensions, but that MINIMAL JIGSAW is solvablein polynomial time in one dimension.

When a transformed objectt (m) is projected intoa one-dimensional image of lengthn, we assume thatthe result is connected. We denote its left-most pixelby start(t (m)) and its right-most pixel byend(t (m)).Transformed objectst (m)which overlap the end of theimage, of lengthn, are considered to end at pixeln.

Definition 7.4. A partial jigsaw solution is a listt1(m1), . . . , tr (mr )of transformed objects which matchexactly the image on pixels 1, . . . , i and which satisfy

1. for q = 1, . . . , r − 1, start(tq+1(mq+1)) =end(tq(mq))+ 1

2. end(tr (mr )) = i .

The following dynamic programming algorithm DP-JIGSAW solves MINIMAL JIGSAW in one dimensionby calculatingD(i ), for i = 0, . . . ,n, where

D(i ) =

minimum number of objects in a partial

jigsaw solutiont1(m1), . . . , tr (mr )

such thattr (mr ) ends at pixeli∞ if no such partial solution exists

Page 12: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

38 Cooper

DP-JIGSAW:

Initialise all D(i ), i = 1, . . . ,n, to infinity andD(0) to 0.

for i := 1 ton dofor all transformed objectst (m) ending at

pixel i doif t (m) matches the imagethenD(i ) := min(D(i ),

1+ D(max(0, i − len(t (m)))))

len(t (m)) is the length of transformed objectt (m).The complexity of DP-JIGSAW will be dominated bythe time to find all correspondences between objectsand the image. These could be determined before-hand by classic linear-time pattern matching algorithms(Aho et al., 1974). A dynamic programming algorithmwhich allows for local distortions in the appearance oft (m) in the image, by minimising the string-edit dis-tance betweent (m) and the image, has been widelyused to find correspondences between object and im-age contours (Ansari and Delp, 1990; Gupta andMalakapalli, 1990).

Theorem 7.5. JIGSAW is NP-complete.

Proof: Given a putative solutionS to a jigsaw(I ,M, T, P), it is simple to verify in polynomial timewhether S is a solution. Thus, to demonstrate NP-completeness it is sufficient to give a polynomial re-duction from 3SAT to JIGSAW.

Let D1∧· · ·∧Dm be any instance of 3SAT onn vari-ables. We will construct a jigsaw (I ,M, T, P) whichhas a solution iff this instance of 3SAT has a solution.

We chooseT to be the set of all rigid transforma-tions in a plane parallel to the image plane andP anorthographic projection. This implies that objects arenot dilated by their projection into the image.

The setM of objects is illustrated in Fig. 7. Thereare two categories of objects: “horizontal” objects(Figs. 7(a), (b), and (c)) and “vertical” objects (Fig. 7(d)and (e)). For each (i, j ) such thatvi is a literal inDj ,there are objects of the kind illustrated in Fig. 7(a). Foreach (i, j ) such that¬vi is a literal inDj , there are ob-jects of the kind illustrated in Fig. 7(b). The characters“ i, j ” are part of the surface markings of the object.Their purpose is to restrict the possible placement ofthe object to a unique position in the image, the po-sition at which the same pattern “i, j ” occurs in theimage. Horizontal objects fall into two classes: either

Figure 7. A set of jigsaw pieces.

they have teeth which project towards the right or theyhave teeth which project towards the left. Teeth facingright in row i correspond to the value true forvi ; teethfacing left correspond to the value false.

When j = n, the tooth on the right-hand side of thesix objects in Figs. 7(a) and (b) is absent. This is simplydue to the fact that these objects will always occur onthe very right-hand side of the jigsaw and hence donot need teeth to interlock with another object on theirright.

The image is chosen so that the horizontal and ver-tical objects must be locked together to form a gridpattern. One such grid pattern is shown in Fig. 8 forthe instanceD1 ∧ D2 ∧ D3 of 3SAT, where

D1 = v3 ∨ v4 ∨ v5

D2 = ¬v1 ∨ v3 ∨ ¬v4

D3 = ¬v2 ∨ ¬v3 ∨ v4

The j th column of this image corresponds to disjunc-tion Dj , and thei th row corresponds to variablevi .

If either vi or ¬vi is one of the literals inDj , thenattached to the intersection of thei th row and thej thcolumn there is an extra square, marked by an asterisk,

Page 13: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

The Tractability of Segmentation and Scene Analysis 39

Figure 8. A jigsaw composed of a selection of the pieces illustratedin Fig. 7.

as shown in Fig. 8. Furthermore, just above this asteriskthere is the pattern “i, j ”, as shown.

The variablevi is considered to be assigned the valuetrue if all objects in rowi of the solution to the jigsawhave right-facing teeth, and false if all objects in rowi have left-facing teeth. The vertical objects whichinterlock with these horizontal objects are shown inFigs. 7(d) and (e). The direction of their teeth havebeen chosen to ensure that objects in a given rowi faceeither all left or all right. This conserves the truth valueof vi throughout the row.

The purpose of the vertical objects in columnj is notonly to propagate the same truth value along the lengthof each row, but also to “count” the number of literalsof Dj which are false. In fact, the length of the tailwhich hangs from the bottom of the vertical object atthe intersection of rowi and columnj is at leastL + 1units whereL is the number of literalsvk or¬vk in Dj

such thatk ≤ i and such that this literal takes on thevalue false. Since, by design of the image, the verticalobject at the intersection of rown and columnj musthave a tail of length 3, the maximum number of literalsin Dj which are false is 2.

Figure 9 illustrates a solution to the jigsaw ofFig. 8. The corresponding assignment of truth values

Figure 9. A solution to the jigsaw of Fig. 8.

to v1, . . . , vn is v1 = v2= false;v3 = v4 = v5 = true,since rows 1 and 2 are comprised only of left-facingpieces and rows 3, 4, 5 only of right-facing pieces.Each disjunctionDj contains at least one literal whichtakes on the value true. Suppose thatvi (or ¬vi ) isthe first such literal inDj . Then the asterisk at theintersection of rowi and column j is accounted forby the hiorizontal object on the right of Fig. 7(a) (orFig. 7(b)).

The other two asterisks in columnj , correspondingto the other two literals ofDj , are accounted for by ver-tical objects shown in Fig. 7(d). These vertical objectshave a tail which is longer by one unit compared withthe length of the tail of the vertical object just above itin the same column. All other vertical objects have atail of the same length as the vertical object just aboveit. At most two vertical objects of the kind shown inFig. 7(d) can occur in the same columnj , since thelength of the tail can only increase from 1 to 3 unitswithin column j . This forces at least one of the literalsin Dj to be true so that at least one of the three aster-isks in columnj can be accounted for by a horizontalobject.

The shaded area of the image is accounted for bycopies of shaded objects not actually shown in Fig. 7but which are also members of the object set. These

Page 14: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

40 Cooper

“filler” objects are of a different colour to the horizontaland vertical objects which interlock to form the gridpattern. They are such that the shaded area can alwaysbe constructed from them.

A solution to the jigsaw constructed as above froman instanceD1∧· · ·∧Dm of 3SAT provides a solutionto this instance of 3SAT. Conversely, given a solutionto this instance of 3SAT, we can construct a solution tothe corresponding jigsaw. 2

The above NP-completeness proof of JIGSAW canbe adapted so that the image and all the objects are auniform colour. The surface markings “i, j ” are re-placed by unique teeth patterns. Thus, the hardness ofthe problem is not due to the correspondence betweenobjects and image, but rather due to the constraint thatpieces interlock perfectly. This shows that JIGSAW isNP-complete even when the image is not available.

Theorem 7.5 seems to contradict the fact that jig-saw puzzles of thousands of pieces can be completedin a few hours by a child. Jigsaw puzzles are speciallydesigned so that they can be solved without backtrack-ing. In a well-designed jigsaw, when two pieces suc-cessfully interlock and there is continuity in the picturealong their join, we can be sure that these two piecesreally do fit together.

The proof of Therem 7.5 can easily be adapted toprove the following result.

Theorem 7.6. RECONSTRUCTION is NP-completeeven when scenes cannot contain cycles of mutuallyoccluding objects.

Proof: We chooseP to be any perspective projectionwhich is non-orthographic. It is sufficient to employ thesame polynomial transformation from 3SAT as in theproof of Theorem 7.5. The objects are assumed to beof non-negligible depth and to have a surface marking,such as a chessboard pattern, which allows us to deter-mine their distance from the camera given knowledgeof the perspective projectionP. This depth informa-tion constrains the transformed objects to interlock toform a solution to the jigsaw.

Any polynomial-time algorithm for RECON-STRUCTION could thus be used to solve 3SAT.2

Theorem 7.6 holds whenever we have a source ofdepth information, other than occlusion. This depthinformation may be directly available, for example ifI is a range image, or may be deduced indirectly; forexample from knowledge of the perspective projection

and the positions of light sources together with analysisof shadows in the image.

The motivation for minimising the number of regionsin a segmentation was that an interpretation consistingof fewer objects is more likely and that fewer objects inthe scene generally implies fewer regions in the image.If we have models of all the objects which may occurin the scene, then it is possible to minimise the numberof objects in the interpretation rather than the numberof regions.

Unfortunately, knowledge of all objects which maybe visible in the image is not sufficient to guaranteebeing able to find the optimal segmentation in poly-nomial time. Indeed, the problem of minimising thenumber of objects in an interpretation is NP-complete.We will show that this is true even without thedepth information which makes RECONSTRUCTIONNP-complete.

Definition 7.7. Given an imageI , a projection trans-formation P, a setM of 3D object models and a setT of 3D transformations, aminimal reconstructionofthe imageI is a reconstruction of minimal cardinalityamong valid reconstructions ofI .

MINRECONSTRUCTION: Given an imageI , a pro-jection transformationP, a setM of 3D object models,a setT of 3D transformations and a constantk, doesthere exist a reconstruction ofI of cardinality at mostk?

Theorem 7.8. MINRECONSTRUCTION is NP-complete even when transformed objects cannot formcycles and the image is formed by an orthographic pro-jection.

Proof: Firstly, observe that MINRECONSTRUC-TION is in NP, since we can verify in polynomial timewhether a putative reconstruction is valid and consistsof at mostk elements.

To complete the proof it is sufficient to consider thereconstruction of the class of images described in theproof of Theorem 7.5, each image representing an in-stance of 3SAT, and to choosek to be 2mn+ n + b,wherem is the number of disjunctions,n the num-ber of variables andb the number of objects requiredto fill the background. The solutions to this class ofjigsaw problems are exactly the minimal reconstruc-tions of the corresponding images. An image mayhave many reconstructions involving overlapping ob-jects, but by choice of the objects and the images,

Page 15: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

The Tractability of Segmentation and Scene Analysis 41

all these alternative reconstructions have cardinalitygreater thank. Thus, any polynomial-time algorithmfor MINRECONSTRUCTION could be used to solve3SAT in polynomial time. 2

8. Conclusion

The theoretical results presented in this paper can beconsidered as a step towards determining the boundarybetween those visual tasks for which reliable and effi-cient algorithms exist and those visual tasks for whichsuch algorithms are impossible.

A segmentation of a two-dimensional image can befound in polynomial time. A segmentation involvingthe minimum number of regions can be found in poly-nomial time for a one-dimensional image, but this prob-lem is NP-complete for two-dimensional images evenwhen pixels can take on only five different brightnessvalues. An alternative formulation of the same prob-lem of approximating an image by a piecewise constantfunction while minimising the number of regions, us-ing a variational approach, has also been shown to besolvable in polynomial time in one dimension but NP-complete in two dimensions.

A sufficient condition for the reconstruction of a3D scene from a 2D image to be solvable in polyno-mial time is that no range information can be deducedfrom the image and that the set of objects which mayoccur in the scene are all known and cannot projectinto cycles of mutually occluding objects in the im-age. Both the no range information condition and theno cycles condition are necessary, since otherwise theproblem is NP-complete. Minimising the number ofobjects in an interpretation also renders the problemNP-complete.

Solving a jigsaw puzzle is NP-complete in two di-mensions, but a solution to a one-dimensional jigsawpuzzle which employs the minimum number of piecescan be found in polynomial time.

These results amply demonstrate that not all combi-natorial problems are intractable. Indeed, intractabilityresults can often point us in the right direction to findinteresting versions of the same problems which aresolvable in polynomial time.

References

Aho, A.V., Hopcroft, J.E., and Ullman, J.D., 1974.The Designand Analysis of Computer Algorithms. Addison-Wesley Publish-ing Company.

Ansari, N. and Delp, E.J., 1990. Partial shape recognition: Alandmark-based approach.IEEE Trans. Pattern Analysis and Ma-chine Intelligence, 12(5):470–483.

Ayache, N. and Faugeras, O.D. 1986. HYPER: A new-approach forthe recognition and positioning of two-dimensional objects.IEEETrans. Pattern Analysis and Machine Intelligence, 8(1):44–54.

Beaulieu, J.-M. and Goldberg, M. 1989. Hierarchy in picture seg-mentation: A stepwise optimization approach.IEEE Trans. Pat-tern Analysis and Machine Intelligence, 11(2):150–163.

Borisenko, V.I., Zlatopol’skii, A.A., and Muchnik, I.B. 1987. Imagesegmentation (state-of-the-art survey).Automation and RemoteControl, 48(7):837–879.

Cass, T.A. 1992. Polynomial-time object recognition in the pres-ence of clutter, occlusion, and uncertainty. InProc. 2nd EuropeanConference on Computer Vision, Springer-Verlag, pp. 834–842.

Clowes, M.B. 1971. On seeing things.Artificial Intelligence, 2:79–116.

Cooper, M.C. 1988. Efficient systematic analysis of occlusion.Pat-tern Recognition Letters, 7:259–264.

Cooper, M.C. 1992.Visual Occlusion and the Interpretation of Am-biguous Pictures. Ellis Horwood: Chichester, UK.

Cooper, M.C. 1997. Interpreting line drawings of curved objectswith tangential edges and surfaces.Image and Vision Computing,15:263–276.

Cooper, M.C., Cohen, D.A., and Jeavons, P.G. 1994. Characterisingtractable constraints.Artificial Intelligence, 65:347–361.

Dendris, N.D., Kalafatis, I.A., and Kirousis, L.M. 1994. An efficientparallel algorithm for geometrically characterising drawings of aclass of 3-D objects.Journal of Mathematical Imaging and Vision,4:375–387.

Eades, P. and Rappaport, D. 1993. The complexity of comput-ing minimum separating polygons.Pattern Recognition Letters,14(9):715–718.

Fu, K.S. and Mui, J.K. 1981. A survey on image segmentation.Pat-tern Recognition, 13:3–16.

Garey, M.R. and Johnson, D.S. 1979.Computers and Intractability—A Guide to the Theory of NP-Completeness. W.H. Freeman: SanFrancisco.

Gupta, L. and Malakapalli, K. 1990. Robust partial shape classifica-tion using invariant breakpoints and dynamic alignment.PatternRecognition, 23(10):1103–1111.

Haralick, R.M. and Shapiro, L.G. 1985. Survey: Image segmentationtechniques.Computer Vision, Graphics and Image Processing,(29):100–132.

Huffman, D.A. 1971. Impossible objects as nonsense sentences. InMachine Intelligence, B. Meltzer and D. Michie (Eds.), EdinburgUniversity Press, vol. 6, pp 295–323.

Kirousis, L.M. 1990. Effectively labeling planar projections of poly-hedra.IEEE Trans. on Pattern Analysis and Machine Intelligence,12(2):123–130.

Kirousis, L.M. 1993. Fast parallel constraint satisfaction.ArtificialIntelligence, 64:147–160.

Kirousis, L.M. and Papadimitriou, C.H. 1988. The complexity ofrecognising polyhedral scenes.Journal of Computer and SystemScience, 37(1):14–38.

Kube, P.R. 1991. Unbounded visual search is not both biologi-cally plausible and NP-Complete.Behavioral and Brain Sciences,4:768–770.

Lamdan, Y., Schwartz, J.T., and Wolfson, H.J. 1988. Object recog-nition by affine invariant matching. InProc. IEEE Computer Soc.Conf. Computer Vision and Pattern Recognition, pp. 335–344.

Page 16: The Tractability of Segmentation and Scene Analysis

P1: GRN

International Journal of Computer Vision kl654-02-cooper October 13, 1998 16:58

42 Cooper

Lichtenstein, D. 1982. Planar formulae and their uses.SIAM J. Com-puting, 11:329–343.

Lim, Y.M. and Lee, S.U. 1990. On the color image segmentationalgorithm based on the thresholding and the fuzzy c-means tech-niques.Pattern Recognition, 23(9):935–952.

Mumford, D. and Shah, J. 1989. Optimal approximations ofpiecewise smooth functions and associated variational problems.Communications in Pure and Applied Mathematics, 42:577–685.

Neveu, C.F., Dyer, C.R., and Chin, R.T. 1986. Two-dimensionalobject recognition using multiresolution models.Computer Vision,Graphics and Image Processing, 34:52–65.

Nitzberg, M., Mumford, D., and Shiota, T. 1991. Filtering, segmen-tation and depth.Lecture Notes in Computer Science, vol. 662,Springer-Verlag.

Pal, N.R. and Pal, S.K. 1993. A review on image segmentation tech-niques.Pattern Recognition, 26:1277–1294.

Papadimitriou, C.H. 1994.Computational Complexity. Addison-Wesley: Reading, MA.

Parodi, P. and Torre, V. 1994. On the complexity of labeling pers-pective projections of polyhedral scenes.Artificial Intelligence,70:239–276.

Pavlidas, T. 1977.Structural Pattern Recognition. Springer-Verlag.Sugihara, K. 1986.Machine Interpretation of Line Drawings. MIT

Press: Cambridge, MA.Tsotsos, J.K. 1990. Analyzing vision at the complexity level.Behav-

ioral and Brain Sciences, 13:423–469.Ullmann, J.R. 1992. Analysis of 2-D occlusion by subtracting

out. IEEE Trans. on Pattern Analysis and Machine Intelligence,14(4):485–489.