9
doi: 10.1098/rstb.1997.0111 , 1283-1290 352 1997 Phil. Trans. R. Soc. Lond. B Alex Pentland Content-based indexing of images and video Email alerting service here the article or click Receive free email alerts when new articles cite this article - sign up in the box at the top right-hand corner of http://rstb.royalsocietypublishing.org/subscriptions go to: Phil. Trans. R. Soc. Lond. B To subscribe to This journal is © 1997 The Royal Society rstb.royalsocietypublishing.org Downloaded from

Content-based indexing of images and video

  • Upload
    a

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Content-based indexing of images and video

doi: 10.1098/rstb.1997.0111, 1283-1290352 1997 Phil. Trans. R. Soc. Lond. B

 Alex Pentland Content-based indexing of images and video 

Email alerting service herethe article or click

Receive free email alerts when new articles cite this article - sign up in the box at the top right-hand corner of

http://rstb.royalsocietypublishing.org/subscriptions go to: Phil. Trans. R. Soc. Lond. BTo subscribe to

This journal is © 1997 The Royal Society

rstb.royalsocietypublishing.orgDownloaded from

Page 2: Content-based indexing of images and video

Content-based indexing of images and video

ALEX PENTLAND

TheMedia Laboratory,Massachusetts Institute ofTechnology, 20Ames Street, Cambridge,MA02139,USA([email protected])

SUMMARY

By representing image content using probabilistic models of an object's appearance we can obtain seman-tics-preserving compression of the image data. Such compact representations of an image's salient featuresallow rapid computer searches of even large image databases. Examples are shown for databases of faceimages, a video of American sign language (ASL), and a video of facial expressions.

1. INTRODUCTION: THE PROBLEM

The traditionalmodel for the search of stored images andvideo, both as amodel of humanmemory andas amodelfor computerized search, hasbeen to createpropositionalannotations that describe the content of the image, andthen enter these annotations into a standard database orsemanticnet.The images themselves arenot really partofthe memory or database; they are only referenced bycomputer text strings ormental propositions.

The problem with this approach is that the oldsaying a picture is worth 1000 words' is an understate-ment. In most images there are literally hundreds ofobjects that could be referenced, and each imagedobject has a long list of attributes. Even worse, spatialrelationships are important in understanding imagecontent, so that complete annotation of an image withn objects each with m attributes requires O(n2m2) data-base entries. And if we must also consider relationsamong images, such a memory indexing model quicklybecomes intractable.

During the last few years, however, there has been amajor change in thinking about image databases. Thekey conceptual breakthrough was the idea of content-based indexing, allowing images and audio to be searcheddirectly instead of relying on keyword (or propositional)annotation. The Massachusetts Institute of Technology(MIT) `Photobook' system (Pentland & Picard 1994;Pentland et al. 1994) together with e¡orts such as theIBM `query-by-image-content' (Faloutsos et al. 1994)system or the ISS project in Singapore (Smoliar &Zhang 1994), have demonstrated that such a search isboth possible and useful. The idea has since becomethe focus of dozens of special journal issues, workshops,and the like. There have also been considerablecommercial successes with this technology. Photobook-like technology is now available as an IBM UltimediaManager system and as the Virage Engine (Virage).Consequently simple examples of content-basedindexing is now widely available on a variety of plat-forms and as an option within several traditionaldatabase systems.

In this paper I will review the Photobook concept ofcontent-based indexing, as developed by ProfessorsPicard, Sclaro¡, and myself and originally describedin an earlier paper (Pentland et al. 1996), and giveseveral examples drawn from recent work by mystudents and myself. For a full account of our research,including current papers, some computer code, and on-line demonstrations, see our web site at (http://www-white.media.mit.edu/vismod).

(a) The problem: semantic indexing of imagecontent

The problem is that to make a user- and purpose-independent image database we must annotate every-thing in the images and all the relations betweenthem. Text databases avoid this problem by usingstrings of characters e.g., words that are a consistentencoding of the database's semantic content. Thus,questions about the database's semantic content can beanswered by simply comparing sets of text strings.Because this search is e¤cient, users can search fortheir answers at query time rather than having to pre-annotate everything.

To accomplish the same thing for image databases,we must be able to e¤ciently compare the imagesthemselves, to see if they have the same, or more gener-ally similar, semantic content. There is, of course, atrade-o¡ between how much work you do at inputtime and how much you do at query time. For instance,one could try to precompute the answers to all possiblequeries, so that no search would be required. Alterna-tively, one could search the raw images themselves,repeating all of the low-level image processing tasksfor each query.

For image databases there is a compelling argumentfor employing a pre-purposive `iconic' level of represen-tation. It does not make sense to try to precompute ageneral purpose'öa completely symbolic representa-tion of image contentöbecause the number of possiblyinterestinggeometric relations is combinatorially explos-ive. Consequently, the output of our precomputation

Phil.Trans. R. Soc. Lond. B (1997) 352, 1283^1290 1283 & 1997 The Royal SocietyPrinted in Great Britain

rstb.royalsocietypublishing.orgDownloaded from

Page 3: Content-based indexing of images and video

must be image-like data structures where the geometricrelationships remain implicit. On the other hand, itdoes make sense to precompute as much as is possiblebecause low-level image operations are so expensive.

These precomputed image primitives must play arole similar to that of letters and words in a databasequery sentence. The user can use them to describe`interesting' or `signi¢cant' visual events, and then letthe computer search for instances of similar events. Forinstance, the user should be able to select a video clip ofa lush waterfall, and be able to ask for other videosequences in which more of the same stu¡ ' occurs. Thecomputer would then examine the precomputeddecomposition of the waterfall sequence, and charac-terize it in terms of texture-like primitives such asspatial and temporal energy. It could then search theprecomputed decomposition of other video clips to¢nd places where there is a similar distribution ofprimitives.

Alternatively, the user might circle a `thing' like aperson's face, and ask the computer to track thatperson within the video clip, or ask the computer to¢nd other images where the same person appears. Inthis case the computer would characterize the person'stwo-dimensional image appearance in terms of primi-tives such as edge geometry and the distribution ofnormalized intensity, and then either track this con¢g-uration of features over time or search other images forsimilarly-arranged conjunctions of the same features.These two types of semantic indexing, using texture-

like descriptions of `stu¡ 'and object-like descriptions of`things' constitute the two basic types of image searchoperation in our system.These two types of descriptionseem to be fundamentally di¡erent in human vision,and correspond roughly to the distinction betweenmass nouns and count nouns in language. Note thatboth types of image query can operate on the sameimage primitives, e.g., the energy in di¡erent band-pass ¢lters, but they di¡er in how they group theseprimitives for comparison. The `stu¡ ' comparisonmethod pools the primitives without regard to detailedlocal geometry, while the `things' method preserveslocal geometry.

2 . SEMANTICS -PRESERVING IMAGECOMPRESSION

The ability to search at query time for instances ofthe same or similar image events depends on the twoconditions discussed below.

(i) There must be a similarity metric for comparingobjects or image properties, e.g. shape, texture, colour,object relationships, that matches human judgements ofsimilarity. This is not to say that the computation mustsomehow mimic the human visual system, but ratherthat computer and human judgements of similaritymust be generally correlated. Without this, the imagesthat the computer ¢nds will not be those desired bythe human user.

(ii) The search must be e¤cient enough to be inter-active. A search that requires minutes per image issimply not useful in a database with millions of images.Furthermore, interactive search speed makes it possible

for users to recursively re¢ne a search by selectingexamples from the currently retrieved images andusing these to initiate a new select^sort^display cycle.Thus, users can iterate a search to quickly `zero in on'what they are looking for.

Consequently, we believe that the key to solving theimage database problem is semantics-preserving imagecompression, compact representations that preserveessential image similarities. This concept is related tosome of the semantic bandwidth compression' ideasput forth in the context of image compression (Picard1992). Image coding has utilized semantics primarilythrough e¡orts to compute a compact image represen-tation by exploiting knowledge about the content of theimage. A simple example of semantic bandwidthcompression is coding the people in a scene using amodel specialized for people, and then using a di¡erentmodel to code the background.

In the image database application, compression isno longer the singular goal. Instead, it is importantthat the coding representation be (i) `perceptuallycomplete', and (ii) `semantically meaningful'. The ¢rstcriterion will typically require a measure of perceptualsimilarity. Measures of similarity on the coe¤cients ofthe coded representation should roughly correlate withhuman judgements of similarity on the originalimages.

The de¢nition of `semantically meaningful' is thatthe representation gives the user direct access to theparts of the image content that are important for theirapplication.That is, it should be easy to map the coe¤-cients that represent the image to control knobs' thatthe user ¢nds important. For instance, if the userwishes to search among faces, it should be easy toprovide control knobs that allow selection of facialexpressions or selection of features such as moustachesor glasses. If the user wishes to search among textures,then it should be easy to select features such as periodi-city, orientation, or roughness.

Having a semantics-preserving image compressionmethod allows you to quickly search through a largenumber of images because the representations arecompact. It also allows you to ¢nd those images thathave perceptually similar content by simply comparingthe coe¤cients of the compressed image code. Thus, inour view the image database problem requires develop-ment of semantics-preserving image compressionmethods.

(a) Algorithm design

How can one design `semantics-preserving imagecompression' algorithms for particular objects ortextures? The general theoretical approach is to useprobabilistic modelling of low-level, two-dimensionalrepresentations of regions of the image data that corre-spond to objects or textures of interest.

To perform such modelling, one ¢rst transformsportions of the image into a low-dimensional coordinatesystemthatpreservesthegeneralperceptualqualityof thetarget object's image, and then use standard statisticalmethods, such as expectation maximization of aGaussian mixture model, to learn the range of

1284 A. Pentland Content-based indexing of images and video

Phil.Trans. R. Soc. Lond. B (1997)

rstb.royalsocietypublishing.orgDownloaded from

Page 4: Content-based indexing of images and video

appearancethatthetargetexhibits inthatnewcoordinatesystem.The result is a very simple, neural-net-like repre-sentation of the target class's appearance, which can beused to detect occurrences of the class, to compactlydescribe its appearance, and to e¤ciently comparedi¡erent examples fromthe same class.

As di¡erent parts of the image have di¡erent charac-teristics, we must use a variety of representations, eachtuned for a speci¢c type of image content. For instance,to represent images of `things', which requires preserva-tion of detailed geometric relations, we use theKarhunen^Loe© ve transform (KLT), which is alsocalled the principal components analysis (PCA). It is aclassic mathematical result that the KLT provides anoptimally-compact linear basis with respect to rootmean square error for a given class of signal. For char-acterization of texture classes, we use an approachbased on theWold decomposition. This transform sepa-rates `structured' and `random' texture components,allowing extremely e¤cient encoding of texturedregions, e.g., by using the KLT on the separatedcomponents, while preserving their perceptual quali-ties (Picard & Kabir 1993).

Given several examples of a target class, O, in such alow-dimensional representation, it is straightforward tomodel the probability distribution function, p(xjO), ofits image-level features (x), as a mixture of Gaussians,thus obtaining a low-dimensional, parametric appear-ance model for the target class (Moghaddam &Pentland 1995). Once the target class's probabilitydistribution function (PDF) has been learned, we canuse Bayes' rule to perform `maximum a postieriori'(MAP) detection and recognition.The use of parametric appearance models to charac-

terize the PDFof an object's appearance in the image isa generalization of the idea of view-based representa-tion, as advocated by Ullman & Basri (1991) andPoggio & Edelman (1990). As originally developed,the idea of view-based recognition was to accuratelydescribe the spatial structure of the target object byinterpolating between various views. However, inorder to describe natural objects such as faces orhands, we have found it necessary to extend the notionof `view' to include characterizing the range ofgeometric and feature variation, as well as the likeli-hoods associated with such variation.

This approach is typi¢ed by our face recognitionresearch(Turk&Pentland1991;Moghaddam&Pentland1995) which uses linear combinations of eigenvectors todescribe a space of target appearances, and then charac-terizes the PDF of the target's appearance within thatspace.This method has been shown to be very powerfulfor detection and recognition of human faces, hands, andfacialexpressions (Moghaddam&Pentland1995).Otherresearchers have used extensions of this basic method torecognize industrial objects and household items(Murase&Nayar1994).

(b) Comparison with other approaches

During the last few years many researchers haveproposed a variety of image indexing methods, basedon shape, colour, or combinations of such indices(Faloutsos et al. 1994; Smoliar & Zhang 1994). Thegeneral approach is to calculate an approximatelyinvariant statistic, such as colour histogram or invar-iants of shape moments, and use that to stratify orpartition the image database. Such partitioning allowsusers to limit the search space when looking for a parti-cular image, and has proven to be quite useful for smallimage databases.

The di¡erence between these methods and ours is thatthey emphasize computinga discriminant that can rejectmany false matches, whereas ours can encode the imagedata to the accuracy required to retain all'of its percep-tually salient aspects. Generally speaking, thecoe¤cients these earlier e¡orts have produced are notsu¤ciently meaningful to reconstruct the perceptuallysalient features of the image. For instance, one cannotreconstruct an image region from its moment invariantsor its colour histogram. In contrast, the models wepresent use coe¤cients which allow reconstruction.Figure 1 shows three reconstructions using appearance,shape, and texture descriptions of image content.

In our view, the problem with using invariants ordiscriminants is that signi¢cant semantic information isirretrievably lost. For instance, do we really want ourdatabase to think that apples, Ferrarris, and tongues are`the same' just because they have the same colour histo-gram? Discriminants give a way to limit search space,but do not answer `looks like' questions except withinconstrained data sets. In contrast, when the coe¤cients

Content-based indexing of images and video A. Pentland 1285

Phil.Trans. R. Soc. Lond. B (1997)

(a) (b) (c)

Figure 1. Images reconstructed from coe¤cients used for database search: (a) 30 coe¤cients, (b) 100 coe¤cients, and (c) 60coe¤cients. From Pentland et al. (1996).

rstb.royalsocietypublishing.orgDownloaded from

Page 5: Content-based indexing of images and video

provide a perceptually complete representation of theimage information, then things the database thinks are`the same'actually look the same.Another important consequence of representational

completeness is that we can ask a wide range of ques-tions about the image, rather than being limited toonly a few prede¢ned questions. For instance, itrequires only a few matrix multiples per image tocalculate indices such as colour histograms or momentinvariants from our coe¤cients. The point is that if youstart with a relatively complete representation, thenyou are not limited in the types of questions you canask; whereas, if you start by calculating discriminants,then you are limited to queries about those particularmeasures only.

3. THE PHOTOBOOK SYSTEM

Photobook is a computer system that allows the userto browse large image databases quickly and e¤ciently,using both text annotation information in an arti¢cialintelligence database and by having the computersearch the images directly based on their content (Pent-land & Picard 1994; Pentland et al. 1996). This allowspeople to search in a £exible and intuitive manner,using semantic categories and analogies, e.g., `show meimages with text annotations similar to those of thisimage but shot in Boston', or visual similarities, e.g.,`show me images that have the same general appear-ance as this one'.

Interactive image browsing is accomplished using aMotif interface. This interface allows the user to ¢rstselect the category of images they wish to examine;e.g., pictures of white males over 40 years of age, orimages of mechanic's tools, or cloth samples forcurtains. Photobook then presents the user with the¢rst screenful of these images (see ¢gure 4); the rest ofthe images can be viewed by `paging' through them onescreen at a time.

Users most frequently employ Photobook by selectingone (or several) of the currently-displayed images, andasking it to sort the entire set of images in terms of theirsimilarity to the selected image or set of images. Byselecting several example images the user is providinginformation about the distribution of visual parametersthat characterize items of interest. Photobook uses suchmultiple examples to make an improved estimate of thesearch parameters. Photobook then re-presents theimages to the user, now sorted by similarity to theselected images. The select^sort^redisplay cycle typi-cally takes less than 1 s. When searching for aparticular item, users quickly scan the newly-displayedimages, and initiate a new select^sort^redisplay cycleevery 2 or 3 s.

(a) An image database example: faces

One of the clearest examples of content-basedindexing is face recognition, and the Photobook facerecognition system was certi¢ed by the US Army asbeing extremely accurate (Phillips et al. 1997). Facerecognition within this system is accomplished by ¢rstdetermining the PDF for face images within a low-

dimensional eigenspace calculated directly fromcontrast-normalized image data. Knowledge of thisdistribution then allows the face and facial features tobe precisely located, and compared along meaningfuldimensions. The following gives a brief description ofhow this is accomplished within the Photobook system(for additional detail see Moghaddam & Pentland(1995)).

(b) Face and feature detection

The standard detection paradigm in image proces-sing is that of normalized correlation or templatematching. However, this approach is only optimal inthe case of a deterministic signal embedded in whiteGaussian noise. When we begin to consider a targetclass detection problemöe.g., ¢nding a generic humanface or a human hand in a sceneöwe must incorporatethe underlying probability distribution of the object ofinterest. Subspace or eigenspace methods, such as theKLT and PCA, are particularly well-suited to such atask since they provide a compact and parametricdescription of the object's appearance and also automa-tically identify the essential components of theunderlying statistical variability.

In particular, the eigenspace formulation leads to apowerful alternative to standard detection techniquessuch as template matching or normalized correlation.The reconstruction error, or residual, of the KLTexpansion is an e¡ective indicator of a match. The resi-dual error is easily computed using the projectioncoe¤cients and the original signal energy. This detec-tion strategy is equivalent to matching with a linearcombination of eigentemplates and allows for a greaterrange of distortions in the input signal (includinglighting, and moderate rotation and scale). Some ofthe low-order eigentemplates for a human face areshown in ¢gure 2. In a statistical signal detectionframework, the use of eigentemplates has been shownto be orders of magnitude better than standardmatched ¢ltering (Moghaddam & Pentland 1995).

Using this approach the target detection problemcan be reformulated from the point of view of a MAPestimation problem. In particular, given the visual¢eld, estimate the position (and scale) of the subimagewhich is most representative of a speci¢c target class O.Computationally this is achieved by sliding an m-by-nobservation window throughout the image and at eachlocation computing the likelihood that the given obser-vation x is an instance of the target class Oöi.e., p(xjO).After this probability map is computed, the locationcorresponding to the highest likelihood can be selectedas the MAP estimate of the target location.

1286 A. Pentland Content-based indexing of images and video

Phil.Trans. R. Soc. Lond. B (1997)

Figure 2. The ¢rst eight eigenfaces.

rstb.royalsocietypublishing.orgDownloaded from

Page 6: Content-based indexing of images and video

(c) The face processor

This MAP-based face ¢nder has been employed asthe basic building block of an automatic face recogni-tion system. The block diagram of the face ¢ndersystem is shown in ¢gure 3. It consists of object detec-tion and alignment, contrast normalization, featureextraction, followed by recognition and, optionally,facial coding. Figure 3b^ e illustrates the operation ofthe detection and alignment stage on a natural imagecontaining a human face.

The ¢rst step in this process is illustrated in ¢gure 3cwhere the MAP estimate of the position and scale ofthe face are indicated by the cross-hairs and boundingbox. Once these regions have been identi¢ed, the esti-mated scale and position are used to normalize fortranslation and scale, yielding a standard `head-in-the-box' format image, ¢gure 3d. A second feature detec-tion stage operates at this ¢xed scale to estimate theposition of four facial features: the left and right eyes,the tip of the nose and the centre of the mouth, ¢gure3e. Once the facial features have been detected, the faceimage is warped to align the geometry and shape of theface with that of a canonical model. Then the facialregion is extracted, by applying a ¢xed mask, andsubsequently normalized for contrast. This geometri-cally aligned and normalized image is then projectedonto the set of eigenfaces shown in ¢gure 2.

The projection coe¤cients obtained by comparisonof the normalized face and the eigenfaces form afeature vector which accurately describes the appear-ance of the face. This feature vector can therefore beused for facial recognition, as well as for facial imagecoding. Figure 4 shows a typical result when using theeigenface feature vector for face recognition.The imagein the upper left is the one to be recognized and theremainder are the most similar faces in the database,ranked by facial similarity left to right, and top tobottom. The top three matches in this case, are images

of the same person taken a month apart and at di¡erentscales. The recognition accuracy of this system, de¢nedas the percent correct rank-one matches, is 99%(Moghaddam & Pentland 1995).

4 . VIDEO DATABASE EXAMPLES :HUMAN EXPRESSION AND GESTURE

Given the rapid progress that has occurred withdatabases of static images, researchers are now begin-ning to turn their attention to the problem of videodatabases. Interestingly, from a practical point of viewthe problem of video databases is largely equivalent tothe problem of interpreting human behaviour in video.That is, most video is about people, and the identityand behaviour of the people in the video are the mostimportant element of its content.

Again, the general theoretical approach we havetaken to interpretation of video is that of MAP inter-pretation on low-level, two-dimensional representa-tions of regions of the image data (Pentland 1996). Asillustrated by the face database example above, theappearance of a target class O, e.g., the probabilitydistribution function p(xjO) of its image-level featuresx, can be characterized by use of a low-dimensionalparametric appearance model. Once such a PDF hasbeen learned, it is straightforward to use it in a MAPestimator in order to detect and recognize targetclasses. Behaviour recognition is accomplished in asimilar manner; these parametric appearance modelsare tracked over time, and their time evolutionp(x(t)jO), characterized probabilistically to obtain aspatio-temporal behaviour model (Pentland 1996).Incoming spatio-temporal data can then be comparedto the spatio-temporal PDFof each of the various beha-viour models using elastic matching methods, such asdynamic time warping (Darrell & Pentland 1993) orhidden Markov modelling (Starner & Pentland 1995).

Content-based indexing of images and video A. Pentland 1287

Phil.Trans. R. Soc. Lond. B (1997)

(b)

(a)

(c) (d) (e)

Attentional Subsystem

Multiscale

Head Search

Feature

Search

InputImage

Warp Warp Mask ContrastNormalization

KL ProjectionRecognition

System

Feature Extraction Recognition & Learning

Object-Centered Representation

Figure 3. (a) The face processing system, (b) original image, (c) position and scale estimate, (d ) normalized head image, (e)position of facial features.

rstb.royalsocietypublishing.orgDownloaded from

Page 7: Content-based indexing of images and video

(a) Example: gesture recognition

The ¢rst example of this approach is recognition ofhuman hand gestures. To address this problem wemodelled the person's hand motion as a Markovdevice, with internal states that have their own parti-cular distribution of appearance and interstatetransition probabilities. Because the internal states of ahuman are not directly observable, they must be deter-mined through an indirect estimation process, using theperson's movements as measurements. One e¤cientand robust method of accomplishing this is to use theViterbi recognition methods developed for use withhidden Markov models (HMMs) (Rabiner & Juang1996).

This general approach is similar to that taken by thespeech recognition community. The di¡erence in ourapproach is that internal state is not thought of as beingjust words or sentences; the internal states can also beactions or even intentions.Moreover, the input is not justaudio ¢lterbanksbut also facial appearance, bodymove-ment, and vocal characteristics (such as pitch)are used toinfer the user's internal state. One good example thatemploys this approach to behaviour recognition is oursystem for reading American sign language (ASL).

The ASL reader is a real-time system that performs99% accurate classi¢cation of a forty-word subset of

ASL. Thad Starner is shown using this system in¢gure 5. The accurate classi¢cation performance ofthis system is particularly impressive because in ASLthe hand movements are rapid and continuous, andexhibit large coarticulation e¡ect. (For additionaldetail see Starner & Pentland (1995).)

1288 A. Pentland Content-based indexing of images and video

Phil.Trans. R. Soc. Lond. B (1997)

Figure 4. Searching for similar faces in a database, using the photobook image database tool (Pentland et al. 1996).

Figure 5. Real-time reading of American sign language,with Thad Starner doing the signing

rstb.royalsocietypublishing.orgDownloaded from

Page 8: Content-based indexing of images and video

(b) Example: expression recognition

A second example is the recognition of facial expres-sion. We have addressed this problem by learningspatio-temporal motion-energy appearance models forthe face. That is, for each facial expression we havecreated a motion-energy appearance model thatexpresses the amount and direction of motion that onewould expect to see at each point on the face. Thesesimple, biologically-plausible motion-energy modelscan be used for expression recognition by comparingthe motion energy observed for a particular face to anaverage' motion-energy model for each expression. Toclassify an expression one compares the observed facialmotion energy with each of these models, and thenpicks the expression with the most similar pattern ofmotion.

This method of expression recognition has beenapplied to a database of 52 image sequences of eightsubjects making various expressions. In each imagesequence the motion energy was measured, comparedto each of the models, and the expression classi¢ed,generating the confusion matrix shown in ¢gure 6.This ¢gure shows just one incorrect classi¢cation,giving an overall recognition rate of 98.0%. (For addi-tional details see Essa & Pentland (1994, 1995).)

5. CONCLUSION

The Photobook system is an interactive tool forbrowsing and searching images and image sequences.The key idea behind this suite of tools is semantics-preser-ving image compression, which reduces images to a smallset of perceptually signi¢cant coe¤cients. Compactsemantics-preserving representations of image andvideo content can be created by learning an appearancemodel of a target class, , in a low-dimensional repre-sentation, such as is produced by the KLT. Once the

probability distribution function p(xj) of the targetclass has been learned, we can use Bayes's rule toperform MAP detection and recognition. The sameapproach can be extended to time by use of methodssuch as dynamic time warping and HMM.

The work described here was funded by British Telecom. Ithank co-authors and collaborators Rosalind Picard, StanSclaro¡, Irfan Essa, Fang Liu, Baback Moghaddam,MatthewTurk, Thad Starner, Bradley Horowitz, and Tom Minka fortheir contributions. Portions of this paper have appeared inPentland (1996) and Pentland et al. (1996), and in the 1996Image UnderstandingWorkshop Proceedings (San Francisco:Morgan Kaufmann). Papers and technical reports on allaspects of this technology are available at http://www-white.media.mit.edu/vismod or by anonymous FTP from whitecha-pel.media.mit.edu

REFERENCES

Darrell,T. & Pentland, A. 1993 Space-time gestures. In IEEEConf. onVision and Pattern Recognition, NY, pp. 335^340.

Essa, I. & Pentland, A.1994 Avision system for observing andextracting facial actionparameters. In IEEEConf. onComputerVisionandPatternRecognition,Seattle,WA, pp.76^83.

Essa, I. & Pentland, A. 1995 Facial expression recognitionusing a dynamic model and motion energy. In IEEE Int.Conf. on ComputerVision, Cambridge, MA, pp. 360^367.

Faloutsos, C., Barber, R., Equitz,W., Flickner, M., Hafner, J.,Niblack, W. & Petkovic, D. 1994 E¤cient and e¡ectivequerying by image content. J. Intell. Inform. Syst. 3, 231^262.

Jones, M., & Poggio, T. 1995 Model-based matching of linedrawings by linear combinations of prototypes. In IEEEInt. Conf. on ComputerVision, Cambridge, MA, pp. 368^375.

Moghaddam, B. & Pentland, A. 1995 Probabilistic visuallearning for object detection. In IEEE Int. Conf. on ComputerVision, Cambridge, MA, pp. 786^793.

Murase,H.&Nayar, S.1994Visual learningand recognitionof3Dobjects fromappearance.IEEEInt.J.Comp.Vision14, 5^24.

Pentland, A. 1996 Smart rooms, smart clothes. Sci. Am.274(4), 68^76.

Content-based indexing of images and video A. Pentland 1289

Phil.Trans. R. Soc. Lond. B (1997)

expressions smile surprise anger disgust raisebrow

template

smile

surprise

anger

disgust

raise brow

success

12

0

0

0

0

100%

0

10

0

0

0

100%

0

0

9

1

0

90%

0

0

0

10

0

100%

0

0

0

0

8

100%

Figure 6. Results of facial expression recognition using spatio-temporal motion energy models. This result is on based on 12image sequences of smile, 10 image sequences of surprise, anger, disgust, and raise eyebrow. Success rate for each expressionis shown in the bottom row. Overall recognition rate is 98.0%.

rstb.royalsocietypublishing.orgDownloaded from

Page 9: Content-based indexing of images and video

Pentland, A. & Picard, R.W. 1994 Video and image seman-tics: advanced tools for telecommunications. IEEEMultimedia 1(2), 73^75.

Pentland, A., Picard, R.W., Sclaro¡, S. et al. 1996 Photobook:tools for content-based manipulation of image databases.Int. J. Comp.Vision 18(3), 233^254.

Phillips, P. J., Moon, H., Rauss, P. & Rizvi, S. A. 1997 TheFERETSeptember1996 database andevaluationprocedure.In Proc. 1st Int. Conf. on Audio and Video-based Biometric PersonAuthentication,Crans-Montana, Switzerland, 12^14March1997.

Picard, R. W. 1992 Random ¢eld texture coding. In Soc. forInformationDisplayInt.Symp.Digest, vol.XXXIII, pp.685^688.

Picard,R.W.&Kabir,T.1993 Finding similar patterns in largeimage databases. Proc. ICASSP, Minneapolis, MN, vol. V,pp.161^164.

Picard, R.W. & Liu, F. 1994 A newWold ordering for imagesimilarity. In Proc. ICASSP, Adelaide, Australia.

Poggio, T. & Edelman, S. 1990 A network that learnsto recognize three-dimensional objects. Nature 33,263^266.

Rabiner, L. & Juang, B. 1996 An introduction to hiddenMarkov models. In IEEE ICASSPMagazine, pp. 4^16.

Smoliar, S. & Zhang, H. 1994 Content-based video indexingand retrieval. IEEEMultimedia 1(2), 62^72.

Starner, T. & Pentland, A. 1995 Visual recognition ofAmerican sign language using hidden Markov models. InProc. Int. Workshop on Automatic Face- and Gesture-Recognition,Zurich, Switzerland.

Turk, M. & Pentland, A. 1991 Eigenfaces for recognition. J.Cognitive Neurosci. 3(1), 71^86.

Ullman, S. & Basri, R. 1991 Recognition by linear combina-tions of models. IEEE Trans. Pattern Analysis and MachineVision 13, 992^1006.

Virage http://www.virage.com.

1290 A. Pentland Content-based indexing of images and video

Phil.Trans. R. Soc. Lond. B (1997)

rstb.royalsocietypublishing.orgDownloaded from