21
1 Video processing for indexing and retrieval Georges Quénot Multimedia Information Retrieval Group Winter School on Multimedia Search Engines Laboratoire d'Informatique de Grenoble October 8, 2007 2 Tutorial outline Media Specificity Segmentation Indexing and Retrieval MPEG-7 Conclusion 3 Media Specificity 4 Media specificity Coherent and synchronized combination of simpler media Image (animated) Audio Text (closed captions, subtitles) Others (Virtual Reality, ...) Lots of formats, resolutions and qualities Stream aspect Temporal structure Passage retrieval from several media Various levels of semantic gap and uncertainties Contextual information

Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

1

Video processing for indexing and retrieval

Georges QuénotMultimedia Information Retrieval Group

Winter School on Multimedia Search Engines

Laboratoire d'Informatique de Grenoble

October 8, 2007

2

Tutorial outline

• Media Specificity• Segmentation• Indexing and Retrieval• MPEG-7• Conclusion

3

Media Specificity

4

Media specificity

• Coherent and synchronized combination of simpler media

– Image (animated)– Audio– Text (closed captions, subtitles)– Others (Virtual Reality, ...)– Lots of formats, resolutions and qualities

• Stream aspect– Temporal structure

• Passage retrieval from several media• Various levels of semantic gap and uncertainties• Contextual information

Page 2: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

5

Animated image• Interlaced and progressive schemes: images

and fields– Image/field size– Image/field rate

• Compression– Image by image: spatial redundancy– Image sequence: spatial and temporal redundancies– From 28 kbit/s (176 x 144 x 12.5 Hz, sequence coding,

realmedia) up to 28 Mbit/s (720 x 288 x 50 Hz, interlaced image coding, DV) or 270 Mbit/s (704 x 288 x 50 Hz, uncompressed, 10-bit D1).

• Very variable quality

6

Image interlacing• Compromises resolution versus continuity• Complicates image sequence processing• Apply only to “full resolution” streams

7

Audio

• From 8 kHz up to 48 kHz sample rates• Mono, stereo or more (5.1, ...)• Multiple audio streams (multilingual DVDs)• Compressed from ~5 kbit/s up to 256 kbit/s

(1.41 Mbit/s for uncompressed, audio CD) for stereo streams

• Very variable quality

8

Text• Subtitles or source speech transcripts

– Good quality text inserted as separate streams– Accurate and relevant– When available...

• + Closed captions– Text actually contained in the image stream– Hard to recover, high miss and error rates

• + Field text (text in the scene)– Even worse...

• + Automatic Speech Recognition– Text actually contained in the audio stream– Significant word error rate but usable

Page 3: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

9

Other stream types

• Animated Artifacts streams (MPEG-4)– Audio, text and image substreams– May contain accurate and relevant semantic information– Could be used of indexing and retrieval– Currently exotic and not widely used

• Musical Instrument Digital Interface streams• Speech synthesis data streams• ...

10

Applications• Search

– General (web, ...)– Domain specific (medical, military, ...)

• Filtering– Technological (or scientific or political, ...) survey– Offending content

• Metrics– Commercials impact estimation (spots, logos)

• Copyright check• Domains

– Archives, web, ...– Conference or meeting records, ...

11

Video Segmentation

12

Video Segmentation• Image Segmentation

– Shot segmentation,– Object segmentation and tracking,– Camera and object motion,– Key frame extraction

• Audio Segmentation– Silence / music / noise / speech / ...– Male / female– High quality / telephone / ...– Speaker / known speaker

• Sub-shot (micro-) segmentation• Story / topic (macro-) segmentation

Page 4: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

13

Shot segmentation• Direct image comparison or matching

– Sum of square differences, ...– With or without motion compensation

• Descriptor extraction and comparison– Color momentums or histograms– Texture, shapes, points of interest, ...

• Other methods– Rough contour tracking, ...

• Compressed domain methods– DC image comparison– Motion vectors, ratio of forward versus backward vectors– Bit rate variation, ratio of predicted versus intra-coded blocks– Specific to media encoding, fast but moderate quality

14

Shot segmentation

• “Cuts” versus gradual transitions• Separated methods:

– Cuts: search for discontinuities in images or descriptors– Other: ad’hoc methods, specific searches for wipes,

dissolves, block changes, ...– Plus: photographic flashes filtering, ...– Need for detector outputs’ fusion

• Integrated multi-resolution methods:– Filtering of descriptors time derivatives at various time scales– Search for peaks with sophisticated filtering:

» Peak location -> transition (center) location» Scale of the highest peak -> transition duration

15

Object segmentation and camera motion

• Classical still image object segmentation (using color, texture, regions, contours, ...) plus:

• Extraction of objects moving relatively to a back-ground and of relative camera/background motion:

– Objects are identified as not following the background motion– Background motion is identified by applying a (parametric)

motion model to the part of the image excluding mobile objects– Reciprocal dependency can be broken using iterative methods

if the background occupy a large enough part of the images– Several alternative motion models have to be considered– Speed versus accuracy compromises (MPEG vectors versus

full optical flow computation)• Combination of still image and motion based object

segmentation• Bonus: mosaic images or 3D views of the background

16

Iterative simultaneous extraction of mobile objects and of relative camera motion

1. Start with assumption of no mobile objects2. Estimate motion parameters from whole image3. Estimate the probability of belonging to the

background for each pixel4. Re-estimate motion parameters from pixels with a

high probability of belonging to the background5. Re-estimate the probability of belonging to the

background for each pixel6. Iterate 4. - 5. until convergence or given count

• No need to make any binary decision• Provides both camera motion and background

segmentation

Page 5: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

17

Background / camera motion models7 degrees of freedom (MPEG-7 descriptors)

Parametric camera model, 7 parameters:

• Translation: track, boom, dolly,• Rotation: tilt, pan, roll,• Zoom or focal length

18

One- or two-level camera motion search

• Direct “3D” camera motion search or:• “3D” camera motion search from a previously

extracted “2D” image motion model • Parametric image motion models:

bx+ay+dax-by+c4Similitude

dx+ey+fax+by+c6Affine

gx2+hy2+ixy+jx+ky+lax2+by2+cxy+dx+ey+f12Quadratic

(dx+ey+f)/(gx+hy+1)(ax+by+c)/(gx+hy+1)8Homographic

y+bx+a2TranslationalY’x’ParamsGeometric Model

19

Techniques for Motion Field Extraction

Using the shift property of the Fourier Transform, fairlyprecise motion of image blocks can be calculated withacceptable speed

Block Based(relatively fast)

Phase CorrelationMotion Estimation

Motion fields extracted on a block level by finding the best maching area that minimizesthe total intensity difference(squared or absolute)

Block Based(fast, depending on search strategy)

Block Matching for Intensity Difference

Assuming the motion vectorsrepresent the approximateintensity flow, they can be readdirectly from the MPEG streamand used

Block Based(available)

Readymade Motion Vectors(eg. MPEG)

Exhaustive pixel correspondance search betweenframes giving a dense motion field for each pixel

Pixel or Sub Pixel Level(very slow)

Dense Optical Flow Calculation

DescriptionTypeMethod

20

Background / camera relative motion search

• No background motion (simple check)• Motion without parallax (two-level search):

– Rotations (3) and focal length,– Search for an homographic transform,– Mosaicing (panoramic view) and mobile objects,

• Motion with parallax (one-level search):– Rotations (3), translations (3) and focal length,– « Motion and structure from motion », « paraperspective

decomposition » method from Poelman et Kanade (1993).– Three-dimensional view of the background

• Irregular motion (crowd, waves, ...)

Page 6: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

21

Camera motion, aim1mb08.mpg document,2992-3137 sequence

22

Panoramic view, aim1mb08.mpg document, 2992-3137 sequence

23

Mobile objects segmentation and tracking aim1mb08.mpg, 1671-1715 sequence

24

Motion with parallax: Tracking of feature points aim1mb08.mpg, 2860-2879 sequence

Page 7: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

25

Motion with parallax

26

Three-dimensional scene structure

27

Key frame extraction• Choice of one or more representative key frame per

continuous shot• First or center frame• Frame with low motion, following a zoom, ...• Frame with high contrast• Weighted combination of criteria• Panoramic view when available• Frame showing a best view of an object (or frontal

face)• Multiple key frame selection according to content

change detection and/or important feature detection• Maximization of inter-shot key frame dissimilarity• Used for content display (browsing) or indexing

28

Audio Segmentation

• Silence / music / noise / speech / language / male / female / high quality / telephone / speaker / known speaker / emotion / ...

• Feature extraction: spectral analysis (MFCC, LPC, plus ...) on 10-20 ms windows

• Class modeling or clustering using gaussianmixture densities

• Known classes (music, speech, male, anger, ...) or classes to be built (one for each unknown speaker)

Page 8: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

29

Micro-segmentation

• Segmenting into units shorter than a shot• May be hierarchical• Change in background / camera relative motion

– Start / stop of a zoom or a pan

• Change in object motion – Start, stop or direction change of objects– Appearance or disappearance of objects/persons

• Partial / local transitions– Small image appearance, disappearance or change

• Speaker or topic change• Useful for key frames selection and content analysis

30

Macro-segmentation

• Segmenting into units longer than a shot• May be hierarchical • Not necessarily aligned with image or audio

transitions• Generally according to semantic changes like switch

of topic within a TV journal• Use of various clues:

– Visual or audio jingles, black or blue frames,– Topic detection and tracking from audio transcription,– Pattern detection from audio transcript,– Detection of text or small image appearance or change.

• Useful for determining appropriate boundaries of retrieved passages

31

Indexing and Retrieval

32

The “semantic gap” problemFaceWomanHatLéna…

……………

…99110125134

…95106121131

…89102116126

…8598112122 ?

Page 9: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

33

“Signal” and “semantic” levels

• Semantics (opposed to signal) :– “Abstract” concepts and relations,– Symbolic representations (also signal),– Successive levels of abstraction from the “signal /

physical / concrete / objective” level to the “semantic / conceptual / symbolic / abstract / subjective” level,

– Gap between the signal and semantic levels (“red”versus “700-600 nm”),

– Somewhat artificial distinction,– Intermediate levels difficult to understand,– Search at the signal level, at the semantic level or with

a combination of both.

34

Content-based search• Aspects :

– Signal : arrays of numbers (“low level”),– Semantic : concepts or keywords (“high level”).

• Search :– Semantic → semantic : classical for text,– Semantic → signal : images corresponding to a concept ?– Signal → signal : image containing a part of another image ?– Signal → semantic : concepts associated to an image ?

• Approaches :– Bottom-up : signal → semantic,– Top-down : semantic → signal,– Combination of both.

35

Document representation• Compression : encoding and decoding• Indexing : characterization of the contents

Documents

Compression

Representation

Decompression

Documents*

Documents

Indexing

Representation

Search

Reference

JPEGGIFPNGMJPEGDVMPEG-1MPEG-2MPEG-4

MPEG-7

36

Problems

• Choice of a representation model,• Indexing method and index organization,• Choice and implementation of the search

engine,• Very high data volume,• Need for manual intervention.

Page 10: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

37

Representation models• Semantic level:

– keywords, word groups, concepts (thesaurus),– Conceptual graphs (concepts and relations),

• Signal level:– Feature vectors,– Sets of interest points,

• Intermediate level:– Transcription of the audio track,– Sets of key frames,– Mixed and structured representations, levels of detail,– Application domain specificities,

• Standards (MPEG 7).38

Indexing methods and index organization

• Build representations from document contents,• Extract features for each document or document part:

– Signal level: automatic processing,– Semantic level : more complex, manual to automatic.

• Globally organize the features fo the search:– Sort, classify, weight, tabulate, format, …

• Application domain specificities,• Problem of the quality versus cost compromise.

39

Choice and implementation of the search engine

• Search for the “best correspondence” between a query and the documents,

• Semantic → semantic:– Logical, vector space and probabilistic models,– Keywords, word groups, concepts, conceptual graphs, …

• Signal → signal :– Color, texture, points of interest, …– Images, imagettes, pieces of image, sketches, …

• Semantic → signal :– Correspondence evaluated during the indexing phase (in general).

• Search with mixed queries.

40

Representation and correspondence

• Representation, descriptors:– Color,– Texture,– Points of interest,– Shape,– Motion, …,

• Associated similarity measures:– Dependent upon the descriptor contents and the application

context.

• Structured organization.• Data volume and processing speed.

Page 11: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

41

Color descriptors

• Moments,• Histograms,• Color spaces,• Subjective corrections.

42

Color moments• Means:

mR = (ΣR)/N, mG = (ΣG)/N, mB = (ΣB)/N)• Means + variances : + covariances:

mRR = (Σ(R-mR)2)/N, mGB = (Σ(G-mG)(B-mB))/N, …• Higher moments: mRRR, mRGB, …• Moments associated to spatial components :

mRX = (Σ(R-mR)(X-mX))/N, mRGX, mBXY, …• Advantages: easy and fast to compute, very

compact descriptors,• Disadvantages: low selectivity and accuracy,• Can be computed on image blocks.

43

Histograms• One-dimensional:

– Component by component for color images or single histogram for monochrome images,

– The space of intensity values is decomposed into complementary intervals (bins),

– For each interval, one compute the proportion of the image pixels whose intensity falls in,

– H[k] = ( Σ ((I >= Ck) && (I < Ck+1)) )/N for the [Ck,Ck+1[ interval (bin),

– Le vector of these proportions is the descriptor,– Three one-dimensional histograms for color

images.

44

Histograms• Multidimensional:

– The space of intensity values is multidimensional (generally three-dimensional),

– The space of intensity values is decomposed into complementary parallelepipeds (bins),

– For each parallelepiped, one compute the proportion of the image pixels whose intensity falls in,

– Les parallelepipeds are themselves organized in a multidimensional with the same number of dimensions : H[r][g][b].

– This array is the descriptor (it can be seen as a vector by linearization).

Page 12: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

45

Histograms• Can be computed on the whole image,• Can be computed by blocks :

– One (mono or multidimensional) histogram per image block,

– The descriptor is the concatenation of the histograms of the different blocks.

– Typically : 4 x 4 complementary blocks but non symmetrical and/or non complementary choices are also possible. For instance: 2 x 2 + full image center.

• Size problem → only a few bins per dimension or a lot of bins in total.

46

Correlograms• Parallelepipeds/bins are taken in the Cartesian

product of the color space by itself : six components H(r1,g1,b1,r2,g2,b2) (or only four components if the color space is projected on only two dimensions: H(u1,v1,u2,v2)).

• Bi-color values are taken according to a distribution of the image point couples:

– At a given distance one from the other,– And/or in one or more given direction.

• Allows for representing relative spatial relationships between colors,

• Large data volumes.

47

Ditherization• Objective: smooth the quantization effect associated

to the large size of bins (typically 4 x 8 x 4 for RGB).• Principle:

– Decompose the images into small blocks(typically 4 x 4 pixels),

– Shift the intensity values according to the location within the block to spread the range of the values inside homogeneous regions,

– The proportion that splits between adjacent bins varies more continuously with the mean intensity value.

t[0][0] = -8; t[0][1] = 0; t[0][2] = -6; t[0][3] = 2;t[1][0] = 4; t[1][1] = -4; t[1][2] = 6; t[1][3] = -2;t[2][0] = -5; t[2][1] = 3; t[2][2] = -7; t[2][3] = 1;t[3][0] = 7; t[3][1] = -1; t[3][2] = 5; t[3][3] = -3;

48

Normalization

• Objective : to become more robust again illumination changes before extracting the descriptors.

• Gain and offset normalization: enforce a mean and a variance value by applying the same affine transform to all the color components,

• Histogram equalization : enforce an as flat as possible histogram for the luminance component by applying the same increasing and continuous function to all the color components,

• Subjective normalization : enforce a normalization which is similar to the one performed by the human visual: “golbal” and highly non linear.

Page 13: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

49

Correspondence functions for color• Vectors of moments:

– Euclidian distance : search for exact similarity,– Angle between vectors : search for similarity with

robustness to illumination changes,• Histograms:

– Euclidian distance: search for exact similarity,– Robustness to illumination changes can only be obtained by

an intensity normalization pre-processing,– Earth-mover distance: compute the cost for transforming

one histogram into another by giving a flat penalty for passing from one bin to another

– Histograms by blocks : sum of the smaller block to block distances only (typically 8 out of 16): permits a search with only a portion of an image,

• Correlograms:– Euclidian distance, with or without intensity normalization.

50

Color spaces

• Linear:– RGB: Red, green, blue,– YUV: Luminance, chrominances (L – red, L – blue),– LAB: Luminance, “blue – yellow”, “green – red”,

• Non linear:– HSV: Hue, Saturation, Value,

51

Texture descriptors• Computed on the luminance component only• Rather fuzzy concept,• Frequential composition or local variability,• Fourier transforms,• Gabor filters,• Neuronal filters,• Cooccurrence matrices,• Many possible combination,• Feature vector,• Associated correspondence functions,• Normalization possible.

52

Gabor transforms

(Circular) Gabor filter of direction θ, of wavelength λ and of extension σ :

Energy of the image through this filter:

Page 14: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

53

“Separable” formulation:

Gabor transforms

54

Linear combination coefficients:

Gabor transforms

55

Simplified expressions:

Gabor transforms

56

λ

θ

σσ

θ

λ

σ

σ

λ

θ

Elliptic: Circular:

Gabor transforms

Page 15: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

57

• Circular: – scale λ, angle θ, variance σ,– σ multiple of λ, typically : σ = 1.25 λ,

(“same number” of wavelength whatever the λ value)

• Elliptic:– scale λ, angle θ, variances σλ and σθ,

– σλ and σθ multiples of λ, typically : σλ = 0.8 λ et σθ = 1.6 λ,

• 2 independent variables:– scale λ : N valeurs (typically 4 to 8) on a logarithmic scale

(typical ratio of √2 to 2)– angle θ : P values (typically 8),– N.P elements in the descriptor,

Gabor transforms

58

Descriptors of points of interest• “High curvature” points or “corners”,• Singular” points of the I[i][j] surface,• Extracted using various filters:

– Computation of the spatial derivatives at a given scale,– Convolution with derivatives of Gaussians,– Construction of invariants by an appropriate combination of

these various derivatives,– Each point is selected and then represented by the set of

valures of these invariants,• The set of selected points of interest is topologically

organized (relations between neighbor points),• The structure is irregular and the size of the

description depends upon the image contents,• Descriptions are large.

59

Correspondence functions for points of interest

• Generally very complex functions,• Relaxation methods:

– Randomly choose a point in the description of the query image,– Compare the neighborhood of this point to all the neighborhoods

of all the points of the candidate document,– Amongst those that are “close” in the sense of the spatial relations

and the values of the associated attributes, do a complementarysearch to see if the neighbor points are also “close” in the same sense,

– Propagate the correspondence between “close” points by following the point topologies in the query and candidate images,

– Find the best possible global correspondence respecting these topologies et preserving close characteristics for the in correspondence,

– Globally evaluate (quantify) the quality of the correspondence.

60

• Very costly method both for representation volume and computation time for the correspondence function,

• Bur very accurate and selective,• Allows for retrieving an image from a portion of it by

searching for a partial correspondence,• Can be made robust to rotations by choosing

appropriate invariants,• Can be made robust to scale transforms by using

multi-scale representations (even more costly)• Usable only on small to medium image collections

(~1000-10,000 images).

Correspondence functions for points of interest

Page 16: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

61

Correspondence functions for points of interest

Example of an image pair involving a large scale change due to the use of a zoom. The scale factor between the images is 6. The common portion represents less than 17% of the image.

62

Shape descriptors• Extraction of shapes by image processing

techniques: homogeneous regions obtainde by iterative growing or segmented from motion,

• Vector representation (sequence of vector producing a curve, the curve may be closed or not),

• Representation by parametric curves (splines),• Representation by frequential decomposition,• Possible scale or rotation invariance (generally at the

level of the correspondence function),• Potentially several shapes in a single image.

63

Correspondence functions for shapes

• Possible normalization for scale and rotation,• Search for a piece of curve within another curve

(relaxation method again)• Search for an “optimal” alignment between two vector

representations,• Search of invariants in the spline parameter sets

(curvature extrema for instance),• Search for a similar frequential composition,• Quantitative similarity measure between shapes,• Global similarity measure between images : average

on the similarity measures for the best shape matches.

64

Motion descriptors• Extraction of the motion of each pixel or of the

matching between pixels of consecutives images,• Statistics on these motions:

– Global average motion : rotation, translation, zoom, …– Average and variance of the motion,– Distribution : histogram or texture of the motion vector field,– Segmentation of the background and et the mobile objects:

number, size and speed of mobile objects (or evaluation of the possibility to detect them),

• Camera motion,• Background structure (mosaicing, 3D scene),• Description oh the objects (color, shape, texture).

Page 17: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

65

Correspondence function for motion

• Similar statistics,• Similar camera motion,• Similar background (color, shape, texture),• Similar mobile objects (color, shape, texture),• Euclidian distances, possibly after normalization,• Correspondence function associated to the attributes

used for the background and the segmented objects,• Global correspondence built from the various

correspondence between the elements.

66

Use of several types of descriptors

• Several types of descriptors : choice according to the target application or to the query type,

• Several correspondence function for each type of descriptor : choice according to the target application or to the target query type (invariances that are desired or not for instance),

• Combination of the descriptions,• Combination of the correspondence functions,• Combination with descriptions from the semantic

level.

67

Fusion of representations (early)

• For all vector description (of fixed size), whatever their origin,

• Possibility to concatenate the various descriptors in a unique mixed descriptor → normalization problem,

• Possibility to reduce la dimension of the resulting vector (and/or of each original vector) in order to keep only the most relevant inforlation:

– Principal Component Ananlysis,– Neural networks,– Learning is needed (representative data and process).

• Less information, faster once learning is done,• Euclidian distance on the shortened vector.

68

Fusion of the correspondence functions (late)

• Each correspondence function generally produces a quantitative value that estimate a similarity,

• It is always possible to come to the case in which the values eare between 0 and 1 and represent a relevance,

• In order to fuse the results from several functions, we may use :

– A weighted sum,– A weighted product (weighted sum on the ogarithms),– The minimum value,– A classifier (SVM, neural network, …)

• Problem for the choice of the weights and/or for the classifier training.

Page 18: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

69

Computation of the relevance• Euclidian distance, angle between vectors,• Comparison between a query vector to all the

vectors in the database (no pre-selection),• “Small” number of dimensions ( < 10) :

clustering techniques hierarchical search,• “Medium” number of dimensions ( ~ 10+) :

methods based on space partitioning,• “Large” number of dimensions( >> 10 ) : no

known method faster that a full linear scan,• Reduction of the number of dimensions by

Principal Component Ananlysis.70

User interface• Classical interface for the part of the query given at the

semantic level (e.g. text input for keywords),• Plus possibility to define a query at the signal level:

– Query by example : one or several images or video segments, initially given or selected during relevance feedback,

– Library of signal elements : colors, textures,shapes (that could be entered as sketches),

– Possibility to define a relative importance for the various signal (or semantic) features available,

– Possibility to define a fusion method for the correspondence functions (sum, product, min, …),

– The system can also mek these choices by analysis of the relevance feedback,

– Link between signal and semantics.

71

MPEG-7

72

MPEG-7

• “Multimedia Content Description Interface”• All levels, all application, all domains, ....

Page 19: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

73

MPEG-7• Descriptors (D), Description Schemes (DS) and

Description Definition Language (DDL), plus ...

74

MPEG-7• Descriptors (D), Description Schemes (DS) and

Description Definition Language (DDL), plus ...

75

MPEG-7 parts1. MPEG-7 Systems - the tools that are needed to prepare

MPEG-7 Descriptions for efficient transport and storage, and to allow synchronization between content en descriptions. Tools related to managing and protecting intellectual property

2. MPEG-7 Description Definition Language - the language for defining new Description Schemes and perhaps eventually also for new Descriptors.

3. MPEG-7 Audio – the Descriptors and Description Schemes dealing with (only) Audio descriptions

4. MPEG-7 Visual – the Descriptors and Description Schemes dealing with (only) Visual descriptions

5. MPEG-7 Multimedia Description Schemes - the Descriptors and Description Schemes dealing with generic features and multimedia descriptions

6. MPEG-7 Reference Software - a software implementation of relevant parts of the MPEG-7 Standard

7. MPEG-7 Conformance - guidelines and procedures for testing conformance of MPEG-7 implementations.

76

MPEG-7 Systems

• Tools that are needed to prepare MPEG-7 Descriptions for efficient transport and storage, and to allow synchronization between content and descriptions

• Tools related to managing and protecting intellectual property.

• Defines the terminal architecture and the normative interfaces

Page 20: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

77

MPEG-7 Description Definition Language

• “... a language that allows the creation of new Description Schemes and, possibly, Descriptors. It also allows the extension and modification of existing Description Schemes.”

• XML Schema Language has been selected to provide the basis for the DDL. As a consequence of this decision, the DDL can be broken down into the following logical normative components:• The XML Schema structural language components;• The XML Schema datatype language components;• The MPEG-7 specific extensions.

78

MPEG-7 Audio

• Audio description framework (which includes the scale tree and low-level descriptors),

• Sound effect description tools,• Instrumental timbre description tools,• Spoken content description,• Uniform silence segment,• Melodic descriptors to facilitate query-by-

humming

79

MPEG-7 Visual

• Color• Texture • Shape• Motion• Localization• Others

80

MPEG-7 / Dublin Core Index Structure

Page 21: Video processing for indexing and retrieval Media ...mrim.imag.fr/georges.quenot/slides/wsmse07.pdf · if the background occupy a large enough part of the images – Several alternative

81

Conclusion

82

Conclusion• Models for content representation available

– Both at signal and semantic levels– How to fill them ? (indexing)– How to exploit them ? (search)– Still room for improvements

• Semantic gap still largely open:– Signal, low level, automatic, not so useful– Semantic, high level, manual, slow and costly– Two significant exceptions:

» People detection and recognition» Speech transcription

– Need for “AI style” content understanding techniques– Need to manage general and domain specific knowledge

bases, world representations.