8.2 Deriving Video Semanticspi4.informatik.uni-mannheim.de/pi4.data/content/courses/2001-ws/... · camera panning or zooming or fast object movements lead to a high ECR for successive

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-1

© W

olfgang Effelsberg,

Ralf S

teinmetz

8.2 Derivin

g V

ideo

Sem

antics

8.2.1C

ut D

etection

A very sim

ple and very basic scheme for the

investigation of video semantics is cu

t detectio

n. A

cut

denotes the border between tw

o sho

tsin a m

ovie. D

uring a shot the camera film

s continuously.

One distinguishes h

ard cu

ts and soft cu

ts(fade out,

fade in, resolve,....).

The m

ost important use of cut detection is to

decompose a video into sm

aller units which can then be

analyzed. A sam

ple application is the use of shots as the “atom

ic” units for storage and retrieval in a video archive.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-2

© W

olfgang Effelsberg,

Ralf S

teinmetz

Cu

t Detectio

n w

ith C

olo

r Histo

gram

s

The sim

plest approach to cut detection is by means of

color histograms: If the color histogram

s of two

successive video frames i and i+

1 differ at least by a threshold value T

, a hard cut is detected.

Let H(r,g,b,i) be the value of the histogram

for a color triple (r,g,b) in a fram

e i, i.e., the number of pixels in

frame i that have color (r,g,b). A

cut is detected if and only if:

Ti

bg

rH

ib

gr

Hb

gr

≥+

−∑

,,

2))

1,

,,

()

,,

,(

(

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-3

© W

olfgang Effelsberg,

Ralf S

teinmetz

Exam

ple: C

ut D

etection

with

Gray-

scale Histo

gram

Differen

ces

fra

me

#

cut

cut

cut

dis

solv

e

intensity histogramm difference

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-4

© W

olfgang Effelsberg,

Ralf S

teinmetz

Typ

ical Detectio

n E

rrors

If we use color histogram

differences, the rate of successful detection of hard cuts is 90%

to 98% for a

typical video.

This m

ethod, however, fails w

hen the colors change significantly betw

een two adjacent fram

es even if no cut is present. F

alse hits are m

uch

mo

re com

mo

n th

an

misses.

Exam

ples

•T

he light is switched on in a room

,•

an explosion occurs,

•the cam

era pans very quickly.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-5

© W

olfgang Effelsberg,

Ralf S

teinmetz

Cu

t Detectio

n w

ith th

e E

dg

e Ch

ang

e Ratio

Generally, the edges in the first fram

e after a cut differ largely from

the edges in the last frame before the cut.

Thus the E

CR

can be used to detect hard cuts.

Let EC

Ri be the edge change ration betw

een frame i

and frame i+1. A

cut is detected if and only if

TE

CR

i ≥

A draw

back of this approach is that it requires the com

putation of motion com

pensation for a video: fast cam

era panning or zooming or fast object m

ovements

lead to a high EC

R for successive fram

es even if there is no cut.

Object m

otion or camera operations can be

distinguished from a hard cut since they alw

ays last for several fram

es.

where T

is a threshold.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-6

© W

olfgang Effelsberg,

Ralf S

teinmetz

Exam

ple: C

ut D

etection

with

EC

R

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-7

© W

olfgang Effelsberg,

Ralf S

teinmetz

Detectin

g S

oft C

uts

A fade betw

een successive scenes is much harder to

detect than a hard cut. One possibility is to try to detect

a characteristic behaviour of the edg

e chan

ge ratio

(EC

R) in the area of fade outs, fade ins and resolves.

Exam

ple:

During a reso

lve, the edges of the old shot disappear w

ith a constant ratio. At the sam

e time, the edges of the

new shot gradually appear. W

e observe a characteristic pattern of the E

CR

:

dissolve

sequence of frame

K

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-8

© W

olfgang Effelsberg,

Ralf S

teinmetz

Fad

e-ins an

d F

ade-o

uts (1)

In a similar fashion, fade-ins and fade-outs in a video

can be determined: W

hen fading out, the number of

pixels located on an edge will be zero in the last fram

e of the fade. W

hen fading in, the number of pixels

located on an edge is zero in the first frame.

Exam

ple:

EC

R during a fade-in and and a fade-out.

fade-infade-out

sequence of frames

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-9

© W

olfgang Effelsberg,

Ralf S

teinmetz

Co

uld

We D

etect So

ft Cu

ts With

Co

lor

Histo

gram

s?

It is obvious that soft transitions between co

lor

histo

gram

sare frequent in a video, and can have

many reasons. T

hus, it practically impossible to detect

soft cuts with color histogram

differences.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-10

© W

olfgang Effelsberg,

Ralf S

teinmetz

8.2.2 Actio

n In

tensity

The intensity of action in a video shot is an interesting

parameter. F

or example, it can be used to distinguish

between different genres of T

V broadcasts, such as

newscasts vs. m

usic clips.

Action intensity can easily be estim

ated by means of

mo

tion

vectors:

one computes the average absolute

value of all vectors in each frame of the shot; this

includes both the motion of objects w

ithin the scene and cam

era operations. If the video was com

pressed using m

otion vectors, this physical parameter can be

computed w

ith minim

um cost.

The ed

ge ch

ang

e ratio E

CR

can also be used as an indicator for action: long and relatively static shots have a low

EC

R w

hile shots with a lot of m

otion have a high E

CR

.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-11

© W

olfgang Effelsberg,

Ralf S

teinmetz

8.2.3D

etection

of C

amera O

peratio

ns

The notion of a cam

era operation encompasses

panning, zooming, etc. It is characteristic for cam

era operations that they affect all pixels of a fram

e in a sim

ilar, deterministic m

anner.

Exam

ple 1

When the cam

era pans, all pixels move into the sam

e direction, by the sam

e amount, betw

een two fram

es.

Exam

ple 2

When the cam

era zooms in, all pixels (except the one

in the center of the frame) m

ove from the center to the

boundaries.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-12

© W

olfgang Effelsberg,

Ralf S

teinmetz

Alg

orith

m fo

r the D

etection

of C

amera

Op

eration

s

Alg

orith

mD

etect-Cam

era-Op

eration

s•

Use the m

otion vectors of the underlying com

pression algorithm (e.g., M

PE

G-2), or com

pute the optical flow

for a sequence of frames.

•D

etermine if the direction and the length of the

motion vectors fit to the schem

e of a pre-defined cam

era operation.

The detection of cam

era operations works w

ell when

the shot under consideration contains no or little object m

otion. When cam

era operation and object motion

appear simultaneously (w

hich is quite comm

on in practice), autom

atic detection becomes very difficult.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-13

© W

olfgang Effelsberg,

Ralf S

teinmetz

Derivin

g a P

ano

ramic Im

age fro

m

Cam

era Mo

tion

When cam

era motion can be detected autom

atically, it is also possible to com

pute a panoramic im

age from a

panning operation.

Exam

ple

Com

putation of panoramic im

ages from a video that

was recorded by a cam

era placed on a person‘s head (S

teve Mann, M

IT M

edia Lab, 1996)

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-14

© W

olfgang Effelsberg,

Ralf S

teinmetz

Pan

oram

ic Imag

e Gen

eration

We w

ant to generate a panoramic im

age of ourenvironm

ent from a fixed position.

Parallax effect:

If the camera m

oves objects atdifferent distances appear shifted against each other.->

the camera m

ust not change position.

camera m

oves

The easy approach: a cylin

drical p

ano

rama. T

he cam

era only rotates around its vertical axis. Images are

mapped onto a virtual cylinder around the cam

era position.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-15

© W

olfgang Effelsberg,

Ralf S

teinmetz

x

y

z1

input image

panorama

Cylin

drical P

ano

ramas (1)

Map the

image to cylindrical coordinates.

22

/

)/

arctan(

zx

yv

zx

�

+= =

�v

Note that the focal-length z has to be know

n for the

coordinate transform.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-16

© W

olfgang Effelsberg,

Ralf S

teinmetz

Cylin

drical P

ano

ramas (2)

x

y

�

v

Image exam

ple

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-17

© W

olfgang Effelsberg,

Ralf S

teinmetz

Cylin

drical P

ano

ramas (3)

Transform

ed images have to be aligned by m

atching the im

age content in the overlapping areas.

∑+

+−

=y

xy

xy

xt

yt

xg

yx

ft

tE

rr,

2|),

()

,(

|)

,(

)(

)(

42

2N

OM

NO

≈

We conclude that w

e need faster algorithms.

The m

inimum

can be found by an exhaustive search over ( tx ;ty ). H

owever, com

putational complexity for

image size

NxN

pixels and search rangeM

xMpixels is:

Find translation vector ( tx ;ty ) by m

inimizing the m

atching error betw

een image f and im

age g:

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-18

© W

olfgang Effelsberg,

Ralf S

teinmetz

Cylin

drical P

ano

ramas (4)

Pel-R

ecursive M

otio

n E

stimatio

n:

For sim

plicity consider the one-dim

ensional case. Let g(x) be an exact copy of f(x), displaced by a constant t.

f(x)

g(x)=f(x+

t)

tf(x0)-g(x0)

x

brightnessWith the spatialderivative of f, w

e obtain t as:x fd d

)(

d d)

()

(

0

00

xx f

xf

xg

t−

=

How

ever, this only works for sm

all displacements t.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-19

© W

olfgang Effelsberg,

Ralf S

teinmetz

Cylin

drical P

ano

ramas (5)

Hierarch

ical Mo

tion

Estim

ation

Alg

orith

m•

Scale the original im

age down by a constant factor to

build a pyramid of the im

age at different resolutions.•

Do m

otion estimation on the low

est resolution layer.•

Scale the obtained m

otion model upw

ard to the next resolution level and refine the m

otion model.

scale + refinement

scale + refinement

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-20

© W

olfgang Effelsberg,

Ralf S

teinmetz

Cylin

drical P

ano

ramas (6)

Exam

ple result of a cylindrical panorama:

optical axisvirtual im

age plane

horizontal image axis

Disad

vantag

es of cylin

drical p

ano

ramas

•D

istortion of straight lines to curved lines•

Focal length (zoom

) has to be known or estim

ated•

Rotation axis has to be perpendicular to the optical axis

and to the horizontal image axis:

•S

tanding on a hill and looking down is not possible

•C

amera m

ay not be rotated around horizontal axis (but m

athematical extension to spherical

panoramas

is possible).

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-21

© W

olfgang Effelsberg,

Ralf S

teinmetz

Fu

ll-Persp

ective Pan

oram

as (1)

Assum

e that your environment is painted on a single,

large plane (environment painted on a glass plane).

Describe the m

otion that the solid plane can perform in

space.

+

=

y x

t t

y x

aa

aa

y x

2221

1211

' '

1'

21

1211

++

++

=y

bx

b

ty

ax

ax

x

1'

21

2221

++

++

=y

bx

b

ty

ax

ay

y

rotate, scaletranslate

3D:

perspective transformation (affine

+ 2 additional

parameters for perspective projection)

2D:

affinem

otion (translation, rotation, scaling)

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-22

© W

olfgang Effelsberg,

Ralf S

teinmetz

Fu

ll-Persp

ective Pan

oram

as (2)

Think of our cam

era taking images of the glass plane

from different orientations. T

o align the images, the

relative transformation betw

een the images has to be

estimated.

Exhaustive search for the param

eters is notpossible because the param

eter space is very large (8 param

eters).

Possible approach for p

arameter estim

ation

:•

Find an estim

ate with a low

er-dimensional m

odel (e.g., translatorial only to coarsely com

pensate m

otion)•

Determ

ine an initial estimate for the perspective

model (see next slide)

•U

se the gradient-descent technique for fine alignm

ent.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-23

© W

olfgang Effelsberg,

Ralf S

teinmetz

Fu

ll Persp

ective Pan

oram

as (3)

Determ

ine in

itial estimatio

nof the perspective m

odel:

1.S

earch for four blocks with characteristic features.

2.E

ven though we have com

pensated translatorial m

otion the features will not m

atch perfectly. Refine

matching positions of the blocks in the second im

age by block-m

atching.In general, the motion vectors w

ill be different for the four blocks.

3.T

hese four motion vectors (w

ith two param

eters each) can be used to determ

ine the eigthparam

eters of the perspective m

otion model (eight equations for eight

unknowns).

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-24

© W

olfgang Effelsberg,

Ralf S

teinmetz

Fu

ll-Persp

ective Pan

oram

as (4)

Fin

ealig

nm

ent

1.S

tart with the param

eter vector estimated in last step:

2

,

|)'

,'(

),

(|

)(

∑−

=y

x

yx

gy

xf

pE

rr&

)(

1i

ii

pE

rrp

p&

&&

∇−

=+

);

;;

;;

;;

(2

122

2112

110

bb

tt

aa

aa

py

x=

initial estim

ate

error fu

nctio

n (h

ardto

calculate)

0p &

2.C

onsider the matching-error depending on the

transformation param

eters:

3.D

o a gradient descent search to find a better match.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-25

© W

olfgang Effelsberg,

Ralf S

teinmetz

Fu

ll-Persp

ective Pan

oram

as (5)

Imag

e examp

le

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-26

© W

olfgang Effelsberg,

Ralf S

teinmetz

8.2.4T

ext Detectio

n

Go

alE

xtraction of text appearing in a video. Motivation: text

is rich in semantics.

We distinguish betw

een artificially generated text (such as a m

ovie title) and text appearing in a scene.

Alg

orith

m•

Detect regions of the fram

e containing text, •

cut them out,

•run an O

CR

algorithm (optical character

recognition.

Ch

aracteristics of g

enerated

text in a vid

eo•

monochrom

e•

rigid•

in the foreground•

the letters have a minim

um and m

aximum

size •

either stationary or moving linearly

•high contrast to background

•appears repeatedly in successive fram

es•

letters appear in groups.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-27

© W

olfgang Effelsberg,

Ralf S

teinmetz

Text S

egm

entatio

n (1)

Text is m

on

och

rom

e

orig

inal fram

e of a vid

eoseg

men

tation

into

regio

ns

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-28

© W

olfgang Effelsberg,

Ralf S

teinmetz

Text S

egm

entatio

n (2)

Apply thresholds for the size of the letters and the

contrast.

previo

us resu

ltafter

thresh

old

ing

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-29

© W

olfgang Effelsberg,

Ralf S

teinmetz

Text S

egm

entatio

n (3)

The text m

ust be either stationary or moving linearly

(here: horizontally).

after app

lication

of

the ru

le of h

orizo

ntal

mo

tion

previo

us resu

lt

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-30

© W

olfgang Effelsberg,

Ralf S

teinmetz

Exp

erimen

tal Resu

lts

test video

n

um

ber

of fram

es n

um

ber o

f letters

successfu

l fin

din

gs

title sequ

ences

7372 6423

99%

advertisem

ents

6858 1065

99%

new

scast 18624

1411 97%

(from the dissertation of R

ainer Lienhart, Praktische

Informatik

IV, U

. Mannheim

)

Detection of text regions in a video

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-31

© W

olfgang Effelsberg,

Ralf S

teinmetz

8.2.5F

ace Detectio

n

Go

alD

etection of areas in a video frame that show

the frontal view

of a human face.

Ap

pro

ach•

Construct a neural netw

ork for the detection of texture

•T

rain this network w

ith thousands of faces, where

the line between the eyes as w

ell as an orthogonal line tow

ards the tip of the nose are marked

•P

rocess an unknown im

age with search areas in

varying sizes:-

pre-processing / filtering / standardization of the illum

ination-

test on the hypothesis “face” with the trained

neural network.

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-32

© W

olfgang Effelsberg,

Ralf S

teinmetz

Alg

orith

m ”F

ace Detectio

n”

Pyram

id o

f the

inp

ut im

age

Search

area (20x20 p

ixels)S

tand

ardizatio

n

of lu

min

ance

Stan

dard

ization

o

f histo

gram

subsampling

Pre-p

rocessin

g

Stan

dard

ization

o

f histo

gram

Neu

ron

al netw

ork

Inp

ut in

to

the n

etwo

rk

20 x 20p

ixels

Recep

tive field

s

Layer o

f h

idd

en

neu

ron

s

Resu

lt of

the n

etwo

rk

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-33

© W

olfgang Effelsberg,

Ralf S

teinmetz

Mu

ltiple D

etection

A G

raduate Course on

Multim

edia Technology

8. Autom

atic Content A

nalysis8.2-34

© W

olfgang Effelsberg,

Ralf S

teinmetz

Visu

alization

of R

esults

Documents

8.2 Deriving Video Semanticspi4.informatik.uni-mannheim.de/pi4.data/content/courses/2001-ws/... · camera panning or zooming or fast object movements lead to a high ECR for successive