57
Advanced topics in databases V. Megalooikonomou Generic Multimedia Indexing (slides are based on notes by C. Faloutsos)

Advanced topics in databases

Embed Size (px)

DESCRIPTION

Advanced topics in databases. V. Megalooikonomou Generic Multimedia Indexing (slides are based on notes by C. Faloutsos). General Overview. Multimedia Indexing Spatial Access Methods (SAMs) k-d trees Point Quadtrees MX-Quadtree z-ordering R-trees Generic Multimedia Indexing. - PowerPoint PPT Presentation

Citation preview

Page 1: Advanced topics in databases

Advanced topics in databases

V. MegalooikonomouGeneric Multimedia Indexing

(slides are based on notes by C. Faloutsos)

Page 2: Advanced topics in databases

General Overview

Multimedia Indexing Spatial Access Methods (SAMs)

k-d trees Point Quadtrees MX-Quadtree z-ordering R-trees

Generic Multimedia Indexing

Page 3: Advanced topics in databases

Mutlimedia Indexing – Detailed outline

Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications

1-D Time sequences 2-D Color images

Page 4: Advanced topics in databases

Generic Multimedia Indexing - problem

Given a database of multimedia objects Design fast search algorithms that locate

objects that match a query object, exactly or approximately Objects:

1-d time sequences Digitized voice or music 2-d color images 2-d or 3-d gray scale medical images Video clips

E.g.: “Find companies whose stock prices move similarly”

Page 5: Advanced topics in databases

Mutlimedia Indexing – Detailed outline

Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications

1-D Time sequences 2-D Color images

Page 6: Advanced topics in databases

Generic Multimedia Indexing- problem

1st step: provide a measure for the distance between two objects Distance function D():

Given two objects OA, OB the distance (=dis-similarity) of the two objects is denoted by

D(OA, OB)

E.g., Euclidean distance (sum of squared differences) of two equal-length time series

Page 7: Advanced topics in databases

Mutlimedia Indexing – Detailed outline

Generic Multimedia Indexing problem dfn Distance function Similarity queries Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications

1-D Time sequences 2-D Color images

Page 8: Advanced topics in databases

Types of Similarity Queries

Similarity queries are classified into: Whole match queries:

Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q

Sub-pattern Match: Given a collection of N objects O1,…, ON and a

query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q

S1

Snavg

1 365

day1 365F(S1)

F(Sn)

std

day

Page 9: Advanced topics in databases

Types of Similarity Queries

Similarity queries are classified into: Whole match queries:

Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q

Sub-pattern Match: Given a collection of N objects O1,…, ON and a

query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q

S1

Snavg

1 365

day1 365F(S1)

F(Sn)

std

day

Page 10: Advanced topics in databases

Types of Similarity Queries

Similarity queries are classified into: Whole match queries:

Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q

Sub-pattern Match: Given a collection of N objects O1,…, ON and a

query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q

S1

Snavg

1 365

day1 365F(S1)

F(Sn)

std

day

Page 11: Advanced topics in databases

Types of Similarity Queries

Similarity queries are classified into: Whole match queries:

Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q

Sub-pattern Match: Given a collection of N objects O1,…, ON and a

query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q

Page 12: Advanced topics in databases

Types of Similarity Queries

Additional types of queries: K- Nearest Neighbor queries:

Given a collection of N objects O1,…, ON and a query object Q find the K most similar data objects to Q

All pairs queries (or ‘spatial joins’): Given a collection of N objects O1,…, ON find all

objects that are within distance from each other

S1

Snavg

1 365

day1 365F(S1)

F(Sn)

std

day

Page 13: Advanced topics in databases

Types of Similarity Queries

Additional types of queries: K- Nearest Neighbor queries:

Given a collection of N objects O1,…, ON and a query object Q find the K most similar data objects to Q

All pairs queries (or ‘spatial joins’): Given a collection of N objects O1,…, ON find all

objects that are within distance from each other

S1

Snavg

1 365

day1 365F(S1)

F(Sn)

std

day

Page 14: Advanced topics in databases

Mutlimedia Indexing – Detailed outline

Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications

1-D Time sequences 2-D Color images

Page 15: Advanced topics in databases

Idea method – requirements

Fast: sequential scanning and distance calculation with each and every object too slow for large databases

“Correct”: No false dismissals. False alarms are acceptable. Why?

Small space overhead Dynamic: easy to insert, delete, and

update objects

Page 16: Advanced topics in databases

Approach Outline

Use k feature extraction functions to map objects into k-dimensional space (applying a mapping F () )

Use highly fine-tuned database SAMs (Spatial Access Methods) like R-trees to accelerate the search (by pruning out large portions of the database that are not promising)…

Page 17: Advanced topics in databases

Mutlimedia Indexing – Detailed outline

Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications

1-D Time sequences 2-D Color images

Page 18: Advanced topics in databases

Basic idea

Focus on ‘whole match’ queries Given a collection of N objects O1,…, ON, a

distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance from Q

Sequential scanning?

Page 19: Advanced topics in databases

Basic idea

Focus on ‘whole match’ queries Given a collection of N objects O1,…, ON, a

distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance from Q

Sequential scanning? May be too slow.. Why?

Page 20: Advanced topics in databases

Basic idea

Focus on ‘whole match’ queries Given a collection of N objects O1,…, ON, a

distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance from Q

Sequential scanning? May be too slow.. for the following

reasons: Distance computation is expensive (e.g., editing

distance in DNA strings) The Database size N may be huge

Faster alternative?

Page 21: Advanced topics in databases

Basic idea

Faster alternative: Step 1: a ‘quick and dirty’ test to discard

quickly the vast majority of non-qualifying objects

Step 2: use of SAMs to achieve faster than sequential searching

Example: Database of yearly stock price movements Euclidean distance function Characterize with a single number (‘feature’) Or use two or more features

2/1

1

2])[][(),(

i

iQiSQSD

Page 22: Advanced topics in databases

Basic idea - illustration

A query with tolerance becomes a sphere with radius

day1 365

day1 365

S1

Sn

F(S1)

F(Sn)

Feature1

Feature2

Page 23: Advanced topics in databases

Basic idea – caution! The mapping F() from objects to k-d

points should not distort the distances D(): distance of two objects Df(): distance of their corresponding

feature vectors Ideally, perfect preservation of

distances In practice, a guarantee of no false

dismissals How?

Page 24: Advanced topics in databases

Basic idea – caution! The mapping F() from objects to k-d points

should not distort the distances D(): distance of two objects Df(): distance of the corresponding feature

vectors Ideally, perfect preservation of distances In practice, a guarantee of no false

dismissals How? If the distance in f-space matches or

underestimates the distance between two objects in the original space

Page 25: Advanced topics in databases

Basic idea – Lower bounding

Let O1, O2 be two objects with distance function D() and F(O1), F(O2), be their feature vectors with distance function Df(), then:

To guarantee no false dismissals for whole match queries, the feature extraction function F() should satisfy:

Df(F(O1), F(O2)) D(O1, O2)

for every pair of objects O1, O2

Page 26: Advanced topics in databases

Lower bounding - Proof

Let Q be the query object and O be the qualifying object and be the tolerance.

Prove: If object O qualifies it will be retrieved by a range query in the f-space

Or, D(Q, O) Df(F(Q), F(O)) However, Df(F(Q), F(O)) D(Q, O) What about ‘all-pairs’? What about ‘nearest-neighbor’ queries?

Page 27: Advanced topics in databases

Lower bounding - Proof

Let Q be the query object and O be the qualifying object and be the tolerance.

Prove: If object O qualifies it will be retrieved by a range query in the f-space

Or, D(Q, O) Df(F(Q), F(O)) However, Df(F(Q), F(O)) D(Q, O) What about ‘all-pairs’? (‘spatial join’ on f-

space) What about ‘nearest-neighbor’ queries?

Page 28: Advanced topics in databases

Lower bounding - Proof

Let Q be the query object and O be the qualifying object and be the tolerance.

Prove: If object O qualifies it will be retrieved by a range query in the f-space

Or, D(Q, O) Df(F(Q), F(O)) However, Df(F(Q), F(O)) D(Q, O) What about ‘all-pairs’? (‘spatial join’ on f-

space) What about ‘nearest-neighbor’ queries? ??

Page 29: Advanced topics in databases

Mutlimedia Indexing – Detailed outline

Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications

1-D Time sequences 2-D Color images

Page 30: Advanced topics in databases

GEneric Multimedia object INdexIng

GEMINI approach:1. Determine distance function D()2. Find one or more numerical feature-extraction

functions (to provide a ‘quick and dirty’ test)3. Prove that Df() lower-bounds D() to guarantee no

false dismissals4. Use a SAM (e.g., R-tree) to store and retrieve k-d

feature vectors !!! The methodology focuses on the speed of

search only; not on the quality of the results which relies on the distance function

Page 31: Advanced topics in databases

Generic Multimedia Object Indexing

Applications: 1-d time sequences 2-d color images

Problems to solve: How to apply the lower-bounding lemma ‘Curse of Dimensionality’ (time sequences) ‘Cross-talk’ of features (color images)

Page 32: Advanced topics in databases

Mutlimedia Indexing – Detailed outline

Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications

1-D Time sequences 2-D Color images

Page 33: Advanced topics in databases

1-D Time Sequences

Distance function: Euclidean distance Find features that:

Preserve/lower-bound the distance Carry as much information as possible(reduce false

alarms) If we are allowed to use only one feature what

would this be?

Page 34: Advanced topics in databases

1-D Time Sequences

Distance function: Euclidean distance Find features that:

Preserve/lower-bound the distance Carry as much information as possible(reduce false

alarms) If we are allowed to use only one feature what

would this be? The average. … extending it…

Page 35: Advanced topics in databases

1-D Time Sequences

Distance function: Euclidean distance Find features that:

Preserve/lower-bound the distance Carry as much information as possible(reduce false

alarms) If we are allowed to use only one feature what

would this be? The average. … extending it… The average of 1st half, of the 2nd half, of the 1st

quarter, etc. Coefficients of the Fourier transform (DFT),

wavelet transform, etc.

Page 36: Advanced topics in databases

1-D Time Sequences

Show that the distance in feature space lower-bounds the actual distance

What about DFT?

Page 37: Advanced topics in databases

1-D Time Sequences

Show that the distance in feature space lower-bounds the actual distance

What about DFT? Parseval’s Theorem: DFT preserves the energy of the

signal as well as the distances between two signals. D(x,y) = D(X,Y) where X and Y are the Fourier transforms of x and y If we keep the first k n coefficients of DFT we lower-

bound the actual distance ),())(),((

21

0

21

0

21

0

yxDyxYXYXyFxFDn

iii

n

fff

k

ffff

Page 38: Advanced topics in databases

1-D Time Sequences

Response time improves as the transform concentrates more the energy of the signal

DFT concentrates the energy for a large class of signals, the colored noises

Colored noises: skewed energy spectrum that drops as O(f -b)

Energy spectrum or power spectrum of a signal is the square of the amplitude |Xf| as a function of the frequency f

b = 2: random walks or brown noise (very predictable) b 2: black noises b = 1: pink noise b = 0: white noise (completely unpredictable) Colored noises even in images (photographs)

Page 39: Advanced topics in databases

Mutlimedia Indexing – Detailed outline

Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications

1-D Time sequences 2-D Color images

Page 40: Advanced topics in databases

2-D color images

Image features for Content Based Image Retrieval (CBIR):

Low Level: Color – color histograms Texture – directionality, granularity, contrast Shape – turning angle, moments of inertia,

pattern spectrum Position – 2D strings method …etc

Object Level: Regions

Page 41: Advanced topics in databases

2-D color images – Color histograms

Each color image – a 2-d array of pixels Each pixel – 3 color components (R,G,B) h colors – each color denoting a point in 3-d color

space (as high as 224 colors) For each image compute the h-element color

histogram – each component is the percentage of pixels that are most similar to that color

The histogram of image I is defined as:For a color Ci , Hci(I) represents the number of pixels of color Ci

in image I OR:For any pixel in image I, Hci(I) represents the possibility of that pixel having color Ci.

Page 42: Advanced topics in databases

2-D color images – Color histograms

Usually cluster similar colors together and choose one representative color for each ‘color bin’

Most commercial CBIR systems include color histogram as one of the features (e.g., QBIC of IBM)

No space information

Page 43: Advanced topics in databases

Color histograms - distance One method to measure the distance between

two histograms x and y is: where the color-to-color similarity matrix A has

entries aij that describe the similarity between color i and color j

h

i

h

jjjiiij

th yxyxayxAyxyxd ))(()()(),(2

Page 44: Advanced topics in databases

Color histograms – lower bounding

Two obstacles for using color-histograms as feature vectors in GEMINI:

‘Dimensionality curse’ (h is large 64, 128) Distance function is quadratic

It involves all cross terms (‘cross-talk’ among features) - expensive to compute - precludes the use of SAMs

e.g.,64 colors

bright redpink

orange

x

q

Page 45: Advanced topics in databases

Color histograms – lower bounding

1st step: define the distance function between two color images D()=dh()

2nd step: find numerical features (one or more) whose Euclidean distance lower-bounds dh()

If we allowed to use one numerical feature to describe the color image what should it be?

Avg. amount for each color component (R,G,B)

Where … , similarly for G and B

Where P is the number of pixels in the image, R(p) is the red component (intensity) of the p-th pixel

tavgavgavg BGRx ),,(

P

pavg pRPR

1

)()/1(

Page 46: Advanced topics in databases

Color histograms – lower bounding

Given the average color vectors and of two images we define davg() as the Euclidean distance between the 3-d average color vectors

3rd step: to prove that the feature distance davg() lower-bounds the actual distance dh()

Main idea of approach: First a filtering using the average (R,G,B) color, then a more accurate matching using the full h-element

histogram

x y

3

1

22 )()()(),(i

iit

avg yxyxyxyxd

Page 47: Advanced topics in databases

Color auto-correlogram

pick any pixel p1 of color Ci in the image I at distance k away from p1 pick another

pixel p2 what is the probability that p2 is also of

color Ci ?

P1

P2

Red ?

Image: I

k

Page 48: Advanced topics in databases

Color auto-correlogram

The auto-correlogram of image I for color Ci , distance k:

Integrate both color information and space information.

]|,|Pr[|)( 1221)(

iii CCkC IpIpkppI

Page 49: Advanced topics in databases

Color auto-correlogram

Page 50: Advanced topics in databases

Implementations

Pixel Distance Measures Use D8 distance (also called chessboard distance):

Choose distance k=1,3,5,7 Computation complexity:

Histogram: Correlogram:

|)||,max(|),(8 yyxx qpqpqpD

)*134( 2n)( 2n

Page 51: Advanced topics in databases

Implementations

Features Distance Measures: D( f(I1) - f(I2) ) is small I1 and I2 are similar. Example: f(a)=1000, f(a’)=1050; f(b)=100,

f(b’)=150 For histogram:

For correlogram:

][ )'()(1

|)'()(||'|

mi CC

CCh IhIh

IhIhII

ii

ii

][],[)()(

)()(

)'()(1

|)'()(||'|

dkmikC

kC

kC

kC

II

IIII

ii

ii

Page 52: Advanced topics in databases

Color Histogram vs Correlogram

If there is no differenceno difference between the query and the target images, both methods have good performance.

Query Query ImageImage

(512 colors)(512 colors)

CorrelograCorrelogram methodm method

Histogram Histogram methodmethod

1s1stt

2nd2nd 3r3rdd

4t4thh

5t5thh

1s1stt

2nd2nd 3r3rdd

4t4thh

5t5thh

Page 53: Advanced topics in databases

Color Histogram vs Correlogram

The correlogram method is more stable to color changecolor change than the histogram method.

QuerQueryy

TargetTarget

Correlogram method: 1st

Histogram method: 48th

Page 54: Advanced topics in databases

Color Histogram vs Correlogram

The correlogram method is more stable to large appearance changelarge appearance change than the histogram method

QuerQueryy

TargetTarget

Correlogram method: 1st

Histogram method: 31th

Page 55: Advanced topics in databases

Color Histogram vs Correlogram

The correlogram method is more stable to contrast & brightness changecontrast & brightness change than the histogram method.

Query Query 11

TargetTarget C: 178th

H: 230th

Query Query 22

Query Query 33

Query Query 44

C: 1st

H: 1st

C: 1st

H: 3rd

C: 5th

H: 18th

Page 56: Advanced topics in databases

Color Histogram vs Correlogram

The color correlogram describes the global distribution of local spatial correlations of colors.

It’s easy to compute It’s more stable than the color histogram

method

Page 57: Advanced topics in databases

Mutlimedia Indexing – Conclusions

GEMINI is a popular method Whole matching problem Should pay attention to:

Distance functions Feature Extraction functions Lower Bounding Particular application

Sub-pattern matching?