Multimodal pattern matching algorithms and applications

Mul$modal pa+ern matching algorithms and applica$ons

Xavier Anguera Telefonica Research

Outline

•  Introduc$on •  Par$al sequence matching

– U-‐DTW algorithm

•  Music/video online synchroniza$on – MuViSync prototype

•  Video Copy detec$on

Par$al Sequence Matching Using an Unbounded Dynamic Time Warping

Algorithm

Xavier Anguera, Robert Macrare and Nuria Oliver

Telefonica Research, Barcelona, Spain

Proposed challenge •  Given one or several audio signals we want to find and align recurring acous$c pa+erns.

Proposed challenge •  We could use the ASR/phone$c output and search for symbol

repe$$ons PROS: –  It is easy to apply, the ASR takes care of any $me warping CONS: –  ASR is language dependent and requires training –  We introduce addi$onal sources of error (acous$c condi$ons, OOV’s) –  It can be very slow and not embeddable

•  Automa$c mo$f discovery directly in the speech signal –  Train free, language independent and resilient to some noises

ASR/Phone$za$on

symbols alignment

Symbolic representa$on

acous$c alignment

•  Alignment loca$ons •  Scores

Areas of applica$on

•  Improve ASR by disambigua$on over several repe$$ons (Park and Glass, 2005)

•  Pa+ern-‐based speech recogni$on – flat modelling (Zweig and Nguyen, 2010)

•  Acous$c summariza$on (Muscariello, 2009)

•  Musical structure analysis (Müller, 2007)

•  Server-‐less mobile voice search (Anguera, 2010)

Automa$c mo$f discovery •  Goal is to avoid going to text and therefore be more robust to errors

•  Good deal of applicable work on this area: – Biomedicine in matching DNA sequences (conver$ng the speech signals into symbol strings)

– Directly from real-‐valued mul$dimensional samples using DTW-‐like algorithms •  Müller’07, Muscariello’09, Park’05, Zweig’10 •  Most need to compute all the cost matrix a priori

Dynamic Time Warping -‐ DTW •  DTW algorithm allows the computa$on of the op$mal alignment between two $me series Xu, Yv ε ΦD

Image by Daniel Lemire

€

XU = (u1,...,um,...,uM )

€

XV = (v1,....,vn,..,vN )

Dynamic Time Warping (II) •  The op$mal alignment can be found in O(MN) complexity using dynamic programming.

•  We need to define a cost func$on between any two elements in the series and build a distance matrix:

€

d :ΦD × ΦD →ℜ≥ 0

Image by Tsanko Dyustabanov

€

d(i, j) = um − vn

Where usually:

€

c(i(k), j(k))

€

F = c(1),...,c(K)Warping func$on: where

Euclidean distance

Warping constraints For speech signals some constraints are usually applied to the warping func$on F: – Monotonicity:

– Con$nuity (i.e. local constraints):

€

i(k −1) ≤ i(k)

€

j(k −1) ≤ j(k)

€

i(k) − i(k −1) ≤1

€

j(k) − j(k −1) ≤1

Sakoe,H. and Chiba,S. (1978) Dynamic programming algorithm op0miza0on for spoken word recogni0on, IEEE Trans. on Acoust., Speech, and Signal Process, ASSP-‐26, 43-‐49.

(m, n)

(m-‐1, n-‐1)

(m-‐1, n)

€

D(m,n) =minD(m −1,n)D(m,n −1)D(m −1,n −1)

⎧

⎨ ⎪

⎩ ⎪

+ d(um,vn )

Warping constraints (II) – Boundary condi$on:

i.e. DTW needs prior knowledge of the start-‐end alignment points.

– Global constraints €

i(1) =1

€

j(1) =1

€

i(K) = M

€

j(K) = N

Image from Keogh and Ratanamahatana

DTW Dynamic Programming




DTW main problem •  The boundary condi$on constraints $me-‐series to be aligned from start to end – We need a modifica$on to DTW to allow common pa+ern discovery in reference and query signals regardless of the sequence’s other content

Alterna$ve proposals

•  Meinard Müller’s Path extrac$on for music – Needs to pre-‐compute the complete cost matrix.

•  Alex Park’s Segmental DTW – Needs to pre-‐compute the complete cost matrix, very computa$onally expensive ajerwards.

•  Armando Muscarielo’s word discovery algorithm – Searches for pa+erns locally, does not check all possible star$ng points.

[1] M. Müller, “Informa$on Retrieval for Music and Mo$on”,Springer, New York, USA, 2007. [2] A. Park et al., “Towards unsupervised pa+ern discovery in speech,” in In Proc. ASRU’05, Puerto Rico, 2005. [3] A. Muscariello et al., “Audio keyword extrac$on by unsupervised word discovery,” in Proc. INTER-‐ SPEECH’09, 2009.

Unbounded-‐DTW Algorithm

•  U-‐DTW is a modifica$on to DTW that is fast and accurate in finding recurring pa+erns

•  We call it unbounded because: – The start-‐end posi$ons of both segments are not constrained

– Mul$ple matching segments can be found with a single pass of the algorithm

– Minimizes the computa$onal cost of comparing two mul$dimensional $me series

U-‐DTW Cost func$on and matching length

•  Given two sequences to be matched U=(u1, u2, …, uM) and V=(v1, v2, …, vN)

we use the inner product similarity

Values range [-‐1,1], the higher the closer •  We look for matching sequences with a minimum length Lmin (set at 400ms in our experiments) €

s(m,n) = cosθ =um ,vnum vn

U-‐DTW global/local constraints

•  no global constraints are applied in order to allow for matching of any segment among both sequences

•  Local constraints are set to allow warping up to 2X

(m, n)

(m-‐1, n-‐2)

(m-‐1, n-‐1)

(m-‐2, n-‐1)

€

D(m,n) =maxD(m − 2,n)D(m,n − 2)D(m − 2,n − 2)

⎧

⎨ ⎪

⎩ ⎪

+ s(um,vn )

U-‐DTW computa$onal savings

•  Computa$onal savings are achieved thanks to: 1.  We sample the distance/similarity matrix at

certain possible matching start points (sesng Synchroniza$on points)

2.  Dynamic programming is done forward, prunning out low similarity paths

Synchroniza$on points •  Only certain (m,n) posi$ons are analyzed in the matrix for possible matching segments – Selected not to loose any matching segment – Op$mize the computa$onal cost

•  Two methods are followed: horizontal and ver$cal bands:

τh

τd

λ

(m,n)

λ

λ

π/4 2τh

(m,n)

U

U

V V

U-‐DTW Dynamic Programming

Forward dynamic programming •  For each posi$on (m,n): 3 possible forward paths are considered

•  The forward path is extended forward IIF: –  Its normalized global similarity is above a pruning threshold

–  is greater than any previous path in that loca$on

(m, n)

(m+1, n+2)

(m+1, n+1)

(m+2, n+1)

€

S(m',n') =D(m,n) + s(m',n')

M(m,n) +1≥Thrprun

€

S(m',n')



Backward path algorithm

•  When a possible matching segment is found in the forward path, the same is done backwards star$ng from the origina$ng SP posi$on.

The same procedure is followed as in the forward path

(m, n)

(m-‐1, n-‐2)

(m-‐1, n-‐1)

(m-‐2, n-‐1)



Computa$onal savings example Ba

rcelon

a

Barcelona

Experimental setup •  We asked 23 people to record 47 words from 6 categories, 5 itera$ons each:

•  Simple energy-‐based trimming eliminates non-‐speech regions

•  We simulate acous$c context by a+aching different start-‐end audio sequences to Xu,v.

Nature

Ci$es

People

Events

Family

Monuments

€

XU ,V [n,i],i =1...5, j =1...47

Experimental setup (II)

•  Signals are parameterized with 10MFCC every 10ms

•  Each word Xu is compared to all words Xv from the same speaker (234 comparisons) and the closest one is retrieved

We get a hit m=n, a miss otherwise •  Tests were performed on an Ubuntu Linux PC @2.4GHz. €

argminm, j D(XU [n,i],XV [m, j]) | (n,i) ≠ (m, j)

Comparing systems

•  Standard DTW – Compare the sequences without any added acous$c context (i.e. prior knowledge of start-‐end points)

•  Segmental DTW (Park and Glass, 2005) – Minimum segment length of 500ms – Band size of 70ms, 50% overlap

– Used 2 distances: Euclidean and 1-‐inner product

Performance evalua$on Used metrics:

–  Accuracy: percentage of words correctly matched (Xu y Xv are different itera$ons of the same word).

–  Average processing $me per sequence pair (Xu-‐Xv) (excluding parameteriza$on)

–  Average ra$o of frame-‐pair distances within each sequence-‐pair cost matrix.

€

Acc =correct matches∑all matches

⋅ 100

€

Time =time(D(XU [n,i],∑ XV [m, j]))

#matches⋅ 100

€

Ratio =computed(d(XU [n,i],XV [m, j]))∑

MN⋅ 100

Results

Algorithm Accuracy Avg. ;me ra;o

Segmental DTW w/ Eucl. 80.61% 82.7ms 1

Segmental DTW w/ inner prod. 74.62% 86.7ms 1

U-‐DTW horiz. bands 89.53% 10.6ms 0.51

U-‐DTW diag. bands 89.34% 9.0ms 0.42

Standard DTW 95.42% 0.6ms 1

Effect of the Cutout Threshold

Conclusions and future work

•  We propose a novel algorithm called U-‐DTW for unconstrained pa+ern discovery in speech

•  We show it is faster and more accurate than exis$ng alterna$ves

•  We are star$ng to test the algorithm for unrestricted audio summariza$on

MuViSync AudioVisual Music Synchroniza$on

Xavier Anguera, Robert Macrae and Nuria Oliver

…on the go, …

…at home, …

People enjoy listening to their favorite music everywhere…

…or in a party with friends

Users increasingly have a personal mp3 music collec$on…

…but it usually contains ‘only’ music.

What if you could watch the video clip of any of our songs while listening to it?

…but the audio quality is much worse that in your mp3…

You could go to sites like YouTube…

What if you could listen to our high quality mp3 music while watching the video clips?

MuViSync: Music and Video Synchroniza$on system

Personal Music

Video clip

streaming

local

MuViSync

MuViSync synchronizes audio and video from two different

sources and plays them together in-‐sync

Applica$on scenarios

•  Watch on TV your favorite music – Personal music synchroniza$on with video clips either local or streamed

•  Watch on your iPhone your music – Personal music synchroniza$on by streaming the video into the iPhone

•  Iden0fy and watch any music – Combined with songID technology, either at home or on the go.

MuViSync applica$on •  We have developed a prototype applica0on for Windows/mac, and soon for Iphone.

Alignment algorithm requirements

•  Perform an alignment between the mp3 music and the Video’s audio track

•  Ini$ally only par$al knowledge is available from both sources (life recording or buffering)

•  Alignment has to be done online and in real-‐$me

•  Emphasis is needed on the user sa$sfac$on when playing the video.

Applica$on testbed •  We use 320 music videos (Youtube) + their corresponding mp3 files

•  A supervised ground-‐truth alignment was performed using offline DTW and checking for consistency

•  Audio is processed every 100ms (200ms window) and chroma features are extracted

MuViSync online alignment algorithm

1.  Ini$al path discovery –  Both signals (audio and video) are buffered, features

are extracted and an ini$al alignment is found

2.  Real-‐$me online alignment –  An incremental alignment is computed

3.  Alignment post-‐processing to ensure a smooth playback of the aligned video.

Audio + feats extrac$on

Feats extrac$on

Ini$al path discovery

Real-‐$me alignment

1)

2)

ta tv

alignment

Ini$al path discovery (online mp3 playback + video buffering)

Audio available from the video

Audio from the mp3 file

Video buffering end

Sync request

Ini$al path discovery •  A segment of the audio and the buffered video are checked for alignment using forward-‐DTW

•  The global similarity D(m,n) at each loca$on (m,n) is normalized by the length of the op$mum path to that loca$on

•  At each step, all paths with D’(m,n) < Dave(*,n) are pruned.

•  The ini0al alignment is selected when only one path survives or the sync 0me is reached.



Aud

io being played from

mp3

Audio $me alignment buffer (about 1s)



Aud


mp3



Aud


mp3



Aud


mp3

Real-‐$me online alignment •  Star$ng from the ini$al alignment we itera$vely compute: 1.  Locally op$mum forward path for L steps: p1…pL

using a) local constraints (no dynamic programming)

2.  Backward (standard) DTW from pL to p1 using b) local constraints

3.  Add the ini$al p/2 steps to the final path, and start 1) from pL/2 un$l the playback ends

Real-‐$me online alignment


Aud


mp3



Aud


mp3

1)Forward locally best path with L=8

p1

pL



Aud


mp3

2)stardard DTW

p1

pL



Aud


mp3

3)Move forward the new star$ng point

p1

Alignment postprocessing •  Alignment es$mates every 100ms are not enough to drive 25/30 fps video

•  An interpola$on of the points + averaging over 5 seconds gives the projec$on es$mate for current playback

Experiments •  We use 320 videos+mp3, aligned using offline DTW and manually checked for consistency.

•  Accuracy is computed as the % of songs with average error < some ms.

Average accuracy @100ms for different video buffer lengths

Experiments

Video Duplicate Detec$on Xavier Anguera and Pere Obrador

Let’s say you’re looking for the Bush a+ack video…

…and you get 11,100 results.

…ajer 40 minutes...

watching many of the videos returned you no$ce that many are similar, i.e. near duplicates

27% in average in Youtube [Wu et al., 2007] 12% in average in Youtube [Anguera et al, 2009]

Near duplicate (NDVC) defini$on •  Iden$cal or approximately iden$cal videos, that differ in some feature: – file formats, encoding parameters – photometric varia$ons (color, ligh$ng changes) – overlays (cap$on, logo, audio commentary)

– edi$ng opera$ons (frames add/remove) –  seman$c similarity

NDVC are videos that are “essen(ally the same”

Near duplicates(NDVC) vs. Video copies

•  These two concepts are not totally well discriminated in the bibliography.

•  Video copy: exact video segment, with some transforma$ons on it

•  Near duplicate: similar videos on the same topic (different view points, seman$cally similar videos, …)

In our research we approach the video copy detec;on

Examples of video copies

Use Scenarios: Copyright law enforcement

Detec$on of copyright infringing videos in online video sharing sites

In a recent study we found that in average 12% of search results in YouTube are copies of the same video

Currently police forces usually have to manually scroll through ALL materials in pederasty cases searching for evidence.

Discover illegal content hidden within other videos

Use Scenarios: Video forensics for illegal ac$vi$es

Database management/op$miza$on and helping in searches over historic contents

Video excerpts used several $mes

Use Scenarios: Database management

Adver$sement detec$on/iden$fica$on

Programming analysis

Use Scenarios: adver$sement detec$on and management

Use Scenarios: Informa$on overload reduc$on

Improved (more diverse) video search results by clustering all video duplicates.

George Bush

Before clustering

Ajer clustering

Steps in Video Duplicate detec$on

1.  Indexing of the reference videos A.  Obtain features represen$ng the video B.  Store these features in a scalable manner

2.  Search of queries within the reference set

Feature extrac$on References indexing

Ref videos

Query video Feature extrac$on

Search for duplicates

Features Database

ONLINE

OFFLINE

Ways to approach near-‐duplicate video detec$on

•  Local features – Extracted from selected frames in the videos

– Focus on local characteris$cs within those frames

•  Global features – Extracted from selected frames or from all the video

– Focus on overall characteris$cs

Local features

•  Comes from the previous knowledge on image copy detec$on/near duplicates detec$on

•  Steps: – Keyframes are first extracted from the videos at regular intervals or by detec$ng shots

– Local features are obtained for these keyframes: •  SIFT •  SURF •  HARRIS •  …

Global Features

•  Features are extracted either from the whole video or from keyframes by looking at the overall image (not at par$cular points).

In our work we extract them from the whole video

Mul$modal video copy detec$on

•  Most works use only video/images informa$on – They prefer local features for their robustness

•  We introduce audio informa$on by combining global features from both the audio and video tracks

•  We are also experimen$ng on fusing local features with global features (work in progress)

Mul$modal global features

•  We use features based on the changes in the data-‐> more robust to transforma$ons

•  Video: – Hue + satura$on interframe change –  Lightest and darkest centroid interframe distance

•  Audio: –  Bayesian informa$on criterion (BIC) between adjacent segments

–  Cross-‐BIC between adjacent segments –  Kullback-‐Leibler divergence (KL2) between adjacent segments

Hue+Satura$on interframe change

1.  Transform the colorspace from RGB to HSV (Hue+Satura$on+Value)

Hue+Satura$on interframe change

2.  Compute for each 2 consecu$ve frames their HS histogram and compute their intersec$on as:

Lightest and darkest centroid interframe distance

1.  Find the lightest and darkest regions in each frame and obtain its centroid

Lightest and darkest centroid interframe distance

We compute the euclidean distance between each two adjacent frames, obtaining two global feature streams

Acous$c features

•  Compute some acous$c distance between adjacent acous$c segments

Segment A Segment B

GMM A GMM B GMM A+B

Acous$c features (II)

•  Likelihood-‐based metrics: – Bayesian Informa$on Criterion

– Cross-‐BIC

•  Model distance metrics: – Kullback-‐Leibler divergence (KL2)

Acous$c features (III)

•  For example: the Bayesian Informa$on Criterion (BIC) output:

Search for full copies •  For each video-‐query pair we compute the correla$on of each feature pair

•  We then find the posi$ons with high similarity (peaks).

Reference

Possible copy

XFFT

FFT

IFFT Find peaks

Mul$modal fusion •  When mul$ple modali$es are available, fusion is performed on the correla$ons

Output score

•  The resul$ng score is computed by weighted sum of the different modali$es’ normalized dot product at the found peak

•  Automa$c weights are obtained via

Finding subsegments of the query •  The previously described algorithm considers the whole query matches a por$on of the reference videos

•  To avoid such restric$on a modifica$on to the algorithm first splits the query into overlaping 20s segments

•  By accumula$ng the resul$ng peaks for each segment we can obtain the main delay and its segment

Algorithm performance evalua$on

•  To test the algorithm we used the MUSCLE-‐VCD database: – Over 100 hours of reference videos from the SoundVision group (Nederlands)

– 2 test sets •  ST1: 15 query videos where the whole query is considered

•  ST2: 3 videos with 21 segments appearing in the reference database

h+p://www-‐roc.inria.fr/imedia/civr-‐bench/benchMuscle.html

MUSCLE-‐VCD transforma$on examples

Evalua$on metrics

•  We use the same metrics as in the MUSCLE-‐VCD benchmark tests

Evalua$on metrics (II)

•  We also use the more standard Precision and recall metrics

Evalua$on results

Evalua$on results histogram for ST1

Youtube reranking applica$on •  We downloaded all videos searching for the top 20 most viewed and 20 most visited videos

Youtube reranking applica$on •  We applied mul$modal copy detec$on and grouped all near duplicates

Youtube Reranking test

•  Results show how some videos have mul$ple clear copies that can boost their ranking once clustered

Thanks for your aHen;on

xanguera@$d.es www.xavieranguera.com

Linkedin: h+p://es.linkedin.com/in/xanguera Twi+er: h+p://twi+er.com/xanguera

Website: h+p://www.xavieranguera.com/

Technology

Multimodal pattern matching algorithms and applications