Clustering)and)Synchronizing)) MulGHCameraVideo)) viaAudio ...njb/research/mcs-slides.pdf ·...

Nicholas J. Bryan, Paris Smaragdis*, and Gautham Mysore*

Stanford University | CCRMA *Advanced Technology Labs | Adobe Systems

Clustering and Synchronizing MulG-‐Camera Video

via Audio FingerprinGng

CCRMA DSP Seminar, November 13th 2012

Outline I IntroducGon

II Proposed Method

-‐ Non-‐Linear Transform

-‐ Time-‐Difference-‐Of-‐Arrival EsGmaGon

-‐ Clustering

-‐ SynchronizaGon Refinement

-‐ Efficient ComputaGon III EvaluaGon IV Conclusions

IntroducGon

•  IdenGfy and synchronize mulGple videos of the same event

Video & Audio

MoGvaGon

•  ProliferaGon of mobile devices

•  MulGple videos of a single event common – Moments in history – Weddings, concerts, speeches, film sets

•  Desired to easily edit video together – Grouping/Clustering (Manual) – SynchronizaGon (Manual, Hardware)

•  Dual System Workflow •  1 Videographer •  1 Sound Engineer

• MulG-‐Camera Workflow •  2+ Videographer •  1+ Sound Engineer

TradiGonal Video Capture

Crowd-‐Sourced MulG-‐Camera Video

•  1 Wedding ≈ 300 guest ≈ 100 smartphones/cameras ≈ 10+ videos of “I do”

•  1 concert ≈ 15,000 people ≈ 5,000 smartphones ≈ 100+ of video clips/song ≈ 1000+ video clips/concert

•  1 presidenGal speech ≈ 200,000 people ≈ 70,000 smartphones ≈ 10,000+ videos

Demo Video

•  Taylor Swih’s “Fearless”

II Proposed Method

-‐ Clustering

General Approach

•  Use audio – Typically more “global” – Allows visually disjoint video

•  Time-‐difference-‐of-‐arrival esGmaGon – For each pair of clips in collecGon, compute Gme offset which best synchronizes the given pair using standard correlaGon

– Use correlaGon signals to decide if the two files should match or not

Problems

•  ComputaGonally expensive

•  No accurate (straighnorward) clustering method •  Not robust

Audio FingerprinGng

•  Short-‐duraGon signatures via feature extracGon •  Finds idenGcal (or similar) matches of unknown clip with DB

•  Hash fingerprints for fast search and retrieval •  Shazam, SoundHound, Philips, Gracenote, etc. •  See [Wang 2003] & [Haitsma and Kalker 2003]

Audio FingerprinGng for MulG-‐Camera

•  Slightly different problem – Group all clips in DB (mulGple matching) – Time synchronize all clips within each group

•  Audio-‐fingerprinGng for mulG-‐camera – Principal of most methods yield sync offset – Robust and fast! –  IniGal work over the last few years [Shrestha et al. 2007] & [Kennedy and Naaman 2009]

Proposed Method

1.  Non-‐Linear Transform (FingerprinGng Step) 2.  Time-‐Difference-‐Of-‐Arrival EsGmaGon 3.  Clustering 4.  SynchronizaGon Refinement 5.  Efficient ComputaGon

II Proposed Method

-‐ Clustering

Non-‐Linear (Landmark) Transform

•  Convert Gme-‐domain audio signal into a high-‐dimensional, sparse, binary landmark signal

x(t) 2 R

Landmark Transform

tL(t, h)

L(t, ·) 2 {0, 1}N

Landmarks

•  Spectral peak pairs as landmarks [Wang 2003] – Short-‐Gme Fourier transform – Landmark = [f1, f2, Δt] + absolute Gme offset – Place each landmark in appropriate locaGon in

L(t1, h) = 1

. . . t1

L(t, h)

(t1, h = [f1t1 , f

2t2 , t2 � t1])

•  With a large number of peaks, peak pairs are created in a limited Gme-‐frequency range

Landmarks as ConstellaGons Fr

Time (secs)

Spectrogram

1 2 3 4 5 60

Student Version of MATLAB

0 1 2 3 4 5 6 70

4Spectral Peak Onsets

Time (secs)

•  Short-‐Gme Fourier transform •  Leaky integrator peak detector for each FFT bin

Simple Frequency Peak Detector

if |X(f)| > �f

�f = �f � (1� e�1/(⌧ffs))�f

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Leaky Peak Detector λf

Time (Sec)

|X(f)|λfPeaks

�f = |X(f)|

Leaky Peak Detector

//Peak Onset

II Proposed Method

-‐ Clustering

Time-‐Difference-‐Of-‐Arrival EsGmaGon

•  Pairwise cross-‐correlaGon method – Correlate each track with each other – Find argmax for offset –  i.e. Matched filter

Rij(t) =P1

⌧=�1 xi(⌧)xj(t+ ⌧)

•  Landmark cross-‐correlaGon

•  Time-‐Difference-‐Of-‐Arrival EsGmaGon

Landmark Cross-‐CorrelaGon

ˆtij = argmaxt RLi,Lj (t)

RLi,Lj (t) =P1

⌧=�1 Li(⌧)TLj(t+ ⌧)

Time-‐Difference-‐Of-‐Arrival EsGmaGon

(a) Normalized absolute Gme-‐domain cross-‐correlaGon.

(b) Normalized landmark cross-‐correlaGon.

Rxi,xj (t)

RLi,Lj (t)

II Proposed Method

-‐ Clustering

Clustering

•  AgglomeraGve Clustering –  IniGalize each clip as a separate cluster and merged into successively larger clusters

– Merge most confidence matches first •  Confidence as funcGon of stats from best potenGal sync •  Reject unconfident merges based on decision rules

•  Maximum of correlaGon •  Mean and variance of cross-‐correlaGon •  Percentage of total matching landmarks in the overlap region •  Overall Gme range defined by the set of matching landmarks •  Overlap region length •  Ignore overly common landmarks (i.e. 60Hz)

Merge Decision Rules

•  Groups w/pairwise sync offset and confidence scores

Clustering Output

t B DB 0 -11.5D 11.5 0

Offsets (seconds) S B DB - 23D 23 -

Confidence Score

t A C EA 0 -5 10C 5 0 -E -10 - 0

Offsets (seconds) Confidence Score

S A C EA - 30 20C 30 - -E 20 - -

II Proposed Method

-‐ Clustering

SynchronizaGon Refinement

•  Refinement is required for clusters of three or more if: 1.  Inconsistent pairwise TDOA esGmates do not saGsfy all

triangle equaliGes within a cluster 2.  One or more TDOA esGmates within any cluster is

unknown caused by non-‐overlapping clips

tAC 6= tAB + tBC

Implied by other esGmates

(a) Case 1 (a) Case 2

Slightly off

Greedy Match-‐and-‐Merge

1.  Find the most confident TDOA esGmate within the cluster in terms of or similar confidence score.

2.  Merge the landmark signals and . First Gme shih by and then mulGply or add the two signals together (depending on the desired effect).

3.  Update the remaining TDOA esGmates and confidence scores to respect the file merge.

4.  Repeat unGl all files within the cluster are merged. 29

tijRLi,Lj

Li LjLj tij

Greedy Match-‐and-‐Merge Graphically

ABC ABCD

(a) IniGal Clusters (b) IteraGon 1

(c) IteraGon 2 (d) IteraGon 3

II Proposed Method

-‐ Clustering

Efficient ComputaGon

•  Leverage knowledge of landmark signal and perform “sparse” cross-‐correlaGon in a special way (fingerprinGng)

•  Use some form of associaGve array, map, or dicGonary to store landmarks and compute all pairwise correlaGons –  Direct arrays –  Binary tree –  Hash table

Map Structure I

•  Create map structure of all landmarks –  Key = (f1, f2, Δt) –  Value = (FileID, AbsoluteTimeOffset)

•  Matching files will have idenGcal landmark •  Difference between AbsoluteTimeOffset of gives sync

t1 t2 t3 t4 t5

A,t1 E,t5

C,t3 A,t1

B,t2 D,t4

Map Structure II

•  Convert map structure to pairwise correlaGons •  For each landmark, compute all pairwise Gme differences

and store in the appropriate pairwise correlaGon

Δt = 4

A vs. E

C vs. A

Δt = 2

B vs. D

Δt = 2

A,t1 E,t5

C,t3 A,t1

B,t2 D,t4

General ComputaGonal Benefit

•  Naïve pairwise correlaGons 1.  pairwise correlaGons, number of files 2.  Each correlaGon , samples in file

•  DrasGcally reduces the computaGonal cost 1.  Eliminates pairwise correlaGons for clips that don’t match 2.  Makes each pairwise correlaGon faster

•  Computes correlaGon computaGon for only the salient parts (landmarks) of audio

P !2(P�2)!

O(N log(N))

Ideal Case

1.  Pairwise Comparisons –  All landmarks are unique its group – Only performs pairwise correlaGons within each group –  For large # groups/small # clips, this is savings huge

2.  Single pairwise correlaGon – Only correlate points with matching landmarks, no computaGon for 0s

–  Ideal case with no false posiGve matches results in a cost, with = number of matching landmarks

O(M) M

II Proposed Method

-‐ Clustering

EvaluaGon Metrics

•  Performance measures –  Precision, Recall, F1 score –  Computed on pairwise matches of final clusters

•  ComputaGonal cost –  Compute Gme (seconds) –  Throughput (seconds processed/seconds of compute Gme)

•  Benchmark –  Comparison to commercial mulG-‐camera sohware Plural Eyes

Precision, Recall, and F1

•  Precision –  fracGon of esGmated pairwise merges retrieved that are correct

•  Recall –  fracGon of correct pairwise merges retrieved

•  F1 score –  harmonic mean of precision and recall

-‐ A-‐E A-‐B A-‐D E-‐B E-‐D B-‐D

A-‐E A-‐C C-‐E

B-‐D

P = 2/6 R = 2/5 F1 = 8/22

EsGmated Clusters Ground Truth Clusters

2PR/(P +R)

Datasets

•  Speech (180 clips from film set) –  Average length 20-‐40 seconds –  54 clusters of one file –  54 clusters of two files –  6 clusters of three files

•  Music (23 clips from live music concerts) –  Average length 3-‐5 minutes –  1 cluster of 7 files –  2 clusters of 8 files

40 All audio files are downsampled to common sample rate of 8kHz for efficiency

Precision, Recall, and F1 Results

•  As expected from using the feature extracGon of [Wang 2003]

ComputaGonal Cost

42 Rough Timing on MacBook Pro, OSX 10.6.8, 2.66 GHz Intel Core i7, unopGmized C++.

Speech Music Speech + Music

Proposed 47.0 41.1 90.1Traditional 1550 197 3600

(a) Computation time (s).

Speech Music Speech + Music

Proposed 164.6 146.5 152.7Traditional 5.0 30.5 3.9

(b) Throughput (s/s).

≈ linear not linear

Benchmark (Speech Dataset)

•  Accuracy Measures – Proposed method F1 ≈ 99% – Plural Eyes 2.1.0 F1 ≈ 95%

•  ComputaGonal Cost – Proposed method ≈ 3 minutes – Plural Eyes 1.2.0 ≈ 6 hours – Plural Eyes 2.1.0 ≈ 2 hours – Plural Eyes 2.1.0 (hard) ≈ 10 hours

II Proposed Method

-‐ Clustering

Future Work & Research DirecGons

•  Video analog to photo “sGtching” – Crowd-‐sourced mulG-‐camera video – Easily change both video and audio viewpoint

•  Denoising/improving audio quality from groups

•  SpaGal audio processing – Use for Gme delay esGmaGon – Large-‐scale beamforming, direcGonal listening, etc.

Conclusions

•  Method of clustering and sync of mulG-‐camera videos using audio – Non-‐Linear Transform – Time-‐Difference-‐Of-‐Arrival EsGmaGon – Clustering – SynchronizaGon Refinement – Efficient ComputaGon

•  Fast and accuracy

References

•  Jaap Haitsma and Ton Kalker, “A Highly Robust Audio FingerprinGng System With an Efficient Search Strategy,” Journal of New Music Research , vol. 32, no. 2, 2003.

•  A.L. Wang, “An Industrial-‐Strength Audio Search Algorithm,” in Proc. 4th Int. Symposium on Music InformaGon Retrieval (ISMIR) , October 2003.

•  P. Shrestha, M. Barbieri, and H. Weda, “SynchronizaGon of mulG-‐camera video recordings based on audio,” in Proc. 15th Intl. Conf. on MulGmedia , 2007.

•  L. Kennedy and M. Naaman, “Less talk, more rock: automated organizaGon of community-‐contributed collecGons of concert videos,” in Proc. 18th Int. Conf. on World Wide Web , 2009.

•  D. Ellis (2009). “Robust Landmark-‐Based Audio FingerprinGng”, h|p://labrosa.ee.columbia.edu/matlab/fingerprint

Demo Video

•  Dave Ma|hews Band’s “Everyday”

Nicholas J. Bryan, Paris Smaragdis*, and Gautham Mysore*

Stanford University | CCRMA *Advanced Technology Labs | Adobe Systems

Clustering and Synchronizing MulG-‐Camera Video

via Audio FingerprinGng

CCRMA DSP Seminar, November 13th 2012

Clustering)and)Synchronizing)) MulGHCameraVideo)) viaAudio ...njb/research/mcs-slides.pdf ·...

Documents

synchronizing LEDs

Synchronizing with SmarterMail

Synchronizing Core Data With Rails

Synchronizing Relay

Generator Synchronizing and Protection

Synchronizing apparatus

LSC - Synchronizing identities @ Loadays 2010

Generator Synchronizing

Technical Note: NTP time servers Synchronizing … · Technical Note: NTP time servers Synchronizing Windows Computers 1 | Synchronizing Windows computers Purpose: The purpose of

Paper 8 NJB Vol 14 2003 - African Journals OnLine

INTRODUCTION TO SYNCHRONIZING AUTOMATIC SYNCHRONIZING … · Manual, 2) Manual with permissive relay (synch check), and 3) Automatic synchronizing. Manual Synchronizing Manual synchronizing

TIMING, SYNCHRONIZING & ADJUSTING ELECTRICAL

INTRODUCTION TO SYNCHRONIZING AUTOMATIC …

njb info dezember

NJB Info März 2012

Njb 1997 534 538 goettsh

ALL STAR ONLINE REGISTRATION - NJB

Methods of defining generator synchronizing power … · Methods of defining generator synchronizing power during low ... excitation parameters ... Methods of defining generator synchronizing

GPS Synchronizing Unit_TM P594

Synchronizing 5G Mobile Networks