Clustering)and)Synchronizing)) MulGHCameraVideo)) viaAudio ...njb/research/mcs-slides.pdf · IntroducGon) 3 • IdenGfy)and)synchronize)mulGple)videos)of) the)same)event Video&Audio

Nicholas J. Bryan, Paris Smaragdis*, and Gautham Mysore*

Stanford University | CCRMA *Advanced Technology Labs | Adobe Systems

Clustering and Synchronizing MulG-‐Camera Video

via Audio FingerprinGng

CCRMA DSP Seminar, November 13th 2012

Outline I IntroducGon

II Proposed Method

-‐ Non-‐Linear Transform

-‐ Time-‐Difference-‐Of-‐Arrival EsGmaGon

-‐ Clustering

-‐ SynchronizaGon Refinement

-‐ Efficient ComputaGon III EvaluaGon IV Conclusions

2

IntroducGon

3

•  IdenGfy and synchronize mulGple videos of the same event

Video & Audio

A

B

C

D

E B

D

A

C

E

MoGvaGon

•  ProliferaGon of mobile devices

•  MulGple videos of a single event common – Moments in history – Weddings, concerts, speeches, film sets

•  Desired to easily edit video together – Grouping/Clustering (Manual) – SynchronizaGon (Manual, Hardware)

4

•  Dual System Workflow •  1 Videographer •  1 Sound Engineer

• MulG-‐Camera Workflow •  2+ Videographer •  1+ Sound Engineer

TradiGonal Video Capture

5

Crowd-‐Sourced MulG-‐Camera Video

6

•  1 Wedding ≈ 300 guest ≈ 100 smartphones/cameras ≈ 10+ videos of “I do”

•  1 concert ≈ 15,000 people ≈ 5,000 smartphones ≈ 100+ of video clips/song ≈ 1000+ video clips/concert

•  1 presidenGal speech ≈ 200,000 people ≈ 70,000 smartphones ≈ 10,000+ videos

Demo Video

•  Taylor Swih’s “Fearless”

7


II Proposed Method



-‐ Clustering



8

General Approach

•  Use audio – Typically more “global” – Allows visually disjoint video

•  Time-‐difference-‐of-‐arrival esGmaGon – For each pair of clips in collecGon, compute Gme offset which best synchronizes the given pair using standard correlaGon

– Use correlaGon signals to decide if the two files should match or not

9

Problems

•  ComputaGonally expensive

•  No accurate (straighnorward) clustering method •  Not robust

Audio FingerprinGng

•  Short-‐duraGon signatures via feature extracGon •  Finds idenGcal (or similar) matches of unknown clip with DB

•  Hash fingerprints for fast search and retrieval •  Shazam, SoundHound, Philips, Gracenote, etc. •  See [Wang 2003] & [Haitsma and Kalker 2003]

11

Audio FingerprinGng for MulG-‐Camera

•  Slightly different problem – Group all clips in DB (mulGple matching) – Time synchronize all clips within each group

•  Audio-‐fingerprinGng for mulG-‐camera – Principal of most methods yield sync offset – Robust and fast! –  IniGal work over the last few years [Shrestha et al. 2007] & [Kennedy and Naaman 2009]

12

Proposed Method

1.  Non-‐Linear Transform (FingerprinGng Step) 2.  Time-‐Difference-‐Of-‐Arrival EsGmaGon 3.  Clustering 4.  SynchronizaGon Refinement 5.  Efficient ComputaGon

13


II Proposed Method



-‐ Clustering



14

Non-‐Linear (Landmark) Transform

•  Convert Gme-‐domain audio signal into a high-‐dimensional, sparse, binary landmark signal

15

...

. . .

x(t) 2 R

...

. . .

Landmark Transform

tL(t, h)

x(t)

L(t, ·) 2 {0, 1}N

Landmarks

•  Spectral peak pairs as landmarks [Wang 2003] – Short-‐Gme Fourier transform – Landmark = [f1, f2, Δt] + absolute Gme offset – Place each landmark in appropriate locaGon in

16

L(t1, h) = 1

...

. . . t1

h

L(t, h)

(t1, h = [f1t1 , f

2t2 , t2 � t1])

•  With a large number of peaks, peak pairs are created in a limited Gme-‐frequency range

17

Landmarks as ConstellaGons Fr

eque

ncy

(kH

z)

Time (secs)

Spectrogram

1 2 3 4 5 60

0.5

1

1.5

2

2.5

3

3.5

4

Student Version of MATLAB

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4Spectral Peak Onsets

Freq

uenc

y (k

Hz)

Time (secs)


•  Short-‐Gme Fourier transform •  Leaky integrator peak detector for each FFT bin

18

Simple Frequency Peak Detector

if |X(f)| > �f

else

�f = �f � (1� e�1/(⌧ffs))�f

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Leaky Peak Detector λf

Ampl

itude

Time (Sec)

|X(f)|λfPeaks


�f = |X(f)|

Leaky Peak Detector

//Peak Onset


II Proposed Method



-‐ Clustering



19

Time-‐Difference-‐Of-‐Arrival EsGmaGon

•  Pairwise cross-‐correlaGon method – Correlate each track with each other – Find argmax for offset –  i.e. Matched filter

Rij(t) =P1

⌧=�1 xi(⌧)xj(t+ ⌧)

•  Landmark cross-‐correlaGon

•  Time-‐Difference-‐Of-‐Arrival EsGmaGon

Landmark Cross-‐CorrelaGon

21

ˆtij = argmaxt RLi,Lj (t)

RLi,Lj (t) =P1

⌧=�1 Li(⌧)TLj(t+ ⌧)

Time-‐Difference-‐Of-‐Arrival EsGmaGon

22

(a) Normalized absolute Gme-‐domain cross-‐correlaGon.

(b) Normalized landmark cross-‐correlaGon.

Rxi,xj (t)

RLi,Lj (t)


II Proposed Method



-‐ Clustering



23

Clustering

•  AgglomeraGve Clustering –  IniGalize each clip as a separate cluster and merged into successively larger clusters

– Merge most confidence matches first •  Confidence as funcGon of stats from best potenGal sync •  Reject unconfident merges based on decision rules

24

A

B

C

D

E

B

D

A

C

A

C E

•  Maximum of correlaGon •  Mean and variance of cross-‐correlaGon •  Percentage of total matching landmarks in the overlap region •  Overall Gme range defined by the set of matching landmarks •  Overlap region length •  Ignore overly common landmarks (i.e. 60Hz)

Merge Decision Rules

25

o

ro

•  Groups w/pairwise sync offset and confidence scores

Clustering Output

26

t B DB 0 -11.5D 11.5 0

Offsets (seconds) S B DB - 23D 23 -

Confidence Score

t A C EA 0 -5 10C 5 0 -E -10 - 0

Offsets (seconds) Confidence Score

S A C EA - 30 20C 30 - -E 20 - -

B

D

A

C E


II Proposed Method



-‐ Clustering



27

SynchronizaGon Refinement

•  Refinement is required for clusters of three or more if: 1.  Inconsistent pairwise TDOA esGmates do not saGsfy all

triangle equaliGes within a cluster 2.  One or more TDOA esGmates within any cluster is

unknown caused by non-‐overlapping clips

28

tAC 6= tAB + tBC

Implied by other esGmates

(a) Case 1 (a) Case 2

Slightly off

Greedy Match-‐and-‐Merge

1.  Find the most confident TDOA esGmate within the cluster in terms of or similar confidence score.

2.  Merge the landmark signals and . First Gme shih by and then mulGply or add the two signals together (depending on the desired effect).

3.  Update the remaining TDOA esGmates and confidence scores to respect the file merge.

4.  Repeat unGl all files within the cluster are merged. 29

tijRLi,Lj

Li LjLj tij

Greedy Match-‐and-‐Merge Graphically

30

A D

B

C

A D

BC

D

ABC ABCD

(a) IniGal Clusters (b) IteraGon 1

(c) IteraGon 2 (d) IteraGon 3


II Proposed Method



-‐ Clustering



31

Efficient ComputaGon

•  Leverage knowledge of landmark signal and perform “sparse” cross-‐correlaGon in a special way (fingerprinGng)

•  Use some form of associaGve array, map, or dicGonary to store landmarks and compute all pairwise correlaGons –  Direct arrays –  Binary tree –  Hash table

32

Map Structure I

•  Create map structure of all landmarks –  Key = (f1, f2, Δt) –  Value = (FileID, AbsoluteTimeOffset)

•  Matching files will have idenGcal landmark •  Difference between AbsoluteTimeOffset of gives sync

33

A

B

C

D

E

t1 t2 t3 t4 t5

…

…

…

A,t1 E,t5

C,t3 A,t1

B,t2 D,t4

Map Structure II

•  Convert map structure to pairwise correlaGons •  For each landmark, compute all pairwise Gme differences

and store in the appropriate pairwise correlaGon

34

Δt = 4

A vs. E

C vs. A

Δt = 2

B vs. D

Δt = 2

…

…

…

A,t1 E,t5

C,t3 A,t1

B,t2 D,t4

General ComputaGonal Benefit

•  Naïve pairwise correlaGons 1.  pairwise correlaGons, number of files 2.  Each correlaGon , samples in file

•  DrasGcally reduces the computaGonal cost 1.  Eliminates pairwise correlaGons for clips that don’t match 2.  Makes each pairwise correlaGon faster

•  Computes correlaGon computaGon for only the salient parts (landmarks) of audio

35

P !2(P�2)!

O(N log(N))

P =

N =

Ideal Case

1.  Pairwise Comparisons –  All landmarks are unique its group – Only performs pairwise correlaGons within each group –  For large # groups/small # clips, this is savings huge

2.  Single pairwise correlaGon – Only correlate points with matching landmarks, no computaGon for 0s

–  Ideal case with no false posiGve matches results in a cost, with = number of matching landmarks

36

O(M) M


II Proposed Method



-‐ Clustering



37

EvaluaGon Metrics

•  Performance measures –  Precision, Recall, F1 score –  Computed on pairwise matches of final clusters

•  ComputaGonal cost –  Compute Gme (seconds) –  Throughput (seconds processed/seconds of compute Gme)

•  Benchmark –  Comparison to commercial mulG-‐camera sohware Plural Eyes

38

Precision, Recall, and F1

•  Precision –  fracGon of esGmated pairwise merges retrieved that are correct

•  Recall –  fracGon of correct pairwise merges retrieved

•  F1 score –  harmonic mean of precision and recall

39

B D

A

C E

B D

A

C

E

-‐ A-‐E A-‐B A-‐D E-‐B E-‐D B-‐D

A-‐E A-‐C C-‐E

B-‐D

P = 2/6 R = 2/5 F1 = 8/22

EsGmated Clusters Ground Truth Clusters

2PR/(P +R)

Datasets

•  Speech (180 clips from film set) –  Average length 20-‐40 seconds –  54 clusters of one file –  54 clusters of two files –  6 clusters of three files

•  Music (23 clips from live music concerts) –  Average length 3-‐5 minutes –  1 cluster of 7 files –  2 clusters of 8 files

40 All audio files are downsampled to common sample rate of 8kHz for efficiency

Precision, Recall, and F1 Results

•  As expected from using the feature extracGon of [Wang 2003]

41

ComputaGonal Cost

42 Rough Timing on MacBook Pro, OSX 10.6.8, 2.66 GHz Intel Core i7, unopGmized C++.

Speech Music Speech + Music

Proposed 47.0 41.1 90.1Traditional 1550 197 3600

(a) Computation time (s).

Speech Music Speech + Music

Proposed 164.6 146.5 152.7Traditional 5.0 30.5 3.9

(b) Throughput (s/s).

≈ linear not linear

Benchmark (Speech Dataset)

•  Accuracy Measures – Proposed method F1 ≈ 99% – Plural Eyes 2.1.0 F1 ≈ 95%

•  ComputaGonal Cost – Proposed method ≈ 3 minutes – Plural Eyes 1.2.0 ≈ 6 hours – Plural Eyes 2.1.0 ≈ 2 hours – Plural Eyes 2.1.0 (hard) ≈ 10 hours

43


II Proposed Method



-‐ Clustering



44

Future Work & Research DirecGons

•  Video analog to photo “sGtching” – Crowd-‐sourced mulG-‐camera video – Easily change both video and audio viewpoint

•  Denoising/improving audio quality from groups

•  SpaGal audio processing – Use for Gme delay esGmaGon – Large-‐scale beamforming, direcGonal listening, etc.

45

Conclusions

•  Method of clustering and sync of mulG-‐camera videos using audio – Non-‐Linear Transform – Time-‐Difference-‐Of-‐Arrival EsGmaGon – Clustering – SynchronizaGon Refinement – Efficient ComputaGon

•  Fast and accuracy

46

References

•  Jaap Haitsma and Ton Kalker, “A Highly Robust Audio FingerprinGng System With an Efficient Search Strategy,” Journal of New Music Research , vol. 32, no. 2, 2003.

•  A.L. Wang, “An Industrial-‐Strength Audio Search Algorithm,” in Proc. 4th Int. Symposium on Music InformaGon Retrieval (ISMIR) , October 2003.

•  P. Shrestha, M. Barbieri, and H. Weda, “SynchronizaGon of mulG-‐camera video recordings based on audio,” in Proc. 15th Intl. Conf. on MulGmedia , 2007.

•  L. Kennedy and M. Naaman, “Less talk, more rock: automated organizaGon of community-‐contributed collecGons of concert videos,” in Proc. 18th Int. Conf. on World Wide Web , 2009.

•  D. Ellis (2009). “Robust Landmark-‐Based Audio FingerprinGng”, h|p://labrosa.ee.columbia.edu/matlab/fingerprint

47

Demo Video

•  Dave Ma|hews Band’s “Everyday”

48

Nicholas J. Bryan, Paris Smaragdis*, and Gautham Mysore*

Stanford University | CCRMA *Advanced Technology Labs | Adobe Systems

Clustering and Synchronizing MulG-‐Camera Video

via Audio FingerprinGng

CCRMA DSP Seminar, November 13th 2012

Documents

Clustering)and)Synchronizing)) MulGHCameraVideo)) viaAudio ...njb/research/mcs-slides.pdf · IntroducGon) 3 • IdenGfy)and)synchronize)mulGple)videos)of) the)same)event Video&Audio