Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort

Automatic Transcription System of Kashino et al.

MUMT 611Doug Van Nort

Objective

• To give an overview of this particular technique for automatic transcription– Original implementation:

• ICMC 1993

Introduction

• Sound Source Separation System– Extracting sound source in the presence of

multiple sources• Physical vs. Perceptual sound source

– Physical: actual source itself– Perceptual: Humans hear as single source

• Ex: Piano, Loudspeaker

Perceptual Sound Source Separation

• Creating system which simulates human perceptual system

• Extraction of parameters based on perceptual model, grouping of parameters based on certain criteria

This PSSS System

• Kashino et al. – U. of Tokyo

• OPTIMA: Organized Processing Towards Intelligent Music Scene Analysis

• First to use human auditory seperation rules

This PSSS System• Suppose: Input = mono audio signal, output = multiple

midi channels (and graphic display)

• Given signal S(t), comprised of mix of M sound sources– Assume S(t) = {F1(t),…,FL(t)}

• Where Fj(t) = {pj(t),fj(t),psij(t)}– Pj = power of spectral peak– Fj = freq of spectral peak– Psij = bandwidth of spectral peak

• Wish to:– Extract Fj(t) from S(t)– Cluster Fj(t) into groups which (ultimately) represent different

sound sources

System Overview• Extraction of Frequency Components

– Analysis first taken• All signals are 16 bit/ 48 khz• Bank of 2nd order IIR bandpass filters (log freq scale) implemented

– Peak Selection/Tracking:• “pinching Plane” method

– Regression planes, calculated via least squares» In other words, minimization of sum of squares in z direction (power),

leaving x and y (time and freq) fixed» Normal vector for each plane calculated. Angle between gives psij(t),

direction vector gives fj(t), pj(t)

– First regression plane analysis sets threshold by which other potential peaks are measured

Pinching Planes

Bottom Up Clustering of Freq Components

• Grouping freq components based on perceptual criteria• Goal is to group sounds humans hear as one

• calculations made for harmonic mistuning and onset asynchrony between pairwise freq components, then evaluated for probability of auditory separation– probability functions based on approximations of

psychoacoustic experiments• given prob functions p1 and p2, the integrated prob of

auditory separationis given by m = 1-(1-p1)(1-p2) – this is from Dempster's law of prob.– m is used as distance measure in clustering

Clustering for Source Identification

• identify sound sources by global characteristics of clusters– goal is to group sounds based on same

source (thus uses direct signal attributes apart from any psychoacoustic metric of determination)

– if a cluster contains a single note we’re good

Clustering for Source Identification

• uses distance function to determine source– D = c1fp+c2fq+c3ta+c4ts

• Where:• fp = peak power ratio of second harmonic to fundamental component• Fq = peak power ratio of third harmonic to fundamental component• Attack time• Sustain time

tone model based processing

• unit of input is a "processing scope”– proc scope consists of one cluster, or several if they

share a freq component– a tone model is a 2D matrix with each row being a

freq component over time (column rep. time). each element is a 2D vector of normalized power and freq.

– "mixture hypotheses" generated for each tone model, and matched with a processing scope to find the closest fit

– distance function minimizes power difference at given time/freq location

– effective in recognizing chords– -but, is model based

Automatic tone modeling

• -automatic acquisition of tone models from analysed signal– -based on "old-plus-new heuristic"

[bregman 90]• a complex sound is interpreted as everything

old which remained is perceived as new sound

Hierarchy of Perceptual Sound Events

A Few Probs and Limitations

Octave = no good• Psychoacoustic Models

– Not tested over large enough group• Detuning

– May not leave enough space for variance in real instruments (2.6% in prob function)

Lots of free parameters– Seemingly a lot of tuning involved

Conclusion

• Works Well for 3 note polyphony– Anssi Klapuri claim: 18 note range, works for flute, piano,

trumpet

• Groundbreaking in that it used Perceptual system model

– Based on auditory scene analysis

• Lots of free parameters– Seemingly a lot of tuning involved

Documents

Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort