_________________________ Speech and Music Discrimination using Gaussian Mixture Model Seminar Program Project Team Dr. Deep Sen(Supervisor) CHOI Arthur,

_________________________ Speech and Music Discrimination using Gaussian Mixture Model Seminar Program Project Team Dr. Deep Sen(Supervisor) CHOI Arthur, Tsz Kin(3015809) CHENG Derek, Ka Chun(3015631)

_________________________ Speech and Music Discrimination using GMM

_________________________ Motivations Many researches on HMM, not too many using GMM GMM reduce complexity compared to HMM Our feature extraction methods will reduce complexity Multimedia files search/storage still under develop Fit University requirement

_________________________ Applications Audio Database IndexingAudio Database Indexing Automatic Bandwidth AllocationAutomatic Bandwidth Allocation Broadcast BrowsingBroadcast Browsing Intelligent Signal ProcessingIntelligent Signal Processing Intelligent Audio CodingIntelligent Audio Coding Audio file CompressionAudio file Compression Audio Clip EditingAudio Clip Editing

_________________________ Speech and Music Discrimination using GMM Approaches Deterministic Signals Deterministic Signals can be analysis as completely specified functions of time can be analysis as completely specified functions of time Un-deterministic Signals must analysis probilistically must analysis probilistically [Tele3013 notes]

_________________________ Speech and Music Discrimination using GMM Procedures 1.Read a signal 2.Segmented it into small frames 3.Extract features of each frames 4.Classify each frames

_________________________ Speech and Music Discrimination using GMM Feature Extractions

_________________________ Speech and Music Discrimination using GMM Classification

_________________________ silencemusicspeech

_________________________ Speech and Music Discrimination by using GMM Segmentation Reasons Reasons Get a better estimation result Get a better estimation result Achieve a Real-Time behavior Achieve a Real-Time behavior Problems and solutions Problems and solutions Frames too big -- Classification accuracy decrease Frames too big -- Classification accuracy decrease Frames too small-- Feature extraction accuracy decrease Frames too small-- Feature extraction accuracy decrease Chose frame size ~20ms Chose frame size ~20ms Music Signal

_________________________ 4 Hz modulation energy Speech energy has a characteristic energy modulation peak around the 4Hz syllabic rate. [Houtgast & Steeneken 1985] Reasons Accurately separate speech signals and music signals (~94%) Accurately separate speech signals and music signals (~94%) Easy to implement in Matlab Easy to implement in Matlab Novel and Robust Novel and Robust

_________________________ Speech and Music Discrimination using GMM Music Signal Speech Signal

_________________________ Music Signal Speech Signal Energy vs. Time

_________________________ Speech and Music Discrimination using GMM Zero-Crossing Count (ZCC) The zero-crossing count is the total number of times that a signal goes through the x-axis over a certain time. Speech signals High ZCC Music signalsLow ZCC Reasons ZCC of a speech signal is significantly high ZCC of a speech signal is significantly high Very easy to implement in Matlab Very easy to implement in Matlab Mature and Robust Mature and Robust

_________________________

_________________________ Spectral Roll-off Point The spectral roll-off point measures the skewness of the spectrum. Reasons Music usually has more energy in the high frequency range Music usually has more energy in the high frequency range Useful for separate different kind of speech later Useful for separate different kind of speech later

_________________________ Speech and Music Discrimination using GMM Spectral Roll-off Point Spectral Roll-off Point = SR where,

_________________________ Speech and Music Discrimination using GMM Music Signal Speech Signal frequency power

_________________________ Speech and Music Discrimination using GMM Entropy Modulation Music appears to be ordered compared with a speech signal [J.Pinquier, J.L. Rouas, R. Andre-Obercht 2002] Higher Entropy means higher ordered Higher Dynamism means higher rate of changes Reasons Accurately separate speech signals and music signals(~90%) Accurately separate speech signals and music signals(~90%) Novel and Robust Novel and Robust

_________________________ Speech and Music Discrimination using GMM Music Signal Speech Signal

_________________________ Speech and Music Discrimination using GMM [J. Ajmera, I.A. McCowan, H.Bourlard 2002]

_________________________ Speech and Music Discrimination using GMM Instantaneous entropy Average entropy Average Instantaneous entropy

_________________________ Speech and Music Discrimination using GMM Pulse Metric The beat of a piece of music is one of the clearest features of the music. [K.D. Martin, E.D.Scheirer, B.L. Vercoe 1988]

_________________________ Speech and Music Discrimination using GMM Other Features Spectral Centroid Spectral Flux Silence Ratio Short-Time Energy Ratio Volume Dynamic Change Number of Segments Segment Duration etc

_________________________ Introduction to Gaussian Mixture Model (GMM) Differentiation of speech and music from a sound source Differentiation of speech and music from a sound source Use for speech processing, mostly for speech recognition, speaker identification and voice conversion Use for speech processing, mostly for speech recognition, speaker identification and voice conversion Model densities and to represent general spectral features Model densities and to represent general spectral features

Why we choose GMM? Low complexity Rate independence Bit scalability Short computation time

What is Gaussian Mixture Model? Gaussian Mixture Model consist of a set of local Gaussian modes, and an integrating network. Different Gaussian distributions represent different domain of feature space, and have different output characteristics GMM try to describe a complex system using combination of all the Gaussian clusters, instead of using a single model

Gaussian mixtures or clusters Use to describe a complex system instead of using a single model Represents a dataset by a set of mean and covariance

Gaussian Mixture Model A Gaussian Mixture Model is represented by: is the P-dimensional input vector is the P-dimensional input vector is the mixture weights is the component densities is the component densities

Clustering clustering is a technique from pattern classification A technique to group samples P-dimensional feature vector is considered as a point in space and all points near if are clustered together

clustering Grey circle represents the variance of distribution

Gaussian component density P-variate Gaussian function of the form: is the mean vector is the mean vector is the covariance matrix

Covariance matrix Indicates the dispersion of distribution In mathematics, it is defined as the matrix whose ij th element is the covariance of andi,j=1d

Covariance matrix The diagonal components of the covariance matrix are the variances of individual random variables Off-diagonal components are the covariance of two random variables, and Symmetric matrix

Full covariance matrix The most powerful Gaussian model as it fits the data best drawback! Needs a lot of data to estimate parameters Costly in high-dimensional feature spaces

Diagonal covariance matrix Good compromise between quality and model size Gaussian components can act together to model the overall probability density function Capable of modelling the correlations between the feature vector

Review the Gaussian mixture density The matrix weight must satisfy the condition and and Three components compose the Gaussian mixture density: mean vectors, covariance matrices and mixture weights Three components compose the Gaussian mixture density: mean vectors, covariance matrices and mixture weights

Expectation-maximization (EM) Estimate the mean vector, covariance matrix and mixture weight Recursively updates distribution of each Gaussian model and conditional probability

Idea of Expectation-maximization Instead of starting with a random configuration of all components and improve upon this configuration with expectation-maximization. We start with the optimal one-component mixture. Then start repeating two steps until convergence Instead of starting with a random configuration of all components and improve upon this configuration with expectation-maximization. We start with the optimal one-component mixture. Then start repeating two steps until convergence i)Inset a new components and ii)Apply EM until convergence

Convergence Theorem The sequence of likelihood is monotonically-increasing and bounded, the likelihood will converge to a local maximum The sequence of likelihood is monotonically-increasing and bounded, the likelihood will converge to a local maximum

EM algorithm Assume denote the log- likelihood of the dataset under k-component matrix Assume denote the log- likelihood of the dataset under k-component matrix 1.Compute the optimal one-component mixture. Set k=1 2.Find the optimal new component and corresponding matrix weight while keepingfixed while keepingfixed

EM algorithm 3. Set and k=k+1 and k=k+1 4. Update until convergence

Speech/music discrimination by using GMM An interesting feature of GMM, component densities of mixture may represent Different phonetic events for modelling speech Different portion of the sound when used to model spectra of sound from musical instrument

Achievement Identified optimized frame size Obtained robust features Performed a few tests Implemented some Matlab codes Studied the Gaussian Mixture Models (GMMs) and some of their mathematical expressions

Next year planning Comprehensive and more in-depth research on GMMs Model the sound source base on GMMs Evaluate noise effect Matlab implementation for speech/music separation

Next year planning Investigate a novel classification method Support Vector Machine (SVM) Differentiate Male and female speech Differentiate Classical and Non-Classical Music Generate a final thesis report

_________________________ Resources Internet, Microsoft Sound Recorder, Matlab Neural Networks for Pattern Recognition (Bishop 1996) Processing and Perception of Speech and Music (Morgan 2000) Research Papers

_________________________ Speech and Music Discrimination using GMM Management Plan Dec Feb 04Matlab ImplementationsDec Feb 04Matlab Implementations Investigate noise effect Research on Support Vector Machine Experiments Jan 05Separating class., non-class. musicJan 05Separating class., non-class. music Feb 05Separating male, female speechFeb 05Separating male, female speech Mar Jun 05Separate Chamber music and Orchestra Music. Separate Baby speech. (if have time)Mar Jun 05Separate Chamber music and Orchestra Music. Separate Baby speech. (if have time)

Ben Gold, Nelson Morgan, Speech and Audio Signal Processing: Processing and Perception of Speech and Music (2000), John Wiley & Sons, Inc., USA. Joseph F. Hair, JR., Rolph E. Anderson, Ronald L. Tatham, William C. Black, Multivariate Data Analysis 4th Edition (1995), Prentice-Hall International, Inc. USA. Keinosuke Fukunaga, Computer Science and Scientific Computing: Introduction to Statistical Pattern Recognition 2nd Edition (1990), Academic Press, Inc., California, USA., ISBN 0-12-269851-7 Marty J.Schmidts, Understanding and Using Statistic (1975), D.C Health and Company, Canada. ISBN 0-669-94490-4 Norman L.Johnson, Samuel Kotz, Distributions in statistics: Continuous univariate distributions vol.1 (1970), Houghton Mifflin Company, Boston, USA Richard A. Johnson, Dean W. Wichern, Applied Multivariate Statistical Analysis (1992), Prentice-Hall, Inc., New Jersey, USA. ISBN 0-13-041400-X Richard J.Harris, A Primer of Multivariate Statistics (1975), Academic Press Inc., New York, USA. ISBN 0-12-327250-5 Thomas D. Rossing, The Science of Sound (1982), Addison-Wesley Publishing Company Inc., USA., ISBN 0-201-06505-3 Thomas D. Rossing, Neville H. Fletcher, Principles of Vibration and Sound (1995), Springer-Verlag New York Inc. ISBN 0-387-94336-6 El-Maleh K., Klein M., Petrucci G., and Kabal P., Speech/music discrimination for multimedia applications (2000), in ICASSP00 Houtgast, T. and Steeneken, H.J.M. (1985). A review of the MTF- concept in room acoustics, J. Acoust. Soc. Am. 77, 1069 1077. J. Ajmera, I. McCowan, and H. Bourlard. Robust HMM- based speech/music segmentation (2002). In Proceedings of ICASSP-02 J.J. Burred, A. Lerch, Hierarchical Automatic Audio Signal Classification (2004), Journal of the Audio Engineering Society J. Pinquier, J. Rouas, R. Andre-Obrecht, Robust speech / music classification in audio documents (2002), 7th International Conference On Spoken Language Processing (ICSLP), pp. 20052008 Martin, KD, Scheirer, ED, Vercoe, BL, Music Content Analysis through Models of Audition (1998), ACM Multimedia98 Workshop on Content Processing of Music for Multimedia Applications, Bristol, UK Thank you

Documents

_________________________ Speech and Music Discrimination using Gaussian Mixture Model Seminar Program Project Team Dr. Deep Sen(Supervisor) CHOI Arthur,