Speech and Music Classiﬁcation Using Hybrid Form of ... · Speech and Music Classiﬁcation Using Hybrid Form of Spectrogram and Fourier Transformation ... is segments and transformed

Speech and Music Classification Using HybridForm of Spectrogram and Fourier Transformation

Piyawat Neammalai, Suphakant Phimoltares, and Chidchanok LursinsapAdvanced Virtual and Intelligent Computing (AVIC) Center

Department of Mathematics and Computer Science,Faculty of Science, Chulalongkorn University, Thailand

Email: [email protected], [email protected], [email protected]

Abstract—This paper presents the technique for feature extrac-tion to classify speech and music audio data. The combinationof image processing and signal processing is used to classifyaudio data. There are three main steps. First, the audio datais segments and transformed to spectrogram image and thenapply image processing methods to find the salient characteristicson the spectrogram image. The next step transforms the salientspectrogram image using 2-dimensional Fourier Transform andthen calculates the energy of signal at the specific frequencies toform the feature vector. Next, in classification process, SupportVector Machine is used as bi-classification tool. The method istested on an audio database containing 510 instances with 1.5seconds length of each. The experimental results show that theacceptable classification accuracy of our proposed technique isachieved.

Index Terms—Speech music classification, Spectrogram,Fourier Transform

I. INTRODUCTION

In the past few years, the study of speech and musicclassification has become an interesting research topic. Thereare several applications developed by using speech and musicclassification [1] such as, Efficient coding of multimediasources, Automatic Speech Recognition (ASR), Automaticclassification, indexing, achieving and retrieving of informa-tion from large database. The important process of thoseapplications is to classify the audio data as either speech ormusic. The audio classification consists of two major parts; (i)feature extraction and, (ii) data classification. Many previousstudies have been made to classify speech and music audiodata, for example, Sheirer and Slaney [2] developed andevaluated different speech and music discrimination systemfor automatic speech recognition (ASR) to distinguish thespeech segment of the soundtrack. For feature extractionprocess, there are two categories of effective features; Time-domain features and Frequency-domain features. Time-domainfeatures are extracted directly from the samples of the audiosignal, such as zero-crossings (ZC), root mean square (RMS)[3], Short time energy (STE), Spectrum flux [4]. Anotherfeatures are Frequency-domain features represented of sig-nal are focused on the magnitude spectral content, such asmel-frequency cepstral co-efficient (MFCC) [5], short-timefundamental frequency [6]. For feature classification processthere are several classifiers have been used, such as Multilayer

Neural Network [7], Support Vector Machine (SVM) [8], k-nearest neighbor [2], Hidden Markov Model (HMM) [9].

In recent survey, there is a few number of previous studiesusing Time-Frequency domain which considers the character-istics of audio data in time and frequency instantaneously, e.g.,S. Nilufar et al. proposed the methods using visual perceptionof time-frequency of audio signal and combination of wavelettransform and multiple kernel learning to select the optimalsubbands to discriminate audio classes [10]. There are manystudies using spectrogram-based for audio classification. Forinstance, G. Yu and J. Slotine proposed the method to classifymultiple musical instruments by treating the spectrogram astexture image with the feature extraction scheme based ontime-frequency matching [11]. Y. Costa et al. compared twodifferent textual descriptors between GLCM and LBP dividedthe spectrogram into several difference zones which are inde-pendently classified by different classifiers. The experimentalresult showed that the LBP-based archived a high overallrecognition rate [12]. J. Dennis, H. D. Tran, & H. Li., proposedthe classification sound event by mapping the spectrogram intoa higher dimensional vector space and extracting the intensitydistribution as the features [13]. It yields a robust classificationaccuracy in mismatch condition.

In this paper, we proposed the technique for feature extrac-tion relying on hybrid form of spectrogram and Fourier trans-formation to classify speech and music audio data. There arethree proposed steps. Firstly, the audio data is segmented andrepresented as spectrogram image using Short Time FourierTransforms. Then, image processing methods consisting ofthresholding and median filtering are used to enhance thespectrogram image. Next,a set of features are generated fromspectrogram image texture analysis using local binary pattern(LBP) at the local region of spectrogram image, and combinedwith the feature extracted from Fourier transformation to formthe an effective set of feature vectors. Next, two effectiveclassification approaches for audio classification are investi-gated, i.e., Support Vector Machine (SVM) [13], [11] andMultilayer Perceptron Neural Network (MLP) [14], to measurethe accuracy of feature vector extracted from our method. Thebenefit of SVM is that the instance data can be transformedinto feature space by using a kernel function which hyperplaneis a non-linear boundary.

978-616-361-823-8 © 2014 APSIPA APSIPA 2014

The rest of paper is organized as follows. The proposedmethod is then presented in Section II. Subsequently, Exper-imental results and discussion are presented in Section III.Finally, the conclusion is discussed in Section IV.

II. PROPOSED METHODS

Since our method relies on the virtual representation ofaudio data by spectrogram, these observations have motivatedus to find the feature extraction method. The overview ofproposed method consists of four main steps as shown inFig. 1; (i) preprocessing, the audio data is segmented andrepresented by spectrogram image, and determining the salientcharacteristics by thresholding and median filtering, (ii) ex-isting features by applying texture descriptor, local binarypattern, combined with the energy of signal calculated fromthe transformed spectrogram image using Fourier transform atthe specific frequencies to form the feature vector, and (iii)applying either SVM or MLP as a classifier. Each part isdescribed as following Subsections.

Fig. 1: overview proposed method

A. Preprocessing

1) Gray-scale Spectrogram Generation: For the first step,the audio file from dataset which is described in Section III-A, the sampling rate of 22050 Hz., audio sample size of 16bits, a bit rate of 352 kbps, and one channel, is segmentedinto non-overlapping windows with 1.5 seconds length. Next,

a segment is normalized with range scaled in [-1,1] interval.Each segment is transformed by applied Short Time FourierTransform (STFT) as follows:

X (τ, k) =N−1∑n=0

x (n)ω (n− τ) e−jnk (1)

where x(n) denotes the short time signal to be transformed,ω (n) is a window function of length N , k corresponds to thefrequency f (k) = kfs

N , where fs is the sampling frequency.Then the linear spectrogram is the squared magnitude ofSTFT and given as SLinear (τ, k) = |X (τ, k)|2. The hu-man perception of sound is logarithmic and hence the log-spectrogram defined as S (τ, k) = log (SLinear (τ, k)) [10].In our experiment, to implement the STFT in spectrogramgeneration, we use Hamming window function width of 23.22ms. (512 samples) and overlapping size of 11.60 ms. (256samples). The time-frequency matrix obtained by applyingSTFT is normalized into gray-scale normalized image withrange scaled in [0,1] interval as follows:

S (τ, k) =S (τ, k)− Smin

Smax − Smin(2)

where S (τ, k) denotes the normalized spectrogram imagecorresponding to audio data. The spectrogram generation forspeech and music are shown as Figs. 2 and 3, respectively.

Fig. 2: Spectrogram image of speech audio data

Fig. 3: Spectrogram image of music audio data

2) Binary Image Generation and Noise Reduction: To pre-pare the information of speech and music audio data in suitablecondition, the binary image is generated by thresholding onthe gray-scale spectrogram image and processing the noisereduction. The basic thresholding is given by:

g(x, y) =

{1 f(x, y) > T

0 f(x, y) ≤ T(3)

where T denotes global threshold value. f(x, y) is the intensityof any point (x, y) in gray-scale normalized spectrogramimage obtained by (2). In this experiment, the global thresholdT is set to 0.8 or intensity of gray-scale image is equal to 204.After the binary image is generated, the filtering is applied tothe image to reduce noise. Median filters operate by replacinga given sample by the median of the signal value. For eachbinary image we use three runs of median filtering to reducethe noise occurred in image as used in a speech spectrogramsegmentation algorithm [15]. The binary image for speech andmusic are shown as Figs. 4 and 5, respectively.

Fig. 4: Binary image of speech

Fig. 5: Binary image of music

After preprocessing process, the binary image representing thesalient characteristics on spectrogram image is ready to thenext step, feature extraction. In the next section, we describesome details of feature extraction by hybridization of texturefeature and energy signal feature to form the feature vectors.

B. Feature Extraction

The spectrogram image obtained from the previous sectioncontains the distinctive patterns of the speech and music

signals. Some different characteristics of speech and musicsignal such as the spectral of music signal contains moreharmonics than that of speech signal, the music spectra changeslower than speech spectra while the energy of speech signalis located at low frequency [10]. Furthermore, energy distri-bution of speech signal is not continuous in frequency andtime axes. These observations suggest that the hybridizationof spectrogram texture analysis and energy of signal usingFourier Transform will form the feature vector for speechmusic classification.

1) Zoning Spectrogram Image: The method called spec-trogram image zoning achieved good results for music genreclassification [16]. In addition, the textures of the spectrogramimages of the same audio data type at the low frequenciesare not different [12]. In the other word, the rationale behindzoning spectrogram is that the audio signal from the sameaudio data type, leads to similar intensities in the spectrogramimage. Thus, in this paper, we have separated the spectrogramimage to extract location information of audio data.

The speech audio data consist of many silent intervalsand most of the energy is located from 0 kHz to 4 kHz.,while the music audio data consist of few silent intervals,and it has continuous energy peaks for a short time andfewer frequency variations [17]. So, we take into account forseparating spectrogram image from 0 kHz.to 4 kHz. and usingtexture feature extraction to obtain local feature.

Fig. 6: Zoning spectrogram image

2) Texture Feature Extraction: For extracting feature ofgray-scale normalized spectrogram image from the previousstage, we take into account for Local Binary Pattern (LBP)method. The LBP was presented by Oja et al. [18] to be amodel describing the texture. An important characteristic ofLBP is that the difference between pixel in the window andeach pixel in neighbour set of pixels. The gray value of thecenter pixel and the gray value of circularly neighbourhoodare denoted by gc and gp (p = 0, ..., P − 1), respectively. Thecharacteristics of texture are described by the joint differencedistribution T obtained by the gray value of center pixel andneighbourhood pixels as follows:

T ≈ (s (g0 − gc) , s (g1 − gc) , ..., s (gP−1 − gc)) (4)

where

s(x) =

{1 , x ≥ 0

0 , x < 0(5)

The LBP value can be obtained by multiplying a binomialfactor 2p for each s (gp − gc). Finally, a image texture istransformed into a unique LBPP,R number:

LBPP,R =P−1∑p=0

s (gp − gc) 2p (6)

where R denotes the distance between center pixel and neigh-bourhood pixel in the considered region of image.

Considering a binary code obtained from (6), Oja et al. [18]introduce a uniformity measure corresponding to the numberof transitions of the binary code from 0’s to 1’s or vice versa. Abinary code is considered to be uniform pattern if the numberof transition is less than or equal to 2.

To reduce dimension of feature vectors, the uniform binarycode is considered to create histogram representing the featurevectors. The dimension of features are only 59-dimensionalfeature vectors. This descriptor called u2, making the LBPdefinition as LBPu2P,R. During the experiments, we choose P=8and R=1.

Fig. 7: LBP operator, where C is center pixel, and Pu , u ={0, ..., 7} are neighbour pixels

3) 2-Dimensional Fourier Transform: After the binarizationof spectrogram image is done in the spatial domain, the binaryimage is converted to spectral form by the 2-D DiscreteFourier transform (DFT) described as follows:

F (u, v) =M−1∑x=0

N−1∑y=0

f (x, y) e−j2π(xu/M+vy/N) ,

u ∈ {0, 1, ...,M − 1} , v = {0, 1, ..., N − 1}(7)

where f(x, y) is a digital image of size M×N and f(x, y) canbe obtained from F (u, v) by using the 2-D Inverse DiscreteFourier Transform (IDFT).

f (x, y) = 1MN

M−1∑u=0

N−1∑v=0

F (u, v) ej2π(ux/M+vy/N),

u ∈ {0, 1, ...,M − 1} , v = {0, 1, ..., N − 1}(8)

In order to reduce the complexity to calculate equation onO(

(MN)2)

multiplications and additions by DFT, the FastFourier Transform (FFT) is performed instead of DFT. There-fore the complexity becomes O (MN log2MN). Thus, we

implement the FFT to obtain power spectrum or spectraldensity of a binary image at frequency (u, v) as follows:

P (u, v) = |F (u, v)|2 (9)

To visualize the centered spectrum, the power spectrum imagethat the zero frequency (0, 0) is located at the center andgradually increased when approaching edges and corners isdisplayed.

4) Energy Signal: Considering power spectrum from 2-D Fourier transformation of speech and music audio data,as shown in Figs. 8 and 9, respectively, which are obtainedfrom previous stage. Obviously, the vertical axes between theimages are different in appearance.

These observations lead us to find the 1-D energy signal ofdata from speech and music transformed into power spectrum.Hence, the 1-D energy signal can be directly derived from theby 2-D Fourier power spectrum with zero horizontal frequencyby following equation.

X (v) = |F (0, v)|2 (10)

Next, we use the Parseval’s theorem to calculate the totalenergy in the Frequency domain as follows:

Fig. 8: Centered 2-D Fourier power spectrum (log-amplitude)of speech

Fig. 9: Centered 2-D Fourier power spectrum (log-amplitude)of music

P (X) =1

N

N−1∑k=0

|X (k)|2 (11)

where P (X) is the total energy of 1-D energy signal X , X (k)is the Discrete Fourier Transform of a signal of length N . Theenergy signal of speech and music audio data x(v) is shownin Fig. 10.

Fig. 10: The energy signal of speech and music at zerohorizontal frequency

Lastly, the input vector comprises of 60 features divided intotwo parts: i) 59 features obtained from the histogram describedin Section B-2 and ii) 1 feature of total energy calculated byEquation (11).

C. Classification

From the overview of proposed method, the next stage isthe classification process which each input vectors obtainedfrom feature extraction is classified to either speech or music.To classify the input data, we use the SVM and MLP whichare effective bi-classification.

1) Support Vector Machine: A support vector machine(SVM) is a discriminative classifier that attempts to find opti-mal hyperplane which separates two classes of data. Suppose aset S = {(x1, y1) , ..., (xN , yN )} of Rn is training set, wherexi is input vector, yi is class label, yi ∈ {−1, 1}, and Rn

denotes n-dimensional space. To find the optimal hyperplane,the support vectors of dataset can maximize the margin, whichis the distance between the hyperplane and support vectors asfollows

min 12‖w‖

2

s.t. yi(⟨

wT xi⟩

+ b)≥ 1

(12)

The solution to the optimization problem of SVM is given byLagrange function as follows:

L (α) =N∑i=1

αi −1

2

N∑i=1

N∑j=1

αiαjyiyjK (xi, xj) (13)

with constraint∑Ni=1 αiyi = 0 and 0 ≤ αi ≤ C, where C

is upper bound of the Lagrange multipliers αi. The kernelfunction K (xi, xj) = 〈φ (xi) , φ (xj)〉, where φ maps the inputspace into high-dimensional feature vector.

2) Multilayer Perceptron: The structure of Multilayer per-ceptron (MLP) neural network consists of three layers thatare an input layer, a hidden layer, and an output layer. Theinput layer is composed of 60 neurons corresponding to thenumber of features. The number of hidden neurons dependson the value between the number of input neurons and thenumber of output neurons. The output layer consists of twoneurons according to the number of classes. In our experiment,10 hidden neurons are applied in the hidden layer.

The back-propagation algorithm is basically the learningalgorithm for training MLP in two phases that are forwardphase and backward phase. The Levenberg-Marquardt (LM)is a computationally efficient training function for back prop-agation neural network without having to compute the Hessianmatrix. The LM algorithm derived from steepest descentmethod and Newton algorithm is shown as

∆w =(JTJ + µI

)−1JTe (14)

where w is the weight vector, I is the identity matrix, µ is thecombination coefficient, J is the Jacobian matrix containingfirst derivatives of the error vector, and e is error vector.

III. EXPERIMENTAL RESULTS AND DISCUSSION

A. Dataset

To evaluate the technique to classify audio data, we useaudio dataset consisting of 128 files from Music SpeechCollection which is separated into two categories, speech andmusic audio categories each of which contains 64 files. Eachaudio data has 30 seconds length with the sampling rate of22050 Hz., 16 bits per sample, bit rate of 352 kbps. All ofaudio files are in one channel. Speech audio data set containsmale and female speakers in quiet and little bit of noise. Musicdataset contains a verity of music genres, e.g., pop, jazz, danceand rock.

To create training dataset, 77 files were randomly selectedfrom speech and music dataset. Each file was randomlysegmented to 10 segments with 1.5 seconds length. The totalnumber of training data set was 770. For testing data set, 51files were randomly selected from dataset and the total numberof testing dataset was 510. In order to avoid bias, the segmentsfrom the same audio file were not in both of the training setand the test set.

B. Classification Results

The experiment shows the accuracy comparison of the SVMand MLP with our features. For the SVM classifier, the third-order polynomial kernel function and standard parameters havebeen considered. For the MLP, the number of hidden nodesis 10. The experimental result shows that depicted in TableI the our proposed feature extraction method well performsfor speech and music classification. The traditional featuresbased on ZCR and STE are compared with ours. Althoughthe overall classification accuracy of our method are greaterthan that of the other, the classification accuracy for music datais greater than ours when using the MLP as a classifier. Therationale behind the slight drop of the classification accuracyfor music audio data is that the spectrogram images for themusic data were collected in short time and noise was alsoincluded so that some of such as images were misclassified.However, using SVM did not affect this situation.

This paper have studied the feature extraction using thecharacteristics of spectrogram image to classify to audio datatype. However, the classification accuracy rate depending on

TABLE I: Classification Accuracy Results

ClassifierClassification Accuracy (in %)

ZCR & STE features Proposed featuresSpeech Music Overall Speech Music Overall

SVM 78.28 93.03 85.56 92.00 95.77 93.68MLP 78.16 92.71 85.43 92.4 89.23 90.82

the some factors. For instance of the segment length ofaudio data, the lower segment length is more effective inmultimedia retrieval [10] however, the longer segment lengthis more suitable for classification to gather high accuracy, thismethod has divided the audio data to segments of 1.5 secondsand the classification accuracy was acceptable. Therefore, theproposed method can be effectively applied to the speech andmusic applications.

IV. CONCLUSION

In this paper, we proposed the technique to classify speechand music audio data using the hybrid form of feature vectorsby spectrogram-texture feature extraction and 2-dimensionalFourier Transformation which was used to calculate the energyof signal at specific frequency. The LBP was used to extracttexture features of the spectrogram image which combinedwith the the total energy of the signal. The implementation ontwo well-known classifiers, SVM and MLP, was considered.The experimental results indicated that the acceptable classi-fication accuracy was achieved whilst the fewer features wereused to classify speech and music data.

REFERENCES

[1] O. Mubarak, E. Ambikairajah, and J. Epps, “Novel features for effectivespeech and music discrimination,” in Engineering of Intelligent Systems,2006 IEEE International Conference on, 2006, pp. 1–5.

[2] E. Scheirer and M. Slaney, “Construction and evaluation of a robust mul-tifeature speech/music discriminator,” in Acoustics, Speech, and SignalProcessing, 1997. ICASSP-97., 1997 IEEE International Conference on,vol. 2, Apr 1997, pp. 1331–1334 vol.2.

[3] C. Panagiotakis and G. Tziritas, “A speech/music discriminator basedon rms and zero-crossings,” Multimedia, IEEE Transactions on, vol. 7,no. 1, pp. 155–166, Feb 2005.

[4] L. Lu, H.-J. Zhang, and H. Jiang, “Content analysis for audio classifi-cation and segmentation,” Speech and Audio Processing, IEEE Trans-actions on, vol. 10, no. 7, pp. 504–516, Oct 2002.

[5] H. Harb and L. Chen, “Robust speech music discrimination usingspectrum’s first order statistics and neural networks,” in Signal Pro-cessing and Its Applications, 2003. Proceedings. Seventh InternationalSymposium on, vol. 2, July 2003, pp. 125–128 vol.2.

[6] T. Zhang and C.-C. Kuo, “Audio content analysis for online audiovisualdata segmentation and classification,” Speech and Audio Processing,IEEE Transactions on, vol. 9, no. 4, pp. 441–457, May 2001.

[7] B. Matityaho and M. Furst, “Neural network based model for classifi-cation of music type,” in Electrical and Electronics Engineers in Israel,1995., Eighteenth Convention of, March 1995, pp. 4.3.4/1–4.3.4/5.

[8] S. Sadjadi, S. Ahadi, and O. Hazrati, “Unsupervised speech/musicclassification using one-class support vector machines,” in Information,Communications Signal Processing, 2007 6th International Conferenceon, Dec 2007, pp. 1–5.

[9] J. Ajmera, I. McCowan, and H. Bourlard, “Speech/music segmentationusing entropy and dynamism features in a HMM classificationframework,” Speech Communication, vol. 40, no. 3, pp. 351– 363, 2003. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167639302000870

[10] S. Nilufar, N. Ray, M. K. I. Molla, and K. Hirose, “Spectrogrambased features selection using multiple kernel learning for speech/musicdiscrimination,” in Acoustics, Speech and Signal Processing (ICASSP),2012 IEEE International Conference on, March 2012, pp. 501–504.

[11] G. Yu and J.-J. Slotine, “Audio classification from time-frequencytexture,” in Acoustics, Speech and Signal Processing, 2009. ICASSP2009. IEEE International Conference on, April 2009, pp. 1677–1680.

[12] Y. Costa, L. Oliveira, A. Koerich, and F. Gouyon, “Comparing texturalfeatures for music genre classification,” in Neural Networks (IJCNN),The 2012 International Joint Conference on, June 2012, pp. 1–6.

[13] J. Dennis, H. D. Tran, and H. Li, “Spectrogram image feature forsound event classification in mismatched conditions,” Signal ProcessingLetters, IEEE, vol. 18, no. 2, pp. 130–133, Feb 2011.

[14] C. Costa, J. Valle, J.D., and A. Koerich, “Automatic classification ofaudio data,” in Systems, Man and Cybernetics, 2004 IEEE InternationalConference on, vol. 1, Oct 2004, pp. 562–567 vol.1.

[15] R. Steinberg and D. O’Shaughnessy, “Segmentation of a speech spectro-gram using mathematical morphology,” in Acoustics, Speech and SignalProcessing, 2008. ICASSP 2008. IEEE International Conference on,March 2008, pp. 1637–1640.

[16] Y. Costa, L. Oliveira, A. Koerich, F. Gouyon, and J. Martins, “Musicgenre classification using LBP textural features,” Signal Processing,vol. 92, no. 11, pp. 2723 – 2737, 2012. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0165168412001478

[17] K.-P. Han, Y.-S. Park, S.-G. Jeon, G.-C. Lee, and Y.-H. Ha, “Genreclassification system of tv sound signals based on a spectrogramanalysis,” Consumer Electronics, IEEE Transactions on, vol. 44, no. 1,pp. 33–42, Feb 1998.

[18] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scaleand rotation invariant texture classification with local binary patterns,”Pattern Analysis and Machine Intelligence, IEEE Transactions on,vol. 24, no. 7, pp. 971–987, Jul 2002.

Documents

Speech and Music Classiﬁcation Using Hybrid Form of ... · Speech and Music Classiﬁcation Using Hybrid Form of Spectrogram and Fourier Transformation ... is segments and transformed