Improving Indian Spoken-Language Identification by Feature

Vol.:(0123456789)

SN Computer Science (2021) 2:442 https://doi.org/10.1007/s42979-021-00750-1

SN Computer Science

ORIGINAL RESEARCH

Improving Indian Spoken‑Language Identification by Feature Selection in Duration Mismatch Framework

Aarti Bakshi1 · Sunil Kumar Kopparapu2

Received: 10 December 2020 / Accepted: 14 June 2021 / Published online: 3 September 2021 © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2021

AbstractPaper presents novel duration normalized feature selection technique and two-step modified hierarchical classifier to improve the accuracy of spoken language identification (SLID) using Indian languages for duration mismatched condition. Fea-ture selection averages random forest-based importance vectors of open SMILE features of different duration utterances. Although it improves the SLID system’s accuracy for mismatched training and testing durations, the performance is signifi-cantly reduced for short-duration utterances. A cascade of inter-family and intra-family classifiers with an additional class to improve false language family estimation. All India Radio data set with nine Indian languages and different utterance durations was used as speech material. Experimental results showed that 150 optimal features with the proposed modified hierarchical classifier showed the highest accuracy of 96.9% and 84.4% for 30 s and 0.2 s utterances for the same train-test duration. However, we achieved an accuracy of 98.3% and 61.9% for 15 and 0.2 s test duration when trained with 30 s duration utterance. Comparative analysis showed a significant improvement in accuracy than several SLID systems in the literature.

Keywords Spoken language identification · Indian language · Feature selection · Classifier fusion

Introduction

Spoken language identification can be defined as the task of automatically identifying the language in which the per-son spoke by analyzing, typically a short duration, of the speech utterance of the user. The state-of-the-art SLID sys-tem exhibits satisfactory performance over long-duration utterances data set. A data set with sufficient long-duration utterances may be available for training the model; however, the available test utterances may be significantly smaller in size ( ≤ 3 s), this problem known as duration mismatch. The mismatch between train and test utterance duration is an emerging issue in the field of spoken language identification. It has many applications in the area, such as a vernacular call center to assist customers, assisting farmers in their native

language, and many more personalized services. However, a reduction in the utterance duration of speech data set drastically degrades the SLID system performance due to lack of information is provided in the short duration utter-ance speech. So nowadays, the short duration utterance and duration mismatched (duration variability) became a critical issue that needs to address on an urgent basis.

The languages spoken in India are distributed into four major language families, most of the languages either belong to (a) Indo-Aryan or (b) Dravidian language families. Most Indian languages have a generic structure of phoneme sounds; however, some language-specific phonemes and combinations of phonemes help differentiate one language from another (see [1]). Like all languages, every spoken Indian regional language has multiple dialects and accents, varying with the geographical area change. The similarity in the phoneme structure across languages and the variability within a language makes spoken Indian language identifica-tion (SLID) tasks very challenging.

For a multilingual country like India, it is difficult for the speaker of a language ( L1 ) to acquire the same pronun-ciation, accents, and language-specific characteristics in terms of phonemes of a non-native ( L2 ) language. Indian languages have a generic structure of sound distribution,

* Aarti Bakshi [email protected]

Sunil Kumar Kopparapu [email protected]

1 UMIT, SNDT University, Mumbai, India2 TCS Research, TATA Consultancy Services, Yantra Park,

Thane, India

http://orcid.org/0000-0002-5784-3696

http://crossmark.crossref.org/dialog/?doi=10.1007/s42979-021-00750-1&domain=pdf

SN Computer Science (2021) 2:442442 Page 2 of 16

SN Computer Science

keeping the structure as it is; the modern Indo-Aryan and Dravidian languages have their specific acoustic, prosodic, and phonotactic features which differentiate them from each other [1]. Prosody features of speech are related to rhythm, intonation, phoneme duration, pitch, and tone. Phonotactic features of speech are related to governing rules for syllable structure, consonants/vowels sequence. The acoustic features are low-level feature representation, depending upon need, can be combined with prosodic and phonotactic features to develop a language identification model.

This paper proposed two approaches to address the dura-tion utterance mismatched conditions in the SLID system using Indian languages. In the first approach, a novel nor-malized duration feature selection (DNFS) is used to gener-ate a set of duration features extracted using the openSMILE toolkit. The proposed DNFS works better in mismatched duration conditions and has an accuracy score better than the entire openSMILE feature set. A two-step modified hierar-chical classifier is applied to reduce inter-family confusion in the SLID system in the second approach. In the first step, inter-language families (Indo-Aryan and Dravidian) are clas-sified, and in the second step, each family’s individual lan-guages are recognized. Different classifiers like ANN, SVM, RF, and fusion of ANN, SVM, and RF are studied in both approaches. A combination of duration independent fea-tures and a modified hierarchical classifier is also explored to identify the language from very short duration utterances in matched and mismatched conditions.

The rest of this paper is structured as follows; the related works are presented in the “Review of Related Work”. The proposed methodologies are presented in “Methodology”. The speech corpus development, the experimental setup is presented in “Training and Test Material”. The experimental results presented in “Results and Discussion”. A compara-tive study of work is discussed in “With Previous Methods” and conclusion is presented in “Conclusion”.

Review of Related Work

Over the years, several language-independent acoustic features like Shifted Delta Cepstral coefficients (SDC) [2, 3], Mel Frequency Cepstral Coefficients (MFCC) [4, 5], Linear Predictive Coefficients (LPC) [3], Perceptual Lin-ear Prediction (PLP) [6] are reported to perform better for same train-test duration utterances. Although probabilistic linear discriminant analysis (PLDA) based i-vector with modified prior estimation technique [7] and exemplar-based technique [8] was reported to improve the perfor-mance of SLID system in duration mismatched conditions, it was not significant, especially for short duration utter-ances [9, 10]. GMM-UBM based features are designed to compensate for duration mismatched conditions, but these

features are computationally expensive [11]. A combina-tion of several features may improve SLID features, but it will be limited due to the curse of dimensionality problem.

Feature selection (FS) is a process of selecting language discriminating features for improving SLID system in duration mismatched conditions while reducing compu-tational complexity [12, 13]. Several FS techniques such as genetic algorithm [14], estimation of distribution algo-rithm (EDA), and greedy search [15] have been reported in the literature. Chowdhary et al. [16], presented a grey wolf optimizer (GWO) FS algorithm for selecting optimum fea-tures for improving SLID. Three texture descriptors, local binary pattern (CLBP), local binary pattern histogram Fourier (LBPHF), and discrete wavelet transform calcu-lated from spectrogram images. GWO selects 156 features from fusion of all 203 features with overall accuracy of 95.96% . Das et al. [3] reported a nature-inspired FS algo-rithm by combining Binary Bat Algorithm (BBA) and Late Acceptance Hill-Climbing (LAHC) algorithm for improv-ing SLID by selecting relevant features from MFCC, LPC, i-vector, x-vector, fusion of MFCC + DWT, and MFCC + GFCC. An optimum feature set of 972 and 1141 selected for IITM and IIT-H data sets reported accuracies 92.35% and 100% with computation time of 158 and 182 min, respectively. Guha et al. [6] reported a hybrid Harmony Search-Naked Mole-Rat (HS-NMR) for improving SLID. Optimum 612, 413, and 527 features were selected Mel-Spectrogram features and Relative Spectral Transform- Perceptual Linear Prediction (RASTA-PLP) features for CSS10, VoxForge, and INDIC TTS data sets, respectively. The proposed HSNMR-based FS method required compu-tation time of 236, 47, 115 min with accuracy of 99.89% on CSS10, 98.22% on Vox-Forge, and 99.75% on INDIC TTS data sets. Most of these FS algorithm are employed for duration-matched conditions and may not be suitable for duration mismatched conditions and short duration utterances [3, 16].

The state-of-the-art SLID systems used Vector Quan-tization (VQ), Gaussian Mixture Model (GMM) [4, 17], Support Vector Machine (SVM) [18–20], Hidden Markov Model (HMM) [21], Artificial Neural Network (ANN) [4, 21, 22], and Random Forest (RF) [3, 6]. The modern end-to-end language recognition models based on deep learning (DL) algorithms improves the performance by increasing the data set requirement and do not perform well for small data set. In duration-matched condition, DL algorithms like bidirectional long short term memory (BLSTM) and recur-rent neural network (RNN) reported excellent performance for short duration language identification [23]. In duration mismatched condition, the performance of these algorithms degrade. This issue can be alleviated by selecting relevant features that can adapt duration mismatched train-test conditions.

SN Computer Science (2021) 2:442 Page 3 of 16 442

SN Computer Science

Methodology

Here in this section, our work is presented in three subsec-tions namely, feature extraction, feature selection, and modi-fied hierarchical classifier. The proposed methodology for experimental set up is given in Fig. 1.

Feature Extraction

Feature extraction is an essential step towards analyzing the signal. It is a parametric representation of speech signals. Different feature extraction techniques such as MFCC, LPC, PLP, and openSMILE are used to extract useful features that carry relevant information and could help pick the difference between languages to discriminate them.

In this paper, we use openSMILE [24] features to repre-sent the overall characteristic of a speech utterance. Note that openSMILE features have been used effectively in audio speech emotion recognition. We have used the publicly available openSMILE toolkit [25] to extract speech features from the spoken utterance in all our experiments. The feature set consists of 1582 features extracted from each spoken utterance as shown in Fig. 2

The feature set contains 34 low-level descriptors (LLD) appended with 34 delta coefficients. The LLD consists of PCM loudness, 15 MFCC features, 8 log Mel frequency band, 8 LSP frequency, fundamental frequency envelope, and the probability of voicing. It yields ( 34 × 2 × 21 ) 1428 features. Four pitch-based LLD and their four delta coef-ficient contours are the second set of features. Pitch-based LLD includes smoothed fundamental frequency counter,

local and delta jitter, and shimmer. 19 functional are applied to pitch based LLD yields ( 4 × 2 × 19 ) 152 features. Two more descriptors are included in the feature sets are the num-ber of the fundamental frequency ( F0 ) onset and the total duration of input. These feature sets contain supra-segmen-tal features, namely, low-level descriptors such as pitch and energy are calculated for every speech frame. Mean, stand-ard deviation, quartile, skewness, kurtosis, and percentile are used as functionals.

Proposed Duration Normalized Feature Selection (DNFS)

The selection of most discriminative features helps to speed up classifier training and improve the system’s robustness. In mismatch utterance duration condition, increasing the dif-ference between the length of the training and test utterances decreases the system’s identification accuracy. We propose a random forest-based DNFS algorithm to improve identifica-tion accuracy under mismatch utterance duration conditions. Random Forest fits a number of decision tree classifiers on various sub-samples of the data set and uses averaging to improve the identification accuracy while controlling the over-fitting.

Let ntree represent the number of trees and let nf = 1582 be the number of input features.

The DNFS algorithm can be enumerated as follows:Let the set of 1582-dimension openSMILE feature vectors

and 9-dimension label vectors for t sec data set be (Xt, Yt) . For each pair (Xt, Yt) a random forest model with ensemble of ntree decision trees is trained. A random set of features are selected from 1582 features in each decision tree, and a

Fig. 1 Proposed feature selection method with modified hierarchical classifier employed for language identification


SN Computer Science

successive binary split at each node is performed based on the most important feature to achieve the overall nine class classification. The importance of kth feature to estimate the best possible binary split is calculated using gini impurity index Gi as

G(.) is gini impurity index calculated as,

where fi is frequency of ith label.Where wk is the weighted number of samples at a node

and its left and right split using kth feature and fi is the fre-quency of ith label at a node. Depending on node-based fea-ture importance, the importance of kth feature in a decision tree is calculated,

Similarly, calculate importance of each feature in ntree deci-sion trees in random forest model. These tree-based feature importance values for each feature are normalized to obtain overall feature importance vector. It for t sec dataset as,

(1)Inode(k) = wkGk − wleft

kGleft

k− w

right

kG

right

k,

G =

n9∑

i=1

fi(1 − fi),

(2)Itree(k) =

∑

∀i�nodesInode(i)

numberofnodes.

(3)It(k) =Itree(k)

∑ntreej=1

Itree(k)1 ≤ k ≤ nf .

Similarly, normalized feature vectors are calculated for dif-ferent utterance durations. Finally, a duration normalized importance for each feature is calculated as,

The importance values for all features are in the range 0–1, and the sum of all feature importance values is equal to 1. The most important feature has a value close to 1, and the least important feature has values close to 0. The set of fea-tures with high importance values are selected as duration normalized features for the improved SLID system. Fig-ure 3 shows the DNFS process with the output score fusion technique.

A Modified Hierarchical Classifier System

In a conventional hierarchical classifier, firstly, the language family of the spoken utterance is identified, the inter-family classification. Secondly, a separate multi-class classifier for each language family is used to identify the individual languages, reducing inter-family confusion and intra-family classification. In hierarchical classifiers, the family classi-fier’s poor performance will decrease the identification accuracy of the entire SLID system. To alleviate this issue, we proposed a modification to the conventional hierarchical classifier.

The first step is to group the spoken utterances dur-ing training according to two language families, namely

(4)Iavg(k) =It(k)

∑

∀t It(k)1 ≤ k ≤ nf .

Fig. 2 Steps involved in openSMILE feature extraction


SN Computer Science

Indo-Aryan and Dravidian. Train a binary language family classifier to identify the language family of the spoken utter-ance correctly. In the second step, each spoken utterance is labeled by the individual languages, and a separate multi-class language classifier for each language family is trained. During each multi-class language classifier’s training, the examples of spoken utterances from other language families are used, and an extra class is added to identify these false positives. Although this requires additional class at each multi-class language classifier, it further reduces inter-family confusion by learning to reject the false positive estimates of binary language family classifier.

The test speech utterance is first presented to the binary language family classifier during testing to identify the lan-guage family. Further, an appropriate multi-class classifier is selected to identify the individual language in test spoken utterance, as shown in Fig. 4. If a multi-class classifier iden-tifies the test spoken utterance from other language families, then it can be concluded that the binary language family classifier had made an error. The test spoken utterance is

then directly presented to the multi-class language classi-fier belonging to the other family to identify the individual language correctly.

The proposed DNFS method’s effectiveness is evaluated for matched and mismatched training and test utterance duration in the first approach. A modification was made to the hierarchical classifier to alleviate inter-family confusion in the SLID system in the second approach. Taking advan-tage of both approaches, DNFS was applied separately to each of the multi-class language classifiers to improve the SLID system’s performance further.

Training and Test Material

Speech Corpus

We have built our own data set [26], which consists of a total of 9 languages, of which 5 are from the Indo-Aryan family, and 4 belong to the Dravidian family. It is to be noted that

Fig. 3 Proposed duration normalized feature selection process


SN Computer Science

languages having the same root languages are more likely to be confused among themselves. It is a studio-quality news speech recording in 9 Indian languages scraped from the All India Radio portal. The original news recording has been manually segmented into 30 s duration. A total of 100 speech utterances per language (total of 900) sampled at 16 kHz forms the data set. Additionally, the initial 30 s speech

utterances have been manually segmented into smaller duration of 15, 10, 5, 3, 1, 0.5 and 0.2 s to form varying duration data set. In all, there are 8 data sets with different duration speech segments (see Table 1). All speech utter-ances listen carefully, and any segment with music, silence or unwanted voice has been filtered out. This speech corpus has utterances by newsreaders, both male, and female (equal

Fig. 4 Proposed functional diagram of a modified hierarchical classifier

Table 1 Number of audio utterances for different languages and duration

Duration 30 s 15 s 10 s 5 s 3 s 1 s 0.5 s 0.2 s

As 100 200 300 600 1002 3004 6003 15,015Bn 100 200 300 600 1000 3000 6010 15,021Gj 100 200 300 601 1002 3007 6055 15,031Hn 100 200 300 600 1000 3000 6041 15,081Mr 100 200 300 601 1002 3009 6032 15,058Kn 100 200 300 600 1002 3007 6020 15,035Ml 100 201 302 605 1009 3027 6112 15,122Tm 100 201 302 605 1009 3025 6013 15,043Tl 100 200 300 604 1008 3021 6033 15,046Total 900 1802 2704 5416 9034 27,100 54,139 135,452


SN Computer Science

in number), on varying sets of topics. The five languages from the Indo-Aryan family in the data set are Assamese (As), Bengali (Bn), Gujarati (Gj), Hindi (Hn), and Marathi (Mr), while the four Dravidian family languages are Kan-nada (Kn), Malayalam (Ml), Tamil (Tm), Telugu (Tl).

Experimental Setup

All experiments were carried out using an Intel Core i7-1065G7 CPU @ 1.30 GHz processor with 16 GB RAM. In all experiments, 5-fold cross-validation was used. In matched conditions, 4 parts of the data set were used for training, while the remaining 1 part of the same dataset was used for testing. In mismatched condition, 4 parts of one utterance duration data set ( 80% spoken utterances) were used to train a classifier while 1 part of the utterance dura-tion data set ( 20% spoken utterances) of the remaining utter-ance duration data sets (not used in training) was used for testing. For example, as shown in Table 2, 80% of the 0.2 s data was used to train a classifier and 20% of 0.5, 1, 3, 5, 10, 15, 30 s data was used for testing.

The metric used for the evaluation of classifiers is the ratio of correctly detected utterances to the total utterances tested.

Three separate classifiers used are support vector machine (SVM), multi-layer perceptron with backpropagation algo-rithm (ANN), random forest (RF). In addition, output score fusion was applied to improve identification resulting in two more classifiers ANN+SVM and ANN+SVM +RF. ReLu activation for fully connected neural network, 0.1 regulariza-tion, 200 epochs, and Adam optimizer were used in ANN. The number of hidden layers and the number of neurons are optimized using several experiments. In SVM, Gauss-ian kernel and OvA decomposition were used while other

Accuracy =Correctly detected utterances

Total utterances

parameters are experimentally optimized. In RF, 1000 trees with four attributes at each split were used.

Results and Discussion

Performance of the System Using 1582 openSMILE Features for Duration Matched and Mismatched Condition

The baseline SLID system’s identification performance using 1582 features is shown in Table 3. In matched con-ditions with 30 s duration, the SVM classifier shows the best performance at 99.1% , followed by ANN with 98.9% , and RF with 98.1% . Table 3 (last two columns) presents the results of two output score fusion techniques. The ANN + SVM + RF performs best with the best accuracy of 99.9% followed by ANN + SVM fusion with 99.4% , show-ing improvement higher accuracy than the best individual classifier 0.8% and 0.3% , respectively. The improvement in accuracy is more significant with decreasing utterance duration. It can be observed that the overall performance of the system decreases significantly with decreasing utterance duration. In matched condition with 0.2 s, ANN + SVM +

Table 2 Mismatched condition data distribution with 0.2 s training and remaining data sets for testing

Duration (s)

0.2 0.5 1 3 5 10 15 30

20% x x x x x x x20% x x x x x x x

Train 20% x x x x x x x20% x x x x x x xx x x x x x x xx x x x x x x xx x x x x x x x

Test x x x x x x x xx x x x x x x xx 20% 20% 20% 20% 20% 20% 20%

Table 3 Performance of the baseline SLID system with 1582 features and different classifiers under matched condition

Utt (sec) ANN (A) RF (R) SVM (S) A+S A+S+R

30 98.1 98.1 99.1 99.4 99.615 98.6 98.9 98.8 99.0 99.510 98.4 98.8 98.7 99.2 99.45 97.7 97.8 97.8 98.0 98.63 98.3 96.1 98.6 98.8 98.91 95.7 88.6 87.2 96.0 96.50.5 91.5 80.5 67.0 91.9 92.60.2 75.6 69.1 37.7 76.2 78.6


SN Computer Science

RF classifier shows the highest an accuracy 78.6% followed by ANN + SVM with 76.2% and least by SVM with 37.7% . This showed that the baseline system does not perform well for short utterance durations.

In mismatched conditions, the performance of the base-line SLID system further decreases. In individual classifiers, RF showed the best accuracy under mismatched conditions, as shown in Table 4. Each row and column forms a pair of training and testing data sets, respectively. The diagonal values represent the matched condition for different utter-ance durations. Although the ANN + SVM + RF classifier’s accuracy is slightly better than RF, it also suffers severely from the mismatched conditions, as shown in Table 5. All other individuals and fusion classifiers were tested and found to be less accurate than those reported here. As discussed earlier, the baseline SLID system’s performance deteriorates irrespective of the classifier for mismatched conditions.

Optimum Feature Selection Using Proposed DNFS Algorithm

The feature selection method is used to improve the perfor-mance of the system. To verify relevant features, logarithmic power of mel frequency band (logMelBand), spectral pair frequency (IspFreq), mel frequency cepstral coefficients (MFCC), loudness as the normalised intensity (PCM-loud-ness), local Shimmer (shimmerLocal) have been used.

These features consist of amean, linregc, pctlrange, percentile, quartile, stddev, skewness. The goal of this phase is to select the most important features using duration normailzed FS. This process generates 25, 50, 100, 125 and 150 features from original 1582 openSMILE features as shown in Table 6.

It is to be noted that the top 25 and 50 feature sets are related to logMelFreqBand-sma and their func-tional, the top 100 features include additionally logMel-FreqBand-sma-de, IspFreq-sma, IspFreq-sma-de, and mfcc-sma and their functional, the top 125 feature additionally include mfcc-sma-de and their functional and the top 150 features contain mfcc-sma-de, pcm-loudness-sma, pcm- loudness-sma-de and shimmerLocal-sma and their functional as shown in the Table 6.

The performance of the different feature sets (25, 50, 100, 125, 150) were evaluated using three classifiers (ANN, SVM, and RF) and output score level techniques (ANN+SVM and ANN+SVM+RF). The overall accuracies of these features sets for 30 s and 5 s utterance duration using ANN are shown in the Tables 7 and 8, respectively.

Here 30 s and 5 s utterance duration is used to train the classifiers, and remaining utterance durations are used for testing. All classifiers performed best for all reduced fea-ture sets (25, 50, 100, 125, 150) when trained by 30 s dura-tion. An incremental trend was observed for all classifiers

Table 4 Performance of the baseline SLID system with 1582 features and RF classifier under mismatched condition

Train (s) Test (s)

30 15 10 5 3 1 0.5 0.2

30 99.1 61.9 61.0 58.3 54.1 43.7 30.6 15.415 57.2 98.8 64.5 62.3 56.3 48.9 35.1 16.210 63.1 63.8 98.7 64.4 57.2 52.5 36.7 18.95 68.2 69.1 66.8 97.8 60.1 56.1 39.5 23.13 69.4 71.4 70.2 66.0 96.1 58.0 41.2 25.21 73.1 73.7 73.4 67.9 63.1 88.8 43.1 30.70.5 77.0 76.3 75.2 70.4 63.8 60.4 80.5 40.20.2 36.8 36.9 37.8 39.6 39.9 42.2 44.1 69.1

Table 5 Performance of baseline SLID system with 1582 features and ANN + SVM + RF classifier under mismatched condition

Train (s) Test (s)

30 15 10 5 3 1 0.5 0.2

30 99.4 63.1 62.2 59.2 55.4 44.6 32.6 16.415 58.5 99.0 65.8 63.5 57.3 46.1 36.3 16.810 64.5 64.7 99.2 65.3 57.3 46.1 36.3 16.85 69.3 70.2 67.7 98.0 61.2 57.3 40.4 23.93 70.3 72.4 71.3 68.1 98.8 59.2 42.1 25.81 74.1 74.6 73.9 68.9 64.2 96.5 43.7 32.50.5 78.0 76.7 76.4 71.2 64.7 61.4 91.9 41.70.2 37.8 37.5 39.4 40.1 40.8 43.6 44.7 76.2


SN Computer Science

are trained by 5 and 30 s utterance durations. The perfor-mance SLID system for duration mismatched condition was evaluated by all reduced feature sets (25, 50, 100, 125, 150), and the best results obtained by 150 optimum feature set is presented in the paper.

To examine the effect of proposed feature selection technique, high-dimensional features are plotted and visu-alized using a two- dimensional plot, as shown in Fig. 5. The t-SNE, a dimensionality reduction technique, is used to transform high-dimensional features into two-dimension. It used PCA as a pre- processor to reduce feature dimensions to 20 followed by t-SNE (perplexity = 30, number of itera-tions = 250) to reduce the feature to two dimensions further. Figure 5 shows t-SNE plots for 1582 openSMILE features and optimum 150 duration normalized openSMILE features, with the first t-SNE feature the horizontal axis and second t-SNE feature on the vertical axis. Features corresponding to each language is represented by different colors ( 1- As, 2- Bn, 3- Gj, 4- Hn, 5- Kn, 6- Ml, 7- Mr, 8- Tm, 9- Tl). It can be observed that each language cluster is more dis-tinct for duration normalized features than original openS-MILE features, indicating an improved feature subset for classification.

Table 6 Optimum duration normalized features for the SLID system

# Features Feature components Functional

50 logMelFreqBand-sma[0]–[7] amean, linregc 2, pctlrange 0–12, pctlrange0–1, linregerrQ, percentile1.0, 99.0, quartile 1,2,3, stddev, amean

12 logMelFreqBand-sma-de[0],[1], [2], [3], [4], [5], [7] linregerrA, linregerrQ, linregerrQ, pctlrange0-1, percentile1.0, stddev20 lspFreq-sma[0], [1], [4], [6] amean, percentile1.0, 99.0, quartile1,2,3, skewness, upleveltime753 lspFreq-sma-de[6] iqr2–3,linregerrA, quartile1,35 mfcc-sma[0], [1],[2], [3], [5], [6], [7], [8] amean, linregc2, pctlrange0–1, percentile1.0,99.0, quartile1,2,37 mfcc-sma-de[0], [1], [6] percentile1.0, linregerrA, linregerrQ, stddev1 pcm-loudness-sma amean10 pcm-loudness-sma-de iqr1–2, iqr1–3, iqr2–3, linregerrA, linregerrQ, pctlrange0–1, percentile1.0,

99.0, quartile1,3, stddev10 pcm-loudness-sma iqr1–3, linregc2, linregerrA, linregerrQ, pctlrange0–1, percentile1.0, 99.0,

quartile1,2,3, stddev2 shimmerLocal-sma quartile1,2

Table 7 Language identification accuracy using different feature sets (30 s data set used for training)

Number of features

Utt Dur 25 50 100 125 150

30 96.1 99.5 97.3 94.6 97.715 98.3 98.7 98.9 98.9 90.710 98.9 98.7 99.4 98.9 98.95 97.2 98.7 98.9 98.9 98.73 90.6 98.5 96.1 95.9 98.61 92.3 97.8 98.3 97.8 97.80.5 81.8 91.2 95.0 93.3 94.40.2 44.2 49.5 48.1 28.2 47.8

Table 8 Language identification accuracy using different feature sets (5 s data set used for training)

Number of features

Utt Dur 25 50 100 125 150

30 83.8 92.4 87.2 87.0 94.215 90.3 96.1 95.6 94.6 96.510 91.0 98.0 96.5 95.9 98.95 96.8 98.4 99.0 98.9 99.23 90.8 97.8 95.3 95.8 99.11 91.0 98.0 96.5 95.9 98.90.5 79.1 91.8 87.8 90.0 94.60.2 44.8 50.7 47.0 47.6 46.4


SN Computer Science

Influence of feature selection on SLID accuracy for duration mismatched condition

The DNFS, as described in “Proposed Duration Normalized Feature Selection (DNFS)”, is used to alleviate the short utterance duration and the mismatched condition issues in the baseline system. The increasing number of features from 1 to 1582, the most important feature to all features, showed that the first 150 most important features are optimum for SLID system under matched and mismatched conditions. The performance of the SLID systems using 150 optimum duration normalized features was tested for matched condi-tions, and all classifiers showed performance similar to the baseline SLID system for different utterance durations. In mismatched conditions, the performance of the SLID system using optimum features showed significant improvement. Tables 9 and 10 compares the effect of optimum duration normalized features using three individual classifiers and two fusion techniques for SLID system trained and tested using varying utterance duration data sets.

It is noticeable that despite discarding 90% of the initial features, the 150 optimum features compared well to the

performance obtained using all 1582 features. Incremental trends are observed in identification accuracy for different utterance durations. It is clear that the SLID system’s accu-racy improves with the selection of duration normalized features, especially for mismatched conditions. It was also observed that there was a significant saving in computational time required for feature extraction.

Figure 6 shows the time required to compute different utterance duration features. It depends on the length of the utterances. For 30 and 0.2 s utterance duration, computa-tional time required is 2.14 s and 15.67 ms respectively.

Influence of Feature Selection with Modified Hierarchical Classifier on SLID Accuracy for Duration Matched and Mismatched Condition

The proposed modified hierarchical classifier model results using the DNFS algorithm for the matched condition are compared in the Table 11. We have computed the feature importance for Indo-Aryan and Dravidian language fami-lies for different duration data set for the proposed modi-fied hierarchical classifier using DNFS algorithm. A total of 25 optimum features were selected from the original 1582 features separately for the Indo-Aryan and Dravidian lan-guage family, and weighted average recognition accuracy is calculated from Indo-Aryan and Dravidian language family results as shown in Table 11.

It is clear from Table 12 that the DNFS method and the proposed modified hierarchical classifier model have again performed very well for the mismatched condition. As dis-cussed earlier, 25 optimum separate feature sets were used to train the modified hierarchical classifier for the Indo-Aryan and Dravidian language families, and weighted aver-age accuracy is calculated from Indo-Aryan and Dravidian language families results. The accuracy improvement of the modified hierarchical classifier model is significantly higher than the 150 reduced feature results (see Table 9), especially it improves the recognition accuracy short duration. It also adds strength to our claim that proposed feature selection and modified hierarchical classifier methods effectively

Fig. 5 (perplexity= 30, number of iteration = 250, pre-processed by PCA to 20 dimensions) visualization for a 1582 openSMILE features , b selected 150 duration normal-ized openSMILE features for 30 s


SN Computer Science

identify the spoken Indian language from short duration utterances under mismatched conditions.

The next series of experiments are performed to evalu-ate the effectiveness of DNFS using the proposed modified hierarchical classifier and output score fusion technique across the mismatched condition, as shown in Table 13. It is observed from the Tables 9 and 10 that the accuracy of clas-sifiers degrades when long utterance samples (1, 3, 5, 10, 15 and 30 s) are used for training and 0.2 s short utterance for testing. This is maybe due to the languages of different fami-lies are getting confused for very short utterances. However, the proposed modified hierarchical classifier using reduced features improved the SLID system recognition accuracy when long utterances are used for training and short utter-ances are used for testing for a particular language family.

Over the years, several researchers proposed different feature extraction, fusion techniques, and classification algo-rithms in the field of language identification. Recently few feature selection algorithms are proposed to optimizing the

feature sets for Indian languages. All these proposed algo-rithms focused on enhancing the recognition accuracy for fixed duration utterances, and very few efforts have been taken to improve the recognition accuracy of varying utterance dura-tion speech samples. This paper proposed DNFS and modi-fied hierarchical classifier to compensate duration mismatched condition and improve recognition accuracy for a short dura-tion for nine Indian languages using eight different duration data sets. However, the proposed algorithm is language-inde-pendent and can achieve higher classification performance in Indian and foreign languages. However, for different data sets, optimum feature sets need to be calculated as per the algorithm reported in “Proposed Duration Normalized Feature Selection (DNFS)”.

Table 9 (a) ANN, (b) RF, (c) SVM with 150 reduced features under mismatched condition

Train (s) Test (s)

30 15 10 5 3 1 0.5 0.2

(a) ANN30 97.7 97.8 97.3 94.2 86.0 63.3 44.8 22.015 90.7 90.7 99.7 96.5 90.7 70.1 49.8 24.310 98.9 99.9 99.3 98.9 92.2 73.6 51.7 24.85 98.7 99.8 99.9 99.2 94.4 80.2 56.5 27.33 98.6 99.3 99.5 99.1 98.6 86.1 63.7 30.01 97.8 99.5 99.4 99.5 97.4 95.5 81.3 36.00.5 94.4 94.9 94.8 94.6 92.5 94.7 88.9 51.30.2 47.8 45.8 45.8 46.4 45.6 53.1 62.6 75.1

(b) RF30 15 10 5 3 1 0.5 0.2

30 97.2 96.7 95.9 92.3 85.0 60.5 41.6 24.215 88.2 88.2 99.4 96.5 88.2 65.9 44.9 24.410 98.9 99.8 98.7 97.6 89.7 69.1 47.3 25.05 98.9 99.7 99.6 98.0 92.7 76.0 53.0 26.53 98.7 98.7 98.5 97.7 96.0 79.9 58.3 28.81 96.2 97.3 97.2 96.9 92.9 89.5 77.0 37.80.5 89.3 90.0 90.0 90.3 84.1 89.0 81.1 53.80.2 59.5 58.7 58.9 59.1 59.1 62.4 66.1 68.0

(c) SVM30 15 10 5 3 1 0.5 0.2

30 97.1 97.5 97.1 95.3 89.5 66.9 41.0 13.215 92.0 92.0 99.2 98.2 92.0 70.9 42.5 12.310 98.7 99.6 99.0 98.7 93.1 74.2 45.9 13.25 98.6 99.5 99.5 98.6 94.7 81.2 52.3 17.03 98.1 98.3 97.8 97.6 96.6 85.3 59.4 22.71 78.0 79.6 97.5 78.7 76.5 77.8 64.5 28.70.5 57.0 57.9 57.6 59.1 55.5 59.7 51.9 32.20.2 18.6 23.9 24.1 23.9 25.0 25.0 25.4 31.3


SN Computer Science

With Previous Methods

Comparison of our system with other systems proposed in the literature for very short utterance durations in duration-matched condition is shown in Table 14. It is observed that openSMILE features with output score fusion obtained the best accuracy for 5 and 0.2 s is 98.6% and 78.6% respectively. The proposed modified hierarchical classifier using DNFS, the best accuracy is 99.8% for 5 s and 84.1% for 0.2 s. Our proposed system performs better for very short utterance durations than the methods mentioned above.

We have compared our proposed DNFS with other litera-ture methods for the duration matched condition, as shown in Table 15. In all cases, our proposed system shows the best recognition accuracy. A comprehensive study by Sarith Fernando et al. [29] used i-vector+ BLSTM to compensate for mismatched duration conditions and reported 66.8% accuracy when tested with 1 s data set. In comparison, the proposed DNFS algorithm and modified hierarchical classi-fier yielded 68.8% and 84.7% accuracy respectively on a 30 s training data set and tested with a 1 s data set.

Table 10 (a) ANN+SVM, (b) ANN+SVM+RF with 150 reduced features under mismatched condition

Train (s) Test (s)

30 15 10 5 3 1 0.5 0.2

(a) ANN + SVM30 98.6 98.3 97.8 95.9 90.0 67.6 45.6 25.015 92.7 92.6 99.8 98.3 92.7 71.5 49.9 24.910 99.3 99.9 99.3 99.4 93.0 74.9 52.1 25.55 99.3 99.8 99.9 99.6 95.0 81.9 57.1 27.83 99.4 99.5 99.8 99.5 99.2 87.1 64.6 30.81 98.4 99.6 99.6 99.8 98.4 96.1 81.5 38.50.5 95.4 94.9 95.6 95.6 92.9 95.7 89.0 54.40.2 60.0 59.2 59.2 59.6 59.9 62.7 62.6 76.1

(b) ANN + SVM + RF30 15 10 5 3 1 0.5 0.2

30 99.3 99.0 98.7 96.8 90.7 68.8 46.3 25.915 93.6 93.4 100 99.0 93.8 72.6 50.4 25.610 99.8 99.9 100 99.8 93.2 75.4 52.4 25.95 99.7 100 100 100 95.8 82.8 57.5 28.33 99.8 100 100 99.9 99.8 88.0 65.1 31.61 99.0 99.9 100 100 98.9 96.8 82.3 39.10.5 96.1 95.4 96.5 96.7 93.7 96.4 89.8 55.10.2 60.4 59.9 59.8 60.2 60.5 63.6 67.4 77.0

Fig. 6 Computation time required for reduced feature extraction

Table 11 Performance of SLID system with modified hierarchical classifier with reduced features under matched condition

Utt (s) ANN (A) RF (R) SVM (S) A+S A+S+R

30 95.9 98.5 95.7 98.1 99.415 97.6 96.9 99.0 99.0 99.510 99.3 99.1 99.1 99.3 99.85 98.5 98.0 98.5 98.8 99.83 97.8 98.2 97.5 98.1 99.01 95.6 91.1 75.2 95.7 96.60.5 85.6 80.8 55.3 85.9 86.70.2 81.4 75.5 60.4 83.2 84.1


SN Computer Science

Table 12 (a) ANN, (b) RF, (c) SVM with reduced features under mismatched condition using modified hierarchical classifier

Train (s) Test (s)

30 15 10 5 3 1 0.5 0.2

(a) ANN30 95.9 97.6 82.2 96.6 70.8 65.4 75.0 57.815 93.9 97.6 80.1 97.3 66.4 62.2 69.7 56.210 89.4 96.2 82.7 97.2 71.1 65.7 75.4 58.05 67.1 90.8 81.6 98.5 68.1 63.9 71.5 57.83 74.2 78.7 76.5 85.3 69.4 65.0 72.8 58.81 56.1 66.5 67.2 63.8 69.8 63.0 75.3 53.20.5 50.5 54.6 67.2 54.0 77.7 67.8 85.6 53.60.2 58.5 47.4 53.4 47.5 58.1 67.7 50.4 81.4

(b) RF30 86.2 90.1 83.0 95.7 72.9 82.1 65.5 46.915 86.4 90.4 83.1 96.3 72.6 81.1 65.7 45.810 86.0 90.1 82.8 96.0 72.2 80.6 65.5 45.95 87.5 91.8 84.1 98.0 73.0 81.4 66.3 45.13 77.2 79.7 75.3 83.2 68.9 76.5 62.9 46.51 71.02 67.9 73.4 63.6 81.3 91.1 73.4 43.70.5 63.7 60.9 66.0 56.9 73.2 63.8 80.8 47.30.2 47.7 46.2 48.9 44.1 52.7 52.5 52.9 67.7

(c) SVM30 88.0 94.8 82.6 92.5 74.7 50.9 36.1 35.415 89.6 99.0 82.0 93.1 73.2 49.6 35.5 33.210 88.0 94.4 82.9 93.4 74.5 50.3 36.7 34.65 86.8 90.0 84.2 98.5 72.7 50.0 37.9 34.43 84.9 78.3 90.1 81.0 97.5 48.1 37.8 36.11 60.9 64.0 58.5 60.4 57.0 70.8 40.7 34.90.5 50.6 51.4 49.9 52.6 47.8 42.0 55.3 37.80.2 50.2 56.0 45.6 45.8 45.5 40.2 34.8 60.4

Table 13 (a) ANN +SVM, (b) ANN+SVM+RF with reduced features under mismatched condition using modified hierarchical classifier

Train (s) Test (s)

30 15 10 5 3 1 0.5 0.2

(a) ANN +SVM30 99.4 97.8 97.1 96.6 96.0 84.5 74.4 58.815 94.6 99.5 97.3 97.4 96.2 83.9 70.2 59.510 92.7 96.4 99.8 97.4 96.0 84.3 75.0 58.25 72.1 91.7 93.1 99.8 94.5 84.9 71.0 58.53 75.5 79.4 81.2 85.6 99.0 81.5 70.8 59.21 58.0 67.7 63.3 64.2 68.3 96.6 75.3 53.40.5 54.7 56.8 56.6 56.7 53.9 65.2 86.7 53.60.2 55.6 53.2 50.5 48.5 49.7 52.4 51.3 83.5

(b) ANN +SVM+RF30 98.9 98.3 97.6 97.1 96.7 84.7 76.0 61.915 95.5 99.2 97.7 97.6 96.4 84.6 73.1 62.510 94.2 96.8 99.5 97.7 96.2 84.7 75.7 62.95 81.6 92.0 93.5 99.1 95.0 85.3 72.6 62.73 76.0 79.9 81.7 86.2 98.9 82.0 73.5 61.81 59.7 68.2 64.1 64.8 69.2 96.0 76.0 59.90.5 54.8 59.0 57.3 57.0 55.3 66.5 86.1 62.40.2 58.6 54.3 51.1 48.6 50.9 53.0 53.1 84.1


SN Computer Science

Conclusion

Indian language identification is crucial for vernacular call centers for automatically routing incoming customer calls to respective language experts. Paper proposed a novel duration normalized feature selection and two-step modified hierar-chical classifier for spoken language identification using Indian languages for different utterance durations. Each utterance was represented using 1582 features extracted using the openSMILE toolkit. Random forest-based models are developed using reduced features to calculate impor-tance vectors for each feature. The optimum 150 duration normalized features were calculated by averaging over different duration utterance data sets to calculate. These features improved SLID system accuracy under training and test duration mismatched conditions, but the system’s accuracy reduced with decreasing utterance duration. A two-step modified hierarchical classifier as a cascade of the inter-family and intra-family classifier was proposed to solve

this issue. An additional class was added to each of the intra-family classifiers to reject the falsely estimated language family. All experiments were evaluated using the All India Radio data set developed by us. The data set was carefully processed to generate eight small duration databases. Results showed that a combination of duration normalized features and modified hierarchical classifier improved accuracy for short utterance durations and mismatched conditions. We have achieved 84.1% an accuracy for very short utterance duration 0.2 s for same duration train-test data set and 61.9% of an accuracy when classifier trained with 30 s data set and tested with 0.2 s data set.

The last few work [30] evidence that, for deep neural net-work architecture, the number of hidden layers increases, will increase the requirement of hyperparameters, which makes the DL model more complex. Hence, it requires more execution time and computational power. Moreover, it degrades the performance of the system in duration mis-matched condition. The proposed feature selection algo-rithm achieved better results with less execution time and

Table 14 Comparison of own baseline system and proposed system with other work published for short duration utterances

Author Classifier/methods Features used Duration of speech samples (s)

Accuracy

Koolagudi et al. [27] ANN MFCC+SDC+P+E 5 86.72.5 83.41.25 78.5

Adeeba and Hussain [23] BLSTM MFCC+GFCC 0.4 50Zazo et al. [28] LSTM-RNN i-vector 0.5 50Own Baseline System ANN+SVM+RF openSMILE 5 98.6

3 98.91 96.50.5 92.60.2 78.6

Proposed modified hierar-chical classifier

ANN+SVM+RF DNFS 5 99.83 99.01 96.60.5 86.70.2 84.1

Table 15 Comparison of proposed DNFS algorithm with other work published for duration-matched condition

Author Feature selection method Database Duration of speech samples (s)

Accuracy (%)

Chowdhury et al. [16] GWO using ANN Indic TTS Database -IITM Not mentioned 96.6Guha et al. [6] HS-NRM using RF Indic TTS Database -IITM 10 99.7Das and Roy [3] BBA-LAHC using RF Indic TTS Database - IITM 5 92.3

IIT Hyderabad 5 100The proposed method DNFS using ANN+SVM+RF Own database 10 100

Own database 5 100


SN Computer Science

computation power than this. However, since the proposed method is new, there will be scope for improvement, and analysis can be performed using well-known Indian lan-guage data sets. IIIT-H and IIIT-M data sets are freely avail-able and fixed utterance durations with varying speech sam-ples. The problem with these data sets is the nonavailability of short utterance speech samples, so we can not use these data sets with their existing format. To fulfill our require-ments of short utterance duration, we have to segment the speech samples. Our future work aimed to do a compre-hensive analysis of our proposed work using open source data sets. To the best of our knowledge, this is the first work focused on implementing a feature selection algorithm for duration mismatched conditions, especially short duration utterances.

Declarations

Conflict of interest The authors declare that they have no conflict of interest.

References

1. Aarti B, Kopparapu SK. Spoken Indian language identification: a review of features and databases. Sadhana (Acad Proc Eng Sci). 2018;43:4.

2. Torres-Carrasquillo P.A, Singer E, Kohler M.A, Greene R.J, Reyn-olds D.A, Deller J.R. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In: INTERSPEECH 2002.

3. Das H, Roy P. Bottleneck feature-based hybrid deep autoencoder approach for Indian language identification. Arab J Sci Eng. 2020. https:// doi. org/ 10. 1007/ s13369- 020- 04430-9.

4. China Bhanja C, Laskar MA, Laskar RH. A pre-classification-based language identification for Northeast Indian languages using prosody and spectral features. Circuits Syst. Signal Process 2019; 38(5):2266. https:// doi. org/ 10. 1007/ s00034- 018- 0962-x.

5. Koolagudi SG, Rastogi D. Spoken language identification using spectral features. Contemporary Computing. IC3 2012. Commu-nications in Computer and Information Science. https:// doi. org/ 10. 1007/ 978-3- 642- 32129-0_ 52.

6. Guha S, Das A, Singh PK, Ahmadian A, Senu N, Sarkar R. Hybrid feature selection method based on harmony search and naked mole-rat algorithms for spoken language identification from audio signals. IEEE Access. 2020;8:182868. https:// doi. org/ 10. 1109/ ACCESS. 2020. 30281 21.

7. Travadi R, Segbroeck MV, Narayanan SS. Modified-prior i-vector estimation for language identification of short duration utterances. In: INTERSPEECH (2014).

8. Wang M, Song Y, Jiang B, Dai L, McLoughlin I. Exemplar based language recognition method for short-duration speech segments. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 2013. pp. 7354–8. https:// doi. org/ 10. 1109/ ICASSP. 2013. 66390 91.

9. Dehak N, Torres-Carrasquillo P, Reynolds D, Dehak R. Lan-guage recognition via i-vectors and dimensionality reduction. In: INTERSPEECH (2011).

10. Poddar A, Sahidullah M, Saha G. Performance comparison of speaker recognition systems in presence of duration variability. In: 2015 Annual IEEE India Conference (INDICON). 2015, pp. 1–6. https:// doi. org/ 10. 1109/ INDIC ON. 2015. 74434 64.

11. Bakshi A, Kopparapu SK. GMM supervector approach for spoken Indian language identification for mismatch utterance length. Bull Elect Eng Inform. 2021; 10(2):1114. https:// doi. org/ 10. 11591/ eei. v10i2. 2861. https:// beei. org/ index. php/ EEI/ artic le/ view/ 2861/ 2151.

12. Kasongo SM, Sun Y. A deep learning method with filter based feature engineering for wireless intrusion detection system. IEEE Access. 2019;7:38597.

13. Arruti A, Cearreta I, Alvarez A, Lazkano E, Sierra B. Feature selection for speech emotion recognition in Spanish and Basque: on the use of machine learning to improve human-computer inter-action. PLOS One. 2014;9(10):1. https:// doi. org/ 10. 1371/ journ al. pone. 01089 75.

14. Wutzl B, Leibnitz K, Rattay F, Kronbichler M, Murata M, Golas-zewski SM. Genetic algorithms for feature selection when clas-sifying severe chronic disorders of consciousness. PLOS One. 2019;14(7):1. https:// doi. org/ 10. 1371/ journ al. pone. 02196 83.

15. Venkatesh B., Anuradha J. A review of feature selection and its methods. Cybernet Inf Technol 2019;19.

16. Chowdhury AA, Borkar VS, Birajdar GK. Indian language identification using time-frequency image textural descriptors and GWO-based feature selection. J Exp Theor Artif Intell. 2020;32(1):111. https:// doi. org/ 10. 1080/ 09528 13X. 2019. 16313 92.

17. Kumar VR, Vydana HK, Vuppala AK. Significance of GMM-UBM based modelling for Indian language identification. Procedia Comput Sci. 2015;54:231.

18. Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Car-rasquillo PA. Support vector machines for speaker and language recognition. Comput. Speech. Lang. 2006;20(2):210. https:// doi. org/ 10. 1016/j. csl. 2005. 06. 003. http:// www. scien cedir ect. com/ scien ce/ artic le/ pii/ S0885 23080 50003 18.

19. Sengupta D, Saha G. Automatic recognition of major language families in Indian. In: 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI) (2012), pp. 1–4. https:// doi. org/ 10. 1109/ IHCI. 2012. 64818 44.

20. Aarti B, Kopparapu SK. Spoken Indian language classification using ANN and Multi-Class SVM. In: 2018 International Confer-ence On Advances in Communication and Computing Technology (ICACCT). 2018. pp. 213–8. https:// doi. org/ 10. 1109/ ICACCT. 2018. 85295 69.

21. Jothilakshmi S, Ramalingam V, Palanivel S. A hierarchical lan-guage identification system for Indian languages. Digit Signal Pro-cess. 2012;22(3):544. https:// doi. org/ 10. 1016/j. dsp. 2011. 11. 008. http:// www. scien cedir ect. com/ scien ce/ artic le/ pii/ S1051 20041 20000 97.

22. Aarti B, Kopparapu SK. Spoken Indian language classification using artificial neural network—An experimental study. In: 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN). 2017. pp. 424–430. https:// doi. org/ 10. 1109/ SPIN. 2017. 80499 87.

23. Adeeba F, Hussain S. Native language identification in very short utterances using bidirectional long short-term memory network. IEEE Access. 2019;7:17098.

24. Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, Müller C, Narayanan S. The INTERSPEECH 2010 paralinguistic chal-lenge. In: Proc. Interspeech. 2010.

25. Eyben F, Wöllmer M, Schuller B. Opensmile: the munich versa-tile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia. New York, NY: Association for Computing Machinery. 2010, MM ’10. pp. 1459–62. https:// doi. org/ 10. 1145/ 18739 51. 18742 46.

https://doi.org/10.1007/s13369-020-04430-9

https://doi.org/10.1007/s00034-018-0962-x

https://doi.org/10.1007/978-3-642-32129-0_52

https://doi.org/10.1007/978-3-642-32129-0_52

https://doi.org/10.1109/ACCESS.2020.3028121

https://doi.org/10.1109/ACCESS.2020.3028121

https://doi.org/10.1109/ICASSP.2013.6639091

https://doi.org/10.1109/ICASSP.2013.6639091

https://doi.org/10.1109/INDICON.2015.7443464

https://doi.org/10.11591/eei.v10i2.2861

https://doi.org/10.11591/eei.v10i2.2861

https://beei.org/index.php/EEI/article/view/2861/2151

https://beei.org/index.php/EEI/article/view/2861/2151

https://doi.org/10.1371/journal.pone.0108975



https://doi.org/10.1080/0952813X.2019.1631392

https://doi.org/10.1080/0952813X.2019.1631392

https://doi.org/10.1016/j.csl.2005.06.003

https://doi.org/10.1016/j.csl.2005.06.003

http://www.sciencedirect.com/science/article/pii/S0885230805000318


https://doi.org/10.1109/IHCI.2012.6481844

https://doi.org/10.1109/ICACCT.2018.8529569

https://doi.org/10.1109/ICACCT.2018.8529569

https://doi.org/10.1016/j.dsp.2011.11.008



https://doi.org/10.1109/SPIN.2017.8049987

https://doi.org/10.1109/SPIN.2017.8049987

https://doi.org/10.1145/1873951.1874246


SN Computer Science

26. Bakshi A, Kopparapu SK. Spoken Indian language identification (2020). https:// doi. org/ 10. 21227/ xm4q- s210.

27. Koolagudi S, Bharadwaj A, Murthy S, et al. Dravidian language classification from speech signal using spectral and prosodic features. Int J Speech Technol. 2017;20:1005. https:// doi. org/ 10. 1007/ s10772- 017- 9466-5.

28. Zazo R, Lozano-Diez A, Gonzalez-Dominguez J, Toledano DT, Gonzalez-Rodriguez J. Language identification in short utter-ances using long short-term memory (LSTM) recurrent neural networks. PLOS One. 2016;11:1. https:// doi. org/ 10. 1371/ journ al. pone. 01469 17. https:// journ als. plos. org/ ploso ne/ artic le? id= 10. 1371/ journ al. pone. 01469 17.

29. Fernando S, Sethu V, Ambikairajah E, Epps J. Bidirectional modelling for short duration language identification. In: INTER-SPEECH. 2017.

30. Justus D, Brennan J, Bonner S, McGough AS. Predicting the com-putational cost of deep learning models. CoRR. 2018. http:// arxiv. org/ abs/ 1811. 11880.

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

https://doi.org/10.21227/xm4q-s210

https://doi.org/10.1007/s10772-017-9466-5

https://doi.org/10.1007/s10772-017-9466-5



https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0146917

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0146917

http://arxiv.org/abs/1811.11880

http://arxiv.org/abs/1811.11880

Documents

Improving Indian Spoken-Language Identification by Feature