Speaker Diarization and Identification Using Machine Learning Enhong Deng, REU Student Graduate Mentor: Abhinav Dixit, Faculty Advisors: Andreas Spanias, Visar Berisha MOTIVATION PROBLEM STATEMENT Sensor Signal and Information Processing Center http://sensip.asu.edu SenSIP Center, School of ECEE, Arizona State University Department of Speech and Hearing Science, Arizona State University SenSIP Algorithms and Devices REU ABSTRACT REFERENCES Speaker diarization identifies speakers in long speech recordings. Form speech segments and remove undesired noise and unvoiced sections. Form i-vectors from features extracted from speech segments. Train a machine learning model on extracted features. Classify new speech segments according to the speaker identity. Speaker Recognition Track the active speaker in a conversation with multiple speakers. Audio Indexing Detect the change of speakers as a pre-processing step for automatic transcription. Information Retrieval Examine contributions of speakers in speech recordings. MACHINE LEARNING ALGORITHM Voice-Activity Detection(VAD) Identifies non-speech sounds and retains only the actual speech. I-Vectors Extracting identity information using MFCCs. Low-dimensional i-vectors that represent the utterances from speech. RESULTS ACKNOWLEDGEMENT This project was funded in part by the National Science Foundation under Grant No. CNS 1659871 REU site: Sensors, Signal and Information Processing Devices and Algorithms. METHODS K-Means Clustering Perform both supervised and unsupervised speaker diarization in a telephone conversation. Distinguish among male and female speakers to answer the question “Who speaks when?” [1] J. H. L. Hansen and T. Hasan, "Speaker Recognition by Machines and Humans: A tutorial review," in IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 74-99, Nov. 2015. [2] H. Song, M. Willi, J. J. Thiagarajan, V. Berisha, and A. Spanias, “Triple Network with Attention for Speaker Diarization,” , in Interspeech 2018 [3] http://multimedia.icsi.berkeley.edu/speaker-diarization/ Unsupervised learning 98.5% accuracy in clustering. All data is partitioned into three groups of clusters. Each cluster represents a speaker class. Supervised learning 97.7% accuracy in classification. 75% training data to generate an SVM model. 25% remaining data to test trained model. Support Vector Machines(SVM) Given labeled data, SVM can be trained to develop a model capable of distinguishing among different classes. The trained SVM model predicts the identity of speaker in new speech data. With a known number of groups k, k number of centroids are randomly chosen. K-means clusters the data into k groups of clusters. Each column of the matrix corresponds to the true class and each row corresponds to the predicted class. Number of correct predictions shown in green blocks, and false predictions shown in red.

Speaker Diarization and Identification Using Machine Learning · Speaker Diarization and Identification Using Machine Learning Enhong Deng, REU Student Graduate Mentor: Abhinav Dixit,

Download PDF Report

Upload
others
View
10
Download
0

Embed Size (px)

Citation preview

Page 1: Speaker Diarization and Identification Using Machine Learning · Speaker Diarization and Identification Using Machine Learning Enhong Deng, REU Student Graduate Mentor: Abhinav Dixit,

Speaker Diarization and Identification Using Machine LearningEnhong Deng, REU Student

Graduate Mentor: Abhinav Dixit, Faculty Advisors: Andreas Spanias, Visar Berisha

MOTIVATION

PROBLEM STATEMENT

Sensor Signal and Information Processing Centerhttp://sensip.asu.edu

SenSIP Center, School of ECEE, Arizona State UniversityDepartment of Speech and Hearing Science, Arizona State University

SenSIP Algorithms and Devices REU

ABSTRACT

REFERENCES

Speaker diarization identifies speakers in long speechrecordings.

Form speech segments and remove undesired noiseand unvoiced sections.

Form i-vectors from features extracted from speechsegments.

Train a machine learning model on extracted features. Classify new speech segments according to the speaker

identity.

Speaker Recognition Track the active speaker in a conversation with multiple

speakers.Audio Indexing Detect the change of speakers as a pre-processing step

for automatic transcription.Information Retrieval Examine contributions of speakers in speech

recordings.

MACHINE LEARNING ALGORITHM

Voice-Activity Detection(VAD) Identifies non-speech sounds and retains only the actual speech.

I-Vectors Extracting identity information using MFCCs. Low-dimensional i-vectors that represent the utterances from

speech.

RESULTS

ACKNOWLEDGEMENT

This project was funded in part by the National ScienceFoundation under Grant No. CNS 1659871 REU site: Sensors, Signaland Information Processing Devices and Algorithms.

METHODS

K-Means Clustering

Perform both supervised and unsupervised speakerdiarization in a telephone conversation.

Distinguish among male and female speakers to answerthe question “Who speaks when?”

[1] J. H. L. Hansen and T. Hasan, "Speaker Recognition by Machines and Humans: Atutorial review," in IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 74-99, Nov.2015.[2] H. Song, M. Willi, J. J. Thiagarajan, V. Berisha, and A. Spanias, “Triple Network withAttention for Speaker Diarization,” , in Interspeech 2018[3] http://multimedia.icsi.berkeley.edu/speaker-diarization/

Unsupervised learning 98.5% accuracy in clustering. All data is partitioned into

three groups of clusters. Each cluster represents a

speaker class.

Supervised learning 97.7% accuracy in classification. 75% training data to generate

an SVM model. 25% remaining data to test

trained model.

Support Vector Machines(SVM)

Given labeled data, SVM canbe trained to develop a modelcapable of distinguishingamong different classes.

The trained SVM modelpredicts the identity ofspeaker in new speech data.

With a known number ofgroups k, k number ofcentroids are randomlychosen.

K-means clusters the datainto k groups of clusters.

Each column of the matrix corresponds to the true class and eachrow corresponds to the predicted class.

Number of correct predictions shown in green blocks, and falsepredictions shown in red.

EmoryIMVC Speaker by Speaker

Documents

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian · PDF file · 2016-11-04Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion Israel D. Gebru,

Documents

Speech Diarization - University of Rochester · Z. Zhou, Y. Zhang and Z. Duan, "Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks," 2018

Documents

Institute of Information Science, Academia Sinica, Taiwan Introduction to Speaker Diarization Date: 2007/08/16 Speaker: Shih-Sian Cheng

Documents

Speaker Diarization Features: The UPM Contribution to the ...oa.upm.es/11896/2/INVE_MEM_2011_107717.pdfinformation theoretic approach [3]. Speaker diarization was first applied to

Documents

Fuzzy Cognitive Diagnosis for Modelling Examinee Performance€¦ · Fuzzy Cognitive Diagnosis for Modelling Examinee Performance QI LIU, RUNZE WU and ENHONG CHEN, University of Science

Documents

SphereDiar - an efficient speaker diarization system for ......kousäänitett ä. N äiss ä kokeissa saavutettiin kaksi keskeist ä tulosta. N äist ä ensimm äinen oli

Documents

A Realtime Multimodal System for Analyzing Group … Realtime Multimodal System for Analyzing Group Meetings by Combining Face Pose Tracking and Speaker Diarization Kazuhiro Otsuka

Documents

Introduction to Speaker Diarization

Documents

COMMERCIAL SPEAKER Line Up_1701_V3.pdf · Horn Speaker Ceiling Speaker Wall Speaker Column Speaker Fashion Speaker SPEAKER LINE UP COMMERCIAL SPEAKER CU-410F/420F…

Documents

Speaker-Conditional Chain Model for Speech Separation and ...speaker information (used in aim speaker extraction, speaker diarization) into speaker-independent tasks (e.g., speech

Documents

Speaker Magazine Article | Keynote Speaker | Business Speaker

Business

3D diarization of a sperm whale click cocktail party by an

Documents

A Speaker-Dependent Approach to Separation of Far-Field ...staff.ustc.edu.cn/~jundu/Publications/publications/sunlei2019.pdf · using oracle speaker diarization information was allowed

Documents

This document is downloaded from DR‑NTU … · 2020. 3. 20. · This document is downloaded from DR‑NTU () Nanyang Technological University, Singapore. Speaker diarization in

Documents

Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 1 Dr. Hagai

Documents

PIECING IT ALL TOGETHER - Christian Publishers · 2015. 8. 11. · Piecing It All Together. CAST OF CHARACTERS Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker

Documents

$The ICSI RT-09 Speaker Diarization Systemfractor/fall2011/pres2.pdf · Papers • The ICSI RT-09 Speaker Diarization System, Gerald Friedland, Adam Janin, David Imseng, Xavier Anguera,$