Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Simon Dobrišek
Speaker De-identification using Phoneme Re-synthesis
Techniques
COST ACTION IC1206 DE-ID, 1st Training School, 7-11 October 2015, Limassol, Cyprus
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Lecture Overview
• The research context
• Speaker de-identification issues
• The use of existing speech technologies for speaker de-identification
• Speaker de-identification performance measures
• A speaker de-identification system using phoneme re-synthesis technique
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
The Research Context
• Research activities of LUKS
• Speech biometrics research results
• Speech recognition research results
• Speech synthesis research results
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Research activities
• Automatic speech recognition
• Artificial speech synthesis
• Emotion recognition from speech and faces
• Spoken language interfaces
• Biometric recognition of faces, speakers, palmprints, …
• Smart surveillance systems
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Fields of application
• Spoken language interfaces
• Hands-free communication systems
• Ambient intelligent systems
• Biometric-based identity management systems
• Smart audio-visual surveillance systems
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
International cooperation
• Universities of Zagreb, Erlangen, Brno, Oslo, Vienna, Groningen, Perth, Malta, Sheffield …
• Fraunhofer Institute for Computer Graphics Research IGD
• International Biometric Group
• SAFRAN Morpho
• INTERPOL
• The Hastings Center
• Alpineon Ltd., TAB Systems Inc.
• Information Commissioner
• UN Special Rapporteur on Privacy
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Speech biometrics research results
• Speaker state and emotion recognition using an HMM-based feature extraction method.
• Implementations and improvements of our own GMM-UBM-based and i-vector-based speaker recognition systems.
• Best performance at the speaker recognition competition held in conjunction with the IAPR ICB'13 conference on the MOBIO database.
• Ranked among the top 10 out of 9000 submitted systems in the NIST 2014 Speaker Recognition i-vector Machine Learning Challenge.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Speech recognition research results
• The proposal of multiresolution MFCC features.
• Using context dependent phone transition acoustic models for speech recognition (diphones, bi-diphones, tri-diphones, …).
• Incremental Finite-state transducer construction and redundancy reduction for pronunciation dictionary modelling.
• An edit-distance model for timed-strings.
• Time- and acoustic-mediated alignment algorithms for speech recognition performance measure.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Speech synthesis research results
• The first fully-developed TD-PSOLA-based Slovenian text-to-speech engine.
• A corpus-based concatenative speech synthesis system for the Slovenian language.
• An HMM-based speech synthesis system for the Slovenian language.
• Adaptation methods for training TTS acoustic models for under-resourced spoken languages from foreign language data.
• An HMM-based emotional speech synthesis system for the Slovenian language.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Speaker De-identification Issues
• Why we are developing speaker de-identification technologies?
• To protect anonymity while preserving intelligibility of the pure depersonalized linguistic information in the speech.
• To remove redundant paralinguistic information on the speaker from the speech.
• To enable the development of privacy-enhanced technologies (PETs), such as privacy-preserving personal verification systems.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
• Why do we need to protect anonymity?
• To protect witnesses in investigations, whistleblowers and journalists’ sources.
• To protect privacy of confidential medical information about the health of individuals.
• To protect freedom of expression and democracy.
Speaker de-identification issues
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
• Why would we like to remove redundant paralinguistic information from the speech?
• To potentially improve the robustness of speaker independent speech recognizers.
• To potentially improve the robustness of speaker-state recognizers and other voice-enabled user interfaces.
• To remove the information that is not necessary for the given application and poses a threat to privacy of the speaker.
Speaker de-identification issues
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
• Applications that use the Google Speech API send raw or FLAC encoded audio signal to Google’s data centers.
• These voice recordings are (most probably) stored in Google’s data centers and pose a threat to privacy of the users.
• Voice conversion techniques for speaker de-identification could be used in this case to protect the privacy of the users.
Example
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
• The Google full-duplex speech API is used via the usual HTTPS connection.
The Google Full-duplex Speech API
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
• Some independent researchers are experimenting with different anti-surveillance solutions.
• Something similar may happen in the field of automatic speaker recognition.
Anti speaker-recognition camouflage
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
The use of existing speech technologies for speaker de-identification
• State-of-the-art techniques of speaker de-identification commonly belong to one of the two following groups:
i. the group of voice-degradation approaches, or
ii. the group of voice-conversion approaches.
• Existing speech-to-text (STT) and text-to speech (TTS) can be used for this purpose as well.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Speaker transformation using existing acoustical speech models
• Speakers can be transformed to target speakers by using existing STT and TTS acoustical models and by establishing a correspondence/mapping between them.
• GMM-, HMM- or corpus-based acoustical speech models can be used for this purpose in different combinations.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
» Homan speech communication process is usually modelled as a sequence of stochastic mappings of random processes.
[sis],[d],[o:],[b],[@],[r],[d],[a:],[n],[Z],[E],[l],[i:],[m],[sie]
/d/,/o/,/b/,/@/,/r/,/d/,/a/,/n/,/Z/,/E/,/l/,/i/,/m/
dober,dan,želim
Dober dan želim !
<item>dober<tag>kakovost=pozitivno</tag></item> <item>dan<tag>čas=24h</tag></item> <item>želim<tag>…
Modelling human speech communication process
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
A FST-based STT model
Pronunciation dictionary transducer
Grammar acceptor
Acoustic-phonetic transducers
token
token
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
» A cascade of finite-state transducers can be merged into one huge unified finite-state model that represent a model of human speech communication system.
A FST-based STT model
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Language dependent knowledge
Text preprocessing
G2P conversion
Duration modelling
Intonation modelling
Signal generation
Intrinsic and extrinsic rules
Intonation models/rules
Acoustic speech models
G2P rules, dictionaries
Text
Text-to-speech model
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Establishing correspondence between models
Acoustic-phonetic transducers Acoustic-phonetic source transducers
GMM-based voice-conversion target model Corpus-based concatenative TTS target model
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
A unified bi-directional FST-based acoustical model
1
2
4
3
X1:d
X2:a
X3:m
X4:O X5:r
X7:n X6:b
X7:i
…,d,a,r,i,O,a,r,i,n,…
k-anonymity/l-diversity
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Speaker De-identification Performance Measures
i. De-identification efficacy assessment
Can I recognize the original speaker?
ii. Intelligibility assessment
Is the de-identified speech still intelligible?
iii. Naturalness assessment
Does the de-identified speech sound natural?
iv. Other performance measures
Is an important non-personal information still present?
Can identifying information be re-linked by a trusted party? …
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Assessing the performance of speaker de-identification
• Automatic speech recognizers can be used to assess the intelligibility of the de-identified speech.
• Automatic speaker recognizers can be used to assess the efficacy of speech de-identification.
• Objective measures that are used for assessing the naturalness of TTS systems can be used to asses the naturalness of the de-identified speech.
• Emotion and speaker-state recognizers can be used to verify the presence of useful harmless paralinguistic information in the de-identified speech.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Assessing the performance of speaker de-identification
• Amazon Mechanical Turk or similar crowdsourcing approach can be utilized to assess the mentioned performance measures.
Are the two speakers the same?
How intelligible is the de-identified speech?
Does the de-identified speech sound natural?
Is an important non-personal paralinguistic information still present?
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
A SPEAKER DE-IDENTIFICATION SYSTEM
• A phoneme re-synthesis system
• System evaluation setup and results
i. Intelligibility assessment
ii. De-identification efficacy assessment
• Shortcomings, improvements and further experiments
• Conclusions
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Motivation
• To develop an experimental speaker de-identification system based on phoneme recognition and synthesis (re-synthesis) using our existing components.
• To experiment with a technique different from the existing classical voice conversion and voice degradation techniques.
• To explore the possibility to develop a unified bi-directional FST-based finite state transducer.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Phone re-synthesis system
• Phoneme recognition module:
Context-dependent HMM-based bi-diphone acoustical models and a phonetic bigram language model.
• Phoneme (re-)synthesis modules:
HMM-based and PSOLA-based synthesizers built from the recordings of the two different target speakers
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Context-dependent phonetic units
Right bi-phones ... Left bi-phones … Triphones …...
Time segmentation …..
Phonetic transcription…………….. Orthographic transcription .......
HMM models ..
Monophones ....
t
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Phone transition phonetic units
Diphones …………. Right bi-diphones.. Left bi-diphones..
HMM models ...
Time segmentation …..
Phonetic transcription…………….. Orthographic transcription .......
t
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Speech segmentation aspects of phone transition acoustical modelling
• Speech unit annotation confusion arises when dealing with phone transition models.
“left” - [sis l, eh, f, t, sil]
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Speech segmentation aspects of phone transition acoustical modelling
• What speech signal segmentation can we expect when dealing with bi-phone and diphone acoustical models?
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Training diphone acoustical model
• Diphone acoustical model initialization
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Training diphone acoustical model
• Restructuring a diphone acoustical model
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Establishing a correspondence between the models
HMM-based TTS target model Diphone TD-PSOLA-based TTS target model
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
De-identifying speech
• The phoneme recognizer provides hypotheses about (di)phone sequence and the durations of the individual phone units from an input speech signal.
• The TD-PSOLA-based TTS system uses both types of information for generating an artificial speech.
• The HMM-based TTS uses only the hypotheses on phone sequences, as prosody is generated using pre-trained statistical phone-level language models.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Results - Intelligibility assessment
• 28 test sentences from the GOPOLIS database:
7 male and 7 female speakers uttering 2 different sentences (with 5-8 words).
• 56 (2x28) de-identified speech recordings:
using two different speech synthesizers.
• 26 evaluators (13 males and 13 females) transcribed the de-identified speech recordings.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Intelligibility assessment
• Each evaluator transcribed 14 (2x7) randomly selected de-identified recordings of different test speakers.
• Each evaluator listened to each sentence only once.
• A total of 364 (26x14) transcriptions were obtained.
• Word error rates (WER) of the manual test transcriptions were analyzed.
• Phone error rates (PER) of the phone recognizer were obtained from the automatic phone transcriptions.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Intelligibility assessment
• Word error rates of the listening tests were compared to the phone error rates of the phone recognizer.
• Points on the vertical lines match the transcriptions of different evaluators of the same test sentence.
PER = 0.5 WER= 0
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Intelligibility assessment
Observing average WER in relation to the PER for the two different speech synthesizers.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Intelligibility assessment
• The average WER and PER for all test utterances, depending on speaker’s gender, were observed.
• Intelligibility of the de-identified speech seems to be speaker’s gender dependent.
GENDER WER HMM WER DIF PER
female 0,44 0,29 0,23
male 0,23 0,13 0,14
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
De-identification efficacy assessment
• The use of an automatic state-of-the-art text-independent i-vector-based speaker recognition system.
• The same test speaker identities were used as in the intelligibility test.
• The target speaker recordings were selected from our test database that was not used for building the system.
• Approx. 12 seconds long utterances were used for speaker recognition tests.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Baseline performance of a speaker recognition system
• 8,832 genuine and 138,240 impostor verification attempts were conducted using the non-de-identified speech recordings.
• A TAR of 77.5% at 0.1% FAR and an EER of 2.36% was achieved.
10-3
10-2
10-1
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Accept Rate
Verif
ica
tio
n R
ate
ASR
Random
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Efficacy of the de-identification procedure
• Speakers were enrolled with the natural speech recordings from our test database.
• Test data include only recordings of the de-identified speech.
10-3
10-2
10-1
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Accept Rate
Verif
icati
on
Rate
HMM
Random
DIF
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Efficacy of the de-identification procedure
• In the second experiment, we tested the de-identified recordings of the two speakers that were used for building the two speech synthesizers.
Speaker 1
Original
De-identified
Speaker 2
Original
De-identified 10-3
10-2
10-1
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Accept Rate
Verif
icati
on
Rate
DIF
random
HMM
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Shortcomings, improvements and further experiments
• Performance of the system strongly depends on the accuracy of the phoneme recognizer.
• The naturalness of the de-identified speech should be improved.
• Different speakers cannot be distinguished from the de-identified speech (the voice is always the same).
• The synthesized speech could be transformed to reflect some broader characteristics of the input speaker.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Shortcomings, improvements and further experiments
• To distinguish different speakers, the identity of the original speaker could be mapped to the index of a selected target speaker using locality-sensitive hashing.
• The direct correspondence between the components of the STT and TTS acoustical models could be established.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Conclusions
• A relatively novel approach to developing a speaker de-identification system was presented.
• A robust diphone speech recognizer and two different speech synthesizers were combined to build the speaker de-identification system.
• Intelligibility of the de-identified speech was assessed using human evaluators and its efficacy evaluated using a state-of-the-art speaker recognition system.
• The proposed system does not require a full-fledged error-free speech recognition system.
COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus
Thank you for your attention!
Questions?