Speaker De-identification using Phoneme Re …costic1206.uvigo.es/sites/default/files/Training_School...speaker de-identification •State-of-the-art techniques of speaker de-identification

Simon Dobrišek

Speaker De-identification using Phoneme Re-synthesis

Techniques

COST ACTION IC1206 DE-ID, 1st Training School, 7-11 October 2015, Limassol, Cyprus

COST ACTION IC1206 DE-ID 1st Training School 7-11 October 2015, Limassol, Cyprus

Lecture Overview

• The research context

• Speaker de-identification issues

• The use of existing speech technologies for speaker de-identification

• Speaker de-identification performance measures

• A speaker de-identification system using phoneme re-synthesis technique


The Research Context

• Research activities of LUKS

• Speech biometrics research results

• Speech recognition research results

• Speech synthesis research results


Research activities

• Automatic speech recognition

• Artificial speech synthesis

• Emotion recognition from speech and faces

• Spoken language interfaces

• Biometric recognition of faces, speakers, palmprints, …

• Smart surveillance systems


Fields of application

• Spoken language interfaces

• Hands-free communication systems

• Ambient intelligent systems

• Biometric-based identity management systems

• Smart audio-visual surveillance systems


International cooperation

• Universities of Zagreb, Erlangen, Brno, Oslo, Vienna, Groningen, Perth, Malta, Sheffield …

• Fraunhofer Institute for Computer Graphics Research IGD

• International Biometric Group

• SAFRAN Morpho

• INTERPOL

• The Hastings Center

• Alpineon Ltd., TAB Systems Inc.

• Information Commissioner

• UN Special Rapporteur on Privacy


Speech biometrics research results

• Speaker state and emotion recognition using an HMM-based feature extraction method.

• Implementations and improvements of our own GMM-UBM-based and i-vector-based speaker recognition systems.

• Best performance at the speaker recognition competition held in conjunction with the IAPR ICB'13 conference on the MOBIO database.

• Ranked among the top 10 out of 9000 submitted systems in the NIST 2014 Speaker Recognition i-vector Machine Learning Challenge.


Speech recognition research results

• The proposal of multiresolution MFCC features.

• Using context dependent phone transition acoustic models for speech recognition (diphones, bi-diphones, tri-diphones, …).

• Incremental Finite-state transducer construction and redundancy reduction for pronunciation dictionary modelling.

• An edit-distance model for timed-strings.

• Time- and acoustic-mediated alignment algorithms for speech recognition performance measure.


Speech synthesis research results

• The first fully-developed TD-PSOLA-based Slovenian text-to-speech engine.

• A corpus-based concatenative speech synthesis system for the Slovenian language.

• An HMM-based speech synthesis system for the Slovenian language.

• Adaptation methods for training TTS acoustic models for under-resourced spoken languages from foreign language data.

• An HMM-based emotional speech synthesis system for the Slovenian language.


Speaker De-identification Issues

• Why we are developing speaker de-identification technologies?

• To protect anonymity while preserving intelligibility of the pure depersonalized linguistic information in the speech.

• To remove redundant paralinguistic information on the speaker from the speech.

• To enable the development of privacy-enhanced technologies (PETs), such as privacy-preserving personal verification systems.


• Why do we need to protect anonymity?

• To protect witnesses in investigations, whistleblowers and journalists’ sources.

• To protect privacy of confidential medical information about the health of individuals.

• To protect freedom of expression and democracy.

Speaker de-identification issues


• Why would we like to remove redundant paralinguistic information from the speech?

• To potentially improve the robustness of speaker independent speech recognizers.

• To potentially improve the robustness of speaker-state recognizers and other voice-enabled user interfaces.

• To remove the information that is not necessary for the given application and poses a threat to privacy of the speaker.

Speaker de-identification issues


• Applications that use the Google Speech API send raw or FLAC encoded audio signal to Google’s data centers.

• These voice recordings are (most probably) stored in Google’s data centers and pose a threat to privacy of the users.

• Voice conversion techniques for speaker de-identification could be used in this case to protect the privacy of the users.

Example


• The Google full-duplex speech API is used via the usual HTTPS connection.

The Google Full-duplex Speech API


• Some independent researchers are experimenting with different anti-surveillance solutions.

• Something similar may happen in the field of automatic speaker recognition.

Anti speaker-recognition camouflage


The use of existing speech technologies for speaker de-identification

• State-of-the-art techniques of speaker de-identification commonly belong to one of the two following groups:

i. the group of voice-degradation approaches, or

ii. the group of voice-conversion approaches.

• Existing speech-to-text (STT) and text-to speech (TTS) can be used for this purpose as well.


Speaker transformation using existing acoustical speech models

• Speakers can be transformed to target speakers by using existing STT and TTS acoustical models and by establishing a correspondence/mapping between them.

• GMM-, HMM- or corpus-based acoustical speech models can be used for this purpose in different combinations.


» Homan speech communication process is usually modelled as a sequence of stochastic mappings of random processes.

[sis],[d],[o:],[b],[@],[r],[d],[a:],[n],[Z],[E],[l],[i:],[m],[sie]

/d/,/o/,/b/,/@/,/r/,/d/,/a/,/n/,/Z/,/E/,/l/,/i/,/m/

dober,dan,želim

Dober dan želim !

<item>dober<tag>kakovost=pozitivno</tag></item> <item>dan<tag>čas=24h</tag></item> <item>želim<tag>…

Modelling human speech communication process


A FST-based STT model

Pronunciation dictionary transducer

Grammar acceptor

Acoustic-phonetic transducers

token

token


» A cascade of finite-state transducers can be merged into one huge unified finite-state model that represent a model of human speech communication system.

A FST-based STT model


Language dependent knowledge

Text preprocessing

G2P conversion

Duration modelling

Intonation modelling

Signal generation

Intrinsic and extrinsic rules

Intonation models/rules

Acoustic speech models

G2P rules, dictionaries

Text

Text-to-speech model


Establishing correspondence between models

Acoustic-phonetic transducers Acoustic-phonetic source transducers

GMM-based voice-conversion target model Corpus-based concatenative TTS target model


A unified bi-directional FST-based acoustical model

1

2

4

3

X1:d

X2:a

X3:m

X4:O X5:r

X7:n X6:b

X7:i

…,d,a,r,i,O,a,r,i,n,…

k-anonymity/l-diversity


Speaker De-identification Performance Measures

i. De-identification efficacy assessment

Can I recognize the original speaker?

ii. Intelligibility assessment

Is the de-identified speech still intelligible?

iii. Naturalness assessment

Does the de-identified speech sound natural?

iv. Other performance measures

Is an important non-personal information still present?

Can identifying information be re-linked by a trusted party? …


Assessing the performance of speaker de-identification

• Automatic speech recognizers can be used to assess the intelligibility of the de-identified speech.

• Automatic speaker recognizers can be used to assess the efficacy of speech de-identification.

• Objective measures that are used for assessing the naturalness of TTS systems can be used to asses the naturalness of the de-identified speech.

• Emotion and speaker-state recognizers can be used to verify the presence of useful harmless paralinguistic information in the de-identified speech.


Assessing the performance of speaker de-identification

• Amazon Mechanical Turk or similar crowdsourcing approach can be utilized to assess the mentioned performance measures.

Are the two speakers the same?

How intelligible is the de-identified speech?

Does the de-identified speech sound natural?

Is an important non-personal paralinguistic information still present?


A SPEAKER DE-IDENTIFICATION SYSTEM

• A phoneme re-synthesis system

• System evaluation setup and results

i. Intelligibility assessment

ii. De-identification efficacy assessment

• Shortcomings, improvements and further experiments

• Conclusions


Motivation

• To develop an experimental speaker de-identification system based on phoneme recognition and synthesis (re-synthesis) using our existing components.

• To experiment with a technique different from the existing classical voice conversion and voice degradation techniques.

• To explore the possibility to develop a unified bi-directional FST-based finite state transducer.


Phone re-synthesis system

• Phoneme recognition module:

Context-dependent HMM-based bi-diphone acoustical models and a phonetic bigram language model.

• Phoneme (re-)synthesis modules:

HMM-based and PSOLA-based synthesizers built from the recordings of the two different target speakers


Context-dependent phonetic units

Right bi-phones ... Left bi-phones … Triphones …...

Time segmentation …..

Phonetic transcription…………….. Orthographic transcription .......

HMM models ..

Monophones ....

t


Phone transition phonetic units

Diphones …………. Right bi-diphones.. Left bi-diphones..

HMM models ...

Time segmentation …..

Phonetic transcription…………….. Orthographic transcription .......

t


Speech segmentation aspects of phone transition acoustical modelling

• Speech unit annotation confusion arises when dealing with phone transition models.

“left” - [sis l, eh, f, t, sil]


Speech segmentation aspects of phone transition acoustical modelling

• What speech signal segmentation can we expect when dealing with bi-phone and diphone acoustical models?


Training diphone acoustical model

• Diphone acoustical model initialization


Training diphone acoustical model

• Restructuring a diphone acoustical model


Establishing a correspondence between the models

HMM-based TTS target model Diphone TD-PSOLA-based TTS target model


De-identifying speech

• The phoneme recognizer provides hypotheses about (di)phone sequence and the durations of the individual phone units from an input speech signal.

• The TD-PSOLA-based TTS system uses both types of information for generating an artificial speech.

• The HMM-based TTS uses only the hypotheses on phone sequences, as prosody is generated using pre-trained statistical phone-level language models.


Results - Intelligibility assessment

• 28 test sentences from the GOPOLIS database:

7 male and 7 female speakers uttering 2 different sentences (with 5-8 words).

• 56 (2x28) de-identified speech recordings:

using two different speech synthesizers.

• 26 evaluators (13 males and 13 females) transcribed the de-identified speech recordings.


Intelligibility assessment

• Each evaluator transcribed 14 (2x7) randomly selected de-identified recordings of different test speakers.

• Each evaluator listened to each sentence only once.

• A total of 364 (26x14) transcriptions were obtained.

• Word error rates (WER) of the manual test transcriptions were analyzed.

• Phone error rates (PER) of the phone recognizer were obtained from the automatic phone transcriptions.



• Word error rates of the listening tests were compared to the phone error rates of the phone recognizer.

• Points on the vertical lines match the transcriptions of different evaluators of the same test sentence.

PER = 0.5 WER= 0



Observing average WER in relation to the PER for the two different speech synthesizers.



• The average WER and PER for all test utterances, depending on speaker’s gender, were observed.

• Intelligibility of the de-identified speech seems to be speaker’s gender dependent.

GENDER WER HMM WER DIF PER

female 0,44 0,29 0,23

male 0,23 0,13 0,14


De-identification efficacy assessment

• The use of an automatic state-of-the-art text-independent i-vector-based speaker recognition system.

• The same test speaker identities were used as in the intelligibility test.

• The target speaker recordings were selected from our test database that was not used for building the system.

• Approx. 12 seconds long utterances were used for speaker recognition tests.


Baseline performance of a speaker recognition system

• 8,832 genuine and 138,240 impostor verification attempts were conducted using the non-de-identified speech recordings.

• A TAR of 77.5% at 0.1% FAR and an EER of 2.36% was achieved.

10-3

10-2

10-1

100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Accept Rate

Verif

ica

tio

n R

ate

ASR

Random


Efficacy of the de-identification procedure

• Speakers were enrolled with the natural speech recordings from our test database.

• Test data include only recordings of the de-identified speech.

10-3

10-2

10-1

100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Accept Rate

Verif

icati

on

Rate

HMM

Random

DIF


Efficacy of the de-identification procedure

• In the second experiment, we tested the de-identified recordings of the two speakers that were used for building the two speech synthesizers.

Speaker 1

Original

De-identified

Speaker 2

Original

De-identified 10-3

10-2

10-1

100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Accept Rate

Verif

icati

on

Rate

DIF

random

HMM


Shortcomings, improvements and further experiments

• Performance of the system strongly depends on the accuracy of the phoneme recognizer.

• The naturalness of the de-identified speech should be improved.

• Different speakers cannot be distinguished from the de-identified speech (the voice is always the same).

• The synthesized speech could be transformed to reflect some broader characteristics of the input speaker.


Shortcomings, improvements and further experiments

• To distinguish different speakers, the identity of the original speaker could be mapped to the index of a selected target speaker using locality-sensitive hashing.

• The direct correspondence between the components of the STT and TTS acoustical models could be established.


Conclusions

• A relatively novel approach to developing a speaker de-identification system was presented.

• A robust diphone speech recognizer and two different speech synthesizers were combined to build the speaker de-identification system.

• Intelligibility of the de-identified speech was assessed using human evaluators and its efficacy evaluated using a state-of-the-art speaker recognition system.

• The proposed system does not require a full-fledged error-free speech recognition system.


Thank you for your attention!

Questions?

Documents

Speaker De-identification using Phoneme Re …costic1206.uvigo.es/sites/default/files/Training_School...speaker de-identification •State-of-the-art techniques of speaker de-identification