23
Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Voice Transformation

Project by:

Asaf Rubin

Michael Katz

Under the guidance of:

Dr. Izhar Levner

Page 2: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Objective

Page 3: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Contents

Conversion Scheme Analysis

Speech production model

Transformation Preprocessing analysis

Synthesis

Results, Conclusions & Future plans

Page 4: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Conversion Scheme

SpeechAnalysis

Transformationfunction

SpeechSynthesis

Source Orator

Target Orator

SourceParameters

TargetParameters

Requires robust parameterization of speech.

Transformation is done on-line, based upon previous off-line data coordination, via codebooks, histogram equalization, neural networks.

Page 5: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Vocal Tract Model Linear all-pole filter, varying slowly in time relative to pitch period :

P

k

kk

Z

z

GV

1

)(

1

Radiation Model Simulates the lips derivation.

Differentiation filter with constant parameters.

Unvoiced Excitation White random noise.

This model was derived from the analytical solution of the acoustic speech model equations.

Conversion Scheme Analysis

GlottalPulseModelG(z)

VocalTract

ModelV(z)

RadiationModelR(z)

Voiced/ Unvoiced

Switch

Speech Production Model

Voiced Excitation

Impulse train with pitch period passed through glottal pulse model.

Page 6: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Signal CleaningNoise reduction through use of signals energy and zero-

crossing computation.

Glottal Pulseestimation

LSP conversion

Global Par.estimation

Pitch estimation

Signalcleaning

SourceSignal

SourceParameters

Conversion Scheme Analysis

Source Parameters Estimation

LPC estimation

Phoneme Segmentation

Phoneme segmentation Manual or semi-automatic – using energy,zero-crossing,pitch. Automatic – using Hidden Markov Models.

Pitch estimationPhonemes’ pitch contour evaluation.

LPC estimation

Calculation of the Linear Prediction Coefficients set for each phoneme.

LSP conversion

Calculation of Line Spectrum Pairs corresponding to each work frame.

Glottal Pulse Parameters estimation

Calculation for corresponding work frames of the phoneme.

Global Parameters EstimationPhonemes characteristics such as duration and global LSP.

Page 7: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Conversion Scheme Transformation

Source Parameters

Target Parameters

GPPTransformation

Pitch

Glottal Pulse Parameters

Pitch

Glottal Pulse Parameters

LSPCode Book

LSPCode Book

LSP LSP

Duration Duration

Transformation Function

SourceCode Book

TargetCode Book

Phoneme LSP

Distance Measure

Find the source codeword closest to the phonemes LSP (given the distance measure).

There is 1to1 correspondence between source and target codeword entries.

Transform the phonemes duration according to the average source and target durations of a corresponding codeword:

SnSCB

nTCB

T DD

DD

For each work frame transform the LSP through secondary 1to1 source-target LSP code books, corresponding to the n-th code words of the primary books.

For each work frame pitch and energy transform through histogram equalization using source-target histograms of the n-th code word.

Residue is substituted by corresponding to the target LSP, accepted via secondary codebooks.

Page 8: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

For each phoneme the LSP and duration are extracted.Given the identical source and target utterances the phoneme coordination is done manually (with aid of preliminary phoneme segmentation) or using HMM.

The LSP of target phonemes, corresponding to source LSP in each quantization region, are clustered to obtain the primary codebook, with centroids of phonemes’ LSP as codewords.

Averaging the durations corresponding to source and target phonemes at each quantization region gives the codebook for phonemes’ durations.

For each of the work frames of every phoneme the LSP, residue, pitch and energy are extracted. Vector quantization is performed upon the phonemes’ LSP of the source, clustering the similar phonemes.

Codebooks Creation – Training Stage

Source LSPQuantization

Phoneme par.calculation

Framework Par. Calc.

Phoneme Coordination

Source & Target utterances

Pitch & Energ.histogram

Secondary codebook

Framework coordination

DurationAveraging

Target LSPClustering

Primary codebook

Conversion Scheme Transformation

The source-target coordination on work frame level is achieved using Dynamic Time Warping – thus for each primary codeword the itemized LSP pairs of the corresponding phonemes establish the secondary codebook.

For each of the primary codewords the pitch and energy information of every work frame of the corresponding phonemes are used to create source and target histograms.The normalized residues corresponding to the itemized LSP are kept as well.

Page 9: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

For each phoneme –The excitation for each work frame is (according to model) 1. impulse pair with given pitch and energy (voiced) or 2. residue interpolated/decimated to 2-pitch length. The work frames are linearly interpolated according to duration. The speech is produced by exiting the prediction filter with corresponding coefficients.

ExcitationGeneration V(z)

LPCConversion

Duration Control

LSP

Duration

Pitch GPP

Target parameters

Target speech

Conversion Scheme Synthesis

Target Speech Production

Page 10: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Vocal Results

Vocal Coding

S SS SS S S S S S

1 1 1 2 2

Conversion

S TS TS T S T

1 111

Source Target1 2 1 2

Legend? ? No codebook ? ? Phoneme codebook ? ? Clustered codebook

Non-modified pitch excitation

Modified pitch excitation

Residue excitation

Page 11: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Vocal ResultsSource Target1 2 1 2

Conversion

S T S TS T

1 11

S T

2

S T

2

S TS T

2 2

Legend? ? No codebook ? ? Phoneme codebook ? ? Clustered codebook

Non-modified pitch excitation

Modified pitch excitation

Residue excitation

Page 12: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Conclusions The parametric approach with codebook attains waveform coding of about 5600 bps.

The training stage phoneme clustering allows global parameter (pitch,duration) conversion.

balances between global frameworks search and single phoneme correspondence.

LSP conversion alone miscaptures significant voice characteristics.

The quality difference between conversion based upon Euclidian and IS distances is insignificant.

Page 13: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Future plans The parametric approach limits the optimum conversion to 5600 bps quality.

Improve the parametric model (GPP), or use non parametric conversion - residue codebook (CELP).

Better clustering method (other then VQ) may improve global parameter conversion as well as phoneme recognition.

Improve LSP transformation-interpolation.

Page 14: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

DTW determines optimal “least-cost” path through the grid, minimizing sum of visited nodes costs.

01 2 3 4 I

1

2

3

4

J

i

j For a given phoneme, we set:

Work frames parameters (LPC or LSP) of target - along i and of source - along j axis.

Node cost - distance between corresponding source and target parameters.

Conversion Scheme Transformation

Dynamic Time Warping

Path constrains, for avoiding distortion, forcing time advancing with limiting stretching/contraction ratio.

The optimal path determines desired alignment through node pairs.

Page 15: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

* - code vectors NiY 1

- quant. regions NiV 1

d – distance measure (Euclidian or I-S).

jYXQ )( jiYXdYXd ij ),(),(iffjVX

Conversion Scheme Transformation

Vector Quantization VQ subdivides the space into quantization regions each represented by code vector.

we find

Given a set of LSP - training sequence

Mix 1 NiY 1 NiV 1 which result

M

iii xQxd

MD

1

))(,(1in smallest average distance:

We use LBG algorithm with PNN initialization for Euclidian and random for I-S distance.

Page 16: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Advantages Robustness to errors. Support of inter-vector operations.

p

k

kkZ zA

1

)( 1 Given the LPC define :

)()1(

)()(1 Z

pZZ AzAP

)()1(

)()(1 Z

pZZ AzAQ

LSP are the positive angles of the roots of :

Conversion Scheme Analysis

LSP Conversion

Q(z) rootsP(z) roots

LSP

V(z) zeros

12

V(z) poles

LPC

For stable vocal filter roots of P&Q lie on the unit circle and are interleaved.

Close PQ pairs correspond to dominant formants (vocal filter poles) .

Page 17: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

P

i

iiLspLsp LspLspd1

2)()()2,1( 21

Euclidian - squared distance between source and target LSP :

Conversion Scheme Transformation

Speech Distance Measures

Itakura-Saito (gain normalized) - in matrix notation :

1212),( 21 Rd tIS

where are LPC vectors and is the covariance matrix of process excited by normalized white noise.

i)( 1AR

1R

Motivation -

YtReE 2

P

k

kk z

1

1 eYthe error variance of any random process passed trough error filter is :

Page 18: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

50 60 70 80 90 100 110 1200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

50 60 70 80 90 100 110 1200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18Source Pitch Histogram Target Pitch Histogram

Given the histograms we calculate the source and target histogram equalization functions:

n

sHnT0

)( n

tHnG0

)(and

Conversion Scheme Transformation

Histogram Equalization

)(1st pTGp

Given the source pitch value the target pitch value is calculated by:

Page 19: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Conversion Scheme Analysis

Hidden Markov Models

Page 20: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Initialization

For each segment

Calculation

Segmentation

Determination

Pitch Voiced/Unvoiced

Speech utterance

Next segment

Segment lengthStep length

Segmentation

Constant length and overlap – for each segment the pitch value is determined.

Segmentation

Speech utterance

Initialization

Initialization

Set 2 adjacent segments of arbitrary minimal length – estimated pitch period.

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?Calculation

Calculation

• Increase the segments length.

• For each calculate their cross-correlation.

• Stop at arbitrary maximal length.

Calculation

Determination

Determination

• Pitch period – length of the segment with maximal cross-correlation value.

• C.C. must achieve given threshold or classify the segment as unvoiced.

Pitch!

Pitch!

Conversion Scheme Analysis

Pitch Estimation

Determination

Pitch Voiced/Unvoiced

Page 21: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Segmentation

Windowing

Pre-emphasis

Calculation

Signal,Pitch

LPC

Windowing

Windowing

Multiply each segment by hamming window. Overlapping hamming windows are approximately rectangular.

Signal,Pitch

2 PitchPitch

Segmentation

Segmentation Work Frame – segment of twice a pitch period duration for voiced or constant duration for unvoiced speech. The segments are overlapped by half.

F

Conversion Scheme Analysis

LPC Estimation

Pre-emphasis Constant parameter HPF compensating for the spectral tilt due to lips derivation.

Windowing

Pre-emphasis

Calculation The gain and the denominator coefficients estimated using Linear Prediction methods.

Pre-emphasis

Calculation

Spectral envelop of V(f)

Signal’s FFT

Calculation

LPC

Page 22: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

We use 2 methods of residue coding: Full residue preservation - obtained by passing the speech segment through the prediction error filter:

Conversion Scheme Analysis

GPP Estimation

)(

1

ZV

Residue’s energy – the excitation is pitch train (voiced) or noise (unvoiced) only.

Page 23: Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Segmentation

Windowing

Pre-emphasis

Calculation

Signal,Pitch

LPC

Signal,Pitch

Pre-emphasis Constant parameter HPF compensating for the spectral tilt due to lips derivation.

Pre-emphasis 1F

Conversion Scheme Analysis

LPC Estimation