View
218
Download
0
Tags:
Embed Size (px)
Citation preview
2002
VIU Oct 2007 : Speaker Recognition 1 F. Schiel
Florian SchielVenice International University
Oct 2007
Speaker Recognition =Speaker Identification, Speaker Verification
2002
VIU Oct 2007 : Speaker Recognition 2 F. Schiel
Agenda
• See the Context
• Speech Recognition vs. Speaker Recognition
• Speaker Identification vs. Speaker Verification
• Speaker Recognition: Basics
• Speaker Verification using HMM
• Discussion
• and then ...
2002
VIU Oct 2007 : Speaker Recognition 3 F. Schiel
General Approach to Authentification
• Three general ways to perform authentification:- proof of knowledge (e.g. password),- proof of possession (e.g. chip card),- proof of property (biometrics), and their combinations
• Biometrics: physiological based vs. behavioural based• Biometrical features:
Fingerprint, iris scan, facial scan, hand geometry, signature, voice
from U. Türk 2007
2002
VIU Oct 2007 : Speaker Recognition 4 F. Schiel
Biometric Features: General Requirements
• universal: can be found in any user• unique: even for identical twins• measurable: does not require human evaluation• robust to short-term and long-term variability• low dimensionality• robust to changing environment• robust to impersonation
from U. Türk 2007
++++++ooo+
2002
VIU Oct 2007 : Speaker Recognition 5 F. Schiel
Taxonomie Speech Processing
Natural Language Processing(NLP)
Spoken Language Processing(SLP)
Lexica
SyntaxParsing
Spellers
Search /IndexingSemantics
Terminology
Thesaurus
Dialogue systems
SpeechIdentification
Speech Synthesis
Speaker recognition
Speech Recognition
Forensics
2002
VIU Oct 2007 : Speaker Recognition 6 F. Schiel
Speech Recognition
"Decode the spoken content from the acoustic signal"
Speaker Recognition
"Determine the identity of a speaker from acoustic signal"
ASR "Sehr geehrter .." SI/SVAccepted/Rejected
ID
SpeechModels
SpeakerCharacteristics
ClaimedIdentity
2002
VIU Oct 2007 : Speaker Recognition 7 F. Schiel
Speaker Verification• Authentification according to
claimed identity• Result is binary:
"accept" / "reject"• Scaling: effort independent
of number of participants• Accuracy: dependent of size
of enrolment data
Speaker Identification• Identification from limited number
of participants• Result is speaker identity• Scaling: effort increases linear
with number of participants• Accuracy: dependent of
+ size of enrolment data+ number of participants
reject
Identität falsch
accept
Identität ok correctidentity ok
accept
Identität ok falsereject
rejectreject
Identität falsch correct
accept
falseaccept
identity wrong
100
NCor
rect
ness
Speaker Recognition
2002
VIU Oct 2007 : Speaker Recognition 8 F. Schiel
• Applications:– Access Control
– Verification of identity
via the phone
– Automatic Teller Machines
– Password resetting
– Banking: Identity for new
accounts etc.
– Protection against theft (cars...)
Speaker Verification
• Applications:– Forensics
– Police Work
– Automatic User Settings
– Speaker Classification:
Advertising
Speaker Identification
2002
VIU Oct 2007 : Speaker Recognition 9 F. Schiel
Speaker Verification: Doddington's Zoo (1)
User = registered speaker, Impostor = non-registered speaker
• Goats : users that are often rejected wrongly (increasing 'false reject' errors)
• Lambs : users that are easily imitated (increasing 'false accept' errors)
• Sheep : users that 'behave' (not goats and not lambs)• Wolfs : particulary successful impostors
(increasing 'false accept' errors)
from Doddington 1998
2002
VIU Oct 2007 : Speaker Recognition 10 F. Schiel
Speaker Verification: Doddington's Zoo (2)
Wolfs may perform zero-effort or active impostor attempts to break into a SV system.
Problem:Speaker verification data bases do not contain active impostorattempts data of wolfs -> most technical evaluations are based on non-realistic data!
2002
VIU Oct 2007 : Speaker Recognition 11 F. Schiel
Technical Speech Processing
Featuredetection
DekoderHighpass
Analog Signal
0
t
Digital Signal
t
Vectors
m1
.
.mN
m1
.
.mN
10 20
...• "Call Richard!"• "Radio off!"• "216"
Symbols
Symbols:• Text• Action• Semantics
A / DAnti-
AliasingFilter
2002
VIU Oct 2007 : Speaker Recognition 12 F. Schiel
Verification"Accept""Reject"
Featuredetection
Highpass
A / DAnti-
AliasingFilter
Claimedidentity
PINFingerprint
ASR
SelectID
Speaker Models
Speaker Verifikation: Basics (1)
2002
VIU Oct 2007 : Speaker Recognition 13 F. Schiel
VerificationFeature
detectionHighpass
Speaker Verification: Basics (2)
ffsam
/2
Analog low pass filterto avoid anti-aliasingeffects
+ Analog-DigitalConverter
„Accept”„Reject”A / D
Anti-Aliasing
Filter
Anti-aliasing
filterA / D
2002
VIU Oct 2007 : Speaker Recognition 14 F. Schiel
Speaker Verification: Basics (3)
Features:• speaker specific• robust against noise• partly long term
0
Extraction ofSpeakercharacteristics
m1
...mN
m1
...mN
10 20
m1
...mN
m1
...mN
30 40
...
Window
25 ms
Merkmals-berechnung
VerificationHighpass
A / DAnti-
AliasingFilter
"Accept""Reject"A / D
Anti-Aliasing
FilterFeature
detection
2002
VIU Oct 2007 : Speaker Recognition 15 F. Schiel
Featuredetection
Highpass
A / DAnti-
AliasingFilter
Verification
"Accept""Reject"
p(S | ID) < threshold
vector sequenceS
m1
.
.mN
m1
.
.mN
10 20
...
decision
p(S | ID) > threshold
"Accept"
"Reject"
speaker modelof claimed ID
Speaker Verification: Basics (4)
2002
VIU Oct 2007 : Speaker Recognition 16 F. Schiel
Speaker Verification: Tuning
• Error types highly dependent on threshold
high security -> false accept low false reject highuser friendly -> false reject low false accept high
EqualErrorRate
falseaccept
falsereject
• Both errors increase by:- channel disturbance- crosstalk- noise- room acoustics
threshold
• Solution:- multiple enrolments- adaptive learning
2002
VIU Oct 2007 : Speaker Recognition 17 F. Schiel
Speaker Verification: Score Normalisation (1)
Problem:How to set the optimal threshold?
HMMs generate a priori probabilities:O : observation = sequence of featuresl : speaker model
Bayes:
but is dependent on various factors
P l∣O=p O∣l P l P O
p O∣l
P O
2002
VIU Oct 2007 : Speaker Recognition 18 F. Schiel
Speaker Verification: Score Normalisation (2)
Solution: Bayesian Decision Rule:
with Bayes and log to both sides this leads to:
P l∣O =p O∣l P l P O
C FR P l∣O C FAP l∣O
log p O∣l − log p O∣l log C FAP l C FRP l
=threshold
CFR
, CFA
: cost functions
2002
VIU Oct 2007 : Speaker Recognition 19 F. Schiel
Speaker Verification: Score Normalisation (3)
Often assumed: costs are equal and speakers occurequally distributed
is estimated using a world or cohort model
world model : speaker model trained to all speakers
cohort model : speaker model trained to a group of
most competing models (wolfs)
lo g p O∣l − lo g p O∣l lo g N − 1
N : number of users∧ im postors
p O∣l
2002
VIU Oct 2007 : Speaker Recognition 20 F. Schiel
Speaker Verification: Enrolment
Method
Fixed, pre-specified sentence:e.g. "My voice is my password"
Fixed, selectable sentence:e.g. maiden name of grandmother
Changing number triplets:e.g. fifteen, thirtynine, seventythree
System generates a new sentencefor each verification
Enrolment Remarks
Speak sentence3 - 5 times
Speak sentence3 – 5 times
Speak each number3 – 5 times
Sentence may be intercepted and played back
Additional securityby content
High security by manypossible combinations
Elaborate enrolment,high processing effort,very high security
Speak each phoneme3 – 5 times
2002
VIU Oct 2007 : Speaker Recognition 21 F. Schiel
Speaker Verification: HMM types
Method
pre-specified sentence
recombination of segments taken from enrolment data
modeling without time structure
Model Security
Accuracy
linear
piecewise linear
ergodic
o
2002
VIU Oct 2007 : Speaker Recognition 22 F. Schiel
Speaker Verification: Features (1)
Variable signal characteristics• often required: telephone band 300 – 3300 Hz
(higher resonances cut off)• changing channel characteristics, caused by
transmission line, handset, distance to mouth• static and intermittent noise • user: health, intoxication, fatigue
2002
VIU Oct 2007 : Speaker Recognition 23 F. Schiel
Speaker Verification: Features (2)
Candidates determined by physiology:• fundamental frequency, average• wave form of vocal folds, jimmer, jitter, irregularities• formants: average and dynamics• places of articulation: fricatives, plosives• nasal cavity resonance• sub-glottal resonance
2002
VIU Oct 2007 : Speaker Recognition 24 F. Schiel
Speaker Verification: Features (3)
Candidates determined by behaviour:• voiced/unvoice ratio• fundamental frequency, dynamics• syllable rate, pause/speech ratio• dialectal features: vowel qualityCandidates determined by speech technology:• Linear Predictor Coefficients (LPC)• filter bank, Bark filter bank, Mel filter bank• Cepstrum, Mel-Cepstrum• (derivations with respect to time)
2002
VIU Oct 2007 : Speaker Recognition 25 F. Schiel
Sprecherverifikation: Road Map
1990 Heute 2010 2020
ZugangskontrollenSicherheitsbereich
Authentifizierungüber Telefon
Geräte "erkennen"ihren Benutzer
Sprecherprofilauf Chipkarten
Zugangskontrolle fürTastaturlose PDAs
Authentifizierungim Hintergrund
ÖffentlicheSprecherprofile
Automatischer Alkohol-test im Fahrzeug
2002
VIU Oct 2007 : Speaker Recognition 26 F. Schiel
Thank You!