Upload
felicity-wade
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classific
ation
Bin MA and Haizhou LI
Institute for Infocomm Research
Singapore
2 ACM SIGIR August 15-19, 2005 Bin MA
Agenda
• Spoken Document Classification & Related Works• Phonotactic-semantic Approach• Voice Tokenization with Acoustic Words• Bag-of-Sounds Representation• Language Identification Classifiers with SVM and LSA• Conclusion
3 ACM SIGIR August 15-19, 2005 Bin MA
Spoken Document Classification & Related Works
• Spoken Document Retrieval (SDR) is the task of retrieving excerpts from a large collection of spoken documents based on a user’s request. – Automatic spoken document classification (SDC) is an important
topic in SDR;– Conventionally approached by integrating automatic speech
recognition (ASR) technologies and text information retrieval (IR).
• Most SDC efforts so far have been devoted to two paradigms:– lexical-semantic– n-gram phonotactic
4 ACM SIGIR August 15-19, 2005 Bin MA
• lexical-semantic– Convert the spoken documents into text transcripts of lexical
words; – The transcripts are typically generated from a large vocabulary
continuous speech recognizer (LVCSR).– Text categorization (TC) techniques are then applied to the
automatic transcripts to derive semantic classes.
Homophone
Out-of-Vocabulary (OOV)
Multilinguality
The major limitations is its lexical choice.
Spoken Document Classification & Related Works
5 ACM SIGIR August 15-19, 2005 Bin MA
• n-gram phonotactic– Use n-gram phonotactics, i.e. the rules governing the sequences
of allowable phonemes, instead of lexical words to represent the lexical constraints that are imposed by semantic domains;
– Enhance robustness against speech recognition errors.
Semantic Abstraction
Multilinguality
Its major shortcoming is not to exploit the global phonotactics in the larger context of a spoken document.
Spoken Document Classification & Related Works
6 ACM SIGIR August 15-19, 2005 Bin MA
Phonotactic-semantic Approach
• Spoken document classification (SDC) is more complex than text categorization (TC).– In TC, we usually derive the lexical vocabulary from the running
text. – For spoken documents, an additional tokenization step is needed
to convert sound wave into a sequence of phonetic units, such as words or phonemes.
• Two issues: – the definition of tokenization unit, and– the choice of vocabulary.
7 ACM SIGIR August 15-19, 2005 Bin MA
• Definition of tokenization unit– Traditionally use the lexical word or phonemes in a specific
language.
– We propose to use a set of universal acoustic word (AW) -language independent, self-organized, and phoneme-like units.
– We treat the documents in all languages equally with the same set of AWs.
– AWs can be learned from a multilingual training corpus using a data driven approach.
Phonotactic-semantic Approach
8 ACM SIGIR August 15-19, 2005 Bin MA
• Choice of vocabulary– Use the bag-of-sounds statistics over AWs, instead of bag-of-
words over lexical words, to derive high level semantic characteristics from a spoken document.
– The bag-of-sounds concept is analogous to the bag-of-words paradigm originally formulated in the context of information retrieval (IR) and text categorization (TC).
– A spoken document is then represented by a high-dimensional vector derived from the statistics of term frequency.
Phonotactic-semantic Approach
9 ACM SIGIR August 15-19, 2005 Bin MA
Lexical constraint
Latent semantics
Outstanding problems
Lexical-semantic approach
Lexical word
bag-of-words vector
1.Homophone 2.OOV 3.Multilinguality
n-gram phonotactic approach
n-local phonotactics
1.Multilinguality 2.Semantic Abstraction
Phonotactic-semantic approach
n-local phonotactics
bag-of-sounds vector
Phonotactic-semantic Approach
10 ACM SIGIR August 15-19, 2005 Bin MA
• Three fundamental components for SDC
– A voice tokenizer, i.e. a speech recognizer front-end which segments a spoken documents into acoustic tokens;
– A statistical language model which captures statistics of semantic domain information;
– A classifier which categorizes a spoken document using the
statistical language model.
Phonotactic-semantic Approach
11 ACM SIGIR August 15-19, 2005 Bin MA
Agenda
• Spoken Document Classification & Related Works• Phonotactic-semantic Approach• Voice Tokenization with Acoustic Words• Bag-of-Sounds Representation• Language Identification Classifiers with SVM and LSA• Conclusion
12 ACM SIGIR August 15-19, 2005 Bin MA
word
phoneme
frame
Voice Tokenization with Acoustic Words
13 ACM SIGIR August 15-19, 2005 Bin MA
• Segment an utterance into Q consecutive segments in a maximum likelihood manner– minimizing an overall distortion with dynamic programming;
• Cluster all segments into T classes with k-means algorithm– speech segments in the same class are acoustically similar;
• Train one HMM for each class– establish T acoustic segment models to represent the overall
acoustic space of all languages.
Voice Tokenization – Acoustic segment modeling (ASM)
14 ACM SIGIR August 15-19, 2005 Bin MA
Voice Tokenization – Phonetically-bootstrapped ASM
• Add phonetic constraints in segmentation– use large amount of labeled speech data from few well studied
languages;– train language-specific phone models;– choose some models to form a set of T models for bootstrapping;
• Phonetically label the multilingual training utterances – use T models to decode all training utterances;– keep the recognized sequences as “true” labels;
• Re-train models– force-align and segment all utterances based on “true” labels;– group all speech segments of a specific label into a class;– use these segments to re-train an HMM.
15 ACM SIGIR August 15-19, 2005 Bin MA
Agenda
• Spoken Document Classification & Related Works• Phonotactic-semantic Approach• Voice Tokenization with Acoustic Words• Bag-of-Sounds Representation• Language Identification Classifiers with SVM and LSA• Conclusion
16 ACM SIGIR August 15-19, 2005 Bin MA
• Bag-of-sounds is analogous to the bag-of-words;
• AWs in the vocabulary with T acoustic tokens;
• A spoken document is described as a count vector of AWs, which has its element to represent the count of an AW and takes the AW vocabulary size W as dimension.
• Capture local phonotactics with lexical constraints;• Capture global phonotactics with co-occurrences of
AWs;
Bag-of-Sounds Representation
nW T
17 ACM SIGIR August 15-19, 2005 Bin MA
Agenda
• Spoken Document Classification & Related Works• Phonotactic-semantic Approach• Voice Tokenization with Acoustic Words• Bag-of-Sounds Representation• Language Identification Classifiers with SVM and LSA• Conclusion
18 ACM SIGIR August 15-19, 2005 Bin MA
Language Identification
• National Institute of Standards and Technology (NIST) 1996 Language Recognition Evaluation (LRE) database.
• 12 languages : Arabic, English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese.
• Linguistic Data Consortium (LDC) Callfriend corpus as the training data.– 40 30-minute conversations;– 12,000 30-second training sessions for each language.
• 1492 30-second speech sessions from 1996 NIST LRE database as the test data.
19 ACM SIGIR August 15-19, 2005 Bin MA
LM-L: French
Universal VT
LM-1: English
LM-2: Chinese
Language Classifier
spoken utterance
Hypothesized language
Language Identification
20 ACM SIGIR August 15-19, 2005 Bin MA
SVM Classifier with Feature Extraction
• SVMlight V6.01 from http://svmlight.joachims.org/• Work with a linear kernel SVM;
• Feature dimension
• L*(L-1)/2 pair-wise binary SVMs• The class that gains most of the winning votes
takes all.
2128; 16,384 AWsT W T
21 ACM SIGIR August 15-19, 2005 Bin MA
• Count-trimming (CT)– AWs that have very low frequency;– AWs that occurs in too few document.
• Mutual Information (MI)– Class membership– Particular AW’s presence
– MI indicates the contribution to semantic classification from an AW’s presence.
{ , }X x x { , }Y y y ( , )
( , ) ( , ) log( ) ( ),
p x yMI X Y p x y
p x p yx X y Y
SVM Classifier with Feature Extraction
22 ACM SIGIR August 15-19, 2005 Bin MA
• Separation Margin (SM)– SVM with a linear kernel – , while– Margin is inversely proportional to – Features with higher |aj| are more influential in determining the
width of the separation margin.
• Feature Weighting
( )T
f c ba c 1 2{ , , ..., }
Wa a a a
a
2 1/ 2
, , ,
1
( ) /( )w d w d w d
w W
c c idf w c
1, 2 , ,
{ , , ..., }T
d d d W dc c c c
( ) log / ( )idf w D d w
SVM Classifier with Feature Extraction
23 ACM SIGIR August 15-19, 2005 Bin MA
10
20
30
40
50
6010
0
500
2000
4000
6000
8000
1000
0
1200
0
1400
0
Acoustic Word Vocabulary Size
Err
or
Rate
(%
)
SM MI CT
SVM Classifier with Feature Extraction
SLID error rate comparison among three feature selection techniques
24 ACM SIGIR August 15-19, 2005 Bin MA
SVM Classifier with Feature Extraction
13
18
23
28
33
100 500 2000 6000 10000
Number of Training Sessions per Language
Err
or
Rate
(%
)
Effect of training corpus size
25 ACM SIGIR August 15-19, 2005 Bin MA
LSA Classifier with SVD
• Singular Vector Decomposition (SVD)– Term-document matrix :– SVD :
– Retain the top Q singular values in matrix S
• Latent Semantic Analysis (LSA)
:H W DTH USV
2
( , ) cos( , )|| || || ||
T
i j
i j i j
i j
v S vg c c v v
v S v S
1( , ) ( , ) cos ( , )
i j i j i jk c c k v v g c c
1T
p ppc v c US
1 2: ; : ; : ( ... )RU W R V D R S R R s s s
26 ACM SIGIR August 15-19, 2005 Bin MA
• LSA Classifier I – k-nearest neighbor
• LSA Classifier II – mixture modeling
ˆ arg min ( , )l l
p lll k v v
( , ) ( )|i j i j
k v v p v v
, ,
1
( | ) ( ) ( | )M
i l l m i l m
m
p v p v p v v
| |
1 1
( | ) ( | )l
DL
d l
l d
p p v
, ,
1
ˆ arg max ( | ) arg max ( ) ( | )M
p l l m p l m
l l m
l p v p v p v v
LSA Classifier with SVD
27 ACM SIGIR August 15-19, 2005 Bin MA
10
15
20
25
30
351 2 4 8 16 32 64 128
256
512
1024
Number of Mixtures
Err
or R
ate
(%)
Effect of Mixture Number M (LSAC-II)
LSA Classifier with SVD
28 ACM SIGIR August 15-19, 2005 Bin MA
LSA Classifier with SVD
#M 1,000 2,000 6,000 12,000
LSAC-I Error (%) 19.8 16.5 15.2 14.8
SVMC Error (%) 18.2 16.2 14.4 13.9
Effect of training data size in LSAC-I & SVMC
P-PRLMP-PRLM &
Score Fusion
LSAC_II SVMC
Error (%) 22.0 17.0 14.9 13.9Benchmark of different models
29 ACM SIGIR August 15-19, 2005 Bin MA
Conclusion
• Non-lexical approach to spoken document tokenization– Universal acoustic word (AW) - language independent, self-
organized, and phoneme-like units;– Data driven approach to learn from multilingual training corpus.
• Phonotactic-semantic paradigm to model – Local phonotactics in an acoustic word (AW); – Global phonotactics in an bag-of-sounds vector.
30 ACM SIGIR August 15-19, 2005 Bin MA
• Thank you !