Upload
shima
View
83
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Sparse Representation. Shih-Hsiang Lin ( 林士翔 ). 1. T. N. Sainath , et al., “Bayesian Compressive Sensing for Phonetic Classification,” ICASSP 2010 - PowerPoint PPT Presentation
Citation preview
Sparse Representation
Shih-Hsiang Lin (林士翔 )
1. T. N. Sainath, et al., “Bayesian Compressive Sensing for Phonetic Classification,” ICASSP 20102. T. N. Sainath, et al., “Sparse Representation Phone Identification Features for Speech Recognition,” IBM T.J. Watson Research Center, Tech. Rep., 2010.3. T. N. Sainath, et al., “Sparse Representation Features for Speech Recognition,” INTERSPEECH 20104. T. N. Sainath, et al., “Sparse Representations for Text Categorization,” INTERSPEECH 20105. V. Goel, et al., “Incorporating Sparse Representation Phone Identification Features in Automatic Speech Recognition using Exponential Families” INTERSPEECH 20106. D. Kanevsky, et al., “An Analysis of Sparseness and Regularization in Exemplar-Based Methods for Speech Classification,” INTERSPEECH 20107. A. Sethy, et al., “Data Selection for Language Modeling Using Sparse Representations,” INTERSPEECH 2010
2
IntroductionRelated Work
◦ Supervised Summarizers◦ Unsupervised Summarizers
Risk Minimization FrameworkExperimentsConvulsion and Future Work
Outline
3
Sparse representation (SR) has become a popular technique for efficient representation and compression of signals
Introduction
1 s.t. Hy
nmRH is constructed consisting of possible examples of the signal
1 nR is a weight vector whose elements reflect the importance of the corresponding training samples
=
y H
A sparseness condition is enforced on β, such that it selects a small number of examples from H to describe y
10
1
4
One benefit of SR is that for a given test example, SR adaptively selects the relevant support vectors from the training set H
It has also shown success in face recognition over linear SVM and 1-NN methods◦ SVMs select a sparse subset of relevant training
examples (support vectors) use these supports to characterize “all” examples in the
test set◦ kNNs characterize a test point by selecting a small
number of k points from the training set which are closest to the test vector voting on the class that has the highest occurrence from
these k samples
Introduction (Cont.)
5
We can think that SR techniques is a kind of exemplar-based methods
Issues◦ Why sparse representations?◦ What type of regularization?◦ Construction of H?◦ Choice of Dictionary?◦ Choice of sampling?
Introduction (Cont.)
Why sparse representations
6
C1 C1 C1 C2 C2 C2 C2
7
l1 norm ◦ This constraint can be modeled as a Gaussian prior◦ LASSO, Bayesian Compressive Sensing (BCS)
Impose a combination of an l1 and l2 constraint◦ Elastic Net
◦ Cyclic Subgradient Projections (CSP)
Impose a semi-Gaussian constraint◦ Approximate Bayesian Compressive Sensing (ABCS)
What type of regularization 1 s.t. Hy
12min Hy
1
222112min Hy
022 Hy 1
2120
10012min PHy T
8
As the size of H increases up to 1, 000 the error rates of the RR and SR both decrease◦ showing the benefit of including multiple training examples when
making a classification decision There is no different in error between the RR and SR techniques
◦ suggesting that regularization does not provide any extra benefit
What type of regularization (Cont.)
9
The plot shows that the coefficients for the RR method are the least sparse
The LASSO technique has the sparsest values
What type of regularization (Cont.)
A randomly selected classificationframe y in TIMIT and an H of size 200
10
The decrease in accuracy when a high degree of sparseness is enforced.
Thus, it appears that using a combination of a sparsity constraint on β does not force unnecessary sparseness and offers the best performance
What type of regularization (Cont.)
11
The traditional CS implementation represents as a linear combination of samples in
Many pattern recognition algorithms have shown better performance can be achieved by a nonlinear mapping of the feature set to a higher dimensional space
Construction of H
Hy
12
Performance for different
Construction of H (Cont.)
H
13
The matrix is constructed consisting of possible examples of the signal
Each could represent features from different classes in the training set
Given a test feature vector , the goal of SRs is to solve the following equation for
Each element of in some sense characterize how well the corresponding represents feature vector
We can make a classification decision for , by choosing the class from that has the maximum size of elements
SR formulation for classification
CHHHH ,, 21
HH i
y
H
21 s.t. Hy
HH i y
Hy
14
Ideally, all non-zero entries of should correspond to the entries in with the same class as
However, due to noise and modeling errors, might have a non-zero value for more than one class
We can compute the l2 norm for all entries within a specific class, and choose the class with the largest l2 norm support◦ Let as a vector whose entries are non-
zero except for entries in corresponding to class i
SR formulation for classification (Cont.)
y
H
Ni R
2max* ii
i
15
Given a test feature vector , we first find a sparse as a solution of . We then compute as
We then can use them as input feature for recognition
Phone Identification Features
12
2
spif
phnid
phnid
H
Hp
yHy
spifp
16
Phone Identification Features (cont.)
17
Success on the sparse representation features depends heavily on a good choice of H
Pooling together all training data from all classes into H will make the columns of H large (typically millions of frames)◦ this will make solving for intractable
Therefore, we should have a good strategy to select H from a large sample set◦ Seeding H from Nearest Neighbors
For each y, we find a neighborhood of closest points to y in the training set
These k neighbors become the entries of H k is chosen to be in the large to ensure that is sparse and
all training examples are not chosen from the same class
Choice of Dictionary H
18
This approach is computationally feasible on small vocabulary tasks, but not for large vocabulary tasks
◦ Using a Trigram Language Model Ideally only a small subset of Gaussians are typically
evaluated at a given frame The training data belonging to this small subset can be
used to seed H For each test frame y, we decode the data using a trigram
language model and find the best aligned Gaussian at each frame
For each Gaussian, we compute the 4 other closest Gaussians to this Gaussian
We seed H with the training data aligning to these top 5 Gaussians
◦ Using a Unigram / no Language Model increase variability between the Gaussians used to seed H
Choice of Dictionary H (Cont.)
19
◦ Enforcing Unique Phonemes all of these Gaussians might come from the same
phoneme by using above approaches finding the 5 closest Gaussians relative to the best
aligned such that the phoneme identities of these Gaussians are unique (i.e. “AA”, “AE”, “AW”, etc.)
◦ Using Gaussian Means The above approaches of seeding H use actual
examples from the training set, which is computationally expensive
We can seed H from Gaussian means At each frame we use a trigram LM to find the best aligned
Gaussian Then we find the 499 closest Gaussians to this top Gaussian,
and use the means from these 500 Gaussians to seed H
Choice of Dictionary H (Cont.)
20
Choice of Dictionary H (Cont.)
21
Using the Maximum Support as a metric is too hard of a decision, as from other classes is often non-zero
Making a softer decision by using the l2 norm of offers higher accuracy
Using the residual error offers the lowest accuracy◦ When will reduced to which is a very small
number and might not offer good distinguishability from class residuals in which is high
SR for Text Categorization18,000 news documents /20 classesTF features
0i 2y i