Sparse Representation

Sparse Representation

Shih-Hsiang Lin (林士翔 )

1. T. N. Sainath, et al., “Bayesian Compressive Sensing for Phonetic Classification,” ICASSP 20102. T. N. Sainath, et al., “Sparse Representation Phone Identification Features for Speech Recognition,” IBM T.J. Watson Research Center, Tech. Rep., 2010.3. T. N. Sainath, et al., “Sparse Representation Features for Speech Recognition,” INTERSPEECH 20104. T. N. Sainath, et al., “Sparse Representations for Text Categorization,” INTERSPEECH 20105. V. Goel, et al., “Incorporating Sparse Representation Phone Identification Features in Automatic Speech Recognition using Exponential Families” INTERSPEECH 20106. D. Kanevsky, et al., “An Analysis of Sparseness and Regularization in Exemplar-Based Methods for Speech Classification,” INTERSPEECH 20107. A. Sethy, et al., “Data Selection for Language Modeling Using Sparse Representations,” INTERSPEECH 2010

2

IntroductionRelated Work

◦ Supervised Summarizers◦ Unsupervised Summarizers

Risk Minimization FrameworkExperimentsConvulsion and Future Work

Outline

3

Sparse representation (SR) has become a popular technique for efficient representation and compression of signals

Introduction

1 s.t. Hy

nmRH is constructed consisting of possible examples of the signal

1 nR is a weight vector whose elements reflect the importance of the corresponding training samples

=

y H

A sparseness condition is enforced on β, such that it selects a small number of examples from H to describe y

10

1

4

One benefit of SR is that for a given test example, SR adaptively selects the relevant support vectors from the training set H

It has also shown success in face recognition over linear SVM and 1-NN methods◦ SVMs select a sparse subset of relevant training

examples (support vectors) use these supports to characterize “all” examples in the

test set◦ kNNs characterize a test point by selecting a small

number of k points from the training set which are closest to the test vector voting on the class that has the highest occurrence from

these k samples

Introduction (Cont.)

5

We can think that SR techniques is a kind of exemplar-based methods

Issues◦ Why sparse representations?◦ What type of regularization?◦ Construction of H?◦ Choice of Dictionary?◦ Choice of sampling?

Introduction (Cont.)

Why sparse representations

6

C1 C1 C1 C2 C2 C2 C2

7

l1 norm ◦ This constraint can be modeled as a Gaussian prior◦ LASSO, Bayesian Compressive Sensing (BCS)

Impose a combination of an l1 and l2 constraint◦ Elastic Net

◦ Cyclic Subgradient Projections (CSP)

Impose a semi-Gaussian constraint◦ Approximate Bayesian Compressive Sensing (ABCS)

What type of regularization 1 s.t. Hy

12min Hy

1

222112min Hy

022 Hy 1

2120

10012min PHy T

8

As the size of H increases up to 1, 000 the error rates of the RR and SR both decrease◦ showing the benefit of including multiple training examples when

making a classification decision There is no different in error between the RR and SR techniques

◦ suggesting that regularization does not provide any extra benefit

What type of regularization (Cont.)

9

The plot shows that the coefficients for the RR method are the least sparse

The LASSO technique has the sparsest values


A randomly selected classificationframe y in TIMIT and an H of size 200

10

The decrease in accuracy when a high degree of sparseness is enforced.

Thus, it appears that using a combination of a sparsity constraint on β does not force unnecessary sparseness and offers the best performance


11

The traditional CS implementation represents as a linear combination of samples in

Many pattern recognition algorithms have shown better performance can be achieved by a nonlinear mapping of the feature set to a higher dimensional space

Construction of H

Hy

12

Performance for different

Construction of H (Cont.)

H

13

The matrix is constructed consisting of possible examples of the signal

Each could represent features from different classes in the training set

Given a test feature vector , the goal of SRs is to solve the following equation for

Each element of in some sense characterize how well the corresponding represents feature vector

We can make a classification decision for , by choosing the class from that has the maximum size of elements

SR formulation for classification

CHHHH ,, 21

HH i

y

H

21 s.t. Hy

HH i y

Hy

14

Ideally, all non-zero entries of should correspond to the entries in with the same class as

However, due to noise and modeling errors, might have a non-zero value for more than one class

We can compute the l2 norm for all entries within a specific class, and choose the class with the largest l2 norm support◦ Let as a vector whose entries are non-

zero except for entries in corresponding to class i

SR formulation for classification (Cont.)

y

H

Ni R

2max* ii

i

15

Given a test feature vector , we first find a sparse as a solution of . We then compute as

We then can use them as input feature for recognition

Phone Identification Features

12

2

spif

phnid

phnid

H

Hp

yHy

spifp

16

Phone Identification Features (cont.)

17

Success on the sparse representation features depends heavily on a good choice of H

Pooling together all training data from all classes into H will make the columns of H large (typically millions of frames)◦ this will make solving for intractable

Therefore, we should have a good strategy to select H from a large sample set◦ Seeding H from Nearest Neighbors

For each y, we find a neighborhood of closest points to y in the training set

These k neighbors become the entries of H k is chosen to be in the large to ensure that is sparse and

all training examples are not chosen from the same class

Choice of Dictionary H

18

This approach is computationally feasible on small vocabulary tasks, but not for large vocabulary tasks

◦ Using a Trigram Language Model Ideally only a small subset of Gaussians are typically

evaluated at a given frame The training data belonging to this small subset can be

used to seed H For each test frame y, we decode the data using a trigram

language model and find the best aligned Gaussian at each frame

For each Gaussian, we compute the 4 other closest Gaussians to this Gaussian

We seed H with the training data aligning to these top 5 Gaussians

◦ Using a Unigram / no Language Model increase variability between the Gaussians used to seed H

Choice of Dictionary H (Cont.)

19

◦ Enforcing Unique Phonemes all of these Gaussians might come from the same

phoneme by using above approaches finding the 5 closest Gaussians relative to the best

aligned such that the phoneme identities of these Gaussians are unique (i.e. “AA”, “AE”, “AW”, etc.)

◦ Using Gaussian Means The above approaches of seeding H use actual

examples from the training set, which is computationally expensive

We can seed H from Gaussian means At each frame we use a trigram LM to find the best aligned

Gaussian Then we find the 499 closest Gaussians to this top Gaussian,

and use the means from these 500 Gaussians to seed H


20


21

Using the Maximum Support as a metric is too hard of a decision, as from other classes is often non-zero

Making a softer decision by using the l2 norm of offers higher accuracy

Using the residual error offers the lowest accuracy◦ When will reduced to which is a very small

number and might not offer good distinguishability from class residuals in which is high

SR for Text Categorization18,000 news documents /20 classesTF features

0i 2y i

Documents

Sparse Representation