21
Sparse Representation Shih-Hsiang Lin ( 林林林 ) 1. T. N. Sainath, et al., “Bayesian Compressive Sensing for Phonetic Classification,” ICASSP 2010 2. T. N. Sainath, et al., “Sparse Representation Phone Identification Features for Speech Recognition,” IBM T.J. Watson Research Center, Tech. Rep., 2010. 3. T. N. Sainath, et al., “Sparse Representation Features for Speech Recognition,” INTERSPEECH 2010 4. T. N. Sainath, et al., “Sparse Representations for Text Categorization,” INTERSPEECH 2010 5. V. Goel, et al., “Incorporating Sparse Representation Phone Identification Features in Automatic Speech Recognition using Exponential Families” INTERSPEECH 2010 6. D. Kanevsky, et al., “An Analysis of Sparseness and Regularization in Exemplar-Based Methods for Speech Classification,” INTERSPEECH 2010 7. A. Sethy, et al., “Data Selection for Language Modeling Using Sparse Representations,” INTERSPEECH 2010

Sparse Representation

  • Upload
    shima

  • View
    83

  • Download
    0

Embed Size (px)

DESCRIPTION

Sparse Representation. Shih-Hsiang Lin ( 林士翔 ). 1. T. N. Sainath , et al., “Bayesian Compressive Sensing for Phonetic Classification,” ICASSP 2010 - PowerPoint PPT Presentation

Citation preview

Page 1: Sparse Representation

Sparse Representation

Shih-Hsiang Lin (林士翔 )

1. T. N. Sainath, et al., “Bayesian Compressive Sensing for Phonetic Classification,” ICASSP 20102. T. N. Sainath, et al., “Sparse Representation Phone Identification Features for Speech Recognition,” IBM T.J. Watson Research Center, Tech. Rep., 2010.3. T. N. Sainath, et al., “Sparse Representation Features for Speech Recognition,” INTERSPEECH 20104. T. N. Sainath, et al., “Sparse Representations for Text Categorization,” INTERSPEECH 20105. V. Goel, et al., “Incorporating Sparse Representation Phone Identification Features in Automatic Speech Recognition using Exponential Families” INTERSPEECH 20106. D. Kanevsky, et al., “An Analysis of Sparseness and Regularization in Exemplar-Based Methods for Speech Classification,” INTERSPEECH 20107. A. Sethy, et al., “Data Selection for Language Modeling Using Sparse Representations,” INTERSPEECH 2010

Page 2: Sparse Representation

2

IntroductionRelated Work

◦ Supervised Summarizers◦ Unsupervised Summarizers

Risk Minimization FrameworkExperimentsConvulsion and Future Work

Outline

Page 3: Sparse Representation

3

Sparse representation (SR) has become a popular technique for efficient representation and compression of signals

Introduction

1 s.t. Hy

nmRH is constructed consisting of possible examples of the signal

1 nR is a weight vector whose elements reflect the importance of the corresponding training samples

=

y H

A sparseness condition is enforced on β, such that it selects a small number of examples from H to describe y

10

1

Page 4: Sparse Representation

4

One benefit of SR is that for a given test example, SR adaptively selects the relevant support vectors from the training set H

It has also shown success in face recognition over linear SVM and 1-NN methods◦ SVMs select a sparse subset of relevant training

examples (support vectors) use these supports to characterize “all” examples in the

test set◦ kNNs characterize a test point by selecting a small

number of k points from the training set which are closest to the test vector voting on the class that has the highest occurrence from

these k samples

Introduction (Cont.)

Page 5: Sparse Representation

5

We can think that SR techniques is a kind of exemplar-based methods

Issues◦ Why sparse representations?◦ What type of regularization?◦ Construction of H?◦ Choice of Dictionary?◦ Choice of sampling?

Introduction (Cont.)

Page 6: Sparse Representation

Why sparse representations

6

C1 C1 C1 C2 C2 C2 C2

Page 7: Sparse Representation

7

l1 norm ◦ This constraint can be modeled as a Gaussian prior◦ LASSO, Bayesian Compressive Sensing (BCS)

Impose a combination of an l1 and l2 constraint◦ Elastic Net

◦ Cyclic Subgradient Projections (CSP)

Impose a semi-Gaussian constraint◦ Approximate Bayesian Compressive Sensing (ABCS)

What type of regularization 1 s.t. Hy

12min Hy

1

222112min Hy

022 Hy 1

2120

10012min PHy T

Page 8: Sparse Representation

8

As the size of H increases up to 1, 000 the error rates of the RR and SR both decrease◦ showing the benefit of including multiple training examples when

making a classification decision There is no different in error between the RR and SR techniques

◦ suggesting that regularization does not provide any extra benefit

What type of regularization (Cont.)

Page 9: Sparse Representation

9

The plot shows that the coefficients for the RR method are the least sparse

The LASSO technique has the sparsest values

What type of regularization (Cont.)

A randomly selected classificationframe y in TIMIT and an H of size 200

Page 10: Sparse Representation

10

The decrease in accuracy when a high degree of sparseness is enforced.

Thus, it appears that using a combination of a sparsity constraint on β does not force unnecessary sparseness and offers the best performance

What type of regularization (Cont.)

Page 11: Sparse Representation

11

The traditional CS implementation represents as a linear combination of samples in

Many pattern recognition algorithms have shown better performance can be achieved by a nonlinear mapping of the feature set to a higher dimensional space

Construction of H

Hy

Page 12: Sparse Representation

12

Performance for different

Construction of H (Cont.)

H

Page 13: Sparse Representation

13

The matrix is constructed consisting of possible examples of the signal

Each could represent features from different classes in the training set

Given a test feature vector , the goal of SRs is to solve the following equation for

Each element of in some sense characterize how well the corresponding represents feature vector

We can make a classification decision for , by choosing the class from that has the maximum size of elements

SR formulation for classification

CHHHH ,, 21

HH i

y

H

21 s.t. Hy

HH i y

Hy

Page 14: Sparse Representation

14

Ideally, all non-zero entries of should correspond to the entries in with the same class as

However, due to noise and modeling errors, might have a non-zero value for more than one class

We can compute the l2 norm for all entries within a specific class, and choose the class with the largest l2 norm support◦ Let as a vector whose entries are non-

zero except for entries in corresponding to class i

SR formulation for classification (Cont.)

y

H

Ni R

2max* ii

i

Page 15: Sparse Representation

15

Given a test feature vector , we first find a sparse as a solution of . We then compute as

We then can use them as input feature for recognition

Phone Identification Features

12

2

spif

phnid

phnid

H

Hp

yHy

spifp

Page 16: Sparse Representation

16

Phone Identification Features (cont.)

Page 17: Sparse Representation

17

Success on the sparse representation features depends heavily on a good choice of H

Pooling together all training data from all classes into H will make the columns of H large (typically millions of frames)◦ this will make solving for intractable

Therefore, we should have a good strategy to select H from a large sample set◦ Seeding H from Nearest Neighbors

For each y, we find a neighborhood of closest points to y in the training set

These k neighbors become the entries of H k is chosen to be in the large to ensure that is sparse and

all training examples are not chosen from the same class

Choice of Dictionary H

Page 18: Sparse Representation

18

This approach is computationally feasible on small vocabulary tasks, but not for large vocabulary tasks

◦ Using a Trigram Language Model Ideally only a small subset of Gaussians are typically

evaluated at a given frame The training data belonging to this small subset can be

used to seed H For each test frame y, we decode the data using a trigram

language model and find the best aligned Gaussian at each frame

For each Gaussian, we compute the 4 other closest Gaussians to this Gaussian

We seed H with the training data aligning to these top 5 Gaussians

◦ Using a Unigram / no Language Model increase variability between the Gaussians used to seed H

Choice of Dictionary H (Cont.)

Page 19: Sparse Representation

19

◦ Enforcing Unique Phonemes all of these Gaussians might come from the same

phoneme by using above approaches finding the 5 closest Gaussians relative to the best

aligned such that the phoneme identities of these Gaussians are unique (i.e. “AA”, “AE”, “AW”, etc.)

◦ Using Gaussian Means The above approaches of seeding H use actual

examples from the training set, which is computationally expensive

We can seed H from Gaussian means At each frame we use a trigram LM to find the best aligned

Gaussian Then we find the 499 closest Gaussians to this top Gaussian,

and use the means from these 500 Gaussians to seed H

Choice of Dictionary H (Cont.)

Page 20: Sparse Representation

20

Choice of Dictionary H (Cont.)

Page 21: Sparse Representation

21

Using the Maximum Support as a metric is too hard of a decision, as from other classes is often non-zero

Making a softer decision by using the l2 norm of offers higher accuracy

Using the residual error offers the lowest accuracy◦ When will reduced to which is a very small

number and might not offer good distinguishability from class residuals in which is high

SR for Text Categorization18,000 news documents /20 classesTF features

0i 2y i