Sparse modeling 1 - 東京工業大学

Sparse modeling 1

Shiro Ikeda

The Institute of Statistical Mathematics

26 June 2015

Ikeda (ISM) Sparse modeling 26/June/2015 1 / 65

Theme

Information processing with sparse modeling

▶ It is getting more popular in many fields.

▶ It will be a standard method.

▶ Compressed sensing is an important keyword.

Today’s topic

▶ What is sparsity?

▶ How we can use it?

▶ What is the difficulty?


Sparsity based information processing

What type of processing

▶ Model selection

▶ Compression

▶ Clustering

▶ Denoising

▶ Image recognition

▶ Data analysis

Fields▶ Statistics

▶ Machine learning

▶ Information theory

▶ Optimization theory

▶ Signal processing

▶ Measurement technology


Domestic projects

Figure : MEXT grant-in-aid for scientific research on innovative areas(2013-2018) Initiative for High-Dimensional Data-Driven Science throughDeepening of Sparse Modeling


Theory of sparse modeling Sparsity and Linear equation

Theory of sparse modelingSparsity and Linear equationSparsity of dataUniqueness of the sparse solutionRelaxed problemNoisy observation

Summary



What is sparsity?

A multi-dimensional vector has a lot of zeros.

x = (x1, · · · , xn)T , xi ∈ ℜ.

y is a function of x,

y = f(x)

and the components which contribute to y is small. This is theassumption of sparsity.

▶ A harmonic sound has a lot of zeros in frequency domain.

▶ There are a lot of genes but the ones related to a specific disease aresmall.

▶ Movie is a sequence of images. But the number of pixels changingeach time is not large.



Linear equation

y = f(x)

Simple case is a linear equation. Let y = (y1, · · · , ym)T is a function ofn-dimensional real vector x = (x1, · · · , xn)T

yi =∑j

aijxj i = 1, · · · ,m.

by defining A = (aij),

y = Ax.

Assume A is known, and our problem is to estimate x from y.



Linear equation: m = n

If m = n and A−1 exists, x = A−1y

y = Ax =

−1 2 −13 −1 2

−1 1 1

x1x2x3

.

When (y1, y2, y3)T is observed, x is computed as follows.

x = A−1y =1

9

3 3 −35 2 1

−2 1 5

y1y2y3

.



Linear equation: m < n

If m, the dimension of y is smaller than n, the dimension of x, thesolution becomes under-determined. There are infinitely many solutions xwhich satisfy the equation.

(y1y2

)=

(−1 2 −13 −1 2

)x1x2x3

.

When, y1 = 2, y2 = −1, solving two linear equation brings the followingline.

(x1, x2, x3)T = (−3t, t+ 1, 5t)T .

Any point on this line satisfies the equation.



Linear equation: m < n

Suppose x is known to be sparse. The point on the line which makes thesolution sparsest is when t = 0, and the solution is x = (0, 1, 0)T .

Sparse solution

We could solve the equation by assuming the solution is sparse.



Single Bit Camera

Figure : A project at Rice university.



Single Bit Camera: Problem

Recent digital cameras have a large number of pixels, but this camera hasa single pixel. A single pixel camera uses a lot of micro mirrors forcollecting image.

(a) Image for a camera. (b) A pattern of the micromirrors.



Single Bit Camera: Compressed Sensing

x is a image and we would like to observe (sensing) it.Eventually, we want to reconstruct the image from observation.The “single” observation of x is to take the inner product between therow vector of A, that is, a(l) = (al1, · · · , alm), and x. More precisely, asingle observation is equivalent to see the following yl.

yl = a(l)x.

After observing a lot of y1, · · · , yn we want to reconstruct x. The vectory is a collection of yl.

y = Ax.

Donoho (2006). “Compressed sensing,” IEEE tr. IT, 52(4), 1289-1306.



Single Bit Camera: How it works.

By repeating observation by changing the patterns of mirrors, we have thefollowing linear equation.

y = Ax.

When A and y are know, we want to compute x.

=

.



Single Bit Camera: Simulation

How it works

We set m = 512, n = 1024, and the linear equation is under-determined.x is sparse that only 234 components are positive among 1024components. The sparse solution is shown as follows.

(a) Recorded image. (b) Reconstructed image.



Single Bit Camera

Figure : A project at Rice university.



Information processing based on sparsity

Simulation works, but we want to know if it works in real. We explain itfrom the following viewpoints.

We explain it from the following topics.

▶ Do the data have sparsity?

▶ Can we obtain the solution if the data have sparsity?

▶ How do we handle noise?

▶ How to compute?

Sparse modeling is a new field that mathematical theory, appliedmathematics, and data analysis are involved.


Theory of sparse modeling Sparsity of data


Summary



Sparsity

▶ In information theory, there are some cases where x can begenerated. This is not the case for data analysis.

▶ We want to know if it is reasonable to assume x sparse.

▶ In big data analysis, people sometimes assume data has sparsity.

▶ In genomic data, there are many genes, but only small number ofthem are related to a disease.

▶ Also sound or music data has sparsity.



Sparsity of sound data

Flute

−1

0

1

s 1(t)

−1

0

1

s 2(t)

0 0.5 1 1.5 2−1

0

1

Time [s]

s 3(t)

(a) Flute sound.

(b) Spectrogram.


sound/multi/DS06.mp3


Sparsity of sound data

Acoustic sound

0 0.5 1 1.5 2−1

0

1

Time [s]

s 2(t)

−1

0

1

s 1(t)

(a) Acoustic sound

(b) Spectrogram.


sound/multi/gm.mp3


Sparsity of image data

Wavelet transform

(a) Original image (b) Wavelet coefficients




Wavelet transform

1 2 3 4 5 6

x 104

−20

−18

−16

−14

−12

−10

−8

−6

−4

−2

0lo

g of

nor

mal

ized

coe

ffs

Figure : Distribution of coefficientsIkeda (ISM) Sparse modeling 26/June/2015 23 / 65



Removing small coefficients (49.86%)

(a) Reconstructed image (b) Wavelet coefficients
























Change the basis

We can change the basis linearly. Fourier and Wavelet transforms in theprevious examples are denoted with a n× n unitary transform Φ as

z = Φx,

where z is a transformed representation.

y = Ax

This problem can be applied even if x is not sparse but z is sparse.Because we can rewrite the problem as the following problem.

y = Ax = AΦ−1z = Bz.




The number of changing points of the image is small (applying aLaplacian filter and show the square of the output value)

(a) image (b) Filter output




The number of changing points of the image is small (applying aLaplacian filter and show the square of the output value)

(a) image (b) Filter output



Sparsity of data

▶ It is not clear what type of sparsity the data have.

▶ In many cases, a sparse representation is obtained after a propertransformation (linear or nonlinear).

▶ It is important to find a proper representation.


Theory of sparse modeling Uniqueness of the sparse solution


Summary



Compressed sensing

Consider the following under-determined linear equation

y = Ax

where m < n and y and A are known. If x is sparse and we computed thesparsest solution. Is it unique?This problem depends on the characteristics of the matrix A.



Norm

The definitions of the norm used in the following analysis

0 norm: ∥x∥ℓ0 = |{x;xi = 0}|1 norm: ∥x∥ℓ1 =

∑i |xi|

2 norm: ∥x∥ℓ2 =(∑

i x2i

)1/2

Definition of norm▶ If ∥x∥ = 0 then x = 0.

▶ ∥ax∥ = |a|∥x∥.▶ ∥x+ y∥ ≤ ∥x∥+ ∥y∥.

To be precise, 0 norm is not a norm. But for convenience, we call it 0norm.



Compressed sensing

Assume the solution x has only S non-zero components. Consider thefollowing problem

P0: ℓ0 optimization

min∥∥x∥∥

ℓ0, subject to y = Ax.



Compressed sensing

Consider the condition that P0 has a unique solution.

Definition: SPARK

Picking up k column vectors from A, and spark(A) is the minimum kwhere the column vectors become linearly dependent.2 ≤ Spark(A) ≤ m+ 1 holds.

The condition P0 has a unique solution.

A sufficient condition that P0 has a unique solution

If the following holds, x0 is the sparsest solution.

∥x0∥ℓ0 <spark(A)

2.



RIP (Restricted isometry property)

Another way to characterize the condition where P0 has a unique solutionis to based on a new concept RIP (Restricted isometry property).

Definition: RIP

Assume x has S non-zero component. If there exists a δ satisfies thefollowing inequality, A has RIP(S, δ).

(1− δ)∥∥x∥∥

ℓ2≤

∥∥Ax∥∥ℓ2

≤ (1 + δ)∥∥x∥∥

ℓ2

for all∥∥x∥∥

ℓ0= S



RIP and P0

ℓ0 recovery

Let S ≥ 1. Suppose A has a RIP and δ2S < 1. If y = Ax holds for an xwith ∥x∥ℓ0 ≤ S, the following problem has a unique solution.

min∥∥x∥∥



Theory of sparse modeling Relaxed problem


Summary



Relax ℓ0 norm optimization


min∥∥x∥∥


It is difficult to solve this problem for a large m.

▶ It is not possible to take the derivative of ℓ0 norm.

▶ Computation to check all the combination of x components is hard.

Relax ∥x∥ℓ0 to ∥x∥ℓ1 and consider the following problem.


min∥∥x∥∥


∥x∥ℓ1 =∑

i |xi|, and we can apply optimization techniques.



Relaxing ℓ0 recovery

Suppose x is S-sparse (S ≥ 1). Suppose A has a RIP and δ2S ≤√2− 1.

Then the solutions of the following two problems are the same.

min∥∥x∥∥

ℓ1, subject to y = Ax

min∥∥x∥∥


If x is sparse and A has a good property, ℓ0 optimization problem can besolved by ℓ1 optimization.

Candes, “The restricted isometry property and its implications forcompressed sensing,” Comptes Rendus Mathematique, 346(9-10),589-592.



Related researches

▶ ℓ1 recovery problem is theoretically guaranteed to be optimal if x issparse.

▶ These discussions depends on the characteristics of A.

▶ There is a gap between ℓ1 recovery and ℓ0 recovery.


Theory of sparse modeling Noisy observation


Summary



Regression analysis and LASSO

The observation processed is modeled as follows,

y = Ax

. But in reality, we have noise.

y = Ax+ e

This is a problem of regression. LASSO is proposed in 1996 as a methodto use the sparsity for regression problem.



LASSO

y = Ax+ e

minx

∥y −Ax∥2ℓ2 subject to ∥x∥2ℓ1 ≤ s.

The number of non-zero components of x changes depending on s. As sincreases, the number of non-zero components increases up-to n. As sdecreases, the number of non-zero components decreases to 1.

Equivalent problem with a Lagrange multiplier

minx

[∥y −Ax∥2ℓ2 + λ∥x∥ℓ1

]For each λ ≥ 0 there exists a s ≥ 0 where the both problems becomeequivalent.



LASSO

Likelihood function:

p(y|x;A) = 1

(2πσ2)m/2exp

(− 1

2σ2∥y −Ax∥2ℓ2

)Prior distribution of x:

p(x;µ) =µn

2exp

(−µ∥x∥ℓ1

)



LASSO

Posterior distribution of x:

p(x|y;A,µ) ∝ exp(− 1

2σ2∥y −Ax∥2ℓ2 − µ∥x∥ℓ1

)MAP Estimate:

x = argmaxx

[− 1

2σ2∥y −Ax∥2ℓ2 − µ∥x∥ℓ1

]x = argmin

x

[ 1

2σ2∥y −Ax∥2ℓ2 + µ∥x∥ℓ1

]= argmin

x

[∥y −Ax∥2ℓ2 + λ∥x∥ℓ1

]LASSO is a MAP estimate.



Reference

Compressed sensing

Donoho (2006). “Compressed sensing,” IEEE tr. IT, 52(4), 1289-1306.

LASSO

Tibshirani (1996). “Regression shrinkage and selection via the Lasso,”J. R. Statisti. Soc. B, 58(1), 267-288.

Obourne, Presnell, & Turlach (1999). “On the Lasso and its dual,” J.Comp. and Graph. Stat., 9, 319-337.



A simulated Single Bit Camera

Under-determined linear equation.

=

.




Image is approximately sparse

(a) Original image (b) Wavelet coefficients




This image is not sparse, but is sparse with wavelet basis.Even if x is not sparse, if z is sparse, we can change the representation asfollows,

Ax = AΦ−1z = Bz.

Also, image is roughly sparse but not strictly. Thus we consider theproblem as a noisy regression, and consider it as the following problem,

y = Bz + e.

We use the LASSO.




minz

∥y −Bz∥2ℓ2 + λ∥z∥ℓ1

Solve the problem as LASSO. After solving the problem, z is used torecover the image as x = Φ−1z. We modified λ and solved the problem.




Recovery with LASSO (λ = 10000)

(a) Recovered image (b) Wavelet coefficients















(c) Recovered image (d) Wavelet coefficients





































Summary

Information processing with sparse modeling

▶ A lot of data have sparsity.

▶ If it is sparse, the optimal solution is unique.

▶ Even if the data is noisy, we can still apply sparsity based methods.

▶ A lot of interesting topics in theory and applications.


Documents

Sparse modeling 1 - 東京工業大学