Message Passing Algorithms for Compressed SensingarXiv:0907.3574v1 [cs.IT] 21 Jul 2009 Message Passing Algorithms for Compressed Sensing David L. Donoho Department of Statististics

arX

iv:0

907.

3574

v1 [

cs.IT

] 21

Jul

200

9

Message Passing Algorithmsfor Compressed Sensing

David L. DonohoDepartment of Statististics

Stanford [email protected]

Arian MalekiDepartment of Electrical Engineering

Stanford [email protected]

Andrea MontanariDepartment Electrical Engineering

and Department of StatisticsStanford University

[email protected]

Abstract—Compressed sensing aims to undersample certain high-dimensional signals, yet accurately reconstruct them by exploiting signalcharacteristics. Accurate reconstruction is possible when the object tobe recovered is sufficiently sparse in a known basis. Currently, the bestknown sparsity-undersampling tradeoff is achieved when reconstructingby convex optimization – which is expensive in important large-scaleapplications.

Fast iterative thresholding algorithms have been intensively studiedas alternatives to convex optimization for large-scale problems. Un-fortunately known fast algorithms offer substantially worse sparsity-undersampling tradeoffs than convex optimization.

We introduce a simple costless modification to iterative thresholdingmaking the sparsity-undersampling tradeoff of the new algorithms equiv-alent to that of the corresponding convex optimization procedures. Thenew iterative-thresholding algorithms are inspired by belief propagationin graphical models.

Our empirical measurements of the sparsity-undersamplingtradeofffor the new algorithms agree with theoretical calculations. We showthat a state evolution formalism correctly derives the true sparsity-undersampling tradeoff. There is a surprising agreement between earliercalculations based on random convex polytopes and this new,apparentlyvery different theoretical formalism.

I. I NTRODUCTION AND OVERVIEW

Compressed sensing refers to a growing body of techniques that‘undersample’ high-dimensional signals and yet recover them accu-rately [1], [2]. Such techniques make fewer measurements than tra-ditional sampling theory demands: rather than sampling proportionalto frequency bandwidth, they make only as many measurementsasthe underlying ‘information content’ of those signals. However, ascompared with traditional sampling theory, which can recover signalsby applying simple linear reconstruction formulas, the task of signalrecovery from reduced measurements requires nonlinear, and so far,relatively expensive reconstruction schemes. One popularclass ofreconstruction schemes uses linear programming (LP) methods; thereis an elegant theory for such schemes promising large improvementsover ordinary sampling rules in recovering sparse signals.However,solving the required LPs is substantially more expensive inapplica-tions than the linear reconstruction schemes that are now standard. Incertain imaging problems, the signal to be acquired may be animagewith 106 pixels and the required LP would involve tens of thousandsof constraints and millions of variables. Despite advancesin the speedof LP, such problems are still dramatically more expensive to solvethan we would like.

This paper develops an iterative algorithm achieving reconstructionperformance in one important senseidentical toLP-based reconstruc-tion while running dramatically faster. We assume that a vector y of nmeasurements is obtained from an unknownN -vectorx0 accordingto y = Ax0, whereA is the n × N measurement matrixn < N .Starting from an initial guessx0 = 0, the first order approximate

message passing(AMP) algorithm proceeds iteratively according to:

xt+1 = ηt(A∗zt + xt) , (1)

zt = y − Axt +1

δzt−1〈η′

t(A∗zt−1 + xt−1)〉 . (2)

Hereηt( · ) are scalarthresholdfunctions (applied componentwise),xt ∈ R

N is the current estimate ofx0, and zt ∈ Rn is the

current residual.A∗ denotes transpose ofA. For a vectoru =(u(1), . . . , u(N)), 〈u〉 ≡ PN

i=1 u(i)/N . Finally η′t( s ) = ∂

∂sηt( s ).

Iterative thresholding algorithms of other types have beenpopularamong researchers for some years, the focus being on schemesofthe form

xt+1 = ηt(A∗zt + xt) , (3)

zt = y − Axt. (4)

Such schemes can have very low per-iteration cost and low storagerequirements; they can attack very large scale applications, - muchlarger than standard LP solvers can attack. However, [3]-[4] fall shortof the sparsity-undersampling tradeoff offered by LP reconstruction[3].

Iterative thresholding schemes based on [3], [4] lack the crucialterm in [2] – namely,1

δzt−1〈η′

t(A∗zt−1 + xt−1)〉 is not included.

We derive this term from the theory of belief propagation in graph-ical models, and show that it substantially improves the sparsity-undersampling tradeoff.

Extensive numerical and Monte Carlo work reported here showsthat AMP, defined by eqns [1], [2] achieves a sparsity-undersamplingtradeoff matching the theoretical tradeoff which has been provedfor LP-based reconstruction. We consider a parameter spacewithaxes quantifying sparsity and undersampling. In the limit of largedimensionsN, n, the parameter space splits in twophases: one wherethe MP approach is successful in accurately reconstructingx0 andone where it is unsuccessful. References [4], [5], [6] derived regionsof success and failure for LP-based recovery. We find these two os-tensibly different partitions of the sparsity-undersampling parameterspace to beidentical. Both reconstruction approaches succeed or failover the same regions, see Figure 1.

Our finding has extensive empirical evidence and strong theoreticalsupport. We introduce astate evolutionformalism and find that itaccurately predicts the dynamical behavior of numerous observablesof the AMP algorithm. In this formalism, the mean squared errorof reconstruction is a state variable; its change from iteration toiteration is modeled by a simple scalar function, theMSE map. Whenthis map has nonzero fixed points, the formalism predicts that AMPwill not successfully recover the desired solution. The MSEmapdepends on the underlying sparsity and undersampling ratios, and candevelop nonzero fixed points over a region of sparsity/undersampling

http://arxiv.org/abs/0907.3574v1

space. The region is evaluated analytically and found to coincide veryprecisely (ie. within numerical precision) with the regionover whichLP-based methods are proved to fail. Extensive Monte Carlo testingof AMP reconstruction finds the region where AMP fails is, to withinstatistical precision, the same region.

In short we introduce a fast iterative algorithm which is found toperform as well as corresponding linear programming based methodson random problems. Our findings are supported from simulationsand from a theoretical formalism.

Remarkably, the success/failure phases of LP reconstruction werepreviously found by methods in combinatorial geometry; we givehere what amounts to a very simple formula for the phase boundary,derived using a very different and seemingly elegant theoreticalprinciple.

A. Underdetermined Linear Systems

Let x0 ∈ RN be the signal of interest. We are interested in

reconstructing it from the vector of measurementsy = Ax0, withy ∈ R

n, for n < N . For the moment, we assume the entriesAij ofthe measurement matrix are independent and identically distributednormalN(0, 1/n).

We consider three canonical models for the signalx0 and threenonlinear reconstruction procedures based on linear programming.

+: x0 is nonnegative, with at mostk entries different from0.Reconstruct by solving the LP: minimize

PNi=1 xi subject tox ≥ 0,

andAx = y.

±: x0 has as many ask nonzero entries. Reconstruct by solving theminimumℓ1 norm problem: minimize||x||1, subject toAx = y. Thiscan be cast as an LP.

�: x0 ∈ [−1, 1]N , with at mostk entries in the interior(−1, 1).Reconstruction by solving the LP feasibility problem: find any vectorx ∈ [−1, +1]N with Ax = y.

Despite the fact that the systems are underdetermined, under certainconditions onk, n, N these procedures perfectly recoverx0. Thistakes place subject to asparsity-undersampling tradeoffnamely anupper bound on the signal complexityk relative ton andN .

B. Phase Transitions

The sparsity-undersampling tradeoff can most easily be describedby taking a large-system limit. In that limit, we fix parameters (δ, ρ)in (0, 1)2 and let k, n, N → ∞ with k/n → ρ and n/N → δ.The sparsity-undersampling behavior we study is controlled by (δ, ρ),with δ the undersampling fraction andρ a measure of sparsity (withlargerρ corresponding to more complex signals).

The domain(δ, ρ) ∈ (0, 1)2 has two phases, a ‘success’ phase,where exact reconstruction typically occurs, and a ‘failure’ phasewere exact reconstruction typically fails. More formally,for eachchoice ofχ ∈ {+,±, �} there is a functionρCG(·; χ) whose graphpartitions the domain into two regions. In the ‘upper’ region, whereρ > ρCG(δ;χ), the corresponding LP reconstructionx1(χ) fails torecoverx0, in the following sense: ask, n, N → ∞ in the largesystem limit withk/n → ρ andn/N → δ, the probability of exactreconstruction{x1(χ) = x0} tends to zero exponentially fast. In the‘lower’ region, whereρ < ρCG(δ; χ), LP reconstruction succeeds torecoverx0, in the following sense: ask, n, N → ∞ in the largesystem limit withk/n → ρ andn/N → δ, the probability of exactreconstruction{x1(χ) = x0} tends to one exponentially fast. Werefer to [4], [5], [7], [6] for proofs and precise definitionsof thecurvesρCG(·; χ).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

δ

ρ

Fig. 1. The phase transition lines for reconstructing sparse non-negativevectors (problem+, red), sparse signed vectors (problem±, blue) and vectorswith entries in[−1, 1] (problem�, green). Continuous lines refer to analyticalpredictions from combinatorial geometry or the state evolution formalisms.Dashed lines present data from experiments with the AMP algorithm, withsignal lengthN = 1000 and T = 1000 iterations. For each value ofδ, weconsidered a grid ofρ values, at each value, generating50 random problems.The dashed line presents the estimated 50th percentile of the response curve.At that percentile, the root mean square error afterT iterations obeysσT ≤10−3 in half of the simulated reconstructions.

The three functionsρCG( · ; +), ρCG( · ;±), ρCG( · ; �) are shownin Figure 1; they are the red, blue, and green curves, respectively.The orderingρCG(δ; +) > ρCG(δ;±) (red > blue) says that knowingthat a signal is sparse and positive is more valuable than onlyknowing it is sparse. Both the red and blue curves behave asρCG(δ; +,±) ∼ (2 log(1/δ))−1 asδ → 0; surprisingly large amountsof undersampling are possible, if sufficient sparsity is present. Incontrast,ρCG(δ; �) = 0 (green curve) forδ < 1/2 so the bounds[−1, 1] are really of no help unless we use a limited amount ofundersampling, i.e. by less than a factor of two.

Explicit expressions forρCG(δ; +,±) are given in [4], [5]; theyare quite involved and use methods from combinatorial geometry.By Finding 1 below, they agree to within numerical precisionto thefollowing formula:

ρSE(δ; χ) = maxz≥0

(1 − (κχ/δ)

ˆ(1 + z2)Φ(−z) − zφ(z)

˜

1 + z2 − κχ

ˆ(1 + z2)Φ(−z) − zφ(z)

˜)

, (5)

where κχ = 1, 2 respectively forχ = +, ±. This formula, aprincipal result of this paper, uses methods unrelated to combinatorialgeometry.

C. Iterative Approaches

Mathematical results for the large-system limit correspond wellto application needs. Realistic modern problems in spectroscopyand medical imaging demand reconstructions of objects withtensof thousands or even millions of unknowns. Extensive testing ofpractical convex optimizers in these problems [8] has shownthat thelarge system asymptotic accurately describes the observedbehaviorof computed solutions to the above LPs. But the same testing showsthat existing convex optimization algorithms run slowly onthese largeproblems, taking minutes or even hours on the largest problems ofinterest.

Many researchers have abandoned formal convex optimization,turning to fast iterative methods instead [9], [10], [11].

The iteration [1]-[2] is very attractive because it does notrequirethe solution of a system of linear equations, and because it doesnot require explicit operations on the matrixA; it only requiresthat one apply the operatorsA and A∗ to any given vector. In anumber of applications - for example Magnetic Resonance Imaging- the operatorsA which make practical sense are not really Gaussianrandom matrices, but rather random sections of the Fourier transformand other physically-inspired transforms [2], [12]. Such operators canbe applied very rapidly using FFTs, rendering the above iterationextremely fast. Provided the process stops after a limited number ofiterations, the computations are very practical.

The thresholding functions{ηt( · )}t≥0 in these schemes dependon both iteration and problem setting. In this paper we considerηt( · ) = η(·; λσt, χ), whereλ is a threshold control parameter,χ ∈{+,±, �} denotes the setting, andσ2

t = AvejE{(xt(j)− x0(j))2}

is the mean square error of the current current estimatext (in practicean empirical estimate of this quantity is used).

For instance, in the case of sparse signed vectors (i.e. problemsetting±), we apply soft thresholdingηt(u) = η(u; λσ,±), where

η(u; λσ,±) =

8<:

(u − λσ) if u ≥ λσ,(u + λσ) if u ≤ −λσ,0 otherwise,

(6)

where we dropped the argument± to lighten notation. Notice thatηt depends on the iteration numbert only through the mean squareerror (MSE)σ2

t .

D. Heuristics for Iterative Approaches

Why should the iterative approach work, i.e. why should itconverge to the correct answerx0? The case± has been mostdiscussed and we focus on that case for this section. Imaginefirstof all that A is an orthogonal matrix, in particularA∗ = A−1.Then the iteration [1]-[2] stops in 1 step, correctly findingx0. Next,imagine thatA is an invertible matrix; [13], has shown that a relatedthresholding algorithm with clever scaling ofA∗ and clever choice ofthreshold, will correctly findx0. Of course both of these motivationalobservations assumen = N , so we are not reallyundersampling.

We sketch a motivational argument for thresholding in the trulyundersampled casen < N which is statistical, which has beenpopular with engineers [12] and which leads to a proper ‘psychology’for understanding our results. Consider the operatorH = A∗A − I ,and note thatA∗y = x0 + Hx0. If A were orthogonal, we wouldof course haveH = 0, and the iteration would, as we have seenimmediately succeed in one step. IfA is a Gaussian random matrixand n < N , then of courseA is not invertible andA∗ is not A−1.Instead ofHx0 = 0, in the undersampled caseHx0 behaves as akind of noisy random vector, i.e.A∗y = x0 + noise. Now x0 issupposed to be a sparse vector, and, one can see, thenoise termis accurately modeled as a vector with i.i.d. Gaussian entries withvariancen−1‖x0‖2

2.In short, the first iteration gives us a ‘noisy’ version of thesparse

vector we are seeking to recover. The problem of recovering asparsevector from noisy measurements has been heavily discussed [14] andit is well understood that soft thresholding can produce a reductionin mean-squared error when sufficient sparsity is present and thethreshold is chosen appropriately. Consequently, one anticipates thatx1 will be closer tox0 thanA∗y.

At the second iteration, one hasA∗(y − Ax1) = x0 + H(x0 −x1). Naively, the matrixH does not correlate withx0 or x1, andso we might pretend thatH(x0 − x1) is again a Gaussian vectorwhose entries have variancen−1||x0 − x1||22. This ‘noise level’ is

0 0.5−0.2

−0.15

−0.1

−0.05

0

0.05

x

Ψ(x

)−x

(a)

0 0.5−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

x

Ψ(x

)−x

(b)

0 0.5−0.2

−0.15

−0.1

−0.05

0

0.05

x

Ψ(x

)−x

(c)

0 0.5−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

x

Ψ(x

)−x

(d)

0 0.5

−0.03

−0.02

−0.01

0

0.01

x

Ψ(x

)−x

(e)

0 0.5−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

x

Ψ(x

)−x

(f)

Fig. 2. Development of fixed points for formal MSE evolution.Here we plotΨ(σ2)− σ2 whereΨ( · ) is the MSE map forχ = + (left column),χ = ±(center column) andχ = � (right column),δ = 0.1 (upper row,χ ∈ {+,±}),δ = 0.55 (upper row,χ = �), δ = 0.4 (lower row,χ ∈ {+,±}) andδ = 0.75 (lower row,χ = �). A crossing of the y-axis corresponds to afixed point of Ψ. If the graphed quantity is negative for positiveσ2 , Ψ hasno fixed points forσ > 0. Different curves correspond to different values ofρ: whereρ is respectively less than, equal to and greater thanρSE. In eachcase,Ψ has a stable fixed fixed point at zero forρ < ρSE, and no otherfixed points, an unstable fixed point at zero forρ = ρSE and devlops twofixed points atρ > ρSE. Blue curves correspond toρ = ρSE(δ; χ), green toρ = 1.05 · ρSE(δ; χ), red toρ = 0.95 · ρSE(δ; χ).

smaller than at iteration zero, and so thresholding of this noise canbe anticipated to produce an even more accurate result at iterationtwo; and so on.

There is a valuable digital communications interpretationof thisprocess. The vectorw = Hx0 is the cross-channel interferenceor mutual access interference(MAI), i.e. the noiselike disturbanceeach coordinate ofA∗y experiences from thepresenceof all theother ‘weakly interacting’ coordinates. The thresholdingiterationsuppresses this interference in the sparse case by detecting the many‘silent’ channels and setting them a priori to zero, producing aputatively better guess at the next iteration. At that iteration, theremaining interference is proportional not to the size of the estimand,but instead to the estimation error, i.e. it is caused by theerrors inreconstructing all the weakly interacting coordinates; these errors areonly a fraction of the sizes of the estimands and so the error issignificantly reduced at the next iteration.

E. State Evolution

The above ‘sparse denoising’/‘interference suppression’heuristic,does agree qualitatively with the actual behavior one can observein sample reconstructions. It is very tempting to take it literally.Assuming it is literally true that the MAI is Gaussian and independentfrom iteration to iteration, we can can formally track the evolution,from iteration to iteration, of the mean-squared error.

This gives a recursive equation for theformal MSE, i.e. the MSEwhich would be true if the heuristic were true. This takes theform

σ2t+1 = Ψ(σ2

t ) , (7)

Ψ(σ2) ≡ E

nˆη

`X +

σ√δZ; λσ

´− X

˜2o

. (8)

Here expectation is with respect to independent random variablesZ ∼ N(0, 1) andX, whose distribution coincides with the empirical

distribution of the entries ofx0. We use soft thresholding (6) if thesignal is sparse and signed, i.e. ifχ = ±. In the case of sparse non-negative vectors,χ = +, we will let η(u; λσ, +) = max(u−λσ, 0).Finally, for χ = �, we let η(u; �) = sign(u) min(|u|, 1). Calcula-tions of this sort are familiar from the theory of soft thresholding ofsparse signals; see the Supplement for details.

We call Ψ : σ2 7→ Ψ(σ2) the MSE map.

Definition I.1. Given implicit parameters(χ, δ, ρ, λ, F ), with F =FX the distribution of the random variableX. State Evolutionis therecursive map (one-dimensional dynamical system):σ2

t 7→ Ψ(σ2t ).

Implicit parameters(χ, δ, ρ, λ, F ) stay fixed during the evolution.Equivalently, the full state evolves by the rule

(σ2t ; χ, δ, ρ, λ, FX) 7→ (Ψ(σ2

t ); χ, δ, ρ, λ, FX) .

Parameter space is partitioned into two regions:

Region (I): Ψ(σ2) < σ2 for all σ2 ∈ (0, EX2]. Here σ2t → 0 as

t → ∞: the SE converges to zero.

Region (II): The complement of Region (I). Here, the SE recursiondoesnot evolve toσ2 = 0.

The partitioning of parameter space induces a notion of sparsitythreshold, the minimal sparsity guarantee needed to obtainconver-gence of the formal MSE:

ρSE(δ; χ, λ, FX) ≡ sup {ρ : (δ, ρ, λ, FX) ∈ Region (I)} . (9)

The subscriptSE stands for State Evolution. Of course,ρSE depends onthe caseχ ∈ {+,±, �}; it also seems to depend also on the signaldistributionFX ; however, an essential simplification is provided by

Proposition I.2. For the three canonical problemsχ ∈ {+,±, �},any δ ∈ [0, 1], and any random variableX with the prescribed spar-sity and bounded second moment,ρSE(δ; χ, λ, FX) is independent ofFX .

Independence fromF allows us to writeρSE(δ; χ, λ) for thesparsity thresholds. The proof of this statement is sketched below,along with the derivation of a more explicit expression. Adopt thenotation

ρSE(δ; χ) = supλ≥0

ρSE(δ; χ, λ). (10)

High precision numerical evaluations of such expression uncovers thefollowing very suggestive

Finding 1. For the three canonical problemsχ ∈ {+,±, �}, andfor any δ ∈ (0, 1)

ρSE(δ;χ) = ρCG(δ; χ) . (11)

In short, the formal MSE evolves to zero exactly over the sameregion of (δ, ρ) phase space as does the phase diagram for thecorresponding convex optimization!

F. Failure of standard iterative algorithms

If we trusted that formal MSE truly describes the evolution of theiterative thresholding algorithm, Finding 1 would imply that iterativethresholding allows to undersample just as aggressively insolvingunderdetermined linear systems as the corresponding LP.

Finding 1 gives new reason to hope for a possibility that has alreadyinspired many researchers over the last five years: the possibility offinding a very fast algorithm that replicates the behavior ofconvexoptimization in settings+,±, �.

Unhappily the formal MSE calculation does not describe thebehavior of iterative thresholding:

1. State Evolution does not predict the observed properties ofiterativethresholding algorithms.

2. Iterative thresholding algorithms, even when optimally tuned, donot achieve the optimal phase diagram.

In [3], two of the authors carried out an extensive empiricalstudyof iterative thresholding algorithms. Even optimizing over the freeparameterλ and the nonlinearityη the phase transition was observedat significantly smaller values ofρ than those observed for LP-basedalgorithms.

Numerical simulations also show very clearly that the MSE mapdoes notdescribe the evolution of the actual MSE under iterativethresholding. The mathematical reason for this failure is quite simple.After the first iteration, the entries ofxt become strongly dependent,and State Evolution does not predict the moments ofxt.

G. Message Passing Algorithm

The main surprise of this paper is that this failure is not theend ofthe story. We now consider a modification of iterative thresholdinginspired by message passing algorithms for inference in graphicalmodels [16], and graph-based error correcting codes [17], [18].These are iterative algorithms, whose basic variables (‘messages’) areassociated to directed edges in a graph that encodes the structure ofthe statistical model. The relevant graph here is a completebipartitegraph overN nodes on one side (‘variable nodes’), andn on theothers (‘measurement nodes’). Messages are updated according tothe rules

xt+1i→a = ηt

“ X

b∈[n]\a

Abiztb→i

”, (12)

zta→i = ya −

X

j∈[p]\i

Aajxtj→a , (13)

for each(i, a) ∈ [N ] × [n]. We will refer to this algorithm1 as toMP.

MP has one important drawback with respect to iterative thresh-olding. Instead of updatingN estimates, at each iterations we needto updateNn messages, thus increasing significantly the algorithmcomplexity. On the other hand, it is easy to see that the right-handside of eqn [12] depends weakly on the indexa (only one outof n terms is excluded) and that the right-hand side of eqn [12]depends weakly oni. Neglecting altogether this dependence leads tothe iterative thresholding equations [3], [4]. A more careful analysisof this dependence leads to corrections of order one in the high-dimensional limit. Such corrections are however fully captured bythe last term on the right hand side of eqn [2], thus leading totheAMP algorithm. Statistical physicists would call this the ‘Onsagerreaction term’; see [24].

H. State Evolution is Correct for MP

Although AMP seems very similar to simple iterative thresholding[3]-[4], SE accurately describes its properties, but not those of thestandard iteration. As a consequence of Finding 1, properlytunedversions of MP-based algorithms are asymptotically as powerful asLP reconstruction.

1For earlier applications of MP to compressed sensing see [19], [20], [21].Relations between MP and LP were explored in a number of papers, see forinstance [22], [23], albeit from a different perspective.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

δ

ρComparison of Different Algorithms

ISTL

1AMP

Fig. 3. Observed phase transitions of reconstruction algorithms. Algorithmsstudied include iterative soft and hard thresholding, orthogonal matchingpursuit, and related. Parameters of each algorithm are tuned to achieve thebest possible phase transition [3]. Reconstructions signal length N = 1000.Iterative thresholding algorithms usedT = 1000 iterations. Phase transitioncurve displays the value ofρ = k/n at which success rate is 50%.

We have conducted extensive simulation experiments with AMP,and more limited experiments with MP, which is computationallymore intensive (for details see the complementary material). Theseexperiments show that the performance of the algorithms canbeaccurately modeled using the MSE map. Let’s be more specific.

According to SE, performance of the AMP algorithm is predictedby tracking the evolution of the formal MSEσ2

t via the recursion[7]. Although this formalism is quite simple, it is accuratein the highdimensional limit. Corresponding to the formal quantitiescalculatedby SE are the actual quantities, so of course to the formal MSEcorresponds the true MSEN−1‖xt − x0‖2

2. Other quantities can becomputed in terms of the stateσ2

t as well: for instance the true falsealarm rate(N − k)−1#{i : xt(i) 6= 0 and x0(i) = 0} is predictedvia the formal false alarm rateP{ηt(X + δ−1/2σtZ) 6= 0|X =0}. Analogously, the true missed-detection ratek−1#{i : xt(i) =0 and x0(i) 6= 0} is predicted by the formal missed-detection rateP{ηt(X + δ−1/2σtZ) = 0|X 6= 0}, and so on.

Our experiments establish agreement of actual and formal quanti-ties.

Finding 2. For the AMP algorithm, and large dimensionsN, n, weobserve

I. SE correctly predicts the evolution of numerous statistical prop-erties ofxt with the iteration numbert. The MSE, the number ofnonzeros inxt, the number of false alarms, the number of misseddetections, and several other measures all evolve in way that matchesthe state evolution formalism to within experimental accuracy.

II. SE correctly predicts the success/failure to converge to thecorrect result. In particular, SE predicts no convergence when ρ >ρSE(δ; χ, λ), and convergence ifρ < ρSE(δ; χ, λ). This is indeedobserved empirically.

Analogous observations were made for MP.

I. Optimizing the MP Phase Transition

An inappropriately tuned version of MP/AMP will not performwell compared to other algorithms, for example LP-based recon-

structions. However, SE provides a natural strategy to tuneMP andAMP (i.e. to choose the free parameterλ): simply use the valueachieving the maximum in eqn [10]. We denote this value byλχ(δ),χ ∈ {+,±, �}, and refer to the resulting algorithms as tooptimallytuned MP/AMP(or sometimes MP/AMP for short). They achieve theState Evolution phase transition:

ρSE(δ; χ) = ρSE(δ;χ, λχ(δ)).

An explicit characterization ofλχ(δ), χ ∈ {+,±} can be found inthe next section.

We summarize below the properties of optimally tuned AMP/MPwithin the SE formalism.

Theorem I.3. For δ ∈ [0, 1], ρ < ρSE(δ; χ), and any associatedrandom variableX, the formal MSE of optimally-tuned AMP/MPevolves to zero under SE. Viceversa, ifρ > ρSE(δ; χ), the formalMSE does not evolve to zero. Further, forρ < ρSE(δ; χ), there existsb = b(δ, ρ) > 0 with the following property. Ifσ2

t denotes the formalMSE aftert SE steps, then, for allt ≥ 0

σ2t ≤ σ2

0 exp(−bt). (14)

II. D ETAILS ABOUT THE MSE MAPPING

In this section, we sketch the proof of Proposition I.2: the iterativethreshold does not depend on the details of the signal distribution.Further, we show how to derive the explicit expression forρSE(δ; χ),χ ∈ {+,±}, given in the introduction.

A. Local Stability Bound

The state evolution thresholdρSE(δ; χ, λ) is the supremum of allρ’s such that the MSE mapΨ(σ2) lies below theσ2 line for allσ2 > 0. SinceΨ(0) = 0, for this to happen it must be true that thederivative of the MSE map atσ2 = 0 smaller than or equal to1. Weare therefore led to define the following ‘local stability’ threshold:

ρLS(δ; χ, λ) ≡ sup

ρ :

dΨ

dσ2

˛˛σ2=0

< 1

ff. (15)

The above argument implies thatρSE(δ; χ, λ) ≤ ρLS(δ; χ, λ).Considering for instanceχ = +, we obtain the following expres-

sion for the first derivative ofΨ

dΨ

dσ2=

„1

δ+ λ2

«E Φ

“√δ

σ(X − λσ)

”− λ√

δE φ

“√δ

σ(X − λσ)

”,

where φ(z) is the standard Gaussian density atz and Φ(z) =R z

−∞ φ(z′) dz′ is the Gaussian distribution.Evaluating this expression asσ2 ↓ 0, we get the local stability

threshold forχ = +:

ρLS(δ; χ, λ) =1 − (κχ/δ)

ˆ(1 + z2)Φ(−z) − zφ(z)

˜

1 + z2 − κχ

ˆ(1 + z2)Φ(−z) − zφ(z)

˜˛˛˛z=λ

√δ

,

whereκχ is the same as in [5]. Notice thatρLS(δ; +, λ) depends onthe distribution ofX only through its sparsity (i.e. it is independentof FX ).

B. Tightness of the Bound and Optimal Tuning

We argued that dΨdσ2

˛σ2=0

< 1 is necessary for the MSE map toconverge to0. This condition turns out to be sufficient because thefunction σ2 7→ Ψ(σ2) is concave onR+. This indeed yields

σ2t+1 ≤ dΨ

dσ2

˛˛σ2=0

σ2t , (16)

which implies exponential convergence to the correct solution [14].In particular we have

ρSE(δ; χ, λ) = ρLS(δ; χ, λ) , (17)

whenceρSE(δ; χ, λ) is independent ofFX as claimed.To prove σ2 7→ Ψ(σ2) is concave, one proceeds by computing

its second derivative. For instance, in the caseχ = +, one needs todifferentiate the expression given above for the first derivative. Weomit details but point out two useful remark:(i) The contributiondue toX = 0 vanishes;(ii) Since a convex combination of concavefunctions is also concave, it is sufficient to consider the case in whichX = x∗ deterministically.

As a byproduct of this argument we obtain explicit expressionsfor the optimal tuning parameter, by maximizing the local stabilitythreshold

λ+(δ) =1√δ

arg maxz≥0

(1 − (κχ/δ)

ˆ(1 + z2)Φ(−z) − zφ(z)

˜

1 + z2 − κχ

ˆ(1 + z2)Φ(−z) − zφ(z)

˜)

.

Before applying this formula in practice, please read the importantnotice in Supplemental Information.

III. D ISCUSSION

A. Relation with Minimax Risk

Let F±ǫ denote the class of probability distributionsF supported

on (−∞,∞) with P{X 6= 0} ≤ ǫ, and letη(x; λ,±) denote thesoft-threshold function [6] with threshold valueλ. The minimax risk[14] is defined as

M±(ǫ) ≡ infλ≥0

supF∈F±

ǫ

EF {[η(X + Z; λ,±) − X]2} , (18)

with λ±(ǫ) the optimal λ. The optimal SE phase transition andoptimal SE threshold obey

δ = M±(ρδ) , ρ = ρSE(δ;±). (19)

An analogous relation holds between the positive caseρSE(δ; +),and the minimax threshold riskM+ whereF is constrained to bea distribution on[0,∞). Exploiting [19], Supporting Informationproves that

ρCG(δ) = ρSE(δ)(1 + o(1)), δ → 0.

B. Other Message Passing Algorithms

The nonlinearityη( · ) in AMP eqns [1], [2] might be chosendifferently. For sufficiently regular such choices, the SE formalismmight predict evolution of the MSE. One might hope to use SE todesign ‘better’ threshold nonlinearities.

The threshold functions used here are such that the MSE mapσ2 7→ Ψ(σ2) is monotone and concave. As a consequence, the phasetransition lineρSE(δ; χ) for optimally tuned AMP is independent ofthe empirical distribution of the vectorx0. State Evolution may beinaccurate without such properties.

Where SE is accurate, it offers limited room for improvementover the results here. IfρSE denotes a (hypothetical) phase transitionderived by SE withany nonlinearitywhatsoever, Supporting Infor-mation exploits [19] to prove

ρSE(δ;χ) ≤ ρSE(δ; χ)(1 + o(1)), δ → 0 , χ ∈ {+,±} .

In the limit of high undersampling, the nonlinearities studied hereoffer essentially unimprovable SE phase transitions. Our reconstruc-tion experiments also suggest that other nonlinearities yield littleimprovement over thresholds used here.

C. Universality

The SE-derived phase transitions are not sensitive to the detaileddistribution of coefficient amplitudes. Empirical resultsin SupportingInformation find similar insensitivity of observed phase transitions forMP.

Gaussianity of the measurement matrixA can be relaxed; Sup-porting Information finds that other random matrix ensembles exhibitcomparable phase transitions.

In applications, one often uses very large matricesA which arenever explicitly represented, but only applied as operators; examplesinclude randomly undersampled partial Fourier transforms. Support-ing Information finds that observed phase transitions for MPin thepartial Fourier case are comparable to those for randomA.

ACKNOWLEDGEMENTS

A. Montanari was partially supported by the NSF CAREERaward CCF-0743978 and the NSF grant DMS-0806211, and thanksMicrosoft Research New England for hospitality during completion ofthis work. A. Maleki was partially supported by NSF DMS-050530.

REFERENCES

[1] D. L. Donoho, “Compressed Sensing,”IEEE Transactions on InformationTheory, Vol. 52, pp. 489-509, April 2006.

[2] E. Candes, J. Romberg, T. Tao, “Robust uncertainty principles: Exactsignal reconstruction from highly incomplete frequency information,”IEEE Transactions on Information Theory,Vol. 52, No. 2, pp. 489-509,February 2006.

[3] A. Maleki, D. L. Donoho, “Optimally Tuned Iterative ThresholdingAlgorithms,” submitted toIEEE journal on selected areas in signalprocessing, 2009.

[4] D. L. Donoho, “High-Dimensional centrally symmetric polytopes withneighborliness proportional to dimension,”Discrete and ComputationalGeometry, Vol 35,No. 4, pp. 617-652, 2006.

[5] D. L. Donoho, J. Tanner, “Neighborliness of randomly-projected simplicesin high dimensions,”Proceedings of the National Academy of Sciences,Vol. 102, No. 27, p. 9452-9457, 2005.

[6] D. L. Donoho, J. Tanner, “Counting faces of randomly projected hyper-cubes and orthants, with applications,” ArXiv.

[7] D. L. Donoho, J. Tanner, “Counting faces of randomly projected polytopeswhen the projection radically lowers dimension,”J. Amer. Math. Soc., Vol.22, pp. 1-53, 2009.

[8] D. L. Donoho, J.Tanner, “Observed universality of phasetransitions inhigh dimensional geometry, with implication for modern data analysisand signal processing,” Phil. Trans. A, 2009.

[9] K. K. Herrity, A. C. Gilbert, and J. A. Tropp, “Sparse approximation viaiterative thresholding,”Proc. ICASSP, Vol. 3, pp. 624-627, Toulouse, May2006.

[10] J. A. Tropp, A. C. Gilbert,“ Signal recovery from randommeasurementsvia orthogonal matching pursuit,”IEEE Transactions Information Theory,53(12),pp. 4655-4666, 2007.

[11] P. Indyk, M. Ruzic, “Near optimal sparse recovery in theℓ1 norm,” In49th Annual Symposium on Foundations of Computer Science, pp. 199-207, Philadelphia, PA, October 2008.

[12] M. Lustig, D. L. Donoho, J. M. Santos, J. M. Pauly, “Compressedsensing MRI,”IEEE Signal Processing Magazine, 2008.

[13] I. Daubechies, M. Defrise and C. De Mol, “An iterative thresholdingalgorithm for linear inverse problems with a sparsity constraint,” Com-munications on Pure and Applied Mathematics, Vol. 75, pp. 1412-1457,2004.

[14] D. L. Donoho, I. M. Johnstone, “Minimax risk overℓp balls,” Prob. Th.and Rel. Fields, Vol. 99, pp. 277-303, 1994.

[15] D. L. Donoho, I. M. Johnstone, “Ideal spatial adaptation via waveletshrinkage,” Biometrica, Vol. 81, pp. 425-455, 1994.

[16] J. Pearl, Probabilistic reasoning in intelligent systems: networks ofplausible inference, Morgan Kaufmann, San Francisco, 1988.

[17] R. G. Gallager, Low-Density Parity-Check Codes, MITPress, Cambridge, Massachusetts, 1963, Available online at:http://web./gallager/www/pages/ldpc.pdf.

http://web./gallager/www/pages/ldpc.pdf

[18] T. J. Richardson, R. Urbanke, Modern coding theory,Cambridge University Press, Cambridge,2008, Available online at:http://lthcwww.epfl.ch/mct/index.php.

[19] Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar andA. Kabbani,“Counter Braids: a novel counter architecture for per-flow measurement,”SIGMETRICS, Annapolis, June 2008.

[20] S. Sarvotham, D. Baron and R. Baraniuk, “Compressed sensing recon-struction via belief propagation,” Preprint, 2006.

[21] F. Zhang, H. Pfister,“On the iterative decoding of high-rate LDPC codeswith applications in compressed sensing,” arXiv:0903.2232v2, 2009.

[22] M. J. Wainwright, T. S. Jaakkola and A. S. Willsky,“MAP estimationvia agreement on trees: message-passing and linear programming,” IEEETransactions on Information Theory,Vol. 51, No. 11, pp. 3697-3717.

[23] M. Bayati, D. Shah and M. Sharma, “ Max-Product for maximum weightmatching: convergence, correctness, and LP duality,”IEEE Transactionson Information Theory, Vol. 54, No. 3, pp. 1241-1251, 2008.

[24] D. J. Thouless, P. W. Anderson and R. G. Palmer, “ Solution of solvablemodel of a spin glass,”Phil. Mag., Vol. 35, pp. 593-601, 1977.

[25] D. L. Donoho and I.M. Johnstone and J.C. Hoch and A.S. Stern,“Maximum Entropy and the Nearly Black Object,”Journal of the RoyalStatistical Society, Series B (Methodological), Vol. 54, pp. 41-81, 1992.

[26] D. L. Donoho, J. Tanner, “Phase transition as sparse sampling theorems,”IEEE Transactions on Information Theory,submitted for publication.

[27] P. J. Bickel, “Minimax estimation of the mean of a normaldistributionsubject to doing well at a point,” in Recent Advances in Statistics: Papersin Honor of Herman Chernoff on His Sixtieth Birthday, Academic Press,511-528, 1983.

[28] B. Efron and T. Hastie and I. Johnstone and R. Tibshirani, “Least angleregression,”Annals of Statistics, Vol.32, pp. 407-492, 2004.

APPENDIX

A. Important Notice

Readers familiar with the literature of thresholding of sparse signalswill want to know that an implicit rescaling is needed to matchequations from that literature with equations here. Specifically, inthe traditional literature, one is used to seeing expressions η(x; λσ)in cases whereσ is the standard deviation of an underlying normaldistribution. This means the thresholdλ is specified in standarddeviations, so many people will immediately understand values likeof λ = 2, 3 etc in terms of their false alarm rates. In the main text,the expressionη(x; λσ) appears numerous times, but note thatσ isnot the standard deviation of the relevant normal distribution; instead,the standard deviation of that normal isτ = σ/

√δ. It follows that

λ in the main text is calibrated differently from the wayλ would becalibrated in other sources, differing by aδ-dependent scale factor.If we let λsd

SE denote the quantityλSE appropriately rescaled sothat it is in units of standard deviations of the underlying normaldistribution, then the needed conversion to sd units is

λsdSE = λSE ·

√δ. (20)

B. A summary of notation

The main paper will be referred as DMM throughout this note.All the notations are consistent with the notations used in DMM. Wewill use repeatedly the notationǫ = δρ.

C. State Evolution Formulas

In the main text we mentionedρSE(δ; χ, λ,FX) is independent ofFX . We also mentioned a few formulas forρSE(δ; χ). The goal ofthis section is to explain the calculations involved in deriving theseresults. First, recall the expression for the MSE map

Ψ(σ2) = E

n`η(X +

σ√δZ; λσ, χ) − X

´2o

. (21)

We denote by∂1η and∂2η the partial derivatives ofη with respectto its first and second arguments. Using Stein’s lemma and thefact

that ∂21η(x; y, χ) = 0 almost everywhere, we get

dΨ

dσ2=

1

δE

n∂1η(X +

σ√δZ; λσ)2

o+

λ

σE

nˆη(X +

σ√δZ; λσ) − X

˜∂2η(X +

σ√δZ; λσ)

o, (22)

where we dropped the dependence ofη( · ) on the constraintχ tosimplify the formula.

1) Caseχ = +: In this case we haveX ≥ 0 almost surely andthe threshold function is

η(x;λσ) =

(x − λσ) if x ≥ λσ,0 otherwise.

As a consequence∂1η(x; λσ) = −∂2η(x; λσ) = I(x ≥ λσ) (almosteverywhere). This yields

dΨ

dσ2=

„1

δ+ λ2

«E Φ

“√δ

σ(X − λσ)

”

− λ√δ

E φ“√

δ

σ(X − λσ)

”.

As σ ↓ 0, we haveΦ“√

δσ

(X−λσ)”→ 1 andφ

“√δ

σ(X−λσ)

”→ 0

if X > 0. Therefore,

dΨ

dσ2

˛˛0

=

„1

δ+ λ2

«ρδ +

„1

δ+ λ2

«(1 − ρδ)Φ(−λ

√δ)

− λ√δ(1 − ρδ)φ(−λ

√δ) .

The local stability thresholdρLS(δ; +, λ) is obtained by settingdΨdσ2

˛0

= 1.In order to prove the concavity ofσ2 7→ Ψ(σ2) first notice that

a convex combination of concave functions is concave and so it issufficient to show the concavity in the caseX = x ≥ 0 determin-istically. Next notice that, in the casex = 0, dΨ

dσ2 is independent of

σ2. A a consequence, it is sufficient to proved2Ψx

d(σ2)2≤ 0 where

δdΨx

dσ2=

`1 + λ2δ

´Φ

“√δ

σ(x − λσ)

”− λ

√δ φ

“√δ

σ(x − λσ)

”.

Using Φ′(u) = φ(u) andφ′(u) = −uφ(u), we get

δd2Ψx

d(σ2)2= − x

2σ3

1 +

λδ

σx

ffφ

“√δ

σ(x − λσ)

”< 0 (23)

for x > 0.2) Caseχ = ±: HereX is supported on(−∞,∞) with P{X 6=

0} ≤ ǫ = ρδ. Recall the definition of soft threshold

η(x;λσ) =

8<:

(x − λσ) if x ≥ λσ,(x + λσ) if x ≤ −λσ,0 otherwise.

As a consequence∂1η(x;λσ) = I(|x| ≥ λσ) and ∂2η(x; λσ) =−sign(x)I(|x| ≥ λσ). This yields

dΨ

dσ2=

„1

δ+ λ2

«E

nΦ

“√δ

σ(X − λσ)

”+

Φ“−

√δ

σ(X + λσ)

”o

− λ√δ

E

nφ

“√δ

σ(X − λσ)

”+ φ

“√δ

σ(X + λσ)

”o.

http://lthcwww.epfl.ch/mct/index.php

http://arxiv.org/abs/0903.2232

By letting σ ↓ 0 we get

dΨ

dσ2

˛˛0

=

„1

δ+ λ2

«ρδ +

„1

δ+ λ2

«(1 − ρδ) 2 Φ(−λ

√δ)

− λ√δ(1 − ρδ) 2φ(−λ

√δ) ,

which yields the local stability thresholdρLS(δ;±, λ) by dΨdσ2

˛0

= 1.Finally the proof of the concavity ofσ2 7→ Ψ(σ2) is completely

analogous to the caseχ = +.3) Caseχ = �: Finally consider the case ofX supported on

[−1, +1] with P{X 6∈ {+1,−1}} ≤ ǫ. In this case we proposed thefollowing nonlinearity,

η(x) =

8<:

+1 if x > +1,x if −1 ≤ x ≤ +1,−1 if x ≤ −1.

Notice that the nonlinearity does not depend on any thresholdparameter. Since∂1η(x) = I(x ∈ [−1, +1]),

dΨ

dσ2=

1

δP

nX +

σ√δZ ∈ [−1, +1]

o

=1

δE

nΦ

“√δ

σ(1 − X)

”− Φ

“−

√δ

σ(1 + X)

”o.

As σ ↓ 0 we get

dΨ

dσ2

˛˛0

=1

2δ(1 + ρδ) ,

whence the local stability conditiondΨdσ2

˛0

< 1 yields ρLS(δ; �) =

(2 − δ−1)+.Concavity ofσ2 7→ Ψ(σ2) immediately follows from the fact that

Φ(√

δσ

(1−x)) is non-increasing inσ for x ≤ 1 andΦ(−√

δσ

(1+x))is non-decreasing forx ≥ −1. Using the combinatorial geometryresult of [6] we get

Theorem A.1. For any δ ∈ [0, 1],

ρCG(δ; �) = ρSE(δ; �) = ρLS(δ; �) = max˘0, 2 − δ−1¯

. (24)

D. Relation to Minimax Thresholding

1) Minimax Thresholding Policy:We denote byF+ǫ the collection

of all CDF’s supported in[0,∞) and with F (0) ≥ 1 − ǫ, and byF±

ǫ the collection of all CDF’s supported in(−∞,∞) and withF (0+) − F (0−) ≥ 1 − ǫ. For χ ∈ {+,±}, define the minimaxthreshold MSE

M∗(ǫ; χ) = infλ

supF∈Fχ

ǫ

EF

˘η(X + Z; λ, χ) − X)2

¯, (25)

whereEF denote expectation with respect to the random variableXwith distribution F , and η(x;λ) = sign(x)(|x| − λ)+ for χ = ±and η(x;λ) = (x − λ)+ for χ = +. Minimax Thresholdingwasdiscussed for the caseχ = + in [25] and forχ = ± in [14], [15].

This machinery gives us a way to look at the results derived abovein very commonsense terms. Suppose we knowδ and ρ but notthe distributionF of X. Let’s consider what threshold one mightuse, and ask at each given iteration of SE, the threshold whichgives us the best possible control of the resulting formal MSE. Thatbest possible thresholdλt is by definition the minimax threshold atnonzero fractionǫ = ρδ, appropriately scaled by the effective noiselevel τ = σ/

√δ,

λt = λ∗(ρ · δ; χ) · σ/√

δ,

where χ ∈ {+,±} depending on the case at hand. Note that thisthreshold does not depend onF . It depends on iteration only through

the effective noise level at that iteration. The guarantee we then get forthe formal MSE is the minimax threshold risk, appropriatelyscaledby the square of the effective noise level:

MSE ≤ M∗(ρδ; χ) · τ 2 = M∗(ρδ;χ)σ2

δ, . (26)

for χ ∈ {+,±}. This guarantee gives us a reduction in MSE overthe previous iteration if and only if the right-hand side in Eq. (26) issmaller thanσ2, i.e. if and only if

M∗(ρδ;χ) < δ , χ ∈ {+,±}.

In short, we can use state evolution with the minimax threshold,appropriately scaled by effective noise level, and we get a guaran-teed fractional reduction in MSE at each iteration, with fractionalimprovement

ωMM(δ, ρ;χ) = (1 − M∗(ρδ;χ)/δ); (27)

hence the formal SE evolution is bounded by:

σ2t ≤ ωMM(δ, ρ; χ)t · EX2, t = 1, 2, . . . . (28)

Results analogous to those of the main text hold for this minimaxthresholding policy. That is, we can define a minimax thresholdingphase transition such that below that transition, state evolution withminimax thresholding converges:

ρMM(δ; χ) = sup{ρ : M∗(ρδ;χ) < δ}; χ ∈ {+,±}.

Theorem A.2. Under SE with the minimax thresholding policydescribed above, for each(δ, ρ) in (0, 1)2 obeyingρ < ρMM(δ; χ),and for every marginal distributionF ∈ Fχ

ǫ , the formal MSE evolvesto zero, with dynamics bounded by (27)- (28).

2) Relating Optimal Thresholding to Minimax Thresholding:Animportant difference between the optimal threshold definedin themain text and the minimax threshold is thatλχ = λχ(δ) dependsonly on the assumedδ – no specificρ need be chosen while minimaxthresholding as defined above requires that one specify bothδ andρ. However, since the methodology is seemingly pointless above theminimax phase transition, one might think to specifyρ = ρMM(δ; χ).This new thresholdλMM(δ; χ) = λ∗(δρMM(δ); χ) then requires nospecification ofρ. As it turns out, the SE threshold coincides withthis new threshold.

Theorem A.3. For χ ∈ {+,±} and δ ∈ [0, 1]

M∗(ρδ;χ) = δ if and only if ρ = ρSE(δ; χ) . (29)

Let λχ(δ) denote the minimax threshold defined in the main text, andlet λsd

χ (δ) denote denote the same quantity expressed in sd units (20).Then

λsdχ (δ) = λχ(ρδ), ρ = ρSE(δ; χ), χ ∈ {+,±}

Proof: It is convenient to introduce the following explicit nota-tion for the MSE map:

Ψ(σ2; δ, λ, F ) = EF

n`η(X +

σ√δZ; λσ) − X

´2o

, (30)

whereZ ∼ N(0, 1) is independent ofX, and X ∼ F . As above,we drop the dependency of the threshold function onχ ∈ {+,±}Sinceη(ax;aλ) = a η(x;λ) for any positivea, we have the scaleinvariance

Ψ(σ2; δ, λ, F, χ) =σ2

δΨ(1; 1, λ

√δ, Sδ1/2/σF ), (31)

where(SaF )(x) = F (x/a) is the operator that takes the CDF of anrandom variableX and returns the CDF of the random variableaX.

Define

J(δ, ρ; χ) = infλ≥0

supF∈Fχ

ǫ

supσ2∈(0,EF {X2}]

1

σ2Ψ(σ2; δ, λ, F, χ) , (32)

where ǫ ≡ ρδ. It follows from the definition ofSE threshold thatρ < ρSE(δ; χ) if and only if J(δ, ρ; χ) < 1. We first notice that byconcavity ofσ2 7→ Ψ(σ2; δ, λ, F, χ), we have

J(δ, ρ; χ) = infλ

supF∈Fχ

ǫ

supσ2>0

1

σ2Ψ(σ2; δ, λ, F, χ) (33)

=1

δinfλ

supF∈Fχ

ǫ

supσ2>0

Ψ(1; 1, λ√

δ, Sδ1/2/σF ) (34)

=1

δinfλ

supF∈FP

ǫ

Ψ(1; 1, λ, F ) (35)

where the second identity follows from the invariance property andthe third from the observation thatSaFχ

ǫ = Fχǫ for any a > 0.

Comparing with the definition (25), we finally obtain

J(δ, ρ; χ) =1

δM∗(δρ;χ) . (36)

Thereforeρ < ρSE(δ; χ) if and only δ > M∗(δρ;χ), which impliesthe thesis.

E. Convergence Rate of State Evolution

The optimal thresholding policy described in the main text is thesame as using the minimax thresholding policy but instead assumingthe most pessimistic possible choice ofρ – the largestρ that canpossibly make sense. In contrast minimax thresholding isρ-adaptive,and can use a smaller threshold where it would be valuable. Below theSE phase transition, both methods will converge, so what’s different?

Note thatλSE(δ; χ) andλMM (δ, ρ; χ) are dimensionally different;λMM is in standard deviation units. ConvertingλSE into sd units by(20), we haveλsd

SE = λSE · δ1/2. Even after this calibration, we findthat methods will generally use different thresholds, i.e.if ρ < ρSE,

λMM(δ, ρ; χ) 6= λsdSE (δ; χ), χ ∈ {+,±}.

In consequence, the methods may have different rates of convergence.Define the worst-case threshold MSE

MSE(ǫ, λ;χ) = supF∈Fχ

ǫ

EF

˘η(X + Z; λ) − X)2

¯

and setMSE(δ, ρ; χ) = MSE(δρ, λsd

SE (δ, χ); χ).

This is the MSE guarantee achieved by usingλsdSE (δ) when in fact

(δ, ρ) is the case. Now by definition of minimax threshold MSE,

MSE(δ, ρ; χ) ≥ M∗(δρ;χ); (37)

the inequality is generally strict. The convergence rate ofoptimalAMP under SE was described implicitly in the main text. We cangive more precise information using this notation. Define

ωSE(δ, ρ; χ) = (1 − MSE(δ, ρ; χ)/δ);

Then we have for the formal MSE of AMP

σ2t ≤ ωSE(δ, ρ;χ)t · EX2, t = 1, 2, 3, . . .

In the main text, the same relation was written in terms ofexp(−bt),with b > 0; here we see that we may takeb(δ, ρ) = − log(ωSE(δ, ρ)).

Explicit evaluation of thisb requires evaluation of the worst-casethersholding riskMSE(ǫ, λ). Now by (37) we have

ωSE(δ, ρ; χ) ≥ ωMM(δ, ρ; χ),

generally with strict inequality; so by using theρ-adaptive thresholdone gets better speed guarantees.

F. Rigorous Asymptotic Agreement of SE and CG

In this section we prove

Theorem A.4. For χ ∈ {+,±}

limδ→0

ρCG(δ;χ)

ρSE(δ; χ)= 1. (38)

In words, ρCG(δ; χ) is the phase transition computed by combi-natorial geometry (polytope theory) andρSE(δ, χ) obtained by stateevolution: they are rigorously equivalent in the highly undersampledlimit (i.e. δ → 0 limit). In the main text, we only can make theobservation that they agree numerically.

1) Properties of the minimax threshold:We summarize here sev-eral known properties of the minimax threshold (25), which provideuseful information about the behavior of SE.

The extremalF achieving the supremum in Eq. (25) is known. Inthe caseχ = +, it is a two-point mixture

F+ǫ = (1 − ǫ) δ0 + ǫ δµ+(ǫ) . (39)

In the signed caseχ = ±, it is a three-point symmetric mixture

F±ǫ = (1 − ǫ) δ0 +

ǫ

2(δµ±(ǫ) + δ−µ±(ǫ)) . (40)

Precise asymptotic expressions forµχ(ǫ) are available. In particular,for χ ∈ {+,±},

µχ(ǫ) =p

2 log(ǫ)(1 + o(1)) as ǫ → 0 . (41)

We also know that

M∗(ǫ; χ) = 2 log(ǫ)(1 + o(1)) asǫ → 0 . (42)

2) Proof of Theorem A.4:Combining Theorem A.3 and Eq. (42),we get

ρSE(δ; ρ) ∼ 1

2 log(δ), δ → 0 . (43)

(correction terms that can be explicitly given). Now we knowrigorously from [26] that the LP-based phase transitions satisfy asimilar relationship:

Theorem A.5 (Donoho and Tanner [26]). For χ ∈ {+,±}

ρCG(δ, χ) ∼ 1

2 log(δ), δ → 0. (44)

Combining now with Lemma 43 we get Theorem A.4.

G. Rigorous Asymptotic Optimality of Soft Thresholding

The discussion in the main text, alluded to the possibility of im-proving on soft thresholding. Here we give a more formal discussion.We work in the situationsχ ∈ {+,±}. Let eη denote some arbitrarynonlinearity with tuning parameterλ. (For a concrete example, thinkof hard thresholding). We can define the minimax MSE for thisnonlinearity in the natural way

fM(ǫ; χ) = infλ

supF∈Fχ

ǫ

EF

˘eη(X + Z; λ) − X)2

¯, . (45)

there is a corresponding minimax thresholdeλ(ǫ; χ). We can deploythe minimax threshold in AMP by settingǫ = ρδ and rescaling thethreshold by the effective noise levelτ = σ/

√δ:

actual threshold at iterationt = eλ(ǫ; χ) · τ= eλ(ρδ; χ) · σt/

√δ.

Under state evolution, this is guaranteed to reduce the MSE provided

fM(ρδ;χ) < δ.

In that case we get full evolution to zero. It makes sense to definethe minimax phase transition:

eρSE(δ; χ) = sup{ρ : fM(ρδ; χ) < δ}; χ ∈ {+,±}.Whatever beF , for (δ, ρ) with ρ < eρSE(δ), SE evolves the formalMSE of eη to zero.

It is tempting to hope that some very special nonlinearity can dosubstantially better than soft thresholding. At least for the minimaxphase transition, this is not so:

Theorem A.6. Let eρMM(δ; χ) be a minimax phase transition com-puted under the State Evolution formalism for the casesχ ∈ {+,±)with some scalar nonlinearityeη. LetρSE(δ; χ) be the phase transitioncalculated in the main text for soft thresholding with correspondingoptimal λ. Then forχ ∈ {+,±}

limδ→0

eρSE(δ; χ)

ρSE(δ; χ)≤ 1.

In words, no other nonlinearity can outperform soft thresholdingin the limit of extreme undersampling – in the sense of minimaxphase transitions. This is best understood using a notion from themain text. We there said that the parameter space(δ, ρ, λ, F ) canbe partitioned into two regions. Region (I) where there zerois theunique fixed point of the MSE map, and is a stable fixed point;and its complement, Region (II). Theorem A.6 says that the rangeof ρ guaranteeing membership in Region (I) cannot be dramaticallyexpanded by using a different nonlinearity.

1) Some results on Minimax Risk:The proof depends on someknow results about minimax MSE, where we are allowed to choosenot just the threshold, but also the nonlinearity. Forχ ∈ {+,±},define the minimax MSE

M⋆⋆(ǫ; χ) = infeη

supF∈Fχ

ǫ

EF

˘eη(X + Z) − X)2

¯, (46)

Here the minimization is overall measurable functionseη : R 7→ R.Minimax MSEwas discussed for the caseχ = + in [25] and forχ = ± in [27], [14], [15]. It is known that

M⋆⋆(ǫ; χ) ∼ 2 log(ǫ−1). ǫ → 0. (47)

H. Proof of Theorem A.6

Evidently, any specific nonlinearity cannot do better than theminimax risk:

fM∗(ǫ) ≥ M∗∗(ǫ; χ).

Consequently, if we put

ρ∗∗(δ; χ) = sup{ρ : M∗∗(δρ;χ) < δ}then

eρ∗(δ, χ) ≤ ρ⋆⋆(δ, χ).

From (47) and the last two displays we conclude

eρ∗(δ; χ) ≤ 1

2 log(1/δ)∼ ρSE(δ, χ), δ → 0.

Theorem A.6 is proven.

I. Data Generation

For a given algorithm with a fully specified parameter vector, weconduct one phase transition measurement experiment as follows.We fix a problem suite, i.e. a matrix ensemble and a coefficientdistribution for generating problem instances(A, x0). We also fixa grid of δ values in[0, 1], typically 30 values equispaced between0.02 and 0.99. Subordinate to this grid, we consider a series ofρvalues. Two cases arise frequently:

• Focused Search design.20 values betweenρCG(δ; χ)−1/10 andρCG(δ;χ) + 1/10, whereρCG is the theoretically expected phasetransition deriving from combinatorial geometry (according tocaseχ ∈ {+,±, �}).

• General Search design.40 values equispaced between 0 and 1.

We then have a (possibly non-cartesian) grid ofδ, ρ values inparameter space[0, 1]2. At each (δ, ρ) combination, we will takeM problem instances; in our caseM = 20. We also fix a measureof success; see below.

Once we specify the problem sizeN , the experiment is nowfully specified; we setn = ⌈δN⌉ and k = ⌈ρn⌉, and generateM problem instances, and obtainM algorithm outputsxi, and Msuccess indicatorsSi, i = 1, . . . M .

A problem instance(y,A, x0) consists ofn × N matrix A fromthe given matrix ensemble and ak-sparse vectorx0 from the givencoefficient ensemble. Theny = Ax0. The algorithm is called withproblem instance(y, A) and it produces a resultx. We declaresuccess if

‖x0 − x‖2

‖x0‖2≤ tol,

wheretol is a given parameter; in our case10−4; the variableSi

indicates success on thei-th Monte Carlo realization. To summarizeall M Monte Carlo repetitions, we setS =

Pi Si.

The result of such an experiment is a dataset with tuples(N, n, k, M, S); each tuple giving the results at one combination(ρ, δ). The meta-information describing the experiment is the spec-ification of the algorithm with all its parameters, the problem suite,and the success measure with its tolerance.

J. Estimating Phase Transitions

From such a dataset we find the location of the phase transitionas follows. Corresponding to each fixed value ofδ in our grid,we have a collection of tuples(N, n, k, M, S) with n/N = δand varyingk. Pretending that our random number generator makestruly independent random numbers, the resultS at one experimentis binomial Bin(π, M), where the success probabilityπ ∈ [0, 1].Extensive prior experiments show that this probability varies from1when ρ is well below ρCG to 0 when ρ is well aboveρCG. In short,the success probability

π = π(ρ|δ;N).

We define thefinite-N phase transitionas the value ofρ at whichsuccess probability is 50%:

π(ρ|δ; N) =1

2at ρ = ρ(δ).

This notion is well-known in biometrics where the 50% point of thedose-response is called the LD50. (Actually we have the implicitdependenceρ(δ) ≡ ρ(δ|N,tol); the tolerance in the successdefinition has a (usually slight) effect, as well as the problem sizeN )

To estimate the phase transition from data, we model depen-dence of success probability onρ using generalized linear models

(GLMs). We take aδ-constant slice of the dataset obtaining triples(k, M, S(k, n, N)), and modelS(k, n, N) ∼ Bin(πk; M) where thesuccess probabilities obeys a generalized linear model with logisticlink

logit(π) = a + bρ

where ρ = k/n; in biometric language, we are modeling that thedose-response probability, whereρ is the ‘complexity-dose’, followsa logistic curve.

In terms of the fitted parametersa,b, we have the estimated phasetransition

ρ(δ) = −a/b,

and the estimated transition width is

w(δ) = 1/b.

Note that, actually,

ρ(δ) = ρ(δ|N,tol), w(δ) = w(δ|N, tol) .

We may be able to see the phase transition and its width varyingwith N and with the success tolerance.

Because we make onlyM measurements in our Monte Carloexperiments, these results are subject to sampling fluctuations. Con-fidence statements can be made forρ using standard statisticalsoftware.

K. Tuning of Algorithms

The procedure so far gives us, for each fully-specified combinationof algorithm parametersΛ and each problem suiteS , a dataset(Λ,S , δ, ρ(δ; Λ, S)). When an algorithm has such parameters, wecan define, for each fixedδ, the value of the parameters which givesthe highest transition:

ρopt(δ;S) = maxΛ

ρ(δ; Λ,S);

with associated optimal parametersΛopt(δ;S). When the results ofthe algorithm depend strongly on problem suite as well, we can alsotune to optimize worst-case performance across suites, getting theminimax transition

ρMM(δ) = maxΛ

minS

ρ(δ; Λ,S).

and corresponding minimax parametersΛMM(δ). This procedure wasfollowed in [3] for a wide range of popular algorithms. Figure 3 ofthe main text presents the observed minimax transitions.

L. Results: Empirical Phase Transition

Figure 4 (which is a complete version of Figure 3 in the main text)compares observed phase transitions of several algorithmsincludingAMP. We considered what was called in [3] thestandard suite, witthese choices

• Matrix ensemble: Uniform spherical ensemble(USE); each col-umn of A is drawn uniformly at random from the unit spherein R

n.• Coefficient ensemble: The vectorx0 hask nonzeros in random

locations, with constant amplitude of nonzeros. Ifχ = +,x0(i) ∈ {0, +1}; if χ ∈ {±, �}, x0(i) ∈ {+1, 0,−1} (withequiprobable positive and negative entries).

For each algorithm we generated an appropriate grid of(δ, ρ) andcreatedM = 20 independent problem instances at each gridpoint,i.e. independent realizations of vectorx and measurement matrixA.

For AMP we used a focused search design, focused aroundρCG(δ).To reconstructx, we runT = 1000 AMP iterations and report the

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

δ

ρ

Comparison of Different Algorithms

IHTISTTuned TSTLARSOMPL

1AMP

Fig. 4. Observed Phase Transitions for 6 Algorithms, andρSE. AMP: methodintroduced in main text. IST: Iterative Soft Thresholding.IHT: Iterative HardThresholding. TST: a class of two-stage thresholding algorithms includingsubspace pursuit and CoSamp. OMP: Orthogonal Matching Pursuit. Note thatthe ℓ1 curve coincides with the state evolution transitionρSE, a theoreticalcalculation. The other curves show empirical results.

mean square error at the final iteration. For other algorithms, we usedthe general search design as described above. For more details aboutobserved phase transitions we refer the reader to [3].

The calculation of the phase transition curve of AMP takes around36 hours on a single Pentium4 processor.

Observed Phase transitions for other coefficient ensemblesandmatrix ensembles are discussed below in sections O and P.

M. Example of the Interference Heuristic

In the main text, our motivation of the SE formalism used theassumption that the mutual access interference term MAIt = (A∗A−I)(xt − x0) is marginally nearly Gaussian – i.e. the distributionfunction of the entries in the MAI vector is approximately Gaussian.

As we mentioned, this heuristic motivates the definition of theMSE map. It is easy to prove that the heuristic is valid at the firstiteration; but for the validity of SE, it must continue to be true atevery iteration until the algorithm stops. Figure 5 presents a typicalexample. In this example we have considered USE matrix ensembleand Rademacher Coefficient ensemble. AlsoN is set to a small sizeproblem 2000 and (δ, ρ) = (0.9, 0.52). The algorithm is trackedacross 90 iterations. Each panel exhibits a linear trend, indicatingapproximate Gaussianity. The slope is decreasing with iteration count.The slope is the square root of the MSE, and its decrease indicatesthat the MSE is evolving towards zero. More interestingly, figure 6shows the QQplot of the MAI noise for the partial Fourier matrixensemble. Coefficients here are again from Rademacher ensembleand (N, δ, ρ) = (16384, 0.5, 0.35).

N. Testing Predictions of State Evolution

The last section gave an illustration tracking the actual evolutionof the AMP algorithm, it showed that the State Evolution heuristicis qualitatively correct.

We now consider predictions made by SE and their quantitativematch with empirical observations. We consider predictions of fourobservables:

−4 −2 0 2 4

−1

0

1

Standard Normal Quantile

Qua

ntile

of I

nput

Sam

ple iteration=10

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=20

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=30

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=40

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=50

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=60

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=70

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=80

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=90

Fig. 5. QQ Plots tracking marginal distribution of mutual access interference(MAI). Panels (a)-(i): iterations10, 20, . . . , 90. Each panel shows QQ plot ofMAI values versus normal distribution in blue, and in red (mostly obscured)points along a straight line. Approximate linearity indicates approximate nor-mality. Decreasing slope with increasing iteration numberindicates decreasingstandard deviation as iterations progress.

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=30

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=60

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=90

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=120

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=150

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=180

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=210

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=240

−4 −2 0 2 4

−1

0

1


Qua

ntile

of I

nput

Sam

ple iteration=270

Fig. 6. QQ Plots tracking marginal distribution of mutual access interference(MAI). Matrix Ensemble: partial Fourier. Panels (a)-(i): iterations 30,60,. . . ,270. For other details, see Fig. 5.

• MSE on zeros and MSE on non-zeros:

MSEZ = E[x(i)2|x0(i) = 0],

MSENZ = E[(x(i) − x0(i))2|x0(i) 6= 0] (48)

• Missed detection rate and False alarm rate:

MDR = P[x(i) = 0|x0(i) 6= 0],

FAR = P[x(i) 6= 0|x0(i) = 0] (49)

We illustrate the calculation ofMDR. Other quantities are computedsimilarly. Let ǫ = δρ, and suppose that entries inx0(i) are either0,

0 10 20 30 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35MSE on the non−zero elements

iteration

empiricaltheoretical

0 10 20 30 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35MSE on zero elements

iteration


0 10 20 30 400

0.02

0.04

0.06

0.08Missed Detection Rate

iteration


0 10 20 30 400.6985

0.699

0.6995

0.7

0.7005

0.701

0.7015

0.702False Alarm Rate

iteration


Fig. 7. Comparison of State Evolution predictions against observations.ρ = .3, δ = .15. Panels (a)-(d): MSENZ, MSE, MDR, FAR. Curve in red:theoretical prediction. Curve in blue: mean observable. Each panel showsthe evolution of a specific observable as iterations progress. Two curves arepresent in each panel, however, except for the lower left panel, the blue curve(empirical data) is obscured by the presence of the red curve. The two curvesare in close agreement in all panels.

0 10 20 30 400

0.05

0.1

0.15

0.2


iteration


0 10 20 30 400

0.02

0.04

0.06

0.08

0.1


iteration


0 10 20 30 400

0.02

0.04

0.06

0.08


iteration


0 10 20 30 400.38

0.3805

0.381

0.3815

0.382

0.3825


iteration


Fig. 8. Comparison of State Evolution predictions against observations.ρ = 0.3, δ = 0.15. For details, see Figure 7.

1, or −1, with P{x0(i) = ±1} = ǫ/2. Then, withZ ∼ N(0, 1),

P[x(i) = 0|x0(i) 6= 0] = P[η(1 +σ√δZ) 6= 0]

= P[1 +σ√δZ 6∈ (−λσ, λσ)]

= P[Z 6∈ (a, b)] (50)

with a = ((−λ − 1/σ) ·√

δ, b = (λ − 1/σ) ·√

δ.In short, the calculation merely requires classical properties of the

normal distribution. The three other quantities simply require othersimilar properties of the normal. As discussed in the main text, SEevolution makes an iteration-by-iteration prediction ofσt; in order tocalculate predictions ofMDR, FAR, MSENZ andMSEZ, the parameters

0 10 20 30 400

0.05

0.1

0.15

0.2

0.25

0.3


iteration


0 10 20 30 400

0.01

0.02

0.03

0.04

0.05

0.06


iteration


0 10 20 30 400

0.02

0.04

0.06

0.08


iteration


0 10 20 30 400.231

0.2315

0.232

0.2325


iteration


Fig. 9. Comparison of State Evolution predictions against observations forρ = 0.7, δ = 0.36. For details, see Figure 7.

ǫ andλ are also needed.We compared the state evolution predictions with the actualvalues

by a Monte Carlo experiment. We chose these triples(δ, ρ, N):(0.3, 0.15, 5000), (0.5, 0.2, 4000), (0.7, 0.36, 3000). We again usedthe standard problem suite (USE matrix and unit amplitude nonzero).At each combination of(δ, ρ,N), we generatedM = 200 randomproblem instances from the standard problem suite, and ran theAMP algorithm for a fixed number of iterations. We computed theobservables at each iteration. For example, the empirical misseddetection rate is estimated by

eMDR(t) =#{i : xt(i) = 0 and x0(i) 6= 0}

#{i : x0(i) 6= 0} .

We averaged the observable trajectories across theM Monte Carlorealizations, producing empirical averages.

The results for the three cases are presented in Figures 7, 8,9. Shown on the display are curves indicating both the theoreticalprediction and the empirical averages. In the case of the upper rowand the lower left panel, the two curves are so close that one cannoteasily tell that two curves are, in fact, being displayed.

O. Coefficient Universality

SE displays invariance of the evolution results with respect to thecoefficient distribution of the nonzeros. What happens in practice?

We studied invariance of AMP results as we varied the distributionsof the nonzeros inx0. We consider the problemχ = ± and used thefollowing distributions for the non-zero entries ofx0:

• Uniform in [−1, +1];• Radamacher (uniform in{+1,−1});• Gaussian;• Cauchy.

In this study,N = 2000, and we consideredδ = 0.1, 0.3. For eachvalue of δ we considered20 equispaced values ofρ in the interval[ρCG(δ;±) − 1/10, ρCG(δ;±) + 1/10], running each timeT = 1000AMP iterations. Data are presented, respectively, in Figures 10.

Each plot displays the fraction of success(S/M) as a function ofρand a fitted success probability i.e. in terms of success probabilities,the curves displayπ(ρ). In each case 4 curves and 4 sets of data

0 0.2 0.4 0.6 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ρ

Suc

cess

rat

e

0 0.2 0.4 0.6 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ρ

Suc

cess

rat

e

Uniform EmpiricalUniform Logit FitSigns EmpiricalSigns Logit FitGaussian EmpiricalGaussian Logit FitCauchy EmpiricalCauchy Logit Fit

Uniform EmpiricalUniform Logit FitSigns EmpiricalSigns Logit FitGaussian EmpiricalGaussian Logit FitCauchy EmpiricalCauchy Logit Fit

Fig. 10. Comparison of Failure probabilities for differentensembles. In theleft window, δ = 0.10 and in the right windowδ = 0.3. Red: unit-amplitudecoefficients. Blue: uniform[−1, 1]. Green: Gaussian. Black: Cauchy. Points:observed failure fractions Curves: Logistic fit.

points are displayed, corresponding to the 4 ensembles. Thefourdatasets are visually quite similar, and it is apparent thatindeed aconsiderable degree of invariance is present.

P. Matrix Universality

The Discussion section in the main text referred to evidencethatour results are not limited to the Gaussian distribution.

We conducted a study of AMP where everything was the sameas in Figure 1 above, however, the matrix ensemble could change.We considered three such ensembles: USE (columns iid uniformlydistributed on the unit sphere), Rademacher (random entries iid ±1equiprobable), and Partial Fourier, (randomly selectn rows fromN×N fourier matrix.) We only considered the caseχ = ±. Results areshown in Fig. 11, and compared to the theoretical phase transitionfor ℓ1.

Q. Timing Results

In actual applications, AMP runs rapidly.We first describe a study comparing AMP to the LARS algorithm

[28]. LARS is appropriate for comparison because, among theitera-tive algorithms previously proposed, its phase transitionis closest totheℓ1 transition. So it comes closest to duplicating the AMP sparsity-undersampling tradeoff.

Each algorithm proceeds iteratively and needs a stopping rule. Inboth cases, we stopped calculations when the relative fidelity measureexceeded0.999, ie when‖y − Axt‖2/‖y‖2 < 0.001.

In our study, we used the partial Fourier matrix ensemble with unitamplitude for nonzero entries in the signalx0. We considered a rangeof problem sizes(N, n, k) and in each case averaged timing resultsover M = 20 problem instances. Table I presents timing results.

In all situations studied, AMP is substantially faster thanLARS.There are a few very sparse situations – i.e. wherek is in the tens orfew hundreds – where LARS performs relatively well, losing the raceby less than a factor 3. However, as the complexity of the objectsincreases, so thatk is several hundred or even one thousand, LARSis beaten by factors of 10 or even more.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

δ

ρPhase transition of FOAMP for different matrix ensembles

Theoretical L1

AMP, USEAMP, FourierAMP, Rademacher

Fig. 11. Observed Phase Transitions at different matrix ensembles. Caseχ = ±. Red: Uniform Spherical Ensemble (Gaussian with normalizecolumnlengths). Magenta: Rademacher (±1 equiprobable). Green: partial Fourier.Blue: ρℓ1 .

TABLE IT IMING COMPARISON OFAMP AND LARS. AVERAGE T IMES IN CPU

SECONDS.

N n k AMP LARS4096 820 120 0.19 0.78192 1640 240 0.34 3.4516384 3280 480 0.72 19.4532768 1640 160 2.41 7.2816384 820 80 1.32 1.518192 820 110 0.61 1.9116384 1640 220 1.1 5.532768 3280 440 2.31 23.54096 1640 270 0.12 1.228192 3280 540 0.22 5.4516384 6560 1080 0.45 27.332768 1640 220 6.95 17.53

(For very largek, AMP has a decisive advantage. When the matrixA is dense, LARS requires at leastc1 ·k ·n·N operations, while AMPrequires at mostc2 ·n·N operations. Herec2 = log((EX2)/σ2

T )/b isa bound on the number of iterations, and(EX2)/σ2

T is the relativeimprovement in MSE inT iterations. Hence in terms of flops wehave

flops(LARS)

flops(AMP)≥ kb(δ, ρ)

log((EX2)/σ2T )

.

This logarithmic dependence of the denominator is very weak, andvery roughly this ratio scales directly withk.)

We also studied AMP’s ability to solve very large problems.We conducted a series of trials with increasingN in a case whereA

andA∗ can be applied rapidly, without using ordinary matrix storageand matrix operations; specifically, the partial Fourier ensemble. Fornonzeros of the signalx0. we chose unit amplitude nonzeros.

We considered the fixed choice(δ, ρ) = (1/6, 1/8) andN rangingfrom 1K to (K = 1024) to 256K in powers of2. At each signallength N we generatedM = 10 random problem instances andmeasured CPU times (on a single Pentium 4 processor) and iterationcounts for AMP in each instance. We considered four stoppingrules,

0 0.5 1 1.5 2 2.5 3

x 105

30

40

50

60

70

80

90

N

Tot

al n

umbe

r of

iter

atio

ns

Normalized MSE=1.5259e−005Normalized MSE=3.0518e−005Normalized MSE=6.1035e−005Normalized MSE=0.00012207

Fig. 12. Iteration Counts versus Signal LengthN . Different curves showresults for different stopping rules. Horizontal axis: signal lengthN . Verticalaxis: Number of iterations,T . Blue, Green, Red, Aqua curves depict resultswhen stopping thresholds are set at12 · 10−524−ℓ, with ℓ = 0, 1, 2, 3 Eachdoubling of accuracy costs about 5 iterations.

based on MSEσ2, σ2/2, σ2/4, andσ2/8, whereσ2 = 12·10−5. Wethen averaged timing results over theM = 10 randomly generatedproblem instances

Figure 12 presents the number of iterations as a function of theproblem size and accuracy level. According to the SE formalism, thisshould be a constant independent ofN at each fixed(δ, ρ) and wesee indeed that this is the case for AMP: the number of iterations isclose to constant for all largeN . Also according to the SE formalism,each additional iteration produces a proportional reduction in formalMSE, and indeed in practice each increment of5 AMP iterationsreduces the actual MSE by about half.

Figure 13 presents CPU time as a function of the problem sizeand accuracy level. Since we are using the partial Fourier ensemble,the cost of applyingA and A∗ is proportional toN log(N); thisis much less than what we would expect for the cost of applyinga general dense matrix. We see that indeed AMP execution timescales very favorably withN in this case – to the eye, the timingseems practically linear withN . The timing results show that eachdoubling of N produces essentially a doubling of execution time.iteration produces a proportional reduction in formal MSE,andindeed in practice each increment of5 AMP iterations reduces theMSE by about half. Each doubling of accuracy costs about 30% morecomputation time.

0 0.5 1 1.5 2 2.5 3

x 105

0

2

4

6

8

10

12

N

Tim

e(se

c)

Normalized MSE=1.5259e−005Normalized MSE=3.0518e−005Normalized MSE=6.1035e−005Normalized MSE=0.00012207

Fig. 13. CPU Time Scaling withN . Different curves show results fordifferent stopping rules. Horizontal axis: signal lengthN . Vertical axis: CPUtime(seconds). Blue, Green, Red, Aqua curves depict results when stoppingthresholds are set at12 · 10−524−ℓ, with ℓ = 0, 1, 2, 3

Documents

Message Passing Algorithms for Compressed SensingarXiv:0907.3574v1 [cs.IT] 21 Jul 2009 Message Passing Algorithms for Compressed Sensing David L. Donoho Department of Statististics