52
THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES By Zhou Fan Andrea Montanari Technical Report No. 2015-14 July 2015 Department of Statistics STANFORD UNIVERSITY Stanford, California 94305-4065

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM

INNER-PRODUCT KERNEL MATRICES

By

Zhou Fan Andrea Montanari

Technical Report No. 2015-14 July 2015

Department of Statistics STANFORD UNIVERSITY

Stanford, California 94305-4065

Page 2: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM

INNER-PRODUCT KERNEL MATRICES

By

Zhou Fan Andrea Montanari Stanford University

Technical Report No. 2015-14 July 2015

This research was supported in part by National Science Foundation grants CCF 1319979 and DMS 1106627 and Air Force Office of Scientific Research grant FA9550-13-1-0036.

Department of Statistics STANFORD UNIVERSITY

Stanford, California 94305-4065

http://statistics.stanford.edu

Page 3: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL

MATRICES

ZHOU FAN1 AND ANDREA MONTANARI1,2

Abstract. We study the spectra of p×p random matrices K with off-diagonal (i, j) entry equal to

n−1/2k(XTi Xj/n

1/2), where Xi’s are the rows of a p× n matrix with i.i.d. entries and k is a scalarfunction. It is known that under mild conditions, as n and p increase proportionally, the empiricalspectral measure of K converges to a deterministic limit µ. We prove that if k is a polynomial andthe distribution of entries of Xi is symmetric and satisfies a general moment bound, then K is thesum of two components, the first with spectral norm converging to ‖µ‖ (the maximum absolutevalue of the support of µ) and the second a perturbation of rank at most two. In certain cases,including when k is an odd polynomial function, the perturbation is 0 and the spectral norm ‖K‖converges to ‖µ‖. If the entries of Xi are Gaussian, we also prove that ‖K‖ converges to ‖µ‖ fora large class of odd non-polynomial functions k. In general, the perturbation may contribute spikeeigenvalues to K outside of its limiting support, and we conjecture that they have deterministiclimiting locations as predicted by a deformed GUE model. Our study of such matrices is motivatedby the analysis of statistical thresholding procedures to estimate sparse covariance matrices frommultivariate data, and our results imply an asymptotic approximation to the spectral norm errorof such procedures when the population covariance is the identity.

1. Introduction

Let X ∈ Rp×n be a random matrix with i.i.d. entries of zero mean and unit variance, and letXT

1 , . . . , XTp denote the rows of X. We study in this paper random matrices K(X) ∈ Rp×p having

entries (K(X))ii′ = 1√nk(XTi Xi′√n

)for all i 6= i′ and (K(X))ii = 0 for all i, for a function k : R→ R.

Our main results pertain to the spectral properties of K(X), specifically its spectral norm, in theasymptotic regime of random matrix theory where n, p→∞ proportionally with p

n → γ ∈ (0,∞),and k is a fixed function independent of n and p. The study of such matrices K(X) in this regimewas initiated by Xiuyuan Cheng and Amit Singer in [9]. Following [9], we will call k a “kernelfunction” and K(X) a “random inner-product kernel matrix”.

Our study of this random matrix model is motivated by the following statistical application:Suppose Y1, . . . , Yn ∈ Rp represent n i.i.d. observations of a random vector with mean zero andunknown covariance matrix Σ ∈ Rp×p, and we wish to estimate Σ from these observations. If p isof comparable size to n, then the standard sample covariance matrix Σ = 1

n

∑ni=1 YiY

Ti is a poor

estimator of Σ, and in general one cannot hope to estimate Σ with high accuracy. However, if it isknown a priori that Σ is sufficiently sparse, i.e. most of its off-diagonal entries are zero, then it maystill be possible to estimate Σ accurately under this assumption. A popular procedure for performingthis estimation is to apply elementwise hard-thresholding to Σ, i.e. to preserve those entries of Σthat are greater than τ in magnitude and to set the remaining entries to 0, for some threshold levelτ := τ(n, p) [4, 13]. Modifications of this procedure that apply continuous thresholding functions

1Department of Statistics, Stanford University2Department of Electrical Engineering, Stanford UniversityE-mail addresses: [email protected], [email protected] is supported by a Hertz Foundation Fellowship and an NDSEG Fellowship (DoD, Air Force Office of Scientific

Research, 32 CFR 168a). AM is partially supported by NSF grants CCF-1319979 and DMS-1106627 and the AFOSRgrant FA9550-13-1-0036.

1

Page 4: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

2 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

elementwise to Σ have also been proposed and studied [25]. Such procedures have found applicationnot only in problems where covariance estimation is the end goal, but also as subroutines of otherstatistical methods that require covariance estimation as an intermediate step, including proceduresfor sparse principal components analysis [18, 10] and sparse linear discriminant analysis [26].

If the true covariance matrix Σ is truly sparse in a suitable sense, and if the threshold level is set

to τ = c√

log pn for a sufficiently large constant c > 0, then it may be shown that the spectral norm

error of such an estimator converges to zero as n, p→∞ [4, 13], and furthermore that the rate ofconvergence is minimax optimal over certain classes of sparse matrices [5]. However, these resultsdo not provide an accurate estimate of how large one should expect this error to be for any specificchoice of threshold level τ . To study this question, it is more natural to consider the asymptoticregime in which hard-thresholding is performed at the level τ = c√

nfor a constant c, so that τ is of

the same order as the typical “noise level” of the off-diagonal entries of Σ. Then the entries of this

thresholded covariance estimator are precisely given by 1√nk(XTi X′i√

n

), where XT

i = (Y1i, . . . , Yni),

and k(x) is a thresholding function (e.g. k(x) = x1{|x| ≥ c} in the case of hard-thresholding). Ifthe true covariance matrix Σ is Idp×p, the p × p identity matrix, and Yji are in fact i.i.d., then

we are led to the matrix model K(X) studied in this paper. As the diagonal of Σ (thresholdedor not) converges to Idp×p in spectral norm under weak conditions when Σ = Idp×p, we defineK(X) with zero diagonal, so that the spectral norm ‖K(X)‖ is asymptotically equivalent to thespectral norm error of the thresholded covariance estimator. One of our main results, Theorem2.10 below, will imply that for an odd and continuously differentiable thresholding function k, ifthe data has standard Gaussian distribution and Σ = Idp×p, then the spectral norm error of thethresholded covariance estimator converges almost surely to a positive deterministic constant in thisasymptotic regime. Our theorem characterizes the dependence of this constant on the thresholdingfunction k. For a given problem of size n and p and threshold level τ , this provides an asymptoticapproximation to the error of the covariance thresholding estimator. We believe that many of theconditions of Theorem 2.10, such as the normality of the data and the continuous-differentiabilityof the threshold function k, may be relaxed.

Whether ‖K(X)‖ converges is also a natural question to ask from the perspective of randommatrix theory, and it was posed as an open question in [9]. For the identity kernel k(x) = x, K(X)is equal to the sample covariance matrix 1

nXXT , excluding the diagonal. Under weak moment

conditions on the entries xij of X, it is easily verified when k(x) = x that ‖K(X) − ( 1nXX

T −Idp×p)‖ → 0 a.s. as n, p→∞. Hence, when k(x) = x, many asymptotic properties of the spectrumofK(X), including the limit of its empirical spectral measure and the limit of its largest and smallesteigenvalues, match those of the matrix 1

nXXT translated by −1. The asymptotic spectral behavior

of 1nXX

T in the large n and p limit is, by now, well-understood. The empirical spectral measure

of 1nXX

T converges weakly a.s. to a deterministic limit µMP,γ with compact support, known as

the Marcenko-Pastur law [23]. Almost-sure convergence of the largest eigenvalue of 1nXX

T to theupper endpoint of the support of µMP,γ was proven in [16] assuming certain moment conditions,and these conditions were later weakened in [34] to existence of the fourth moment of xij . Almostsure convergence of the smallest eigenvalue of 1

nXXT to the lower endpoint of the support of µMP,γ ,

when γ < 1, was shown in [1]. The fluctuations of the extremal eigenvalues of 1nXX

T around theiralmost sure limits are also understood—for instance, assuming that xij ∼ N (0, 1), it was shownin [17] that, after appropriate centering and rescaling, the distribution of the largest eigenvalue of1nXX

T converges weakly to the Tracy-Widom law of order 1, first introduced in [30] as the limitingdistribution of the largest eigenvalue of the Gaussian Orthogonal Ensemble. This result has beenextended to more general distributions of xij satisfying exponential decay conditions in [24].

For the case of a general kernel function k, the main theorem of [9] establishes that, if xij ∼N (0, 1), then the empirical spectral measure of K(X) also converges weakly a.s. to a deterministic

Page 5: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 3

limit, under certain mild assumptions on k. This limit distribution, which we denote as µa,ν,γ(c.f. Definition 2.3 below), depends on k through its orthogonal decomposition in the Hermite

polynomial basis of L2(q(x)dx), where q(x) = 1√2πe−

x2

2 is the standard Gaussian density. When

k(x) = x, µa,ν,γ coincides with the Marcenko-Pastur law µMP,γ translated by −1. Quite remarkably,if k has no linear component in its Hermite polynomial decomposition, then the limiting spectraldistribution µa,ν,γ is an appropriately scaled version of Wigner’s semicircle law. In the generalsetting, a characterization of the limiting measure µa,ν,γ was given in [9] in terms of an implicitequation in its Stieltjes transform, restated as eq. (1) below. This result was shown to hold formore general distributions of xij having moments of all orders in Theorem 3 of [11].

We observe that the limiting spectral distribution µa,ν,γ of K(X) is in fact the additive freeconvolution, as defined by Dan Voiculescu in [32], of a scaled semicircle law and a scaled andtranslated Marcenko-Pastur law. As such, it is also the limiting spectral distribution of a randommatrix W +V , for W ∈ Cp×p a real symmetric or complex Hermitian Wigner matrix and V ∈ Rp×pa deterministic diagonal matrix whose empirical spectral measure converges to this scaled andtranslated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as “deformed Wignermatrices” or “Wigner matrices with external source” in the random matrix theory literature, andthey have been the subject of much recent study. For instance, if W is distributed as the GaussianUnitary Ensemble and V does not have eigenvalues outside of the support of its limiting spectrum,then the results of [6] and [22] imply that in the limit of large n and p, no eigenvalues of W + Vfall outside of supp(µa,ν,γ) + (−ε, ε) a.s. for any ε > 0, and in particular, the largest and smallesteigenvalues of W + V converge to the upper and lower boundaries of supp(µa,ν,γ). More generally,if V has a finite number of fixed “spike” eigenvalues outside of its limiting support, then W + Vmay have corresponding spike eigenvalues outside of supp(µa,ν,γ) as well, and the conditions forexistence and the limiting locations of these spike eigenvalues of W + V are fully characterized in[6]. Under various assumptions on W and V , more detailed results regarding the fluctuations ofthe spike eigenvalues of W + V and the eigenvalues at the edges of supp(µa,ν,γ) have also beenobtained, see e.g. [27, 7, 20].

In comparison, the spectral norm and largest and smallest eigenvalues of the random inner-product kernel matrix K(X) are not well-understood when the kernel function k is nonlinear.

In previous work, Lemma 4.2 of [9] showed that E‖K(X)‖ ≤ Od(n1/4) when the kernel function

k(x) := kn(x) is the degree-d orthogonal polynomial with respect to the distribution ofXTi Xi′√n

, and

the authors conjectured that the true size of E‖K(X)‖ in this case should be Od(1). Proposition 6.2of [10] used Gaussian concentration of measure results and a covering argument to show that, whenxij have Gaussian distribution, ‖K(X)‖ ≤ Oτ (1) with high probability when k(x) = sgn(x)(|x|−τ)+

is the soft-thresholding function at level τ > 0. We are not aware of existing results that establishwhether ‖K(X)‖, λmax(K(X)), and λmin(K(X)) converge to the respective quantities sup{|x| :x ∈ supp(µa,ν,γ)}, sup{x : x ∈ supp(µa,ν,γ)}, and inf{x : x ∈ supp(µa,ν,γ)} for any nonlinear kernelfunction k.

In this paper, we provide an analysis of the spectral norm ‖K(X)‖ for the case where k is a poly-nomial function and the distribution of xij is symmetric and satisfies Assumption 2.1 below. Ourresults imply (Corollary 2.7) that ‖K(X)‖ converges a.s. to ‖µa,ν,γ‖ := sup{|x| : x ∈ supp(µa,ν,γ)}either when xij ∼ 1

2δ−1 + 12δ1 or when the Hermite polynomial decomposition of k has no degree-2

component. The latter condition holds, in particular, if k is an odd polynomial function. Moregenerally (Theorem 2.6), we exhibit a decomposition K(X) = K(X) + R(X), where the empirical

spectral distribution of K(X) converges weakly a.s. to µa,ν,γ , ‖K(X)‖ converges a.s. to ‖µa,ν,γ‖, and

R(X) is a rank-two perturbation matrix whose two nonzero eigenvalues converge to fixed quantities.

In the limit of large n and p, R(X) may contribute “spike” eigenvalues to K(X) that fall outsideof supp(µa,ν,γ) and in particular may be larger in magnitude than ‖µa,ν,γ‖, and we conjecture that

Page 6: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

4 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

these spike eigenvalues have deterministic limiting locations that match those of a deformed Wignermodel as characterized in [6].

We believe that our main result may be extended to certain classes of non-polynomial functionsk via polynomial approximation arguments, although the details are non-trivial. As one instanceof such an extension, we show that if k is an odd and continuously-differentiable function whosederivative k′(x) grows at most exponentially in x, then ‖K(X)‖ converges to ‖µa,ν,γ‖ when xijhave Gaussian distribution (Theorem 2.10). Such an extension is important for the applicabilityof our results to the statistical covariance estimation application previously discussed, and furtherrelaxations of the conditions imposed on k and the distribution of xij are an interesting avenue forfuture work.

Let us remark that in order for the empirical spectrum of a kernel random matrix to convergeto a deterministic limit, the entries of the matrix must be scaled appropriately as n and p increase.When k is nonlinear, scaling inside and outside of the kernel function k are not interchangeableand can lead to very different asymptotic behaviors. When xij are independent with unit variance,XTi Xi′ is of typical size O(n) for i = i′ and O(

√n) for i 6= i′. In [14], Nourredine El Karoui

first explored the spectral behavior of kernel random matrices in the large n and p limit under

the scaling (K(X))ii′ = k(XTi Xi′n

)for a fixed function k. Under this scaling, k is applied to

diagonal entries of size O(1) and to off-diagonal entries of size O(

1√n

). Theorem 2.1 of [14]

implies that if k is locally smooth at 0 and 1, then ‖K(X) −M(X)‖ → 0 for the simpler matrix

M(X) =(k(0) + k′′(0)

2n

)1p1

Tp + k′(0)XX

T

n + (k(1) − k(0) − k′(0)) Idp×p. In particular, the off-

diagonal entries of M(X) depend on k only through its Taylor expansion at 0. As a consequence ofthis result, the limiting spectral distributions of K(X) and M(X) are identical, and it is given byshifting and rescaling the Marcenko-Pastur law. Furthermore, the largest and smallest eigenvaluesof K(X) and M(X) have the same almost sure limits. It was concluded in [14] that the asymptoticspectral properties of K(X) are “essentially linear”. In contrast, we remove the matrix diagonal and

apply the scaling (K(X))ii′ = 1√nk(XTi Xi′√n

)to the off-diagonal entries, again for a fixed function

k. This scaling applies k to off-diagonal entries of typical size O(1) and yields the more interesting“nonlinear” behavior discovered in [9]. Consideration of this scaling is strongly motivated by thecovariance thresholding application previously discussed.

A formal statement of our definitions, assumptions, theorems, and conjectures, as well as ahigh-level outline of the proof, are provided in Section 2. The proof of our main result regardingpolynomial kernel functions k is given in Sections 3–5, with details deferred to two appendices.Finally, the proof of our extension to odd and continuously-differentiable kernel functions in theGaussian case is given in Section 6.

2. Main theorems and discussion

2.1. Background and statement of results. Let X = (xij : 1 ≤ i ≤ p, 1 ≤ j ≤ n) ∈ Rp×n bea matrix whose entries xij are a collection of independent and identically distributed real randomvariables, satisfying the following conditions:

Assumption 2.1. (1) E[xij ] = 0 and E[x2ij ] = 1.

(2) E[|xij |k] ≤ kαk for all k ≥ 2 and some α > 0.

(3) The distribution of xij is symmetric, i.e. xijL= −xij.

Let XTi = (xij : 1 ≤ j ≤ n) ∈ Rn denote the ith row of X.

Definition 2.2. For a kernel function k : R→ R, and X ∈ Rp×n, the random inner-productkernel matrix associated to k and X is given by Kn,p(X) = (kii′ : 1 ≤ i, i′ ≤ p) ∈ Rp×p with

Page 7: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 5

entries

kii′ =

{1√nk(XTi Xi′√n

), i 6= i′

0, i = i′.

Throughout, we will use capital variables to denote matrices and vectors, and lowercase variablesto denote (random or deterministic) scalars. We will use i, i′, i1, i2, . . . to denote indices in {1, . . . , p}and j, j′, j1, j2, . . . to denote indices in {1, . . . , n}. The notation Kn,p refers to the a matrix whosedefinition depends on n and p; kii′ refers to the (i, i′) entry of Kn,p, and for convenience we suppressthe dependence of kii′ on n, p.

Assumption 2.1 specifies conditions on the distribution of xij . The moment condition in part(2) of Assumption 2.1 was also assumed in the analysis of the spectral norm of standard samplecovariance matrices in [16], and in particular, it is satisfied by any sub-Gaussian or sub-exponentialrandom variable (c.f. Definitions 5.7 and 5.13 of [31]). It is possible to weaken this assumption usingtruncation arguments, and we have not made an attempt to do so here. Part (3) of Assumption 2.1,that the distribution of xij is symmetric, is required for our subsequent combinatorial arguments,although it is probably not a necessary condition for our main results to hold.

The study of matrices Kn,p(X) in Definition 2.2 in the asymptotic regime of large n and p wasinitiated by Xiuyuan Cheng and Amit Singer in [9], in which it was shown that the empirical spectraldistribution of Kn,p(X) converges weakly a.s. to a deterministic limit. To describe this limit, recallthat for a probability measure µ on R, its Stieltjes transform is the function m : C+ → C+ givenby

m(z) =

∫µ(dλ)

λ− z,

where C+ = {z ∈ C : Im z > 0} is the upper-half complex plane. The measure µ is uniquelydetermined by its Stieltjes transform and may be recovered from the inversion formula

µ([a, b]) = limε→0

1

π

∫ b

aImm(λ+ iε)dλ

for any a, b that are continuity points of µ. (See Theorem B.8 of [2].)

Definition 2.3. For ν, γ > 0 and a ∈ [−√ν,√ν], let m(z) := ma,ν,γ(z) be the unique solution to

the equation

− 1

m(z)= z + a

(1− 1

1 + aγm(z)

)+ γ(ν − a2)m(z), z ∈ C+ (1)

with m(z) ∈ C+. Let µa,ν,γ be the measure on R having Stieltjes transform m(z), let supp(µa,ν,γ)denote its support, and let ‖µa,ν,γ‖ = sup{|x| : x ∈ supp(µa,ν,γ)}.

It was shown in [9] that m(z) is well-defined and that µa,ν,γ is a probability measure withcontinuous density and compact support. (Note that in our notation, n and p are reversed fromtheir definitions in [9], and γ = limn,p→∞

pn corresponds to 1

γ in [9].) The following result establishes

weak convergence of the empirical spectral measure of Kn,p(X) to µa,ν,γ ; it is implied by Theorem3 of [11] and Remark 3.2 and Lemma C.2 of [9]. The result was first shown in [9] in the casewhere xij ∼ N (0, 1) and was extended to more general distributions of xij in [11] via Lindeberg’sswapping argument.

Definition 2.4. Let q(x) = 1√2πe−

x2

2 be the standard Gaussian density. Let {hd}∞d=0 be the Hermite

polynomials orthonormal with respect to L2(q(x)dx), i.e. hd is of degree d and, for ξ ∼ N (0, 1),E[hd(ξ)h

′d(ξ)] = 1 if d = d′ and 0 if d 6= d′.

Theorem 2.5 ([9, 11]). Suppose k : R→ R satisfies E[k(ξ)] = 0 and E[k(ξ)2] <∞ for ξ ∼ N (0, 1).Let k(x) =

∑∞d=1 adhd(x) be the expansion of k in the Hermite polynomial basis, where convergence

Page 8: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

6 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

of the sum is in the sense of L2(q(x)dx), and let a = a1 = E[ξk(ξ)] and ν =∑∞

d=1 a2d = E[k(ξ)2].

Let X ∈ Rp×n be a random matrix with i.i.d. entries having finite moments of all orders, and let XTi

denote the ith row of X. Suppose that

∣∣∣∣E [k(ξ)2]− E

[k(XT

1 X2√n

)2]∣∣∣∣→ 0 as n→∞, let Kn,p(X) be

as in Definition 2.2 with kernel function k, and let λ1, . . . , λp be the eigenvalues of Kn,p(X). Then,as n, p→∞ with p

n → γ ∈ (0,∞),

1

p

p∑i=1

δλi ⇒ µa,ν,γ ,

where 1p

∑pi=1 δλi is the empirical spectral measure of Kn,p(X), µa,ν,γ is the measure in Definition

2.3, and the convergence holds weakly a.s.

This weak convergence result does not necessarily imply convergence of ‖Kn,p(X)‖ to ‖µa,ν,γ‖,as it only ensures that there are at most o(n) eigenvalues of Kn,p(X) falling outside of the supportof µa,ν,γ . The following theorem, which is the first main result of this paper, examines this questionof convergence of spectral norm in the case where the kernel k(x) is a polynomial function.

Theorem 2.6. Suppose k(x) =∑D

a=1 adhd(x) where hd is the Hermite polynomial of degree das in Definition 2.4 (so k(x) is a polynomial function with E[k(ξ)] = 0 for ξ ∼ N (0, 1)). Leta = a1 = E[ξk(ξ)] and ν = a2

1 + . . .+ a2D = E[k(ξ)2]. Let X satisfy Assumption 2.1, let Kn,p(X) be

as in Definition 2.2 with kernel function k, and suppose n, p → ∞ with pn → γ. Then Kn,p(X) =

Kn,p(X) + Rn,p(X), where

(1) limn,p→∞ ‖Kn,p(X)‖ = ‖µa,ν,γ‖ a.s.

(2) Rn,p(X) is of rank at most two. Specifically, letting Vn,p(X) = (vi : 1 ≤ i ≤ p) ∈ Rp be the

vector with entries vi =∑n

j=1

x2ij−1√n

and denoting 1p = (1, . . . , 1) ∈ Rp,

Rn,p(X) =a2

n√

2(Vn,p(X)1Tp + 1pVn,p(X)T ).

Note that in this theorem, Kn,p(X) also has µa,ν,γ as its limiting spectral measure, since thislimit is unaffected by perturbations of finite rank. (See Theorem A.43 of [2].) Hence this theoremcharacterizes Kn,p(X) as the sum of two components, the first of which has spectral norm convergingto ‖µa,ν,γ‖ as expected, and the second of which might contribute additional spike eigenvalues toKn,p(X) that may be larger in magnitude than ‖µa,ν,γ‖. This theorem has the following immediatecorollaries.

Corollary 2.7. Under the assumptions of Theorem 2.6, if a2 = 0 or if xij ∼ 12δ−1 + 1

2δ1, thenlimn,p→∞ ‖Kn,p(X)‖ = ‖µa,ν,γ‖ almost surely. In particular, if k is an odd polynomial function,i.e. k(−x) = −k(x), then ‖Kn,p(X)‖ = ‖µa,ν,γ‖ almost surely.

Proof. If a2 = 0 or xij ∼ 12δ−1 + 1

2δ1, then Rn,p(X) = 0, so the result follows from Theorem 2.6. Ifk is an odd function, then a2 = 0. �

Corollary 2.8. Under the assumptions of Theorem 2.6, if a2 6= 0 and xij is not distributed as

12δ−1 + 1

2δ1, then the two non-zero eigenvalues of Rn,p(X) converge a.s. to ±a2γ

√Ex4

ij−1

2 .

Proof. Letting V := Vn,p(X) = {vi : 1 ≤ i ≤ p} be as in Theorem 2.6, we may compute

Tr Rn,p(X) = a2

√2

n

∑pi=1 vi and Tr Rn,p(X)2 =

a22n2

((∑p

i=1 vi)2

+ p‖V ‖22)

. If λ1 and λ2 are the

two non-zero eigenvalues of Rn,p(X), then this implies λ1 + λ2 = a2

√2

n

∑pi=1 vi and λ1λ2 =

Page 9: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 7

(λ1+λ2)2−λ21−λ2

22 =

a22

2n2

((∑p

i=1 vi)2 − p‖V ‖22

), so λ1 and λ2 are the roots of the equation

λ2 −

(a2

√2

n

p∑i=1

vi

)λ+

a22

2n2

( p∑i=1

vi

)2

− p‖V ‖22

= 0.

By the law of large numbers, limn,p→∞1n

∑pi=1 vi = 0 and limn,p→∞

p‖V ‖22n2 → γ2(Ex4

ij − 1) almostsurely. Since the roots of a polynomial are continuous in its coefficients, the result follows. �

Corollary 2.9. Under the assumptions of Theorem 2.6,

lim supn,p→∞

‖Kn,p(X)‖ ≤ ‖µa,ν,γ‖+ |a2|γ

√Ex4

ij − 1

2<∞

almost surely.

Proof. As ‖Kn,p(X)‖ ≤ ‖Kn,p(X)‖ + ‖Rn,p(X)‖, this follows from Theorem 2.6 and Corollary2.8. �

Corollary 2.7 implies that if k(x) is an odd polynomial function, then ‖Kn,p(X)‖ → ‖µa,ν,γ‖.In the case where xij have Gaussian distribution, we extend this conclusion to more general oddkernel functions in our second main result.

Theorem 2.10. Suppose k : R→ R is an odd function, i.e. k(−x) = −k(x), and that it is contin-

uously differentiable with lim sup|x|→∞log |k′(x)||x| <∞. Let k(x) =

∑∞d=1 adhd(x) be the expansion of

k in the Hermite polynomial basis of Definition 2.4, where convergence of the sum is in the sense ofL2(q(x)dx), and let a = a1 = E[ξk(ξ)] and ν =

∑∞d=1 a

2d = E[k(ξ)2] for ξ ∼ N (0, 1). Let X ∈ Rp×n

have entries xijiid∼ N (0, 1), let Kn,p(X) be as in Definition 2.2 with kernel function k, and let

n, p→∞ with pn → γ. Then, almost surely,

limn,p→∞

‖Kn,p(X)‖ = ‖µa,ν,γ‖.

Note that, by the assumptions of Theorem 2.10, |k(x)| ≤ CeC|x| for some constant C > 0 andall x ∈ R, so ν = E[k(ξ)2] <∞ and the Hermite polynomial decomposition of k is well-defined.

2.2. Discussion and conjectures. We observe in the following proposition that the limitingmeasure µa,ν,γ of Definition 2.3 is the additive free convolution [32] of a semicircle law and theMarcenko-Pastur law translated by −1 and scaled by a. Recall that for a probability measure µwith Stieltjes transform m(z), there exist η,M > 0 such that z 7→ − 1

m(z) is injective on {z ∈ C+ :

Im z > ηRe z, |z| > M}. Denoting the inverse of z 7→ − 1m(z) on this domain as S(z), the Voiculescu

R-transform of µ is given by the function R(z) = S(1z )− 1

z . For two probability measures µ and νon R with R-transforms Rµ and Rν , their free additive convolution µ�ν is the probability measurewhose R-transform is given by Rµ�ν(z) = Rµ(z) +Rν(z) [32, 21, 3].

Proposition 2.11. Let µsc be the semicircle law with density

µsc(dx) =√

4γ(ν − a2)− x21[−2√γ(ν−a2),2

√γ(ν−a2)

](x)dx.

Let µMP,shift be the Marcenko-Pastur law with parameter γ, translated by −1 and scaled by a. (Ifγ ≤ 1 and a > 0, then µMP,shift has density

µMP,shift(dx) =1

√(γ + 2

√γ − x

a

) (xa − γ + 2

√γ)

γ(x+ a)1[aγ−2a

√γ,aγ+2a

√γ](x)dx.

Page 10: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

8 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

If γ > 1 and a > 0, then µMP,shift is a mixture(

1− 1γ

)δ0 + 1

γ µMP,shift where µMP,shift has the

above density. If a = 0, then µMP,shift = δ0.) Then µa,ν,γ given in Definition 2.3 is equal to thefree additive convolution µsc � µMP,shift.

Proof. The semicircle law µsc has Stieltjes transform

msc(z) = − 1

2γ(ν − a2)

(z −

√z2 − 4γ(ν − a2)

),

and when a > 0, µMP,shift has Stieltjes transform

mMP,shift(z) =−γ − z

a +√(

za − γ

)2 − 4γ

2γ(z + a).

Here, the branch of the square-root function is chosen to have positive imaginary part. (See Lemmas2.11 and 3.11 of [2].) The first equation implies − 1

msc(z)= z + γ(ν − a2)msc(z). When a = 0, this

is precisely eq. (1) defining µa,ν,γ , so µsc = µa,ν,γ . When a > 0, the second equation implies

− 1mMP,shift(z)

= z + a(

1− 11+aγmMP,shift(z)

), and the R-transforms of µsc and µMP,shift are given

by Rsc(z) = γ(ν − a2)z and RMP,shift(z) = −a(

1− 11−aγz

), respectively. Then the R-transform

of µsc � µMP,shift is given by Rµsc�µMP,shift(z) = −a

(1− 1

1−aγz

)+ γ(ν − a2)z, so the Stieltjes

transform mµsc�µMP,shift(z) satisfies

z = −a(

1− 1

1 + aγm(z)

)− γ(ν − a2)m(z)− 1

m(z),

at least on an open sub-domain of C+. This agrees with eq. (2.3) defining the Stieltjes transform ofµa,ν,γ , and as the Stieltjes transform of any measure is analytic on C+, these functions must agreeon all of C+. Then µa,ν,γ = µsc � µMP,shift. �

In particular, if a = 0, then the measure µa,ν,γ is the semicircle law µsc, and if a2 = ν, then itis the translated and scaled Marcenko-Pastur law µMP,shift. (See also Remarks 3.6 and 3.7 of [9].)By Remark 2.2 of [6], when γ ≤ 1 or a = 0, the support of µa,ν,γ must be a single interval, andwhen γ > 1 and a > 0, it must either be a single interval or the union of two disjoint intervals.We observe that when a = 0, the measure µa,ν,γ is symmetric, and hence Theorem 2.6 implies the

almost sure convergence of both the largest and smallest eigenvalues of Kn,p(X) to ±‖µa,ν,γ‖. Ingeneral, µa,ν,γ is not symmetric, and Theorem 2.6 provides information on only one of these twoeigenvalues. Furthermore, when µa,ν,γ has two disjoint intervals of support, Theorem 2.6 does not

describe whether Kn,p(X) has eigenvalues in between these two intervals of support. We conjecture

that, in fact, Kn,p(X) has no eigenvalues outside of the limiting support, in the following sense.

Conjecture 2.12. Under the assumptions of Theorem 2.6, for any ε > 0,

limn,p→∞

P[∃i ∈ {1, . . . , p} : dist(λi(Kn,p(X)), supp(µa,ν,γ)) > ε] = 0,

where dist(x,C) = inf{|x− y| : y ∈ C}.

Turning to the rank-two matrix Rn,p(X) of Theorem 2.6, we note that Rn,p(X) may contribute“spike” eigenvalues to Kn,p(X) that fall outside of the support of the limiting spectral measure

µa,ν,γ . The entries of Rn,p(X) and Kn,p(X) are dependent, but let us momentarily consider thesimpler matrix W + V , where W is a Wigner Hermitian matrix with limiting empirical spectralmeasure µsc and V is a deterministic diagonal matrix with limiting empirical spectral measureµMP,shift, for µsc and µMP,shift as defined in Proposition 2.11. Then by [33], the empirical spectralmeasure of W + V converges to µa,ν,γ . Suppose, in addition, that V has at most two “spike”

Page 11: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 9

Eigenvalues

Fre

quen

cy

−1.5 −0.5 0.5 1.0 1.5

020

4060

8010

0

(a)

Eigenvalues

Fre

quen

cy

−6 −4 −2 0 2 4 6

050

100

200

300

(b)

Eigenvalues

Fre

quen

cy

−10 −5 0 5 10

050

150

250

350

(c)

Eigenvalues

Fre

quen

cy

−1.5 −0.5 0.5 1.0 1.5

020

4060

8010

0

(d)

Eigenvalues

Fre

quen

cy

−5 0 5 10

010

020

030

0

(e)

Eigenvalues

Fre

quen

cy

−15 −5 0 5 10 15

010

020

030

040

050

0

(f)

Eigenvalues

Fre

quen

cy

−1.5 −0.5 0.5 1.0 1.5

020

4060

8010

0

(g)

Eigenvalues

Fre

quen

cy

−5 0 5

010

020

030

040

0

(h)

Eigenvalues

Fre

quen

cy

−10 −5 0 5 10 15

010

020

030

040

050

0(i)

Eigenvalues

Fre

quen

cy

−1.5 −0.5 0.5 1.0 1.5

020

4060

8010

0

(j)

Eigenvalues

Fre

quen

cy

−5 0 5 10

010

020

030

040

050

0

(k)

Eigenvalues

Fre

quen

cy

−15 −5 0 5 10 15

010

030

050

070

0

(l)

Figure 1. Simulation results are shown for the empirical spectra and extremal eigenvalues of ma-trices Kn,p(X), for various kernel functions, distributions of the matrix entries xij , and parametersa, ν, and γ. The histogram of eigenvalues of Kn,p(X) is shown in red, with the (scaled) den-sity function of the limiting empirical spectral measure µa,ν,γ shown in black. In settings whereConjecture 2.13 predicts the presense of spike eigenvalues of Kn,p(X), the theoretical predictionsfor the locations of these spikes according to Conjecture 2.13 are shown as black crosses, and thelocations of the corresponding largest and/or smallest empirically observed eigenvalues of Kn,p(X)are indicated by red arrows. Panels (a)–(f) correspond to the kernel function k(x) = h2(x) +h3(x).Panels (g)–(l) correspond to the kernel function k(x) = 0.9h1(x) + h2(x). Panels (a), (d), (g), and(j) correspond to p = 4000, n = 12000, and γ = 1

3 . Panels (b), (e), (h), and (k) correspond top = 10000, n = 2000, and γ = 5. Panels (c), (f), (i), and (l) correspond to p = 10000, n = 1000,and γ = 10. Panels (a)–(c) and (g)–(i) take xij ∼ N (0, 1), so E[x4

ij ] = 3, and panels (d)–(f) and

(j)–(l) take xij ∼ Laplace(0, 1√2), so E[x4

ij ] = 6.

Page 12: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

10 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

eigenvalues outside of supp(µMP,shift), and these eigenvalues are fixed and equal to ±a2γ

√Ex4

ij−1

2 ,as given in Corollary 2.8. Then Theorem 8.1 of [6] implies that W + V has at most two “spike”eigenvalues outside of supp(µa,ν,γ) and gives a precise characterization of when such spikes occurand the locations at which they appear. We conjecture that this characterization, summarizedbelow, holds also for the spike eigenvalues of Kn,p(X), even though Kn,p(X) is not a deformed

Wigner matrix and Rn,p(X) is not independent of Kn,p(X).

Conjecture 2.13. Consider the setup of Theorem 2.6, and suppose a2 < ν. Let S = supp(µMP,γ)and define H : R \ S → R by H(z) = z − γ(ν − a2)mMP,γ(z), where µMP,γ is the scaled andtranslated Marcenko-Pastur law as in Proposition 2.11 and mMP,γ(z) is its Stieltjes transform. Let

λ1, λ2 = ±a2γ

√Ex4

ij−1

2 . If λ1 /∈ S and H ′(λ1) > 0, then H(λ1) /∈ supp(µa,ν,γ) and there is one

eigenvalue of Kn,p(X) that converges a.s. to H(λ1). Similarly, if λ2 /∈ S and H ′(λ2) > 0, thenH(λ2) /∈ supp(µa,ν,γ) and there is one eigenvalue of Kn,p(X) that converges a.s. to H(λ2). Theremaining eigenvalues of Kn,p(X) are, for any ε > 0, within an ε-neighborhood of supp(µa,ν,γ), inthe sense of Conjecture 2.12.

Figure 1 depicts simulation results of the empirical spectrum of Kn,p(X) for the kernel functionsk(x) = h2(x) + h3(x) and k(x) = 0.9h1(x) + h2(x), various settings of γ = p

n , and Gaussian andLaplace distributions for xij . For the parameter combinations for which Conjecture 2.13 predictsthe presence of spike eigenvalues in the limiting spectrum of Kn,p(X), we in fact observe such spikeeigenvalues in simulation. The predicted spike locations are shown together with the empiricallyobserved locations, and there is close agreement in all cases.

2.3. Outline of proof. In the remainder of the paper, we prove Theorems 2.6 and 2.10. Our proofof Theorem 2.6 follows three high-level steps:

(1) If z1, . . . , zn are i.i.d. random variables with mean zero, variance one, and zero third moment,we show that

√d!hd

(∑ni=1 zi√n

)≈√

1

nd

n∑j1,...,jd=1

j1 6=j2 6=... 6=jd

d∏i=1

zji , (2)

where hd is the degree-d orthonormal Hermite polynomial as in Definition 2.4. We notethat

√d!hd on the left side of eq. (2) is the monic Hermite polynomial of degree d, i.e.

√d!hd(x) = xd+ . . . where the leading term has coefficient 1, and that

(∑ni=1 zi√n

)dequals the

right side of eq. (2) except without the restriction that the indices of summation j1, . . . , jdare distinct. The terms of the summation in which the indices j1, . . . , jd are not distinctare essentially cancelled out by the lower degree terms of

√d!hd(x). Intuition for this

cancellation comes from the observation that if z1, . . . , zniid∼ 1

2δ−1 + 12δ1, and if

√d!hd is

replaced by the monic degree-d orthogonal polynomial with respect to the distribution of∑ni=1 zi√n

, then eq. (2) is actually an exact equality. This follows from the well-known fact

that

pd(z1, . . . , zn) :=

n∑j1,...,jd=1

j1 6=j2 6=... 6=jd

d∏i=1

zji

satisfies E[pd(z1, . . . , zn)pd′(z1, . . . , zn)] = 0 for any d 6= d′. When z1, . . . , zn are notBernoulli-distributed and hd is the Hermite polynomial, then eq. (2) is only an approxi-mation, where the right side of eq. (2) may be considered as a first-order term of the leftside. In Proposition 3.1, we quantify the error of this approximation by also computing

Page 13: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 11

the second-order term, which is of size O(

1√n

)with high probability, and showing that the

third and higher-order terms in this approximation are of size O(

1n

)with high probability.

(2) As k(x) =∑D

d=1 adhd(x), the decomposition in step (1) above yields a corresponding de-composition Kn,p(X) = Qn,p(X) +Rn,p(X) + Sn,p(X) of the kernel random matrix, whereQ, R, and S correspond to the first-order, second-order, and third-and-higher-order terms,respectively, of the decompositions in step (1) of the hermite polynomials h1, . . . , hD. Thebulk of our argument lies in establishing that lim supn,p→∞ ‖Qn,p(X)‖ ≤ ‖µa,ν,γ‖ almostsurely. To prove this result, we use the moment method [15, 16]: For even integers l,‖Qn,p(X)‖l ≤ TrQn,p(X)l, and a sufficiently tight upper bound may be obtained by takingl := l(n) � log n. Since µa,ν,γ is the free additive convolution of a semicircle law with ascaled and translated Marcenko-Pastur law (Proposition 2.11), it is the limiting empirical

spectral measure of a deformed GUE matrix of the form Mn,p =√

γ(ν−a2)p Wp + a

nVn,p,

as n, p → ∞ with pn → γ, where Wp is a p × p GUE matrix and Vn,p is an independent

p× p sample covariance matrix based on n samples and having zero diagonal. We employcombinatorial arguments to upper-bound the quantity E[TrQn,p(X)l] using E[TrM l

n,p] fora suitable choice of n and p, and we bound the latter quantity using the known convergenceresult limn,p→∞ ‖Mn,p‖ = ‖µa,ν,γ‖.

(3) Finally, we analyze the remainder matrices Rn,p(X) and Sn,p(X) from the decompositionin step (2) above. It is easily shown that lim supn,p→∞ ‖Sn,p(X)‖ = 0. For Rn,p(X),

we may write Rn,p(X) =∑D

d=2Rn,p,d(X) where Rn,p,d(X) is the contribution from theHermite polynomial hd of degree d. (The linear polynomial h1 does not have such a re-mainder term in the decomposition from step (1).) Again using a moment bound, weshow that lim supn,p→∞ ‖Rn,p,d(X)‖ = 0 for each d ≥ 3. For d = 2, we show that

limn,p→∞ ‖Rn,p,2(X) − Rn,p(X)‖ = 0, where Rn,p(X) is the rank-two matrix in Theorem

2.6. This establishes that Kn,p(X) = Kn,p(X)+ Rn,p(X), where lim supn,p→∞ ‖Kn,p(X)‖ ≤‖µa,ν,γ‖. The reverse inequality lim infn,p→∞ ‖Kn,p(X)‖ ≥ ‖µa,ν,γ‖ follows immediatelyfrom Theorem 2.5, hence establishing Theorem 2.6.

Step (1) above is carried out in Section 3. Step (2) is carried out in Section 4, with many detailsof the combinatorial argument deferred to Appendix A and the estimate of E[TrM l

n,p] deferred to

Appendix B. Step (3) and the conclusion of the proof of Theorem 2.6 are carried out in Section 5.In Section 6, we prove Theorem 2.10 using Theorem 2.6. Our argument uses an approximation,

in a suitable sense, of the kernel function k(x) by a polynomial function q(x). The spectral norm ofthe kernel matrix corresponding to q(x) is easily estimated by Theorem 2.6, and we use a concen-tration of measure argument to control the spectral norm of the kernel matrix corresponding to theremainder r(x) = k(x)−q(x). The concentration of measure argument relies on the construction ofa certain dyadic covering net of the unit ball in Rp and is inspired by a similar argument of RafalLatala [19].

3. Decomposition of Hermite polynomials of sums of IID random variables

The goal of this section is to prove the following proposition, which exhibits a three-term de-composition of a Hermite polynomial applied to a normalized sum 1√

n

∑nj=1 zj of i.i.d. random

variables with zero mean, unit variance, and zero third moment.

Proposition 3.1. Let Z = (zj : 1 ≤ j ≤ n) ∈ Rn, where the entries zj are a collection ofindependent and identically distributed random variables such that E[zj ] = 0, E[z2

j ] = 1, E[z3j ] = 0,

and E[|zj |l] <∞ for each l ≥ 1. Let hd denote the orthonormal Hermite polynomial of degree d as

Page 14: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

12 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

in Definition 2.4. Define

qd,n(Z) =

√1

ndd!

n∑j1,...,jd=1

j1 6=j2 6=... 6=jd

d∏i=1

zji , (3)

rd,n(Z) =

0 d = 1√

1

ndd!

(d

2

) n∑j1,...,jd−1=1

j1 6=j2 6=... 6=jd−1

((z2j1 − 1)

d−1∏i=2

zji

)d ≥ 2 , (4)

sd,n(Z) = hd

1√n

n∑j=1

zj

− qd,n(Z)− rd,n(Z). (5)

Then, for each d ≥ 1 and any α, β > 0, P[|sd,n(Z)| > n−1+α] < n−β for all sufficiently large n (i.e.for n ≥ N where N may depend on α, β, d, and the distribution of zj).

Our proof of Proposition 3.1 proceeds via induction on d, using the three-term recurrence satisfiedby the Hermite polynomials hd, and it is presented at the end of this section. By Lemma 3.4 below,

P [|qd,n(Z)| > nα] < n−β and P[|rd,n(Z)| > n−

12

+α]< n−β for any α, β > 0 and all sufficiently

large n. Hence Proposition 3.1 may be interpreted as decomposing hd

(1√n

∑nj=1 zj

)as the sum of

an “O(1) term”, an “O(

1√n

)term”, and an “O

(1n

)term”. Corresponding to this decomposition

of the Hermite polynomials, let us consider the following decomposition of the kernel inner productmatrix.

Definition 3.2. Suppose k(x) =∑D

d=1 adhd(x), as in Theorem 2.6. Then let Qn,p(X) = (qii′ : 1 ≤i, i′ ≤ p) ∈ Rp×p, with entries

qii′ =

1√n

D∑d=1

adqd,n(xi1xi′1, . . . , xinxi′n), i 6= i′

0, i = i′,

where qd,n is defined as in eq. (3). Let Rn,p ∈ Rp×p and Sn,p ∈ Rp×p be defined analogously usingthe functions rd,n and sd,n, from eqs. (4) and (5), respectively, in place of qd,n.

Remark 3.3. By Definitions 2.2 and 3.2 and the definition of sd,n in eq. (5), it is evident that

Kn,p(X) = Qn,p(X) +Rn,p(X) + Sn,p(X),

where Kn,p(X) is as in Theorem 2.6.

Lemma 3.4. Suppose z1, . . . , zn are i.i.d. random variables, with E[|zj |l] < ∞ for all l ≥ 1. Letp1, . . . , pd : R→ R be any polynomial functions such that E[pi(zj)] = 0 for each i = 1, . . . , d. Thenfor any α, β > 0,

P

n− d2∣∣∣∣∣∣∣∣

n∑j1,...,jd=1

j1 6=j2 6=... 6=jd

d∏i=1

pi(zji)

∣∣∣∣∣∣∣∣ > nα

< n−β

for all sufficiently large n (i.e. n ≥ N where N may depend on α, β, d and the distribution of zj).

Page 15: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 13

Proof. Fix α, β > 0. Let

f(z1, . . . , zn) = n−d2

∣∣∣∣∣∣∣∣n∑

j1,...,jd=1

j1 6=j2 6=... 6=jd

d∏i=1

pi(zji)

∣∣∣∣∣∣∣∣ ,and let l be an even integer such that αl > β. Then

P[f(z1, . . . , zn) > nα] ≤ E[f(z1, . . . , zn)l]

nαl,

and it suffices to show E[f(z1, . . . , zn)l] ≤ C for a constant C independent of n. Note that

E[f(z1, . . . , zn)l] = n−ld2

n∑j11 ,...,j

1d=1

j11 6=... 6=j1d

. . .

n∑jl1,...,j

ld=1

jl1 6=... 6=jld

E

[d∏i=1

l∏k=1

pi

(zjki

)].

For each term of the above sum, if there is some j such that j = jki for exactly one pair of indicesi ∈ {1, . . . , d} and k ∈ {1, . . . , l}, then the expectation of that term is 0 as E[pi(zj)] = 0 and zj isindependent of z1, . . . , zj−1, zj+1, . . . , zn. Hence, for terms in the sum with non-zero expectation,

there are at most ld2 distinct values of jki . Then the number of such terms is at most Cl,d

( nld2

)for

a combinatorial constant Cl,d not depending on n. Also, the magnitude of the expectation of eachsuch term is at most C ′ <∞ for a constant C ′ depending on l, d, the polynomials p1, . . . , pd, and

the absolute moments of zj (which are finite by assumption). As( nld2

)≤ n

ld2 , the result follows. �

Proof of Proposition 3.1. Let S = 1√n

∑nj=1 zj . It will be notationally convenient to work with the

monic Hermite polynomials hd = hd√d!, so that hd has leading coefficient 1. Let us accordingly

define qd,n = qd,n√d!, rd,n = rd,n

√d!, and sd,n = sd,n

√d!. Then

hd(S) = qd,n(Z) + rd,n(Z) + sd,n(Z),

and we wish to show for any α, β > 0, P[|sd(Z)| > n−1+α

]< n−β for all sufficiently large n.

We proceed by induction on d. Note that h0(x) = 1, h1(x) = x, and h2(x) = x2 − 1. Then for

d = 1, h1(S) = S = q1,n(Z), and for d = 2,

h2(S) = S2 − 1 = n−1

n∑j1,j2=1

j1 6=j2

zj1zj2 +n∑j=1

(z2j − 1)

= q2,n(Z) + r2,n(Z).

Hence the proposition holds with s1,n(Z) = s2,n(Z) = 0.Let us assume by induction that the proposition holds for d − 1 and d. Recall that the monic

Hermite polynomials satisfy the three-term recurrence hd+1(x) = xhd(x)−dhd−1(x) (c.f. eq. (5.5.8)of [28]). We may compute

Sqd,n(Z) = n−d+1

2

n∑j=1

zj

n∑j1,...,jd=1

j1 6=... 6=jd

d∏i=1

zji

= n−d+1

2

n∑

j1,...,jd+1=1

j1 6=... 6=jd+1

d+1∏i=1

zji + dn∑

j1,...,jd=1

j1 6=... 6=jd

z2j1

d∏i=2

zji

Page 16: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

14 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

= qd+1,n(Z) +2

d+ 1rd+1,n(Z) +

d(n− d+ 1)

nqd−1,n(Z),

Srd,n(Z) = n−d+1

2

(d

2

) n∑j=1

zj

n∑j1,...,jd−1=1

j1 6=... 6=jd−1

((z2j1 − 1)

d−1∏i=2

zji

)

= n−d+1

2

(d

2

)n∑

j1,...,jd=1

j1 6=... 6=jd

((z2j1 − 1)

d∏i=2

zji

)+

n∑j1,...,jd−1=1

j1 6=... 6=jd−1

((z3j1 − zj1)

d−1∏i=2

zji

)

+(d− 2)n∑

j1,...,jd−1=1

j1 6=... 6=jd−1

((z2j1 − 1)z2

j2

d−1∏i=3

zji

)=d− 1

d+ 1rd+1,n(Z) + n−

d+12

(d

2

) n∑j1,...,jd−1=1

j1 6=... 6=jd−1

((z3j1 − zj1)

d−1∏i=2

zji

)

+ n−d+1

2

(d

2

)(d− 2)

n∑j1,...,jd−1=1

j1 6=... 6=jd−1

(z2j1 − 1)(z2

j2 − 1)

d−1∏i=3

zji +d(n− d+ 2)

nrd−1,n(Z).

Substituting these expressions into the three-term recurrence,

hd+1(S) = S (qd,n(Z) + rd,n(Z) + sd,n(Z))− d (qd−1(Z) + rd−1(Z) + sd−1(Z))

= qd+1,n(Z) + rd+1,n(Z) + sd+1,n(Z)

for

sd+1,n(Z) := −d(d− 1)

nqd−1,n(Z) + n−

d+12

(d

2

) n∑j1,...,jd−1=1

j1 6=... 6=jd−1

((z3j1 − zj1)

d−1∏i=2

zji

)

+ n−d+1

2

(d

2

)(d− 2)

n∑j1,...,jd−1=1

j1 6=... 6=jd−1

((z2j1 − 1)(z2

j2 − 1)d−1∏i=3

zji

)− d(d− 2)

nrd−1,n(Z)

+ Ssd,n(Z)− dsd−1,n(Z)

=: I + II + III + IV + V + V I.

Fix α, β > 0. Note that E[zj ] = 0, E[z2j − 1] = 0, and E[z3

j − zj ] = 0, so by Lemma 3.4,

max(P[|I| > n−1+α

2

],P[|II| > n−1+α

2

],P[|III| > n−1+α

2

],P[|IV | > n−1+α

2

])< n−2β

for all large n. By the induction hypothesis, P[|sd,n| > n−1+α

4

]< n−2β

2 for all large n, and also

P[|S| > n

α4

]< n−2β

2 for all large n by Lemma 3.4 (applied to the simple case where d = 1 and

p1(x) = x). Then P[|V | > n−1+α

2

]< n−2β for all large n. Similarly, the induction hypothesis

Page 17: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 15

implies P[|V I| > n−1+α

2

]< n−2β for all large n. Putting this together,

P[|sd+1,n(Z)| > n−1+α

]≤ P

[|I + II + III + IV + V + V I| > 6n−1+α

2

]≤ P

[|I| > n−1+α

2

]+ . . .+ P

[|V I| > n−1+α

2

]< 6n−2β < n−β

for all large n, completing the induction. �

4. Bounding the dominant term ‖Qn,p(X)‖

In this section, we establish the following result.

Proposition 4.1. Let Qn,p(X) be as in Definition 3.2. Let µa,ν,γ be as in Definition 2.3, witha, ν, γ as specified in Theorem 2.6. Then lim supn,p→∞ ‖Qn,p(X)‖ ≤ ‖µa,ν,γ‖ almost surely.

Our proof uses the moment method. The following definition of a multi-labeling of an l-graphwill correspond to the primary combinatorial object of interest in the subsequent analysis.

Definition 4.2. For any integer l ≥ 2, an l-graph is a graph consisting of a single cycle with 2lvertices and 2l edges, with the vertices alternatingly denoted as p-vertices and n-vertices.

We will consider the vertices of the l-graph to be ordered by picking an arbitrary p-vertex as thefirst vertex and ordering the remaining vertices according to a traversal along the cycle. A vertexV “follows” or “precedes” another vertex W if V comes before or after W , respectively, in thisordering, and the last vertex of the cycle (which is an n-vertex) is followed by the first p-vertex.

Definition 4.3. A multi-labeling of an l-graph is an assignment of a p-label in {1, 2, 3, . . .}to each p-vertex and an ordered tuple of n-labels in {1, 2, 3, . . .} to each n-vertex, such that thefollowing conditions are satisfied:

(1) The p-label of each p-vertex is distinct from those of the two p-vertices immediately precedingand following it in the cycle.

(2) The number ds of n-labels in the tuple for each sth n-vertex satisfies 1 ≤ ds ≤ D, and theseds n-labels are distinct.

(3) For each distinct p-label i and distinct n-label j, there are an even number of edges in thecycle (possibly 0) such that its p-vertex endpoint is labeled i and its n-vertex endpoint haslabel j in its tuple.

A (p,n)-multi-labeling is a multi-labeling with all p-labels in {1, . . . , p} and all n-labels in{1, . . . , n}.

Figure 2 shows an example of a (3, 5)-multi-labeling of an l-graph, for l = 4 and D = 3. InDefinition 4.3 and the subsequent combinatorial arguments, D corresponds to the degree of thepolynomial k in Theorem 2.6, which we will always treat as a fixed quantity. We will alwaysconsider p-labels to be distinct from n-labels, even though (for notational convenience) we use thesame label set {1, 2, 3, . . .} for both.

A key bound on the number of possible distinct p-labels and n-labels that appear in a multi-labeling of an l-graph is provided by the following lemma.

Lemma 4.4. Suppose a multi-labeling of an l-graph has d1, . . . , dl n-labels on the first throughlth n-vertices, respectively, and suppose that it has m total distinct p-labels and n-labels. Then

m ≤ l+∑ls=1 ds2 + 1.

We defer the proof of Lemma 4.4 to Appendix A. The quantityl+

∑ls=1 ds2 + 1−m will appear in

many of our subsequent combinatorial lemmas, and we give it a name.

Page 18: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

16 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

Figure 2. A (3, 5)-multi-labeling of an l-graph, for l = 4 and D = 3. p-vertices are depicted witha circle and n-vertices are depicted with a square.

Definition 4.5. Suppose a multi-labeling of an l-graph has d1, . . . , dl n-labels on the first through lth

n-vertices, respectively, and suppose that it has m total distinct p-labels and n-labels. The excess

of the multi-labeling is ∆ :=l+

∑ls=1 ds2 + 1−m.

Definition 4.6. Two multi-labelings of an l-graph are equivalent if there is a permutation πp of{1, 2, 3, . . .} and a permutation πn of {1, 2, 3, . . .} such that one labeling is the image of the otherupon applying πp to all of its p-labels and πn to all of its n-labels. For any fixed l, the equivalenceclasses under this relation will be called multi-labeling equivalence classes.

Note that Lemma 4.4 implies that the excess ∆ of a multi-labeling is always nonnegative. Underthese definitions, the number of distinct p-labels, number of distinct n-labels, number of n-labelsd1, . . . , dl for each of the l n-vertices, and excess ∆ are equivalence class properties, i.e. they are thesame for all labelings in the same multi-labeling equivalence class. The motivation for Definition4.3 of a multi-labeling is provided by the following lemma.

Lemma 4.7. Let Qn,p(X) be as in Proposition 4.1, and let l ≥ 2 be an even integer. Let C denotethe set of all multi-labeling equivalence classes for an l-graph. For each multi-labeling equivalenceclass L ∈ C, let ∆(L) be the excess, r(L) the number of distinct p-labels, and d1(L), . . . , dl(L) thenumber of n-labels on the first to lth n-vertices, respectively. Then, with α > 0 as in Assumption2.1 and with the convention 00 = 1,

E[TrQn,p(X)l] ≤ n∑L∈C

((12∆(L))12α

n

)∆(L) ( pn

)r(L)(

l∏s=1

ads(L)

(ds(L)!)1/2

). (6)

Proof. By Definition 3.2, letting il+1 := i1 for notational convenience,

E[TrQn,p(X)l] =

p∑i1,...,il=1

i1 6=i2,i2 6=i3,...,il 6=i1

E

[l∏

s=1

qisis+1

]

=

p∑i1,...,il=1

i1 6=i2,i2 6=i3,...,il 6=i1

n−l2E

l∏s=1

D∑d=1

ad

√1

ndd!

n∑j1,...,jd=1

j1 6=j2 6=... 6=jd

d∏a=1

xisjaxis+1ja

=

p∑i1,...,il=1

i1 6=i2,i2 6=i3,...,il 6=i1

D∑d1,...,ds=1

n∑j11 ,...,j

1d1

=1

j11 6=... 6=j1d1

. . .n∑

jl1,...,jldl

=1

jl1 6=... 6=jldl

n−l+

∑ls=1 ds2

(l∏

s=1

ads(ds!)1/2

)E

[l∏

s=1

ds∏a=1

xisjsaxis+1jsa

].

Page 19: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 17

Note that as xijL= −xij by Assumption 2.1, E[xcij ] = 0 for any positive odd integer c. Hence, if

any xij appears an odd number of times in the expression∏ls=1

∏dsa=1 xisjsaxis+1jsa , then, as xij is

independent of xi′j′ if j 6= j′ or i 6= i′, E[∏l

s=1

∏dsa=1 xisjsaxis+1jsa

]= 0 for such a term. We identify

the combination of the sums above, over the remaining non-zero terms, as the sum over all possible(p, n)-multi-labelings of an l-graph. Here, the first sum over i1, . . . , il is over all choices of p-labels,with condition (1) in Definition 4.3 that any two consecutive p-vertices have distinct labels beingimposed by the constraints i1 6= i2, i2 6= i3, . . . , il 6= i1 in the sum. The sum over d1, . . . , ds is overall possible choices of the number of n-labels in the n-label tuple for each n-vertex, and the sumover js1, . . . , j

sds

is over all possible choices of ds n-labels for the sth n-vertex, with condition (2)in Definition 4.3 that the p-labels for each p-vertex are distinct being imposed by the constraint

that js1, . . . , jsds

are distinct. The product expression∏ls=1

∏dsa=1 xisjsaxis+1jsa then corresponds to

a product, over all n-vertices, all ds n-labels for that n-vertex, and both p-vertices immediatelypreceding and immediately following that n-vertex, of xij , where j ∈ {1, . . . , n} is the n-label andi ∈ {1, . . . , p} is the p-label of the p-vertex. The condition that xij for each distinct pair (i, j)appears an even number of times, so that this term has non-zero expectation, is precisely condition(3) in Definition 4.3. Thus, to summarize,

E[TrQn,p(X)l] =∑

l-graph (p,n)-multi-labelings

n−l+

∑ls=1 ds2

(l∏

s=1

ads(ds!)1/2

)E

[l∏

s=1

ds∏a=1

xisjsaxis+1jsa

],

where d1, . . . , dl are the numbers of n-labels for the first through lth n-vertices, respectively.

Consider a fixed (p, n)-multi-labeling and write∏ls=1

∏dsa=1 xisjsaxis+1jsa =

∏nj=1

∏pi=1 x

bijij , where

bij is the number of times xij appears as a term in this product. Note that each bij is even (possibly

0). As E[x2ij ] = 1, E[|xij |k] ≤ kαk, and xij is independent of xi′j′ if j′ 6= j or i′ 6= i,

E

[l∏

s=1

ds∏a=1

xisjsaxis+1jsa

]=

∏i,j:bij>2

E[xbijij

]≤

∏i,j:bij>2

bαbijij ≤

∑i,j:bij>2

bij

α∑i,j:bij>2 bij

,

where the last inequality holds with the convention 00 = 1 if bij ≤ 2 for all (i, j). We show in Lemma

A.6 that∑

i,j:bij>2 bij ≤ 12∆, so E[∏l

s=1

∏dsa=1 xisjsaxis+1jsa

]≤ (12∆)12α∆. Eq. (6) then follows

upon noting that each (p, n)-multi-labeling with r distinct p-labels and m − r distinct n-labels

has p!(p−r)!

n!(n−m+r)! ≤ nm

( pn

)r(p, n)-multi-labelings in its equivalence class, and n−

l+∑ls=1 ds2

+m =

n1−∆. �

To prove Proposition 4.1, it remains to control the upper bound in eq. (6). We do so by comparingthis quantity to an analogous quantity for a deformed GUE matrix, specified in the followingdefinition.

Definition 4.8. For n, p ≥ 1, ν, γ > 0, and a ∈ [−√ν,√ν], let Wp = (wii′ : 1 ≤ i, i′ ≤ p) ∈ Cp×p

be distributed according to the GUE, i.e. {wii : 1 ≤ i ≤ p}∪{√

2 Rewii′ ,√

2 Imwii′ : 1 ≤ i < i′ ≤ p}are i.i.d. N (0, 1), and wii′ = wi′i for i > i′. Let Vn,p ∈ Rp×p be standard real Wishart-distributedwith n degrees of freedom and zero diagonal, i.e. Vn,p = ZZT − diag(‖Zi‖22) where Z := Zn,p =

(zij : 1 ≤ i ≤ p, 1 ≤ j ≤ n) ∈ Rp×n, zijiid∼ N (0, 1), and diag(‖Zi‖22) ∈ Rp×p denotes the diagonal

matrix whose ith diagonal entry is the squared Euclidean norm of the ith row of Z. Take Vn,p and

Wp to be independent, and let Mn,p =√

γ(ν−a2)p Wp + a

nVn,p ∈ Cp×p.

As n, p→∞ with pn → γ, the limiting spectral distribution of Mn,p is given by the free additive

convolution of a scaled semicircle law and a scaled and translated Marcenko-Pastur law. We have

Page 20: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

18 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

chosen the scaling factors for Wp and Vn,p so that this free convolution is exactly the measure µa,ν,γin Definition 2.3. It follows from the results of [6] that, in fact, a norm convergence result holds forMn,p, i.e. limn,p→∞ ‖Mn,p‖ = ‖µa,ν,γ‖, from which we may deduce the following Proposition.

Proposition 4.9. Let Mn,p be as in Definition 4.8. Suppose l is an even integer and n, p, l → ∞with p

n → γ and l ≤ C log n for some constant C > 0. Then, for any ε > 0,

E[‖Mn,p‖l] ≤ (‖µa,ν,γ‖+ ε)l

for all sufficiently large n.

The proof of Proposition 4.9 is deferred to Appendix B. (Let us remark that our combinatorialargument would be somewhat simplified if, in Definition 4.8, we could replace the GUE matrix bya GOE matrix, but we cannot find a rigorous statement of the result limn,p→∞ ‖Mn,p‖ = ‖µa,ν,γ‖a.s. for the GOE case in the literature.) As 1

pE[TrM ln,p] ≤ E[‖Mn,p‖l], our strategy for proving

Proposition 4.1 will be to show that the upper bound in eq. (6) can in turn be bounded aboveusing the quantity E[TrM l

n,p], for some choices of p and n. To analyze the quantity E[TrM ln,p], we

consider the following notion of a simple-labeling of an l-graph.

Definition 4.10. A simple-labeling of an l-graph is an assignment of a p-label in {1, 2, 3, . . .}to each p-vertex and either one n-label in {1, 2, 3, . . .} or the empty label ∅ to each n-vertex, suchthat the following conditions are satisfied:

(1) The p-label of each p-vertex is distinct from those of the two p-vertices immediately precedingand following it in the cycle.

(2) For each distinct p-label i and distinct non-empty n-label j, there are an even number ofedges in the cycle (possibly 0) such that its p-vertex endpoint is labeled i and its n-vertexendpoint is labeled j.

(3) For any two distinct p-labels i and i′, the number of occurrences (possibly 0) of the threeconsecutive labels i, ∅, i′ on a p-vertex, its following n-vertex, and its following p-vertex,respectively, is equal to the number of occurrences of the three consecutive labels i′, ∅, i.

A (p,n)-simple-labeling is a simple-labeling with all p-labels in {1, . . . , p} and all non-emptyn-labels in {1, . . . , n}.

Analogous to Lemma 4.4, the following lemma provides a key bound on the number of possibledistinct p-labels and n-labels that appear in a simple-labeling of an l-graph.

Lemma 4.11. Suppose a simple-labeling of an l-graph has k n-vertices with non-empty label and

m total distinct p-labels and distinct non-empty n-labels. Then m ≤ l+k2 + 1.

The proof of Lemma 4.11 is deferred to Appendix A. We may then define the excess of asimple-labeling, analogous to Definition 4.5, and note that the excess is always nonnegative.

Definition 4.12. Suppose a simple-labeling of an l-graph has k n-vertices with non-empty labeland m total distinct p-labels and distinct non-empty n-labels. The excess of the simple-labeling is

∆ := l+k2 + 1− m.

Definition 4.13. Two simple-labelings of an l-graph are equivalent if there is a permutation πpof {1, 2, 3, . . .} and a permutation πn of {1, 2, 3, . . .} such that one labeling is the image of the otherupon applying πp to all of its p-labels and πn to all of its n-labels. (The empty n-label remainsempty under any such permutation πn.) For any fixed l, the equivalence classes under this relationwill be called simple-labeling equivalence classes.

Motivation for Definition 4.10 of a simple labeling is provided by the following lemma, whichgives a lower bound for the quantity E[TrM l

n,p] analogous to the upper bound for E[TrQn,p(X)l]in Lemma 4.7.

Page 21: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 19

Lemma 4.14. Let Mn,p be as in Definition 4.8, and let l ≥ 2 be an even integer. Let C denotethe set of all simple-labeling equivalence classes for an l-graph. For each simple-labeling equivalenceclass L ∈ C, let ∆(L) be its excess, k(L) be the number of n-vertices with non-empty label, and r(L)be the number of distinct p-labels. Then, with the convention 00 = 1,

E[TrM ln,p] ≥ n

(p− lp

)l ( n− ln

)l∑L∈C

(1

n

)∆(L)( pn

)r(L)− l−k(L)2

|a|k(L)(γ(ν − a2))l−k(L)

2 . (7)

Proof. By Definition 4.8, letting il+1 := i1 for notational convenience,

E[TrM l

n,p

]= E

Tr

(√γ(ν − a2)

pWp +

a

nVn,p

)l=

p∑i1,...,il=1

E

[l∏

s=1

(√γ(ν − a2)

pwisis+1 +

a

nvisis+1

)]

=

p∑i1,...,il=1

∑S⊆{1,...,l}

(an

)|S|(γ(ν − a2)

p

) l−|S|2

E

[∏s∈S

visis+1

]E

[∏s/∈S

wisis+1

]

=∑

S⊆{1,...,l}

p∑i1,...,il=1

is 6=is+1∀s∈S

n−l+|S|

2

(p

n

)− l−|S|2

a|S|(γ(ν − a2))l−|S|

2 E

[∏s∈S

visis+1

]E

[∏s/∈S

wisis+1

]

=∑

S⊆{1,...,l}

p∑i1,...,il=1

is 6=is+1∀s∈S

∑(js:s∈S)∈{1,...,n}|S|

n−l+|S|

2

(p

n

)− l−|S|2

a|S|(γ(ν − a2))l−|S|

2 E

[∏s∈S

zisjszis+1js

]E

[∏s/∈S

wisis+1

].

In the fourth line above, we restricted the summation to is 6= is+1 ∀s ∈ S, as vii = 0 for eachi = 1, . . . , p by Definition 4.8.

Let us write∏s∈S zisjszis+1js =

∏pi=1

∏nj=1 z

cijij where cij is the number of times zij appears in this

product, and let us write∏s/∈S wisis+1 =

∏pi=1w

aiiii

∏p−1i=1

∏pi′=i+1w

aii′ii′ w

bii′i′i , where aii′ and bii′ are

the numbers of times wii′ and wi′i appear in this product, respectively. Recall {zij : 1 ≤ i ≤ p, 1 ≤j ≤ n} iid∼ N (0, 1), so E[

∏pi=1

∏nj=1 z

cijij ] 6= 0 only if each cij is even (possibly zero), in which case this

quantity is at least 1. Similarly, recall that {wii : 1 ≤ i ≤ p} ∪ {√

2 Rewii′ ,√

2 Imwii′ : 1 ≤ i < i′ ≤p} iid∼ N (0, 1), and wii′ = wi′i for i > i′. If w = reiθ such that

√2 Rew,

√2 Imw

iid∼ N (0, 1), then

r and θ are independent with r2 ∼ χ22/2 and θ ∼ Unif[0, 2π). Then E[wawb] = E[ra+b]E[ei(a−b)θ]

for all nonnegative integers a, b, and this is 0 if a 6= b and at least 1 if a = b ≥ 0. Hence

E[∏pi=1w

aiiii

∏p−1i=1

∏pi′=i+1w

aii′ii′ w

bii′i′i ] = 0 unless aii′ = bii′ for each i′ > i and aii is even (possibly

zero) for each 1 ≤ i ≤ p, in which case this quantity is also at least 1.The above arguments imply, in particular, that E

[∏s/∈S wisis+1

]= 0 unless l − |S| is even. As l

is even by assumption, l − |S| is even if and only if |S| is also even, in which case a|S| = |a||S| ≥ 0.Hence each term of the sum in the above expression for E[TrM l

n,p] is nonnegative, so a lower bound

Page 22: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

20 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

is obtained if we further restrict the summation to is 6= is+1 ∀s ∈ {1, . . . , l}, i.e.

E[TrM l

n,p

]≥

∑S⊆{1,...,l}

p∑i1,...,il=1

i1 6=i2,i2 6=i3,...,is 6=i1

∑(js:s∈S)∈{1,...,n}|S|

n−l+|S|

2

(p

n

)− l−|S|2

|a||S|(γ(ν − a2))l−|S|

2 E

[∏s∈S

zisjszis+1js

]E

[∏s/∈S

wisis+1

].

We identify the combination of these three sums, over the nonzero terms, as a sum over all (p, n)-simple-labelings of an l-graph. Here, the first sum over S is over all choices of the subset of then-vertices that have non-empty label. The second sum over i1, . . . , il is over all choices of p-labels,with condition (1) in Definition 4.10 that any two consecutive p-vertices have distinct labels beingimposed by the constraints i1 6= i2, i2 6= i3, . . . , il 6= i1 in the sum. The last sum over (js : s ∈ S)is over all choices of n-labels for the n-vertices that have nonempty label. The product expression∏s∈S zisjszis+1js then corresponds to a product, over all n-vertices with non-empty label and both

p-vertices immediately preceding and following that n-vertex, of zij , where j ∈ {1, . . . , n} is then-label of the n-vertex and i ∈ {1, . . . , n} is the p-label of the p-vertex. Similarly, the productexpression

∏s/∈S wisis+1 corresponds to a product, over all n-vertices with empty label, of wii′ ,

where i and i′ are the p-labels of the p-vertices immediately preceding and immediately followingthis n-vertex, respectively. The condition that zij for each distinct pair (i, j) appears an evennumber of times, so that E[

∏s∈S zisjszis+1js ] 6= 0, is precisely condition (2) in Definition 4.10, and

the condition that each wii′ appears the same number of times as wi′i, so that E[∏s/∈S wisis+1 ] 6= 0,

is precisely condition (3) in Definition 4.10. As E[∏s∈S zisjszis+1js ]E[

∏s/∈S wisis+1 ] ≥ 1 whenever

this quantity is nonzero, this implies

E[TrM l

n,p

]≥

∑l-graph (p,n)-simple-labelings

n−l+k

2

(p

n

)− l−k2

|a|k(γ(ν − a2))l−k

2 ,

where k = |S| is the number of n-vertices in the simple-labeling with non-empty label. Anysimple labeling with r distinct p-labels and at most m− r distinct non-empty n-labels has at most

n!(n−m+r)!

p!(p−r)! ≥ nm

(pn

)r (p−lp

)l (n−lp

)llabelings in its equivalence class (where we have used

m− r ≤ l and r ≤ l). The desired result then follows upon identifying n1−∆ = n−l+k

2+m. �

With Lemmas 4.7 and 4.14 established, the remainder of our proof of Proposition 4.1 involves acomparison of the upper bound in eq. (6) of Lemma 4.7 and the lower bound in eq. (7) of Lemma4.14. The intuition for the comparison is the following: From Lemmas 4.4 and 4.11, we know thatthe excesses satisfy ∆(L) ≥ 0 and ∆(L) ≥ 0 in eqs. (6) and (7), respectively. Provided that the

number of labelings with excesses ∆(L) and ∆(L) do not increase too rapidly as ∆(L) and ∆(L)increase, we expect that when n is large, the dominant contributions to the sums in eqs. (6) and(7) come from the labelings with zero excess. It may be shown that if we take any multi-labelingequivalence class L with excess ∆(L) = 0 and replace the labels of any n-vertex having more thanone n-label with the empty label, then this mapping yields a valid simple-labeling equivalence class

L with excess ∆(L) = 0, and furthermore,∑L:L maps to L

∏ls=1

ads (L)

(ds(L)!)1/2 = |a||k(L)|(ν − a2)l−k(L)

2 .

Hence, this mapping yields a direct correspondence between terms in eq. (6) with excess ∆(L) = 0

and terms in eq. (7) with excess ∆(L) = 0. To handle the terms in eq. (6) where ∆(L) 6= 0, we willextend this mapping to multi-labeling equivalence classes L with positive excess. We do this in thecase a 6= 0, and the properties of this mapping that we will need are summarized in the followingproposition.

Page 23: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 21

Proposition 4.15. Suppose a 6= 0 and l ≥ 2. Let C denote the set of all multi-labeling equivalenceclasses of an l-graph, and let C denote the set of all simple-labeling equivalence classes of an l-graph. For L ∈ C, let ∆(L) be its excess and r(L) be the number of distinct p-labels, and for

L ∈ C, let ∆(L) be its excess, r(L) be the number of distinct p-labels, and k(L) be the number of

n-vertices with non-empty label. Then there exists a map ϕ : C → C such that, for some constantsC1, C2, C3, C4 > 0 depending only on D,

(1) For all L ∈ C, r(L) = r(L),

(2) For all L ∈ C, ∆(ϕ(L)) ≤ C1∆(L), and

(3) For any L ∈ C and ∆0 ≥ 0,

∑L∈ϕ−1(L)

∆(L)=∆0

l∏s=1

|ads(L)|(ds(L)!)1/2

≤(√

ν

|a|

)C2∆0

|a|k(L)(ν − a2)l−k(L)

2 lC3+C4∆0 .

This proposition allows us to control all terms in the sum in eq. (6), including those with positiveexcess, by terms in the sum in eq. (7), thus bypassing the need to directly control how the numberof terms in the sum for eq. (6) corresponding to each value of ∆(L) grows with ∆(L). The proofof this proposition and the explicit construction of the map ϕ require some detailed combinatorialarguments, which we defer to Appendix A. Using this result, we may complete the proof ofProposition 4.1 in the case a 6= 0.

Proof of Proposition 4.1 (Case a 6= 0). For any ε > 0 and even integer l ≥ 2,

P [‖Qn,p(X)‖ > (1 + ε)‖µa,ν,γ‖] ≤ P[TrQn,p(X)l > ((1 + ε)‖µa,ν,γ‖)l

]≤ E[TrQn,p(X)l]

(1 + ε)l‖µa,ν,γ‖l.

By Lemma 4.7, Definition 4.5, and Proposition 4.15,

E[TrQn,p(X)l] ≤ n∑L∈C

((12( l+Dl2 + 1))12α

n

)∆(L) ( pn

)r(L)(

l∏s=1

|ads(L)|(ds(L)!)1/2

)

= n∑L∈C

l+Dl2

+1∑∆0=

⌈∆(L)C1

⌉∑

L∈ϕ−1(L)

∆(L)=∆0

((12( l+Dl2 + 1))12α

n

)∆0 ( pn

)r(L)l∏

s=1

|ads(L)|(ds(L)!)1/2

≤ n∑L∈C

( pn

)r(L)l+Dl

2+1∑

∆0=⌈

∆(L)C1

⌉(

(12( l+Dl2 + 1))12α

n

)∆0

(√ν

|a|

)C2∆0

|a|k(L)(ν − a2)l−k(L)

2 lC3+C4∆0

≤ nlC3(l+Dl

2 + 2)∑L∈C

( pn

)r(L)|a|k(L)(ν − a2)

l−k(L)2

(12( l+Dl2 + 1)

)12α(√

ν|a|

)C2

lC4

n

∆(L)C1

,

Page 24: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

22 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

where the last line holds for all sufficiently large n if l ∼M log n, for any M > 0. Let

n =

n1C1(

12( l+lD2 + 1)) 12αC1

(√ν|a|

)C2C1 l

C4C1

,and let p = b npn c. Then nlC3

(l+Dl

2 + 2)≤ n2 and ( pn)r(L) ≤ ( pn)r(L)(1 + ε

4)l for l ∼ M log n, anyM > 0, and all sufficiently large n, so

E[TrQn,p(X)l] ≤ n2(1 + ε

4

)l∑L∈C

(1

n

)∆(L)( pn

)r(L)

|a|k(L)(ν − a2)l−k(L)

2 .

On the other hand, by Lemma 4.14,(1− ε

4

)ln∑L∈C

(1

n

)∆(L)( pn

)r(L)

|a|k(L)(ν − a2)l−k(L)

2 ≤ E[TrM l

n,p

]for all sufficiently large n. Since p

n → γ and l ∼ MC1 log n if l ∼ M log n, Proposition 4.9 implies

E[TrM ln,p] ≤ pE[‖Mn,p‖l] ≤ p

(‖µa,ν,γ‖

(1 + ε

4

))lfor all large n. Thus

P [‖Qn,p(X)‖ > (1 + ε)‖µa,ν,γ‖] ≤ n2 p

n

( (1 + ε

4

)2(1− ε

4

)(1 + ε)

)l.

Taking l ∼M log n with M > 0 sufficiently large such that M log(1+

ε4)

2

(1− ε4)(1+ε)< −4 (which is possible

for any sufficiently small ε > 0), this implies P [‖Qn,p(X)‖ > (1 + ε)‖µa,ν,γ‖] < 1n2 for all large n.

Then lim supn,p→∞ ‖Qn,p(X)‖ ≤ (1 + ε)‖µa,ν,γ‖ a.s., by the Borel-Cantelli lemma. This holds forall sufficiently small ε > 0, so the proposition follows. �

Proposition 4.1 in the case a = 0 may be easily established from the a 6= 0 case via the followingcontinuity argument.

Lemma 4.16. For the measure µa,ν,γ in Definition 2.3, ‖µa,ν,γ‖ is continuous in (a, ν, γ).

Proof. Let Wp and Vn,p be as in Definition 4.8, and let Mn,p(a, ν, γ) =√

γ(ν−a2)p Wp + a

nVn,p.

For any fixed (a, ν, γ) and (ai, νi, γi)∞i=1 such that limi→∞(ai, νi, γi) = (a, ν, γ), by Lemma B.1,

‖Mn,p(a, ν, γ)‖ → ‖µa,ν,γ‖, ‖Mn,p(ai, νi, γi)‖ → ‖µai,νi,γi‖ for each i, and lim supn,p→∞1√p‖Wp‖ <

∞ and lim supn,p→∞1n‖Vn,p‖ < ∞ on an event having probability 1. Then on this event, for each

i,

|‖µai,νi,γi‖ − ‖µa,ν,γ‖| ≤ lim supn,p→∞

|‖Mn,p(ai, νi, γi)‖ − ‖Mn,p(a, ν, γ)‖|

≤ lim supn,p→∞

‖Mn,p(ai, νi, γi)−Mn,p(a, ν, γ)‖

≤∣∣∣∣√γi(νi − a2

i )−√γ(ν − a2)

∣∣∣∣ lim supn,p→∞

1√p‖Wp‖+ |a− ai| lim sup

n,p→∞

1

n‖Vn,p‖,

which implies limi→∞ ‖µai,νi,γi‖ = ‖µa,ν,γ‖. �

Proof of Proposition 4.1 (Case a = 0). Suppose k(x) is a polynomial function such that the coef-ficient of the linear term in its Hermite polynomial decomposition is zero. For any a > 0, letka(x) = k(x) + ax, and let Qn,p,a(X) be the matrix as defined in Definition 3.2 for the ker-nel function ka. Then Qn,p,a(X) = Qn,p(X) + a

nVn,p(X), where Vn,p(X) has zero diagonal and

Page 25: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 23

equals XXT off of the diagonal. By Proposition 4.1 for the a 6= 0 case, established above,lim supn,p→∞ ‖Qn,p,a(X)‖ ≤ ‖µa,(ν+a2),γ‖. As a consequence of the main theorem in [16] and the ob-

servation limn,p→∞ suppi=1

(‖Xi‖2n − 1

)= 0, which follows easily under the Assumption 2.1, we have

lim supn,p→∞ ‖ 1nVn,p(X)‖ ≤ Cγ for some constant Cγ > 0. This implies lim supn,p→∞ ‖Qn,p(X)‖ ≤

‖µa,(ν+a2),γ‖ − aCγ for any a > 0, and the desired result follows from Lemma 4.16 upon takinga→ 0. �

5. Analyzing the remainder matrices Rn,p(X) and Sn,p(X)

To conclude the proof of Theorem 2.6, we analyze in this section the remainder matrices Rn,p(X)and Sn,p(X) of Definition 3.2.

Lemma 5.1. As n, p→∞ with pn → γ ∈ (0,∞), limn,p→∞ ‖Sn,p(X)‖ = 0 almost surely.

Proof. Note that ‖Sn,p(X)‖ ≤ ‖Sn,p(X)‖F ≤ pmax1≤i,i′≤p |sii′ | where ‖ · ‖F is the Frobenius norm.

By Definition 3.2 and Proposition 3.1, for any 1 ≤ i, i′ ≤ p and α > 0, |sii′ | ≤ n−32

+α∑Dd=1 |ad| with

probability at least 1− n−4, for all large n. Then by a union bound, pmax1≤i,i′≤p |sii′ | ≤ pn−32

with probability at least 1− p2n−4. As pn → γ ∈ (0,∞), taking any α < 1

2 and ε > 0, this implies

P[‖Sn,p(X)‖ > ε] ≤ p2n−4 for all large n, so lim supn,p→∞ ‖Sn,p(X)‖ ≤ ε a.s. by the Borel-Cantellilemma. As ε > 0 is arbitrary, the result follows. �

Definition 5.2. For d ≥ 2, let Rn,p,d(X) := (rii′ : 1 ≤ i, i′ ≤ p) ∈ Rp×p, with entries

rii′ =

(d2

)√d!n−

d+12

n∑j1,...,jd−1=1

j1 6=j2 6=... 6=jd−1

((x2ij1x

2i′j1 − 1)

d−1∏a=2

xijaxi′ja

)i 6= i′

0 i = i′.

Note that Rn,p(X) in Definition 3.2 is given by Rn,p(X) =∑D

d=2 adRn,p,d(X).

Lemma 5.3. As n, p→∞ with pn → γ ∈ (0,∞), limn,p→∞ ‖Rn,p,d(X)‖ = 0 a.s. for any d ≥ 3.

Proof. Letting i7 := i1 for notational convenience, note that

E[TrRn,p,d(X)6

]=

p∑i1,...,i6=1

i1 6=i2,i2 6=i3,...,i6 6=i1

E

[6∏s=1

risis+1

]

=

(d2

)6(d!)3

n−3(d+1)p∑

i1,...,i6=1

i1 6=i2,i2 6=i3,...,i6 6=i1

n∑j11 ,...,j

1d−1=1

j11 6=j12 6=... 6=j1d−1

. . .n∑

j61 ,...,j6d−1=1

j61 6=j62 6=... 6=j6d−1

E

[6∏s=1

((x2isjs1

x2is+1js1

− 1)

d−1∏a=2

xisjsaxis+1jsa

)].

Consider the term

E

[6∏s=1

((x2isjs1

x2is+1js1

− 1)d−1∏a=2

xisjsaxis+1jsa

)]. (8)

Suppose that there is some j∗ ∈ {1, . . . , n} such that jsa = j∗ for exactly one pair of indices(s, a) ∈ {1, . . . , 6} × {1, . . . , d − 1}. If a = 1, then the only term of the product in eq. (8) thatcontains a variable equal to xij∗ for any i is (x2

isjs1x2is+1js1

−1), and if a ≥ 2, then the only two terms

that contain a variable equal to xij∗ for any i are xisjsa and xis+1jsa . In either case, as is 6= is+1,

Page 26: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

24 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

this implies eq. (8) is zero by independence of the entries of X. Hence, in order for eq. (8) to benonzero, each distinct j∗ that appears in the product must appear as jsa for at least two pairs (s, a).

Suppose, next, that there is some i∗ ∈ {1, . . . , p} such that is = i∗ for exactly one index s ∈{1, . . . , 6}. Consider the label j∗ = js−1

a for any a ∈ {2, . . . , d− 1} (identifying s− 1 ≡ 6 if s = 1).Such a label j∗ exists when d ≥ 3. If j∗ 6= jsa′ for all a′ ∈ {1, . . . , d − 1}, then the variable xi∗j∗appears exactly once in the product in eq. (8), and so eq. (8) is zero. If j∗ = js1, then the variablexi∗j∗ appears twice, once as the term xisjs−1

aand once in the term (x2

isjs1x2is+1js1

− 1). The product

of these terms is x3i∗j∗xis+1j∗ − xi∗j∗ , and as E[x3

i∗j∗ ] = 0 and E[xi∗j∗ ] = 0, this implies eq. (8) is

again zero. Hence, in order for eq. (8) to be nonzero, js−1a must equal jsa′ for some a′ ≥ 2.

Consider first the case where there are at most 4 distinct values in {i1, . . . , i6}. Note that if eq.

(8) is nonzero, then there are at most 6(d−1)2 = 3d − 3 distinct values of jsa in the product, by our

previous argument. There are at most p4n3d−3Cd total choices of indices (i1, . . . , i6) and (jsa : 1 ≤a ≤ d − 1, 1 ≤ s ≤ 6) such that |{i1, . . . , i6}| ≤ 4 and |{jsa : 1 ≤ a ≤ d − 1, 1 ≤ s ≤ 6}| ≤ 3d − 3,where Cd is a combinatorial constant depending on d but not on n and p. Then

p∑i1,...,i6=1

i1 6=i2,i2 6=i3,...,i6 6=i1|{i1,...,i6}|≤4

n∑j11 ,...,j

1d−1=1

j11 6=... 6=j1d−1

. . .n∑

j61 ,...,j6d−1=1

j61 6=... 6=j6d−1

E

[6∏s=1

((x2isjs1

x2is+1js1

− 1)d−1∏a=2

xisjsaxis+1jsa

)]≤ Cd,αp4n3d−3

for a constant Cd,α depending on d and the value of α in Assumption 2.1.If there are 5 distinct values in {i1, . . . , i6}, then either is = is+2 for some s or is = is+3 for some

s (where s + 2 and s + 3 are taken modulo 6), with the remaining indices all distinct. Supposewithout loss of generality that i2 and i3 are distinct from {i1, i4, i5, i6}. Then, letting i∗ := i2and j∗ := j1

2 , we note that i2 is the unique index in {i1, . . . , i6} equal to i∗, so for eq. (8) to benonzero, we must have j∗ = j2

a for some a ≥ 2, by our previous argument. The same argumentapplied to i∗ := i3 then shows that we must have j∗ = j3

a′ for some a′ ≥ 2. Then there are atleast three pairs (s, a) for which jsa = j∗. This implies that there are at most 3d− 4 distinct valuesof jsa (as if there were 3d − 3 distinct values and each value must equal jsa for at least two pairs(s, a), then no such value can equal jsa for more than two pairs). There are at most p5n3d−4C ′d totalchoices of indices (i1, . . . , i6) and (jsa : 1 ≤ a ≤ d − 1, 1 ≤ s ≤ 6) such that |{i1, . . . , i6}| = 5 and|{jsa : 1 ≤ a ≤ d − 1, 1 ≤ s ≤ 6}| ≤ 3d − 4, where C ′d is a combinatorial constant depending on dbut not on n and p. Then

p∑i1,...,i6=1

i1 6=i2,i2 6=i3,...,i6 6=i1|{i1,...,i6}|=5

n∑j11 ,...,j

1d−1=1

j11 6=... 6=j1d−1

. . .

n∑j61 ,...,j

6d−1=1

j61 6=... 6=j6d−1

E

[6∏s=1

((x2isjs1

x2is+1js1

− 1)d−1∏a=2

xisjsaxis+1jsa

)]≤ C ′d,αp5n3d−4

for a constant C ′d,α.

Finally, if |{i1, . . . , i6}| = 6, then letting j∗ = j12 and i∗ = i2, our previous argument implies

that for eq. (8) to be nonzero, there must be some a ≥ 2 such that j∗ = j2a. Similarly, for each

s = 3, . . . , 6, there must be some a ≥ 2 such that j∗ = jsa. Then there are exactly 6 pairs (s, a)

such that jsa = j∗, so the number of distinct values of jsa is at most 6(d−1)−62 + 1 = 3d − 5. There

are at most p6n3d−5C ′′d total choices of indices (i1, . . . , i6) and (jsa : 1 ≤ a ≤ d− 1, 1 ≤ s ≤ 6) suchthat |{i1, . . . , i6}| = 6 and |{jsa : 1 ≤ a ≤ d− 1, 1 ≤ s ≤ 6}| ≤ 3d− 5, where C ′′d is a combinatorial

Page 27: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 25

constant depending on d but not on n and p. Then

p∑i1,...,i6=1

i1 6=i2,i2 6=i3,...,i6 6=i1|{i1,...,i6}|=6

n∑j11 ,...,j

1d−1=1

j11 6=... 6=j1d−1

. . .n∑

j61 ,...,j6d−1=1

j61 6=... 6=j6d−1

E

[6∏s=1

((x2isjs1

x2is+1js1

− 1)d−1∏a=2

xisjsaxis+1jsa

)]≤ C ′′d,αp6n3d−5

for a constant C ′′d,α. Putting this together,

E[TrRn,p,d(X)6

]≤Cd,α,γn2

for some constant Cd,α,γ depending also on γ. Then for any ε > 0,

P [‖Rn,p,d(X)‖ > ε] ≤E[TrRn,p,d(X)6]

ε6≤Cd,α,γε6n2

,

so the Borel-Cantelli lemma implies lim supn,p→∞ ‖Rn,p,d(X)‖ ≤ ε almost surely. As this holds forall ε > 0, the result follows. �

Lemma 5.4. Let Rn,p,d(X) be as in Definition 5.2, and let Rn,p(X) be as in Theorem 2.6. As

n, p→∞ with pn → γ ∈ (0,∞), limn,p→∞ ‖a2Rn,p,2(X)− Rn,p(X)‖ → 0.

Proof. Let Tn,p(X) = a2Rn,p,2(X)− Rn,p(X). Then Tn,p(X) has entries

tii′ =

{a2√

2n−

32∑n

j=1

((x2ijx

2i′j − 1)− (x2

ij − 1)− (x2i′j − 1)

)i 6= i′

0 i = i′

=

{a2√

2n−

32∑n

j=1(x2ij − 1)(x2

i′j − 1) i 6= i′

0 i = i′.

Thus, excluding the diagonal, Tn,p(X) equals a2√2n−

32Y Y T where Y = (yij) ∈ Rp×n and yij = x2

ij−1.

By Assumption 2.1 and the main theorem of [16], 1n‖Y Y

T ‖ converges to a finite limit almost surely,

so n−32 ‖Y Y T ‖ → 0. For any ε > 0 and any i,

P

n− 32

n∑j=1

y2ij − E[y2

ij ] ≥ ε

≤ E[(∑n

j=1 y2ij − E[y2

ij ])4]

ε4n6≤ C(ε, α)

n4

for some constant C(ε, α) > 0 depending on ε and α in Assumption 2.1. Then

P

pmaxi=1

n−32

n∑j=1

y2ij ≥ ε

≤ pC(ε, α)

n4≤ C(ε, α, γ)

n3

for all large n, and hence∥∥∥Tn,p(X)− a2√

2n−

32Y Y T

∥∥∥→ 0 almost surely by the Borel-Cantelli lemma,

implying ‖Tn,p(X)‖ → 0 almost surely. �

We may now conclude the proof of Theorem 2.6.

Proof of Theorem 2.6. By Remark 3.3, Kn,p(X) = Qn,p(X) + Rn,p(X) + Sn,p(X). As Rn,p(X) =∑Dd=2 adRn,p,d(X), where Rn,p,d(X) is as in Definition 5.2, Proposition 4.1 and Lemmas 5.1, 5.3,

and 5.4 imply lim supn,p→∞ ‖Kn,p(X) − Rn,p(X)‖ ≤ ‖µa,ν,γ‖. On the other hand, Theorem 2.5

implies, for any ε > 0, limn,p→∞1p

∑pi=1 1{|λi(Kn,p(X))| ∈ [‖µa,ν,γ‖ − ε, ‖µa,ν,γ‖]} > cp for some

constant c := c(ε) > 0 a.s., so in particular, lim infn,p→∞max(λ3(Kn,p(X)),−λp−2(Kn,p(X))) ≥

Page 28: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

26 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

‖µa,ν,γ‖ a.s., where λ3(Kn,p(X)) and λp−2(Kn,p(X)) are the third-largest and third-smallest eigen-

values of Kn,p(X). As Rn,p(X) has rank two, Weyl’s eigenvalue inequality implies λ3(Kn,p(X)) ≤λmax(Kn,p(X) − Rn,p(X)) and −λp−2(Kn,p(X)) ≤ −λmin(Kn,p(X) − Rn,p(X)), hence establishing

lim infn,p→∞ ‖Kn,p(X)−Rn,p(X)‖ ≥ ‖µa,ν,γ‖ almost surely. This concludes the proof of the theorem

upon setting Kn,p(X) = Kn,p(X)− Rn,p(X). �

6. Extension to odd kernel functions for Gaussian observations

In this section, we prove Theorem 2.10. Our proof will rely on the following two results, thefirst giving a polynomial approximation of the kernel function k and the second providing a generalconcentration inequality that will allow us to handle the remainder term from this polynomialapproximation.

Theorem 6.1 ([8]). Suppose w(x) is an even, lower semi-continuous function on R with 1 ≤w(x) < ∞, such that logw(x) is a convex function of log x. Let Cw be the class of continuous

functions on R such that lim|x|→∞f(x)w(x) = 0 for all f ∈ Cw, and suppose Cw contains all polynomial

functions. If∫∞

1logw(x)x2 dx =∞, then for any f ∈ Cw and ε > 0, there exists a polynomial P such

that |f(x)− P (x)| < εw(x) for all x ∈ R.

Proof. See the first Theorem on page 956 of [8]. �

Proposition 6.2. Let k be an odd and differentiable function with |k′(x)| ≤ eβ|x| for some β > 0

and all x ∈ R. Let X ∈ Rp×n have entries xijiid∼ N (0, 1), and let Kn,p(X) be the kernel inner-

product matrix with kernel k as in Definition 2.2. Suppose p, n→∞ with pn → γ. Then there exist

constants Nβ,γ , Cβ,γ > 0, depending only on β and γ, such that

P[‖Kn,p(X)‖ > Cβ,γ ] ≤Cβ,γn2

for all n ≥ Nβ,γ.

Assuming the above Proposition, let us first prove Theorem 2.10.

Proof of Theorem 2.10. By the given conditions, there exists β > 0 such that lim|x|→∞|k′(x)|eβ|x|

= 0.

Applying Theorem 6.1 with w(x) = eβ|x|, for any ε > 0, there exists a polynomial q such that

|k′(x) − q(x)| < εeβ|x| for all x ∈ R. As k is an odd function, k′ is even, so we may take q tobe an even polynomial function. (Otherwise, take the polynomial to be 1

2(q(x) + q(−x)).) Let

q(x) =∫ x

0 q(x)dx for all x ∈ R, and let r(x) = k(x) − q(x). Then q is an odd polynomial

function, r is hence also an odd function, and |r′(x)| < εeβ|x| by construction. Let Qn,p(X) bethe kernel inner-product matrix with kernel function q as in Definition 2.2, and let Rn,p(X) bethe kernel inner-product matrix with kernel function r(x), so that Kn,p(X) = Qn,p(X) +Rn,p(X).

By Proposition 6.2, there exists a constant Cβ,γ > 0 such that P[‖Rn,p(X)‖ > εCβ,γ ] <Cβ,γn2 , for

all sufficiently large n, so lim supn,p→∞ ‖Rn,p(X)‖ ≤ εCβ,γ almost surely. On the other hand, ifq(x) = a0,ε+a1,εh1(x)+ . . .+aD,εhD(x) where h1, . . . , hD are the orthonormal Hermite polynomialsas in Definition 2.4, then aj,ε = 0 for all even j since q is an odd function, and hence, in particular,a2 = 0 and E[q(ξ)] = 0 for ξ ∼ N (0, 1). Then by Theorem 2.6, ‖Qn,p(X)‖ → ‖µaε,νε,γ‖ where

aε = a1,ε and νε =∑D

d=1 a2d,ε. Hence

‖µaε,νε,γ‖ − εCβ,γ ≤ lim infn,p→∞

‖Kn,p(X)‖ ≤ lim supn,p→∞

‖Kn,p(X)‖ ≤ ‖µaε,νε,γ‖+ εCβ,γ

for any ε > 0. Note that |k(x) − q(x)| ≤ εβ e

β|x| for all x ∈ R, so by the dominated convergence

theorem, limε→0 E[(k(ξ) − q(ξ))2] = 0 for ξ ∼ N (0, 1). Then aε → a and νε → ν as ε → 0, where

Page 29: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 27

a and ν are defined as in Theorem 2.5 for the kernel function k. By Lemma 4.16, this implieslimε→0 ‖µaε,νε,γ‖ → ‖µa,ν,γ‖, and hence limn,p→∞ ‖Kn,p(X)‖ = ‖µa,ν,γ‖. �

In the remainder of this section, we prove Proposition 6.2. Our proof relies on the coveringbound ‖Kn,p(X)‖ = supy∈Rp:‖y‖≤1 y

TKn,p(X)y ≤ C supy∈Dp2 yTKn,p(X)y for an appropriate choice

of covering set Dp2 ⊂ {y ∈ Rp : ‖y‖ ≤ 1} and sufficiently large constant C. The specific construction

of Dp2 and method of bounding supy∈Dp2 y

TKn,p(X)y are inspired by a similar argument in [19].

Definition 6.3. Let G(β) ⊂ Rp×n × Rp×n be the set of pairs (X,X ′) of p × n matrices such that‖X‖ ≤ √p+ 2

√n, ‖X ′‖ ≤ √p+ 2

√n, and for all l = 1, . . . , p,

1

p

p∑i=1i 6=l

exp

(16β√n|XT

i Xl|)≤ 3e256β2

,

1

p

p∑i=1i 6=l

exp

(16β√n|X ′i

TXl|)≤ 3e256β2

,

1

p

p∑i=1i 6=l

exp

(16β√n|XT

i X′l |)≤ 3e256β2

,

1

p

p∑i=1i 6=l

exp

(16β√n|X ′i

TX ′l |)≤ 3e256β2

,

Lemma 6.4. Let X,X ′ ∈ Rp×n with xij , x′ij

iid∼ N (0, 1). Then for all sufficiently large p (i.e.

p > p0(β)),

P[(X,X ′) /∈ G(β)] ≤ 4e−n2 +

484e384β2

p2+ 4pe−

n8 .

Proof. By Corollary 5.35 of [31], P[‖X‖ > √p+ 2

√n]≤ 2e−

n2 , and similarly for X ′.

For ξ ∼ N (0, 1) and c > 0, E[ec|ξ|] ≤ E[ecξ] + E[e−cξ] = 2ec2

2 , and Var[ec|ξ|] ≤ E[e2c|ξ|] ≤ 2e2c2 .

Then for ξ1, . . . , ξpiid∼ N (0, 1), denoting f(ξ) = ec|ξ| − E[ec|ξ|],

P

[1

p

p∑i=1

ec|ξi| > 3ec2

2

]≤ P

[1

p

p∑i=1

f(ξi) > ec2

2

]≤ e−3c2

p6E

( p∑i=1

f(ξi)

)6

≤ e−3c2

p6

(pE[f(ξi)

6]

+ 15p2E[f(ξi)

4]E[f(ξi)

2]

+ 15p3E[f(ξi)

2]3)

<121e3c2

p3,

where the last line holds for all p ≥ p0(c). For any i 6= l, (XTi Xl, Xl)

L= (‖Xl‖ξi, Xl) where

ξi ∼ N (0, 1) is independent of Xl. Hence

P

1

p− 1

p∑i=1i 6=l

exp

(16β√n|XT

i Xl|)> 3e

128β2‖Xl‖2

n

∣∣∣∣∣Xl

< 121e768β2‖Xl‖

2

n

p3

Page 30: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

28 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

for all p ≥ p0

(β, ‖Xl‖√

n

). Then

P

1

p− 1

p∑i=1i 6=l

exp

(16β√n|XT

i Xl|)> 3e256β2

∣∣∣∣∣‖Xl‖ ≤√

2n

< 121e1536β2

p3

for all p ≥ p0(β), so using the chi-squared tail bound P[‖Xl‖2 > 2n] ≤ e−n8 ,

P

1

p− 1

p∑i=1i 6=l

exp

(16β√n|XT

i Xl|)> 3e256β2

≤ 121e1536β2

p3+ e−

n8 .

The same argument holds for the analogous sums with X ′iTXl, X

Ti X

′l , and X ′i

TX ′l in place of XTi Xl,

and the result follows by a union bound. �

Lemma 6.5. Let y, z ∈ Rp satisfy ‖y‖ ≤ 1 and ‖z‖ ≤ 1. Under the setup of Proposition 6.2,

let F (X) = zTKn,p(X)y. Then E[et(F (X)−F (X′))

1{(X,X ′) ∈ G(β)}]≤ 2 exp

(Cβ,γ‖y‖∞t2√

n

)for all

n ≥ Nβ,γ, for constants Nβ,γ , Cβ,γ > 0 depending only on β and γ.

Proof. Consider F as a function from Rpn to R. Then

∇XlF (X) = ∇Xl

p∑i=1

p∑i′=1i′ 6=i

1√nk

(XTi Xi′√n

)ziyi′

=

p∑i=1i 6=l

1

nk′(XTi Xl√n

)(ziyl + yizl)X

Ti

=ylnvTz X +

zlnvTy X,

for vy, vz ∈ Rp with (vy)i = k′(XTi Xl√n

)yi1{i 6= l} and (vz)i = k′

(XTi Xl√n

)zi1{i 6= l}. Then

‖∇F (X)‖2 =

p∑l=1

‖∇XlF (X)‖2

=

p∑l=1

2y2l

n2‖X‖2‖vz‖2 +

2z2l

n2‖X‖2‖vy‖2

=4‖X‖2

n2

p∑i=1

p∑l=1l 6=i

k′(XTi Xl√n

)2

z2i y

2l

≤ 4‖X‖2

n2

pmaxi=1

p∑l=1l 6=i

k′(XTi Xl√n

)2

y2l

≤ 4‖X‖2

n2

pmaxi=1

p∑l=1l 6=i

k′(XTi Xl√n

)4

1/2 p∑

l=1l 6=i

y4l

1/2

Page 31: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 29

≤ 4‖X‖2‖y‖∞n2

pmaxi=1

p∑l=1l 6=i

k′(XTi Xl√n

)4

1/2 p∑

l=1l 6=i

y2l

1/2

≤ 4‖X‖2‖y‖∞n2

pmaxi=1

p∑l=1l 6=i

k′(XTi Xl√n

)4

1/2

.

For θ ∈[0, π2

], let Xθ = X ′ cos θ +X sin θ. Then

‖Xθ‖2 ≤ (‖X ′ cos θ‖+ ‖X sin θ‖)2 ≤ 2‖X ′‖2(cos θ)2 + 2‖X‖2(sin θ)2 ≤ 2 max(‖X‖2, ‖X ′‖2),

sop∑l=1l 6=i

k′(

(Xθ)Ti (Xθ)l√n

)4

≤p∑l=1l 6=i

exp

(4β|(Xθ)

Ti (Xθ)l|√n

)

=

p∑l=1l 6=i

exp

(4β|(X ′i cos θ +Xi sin θ)T (X ′l cos θ +Xl sin θ)|√

n

)

≤p∑l=1l 6=i

exp

(4β(|XT

i Xl|+ |X ′iTXl|+ |XT

i X′l |+ |X ′i

TX ′l |)√n

)

p∑l=1l 6=i

exp

(16β|XT

i Xl|√n

)1/4 p∑

l=1l 6=i

exp

(16β|X ′i

TXl|√n

)1/4

p∑l=1l 6=i

exp

(16β|XT

i X′l |√

n

)1/4 p∑

l=1l 6=i

exp

(16β|X ′i

TX ′l |√n

)1/4

,

where the last line follows from Holder’s inequality. Hence for any (X,X ′) ∈ G(β),

‖∇F (Xθ)‖2 ≤Cβ,γ‖y‖∞n1/2

all n ≥ Nβ,γ and some constants Nβ,γ , Cβ,γ > 0 depending on β and γ. Then

E[et(F (X)−F (X′))

1{(X,X ′) ∈ G(β)}]

= E

[exp

(2

π

∫ π2

0

πt

2

d

dθF (Xθ)dθ

)1{(X,X ′) ∈ G(β)}

]

≤ E

[2

π

∫ π2

0exp

(πt

2

d

dθF (Xθ)

)dθ 1{(X,X ′) ∈ G(β)}

]

=2

π

∫ π2

0E[exp

(πt

2∇F (Xθ)

T Xθ

)1{(X,X ′) ∈ G(β)}

]dθ,

Page 32: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

30 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

where the third line follows from Jensen’s inequality, Xθ = −X ′ sin θ + X cos θ in the fourth line,and the inner-product ∇F (Xθ)

T Xθ represents the vector inner-product in Rpn. Noting that Xθ

and Xθ are independent and both equal in law to X, we may first condition on Xθ and use the

Cauchy-Schwarz inequality and E[ec|(Xθ)ij |

]≤ E

[ec(Xθ)ij

]+ E

[e−c(Xθ)ij

]≤ 2e

c2

2 to obtain

E[et(F (X)−F (X′))

1{(X,X ′) ∈ G(β)}]

=2

π

∫ π2

0E

[E

[exp

(πt

2∇F (Xθ)

T Xθ

)1{(X,X ′) ∈ G(β)}

∣∣∣∣∣Xθ

]]dθ

≤ 2

π

∫ π2

0E

E[exp(πt∇F (Xθ)

T Xθ

) ∣∣∣∣∣Xθ

] 12

E

[1{(X,X ′) ∈ G(β)}

∣∣∣∣∣Xθ

] 12

dθ≤ 4

π

∫ π2

0E

[exp

(π2t2‖∇F (Xθ)‖2

4

)E[1{(X,X ′) ∈ G(β)}

∣∣∣∣Xθ

] 12

]dθ

=4

π

∫ π2

0E

E[exp

(π2t2‖∇F (Xθ)‖2

2

)1{(X,X ′) ∈ G(β)}

∣∣∣∣∣Xθ

] 12

dθ≤ 4

π

∫ π2

0E[exp

(π2t2‖∇F (Xθ)‖2

2

)1{(X,X ′) ∈ G(β)}

] 12

≤ 2 exp

(π2Cβ,γ‖y‖∞t2

4n1/2

).

Definition 6.6. For m = dlog2 pe, let Dp2 =

{y ∈ Rp : ‖y‖ ≤ 1, ∀i, y2

i ∈ {0, 1, 12 ,

14 ,

18 , . . . ,

12m+3 }

}.

For each l = 0, 1, . . . ,m + 3, let πl : Dp2 → Dp

2 be defined by (πl(y))i = yi1{y2i ≥ 2−l}, and let

πl\l−1 : Dp2 → Dp

2 be defined by (πl\l−1(y))i = yi1{y2i = 2−l}.

Lemma 6.7. For any x ∈ Rp with ‖x‖ ≤ 1, there exists y ∈ Dp2 such that ‖x− y‖ < 9

20 , and hence

‖M‖ ≤ 10 supy∈Dp2 yTMy for any symmetric M ∈ Rp×p.

Proof. If x = 0 or x2i = 1 for some i, then we may take y = x. Otherwise, for each i = 1, . . . , p, if

2−l ≤ x2i < 2−l+1 for some l ∈ {1, . . . ,m+ 3}, let yi = 2−

l2 sign(xi), and if x2

i < 2−m−3, let yi = 0.Then ‖y‖ ≤ ‖x‖ ≤ 1 and y ∈ Dp

2, and

‖x− y‖2 =∑

i:x2i≥2−m−3

(xi − yi)2 +∑

i:x2i<2−m−3

x2i

≤∑

i:x2i≥2−m−3

(1− 1√

2

)2

x2i +

∑i:x2

i<2−m−3

x2i

≤(

1− 1√2

)2

+

(1−

(1− 1√

2

)2) ∑i:x2

i<2−m−3

x2i

≤(

1− 1√2

)2

+1

8

(1−

(1− 1√

2

)2)

<

(9

20

)2

Page 33: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 31

establishing the first claim. The second claim then follows from Lemma 5.4 in [31]. �

Lemma 6.8. For some constant C > 0, m = dlog2 pe, and all l ∈ {0, 1, . . . ,m + 3}, log |{πl(y) :y ∈ Dp

2}| ≤ C(m+ 4− l)2l.

Proof. For any l ∈ {0, 1, . . . ,m},

|{πl\l−1(y) : y ∈ Dp2}| ≤

2l∑k=0

(p

k

)2k,

as there are at most 2l non-zero entries of πl\l−1(y), and for each non-zero entry there are two

choices of sign. Using(pk

)≤( epk

)k, and noting that k 7→ (2ep)kk−k is monotonically increasing over

k ∈ [0, 2p] and that 2l ≤ 2p for l ≤ m, this implies

log |{πl\l−1(y) : y ∈ Dp2}| ≤ log

(1 + 2l

(2ep

2l

)2l)≤ log

(1 + 2l

(2e2m−l

)2l)≤ C(m− l + 1)2l

for a constant C > 0. For l ∈ {m+ 1,m+ 2,m+ 3}, we use the bound |{πl\l−1(y) : y ∈ Dp2}| ≤ 3p,

as each coordinate of πl\l−1(y) takes one of three values. Then

log |{πl\l−1(y) : y ∈ Dp2}| ≤ C2m ≤ C(m+ 4− l)2l

for a constant C > 0. Then

log |πl(y)| ≤l∑

j=0

log |πj\j−1(y)| ≤ Cl∑

j=0

(m+ 4− j)2j = C(m+ 4− l)l∑

j=0

2j + C

l−1∑k=0

k∑j=0

2j

≤ 2C(m+ 4− l)2l + 2C2l ≤ C ′(m+ 4− l)2l

for a constant C ′ > 0. �

Lemma 6.9. Under the setup of Proposition 6.2, let m = dlog2 pe. Then there are constantsC,Cβ,γ , Nβ,γ > 0 such that for any l ∈ {0, 1, . . . ,m+ 3}, any t > 0, and any n ≥ Nβ,γ,

P

[supy∈Dp2

πl(y)TKn,p(X)πl\l−1(y) > t and (X,X ′) ∈ G(β)

]≤ 2e

C(m+4−l)2l− t2√

2ln4Cβ,γ ,

and for any l ∈ {1, . . . ,m+ 3}, any t > 0, and any n ≥ Nβ,γ,

P

[supy∈Dp2

πl−1(y)TKn,p(X)πl\l−1(y) > t and (X,X ′) ∈ G(β)

]≤ 2e

C(m+4−l)2l− t2√

2ln4Cβ,γ .

Proof. Letting m = dlog2 pe and j = l − 1 or l,

P

[supy∈Dp2

πj(y)Kn,p(X)πl\l−1(y) > t and (X,X ′) ∈ G(β)

]

= P

[sup

y∈{πl(x):x∈Dp2}πj(y)Kn,p(X)πl\l−1(y) > t and (X,X ′) ∈ G(β)

]≤ eC(m+4−l)2l sup

y∈{πl(x):x∈Dp2}P[πj(y)Kn,p(X)πl\l−1(y) > t and (X,X ′) ∈ G(β)

]≤ eC(m+4−l)2le−λt sup

y∈{πl(x):x∈Dp2}E[eλπj(y)Kn,p(X)πl\l−1(y)

1{(X,X ′) ∈ G(β)}],

where the second line follows from πl\l−1(y) = πl\l−1(πl(y)) and πl−1(y) = πl−1(πl(y)), the thirdline follows from a union bound and Lemma 6.8 for some constant C > 0, and the fourth line

Page 34: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

32 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

follows from Markov’s inequality for any λ > 0. Let Λ be the set of all diagonal matrices in Rp×pwith all diagonal entries in {−1, 1}. Note that (X,X ′) ∈ G(β) if and only if (X,DX ′) ∈ G(β) forall D ∈ Λ. Then, conditional on X and (X,X ′) ∈ G(β), X ′ is equal in law to DX ′ for D uniformlydistributed over Λ, and

E[Kn,p(X′)|X, (X,X ′) ∈ G(β)]

= E[Kn,p(DX′)|X, (X,X ′) ∈ G(β)]

= E[E[Kn,p(DX′)|X ′, X, (X,X ′) ∈ G(β)]|X, (X,X ′) ∈ G(β)]

= 0,

where the last line follows from E[Kn,p(DX′)] = 0 for any fixed X ′ ∈ Rp×n as the kernel function

k is odd. Hence by Jensen’s inequality, for any y ∈ Dp2,

E[e−λπj(y)Kn,p(X′)πl\l−1(y)

∣∣∣X, (X,X ′) ∈ G(β)]≥ 1,

and so

E[eλπj(y)Kn,p(X)πl\l−1(y)

1{(X,X ′) ∈ G(β)}]

= E[eλπj(y)Kn,p(X)πl\l−1(y)

∣∣∣(X,X ′) ∈ G(β)]P[(X,X ′) ∈ G(β)]

≤ E[eλπj(y)Kn,p(X)πl\l−1(y)E

[e−λyKn,p(X′)πl\l−1(y)

∣∣∣X, (X,X ′) ∈ G(β)] ∣∣∣(X,X ′) ∈ G(β)

]P[(X,X ′) ∈ G(β)]

= E[eλπj(y)(Kn,p(X)−Kn,p(X′))πl\l−1(y)

∣∣∣(X,X ′) ∈ G(β)]P[(X,X ′) ∈ G(β)]

= E[eλπj(y)(Kn,p(X)−Kn,p(X′))πl\l−1(y)

1{(X,X ′) ∈ G(β)}].

Noting that ‖πl\l−1(y)‖∞ ≤ 2−l2 , Lemma 6.5 implies this last quantity is at most 2 exp

(Cβ,γλ

2√

2ln

)for all y ∈ Dp

2 and n ≥ Nβ,γ . Setting λ = t√

2ln2Cβ,γ

gives the desired result. �

Proof of Proposition 6.2. Let m = dlog2 pe, let Dp2, πl, and πl\l−1 be as in Definition 6.6, and let

C,Cβ,γ > 0 be the constants in Lemma 6.9. We may assume without loss of generality C ≥ 3. Foreach l = 0, . . . ,m+ 3, let

tl =

√8CCβ,γ(m+ 4− l)2

l4

n14

.

Let X ′ be an independent copy of X. Then by Lemma 6.9, for all n ≥ Nβ,γ and each l = 0, . . . ,m+3,

P

[supy∈Dp2

πl(y)TKn,p(X)πl\l−1(y) > tl and (X,X ′) ∈ G(β)

]≤ 2e−C(m+4−l)2l ,

and for each l = 1, . . . ,m+ 3,

P

[supy∈Dp2

πl−1(y)TKn,p(X)πl\l−1(y) > tl and (X,X ′) ∈ G(β)

]≤ 2e−C(m+4−l)2l .

Note

2m+3∑l=0

tl ≤2√

8CCβ,γ

n14

m+3∑l=0

(m+ 4− l)2l4 =

2√

8CCβ,γ

n14

m+3∑l=0

l∑j=0

2j4 ≤

C ′β,γ2m4

n14

≤ C ′′β,γ

for some constants C ′β,γ , C′′β,γ > 0, while

supy∈Dp2

yTKn,p(X)y = supy∈Dp2

(m+3∑l=0

πl(y)TKn,p(X)πl\l−1(y) +m+3∑l=1

πl\l−1(y)TKn,p(X)πl−1(y)

)

Page 35: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 33

≤m+3∑l=0

supy∈Dp2

πl(y)TKn,p(X)πl\l−1(y) +

m+3∑l=1

supy∈Dp2

πl−1(y)TKn,p(X)πl\l−1(y).

Then

P

[supy∈Dp2

yTKn,p(X)y > C ′′β,γ and (X,X ′) ∈ G(β)

]

≤m+3∑l=0

P

[supy∈Dp2

πl(y)TKn,p(X)πl\l−1(y) > tl and (X,X ′) ∈ G(β)

]

+m+3∑l=1

P

[supy∈Dp2

πl−1(y)TKn,p(X)πl\l−1(y) > tl and (X,X ′) ∈ G(β)

]

≤ 4

m+3∑l=0

e−C(m+4−l)2l < 4(m+ 4)e−3m <C ′ log p

p3

for some constant C ′ > 0 and all n ≥ Nβ,γ . By Lemma 6.4, this implies

P

[supy∈Dp2

yTKn,p(X)y > C ′′β,γ

]<C ′ log p

p3+ P[(X,X)′ /∈ G(β)] ≤

C ′′′β,γn2

for all n ≥ N ′β,γ and some constants C ′′′β,γ , N′β,γ > 0. The desired result then follows from Lemma

6.7. �

Appendix A. Combinatorial results

This appendix contains the proofs of a number of combinatorial lemmas used in Section 4, aswell as the proof of Proposition 4.15 and the explicit construction of the map ϕ in that proposition.We restate any lemmas previously stated in Section 4 using their original numbering.

Lemma 4.11. Suppose a simple-labeling of an l-graph has k n-vertices with non-empty label and

m total distinct p-labels and distinct non-empty n-labels. Then m ≤ l+k2 + 1.

Proof. Let I = {1, . . . , p} and J = {1, . . . , n}, and consider an undirected graph G on the vertexset I tJ (the disjoint union of I and J with n+p elements, treating elements of I and the elementsof J as distinct). Let G have an edge between i, i′ ∈ I if there are two consecutive p-vertices ofthe l-graph having p-labels i and i′ in the simple-labeling and such that the n-vertex between themhas empty label, and let G have an edge between i ∈ I and j ∈ J if there are two consecutivevertices of the l-graph such that the p-vertex has label i and the n-vertex has label j. The numberof vertices of G incident to at least one edge is m, and G must be connected, so it has at least m−1edges. Note that an edge in G between i, i′ ∈ I corresponds to at least two consecutive pairs ofp-vertices in the l-graph such that the n-vertex between them has empty label, by condition (3) of

Definition 4.10, so the number of such edges is at most l−k2 . Similarly, an edge in G between i ∈ I

and j ∈ J corresponds to at least two pairs of consecutive n and p-vertices of the l-graph such thatthe n-vertex has non-empty label, by condition (2) of 4.10. Hence, the number of such edges is at

most 2k2 . Then m− 1 ≤ l+k

2 . �

Lemma A.1. In any multi-labeling of an l-graph, each distinct n-label that appears must appearon at least two n-vertices.

Proof. Suppose that an n-label j appears only once. The two p-vertices preceding and followingthat n-vertex must have distinct labels, say i1 and i2, by condition (1) of Definition 4.3. Then

Page 36: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

34 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

exactly one edge in the l-graph has p-vertex endpoint labeled i1 and n-vertex endpoint having labelj (and similarly for i2 and j), contradicting condition (3) of Definition 4.3. �

Lemma A.2. Suppose l = 2 or l = 3. Then for any multi-labeling of the l-graph, all l p-labels aredistinct, and all l n-vertices have the same tuple of n-labels, up to reordering.

Proof. That all l p-labels are distinct is a consequence of condition (1) of Definition 4.3. Then byconditions (2) and (3) of Definition 4.3, the n-vertices immediately preceding and following eachp-vertex must have the same tuple of n-labels, up to reordering, and hence all l n-vertices has thesame tuple of n-labels. �

Lemma A.3. In a multi-labeling of an l-graph with l ≥ 4, suppose a p-vertex V is such that itsp-label appears on no other p-vertices. Let the n-vertex preceding V be U , the p-vertex preceding Ube T , the n-vertex following V be W , and the p-vertex following W be X.

(1) If T and X have different p-labels, then the graph obtained by deleting V and W andconnecting U to X is an (l − 1)-graph with valid multi-labeling.

(2) If T and X have the same p-label, then the graph obtained by deleting U , V , W , and X andconnecting U to the n-vertex after X is an (l − 2)-graph with valid multi-labeling.

Proof. First consider case (1). As T and X have distinct p-labels, it remains true that no twoconsecutive p-vertices in the (l− 1)-graph have the same p-label, so condition (1) of Definition 4.3holds. Condition (2) of Definition 4.3 clearly still holds as well. If V has p-label i and W hasn-labels (j1, . . . , jd), then U has n-labels (j1, . . . , jd) as well, up to reordering, by conditions (2)and (3) of Definition 4.3 for the original l-graph and the fact that V is the only p-vertex with labeli. Then in the (l − 1)-graph obtained by deleting V and W , the number of edges with p-vertexendpoint labeled i and n-vertex endpoint having label js for any s = 1, . . . , d is zero, and thenumber of edges with p-vertex endpoint labeled i′ and n-vertex endpoint having label j′ is the sameas in the original l-graph for all other pairs (i′, j′). Thus condition (3) of Definition 4.3 still holdsas well, so the (l − 1)-graph still has a valid multi-labeling.

Now consider case (2). X and the p-vertex after X must have different p-labels in the originall-graph, by condition (1) of Definition 4.3. As T and X have the same p-label, this implies T andthe p-vertex after X must have different p-labels, so condition (1) of Definition 4.3 still holds in the(l−2)-graph. Condition (2) of Definition 4.3 clearly still holds in the (l−2)-graph as well. SupposeV has p-label i1, T and X have p-label i2, and W has n-labels (j1, . . . , jd). Then by conditions (2)and (3) of Definition 4.3 for the original l-graph and the fact that V is the only p-vertex with labeli1, U must also have n-labels (j1, . . . , jd), up to reordering. Then in the (l − 2)-graph obtainedby deleting U , V , W , and X, the number of edges with p-vertex endpoint labeled i1 and n-vertexendpoint having label js for any s = 1, . . . , d is zero, the number of edges with p-vertex endpointlabeled i2 and n-vertex endpoint having label js for any s = 1, . . . , d is two less than in the originall-graph, and the number of edges with p-vertex endpoint labeled i′ and n-vertex endpoint havinglabel j′ is the same as in the original l-graph for all other pairs (i′, j′). Hence condition (3) ofDefinition 4.3 still holds as well, so the (l − 2)-graph still has a valid multi-labeling. �

Lemma 4.4. Suppose a multi-labeling of an l-graph has d1, . . . , dl n-labels on the first throughlth n-vertices, respectively, and suppose that it has m total distinct p-labels and n-labels. Then

m ≤ l+∑ls=1 ds2 + 1.

Proof. We induct on l. For l = 2, a multi-labeling must have d1 = d2 and m = d1 + 2, and forl = 3, a multi-labeling must have d1 = d2 = d3 and m = d1 + 3, by Lemma A.2. The result is theneasily verified in these two cases.

Suppose by induction that the result holds for l − 2 and l − 1, and consider a multi-labeling ofan l-graph with d1, . . . , dl and m as specified, and l ≥ 4. If each distinct p-label which appears in

Page 37: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 35

the labeling appears at least twice, then there are at most l2 distinct p-labels. Lemma A.1 implies

there are at most∑ls=1 ds

2 distinct n-labels, so m ≤ l+∑ls=1 ds2 , establishing the result.

Thus, suppose that some p-vertex V has a label that appears exactly once. Let the n-vertexpreceding V be U , the p-vertex preceding U be T , the n-vertex following V be W , and the p-vertexfollowing W be X. If T and X have different p-labels, follow procedure (1) in Lemma A.3 to obtaina multi-labeling of an (l − 1)-graph. This multi-labeling now has m− 1 total distinct p-labels and

n-labels, and so the induction hypothesis implies m− 1 ≤ l−1+∑ls=1 ds−d2 + 1 where d is the number

of n-labels of the deleted n-vertex W . Hence m ≤ l+∑ls=1 ds2 − d+1

2 + 2 ≤ l+∑ls=1 ds2 + 1.

If T and X have the same p-label, follow procedure (2) of Lemma A.3 to obtain a multi-labelingof an (l − 2)-graph. This multi-labeling has at least m − d − 1 and at most m − 1 total distinctp-labels and n-labels, where d is the number of n-labels of the deleted n-vertex W . The induction

hypothesis implies m − d − 1 ≤ l−2+∑ls=1 ds−2d2 + 1, so m ≤ l+

∑ls=1 ds2 + 1. This completes the

induction in both cases, establishing the desired result. �

Lemma A.4. Suppose a multi-labeling of an l-graph has excess ∆, d1, . . . , dl n-labels on the firstthrough lth n-vertices, respectively, and at most l

2 distinct p-labels. Then at most 6∆ − 6 of the∑ls=1 ds total n-labels are such that that distinct n-label appears three or more times in the labeling.

Proof. Observe that if m total distinct p-labels and n-labels appear in the labeling, and at most l2 of

these are p-labels, then the labeling has at least m− l2 distinct n-labels. If c is the number of distinct

n-labels that appear exactly twice, then, as each distinct n-label appears at least twice by Lemma

A.1, a pigeonhole argument implies 2c+ 3(m− l

2 − c)≤∑l

s=1 ds, so c ≥ 3m− 3l2 −

∑ls=1 ds. Then

the n-labels which appear exactly twice account for at least 6m − 3l − 2∑l

s=1 ds of the∑l

s=1 dstotal n-labels, implying that at most 3l + 3

∑ls=1 ds − 6m = 6∆− 6 total n-labels remain. �

Remark A.5. Lemma A.4 implies that if an l-graph has at most l2 distinct p-labels, then it has

excess ∆ ≥ 1.

Lemma A.6. Suppose a multi-labeling of an l-graph has excess ∆. For each p-label i and n-labelj, let bij be the number of pairs of consecutive vertices in the l-graph such that the p-vertex of thepair is labeled i and the n-vertex of the pair has label j in its tuple. Then

∑i,j:bij>2 bij ≤ 12∆.

Proof. We induct on l. For l = 2 or 3, we must have bij = 2 for all (i, j) by Lemma A.2, and ∆ ≥ 0by Lemma 4.4, so the result holds.

Suppose the result holds for l − 2 and l − 1, and consider a multi-labeling of an l-graph withl ≥ 4. If each p-label that appears in the labeling appears at least twice, then there are at mostl2 distinct p-labels, so Lemma A.4 implies that at most 6∆ of the total n-labels are such that thatdistinct n-label appears at least three times. Note that for any distinct n-label j that appears onlytwice, bij = 2 or bij = 0 for all p-labels i, by conditions (1) and (3) of Definition 4.3. For anydistinct n-label j that appears at least three times,

∑i:bij>2 bij is exactly twice the total number

of appearances of j as an n-label (as the p-vertices to the left and right of any n-vertex containinglabel j each contributes 1 to this sum). Then

∑i,j:bij>2 bij ≤ 12∆ by Lemma A.4.

Now suppose that some p-vertex V has a label i that appears exactly once in the labeling.Consider the (l− 1)-graph or (l− 2)-graph obtained by following either procedure (1) or procedure(2) of Lemma A.3. In the (l − 1)-graph, bij = 0 for n-labels j that appear on the deleted n-vertices U and W , whereas bij = 2 for such n-labels j in the original l-graph, and the valuesbi′j′ for all other pairs (i′, j′) are the same in the two graphs. Hence

∑i,j:bij>2 bij is the same

in the (l − 1)-graph as in the l-graph. Then the induction hypothesis implies∑

i,j:bij>2 bij ≤

12(l−1+

∑ls=1 ds−d2 + 1− (m− 1)

)≤ 12∆, where d is the number of n-labels on the deleted n-vertex

W (or U) and m is the total number of distinct n and p-labels in the original l-graph.

Page 38: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

36 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

In the case of the (l−2)-graph, suppose the deleted n-vertex W (or U) had d n-labels, of which d′

appear on an n-vertex other than W and U . If j is a distinct n-label that does not appear on W orU , then clearly bij is the same in the (l−2)-graph and the original l-graph for all i. If j is one of thed−d′ distinct n-labels that appear only on W and U , then bij = 0 or 2 in both the (l−2)-graph andthe original l-graph for all i. If j is one of the other d′ distinct n-labels, then in deleting U , V , W ,and X, we may have reduced bij by 2 for at most two distinct i-labels (corresponding to the i-labelsof V and X). This implies

∑i:bij>2 bij reduces by at most 8 for this j, with the maximal reduction

occurring if bij = 4 for both of these distinct i-labels in the original l-graph. Then by the induction

hypothesis,∑

i,j:bij>2 bij − 8d′ ≤ 12(l−2+

∑ls=1 ds−2d2 + 1− (m− 1− (d− d′)

), as the (l − 2)-graph

has m−1−(d−d′) total distinct n and p-labels. Then∑

i,j:bij>2 bij ≤ 12(l+

∑ls=1 ds2 + 1−m− d′

)+

8d′ ≤ 12∆, so the result holds in this case as well, completing the induction. �

This concludes the proof of all combinatorial lemmas used in Section 4. The remainder of thisappendix establishes Proposition 4.15 and provides the explicit construction of the map ϕ in thatproposition.

Definition A.7. In an l-graph with a multi-labeling, an n-vertex is single if it has only one n-label.It is a good single if it is single and if its n-label does not appear on any other n-vertex that isnot single. Otherwise, it is a bad single.

Definition A.8. In an l-graph with a (p, n)-multi-labeling, a pair (V, V ′) of distinct (not necessarilyconsecutive) n-vertices is a good pair if the following conditions hold:

(1) V and V ′ have the same tuple of n-labels, up to reordering,(2) V has at least two n-labels (as does V ′),(3) Each distinct n-label appearing on V and V ′ does not appear on any other n-vertices besides

V and V ′.

Remark A.9. The p-labels of the two p-vertices preceding and following V must be distinct, bycondition (1) of Definition 4.3, and similarly for the p-labels of the two p-vertices preceding andfollowing V ′. As each n-label for V and V ′ does not appear on any other vertices besides V and V ′,condition (3) of Definition 4.3 requires that, in fact, the two p-labels of the p-vertices preceding andfollowing V are the same as those of the p-vertices preceding and following V ′ (but not necessarilyin the same order).

Definition A.10. Suppose (V, V ′) is a good pair of n-vertices. Let the p-vertices preceding andfollowing V be U and W , respectively, and let the p-vertices preceding and following V ′ be U ′ andW ′, respectively. Then the good pair (V, V ′) is proper if U has the same label as W ′ and U ′ hasthe same label as W , and it is improper if U has the same label as U ′ and W has the same labelas W ′.

Definition A.11. The label-simplifying map is the map from (p, n)-multi-labelings of an l-graphto (p, n+ 1)-simple-labelings of an l-graph, defined by the following procedure:

(1) While there exists an improper good pair of n-vertices (V, V ′), iterate the following: Let Wbe the p-vertex following V and W ′ be the p-vertex following V ′, and reverse the sequenceof vertices starting at W and ending at W ′ along with their labels. (I.e., swap W with W ′,the n-vertex following W with the n-vertex preceding W ′, etc.)

(2) For each n-vertex in a good pair, relabel it with the empty label.(3) For each n-vertex that is neither a good single nor part of a good pair, relabel it with the

single label n+ 1.

Remark A.12. In the case where there are multiple improper good pairs in step (1) of this proce-dure, it will not be important for our later arguments in which order the pairs (V, V ′) are selected.

Page 39: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 37

For concreteness, we may always select {V, V ′} to be the improper good pair whose sorted n-label-tuple is smallest lexicographically, and we may take V to come before V ′ in the l-graph cycle.

Lemma A.13. The following are true for the label-simplifying map in Definition A.11:

(1) Step (1) of the procedure in Definition A.11 always terminates in a valid (p, n)-multi-labelingwith no improper good pairs.

(2) The image of any (p, n)-multi-labeling under the map is a valid (p, n+ 1)-simple-labeling.(3) If two multi-labelings are equivalent, then their image simple-labelings are also equivalent.

Proof. Clearly each reversal in step (1) of the procedure in Definition A.11 preserves condition (2)of Definition 4.3 as well as the number of good pairs and n-labels of each good pair. As W andW ′ have the same p-label because (V, V ′) is improper, it also preserves conditions (1) and (3) ofDefinition 4.3, so the labeling of the vertices after each such reversal is still a valid (p, n)-multi-labeling. Each time an improper good pair (V, V ′) is identified and a reversal is performed, V andV ′ become consecutive n-vertices in the l-graph, and the pair (V, V ′) becomes a proper good pair.As V and V ′ are consecutive, they must remain consecutive under each subsequent reversal that isperformed, so their properness is preserved. Hence the procedure must terminate after a number ofiterations at most the total number of good pairs in the multi-labeling, and the final multi-labelingis such that all good pairs are proper. This establishes (1).

To prove (2), note that the image labeling under the label-simplifying map has either one n-labelor the empty label for each n-vertex. Condition (1) of Definition 4.10 holds for the image labelingby condition (1) of Definition 4.3 for multi-labelings, as the p-labels are preserved under steps (2)and (3) of the procedure in Definition A.11. As all good pairs in the multi-labeling obtained afterapplying step (1) of the procedure are proper, and step (2) of the procedure maps their labels tothe empty label, condition (3) of Definition 4.10 holds for the image simple-labeling. Finally, notethat if j is an n-label appearing on good single vertices in the multi-labeling, then condition (2) ofDefinition 4.10 holds in the image labeling for this j and all p-labels i by condition (3) in Definition4.3 for multi-labelings. For the new n-label n + 1 created under the label-simplifying map, notethat for each distinct p-label i, there must be an even number of total edges in the l-graph cyclewith p-endpoint labeled i. Of these, there must be an even number with n-endpoint j for any goodsingle label j, by the above argument, and there must also be an even number with n-endpointbelonging to a good pair in the multi-labeling since these edges must come in pairs. Hence thenumber of remaining edges adjacent to any p-vertex with label i must also be even. These areprecisely the edges with p-vertex endpoint labeled i and n-vertex endpoint labeled n + 1 in theimage labeling, so condition (2) of Definition 4.10 holds for the new n-label n + 1 and all p-labelsi as well. Hence the image labeling is a valid (p, n+ 1)-simple-labeling, establishing (2).

For (3), note that two p-vertices have the same label in the multi-labeling if and only if they havethe same label in the image simple labeling, so equivalence of p-labels is preserved under the map.Furthermore, equivalent multi-labelings have the same good pairs of n-vertices and the same goodsingle n-vertices, and the good single n-vertices are divided into the same sets of vertices that sharea common label. Hence equivalence of n-labels is also preserved under the map, so (3) holds. �

Definition A.14. Let C and C be the set of all multi-labeling equivalence classes and simple-labeling equivalence classes, respectively, of an l-graph. For L ∈ C and any multi-labeling in classL, let L ∈ C be the equivalence class of its image simple-labeling under the label simplifying map ofDefinition A.11. Then define ϕ : C → C by ϕ(L) = L.

By parts (2) and (3) of Lemma A.13, the above construction of ϕ is well-defined. The remainderof this appendix shows that ϕ satisfies the conditions of Proposition 4.15.

Lemma A.15. Suppose a multi-labeling of an l-graph has excess ∆. Then at most 42∆ pairs ofconsecutive p-vertices are such that their pair of p-labels appears (in some order) on three or morepairs of consecutive p-vertices.

Page 40: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

38 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

Proof. We induct on l. For l = 2 and 3, there are zero pairs of consecutive p-vertices satisfying thecondition of the lemma, and ∆ ≥ 0 by Lemma 4.4, so the result holds.

Suppose by induction that the result holds for l − 2 and l − 1, and consider a multi-labeling ofan l-graph with l ≥ 4. Let d1, . . . , dl be the number of n-labels on the first through lth n-vertices,respectively, and let m be the number of distinct p-labels and n-labels. First suppose each distinctp-label which appears in the labeling appears at least twice. Then there are at most l

2 distinctp-labels. Lemma A.4 implies that there are at least l − 6∆ n-vertices such that all of its n-labelsappear exactly twice. For any such n-vertex W and n-label j for W , consider the other n-vertexW ′ also having n-label j. The two p-vertices preceding and following W must have the same pairof labels as the two p-vertices preceding and following W ′, by conditions (1) and (3) of Definition4.3. This implies that there are at most 6∆ pairs of p-labels which appear (in some order) on onlyone consecutive pair of p-vertices.

On the other hand, the number of distinct p-labels in the multi-labeling is at most one morethan the number of distinct pairs of p-labels appearing on pairs of consecutive p-vertices. This iseasily seen by considering the undirected graph with vertices {1, . . . , p}, having an edge betweeni, i′ ∈ {1, . . . , p} if and only if some consecutive pair of p-vertices have labels i and i′. The edgesof this graph must form a single connected component, so the number of vertices adjacent to atleast one edge (which is the number of distinct p-labels appearing in the multi-labeling) is at most

one more than the number of edges. Note that by Lemma A.1, there are at most∑ls=1 ds

2 distinct

n-labels in the multi-labeling, so there are at least m−∑ls=1 ds

2 distinct p-labels. Hence, there are at

least m−∑ls=1 ds

2 − 1 = l2 −∆ distinct pairs of consecutive p-labels. By our previous argument, at

least l2 − 7∆ of these pairs appear on at least two pairs of consecutive p-vertices. If c distinct pairs

of p-labels appear on exactly two pairs of consecutive p-vertices, then by a pigeonhole argument,2c+ 3

(l2 − 7∆− c

)≤ l, so c ≥ l

2 − 21∆. These account for at least l− 42∆ pairs of consecutive p-vertices, implying that at most 42∆ pairs of consecutive p-vertices have a pair of p-labels appearingthree or more times. This establishes the result in this case.

Now suppose that there is some p-vertex, V , whose p-label appears only once in the labeling.Consider the (l− 1)-graph or (l− 2)-graph obtained by following either procedure (1) or procedure(2) of Lemma A.3. Note that this (l − 1)-graph or (l − 2)-graph has the same number of pairs ofconsecutive p-vertices such that their pair of p-labels appears on three or more pairs of consecutivep-vertices as in the original l-graph, since no such pair in the original l-graph could contain thep-label of V . On the other hand, our proof of Lemma 4.4 shows that this (l − 1)-graph or (l − 2)-graph has excess less than or equal to the excess of the original l-graph. Then the desired resultfollows from the induction hypothesis. �

The next lemma represents a key insight into the structure of the multi-labelings defined inDefinition 4.3. It indicates that in any multi-labeling with small excess ∆, most of the non-singlen-vertices must belong to a good pair, and in particular, if ∆ = 0, then all non-single n-verticesbelong to good pairs.

Lemma A.16. Suppose a multi-labeling of an l-graph has excess ∆ and k single n-vertices. Thenthere are at least l−k

2 − 48∆ good pairs of n-vertices.

Proof. Let the multi-labeling have d1, . . . , dl n-labels for the first through lth n-vertices, respectively,andm total distinct p-labels and n-labels. We induct on l. If l = 2, then Lemma A.2 implies d1 = d2,m = d1 + 2, and ∆ = 0. If d1 = d2 = 1, then k = 2 and there are no good pairs, and if d1 = d2 ≥ 2,then k = 0 and there is one good pair. Hence the result holds. If l = 3, then Lemma A.2 impliesd1 = d2 = d3, m = d1 + 3, and ∆ = d1−1

2 . If d1 = d2 = d3 = 1, then k = 3, ∆ = 0, and there are

no good pairs. If d1 = d2 = d3 ≥ 2, then k = 0, ∆ ≥ 12 , and there are still no good pairs. In either

case, the result also holds.

Page 41: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 39

Assume by induction that the result holds for l − 2 and l − 1, and consider a multi-labeling ofan l-graph with l ≥ 4. First suppose each distinct p-label which appears in the labeling appears atleast twice. Then there are at most l

2 distinct p-labels. As there are l− k non-single n-vertices, byLemmas A.1 and A.4, there are at least l − k − 6∆ non-single n-vertices such that each n-label ofthat vertex appears exactly twice. Let V be one such n-vertex. Suppose that V has two n-labelsj1 and j2 that occur on two different n-vertices W1 and W2 respectively, in addition to V . Then byconditions (1) and (3) of Definition 4.3, the three pairs of consecutive p-vertices around V , W1, andW2 must have the same pair of (distinct) p-labels. By Lemma A.15, there are at most 42∆ suchn-vertices V . Now suppose that all n-labels of V reappear on a single other n-vertex W1, but W1

has some additional n-label j not appearing on V . Then either all additional n-labels of W1 appearat least three times, or there is some n-label j appearing on W1 and a single other n-vertex W2,but not on V . In the former case, the number of such vertices W1 is at most 6∆ by Lemma A.4.As V is the unique n-vertex sharing an n-label with W1 that appears exactly twice, this impliesthe number of such vertices V is also at most 6∆. In the latter case, the three pairs of p-verticesaround V , W1, and W2 must have the same pair of p-labels, so by Lemma A.15, the number of suchvertices V is at most 42∆. Combining the results from all of these cases, there are then at leastl − k − 96∆ non-single n-vertices V whose labels all appear exactly twice, on one other n-vertexV ′, and such that V ′ has no additional n-labels. These pairs (V, V ′) form at least l−k

2 − 48∆ goodpairs, so the conclusion holds.

Now suppose there is some p-vertex, V , whose p-label appears only once in the labeling. Let then-vertex preceding V be U , the p-vertex preceding U be T , the n-vertex following V be W , andthe p-vertex following W be X. By conditions (2) and (3) of Definition 4.3 and the fact that thep-vertex of V appears only once, U and W must have the same tuple of n-labels, up to reordering.Consider four cases:

(1) T and X have different p-labels, and U and W are single.(2) T and X have different p-labels, and U and W each have d ≥ 2 n-labels.(3) T and X have the same p-label, and U and W are single.(4) T and X have the same p-label, and U and W each have d ≥ 2 n-labels.

In cases (1) and (2), remove V and W , and connect U to X. Lemma A.3 implies that the resultinggraph is an (l− 1)-graph with a valid multi-labeling. In case (1), this (l− 1)-graph has k− 1 single

n-vertices,∑l

s=1 ds − 1 total n-labels, and m− 1 total distinct p-labels and n-labels. Then by the

induction hypothesis, it has at least (l−1)−(k−1)2 −48

((l−1)+(

∑ls=1 ds−1)

2 + 1− (m− 1))

= l−k2 −48∆

good pairs. Note that in case (1), this (l − 1)-graph must have the same number of good pairs asthe original l-graph, so the desired result holds.

In case (2), the (l − 1)-graph has k single n-vertices,∑l

s=1 ds − d total n-labels, and m − 1distinct p-labels and n-labels. By the induction hypothesis, as d ≥ 2 by assumption, it has at least(l−1)−k

2 − 48(

(l−1)+(∑ls=1 ds−d)

2 + 1− (m− 1))> l−k

2 − 48∆ + 1 good pairs. This (l− 1)-graph can

have at most one more good pair than the original l-graph. (It has exactly one more good pair ifthe removed n-vertex W had a tuple of n-labels that occurred exactly three times on three differentn-vertices in the original l-graph.) So the number of good pairs in the original l-graph is at leastl−k

2 − 48∆ and the conclusion holds in this case as well.In cases (3) and (4), remove U , V , W , and X, and connect T to the n-vertex after X. Lemma

A.3 implies that the resulting graph is an (l−2)-graph with a valid multi-labeling. In case (3), this

(l−2)-graph has k−2 single n-vertices,∑l

s=1 ds−2 total n-labels, and either m−2 distinct p-labelsand n-labels if the removed n-vertices had an n-label appearing only those two times, or m − 1distinct p-labels and n-labels otherwise. Suppose the former. Then, by the induction hypothesis,

this (l− 2)-graph has at least (l−2)−(k−2)2 − 48

((l−2)+(

∑ls=1 ds−2)

2 + 1− (m− 2))

= l−k2 − 48∆ good

Page 42: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

40 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

pairs, and it has the same number of good pairs as the original l-graph. If, instead, the (l−2)-graphhas m − 1 distinct p-labels and n-labels, then it can have at most one more good pair than theoriginal l-graph. (It has exactly one more good pair if the (l−2)-graph has a pair of n-vertices havingthe n-label of the removed vertices U and W , and this pair now forms a good pair.) But in this case,

the (l − 2)-graph has at least (l−2)−(k−2)2 − 48

((l−2)+(

∑ls=1 ds−2)

2 + 1− (m− 1))> l−k

2 − 48∆ + 1

good pairs, so the original l-graph must have at least l−k2 − 48∆ good pairs in this case as well.

Finally, in case (4), the (l− 2)-graph has k single n-vertices,∑l

s=1 ds − 2d total n-labels, and atleast m−d− 1 and at most m− 1 distinct p-labels and n-labels. If it has exactly m−d− 1 distinctp-labels and n-labels, then we must have removed a good pair, and by the induction hypothesis, the

(l−2)-graph has at least (l−2)−k2 −48

((l−2)+(

∑ls=1 ds−2d)2 + 1− (m− d− 1)

)= l−k

2 −48∆−1 good

pairs. Hence the original l-graph had at least l−k2 −48∆ good pairs. If, instead, the (l−2)-graph has

m− c− 1 distinct p-labels and n-labels for 0 ≤ c < d (so that d− c distinct n-labels in the removedpair of n-vertices U and W appear more than just those two times), then U and W cannot be a goodpair in the original l-graph, and the (l−2)-graph can have at most d−c more good pairs than the l-graph, one for each distinct n-vertex label of U and W that appeared more than twice in the l-graph.

The (l−2)-graph has at least (l−2)−k2 −48

((l−2)+(

∑ls=1 ds−2d)2 + 1− (m− c− 1)

)> l−k

2 −48∆+d−cgood pairs, which implies that the original l-graph had at least l−k

2 − 48∆ good pairs.This completes the induction in all cases, so the conclusion holds for all l. �

Remark A.17. In the context of Theorem 2.5 and Lemma 4.7, if a = 0, then only multi-labelingequivalence classes with no single n-vertices contribute to the sum in eq. (6). By Lemma A.16,the multi-labelings with no single n-vertices and excess ∆ = 0, which comprise the dominant term

of this sum, must be such that all n-vertices belong to good pairs. Then there are exactly∑ls=1 ds

2distinct n-labels in such multi-labelings, which in turn implies by the definition of ∆ that there areexactly l

2 + 1 distinct p-labels. By Remark A.9, this identifies the sequence of p-labels appearing

along the l-graph cycle as a traversal of a tree with l2 edges in the complete graph having vertices

{1, . . . , p}, where each edge of the tree is traversed exactly once in each direction. This correspondsexactly to Wigner paths in the combinatorial proof of the semicircle law for Wigner matrices andexplains the emergence of the semicircle law as the limit measure µa,ν,γ in Theorem 2.5 in the casewhere a = 0.

Remark A.18. A statement analogous to Lemma A.16 for the single n-vertices does not hold, i.e.,it is not true in general that if a multi-labeling of an l-graph has excess ∆ and k single n-vertices,then at least k −O(∆) of these are good singles.

Lemma A.19. Suppose a multi-labeling of an l-graph has excess ∆. Then there are at most 2∆good pairs of n-vertices such that the two vertices in the pair are consecutive in the l-graph cycleand the p-label of the p-vertex between them appears at least twice in the labeling.

Proof. Suppose that (V, V ′) is such a pair, W is the p-vertex between them, and W has p-labeli. If i appears on any p-vertex that is not between two consecutive n-vertices forming a goodpair, then change the p-label of W to a new p-label not yet appearing in the multi-labeling, anddo this for every such p-vertex W with label i (picking a different new p-label each time). If ionly appears on p-vertices between consecutive n-vertices forming good pairs, and there are c suchp-vertices including W , then change the p-labels of c − 1 of these p-vertices to c − 1 new labelsnot yet appearing in the multi-labeling. Note that changing the p-label of any p-vertex betweentwo consecutive n-vertices forming a good pair to a new p-label not yet appearing in the labelingcannot violate any of the conditions of Definition 4.3, so the resulting labeling is still a valid multi-labeling. If x is the number of good pairs satisfying the condition of the lemma, then we have addedat least x

2 distinct new p-labels to the multi-labeling. If there were originally m distinct n-labels

Page 43: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 41

and p-labels and d1, . . . , dl n-labels on the first through lth n-vertices, respectively, then Lemma

4.4 implies m+ x2 ≤

l+∑s=l ds2 + 1, so x ≤ 2∆. �

Definition A.20. In a multi-labeling of an l-graph, a distinct p-label i that appears in the multi-labeling is a connector if, among all n-vertices that are adjacent to any p-vertex with label i,exactly two of them are bad singles and the remainder of them are either good singles or part ofgood pairs. Two bad single n-vertices are connected if they are these two n-vertices correspondingto a connector i. A sequence of bad single n-vertices W1, . . . ,Wa is a connected cycle if W1 isconnected to W2, W2 is connected to W3, etc., and Wa is connected to W1.

Note that in the above definition, “connector” refers to a distinct p-label i, not to any specificp-vertex having label i, and any two “connected” bad single n-vertices are adjacent to p-verticeshaving some connector label i but these p-vertices may be distinct vertices in the l-graph. Each badsingle n-vertex may be connected to at most two other bad single n-vertices (where the connectorsare the p-labels of the p-vertices adjacent to that bad single n-vertex), and hence this notion ofconnectedness partitions the set of bad single n-vertices into connected components that are eitherindividual vertices, linear chains, or cycles. The motivation for the above definition comes from theobservation that if two bad single n-vertices are connected, then they must have the same n-label,as follows from condition (3) of Definition 4.3 and the fact that n-labels appearing on good singlesand good pairs must be distinct from those appearing on n-vertices that are not good singles norgood pairs.

Lemma A.21. Suppose a multi-labeling of an l-graph has excess ∆ and k single n-vertices, ofwhich k′ are good single and k − k′ are bad single. Then at least k − k′ − (288D + 2)∆ distinctp-labels are connectors, and there are at most (192D+1)∆ connected cycles of bad single n-vertices.

Proof. Suppose the multi-labeling is a (p, n)-multi-labeling. Construct an undirected multi-graphG on p vertices labeled {1, . . . , p}, with each edge of G having one label in {1, . . . , n}, such thatthe following is true: Corresponding to each n-label j of each n-vertex V in the multi-labeling, ifV is preceded and followed by p-vertices having labels i1 and i2, there is an edge between vertices

i1 and i2 of G with label j. (Hence, if the multi-labeling has∑l

s=1 ds total n-labels, then G has∑ls=1 ds total edges.) Condition (3) of Definition 4.3 states that for any n-label j, each vertex of

G has even degree in the sub-graph consisting of only edges with label j.We will sequentially remove the edges of G corresponding to the good pair and good single

n-vertices of the multi-labeled l-graph, until only the edges of G corresponding to the n-verticesthat are not good pairs or good singles remain. At any stage of this removal process, let us calla vertex of G “active” if there is at least one edge still adjacent to that vertex. Let us define a“component” as the set of active vertices that may be reached by traversing the remaining edgesof G from a particular active vertex. (Hence a component of G is a connected component, in thestandard sense, that contains at least two vertices.) We will keep track of the quantity

M = #{active vertices}+ #{distinct edge labels} −#{components}.Note that initially, if the l-graph multi-labeling has m distinct p-labels and n-labels, then m is

also the number of active vertices plus the number of distinct edge labels of G. Also, G initially hasonly one component, so M = m− 1. Let us now remove the edges of G corresponding to the goodpairs of the l-graph. If an n-vertex of a good pair has d n-labels, then the good pair correspondsto 2d edges between a single pair of vertices in G, having d distinct edge labels that do not occurelsewhere in G. Hence removing these 2d edges of G removes d distinct edge labels, and if thisalso changes the connectivity structure of G, then either the number of components increases by1, the number of components stays the same but the number of active vertices decreases by 1, orthe number of components decreases by 1 and the number of active vertices decreases by 2. In allof these cases, upon removing these 2d edges from G, M decreases by at most d + 1. Then after

Page 44: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

42 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

removing all edges of G corresponding to good pairs, M ≥ m−1−(∑l

s=1 ds−k2

)−(l−k

2

)= k−∆, as

there are at most∑ls=1 ds−k

2 distinct n-vertex labels for the good pairs and at most l−k2 good pairs.

Let us now remove the edges of G corresponding to the good single n-vertices in the l-graph.Let j be an n-label that appears on a good single n-vertex, and consider removing the edges of Gwith label j one at a time. As each vertex of G has even degree in the subgraph of edges of Gwith label j, when the first such edge is removed, the number of components and active verticescannot change. Subsequently, the removal of each additional edge might increase the number ofcomponents by 1, keep the number of components the same and decrease the number of activevertices by 1, or decrease the number of components by 1 and the number of active vertices by 2.When the last such edge is removed, there are no longer any edges with label j by the definition ofa good single, so the number of distinct edge labels decreases by 1. Hence removing all edges withlabel j decreases M by at most the number of such edges, and M ≥ k− k′ −∆ after removing theedges corresponding to all k′ good singles.

Call the resulting graph G′. Note that every vertex of G′ still has even degree in the sub-graphconsisting of edges with label j, for any j, and in particular, every active vertex of G′ has degreeat least two. Then by Definition A.20, a p-label i of the l-graph is a connector if and only if vertexi has degree exactly two in G′, in which case the edges adjacent to i in G′ must have the samelabel j, and the n-vertices with label j in the l-graph are the bad singles connected by connectori. A connected cycle of bad single n-vertices in the l-graph corresponds to the edges of a cycle of(necessarily distinct) vertices in G′ with degree exactly two.

The number of distinct edge labels that remain in G′ is the number of distinct n-labels in theoriginal l-graph multi-labeling that appear on n-vertices that are not good singles nor part of goodpairs, which by Lemma A.16 is at most 96D∆. Hence the number of active vertices minus thenumber of components of G′ is at least k−k′−(96D+1)∆, by our lower bound on M . The numberof total edges in G′ is at most k − k′ + 96D∆, with k − k′ of them corresponding to bad singlen-vertices of the l-graph and at most 96D∆ of them corresponding to non-single n-vertices thatare not part of good pairs. Then the total vertex degree of G′ is at most 2(k − k′ + 96D∆). Aseach active vertex in G′ has degree at least two, this implies there are at most k−k′+96D∆ activevertices. Then there are at most (192D + 1)∆ components in G′, and hence at most (192D + 1)∆connected cycles of bad single n-vertices in the l-graph. Furthermore, if x active vertices in G′ havedegree exactly two in G′, then as there are at least k−k′− (96D+1)∆ active vertices, a pigeonholeargument implies 2x+4(k−k′− (96D+1)∆−x) ≤ 2(k−k′+96D∆), so x ≥ k−k′− (288D+2)∆.Hence there are at least k − k′ − (288D + 2)∆ connectors in the l-graph multi-labeling. �

Proof of Proposition 4.15. For notational convenience, let C denote a positive constant that maydepend on D and that may change from instance to instance. Let ϕ be as defined in DefinitionA.14. As ϕ preserves the p-labels, up to reordering, clearly condition (1) of Proposition 4.15 holds.

To verify condition (2), let L ∈ C be any multi-labeling equivalence class. Let ϕ(L) have k

n-vertices with non-empty label. This means L has k n-vertices that do not belong to a good pair.

These vertices have at least k total n-labels in L, implying that there are at most∑l

s=1 ds − ktotal n-labels on the good pair vertices, where d1, . . . , dl are the number of n-labels on the first

through lth n-vertices, respectively, in L. These good pair vertices account for at most∑ls=1 ds−k

2distinct n-labels in L, and these are mapped to the empty label under the label-simplifying map.Furthermore, by Lemma A.16, there are at most 2C∆(L) n-vertices that are not single and thatalso do not belong to a good pair in L, and hence these have at most 2CD∆(L) additional distinctn-labels in L that are mapped to the new n-label n + 1 under the label-simplifying map. Notethat any bad single n-vertex has an n-label that is the same as one of these 2CD∆(L) distinctn-labels (otherwise it is a good single by definition), and the n-label of any good single n-vertexis preserved under the label-simplifying map. Hence, if m is the number of total distinct p-labels

Page 45: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 43

and n-labels in L and m is the number of total distinct p-labels and non-empty n-labels in ϕ(L),then, as the set of distinct p-labels is unchanged under the label-simplifying map, this implies

m ≥ m −∑ls=1 ds−k

2 − 2CD∆(L), so ∆(ϕ(L)) = l+k2 + 1 − m ≤ (2CD + 1)∆(L). Hence condition

(2) holds.

It remains to verify condition (3). Fix L ∈ C, and let the “canonical simple-labeling” in the

class L be the one in which each ith new p-label that appears in the l-graph cycle is label i andeach jth new n-label that appears in the l-graph cycle is label j. Note that as there are at mostl distinct p-labels and l distinct n-labels, the canonical simple-labeling is an (l, l)-simple-labelingof the l-graph. For notational convenience, let us still denote this canonical simple-labeling byL when the meaning is clear. Consider the below (non-determined) procedure that constructs a

multi-labeling from L.

(1) Choose an n-label in {1, . . . , l} to be the “new label”, or assume there is no new label.

(n-vertices in L with the new label will be the ones that are neither good singles nor partof good pairs in the multi-labeling.)

(2) For all p-vertices, copy its p-label from L to the multi-labeling, and for all n-vertices with

non-empty n-label that is not the “new label” in L, copy its n-label from L to the multi-labeling.

(3) Among n-vertices having the new label in L, choose a subset S of them that will correspondto the non-single n-vertices that are not part of good pairs.

(4) For each n-vertex in S, choose the size of its n-label tuple in the multi-labeling to be between2 and D, and pick the n-labels for that tuple.

(5) For each n-vertex with the new label but not belonging to S, pick one n-label for thatn-vertex in the multi-labeling.

(6) For all n-vertices with empty label in L, pair them up into good pairs for the multi-labeling.(7) Let G be the set of good pairs (V, V ′) in the multi-labeling that are consecutive n-vertices

in the l-graph, and such that the p-label of the p-vertex between them appears at leasttwice. Choose an ordered subset of G. For each (V, V ′) in this ordered subset, if W is thep-vertex between V and V ′, choose some other p-vertex W ′ having the same p-label as W ,and either reverse the sequence of vertices from W to W ′ or reverse the sequence of verticesfrom W ′ to W .

(8) For each good pair in the multi-labeling, choose the size of its n-label tuple to be between2 and D, pick the n-labels for the first vertex of the good pair, and pick a permutation ofthese n-labels for the second vertex of the good pair.

The above procedure is non-determined in the sense that there are many ways to perform each ofthe above steps, and hence many different multi-labelings may be the output of the procedure fora single canonical simple-labeling L. (The above procedure may, in addition, construct labelingsthat are not valid multi-labelings according to Definition 4.3, but those will be irrelevant for ourargument.) We claim that for any L ∈ ϕ−1(L), there is a multi-labeling in class L that may be

constructed from L according to the above procedure, and furthermore that this multi-labeling isa (l,Dl)-multi-labeling such that, for each good pair, the n-labels of the first n-vertex in the pairare the smallest n-labels not yet appearing in the labeling and are in sorted order.

To verify this claim, note that the steps of the above procedure “invert” the label-simplifyingmap in Definition A.11. The “new label” chosen in step (1) above corresponds to the label n + 1that is given to n-vertices that are neither good singles nor part of good pairs in the multi-labelingunder the label-simplifying map (except, of course, if we take L to be the canonical simple-labeling

equivalent to the output of the label-simplifying map, then the new label is no longer n+1 in L). Ifno new label is chosen, this implies that the multi-labeling we construct has all of its n-vertices beingeither a good single or part of a good pair. Steps (2)–(6) and (8) above invert the process by whichp-labels and n-labels are mapped from the multi-labeling to the simple-labeling under steps (2) and

Page 46: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

44 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

(3) of Definition A.11 for the label-simplifying map. Note that if step (1) of the label-simplifyingmap in Definition A.11 is performed on any multi-labeling, each reversal that is performed causesan additional good pair (V, V ′) of n-vertices that were not consecutive to become consecutive inthe l-graph, and they remain consecutive under each subsequent reversal. Hence, starting with thefinal multi-labeling in which all good pairs are proper, we may perform this sequence of reversals inreverse order to recover the original multi-labeling. So step (7) above inverts step (1) of Definition

A.11. This establishes that, for any multi-labeling equivalence class L ∈ ϕ−1(L), there exists some

multi-labeling L in class L that may be constructed by the above procedure from L. The aboveprocedure picks new n-labels for the non-good-single n-vertices of L in steps (4), (5), and (8), butsince it does not specify which n-labels are picked, there is a (l,Dl)-multi-labeling equivalent to Lthat may also be constructed by the above procedure, and such that for each of its good pairs, then-labels of the first n-vertex in the pair are the smallest n-labels not yet appearing in the labelingand are in sorted order. This establishes our claim.

Thus, to verify condition (3) of the proposition, it suffices to upper-bound the number of waysin which each of the above steps (1)–(8) may be performed while still ensuring that the resultingmulti-labeling is a valid (l,Dl)-multi-labeling satisfying Definition 4.3, having excess ∆0, and suchthat the first n-vertex of each good pair has n-labels that are the smallest n-labels not yet appearingin the labeling and are in sorted order. As there are at most l distinct n-labels in L, there are atmost l + 1 ways of choosing the new label or choosing no new label in step (1). There is only oneway of performing step (2). By Lemma A.16, for a multi-labeling with excess ∆0, there can be atmost C∆0 n-vertices that are not single and also do not belong to a good pair. Hence, to obtain amulti-labeling with excess ∆0, we may restrict our selection of S in step (3) to be of size at mostC∆0, so there are at most (l+ 1)C∆0 ways of choosing S in step (3). To perform step (4), for eachvertex in S, we may first choose the number of n-labels d between 1 and D, and then there are atmost (Dl)d ways of choosing the n-labels for that vertex.

For step (5), suppose that k single n-vertices and k′ good single n-vertices have been identifiedin steps (1)–(4). In step (5) we must assign n-labels to each of the k − k′ n-vertices that arebad singles. Recall the notions of connectors and connected bad singles from Definition A.20. ByLemma A.21, there are at least k − k′ − C∆0 connectors. By condition (3) of Definition 4.3, anytwo of the k − k′ bad single n-vertices connected by a connector must be given the same n-labelin the multi-labeling. Hence, going through the connectors one-by-one, each successive connectorconstrains the n-label of one more bad single n-vertex, unless that connector closes a connectedcycle of such vertices. But as there are at most C∆0 total connected cycles by Lemma A.21, thisimplies that the number of bad single n-vertices that we can freely label at most C∆0 (rather thanthe naive bound of at most k − k′). Then there are at most (Dl)C∆0 ways to perform step (5).

For step (6), recall from Remark A.9 that the pairs of p-vertices surrounding the two n-vertices ofeach good pair must have the same pair of p-labels. By Lemma A.15, for all but at most C∆0 of then-vertices with empty label, this pairing is uniquely determined. Then there are at most (C∆0)C∆0

ways of performing the pairing in step (6). For step (7), Lemma A.19 shows that |G| ≤ 2∆0. Thenthere are at most 2∆0 ways of choosing each successive element in the ordered subset of G, and atmost 2l ways of choosing the other p-vertex W ′ as well as which half of the cycle to reverse for eachsuch added element. As the size of the ordered subset is also at most 2∆0, and we may choose tonot add any more elements to the ordered subset at any point, there are at most (4∆0l+1)2∆0 waysof performing step (7). Finally, for step (8), for each good pair we may first choose the number ofn-labels d between 2 and D. As we are requiring that the first vertex of the pair have the smallestn-labels not yet appearing in the multi-labeling and in sorted order, this determines uniquely thechoice of n-labels for this first vertex in the good pair. We may then choose the permutation ofthese labels for the second vertex in the pair from one of d! choices.

Page 47: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 45

Combining the above arguments, we obtain the bound∑L∈ϕ−1(L)

∆(L)=∆0

l∏s=1

|ads(L)|(ds(L)!)1/2

≤ (l + 1)∑S

|a1|k(L)−|S|

(D∑d=2

(Dl)d|ad|

(d!)1/2

)|S|(Dl)C∆0(C∆0)C∆0

(4∆0l + 1)2∆0

(D∑d=2

d!a2d

d!

) l−k(L)2

≤ (l + 1)(Cl)C∆0 |a|k(L)(ν − a2)

l−k(L)2

∑S

|a|−|S|(

D∑d=2

(Dl)d|ad|

(d!)1/2

)|S|,

where the summation over S represents the sum over all possible sets S selected by step (3) of theprocedure above. As |S| ≤ C∆0 by our preceding argument, this implies

|a|−|S|(

D∑d=2

(Dl)d|ad|

(d!)1/2

)|S|≤ (Cl)C∆0 |a|−|S|

√D

(D∑d=2

a2d

d!

) |S|2

≤√D(Cl)C∆0

(√ν

|a|

)C∆0

.

The sum is over at most (l+ 1)C∆0 possible sets S, so this verifies condition (3) of the proposition

upon noting that√D(l + 1)C∆0+1(Cl)C∆0 ≤ lC3+C4∆0 for some constants C3, C4 > 0 and all

l ≥ 2. �

Appendix B. Moments of a deformed GUE matrix

In this Appendix, we prove Proposition 4.9. Recall Definition 4.8 of Mn,p, Wp, Vn,p, and Zn,p.Throughout this section, we will use p and n in place of p and n, and we will suppress the dependenceof W , V , and Z on n and p.

Lemma B.1. Suppose n, p→∞ with pn → γ. Then ‖Mn,p‖ → ‖µa,ν,γ‖ almost surely

Proof. Recall Mn,p =√

γ(ν−a2)p W + a

nV , where V = ZZT −D and D = diag(‖Zi‖22). We will apply

Proposition 8.1 of [6], conditional on V . The empirical spectral distribution of 1nZZ

T convergesweakly a.s. to µMP,γ , the Marcenko-Pastur law with parameter γ. By a standard chi-squared tail

bound and a union bound, for any ε > 0, P[max1≤i≤p |‖Zi‖22−n| > εn] ≤ 2pe−nε2

8 . Then the Borel-Cantelli lemma implies

∥∥ 1nD − I

∥∥ → 0 a.s., and hence the empirical spectral distribution of anV

converges weakly a.s. to the translated and scaled Marcenko-Pastur law µMP,shift of Proposition2.11. Furthermore, µMP,γ has compact support, and the maximal distance between an eigenvalue

of 1nZZ

T and the support of µMP,γ converges to 0 a.s. by the results of [34] and [1]. Hence thesame is true of µMP,shift and a

nV .

Let V = OΛOT where O is the real orthogonal matrix that diagonalizes V . Then the spectrum of

Mn,p is the same as that of√

γ(ν−a2)p OTWO+ a

nΛ. By the above argument, conditional on V , anΛ is

a non-random diagonal matrix whose empirical spectral distribution converges weakly to µMP,shift

and such that the maximal distance between any of its diagonal entries and supp(µMP,shift) con-

verges to 0 (a.s. in V ). Furthermore, OTWO is still distributed as the GUE by unitary invari-ance. Hence the conditions of [6] are satisfied with no spike eigenvalues, so, conditional on V ,

Proposition 8.1 of [6] implies that the spectral norm of√

γ(ν−a2)p OTWO + a

nΛ converges a.s. to

sup{|x| : x ∈ supp(µsc � µMP,shift)}. As this convergence holds a.s. in V and the limit doesnot depend on V , it holds a.s. unconditionally as well. The result then follows from Proposition2.11. �

Page 48: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

46 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

Lemma B.2. Suppose n, p→∞ with pn → γ, let l := l(n) be such that l(n)

n → 0, and let Bn be any

event. Then there exist positive constants C := Ca,ν,γ and c := ca,ν,γ such that E[‖Mn,p‖l1{Bn}] ≤C lP[Bn] + e−cn for all large n.

Proof. Note

‖Mn,p‖ ≤

√γ(ν − a2)

p‖W‖+

|a|n‖ZZT ‖+

|a|n

max1≤i≤p

‖Zi‖22.

By Corollary 2.3.5 of [29], there exist positive constants A and B such that for all t ≥ A,

P[‖W‖ > t√p] ≤ Ae−Btp.

By Corollary 5.35 of [31], for all t ≥ 0,

P[‖ZZT ‖ > (√n+√p+√tn)2] ≤ P[‖Z‖ >

√n+√p+√tn] ≤ 2e−

tn2

By a standard chi-squared tail bound and a union bound, for all t ≥ 0,

P[

max1≤i≤p

‖Zi‖22 > n+√tn

]≤ pe−

tn8 .

Hence, for all t ≥ A and sufficiently large n,

P[‖Mn,p‖ > t

√γ(ν − a2) + |a|(1.1 +

√γ +√t)2 + |a|(1 +

√t)]≤ Ae−Btp + 2e−

tn2 + pe−

tn8 .

So there exist constants C, ε > 0 depending on a, ν, γ such that, for all t ≥ C and sufficiently largen, P[‖Mn,p‖ > t] ≤ e−εtn. Then we may write

E[‖Mn,p‖l1{Bn}

]= E

[‖Mn,p‖l1{Bn}1{‖Mn,p‖ ≤ C}

]+ E

[‖Mn,p‖l1{Bn}1{‖Mn,p‖ > C}

]≤ C lP[Bn] +

∫ ∞Cl

P[‖Mn,p‖l > t

]dt

= C lP[Bn] +

∫ ∞C

P[‖Mn,p‖ > s] · lsl−1ds

≤ C lP[Bn] + l

∫ ∞C

e−εsn+(l−1) log sds

≤ C lP[Bn] + l

∫ ∞C

e−(εn−l)sds

= C lP[Bn] +l

εn− le−(εn−l)C

for all large n. As l = o(n), the result follows upon setting c = Cε2 . �

Lemma B.3. Suppose n, p→∞ with pn → γ. Then E[‖Mn,p‖]→ ‖µa,ν,γ‖.

Proof. By Lemma B.1, ‖Mn,p‖ → ‖µa,ν,γ‖ almost surely. Then lim inf E[‖Mn,p‖] ≥ E[lim inf ‖Mn,p‖] =‖µa,ν,γ‖ by Fatou’s lemma, so it suffices to show lim supE[‖Mn,p‖] ≤ ‖µa,ν,γ‖+ ε for all ε > 0. LetBn =

{‖Mn,p‖ > ‖µa,ν,γ‖+ ε

2

}. Then

E[‖Mn,p‖] = E[‖Mn,p‖1{BCn }] + E[‖Mn,p‖1{Bn}] ≤ ‖µa,ν,γ‖+ ε2 + E[‖Mn,p‖1{Bn}].

Lemma B.1 implies P[Bn] → 0, so Lemma B.2 (with l = 1) implies E[‖Mn,p‖1{Bn}] → 0 as well.Then E[‖Mn,p‖] ≤ ‖µa,ν,γ‖+ ε for all large n, as desired. �

Lemma B.4. Suppose F : Rd → R is L-Lipschitz on a set G ⊆ Rk, i.e. |F (x)−F (y)| ≤ L‖x−y‖2for all x, y ∈ G. Let ξ ∼ N(0, Id). Then there exists a function F : Rd → R such that F (x) = F (x)

for all x ∈ G, |F (x)− F (y)| ≤ L‖x− y‖2 for all x, y ∈ Rk, and, for all ∆ > 0,

P[F (ξ)− EF (ξ) ≥ ∆ + |EF (ξ)− EF (ξ)| and ξ ∈ G] ≤ e−∆2

2L2 .

Page 49: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 47

Proof. Let F (x) = infx′∈G(F (x′)+L‖x−x′‖2). Note that if x ∈ G, then F (x) ≤ F (x′)+L‖x−x′‖2for all x′ ∈ G, so F (x) = F (x). Also, for any x, y ∈ Rk and ε > 0, there exists x′ ∈ G such that

F (x) ≥ F (x′) + L‖x− x′‖2 − ε. Then by definition, F (y) ≤ F (x′) + L‖y − x′‖2, so F (y)− F (x) ≤L‖y − x′‖2 − L‖x− x′‖2 + ε ≤ L‖x− y‖2 + ε. Similarly, F (x)− F (y) ≤ L‖x− y‖2 + ε. This holds

for all ε > 0, so |F (x) − F (y)| ≤ L‖x − y‖2. Finally, applying Gaussian concentration of measure

for the Lipschitz function F ,

P[F (ξ)− EF (ξ) ≥ ∆ + |EF (ξ)− EF (ξ)| and ξ ∈ G]

= P[F (ξ) ≥ ∆ + |EF (ξ)− EF (ξ)|+ EF (ξ) and ξ ∈ G]

≤ P[F (ξ) ≥ ∆ + EF (ξ)] ≤ e−∆2

2L2 .

Lemma B.5. Suppose n, p → ∞ with pn → γ, and let ε > 0. Then there exist c := ca,ν,γ > 0 and

N := Na,ν,γ,ε > 0 and a set G := Gn,p ⊂ Rp×n with P[Z ∈ G] ≥ 1 − 2e−n2 , such that for all t > ε

and n > N ,

P[‖Mn,p‖ ≥ ‖µa,ν,γ‖+ t and Z ∈ G] ≤ e−cnt2 .

Proof. Recall Mn,p =√

γ(ν−a2)p W + a

n(ZZT − diag(‖Zi‖22)). Denote

W =((wii)1≤i≤p, (

√2 Rewij ,

√2 Imwij)1≤i<j≤p

)∈ Rp

2,

so that the entries of W and Z are iid N (0, 1). Let f : Rp2+np → R and fv : Rp2+np →R for v ∈ Cp be given by f(W, Z) = ‖Mn,p‖ and fv(W, Z) = v∗Mn,pv, so that f(W, Z) =supv∈Cp:‖v‖2=1 |fv(W, Z)|. Note that

fv(W, Z) =

√γ(ν − a2)

p

p∑i=1

wii|vi|2 +∑

1≤i<j≤pwijvivj + wijvivj

+a

n

∑1≤i<j≤p

(vivj + vivj)ZTi Zj

=

√γ(ν − a2)

p

p∑i=1

wii|vi|2 +∑

1≤i<j≤p(Rewij)(vivj + vivj) + i(Imwij)(vivj − vivj)

+a

n

∑1≤i<j≤p

(vivj + vivj)ZTi Zj

,

∂fv(W, Z)

∂wii=

√γ(ν − a2)

p|vi|2,

∂fv(W, Z)

∂(√

2 Rewij)=

√2γ(ν − a2)

pRe(vivj),

∂fv(W, Z)

∂(√

2 Imwij)= −

√2γ(ν − a2)

pIm(vivj), ∇Zifv(W, Z) =

2a

n

p∑j=1

j 6=i

Re(vivj)Zj .

Then, for any v ∈ Cp such that ‖v‖2 = 1,

‖∇fv(W, Z)‖22 =γ(ν − a2)

p

p∑i=1

|vi|4 + 2∑

1≤i<j≤p|vivj |2

+4a2

n2

p∑i=1

∥∥∥∥∥∥∥∥p∑j=1

j 6=i

Re(vivj)Zj

∥∥∥∥∥∥∥∥2

2

≤ γ(ν − a2)

p

(p∑i=1

|vi|2)2

+4a2

n2

p∑i=1

|vi|2‖Z‖2‖v‖22

Page 50: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

48 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

=γ(ν − a2)

p+

4a2‖Z‖2

n2.

Take G = {Z ∈ Rp×n : ‖Z‖ ≤ 2√n +√p}. Then by Corollary 5.35 of [31], P[Z /∈ G] ≤ 2e−

n2 .

As Rp2 × G is convex, the above inequality implies fv(W, Z) is L-Lipschitz on Rp2 × G for L :=(γ(ν−a2)

p +4a2(2

√n+√p)2

n2

)1/2= O(n−1/2). Then

f(W, Z)− f(W ′, Z ′) ≤ supv∈Cp:‖v‖2=1

(|fv(W, Z)| − |fv(W ′, Z ′)|

)≤ sup

v∈Cp:‖v‖2=1

∣∣fv(W, Z)− fv(W ′, Z ′)∣∣ ≤ L‖(W, Z)− (W ′, Z ′)‖2

for all W,W ′ ∈ Rp2and Z,Z ′ ∈ G, so f is also L-Lipschitz on Rp2 ×G.

Let f : Rp2+np → R be the L-Lipschitz extension of f on Rp2 × G given by Lemma B.4. Notethat

|Ef(W, Z)− Ef(W, Z)| = |E[(f(W, Z)− f(W, Z))1{Z /∈ G}]|

≤ E|f(W, Z)1{Z /∈ G}|+ E|f(W, Z)1{Z /∈ G}|.

Lemma B.2 (with l = 1) implies E|f(W, Z)1{Z /∈ G}| = E[‖Mn,p‖1{Z /∈ G}] = o(1). As f isL-Lipschitz,

|f(W, Z)| ≤ |f(0, 0)|+ L‖(W, Z)‖2 = |f(0, 0)|+ L‖(W, Z)‖2 = L‖(W, Z)‖2.

Let An ={‖(W, Z)‖2 ≤

√2(p2 + np)

}. As ‖(W, Z)‖22 is chi-squared distributed with p2 + np

degrees of freedom, a standard tail bound gives P[‖(W, Z)‖22 ≥ p2 + np+ t

]≤ e−

t2

8(p2+np) . Then

E[‖(W, Z)‖22 1{A

Cn }]

=

∫ ∞p2+np

P[‖(W, Z)‖22 ≥ p

2 + np+ t]dt ≤

∫ ∞p2+np

e− t2

8(p2+np)dt

= 2√p2 + np

∫ ∞√p2+np

2

e−s2

2 ds ∼ 4e−p2+np

8 .

This implies

E|f(W, Z)1{Z /∈ G}| ≤ E[|f(W, Z)|1{Z /∈ G}1{An}] + E[|f(W, Z)|1{Z /∈ G}1{ACn }]

≤ L√

2(p2 + np)P[Z /∈ G] + LE[‖(W, Z)‖22 1{A

Cn }]1/2

= o(1).

Then |Ef(W, Z)−Ef(W, Z)| = o(1), so Lemmas B.3 and B.4 imply, for all t > ε and all sufficientlylarge n (i.e. n > Na,ν,γ,ε independent of t),

P[‖Mn,p‖ ≥ ‖µa,ν,γ‖+ t and Z ∈ G]

≤ P[‖Mn,p‖ − E‖Mn,p‖ ≥ t− ε

2 + |Ef(W, Z)− Ef(W, Z)| and Z ∈ G]

≤ e−(t−ε/2)2

2L2 ≤ e−t2

8L2 .

The result follows upon noting that L = O(n−1/2). �

Proof of Proposition 4.9. Let c > 0 and G ⊂ Rp×n be as given by Lemma B.5. Then, for any ε > 0,

E[‖Mn,p‖l1{Z ∈ G}] ≤ (‖µa,ν,γ‖+ ε)l + E[‖Mn,p‖l1{‖Mn,p‖ ≥ ‖µa,ν,γ‖+ ε}1{Z ∈ G}

]= (‖µa,ν,γ‖+ ε)l +

∫ ∞(‖µa,ν,γ‖+ε)l

P[‖Mn,p‖l ≥ t and Z ∈ G

]dt

Page 51: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES 49

= (‖µa,ν,γ‖+ ε)l +

∫ ∞‖µa,ν,γ‖+ε

P[‖Mn,p‖ ≥ s and Z ∈ G] · lsl−1ds

≤ (‖µa,ν,γ‖+ ε)l + l

∫ ∞ε

e−cns2(‖µa,ν,γ‖+ s)l−1ds

for all sufficiently large n, where we have applied Lemma B.5. Note that

l

∫ ∞ε

e−cns2(‖µa,ν,γ‖+ s)l−1ds ≤ l

∫ ∞ε

e−cns2+l(‖µa,ν,γ‖+s)ds

= lel‖µa,ν,γ‖+l2

4cn

∫ ∞ε

e−cn(s−l

2cn)2

ds

=lel‖µa,ν,γ‖+

l2

4cn

√2cn

∫ ∞√

2cn(ε− l2cn)

e−t2

2 dt

∼ lel‖µa,ν,γ‖+l2

4cn

2cn(ε− l

2cn

)e−cn(ε− l2cn)

2

→ 0

for l = O(log n), so E[‖Mn,p‖l1{Z ∈ G}] ≤ (‖µa,ν,γ‖+ ε)l + o(1). On the other hand, P[Z /∈ G] ≤2e−

n2 by Lemma B.5, so Lemma B.2 implies E[‖Mn,p‖l1{Z /∈ G}] = o(1) for l = O(log n). Hence

E[‖Mn,p‖l] ≤ (‖µa,ν,γ‖+ ε)l + o(1). As ε > 0 was arbitrary, this proves the desired result. �

References

[1] Z D Bai and Y Q Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. TheAnnals of Probability, 21(3):1275–1294, 1993.

[2] Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices. Springer, 2010.[3] Hari Bercovici and Dan Voiculescu. Free convolution of measures with unbounded supports. Indiana University

Mathematics Journal, 42(3):733–773, 1993.[4] Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics,

36(6):2577–2604, 2008.[5] T Tony Cai and Harrison H Zhou. Optimal rates of convergence for sparse covariance matrix estimation. The

Annals of Statistics, 40(5):2389–2420, 2012.[6] Mireille Capitaine, Catherine Donati-Martin, Delphine Feral, and Maxime Fevrier. Free convolution with a semi-

circular distribution and eigenvalues of spiked deformations of Wigner matrices. Electronic Journal of Probability,16(64):1750–1792, 2011.

[7] Mireille Capitaine and Sandrine Peche. Fluctuations at the edges of the spectrum of the full rank deformedGUE. Probability Theory and Related Fields, pages 1–45, 2015.

[8] Lennart Carleson. On Bernstein’s approximation problem. Proceedings of the American Mathematical Society,2(6):953–961, 1951.

[9] Xiuyuan Cheng and Amit Singer. The spectrum of random inner-product kernel matrices. Random Matrices:Theory and Applications, 2(4), 2013.

[10] Yash Deshpande and Andrea Montanari. Sparse PCA via covariance thresholding. arXiv preprintarXiv:1311.5179, 2013.

[11] Yen Do and Van Vu. The spectrum of random kernel matrices: Universality results for rough and varying kernels.Random Matrices: Theory and Applications, 2(3), 2013.

[12] Ken Dykema. On certain free product factors via an extended matrix model. Journal of Functional Analysis,112(1):31–60, 1993.

[13] Noureddine El Karoui. Operator norm consistent estimation of large-dimensional sparse covariance matrices.The Annals of Statistics, 36(6):2717–2756, 2008.

[14] Noureddine El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1–50, 2010.[15] Zoltan Furedi and Janos Komlos. The eigenvalues of random symmetric matrices. Combinatorica, 1(3):233–241,

1981.[16] Stuart Geman. A limit theorem for the norm of random matrices. The Annals of Probability, 8(2):252–261, 1980.[17] Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of

Statistics, 29(2):295–327, 2001.

Page 52: THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL … · translated Marcenko-Pastur law [33, 12]. Such matrices have been referred to as \deformed Wigner matrices" or \Wigner matrices

50 THE SPECTRAL NORM OF RANDOM INNER-PRODUCT KERNEL MATRICES

[18] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semidefinite relaxations really solve sparse PCA?arXiv preprint arXiv:1306.3690, 2013.

[19] Rafal Latala. Some estimates of norms of random matrices. Proceedings of the American Mathematical Society,133(5):1273–1282, 2005.

[20] Ji Oon Lee and Kevin Schnelli. Edge universality for deformed Wigner matrices. arXiv preprint arXiv:1407.8015,2014.

[21] Hans Maassen. Addition of freely independent random variables. Journal of Functional Analysis, 106(2):409–438,1992.

[22] Camille Male. The norm of polynomials in large random and deterministic matrices. Probability Theory andRelated Fields, 154(3-4):477–532, 2012.

[23] Vladimir A Marcenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of random ma-trices. Sbornik: Mathematics, 1(4):457–483, 1967.

[24] Natesh S Pillai and Jun Yin. Universality of covariance matrices. The Annals of Applied Probability, 24(3):935–1001, 2014.

[25] Adam J Rothman, Elizaveta Levina, and Ji Zhu. Generalized thresholding of large covariance matrices. Journalof the American Statistical Association, 104(485):177–186, 2009.

[26] Jun Shao, Yazhen Wang, Xinwei Deng, and Sijian Wang. Sparse linear discriminant analysis by thresholding forhigh dimensional data. The Annals of Statistics, 39(2):1241–1265, 2011.

[27] Tatyana Shcherbina. On universality of local edge regime for the deformed Gaussian unitary ensemble. Journalof Statistical Physics, 143(3):455–481, 2011.

[28] Gabor Szego. Orthogonal Polynomials. American Mathematical Society, 1939.[29] Terence Tao. Topics in Random Matrix Theory. American Mathematical Society, 2012.[30] Craig A Tracy and Harold Widom. On orthogonal and symplectic matrix ensembles. Communications in Math-

ematical Physics, 177(3):727–754, 1996.[31] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing,

pages 210–268. Cambridge University Press, 2012.[32] Dan Voiculescu. Addition of certain non-commuting random variables. Journal of Functional Analysis, 66(3):323–

346, 1986.[33] Dan Voiculescu. Limit laws for random matrices and free products. Inventiones mathematicae, 104(1):201–220,

1991.[34] Y Q Yin, Z D Bai, and P R Krishnaiah. On the limit of the largest eigenvalue of the large dimensional sample

covariance matrix. Probability Theory and Related Fields, 78(4):509–521, 1988.