Upload
hatuong
View
219
Download
1
Embed Size (px)
Citation preview
Sub-Gaussian Estimators of the Mean of a Random Matrix withEntries Possessing Only Two Moments
Stas MinskerUniversity of Southern California
July 21, 2016
ICERM Workshop
Simple question: how to estimate the mean?
Assume that X1, . . . ,Xn are i.i.d. N (µ, σ20).
Problem: construct CInorm(α) for µ with coverage probability ≥ 1− 2α.
Solution: compute µn := 1n
n∑j=1
Xj , take
CInorm(α) =
[µn − σ0
√2
√log(1/α)
n, µn + σ0
√2
√log(1/α)
n
]
Simple question: how to estimate the mean?
Assume that X1, . . . ,Xn are i.i.d. N (µ, σ20).
Problem: construct CInorm(α) for µ with coverage probability ≥ 1− 2α.
Solution: compute µn := 1n
n∑j=1
Xj , take
CInorm(α) =
[µn − σ0
√2
√log(1/α)
n, µn + σ0
√2
√log(1/α)
n
]
Simple question: how to estimate the mean?
Assume that X1, . . . ,Xn are i.i.d. N (µ, σ20).
Problem: construct CInorm(α) for µ with coverage probability ≥ 1− 2α.
Solution: compute µn := 1n
n∑j=1
Xj , take
CInorm(α) =
[µn − σ0
√2
√log(1/α)
n, µn + σ0
√2
√log(1/α)
n
]
Coverage is guaranteed since
Pr
(∣∣µn − µ∣∣ ≥ σ0
√2 log(1/α)
n
)≤ 2α.
Example: how to estimate the mean?
P. J. Huber (1964): “...This raises a question which could have been asked already by Gauss,but which was, as far as I know, only raised a few years ago (notably by Tukey): whathappens if the true distribution deviates slightly from the assumed normal one?"
Going back to our question: what if X1, . . . ,Xn are i.i.d. copies of X ∼ Π such that
EX = µ, Var(X) ≤ σ20?
Problem: construct CI for µ with coverage probability ≥ 1− α such that for any α
length(CI(α)) ≤ (Absolute constant) · length(CInorm(α))
No additional assumptions on Π are imposed.
Remark: guarantees for the sample mean µn = 1n
n∑j=1
Xj is unsatisfactory:
Pr
(∣∣µn − µ∣∣ ≥ σ0
√(1/α)
n
)≤ α.
Does the solution exist?
Example: how to estimate the mean?
P. J. Huber (1964): “...This raises a question which could have been asked already by Gauss,but which was, as far as I know, only raised a few years ago (notably by Tukey): whathappens if the true distribution deviates slightly from the assumed normal one?"
Going back to our question: what if X1, . . . ,Xn are i.i.d. copies of X ∼ Π such that
EX = µ, Var(X) ≤ σ20?
Problem: construct CI for µ with coverage probability ≥ 1− α such that for any α
length(CI(α)) ≤ (Absolute constant) · length(CInorm(α))
No additional assumptions on Π are imposed.
Remark: guarantees for the sample mean µn = 1n
n∑j=1
Xj is unsatisfactory:
Pr
(∣∣µn − µ∣∣ ≥ σ0
√(1/α)
n
)≤ α.
Does the solution exist?
Example: how to estimate the mean?
P. J. Huber (1964): “...This raises a question which could have been asked already by Gauss,but which was, as far as I know, only raised a few years ago (notably by Tukey): whathappens if the true distribution deviates slightly from the assumed normal one?"
Going back to our question: what if X1, . . . ,Xn are i.i.d. copies of X ∼ Π such that
EX = µ, Var(X) ≤ σ20?
Problem: construct CI for µ with coverage probability ≥ 1− α such that for any α
length(CI(α)) ≤ (Absolute constant) · length(CInorm(α))
No additional assumptions on Π are imposed.
Remark: guarantees for the sample mean µn = 1n
n∑j=1
Xj is unsatisfactory:
Pr
(∣∣µn − µ∣∣ ≥ σ0
√(1/α)
n
)≤ α.
Does the solution exist?
Example: how to estimate the mean?
Answer (somewhat unexpected?): Yes!
Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11]
Split the sample into k = blog(1/α)c+ 1 groups G1, . . . ,Gk of size ' n/k each:
G1︷ ︸︸ ︷X1, . . . ,X|G1|︸ ︷︷ ︸µ1:= 1
|G1|∑
Xi∈G1
Xi
. . . . . .
Gk︷ ︸︸ ︷Xn−|Gk |+1, . . . ,Xn︸ ︷︷ ︸µk := 1
|Gk |∑
Xi∈Gk
Xi︸ ︷︷ ︸µ∗=µ∗(α):=median(µ1,...,µk )
Claim:
Pr
(|µ∗ − µ| ≥ 7.7σ0
√log(e/α)
n
)≤ α
Example: how to estimate the mean?
Answer (somewhat unexpected?): Yes!
Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11]
Split the sample into k = blog(1/α)c+ 1 groups G1, . . . ,Gk of size ' n/k each:
G1︷ ︸︸ ︷X1, . . . ,X|G1|︸ ︷︷ ︸µ1:= 1
|G1|∑
Xi∈G1
Xi
. . . . . .
Gk︷ ︸︸ ︷Xn−|Gk |+1, . . . ,Xn︸ ︷︷ ︸µk := 1
|Gk |∑
Xi∈Gk
Xi︸ ︷︷ ︸µ∗=µ∗(α):=median(µ1,...,µk )
Claim:
Pr
(|µ∗ − µ| ≥ 7.7σ0
√log(e/α)
n
)≤ α
Example: how to estimate the mean?
Answer (somewhat unexpected?): Yes!
Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11]
Split the sample into k = blog(1/α)c+ 1 groups G1, . . . ,Gk of size ' n/k each:
G1︷ ︸︸ ︷X1, . . . ,X|G1|︸ ︷︷ ︸µ1:= 1
|G1|∑
Xi∈G1
Xi
. . . . . .
Gk︷ ︸︸ ︷Xn−|Gk |+1, . . . ,Xn︸ ︷︷ ︸µk := 1
|Gk |∑
Xi∈Gk
Xi︸ ︷︷ ︸µ∗=µ∗(α):=median(µ1,...,µk )
Claim:
Pr
(|µ∗ − µ| ≥ 7.7σ0
√log(e/α)
n
)≤ α
Example: how to estimate the mean?
Answer (somewhat unexpected?): Yes!
Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11]
Split the sample into k = blog(1/α)c+ 1 groups G1, . . . ,Gk of size ' n/k each:
G1︷ ︸︸ ︷X1, . . . ,X|G1|︸ ︷︷ ︸µ1:= 1
|G1|∑
Xi∈G1
Xi
. . . . . .
Gk︷ ︸︸ ︷Xn−|Gk |+1, . . . ,Xn︸ ︷︷ ︸µk := 1
|Gk |∑
Xi∈Gk
Xi︸ ︷︷ ︸µ∗=µ∗(α):=median(µ1,...,µk )
Claim:
Pr
(|µ∗ − µ| ≥ 7.7σ0
√log(e/α)
n
)≤ α
Then take
CI(α) =
[µ∗ − 7.7σ0
√log(e/α)
n, µ∗ + 7.7σ0
√log(e/α)
n
]
Idea of the proof:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
µ8µ1 µ. . . . . . . . . . . .
|µ− µ| ≥ s =⇒ at least half of events {|µj − µ| ≥ s} occur.
Improve the constant?
O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0
− log(1− x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2),
and define µ vian∑
j=1
ψ(θ(Xj − µ)
)= 0.
Improve the constant?
O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0
− log(1− x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2),
and define µ vian∑
j=1
ψ(θ(Xj − µ)
)= 0.
Truncation τ(x) = (|x | ∧ 1)sign(x) satisfies a weaker inequality
− log(1− x + x2) ≤ τ(x) ≤ log(1 + x + x2)
!1 0 1
!1
0
1
Improve the constant?
n∑j=1
ψ(θ(Xj − µ)
)= 0.
Intuition: for small θ > 0,
n∑j=1
ψ(θ(Xj − µ)
)'
n∑j=1
θ(Xj − µ) = 0
=⇒ µ '1n
n∑j=1
Xj
Improve the constant?
n∑j=1
ψ(θ(Xj − µ)
)= 0.
The following holds: set θ∗ =√
2 log(1/α)n
1σ0
. Then
|µ− µ| ≤(√
2 + o(1))σ0
√log(1/α)
n
with probability ≥ 1− 2α.
Extensions to higher dimensions
A natural question: is it possible to extend presented techniques to the multivariate mean?
Motivation: PCA
Genes mirror geography within Europe, J. Novembre et al, Nature 2008.
Mathematical framework:
Y1, . . . ,Yn ∈ Rd , i.i.d. EYj = 0, EYj Y Tj = Σ.
Goal: construct Σ, an estimator of Σ such that∥∥∥Σ− Σ∥∥∥
Op
is small.
Sample covariance
Σn =1n
n∑j=1
Yj Y Tj
is very sensitive to outliers.
Extensions to higher dimensionsA natural question: is it possible to extend presented techniques to the multivariate mean?Motivation: PCA
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
71
71.1
71.2
71.3
71.4
71.5
71.6
71.7
71.8
71.9
72
=⇒
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
Genes mirror geography within Europe, J. Novembre et al, Nature 2008.Mathematical framework:
Y1, . . . ,Yn ∈ Rd , i.i.d. EYj = 0, EYj Y Tj = Σ.
Goal: construct Σ, an estimator of Σ such that∥∥∥Σ− Σ∥∥∥
Op
is small.Sample covariance
Σn =1n
n∑j=1
Yj Y Tj
is very sensitive to outliers.
Extensions to higher dimensions
A natural question: is it possible to extend presented techniques to the multivariate mean?
Motivation: PCAGenes mirror geography within Europe, J. Novembre et al, Nature 2008.
The direction of the PC1 axis and its relative strength may reflect aspecial role for this geographic axis in the demographic history ofEuropeans (as first suggested in ref. 10). PC1 aligns north-northwest/south-southeast (NNW/SSE, 216 degrees) and accounts forapproximately twice the amount of variation as PC2 (0.30% versus0.15%, first eigenvalue 5 4.09, second eigenvalue 5 2.04). However,caution is required because the direction and relative strength of thePC axes are affected by factors such as the spatial distribution ofsamples (results not shown, also see ref. 9). More robust evidencefor the importance of a roughly NNW/SSE axis in Europe is that, inthese same data, haplotype diversity decreases from south to north(A.A. et al., submitted). As the fine-scale spatial structure evident inFig. 1 suggests, European DNA samples can be very informativeabout the geographical origins of their donors. Using a multi-ple-regression-based assignment approach, one can place 50% of
individuals within 310 km of their reported origin and 90% within700 km of their origin (Fig. 2 and Supplementary Table 4, resultsbased on populations with n . 6). Across all populations, 50% ofindividuals are placed within 540 km of their reported origin, and90% of individuals within 840 km (Supplementary Fig. 3 andSupplementary Table 4). These numbers exclude individuals whoreported mixed grandparental ancestry, who are typically assignedto locations between those expected from their grandparental origins(results not shown). Note that distances of assignments fromreported origin may be reduced if finer-scale information on originwere available for each individual.
Population structure poses a well-recognized challenge for disease-association studies (for example, refs 11–13). The results obtainedhere reinforce that the geographic distribution of a sample is impor-tant to consider when evaluating genome-wide association studies
–0.03 –0.02 –0.01 0 0.01 0.02 0.03–0.03
–0.02
–0.01
0
0.01
0.02
0.03
Italy
Germany
France
UK
SpainPortugal
0 1,000 2,000 3,000
–0.010
0
0.010
0.020
Geographic distance betweenpopulations (km)
Med
ian
gene
tic c
orre
latio
n
PC
1a
b c
French-speaking SwissGerman-speaking SwissItalian-speaking Swiss
FrenchGermanItalian
Nor
th–s
outh
in P
C1–
PC
2 sp
ace
East–west in PC1–PC2 space
PC2
Figure 1 | Population structure within Europe. a, A statistical summary ofgenetic data from 1,387 Europeans based on principal component axis one(PC1) and axis two (PC2). Small coloured labels represent individuals andlarge coloured points represent median PC1 and PC2 values for eachcountry. The inset map provides a key to the labels. The PC axes are rotatedto emphasize the similarity to the geographic map of Europe. AL, Albania;AT, Austria; BA, Bosnia-Herzegovina; BE, Belgium; BG, Bulgaria; CH,Switzerland; CY, Cyprus; CZ, Czech Republic; DE, Germany; DK, Denmark;ES, Spain; FI, Finland; FR, France; GB, United Kingdom; GR, Greece; HR,
Croatia; HU, Hungary; IE, Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK,Macedonia; NO, Norway; NL, Netherlands; PL, Poland; PT, Portugal; RO,Romania; RS, Serbia and Montenegro; RU, Russia, Sct, Scotland; SE,Sweden; SI, Slovenia; SK, Slovakia; TR, Turkey; UA, Ukraine; YG,Yugoslavia. b, A magnification of the area around Switzerland froma showing differentiation within Switzerland by language. c, Geneticsimilarity versus geographic distance. Median genetic correlation betweenpairs of individuals as a function of geographic distance between theirrespective populations.
NATURE | Vol 456 | 6 November 2008 LETTERS
99 ©2008 Macmillan Publishers Limited. All rights reserved
good explanation for non-experts:https://faculty.washington.edu/tathornt/SISG2015/lectures/assoc2015session05.pdf
Mathematical framework:
Y1, . . . ,Yn ∈ Rd , i.i.d. EYj = 0, EYj Y Tj = Σ.
Goal: construct Σ, an estimator of Σ such that∥∥∥Σ− Σ∥∥∥
Op
is small.Sample covariance
Σn =1n
n∑j=1
Yj Y Tj
is very sensitive to outliers.
Extensions to higher dimensions
A natural question: is it possible to extend presented techniques to the multivariate mean?
Motivation: PCA
Genes mirror geography within Europe, J. Novembre et al, Nature 2008.
Mathematical framework:
Y1, . . . ,Yn ∈ Rd , i.i.d. EYj = 0, EYj Y Tj = Σ.
Goal: construct Σ, an estimator of Σ such that∥∥∥Σ− Σ∥∥∥
Op
is small.
Sample covariance
Σn =1n
n∑j=1
Yj Y Tj
is very sensitive to outliers.
Extensions to higher dimensions
A natural question: is it possible to extend presented techniques to the multivariate mean?
Motivation: PCA
Genes mirror geography within Europe, J. Novembre et al, Nature 2008.
Mathematical framework:
Y1, . . . ,Yn ∈ Rd , i.i.d. EYj = 0, EYj Y Tj = Σ.
Goal: construct Σ, an estimator of Σ such that∥∥∥Σ− Σ∥∥∥
Op
is small.
Sample covariance
Σn =1n
n∑j=1
Yj Y Tj
is very sensitive to outliers.
Extensions to higher dimensions
Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise.Makes the bound dimension-dependent.
Better approach – replace the usual median by the geometric median.
x∗ = med(x1, . . . , xk ) := argminy∈Rd
k∑j=1
‖y − xj‖.
Still some issues:1 does not work well for small sample sizes;2 yields bounds in the wrong norm.
Alternatives: Tyler’s M-estimator, Maronna’s M-estimator; guarantees are limited to specialclasses of distributions.
Extensions to higher dimensions
Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise.Makes the bound dimension-dependent.
Better approach – replace the usual median by the geometric median.
x∗ = med(x1, . . . , xk ) := argminy∈Rd
k∑j=1
‖y − xj‖.
Still some issues:1 does not work well for small sample sizes;2 yields bounds in the wrong norm.
Alternatives: Tyler’s M-estimator, Maronna’s M-estimator; guarantees are limited to specialclasses of distributions.
Extensions to higher dimensions
Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise.Makes the bound dimension-dependent.
Better approach – replace the usual median by the geometric median.
x∗ = med(x1, . . . , xk ) := argminy∈Rd
k∑j=1
‖y − xj‖.
Still some issues:1 does not work well for small sample sizes;2 yields bounds in the wrong norm.
Alternatives: Tyler’s M-estimator, Maronna’s M-estimator; guarantees are limited to specialclasses of distributions.
Extensions to higher dimensions
Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise.Makes the bound dimension-dependent.
Better approach – replace the usual median by the geometric median.
x∗ = med(x1, . . . , xk ) := argminy∈Rd
k∑j=1
‖y − xj‖.
Still some issues:1 does not work well for small sample sizes;2 yields bounds in the wrong norm.
Alternatives: Tyler’s M-estimator, Maronna’s M-estimator; guarantees are limited to specialclasses of distributions.
Matrix functions
f : R 7→ R, A = AT = UΛUT , then
f (A) = Uf (Λ)UT , f (Λ) = f
λ1
. . .λd
=
f (λ1)
. . .f (λd )
Construction of the estimator
X ∈ Rd×d - symmetric random matrix, X1, . . . ,Xn ∈ Rd×d – i.i.d. copies of X , E‖X‖2F <∞.
No additional assumptions.
− log(1− x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), θ > 0, define
Σn =1nθ
n∑j=1
ψ(θXj )
For example, if Xj = Yj Y Tj , we get
Σn =1nθ
n∑j=1
ψ(θYj Y T
j
)Intuition: for small θ, ψ(θx) ' θx , hence
Σn ' Sample mean + o(θ)
Construction of the estimator
X ∈ Rd×d - symmetric random matrix, X1, . . . ,Xn ∈ Rd×d – i.i.d. copies of X , E‖X‖2F <∞.
No additional assumptions.
− log(1− x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), θ > 0, define
Σn =1nθ
n∑j=1
ψ(θXj )
For example, if Xj = Yj Y Tj , we get
Σn =1nθ
n∑j=1
ψ(θYj Y T
j
)Intuition: for small θ, ψ(θx) ' θx , hence
Σn ' Sample mean + o(θ)
Construction of the estimator
X ∈ Rd×d - symmetric random matrix, X1, . . . ,Xn ∈ Rd×d – i.i.d. copies of X , E‖X‖2F <∞.
No additional assumptions.
− log(1− x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), θ > 0, define
Σn =1nθ
n∑j=1
ψ(θXj )
For example, if Xj = Yj Y Tj , we get
Σn =1nθ
n∑j=1
ψ(θYj Y T
j
)Note that
ψ(θYj Y T
j
)= ψ(θ‖Yj‖2
2)Yj
‖Yj‖2
Y Tj
‖Yj‖2
is easy to compute.
Intuition: for small θ, ψ(θx) ' θx , hence
Σn ' Sample mean + o(θ)
Construction of the estimator
X ∈ Rd×d - symmetric random matrix, X1, . . . ,Xn ∈ Rd×d – i.i.d. copies of X , E‖X‖2F <∞.
No additional assumptions.
− log(1− x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), θ > 0, define
Σn =1nθ
n∑j=1
ψ(θXj )
For example, if Xj = Yj Y Tj , we get
Σn =1nθ
n∑j=1
ψ(θYj Y T
j
)Intuition: for small θ, ψ(θx) ' θx , hence
Σn ' Sample mean + o(θ)
Σn =1nθ
n∑j=1
ψ(θXj)
Theorem (M., 2016)
X1, . . . ,Xn - i.i.d. Assume that σ2 ≥ ‖EX 2‖. Let θ =√
2 log(d/α)n
1σ
, then
∥∥∥Σn − EX∥∥∥ ≤ σ√2 log(d/α)
n
with probability ≥ 1− 2α.
For example, in covariance estimation σ2 =∥∥∥E‖Y‖2
2 YY T∥∥∥.
Theorem (M., 2016)
X1, . . . ,Xn - i.i.d. Assume that σ2 ≥ ‖EX 2‖. Let θ =√
2 log(d/α)n
1σ
, then
∥∥∥Σn − EX∥∥∥ ≤ σ√2 log(d/α)
n
with probability ≥ 1− 2α.
Compare to:
Theorem (Matrix Bernstein inequality, Tropp ‘11)
X ,X1, . . . ,Xn ∈ Rd×d - i.i.d., σ20 =
∥∥E(X − EX)2∥∥, ‖X‖ ≤ M. Then for all 0 < α < 1,
∥∥∥1n
n∑j=1
Xj − EX∥∥∥ ≤ max
(2σ0
√log(d/α)
n,
43
M log(d/α)
n
)
with probability ≥ 1− 2α.
Further improvements: Xj 7→ Xj + S,
Σ(S) = S +1nθ
n∑j=1
ψ(θ(Xj − S)
)︸ ︷︷ ︸
'EX−S
.
"Ideal choice" S = EX is unavailable =⇒ use the initial estimator Σn in place of S.
Iterate...
S∞ = S∞ +1nθ
n∑j=1
ψ(θ(Xj − S∞)
)︸ ︷︷ ︸
=0
Further improvements: Xj 7→ Xj + S,
Σ(S) = S +1nθ
n∑j=1
ψ(θ(Xj − S)
)︸ ︷︷ ︸
'EX−S
.
"Ideal choice" S = EX is unavailable =⇒ use the initial estimator Σn in place of S.
Iterate...
S∞ = S∞ +1nθ
n∑j=1
ψ(θ(Xj − S∞)
)︸ ︷︷ ︸
=0
Further improvements: Xj 7→ Xj + S,
Σ(S) = S +1nθ
n∑j=1
ψ(θ(Xj − S)
)︸ ︷︷ ︸
'EX−S
.
"Ideal choice" S = EX is unavailable =⇒ use the initial estimator Σn in place of S.
Iterate...
S∞ = S∞ +1nθ
n∑j=1
ψ(θ(Xj − S∞)
)︸ ︷︷ ︸
=0
Theorem (M., 2016)
Assume that σ20 ≥ ‖E(X − EX)2‖. Let θ =
√2 log(d/α)
n1σ0
, and
1nθ
n∑j=1
ψ(θ(Xj − S∞)
)= 0.
Assume that n is large enough (n & d3). Then S∞ exists and
∥∥∥S∞ − EX∥∥∥ ≤ Cσ0
√log(d/α)
n
with probability ≥ 1− α.
Numerical results
Y1, . . . ,Yn ∈ R100,
Σ =
10
51
. . .1
100
Yi,j ∼ symmetric Pareto-type distribution with 4 moments.
Numerical results
Histograms over 500 replications: n = 100.
1 2 3 4 5 6 7 8 9 10 110
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Error
Fre
quency
Sample covariance estimator
Robust covariance estimator
Sample covariance error
‖Sn− Σ‖/‖Σ‖
Robust estimator error
‖Σn − Σ‖/‖Σ‖
Numerical results
Histograms over 500 replications: n = 100.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Error
Fre
quency
Sample covariance estimator
Robust covariance estimator
‖u1(Σn)u(Σn)T− u1(Σ)u1(Σ)T‖
‖u1(Sn)u1(Sn)T− u1(Σ)u1(Σ)T‖
Numerical results
Histograms over 500 replications: n = 1000.
0 1 10 20 30 40 50 600
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Error
Fre
quency
Sample covariance estimator
Robust covariance estimator
Robust estimator error
‖Σn − Σ‖/‖Σ‖
Sample covariance error
‖Sn− Σ‖/‖Σ‖
Numerical results
Histograms over 500 replications: n = 1000.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Error
Fre
quency
Sample covariance estimator
Robust covariance estimator
‖u1(Σn)u(Σn)T− u1(Σ)u1(Σ)T‖
‖u1(Sn)u1(Sn)T− u1(Σ)u1(Σ)T‖
Matrix Completion
Observe some entries of the ratings matrix
A0 =
movie 1 movie 2 . . . movie n
user 1 ∗ ∗ . . . ∗... . . . . . . . . .
...user k ∗ ∗ . . . ∗
Question: can we predict the unobserved entries?
Matrix Completion
X ={
ej (d)eTk (d), 1 ≤ j ≤ d , 1 ≤ k ≤ d
}.
X1, . . . ,Xn - independent sample from Π := Unif(X ), and observations Yj , j = 1, . . . , n havethe form
Yj = tr (X Tj A0) + ξj , (“noisy matrix entry”)
where ξj , j = 1, . . . , n is additive noise.
E(YX) = 1d2 A0, hence natural estimator of A0 is
A =d2
n
n∑j=1
Yj Xj .
Incorporate low rank assumption:
Aτ = argminA∈Rd×d
[‖A− A‖2
F
d2+ τ‖A‖1
]
Matrix Completion
X ={
ej (d)eTk (d), 1 ≤ j ≤ d , 1 ≤ k ≤ d
}.
X1, . . . ,Xn - independent sample from Π := Unif(X ), and observations Yj , j = 1, . . . , n havethe form
Yj = tr (X Tj A0) + ξj , (“noisy matrix entry”)
where ξj , j = 1, . . . , n is additive noise.
E(YX) = 1d2 A0, hence natural estimator of A0 is
A =d2
n
n∑j=1
Yj Xj .
Incorporate low rank assumption:
Aτ = argminA∈Rd×d
[‖A− A‖2
F
d2+ τ‖A‖1
]
Matrix Completion
X ={
ej (d)eTk (d), 1 ≤ j ≤ d , 1 ≤ k ≤ d
}.
X1, . . . ,Xn - independent sample from Π := Unif(X ), and observations Yj , j = 1, . . . , n havethe form
Yj = tr (X Tj A0) + ξj , (“noisy matrix entry”)
where ξj , j = 1, . . . , n is additive noise.
E(YX) = 1d2 A0, hence natural estimator of A0 is
A =d2
n
n∑j=1
Yj Xj .
Incorporate low rank assumption:
Aτ = argminA∈Rd×d
[‖A− A‖2
F
d2+ τ‖A‖1
]
Matrix Completion
X ={
ej (d)eTk (d), 1 ≤ j ≤ d , 1 ≤ k ≤ d
}.
X1, . . . ,Xn - independent sample from Π := Unif(X ), and observations Yj , j = 1, . . . , n havethe form
Yj = tr (X Tj A0) + ξj , (“noisy matrix entry”)
where ξj , j = 1, . . . , n is additive noise.
E(YX) = 1d2 A0, hence natural estimator of A0 is
A =d2
n
n∑j=1
Yj Xj .
Incorporate low rank assumption:
Aτ = argminA∈Rd×d
[‖A− A‖2
F
d2+ τ‖A‖1
]
Matrix completion
What if noise ξj is heavy-tailed (only Var(ξj ) <∞)?
Replace A with a "robust" estimator
R =d2
nθ
n∑j=1
ψ(θYjH(Xj )
)and
Rτ = argminA∈Rd×d
[‖A− R‖2
F
d2+ τ‖A‖1
].
Here, H(X) =
(0 X
X T 0
)is the so-called self-adjoint dilation.
Matrix completion
What if noise ξj is heavy-tailed (only Var(ξj ) <∞)?
Replace A with a "robust" estimator
R =d2
nθ
n∑j=1
ψ(θYjH(Xj )
)and
Rτ = argminA∈Rd×d
[‖A− R‖2
F
d2+ τ‖A‖1
].
Here, H(X) =
(0 X
X T 0
)is the so-called self-adjoint dilation.
Matrix completionWhat if noise ξj is heavy-tailed (only Var(ξj ) <∞)?
Replace A with a "robust" estimator
R =d2
nθ
n∑j=1
ψ(θYjH(Xj )
)and
Rτ = argminA∈Rd×d
[‖A− R‖2
F
d2+ τ‖A‖1
].
Here, H(X) =
(0 X
X T 0
)is the so-called self-adjoint dilation.
Theorem (M., 2016)Take
τ = Const ·√
t + log 2dnd
,
then1
d2
∥∥∥Rτ −H(A0)∥∥∥2
F≤(
1 +√
22
)2d · 2rank(A0)
n
√t + log 2d
with probability ≥ 1− e−t .
Thank you for your attention!