View
223
Download
0
Tags:
Embed Size (px)
Citation preview
Coresets and Sketches for
High Dimensional Subspace Approximation Problems
Morteza Monemizadeh
TU Dortmund
Joint work with: D. Feldman, C. Sohler, D. Woodruff
SODA 2010
Unbounded Precision
Insertion-only Streaming:
P = fp1;p2;¢¢¢;pngµ <d; j ¸ 0
Head of stream
Seen pointsUnseen points
Input:
Subspace Problem
OP T = minF µ <d cost(P;F )
Find a j-subspace F:
p1 p2p4
p3
p5
p6Euclidean Distance
= minF µ <d
Ppi 2P dist(pi ;F )
Subspace Approximation
cost(P;F 0) · (1+ ²) ¢OP T
Find a j-subspace such thatF 0
PTAS:
Simple Cases
cost(P;F ) =P
pi 2P (dist(pi ;F ))
j = 0: 1-median
j : PCA/SVD
Machine Learning
LSI, PageRank, HIITS
Collaborative Filtering, Recommendation Systems
Clustering
k-median
Simple Cases
Linear regression
Nonlinear regression
j = d¡ 1 :
Shape-fitting
Known Before
Coresets (Har-Peled)
Dynamic Programming (Arora, Mitchell)
d =O(1): Low-dimensions
d =O(n): High-dimensions
Dimensionality Reduction (Indyk, Rabani, …)
d =O(1): Low-dimensions
Simple PTAS
PTAS: O(nd¢j¡ j)
9F i : cost(P;F i ) · (1+ ²) ¢OP T
F j¡ j
Centroid Set:
¡ =F1 F2 F i
PTAS
S = fs1;s2;¢¢¢;si ;¢¢¢;sjS jg
8F i 2 ¡ : jcost(S;F i ) ¡ cost(P;F i )j · ² ¢cost(P;F i )
Weak Coreset:
jSj = O(j¡ j)
PTAS:O(d¢j¡ j2) O(d¢2poly(j =²))
j¡ j = 2poly(j =²)
Tools
Weak Coreset Centroid Set
Coreset Construction
Assumptions:
d=2, j=1
Fix a 1-subspace (line): F i
Have a 1-subspace (line): cost(P;F ¤) · O(1) ¢OP T
9Q µ <d : jcost(P;F i ) ¡ cost(Q;F i )j · ² ¢cost(P;F ¤)
· O(1) ¢² ¢OP T
· O(²) ¢cost(P;F i )
GOAL: P rob¸ 1¡ ±
jQj = O( log(1=±)²2 )
1st Try
F i
Sampling u.a.r or even non-uniformly:
E (cost(Q;F i )) = cost(P;F i )
2nd Try
F ¤
F i
(pj ;1)
(p1;1)
(¹pj ; ¡ 1)
(¹pj ;+1)(¹p1; ¡ 1)
(¹p1;+1)
F i
F ¤(¹pj ;+1)(¹p1;+1)
cost( ¹P ;F i ) =P
pj 2P dist(¹pj ;F i )
F ¤
F i
(¹pj ; ¡ 1)
(pj ;1)
(p1;1)
(¹p1;¡ 1)
cost(P;F i ) ¡ cost( ¹P ;F i )
F ¤
F i
(¹pj ; ¡ 1)
(¹pj ;+1)
(pj ;1)
(p1;1)
(¹p1; ¡ 1)
(¹p1;+1)
cost(P;F i )
F ¤
F i
E [cost(S;F i )] = cost(P;F i ) ¡ cost( ¹P ;F i )
(p1;1)
(¹p1; ¡ 1)
(¹pj ; ¡ 1)
(pj ;1)
cost(S;F i ) · jcost(P;F i ) ¡ cost( ¹P ;F i )j · cost(P;F ¤)
(¹pj ;¡ 2)
(pj ;2)
Chernoff Bounds
jSj = O( log(1=±)²2 )
jcost(S;F i ) ¡ (cost(P;F i ) ¡ cost( ¹P ;F i ))j · ² ¢cost(P;F ¤)
E [cost(S;F i )] = cost(P;F i ) ¡ cost( ¹P ;F i )
cost(S;F i ) · jcost(P;F i ) ¡ cost( ¹P ;F i )j · cost(P;F ¤)
Recursion
F i
F ¤(¹pj ;+1)(¹p1;+1)
cost( ¹P ;F i ) =P
pj 2P dist(¹pj ;F i )
0
Recursion
F i
F ¤(¹pj ;+1)(¹p1;+1)
cost( ¹P ;F i ) =P
pj 2P dist(¹pj ;F i )
0
(¹¹pj ;+1)
(¹¹pj ; ¡ 1)
cost( ¹P ;F i )
F i
F ¤(¹pj ;+1)(¹p1;+1)
0
(¹¹pj ;+1)
(¹¹pj ;¡ 1)
F i
F ¤
(0,n)
cost(~0£ n;F i )
F ¤
F i
(¹pj ;+1)(¹p1;+1)
0
(¹¹pj ;¡ 1)
(¹¹p1; ¡ 1)
cost( ¹P ;F i ) ¡ cost(~0£ n;F i )
F ¤
F i
0
(¹¹p1; ¡ 1)
(¹p1;+1) (¹pj ;+1)
E (cost(S0;F i )) = cost( ¹P ;F i ) ¡ cost(~0£ n;F i )
cost(S0;F i ) · O(1) ¢cost(P;F ¤)
(¹¹pj ;+2)
(¹pj ;+2)
jSj = O( log(1=±)²2 )
jcost(S0;F i ) ¡ (cost( ¹P ;F i ) ¡ cost(~0£ n;F i ))j · ² ¢cost(P;F ¤)
+cost( ¹P ;F i ) ¡ cost(~0£ n;F i )
cost(P;F i ) = cost(~0£ n;F i )
+cost(P;F i ) ¡ cost( ¹P ;F i )
jcost(S0;F i ) ¡ (cost( ¹P ;F i ) ¡ cost(~0£ n;F i )j · ² ¢cost(P;F ¤)
jcost(S;F i ) ¡ (cost(P;F i ) ¡ cost( ¹P ;F i )j · ² ¢cost(P;F ¤)
Strong Coreset
S = fs1;s2;¢¢¢;si ;¢¢¢;sjS jg
8F i 2 <d : jcost(S;F i ) ¡ cost(P;F i )j · ² ¢cost(P;F i )
jSj = O(dj O(j 2) ¢²¡ 2 ¢logn)
jSj = O(d( j ¢2p
log n
²2 )poly(j ))Stream:
Centroid Set
In time n ¢2poly(j =²)
j¡ j = 2poly(j =²)
9F i : cost(P;F i ) · (1+ ²) ¢OP T
F j¡ j
Centroid Set:
¡ =F1 F2 F i
Centroid Set Construction
Bounded Precisionp1
p2
pi
pn
Stream S: …., (i,j, -5), …, (i,j, +10), … : |S|=poly(n,M)
A[i,j]-5 A[i,j]+10
A[i; j ] 2 f ¡ M ;¢¢¢;+M g
=A[n,d]
Bounded Precision
1-pass streaming algorithm
~O(nj 4 ¢log(nd))=²5Space:
Time: M poly(j =²)
Open Problems
Coreset size:
jSj = O(dj O(j 2) ¢²2 ¢logn)
jSj = O(d( j ¢2p
log n
²2 )poly(j ))Stream:
PTAS: O(nd¢poly(j =²) + (n + d) ¢2poly(j =²))
What other classes of Clustering?
Thanks!