Blake Foster, Sridhar Mahadevan, Rui Wang
A GPU-based Approximate SVD Algorithm
University of Massachusetts Amherst
2
Introduction
Singular Value Decomposition (SVD)
• 𝑨: 𝑚 × 𝑛 matrix (𝑚 ≥ 𝑛)
• 𝑼, 𝑽: orthogonal matrices
• 𝜮: diagonal matrix
Exact SVD can be computed in 𝑶(𝒎𝒏𝟐) time.
3
Introduction
Approximate SVD
• 𝑨: 𝑚 × 𝑛 matrix, whose intrinsic dimensionality is far less than 𝑛
Usually sufficient to compute the 𝑘 (≪ 𝑛) largest singular values and corresponding vectors.
Approximate SVD can be solved in 𝑶(𝒎𝒏𝒌) time ( compared to 𝑶(𝒎𝒏𝟐) for full svd).
4
Introduction
Sample-based Approximate SVD
• Construct a basis 𝑽 by directly sampling or taking linear combinations of rows from 𝑨.
• An approximate SVD can be extracted from the projection of 𝑨 onto 𝑽 .
5
Introduction
Sample-based Approximate SVD
• Construct a basis 𝑽 by directly sampling or taking linear combinations of rows from 𝑨.
• An approximate SVD can be extracted from the projection of 𝑨 onto 𝑽 .
Sampling Methods
• Length-squared [Frieze et al. 2004, Deshpande et al. 2006]
• Random projection [Friedland et al 2006, Sarlos et al. 2006, Rokhlin et al. 2009]
6
Introduction
QUIC-SVD [Holmes et al. NIPS 2008]
• Sampled-based approximate SVD.
• Progressively constructs a basis 𝑽 using a cosine tree.
• Results have controllable L2 error basis construction adapts to the matrix and the desired error.
7
Our Goal
Map QUIC-SVD onto the GPU (Graphics Processing Units)
• Powerful data-parallel computation platform.
• High computation density, high memory bandwidth.
• Relatively low-cost.
NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 3 GB memory 200 GB/s bandwidth $549
8
Related Work
OpenGL-based: rasterization, framebuffers, shaders
• [Galoppo et al. 2005]
• [Bondhugula et al. 2006]
CUDA-based:
• Matrix factorizations [Volkov et al. 2008]
• QR decomposition [Kerr et al. 2009]
• SVD [Lahabar et al. 2009]
• Commercial GPU linear algebra package [CULA 2010]
9
Our Work
A CUDA implementation of QUIC-SVD.
Up to 𝟔~𝟕 times speedup over an optimized CPU version.
A partitioned version that can process large, out-of-core matrices on a single GPU
• The largest we tested are 22,000 x 22,000 matrices (double precision FP).
10
Overview of QUIC-SVD
[Holmes et al. 2008]
𝑽
𝑨
11
Overview of QUIC-SVD
Start from a root node that owns the entire matrix (all rows).
[Holmes et al. 2008]
𝑨
12
Overview of QUIC-SVD
Select a pivot row by importance sampling the squared magnitudes of the rows.
[Holmes et al. 2008]
pivot row →
𝑨
13
Overview of QUIC-SVD
Compute inner product of every row with the pivot row. (this is why it’s called Cosine Tree)
[Holmes et al. 2008]
pivot row →
𝑨
14
Overview of QUIC-SVD
Based on the inner products, the rows are partitioned into 2 subsets, each represented by a child node.
[Holmes et al. 2008]
𝑨
Closer to zero →
T𝐡𝐞 𝐫𝐞𝐬𝐭 →
15
Overview of QUIC-SVD
Compute the row mean (average) of each subset; add it to the current basis set (keep orthogonalized).
[Holmes et al. 2008]
𝑨
𝑽
16
Overview of QUIC-SVD
Repeat, until the estimated L2 error is below a threshold. The basis set now provides a good low-rank approximation.
[Holmes et al. 2008]
𝑨
𝑽
17
Overview of QUIC-SVD
Monte Carlo Error Estimation:
Input: any subset of rows 𝑨𝑠, the basis 𝑽 , 𝛿 ∈ [0,1]
Output: estimated , s.t. with probability at least 1 − 𝛿, the L2 error:
Termination Criteria:
18
Overview of QUIC-SVD
5 Steps:
1. Select a leaf node with maximum estimated L2 error;
2. Split the node and create two child nodes;
3. Calculate and insert the mean vector of each child to the basis set 𝑉 (replacing the parent vector), orthogonalized;
4. Estimate the error of newly added child nodes;
5. Repeat steps 1—4 until the termination criteria is satisfied.
19
GPU Implementation of QUIC-SVD
Computation cost mostly on constructing the tree.
Most expensive steps:
• Computing vector inner products and row means
• Basis orthogonalization
20
Parallelize Vector Inner Products and Row Means
Compute all inner products and row means in parallel.
Assume a large matrix, there are sufficient number of rows to engage all GPU threads.
An index array is used to point to the physical rows.
21
Parallelize Gram Schmidt Orthogonalization
Classical Gram Schmidt: Assume are in the current basis set (i.e. they are orthonormal vectors), to add a new vector :
This is easy to parallelize (as each projection is independently calculated), but it has poor numerical stability.
1 2 3, , ... nv v v v
r
1 2( , ) ( , ) ... ( , )nr r proj r v proj r v proj r v
( , ) ( )proj r v r v v1n
rv
r
where
22
Parallelize Gram Schmidt Orthogonalization
Modified Gram Schmidt:
This has good numerical stability, but is hard to parallelize, as the computation is sequential!
1 1
2 1 1 2
1 1
( , ),
( , ),
... ...
( , )k k k k
r r proj r v
r r proj r v
r r r proj r v
1nv r r
23
Parallelize Gram Schmidt Orthogonalization
Our approach: Blocked Gram Schmidt
This hybrid approach combines the advantages of the previous two – numerically stable, and GPU-friendly.
(1)
1 2
(2) (1) (1) (1) (1)
1 2 2
( / ) ( / ) ( / ) ( / )
1 2 1 2 2
( , ) ( , ) ... ( , )
( , ) ( , ) ... ( , )
... ...
( , ) ( , ) ... ( , )
m
m m m
n m n m n m n m
k m m k
v v proj v u proj v u proj v u
v v proj v u proj v u proj v u
u v proj v u proj v u proj v u
24
Extract the Final Result
Input: matrix 𝑨, basis set 𝑽
Output: 𝑼,𝜮, 𝑽, the best approximate SVD of 𝑨 within the subspace 𝑽 .
Algorithm:
• Compute 𝑨𝑽 , then (𝑨𝑽 )𝑻𝑨𝑽 , and SVD 𝑼′𝜮′𝑽′𝑻 = (𝑨𝑽 )𝑻𝑨𝑽
• Let 𝑽 = 𝑽 𝑽′, 𝜮 = (𝜮′)𝟏/𝟐, 𝑼 = (𝑨𝑽 )𝑽′𝜮−𝟏,
• Return 𝑼,𝜮, 𝑽.
25
Extract the Final Result
Input: matrix 𝑨, subspace basis 𝑽
Output: 𝑼,𝜮, 𝑽, the best approximate SVD of 𝑨 within the subspace 𝑽 .
Algorithm:
• Compute 𝑨𝑽 , then (𝑨𝑽 )𝑻𝑨𝑽 , and SVD 𝑼′𝜮′𝑽′𝑻 = (𝑨𝑽 )𝑻𝑨𝑽
• Let 𝑽 = 𝑽 𝑽′, 𝜮 = (𝜮′)𝟏/𝟐, 𝑼 = (𝑨𝑽 )𝑽′𝜮−𝟏,
• Return 𝑼,𝜮, 𝑽.
SVD of a 𝒌 × 𝒌 matrix.
26
Partitioned QUIC-SVD
The advantage of the GPU is only obvious for large-scale problems, e.g. matrices larger than 10,0002.
For dense matrices, this will soon become an out-of-core problem.
We modified our algorithm to use a matrix partitioning scheme, so that each partition can fit in GPU memory.
27
Partitioned QUIC-SVD
Partition the matrix into fixed-size blocks
𝑨𝟏
𝑨𝟐
𝑨𝒔
… …
… …
28
Partitioned QUIC-SVD
Partition the matrix into fixed-size blocks
𝑨𝟏
𝑨𝟐
𝑨𝒔
… …
… …
29
Partitioned QUIC-SVD
30
Partitioned QUIC-SVD
Termination Criteria is guaranteed:
31
Implementation Details
Main implementation done in CUDA.
Use CULA library to extract the final SVD (this involves processing a 𝑘 × 𝑘 matrix, where 𝑘 ≪ 𝑛).
Selection of the splitting node is done using a priority queue, maintained on the CPU side.
Source code will be made available on our site http://graphics.cs.umass.edu
32
Performance
Test environment:
• CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory
• GPU: NVIDIA GTX 480 GPU
Test data:
• Given a rank 𝑘 and matrix size 𝑛, we generate a 𝑛 × 𝑘 left matrix and 𝑘 × 𝑛 right matrix, both filled with random numbers; the product of the two gives the test matrix.
Result accuracy:
• Both CPU and GPU versions use double floats; termination error 𝜖 = 10−12, Monte Carlo estimation 𝛿 = 10−12.
33
Performance
Comparisons:
1. Matlab’s svds;
2. CPU-based QUIC-SVD (optimized, multi-threaded, implemented using Intel Math Kernel Library);
3. GPU-based QUIC-SVD.
4. Tygert SVD (approximate SVD algorithm, implemented in Matlab, based on random projection);
34
Performance – Running Time
35
Performance – Speedup Factors
36
Performance – Speedup Factors
37
Performance – CPU vs. GPU QUIC-SVD
Running Time Speedup
38
Performance – GPU QUIC-SVD vs. Tygert SVD
Running Time Speedup
39
Integration with Manifold Alignment Framework
40
Conclusion
A fast GPU-based implementation of an approximate SVD algorithm.
Reasonable (up to 6~7 ×) speedup over an optimized CPU implementation, and Tygert SVD.
Our partitioned version allows processing out-of-core matrices.
41
Future Work
Be able to process sparse matrices.
Application in solving large graph Laplacians.
Components from this project can become building blocks for other algorithms: random projection trees, diffusion wavelets etc.
42
Acknowledgement
Alexander Gray
PPAM reviewers
NSF Grant FODAVA-1025120