View
220
Download
0
Category
Tags:
Preview:
Citation preview
Fast Algorithms for Analyzing Massive Data
Alexander GrayGeorgia Institute of Technology
www.fast-lab.org
The FASTlabFundamental Algorithmic and Statistical Tools Laboratory
www.fast-lab.org
1. Alexander Gray: Assoc Prof, Applied Math + CS; PhD CS
2. Arkadas Ozakin: Research Scientist, Math + Physics; PhD Physics
3. Dongryeol Lee: PhD student, CS + Math4. Ryan Riegel: PhD student, CS + Math5. Sooraj Bhat: PhD student, CS6. Nishant Mehta: PhD student, CS7. Parikshit Ram: PhD student, CS + Math8. William March: PhD student, Math + CS9. Hua Ouyang: PhD student, CS10. Ravi Sastry: PhD student, CS11. Long Tran: PhD student, CS12. Ryan Curtin: PhD student, EE13. Ailar Javadi: PhD student, EE14. Anita Zakrzewska: PhD student, CS
+ 5-10 MS students and undergraduates
7 tasks ofmachine learning / data mining
1. Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)
2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)
3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM
4. Regression: linear regression, LASSO, kernel regression O(N2), Gaussian process regression O(N3)
5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models
6. Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3)
7. Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding
7 tasks ofmachine learning / data mining
1. Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)
2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)
3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3), Lp SVM
4. Regression: linear regression, LASSO, kernel regression O(N2), Gaussian process regression O(N3)
5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models
6. Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3)
7. Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding
7 tasks ofmachine learning / data mining
1. Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)
2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3), submanifold density estimation [Ozakin & Gray, NIPS 2010], O(N3), convex adaptive kernel estimation [Sastry & Gray, AISTATS 2011] O(N4)
3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM, non-negative SVM [Guan et al, 2011]
4. Regression: linear regression, LASSO, kernel regression O(N2), Gaussian process regression O(N3)
5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models, rank-preserving maps [Ouyang and Gray, ICML 2008] O(N3); isometric separation maps [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N3); isometric NMF [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N3); functional ICA [Mehta and Gray, 2009], density preserving maps [Ozakin and Gray, in prep] O(N3)
6. Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3)7. Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point
correlation 2-sample testing O(Nn), kernel embedding
7 tasks ofmachine learning / data mining
1. Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)
2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)
3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM
4. Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3), LASSO
5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3), Gaussian graphical models, discrete graphical models
6. Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3)
7. Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding
ComputationalProblem!
The “7 Giants” of Data(computational problem types)
[Gray, Indyk, Mahoney, Szalay, in National Acad of Sci Report on Analysis of Massive Data, in prep]
1. Basic statistics: means, covariances, etc.
2. Generalized N-body problems: distances, geometry
3. Graph-theoretic problems: discrete graphs
4. Linear-algebraic problems: matrix operations
5. Optimizations: unconstrained, convex
6. Integrations: general dimension
7. Alignment problems: dynamic prog, matching
7 general strategies
1. Divide and conquer / indexing (trees)
2. Function transforms (series)
3. Sampling (Monte Carlo, active learning)
4. Locality (caching)
5. Streaming (online)
6. Parallelism (clusters, GPUs)
7. Problem transformation (reformulations)
• Fastest approach for:– nearest neighbor, range search (exact) ~O(logN) [Bentley
1970], all-nearest-neighbors (exact) O(N) [Gray & Moore, NIPS
2000], [Ram, Lee, March, Gray, NIPS 2010], anytime nearest neighbor (exact) [Ram & Gray, SDM 2012], max inner product [Ram & Gray, under review]
– mixture of Gaussians [Moore, NIPS 1999], k-means [Pelleg and
Moore, KDD 1999], mean-shift clustering O(N) [Lee & Gray, AISTATS
2009], hierarchical clustering (single linkage, friends-of-friends) O(NlogN) [March & Gray, KDD 2010]
– nearest neighbor classification [Liu, Moore, Gray, NIPS 2004], kernel discriminant analysis O(N) [Riegel & Gray, SDM 2008]
– n-point correlation functions ~O(Nlogn) [Gray & Moore, NIPS 2000],
[Moore et al. Mining the Sky 2000], multi-matcher jackknifed npcf [March & Gray, under review]
1. Divide and conquer
3-point correlation
(biggest previous: 20K)
VIRGO simulation data,N = 75,000,000
naïve: 5x109 sec. (~150 years)multi-tree: 55 sec. (exact)
n=2: O(N)
n=3: O(Nlog3)
n=4: O(N2)
3-point correlation
Naive - O(Nn)(estimated)
Single bandwidth
[Gray & Moore 2000, Moore et al.
2000]
Multi-bandwidth
[March & Gray in prep 2010]
new
2 point cor.2 point cor.100 matchers100 matchers
2.0 x 107 s
352.8 s56,000
4.96 s71.1
3 point cor.3 point cor.243 matchers243 matchers
1.1 x 1011 s
891.6 s1.23 x
108
13.58 s65.6
4 point cor. 4 point cor. 216 matchers216 matchers
2.3 x 1014 s
14530 s1.58 x 1010
503.6 s28.8
106 points, galaxy simulation data
2. Function transforms
• Fastest approach for:– Kernel estimation (low-ish
dimension): dual-tree fast Gauss transforms (multipole/Hermite expansions) [Lee, Gray, Moore NIPS 2005], [Lee and Gray, UAI 2006]
– KDE and GP (kernel density estimation, Gaussian process regression) (high-D): random Fourier functions [Lee and Gray, in prep]
3. Sampling
• Fastest approach for (approximate):– PCA: cosine trees [Holmes, Gray, Isbell, NIPS 2008]
– Kernel estimation: bandwidth learning [Holmes, Gray, Isbell, NIPS
2006], [Holmes, Gray, Isbell, UAI 2007], Monte Carlo multipole method (with SVD trees) [Lee & Gray, NIPS 2009]
– Nearest-neighbor: distance-approx: spill trees with random proj: [Liu, Moore, Gray, Yang, NIPS 2004], rank-approximate: [Ram, Ouyang, Gray, NIPS 2009]
Rank-approximate NN:•Best meaning-retaining approximation criterion in the face of high-dimensional distances •More accurate than LSH
3. Sampling
• Active learning: the sampling can depend on previous samples– Linear classifiers:
rigorous framework for pool-based active learning [Sastry and Gray, AISTATS 2012]
• Empirically allows reduction in the number of objects that require labeling
• Theoretical rigor: unbiasedness
4. Caching
• Fastest approach for (using disk):– Nearest-neighbor, 2-point: Disk-based treee
algorithms in Microsoft SQL Server [Riegel, Aditya, Budavari, Gray, in prep]
• Builds kd-tree on top of built-in B-trees• Fixed-pass algorithm to build kd-tree
No. of points MLDB (Dual tree) Naive
40,000 8 seconds 159 seconds
200,000 43 seconds 3480 seconds
2,000,000 297 seconds 80 hours
10,000,000 29 mins 27 sec 74 days
20,000,000 58mins 48sec 280 days
40,000,000 112m 32 sec 2 years
5. Streaming / online
• Fastest approach for (approximate, or streaming):– Online learning/stochastic optimization: just use the
current sample to update the gradient• SVM (squared hinge loss): stochastic Frank-Wolfe [Ouyang
and Gray, SDM 2010]
• SVM, LASSO, et al.: noise-adaptive stochastic approximation [Ouyang and Gray, in prep, on arxiv], accelerated non-smooth SGD [Ouyang and Gray, under review]
– faster than SGD
– solves step size problem
– beats all existing convergence rates
6. Parallelism
• Fastest approach for (using many machines):– KDE, GP, n-point: distributed trees [Lee and Gray, SDM 2012], 6000+
cores; [March et al, in prep for Gordon Bell Prize 2012], 100K cores?• Each process owns the global tree and its local tree• First log p levels built in parallel; each process determines where to send data• Asynchronous averaging; provable convergence
– SVM, LASSO, et al.: distributed online optimization [Ouyang and Gray, in prep, on arxiv]
• Provable theoretical speedup for the first time
7. Transformationsbetween problems
• Change the problem type:– Linear algebra on kernel matrices N-body inside conjugate gradient [Gray, TR 2004]
– Euclidean graphs N-body problems [March & Gray, KDD 2010]
– HMM as graph matrix factorization [Tran & Gray, in prep]
• Optimizations: reformulate the objective and constraints:– Maximum variance unfolding: SDP via Burer-Monteiro convex relaxation [Vasiloglou,
Gray, Anderson MLSP 2009]
– Lq SVM, 0<q<1: DC programming [Guan & Gray, CSDA 2011]
– L0 SVM: mixed integer nonlinear program via perspective cuts [Guan & Gray, under review]
– Do reformulations automatically [Agarwal et al, PADL 2010], [Bhat et al, POPL 2012]
• Create new ML methods with desired computational properties:– Density estimation trees: nonparametric density estimation, O(NlogN) [Ram & Gray, KDD
2011]
– Local linear SVMs: nonlinear classification, O(NlogN) [Sastry & Gray, under review]
– Discriminative local coding: nonlinear classification O(NlogN) [Mehta & Gray, under review]
Software• For academic use only: MLPACK
– Open source, C++, written by students
– Data must fit in RAM: distributed in progress
• For institutions: Skytree Server– First commercial-grade high-performance machine learning server
– Fastest, biggest ML available: up to 10,000x faster than existing solutions (on one machine)
– V.12, April 2012-ish: distributed, streaming
– Connects to stats packages, Matlab, DBMS, Python, etc
– www.skytreecorp.com
– Colleagues: Email me to try it out: agray@cc.gatech.edu
Recommended