Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology

Fast Algorithms for Analyzing Massive Data

Alexander GrayGeorgia Institute of Technology

www.fast-lab.org

The FASTlabFundamental Algorithmic and Statistical Tools Laboratory

www.fast-lab.org

1. Alexander Gray: Assoc Prof, Applied Math + CS; PhD CS

2. Arkadas Ozakin: Research Scientist, Math + Physics; PhD Physics

3. Dongryeol Lee: PhD student, CS + Math4. Ryan Riegel: PhD student, CS + Math5. Sooraj Bhat: PhD student, CS6. Nishant Mehta: PhD student, CS7. Parikshit Ram: PhD student, CS + Math8. William March: PhD student, Math + CS9. Hua Ouyang: PhD student, CS10. Ravi Sastry: PhD student, CS11. Long Tran: PhD student, CS12. Ryan Curtin: PhD student, EE13. Ailar Javadi: PhD student, EE14. Anita Zakrzewska: PhD student, CS

+ 5-10 MS students and undergraduates

7 tasks ofmachine learning / data mining

1. Querying: spherical range-search O(N), orthogonal range-search O(N), nearest-neighbor O(N), all-nearest-neighbors O(N2)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)

3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM

4. Regression: linear regression, LASSO, kernel regression O(N2), Gaussian process regression O(N3)

5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models

6. Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3)

7. Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point correlation 2-sample testing O(Nn), kernel embedding

3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3), Lp SVM

5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models

2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3), submanifold density estimation [Ozakin & Gray, NIPS 2010], O(N3), convex adaptive kernel estimation [Sastry & Gray, AISTATS 2011] O(N4)

3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM, non-negative SVM [Guan et al, 2011]

5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3); Gaussian graphical models, discrete graphical models, rank-preserving maps [Ouyang and Gray, ICML 2008] O(N3); isometric separation maps [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N3); isometric NMF [Vasiiloglou, Gray, and Anderson MLSP 2009] O(N3); functional ICA [Mehta and Gray, 2009], density preserving maps [Ozakin and Gray, in prep] O(N3)

6. Clustering: k-means, mean-shift O(N2), hierarchical (FoF) clustering O(N3)7. Testing and matching: MST O(N3), bipartite cross-matching O(N3), n-point

correlation 2-sample testing O(Nn), kernel embedding

3. Classification: decision tree, nearest-neighbor classifier O(N2), kernel discriminant analysis O(N2), support vector machine O(N3) , Lp SVM

4. Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3), LASSO

5. Dimension reduction: PCA, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3), Gaussian graphical models, discrete graphical models

ComputationalProblem!

The “7 Giants” of Data(computational problem types)

[Gray, Indyk, Mahoney, Szalay, in National Acad of Sci Report on Analysis of Massive Data, in prep]

1. Basic statistics: means, covariances, etc.

2. Generalized N-body problems: distances, geometry

3. Graph-theoretic problems: discrete graphs

4. Linear-algebraic problems: matrix operations

5. Optimizations: unconstrained, convex

6. Integrations: general dimension

7. Alignment problems: dynamic prog, matching

7 general strategies

1. Divide and conquer / indexing (trees)

2. Function transforms (series)

3. Sampling (Monte Carlo, active learning)

4. Locality (caching)

5. Streaming (online)

6. Parallelism (clusters, GPUs)

7. Problem transformation (reformulations)

• Fastest approach for:– nearest neighbor, range search (exact) ~O(logN) [Bentley

1970], all-nearest-neighbors (exact) O(N) [Gray & Moore, NIPS

2000], [Ram, Lee, March, Gray, NIPS 2010], anytime nearest neighbor (exact) [Ram & Gray, SDM 2012], max inner product [Ram & Gray, under review]

– mixture of Gaussians [Moore, NIPS 1999], k-means [Pelleg and

Moore, KDD 1999], mean-shift clustering O(N) [Lee & Gray, AISTATS

2009], hierarchical clustering (single linkage, friends-of-friends) O(NlogN) [March & Gray, KDD 2010]

– nearest neighbor classification [Liu, Moore, Gray, NIPS 2004], kernel discriminant analysis O(N) [Riegel & Gray, SDM 2008]

– n-point correlation functions ~O(Nlogn) [Gray & Moore, NIPS 2000],

[Moore et al. Mining the Sky 2000], multi-matcher jackknifed npcf [March & Gray, under review]

1. Divide and conquer

3-point correlation

(biggest previous: 20K)

VIRGO simulation data,N = 75,000,000

naïve: 5x109 sec. (~150 years)multi-tree: 55 sec. (exact)

n=2: O(N)

n=3: O(Nlog3)

n=4: O(N2)

3-point correlation

Naive - O(Nn)(estimated)

Single bandwidth

[Gray & Moore 2000, Moore et al.

Multi-bandwidth

[March & Gray in prep 2010]

2 point cor.2 point cor.100 matchers100 matchers

2.0 x 107 s

352.8 s56,000

4.96 s71.1

3 point cor.3 point cor.243 matchers243 matchers

1.1 x 1011 s

891.6 s1.23 x

13.58 s65.6

4 point cor. 4 point cor. 216 matchers216 matchers

2.3 x 1014 s

14530 s1.58 x 1010

503.6 s28.8

106 points, galaxy simulation data

2. Function transforms

• Fastest approach for:– Kernel estimation (low-ish

dimension): dual-tree fast Gauss transforms (multipole/Hermite expansions) [Lee, Gray, Moore NIPS 2005], [Lee and Gray, UAI 2006]

– KDE and GP (kernel density estimation, Gaussian process regression) (high-D): random Fourier functions [Lee and Gray, in prep]

3. Sampling

• Fastest approach for (approximate):– PCA: cosine trees [Holmes, Gray, Isbell, NIPS 2008]

– Kernel estimation: bandwidth learning [Holmes, Gray, Isbell, NIPS

2006], [Holmes, Gray, Isbell, UAI 2007], Monte Carlo multipole method (with SVD trees) [Lee & Gray, NIPS 2009]

– Nearest-neighbor: distance-approx: spill trees with random proj: [Liu, Moore, Gray, Yang, NIPS 2004], rank-approximate: [Ram, Ouyang, Gray, NIPS 2009]

Rank-approximate NN:•Best meaning-retaining approximation criterion in the face of high-dimensional distances •More accurate than LSH

3. Sampling

• Active learning: the sampling can depend on previous samples– Linear classifiers:

rigorous framework for pool-based active learning [Sastry and Gray, AISTATS 2012]

• Empirically allows reduction in the number of objects that require labeling

• Theoretical rigor: unbiasedness

4. Caching

• Fastest approach for (using disk):– Nearest-neighbor, 2-point: Disk-based treee

algorithms in Microsoft SQL Server [Riegel, Aditya, Budavari, Gray, in prep]

• Builds kd-tree on top of built-in B-trees• Fixed-pass algorithm to build kd-tree

No. of points MLDB (Dual tree) Naive

40,000 8 seconds 159 seconds

200,000 43 seconds 3480 seconds

2,000,000 297 seconds 80 hours

10,000,000 29 mins 27 sec 74 days

20,000,000 58mins 48sec 280 days

40,000,000 112m 32 sec 2 years

5. Streaming / online

• Fastest approach for (approximate, or streaming):– Online learning/stochastic optimization: just use the

current sample to update the gradient• SVM (squared hinge loss): stochastic Frank-Wolfe [Ouyang

and Gray, SDM 2010]

• SVM, LASSO, et al.: noise-adaptive stochastic approximation [Ouyang and Gray, in prep, on arxiv], accelerated non-smooth SGD [Ouyang and Gray, under review]

– faster than SGD

– solves step size problem

– beats all existing convergence rates

6. Parallelism

• Fastest approach for (using many machines):– KDE, GP, n-point: distributed trees [Lee and Gray, SDM 2012], 6000+

cores; [March et al, in prep for Gordon Bell Prize 2012], 100K cores?• Each process owns the global tree and its local tree• First log p levels built in parallel; each process determines where to send data• Asynchronous averaging; provable convergence

– SVM, LASSO, et al.: distributed online optimization [Ouyang and Gray, in prep, on arxiv]

• Provable theoretical speedup for the first time

7. Transformationsbetween problems

• Change the problem type:– Linear algebra on kernel matrices N-body inside conjugate gradient [Gray, TR 2004]

– Euclidean graphs N-body problems [March & Gray, KDD 2010]

– HMM as graph matrix factorization [Tran & Gray, in prep]

• Optimizations: reformulate the objective and constraints:– Maximum variance unfolding: SDP via Burer-Monteiro convex relaxation [Vasiloglou,

Gray, Anderson MLSP 2009]

– Lq SVM, 0<q<1: DC programming [Guan & Gray, CSDA 2011]

– L0 SVM: mixed integer nonlinear program via perspective cuts [Guan & Gray, under review]

– Do reformulations automatically [Agarwal et al, PADL 2010], [Bhat et al, POPL 2012]

• Create new ML methods with desired computational properties:– Density estimation trees: nonparametric density estimation, O(NlogN) [Ram & Gray, KDD

– Local linear SVMs: nonlinear classification, O(NlogN) [Sastry & Gray, under review]

– Discriminative local coding: nonlinear classification O(NlogN) [Mehta & Gray, under review]

Software• For academic use only: MLPACK

– Open source, C++, written by students

– Data must fit in RAM: distributed in progress

• For institutions: Skytree Server– First commercial-grade high-performance machine learning server

– Fastest, biggest ML available: up to 10,000x faster than existing solutions (on one machine)

– V.12, April 2012-ish: distributed, streaming

– Connects to stats packages, Matlab, DBMS, Python, etc

– www.skytreecorp.com

– Colleagues: Email me to try it out: agray@cc.gatech.edu

Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology

Documents

Edward gray gray tube replication

FINDING COLOR WORLD OF GRAY RESULTS...“Athena Insight provides a reliable, high quality, validated process for analyzing genetic variants of unknown significance. The standardized

Gray @ Nortel 20 April 1999 Scaleable Computing Jim Gray Microsoft Research Gray@Microsoft.com Gray/talks/ Gray@Microsoft.com

Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~Gray

Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology

cartella colori · 2016. 7. 15. · 01 02 06 04 05 10 07 08 09 03 area clinica pearl gray pearl gray private gray private gray private gray pearl gray private gray private gray pearl

WEAVING ARCHITECTURE...This is an architectural thesis on weaving. The city is a massive textile, a patchwork of buildings, infrastructure and ... Gray Block Hagerstown Building Products

Lecture 5 - Analyzing Massive Graphs Part II 5... · of community coverage." International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction. Springer,

Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

datasets using Databricks Analyzing massive genomics · • Use Spark SQL, ADAM, or Hail for overlap and aggregate queries ... • Spark allows you to program across large clusters

Fast Algorithms for Analyzing Massive Data

Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy

N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey

Internet HDTV Terry Gray University of Washington gray@washington.edu Terry Gray University of Washington gray@washington.edu

Vector Diffusion Maps and the Connection Laplacianamits/publications/... · We introduce vector diffusion maps (VDM), a new mathematical framework for organizing and analyzing massive

Analyzing Massive Data using Heterogeneous …Updating clustering coefficients in streaming graph • Using RMAT as a graph and edge stream generator. – Mix of insertions and deletions

The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor

Blood Clotting Robin Gray. Blood Clotting AKA Coagulation Cardiovascular System Not necessarily a “balancing” effect It prevents massive blood loss Hemostatis

How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing

How to do Machine Learning on Massive Astronomical Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College