1
DOE ASCR Applied Mathematics Principal Investigators' (PI) Meeting, Rockville, MD, September 11-12, 2017 Abstract Productive and extreme-scale graph computations enabled by GraphBLAS Ariful Azad and Aydın Buluç, Computational Research Division, Lawrence Berkeley National Laboratory ü GraphBLAS-based approach separates computation- al kernels from high-level algorithms & applications. ü Boosts the productivity of applications significantly. ü Resulted in highly-scalable algorithms for matching, ordering, connected components, maximal independent set and graph clustering. Future directions and impacts: Short term: A GraphBLAS-compliant library with optimized in-node performance. Explore new domains. Medium term: Communication-avoiding algorithms for GraphBLAS primitives targeting future exascale systems. Enable new high-level algorithms. Long term: Provide easy-to-use graph libraries for biology, scientific computing & machine learning. Areas needing high-performance graph computation: ü Machine Learning ü Computational Biology ü Scientific computing ü Quantum computing We develop several classes of graph algorithms using linear-algebraic (GraphBLAS) primitives. In- house combinatorial BLAS library enabled rapid development of bipartite graph matching, reverse Cuthill-McKee ordering, triangle counting, connected components and Markov clustering algorithms that scale to thousands of cores on modern supercomputers. These algorithms in turn empower key science applications including protein family detection and sparse linear solvers. Approx. weight perfect matching Maximum cardinality matching (MCM) Maximal cardinality matching 16 32 64 128 256 512 1024 2048 2 4 8 16 32 64 128 256 Number of Cores Time (sec) ljournal cage15 road_usa nlpkkt200 hugetrace delaunay_n24 HV15R 12x-18x speedups ~80x increase of cores (Edison) Fig 1. Algorithmic chain to be developed 4 16 64 256 1024 1 4 16 64 256 1024 Time (s) Number of Nodes (24 cores/node) HipMCL MCL (van Dongen) (1) Distributed-memory graph matching ü A suite of parallel algorithms developed. Sparse matrix-sparse vector multiply, inverted index used. ü Scales to several thousands of cores ü Impact: Remove the sequential-ordering bottleneck from SuperLU and STRUMPACK Why graphs? Graph computation drives many applications in biology and scientific computing. Why GraphBLAS? The diversity and rapid evolution of applications, architectures & algorithms motivates us to isolate a small number of graph kernels entrusted with delivering high-performance. Expected impacts: (a) better understanding of data and computational patterns, (b) rapid development of high-performance applications. Sparse - Dense Matrix Product (SpDM 3 ) Sparse - Sparse Matrix Product (SpGEMM) Sparse Matrix Times Mul<ple Dense Vectors (SpMM) Sparse Matrix- Dense Vector (SpMV) Sparse Matrix- Sparse Vector (SpMSpV) GraphBLAS primi<ves in increasing arithme<c intensity Shortest paths (all-pairs, single- source, temporal) Graph clustering (Markov cluster, peer pressure, spectral, local) Miscellaneous: connec<vity, traversal (BFS), independent sets (MIS), graph matching Centrality (PageRank, betweenness, closeness) Higher-level combinatorial and machine learning algorithms Classifica7on (support vector machines, Logis<c regression) Dimensionality reduc7on (NMF, PCA) We develop graph and machine learning algorithms using linear-algebraic primitives. Two thrusts: 1. Develop communication-avoiding and work- efficient primitives (Aydin’s poster) 2. Design algorithms using optimized primitives Results Conclusions and Future Work (2) Distributed-memory Markov clustering (HipMCL) ü Sparse matrix-matrix multiply, connected components, and k-select algorithms used. ü Scalable up to 136K cores on NERSC/Cori ü Impact: Reduced a 45-day clustering job to just an hour. Clustered massive networks with 70B edges. |V| |E| HipMCL (cores) MCL shm 69 M 12 B 1.66 hr (24K) 45 day 70 M 68 B 2.41 hr (136K) X 282 M 37 B 3.23 hr (136K) X Fig. Performance of HipMCL w.r.t. a previous approach Fig. (a) Classes of algorithms, (b) performance of MCM Areas in which we can help Areas in which we need help References Areas needing high-performance graph computation: ü Biology & other domains to understand applications ü Programming language and libraries (UPC, GASNet) ü Efficient FILE I/O (HDF5) 1. Azad, Buluç and Pothen. Computing maximum cardinality matchings in parallel on bipartite graphs via tree-grafting. TPDS, 2017. 2. Azad, Jacquelin, Buluç, and Ng. The reverse Cuthill-McKee algorithm in distributed-memory. IPDPS, 2017. 3. Azad, Ballard, Buluç, Demmel, Grigori, Schwartz, Toledo, & Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing, 2016. 4. Azad and Buluç. Distributed-memory algorithms for maximum cardinality matching in bipartite graphs. IPDPS, 2016. 5. Azad, Buluç and Gilbert. Parallel triangle counting and enumeration using matrix algebra. IPDPSW 2015. Motivation Approach

Productive and extreme-scale graph computations enabled by ... · development of bipartite graph matching, reverse Cuthill-McKee ordering, triangle counting, connected components

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Productive and extreme-scale graph computations enabled by ... · development of bipartite graph matching, reverse Cuthill-McKee ordering, triangle counting, connected components

DOEASCRAppliedMathematicsPrincipalInvestigators'(PI)Meeting,Rockville,MD,September11-12,2017

Abstract

Productiveandextreme-scalegraphcomputationsenabledbyGraphBLASAriful AzadandAydınBuluç,ComputationalResearchDivision,LawrenceBerkeleyNationalLaboratory

ü GraphBLAS-basedapproachseparatescomputation-alkernelsfromhigh-levelalgorithms&applications.

ü Booststheproductivityofapplicationssignificantly.ü Resultedinhighly-scalablealgorithmsformatching,

ordering,connectedcomponents,maximalindependentsetandgraphclustering.

Futuredirectionsandimpacts:❄ Shortterm:AGraphBLAS-compliantlibrarywithoptimizedin-nodeperformance.Explorenewdomains.❄Mediumterm:Communication-avoidingalgorithmsforGraphBLAS primitivestargetingfutureexascalesystems.Enablenewhigh-levelalgorithms.❄ Longterm:Provideeasy-to-usegraphlibrariesforbiology,scientificcomputing&machinelearning.

Areasneedinghigh-performancegraphcomputation:ü MachineLearningü ComputationalBiologyü Scientificcomputingü Quantumcomputing

Wedevelopseveralclassesofgraphalgorithmsusinglinear-algebraic(GraphBLAS)primitives.In-housecombinatorialBLASlibraryenabledrapiddevelopmentofbipartitegraphmatching,reverseCuthill-McKeeordering,trianglecounting,connectedcomponentsandMarkovclusteringalgorithmsthatscaletothousandsofcoresonmodernsupercomputers.Thesealgorithmsinturnempowerkeyscienceapplicationsincludingproteinfamilydetectionandsparselinearsolvers.

Approx.weightperfectmatching

Maximumcardinality

matching(MCM)

Maximalcardinalitymatching

16 32 64 128 256 512 1024 20482

4

8

16

32

64

128

256

Number of Cores

Tim

e (s

ec)

ljournalcage15road_usanlpkkt200hugetracedelaunay_n24HV15R

12x-18xspeedu

ps

~80xincreaseofcores(Edison)

Fig1.Algorithmicchaintobedeveloped

4

16

64

256

1024

1 4 16 64 256 1024

Tim

e (s

)

Number of Nodes (24 cores/node)

HipMCL MCL (van Dongen)

(1) Distributed-memorygraphmatchingü Asuiteofparallelalgorithmsdeveloped.Sparse

matrix-sparsevectormultiply,invertedindexused.ü Scalestoseveralthousandsofcoresü Impact: Removethesequential-ordering

bottleneckfromSuperLU andSTRUMPACK

❄Whygraphs?Graphcomputationdrivesmanyapplicationsinbiologyandscientificcomputing.❄WhyGraphBLAS?Thediversityandrapidevolutionofapplications,architectures&algorithmsmotivatesustoisolateasmallnumberofgraphkernelsentrustedwithdeliveringhigh-performance.❄ Expectedimpacts:(a)betterunderstandingofdataandcomputationalpatterns,(b)rapiddevelopmentofhigh-performanceapplications.

Sparse-DenseMatrixProduct

(SpDM3)

Sparse-SparseMatrixProduct (SpGEMM)

SparseMatrixTimesMul<pleDenseVectors

(SpMM)

SparseMatrix-DenseVector

(SpMV)

SparseMatrix-SparseVector(SpMSpV)

GraphBLASprimi<vesinincreasingarithme<cintensity

Shortestpaths(all-pairs,

single-source,temporal)

Graphclustering(Markovcluster,peerpressure,spectral,local)

Miscellaneous:connec<vity,traversal(BFS),independentsets(MIS),graphmatching

Centrality(PageRank,

betweenness,closeness)

Higher-levelcombinatorialandmachinelearningalgorithms

Classifica7on(supportvector

machines,Logis<cregression)

Dimensionalityreduc7on(NMF,PCA)

Wedevelopgraphandmachinelearningalgorithmsusinglinear-algebraicprimitives.Twothrusts:1.Develop communication-avoidingandwork-efficientprimitives(Aydin’sposter)2.Designalgorithmsusingoptimizedprimitives

Results ConclusionsandFutureWork

(2)Distributed-memoryMarkovclustering(HipMCL)ü Sparsematrix-matrixmultiply,connected

components,andk-selectalgorithmsused.ü Scalableupto136KcoresonNERSC/Coriü Impact: Reduceda45-dayclusteringjobtojustan

hour.Clusteredmassivenetworkswith70Bedges.

|V| |E| HipMCL(cores)

MCLshm

69M

12B

1.66hr(24K)

45day

70M

68B

2.41hr(136K) X

282M

37B

3.23hr(136K)

X

Fig.PerformanceofHipMCL w.r.t. apreviousapproach

Fig.(a)Classesofalgorithms,(b) performanceofMCM

Areasinwhichwecanhelp

Areasinwhichweneedhelp

References

Areasneedinghigh-performancegraphcomputation:ü Biology&otherdomainstounderstandapplicationsü Programminglanguageandlibraries(UPC,GASNet)ü EfficientFILEI/O(HDF5)

1. Azad,Buluç andPothen.Computingmaximumcardinalitymatchingsinparallelonbipartitegraphsviatree-grafting.TPDS,2017.

2. Azad,Jacquelin,Buluç,andNg.ThereverseCuthill-McKeealgorithmindistributed-memory.IPDPS,2017.

3. Azad,Ballard,Buluç,Demmel,Grigori,Schwartz,Toledo,& Williams.Exploitingmultiplelevelsofparallelisminsparsematrix-matrixmultiplication.SIAMJournalonScientificComputing,2016.

4. AzadandBuluç.Distributed-memoryalgorithmsformaximumcardinalitymatchinginbipartitegraphs.IPDPS,2016.

5. Azad,Buluç andGilbert.Paralleltrianglecountingandenumerationusingmatrixalgebra.IPDPSW2015.

Motivation

Approach