Productive and extreme-scale graph computations enabled by ... · development of bipartite graph matching, reverse Cuthill-McKee ordering, triangle counting, connected components

DOEASCRAppliedMathematicsPrincipalInvestigators'(PI)Meeting,Rockville,MD,September11-12,2017

Abstract

Productiveandextreme-scalegraphcomputationsenabledbyGraphBLASAriful AzadandAydınBuluç,ComputationalResearchDivision,LawrenceBerkeleyNationalLaboratory

ü GraphBLAS-basedapproachseparatescomputation-alkernelsfromhigh-levelalgorithms&applications.

ü Booststheproductivityofapplicationssignificantly.ü Resultedinhighly-scalablealgorithmsformatching,

ordering,connectedcomponents,maximalindependentsetandgraphclustering.

Futuredirectionsandimpacts:❄ Shortterm:AGraphBLAS-compliantlibrarywithoptimizedin-nodeperformance.Explorenewdomains.❄Mediumterm:Communication-avoidingalgorithmsforGraphBLAS primitivestargetingfutureexascalesystems.Enablenewhigh-levelalgorithms.❄ Longterm:Provideeasy-to-usegraphlibrariesforbiology,scientificcomputing&machinelearning.

Areasneedinghigh-performancegraphcomputation:ü MachineLearningü ComputationalBiologyü Scientificcomputingü Quantumcomputing

Wedevelopseveralclassesofgraphalgorithmsusinglinear-algebraic(GraphBLAS)primitives.In-housecombinatorialBLASlibraryenabledrapiddevelopmentofbipartitegraphmatching,reverseCuthill-McKeeordering,trianglecounting,connectedcomponentsandMarkovclusteringalgorithmsthatscaletothousandsofcoresonmodernsupercomputers.Thesealgorithmsinturnempowerkeyscienceapplicationsincludingproteinfamilydetectionandsparselinearsolvers.

Approx.weightperfectmatching

Maximumcardinality

matching(MCM)

Maximalcardinalitymatching

16 32 64 128 256 512 1024 20482

4

8

16

32

64

128

256

Number of Cores

Tim

e (s

ec)

ljournalcage15road_usanlpkkt200hugetracedelaunay_n24HV15R

12x-18xspeedu

ps

~80xincreaseofcores(Edison)

Fig1.Algorithmicchaintobedeveloped

4

16

64

256

1024

1 4 16 64 256 1024

Tim

e (s

)

Number of Nodes (24 cores/node)

HipMCL MCL (van Dongen)

(1) Distributed-memorygraphmatchingü Asuiteofparallelalgorithmsdeveloped.Sparse

matrix-sparsevectormultiply,invertedindexused.ü Scalestoseveralthousandsofcoresü Impact: Removethesequential-ordering

bottleneckfromSuperLU andSTRUMPACK

❄Whygraphs?Graphcomputationdrivesmanyapplicationsinbiologyandscientificcomputing.❄WhyGraphBLAS?Thediversityandrapidevolutionofapplications,architectures&algorithmsmotivatesustoisolateasmallnumberofgraphkernelsentrustedwithdeliveringhigh-performance.❄ Expectedimpacts:(a)betterunderstandingofdataandcomputationalpatterns,(b)rapiddevelopmentofhigh-performanceapplications.

Sparse-DenseMatrixProduct

(SpDM3)

Sparse-SparseMatrixProduct (SpGEMM)

SparseMatrixTimesMul<pleDenseVectors

(SpMM)

SparseMatrix-DenseVector

(SpMV)

SparseMatrix-SparseVector(SpMSpV)

GraphBLASprimi<vesinincreasingarithme<cintensity

Shortestpaths(all-pairs,

single-source,temporal)

Graphclustering(Markovcluster,peerpressure,spectral,local)

Miscellaneous:connec<vity,traversal(BFS),independentsets(MIS),graphmatching

Centrality(PageRank,

betweenness,closeness)

Higher-levelcombinatorialandmachinelearningalgorithms

Classifica7on(supportvector

machines,Logis<cregression)

Dimensionalityreduc7on(NMF,PCA)

Wedevelopgraphandmachinelearningalgorithmsusinglinear-algebraicprimitives.Twothrusts:1.Develop communication-avoidingandwork-efficientprimitives(Aydin’sposter)2.Designalgorithmsusingoptimizedprimitives

Results ConclusionsandFutureWork

(2)Distributed-memoryMarkovclustering(HipMCL)ü Sparsematrix-matrixmultiply,connected

components,andk-selectalgorithmsused.ü Scalableupto136KcoresonNERSC/Coriü Impact: Reduceda45-dayclusteringjobtojustan

hour.Clusteredmassivenetworkswith70Bedges.

|V| |E| HipMCL(cores)

MCLshm

69M

12B

1.66hr(24K)

45day

70M

68B

2.41hr(136K) X

282M

37B

3.23hr(136K)

X

Fig.PerformanceofHipMCL w.r.t. apreviousapproach

Fig.(a)Classesofalgorithms,(b) performanceofMCM

Areasinwhichwecanhelp

Areasinwhichweneedhelp

References

Areasneedinghigh-performancegraphcomputation:ü Biology&otherdomainstounderstandapplicationsü Programminglanguageandlibraries(UPC,GASNet)ü EfficientFILEI/O(HDF5)

1. Azad,Buluç andPothen.Computingmaximumcardinalitymatchingsinparallelonbipartitegraphsviatree-grafting.TPDS,2017.

2. Azad,Jacquelin,Buluç,andNg.ThereverseCuthill-McKeealgorithmindistributed-memory.IPDPS,2017.

3. Azad,Ballard,Buluç,Demmel,Grigori,Schwartz,Toledo,& Williams.Exploitingmultiplelevelsofparallelisminsparsematrix-matrixmultiplication.SIAMJournalonScientificComputing,2016.

4. AzadandBuluç.Distributed-memoryalgorithmsformaximumcardinalitymatchinginbipartitegraphs.IPDPS,2016.

5. Azad,Buluç andGilbert.Paralleltrianglecountingandenumerationusingmatrixalgebra.IPDPSW2015.

Motivation

Approach

Documents

Productive and extreme-scale graph computations enabled by ... · development of bipartite graph matching, reverse Cuthill-McKee ordering, triangle counting, connected components