Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Building Robust Scientific Codes
Matthew Knepley
Computation InstituteUniversity of Chicago
Scientific Computing in the Americas:The Challenge of Massive Parallelism
Valparaiso, Chile, January 2011
M. Knepley Robust PASI ’11 1 / 231
Tools and Infrastructure
Outline
1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS
2 GPU Computing
3 Linear Algebra and Solvers
4 First Lab: Assembling a Code
5 Second Lab: Debugging and Performance Benchmarking
M. Knepley Robust PASI ’11 2 / 231
Tools and Infrastructure
What I Need From You
Tell me if you do not understandTell me if an example does not workSuggest better wording or figuresFollowup problems at [email protected]
M. Knepley Robust PASI ’11 3 / 231
mailto:[email protected]
Tools and Infrastructure
Ask Questions!!!
Helps me understand what you are missing
Helps you clarify misunderstandings
Helps others with the same question
M. Knepley Robust PASI ’11 4 / 231
Tools and Infrastructure
How We Can Help at the Tutorial
Point out relevant documentationQuickly answer questionsHelp installGuide design of large scale codesAnswer email at [email protected]
M. Knepley Robust PASI ’11 5 / 231
mailto:[email protected]
Tools and Infrastructure
How We Can Help at the Tutorial
Point out relevant documentationQuickly answer questionsHelp installGuide design of large scale codesAnswer email at [email protected]
M. Knepley Robust PASI ’11 5 / 231
mailto:[email protected]
Tools and Infrastructure
How We Can Help at the Tutorial
Point out relevant documentationQuickly answer questionsHelp installGuide design of large scale codesAnswer email at [email protected]
M. Knepley Robust PASI ’11 5 / 231
mailto:[email protected]
Tools and Infrastructure
How We Can Help at the Tutorial
Point out relevant documentationQuickly answer questionsHelp installGuide design of large scale codesAnswer email at [email protected]
M. Knepley Robust PASI ’11 5 / 231
mailto:[email protected]
Tools and Infrastructure
New Model for Scientific Software
Simplifying Parallelization of Scientific Codesby a Function-Centric Approach in Python
Jon K. Nilsen, Xing Cai, Bjorn Hoyland, and Hans Petter Langtangen
Python at the application levelnumpy for data structurespetsc4py for linear algebra and solversPyCUDA for integration (physics) and assembly
M. Knepley Robust PASI ’11 6 / 231
http://arxiv.org/abs/1002.0705http://arxiv.org/abs/1002.0705
Tools and Infrastructure
New Model for Scientific Software
Application
FFC/SyFieqn. definitionsympy symbol
ics
numpyda
tast
ruct
ures
petsc4py
solve
rs
PyCUDA
integration/assembly
PETScCUDA
OpenCL
Figure: Schematic for a generic scientific applicationM. Knepley Robust PASI ’11 7 / 231
Tools and Infrastructure
What is Missing from this Scheme?
Unstructured graph traversalIteration over cells in FEM
Use a copy via numpy, use a kernel via Queue
(Transitive) Closure of a vertexUse a visitor and copy via numpy
Depth First SearchHell if I know
Logic in computationLimiters in FV methods
Can sometimes use tricks for branchless logic
Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless
Boundary conditionsRestrict branching to PETSc C numbering and assembly calls
Audience???M. Knepley Robust PASI ’11 8 / 231
Tools and Infrastructure
What is Missing from this Scheme?
Unstructured graph traversalIteration over cells in FEM
Use a copy via numpy, use a kernel via Queue
(Transitive) Closure of a vertexUse a visitor and copy via numpy
Depth First SearchHell if I know
Logic in computationLimiters in FV methods
Can sometimes use tricks for branchless logic
Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless
Boundary conditionsRestrict branching to PETSc C numbering and assembly calls
Audience???M. Knepley Robust PASI ’11 8 / 231
Tools and Infrastructure
What is Missing from this Scheme?
Unstructured graph traversalIteration over cells in FEM
Use a copy via numpy, use a kernel via Queue
(Transitive) Closure of a vertexUse a visitor and copy via numpy
Depth First SearchHell if I know
Logic in computationLimiters in FV methods
Can sometimes use tricks for branchless logic
Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless
Boundary conditionsRestrict branching to PETSc C numbering and assembly calls
Audience???M. Knepley Robust PASI ’11 8 / 231
Tools and Infrastructure
What is Missing from this Scheme?
Unstructured graph traversalIteration over cells in FEM
Use a copy via numpy, use a kernel via Queue
(Transitive) Closure of a vertexUse a visitor and copy via numpy
Depth First SearchHell if I know
Logic in computationLimiters in FV methods
Can sometimes use tricks for branchless logic
Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless
Boundary conditionsRestrict branching to PETSc C numbering and assembly calls
Audience???M. Knepley Robust PASI ’11 8 / 231
Tools and Infrastructure
What is Missing from this Scheme?
Unstructured graph traversalIteration over cells in FEM
Use a copy via numpy, use a kernel via Queue
(Transitive) Closure of a vertexUse a visitor and copy via numpy
Depth First SearchHell if I know
Logic in computationLimiters in FV methods
Can sometimes use tricks for branchless logic
Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless
Boundary conditionsRestrict branching to PETSc C numbering and assembly calls
Audience???M. Knepley Robust PASI ’11 8 / 231
Tools and Infrastructure
What is Missing from this Scheme?
Unstructured graph traversalIteration over cells in FEM
Use a copy via numpy, use a kernel via Queue
(Transitive) Closure of a vertexUse a visitor and copy via numpy
Depth First SearchHell if I know
Logic in computationLimiters in FV methods
Can sometimes use tricks for branchless logic
Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless
Boundary conditionsRestrict branching to PETSc C numbering and assembly calls
Audience???M. Knepley Robust PASI ’11 8 / 231
Tools and Infrastructure
What is Missing from this Scheme?
Unstructured graph traversalIteration over cells in FEM
Use a copy via numpy, use a kernel via Queue
(Transitive) Closure of a vertexUse a visitor and copy via numpy
Depth First SearchHell if I know
Logic in computationLimiters in FV methods
Can sometimes use tricks for branchless logic
Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless
Boundary conditionsRestrict branching to PETSc C numbering and assembly calls
Audience???M. Knepley Robust PASI ’11 8 / 231
Tools and Infrastructure
What is Missing from this Scheme?
Unstructured graph traversalIteration over cells in FEM
Use a copy via numpy, use a kernel via Queue
(Transitive) Closure of a vertexUse a visitor and copy via numpy
Depth First SearchHell if I know
Logic in computationLimiters in FV methods
Can sometimes use tricks for branchless logic
Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless
Boundary conditionsRestrict branching to PETSc C numbering and assembly calls
Audience???M. Knepley Robust PASI ’11 8 / 231
Tools and Infrastructure Version Control
Outline
1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS
M. Knepley Robust PASI ’11 9 / 231
Tools and Infrastructure Version Control
Location and Retrieval“Where’s the Tarball”
Version ControlMercurial, Git, Subversion
HostingBitBucket, GitHub, Launchpad
Community involvementarXiv, PubMed
M. Knepley Robust PASI ’11 10 / 231
http://mercurial.selenic.comhttp://git-scm.comhttp://subversion.tigris.orghttp://bitbucket.orghttp://github.comhttps://launchpad.nethttp://arXiv.orghttp://www.ncbi.nlm.nih.gov/pubmed
Tools and Infrastructure Version Control
Distributed Version Control
CVS/SVN manage a single repositoryVersioned dataLocal copy for modification and checkin
Mercurial manages many repositoriesIdentified by URLsNo one Master
Repositories communicate by ChangeSetsUse push and pull to move changesetsCan move arbitrary changes with patch queues
M. Knepley Robust PASI ’11 11 / 231
Tools and Infrastructure Version Control
Project Workflow
User
Figure: Single Repository
M. Knepley Robust PASI ’11 12 / 231
Tools and Infrastructure Version Control
Project Workflow
Master
User A User B
Figure: Master Repository with User Clones
M. Knepley Robust PASI ’11 13 / 231
Tools and Infrastructure Version Control
Project Workflow
Master Release
User Bugfix
Figure: Project with Release and Bugfix Repositories
M. Knepley Robust PASI ’11 14 / 231
Tools and Infrastructure Configuration and Build
Outline
1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS
M. Knepley Robust PASI ’11 15 / 231
Tools and Infrastructure Configuration and Build
Configuration and Build“It won’t run on my iPhone”
PortabilityPETSc BuildSystem, autoconf
DependenciesDoes this work with UnsupportedGradStudentAMG?
Configurable buildBuild must integrate with the configuration systemCMake, SCons
M. Knepley Robust PASI ’11 16 / 231
http://petsc.cs.iit.edu/petsc/BuildSystemhttp://www.gnu.org/software/autoconfhttp://www.cmake.orghttp://www.scons.org
Tools and Infrastructure Configuration and Build
BuildSystem
Provides tools for Configuration and Build
Dependency tracking and analysisPackage management and hierarchyLibrary of standard testsStandard build rulesAutomatic package build and integration
http://petsc.cs.iit.edu/petsc/BuildSystemhttp://petsc.cs.iit.edu/petsc/SimpleConfigure
M. Knepley Robust PASI ’11 17 / 231
http://petsc.cs.iit.edu/petsc/BuildSystemhttp://petsc.cs.iit.edu/petsc/BuildSystem
Tools and Infrastructure Configuration and Build
ConfigureModules
BuildSystem.config.base configures a specific functionality
Entry points:setupHelp()setupDependencies()configure()
Builtin capabilities:Preprocessing, compilation, linking, runningManages languagesChecks for executables
Output types:Define, typedef, or prototypeMake macro or ruleSubstitution (old-style)
M. Knepley Robust PASI ’11 18 / 231
Tools and Infrastructure Configuration and Build
ConfigureFramework
BuildSystem.config.framework manages the configure run
Manages configure modulesDependencies with DAG, require()Options tableInitialization, run, cleanup
OutputsConfigure headers and logMake variable and rulesPickled configure tree
M. Knepley Robust PASI ’11 19 / 231
Tools and Infrastructure Configuration and Build
ConfigureThird Party Packages
BuildSystem.config.package manages other packages
BuildSystem/config/packages/* examples (MPI, FIAT, etc.)Standard location and install hooksStandard header and library testsUniform interface for parameter retrievalSpecial support for GNU packages
M. Knepley Robust PASI ’11 20 / 231
Tools and Infrastructure Configuration and Build
ConfigureBuild Integration
A module can declare a dependency using:
fw = s e l f . frameworks e l f . mpi = fw . requ i re ( ’ con f i g . packages . MPI ’ , s e l f )
so that MPI is configured before self. Information is retrieved duringconfigure():
i f s e l f . mpi . found :inc lude . extend ( s e l f . mpi . i nc lude )l i b s . extend ( s e l f . mpi . l i b )
M. Knepley Robust PASI ’11 21 / 231
Tools and Infrastructure Configuration and Build
ConfigureBuild Integration
A module can declare a dependency using:
fw = s e l f . frameworks e l f . mpi = fw . requ i re ( ’ con f i g . packages . MPI ’ , s e l f )
so that MPI is configured before self. Information is retrieved duringconfigure():
i f s e l f . mpi . found :inc lude . extend ( s e l f . mpi . i nc lude )l i b s . extend ( s e l f . mpi . l i b )
M. Knepley Robust PASI ’11 21 / 231
Tools and Infrastructure Configuration and Build
ConfigureBuild Integration
A build system can acquire the information using:
c lass ConfigReader ( s c r i p t . S c r i p t ) :def _ _ i n i t _ _ ( s e l f ) :
impor t RDictargDB = RDict . RDict (None , None , 0 , 0)argDB . saveFilename = os . path . j o i n ( ’ path ’ , ’ RDict . db ’ )argDB . load ( )s c r i p t . S c r i p t . _ _ i n i t _ _ ( s e l f , argDB = argDB )r e t u r n
def getMPIModule ( s e l f ) :s e l f . setup ( )fw = s e l f . loadConf igure ( )mpi = fw . requ i re ( ’ con f i g . packages . MPI ’ , None )r e t u r n mpi
M. Knepley Robust PASI ’11 22 / 231
Tools and Infrastructure Configuration and Build
Make
GNU Make automates a package build
Has a single predicate, older-than
Executes shell code for actions
PETSc has support forconfiguration integrationautomatic compilation
AlternativesSConsCMake
M. Knepley Robust PASI ’11 23 / 231
http://www.gnu.org/makehttp://www.scons.orghttp://www.cmake.org
Tools and Infrastructure Configuration and Build
builder
Simple replacement for GNU make
Excellent configure integration
User-defined predicates
Dependency analysis and tracking
Python actions
Support for test execution
M. Knepley Robust PASI ’11 24 / 231
Tools and Infrastructure Configuration and Build
builderTwo Interfaces
The simple interface handles the entire build:./config/builder.py
A more flexible front end allows finer control:./config/builder2.py help [command]./config/builder2.py clean./config/builder2.py stubs fortran./config/builder2.py build [src/snes/interface/snesj.c]./config/builder2.py check [src/snes/examples/tutorials/ex10.c]
M. Knepley Robust PASI ’11 25 / 231
Tools and Infrastructure Configuration and Build
Testing“They are identical in the eyeball norm”
Unit testscppUnit
Regression testsbuildbot
BenchmarksCigma
M. Knepley Robust PASI ’11 26 / 231
http://sourceforge.net/projects/cppunithttp://buildbot.nethttp://www.geodynamics.org/cig/software/packages/cs/cigma
Tools and Infrastructure PETSc
Outline
1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS
M. Knepley Robust PASI ’11 27 / 231
Tools and Infrastructure PETSc
How did PETSc Originate?
PETSc was developed as a Platform forExperimentation
We want to experiment with differentModelsDiscretizationsSolversAlgorithms
which blur these boundaries
M. Knepley Robust PASI ’11 28 / 231
http://amzn.com/0521602866
Tools and Infrastructure PETSc
The Role of PETSc
Developing parallel, nontrivial PDE solvers thatdeliver high performance is still difficult and re-quires months (or even years) of concentratedeffort.
PETSc is a toolkit that can ease these difficul-ties and reduce the development time, but it isnot a black-box PDE solver, nor a silver bullet.— Barry Smith
M. Knepley Robust PASI ’11 29 / 231
http://www.mcs.anl.gov/~bsmith
Tools and Infrastructure PETSc
Advice from Bill Gropp
You want to think about how you decompose your datastructures, how you think about them globally. [...] If youwere building a house, you’d start with a set of blueprintsthat give you a picture of what the whole house lookslike. You wouldn’t start with a bunch of tiles and say.“Well I’ll put this tile down on the ground, and then I’llfind a tile to go next to it.” But all too many people try tobuild their parallel programs by creating the smallestpossible tiles and then trying to have the structure oftheir code emerge from the chaos of all these littlepieces. You have to have an organizing principle ifyou’re going to survive making your code parallel.
(http://www.rce-cast.com/Podcast/rce-28-mpich2.html)
M. Knepley Robust PASI ’11 30 / 231
http://www.rce-cast.com/Podcast/rce-28-mpich2.html
Tools and Infrastructure PETSc
What is PETSc?
A freely available and supported researchcode for the parallel solution of nonlinearalgebraic equations
FreeDownload from http://www.mcs.anl.gov/petscFree for everyone, including industrial users
SupportedHyperlinked manual, examples, and manual pages for all routinesHundreds of tutorial-style examplesSupport via email: [email protected]
Usable from C, C++, Fortran 77/90, Matlab, Julia, and Python
M. Knepley Robust PASI ’11 31 / 231
http://www.mcs.anl.gov/petscmailto:[email protected]
Tools and Infrastructure PETSc
What is PETSc?
Portable to any parallel system supporting MPI, including:Tightly coupled systems
Cray XT6, BG/Q, NVIDIA Fermi, K ComputerLoosely coupled systems, such as networks of workstations
IBM, Mac, iPad/iPhone, PCs running Linux or Windows
PETSc HistoryBegun September 1991Over 60,000 downloads since 1995 (version 2)Currently 400 per month
PETSc Funding and SupportDepartment of Energy
SciDAC, MICS Program, AMR Program, INL Reactor ProgramNational Science Foundation
CIG, CISE, Multidisciplinary Challenge Program
M. Knepley Robust PASI ’11 32 / 231
Tools and Infrastructure PETSc
Timeline
1991 1995 2000 2005 2010
PETSc-1
MPI-1MPI-2
PETSc-2 PETSc-3Barry
BillLois
SatishDinesh
HongKrisMatt
VictorDmitry
LisandroJedShri
Peter
M. Knepley Robust PASI ’11 33 / 231
Tools and Infrastructure PETSc
What Can We Handle?
PETSc has run implicit problems with over 500 billion unknownsUNIC on BG/P and XT5PFLOTRAN for flow in porous media
PETSc has run on over 290,000 cores efficientlyUNIC on the IBM BG/P Jugene at JülichPFLOTRAN on the Cray XT5 Jaguar at ORNL
PETSc applications have run at 23% of peak (600 Teraflops)Jed Brown on NERSC EdisonHPGMG code
M. Knepley Robust PASI ’11 34 / 231
https://hpgmg.org/
Tools and Infrastructure PETSc
What Can We Handle?
PETSc has run implicit problems with over 500 billion unknownsUNIC on BG/P and XT5PFLOTRAN for flow in porous media
PETSc has run on over 290,000 cores efficientlyUNIC on the IBM BG/P Jugene at JülichPFLOTRAN on the Cray XT5 Jaguar at ORNL
PETSc applications have run at 23% of peak (600 Teraflops)Jed Brown on NERSC EdisonHPGMG code
M. Knepley Robust PASI ’11 34 / 231
https://hpgmg.org/
Tools and Infrastructure PETSc
What Can We Handle?
PETSc has run implicit problems with over 500 billion unknownsUNIC on BG/P and XT5PFLOTRAN for flow in porous media
PETSc has run on over 290,000 cores efficientlyUNIC on the IBM BG/P Jugene at JülichPFLOTRAN on the Cray XT5 Jaguar at ORNL
PETSc applications have run at 23% of peak (600 Teraflops)Jed Brown on NERSC EdisonHPGMG code
M. Knepley Robust PASI ’11 34 / 231
https://hpgmg.org/
Tools and Infrastructure numpy
Outline
1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS
M. Knepley Robust PASI ’11 35 / 231
Tools and Infrastructure numpy
numpy
numpy is ideal for building Python data structures
Supports multidimensional arraysEasily interfaces with C/C++ and FortranHigh performance BLAS/LAPACK and functional operationsPython 2 and 3 compatibleUsed by petsc4py to talk to PETSc
M. Knepley Robust PASI ’11 36 / 231
http://numpy.scipy.org
Tools and Infrastructure sympy
Outline
1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS
M. Knepley Robust PASI ’11 37 / 231
Tools and Infrastructure sympy
sympy
sympy is useful for symbolic manipulation
Interacts with numpyDerivatives and integralsSeries expansionsEquation simplificationSmall and open source
M. Knepley Robust PASI ’11 38 / 231
http://sympy.scipy.org
Tools and Infrastructure sympy
sympyExample of Series Transform
Create the shifted polynomial
order∑i=0
cii!
(x − a)i
def cons t ruc tSh i f t edPo lynomia l ( order ) :from sympy impor t Symbol , c o l l e c t , d i f f , l i m i tfrom sympy impor t f a c t o r i a l as fc = [ Symbol ( ’ c ’+ s t r ( i ) ) f o r i i n range ( order ) ]g = sum ( [ c [ i ] * ( x−a ) * * i / f ( i ) f o r i i n range ( order ) ] )# Convert to a monomialg = c o l l e c t ( g . expand ( ) , x )r e t u r n c , g
M. Knepley Robust PASI ’11 39 / 231
Tools and Infrastructure sympy
sympyExample of Series Transform
Here is the shifted polynomial for order 5:c0 - a*c1 + c2*a**2/2 - c3*a**3/6 + c4*a**4/24+ x*(c1 - a*c2 + c3*a**2/2 - c4*a**3/6)+ x**2*(c2/2 - a*c3/2 + c4*a**2/4)+ x**3*(c3/6 - a*c4/6)+ c4*x**4/24
M. Knepley Robust PASI ’11 40 / 231
Tools and Infrastructure sympy
sympyExample of Series Transform
Construct matrix transform from
order∑i=0
cii!
(x − a)i toorder∑i=0
cii!
x i
def cons t ruc tTrans fo rmMat r i x ( order = 5 ) :from sympy impor t d i f f , l i m i tc , g = cons t ruc tSh i f t edPo lynomia l ( order , debug )M = [ ]f o r o i n range ( order ) :
exp = g . d i f f ( x , o ) . l i m i t ( x , 0)M. append ( [ exp . d i f f ( c [ p ] ) f o r p i n range ( order ) ] )
r e t u r n M
M. Knepley Robust PASI ’11 41 / 231
Tools and Infrastructure sympy
sympyExample of Series Transform
Here is the transform matrix M:1 −a a22 −
a36
a424
0 1 −a a22 −a36
0 0 1 −a a220 0 0 1 −a0 0 0 0 1
M. Knepley Robust PASI ’11 42 / 231
Tools and Infrastructure petsc4py
Outline
1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS
M. Knepley Robust PASI ’11 43 / 231
Tools and Infrastructure petsc4py
petsc4py
petcs4py provides Python bindings for PETSc
Provides ALL PETSc functionality in a Pythonic wayLogging using the Python with statement
Can use Python callback functionsSNESSetFunction(), SNESSetJacobian()
Manages all memory (creation/destruction)
Visualization with matplotlib
M. Knepley Robust PASI ’11 44 / 231
http://code.google.com/p/petsc4py/http://matplotlib.sourceforge.net
Tools and Infrastructure petsc4py
petsc4py Installation
Automaticpip install -install-options=-user petscp4yUses $PETSC_DIR and $PETSC_ARCHInstalled into $HOME/.localNo additions to PYTHONPATH
From Sourcevirtualenv python-envsource ./python-env/bin/activateNow everything installs into your proxy Python environmenthg clone https://petsc4py.googlecode.com/hgpetsc4py-devARCHFLAGS="-arch x86_64" python setup.py sdistARCHFLAGS="-arch x86_64" pip installdist/petsc4py-1.1.2.tar.gzARCHFLAGS only necessary on Mac OSX
M. Knepley Robust PASI ’11 45 / 231
Tools and Infrastructure petsc4py
petsc4py Examples
externalpackages/petsc4py-1.1/demo/bratu2d/bratu2d.py
Solves Bratu equation (SNES ex5) in 2D
Visualizes solution with matplotlib
src/ts/examples/tutorials/ex8.py
Solves a 1D ODE for a diffusive process
Visualize solution using -vec_view_draw
Control timesteps with -ts_max_steps
M. Knepley Robust PASI ’11 46 / 231
http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/src/snes/examples/tutorials/ex5.c.html
Tools and Infrastructure PyCUDA
Outline
1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS
M. Knepley Robust PASI ’11 47 / 231
Tools and Infrastructure PyCUDA
PyCUDA and PyOpenCL
Python packages by Andreas Klöcknerfor embedded GPU programming
Handles unimportant details automaticallyCUDA compile and caching of objectsDevice initializationLoading modules onto card
Excellent Documentation & Tutorial
Excellent platform for MetaprogrammingOnly way to get portable performanceRoad to FLAME-type reasoning about algorithms
M. Knepley Robust PASI ’11 48 / 231
http://mathema.tician.de/aboutmehttp://documen.tician.de/pycudahttp://arxiv.org/abs/0911.3456
Tools and Infrastructure PyCUDA
Code Template
$ { pb . globalMod ( isGPU ) } vo id kerne l ( $ { pb . g r i dS ize ( isGPU ) } f l o a t * output ) {
$ { pb . g r idLoopSta r t ( isGPU , load , s to re ) }$ { pb . threadLoopStar t ( isGPU , blockDimX ) }f l o a t G[ $ { dim * dim } ] = { $ { ’ , ’ . j o i n ( [ ’ 3.0 ’ ] * ( dim * dim ) ) } } ;f l o a t K [ $ { dim * dim } ] = { $ { ’ , ’ . j o i n ( [ ’ 3.0 ’ ] * ( dim * dim ) ) } } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x *$ { numThreads } ;
/ / Cont ract G and K% f o r n i n range ( numLocalElements ) :% f o r alpha i n range ( dim ) :% f o r beta i n range ( dim ) :
product += G[ $ { gIdx } ] * K [ $ { k Idx } ] ;% endfor% endfor% endfor
output [ Oof fse t+ idx ] = product ;$ { pb . threadLoopEnd ( isGPU ) }$ { pb . gridLoopEnd ( isGPU ) }r e t u r n ;
} M. Knepley Robust PASI ’11 49 / 231
Tools and Infrastructure PyCUDA
Rendering a Template
We render code template into strings using a dictionary of inputs.
args = { ’ dim ’ : s e l f . dim ,’ numLocalElements ’ : 1 ,’ numThreads ’ : s e l f . threadBlockSize }
kernelTemplate = s e l f . getKernelTemplate ( )gpuCode = kernelTemplate . render ( isGPU = True , * * args )cpuCode = kernelTemplate . render ( isGPU = False , * * args )
M. Knepley Robust PASI ’11 50 / 231
Tools and Infrastructure PyCUDA
GPU Source Code
__global__ vo id kerne l ( f l o a t * output ) {const i n t g r i d I d x = b lock Idx . x + b lock Idx . y * gr idDim . x ;const i n t i dx = th read Idx . x + th read Idx . y * 1 ; / / This i s ( i , j )f l o a t G[ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t K [ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x * 1 ;
/ / Cont ract G and Kproduct += G[ 0 ] * K [ 0 ] ;product += G[ 1 ] * K [ 1 ] ;product += G[ 2 ] * K [ 2 ] ;product += G[ 3 ] * K [ 3 ] ;product += G[ 4 ] * K [ 4 ] ;product += G[ 5 ] * K [ 5 ] ;product += G[ 6 ] * K [ 6 ] ;product += G[ 7 ] * K [ 7 ] ;product += G[ 8 ] * K [ 8 ] ;ou tput [ Oof fse t+ idx ] = product ;r e t u r n ;
}
M. Knepley Robust PASI ’11 51 / 231
Tools and Infrastructure PyCUDA
CPU Source Code
vo id kerne l ( i n t numInvocations , f l o a t * output ) {f o r ( i n t g r i d I d x = 0; g r i d I d x < numInvocations ; ++ g r i d I d x ) {
f o r ( i n t i = 0 ; i < 1 ; ++ i ) {f o r ( i n t j = 0 ; j < 1 ; ++ j ) {
const i n t i dx = i + j * 1 ; / / This i s ( i , j )f l o a t G[ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t K [ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x * 1 ;
/ / Cont ract G and Kproduct += G[ 0 ] * K [ 0 ] ;product += G[ 1 ] * K [ 1 ] ;product += G[ 2 ] * K [ 2 ] ;product += G[ 3 ] * K [ 3 ] ;product += G[ 4 ] * K [ 4 ] ;product += G[ 5 ] * K [ 5 ] ;product += G[ 6 ] * K [ 6 ] ;product += G[ 7 ] * K [ 7 ] ;product += G[ 8 ] * K [ 8 ] ;ou tput [ Oof fse t+ idx ] = product ;
}}
}r e t u r n ;
}
M. Knepley Robust PASI ’11 52 / 231
Tools and Infrastructure PyCUDA
Creating a Module
CPU:
# Output kerne l and C support codes e l f . outputKernelC ( cpuCode )s e l f . w r i t e M a k e f i l e ( )out , er r , s ta tus = s e l f . executeShellCommand ( ’make ’ )\ end { minted }
\ b i gsk ip
GPU:\ begin { minted } { python }from pycuda . compi ler impor t SourceModule
mod = SourceModule ( gpuCode )s e l f . ke rne l = mod. ge t_ func t i on ( ’ ke rne l ’ )s e l f . kerne lRepor t ( s e l f . kernel , ’ ke rne l ’ )
M. Knepley Robust PASI ’11 53 / 231
Tools and Infrastructure PyCUDA
Executing a Module
impor t pycuda . d r i v e r as cudaimpor t pycuda . a u t o i n i t
blockDim = ( s e l f . dim , s e l f . dim , 1)s t a r t = cuda . Event ( )end = cuda . Event ( )g r i d = s e l f . c a l c u l a t e G r i d (N, numLocalElements )s t a r t . record ( )f o r i i n range ( i t e r s ) :
s e l f . ke rne l ( cuda . Out ( output ) ,b lock = blockDim , g r i d = g r i d )
end . record ( )end . synchronize ( )gpuTimes . append ( s t a r t . t i m e _ t i l l ( end ) *1 e−3/ i t e r s )
M. Knepley Robust PASI ’11 54 / 231
Tools and Infrastructure FEniCS
Outline
1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS
M. Knepley Robust PASI ’11 55 / 231
Tools and Infrastructure FEniCS
FIAT
Finite Element Integrator And Tabulator by Rob Kirby
http://www.fenics.org/fiat
FIAT understandsReference element shapes (line, triangle, tetrahedron)Quadrature rulesPolynomial spacesFunctionals over polynomials (dual spaces)Derivatives
User can build arbitrary elements specifying the Ciarlet triple (K ,P,P ′)
FIAT is part of the FEniCS project, as is the PETSc Sieve module
M. Knepley Robust PASI ’11 56 / 231
http://www.fenics.org/fiat
Tools and Infrastructure FEniCS
FFC
FFC is a compiler for variational forms by Anders Logg.
Here is a mixed-form Poisson equation:
a((τ,w), (σ, u)) = L((τ,w)) ∀(τ,w) ∈ V
where
a((τ,w), (σ, u)) =∫
Ωτσ −∇ · τu + w∇ · u dx
L((τ,w)) =∫
Ωwf dx
M. Knepley Robust PASI ’11 57 / 231
Tools and Infrastructure FEniCS
FFCMixed Poisson
shape = " t r i a n g l e "
BDM1 = Fin i teE lement ( " Brezzi−Douglas−Mar in i " , shape , 1 )DG0 = Fin i teE lement ( " Discont inuous Lagrange " , shape , 0 )
element = BDM1 + DG0( tau , w) = TestFunct ions ( element )( sigma , u ) = T r i a l F u n c t i o n s ( element )
a = ( dot ( tau , sigma ) − d iv ( tau ) * u + w* d iv ( sigma ) ) * dx
f = Funct ion (DG0)L = w* f * dx
M. Knepley Robust PASI ’11 58 / 231
Tools and Infrastructure FEniCS
FFC
Here is a discontinuous Galerkin formulation of the Poisson equation:
a(v ,u) = L(v) ∀v ∈ V
where
a(v ,u) =∫
Ω∇u · ∇v dx
+∑
S
∫S− < ∇v > ·[[u]]n − [[v ]]n· < ∇u > −(α/h)vu dS
+
∫∂Ω−∇v · [[u]]n − [[v ]]n · ∇u − (γ/h)vu ds
L(v) =∫
Ωvf dx
M. Knepley Robust PASI ’11 59 / 231
Tools and Infrastructure FEniCS
FFCDG Poisson
DG1 = Fin i teE lement ( " Discont inuous Lagrange " , shape , 1 )v = TestFunct ions (DG1)u = T r i a l F u n c t i o n s (DG1)f = Funct ion (DG1)g = Funct ion (DG1)n = FacetNormal ( " t r i a n g l e " )h = MeshSize ( " t r i a n g l e " )a = dot ( grad ( v ) , grad ( u ) ) * dx− dot ( avg ( grad ( v ) ) , jump ( u , n ) ) * dS− dot ( jump ( v , n ) , avg ( grad ( u ) ) ) * dS+ alpha / h* dot ( jump ( v , n ) + jump ( u , n ) ) * dS− dot ( grad ( v ) , jump ( u , n ) ) * ds− dot ( jump ( v , n ) , grad ( u ) ) * ds+ gamma/ h* v *u* ds
L = v * f * dx + v *g* ds
M. Knepley Robust PASI ’11 60 / 231
Tools and Infrastructure FEniCS
Big Picture
Usability is paramountNeed community by-inNeed complete workflow
Leverage existing systemsAdoption is much easier with the familiararXiv, package managers
M. Knepley Robust PASI ’11 61 / 231
http://arXiv.org
GPU Computing
Outline
1 Tools and Infrastructure
2 GPU ComputingFEM-GPUPETSc-GPU
3 Linear Algebra and Solvers
4 First Lab: Assembling a Code
5 Second Lab: Debugging and Performance Benchmarking
M. Knepley Robust PASI ’11 62 / 231
GPU Computing
Collaborators
Dr. Andy Terrel (FEniCS)Dept. of Computer Science, University of TexasTexas Advanced Computing Center, University of Texas
Prof. Andreas Klöckner (PyCUDA)Courant Institute of Mathematical Sciences, New York University
Dr. Brad Aagaard (PyLith)United States Geological Survey, Menlo Park, CA
Dr. Charles Williams (PyLith)GNS Science, Wellington, NZ
M. Knepley Robust PASI ’11 63 / 231
http://andy.terrel.us/Professional/http://mathema.tician.de/aboutme/http://profile.usgs.gov/baagaardhttp://w3.geodynamics.org/cig/Members/willic3
GPU Computing FEM-GPU
Outline
2 GPU ComputingFEM-GPUPETSc-GPU
M. Knepley Robust PASI ’11 64 / 231
GPU Computing FEM-GPU
What are the Benefits for current PDE Code?
Low Order FEM on GPUs
Analytic Flexibility
Computational Flexibility
Efficiency
http://www.bitbucket.org/aterrel/flamefem
M. Knepley Robust PASI ’11 65 / 231
http://www.bitbucket.org/aterrel/flamefem
GPU Computing FEM-GPU
What are the Benefits for current PDE Code?
Low Order FEM on GPUs
Analytic Flexibility
Computational Flexibility
Efficiency
http://www.bitbucket.org/aterrel/flamefem
M. Knepley Robust PASI ’11 65 / 231
http://www.bitbucket.org/aterrel/flamefem
GPU Computing FEM-GPU
What are the Benefits for current PDE Code?
Low Order FEM on GPUs
Analytic Flexibility
Computational Flexibility
Efficiency
http://www.bitbucket.org/aterrel/flamefem
M. Knepley Robust PASI ’11 65 / 231
http://www.bitbucket.org/aterrel/flamefem
GPU Computing FEM-GPU
What are the Benefits for current PDE Code?
Low Order FEM on GPUs
Analytic Flexibility
Computational Flexibility
Efficiency
http://www.bitbucket.org/aterrel/flamefem
M. Knepley Robust PASI ’11 65 / 231
http://www.bitbucket.org/aterrel/flamefem
GPU Computing FEM-GPU
Analytic FlexibilityLaplacian
∫T∇φi(x) · ∇φj(x)dx (1)
element = F in i teE lement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner ( grad ( v ) , grad ( u ) ) * dx
M. Knepley Robust PASI ’11 66 / 231
GPU Computing FEM-GPU
Analytic FlexibilityLaplacian
∫T∇φi(x) · ∇φj(x)dx (1)
element = F in i teE lement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner ( grad ( v ) , grad ( u ) ) * dx
M. Knepley Robust PASI ’11 66 / 231
GPU Computing FEM-GPU
Analytic FlexibilityLinear Elasticity
14
∫T
(∇~φi(x) +∇T ~φi(x)
):(∇~φj(x) +∇~φj(x)
)dx (2)
element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner (sym( grad ( v ) ) , sym( grad ( u ) ) ) * dx
M. Knepley Robust PASI ’11 67 / 231
GPU Computing FEM-GPU
Analytic FlexibilityLinear Elasticity
14
∫T
(∇~φi(x) +∇T ~φi(x)
):(∇~φj(x) +∇~φj(x)
)dx (2)
element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner (sym( grad ( v ) ) , sym( grad ( u ) ) ) * dx
M. Knepley Robust PASI ’11 67 / 231
GPU Computing FEM-GPU
Analytic FlexibilityFull Elasticity
14
∫T
(∇~φi(x) +∇T ~φi(x)
): C :
(∇~φj(x) +∇~φj(x)
)dx (3)
element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)cElement = TensorElement ( ’ Lagrange ’ , te t rahedron , 1 ,
( dim , dim , dim , dim ) )v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )C = C o e f f i c i e n t ( cElement )i , j , k , l = i nd i ces ( 4 )a = sym( grad ( v ) ) [ i , j ] *C[ i , j , k , l ] * sym( grad ( u ) ) [ k , l ] * dx
Currently broken in FEniCS release
M. Knepley Robust PASI ’11 68 / 231
GPU Computing FEM-GPU
Analytic FlexibilityFull Elasticity
14
∫T
(∇~φi(x) +∇T ~φi(x)
): C :
(∇~φj(x) +∇~φj(x)
)dx (3)
element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)cElement = TensorElement ( ’ Lagrange ’ , te t rahedron , 1 ,
( dim , dim , dim , dim ) )v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )C = C o e f f i c i e n t ( cElement )i , j , k , l = i nd i ces ( 4 )a = sym( grad ( v ) ) [ i , j ] *C[ i , j , k , l ] * sym( grad ( u ) ) [ k , l ] * dx
Currently broken in FEniCS release
M. Knepley Robust PASI ’11 68 / 231
GPU Computing FEM-GPU
Analytic FlexibilityFull Elasticity
14
∫T
(∇~φi(x) +∇T ~φi(x)
): C :
(∇~φj(x) +∇~φj(x)
)dx (3)
element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)cElement = TensorElement ( ’ Lagrange ’ , te t rahedron , 1 ,
( dim , dim , dim , dim ) )v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )C = C o e f f i c i e n t ( cElement )i , j , k , l = i nd i ces ( 4 )a = sym( grad ( v ) ) [ i , j ] *C[ i , j , k , l ] * sym( grad ( u ) ) [ k , l ] * dx
Currently broken in FEniCS release
M. Knepley Robust PASI ’11 68 / 231
GPU Computing FEM-GPU
Form Decomposition
Element integrals are decomposed into analytic and geometric parts:
∫T ∇φi(x) · ∇φj(x)dx (4)
=∫T∂φi (x)∂xα
∂φj (x)∂xα dx (5)
=∫Tref
∂ξβ∂xα
∂φi (ξ)∂ξβ
∂ξγ∂xα
∂φj (ξ)∂ξγ|J|dx (6)
=∂ξβ∂xα
∂ξγ∂xα |J|
∫Tref
∂φi (ξ)∂ξβ
∂φj (ξ)∂ξγ
dx (7)
= Gβγ(T )K ijβγ (8)
Coefficients are also put into the geometric part.
M. Knepley Robust PASI ’11 69 / 231
GPU Computing FEM-GPU
Form Decomposition
Additional fields give rise to multilinear forms.
∫T φi(x) ·
(φk (x)∇φj(x)
)dA (9)
=∫T φ
βi (x)
(φαk (x)
∂φβj (x)∂xα
)dA (10)
=∫Tref φ
βi (ξ)φ
αk (ξ)
∂ξγ∂xα
∂φβj (ξ)
∂ξγ|J|dA (11)
=∂ξγ∂xα |J|
∫Tref φ
βi (ξ)φ
αk (ξ)
∂φβj (ξ)
∂ξγdA (12)
= Gαγ(T )K ijkαγ (13)
The index calculus is fully developed by Kirby and Logg inA Compiler for Variational Forms.
M. Knepley Robust PASI ’11 70 / 231
http://www.fenics.org/pub/documents/ffc/papers/ffc-toms-2005.pdf
GPU Computing FEM-GPU
Form Decomposition
Isoparametric Jacobians also give rise to multilinear forms
∫T ∇φi(x) · ∇φj(x)dA (14)
=∫T∂φi (x)∂xα
∂φj (x)∂xα dA (15)
=∫Tref
∂ξβ∂xα
∂φi (ξ)∂ξβ
∂ξγ∂xα
∂φj (ξ)∂ξγ|J|dA (16)
= |J|∫Tref φkJ
βαk
∂φi (ξ)∂ξβ
φlJγαl
∂φj (ξ)∂ξγ
dA (17)
= Jβαk Jγαl |J|
∫Tref φk
∂φi (ξ)∂ξβ
φl∂φj (ξ)∂ξγ
dA (18)
= Gβγkl (T )Kijklβγ (19)
A different space could also be used for Jacobians
M. Knepley Robust PASI ’11 71 / 231
GPU Computing FEM-GPU
Weak Form Processing
from f f c . ana l ys i s impor t analyze_formsfrom f f c . compi ler impor t compute_ir
parameters = f f c . defau l t_parameters ( )parameters [ ’ r ep resen ta t i on ’ ] = ’ tensor ’ana l ys i s = analyze_forms ( [ a , L ] , { } , parameters )i r = compute_ir ( ana lys is , parameters )
a_K = i r [ 2 ] [ 0 ] [ ’AK ’ ] [ 0 ] [ 0 ]a_G = i r [ 2 ] [ 0 ] [ ’AK ’ ] [ 0 ] [ 1 ]
K = a_K . A0 . astype (numpy . f l o a t 3 2 )G = a_G
M. Knepley Robust PASI ’11 72 / 231
GPU Computing FEM-GPU
Computational Flexibility
We generate different computations on the fly,
and can changeElement Batch Size
Number of Concurrent Elements
Loop unrolling
Interleaving stores with computation
M. Knepley Robust PASI ’11 73 / 231
GPU Computing FEM-GPU
Computational FlexibilityBasic Contraction
G K
Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 74 / 231
GPU Computing FEM-GPU
Computational FlexibilityBasic Contraction
G Kthread 0
Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 74 / 231
GPU Computing FEM-GPU
Computational FlexibilityBasic Contraction
G Kthread 0
thread 5
Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 74 / 231
GPU Computing FEM-GPU
Computational FlexibilityBasic Contraction
G Kthread 0
thread 5
thread 15
Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 74 / 231
GPU Computing FEM-GPU
Computational FlexibilityElement Batch Size
G0G1G2G3
Kthread 0
thread 5
thread 15
Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 75 / 231
GPU Computing FEM-GPU
Computational FlexibilityElement Batch Size
G0G1G2G3
K
thread
0
thread 5
thread 15
Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 75 / 231
GPU Computing FEM-GPU
Computational FlexibilityElement Batch Size
G0G1G2G3
K
thre
ad0
thread 5
thread 15
Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 75 / 231
GPU Computing FEM-GPU
Computational FlexibilityElement Batch Size
G0G1G2G3
K
thre
ad0
thread
5
thread 15
Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 75 / 231
GPU Computing FEM-GPU
Computational FlexibilityConcurrent Elements
G00G01G02G03
G10G11G12G13
Kthread 0
thread 5
thread 15
thread 16
thread 21
thre
ad31
Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 76 / 231
GPU Computing FEM-GPU
Computational FlexibilityConcurrent Elements
G00G01G02G03
G10G11G12G13
K
threa
d 0
thread 5
thread 15
thread 16thread 21
thre
ad31Figure: Tensor Contraction Gβγ(T )K ijβγ
M. Knepley Robust PASI ’11 76 / 231
GPU Computing FEM-GPU
Computational FlexibilityConcurrent Elements
G00G01G02G03
G10G11G12G13
Kth
read
0
thread 5
thread 15
thread 16thread 21
threa
d 31
Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 76 / 231
GPU Computing FEM-GPU
Computational FlexibilityConcurrent Elements
G00G01G02G03
G10G11G12G13
Kth
read
0
threa
d 5
thread 15
thread 16thread 21
thread 31
Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 76 / 231
GPU Computing FEM-GPU
Computational FlexibilityLoop Unrolling
/ * G K c o n t r a c t i o n : u n r o l l = f u l l * /E [ 0 ] += G[ 0 ] * K [ 0 ] ;E [ 0 ] += G[ 1 ] * K [ 1 ] ;E [ 0 ] += G[ 2 ] * K [ 2 ] ;E [ 0 ] += G[ 3 ] * K [ 3 ] ;E [ 0 ] += G[ 4 ] * K [ 4 ] ;E [ 0 ] += G[ 5 ] * K [ 5 ] ;E [ 0 ] += G[ 6 ] * K [ 6 ] ;E [ 0 ] += G[ 7 ] * K [ 7 ] ;E [ 0 ] += G[ 8 ] * K [ 8 ] ;
M. Knepley Robust PASI ’11 77 / 231
GPU Computing FEM-GPU
Computational FlexibilityLoop Unrolling
/ * G K c o n t r a c t i o n : u n r o l l = none * /f o r ( i n t b = 0; b < 1; ++b ) {
const i n t n = b * 1 ;f o r ( i n t alpha = 0; alpha < 3; ++alpha ) {
f o r ( i n t beta = 0; beta < 3; ++beta ) {E [ b ] += G[ n*9+ alpha *3+ beta ] * K [ alpha *3+ beta ] ;
}}
}
M. Knepley Robust PASI ’11 78 / 231
GPU Computing FEM-GPU
Computational FlexibilityInterleaving stores
/ * G K c o n t r a c t i o n : u n r o l l = none * /f o r ( i n t b = 0; b < 4; ++b ) {
const i n t n = b * 1 ;f o r ( i n t alpha = 0; alpha < 3; ++alpha ) {
f o r ( i n t beta = 0; beta < 3; ++beta ) {E [ b ] += G[ n*9+ alpha *3+ beta ] * K [ alpha *3+ beta ] ;
}}
}/ * Store c o n t r a c t i o n r e s u l t s * /elemMat [ Eo f f se t + idx +0] = E [ 0 ] ;elemMat [ Eo f f se t + idx +16] = E [ 1 ] ;elemMat [ Eo f f se t + idx +32] = E [ 2 ] ;elemMat [ Eo f f se t + idx +48] = E [ 3 ] ;
M. Knepley Robust PASI ’11 79 / 231
GPU Computing FEM-GPU
Computational FlexibilityInterleaving stores
n = 0;f o r ( i n t alpha = 0; alpha < 3; ++alpha ) {
f o r ( i n t beta = 0; beta < 3; ++beta ) {E += G[ n*9+ alpha *3+ beta ] * K [ alpha *3+ beta ] ;
}}/ * Store c o n t r a c t i o n r e s u l t * /elemMat [ Eo f f se t + idx +0] = E;n = 1; E = 0 . 0 ; / * con t rac t * /elemMat [ Eo f f se t + idx +16] = E;n = 2; E = 0 . 0 ; / * con t rac t * /elemMat [ Eo f f se t + idx +32] = E;n = 3; E = 0 . 0 ; / * con t rac t * /elemMat [ Eo f f se t + idx +48] = E;
M. Knepley Robust PASI ’11 80 / 231
GPU Computing FEM-GPU
Code Template
$ { pb . globalMod ( isGPU ) } vo id kerne l ( $ { pb . g r i dS ize ( isGPU ) } f l o a t * output ) {
$ { pb . g r idLoopSta r t ( isGPU , load , s to re ) }$ { pb . threadLoopStar t ( isGPU , blockDimX ) }f l o a t G[ $ { dim * dim } ] = { $ { ’ , ’ . j o i n ( [ ’ 3.0 ’ ] * ( dim * dim ) ) } } ;f l o a t K [ $ { dim * dim } ] = { $ { ’ , ’ . j o i n ( [ ’ 3.0 ’ ] * ( dim * dim ) ) } } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x *$ { numThreads } ;
/ / Cont ract G and K% f o r n i n range ( numLocalElements ) :% f o r alpha i n range ( dim ) :% f o r beta i n range ( dim ) :
product += G[ $ { gIdx } ] * K [ $ { k Idx } ] ;% endfor% endfor% endfor
output [ Oof fse t+ idx ] = product ;$ { pb . threadLoopEnd ( isGPU ) }$ { pb . gridLoopEnd ( isGPU ) }r e t u r n ;
} M. Knepley Robust PASI ’11 81 / 231
GPU Computing FEM-GPU
Rendering a Template
We render code template into strings using a dictionary of inputs.
args = { ’ dim ’ : s e l f . dim ,’ numLocalElements ’ : 1 ,’ numThreads ’ : s e l f . threadBlockSize }
kernelTemplate = s e l f . getKernelTemplate ( )gpuCode = kernelTemplate . render ( isGPU = True , * * args )cpuCode = kernelTemplate . render ( isGPU = False , * * args )
M. Knepley Robust PASI ’11 82 / 231
GPU Computing FEM-GPU
GPU Source Code
__global__ vo id kerne l ( f l o a t * output ) {const i n t g r i d I d x = b lock Idx . x + b lock Idx . y * gr idDim . x ;const i n t i dx = th read Idx . x + th read Idx . y * 1 ; / / This i s ( i , j )f l o a t G[ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t K [ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x * 1 ;
/ / Cont ract G and Kproduct += G[ 0 ] * K [ 0 ] ;product += G[ 1 ] * K [ 1 ] ;product += G[ 2 ] * K [ 2 ] ;product += G[ 3 ] * K [ 3 ] ;product += G[ 4 ] * K [ 4 ] ;product += G[ 5 ] * K [ 5 ] ;product += G[ 6 ] * K [ 6 ] ;product += G[ 7 ] * K [ 7 ] ;product += G[ 8 ] * K [ 8 ] ;ou tput [ Oof fse t+ idx ] = product ;r e t u r n ;
}
M. Knepley Robust PASI ’11 83 / 231
GPU Computing FEM-GPU
CPU Source Code
vo id kerne l ( i n t numInvocations , f l o a t * output ) {f o r ( i n t g r i d I d x = 0; g r i d I d x < numInvocations ; ++ g r i d I d x ) {
f o r ( i n t i = 0 ; i < 1 ; ++ i ) {f o r ( i n t j = 0 ; j < 1 ; ++ j ) {
const i n t i dx = i + j * 1 ; / / This i s ( i , j )f l o a t G[ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t K [ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x * 1 ;
/ / Cont ract G and Kproduct += G[ 0 ] * K [ 0 ] ;product += G[ 1 ] * K [ 1 ] ;product += G[ 2 ] * K [ 2 ] ;product += G[ 3 ] * K [ 3 ] ;product += G[ 4 ] * K [ 4 ] ;product += G[ 5 ] * K [ 5 ] ;product += G[ 6 ] * K [ 6 ] ;product += G[ 7 ] * K [ 7 ] ;product += G[ 8 ] * K [ 8 ] ;ou tput [ Oof fse t+ idx ] = product ;
}}
}r e t u r n ;
}
M. Knepley Robust PASI ’11 84 / 231
GPU Computing FEM-GPU
Creating a Module
CPU:
# Output kerne l and C support codes e l f . outputKernelC ( cpuCode )s e l f . w r i t e M a k e f i l e ( )out , er r , s ta tus = s e l f . executeShellCommand ( ’make ’ )\ end { minted }
\ b i gsk ip
GPU:\ begin { minted } { python }from pycuda . compi ler impor t SourceModule
mod = SourceModule ( gpuCode )s e l f . ke rne l = mod. ge t_ func t i on ( ’ ke rne l ’ )s e l f . kerne lRepor t ( s e l f . kernel , ’ ke rne l ’ )
M. Knepley Robust PASI ’11 85 / 231
GPU Computing FEM-GPU
Executing a Module
impor t pycuda . d r i v e r as cudaimpor t pycuda . a u t o i n i t
blockDim = ( s e l f . dim , s e l f . dim , 1)s t a r t = cuda . Event ( )end = cuda . Event ( )g r i d = s e l f . c a l c u l a t e G r i d (N, numLocalElements )s t a r t . record ( )f o r i i n range ( i t e r s ) :
s e l f . ke rne l ( cuda . Out ( output ) ,b lock = blockDim , g r i d = g r i d )
end . record ( )end . synchronize ( )gpuTimes . append ( s t a r t . t i m e _ t i l l ( end ) *1 e−3/ i t e r s )
M. Knepley Robust PASI ’11 86 / 231
GPU Computing FEM-GPU
Element Matrix Formation
Element matrix K is now made up of small tensorsContract all tensor elements with each the geometry tensor G(T )
3 00 0
0 -10 0
1 10 0
-4 -40 0
0 40 0
0 00 0
0 0-1 0
0 00 3
0 01 1
0 00 0
0 04 0
0 0-4 -4
1 01 0
0 10 1
3 33 3
-4 0-4 0
0 00 0
0 -40 -4
-4 0-4 0
0 00 0
-4 -40 0
8 44 8
0 -4-4 -8
0 44 0
0 04 0
0 40 0
0 00 0
0 -4-4 -8
8 44 8
-8 -4-4 0
0 00 0
0 -40 -4
0 0-4 -4
0 44 0
-8 -4-4 0
8 44 8
M. Knepley Robust PASI ’11 87 / 231
GPU Computing FEM-GPU
Mapping GαβK ijαβ to the GPUProblem Division
For N elements, map blocks of NL elements to each Thread Block (TB)Launch grid must be gx × gy = N/NLTB grid will depend on the specific algorithmOutput is size Nbasis × Nbasis × NL
We can split a TB to work on multiple, NB, elements at a timeNote that each TB always gets NL elements, so NB must divide NL
M. Knepley Robust PASI ’11 88 / 231
GPU Computing FEM-GPU
Mapping GαβK ijαβ to the GPUKernel Arguments
__global__vo id in teg ra teJacob ian ( f l o a t * elemMat ,
f l o a t * geometry ,f l o a t * a n a l y t i c )
geometry: Array of G tensors for each element
analytic: K tensor
elemMat: Array of E = G : K tensors for each element
M. Knepley Robust PASI ’11 89 / 231
GPU Computing FEM-GPU
Mapping GαβK ijαβ to the GPUMemory Movement
We can interleave stores with computation, or wait until the endWaiting could improve coalescing of writes
Interleaving could allow overlap of writes with computation
Also need toCoalesce accesses between global and local/shared memory(use moveArray())
Limit use of shared and local memory
M. Knepley Robust PASI ’11 90 / 231
GPU Computing FEM-GPU
Memory Bandwidth
Superior GPU memory bandwidth is due to both
bus width and clock speed.
CPU GPUBus Width (bits) 64 512Bus Clock Speed (MHz) 400 1600Memory Bandwidth (GB/s) 3 102Latency (cycles) 240 600
Tesla always accesses blocks of 64 or 128 bytes
M. Knepley Robust PASI ’11 91 / 231
GPU Computing FEM-GPU
Mapping GαβK ijαβ to the GPUReduction
Choose strategies to minimize reductions
Only reductions occur in summation for contractionsSimilar to the reduction in a quadrature loop
Strategy #1: Each thread uses all of K
Strategy #2: Do each contraction in a separate thread
M. Knepley Robust PASI ’11 92 / 231
GPU Computing FEM-GPU
Strategy #1TB Division
Each thread computes an entire element matrix, so that
blockDim = (NL/NB,1,1)
We will see that there is little opportunity to overlap computation andmemory access
M. Knepley Robust PASI ’11 93 / 231
GPU Computing FEM-GPU
Strategy #1Analytic Part
Read K into shared memory (need to synchronize before access)
__shared__ f l o a t K [ $ { dim * dim * numBasisFuncs * numBasisFuncs } ] ;
$ { fm . moveArray ( ’K ’ , ’ a n a l y t i c ’ ,dim * dim * numBasisFuncs * numBasisFuncs , ’ ’ , numThreads ) }
__syncthreads ( ) ;
M. Knepley Robust PASI ’11 94 / 231
GPU Computing FEM-GPU
Strategy #1Geometry
Each thread handles NB elementsRead G into local memory (not coalesced)Interleaving means writing after each thread does a singleelement matrix calculation
f l o a t G[ $ { dim * dim * numBlockElements } ] ;
i f ( i n t e r l e a v e d ) {const i n t Gof fse t = ( g r i d I d x *$ { numLocalElements } + idx ) * $ { dim * dim } ;f o r n i n range ( numBlockElements ) :
$ { fm . moveArray ( ’G ’ , ’ geometry ’ , dim * dim , ’ Gof fse t ’ ,blockNumber = n* numLocalElements / numBlockElements ,localBlockNumber = n , isCoalesced = False ) }
endfor} e lse {
const i n t Gof fse t = ( g r i d I d x *$ { numLocalElements / numBlockElements } + idx )*$ { dim * dim * numBlockElements } ;
$ { fm . moveArray ( ’G ’ , ’ geometry ’ , dim * dim * numBlockElements , ’ Gof fse t ’ ,isCoalesced = False ) }
}
M. Knepley Robust PASI ’11 95 / 231
GPU Computing FEM-GPU
Strategy #1Output
We write element matrices out contiguously by TB
const i n t matSize = numBasisFuncs * numBasisFuncs ;const i n t Eo f f se t = g r i d I d x * matSize * numLocalElements ;
i f ( i n t e r l e a v e d ) {const i n t elemOff = idx * matSize ;__shared__ f l o a t E [ matSize * numLocalElements / numBlockElements ] ;
} e lse {const i n t elemOff = idx * matSize * numBlockElements ;__shared__ f l o a t E [ matSize * numLocalElements ] ;
}
M. Knepley Robust PASI ’11 96 / 231
GPU Computing FEM-GPU
Strategy #1Contraction
matSize = numBasisFuncs * numBasisFuncsi f i n t e r l ea v eS t o re s :
f o r b i n range ( numBlockElements ) :# Do 1 c o n t r a c t i o n f o r each thread__syncthreads ( ) ;fm . moveArray ( ’E ’ , ’ elemMat ’ ,
matSize * numLocalElements / numBlockElements ,’ Eo f f se t ’ , numThreads , blockNumber = n , isLoad = 0)
e lse :# Do numBlockElements co n t r ac t i ons f o r each thread__syncthreads ( ) ;fm . moveArray ( ’E ’ , ’ elemMat ’ ,
matSize * numLocalElements ,’ Eo f f se t ’ , numThreads , isLoad = 0)
M. Knepley Robust PASI ’11 97 / 231
GPU Computing FEM-GPU
Strategy #2TB Division
Each thread computes a single element of an element matrix, so that
blockDim = (Nbasis,Nbasis,NB)
This allows us to overlap computation of another element in the TBwith writes for the first.
M. Knepley Robust PASI ’11 98 / 231
GPU Computing FEM-GPU
Strategy #2Analytic Part
Assign an (i , j) block of K to local memoryNB threads will simultaneously calculate a contraction
const i n t Kidx = th read Idx . x + th read Idx . y *$ { numBasisFuncs } ; / / This i s ( i , j )const i n t i dx = Kidx + th read Idx . z *$ { numBasisFuncs * numBasisFuncs } ;const i n t Ko f f se t = Kidx *$ { dim * dim } ;f l o a t K [ $ { dim * dim } ] ;
% f o r alpha i n range ( dim ) :% f o r beta i n range ( dim ) :
K[ $ { k Idx } ] = a n a l y t i c [ Ko f f se t +$ { k Idx } ] ;% endfor% endfor
M. Knepley Robust PASI ’11 99 / 231
GPU Computing FEM-GPU
Strategy #2Geometry
Store NL G tensors into shared memoryInterleaving means writing after each thread does a singleelement calculation
const i n t Gof fse t = g r i d I d x *$ { dim * dim * numLocalElements } ;__shared__ f l o a t G[ $ { dim * dim * numLocalElements } ] ;
$ { fm . moveArray ( ’G ’ , ’ geometry ’ , dim * dim * numLocalElements ,’ Gof fse t ’ , numThreads ) }
__syncthreads ( ) ;
M. Knepley Robust PASI ’11 100 / 231
GPU Computing FEM-GPU
Strategy #2Output
We write element matrices out contiguously by TBIf interleaving stores, only need a single productOtherwise, need NL/NB, one per element processed by a thread
const i n t matSize = numBasisFuncs * numBasisFuncs ;const i n t Eo f f se t = g r i d I d x * matSize * numLocalElements ;
i f ( i n t e r l e a v e d ) {f l o a t product = 0 . 0 ;const i n t elemOff = idx * matSize ;
} e lse {f l o a t product [ numLocalElements / numBlockElements ] ;const i n t elemOff = idx * matSize * numBlockElements ;
}
M. Knepley Robust PASI ’11 101 / 231
GPU Computing FEM-GPU
Strategy #2Contraction
i f i n t e r l ea v eS t o re s :f o r n i n range ( numLocalElements / numBlockElements ) :
# Do 1 c o n t r a c t i o n f o r each thread__syncthreads ( )# Do coalesced w r i t e o f element mat r i xelemMat [ Eo f f se t + idx + n* numThreads ] = product
e lse :# Do numLocalElements / numBlockElements c on t r ac t i ons# save r e s u l t s i n product [ ]f o r n i n range ( numLocalElements / numBlockElements ) :
elemMat [ Eo f f se t + idx + n* numThreads ] = product [ n ]
M. Knepley Robust PASI ’11 102 / 231
GPU Computing FEM-GPU
ResultsGTX 285
M. Knepley Robust PASI ’11 103 / 231
GPU Computing FEM-GPU
ResultsGTX 285
M. Knepley Robust PASI ’11 103 / 231
GPU Computing FEM-GPU
ResultsGTX 285, 2 Simultaneous Elements
M. Knepley Robust PASI ’11 103 / 231
GPU Computing FEM-GPU
ResultsGTX 285, 2 Simultaneous Elements
M. Knepley Robust PASI ’11 103 / 231
GPU Computing FEM-GPU
ResultsGTX 285
M. Knepley Robust PASI ’11 104 / 231
GPU Computing FEM-GPU
ResultsGTX 285
M. Knepley Robust PASI ’11 104 / 231
GPU Computing FEM-GPU
ResultsGTX 285, 2 Simultaneous Elements
M. Knepley Robust PASI ’11 104 / 231
GPU Computing FEM-GPU
ResultsGTX 285, 2 Simultaneous Elements
M. Knepley Robust PASI ’11 104 / 231
GPU Computing FEM-GPU
ResultsGTX 285
M. Knepley Robust PASI ’11 105 / 231
GPU Computing FEM-GPU
ResultsGTX 285
M. Knepley Robust PASI ’11 105 / 231
GPU Computing FEM-GPU
ResultsGTX 285, 2 Simultaneous Elements
M. Knepley Robust PASI ’11 105 / 231
GPU Computing FEM-GPU
ResultsGTX 285, 2 Simultaneous Elements
M. Knepley Robust PASI ’11 105 / 231
GPU Computing FEM-GPU
ResultsGTX 285, 2 Simultaneous Elements
M. Knepley Robust PASI ’11 106 / 231
GPU Computing FEM-GPU
ResultsGTX 285, 2 Simultaneous Elements
M. Knepley Robust PASI ’11 106 / 231
GPU Computing FEM-GPU
ResultsGTX 285, 4 Simultaneous Elements
M. Knepley Robust PASI ’11 106 / 231
GPU Computing FEM-GPU
ResultsGTX 285, 4 Simultaneous Elements
M. Knepley Robust PASI ’11 106 / 231
GPU Computing FEM-GPU
PerformanceInfluence of Element Batch Sizes
M. Knepley Robust PASI ’11 107 / 231
GPU Computing FEM-GPU
PerformanceInfluence of Element Batch Sizes
M. Knepley Robust PASI ’11 108 / 231
GPU Computing FEM-GPU
PerformanceInfluence of Code Structure
M. Knepley Robust PASI ’11 109 / 231
GPU Computing FEM-GPU
PerformanceInfluence of Code Structure
M. Knepley Robust PASI ’11 110 / 231
GPU Computing PETSc-GPU
Outline
2 GPU ComputingFEM-GPUPETSc-GPU
M. Knepley Robust PASI ’11 111 / 231
GPU Computing PETSc-GPU
Thrust
Thrust is a CUDA library of parallel algorithms
Interface similar to C++ Standard Template Library
Containers (vector) on both host and device
Algorithms: sort, reduce, scan
Freely available, part of PETSc configure (-with-thrust-dir)
Included as part of CUDA 4.0 installation
M. Knepley Robust PASI ’11 112 / 231
http://code.google.com/p/thrust/
GPU Computing PETSc-GPU
Cusp
Cusp is a CUDA library forsparse linear algebra and graph computations
Builds on data structures in Thrust
Provides sparse matrices in several formats (CSR, Hybrid)
Includes some preliminary preconditioners (Jacobi, SA-AMG)
Freely available, part of PETSc configure (-with-cusp-dir)
M. Knepley Robust PASI ’11 113 / 231
http://code.google.com/p/cusp-library/
GPU Computing PETSc-GPU
VECCUDA
Strategy: Define a new Vec implementation
Uses Thrust for data storage and operations on GPU
Supports full PETSc Vec interface
Inherits PETSc scalar type
Can be activated at runtime, -vec_type cuda
PETSc provides memory coherence mechanism
M. Knepley Robust PASI ’11 114 / 231
http://code.google.com/p/thrust/
GPU Computing PETSc-GPU
Memory Coherence
PETSc Objects now hold a coherence flag
PETSC_CUDA_UNALLOCATED No allocation on the GPUPETSC_CUDA_GPU Values on GPU are currentPETSC_CUDA_CPU Values on CPU are currentPETSC_CUDA_BOTH Values on both are current
Table: Flags used to indicate the memory state of a PETSc CUDA Vec object.
M. Knepley Robust PASI ’11 115 / 231
GPU Computing PETSc-GPU
MATAIJCUDA
Also define new Mat implementations
Uses Cusp for data storage and operations on GPU
Supports full PETSc Mat interface, some ops on CPU
Can be activated at runtime, -mat_type aijcuda
Notice that parallel matvec necessitates off-GPU data transfer
M. Knepley Robust PASI ’11 116 / 231
http://code.google.com/p/cusp-library/
GPU Computing PETSc-GPU
Solvers
Solvers come for FreePreliminary Implementation of PETSc Using GPU,
Minden, Smith, Knepley, 2010
All linear algebra types work with solvers
Entire solve can take place on the GPUOnly communicate scalars back to CPU
GPU communication cost could be amortized over several solves
Preconditioners are a problemCusp has a promising AMG
M. Knepley Robust PASI ’11 117 / 231
http://www.mcs.anl.gov/uploads/cels/papers/P1787.pdf
GPU Computing PETSc-GPU
Installation
PETSc only needs# Turn on CUDA--with-cuda# Specify the CUDA compiler--with-cudac=’nvcc -m64’# Indicate the location of packages# --download-* will also work soon--with-thrust-dir=/PETSc3/multicore/thrust--with-cusp-dir=/PETSc3/multicore/cusp# Can also use double precision--with-precision=single
M. Knepley Robust PASI ’11 118 / 231
GPU Computing PETSc-GPU
ExampleDriven Cavity Velocity-Vorticity with Multigrid
ex50 -da_vec_type seqcusp-da_mat_type aijcusp -mat_no_inode # Setup types-da_grid_x 100 -da_grid_y 100 # Set grid size-pc_type none -pc_mg_levels 1 # Setup solver-preload off -cuda_synchronize # Setup run-log_summary
M. Knepley Robust PASI ’11 119 / 231
Linear Algebra and Solvers
Outline
1 Tools and Infrastructure
2 GPU Computing
3 Linear Algebra and SolversVector AlgebraMatrix AlgebraAlgebraic SolversSNESDAPCFieldSplit
4 First Lab: Assembling a Code
5 Second Lab: Debugging and Performance BenchmarkingM. Knepley Robust PASI ’11 120 / 231
Linear Algebra and Solvers Vector Algebra
Outline
3 Linear Algebra and SolversVector AlgebraMatrix AlgebraAlgebraic SolversSNESDAPCFieldSplit
M. Knepley Robust PASI ’11 121 / 231
Linear Algebra and Solvers Vector Algebra
Vector Algebra
What are PETSc vectors?
Fundamental objects representingsolutionsright-hand sidescoefficients
Each process locally owns a subvector of contiguous global data
M. Knepley Robust PASI ’11 122 / 231
Linear Algebra and Solvers Vector Algebra
Vector Algebra
How do I create vectors?
VecCreate(MPI_Comm, Vec*)
VecSetSizes(Vec, PetscIntn, PetscInt N)
VecSetType(Vec, VecType typeName)
VecSetFromOptions(Vec)
Can set the type at runtime
M. Knepley Robust PASI ’11 123 / 231
Linear Algebra and Solvers Vector Algebra
Vector Algebra
A PETSc Vec
Supports all vector space operationsVecDot(), VecNorm(), VecScale()
Has a direct interface to the valuesVecGetArray(), VecGetArrayF90()
Has unusual operationsVecSqrtAbs(), VecStrideGather()
Communicates automatically during assemblyHas customizable communication (PetscSF,VecScatter)
M. Knepley Robust PASI ’11 124 / 231
Linear Algebra and Solvers Vector Algebra
Parallel AssemblyVectors and Matrices
Processes may set an arbitrary entryMust use proper interface
Entries need not be generated locallyLocal meaning the process on which they are stored
PETSc automatically moves data if necessaryHappens during the assembly phase
M. Knepley Robust PASI ’11 125 / 231
Linear Algebra and Solvers Vector Algebra
Vector Assembly
A three step processEach process sets or adds valuesBegin communication to send values to the correct processComplete the communication
VecSetValues ( Vec v , Pe tsc In t n , Pe tsc In t rows [ ] ,PetscScalar values [ ] , InsertMode mode)
Mode is either INSERT_VALUES or ADD_VALUESTwo phases allow overlap of communication and computation
VecAssemblyBegin(Vecv)VecAssemblyEnd(Vecv)
M. Knepley Robust PASI ’11 126 / 231
Linear Algebra and Solvers Vector Algebra
One Way to Set the Elements of a Vector
VecGetSize ( x , &N) ;MPI_Comm_rank (PETSC_COMM_WORLD, &rank ) ;i f ( rank == 0) {
va l = 0 . 0 ;f o r ( i = 0 ; i < N; ++ i ) {
VecSetValues ( x , 1 , &i , &val , INSERT_VALUES ) ;va l += 10 .0 ;
}}/ * These rou t i nes ensure t h a t the data i s
d i s t r i b u t e d to the other processes * /VecAssemblyBegin ( x ) ;VecAssemblyEnd ( x ) ;
M. Knepley Robust PASI ’11 127 / 231
Linear Algebra and Solvers Vector Algebra
A Better Way to Set the Elements of a Vector
VecGetOwnershipRange ( x , &low , &high ) ;va l = low * 1 0 . 0 ;f o r ( i = low ; i < high ; ++ i ) {
VecSetValues ( x , 1 , &i , &val , INSERT_VALUES ) ;va l += 10 .0 ;
}/ * No data w i l l be communicated here * /VecAssemblyBegin ( x ) ;VecAssemblyEnd ( x ) ;
M. Knepley Robust PASI ’11 128 / 231
Linear Algebra and Solvers Vector Algebra
Selected Vector Operations
Function Name OperationVecAXPY(Vec y, PetscScalar a, Vec x) y = y + a ∗ xVecAYPX(Vec y, PetscScalar a, Vec x) y = x + a ∗ yVecWAYPX(Vec w, PetscScalar a, Vec x, Vec y) w = y + a ∗ xVecScale(Vec x, PetscScalar a) x = a ∗ xVecCopy(Vec y, Vec x) y = xVecPointwiseMult(Vec w, Vec x, Vec y) wi = xi ∗ yiVecMax(Vec x, PetscInt *idx, PetscScalar *r) r = maxriVecShift(Vec x, PetscScalar r) xi = xi + rVecAbs(Vec x) xi = |xi |VecNorm(Vec x, NormType type, PetscReal *r) r = ||x ||
M. Knepley Robust PASI ’11 129 / 231
Linear Algebra and Solvers Vector Algebra
Working With Local Vectors
It is sometimes more efficient to directly access local storage of a Vec.PETSc allows you to access the local storage with
VecGetArray(Vec, double *[])
You must return the array to PETSc when you finishVecRestoreArray(Vec, double *[])
Allows PETSc to handle data structure conversionsCommonly, these routines are fast and do not involve a copy
M. Knepley Robust PASI ’11 130 / 231
Linear Algebra and Solvers Vector Algebra
VecGetArray in C
Vec v ;PetscScalar * ar ray ;Pe tsc In t n , i ;
VecGetArray ( v , &ar ray ) ;VecGetLocalSize ( v , &n ) ;PetscSynchron izedPr in t f (PETSC_COMM_WORLD,
" F i r s t element o f l o c a l a r ray i s %f \ n " , a r ray [ 0 ] ) ;PetscSynchronizedFlush (PETSC_COMM_WORLD) ;f o r ( i = 0 ; i < n ; ++ i ) {
a r ray [ i ] += ( PetscScalar ) rank ;}VecRestoreArray ( v , &ar ray ) ;
M. Knepley Robust PASI ’11 131 / 231
Linear Algebra and Solvers Vector Algebra
VecGetArray in F77
# inc lude " f i n c l u d e / petsc . h "
Vec v ;PetscScalar ar ray ( 1 )PetscOf fse t o f f s e tPe tsc In t n , iPetscErrorCode i e r r
c a l l VecGetArray ( v , array , o f f s e t , i e r r )c a l l VecGetLocalSize ( v , n , i e r r )do i =1 ,n
ar ray ( i + o f f s e t ) = ar ray ( i + o f f s e t ) + rankend doc a l l VecRestoreArray ( v , array , o f f s e t , i e r r )
M. Knepley Robust PASI ’11 132 / 231
Linear Algebra and Solvers Vector Algebra
VecGetArray in F90
# inc lude " f i n c l u d e / petsc . h90 "
Vec v ;PetscScalar p o i n t e r : : a r ray ( : )Pe tsc In t n , iPetscErrorCode i e r r
c a l l VecGetArrayF90 ( v , array , i e r r )c a l l VecGetLocalSize ( v , n , i e r r )do i =1 ,n
ar ray ( i ) = ar ray ( i ) + rankend doc a l l VecRestoreArrayF90 ( v , array , i e r r )
M. Knepley Robust PASI ’11 133 / 231
Linear Algebra and Solvers Vector Algebra
VecGetArray in Python
wi th v as a :f o r i i n range ( len ( a ) ) :
a [ i ] = 5 .0* i
M. Knepley Robust PASI ’11 134 / 231
Linear Algebra and Solvers Vector Algebra
DMDAVecGetArray in C
DM da ;Vec v ;DMDALocalInfo * i n f o ;PetscScalar * * ar ray ;
DMDAVecGetArray ( da , v , &ar ray ) ;f o r ( j = in fo−>ys ; j < in fo−>ys+ in fo−>ym; ++ j ) {
f o r ( i = in fo−>xs ; i < in fo−>xs+ in fo−>xm; ++ i ) {u = x [ j ] [ i ] ;uxx = ( 2 . 0 * u − x [ j ] [ i −1] − x [ j ] [ i + 1 ] ) * hydhx ;uyy = ( 2 . 0 * u − x [ j −1][ i ] − x [ j + 1 ] [ i ] ) * hxdhy ;f [ j ] [ i ] = uxx + uyy ;
}}DMDAVecRestoreArray ( da , v , &ar ray ) ;
M. Knepley Robust PASI ’11 135 / 231
Linear Algebra and Solvers Matrix Algebra
Outline
3 Linear Algebra and SolversVector AlgebraMatrix AlgebraAlgebraic SolversSNESDAPCFieldSplit
M. Knepley Robust PASI ’11 136 / 231
Linear Algebra and Solvers Matrix Algebra
Matrix Algebra
What are PETSc matrices?
Fundamental objects for storing stiffness matrices and JacobiansEach process locally owns a contiguous set of rowsSupports many data types
AIJ, Block AIJ, Symmetric AIJ, Block Matrix, etc.Supports structures for many packages
MUMPS, Spooles, SuperLU, UMFPack, DSCPack
M. Knepley Robust PASI ’11 137 / 231
Linear Algebra and Solvers Matrix Algebra
How do I create matrices?
MatCreate(MPI_Comm, Mat*)
MatSetSizes(Mat, PetscIntm, PetscInt n, M, N)
MatSetType(Mat, MatType typeName)
MatSetFromOptions(Mat)
Can set the type at runtime
MatSeqAIJPreallocation(Mat, PetscIntnz, const PetscInt nnz[])
MatXAIJPreallocation(Mat, bs, dnz[], onz [], dnzu[], onzu[])
MatSetValues(Mat, m, rows[], n, cols [], values [], InsertMode)
MUST be used, but does automatic communication
M. Knepley Robust PASI ’11 138 / 231
Linear Algebra and Solvers Matrix Algebra
Matrix Polymorphism
The PETSc Mat has a single user interface,Matrix assembly
MatSetValues()
Matrix-vector multiplicationMatMult()
Matrix viewingMatView()
but multiple underlying implementations.AIJ, Block AIJ, Symmetric Block AIJ,DenseMatrix-Freeetc.
A matrix is defined by its interface, not by its data structure.
M. Knepley Robust PASI ’11 139 / 231
Linear Algebra and Solvers Matrix Algebra
Matrix Assembly
A three step processEach process sets or adds valuesBegin communication to send values to the correct processComplete the communication
MatSetValues(Matm, m, rows[], n, cols [], values [], mode)
mode is either INSERT_VALUES or ADD_VALUESLogically dense block of values
Two phase assembly allows overlap of communication andcomputation
MatAssemblyBegin(Matm, MatAssemblyType type)MatAssemblyEnd(Matm, MatAssemblyType type)type is either MAT_FLUSH_ASSEMBLY or MAT_FINAL_ASSEMBLY
M. Knepley Robust PASI ’11 140 / 231
Linear Algebra and Solvers Matrix Algebra
One Way to Set the Elements of a MatrixSimple 3-point stencil for 1D Laplacian
v [ 0 ] = −1.0; v [ 1 ] = 2 . 0 ; v [ 2 ] = −1.0;i f ( rank == 0) {
f o r ( row = 0; row < N; row++) {co ls [ 0 ] = row−1; co ls [ 1 ] = row ; co ls [ 2 ] = row +1;i f ( row == 0) {
MatSetValues (A,1 ,& row ,2 ,& co ls [1 ] ,& v [ 1 ] , INSERT_VALUES ) ;} e lse i f ( row == N−1) {
MatSetValues (A,1 ,& row ,2 , cols , v , INSERT_VALUES ) ;} e lse {
MatSetValues (A,1 ,& row ,3 , cols , v , INSERT_VALUES ) ;}
}}MatAssemblyBegin (A, MAT_FINAL_ASSEMBLY ) ;MatAssemblyEnd (A, MAT_FINAL_ASSEMBLY ) ;
M. Knepley Robust PASI ’11 141 / 231
Linear Algebra and Solvers Matrix Algebra
A Better Way to Set the Elements of a MatrixSimple 3-point stencil for 1D Laplacian
v [ 0 ] = −1.0; v [ 1 ] = 2 . 0 ; v [ 2 ] = −1.0;MatGetOwnershipRange (A,& s t a r t ,&end ) ;f o r ( row = s t a r t ; row < end ; row++) {
co ls [ 0 ] = row−1; co ls [ 1 ] = row ; co ls [ 2 ] = row +1;i f ( row == 0) {
MatSetValues (A,1 ,& row ,2 ,& co ls [1 ] ,& v [ 1 ] , INSERT_VALUES ) ;} e lse i f ( row == N−1) {
MatSetValues (A,1 ,& row ,2 , cols , v , INSERT_VALUES ) ;} e lse {
MatSetValues (A,1 ,& row ,3 , cols , v , INSERT_VALUES ) ;}
}MatAssemblyBegin (A, MAT_FINAL_ASSEMBLY ) ;MatAssemblyEnd (A, MAT_FINAL_ASSEMBLY ) ;
M. Knepley Robust PASI ’11 142 / 231
Linear Algebra and Solvers Matrix Algebra
Why Are PETSc Matrices That Way?
No one data structure is appropriate for all problemsBlocked and diagonal formats provide performance benefitsPETSc has many formatsMakes it easy to add new data structures
Assembly is difficult enough without worrying about partitioningPETSc provides parallel assembly routinesHigh performance still requires making most operations localHowever, programs can be incrementally developed.MatPartitioning and MatOrdering can help
Matrix decomposition in contiguous chunks is simpleMakes interoperation with other codes easierFor other ordering, PETSc provides “Application Orderings” (AO)
M. Knepley Robust PASI ’11 143 / 231
Linear Algebra and Solvers Algebraic Solvers
Outline
3 Linear Algebra and SolversVector AlgebraMatrix AlgebraAlgebraic SolversSNESDAPCFieldSplit
M. Knepley Robust PASI ’11 144 / 231
Linear Algebra and Solvers Algebraic Solvers
Solver Types
Explicit:Field variables are updated using local neighbor information
Semi-implicit:Some subsets of variables are updated with global solvesOthers with direct local updates
Implicit:Most or all variables are updated in a single global solve
M. Knepley Robust PASI ’11 145 / 231
Linear Algebra and Solvers Algebraic Solvers
Linear SolversKrylov Methods
Using PETSc linear algebra, just add:KSPSetOperators(KSPksp, MatA, MatM, MatStructure flag)KSPSolve(KSPksp, Vecb, Vecx)
Can access subobjectsKSPGetPC(KSPksp, PC*pc)
Preconditioners must obey PETSc interfaceBasically just the KSP interface
Can change solver dynamically from the command line-ksp_type bicgstab
M. Knepley Robust PASI ’11 146 / 231
Linear Algebra and Solvers Algebraic Solvers
Nonlinear Solvers
Using PETSc linear algebra, just add:SNESSetFunction(SNESsnes, Vecr, residualFunc, void *ctx)SNESSetJacobian(SNESsnes, MatA, MatM, jacFunc, void *ctx)SNESSolve(SNESsnes, Vecb, Vecx)
Can access subobjectsSNESGetKSP(SNESsnes, KSP*ksp)
Can customize subobjects from the cmd lineSet the subdomain preconditioner to ILU with -sub_pc_type ilu
M. Knepley Robust PASI ’11 147 / 231
Linear Algebra and Solvers Algebraic Solvers
Basic Solver Usage
Use SNESSetFromOptions() so that everything is set dynamicallySet the type
Use -snes_type (or take the default)Set the preconditioner
Use -npc_snes_type (or take the default)Override the tolerances
Use -snes_rtol and -snes_atolView the solver to make sure you have the one you expect
Use -snes_viewFor debugging, monitor the residual decrease
Use -snes_monitorUse -ksp_monitor to see the underlying linear solver
M. Knepley Robust PASI ’11 148 / 231
Linear Algebra and Solvers Algebraic Solvers
3rd Party Solvers in PETSc
Complete table of solvers1 Sequential LU
ILUDT (SPARSEKIT2, Yousef Saad, U of MN)EUCLID & PILUT (Hypre, David Hysom, LLNL)ESSL (IBM)SuperLU (Jim Demmel and Sherry Li, LBNL)MatlabUMFPACK (Tim Davis, U. of Florida)LUSOL (MINOS, Michael Saunders, Stanford)
2 Parallel LUMUMPS (Patrick Amestoy, IRIT)SPOOLES (Cleve Ashcroft, Boeing)SuperLU_Dist (Jim Demmel and Sherry Li, LBNL)
3 Parallel CholeskyDSCPACK (Padma Raghavan, Penn. State)MUMPS (Patrick Amestoy, Toulouse)CHOLMOD (Tim Davis, Florida)
4 XYTlib - parallel direct solver (Paul Fischer and Henry Tufo, ANL)M. Knepley Robust PASI ’11 149 / 231
http://www.mcs.anl.gov/petsc/petsc-as/documentation/linearsolvertable.html
Linear Algebra and Solvers Algebraic Solvers
3rd Party Preconditioners in PETSc
Complete table of solvers1 Parallel ICC
BlockSolve95 (Mark Jones and Paul Plassman, ANL)2 Parallel ILU
PaStiX (Faverge Mathieu, INRIA)3 Parallel Sparse Approximate Inverse
Parasails (Hypre, Edmund Chow, LLNL)SPAI 3.0 (Marcus Grote and Barnard, NYU)
4 Sequential Algebraic MultigridRAMG (John Ruge and Klaus Steuben, GMD)SAMG (Klaus Steuben, GMD)
5 Parallel Algebraic MultigridPrometheus (Mark Adams, PPPL)BoomerAMG (Hypre, LLNL)ML (Trilinos, Ray Tuminaro and Jonathan Hu, SNL)
M. Knepley Robust PASI ’11 149 / 231
http://www.mcs.anl.gov/petsc/petsc-as/documentation/linearsolvertable.html
Linear Algebra and Solvers SNES
Outline
3 Linear Algebra and SolversVector AlgebraMatrix AlgebraAlgebraic SolversSNESDAPCFieldSplit
M. Knepley Robust PASI ’11 150 / 231
Linear Algebra and Solvers SNES
Flow Control for a PETSc Application
Timestepping Solvers (TS)
Preconditioners (PC)
Nonlinear Solvers (SNES)
Linear Solvers (KSP)
FunctionEvaluation Postprocessing
JacobianEvaluation
ApplicationInitialization
Main Routine
PETSc
M. Knepley Robust PASI ’11 151 / 231
Linear Algebra and Solvers SNES
SNES Paradigm
The SNES interface is based upon callback functionsFormFunction(), set by SNESSetFunction()
FormJacobian(), set by SNESSetJacobian()
When PETSc needs to evaluate the nonlinear residual F (x),Solver calls the user’s function
User function gets application state through the ctx variablePETSc never sees application data
M. Knepley Robust PASI ’11 152 / 231
Linear Algebra and Solvers SNES
Topology Abstractions
DMDAAbstracts Cartesian grids in any dimensionSupports stencils, communication, reorderingNice for simple finite differences
DMMeshAbstracts general topology in any dimensionAlso supports partitioning, distribution, and global ordersAllows aribtrary element shapes and discretizations
M. Knepley Robust PASI ’11 153 / 231
Linear Algebra and Solvers SNES
Assembly Abstractions
DMAbstracts the logic of multilevel (multiphysics) methodsManages allocation and assembly of local and global structuresInterfaces to PCMG solver
PetscSectionAbstracts functions over a topologyManages allocation and assembly of local and global structuresWill merge with DM somehow
M. Knepley Robust PASI ’11 154 / 231
Linear Algebra and Solvers SNES
SNES Function
User provided function calculates the nonlinear residual:
PetscErrorCode ( * func ) (SNES snes , Vec x , Vec r , vo id * c t x )
x: The current solutionr: The residual
ctx: The user context passed to SNESSetFunction()Use this to pass application information, e.g. physical constants
M. Knepley Robust PASI ’11 155 / 231
Linear Algebra and Solvers SNES
SNES Jacobian
User provided function calculates the Jacobian:
PetscErrorCode ( * func ) (SNES snes , Vec x , Mat * J , Mat *M, vo id * c t x )
x: The current solutionJ: The JacobianM: The Jacobian preconditioning matrix (possibly J itself)
ctx: The user context passed to SNESSetJacobian()Use this to pass application information, e.g. physical constants
Alternatively, you can usematrix-free finite difference approximation, -snes_mffinite difference approximation with coloring, -snes_fd
M. Knepley Robust PASI ’11 156 / 231
Linear Algebra and Solvers SNES
SNES Variants
Picard iteration
Line search/Trust region strategies
Quasi-Newton
Nonlinear CG/GMRES
Nonlinear GS/ASM
Nonlinear Multigrid (FAS)
Variational inequality approaches
M. Knepley Robust PASI ’11 157 / 231
Linear Algebra and Solvers SNES
Finite Difference Jacobians
PETSc can compute and explicitly store a Jacobian via 1st-order FDDense
Activated by -snes_fdComputed by SNESDefaultComputeJacobian()
Sparse via colorings (default)Coloring is created by MatFDColoringCreate()Computed by SNESDefaultComputeJacobianColor()
Can also use Matrix-free Newton-Krylov via 1st-order FDActivated by -snes_mf without preconditioningActivated by -snes_mf_operator with user-definedpreconditioning
Uses preconditioning matrix from SNESSetJacobian()
M. Knepley Robust PASI ’11 158 / 231
Linear Algebra and Solvers SNES
SNES ExampleDriven Cavity
Velocity-vorticity formulationFlow driven by lid and/or bouyancyLogically regular grid
Parallelized with DMDA
Finite difference discretizationAuthored by David Keyes
$PETSC_DIR/src/snes/examples/tutorials/ex19.c
M. Knepley Robust PASI ’11 159 / 231
http://www.mcs.anl.gov/petsc/petsc-current/src/snes/examples/tutorials/ex19.c.html
Linear Algebra and Solvers SNES
Driven Cavity Application Context
typedef s t r u c t {/ *−−−−− basic a p p l i c a t i o n data −−−−−* /PetscReal l i d _ v e l o c i t y ;PetscReal p r a n d t lPetscReal grashof ;PetscBool draw_contours ;
} AppCtx ;
$PETSC_DIR/src/snes/examples/tutorials/ex19.c
M. Knepley Robust PASI ’11 160 / 231
http://www.mcs.anl.gov/petsc/petsc-current/src/snes/examples/tutorials/ex19.c.html
Linear Algebra and Solvers SNES
Driven Cavity Residual Evaluation
Residual (SNES snes , Vec X, Vec F , vo id * p t r ) {AppCtx * user = ( AppCtx * ) p t r ;
/ * l o c a l s t a r t i n g and ending g r i d po in t s * /Pe tsc In t i s t a r t , iend , j s t a r t , jend ;PetscScalar * f ; / * l o c a l vec to r data * /PetscReal grashof = user−>grashof ;PetscReal p r a n d t l = user−>p r a n d t l ;PetscErrorCode i e r r ;
/ * Code to communicate non loca l ghost po i n t data * /VecGetArray (F , & f ) ;/ * Code to compute l o c a l f u n c t i o n components * /VecRestoreArray (F , & f ) ;r e t u r n 0 ;
}
$PETSC_DIR/src/snes/examples/tutorials/ex19.c
M. Knepley Robust PASI ’11 161 / 231
http://www.mcs.anl.gov/petsc/petsc-current/src/snes/examples/tutorials/ex19.c.html
Linear Algebra and Solvers SNES
Better Driven Cavity Residual Evaluation
ResLocal (DMDALocalInfo * in fo ,PetscScalar * * x , PetscScalar * * f , vo id * c tx )
{f o r ( j =