294
Building Robust Scientific Codes Matthew Knepley Computation Institute University of Chicago Scientific Computing in the Americas: The Challenge of Massive Parallelism Valparaiso, Chile, January 2011 M. Knepley Robust PASI ’11 1 / 231

Building Robust Scientific Codes - Rice Universitymk51/presentations/PresChile2011.pdfBuildSystem.config.framework manages the configure run Manages configure modules Dependencies

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Building Robust Scientific Codes

    Matthew Knepley

    Computation InstituteUniversity of Chicago

    Scientific Computing in the Americas:The Challenge of Massive Parallelism

    Valparaiso, Chile, January 2011

    M. Knepley Robust PASI ’11 1 / 231

  • Tools and Infrastructure

    Outline

    1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS

    2 GPU Computing

    3 Linear Algebra and Solvers

    4 First Lab: Assembling a Code

    5 Second Lab: Debugging and Performance Benchmarking

    M. Knepley Robust PASI ’11 2 / 231

  • Tools and Infrastructure

    What I Need From You

    Tell me if you do not understandTell me if an example does not workSuggest better wording or figuresFollowup problems at [email protected]

    M. Knepley Robust PASI ’11 3 / 231

    mailto:[email protected]

  • Tools and Infrastructure

    Ask Questions!!!

    Helps me understand what you are missing

    Helps you clarify misunderstandings

    Helps others with the same question

    M. Knepley Robust PASI ’11 4 / 231

  • Tools and Infrastructure

    How We Can Help at the Tutorial

    Point out relevant documentationQuickly answer questionsHelp installGuide design of large scale codesAnswer email at [email protected]

    M. Knepley Robust PASI ’11 5 / 231

    mailto:[email protected]

  • Tools and Infrastructure

    How We Can Help at the Tutorial

    Point out relevant documentationQuickly answer questionsHelp installGuide design of large scale codesAnswer email at [email protected]

    M. Knepley Robust PASI ’11 5 / 231

    mailto:[email protected]

  • Tools and Infrastructure

    How We Can Help at the Tutorial

    Point out relevant documentationQuickly answer questionsHelp installGuide design of large scale codesAnswer email at [email protected]

    M. Knepley Robust PASI ’11 5 / 231

    mailto:[email protected]

  • Tools and Infrastructure

    How We Can Help at the Tutorial

    Point out relevant documentationQuickly answer questionsHelp installGuide design of large scale codesAnswer email at [email protected]

    M. Knepley Robust PASI ’11 5 / 231

    mailto:[email protected]

  • Tools and Infrastructure

    New Model for Scientific Software

    Simplifying Parallelization of Scientific Codesby a Function-Centric Approach in Python

    Jon K. Nilsen, Xing Cai, Bjorn Hoyland, and Hans Petter Langtangen

    Python at the application levelnumpy for data structurespetsc4py for linear algebra and solversPyCUDA for integration (physics) and assembly

    M. Knepley Robust PASI ’11 6 / 231

    http://arxiv.org/abs/1002.0705http://arxiv.org/abs/1002.0705

  • Tools and Infrastructure

    New Model for Scientific Software

    Application

    FFC/SyFieqn. definitionsympy symbol

    ics

    numpyda

    tast

    ruct

    ures

    petsc4py

    solve

    rs

    PyCUDA

    integration/assembly

    PETScCUDA

    OpenCL

    Figure: Schematic for a generic scientific applicationM. Knepley Robust PASI ’11 7 / 231

  • Tools and Infrastructure

    What is Missing from this Scheme?

    Unstructured graph traversalIteration over cells in FEM

    Use a copy via numpy, use a kernel via Queue

    (Transitive) Closure of a vertexUse a visitor and copy via numpy

    Depth First SearchHell if I know

    Logic in computationLimiters in FV methods

    Can sometimes use tricks for branchless logic

    Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless

    Boundary conditionsRestrict branching to PETSc C numbering and assembly calls

    Audience???M. Knepley Robust PASI ’11 8 / 231

  • Tools and Infrastructure

    What is Missing from this Scheme?

    Unstructured graph traversalIteration over cells in FEM

    Use a copy via numpy, use a kernel via Queue

    (Transitive) Closure of a vertexUse a visitor and copy via numpy

    Depth First SearchHell if I know

    Logic in computationLimiters in FV methods

    Can sometimes use tricks for branchless logic

    Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless

    Boundary conditionsRestrict branching to PETSc C numbering and assembly calls

    Audience???M. Knepley Robust PASI ’11 8 / 231

  • Tools and Infrastructure

    What is Missing from this Scheme?

    Unstructured graph traversalIteration over cells in FEM

    Use a copy via numpy, use a kernel via Queue

    (Transitive) Closure of a vertexUse a visitor and copy via numpy

    Depth First SearchHell if I know

    Logic in computationLimiters in FV methods

    Can sometimes use tricks for branchless logic

    Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless

    Boundary conditionsRestrict branching to PETSc C numbering and assembly calls

    Audience???M. Knepley Robust PASI ’11 8 / 231

  • Tools and Infrastructure

    What is Missing from this Scheme?

    Unstructured graph traversalIteration over cells in FEM

    Use a copy via numpy, use a kernel via Queue

    (Transitive) Closure of a vertexUse a visitor and copy via numpy

    Depth First SearchHell if I know

    Logic in computationLimiters in FV methods

    Can sometimes use tricks for branchless logic

    Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless

    Boundary conditionsRestrict branching to PETSc C numbering and assembly calls

    Audience???M. Knepley Robust PASI ’11 8 / 231

  • Tools and Infrastructure

    What is Missing from this Scheme?

    Unstructured graph traversalIteration over cells in FEM

    Use a copy via numpy, use a kernel via Queue

    (Transitive) Closure of a vertexUse a visitor and copy via numpy

    Depth First SearchHell if I know

    Logic in computationLimiters in FV methods

    Can sometimes use tricks for branchless logic

    Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless

    Boundary conditionsRestrict branching to PETSc C numbering and assembly calls

    Audience???M. Knepley Robust PASI ’11 8 / 231

  • Tools and Infrastructure

    What is Missing from this Scheme?

    Unstructured graph traversalIteration over cells in FEM

    Use a copy via numpy, use a kernel via Queue

    (Transitive) Closure of a vertexUse a visitor and copy via numpy

    Depth First SearchHell if I know

    Logic in computationLimiters in FV methods

    Can sometimes use tricks for branchless logic

    Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless

    Boundary conditionsRestrict branching to PETSc C numbering and assembly calls

    Audience???M. Knepley Robust PASI ’11 8 / 231

  • Tools and Infrastructure

    What is Missing from this Scheme?

    Unstructured graph traversalIteration over cells in FEM

    Use a copy via numpy, use a kernel via Queue

    (Transitive) Closure of a vertexUse a visitor and copy via numpy

    Depth First SearchHell if I know

    Logic in computationLimiters in FV methods

    Can sometimes use tricks for branchless logic

    Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless

    Boundary conditionsRestrict branching to PETSc C numbering and assembly calls

    Audience???M. Knepley Robust PASI ’11 8 / 231

  • Tools and Infrastructure

    What is Missing from this Scheme?

    Unstructured graph traversalIteration over cells in FEM

    Use a copy via numpy, use a kernel via Queue

    (Transitive) Closure of a vertexUse a visitor and copy via numpy

    Depth First SearchHell if I know

    Logic in computationLimiters in FV methods

    Can sometimes use tricks for branchless logic

    Flux Corrected Transport for shock capturingMaybe use WENO schemes which can be branchless

    Boundary conditionsRestrict branching to PETSc C numbering and assembly calls

    Audience???M. Knepley Robust PASI ’11 8 / 231

  • Tools and Infrastructure Version Control

    Outline

    1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS

    M. Knepley Robust PASI ’11 9 / 231

  • Tools and Infrastructure Version Control

    Location and Retrieval“Where’s the Tarball”

    Version ControlMercurial, Git, Subversion

    HostingBitBucket, GitHub, Launchpad

    Community involvementarXiv, PubMed

    M. Knepley Robust PASI ’11 10 / 231

    http://mercurial.selenic.comhttp://git-scm.comhttp://subversion.tigris.orghttp://bitbucket.orghttp://github.comhttps://launchpad.nethttp://arXiv.orghttp://www.ncbi.nlm.nih.gov/pubmed

  • Tools and Infrastructure Version Control

    Distributed Version Control

    CVS/SVN manage a single repositoryVersioned dataLocal copy for modification and checkin

    Mercurial manages many repositoriesIdentified by URLsNo one Master

    Repositories communicate by ChangeSetsUse push and pull to move changesetsCan move arbitrary changes with patch queues

    M. Knepley Robust PASI ’11 11 / 231

  • Tools and Infrastructure Version Control

    Project Workflow

    User

    Figure: Single Repository

    M. Knepley Robust PASI ’11 12 / 231

  • Tools and Infrastructure Version Control

    Project Workflow

    Master

    User A User B

    Figure: Master Repository with User Clones

    M. Knepley Robust PASI ’11 13 / 231

  • Tools and Infrastructure Version Control

    Project Workflow

    Master Release

    User Bugfix

    Figure: Project with Release and Bugfix Repositories

    M. Knepley Robust PASI ’11 14 / 231

  • Tools and Infrastructure Configuration and Build

    Outline

    1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS

    M. Knepley Robust PASI ’11 15 / 231

  • Tools and Infrastructure Configuration and Build

    Configuration and Build“It won’t run on my iPhone”

    PortabilityPETSc BuildSystem, autoconf

    DependenciesDoes this work with UnsupportedGradStudentAMG?

    Configurable buildBuild must integrate with the configuration systemCMake, SCons

    M. Knepley Robust PASI ’11 16 / 231

    http://petsc.cs.iit.edu/petsc/BuildSystemhttp://www.gnu.org/software/autoconfhttp://www.cmake.orghttp://www.scons.org

  • Tools and Infrastructure Configuration and Build

    BuildSystem

    Provides tools for Configuration and Build

    Dependency tracking and analysisPackage management and hierarchyLibrary of standard testsStandard build rulesAutomatic package build and integration

    http://petsc.cs.iit.edu/petsc/BuildSystemhttp://petsc.cs.iit.edu/petsc/SimpleConfigure

    M. Knepley Robust PASI ’11 17 / 231

    http://petsc.cs.iit.edu/petsc/BuildSystemhttp://petsc.cs.iit.edu/petsc/BuildSystem

  • Tools and Infrastructure Configuration and Build

    ConfigureModules

    BuildSystem.config.base configures a specific functionality

    Entry points:setupHelp()setupDependencies()configure()

    Builtin capabilities:Preprocessing, compilation, linking, runningManages languagesChecks for executables

    Output types:Define, typedef, or prototypeMake macro or ruleSubstitution (old-style)

    M. Knepley Robust PASI ’11 18 / 231

  • Tools and Infrastructure Configuration and Build

    ConfigureFramework

    BuildSystem.config.framework manages the configure run

    Manages configure modulesDependencies with DAG, require()Options tableInitialization, run, cleanup

    OutputsConfigure headers and logMake variable and rulesPickled configure tree

    M. Knepley Robust PASI ’11 19 / 231

  • Tools and Infrastructure Configuration and Build

    ConfigureThird Party Packages

    BuildSystem.config.package manages other packages

    BuildSystem/config/packages/* examples (MPI, FIAT, etc.)Standard location and install hooksStandard header and library testsUniform interface for parameter retrievalSpecial support for GNU packages

    M. Knepley Robust PASI ’11 20 / 231

  • Tools and Infrastructure Configuration and Build

    ConfigureBuild Integration

    A module can declare a dependency using:

    fw = s e l f . frameworks e l f . mpi = fw . requ i re ( ’ con f i g . packages . MPI ’ , s e l f )

    so that MPI is configured before self. Information is retrieved duringconfigure():

    i f s e l f . mpi . found :inc lude . extend ( s e l f . mpi . i nc lude )l i b s . extend ( s e l f . mpi . l i b )

    M. Knepley Robust PASI ’11 21 / 231

  • Tools and Infrastructure Configuration and Build

    ConfigureBuild Integration

    A module can declare a dependency using:

    fw = s e l f . frameworks e l f . mpi = fw . requ i re ( ’ con f i g . packages . MPI ’ , s e l f )

    so that MPI is configured before self. Information is retrieved duringconfigure():

    i f s e l f . mpi . found :inc lude . extend ( s e l f . mpi . i nc lude )l i b s . extend ( s e l f . mpi . l i b )

    M. Knepley Robust PASI ’11 21 / 231

  • Tools and Infrastructure Configuration and Build

    ConfigureBuild Integration

    A build system can acquire the information using:

    c lass ConfigReader ( s c r i p t . S c r i p t ) :def _ _ i n i t _ _ ( s e l f ) :

    impor t RDictargDB = RDict . RDict (None , None , 0 , 0)argDB . saveFilename = os . path . j o i n ( ’ path ’ , ’ RDict . db ’ )argDB . load ( )s c r i p t . S c r i p t . _ _ i n i t _ _ ( s e l f , argDB = argDB )r e t u r n

    def getMPIModule ( s e l f ) :s e l f . setup ( )fw = s e l f . loadConf igure ( )mpi = fw . requ i re ( ’ con f i g . packages . MPI ’ , None )r e t u r n mpi

    M. Knepley Robust PASI ’11 22 / 231

  • Tools and Infrastructure Configuration and Build

    Make

    GNU Make automates a package build

    Has a single predicate, older-than

    Executes shell code for actions

    PETSc has support forconfiguration integrationautomatic compilation

    AlternativesSConsCMake

    M. Knepley Robust PASI ’11 23 / 231

    http://www.gnu.org/makehttp://www.scons.orghttp://www.cmake.org

  • Tools and Infrastructure Configuration and Build

    builder

    Simple replacement for GNU make

    Excellent configure integration

    User-defined predicates

    Dependency analysis and tracking

    Python actions

    Support for test execution

    M. Knepley Robust PASI ’11 24 / 231

  • Tools and Infrastructure Configuration and Build

    builderTwo Interfaces

    The simple interface handles the entire build:./config/builder.py

    A more flexible front end allows finer control:./config/builder2.py help [command]./config/builder2.py clean./config/builder2.py stubs fortran./config/builder2.py build [src/snes/interface/snesj.c]./config/builder2.py check [src/snes/examples/tutorials/ex10.c]

    M. Knepley Robust PASI ’11 25 / 231

  • Tools and Infrastructure Configuration and Build

    Testing“They are identical in the eyeball norm”

    Unit testscppUnit

    Regression testsbuildbot

    BenchmarksCigma

    M. Knepley Robust PASI ’11 26 / 231

    http://sourceforge.net/projects/cppunithttp://buildbot.nethttp://www.geodynamics.org/cig/software/packages/cs/cigma

  • Tools and Infrastructure PETSc

    Outline

    1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS

    M. Knepley Robust PASI ’11 27 / 231

  • Tools and Infrastructure PETSc

    How did PETSc Originate?

    PETSc was developed as a Platform forExperimentation

    We want to experiment with differentModelsDiscretizationsSolversAlgorithms

    which blur these boundaries

    M. Knepley Robust PASI ’11 28 / 231

    http://amzn.com/0521602866

  • Tools and Infrastructure PETSc

    The Role of PETSc

    Developing parallel, nontrivial PDE solvers thatdeliver high performance is still difficult and re-quires months (or even years) of concentratedeffort.

    PETSc is a toolkit that can ease these difficul-ties and reduce the development time, but it isnot a black-box PDE solver, nor a silver bullet.— Barry Smith

    M. Knepley Robust PASI ’11 29 / 231

    http://www.mcs.anl.gov/~bsmith

  • Tools and Infrastructure PETSc

    Advice from Bill Gropp

    You want to think about how you decompose your datastructures, how you think about them globally. [...] If youwere building a house, you’d start with a set of blueprintsthat give you a picture of what the whole house lookslike. You wouldn’t start with a bunch of tiles and say.“Well I’ll put this tile down on the ground, and then I’llfind a tile to go next to it.” But all too many people try tobuild their parallel programs by creating the smallestpossible tiles and then trying to have the structure oftheir code emerge from the chaos of all these littlepieces. You have to have an organizing principle ifyou’re going to survive making your code parallel.

    (http://www.rce-cast.com/Podcast/rce-28-mpich2.html)

    M. Knepley Robust PASI ’11 30 / 231

    http://www.rce-cast.com/Podcast/rce-28-mpich2.html

  • Tools and Infrastructure PETSc

    What is PETSc?

    A freely available and supported researchcode for the parallel solution of nonlinearalgebraic equations

    FreeDownload from http://www.mcs.anl.gov/petscFree for everyone, including industrial users

    SupportedHyperlinked manual, examples, and manual pages for all routinesHundreds of tutorial-style examplesSupport via email: [email protected]

    Usable from C, C++, Fortran 77/90, Matlab, Julia, and Python

    M. Knepley Robust PASI ’11 31 / 231

    http://www.mcs.anl.gov/petscmailto:[email protected]

  • Tools and Infrastructure PETSc

    What is PETSc?

    Portable to any parallel system supporting MPI, including:Tightly coupled systems

    Cray XT6, BG/Q, NVIDIA Fermi, K ComputerLoosely coupled systems, such as networks of workstations

    IBM, Mac, iPad/iPhone, PCs running Linux or Windows

    PETSc HistoryBegun September 1991Over 60,000 downloads since 1995 (version 2)Currently 400 per month

    PETSc Funding and SupportDepartment of Energy

    SciDAC, MICS Program, AMR Program, INL Reactor ProgramNational Science Foundation

    CIG, CISE, Multidisciplinary Challenge Program

    M. Knepley Robust PASI ’11 32 / 231

  • Tools and Infrastructure PETSc

    Timeline

    1991 1995 2000 2005 2010

    PETSc-1

    MPI-1MPI-2

    PETSc-2 PETSc-3Barry

    BillLois

    SatishDinesh

    HongKrisMatt

    VictorDmitry

    LisandroJedShri

    Peter

    M. Knepley Robust PASI ’11 33 / 231

  • Tools and Infrastructure PETSc

    What Can We Handle?

    PETSc has run implicit problems with over 500 billion unknownsUNIC on BG/P and XT5PFLOTRAN for flow in porous media

    PETSc has run on over 290,000 cores efficientlyUNIC on the IBM BG/P Jugene at JülichPFLOTRAN on the Cray XT5 Jaguar at ORNL

    PETSc applications have run at 23% of peak (600 Teraflops)Jed Brown on NERSC EdisonHPGMG code

    M. Knepley Robust PASI ’11 34 / 231

    https://hpgmg.org/

  • Tools and Infrastructure PETSc

    What Can We Handle?

    PETSc has run implicit problems with over 500 billion unknownsUNIC on BG/P and XT5PFLOTRAN for flow in porous media

    PETSc has run on over 290,000 cores efficientlyUNIC on the IBM BG/P Jugene at JülichPFLOTRAN on the Cray XT5 Jaguar at ORNL

    PETSc applications have run at 23% of peak (600 Teraflops)Jed Brown on NERSC EdisonHPGMG code

    M. Knepley Robust PASI ’11 34 / 231

    https://hpgmg.org/

  • Tools and Infrastructure PETSc

    What Can We Handle?

    PETSc has run implicit problems with over 500 billion unknownsUNIC on BG/P and XT5PFLOTRAN for flow in porous media

    PETSc has run on over 290,000 cores efficientlyUNIC on the IBM BG/P Jugene at JülichPFLOTRAN on the Cray XT5 Jaguar at ORNL

    PETSc applications have run at 23% of peak (600 Teraflops)Jed Brown on NERSC EdisonHPGMG code

    M. Knepley Robust PASI ’11 34 / 231

    https://hpgmg.org/

  • Tools and Infrastructure numpy

    Outline

    1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS

    M. Knepley Robust PASI ’11 35 / 231

  • Tools and Infrastructure numpy

    numpy

    numpy is ideal for building Python data structures

    Supports multidimensional arraysEasily interfaces with C/C++ and FortranHigh performance BLAS/LAPACK and functional operationsPython 2 and 3 compatibleUsed by petsc4py to talk to PETSc

    M. Knepley Robust PASI ’11 36 / 231

    http://numpy.scipy.org

  • Tools and Infrastructure sympy

    Outline

    1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS

    M. Knepley Robust PASI ’11 37 / 231

  • Tools and Infrastructure sympy

    sympy

    sympy is useful for symbolic manipulation

    Interacts with numpyDerivatives and integralsSeries expansionsEquation simplificationSmall and open source

    M. Knepley Robust PASI ’11 38 / 231

    http://sympy.scipy.org

  • Tools and Infrastructure sympy

    sympyExample of Series Transform

    Create the shifted polynomial

    order∑i=0

    cii!

    (x − a)i

    def cons t ruc tSh i f t edPo lynomia l ( order ) :from sympy impor t Symbol , c o l l e c t , d i f f , l i m i tfrom sympy impor t f a c t o r i a l as fc = [ Symbol ( ’ c ’+ s t r ( i ) ) f o r i i n range ( order ) ]g = sum ( [ c [ i ] * ( x−a ) * * i / f ( i ) f o r i i n range ( order ) ] )# Convert to a monomialg = c o l l e c t ( g . expand ( ) , x )r e t u r n c , g

    M. Knepley Robust PASI ’11 39 / 231

  • Tools and Infrastructure sympy

    sympyExample of Series Transform

    Here is the shifted polynomial for order 5:c0 - a*c1 + c2*a**2/2 - c3*a**3/6 + c4*a**4/24+ x*(c1 - a*c2 + c3*a**2/2 - c4*a**3/6)+ x**2*(c2/2 - a*c3/2 + c4*a**2/4)+ x**3*(c3/6 - a*c4/6)+ c4*x**4/24

    M. Knepley Robust PASI ’11 40 / 231

  • Tools and Infrastructure sympy

    sympyExample of Series Transform

    Construct matrix transform from

    order∑i=0

    cii!

    (x − a)i toorder∑i=0

    cii!

    x i

    def cons t ruc tTrans fo rmMat r i x ( order = 5 ) :from sympy impor t d i f f , l i m i tc , g = cons t ruc tSh i f t edPo lynomia l ( order , debug )M = [ ]f o r o i n range ( order ) :

    exp = g . d i f f ( x , o ) . l i m i t ( x , 0)M. append ( [ exp . d i f f ( c [ p ] ) f o r p i n range ( order ) ] )

    r e t u r n M

    M. Knepley Robust PASI ’11 41 / 231

  • Tools and Infrastructure sympy

    sympyExample of Series Transform

    Here is the transform matrix M:1 −a a22 −

    a36

    a424

    0 1 −a a22 −a36

    0 0 1 −a a220 0 0 1 −a0 0 0 0 1

    M. Knepley Robust PASI ’11 42 / 231

  • Tools and Infrastructure petsc4py

    Outline

    1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS

    M. Knepley Robust PASI ’11 43 / 231

  • Tools and Infrastructure petsc4py

    petsc4py

    petcs4py provides Python bindings for PETSc

    Provides ALL PETSc functionality in a Pythonic wayLogging using the Python with statement

    Can use Python callback functionsSNESSetFunction(), SNESSetJacobian()

    Manages all memory (creation/destruction)

    Visualization with matplotlib

    M. Knepley Robust PASI ’11 44 / 231

    http://code.google.com/p/petsc4py/http://matplotlib.sourceforge.net

  • Tools and Infrastructure petsc4py

    petsc4py Installation

    Automaticpip install -install-options=-user petscp4yUses $PETSC_DIR and $PETSC_ARCHInstalled into $HOME/.localNo additions to PYTHONPATH

    From Sourcevirtualenv python-envsource ./python-env/bin/activateNow everything installs into your proxy Python environmenthg clone https://petsc4py.googlecode.com/hgpetsc4py-devARCHFLAGS="-arch x86_64" python setup.py sdistARCHFLAGS="-arch x86_64" pip installdist/petsc4py-1.1.2.tar.gzARCHFLAGS only necessary on Mac OSX

    M. Knepley Robust PASI ’11 45 / 231

  • Tools and Infrastructure petsc4py

    petsc4py Examples

    externalpackages/petsc4py-1.1/demo/bratu2d/bratu2d.py

    Solves Bratu equation (SNES ex5) in 2D

    Visualizes solution with matplotlib

    src/ts/examples/tutorials/ex8.py

    Solves a 1D ODE for a diffusive process

    Visualize solution using -vec_view_draw

    Control timesteps with -ts_max_steps

    M. Knepley Robust PASI ’11 46 / 231

    http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/src/snes/examples/tutorials/ex5.c.html

  • Tools and Infrastructure PyCUDA

    Outline

    1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS

    M. Knepley Robust PASI ’11 47 / 231

  • Tools and Infrastructure PyCUDA

    PyCUDA and PyOpenCL

    Python packages by Andreas Klöcknerfor embedded GPU programming

    Handles unimportant details automaticallyCUDA compile and caching of objectsDevice initializationLoading modules onto card

    Excellent Documentation & Tutorial

    Excellent platform for MetaprogrammingOnly way to get portable performanceRoad to FLAME-type reasoning about algorithms

    M. Knepley Robust PASI ’11 48 / 231

    http://mathema.tician.de/aboutmehttp://documen.tician.de/pycudahttp://arxiv.org/abs/0911.3456

  • Tools and Infrastructure PyCUDA

    Code Template

    $ { pb . globalMod ( isGPU ) } vo id kerne l ( $ { pb . g r i dS ize ( isGPU ) } f l o a t * output ) {

    $ { pb . g r idLoopSta r t ( isGPU , load , s to re ) }$ { pb . threadLoopStar t ( isGPU , blockDimX ) }f l o a t G[ $ { dim * dim } ] = { $ { ’ , ’ . j o i n ( [ ’ 3.0 ’ ] * ( dim * dim ) ) } } ;f l o a t K [ $ { dim * dim } ] = { $ { ’ , ’ . j o i n ( [ ’ 3.0 ’ ] * ( dim * dim ) ) } } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x *$ { numThreads } ;

    / / Cont ract G and K% f o r n i n range ( numLocalElements ) :% f o r alpha i n range ( dim ) :% f o r beta i n range ( dim ) :

    product += G[ $ { gIdx } ] * K [ $ { k Idx } ] ;% endfor% endfor% endfor

    output [ Oof fse t+ idx ] = product ;$ { pb . threadLoopEnd ( isGPU ) }$ { pb . gridLoopEnd ( isGPU ) }r e t u r n ;

    } M. Knepley Robust PASI ’11 49 / 231

  • Tools and Infrastructure PyCUDA

    Rendering a Template

    We render code template into strings using a dictionary of inputs.

    args = { ’ dim ’ : s e l f . dim ,’ numLocalElements ’ : 1 ,’ numThreads ’ : s e l f . threadBlockSize }

    kernelTemplate = s e l f . getKernelTemplate ( )gpuCode = kernelTemplate . render ( isGPU = True , * * args )cpuCode = kernelTemplate . render ( isGPU = False , * * args )

    M. Knepley Robust PASI ’11 50 / 231

  • Tools and Infrastructure PyCUDA

    GPU Source Code

    __global__ vo id kerne l ( f l o a t * output ) {const i n t g r i d I d x = b lock Idx . x + b lock Idx . y * gr idDim . x ;const i n t i dx = th read Idx . x + th read Idx . y * 1 ; / / This i s ( i , j )f l o a t G[ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t K [ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x * 1 ;

    / / Cont ract G and Kproduct += G[ 0 ] * K [ 0 ] ;product += G[ 1 ] * K [ 1 ] ;product += G[ 2 ] * K [ 2 ] ;product += G[ 3 ] * K [ 3 ] ;product += G[ 4 ] * K [ 4 ] ;product += G[ 5 ] * K [ 5 ] ;product += G[ 6 ] * K [ 6 ] ;product += G[ 7 ] * K [ 7 ] ;product += G[ 8 ] * K [ 8 ] ;ou tput [ Oof fse t+ idx ] = product ;r e t u r n ;

    }

    M. Knepley Robust PASI ’11 51 / 231

  • Tools and Infrastructure PyCUDA

    CPU Source Code

    vo id kerne l ( i n t numInvocations , f l o a t * output ) {f o r ( i n t g r i d I d x = 0; g r i d I d x < numInvocations ; ++ g r i d I d x ) {

    f o r ( i n t i = 0 ; i < 1 ; ++ i ) {f o r ( i n t j = 0 ; j < 1 ; ++ j ) {

    const i n t i dx = i + j * 1 ; / / This i s ( i , j )f l o a t G[ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t K [ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x * 1 ;

    / / Cont ract G and Kproduct += G[ 0 ] * K [ 0 ] ;product += G[ 1 ] * K [ 1 ] ;product += G[ 2 ] * K [ 2 ] ;product += G[ 3 ] * K [ 3 ] ;product += G[ 4 ] * K [ 4 ] ;product += G[ 5 ] * K [ 5 ] ;product += G[ 6 ] * K [ 6 ] ;product += G[ 7 ] * K [ 7 ] ;product += G[ 8 ] * K [ 8 ] ;ou tput [ Oof fse t+ idx ] = product ;

    }}

    }r e t u r n ;

    }

    M. Knepley Robust PASI ’11 52 / 231

  • Tools and Infrastructure PyCUDA

    Creating a Module

    CPU:

    # Output kerne l and C support codes e l f . outputKernelC ( cpuCode )s e l f . w r i t e M a k e f i l e ( )out , er r , s ta tus = s e l f . executeShellCommand ( ’make ’ )\ end { minted }

    \ b i gsk ip

    GPU:\ begin { minted } { python }from pycuda . compi ler impor t SourceModule

    mod = SourceModule ( gpuCode )s e l f . ke rne l = mod. ge t_ func t i on ( ’ ke rne l ’ )s e l f . kerne lRepor t ( s e l f . kernel , ’ ke rne l ’ )

    M. Knepley Robust PASI ’11 53 / 231

  • Tools and Infrastructure PyCUDA

    Executing a Module

    impor t pycuda . d r i v e r as cudaimpor t pycuda . a u t o i n i t

    blockDim = ( s e l f . dim , s e l f . dim , 1)s t a r t = cuda . Event ( )end = cuda . Event ( )g r i d = s e l f . c a l c u l a t e G r i d (N, numLocalElements )s t a r t . record ( )f o r i i n range ( i t e r s ) :

    s e l f . ke rne l ( cuda . Out ( output ) ,b lock = blockDim , g r i d = g r i d )

    end . record ( )end . synchronize ( )gpuTimes . append ( s t a r t . t i m e _ t i l l ( end ) *1 e−3/ i t e r s )

    M. Knepley Robust PASI ’11 54 / 231

  • Tools and Infrastructure FEniCS

    Outline

    1 Tools and InfrastructureVersion ControlConfiguration and BuildPETScnumpysympypetsc4pyPyCUDAFEniCS

    M. Knepley Robust PASI ’11 55 / 231

  • Tools and Infrastructure FEniCS

    FIAT

    Finite Element Integrator And Tabulator by Rob Kirby

    http://www.fenics.org/fiat

    FIAT understandsReference element shapes (line, triangle, tetrahedron)Quadrature rulesPolynomial spacesFunctionals over polynomials (dual spaces)Derivatives

    User can build arbitrary elements specifying the Ciarlet triple (K ,P,P ′)

    FIAT is part of the FEniCS project, as is the PETSc Sieve module

    M. Knepley Robust PASI ’11 56 / 231

    http://www.fenics.org/fiat

  • Tools and Infrastructure FEniCS

    FFC

    FFC is a compiler for variational forms by Anders Logg.

    Here is a mixed-form Poisson equation:

    a((τ,w), (σ, u)) = L((τ,w)) ∀(τ,w) ∈ V

    where

    a((τ,w), (σ, u)) =∫

    Ωτσ −∇ · τu + w∇ · u dx

    L((τ,w)) =∫

    Ωwf dx

    M. Knepley Robust PASI ’11 57 / 231

  • Tools and Infrastructure FEniCS

    FFCMixed Poisson

    shape = " t r i a n g l e "

    BDM1 = Fin i teE lement ( " Brezzi−Douglas−Mar in i " , shape , 1 )DG0 = Fin i teE lement ( " Discont inuous Lagrange " , shape , 0 )

    element = BDM1 + DG0( tau , w) = TestFunct ions ( element )( sigma , u ) = T r i a l F u n c t i o n s ( element )

    a = ( dot ( tau , sigma ) − d iv ( tau ) * u + w* d iv ( sigma ) ) * dx

    f = Funct ion (DG0)L = w* f * dx

    M. Knepley Robust PASI ’11 58 / 231

  • Tools and Infrastructure FEniCS

    FFC

    Here is a discontinuous Galerkin formulation of the Poisson equation:

    a(v ,u) = L(v) ∀v ∈ V

    where

    a(v ,u) =∫

    Ω∇u · ∇v dx

    +∑

    S

    ∫S− < ∇v > ·[[u]]n − [[v ]]n· < ∇u > −(α/h)vu dS

    +

    ∫∂Ω−∇v · [[u]]n − [[v ]]n · ∇u − (γ/h)vu ds

    L(v) =∫

    Ωvf dx

    M. Knepley Robust PASI ’11 59 / 231

  • Tools and Infrastructure FEniCS

    FFCDG Poisson

    DG1 = Fin i teE lement ( " Discont inuous Lagrange " , shape , 1 )v = TestFunct ions (DG1)u = T r i a l F u n c t i o n s (DG1)f = Funct ion (DG1)g = Funct ion (DG1)n = FacetNormal ( " t r i a n g l e " )h = MeshSize ( " t r i a n g l e " )a = dot ( grad ( v ) , grad ( u ) ) * dx− dot ( avg ( grad ( v ) ) , jump ( u , n ) ) * dS− dot ( jump ( v , n ) , avg ( grad ( u ) ) ) * dS+ alpha / h* dot ( jump ( v , n ) + jump ( u , n ) ) * dS− dot ( grad ( v ) , jump ( u , n ) ) * ds− dot ( jump ( v , n ) , grad ( u ) ) * ds+ gamma/ h* v *u* ds

    L = v * f * dx + v *g* ds

    M. Knepley Robust PASI ’11 60 / 231

  • Tools and Infrastructure FEniCS

    Big Picture

    Usability is paramountNeed community by-inNeed complete workflow

    Leverage existing systemsAdoption is much easier with the familiararXiv, package managers

    M. Knepley Robust PASI ’11 61 / 231

    http://arXiv.org

  • GPU Computing

    Outline

    1 Tools and Infrastructure

    2 GPU ComputingFEM-GPUPETSc-GPU

    3 Linear Algebra and Solvers

    4 First Lab: Assembling a Code

    5 Second Lab: Debugging and Performance Benchmarking

    M. Knepley Robust PASI ’11 62 / 231

  • GPU Computing

    Collaborators

    Dr. Andy Terrel (FEniCS)Dept. of Computer Science, University of TexasTexas Advanced Computing Center, University of Texas

    Prof. Andreas Klöckner (PyCUDA)Courant Institute of Mathematical Sciences, New York University

    Dr. Brad Aagaard (PyLith)United States Geological Survey, Menlo Park, CA

    Dr. Charles Williams (PyLith)GNS Science, Wellington, NZ

    M. Knepley Robust PASI ’11 63 / 231

    http://andy.terrel.us/Professional/http://mathema.tician.de/aboutme/http://profile.usgs.gov/baagaardhttp://w3.geodynamics.org/cig/Members/willic3

  • GPU Computing FEM-GPU

    Outline

    2 GPU ComputingFEM-GPUPETSc-GPU

    M. Knepley Robust PASI ’11 64 / 231

  • GPU Computing FEM-GPU

    What are the Benefits for current PDE Code?

    Low Order FEM on GPUs

    Analytic Flexibility

    Computational Flexibility

    Efficiency

    http://www.bitbucket.org/aterrel/flamefem

    M. Knepley Robust PASI ’11 65 / 231

    http://www.bitbucket.org/aterrel/flamefem

  • GPU Computing FEM-GPU

    What are the Benefits for current PDE Code?

    Low Order FEM on GPUs

    Analytic Flexibility

    Computational Flexibility

    Efficiency

    http://www.bitbucket.org/aterrel/flamefem

    M. Knepley Robust PASI ’11 65 / 231

    http://www.bitbucket.org/aterrel/flamefem

  • GPU Computing FEM-GPU

    What are the Benefits for current PDE Code?

    Low Order FEM on GPUs

    Analytic Flexibility

    Computational Flexibility

    Efficiency

    http://www.bitbucket.org/aterrel/flamefem

    M. Knepley Robust PASI ’11 65 / 231

    http://www.bitbucket.org/aterrel/flamefem

  • GPU Computing FEM-GPU

    What are the Benefits for current PDE Code?

    Low Order FEM on GPUs

    Analytic Flexibility

    Computational Flexibility

    Efficiency

    http://www.bitbucket.org/aterrel/flamefem

    M. Knepley Robust PASI ’11 65 / 231

    http://www.bitbucket.org/aterrel/flamefem

  • GPU Computing FEM-GPU

    Analytic FlexibilityLaplacian

    ∫T∇φi(x) · ∇φj(x)dx (1)

    element = F in i teE lement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner ( grad ( v ) , grad ( u ) ) * dx

    M. Knepley Robust PASI ’11 66 / 231

  • GPU Computing FEM-GPU

    Analytic FlexibilityLaplacian

    ∫T∇φi(x) · ∇φj(x)dx (1)

    element = F in i teE lement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner ( grad ( v ) , grad ( u ) ) * dx

    M. Knepley Robust PASI ’11 66 / 231

  • GPU Computing FEM-GPU

    Analytic FlexibilityLinear Elasticity

    14

    ∫T

    (∇~φi(x) +∇T ~φi(x)

    ):(∇~φj(x) +∇~φj(x)

    )dx (2)

    element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner (sym( grad ( v ) ) , sym( grad ( u ) ) ) * dx

    M. Knepley Robust PASI ’11 67 / 231

  • GPU Computing FEM-GPU

    Analytic FlexibilityLinear Elasticity

    14

    ∫T

    (∇~φi(x) +∇T ~φi(x)

    ):(∇~φj(x) +∇~φj(x)

    )dx (2)

    element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )a = inner (sym( grad ( v ) ) , sym( grad ( u ) ) ) * dx

    M. Knepley Robust PASI ’11 67 / 231

  • GPU Computing FEM-GPU

    Analytic FlexibilityFull Elasticity

    14

    ∫T

    (∇~φi(x) +∇T ~φi(x)

    ): C :

    (∇~φj(x) +∇~φj(x)

    )dx (3)

    element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)cElement = TensorElement ( ’ Lagrange ’ , te t rahedron , 1 ,

    ( dim , dim , dim , dim ) )v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )C = C o e f f i c i e n t ( cElement )i , j , k , l = i nd i ces ( 4 )a = sym( grad ( v ) ) [ i , j ] *C[ i , j , k , l ] * sym( grad ( u ) ) [ k , l ] * dx

    Currently broken in FEniCS release

    M. Knepley Robust PASI ’11 68 / 231

  • GPU Computing FEM-GPU

    Analytic FlexibilityFull Elasticity

    14

    ∫T

    (∇~φi(x) +∇T ~φi(x)

    ): C :

    (∇~φj(x) +∇~φj(x)

    )dx (3)

    element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)cElement = TensorElement ( ’ Lagrange ’ , te t rahedron , 1 ,

    ( dim , dim , dim , dim ) )v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )C = C o e f f i c i e n t ( cElement )i , j , k , l = i nd i ces ( 4 )a = sym( grad ( v ) ) [ i , j ] *C[ i , j , k , l ] * sym( grad ( u ) ) [ k , l ] * dx

    Currently broken in FEniCS release

    M. Knepley Robust PASI ’11 68 / 231

  • GPU Computing FEM-GPU

    Analytic FlexibilityFull Elasticity

    14

    ∫T

    (∇~φi(x) +∇T ~φi(x)

    ): C :

    (∇~φj(x) +∇~φj(x)

    )dx (3)

    element = VectorElement ( ’ Lagrange ’ , te t rahedron , 1)cElement = TensorElement ( ’ Lagrange ’ , te t rahedron , 1 ,

    ( dim , dim , dim , dim ) )v = TestFunct ion ( element )u = T r i a l F u n c t i o n ( element )C = C o e f f i c i e n t ( cElement )i , j , k , l = i nd i ces ( 4 )a = sym( grad ( v ) ) [ i , j ] *C[ i , j , k , l ] * sym( grad ( u ) ) [ k , l ] * dx

    Currently broken in FEniCS release

    M. Knepley Robust PASI ’11 68 / 231

  • GPU Computing FEM-GPU

    Form Decomposition

    Element integrals are decomposed into analytic and geometric parts:

    ∫T ∇φi(x) · ∇φj(x)dx (4)

    =∫T∂φi (x)∂xα

    ∂φj (x)∂xα dx (5)

    =∫Tref

    ∂ξβ∂xα

    ∂φi (ξ)∂ξβ

    ∂ξγ∂xα

    ∂φj (ξ)∂ξγ|J|dx (6)

    =∂ξβ∂xα

    ∂ξγ∂xα |J|

    ∫Tref

    ∂φi (ξ)∂ξβ

    ∂φj (ξ)∂ξγ

    dx (7)

    = Gβγ(T )K ijβγ (8)

    Coefficients are also put into the geometric part.

    M. Knepley Robust PASI ’11 69 / 231

  • GPU Computing FEM-GPU

    Form Decomposition

    Additional fields give rise to multilinear forms.

    ∫T φi(x) ·

    (φk (x)∇φj(x)

    )dA (9)

    =∫T φ

    βi (x)

    (φαk (x)

    ∂φβj (x)∂xα

    )dA (10)

    =∫Tref φ

    βi (ξ)φ

    αk (ξ)

    ∂ξγ∂xα

    ∂φβj (ξ)

    ∂ξγ|J|dA (11)

    =∂ξγ∂xα |J|

    ∫Tref φ

    βi (ξ)φ

    αk (ξ)

    ∂φβj (ξ)

    ∂ξγdA (12)

    = Gαγ(T )K ijkαγ (13)

    The index calculus is fully developed by Kirby and Logg inA Compiler for Variational Forms.

    M. Knepley Robust PASI ’11 70 / 231

    http://www.fenics.org/pub/documents/ffc/papers/ffc-toms-2005.pdf

  • GPU Computing FEM-GPU

    Form Decomposition

    Isoparametric Jacobians also give rise to multilinear forms

    ∫T ∇φi(x) · ∇φj(x)dA (14)

    =∫T∂φi (x)∂xα

    ∂φj (x)∂xα dA (15)

    =∫Tref

    ∂ξβ∂xα

    ∂φi (ξ)∂ξβ

    ∂ξγ∂xα

    ∂φj (ξ)∂ξγ|J|dA (16)

    = |J|∫Tref φkJ

    βαk

    ∂φi (ξ)∂ξβ

    φlJγαl

    ∂φj (ξ)∂ξγ

    dA (17)

    = Jβαk Jγαl |J|

    ∫Tref φk

    ∂φi (ξ)∂ξβ

    φl∂φj (ξ)∂ξγ

    dA (18)

    = Gβγkl (T )Kijklβγ (19)

    A different space could also be used for Jacobians

    M. Knepley Robust PASI ’11 71 / 231

  • GPU Computing FEM-GPU

    Weak Form Processing

    from f f c . ana l ys i s impor t analyze_formsfrom f f c . compi ler impor t compute_ir

    parameters = f f c . defau l t_parameters ( )parameters [ ’ r ep resen ta t i on ’ ] = ’ tensor ’ana l ys i s = analyze_forms ( [ a , L ] , { } , parameters )i r = compute_ir ( ana lys is , parameters )

    a_K = i r [ 2 ] [ 0 ] [ ’AK ’ ] [ 0 ] [ 0 ]a_G = i r [ 2 ] [ 0 ] [ ’AK ’ ] [ 0 ] [ 1 ]

    K = a_K . A0 . astype (numpy . f l o a t 3 2 )G = a_G

    M. Knepley Robust PASI ’11 72 / 231

  • GPU Computing FEM-GPU

    Computational Flexibility

    We generate different computations on the fly,

    and can changeElement Batch Size

    Number of Concurrent Elements

    Loop unrolling

    Interleaving stores with computation

    M. Knepley Robust PASI ’11 73 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityBasic Contraction

    G K

    Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 74 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityBasic Contraction

    G Kthread 0

    Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 74 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityBasic Contraction

    G Kthread 0

    thread 5

    Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 74 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityBasic Contraction

    G Kthread 0

    thread 5

    thread 15

    Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 74 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityElement Batch Size

    G0G1G2G3

    Kthread 0

    thread 5

    thread 15

    Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 75 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityElement Batch Size

    G0G1G2G3

    K

    thread

    0

    thread 5

    thread 15

    Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 75 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityElement Batch Size

    G0G1G2G3

    K

    thre

    ad0

    thread 5

    thread 15

    Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 75 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityElement Batch Size

    G0G1G2G3

    K

    thre

    ad0

    thread

    5

    thread 15

    Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 75 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityConcurrent Elements

    G00G01G02G03

    G10G11G12G13

    Kthread 0

    thread 5

    thread 15

    thread 16

    thread 21

    thre

    ad31

    Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 76 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityConcurrent Elements

    G00G01G02G03

    G10G11G12G13

    K

    threa

    d 0

    thread 5

    thread 15

    thread 16thread 21

    thre

    ad31Figure: Tensor Contraction Gβγ(T )K ijβγ

    M. Knepley Robust PASI ’11 76 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityConcurrent Elements

    G00G01G02G03

    G10G11G12G13

    Kth

    read

    0

    thread 5

    thread 15

    thread 16thread 21

    threa

    d 31

    Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 76 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityConcurrent Elements

    G00G01G02G03

    G10G11G12G13

    Kth

    read

    0

    threa

    d 5

    thread 15

    thread 16thread 21

    thread 31

    Figure: Tensor Contraction Gβγ(T )K ijβγM. Knepley Robust PASI ’11 76 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityLoop Unrolling

    / * G K c o n t r a c t i o n : u n r o l l = f u l l * /E [ 0 ] += G[ 0 ] * K [ 0 ] ;E [ 0 ] += G[ 1 ] * K [ 1 ] ;E [ 0 ] += G[ 2 ] * K [ 2 ] ;E [ 0 ] += G[ 3 ] * K [ 3 ] ;E [ 0 ] += G[ 4 ] * K [ 4 ] ;E [ 0 ] += G[ 5 ] * K [ 5 ] ;E [ 0 ] += G[ 6 ] * K [ 6 ] ;E [ 0 ] += G[ 7 ] * K [ 7 ] ;E [ 0 ] += G[ 8 ] * K [ 8 ] ;

    M. Knepley Robust PASI ’11 77 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityLoop Unrolling

    / * G K c o n t r a c t i o n : u n r o l l = none * /f o r ( i n t b = 0; b < 1; ++b ) {

    const i n t n = b * 1 ;f o r ( i n t alpha = 0; alpha < 3; ++alpha ) {

    f o r ( i n t beta = 0; beta < 3; ++beta ) {E [ b ] += G[ n*9+ alpha *3+ beta ] * K [ alpha *3+ beta ] ;

    }}

    }

    M. Knepley Robust PASI ’11 78 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityInterleaving stores

    / * G K c o n t r a c t i o n : u n r o l l = none * /f o r ( i n t b = 0; b < 4; ++b ) {

    const i n t n = b * 1 ;f o r ( i n t alpha = 0; alpha < 3; ++alpha ) {

    f o r ( i n t beta = 0; beta < 3; ++beta ) {E [ b ] += G[ n*9+ alpha *3+ beta ] * K [ alpha *3+ beta ] ;

    }}

    }/ * Store c o n t r a c t i o n r e s u l t s * /elemMat [ Eo f f se t + idx +0] = E [ 0 ] ;elemMat [ Eo f f se t + idx +16] = E [ 1 ] ;elemMat [ Eo f f se t + idx +32] = E [ 2 ] ;elemMat [ Eo f f se t + idx +48] = E [ 3 ] ;

    M. Knepley Robust PASI ’11 79 / 231

  • GPU Computing FEM-GPU

    Computational FlexibilityInterleaving stores

    n = 0;f o r ( i n t alpha = 0; alpha < 3; ++alpha ) {

    f o r ( i n t beta = 0; beta < 3; ++beta ) {E += G[ n*9+ alpha *3+ beta ] * K [ alpha *3+ beta ] ;

    }}/ * Store c o n t r a c t i o n r e s u l t * /elemMat [ Eo f f se t + idx +0] = E;n = 1; E = 0 . 0 ; / * con t rac t * /elemMat [ Eo f f se t + idx +16] = E;n = 2; E = 0 . 0 ; / * con t rac t * /elemMat [ Eo f f se t + idx +32] = E;n = 3; E = 0 . 0 ; / * con t rac t * /elemMat [ Eo f f se t + idx +48] = E;

    M. Knepley Robust PASI ’11 80 / 231

  • GPU Computing FEM-GPU

    Code Template

    $ { pb . globalMod ( isGPU ) } vo id kerne l ( $ { pb . g r i dS ize ( isGPU ) } f l o a t * output ) {

    $ { pb . g r idLoopSta r t ( isGPU , load , s to re ) }$ { pb . threadLoopStar t ( isGPU , blockDimX ) }f l o a t G[ $ { dim * dim } ] = { $ { ’ , ’ . j o i n ( [ ’ 3.0 ’ ] * ( dim * dim ) ) } } ;f l o a t K [ $ { dim * dim } ] = { $ { ’ , ’ . j o i n ( [ ’ 3.0 ’ ] * ( dim * dim ) ) } } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x *$ { numThreads } ;

    / / Cont ract G and K% f o r n i n range ( numLocalElements ) :% f o r alpha i n range ( dim ) :% f o r beta i n range ( dim ) :

    product += G[ $ { gIdx } ] * K [ $ { k Idx } ] ;% endfor% endfor% endfor

    output [ Oof fse t+ idx ] = product ;$ { pb . threadLoopEnd ( isGPU ) }$ { pb . gridLoopEnd ( isGPU ) }r e t u r n ;

    } M. Knepley Robust PASI ’11 81 / 231

  • GPU Computing FEM-GPU

    Rendering a Template

    We render code template into strings using a dictionary of inputs.

    args = { ’ dim ’ : s e l f . dim ,’ numLocalElements ’ : 1 ,’ numThreads ’ : s e l f . threadBlockSize }

    kernelTemplate = s e l f . getKernelTemplate ( )gpuCode = kernelTemplate . render ( isGPU = True , * * args )cpuCode = kernelTemplate . render ( isGPU = False , * * args )

    M. Knepley Robust PASI ’11 82 / 231

  • GPU Computing FEM-GPU

    GPU Source Code

    __global__ vo id kerne l ( f l o a t * output ) {const i n t g r i d I d x = b lock Idx . x + b lock Idx . y * gr idDim . x ;const i n t i dx = th read Idx . x + th read Idx . y * 1 ; / / This i s ( i , j )f l o a t G[ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t K [ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x * 1 ;

    / / Cont ract G and Kproduct += G[ 0 ] * K [ 0 ] ;product += G[ 1 ] * K [ 1 ] ;product += G[ 2 ] * K [ 2 ] ;product += G[ 3 ] * K [ 3 ] ;product += G[ 4 ] * K [ 4 ] ;product += G[ 5 ] * K [ 5 ] ;product += G[ 6 ] * K [ 6 ] ;product += G[ 7 ] * K [ 7 ] ;product += G[ 8 ] * K [ 8 ] ;ou tput [ Oof fse t+ idx ] = product ;r e t u r n ;

    }

    M. Knepley Robust PASI ’11 83 / 231

  • GPU Computing FEM-GPU

    CPU Source Code

    vo id kerne l ( i n t numInvocations , f l o a t * output ) {f o r ( i n t g r i d I d x = 0; g r i d I d x < numInvocations ; ++ g r i d I d x ) {

    f o r ( i n t i = 0 ; i < 1 ; ++ i ) {f o r ( i n t j = 0 ; j < 1 ; ++ j ) {

    const i n t i dx = i + j * 1 ; / / This i s ( i , j )f l o a t G[ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t K [ 9 ] = { 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 , 3 . 0 } ;f l o a t product = 0 . 0 ;const i n t Oof fse t = g r i d I d x * 1 ;

    / / Cont ract G and Kproduct += G[ 0 ] * K [ 0 ] ;product += G[ 1 ] * K [ 1 ] ;product += G[ 2 ] * K [ 2 ] ;product += G[ 3 ] * K [ 3 ] ;product += G[ 4 ] * K [ 4 ] ;product += G[ 5 ] * K [ 5 ] ;product += G[ 6 ] * K [ 6 ] ;product += G[ 7 ] * K [ 7 ] ;product += G[ 8 ] * K [ 8 ] ;ou tput [ Oof fse t+ idx ] = product ;

    }}

    }r e t u r n ;

    }

    M. Knepley Robust PASI ’11 84 / 231

  • GPU Computing FEM-GPU

    Creating a Module

    CPU:

    # Output kerne l and C support codes e l f . outputKernelC ( cpuCode )s e l f . w r i t e M a k e f i l e ( )out , er r , s ta tus = s e l f . executeShellCommand ( ’make ’ )\ end { minted }

    \ b i gsk ip

    GPU:\ begin { minted } { python }from pycuda . compi ler impor t SourceModule

    mod = SourceModule ( gpuCode )s e l f . ke rne l = mod. ge t_ func t i on ( ’ ke rne l ’ )s e l f . kerne lRepor t ( s e l f . kernel , ’ ke rne l ’ )

    M. Knepley Robust PASI ’11 85 / 231

  • GPU Computing FEM-GPU

    Executing a Module

    impor t pycuda . d r i v e r as cudaimpor t pycuda . a u t o i n i t

    blockDim = ( s e l f . dim , s e l f . dim , 1)s t a r t = cuda . Event ( )end = cuda . Event ( )g r i d = s e l f . c a l c u l a t e G r i d (N, numLocalElements )s t a r t . record ( )f o r i i n range ( i t e r s ) :

    s e l f . ke rne l ( cuda . Out ( output ) ,b lock = blockDim , g r i d = g r i d )

    end . record ( )end . synchronize ( )gpuTimes . append ( s t a r t . t i m e _ t i l l ( end ) *1 e−3/ i t e r s )

    M. Knepley Robust PASI ’11 86 / 231

  • GPU Computing FEM-GPU

    Element Matrix Formation

    Element matrix K is now made up of small tensorsContract all tensor elements with each the geometry tensor G(T )

    3 00 0

    0 -10 0

    1 10 0

    -4 -40 0

    0 40 0

    0 00 0

    0 0-1 0

    0 00 3

    0 01 1

    0 00 0

    0 04 0

    0 0-4 -4

    1 01 0

    0 10 1

    3 33 3

    -4 0-4 0

    0 00 0

    0 -40 -4

    -4 0-4 0

    0 00 0

    -4 -40 0

    8 44 8

    0 -4-4 -8

    0 44 0

    0 04 0

    0 40 0

    0 00 0

    0 -4-4 -8

    8 44 8

    -8 -4-4 0

    0 00 0

    0 -40 -4

    0 0-4 -4

    0 44 0

    -8 -4-4 0

    8 44 8

    M. Knepley Robust PASI ’11 87 / 231

  • GPU Computing FEM-GPU

    Mapping GαβK ijαβ to the GPUProblem Division

    For N elements, map blocks of NL elements to each Thread Block (TB)Launch grid must be gx × gy = N/NLTB grid will depend on the specific algorithmOutput is size Nbasis × Nbasis × NL

    We can split a TB to work on multiple, NB, elements at a timeNote that each TB always gets NL elements, so NB must divide NL

    M. Knepley Robust PASI ’11 88 / 231

  • GPU Computing FEM-GPU

    Mapping GαβK ijαβ to the GPUKernel Arguments

    __global__vo id in teg ra teJacob ian ( f l o a t * elemMat ,

    f l o a t * geometry ,f l o a t * a n a l y t i c )

    geometry: Array of G tensors for each element

    analytic: K tensor

    elemMat: Array of E = G : K tensors for each element

    M. Knepley Robust PASI ’11 89 / 231

  • GPU Computing FEM-GPU

    Mapping GαβK ijαβ to the GPUMemory Movement

    We can interleave stores with computation, or wait until the endWaiting could improve coalescing of writes

    Interleaving could allow overlap of writes with computation

    Also need toCoalesce accesses between global and local/shared memory(use moveArray())

    Limit use of shared and local memory

    M. Knepley Robust PASI ’11 90 / 231

  • GPU Computing FEM-GPU

    Memory Bandwidth

    Superior GPU memory bandwidth is due to both

    bus width and clock speed.

    CPU GPUBus Width (bits) 64 512Bus Clock Speed (MHz) 400 1600Memory Bandwidth (GB/s) 3 102Latency (cycles) 240 600

    Tesla always accesses blocks of 64 or 128 bytes

    M. Knepley Robust PASI ’11 91 / 231

  • GPU Computing FEM-GPU

    Mapping GαβK ijαβ to the GPUReduction

    Choose strategies to minimize reductions

    Only reductions occur in summation for contractionsSimilar to the reduction in a quadrature loop

    Strategy #1: Each thread uses all of K

    Strategy #2: Do each contraction in a separate thread

    M. Knepley Robust PASI ’11 92 / 231

  • GPU Computing FEM-GPU

    Strategy #1TB Division

    Each thread computes an entire element matrix, so that

    blockDim = (NL/NB,1,1)

    We will see that there is little opportunity to overlap computation andmemory access

    M. Knepley Robust PASI ’11 93 / 231

  • GPU Computing FEM-GPU

    Strategy #1Analytic Part

    Read K into shared memory (need to synchronize before access)

    __shared__ f l o a t K [ $ { dim * dim * numBasisFuncs * numBasisFuncs } ] ;

    $ { fm . moveArray ( ’K ’ , ’ a n a l y t i c ’ ,dim * dim * numBasisFuncs * numBasisFuncs , ’ ’ , numThreads ) }

    __syncthreads ( ) ;

    M. Knepley Robust PASI ’11 94 / 231

  • GPU Computing FEM-GPU

    Strategy #1Geometry

    Each thread handles NB elementsRead G into local memory (not coalesced)Interleaving means writing after each thread does a singleelement matrix calculation

    f l o a t G[ $ { dim * dim * numBlockElements } ] ;

    i f ( i n t e r l e a v e d ) {const i n t Gof fse t = ( g r i d I d x *$ { numLocalElements } + idx ) * $ { dim * dim } ;f o r n i n range ( numBlockElements ) :

    $ { fm . moveArray ( ’G ’ , ’ geometry ’ , dim * dim , ’ Gof fse t ’ ,blockNumber = n* numLocalElements / numBlockElements ,localBlockNumber = n , isCoalesced = False ) }

    endfor} e lse {

    const i n t Gof fse t = ( g r i d I d x *$ { numLocalElements / numBlockElements } + idx )*$ { dim * dim * numBlockElements } ;

    $ { fm . moveArray ( ’G ’ , ’ geometry ’ , dim * dim * numBlockElements , ’ Gof fse t ’ ,isCoalesced = False ) }

    }

    M. Knepley Robust PASI ’11 95 / 231

  • GPU Computing FEM-GPU

    Strategy #1Output

    We write element matrices out contiguously by TB

    const i n t matSize = numBasisFuncs * numBasisFuncs ;const i n t Eo f f se t = g r i d I d x * matSize * numLocalElements ;

    i f ( i n t e r l e a v e d ) {const i n t elemOff = idx * matSize ;__shared__ f l o a t E [ matSize * numLocalElements / numBlockElements ] ;

    } e lse {const i n t elemOff = idx * matSize * numBlockElements ;__shared__ f l o a t E [ matSize * numLocalElements ] ;

    }

    M. Knepley Robust PASI ’11 96 / 231

  • GPU Computing FEM-GPU

    Strategy #1Contraction

    matSize = numBasisFuncs * numBasisFuncsi f i n t e r l ea v eS t o re s :

    f o r b i n range ( numBlockElements ) :# Do 1 c o n t r a c t i o n f o r each thread__syncthreads ( ) ;fm . moveArray ( ’E ’ , ’ elemMat ’ ,

    matSize * numLocalElements / numBlockElements ,’ Eo f f se t ’ , numThreads , blockNumber = n , isLoad = 0)

    e lse :# Do numBlockElements co n t r ac t i ons f o r each thread__syncthreads ( ) ;fm . moveArray ( ’E ’ , ’ elemMat ’ ,

    matSize * numLocalElements ,’ Eo f f se t ’ , numThreads , isLoad = 0)

    M. Knepley Robust PASI ’11 97 / 231

  • GPU Computing FEM-GPU

    Strategy #2TB Division

    Each thread computes a single element of an element matrix, so that

    blockDim = (Nbasis,Nbasis,NB)

    This allows us to overlap computation of another element in the TBwith writes for the first.

    M. Knepley Robust PASI ’11 98 / 231

  • GPU Computing FEM-GPU

    Strategy #2Analytic Part

    Assign an (i , j) block of K to local memoryNB threads will simultaneously calculate a contraction

    const i n t Kidx = th read Idx . x + th read Idx . y *$ { numBasisFuncs } ; / / This i s ( i , j )const i n t i dx = Kidx + th read Idx . z *$ { numBasisFuncs * numBasisFuncs } ;const i n t Ko f f se t = Kidx *$ { dim * dim } ;f l o a t K [ $ { dim * dim } ] ;

    % f o r alpha i n range ( dim ) :% f o r beta i n range ( dim ) :

    K[ $ { k Idx } ] = a n a l y t i c [ Ko f f se t +$ { k Idx } ] ;% endfor% endfor

    M. Knepley Robust PASI ’11 99 / 231

  • GPU Computing FEM-GPU

    Strategy #2Geometry

    Store NL G tensors into shared memoryInterleaving means writing after each thread does a singleelement calculation

    const i n t Gof fse t = g r i d I d x *$ { dim * dim * numLocalElements } ;__shared__ f l o a t G[ $ { dim * dim * numLocalElements } ] ;

    $ { fm . moveArray ( ’G ’ , ’ geometry ’ , dim * dim * numLocalElements ,’ Gof fse t ’ , numThreads ) }

    __syncthreads ( ) ;

    M. Knepley Robust PASI ’11 100 / 231

  • GPU Computing FEM-GPU

    Strategy #2Output

    We write element matrices out contiguously by TBIf interleaving stores, only need a single productOtherwise, need NL/NB, one per element processed by a thread

    const i n t matSize = numBasisFuncs * numBasisFuncs ;const i n t Eo f f se t = g r i d I d x * matSize * numLocalElements ;

    i f ( i n t e r l e a v e d ) {f l o a t product = 0 . 0 ;const i n t elemOff = idx * matSize ;

    } e lse {f l o a t product [ numLocalElements / numBlockElements ] ;const i n t elemOff = idx * matSize * numBlockElements ;

    }

    M. Knepley Robust PASI ’11 101 / 231

  • GPU Computing FEM-GPU

    Strategy #2Contraction

    i f i n t e r l ea v eS t o re s :f o r n i n range ( numLocalElements / numBlockElements ) :

    # Do 1 c o n t r a c t i o n f o r each thread__syncthreads ( )# Do coalesced w r i t e o f element mat r i xelemMat [ Eo f f se t + idx + n* numThreads ] = product

    e lse :# Do numLocalElements / numBlockElements c on t r ac t i ons# save r e s u l t s i n product [ ]f o r n i n range ( numLocalElements / numBlockElements ) :

    elemMat [ Eo f f se t + idx + n* numThreads ] = product [ n ]

    M. Knepley Robust PASI ’11 102 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285

    M. Knepley Robust PASI ’11 103 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285

    M. Knepley Robust PASI ’11 103 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285, 2 Simultaneous Elements

    M. Knepley Robust PASI ’11 103 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285, 2 Simultaneous Elements

    M. Knepley Robust PASI ’11 103 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285

    M. Knepley Robust PASI ’11 104 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285

    M. Knepley Robust PASI ’11 104 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285, 2 Simultaneous Elements

    M. Knepley Robust PASI ’11 104 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285, 2 Simultaneous Elements

    M. Knepley Robust PASI ’11 104 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285

    M. Knepley Robust PASI ’11 105 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285

    M. Knepley Robust PASI ’11 105 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285, 2 Simultaneous Elements

    M. Knepley Robust PASI ’11 105 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285, 2 Simultaneous Elements

    M. Knepley Robust PASI ’11 105 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285, 2 Simultaneous Elements

    M. Knepley Robust PASI ’11 106 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285, 2 Simultaneous Elements

    M. Knepley Robust PASI ’11 106 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285, 4 Simultaneous Elements

    M. Knepley Robust PASI ’11 106 / 231

  • GPU Computing FEM-GPU

    ResultsGTX 285, 4 Simultaneous Elements

    M. Knepley Robust PASI ’11 106 / 231

  • GPU Computing FEM-GPU

    PerformanceInfluence of Element Batch Sizes

    M. Knepley Robust PASI ’11 107 / 231

  • GPU Computing FEM-GPU

    PerformanceInfluence of Element Batch Sizes

    M. Knepley Robust PASI ’11 108 / 231

  • GPU Computing FEM-GPU

    PerformanceInfluence of Code Structure

    M. Knepley Robust PASI ’11 109 / 231

  • GPU Computing FEM-GPU

    PerformanceInfluence of Code Structure

    M. Knepley Robust PASI ’11 110 / 231

  • GPU Computing PETSc-GPU

    Outline

    2 GPU ComputingFEM-GPUPETSc-GPU

    M. Knepley Robust PASI ’11 111 / 231

  • GPU Computing PETSc-GPU

    Thrust

    Thrust is a CUDA library of parallel algorithms

    Interface similar to C++ Standard Template Library

    Containers (vector) on both host and device

    Algorithms: sort, reduce, scan

    Freely available, part of PETSc configure (-with-thrust-dir)

    Included as part of CUDA 4.0 installation

    M. Knepley Robust PASI ’11 112 / 231

    http://code.google.com/p/thrust/

  • GPU Computing PETSc-GPU

    Cusp

    Cusp is a CUDA library forsparse linear algebra and graph computations

    Builds on data structures in Thrust

    Provides sparse matrices in several formats (CSR, Hybrid)

    Includes some preliminary preconditioners (Jacobi, SA-AMG)

    Freely available, part of PETSc configure (-with-cusp-dir)

    M. Knepley Robust PASI ’11 113 / 231

    http://code.google.com/p/cusp-library/

  • GPU Computing PETSc-GPU

    VECCUDA

    Strategy: Define a new Vec implementation

    Uses Thrust for data storage and operations on GPU

    Supports full PETSc Vec interface

    Inherits PETSc scalar type

    Can be activated at runtime, -vec_type cuda

    PETSc provides memory coherence mechanism

    M. Knepley Robust PASI ’11 114 / 231

    http://code.google.com/p/thrust/

  • GPU Computing PETSc-GPU

    Memory Coherence

    PETSc Objects now hold a coherence flag

    PETSC_CUDA_UNALLOCATED No allocation on the GPUPETSC_CUDA_GPU Values on GPU are currentPETSC_CUDA_CPU Values on CPU are currentPETSC_CUDA_BOTH Values on both are current

    Table: Flags used to indicate the memory state of a PETSc CUDA Vec object.

    M. Knepley Robust PASI ’11 115 / 231

  • GPU Computing PETSc-GPU

    MATAIJCUDA

    Also define new Mat implementations

    Uses Cusp for data storage and operations on GPU

    Supports full PETSc Mat interface, some ops on CPU

    Can be activated at runtime, -mat_type aijcuda

    Notice that parallel matvec necessitates off-GPU data transfer

    M. Knepley Robust PASI ’11 116 / 231

    http://code.google.com/p/cusp-library/

  • GPU Computing PETSc-GPU

    Solvers

    Solvers come for FreePreliminary Implementation of PETSc Using GPU,

    Minden, Smith, Knepley, 2010

    All linear algebra types work with solvers

    Entire solve can take place on the GPUOnly communicate scalars back to CPU

    GPU communication cost could be amortized over several solves

    Preconditioners are a problemCusp has a promising AMG

    M. Knepley Robust PASI ’11 117 / 231

    http://www.mcs.anl.gov/uploads/cels/papers/P1787.pdf

  • GPU Computing PETSc-GPU

    Installation

    PETSc only needs# Turn on CUDA--with-cuda# Specify the CUDA compiler--with-cudac=’nvcc -m64’# Indicate the location of packages# --download-* will also work soon--with-thrust-dir=/PETSc3/multicore/thrust--with-cusp-dir=/PETSc3/multicore/cusp# Can also use double precision--with-precision=single

    M. Knepley Robust PASI ’11 118 / 231

  • GPU Computing PETSc-GPU

    ExampleDriven Cavity Velocity-Vorticity with Multigrid

    ex50 -da_vec_type seqcusp-da_mat_type aijcusp -mat_no_inode # Setup types-da_grid_x 100 -da_grid_y 100 # Set grid size-pc_type none -pc_mg_levels 1 # Setup solver-preload off -cuda_synchronize # Setup run-log_summary

    M. Knepley Robust PASI ’11 119 / 231

  • Linear Algebra and Solvers

    Outline

    1 Tools and Infrastructure

    2 GPU Computing

    3 Linear Algebra and SolversVector AlgebraMatrix AlgebraAlgebraic SolversSNESDAPCFieldSplit

    4 First Lab: Assembling a Code

    5 Second Lab: Debugging and Performance BenchmarkingM. Knepley Robust PASI ’11 120 / 231

  • Linear Algebra and Solvers Vector Algebra

    Outline

    3 Linear Algebra and SolversVector AlgebraMatrix AlgebraAlgebraic SolversSNESDAPCFieldSplit

    M. Knepley Robust PASI ’11 121 / 231

  • Linear Algebra and Solvers Vector Algebra

    Vector Algebra

    What are PETSc vectors?

    Fundamental objects representingsolutionsright-hand sidescoefficients

    Each process locally owns a subvector of contiguous global data

    M. Knepley Robust PASI ’11 122 / 231

  • Linear Algebra and Solvers Vector Algebra

    Vector Algebra

    How do I create vectors?

    VecCreate(MPI_Comm, Vec*)

    VecSetSizes(Vec, PetscIntn, PetscInt N)

    VecSetType(Vec, VecType typeName)

    VecSetFromOptions(Vec)

    Can set the type at runtime

    M. Knepley Robust PASI ’11 123 / 231

  • Linear Algebra and Solvers Vector Algebra

    Vector Algebra

    A PETSc Vec

    Supports all vector space operationsVecDot(), VecNorm(), VecScale()

    Has a direct interface to the valuesVecGetArray(), VecGetArrayF90()

    Has unusual operationsVecSqrtAbs(), VecStrideGather()

    Communicates automatically during assemblyHas customizable communication (PetscSF,VecScatter)

    M. Knepley Robust PASI ’11 124 / 231

  • Linear Algebra and Solvers Vector Algebra

    Parallel AssemblyVectors and Matrices

    Processes may set an arbitrary entryMust use proper interface

    Entries need not be generated locallyLocal meaning the process on which they are stored

    PETSc automatically moves data if necessaryHappens during the assembly phase

    M. Knepley Robust PASI ’11 125 / 231

  • Linear Algebra and Solvers Vector Algebra

    Vector Assembly

    A three step processEach process sets or adds valuesBegin communication to send values to the correct processComplete the communication

    VecSetValues ( Vec v , Pe tsc In t n , Pe tsc In t rows [ ] ,PetscScalar values [ ] , InsertMode mode)

    Mode is either INSERT_VALUES or ADD_VALUESTwo phases allow overlap of communication and computation

    VecAssemblyBegin(Vecv)VecAssemblyEnd(Vecv)

    M. Knepley Robust PASI ’11 126 / 231

  • Linear Algebra and Solvers Vector Algebra

    One Way to Set the Elements of a Vector

    VecGetSize ( x , &N) ;MPI_Comm_rank (PETSC_COMM_WORLD, &rank ) ;i f ( rank == 0) {

    va l = 0 . 0 ;f o r ( i = 0 ; i < N; ++ i ) {

    VecSetValues ( x , 1 , &i , &val , INSERT_VALUES ) ;va l += 10 .0 ;

    }}/ * These rou t i nes ensure t h a t the data i s

    d i s t r i b u t e d to the other processes * /VecAssemblyBegin ( x ) ;VecAssemblyEnd ( x ) ;

    M. Knepley Robust PASI ’11 127 / 231

  • Linear Algebra and Solvers Vector Algebra

    A Better Way to Set the Elements of a Vector

    VecGetOwnershipRange ( x , &low , &high ) ;va l = low * 1 0 . 0 ;f o r ( i = low ; i < high ; ++ i ) {

    VecSetValues ( x , 1 , &i , &val , INSERT_VALUES ) ;va l += 10 .0 ;

    }/ * No data w i l l be communicated here * /VecAssemblyBegin ( x ) ;VecAssemblyEnd ( x ) ;

    M. Knepley Robust PASI ’11 128 / 231

  • Linear Algebra and Solvers Vector Algebra

    Selected Vector Operations

    Function Name OperationVecAXPY(Vec y, PetscScalar a, Vec x) y = y + a ∗ xVecAYPX(Vec y, PetscScalar a, Vec x) y = x + a ∗ yVecWAYPX(Vec w, PetscScalar a, Vec x, Vec y) w = y + a ∗ xVecScale(Vec x, PetscScalar a) x = a ∗ xVecCopy(Vec y, Vec x) y = xVecPointwiseMult(Vec w, Vec x, Vec y) wi = xi ∗ yiVecMax(Vec x, PetscInt *idx, PetscScalar *r) r = maxriVecShift(Vec x, PetscScalar r) xi = xi + rVecAbs(Vec x) xi = |xi |VecNorm(Vec x, NormType type, PetscReal *r) r = ||x ||

    M. Knepley Robust PASI ’11 129 / 231

  • Linear Algebra and Solvers Vector Algebra

    Working With Local Vectors

    It is sometimes more efficient to directly access local storage of a Vec.PETSc allows you to access the local storage with

    VecGetArray(Vec, double *[])

    You must return the array to PETSc when you finishVecRestoreArray(Vec, double *[])

    Allows PETSc to handle data structure conversionsCommonly, these routines are fast and do not involve a copy

    M. Knepley Robust PASI ’11 130 / 231

  • Linear Algebra and Solvers Vector Algebra

    VecGetArray in C

    Vec v ;PetscScalar * ar ray ;Pe tsc In t n , i ;

    VecGetArray ( v , &ar ray ) ;VecGetLocalSize ( v , &n ) ;PetscSynchron izedPr in t f (PETSC_COMM_WORLD,

    " F i r s t element o f l o c a l a r ray i s %f \ n " , a r ray [ 0 ] ) ;PetscSynchronizedFlush (PETSC_COMM_WORLD) ;f o r ( i = 0 ; i < n ; ++ i ) {

    a r ray [ i ] += ( PetscScalar ) rank ;}VecRestoreArray ( v , &ar ray ) ;

    M. Knepley Robust PASI ’11 131 / 231

  • Linear Algebra and Solvers Vector Algebra

    VecGetArray in F77

    # inc lude " f i n c l u d e / petsc . h "

    Vec v ;PetscScalar ar ray ( 1 )PetscOf fse t o f f s e tPe tsc In t n , iPetscErrorCode i e r r

    c a l l VecGetArray ( v , array , o f f s e t , i e r r )c a l l VecGetLocalSize ( v , n , i e r r )do i =1 ,n

    ar ray ( i + o f f s e t ) = ar ray ( i + o f f s e t ) + rankend doc a l l VecRestoreArray ( v , array , o f f s e t , i e r r )

    M. Knepley Robust PASI ’11 132 / 231

  • Linear Algebra and Solvers Vector Algebra

    VecGetArray in F90

    # inc lude " f i n c l u d e / petsc . h90 "

    Vec v ;PetscScalar p o i n t e r : : a r ray ( : )Pe tsc In t n , iPetscErrorCode i e r r

    c a l l VecGetArrayF90 ( v , array , i e r r )c a l l VecGetLocalSize ( v , n , i e r r )do i =1 ,n

    ar ray ( i ) = ar ray ( i ) + rankend doc a l l VecRestoreArrayF90 ( v , array , i e r r )

    M. Knepley Robust PASI ’11 133 / 231

  • Linear Algebra and Solvers Vector Algebra

    VecGetArray in Python

    wi th v as a :f o r i i n range ( len ( a ) ) :

    a [ i ] = 5 .0* i

    M. Knepley Robust PASI ’11 134 / 231

  • Linear Algebra and Solvers Vector Algebra

    DMDAVecGetArray in C

    DM da ;Vec v ;DMDALocalInfo * i n f o ;PetscScalar * * ar ray ;

    DMDAVecGetArray ( da , v , &ar ray ) ;f o r ( j = in fo−>ys ; j < in fo−>ys+ in fo−>ym; ++ j ) {

    f o r ( i = in fo−>xs ; i < in fo−>xs+ in fo−>xm; ++ i ) {u = x [ j ] [ i ] ;uxx = ( 2 . 0 * u − x [ j ] [ i −1] − x [ j ] [ i + 1 ] ) * hydhx ;uyy = ( 2 . 0 * u − x [ j −1][ i ] − x [ j + 1 ] [ i ] ) * hxdhy ;f [ j ] [ i ] = uxx + uyy ;

    }}DMDAVecRestoreArray ( da , v , &ar ray ) ;

    M. Knepley Robust PASI ’11 135 / 231

  • Linear Algebra and Solvers Matrix Algebra

    Outline

    3 Linear Algebra and SolversVector AlgebraMatrix AlgebraAlgebraic SolversSNESDAPCFieldSplit

    M. Knepley Robust PASI ’11 136 / 231

  • Linear Algebra and Solvers Matrix Algebra

    Matrix Algebra

    What are PETSc matrices?

    Fundamental objects for storing stiffness matrices and JacobiansEach process locally owns a contiguous set of rowsSupports many data types

    AIJ, Block AIJ, Symmetric AIJ, Block Matrix, etc.Supports structures for many packages

    MUMPS, Spooles, SuperLU, UMFPack, DSCPack

    M. Knepley Robust PASI ’11 137 / 231

  • Linear Algebra and Solvers Matrix Algebra

    How do I create matrices?

    MatCreate(MPI_Comm, Mat*)

    MatSetSizes(Mat, PetscIntm, PetscInt n, M, N)

    MatSetType(Mat, MatType typeName)

    MatSetFromOptions(Mat)

    Can set the type at runtime

    MatSeqAIJPreallocation(Mat, PetscIntnz, const PetscInt nnz[])

    MatXAIJPreallocation(Mat, bs, dnz[], onz [], dnzu[], onzu[])

    MatSetValues(Mat, m, rows[], n, cols [], values [], InsertMode)

    MUST be used, but does automatic communication

    M. Knepley Robust PASI ’11 138 / 231

  • Linear Algebra and Solvers Matrix Algebra

    Matrix Polymorphism

    The PETSc Mat has a single user interface,Matrix assembly

    MatSetValues()

    Matrix-vector multiplicationMatMult()

    Matrix viewingMatView()

    but multiple underlying implementations.AIJ, Block AIJ, Symmetric Block AIJ,DenseMatrix-Freeetc.

    A matrix is defined by its interface, not by its data structure.

    M. Knepley Robust PASI ’11 139 / 231

  • Linear Algebra and Solvers Matrix Algebra

    Matrix Assembly

    A three step processEach process sets or adds valuesBegin communication to send values to the correct processComplete the communication

    MatSetValues(Matm, m, rows[], n, cols [], values [], mode)

    mode is either INSERT_VALUES or ADD_VALUESLogically dense block of values

    Two phase assembly allows overlap of communication andcomputation

    MatAssemblyBegin(Matm, MatAssemblyType type)MatAssemblyEnd(Matm, MatAssemblyType type)type is either MAT_FLUSH_ASSEMBLY or MAT_FINAL_ASSEMBLY

    M. Knepley Robust PASI ’11 140 / 231

  • Linear Algebra and Solvers Matrix Algebra

    One Way to Set the Elements of a MatrixSimple 3-point stencil for 1D Laplacian

    v [ 0 ] = −1.0; v [ 1 ] = 2 . 0 ; v [ 2 ] = −1.0;i f ( rank == 0) {

    f o r ( row = 0; row < N; row++) {co ls [ 0 ] = row−1; co ls [ 1 ] = row ; co ls [ 2 ] = row +1;i f ( row == 0) {

    MatSetValues (A,1 ,& row ,2 ,& co ls [1 ] ,& v [ 1 ] , INSERT_VALUES ) ;} e lse i f ( row == N−1) {

    MatSetValues (A,1 ,& row ,2 , cols , v , INSERT_VALUES ) ;} e lse {

    MatSetValues (A,1 ,& row ,3 , cols , v , INSERT_VALUES ) ;}

    }}MatAssemblyBegin (A, MAT_FINAL_ASSEMBLY ) ;MatAssemblyEnd (A, MAT_FINAL_ASSEMBLY ) ;

    M. Knepley Robust PASI ’11 141 / 231

  • Linear Algebra and Solvers Matrix Algebra

    A Better Way to Set the Elements of a MatrixSimple 3-point stencil for 1D Laplacian

    v [ 0 ] = −1.0; v [ 1 ] = 2 . 0 ; v [ 2 ] = −1.0;MatGetOwnershipRange (A,& s t a r t ,&end ) ;f o r ( row = s t a r t ; row < end ; row++) {

    co ls [ 0 ] = row−1; co ls [ 1 ] = row ; co ls [ 2 ] = row +1;i f ( row == 0) {

    MatSetValues (A,1 ,& row ,2 ,& co ls [1 ] ,& v [ 1 ] , INSERT_VALUES ) ;} e lse i f ( row == N−1) {

    MatSetValues (A,1 ,& row ,2 , cols , v , INSERT_VALUES ) ;} e lse {

    MatSetValues (A,1 ,& row ,3 , cols , v , INSERT_VALUES ) ;}

    }MatAssemblyBegin (A, MAT_FINAL_ASSEMBLY ) ;MatAssemblyEnd (A, MAT_FINAL_ASSEMBLY ) ;

    M. Knepley Robust PASI ’11 142 / 231

  • Linear Algebra and Solvers Matrix Algebra

    Why Are PETSc Matrices That Way?

    No one data structure is appropriate for all problemsBlocked and diagonal formats provide performance benefitsPETSc has many formatsMakes it easy to add new data structures

    Assembly is difficult enough without worrying about partitioningPETSc provides parallel assembly routinesHigh performance still requires making most operations localHowever, programs can be incrementally developed.MatPartitioning and MatOrdering can help

    Matrix decomposition in contiguous chunks is simpleMakes interoperation with other codes easierFor other ordering, PETSc provides “Application Orderings” (AO)

    M. Knepley Robust PASI ’11 143 / 231

  • Linear Algebra and Solvers Algebraic Solvers

    Outline

    3 Linear Algebra and SolversVector AlgebraMatrix AlgebraAlgebraic SolversSNESDAPCFieldSplit

    M. Knepley Robust PASI ’11 144 / 231

  • Linear Algebra and Solvers Algebraic Solvers

    Solver Types

    Explicit:Field variables are updated using local neighbor information

    Semi-implicit:Some subsets of variables are updated with global solvesOthers with direct local updates

    Implicit:Most or all variables are updated in a single global solve

    M. Knepley Robust PASI ’11 145 / 231

  • Linear Algebra and Solvers Algebraic Solvers

    Linear SolversKrylov Methods

    Using PETSc linear algebra, just add:KSPSetOperators(KSPksp, MatA, MatM, MatStructure flag)KSPSolve(KSPksp, Vecb, Vecx)

    Can access subobjectsKSPGetPC(KSPksp, PC*pc)

    Preconditioners must obey PETSc interfaceBasically just the KSP interface

    Can change solver dynamically from the command line-ksp_type bicgstab

    M. Knepley Robust PASI ’11 146 / 231

  • Linear Algebra and Solvers Algebraic Solvers

    Nonlinear Solvers

    Using PETSc linear algebra, just add:SNESSetFunction(SNESsnes, Vecr, residualFunc, void *ctx)SNESSetJacobian(SNESsnes, MatA, MatM, jacFunc, void *ctx)SNESSolve(SNESsnes, Vecb, Vecx)

    Can access subobjectsSNESGetKSP(SNESsnes, KSP*ksp)

    Can customize subobjects from the cmd lineSet the subdomain preconditioner to ILU with -sub_pc_type ilu

    M. Knepley Robust PASI ’11 147 / 231

  • Linear Algebra and Solvers Algebraic Solvers

    Basic Solver Usage

    Use SNESSetFromOptions() so that everything is set dynamicallySet the type

    Use -snes_type (or take the default)Set the preconditioner

    Use -npc_snes_type (or take the default)Override the tolerances

    Use -snes_rtol and -snes_atolView the solver to make sure you have the one you expect

    Use -snes_viewFor debugging, monitor the residual decrease

    Use -snes_monitorUse -ksp_monitor to see the underlying linear solver

    M. Knepley Robust PASI ’11 148 / 231

  • Linear Algebra and Solvers Algebraic Solvers

    3rd Party Solvers in PETSc

    Complete table of solvers1 Sequential LU

    ILUDT (SPARSEKIT2, Yousef Saad, U of MN)EUCLID & PILUT (Hypre, David Hysom, LLNL)ESSL (IBM)SuperLU (Jim Demmel and Sherry Li, LBNL)MatlabUMFPACK (Tim Davis, U. of Florida)LUSOL (MINOS, Michael Saunders, Stanford)

    2 Parallel LUMUMPS (Patrick Amestoy, IRIT)SPOOLES (Cleve Ashcroft, Boeing)SuperLU_Dist (Jim Demmel and Sherry Li, LBNL)

    3 Parallel CholeskyDSCPACK (Padma Raghavan, Penn. State)MUMPS (Patrick Amestoy, Toulouse)CHOLMOD (Tim Davis, Florida)

    4 XYTlib - parallel direct solver (Paul Fischer and Henry Tufo, ANL)M. Knepley Robust PASI ’11 149 / 231

    http://www.mcs.anl.gov/petsc/petsc-as/documentation/linearsolvertable.html

  • Linear Algebra and Solvers Algebraic Solvers

    3rd Party Preconditioners in PETSc

    Complete table of solvers1 Parallel ICC

    BlockSolve95 (Mark Jones and Paul Plassman, ANL)2 Parallel ILU

    PaStiX (Faverge Mathieu, INRIA)3 Parallel Sparse Approximate Inverse

    Parasails (Hypre, Edmund Chow, LLNL)SPAI 3.0 (Marcus Grote and Barnard, NYU)

    4 Sequential Algebraic MultigridRAMG (John Ruge and Klaus Steuben, GMD)SAMG (Klaus Steuben, GMD)

    5 Parallel Algebraic MultigridPrometheus (Mark Adams, PPPL)BoomerAMG (Hypre, LLNL)ML (Trilinos, Ray Tuminaro and Jonathan Hu, SNL)

    M. Knepley Robust PASI ’11 149 / 231

    http://www.mcs.anl.gov/petsc/petsc-as/documentation/linearsolvertable.html

  • Linear Algebra and Solvers SNES

    Outline

    3 Linear Algebra and SolversVector AlgebraMatrix AlgebraAlgebraic SolversSNESDAPCFieldSplit

    M. Knepley Robust PASI ’11 150 / 231

  • Linear Algebra and Solvers SNES

    Flow Control for a PETSc Application

    Timestepping Solvers (TS)

    Preconditioners (PC)

    Nonlinear Solvers (SNES)

    Linear Solvers (KSP)

    FunctionEvaluation Postprocessing

    JacobianEvaluation

    ApplicationInitialization

    Main Routine

    PETSc

    M. Knepley Robust PASI ’11 151 / 231

  • Linear Algebra and Solvers SNES

    SNES Paradigm

    The SNES interface is based upon callback functionsFormFunction(), set by SNESSetFunction()

    FormJacobian(), set by SNESSetJacobian()

    When PETSc needs to evaluate the nonlinear residual F (x),Solver calls the user’s function

    User function gets application state through the ctx variablePETSc never sees application data

    M. Knepley Robust PASI ’11 152 / 231

  • Linear Algebra and Solvers SNES

    Topology Abstractions

    DMDAAbstracts Cartesian grids in any dimensionSupports stencils, communication, reorderingNice for simple finite differences

    DMMeshAbstracts general topology in any dimensionAlso supports partitioning, distribution, and global ordersAllows aribtrary element shapes and discretizations

    M. Knepley Robust PASI ’11 153 / 231

  • Linear Algebra and Solvers SNES

    Assembly Abstractions

    DMAbstracts the logic of multilevel (multiphysics) methodsManages allocation and assembly of local and global structuresInterfaces to PCMG solver

    PetscSectionAbstracts functions over a topologyManages allocation and assembly of local and global structuresWill merge with DM somehow

    M. Knepley Robust PASI ’11 154 / 231

  • Linear Algebra and Solvers SNES

    SNES Function

    User provided function calculates the nonlinear residual:

    PetscErrorCode ( * func ) (SNES snes , Vec x , Vec r , vo id * c t x )

    x: The current solutionr: The residual

    ctx: The user context passed to SNESSetFunction()Use this to pass application information, e.g. physical constants

    M. Knepley Robust PASI ’11 155 / 231

  • Linear Algebra and Solvers SNES

    SNES Jacobian

    User provided function calculates the Jacobian:

    PetscErrorCode ( * func ) (SNES snes , Vec x , Mat * J , Mat *M, vo id * c t x )

    x: The current solutionJ: The JacobianM: The Jacobian preconditioning matrix (possibly J itself)

    ctx: The user context passed to SNESSetJacobian()Use this to pass application information, e.g. physical constants

    Alternatively, you can usematrix-free finite difference approximation, -snes_mffinite difference approximation with coloring, -snes_fd

    M. Knepley Robust PASI ’11 156 / 231

  • Linear Algebra and Solvers SNES

    SNES Variants

    Picard iteration

    Line search/Trust region strategies

    Quasi-Newton

    Nonlinear CG/GMRES

    Nonlinear GS/ASM

    Nonlinear Multigrid (FAS)

    Variational inequality approaches

    M. Knepley Robust PASI ’11 157 / 231

  • Linear Algebra and Solvers SNES

    Finite Difference Jacobians

    PETSc can compute and explicitly store a Jacobian via 1st-order FDDense

    Activated by -snes_fdComputed by SNESDefaultComputeJacobian()

    Sparse via colorings (default)Coloring is created by MatFDColoringCreate()Computed by SNESDefaultComputeJacobianColor()

    Can also use Matrix-free Newton-Krylov via 1st-order FDActivated by -snes_mf without preconditioningActivated by -snes_mf_operator with user-definedpreconditioning

    Uses preconditioning matrix from SNESSetJacobian()

    M. Knepley Robust PASI ’11 158 / 231

  • Linear Algebra and Solvers SNES

    SNES ExampleDriven Cavity

    Velocity-vorticity formulationFlow driven by lid and/or bouyancyLogically regular grid

    Parallelized with DMDA

    Finite difference discretizationAuthored by David Keyes

    $PETSC_DIR/src/snes/examples/tutorials/ex19.c

    M. Knepley Robust PASI ’11 159 / 231

    http://www.mcs.anl.gov/petsc/petsc-current/src/snes/examples/tutorials/ex19.c.html

  • Linear Algebra and Solvers SNES

    Driven Cavity Application Context

    typedef s t r u c t {/ *−−−−− basic a p p l i c a t i o n data −−−−−* /PetscReal l i d _ v e l o c i t y ;PetscReal p r a n d t lPetscReal grashof ;PetscBool draw_contours ;

    } AppCtx ;

    $PETSC_DIR/src/snes/examples/tutorials/ex19.c

    M. Knepley Robust PASI ’11 160 / 231

    http://www.mcs.anl.gov/petsc/petsc-current/src/snes/examples/tutorials/ex19.c.html

  • Linear Algebra and Solvers SNES

    Driven Cavity Residual Evaluation

    Residual (SNES snes , Vec X, Vec F , vo id * p t r ) {AppCtx * user = ( AppCtx * ) p t r ;

    / * l o c a l s t a r t i n g and ending g r i d po in t s * /Pe tsc In t i s t a r t , iend , j s t a r t , jend ;PetscScalar * f ; / * l o c a l vec to r data * /PetscReal grashof = user−>grashof ;PetscReal p r a n d t l = user−>p r a n d t l ;PetscErrorCode i e r r ;

    / * Code to communicate non loca l ghost po i n t data * /VecGetArray (F , & f ) ;/ * Code to compute l o c a l f u n c t i o n components * /VecRestoreArray (F , & f ) ;r e t u r n 0 ;

    }

    $PETSC_DIR/src/snes/examples/tutorials/ex19.c

    M. Knepley Robust PASI ’11 161 / 231

    http://www.mcs.anl.gov/petsc/petsc-current/src/snes/examples/tutorials/ex19.c.html

  • Linear Algebra and Solvers SNES

    Better Driven Cavity Residual Evaluation

    ResLocal (DMDALocalInfo * in fo ,PetscScalar * * x , PetscScalar * * f , vo id * c tx )

    {f o r ( j =