Upload
hoangdung
View
215
Download
1
Embed Size (px)
Citation preview
Compilers and Tools at the University of Houston
Barbara Chapman University of Houston October 22, 2010
High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools
Where is Houston?
Houston is in the State of Texas on the Texas Gulf Coast
Conference Venue
Houston is Space City, the Bayou City, and Texas Cowboy Country
HOUSTON FACTS
It’s a well-known fact that everyone in Houston wears cowboy hats and boots, rides a horse to work, talks with a drawl, chews tobacco, and has an oil well in his back yard...
Actually, Houston is one of the most international, modern, and cosmopolitan cities in the United States of America. It is the fourth most populous city in the nation (following New York, Los Angeles, and Chicago). But in terms of geography, Houston is the largest city in the nation, containing 618 square miles.
Of course, it is most famous for its basketball players
Agenda
OpenUH Support for OpenMP Programming
Heterogeneous Systems Today
OpenMP as a Potential Uniform API for Heterogeneous Systems
5
The OpenMP Shared Memory API
High-level directive-based multithreaded programming The user makes strategic decisions Compiler figures out details Threads communicate by sharing variables Synchronization to order accesses and prevent data conflicts Structured programming to reduce likelihood of bugs
#pragma omp parallel #pragma omp for schedule(dynamic)
for (I=0;I<N;I++){ NEAT_STUFF(I); } /* implicit barrier here */
OpenUH Compiler Infrastructure
IPA (Inter Procedural Analyzer)
Source code w/ OpenMP directives
Source code with runtime library calls
Linking
CG (Itanium, Opteron, Pentium)
WOPT (global scalar optimizer) Object files
LOWER_MP (Transformation of OpenMP )
A Native Compiler
Executables
A Portable OpenMP Runtime library
FRONTENDS (C/C++, Fortran 90, OpenMP)
Ope
n64
Com
pile
r inf
rast
ruct
ure
LNO (Loop Nest Optimizer)
OMP_PRELOWER (Preprocess OpenMP )
WHIRL2C & WHIRL2F (IR-to-source option )
6 Under
reconstruction
Open-MP, CAF, UPC, CUDA
Joint work with Tsinghua University, Chinese Academy of Sciences
OpenMP Implementation: All Tasks
Part of computation of gradient of hydrostatic pressure in POP code
Standard Runtime Execution Model (c stands for chunk)
Dataflow Execution Model associated with translated code
“Implementing OpenMP using Dataflow Execution Model for Data Locality and Efficient Parallel Execution”. Weng and
Chapman, HIPS-7, IPDPS, 2002
Single entry
Single exit
A: An OpenMP Single construct
Sections entry
Sections exit
B: An OpenMP Sections construct with flush inside
OMP for entry
OMP for exit
C: An OpenMP For construct
Sequential Edge
D: An OpenMP Critical construct
Conflict edge:
Barrier
Critical entry
Critical exit
Section entry
Barrier
Section entry
Section exit Section exit
Parallel Edge
Barrier
M
M
M M
M M
M
M
M: Must-take attribute
M
M
M
M
M
M
Collector API: OpenMP Performance Monitoring Interface
OpenMP ARB sanctioned performance monitoring interface for OpenMP
Performance tools communicate with OpenMP runtime library through collector interface
Designed to support statistical sampling
Support tracing with extensions
Compiler Translated OpenMP Program
Collector API
Performance Tool
Dragon Tool Browser
Front End
IPL
IPA-Link Program Info.
Database
LNO Data Dependence Array Section
The Open64 compiler
Control flow graph
Call graph
.vcg .ps .bmp
CFG_IPL
VCG
Dragon Executable
WOPT/CG feedback
The Infrastructure of the Dragon Analysis Tool
Scientific programmers adapting code for GPUs Currently intensive manual effort
Participation in CCSM / CAM / HOMME Important climate application for DOE rates over q physics tracers
Six boundary exchange operations HOMME
Cubed Sphere Elements Domain Decomposition
np nv
nlev
Element
Exploiting Heterogeneous Systems
Exploiting Heterogeneous System Computations are spread over hundreds of subroutines Need to find strategies to map elements to nodes with
different heterogeneous resources Implies changes in element data structure layouts and
distribution that affect many routines Merge routines/task that need to be mapped to devices
HOMME Callgraph profile
type, public :: elem_state_t! real (kind=real_kind) :: v(nv,nv,2,nlev,timelevels) ! real (kind=real_kind) :: T(np,np,nlev,timelevels) ! real (kind=real_kind) :: lnps(np,np,timelevels) ! real (kind=real_kind) :: ps_v(nv,nv,timelevels) ! real (kind=real_kind) :: phis(np,np) ! real (kind=real_kind) :: Q(nv,nv,nlev,qsize_d,timelevels) ! real (kind=real_kind) :: phi(np,np,nlev) ! real (kind=real_kind) :: grad_lnps(nv,nv,2) ! real (kind=real_kind) :: eta_dot_dpdn(nv,nv,nlevp) ! real (kind=real_kind) :: T_v(nv,nv,nlev) ! real (kind=real_kind) :: zeta(nv,nv,nlev) ! real (kind=real_kind) :: omega_p(nv,nv,nlev) ! real (kind=real_kind) :: div(nv,nv,nlev,timelevels) !end type elem_state_t!
Element state data structure
HOMME Kernel !$omp do private(ie,dev) do dev=1,2 if (dev.eq.1) then !$hmpp <cudagroup> allocate !$hmpp <cudagroup> hmpp_resident advancedload,args[::rdx;…;rmetdetp] else !$hmpp <cudagroup2> allocate !$hmpp <cudagroup2> hmpp_resident22 advancedload,args[::rdx;::rdy;::Dvv…;metdet;::rmetdetp] endif do ie=(dev-1)*(nete/2)+1, (dev-1)*(nete/2)+(nete/2), 1 if (dev.eq.1) then !$hmpp <cudagroup> hmpp_resident callsite, asynchronous call divergence_sphere5d_hmpp_tuned_resident(ie,qsize,nlev,nv,divdp4d_omp) !$hmpp <cudagroup> hmpp_resident delegatedstore,args[divdp4d_omp] !$hmpp <cudagroup> hmpp_resident synchronize if (ie.eq.6) then !$hmpp <cudagroup> release endif end if ! end if dev.eq.1
if (dev.eq.2) then … end if !end if dev.eq.2 end do ! second outer most loop end do !$omp end do nowait !$omp end parallel
2 GPUs
Allocate resources for
group 1 on the GPU
Loading data ASAP
Release resources on
the GPUs
Asynchronous model
Save results onto the Host
OpenMP for multiple devices/GPU groups
managment
Should we Extend OpenMP?
Extending OpenMP directives for heterogeneous computing
Can help avoid complexities of multiple APIs Potentially reduces code modification, supports
portability Maintenance benefit Might require non-trivial extensions to current OpenMP Must satisfy needs of technical computing, general-
purpose computing and also embedded computing
A number of companies are exploring this intensively
Accelerator Region (PGI) - Example void foo(char A[], char B[], char C[], int nrows, int ncols) { #pragma omp acc_region shared-copy(C), shared-const(A,B) { for (int i=0; i < nrows; ++i) for (int j=0; j < ncols; j += NLANES) for (int k=0; k < NLANES; ++k) { int index = (i * ncols) + j + k; C[index] = A[index] + B[index]; } } // end accelerate region
print2d(C, nrows, ncols); }
Summary
Enormous potential for joint research in areas of parallel programming models and their implementation Programming Interface Compiler Runtime Tools