Compilers and Tools at the University of Houston - UAF … · Compilers and Tools at the University of Houston ... Chinese Academy of Sciences . ... chapman--presentation.ppt

Compilers and Tools at the University of Houston

Barbara Chapman University of Houston October 22, 2010

High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools

Where is Houston?

Houston is in the State of Texas on the Texas Gulf Coast

Conference Venue

Houston is Space City, the Bayou City, and Texas Cowboy Country

HOUSTON FACTS

It’s a well-known fact that everyone in Houston wears cowboy hats and boots, rides a horse to work, talks with a drawl, chews tobacco, and has an oil well in his back yard...

Actually, Houston is one of the most international, modern, and cosmopolitan cities in the United States of America. It is the fourth most populous city in the nation (following New York, Los Angeles, and Chicago). But in terms of geography, Houston is the largest city in the nation, containing 618 square miles.

Of course, it is most famous for its basketball players

Agenda

  OpenUH   Support for OpenMP   Programming

Heterogeneous Systems Today

  OpenMP as a Potential Uniform API for Heterogeneous Systems

5

The OpenMP Shared Memory API

  High-level directive-based multithreaded programming   The user makes strategic decisions   Compiler figures out details   Threads communicate by sharing variables   Synchronization to order accesses and prevent data conflicts   Structured programming to reduce likelihood of bugs

#pragma omp parallel #pragma omp for schedule(dynamic)

for (I=0;I<N;I++){ NEAT_STUFF(I); } /* implicit barrier here */

OpenUH Compiler Infrastructure

IPA (Inter Procedural Analyzer)

Source code w/ OpenMP directives

Source code with runtime library calls

Linking

CG (Itanium, Opteron, Pentium)

WOPT (global scalar optimizer) Object files

LOWER_MP (Transformation of OpenMP )

A Native Compiler

Executables

A Portable OpenMP Runtime library

FRONTENDS (C/C++, Fortran 90, OpenMP)

Ope

n64

Com

pile

r inf

rast

ruct

ure

LNO (Loop Nest Optimizer)

OMP_PRELOWER (Preprocess OpenMP )

WHIRL2C & WHIRL2F (IR-to-source option )

6 Under

reconstruction

Open-MP, CAF, UPC, CUDA

Joint work with Tsinghua University, Chinese Academy of Sciences

OpenMP Implementation: All Tasks

Part of computation of gradient of hydrostatic pressure in POP code

Standard Runtime Execution Model (c stands for chunk)

Dataflow Execution Model associated with translated code

“Implementing OpenMP using Dataflow Execution Model for Data Locality and Efficient Parallel Execution”. Weng and

Chapman, HIPS-7, IPDPS, 2002

Single entry

Single exit

A: An OpenMP Single construct

Sections entry

Sections exit

B: An OpenMP Sections construct with flush inside

OMP for entry

OMP for exit

C: An OpenMP For construct

Sequential Edge

D: An OpenMP Critical construct

Conflict edge:

Barrier

Critical entry

Critical exit

Section entry

Barrier

Section entry

Section exit Section exit

Parallel Edge

Barrier

M

M

M M

M M

M

M

M: Must-take attribute

M

M

M

M

M

M

Collector API: OpenMP Performance Monitoring Interface

  OpenMP ARB sanctioned performance monitoring interface for OpenMP

  Performance tools communicate with OpenMP runtime library through collector interface

  Designed to support statistical sampling

  Support tracing with extensions

Compiler Translated OpenMP Program

Collector API

Performance Tool

Dragon Tool Browser

Front End

IPL

IPA-Link Program Info.

Database

LNO Data Dependence Array Section

The Open64 compiler

Control flow graph

Call graph

.vcg .ps .bmp

CFG_IPL

VCG

Dragon Executable

WOPT/CG feedback

The Infrastructure of the Dragon Analysis Tool

  Scientific programmers adapting code for GPUs   Currently intensive manual effort

  Participation in CCSM / CAM / HOMME   Important climate application for DOE   rates over q physics tracers

  Six boundary exchange operations HOMME

Cubed Sphere Elements Domain Decomposition

np nv

nlev

Element

Exploiting Heterogeneous Systems

Exploiting Heterogeneous System   Computations are spread over hundreds of subroutines   Need to find strategies to map elements to nodes with

different heterogeneous resources   Implies changes in element data structure layouts and

distribution that affect many routines   Merge routines/task that need to be mapped to devices

HOMME Callgraph profile

type, public :: elem_state_t! real (kind=real_kind) :: v(nv,nv,2,nlev,timelevels) ! real (kind=real_kind) :: T(np,np,nlev,timelevels) ! real (kind=real_kind) :: lnps(np,np,timelevels) ! real (kind=real_kind) :: ps_v(nv,nv,timelevels) ! real (kind=real_kind) :: phis(np,np) ! real (kind=real_kind) :: Q(nv,nv,nlev,qsize_d,timelevels) ! real (kind=real_kind) :: phi(np,np,nlev) ! real (kind=real_kind) :: grad_lnps(nv,nv,2) ! real (kind=real_kind) :: eta_dot_dpdn(nv,nv,nlevp) ! real (kind=real_kind) :: T_v(nv,nv,nlev) ! real (kind=real_kind) :: zeta(nv,nv,nlev) ! real (kind=real_kind) :: omega_p(nv,nv,nlev) ! real (kind=real_kind) :: div(nv,nv,nlev,timelevels) !end type elem_state_t!

Element state data structure

HOMME Kernel !$omp do private(ie,dev) do dev=1,2 if (dev.eq.1) then !$hmpp <cudagroup> allocate !$hmpp <cudagroup> hmpp_resident advancedload,args[::rdx;…;rmetdetp] else !$hmpp <cudagroup2> allocate !$hmpp <cudagroup2> hmpp_resident22 advancedload,args[::rdx;::rdy;::Dvv…;metdet;::rmetdetp] endif do ie=(dev-1)*(nete/2)+1, (dev-1)*(nete/2)+(nete/2), 1 if (dev.eq.1) then !$hmpp <cudagroup> hmpp_resident callsite, asynchronous call divergence_sphere5d_hmpp_tuned_resident(ie,qsize,nlev,nv,divdp4d_omp) !$hmpp <cudagroup> hmpp_resident delegatedstore,args[divdp4d_omp] !$hmpp <cudagroup> hmpp_resident synchronize if (ie.eq.6) then !$hmpp <cudagroup> release endif end if ! end if dev.eq.1

if (dev.eq.2) then … end if !end if dev.eq.2 end do ! second outer most loop end do !$omp end do nowait !$omp end parallel

2 GPUs

Allocate resources for

group 1 on the GPU

Loading data ASAP

Release resources on

the GPUs

Asynchronous model

Save results onto the Host

OpenMP for multiple devices/GPU groups

managment

Should we Extend OpenMP?

Extending OpenMP directives for heterogeneous computing

  Can help avoid complexities of multiple APIs   Potentially reduces code modification, supports

portability   Maintenance benefit   Might require non-trivial extensions to current OpenMP   Must satisfy needs of technical computing, general-

purpose computing and also embedded computing

A number of companies are exploring this intensively

Accelerator Region (PGI) - Example void foo(char A[], char B[], char C[], int nrows, int ncols) { #pragma omp acc_region shared-copy(C), shared-const(A,B) { for (int i=0; i < nrows; ++i) for (int j=0; j < ncols; j += NLANES) for (int k=0; k < NLANES; ++k) { int index = (i * ncols) + j + k; C[index] = A[index] + B[index]; } } // end accelerate region

print2d(C, nrows, ncols); }

Summary

  Enormous potential for joint research in areas of parallel programming models and their implementation   Programming Interface   Compiler   Runtime   Tools

Documents

Compilers and Tools at the University of Houston - UAF … · Compilers and Tools at the University of Houston ... Chinese Academy of Sciences . ... chapman--presentation.ppt