1
TAUcuda events Performance Tool Integra0on in Programming Environments for GPU Accelera0on: Experiences with TAU and HMPP Allen D. Malony 1,2 , Shangkar Mayanglambam 1 , Sameer Shende 1,2 , Ma3 So5le 1 1 Computer and Informa6on Science Department, University of Oregon, 2 ParaTools, Inc. Laurent Morin, Stephane Bihan, Francois Bodin CAPS Entreprise This work was supported by a 2009 NVIDIA Professor Partner award to Prof. Allen D. Malony. CPU GPU begin timestamp end timestamp A A event record } "A" TAU context # time waiting finalize ( ) Parallel programming environments targe6ng GPU accelerators hide the complexity of working with raw devices by allowing the applica6on developer to work with libraries, special language constructs, or direc6ves to a compiler. The benefit for the programmer is a higher‐level abstrac6on for accelerator programming and protec6on of their soKware investment, since the environment takes the responsibility for transla6ng the program to work with different accelera6on backends. The challenge for accelerator programming environments is to provide high‐level support and flexibility without sacrificing delivered performance. For op6miza6on of GPU‐accelerated applica6ons, tools must 1) be able to measure performance of GPU computa6ons, and 2) be integrated with the high‐level programming framework to generate important performance events and meta data for represen6ng performance results to the user. We have developed an approach (called TAUcuda) to measure the performance of GPU computa6ons programmed using CUDA and integrate this informa6on with applica6on performance data captured with the TAU Performance System. To address the high‐level programming aspect, we have integrated TAU/TAUcuda with the HMPP Workbench. The design methodology includes an instrumenta6on strategy whereby HMPP automa6cally inserts calls to the TAU/TAUcuda measurement interfaces in its run6me system and HMPP‐translated code to capture a performance picture of the resul6ng applica6on execu6on. Introduction void tau_cuda_init(int argc, char **argv); To be called when the application starts Initializes data structures and checks GPU status void tau_cuda_exit() To be called before any thread exits at end of application All the CUDA profile data output for each thread of execution void* tau_cuda_stream_begin(char *event, cudaStream_t stream); Called before CUDA statements to be measured Returns handle which should be used in the end call Create new CUDA event profile object is created void tau_cuda_stream_end(void * handle); Called immediately after CUDA statements to be measured Inserts a CUDA event into the stream identified by the handle vector<Event> tau_cuda_update(); Checks for completed CUDA events on all streams Non-blocking and returns # completed on each stream int tau_cuda_update(cudaStream_t stream); Same as tau_cuda_update() except for a particular stream Non-blocking and returns # completed on the stream vector<Event> tau_cuda_finalize(); Waits for all CUDA events to complete on all streams Blocking and returns # completed on each stream int tau_cuda_finalize(cudaStream_t stream); Same as tau_cuda_finalize() except for a particular stream Blocking and returns # completed on the stream TAUcuda API HMPP-TAU HMPP CPU-GPU Scenarios TAUcuda Methodology Trace Profile HMPP annoted source code HMPP Compiler ApplicaCon source code Standard Compiler Host applicaCon CUDA source code CUDA Compiler CUDA code HMPP RunCme HMPP Preprocessor CUDA driver CUDA Generator GPU CPU http://tau.uoregon.edu http://caps-entreprise.com Game of Life Performance Matrix-Vector Multiply Two codelets allow overlap of data transfer and computation Demonstrates profiling and tracing One Stream Tests Main thread execution Kernel execution and data transfers overlapping Codelet 2 data transfers Codelet 1 data transfers Codelet 2 kernel execution Codelet 1 kernel execution TAU instrumented HMPP application TAU compiler HMPP compiler Generic compiler TAU Instrumenter TAU lib TAU instrumented HMPP application CUDA codelet library CUDA compiler CUDA generator TAUcuda instrumentaCon HMPP annotated application HMPP lib CUDA Codelet TAUcuda- instrumented

Performance Tool Integraon in Programming Environments for ... · GPU‐accelerated applicaons, tools must 1) be able to measure performance of GPU computaons, and 2) be integrated

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Tool Integraon in Programming Environments for ... · GPU‐accelerated applicaons, tools must 1) be able to measure performance of GPU computaons, and 2) be integrated

TAUcudaevents

PerformanceToolIntegra0oninProgrammingEnvironmentsforGPUAccelera0on:ExperienceswithTAUandHMPP

AllenD.Malony1,2,ShangkarMayanglambam1,SameerShende1,2,Ma3So5le1

1ComputerandInforma6onScienceDepartment,UniversityofOregon,2ParaTools,Inc.

LaurentMorin,StephaneBihan,FrancoisBodin

CAPSEntreprise

This work was supported by a 2009 NVIDIA Professor Partner award to Prof. Allen D. Malony.

CPU

GPU

begintimestamp

endtimestamp

A

A event record

}

"A" TAU context # time waiting finalize

( )Parallelprogrammingenvironments targe6ngGPUacceleratorshide thecomplexityof working with raw devices by allowing the applica6on developer to work withlibraries,speciallanguageconstructs,ordirec6vestoacompiler. Thebenefitfortheprogrammerisahigher‐levelabstrac6onforacceleratorprogrammingandprotec6onof their soKware investment, since the environment takes the responsibility fortransla6ngtheprogramtoworkwithdifferentaccelera6onbackends.

The challenge for accelerator programming environments is to provide high‐levelsupportandflexibilitywithoutsacrificingdeliveredperformance.Forop6miza6onofGPU‐acceleratedapplica6ons,toolsmust1)beabletomeasureperformanceofGPUcomputa6ons,and2)be integratedwith thehigh‐levelprogramming framework togenerate important performance events and meta data for represen6ngperformanceresultstotheuser.

Wehavedeveloped an approach (called TAUcuda) tomeasure theperformanceofGPU computa6ons programmed using CUDA and integrate this informa6on withapplica6on performance data captured with the TAU Performance System. Toaddress thehigh‐levelprogrammingaspect,wehave integratedTAU/TAUcudawiththe HMPP Workbench. The design methodology includes an instrumenta6onstrategy whereby HMPP automa6cally inserts calls to the TAU/TAUcudameasurementinterfacesinitsrun6mesystemandHMPP‐translatedcodetocaptureaperformancepictureoftheresul6ngapplica6onexecu6on.

Introduction void tau_cuda_init(int argc, char **argv);   To be called when the application starts   Initializes data structures and checks GPU status void tau_cuda_exit()   To be called before any thread exits at end of application   All the CUDA profile data output for each thread of execution void* tau_cuda_stream_begin(char *event, cudaStream_t stream);   Called before CUDA statements to be measured   Returns handle which should be used in the end call   Create new CUDA event profile object is created void tau_cuda_stream_end(void * handle);   Called immediately after CUDA statements to be measured   Inserts a CUDA event into the stream identified by the handle vector<Event> tau_cuda_update();   Checks for completed CUDA events on all streams   Non-blocking and returns # completed on each stream int tau_cuda_update(cudaStream_t stream);   Same as tau_cuda_update() except for a particular stream   Non-blocking and returns # completed on the stream vector<Event> tau_cuda_finalize();   Waits for all CUDA events to complete on all streams   Blocking and returns # completed on each stream int tau_cuda_finalize(cudaStream_t stream);   Same as tau_cuda_finalize() except for a particular stream   Blocking and returns # completed on the stream

TAUcuda API

HMPP-TAU

HMPP

CPU-GPU Scenarios TAUcuda Methodology

Trace

Profile

HMPPannotedsourcecode

HMPPCompiler

ApplicaConsourcecode

StandardCompiler

HostapplicaCon

CUDAsourcecode

CUDACompiler

CUDAcode

HMPPRunCme

HMPPPreprocessor

CUDAdriver

CUDAGenerator

GPU CPU

http://tau.uoregon.edu http://caps-entreprise.com

Game of Life Performance

Matrix-Vector Multiply   Two codelets allow

overlap of data transfer and computation

  Demonstrates profiling and tracing

One Stream Tests

Main thread execution

Kernel execution and data transfers

overlapping

Codelet 2 data transfers

Codelet 1 data transfers

Codelet 2 kernel

execution

Codelet 1 kernel

execution

TAU instrumented HMPP application

TAUcompiler

HMPPcompiler

Genericcompiler

TAUInstrumenter

TAU lib

TAU instrumented

HMPP application

CUDA codelet library

CUDAcompiler

CUDAgeneratorTAUcudainstrumentaCon

HMPP annotated application

HMPP lib

CUDA Codelet TAUcuda-

instrumented