View
4
Download
0
Category
Preview:
Citation preview
TAUcudaevents
PerformanceToolIntegra0oninProgrammingEnvironmentsforGPUAccelera0on:ExperienceswithTAUandHMPP
AllenD.Malony1,2,ShangkarMayanglambam1,SameerShende1,2,Ma3So5le1
1ComputerandInforma6onScienceDepartment,UniversityofOregon,2ParaTools,Inc.
LaurentMorin,StephaneBihan,FrancoisBodin
CAPSEntreprise
This work was supported by a 2009 NVIDIA Professor Partner award to Prof. Allen D. Malony.
CPU
GPU
begintimestamp
endtimestamp
A
A event record
}
"A" TAU context # time waiting finalize
( )Parallelprogrammingenvironments targe6ngGPUacceleratorshide thecomplexityof working with raw devices by allowing the applica6on developer to work withlibraries,speciallanguageconstructs,ordirec6vestoacompiler. Thebenefitfortheprogrammerisahigher‐levelabstrac6onforacceleratorprogrammingandprotec6onof their soKware investment, since the environment takes the responsibility fortransla6ngtheprogramtoworkwithdifferentaccelera6onbackends.
The challenge for accelerator programming environments is to provide high‐levelsupportandflexibilitywithoutsacrificingdeliveredperformance.Forop6miza6onofGPU‐acceleratedapplica6ons,toolsmust1)beabletomeasureperformanceofGPUcomputa6ons,and2)be integratedwith thehigh‐levelprogramming framework togenerate important performance events and meta data for represen6ngperformanceresultstotheuser.
Wehavedeveloped an approach (called TAUcuda) tomeasure theperformanceofGPU computa6ons programmed using CUDA and integrate this informa6on withapplica6on performance data captured with the TAU Performance System. Toaddress thehigh‐levelprogrammingaspect,wehave integratedTAU/TAUcudawiththe HMPP Workbench. The design methodology includes an instrumenta6onstrategy whereby HMPP automa6cally inserts calls to the TAU/TAUcudameasurementinterfacesinitsrun6mesystemandHMPP‐translatedcodetocaptureaperformancepictureoftheresul6ngapplica6onexecu6on.
Introduction void tau_cuda_init(int argc, char **argv); To be called when the application starts Initializes data structures and checks GPU status void tau_cuda_exit() To be called before any thread exits at end of application All the CUDA profile data output for each thread of execution void* tau_cuda_stream_begin(char *event, cudaStream_t stream); Called before CUDA statements to be measured Returns handle which should be used in the end call Create new CUDA event profile object is created void tau_cuda_stream_end(void * handle); Called immediately after CUDA statements to be measured Inserts a CUDA event into the stream identified by the handle vector<Event> tau_cuda_update(); Checks for completed CUDA events on all streams Non-blocking and returns # completed on each stream int tau_cuda_update(cudaStream_t stream); Same as tau_cuda_update() except for a particular stream Non-blocking and returns # completed on the stream vector<Event> tau_cuda_finalize(); Waits for all CUDA events to complete on all streams Blocking and returns # completed on each stream int tau_cuda_finalize(cudaStream_t stream); Same as tau_cuda_finalize() except for a particular stream Blocking and returns # completed on the stream
TAUcuda API
HMPP-TAU
HMPP
CPU-GPU Scenarios TAUcuda Methodology
Trace
Profile
HMPPannotedsourcecode
HMPPCompiler
ApplicaConsourcecode
StandardCompiler
HostapplicaCon
CUDAsourcecode
CUDACompiler
CUDAcode
HMPPRunCme
HMPPPreprocessor
CUDAdriver
CUDAGenerator
GPU CPU
http://tau.uoregon.edu http://caps-entreprise.com
Game of Life Performance
Matrix-Vector Multiply Two codelets allow
overlap of data transfer and computation
Demonstrates profiling and tracing
One Stream Tests
Main thread execution
Kernel execution and data transfers
overlapping
Codelet 2 data transfers
Codelet 1 data transfers
Codelet 2 kernel
execution
Codelet 1 kernel
execution
TAU instrumented HMPP application
TAUcompiler
HMPPcompiler
Genericcompiler
TAUInstrumenter
TAU lib
TAU instrumented
HMPP application
CUDA codelet library
CUDAcompiler
CUDAgeneratorTAUcudainstrumentaCon
HMPP annotated application
HMPP lib
CUDA Codelet TAUcuda-
instrumented
Recommended