Performance Tool Integraon in Programming Environments for ... · GPU‐accelerated applicaons,...

TAUcudaevents

PerformanceToolIntegra0oninProgrammingEnvironmentsforGPUAccelera0on:ExperienceswithTAUandHMPP

AllenD.Malony1,2,ShangkarMayanglambam1,SameerShende1,2,Ma3So5le1

1ComputerandInforma6onScienceDepartment,UniversityofOregon,2ParaTools,Inc.

LaurentMorin,StephaneBihan,FrancoisBodin

CAPSEntreprise

This work was supported by a 2009 NVIDIA Professor Partner award to Prof. Allen D. Malony.

begintimestamp

endtimestamp

A event record

"A" TAU context # time waiting finalize

( )Parallelprogrammingenvironments targe6ngGPUacceleratorshide thecomplexityof working with raw devices by allowing the applica6on developer to work withlibraries,speciallanguageconstructs,ordirec6vestoacompiler. Thebenefitfortheprogrammerisahigher‐levelabstrac6onforacceleratorprogrammingandprotec6onof their soKware investment, since the environment takes the responsibility fortransla6ngtheprogramtoworkwithdifferentaccelera6onbackends.

The challenge for accelerator programming environments is to provide high‐levelsupportandflexibilitywithoutsacrificingdeliveredperformance.Forop6miza6onofGPU‐acceleratedapplica6ons,toolsmust1)beabletomeasureperformanceofGPUcomputa6ons,and2)be integratedwith thehigh‐levelprogramming framework togenerate important performance events and meta data for represen6ngperformanceresultstotheuser.

Wehavedeveloped an approach (called TAUcuda) tomeasure theperformanceofGPU computa6ons programmed using CUDA and integrate this informa6on withapplica6on performance data captured with the TAU Performance System. Toaddress thehigh‐levelprogrammingaspect,wehave integratedTAU/TAUcudawiththe HMPP Workbench. The design methodology includes an instrumenta6onstrategy whereby HMPP automa6cally inserts calls to the TAU/TAUcudameasurementinterfacesinitsrun6mesystemandHMPP‐translatedcodetocaptureaperformancepictureoftheresul6ngapplica6onexecu6on.

Introduction void tau_cuda_init(int argc, char **argv);   To be called when the application starts   Initializes data structures and checks GPU status void tau_cuda_exit()   To be called before any thread exits at end of application   All the CUDA profile data output for each thread of execution void* tau_cuda_stream_begin(char *event, cudaStream_t stream);   Called before CUDA statements to be measured   Returns handle which should be used in the end call   Create new CUDA event profile object is created void tau_cuda_stream_end(void * handle);   Called immediately after CUDA statements to be measured   Inserts a CUDA event into the stream identified by the handle vector<Event> tau_cuda_update();   Checks for completed CUDA events on all streams   Non-blocking and returns # completed on each stream int tau_cuda_update(cudaStream_t stream);   Same as tau_cuda_update() except for a particular stream   Non-blocking and returns # completed on the stream vector<Event> tau_cuda_finalize();   Waits for all CUDA events to complete on all streams   Blocking and returns # completed on each stream int tau_cuda_finalize(cudaStream_t stream);   Same as tau_cuda_finalize() except for a particular stream   Blocking and returns # completed on the stream

TAUcuda API

HMPP-TAU

CPU-GPU Scenarios TAUcuda Methodology

Profile

HMPPannotedsourcecode

HMPPCompiler

ApplicaConsourcecode

StandardCompiler

HostapplicaCon

CUDAsourcecode

CUDACompiler

CUDAcode

HMPPRunCme

HMPPPreprocessor

CUDAdriver

CUDAGenerator

GPU CPU

http://tau.uoregon.edu http://caps-entreprise.com

Game of Life Performance

Matrix-Vector Multiply   Two codelets allow

overlap of data transfer and computation

  Demonstrates profiling and tracing

One Stream Tests

Main thread execution

Kernel execution and data transfers

overlapping

Codelet 2 data transfers

Codelet 1 data transfers

Codelet 2 kernel

execution

Codelet 1 kernel

execution

TAU instrumented HMPP application

TAUcompiler

HMPPcompiler

Genericcompiler

TAUInstrumenter

TAU lib

TAU instrumented

HMPP application

CUDA codelet library

CUDAcompiler

CUDAgeneratorTAUcudainstrumentaCon

HMPP annotated application

HMPP lib

CUDA Codelet TAUcuda-

instrumented

Performance Tool Integraon in Programming Environments for ... · GPU‐accelerated applicaons,...

Documents

GPU Architecture & Implications - Computer Scienceskadron/cuda_asplos08_tutorial/4-GPU-architecture.pdf · GPU Architecture CUDA provides a parallel programming model The Tesla GPU

GPU Developments 2017 · GPU Developments 2017 ... and • •

Fully scripted builds and connuous integraon have become ... · Fully scripted builds and connuous integraon have become more mainstream in the past years. In this talk I want to

Predictable+Integraon++ of+Safety4Cri/cal+So6ware++ … · 2013-04-08 · Predictable+Integraon++ of+Safety4Cri/cal+So6ware++ on+COTS4based+Embedded+Systems+ + Marco+Caccamo+ University+of+Illinois+

GPU, GP-GPU, GPU computing

Experiences with Grid Integraon of Wind Power in Sri Lanka

Con$nued(from(last$me:(( Race(Detec$on(in(Cilk(Computaons(

Virtualized SatCom Networks and Mul^- Domain Integraon with … · H2020 VITAL Project Coordinator Virtualized SatCom Networks and Mul^-Domain Integraon with 5G: Architectural Perspec^ves

Experiences with Grid Integraon of Wind Power in Sri Lanka€¦ · Experiences with Grid Integraon of Wind Power in Sri Lanka ... • Power quality issues will be addressed complying

Genomics)of)Gene)Regulaon)2:) Conservaon,)Integraon)of)features,)Assays… · 10/29/15 1 Genomics)of)Gene)Regulaon)2:) Conservaon,)Integraon)of)features,)Assays,) Issues)in)interpretaon)

Successful Co-ordinaon of Integraon Tes1ng of Global

Best Practices GPU-Based Video Processing | GTC 2013on-demand.gputechconf.com/...Based-Video-Processing... · GPU-Based Video Processing Optimal GPU Upload GPU Processing GPU Readback

Integraonbetweenins(tu(onalrepository, researcher ... · Integraonbetweenins(tu(onalrepository, researcherdirectoryandlibrarycatalog* usingORCIDandNextLEnju July9th,2015.OpenRepositories*2015

Unit Commitment for Sustainable Integraon of Large‐Scale ......Unit Commitment for Sustainable Integraon of Large‐Scale Wind Power and Responsive Demand ©Marija Ilic ECE and EPP,

GPU Acceleration for Seismic Interpretation Algorithmsdeveloper.download.nvidia.com/GTC/...GTC2012-GPU-Acceleration-Sei… · GPU Acceleration for Seismic ... GPU Acceleration for

Heterogeneous Integraon Roadmap Presented by · PDF fileHeterogeneous Integraon Roadmap Presented by William (Bill ) Chen ... , 2016 hhp:// 7. thESTC Grenoble, France September 13

Multi-GPU MapReduce on GPU Clusters

Welfare’Regimes’in’anAgeof Austerity’ · 2013-05-20 · Compeng ’Conjectures’ Covergence(hypothesis: Economic’globalizaon ’and’ integraon ’on’the’one’hand’and’the’ﬁnancial’and’

GPU GPU and Supercomputer - University of Rochester€¦ · 4/16/2015 9 Outline: GPU GPGPU and CUDA CPU + GPU SuperComputer (TianHe 1A) Latest Trends GPU accelerated supercomputer

Heterogeneous Integraon Roadmap Presented by William (Bill ...eps.ieee.org/images/files/Roadmap/Heterogeneous... · Heterogeneous Integraon Deﬁned Heterogeneous Integraon refers