Click here to load reader
Upload
malliwi88
View
225
Download
0
Embed Size (px)
DESCRIPTION
cuda overlap
Citation preview
Next → (http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/)
← Previous (http://devblogs.nvidia.com/parallelforall/how-overlap-data-
transfers-cuda-fortran/)
Search DevZone
Developer Programs (https://developer.nvidia.com/registered-developer-programs)
DEVELOPER CENTERS(HTTP://DEVELOPER.NVIDIA.COM) TECHNOLOGIES(HTTPS://DEVELOPER.NVIDIA.COM/TECHNOLOGIES)
TOOLS(HTTPS://DEVELOPER.NVIDIA.COM/TOOLS) RESOURCES(HTTPS://DEVELOPER.NVIDIA.COM/SUGGESTED-READING)
COMMUNITY(HTTPS://DEVELOPER.NVIDIA.COM/)
Share:
(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-
419&s=reddit&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-
cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-
/543d42c8d6a2b1a6/2&frommenu=1&uid=543d42c852ea9b57&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)
(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-
419&s=google_plusone_share&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-
cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-
/543d42c8d6a2b1a6/3&frommenu=1&uid=543d42c872c5fd8a&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)
(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-
419&s=linkedin&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-
cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-
/543d42c8d6a2b1a6/4&frommenu=1&uid=543d42c836c3f53a&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)
How to Overlap Data Transfers in CUDA C/C++
Posted on December 13, 2012 (http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/) by Mark Harris(http://devblogs.nvidia.com/parallelforall/author/mharris/) No CommentsTagged Concurrency (http://devblogs.nvidia.com/parallelforall/tag/concurrency/), CUDA C/C++ (http://devblogs.nvidia.com/parallelforall/tag/cuda-cc/), Memory (http://devblogs.nvidia.com/parallelforall/tag/memory/), Streams (http://devblogs.nvidia.com/parallelforall/tag/streams/)
In our last CUDA C/C++ post (http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/) we discussed how to transfer data
efficiently between the host and device. In this post, we discuss how to overlap data transfers with computation on the host, computation on the
device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires
the use of CUDA streams, so first let’s learn about streams.
CUDA Streams
A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within
a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run
concurrently.
The default stream
All device operations (kernels and data transfers) in CUDA run in a stream. When no stream is specified, the default stream (also called the “null
stream”) is used. The default stream is different from other streams because it is a synchronizing stream with respect to operations on the device: no
operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the
default stream must complete before any other operation (in any stream on the device) will begin.
Let’s look at some simple code examples that use the default stream, and discuss how operations progress from the perspective of the host as well as
the device.
c u d a M e m c p y ( d _ a , a , n u m B y t e s , c u d a M e m c p y H o s t T o D e v i c e ) ;
i n c r e m e n t < < < 1 , N > > > ( d _ a )
c u d a M e m c p y ( a , d _ a , n u m B y t e s , c u d a M e m c p y D e v i c e T o H o s t ) ;
In the code above, from the perspective of the device, all three operations are issued to the same (default) stream and will execute in the order that
they were issued. From the perspective of the host, the implicit data transfers are blocking or synchronous transfers, while the kernel launch is
asynchronous. Since the host-to-device data transfer on the first line is synchronous, the CPU thread will not reach the kernel call on the second line
until the host-to-device transfer is complete. Once the kernel is issued, the CPU thread moves to the third line, but the transfer on that line cannot
begin due to the device-side order of execution.
The asynchronous behavior of kernel launches from the host’s perspective makes overlapping device and host computation very simple. We can modify
the code to add some independent CPU computation as follows.
c u d a M e m c p y ( d _ a , a , n u m B y t e s , c u d a M e m c p y H o s t T o D e v i c e ) ;
i n c r e m e n t < < < 1 , N > > > ( d _ a )
m y C p u F u n c t i o n ( b )
c u d a M e m c p y ( a , d _ a , n u m B y t e s , c u d a M e m c p y D e v i c e T o H o s t ) ;
In the above code, as soon as the increment() kernel is launched on the device the CPU thread executes myCpuFunction(), overlapping its
execution on the CPU with the kernel execution on the GPU. Whether the host function or device kernel completes first doesn’t affect the subsequent
device-to-host transfer, which will begin only after the kernel completes. From the perspective of the device, nothing has changed from the previous
example; the device is completely unaware of myCpuFunction().
Share
Share
Non-default streams
Non-default streams in CUDA C/C++ are declared, created, and destroyed in host code as follows.
c u d a S t r e a m _ t s t r e a m 1 ;
c u d a E r r o r _ t r e s u l t ;
r e s u l t = c u d a S t r e a m C r e a t e ( & s t r e a m 1 )
r e s u l t = c u d a S t r e a m D e s t r o y ( s t r e a m 1 )
To issue a data transfer to a non-default stream we use the cudaMemcpyAsync (http://docs.nvidia.com/cuda/cuda-runtime-
api/index.html#group__CUDART__MEMORY_1gf2810a94ec08fe3c7ff9d93a33af7255)() function, which is similar to the cudaMemcpy()
function discussed in the previous post, but takes a stream identifier as a fifth argument.
r e s u l t = c u d a M e m c p y A s y n c ( d _ a , a , N , c u d a M e m c p y H o s t T o D e v i c e , s t r e a m 1 )
cudaMemcpyAsync() is non-blocking on the host, so control returns to the host thread immediately after the transfer is issued. There are
cudaMemcpy2DAsync (http://docs.nvidia.com/cuda/cuda-runtime-
api/index.html#group__CUDART__MEMORY_1g7f182f1a8b8750c7fa9e18aeb280d31c)() and cudaMemcpy3DAsync
(http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1gb5854dd48800fe861864ae8784762a46)()
variants of this routine which can transfer 2D and 3D array sections asynchronously in the specified streams.
To issue a kernel to a non-default stream we specify the stream identifier as a fourth execution configuration parameter (the third execution
configuration parameter allocates shared device memory, which we’ll talk about later; use 0 for now).
i n c r e m e n t < < < 1 , N , 0 , s t r e a m 1 > > > ( d _ a )
Synchronization with streams
Since all operations in non-default streams are non-blocking with respect to the host code, you will run across situations where you need to synchronize
the host code with operations in a stream. There are several ways to do this. The “heavy hammer” way is to use cudaDeviceSynchronize
(http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__DEVICE_1g32bdc6229081137acd3cba5da2897779)(),
which blocks the host code until all previously issued operations on the device have completed. In most cases this is overkill, and can really hurt
performance due to stalling the entire device and host thread.
The CUDA stream API has multiple less severe methods of synchronizing the host with a stream. The function cudaStreamSynchronize
(http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__STREAM_1geb3b2f88b7c1cff8b67a998a3a41c179)
(stream) can be used to block the host thread until all previously issued operations in the specified stream have completed. The function
cudaStreamQuery (http://docs.nvidia.com/cuda/cuda-runtime-
api/index.html#group__CUDART__STREAM_1ge78feba9080b59fe0fff536a42c13e6d)(stream) tests whether all operations issued to the
specified stream have completed, without blocking host execution. The functions cudaEventSynchronize
(http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__EVENT_1g08241bcf5c5cb686b1882a8492f1e2d9)
(event) and cudaEventQuery (http://docs.nvidia.com/cuda/cuda-runtime-
api/index.html#group__CUDART__EVENT_1gf8e4ddb569b1da032c060f0c54da698f)(event) act similar to their stream counterparts, except
that their result is based on whether a specified event has been recorded rather than whether a specified stream is idle. You can also synchronize
operations within a single stream on a specific event using cudaStreamWaitEvent (http://docs.nvidia.com/cuda/cuda-runtime-
api/index.html#group__CUDART__STREAM_1g80c62c379f0c3ed8afe31fd0a31ad8a2)(event) (even if the event is recorded in a different stream, or on a
different device!).
Overlapping Kernel Execution and Data Transfers
Earlier we demonstrated how to overlap kernel execution in the default stream with execution of code on the host. But our main goal in this post is to
show you how to overlap kernel execution with data transfers. There are several requirements for this to happen.
The device must be capable of “concurrent copy and execution”. This can be queried from the deviceOverlap field of a
cudaDeviceProp struct, or from the output of the deviceQuery sample included with the CUDA SDK/Toolkit. Nearly all devices with
compute capability 1.1 and higher have this capability.
The kernel execution and the data transfer to be overlapped must both occur in different, non-default streams.
The host memory involved in the data transfer must be pinned (http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-
cuda-cc/) memory.
So let’s modify our simple host code from above to use multiple streams and see if we can achieve any overlap. The full code for this example is
available on Github (https://github.com/parallel-forall/code-samples/blob/master/series/cuda-cpp/overlap-data-transfers/async.cu). In the modified
code, we break up the array of size N into chunks of streamSize elements. Since the kernel operates independently on all elements, each of the
chunks can be processed independently. The number of (non-default) streams used is nStreams=N/streamSize. There are multiple ways to
implement the domain decomposition of the data and processing; one is to loop over all the operations for each chunk of the array as in this example
code.
f o r ( i n t i = 0 ; i < n S t r e a m s ; + + i ) {
i n t o f f s e t = i * s t r e a m S i z e ;
c u d a M e m c p y A s y n c ( & d _ a [ o f f s e t ] , & a [ o f f s e t ] , s t r e a m B y t e s , s t r e a m [ i ] ) ;
k e r n e l < < > > ( d _ a , o f f s e t ) ;
c u d a M e m c p y A s y n c ( & a [ o f f s e t ] , & d _ a [ o f f s e t ] , s t r e a m B y t e s , s t r e a m [ i ] ) ;
}
Another approach is to batch similar operations together, issuing all the host-to-device transfers first, followed by all kernel launches, and then all
device-to-host transfers, as in the following code.
f o r ( i n t i = 0 ; i < n S t r e a m s ; + + i ) {
i n t o f f s e t = i * s t r e a m S i z e ;
c u d a M e m c p y A s y n c ( & d _ a [ o f f s e t ] , & a [ o f f s e t ] ,
s t r e a m B y t e s , c u d a M e m c p y H o s t T o D e v i c e , s t r e a m [ i ] ) ;
}
f o r ( i n t i = 0 ; i < n S t r e a m s ; + + i ) {
i n t o f f s e t = i * s t r e a m S i z e ;
k e r n e l < < > > ( d _ a , o f f s e t ) ;
}
f o r ( i n t i = 0 ; i < n S t r e a m s ; + + i ) {
i n t o f f s e t = i * s t r e a m S i z e ;
c u d a M e m c p y A s y n c ( & a [ o f f s e t ] , & d _ a [ o f f s e t ] ,
s t r e a m B y t e s , c u d a M e m c p y D e v i c e T o H o s t , s t r e a m [ i ] ) ;
}
Both asynchronous methods shown above yield correct results, and in both cases dependent operations are issued to the same stream in the order in
which they need to be executed. But the two approaches perform very differently depending on the specific generation of GPU used. On a Tesla C1060
(compute capability 1.3) running the test code (from Github) gives the following results.
D e v i c e : T e s l a C 1 0 6 0
T i m e f o r s e q u e n t i a l t r a n s f e r a n d e x e c u t e ( m s ) : 1 2 . 9 2 3 8 1
m a x e r r o r : 2 . 3 8 4 1 8 5 8 E - 0 7
T i m e f o r a s y n c h r o n o u s V 1 t r a n s f e r a n d e x e c u t e ( m s ) : 1 3 . 6 3 6 9 0
m a x e r r o r : 2 . 3 8 4 1 8 5 8 E - 0 7
T i m e f o r a s y n c h r o n o u s V 2 t r a n s f e r a n d e x e c u t e ( m s ) : 8 . 8 4 5 8 8
m a x e r r o r : 2 . 3 8 4 1 8 5 8 E - 0 7
On a Tesla C2050 (compute capability 2.0) we get the following results.
D e v i c e : T e s l a C 2 0 5 0
T i m e f o r s e q u e n t i a l t r a n s f e r a n d e x e c u t e ( m s ) : 9 . 9 8 4 5 1 2
m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7
T i m e f o r a s y n c h r o n o u s V 1 t r a n s f e r a n d e x e c u t e ( m s ) : 5 . 7 3 5 5 8 4
m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7
T i m e f o r a s y n c h r o n o u s V 2 t r a n s f e r a n d e x e c u t e ( m s ) : 7 . 5 9 7 9 8 4
m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7
Here the first time reported is the sequential transfer and kernel execution using blocking transfers, which we use as a baseline for asynchronous
speedup comparison. Why do the two asynchronous strategies perform differently on different architectures? To decipher these results we need to
understand a bit more about how CUDA devices schedule and execute tasks. CUDA devices contain engines for various tasks, which queue up
operations as they are issued. Dependencies between tasks in different engines are maintained, but within any engine all external dependencies are
lost; tasks in each engine’s queue are executed in the order they are issued. The C1060 has a single copy engine and a single kernel engine. A time line
for the execution of our example code on a C1060 is shown in the following diagram.
Share:
(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-
419&s=reddit&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-
cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-
/543d42c8d6a2b1a6/5&frommenu=1&uid=543d42c8f4e3c3cf&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)
In the schematic we assume that the time required for the host-to-device transfer, kernel execution, and device-to-host transfer are approximately the
same (the kernel code was chosen in order to achieve this). As expected for the sequential kernel, there is no overlap in any of the operations. For the
first asynchronous version of our code the order of execution in the copy engine is: H2D stream(1), D2H stream(1), H2D stream(2), D2H stream(2), and
so forth. This is why we do not see any speed-up when using the first asynchronous version on the C1060: tasks were issued to the copy engine in an
order that precludes any overlap of kernel execution and data transfer. For version two, however, where all the host-to-device transfers are issued
before any of the device-to-host transfers, overlap is possible as indicated by the lower execution time. From our schematic, we expect the execution
of asynchronous version 2 to be 8/12 of the sequential version, or 8.7 ms which is confirmed in the timing results given previously.
On the C2050, two features interact to cause a behavior difference from the C1060. The C2050 has two copy engines, one for host-to-device transfers
and another for device-to-host transfers, as well as a single kernel engine. The following diagram illustrates execution of our example on the C2050.
Having two copy engines explains why asynchronous version 1 achieves good speed-up on the C2050: the device-to-host transfer of data in stream[i]
does not block the host-to-device transfer of data in stream[i+1] as it did on the C1060 because there is a separate engine for each copy direction on
the C2050. The schematic predicts the execution time to be cut in half relative to the sequential version, and this is roughly what our timing results
showed.
But what about the performance degradation observed in asynchronous version 2 on the C2050? This is related to the C2050′s ability to concurrently run
multiple kernels. When multiple kernels are issued back-to-back in different (non-default) streams, the scheduler tries to enable concurrent
execution of these kernels and as a result delays a signal that normally occurs after each kernel completion (which is responsible for kicking off the
device-to-host transfer) until all kernels complete. So, while there is overlap between host-to-device transfers and kernel execution in the second
version of our asynchronous code, there is no overlap between kernel execution and device-to-host transfers. The schematic predicts an overall time for
the asynchronous version 2 to be 9/12 of the time for the sequential version, or 7.5 ms, and this is confirmed by our timing results.
A more detailed description of the example used in this post is available in CUDA Fortran Asynchronous Data Transfers
(http://www.pgroup.com/lit/articles/insider/v3n1a4.htm). The good news is that for devices with compute capability 3.5 (the K20 series), the Hyper-Q
feature eliminates the need to tailor the launch order, so either approach above will work. We will discuss using Kepler features in a future post, but
for now, here are the results of running the sample code on a Tesla K20c GPU. As you can see, both asynchronous methods achieve the same speedup
over the synchronous code.
D e v i c e : T e s l a K 2 0 c
T i m e f o r s e q u e n t i a l t r a n s f e r a n d e x e c u t e ( m s ) : 7 . 1 0 1 7 6 0
m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7
T i m e f o r a s y n c h r o n o u s V 1 t r a n s f e r a n d e x e c u t e ( m s ) : 3 . 9 7 4 1 4 4
m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7
T i m e f o r a s y n c h r o n o u s V 2 t r a n s f e r a n d e x e c u t e ( m s ) : 3 . 9 6 7 6 1 6
m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7
Summary
This post and the previous one (http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/) discussed how to optimize data
transfers between the host and device. The previous post focused on how to minimize the time for executing such transfers, and this post introduced
streams and how to use them to mask data transfer time by concurrently executing copies and kernels.
In a post dealing with streams I should mention that while using the default stream is convenient for developing code—synchronous code is simpler—
eventually your code should use non-default streams. This is especially important when writing libraries. If code in a library uses the default stream,
there is no chance for the end user to overlap data transfers with library kernel execution.
Now you know how to move data efficiently between the host and device, so we’ll look at how to access data efficiently from within kernels in the next
post (http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/).
∥∀
Share
Share
Maxwell: The Most Advanced CUDA GPU Ever Made11 comments • a month ago
xman — am wondering if the warp scheduler is still 0 overheadscheduling ready instructions.and if the L1 cache …
CUDA Pro Tip: Occupancy API Simplifies LaunchConfiguration5 comments • 3 months ago
karthikeyan Natarajan — I required this one for very long time andwrote this myself using cuda …
LowPower Sensing and Autonomy With NVIDIA JetsonTK13 comments • 4 months ago
Matt Warmuth — Great article Dustin. Great tips on setting up thehardware through the OS on the device.
CUDACasts Episode 20: Getting started with Jetson TK1and OpenCV1 comment • 2 months ago
mpeniak — Awesome, thanks!
ALSO ON PARALLEL FORALL
0 Comments Parallel Forall Login
Sort by Best Share ⤤
Start the discussion…
Be the first to comment.
WHAT'S THIS?
Subscribe✉ Add Disqus to your sited Privacy
Favorite ★
About (/parallelforall/about/) Contact (/parallelforall/contact/)
Copyright © 2014 NVIDIA Corporation Legal Information (http://www.nvidia.com/object/legal_info.html) Privacy Policy
(http://www.nvidia.com/object/privacy_policy.html) Code of Conduct (https://developer.nvidia.com/code-of-conduct)
(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-
419&s=google_plusone_share&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-
cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-
/543d42c8d6a2b1a6/6&frommenu=1&uid=543d42c8ce474a3c&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)
(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-
419&s=linkedin&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-
cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-
/543d42c8d6a2b1a6/7&frommenu=1&uid=543d42c81e2ea111&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)
Share
Share
Share
About Mark Harris
Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs,ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark hasbeen using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student atUNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), andstarted GPGPU.org to provide a forum for those working in the field to share and discuss their work.
Follow @harrism on Twitter (https://twitter.com/intent/user?screen_name=harrism)View all posts by Mark Harris → (http://devblogs.nvidia.com/parallelforall/author/mharris/)