How to Overlap Data Transfers in CUDA C_C++ _ Parallel Forall

Next → (http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/)

← Previous (http://devblogs.nvidia.com/parallelforall/how-overlap-data-

transfers-cuda-fortran/)

Search DevZone

Developer Programs (https://developer.nvidia.com/registered-developer-programs)

DEVELOPER CENTERS(HTTP://DEVELOPER.NVIDIA.COM) TECHNOLOGIES(HTTPS://DEVELOPER.NVIDIA.COM/TECHNOLOGIES)

TOOLS(HTTPS://DEVELOPER.NVIDIA.COM/TOOLS) RESOURCES(HTTPS://DEVELOPER.NVIDIA.COM/SUGGESTED-READING)

COMMUNITY(HTTPS://DEVELOPER.NVIDIA.COM/)

Share:

(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-

419&s=reddit&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-

cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-

/543d42c8d6a2b1a6/2&frommenu=1&uid=543d42c852ea9b57&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)


419&s=google_plusone_share&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-


/543d42c8d6a2b1a6/3&frommenu=1&uid=543d42c872c5fd8a&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)


419&s=linkedin&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-


/543d42c8d6a2b1a6/4&frommenu=1&uid=543d42c836c3f53a&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)

How to Overlap Data Transfers in CUDA C/C++

Posted on December 13, 2012 (http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/) by Mark Harris(http://devblogs.nvidia.com/parallelforall/author/mharris/) No CommentsTagged Concurrency (http://devblogs.nvidia.com/parallelforall/tag/concurrency/), CUDA C/C++ (http://devblogs.nvidia.com/parallelforall/tag/cuda-cc/), Memory (http://devblogs.nvidia.com/parallelforall/tag/memory/), Streams (http://devblogs.nvidia.com/parallelforall/tag/streams/)

In our last CUDA C/C++ post (http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/) we discussed how to transfer data

efficiently between the host and device. In this post, we discuss how to overlap data transfers with computation on the host, computation on the

device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires

the use of CUDA streams, so first let’s learn about streams.

CUDA Streams

A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within

a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run

concurrently.

The default stream

All device operations (kernels and data transfers) in CUDA run in a stream. When no stream is specified, the default stream (also called the “null

stream”) is used. The default stream is different from other streams because it is a synchronizing stream with respect to operations on the device: no

operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the

default stream must complete before any other operation (in any stream on the device) will begin.

Let’s look at some simple code examples that use the default stream, and discuss how operations progress from the perspective of the host as well as

the device.

c u d a M e m c p y ( d _ a , a , n u m B y t e s , c u d a M e m c p y H o s t T o D e v i c e ) ;

i n c r e m e n t < < < 1 , N > > > ( d _ a )

c u d a M e m c p y ( a , d _ a , n u m B y t e s , c u d a M e m c p y D e v i c e T o H o s t ) ;

In the code above, from the perspective of the device, all three operations are issued to the same (default) stream and will execute in the order that

they were issued. From the perspective of the host, the implicit data transfers are blocking or synchronous transfers, while the kernel launch is

asynchronous. Since the host-to-device data transfer on the first line is synchronous, the CPU thread will not reach the kernel call on the second line

until the host-to-device transfer is complete. Once the kernel is issued, the CPU thread moves to the third line, but the transfer on that line cannot

begin due to the device-side order of execution.

The asynchronous behavior of kernel launches from the host’s perspective makes overlapping device and host computation very simple. We can modify

the code to add some independent CPU computation as follows.

c u d a M e m c p y ( d _ a , a , n u m B y t e s , c u d a M e m c p y H o s t T o D e v i c e ) ;

i n c r e m e n t < < < 1 , N > > > ( d _ a )

m y C p u F u n c t i o n ( b )

c u d a M e m c p y ( a , d _ a , n u m B y t e s , c u d a M e m c p y D e v i c e T o H o s t ) ;

In the above code, as soon as the increment() kernel is launched on the device the CPU thread executes myCpuFunction(), overlapping its

execution on the CPU with the kernel execution on the GPU. Whether the host function or device kernel completes first doesn’t affect the subsequent

device-to-host transfer, which will begin only after the kernel completes. From the perspective of the device, nothing has changed from the previous

example; the device is completely unaware of myCpuFunction().

Share

Share

http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/

http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-fortran/

http://developer.nvidia.com/

https://developer.nvidia.com/registered-developer-programs

http://developer.nvidia.com/

https://developer.nvidia.com/technologies

https://developer.nvidia.com/tools

https://developer.nvidia.com/suggested-reading

https://developer.nvidia.com/

http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-419&s=reddit&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-/543d42c8d6a2b1a6/2&frommenu=1&uid=543d42c852ea9b57&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha

http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-419&s=google_plusone_share&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-/543d42c8d6a2b1a6/3&frommenu=1&uid=543d42c872c5fd8a&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha

http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-419&s=linkedin&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-/543d42c8d6a2b1a6/4&frommenu=1&uid=543d42c836c3f53a&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha

http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

http://devblogs.nvidia.com/parallelforall/author/mharris/

http://devblogs.nvidia.com/parallelforall/tag/concurrency/

http://devblogs.nvidia.com/parallelforall/tag/cuda-cc/

http://devblogs.nvidia.com/parallelforall/tag/memory/

http://devblogs.nvidia.com/parallelforall/tag/streams/

http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/

Non-default streams

Non-default streams in CUDA C/C++ are declared, created, and destroyed in host code as follows.

c u d a S t r e a m _ t s t r e a m 1 ;

c u d a E r r o r _ t r e s u l t ;

r e s u l t = c u d a S t r e a m C r e a t e ( & s t r e a m 1 )

r e s u l t = c u d a S t r e a m D e s t r o y ( s t r e a m 1 )

To issue a data transfer to a non-default stream we use the cudaMemcpyAsync (http://docs.nvidia.com/cuda/cuda-runtime-

api/index.html#group__CUDART__MEMORY_1gf2810a94ec08fe3c7ff9d93a33af7255)() function, which is similar to the cudaMemcpy()

function discussed in the previous post, but takes a stream identifier as a fifth argument.

r e s u l t = c u d a M e m c p y A s y n c ( d _ a , a , N , c u d a M e m c p y H o s t T o D e v i c e , s t r e a m 1 )

cudaMemcpyAsync() is non-blocking on the host, so control returns to the host thread immediately after the transfer is issued. There are

cudaMemcpy2DAsync (http://docs.nvidia.com/cuda/cuda-runtime-

api/index.html#group__CUDART__MEMORY_1g7f182f1a8b8750c7fa9e18aeb280d31c)() and cudaMemcpy3DAsync

(http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1gb5854dd48800fe861864ae8784762a46)()

variants of this routine which can transfer 2D and 3D array sections asynchronously in the specified streams.

To issue a kernel to a non-default stream we specify the stream identifier as a fourth execution configuration parameter (the third execution

configuration parameter allocates shared device memory, which we’ll talk about later; use 0 for now).

i n c r e m e n t < < < 1 , N , 0 , s t r e a m 1 > > > ( d _ a )

Synchronization with streams

Since all operations in non-default streams are non-blocking with respect to the host code, you will run across situations where you need to synchronize

the host code with operations in a stream. There are several ways to do this. The “heavy hammer” way is to use cudaDeviceSynchronize

(http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__DEVICE_1g32bdc6229081137acd3cba5da2897779)(),

which blocks the host code until all previously issued operations on the device have completed. In most cases this is overkill, and can really hurt

performance due to stalling the entire device and host thread.

The CUDA stream API has multiple less severe methods of synchronizing the host with a stream. The function cudaStreamSynchronize

(http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__STREAM_1geb3b2f88b7c1cff8b67a998a3a41c179)

(stream) can be used to block the host thread until all previously issued operations in the specified stream have completed. The function

cudaStreamQuery (http://docs.nvidia.com/cuda/cuda-runtime-

api/index.html#group__CUDART__STREAM_1ge78feba9080b59fe0fff536a42c13e6d)(stream) tests whether all operations issued to the

specified stream have completed, without blocking host execution. The functions cudaEventSynchronize

(http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__EVENT_1g08241bcf5c5cb686b1882a8492f1e2d9)

(event) and cudaEventQuery (http://docs.nvidia.com/cuda/cuda-runtime-

api/index.html#group__CUDART__EVENT_1gf8e4ddb569b1da032c060f0c54da698f)(event) act similar to their stream counterparts, except

that their result is based on whether a specified event has been recorded rather than whether a specified stream is idle. You can also synchronize

operations within a single stream on a specific event using cudaStreamWaitEvent (http://docs.nvidia.com/cuda/cuda-runtime-

api/index.html#group__CUDART__STREAM_1g80c62c379f0c3ed8afe31fd0a31ad8a2)(event) (even if the event is recorded in a different stream, or on a

different device!).

Overlapping Kernel Execution and Data Transfers

Earlier we demonstrated how to overlap kernel execution in the default stream with execution of code on the host. But our main goal in this post is to

show you how to overlap kernel execution with data transfers. There are several requirements for this to happen.

The device must be capable of “concurrent copy and execution”. This can be queried from the deviceOverlap field of a

cudaDeviceProp struct, or from the output of the deviceQuery sample included with the CUDA SDK/Toolkit. Nearly all devices with

compute capability 1.1 and higher have this capability.

The kernel execution and the data transfer to be overlapped must both occur in different, non-default streams.

The host memory involved in the data transfer must be pinned (http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-

cuda-cc/) memory.

So let’s modify our simple host code from above to use multiple streams and see if we can achieve any overlap. The full code for this example is

available on Github (https://github.com/parallel-forall/code-samples/blob/master/series/cuda-cpp/overlap-data-transfers/async.cu). In the modified

code, we break up the array of size N into chunks of streamSize elements. Since the kernel operates independently on all elements, each of the

chunks can be processed independently. The number of (non-default) streams used is nStreams=N/streamSize. There are multiple ways to

implement the domain decomposition of the data and processing; one is to loop over all the operations for each chunk of the array as in this example

code.

f o r ( i n t i = 0 ; i < n S t r e a m s ; + + i ) {

i n t o f f s e t = i * s t r e a m S i z e ;

c u d a M e m c p y A s y n c ( & d _ a [ o f f s e t ] , & a [ o f f s e t ] , s t r e a m B y t e s , s t r e a m [ i ] ) ;

k e r n e l < < > > ( d _ a , o f f s e t ) ;

c u d a M e m c p y A s y n c ( & a [ o f f s e t ] , & d _ a [ o f f s e t ] , s t r e a m B y t e s , s t r e a m [ i ] ) ;

}

Another approach is to batch similar operations together, issuing all the host-to-device transfers first, followed by all kernel launches, and then all

device-to-host transfers, as in the following code.

http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1gf2810a94ec08fe3c7ff9d93a33af7255

http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1g7f182f1a8b8750c7fa9e18aeb280d31c

http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1gb5854dd48800fe861864ae8784762a46

http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__DEVICE_1g32bdc6229081137acd3cba5da2897779

http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__STREAM_1geb3b2f88b7c1cff8b67a998a3a41c179

http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__STREAM_1ge78feba9080b59fe0fff536a42c13e6d

http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__EVENT_1g08241bcf5c5cb686b1882a8492f1e2d9

http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__EVENT_1gf8e4ddb569b1da032c060f0c54da698f

http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__STREAM_1g80c62c379f0c3ed8afe31fd0a31ad8a2


https://github.com/parallel-forall/code-samples/blob/master/series/cuda-cpp/overlap-data-transfers/async.cu



c u d a M e m c p y A s y n c ( & d _ a [ o f f s e t ] , & a [ o f f s e t ] ,

s t r e a m B y t e s , c u d a M e m c p y H o s t T o D e v i c e , s t r e a m [ i ] ) ;

}



k e r n e l < < > > ( d _ a , o f f s e t ) ;

}



c u d a M e m c p y A s y n c ( & a [ o f f s e t ] , & d _ a [ o f f s e t ] ,

s t r e a m B y t e s , c u d a M e m c p y D e v i c e T o H o s t , s t r e a m [ i ] ) ;

}

Both asynchronous methods shown above yield correct results, and in both cases dependent operations are issued to the same stream in the order in

which they need to be executed. But the two approaches perform very differently depending on the specific generation of GPU used. On a Tesla C1060

(compute capability 1.3) running the test code (from Github) gives the following results.

D e v i c e : T e s l a C 1 0 6 0

T i m e f o r s e q u e n t i a l t r a n s f e r a n d e x e c u t e ( m s ) : 1 2 . 9 2 3 8 1

m a x e r r o r : 2 . 3 8 4 1 8 5 8 E - 0 7

T i m e f o r a s y n c h r o n o u s V 1 t r a n s f e r a n d e x e c u t e ( m s ) : 1 3 . 6 3 6 9 0

m a x e r r o r : 2 . 3 8 4 1 8 5 8 E - 0 7

T i m e f o r a s y n c h r o n o u s V 2 t r a n s f e r a n d e x e c u t e ( m s ) : 8 . 8 4 5 8 8

m a x e r r o r : 2 . 3 8 4 1 8 5 8 E - 0 7

On a Tesla C2050 (compute capability 2.0) we get the following results.

D e v i c e : T e s l a C 2 0 5 0

T i m e f o r s e q u e n t i a l t r a n s f e r a n d e x e c u t e ( m s ) : 9 . 9 8 4 5 1 2

m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7

T i m e f o r a s y n c h r o n o u s V 1 t r a n s f e r a n d e x e c u t e ( m s ) : 5 . 7 3 5 5 8 4

m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7


m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7

Here the first time reported is the sequential transfer and kernel execution using blocking transfers, which we use as a baseline for asynchronous

speedup comparison. Why do the two asynchronous strategies perform differently on different architectures? To decipher these results we need to

understand a bit more about how CUDA devices schedule and execute tasks. CUDA devices contain engines for various tasks, which queue up

operations as they are issued. Dependencies between tasks in different engines are maintained, but within any engine all external dependencies are

lost; tasks in each engine’s queue are executed in the order they are issued. The C1060 has a single copy engine and a single kernel engine. A time line

for the execution of our example code on a C1060 is shown in the following diagram.

Share:


419&s=reddit&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-


/543d42c8d6a2b1a6/5&frommenu=1&uid=543d42c8f4e3c3cf&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)

In the schematic we assume that the time required for the host-to-device transfer, kernel execution, and device-to-host transfer are approximately the

same (the kernel code was chosen in order to achieve this). As expected for the sequential kernel, there is no overlap in any of the operations. For the

first asynchronous version of our code the order of execution in the copy engine is: H2D stream(1), D2H stream(1), H2D stream(2), D2H stream(2), and

so forth. This is why we do not see any speed-up when using the first asynchronous version on the C1060: tasks were issued to the copy engine in an

order that precludes any overlap of kernel execution and data transfer. For version two, however, where all the host-to-device transfers are issued

before any of the device-to-host transfers, overlap is possible as indicated by the lower execution time. From our schematic, we expect the execution

of asynchronous version 2 to be 8/12 of the sequential version, or 8.7 ms which is confirmed in the timing results given previously.

On the C2050, two features interact to cause a behavior difference from the C1060. The C2050 has two copy engines, one for host-to-device transfers

and another for device-to-host transfers, as well as a single kernel engine. The following diagram illustrates execution of our example on the C2050.

Having two copy engines explains why asynchronous version 1 achieves good speed-up on the C2050: the device-to-host transfer of data in stream[i]

does not block the host-to-device transfer of data in stream[i+1] as it did on the C1060 because there is a separate engine for each copy direction on

the C2050. The schematic predicts the execution time to be cut in half relative to the sequential version, and this is roughly what our timing results

showed.

But what about the performance degradation observed in asynchronous version 2 on the C2050? This is related to the C2050′s ability to concurrently run

multiple kernels. When multiple kernels are issued back-to-back in different (non-default) streams, the scheduler tries to enable concurrent

execution of these kernels and as a result delays a signal that normally occurs after each kernel completion (which is responsible for kicking off the

device-to-host transfer) until all kernels complete. So, while there is overlap between host-to-device transfers and kernel execution in the second

version of our asynchronous code, there is no overlap between kernel execution and device-to-host transfers. The schematic predicts an overall time for

the asynchronous version 2 to be 9/12 of the time for the sequential version, or 7.5 ms, and this is confirmed by our timing results.

A more detailed description of the example used in this post is available in CUDA Fortran Asynchronous Data Transfers

(http://www.pgroup.com/lit/articles/insider/v3n1a4.htm). The good news is that for devices with compute capability 3.5 (the K20 series), the Hyper-Q

feature eliminates the need to tailor the launch order, so either approach above will work. We will discuss using Kepler features in a future post, but

for now, here are the results of running the sample code on a Tesla K20c GPU. As you can see, both asynchronous methods achieve the same speedup

over the synchronous code.

D e v i c e : T e s l a K 2 0 c

T i m e f o r s e q u e n t i a l t r a n s f e r a n d e x e c u t e ( m s ) : 7 . 1 0 1 7 6 0

m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7


m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7


m a x e r r o r : 1 . 1 9 2 0 9 2 9 e - 0 7

Summary

This post and the previous one (http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/) discussed how to optimize data

transfers between the host and device. The previous post focused on how to minimize the time for executing such transfers, and this post introduced

streams and how to use them to mask data transfer time by concurrently executing copies and kernels.

In a post dealing with streams I should mention that while using the default stream is convenient for developing code—synchronous code is simpler—

eventually your code should use non-default streams. This is especially important when writing libraries. If code in a library uses the default stream,

there is no chance for the end user to overlap data transfers with library kernel execution.

Now you know how to move data efficiently between the host and device, so we’ll look at how to access data efficiently from within kernels in the next

post (http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/).

∥∀

Share

Share

http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-419&s=reddit&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-/543d42c8d6a2b1a6/5&frommenu=1&uid=543d42c8f4e3c3cf&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha

http://www.pgroup.com/lit/articles/insider/v3n1a4.htm


http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/

Maxwell: The Most Advanced CUDA GPU Ever Made11 comments • a month ago

xman — am wondering if the warp scheduler is still 0 overheadscheduling ready instructions.and if the L1 cache …

CUDA Pro Tip: Occupancy API Simplifies LaunchConfiguration5 comments • 3 months ago

karthikeyan Natarajan — I required this one for very long time andwrote this myself using cuda …

LowPower Sensing and Autonomy With NVIDIA JetsonTK13 comments • 4 months ago

Matt Warmuth — Great article Dustin. Great tips on setting up thehardware through the OS on the device.

CUDACasts Episode 20: Getting started with Jetson TK1and OpenCV1 comment • 2 months ago

mpeniak — Awesome, thanks!

ALSO ON PARALLEL FORALL

0 Comments Parallel Forall Login

Sort by Best Share ⤤

Start the discussion…

Be the first to comment.

WHAT'S THIS?

Subscribe✉ Add Disqus to your sited Privacy

Favorite ★

About (/parallelforall/about/) Contact (/parallelforall/contact/)

Copyright © 2014 NVIDIA Corporation Legal Information (http://www.nvidia.com/object/legal_info.html) Privacy Policy

(http://www.nvidia.com/object/privacy_policy.html) Code of Conduct (https://developer.nvidia.com/code-of-conduct)


419&s=google_plusone_share&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-


/543d42c8d6a2b1a6/6&frommenu=1&uid=543d42c8ce474a3c&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)


419&s=linkedin&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-


/543d42c8d6a2b1a6/7&frommenu=1&uid=543d42c81e2ea111&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha)

Share

Share

Share

About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs,ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark hasbeen using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student atUNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), andstarted GPGPU.org to provide a forum for those working in the field to share and discuss their work.

Follow @harrism on Twitter (https://twitter.com/intent/user?screen_name=harrism)View all posts by Mark Harris → (http://devblogs.nvidia.com/parallelforall/author/mharris/)

http://redirect.disqus.com/url?url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fmaxwell-most-advanced-cuda-gpu-ever-made%2F%3ArcDIOuACGIYJbs1nY_AGWoEGaCE&imp=31jqhn3ec4scs&prev_imp&forum_id=2464314&forum=nvparallelforall&thread_id=1957558782&major_version=metadata&thread=3031943755&zone=internal_discovery

http://redirect.disqus.com/url?url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fmaxwell-most-advanced-cuda-gpu-ever-made%2F%3ArcDIOuACGIYJbs1nY_AGWoEGaCE&imp=31jqhn3ec4scs&prev_imp&forum_id=2464314&forum=nvparallelforall&thread_id=1957558782&major_version=metadata&thread=3031943755&zone=internal_discovery

http://redirect.disqus.com/url?url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fcuda-pro-tip-occupancy-api-simplifies-launch-configuration%2F%3AVvIyFkm0C_NL1PvWLjeddVkrUiQ&imp=31jqhn3ec4scs&prev_imp&forum_id=2464314&forum=nvparallelforall&thread_id=1957558782&major_version=metadata&thread=2852830213&zone=internal_discovery

http://redirect.disqus.com/url?url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fcuda-pro-tip-occupancy-api-simplifies-launch-configuration%2F%3AVvIyFkm0C_NL1PvWLjeddVkrUiQ&imp=31jqhn3ec4scs&prev_imp&forum_id=2464314&forum=nvparallelforall&thread_id=1957558782&major_version=metadata&thread=2852830213&zone=internal_discovery

http://redirect.disqus.com/url?url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Flow-power-sensing-autonomy-nvidia-jetson-tk1%2F%3AOmINqTX4pqE3cZpoxH5Smz02EZk&imp=31jqhn3ec4scs&prev_imp&forum_id=2464314&forum=nvparallelforall&thread_id=1957558782&major_version=metadata&thread=2800389345&zone=internal_discovery

http://redirect.disqus.com/url?url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Flow-power-sensing-autonomy-nvidia-jetson-tk1%2F%3AOmINqTX4pqE3cZpoxH5Smz02EZk&imp=31jqhn3ec4scs&prev_imp&forum_id=2464314&forum=nvparallelforall&thread_id=1957558782&major_version=metadata&thread=2800389345&zone=internal_discovery

http://redirect.disqus.com/url?url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fcudacasts-episode-20-getting-started-jetson-tk1-opencv%2F%3A-JmGOTuihdvtyWXd_e1OCvnKyow&imp=31jqhn3ec4scs&prev_imp&forum_id=2464314&forum=nvparallelforall&thread_id=1957558782&major_version=metadata&thread=2965899639&zone=internal_discovery

http://redirect.disqus.com/url?url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fcudacasts-episode-20-getting-started-jetson-tk1-opencv%2F%3A-JmGOTuihdvtyWXd_e1OCvnKyow&imp=31jqhn3ec4scs&prev_imp&forum_id=2464314&forum=nvparallelforall&thread_id=1957558782&major_version=metadata&thread=2965899639&zone=internal_discovery

http://disqus.com/

https://disqus.com/websites/?utm_source=nvparallelforall&utm_medium=Disqus-Footer

https://help.disqus.com/customer/portal/articles/1657951?utm_source=disqus&utm_medium=embed-footer&utm_content=privacy-btn

http://devblogs.nvidia.com/parallelforall/about/

http://devblogs.nvidia.com/parallelforall/contact/

http://www.nvidia.com/object/legal_info.html

http://www.nvidia.com/object/privacy_policy.html

https://developer.nvidia.com/code-of-conduct

http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-419&s=google_plusone_share&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-/543d42c8d6a2b1a6/6&frommenu=1&uid=543d42c8ce474a3c&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha

http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-419&s=linkedin&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-transfers-cuda-cc%2F&title=How%20to%20Overlap%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-/543d42c8d6a2b1a6/7&frommenu=1&uid=543d42c81e2ea111&ct=1&pre=https%3A%2F%2Fwww.google.com.co%2F&tt=0&captcha_provider=nucaptcha

https://twitter.com/intent/user?screen_name=harrism

http://devblogs.nvidia.com/parallelforall/author/mharris/

Documents

How to Overlap Data Transfers in CUDA C_C++ _ Parallel Forall