Upload
jocelyn
View
39
Download
0
Embed Size (px)
Citation preview
Thur., Nov 14, 2013
Pin Yi Tsai
WEEKLY REPORT
OUTLINE
• Current Work• Compute Integral Image – computeByRow
Using shared memory Using register Result
• CUDA Memory Architecture
USING SHARED MEMORY
• Scope: block
• Shared memory: store the values of the previous line
• computing by Row for img[*][y] and img[*][y+1]
• Time t: calculate img[*][y] + shared memory[*]
• Then store the result back to shared memory[*]
• Time t+1: calculate img[*][y+1] + shared memory[*]
USING REGISTER
• Scope: thread
• One line one thread Why not one pixel one thread? The use of _syncthread();
• Using register: store the values of the previous pixel
RESULT
• 16x16
• Serial version: 0.006336 ms
• Parallel version: 5.88559e-39 ms======== Profiling result:
Time(%) Time Calls Avg Min Max Name
55.69 18.91us 1 18.91us 18.91us 18.91us computeByRow(float*, int, int)
25.84 8.78us 1 8.78us 8.78us 8.78us computeByColumn(float*, int, int)
12.91 4.38us 2 2.19us 2.18us 2.21us [CUDA memcpy DtoH]
5.56 1.89us 2 944ns 928ns 960ns [CUDA memcpy HtoD]
RESULT (CONT.)
• 640*480
• Serial version: 5.1607 ms
• Parallel version: 4.40496 ms
======== Profiling result: Time(%) Time Calls Avg Min Max Name
66.37 2.19ms 1 2.19ms 2.19ms 2.19ms computeByRow(float*, int, int)
12.75 419.74us 2 209.87us 209.28us 210.46us [CUDA memcpy HtoD]
11.74 386.43us 2 193.22us 191.04us 195.39us [CUDA memcpy DtoH]
9.15 301.24us 1 301.24us 301.24us 301.24us computeByColumn(float*, int, int)
CUDA MEMORY ARCHITECTURE
The End