20131114

Thur., Nov 14, 2013

Pin Yi Tsai

WEEKLY REPORT

OUTLINE

• Current Work• Compute Integral Image – computeByRow

Using shared memory Using register Result

• CUDA Memory Architecture

USING SHARED MEMORY

• Scope: block

• Shared memory: store the values of the previous line

• computing by Row for img[*][y] and img[*][y+1]

• Time t: calculate img[*][y] + shared memory[*]

• Then store the result back to shared memory[*]

• Time t+1: calculate img[*][y+1] + shared memory[*]

USING REGISTER

• Scope: thread

• One line one thread Why not one pixel one thread? The use of _syncthread();

• Using register: store the values of the previous pixel

RESULT

• 16x16

• Serial version: 0.006336 ms

• Parallel version: 5.88559e-39 ms======== Profiling result:

Time(%) Time Calls Avg Min Max Name

55.69 18.91us 1 18.91us 18.91us 18.91us computeByRow(float*, int, int)

25.84 8.78us 1 8.78us 8.78us 8.78us computeByColumn(float*, int, int)

12.91 4.38us 2 2.19us 2.18us 2.21us [CUDA memcpy DtoH]

5.56 1.89us 2 944ns 928ns 960ns [CUDA memcpy HtoD]

RESULT (CONT.)

• 640*480

• Serial version: 5.1607 ms

• Parallel version: 4.40496 ms

======== Profiling result: Time(%) Time Calls Avg Min Max Name

66.37 2.19ms 1 2.19ms 2.19ms 2.19ms computeByRow(float*, int, int)

12.75 419.74us 2 209.87us 209.28us 210.46us [CUDA memcpy HtoD]

11.74 386.43us 2 193.22us 191.04us 195.39us [CUDA memcpy DtoH]

9.15 301.24us 1 301.24us 301.24us 301.24us computeByColumn(float*, int, int)

CUDA MEMORY ARCHITECTURE

The End

Technology

20131114