16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn

Copyright © 2007 Intel Corporation.

RR

®®

16bit 3D Convolution 16bit 3D Convolution Implementation SSE + OpenMPImplementation SSE + OpenMP

Benchmarking on PenrynBenchmarking on Penryn

Dr. Zvi Danovich, Dr. Zvi Danovich, Senior Application EngineerSenior Application Engineer

January 2008January 2008

Copyright © 2008 Intel Corporation. 2

AgendaAgenda Mathematics of 3D convolutionMathematics of 3D convolution Main idea of SSE implementation of 1D convolutionMain idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 lineBasic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by Main routine of algorithm: 3D convolution – line by

lineline Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions


3D convolution (with 3x3x3 kernel 3D convolution (with 3x3x3 kernel KK) is computed ) is computed for each pixel for each pixel PP as as

where where pp is source pixels and is source pixels and KK – convolution kernel – convolution kernel values.values.

In another words, each new pixel is the sum of 27 In another words, each new pixel is the sum of 27 products of source pixels values with appropriate products of source pixels values with appropriate kernel values inside kernel cubic:kernel values inside kernel cubic:

3D convolution – what is it ?3D convolution – what is it ?

10

10

10

10

10

10

,,1,1,1D3,, 000000

ll

ll

mm

mm

nn

nn

nmlnnmmllnml pKP

KpKp KpKp KpKp

KpKp KpKp KpKp

KpKp KpKp KpKp

P = sum


Recombination from 1D convolutionsRecombination from 1D convolutions

If 1D convolution is defined asIf 1D convolution is defined as

therefore final line of 3D convolution istherefore final line of 3D convolution is

i.e. i.e. 3D convolution can be presented as double sum of 9 3D convolution can be presented as double sum of 9 1D convolutions – 3 planes with 3 lines in plane1D convolutions – 3 planes with 3 lines in plane

121101D1

00000

10

10

nnnnnnn pKpKpKpKP

nn

nn

10

10

10

10

,D3, 00

ll

ll

mm

mm

mlml PP





Main part of algorithm: 1D convolutionMain part of algorithm: 1D convolutionidea of implementationidea of implementation

Let start from 3 sequential QUADs from sourse line, multiply all Let start from 3 sequential QUADs from sourse line, multiply all three by different three by different KK (kernel) values (denoted as k (kernel) values (denoted as k--, k, kcc,k,k++) )

-4-4 -3-3 -2-2 -1-1 00 11 22 33 44 55 66 77

kk--

-4-4

kk--

-3-3

kk--

-2-2

kk--

-1-1

kk--

00

kk--

11

kk--

22

kk--

33

kk--

44

kk--

55

kk--

66

kk--

77

kkcc

-4-4

kkcc

-3-3

kkcc

-2-2

kkcc

-1-1

kkcc

00

kkcc

11

kkcc

22

kkcc

33

kkcc

44

kkcc

55

kkcc

66

kkcc

77

kk++

-4-4

kk++

-3-3

kk++

-2-2

kk++

-1-1

kk++

00

kk++

11

kk++

22

kk++

33

kk++

44

kk++

55

kk++

66

kk++

77

kk-- kk-- kk-- kk--Multiplication

kkcc kkcc kkcc kkcc

kk++ kk++ kk++ kk++Multiplication

Selection by PALIGNR

Selection by PALIGNR

Using PALIGNR, select QUAD shifted left for products with kUsing PALIGNR, select QUAD shifted left for products with k-- and QUAD and QUAD shifted right for products with kshifted right for products with k++. Sum up them with unshifted QUAD products . Sum up them with unshifted QUAD products with kwith kcc: :

Sourse pixels p

kk--

-1-1

kk--

00

kk--

11

kk--

22

kkcc 00

kkcc 11

kkcc 22

kkcc 33

kk++ 11

kk++ 22

kk++ 33

kk++ 44

PP00PP

11PP

22PP

33 k-p2+kcp3+k+p4

k-p1+kcp2+k+p3

k-p0+kcp1+k+p2

k-p-1+kcp0+k+p1

Resulting sums are convolution expressions for central QUAD !





Main loop is treating sequential EIGHTs of 16bit pixels for 3 Main loop is treating sequential EIGHTs of 16bit pixels for 3 adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit form) is computed for 2 QUADs of each EIGHT, results for 3 form) is computed for 2 QUADs of each EIGHT, results for 3 lines are summed up, therefore forming 2D convolution results.lines are summed up, therefore forming 2D convolution results.

To avoid using “if”s in the main loop, the very first step is To avoid using “if”s in the main loop, the very first step is separated into prolog part, being simpler than general step.separated into prolog part, being simpler than general step.

Below is the description of 1 line (from 3 lines) computations in Below is the description of 1 line (from 3 lines) computations in general main loop step. general main loop step.

It starts from loading EIGHT 16bit source pixels and unpacking It starts from loading EIGHT 16bit source pixels and unpacking them into 2 32bit QUADs :them into 2 32bit QUADs :

Basic routine of algorithm: 2D convolution – 1 lineBasic routine of algorithm: 2D convolution – 1 line

pp00 pp11 pp22 pp33 pp44 pp55 pp66 pp77

pp00 pp11 pp22 pp33

pp44 pp55 pp66 pp77

pp00 pp11 pp22 pp33

pp44 pp55 pp66 pp77

Load EIGHT of 16 bit source pixels

Shuffle

Shuffle

Equivalence

Equivalence

First unpacked 32bit QUAD

Second unpacked 32bit QUAD


Multiply 2 QUADs (from previous step) with three different Multiply 2 QUADs (from previous step) with three different KK values values (denoted as k(denoted as k--, k, kcc, k, k++), resulting in 6 product QUADs. Treat them ), resulting in 6 product QUADs. Treat them together with 2 similar product QUADs saved at previous step. together with 2 similar product QUADs saved at previous step.

00 11 22 33 44 55 66 77

kk--

-4-4

kk--

-3-3

kk--

-2-2

kk--

-1-1

kk--

00

kk--

11

kk--

22

kk--

33

kk--

44

kk--

55

kk--

66

kk--

77

kkcc

00

kkcc

11

kkcc

22

kkcc

33

kkcc

44

kkcc

55

kkcc

66

kkcc

77

kk++

-4-4

kk++

-3-3

kk++

-2-2

kk++

-1-1

kk++

00

kk++

11

kk++

22

kk++

33

kk++

44

kk++

55

kk++

66

kk++

77

kk-- kk-- kk-- kk--

kkcc kkcc kkcc kkcc

kk++ kk++ kk++ kk++

Using PALIGNR, select appropriate QUAD and start/continue forming 3 Using PALIGNR, select appropriate QUAD and start/continue forming 3 sum QUADs:sum QUADs:

– (1) (1) REDRED frame: 2D convolution of 1 frame: 2D convolution of 1stst sourse QUAD: will be finalized and sourse QUAD: will be finalized and stored at the end of stored at the end of currentcurrent step, step,

– (2) (2) GREENGREEN frame: 2D convolution of 2 frame: 2D convolution of 2ndnd sourse QUAD: will be finalized and sourse QUAD: will be finalized and stored at the end of stored at the end of nextnext step/epilog, step/epilog,

– (Prev) (Prev) YELLOWYELLOW frame: 2D convolution of previous 2 frame: 2D convolution of previous 2ndnd sourse QUAD: will be sourse QUAD: will be finalized and stored at the end of finalized and stored at the end of currentcurrent step step

Therefore, at the end of current step, 2 resulting 2D convolution Therefore, at the end of current step, 2 resulting 2D convolution QUADs– QUADs– PREVIOUSPREVIOUS 2 2ndnd and and CURRENTCURRENT 1 1st st - will be stored.- will be stored.

Basic routine of algorithm: 2d convolution – 1 lineBasic routine of algorithm: 2d convolution – 1 line

Saved product QUADs from previous step

2

21

1

Prev

1

MultiplicationSSE4 mullo_epi32

MultiplicationSSE4 mullo_epi32


As already mentioned, each step treats and sums up data from 3 As already mentioned, each step treats and sums up data from 3 adjacent lines – performs computations from previous foils for 2 other adjacent lines – performs computations from previous foils for 2 other lines and sets of kernel components accordingly.lines and sets of kernel components accordingly.

Prolog step doesn’t include Prolog step doesn’t include PREVIOUSPREVIOUS sum computation and certainly sum computation and certainly doesn’t save it.doesn’t save it.

The epilog step includes the very last 2D convolution QUAD The epilog step includes the very last 2D convolution QUAD computation and store that is fully similar to computation and store that is fully similar to PREVIOUSPREVIOUS computation in computation in regular step. regular step.

Finally, the above routine builds ONE 32bit line of 2D convolution Finally, the above routine builds ONE 32bit line of 2D convolution resulting points.resulting points.

Basic routine of algorithm: 2d convolution – 1 lineBasic routine of algorithm: 2d convolution – 1 linefinalizingfinalizing





To build full 3D convolution stack, this routine runs on lines (inner loop) of To build full 3D convolution stack, this routine runs on lines (inner loop) of all slices (external loop).all slices (external loop).

For each source line, it computes 3 32bit 2D convolution lines – based on For each source line, it computes 3 32bit 2D convolution lines – based on previous, current and next slices, using “2D convolution -1 line” routine previous, current and next slices, using “2D convolution -1 line” routine described above.described above.

Main routine of algorithm: 3D convolution – line by lineMain routine of algorithm: 3D convolution – line by line

Slice 1 (next)Slice 1 (next)

Slice 0 (current)Slice 0 (current)

Slice -1 (previous)Slice -1 (previous)Line

-1

Line

-1

Line

0

Line

0Li

ne 1

Line

1

2D c

onvo

lutio

n

2D c

onvo

lutio

nSumming up

Summing up

Resulting 3D convolution line is built by summing up these 3 lines, normalizing by Resulting 3D convolution line is built by summing up these 3 lines, normalizing by arithmetical shift and converting result to 16 bit as following:arithmetical shift and converting result to 16 bit as following:

00 11 22 33

00 11 22 33

00 11 22 33

44 55 66 77

44 55 66 77

44 55 66 77

Line -1 2D conv.

Line 0 2D conv.

Line +1 2D conv.Su

mm

ing

up

00 11 22 33 44 55 66 7732bit 3D convolution

00 11 22 33 44 55 66 77

After shift: actually – 16bit

packs_epi32

00 11 22 33 44 55 66 77

Final 16bit 3D convolution EIGHT

Store

Shift





Parallelizing by OpenMP and benchmarkingParallelizing by OpenMP and benchmarking To parallelize the above algorithm by using OpenMP for external (slices) To parallelize the above algorithm by using OpenMP for external (slices)

loop, 3 32bit working lines for each thread are allocated.loop, 3 32bit working lines for each thread are allocated. See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores).See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores).

3 runs – equivalent of 3D gradient computation: SSE only SSE+OpenMP

Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~5.5, Serial/(SSE+OpenMP) = ~16.3

10 runs: SSE only SSE+OpenMP

Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~6.3, Serial/(SSE+OpenMP) = ~18.6

Speed-up of SSE (3x) is close to theoretical limit for 4-32bit-vector operations !

Additional OpenMP speed-up (5.5x-6.3x) brings overall speed-up to 16.3x-18.6x !

Documents

16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn