Exploiting Both Pipelining and Data Parallelism with SIMD RA

Exploiting Both Pipelining and Data Parallelism with SIMD RA

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek*

*Seoul National University**UNIST (Ulsan National Institute of Science & Technology)

ARC March 21, 2012Hong Kong

Reconfigurable Architec-ture

2/20

Reconfigurable architecture High perfor-

mance Flexible

Cf. ASIC Energy effi-

cient Cf. GPU

Source: ChipDesignMag.com

Coarse-Grained Reconfigurable Archi-tecture

3 /20

Coarse-Grained RA Word-level granularity Dynamic reconfigurability Simpler to compile

Execution modelMain

Proces-sor

CGRA

Main Mem-ory

DMA Con-

troller

MorphoSys

ADRES

Application Mapping

4 /20

Place and route DFG on the PE array mapping space

Should satisfy several constraints Should map nodes on the PE which have a right functionality Data transfer between nodes should be guaranteed Resource consumption should be minimized for performance

Application

IR

Front-end

Partitioner

ConventionalC

compilation

ConfigurationAssembly

Exec. + Config.

Extended assembler

Seq Code Loops

Place & Route

DFG generation

ArchParam.

Mapping for CGRA

<DFG>

<CGRA>

Modulo scheduling-based mapping

5 /20

Software Pipelining

time

0 1

2

4

3

A[i]

B[i]

C[i]

PE0

PE3

PE1

PE2

PE0 PE1 PE2 PE31234567

0 1

2

4

3

0 1

2

4

3

0 1

2

4

3

II = 2 cycles

II : Initiation In-terval

Suffer several problems in a large scale CGRA Lack of parallelism

Limited ILP in general applications Configuration size(in unrolling case)

Search a very large mapping space for placement and routing

Skyrocketing compilation time

CGRAs remain at 4x4 or 8x8 at the most.6 /20

Problem - Scalability

Overview

7 /20

Background

SIMD Reconfigurable Architecture (SIMD RA)

Mapping on SIMD RA

Evaluation

Consists of multi-ple identical parts, called cores

Identical for the reuse of configura-tions

At least one load-store PE in each core

8 /20

SIMD Reconfigurable Architecture

Crossbar SwitchBank1 Bank2 Bank3 Bank4

Core 1 Core 2

Core 3 Core 4

More iterations executed in parallel Scale with the PE array size Short compilation time thanks to small mapping space Archive denser scheduled configuration

Higher utilization and performance. Loop must not have loop-carried dependence.9 /20

Advantages of SIMD-RA

time

Large CoreIteration 0Iteration 1Iteration 2Iteration 3Iteration 4Iteration 5

time

Core 1

Core 2

Core 3

Core 4

Iter.0

Iter.1

Iter.2

Iter.3

Iter.4

Iter.5

Iter.6

Iter.7

Iter.8

Iter.9

Iter.10

Iter.11

Large Core

Core 1

Core 2

Core 3

Core 4

Overview

10 /20

Background

SIMD Reconfigurable Architecture (SIMD RA)

Bank Conflict Minimization in SIMD RA

Evaluation

New mapping problem Iteration-to-core mapping

Iteration mapping affects on the perfor-mance related with a data mapping affect the number of bank conflicts

11 /20

Problems of SIMD RA mapping

for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i];}

Core 1

Core 2

Core 3

Core 4

15 iterations

Iteration-to-core mapping Data mapping

12 /20

Mapping schemes

Iter.0-3

Iter.4-7Iter.12-14

Iter.8-11

Iter.0,4,8,1

2

Iter.1,5,9,1

3

Iter.3,7,11

Iter.2,6,10,

14

Crossbar SwitchA[0]A[4]A[8]

A[12]B[1]B[5]B[9]

B[13]

A[1]A[5]A[9]

A[13]B[2]B[6]

B[10]B[14]

A[2]A[6]A[10]A[14]B[3]B[7]B[11]

A[3]A[7]

A[11]B[0]B[4]B[8]

B[12]

Crossbar SwitchA[0]A[1]A[2]A[3]A[4]A[5]

A[13]A[14]

B[0]B[1]B[2]B[3]B[4]B[5]

B[13]B[14]

… …

< Sequen-tial >

< Interleav-ing >

< Sequen-tial >

< Interleav-ing >

for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i];}

13

With interleaving data placement, interleaved iteration assignment is better than sequential iteration assignment.

Weak in stride ac-cesses

reduce the number of utilized banks,

increase bank conflicts

Interleaving data placement

Iter.0-3

Iter.4-7Iter.12-14

Iter.8-11

Iter.0,4,8,1

2

Iter.1,5,9,1

3

Iter.3,7,11

Iter.2,6,10,

14

Crossbar SwitchA[0]A[4]A[8]

A[12]B[1]B[5]B[9]

B[13]

A[1]A[5]A[9]

A[13]B[2]B[6]

B[10]B[14]

A[2]A[6]

A[10]A[14]B[3]B[7]

B[11]

A[3]A[7]

A[11]B[0]B[4]B[8]

B[12]

ConfigurationLoad

A[i]…

… …

Load

A[2i]

14

Sequential data place-ment

Cannot work well with SIMD mapping

Cause frequent bank con-flicts

Data tiling i) array base address mod-

ification ii) rearranging data on the

local memory. Sequential iteration as-

signment with data tiling suits for SIMD mapping

14

Crossbar SwitchA[0]A[1]A[2]A[3]B[0]B[1]B[2] B[3]

A[4]A[5]A[6]A[7]B[4]B[5]B[6]B[7]

A[8]A[9]

A[10]A[11]B[8]B[9]

B[10]B[11]

A[12]A[13]A[14]

B[12]B[13]B[14]

Crossbar SwitchA[0]A[1]A[2]A[3]A[4]A[5]

A[13]A[14]

B[0]B[1]B[2]B[3]B[4]B[5]

B[13]B[14]

… …

Iter.0-3

Iter.4-7Iter.12-14

Iter.8-11

Iter.0,4,8,1

2

Iter.1,5,9,1

3

Iter.3,7,11

Iter.2,6,10,

14

ConfigurationLoad

A[i]…

… …

Two out of the four combinations have strong advantages Interleaved iteration, interleaved data

mapping Weak in accesses with stride Simple data management

Sequential iteration, sequential data mapping (with data tiling)

More robust against bank conflict Data rearranging overhead

15 /20

Summary of Mapping Combinations Analysis

Experimental Setup

16 /20

Sets of loop kernels from OpenCV, multimedia and SPEC2000 benchmarks

Target system Two CGRA sizes – 4x4, 8x4 2x2 core with one load-store PE and one multiplier PE Mesh + diagonal connections between PEs Full crossbar switch between PEs and local memory

banks

Compared with non-SIMD mapping Original : non-SIMD previous mapping SIMD : Our approach (interleaving-interleaving mapping)

reduced by 61% in 4x4 CGRA, 79% in 8x4 CGRA17 /20

Configuration Size

18 /20

RuntimeOr

ig.

SIMD

Orig

.(U2)

SIMD Orig

.SI

MDOr

ig.(U

2)SI

MDOr

ig.(U

2)SI

MDOr

ig.(U

2)SI

MD Orig

.SI

MDOr

ig.(U

4)SI

MDOr

ig.(U

2)SI

MDOr

ig.(U

4)SI

MDOr

ig.(U

3)SI

MDOr

ig.(U

6)SI

MDOr

ig.(U

3)SI

MDOr

ig.(U

4)SI

MDOr

ig.(U

4)SI

MDOr

ig.(U

8)SI

MDOr

ig.(U

4)SI

MDOr

ig.(U

6)SI

MDOr

ig.(U

2)SI

MDOr

ig.(U

2)SI

MD Orig

.SI

MD Orig

.SI

MD

4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4Swim1 Swim2 Swim3 Laplace Wavelet CalcHar-

risCvtColor Dot-

ProductGaussian Erode Average

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%Stall time Non-stall time

Runt

ime

(Nor

mal

ized

to

4x4

Orig

inal

)

29%

32%

Presented SIMD reconfigurable architecture Exploit data parallelism and instruction level

parallelism at the same time

Advantages of SIMD reconfigurable archi-tecture Scale the large number of PEs well Alleviate increasing compilation time Increase performance and reduce configuration

size

19 /20

Conclusion

Thank you!

20 /20

21

In a large loop case, small core might not

be a good match

Merge multiple cores ⇒ Macrocore

No HW modification require

Core size

Crossbar SwitchBank1 Bank2 Bank3 Bank4

Core 1 Core 2

Core 3 Core 4

Macrocore 1

Macrocore 2

22

SIMD RA mapping flow

Operation Mapping

Check SIMDRequirement

Select Core Size

Iteration Mapping

Data Tiling

If scheduling fails and MaxII<II, increase core size.

Traditional MappingFail

If scheduling fails, increase II and repeat.

Modulo Scheduling

Array Placement(Implicit)

Int-Int Seq-Tiling

Documents

Exploiting Both Pipelining and Data Parallelism with SIMD RA