Matching Memory Access Patterns and Data Placement for NUMA Systems

MatchingMemory Access Patterns and Data Placementfor NUMA Systems

Zoltán MajóThomas R. Gross

Computer Science DepartmentETH Zurich, Switzerland

2

Non-uniform memory architecture

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

3


Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

T

Data

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

4


Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Remote memory accessesbandwidth: 6.3 GB/slatency: 310 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

T

Data

Key to good performance: data locality

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

5

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

10%

20%

30%

40%

50%

60%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

6

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

10%

20%

30%

40%

50%

60%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

7

Outline

Automatic page placement Memory access patterns of matrix-based computations Matching memory access patterns and data placement Evaluation Conclusions

8

Automatic page placement

Current OS support for NUMA: first-touch page placement Often high number of remote accesses

Data address profiling Profile-based page-placement Supported in hardware on many architectures

9

Profile-based page placementBased on the work of Marathe et al. [JPDC 2010, PPoPP 2006]

Processor 1

DRAM

Processor 0

DRAM

T0

Profile P0 : accessed 1000 times by

P1 : accessed3000 times by

T0T1

T1P1

P0

10


Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping

11

Profile-based page placement

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%Performance improvement over first-touch [%]

12

Profile-based page placement


5%

10%

15%

20%

25%Performance improvement over first-touch [%]

13

Inter-processor data sharing

Processor 1

DRAM

Processor 0

DRAM

T0


P1 : accessed 3000 times by

T0T1

T1

P0 P1


accessed 5000 times by

T0

T1

P2

P2: inter-processor shared

14


Processor 1

DRAM

Processor 0

DRAM

T0



T0T1

T1

P0 P1


accessed 5000 times by

T0

T1P2

P2: inter-processor shared

15



10%

20%

30%

40%

50%

60%

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

16



10%

20%

30%

40%

50%

60%

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

17



10%

20%

30%

40%

50%

60%

0%

5%

10%

15%

20%

25%

30%

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

18



10%

20%

30%

40%

50%

60%

0%

5%

10%

15%

20%

25%

30%

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

19


Profile-based page placement often ineffective Reason: inter-processor data sharing

Inter-processor data sharing is a program property Detailed look: program memory access patterns

Loop-parallel programs with OpenMP-like parallelization Matrix processing NAS BT

20

Matrix processing

Process m sequentiallym[NX][NY]

NX

NY

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

21

for (i=0; i<NX; i++)for (j=0; j<NY; j++)// access m[i][j]

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)// access m[i][j]

Matrix processing

Process m x-wise parallel

NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

m[NX][NY]

22

Thread scheduling

Remember: fixed thread-to-core mapping

Processor 1

DRAM

Processor 0

DRAM

T0

T1T2

T3T4

T5

T6

T7

23




Matrix processing

NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

Allocated atProcessor 1


m[NX][NY]

24




for (i=0; i<NX; i++)#pragma omp parallel for


Process m y-wise parallel

Matrix processing

NX

NY

T0 T1 T2 T3 T4 T5 T6 T7



m[NX][NY]

25

NX

NY

for (t=0; t<TMAX; t++){

x_wise();y_wise();

}

Example: NAS BT

Time-step iterationm[NX][NY]T0 T1 T2 T3 T4 T5 T6 T7

T0

T1

T2

T3

T4

T5

T6

T7

26

NX

NY


x_wise();y_wise();

}

Example: NAS BT

Result:Inter-processor shared heap: 35%Remote accesses: 19%

Time-step iterationm[NX][NY]T0 T1 T2 T3 T4 T5 T6 T7

T0

T1

T2

T3

T4

T5

T6

T7

Appropriateallocationnot possible



Appropriate allocationnot possible

27

Solution?

1. Adjust data placementHigh overhead of runtime data migration cancels benefit

2. Adjust iteration schedulingLimited by data dependences

3. Adjust data placement and iteration scheduling together

28

API

Library for data placement Set of common data distributions

Affinity-aware loop iteration scheduling Extension to GCC OpenMP implementation

Example use case: NAS BT

29

Use-case: NAS BT

Remember: BT has two incompatible access patterns Repeated x-wise and y-wise access to the same data

Idea: data placement to accommodate both access patterns

NX

NY





Blocked-exclusive data placement

30

distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);

Use-case: NAS BT


x_wise();

y_wise();}

31

Use-case: NAS BT



for (j=0; j<NY; j++)//access m[i][j]


x_wise();

y_wise();}

32

x_wise()

Matrix processed in two steps

Step 1: left halfall accesses local

Step 2: right halfall accesses local



NY / 2

NX



NY / 2

T0

T1T2

T3

T4

T5

T6

T7

33


for (j=0; j<NY; j++)//access m[i][j]


for (j=0; j<NY/2; j++)//access m[i][j]


for (j=NY/2; j<NY; j++)//access m[i][j]

Use-case: NAS BT


x_wise();


y_wise();}

34


for (j=0; j<NY/2; j++)//access m[i][j]


for (j=NY/2; j<NY; j++)//access m[i][j]

Use-case: NAS BT


x_wise();


schedule(static-inverse)

schedule(static)

y_wise();}

35


#pragma omp parallel forschedule(static)


// access m[i][j]

Matrix processing


NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

m[NX][NY]

36




// access m[i][j]

Matrix processing


NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

m[NX][NY]

m[0 .. NX/8 - 1][*]m[NX/8 .. 2*NX/8 - 1][*]m[2*NX/8.. 3*NX/8 - 1][*]m[3*NX/8.. 4*NX/8 - 1][*]m[4*NX/8.. 5*NX/8 - 1][*]m[5*NX/8 ..6*NX/8 - 1][*]m[6*NX/8 ..7*NX/8 - 1][*]m[7*NX/8 .. NX - 1][*]

37

m[0 .. NX/8 - 1][*]m[NX/8 .. 2*NX/8 - 1][*]m[2*NX/8 .. 3*NX/8 - 1][*]m[3*NX/8 .. 4*NX/8 - 1][*]m[4*NX/8 .. 5*NX/8 - 1][*]m[5*NX/8 .. 6*NX/8 - 1][*]m[6*NX/8 .. 7*NX/8 - 1][*]m[7*NX/8 .. NX - 1][*]



// access m[i][j]

static vs. static-inverseT0 m[0 .. NX/8 - 1][*]T1 m[NX/8 .. 2*NX/8 - 1][*]T2 m[2*NX/8 .. 3*NX/8 - 1][*]T3 m[3*NX/8 .. 4*NX/8 - 1][*]T4 m[4*NX/8 .. 5*NX/8 - 1][*]T5 m[5*NX/8 .. 6*NX/8 - 1][*]T6 m[6*NX/8 .. 7*NX/8 - 1][*]T7 m[7*NX/8 .. NX - 1][*]

#pragma omp parallel forschedule(static-inverse)


// access m[i][j]

T0

T1

T2

T3

T4T5

T6

T7

38

y_wise()

Matrix processed in two steps


Allocated atProcessor 1 NX / 2



NY

NX / 2

T4 T5 T6 T7

Step 1: upper halfall accesses local

Step 2: lower halfall accesses local

T0 T1 T2 T3

39

Outline

Profile-based page placement Memory access patterns Matching data distribution and iteration scheduling Evaluation Conclusions

40


5%

10%

15%

20%

25%

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

41


5%

10%

15%

20%

25%



42


5%

10%

15%

20%

25%



43

ScalabilityMachine: 4-processor 32-core Intel Xeon E7-4830

Performance improvement over first-touch [%]

cg.C lu.C bt.C ft.C sp.C0%

50%

100%

150%

200%

250%

44

ScalabilityMachine: 4-processor 32-core Intel Xeon E7-4830

Performance improvement over first-touch [%]

cg.C lu.C bt.C ft.C sp.C0%

50%

100%

150%

200%

250%

45

Conclusions

Automatic data placement (still) limited Alternating memory access patterns


Match memory access patterns and data placement

Simple API: practical solution that works today Ample opportunities for further improvement

46

Thank you for your attention!

Documents

Matching Memory Access Patterns and Data Placement for NUMA Systems