46
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland

Matching Memory Access Patterns and Data Placement for NUMA Systems

  • Upload
    mac

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Matching Memory Access Patterns and Data Placement for NUMA Systems. Zolt á n Maj ó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland. Non-uniform memory architecture. Processor 0. Processor 1. Core 0. Core 1. Core 4. Core 5. Core 2. Core 3. Core 6. Core 7. - PowerPoint PPT Presentation

Citation preview

Page 1: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

MatchingMemory Access Patterns and Data Placementfor NUMA Systems

Zoltán MajóThomas R. Gross

Computer Science DepartmentETH Zurich, Switzerland

Page 2: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

2

Non-uniform memory architecture

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

Page 3: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

3

Non-uniform memory architecture

Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

T

Data

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

Page 4: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

4

Non-uniform memory architecture

Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles

Remote memory accessesbandwidth: 6.3 GB/slatency: 310 cycles

Processor 1

Core 4 Core 5

Core 6 Core 7

IC MC

DRAM

Processor 0

Core 0 Core 1

Core 2 Core 3

MC IC

DRAM

T

Data

Key to good performance: data locality

All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])

Page 5: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

5

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

10%

20%

30%

40%

50%

60%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

Page 6: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

6

Data locality in multithreaded programs

cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%

10%

20%

30%

40%

50%

60%

NAS Parallel Benchmarks

Remote memory references / total memory references [%]

Page 7: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

7

Outline

Automatic page placement Memory access patterns of matrix-based computations Matching memory access patterns and data placement Evaluation Conclusions

Page 8: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

8

Automatic page placement

Current OS support for NUMA: first-touch page placement Often high number of remote accesses

Data address profiling Profile-based page-placement Supported in hardware on many architectures

Page 9: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

9

Profile-based page placementBased on the work of Marathe et al. [JPDC 2010, PPoPP 2006]

Processor 1

DRAM

Processor 0

DRAM

T0

Profile P0 : accessed 1000 times by

P1 : accessed3000 times by

T0T1

T1P1

P0

Page 10: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

10

Automatic page placement

Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping

Page 11: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

11

Profile-based page placement

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%Performance improvement over first-touch [%]

Page 12: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

12

Profile-based page placement

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%Performance improvement over first-touch [%]

Page 13: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

13

Inter-processor data sharing

Processor 1

DRAM

Processor 0

DRAM

T0

Profile P0 : accessed 1000 times by

P1 : accessed 3000 times by

T0T1

T1

P0 P1

P2 : accessed 4000 times by

accessed 5000 times by

T0

T1

P2

P2: inter-processor shared

Page 14: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

14

Inter-processor data sharing

Processor 1

DRAM

Processor 0

DRAM

T0

Profile P0 : accessed 1000 times by

P1 : accessed 3000 times by

T0T1

T1

P0 P1

P2 : accessed 4000 times by

accessed 5000 times by

T0

T1P2

P2: inter-processor shared

Page 15: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

15

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

Page 16: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

16

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

Inter-processor shared heap relative to total heap

Shared heap / total heap [%]

Page 17: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

17

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

0%

5%

10%

15%

20%

25%

30%

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

Page 18: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

18

Inter-processor data sharing

cg.B lu.C bt.B ft.B sp.B0%

10%

20%

30%

40%

50%

60%

0%

5%

10%

15%

20%

25%

30%

Inter-processor shared heap relative to total heapPerformance improvement over first-touch

Shared heap / total heap [%] Performance improvement [%]

Page 19: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

19

Automatic page placement

Profile-based page placement often ineffective Reason: inter-processor data sharing

Inter-processor data sharing is a program property Detailed look: program memory access patterns

Loop-parallel programs with OpenMP-like parallelization Matrix processing NAS BT

Page 20: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

20

Matrix processing

Process m sequentiallym[NX][NY]

NX

NY

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

Page 21: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

21

for (i=0; i<NX; i++)for (j=0; j<NY; j++)// access m[i][j]

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)// access m[i][j]

Matrix processing

Process m x-wise parallel

NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

m[NX][NY]

Page 22: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

22

Thread scheduling

Remember: fixed thread-to-core mapping

Processor 1

DRAM

Processor 0

DRAM

T0

T1T2

T3T4

T5

T6

T7

Page 23: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

23

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)// access m[i][j]

Process m x-wise parallel

Matrix processing

NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

Allocated atProcessor 1

Allocated atProcessor 0

m[NX][NY]

Page 24: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

24

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)// access m[i][j]

Process m x-wise parallel

for (i=0; i<NX; i++)#pragma omp parallel for

for (j=0; j<NY; j++)// access m[i][j]

Process m y-wise parallel

Matrix processing

NX

NY

T0 T1 T2 T3 T4 T5 T6 T7

Allocated atProcessor 1

Allocated atProcessor 0

m[NX][NY]

Page 25: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

25

NX

NY

for (t=0; t<TMAX; t++){

x_wise();y_wise();

}

Example: NAS BT

Time-step iterationm[NX][NY]T0 T1 T2 T3 T4 T5 T6 T7

T0

T1

T2

T3

T4

T5

T6

T7

Page 26: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

26

NX

NY

for (t=0; t<TMAX; t++){

x_wise();y_wise();

}

Example: NAS BT

Result:Inter-processor shared heap: 35%Remote accesses: 19%

Time-step iterationm[NX][NY]T0 T1 T2 T3 T4 T5 T6 T7

T0

T1

T2

T3

T4

T5

T6

T7

Appropriateallocationnot possible

Allocated atProcessor 0

Allocated atProcessor 1

Appropriate allocationnot possible

Page 27: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

27

Solution?

1. Adjust data placementHigh overhead of runtime data migration cancels benefit

2. Adjust iteration schedulingLimited by data dependences

3. Adjust data placement and iteration scheduling together

Page 28: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

28

API

Library for data placement Set of common data distributions

Affinity-aware loop iteration scheduling Extension to GCC OpenMP implementation

Example use case: NAS BT

Page 29: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

29

Use-case: NAS BT

Remember: BT has two incompatible access patterns Repeated x-wise and y-wise access to the same data

Idea: data placement to accommodate both access patterns

NX

NY

Allocated atProcessor 0

Allocated atProcessor 0

Allocated atProcessor 1

Allocated atProcessor 1

Blocked-exclusive data placement

Page 30: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

30

distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);

Use-case: NAS BT

for (t=0; t<TMAX; t++){

x_wise();

y_wise();}

Page 31: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

31

Use-case: NAS BT

distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)//access m[i][j]

for (t=0; t<TMAX; t++){

x_wise();

y_wise();}

Page 32: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

32

x_wise()

Matrix processed in two steps

Step 1: left halfall accesses local

Step 2: right halfall accesses local

Allocated atProcessor 1

Allocated atProcessor 0

NY / 2

NX

Allocated atProcessor 0

Allocated atProcessor 1

NY / 2

T0

T1T2

T3

T4

T5

T6

T7

Page 33: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

33

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY; j++)//access m[i][j]

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY/2; j++)//access m[i][j]

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=NY/2; j<NY; j++)//access m[i][j]

Use-case: NAS BT

for (t=0; t<TMAX; t++){

x_wise();

distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);

y_wise();}

Page 34: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

34

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=0; j<NY/2; j++)//access m[i][j]

#pragma omp parallel forfor (i=0; i<NX; i++)

for (j=NY/2; j<NY; j++)//access m[i][j]

Use-case: NAS BT

for (t=0; t<TMAX; t++){

x_wise();

distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);

schedule(static-inverse)

schedule(static)

y_wise();}

Page 35: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

35

for (i=0; i<NX; i++)for (j=0; j<NY; j++)// access m[i][j]

#pragma omp parallel forschedule(static)

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

Matrix processing

Process m x-wise parallel

NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

m[NX][NY]

Page 36: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

36

for (i=0; i<NX; i++)for (j=0; j<NY; j++)// access m[i][j]

#pragma omp parallel forschedule(static)

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

Matrix processing

Process m x-wise parallel

NX

NY

T0

T1

T2

T3

T4

T5

T6

T7

m[NX][NY]

m[0 .. NX/8 - 1][*]m[NX/8 .. 2*NX/8 - 1][*]m[2*NX/8.. 3*NX/8 - 1][*]m[3*NX/8.. 4*NX/8 - 1][*]m[4*NX/8.. 5*NX/8 - 1][*]m[5*NX/8 ..6*NX/8 - 1][*]m[6*NX/8 ..7*NX/8 - 1][*]m[7*NX/8 .. NX - 1][*]

Page 37: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

37

m[0 .. NX/8 - 1][*]m[NX/8 .. 2*NX/8 - 1][*]m[2*NX/8 .. 3*NX/8 - 1][*]m[3*NX/8 .. 4*NX/8 - 1][*]m[4*NX/8 .. 5*NX/8 - 1][*]m[5*NX/8 .. 6*NX/8 - 1][*]m[6*NX/8 .. 7*NX/8 - 1][*]m[7*NX/8 .. NX - 1][*]

#pragma omp parallel forschedule(static)

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

static vs. static-inverseT0 m[0 .. NX/8 - 1][*]T1 m[NX/8 .. 2*NX/8 - 1][*]T2 m[2*NX/8 .. 3*NX/8 - 1][*]T3 m[3*NX/8 .. 4*NX/8 - 1][*]T4 m[4*NX/8 .. 5*NX/8 - 1][*]T5 m[5*NX/8 .. 6*NX/8 - 1][*]T6 m[6*NX/8 .. 7*NX/8 - 1][*]T7 m[7*NX/8 .. NX - 1][*]

#pragma omp parallel forschedule(static-inverse)

for (i=0; i<NX; i++)for (j=0; j<NY; j++)

// access m[i][j]

T0

T1

T2

T3

T4T5

T6

T7

Page 38: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

38

y_wise()

Matrix processed in two steps

Allocated atProcessor 0

Allocated atProcessor 1 NX / 2

Allocated atProcessor 0

Allocated atProcessor 1

NY

NX / 2

T4 T5 T6 T7

Step 1: upper halfall accesses local

Step 2: lower halfall accesses local

T0 T1 T2 T3

Page 39: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

39

Outline

Profile-based page placement Memory access patterns Matching data distribution and iteration scheduling Evaluation Conclusions

Page 40: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

40

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

Page 41: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

41

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

Page 42: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

42

cg.B lu.C bt.B ft.B sp.B0%

5%

10%

15%

20%

25%

Profile-based allocation Program transformations

EvaluationPerformance improvement over first-touch [%]

Page 43: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

43

ScalabilityMachine: 4-processor 32-core Intel Xeon E7-4830

Performance improvement over first-touch [%]

cg.C lu.C bt.C ft.C sp.C0%

50%

100%

150%

200%

250%

Page 44: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

44

ScalabilityMachine: 4-processor 32-core Intel Xeon E7-4830

Performance improvement over first-touch [%]

cg.C lu.C bt.C ft.C sp.C0%

50%

100%

150%

200%

250%

Page 45: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

45

Conclusions

Automatic data placement (still) limited Alternating memory access patterns

Inter-processor data sharing

Match memory access patterns and data placement

Simple API: practical solution that works today Ample opportunities for further improvement

Page 46: Matching Memory  Access Patterns and Data  Placement for  NUMA Systems

46

Thank you for your attention!