Upload
mac
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Matching Memory Access Patterns and Data Placement for NUMA Systems. Zolt á n Maj ó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland. Non-uniform memory architecture. Processor 0. Processor 1. Core 0. Core 1. Core 4. Core 5. Core 2. Core 3. Core 6. Core 7. - PowerPoint PPT Presentation
Citation preview
MatchingMemory Access Patterns and Data Placementfor NUMA Systems
Zoltán MajóThomas R. Gross
Computer Science DepartmentETH Zurich, Switzerland
2
Non-uniform memory architecture
Processor 1
Core 4 Core 5
Core 6 Core 7
IC MC
DRAM
Processor 0
Core 0 Core 1
Core 2 Core 3
MC IC
DRAM
3
Non-uniform memory architecture
Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles
Processor 1
Core 4 Core 5
Core 6 Core 7
IC MC
DRAM
Processor 0
Core 0 Core 1
Core 2 Core 3
MC IC
DRAM
T
Data
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])
4
Non-uniform memory architecture
Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles
Remote memory accessesbandwidth: 6.3 GB/slatency: 310 cycles
Processor 1
Core 4 Core 5
Core 6 Core 7
IC MC
DRAM
Processor 0
Core 0 Core 1
Core 2 Core 3
MC IC
DRAM
T
Data
Key to good performance: data locality
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])
5
Data locality in multithreaded programs
cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%
10%
20%
30%
40%
50%
60%
NAS Parallel Benchmarks
Remote memory references / total memory references [%]
6
Data locality in multithreaded programs
cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%
10%
20%
30%
40%
50%
60%
NAS Parallel Benchmarks
Remote memory references / total memory references [%]
7
Outline
Automatic page placement Memory access patterns of matrix-based computations Matching memory access patterns and data placement Evaluation Conclusions
8
Automatic page placement
Current OS support for NUMA: first-touch page placement Often high number of remote accesses
Data address profiling Profile-based page-placement Supported in hardware on many architectures
9
Profile-based page placementBased on the work of Marathe et al. [JPDC 2010, PPoPP 2006]
Processor 1
DRAM
Processor 0
DRAM
T0
Profile P0 : accessed 1000 times by
P1 : accessed3000 times by
T0T1
T1P1
P0
10
Automatic page placement
Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping
11
Profile-based page placement
cg.B lu.C bt.B ft.B sp.B0%
5%
10%
15%
20%
25%Performance improvement over first-touch [%]
12
Profile-based page placement
cg.B lu.C bt.B ft.B sp.B0%
5%
10%
15%
20%
25%Performance improvement over first-touch [%]
13
Inter-processor data sharing
Processor 1
DRAM
Processor 0
DRAM
T0
Profile P0 : accessed 1000 times by
P1 : accessed 3000 times by
T0T1
T1
P0 P1
P2 : accessed 4000 times by
accessed 5000 times by
T0
T1
P2
P2: inter-processor shared
14
Inter-processor data sharing
Processor 1
DRAM
Processor 0
DRAM
T0
Profile P0 : accessed 1000 times by
P1 : accessed 3000 times by
T0T1
T1
P0 P1
P2 : accessed 4000 times by
accessed 5000 times by
T0
T1P2
P2: inter-processor shared
15
Inter-processor data sharing
cg.B lu.C bt.B ft.B sp.B0%
10%
20%
30%
40%
50%
60%
Inter-processor shared heap relative to total heap
Shared heap / total heap [%]
16
Inter-processor data sharing
cg.B lu.C bt.B ft.B sp.B0%
10%
20%
30%
40%
50%
60%
Inter-processor shared heap relative to total heap
Shared heap / total heap [%]
17
Inter-processor data sharing
cg.B lu.C bt.B ft.B sp.B0%
10%
20%
30%
40%
50%
60%
0%
5%
10%
15%
20%
25%
30%
Inter-processor shared heap relative to total heapPerformance improvement over first-touch
Shared heap / total heap [%] Performance improvement [%]
18
Inter-processor data sharing
cg.B lu.C bt.B ft.B sp.B0%
10%
20%
30%
40%
50%
60%
0%
5%
10%
15%
20%
25%
30%
Inter-processor shared heap relative to total heapPerformance improvement over first-touch
Shared heap / total heap [%] Performance improvement [%]
19
Automatic page placement
Profile-based page placement often ineffective Reason: inter-processor data sharing
Inter-processor data sharing is a program property Detailed look: program memory access patterns
Loop-parallel programs with OpenMP-like parallelization Matrix processing NAS BT
20
Matrix processing
Process m sequentiallym[NX][NY]
NX
NY
for (i=0; i<NX; i++)for (j=0; j<NY; j++)
// access m[i][j]
21
for (i=0; i<NX; i++)for (j=0; j<NY; j++)// access m[i][j]
#pragma omp parallel forfor (i=0; i<NX; i++)
for (j=0; j<NY; j++)// access m[i][j]
Matrix processing
Process m x-wise parallel
NX
NY
T0
T1
T2
T3
T4
T5
T6
T7
m[NX][NY]
22
Thread scheduling
Remember: fixed thread-to-core mapping
Processor 1
DRAM
Processor 0
DRAM
T0
T1T2
T3T4
T5
T6
T7
23
#pragma omp parallel forfor (i=0; i<NX; i++)
for (j=0; j<NY; j++)// access m[i][j]
Process m x-wise parallel
Matrix processing
NX
NY
T0
T1
T2
T3
T4
T5
T6
T7
Allocated atProcessor 1
Allocated atProcessor 0
m[NX][NY]
24
#pragma omp parallel forfor (i=0; i<NX; i++)
for (j=0; j<NY; j++)// access m[i][j]
Process m x-wise parallel
for (i=0; i<NX; i++)#pragma omp parallel for
for (j=0; j<NY; j++)// access m[i][j]
Process m y-wise parallel
Matrix processing
NX
NY
T0 T1 T2 T3 T4 T5 T6 T7
Allocated atProcessor 1
Allocated atProcessor 0
m[NX][NY]
25
NX
NY
for (t=0; t<TMAX; t++){
x_wise();y_wise();
}
Example: NAS BT
Time-step iterationm[NX][NY]T0 T1 T2 T3 T4 T5 T6 T7
T0
T1
T2
T3
T4
T5
T6
T7
26
NX
NY
for (t=0; t<TMAX; t++){
x_wise();y_wise();
}
Example: NAS BT
Result:Inter-processor shared heap: 35%Remote accesses: 19%
Time-step iterationm[NX][NY]T0 T1 T2 T3 T4 T5 T6 T7
T0
T1
T2
T3
T4
T5
T6
T7
Appropriateallocationnot possible
Allocated atProcessor 0
Allocated atProcessor 1
Appropriate allocationnot possible
27
Solution?
1. Adjust data placementHigh overhead of runtime data migration cancels benefit
2. Adjust iteration schedulingLimited by data dependences
3. Adjust data placement and iteration scheduling together
28
API
Library for data placement Set of common data distributions
Affinity-aware loop iteration scheduling Extension to GCC OpenMP implementation
Example use case: NAS BT
29
Use-case: NAS BT
Remember: BT has two incompatible access patterns Repeated x-wise and y-wise access to the same data
Idea: data placement to accommodate both access patterns
NX
NY
Allocated atProcessor 0
Allocated atProcessor 0
Allocated atProcessor 1
Allocated atProcessor 1
Blocked-exclusive data placement
30
distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);
Use-case: NAS BT
for (t=0; t<TMAX; t++){
x_wise();
y_wise();}
31
Use-case: NAS BT
distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);
#pragma omp parallel forfor (i=0; i<NX; i++)
for (j=0; j<NY; j++)//access m[i][j]
for (t=0; t<TMAX; t++){
x_wise();
y_wise();}
32
x_wise()
Matrix processed in two steps
Step 1: left halfall accesses local
Step 2: right halfall accesses local
Allocated atProcessor 1
Allocated atProcessor 0
NY / 2
NX
Allocated atProcessor 0
Allocated atProcessor 1
NY / 2
T0
T1T2
T3
T4
T5
T6
T7
33
#pragma omp parallel forfor (i=0; i<NX; i++)
for (j=0; j<NY; j++)//access m[i][j]
#pragma omp parallel forfor (i=0; i<NX; i++)
for (j=0; j<NY/2; j++)//access m[i][j]
#pragma omp parallel forfor (i=0; i<NX; i++)
for (j=NY/2; j<NY; j++)//access m[i][j]
Use-case: NAS BT
for (t=0; t<TMAX; t++){
x_wise();
distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);
y_wise();}
34
#pragma omp parallel forfor (i=0; i<NX; i++)
for (j=0; j<NY/2; j++)//access m[i][j]
#pragma omp parallel forfor (i=0; i<NX; i++)
for (j=NY/2; j<NY; j++)//access m[i][j]
Use-case: NAS BT
for (t=0; t<TMAX; t++){
x_wise();
distr_t *distr;distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2));distribute_to(distr);
schedule(static-inverse)
schedule(static)
y_wise();}
35
for (i=0; i<NX; i++)for (j=0; j<NY; j++)// access m[i][j]
#pragma omp parallel forschedule(static)
for (i=0; i<NX; i++)for (j=0; j<NY; j++)
// access m[i][j]
Matrix processing
Process m x-wise parallel
NX
NY
T0
T1
T2
T3
T4
T5
T6
T7
m[NX][NY]
36
for (i=0; i<NX; i++)for (j=0; j<NY; j++)// access m[i][j]
#pragma omp parallel forschedule(static)
for (i=0; i<NX; i++)for (j=0; j<NY; j++)
// access m[i][j]
Matrix processing
Process m x-wise parallel
NX
NY
T0
T1
T2
T3
T4
T5
T6
T7
m[NX][NY]
m[0 .. NX/8 - 1][*]m[NX/8 .. 2*NX/8 - 1][*]m[2*NX/8.. 3*NX/8 - 1][*]m[3*NX/8.. 4*NX/8 - 1][*]m[4*NX/8.. 5*NX/8 - 1][*]m[5*NX/8 ..6*NX/8 - 1][*]m[6*NX/8 ..7*NX/8 - 1][*]m[7*NX/8 .. NX - 1][*]
37
m[0 .. NX/8 - 1][*]m[NX/8 .. 2*NX/8 - 1][*]m[2*NX/8 .. 3*NX/8 - 1][*]m[3*NX/8 .. 4*NX/8 - 1][*]m[4*NX/8 .. 5*NX/8 - 1][*]m[5*NX/8 .. 6*NX/8 - 1][*]m[6*NX/8 .. 7*NX/8 - 1][*]m[7*NX/8 .. NX - 1][*]
#pragma omp parallel forschedule(static)
for (i=0; i<NX; i++)for (j=0; j<NY; j++)
// access m[i][j]
static vs. static-inverseT0 m[0 .. NX/8 - 1][*]T1 m[NX/8 .. 2*NX/8 - 1][*]T2 m[2*NX/8 .. 3*NX/8 - 1][*]T3 m[3*NX/8 .. 4*NX/8 - 1][*]T4 m[4*NX/8 .. 5*NX/8 - 1][*]T5 m[5*NX/8 .. 6*NX/8 - 1][*]T6 m[6*NX/8 .. 7*NX/8 - 1][*]T7 m[7*NX/8 .. NX - 1][*]
#pragma omp parallel forschedule(static-inverse)
for (i=0; i<NX; i++)for (j=0; j<NY; j++)
// access m[i][j]
T0
T1
T2
T3
T4T5
T6
T7
38
y_wise()
Matrix processed in two steps
Allocated atProcessor 0
Allocated atProcessor 1 NX / 2
Allocated atProcessor 0
Allocated atProcessor 1
NY
NX / 2
T4 T5 T6 T7
Step 1: upper halfall accesses local
Step 2: lower halfall accesses local
T0 T1 T2 T3
39
Outline
Profile-based page placement Memory access patterns Matching data distribution and iteration scheduling Evaluation Conclusions
40
cg.B lu.C bt.B ft.B sp.B0%
5%
10%
15%
20%
25%
Profile-based allocation Program transformations
EvaluationPerformance improvement over first-touch [%]
41
cg.B lu.C bt.B ft.B sp.B0%
5%
10%
15%
20%
25%
Profile-based allocation Program transformations
EvaluationPerformance improvement over first-touch [%]
42
cg.B lu.C bt.B ft.B sp.B0%
5%
10%
15%
20%
25%
Profile-based allocation Program transformations
EvaluationPerformance improvement over first-touch [%]
43
ScalabilityMachine: 4-processor 32-core Intel Xeon E7-4830
Performance improvement over first-touch [%]
cg.C lu.C bt.C ft.C sp.C0%
50%
100%
150%
200%
250%
44
ScalabilityMachine: 4-processor 32-core Intel Xeon E7-4830
Performance improvement over first-touch [%]
cg.C lu.C bt.C ft.C sp.C0%
50%
100%
150%
200%
250%
45
Conclusions
Automatic data placement (still) limited Alternating memory access patterns
Inter-processor data sharing
Match memory access patterns and data placement
Simple API: practical solution that works today Ample opportunities for further improvement
46
Thank you for your attention!