1
Sector Cache on the SPARCVIIIfx I On-demand split of the shared L2 cache. I User controlled mapping between accesses and sectors. Isolation of thrashing accesses. Select and keep useful data in cache. Our Goals Assess applicability of the sector cache to optimize HPC applications. I Study potential optimization strategies. I Help users find good sector cache optimizations. I Aim for as much automation as possible. Issues I Low-level compile-time API. I Hard to predict impact on performance. I Little support from compiler and performance analysis tools. Current Sector Cache API double myarray[NSIZE]; double otherarray[NSIZE]; void mywork(void) { int i; double sum = 0; #pragma statement cache_sector_size 1 11 #pragma statement cache_subsector_assign myarray for(i = 2; i < NSIZE-2; i++) { // myarray in sector 1 sum += myarray[i-2] + myarray[i-1] + myarray[i] + myarray[i+1] + myarray[i+2] + otherarray[i]; } } Results I efficient cache optimization of classical HPC benchmarks. I framework for locality analysis and optimization of HPC applications. I on-going automation, already little user action required. Framework Overview Hotspot detection Structures/Scope setup DWARF reader Binary instrumentation Locality analysis Code modification Locality Measurements Reuse Distance: for a memory access, the number of unique memory locations touched since the previous access to the same location. an architecture-independent measure closely related to the cache misses triggered by an application. use it to measure consequences of isolating one structure by itself with the sector cache. Reuse exemple Access : load 0x10 load 0x20 load 0x30 load 0x10 load 0x20 load 0x10 load 0x30 Distance : 2 2 1 2 Density Distance Validating the Framework: a toy example 1.75MB 3.5MB 7MB 0 2 · 10 6 4 · 10 6 6 · 10 6 8 · 10 6 1 · 10 7 1.2 · 10 7 1.4 · 10 7 M1 M2 M3 Reuse Distances of Each Structures 1 2 3 4 5 6 7 8 9 10 11 0 2 4 6 8 10 12 14 16 18 20 22 24 Sector Size Cache Misses Reduction (%) M 1 M 2 M 3 Selected by framework Optimizations for all Sector Configurations Optimizing the NAS Parallel Benchmarks Benchmark Function Isolated Variables Sector Size Miss Reduction (%) Runtime Reduction (%) CG conj_grad p (1,11) 19 10 LU ssor a,b,c,d (2,10) 48 8 blts ldz,ldy,ldx,d 75 10 buts d,udx,udy,udz 18 3 jacld a,b,c,d 64 14 jacu a,b,c,d 57 6 Acknowledgements Part of the results were obtained by early access to the K computer at the RIKEN AICS. This work was supported by the JSPS Grant-in-Aid for JSPS Fellows Number 24.02711. RIKEN Advanced Institute for Computational Science Programming Environment Research Team [email protected]

Sector Cache Optimizations for the K Computerperarnau/pdfs/aics13poster.pdf · Sector Cache Optimizations for the K Computer Swann Perarnau, Mitsuhisa Sato RIKEN AICS, University

Embed Size (px)

Citation preview

Page 1: Sector Cache Optimizations for the K Computerperarnau/pdfs/aics13poster.pdf · Sector Cache Optimizations for the K Computer Swann Perarnau, Mitsuhisa Sato RIKEN AICS, University

Sector Cache Optimizations for the K Computer

Swann Perarnau, Mitsuhisa Sato

RIKEN AICS, University of Tsukuba

Sector Cache on the SPARCVIIIfx

IOn-demand split of the shared L2 cache.IUser controlled mapping between accesses and sectors.→ Isolation of thrashing accesses.→ Select and keep useful data in cache.

Our Goals

Assess applicability of the sector cache to optimize HPC applications.I Study potential optimization strategies.IHelp users find good sector cache optimizations.IAim for as much automation as possible.

Issues

I Low-level compile-time API.IHard to predict impact on performance.I Little support from compiler and performance analysis tools.

Current Sector Cache API

double myarray[NSIZE];double otherarray[NSIZE];

void mywork(void){

int i;double sum = 0;

#pragma statement cache_sector_size 1 11#pragma statement cache_subsector_assign myarray

for(i = 2; i < NSIZE-2; i++){

// myarray in sector 1sum += myarray[i-2] + myarray[i-1] +

myarray[i] + myarray[i+1] +myarray[i+2] + otherarray[i];

}}

Results

I efficient cache optimization of classical HPC benchmarks.I framework for locality analysis and optimization of HPC applications.I on-going automation, already little user action required.

Framework Overview

Hotspotdetection

Structures/Scopesetup

DWARF reader

Binaryinstrumentation

Locality analysis

Codemodification

Locality Measurements

Reuse Distance: for a memory access, the number of uniquememory locations touched since the previous access to the samelocation.→ an architecture-independent measure closely related to the cachemisses triggered by an application.→ use it to measure consequences of isolating one structure by itselfwith the sector cache.

Reuse exemple

Access :

load 0x10load 0x20load 0x30load 0x10load 0x20load 0x10load 0x30

Distance :

∞∞∞2212

Den

sity

Distance

Validating the Framework: a toy example

1.75MB 3.5MB 7MB ∞

0

2 · 106

4 · 106

6 · 106

8 · 106

1 · 107

1.2 · 107

1.4 · 107M1M2M3

Reuse Distances of Each Structures

1 2 3 4 5 6 7 8 9 10 11

024681012141618202224

Sector Size

Cache

Miss

esReductio

n(%

)

M1M2M3

Selected by framework

Optimizations for all Sector Configurations

Optimizing the NAS Parallel Benchmarks

Benchmark Function Isolated Variables Sector Size Miss Reduction (%) Runtime Reduction (%)CG conj_grad p (1,11) 19 10

LU

ssor a,b,c,d

(2,10)

48 8blts ldz,ldy,ldx,d 75 10buts d,udx,udy,udz 18 3jacld a,b,c,d 64 14jacu a,b,c,d 57 6

Acknowledgements

Part of the results were obtained by early access to the K computer at the RIKENAICS. This work was supported by the JSPS Grant-in-Aid for JSPS Fellows Number24.02711.

RIKEN Advanced Institute for Computational Science Programming Environment Research Team [email protected]