9
www.bsc.es Extreme-Scale Performance Tools - SC12 workshop Judit Gimenez ([email protected]) BSC Tools Detecting and analyzing application structure with clustering

BSC Tools - VI-HPS :: Home€¦ · Clustering Identification of computation structure – CPU burst = region between consecutive runtime calls • Described with performance hardware

  • Upload
    lammien

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BSC Tools - VI-HPS :: Home€¦ · Clustering Identification of computation structure – CPU burst = region between consecutive runtime calls • Described with performance hardware

www.bsc.es

Extreme-Scale Performance Tools - SC12 workshop

Judit Gimenez ([email protected])

BSC ToolsDetecting and analyzing application structure

with clustering

Page 2: BSC Tools - VI-HPS :: Home€¦ · Clustering Identification of computation structure – CPU burst = region between consecutive runtime calls • Described with performance hardware

Clustering

Identification of computation structure– CPU burst = region between consecutive runtime calls

• Described with performance hardware counters• Associated with call stack data

Using DBSCAN density-cluster algorithm– Data not necessarily Gaussian– Two parameters: Epsilon (search radius), MinPoints

Automatic Detection of Parallel Applications Computation Phases. (IPDPS 2009)

Noise point

Cluster points

Page 3: BSC Tools - VI-HPS :: Home€¦ · Clustering Identification of computation structure – CPU burst = region between consecutive runtime calls • Described with performance hardware

Clustering

Identification of computation structure– CPU burst = region between consecutive runtime calls

• Described with performance hardware counters• Associated with call stack data

Using DBSCAN density-cluster algorithm– Data not necessarily Gaussian– Two parameters: Epsilon (search radius), MinPoints

Automatic Detection of Parallel Applications Computation Phases. (IPDPS 2009)

Page 4: BSC Tools - VI-HPS :: Home€¦ · Clustering Identification of computation structure – CPU burst = region between consecutive runtime calls • Described with performance hardware

Outputs

Scatter Plot of Clustering Metrics Clusters Distribution Along Time

Cluster Statistics Code Linking

All counters in a single run

Page 5: BSC Tools - VI-HPS :: Home€¦ · Clustering Identification of computation structure – CPU burst = region between consecutive runtime calls • Described with performance hardware

Using clusters to understand apps behavior (GROMACS)Instructions imbalance

IPC Imbalance 64 procs

256 procs

Page 6: BSC Tools - VI-HPS :: Home€¦ · Clustering Identification of computation structure – CPU burst = region between consecutive runtime calls • Described with performance hardware

Identifying main code regions

WRF

HydroCCG-POP

instr. vs. cluster

Page 7: BSC Tools - VI-HPS :: Home€¦ · Clustering Identification of computation structure – CPU burst = region between consecutive runtime calls • Described with performance hardware

0.0

0.2

0.4

0.6

0.8

1.0

1.21

2

3

4

5

67

8

9

10

11

Cluster 1_L Cluster 1 Cluster 1_R

PM_CYC

PM_GRP_CMPL

PM_GCT_EMPTY_CYC

PM_GCT_EMPTY_IC_MISS

PM_GCT_EMPTY_BR_MPRED

PM_CMPLU_STALL_LSU

PM_CMPLU_STALL_REJECT

PM_CMPLU_STALL_ERAT_MISS

PM_CMPLU_STALL_DCACHE_MISS

PM_CMPLU_STALL_FXU

PM_CMPLU_STALL_DIV

PM_CMPLU_STALL_FPU

PM_CMPLU_STALL_FDIV

PM_CMPLU_STALL_OTHER

PM_INST_CMPL

PM_INST_DISP

PM_LD_REF_L1

PM_ST_REF_L1

PM_LD_MISS_L1

PM_ST_MISS_L1

PM_LD_MISS_L1_LSU0

PM_LD_MISS_L1_LSU1

PM_L1_WRITE_CYC

PM_DATA_FROM_MEM

PM_DATA_FROM_L2

PM_DTLB_MISS

PM_ITLB_MISS

PM_LSU0_BUSY

PM_LSU1_BUSY

PM_LSU_FLUSH

PM_LSU_LDF

PM_FPU_FIN

PM_FPU_FSQRT

PM_FPU_FDIV

PM_FPU_FMA

PM_FPU0_FIN

PM_FPU1_FIN

PM_FPU_STF

PM_LSU_LMQ_SRQ_EMPTY_CYC

PM_HV_CYC

PM_1PLUS_PPC_CMPL

PM_TB_BIT_TRANS

PM_FLUSH_BR

PM_BR_MPRED_TA

PM_GCT_EMPTY_SRQ_FULL

PM_FXU_FIN

PM_FXU_BUSY

PM_IOPS_CMPL

PM_GCT_FULL_CYC

1E+ 001E+ 021E+ 041E+ 061E+ 081E+ 10

Cluster 1Cluster 2

Cou

nter

Val

ue CPI stack - Detailed View

0.

0.2

0.4

0.6

0.8

1.

1.2

Cluster1_L

Cluster1

Cluster1_R

Cluster2_L

Cluster2

Cluster2_R

Other StallsStall by FPU Basic LatencyStall by FDIV/FSQRTStall by FXU Basic LatencyStall by Div/MTSPR/MFSPRStall by LSU Basic LatencyStall by D-cache MissOther RejectStall by XlateOther GCT Stalls (flush, etc.)Branch Mispredict PenaltyI-cache Miss PenaltyTotal Group Complete Cycles

Hardware counters projection

Performance Data Extrapolation in Parallel Codes. (ICPADS 2010)

Page 8: BSC Tools - VI-HPS :: Home€¦ · Clustering Identification of computation structure – CPU burst = region between consecutive runtime calls • Described with performance hardware

Quality driven clustering algorithms

Hierarchy of cluster granularities

Automatic SPMD quality criteria

Automatic Refinement of Parallel Application Structure Detection (LSPP 2012)

Page 9: BSC Tools - VI-HPS :: Home€¦ · Clustering Identification of computation structure – CPU burst = region between consecutive runtime calls • Described with performance hardware

Conclusions

Performance analytics: Data analytics applied to raw performance data– From data to insight, huge room for research

Clustering enables focusing the analysis and opens many different uses– Correlation of sampled data to generate instantaneous metric

evolution– Separate speed factors per cluster on predictive simulations with

Dimemas– Track application evolution using the clustering scatter plots

www.bsc.es/paraver