POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

POWER8 Performance Analysis

Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor

IBM Systems and Technology Labs [email protected]

Join the conversation at #OpenPOWERSummit 1

#OpenPOWERSummit

Overview POWER8 Overview

Introduction to Performance Monitoring

Performance Monitoring Features in POWER8

What’s new in POWER8?

POWER8 Pipeline

CPI Stack overview – Stall Accounting Model

Performance analysis CPI analysis

Data source analysis

Prefetch control & Prefetch effectiveness

Application level performance analysis Marked event profiling & performance analysis.

Microarchitecture bottleneck analysis Core bottleneck analysis using trace tool and scroll pipe.


POWER8 Processor


Improvements over POWER7


Cache Improvements


Cache Bandwidths


Memory Organization


Performance Instrumentation in P8 • Hardware Performance Monitoring is critical to enable performance

evaluation of applications/programs on complex performance cores such as POWER8

• POWER8 provides advanced instrumentation capabilities in two layers

• Core Instrumentation

• Nest level Instrumentation


Core Level Performance Monitoring

Nest Level Performance Monitoring

Core Level Performance Monitoring

Key to root cause performance bottlenecks at core or thread level

Facilitates monitoring of • Core Pipeline efficiency – frontend, branch prediction,

execution units, schedulers, etc

• Behavior metrics – stalls, execution rates, utilizations, thread prioritization & resource sharing

Enables understanding and optimization of application performance at processor and compiler level.


Nest Level Instrumentation Instrumentation at

• L3 Cache,

• Interconnect Fabric

• Memory channels/controller

Information provided at per-core and chip-level( as against thread-level for core-level counters)

Significance & Usefulness: • Bandwidth Analysis

• Key for analyzing the Cloud Virtualized environment performance.

• Can be used to effectively monitor the memory and chip level characteristics to employ effective provisioning of the cloud space.


What’s new in POWER8?

Enhanced CPI Stack Cycle Accounting Model

Hotness Table

Branch History Rolling Buffer

Event-Based Branches

Prefetch effectiveness events

Additional Events to capture & analyze hardware level performance issues


POWER8 Microarchitecture


POWER8 Core Pipeline


Front end stalls: cycles a thread’s GCT was empty , i.e. pipeline was empty for that thread.

Back end stalls: cycles thread had GCT entries but no completion occurred.

POWER8 Group Formation

Group formation:

• Instructions are formed into groups for dispatch and completion tracking after Instruction Fetch.

• Thread priority logic selects up to 8 instructions from the Instruction buffers for group formation in each cycle

• Group formation driven by group formation rules

Global Completion Table(GCT)

Completion based performance bottleneck analysis


CPI Analysis Cycles-per-instruction(CPI) stack presents a picture of a

typical instruction’s lifespan from fetch to completion

Provides information to narrow down to the bottleneck point(s) in the processor pipeline

POWER8 features a Completion-based CPI Stack accounting model

Time spent in the execution is split into :

Group Completion cycles

Stall cycles


Join the conversation at #OpenPOWERSummit

Cycles

Completion Stalls

Stall due to BR or CR Stall due to Branch

Stall due to CR

Stall due to Fixed-Point Stall due to Fixed-Point Long

Stall due to Fixed-Point (Other)

Stall due to Vector/Scalar

Stall due to Vector Stall due to Vector Long

Stall due to Vector (other)

Stall due to Scalar Stall due to Scalar Long

Stall due to Scalar (other) Stall due to Vector/Scalar (other)

Stall due to LSU

Stall due to Dcache Miss Stall due to LSU Reject

Stall due to Store Finish

Stall due to Load Finish

Stall due to Store Forward

Stall due to Load/Store (other)

Stall due to Next-to-Complete Flush

Waiting to Complete

Thread Blocked

Blocked due to LWSync

Blocked due to HWSync

Blocked due to ECC Delay

Blocked due to Flush

Blocked due to COQ Full Thread Blocked (other)

Completion Table Empty

Completion Table Empty due to

IC Miss

Completion Table Empty due to IC L3 Miss

Completion Table Empty due to IC Miss (other) Completion Table Empty due to Branch Mispredict

Completion Table Empty due to Branch Mispredict + IC Miss

Completion Table Empty – Dispatch Held

Dispatch Held due to Mapper Dispatch Held due to Store Queue

Dispatch Held due to Issue Queue

Dispatch Held (other)

Completion Table Empty (Other) Completion Cycles

POWER8 CPI Stack

CPI Stack – LSU Stalls


An Example of CPI Stack


0.000

0.500

1.000

1.500

2.000

2.500

3.000

Prefetch OFF Prefetch ON

CPI Stack

PM_CMPLU_STALL

PM_NTCG_ALL_FIN

PM_CMPLU_STALL_THRD

PM_GCT_NOSLOT_CYC

PM_GRP_CMPL

CPI Stack – Detailed Stall Distribution


0.000

0.500

1.000

1.500

2.000

2.500

3.000

3.500

4.000

Prefetch OFF Prefetch ON

Completion Stall Components

PM_CMPLU_STALL_BRU_CRU

PM_CMPLU_STALL_FXU

PM_CMPLU_STALL_VSU

PM_CMPLU_STALL_VECTOR

PM_CMPLU_STALL_SCALAR

PM_CMPLU_STALL_NTCG_FLUSH

PM_CMPLU_STALL_LSU

PM_CMPLU_STALL_DCACHE_MISS

PM_CMPLU_STALL_REJECT

PM_CMPLU_STALL_STORE

PM_CMPLU_STALL_LOAD_FINISH

PM_CMPLU_STALL_ST_FWD

Data Source Analysis Analysis of application data accesses across the Cache &

Memory hierarchy is key to understanding the following

• Performance limiting factors & resource requirements of the application

• Scaling capabilities(in multi-threaded scenarios)

Cache hierarchy latencies:


Prefetch Controls Prefetch effects:

• Positive Brings data closer to the core Reduces memory access stalls

• Possible negative effects: Extra Bandwidth consumption - choking other application memory

accesses Cache pollution Increased power consumption

POWER8 supports L1 and L3 levels Prefetches DSCR Register ( Power ISA v2.07 )


DPFD: Default Prefetch Depth

SSE: Store Stream Enable

SNSE: Stride-N Stream Enable

LSD: Load Stream Disable

URG: Depth Attainment Urgency

Studying Prefetch Effectiveness

POWER8 provides performance events to study the prefetch effectiveness

Counters indicate usage and non-usage of cache lines that are prefetched into the cache at the time of eviction from the cache

Counters available:

• MEPF Metrics are used to evaluate the Prefetch effectiveness in POWER8


Application Profiling tools

Market Event Profiling:

• Pinpoint performance inhibiting behavior/bottlenecks to specific instruction in application code

Why necessary?

• Non-marked events are best suited to study performance metrics

• In an OOO super-scalar multiple-issue processor, the profile data from non-marked events can only indicate code “region” responsible for performance bottlenecks

• Code “region” granularity can range from few to tens of instructions.


Example of Marked Event profiling


Marked Events – a non-exhaustive list PM_MRK_LD_MISS_L1 PM_MRK_LD_MISS_L1_CYC PM_MRK_BR_MPRED_CMPL PM_MRK_BR_TAKEN_CMPL PM_MRK_DATA_FROM_MEM PM_MRK_LSU_REJECT PM_MRK_STCX_FAIL PM_MRK_GRP_IC_MISS PM_MRK_DTLB_MISS PM_MRK_ST_FWD PM_MRK_LSU_FLUSH PM_MRK_LSU_FLUSH_ULD PM_MRK_LSU_FLUSH_UST


Microarchitecture Analysis Deep-dive analysis to root-cause performance inhibitor at processor

pipeline stages. Tools used:

• Itrace • Cycle Accurate Simulator


Trace application with valgrind

Generate qtrace

simppc

Scrollpipe

Analyze &

Optimize Application code

Microarchitecture Stats

Tools for Microarchitecture Analysis IBM SDK for Linux on Power

IBM POWER8 Functional Simulator (systemsim)

Valgrind framework provides application/program tracing capabilities (itrace)

POWER8 Performance Simulator (sim_ppc) https://www-304.ibm.com/webapp/set2/sas/f/lopdiags/sdklop.html


https://www-304.ibm.com/webapp/set2/sas/f/lopdiags/sdklop.html





Thank You!


Documents

POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring