28
POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs [email protected] Join the conversation at #OpenPOWERSummit 1 #OpenPOWERSummit

POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

POWER8 Performance Analysis

Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor

IBM Systems and Technology Labs [email protected]

Join the conversation at #OpenPOWERSummit 1

#OpenPOWERSummit

Page 2: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Overview POWER8 Overview

Introduction to Performance Monitoring

Performance Monitoring Features in POWER8

What’s new in POWER8?

POWER8 Pipeline

CPI Stack overview – Stall Accounting Model

Performance analysis CPI analysis

Data source analysis

Prefetch control & Prefetch effectiveness

Application level performance analysis Marked event profiling & performance analysis.

Microarchitecture bottleneck analysis Core bottleneck analysis using trace tool and scroll pipe.

Join the conversation at #OpenPOWERSummit 2

Page 3: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

POWER8 Processor

Join the conversation at #OpenPOWERSummit 3

Page 4: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Improvements over POWER7

Join the conversation at #OpenPOWERSummit 4

Page 5: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Cache Improvements

Join the conversation at #OpenPOWERSummit 5

Page 6: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Cache Bandwidths

Join the conversation at #OpenPOWERSummit 6

Page 7: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Memory Organization

Join the conversation at #OpenPOWERSummit 7

Page 8: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Performance Instrumentation in P8 • Hardware Performance Monitoring is critical to enable performance

evaluation of applications/programs on complex performance cores such as POWER8

• POWER8 provides advanced instrumentation capabilities in two layers

• Core Instrumentation

• Nest level Instrumentation

Join the conversation at #OpenPOWERSummit 8

Core Level Performance Monitoring

Nest Level Performance Monitoring

Page 9: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Core Level Performance Monitoring

Key to root cause performance bottlenecks at core or thread level

Facilitates monitoring of • Core Pipeline efficiency – frontend, branch prediction,

execution units, schedulers, etc

• Behavior metrics – stalls, execution rates, utilizations, thread prioritization & resource sharing

Enables understanding and optimization of application performance at processor and compiler level.

Join the conversation at #OpenPOWERSummit 9

Page 10: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Nest Level Instrumentation Instrumentation at

• L3 Cache,

• Interconnect Fabric

• Memory channels/controller

Information provided at per-core and chip-level( as against thread-level for core-level counters)

Significance & Usefulness: • Bandwidth Analysis

• Key for analyzing the Cloud Virtualized environment performance.

• Can be used to effectively monitor the memory and chip level characteristics to employ effective provisioning of the cloud space.

Join the conversation at #OpenPOWERSummit 10

Page 11: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

What’s new in POWER8?

Enhanced CPI Stack Cycle Accounting Model

Hotness Table

Branch History Rolling Buffer

Event-Based Branches

Prefetch effectiveness events

Additional Events to capture & analyze hardware level performance issues

Join the conversation at #OpenPOWERSummit 11

Page 12: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

POWER8 Microarchitecture

Join the conversation at #OpenPOWERSummit 12

Page 13: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

POWER8 Core Pipeline

Join the conversation at #OpenPOWERSummit 13

Front end stalls: cycles a thread’s GCT was empty , i.e. pipeline was empty for that thread.

Back end stalls: cycles thread had GCT entries but no completion occurred.

Page 14: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

POWER8 Group Formation

Group formation:

• Instructions are formed into groups for dispatch and completion tracking after Instruction Fetch.

• Thread priority logic selects up to 8 instructions from the Instruction buffers for group formation in each cycle

• Group formation driven by group formation rules

Global Completion Table(GCT)

Completion based performance bottleneck analysis

Join the conversation at #OpenPOWERSummit 14

Page 15: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

CPI Analysis Cycles-per-instruction(CPI) stack presents a picture of a

typical instruction’s lifespan from fetch to completion

Provides information to narrow down to the bottleneck point(s) in the processor pipeline

POWER8 features a Completion-based CPI Stack accounting model

Time spent in the execution is split into :

Group Completion cycles

Stall cycles

Join the conversation at #OpenPOWERSummit 15

Page 16: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Join the conversation at #OpenPOWERSummit

Cycles

Completion Stalls

Stall due to BR or CR Stall due to Branch

Stall due to CR

Stall due to Fixed-Point Stall due to Fixed-Point Long

Stall due to Fixed-Point (Other)

Stall due to Vector/Scalar

Stall due to Vector Stall due to Vector Long

Stall due to Vector (other)

Stall due to Scalar Stall due to Scalar Long

Stall due to Scalar (other) Stall due to Vector/Scalar (other)

Stall due to LSU

Stall due to Dcache Miss Stall due to LSU Reject

Stall due to Store Finish

Stall due to Load Finish

Stall due to Store Forward

Stall due to Load/Store (other)

Stall due to Next-to-Complete Flush

Waiting to Complete

Thread Blocked

Blocked due to LWSync

Blocked due to HWSync

Blocked due to ECC Delay

Blocked due to Flush

Blocked due to COQ Full Thread Blocked (other)

Completion Table Empty

Completion Table Empty due to

IC Miss

Completion Table Empty due to IC L3 Miss

Completion Table Empty due to IC Miss (other) Completion Table Empty due to Branch Mispredict

Completion Table Empty due to Branch Mispredict + IC Miss

Completion Table Empty – Dispatch Held

Dispatch Held due to Mapper Dispatch Held due to Store Queue

Dispatch Held due to Issue Queue

Dispatch Held (other)

Completion Table Empty (Other) Completion Cycles

POWER8 CPI Stack

Page 17: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

CPI Stack – LSU Stalls

Join the conversation at #OpenPOWERSummit 17

Page 18: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

An Example of CPI Stack

Join the conversation at #OpenPOWERSummit 18

0.000

0.500

1.000

1.500

2.000

2.500

3.000

Prefetch OFF Prefetch ON

CPI Stack

PM_CMPLU_STALL

PM_NTCG_ALL_FIN

PM_CMPLU_STALL_THRD

PM_GCT_NOSLOT_CYC

PM_GRP_CMPL

Page 19: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

CPI Stack – Detailed Stall Distribution

Join the conversation at #OpenPOWERSummit 19

0.000

0.500

1.000

1.500

2.000

2.500

3.000

3.500

4.000

Prefetch OFF Prefetch ON

Completion Stall Components

PM_CMPLU_STALL_BRU_CRU

PM_CMPLU_STALL_FXU

PM_CMPLU_STALL_VSU

PM_CMPLU_STALL_VECTOR

PM_CMPLU_STALL_SCALAR

PM_CMPLU_STALL_NTCG_FLUSH

PM_CMPLU_STALL_LSU

PM_CMPLU_STALL_DCACHE_MISS

PM_CMPLU_STALL_REJECT

PM_CMPLU_STALL_STORE

PM_CMPLU_STALL_LOAD_FINISH

PM_CMPLU_STALL_ST_FWD

Page 20: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Data Source Analysis Analysis of application data accesses across the Cache &

Memory hierarchy is key to understanding the following

• Performance limiting factors & resource requirements of the application

• Scaling capabilities(in multi-threaded scenarios)

Cache hierarchy latencies:

Join the conversation at #OpenPOWERSummit 20

Page 21: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Prefetch Controls Prefetch effects:

• Positive Brings data closer to the core Reduces memory access stalls

• Possible negative effects: Extra Bandwidth consumption - choking other application memory

accesses Cache pollution Increased power consumption

POWER8 supports L1 and L3 levels Prefetches DSCR Register ( Power ISA v2.07 )

Join the conversation at #OpenPOWERSummit 21

DPFD: Default Prefetch Depth

SSE: Store Stream Enable

SNSE: Stride-N Stream Enable

LSD: Load Stream Disable

URG: Depth Attainment Urgency

Page 22: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Studying Prefetch Effectiveness

POWER8 provides performance events to study the prefetch effectiveness

Counters indicate usage and non-usage of cache lines that are prefetched into the cache at the time of eviction from the cache

Counters available:

• MEPF Metrics are used to evaluate the Prefetch effectiveness in POWER8

Join the conversation at #OpenPOWERSummit 22

Page 23: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Application Profiling tools

Market Event Profiling:

• Pinpoint performance inhibiting behavior/bottlenecks to specific instruction in application code

Why necessary?

• Non-marked events are best suited to study performance metrics

• In an OOO super-scalar multiple-issue processor, the profile data from non-marked events can only indicate code “region” responsible for performance bottlenecks

• Code “region” granularity can range from few to tens of instructions.

Join the conversation at #OpenPOWERSummit 23

Page 24: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Example of Marked Event profiling

Join the conversation at #OpenPOWERSummit 24

Page 25: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Marked Events – a non-exhaustive list PM_MRK_LD_MISS_L1 PM_MRK_LD_MISS_L1_CYC PM_MRK_BR_MPRED_CMPL PM_MRK_BR_TAKEN_CMPL PM_MRK_DATA_FROM_MEM PM_MRK_LSU_REJECT PM_MRK_STCX_FAIL PM_MRK_GRP_IC_MISS PM_MRK_DTLB_MISS PM_MRK_ST_FWD PM_MRK_LSU_FLUSH PM_MRK_LSU_FLUSH_ULD PM_MRK_LSU_FLUSH_UST

Join the conversation at #OpenPOWERSummit 25

Page 26: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Microarchitecture Analysis Deep-dive analysis to root-cause performance inhibitor at processor

pipeline stages. Tools used:

• Itrace • Cycle Accurate Simulator

Join the conversation at #OpenPOWERSummit 26

Trace application with valgrind

Generate qtrace

simppc

Scrollpipe

Analyze &

Optimize Application code

Microarchitecture Stats

Page 27: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Tools for Microarchitecture Analysis IBM SDK for Linux on Power

IBM POWER8 Functional Simulator (systemsim)

Valgrind framework provides application/program tracing capabilities (itrace)

POWER8 Performance Simulator (sim_ppc) https://www-304.ibm.com/webapp/set2/sas/f/lopdiags/sdklop.html

Join the conversation at #OpenPOWERSummit 27

Page 28: POWER8 Performance Analysis › wp-content › uploads › ... · Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring

Thank You!

Join the conversation at #OpenPOWERSummit 28