Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
POWER8 Performance Analysis
Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor
IBM Systems and Technology Labs [email protected]
Join the conversation at #OpenPOWERSummit 1
#OpenPOWERSummit
Overview POWER8 Overview
Introduction to Performance Monitoring
Performance Monitoring Features in POWER8
What’s new in POWER8?
POWER8 Pipeline
CPI Stack overview – Stall Accounting Model
Performance analysis CPI analysis
Data source analysis
Prefetch control & Prefetch effectiveness
Application level performance analysis Marked event profiling & performance analysis.
Microarchitecture bottleneck analysis Core bottleneck analysis using trace tool and scroll pipe.
Join the conversation at #OpenPOWERSummit 2
POWER8 Processor
Join the conversation at #OpenPOWERSummit 3
Improvements over POWER7
Join the conversation at #OpenPOWERSummit 4
Cache Improvements
Join the conversation at #OpenPOWERSummit 5
Cache Bandwidths
Join the conversation at #OpenPOWERSummit 6
Memory Organization
Join the conversation at #OpenPOWERSummit 7
Performance Instrumentation in P8 • Hardware Performance Monitoring is critical to enable performance
evaluation of applications/programs on complex performance cores such as POWER8
• POWER8 provides advanced instrumentation capabilities in two layers
• Core Instrumentation
• Nest level Instrumentation
Join the conversation at #OpenPOWERSummit 8
Core Level Performance Monitoring
Nest Level Performance Monitoring
Core Level Performance Monitoring
Key to root cause performance bottlenecks at core or thread level
Facilitates monitoring of • Core Pipeline efficiency – frontend, branch prediction,
execution units, schedulers, etc
• Behavior metrics – stalls, execution rates, utilizations, thread prioritization & resource sharing
Enables understanding and optimization of application performance at processor and compiler level.
Join the conversation at #OpenPOWERSummit 9
Nest Level Instrumentation Instrumentation at
• L3 Cache,
• Interconnect Fabric
• Memory channels/controller
Information provided at per-core and chip-level( as against thread-level for core-level counters)
Significance & Usefulness: • Bandwidth Analysis
• Key for analyzing the Cloud Virtualized environment performance.
• Can be used to effectively monitor the memory and chip level characteristics to employ effective provisioning of the cloud space.
Join the conversation at #OpenPOWERSummit 10
What’s new in POWER8?
Enhanced CPI Stack Cycle Accounting Model
Hotness Table
Branch History Rolling Buffer
Event-Based Branches
Prefetch effectiveness events
Additional Events to capture & analyze hardware level performance issues
Join the conversation at #OpenPOWERSummit 11
POWER8 Microarchitecture
Join the conversation at #OpenPOWERSummit 12
POWER8 Core Pipeline
Join the conversation at #OpenPOWERSummit 13
Front end stalls: cycles a thread’s GCT was empty , i.e. pipeline was empty for that thread.
Back end stalls: cycles thread had GCT entries but no completion occurred.
POWER8 Group Formation
Group formation:
• Instructions are formed into groups for dispatch and completion tracking after Instruction Fetch.
• Thread priority logic selects up to 8 instructions from the Instruction buffers for group formation in each cycle
• Group formation driven by group formation rules
Global Completion Table(GCT)
Completion based performance bottleneck analysis
Join the conversation at #OpenPOWERSummit 14
CPI Analysis Cycles-per-instruction(CPI) stack presents a picture of a
typical instruction’s lifespan from fetch to completion
Provides information to narrow down to the bottleneck point(s) in the processor pipeline
POWER8 features a Completion-based CPI Stack accounting model
Time spent in the execution is split into :
Group Completion cycles
Stall cycles
Join the conversation at #OpenPOWERSummit 15
Join the conversation at #OpenPOWERSummit
Cycles
Completion Stalls
Stall due to BR or CR Stall due to Branch
Stall due to CR
Stall due to Fixed-Point Stall due to Fixed-Point Long
Stall due to Fixed-Point (Other)
Stall due to Vector/Scalar
Stall due to Vector Stall due to Vector Long
Stall due to Vector (other)
Stall due to Scalar Stall due to Scalar Long
Stall due to Scalar (other) Stall due to Vector/Scalar (other)
Stall due to LSU
Stall due to Dcache Miss Stall due to LSU Reject
Stall due to Store Finish
Stall due to Load Finish
Stall due to Store Forward
Stall due to Load/Store (other)
Stall due to Next-to-Complete Flush
Waiting to Complete
Thread Blocked
Blocked due to LWSync
Blocked due to HWSync
Blocked due to ECC Delay
Blocked due to Flush
Blocked due to COQ Full Thread Blocked (other)
Completion Table Empty
Completion Table Empty due to
IC Miss
Completion Table Empty due to IC L3 Miss
Completion Table Empty due to IC Miss (other) Completion Table Empty due to Branch Mispredict
Completion Table Empty due to Branch Mispredict + IC Miss
Completion Table Empty – Dispatch Held
Dispatch Held due to Mapper Dispatch Held due to Store Queue
Dispatch Held due to Issue Queue
Dispatch Held (other)
Completion Table Empty (Other) Completion Cycles
POWER8 CPI Stack
CPI Stack – LSU Stalls
Join the conversation at #OpenPOWERSummit 17
An Example of CPI Stack
Join the conversation at #OpenPOWERSummit 18
0.000
0.500
1.000
1.500
2.000
2.500
3.000
Prefetch OFF Prefetch ON
CPI Stack
PM_CMPLU_STALL
PM_NTCG_ALL_FIN
PM_CMPLU_STALL_THRD
PM_GCT_NOSLOT_CYC
PM_GRP_CMPL
CPI Stack – Detailed Stall Distribution
Join the conversation at #OpenPOWERSummit 19
0.000
0.500
1.000
1.500
2.000
2.500
3.000
3.500
4.000
Prefetch OFF Prefetch ON
Completion Stall Components
PM_CMPLU_STALL_BRU_CRU
PM_CMPLU_STALL_FXU
PM_CMPLU_STALL_VSU
PM_CMPLU_STALL_VECTOR
PM_CMPLU_STALL_SCALAR
PM_CMPLU_STALL_NTCG_FLUSH
PM_CMPLU_STALL_LSU
PM_CMPLU_STALL_DCACHE_MISS
PM_CMPLU_STALL_REJECT
PM_CMPLU_STALL_STORE
PM_CMPLU_STALL_LOAD_FINISH
PM_CMPLU_STALL_ST_FWD
Data Source Analysis Analysis of application data accesses across the Cache &
Memory hierarchy is key to understanding the following
• Performance limiting factors & resource requirements of the application
• Scaling capabilities(in multi-threaded scenarios)
Cache hierarchy latencies:
Join the conversation at #OpenPOWERSummit 20
Prefetch Controls Prefetch effects:
• Positive Brings data closer to the core Reduces memory access stalls
• Possible negative effects: Extra Bandwidth consumption - choking other application memory
accesses Cache pollution Increased power consumption
POWER8 supports L1 and L3 levels Prefetches DSCR Register ( Power ISA v2.07 )
Join the conversation at #OpenPOWERSummit 21
DPFD: Default Prefetch Depth
SSE: Store Stream Enable
SNSE: Stride-N Stream Enable
LSD: Load Stream Disable
URG: Depth Attainment Urgency
Studying Prefetch Effectiveness
POWER8 provides performance events to study the prefetch effectiveness
Counters indicate usage and non-usage of cache lines that are prefetched into the cache at the time of eviction from the cache
Counters available:
• MEPF Metrics are used to evaluate the Prefetch effectiveness in POWER8
Join the conversation at #OpenPOWERSummit 22
Application Profiling tools
Market Event Profiling:
• Pinpoint performance inhibiting behavior/bottlenecks to specific instruction in application code
Why necessary?
• Non-marked events are best suited to study performance metrics
• In an OOO super-scalar multiple-issue processor, the profile data from non-marked events can only indicate code “region” responsible for performance bottlenecks
• Code “region” granularity can range from few to tens of instructions.
Join the conversation at #OpenPOWERSummit 23
Example of Marked Event profiling
Join the conversation at #OpenPOWERSummit 24
Marked Events – a non-exhaustive list PM_MRK_LD_MISS_L1 PM_MRK_LD_MISS_L1_CYC PM_MRK_BR_MPRED_CMPL PM_MRK_BR_TAKEN_CMPL PM_MRK_DATA_FROM_MEM PM_MRK_LSU_REJECT PM_MRK_STCX_FAIL PM_MRK_GRP_IC_MISS PM_MRK_DTLB_MISS PM_MRK_ST_FWD PM_MRK_LSU_FLUSH PM_MRK_LSU_FLUSH_ULD PM_MRK_LSU_FLUSH_UST
Join the conversation at #OpenPOWERSummit 25
Microarchitecture Analysis Deep-dive analysis to root-cause performance inhibitor at processor
pipeline stages. Tools used:
• Itrace • Cycle Accurate Simulator
Join the conversation at #OpenPOWERSummit 26
Trace application with valgrind
Generate qtrace
simppc
Scrollpipe
Analyze &
Optimize Application code
Microarchitecture Stats
Tools for Microarchitecture Analysis IBM SDK for Linux on Power
IBM POWER8 Functional Simulator (systemsim)
Valgrind framework provides application/program tracing capabilities (itrace)
POWER8 Performance Simulator (sim_ppc) https://www-304.ibm.com/webapp/set2/sas/f/lopdiags/sdklop.html
Join the conversation at #OpenPOWERSummit 27
Thank You!
Join the conversation at #OpenPOWERSummit 28