Upload
truongkhanh
View
247
Download
0
Embed Size (px)
Citation preview
Using CodeAnalyst on Red Hat Enterprise Linux to Understand Performance on AMD Servers
Name Sanjay Rao, D John ShakshoberDate May 10, 2007
AMD CodeAnalyst (CA) profiling on various user applications running RHEL5 Ga.
System Configurations● Tyan AMD 8cpu, 4socket, dual core, 1dual QLA2342 FiberChannel, 28 15k
RPM disks, on HP Enterprise Virtual Array 4000, dual path MPIO McCalpin Stream Benchmark
● Copy Bandwidth – 1 GB per stream, 1,2,4 and 8 streams● W/ and without NUMA ● Measure IPC and L2 cache, Bus traffic
Oracle OLTP workload ● Random 2k IO's (50% Read/50% Write), Sequential Write to logs, EXT3● Vary user count, tune SGA to saturate 8cpu, using EXT3 Direct and Async I/O ● Number of transactions / minute (tpm) ● Run with and without Large pages (HughTLBfs)● Measure IPC, Translation Buffer Misses
Memory
Memory
Memory
Memory
C0 C1 C0 C1
C0 C1 C0 C1
S1 S2
S3 S4
Process on S1C0
S1
Interleaved Memory
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
Process on S1C0
S1 S2 S3 S4NonInterleaved (NUMA)
1 hop to any memory bank
Tyan AMD64 Numa Memory Layout
McCalpin Streams Copy Bandwidth (1,2,4,8)
1 2 4 80
2000
4000
6000
8000
10000
12000
14000
16000
0
2.5
5
7.5
10
12.5
15
17.5
20
22.5
NonNumaNuma%Difference
No. of Streams
Rat
e (M
B/s)
IPC Comparison – McCalpin Streams
Data Access Comparison – McCalpin Streams
Instruction Comparison – McCalpin Streams
L2 Cache Comparison – McCalpin Streams
CA used to montior CPU, data access stallsw/ complex Database Workload, Oracle 10G
Oracle OLTP workload ● Random 2k IO's (50% Read/50% Write), Sequential Write to logs, EXT3● Vary user count, tune SGA to saturate 8cpu, using EXT3 Direct and Async
I/O ● Number of transactions / minute (tpm) ● Run with and without Large pages (HughTLBfs)● Measure IPC, Translation Buffer Misses
The Translation Lookaside Buffer (TLB) is a small CPU cache of recently used virtual to physical address mappings
TLB misses are extremely expensive on today's very fast, pipelined CPUs
Large memory applicationscan incur high TLB miss rates
HugeTLBs permit memory to bemanaged in very large segments
AMD64
● Standard page: 4KB● Default huge page: 2MB● 500:1 difference
File system mapping interface Ideal for databases
● E.G. TLB can fully map a 2GBOracle SGA w/ 1024 TLB entries
HugeTLBFS
Physical Memory
Virtual AddressSpace
TLB
128 data128 instruction
Oracle 10G OLTP Performance (tpm k) 4k vs 2MB huge pages
Trans / min
DTLB Accesses
IC – Misses
L2 Misses
0.00
50000.00
100000.00
150000.00
200000.00
250000.00
300000.00
350000.00
400000.00
17.5
15
12.5
10
7.5
5
2.5
0
2.5
5
7.5
RHEL5RHEL5 – Hugepages% Difference
Data Access – DTLB Assessment Comparison – Oracle Workload
Instruction Cycle Comparison – Oracle Workload
L2 Cache Comparison – Oracle Workload
IPC Comparison – Oracle Workload
RHEL and AMD CodeAnalyst w/ Oprofile Runs w/ Standard RHEL oprofile (install sysstat) Download CA rpm from AMDdeveloper page Gui allows for easy data collection of
● Cycles, retired inst profile IPC calculation● Data Cache access (both I and D)● Memory subsystem performance
● NUMActl at OS, L2 references● Translation buffer analysis (TLB)