Upload
keefe
View
44
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware. Yaoping Ruan Princeton University. Vivek Pai, Princeton University Erich Nahum , IBM T.J. Watson John Tracey , IBM T.J. Watson. Motivation. Network servers Throughput matters Hardware intensive - PowerPoint PPT Presentation
Citation preview
Evaluating the Impact of Simultaneous Multithreading on Network Servers
Using Real Hardware
Yaoping RuanPrinceton University
Vivek Pai, Princeton UniversityErich Nahum, IBM T.J. WatsonJohn Tracey, IBM T.J. Watson
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 2
Motivation
Network servers Throughput matters Hardware intensive
Simultaneous Multithreading (SMT) Processor support for high throughput Simulated since mid-90s Now - Intel Xeon/Pentium 4 (Hyper-
Threading), IBM POWER5 available
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 3
How Does SMT Work? Simultaneous execution of multiple jobs Higher utilization of functional units
cycles (direction of data flow)
Job 1Processor 1
Job 2Processor 2
Job 1&2SMT processor
(Colored blocks are functional units currently in use)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 4
SMT Architecture
Appear as multi-processors for OS and app.
Architectural State Registers #1
DuplicatedResource
Architectural State Registers #2
Shared Resource
Pipeline Execution Units
Cache Hierarchy
System Bus
Main Memory
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 5
Contributions Detailed analysis of multiple real hardware
platforms and server packagesIncludes previously ignored OS overheads
Micro-architectural performance analysisDemonstrates dominance of memory hierarchy
Comparison with simulation studiesExplain why SMT provides relatively small
benefits on real hardwareOverly-aggressive memory simulation yielded
higher expected benefits
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 6
Outline
BackgroundMeasurement methodologyThroughput & improvementMicro-architectural performanceDiscussion
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 7
Measurements OverviewMetrics
Server throughputThroughput improvements (relative speedups)Architectural features (CPI, miss ratio, etc.)
Multiple configurationsHardware platforms (clock speed, cache, etc.)Server software (Apache, Flash, TUX, etc.)Kernel configuration (uniprocessor and
multiprocessor)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 8
Hardware Platforms
Three models of Xeon processors
Clock rate 2.0GHz 3.06Ghz 3.06GHz L3
L3 - 1MB
Mem latency (cycles)
220 350 cycles
L1/L2 cache sizes, main memory, buses and # threads/processor are the same
Clock rate Cache
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 9
Web Servers
5 Web server packages Apache-MP: multi-process Apache-MT: multi-thread Flash: event-driven TUX: in-kernel Haboob: Java server, staged multi-thread model
Benchmark SPECweb96 and SPECweb99
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 10
System Configuration
5 configuration labels # CPUs, SMT on/off, kernel type
1P-UP 1P-MP 2T 2P 4T
on onSMT
Multiprocessor kernelkernel
1# CPUs 2
(T – # threads, P – # processors)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 11
Outline
BackgroundMeasurement methodologyThroughput & improvement
Single processor Dual-processor
Micro-architectural performanceDiscussion
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 12
Apache-MP, 3.06GHz
0
200
400
600
800
1000
1200
1P-UP 1P-MP 2Tw/ SMT
2P 4Tw/ SMT
Th
rou
gh
pu
t (M
b/s
)Throughput Evaluation
2T vs. 1P-MP
4T vs. 2P
2T vs. 1P-UP
single processor dual-processor
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 13
Improvement on Single Processor
2T : 2 threads, multiprocessor kernel1P-MP: 1 thread, multiprocessor kernel
2T vs. 1P-MP
-10
0
10
20
30
40
Apache-MP Apache-MT Flash TUX Haboob
Th
rou
gh
pu
t im
pro
vem
ent
(%)
2.0GHz 3.06GHz 3.06GHz L3
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 14
2T vs. 1P-UP
-10
0
10
20
30
40
Apache-MP Apache-MT Flash TUX Haboob
Th
rou
gh
pu
t im
pro
vem
ent
(%)
2.0GHz 3.06GHz 3.06GHz L3
Improvement on Single Processor
2T : 2 threads, Multiprocessor kernel1P-UP: 1 threads, Uniprocessor kernel
Kernel overhead
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 15
Improvement on Dual-processor4T: 4 threads (2 processors, 2T/Processor)2P: 2 physical processors (SMT disabled)
4T vs. 2P
-20
-10
0
10
20
30
40
Apache-MP Apache-MT Flash TUX HaboobTh
rou
gh
pu
t im
pro
vem
ent
(%)
2.0GHz 3.06GHz 3.06GHz L3
2.0GHz & 3.06GHz with L3 are better Memory is still the
bottleneck
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 16
Micro-architectural Analysis
Use Oprofile In-house patch to measure extra events
About 25 performance events Cache miss/hit TLB miss/hit Branches Pipeline stall, clear, etc. Bus utilization
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 17
L1 Instruction Cache Miss Rate
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Apache-MP Apache-MT Flash TUX Haboob
1P-UP 1P-MP 2T(SMT)
2P 4T(SMT)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 18
L2 Cache Miss Rate
Instruction & data unified Lower rate in SMT due to higher L1 misses
0%
2%
4%
6%
8%
10%
Apache-MP Apache-MT Flash TUX Haboob
1P-UP 1P-MP 2T(SMT)
2P 4T(SMT)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 19
Apache-MP
02468
10121416
1P-UP 1P-MP 2T 2P 4T
Putting Events TogetherC
ycle
s pe
r In
stru
ctio
n (C
PI)
work L1 Miss L2 Miss ITLBDTLB Branch Clear Buffer
work
L1 Miss
L2 Miss
others
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 20
Non-overlapped CPI
L1/L2 miss penalty dominates
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 21
Measuring Bus Utilization
Event: FSB_DATA_ACTIVITYCPU cycles when the bus is busy
Normalized to CPU speedComparable across all CPU clock rate
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 22
Bus Utilization Results 2.0GHz & 3.06GHz
L3 have less data transfer cyclesLower memory
latency in 2.0GHz & 3.06GHz with L3
Coefficient of correlation between bus utilization & speedups : 0.62 ~ 0.95
Apache-MP
0
5
10
15
20
1P-UP
1P-M
P 2T 2P 4T
Bu
s U
tiliz
atio
n (
%)
2.0GHz 3.06GHz 3.06GHz L3
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 23
Outline
BackgroundMeasurement parametersThroughput speedupMicro-architectural performanceDiscussion
Compare to simulationOther Web workloads
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 24
SMT Performance on Web Servers
Simulation
Multiprocessorkernel
Uniprocessor kernel
Dualprocessor
-10%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Th
rou
gh
pu
t im
pro
vem
ent
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 25
Compare to Simulation
Simulation Measurement
Size Miss rate Size Miss rate
L1-I 128 KB 2.0% 12 KB 17%
L1-D 128 KB 3.6% 8 KB 5.7%
L2 16 MB 1.4% 512 KB 3.9%
Mem latency 90 cycles 220 ~ 350 cycles
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 26
Processor Development Trend
2000 20031996
62-cycle mem
32 KB L1
256 KB L2
90-cycle mem
128 KB L1
16384 KB L2
90-cycle mem
64 KB L1
16384 KB L2
74-cycle mem
16 KB L1
256 KB L2
94-cycle mem
16 KB L1
512 KB L2
350-cycle mem
8-12 KB L1
512 KB L2
Simulated models:
Actual processors:
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 27
SMT on SPECweb99
SPECweb99 results in paperDynamic + staticMultiple programs
• CGI requests, user profile logging, etc.
Speedup very close to static-only workloadsNo more negative speedups in FlashMay be due to better sharing of resources of
different programs
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 28
Summary
More realistic speedup evaluation of SMT 3 processors, 5 servers, 2 kernels Exposed factors not previously examined 5~15% speedup in our best cases
Detailed analysis of memory hierarchy impact on SMT performance All other architecture overheads secondary Reasons why simulation results were overly
optimistic
Thank you
http://www.cs.princeton.edu/~yruan
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 30
Future Work
Ways of improving Simultaneous Multithreading performanceServer performance on POWER5Using execution driven simulation for deeper
understanding
Study Chip Multiprocessor (CMP)Intel, AMD, and IBM
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 31
Pipeline Clears (per Byte)
Conditions when the whole pipeline needs to be flushed
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Apache-MP Apache-MT Flash TUX Haboob
1T-UP 1T-MP 2T 2P 4T