Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Evaluating the Impact of Simultaneous Multithreading on Network Servers

Using Real Hardware

Yaoping RuanPrinceton University

Vivek Pai, Princeton UniversityErich Nahum, IBM T.J. WatsonJohn Tracey, IBM T.J. Watson

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 2

Motivation

Network servers Throughput matters Hardware intensive

Simultaneous Multithreading (SMT) Processor support for high throughput Simulated since mid-90s Now - Intel Xeon/Pentium 4 (Hyper-

Threading), IBM POWER5 available


How Does SMT Work? Simultaneous execution of multiple jobs Higher utilization of functional units

cycles (direction of data flow)

Job 1Processor 1

Job 2Processor 2

Job 1&2SMT processor

(Colored blocks are functional units currently in use)


SMT Architecture

Appear as multi-processors for OS and app.

Architectural State Registers #1

DuplicatedResource

Architectural State Registers #2

Shared Resource

Pipeline Execution Units

Cache Hierarchy

System Bus

Main Memory


Contributions Detailed analysis of multiple real hardware

platforms and server packagesIncludes previously ignored OS overheads

Micro-architectural performance analysisDemonstrates dominance of memory hierarchy

Comparison with simulation studiesExplain why SMT provides relatively small

benefits on real hardwareOverly-aggressive memory simulation yielded

higher expected benefits


Outline

BackgroundMeasurement methodologyThroughput & improvementMicro-architectural performanceDiscussion


Measurements OverviewMetrics

Server throughputThroughput improvements (relative speedups)Architectural features (CPI, miss ratio, etc.)

Multiple configurationsHardware platforms (clock speed, cache, etc.)Server software (Apache, Flash, TUX, etc.)Kernel configuration (uniprocessor and

multiprocessor)


Hardware Platforms

Three models of Xeon processors

Clock rate 2.0GHz 3.06Ghz 3.06GHz L3

L3 - 1MB

Mem latency (cycles)

220 350 cycles

L1/L2 cache sizes, main memory, buses and # threads/processor are the same

Clock rate Cache


Web Servers

5 Web server packages Apache-MP: multi-process Apache-MT: multi-thread Flash: event-driven TUX: in-kernel Haboob: Java server, staged multi-thread model

Benchmark SPECweb96 and SPECweb99


System Configuration

5 configuration labels # CPUs, SMT on/off, kernel type

1P-UP 1P-MP 2T 2P 4T

on onSMT

Multiprocessor kernelkernel

1# CPUs 2

(T – # threads, P – # processors)


Outline

BackgroundMeasurement methodologyThroughput & improvement

Single processor Dual-processor

Micro-architectural performanceDiscussion


Apache-MP, 3.06GHz

0

200

400

600

800

1000

1200

1P-UP 1P-MP 2Tw/ SMT

2P 4Tw/ SMT

Th

rou

gh

pu

t (M

b/s

)Throughput Evaluation

2T vs. 1P-MP

4T vs. 2P

2T vs. 1P-UP

single processor dual-processor


Improvement on Single Processor

2T : 2 threads, multiprocessor kernel1P-MP: 1 thread, multiprocessor kernel

2T vs. 1P-MP

-10

0

10

20

30

40

Apache-MP Apache-MT Flash TUX Haboob

Th

rou

gh

pu

t im

pro

vem

ent

(%)

2.0GHz 3.06GHz 3.06GHz L3


2T vs. 1P-UP

-10

0

10

20

30

40


Th

rou

gh

pu

t im

pro

vem

ent

(%)


Improvement on Single Processor

2T : 2 threads, Multiprocessor kernel1P-UP: 1 threads, Uniprocessor kernel

Kernel overhead


Improvement on Dual-processor4T: 4 threads (2 processors, 2T/Processor)2P: 2 physical processors (SMT disabled)

4T vs. 2P

-20

-10

0

10

20

30

40

Apache-MP Apache-MT Flash TUX HaboobTh

rou

gh

pu

t im

pro

vem

ent

(%)


2.0GHz & 3.06GHz with L3 are better Memory is still the

bottleneck


Micro-architectural Analysis

Use Oprofile In-house patch to measure extra events

About 25 performance events Cache miss/hit TLB miss/hit Branches Pipeline stall, clear, etc. Bus utilization


L1 Instruction Cache Miss Rate

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%


1P-UP 1P-MP 2T(SMT)

2P 4T(SMT)


L2 Cache Miss Rate

Instruction & data unified Lower rate in SMT due to higher L1 misses

0%

2%

4%

6%

8%

10%


1P-UP 1P-MP 2T(SMT)

2P 4T(SMT)


Apache-MP

02468

10121416

1P-UP 1P-MP 2T 2P 4T

Putting Events TogetherC

ycle

s pe

r In

stru

ctio

n (C

PI)

work L1 Miss L2 Miss ITLBDTLB Branch Clear Buffer

work

L1 Miss

L2 Miss

others


Non-overlapped CPI

L1/L2 miss penalty dominates


Measuring Bus Utilization

Event: FSB_DATA_ACTIVITYCPU cycles when the bus is busy

Normalized to CPU speedComparable across all CPU clock rate


Bus Utilization Results 2.0GHz & 3.06GHz

L3 have less data transfer cyclesLower memory

latency in 2.0GHz & 3.06GHz with L3

Coefficient of correlation between bus utilization & speedups : 0.62 ~ 0.95

Apache-MP

0

5

10

15

20

1P-UP

1P-M

P 2T 2P 4T

Bu

s U

tiliz

atio

n (

%)



Outline

BackgroundMeasurement parametersThroughput speedupMicro-architectural performanceDiscussion

Compare to simulationOther Web workloads


SMT Performance on Web Servers

Simulation

Multiprocessorkernel

Uniprocessor kernel

Dualprocessor

-10%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Th

rou

gh

pu

t im

pro

vem

ent


Compare to Simulation

Simulation Measurement

Size Miss rate Size Miss rate

L1-I 128 KB 2.0% 12 KB 17%

L1-D 128 KB 3.6% 8 KB 5.7%

L2 16 MB 1.4% 512 KB 3.9%

Mem latency 90 cycles 220 ~ 350 cycles


Processor Development Trend

2000 20031996

62-cycle mem

32 KB L1

256 KB L2

90-cycle mem

128 KB L1

16384 KB L2

90-cycle mem

64 KB L1

16384 KB L2

74-cycle mem

16 KB L1

256 KB L2

94-cycle mem

16 KB L1

512 KB L2

350-cycle mem

8-12 KB L1

512 KB L2

Simulated models:

Actual processors:


SMT on SPECweb99

SPECweb99 results in paperDynamic + staticMultiple programs

• CGI requests, user profile logging, etc.

Speedup very close to static-only workloadsNo more negative speedups in FlashMay be due to better sharing of resources of

different programs


Summary

More realistic speedup evaluation of SMT 3 processors, 5 servers, 2 kernels Exposed factors not previously examined 5~15% speedup in our best cases

Detailed analysis of memory hierarchy impact on SMT performance All other architecture overheads secondary Reasons why simulation results were overly

optimistic

Thank you

http://www.cs.princeton.edu/~yruan


Future Work

Ways of improving Simultaneous Multithreading performanceServer performance on POWER5Using execution driven simulation for deeper

understanding

Study Chip Multiprocessor (CMP)Intel, AMD, and IBM


Pipeline Clears (per Byte)

Conditions when the whole pipeline needs to be flushed

0.00

0.05

0.10

0.15

0.20

0.25

0.30


1T-UP 1T-MP 2T 2P 4T

Documents

Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware