33
June 4, 2005 June 4, 2005 MoBS 2005 MoBS 2005 1 Sampling and Stability Sampling and Stability in TCP/IP Workloads in TCP/IP Workloads Lisa Hsu, Ali Saidi, Nathan Lisa Hsu, Ali Saidi, Nathan Binkert Binkert Prof. Steven Reinhardt Prof. Steven Reinhardt University of Michigan University of Michigan

June 4, 2005 MoBS 2005 1 Sampling and Stability in TCP/IP Workloads Lisa Hsu, Ali Saidi, Nathan Binkert Prof. Steven Reinhardt University of Michigan

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

June 4, 2005June 4, 2005 MoBS 2005MoBS 2005 11

Sampling and Stability in Sampling and Stability in TCP/IP WorkloadsTCP/IP Workloads

Lisa Hsu, Ali Saidi, Nathan BinkertLisa Hsu, Ali Saidi, Nathan Binkert

Prof. Steven ReinhardtProf. Steven Reinhardt

University of MichiganUniversity of Michigan

22June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

BackgroundBackground

During networking experiments, some During networking experiments, some runs would inexplicably get no bandwidthruns would inexplicably get no bandwidth

Searched high and low for what was Searched high and low for what was “wrong” “wrong” Simulator bug?Simulator bug? Benchmark bug?Benchmark bug? OS bug?OS bug?

Answer: none of the aboveAnswer: none of the above

33June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

The Real AnswerThe Real Answer

Simulation Methodology!?Simulation Methodology!? Tension between speed and accuracy in Tension between speed and accuracy in

simulationsimulation Want to capture representative portions of Want to capture representative portions of

simulation WITHOUT running the entire simulation WITHOUT running the entire application application

Solution: Fast functional simulation Solution: Fast functional simulation

So what’s the problem here?So what’s the problem here?

44June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

TCP TuningTCP TuningTCP tunes itself to the performance of TCP tunes itself to the performance of underlying systemunderlying systemSets its send rate based on perceived end-to-Sets its send rate based on perceived end-to-end bandwidthend bandwidth Performance of networkPerformance of network Performance of receiverPerformance of receiver

During checkpointing simulation, had tuned to During checkpointing simulation, had tuned to performance of meaningless systemperformance of meaningless systemAfter switching to detailed simulation, the After switching to detailed simulation, the dramatic change in underlying system dramatic change in underlying system performance disrupted flowperformance disrupted flow

55June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Timing DependenceTiming Dependence

The degree to which an application’s The degree to which an application’s performance depends upon execution performance depends upon execution timing (e.g. memory latencies)timing (e.g. memory latencies)

Three classes:Three classes: Non-timing dependent (like SPEC2000)Non-timing dependent (like SPEC2000) Weakly timing dependent (like multithreaded)Weakly timing dependent (like multithreaded) Strongly timing dependentStrongly timing dependent

66June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Strongly Timing DependentStrongly Timing Dependent

Execution Path

Packet from application Perceived bandwidth high

send it now!

Peceived bandwidth low wait til later

Application execution depends on stored feedback state from

underlying system (like TCP/IP workloads)

77June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Correctness IssueCorrectness Issue

Execution Path

Packet from application Perceived bandwidth high

send it now!

Peceived bandwidth low wait til later

Functional Simulation Detailed Simulation

MEANINGLESS

88June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Need to….Need to….

Packet from application Perceived bandwidth

high send it now!

Peceived bandwidth low wait til later

Perceived bandwidth

reflects that of configuration

under test

Safe to take Data!!

99June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

GoalsGoals

More rigorous characterization of this More rigorous characterization of this phenomenonphenomenon

Determine severity of this tuning problem Determine severity of this tuning problem across a variety of networking workloadsacross a variety of networking workloads Network link latency sensitivity?Network link latency sensitivity? Benchmark type sensitivity?Benchmark type sensitivity? Functional CPU performance sensitivity?Functional CPU performance sensitivity?

1010June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

M5 SimulatorM5 SimulatorNetwork targeted full system simulatorNetwork targeted full system simulatorReal NIC modelReal NIC model National Semiconductor DP83820 GigE National Semiconductor DP83820 GigE

Ethernet ControllerEthernet Controller

Boots Linux 2.6Boots Linux 2.6 Uses Linux 2.6 driver for DP83820Uses Linux 2.6 driver for DP83820

All systems (and link) modeled in a single All systems (and link) modeled in a single processprocess Synchronization between systems managed Synchronization between systems managed

by a global tick frequencyby a global tick frequency

1111June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

ModesModes Wall Clock SpeedWall Clock Speed Simulated CPU Simulated CPU SpeedSpeed

Pure Functional Pure Functional (PF)(PF)

CheckpointingCheckpointing

Very fastVery fast1 or 8 IPC1 or 8 IPC

1 Cycle Mem1 Cycle Mem

Functional with Functional with Caches (FC)Caches (FC)Cache WarmupCache Warmup

FastFast1 IPC + 1 IPC +

Blocking Caches Blocking Caches << 1 IPC << 1 IPC

Detailed (D)Detailed (D)Data MeasurementData Measurement

Very SlowVery SlowOoO SuperscalarOoO Superscalar

Non-blocking Non-blocking Caches Caches

Operating ModesOperating Modes

1 IPC + 1 IPC +

Blocking Caches Blocking Caches << 1 IPC << 1 IPC

SLOWEST

1 or 8 IPC 1 or 8 IPC 1 Cycle Mem 1 Cycle Mem

FASTER

OoO Superscalar OoO Superscalar Non-Blocking Non-Blocking

CachesCachesFASTER

or 8 IPC1 or 8 IPC 1 or 8 IPC 1 Cycle Mem 1 Cycle Mem

FASTEST or 8 IPC

1212June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

BenchmarksBenchmarks

2 system client/server configuration2 system client/server configuration Netperf Netperf

Stream – a transmit microbenchmarkStream – a transmit microbenchmark

Maerts – a receive microbenchmarkMaerts – a receive microbenchmark SPECWeb99SPECWeb99

NAT configuration (3 system config)NAT configuration (3 system config) Netperf maerts with a NAT gateway between Netperf maerts with a NAT gateway between

client and serverclient and server

1313June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Experimental ConfigurationExperimental Configuration

System Under Test

Drive Systemlink

(sender/NAT/receiver) (receiver/sender)

PF8

CHECKPOINTING

PF1/PF8

CACHE WARMUP

FC1 cache

MEASUREMENT

D

(x2 if NAT)

1414June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

““Graph Theory”Graph Theory”

Tuning periods after CPU model changes?Tuning periods after CPU model changes?

How long do they last?How long do they last?

Which graph minimizes Detailed modeling Which graph minimizes Detailed modeling time necessary?time necessary?

Effects of checkpointing PF width?Effects of checkpointing PF width?

1515June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Netperf MaertsNetperf MaertsDetailed

0

1

2

3

4

5

6

7

8

10 20 30 40 50 60 70 80 90 100

Millions of Cycles

Gb

ps w idth=1

w idth=8

FC->Detailed

0

1

2

3

4

5

6

Millions of Cycles

Gb

ps w idth=1

w idth=8COV 1.66%COV .5%

PF checkpoints loadedtransition to D

or FC

FC Cache warmup

endstransition to D

Known achievable bandwidth by each

system configuration

Tuning period

Tuning period

Takeaways:

1) Shift from “high performance” CPU to lower causes more drastic tuning periods

2) Shift from lower performance to higher has more gentle transition

No tuning!

bears brunt of tuning time

1616June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Netperf StreamNetperf Stream

Why no tuning periods?Why no tuning periods? Because it is SENDER limited!Because it is SENDER limited! Change in performance is local – no feedback from Change in performance is local – no feedback from

network or receiver requirednetwork or receiver required Thus changes in send rate can be immediateThus changes in send rate can be immediate

FC->Detailed

0

0.5

1

1.5

2

2.5

Millions of Cycles

Gb

ps width = 1

width = 8

Detailed

0

0.5

1

1.5

2

2.5

10 20 30 40 50 60 70 80 90 100

Millions of Cycles

Gb

ps width = 1

width = 8

1717June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

NAT Netperf MaertsNAT Netperf MaertsFC->Detailed

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Millions of Cycles

Gb

ps w idth = 1

w idth = 8

Detailed

0

0.5

1

1.5

2

2.5

10 20 30 40 50 60 70 80 90 100

Millions of Cycles

Gb

ps w idth = 1

w idth = 8

NAT = System

Under Test

sender receiver

CPU changes applied here

The “pipe” is changing – this feedback takes longer to receive in TCP because it is not explicit may ruin simulation

1818June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

TCP Kernel ParametersTCP Kernel Parameters

TCP RULES:TCP RULES:

pouts may NOT exceed cwndspouts may NOT exceed cwnds

bytes(pouts) may NOT exceed sndwndsbytes(pouts) may NOT exceed sndwnds

Detailed Kernel Params

0

50

100

150

200

250

300

0.5 10.5 20.5 30.5 40.5 50.5 60.5 70.5 80.5 90.5

Millions of Cycles

Pac

kets

37490

37500

37510

37520

37530

37540

37550

37560

37570

37580

37590

pouts

cw nds

sndw nds

poutspouts – unACKed packets in flight– unACKed packets in flight

cwndscwnds – congestion window (in – congestion window (in packets)packets)

**Reflects state of the network pipe**Reflects state of the network pipe

sndwndssndwnds – available receiver buffer – available receiver buffer space (in bytes)space (in bytes)

**Reflects receiver’s ability to **Reflects receiver’s ability to receivereceive

Solved in real world by TCP timeouts, but would

take much too long to simulateDea

dlock

?

1919June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

SPECWeb99SPECWeb99Detailed

0

1

2

3

4

5

6

Millions of Cycles

Gb

ps

w idth = 1

w idth = 8

FC->Detailed

0

1

2

3

4

5

6

Millions of Cycles

Gb

ps w idth = 1

w idth = 8

Much more complex than NetperfMuch more complex than Netperf

Harder to understand fundamental interactionsHarder to understand fundamental interactions

Speculations in paper – but understanding this Speculations in paper – but understanding this more deeply definitely future workmore deeply definitely future work

2020June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

What About Link Delay?What About Link Delay?Maerts Link Delay Comparison

0

1

2

3

4

5

6

10 100

190

280

370

460

550

640

730

820

910

1000

1090

Millions of Cycles

Gb

ps Zero delay

400us Delay

400us Delay Kernel Parameters

0

50

100

150

200

250

300

0.5 79 158

236

315

393

472

550

629

707

786

864

943

1021

1100

Millions of Cycles

Pac

kets pouts

cw nds

TCP algorithm: cwnd can only increase upon TCP algorithm: cwnd can only increase upon every receipt of an ACK packetevery receipt of an ACK packetRamp-up of cwnd is limited by RTTRamp-up of cwnd is limited by RTTKEY POINT: tuning time is sensitive KEY POINT: tuning time is sensitive to RTTto RTT

2121June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

ConclusionsConclusionsTCP/IP workloads require a tuning period TCP/IP workloads require a tuning period relative to the network RTT when receiver limitedrelative to the network RTT when receiver limitedSender-limited workloads are generally not Sender-limited workloads are generally not problematicproblematicSome cases lead to unstable system behaviorSome cases lead to unstable system behaviorTips for minimizing tuning time:Tips for minimizing tuning time: ““Slow” fast forwarding CPUSlow” fast forwarding CPU Try different switchover pointsTry different switchover points Use fast-ish cache warmup period to bear brunt of Use fast-ish cache warmup period to bear brunt of

transitiontransition

2222June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Future WorkFuture Work

Identify other strongly timing dependent Identify other strongly timing dependent workloads (feedback directed workloads (feedback directed optimization?)optimization?)

Examine SPECWeb behavior furtherExamine SPECWeb behavior further

Further investigate protocol interactions Further investigate protocol interactions that cause zero bandwidth periodsthat cause zero bandwidth periods Hopefully lead to more rigorous avoidance Hopefully lead to more rigorous avoidance

methodmethod

2323June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Questions?Questions?

2424June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Non-Timing DependentNon-Timing Dependent

memory access

Execution Path

Perfect CacheHITMISS L1

Single-threaded, application only execution (like SPEC2000)

2525June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Weakly Timing DependentWeakly Timing Dependent

Execution Path

memory access

Perfect Cachecontinue

L1 Missidle loop

RAM accessschedule different thread

Application execution tied to OS decisions (like multi-threaded apps)

2626June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Basic TCP OverviewBasic TCP Overview

Congestion Control AlgorithmCongestion Control Algorithm Match send rate to the network’s ability to Match send rate to the network’s ability to

receive itreceive it

Flow Control AlgorithmFlow Control Algorithm Match send rate to the receiver’s ability to Match send rate to the receiver’s ability to

receive itreceive it

Overall goal:Overall goal: Send data as fast as possible without Send data as fast as possible without

overwhelming system, which would overwhelming system, which would effectively cause slowdowneffectively cause slowdown

2727June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Congestion ControlCongestion Control

Feedback in the form ofFeedback in the form of Time OutsTime Outs Duplicate ACKsDuplicate ACKs

Feedback dictates Congestion Window Feedback dictates Congestion Window parameterparameter Limits the number of unACKed packets out at Limits the number of unACKed packets out at

a given time (i.e. send rate)a given time (i.e. send rate)

2828June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Congestion Control cont.Congestion Control cont.

Slow StartSlow Start Congestion window starts at 1, every ACK Congestion window starts at 1, every ACK

received is an exponential increase in received is an exponential increase in congestion windowcongestion window

Additive Increase, Multiplicative Decrease Additive Increase, Multiplicative Decrease (AIMD)(AIMD) Every ACK increases window by 1, losses Every ACK increases window by 1, losses

perceived by DupACK halve the windowperceived by DupACK halve the window

Timeout recoveryTimeout recovery Upon timeout, go back to slow startUpon timeout, go back to slow start

2929June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Flow ControlFlow Control

Feedback in the form of explicit TCP Feedback in the form of explicit TCP header notificationsheader notifications Receiver tells sender how much kernel buffer Receiver tells sender how much kernel buffer

space it has availablespace it has available

Feedback dictates send window Feedback dictates send window parameterparameter Limits the amount of unACKed data out at any Limits the amount of unACKed data out at any

given timegiven time

3030June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

ResultsResults

Zero Link Delay Zero Link Delay

3131June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Non Timing DependentNon Timing Dependent

Single threaded, application only Single threaded, application only simulation (like SPEC2000)simulation (like SPEC2000)

The execution timing does not affect the The execution timing does not affect the commit order of instructionscommit order of instructions

Architectural state generated by a fast Architectural state generated by a fast functional simulator would be the same as functional simulator would be the same as a detailed simulatora detailed simulator

3232June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Weakly Timing DependentWeakly Timing Dependent

Applications whose performance are tied Applications whose performance are tied with OS decisionswith OS decisions Multi-threaded (CMP, SMT, etc.)Multi-threaded (CMP, SMT, etc.)

Execution timing effects like cache hits Execution timing effects like cache hits and misses, memory latencies, etc. can and misses, memory latencies, etc. can affect scheduling decisionsaffect scheduling decisionsHowever, these execution path variations However, these execution path variations are all valid and do not pose a correctness are all valid and do not pose a correctness problemproblem

3333June 4, 2005June 4, 2005 MoBS 2005MoBS 2005

Strongly Timing DependentStrongly Timing Dependent

Workloads that explicitly tune themselves Workloads that explicitly tune themselves to performance of underlying systemto performance of underlying system

Tuning to an artificially fast system affects Tuning to an artificially fast system affects system performancesystem performance

When switching to detailed simulation, you When switching to detailed simulation, you may get meaningless results may get meaningless results