Upload
maryann-webb
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
High-Fidelity Latency Measurements in Low-
Latency Networks
Ramana Rao Kompella
Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research)
Stanford 2
Low Latency Applications Many important data center applications require
low end-to-end latencies (microseconds) High Performance Computing – lose parallelism Cluster Computing, Storage – lose performance Automated Trading – lose arbitrage opportunities
Stanford 3
Low Latency Applications Many important data center applications require
low end-to-end latencies (microseconds) High Performance Computing – lose parallelism Cluster Computing, Storage – lose performance Automated Trading – lose arbitrage opportunities
Cloud applications Recommendation Systems, Social Collaboration All-up SLAs of 200ms [AlizadehSigcomm10] Involves backend computation time and network
latencies have little budget
Stanford 4
… … … …
ToR S/W
Edge Router
Core Router …Latency Measurements are Needed
At every router, high-fidelity measurements are critical to
localize root causes
Once root cause localized, operators can fix by rerouting
traffic, upgrade links or perform detailed diagnosis
Which router causes the problem??
1ms
Router
Measurement within a router is necessary
Stanford 5
Vision: Knowledge Plane
Knowledge Plane
Data Center Network
Response Query
SLA Diagnosis
Routing/Traffic Engineering
Scheduling/Job Placement
Latency Measurements
Query Interface
Latency Measurements
Push
Pull
Stanford 6
Contributions Thus Far… Aggregate Latency Estimation
Lossy Difference Aggregator – Sigcomm 2009 FineComb – Sigmetrics 2011 mPlane – ReArch 2009
Differentiated Latency Estimation Multiflow Estimator – Infocom 2010 Reference Latency Interpolation – Sigcomm 2010 RLI across Routers – Hot-ICE 2011 Delay Sketching – (under review at Sigcomm 2011)
Scalable Query Interface MAPLE – (under review at Sigcomm 2011)
Per-flow latency measurements at every
hop
Per-Packet Latency Measurements
Stanford 7
1) PER-FLOW MEASUREMENTS WITH REFERENCE LATENCY INTERPOLATION [SIGCOMM 2010]
Stanford 8
Native router support: SNMP, NetFlow No latency measurements
Active probes and tomography Too many probes (~10000HZ) required wasting bandwidth
Use expensive high-fidelity measurement boxes London Stock Exchange uses Corvil boxes Cannot place them ubiquitously
Recent work: LDA [Kompella09Sigcomm] Computes average latency/variance accurately within a switch Provides a good start but may not be sufficient to diagnose flow-
specific problems
Obtaining Fine-Grained Measurements
Stanford 9
From Aggregates to Per-Flow
Del
ay
Time
S/W
…
Queue
Average latency
Interval
Large delay
Small delay
Observation: Significant amount of difference in average latencies across flows at a router
Goal of this paper: How to obtain per-flow latency measurements in a scalable fashion ?
Stanford 10
Measurement Model
Router
Egress E
Ingress I
Assumption: Time synchronization between router interfaces
Constraint: Cannot modify regular packets to carry timestamps Intrusive changes to the routing forwarding path
Stanford 11
Naïve Approach
For each flow key, Store timestamps for each packet at I and E
After a flow stops sending, I sends the packet timestamps to E
E computes individual packet delays
E aggregates average latency, variance, etc for each flow
Problem: High communication costs At 10Gbps, few million packets per second
Sampling reduces communication, but also reduces accuracy
Ingress I Egress E
10 − =2023
2730
+1513 18
−=
2232
Avg. delay = 22/2 = 11Avg. delay = 32/2 = 16
−+ −
Stanford 12
A (Naïve) Extension of LDA
Maintaining LDAs with many counters for flows of interest
Problem: (Potentially) high communication costs Proportional to the number of flows
Ingress I Egress E
LDA LDA
LDA LDA
LDA LDA
28
15
2
1
Packet count
Sum of timestamps
…
Coordination
Per-flow latency
Stanford 13
Key Observation: Delay Locality
True mean delay = (D1 + D2 + D3) / 3
Localized mean delay = (WD1 + WD2 + WD3) / 3
WD1 WD3WD2
How close is localized mean delay to true mean delay as window size varies?
Del
ay
Time
D1
D2D3
Stanford 14
Key Observation: Delay Locality
True Mean delay per key / ms
Loca
l mea
n de
lay
per
key
/ m
s
Global Mean
0.1ms: RMSRE=0.054
10ms: RMSRE=0.16
1s: RMSRE=1.72
Data sets from real router and synthetic queueing models
Stanford 15
Exploiting Delay Locality
Reference packets are injected regularly at the ingress I Special packets carrying ingress timestamp
Provides some reference delay values (substitute for window
averages)
Used to approximate the latencies of regular packets
Del
ay
Time
ReferencePacket
IngressTimestamp
Stanford 16
RLI Architecture
Component 1: Reference Packet generator
Injects reference packets regularly
Component 2: Latency Estimator
Estimates packet latencies and updates per-flow statistics
Estimates directly at the egress with no extra state maintained
at ingress side (reduces storage and communication overheads)
Egress EIngress I
1) ReferencePacket
Generator
2) LatencyEstimator
123 123
R
LIngress
Timestamp
Stanford 17
Component 1: Reference Packet Generator
Question: When to inject a reference packet ?
Idea 1: 1-in-n: Inject one reference packet every n packets Problem: low accuracy under low utilization
Idea 2: 1-in-τ: Inject one reference packet every τ seconds Problem: bad in case where short-term delay variance is high
Our approach: Dynamic injection based on utilization High utilization low injection rate
Low utilization high injection rate
Adaptive scheme works better than fixed rate schemes
Details in the paper
Stanford 18
Component 2: Latency Estimator
Question 1: How to estimate latencies using reference packets ?
Solution: Different estimators possible Use only the delay of a left reference packet (RLI-L) Use linear interpolation of left and right reference packets (RLI) Other non-linear estimators possible (e.g., shrinkage)
LInterpolated
delay
Del
ay
Time
R
Error indelay estimate
RegularPacket
ReferencePacket
Linear interpolationline
Arrival time is known
Arrival time and delay are known
Estimateddelay
Error indelay estimate
R
Stanford 19
Component 2: Latency Estimator Flow
keyC1 C2 C3
8 11 39
2 3 6
Interpolation buffer
Estimate10 20 80
3 4 7
Avg. latency = C2 / C1
RL
Right Reference Packet arrived
When a flow is
exported
Question 2: How to compute per-flow latency statistics
Solution: Maintain 3 counters per flow at the egress side C1: Number of packets
C2: Sum of packet delays
C3: Sum of squares of packet delays (for estimating variance)
To minimize state, can use any flow selection strategy to maintain
counters for only a subset of flows
Flow Key
4 51DelaySquare of delay 16 251
Update
Any flow selectionstrategy Update
Selection
Stanford 20
Experimental Setup Data sets
No public data center traces with timestamps Real router traces with synthetic workloads: WISC Real backbone traces with synthetic queueing: CHIC
and SANJ
Simulation tool: Open source NetFlow software – YAF Supports reference packet injection mechanism Simulates a queueing model with RED active queue
management policy
Experiments with different link utilizations
Stanford 21
Accuracy under High Link Utilization
Relative error
CD
F
Median relative erroris 10-12%
Stanford 22
Comparison with Other Solutions
Utilization
Ave
rage
rel
ativ
e er
ror
Packet sampling rate = 0.1%
1-2 orders of magnitude difference
Stanford 23
Overhead of RLIBandwidth overhead is low
less than 0.2% of link capacity
Impact to packet loss is small Packet loss difference with and without RLI is at
most 0.001% at around 80% utilization
Stanford 24
Summary A scalable architecture to obtain high-fidelity per-
flow latency measurements between router interfaces
Achieves a median relative error of 10-12% Obtains 1-2 orders of magnitude lower relative
error compared to existing solutions Measurements are obtained directly at the
egress side
Stanford 25
Contributions Thus Far… Aggregate Latency Estimation
Lossy Difference Aggregator – Sigcomm 2009 FineComb – Sigmetrics 2011 mPlane – ReArch 2009
Differentiated Latency Estimation Multiflow Estimator – Infocom 2010 Reference Latency Interpolation – Sigcomm 2010 RLI across Routers – Hot-ICE 2011 Virtual LDA – (under review at Sigcomm 2011)
Scalable Query Interface MAPLE – (under review at Sigcomm 2011)
Stanford 26
2) SCALABLE PER-PACKET LATENCY MEASUREMENT ARCHITECTURE (UNDER REVIEW AT SIGCOMM 2011)
Stanford 27
MAPLE MotivationLDA and RLI are ossified in the aggregation
levelNot suitable for obtaining arbitrary sub-
population statistics Single packet delay may be important
Key Goal: How to enable a flexible and scalable architecture for packet latencies ?
Stanford 28
1) Packet LatencyStore
2) QueryEngineTimestamp
Unit
MAPLE Architecture
Timestamping not strictly required Can work with RLI estimated latencies
Router A Router BP1
P1 T1 P1 D1P1
CentralMonitor
Q(P1) A(P1)
Stanford 29
Packet Latency Store (PLS) Challenge: How to store packet latencies in the
most efficient manner ? Naïve idea: Hashtables does not scale well
At a minimum, require label (32 bits) + timestamp (32 bits) per packet
To avoid collisions, need a large number of hash table entries (~147 bits/pkt for a collision rate of 1%)
Can we do better ?
Stanford 30
Our Approach Idea 1: Cluster packets
Typically few dominant values Cluster packets into equivalence classes Associate one delay value with a cluster Choose cluster centers such that error is small
Idea 2: Provision storage Naïvely, we can use one Bloom Filter per cluster
(Partitioned Bloom Filter) We propose a new data structure called Shared
Vector Bloom Filter (SVBF) that is more efficient
Stanford 31
Selecting Representative Delays Approach 1: Logarithmic delay selection
Divide delay range into logarithmic intervals E.g., 0.1-10,000μs 0.1-1μs, 1-10μs …
Simple to implement, bounded relative error, but accuracy may not be optimal
Approach 2: Dynamic clustering k-means (medians) clustering formulation Minimizes the average absolute error of packet
latencies (minimizes total Euclidean distance)
Approach 3: Hybrid clustering Split centers equally across static and dynamic Best of both worlds
Stanford 32
K-means Goal: Determine k-centers every measurement cycle
Can be formulated as a k-means clustering algorithm
Problem 1: Running k-means typically hard Basic algorithm has O(nk+1 log n) run time Heuristics (Lloyd’s algorithm) also complicated in practice
Solution: Sampling and streaming algorithms Use sampling to reduce n to pn Use a streaming k-medians algorithm (approximate but sufficient)
Problem 2: Can’t find centers and record membership at the same time
Solution: Pipelined implementation Use previous interval’s centers as an approximation for this
interval
Stanford 33
Streaming k-Medians [CharikarSTOC03]
PacketSampling
OnlineClustering
Stage
OfflineClustering
Stage
SOFTWARE
Storage Data Structure
PacketStream
HARDWARE
DRAM/SSD
Data
k-centers
Flushed after everyepoch for archival
Packets in (i+2)th epoch
np packetsat i-th epoch
O(k log(np) centers
at (i+1)th epoch
Stanford 34
Naïve: Partitioned BF (PBF) c1
c3
c2
c4
PacketLatency
Parallel matching of closest center
1 1 0 1 1… 1
0 0 1 1… 0 00 0 1 1 11 1
0 1 1…0 0 1 1…
Bits are set by hashing packet
contents
INSERTION
PacketContents
Query all Bloom filters
1 1 0 1 1… 1
0 0 1 1… 0 11 0 1 1 11 1
0 1 1…0 0 1 1…
All bits are 1
LOOKUP c1
c3
c2
c4
Stanford 35
Problems with PBFProvisioning is hard
Cluster sizes not known apriori Over-estimation or under estimation of BF sizes
Lookup complexity is higher Need the data structure to be partitioned every
cycle Need to lookup multiple random locations in the
bitmap (based on number of hash functions)
Stanford 36
Shared-Vector Bloom Filter
c1
c3
c2
c4
PacketLatency
Parallel matching of closest center
0 0 0 1… 0 00 0 1 1 11 1
INSERTION
PacketContents
LOOKUP
H1 H2
c2
Bit is set to 1 after offset by the number of matched center
Bit position is located by hashing
0 0 1 1… 0 10 0 1 1 1
H1 H2
0 1 0 0 0 1 1 1
PacketContents
AND
0 1 0 0
Bulk read
Offset is center id
# of centers
Stanford 37
Comparing PBF and SVBFPBF
− Lookup is not easily parallelizable
− Provisioning is hard since number of packets per BF is not known apriori
SVBF+ One Bloom filter is used
+ Burst read at the length of word
COMB [Hao10Infocom]+ Single BF with groups of hash functions
− More memory usage than SVBF and burst read not possible
Stanford 38
Comparing Storage NeedsData Structure
# of Hash functions
Capacity (bits/entry)
Insertion Lookup Note
HashTable 1 147 1 1 Storing only latency value (no label)
PBF 9 12.8 9 450 Provisioning is hard (12.8 if cardinality known before)
COMB 7 12.8 14 77 (alternate combinations exist)
SVBF 9 12.8 9 27 (burst reads)
Provisioning is easy
For same classification failure rate of 1% and 50 centers (k=50)
Stanford 39
Tie-Breaking HeuristicBloom filters have false positivesLookups involve search across all BFs
So, multiple BFs may return match
Tie-breaking heuristic returns the group that has the highest cardinality Store a counter per center to store number of
packets that match the center (cluster cardinality) Works well in practice (especially when skewed
distributions)
Stanford 40
Estimation AccuracyC
DF
Absolute error (μs)
Stanford 41
Accuracy of AggregatesC
DF
Relative error
Stanford 42
2) QueryEngine
MAPLE Architecture
Router A Router B
CentralMonitor
Q(P1) A(P1)
Stanford 43
Query InterfaceAssumption: Path of a packet is known
Possible to determine using forwarding tables In OpenFlow-enabled networks, controller has the
information
Query answer: Latency estimate Type: (1) Match, (2) Multi-Match, (3) No-Match
Stanford 44
Query Bandwidth Query method 1: Query using packet hash
Hashed using invariant fields in a packet header High query bandwidth for aggregate latency statistics (e.g., flow-
level latencies)
Query method 2: Query using flow key and IP identifier Support range search to reduce query bandwidth overhead Inserts: use flow key and IPID for hashing Query: use a flow key and ranges of continuous IPIDs are sent
f1 1 5 f1 20 35Query
message:
Continuous IPID block
Stanford 45
Query Bandwidth Compression
CD
F
Compression ratio
Median compression per flow reduces bw
by 90%
Stanford 46
Storage OC192 interface
5 Million packets 60Mbits per second Assuming 10% utilization, 6 Mbits per second
DRAM – 16 GB 40 minutes of packets
SSD – 256 GB 10 hours – enough time for diagnosis
Stanford 47
SummaryRLI and LDA are ossified in their aggregation
levelProposed MAPLE as a mechanism to
compute measurements across arbitrary sub-populations Relies on clustering dominant delay values Novel SVBF data structure to reduce storage and
lookup complexity
Stanford 48
Conclusion Many applications demand low latencies Network operators need high-fidelity tools for
latency measurements Proposed RLI for fine-grained per-flow
measurements Proposed MAPLE to:
Store per-packet latencies in a scalable way Compose latency aggregates across arbitrary sub-
populations
Many other solutions (papers on my web page)
Sponsors
49
CNS – 1054788: NSF CAREER: Towards a Knowledge Plane for Data Center Networks
CNS – 0831647: NSF NECO: Architectural Support for Fault Management
Cisco Systems: Designing Router Primitives for Monitoring Network Health
Stanford