1 Hybrid network traffic engineering system (HNTES) Zhenzhen Yan, M. Veeraraghavan, Chris Tracy University of Virginia ESnet June 23, 2011 Please send

1

Hybrid network traffic engineering system

(HNTES)Zhenzhen Yan, M. Veeraraghavan, Chris TracyUniversity of Virginia ESnet

June 23, 2011

Please send feedback/comments to:[email protected], [email protected], [email protected]

This work was carried out as part of a sponsored research project from the US DOE ASCR program office on grant DE-SC002350

2

Outline

• Problem statement• Solution approach

– HNTES 1.0 and HNTES 2.0 (ongoing)

• ESnet-UVA collaborative work• Future work: HNTES 3.0 and integrated

network

Project web site: http://www.ece.virginia.edu/mv/research/DOE09/index.html

http://www.ece.virginia.edu/mv/research/DOE09/index.html

Problem statement

• Hybrid network is one that supports both IP-routed and circuit services on:– Separate networks as in ESnet4, or– An integrated network

• A hybrid network traffic engineering system (HNTES) is one that moves data flows between these two services as needed– engineers the traffic to use the service type

appropriate to the traffic type

3

Two reasons for using circuits

1. Offer scientists rate-guaranteed connectivity– necessary for low-latency/low-jitter applications such

as remote instrument control– provides low-variance throughput for file transfers

2. Isolate science flows from general-purpose flows

4

ReasonCircuit scope

Rate-guaranteed connections

Science flow isolation

End-to-end(inter-domain)

Per provider (intra-domain)

Role of HNTES

• HNTES is a network management system and if proven, it would be deployed in networks that offer IP-routed and circuit services

5

6

Outline

• Problem statement Solution approach

– Tasks executed by HNTES– HNTES architecture– HNTES 1.0 vs. HNTES 2.0– HNTES 2.0 details

• ESnet-UVA collaborative work• Future work: HNTES 3.0 and integrated

network

Three tasks executed by HNTES

7

online: upon flow arrival

1.

2.

3.

HNTES architecture

8

1. Offline flow analysis and populate MFDB

2. RCIM reads MFDB and programs routers to port mirror packets from MFDB flows

3. Router mirrors packets to FMM

4. FMM asks IDICM to initiate circuit setup as soon as it receives packets from the router corresponding to one of the MFDB flows

5. IDCIM communicates with IDC, which sets up circuit and PBR for flow redirection to newly established circuit

HNTES 1.0

Heavy-hitter flows

• Dimensions– size (bytes): elephant and mice– rate: cheetah and snail– duration: tortoise and dragonfly– burstiness: porcupine and stingray

9

Kun-chan Lan and John Heidemann, A measurement study of correlations of Internet flow characteristics. ACM Comput. Netw. 50, 1 (January 2006), 46-62.

HNTES 1.0 vs. HNTES 2.0

10

HNTES 1.0(tested on ANI testbed)

HNTES 2.0

Dimension of heavy-hitter flow

Duration Size

Circuit granularity Circuit for each flow Circuit carries multiple flows

Heavy hitter flow identification

Online Offline

Circuit provisioning Online Offline

Flow redirection (PBRconfiguration)

Online Offline

IDC circuit setup delay is about 1 minute

Can use circuits only forlong-DURATION flows

Focus: DYNAMIC (or online) circuit setup

HNTES 1.0 logic

Rationale for HNTES 2.0

• Why the change in focus?– Size is the dominant dimension of heavy-

hitter flows in ESnet– Large sized (elephant) flows have negative

impact on mice flows and jitter-sensitive real-time audio/video flows

– Do not need to assign individual circuits for elephant flows

– Flow monitoring module impractical if all data packets from heavy-hitter flows are mirrored to HNTES

11

HNTES 2.0 solution

• Task 1: offline algorithm for elephant flow identification - add/delete flows from MFDB

• Nightly analysis of MFDB for new flows (also offline)– Task 2: IDCIM initiates provisioning of rate-unlimited

static MPLS LSPs for new flows if needed– Task 3: RCIM configures PBR in routers for new flows

• HNTES 2.0 does not use FMM

12

MFDB: Monitored Flow Data BaseIDCIM: IDC Interface ModuleRCIM: Router Control Interface ModuleFMM: Flow Monitoring Module

HNTES 2.0: use rate-unlimited static MPLS LSPs

• With rate-limited LSPs: If the PNNL router needs to send elephant flows to 50 other ESnet routers, the 10 GigE interface has to be shared among 50 LSPs

• A low per-LSP rate will decrease elephant flow file transfer throughput• With rate-unlimited LSPs, science flows enjoy full interface bandwidth• Given the low rate of arrival of science flows, probability of two elephant

flows simultaneously sharing link resources, though non-zero, is small. Even when this happens, theoretically, they should each receive a fair share

• No micromanagement of circuits per elephant flow• Rate-unlimited virtual circuits feasible with MPLS technology• Removes need to estimate circuit rate and duration

13

PNNL-locatedESnet PE router

PNWG-cr1ESnet core router

10 GigE LSP 50 to site PE router

LSP 1 to site PE router

HNTES 2.0 Monitored flow database (MFDBv2)

14

Row number

Source IP address

Destination IP address

Is the source a data door?

Is the destination a data door?

Day 1

Day 2 .... Day 30

(total transfer size; if one day the total transfer size between this node pair is < 1GB, list 0)

1 0 or 1 0 or 12

Row number Ingress Router ID Egress Router ID

Row number

Source IP address

Destination IP address

Ingress Router ID

Egress Router ID

Circuit number

12

Identified elephant flows table

Existing circuits table

Flow analysis table

HNTES 2.0 Task 1Flow analysis table

• Definition of “flow”: source/destination IP address pair (ports not used)

• Add sizes for a flow from all flow records in say one day

• Add flows with total size > threshold (e.g. 1GB) to flow analysis table

• Enter 0 if a flow size on any day after it first appears is < threshold

• Enter NA for all days other than when it first appears as a > threshold sized flow

• Sliding window: number of days15

HNTES 2.0 Task 1Identified elephant flows table

• Sort flows in flow analysis table by a metric• Metric: weighted sum of

– persistency measure– size measure

• Persistency measure: Percentage of days in which size is non-zero out of the days for which data is available

• Size measure: Average per-day size measure (for days in which data is available) divided by max value (among all flows)

• Set threshold for weighted sum metric and drop flows whose metric is smaller than threshold

• Limits number of rows in identified elephant flows table

16

Sensitivity analysis

• Size threshold, e.g., 1GB• Period for summation of sizes, e.g.,

1 day• Sliding window, e.g., 30 days• Value for weighted sum metric

17

Is HNTES 2.0 sufficient?

• Will depend on persistency measure– if many new elephant flows appear each

day, need a complementary online solution

• Online Flow Monitoring Module (FMM)

18

19

Outline



ESnet-UVA collaborative work– Netflow data analysis– Validation of Netflow based size estimation– Effect of elephant flows

• SNMP measurements• OWAMP data analysis

– GridFTP transfer log data analysis

• Future work: HNTES 3.0 and integrated network

Netflow data analysis

• Zhenzhen Yan coded OFAT (Offline flow analysis tool) and R program for IP address anonymization

• Chris Tracy is executing OFAT on ESnet Netflow data and running the anonymization R program

• Chris will provide UVA Flow Analysis table with anonymized IP addresses

• UVA will analyze flow analysis table with R programs, and create identified elephant flows table

• If high persistency measure, then offline solution is suitable; if not, need HNTES 3.0 and FMM!

20

Findings: NERSC-mr2, April 2011 (one month data)

21

Persistency measure = ratio of (number of days in which flow size > 1GB) to (number of days from when the flow first appears)Total number of flows = 2281 Number of flows that had > 1GB transfers every day = 83

Data doors

• Number of flows from NERSC data doors = 84 (3.7% of flows)

• Mean persistency ratio of data door flows = 0.237

• Mean persistency ratio of non-data door flows = 0.197

• New flows graph right skewed offline is good enough? (just one month – need more months’ data analysis)

• Persistency measure is also right skewed online may be needed

22

Validation of size estimation from Netflow data

• Hypothesis– Flow size from concatenated Netflow

records for one flow can be multiplied by 1000 (since the ESnet Netflow sampling rate is 1 in 1000 packets) to estimate actual flow size

23

Experimental setup

24

• GridFTP transfers of 100 MB, 1GB, 10 GB files

• sunn-cr1 and chic-cr1 Netflow data used

Chris Tracy set up this experiment

Flow size estimation experiments

• Workflow inner loop (executed 30 times):– obtain initial value of firewall counters at sunn-cr1

and chic-cr1 routers– start GridFTP transfer of a file of known size– from GridFTP logs, determine data connection

TCP port numbers– read firewall counters at the end of the transfer– wait 300 seconds for Netflow data to be exported

• Repeat experiment 400 times for 100MB, 1 GB and 10 GB file sizes

25Chris Tracy ran the experiments

Create log files

• Filter out GridFTP flows from Netflow data• For each transfer, find packet counts and

byte counts from all the flow records and add

• Multiply by 1000 (1-in-1000 sampling rate)• Output the byte and packet counts from

the firewall counters• Size-accuracy ratio = Size computed from

Netflow data divided by size computed from firewall counters

26Chris Tracy wrote scripts to create these log files and gave UVA these files for analysis

Size-accuracy ratio

27

Netflow records obtained from Chicago ESnet router

Netflow records obtained from Sunnyvale ESnet router

Mean Standard deviation

Mean Standard deviation

100 MB 0.949 0.2780 1.0812 0.30731 GB 0.996 0.1708 1.032 0.165310 GB 0.990 0.0368 0.999 0.0252

• Sample mean shows a size-accuracy ratio close to 1

• Standard deviation is smaller for larger files. • Dependence on traffic load• Sample size = 50

Zhenzhen Yan analyzed log files

28

Outline



ESnet-UVA collaborative work– Netflow data analysis– Validation of Netflow based size estimation Effect of elephant flows

• SNMP measurements• OWAMP data analysis

– GridFTP log analysis

• Future work: HNTES 3.0 and integrated network

Effect of elephant flows on link loads

• SNMP link load averaging over 30 sec• Five 10GB GridFTP transfers• Dashed lines: rest of the traffic load 29

2.5 Gb/s

10 Gb/s

1 minute

SUNN-cr1interface SNMP load

CHIC-cr1interface SNMP load

Chris Tracy

OWAMP (one-way ping)

• One-Way Active Measurement Protocol (OWAMP)– 9 OWAMP servers across Internet2 (72 pairs)– The system clock is synchronized– The “latency hosts” (nms-rlat) are dedicated

only to OWAMP– 20 packets per second on average (10 for ipv4,

10 for ipv6) for each OWAMP server pair– Raw data for 2 weeks obtained for all pairs

30

Study of “surges” (consecutive higher OWAMP delays on 1-minute basis)

• Steps:• Find the 10th percentile delay b across

the 2-weeks data set• Find the 10th percentile delay i for

each minute • If i > n × b, i is considered a surge

point (n = 1.1, 1.2, 1.5)• Consecutive surge points are

combined as a single surge

31

Study of surges cont.

32

CHIC-LOSA CHIC-KANS KANS-HOUS HOUS-LOSA LOSA-SALT

10th percentile 29 ms 5 ms 6.7 ms 16.1 ms 7.3 ms>1.1×(10th percentile) 31 ms 5.9 ms 7.3 ms 17.5 ms 8.5 ms

>1.2×(10th percentile) 34 ms 6.3 ms 8 ms 19 ms 9.5 ms

>1.5×(10th percentile) NA NA NA 23.9 ms 11.6 ms

• Sample absolute values of 10th percentile delays

PDF of surge duration

33

• a surge lasted for 200 mins• the median value is 34 mins

95th percentile per minute

34

CHIC-LOSA CHIC-KANS KANS-HOUS HOUS-LOSA LOSA-SALT

10th percentile of 2 weeks 29 ms 5 ms 6.7 ms 16.1 ms 7.3 ms

>1.2×(10th percentile) 33 ms 6.4 ms 8 ms 18.7 ms 9.3 ms

>1.5×(10th percentile) 50 ms 8.1 ms 18.8 ms 23.9 ms 11.5 ms

>2×(10th percentile) 58 ms 11 ms 18.8 ms 40.7 ms NA

>3×(10th percentile) 84 ms 17 ms NA 53.8 ms NA

Max of 95th percentile 119.8 ms 50.5 ms NA 86.7 ms NA

•The 95 percentile delay per min was 4.13 (CHIC-LOSA), 10.1 (CHIC-KANS) and 5.4 (HOUS-LOSA) times the one way propagation delay

Future workDetermine cause(s) of surges

• Host (OWAMP server) issues?– In addition to OWAMP pings, OWAMP server pushes

measurements to Measurement Archive at IU

• Interference from BWCTL at HP LAN switch within PoP?– Correlate BWCTL logs with OWAMP delay surges

• Router buffer buildups due to elephant flows– Correlate Netflow data with OWAMP delay surges

• If none of above, then surges due to router buffer buildups resulting from multiple simultaneous mice flows

35

GridFTP data analysis findings

Size (bytes) Duration (sec) Throughput

Minimum 100003680 0.25 1.2 Mbps

Median 104857600 2.5 348 Mbps

Maximum 96790814720 = 90 GB

9952 4.3 Gbps

36

• All GridFTP transfers from NERSC GridFTP servers that > 100 MB: one month (Sept. 2010)

• Total number of transfers: 124236 • Data from GridFTP logs

Throughput of GridFTP transfers

37

• Total number of transfers: 124236

• Most transfers get about 50 MB/sec or 400 Mb/s

Variability in throughput for files of the same size

Throughput in bits/s

Minimum 7.579e+08

1st quartile 1.251e+09

Median 1.499e+09

Mean 1.625e+09

3rd quartile 1.947e+09

Maximum 3.644e+09

38

• There were 145 file transfers of size 34359738368 (bytes) – 34 GB approx.

• IQR (Inter-quartile range) measure of variance is 695 Mbps

• Need to determine other end and consider time

39

Outline



• ESnet-UVA collaborative work Future work: HNTES 3.0 and integrated

network

HNTES 3.0

• Online flow detection Packet header based schemes– Payload based scheme– Machine learning schemes

• For ESnet– Data door IP address based 0-length (SYN) segment

mirroring to trigger PBR entries (if full mesh of LSPs), and LSP setup (if not a full mesh)

– PBR can be configured only after finding out the other end’s IP address (data door is one end)

– “real-time” analysis of Netflow data• Need validation by examining patterns within each day

40

HNTES in an integrated network

• Setup two queues on each ESnet physical link; each rate-limited

• Two approaches• Use different DSCP taggings

– General purpose: rate limited at 20% capacity– Science network: rate limited at 80% capacity

• IP network + MPLS network– General purpose: same as approach I– Science network: full mesh of MPLS LSPs

mapped to 80% queue

41

Ack: Inder Monga

Comparison

• In first solution, there is no easy way to achieve load balancing of science flows

• Second solution:– MPLS LSPs are rate unlimited– Use SNMP measurements to measure load on each

of these LSPs– Obtain traffic matrix– Run optimization to load balance science flows by

rerouting LSPs to use whole topology– Science flows will enjoy higher throughput than in

the first solution because TE system can periodically re-adjust routing of LSPs

42

Discuss integration with IDC

• IDC established LSPs have rate policing at ingress router

• Not suitable for HNTES redirected science flows

• Add a third queue for this category

43Discussion with Chin Guok

Summary

• HNTES 2.0 focus– Elephant (large-sized) flows– Offline detection– Rate-unlimited static MPLS LSPs– Offline setting of policy based routes for flow

redirection

• HNTES 3.0– Online PBR configuration – Requires flow monitoring module to receive port

mirrored packets from routers and execute online flow redirection after identifying other end

• HNTES operation in an integrated network44

Documents

1 Hybrid network traffic engineering system (HNTES) Zhenzhen Yan, M. Veeraraghavan, Chris Tracy University of Virginia ESnet June 23, 2011 Please send