High statistics ping results

Embed Size (px)

Citation preview

  • 8/13/2019 High statistics ping results

    1/19

    High statistics ping results

    Created: May 14, 1999; last updated by: Les Cottrell on February 27, 2000

    Page Contents

    Introduction

    To understand better how to interpret PingER results we decided to make a series

    of one off high statistics ping measurements with shorter time frames than the

    normal PingER measurements on both LAN and various WAN paths. The idea is

    to look at the frequency distributions, the time variations for various types of

    networks in the LAN and WAN environment, and to correlate the results with the

    topology, routes and known performance issues. Our goal is also to compare theseresults with results from other high statitics delay measurements.

    Unless otherwise noted, the pings were sent at one second intervals with a timeout

    of 20 seconds and a payload (including the 8 ICMP protocol bytes) of 100 bytes.

    Pings between hosts at same site

    To understand the behavior of ping at a single site we used the NIKHEF ping client

    (since it has a resolution of 1 usec.) running on a Redhat 5.2 Linux on an Intel

    400MHz Pentium host (doris) at SLAC to various other hosts at SLAC separated

    from doris by various network devices with various interface speeds from 10 Mbps

    to 1 Gbps. Doris, the ping client, is connected to the network via a shared 10Mbps

    hub that is connected to a 10Mbps edge switch port. The table below shows the

    hosts pinged (i.e. acting as ping servers) together with their hardware and software

    configurations and the connection between doris and the server. The edge switches

    are Cisco Catalyst 5000s, the core switches are Cisco Catalyst 6500s, the farm and

    server switches are Cisco Catalyst 5500s, and the core routers are Cisco Catalyst

    8500s.

    Server

    name

    Server

    Hardware

    Server

    OS

    Serverinterface

    speed

    Network connection devices & speeds

    mercury Sun Ultra 5Solaris

    5.6

    10Mbps HDX

    sharedSame shared 10 Mbps hub

    charon Sun Ultra 1Solaris

    5.6

    10Mbps HDX

    shared10Mbps to edge switch (cgb3) 10 Mbps to doris

    1

  • 8/13/2019 High statistics ping results

    2/19

    bronco001 Sun Ultra 5Solaris

    5.6

    100Mbps

    FDX switched

    100 Mbps to farm switch, 1Gbps to core switch,

    1Gbps to core router, 1Gbps to core switch,

    100Mbps to edge switch, 10Mbps to doris

    mailbox Sun Ultra 5Solaris

    5.6

    100 Mbps

    FDX switched

    100Mbps to server switch, 1Gbps to core switch,

    1Gbps to core router, 1Gbps to core switch,

    100Mbps to edge switch, 10 Mbps to doris

    grouse Sun Sparc1+

    SunOS4.1.3.1

    10Mbps HDXshared

    10Mbps to edge switch, 100Mbps to core switch,

    1Gbps to core router, 1Gbps to core switch,

    100Mbps to edge switch, 10Mbps to doris

    A simple model to understand the median or minimum ping response times for an

    unloaded local area network and lightly loaded hosts is to ignore the hubs (a hub

    inserts about 1 bit time delay) and the cable lengths (for a site with cable runs of =1msec. and < 10msec., 100usec. for RTT >=10msec.

    and < 100msec., and 1msec. for >=100msec.). and does not show up when using

    equally spaced linear RTT bins. The double peak in the frequency distribution for

    the two hosts on the same subnet is also a binning effect. and does not show up

    when using linear bin-widths. On the other hand, by measuring the wire-time

    difference between packets entering and leaving the server, the double peak seen

    in the low RTT "peak" for the distribution of the two hosts on the same shared hub

    is found to be be caused by the ping server. For another example of a pathological

    RTT distribution caused by a ping server, see Pinger Measurement Pathologies.

    2

  • 8/13/2019 High statistics ping results

    3/19

    Pings between 2 hosts on the same shared 10 Mbps hub

    The following plots are for the Linux host to a Sun Ultra 5 (mercury) running

    Solaris 5.6. The first plot shows the ping RTT in msec. for about 260,000 100 byte

    pings started May 30, 1999.

    The

    second

    plot

    shows

    the

    frequency histogram of the ping RTTs with log scales. From measurements of the

    "wire time" using NetXray running on a separate Windows NT on the same hub as

    the ping server/responder host (mercury), we verified that the two peaks around

    log10(RTT) = 0.5 are a function of the ping server/responder host itself.

    Pings

    between 2

    hosts on

    the same

    subnet but

    different

    ports on

    the sameswitch

    The

    following

    plots are

    from the

    3

  • 8/13/2019 High statistics ping results

    4/19

    host to a Sun Ultra 1 (charon) running Solaris 5.6. The hosts are on the same Cisco

    Catalyst 5000 switch but different 10 Mbps shared ports. the pings were stated on

    May 30, 1999 at 12:38:20 PST and the packet loss was about 0.08%. The first plot

    shows the behavior of about 260,000 ping RTTs as a function of time.

    Thesecond

    plot

    shows

    the

    frequency distribution of the pings.

    Pings

    between 2

    hosts atthe same

    site but on

    different

    subnets

    The

    following

    plots are

    for 500,000ping RTTs

    between a

    Linux

    Redhat 5.2 host (doris) and a Sun Sparc 1+ (grouse) running SunOS and the same

    Linux host and a Surveyor host running on a Pentium II running FreeBSD. The

    grouse pings were started on May 30, 1999. The two hosts are on separate subnets

    4

  • 8/13/2019 High statistics ping results

    5/19

    and are separated by 4 switches and a router. The first plot shows the time

    variation of the ping RTT.

    The second

    plot is the

    frequencyhistogram

    of the ping

    RTT. The

    blue line

    shows the

    cumulative

    distribution function (CDF). The data was binned into 2 different bin widths to

    provide a reasonable number of counts in the higer RTT bins: 0.1 msec. bins are

    shown in magenta and extend out to 10 msec., and 10 msec. bins run from 10 to 100

    msec. The counts in the 1 msec. wide bins are normalized to the 0.1 msec. wide

    bins by dividing the count in the 1 msec. bins by 10. A simple power series fit to

    the data between RTT 2.3 msec. and 61 msec. is also shown as a black line.

    The distribution has a sharp peak with a median at 1.35 msec and with an Inter

    Quartile Range (IQR) of 0.2 msec. There is also a high RTT tail.

    The third plot

    in this

    subsection

    shows the

    time variation

    of the ping

    RTT for

    306,000 pings

    between the

    Linux host

    and the SLAC

    Surveyor host.

    5

  • 8/13/2019 High statistics ping results

    6/19

    The final

    plot in

    this

    subsection show the frequency distribution of the ping RTTs between the Linux

    host and the SLAc Surveyor host. The blue line shows the cumulative distributionfunction (CDF). The data is binned into 3 different bin widths The black dots are

    for bins with a width of 0.1 msec. and are for RTT < 1 msec.. The magenta dots are

    for bin widths of 1 msec. and are for RTTs < 10 msec.. The green dots have bin

    widths of 10 msec. and cover the entire range of data. The binned data is

    normalized by dividing the counts in the 1 msec. bins by 10 and the counts in the

    6

  • 8/13/2019 High statistics ping results

    7/19

    10 msec. bins by 10. the black line is a simple power series fits to the data between

    2.3 msec. and 61 msec. inclusive.

    The distribution exhibits a sharp peak with a median at 0.9 msec. an IQR of 0.06

    msec. and a high RTT tail. There are also secondary peaks at 10 msec. and 2.4 msec.

    Pings between 2 ESnet sites

    ESnet sites have excellent connectivity with low packet loss and a high speed

    well-provisioned backbone that they connect to. Thus they provide an example of

    "how good it can get". The ESnet operations center is at LBNL and SLAC is anESnet site with, at the time of the measurements below, a T3 interconnect to the

    ESnet ATM backbone cloud. The SLAC link to ESnet is also lightly loaded with

    peaks measured over 5 minutes only reaching about 50% utilization for the period

    of interest.

    The ping distribution for an extensive (500K samples) measurement between a host

    at SLAC (minos.slac.stanford.edu) and a host at ESnet at LBNL (hershey.es.net), is

    seen below starting at 9:01am on April 23, 1999 and ending at 3:59am on April 29

    1999. The pings were separated by 1 second and the timeout was 20 seconds. It can

    be seen that there is a narrow (IQR = 1msec.) peak at 4 msec. with a very long tail

    extending out to beyond 750 msec. The black line is a fit to a power series with the

    parameters shown.

    7

  • 8/13/2019 High statistics ping results

    8/19

    If one plots this data on a log-log plot (see below) then it can be seen that there are

    two time scales (4-18 msec. and 18-1000 msec.) with quite different behaviors. The

    bulk of the data (99.8%) falls in the 4-18 msec. region. In the 4-18 msec. region (the

    magenta points) the data falls of asy ~ A * RTT-6.6whereas beyond 18msec. (the

    blue points) it falls off asy ~ B * RTT-1.7. The parameters of the fits are shown in the

    chart. Note that in the 4-18 msec. region the data are histogrammed in 1 msec. bins,

    whereas beyond that they are histogrammed in 10 msec. bins. and the 2 y scales are

    adjusted appropriately (the one for the wider bins beyond 18 msec. is a factor 10

    greater than the other). The green points are not used in the fits and are the data

    histogrammed in 1 msec. bins for the range 19 msec. to 55 msec. The power law

    exponent behavior in the region 4 - 18 msec. is that exhibited by very chaotic

    processes such as fully developed turbulence or the stock market, whereas the data

    beyond 18 msec. is more characteristic of heavy-tailed or long range similarity

    behavior. A guess is that the transition at 20 msec reflects a change from delays

    caused by simple queueing to delays caused by router processing and needs more

    work to substantiate.

    8

  • 8/13/2019 High statistics ping results

    9/19

    The autocorrelation function for the first 64000 RTTs for (there was no packet loss

    in this period) is shown below. It can be seen that in general there is a very weak (pathchar -q 64 hershe

    pathchar to hershey.es

    mtu limitted to 8192 bytes at local host

    doing 64 probes at each of 64 to 8192 by 2600 FLORA03.SLAC.Stanford.EDU (134.79.16.55)

    | 77 Mb/s, 462 us (1.77 ms)

    1 RTR-CGB5.SLAC.Stanford.EDU (134.79.19.3)

    | 294 Mb/s, 218 us (2.43 ms)

    2 RTR-CGB6.SLAC.Stanford.EDU (134.79.135.6)

    11

  • 8/13/2019 High statistics ping results

    12/19

    | 18 Mb/s, 276 us (6.53 ms)

    3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4)

    | ?? b/s, -85 us (2.44 ms)

    4 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.18)

    -> 192.68.191.18 (1)

    | ?? b/s, 1.42 ms (5.13 ms)5?lbl1-atms.es.net (134.55.24.11)

    | 245 Mb/s, 71 us (5.54 ms)

    6 esnet-lbl.es.net (134.55.23.66)

    | 9.7 Mb/s, 95 us (12.5 ms)

    7 hershey.es.net (198.128.1.11)

    7 hops, rtt 4.91 ms (12.5 ms), bottleneck 9.7 Mb/s, pipe 42418 byte

    Pings between International sites

    CERN, at the time of the measurements below, shared an 8Mbps link across the

    Atlantic with the World Health Organization, IN2P3 in France and Switch (the

    Swiss Academic network). The shared trans Atlantic link was reached over 80%

    utilization for a 5 minute period during the measurement period and was normally

    the bottleneck. The loading on the link is seen below. The green represents the

    average (over 30 minutes) traffic to Switzerland and the blue is the average (over

    30 minutes) traffic to the U.S. The dark green and magenta are the 5 minute

    maxima. The ping measurements below were for the consecutive days labelled

    Sun, Mon, Tue, Wed in the utilization graph below.

    To better understand the behavior of ping Round Trip Time (RTT) in the WAN, we

    pinged CERN (ping.cern.ch) from SLAC (minos.slac.stanford.edu) every minutewith a timeout of 20 seconds for 260K pings between 8:36 am Sunday May 9 and

    10:35am Wednesday May 12, 1999 (PDT). The packet loss for these measurements

    was about 0.053%. The distribution of the RTT is seen in the chart below.

    12

  • 8/13/2019 High statistics ping results

    13/19

    The distribution shows a lot of structure. First there is a sharp peak at about 224

    msec. with a width of (90% of the peak is contained in) 9.5 msec. On the high RTT

    side of the peak several smaller peaks are seen, together with a long tail. If we lookat the individual RTTs in the high RTT tail beyond 260 msec. then we get the chart

    shown below:

    The clusters of points for Tuesday May 11, also show up in the Surveyor data as

    shown in the graphs below:

    Of

    13

  • 8/13/2019 High statistics ping results

    14/19

    particular interest is the cluster around 18:00 hours on Tuesday May 11. The ping

    RTT and loss data is shown for this data in the chart below. The loss is calculated

    by looking for missing ping sequence numbers. The routes are obtained from

    Surveyor measurements which use traceroute to measure the routes about every 15minutes.

    14

  • 8/13/2019 High statistics ping results

    15/19

    There is a clear change in behavior starting at about 18:10 hours and stopping at

    about 19:20 hours. At the start of this period there is a loss of 169 consecutive ping

    packets (or a break in connectivity of 169 seconds, since the pings are sent at one

    second intervals, while the network routing converges to a new route), and at the

    end a further loss of 36 consecutive ping packets. Apart from this period the route

    (as measured by traceroute) to CERN is from SLAC to ESnet to the New York

    Sprint NAP, then to West Orange in New Jersey and thence back to Chicago to the

    STAR-Tap and onto CERN. During the period from 18:10 hours to 19:20 hours, the

    route is from SLAC to ESnet to BBN which goes via New York, London, to Geneva

    and is more congested, and hence the increase in packet loss, but avoids the trip

    back from New Jersey to Chicago (and so saves an extra 30 msec. in the round

    trip). The complete routes can be seen below:

    15

  • 8/13/2019 High statistics ping results

    16/19

    The ping RTT data for the cluster around 1:00am on May 11, 1999 can be seen in

    more detail in the chart below. In the chart it can be seen that there is a complete

    loss of connectivity (i.e. no pings responded) of about 14 minutes starting at about

    1:07am until about 1:21am. After this performance looks fairly normal. Prior to the

    loss of connectivity, there are periods of longer RTT (almost double) followed by

    shorter losses of connectivity. For CERN to SLAC, Surveyor shows a change from

    the normal route at 1:00am and 1:15am returning to the normal route at 1:35am.

    For SLAC to CERN, Surveyor shows a change in route at 0:56am returning to the

    normal route at the next measurement at 1:23am. The alternate routes are limited

    to the SLAC site. This cluster is coincident with problems occurring as a result of

    16

  • 8/13/2019 High statistics ping results

    17/19

    making changes to a core switch at SLAC.

    The

    cluster around 7:15am on May 11, 1999 shown in more detail below is actually 3

    sudden changes in RTT from about 220 msec. to about 525 msec. and back after 1 to

    2 minutes, with RTT top hat shaped peaks at about 7:14am to 7:16am, 7:19am to7:20am, and 7:23am to 7:24am. Surveyor traceroute samples did not coincide with

    any of these peaks and saw no route changes. Only one packet was lost in the

    period shown below. The black line is a moving average with the average being

    over 10 seconds. It is inserted to help the eye discern the top hat peaks.

    Surveyor also does not indicate any route changes for the clusters around 14:00

    hours on May 11, 1999 or 15:00 hours or 18:30 hours on May 10, 1999.

    The pathchar information for the normal path from SLAC to CERN is shown

    below:

    >pathchar -q 64 ping.cern.ch

    pathchar to dxcoms.cern.ch (137.138.28.176)

    mtu limitted to 8192 bytes at local host

    doing 64 probes at each of 64 to 8192 by 260

    17

  • 8/13/2019 High statistics ping results

    18/19

    0 FLORA03.SLAC.Stanford.EDU (134.79.16.55)

    | 162 Mb/s, 369 us (1.14 ms)

    1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2)

    | 115 Mb/s, 281 us (2.28 ms)

    2 RTR-CGB6.SLAC.Stanford.EDU (134.79.159.12)

    | 19 Mb/s, 242 us (6.29 ms)3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4)

    | ?? b/s, -100 us (2.29 ms)

    4 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.18)

    -> 192.68.191.18 (1)

    | ?? b/s, 31.1 ms (64.4 ms)

    5?nynap1-atms.es.net (134.55.24.9)

    | 914 Mb/s, 118 us (64.7 ms)

    6 1-sprint-nap.cw.net (192.157.69.11)

    -> 192.157.69.11 (1)

    | 1997 Mb/s, 1.72 ms (68.2 ms)

    7?core4-hssi6-0-0.WestOrange.cw.net (204.70.10.225)

    | 591 Mb/s, 9.52 ms (87.4 ms)

    8 bordercore4.WillowSprings.cw.net (166.48.34.1)

    -> 166.48.34.1 (2)

    | 86 Mb/s, 1.13 ms (90.4 ms)

    9?cern-cwe.WillowSprings.cw.net (166.48.34.6)

    -> 166.48.34.6 (3)

    | 130 Mb/s, 59.9 ms (211 ms)

    10?cernh9-ar1-chicago.cern.ch (192.65.184.166)

    -> 192.65.184.166 (2)

    | ?? b/s, 356 us (211 ms)

    11?cgate2.cern.ch (192.65.185.1)

    | 2634 Mb/s, 135 us (211 ms)

    12 cgate1-dmz.cern.ch (192.65.184.65)

    -> 192.65.184.65 (3)

    | 551 Mb/s, 327 us (212 ms)

    13?r513-c-rci47-15-gb0.cern.ch (128.141.211.41)

    -> 128.141.211.41 (1)

    | 15 Mb/s, -225 us (216 ms)

    14?dxcoms.cern.ch (137.138.28.176)

    14 hops, rtt 210 ms (216 ms), bottleneck 15 Mb/s, pipe 425545 byte

    [ Feedback ]

    18

  • 8/13/2019 High statistics ping results

    19/19

    READABILITY An Arc90 Laboratory Experiment

    Excerpted fromHigh statistics ping results

    http://www.slac.stanford.edu/comp/net/wan-mon/ping-hi-stat.html

    http://lab.arc90.com/experiments/readability

    19