16
Network Coordinates in the Wild Jonathan Ledlie, Paul Gardner, and Margo Seltzer Harvard School of Engineering and Applied Sciences and Aelitis {jonathan,margo}@eecs.harvard.edu, [email protected] Abstract Network coordinates provide a mechanism for select- ing and placing servers efficiently in a large distributed system. This approach works well as long as the coordi- nates continue to accurately reflect network topology. We conducted a long-term study of a subset of a million-plus node coordinate system and found that it exhibited some of the problems for which network coordinates are fre- quently criticized, for example, inaccuracy and fragility in the presence of violations of the triangle inequality. Fortunately, we show that several simple techniques rem- edy many of these problems. Using the Azureus BitTor- rent network as our testbed, we show that live, large-scale network coordinate systems behave differently than their tame PlanetLab and simulation-based counterparts. We find higher relative errors, more triangle inequality viola- tions, and higher churn. We present and evaluate a number of techniques that, when applied to Azureus, efficiently produce accurate and stable network coordinates. 1 Introduction The performance of many Internet applications, such as distributed hash tables, web caches, and overlay net- works, relies on accurate latency estimation between par- ticipants (e.g., [11, 29]). Researchers propose acquiring these measurements using various techniques, from proxy measurement [14, 28] to landmark binning [25] to decen- tralized network embeddings (e.g., [12, 21, 23, 26, 30]). In a network embedding, a subset of inter-node latency measurements is embedded into a low-dimensional metric space. Each node maintains a network coordinate, such that the metric distance between two coordinates in the abstract space predicts real-world latencies. This paper examines the performance of Internet-scale network em- beddings through the study of a subset of a million-node live coordinate system. Although network coordinates have attractive proper- ties for latency prediction on the Internet, they have been criticized for requiring expensive maintenance and having prediction accuracy significantly worse than direct mea- surement methods such as Meridian [31]. At the very least, critics say that network coordinates are an unproven idea and unlikely to work in practice because Internet routing policies cause too many triangle inequality viola- tions [33]. Supporters respond with claims that accuracies are reasonable (8 - 15%), and they have demonstrated that coordinate maintenance can be built on top of existing ap- plication communication. They support these claims with simulations and small-scale live deployments on Planet- Lab [9, 17, 22, 26]. This paper provides the missing piece of the debate: data and analysis of a truly large-scale and long-running network coordinate system. The Azureus file-sharing net- work [1], which runs a million-node network coordinate system, is the main artifact for our analysis and experi- mentation. This work is the result of a collaboration be- tween the Azureus team (Gardner) and a team from Har- vard (Ledlie, Seltzer). Gardner contacted the Harvard team because Azureus was exhibiting some of the difficul- ties that Ledlie et al. had addressed in earlier work with a PlanetLab-based coordinate system [17]. We merged the techniques from Ledlie’s previous work into the test branch of the Azureus code, used by approximately ten thousand clients. While our previous techniques did work “in the wild,” Azureus continued to experience unsatisfactorily high er- rors. This occurred because its gossip pattern stifled convergence: as all coordinate maintenance is “piggy- backed” on other traffic, each coordinate became heav- ily skewed to small segments of the network and failed to become globally accurate. We created a simple new technique called neighbor decay that smoothly manages these skewed neighbor sets while retaining the appealing zero-maintenance property of Azureus’ coordinates. With these techniques in place, Azureus’ coordinates and, by in- ference, Internet-scale coordinate systems in general, can now tackle a basic goal: quickly and efficiently optimiz- ing anycast decisions based on correct latency estimates. Because even with these approaches Internet-scale coordi- nates are still partially untamed, we isolated and analyzed a set of major remaining impediments. The contributions of this work are: Improvements to the live Azureus coordinate sys- tem, producing a 43% improvement in accuracy and a four order-of-magnitude improvement in stability. The new coordinates optimize DHT traversal, help-

Network Coordinates in the Wild

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Network Coordinates in the Wild

Network Coordinates in the Wild

Jonathan Ledlie, Paul Gardner, and Margo SeltzerHarvard School of Engineering and Applied Sciences and Aelitis{jonathan,margo}@eecs.harvard.edu, [email protected]

AbstractNetwork coordinates provide a mechanism for select-

ing and placing servers efficiently in a large distributedsystem. This approach works well as long as the coordi-nates continue to accurately reflect network topology. Weconducted a long-term study of a subset of a million-plusnode coordinate system and found that it exhibited someof the problems for which network coordinates are fre-quently criticized, for example, inaccuracy and fragilityin the presence of violations of the triangle inequality.Fortunately, we show that several simple techniques rem-edy many of these problems. Using the Azureus BitTor-rent network as our testbed, we show that live, large-scalenetwork coordinate systems behave differently than theirtame PlanetLab and simulation-based counterparts. Wefind higher relative errors, more triangle inequality viola-tions, and higher churn. We present and evaluate a numberof techniques that, when applied to Azureus, efficientlyproduce accurate and stable network coordinates.

1 IntroductionThe performance of many Internet applications, such

as distributed hash tables, web caches, and overlay net-works, relies on accurate latency estimation between par-ticipants (e.g., [11, 29]). Researchers propose acquiringthese measurements using various techniques, from proxymeasurement [14, 28] to landmark binning [25] to decen-tralized network embeddings (e.g., [12, 21, 23, 26, 30]).In a network embedding, a subset of inter-node latencymeasurements is embedded into a low-dimensional metricspace. Each node maintains a network coordinate, suchthat the metric distance between two coordinates in theabstract space predicts real-world latencies. This paperexamines the performance of Internet-scale network em-beddings through the study of a subset of a million-nodelive coordinate system.

Although network coordinates have attractive proper-ties for latency prediction on the Internet, they have beencriticized for requiring expensive maintenance and havingprediction accuracy significantly worse than direct mea-surement methods such as Meridian [31]. At the veryleast, critics say that network coordinates are an unprovenidea and unlikely to work in practice because Internet

routing policies cause too many triangle inequality viola-tions [33]. Supporters respond with claims that accuraciesare reasonable (8−15%), and they have demonstrated thatcoordinate maintenance can be built on top of existing ap-plication communication. They support these claims withsimulations and small-scale live deployments on Planet-Lab [9, 17, 22, 26].

This paper provides the missing piece of the debate:data and analysis of a truly large-scale and long-runningnetwork coordinate system. The Azureus file-sharing net-work [1], which runs a million-node network coordinatesystem, is the main artifact for our analysis and experi-mentation. This work is the result of a collaboration be-tween the Azureus team (Gardner) and a team from Har-vard (Ledlie, Seltzer). Gardner contacted the Harvardteam because Azureus was exhibiting some of the difficul-ties that Ledlie et al. had addressed in earlier work witha PlanetLab-based coordinate system [17]. We mergedthe techniques from Ledlie’s previous work into the testbranch of the Azureus code, used by approximately tenthousand clients.

While our previous techniques did work “in the wild,”Azureus continued to experience unsatisfactorily high er-rors. This occurred because its gossip pattern stifledconvergence: as all coordinate maintenance is “piggy-backed” on other traffic, each coordinate became heav-ily skewed to small segments of the network and failedto become globally accurate. We created a simple newtechnique called neighbor decay that smoothly managesthese skewed neighbor sets while retaining the appealingzero-maintenance property of Azureus’ coordinates. Withthese techniques in place, Azureus’ coordinates and, by in-ference, Internet-scale coordinate systems in general, cannow tackle a basic goal: quickly and efficiently optimiz-ing anycast decisions based on correct latency estimates.Because even with these approaches Internet-scale coordi-nates are still partially untamed, we isolated and analyzeda set of major remaining impediments.

The contributions of this work are:

• Improvements to the live Azureus coordinate sys-tem, producing a 43% improvement in accuracy anda four order-of-magnitude improvement in stability.The new coordinates optimize DHT traversal, help-

Page 2: Network Coordinates in the Wild

ing the application pick physically close nodes; thistrims lookup delay by 33% compared to the most di-rect logical path.

• A new technique for managing neighbors in coordi-nate systems where all gossip is “piggybacked” onexisting traffic — i.e. where there are zero mainte-nance messages.

• A new, large-scale latency matrix providing a valu-able new portal into Internet behavior. Previous largematrices were between DNS servers and did not cap-ture latencies between actual nodes [9, 31].

• Evidence why Internet-scale latency estimation withcoordinates works. We find the intrinsic dimension-ality of large-scale systems to be less than previ-ous work, which studied smaller networks [30], andwe show why the world flattens into near-planar Eu-clidean coordinates.

• Analysis of five major barriers to accuracy: churn,drift, intrinsic error, corruption, and latency variance.We present techniques for lowering these barriers andshow how latency variance requires a fundamentallynew approach to latency prediction.

In Section 2, we explain why practitioners, such asthe Azureus developers, use network coordinates in large-scale deployments and review Azureus’ network coordi-nate algorithm. In Section 3, we use a dense latency matrixto analyze the characteristics of the Azureus’ latency dis-tribution, determining its intrinsic dimensionality and theextent of its triangle inequality violations. In Section 4,we describe three techniques integrated into the Azureuscode. In Section 5 we review metrics for evaluating co-ordinate systems. In Section 6, we examine the live per-formance of Azureus through three methods: (a) Azureusclients we ran on PlanetLab, (b) crawling instrumentedclients run by approximately ten thousand Azureus users,and (c) an application-level benchmark: using coordinatesto optimize DHT hop selection. In Section 7, we examinefive primary causes of the remaining difference betweenthe current live accuracy and what appears to be achiev-able based on simulation results. In Section 8, we reviewthe approaches for estimating latencies in large distributedsystems. In Section 9, we conclude.

2 BackgroundAzureus is currently one of the most popular clients for

BitTorrent, a file sharing protocol [7]. For a given file, theprotocol embodies four main roles: an initial seeder, newseeders, a tracker, and peers. Initial seeders, new seed-ers, and peers are all transient clients; trackers are typi-cally web servers. The initial seeder is the source of thefile. It divides the file into small pieces, creates a meta-data description of the file and sends this description to thetracker. Peers discover this file description through some

out-of-band mechanism (e.g., a web page) and then beginlooking for pieces of the file. Peers contact the trackerto bootstrap their knowledge of other peers and seeds.The tracker returns a randomized subsets of other peersand seeds. Initially, only the initial seeder has pieces, butsoon peers are able to exchange missing pieces with eachother, typically using a tit-for-tat scheme. Once a peer ac-quires all of the pieces for a file, it becomes a new seeder.This collection of clients actively sharing a file is called aswarm. In Azureus, file descriptors and other metadata arestored in a DHT, in which all clients participate, and anynode can be assigned the role of tracker if it is or is nearthe root of the hash of a given file’s descriptor. In practice,there can be many possible trackers from which to choosefor a particular file and even more possible clients for agiven piece. A key challenge in efficiently implementingthis protocol is providing a simple method for node selec-tion, an example of anycast.

Distributed systems developers are beginning to usenetwork coordinates as a mechanism to support anycast.The Azureus developers use them for two distinct pur-poses: (a) to optimize DHT traversal and (b) to selectnearby nodes for application-level congestion monitoring.We are currently testing another coordinate-based opti-mization: biasing the set of nodes the tracker returns tobe nearby the caller. Bindal et al. show in simulation howthese locally-biased swarms reduce download times andinter-ISP traffic [3]. Future plans call for using networkcoordinates to optimize media streaming over Azureus.

We worked with the Azureus developers to analyzeand improve the coordinates maintained by their system,which contains more than a million clients. We were ableto modify the Azureus code internals and watch its be-havior on a subset of the network because approximatelyten thousand Azureus users run a plugin that automaticallyupgrades their version to the latest CVS release. Accord-ing to the Azureus developers, the clients who use the lat-est release exhibit normal user characteristics, so we ex-pect that our results generalize to the larger system.

2.1 VivaldiAzureus uses the Vivaldi network coordinate update al-

gorithm [9]. The Vivaldi algorithm calculates coordinatesas the solution to a spring relaxation problem. The mea-sured latencies between nodes are modeled as the exten-sions of springs between massless bodies. A network em-bedding with a minimum error is found as the low-energystate of the spring system.

Figure 1 shows how a new observation, consisting ofa remote node’s coordinate −→xj , its confidence wj , and alatency measurement lij between the two nodes, i and j,is used to update a local coordinate. The confidence, wi,quantifies how accurate a coordinate is believed to be.Note that confidence increases as it approaches 0. The al-

Page 3: Network Coordinates in the Wild

VIVALDI(−→xj , wj , lij)1 ws = wi

wi+wj

2 ε = |‖−→xi−−→xj‖−lij |lij

3 α = ce × ws

4 wi = (α× ε) + ((1− α)× wi)5 δ = cc × ws

6 −→xi = −→xi + δ × (‖−→xi −−→xj‖ − lij)× u(−→xi −−→xj)

Figure 1: Vivaldi update algorithm.

gorithm first calculates the sample confidence ws (Line 1)and the relative error ε (Line 2). The relative error ε ex-presses the accuracy of the coordinate in comparison tothe true network latency. Second, node i updates its confi-dence wi with an exponentially-weighted moving average(EWMA) (Line 4). The weight α for the EWMA is set ac-cording to the sample confidence ws (Line 3). Also basedon the sample confidence, δ dampens the change appliedto the coordinate (Line 5). As a final step, the coordinate isupdated in Line 6 (u is the unit vector). Constants ce andcc affect the maximum impact an observation can have onthe confidence and the coordinate, respectively.

Height is an alternative to a purely Euclidean distancemetric. With height, the distance between nodes is mea-sured as their Euclidean distance plus a height “above” thehypercube that models the latency penalty of network ac-cess links, such as DSL lines [9].

Each node successively refines its coordinate throughperiodic updates with other nodes in its neighbor set. InAzureus, the information used to maintain the network co-ordinate system is entirely piggybacked on existing mes-sages, such as routing table heartbeats. While this doesmean the coordinates induce no additional overhead (be-yond 24 bytes per message for four dimensions, height,and confidence), it also means that the algorithm neededto be modified to function passively. In Section 4.2, wedescribe a technique we developed to incorporate infor-mation from highly transient neighbors.

3 Latencies in the WildBefore we examine the accuracy with which Internet-

scale latencies can be embedded into a coordinate space,we compare latencies in Azureus to those in other net-works to gain insight into the causes of error in Internet-scale embeddings. We generate a dense latency matrix ofa subset of Azureus and compare it to PlanetLab and to theMIT King data set, a square matrix containing the medianlatencies between 1740 DNS servers collected using theKing method [9, 14]. Researchers found PlanetLab andMIT King can be reduced to low dimensional coordinateswith ≤ 10% median error [9, 17]. We examine three char-acteristics: inter-node round trip times, violations of thetriangle inequality, and intrinsic dimensionality.

0.0

0.2

0.4

0.6

0.8

1.0

1 10 100 1000 10000 100000

CD

F

Round Trip Time (ms)

MIT KingPlanetLab

Azureus

Figure 2: A comparison of round-trip times shows thatAzureus spreads across a range one order-of-magnitudelarger than MIT King, based on inter-DNS latencies. Thislarger spread tends to lead to lower accuracy embeddings.

3.1 CollectionWe instrumented clients that we ran on PlanetLab to

record the application-level latency between them and therest of the network creating a dense latency matrix. Theseclients ran on 283 PlanetLab nodes for 24 days startingon July 19th 2006, collecting 9.5 × 107 latency measure-ments to 156, 658 Azureus nodes. To reduce these rawmeasurements into a dense latency matrix, we used thefollowing process: first, we summarized each edge withthe median round trip time for this edge, discarding edgeswith fewer than a minimum number of samples (4); sec-ond, we discarded all nodes that had fewer than half of themaximum number of edges (280). This process resultedin a 249 × 2902 matrix with 91% density, where 83% ofthe entries were the median of at least ten samples. Wederived the PlanetLab data set from the Azureus matrix bysimply selecting out its subset of hosts.

3.2 Round Trip TimesIn Figure 2, we illustrate the distribution of inter-node

round trip times between nodes in the three data sets.The King measurements were limited to a maximum of800ms. The data exhibit one important characteristic:spread. The application-level, Azureus round trip timesspread across four orders-of-magnitude, while the inter-DNS, King data set spreads across three. In theory, thisis not a harbinger of higher embedding error; in practice,however, as Hong et al. have shown, the error betweennodes whose distance is near the middle of the latency dis-tribution tends to be the lowest [32]: with longer tails tothis distribution, there are more edges to be inaccurate.(We found ICMP measurements exhibit a similarly widedistribution; see § 7.5.) This wide spread is a warning signthat Azureus will have higher error than a system with anarrower round trip time distribution.

Page 4: Network Coordinates in the Wild

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.0

Frac

tion

of P

airs

Relative Path Length

MIT KingPlanetLab

Azureus

Figure 3: In all three data sets, over half of all nodepairs fail the Tang/Crovella triangle inequality test, be-cause there exists a third node between the nodes in thepair that produces a shorter path than the direct path be-tween the two nodes. A large fraction of these violatingpairs have paths that are significantly faster.

3.3 Violations of the Triangle InequalityNetwork coordinate embeddings that use Euclidean dis-

tances make the assumption that the triangle inequality isnot violated to a great extent by a large fraction of pairsof nodes. The triangle inequality states that for any trian-gle the length of a given side must be less than the sum ofthe other two sides but greater than the difference betweenthe two sides, i.e., the sides must be able to form a trian-gle. When the latencies between node triples cannot forma triangle, they are said to violate the triangle inequality.Nodes with large and frequent violations tend to be theones with the largest individual prediction error and theirexistence decreases overall accuracy (see [17] and § 7.3).

We use a method from Tang and Crovella to examinethe severity of triangle inequality violations [30]. Thismethod normalizes the severity of each violation, permit-ting an all-pairs comparison. For each node pair, we findthe shortest path between the two that passes through athird node. Thus, for all pairs of nodes i and j, we findthe best alternative path through a node k and normalizeby the latency between i and j:

rpl = mink

(d(i, k) + d(k, j)

d(i, j)

)Figure 3 illustrates the cumulative distribution of thisquantity, the relative path length. Note that any fractionbelow 1 is a violation: there exists a path through an al-ternative node that is faster than the direct path. 83% ofthe Azureus pairs, 85% of MIT King, and 68% of thePlanetLab subset violate the triangle inequality. In con-trast to earlier work that examined several small-scale datasets [30], we find the fraction of pairs with the largest vi-olations to be quite large: Tang and Crovella found only10% of nodes had an alternative path that is ≥ 20% faster;

0.0

0.2

0.4

0.6

0.8

1.0

2 4 6 8 10 12

Mag

nitu

de

Singular Values

AzureusMITPlanetLabSynthetic 5-dSynthetic 10-d

Figure 4: Scree plots suggest the inherent dimensionalityof MIT King, PlanetLab, and Azureus datasets is small.Two synthetic matrices of five and ten dimensions are in-cluded for comparison.

here 37% of Azureus pairs and 22% of MIT King pairsexhibit this large level of violation.

We examined the cause of the large fraction of pairswith very low rpl (< 0.1) in Azureus. We found that onlya few nodes were members of many of these low rpl pairs.What distinguished these nodes — and what was the causeof their frequent participation in triangle inequality viola-tions — was that their delay to non-PlanetLab nodes wasatypically large, on the order of seconds, while their delayto other PlanetLab nodes remained typical (less than a sec-ond). In effect, this extended one side of the triangles thesenodes participated in: d(i, j) became large while d(i, k)and d(k, j) did not. Because PlanetLab nodes that ex-hibited this behavior were co-located, we conjecture thatthe Azureus traffic to non-PlanetLab sites was being arti-ficially limited at site gateways, while traffic to PlanetLabnodes avoided this traffic shaping. Rather than being aconstruct of the PlanetLab environment, this effect, lead-ing to bi- or multi-modal latency distributions, will be thenorm for at least some participants in Internet-scale ap-plications that use well-known ports and consume a largeamount of bandwidth, such as Azureus, because somesites will limit traffic and some will not. Like the roundtrip time spread, Azureus’ violations foreshadow a higherembedding error.

3.4 DimensionalityNetwork coordinates would be less useful if a large

number of dimensions were needed to capture the inter-node latencies of the Internet. Tang and Crovella usedPrincipal Component Analysis (PCA) to hint at the num-ber of dimensions required to encompass this informationfor several small data sets [30]. Because we wanted toknow if few dimensions would be sufficient for a large,broad spectrum of endpoints, we used the same method toexamine the intrinsic dimensionality of Azureus.

PCA is a linear transformation from one coordinate sys-

Page 5: Network Coordinates in the Wild

0.00.20.40.60.81.0

24 26 28 210 212

CD

F

Round Trip Time (ms)

Source: Asia

Asia Europe

0.00.20.40.60.81.0

24 26 28 210 212

Round Trip Time (ms)

Source: North America

0.00.20.40.60.81.0

24 26 28 210 212

Round Trip Time (ms)

Source: Europe

Destination: N.Amer.

N.Amer.

N.Amer.Europe

EuropeAsia Asia

Figure 5: Intercontinental Latency Distributions illustrate why a Euclidean distance metric works for network coordinateson the Internet: messages from Asia to Europe (and from Europe to Asia) go through North America.

tem to a new, orthogonal coordinate system. The new sys-tem is chosen such that each subsequent axis captures themaximum possible remaining variance in projections frompoints in the old system to points in the new: the first newaxis captures the most variance, the second less, and soon. While an input system of k elements will produce anoutput system also of k elements, often only the first sev-eral dimensions of the output system will summarize allor part of the same distance information of the original setof points. Singular values are a result of the PCA transfor-mation: each new axis has a corresponding singular valuethat describes the amount of variance captured by this axis.Thus, if a singular value is very small or zero, this suggeststhat this axis is unnecessary in describing the variance ina particular data set.

Because PCA requires a full matrix, we first used thefollowing two techniques to fill in the remaining 9% ofthe Azureus matrix and the missing 0.4% of the MIT ma-trix. We filled half of the missing Azureus values withthe King technique [14] (King fails in certain cases, e.g.,when the endpoint cannot be resolved). We interpolatedthe remaining values in both matrices by embedding eachmatrix and extracting the missing values.

We use a scree plot to illustrate how much variance eachnew singular value is capturing, which in turn hints at theinherent dimensionality of the underlying data set. Theindependent variables of a scree plot are the singular val-ues, sorted by their magnitude; the dependent variables aretheir corresponding magnitudes. At the point where themagnitude of the singular values becomes zero or nearlyzero, the relative importance of this and subsequent sin-gular values (i.e., dimensions) is low. Up to this point,these dimensions are necessary to capture the values inthe original input matrix, which in this case is made up ofinter-node latency values.

We show the normalized singular values for the King,PlanetLab, and Azureus data sets in Figure 4. For com-parison, we created synthetic 5d and 10d systems eachcontaining 250 random points in a unit hypercube andfound their singular values. As one would expect, the syn-thetic 5d and 10d data sets show a sharp knee soon after 5and 10 singular values, respectively. In contrast, the bulk

of the inter-node latency information from two Internet-based data sets requires very few dimensions. Azureus, inparticular, is dominated by a single dimension, and MITKing by two. However, the next several dimensions re-main significant for the few nodes that need to navigatearound the clusters of nodes that have found good posi-tions. In the data, this is shown by the continued rele-vance of singular values when compared to synthetic datasets. To lower the error for these nodes, we find 4 − 5dimensions is appropriate for Internet-scale network coor-dinates. While the previous two characteristics, round triptimes and violations of the triangle inequality, suggest thatthe Azureus latency distribution will experience higher er-ror than MIT King, its intrinsic dimensionality does notappear to be an additional impediment.

3.5 Intercontinental Latency DistributionsWhile the Azureus data set is clearly of low dimension-

ality, a more concrete way to examine the “flatness” ofthis large-scale network is to look at its intercontinentallatency distribution. In a way, it is surprising that em-bedding latencies found on a globe (the Earth) into a Eu-clidean space works at all. If messages could be routedin any direction of the Earth’s surface, using a Euclideanmetric would be a poor choice. Previous work on sphericalcoordinates, however, found they had significantly largererror than Euclidean ones [9]. Anecdotal evidence sug-gested that the main reason why the Internet embeds intoa low dimensional Euclidean space is because the world isflat: traffic between Asia and Europe flows through NorthAmerica [9].

An examination of our Azureus data set confirms thatthis traffic flow is indeed the case. We mapped the IPaddresses in the data set to countries through their au-tonomous system record and, in turn, mapped these coun-tries to continents. As Figure 5 illustrates, no messagesfrom Asia to Europe were faster than those from Asia toNorth America; the same holds in the other direction. Allpaths between Asia and Europe appear to travel in a lineacross two oceans. This trend continues until the speedof the connection to ISPs or other coarse delays begin todominate.

Page 6: Network Coordinates in the Wild

This flatness suggests why hyperbolic coordinates [27]also work well: North America maps to the center of thehyperbolic space. Thus, because the distribution of laten-cies is “flat” – at least at a high level – using a Euclideanmetric is sufficient. In the future, new direct transmissionlines between Europe and Asia may change the Internet’sshape, perhaps driving a shift to spherical coordinates.

4 Taming Live Coordinate SystemsFrom our experience tuning a network coordinate sys-

tem on PlanetLab, we developed two techniques that leadto more stable and accurate coordinates on a small “live”system [17]. The Azureus and Harvard teams workedtogether to integrate these techniques into the Azureuscode. After confirming that these techniques worked asexpected, we found and resolved a new problem: skewedneighbor sets. This problem particularly disrupts large-scale, live coordinate systems like Azureus that rely solelyon other application communication for maintenance (i.e.they have zero maintenance costs) and has has been sug-gested as a goal for coordinate systems [9]. Through ex-perimentation with these techniques in simulation and pe-riodic measurement of the live system, we arrived at co-ordinates that are not perfect, but are a satisfactory start.We include a review of the techniques we developed aspart of our previous research (§ 4.1) and describe our newtechnique, neighbor decay (§ 4.2).

4.1 Latency and Update FiltersIn previous work, we developed two simple filters that

had distinct beneficial effects on a coordinate system run-ning on PlanetLab [17]. The first type, which we call alatency filter, takes the stream of latency measurementsfrom a remote node and turns these into an expectedlatency value. For a stream of measurements betweennodes i and j, the goal of the latency filter is to summa-rize the measurements, providing a current and stable de-scription of the expected latency between i and j. Twomain considerations affect the value Ex[rtt(i, j)]. First,anomalous measurements, sometimes several orders-of-magnitude larger than the baseline, would appear in thestream of measurements. For example, we would measurea round-trip time of 1000ms when typical measurementswere 200ms. Although we were using application-levelUDP measurements, we found these anomalies also oc-curred with ICMP. Second, the expected value could notbe fixed at a single value. Due to congestion and BGPchanges, the underlying latency between pairs of nodeschanges. We found that using a simple, short, movingmedian worked as a latency filter compensating for bothanomalous measurements and plateau shifts.

The second type of filter we developed on PlanetLabfocuses on making coordinates more stable, not more ac-curate. These update filters tackle a problem shared across

many types of applications that use network coordinates:discerning when a coordinate has changed “enough” topotentially necessitate an application-level reaction (e.g.,a service migration). In an early application we developedthat used network coordinates [24], we found it was hardfor the application to immediately determine if it shouldreact to coordinate updates, which were occurring sev-eral times per minute. A single threshold (“react if movedmore than 50ms”) did not work for all nodes because thevolume through which each coordinate moved was node-dependent. We developed a generic filtering technique toallow applications to easily determine when to update co-ordinates. Applications that find all updates useful canbypass the filters.

Update filters make the distinction between con-stantly evolving “system-level” coordinates and stable“application-level” coordinates, providing a barrier be-tween these two: system-level coordinates fine tunethe coordinate further with each measurement, whileapplication-level coordinates change only when the under-lying coordinate has undergone a significant migration toa new location relative to other coordinates. In our previ-ous work, we examined several heuristics for distinguish-ing between a system-level coordinate that was movingaround a single point (not requiring application-level noti-fication) and one that had migrated to a new location (po-tentially requiring application activity). We found heuris-tics that compare windows of previous system-level coor-dinates to one another, especially those that augment thiscomparison with distances to other nodes in the system,perform well. Applications can tune how much these win-dows may differ before being notified.

4.2 Neighbor DecayResearchers have posited that a network coordinate sub-

system could become a useful component of numerouslarge-scale distributed applications, particularly if it couldperform its job passively, that is, without generating anyextra traffic. In our Azureus implementation, this passivitywas forced upon us: we had no control over the selectionof which nodes we gossipped with or when we gossippedwith them, because the information necessary for a coor-dinate update was piggybacked on to other application-level messages, e.g., DHT routing table maintenance. Dueto this passivity and to churn, nodes did not have fixedsets of neighbors with which they could expect regular ex-changes. In fact, nodes would frequently receive 1 − 3updates from a remote node as that node was being testedfor entry into the routing table and then never hear fromthat node again. The net effect of these limited exchangeswas that each node’s “working set” was much smaller thanthe number of nodes with which it actually communicated.Nodes were having blips of communication with manynodes, but constant communication with few. The goal

Page 7: Network Coordinates in the Wild

of neighbor decay is to expand the size of the working set,which in turn improves accuracy.

A standard, gossip-based coordinate update involvestaking new information from a single remote node andoptimizing the local coordinate with respect to that node.If some set of remote nodes is sampled at approximatelythe same frequency, a node’s coordinate will become op-timized with respect to these remote coordinates (whichare in turn performing the same process with their neigh-bors). However, if some remote nodes are sampled at a fargreater frequency than others, the local coordinate opti-mization process will become skewed toward these nodes.In the theoretical limit, the result would be the same, but inpractice, these skewed updates – a problem that could beexpected in any passive implementation – slow the globaloptimization process.

Our solution to the problem of skewed neighbor updatesis simple. Instead of refining our coordinate with respectto the remote node from which we just received new infor-mation, we refine it with respect to all nodes from whichwe have recently received an update. To normalize thesum of the forces of this recent neighbor set, we scale theforce of each neighbor by its age: older information re-ceives less weight. This allows nodes that we hear fromonly a few times to have a lasting, smooth effect on ourcoordinate. Algorithmically, we set the effect of a neigh-bor j on the aggregate force −→F to be:

−→F = −→

F +−→Fj ×

amax − aj∑(amax − a)

where aj is the age of our knowledge of j and amax is theage of the oldest neighbor.

This use of an expanded neighbor set that decays slowlyover time has two main benefits. First, because the forcefrom each update is effectively sliced up and distributedover time, nodes’ coordinates do not jump to locationswhere they have high error with respect to other membersof the neighbor set. Second, by keeping track of recent,but not old, neighbors, neighbor decay acts to increase theeffective size of the neighbor set, which in turn leads tohigher global accuracy. In our implementation, nodes ex-pired from the recent neighbor set after 30 minutes.

Note the distinct effects of neighbor decay from bothlatency and update filters. Latency filters generate a cur-rent, expected round trip time to a remote node and updatefilters prevent system-level coordinate updates from spu-riously affecting application behavior. Neighbor decay, incontrast, handles the problem of skewed updates that canoccur when network coordinates are maintained as a pas-sive subsystem. It allows the smooth incorporation of in-formation from a wider range of neighbors, particularlyin a system where contact between nodes is highly tran-sient. In simulation, we confirmed that neighbor decaysubstantially increased stability and moderately improved

continuous relative error.

5 Measuring Coordinate SystemsIn this section, we review metrics used to evaluate co-

ordinate systems and other latency services.Relative Error. Relative error, the most basic and intu-itive measure of accuracy, is the difference between theexpected and actual latencies between two nodes:

e =| ‖−→xi −−→xj‖ − lij |

lij

Relative error comes in three forms: global, continuous,and neighbor. Global relative error is the accuracy fromthe viewpoint of an omniscient external viewer: at one in-stant, the metric is computed for all links. With the simu-lations that use a latency matrix, this is what we computebecause we do indeed have this viewpoint. Continuous er-ror is what a node computes on-the-fly as it receives newobservations from remote notes. This error is added to astatistic, such as an EWMA, as in Vivaldi’s confidence.Two disadvantages to continuous error are (a) a singlemeasurement may result in a large change in value and (b)it can become skewed by a handful of remote nodes if the“working set” of active gossip is small. Instead of contin-uous error, we use neighbor error as a proxy for global er-ror when live nodes are performing the computation them-selves, e.g., within live Azureus clients. Neighbor error isthe distribution of relative errors for a set of recently con-tacted nodes. With a large number of neighbors, neighborerror generally provides a close approximation to global.Stability. Stable coordinates are particularly importantwhen a coordinate change triggers application activity.In our distributed streaming query system, for example,a coordinate change could initiate a cascade of events,culminating in one or more heavyweight process migra-tions [24]. If the systems’ coordinates have not changedsignificantly, there is no reason to begin this process. Astable coordinate system is one in which coordinates arenot changing over time, assuming that the network itselfis unchanging. We use the rate of coordinate change

s =∑

∆−→xi

t

to quantify stability. The units for stability are ms/sec.Descriptions of and results from other metrics are in-

cluded in technical report version of this paper [16].

6 Internet-Scale Network CoordinatesUsing a latency matrix can only tell part of the story

of an Internet coordinate system. It helps describe thenetwork’s characteristics, e.g., its intrinsic dimensional-ity, but misses out on problems that may occur only ina running system, such as churn, changes in latencies over

Page 8: Network Coordinates in the Wild

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

1 2 3 4 5 7 15 30 60

Med

ian

Rel

ativ

e E

rror

Dimensions

Azureus, No HeightAzureus, with HeightMIT King, No Height

MIT King, with Height

Figure 6: Because height had a major, positive impact onAzureus in simulation, we returned it to the 4d+h version.

time, and measurement anomalies. We used three dis-tinct methods to understand the online performance ofAzureus’ coordinates: (a) on PlanetLab, we ran instru-mented Azureus clients that recorded the entirety of theircoordinate-related behavior (§ 6.2), (b) we crawled ap-proximately ten thousand Azureus clients that internallytracked the performance of their coordinates using statis-tics we inserted into the Azureus code (§ 6.3), and (c) weran a benchmark to determine the effectiveness of the co-ordinates on application-level decisions (§ 6.4).

6.1 Refining Azureus’ CoordinatesBecause updates to the CVS tree could take weeks to

proliferate to a majority of users, changing single variablesor techniques was not feasible. Instead we relied on simu-lations and on-going measurement to guide the roll-out oftwo major coordinate versions.

Azureus’ coordinates originally used two dimensions,height, and none of the three filtering techniques we de-scribed in Section 4. We call this version 2D+H. To cre-ate version 5D, we incorporated the two techniques fromour previous research, latency and update filters, into thecode. Based on our on-going PlanetLab coordinate ser-vice, which did not use height and reliably exhibited lowerror, we also dropped height and added three more di-mensions. Unfortunately, removing height proved to be amistake. Through simulations of the Azureus latency ma-trix (see Figure 6), we realized we could expect a substan-tial improvement in accuracy by converting the last dimen-sion of the 5d implementation to height without chang-ing the gossip packet structure. We also found the highlyskewed neighbor sets slowed convergence and developedthe neighbor decay technique to compensate. We com-bined these changes and rolled out version 4D+H.

6.2 PlanetLab SnapshotsWe took snapshots of each version by running clients

on approximately 220 PlanetLab nodes. Each snapshotlasted for at least three days, and logged updates withapproximately 10, 000 Azureus nodes. We collected a

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.1 0.2 0.3 0.4 0.5

CD

F

Median Relative Error

4d+h5d2d+h

0.0

0.2

0.4

0.6

0.8

1.0

0.01 0.1 1 10 100 1000 10000

CD

F

Stability

4d+h5d2d+h

Figure 7: The combination of filtering, neighbor decay,and height lead to substantially more accurate coordinateson PlanetLab nodes participating in the Azureus networkcoordinate system. Comparing 2D+H to 4D+H, the datashow a 43% improvement in relative error and a fourorders-of-magnitude improvement in stability.

snapshot for each of the three versions in March, July,and September 2006, respectively. Note that these instru-mented clients never stored or transferred any content thattravels over the Azureus network.

We compare data gathered from the different versions inFigure 7. Because the data are aggregated across roughlythe same source PlanetLab nodes, the three snapshots pro-vide a reasonable, though imperfect, way to isolate the ef-fects of the different techniques. In all cases, we find 4D+H

is more accurate and stable than both the original 2D+H

and our initial rollout of 5D.Our first revision had mixed results. Based on this data

and on simulations with and without height, the data con-vey that the removal of height damaged accuracy morethan the filters aided it. In retrospect, given the Azureusround trip time distribution (see § 3.2), in which 7.6% ofthe node pairs exhibit round trip times≥ 1 second, it is notsurprising that using height helped many nodes find a lowerror coordinate. In addition, given that two dimensionsare enough to capture much of Azureus’ inherent dimen-sionality, it is also not surprising that the addition of threedimensions did not radically improve accuracy. Althoughthe 5D coordinates are less accurate, they are more than2 1

2 orders-of-magnitude more stable because the latencyfilters prevent anomalous measurements from reaching the

Page 9: Network Coordinates in the Wild

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.1 0.2 0.3 0.4 0.5

CD

F

Median Relative Error

SimulationLive (All)Live (PlanetLab)

Figure 8: Reality does not live up to expectations: a com-parison of probed statistics from live Azureus nodes tothose from simulation suggests that accuracy could be im-proved by as much as 45%. Section 7 explores the majorremaining impediments.

update algorithm.Our second change was more successful. The introduc-

tion of neighbor decay and the re-introduction of height in4D+H create a much more accurate coordinate space thaneither of the previous two snapshots. This increase in ac-curacy occurs because neighbor decay enables nodes totriangulate their coordinates with a larger fraction of thenetwork (and their neighbors are doing the same) and be-cause height supplies the numerous nodes on DSL and ca-ble lines with the additional abstract distance over whichall their physical communication must travel.

We first evaluated neighbor decay in simulation. Toconfirm its continued effectiveness in a live system, weperformed an experiment where we monitored the conver-gence of a node with and without neighbor decay enabledas part of the 4D+H coordinate system. In an average ofthree trials, we found neighbor decay improved medianaccuracy by 35%, 40% and 54% at the 15, 30, and 60minute marks respectively.

6.3 End-Host Live CoordinatesThe logs from our Azureus clients running on Planet-

Lab nodes provide a detailed view of a narrow slice ofthe system. To obtain a picture of the broader system, weinserted online statistics collection into the Azureus CVStree. Using its recent neighbor set, each node computedits neighbor error and stability statistics on demand whenprobed. We present results from Azureus end-hosts run-ning version 4D+H.

Figure 8 “live (all)” illustrates the data from a crawl of9477 end-hosts. We exclude live nodes with fewer than10% of the maximum 512 neighbors because their metricsare skewed to a very small percentage of the network. Thedata show that the bulk of the Azureus system experiencesaccuracy similar to clients running on PlanetLab. How-ever, the error on the greater Azureus network has a longtail: at the 95th percentile, its accuracy is 76% worse. As

0.0

0.2

0.4

0.6

0.8

1.0

0 200 400 600 800 1000

CD

F

Lookup Delay (ms)

4d+h2d+hXORRandom

Figure 9: By choosing paths that are small detours in thelogical space but lower latency, network coordinates im-prove lookup delay in Azureus’s DHT.

we discuss in Section 7.1, we conjecture that the high rateof churn causes much of this difference in the tail.

In order to hint at the exigencies caused by running “inthe wild” as opposed to safely in the lab, we comparedthe statistics from live Azureus nodes to those from oursimulated embeddings of the Azureus latency matrix. InFigure 8, we compare live and simulated relative error.The data show a significant gap between live and simu-lated performance. (Prior work using the same simulatorfound simulations of PlanetLab mirrored live results [17].)The medians of the relative error distributions are 26% and14% for live and simulated coordinates, respectively, a dif-ference of 45%.

The data suggest that network coordinates have beenpartially tamed, but can be made substantially more accu-rate, and, therefore, more useful for distributed applica-tions that would like to make cheap, quick decisions be-tween providers of the same service. We show how thecurrent level of accuracy affects these anycast decisions inthe following section.

6.4 Application-Level PerformanceAccuracy and stability metrics capture application-

independent, low-level behavior. To understand howInternet-scale coordinate systems can affect application-level behavior, we also examined how Azureus uses themto make higher-level, anycast decisions in one of its com-mon tasks: DHT key lookup. Azureus performs this oper-ation for each tracker announcement, torrent rating lookupand publish, and NAT traversal rendezvous lookup andpublish (for tunnelling through NATs).

We modified an Azureus client so that it used networkcoordinates to optimize lookup delay. Our experiment toevaluate the change in lookup delay first stored a set ofkeys in the DHT, then looked up each key using four dis-tinct node selection methods, recording the time for thelookup operation. For each key, we ran the methods inrandom order.

Each method selects one node from a small set, i.e.,

Page 10: Network Coordinates in the Wild

is performing an anycast: all choices will make logicalprogress toward the target, some have lower latency thanothers. Azureus uses Kademlia, which defines the logi-cal distance between two DHT keys as the exclusive-or oftheir bits [19]. Starting with the logically nearest knownnodes to the target: XOR picks the logically nearest node,2D+H picks the node whose latency as predicted by the2D+H coordinates is smallest, 4D+H picks the lowest la-tency node as predicted by the 4D+H coordinates, and RAN-DOM picks randomly from the set. Each node contactedreturns its neighbors that are logically close to the target.This repeats until either a node storing the key is found orthe lookup fails. Because Azureus performs DHT lookupsiteratively, we were able to experiment with the lookupalgorithm through code updates on only a single node.

We plot the distribution of delays from storing 250 keysand performing 2500 lookups in Figure 9. Compared tothe XOR method, which always chooses the nearest logi-cal node, the data show that 4D+H reduces lookup delayby 33% at the 80th percentile. It is 12% faster than theearly version of the coordinates, 2D+H, also at the 80th

percentile. Because no latency prediction information iscurrently returned to the caller, the optimization only af-fects the selection of the first hop. In addition, we werenot able to predict latencies to 34% of nodes due to ver-sion incompatibilities. Both of these factors suggest theseimprovements are conservative. We excluded lookups thattimed out due to dropped UDP messages to avoid depen-dence on a particular timeout handling mechanism. Thesedata show that using network coordinates can provide asubstantial improvement to an application-level process.

7 Barriers to AccuracyIn this section, we examine five primary causes of the

remaining difference between the current live accuracyand what appears to be achievable based on simulation re-sults. The five barriers are: churn, drift, intrinsic error,corruption, and latency variance. We present techniquesthat address the first three barriers and non-malicious cor-ruption. However, malicious corruption and latency vari-ance remain unsolved; indeed, the latter requires a fun-damentally new approach to latency prediction. Basedon our simulation and PlanetLab results and on moni-toring Azureus over time, we have added the techniquesthat address churn, drift, and non-malicious corruption tothe Azureus code. While preliminary experiments suggestthey function as expected, we have not yet fully quantifiedtheir effects and do not include results for them here.

7.1 ChurnDistributed network coordinate algorithms traditionally

consider churn as part of their network model. Re-searchers ask the question: given an existing, stable sys-tem, how quickly can a new node find a stable, accu-

0.0

0.2

0.4

0.6

0.8

1.0

1 min 1 hour 1 day

CD

F

Node Lifetime (log scale)

Figure 10: Azureus nodes follow a typical peer-to-peerlifetime distribution curve. With 78% of its nodes in thesystem for less than one hour, it is difficult to incorporatethe steady stream of newcomers with coordinates startingat the origin.

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.1 0.2 0.3 0.4 0.5

CD

F

Median Relative Error

Uptime ≥ 1 hourUptime < 1 hour

Figure 11: Azureus nodes that have been in the systemfor longer periods have more accurate coordinates. Thissuggests that churn may hurt convergence of Internet-scalecoordinate systems.

rate coordinate? Unfortunately, implicit in this questionis the assumption that the existing system has converged,and this assumption breaks down in many large-scale dis-tributed systems, including Azureus. As Figure 10, illus-trates, Azureus follows the long-tailed lifetime distribu-tion typical of peer-to-peer systems. [4]. (Azureus clientstrack uptime using an internal, query-able statistic.)

Because coordinate updates were on the order of tensof seconds or sometimes minutes apart, nodes often didnot have much time to settle into a stable position beforethey exited the system. Using the data from our crawl ofthe live network, we separated nodes into ones that hadbeen in the system for an hour or more and those that hadnot. We plot the relative error experienced by these twogroups in Figure 11. The data confirm that these short-lived nodes, which make up the majority of the system,are substantially less accurate than long-lived ones.

We considered three potential solutions to the problemof sustaining a coordinate system under high churn rates.First, nodes could perform a rapid initial triangulation pro-

Page 11: Network Coordinates in the Wild

0.0

0.1

0.2

0.3

0.4

0.5

8 16 32 64 128 256 512 ∞

Rel

ativ

e E

rror

(50

th P

ctl.)

Churn (Avg. lifetime in Rounds)

More Churn Less Churn

Coordinate at Birth

ForgottenRemembered

Figure 12: Coordinate systems that experience high churnrates and do not allow nodes to “remember” their previouscoordinates have trouble converging.

Gravity’s ρ Migration Error

26 8ms 25%28 17ms 10%210 74ms 10%212 163ms 10%None 179ms 10%

Table 1: Small amounts of gravity limit drift without pre-venting coordinates from migrating to low-error positions.

cess before shifting to a lower update rate. However, ad-justing the gossip rate over time has two problems: (a)“passive” (i.e. maintenance-free) coordinate systems haveno control over gossip and (b) in an “active” system, itwould be a new, complex knob. Second, we considered“greedy optimization,” where instead of just stepping oncethrough the update process, nodes would repeat until a (lo-cal) minimum had been reached with respect to the cur-rently known neighbors. Unfortunately, we found that thisform of optimization does not work well until many neigh-bors are known, which is not the case early in a node’s life-time. Finally, we found a solution that is both extremelysimple and had positive results in simulation: instead ofstarting from scratch when restarting a client, have it be-gin where it left off. We performed an experiment wherewe varied the amount of churn in simulation and toggledwhether or not nodes “remembered” their coordinate onre-entry. In Figure 12, we show the results of this exper-iment. We found that when nodes started at the origin onre-entry, they had a deleterious effect not only on them-selves, but on overall system convergence. In contrast,with this simple technique, accuracy remained about thesame as when there was no churn. While this techniqueassumes limited drift (see next section), it appears to be apromising start to resolving the noxious effect of churn onlive coordinate systems.

7.2 DriftMonitoring our PlanetLab-based coordinate service

over several months revealed that coordinates migrated ina fairly constant direction: the centroid of the coordinatesdid not move in a “random walk,” but instead drifted con-stantly and repeatedly in a vector away from the origin.This was surprising because our previous study, based on ashorter, three-day trace, had not exhibited this pattern [17].

While coordinates are meant to provide relative dis-tance information, absolute coordinates matter too. Oneproblem with drift is that applications that use them oftenneed to make assumptions on maximum distances awayfrom the “true” origin. For example, one could use Hilbertfunctions to map coordinates into a single dimension [5].This requires an a priori estimate of the maximum vol-ume the coordinates may fill up. Mapping functions likeHilbert require that the current centroid not drift from theorigin without bound. Drift also limits the amount of timethat cached coordinates remain useful [13].

A “strawman” solution to drift would be to continuouslyredefine the origin as the centroid of the systems coordi-nate. Unfortunately, this would require accurate statisti-cal sampling of the coordinate distribution and a reliablemechanism to advertise the current centroid. Our solu-tion to drift is to apply a polynomially-increasing gravityto coordinates as they become farther away from the trueorigin. Gravity −→G is a force vector applied to the node’scoordinate −→xi after each update:

−→G =

(‖−→xi‖

ρ

)2

× u(−→xi)

where ρ tunes −→G so that its pull is a small fraction of theexpected diameter of the network. Hyperbolic coordinatescould use a similar equation to compute gravity.

Drift does not occur in simulation if one is using a la-tency matrix and updating nodes randomly, because thisform of simulation does not capture time-dependent RTTvariability. Instead, we used a 24-hour trace of our Planet-Lab service to simulate the effect of gravity; we show theeffect of different strengths of gravity in Table 1. The datashow that this simple technique does keep the coordinatecentroid highly stationary without affecting accuracy.

To confirm the effect of gravity on a live system, weadded it to our on-going PlanetLab service, which had≈ 300 participants. In Figure 13, we compare drift beforeand after adding gravity over two 18 day periods. The datashow that gravity effectively eliminates drift. In addition,it did not reduce accuracy, which, in both cases, had a me-dian of about 10%. While gravity does not actively limitrotation, we did not observe a rate greater than one fullrotation per three days. Determining the cause of drift isbeyond the scope of this work.

Page 12: Network Coordinates in the Wild

0

500

1000

1500

2000

0 2 4 6 8 10 12 14 16 18

Mig

ratio

n of

Cen

troi

dof

all

Coo

rdin

ates

(m

s)

Days

No GravityWith Gravity

Figure 13: With gravity, coordinates did not drift awayfrom their original origin as they had done before.

7.3 Intrinsic ErrorViolations of the triangle inequality occur more fre-

quently and to a greater extent on Azureus than eitheron PlanetLab or for sets of DNS servers (see § 3.3). Wefound, perhaps surprisingly, that removing a small num-ber of the worst violators causes a large improvement inglobal accuracy. Not only do the violations these nodestake part in damage their own coordinates, but the damagethey cause continues to reverberate throughout the system.

We performed an experiment where we removed a smallpercentage of the nodes with the largest triangle violationsfrom the Azureus latency matrix and compared this to re-moving a random subset of nodes of the same size. Wethen computed a system of coordinates and found the rel-ative error of each link. As Figure 14 illustrates, removingonly the worst 0.5 percent of nodes leads to a 20 percentimprovement in global accuracy. This data parallels re-sults from theoretical work that showed how to decreaseembedding distortion by sacrificing a small fraction of dis-tances to be arbitrarily distorted [2]. These results showthat if a mechanism could prevent these nodes from af-fecting the rest of the system, it would improve overall ac-curacy. Two example mechanisms for node self-detectionand removal from the coordinate system are: (a) directlyevolving an estimate of the extent of their violations byasking neighbors for latencies to other neighbors, and (b)determining if they are subject to traffic shaping (based onthe modality of their latency distribution), and thereforea major cause of triangle violations. Preliminary experi-ments with self-exclusion based on bimodality tests showan improvement in accuracy of 8% at the 95th percentile.

7.4 Corruption and VersioningAn insipid fact of running a large system where users

can choose when to upgrade is that not everyone is run-ning the same version. One of the problems we foundwith our original deployments was that about 13% of theremote coordinates received during gossip were at the ori-

0.0

0.5

1.0

1.5

2.0

0% 0.5% 1% 2% 4% 8%

Rel

ativ

e E

rror

(95

th P

ctl.)

Nodes Removed

Random SubsetWorst Violators

Figure 14: Removing only a small percentage of nodeswith the worst triangle violations has a large effect onglobal accuracy.

gin; that is, [0]d. After much discussion (Is that incrediblechurn rate possible? Do nodes behind firewalls never up-date their coordinates?), we realized that this problem wasdue to a portion of the network running an old code ver-sion. In fact, during one crawl of the Azureus network,we found only about 44% of the ≈ 9000 clients crawledwere using the current version. While not very exciting,realizing this fact allowed us to compensate for it both inthe coordinate update process and in active statistics col-lection through the explicit handling of different versionswithin the code.

Kaafar et al. have begun investigating the more in-teresting side of the problem of coordinate corruption:malicious behavior [15]. They divide attacks into fourclasses: disorder, isolation, free-riding, and landmark con-trol. While we did not see any evidence of intentionallycorrupt messages, it would be trivial to install a client, ora set of clients, that responded with random values, for ex-ample (just as the MPAA runs clients with spurious con-tent advertisements to squelch piracy). As Internet-scalecoordinate systems come into wider use, they will need tograpple with both oblivious and malicious corruption.

7.5 Latency VarianceThe prior “barriers to accuracy” paint a rosy picture;

most problems have a fairly simple solution that practi-tioners can use to build more accurate, live coordinate sys-tems. The existence of wide variation in latency measure-ments between the same pair of nodes over a short periodof time is a harder problem with broad ramifications. Ifvariances are very large what does it actually mean to “pre-dict” the latency from one node to another? Using the datafrom our longest snapshot (5D), we determined the stan-dard deviation of latency between each pair of nodes. Wefound that round trip times varied by a median of 183ms.We show this distribution in Figure 15. This spread affectsother latency prediction systems as well. A reactive mea-surement service, such as Meridian, will be more error-prone or have higher overhead if small numbers of pings

Page 13: Network Coordinates in the Wild

0.0

0.2

0.4

0.6

0.8

1.0

0 200 400 600 800 1000

CD

F

Round Trip Time (ms)

(a) From planetlab2.iis.sinica.edu.tw To 12.226.X.X

ICMP (2623)Raw (10)

Filter (10)0.0

0.2

0.4

0.6

0.8

1.0

0 200 400 600 800 1000

CD

F

Round Trip Time (ms)

(b) From planetlab14.millennium.berkeley.edu To 84.18.X.X

ICMP (2630)Raw (7)

Filter (7)

0 200 400 600 800

1000 1200 1400 1600

0 1 2 3 4 5

Rou

nd T

rip

Tim

e (m

s)

Time (hours)

ICMPRaw

Filter

0 200 400 600 800

1000 1200 1400 1600

0 1 2 3 4 5

Rou

nd T

rip

Tim

e (m

s)

Time (hours)

ICMPRaw

Filter

Figure 16: A comparison of round trip times between two sets of node pairs using ICMP, raw application-level measure-ments, and filtered measurements. Pair (a) exhibits some variance, but shows a consistent baseline. With pair (b), thevariance is so large that assigning this node a coordinate — or putting it into a consistent Meridian ring — is bound to bean error-prone process. The number in parentheses in the legend is the number of round trip time measurements in thecumulative distribution function.

0.0

0.2

0.4

0.6

0.8

1.0

22 24 26 28 210 212 214

CD

F

Standard Deviation of Round Trip Time (per link, ms)

Figure 15: When round trip times vary by a median of183ms, what does it mean to summarize a latency predic-tion with a single value?

do not sufficiently measure the latency to a high variancetarget. In fact, coordinate systems may be in a better posi-tion to address this problem because they can retain histo-ries of inter-node behavior.

As reviewed in Section 4.1, we developed latency filtersin previous work. They act as a low-pass filter: anomaliesare ignored while a baseline signal passes through. Ad-ditionally, they adapt to shifts in the baseline that BGProute changes cause, for example. These filters assign alink a single value that conveys the expected latency of thelink. While we found these simple filters worked well onPlanetLab, describing a link with a single value is not ap-propriate with the enormous variance we observe on someof Azureus’ links.

We ran an experiment where we compared ICMP, fil-

tered, and raw latency measurements that were taken atthe same time. To determine which destination nodes touse, we started Azureus on three PlanetLab nodes andchose five ping-able neighbors after a twenty-minute start-up period. We then let Azureus continue to run normallyfor six hours while simultaneously measuring the latencyto these nodes with ping. We plot the data in Figure 16.Figure 16 (a) illustrates a pair similar to our PlanetLabobservations: there was raw application-level and ICMPvariance, but a consistent baseline that could be describedwith a single value. In contrast, Figure 16 (b) portrays ahigh variance pair: while the filter does approximate themedian round trip time, it is difficult to say, at any point intime, what the latency is between this pair.

The impact of the dual problems of high latency vari-ance and modifying algorithms to deal with high latencyvariance is not limited to network coordinate systems. La-tency and anycast services deployed “in the wild” need toaddress this problem. While there may exist methods toincorporate this variance into coordinate systems — eitherthrough “uncertainty” in the latency filters or in the co-ordinates themselves — resolving this problem is beyondthe scope of this paper.

8 Related WorkEarly work on latency prediction services focused

on reducing the intractability of all-pairs measurementsthrough clustering. Based on the assumption that nodes inthe same cluster would have similar latencies to nodes inanother cluster, researchers examined how to create accu-rate clusters and how to minimize inter- and intra-clustermeasurement overhead. Francis et al. created clusters

Page 14: Network Coordinates in the Wild

based on IP address prefixes, but found that predictionerror was heavily dependent on the initial choice of rep-resentatives [12]. Chen et al. addressed this problemthrough the automatic formation of clusters and represen-tatives; they found the cluster size and, more generally,the amenability of the network to clustering had a largeeffect on accuracy [6]. Ratnasamy et al. proposed a hy-brid approach: nodes that are similar distances away fromfixed landmarks place themselves into the same cluster;they also found error was highly dependent on the numberof bins [25]. Because all of this clustering involves mea-surement and lower network layers are already perform-ing much of this measurement, Nakao et al. proposed re-ducing overhead by tapping into this existing information;unfortunately, this requires a change in the interface of In-ternet routers [20].

While this research and the work on network coordi-nates that grew out of it focus on generalized latency pre-diction — maintaining an infrastructure that works wellfor most queries — a separate body of work has fo-cused more directly on the problem of finding the near-est of many replicated services. In direct response to anapplication-level request, Meridian finds the nearest over-lay node (i.e., one running Meridian) to an arbitrary pointin the Internet through a set of pings that progress loga-rithmically closer to the target [31]. Freedman et al. de-veloped OASIS, a distributed service explicitly designedto help clients find and choose a “good” server out ofmany [13]. Building on Meridian, OASIS primarily fo-cuses on network locality, but also incorporates livenessand load. OASIS employs a reliable core of hosts to mapclients to nearby servers, which are assumed to be long-lived. Note the distinct purposes of these anycast servicesfrom those of network coordinates: Meridian and OASISare designed for the case where contact with the servicewill be frequent and long-lived enough to outweigh thehigh upfront cost of finding the best service. With theircurrent levels of accuracy (good but not perfect) and main-tenance (zero), network coordinates fall to the other sideof the trade-off: short-lived, cheap decisions for whichfinding the exact answer is not worthwhile, but repeatedlyfinding a good answer leads to aggregate savings. WhileMeridian (and OASIS) are inherently reactive – acting inresponse to a query – they too could be more tightly inte-grated with an application, using its messages to dampenring maintenance, for example.

8.1 Network CoordinatesThere exist two main classes of algorithms for calculat-

ing coordinates: landmark-based schemes, in which over-lay nodes use a fixed number of landmark nodes to cal-culate their coordinates, and simulation-based schemes,which are decentralized and calculate coordinates by mod-eling nodes as entities in a physical system.

Landmark-based. In GNP [21], nodes contact multi-ple landmark nodes to triangulate their coordinates. Thedrawbacks of this approach are that the accuracy of thecoordinates depends on the choice of landmark nodesand landmark nodes may become a bottleneck. Light-houses [23] addresses this by supporting multiple indepen-dent sets of landmarks with their own coordinate systems.These local coordinates map into a global coordinate sys-tem. PIC [8] does not use explicit landmarks, incorporat-ing measurements to any node using a simplex optimiza-tion algorithm to obtain an up-to-date coordinate. Theselandmark-based schemes require a reasonably stable in-frastructure and, to the best of our knowledge, have notbeen adopted for wide-spread use.Simulation-based. Vivaldi [9] and Big Bang Simula-tion [26] determine coordinates using spring-relaxationand force-field simulation, respectively. In both, nodes at-tract and repel each other according to network distancemeasurements. The low-energy state of the physical sys-tem corresponds to the coordinates with minimum error.

de Launois et al. propose a different method for sta-bilizing coordinates: asymptotically dampening the effectof each new Vivaldi measurement [10]. While this factordoes mitigate oscillations in a fixed network, it preventsthe algorithm from adapting to changing network condi-tions.

9 ConclusionWe have demonstrated that network coordinates in the

wild do behave somewhat differently than do tame coor-dinates on PlanetLab or in simulation. Fortunately, eventhese wild coordinates can be tamed. Our analysis of alarge, Internet-scale coordinate system has convinced usto join the network coordinate supporters camp. Whilethe initial network coordinate implementation illustratedsome of the problems that critics often cite, we found thatsimple, but effective techniques overcame nearly all theseissues. In Azureus, network coordinates provide a sim-ple and efficient mechanism for anycast, as part of DHTlookups, and may soon be used to optimize streaming me-dia. In addition to providing a wealth of data and analysisfrom a live, large-scale deployment, we have deployed andevaluated six techniques that improve the accuracy and/orstability of network coordinate systems: latency filters,update filters, neighbor decay, coordinate memory, grav-ity, and violator exclusion. Together, these yield efficient,accurate, and stable network coordinates in the million-node Azureus network. In the future, we plan to add theremaining techniques to the Azureus code and monitortheir effectiveness.

10 AcknowledgementsWe wish to thank Peter Pietzuch for early discussions

on the myriad potential reasons for the discrepancy be-

Page 15: Network Coordinates in the Wild

tween simulated and real network coordinates, OlivierChalouhi of Aelitis for allowing us to tinker with theAzureus source code, and Michael Parker of UCLA forputting us in touch with Aelitis and for porting our im-plementation to Java. In addition, we wish to thank ouranonymous reviewers and our shepherd, Emin Gun Sirer,who provided extremely detailed constructive criticism.

References[1] Azureus BitTorrent Client. http://azureus.

sourceforce.net/.[2] Y. Bartal, N. Linial, M. Mendel, and A. Naor. On metric

ramsey-type phenomena. In Proc. of Symposium on Theoryof Computing, San Diego, CA, 2003.

[3] R. Bindal, P. Cao, W. Chan, J. Medved, G. Suwala,T. Bates, and A. Zhang. Improving traffic locality in bit-torrent via biased neighbor selection. In Proc. of Interna-tional Conference on Distributed Computing Systems, Lis-bon, Portugal, July 2006.

[4] F. Bustamante and Y. Qiao. Friendships that last: Peerlifespan and its role in P2P protocols. In Proc. of Inter-national Workshop on Web Content Caching and Distribu-tion, Hawthorne, NY, October 2003.

[5] A. R. Butz. Alternative Algorithm for Hilbert’s Space-Filling Curve. IEEE Transactions on Computers, pages424–426, April 1971.

[6] Y. Chen, K. H. Lim, R. H. Katz, and C. Overton. On theStability of Network Distance Estimation. SIGMETRICSPerformance Evaluation Review, 30(2), 2002.

[7] B. Cohen. Incentives Build Robustness in BitTorrent. InProc. of Workshop on Economics of Peer-to-Peer Systems,Berkeley, CA, June 2003.

[8] M. Costa, M. Castro, A. Rowstron, and P. Key. PIC: Practi-cal Internet Coordinates for Distance Estimation. In Proc.of International Conference on Distributed Computing Sys-tems, Tokyo, Japan, March 2004.

[9] F. Dabek, R. Cox, F. Kaashoek, and R. Morris. Vivaldi:A Decentralized Network Coordinate System. In Proc. ofSIGCOMM, Aug. 2004.

[10] C. de Launois, S. Uhlig, and O. Bonaventure. A Stable andDistributed Network Coordinate System. Technical report,Universite Catholique de Louvain, December 2004.

[11] J. Dilley, B. Maggs, J. Parikh, H. Prokop, R. Sitaraman, andB. Weihl. Globally Distributed Content Delivery. IEEEInternet Computing, September/October 2002.

[12] P. Francis, S. Jamin, C. Jin, Y. Jin, D. Raz, Y. Shavitt, andL. Zhang. IDMaps: a global internet host distance estima-tion service. IEEE/ACM Trans. Networking, 9(5), 2001.

[13] M. Freedman, K. Lakshminarayanan, and D. Mazieres.OASIS: Anycast for Any Service. In Proc. of the Sympo-sium on Networked Systems Design and Implementation,San Jose, CA, May 2006.

[14] K. P. Gummadi, S. Saroiu, and S. D. Gribble. King: Es-timating Latency between Arbitrary Internet End Hosts.In Proc. of Internet Measurement Workshop, Marseille,France, Nov. 2002.

[15] M. A. Kaafar, L. Mathy, T. Turletti, and W. Dabbous. Vir-tual Networks under Attack: Disrupting Internet Coordi-nate Systems. In Proc. of Second CoNext Conference, Lis-bon, Portugal, 2006.

[16] J. Ledlie, P. Gardner, and M. Seltzer. Network Coordinatesin the Wild. Technical Report TR-20-06, Harvard Univer-

sity, Cambridge, MA, Oct. 2006.[17] J. Ledlie, P. Pietzuch, and M. Seltzer. Stable and Accurate

Network Coordinates. In Proc. of International Conferenceon Distributed Computing Systems, Lisbon, Portugal, July2006.

[18] E. K. Lua, T. Griffin, M. Pias, H. Zheng, and J. Crowcroft.On the Accuracy of Embeddings for Internet CoordinateSystems. In Proc. of Internet Measurement Conference,Berkeley, CA, Oct. 2005.

[19] P. Maymounkov and D. Mazieres. Kademlia: A peer-to-peer information system based on the xor metric. In Proc.of International Workshop on Peer-to-Peer Systems, Cam-bridge, MA, March 2002.

[20] A. Nakao, L. Peterson, and A. Bavier. A Routing Underlayfor Overlay Networks. In Proc. of SIGCOMM, Karlsruhe,Germany, Aug. 2003.

[21] T. S. E. Ng and H. Zhang. Predicting Internet NetworkDistance with Coordinates-Based Approaches. In Proc. ofINFOCOM, June 2002.

[22] L. Peterson, T. Anderson, D. Culler, and T. Roscoe. ABlueprint for Introducing Disruptive Technology into theInternet. In Proc. of Workshop on Hot Topics in Networks,Princeton, New Jersey, October 2002.

[23] M. Pias, J. Crowcroft, S. Wilbur, T. Harris, and S. Bhatti.Lighthouses for Scalable Distributed Location. In Proc. ofInternational Workshop on Peer-to-Peer Systems, Berkeley,CA, February 2003.

[24] P. Pietzuch, J. Ledlie, J. Shneidman, M. Welsh, M. Seltzer,and M. Roussopoulos. Network-Aware Operator Place-ment for Stream-Processing Systems. In Proc. of Interna-tional Conference on Data Engineering, Atlanta, GA, April2006.

[25] S. Ratnasamy, P. Francis, M. Handley, B. Karp, andS. Shenker. Topology-Aware Overlay Construction andServer Selection. In Proc. of INFOCOM, June 2002.

[26] Y. Shavitt and T. Tankel. Big-Bang Simulation for embed-ding network distances in Euclidean space. In Proc. of IN-FOCOM, San Francisco, CA, June 2003.

[27] Y. Shavitt and T. Tankel. On the Curvature of the Internetand its usage for Overlay Construction and Distance Esti-mation. In INFOCOM, Hong Kong, June 2004.

[28] N. Spring, D. Wetherall, and T. Anderson. Scriptroute: APublic Internet Measurement Facility. In Proc. of the Sym-posium on Internet Technolgies and Systems, Seattle, WA,Mar. 2003.

[29] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, andH. Balakrishnan. Chord: A Scalable Peer-to-peer LookupService for Internet Applications. In Proc. of SIGCOMM,San Diego, CA, Aug. 2001.

[30] L. Tang and M. Crovella. Virtual Landmarks for the In-ternet. In Proc. of Internet Measurement Conference, Oct.2003.

[31] B. Wong, A. Slivkins, and E. G. Sirer. Meridian: ALightweight Network Location Service without Virtual Co-ordinates. In Proc. of SIGCOMM, Aug. 2005.

[32] R. Zhang, Y. C. Hu, X. Lin, and S. Fahmy. A HierarchicalApproach to Internet Distance PredictionStable and Accu-rate Network Coordinates. In Proc. of International Con-ference on Distributed Computing Systems, Lisbon, Portu-gal, July 2006.

[33] H. Zheng, E. K. Lua, M. Pias, and T. G. Griffin. InternetRouting Policies and Round-Trip-Times. In Proc. of Pas-sive and Active Measurement Workshop, Mar. 2005.

Page 16: Network Coordinates in the Wild

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.1 0.2 0.3 0.4 0.5

CD

F

Relative Rank Loss

4d+h5d2d+h

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

CD

F

Relative Application-Level Penalty

4d+h5d2d+h

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.1 0.2 0.3 0.4 0.5

CD

F

Relative Rank Loss

SimulationLive (All)Live (PlanetLab)

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

CD

FRelative Application-Level Penalty

SimulationLive (All)Live (PlanetLab)

Figure 17: On PlanetLab (top), comparing 2D+H to 4D+H, the data show a 20% improvement in RRL and a 15% im-provement in RALP. The bottom two graphs show how these metrics changed on the wider set of Azureus end-hosts ascompared to simulation.

A Appendix: Other Coordinate MetricsRelative Rank Loss. Lua et al. proposed error metricsthat can capture application accuracy better than relativeerror [18]. Relative rank loss (RRL) determines how well anetwork coordinate scheme preserves the relative orderingof all pairs of neighbors. RRL is the percentage of rankingthat is incorrect. A ranking for nodes x,i, and j is incorrectif:

(dxi > dxj ∧ lxi < lxj) ∨ (dxi < dxj ∧ lxi > lxj)

where d is coordinate distance and l is latency. This metricis important for applications that make decisions depen-dent on the relative ordering of nodes, for example, whenupdating routing table entries.Relative Application-Level Penalty. In relatedwork [17], we extended RRL to capture the latencypenalty that an application would experience if it reliedon coordinates instead of direct latency measurementsfor all pairs of nodes. For each incorrectly ranked pairi, j for a node x, where x picks j even though i is closer,we sum up the application-experienced penalty lxj/lxi.This metric, relative application-level penalty (RALP),conveys the cumulative latency penalty for pairwiseanycast decisions.

We show these additional metrics collected on Planet-Lab (see § 6.2) and on Azureus’ end hosts (see § 6.3) inFigure 17.