TULIP - SLAC National Accelerator Laboratory · Using location data from about 600 PingER hosts in over 100 countries, we shall validate the accuracy and applicability of TULIP, and

1

TULIP

TRILATERATION UTILITY FOR LOCATING IP ADDRESSES – A DELAY BASED SOLUTION FOR IP GEOLOCATION

Project Team

Dr. Les Cottrell [SLAC, USA] Umar Kalim [NIIT, PAK]

Shahryar khan [NIIT, PAK] Faran Javed [NIIT, PAK]

2

ABSTRACT

Currently we see tremendous expansion in the area of network deployment. As

companies realized the cost benefits and productivity gains created by network

technology, they began to add networks and expand existing networks almost as

rapidly as new network technologies and products are being introduced. The problems

associated with network expansion affect both day-to-day network operation

management and strategic network growth planning. Each new network technology

requires its own set of experts. An urgent need has arisen for automated network

management (including what is typically called network capacity planning) integrated

across diverse environments.

In such an increasingly connected world, geo-location of a target host based on

its name or address can, for example, provide critical information for security, content

delivery and Internet traffic studies. The knowledge of the physical location of a user

with an assigned ip address is currently being used from credit card fraud protection to

online advertising. However, most industrial use approaches to assign ip-addresses or -

ranges to a geolocation are currently based on manually maintained databases which

might lead to wrong or outdated information.

One cannot assume the location of a host is given by its name (e.g. its Top

Level Domain). For example, server hosts associated with a web site may have proxies

overseas. While not being useful for all ip addresses (tunnel-endpoints or mobile

3

nodes, for example), most ip addresses can be traced automatically to their location

with an inaccuracy of several hundreds of kilometers. This might appear high at first,

but judged by the fact that it is e.g. sufficient for credit card fraud protection to know

the country the user is currently being located, this is a tolerable inaccuracy. Several

examinations developed different mechanisms of automatically geolocation, using a

set of servers with known location to triangulate the ip address, using provided

location information of the target or topology hints in the router naming scheme.

We use delay measurements to dynamically locate a target host, from a set of

reference (landmarks) that are reliably accessible, their locations are well known and

constant and triangulate a large fraction of the world’s IP hosts the delay

measurements provide rough estimates of the distances between the landmarks and the

target which are used to geo-locate the target. Multiltateration, a concept used in

[bamba], is then applied, to pin point the target host.

The delay increases with the increase in geographical distance but we may get

an additive delay due to a number of reasons, such as greater number of hops,

congested paths, indirect routing etc. The first step is to convert the delay

measurements to distance estimates based on the speed of light in fiber, perform

multiltateration and avoid miscalculations caused by over-estimation caused by the

additive delay. The more challenging part is obtaining “good” landmarks that provide

good connectivity to a large fraction of the world. Although CGB [bamba] provides a

fairly accurate estimate of sites in the developed world, especially in North America

and Europe, as we move outside these regions, the level of accuracy drops

4

dramatically. The aim is to get a better coverage of landmarks, and an improved

algorithm to cater for increased delays in the less developed regions.

Using location data from about 600 PingER hosts in over 100 countries, we

shall validate the accuracy and applicability of TULIP, and also use it to identify mis-

placed hosts in the PingER database.

5

Chapter 1

INTRODUCTION

Computers are fulfilling an increasingly diversified set of tasks in our society.

They assume saliently many key jobs and assist us in managing numerous tasks in our

daily life. In one or the other form, they process, and provide us information. Many

researchers agree that information age is rapidly replacing industrial era.

There have been three major stages in human history. The first stage is said to

be the agrarian base, and the economy of this stage was dependent upon the

agricultural activity, but with little knowledge. Industrial revolution, which is coming

to end now, was the first major change. The economic success of the era had been

based on mass-production and quality of the products. Lastly the third era is the age of

information technology and services. In this era, information and knowledge are

becoming focus of the economic activity of the entire world. This phase has changed

the economic, political and cultural values on global scale. This change has forced

state-controlled economy to a free economy oriented towards abundant supply and

diversity. Global information systems, such as the Internet, are no longer just pathways

for digital data rather they generate and process information to create knowledge for

commercial activity, education business and research. For the first time, in the history

of human kind, non-material resources such as software and information services have

become new raw materials and real wealth of knowledge-base society.

6

Now that we have discussed the new era, namely Information age, we need to

explore what new opportunities and challenges are emerging in the near future.

Information and communication are two of the most important challenges for the

success of every enterprise. While today nearly every organization uses a substantial

number of computers and communication tools (telephones, fax, and personal

handheld devices), they are often still isolated. While managers today are able to use

the newest applications, many departments still do not communicate and much needed

information cannot be readily accessed.

To overcome these obstacles in an effective usage of information technology,

computer networks are necessary. They are a new kind (one might call it paradigm) of

organization of computer systems produced by the need to merge computers and

communications. At the same time they are the means to converge the two areas; the

unnecessary distinction between tools to process and store information and tools to

collect and transport information can disappear. Computer networks can manage to put

down the barriers between information held on several (not only computer) systems.

Only with the help of computer networks can a borderless communication and

information environment be built.

Networks can only be used to their maximum efficiency if we can manage

them effectively and measure their states. There is an increasing need for high-

performance monitoring to set expectations, provide planning and trouble-shooting

information, and to provide steering for applications to quantify and help bridge the

Digital Divide, enable world-wide collaborations, and reach-out to scientists world-

7

wide, it is imperative to continue and extend the monitoring coverage to all countries

with HENP programs and significant scientific enterprises. To meet these requirements

PingER project was initiated.

PingER uses ICMP to gather measurement data and also considers the physical

location of the host. E.g. a host from Australia gives a constant delay of 400 ms when

measured from SLAC. But suddenly the delay increases from 400 ms to 600 ms.

apparently it seems that the network condition has gotten worst. A detailed manual

analysis suggests that there is nothing wrong with the network but the host has shifted

to some other geographical location.

1.1 Need for a Dynamic Automated IP geolocation Tool

The Internet has become a collection of resources meant to appeal to a large

general audience. Although this multitude of information has been a great boon, it also

has diluted the importance of geographically localized information. Offering the ability

for Internet users to garner information based on geographic location can decrease

search times and increase visibility of local establishments. Similarly, user

communities and chat-rooms can be enhanced through knowing the locations (and

therefore, local times, weather conditions and news events) of their members as they

roam the globe. It is possible to provide user services in applications and Web sites

without the need for users to carry GPS receivers or even to know where they

themselves are.

8

Geolocation by IP address is the technique of determining a user's geographic latitude,

longitude and, by inference, city, region and nation by comparing the user's public

Internet IP address with known locations of other electronically neighboring servers

and routers.

IP addresses are not static - they can be reassigned, relocated within the same

provider or being forwarded using mechanism like vpns or other tunnels. A database

consisting of geolocational information about ips hence has to be updated and

maintained on a regular base. Due to the large and increasing number of used ip

addresses, this is a task which is nearby impossible to be maintained manually.

The most popular approach to geolocation is maintaining a manually created

database, like "GeoIP" from Maxmind [14] - a stripped down version of their

commercially available database is given out for free with several tools and API’s, and

is quite popular within the open source community. Several other free libraries like

[13] exist, too. As already described in the section "Problems", this is a rather

problematic approach. Paid workers maintaining such a database is very cost effective.

Using community created databases like [13] or the database gained with web2.0

communities like [16] minimizes the costs, but increases the risk of accidental or

intentional entered wrong data.

In recent years many efforts have been made for IP geo-location. Most of them

are log based and are hence pron to get outdated. Very few are based on dynamic

measurments. Out of them I have dicussed a few that use delay measurments for

distance estimation and are hence in accordance with our proposed objectives.

9

1.1.1- Bamba

CBG (Constraint Based Geolocation) transforms delay measurements to

geographic distance constraints, and then uses multiltateration to infer the Geolocation

of the target host. CBG is able to assign a confidence region to each given location

estimate. This allows a location-aware application to assess whether the location

estimate is sufficiently accurate for its needs.

1.1.2- NetGeo

A project by CAIDA (Co-operative Association for Internet Data Analysis).

The project is quite some years old. A database is maintained that contains

geographical latitude and longitude of ASNs (Autonomous System Numbers).

1.1.3- ICS (Internet Coordinate System)

The schemes work by using a small set of nodes called “landmarks” to compute

the coordinates of other hosts. The principal component analysis (PCA) technique is

used to effectively extract topological information from delay measurements between

beacon hosts. Based on PCA, a transformation method is devised that projects the

distance data space into a new coordinate system of (much) smaller dimensions. The

transformation retains as much topological information as possible and yet enables end

hosts to easily determine their locations in the coordinate system.

10

1.1.4- GeoIP

This technique uses end users input. It acquires data from various sites that

require users to register themselves for service. Based on this data it estimates the

location of an ASN. This technique is ideal for e-commerce applications but fails for

intermediate and backbone routers. This is observed mainly because routers belonging

to one AS may be in different countries. Hence for scientific study GeoIP will not be

the ideal tool. E.g. if we are to find the geographical location of a bottleneck it may be

an intermediate router.

11

Chapter 2

DELAY BASED IP GEOLOCATION

In this chapter we shall discuss the issues involved in delay based IP geolocation.

4.1 On Delay Based IP Geolocation

The location of an IP address can be determined using two methodologies: “Ping”, by

using the minimum Round Trip Estimates, and triangulating the results to obtain an

approximate location and “Whois”. We are exploring the first technique also called

delay passed IP Geolocation. The main idea is to use RTT measurements from

multiple sites (landmarks) to triangulate the position of a remote host that is being

pinged. The basic tool would be a script that takes RTT measurements from

landmarks at known latitudes and longitudes to a remote host (at an unknown location)

specified by the user and figures out the latitudes and longitudes of the remote host.

TULIP (IP Locator Using Triangulation) has been developed which utilizes PingER

data in order to verify IP locations. Using TULIP, PingER would be able to detect

anomalies like:

1. A set of measured RTT’s to a given host that does not agree where the remote

host is located.

2. Measurements that do agree where a remote host is located but that location

differs from the "known" latitudes and longitudes of the host in the database.

As part of this, a ping utility has been put together and being distributed across known

places around the world, e.g. at PingER monitoring hosts which would then serve as

landmark sites.

12

Given this infrastructure a web accessible tool has being designed where the user

enters the name or IP address of the host whose location is to be determined. The tool

then gets the landmark sites to make RTT measurements, collects these measurements,

analyzes the data, presents the results and on demand makes trace routes to the remote

host from elected landmark sites. The difficulty in getting ping tool deployed at

enough places would be solved by getting the RTT estimates from reverse trace route

servers around the world. While implementing such technique there are several factors

that have to be taken into account:

• Bandwidth Links can have capacity bandwidths that differ by many orders of

magnitude (e.g. 1 Mbits/s versus 10Gbits/s). For a 1 Mbits/s link the time to

clock a 100byte packet onto the link is about 0.8msec. This is roughly the time

for light to propagate through a 100 mile (~161km) fiber1. It is thus not

possible to have a single conversion factor from RTT to distance that works for

all links for all distances.

• Queuing delay can have major effects on the RTT measurement. Queuing

delays are critically affected by cross-traffic. Increased cross-traffic is likely to

increase queuing and thus RTTs. We thus use the minimum RTT of multiple

measurements with the understanding that the minimum possible RTT is for

packets which see no queuing. However if the queuing persists for the duration

1 A canonical average delay per distance for packets traversing well provisioned parts of the Internet is about a factor of two slower than this (i.e. 1ms per 100 km) due to non great circle routes, multiple hops etc.

13

of the measurement then the minimum RTT will include a queuing component

and not accurately reflect the distance.

• Routing policy The path a packet takes over the network may change

dependent on link/router availability and routing policies. Extra hops in the

path will increase extra router and clocking delays, different hops will have

different delays, and different links will have different capacities and distances.

All of this will change the minimum RTT when a route changes. Further the

peering may be such that the route between a landmark and a selected host is

not direct. For example, the route from one landmark in San Jose to SLAC (20

miles away) has a route that goes via New York. Thus one may need to

compare the results from a cluster of landmarks that are close together to

choose the minimum RTT.

• Bottleneck bandwidth is the available bandwidth of the link in the path that at

the measurement moment has the lowest available bandwidth. This may be the

link with the lowest static capacity or it may be a heavily loaded link with the

currently minimum available bandwidth (~ link capacity – link utilization). The

link utilization changes as a function of time so the bottleneck magnitude may

vary and its location may move from link to link.

14

• Mis-configuration of network devices such as routers and switches can cause

unexpected delays. First of all mis-configurations can cause a router or link to

fail. Secondly the alternative route may be poorly chosen (e.g. have much

lower capacity or much longer delays).

• Traffic conditions may vary enormously as a function of time. For example,

on Monday mornings typically there is a surge in traffic as people come to

work; also there is usually much more traffic during peak work hours compared

to weekends and nights. Such increased traffic often leads to congestion,

increased queuing and thus increases in RTT.

Links in Canada, USA, Europe, Japan are generally of high capacity e.g.

100Mbits/s or 1 Gbits/s end host Network interfaces, 1 or 10Gbits/s backbone links

and >= 100Mbits/s border router links. Also the routes tend to be well defined with

optimum backup paths. Thus the distances derived from the minimum RTT values in

those regions are generally accurate.

4.2 Delay to Distance Conversion

TULIP uses solely delay measurements to estimate distance of a site from each

landmark. Many factors have an impact on the end-to-end delay. The most governing

factor being that digital information travels on the fiber at 0.6 times the speed of light

in vacuum [1]. This gives us a very convenient rule for 1 ms of RTT for 100 Km of

cable. There are other factors that govern RTT measurements and may cause an

15

additive delay e.g. data packets may follow a circuitous path, many links follow

highways and railways etc [3]. Also there is a processing delay and/or queuing delay

due to cross traffic, per hop, summing up to a few milliseconds [3]. Since the

maximum delay along a path is variable and cannot be predicted hence we would use

the rule mentioned in [bamba] to estimate the geographical distance. Hence we observe

only additive distortion when estimating the distance.

To minimize this additive distortion we further optimized the alpha value. We

took a tighter upper bound based on geographical locations of the landmarks. A

suitable value for alpha was selected for each landmark. An automated script to

determine alpha values was written and packaged with TULIP. The methodology was

as follows:

Fig:12 Alpha optimization

16

A graph between min RTT and distance in km is plotted. The red dots represent

the actual distances plotted against observed min RTT. The blue line represents the

estimated distance by using alpha = 100. From here the huge amount of over

estimation is clear. Now the green line represents the tightest upper bound. It is

calculated by determining the equation of this line.

Using the point slope form we have

Y = mx + b

Here b=0

The line should be above the point with the highest value of factor

Distance in km / min RTT in ms

Hence this becomes the slope of the line. Mathematically:

Drawing a line to the point to the above mentioned point with co-ordinates (x1, y1)

from the origin:

12

12

YYXXM

−−

= Put 22 & YX equals to 0 [ ]origin

We have 1

1

YX

m =

Or as stated earlier x1 is the minRTT and y1 is the distance is km.

The equation of the line becomes xmY ×=

Or distance = RTTm min×

I.e. m (the slope of the line) is the alpha

17

4.3 Using PingER historical data for alpha optimization

To further optimize the alpha we use data gathered by PingER monitoring sites.

Location information about each site is retrieved from rss.xml.

As done previously a scatter plot of distance and delay is drawn and the tightest

upper bound is computed as the best alpha. When collecting historical data some sites

with geographical distance not equal to zero gave a minRTT of 0.000. This makes

alpha computation impossible. Hence all such zeros were discarded.

18

Chapter 3

SYSTEM ARCHITECTURE

In this chapter we will discuss system architecture of TULIP. TULIP mainly

consists of three sub modules. In the last part of this chapter integration of the modules

and their future design concerns are discussed.

TULIP is not only an application but offers a comprehensive API so that other

programs may use its Geolocation algorithm. This API is being used by GridOS which

is being developed at NIIT in collaboration with UWE.

Figure 13: Block Diagram of TULIP Modules

19

TULIP mainly consists of three modules:

1- Probing Engine

2- Core

3- Visualization

5.1 Probing Engine

This module is responsible for making all the measurements. It requests all

landmarks for ping and traceroute requests, Collects data from these landmarks, parses

the raw output and populates the tables which can be further processed for calculation.

5.2 Core

This module carries out all the calculations and analysis part. The Trilateration

algorithm, Alpha correction and all other statistical calculations are carried out by this

module.

5.3 Visualization

This module plots the results of the core visually. It is responsible for

displaying statistical calculations as graphs and plotting the landmarks and host on

globe after the positions have been calculated by the core.

20

Chapter 4

IP GEOLOCATION ALGORITHM

Now in the previous chapter we have given the whole system architecture and

explained the main modules on TULIP. In this chapter we will discuss the entire

algorithm that I have created to be used in ip geolocation.

6.1 The Location Algorithm

MinRTT = propagation delay + extra delay (due to extra circular routes)

∆T measured= ∆t + ∆t0

(Pseudo -distance)

PD = ∆Tmeasured.α

(Actual distance)

D = ∆T.α

PD = (∆T+∆T0).α

PD = D+∆T0.c

D = actual distance from the landmark.

C = speed of light

21

a = X(c) i.e. Speed of digital info in fiber optic cable

X = factor of c with which digital info travels in fiber optic cable.

∆T = actual propagation delay along the greater circle router/paths.

∆T0 = the extra delay causing overestimation.

PD = pseudo distance

GRAPHICALLY

Figure 14: Solving the system in 2D plane

H: host

L1: Landmark 1

22

L2: landmark 2

L3: landmark 3

D1=22 Yh)-(YL1 Xh)-(XL1 + ……………………………………… (2)

FROM (1) & (2)

PD= 22 Yh)-(YL1 Xh)-(XL1 + + 0T∆×α ……………….. (A)

Similarly for other 2 landmarks:

PD2 = 022 Yh)-(YL2 Xh)-(XL2 t∆×++ α ……………….. (B)

PD3 = 0

22 Yh)-(YL3 Xh)-(XL3 t∆×++ α ………………...(C)

We need to linearize (A), (B) & (C) to solve them

USE TAYLOR SERIES:

∑ =

−0

)(

!))((

n

nn

naxaf

( ) ( ) ( ) ( ) ( ) ( )!2!1

00''

000

xxxfxxxfxfxf

−×+

−×+=

Considering the simplified first part

( ) ( ) ( )( )000 xxxfxfxf −+=

23

Put ( )xxx ∆=− 0

( ) ( ) ( ) xxfxfxf ∆+= 0'

0 ………………………… (3)

Hence to compute the original value of X an arbitrary value x0 is required, this is done by

simple Trilateration.

We know that

XXH estx ∆+= ……………. (D)

YYH estY ∆+= …………….. (D)

Also

( ) ( )22estyesthiist YHXLDE −+−= ………………………….. (4)

From (3) & (4)

( ) ( )dY

YDEddX

XDEdDEPD istist

isti∆×

+∆×

+=

After carrying out partial differentiation

( ) ( ) ( )0TCY

dYYY

XdX

XXDEPD liEstliEst

isti ∆⋅+∆⋅−

+∆⋅−

+=

Now we need to solve for ( )0,, TYX ∆∆∆

24

( ) ( )

( ) ( )

( ) ( )

∆∆∆

×

−−

−−

−−

=

−−−

tYX

cDE

YYDE

XX

cDE

YYDE

XX

cDE

YYDE

XX

DEPDDEPDDEPD

st

Est

st

lEst

st

Est

st

lEst

st

liEst

st

lEst

st

st

st

3

13

3

3

2

12

2

2

11

1

33

22

11

Solution from (4) is put in eq (D) to get new estimations.

Hx, HY becomes the new estimated position.

6.2 Confidence Estimates

If we can calculate the error in alpha estimation we can calculate the error in distance

and hence location estimate. A rough calculation of the error in alpha for N. America

as follows:

For each point calculate alpha =distance/minRTT, then calculate the median and Inter-

quartile Range of the alphas. In the following case study we got 46.61=median and

IQR=15.31. This is illustrated by the graphs bellow.

We also looked at how alpha varies with distance. It looks like using a single alpha

may fail for small RTTs (distances). This is probably since the minRTT is dependent

on router delays and where networks peer (so the routes may be very different from the

Great Circle Distance (GCD) rather than global distance. Looking at the data of Alpha

vs minRTT from SLAC it appears there are two regimes separated at ~8ms. Above

25

8ms alpha is ~ 50km/ms, below it is more variable closer to ~10km/ms (see tab alpha-

vs-rtt). One might be able to use a power law to represent the full range with somewhat

less accuracy. For this data median alpha ~ 46.5km/ms and IQR ~15.6km/ms or

IQR/Median~ 33% or ~ +-16%.

Alpha vs Distance from SLAC

y = 3.3609x0.3301

R2 = 0.567

0.1

1

10

100

1 10 100 1000 10000Distance from SLAC (km)

Alp

ha (k

m/m

s)

Figure 15: Alpha vs. Distance from SLAC

The above graph shows how alpha varies with distance. Using a single alpha may fail

for small RTTs (distances). This is probably since the min_RTT is dependent on router

delays and where networks peer, so the routes may be very different from the Great

Circle Distance (GCD), and the min_RTT may be only partially dependent on

distance.

26

Alpha vs. min_RTT from SLAC y = 14.026x0.2593

R2 = 0.1861

0.1

1

10

100

0.1 1 10 100 1000

min_RTT (ms)

Alp

ha (k

m/m

s)

Figure 16: Alpha vs. MinRTT

Looking at the graph above of Alpha vs. min_RTT from SLAC it appears there are two

regimes separated at ~8ms. Above 8ms alpha is ~ 50km/ms, below it is more variable

closer to ~10km/ms. one can use a smooth power function to predict alpha given the

min_RTT as shown in the figure. However it is as easy and probably more accurate to

use the two values.

27

Histogram of Min RTT for Canadian, US, Mexican remote hosts

024

68

101214

161820

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

Min RTT from SLAC, TRIUMF, BNL, Florida (ms)

Freq

uenc

y

0%10%20%

30%40%50%60%70%

80%90%100%

FrequencyCumulative %

Figure 17: Histogram of MinRTT for Canadian hosts

28

6.2 Landmark Tiering Analysis

Another approach towards effective landmark selection could be tiering

approach. In this approach first we probe a target host with a few selected level zero

landmarks, once the tier has been decided than we can use level one and level two

landmarks of that tear to increase accuracy. Following is a case study for the feasibility

of tiring approach for North American region.

The purpose of this study is to investigate the effectiveness of tiering for

TULIP (i.e we have a set of primary landmarks tier0 which will narrow down the

target location to being in a particular region and then a denser set of secondary tier1

landmarks in the discovered region that can be used to get more accurate results). The

use of tiering should enable us to reduce the network traffic (number of landmarks

pinging a target) while retaining the accuracy of using all landmarks.

Currently we are looking into two regions where we hope to use tiering, i.e.

north America and Europe. The choice of these two regions is since we have a

relatively large number of landmarks in these regions. Landmarks in all other regions

will be considered as tier0. Thus TULIP will try and locate targets in two passes. The

first just uses tier0 landmarks. If the target then appears to be in North America or

Europe, then a second pass is made using all the landmarks in that region, otherwise

the target is determined from the first pass. The purpose of the current study is to

determine whether tiering is effective and how it should be used.

We selected 4 landmarks in North America for our study (SLAC, BNL,

Ampath-Florida, and TRIUMF-Canada). They were chosen to be reliable and to

29

roughly be located close to the 4 corners of the region. We pinged a set of known

geographical target sites in the US, Canada and Mexico. The coordinates of the sites

were obtained from the PingER database. We also calculated the physical distances

between the target sites and the landmarks. The figure below shows the distribution. It

appeared that all hosts were within 80ms of one another of the 4 monitors and 98.84%

are within 70ms.

Figure 18: Histogram of MinRTT for Canada, US and Mexico

30

Figure 19: MinRTT and distance as seen from SLAC

Figure 20: MinRTT and distance as seen from BNL

31

Figure 21: MinRTT and distance as seen from Florida

Figure 22: MinRTT and distance as seen from Canada

32

As we can see from the first graph that some sites are replicated (e.g. DNS root name

servers, Yahoo), others are not at a permanent location (e.g. supercomp moves

annually with the location of the Supercomputing show) or use strange routing (e.g. the

site labeled 134.79 a SLAC host located elsewhere, though the routing goes through

SLAC). We ignore such hosts in our results, though we note that this analysis has

clearly pointed out such anomalies. It is also seen that some values for Mexico are

aberrant for example min RTT >200 for a Mexican site from Florida. So it is a good

idea not to include Mexico in this region. This region will therefore comprise of

Canada and US only.

Figure 23: Alpha vs. distance as seen from SLAC

33

The above graph shows how alpha varies with distance. Using a single alpha may fail

for small RTTs (distances). This is probably since the min_RTT is dependent on router

delays and where networks peer, so the routes may be very different from the Great

Circle Distance (GCD), and the min_RTT may be only partially dependent on

distance.

Figure 23: Alpha vs. minRTT as seen from SLAC

34

Figure 23: Alpha Frequency from SLAC to N.America

Looking at the graphs above of Alpha vs. min_RTT from SLAC it appears there are

two regimes separated at ~8ms. Above 8ms alpha is ~ 50km/ms, below it is more

variable closer to ~10km/ms. one can use a smooth power function to predict

alpha given the min_RTT as shown in the figure. However it is as easy and probably

more accurate to use the two values.

We also calculated median alpha from SLAC which is ~ 46.5km/ms and IQR

~15.6km/ms or IQR/Median~ 33% or ~ +-16%

TULIP uses the maximum alpha value in order to avoid underestimation which is ~

62.98 km/ms in our case. Also TULIP calculates alpha based on PingER historical data

35

to all the world regions and in this study we only used the North American PingER

sites. Looking at the frequency histogram of alpha for the SLAC landmark to North

American target sites it is seen that this is a reasonable maximum value.

36

Chapter 5

IMPLEMENTATION

7.1 INTRODUCTION

In this chapter implementation detail of TULIP will be discussed. In this

chapter TULIP is observed from logical design to classes’ details and then screen

shots. The system architecture of TULIP leads to its implementation.

7.2 TOOLS USED

The tools which I have used in implementing CONSTELLA are:

1. Programming Language Java

2. Java Web start

3. J Free Chart API for graphs

4. Perl Scripting Language

7.2.1 Programming Language Java

Java language from Sun Microsystems is a general-purpose, object-oriented

language. Java is well known for its client/server role, as a language for writing

“applets” that are downloaded from a server and executed locally. It also works well

for writing traditional single-computer applications to complex distributed

applications. Following are the few features of Java that make it the best choice for

implementation language.

37

A. Simple

B. Object-Oriented

C. Distributed

D. Robust

E. Secure

F. Architecture-Neutral

G. Portable

H. Interpreted

I. High-Performance

J. Multithreaded

K. Dynamic

7.2.2 Java Web Start

Java Web Start is a software technology that encompasses the portability of

applets, the maintainability of Servlets and Java Server Pages (JSP) technology, and

the simplicity of markup languages such as XML and HTML. It is a Java-based

application that allows full-featured Java 2 client applications to be launched,

deployed, and updated from a standard Web server. Upon launching Java Web Start

for the first time, the user may download new client applications from the Web;

thereafter these applications may be initiated either through a link on a Web page or (in

Windows) from desktop icons or the Start menu. Applications initialize quickly under

Java Web Start, are cached on the client machine, and can be launched remotely

offline. Additionally, because Java Web Start is built from Java 2 technology, it

inherits the Java platform's complete security architecture. Since Java Web Start is, in

itself, a Java application, the software is platform independent and can be supported on

any client system that supports the Java 2 platform. Java Web Start performs an update

38

automatically when a client application is launched, downloading the latest code from

the Web while simultaneously loading the application from a previous cache (provided

that a cache exists). Java Web Start also provides a Java Application Manager utility,

allowing end-users to organize their Java applications as well as providing a variety of

options, such as clearing the cache of downloaded applications, specifying the use of

multiple JREs, and setting HTTP proxies.

7.2.3 J Free Chart API for graphs

JFreeChart is a popular open source Java charting library that can generate

most common chart types, including pie, bar, line, and Gantt charts. In addition, the

JFreeChart API supports many interactive features, such as tool tips and zooming.

JFreeChart provides an excellent choice for developers who need to add charts to

Swing- or Web-based applications.

7.2.4 Perl Scripting Language

Practical Extraction and Reporting Language is an open source server side

programming language extensively used for web scripts and to process data passed via

the Common Gateway Interface from HTML forms etc. Perl scripts are not embedded

within HTML pages and do not download to the web browser but reside on the server.

They execute by being triggered from commands within HTML pages or other scripts

and may produce HTML output that does download to the web browser. Programs

written in Perl are called Perl scripts, whereas the term the perl program refers to the

system program named Perl for executing Perl scripts. Perl is implemented as an

39

interpreted (not compiled) language. Thus, the execution of a Perl script tends to use

more CPU time than a corresponding C program, for instance. On the other hand,

computers tend to get faster and faster, and writing something in Perl instead of C

tends to save your time.

7.3 PACKAGES

tulip

tulip.core

tulip.core.pingEngine

tulip.core.LocationAlgo

tulip.core.traceEngine

tulip.GUI tulip.statstulip.util

tulip.util.logger

Figure 24: packages

40

7.4 CLASS DIAGRAM

7.4.1 Class Diagram of the GUI Package

Figure 25: GUI Package

41

7.4.2 Class Diagram of the Util Package

tulip.util.log

LocalLoggerStatic_Log

ServerLoggerServerPathLog_Text

Figure 26: Util Package

7.4.3 Class Diagram of the Core Package

42

tulip.core

GetHostInfo

GetNetGeoData

GetNetGeoData

MonsiteminRTTmaxRTTavgRTT

Delay()Alpha()Confidence()lat()lon()

OctantRSSParser

tulip.core.TraceEngine

GetTraceData TraceThread

TraceParser

Figure 27: Core Package

43

7.4.4 Class Diagram of the Stats Package

Bar Chart

AvgPlot

MinFixAnalysis

PlotAllStats

MaxPlot

MinPlot

LogPlot

Figure 28: stats Package

44

7.5 SCREEN SHOTS

Figure 29: Alpha Optimization

45

Figure 30: TULIP Main Frame

Figure 31: Various Plots to aid in analysis

46

Figure 32: TULIP Visualization. Plots the entire scenario on the globe

Figure 33: Log File Analysis to help decide which host to block.

47

Figure 34: Updating Site configuration File from tulip client.

7.6 TULIP configuration Files

TULIP requires a few external configuration files and a set of scripts to manage

these files. These scripts and files are to be deployed on a central web server. These

Files include:

1- Site List – Configuration File containing data about all landmarks.

2- LOG – TULIP access logs to help decide a malicious host.

3- Block List – A Block List containing blocked hosts.

7.7 TULIP Security Concerns

TULIP can be used to initiate a DOS attack on a host by making several

landmarks probe the host simultaneously. TULIP ensures that no host is allowed to do

so. This is done by using the block list and LOG file analysis. If a client makes to

48

many attempts at a host it is blocked. Also continuous attempts on the same host are

not allowed by TULIP client.

49

Chapter 6

PERFORMANCE EVALUATION AND RESULTS

8.1 INTRODUCTION

This section describes the performance evaluation and testing that has taken

plan to validate the functionality, accuracy and performance improvement of IP

geolocation. TULIP has been tested as a whole (Black Box Testing) and the individual

modules are also tested individually. Results have been drawn from the testing of each

module.

8.2 TEST BED

The aim is to get a better coverage of landmarks, and an improved algorithm to

cater for increased delays in the less developed regions. Using location data from about

600 PingER hosts in over 100 countries, we shall validate the accuracy and

applicability of TULIP, and also use it to identify mis-placed hosts in the PingER

database.

8.3 PERFORMANCE EVALUATION

The following graphs show distance between PingER location data and various

geolocation approach’s estimates. These graphs help in comparing the agreement rate

between PingER and different geolocation techniques. Hence we can have a measure

of the accuracy of each approach and also point out errors in the PingER database.

50

Dist(Gutheri,Hostinfo)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 20 40 60 80 100 120 140 160 180

Site

Dis

tanc

e (k

m)

Dist(Gutheri,Hostinfo)

Figure 35: Host Info

Almost 170 hosts’ location data was retrieved from HostInfo. Almost 140 hosts agreed

to PingER database. For 37 hosts the distance was huge ranging from 500 km to

10,000 km. The main reason was due to the log based nature of the host.

51

Distance (Gutheri , geoIP)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 50 100 150 200 250 300 350 400

Site

Dis

tanc

e (k

m)

Distance (Gutheri , geoIP)

Figure 36: Geo IP

Almost 380 hosts’ location data was retrieved from GeoIP. It is the most preferred

solution for geolocation. 253 hosts agreed to PingER database. Remaining distances

ranged from 500 km to 10,000 km. As above the main reason for these disagreements

was due to its log based nature.

52

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

kyoto-u.ac.jp200.37.46.80w

ww

.sustech.eduglobalnet.cmw

ebster.ac.thrw

andaparliament.gov.r

rol.net.mv

ww

w.ust.edu.sd

seua.amyum

it.amw

ww

.institutokilpatrick.esyr.eduknu.ac.krfcien.edu.uyuiuc.eduasu.edusara.nlaspu.edu.jona.infn.itm

ercury.uvic.calattice.act.aarnet.net.auhanarotel.nethellenic.ac.zww

ww

.mssf.m

nlatinalfuheis.edu.jouaeu.ac.aem

cbs.edu.omnovagest.co.aocad.zju.edu.cnam

s.ac.irum

ich.eduw

isc.edufinance.gov.m

vcaltech.educaltech.edubrandeis.edualfred.eduw

isc.edubrow

n.eduv-w

ww

.ihep.ac.cnw

ww

.region.amcm

sfq.edu.ecw

ww

.ecnu.edu.cnlbl.goves.netcornell.edu81.199.21.194auth.grlbl.govpdsfgrid4.nersc.govusb.veaau.edu.etm

it.edurhnet.iscam

net.cmuoregon.eduuoregon.edubu.edudesy.dem

ultinet.afping.if.usp.brru.ac.zaarizona.eduw

ww

.intercollege.ac.cyw

ww

.fulbright.org.cyhaw

aii.edubu.eduprinceton.eduprinceton.eduprinceton.edudesy.de130.207.244.56m

su.rustsci.eduohio-state.edustanford.eduw

ww

.ifj.edu.plw

ww

.cyfronet.krakow.pl

in2p3.frucsc.edukotis.netthrunet.co.krcau.ac.krm

ps.ohio-state.eduiepm

-bw.cesnet.cz

stanford.edups.uci.eduutk.eduihep.ac.cncm

u.edupurdue.educaida.orgvix.comw

ww

.vodafone.com.m

ttrium

f.casnow

mass2001.org

ufrj.brcbpf.brns.cybercentro.com

.svcir.red.svum

n.eduutexas.eduornl.govornl.govrutgers.eduuchicago.edulattice.w

a.aarnet.net.audigex.netnic.nislac.stanford.eduslac.stanford.edulahoreschoolofeconom

iw

ww

.hrfoundation.bww

ashington.eduw

ashington.edum

fa.gov.bnkazrena.kzpinger.bnl.orgw

ww

.msu.ru

rftpexp.rhic.bnl.govw

ww

.irk.ruutdallas.eduindo.net.idcern.chleonis.nus.edu.sgw

ww

.tsc.rucern.chw

ww

.monash.edu.m

yhepi.edu.geindiana.edusci.amindiana.edunyu.educisco.comjlab.orgw

ww

.runnet.ruaip.orgub.esd.root-servers.netucsd.eduanl.govanl.govanl.govb.root-servers.net82.137.192.62ucla.eduucla.eduprim

e.edu.npllnl.govbo.cache.nlanr.netpsi.netns.fq.edu.uyorange.cmgnt4.grid.m

an.ac.ukperl-pbdsl.stanford.eduece.rice.eduns1.retina.aruoi.grsunysb.eduw

ww

.psi.gov.psm

t.net.mk

just.edu.jokornet.ne.krkreonet.re.krnetsgo.comdirecpc.compgis.lkw

ww

.utl.co.ughaw

aii.educbinet.biw

ww

.eng.bellsouth.netw

aikato.ac.nzlanl.govnic.lkbham

.ac.ukucr.educache.kr.apan.netkaist.ac.krnoc.kr.apan.netru.ac.bdhokudai.ac.jpjp.apan.netm

.root-servers.netkyushu-u.ac.jpshinbiro.netbunda.unim

a.mw

credis.rokek.jpkek.jpw

ww

.uma.rnu.tn

uta.edu

Distance GeoIPDistance TULIPDistance Host Info

Figure 37: TULIP Results

The above graph is a comparison between TULIP GeoIP and HostInfo results.

HostInfo and GeoIP have already been discussed. TULIP shows slight disagreements

in the beginning, these are due to slight errors in landmark location info maintained in

the TULIP configuration file. The peak differences ranging from 5000 km to 13,000

km include 34 sites. These were verified as errors in the PingER database and were

corrected. For the remaining hosts location estimates can only be improved by careful

landmark selection either by confidence estimates approach or by tiering approach. It

has been constantly observed that nearby landmarks may over estimate a little causing

and error in location estimation whereas distant landmarks make a very accurate

53

estimate. If we can sort out these accurate distant landmarks than TULIP accuracy can

be further increased.

0%10%20%30%40%50%60%70%80%90%

100%

0 5000 10000 15000 20000

Distance (km)

Cum

ulat

ive

Dis

tribu

tion

Number of hosts = 285

Figure 38: Cumulative Distribution Function of Distances

54

Chapter 7

CONCLUSIONS

The Internet has become a collection of resources meant to appeal to a large

general audience. Although this multitude of information has been a great boon, it also

has diluted the importance of geographically localized information. Offering the ability

for Internet users to garner information based on geographic location can decrease

search times and increase visibility of local establishments. Geolocation by IP address

is the technique of determining a user's geographic latitude, longitude and, by

inference, city, region and nation by comparing the user's public Internet IP address

with known locations of other electronically neighboring servers and routers.

In recent years many efforts have been made for IP geo-location. Most of them

are log based and are hence pron to get outdated. Very few are based on dynamic

measurments.

We use delay measurements to dynamically locate a target host, from a set of

reference (landmarks) that are reliably accessible, their locations are well known and

constant and triangulate a large fraction of the world’s IP hosts the delay

measurements provide rough estimates of the distances between the landmarks and the

target which are used to geo-locate the target. Multiltateration, a concept used in

[bamba], is then applied, to get an estimation of the target host.

Iterative correction is then applied using the above estimate. The approach has

been thoroughly tested by applying it on 600 PingER nodes. Since our measurements

are dynamic we can obtain results only for online nodes. At the time of the above

55

mention tests 285 nodes were responding. Also this accuracy can further increase if we

can improve our landmark selection scheme.

56

REFERENCES

[1] R. Percacci and A. Vespignani, .Scale-free behavior of the Internet global

performance. The European Physical Journal B - Condensed Matter, vol. 32, no. 4, pp.

411.414, Apr. 2003.

[2] Constraint-Based Geolocation of Internet Hosts Bamba Gueye, Artur Ziviani, Mark

Crovella and Serge Fdida,

[3] C. Bovy, H.T. Mertodimedjo, G. Hooghiemstra, H. Uijterwaal, P. Van Mieghem,

Proceedings of the PAM 2002 Conference, Fort Collins, Colorado, 25-26 March 2002

[4] An Empirical Evaluation of Landmark Placement on Internet Coordinate Schemes

Sridhar Srinivasan Ellen Zegura Networking and Telecommunications Group College

of Computing Georgia Institute of Technology Atlanta, GA 30332, USA Email:

{sridhar,ewz}@cc.gatech.edu

[5] A Network Positioning System for the Internet, T. S. Eugene Ng, Rice University,

Hui Zhang, Carnegie Mellon University.

[6] IP Geolocation Using Delay and Topology Measurements Ethan Katz-Bassett John

P. John Arvind Krishnamurthy David Wetherall† Thomas Anderson Yatin Chawathe‡

57

[7] Adwords, G. (2007). Google adwords: Regional and local targeting. Available

from: https://adwords.google.com/select/targeting.html.

[8] Davis, C. (2007). Dns loc: Geo-enabling the domain name system. Available from:

http://www.ckdhr.com/dns-loc/.

[9] FCC (2007). Fcc consumer advisory: Voip and 911 service. Available from:

http://www.fcc.gov/cgb/consumerfacts/voip911.html.

[10] Govindan, R. and Tangmunarunkit, H. (2000). Heuristics for internet map

discovery. In IEEE INFOCOM 2000, pages 1371{1380, Tel Aviv, Israel. IEEE.

Available from: citeseer.ist.psu.edu/ govindan00heuristics.html.

[11] Gueye, B., Ziviani, A., Crovella, M., and Fdida, S. (2004). Constraint-based

geolocation of internet hosts. In IMC '04: Proceedings of the 4th ACM SIGCOMM

conference on Internet measurement, pages 288{293, New York, NY, USA. ACM

Press. Available from: http://portal.acm.org/citation.cfm?id=1028828.

[12] Katz-Bassett, E., John, J. P., Krishnamurthy, A., Wetherall, D., Anderson, T., and

Chawathe, Y. (2006). Towards ip geolocation using delay and topology measurements.

In IMC '06: Proceedings of the 6th ACM SIGCOMM on Internet measurement, pages

71{84, New York, NY, USA. ACM Press. Available from:

http://portal.acm.org/citation.cfm?id=1177090.

58

[13] Mack, T. (2007). open geo coordinates database. Available from:

sourceforge.net/projects/opengeodb/.

[14] Maxmind (2007). Maxmind llc geoip database. Available from:

http://www.maxmind.com/app/ip-location.

[15] MaxMind (2007). Maxmind minfraud whitepaper. Available from:

http://www.maxmind.com/MaxMind_minFraud_Overview.pdf.

[16] Padmanabhan, V. N. and Subramanian, L. (2001). An investigation of geographic

mapping techniques for internet hosts. Proceedings of SIGCOMM'2001, page 13.

Available from: citeseer.ist.psu.edu/padmanabhan01investigation.html.

[16] Plazes (2007). Plazes website. Available from: http://www.plazes.com.

[17] Quova (2007). Compliance. Available from:

http://www.quova.com/page.php?id=9.

[18] Spring, N., Mahajan, R., and Wetherall, D. (2002). Measuring isp topologies with

rocketfuel. Available from: citeseer.ist.psu.edu/spring02measuring.html.

[19] http://www.advanced.org/IPPM/archive/0246.html

59

[20] P. Bahl and V.N. Padmanabhan. RADAR: An In-Building RF-Based User

Location and Tracking System. IEEE Infocom, March 2000.

[21] G. Ballintijn, M. van Steen, and A.S. Tanenbaum. Characterizing Internet

Performance to Support Wide-area Application Development. Operating Systems

Review, 34(4):41-47, October 2000.

[22] L. S. Brakmo, S. W. O'Malley, and L. L. Peterson. TCP Vegas: New Techniques

for Congestion Detection and Avoidance. ACM SIGCOMM, August 1994.

[23] K. Cheverst, N. Davies, K. Mitchell, and A. Friday. Experiences of Developing

and Deploying a Context-Aware Tourist Guide: The GUIDE project. ACM Mobicom,

August 2000.

[24] R. Droms, Dynamic Host Con_guration Protocol, RFC-1531, IETF, October

1993.

[25] P. Enge and P. Misra, The Global Positioning System, Proc. of the IEEE, January

1999.

[26] R. Govindan and H. Tangmunarunkit. Heuristics for Internet Map Discovery.

IEEE Infocom, March 2000.

60

[27] K. Harrenstien, M. Stahl, E. Feinler, NICKNAME/ WHOIS, RFC-954, IETF,

October 1985.

[28] Andy Harter and Andy Hopper. A Distributed Location System, for the Active

OÆce. IEEE Network Vol.8 No.1, January 1994.

[29] A. Harter, A. Hopper, P. Steggles, A. Ward, and P. Webster, The Anatomy of a

Context-Aware Application, ACM Mobicom, August 1999.

[30] V. Jacobson, Traceroute software, 1989, ftp://ftp.ee.lbl.gov/traceroute.tar.gz

[31] B. Krishnamurthy and J. Wang. On Network Aware Clustering of Web Clients.

ACM SIGCOMM, August 2000.

[32] J.Y. Li et al. A Scalable Location Service for Geographic Ad-Hoc Routing. ACM

Mobicom, August 2000.

[33] D. Moore et.al. Where in the World is netgeo.caida.org? INET 2000, June 2000.

[34] R. Periakaruppan, E. Nemeth. GTrace {a Graphical Traceroute Tool. Usenix

LISA, Nov 1999.

61

[35] U. Raz. How to find a host's geographic location. (private.org)

http://www.private.org.il/IP2geo.html

[36] D. C. Vixie, P. Goodwin and T. Dickinson. A Means for Expressing Location

Information in the Domain Name System, RFC-1876, IETF, January 1996.

[37] BBNPlanet publically available route server, telnet://ner-routes.bbnplanet.net.

[38] BCentral. http://www.bcentral.com/

[39] Digital Island Inc. http://www.digitalisland.com/

[40] DoubleClick, http://www.doubleclick.com/

[41] Internet2. http://www.internet2.org/

[42] IP to Latitude/Longitude Server, University of Illinois http://cello.cs.uiuc.edu/cgi-

bin/slamm/ip2ll

[43] List of Airport Codes. http://www.mapping.com/airportcodes.html

[44] MERIT Network, http://www.merit.edu/

62

[45] NeoTrace, A Graphical Traceroute Tool

http://www.neoworx.com/products/neotrace/default.asp

[46] U.S. Gazetteer, U.S. Census Bureau, http://www.census.gov/cgi-bin/gazetteer.

[47] VisualRoute, Visualware Inc., http://www.visualroute.com/

Documents

TULIP - SLAC National Accelerator Laboratory · Using location data from about 600 PingER hosts in over 100 countries, we shall validate the accuracy and applicability of TULIP, and