Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
TULIP
TRILATERATION UTILITY FOR LOCATING IP ADDRESSES – A DELAY BASED SOLUTION FOR IP GEOLOCATION
Project Team
Dr. Les Cottrell [SLAC, USA] Umar Kalim [NIIT, PAK]
Shahryar khan [NIIT, PAK] Faran Javed [NIIT, PAK]
2
ABSTRACT
Currently we see tremendous expansion in the area of network deployment. As
companies realized the cost benefits and productivity gains created by network
technology, they began to add networks and expand existing networks almost as
rapidly as new network technologies and products are being introduced. The problems
associated with network expansion affect both day-to-day network operation
management and strategic network growth planning. Each new network technology
requires its own set of experts. An urgent need has arisen for automated network
management (including what is typically called network capacity planning) integrated
across diverse environments.
In such an increasingly connected world, geo-location of a target host based on
its name or address can, for example, provide critical information for security, content
delivery and Internet traffic studies. The knowledge of the physical location of a user
with an assigned ip address is currently being used from credit card fraud protection to
online advertising. However, most industrial use approaches to assign ip-addresses or -
ranges to a geolocation are currently based on manually maintained databases which
might lead to wrong or outdated information.
One cannot assume the location of a host is given by its name (e.g. its Top
Level Domain). For example, server hosts associated with a web site may have proxies
overseas. While not being useful for all ip addresses (tunnel-endpoints or mobile
3
nodes, for example), most ip addresses can be traced automatically to their location
with an inaccuracy of several hundreds of kilometers. This might appear high at first,
but judged by the fact that it is e.g. sufficient for credit card fraud protection to know
the country the user is currently being located, this is a tolerable inaccuracy. Several
examinations developed different mechanisms of automatically geolocation, using a
set of servers with known location to triangulate the ip address, using provided
location information of the target or topology hints in the router naming scheme.
We use delay measurements to dynamically locate a target host, from a set of
reference (landmarks) that are reliably accessible, their locations are well known and
constant and triangulate a large fraction of the world’s IP hosts the delay
measurements provide rough estimates of the distances between the landmarks and the
target which are used to geo-locate the target. Multiltateration, a concept used in
[bamba], is then applied, to pin point the target host.
The delay increases with the increase in geographical distance but we may get
an additive delay due to a number of reasons, such as greater number of hops,
congested paths, indirect routing etc. The first step is to convert the delay
measurements to distance estimates based on the speed of light in fiber, perform
multiltateration and avoid miscalculations caused by over-estimation caused by the
additive delay. The more challenging part is obtaining “good” landmarks that provide
good connectivity to a large fraction of the world. Although CGB [bamba] provides a
fairly accurate estimate of sites in the developed world, especially in North America
and Europe, as we move outside these regions, the level of accuracy drops
4
dramatically. The aim is to get a better coverage of landmarks, and an improved
algorithm to cater for increased delays in the less developed regions.
Using location data from about 600 PingER hosts in over 100 countries, we
shall validate the accuracy and applicability of TULIP, and also use it to identify mis-
placed hosts in the PingER database.
5
Chapter 1
INTRODUCTION
Computers are fulfilling an increasingly diversified set of tasks in our society.
They assume saliently many key jobs and assist us in managing numerous tasks in our
daily life. In one or the other form, they process, and provide us information. Many
researchers agree that information age is rapidly replacing industrial era.
There have been three major stages in human history. The first stage is said to
be the agrarian base, and the economy of this stage was dependent upon the
agricultural activity, but with little knowledge. Industrial revolution, which is coming
to end now, was the first major change. The economic success of the era had been
based on mass-production and quality of the products. Lastly the third era is the age of
information technology and services. In this era, information and knowledge are
becoming focus of the economic activity of the entire world. This phase has changed
the economic, political and cultural values on global scale. This change has forced
state-controlled economy to a free economy oriented towards abundant supply and
diversity. Global information systems, such as the Internet, are no longer just pathways
for digital data rather they generate and process information to create knowledge for
commercial activity, education business and research. For the first time, in the history
of human kind, non-material resources such as software and information services have
become new raw materials and real wealth of knowledge-base society.
6
Now that we have discussed the new era, namely Information age, we need to
explore what new opportunities and challenges are emerging in the near future.
Information and communication are two of the most important challenges for the
success of every enterprise. While today nearly every organization uses a substantial
number of computers and communication tools (telephones, fax, and personal
handheld devices), they are often still isolated. While managers today are able to use
the newest applications, many departments still do not communicate and much needed
information cannot be readily accessed.
To overcome these obstacles in an effective usage of information technology,
computer networks are necessary. They are a new kind (one might call it paradigm) of
organization of computer systems produced by the need to merge computers and
communications. At the same time they are the means to converge the two areas; the
unnecessary distinction between tools to process and store information and tools to
collect and transport information can disappear. Computer networks can manage to put
down the barriers between information held on several (not only computer) systems.
Only with the help of computer networks can a borderless communication and
information environment be built.
Networks can only be used to their maximum efficiency if we can manage
them effectively and measure their states. There is an increasing need for high-
performance monitoring to set expectations, provide planning and trouble-shooting
information, and to provide steering for applications to quantify and help bridge the
Digital Divide, enable world-wide collaborations, and reach-out to scientists world-
7
wide, it is imperative to continue and extend the monitoring coverage to all countries
with HENP programs and significant scientific enterprises. To meet these requirements
PingER project was initiated.
PingER uses ICMP to gather measurement data and also considers the physical
location of the host. E.g. a host from Australia gives a constant delay of 400 ms when
measured from SLAC. But suddenly the delay increases from 400 ms to 600 ms.
apparently it seems that the network condition has gotten worst. A detailed manual
analysis suggests that there is nothing wrong with the network but the host has shifted
to some other geographical location.
1.1 Need for a Dynamic Automated IP geolocation Tool
The Internet has become a collection of resources meant to appeal to a large
general audience. Although this multitude of information has been a great boon, it also
has diluted the importance of geographically localized information. Offering the ability
for Internet users to garner information based on geographic location can decrease
search times and increase visibility of local establishments. Similarly, user
communities and chat-rooms can be enhanced through knowing the locations (and
therefore, local times, weather conditions and news events) of their members as they
roam the globe. It is possible to provide user services in applications and Web sites
without the need for users to carry GPS receivers or even to know where they
themselves are.
8
Geolocation by IP address is the technique of determining a user's geographic latitude,
longitude and, by inference, city, region and nation by comparing the user's public
Internet IP address with known locations of other electronically neighboring servers
and routers.
IP addresses are not static - they can be reassigned, relocated within the same
provider or being forwarded using mechanism like vpns or other tunnels. A database
consisting of geolocational information about ips hence has to be updated and
maintained on a regular base. Due to the large and increasing number of used ip
addresses, this is a task which is nearby impossible to be maintained manually.
The most popular approach to geolocation is maintaining a manually created
database, like "GeoIP" from Maxmind [14] - a stripped down version of their
commercially available database is given out for free with several tools and API’s, and
is quite popular within the open source community. Several other free libraries like
[13] exist, too. As already described in the section "Problems", this is a rather
problematic approach. Paid workers maintaining such a database is very cost effective.
Using community created databases like [13] or the database gained with web2.0
communities like [16] minimizes the costs, but increases the risk of accidental or
intentional entered wrong data.
In recent years many efforts have been made for IP geo-location. Most of them
are log based and are hence pron to get outdated. Very few are based on dynamic
measurments. Out of them I have dicussed a few that use delay measurments for
distance estimation and are hence in accordance with our proposed objectives.
9
1.1.1- Bamba
CBG (Constraint Based Geolocation) transforms delay measurements to
geographic distance constraints, and then uses multiltateration to infer the Geolocation
of the target host. CBG is able to assign a confidence region to each given location
estimate. This allows a location-aware application to assess whether the location
estimate is sufficiently accurate for its needs.
1.1.2- NetGeo
A project by CAIDA (Co-operative Association for Internet Data Analysis).
The project is quite some years old. A database is maintained that contains
geographical latitude and longitude of ASNs (Autonomous System Numbers).
1.1.3- ICS (Internet Coordinate System)
The schemes work by using a small set of nodes called “landmarks” to compute
the coordinates of other hosts. The principal component analysis (PCA) technique is
used to effectively extract topological information from delay measurements between
beacon hosts. Based on PCA, a transformation method is devised that projects the
distance data space into a new coordinate system of (much) smaller dimensions. The
transformation retains as much topological information as possible and yet enables end
hosts to easily determine their locations in the coordinate system.
10
1.1.4- GeoIP
This technique uses end users input. It acquires data from various sites that
require users to register themselves for service. Based on this data it estimates the
location of an ASN. This technique is ideal for e-commerce applications but fails for
intermediate and backbone routers. This is observed mainly because routers belonging
to one AS may be in different countries. Hence for scientific study GeoIP will not be
the ideal tool. E.g. if we are to find the geographical location of a bottleneck it may be
an intermediate router.
11
Chapter 2
DELAY BASED IP GEOLOCATION
In this chapter we shall discuss the issues involved in delay based IP geolocation.
4.1 On Delay Based IP Geolocation
The location of an IP address can be determined using two methodologies: “Ping”, by
using the minimum Round Trip Estimates, and triangulating the results to obtain an
approximate location and “Whois”. We are exploring the first technique also called
delay passed IP Geolocation. The main idea is to use RTT measurements from
multiple sites (landmarks) to triangulate the position of a remote host that is being
pinged. The basic tool would be a script that takes RTT measurements from
landmarks at known latitudes and longitudes to a remote host (at an unknown location)
specified by the user and figures out the latitudes and longitudes of the remote host.
TULIP (IP Locator Using Triangulation) has been developed which utilizes PingER
data in order to verify IP locations. Using TULIP, PingER would be able to detect
anomalies like:
1. A set of measured RTT’s to a given host that does not agree where the remote
host is located.
2. Measurements that do agree where a remote host is located but that location
differs from the "known" latitudes and longitudes of the host in the database.
As part of this, a ping utility has been put together and being distributed across known
places around the world, e.g. at PingER monitoring hosts which would then serve as
landmark sites.
12
Given this infrastructure a web accessible tool has being designed where the user
enters the name or IP address of the host whose location is to be determined. The tool
then gets the landmark sites to make RTT measurements, collects these measurements,
analyzes the data, presents the results and on demand makes trace routes to the remote
host from elected landmark sites. The difficulty in getting ping tool deployed at
enough places would be solved by getting the RTT estimates from reverse trace route
servers around the world. While implementing such technique there are several factors
that have to be taken into account:
• Bandwidth Links can have capacity bandwidths that differ by many orders of
magnitude (e.g. 1 Mbits/s versus 10Gbits/s). For a 1 Mbits/s link the time to
clock a 100byte packet onto the link is about 0.8msec. This is roughly the time
for light to propagate through a 100 mile (~161km) fiber1. It is thus not
possible to have a single conversion factor from RTT to distance that works for
all links for all distances.
• Queuing delay can have major effects on the RTT measurement. Queuing
delays are critically affected by cross-traffic. Increased cross-traffic is likely to
increase queuing and thus RTTs. We thus use the minimum RTT of multiple
measurements with the understanding that the minimum possible RTT is for
packets which see no queuing. However if the queuing persists for the duration
1 A canonical average delay per distance for packets traversing well provisioned parts of the Internet is about a factor of two slower than this (i.e. 1ms per 100 km) due to non great circle routes, multiple hops etc.
13
of the measurement then the minimum RTT will include a queuing component
and not accurately reflect the distance.
• Routing policy The path a packet takes over the network may change
dependent on link/router availability and routing policies. Extra hops in the
path will increase extra router and clocking delays, different hops will have
different delays, and different links will have different capacities and distances.
All of this will change the minimum RTT when a route changes. Further the
peering may be such that the route between a landmark and a selected host is
not direct. For example, the route from one landmark in San Jose to SLAC (20
miles away) has a route that goes via New York. Thus one may need to
compare the results from a cluster of landmarks that are close together to
choose the minimum RTT.
• Bottleneck bandwidth is the available bandwidth of the link in the path that at
the measurement moment has the lowest available bandwidth. This may be the
link with the lowest static capacity or it may be a heavily loaded link with the
currently minimum available bandwidth (~ link capacity – link utilization). The
link utilization changes as a function of time so the bottleneck magnitude may
vary and its location may move from link to link.
14
• Mis-configuration of network devices such as routers and switches can cause
unexpected delays. First of all mis-configurations can cause a router or link to
fail. Secondly the alternative route may be poorly chosen (e.g. have much
lower capacity or much longer delays).
• Traffic conditions may vary enormously as a function of time. For example,
on Monday mornings typically there is a surge in traffic as people come to
work; also there is usually much more traffic during peak work hours compared
to weekends and nights. Such increased traffic often leads to congestion,
increased queuing and thus increases in RTT.
Links in Canada, USA, Europe, Japan are generally of high capacity e.g.
100Mbits/s or 1 Gbits/s end host Network interfaces, 1 or 10Gbits/s backbone links
and >= 100Mbits/s border router links. Also the routes tend to be well defined with
optimum backup paths. Thus the distances derived from the minimum RTT values in
those regions are generally accurate.
4.2 Delay to Distance Conversion
TULIP uses solely delay measurements to estimate distance of a site from each
landmark. Many factors have an impact on the end-to-end delay. The most governing
factor being that digital information travels on the fiber at 0.6 times the speed of light
in vacuum [1]. This gives us a very convenient rule for 1 ms of RTT for 100 Km of
cable. There are other factors that govern RTT measurements and may cause an
15
additive delay e.g. data packets may follow a circuitous path, many links follow
highways and railways etc [3]. Also there is a processing delay and/or queuing delay
due to cross traffic, per hop, summing up to a few milliseconds [3]. Since the
maximum delay along a path is variable and cannot be predicted hence we would use
the rule mentioned in [bamba] to estimate the geographical distance. Hence we observe
only additive distortion when estimating the distance.
To minimize this additive distortion we further optimized the alpha value. We
took a tighter upper bound based on geographical locations of the landmarks. A
suitable value for alpha was selected for each landmark. An automated script to
determine alpha values was written and packaged with TULIP. The methodology was
as follows:
Fig:12 Alpha optimization
16
A graph between min RTT and distance in km is plotted. The red dots represent
the actual distances plotted against observed min RTT. The blue line represents the
estimated distance by using alpha = 100. From here the huge amount of over
estimation is clear. Now the green line represents the tightest upper bound. It is
calculated by determining the equation of this line.
Using the point slope form we have
Y = mx + b
Here b=0
The line should be above the point with the highest value of factor
Distance in km / min RTT in ms
Hence this becomes the slope of the line. Mathematically:
Drawing a line to the point to the above mentioned point with co-ordinates (x1, y1)
from the origin:
12
12
YYXXM
−−
= Put 22 & YX equals to 0 [ ]origin
We have 1
1
YX
m =
Or as stated earlier x1 is the minRTT and y1 is the distance is km.
The equation of the line becomes xmY ×=
Or distance = RTTm min×
I.e. m (the slope of the line) is the alpha
17
4.3 Using PingER historical data for alpha optimization
To further optimize the alpha we use data gathered by PingER monitoring sites.
Location information about each site is retrieved from rss.xml.
As done previously a scatter plot of distance and delay is drawn and the tightest
upper bound is computed as the best alpha. When collecting historical data some sites
with geographical distance not equal to zero gave a minRTT of 0.000. This makes
alpha computation impossible. Hence all such zeros were discarded.
18
Chapter 3
SYSTEM ARCHITECTURE
In this chapter we will discuss system architecture of TULIP. TULIP mainly
consists of three sub modules. In the last part of this chapter integration of the modules
and their future design concerns are discussed.
TULIP is not only an application but offers a comprehensive API so that other
programs may use its Geolocation algorithm. This API is being used by GridOS which
is being developed at NIIT in collaboration with UWE.
Figure 13: Block Diagram of TULIP Modules
19
TULIP mainly consists of three modules:
1- Probing Engine
2- Core
3- Visualization
5.1 Probing Engine
This module is responsible for making all the measurements. It requests all
landmarks for ping and traceroute requests, Collects data from these landmarks, parses
the raw output and populates the tables which can be further processed for calculation.
5.2 Core
This module carries out all the calculations and analysis part. The Trilateration
algorithm, Alpha correction and all other statistical calculations are carried out by this
module.
5.3 Visualization
This module plots the results of the core visually. It is responsible for
displaying statistical calculations as graphs and plotting the landmarks and host on
globe after the positions have been calculated by the core.
20
Chapter 4
IP GEOLOCATION ALGORITHM
Now in the previous chapter we have given the whole system architecture and
explained the main modules on TULIP. In this chapter we will discuss the entire
algorithm that I have created to be used in ip geolocation.
6.1 The Location Algorithm
MinRTT = propagation delay + extra delay (due to extra circular routes)
∆T measured= ∆t + ∆t0
(Pseudo -distance)
PD = ∆Tmeasured.α
(Actual distance)
D = ∆T.α
PD = (∆T+∆T0).α
PD = D+∆T0.c
D = actual distance from the landmark.
C = speed of light
21
a = X(c) i.e. Speed of digital info in fiber optic cable
X = factor of c with which digital info travels in fiber optic cable.
∆T = actual propagation delay along the greater circle router/paths.
∆T0 = the extra delay causing overestimation.
PD = pseudo distance
GRAPHICALLY
Figure 14: Solving the system in 2D plane
H: host
L1: Landmark 1
22
L2: landmark 2
L3: landmark 3
D1=22 Yh)-(YL1 Xh)-(XL1 + ……………………………………… (2)
FROM (1) & (2)
PD= 22 Yh)-(YL1 Xh)-(XL1 + + 0T∆×α ……………….. (A)
Similarly for other 2 landmarks:
PD2 = 022 Yh)-(YL2 Xh)-(XL2 t∆×++ α ……………….. (B)
PD3 = 0
22 Yh)-(YL3 Xh)-(XL3 t∆×++ α ………………...(C)
We need to linearize (A), (B) & (C) to solve them
USE TAYLOR SERIES:
∑ =
−0
)(
!))((
n
nn
naxaf
( ) ( ) ( ) ( ) ( ) ( )!2!1
00''
000
xxxfxxxfxfxf
−×+
−×+=
Considering the simplified first part
( ) ( ) ( )( )000 xxxfxfxf −+=
23
Put ( )xxx ∆=− 0
( ) ( ) ( ) xxfxfxf ∆+= 0'
0 ………………………… (3)
Hence to compute the original value of X an arbitrary value x0 is required, this is done by
simple Trilateration.
We know that
XXH estx ∆+= ……………. (D)
YYH estY ∆+= …………….. (D)
Also
( ) ( )22estyesthiist YHXLDE −+−= ………………………….. (4)
From (3) & (4)
( ) ( )dY
YDEddX
XDEdDEPD istist
isti∆×
+∆×
+=
After carrying out partial differentiation
( ) ( ) ( )0TCY
dYYY
XdX
XXDEPD liEstliEst
isti ∆⋅+∆⋅−
+∆⋅−
+=
Now we need to solve for ( )0,, TYX ∆∆∆
24
( ) ( )
( ) ( )
( ) ( )
∆∆∆
×
−−
−−
−−
=
−−−
tYX
cDE
YYDE
XX
cDE
YYDE
XX
cDE
YYDE
XX
DEPDDEPDDEPD
st
Est
st
lEst
st
Est
st
lEst
st
liEst
st
lEst
st
st
st
3
13
3
3
2
12
2
2
11
1
33
22
11
Solution from (4) is put in eq (D) to get new estimations.
Hx, HY becomes the new estimated position.
6.2 Confidence Estimates
If we can calculate the error in alpha estimation we can calculate the error in distance
and hence location estimate. A rough calculation of the error in alpha for N. America
as follows:
For each point calculate alpha =distance/minRTT, then calculate the median and Inter-
quartile Range of the alphas. In the following case study we got 46.61=median and
IQR=15.31. This is illustrated by the graphs bellow.
We also looked at how alpha varies with distance. It looks like using a single alpha
may fail for small RTTs (distances). This is probably since the minRTT is dependent
on router delays and where networks peer (so the routes may be very different from the
Great Circle Distance (GCD) rather than global distance. Looking at the data of Alpha
vs minRTT from SLAC it appears there are two regimes separated at ~8ms. Above
25
8ms alpha is ~ 50km/ms, below it is more variable closer to ~10km/ms (see tab alpha-
vs-rtt). One might be able to use a power law to represent the full range with somewhat
less accuracy. For this data median alpha ~ 46.5km/ms and IQR ~15.6km/ms or
IQR/Median~ 33% or ~ +-16%.
Alpha vs Distance from SLAC
y = 3.3609x0.3301
R2 = 0.567
0.1
1
10
100
1 10 100 1000 10000Distance from SLAC (km)
Alp
ha (k
m/m
s)
Figure 15: Alpha vs. Distance from SLAC
The above graph shows how alpha varies with distance. Using a single alpha may fail
for small RTTs (distances). This is probably since the min_RTT is dependent on router
delays and where networks peer, so the routes may be very different from the Great
Circle Distance (GCD), and the min_RTT may be only partially dependent on
distance.
26
Alpha vs. min_RTT from SLAC y = 14.026x0.2593
R2 = 0.1861
0.1
1
10
100
0.1 1 10 100 1000
min_RTT (ms)
Alp
ha (k
m/m
s)
Figure 16: Alpha vs. MinRTT
Looking at the graph above of Alpha vs. min_RTT from SLAC it appears there are two
regimes separated at ~8ms. Above 8ms alpha is ~ 50km/ms, below it is more variable
closer to ~10km/ms. one can use a smooth power function to predict alpha given the
min_RTT as shown in the figure. However it is as easy and probably more accurate to
use the two values.
27
Histogram of Min RTT for Canadian, US, Mexican remote hosts
024
68
101214
161820
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Min RTT from SLAC, TRIUMF, BNL, Florida (ms)
Freq
uenc
y
0%10%20%
30%40%50%60%70%
80%90%100%
FrequencyCumulative %
Figure 17: Histogram of MinRTT for Canadian hosts
28
6.2 Landmark Tiering Analysis
Another approach towards effective landmark selection could be tiering
approach. In this approach first we probe a target host with a few selected level zero
landmarks, once the tier has been decided than we can use level one and level two
landmarks of that tear to increase accuracy. Following is a case study for the feasibility
of tiring approach for North American region.
The purpose of this study is to investigate the effectiveness of tiering for
TULIP (i.e we have a set of primary landmarks tier0 which will narrow down the
target location to being in a particular region and then a denser set of secondary tier1
landmarks in the discovered region that can be used to get more accurate results). The
use of tiering should enable us to reduce the network traffic (number of landmarks
pinging a target) while retaining the accuracy of using all landmarks.
Currently we are looking into two regions where we hope to use tiering, i.e.
north America and Europe. The choice of these two regions is since we have a
relatively large number of landmarks in these regions. Landmarks in all other regions
will be considered as tier0. Thus TULIP will try and locate targets in two passes. The
first just uses tier0 landmarks. If the target then appears to be in North America or
Europe, then a second pass is made using all the landmarks in that region, otherwise
the target is determined from the first pass. The purpose of the current study is to
determine whether tiering is effective and how it should be used.
We selected 4 landmarks in North America for our study (SLAC, BNL,
Ampath-Florida, and TRIUMF-Canada). They were chosen to be reliable and to
29
roughly be located close to the 4 corners of the region. We pinged a set of known
geographical target sites in the US, Canada and Mexico. The coordinates of the sites
were obtained from the PingER database. We also calculated the physical distances
between the target sites and the landmarks. The figure below shows the distribution. It
appeared that all hosts were within 80ms of one another of the 4 monitors and 98.84%
are within 70ms.
Figure 18: Histogram of MinRTT for Canada, US and Mexico
30
Figure 19: MinRTT and distance as seen from SLAC
Figure 20: MinRTT and distance as seen from BNL
31
Figure 21: MinRTT and distance as seen from Florida
Figure 22: MinRTT and distance as seen from Canada
32
As we can see from the first graph that some sites are replicated (e.g. DNS root name
servers, Yahoo), others are not at a permanent location (e.g. supercomp moves
annually with the location of the Supercomputing show) or use strange routing (e.g. the
site labeled 134.79 a SLAC host located elsewhere, though the routing goes through
SLAC). We ignore such hosts in our results, though we note that this analysis has
clearly pointed out such anomalies. It is also seen that some values for Mexico are
aberrant for example min RTT >200 for a Mexican site from Florida. So it is a good
idea not to include Mexico in this region. This region will therefore comprise of
Canada and US only.
Figure 23: Alpha vs. distance as seen from SLAC
33
The above graph shows how alpha varies with distance. Using a single alpha may fail
for small RTTs (distances). This is probably since the min_RTT is dependent on router
delays and where networks peer, so the routes may be very different from the Great
Circle Distance (GCD), and the min_RTT may be only partially dependent on
distance.
Figure 23: Alpha vs. minRTT as seen from SLAC
34
Figure 23: Alpha Frequency from SLAC to N.America
Looking at the graphs above of Alpha vs. min_RTT from SLAC it appears there are
two regimes separated at ~8ms. Above 8ms alpha is ~ 50km/ms, below it is more
variable closer to ~10km/ms. one can use a smooth power function to predict
alpha given the min_RTT as shown in the figure. However it is as easy and probably
more accurate to use the two values.
We also calculated median alpha from SLAC which is ~ 46.5km/ms and IQR
~15.6km/ms or IQR/Median~ 33% or ~ +-16%
TULIP uses the maximum alpha value in order to avoid underestimation which is ~
62.98 km/ms in our case. Also TULIP calculates alpha based on PingER historical data
35
to all the world regions and in this study we only used the North American PingER
sites. Looking at the frequency histogram of alpha for the SLAC landmark to North
American target sites it is seen that this is a reasonable maximum value.
36
Chapter 5
IMPLEMENTATION
7.1 INTRODUCTION
In this chapter implementation detail of TULIP will be discussed. In this
chapter TULIP is observed from logical design to classes’ details and then screen
shots. The system architecture of TULIP leads to its implementation.
7.2 TOOLS USED
The tools which I have used in implementing CONSTELLA are:
1. Programming Language Java
2. Java Web start
3. J Free Chart API for graphs
4. Perl Scripting Language
7.2.1 Programming Language Java
Java language from Sun Microsystems is a general-purpose, object-oriented
language. Java is well known for its client/server role, as a language for writing
“applets” that are downloaded from a server and executed locally. It also works well
for writing traditional single-computer applications to complex distributed
applications. Following are the few features of Java that make it the best choice for
implementation language.
37
A. Simple
B. Object-Oriented
C. Distributed
D. Robust
E. Secure
F. Architecture-Neutral
G. Portable
H. Interpreted
I. High-Performance
J. Multithreaded
K. Dynamic
7.2.2 Java Web Start
Java Web Start is a software technology that encompasses the portability of
applets, the maintainability of Servlets and Java Server Pages (JSP) technology, and
the simplicity of markup languages such as XML and HTML. It is a Java-based
application that allows full-featured Java 2 client applications to be launched,
deployed, and updated from a standard Web server. Upon launching Java Web Start
for the first time, the user may download new client applications from the Web;
thereafter these applications may be initiated either through a link on a Web page or (in
Windows) from desktop icons or the Start menu. Applications initialize quickly under
Java Web Start, are cached on the client machine, and can be launched remotely
offline. Additionally, because Java Web Start is built from Java 2 technology, it
inherits the Java platform's complete security architecture. Since Java Web Start is, in
itself, a Java application, the software is platform independent and can be supported on
any client system that supports the Java 2 platform. Java Web Start performs an update
38
automatically when a client application is launched, downloading the latest code from
the Web while simultaneously loading the application from a previous cache (provided
that a cache exists). Java Web Start also provides a Java Application Manager utility,
allowing end-users to organize their Java applications as well as providing a variety of
options, such as clearing the cache of downloaded applications, specifying the use of
multiple JREs, and setting HTTP proxies.
7.2.3 J Free Chart API for graphs
JFreeChart is a popular open source Java charting library that can generate
most common chart types, including pie, bar, line, and Gantt charts. In addition, the
JFreeChart API supports many interactive features, such as tool tips and zooming.
JFreeChart provides an excellent choice for developers who need to add charts to
Swing- or Web-based applications.
7.2.4 Perl Scripting Language
Practical Extraction and Reporting Language is an open source server side
programming language extensively used for web scripts and to process data passed via
the Common Gateway Interface from HTML forms etc. Perl scripts are not embedded
within HTML pages and do not download to the web browser but reside on the server.
They execute by being triggered from commands within HTML pages or other scripts
and may produce HTML output that does download to the web browser. Programs
written in Perl are called Perl scripts, whereas the term the perl program refers to the
system program named Perl for executing Perl scripts. Perl is implemented as an
39
interpreted (not compiled) language. Thus, the execution of a Perl script tends to use
more CPU time than a corresponding C program, for instance. On the other hand,
computers tend to get faster and faster, and writing something in Perl instead of C
tends to save your time.
7.3 PACKAGES
tulip
tulip.core
tulip.core.pingEngine
tulip.core.LocationAlgo
tulip.core.traceEngine
tulip.GUI tulip.statstulip.util
tulip.util.logger
Figure 24: packages
40
7.4 CLASS DIAGRAM
7.4.1 Class Diagram of the GUI Package
Figure 25: GUI Package
41
7.4.2 Class Diagram of the Util Package
tulip.util.log
LocalLoggerStatic_Log
ServerLoggerServerPathLog_Text
Figure 26: Util Package
7.4.3 Class Diagram of the Core Package
42
tulip.core
GetHostInfo
GetNetGeoData
GetNetGeoData
MonsiteminRTTmaxRTTavgRTT
Delay()Alpha()Confidence()lat()lon()
OctantRSSParser
tulip.core.TraceEngine
GetTraceData TraceThread
TraceParser
Figure 27: Core Package
43
7.4.4 Class Diagram of the Stats Package
Bar Chart
AvgPlot
MinFixAnalysis
PlotAllStats
MaxPlot
MinPlot
LogPlot
Figure 28: stats Package
44
7.5 SCREEN SHOTS
Figure 29: Alpha Optimization
45
Figure 30: TULIP Main Frame
Figure 31: Various Plots to aid in analysis
46
Figure 32: TULIP Visualization. Plots the entire scenario on the globe
Figure 33: Log File Analysis to help decide which host to block.
47
Figure 34: Updating Site configuration File from tulip client.
7.6 TULIP configuration Files
TULIP requires a few external configuration files and a set of scripts to manage
these files. These scripts and files are to be deployed on a central web server. These
Files include:
1- Site List – Configuration File containing data about all landmarks.
2- LOG – TULIP access logs to help decide a malicious host.
3- Block List – A Block List containing blocked hosts.
7.7 TULIP Security Concerns
TULIP can be used to initiate a DOS attack on a host by making several
landmarks probe the host simultaneously. TULIP ensures that no host is allowed to do
so. This is done by using the block list and LOG file analysis. If a client makes to
48
many attempts at a host it is blocked. Also continuous attempts on the same host are
not allowed by TULIP client.
49
Chapter 6
PERFORMANCE EVALUATION AND RESULTS
8.1 INTRODUCTION
This section describes the performance evaluation and testing that has taken
plan to validate the functionality, accuracy and performance improvement of IP
geolocation. TULIP has been tested as a whole (Black Box Testing) and the individual
modules are also tested individually. Results have been drawn from the testing of each
module.
8.2 TEST BED
The aim is to get a better coverage of landmarks, and an improved algorithm to
cater for increased delays in the less developed regions. Using location data from about
600 PingER hosts in over 100 countries, we shall validate the accuracy and
applicability of TULIP, and also use it to identify mis-placed hosts in the PingER
database.
8.3 PERFORMANCE EVALUATION
The following graphs show distance between PingER location data and various
geolocation approach’s estimates. These graphs help in comparing the agreement rate
between PingER and different geolocation techniques. Hence we can have a measure
of the accuracy of each approach and also point out errors in the PingER database.
50
Dist(Gutheri,Hostinfo)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 20 40 60 80 100 120 140 160 180
Site
Dis
tanc
e (k
m)
Dist(Gutheri,Hostinfo)
Figure 35: Host Info
Almost 170 hosts’ location data was retrieved from HostInfo. Almost 140 hosts agreed
to PingER database. For 37 hosts the distance was huge ranging from 500 km to
10,000 km. The main reason was due to the log based nature of the host.
51
Distance (Gutheri , geoIP)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 50 100 150 200 250 300 350 400
Site
Dis
tanc
e (k
m)
Distance (Gutheri , geoIP)
Figure 36: Geo IP
Almost 380 hosts’ location data was retrieved from GeoIP. It is the most preferred
solution for geolocation. 253 hosts agreed to PingER database. Remaining distances
ranged from 500 km to 10,000 km. As above the main reason for these disagreements
was due to its log based nature.
52
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
kyoto-u.ac.jp200.37.46.80w
ww
.sustech.eduglobalnet.cmw
ebster.ac.thrw
andaparliament.gov.r
rol.net.mv
ww
w.ust.edu.sd
seua.amyum
it.amw
ww
.institutokilpatrick.esyr.eduknu.ac.krfcien.edu.uyuiuc.eduasu.edusara.nlaspu.edu.jona.infn.itm
ercury.uvic.calattice.act.aarnet.net.auhanarotel.nethellenic.ac.zww
ww
.mssf.m
nlatinalfuheis.edu.jouaeu.ac.aem
cbs.edu.omnovagest.co.aocad.zju.edu.cnam
s.ac.irum
ich.eduw
isc.edufinance.gov.m
vcaltech.educaltech.edubrandeis.edualfred.eduw
isc.edubrow
n.eduv-w
ww
.ihep.ac.cnw
ww
.region.amcm
sfq.edu.ecw
ww
.ecnu.edu.cnlbl.goves.netcornell.edu81.199.21.194auth.grlbl.govpdsfgrid4.nersc.govusb.veaau.edu.etm
it.edurhnet.iscam
net.cmuoregon.eduuoregon.edubu.edudesy.dem
ultinet.afping.if.usp.brru.ac.zaarizona.eduw
ww
.intercollege.ac.cyw
ww
.fulbright.org.cyhaw
aii.edubu.eduprinceton.eduprinceton.eduprinceton.edudesy.de130.207.244.56m
su.rustsci.eduohio-state.edustanford.eduw
ww
.ifj.edu.plw
ww
.cyfronet.krakow.pl
in2p3.frucsc.edukotis.netthrunet.co.krcau.ac.krm
ps.ohio-state.eduiepm
-bw.cesnet.cz
stanford.edups.uci.eduutk.eduihep.ac.cncm
u.edupurdue.educaida.orgvix.comw
ww
.vodafone.com.m
ttrium
f.casnow
mass2001.org
ufrj.brcbpf.brns.cybercentro.com
.svcir.red.svum
n.eduutexas.eduornl.govornl.govrutgers.eduuchicago.edulattice.w
a.aarnet.net.audigex.netnic.nislac.stanford.eduslac.stanford.edulahoreschoolofeconom
iw
ww
.hrfoundation.bww
ashington.eduw
ashington.edum
fa.gov.bnkazrena.kzpinger.bnl.orgw
ww
.msu.ru
rftpexp.rhic.bnl.govw
ww
.irk.ruutdallas.eduindo.net.idcern.chleonis.nus.edu.sgw
ww
.tsc.rucern.chw
ww
.monash.edu.m
yhepi.edu.geindiana.edusci.amindiana.edunyu.educisco.comjlab.orgw
ww
.runnet.ruaip.orgub.esd.root-servers.netucsd.eduanl.govanl.govanl.govb.root-servers.net82.137.192.62ucla.eduucla.eduprim
e.edu.npllnl.govbo.cache.nlanr.netpsi.netns.fq.edu.uyorange.cmgnt4.grid.m
an.ac.ukperl-pbdsl.stanford.eduece.rice.eduns1.retina.aruoi.grsunysb.eduw
ww
.psi.gov.psm
t.net.mk
just.edu.jokornet.ne.krkreonet.re.krnetsgo.comdirecpc.compgis.lkw
ww
.utl.co.ughaw
aii.educbinet.biw
ww
.eng.bellsouth.netw
aikato.ac.nzlanl.govnic.lkbham
.ac.ukucr.educache.kr.apan.netkaist.ac.krnoc.kr.apan.netru.ac.bdhokudai.ac.jpjp.apan.netm
.root-servers.netkyushu-u.ac.jpshinbiro.netbunda.unim
a.mw
credis.rokek.jpkek.jpw
ww
.uma.rnu.tn
uta.edu
Distance GeoIPDistance TULIPDistance Host Info
Figure 37: TULIP Results
The above graph is a comparison between TULIP GeoIP and HostInfo results.
HostInfo and GeoIP have already been discussed. TULIP shows slight disagreements
in the beginning, these are due to slight errors in landmark location info maintained in
the TULIP configuration file. The peak differences ranging from 5000 km to 13,000
km include 34 sites. These were verified as errors in the PingER database and were
corrected. For the remaining hosts location estimates can only be improved by careful
landmark selection either by confidence estimates approach or by tiering approach. It
has been constantly observed that nearby landmarks may over estimate a little causing
and error in location estimation whereas distant landmarks make a very accurate
53
estimate. If we can sort out these accurate distant landmarks than TULIP accuracy can
be further increased.
0%10%20%30%40%50%60%70%80%90%
100%
0 5000 10000 15000 20000
Distance (km)
Cum
ulat
ive
Dis
tribu
tion
Number of hosts = 285
Figure 38: Cumulative Distribution Function of Distances
54
Chapter 7
CONCLUSIONS
The Internet has become a collection of resources meant to appeal to a large
general audience. Although this multitude of information has been a great boon, it also
has diluted the importance of geographically localized information. Offering the ability
for Internet users to garner information based on geographic location can decrease
search times and increase visibility of local establishments. Geolocation by IP address
is the technique of determining a user's geographic latitude, longitude and, by
inference, city, region and nation by comparing the user's public Internet IP address
with known locations of other electronically neighboring servers and routers.
In recent years many efforts have been made for IP geo-location. Most of them
are log based and are hence pron to get outdated. Very few are based on dynamic
measurments.
We use delay measurements to dynamically locate a target host, from a set of
reference (landmarks) that are reliably accessible, their locations are well known and
constant and triangulate a large fraction of the world’s IP hosts the delay
measurements provide rough estimates of the distances between the landmarks and the
target which are used to geo-locate the target. Multiltateration, a concept used in
[bamba], is then applied, to get an estimation of the target host.
Iterative correction is then applied using the above estimate. The approach has
been thoroughly tested by applying it on 600 PingER nodes. Since our measurements
are dynamic we can obtain results only for online nodes. At the time of the above
55
mention tests 285 nodes were responding. Also this accuracy can further increase if we
can improve our landmark selection scheme.
56
REFERENCES
[1] R. Percacci and A. Vespignani, .Scale-free behavior of the Internet global
performance. The European Physical Journal B - Condensed Matter, vol. 32, no. 4, pp.
411.414, Apr. 2003.
[2] Constraint-Based Geolocation of Internet Hosts Bamba Gueye, Artur Ziviani, Mark
Crovella and Serge Fdida,
[3] C. Bovy, H.T. Mertodimedjo, G. Hooghiemstra, H. Uijterwaal, P. Van Mieghem,
Proceedings of the PAM 2002 Conference, Fort Collins, Colorado, 25-26 March 2002
[4] An Empirical Evaluation of Landmark Placement on Internet Coordinate Schemes
Sridhar Srinivasan Ellen Zegura Networking and Telecommunications Group College
of Computing Georgia Institute of Technology Atlanta, GA 30332, USA Email:
{sridhar,ewz}@cc.gatech.edu
[5] A Network Positioning System for the Internet, T. S. Eugene Ng, Rice University,
Hui Zhang, Carnegie Mellon University.
[6] IP Geolocation Using Delay and Topology Measurements Ethan Katz-Bassett John
P. John Arvind Krishnamurthy David Wetherall† Thomas Anderson Yatin Chawathe‡
57
[7] Adwords, G. (2007). Google adwords: Regional and local targeting. Available
from: https://adwords.google.com/select/targeting.html.
[8] Davis, C. (2007). Dns loc: Geo-enabling the domain name system. Available from:
http://www.ckdhr.com/dns-loc/.
[9] FCC (2007). Fcc consumer advisory: Voip and 911 service. Available from:
http://www.fcc.gov/cgb/consumerfacts/voip911.html.
[10] Govindan, R. and Tangmunarunkit, H. (2000). Heuristics for internet map
discovery. In IEEE INFOCOM 2000, pages 1371{1380, Tel Aviv, Israel. IEEE.
Available from: citeseer.ist.psu.edu/ govindan00heuristics.html.
[11] Gueye, B., Ziviani, A., Crovella, M., and Fdida, S. (2004). Constraint-based
geolocation of internet hosts. In IMC '04: Proceedings of the 4th ACM SIGCOMM
conference on Internet measurement, pages 288{293, New York, NY, USA. ACM
Press. Available from: http://portal.acm.org/citation.cfm?id=1028828.
[12] Katz-Bassett, E., John, J. P., Krishnamurthy, A., Wetherall, D., Anderson, T., and
Chawathe, Y. (2006). Towards ip geolocation using delay and topology measurements.
In IMC '06: Proceedings of the 6th ACM SIGCOMM on Internet measurement, pages
71{84, New York, NY, USA. ACM Press. Available from:
http://portal.acm.org/citation.cfm?id=1177090.
58
[13] Mack, T. (2007). open geo coordinates database. Available from:
sourceforge.net/projects/opengeodb/.
[14] Maxmind (2007). Maxmind llc geoip database. Available from:
http://www.maxmind.com/app/ip-location.
[15] MaxMind (2007). Maxmind minfraud whitepaper. Available from:
http://www.maxmind.com/MaxMind_minFraud_Overview.pdf.
[16] Padmanabhan, V. N. and Subramanian, L. (2001). An investigation of geographic
mapping techniques for internet hosts. Proceedings of SIGCOMM'2001, page 13.
Available from: citeseer.ist.psu.edu/padmanabhan01investigation.html.
[16] Plazes (2007). Plazes website. Available from: http://www.plazes.com.
[17] Quova (2007). Compliance. Available from:
http://www.quova.com/page.php?id=9.
[18] Spring, N., Mahajan, R., and Wetherall, D. (2002). Measuring isp topologies with
rocketfuel. Available from: citeseer.ist.psu.edu/spring02measuring.html.
[19] http://www.advanced.org/IPPM/archive/0246.html
59
[20] P. Bahl and V.N. Padmanabhan. RADAR: An In-Building RF-Based User
Location and Tracking System. IEEE Infocom, March 2000.
[21] G. Ballintijn, M. van Steen, and A.S. Tanenbaum. Characterizing Internet
Performance to Support Wide-area Application Development. Operating Systems
Review, 34(4):41-47, October 2000.
[22] L. S. Brakmo, S. W. O'Malley, and L. L. Peterson. TCP Vegas: New Techniques
for Congestion Detection and Avoidance. ACM SIGCOMM, August 1994.
[23] K. Cheverst, N. Davies, K. Mitchell, and A. Friday. Experiences of Developing
and Deploying a Context-Aware Tourist Guide: The GUIDE project. ACM Mobicom,
August 2000.
[24] R. Droms, Dynamic Host Con_guration Protocol, RFC-1531, IETF, October
1993.
[25] P. Enge and P. Misra, The Global Positioning System, Proc. of the IEEE, January
1999.
[26] R. Govindan and H. Tangmunarunkit. Heuristics for Internet Map Discovery.
IEEE Infocom, March 2000.
60
[27] K. Harrenstien, M. Stahl, E. Feinler, NICKNAME/ WHOIS, RFC-954, IETF,
October 1985.
[28] Andy Harter and Andy Hopper. A Distributed Location System, for the Active
OÆce. IEEE Network Vol.8 No.1, January 1994.
[29] A. Harter, A. Hopper, P. Steggles, A. Ward, and P. Webster, The Anatomy of a
Context-Aware Application, ACM Mobicom, August 1999.
[30] V. Jacobson, Traceroute software, 1989, ftp://ftp.ee.lbl.gov/traceroute.tar.gz
[31] B. Krishnamurthy and J. Wang. On Network Aware Clustering of Web Clients.
ACM SIGCOMM, August 2000.
[32] J.Y. Li et al. A Scalable Location Service for Geographic Ad-Hoc Routing. ACM
Mobicom, August 2000.
[33] D. Moore et.al. Where in the World is netgeo.caida.org? INET 2000, June 2000.
[34] R. Periakaruppan, E. Nemeth. GTrace {a Graphical Traceroute Tool. Usenix
LISA, Nov 1999.
61
[35] U. Raz. How to find a host's geographic location. (private.org)
http://www.private.org.il/IP2geo.html
[36] D. C. Vixie, P. Goodwin and T. Dickinson. A Means for Expressing Location
Information in the Domain Name System, RFC-1876, IETF, January 1996.
[37] BBNPlanet publically available route server, telnet://ner-routes.bbnplanet.net.
[38] BCentral. http://www.bcentral.com/
[39] Digital Island Inc. http://www.digitalisland.com/
[40] DoubleClick, http://www.doubleclick.com/
[41] Internet2. http://www.internet2.org/
[42] IP to Latitude/Longitude Server, University of Illinois http://cello.cs.uiuc.edu/cgi-
bin/slamm/ip2ll
[43] List of Airport Codes. http://www.mapping.com/airportcodes.html
[44] MERIT Network, http://www.merit.edu/
62
[45] NeoTrace, A Graphical Traceroute Tool
http://www.neoworx.com/products/neotrace/default.asp
[46] U.S. Gazetteer, U.S. Census Bureau, http://www.census.gov/cgi-bin/gazetteer.
[47] VisualRoute, Visualware Inc., http://www.visualroute.com/