Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Carleton University BCS Honours Project
Peer-to-Peer File Sharing Network Optimisation
Darryl Edward Payne Student: 266137
Supervisor: Dr. Tony White, Computer Science Wednesday, March 31, 2004
ii
Abstract
Peer-to-peer file sharing networks (P2PFSN) appeared after most of the other “killer apps” on the Internet had permanently affixed themselves onto our lives. Nevertheless, they are now one of the most common methods of publishing content on the Internet. There are two aspects to these networks: performing searches for wanted files, and downloading content. This paper focuses on content searching, however will delve not-too-deeply into content downloading where appropriate. The purpose of this paper is to examine how a mature network, Gnutella, functions and consider several relatively simple changes to it which will hopefully create noticeable improvements in its usability. Modifications made to the Gnutella network as part of this project will be implemented by extending an existing open-source client written in the Java programming language, Phex. Specific changes to the client include the addition of caching of mirrors for local content, and modifications to the order in which connections to hosts on the network are attempted.
iii
Acknowledgements
I’d like to thank the entire development team at Momentous.ca, past and present, for their support during my preparation of this report. Each one contributed their opinions and ideas, and they all deserve some credit. Specifically, they are Ryan North for his superior combination of grammar and technical knowledge, Roy Hooper for his understanding of the inner workings of the Internet, Amanda Shiga - for too many things to list here, Tony Hooper for his engineering point of view, Ben Levac for offering advise that only someone with a view of the world that contrasts my own can, Norm Ritchie for understanding that I can’t come home after a long day at work and work some more, Kelvin Osborn for his support while I was away from work, Taryn Naidu - simply for his presence, Mel Tayler for her unannounced – but pleasant - visit during my last hours of work on this report, Sheri Adamson for her support ever since first year, Georgiana Badea for her never-ending good humor, and finally Magaly Obas and Nikki Melki who I certainly don’t see enough of these days. I’d also like to thank Dr. Tony White for his comments, criticisms, and ideas without which this report would not be complete. Finally, thanks to my family, friends, fellow students, and everyone else who contributed in some way to my years at Carleton. They number far too many to name here.
iv
Table of Contents
Part 1: P2PFSN: How They Work and Why They Exist .................................................... 1 1.1 Ancient Origins......................................................................................................... 1 1.2 Napster ...................................................................................................................... 4 1.3 Second Generation Clients........................................................................................ 5
Part 2 – The Internet vs. Peer-to-Peer Networks ................................................................ 9 2.1 – Real-world results .................................................................................................. 9 2.2 – Specific Problems ................................................................................................ 11
Part 3 – Gnutella ............................................................................................................... 13 3.1 - Introduction .......................................................................................................... 13 3.2 - The Ultrapeer System ........................................................................................... 15 3.3 – Existing Optimisations......................................................................................... 17
Part 4 – Strategies – Network Layout ............................................................................... 19 4.1 – P2P networks are built on top of TCP/IP............................................................. 19 4.2 – Current Algorithm and Immediate Goals for Improvement ................................ 24 4.3 – Swarm Intelligence .............................................................................................. 26 4.4 – Implementation .................................................................................................... 32
Part 5 – Caching Strategies ............................................................................................... 34 5.1 The Purpose of Caching in a Peer-to-Peer System ................................................. 34 5.2 – Content Mirrors.................................................................................................... 35 5.3 – Implementation .................................................................................................... 40
Part 6: Results ................................................................................................................... 42 6.1 – Results of Network Host Cache Changes ............................................................ 42 6.2 – Results of File Mirror Site Changes..................................................................... 45
Part 7: Conclusions and Suggestions for Future Work..................................................... 48 Appendix – Contents of Included Disc............................................................................. 54
v
List of Figures
FIGURE 1: “HYBRID” PEER-TO-PEER NETWORK LAYOUT SUCH AS NAPSTER 4 FIGURE 2: EXAMPLE OF PEER-TO-PEER SEARCH TREE – DEPTH OF 3, 30 HOSTS REACHED 8 FIGURE 3- NETWORK LAYOUT WITH GNUTELLA'S ULTRAPEER SYSTEM 15 FIGURE 4: NETWORK LAYERS (OSI MODEL) 19
vi
List of Tables
TABLE 1: MAXIMUM NUMBER OF HOSTS REACHED AT SEARCH DEPTH 8 TABLE 2 - GNUTELLA PROTOCOL 0.6 MESSAGES 14 TABLE 3 – SOME POSSIBLE HEURISTICS AVAILABLE FOR GNUTELLA HOSTS 28 TABLE 4 - MIRRORS AND FILE POPULARITY 45
1
Part 1: P2PFSN: How They Work and Why They Exist
1.1 Ancient Origins Before understanding how peer-to-peer networks work, it’s
important to understand how and why they came to exist.
The Internet was designed and built as a network of peers.
However, if we look at the initial set of widely used
protocols which evolved from it – all still in widespread
use - we see they are based on the same networking pattern:
HTTP (Hyper-Text Transfer Protocol) for the World Wide Web;
SMTP (Send Mail Transfer Protocol) for the sending and
distribution of e-mail;
IMAP (Internet Message Access Protocol) and POP (Post
Office Protocol) for the management of received of email;
FTP (File Transfer Protocol) which is still among the most
used file sharing systems;
And finally IRC (Internet Relay Chat) for real-time text
messaging (“chat”)
All these protocols (from the average Internet user’s
perspective) appear to be based on the client-server
pattern, where all users are considered clients and all
2
messages are a two way communication between clients and
the server – or at least pass through the server before
reaching another client. The clients only see each other
if the server tells them there are other clients connected
to it (directly, as part of the protocol specification, in
the case of IRC; or indirectly, through scripts using the
protocol, as in dynamic content through HTTP.)
However, if we look deeper into the systems behind
the servers utilizing some of these protocols, we can begin
to see the roots of Peer-to-Peer file sharing. To see
this, we’ll look more closely at what happens when a user
sends an email. Let’s take the example of Jim (whose
address is [email protected]), who is writing to his
friend Sue ([email protected]). Jim opens up his email
client, and writes the email. When it’s ready to send, his
email client connects to his local SMTP server,
smtp.jimsdomain.com, and sends it a copy of his message,
addressed to [email protected]. smtp.jimsdomain.com will
then connect to smtp.suesdomain.com and ask it to deliver
the message to [email protected]. Alternately, if
smtp.suesdomain.com is not the final destination for an
email addressed to [email protected], but rather
emailtosue.suesdomain.com is, smtp.suesdomain.com can
3
either choose to forward the email to the correct server
transparently, or notify smtp.jimsdomain.com of the correct
peer he should connect to. The worldwide network of smtp
servers, of which smtp.jimsdomain.com and
smtp.suesdomain.com are only two – is made up of a complete
graph of peers (every peer on the network is able to reach
– either directly or indirectly – every other peer), and
can be looked at as one of the first worldwide peer-to-peer
networks. Some of the principles used within SMTP, such as
message forwarding and redirecting between peers, are good
starting points in our exploration of peer-to-peer
networks.
4
1.2 Napster
User 2
User 1
User 5
User 6
server1.napster.comserver2.napster.com
User 4SearchableDatabase
User 3
SearchableDatabase
Legend
Search PathDownload Path
FIGURE 1: “Hybrid” peer-to-peer network layout such as Napster It’s impossible to talk about peer-to-peer networks without
mentioning Napster, whose rapid rise to notoriety and even
more rapid demise were catalysts encouraging widespread use
of peer-to-peer file sharing, and are possibly the main
reasons that peer-to-peer file sharing clients are as
popular on the Internet as they are today. However, it is
important to make the distinction that Napster was not a
peer-to-peer network as we know them today, but rather a
hybrid system* with a massive amount of peers connected to
one of one or many detached central servers at any time,
the servers controlling the connections of clients and
* See Figure 1
5
their searching capabilities, but the peers themselves
sharing the actual content.
Napster appeared on the Internet in early 1999, as the
result of a young programmer, Shawn Fanning’s idea.
Napster was the first online file sharing system where no
files were stored at a central server – but rather they
were distributed amongst the actual users of the system.
From the user’s point of view, where Napster differed from
previous ways of accessing files on the Internet was the
sheer quantity of uncensored content that was easily
accessible. Unfortunately, the variety of its content, and
particularly its uncensored content quickly brought it
problems. In December 1999, the RIAA sued Napster for
copyright infringement, and by July 2001 it had been shut
down completely. However, by the time this happened,
Napster’s user base had reached a magnitude which required
an immense amount of system and bandwidth resources to
support its server-based system.
1.3 Second Generation Clients In my discussion of Napster so far, I’ve briefly mentioned
the two categories which would influence, and support the
6
development of a new generation of peer-to-peer file
sharing networks.
Firstly, technical requirements of Napster’s system had
required a huge cash investment to create and maintain
enough servers for their growing user base – which quickly
grew from hundreds of thousands to tens of millions of
users, a cost which could not be maintained in the long
run. More immediately important, was that the servers
themselves were separate entities: Users could only search
among other users connected to the same server, hiding a
wealth of results to their searches.
Secondly, political difficulties caused setbacks to their
system which became increasingly difficult, and ultimately
impossible to overcome over time. Here the server-based
design of their system was its downfall, as with a single
point of failure at the servers, shutting down the system
became as simple as flipping a switch – more specifically,
forcing Napster through legal venues to flip that switch.
Once these difficulties had finally shut down the entire
system, it didn’t take long for Napster’s users to notice
its disappearance and look for a replacement. The OpenNap
7
server was one of the first entries into the game. OpenNap
was open-source Napster-compatible server software, which
allowed anyone to run their own Napster server, and enabled
existing Napster clients would function with minimal
modifications. However, these too were shut down one by
one, or became overloaded with clients and searches to the
point of being nearly impossible to obtain a connection to,
and impossibly slow once connected. OpenNap servers and
clients are still in widespread use, as servers come and
go, however they are not nearly as popular as Napster
itself was in its prime.
The reason for this is not that the user base for online
file sharing has declined, but that a new generation of
peer-to-peer clients; networks that not only used peer-to-
peer networking for distributing files, but also as a way
to search for content soon appeared. Gnutella and WinMX
were among the first of these to gain popularity.
8
Client Performing Search
FIGURE 2: Example of peer-to-peer search tree – depth of 3, 30 hosts reached
TABLE 1: Maximum number of hosts reached at search depth Search Depth Clients Reached Search Depth Clients Reached
1 5 8 488,280 2 30 9 2,441,405 3 155 10 12,207,030 4 780 11 61,035,155 5 3,905 12 305,175,780 6 19,530 13 1,525,878,905 7 97,655 14 7,629,394,530
This second generation of clients began on a simple
premise: by eliminating the central server, the significant
technical and political problems with Napster’s client
could be overcome. Perhaps more importantly, with every
peer on the network considered an equal, every client would
be able to search the content of every other. The
theoretical reaches of this network were staggering. The
premise is this: If each client on the network broadcasts
a message to five others, then after a depth of one, five
9
clients are reached. After a depth of two (each of those
five clients broadcasts or forwards the message to five
others), this becomes thirty clients†. At a depth of
fourteen, over seven billion clients have been sent the
message‡ – this represents a client for every person in the
world. If each of these messages takes one second to send,
it should require only thirty seconds of real time (14
seconds for the message to reach all clients, 14 seconds
for any responses to travel back up the network to the
client who initiated the search) before the entire world
has been asked the question and given their chance to
answer. However, as we will see in part 2, in reality the
Internet is far from the homogeneous, limitless network
where every client is considered equal that these equations
assume it to be.
Part 2 – The Internet vs. Peer-to-Peer Networks
2.1 – Real-world results The real-world networks made of up of this second
generation of peer-to-peer clients were successful in
providing a new, decentralized method of allowing clients † See Figure 2 ‡ See Table 1 for the theoretical number of clients reachable as the search tree reaches higher depths
10
to search each other’s shared files. However, the
theoretical abilities of the network were based an ideal
world, and in reality the network has not reached its full
potential either in size or in the level of service it
provides its clients.
The average number of users connected to the Gnutella
network at any one time is in the range of 200,000. T
means that all peers should be able to reach all other
peers in fewer than 8 hops, most in fewer than 7§. In
reality, the 14 hops that should only be needed in a
network 35,000 times the size is not enough to traverse the
longest path connecting two hosts on the existing network.
The reason for this is fairly simple: Without a central
server to oversee the network topology, a peer on the
network connects to other peers based only on availability
– this results in a network whose interconnecting paths are
generally randomly placed. The solution to this problem
is, however, not as simple. Before we even consider a
solution to this problem, we need to examine some of the
problems we will face – problems which every application
existing on a system as large and diverse as the Internet
§ See TABLE 1
11
will inevitably come across, and which current peer-to-peer
networks only partially overcome.
2.2 – Specific Problems
Firewalls are one of the Internet’s most common and
reliable security features, and thus the problems they
cause are the most difficult to get through. There are two
types of firewalls – those that block incoming connections
and those that block outgoing connections. The most common
of these is the type that blocks incoming connections, and
often all incoming connections are prevented on all ports.
If all incoming ports are blocked, a peer cannot accept
incoming connections – his only method of entering the
network is to actively look for ways in, and the part he
can play in the network is severely limited. The main
reason for this is that if two clients cannot accept
incoming connections, there is no way for them to connect
to each other directly, and they cannot trade files.
Bandwidth limitations are a major concern on the network as
well – the numbers here are easy to see. If there are
200,000 clients on the network, and every client sends a 50
byte search message every five minutes, clients would be
12
maintaining an average bandwidth usage of 34KB/sec both
ways. Clients on a dialup connection have at most a
7KB/sec download bandwidth, and 4.2KB/sec upload bandwidth
which would become very quickly consumed, leaving little
bandwidth left for the reason the client is on the network,
to download content. This becomes even more complicated a
problem when we consider that most dialup, ADSL, or cable
modem internet connections have greater download capacity
than upload – this means that clients approaching their
maximum download capacity will not be able to forward every
message they receive. This can create many possible dead-
ends on the network, where messages would simply stop being
forwarded.
The network layout of a peer-to-peer network is also
something that can easily spiral out of control. In an
ideal world, at each depth of the search tree, only clients
which have not been contacted before would be reached.
However, often the same packet reaches a host twice along
two different paths. When this happens, another dead end
is reached when the packet reaches a host the second time –
the host remembers that it has reached him before and
discards it.
13
Part 3 – Gnutella
3.1 - Introduction
Gnutella is a Peer-to-Peer system with a fairly simple
protocol which is built on top of existing standards such
as the HTTP protocol and the XML formatting standard, and
has had open-source clients nearly from its introduction.
It is by far the most used protocol with these
qualifications – only the Fastrack network with its Kazaa
client is more widely used, it is closed and proprietary in
nature**. As its downloading portion masquerades as an
extension to HTTP, users can access files distributed on
the Gnutella network through many firewalls and even work
through some proxy servers for a greater compatibility with
firewalled systems. Its XML-based protocol allows for
standard extensions which can be used or ignored at the
individual client’s discretion.
** Although Kazaa is known to use some of the same existing standards as Gnutella, it is a commercial project, so what is known about it comes only from reverse engineering – something that is especially difficult because its messages are encrypted - not from documentation or source code released by its creators.
14
TABLE 2 - Gnutella Protocol 0.6 Messages Message Name Description and Purpose
Ping Broadcasts your existence to the network, and requests PONG responses to locate other peers on the network
Pong
Response to a Ping message. Message is routed back to the peer who initiated the Ping message. Peers seeing this message may (in fact, are expected to) cache the host IP address and ID which are sent within the message in their host cache
Query Initiates a search on the network. This message includes some search text or the unique identifier of a specific file (normally a SHA1 signature) the requesting party is looking for.
Query Hit
Response to a Query message. This message includes the host ID and connection information for the party which has the file, as well as the unique identifier (SHA1 signature) of the file. It may also include, as part of the extension block of the message, a list of peers which are known to have the same file.
Push
Used to request a file from a client which is behind a firewall. The message is routed back through the path which the Query Hit message traveled. When the message reaches its destination, the client serving the file will initiate a connection to the client that wishes to download the file and send it.
Bye Optional message sent before a client disconnects from another. The message can include the reason why the client is disconnecting.
Gnutella is also a very developed protocol, supporting
standardized extensions to its messages and already
including many optimisations to its networks, the most
significant of which being related to the Ultrapeer system.
These extensions and existing optimisations will allow us
to examine and test some possible further optimisations on
an already mature network. Table 2, above, outlines the
basic messages it broadcasts or forwards over the Gnutella
network, enabling its clients to search for and request
content. The complete specification for the Gnutella
15
protocol can be found in the Gnutella 0.6 RFC (Klingberg and
Manfredi, 2002).
3.2 - The Ultrapeer System
Ultrapeer
Leaf
Leaf
Leaf Leaf
Ultrapeer
Leaf
Leaf
Leaf
Leaf
Ultrapeer
LeafLeaf
LeafLeaf
Ultrapeer
LeafLeaf
Leaf
Leaf
FIGURE 3- Network Layout with Gnutella's Ultrapeer System
The Ultrapeer system, whose layout is shown in Fig. 3, is
at the heart of Gnutella. It was designed to alleviate
bandwidth usage on dial-up clients, minimize the effects of
firewalls on network layout, as well as add some stability
to the network. Clients using the Ultrapeer system would
either designate themselves an Ultrapeer or a Leaf, and
connections by either would only be initiated to
16
Ultrapeers. What this means is that leafs are only
connected to Ultrapeers and not other leafs, therefore they
do not see traffic as it travels through the network, they
only receive requests as an endpoint on the network.
The benefits to the leafs in this system are immediately
obvious – their bandwidth usage is minimized, as they never
have to forward a message on through the network. The
benefits to the overall network are also easy to see – if
we consider that clients can only become Ultrapeers if they
meet a certain set of qualifications. Ultrapeers must be
connected to the network for, generally, more than 2 hours
before they can become an Ultrapeer. What this means to
the network is that there is a certain amount of stability
to its layout; Clients that are only connected briefly, for
example, to do a single search, cannot instantly become an
integral part of the network.
The number of hosts reached by a query is also directly
increased by this system. As Ultrapeers have prescribed
minimum bandwidth and resource requirements, additional
connections can be made to each Ultrapeer. If 6
Ultrapeers, and 10 leafs are connected to each Ultrapeer,
and each of these leafs has a chance to respond to each
17
query, the number of hosts reached at each level of the
query tree becomes 11, instead of 1, thus 1074205 hosts can
be reached with a search through 7 levels of Ultrapeers.
3.3 – Existing Optimisations
The stability that the Ultrapeer system added to the
Gnutella network eased the introduction of a multitude of
optimisations. The primary Ultrapeer related optimisation
is meant to be a near-optimal minimisation of traffic to
the leaf nodes. When a leaf connects to an Ultrapeer, it
transfers an index of search terms related to its files to
the Ultrapeer, so the Ultrapeer can check this index before
forwarding any queries to the leaf. This index of search
terms is generally simply a list of words related to the
files the leaf is sharing; The Ultrapeer will only forward
queries containing one of these words. In this way, nearly
all unimportant traffic will never reach a leaf node. Of
course, until a leaf node has completed the transfer of its
index to the Ultrapeer, it will still receive all queries.
Many optimisations have been proposed which may not yet be
part of the Gnutella standard, but may still be implemented
and used in some production clients. Some of these include
18
using a “random walker” search algorithm and caching of
QUERYHIT messages.
The random walker search algorithm contrasts the standard
“flooding” search algorithm, where every host forwards the
search message to every other host. Instead, a set of
random walker messages are sent out onto the network, which
are forwarded along the network to only one host by each
host. This creates a drastic reduction in traffic, and
because of this each message is less likely to reach a dead
end, thereby allowing the search to possibly travel further
out on the network. However, the sheer number of hosts
that the flooding search pattern reaches is not achieved.
More about random walkers can be read in “Search and
Replication in Unstructured Peer-to-Peer Networks” [Lv. et
al 2002].
QUERYHIT message caching allows a host to return a QUERYHIT
message for a file he is not serving. If a host sees a
search term twice in succession, and has cached QUERYHIT
messages associated with the first search, he can
optionally return a hit for this file rather than
forwarding the search on. The downside of this technique
is a danger with any caching technique: The network on the
19
other side of the host may have changed in between
searches, and the hit he is returning may no longer exist
while others may have since appeared.
Part 4 – Strategies – Network Layout
4.1 – P2P networks are built on top of TCP/IP
1 - Physical
7 - Application6 - Presentation
5 - Session 4 - Transport 2 - Data Link3 - Network
GnutellaP2P
ProtocolTCP IP
FIGURE 4: Network Layers (OSI Model)
TCP is a connection-based packet delivery protocol
operating a layer below most communication protocols on the
Internet. As P2P network searching is inherently forgiving
of dropped packets and network dead ends (our goal here is
not to reach every host on the network, only to reach as
many as possible in an acceptable amount of time) the
specifics of it are uninteresting to the operation of peer-
to-peer file searching in general. However, IP is the
routed protocol operating a layer below TCP, at layer 3 of
20
the OSI model††. The routing path of packets along the
network is minimized by several routing table optimisation
algorithms (routing protocols) which may be of interest to
us. At the very least, we can use some of the information
contained within the routing tables IP uses to help us in
the organization of our network. This warrants closer
examination.
The reason for the distinction between IP being a routed
protocol, and the routing protocols behind it, is the
routing table. Routing of IP packets is not strictly
determined at transmission time, but rather they travel
along a pre-determined path, directed by a routing table at
each router. These routing tables are the output of one of
several algorithms which collectively map the routes across
the Internet, and adapt to include new systems, networks,
and routers as the Internet itself changes shape.
Before we can see how these algorithms can help peer-to-
peer networks, we need to see that a routing table to a
router on the Internet is analogous to a list of connected
peers to any Ultrapeer on Gnutella. Naturally, while there
are similarities, there are also differences, the biggest
†† See Fig. 4 for the OSI model and the layers and protocols worth nothing
21
being that IP routing’s goal is to map the shortest path
between any two nodes, while the goal of optimising a P2P
network’s layout is, for our purposes, to reach the maximum
number of relevant peers with the 7 hops the TTL‡‡ on
Gnutella’s messages gives us. Still, the goal of both
systems is similar: to build a network without a central
authority, that does the best job it can at routing
messages to all peers.
The way in which this is accomplished for IP on the
Internet is through a jumble of information exchanged
between routers, through protocols such as RIP (Routing
Information Protocol), IGRP (Interior Gateway Routing
Protocol), and BGP (Border Gateway Protocol). These
protocols and the systems behind them have the goal of
maintaining the routing tables at each router which route
IP packets along the correct path to reach their
destination, however the methods they use, and consequently
their success in a given situation varies. I will discuss
the basic techniques used by each of these routing
protocols briefly and how these techniques can be applied
to peer-to-peer searching.
‡‡ TTL refers to time to live; a number associated with a Gnutella packet which is decremented every time the packet reaches a host (makes a “hop” on the network). When the TTL reaches 0, the packet stops being forwarded. Most Gnutella packets have a TTL of 7 which means that the seventh host they reach will drop the packet.
22
The Routing Information Protocol is among the oldest of the
Internet’s routing protocols – it was designed for a past
incarnation of the Internet which was a much smaller system
than it is today. Routers supporting RIP would broadcast
their routing table to their neighbours – a routing table
entry consisting of a subnet and the number of hops from
that router to the destination. The neighbours would then
add appropriate entries to their own routing table, adding
one hop to each entry stored. They would then rebroadcast
their own routing table to their neighbours, and so on… If
a router received two entries for the same subnet, it would
favour the path which was shortest: the one with the least
router hops. This system was simple, and elegant enough,
however it did not scale well – routing tables were
transmitted at 90-second intervals and would eventually
become too large to deal with. As every router shared its
entire routing table, routers using RIP would eventually
have a routing table entry for every system on the
Internet. It was also particularly susceptible to routing
loops – packets would too often get lost in the maze of
routers and reach the same router more than once. This
resulted in longer than necessary transmission times, or on
occasion packets not reaching their destination at all.
23
RIP’s successors such as the IGRP and OSPF extended RIP’s
abilities by using other metrics than hop count to decide a
packet’s route when more than one was available, such as
bandwidth, load, delay, and reliability. OSPF also cut
down on the transmission of routing tables by only sharing
immediately interesting information with its neighbours: a
router’s abilities were not broadcast over the entire
network, so each router only knew the next immediate step
in a packet’s route, not the full route to the router that
handles it. The Border Gateway Protocol (BGP) eventually
became the standard for communication between Autonomous
Systems (AS) on the Internet, however is not of much
interest to us as it is dependent on having an authority or
central server to assign AS numbers, and to divide the
network into distinct Autonomous Systems, whereas we must
maintain the P2P network’s lack of a central authority.
The purpose of this discussion has been to examine the way
that IP builds its routing tables, with the hope that some
of its algorithms will help our Gnutella client build its
routing table, or choose its direct connections. RIP
builds its routing table by forwarding packets on to the
host that will reach its destination in the least number of
24
router hops. Other routing protocols extend this by
measuring the cost of forwarding a packet on through a
specific path – cost being a heuristic built from known
information about the router or path, reflecting both the
likelihood the path will be successful in delivering the
packet to its destination, and the speed at which it will
do so.
In Gnutella, the destination of our QUERY packets is
everyone, or “as many clients as possible”, so while the
value of a direct connection can be determined by some of
the same heuristics as IP: the reliability of a peer and
the cost of sending a message to a peer, we also need to
consider the number of hosts reached by sending a message
to that peer. The way in which we calculate the cost and
reliability of a connection will thus also differ, and we
need to re-consider all the information available to us to
come up with appropriate heuristics.
4.2 – Current Algorithm and Immediate Goals for Improvement
Phex, in its current incarnation, has a single goal in
choosing which host to connect to on the network: get on
the network as quickly as possible by connecting to the
25
most reliable hosts. It first chooses hosts it has
connected to before, and is quick to discard hosts who
denied its last connection. The only other parameter it
uses to place the hosts in the order it will try them is a
number representing the average daily uptime of the host –
information which is provided by most hosts on the network
when they respond to a “PING” query, these responses being
its sole method of harvesting hosts.
Phex keeps a cache of 1000 hosts it has seen on the
Gnutella network, and once this limit has been reached,
discards the oldest hosts to whom its last connection
attempt failed. This again is consistent with its strategy
of maintaining a list of the most reliable hosts, although
would not prevent a host that has allowed a connection many
times, but simply rejected the last one from dropping from
the list before it has outlived its usefulness.
Our immediate goals to improve this strategy should result
in the following enhancements: First of all, we want to
connect to the hosts which will enable us to contribute the
most to the network as a whole – we want to position
ourselves in the network so the bandwidth and resources we
have to offer do not go unused. Secondly, we want to
26
connect to the hosts which will give us as a user the best
experience possible – provide the most applicable search
results quickly, and allow us to download the files we were
looking for speedily. It is worth noting that if the
entire network works towards the first goal, the second
goal will be at least partially fulfilled. Thirdly, we
don’t want to lose sight of the existing goal, which is to
join and become a productive member of the network as
rapidly as possible, by immediately connecting to hosts
which are likely to accept our connections. Finally, we
want to do all this without introducing significant – if
any – additional message traffic into the network.
4.3 – Swarm Intelligence
Swarm intelligence is a way of forming self-organizing
systems where each of the individuals in the system follows
a simple set of behaviour rules, their direct goal as an
individual differing from, and often not immediately
recognizable as being even related to, the goal of the
collective system. A simple example of such a system is a
virtual system of ants. The virtual ants exist in a world
made up of a small grid. Each spot on the grid is either
empty, or contains a piece of food. Each ant in the system
27
follows a simple set of rules: Move around randomly. If
you are not carrying any food, and come across a piece of
food, pick it up. If you are carrying a piece of food, and
come across a piece of food, drop what you are carrying.
So the goal of the individual ants is to pick up food and
drop it near other pieces of food. The goal of the system
of ants, however, and what does indeed eventually occur, is
to collect all the food into a single pile.
If we want to make use of this idea in attempting to lend
some organization to Gnutella’s network layout, and to our
position in the Gnutella network, we need to find some
available information on which we can base the order in
which we connect to known hosts, and thus choose the best
available hosts to which we forward messages. The numbers
or heuristics that we base this decision on should both
provide us immediate improved performance as a user, as
well as – assuming all the hosts on the network use similar
heuristics – improve the overall structure of the network.
There are three types of information available to us:
information that is already being stored and used,
information that is already available during normal use of
the network – it just needs to be stored (passively
28
available), and information that we need to do some
additional processing to gather (actively available). The
numbers can generally be assigned to one of three
categories: TCP/IP related statistics, Gnutella Network
related statistics, and Gnutella Host Content related
statistics. Finally, each heuristic can help us decide a
host’s usefulness in one of four instances – helping us
decide which hosts are closest on the TCP/IP network, which
are closest on the Gnutella network, which are the most
reliable, and which hosts we can most easily discard when
our host cache becomes full.
TABLE 3 – Some possible heuristics available for Gnutella Hosts Heuristic Availability Type/Source Usefulness Date first seen on Gnutella Passive Gnutella Reliability
Date most recently seen on Gnutella Passive Gnutella Reliability
Average Daily Uptime Already Used Gnutella Reliability
Number of failed direct connections Passive TCP/IP Reliability and
Discardability
Number of successful direct connections Passive TCP/IP Reliability and
Discardability
Date of Last failed direct connection Already Used TCP/IP Reliability and
Discardability
Date of Last successful direct connection Already Used TCP/IP Reliability and
Discardability
Was last direct connection successful? Already Used TCP/IP Reliability and
Discardability
Time to establish direct connection - most recent Passive TCP/IP TCP/IP Layout
29
Time to establish direct connection - average Passive TCP/IP TCP/IP Layout
ICMP Ping time Active TCP/IP TCP/IP Layout
Trace route Number of Router Hops Active TCP/IP TCP/IP Layout
Last Gnutella response time Passive Gnutella Gnutella Layout
Total number of Shared files Passive or Active Gnutella Content Gnutella Layout
Total size of Shared files Passive or Active Gnutella Content Gnutella Layout
Total number of Interesting Search Results Passive or Active Gnutella Content Gnutella Layout
Total number of mirrors at this host Passive or Active Gnutella Content Gnutella Layout
Table 3 shows only a small subset of all the information we
can gather about hosts in our host cache, as well as how
this information could be useful to us. As a general rule,
we want to connect to hosts that are interesting to us.
Interesting can mean they have a wide variety of content
available or specific content that we as a user are
interested in. It can also mean that they are close to us,
or we have a particularly fast connection to them on the
TCP/IP network. They become even more interesting for a
direct connection if they are noticeably distant on the
Gnutella network – establishing a direct connection to them
would enable all responses to be received in the time shown
by the TCP/IP response time, rather than the current
Gnutella response time.
30
This particular idea seems like a good candidate for a rule
used by an individual in a network governed by swarm
intelligence. To see this, we will ignore for a moment all
the other factors, and start as a host wishing to gain
access to the network. First, we connect to one host, ask
it who the furthest host away from it on the Gnutella
network is, and then connect to the host it specifies, this
gives us access to two completely separate parts of the
network, as well as giving the first host we connected to
access to hosts one step further away from the furthest
host it knows about. If every host on the network follows
this set of rules, it should follow that each host will be
doing its part to bridge parts of the network together, and
an organized pattern will begin to emerge.
In reality, however, there are other factors to consider.
Firstly, we are connecting to more than 2 hosts – after the
first two hosts, we need to decide ourselves which host we
are having the most trouble reaching, or which we know to
be the furthest away from us through each host we’re
connected to. Secondly, it is very likely that that host
will deny our connection, either because it’s already full,
or it is behind a firewall. Thirdly, we want to consider
31
some of the other heuristics available to us in deciding
which host is best for us to connect to. Finally, we must
consider that the first host we connect to will not have
this factor available for us to consider.
So, in consideration of all these factors, we can propose
the following scheme for ordering our host cache:
The hosts we connect to first should be hosts which are
reliable – we want to get a leg into the network as quickly
as possible, so the user can start searching. We should
then look for two kinds of hosts:
Firstly, hosts which we are having trouble reaching. The
last hosts to return a PONG or QUERYHIT in response to a
PING or QUERY message are assumed to be the hosts furthest
away from us on the Gnutella network.
Secondly, hosts which are interesting. If a host is a
mirror for a lot of our files, he may have other files we
want. If a host returns a significant number of
interesting search results for your searches, we might as
well connect to him directly. The added bonus is if he is
using the same heuristic to connect to peers, eventually
32
interesting peers will clump together into interconnected
cliques.
In addition, we need to keep track of hosts which are no
longer of any use to us. These are the ones we will drop
from our list first when more interesting hosts come along.
This decision needs to take all three of the above
heuristics into account so that we can maintain a queue
which contains enough interesting hosts in all three
categories.
4.4 – Implementation
Phex maintains a sorted tree to hold its host cache, the
tree currently uses a comparator which bases its decisions
on reliability – it chooses first the host with the best
average daily uptime, or who we last connected to
successfully most recently, and to whom our last connection
did not fail. We will extend this to maintain at least two
other trees – one containing hosts in the order we will
discard them, the other containing hosts in the order in
which we are most interested in connecting to directly – as
opposed to the existing tree which contains hosts in order
of which we are most likely to connect to.
33
In addition, we will need to be able to view the contents
of the cache in real-time, for this purpose a tab has been
added to the Phex user interface with a table detailing all
the hosts currently in the cache, along with their various
available statistics. The tree is initially sorted in the
order the hosts will be tried.
The three heuristics will be inter-dependent. If one
heuristic believes two hosts to be equal then it will
fallback to the less appropriate one. Specifically, the
order in which hosts will be discarded will be first based
on the number of unsuccessful connection attempts (higher
is worse), and the length of time since a successful
connection has been made to that host (longer is worse).
If these two heuristics are equal, the algorithm will first
discard hosts which are less interesting. Similarly, if
all hosts are equally reliable (generally this means the
client has already tried all hosts it has reliability
statistics for), it will immediately try the most
interesting ones.
34
Part 5 – Caching Strategies
5.1 The Purpose of Caching in a Peer-to-Peer System Even with a fully optimized network layout, the most
commonly used TTL on Gnutella, at 7, is only enough to
reach a network of just over 97,000 users. With the
current estimated number of hosts in the Gnutella network
being over 200,000 it is not just unlikely, but impossible
for any message to reach all the hosts on the network.
Caching, occurring at various points throughout the
network, can allow a host to see relevant contents of a
host more than 7 hops distant, without being able to see
the host itself.
The Gnutella network supports several forms of caching –
the Ultrapeer system is designed around a caching system
with a different goal: to eliminate needless traffic on
the network. Thus, each Ultrapeer has a small cache of
recent searches – and knows whether or not connected leafs
or peers responded to those searches. If the same search
reaches the Ultrapeer a second time, it does not forward
the search on to branches which it knows do not respond to
that search.
35
5.2 – Content Mirrors Towards the goal of this project, we will examine a form of
caching meant to increase the number of search results,
specifically the number of full and partial mirrors
available for a specific result.
File mirrors are an idea that began on the Internet with
FTP. Systems would periodically make a copy, or “mirror”
of a site, at a separate location so downloads of the
site’s files could be split between the two servers. In a
peer-to-peer network, this idea can grow from having two
mirrors for a file, to having hundreds or thousands. When
more than one user is sharing a file, the downloader can
take is pick from all available sources and download from
the least busy, fastest, closest, or most reliable source.
Mirrors become even more useful if a client is capable of
simultaneously downloading separate parts of a file from
different sources, and rebuilding the file once all parts
have been received, an idea which peer-to-peer systems such
as Bittorrent make use of to its fullest extent.§§
§§ See Section 7 for more information on Bittorrent and how its logic could be applied to more traditional peer to peer networks.
36
These mirrors or alternate download locations can be most
effectively stored, and most easily integrated into the
network, by being cached at all peers who have that file.
The reasons for this include the fact that a peer with the
file is more likely to see other peers with the file,
either directly or indirectly – as I will discuss in the
remainder of this section.
There are four ways a peer can passively (without sending
or receiving any extra messages) locate other peers with
that file on the network, as well as a couple which will
require minimal extensions to existing messages. The first
of these passive methods is the most obvious: When the
peer receives a download request for the file, and
successfully sends the file to a peer - assuming the
receiving peer will immediately share that file - he knows
that peer has that file, and can add it to his alternate
list. If the receiving client supports sharing partial
files, this can be done even sooner: As soon as the
serving client starts sending the receiving client data,
the receiving client immediately becomes a partial mirror
for the file. This, in theory, should be a very effective
method of constantly growing a list of mirrors.
37
The second of these methods guarantees that all files
downloaded off the network will start with at least one
good initial alternate – this is simply the other direction
of the first method. When a peer finishes receiving a
file, he should immediately store the peer he received the
file from as a mirror. In many cases, more than one peer
would have served parts of the file, in which case we can
seed our mirror list with every peer who served us a part
of the file, or even every peer in our download candidates
list whether we used them or not.
These two strategies alone, if implemented on all clients,
should quickly provide nearly every file on the network
with at least one mirror location, and are fairly
straightforward. The third method I will discuss requires
an additional action on the part of the user – which is
likely, but not guaranteed to happen in the course of the
user’s interaction with the client. If the user has a
certain type of file in his library, it is likely that he
will search for similar files, and that a file he has in
his library will appear in the results of this search.
When this happens, the client software can notice a file in
the results is already locally available, and rather than
offer it for download to the user again, store the sources
38
for the file as alternates for the local copy. While this
is not guaranteed to provide results, as it requires both
an action on the part of the user, and search results which
are not guaranteed to be found, this has the potential to
be the best constant source of alternate mirrors as the
results of a search habitually provide many more results
than the number of peers which a client has uploaded to or
downloaded from.
The final method of using existing information on the
network to obtain mirror sites is passive searching.
Because each peer on the network routes messages to other
peers, a client will see hundreds of messages each minute
as they pass through the network. Some of these messages
represent query results directed at other peers, and some
of these query results will be for files which are in a
client’s library. If the client, before passing the
message on through the network, checks if the file is
available locally and if so stores the host in the QUERYHIT
message that is currently serving the file, the client can
slowly accumulate mirrors simply by being an observer (even
while only an idle participant) in the network. This will
likely yield real-world results only for popular files,
39
however as it is only an extension of earlier ideas, it is
worth implementing.
The methods described so far did not require any additional
messages to be sent across the Gnutella network – indeed;
even messages already being sent did not require any
modification at all: Only existing data already being
shared across the network was used. The obvious
progression of this idea is sharing of mirror lists between
peers – when a new source gets added to a client’s list; it
could contact all the sources it knows about, and tell them
about this new source. In this way, each client on the
network could always have a complete list of every peer on
the network sharing each file. However, it is easy to see
that the amount of additional data this scheme would pass
over the network could quickly be overwhelming – and we may
show that it is unnecessary if we can maintain a large
enough list of mirrors using the passive methods already
described.
Actively searching for alternate download sources could
also be an option worth considering: When the client is
idling on the network, it would at intervals go through its
library searching for new alternates and storing them in
40
its list. Along the same lines, the client could ask its
existing alternate sources if they know of any new
alternate sources at intervals. This may be a better
option, simply because the number of messages passed could
be throttled based on the number of sources available – if
a client already knows of a reasonable number of sources
for a file, it does not need to actively look for more.
5.3 – Implementation
Phex has an existing nearly complete implementation of the
standard Gnutella extension for alternate locations. It
has no existing methods of acquiring these alternate
locations implemented, although it does have the ability to
store them and send them as an extension to the QUERYHIT
message. Many Gnutella clients have the ability to process
these mirrors when sent as part of the QUERYHIT message, so
this extension to Phex should be immediately worthwhile.
All four passive methods of acquiring mirrors will be
implemented. When a download is partially completed, and
the file is added to Phex’s shared files list, all good
download candidates for the file will be immediately
stored. When a part of an upload is completed, the
41
receiving client will be immediately stored as a mirror.
Finally, all QUERYHIT messages will be processed as soon
they are seen, and checked for a match with the signatures
of all files we are currently sharing. Mirrors will be
cached to a maximum count of 100***, with older mirrors
being taken off the list as newer ones appear.
The Phex user interface will also be modified to show the
user how many mirrors it is storing for each file he is
sharing. This will also help us gage the success of the
system, as this number should continually grow as the
client operates. It also is a good indication of the
overall popularity of files a user is sharing, as we will
see in section 6.2.
*** 100 was chosen simply as an arbitrary number used so the size of this cache does not grow unnecessarily large; for the more popular files on the network, it is not unreasonable to expect to see thousands of mirrors for one file within a matter of hours, for example current #1 hit songs in mp3 format. We are operating under the assumption that there is no reason for one host to know over 100 mirrors for a file at this stage; for a further examination of this cache size see part 7.
42
Part 6: Results
6.1 – Results of Network Host Cache Changes The first thing that was immediately apparent once the Phex
client was displaying the real-time status of its host
cache was that there were a few issues with the existing
system. The first was a minor annoyance – as soon as a
host was used, it disappeared from the host cache. It is,
of course, necessary to temporarily displace it from the
top of the host cache so it is not attempted twice, however
if we remove every host we connect to from the cache, we
immediately eliminate some of the best candidates we may
have for the next time the client connects to the network.
Also, since some of the statistics we collect can only be
found by attempting a direct connection, such as TCP/IP
response time, this became an important thing to fix.
It seemed prudent to not move the host’s position at all in
the cache until information about the host became
available, so the function to get the next host in the
cache simply skips over anything the client is already
connected to. Thus, the real-time display of the cache
shows hosts it is connected to highlighted in blue at the
43
top – although these may move down as more information
about these hosts becomes available and more hosts get
added.
The Gnutella response time of a host is not persistent – it
gets reset every time the program restarts. This enables
us to first connect to the most reliable hosts,
disregarding this piece of information, and as it gets
collected, hosts with a higher Gnutella response time move
to the top of the cache. It is also necessary for a more
fundamental reason – the relative location of that host on
the network will change as your position in the network
changes, so the number becomes less valid as soon as you
connect to or disconnect from a host.
Most other heuristics are persistent, however. TCP/IP
networks are relatively stable so related information need
not be updated so often. A host’s content is, for our
purposes, assumed to be fairly constant, although it will
increase over time and a user can remove files from what he
is sharing. The total amount of content shared is updated
each time we connect to a host, however the interest rating
of a host is more constant than the actual content he
44
carries; while the content itself may change the general
theme of the content can be generally assumed to be stable.
Testing for this portion of the project was only done with
a single client. The modifications were successful in that
the program is noticeably quicker to gain an initial
connection into the network than before, thanks to some
stability added to the host cache. It does attempt and
connect to hosts which it is slow to receive responses from
over the Gnutella network – often the first ten hosts in
the cache will have a Gnutella response time of over 30
seconds. If all other assumptions are correct, this both
enables us to see parts of the network which were out of
reach to us before, as well as showing promise of lending
some organization to the network layout if all clients were
to use a similar scheme.
45
6.2 – Results of File Mirror Site Changes
Table 4 - Mirrors and File Popularity
File ID # times file was searched
# of times file was uploaded
# of mirrors known after 7 days
A 878 64 100 B 205 52 100 C 154 41 100 D 432 20 99 E 444 27 87 F 435 0 23 G 192 5 22 H 298 0 18 I 461 1 4 J 20 1 3 K 20 2 3 L 24 2 3 M 19 1 3 N 149 1 2 O 25 2 2 P 22 2 2 Q 651 1 1 R 195 1 1 S 145 1 1 T 130 1 1 U 128 1 1 V 30 1 1 W 11 1 1 X 8 0 1 Y 8 0 1 Z 8 0 1
The success of the system in acquiring and caching mirrors
for local content varied somewhat by file, however overall
did gather mirrors for every file it downloaded and
uploaded, and additional mirrors were collected with
varying degrees of success (generally, only for the most
46
popular files). Table 4 shows the status of the mirror
cache at the end of the 7 day test period. As expected,
each file that was initially acquired from the network (all
files except Q, R, and S) has at least one mirror: The
host that the file was initially downloaded from. Files Q,
R, and S were files our host was seeding onto the network –
they could not have been previously in any other host’s
library. Each was downloaded by another peer once, and
thus has that peer as its only current mirror.
As the table shows, it had no trouble caching mirrors for
the more popular files; file A in Table 4, which was a
current number one hit single, received 64 hosts from
uploading alone; while it only stored the most recent 100
mirrors for the file, thousands of different hosts with
this file were seen over the 7 day period. Files B and C,
also files commonly uploaded, also saw mirrors numbering in
the hundreds. The less popular files fared less well –
most gathering one to 4 mirrors. A longer stay on the
network may have driven more sharing of these files, and
therefore more mirrors would be found. It would be, for
example, interesting to examine in several months time how
many mirrors could be found for the files we seeded.
47
Overall, uploads provided 40% of the mirrors, the initial
download of the file provided 9% of the mirrors and 51%
came from the search results: Active searches where the
file came up again in a user’s searches or passive searches
where we found a mirror by inactively watching the messages
through the network. These results generally reflect what
we expected to see, with an initial small number of mirrors
coming from the download of a file, while the majority
coming through passive monitoring of network activity
afterwards.
48
Part 7: Conclusions and Suggestions for Future Work
The results we achieved with this project barely scrape the
surface of the optimisations that could be implemented on
peer to peer file searching networks such as Gnutella. We
showed, on an existing public network, that it is possible
- without burdening the network with any more messages or
data - to harvest mirrors for files in your library. We
also showed that, again without sending any additional data
over the network, it is possible to build and calculate
heuristics to be more selective about which hosts on the
network to connect to. The full potential of both of these
techniques, however, would require some further changes to
the client, as well as changes to other aspects of the
network which we will study briefly.
It is worth examining further here the effects of the size
of the mirror cache on the validity of mirrors. As with
any caching technique, care must be made to ensure the data
does not become stale. Specifically, in terms of our
mirror caching techniques, it would not be productive to be
spreading a list of invalid mirrors for a file. It also
may be possible to carefully prune the stored mirror lists
to realize a more reliable data cache.
49
With the cache size we used, there are two cases we see in
the real world: popular files where the cache is replaced
with new information regularly, and less popular files
where the cache is fairly static – in fact, in these cases
mirrors could remain in the cache indefinitely since the
limit of 100 hosts will never be reached – 100 different
peers will never want that file. Because of the constantly
changing structure of the Gnutella network, a large portion
of these mirrors will only be valid for a short time after
they are seen. Others will only be valid for a short
period of time per day, or per week: whenever the user
starts up their Gnutella peer software.
In the first case - the extreme being file A, but files B
and C also falling into this category – the cache is
constantly being updated with new information. This case
is the more interesting one to examine; we end up with a
mirror cache similar in many ways to the host cache we use
to connect into the Gnutella network, thus some of the same
techniques can be considered. Rather than simply deleting
the oldest host in the cache, the least reliable one could
be deleted instead. Reliability could, as with the host
cache, be calculated based on number of times this host has
50
been seen, whether a partial or complete file was seen, how
recently this host has been seen, or whether you have
actually downloaded from this host. The precise heuristic
which decides when to discard a host could be some weighted
combination of these factors. In the second case, to
further ensure the utility of cached data, the host could
be discarded before the cache size limit is reached if this
heuristic reached a specific value.
Mirrors become especially useful to a client in a peer to
peer network when the client supports multi-source
downloads, and when files become larger than the average 4-
minute mp3-compressed song. Because of the capricious
nature of peer to peer networks, it is difficult to find a
peer willing to share the entirety of a file larger than
this, and generally also difficult to find peers with that
file who are not already uploading it and thus not able to
upload to you as well. This is a problem the BitTorrent
engine, whose techniques are described in detail by its
creator Bram Cohen [Cohen, 2003], addresses and offers a
solution that works in many situations.
The techniques Cohen uses are based on partial file
sharing. When a user has downloaded only part of a file,
51
he starts to share it on the network. This way, if two
users are downloading the file, the entire file needs to be
provided only once to the combination of the two users. If
user one has the first half of the file, and user two has
the second, they can then send the part of the file the
other user needs. Cohen’s work is based on a much larger
system, where hundreds or thousands of users can be
involved, each having different and overlapping parts of a
single file. The BitTorrent system is designed to provide
each user with optimal download performance, as well as
attempting to ensure that a complete distributed copy of
the file is available for as long as possible, so that when
peers sharing the complete file disappear from the network,
users will not be left with a file they cannot complete.
What this does is addresses the problem of peers who are
serving files leaving the network before all peers who are
downloading files have completed their downloads. If a
user must upload a file to be able to download it, they
will be much more reliable as a source than a user who has
no incentive to continue sharing a file. By integrating a
system such as this with Gnutella – the basis for which is
already there based on the content mirrors this project
experimented with – the capability of the system would be
52
further enhanced, allowing it to be a reliable source for
larger files.
Changes to the host cache management of the Phex client
also yielded some interesting results. Hosts with long
response times proved plentiful on the network, however no
numbers were generated as to what the effect of giving
priority to these hosts actually is on the network.
Results of this model are better suited to a simulation –
real-world results may not be obvious until most clients on
the network are using such an algorithm, a difficult
proposition considering the wide variety of clients in use
and hosts numbering in the hundreds of thousands.
Content-based heuristics to choose a host to connect to
were only minimally implemented, and only used as a
secondary statistic on which to order the host cache.
These could prove interesting to a single client on the
network, as he could seek out hosts with similar interests
and connect to them directly. However, again this concept
could be examined better in a simulation as sub-networks or
sub-graphs of clients with similar content are the eventual
goal of this heuristic, something that would again require
a large group of hosts on the network.
53
References
Cohen, B (2003) Incentives Build Robustness in BitTorrent, http://www.bitconjurer.org/BitTorrent/bittorrentecon.pdf T. Klingberg, R. Manfredi (2002) Gnutella 0.6, http://rfc-gnutella.sourceforge.net/src/rfc-0_6-draft.html Lv, Q., Cao, P., Edith, C., Li, K., Shenker, S. (2002) Search and Replication in Unstructured Peer-to-Peer Networks,
http://www.cs.princeton.edu/~qlv/download/searchp2p_full.pdf
54
Appendix – Contents of Included Disc The included disc contains a soft copy of this document, as well as the source code to the modified Phex client. The Phex client that was modified was acquired from Phex CVS on SourceForge (www.sf.net), thus the changes made could be submitted to the source repository with ease. Modified files are marked as changed through CVS; if the reader is interested in examining the source code modifications, using the latest version of Eclipse (www.eclipse.org) to open the project will provide the best results. Included in the Eclipse project is the CVS information so modified files will be marked using Eclipse’s built-in CVS support. Building the client can be done using Eclipse once the project is opened. Executing the project is most easily done by using the Run command from within Eclipse as well. Source code is included in the source/ folder; Documents are in the documents/ folder; Eclipse is available in the eclipse/ folder; Related documents are included in the references/ folder.