Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Preview:

Citation preview

Mapping the Gnutella Network

Presented By:

Tony Young

M.Math Candidate

October 7th, 2004

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review

Introduction

Peer to peer systems have recently exploded onto the internet scene

Two main contributing factors: Low cost and high availability of resources

(computing and storage) Increased network connectivity (proliferation

of “always on” connections)

Introduction Peer systems build a virtual topology (overlay)

with its own routing mechanisms The topology of the overlay and routing protocols

directly affects Performance: Number of physical hops to send a

message through virtual overlay Reliability: Will a message actually reach the other end Scalability: Can other nodes be added while keeping

performance good Anonymity: Can we protect the identity of nodes in the

network

Introduction Gnutella is studied in depth and analysis is

performed to determine how the overlay affects the four characteristics previously mentioned

Started by capturing the network topology and behaviour

Performed a macroscopic analysis of the network to evaluate costs and benefits

Investigated possible improvements

Introduction

Two questions drive analysis What is the connectivity structure of

Gnutella? How well does the Gnutella overlay map to

the actual network topology?

Introduction

Connectivity Structure Networks as diverse as natural networks

usually have a few well connected nodes and many poorly connected nodes

• I.e. Power Law Networks We will see Gnutella is not a pure power law

network, but still has good fault tolerance and is less resistant to DoS attacks

Introduction

Overlay Topology Important for ISP’s: overlays that don’t map

closely to the physical topology adds additional stress on the infrastructure and costs ISP’s more money

Scalability is directly linked to efficient use of network resources

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements

Gnutella in Depth Gnutella is an open protocol It is decentralized and unstructured Allows group membership and searching of available

files for download Gnutella should operate in a dynamic environment where

hosts can join/leave at any time Gnutella should experience good performance and

scalability External attacks should not cause data loss or

performance degradation Users seeking or providing unpopular material should stay

anonymous

Gnutella in Depth

Gnutella nodes are called “servents” (SERVer-cliENTS) Provide a client-side interface to allow

searching of file base Provide server-side storage, routing and

response to network messages and requests

Gnutella in Depth

To connect, a node contacts an “always on” host (I.e. gnutella.com) and sends a PING

Node replies with a PONG and forwards the PING on to other nodes in the network who reply with PONG messages and forward the PING on PING stops after TTL hops

Gnutella in Depth

To find files, users submit QUERY messages to other nodes Messages are broadcast to all neighbours

who forward them on to other neighbours, etc. for TTL hops

QUERY RESPONSE messages are returned to the querying node

Gnutella in Depth

To download a file, nodes send GET and PUSH messages to individual hosts holding a file I.e. transfer requests and transfers are routed

directly between communicating hosts, and not back-propagated

Gnutella in Depth Messaging protocol has three important

features TTL and “hops passed” fields are attached to each

message Randomly generated message ID is attached to each

message Each node keeps track of recently routed messages

to prevent re-broadcasting and to implement back-propagation

Gnutella in Depth

PING message contains the host address and name, number of files and size of data store

PONG message contains the same information from the host that received the PING

Gnutella in Depth PING messages propagate until TTL has

expired Hop count incremented at each servent

receiving the PING Message propagates until hop count = TTL

PONG messages are back-propagated (I.e. sent on the reverse path that the original message followed) to the host initiating the PING

Gnutella in Depth

QUERY messages are sent the same way as a PING message Nodes check the search string requested

against the names of their locally stored files QUERY RESPONSE messages are

back-propagated to the querying node and include information necessary to download the file

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review

The Crawler In order to conduct the network tests, a crawler

was developed to gather information about the virtual topology

Crawler starts with a list of active nodes and sends a PING message to each of them

PONG messages are received and the IP, port, number of stored files and size of archive are stored in a table

PING propagates to other nodes and PONG back propagates to crawler

The Crawler A sequential version of the crawler was initially

developed I.e. send a PING with an empirically determined

optimal TTL to a set of nodes; resend to the nodes where the PING stops, etc.

Proved to be very slow: 50 hours to collect data from a 4 000 node network

Slowness means two things: Not scalable: Will get slower as we add more nodes Does not give an accurate network snapshot: network

changes drastically over 50 hours!

The Crawler A distributed crawler was developed next

Client-Server architecture Server maintains node list and creates a network

graph Clients receive a list of nodes to contact and discover

neighbours for Decided to use only 50 clients at once

Reduces invasiveness of search and consumption of network resources

Reduced crawling time to a couple of hours for a large initial list and a network of 30 000 nodes

The Crawler Network membership is defined as follows

A node is a member of the network if the crawler is able to connect to it

A node might be excluded from network membership if it was reported as active by a server or other node, but the crawler could not contact it

• This might happen if nodes go offline before the crawler can contact them

Outline Introduction Gnutella in Depth The Crawler Analysis of Network

Growth Trends Traffic Estimates Connectivity and Reliability Overlay vs. Topology

Summary and Improvements Paper Review

Analysis of Network

Data was collected over a 6 month period Data shows:

Overhead traffic is reducing Traffic volume is a significant barrier to

growth

Growth Trends

Size of network is growing rapidly Largest connected component in November

2000 had 2 063 neighbours Largest connected component in May 2001

had 48 195 neighbours! Number of neighbours for the largest

connected component has grown 25 times!

Growth Trends

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Growth Trends

Despite the explosive growth, most nodes are not connected long

Successive crawls of the network found: 40% of nodes leave the network in less than

4 hours 25% of nodes are alive for more than 24

hours

Traffic Estimates A modified version of the crawler

recorded traffic generated across one randomly chosen link 36% of total traffic (in bytes) is user

generated QUERY messages 55% is group membership (PING/PONG)

messages 9% is non-standard or malformed messages N.B. File transfer traffic is excluded

Traffic Estimates

After June 2001 (when new Gnutella implementation was released) 92% of total traffic (in bytes) was QUERY

messages 8% is group membership (PING/PONG)

messages N.B. File transfer traffic is excluded

Traffic Estimates 95% of all nodes are reachable within 7 hops.

Thus, each message typically uses a TTL = 7 Most links are expected to support similar amounts

of traffic for these reasons

As verified empirically, the total Gnutella generated traffic is proportional to the number of connections in the network However, the average number of connections per

node stays relatively constant as the network grows

Traffic Estimates

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Traffic Estimates

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Traffic Estimates

The total traffic estimate for the Gnutella network is 1 Gbps I.e. 170 000 connections for a 50 000 node

network times 6 kbps per connection This is approximately 330 TB/month! Excluding file transfers!

Traffic Estimates

This total is 1.7% of the total internet traffic in US backbones in December 2000

This volume of traffic is believed to be an obstacle to further growth

The underlying network topology must be used more efficiently to allow scaling and wider deployment

Connectivity and Reliability Note: Nodes decide locally:

How many connections to support When to add or drop a connection

Recent research shows that many natural systems organize themselves into “power law networks” I.e. networks where a few nodes are well

connected and most nodes have very few connections

Connectivity and Reliability Power law networks:

Number of nodes with L links (connections) is proportional to L-k where k is system-dependent

Resilient to losing many poorly connected nodes

Falls apart quickly if only a few well connected nodes are lost

Extremely robust to random failures, but vulnerable to targeted attacks

Connectivity and Reliability Power law networks appear as a linear

system on a log-log plot Data for December 2000 shows that early

Gnutella networks were power law Data for March 2001 shows that later

Gnutella networks are a mixture• There are a constant number of nodes with fewer

than 10 links• Above 10 links, nodes follow a power law

structure

Connectivity and Reliability

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Connectivity and Reliability

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Connectivity and Reliability

Why did the distribution change? Two possible reasons:

About 20% of Gnutella users have modem connections - DSL and up can support more connections

Gnutella users run as many connections as their network can support - perception is that more connections = better query results

Connectivity and Reliability

Does the change in distribution affect reliability? Yes! Preserves resilience to random failures Makes network less dependent on well

connected nodes and hence less prone to DoS attacks

Overlay vs. Topology

Peer systems change the way bandwidth is used on the internet Servers are at the edge of the network now,

and peers are constantly downloading Most ISP’s use flat-rate billing

Peer systems may break this model!

Overlay vs. Topology Due to the amount of traffic peer systems

generate, efficient use of resources is important The greater the mismatch between the

overlay and the physical network topology, the more messages need to be transmitted to route information from A to B

This means more stress on the network resources

Overlay vs. Topology

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Communication from A to all other nodes requires one message over the D - E link

Overlay vs. Topology

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Communication from A to all other nodes requires six messages over the D - E link

Overlay vs. Topology

How well does Gnutella map to the topology? Assume that domain names are roughly

evident of the hierarchy of the internet Check how well generated traffic maps to the

cluster of domain names found by the crawler

Overlay vs. Topology

After analysis of 10 overlays, it was found that Gnutella nodes often connect to peers outside of their respective domains

Thus, it appears that Gnutella does not make efficient use of the underlying topology

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review

Summary and Improvements Gnutella has a multimodal connectivity

distribution that is partially constant and partially power law Network is resilient to random failures Network is harder to attack by malicious parties, but

not immune to DoS attacks

Gnutella makes little effort to ward off attackers E.g. topology, connectivity and traffic information is

easy to obtain and can be used to plan attacks

Summary and Improvements Gnutella’s traffic volume is a significant

fraction of all internet traffic Makes the future growth of the network

reliant on efficient use of the topology Gnutella’s overlay does not match the

network topology very well This increases quite substantially the number

of messages and the amount of network traffic generated

Summary and Improvements

Necessary improvements Make efforts to hide overlay and connectivity

information (encryption?) Match overlay more closely with topology Limits to growth must be solved first and fast

at the rate that Gnutella is growing

Summary and Improvements

Suggested Improvements Exploit locality of files and query distribution

(I.e. caching and localized queries) Replace query flooding strategy with

something more efficient (I.e. superpeer routing and group communication)

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review

Paper Review Organization

Some discussions of the Gnutella architecture and protocols were scattered throughout the paper

Should have combined everything into a more logical order inside the protocol section

Writing Style Generally very good. Some missing words

and poor grammar

Paper Review Novel Ideas

Presented a qualitative and quantitative analysis of the Gnutella network, and some important points for P2P as a whole

Content Some backing information was missing Some claims were made without supporting

evidence, or just referring the reader to another paper

Questions?

Recommended