58
Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Embed Size (px)

Citation preview

Page 1: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Mapping the Gnutella Network

Presented By:

Tony Young

M.Math Candidate

October 7th, 2004

Page 2: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review

Page 3: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review

Page 4: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Introduction

Peer to peer systems have recently exploded onto the internet scene

Two main contributing factors: Low cost and high availability of resources

(computing and storage) Increased network connectivity (proliferation

of “always on” connections)

Page 5: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Introduction Peer systems build a virtual topology (overlay)

with its own routing mechanisms The topology of the overlay and routing protocols

directly affects Performance: Number of physical hops to send a

message through virtual overlay Reliability: Will a message actually reach the other end Scalability: Can other nodes be added while keeping

performance good Anonymity: Can we protect the identity of nodes in the

network

Page 6: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Introduction Gnutella is studied in depth and analysis is

performed to determine how the overlay affects the four characteristics previously mentioned

Started by capturing the network topology and behaviour

Performed a macroscopic analysis of the network to evaluate costs and benefits

Investigated possible improvements

Page 7: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Introduction

Two questions drive analysis What is the connectivity structure of

Gnutella? How well does the Gnutella overlay map to

the actual network topology?

Page 8: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Introduction

Connectivity Structure Networks as diverse as natural networks

usually have a few well connected nodes and many poorly connected nodes

• I.e. Power Law Networks We will see Gnutella is not a pure power law

network, but still has good fault tolerance and is less resistant to DoS attacks

Page 9: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Introduction

Overlay Topology Important for ISP’s: overlays that don’t map

closely to the physical topology adds additional stress on the infrastructure and costs ISP’s more money

Scalability is directly linked to efficient use of network resources

Page 10: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements

Page 11: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Gnutella in Depth Gnutella is an open protocol It is decentralized and unstructured Allows group membership and searching of available

files for download Gnutella should operate in a dynamic environment where

hosts can join/leave at any time Gnutella should experience good performance and

scalability External attacks should not cause data loss or

performance degradation Users seeking or providing unpopular material should stay

anonymous

Page 12: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Gnutella in Depth

Gnutella nodes are called “servents” (SERVer-cliENTS) Provide a client-side interface to allow

searching of file base Provide server-side storage, routing and

response to network messages and requests

Page 13: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Gnutella in Depth

To connect, a node contacts an “always on” host (I.e. gnutella.com) and sends a PING

Node replies with a PONG and forwards the PING on to other nodes in the network who reply with PONG messages and forward the PING on PING stops after TTL hops

Page 14: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Gnutella in Depth

To find files, users submit QUERY messages to other nodes Messages are broadcast to all neighbours

who forward them on to other neighbours, etc. for TTL hops

QUERY RESPONSE messages are returned to the querying node

Page 15: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Gnutella in Depth

To download a file, nodes send GET and PUSH messages to individual hosts holding a file I.e. transfer requests and transfers are routed

directly between communicating hosts, and not back-propagated

Page 16: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Gnutella in Depth Messaging protocol has three important

features TTL and “hops passed” fields are attached to each

message Randomly generated message ID is attached to each

message Each node keeps track of recently routed messages

to prevent re-broadcasting and to implement back-propagation

Page 17: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Gnutella in Depth

PING message contains the host address and name, number of files and size of data store

PONG message contains the same information from the host that received the PING

Page 18: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Gnutella in Depth PING messages propagate until TTL has

expired Hop count incremented at each servent

receiving the PING Message propagates until hop count = TTL

PONG messages are back-propagated (I.e. sent on the reverse path that the original message followed) to the host initiating the PING

Page 19: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Gnutella in Depth

QUERY messages are sent the same way as a PING message Nodes check the search string requested

against the names of their locally stored files QUERY RESPONSE messages are

back-propagated to the querying node and include information necessary to download the file

Page 20: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review

Page 21: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

The Crawler In order to conduct the network tests, a crawler

was developed to gather information about the virtual topology

Crawler starts with a list of active nodes and sends a PING message to each of them

PONG messages are received and the IP, port, number of stored files and size of archive are stored in a table

PING propagates to other nodes and PONG back propagates to crawler

Page 22: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

The Crawler A sequential version of the crawler was initially

developed I.e. send a PING with an empirically determined

optimal TTL to a set of nodes; resend to the nodes where the PING stops, etc.

Proved to be very slow: 50 hours to collect data from a 4 000 node network

Slowness means two things: Not scalable: Will get slower as we add more nodes Does not give an accurate network snapshot: network

changes drastically over 50 hours!

Page 23: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

The Crawler A distributed crawler was developed next

Client-Server architecture Server maintains node list and creates a network

graph Clients receive a list of nodes to contact and discover

neighbours for Decided to use only 50 clients at once

Reduces invasiveness of search and consumption of network resources

Reduced crawling time to a couple of hours for a large initial list and a network of 30 000 nodes

Page 24: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

The Crawler Network membership is defined as follows

A node is a member of the network if the crawler is able to connect to it

A node might be excluded from network membership if it was reported as active by a server or other node, but the crawler could not contact it

• This might happen if nodes go offline before the crawler can contact them

Page 25: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Outline Introduction Gnutella in Depth The Crawler Analysis of Network

Growth Trends Traffic Estimates Connectivity and Reliability Overlay vs. Topology

Summary and Improvements Paper Review

Page 26: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Analysis of Network

Data was collected over a 6 month period Data shows:

Overhead traffic is reducing Traffic volume is a significant barrier to

growth

Page 27: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Growth Trends

Size of network is growing rapidly Largest connected component in November

2000 had 2 063 neighbours Largest connected component in May 2001

had 48 195 neighbours! Number of neighbours for the largest

connected component has grown 25 times!

Page 28: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Growth Trends

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 29: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Growth Trends

Despite the explosive growth, most nodes are not connected long

Successive crawls of the network found: 40% of nodes leave the network in less than

4 hours 25% of nodes are alive for more than 24

hours

Page 30: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Traffic Estimates A modified version of the crawler

recorded traffic generated across one randomly chosen link 36% of total traffic (in bytes) is user

generated QUERY messages 55% is group membership (PING/PONG)

messages 9% is non-standard or malformed messages N.B. File transfer traffic is excluded

Page 31: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Traffic Estimates

After June 2001 (when new Gnutella implementation was released) 92% of total traffic (in bytes) was QUERY

messages 8% is group membership (PING/PONG)

messages N.B. File transfer traffic is excluded

Page 32: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Traffic Estimates 95% of all nodes are reachable within 7 hops.

Thus, each message typically uses a TTL = 7 Most links are expected to support similar amounts

of traffic for these reasons

As verified empirically, the total Gnutella generated traffic is proportional to the number of connections in the network However, the average number of connections per

node stays relatively constant as the network grows

Page 33: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Traffic Estimates

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 34: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Traffic Estimates

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 35: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Traffic Estimates

The total traffic estimate for the Gnutella network is 1 Gbps I.e. 170 000 connections for a 50 000 node

network times 6 kbps per connection This is approximately 330 TB/month! Excluding file transfers!

Page 36: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Traffic Estimates

This total is 1.7% of the total internet traffic in US backbones in December 2000

This volume of traffic is believed to be an obstacle to further growth

The underlying network topology must be used more efficiently to allow scaling and wider deployment

Page 37: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Connectivity and Reliability Note: Nodes decide locally:

How many connections to support When to add or drop a connection

Recent research shows that many natural systems organize themselves into “power law networks” I.e. networks where a few nodes are well

connected and most nodes have very few connections

Page 38: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Connectivity and Reliability Power law networks:

Number of nodes with L links (connections) is proportional to L-k where k is system-dependent

Resilient to losing many poorly connected nodes

Falls apart quickly if only a few well connected nodes are lost

Extremely robust to random failures, but vulnerable to targeted attacks

Page 39: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Connectivity and Reliability Power law networks appear as a linear

system on a log-log plot Data for December 2000 shows that early

Gnutella networks were power law Data for March 2001 shows that later

Gnutella networks are a mixture• There are a constant number of nodes with fewer

than 10 links• Above 10 links, nodes follow a power law

structure

Page 40: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Connectivity and Reliability

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 41: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Connectivity and Reliability

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 42: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Connectivity and Reliability

Why did the distribution change? Two possible reasons:

About 20% of Gnutella users have modem connections - DSL and up can support more connections

Gnutella users run as many connections as their network can support - perception is that more connections = better query results

Page 43: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Connectivity and Reliability

Does the change in distribution affect reliability? Yes! Preserves resilience to random failures Makes network less dependent on well

connected nodes and hence less prone to DoS attacks

Page 44: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Overlay vs. Topology

Peer systems change the way bandwidth is used on the internet Servers are at the edge of the network now,

and peers are constantly downloading Most ISP’s use flat-rate billing

Peer systems may break this model!

Page 45: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Overlay vs. Topology Due to the amount of traffic peer systems

generate, efficient use of resources is important The greater the mismatch between the

overlay and the physical network topology, the more messages need to be transmitted to route information from A to B

This means more stress on the network resources

Page 46: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Overlay vs. Topology

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Communication from A to all other nodes requires one message over the D - E link

Page 47: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Overlay vs. Topology

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Communication from A to all other nodes requires six messages over the D - E link

Page 48: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Overlay vs. Topology

How well does Gnutella map to the topology? Assume that domain names are roughly

evident of the hierarchy of the internet Check how well generated traffic maps to the

cluster of domain names found by the crawler

Page 49: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Overlay vs. Topology

After analysis of 10 overlays, it was found that Gnutella nodes often connect to peers outside of their respective domains

Thus, it appears that Gnutella does not make efficient use of the underlying topology

Page 50: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review

Page 51: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Summary and Improvements Gnutella has a multimodal connectivity

distribution that is partially constant and partially power law Network is resilient to random failures Network is harder to attack by malicious parties, but

not immune to DoS attacks

Gnutella makes little effort to ward off attackers E.g. topology, connectivity and traffic information is

easy to obtain and can be used to plan attacks

Page 52: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Summary and Improvements Gnutella’s traffic volume is a significant

fraction of all internet traffic Makes the future growth of the network

reliant on efficient use of the topology Gnutella’s overlay does not match the

network topology very well This increases quite substantially the number

of messages and the amount of network traffic generated

Page 53: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Summary and Improvements

Necessary improvements Make efforts to hide overlay and connectivity

information (encryption?) Match overlay more closely with topology Limits to growth must be solved first and fast

at the rate that Gnutella is growing

Page 54: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Summary and Improvements

Suggested Improvements Exploit locality of files and query distribution

(I.e. caching and localized queries) Replace query flooding strategy with

something more efficient (I.e. superpeer routing and group communication)

Page 55: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Outline

Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review

Page 56: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Paper Review Organization

Some discussions of the Gnutella architecture and protocols were scattered throughout the paper

Should have combined everything into a more logical order inside the protocol section

Writing Style Generally very good. Some missing words

and poor grammar

Page 57: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Paper Review Novel Ideas

Presented a qualitative and quantitative analysis of the Gnutella network, and some important points for P2P as a whole

Content Some backing information was missing Some claims were made without supporting

evidence, or just referring the reader to another paper

Page 58: Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004

Questions?