I. Alvarez-Hamelin (LPT, France) L. Dall’Asta (LPT, France)

Traceroute-like exploration of unknown networks: a statistical

analysis

A. Barrat, LPT, Université Paris-Sud, FranceI. Alvarez-Hamelin (LPT, France)L. Dall’Asta (LPT, France)A. Vázquez (Notre-Dame University, USA)A. Vespignani (LPT, France)

cond-mat/0406404

http://www.th.u-psud.fr/page_perso/Barrat

Plan of the talk

• Apparent complexity of Internet’s structure

• Problem of sampling biases

• Model for traceroute

• Theoretical approach of the traceroute mapping process

• Numerical results

• Conclusions

•Multi-probe reconstruction (router-level)•Use of BGP tables for the Autonomous System level (domains)

Graph representation

different granularities

Internet representation

Many projects (CAIDA, NLANR, RIPE, IPM, PingER...)

=>Large-scale visualizations

TOPOLOGICAL ANALYSIS by STATISTICAL TOOLS and GRAPH THEORY !

Main topological characteristics

•Small world

Distribution of Shortest paths (# hops) between two nodes


•Small world : captured by Erdös-Renyi graphs

Poisson distribution

<k> = p N

With probability p an edge is established among couple of vertices


•Small world•Large clustering: different neighbours of a node will likely know each other

1

2

3

n

Higher probability to be connected

=>graph models with large clustering, e.g. Watts-Strogatz 1998


•Small world•Large clustering•Dynamical network•Broad connectivity distributions

•also observed in many other contexts (from biological to social networks)•huge activity of modeling


Broad connectivity distributions:obtained from mapping projects

Govindan et al 2002

Result of a sampling: is this reliable ?

Sampling biases

Internet mapping:

Sampling is incompleteLateral connectivity is missed (edges are underestimated)Finite size sample

=> spanning tree

Sampling biases

● Vertices and edges best sampled in the proximity of sources

● Bad estimation of some topological properties

Statistical properties of the sampled graph may sharply differ from the original one

Lakhina et al. 2002

Clauset & Moore 2004

De Los Rios & Petermann 2004

Guillaume & Latapy 2004

Bad sampling

?

Evaluating sampling biases

Real graph G=(V,E) (known, with given properties )

Sampled graph G’=(V’,E’)

simulated sampling

Analysis of G’, comparison with G

Model for traceroute

G=(V, E)

Sources Targets

First approximation: union of shortest paths

NB: Unique Shortest Path

Model for traceroute

G’=(V’, E’)

First approximation: union of shortest paths

Very simple model, but: allows for some analyticaland numerical understanding

More formally...

G = (V, E): sparse undirected graph with a set of

NS sources S = { i1, i2, …, iNS }

NT targets T = {j1, j2, …, jNT }

randomly placed.

The sampled graph G’=(V’,E’) is obtained by considering the union of all the traceroute-like paths connecting source-target pairs.

PARAMETERS: ,N

NSS ,

N

NTT

N

NN TS (probing effort)

Usually NS=O(1), T=O(1)

For each set = { S,T }, the indicator function that a given edge (i, j) belongs to the sampled graph is

ml

N

s

mlij

N

tmjliij

S T

ts1

),(

1

11

with

0

1),( mlij

if (i,j) path between l,m

otherwise,S

N

sii N

S

s

1

T

N

tij N

T

t

1

Analysis of the mapping process

Averaging over all the possible realizations of the set = { S,T },

ml

mlijTS

ml

N

s

N

t

mlijmjliij

S T

ts)1(111 ),(

1 1

),(

WE HAVE NEGLECTED CORRELATIONS !!

• usually S T << 1

•

Mean-field statistical analysis

ij > 1 – exp(- S T bij )

betweenness

Betweenness centrality b

for each pair of nodes (l,m) in the graph, there are

lm shortest paths between l and m

ijlm shortest paths going through ij

bij is the sum of ijlm

/ lm over all pairs (l,m)

Similar concept: node betweenness bi

ij

kij: large betweenness

jk: small betweenness

Also: flow of information if each individual sends a message to all other individuals

Consequences of the analysis

ij > 1 – exp(- S T bij )

•Smallest betweenness bij=2 => <ij > 2 S T i

j

•Largest betweenness bij=O(N2) => <ij > 1

(i.e. i and j have to be source and target to discover i-j)

Results for the vertices

ij > 1 – exp(- bij/N)

<i > 1 – (1-T) exp( - bi/N ) (discovery probability)

N*k/Nk 1 – exp( - b(k)/N ) (discovery frequency)

<k*>/k ( 1 + b(k)/N )/k (discovered connectivity)

Summary:

• discovery probability strongly related with the centrality ;

• vertex discovery favored by a finite density of targets ;

• accuracy increased by increasing probing effort .

Numerical simulations

1. Homogeneous graphs:

• peaked distributions of k and b

• narrow range of betweenness

11,max ebb

(ex: ER random graphs)

=> Good sampling expected only for high probing effort


• broad distributions of k and b

P(k) ~ k-3

• large range of available values

1

k

1. Homogeneous graphs

2. Heavy-tailed graphs

(ex: Scale-free BA model)

=> Expected: Hubs well-sampled independently of

(b(k) ~k

Homogeneous graphs

Homogeneously pretty badly sampled


Hubs are well discovered

Numerical simulationsScale-free graphs

No heavy-tail behaviour except....

• heavy-tailed P*(k) only for NS = 1 (cf Clauset and Moore 2004)

• cut-off around <k> => large, unrealistic <k> needed

• bad sampling of P(k)

Numerical simulationsHomogeneous graphs

NS=1

• good sampling, especially of the heavy-tail;

• almost independent of NS ;

• slight bending for low degree (less central) nodes => bad evaluation of the exponents.

Scale-free graphs


Summary

Analytical approach to a traceroute-like sampling process

Link with the topological properties, in particular the betweenness

Usual random graphs more “difficult” to sample than heavy-tails

Heavy-tails well sampled

Bias yielding a sampled scale-free network from a homogeneous network:only in few cases

<k> has to be unrealistically large

Heavy tails properties are a genuine feature of the Internet

however

Quantitative analysis might be strongly biased (wrong exponents...)

•Optimized strategies:•separate influence of T, S

•location of sources, targets•investigation of other networks

•Results on redundancy issues•Massive deployment traceroute@home

Perspectives

•The internet is a weighted networks bandwidth, traffic, efficiency, routers capacity

•Data are scarse and on limited scale

•Interaction among topology and traffic

and...

cf also Guillaume and Latapy 2004

Documents

I. Alvarez-Hamelin (LPT, France) L. Dall’Asta (LPT, France)