13

Click here to load reader

Modeling the webgraph evolution

Embed Size (px)

Citation preview

Page 1: Modeling the webgraph evolution

M

Ra

b

a

ARR1AA

KWGG

1

teiamiscisepg

(

1d

Journal of Computational Science 2 (2011) 67–79

Contents lists available at ScienceDirect

Journal of Computational Science

journa l homepage: www.e lsev ier .com/ locate / jocs

odeling the webgraph evolution

oberto da Silvaa, Luciana S. Buriola,∗, Leila Ribeiroa, Fernando L. Dottib

Instituto de Informática - Universidade Federal do Rio Grande do Sul, Av. Bento Goncalves, 9500, 91.509-900 Porto Alegre, BrazilFaculdade de Informática - Pontifcia Universidade Catlica do Rio Grande do Sul, Av. Ipiranga, 6681, 90.619-900 Porto Alegre, Brazil

r t i c l e i n f o

rticle history:eceived 17 June 2010eceived in revised form2 November 2010ccepted 15 November 2010vailable online 22 December 2010

eywords:ebgraph evolution

raph grammarsraph models

a b s t r a c t

The impact of the web as source of information and services is growing continuously and consequentlythe importance of appropriate design of algorithms for the web has increased. Such algorithms dependboth on the web structure and how it evolves over time. The webgraph is formed by the link struc-ture of webpages. It is well known that this graph is sparse, huge and dynamic, i.e., it changes overtime. A challenge related to this topic is how to model the webgraph evolution taking into account itscharacteristics.

As main contribution, in this work we propose a new approach to model webgraph evolution. Threedifferent models are defined. The last presented model describes webgraph evolution in a more faithfulway, as it can be seen from the analysis of the generated graphs. Nevertheless, other models are alsointeresting on their own, and could be used when only some characteristics of the generated graphs areof interest. Moreover, the stepwise construction and comparison of the models is useful to make explicithow the generated graphs are affected by changes of the rules governing the proposed evolution models.All proposed models are based on node and arc insertions according to the preferential attachment. The

probabilistic algorithms derived from the models were implemented and their resulting graphs anal-ysed and compared with results from the literature. This study has revealed that the models were ableto generate synthetic webgraphs with many important characteristics found in real-world webgraphs.As a second contribution of the paper, we introduce the use of graph grammars as a specification lan-guage to describe the webgraph evolution. Graph grammars is a formalism that allows the description ofgraph transformations in a natural, graphical and precise way, providing a good basis for understanding,

spec

reasoning and comparing

. Introduction

The webgraph is the graph generated from the link structure ofhe web pages. In this graph, each node represents a web page andach arc is a hyperlink from one page to another. The webgraphs considered a massive data set due to its large size. In July 2005study of Gulli and Signorini [22] showed that the webgraph waseasuring about 11.5 billion of indexable pages. In June 2010, the

ndexed web is estimated to have at least 55 billion pages [1]. Aearch engine is an information retrieval system that, given someriteria about an item of interest, searches the webgraph for thetems that match these criteria. In this context, the criteria for the

earch are usually referred to as search query. The use of searchngines is becoming more and more common for the more diverseurposes, such as work, leisure, study, and research in general. Aood and fast answer for a query in a search engine is thus becoming

∗ Corresponding author. Tel.: +55 51 3308 6827; fax: +55 51 3308 7308.E-mail addresses: [email protected] (R. da Silva), [email protected]

L.S. Buriol), [email protected] (L. Ribeiro), [email protected] (F.L. Dotti).

877-7503/$ – see front matter © 2010 Elsevier B.V. All rights reserved.oi:10.1016/j.jocs.2010.11.003

ifications.© 2010 Elsevier B.V. All rights reserved.

of extremely relevance for most people. Since the development ofefficient search engines is highly dependent on the structure of thewebgraph, the study of the graph properties of the webgraph gainedmuch attention in the last decade. Moreover, not only the structureat one moment is of importance, but the way the webgraph evolvesover time, since the webgraph is a highly dynamic structure. Thus,finding the rules that govern webgraph evolution is a great researchchallenge.

From a graph theory point of view, the study of the webgraph is achallenge, not only because of its size, but also because the diversityof substructures it has. To explore properties of very large graphs,as the webgraph, is not a trivial task. In the last decade much effortwas done with the aim of computing properties of this graph inorder to know more about its structure.

One of the properties of the webgraph that was first detectedis the power law distribution on the in-degree of nodes. By this

distribution, the probability that a node u be linked by exactly kother nodes is

Pr[|IN(u)| = k]∼ 1k˛

for k ∈Z (1)

Page 2: Modeling the webgraph evolution

6 puta

wdfwaBomtDttisop

1

dddi[

obrpInip

wlsptgoiintc

[amangscwoc

wm

iab

8 R. da Silva et al. / Journal of Com

here IN(u) means the set of the incoming links into node u (or in-egree of u). Usually in webgraphs we have ˛ ≈ 2.1 [9,12]. In [31],our realworld webgraphs were analysed and the ˛ values foundere 1.9, 1.7, 2.2, and 1.6. Surprisingly, power law distributions

re found when analyzing other properties of webgraphs. In 2000,roder et al. [9] showed that the in-degree and out-degree of nodesf the webgraph follow the power law distribution (even though aore recent work [31]). In 2002, Pandurangan et al. [27] observed

hat the pagerank also follows the power law distribution. In 2004,onato et al. [12] reported the power law on the distribution of

he strongest connected component and other components of theopological structure of webgraphs. Based on some of these empir-cal observations in real world webgraphs, models for generatingynthetic webgraphs were proposed in the last few years. In allf them, the power law on the in-degree distribution is the firstroperty to be verified.

.1. Models to describe the webgraph evolution

The study of synthetic graphs was first formalized about fiveecades ago with the research of Erdös and Rényi [19,20] on ran-om graphs. The main characteristic of those graphs was the Poisonistribution of the nodes degree. This distribution was also found

n posterior studies of Bollobás [5] and later by Watts and Strogatz32].

In 1999, Barabasi and Albert [4] observed that the in-degree andut-degree of a crawl of webpages presented the power law distri-ution, instead of the Poison distribution expected to be found inandom graphs. Considering the power law distribution, they pro-osed the evolving network model for generating synthetic graphs.

n this model, a new node is inserted at every time step. The newestode inserted in the graph links previously inserted nodes accord-

ng to the preferential attachment, i.e., the probability that an arcoints to a node i is proportional do |IN(i)|.

Also in 1999, Kumar et al. [23] proposed the copying model, inhich the graph grows according to a determined probability of

inking a node already linked by another node, i.e., copying theource of an already existing arc. Two variations of this idea wereroposed. The linear model inserts one node at each step and linkshis node with probability ˛ to K randomly selected pages in theraph, and with probability (1 − ˛) it copies the target node ofne existing arc. They proposed also a variation of this model thatnserts one node at a time, but arcs are inserted in a different fash-on: the number of arcs inserted at each step, as well as the targetodes of these arcs, are defined considering functions proposed byhe model. In the literature, the first model is referred to as theopying model.

Another interesting model was that proposed by Pennock et al.28], the network growth model. As in the two models discussedbove, one node is inserted at a time, together with a fixed number

of arcs. In this model, the target node of each new arc is chosenmong the set N of previously created nodes, and do not necessarilyeed to link the latest created node. Thus, the model allows theeneration of webgraphs with disconnected components (nodes orubgraphs), as well as cyclic graphs. The probability Pi that a new arconnects a node i is given by Pi = ˛(|IN(i)|/2mt) + (1 − ˛)(1/m0 + t),here m0 is the number of nodes of the initial graph, t is the number

f steps, 2mt is the total connectivity at a time t, and m0 + t is theurrent number of nodes.

The webgraph can also be created in a multi-layer structure, inhich each single layer can be constructed considering one, or a

ix of the models already proposed [24].A few other models were proposed, basically introducing mod-

fications of the ones listed above. We refer to the works [2,18,26]s examples. A survey on models for generating webgraphs cane found in [6], and in [7] a more detailed work is presented as a

tional Science 2 (2011) 67–79

book. In [14] the authors implemented, analysed and comparedgraphs generated by the main models. Another branch of stud-ies in this area considers not only insertions, but also deletionsof items during the process of creation of the graph [11,21]. Themodels proposed in this paper do not consider deletions, but theycould easily be extended by applying edge and node deletionswith some probability. However, we chose keeping our modelssimpler in order to study and understand them without dele-tions.

The work on the generation of synthetic webgraphs leads toseveral contributions:

• it makes explicit the rules that govern the webgraph growth, aswell as their parameters;

• it allows to predict the webgraph behavior in time and may evensuggest unexpected properties of today’s web;

• it aids the design and analysis of algorithms that deal with web-graphs.

Although there are already many models to describe web graphsand/or its evolution, none of the models is completely satisfactory,since most of them generate non-cyclic graphs (after the generationaccording to the rules, some models have a step to randomly intro-duce arcs to create cycles) or use a fixed number for the out-degreeof nodes (as discussed before, experiments have shown that out-degrees should also follow a power law distribution). Moreover,it is difficult to compare the models because sometimes the rulesthat govern the webgraph creation are not clearly and explicitlydescribed, and are typically presented using quite different nota-tions.

1.2. Our contribution

The contribution of this paper is twofold:

1. We propose and discuss three new models to describe theevolution of webgraphs, aiming to maintain the macro andmicroscopic properties of the webgraph. The models range froma simpler to a more complex and faithful, making explicit theimpact of the different rules governing each evolution model onthe corresponding generated webgraphs. Our models are pre-sented in a computational rather than an analytical way, andtherefore we obtain concrete webgraphs by simulation of themodels;

2. We propose the use of a formal, intuitive and graphical language,namely Graph Grammars (explained in Section 2), to specify thegraph transformations that take place on the webgraph duringits evolution. This language is precise and enables a simple defi-nition for each model (given by one or two rules).

Our approach takes into consideration properties that wereobserved in real webgraphs and creates rules that generate graphswith these properties. We present three models based on the samefundamental idea that grasps an evolving and dynamic aspect:every time a node becomes a target of an arc, it becomes a moreimportant webpage, that is, the probability that other arcs pointto this node increases. Thus, the models generate graphs accordingto the preferential attachment. Note that the aim of the paper isto generate graphs that have properties that are observed in thewebgraph, and thus it is natural to consider that these propertiesinfluence the generation process.

The main characteristics of the mechanism of webgraph growthof our approach are:

1. Start point: Since we want to be able to estimate the impact, interms of statistical properties, of applying rules to govern the

Page 3: Modeling the webgraph evolution

putational Science 2 (2011) 67–79 69

2

3

IimmgIiitmstestosiaew

gpefCaSs

G3G2G1

Ωin = [2, 3]5

5 1

5n = 4

R. da Silva et al. / Journal of Com

growth of a given graph, we choose an initial graph with knownbasic properties, perform a series of modifications, and analyzethe result comparing to the starting point.

. Potential method: To build synthetic graphs that show certainproperties, we use the notion of potential attributes. Potentialattributes are configured for each node of the synthetic graph.The graph creation and growth processes are designed to buildgraphs that approximate the defined potential attributes. Moreconcretely, for each node, in- and out-degree potential attributesare defined when the node is created. During the growth pro-cess, the probability of linking nodes takes into considerationthe respective in- and out-degree potential values, which areupdated during the evolution when new nodes and arcs areinserted. We used in- and out-degree potential attributes accord-ing to the power law distribution, but other distributions couldbe considered by configuring the respective potential attributesas desired.

. Evolution process: To describe the webgraph evolution three pos-sibilities are considered:

Fixed-Outdegree Model: Each time we add a node i, k arcswith source i are added. The target of each arc is a randomlyselected node j with probability ˝in

j/˝MAXin, where ˝in

jis

the potential in-degree of node j and ˝MAXin is the maxi-mum among all in-degree potentials (potential method). Thisapproach gives raise to graphs where nodes have constantout-degree k;Variable-Outdegree Model: Here we consider the same sam-pling process, but the number of arcs inserted together witha node is not fixed a priori;Controlled-Insertion Model: Alternatively, we consider a modelin which nodes and arcs are inserted independently, givingrise to graphs where both in- and out-degrees follow powerlaws.

The proposed approaches are specified using graph grammars.t will become evident that using this description language makest very easy to understand and compare different approaches. Each

odel is defined by rules, and thus it is clear which changes areade from one evolution model to the other. By analyzing the

enerated graphs, we can understand the impact of such changes.n software engineering, it is a common practice to build a spec-fication of the system before implementing it. A specifications more abstract than the implementation, being more suitablehan code to understanding and validation against the require-

ents, as well as to other kinds of analysis. The use of visualpecification languages is widespread, since it has been recognizedhat they help the development of complex systems, and can beasily understood even by non-specialists. Graph-grammars arepecially appealing in this context, providing a formal specifica-ion language that matches the abstraction level used by designersf web graph evolution models. Moreover, formal specificationerves as a non-ambiguous basis for code generation. We havemplemented algorithms to generate synthetic large web graphsccording to these proposed models. The characteristics of the gen-rated graphs were then analysed and compared with real-worldebgraphs.

The paper is organized as follows: in Section 2 we present graphrammars. Next, in Section 3, we present the different models pro-osed in this paper to simulate the webgraph evolution, describingach one using graph grammars. In Section 4 an implementation

or generating an initial graph, as well an implementation of theontrolled-Insertion Model are presented. In Section 5 we maken analysis of the graphs generated by our approaches. Finally inection 6 we conclude with further considerations and conclu-ions.

1 [2, 6]

Fig. 1. Examples of graphs.

2. The specification language graph grammars

Graph grammars have originated from the concept of formalgrammars on strings by substituting strings by graphs [17,29].Methods, techniques, and results for graph grammars have beenstudied since then, and applied in a variety of fields in computerscience such as formal language theory, pattern recognition, frac-tal generation, software engineering, concurrent and distributedsystem modeling, database design and theory, among others (see,e.g. [15]). In particular, graph grammars are very well suited to thespecification of concurrent and distributed systems: a (distributed)state of the system can naturally be represented by a graph andrules (where the left- and right-hand sides are graphs) describe pos-sible state changes. The behavior of the system is then described viaapplications of these rules to graphs describing the actual states of asystem. Graph grammars are appealing as a specification formalismbecause they are formal, they are based on simple but powerful con-cepts to describe behavior, and, at the same time, they have a nicegraphical layout that helps non-theoreticians understand a graphgrammar specification.

Now we will review the main concepts that are used in thegraph grammar approach. The description below is based on anstochastic definition of graph grammars, called Stochastic GraphGrammars [25]. Here we extend this formalism with the notion of(uniformly distributed) random variable and with a notion of globaland local variables that will be used to distinguish variables withdifferent roles that may appear in rules. The resulting formalismis well suited to describe various simulation models, like the webgraph evolution models presented here.

2.1. Graph

We use attributed graphs, that is, graphs in which nodes and/orarcs may be attributed with values. We may use different kindsof attributes for nodes/arcs. To specify the possible values, we useabstract data types (values are elements of suitable algebras, spec-ified using algebraic specifications [16]). This allows the use ofvariables and operations on data types in the definitions of the rulesthat specify the behavior of a system. Thus, in our language, eachgraph consists of nodes, arcs and sets of variables and values. In thispaper, the data types used as attributes in graphs will be real num-bers and lists of real numbers (and corresponding operations). Aspecial kind of variable, denoted by rndi, for some natural numberi, will be used to represent a random number uniformly distributedin the interval (0:1]. Moreover, we will distinguish between globaland local variables. Global variables are part of any graph in a speci-fication, whereas local variables occur only in the definition of rules(the roles of these different kinds of variables will be explained initems Rule and Match below). Examples of graphs are depicted inFig. 1. Attribute values appear close to corresponding nodes. GraphG1 has one node attributed with value 5, graph G2 has 3 nodesattributed with 5, 1 and with the list [2,6]. These graphs have

no variables. Graph G3 has 2 nodes and 2 variables, one node isattributed with value 5, the other with value 1. (Global) Variableswith names n and ˝in are not used as attributes (but belong to theset of variables associated to graph G3), and have currently values4 and list [2,3], respectively.
Page 4: Modeling the webgraph evolution

70 R. da Silva et al. / Journal of Computational Science 2 (2011) 67–79

ule ex

2

(pCasotaiih

tahaFn5virevcaupavtecap

c2ar

2

mgk

ac

Fig. 2. R

.2. Rule

A graph rule r : L → R consists of graphs L (left hand side) and Rright hand side), together with a partial graph morphism r map-ing arcs and nodes of L to arcs and nodes of R in a compatible way.ompatibility here means that whenever an arc aL is mapped to anrc aR then the source (target) node of aL must be mapped to theource (target) node of aR. However, compatibility is not requiredn attributes of nodes: a node can be mapped to another one withhe different attribute value. This means that a rule may change thettributes of a node. The operational interpretation of a rule r : L → Rs: items in L which do not have an image in R are deleted; itemsn L which are mapped to R are preserved; items in R which do notave a pre-image in L are created.

An important characteristic of rules is that they usually con-ain variables and operations of the corresponding data types asttributes.1 Variables in rules may be local or global and, in any case,ave no associated value (because these variables will be instanti-ted only when the rule is applied). For example, rule Example inig. 2 models the creation of a new node (attributed with n′) con-ected to an existing one (attributed with v). This rule involvesvariables: n, ˝MAXin, ˝in are global, whereas v and the random

ariable rnd1 are local. The types of these variables are describedn the dashed box on the left. The conditions written above theule arrow must be satisfied for the rule to be applied, and thequations below the rule arrow represent relations between theariables of the left- and right-hand sides of the rule (we use theonvention that primed variables denote the state of the variablefter the execution of the rule, if the value does not change, wese unprimed variables also in the right-hand side). Rule Exam-le may be applied if (i) there is in the current state graph a nodettributed with value v, (ii) value v is between 1 and the currentalue of variable n, and (iii) the current value of ˝MAXin is suchhat rnd1 ∗ ˝MAXin < ˝in

v , for some value of rnd1 (˝inv is the v-th

lement of list ˝in). The effect is that a node attributed with n′ isreated and variables ˝MAXin and n are updated to ˝MAXin

′and n′

ccording to the equations of the rule, and the value of the n′-thosition of ˝in is set to 5.

Additionally, in stochastic graph grammars, a rate may be asso-iated to a rule to guide the frequency of its application. If we haverules, one with rate 1 and another with rate 7, the second will be

pplied in average 7 times more often than the first. In the graphicalepresentation, if no rate is explicitly shown, rate 1 is assumed.

.3. Match

The operational behavior of a system described by a graph gram-ar is described by applying the rules of the grammar to actual

raphs. The first step to be able to apply a rule r to a graph G is tonow whether this rule is applicable, that is, whether there exists

1 That is why it is necessary to define the values used in the graph grammar asbstract data types. Formally, the algebras used in rules are term-algebras of theorresponding data type specifications.

ample.

an image of the left-hand side of the rule in G. This occurrence iscalled match. A match consists of a mapping of the graphical part(vertices and edges) of the left-hand side of the rule to a corre-sponding graphical part of graph G, together with an assignmentof values to all (global and local) variables of the rule. All equa-tions of the rule must be satisfied by this assignment of concretevalues of attributes in G to variables of the rule r, otherwise, thisrule application is not possible. Note that, since any graph withrespect to the same specification will have the same set of globalvariables, the match must relate each global variable to its currentvalue in graph G. For the local variables, there might be many possi-ble choices (many different assignments may satisfy all equations).For example, considering the assignment of values 2 and 0.5 to vari-ables v and rnd1, respectively (denoted by v �→ 2 and rnd1 �→ 0.5),the mapping m : L → G in Fig. 3 is a match, since there is an image ofthe graphical part of L in G (depicted by the dotted arrow), and theassignment of variables of L to corresponding values in G respectsthe equations of the rule (we omitted the mapping for global vari-ables, since there is only one way to map global variables to thecorresponding values in graph G). If we consider the same rule rand a different choice for any of the local variables, it might notbe possible to apply this rule (for example, if we choose rnd1 �→ 1,the condition of the rule would be false, preventing its applica-tion). There are also other choices that might allow the applicationof the rule, but would lead to different results (for example, map-ping •v to •1 would lead to a graph in which •4 is connected to•1).

2.4. Rule application

The application of a rule to an actual graph, called derivation step,is possible if there is a match of the left-hand side of this rule intothe actual graph. The result of the application of a rule r : L → R to agraph G is obtained by the following steps:

1. Add to G everything that is created by the rule (items that are inthe right-hand side R of the rule but not in the left-hand side L).

2. Delete from the result of (1) everything that shall be deleted bythe rule (items that are in the left-hand side L of the rule but notin the right-one R).

3. Delete dangling edges. This step is necessary because it can bethat some vertexes deleted in step 2 had incoming and/or out-going edges, and these must be deleted such that the resultbecomes a graph. This implicit deletion of edges is a garbage col-lection procedure. Intuitively, one may compare this phenomenaof removing a page linked by other pages.2

An example of rule application can be found in Fig. 3 (interme-diate steps to construct the resulting graph H are not shown).

2 This step will not be necessary in this paper, since we will not consider deletionof nodes.

Page 5: Modeling the webgraph evolution

R. da Silva et al. / Journal of Computational Science 2 (2011) 67–79 71

h and

3

wlbtfofp

3

a

•••••

••

gamw

Fig. 3. Matc

. Models to describe the webgraph evolution

To describe the evolution of webgraphs using graph grammars,e must define the initial graph and the rules that govern this evo-

ution. Naturally, one could start the graph generation from scratch,ut it would take longer to reach graph stability than startinghe procedure from a non-empty initial graph. Moreover, startingrom a graph with known properties one can analyze the impactf different evolution models. In Section 4.1, an implementationor generating initial graphs will be discussed. In this section weresent three new models for generating synthetic graphs.

.1. Notation

Nodes are identified by natural numbers. The following variablesnd representations are adopted:

ninit : number of nodes of the initial graph;nmax : number of nodes of the final graph;IN(u) : set of incoming arcs to node u;OUT(u) : set of outgoing arcs from node u;˝: list of potential in- and out-degrees of nodes. ˝in

iand ˝out

idenote the potential in-degree and out-degrees of node i, respec-tively. We use function incrInOut(˝, posin, posout) to incrementby one the in-degree of the posin-th element of the list as well asthe out-degree of the posout-th element of the list.˝MAXin : maximum in-degree in ˝;sumin(˝) : this function returns the sum of all potential in-degrees in ˝, sumout is defined analogously.rnd: variable chosen randomly in (0:1].

Given an initial graph, we present three new models for theraph evolution that preserve the power law distribution for in-nd out-degrees. These three models are based on the potentialethod. Our motivation to consider a given degree distributionhen generating a graph is that this characteristic is commonly

derivation.

found in webgraphs. Thus, we assumed that this characteristic ispresent in webgraphs, and proposed simple models. One of them,the third, has many characteristics expected to be found in web-graphs. The first two models are not very realistic, as many previousmodels proposed in the literature, since they generate acyclic graphwith no islands. However, it is interesting to observe that (i) evenin graphs generated by these models exhibit some important web-graph properties, and (ii) to understand how changes in the modelsaffect properties of generated graphs.

3.2. Fixed-Outdegree Model

The Fixed-Outdegree Model inserts, at each time step, a node witha fixed number K of outgoing links. A node is inserted with in-degree zero, what corresponds to realistic situations, where newlycreated pages cannot be target of any already existing link. Thevalue of the constant out-degree was set to accordingly with val-ues found in realworld graphs. For example, in the Altavista crawl[9] a value K ∼ 7 was observed. As technology evolves, and com-puters and Internet are faster, the number of links per page tendsto increase, as it can be observed in [31]. However, this value is aparameter for the model, and it can be adapted if necessary.

Fig. 4 presents the graph grammar rules for the three proposedmodels. Other models presented in the literature, as well as theinitial graph generation presented later in this paper, could also bedescribed via graph grammars. However, we chose only to describethe new models via graph grammars in this paper, since they rep-resent the major contribution of this work.

The model can be described by the rule on the top of Fig. 4. Eachnode ni (created in the i-th step) is attributed with its potentialin-degree ˝in

ni. The variables used in this rule are the current num-

ber of nodes n; the current maximum potential in-degree ˝MAXin;the list of potential in-degrees ˝in; node identifiers n1 to n7 andrandom variables uniformly distributed rnd1, . . ., rnd7, rndin ∈ (0,1]. Note that n1 to n7 are variables, each of them will be assignedwith a node number when the rule is applied. A potential match

Page 6: Modeling the webgraph evolution

72 R. da Silva et al. / Journal of Computational Science 2 (2011) 67–79

F odel (M

fvofatt

ig. 4. A graphical description of the three proposed models: Fixed-Outdegree Model (on the bottom).

or this rule in graph G maps n, ˝MAXin and ˝in to their currentalues, chooses 7 nodes of G (n1 to n7) to be the target nodes

f the outgoing links of the created node n and chooses valuesor the corresponding random variables. If all equations above therrow of the rule are satisfied for these chosen values, this poten-ial match is actually a match and the rule can be applied. Sincehere are already n nodes in the graph, the next that will be cre-

on the top), Variable-Outdegree Model (on the middle), and Controlled-Insertion

ated will have the identifier n′ = n + 1, and the other global variablesare updated according to the equations below the arrow of the

rule.

Each one of the seven target nodes is selected according tothe preferential attachment by the following procedure: given arandom variable rndi ∈ (0, 1], nodes are analysed sequentially (con-sidering the increasing order of their labels), starting from a random

Page 7: Modeling the webgraph evolution

puta

n

vCioc

c

lpb

w

c

˝

ai

tcp

3

itttaMioie(t

i

opefiHb

3

t

R. da Silva et al. / Journal of Com

ode. Node ni is selected if it is the first one found that satisfies

�inni

�MAXin≥ rndi

According to this model, each new inserted node n′ receives aalue ˝in

n′ that represents the potential this node has to be linked.onsidering the evolving graph, when a new node is inserted at step(that is, now the graph has ni nodes, and ni is also the identifierf the last included node), the potential is calculated as follows. Aontinuous approximation of a power law such that∫ ni

1

x−˛in dx = 1

eads to c = (1 − ˛in)/(n1−˛ini

− 1). From this approximation, theotential in-degree of the newly generated node can be determinedy generating a random variable k according to

1 − ˛in

n1−˛ini

− 1

∫ k

1

x−˛in dx = rndi

here rnd ∈ (0, 1] is a random variable uniformly distributed.The potential in-degree of ni inserted in the i-th step is thus

alculated as

inni

= k = [1 + rndi(n1−˛ini − 1)]1/1−˛in (2)

By a proper evolution of the webgraph, it is expected that, aftersuitable number of evolution steps, |IN(ni)|∼˝in

ni, with this prox-

mity depending on a stochastic noise.The procedure is repeated until the generated graph reaches

he termination criteria, that is when n nodes are inserted. Theseonditions for generating nodes guarantee the preservation of theower law of the initial graph.

.3. Variable-Outdegree Model

As stated before, we also considered a more sophisticated modeln which the nodes have variable out-degree. This is depicted inhe middle model in Fig. 4. As in the previously presented model,o each inserted node ni a potential ˝in

niis assigned representing

he probability of node ni be linked. The probability is calculatedccording to the power law distribution. In the Variable-Outdegreeodel, ˝out

niis calculated in the same way as ˝in

ni, but using ˛out

nstead of ˛in. Thus, each new inserted node ni links ˝outni

previ-usly inserted nodes, each one selected according to their potentialn-degree ˝in. Here, we use another approximation for the prefer-ntial attachment: a node nk is selected as target if it is the first nodeconsidering the order on which nodes were created) that satisfieshe following inequation:∑nk

j=1˝inj

sumin(˝)≥ rndi

In the rule variable outdegree, function minkP calculates the min-mum k that satisfies property P.

The left-hand side of the rule has as many nodes as the potentialut-degree generated for node n′ = n + 1. If there is a bound on theossible out-degree, there is a finite number of rules (one rule forach possible out-degree) that can be applied. Because we generatenite graphs, there must be a bound in the possible out-degree.owever, for the implementation, we do not need to know thisound a priory.

.4. Controlled-Insertion Model

Both previously proposed models insert arcs having as sourcehe last inserted node. However, in this fashion, the generated graph

tional Science 2 (2011) 67–79 73

has no cycles. This kind of graphs does not represent realworldwebgraphs. Graphs with no cycles are also generated by other well-know models, as [3,23].

The Controlled-Insertion Model generates graphs that allow thepresence of cycles. The procedure is composed of independent nodeand arc insertions. An arc insertion rule is applied K times moreoften than node insertions. K was set to seven in our experiments.When a node is inserted, it receives in and out potential attributes.Arcs are inserted linking nodes according to their potentials. Thisis described by rules InsertNode and InsertArc of Fig. 4.

Graphs generated by this model are directed graphs that maybe disconnected and may have nodes with no incoming/outgoingarcs, what is more realistic than graphs generated by the previouslypresented models.

4. Implementations of the models

In this section, we show how the models were implemented togenerate actual graphs. An analysis of the generates graphs is pre-sented in the next section. First, we define a procedure to generateinitial graphs, and then a procedure that describes the evolutionaccording to the Controlled-Insertion Model. The implementationof the Fixed and Variable-Outdegree Models is analogous to thisone, and therefore will be omitted.

4.1. Initial graph

The initial graph may be an existing or a synthetically obtainedgraph. If desired, one could choose also an empty initial graph.Since one of the basic characteristics of webgraphs is the powerlaw distribution of in- and out-degrees, our initial graphs, havingninit nodes, are generated according to the discrete and normalizedpower law distribution.

Initially, the the number of nodes with in-degree k, denoted bynumINk, is calculated as

numINk = ninitk−˛in∑ninit

i=1 i−˛in(3)

where ˛in is set to a constant value. Since numINk is an integer, Eq.(3) is rounded up or down. In our experiments we used ˛in = 2.1,that is a value observed in analysis of real-world webgraphs [14].Thus, for each k = 1, 2, ..., ninit, we choose randomly numINk nodes,and set their in-degrees to k. When this procedure finishes, we endup with a graph with ninit nodes respecting the power law.

Fig. 5 presents the pseudo-code of the procedure to generatethe initial graph. Given the parameters ˛, ninit, ε, and K (the num-ber of outgoing links inserted per node) procedure InitGraph( )generates the initial graph whose in-degrees follow a power lawdistribution. For each node k (line 2) a value of numINk, the numberof nodes with in-degree k, is calculated using function calcu-lateNk( ) (this function implements Eq. (3)). In lines 4–12, numINknodes are selected at random by function selectNode( ), one byone. For each node u selected, k outgoing links (line 7) are gener-ated linking random nodes (random() generates a random numberin (0:1]). Finally, in line 14, the function returns the generated graphG :={Ninit, E′}.

For simplicity, some details were omitted in the pseudo-code.

For example, self-loops and parallel arcs (arcs with the same end-points) are not generated and the tests for avoiding these arcs arenot presented. We decided to avoid self-loops and parallel arcs,since the library [13] we use for computing properties of graphsremoves parallel arcs.
Page 8: Modeling the webgraph evolution

74 R. da Silva et al. / Journal of Computational Science 2 (2011) 67–79

Fig. 5. Pseudo-code of the InitialGraph procedure that generates an initial graph.

f the C

4

m

ffieatlnpbwCtsoh

war

Fig. 6. An implementation o

.2. Evolution according to the Controlled-Insertion Model

In Fig. 6 we present the pseudo-code that implements thisodel.The procedure receives as input the initial graph (Ginit), a value

or ˛ of in-degree (˛in), a value of ˛ for out-degree (˛out), thenal number of nodes (nmax), and the number of inserted links forach inserted node (K). This procedure creates nmax − ninit nodesnd K times this number of arcs. In lines 2–4 of the algorithm,he in- and out potentials of each new generated node are calcu-ated, corresponding to the rule InsertNode from Fig. 4. For eachode, the loop in lines 5–14 inserts K arcs, updating the in- and outotential of the corresponding nodes in lines 11–12, as describedy rule InsertArc from Fig. 4. This algorithm is a straightfor-ard implementation of the actions described by the rules of theontrolled-Insertion Model in Fig. 4. Although the model specifieshat arcs might be inserted in parallel, the resulting graph is theame when it is generated by a sequential implementation, as thene presented in Fig. 6. This is due to the fact that no deletions is

andled by the model.

The algorithm ControlledInsertion runs in linear time in theorst and best cases. Denoting by n the value of (nmax − ninit + 1),

nd since K is in fact a constant, the time complexity of the algo-ithm is �(n).

ontrolled-Insertion Model.

5. Experiments and statistics

This section presents experimental results conducted with theaim of analyzing the graphs generated by the implementations ofthe models presented in the previous sections. The next subsec-tion describes the chosen metrics, while the following subsectionsdiscuss each proposed model in terms of these metrics. The last sub-section discusses complementary analysis relating some metricsand general considerations on the observed values.

5.1. Metrics

For each of the proposed models we analyze the following met-rics of the generated graphs:

• in-degree and out-degree distribution: this analysis observes ifin- and out-degrees obtained follow the power law;

• topological structure: Broder et al. [9] identified and quantifiedthe topological structure of webgraphs. Webgraphs are consti-

tuted by isolated components. Typically, the larger componenthas 90% of the nodes, being called strongest connected com-ponent while the smaller components are called Islands. Thetopology of the largest component is further analysed. The nodesof a the largest component are classified as SCC, IN, OUT and
Page 9: Modeling the webgraph evolution

putational Science 2 (2011) 67–79 75

op

5t

5

coosdoc

erTftopo

t

Fig. 7. Power law behavior of synthetic web graphs with 100,000 nodes, evolvedfrom 50,000 nodes, considering three different values of kout.

Table 1The power law exponent for realworld and synthetic graphs.

G1 → G2 ˛in

Notre Dame 1999 [3] 2.1Altavista 1999 [9] 2.1WebBase 2001 [12] 2.1WebBase 2001 [31] 1.9WebBase 2003 [31] 2.2UK Graph 2002 [31] 1.7

Fs

R. da Silva et al. / Journal of Com

Tendrils and Tubes.- The SCC corresponds to the strongest connected component,

i.e., there is a directed path between each pair of nodes in theSCC;

- The set IN corresponds to the nodes that reach all nodes in theSCC;

- The set OUT corresponds to the nodes reached by all nodes inthe SCC;

- The remaining nodes comprise the sets Tendrils and Tubes. Ten-drils is the set of nodes with a path incoming OUT, or outgoingIN. A particular type of tendril is called tube if it outcomes INand incomes OUT.

pagerank distribution: another important measure computed forwebgraphs is the distribution of the pagerank values. The pager-ank algorithm [8] calculates a value for each page based on thelink structure of the graph. More linked pages, as well as pageslinked by pages with high pagerank, tend to have higher value ofpagerank.

The analysis of the graphs were mainly performed making usef the COSIN Graph Library, a free available library for computingroperties of large graphs [13].

.2. Experiments performed with graphs generated according tohe Fixed-Outdegree Model

.2.1. In- and out-degreeThis model has fixed out-degree. Thus, the out-degree is a

haracteristic of the model enforced by construction, and not anbserved metric. In the experiments presented in this paper theut-degree we used is kout = 7, i.e., each new node links exactlyeven previously generated nodes. The value K = 7 is the averageegree of a webgraph node reported in [23], observed in a real crawlf 200M pages analysed in [12], and observed also in the Altavistarawl [9].

Fig. 7 presents the in-degree distribution for the graphs gen-rated using the Fixed-Outdegree Model. The continuous lineepresents the power law of the initial graph (of 50,000 nodes).he power law was calculated for three different graphs evolvedrom this initial graph, using values kout = 7, 14, and 28, respec-ively. We observed only a slight difference of the value ˛ of the

btained power laws. Also, the top of the curves deviates from theower law behavior, and the deviation increases with the increasef the value of kout.

We also performed experiments varying the value of the ˛in forhe in-degree analysis. We have tested four values, ˛in = 1.1, 2.1,

ig. 8. Influence of ˛in in the in-degree distribution of synthetic webgraphs. The initial gide).

IT Graph 2004 [31] 1.62.5 × 104 → 5.0 × 104 1.95 ± 0.045.0 × 104 → 1.0 × 105 2.06 ± 0.031.0 × 105 → 2.0 × 105 2.10 ± 0.02

3.1, and 4.1. Fig. 8 presents the in-degree distribution for the initialgraph with 40,000 nodes (left), and for the evolved graphs with80,000 nodes (right).

Surprisingly, all curves intercept in a common point, kin ≈ 16,as it can be observed in Fig. 8-right. This means that, for graphsgenerated by the fixed potential graph model, the probability of anode having in-degree 16 is the same for all synthetic generated

graphs, considering the values of ˛ we have tested.

Once a graph is generated we measure the exponent of thepower law (˛in)—see Table 1. The first seven lines present thevalue of ˛in from realworld webgraphs, whose details are pre-sented in Table 2. The last three lines of the table correspond to

raphs have 40,000 nodes (left side), and they were evolved to 80,000 nodes (right

Page 10: Modeling the webgraph evolution

76 R. da Silva et al. / Journal of Computa

Table 2Realworld webgraphs. The columns stand for the graph origin (graph), year it wascrawled (year), paper that reports the data (paper), number of nodes (#nodes), andnumber of links (#links). The number of nodes and links are in millions.

Graph Year Paper #Nodes #Links

Notre Dame domain 1999 [4] 0.3 1.8Altavista 1999 [9] 203 1466WebBase 2001 [12] 200 1400WebBase 2001 [31] 81 752

gfr

etahe

5t

5

bOt

twbnod

Fa

TS

WebBase 2003 [31] 49 1185UK Graph 2002 [31] 18 292IT Graph 2004 [31] 41 1136

raphs evolved following the model from 25,000 to 50,000 nodes,rom 50,000 to 100,000 nodes, and from 100,000 to 200,000 nodes,espectively. In the graph generation ˛in was set to 2.1.

The pagerank was also measured for the synthetic graphs. How-ver, the plots are very similar to the ones of graphs generated byhe Variable-Outdegree Model, and thus we chose to present andnalyze the pagerank results only in the next subsection. The sameolds for the graph components analysis. These similarities werexpected, since none of these models generates cyclic graphs.

.3. Experiments performed with graphs generated according tohe Variable-Outdegree Model

.3.1. In- and out-degreeFig. 9 presents the in- and out-degree distributions generated

y synthetic graphs of 300,000 nodes obtained with the Variable-utdegree Model. Only results for two graphs were presented for

he sake of clarity of the figure.The in-degree distribution follows clearly the power law dis-

ribution. The graph evolving from a graph with 10,000 to a graph

ith 300,000 nodes has out-degree following a power law distri-

ution, while the graph evolved from 200,000 to 300,000 nodes hasot. This is expected since the initial graph has a greater influencen the result in latter case. Initial graphs were generated with in-egree following the power law distribution, but this was not true

ig. 9. In-degree distribution (left) and out-degree distribution (right) for graphs with 30ccordingly to the Variable-Outdegree Model.

able 3ize of graph components of different generated graphs generated by the variable-outdeg

G1 → G2 SCC IN O

1.0 × 104 → 3.0 × 105 2.21 87.075.0 × 104 → 3.0 × 105 11.45 70.601.0 × 105 → 3.0 × 105 23.17 55.21 12.0 × 105 → 3.0 × 105 47.35 27.22 1

tional Science 2 (2011) 67–79

for the out-degree. The final graph generated from an initial graphwith 200,000 nodes has most of the nodes from the initial graph.

5.3.2. Topological structureNow we present the size of topological components for the

graphs generated using the Variable-Outdegree Model. Table 3shows the percentage of nodes within each component for graphs.Columns represent graph components, while lines represent dif-ferent graphs. The synthetic graphs were generated with 300,000nodes, evolving from initial graphs with 10,000, 50,000, 100,000,and 200,000 nodes. All nodes generated by the rules do not belongto the SCC component, since the model does not allow cycles dur-ing the graph evolution phase. But among the nodes of the initialgraph, it is possible to have cycles. Because of this fact, the largeris the initial graph, larger is the SCC (in percentage of nodes). Thisexplains why the size of SCC increases with the size of the initialgraph.

5.3.3. Pagerank distributionFig. 10 presents a plot with the distribution of the pagerank

values for the synthetic graph generated following the Variable-Outdegree Model. As already observed in previous realworldwebgraphs, the pagerank in the generated synthetic graphs alsofollows a power law distribution.

5.4. Experiments performed with graphs generated according tothe Controlled-Insertion Model

In this section we explore characteristics of graphs generatedaccording to the Controlled-Insertion Model. Ten graphs of eachsize were generated. The plots represent the results for one ran-dom graph among the ten available, since very similar plots were

observed among the ten generated graphs. Moreover, since theseplots represent points that were discretized within the largestand smallest values found, an average among distributions cangenerate non representative plots. We will present a table withthe sizes of the graph components containing the average of

0,000 nodes, evolved from 10,000 and 200,000 nodes. The graphs were generated

ree model (values are given in %).

UT Tendrils Tubes Islands

1.11 6.56 3.04 0.005.19 9.31 3.44 0.000.13 8.77 2.71 0.009.27 5.27 0.88 0.00

Page 11: Modeling the webgraph evolution

R. da Silva et al. / Journal of Computa

1

10

100

1000

10000

100000

1e+06

1e-07 1e-06 1e-05 0.0001 0.001 0.01

num

ber

of n

odes

pagerank

10,000 to 300,000 nodes200,000 to 300,000 nodes

FMg

ts

5

gtneT

itienoatli3

5

pMs

F5

ig. 10. Pagerank distribution for graphs generated using the Variable-Outdegreeodel. The plot corresponds to two graphs of 300,000 nodes evolved from initial

raphs with 10,000 and 200,000 nodes.

he ten graphs, as well as the standard deviation (in parenthe-is).

.4.1. In- and out-degreeFig. 11 presents the in-degree and out-degree distributions of

raphs with 300,000 nodes. For the in-degree distribution analysis,he graphs were evolved from initial graphs of 10,000 and 200,000odes. For the sake of clarity, the in-degree distribution of graphsvolved from 50,000 and 100,000 nodes were omitted in the figure.hey present similar power law as the ones plotted in the figure.

The out-degree distribution has a different behavior. Since thenitial graphs were not generated respecting power law distribu-ion for the out-degree, plots from graphs evolved from smallernitial graphs approximate more to a power law of ˛out = 2.7. Forxample, the graph evolved from 200,000 to 300,000 nodes doesot present power law, since the initial graph corresponds to 66%f the total size of the graph, and the initial graph was not gener-ted with out-degree following a power law. On the other hand,he graph evolved from 10,000 to 300,000 nodes presents poweraw out-degree distribution, since the model considers power lawn the ou-tdegree, and the initial graph corresponds to only about% of the evolved graph.

.4.2. Topological structureTable 4 presents the percentage of nodes within each com-

onent for graphs generated according the Controlled-Insertionodel. For the sake of comparison, we report these same dimen-

ions for six real world webgraphs. Together with the identification

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

num

ber

of n

odes

indegree

10.000 to 300.000 nodes200.000 to 300.000 nodes

ig. 11. In-degree distribution (left) and out-degree distribution (right) for graphs with 300,000, 100,000 and, 200,000 nodes (out-degree). Graphs were generated according to th

tional Science 2 (2011) 67–79 77

of each graph, we added the information of the year it was anal-ysed, as well as a reference to the paper that reported the presentedresults. In Table 4, since Tubes represents less than 0.2% of the nodesfor all graphs, Tendrils and Tubes are presented together. More-over, the results for the generated graphs represent an average for10 generated graphs and the standard deviation is presented inparenthesis.

From these results, we can observe that the larger the initialgraph, the lager the size of the SCC and OUT components, and thesmaller the size of the remaining sets. This happens due to theproportions of the final graphs that are originally from the initialgraph. The graph evolved from an initial graph of 10,000 nodesis the synthetic graph in which 97% of the nodes and arcs wereinserted by the model. Thus, comparing with Altavista Graph, onecan observe that both present similar proportions, with exceptionof Islands component. One of the reasons for this component notbeing so large in realworld webgraphs is that many islands maynot be reached by web crawlers, and so they do not count for thewebgraph size. In synthetic graphs one can expect a larger size ofthis component.

The synthetic graphs have the size of the components very sim-ilar to the ones presented by the first two realworld graphs. Inspecial, observe the results for the synthetic graph evolved froman initial graph of 10,000 nodes, that is the graph with the largernumber of nodes inserted accordingly to the model. However, thesizes for the synthetic graphs considerably differ from the resultsreported by the last three realworld webgraphs. In special, thesize of the IN components of the realworld graphs are very small.It is completely understandable that in realworld webgraphs theobserved IN set is small, which does not necessarily means that thereal IN set is small. When crawling the Webgraph, a set of seedsis used as starting points for the search. It is well known that notall webpages are collected by crawlers. Most of the pages that arenot collected are expected to belong to IN set. This happens becausepages in IN are not pointed by any page on sets SCC or OUT, and thus,if they are not used as seeds, or pointed by another page belongingto set IN, they will never be reached. Thus, we expected that thesize of the components can vary on realworld graphs depending onthe crawling politics adopted. In special, the size of set IN dependsconsiderably on the seeds used.

5.4.3. Pagerank distribution

Fig. 12 presents a plot with the distribution of the pager-

ank values for the synthetic graph generated following theControlled-Insertion Model. As already observed in previous real-world webgraphs, the pagerank in the generated synthetic graphsalso follows a power law distribution.

1

10

100

1000

10000

100000

1 10 100 1000 10000

num

ber

of n

odes

outdegree

10,000 to 300,000 nodes50,000 to 300,000 nodes

100,000 to 300,000 nodes200,000 to 300,000 nodes

0,000 nodes, evolved from 10,000 and 200,000 nodes (in-degree), and from 10,000,e controlled-insertion model.

Page 12: Modeling the webgraph evolution

78 R. da Silva et al. / Journal of Computational Science 2 (2011) 67–79

Table 4Relative size of graph components of different graphs (values in %). The first six rows present results for realworld graphs (“–” stands for a non given data, and the percentagescorresponds to the number of nodes in each component). The remaining rows show results for the generated graphs, representing an average for 10 graphs (the standarddeviation is presented after ±).

G1 → G2 SCC IN OUT Tendrils and tubes Islands

Altavista 1999 [9] 28 21 21 11 9WebBase 2001 [12] 33 11 39 11 4WebBase 2001 [31] 56.46 17.24 17.94 – –WebBase 2003 [31] 85.87 2.28 11.26 – –UK 2002 [31] 65.28 1.69 31.88 – –IT 2004 [31] 72.30 0.03 27.64 – –1.0 × 104 → 3.0 × 105 27.79 ± 0.09 17.58 ± 0.535.0 × 104 → 3.0 × 105 32.50 ± 0.70 14.36 ± 0.981.0 × 105 → 3.0 × 105 37.70 ± 0.30 10.66 ± 0.182.0 × 105 → 3.0 × 105 52.86 ± 0.11 4.39 ± 0.02

1

10

100

1000

10000

100000

1e+06

1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1

num

ber

of n

odes

pagerank

10,000 to 300,000 nodes200,000 to 300,000 nodes

Fmg

5

mwoiarcacpate

6

tntnFaIalad

g

ig. 12. Pagerank distribution for graphs generated using the controlled-insertionodel. The plot corresponds to two graphs of 300,000 nodes evolved from initial

raphs with 10,000 and 200,000 nodes.

.4.4. Complementary analysisFinally, we calculated the Pearson correlation between some

easures. As reported in Donato et al. [12], in some realworldebgraphs, there were no correlation between the in-degree and

ut-degree, and between the pagerank and out-degree. Accord-ngly, the higher value of correlation found in our graphs were 0.04nd 0.02, for in-degree/out-degree and out-degree/pagerank cor-elations, respectively. However, Donato et al. [12] report a loworrelation between in-degree and pagerank, and we indeed foundcorrelation ≥0.58. In [30], Fortunato et al. show that there is a highorrelation between in-degree and pagerank. Furthermore, theyresent high linear correlation values for four realworld graphsnalysed. A high correlation was also found by Buriol et al. [10] forhe pagerank computed for the English Wikigraph, the graph gen-rated from the link structure of English articles from Wikipedia.

. Conclusions and summaries

In this work, we proposed three new models for evolving syn-hetic webgraphs. In the Variable-Outdegree Model, for each newode, a variable set of outgoing arcs are assigned to it. A poten-ial is assigned to each node representing the probability of theode being linked, according to the preferential attachment. Theixed-Outdegree Model is analogous, but the number of outgoingrcs inserted when a node is created is fixed. In the Controlled-nsertion Model nodes and arcs are inserted independently, butrcs are inserted with higher rate. To each node, a potential of

inking other nodes and a potential to be linked by other nodesre attributed. The potentials are calculated according power lawistributions.

We describe the specification of our graph models throughraph grammars, allowing an easier understanding of the rules that

18.42 ± 0.49 4.67 ± 0.00 31.62 ± 0.2020.37 ± 0.13 4.39 ± 0.03 28.28 ± 0.5222.96 ± 0.18 3.99 ± 0.01 24.61 ± 0.0824.86 ± 0.02 1.82 ± 0.01 15.99 ± 0.04

govern the graph generation process. We implemented these spec-ifications, generated a set of graphs, and computed characteristicscommonly observed in realworld webgraphs. While graphs gener-ated by the Fixed-Outdegree and Variable-Outdegree Models haveno cycles (with exception of the initial graph), in the Controlled-Insertion Model cycles are naturally generated. In previous modelsproposed in the literature, the generated graphs also have no cycles,in some models a percentage of arcs were inserted randomly (notfollowing the model) with the aim of creating of cycles.

We performed a set of experiments considering the graphs gen-erated by our models, measuring in- and out-degree distributions,pagerank distribution, topological structure of the graph and cor-relation between some measures. Experiments had shown that themodels generate graphs that have many of the expected character-istics found in webgraphs. The out-degree, in-degree and pagerankdistributions, as well as the correlation values, are similar in allthree models. However, the bow-tie structure found in graphs gen-erated by the Controlled-Insertion Model is more similar to the onefound in realworld webgraphs.

Another interesting observation is that none of graphs gener-ated by the models have any bipartite subgraphs ki,j with i ≥ 3 andj ≥ 3. The literature reports that realworld webgraphs contain anexpressive number of ki,j. Thus, although our third model gener-ated webgraphs with most of the characteristics commonly foundin webgraphs, further research is necessary to propose models thatare even more realistic. Also a closer study on how deletion affectswebgraph evolution is an interesting topic of research.

Finally, we would like to remark that a recent work by Serranoet al. [31] observed, analyzing in detail four realworld web-graphs, an exponential out-degree distribution, differently from thepower law out-degree distribution adopted in our work. However,the seminal works [3,9], as well more recent works that anal-ysed realworld webgraphs [12], for example, observed a powerlaw distribution for the out-degree. Thus, we assumed a powerlaw distribution on the out-degree. Furthermore, the exponen-tial out-degree distribution can be implemented with a simplechange in the third model. It would consider a new potentialdetermined by exponential distribution of out-degrees, i.e., Pr(out-degree = x) = ˛out/1 − e−˛outni exp( − ˛outx) resulting in ˝out

ni= k =

− 1˛out

ln[1 − rnd · (1 − e−˛outni )]. Graphs generated considering thismodification can be analysed and compared with the results pre-sented in this paper.

References

[1] www.worldwidewebsize.com (accessed on June 28th 2010).

[2] W. Aiello, F. Chung, L. Lu, A random graph model for massive graphs, in: Annual

ACM Sympoium on the Theory of Computing, 2000, pp. 171–180.[3] R. Albert, H. Jeong, A.-L. Barabasi, The diameter of the world wide web, Nature

401 (1999) 130–131.[4] A. Barabasi, R. Albert, Emergence of scaling in random networks, Science 286

(1999) 509.

Page 13: Modeling the webgraph evolution

puta

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

respectively. His Ph.D. was obtained in 1997 from theTechnical University of Berlin, Germany. He is currentlyan associate professor at Pontifical Catholic University ofRio Grande do Sul (PUCRS), Brazil. His main interests are

R. da Silva et al. / Journal of Com

[5] B. Bollobás, Random Graphs, Academic Press, 1985.[6] A. Bonato, A survey on models of the web graph, Computer Networks (2004)

159–172.[7] A. Bonato, A Course on the Web Graph, American Mathematical Society, 2008.[8] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine,

in: Computer Networks and ISDN Systems, 1998, pp. 107–117.[9] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, S. Stata, A.

Tomkins, J. Wiener, Graph structure in the web, Computer Networks 33 (June)(2000) 309–320.

10] L. Buriol, C. Castillo, D. Donato, S. Leonardi, S. Millozzi, Temporal evolution of thewikigraph, in: Proceedings of the Web Intelligence Conference, IEEE CS Press,2006.

11] C. Cooper, A. Frieze, Random deletion in a scale free random graph process,Internet Mathematics 1 (4) (2004) 463–483.

12] D. Donato, L. Laura, S. Leonardi, S. Millozzi, Large scale properties of the web-graph, European Physical Journal B 38 (2004) 239–243.

13] D. Donato, L. Laura, S. Leonardi, S. Millozzi, A library of softwaretools for performing measures on large networks, Deliverable D13 ofthe FET Open Project IST-2001-33555 “COevolution and Self-organizationin Dynamic Networks (COSIN)”, 2004, http://www.dis.uniroma1.it/cosin/publications/deliverableD13.pdf.

14] D. Donato, L. Laura, S. Leonardi, S. Millozzi, Simulating the webgraph: a com-parative analysis of models, Computing Systems Science & Engineering 6 (6)(2004) 84–89.

15] H. Ehrig, G. Engels, H.-J. Kreowski, G. Rozenberg, Handbook of Graph Grammarsand Computing by Graph Transformation: Vol. 2 Applications, Languages, andTools, World Scientific, 1999.

16] H. Ehrig, B. Mahr, Fundamentals of Algebraic Specifications 1: Equations andInitial Semantics: Vol. 6 of EACTS Monographs on Theoretical Computer Sci-ence, Springer, Berlin, 1985.

17] H. Ehrig, M. Pfender, H. Schneider, Graph grammars: an algebraic approach,in: Proceedings of the 14th Annual Symposium on Switching and AutomataTheory, IEEE CS Press, 1973, pp. 167–180.

18] D. Eppstein, J. Wang, A steady state model for graph power laws, in: 2nd Inter-national Workshop on Web Dynamics, 2002.

19] P. Erdös, A. Rényi, On random graphs, Publicationes Mathematicae 6 (1959)290–297.

20] P. Erdös, A. Rényi, The Evolution of Random Graphs, vol. 5, Mathematical Insti-tute of the Hungarian Academy of Sciences, 1960, pp. 17–61.

21] A. Flaxman, A. Frieze, J. Vera, Adversarial deletion in a scale free random graphprocess, in: Proceedings of the 16th Annual ACM-SIAM Symposium on DiscreteAlgorithms, 2005, pp. 287–292.

22] A. Gulli, A. Signorini, The indexable web is more than 11.5 billion pages, in:WWW’05: Special Interest Tracks and Posters of the 14th International Con-ference on World Wide Web, ACM Press, New York, NY, USA, 2005, pp. 902–903.

23] R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, Trawling the web for emerg-ing cyber communities, in: Proceedings of the 8th WWW Conference, 1999, pp.403–416.

24] L. Laura, S. Leonardi, G. Caldarelli, P.D.L. Rios, A multi-layer model for the web-graph, in: On-line Proceedings of the 2nd International Workshop on WebDynamics, 2002.

25] O. Mendizabal, F. Dotti, L. Ribeiro, Stochastic graph transformation systems,Electronic Notes in Theoretical Computer Science 184 (2007) 151–170.

26] C. Palmer, J. Steffan, Generating network topologies that obey power laws, in:Proceedings of the GLOBECOM, 2000.

27] G. Pandurangan, P. Raghavan, E. Upfal, Using pagerank to characterize web

structure, in: 8th Annual International Computing and Combinatorics Confer-ence (COCOON), 2002.

28] D. Pennock, G. Flake, S. Lawrence, E. Glover, C. Giles, Winners don’t takeall: characterizing the competition for links on the web, Proceedings of theNational Academy of Sciences of the United States of America 99 (8) (2002)5207–5211.

tional Science 2 (2011) 67–79 79

29] G. Rozenberg (Ed.), Handbook of Graph Grammars and Computing by GraphTransformation. Volume 1: Foundations, World Scientific, 1997.

30] A. Flammini, F. Menczer, S. Fortunato, M. Boguná, On local estimations of pager-ank: a mean field approach, Internet Mathematics 4 (2) (2007) 245–266.

31] M. Serrano, A. Maguitman, M. Boguná, S. Fortunato, A. Vespignani, Decoding thestructure of the www: a comparative analysis of web crawls, ACM Transactionson the Web 1 (2) (2007) (paper 10).

32] D. Watts, S. Strogatz, Collective dynamics of ‘small-world’ networks, Nature393 (1998) 440–442.

Roberto da Silva was born in Mauá, São Paulo, Brazil, onDecember 16, 1973. He received the B.S. and Ph.D. degreesin physics from the University of São Paulo (USP) in 1998and 2002, respectively. From 1998 to 2002, he worked instatistical mechanics, mathematical physics, and compu-tational physics where most of his papers are publishedin indexed journals. Since 2003, he is a professor at theInstitute of Informatics of Federal University of Rio Grandedo Sul, Porto Alegre, Brazil. His research interests include,computational and analytical modeling of the systemsphysically and biologically motivated.

Luciana S. Buriol received her bachelor in computer sci-ence in 1998 from the Federal University of Santa Maria,Brazil. Her master and Ph.D. degrees were obtained in2000 and 2003 from the State University of Campinas(UNICAMP), Brazil. In 2001 and 2002 she spent 15 monthsas a visiting scholar at AT&T Labs Research, USA. In 2004and 2005 she was a postdoc at the University of Rome.Since 2006 she is associate professor in computer scienceat Federal University of Rio Grande do Sul (UFRGS), Brazil.Her main research interests are in optimization and algo-rithms area.

Leila Ribeiro received her bachelor and master degreesin computer science from the Federal University of RioGrande do Sul (UFRGS), Brazil, in 1988 and 1991, respec-tively. Her Ph.D. was obtained in 1996 from the TechnicalUniversity of Berlin, Germany. She is currently an asso-ciate professor at UFRGS. Her main interests are softwarespecification and verification, modeling and analysis ofcomplex systems and bioinformatics.

Fernando Luís Dotti received his bachelor and masterdegrees in computer science from the Federal Universityof Rio Grande do Sul (UFRGS), Brazil, in 1988 and 1992,

specification and analysis of distributed and fault-tolerantsystems and protocols.