19
1 Network Properties of Folksonomies Ciro Cattuto c,b Christoph Schmitz a Andrea Baldassarri b Vito D. P. Servedio b,c Vittorio Loreto b,c Andreas Hotho a Miranda Grahl a Gerd Stumme a a Knowledge & Data Engineering Group Dept. of Mathematics and Computer Science Univ. of Kassel Wilhelmsh¨ oher Allee 73 D-34121 Kassel, Germany {lastname}@cs.uni-kassel.de b Dipartimento di Fisica, Universit` a di Roma “La Sapienza” P.le A. Moro, 2, I-00185 Roma, Italy {firstname.lastname}@roma1.infn.it c Museo Storico della Fisica e Centro Studi e Ricerche Enrico Fermi Compendio Viminale I-00184 Roma, Italy Social resource sharing systems like YouTube and del.icio.us have acquired a large number of users within the last few years. They provide rich resources for data analysis, information retrieval, and knowledge discov- ery applications. A first step towards this end is to gain better insights into content and structure of these systems. In this paper, we will analyse the main net- work characteristics of two of these systems. We con- sider their underlying data structures – so-called folk- sonomies – as tri-partite hypergraphs, and adapt clas- sical network measures like characteristic path length and clustering coefficient to them. Subsequently, we introduce a network of tag co- occurrence and investigate some of its statistical prop- erties, focusing on correlations in node connectivity and pointing out features that reflect emergent seman- tics within the folksonomy. We show that simple sta- tistical indicators unambiguously spot non-social be- havior such as spam. 1. Introduction A new family of so-called “Web 2.0” applica- tions is currently emerging on the Web. These in- clude user-centric publishing and knowledge man- agement platforms like Wikis, Blogs, and social resource sharing systems. In this paper, we fo- cus on resource sharing systems, which all use the same kind of lightweight knowledge representa- tion, called folksonomy. The word ‘folksonomy’ is a blend of the words ‘taxonomy’ and ‘folk’, and stands for conceptual structures created by the people. Resource sharing systems, such as YouTube 1 or del.icio.us, 2 have acquired large numbers of users (from discussions on the del.icio.us mailing list, one can approximate the number of users on del.icio.us to be several hundreds of thousands) within less than three years. The reason for their immediate success is the fact that no specific skills are needed for participating, and that these tools yield immediate benefit for each individual user (e.g. organizing ones bookmarks in a browser- independent, persistent fashion) without too much overhead. Large numbers of users have created huge amounts of information within a very short period of time. In this paper, we will investigate the growing network structure of folksonomies over time from different viewpoints, using two datasets from run- ning systems as examples. Firstly, we investigate the network structure of folksonomies much on the same line as the devel- opments in the research area of complex networks. To that end, we will adapt measures for so-called “small world networks” which have been used on a wide variety of graphs in recent years, to the particular tripartite structure of folksonomies and show that folksonomies do indeed exhibit a small world structure. Secondly, beyond the analysis of the whole hy- pergraph, we also consider specific projections of it by narrowing the scope and focusing on partic- ular features of the structure. We analyze in par- ticular the tag co-occurrence network and study its properties. This is a weighted network where 1 http://www.youtube.com/ 2 http://del.icio.us AI Communications ISSN 0921-7126, IOS Press. All rights reserved

Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

1

Network Properties of Folksonomies

Ciro Cattuto c,b Christoph Schmitz a

Andrea Baldassarri b Vito D. P. Servedio b,c

Vittorio Loreto b,c Andreas Hotho a

Miranda Grahl a Gerd Stumme a

a Knowledge & Data Engineering GroupDept. of Mathematics and Computer ScienceUniv. of KasselWilhelmshoher Allee 73D-34121 Kassel, Germany{lastname}@cs.uni-kassel.deb Dipartimento di Fisica, Universita di Roma “LaSapienza”P.le A. Moro, 2,I-00185 Roma, Italy{firstname.lastname}@roma1.infn.itc Museo Storico della Fisica e Centro Studi eRicerche Enrico FermiCompendio ViminaleI-00184 Roma, Italy

Social resource sharing systems like YouTube anddel.icio.us have acquired a large number of users withinthe last few years. They provide rich resources for dataanalysis, information retrieval, and knowledge discov-ery applications. A first step towards this end is togain better insights into content and structure of thesesystems. In this paper, we will analyse the main net-work characteristics of two of these systems. We con-sider their underlying data structures – so-called folk-sonomies – as tri-partite hypergraphs, and adapt clas-sical network measures like characteristic path lengthand clustering coefficient to them.

Subsequently, we introduce a network of tag co-occurrence and investigate some of its statistical prop-erties, focusing on correlations in node connectivityand pointing out features that reflect emergent seman-tics within the folksonomy. We show that simple sta-tistical indicators unambiguously spot non-social be-havior such as spam.

1. Introduction

A new family of so-called “Web 2.0” applica-tions is currently emerging on the Web. These in-

clude user-centric publishing and knowledge man-agement platforms like Wikis, Blogs, and socialresource sharing systems. In this paper, we fo-cus on resource sharing systems, which all use thesame kind of lightweight knowledge representa-tion, called folksonomy. The word ‘folksonomy’ isa blend of the words ‘taxonomy’ and ‘folk’, andstands for conceptual structures created by thepeople.

Resource sharing systems, such as YouTube1

or del.icio.us,2 have acquired large numbers ofusers (from discussions on the del.icio.us mailinglist, one can approximate the number of users ondel.icio.us to be several hundreds of thousands)within less than three years. The reason for theirimmediate success is the fact that no specific skillsare needed for participating, and that these toolsyield immediate benefit for each individual user(e.g. organizing ones bookmarks in a browser-independent, persistent fashion) without too muchoverhead. Large numbers of users have createdhuge amounts of information within a very shortperiod of time.

In this paper, we will investigate the growingnetwork structure of folksonomies over time fromdifferent viewpoints, using two datasets from run-ning systems as examples.

Firstly, we investigate the network structure offolksonomies much on the same line as the devel-opments in the research area of complex networks.To that end, we will adapt measures for so-called“small world networks” which have been used ona wide variety of graphs in recent years, to theparticular tripartite structure of folksonomies andshow that folksonomies do indeed exhibit a smallworld structure.

Secondly, beyond the analysis of the whole hy-pergraph, we also consider specific projections ofit by narrowing the scope and focusing on partic-ular features of the structure. We analyze in par-ticular the tag co-occurrence network and studyits properties. This is a weighted network where

1http://www.youtube.com/2http://del.icio.us

AI CommunicationsISSN 0921-7126, IOS Press. All rights reserved

Page 2: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

2 Cattuto et al. / Network Properties of Folksonomies

each tag is a node and links are drawn between apair of tags whenever the two tags co-occur in thesame post and the weight is given by the numberof different posts where that pair appears. This tagco-occurrence network can be used to get insightsinto the tagging behaviour of users and to detectanomalies, e. g. those inflicted by spammers.

The remainder of the paper is structured asfollows: In Section 2, we discuss related work.Section 3 introduces two large scale folksonomydatasets which our analyses will be based on. Sec-tion 4 introduces quantitative measures for thenetwork properties for the tripartite structure ofa folksonomies. Section 5 examines a projection ofthe tripartite graph by studying the structure ofthe tag co-occurrence network. Finally in Section 6we draw some conclusions and highlight open is-sues.

2. Related Work

2.1. Folksonomies and Folksonomy Mining

As the field of folksonomies is a young one, thereare relatively few scientific publications about thistopic. Refs. [19,9] provide a general overview offolksonomies, their structure, and provide some in-sights into their dynamics.

More recently, particular aspects of folksonomieshave been elaborated in more detail, e.g. rankingof contents [12], discovering trends in the taggingbehaviour of users [7,13], or learning taxonomicrelations from tags [10,27,26,20,14].

2.2. Small World Networks

The graph-theoretic notions of Section 4 are de-rived from those developed in an emerging area ofresearch which has been called “the new scienceof networks” [23], using concepts from social net-work analysis, graph theory, as well as statisticalphysics; see [23] for an overview.

In particular, the notions of clustering coeffi-cient and characteristic path length as indicatorsfor small world networks have been introduced byWatts and Strogatz [31]; for particular kinds ofnetworks, such as bipartite [18] or weighted [2]graphs, variants of those measures have been de-vised. To the best of our knowledge, no versionsof these measures for tripartite hypergraphs such

as folksonomies, or hypergraphs in general, havebeen proposed previously.

Networks related to folksonomy, in line withother different human based social or technologi-cal networks, possess a lot of other peculiar char-acteristics. The most striking of them is proba-bly the observation that the degree of nodes, i. e.,the number of links connected to a node, followsa fat tailed distribution index of a complex inter-action between human agents [29]. Work has beendone also on the complex network of WikiPedia [4]where links also possess a specific direction.

The notion of a small world has been introducedin a seminal paper by Milgram [21]. Milgram triedto verify in a practical experiment that, with ahigh probability, any two given persons within theUnited States would be connected through a rel-atively short chain of mutual acquaintances. Re-cently, the term “small world” has been definedmore precisely as a network having a small charac-teristic path length comparable to that of a (reg-ular or Erdos) random graph, while at the sametime exhibiting a large degree of clustering [30](which a random graph does not). These networksshow some interesting properties: while nodes aretypically located in densely-knit clusters, there arestill long-range connections to other parts of thenetwork, so that information can spread quickly.At the same time, the networks are robust againstrandom node failures. Since the coining of theterm “small world”, many networks, including so-cial and biological as well as man-made, engineeredones, have been shown to exhibit small-world prop-erties. We will show in the remainder of this paperthat folksonomies have a small world structure.

3. Folksonomy Datasets

In this section, we will introduce the formal no-tation used in the remainder of the paper, as wellas the two large scale data sets that we will discussin the following sections.

3.1. Folksonomy Notation

In the following, we briefly recapitulate the for-mal notation for folksonomies introduced in [12],which we will use in the remainder of the paper.3

A folksonomy is a tuple F := (U, T,R, Y ) where

3We use the simplified version without personomies orhierarchical relations between tags here.

Page 3: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

Cattuto et al. / Network Properties of Folksonomies 3

Fig. 1. BibSonomy displays bookmarks and BibTEX based bibliographic references simultaneously.

– U , T , and R are finite sets, whose elementsare called users, tags and resources, resp., and

– Y is a ternary relation between them, i. e.,Y ⊆ U × T ×R, called tag assignments (TASfor short).

Another view on this kind of data is that ofa 3-regular, tripartite hypergraph, in which thenode set is partitioned into three disjoint sets:V = T ∪U ∪R, and every hyperedge {t, u, r} con-sists of exactly one tag, one user, and one resource.In Formal Concept Analysis [8], such data struc-tures are called triadic context [17].

Sometimes it is convenient to consider all tagassignments of a given user to a given resource.We call this aggregation of TAS of a user u to aresource r a post P (u, r) := {(t, u, r) ∈ Y | t ∈ T}.

3.2. del.icio.us Dataset

For our experiments, we collected data fromthe del.ico.us system in the following way. Ini-

tially we used wget starting from the start pageof del.ico.us to obtain nearly 6, 900 users and 700tags as a starting set. Out of this dataset we ex-tracted all users and resources (i. e., del.icio.us’MD5-hashed URLs). From July 27 to 30, 2005, wedownloaded in a recursive manner user pages toget new resources, and resource pages to get newusers. Furthermore we monitored the del.icio.usstart page to gather additional users and resources.This way we collected a list of several thousandusernames which we used for accessing the first10, 000 resources each user had tagged. From thecollected data we finally took the user files to ex-tract resources, tags, dates, descriptions, extendeddescriptions, and the corresponding username.

We obtained a folksonomy with |U | = 75, 242users, |T | = 533, 191 tags and |R| = 3, 158, 297 re-sources, related by in total |Y | = 17, 362, 212 tagassignments. In addition, we generated monthlydumps from the timestamps associated with posts,so that 14 snapshots in monthly intervals from

Page 4: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

4 Cattuto et al. / Network Properties of Folksonomies

Fig. 2. detail showing a single bookmark post

Fig. 3. detail showing a single publication post

June 15th, 2004 through July 15th, 2005 are avail-able.

3.3. BibSonomy Dataset

As some of the authors are involved in the folk-sonomy site BibSonomy [11],4 a second datasetfrom that system could be obtained directly froma database dump.

BibSonomy allows users to share bookmarks(i. e., URLs) as well as publication references. Thedata model of the publication part is based onBibTEX [24], a popular literature management sys-tem for LATEX [16].

A typical list of posts is depicted in Figure 1which shows bookmark and publication posts con-taining the tag web. The page is divided into fourparts: the header (showing information such as thecurrent page and path, navigation links and searchboxes), two lists of posts – one for bookmarks andone for publications – each sorted by date in de-scending order, and a list of tags related to theposts. This scheme holds for all pages showingposts and allows for navigation in all dimensionsof the folksonomy.

A detailed view of one bookmark post from thelist in Figure 1 can be seen in Figure 2. The firstline shows in bold the title of the bookmark whichhas the URL of the bookmark as underlying hy-perlink. The second line shows an optional descrip-tion the user can assign to every post. The last twolines belong together and show detailed informa-tion: first, all the tags the user has assigned to this

4http://www.bibsonomy.org

post (web, service, tutorial, guidelines and api),second, the user name of that user (hotho) followedby a note, how many users tagged that specificresource. These parts have underlying hyperlinks,leading to the corresponding tag pages of the user(/user/hotho/web, /user/hotho/service, . . . ),the users page (/user/hotho) and a page show-ing all four posts (i. e., the one of user hotho andthose of the three other users) of this resource(/url/$r$). The last part shows the posting dateand time followed by links for actions the user cando with this post – depending on if this is his own(edit, delete) or another user’s post (copy).

The structure of a publication post displayed inBibSonomy is very similar, as seen in Figure 3. Thefirst line shows again the title of the post, whichequals the title of the publication in BibTEX. It hasan underlying link leading to a page which showsdetailed information on that post. This line is fol-lowed by the authors or editors of the publication,as well as journal or book title and the year. Thenext lines show the tags assigned to this post bythe user, whose user name comes next followed bya note how many people tagged this publication.As described for bookmark posts, these parts linkto the respective pages. After the date and timethe user posted this entry follow the actions theuser can do, which in this case include picking theentry for later download, copying it, accessing theURL of the entry or viewing the BibTEX sourcecode.

As with the del.icio.us dataset, we created adump of the system, and calculated monthly snap-shots, based on the timestamps. This resultedin 20 datasets. The most recent one, from July31st, 2006, contains data from |U | = 428 users,|T | = 13, 108 tags, |R| = 47, 538 resources, con-nected by |Y | = 161, 438 tag assignments.

4. Small Worlds in Three-Mode-Networks

As expected, the tagging behavior in del.icio.usdisplays a fat-tailed distribution: Fig. 4 shows thefraction of tags, users, and resources, respectively,occurring in a given number of TAS. We observethat the probability distributions for tags and re-sources display a rather clean power-law tail, whilethe distribution for users features a different be-havior for small frequencies. This suggests the exis-

Page 5: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

Cattuto et al. / Network Properties of Folksonomies 5

100

101

102

103

104

105

106

f

10-10

10-8

10-6

10-4

10-2

100

P(f)

tagsusersresources

Fig. 4. Probability distribution of the frequency of occur-rence of tags, users and resources in del.icio.us. For tags(black dots), the abscissa of each point corresponds to agiven frequency, and its ordinate is the fraction of tags thatoccur with the selected frequency. The same holds for users(green dots) and resources (red dots).

tence of two classes of users, with very active users(less than a thousandth of the total) following adifferent scaling as compared to the vast majorityof less active users.

We will now investigate whether folksonomiesfeature small-world properties. To this end, we willdefine the notions of characteristic path lengthand clustering coefficient in tripartite hypergraphssuch as folksonomies, and apply these to thedatasets introduced in Section 3.

4.1. Characteristic Path Length

The characteristic path length of a graph [30]describes the average length of a shortest path be-tween two random nodes in the graph. If the char-acteristic path length is small, few hops will beneccessary, on average, to get from a particularnode in the graph to any other node.

As folksonomies are triadic structures of (tag,user, resource) assignments, the user interface ofsuch a folksonomy system will typically allow theuser to jump from a given tag to (a) any resourceassociated with that tag, or to (b) any user whouses that tag, and vice versa for users and re-sources. Thus, the effort of getting from one nodein the folksonomy to another can be measured bycounting the hyperedges in shortest paths between

the two. Here a path is defined as a sequence of hy-peredges such that each hyperedge shares at leastthe user or the resource or the tag with the follow-ing hyperedge.

More precisely, let v1, v2 ∈ T ∪ U ∪ R betwo nodes in the folksonomy, and (t0, u0, r0), . . . ,(tn, un, rn) a minimal sequence of TAS such that,for all k with 0 ≤ k < n, (tk = tk+1) ∨ (uk =uk+1) ∨ (rk = rk+1), and v1 ∈ {t0, u0, r0}, v2 ∈{tn, un, rn}. Then we call d(v1, v2) := n the dis-tance of v1 and v2. We compute path lengthswithin connected components only.

Following Watts [30], we define dv as the meanof d(v, u) over all u ∈ (T ∪ U ∪R)− {v}, and callthe median of the dv over all v ∈ T ∪ U ∪ R thecharacteristic path length L of the folksonomy.

In Section 4.3, we will analyse the character-istic path length on our datasets. As comput-ing the characteristic path length is prohibitive-ly expensive for graphs of the size encounteredhere, we sampled 200 nodes randomly from eachgraph and computed the path lengths from each ofthose nodes to all others in the folksonomy usingbreadth-first search.

4.2. Clustering Coefficients

Clustering or transitivity in a network meansthat two neighbors of a given node are likely tobe directly connected as well, thus indicating thatthe network is locally dense around each node. Tomeasure the amount of clustering around a givennode v, Watts [30] has defined a clustering coeffi-cient γv (for normal, non-hyper-graphs). The clus-tering coefficient of a graph is γv averaged over allnodes v.

Watts [30, p. 33] defines the clustering coefficientγv as follows (Γv = Γ(v) denotes the neighborhoodof v):

Hence γv is simply the net fraction of those pos-sible edges that actually occur in the real Γv. Interms of a social-network analogy, γv is the de-gree to which a person’s acquaintances are ac-quainted with each other and so measures thecliquishness of v’s friendship network. Equiva-lently, γv is the probability that two vertices inΓ(v) will be connected.

Note that Watts combines two aspects which arenot equivalent in the case of three-mode data. Thefirst one is: how many of the possible edges around

Page 6: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

6 Cattuto et al. / Network Properties of Folksonomies

a node do actually occur, i. e., does the neighbor-hood of the given vertex approach a clique? Thesecond aspect is that of neighbors of a given nodebeing connected themselves.

Following the two motivations of Watts, wethus define two different clustering coefficients forthree-mode data:

Cliquishness: From this point of view, the cluster-ing coefficient of a node is high iff many of thepossible edges in its neighborhood are present.More formally: Consider a resource r. Thenthe following tags Tr and users Ur are con-nected to r: Tr = {t ∈ T | ∃u : (t, u, r) ∈ Y },Ur = {u ∈ U | ∃t : (t, u, r) ∈ Y }. Further-more, let tur := {(t, u) ∈ T × U | (t, u, r) ∈Y }, i. e., the (tag, user) pairs occurring withr.If the neighborhood of r was maximallycliquish, all of the pairs from Tr × Ur wouldoccur in tur. So we define the clustering coef-ficient γcl(r) as:

γcl(r) =|tur|

|Tr| · |Ur|∈ [0, 1] (1)

i. e., the fraction of possible pairs present inthe neighborhood. A high γcl(r) would indi-cate, for example, that many of the users re-lated to a resource r assign overlapping setsof tags to it.The same definition of γcl stated here for re-sources can be made symmetrically for tagsand users.

Connectedness (Transitivity): The other point ofview follows the notion that the clusteringaround a node is high iff many nodes inthe neighborhood of the node were connectedeven if that node was not present.In the case of folksonomies: consider a re-source r. Let tur := {(t, u) ∈ T×U | (t, u, r) ∈Y ∧∃r 6= r : (t, u, r) ∈ Y }, i. e., the (tag, user)pairs from tur that also occur with some otherresource than r. Then we define:

γco(r) :=|tur||tur|

∈ [0, 1] (2)

i. e., the fraction of r’s neighbor pairs thatwould remain connected if r were deleted. γco

indicates to what extent the surroundings ofthe resource r contain “singleton” combina-tions (user, tag) that only occur once.

Again, the definition works the same for tagsand users, and the clustering coefficients forthe whole folksonomy are defined as the arith-metic mean over the nodes.

The following example demonstrates that theclustering coefficients γcl and γco do indeed cap-ture different characteristics of the graph and arenot intrinsically related. One might suspect thatthere is a systematic connection between the two,such as γcl(r) < γcl(s) ⇒ γco(r) < γco(s) for nodesr, s ∈ T ∪ U ∪ R, or similarly, on the level of thewhole folksonomy, γco(F) < γco(G) ⇒ γcl(F) <γcl(G).

However, this is not the case: consider a folk-sonomy F with tag assignments Y1 = {(t1, u2, r2),(t1, u1, r1), (t1, u1, r2), (t1, u2, r1), (t1, u3, r3), (t2,u3, r3), (t2, u4, r4)}. Here we have γcl(t1) ≈ 0.556 >γcl(t2) = 0.5, but γco(t1) = 0.2 < γco(t2) = 0.5.

Also, there is no monotonic connection whenconsidering the folksonomy as a whole. For thewhole folksonomy F, we have γcl(F) ≈ 0.906,γco(F) ≈ 0.470.

Considering a second folksonomy G with tag as-signments Y2 = {(t1, u1, r1), (t1, u1, r3), (t1, u2, r2),(t1, u3, r2), (t2, u1, r2), (t2, u2, r1), (t2, u2, r2),(t2, u2, r3), (t3, u1, r2), (t3, u2, r2)}, we see thatγcl(G) = 0.642, γco(G) = 0.669, thus γcl(F) >γcl(G) while γco(F) < γco(G).

4.3. Experiments

4.3.1. SetupIn order to check whether our observed folkson-

omy graphs exhibit small world characteristics, wecompared the characteristic path lengths and clus-tering coefficients with random graphs of a sizeequal in all dimensions T , U , and R as well as Yto the respective folksonomy under consideration.

Two kinds of random graphs are used for com-parison:

Binomial: These graphs are generated similar toan Erdos random graph G(n, M) [3], wheren is the number of nodes and M is thenumber of edges. Adapting the constructionof G(n, M) to the structure of folksonomies,T,U,R are taken as nodes from the observedfolksonomies, and |Y | many hyperedges arethen created by picking the three endpoints ofeach edge from uniform distributions over T ,U , and R, respectively, leading to a binomialdistribution of degrees over the nodes.

Page 7: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

Cattuto et al. / Network Properties of Folksonomies 7

Permuted: These graphs are created by usingT,U,R from the observed folksonomy. Thetagging relation Y is created by taking theTAS from the original graph and permutingeach dimension of Y independently (using aKnuth Shuffle [15]), thus creating a randomgraph with the same degree sequence as theobserved folksonomy.

As stated above, the computation of the charac-teristic path length is prohibitively expensive forgraphs of our size. As for the del.icio.us and Bib-Sonomy datasets, we sampled 200 nodes randomlyfrom each graph and computed the path lengthsfrom each of those nodes to all others in the folk-sonomy.

Although we did not take any specific measuresto keep the graphs connected, almost all nodes liein a giant connected component for all data sets.The largest number of nodes disconnected from thegiant component we encountered were 351 out of1,539,326 nodes (.02% of the nodes) for the actualdel.icio.us data in month 9, and 88 out of 58,879(.15% of the nodes) for the BibSonomy data inmonth 19. The random graphs showed even fewerdisconnected nodes, the maximum numbers being18 nodes out of 951,513 (.002%, del.icio.us, month7, permuted) and 6 out of 60,984 (.01%, BibSon-omy, month20, permuted).

For all experiments involving randomness (i. e.,those on the random graphs as well as the samplingfor characteristic path lengths), 20 runs were per-formed to ensure consistency. The presented val-ues are the arithmetic means over the runs; thedeviations across the runs were negligible in all ex-periments.

4.3.2. First ObservationsFigures 5–7 show the results for the clustering

coefficients and the characteristic path lengths forboth datasets, plotted against the number |Y | oftag assignments for the respective monthly snap-shots.

Both folksonomy datasets under considerationexhibit the small world characteristics as definedat the beginning of this section. Their clusteringcoefficients are extremely high, while the charac-teristic path lengths are comparable to (BibSon-omy) or even considerably lower (del.icio.us) thanthose of the binomial random graphs.

del.icio.us. In the del.icio.us dataset (Figures 6and 7, right hand sides), it can be seen that bothclustering coefficients are extremely high at about0.86, much higher than those for the permuted andbinomial random graphs. This could be an indica-tion of coherence in the tagging behaviour: if, forexample, a given set of tags is attached to a certainkind of resources, users do so consistently.

On the other hand, the characteristic pathlengths (Figure 5, right) are considerably smallerthan for the random binomial graphs, though notas small as for the permuted setting. The compar-ison with the random binomial graph shows thesmall world behavior of the human tagging activ-ity. Our interpretation of the comparison with thepermuted setting is that the latter maintains thestructural features of the human tagging behavior,while introducing additional links between person-omies of otherwise unrelated users; leading themthus out of their ‘caveman world’ [30].

Interestingly, the path length has remained al-most constant at about 3.5 while the number ofnodes has grown about twentyfold in the observa-tion period. As explained in Section 4.1, in practicethis means that on average, every user, tag, or re-source within del.icio.us can be reached within 3.5mouse clicks from any given del.icio.us page. Thismight help to explain why the concept of serendip-itous discovery [19] of contents plays such a largerole in the folksonomy community – even if thefolksonomy grows to millions of nodes, everythingin it is still reachable within few hyperlinks.

BibSonomy. As the BibSonomy system is ratheryoung, it contains roughly two orders of magni-tude fewer tags, users, resources, and TAS thanthe del.icio.us dataset.

On the other hand, the values show the sametendencies as in the del.icio.us experiments.

Figures 6 and 7 (left) show that clustering isextremely high at γcl ≈ 0.96 and γco ≈ 0.93 – evenmore so than in the del.icio.us data.

At the same time, Figure 5 shows that the char-acteristic path lengths are somewhat larger, but atleast comparable to those of the binomial graph.

There is considerably more fluctuation in thevalues measured for BibSonomy due to the factthat the system started only briefly before our ob-servation period. Thus, in that smaller folkson-omy, small changes, such as the appearance of anew user with a somewhat different behaviour, had

Page 8: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

8 Cattuto et al. / Network Properties of Folksonomies

215

216

217

218

Number of tag assignments

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

L

BibSonomyPermutedBinomial

219

220

221

222

223

224

Number of tag assignments

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

L

del.icio.usPermutedBinomial

Fig. 5. Characteristic path length for the BibSonomy folksonomy (left) and the del.icio.us folksonomy (right), compared withthe corresponding random graphs: permuted and binomial (see text). The measure is repeated following the network growthand shown as a function of the number of tagging events. Similar graphs have been obtained as a function of the number ofnodes of the networks (not shown). Note how the charateristic path length takes quite similar low values, typical of smallworld networks, for all graphs.

215

216

217

218

Number of tag assignments

0

0.2

0.4

0.6

0.8

1

γcl

BibsonomyPermutedBinomial

219

220

221

222

223

224

Number of tag assignments

0

0.2

0.4

0.6

0.8

1

γcl

del.icio.usPermutedBinomial

Fig. 6. Cliquishness of the BibSonomy folksonomy (left) and the del.icio.us folksonomy (right), compared with the correspond-ing random graphs: permuted and binomial (see text). The measure is repeated following the network growth and shown asa function of the number of tagging events. Similar graphs have been obtained as a function of the number of nodes of thenetworks (not shown). The cliquishness for the folksonomy networks takes quite high values, higher than the correspondingrandom graph (permuted and binomial).

more impact on the values measured in our exper-iments.

Furthermore, many BibSonomy users are earlyadopters of the system, many of which know eachother personally, work in the same field of inter-est, and have previous experience with folksonomysystems. This might also account for the very highamount of clustering.

4.4. Characteristic Path Length for Tags

Figure 5 demonstrated that the characteristicpath length L of the two folksonomies under con-

sideration grows comparably to that of the respec-tive “binomial” random folksonomies. As the num-ber of resources |R| dominates the numbers of tags|T | and users |U | by almost one and two ordersof magnitude, resp., L is heavily influenced by thecharacteristic path length for resources.

In order to get an insight into the behaviour oftags in that respect, we computed the character-istic path length as described in 4.1, but this timetaking only the values dt for tags t ∈ T into ac-count for L.

Figure 8 shows the growth of L for tags in theBibSonomy and del.icio.us folksonomies. Interest-

Page 9: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

Cattuto et al. / Network Properties of Folksonomies 9

215

216

217

218

Number of tag assignments

0

0.2

0.4

0.6

0.8

1

γco bibsonomy

PermutedBinomial

219

220

221

222

223

224

Number of tag assignments

0

0.2

0.4

0.6

0.8

1

γco

del.icio.usPermutedBinomial

Fig. 7. Connectedness/Transitivity of the BibSonomy folksonomy (left) and the del.icio.us folksonomy (right), compared withthe corresponding random graphs: permuted and binomial (see text). The measure is repeated following the network growthand shown as a function of the number of tagging events. Similar graphs have been obtained as a function of the number ofnodes of the networks (not shown). As in the case of cliquishness, the values of connectedness/transitivity are very high forthe folksonomy networks, at odds with the corresponding random graphs (permutated and binomial).

215

216

217

Number of Tag Assignments

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

L

BibSonomyBinomialPermuted

219

220

221

222

223

224

Number of Tag Assignments

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

L

del.icio.usBinomialPermuted

Fig. 8. Characteristic path length L considering only tags in BibSonomy (left) and del.icio.us (right). The figures show thatthe characteristic path length restricted to tags is almost constant starting from a very early point in the growth of thefolksonomy.

ingly, the average path length for tags in the Bib-Sonomy dataset is much larger than that for therandom folksonomies and rises to about 3.5 to 3.6very early in the life of BibSonomy, but then re-mains almost constant. In the del.icio.us folkson-omy, which is considerably larger than the latterone, the characteristic path length for tags still re-mains almost the same at about 3.7.

Our interpretation is that even a small numberof early folksonomy users introduces a considerableamount of idiosyncratic vocabulary, large parts ofwhich are rather distant from the rest of the folk-sonomy. Interestingly, even in the larger del.icio.usfolksonomy, the average tag is still farther awayfrom the rest of the folksonomy at L ≈ 3.7 as op-

posed to the L ≈ 3.5 from Figure 5 which is largelydominated by resources. This is surprising, as theaverage tag occurs in about 9 times as many tagassignments as the average resource.

4.5. A Closer Look on del.icio.us

We will conclude this section by a closer look onhow the characteristic path length, the cliquish-ness, and the connectedness are distributed overthe users, tags, and resources in del.icio.us.

To this end, we have computed the co-occurencegraphs for the three dimensions users, tags, andresources. More formally, the co-occurence graphfor the tags has the set T of tags as vertices; and

Page 10: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

10 Cattuto et al. / Network Properties of Folksonomies

two tags t1 and t2 are connected by an undirectededge, iff there is at least one resource r and oneuser u such that (u, t1, r), (u, t2, r) ∈ Y . The char-acteristic path length and clustering coefficients ofthe (non-hyper) co-occurrence graphs are shown inthe left diagrams of Figures 9 and 10. The char-acteristic path length was approximated by takinga 200-node sample, and for the clustering coeffi-cient the approximation from [25] was used with aprecision of ε = 10−3 and a probability of 0.99.

The left diagram of Figure 9 shows the char-acteristic path lengths of the three co-ocurrencegraphs.5 The result is as expected: the set of re-sources is almost an order of magnitude larger thanthe set of tags, which is about the same ratio largerthan the set of users. The larger graphs have highercharacteristic path lengths.

The right diagram of Figure 9 shows the differ-ent contributions of the tags, users, and resourcesto the del.icio.us curve in the right diagrams ofFigures 5 and 8. For computing the values, therandom nodes have been drawn only from the re-spective classes. The low path length for the usernodes indicates that personomies (defined as theset of TAS associated with a single given user)are a structural element in a folksonomy: Con-sider the extreme case that all personomies arecompletely disjoint. Then the users are the centralnodes in their connected component (which equalstheir personomy), and have thus shorter charac-teristic path lengths in average.

The characteristic path lengths of the tags andresources in the right diagram are reversed com-pared to the left diagram. This is likely to be dueto the fact that users tend to invent new, personaltags – which are further away from the core of thefolksonomy – whereas there is less divergence ofthe URLs to be included in the system.

Figure 10 shows the clustering coefficient ofdel.icio.us for the three co-occurrence graphs (left)and connectedness of the hypergraph by dimen-sion (right). Both diagrams show that the neigh-borhoods around tags and resources are denserthan around users. This is likely to stem from thefact that users usually have different interests. Aninteresting observation is that the user curve de-creases over time in the left diagram, while it in-creases in the right one. Both effects result from

5Note that the three values are measured in three differ-ent graphs.

220 221 222 223 224

Number of Tag Assignments

0

0.2

0.4

0.6

0.8

1

γclResourcesTagsUsers

Fig. 11. Cliquishness of del.icio.us for the three dimensionsin the hypergraph. The diagram shows that the cliquishnessfor tags and resources is high – indicating that if, e.g., aresource is given certain tags and tagged by certain users,many of the possible combinations of those tags and usersare likely to occur. On the other hand, the cliquishnessfor users is considerably lower, indicating different fields ofinterest for each user.

the increasing number of neighbors over time. Theclustering coefficient decreases because less andless neighbors are connected to each other whenthe neighborhood increases. γco on the hand in-creases over time, as it becomes – with increasingneighborhoods – more likely that, for a given TAS,another user has assigned exactly the same tag tothe same resource. This indicates that, althougheach user invents new, personal tags, a form of con-sensus grows with time on the vocabulary associ-ated to the same resource, i.e. a common semanticsemerges.

Figure 11 shows that the cliquishness for tagsand resources is high – indicating that if, e.g., aresource is given certain tags and tagged by cer-tain users, many of the possible combinations ofthose tags and users are likely to occur, i. e., thereis a natural set of tags which seem appropriatefor a given resource, and vice versa, for a giventag, the users using that tag agree to a large ex-tent on which resources should be tagged with it.On the other hand, the cliquishness for users isconsiderably lower. This demonstrates that, otherthan tags and resources, users typically have dif-ferent fields of interest and thus are connected toelements of the other dimensions which will notnecessarily occur in many of the possible combina-tions.

Page 11: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

Cattuto et al. / Network Properties of Folksonomies 11

219

220

221

222

223

224

Number of Tag Assignments

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

L

ResourcesTagsUsers

220

221

222

223

224

Number of Tag Assignments

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

L

ResourcesTagsUsers

Fig. 9. Characteristic path lengths of del.icio.us in the three coocurrence graphs (left) and in the hypergraph for the threedimensions (right). The left diagram shows the characteristic path lengths of the three coocurrence graphs in one diagram.The difference reflect the different sizes of the co-occurrence graphs; the larger graphs have higher characteristic path length.The right diagram shows the different contributions of the tags, users, and resources to the del.icio.us curve in the rightdiagrams of Figures 5 and 8. For computing the values, the random nodes have been drawn only from the respective class.The low char. path lengths of the user nodes reflect the structural importance of the personomies in the hyper-graph. Thechar. path lengths of the tags and resources are reversed compared to the left diagram; indicating that tags tend to be moreindependent from the core folksonomy structure than resources.

220

221

222

223

224

Number of Tag Assignments

0

0.2

0.4

0.6

0.8

1

γ

ResourcesTagsUsers

219

220

221

222

223

224

Number of Tag Assignments

0

0.2

0.4

0.6

0.8

1

γco

ResourcesTagsUsers

Fig. 10. Clustering coefficient of del.icio.us for the three co-occurrence graphs (left) and connectedness of the hypergraph bydimension (right). Both diagrams show that the neighborhoods around tags and resources are denser than around users. Theuser curve in the left diagram decreases over time, while it increases in the right one.

5. Networks of Tag Co-occurrence

In order to investigate the emergent semanticproperties of the folksonomy, we focus on the rela-tions of co-occurrence among tags. Since the pro-cess of tagging is inclusive [9], and large overlapoften exists among resources marked with differenttags, the relations of co-occurrence among tags ex-pose the semantic aspects underling collaborativetagging, such as homonymy, synonymy, hierarchi-cal relations among tags and so on.

The simplest way to study tag co-occurrence atthe global level is to define a network of tags, where

two tags i and j are linked if there exists a postwhere they have been associated by a user with thesame resource. A link weight for two nodes i andj(6= i) can be introduced and defined as the num-ber of posts where they appear together. Formally,we define W (i, j), i. e., the set of posts where i andj appear together, as

W (i, j) := {(u, r) ∈ U ×R |

[(i, u, r) ∈ Y ] ∧ [(j, u, r) ∈ Y ]} , (3)

and define the link weight w(i, j) := |W (i, j)|. Thisis thus a weighted version of the co-occurrence

Page 12: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

12 Cattuto et al. / Network Properties of Folksonomies

graph defined in Section 4.5.The above link strength defines on T × T a

symmetric similarity matrix which is analogous tothe usual adjacency matrix in graph theory. Thestrength st of a node t is defined as [1]

si :=∑j 6=i

w(i, j) . (4)

5.1. Cumulative probability distribution of nodestrength

A first statistical characterization of the net-work of tags is afforded by the cumulative proba-bility distribution P>(s), defined as the probabil-ity of observing a strength in excess of s. Thesedistributions are displayed for del.icio.us and Bib-Sonomy in Figs. 12 and 13, respectively. This is astandard measure in complex network theory andplays the same role of the degree distribution inunweighted networks. We observe that P>(s) is afat-tailed distribution for both folksonomies: thisis related to a lack of characteristic scale for nodestrengths and is one of the typical fingerprintsof an underlying complex dynamics of interactinghuman agents [28,29]. A coarse indicator such asP>(s), despite its simplicity, is able to point outanomalous activity (i. e., spam) within the inves-tigated folksonomies, as discussed in the captionsof Figs. 12 and 13. Quite interestingly, on filteringout these undesired (and probably automaticallygenerated) contributions, the probability distribu-tions for del.icio.us and BibSonomy become rathersimilar, even though the two systems under studyare dramatically different in terms of user base,size and age.

Uncovering the detailed “microscopic” mecha-nism responsible for the observed distribution is adaunting task. A simple way to identify the con-tribution of semantics – and in general of humanactivity – to those distributions consists in de-stroying semantics altogether by randomly shuf-fling tags among TAS entries. In the tripartitegraph view of the folksonomy, this corresponds tointroducing a random permutation of the set oftags T , biunivocally mapping each tag t ∈ T intoa corresponding tag t′. Correspondingly, each hy-peredge (t, u, r) is mapped into a new hyperedge(t′, u, r). Each post in the original folksonomy cor-responds to a new post with the same number of

tags, but now the co-occurrence relations are com-pletely different.6

In Figs. 12 and 13 we show that by perform-ing this shuffling operation (blue dots) the distri-bution is only marginally affected. Far from beingobvious, this shows that the global frequencies oftags – and not their co-occurrence relations – arethe main factors shaping the distribution P>(s).In other words, the fat-tailed nature of P>(s) is in-duced by the distribution of tag frequencies, whichhas been known to be fat-tailed [9,6], in analogyto Zipf’s law (also observed in human languages).

In order to probe deeper into the structure ofthe co-occurrence network and recognize the con-tribution of semantics, we need to compute observ-ables more sensitive to correlations and to the lo-cal structure of the network. To this end, a usefulquantity studied in complex networks is the near-est neighbor connectivity. Given a node i, we de-fine its average nearest-neighbor strength as:

Snn(i) =1ki

ki∑j=1

sj , (5)

where ki denotes the number of links with non-zeroweight connected to node i, and where sj denotesthe strength of node j (see Eq. 4). The conceptof nearest neighbor needs to be clarified here. Inprinciple all nodes are connected to each other ina weighted graph, but in this particular contextwe ignore the existence of those links that haveweight zero. Consequently, we consider two nodesas nearest neighbors, iff there exists a link withnon-zero weight connecting them.

The relation of the average nearest neighborstrengths Snn(i) to the node strengths si providesinformation on correlations among the strengthof nodes and therefore is also known in litera-ture as node nearest-neighbor strength correlationSnn(s) [1]. When referred to unweighted networks,i. e., where all existing links have unit strength,Snn is able to discriminate between technologicalnetworks, where Snn and s are negatively corre-lated, and social networks, where, on the contrary,Snn(s) displays an increasing behavior. These twonetworks with opposite behaviors are commonlyreferred to as disassortative and assortative mixingnetworks, respectively [22,5].

6In difference to the permuted graph introduced in Sec-tion 4.3.1, we shuffle the tags only.

Page 13: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

Cattuto et al. / Network Properties of Folksonomies 13

Fig. 12. Cumulative strength distribution for the network of co-occurrence of tags in del.icio.us. P>(s) is the probability ofhaving a node with strength in excess of s. Red dots correspond to the whole co-occurrence network. The two steps indicatedby arrows correspond to an excess of links with a specific weight and can be related to spamming activity. Excluding fromthe analysis all posts with more than 50 tags removes the steps (green dots). Shuffling the tags contained in posts (blue dots)does not affect significantly the cumulated weight distribution. This proves that such a distribution is uniquely determinedby tag frequencies within the folksonomy, and not by the semantics of co-occurrence.

Fig. 13. Cumulative strength distribution P>(s) for the network of co-occurrence of tags in BibSonomy (see also Fig. 12). Reddots correspond to the whole co-occurrence network. The irregular behavior for high strengths can be linked to spammingactivity: identified spam in BibSonomy consists of posts with a large number of tags, as well as a large number of posts withexactly 10 tags, injected by a small group of spammers. Both types of spam were identified by inspecting the distributionof the number of tags per post. Excluding the above posts from the analysis (green dots), the distribution becomes smoothand similar to the filtered one observed for del.icio.us. Similarly, shuffling the tags contained in posts (blue dots) has a smalleffect on the cumulated weight distribution.

Page 14: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

14 Cattuto et al. / Network Properties of Folksonomies

Fig. 14. Average nearest-neighbor strength Snn of nodes (tags) in relation to the node (tag) strengths s, in del.icio.us. Reddots correspond to the whole co-occurrence network. Assortative behavior is observed for low values of the strength s, whiledisassortative behavior is visible for high values of s. A few clusters (indicated by arrows) stand out from the main cloudof data points. As in Fig. 12, such anomalies correspond to spamming activity and can be removed by filtering out postscontaining an excessive number of tags (green dots). In this case, shuffling the tags (blue points) affects dramatically thedistribution of data points: this happens because the average nearest-neighbor strength of nodes is able to probe the localstructure of the network of co-occurrence beyond the pure frequency effects, and is sensitive to patterns of co-occurrenceinduced by semantics.

Fig. 15. Average nearest-neighbor strength Snn of nodes (tags) in relation to the node (tag) strengths s, in BibSonomy.Red dots correspond to the whole co-occurrence network. The scatter plot is qualitatively very similar to the one reportedin Fig. 14 for del.icio.us: assortative behavior is observed for low values of the strength s, while disassortative behavior isvisible for high values of s. Again, a few clusters (indicated by arrows) stand out from the main cloud of data points andtheir presence can be linked to spamming activity. They disappear when we filter out posts containing an excessive numberof tags (green dots). Shuffling the tags (blue dots) has the same effect as in Fig. 14, and the same observations apply.

Page 15: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

Cattuto et al. / Network Properties of Folksonomies 15

Figs. 14 and 15 display our results for del.icio.usand BibSonomy, respectively. In the figures, eachdot corresponds to a node of the network (i. e., atag), with its strength s as the abscissa and the av-erage strength of its neighbors Snn as the ordinate.Both quantities span several orders of magnitude,hence we use a logarithmic scale along both axesto display the global features of the scatter plot.This is related to the fat-tailed behavior observedfor the strength distribution P>(s), which is in factrecovered by projecting the data points along thes-axis and computing the cumulative distribution.

The broad range of the points in the scatterplots can also be studied considering the distribu-tion of Snn values corresponding to a given fixed svalue. In Figure 16, we show two examples for thecase of the del.icio.us data plotted in Figure 14.Incidentally, some interesting features, not simplyrelated to the strength distribution, can be ob-served in the distribution shapes (more details inthe figure caption).

In the scatter plots, the anomalous activity suchas spam is more clearly detectable, and its con-tribution appears in the form of foreign clusters(indicated by arrows) that clearly stand out fromthe otherwise smooth cloud of data points, a factthat reflects the anomalous nature of their connec-tions with the rest of the network. Excluding spamfrom the analysis, those clusters disappear alto-gether (green dots). The general shape of the cloudof data points remains unchanged, even though,in the case of BibSonomy, it shifts down towardslower strengths. This happens because BibSonomyis a smaller system and spam removal has a moresignificant global impact on the network and thestrengths of its nodes.

Overall, the plots for del.icio.us and BibSon-omy look quite similar, and this suggests that thefeatures we report here are generally representa-tive of collaborative tagging systems. An assor-tative region (Snn roughly increasing with s) isobserved for low values of the strength s, whiledisassortative behavior (Snn decreasing with s) isvisible for high values of s. As we have alreadydone for the probability distribution P>(s), wecan highlight the contribution of semantics by ran-domly shuffling tags in TAS entries (blue dots inFig. 14 and 15). In this case, shuffling the tags(blue points) affects dramatically the distributionof data points: this happens because the aver-age nearest-neighbor strength of nodes is able to

probe the local structure of the network of co-occurrence beyond the pure frequency effects, andis sensitive to patterns of co-occurrence inducedby semantics. Interestingly, the main effect seemsto be the disappearance of points in the assor-tative (low strengh) region of the plot, possiblyidentifying this region as the one exposing se-mantically relevant connections between tags. No-tice, for example, the disappearance of a wholecloud of points at the top-left of Fig. 15: thosepoints represent nodes (tags) with low strengththat are attached preferentially to nodes of highstrength. Similarly, in Fig. 14, the highly popu-lated region with s roughly ranging between 10 anda few thousands also disappears when tag shuf-fling is applied. Those data points also representlow-strength nodes (tags) preferentially connectedwith higher-strength nodes (tags). Such propertiesare commonly found in hierarchically organizednetworks, and could be related to an underlyinghierarchical organization of tags [10].

5.2. Spam detection and characterization

From a semantic point of view, spam contami-nates the system. Therefore the removal of spam isof primary importance if the system is to be stud-ied from a semantic point of view. A crude wayto remove spam is to cut off all posts that containmore than a certain arbitrary (but large) numberof TAS, i. e., number of associated tags. The draw-back of this fast method is that also semantic valu-able posts with large number of tags are filteredout. An elegant alternative way to proceed in spamremoval is to consider the structure of the quan-tity Snn(si). As already mentioned, most spam inthe system is easily spotted by looking at the scat-tered plot in Fig. 14. In that picture, three discretefeatures can be ascribed to spam. In order to bet-ter understand the structure of spam, we plot thesame scatter plot of Fig. 14, but raising the singlelink weights to a power γ ∈ {1.0, 0.8, 0.2}. The cor-responding scatter plots are shown in Fig. 17. Thevalue γ = 1 trivially reproduces the plot of Fig. 14,while the value γ = 0 (not shown) would repro-duce the unweighted analogous topological quan-tity of Knn(k). As the value of γ decreases, thespam spots displays two different behaviors. Thespots indicated with number 1 and 2 remain onthe Snn = s line, while spot 3 merges and dis-appears into the main cloud. This suggests that

Page 16: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

16 Cattuto et al. / Network Properties of Folksonomies

100

102

104

106

Snn

100

101

102

103

104

100

102

104

106

Snn

100

101

102

103

rawfilteredfiltered + shuffle

s ~ 1000

s = 10

Fig. 16. Histograms of the average nearest-neighbor strength Snn of nodes having a value of node strength s, in del.icio.us.The data are extracted from the scatter plot of Fig. 14 for two values of the strength, i. e., s = 10 (upper panel) and s = 1000(lower panel). The two values of the strength are chosen in order to show more clearly the whole distribution of scatteredpoints in two qualitatively different regions of Fig. 14. The two regions are only slightly affected by spam. In the upper panel(s = 10) a broad distribution appears, characterized by a most probable high value of Snn and a broad algebraic tail forsmaller values. Interestingly, shuffling the tags dramatically changes the distribution and the tail disappears, indicating thatthe tail behavior is not a trivial frequency effect. In the lower panel (s = 1000) the tail is narrower, but the shuffling proceduregives qualitatively the same result.

spots 1 and 2 are posts made of a large numberof TAS with quite uncommon tags in the system,such that these posts can be considered as cliquesrather decoupled from the system. In fact, an iso-lated clique would lie always on the Snn = s andKnn = k lines. The comet like shape of those spotsis due to the presence of more common tags thatare present in the system. Spot number 3 is in factstrictly bound to the rest of the system; this is thereason why it disappears as the value of γ tends to0. A close inspection of spots 1, 2 and 3 is providedin the caption of Fig. 17.

5.3. Strength as a function of degree

We could ask to which extent the strength of anode in the network is important with respect tothe topological degree of a node. We recall thatthe strength of a node is the sum of the weightsof the link connecting it to the network, i. e., thenumber of its total co-occurrences, while the de-gree of a node is the number of different nodes co-occurring with it. The quantity that can be ana-lyzed is si(ki), i. e., the relation of the strengths

of the nodes to their topological degrees. The re-lation between strength and degree has been al-ready studied in the literature for some weightednetworks and dependencies of the type s(k) ≈ kand s(k) ≈ k1.5 have been measured in the airtransportation network and network of article co-authorship, respectively [1].

The corresponding scatter plot is shown inFig. 18. A linear behavior of the relation(ki, si)i is a tag would indicate that these two sta-tistical quantities are equivalent. Rare tags thatappear only once (i. e., in one post) in the wholesystem must have s(k) = k by definition, whilewe expect the strength to become more importantfor the most frequent tags, which occur in a lot ofposts. An inspection of Fig. 18 shows three differ-ent regimes: for very low degrees s(k) ≈ k, then anextended crossover region followed by an asymp-totic regime above k ≈ 500 featuring a pronouncedpower s(k) & k1.5. In Fig. 18 we also show the s(k)behavior in the case of the shuffled system, as al-ready described in the previous section. From thelatter it is clear that the regimes of s(k) ≈ k ands(k) ≈ k1.5 are to be ascribed to mere frequency

Page 17: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

Cattuto et al. / Network Properties of Folksonomies 17

Fig. 17. Average nearest-neighbor strength Snn of nodes (tags) as a function of the node (tag) strength s, in del.icio.us. HereSnn is computed on a network obtained from the one of Fig. 14 by raising the weights of all edges to the power γ. The caseγ = 1 corresponds exactly to the data shows in Fig. 14 (black dots). On decreasing the value of γ (red and cyan dots) onecan progressively lower the importance of weights, making contact with the unweighted network for γ = 0 (all weights equalto 1). This analysis allows us to further probe the nature of the spam activity corresponding to the discrete clusters visiblein figure (labeled as 1, 2 and 3). Clusters 1 and 2 lie along the Snn = s line (blue) and changes in the value of γ don’taffect their position. This means that weights play no role in their presence, i. e., clusters 1 and 2 correspond to structureswhose edges have all the same weights. Isolated cliques have this property, and indeed direct inspection of the data showsthat cluster 1 corresponds to a single post with over 2000 tags. Since those tags are not common in the rest of the system,this large post induces a clique-like structure in the Snn vs s plot. The change in shape that occurs when γ is varied is dueto the weak linking of the clique-like structure to the rest of the tag network. Cluster 2, similarly, corresponds to a set postswith about 100 tags each. For this kind of spam, the tags involved have low global frequency, and the overlap between tagsbelonging to different posts is low, so that each post induces its own clique-like structure. Cluster 3, conversely, correspondsto a set of posts using common tags. This creates a structure that is better linked to the main component of the network.Because of this, as γ is changed, cluster 3 tracks the displacement of the main cloud visible in the plot. For this kind of spam,the overlap between posts is almost total, and this is responsible for high weights of the corresponding links, i. e., its raisedposition with respect to the line Snn = s.

effects. The crossover region, which seems to fea-ture an exponent slightly larger than 1, is not yetunderstood and deserves further studies.

6. Summary and Outlook

6.1. Conclusion

In this paper, we have analyzed the networkstructure of the folksonomies of two social resourcesharing systems, del.icio.us and BibSonomy. Weobserved that the tripartite hypergraphs of theirfolksonomies are highly connected and that the rel-ative path lengths are relatively low, facilitatingthus the “serendipitous discovery” of interestingcontents and users.

We subsequently introduced a weighted net-work of tags where link strengths are based on

the frequencies of tag co-occurrence, and studiedthe weight distributions and connectivity correla-tions among nodes in this network. Our evidenceis compatible with the existence of complex, pos-sibly hierarchical structures in the network of tagco-occurrence. Our analysis and experiments evi-dence the statistical signature of the emergence ofa shared semantics in the metadata system, anar-chically negotiated by users.

Our experiments hint that spam – which be-comes an increasing nuisance in social resourcesharing systems – systematically shows up in theconnectivity correlation properties of the weightedtag co-occurrence network.

6.2. Future Work

(Semi-)automatic Spam Detection. At the mo-ment, spam handling in BibSonomy is mostly done

Page 18: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

18 Cattuto et al. / Network Properties of Folksonomies

100

101

102

103

104

105

k

100

102

104

106

s(k)

filtered + shuffledfiltered

~ k

~ k3/2

Fig. 18. Node strength S as a function of node degree k in the co-occurrence network of tags, for del.icio.us. The underlyingnetwork data are the same of Fig. 14. A two-slope regime is visible for the spam-filtered data (red dots) as well as for theshuffled data (blue dots). In the latter case, a linear regime is visible for low values of k, while a different power-law regimeis visible for high k (solid lines).

by manual inspection and removal of offendingcontent.

In a follow-up paper, we will turn our observa-tions about spamming anomalies in the connectiv-ity of tags into a spam detection mechanism forfolksonomies. Using the techniques from Section 5,support for the administrators can be provided todetect spamming activities.

Identification of Communities. As the resultsfrom Section 4 suggest that the folksonomy con-sists of densely-connected communities, a secondline of research that we are currently pursuing andthat will benefit from the observations in this pa-per is the detection of communities.

This can be used, for example, to make thosecommunities explicit which already exist intrinsi-cally in a folksonomy, e. g. to provide user rec-ommendations and support new users in browsingand exploring the system.

Acknowledgement This research has been partlysupported by the TAGora project funded by theFuture and Emerging Technologies program (IST-FET) of the European Commission under the EURD contract IST-034721. The information pro-vided is the sole responsibility of the authors anddoes not reflect the Commission’s opinion. TheCommission is not responsible for any use thatmay be made of data appearing in this publication.

References

[1] A. Barrat, M. Barthelemy, R. Pastor-Satorras, andA. Vespignani. The architecture of complex weightednetworks. PNAS, 101(11):3747–3752, 2004.

[2] Marc Barthelemy, Alain Barrat, Romualdo Pastor-Satorras, and Alessandro Vespignani. Velocity and hi-erarchical spread of epidemic outbreaks in scale-freenetworks. Physical Review Letters, 92:178701, 2004.

[3] B. Bollobas. Random Graphs. Cambridge UniversityPress, 2001.

[4] A. Capocci, V. D. P. Servedio, F. Colaiori, L. S. Bu-riol, D. Donato, S. Leonardi, and G. Caldarelli. Pref-erential attachment in the growth of social networks:The internet encyclopedia wikipedia. Phys. Rev. E,74:036116, 2006.

[5] Andrea Capocci and Francesca Colaiori. Mixing prop-erties of growing networks and the simpson’s paradox.Phys. Rev. E, 74:026122, 2006.

[6] Ciro Cattuto, Vittorio Loreto, and Luciano Pietronero.Semiotic dynamics and collaborative tagging. Proceed-ings of the National Academy of Sciences (PNAS),104:1461–1464, 2007.

[7] M. Dubinko, R. Kumar, J. Magnani, J. Novak,P. Raghavan, and A. Tomkins. Visualizing tags overtime. In Proceedings of the 15th International WWWConference, May 2006.

[8] B. Ganter and R. Wille. Formal Concept Analysis:Mathematical foundations. Springer, 1999.

[9] Scott Golder and Bernardo A. Huberman. The struc-ture of collaborative tagging systems. Journal of In-formation Science, 32(2):198–208, April 2006.

Page 19: Network Properties of Folksonomies › hotho › pub › 2007 › aicomm...Cattuto et al. / Network Properties of Folksonomies 3 Fig. 1. BibSonomy displays bookmarks and BibTEX based

Cattuto et al. / Network Properties of Folksonomies 19

[10] Paul Heymann and Hector Garcia-Molina. Collabo-rative creation of communal hierarchical taxonomiesin social tagging systems. Technical Report 2006-10,Computer Science Department, April 2006.

[11] Andreas Hotho, Robert Jaschke, Christoph Schmitz,and Gerd Stumme. BibSonomy: A social bookmarkand publication sharing system. In Aldo de Moor, Si-mon Polovina, and Harry Delugach, editors, Proceed-ings of the Conceptual Structures Tool Interoperabil-ity Workshop at the 14th International Conference onConceptual Structures, Aalborg, Denmark, July 2006.Aalborg University Press.

[12] Andreas Hotho, Robert Jaschke, Christoph Schmitz,and Gerd Stumme. Information retrieval in folk-sonomies: Search and ranking. In Proceedings of the3rd European Semantic Web Conference, Budva, Mon-tenegro, June 2006.

[13] Andreas Hotho, Robert Jaschke, Christoph Schmitz,and Gerd Stumme. Trend detection in folksonomies.In Prof. First International Conference on Seman-tics And Digital Media Technology (SAMT), Athens,Greece, dec 2006.

[14] Robert Jaschke, Andreas Hotho, Christoph Schmitz,Bernhard Ganter, and Gerd Stumme. Trias - an al-gorithm for mining iceberg tri-lattices. Hong Kong,December 2006. (to appear).

[15] Donald E. Knuth. The Art of Computer Programming,Volume II: Seminumerical Algorithms, 2nd Edition.Addison-Wesley, 1981.

[16] Leslie Lamport. LaTeX: A Document Preparation Sys-tem. Addison-Wesley, 1986.

[17] F. Lehmann and R. Wille. A triadic approach to formalconcept analysis. In G. Ellis, R. Levinson, W. Rich,and J. F. Sowa, editors, Conceptual Structures: Ap-plications, Implementation and Theory, volume 954of Lecture Notes in Computer Science, pages 32–43.Springer, 1995.

[18] Pedro G. Lind, Marta C. Gonzalez, and Hans J. Her-rmann. Cycles and clustering in bipartite networks.Phys. Rev. E, 72(5), nov 2005.

[19] Adam Mathes. Folksonomies – Coop-erative Classification and CommunicationThrough Shared Metadata, December 2004.http://www.adammathes.com/academic/computer-mediated-communication/folksonomies.html.

[20] Peter Mika. Ontologies Are Us: A Unified Model of So-cial Networks and Semantics. In Yolanda Gil, EnricoMotta, V. Richard Benjamins, and Mark A. Musen,editors, ISWC 2005, volume 3729 of LNCS, pages522–536, Berlin Heidelberg, November 2005. Springer-Verlag.

[21] Stanley Milgram. The small world problem. Psychol-ogy Today, 67(1):61–67, 1967.

[22] M. E. J. Newman. Assortative mixing in networks.Physical Review Letters, 89:208701, 2002.

[23] Mark Newman, Albert-Laszlo Barabasi, and Duncan J.Watts, editors. The Structure and Dynamics of Net-works. Princeton University Press, Princeton, NJ,USA, 2006.

[24] Oren Patashnik. BibTeXing, 1988. (Included in theBibTEX distribution).

[25] Thomas Schank and Dorothea Wagner. Approximat-ing clustering coefficient and transitivity. Journalof Graph Algorithms and Applications, 9(2):265–275,2005.

[26] Christoph Schmitz, Andreas Hotho, Robert Jaschke,and Gerd Stumme. Mining association rules in folk-sonomies. In Data Science and Classification: Proc. ofthe 10th IFCS Conf., Ljubljana, Slovenia, July 2006.

[27] Patrick Schmitz. Inducing ontology from Flickrtags. In Collaborative Web Tagging Workshop atWWW2006, Edinburgh, Scotland, May 2006.

[28] A. Vazquez, J. Gama Oliveira, Z. Dezso, K. I. Goh,I. Kondor, and A. L. Barabasi. Modeling bursts andheavy tails in human dynamics. Physical Review E,73:036127, 2006.

[29] Alexei Vazquez. Exact results for the Barabasimodel of human dynamics. Physical Review Letters,95:248701, 2005.

[30] Duncan J. Watts. Small Worlds – The Dynamics ofNetworks between Order and Randomneess. PrincetonUniversity Press, Princeton, New Jersey, 1999.

[31] Duncan J. Watts and Steven Strogatz. Collective dy-namics of ’small-world’ networks. Nature, 393:440–442,June 1998.