6
Semantic Web Wiki: Social Network Analysis of Page Editing Nenad Tomašev, Dunja Mladenić Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia e-mail: [email protected] , [email protected] Abstract. This paper presents experiments with applying social network analysis on data about editing of semantic media wiki. As a number of users are editing the same wiki pages, one can view them as a social network of people interacting via wiki pages. We propose representation of the wiki editing log files as a graph either of users that are connected if they are editing the same pages or of pages that are connected if edited by the same users. We apply social network analysis on such graphs in order to provide some insights into activity of the editors of semantic wiki pages. Keywords. Social network analysis, wiki, semantic web, collaborative editing. 1. Introduction Social network analysis [3] enables analysis of connections between actors in a social network. Strictly speaking the connections between the actors are based on some social interactions, but they can also be of different nature. In this paper we address a problem of indirect interactions between users when editing the same Wiki pages. The available data is given in change logs as history of changes of Wiki pages. These change logs are publicly available from the Wiki [2] under changes history. Our goal is to provide some insights into the changes of wiki pages by presenting the change logs as two orthogonal graphs. Even though the data we are using can be seen as logs of Web pages as used in Web mining [5], we are approaching it from a different perspective. The first is a graph of users connected if they have changed the same page. The second is a graph of pages that are connected if the same user has changed them. In addition we define a bipartite graph connecting users and Wiki pages. The resulting visual representations of the change logs clearly shows several dimensions of the users’ activity including the most active users, grouping of user based on accessing the same pages, the most central users and the most frequently edited pages. The rest of this paper is structured as follows. Section 2 gives data description, Section 3 provides results of social network analysis on the described data, while Section 4 gives some discussion. 2. Data Preprocessing The data with change log of semantic Wiki pages as of the end of 2008 were given in an XML form, presented as a sequence of revisions for each page in the Wiki, containing a timestamp and either a username with an internal ID code or an IP of the person performing the page revision. Consecutive changes in page content were also included in the file, but we have discarded them in the pre-processing phase as they are not used in our further analysis. There was a total of 36,078 page edits. Most of those edits (75.5%) were made by the registered users. We have identified 617 registered users and 2,512 different IPs of the anonymous (non- registered) users. For the purposes of further analysis, each unique IP was considered a separate user giving us a total of 3,129 users. We have also performed some merging of user IDs in the process of data cleaning, since there were some cases of people using several usernames when logging in, for instance MaxVölkel and Max Völkel most likely refer to the same person, considering the fact that name collisions of such sort seem pretty unlikely in such a small user community. The most “collaborative” users can be seen in Figure 1. By collaborative we mean that they are frequently (at least 5 times) editing the same pages as (at least 3) other users. In our visualization, size of the circle reflects the number of changes the user has made on all the pages, while color shows if the user is registered (red) or anonymous (blue). We can see that the

Semantic Web Wiki: Social Network Analysis of Page Editing

  • Upload
    vokien

  • View
    244

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Semantic Web Wiki: Social Network Analysis of Page Editing

Semantic Web Wiki: Social Network Analysis of Page Editing

Nenad Tomašev, Dunja Mladenić

Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia e-mail: [email protected], [email protected]

Abstract. This paper presents experiments with applying social network analysis on data about editing of semantic media wiki. As a number of users are editing the same wiki pages, one can view them as a social network of people interacting via wiki pages. We propose representation of the wiki editing log files as a graph either of users that are connected if they are editing the same pages or of pages that are connected if edited by the same users. We apply social network analysis on such graphs in order to provide some insights into activity of the editors of semantic wiki pages. Keywords. Social network analysis, wiki, semantic web, collaborative editing.

1. Introduction

Social network analysis [3] enables analysis of connections between actors in a social network. Strictly speaking the connections between the actors are based on some social interactions, but they can also be of different nature.

In this paper we address a problem of indirect interactions between users when editing the same Wiki pages. The available data is given in change logs as history of changes of Wiki pages. These change logs are publicly available from the Wiki [2] under changes history.

Our goal is to provide some insights into the changes of wiki pages by presenting the change logs as two orthogonal graphs. Even though the data we are using can be seen as logs of Web pages as used in Web mining [5], we are approaching it from a different perspective. The first is a graph of users connected if they have changed the same page. The second is a graph of pages that are connected if the same user has changed them. In addition we define a bipartite graph connecting users and Wiki pages. The resulting visual representations of the change logs

clearly shows several dimensions of the users’ activity including the most active users, grouping of user based on accessing the same pages, the most central users and the most frequently edited pages.

The rest of this paper is structured as follows. Section 2 gives data description, Section 3 provides results of social network analysis on the described data, while Section 4 gives some discussion. 2. Data Preprocessing

The data with change log of semantic Wiki pages as of the end of 2008 were given in an XML form, presented as a sequence of revisions for each page in the Wiki, containing a timestamp and either a username with an internal ID code or an IP of the person performing the page revision. Consecutive changes in page content were also included in the file, but we have discarded them in the pre-processing phase as they are not used in our further analysis. There was a total of 36,078 page edits. Most of those edits (75.5%) were made by the registered users. We have identified 617 registered users and 2,512 different IPs of the anonymous (non-registered) users. For the purposes of further analysis, each unique IP was considered a separate user giving us a total of 3,129 users. We have also performed some merging of user IDs in the process of data cleaning, since there were some cases of people using several usernames when logging in, for instance MaxVölkel and Max Völkel most likely refer to the same person, considering the fact that name collisions of such sort seem pretty unlikely in such a small user community.

The most “collaborative” users can be seen in Figure 1. By collaborative we mean that they are frequently (at least 5 times) editing the same pages as (at least 3) other users. In our

visualization, size of the circle reflects the number of changes the user has made on all the

pages, while color shows if the user is registered (red) or anonymous (blue). We can see that the

Page 2: Semantic Web Wiki: Social Network Analysis of Page Editing

most active users are registered (the right top part in Figure 1), a few of them are exceptionally active compared to the other users (e.g, the most active user has made 6980 page edits, the second most active 2982, while the 10th most active made 589 page edits).

Figure 1: A subset of users that have edited the same pages as at least three other users (min. vertex degree = 3) and share at least five pages with them (min. edge weight = 5).

After a brief inspection, it was determined that there were many very similar pages in the Wiki, for instance: WikiSpammer, WikiSpammers, Talk:WikiSpammer. Since it was conjectured that a same user or a group of users of similar knowledge and skills is likely to be editing all of such very similar pages, it was deemed useful to first group together such pages into one concept and form an aggregate concepts representation of the data.

When combining the pages we used additional information that we have on the data, namely that the pages represent concepts of the ontology on semantic web represented as Semantic Wiki. However, in the presented work no information about the structure of the original concept ontology was used. The concepts were grouped based on syntactical similarity of their names i.e. page titles and the shortest title was used to label the aggregated concept. Short words that were too common in the concept names had been added to the exception list in order to contain a reasonable maximum group size. Some of these words include: Property, Talk, Category, Template, Concept, Paper, User Talk, Wiki, File, Semantic Web, etc. The number of concepts was thus reduced from the original 7,369 to 5,500.

3. Searching for relations

After the preprocessing, three different graphs were extracted from the data: a graph of users, a graph of concepts and a combined bipartite graph of concepts and users. These were given as input to social network analysis system Pajek [1] that was used here for data analysis including graph visualizations provided in Figure 1 and Figure 2. The system includes several social network analysis methods and is one of the commonly used system in research on graph visualization and social network analysis, as for instance, analysis of co-authorship of research publications [4]. 3.1. User Graph

It had already been stated that the initial dilemma regarding the mapping of users declared in the XML schema to the graph nodes in the network model of the community was related to the phenomenon of anonymous revisions. Such revisions accounted for 24.5% of the total number of page edits, so it would not have been beneficial to simply disregard them. Even though the average number of revisions per anonymous user was only 3.5, the maximum number was 251, which was not negligible. The overall most active user was Patrick accounting for 19.3% of the total number, by performing 6,980 page edits in the Wiki.

User graph was formed by connecting the users who contributed to the same aggregated concepts, i.e. who had edited pages belonging to the same group of pages. The edges were associated with weights, corresponding to the number of different concept groups that were edited by the same two users.

The user graph has 86,401 edges, which is less than 1% of the maximum possible number of edges. Hence, it can be considered a sparse graph. Only about 5% of these edges have a weight of more than one. About 9.7% of the graph consists of isolated nodes. The pages that were edited by these users were not edited by any other user. Most of them were pages about the users themselves (so, most of these users didn’t participate in building the wiki any more than making an entry about themselves or their coworkers). However, there were also some exceptions, such as concepts Ontology learning, RELAX SEO Services and Category: Czech

Page 3: Semantic Web Wiki: Social Network Analysis of Page Editing

person. These were only edited by users Dmanzano, Webmissile and Tom, respectively.

In the user graph there are 30 connected components having at least two nodes, but the biggest component comprises most of the users (84%). Average distance among reachable pairs in the graph is 2.23.

Figure 2: Graph of the most collaborative users (min. edge weight=4, min. Vertex degree=3) showing an isolated group of four anonymous users actively working on a group of 9 wiki pages (top right).

Centrality was calculated for all the users and is presented in Figure 3. It is immediately apparent that there is a small number of very active users in the center of the network and a lot of less well centered users.

Figure 3: Distribution of degree centrality among the Semantic Web Wiki users. Table 1: Top five users according to some importance criteria from social network analysis.

Number of revisions

Degree centrality

Betweenness

1.Patrick 1. Markus Krötzsch

1. Markus Krötzsch

2.Markus Krötzsch

2. Patrick 2.Patrick

3.Denny 3. MovGPO 3.Denny 4.Skierpage 4. Denny 4.MovGPO 5.Joris Gillis 5. Knud 5.WikiSym

The betweenness centralization achieved in the user network is 0.15. It is interesting to compare the top five users according to these different criteria. Such a comparison is given in Table 1. The users Patrick, Markus Krotzsch and Denny occur at the top in all three lists, with

some difference in their ranking. As can be seen in Table 1, while Patrick has the highest number of revisions, Markus Krotzsch has the highest degree of centrality and betweenness. Even though these user lists (not just the top 5) are somewhat similar, they do show that an activity

Figure 4: Concept graph showing only those concepts having at least 10 mutual revising users with some other concepts.

Page 4: Semantic Web Wiki: Social Network Analysis of Page Editing

of a contributor does not directly reflect his/her collaborative potential. 3.2. Concept Graph

In the same way in which we have observed the similarities between various users, it is possible to determine similar concepts, or rather in this case – similar groups of concepts. Hence, a weighed concept graph was constructed, each aggregate concept represented as a node, edge weights corresponding to the number of common users that have edited concepts belonging to both concept groups.

Even though the approach is the same as in the case of user activity analysis, the concept graph displays some different properties. Unlike the user graph, concept graph is dense. It has 5600 vertices, but approximately 2,5 million edges, which is 15.95% of the number of edges in a complete graph with the same number of vertices. Average distance between reachable pairs is thus smaller than the one of user graph, though diameter of the largest component is the same, six. There are 47 connected components that contain at least two vertices, but the larger component comprises 91.6% of the total number of vertices.

By observing strongly related concepts, it becomes quite apparent that there aren’t that many concepts that are both being edited by a larger group of users. After making a cut in the graph (see Figure 4) by removing all those aggregate concepts that do not have at least 10 common revisers, we have only 35 concept remaining, most of which are some pretty general concepts (such as, Math, Time, Help, Relation, Programming, Germany, Sandbox). Concept Sandbox, for instance, represents a page pointing to a test server where users can test their queries and test the appearance of their content prior to adding it to the main wiki. This might indicate a lack of cooperation between larger user groups, since we have already seen that there are pairs of users exhibiting great similarity in their wiki editing effort. On the other hand, it might also be a sign of good partitioning of work, because there is no need for many people to participate in an editing of the same page, under the assumption that the editing is done by domain experts who do not

need others to provide corrections of their contributions to the wiki.

Another approach in search for strong relations between concepts in to determine line islands in the concept graph. Line island is a subgraph induced by a set of vertices such that the relations (edge weights) between the vertices in an island are stronger than the ones between those vertices and the rest of the vertices in the original graph. As it turns out, the graph of Semantic Web wiki aggregate concepts has 9 such islands comprising 1.2% of the entire graph. Three most interesting islands are shown in Figure 5. In the bottom left corner one can see the following concepts: Category:AlkaliEarthMetalsGroup, Category:TransitionMetalsGroup, Category:Actinides, Category:NonMetalsGroup, Category:AlkaliMetalsGroup and a few more chemistry-related concepts. They form an island of their own, though they are apparently connected with the largest island composed of the most common aggregate concepts: Help, Semantic wiki, Media wiki, Main page, Sandbox, SPARQL, Template:Person, etc. On the right side are five isolated concepts related to KWTR (Knowledge Web Technology Roadmap), more specifically: KWTR:web data integration, KWTR:Workflow systems, KWTR:web information system architecture, KWTR:web personalization, KWTR:Ontologies and Ontology languages.

Figure 5. Concept graph showing some strongly correlated concept groups

Not only does the examination of the most frequently edited concepts and strongly correlated ones provide us with the information about trends in wiki page editing, but also does the analysis of rarely edited concepts.

Page 5: Semantic Web Wiki: Social Network Analysis of Page Editing

Figure 6: Graph of the users and the most frequently edited concepts (over 60 edits per concept).

Surprisingly, 4400 aggregate concepts (78.6% of merged concept graph) have been edited less than 6 times during wiki development. This means that most of the concepts in the ontology have not been edited many times, nor have they been edited by many people (no more than 5 in any case, though certainly much less on average). In this reduced graph, the following concepts have the greatest centrality: Property:Rights, Property:Source, Template:Esoteric, Rdfs:subClassOf. The number of components (not counting isolated vertices) in this graph is significantly larger than in the previous case (101 component), since

commonly edited concepts had been removed and they connect many of the rarely edited concepts into larger components. 3.3. Users and concepts: a two mode network

Bipartite graph of the users and the aggregated concepts they are editing (see Figure 6) shows the most active users and the concepts they are editing. For instance, Markus Krotzsch has edited MainPage 100 times, TemplatePerson 67 times, Help 152 times and MediaWiki 125 times.

Figure 7: Core (2, 24) of bipartite user-concept graph clearly showing the most frequently edited aggregate concepts: MediaWiki, Main Page, Help, Relation, State, Group.

Checking all the concepts and users shows that the two most active users Patrick and

Page 6: Semantic Web Wiki: Social Network Analysis of Page Editing

Skierpage have edited 359 same concepts. One of the more illustrative cores of this bipartite graph is shown in Figure 7. 4. Conclusions

Wiki is a product of combined user effort and collaboration. Some aspects of such collaboration can be seen by analyzing page edit logs via application of social network analysis and graph visualization techniques.

Graphs representing relations of Semantic Web wiki page editing (user-concept bipartite graph, similar concepts graph, similar users graph) have been constructed. Some concept merging was performed, to emphasize relations between concepts that are not obviously similar.

It was determined that there is a number of users which make unique contributions to the wiki, since no other user edits the same pages – even though this was most commonly the case with the wiki pages representing users themselves. Also, there is a number of very active users, a small group responsible for most of the current page edits. The importance of each of those users was determined according to several criteria. Areas of interest/expertise for such users have been detected and displayed in visual form (as a graph cut).

Anonymous users (those for which only the IP was known) have been compared to the ones having a proper username. This comparison has revealed that even though the named users are on average much more active, there has been an amount of editing by the anonymous users which is not negligible and that there are some IPs displaying a higher activity than many users possessing a username.

Both most frequently edited and least edited representative concepts have been determined, as well as groups of similar concepts based on several criteria (islands, cores, cuts). The most edited group of concepts is Help.

User and concept graphs display some similarities, but also a significant difference in density.

In the end, we can only conclude that the performed analysis has led to some interesting insights into Semantic Web wiki development and that such approach might allow for better management of the process if applied at various points in time.

Acknowledgements

This work was supported by the Slovenian Research Agency and the IST Programme of the European Community under NeOn Lifecycle Support for Networked Ontologies (IST-4-027595-IP) and PASCAL2 Network of Excellence (ICT-NoE-2008). This publication only reflects the authors' views. Thanks to Markus Krötzsch and Peter Haase for providing the data. References

[1] V. Batagelj, A. Mrvar Pajek: Program for Analysis and Evaluation of Large Networks. 2003.

[2] Semantic Web: http://semanticweb.org/wiki/Main_Page

[3] S. Wasserman, K. Faust, Social Network Analysis: Methods and Applications, United Kingdom: Cambridge University Press, 1994.

[4] S. Sabo, M. Grčar, D. A. Fabjan, P. Ljubi, N. Lavrač, Exploratory analysis of the ILPNet2 social network, In Proceedings of the 10th International multi-conference Information society, SiKDD2007 (2007)

[5] Soumen Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data,Morgan-Kaufmann Publishers, 2002