1

Click here to load reader

DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"

Embed Size (px)

Citation preview

Page 1: DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"

Identifying Semantic Concepts

Selection of CN’s

Data CollectionPreprocessing

Extraction of CN, LN,

IWL, GN, page-size

and access log data

Calculation of REPk, REPv, REPt and REL

Analysis of data

Eric Tessenow,1 Mirko Kämpf,2 and Jan W. Kantelhardt 2

Abstract Since the numbers of hypertext pages and hyperlinks in the WWW have

been continuously growing for more than 20 years, the problem of

finding relevant content has become increasingly important. We have

developed and evaluated techniques for a time-dependent characteri-

zation of the global and local relevance of WWW pages based on

document length, number of links, and cross-correlations in user-access

time series. We focus on content and user activity in selected groups of

Wikipedia articles as a first application mainly because of data availa-

bility. Our goal is the assignment of ranking values to a hypertext page

(node). The values shall cover static properties of the node and its

neighbourhood (context) as well as dynamic properties derived from its

page-view rates that depend on underlying communication processes.

We show in several examples how this goal can be achieved.

1 Institute of Communications Studies, University of Leeds, LS2 9JT, Leeds, United Kingdom

2 Institut für Physik, Martin-Luther-Universität Halle-Wittenberg, 06099 Halle (Saale), Germany

Motivation Since many aspects have to be taken into account in the analysis of

global social networks, it is challenging to compare data collections and

obtain results from their analysis. We, therefore, require a robust and at

the same time flexible framework, which enables interdisciplinary

research as scientist from different fields look at different parts of a data

set. Our work suggests a methodology for comparable measurements of

a node‘s relevance in local graphs defined by the node’s local

neighbourhood, while considering local link structure, text volume, user

access activity and editorial activity.

This enables a qualitative and also an efficient quantitative analysis of

parts of a global social network without having to explore and analyze

the whole graph.

In order to identify and to compare different communication pro-

cesses on multiple channels, one has to quantify the influence of

the environment in which an individual process is embedded in,

e.g. for different topics and different regions on earth we study

usage patterns and embedding of content in one of the largest

public and open content networks, Wikipedia.

Information Flow in Correlation Networks Outlook

Local Representation Indexes: REPk,v and REPa,e(t)

Data Sets & Processing

SOE

6.1

References[1] Kämpf, M., Tessenow, E., Kantelhardt, J.W., Context Sensitive and Time Resolved Relevance in Complex Networks. Unpublished (in preparation, 2014).

[2] Kämpf M., Tismer S., Kantelhardt J.W., Muchnik L., Fluctuations in Wikipedia access-rate and edit-event data. Physica A, 391: 6101-6111 (2012).

[3] Kämpf M., Kantelhardt J.W., Muchnik L., From time series to co-evolving functional networks: dynamics of the complex system ‘Wikipedia’, Proc. Europ. Conf. Complex Syst. (2012).

[4] Schreck B., Kämpf M., Kantelhardt J.W., Motzkau H., Comparing the usage of global and local Wikipedias with focus on Swedish Wikipedia, arXiv:1308.1776 (2013).

[5] Kämpf M., Kantelhardt J.W., Hadoop.TS: large-scale time-series processing, International Journal of Computer Applications (IJCA) 74: 17 (2013), DOI: 10.5120/12974-0233.

[6] Segev E., Mapping the international: Global and local salience and news-links between countries in popular news sites worldwide. Int. Journal for Internet Science, 5: 48-71. (2010)

Contact

We compare different media types – in

particular channels which push information to

consumer (TV news, radio news, Twitter and

Facebook communication) - opposed to pull-media

like Wikipedia, forum or blackboard websites, from

which customers pull data on demand.

We evaluate how properties of different

network types, e.g. social-, content-, and

communication-networks influence each

other and if such couplings depend more on the

content or more on the way information is offered and

spread.

Finally we are interested in the question:

To what extend and how can automatied tools

influence the communication processes?

Relevance Indexes: RELv and RELa(t)We measure characteristic static and dynamic properties of a Wikipedia page based on I.) node degree k,

II.) average text-volume v, and III.) their access-rate or edit-rates time series (a(t), e(t)) in order to

determine and quantify the level of representation in a semantic or lingual context.

I.) Node-degree III.) Time-dependent access-rate a(t)

II.) Average text volume

We measure the time dependent or tempo-ral relevance of a Wikipedia page during atime period for access rates (a,b) of the central node CN (black), the group IWL(green), the local neighbourhood (LN, blue) And the global neighbourhood (GN, red).

a) Relevance Index: shows the level of attraction of a topic, e.g. for a Wikipedia page in one selected language. It compares the user interest in pages in the selected language and average values for pages with the same content for all other languages.

Fig. 1: Definition of partial

data sets (local networks)

Fig. 2: Comparison of local network

structures with identical nodes based on (a) direct links and (b) functional link strengths derived from access activity.

We calculate the time-dependent link strengths correlation by:

Fig. 3: Comparison of static representation indexes for two semantic concepts (data sets 1 and 2).

Fig. 4: Comparison of two local page networks with an assumed

higher global relevance (left) and with higher local relevance (right).

Fig. 5: Distribution of dynamic link strengths

for statically linked pages (a,d), for pages within groups LN (blue) and GN (red) (b,e), and for pages in different groups (c,f). Lines show results for real data and are compared with results from randomly shuffled data series (filled areas).

Average values and maximum values of the distribution function vary over time. Hence, we cannot define a simple threshold to identify relevant links. However, the distributions differ significantly for real data and surrogate data in (a,d,e).

In the presence of extreme events in access time series (bottom row) we find a significant increase in cross-correlation based link strengths for page pairs in the local and global neighbourhoods.

RAW data setlarge scale data management

Partial data setpreparation

Result data setCommunication

Process

Modelling and Analysis

of Complex Systems

Definition of data sets (Fig. 1)

a) Central node (CN), all directly linked nodes in the same language (local neighbourhood, LN), all nodes regarding the same topic in other languages (linked by inter-wiki links, IWL), and the all nodes linked to nodes in the IWL group (global neighbourhood, GN).

b) The CN group and the IWL group are the core of the local network for one topic. Both neighbourhoods, local (LN) and global (GN) form the hull of the local network.

Data stets for preliminary results and method testsWe address three data sets with differently chosen CNs (Wikipedia pages):

(1) Four German cities (Berlin, Heidelberg, Bad Harzburg, Sulingen) and two British cities (Oxford, Birmingham);

(2) ’United States of America’, ’Germany’, the ’President of the United States Barack Obama’, and the ’Federal Chancellor Angela Merkel’ in German and English language;

(3) Selected CNs with predominantly local and global relevance: Erfurt rampage and Illuminati book – both already used in a previous study of the fluctuations in Wikipedia access-rate time series [2] – and four times three pairs of CNs within the categories: minorities, cities, politicians, and meals.

Comparison of static link network and dynamic correlation networks (Fig. 2)

a) Direct Wikipedia links between all nodes in the groups CN, LN, IWL, and GN.

b) Functional link strengths calculated from user access-rate time series.

Illuminati (book) Erfurt rampage

[email protected], [email protected], [email protected]

This work was supported by:

Acknowledgement

Global and local relevance seem to be a characteristic property of a page. In (c) the local relevance decreases (blue dashed line) and in (d) it is similar for all languages. To compare L.REL and G.REL we show the cross-correlation for sliding windows of different sizes in (e) and (f).

CN: Erfurt rampage (languge: de)

Jan-Feb 2009

Mar-Apr 2009

Fig. 6: Time resolved average link strength for local

functional networks around two selected CNs (see Fig. 4).

Fig. 6a) shows a significant change in the average cross-correlation for pages in group GN (area A). At the same time thecorrelation in group LN drops significantly. In Fig. 6b) one can see that a decreasing local correlation is not necessarily related to a change in global correlations. This way one might be able to distinguish between local and global relevance as well.

We visualize the relevance of semantic concepts for specific regions, while we take the natural density of speakers and topic-specific relevance of languages within a specific region into account.

Such a language-dependent visualization will help to distinctively identify global and local trends for a semantic concept in a specific continent, country, or region based on public data sources and social communication and content networks like Wikipedia, but also Facebook, Google+, Twitter or even internal system, used in global Enterprises can be analyzed this way – even in a multilingual environment.

Fig. 7: Collaboration networks for pages regarding

the same topic in different languages (central large violet nodes) show inhomogeneous structure with clusters of multiple sizes. Connections between editor-clusters are “automatic editing tools (robots)”. Do such robots influence the spread of information?

Context Sensitive and Time Resolved Relevance

of Wikipedia Articles