Click here to load reader
Upload
mirko-kaempf
View
230
Download
0
Embed Size (px)
Citation preview
Identifying Semantic Concepts
Selection of CN’s
Data CollectionPreprocessing
Extraction of CN, LN,
IWL, GN, page-size
and access log data
Calculation of REPk, REPv, REPt and REL
Analysis of data
Eric Tessenow,1 Mirko Kämpf,2 and Jan W. Kantelhardt 2
Abstract Since the numbers of hypertext pages and hyperlinks in the WWW have
been continuously growing for more than 20 years, the problem of
finding relevant content has become increasingly important. We have
developed and evaluated techniques for a time-dependent characteri-
zation of the global and local relevance of WWW pages based on
document length, number of links, and cross-correlations in user-access
time series. We focus on content and user activity in selected groups of
Wikipedia articles as a first application mainly because of data availa-
bility. Our goal is the assignment of ranking values to a hypertext page
(node). The values shall cover static properties of the node and its
neighbourhood (context) as well as dynamic properties derived from its
page-view rates that depend on underlying communication processes.
We show in several examples how this goal can be achieved.
1 Institute of Communications Studies, University of Leeds, LS2 9JT, Leeds, United Kingdom
2 Institut für Physik, Martin-Luther-Universität Halle-Wittenberg, 06099 Halle (Saale), Germany
Motivation Since many aspects have to be taken into account in the analysis of
global social networks, it is challenging to compare data collections and
obtain results from their analysis. We, therefore, require a robust and at
the same time flexible framework, which enables interdisciplinary
research as scientist from different fields look at different parts of a data
set. Our work suggests a methodology for comparable measurements of
a node‘s relevance in local graphs defined by the node’s local
neighbourhood, while considering local link structure, text volume, user
access activity and editorial activity.
This enables a qualitative and also an efficient quantitative analysis of
parts of a global social network without having to explore and analyze
the whole graph.
In order to identify and to compare different communication pro-
cesses on multiple channels, one has to quantify the influence of
the environment in which an individual process is embedded in,
e.g. for different topics and different regions on earth we study
usage patterns and embedding of content in one of the largest
public and open content networks, Wikipedia.
Information Flow in Correlation Networks Outlook
Local Representation Indexes: REPk,v and REPa,e(t)
Data Sets & Processing
SOE
6.1
References[1] Kämpf, M., Tessenow, E., Kantelhardt, J.W., Context Sensitive and Time Resolved Relevance in Complex Networks. Unpublished (in preparation, 2014).
[2] Kämpf M., Tismer S., Kantelhardt J.W., Muchnik L., Fluctuations in Wikipedia access-rate and edit-event data. Physica A, 391: 6101-6111 (2012).
[3] Kämpf M., Kantelhardt J.W., Muchnik L., From time series to co-evolving functional networks: dynamics of the complex system ‘Wikipedia’, Proc. Europ. Conf. Complex Syst. (2012).
[4] Schreck B., Kämpf M., Kantelhardt J.W., Motzkau H., Comparing the usage of global and local Wikipedias with focus on Swedish Wikipedia, arXiv:1308.1776 (2013).
[5] Kämpf M., Kantelhardt J.W., Hadoop.TS: large-scale time-series processing, International Journal of Computer Applications (IJCA) 74: 17 (2013), DOI: 10.5120/12974-0233.
[6] Segev E., Mapping the international: Global and local salience and news-links between countries in popular news sites worldwide. Int. Journal for Internet Science, 5: 48-71. (2010)
Contact
We compare different media types – in
particular channels which push information to
consumer (TV news, radio news, Twitter and
Facebook communication) - opposed to pull-media
like Wikipedia, forum or blackboard websites, from
which customers pull data on demand.
We evaluate how properties of different
network types, e.g. social-, content-, and
communication-networks influence each
other and if such couplings depend more on the
content or more on the way information is offered and
spread.
Finally we are interested in the question:
To what extend and how can automatied tools
influence the communication processes?
Relevance Indexes: RELv and RELa(t)We measure characteristic static and dynamic properties of a Wikipedia page based on I.) node degree k,
II.) average text-volume v, and III.) their access-rate or edit-rates time series (a(t), e(t)) in order to
determine and quantify the level of representation in a semantic or lingual context.
I.) Node-degree III.) Time-dependent access-rate a(t)
II.) Average text volume
We measure the time dependent or tempo-ral relevance of a Wikipedia page during atime period for access rates (a,b) of the central node CN (black), the group IWL(green), the local neighbourhood (LN, blue) And the global neighbourhood (GN, red).
a) Relevance Index: shows the level of attraction of a topic, e.g. for a Wikipedia page in one selected language. It compares the user interest in pages in the selected language and average values for pages with the same content for all other languages.
Fig. 1: Definition of partial
data sets (local networks)
Fig. 2: Comparison of local network
structures with identical nodes based on (a) direct links and (b) functional link strengths derived from access activity.
We calculate the time-dependent link strengths correlation by:
Fig. 3: Comparison of static representation indexes for two semantic concepts (data sets 1 and 2).
Fig. 4: Comparison of two local page networks with an assumed
higher global relevance (left) and with higher local relevance (right).
Fig. 5: Distribution of dynamic link strengths
for statically linked pages (a,d), for pages within groups LN (blue) and GN (red) (b,e), and for pages in different groups (c,f). Lines show results for real data and are compared with results from randomly shuffled data series (filled areas).
Average values and maximum values of the distribution function vary over time. Hence, we cannot define a simple threshold to identify relevant links. However, the distributions differ significantly for real data and surrogate data in (a,d,e).
In the presence of extreme events in access time series (bottom row) we find a significant increase in cross-correlation based link strengths for page pairs in the local and global neighbourhoods.
RAW data setlarge scale data management
Partial data setpreparation
Result data setCommunication
Process
Modelling and Analysis
of Complex Systems
Definition of data sets (Fig. 1)
a) Central node (CN), all directly linked nodes in the same language (local neighbourhood, LN), all nodes regarding the same topic in other languages (linked by inter-wiki links, IWL), and the all nodes linked to nodes in the IWL group (global neighbourhood, GN).
b) The CN group and the IWL group are the core of the local network for one topic. Both neighbourhoods, local (LN) and global (GN) form the hull of the local network.
Data stets for preliminary results and method testsWe address three data sets with differently chosen CNs (Wikipedia pages):
(1) Four German cities (Berlin, Heidelberg, Bad Harzburg, Sulingen) and two British cities (Oxford, Birmingham);
(2) ’United States of America’, ’Germany’, the ’President of the United States Barack Obama’, and the ’Federal Chancellor Angela Merkel’ in German and English language;
(3) Selected CNs with predominantly local and global relevance: Erfurt rampage and Illuminati book – both already used in a previous study of the fluctuations in Wikipedia access-rate time series [2] – and four times three pairs of CNs within the categories: minorities, cities, politicians, and meals.
Comparison of static link network and dynamic correlation networks (Fig. 2)
a) Direct Wikipedia links between all nodes in the groups CN, LN, IWL, and GN.
b) Functional link strengths calculated from user access-rate time series.
Illuminati (book) Erfurt rampage
[email protected], [email protected], [email protected]
This work was supported by:
Acknowledgement
Global and local relevance seem to be a characteristic property of a page. In (c) the local relevance decreases (blue dashed line) and in (d) it is similar for all languages. To compare L.REL and G.REL we show the cross-correlation for sliding windows of different sizes in (e) and (f).
CN: Erfurt rampage (languge: de)
Jan-Feb 2009
Mar-Apr 2009
Fig. 6: Time resolved average link strength for local
functional networks around two selected CNs (see Fig. 4).
Fig. 6a) shows a significant change in the average cross-correlation for pages in group GN (area A). At the same time thecorrelation in group LN drops significantly. In Fig. 6b) one can see that a decreasing local correlation is not necessarily related to a change in global correlations. This way one might be able to distinguish between local and global relevance as well.
We visualize the relevance of semantic concepts for specific regions, while we take the natural density of speakers and topic-specific relevance of languages within a specific region into account.
Such a language-dependent visualization will help to distinctively identify global and local trends for a semantic concept in a specific continent, country, or region based on public data sources and social communication and content networks like Wikipedia, but also Facebook, Google+, Twitter or even internal system, used in global Enterprises can be analyzed this way – even in a multilingual environment.
Fig. 7: Collaboration networks for pages regarding
the same topic in different languages (central large violet nodes) show inhomogeneous structure with clusters of multiple sizes. Connections between editor-clusters are “automatic editing tools (robots)”. Do such robots influence the spread of information?
Context Sensitive and Time Resolved Relevance
of Wikipedia Articles