15
Lehrstuhl Informatik V (Informationssy steme) Prof. Dr. M. I5-RK-0808-1 CUELC Zinayida Petrushyna, Ralf Klamma RWTH Aachen University Workshop “Digital social networks”, Munich September 12, 2008 The Troll under the Bridge: Data Management for Huge Web Science Mediabases

The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Embed Size (px)

DESCRIPTION

Zinayida Petrushyna, Ralf KlammaRWTH Aachen UniversityWorkshop “Digital Social Networks”, MunichSeptember 12, 2008

Citation preview

Page 1: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-1

CUELC

Zinayida Petrushyna, Ralf Klamma

RWTH Aachen University

Workshop “Digital social networks”, Munich

September 12, 2008

The Troll under the Bridge:Data Management for Huge Web

Science Mediabases

Page 2: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-2

CUELC

Agenda

Motivation & Problem definition Data Management for Web Science

– Crawling: Watchers– Analysis: Patterns– Visualization: Graphs

Conclusion Outlook

Page 3: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-3

CUELC

Data Management issues in Web Science

Interoperable formats– XML based – Wikis , RSS Feeds, Microformat– SQL based – Deep Web – Text based – Websites, Forums

Non-continuous analysis– Crawling vs. Dumps– Special purpose vs. General purpose

Aggregation level is not possible to achieve– Data warehouses– Theoretical considerations of agency – Actor network theory

Page 4: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-4

CUELC

Data Model for the Web 2.0

Latour: On Recalling ANT, 1999

Page 5: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-5

CUELC

Mediabase

A Mediabase is a six-tuple graph

L), , , R,(A, M A A R

L A :

L R : 1 0, R :

Page 6: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-6

CUELC

Actors in the Mediabase

Network Agent, Process, Artefact, Medium, A

Folksonomy site, gbookmarkin Social

Forum, Wiki,room,Chat Podcast, Blog, site,-Web

Feed, Newsgroup, ,Newsletter lists, Mailing

Medium

Reference Rankíng,

,Multimedia Rating, URL,Review, Trackback, Tag, ,Executions

Thread, entry, Blog Burst, on,Conversati Feedback, Host,

n,Transactio Entry, RSS Comment, Index, mail,-E Message,

Artefact

Addressing ion,Transcript Retrieval,

,Monitoring Search, n,Acquisitio Process

Expert onalist,Conversati Spammer, Troll, ,Questioner

person, Answering Dead, Reviewer, Lurker, Member, tor,AdministraAgent

Page 7: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-7

CUELC

Crawling Technologies

Artefact MediaW

Index Thread Message list MailingMW

Agent Process Artefact MediaI

Network Agent Process Artefact MediaG

Mix of dumps (Wikis) and special purpose crawlers:

Page 8: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-8

CUELC

Trolls under the Bridge

What is a disturbance, e.g. a troll?– Sensing an incompatibility between theories exposed and

theories-in-use Disturbances are starting points of learning

processes– Disturbances disturb, prevent … but they are creating

reflection Disturbances are hard to detect or to forecast

Page 9: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-9

CUELC

Complex Troll Pattern in Basic Notation

postedIn ), v(performs, ), v(performs, ) ,v(

P : Artefact , Process , Thread )9

performs. ) , v(msg Authoring

P Member Ag : Artefact , Process ,Agent )8

performs ) , th v( Autoring P Member Ag

: Artefact , Process ,Agent )7

:Agent Ag : AgAgent )6

:Process P : P Process )5

:Artefact : Message )4

on stored ) , v(th : Artefact , Medium )3

Thread 1 , : 1 , thread )2

Artefact : Thread )1

msgThmsg apmsgp

msgapthmsgThapmsgTh

paua

ppauauapau

pcrappcrcr

apcr

msg msg

amaam

thththth

thth

Page 10: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-10

CUELC

Complex Troll Pattern in Basic Notation

msgThaucr

tmsgTh

autcrttt

eadmessageThr author t author creator

creator minPosts msg eadmessageThr

author author creator creator Ag : troll

1111

1

10)

Page 11: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-11

CUELC

Pattern LanguageVariables – simple variables (troll, thread), properties

(thread.author) and set variables (v1,…,vn).Operations

– Arithmetic (+, -, *, / )– Aggregate (SUM, COUNT, AVERAGE)– Logical (&, |, ~, FORALL and EXISTS)– Comparison (=, !=, >, <).

Rules for variable binding– Simple variables – pattern parameters, actors or set variables– Properties – actor properties or relations– Set variables – actors

Interpreted by a finite state automaton

Page 12: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-12

CUELC

Pattern Language for PALADIN: Example Troll

Troll Pattern: This pattern tries to discover the cases when a troll exists in a digital social network. A troll in the network is considered a disturbance.

Disturbance: (EXISTS [medium | medium.affordance = threadArtefact]) &

(EXISTS [troll |(EXISTS [thread | (thread.author = troll) & (COUNT [message | (message.author = troll) & (message.posted = thread)]) > minPosts]) & (~EXISTS[ thread1, message1| (thread1.author1 != troll) &

(message1.author = troll & message1.posted = thread1 ]))])])

Forces: medium; troll; network; member; thread; message; url

Force Relations: neighbour(troll, member); own thread(troll, thread)

Solution: No attention must be paid to the discussions started by the troll. Rationale: The troll needs attention to continue its activities. If no attention is paid, he/she

will stop participating in the discussions. Pattern Relations: Associates Spammer pattern.

Page 13: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-13

CUELC

Pattern Discovery ProcessPattern

Disturbance

Variables

Pattern Template

Disturbance

VariablesPattern Parameters

Pattern Template Instance

Pattern Instance

Disturbance

Variables Pattern Parameters

Forces ForceRelations

Rationale

Dependencies

Description Solution

Pattern Relations

Disturbance Instances

Variables Pattern Parameters

Digital Social Network

1. Set pattern parameters

2. Instantiate disturbances

3. Evaluate disturbances

4a. Change Pattern Parameters

4b. Apply Pattern Solution

Page 14: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-14

CUELC

Visualization

Page 15: The Troll under the Bridge: Data Management for Huge Web Science Mediabases

Lehrstuhl Informatik V(Informationssysteme)

Prof. Dr. M. JarkeI5-RK-0808-15

CUELC

Conclusions and Outlook

Homogeneous data management Pattern language for disturbance analysis Graph-based visualization

Data uncertainty and inconsistent data Goals and intentions of analysts Dynamic Mediabase visualization