Challenges of Building Web Observatories

Preview:

DESCRIPTION

Invited Talk at WebSci workshop on Building Web Observatories

Citation preview

Steffen Staabstaab@uni-koblenz.de

1WeST

Vote for free Web Science MOOC!

Steffen Staabstaab@uni-koblenz.de

2WeST

You want to have more free

Web Science Education on the Web?

Vote for our course at

https://moocfellowship.org/

now!

Steffen Staabstaab@uni-koblenz.de

3WeST

Web Science & Technologies

University of Koblenz ▪ Landau, Germany

The Challenges of Building Interoperable Web Observatories

http://wow.west.webobservatory.org/

Steffen Staab

Steffen Staabstaab@uni-koblenz.de

4WeST

Produce

Consume

Cognition

Emotion

Behavior

SocialisationKnowledge

Observable Micro-

interactions in the Web

AppsProtocols

Data & InformationGovernance

WWW

Observable Macro-

effects in the Web

What to observe?

Steffen Staabstaab@uni-koblenz.de

5WeST

Why to observe?

Understanding Collecting Describing Analyzing Modeling Predicting Repeating!

Steffen Staabstaab@uni-koblenz.de

6WeST

Why to observe?

Understanding Collecting Describing Analyzing Modeling Predicting Repeating!

Steffen Staabstaab@uni-koblenz.de

7WeST

Produce

Consume

Cognition

Emotion

Behavior

SocialisationKnowledge

Observable Micro-

interactions in the Web

AppsProtocols

Data & InformationGovernance

WWW

Observable Macro-

effects in the Web

What to observe?

Web Crawling Usage Logging

Steffen Staabstaab@uni-koblenz.de

8WeST

Challenges – Data Collection Issues

Legal and/or Ethical Crawling

May be disallowed by provider

Usage logging Privacy of individuals

Even if it is allowed....

Steffen Staabstaab@uni-koblenz.de

9WeST

Challenges – Data Collection Issues

Crawling What does it mean to crawl a heavily interactive site? Incomplete data

• Unreachability• Time outs

Steffen Staabstaab@uni-koblenz.de

10WeST

Challenges – Data Collection Issues

Crawling What does it mean to crawl a heavily interactive site? Incomplete data Where to start?

• We cannot observe everything!– Even just for data size!– What appear to be most fruitful starting points?

Steffen Staabstaab@uni-koblenz.de

11WeST

Challenges – Data Collection Issues

Crawling What does it mean to crawl a heavily interactive site? Incomplete data Where to start? Where to stop?

• Each crawl is a view– Twitter

» Tweet» URL

» Web Page» Subweb

» Followers» Followers‘ Followers

» ...

Steffen Staabstaab@uni-koblenz.de

12WeST

Challenges – Data Collection Issues

Crawling What does it mean to crawl a heavily interactive site? Incomplete data Where to start? Where to stop? Synchronous vs asynchronous

• Strictly speaking: only asynchronous crawling possible– But in [Dellschaft&Staab] we targeted the construction of

models for streams of tags

Steffen Staabstaab@uni-koblenz.de

13WeST

Challenges – Data Publishing Issues

Legal and/or Ethical Example Issues AOL query log Netflix challenge Delicious

http://www.tagora-project.eu/data/ Twitter

Collecting, but no sharing• SocialSensor project

Steffen Staabstaab@uni-koblenz.de

14WeST

Challenges – Data Publishing Issues

Technical/Modelling issues Generic format, e.g. RDF Format ready for digestion by a certain software, e.g. for

Matlab processing Openness to other data

E.g. references to DBPedia/Wikipedia Accuracy of publishing

http://me.org showed „...“ http://me.org showed „...“@2013-05-01:0900CEST http://me.org showed „...“@2013-05-01:0900CEST called

from IP 193.99.144.85 using browser...version...history...

Steffen Staabstaab@uni-koblenz.de

15WeST

Sharing Software

Software For crawling or usage logging Rather than sharing the data, share the code for observing

Example: code for crawling Twitter in a certain way

Issues Limited repeatability Disturbance liability („Störerhaftung“) – at least in DE

• If you provide source code for crawling, e.g., Facebook, even if you do not crawl FB, FB can sue you

Steffen Staabstaab@uni-koblenz.de

16WeST

Why to observe?

Understanding Collecting Describing Analyzing Modeling Predicting Repeating!

Steffen Staabstaab@uni-koblenz.de

17WeST

WEB OBSERVATORY WIKIIn spite of all this....

Steffen Staabstaab@uni-koblenz.de

18WeST

Ongoing discussion

What to do about sharing Web Science datasets?

Let‘s do simple things first Collect pointers! Publish whatever you can publish – others will reuse Make it more archival

In a way that makes it easy to expand to handle more complex issues Semantic Wiki!

Steffen Staabstaab@uni-koblenz.de

19WeST

Web Observatory Wiki

• Main Goals:• Registry of Web Science datasets• Compiled by Web Observatory participants –

YOU!

• Minor Goals• Semantically store all information about

datasets• Make it

• Explorable• Queryable• Reuseable

Steffen Staabstaab@uni-koblenz.de

20WeST

Semantic MediaWiki + Forms Extension URL: http://wow.west.webobservatory.org/

Main classes: Examples: Dataset_Repository KONECT Dataset Slashdot Zoo Organization WeST

Quick Facts -1

Steffen Staabstaab@uni-koblenz.de

21WeST

Semantic MediaWiki + Forms Extension URL: http://wow.west.webobservatory.org/

Class Hierarchy Example: Attributes: Dataset Dublin Core +

Size, license, URL,…

Network Node Count Social Network …

Quick Facts - 2

Steffen Staabstaab@uni-koblenz.de

22WeST

Semantic Exploration by Views

Steffen Staabstaab@uni-koblenz.de

23WeST

Semantic Forms: Providing Data

Steffen Staabstaab@uni-koblenz.de

24WeST

ko:konect

ko:slashdot-zoo

wow:contains

1944

wow:network-volumewow:social-network

rdf:type

wow:network

rdfs:subClassOf

wow:dataset

rdfs:subClassOf

ko:twitter

wow:contains

120000000

wow:size

wow:network-volume

rdfs:domain

wow:size

rdfs:domain

rdf:type

wow:dataset-repositoryrdf:type

wow:contains

rdfs:domain

rdfs:range

Schema (Excerpt)

Steffen Staabstaab@uni-koblenz.de

25WeST

Discussion & Q&A

Access to wiki Current model:

• Edits allowed by IPs and users• Everyone can be blocked, including IPs

Contribute: Content Modeling requirements ... Let us know!

Steffen Staabstaab@uni-koblenz.de

26WeST

Sanity Check

Understanding

Collecting (to some extent: commodity service)

Describing (WOW)

Analyzing

Modeling

Predicting

Repeating!

So far ad hoc –needs much more:• Experience• Guidelines• Processing workflow• Executable code shares

(on big data!)• ...

Steffen Staabstaab@uni-koblenz.de

27WeST

What else do we need?

Steffen Staabstaab@uni-koblenz.de

28WeST

Vote at: https://moocfellowship.org/