68
Linked Data for Digital Humanities SW4SH 2015 Victor de Boer With input from Christophe Guéret, Serge ter Braake, Niels Ockeloen, Antske Fokkens, Dirk Roorda, Lora Aroyo, Johan Oomen, Oana Inel, Jan Wielemaker

Sw4 sh slides

Embed Size (px)

Citation preview

Linked Data for Digital HumanitiesSW4SH 2015

Victor de Boer

With input from Christophe Guéret, Serge ter Braake, Niels Ockeloen, Antske Fokkens, Dirk Roorda, Lora

Aroyo, Johan Oomen, Oana Inel, Jan Wielemaker

Victor de Boer

Web & Media Group, CS, VU University AmsterdamNetherlands Institute for Sound and Vision

Cultural HeritageDigital History

Linked Data for Development

Digital History

Sub-discipline of digital humanities

Buzzword that people are embracing and at the same time already getting tired of.

“ ..the use of digital media and computational analytics for furthering historical practice, presentation, analysis, or research” --Wikipedia

Digital History

Part of the effort of historian is moved from the physical archives to digital ones

Cross-domain collaboration

Img:www.doaks.org, www.dkrz.de

Data-driven research

Fig: Christophe Guéret

“That is great. I would love that…

…but my research questions are slightly different.”

Img:Monty Python

Better

Enter your (national) research

infrastructure hereFig: C. Guéret

Aging

Data Tool

C. Guéret based on http://redmonk.com/jgovernor/2007/04/05/why-applciations-are-like-fish-and-data-is-like0wine/

Even better

Do not bake the data into the tool and treat data as an end product.Build tools on top of the data.Make sure others can do so as well.

Fig: C. Guéret

Framework generic solutions with historians

1. Preprocess, Clean, Model, Link, Enrich data in a collaboration with domain experts

2. Access heterogeneous datasets in a convenient way to get an intuition of the character and anomalies of the (linked) data;

3. Perform arbitrary queries to retrieve results relevant to their research questions;

4. Verify the veracity of query results, by following provenance links to original material

5. Retrieve and analyze the data with tool of preference.

6. Republish and share results

Linked Data for Digital History• Represent heterogeneous datasets with their own data models

– In one data format (RDF)– Link what can be linked to integrate at project level (and beyond)– Keep specificity of original data

• Links to other sources: re-use knowledge• Common-sense or very specific• Digital hermeneutics

• Allow multiple levels of semantic enrichment– normalization – through Named Graphs – Provenance

• Linked Data is the (technically) best way to publish and share your research data

Some examples

Dutch Ships and Sailors

The Problem:((Maritime) historical) data is not integrated

• Researchers’ data is “lost”– In different physical locations– In different file formats– In different semantic structures

• In a workshop, we identified 25+ maritime historical datasets. – http://dutchshipsandsailors.nl

• We do not want to force one monolithic data model for integration

The solution: Linked Open Data• Represent heterogeneous datasets with

their own data models– In one data format (RDF)– Link what can be linked to integrate at

project level (and beyond)– Keep specificity of original data

• Links to other sources: re-use knowledge

• Allow multiple levels of semantic enrichment/ normalization – through Named Graphs – Provenance

Dutch Ships and

Sailors in the Linked

Open Data Cloud

What we did1. Model four maritime historical datasets as RDF

– Noordelijke Monsterrollen Database [J. Leinenga]– Generale Zeemonsterrollen [M. van Rossum]– Dutch Asiatic Shipping– VOC Opvarenden

2. Link to each other (based on ships, ship types, ranks, geography,…)– Models and links evaluated by domain experts

3. Publish as Linked Open Data

4. Show how this data cloud can lead to new types of integrated research questions

Modeling in collaboration with historians

dss:Recordgzmvoc:Telling

gzmvoc:telling-1046-De_Berkel

__bnode_1

gzmvoc:aziatischeBemanning

dss:Shipgzmvoc:Schip

gzmvoc: schip-1046-De_Berkel

dss:has_shipgzmvoc:schip

"1046"

“Schip”

“De Berkel”

rdfs:labeldss:scheepsnaam

gzmvoc:scheepsnaam

dss:ShipTypegzmvoc:Scheepstype

gzmvoc: type-Shipdss:has_shiptype

gzmvoc:has_shiptype

gzmvoc:scheepstype

“21”

“Moorse mattroosen”

dss:azRegistratieKop

gzmvoc:azAantalMatrozen

gzmvoc:telling

gzmvoc:heeft DAS heenreis

dss:Recorddas:Voyage

das:voyage-1918_61

DIFFERENT but LINKED DATAMODELS

Modelling principles

Model each dataset as directly as possibleOnly “syntactical” transformation to RDFNo normalization

ReusabilityTransparency, trust

Normalize and link in second stage store in separate RDF Named Graphs

Links to Historical Newspapers

[HARLINGEN, 24 October.] …gestrand. Tevens is het berigt ontvan°e > dat het hier behoorende schoonerschip Transit, kapitein Schaap, in de Noordzee is gezonken, nadat het achterschip was weggeslagen ; een ligtmatroos verloor daarbij het leven. Mede zijn hier drie vreemde schepen met meer en minder zware averij binnengeloopen.

- Andrea Bravo Balado

DAS

GZMVOC

MDB

VOCOPVBegunstig

den

VOCOPVSoldijboek

en

PROV

AAT

VOCOPVOpvaren

den

foaf

owl:sameAs

dss:hasKBLink

rdfs:subClassOf,rdfs:subPropertyOf

dss:DAS link

skos :exactMatch

ClioPatria Triplestore

Data live at Huygens Institute for Dutch History

http://dutchshipsandsailors.nl/data~30 Million triples

Dev. Server http://semanticweb.cs.vu.nl/dss

Purl.org URIs redirect to live server w/ content negotiation SPARQL endpointWeb interface

Search, browse and query

• SPARQL for R package

Data analysis and visualisation

Dutch Ships and Sailors

• Linked Data principles are a great fit to digital history requirements– Heterogeneous models/datasets, light-weight reusable

integration– Multiple levels of normalisation, through separate

named graphs– SW Provenance matches Historical Provenance

• Watch out when you sail your Schooner into the North Sea

DIGITAL HUMANITIES RESEARCHERS Media

rese

arch

er La

rs Arv

e R

øssla

nd

of th

e U

niv

ersity

of B

erg

en. (P

hoto

: Andre

as R

. Gra

ven)

EXPLORATIVE SEARCH

Digital Hermeneutics: The combination of digital (Web) technology and theory of interpretation

Builds on AGORA PROJECT

Slide: Lora Aroyo

DATA: OPENIMAGES.EU

Open videos Netherlands Institute for Sound and Vision~3000, mostly news broadcastsDescriptions

DATA: DELPHER.NL

Scans of Radio bulletins (hand annotated)1937 – 1984 1.5 Million OCR’ed and NErred

ENTITY EXTRACTION

CROWDTRUTH.ORG

ENTITY EXTRACTION

EVENTS CROWDSOURCING AND LINKING TO CONCEPTS THROUGH CROWDTRUTH.ORG

SEGMENTATION & KEYFRAMES

LINKING EVENTS AND CONCEPTS TO KEYFRAMES

SIMPLE EVENT MODEL (SEM), OPENANNOTATION (OA) AND SKOS

DIVE:MEDIA OBJECT

SEM:EVENT

SEM:PLACE

SEM:TIME

SEM:ACTOR

SKOS:CONCEPT

OA:ANNOTATION

LINKS TO EUROPEANA (MULTILINGUAL)LINKS TO DBPEDIA

DIGITAL SUBMARINE UI

https://w

ww

.flickr.com/photos/benjcarson/245171885

https://w

ww

.flickr.com/photos/m

ibuchat/2774251415

INFINITY OF EXPLORATION

DEMO

DIVE.BEELDENGELUID.NL

DIVE

Data conversion pipeline includes crowdsourcing, text analysis

Again provenance

Generic browsing doesn’t have to be boring

BiographyNet

Biographynet.nl

Starting point

Starting Point: Biography Portal of the Netherlands; www.biografischportaal.nl

125,000 short biographical descriptions with limited metadata from 23 Dutch biographical dictionaries (~76,000 individuals)

What kind of historical questions can be answered with these data with the help of computational methods

Biographynet.nl

Methods developed in collaboration

Biographynet.nlProsopography

Biographynet conversion

Biographynet.nl

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

Linked Data for BiograpyNet

Thorbecke

Biographical Description

ProvenanceMeta Data

NNBW

PersonMeta Data

“Thorbecke”

BiographyParts

Birth1798Event

Biographical Description

Enrichment NLP Tool

PersonMeta Data

EventBirth

Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…

Zwolle1798-01-14

Biographynet.nl

a

Provenance in BiographynetEnsure credibility of the demonstrator, to evaluate its performance and to improve the academic status of the tool

Information involved Sources, but also: NER input data, etc. Processes involved All steps in enrichment, aggregation…People involved Who was responsible for pipeline, tool,

Includes P-PLAN:* Allows for comparing the actual activity and its input/output with the original plan and its variables

Biographynet.nl*Daniel Garijo, Yolanda Gil; http://www.opmw.org/model/p-plan

Interface for historians (mockup)

Biographynet.nl

Biographynet.nl

BiographynetData sustainably stored at historical research institute

Structured data accessible as LOD

Provenance through deep linking into enrichment

Generic browse / compare functionality for broad historic research (prosopography)

Re-usabilityBiographynet.nl

Verrijkt Koninkrijk

Het Koninkrijk der Nederlanden in de Tweede WereldoorlogHistory of German occupied Dutch society (1940-1945) Published 1969 - 1991 in 14 volumes, 30 parts, 18.000 pages

1. Digitization

2. Open Data

3. Enriched access with Linked Data

Verrijkt Koninkrijk

Step 1: Lou de Jong’s “Het Koninkrijk” was digitized and made available in a reusable format

Step 2: Named Entity Recognition and consolidation of the back-of-the-book index provide structured vocabularies with links into the text

country, collection, doc-type, volume, chapter, section, sub-section, paragraph

Back-of-the-Book index Named Entities

Verrijkt Koninkrijk

Step 3: Enrichment with Linked Data makes new ways of interaction and analysis possible

Back-of-the-Book index Named Entities

niod:Blitzkrieg

niod:oai_wo2_niod_nl_rec_102045 dct:subjec

t

http://resolver.verrijktkoninkrijk.nl/nl.vk.d.reg.4.1386

botb:Blitzkrieg

skos:exactMatch

52

National-Socialist29%

Social-Democrat21%Protestant

13%

Liberal12%

R-Catholic12%

Com

munist8%

Jewish5%

http://semanticweb.cs.vu.nl/verrijktkoninkrijk/

http://search.loedejongdigitaal.nl/

Results are links to paragraphs

re-usability

Reuse in comparative interface

Quantifying Historical Perspectives on WWIIhttp://qhp.science.uva.nl/

Verrijkt Koninkrijk

Textual data in digital repositories

Structured data accessible as LOD

Provenance

Simple tooling for primary research project

Re-usability

Shebanq

http://shebanq.ancient-data.org/

Nanopublications in digital history

Cf. http://www.slideshare.net/schambers3/nanopublications-in-the-arts-and-humanities

Shebanq

Source material shared in sustainable repo

Queries and metadata shared among researchers

Working on Linked Data model based on OpenAnnotation

Historical tool / data criticism

What happens between question/query and answer?

What happened to the original data?

Can we make the complex computer processes understandable for ‘lay’ people?

A detailed and understandable (these two match poorly) description of the process of preprocessing and enrichments

A detailed description of the way the data is visualized and the choices made in the design

Provenance

Historical tool criticism… willingness from historians to invest the time to learn about computer processes (at least the basic principles)

Possibilities for education at universities to bridge the gap between computer science and humanities studies and make tool criticism an integral part of student’s curricula

“Why do we still teach history student to decipher 17th Century handwriting, but not SQL”

Thank you!

Victor de Boer

http://[email protected]

@victordeboer

Backup slides

MultimediaN E-Culture project (2006)

Museums have increasingly nice websites But: most of them are driven by stand-alone collection databases

Data is isolated, both syntactically and semantically

If users can do cross-collection search, the individual collections become more valuable!

Semantic Search

E-Culture data cloud

Vocabulary alignment

“Easel-pieces”

RMA concept “Schilderij”

RMA is the thesaurus of Rijksmuseum

AAT artefact type “Easel Piece” “Painting”

AAT is Getty’s Art & Architecture Thesaurus

http://e-culture.multimedian.nl/