Upload
victor-de-boer
View
786
Download
0
Tags:
Embed Size (px)
Citation preview
Linked Data for Digital HumanitiesSW4SH 2015
Victor de Boer
With input from Christophe Guéret, Serge ter Braake, Niels Ockeloen, Antske Fokkens, Dirk Roorda, Lora
Aroyo, Johan Oomen, Oana Inel, Jan Wielemaker
Victor de Boer
Web & Media Group, CS, VU University AmsterdamNetherlands Institute for Sound and Vision
Cultural HeritageDigital History
Linked Data for Development
Digital History
Sub-discipline of digital humanities
Buzzword that people are embracing and at the same time already getting tired of.
“ ..the use of digital media and computational analytics for furthering historical practice, presentation, analysis, or research” --Wikipedia
Digital History
Part of the effort of historian is moved from the physical archives to digital ones
Cross-domain collaboration
Img:www.doaks.org, www.dkrz.de
Tools and visualisations
http://armstrongdigitalhistory.org/, http://www.vcdh.virginia.edu/courses/fall07/hius401-f/, http://digitalhistory.unl.edu/essays/thomasessay.php, http://www.philipvickersfithian.com/2013/05/gender-in-stacks-on-managing-small.html
“That is great. I would love that…
…but my research questions are slightly different.”
Img:Monty Python
Aging
Data Tool
C. Guéret based on http://redmonk.com/jgovernor/2007/04/05/why-applciations-are-like-fish-and-data-is-like0wine/
Even better
Do not bake the data into the tool and treat data as an end product.Build tools on top of the data.Make sure others can do so as well.
Fig: C. Guéret
Framework generic solutions with historians
1. Preprocess, Clean, Model, Link, Enrich data in a collaboration with domain experts
2. Access heterogeneous datasets in a convenient way to get an intuition of the character and anomalies of the (linked) data;
3. Perform arbitrary queries to retrieve results relevant to their research questions;
4. Verify the veracity of query results, by following provenance links to original material
5. Retrieve and analyze the data with tool of preference.
6. Republish and share results
Linked Data for Digital History• Represent heterogeneous datasets with their own data models
– In one data format (RDF)– Link what can be linked to integrate at project level (and beyond)– Keep specificity of original data
• Links to other sources: re-use knowledge• Common-sense or very specific• Digital hermeneutics
• Allow multiple levels of semantic enrichment– normalization – through Named Graphs – Provenance
• Linked Data is the (technically) best way to publish and share your research data
The Problem:((Maritime) historical) data is not integrated
• Researchers’ data is “lost”– In different physical locations– In different file formats– In different semantic structures
• In a workshop, we identified 25+ maritime historical datasets. – http://dutchshipsandsailors.nl
• We do not want to force one monolithic data model for integration
The solution: Linked Open Data• Represent heterogeneous datasets with
their own data models– In one data format (RDF)– Link what can be linked to integrate at
project level (and beyond)– Keep specificity of original data
• Links to other sources: re-use knowledge
• Allow multiple levels of semantic enrichment/ normalization – through Named Graphs – Provenance
Dutch Ships and
Sailors in the Linked
Open Data Cloud
What we did1. Model four maritime historical datasets as RDF
– Noordelijke Monsterrollen Database [J. Leinenga]– Generale Zeemonsterrollen [M. van Rossum]– Dutch Asiatic Shipping– VOC Opvarenden
2. Link to each other (based on ships, ship types, ranks, geography,…)– Models and links evaluated by domain experts
3. Publish as Linked Open Data
4. Show how this data cloud can lead to new types of integrated research questions
Modeling in collaboration with historians
dss:Recordgzmvoc:Telling
gzmvoc:telling-1046-De_Berkel
__bnode_1
gzmvoc:aziatischeBemanning
dss:Shipgzmvoc:Schip
gzmvoc: schip-1046-De_Berkel
dss:has_shipgzmvoc:schip
"1046"
“Schip”
“De Berkel”
rdfs:labeldss:scheepsnaam
gzmvoc:scheepsnaam
dss:ShipTypegzmvoc:Scheepstype
gzmvoc: type-Shipdss:has_shiptype
gzmvoc:has_shiptype
gzmvoc:scheepstype
“21”
“Moorse mattroosen”
dss:azRegistratieKop
gzmvoc:azAantalMatrozen
gzmvoc:telling
gzmvoc:heeft DAS heenreis
dss:Recorddas:Voyage
das:voyage-1918_61
DIFFERENT but LINKED DATAMODELS
Modelling principles
Model each dataset as directly as possibleOnly “syntactical” transformation to RDFNo normalization
ReusabilityTransparency, trust
Normalize and link in second stage store in separate RDF Named Graphs
Links to Historical Newspapers
[HARLINGEN, 24 October.] …gestrand. Tevens is het berigt ontvan°e > dat het hier behoorende schoonerschip Transit, kapitein Schaap, in de Noordzee is gezonken, nadat het achterschip was weggeslagen ; een ligtmatroos verloor daarbij het leven. Mede zijn hier drie vreemde schepen met meer en minder zware averij binnengeloopen.
- Andrea Bravo Balado
DAS
GZMVOC
MDB
VOCOPVBegunstig
den
VOCOPVSoldijboek
en
PROV
AAT
VOCOPVOpvaren
den
foaf
owl:sameAs
dss:hasKBLink
rdfs:subClassOf,rdfs:subPropertyOf
dss:DAS link
skos :exactMatch
ClioPatria Triplestore
Data live at Huygens Institute for Dutch History
http://dutchshipsandsailors.nl/data~30 Million triples
Dev. Server http://semanticweb.cs.vu.nl/dss
Purl.org URIs redirect to live server w/ content negotiation SPARQL endpointWeb interface
Dutch Ships and Sailors
• Linked Data principles are a great fit to digital history requirements– Heterogeneous models/datasets, light-weight reusable
integration– Multiple levels of normalisation, through separate
named graphs– SW Provenance matches Historical Provenance
• Watch out when you sail your Schooner into the North Sea
DIGITAL HUMANITIES RESEARCHERS Media
rese
arch
er La
rs Arv
e R
øssla
nd
of th
e U
niv
ersity
of B
erg
en. (P
hoto
: Andre
as R
. Gra
ven)
EXPLORATIVE SEARCH
Digital Hermeneutics: The combination of digital (Web) technology and theory of interpretation
DATA: OPENIMAGES.EU
Open videos Netherlands Institute for Sound and Vision~3000, mostly news broadcastsDescriptions
ENTITY EXTRACTION
CROWDTRUTH.ORG
ENTITY EXTRACTION
EVENTS CROWDSOURCING AND LINKING TO CONCEPTS THROUGH CROWDTRUTH.ORG
SEGMENTATION & KEYFRAMES
LINKING EVENTS AND CONCEPTS TO KEYFRAMES
SIMPLE EVENT MODEL (SEM), OPENANNOTATION (OA) AND SKOS
DIVE:MEDIA OBJECT
SEM:EVENT
SEM:PLACE
SEM:TIME
SEM:ACTOR
SKOS:CONCEPT
OA:ANNOTATION
LINKS TO EUROPEANA (MULTILINGUAL)LINKS TO DBPEDIA
DIGITAL SUBMARINE UI
https://w
ww
.flickr.com/photos/benjcarson/245171885
https://w
ww
.flickr.com/photos/m
ibuchat/2774251415
INFINITY OF EXPLORATION
DIVE
Data conversion pipeline includes crowdsourcing, text analysis
Again provenance
Generic browsing doesn’t have to be boring
Starting point
Starting Point: Biography Portal of the Netherlands; www.biografischportaal.nl
125,000 short biographical descriptions with limited metadata from 23 Dutch biographical dictionaries (~76,000 individuals)
What kind of historical questions can be answered with these data with the help of computational methods
Biographynet.nl
Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…
Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…
Linked Data for BiograpyNet
Thorbecke
Biographical Description
ProvenanceMeta Data
NNBW
PersonMeta Data
“Thorbecke”
BiographyParts
Birth1798Event
Biographical Description
Enrichment NLP Tool
PersonMeta Data
EventBirth
Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse…
Zwolle1798-01-14
Biographynet.nl
a
Provenance in BiographynetEnsure credibility of the demonstrator, to evaluate its performance and to improve the academic status of the tool
Information involved Sources, but also: NER input data, etc. Processes involved All steps in enrichment, aggregation…People involved Who was responsible for pipeline, tool,
Includes P-PLAN:* Allows for comparing the actual activity and its input/output with the original plan and its variables
Biographynet.nl*Daniel Garijo, Yolanda Gil; http://www.opmw.org/model/p-plan
BiographynetData sustainably stored at historical research institute
Structured data accessible as LOD
Provenance through deep linking into enrichment
Generic browse / compare functionality for broad historic research (prosopography)
Re-usabilityBiographynet.nl
Het Koninkrijk der Nederlanden in de Tweede WereldoorlogHistory of German occupied Dutch society (1940-1945) Published 1969 - 1991 in 14 volumes, 30 parts, 18.000 pages
1. Digitization
2. Open Data
3. Enriched access with Linked Data
Verrijkt Koninkrijk
Step 1: Lou de Jong’s “Het Koninkrijk” was digitized and made available in a reusable format
Step 2: Named Entity Recognition and consolidation of the back-of-the-book index provide structured vocabularies with links into the text
country, collection, doc-type, volume, chapter, section, sub-section, paragraph
Back-of-the-Book index Named Entities
Verrijkt Koninkrijk
Step 3: Enrichment with Linked Data makes new ways of interaction and analysis possible
Back-of-the-Book index Named Entities
niod:Blitzkrieg
niod:oai_wo2_niod_nl_rec_102045 dct:subjec
t
http://resolver.verrijktkoninkrijk.nl/nl.vk.d.reg.4.1386
botb:Blitzkrieg
skos:exactMatch
52
National-Socialist29%
Social-Democrat21%Protestant
13%
Liberal12%
R-Catholic12%
Com
munist8%
Jewish5%
http://semanticweb.cs.vu.nl/verrijktkoninkrijk/
http://search.loedejongdigitaal.nl/
Reuse in comparative interface
Quantifying Historical Perspectives on WWIIhttp://qhp.science.uva.nl/
Verrijkt Koninkrijk
Textual data in digital repositories
Structured data accessible as LOD
Provenance
Simple tooling for primary research project
Re-usability
Shebanq
http://shebanq.ancient-data.org/
Nanopublications in digital history
Cf. http://www.slideshare.net/schambers3/nanopublications-in-the-arts-and-humanities
Shebanq
Source material shared in sustainable repo
Queries and metadata shared among researchers
Working on Linked Data model based on OpenAnnotation
Historical tool / data criticism
What happens between question/query and answer?
What happened to the original data?
Can we make the complex computer processes understandable for ‘lay’ people?
A detailed and understandable (these two match poorly) description of the process of preprocessing and enrichments
A detailed description of the way the data is visualized and the choices made in the design
Provenance
Historical tool criticism… willingness from historians to invest the time to learn about computer processes (at least the basic principles)
Possibilities for education at universities to bridge the gap between computer science and humanities studies and make tool criticism an integral part of student’s curricula
“Why do we still teach history student to decipher 17th Century handwriting, but not SQL”
MultimediaN E-Culture project (2006)
Museums have increasingly nice websites But: most of them are driven by stand-alone collection databases
Data is isolated, both syntactically and semantically
If users can do cross-collection search, the individual collections become more valuable!
Semantic Search
Vocabulary alignment
“Easel-pieces”
RMA concept “Schilderij”
RMA is the thesaurus of Rijksmuseum
AAT artefact type “Easel Piece” “Painting”
AAT is Getty’s Art & Architecture Thesaurus