Building a knowledge graph of the Belgian War Press

  • Published on
    19-Mar-2017

  • View
    29

  • Download
    0

Embed Size (px)

Transcript

  • Lets talk Linked Data session Open Belgium 2017Brecht Van de Vyvere | @brechtvdv

    Building a knowledge graph of the Belgium War Press

  • Can I easily link historic papers with other datasources?

  • Agenda

    hetarchief.be Knowledge graph 5-star Open Data plan Adding context Linked Data as a Service Future Work

  • Dataset

  • hetarchief.beNews from the Great War Newspapers 1914 - 1918 10+ Content Partners Begin 2015: site launched Functionality Search by keyword Map with place of publication Collections

    1k titles

    55k newspapers

    300k pages

  • Human-readable interface

  • Policy1.Metadata No restrictions CC0

    2.OCR, documents Pictures, short stories Uncertain copyright status No license or terms of use that minimises restrictions

    for re-use Disclaimer

  • hetarchief.be One of the biggest databases online No raw data? Title Description OCR from ALTO Date created Owner IDs (carrier, Abraham, VIAA) URL image

  • 9

    5-starsOpen Data Plan

  • First 3 Stars Open License Structured Non-proprietary

    VIAA DB VIAA API NodeJS

    github.com/viaacode/hetarchief2lod

    IDs Metadata

    CSV

    Transform

    http://github.com/viaacode/hetarchief2lod

  • Step 4: URIs for everything Map VIAAs internal ID to URI: http://data.viaa.be/noid/{id}

    Use ontologies BBC Creative Work Ontology schema.org Hydra collections

  • Knowledge graph Semantic network Concepts Relations

    Linked Data URIs RDF

  • rdfs:type

  • 5-star: link to other sources ABRAHAM: catalogue of newspapers in Belgium

    owl:sameAs

  • Lillustration1915-10-XX

    http://data.viaa.be/noid/tm71v5c76q_191510XX

    cwork:titlecwork:dateCreated

    On dit que c'est notre imagination

    et.

    cwork:content

    cwork:CreativeWork

    rdf:type

    UGENT

    schema:copyrightHolder

    schema:inLanguage

    en

    Basic information triples

  • http://data.viaa.be/noid/tm71v5c76q

    http://data.viaa.be/noid/tm71v5c76q_191804xx_0003

    http://data.viaa.be/noid/tm71v5c76q_191804xx_0002

    http://data.viaa.be/noid/tm71v5c76q_191804xx_0001

    first last

    previous/nextfirst

    memberOf

    totalItemsHydra

    last

    3

    first/last

  • Problems Node limited to 1.7 GB memory OCR too big Turtle file: 475 MB max (32k

    newspapers) Compressed to HDT: 388 MB Basic triples with HDT: 54k newspapers 8.2 MB

  • Adding context

  • Connect with other datasources

    Cfr. Europeana, delpher.nl, lab.kbresearch.nl

  • Stanford NER 4 types: Location, Organisation, Person and

    Other Train your model: golden corpus Write code that fits your needs

    SPARQL query that matches strings REPERTOIRE des COMMUNES et des PRINCIPAUX

    HAMEAUX de la ci-devant Belgique

    Difficult to find cultural APIs (cfr. InFlandersField list of names, Abraham catalogue)

  • DBpedia Spotlight Proof of concept Models for all languages (nl, en, fr, de)

    NL/FR/EN/DE trained model

    DBpedia matcher

    Stanford NER

  • Results?

    Filter on OCR quality; e.g.

  • DBpedia Spotlight Running your own endpoint is easy: java -Xmx8G -jar dbpedia-

    spotlight-0.7.1.jar nl http://localhost:2223/nl/rest

    Or with Docker: docker build -f Dockerfile -t

    dutch_spotlight . docker run -i -p 2223:80 dutch_spotlight

    spotlight.sh

  • Linked Data as a Service Allow federated queries Low server cost Be reliable Triple Pattern Fragments: a Low-cost

    Knowledge Graph Interface for the Web

  • Linked Data Fragments querying VIAA is part of the family!

    http://data.viaa.be/ldfhttps://query.wikidata.org/

    bigdata/ldf

    http://data.linkeddatafragments.

    org/linkedgeodata

    http://data.linkeddatafragments.

    org/dbpedia2014

    Your browser

    Client-side algorithm

    GET fragments

  • Demo time!

  • Demo

    Retrieve all newspaper titles:

    SELECT DISTINCT ?titleWHERE {?paper ?title}

  • Demo Retrieve more info from corresponding

    DBpedia URI:

    SELECT ?label ?commentWHERE { ?tag .?db owl:sameAs ?tag .?db rdfs:label ?label .?db rdfs:comment ?comment}

  • Battle of the Somme Pages with military leaders from the Battle

    of the Somme mentioned + thumbnail:

    SELECT ?paper ?o ?thumbnailWHERE { ?o .?paper ?ctag .?o owl:sameAs ?ctag .?o ?thumbnail .}

  • Frontpainters Semi-automatic generation of collections,

    e.g. about frontpaintersSELECT ?newspaper ?artist ?tag ?hetarchiefWHERE {?artist dc:subject .?artist owl:sameAs ?tag .?newspaper ?tag .?newspaper ?hetarchief}

  • Conclusion

    Extra search method for our researchers NER versus OCR: enhanced findability Adding extra information (cfr. Abraham)

    requires effort, we need more TPFs interfaces

  • Future work Dereferencable URIs http://data.viaa.be/noid/{id}

    Content negotiation HTML JSON RDF

    Save location with OLR Suggestions are welcome!

    http://data.viaa.be/noid/

  • Q&A

    Brecht Van de Vyvere | @brechtvdv

Recommended

View more >