Industrialized Linked Data

Industrialized Linked Data

Dave Reynolds, Epimorphics Ltd @der42

Context: public sector Linked Data

Linked Data journey ...

explore

what is linked data?

what use it is for us?


explore



self-describing

carries semantics with it

annotate and explain

data in context

...

Integration

comparable

slice and dice

web API

...


explore



what’s involved?

self-describing

carries semantics with it

annotate and explain

data in context

...

Integration

comparable

slice and dice

web API

...


explore pilot

data model publish convert apply

Photo of The Thinker © dSeneste.dk@flicker CC BY


explore pilot routine?

Great pilot but ...

can we reduce the time and cost?

how do we handle changes and updates?

how can we make the published data easier to use?

How do we make Linked Data “business as usual”?

Example case study: Environment Agency

monitoring of bathing water quality

static pilot

live pilot

historic annual assessments

weekly assessments

operational system

additional data feeds

live update

integrated API

data explorer

From pilot to practice

reduce modelling costs patterns

reuse

handling change and update patterns

publication process

automation conversion

publication

embed in the business process use internally as well as externally

publish once, use many

data platform

dive1

Reduce costs - modelling

1. Don’t do it

map source data into isomorphic RDF, synthesize URIs

loses some of the value proposition

2. Reuse existing ontologies intact or mix-and-match

best solution when available

W3C GLD work on vocabularies – people, organizations, datasets ...

3. Reusable vocabulary patterns

example:

Data cube plus reference URI sets

adaptable to broad range of data – environmental, statistical, financial ...

Reusable patterns: Data cube

Much public sector data has regularities

set of measures

observations, forecasts, budgets, assessments, statistics ...

27 good

125

excellent

good

>0.1 34

poor



sets of measures

observations, forecasts, budgets, assessments, estimates ...

organized along some dimensions

region, agency, time, category, cost centre ...

120 130 180

8 9 11

12 15 25

time

cost centre

measure: spend

objective code



sets of measures

observations, forecasts, budgets, assessments, estimates ...

organized along some dimensions

region, agency, time, category, cost centre ...

interpreted according to attributes

units, multipliers, status

$120k $130k $180k

$8k $9k $11k

$12k $15k $25k

time

cost centre objective code

provisional

final

measure: spend

Data cube vocabulary

Data cube pattern

Pattern, not a fixed ontology

customize by selecting measures, dimensions and attributes

originated in publishing of statistics

applied to environment measurements, weather forecasts, budgets and spend, quality assessments, regional demographics ...

Supports reuse

widely reusable URI sets – geography, time periods, agencies, units

organization-wide sets

modelling often only requires small increments on top of core pattern and reusable components

opens door for reusable visualization tools

standardization through W3C GLD

Application to case study

Data Cubes for water quality measurement

in-season weekly assessments

end of season annual assessments

dimensions:

time intervals – UK reference time service

location - reference URI set for bathing waters and sample pts

cubes can reuse these dimensions

just need to define specific measures



reuse


publication process


publication



data platform

dive 2

Handling change

critical challenge

most initial pilots choose a snapshot dataset

and go stale, fast

understanding the nature of data updates and how to handle them is critical to successful scaling to business as usual

types of change

new data related to different time period

corrections to data

entities change

properties

identity

Modelling change 1. Individual data items relate to new time period

Pattern: n-ary relation observation resource relates value to time period and other context

use Data Cube dimensions for this

History or latest? latest is non-monotonic but helpful for many practical uses

materialize (SPARQL Update), implement in query, implement in API

choice whether to keep history as well water quality v. weather forecasts

bwq:sampleYear

http://environment.data.gov.uk/id/bathing-

water/ukk1202-36000

Clevedon Beach

http://reference.data.gov.uk/id/year/2009 bwq:bathingWater

bwq:classification Higher

http://reference.data.gov.uk/id/year/2010 bwq:sampleYear

bwq:classification Minimum

http://reference.data.gov.uk/id/year/2011 bwq:sampleYear

bwq:classification Higher

Modelling change 2. Corrections

patterns

silent change (!)

explicit replacement

API level hides replaced values but SPARQL query can retrieve & trace

explicit change event

dct:isReplacedBy

http://environment.data.gov.uk/id/bathing-

water/ukk1202-36000

Clevedon Beach

classification : Higher http://reference.data.gov.uk/id/year/2011

classification : Minimum status: replaced

reason: reanalysis

dct:replaces

bwq:bathingWater bwq:sampleYear

analysis event

ev:before

ev:after

ev:occuredOn

ev:agent

Modelling change 3. Mutation

Infrequent change of properties, essential identity remains

e.g. renaming a school, adding another building

routine accesses see property value, not function of time

patterns

in place update

named graphs current graph + graphs for each previous state + meta-graph

explicit versioning with open periods

Modelling change 3. Mutation

explicit versioning with open periods

find right version by query on validity interval

simplify use through

non-monotonic “latest value” link

API to implement query filters automatically

“Clevedon Beach” “Clevedon Sands”

endurant

2003

2011

dct:valid

time:intervalStarts

time:intervalFinishes

2011 dct:valid

time:intervalStarts

dct:hasVersion dct:hasVersion


weekly and annual samples

use Data Cube pattern (n-ary relation)

withdrawn samples replacement pattern (no explicit change event)

Data Cube slice for “latest valid assessment”

generated by a SPARQL Update query

API gives easy access to the latest valid values

linked data following or raw SPARQL query allows drilling into changes

changes to bathing water profile

versioning pattern

bathing water entity points to latest profile (SPARQL Update again)



reuse


publication process


publication



data platform

dive 3

Automation Transform and publish data feed increments

transformation engine service

reusable mappings, low cost to adapt to new feeds

linking to reference data

publication service that supports non-monotonic changes

Reference data

data increments (csv)

replicated publication

servers

transform service

pu

blicatio

n

service

xform spec. xform

spec.

reconciliation service

xform spec.

Transformation service

declarative specification of transform

single service support range of transformations

easy to adapt transformation to new feeds and modelling changes

R2RML – RDB to RDF Mapping Language

specify mapping from database tables to RDF triples

W3C candidate recommendation

D2RML

R2RML extension to treat CSV feed as a database table

Small D2RML example :dataSource a dr:CSVDataSource ;

rdfs:label "dataSource" .

:bathingWaterTermMap a dr:SubjectMap;

dr:template "http://environment.data.gov.uk/id/bathing-water/{EUBWID2}" ;

dr:class def-bw:BathingWater .

:bathingWaterMap

dr:logicalTable :dataSource ;

dr:subjectMap :bathingWaterTermMap ;

dr:predicateObjectMap [

dr:predicate rdfs:label ;

dr:objectMap [dr:column "description_english" ; dr:language "en" ] ]

dr:predicateObjectMap [

dr:predicate def-bw:eubwidNotation;

dr:objectMap [ dr:column "EUBWID2"; dr:datatype def-bw:eubwid ] ] .

Using patterns

problems with verbosity, increases reuse costs

extend to support modelling patterns

Data Cube

specify mapping to observation with measures and dimensions

engine generates Data Set and Data Structure Definition automatically

D2RML cube map example :dataCubeMap a dr:DataCubeMap ;

rr:logicalTable “dataSource”;

dr:datasetIRI “http://example.org/datacube1”^^xsd:anyURI ;

dr:dsdIRI “http://example.org/myDsd”^^xsd:anyURI ;

dr:observationMap [

rr:subjectMap [

rr:termType rr:IRI ;

rr:template “http://example.org/observation/{PLACE}/{DATE}” ] ;

rr:componentMap [

dr:componentType qb:measure ;

rr:predicate aq:concentration ;

rr:objectMap [ rr:column “NO2” ; rr:datatype xsd:decimal ; ]

] ;

... Define how measure

value is to be represented

Instances will automatically link to

base Data Set

Implies an entry in the Data Structure Definition which is

auto-generated

But what about linking?

connect observations to reference data

a core value of linked data

R2RML has Term Maps to create values

constants and templates

extend to allow maps based on other data sources

Lookup map

lookup resource in a store, fetch predicate

Reconcile

specify lookup in a remote service

use Google Refine reconciliation API






Reference data



servers

transform service

pu

blicatio

n

service

xform spec. xform

spec.


xform spec.

Publication service

goals

cope with non-monotonic effects of change representation

so replication is robust and cheap (=> make it idempotent)

solution

SPARQL Update

publish transformed increment as a simple DATA INSERT

then run SPARQL Update script for non-monotonic links

dct:replacedBy links

lastest value slices

Sample update script DELETE {

?bw bwq:latestComplianceAssessment ?o .

} WHERE {


}

INSERT {


} WHERE {

{

?slice a bwq:ComplianceByYearSlice; bwq:sampleYear [interval:ordinalYear ?year].

OPTIONAL {

?slice2 a bwq:ComplianceByYearSlice; bwq:sampleYear [interval:ordinalYear ?year2].

FILTER (?year2 > ?year)

} FILTER ( !bound(?slice2) )

}

?slice qb:observation ?o .

?o bwq:bathingWater ?bw.

}






Reference data



servers

transform service

pu

blicatio

n

service

xform spec. xform

spec.


xform spec.


Update server

transforms based on scripts (earlier scripting utility)


distributed publication via SPARQL Update

extensible range of data sets annual assessments

in-season assessments

bathing water profile

features (e.g. pollution sources)

reference data



reuse


publication process


publication



data platform

dive 4

Embed in business process

embedding is critical to ensure data kept up to date

in turn needs usage

=> lower barrier to use

data not used

hard to justify

data goes stale

external use

invest rich, up to date

data

internal use

Lowering barrier to use

simple REST APIs

use Linked Data API specification

rich query without learning SPARQL

easy consumption as JSON, XML

gets developers used to data and data model

transform service

pu

blicatio

n

service

LD API


embedded in process for weekly/daily updates

infrastructure to automate conversion and publishing

API plus extensive developer documentation

third party and in-house applications built over API


information products as applications over a data platform, usable externally as well as internally

The next stage

grow range of data publications and uses

range of reference data and sets brings new challenges

discover reference terms and models to reuse

discover datasets to use for application

discover models and links between sets

needs a coordination or registry service

story for another day ...

Conclusions

illustrated how public sector users of linked are moving from static pilots to operational systems

keys are:

reduce modelling costs through patterns and reuse

design for continuous update

automation of publication using declarative mappings and SPARQL Update

lower barrier to use through API design and documentation

embed in organization’s process so the data is used and useful

Acknowledgements Only possible thanks to many smart colleagues: Stuart Williams, Andy Seaborne, Ian Dickinson, Brian McBride, Chris Dollin plus Alex Coley and team from the Environment Agency

Technology

Industrialized Linked Data