Upload
jerdeb
View
300
Download
3
Tags:
Embed Size (px)
Citation preview
Linked Data Quality Assessment – daQ and Luzzu
Jeremy DebattistaUniversity of Bonn
Presentation at the Ontology Engineering Group (UPM)
…who am I?
• B.Sc (Hons) in Computer Science – University of Malta– Thesis: Collaborative Editing and Expert Finding
• M.App Sc in Computer Science – DERI, National University of Ireland, Galway– Thesis: Ontology-based rules for User-Controlled
Support in Ubiquitous Environments
• PhD Candidate – University of Bonn
… my PhD – the big picture
• Work related to Data Quality (in LD)– representing quality metadata (daQ)– assessing data quality (Luzzu)– identifying new metrics from standard
vocabularies (like PROV-O)
… the need for Quality Metadata
• Convincing data consumers to use our published data
• Filtering datasets
• Poor Quality Perspective – Big Data Veracity
… the daQ vocabulary
… the daQ vocabulary
… the daQ vocabulary
• Metadata as Named Graphs
• Usage of abstract class concept
• Metric assessment as Observations
• Preserving Provenance information
… daQ Applications
• daQ validator – Validates quality metric schemas extending the daQ (will be online soon)– e.g. checking that each dimension is in exactly one category…
• Luzzu – next slides
… Luzzu – QA Framework
• A comprehensive QA framework– assesses LD quality using user-provided metrics (we
have a number of LOD metrics already) in a scalable manner
– provides queryable metadata (daQ) – provide quality reports which can be used for cleaning
• Java Based with maven integration• http://eis-bonn.github.io/Luzzu
… Luzzu – QA Framework
… Luzzu – QA Framework
…what’s missing in Luzzu
• Make Luzzu work better on Big Data Platforms
– We already have a SPARK Processor
– How can metrics be scaled on different cores? Something like map-reduce maybe?
… data quality lifecycle
… quality metrics
• Traditional naïve way
• Probabilistic Techniques (A paper was presented at ESWC this year)
… probabilistic technique hypothesis
Probabilistic approximation techniques would :
(H1) drastically improve computational time(H2) give close to accurate results
… probabilistic techniques used
Reservoir Sampling
Bloom Filters
Clustering Coefficient Estimation
Dereferenceability
Links to External Data Providers
Extensional Conciseness
Clustering Coefficient of a
Network
… some results
Reservoir Sampling
Bloom Filters
Clustering Coefficient Estimation
Dereferenceability
Links to External Data Providers
Extensional Conciseness
Clustering Coefficient of a
Network
Precision: approx. 75% Time Saved: > 2 Orders of Magnitude
Precision: 100%Time Saved: > 2 Orders of Magnitude
… some results
Reservoir Sampling
Bloom Filters
Clustering Coefficient Estimation
Dereferenceability
Links to External Data Providers
Extensional Conciseness
Clustering Coefficient of a
Network
Precision: approx. 97%Time Saved: > 3 Orders of Magnitude
… some results
Reservoir Sampling
Bloom Filters
Clustering Coefficient Estimation
Dereferenceability
Links to External Data Providers
Extensional Conciseness
Clustering Coefficient of a
Network
Precision: approx. 95% Time Saved: > 1 Order of Magnitude
… what am I working on
• Large Scale/Data web Scale evaluation Journal Paper– assessing the quality of LOD Cloud datasets
• daQ (Journal Paper)
… what do we do at Bonn
• Open Government Data – Publishing and Consumption– Data Value Chains, Value Creation, Budgeting
• Portal for publication and consumption of open data– Lowering of semantic data to shallower domain specific
formats (RDB, CSV etc..)
• RDF Visualisations and Recommendations
… what do we do at Bonn
• Dataset Change Detection
• Collaborative Authoring and Open Educational Content
• Low-threshold agile methodology for collaborative vocabulary development
• Mapping of AutomationML to RDF
… some tools
http://purl.org/net/exconquer/
… some tools
http://purl.org/net/dsaas
… some tools
http://slidewiki.org
… some tools
http://eis.iai.uni-bonn.de/Projects/LinkDaViz.html