Upload
dimitris-kontokostas
View
313
Download
0
Embed Size (px)
Citation preview
Semantically Enhanced Quality Assurance in the
JURION Business Use Case
Dimitris Kontokostas, Christian Mader, Christian Dirschl, Katja Eck, Michael Leuthold,
Jens Lehmann, Sebastian Hellmann
ESWC 2016
Overview
● Wolters Kluwers overview● Use Case Tools● Challenges● Solutions● Evaluation● Future Work
ESWC 2016
Wolters Kluwers
Wolters Kluwer provides solutions to customers in over 170 countries and provides content in at least a dozen languages.
Focusing on legal, tax, finance and health industries.
ESWC 2016
Wolters Kluwer Transformation
ESWC 2016
Wolters Kluwer Transformation
Quality
ESWC 2016
WKD in LOD2 project
ESWC 2016
ESWC 2016
WKD in the ALIGNED Project
ESWC 2016
RDF in the publishing industry
ESWC 2016
Use Case Tools
ESWC 2016
● TDDD: Test Driven (Data) Development○ Methodology, definitions & Tools
● SPARQL● Reusable unit tests for
○ vocabularies○ datasets○ applications
● Test Auto Generators○ OWL○ IBM Shapes○ DSP (Dublin Core Set Profiles)○ W3c Shapes (in progress)
● Open Source (Apache license)
● Stable tool, used in many research & industrial settings
http://rdfunit.aksw.org
ESWC 2016
https://www.poolparty.biz
● Commercial product developed by Semantic Web Company● Thesauri development in a collaborative way
○ From scratch / by extraction of terms from a document corpus
● Compliance to the 5-star Open Data principles (RDF & SKOS)● Automatically retrieve potential additional concepts for inclusion into the
thesauri by querying SPARQL endpoints (e.g. DBpedia)● identify and link to related resources from local / remote projects ● Simple ontology editing (rdf:type, rdfs:subClassOf, rdfs:domain/range,...)● Automated quality assurance mechanisms
○ Conformance to SKOS or a custom schema
○ Enforcement level of some quality metrics can be configured by the user so that it is, e.g.,
possible to get an alert if circular hierarchical relation○ Check a taxonomy “as a whole” against a set of potential quality violations
ESWC 2016
Challenges
ESWC 2016
Metadata RDF Conversion Verification
Existing Infrastructure
● Platform Content Interface (PCI) ontology
○ proprietary schema that describes legal documents and metadata in OWL
● PCI revisions => verify data conforms to PCI
● Proprietary SOAP-based validation service
○ Package based validation => hard error detection
○ Asynchronous & complex web service => hard to use
○ Network dependency => potentially unstable
ESWC 2016
Metadata RDF Conversion Verification
Continuous & high quality triplification of semi-structured data is a common problem in the information industry. Schema changes and enhancements are routine tasks, but ensuring data quality is still very often purely manual effort. So any automation will support a lot of real-life use cases in different domains.
Goal: Based on the schema, test cases should automatically be created, which are run on a regular basis against the data that needs to be transformed. The errors detected will lead to refinements and changes of the XSLT scripts and sometimes also to schema changes, which impose again new automatically created test cases
ESWC 2016
ESWC 2016
RDFUnit / JUnit Integration
ESWC 2016
Quality Control in Thesaurus Management
● WKD develops multiple controlled vocabularies for annotating documents (e.g., court decision, labour law,...) using PoolParty
● Interconnected to each other● Consistency and quality must be ensured over all vocabularies● Various quality issues, e.g.,
○ Duplicates○ Links to deprecated (deleted) concepts○ Unresolvable links
● Up to now curated manually in deployed system, regular errors in production versions
ESWC 2016
Quality Control in Thesaurus Management
The creation and maintenance of knowledge models is gaining importance in the Web of Data. These tasks are increasingly being executed by SME’s in the domain, not in knowledge modelling and IT as such. Therefore, better automatic support of these processes will directly help achieving quality and efficiency gains.
● Automated quality checks over multiple vocabularies● Improved notifications: email on changes performed by users● Additional statistics on, e.g, vocabulary dependencies, changes, etc
ESWC 2016
Vocabulary link validation (PoolParty)
● Uses project metadata to identify linked vocabularies
● Link is invalid if target concept is either deprecated or deleted
● Creates a report for human curators
● Vocabulary repair still manual process
Quality Control in Thesaurus Management
ESWC 2016
Results & Evaluation
The analysis is based on measured metrics and the qualitative feedback of experts and users.
Participants of the evaluation study were selected from WKD staff in the fields of software development and data development. There were seven participants in total: four involved in the expert evaluation and three content experts involved in the usability/interview evaluation.
● Productivity
● Quality
● Agility
ESWC 2016
Productivity (RDFUnit)
● Total time for quality checks and error detection● The time need for manual interaction.
What we measured:
● 1ms to 50ms per single test (depending on the document / ontology size)○ as close to real-time as possible, currently a couple of minutes
● Quality checks can be triggered by manual execution, but they are always verified automatically by the CI build system
● A total of 44.000 tests with a total duration of 11 minutes ○ may scale-up easily when parallelized or clustered
ESWC 2016
Quality (RDFUnit)
What kind of errors can be detected and is categorization possible?
● Experts concluded that it is helpful to spot errors introduced by changes, since issues spotted in this way can be assumed to point to really existing errors; the causes of which can be identified and addressed
● Successful tests are less significant as we are not yet able to evaluate whether and how the measurements taken correspond to target measures and these tests do not point to concrete errors.
○ Coverage & other metrics needed
ESWC 2016
Agility (RDFUnit)
… time to include new requirements
● Including new constraints or adapting existing constraints works by adding new reference documents to the input dataset to make the test environment as representative as possible.
● The process of generating tests and testing is fully automated, it adapts very easily to changed parameters.
● Adding more documents to the input dataset increases the total runtime
ESWC 2016
Productivity (PoolParty)
● The number of checked links● The number of violations ● The total time
What we measured:
The presentation of the results was well understood. In general, the tool was received well by the experts, which was reflected by their feedback in the interviews.
ESWC 2016
Quality (PoolParty)
● No false broken link detection● Prototype still lacks some usability.
ESWC 2016
Agility (PoolParty)
… integration, configuration time and extension
● Very useful for getting an overview
● cases it is desired to limit the link lookups and adapt the way links to external datasets are detected
○ Use custom base URI or regular expression-based techniques
● Re-configuration is possible but recompiling the application might be needed
○ Plans to delegate this process to unified views
ESWC 2016
Future Work
● Error analysis (statistics, time to fix an issue, regressions)
● Test coverage and better metrics
● Improve the UI of the Link Validation tool
● Provide more advanced settings
● Inter-repository Link Validation
ESWC 2016
Thank You!
Questions ?
(You might want to) take a look at…RDF and XML Interoperability W3c Community grouphttps://www.w3.org/community/rax/