33

NLP Data Cleansing Based on Linguistic Ontology Constraints

Embed Size (px)

DESCRIPTION

Slides for the following paper: NLP Data Cleansing Based on Linguistic Ontology Constraints Abstract: Linked Data comprises of an unprecedented volume of structured data on the Web and is adopted from an increasing number of domains. However, the varying quality of published data forms a barrier for further adoption, especially for Linked Data consumers. In this paper, we extend a previously developed methodology of Linked Data quality assessment, which is inspired by test-driven software development. Specifically, we enrich it with ontological support and different levels of result reporting and describe how the method is applied in the Natural Language Processing (NLP) area. NLP is – compared to other domains, such as biology – a late Linked Data adopter. However, it has seen a steep rise of activity in the creation of data and ontologies. NLP data quality assessment has become an important need for NLP datasets. In our study, we analysed 11 datasets using the lemon and NIF vocabularies in 277 test cases and point out common quality issues.

Citation preview

Page 1: NLP Data Cleansing Based on Linguistic Ontology Constraints

NLP Data Cleansing Based on Linguistic OntologyConstraints

Dimitris Kontokostas13 Martin Brümmer1 Sebastian Hellmann13

Jens Lehmann1 Lazaros Ioannidis2

1AKSW, University of Leipzig

2Aristotle University of Thessaloniki

3DBpedia Association

2014-05-27

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 1 / 33

Page 2: NLP Data Cleansing Based on Linguistic Ontology Constraints

LOD Cloud (2011)

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 2 / 33

Page 3: NLP Data Cleansing Based on Linguistic Ontology Constraints

LOD Cloud (2011)

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 3 / 33

Page 4: NLP Data Cleansing Based on Linguistic Ontology Constraints

Linguistic Communities

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 4 / 33

Page 5: NLP Data Cleansing Based on Linguistic Ontology Constraints

Linguistic workshops & conferences

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 5 / 33

Page 6: NLP Data Cleansing Based on Linguistic Ontology Constraints

Linguistic workshops & conferences

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 6 / 33

Page 7: NLP Data Cleansing Based on Linguistic Ontology Constraints

Linguistic LOD Cloud (LLOD Cloud)

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 7 / 33

Page 8: NLP Data Cleansing Based on Linguistic Ontology Constraints

Problem de�nition

Linguistic (related) Data

Purpose-Driven de�nition

Increasing Data, ontologies & vocabularies

New-comers → hard to understand the ontologies / follow updates

Validation is essential

Many di�erent pipelines (parsing, annotation, disambiguation, etc)

Errors are propagated

Partially provided by maintainers (incomplete)

Focus on Lemon & NIF (proof of concept)

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 8 / 33

Page 9: NLP Data Cleansing Based on Linguistic Ontology Constraints

Lemon - Lexicon Model for Ontologies

Models lexicon and machine-readabledictionaries

RDF-native form

Linguistically sound structure (LMF)

Separation of the lexicon andontology layers

Linking to data categories →arbitrarily complex linguisticdescription

Principle of least power - the lessexpressive the language, the morereusable the data.

http://lemon-model.net/

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 9 / 33

Page 10: NLP Data Cleansing Based on Linguistic Ontology Constraints

Lemon - Example

: l e x i c o n a lemon : Lex i con ;lemon : e n t r y : P izza , : T o r t i l l a .

: P i z za a lemon : L e x i c a l E n t r y ;lemon : s en s e [ lemon : r e f e r e n c e

<ht tp :// dbped ia . org / r e s o u r c e /Pizza> ] .

: T o r t i l l a a lemon : L e x i c a l E n t r y ;lemon : s en s e [ lemon : r e f e r e n c e

<ht tp :// dbped ia . org / r e s o u r c e / T o r t i l l a > ] .

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 10 / 33

Page 11: NLP Data Cleansing Based on Linguistic Ontology Constraints

Lemon - Example (Correct)

: l e x i c o n a lemon : Lex i con ;lemon : l anguage "en" ;lemon : e n t r y : P izza , : T o r t i l l a .

: P i z za a lemon : L e x i c a l E n t r y ;lemon : canon i ca lFo rm [lemon : wr i t t enRep " P i z za "@en ] ;

lemon : s en s e [ lemon : r e f e r e n c e<ht tp :// dbped ia . org / r e s o u r c e /Pizza >] .

: T o r t i l l a a lemon : L e x i c a l E n t r y ;lemon : canon i ca lFo rm [lemon : wr i t t enRep " T o r t i l l a "@en ] ;

lemon : s en s e [ lemon : r e f e r e n c e<ht tp :// dbped ia . org / r e s o u r c e / T o r t i l l a >] .

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 11 / 33

Page 12: NLP Data Cleansing Based on Linguistic Ontology Constraints

NIF - NLP Interchange Format

RDF/OWL-based format that aims to achieve interoperability betweenNatural Language Processing (NLP) tools, language resources andannotationsIn a nutshell:

Logical formalisation of strings and annotations

Builds on existing standards, e.g. RDF, LAF/GrAF, RFC 5147

Reuse of RDF tool stack

Decreases development cost for integration

Integrated in:

DBpedia Spotlight, Stanford Core NLP, OpenNLP, RDFace, Validator,ConLL converter , . . .

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 12 / 33

Page 13: NLP Data Cleansing Based on Linguistic Ontology Constraints

NIF - Overview

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 13 / 33

Page 14: NLP Data Cleansing Based on Linguistic Ontology Constraints

NIF - Example

<http :// abc . com/doc#char=0,17>a n i f : Context ;a n i f : RFC147Str ing ;n i f : b e g i n I nd e x "0" ;n i f : end Index "17" ;n i f : i s S t r i n g "My dog l i k e s p i z z a " .

<ht tp :// abc . com/doc#char=2,7>a n i f : RFC5147Str ing ;n i f : anchorOf " dog " ;n i f : r e f e r e n c eCon t e x t <ht tp :// abc . com/doc#char=0,17> .i t s r d f : t aC l a s sR e f dbo : Animal ;

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 14 / 33

Page 15: NLP Data Cleansing Based on Linguistic Ontology Constraints

NIF - Example (Correct)

<http :// abc . com/doc#char=0,18>a n i f : Context ;a n i f : RFC5147 S t r i n g ;n i f : b e g i n I nd e x "0"^^xsd : n onNega t i v e I n t e g e r ;n i f : end Index "18"^^xsd : n onNega t i v e I n t e g e r ;n i f : i s S t r i n g "My dog l i k e s p i z z a "^^xsd : s t r i n g .

<ht tp :// abc . com/doc#char=2,7>a n i f : RFC5147Str ing ;n i f : b e g i n I nd e x "2"^^xsd : n onNega t i v e I n t e g e r ;n i f : end Index "7"^^xsd : n onNega t i v e I n t e g e r ;n i f : anchorOf " dog "^^xsd : s t r i n g ;n i f : r e f e r e n c eCon t e x t <ht tp :// abc . com/doc#char=0,27> .i t s r d f : t aC l a s sR e f dbo : Animal ;

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 15 / 33

Page 16: NLP Data Cleansing Based on Linguistic Ontology Constraints

Maintainer validation

Lemon

Python script

24 tests for structural criteria

too slow on big datasetsnot good reporting

NIF

SPARQL queries

11 tests for common errors

not complete

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 16 / 33

Page 17: NLP Data Cleansing Based on Linguistic Ontology Constraints

Built on previous work

Test-driven evaluation of linked data quality. Dimitris Kontokostas, PatrickWestphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, RolandCornelissen, and Amrapali J. Zaveri in WWW 2014.

Horizontal, multi-domain data quality assessment

Massive detection of errors for �ve large-scale LOD data sets

291 vocabularies, independent of their domain or purpose

New contributions:

Relation to OWL reasoners

Test Driven Data Engineering Ontology

Domain-speci�c validation

Quickly improving existing validation options provided by maintainers

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 17 / 33

Page 18: NLP Data Cleansing Based on Linguistic Ontology Constraints

Test-Driven Data Development Methodology

Test case: a data constraint that involves one or more triples

Test suite: a set of test cases for testing a dataset

Status: Success, Fail, Timeout (complexity) or Error (e.g. network)

Fail: Error, warning or notice

RDF: basis for both data and schema

Uni�ed model facilitates automatic test case generationSPARQL serves as the test case de�nition language

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 18 / 33

Page 19: NLP Data Cleansing Based on Linguistic Ontology Constraints

Example test case

A nif:RFC5147String should never have a nif:beginIndex greater thannif:endIndex

Test cases are written in SPARQL

SELECT ? s WHERE {? s n i f : b e g i n I nd e x ? v1 .? s n i f : end Index ? v2 .FILTER ( ? v1 > ?v2 ) }

We query for errors

Success: Query returns empty result set

Fail: Query returns results

Every result we get is a violation instance

Timeout / Error: needs further investigation on SPARQL Enginecapabilities, query syntax or query complexity

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 19 / 33

Page 20: NLP Data Cleansing Based on Linguistic Ontology Constraints

Patterns & Bindings

Data Quality Test Patterns (DQTP)abstract patterns, which can be further re�ned into concrete data qualitytest cases using test pattern bindings

Existing library of 20 patterns

SELECT ? s WHERE {? s %%P1%% ?v1 .? s %%P2%% ?v2 .FILTER ( ? v1 %%OP%% ?v2 ) }

Bindingsmapping of variables to valid pattern replacement

P1 => n i f : b e g i n I n d e x | SELECT ? s WHERE {P2 => n i f : end Index | ? s n i f : b e g i n I nd e x ? v1 .OP => > | ? s n i f : end Index ? v2 .

| FILTER ( ? v1 > ?v2 ) }

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 20 / 33

Page 21: NLP Data Cleansing Based on Linguistic Ontology Constraints

Test Auto Generators (TAGs)

RDF(s) & OWL (partial) support

Query schema for supported axioms

SELECT DISTINCT ?T1 ?T2 WHERE {?T1 owl : d i s j o i n tW i t h ?T2 . }

For every result a binding to a pattern is generated & a test caseinstantiated

Supported axioms at the moment:

RDFS: domain & rangeOWL: minCardinality, maxCardinality, cardinality, functionalProperty,InverseFunctionalProperty, disjointClass, propertyDisjointWith,AsymmetricProperty and deprecated

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 21 / 33

Page 22: NLP Data Cleansing Based on Linguistic Ontology Constraints

Test Case Elicitation Work�ow

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 22 / 33

Page 23: NLP Data Cleansing Based on Linguistic Ontology Constraints

TD(D)D vs Reasoners

SPARQL test cases detect a subset of validation errors detectable byan OWL reasoner. Limited by

SPARQL endpoint reasoning supportlimitations of the OWL-to-SPARQL translation.

SPARQL test cases detect validation errors not expressible in OWL

OWL reasoning is often not feasible on large datasets.

Datasets are already deployed and accessible via SPARQL endpoints

Pattern library more user friendly approach for building validation rulescompared to modelling OWL axioms.

requires familiaritynon-common validations require manual SPARQL test cases

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 23 / 33

Page 24: NLP Data Cleansing Based on Linguistic Ontology Constraints

Data Engineering Ontology

Input / Output entirely in RDF

Model the methodology in OWL

test suites, test cases, patterns, auto generators

Strict to serve as a validation layer

Four di�erent levels of error reporting

simple test case report (success, fail) / enriched with countsviolation instance reporting / enriched with annotations

Reuse dcterms, prov, spin, rlog

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 24 / 33

Page 25: NLP Data Cleansing Based on Linguistic Ontology Constraints

Data Engineering Ontology - De�nition & Generation

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 25 / 33

Page 26: NLP Data Cleansing Based on Linguistic Ontology Constraints

Data Engineering Ontology - Result Representation

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 26 / 33

Page 27: NLP Data Cleansing Based on Linguistic Ontology Constraints

Lemon & NIF Test case elicitation

RDFUnit Suite implements our methodology

Run on Lemon & NIF ontologies

TAGs could not yet handle some complex owl:Restrictions

owl:unionOf, owl:allValuesFrom, owl:someValuesFrom,owl:hasSelf and some rdfs:subPropertyOf cases

Manual test cases for constraints not captured in OWL.

Total Domain Range Datatype Card. Disj. Func. I. Func. Manual

Lemon 182 40 34 1 29 64 3 1 10

NIF 96 42 24 4 6 10 10

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 27 / 33

Page 28: NLP Data Cleansing Based on Linguistic Ontology Constraints

Example of manual Lemon test case

lemon:narrower denotes that one sense of a word is narrower than theother and must never be symmetric or contain cycles.

SELECT DISTINCT ? s WHERE {? s lemon : na r rowe r+ ? na r rowe r .? na r rowe r lemon : na r rowe r+ ? s . }

lemon:language must not have a language tag (RDF1.1 to the rescue)

SELECT DISTINCT ? s WHERE {? s lemon : l anguage ? v1 .FILTER ( l ang (? v1 ) !="" ) }

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 28 / 33

Page 29: NLP Data Cleansing Based on Linguistic Ontology Constraints

Example of manual NIF test case

Ensure that nif:beginIndex & nif:endIndex index are correct

SELECT DISTINCT ? s WHERE {? s n i f : anchorOf ? anchorOf ;

n i f : b e g i n I nd e x ? b eg i n I nd e x ;n i f : end Index ? end Index ;n i f : r e f e r e n c eCon t e x t

[ n i f : i s S t r i n g ? r e f e r e n c e S t r i n g ] .BIND (SUBSTR(? r e f e r e n c e S t r i n g ,

? b eg i n I nd e x ,(? end Index − ? b eg i n I nd e x ) ) AS ? t e s t ) .

FILTER ( s t r (? t e s t ) != s t r (? anchorOf ) ) . }

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 29 / 33

Page 30: NLP Data Cleansing Based on Linguistic Ontology Constraints

Evaluation Datasets

Name Description Ontology Type

lemon datasets

LemonUby Wiktionary EN Conversion of the English Wiktionary into UBY-LMF model lemon,UBY-LMF

Dictionary

LemonUby Wiktionary DE Conversion of the German Wiktionary into UBY-LMF model lemon,UBY-LMF

Dictionary

LemonUby Wordnet Conversion of the Princeton WordNet 3.0 into UBY-LMFmodel

lemon,UBY-LMF

WordNet

DBpedia Wiktionary Conversion of the English Wiktionary into lemon lemon Dictionary

QHL Multilingual translation graph from more than 50 lexicons lemon Dictionary

NIF datasets

Wikilinks sample of 60976 randomly selected phrases linked toWikipedia articles

NIF NER

DBpedia Spotlight dataset 58 manually NE annotated natural language sentences NIF NER

KORE 50 evaluationdataset

50 NE annotated natural language sentences from the AIDAcorpus

NIF NER

News-100 100 manually annotated German news articles NIF NER

RSS-500 500 manually annotated sentences from 1,457 RSS feeds NIF NER

Reuters-128 128 news articles manually curated NIF NER

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 30 / 33

Page 31: NLP Data Cleansing Based on Linguistic Ontology Constraints

Evaluation results

Size SC FL TO ER Auto Errors Man Errors MWarn MInfo

WiktDBp 60M 177 5 - - 3.746.103 7.521.791 - 3.582.837

WktEN 8M 168 14 - - 752.018 394.766 - 633.270

WktDE 2M 170 12 - - 273.109 66.268 - 155.598

Wordnet 4M 166 16 - - 257.228 36 - 257.204

QHL 3M 170 11 - 1 433.118 538.933 - 538.016

Wikilinks 0.6M 91 4 - 1 141.528 21.246 - -

News-100 13K 91 2 - 3 3.510 - - -

RSS-500 10K 91 2 - 3 3.000 - - -

Reuters-128 7K 91 2 - 3 2.016 - - -

Spotlight 3K 92 3 - 1 662 68 - -

KORE50 2K 89 6 - 1 301 55 - -

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 31 / 33

Page 32: NLP Data Cleansing Based on Linguistic Ontology Constraints

Conclusion

Extended a previously introduced methodology for test-driven qualityassessment

Data engineering ontology

Devised 277 test cases for NLP datasets using the Lemon and NIFvocabularies

Revealed a substantial number of errors for Lemon & NIF datasets

Future directions

extend the test cases to more NLP ontologies (MARL, NERD, ITSRDF)automatic dependencies between test caseswrap RDFUnit for NLP services (integrated in NIF)

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 32 / 33

Page 33: NLP Data Cleansing Based on Linguistic Ontology Constraints

Thank you!

Dimitris KontokostasWith kind support of

John McCrae (Lemon model)

http://rdfunit.aksw.org

http://github.com/AKSW/RDFUnit

#eswc2014kontokostas

Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 33 / 33