Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
LLOD and corpora
Ch. Chiarcos, ACoLi CO, 2016, July 22
If you have Un*x environment (Linux, BSD, Mac, Cygwin): - make sure you have JAVA installed and wifi works - lab members: svn checkout /intern/incubator/conll-rdf - others: download and unzip http://acoli.informatik.uni-
frankfurt.de/tmp/conll-rdf-prerelease.zip others: - find a neighbor who does
TODO
1. Linguistic Linked Open Data
and interoperability challenges wrt. corpora
2. State of the Art: NIF & POWLA
... and why neither is used in NLP nor corpus linguistics
3. Yet another format: CoNLL-RDF
Breaking the usage barrier ?
4. Working with CoNLL-RDF
LLOD and corpora
Linked Open Data
RDF, RDF vocabularies, Linking
Resource Description Framework (RDF)
• W3C standard (1999)
– generic data model: directed labeled graph
• nodes, edges, labels
– originally developed to provide metadata about resources
• e.g., journals in a bookstore and eBooks in an online shop
– resources are unambiguously identified in the web of data by Uniform Resource Identifiers URIs)
4
Resource Description Framework (RDF)
• a (labeled directed multi-) graph
– nodes („RDF resources“)
• anything we want to provide information about
– edges („RDF properties“)
• assigns a source node („subject“) a target node („object“) or a value („literal“)
– nodes and edges are unambiguously identified
• Uniform Resource Identifiers (URIs), e.g., URLs
5
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus: bronstijd
thesaurus: period
rdf:type
(the concept) „bronstijd“ is an (instance of concept) „period“ 6
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus: bronstijd
thesaurus: period
rdf:type
abbreviated for URI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
could be opened in a browser resolvable URIs may provide further information
7
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus: bronstijd
thesaurus: period
rdf:type
in German (de), the preferred label for (the concept) „bronstijd“ is „Bronzezeit“
„Bronzezeit“@de
skos:prefLabel
8
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus: bronstijd
thesaurus: period
rdf:type
„Bronzezeit“@de
skos:prefLabel
„bronze age“@en
skos:prefLabel
in English (en), the preferred label for (the concept) „bronstijd“ is „bronze age“ 9
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .
thesaurus: bronstijd
thesaurus: period
rdf:type
„Bronzezeit“@de
skos:prefLabel
„bronze age“@en
skos:prefLabel
„bronstijd“@nl
skos:prefLabel
in English (en), the preferred label for (the concept) „bronstijd“ is „bronstijd“ 10
RDF
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .
thesaurus: bronstijd
thesaurus: period
rdf:type
„Bronzezeit“@de
skos:prefLabel
„bronze age“@en
skos:prefLabel
„bronstijd“@nl
skos:prefLabel
graphical notation
triple notation (Turtle)
11
RDF Querying
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .
SPARQL
triple notation (Turtle)
12
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?de ?en
WHERE {
?a skos:prefLabel ?de.
?a skos:prefLabel ?en.
FILTER(langMatches(lang(?de), „de"))
FILTER(langMatches(lang(?en), „en“))
}
SPARQL „SQL meets Turte“
RDF Querying
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .
SPARQL
triple notation (Turtle)
13
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?de ?en
WHERE {
?a skos:prefLabel ?de.
?a skos:prefLabel ?en.
FILTER(langMatches(lang(?de), „de"))
FILTER(langMatches(lang(?en), „en“))
}
triples with variables
FILTERS with XPath-like functions
RDF Querying
thesaurus: bronstijd
thesaurus: period
rdf:type
„Bronzezeit“@de
skos:prefLabel
„bronze age“@en
skos:prefLabel
„bronstijd“@nl
skos:prefLabel
graphical notation
triple notation (Turtle)
14
thesaurus:bronstijd rdf:type thesaurus:period.
thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.
thesaurus:bronstijd skos:prefLabel "bronze age"@en.
thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .
=> list with two cols
RDF
RDF graphs can represent arbitrarily complex structures, can be freely extended with additional nodes, links (edges pointing to external resources), etc.
thesaurus: bronstijd
thesaurus: period
rdf:type
„Bronzezeit“@de
skos:prefLabel
„bronze age“@en
skos:prefLabel
„bronstijd“@nl
skos:prefLabel
thesaurus: arch_concept
rdfs:subClassOf
(every) „period“ is an „arch_concept“ „period“ is subclass of „arch_concept“ 15
RDF vocabularies
Specialized vocabularies for different kinds of information do exist and can (and should) be re-used
=> uniform information representation
thesaurus: bronstijd
thesaurus: period
rdf:type
„Bronzezeit“@de
skos:prefLabel
„bronze age“@en
skos:prefLabel
„bronstijd“@nl
skos:prefLabel
thesaurus: arch_concept
rdfs:subClassOf
RDF und RDF Schema (RDFS) basic vocabulary for taxonomies
16
RDF vocabularies
thesaurus: bronstijd
thesaurus: period
rdf:type
„Bronzezeit“@de
„bronze age“@en
skos:prefLabel
„bronstijd“@nl
skos:prefLabel
thesaurus: arch_concept
rdfs:subClassOf
Additional vocabularies (SKOS, OWL, etc.) extended vocabulary for taxonomies and ontologies
skos:prefLabel
Specialized vocabularies for different kinds of information do exist and can (and should) be re-used
17
RDF vocabularies
thesaurus: bronstijd
thesaurus: period
rdf:type
„Bronzezeit“@de
„bronze age“@en
skos:prefLabel
„bronstijd“@nl
skos:prefLabel
thesaurus: arch_concept
rdfs:subClassOf
thesaurus toy vocabulary for archeology
Specialized vocabularies for different kinds of information do exist
Domain-specific vocabularies can be
specified as required
18
RDF vocabularies
• graphical browsers and editors
e.g.
– http://labs.sparna.fr/skos-play (SKOS)
– http://protege.stanford.edu/ (RDF, OWL, SKOS)
19
RDF vocabularies
• graphical browsers and editors do exist …
… as well as
• data bases,
• W3C-standardized query language,
• APIs,
• reasoners,
• usw. 20
RDF vocabularies
• RDF resources and edges are defined by URIs
– as a URL, they may be accessed remotely
re-usable vocabularies & knowledge bases
– link to community-maintained data types/terms instead of defining your own
• „Linked Open Data Cloud“ (LOD)
– a great variety of resources that are linked with each other and released under an open license
21
Linked Data
• Linked Data
– rules of best practice for publishing data on the web
• use URIs as names for things (1) – links to external URIs (links) allow us to retrieve more
information from these sites
• if they can be resolved via HTTP (2)
• and provide information as RDF* (3)
• and they include links to other URIs (4)
then, this is Linked Data (informal)
http://www.w3.org/DesignIssues/LinkedData.html
22
Linked Data
• Linked Data
– rules of best practice for publishing data on the web
=> Information integration
– Structural interoperability
• comparable formats and protocols to access data
=> the same query language for different data sets
23
Linked Data
• Linked Data
– rules of best practice for publishing data on the web
=> Information integration
– Structural interoperability
– Conceptual interoperability
• develop and (re-)use a shared vocabularies for equivalent concepts
=> the same query on different data sets
24
Linked Data
• Linked Data
– rules of best practice for publishing data on the web
=> Information integration
– Structural interoperability
– Conceptual interoperability
– Federation
• data published on the web – with a query interface (SPARQL end point)
=> a single query to query different datasets simultaneously
25
Linked Open Data (LOD)
26
LOD cloud: Aug 2014
http://lod-cloud.net/
bibliography
life sciences social networks
geo information
government data
media linguistic LOD
27
Linking Corpora …
• … with other schemes
• … with terminology repositories
• … with lexical-semantic networks
• corpora as Linked Open Data
=> network effects
Structurally Interoperable Language Resources
NLP Interchange Format (NIF):
Interoperability for NLP pipelines
29
Linked Data Corpus Creation with NIF
(NIF Reference Card)
Best practices to follow for the generation of Linked Data text
corpora, using the NLP Interchange Format (NIF).
Target audience
Scope
Corpus creators and users seeking to make corpora interoperable
and to publish them as linked data. Basic knowledge of RDF is
mandatory for conversion. Basic knowledge of linked data and
web server access is needed for publication.
Conversion of existing corpora into RDF using NIF, as well as
creation of linked data corpora from textual data.
30
Website: http://site.nlp2rdf.org Github: http://github.com/nlp2rdf Example corpus: http://brown.nlp2rdf.org
31
Core concepts
Website: http://site.nlp2rdf.org Github: http://github.com/nlp2rdf Example corpus: http://brown.nlp2rdf.org
Corpus
We understand a corpus as a collection of documents. Documents contain text, represented as strings
of characters and annotations that provide more information about these strings. NIF provides a way to
identify strings using URIs and annotate them using an ontology.
String identification via URI:
Strings are identified using a URI scheme consisting of: the prefix of the corpus URI; the character
indices of beginning and end of the string; and a scheme identifier between document URI and string
position identifier. Character indices in NIF are counted offset based, starting at zero before the first
character and counting the gaps between the characters until after the last character of the referenced
string: http://example.org/corpus/document#offset4_10
This URI scheme is valid for text/plain. Other mime types may require different URI schemes.
String annotation
After assigning URIs to meaningful strings of the corpus, these URIs can be annotated using the NIF
core ontology.
32
Example
Website: http://site.nlp2rdf.org Github: http://github.com/nlp2rdf Example corpus: http://brown.nlp2rdf.org
String URIs
◦ Strings are the basis of analysis
◦ nif:anchorOf => string value
◦ offset-based
@base <http://example.org/prefix>
<#char=3,12>
<http://example.org/prefix#char=3,12>
Context
◦ Contains document text in nif:isString
◦ nif:beginIndex is always 0
◦ Strings refer to Context with
nif:referenceContext
33
Example
Pre-defined categories
◦ Word, Phrase, Sentence, Paragraph
◦ rdfs:subClassOf String
◦ hierarchy => subString
Pre-defined properties
◦ head, lemma, stem, posTag, …
oriented towards industrial applications
primarily used for Entity Linking (taldentRef)
limited in scalability
limited support of dependency annotations, no
support for relational semantics
34
Example
Information Integration
◦ Annotations are attached to strings
◦ Implicit unification of divergent annotations
◦ If different tools annotate the same string, this refers to the same URI
35
NIF Core ontology
Rich vocabulary
• OWL-based
• redundant properties
• transitive
• inverse
• before ~
previousWord
• NIF is used in – EU projects
– community projects (e.g., Apache Stanbol)
– industrial applications
• Why not in NLP and linguistics? – ignorance
“Standard data formats … I I'm not sure these are important: if someone can use a parser, they can probably also write a Python wrapper”
Mark Johnson (2012), Computational Linguistics. Where do we go from here?, invited plenary talk at the 50th Annual Meeting of the ACL, Jeju
37
NIF and computational linguistics
• NIF is used in – EU projects
– community projects (e.g., Apache Stanbol)
– industrial applications
• Why not in NLP and linguistics? – ignorance
– readability human-readable?
38
NIF and computational linguistics
<http://example.org/sem#offset_4_16> a nif:String , nif:Phrase, nif:OffsetBasedString ; nif:anchorOf "Semantic Web"@en ; nif:beginIndex "4"^^xsd:int ; nif:endIndex "16"^^xsd:int ; nif:oliaLink <http://purl.org/olia/penn.owl#NNP> ; itsrdf:taIdentRef <http://dbpedia.org/resource/Semantic_Web> ; nif:referenceContext <http://example.org/sem#offset_0_32> .
The DT _ Semantic Web NNP http://dbpedia.org/resource/Semantic_Web is VB _ a DT _ good JJ _ idea NN _
NIF
CoNLL
• NIF is used in – EU projects
– community projects (e.g., Apache Stanbol)
– industrial applications
• Why not in NLP and linguistics? – ignorance
– readability
– limited expressivity • non-String annotations / empty strings?
• non-phrasal MWEs?
• hard-wired properties and concepts – limited to morphosyntax, NER and dependency syntax
39
NIF and computational linguistics
• NIF is used in – EU projects
– community projects (e.g., Apache Stanbol)
– industrial applications
• Why not in NLP and linguistics? – ignorance
– readability
– limited expressivity
– established formats for „basic“ annotations • Do we get anything that CoNLL-TSV doesn‘t give us yet?
40
NIF and computational linguistics
• NIF is used in – EU projects
– community projects (e.g., Apache Stanbol)
– industrial applications
• Why not in NLP and linguistics? – ignorance
– readability
– limited expressivity
– established formats for „basic“ annotations
– developed with a focus on Semantic Web applications • neither linguistics nor NLP
41
NIF and computational linguistics
• TELIX
– motivation similar as NIF
– RDFa annotations, added to XML documents
• popularity decreases with the popularity of XML
• OpenAnnotation
– intended for expressing metadata over content elements in HTML
• limited support for linguistic annotations
42
NIF Alternatives
• POWLA
– RDF/OWL reconstruction of an XML standoff format
• capable to represent any annotation faithfully – comes from a small and specialized sub-community in NLP
and linguistics (multi-layer corpora, discourse annotation)
• as unreadable as the original XML standoff format – but better to process
43
NIF Alternatives
CoNLL-RDF
Yet another corpus formalism
• use off-the-shelf technologies for – data storing (RDF triple/quad stores),
– querying (SPARQL),
– manipulation (SPARQL Update), and
– access (SPARQL end points)
• structurally interoperable with – NLP output (NIF) and other RDF-based corpus
formats,
– dictionaries (lemon), and
– terminology bases
45
Technical motivations for corpora in RDF
=> flexible information integration
• RDF-based formalism – rather than OWL-based (NIF, POWLA)
– minimal: no transitive and inverse properties
– generic: no pre-defined categories nor properties
• Grounded in an established and widely used formats in the field – tab-separated values: CoNLL format family
• comfortable – import and export to human-readable
representations
46
CoNLL-RDF
47
48
49
CoNLL-X format (2006)
50 Tab-Separated Values
51
CoNLL-2009 additions: SRL
52
CoNLL-U: Universal Dependencies
53
Tool-specific CoNLL variants: SENNA
• comments begin with #
• sentences separated by empty line
• one word per line
• annotations separated by tab (=> columns)
• empty columns left empty, contain -, _ or O
• HEAD points to another word in the same sentence – „foreign key“, identified by sentential position/ID
• all other annotations assign string values
54
CoNLL formats
• generate URIs
– word: base URI + „s“ + sentence nr + „.“ + word ID
– sentence: word with ID „0“ ( CoNLL root)
– base URI should refer to the original document in a corpus, e.g.,
http://cormand.huma-num.fr/01npogotiginin_kokorobola.dis.html#
55
CoNLL-RDF
• generate URIs
– word: base URI + „s“ + sentence nr + „.“ + word ID
• supply labels for all columns
datatype property in conll namespace
http://ufal.mff.cuni.cz/conll2009-st/task-description.html#<COLNAME>
or
conll:<COLNAME>
56
CoNLL-RDF
• generate URIs
– word: base URI + „s“ + sentence nr + „.“ + word ID
• supply labels for all columns
datatype property in conll namespace
• special treatment of HEAD
– object property pointing to head URI
57
CoNLL-RDF
• generate URIs
– word: base URI + „s“ + sentence nr + „.“ + word ID
• supply labels for all columns
datatype property in conll namespace
• special treatment of HEAD
– object property pointing to head URI
• minimal use of NIF vocabulary
– nif:Word, nif:nextWord, nif:nextSentence
58
CoNLL-RDF
• generate URIs
– word: base URI + „s“ + sentence nr + „.“ + word ID
• supply labels for all columns
datatype property in conll namespace
• special treatment of HEAD
– object property pointing to head URI
• minimal use of NIF vocabulary
• conventionally formatted to resemble CoNLL
59
CoNLL-RDF
60
CoNLL vs. CoNLL-RDF
...
• three simple java programs
svn:/intern/incubator/conll-rdf/
– CoNLL2RDF basic converter
read CoNLL file, write Turtle
61
CoNLL-RDF processing
• three simple java programs
svn:/intern/incubator/conll-rdf/
– CoNLL2RDF basic converter
– CoNLLStreamExtractor graph manipulation
read CoNLL file sentence by sentence, for every sentence, apply SPARQL Update statements, write Turtle
62
CoNLL-RDF processing
• three simple java programs
svn:/intern/incubator/conll-rdf/
– CoNLL2RDF basic converter
– CoNLLStreamExtractor graph manipulation
– CoNLLRDFFormatter basic visualization
read CoNLL-RDF, output: CoNLL-like visualization
• special treatment of depencencies
• limited to conll namespace
• coloring on Unix shells
• imposes some naming conventions 63
CoNLL-RDF processing
• three simple java programs
svn:/intern/incubator/conll-rdf/
– CoNLL2RDF basic converter
– CoNLLStreamExtractor graph manipulation
– CoNLLRDFFormatter basic visualization
– anything beyond this is handled by SPARQL update queries
• modular pipeline (examples in *.sh)
• re-usable modules (given appropriate documentation) – maybe hard to read, but easy to write your own!
64
CoNLL-RDF processing
An example
65
Requires a Unix shell
Please look into the code!
66
An example: acoli-example.sh
67
An example: acoli-example.sh
read one or several files if you have one file only, say FILE.conll, write „cat FILE.conll | \“ instead
68
An example: acoli-example.sh
call CoNLLStreamExtractor (run.sh just adds the classpath)
base URI (can be any URL, etc.)
column labels in the order they occur in the CoNLL file
69
An example: acoli-example.sh
do some layouting
$* means that command-line arguments are passed to CoNLLRDFFormatter but these are not necessary
70
An example: acoli-example.sh
do some layouting
$* means that command-line arguments are passed to CoNLLRDFFormatter but these are not necessary
Please, try for yourself
71
Manipulating the data
Between reading and writing, manipulations can be applied to the graph.
–u => next are some SPARQL Update queries these queries are read from the files in the argument optionally, a number of iterations can be supplied in {0}
• For more complicated manipulations, we also require insertions
– Use INSERT { ... } before DELETE
72
sparql/remove-ID.sparql
• This is from another pipeline, see shift-reduce-example.sh
73
shift-reduce/initialize-SHIFT.sparql
• Given the UD annotations
– write a query that extracts the main verb (predicate) of a sentence
– add this information to the graph ?verb a conll:predicate.
– add this query to acoli-example.sh and run it
74
Task: Write a SPARQL update query
use this as a template.
Example, extended
75
Linking with OLiA
Linking Corpora OLiA
(Schmidt et al. 2006, Chiarcos 2008, 2010, 2012)
English
EAGLES
MULTEXT/ East
15 (mostly) Eastern European languages
MULTEXT/ East
MULTEXT/ East
11 European languages
EAGLES EAGLES
STTS
TIGER German Connexor
TüBa-D/Z
Annotation Models for German
Penn
Brown
Susanne etc.
Reference Model
GOLD
ISOcat (morpho-
syntax)
OntoTag (morpho-
syntax)
TDS ontology
OLiA
External Reference Models
(Terminology Repositories)
Annotation Models
Linking: given a POWLA individual i
if annotations of i match
OLiA annotation model
specs
then declare i an instance of
the corresponding OLiA class
77
Integrating external information
link English POS / dependencies with OLiA
78
sparql/link-penn-POS-simple.sparql
79
Task II: Extend this such that ?a is also an instance of superclasses of ?concept
Example, further extended
80
Consult an online dictionary
• SPARQL permits accessing endpoints and web services on the web
– in the SELECT part of the query
SERVICE <URL> { ... }
• SPARQL permits accessing endpoints and web services on the web
– e.g. DBnary, an RDF edition of various Wiktionaries
• http://kaiko.getalp.org/sparql
81
Getting German glosses for English text
82
Exploring an end point
There seems to be a German DBnary
What datasets are in here ?
83
Exploring an end point
lemon:LexicalEntry may be what we‘re looking for
What does the German dictionary contain?
84
Exploring an end point Examples for LexicalEntries
Ok, let‘s look into „Abenteuer“
85
Exploring an end point Examples for LexicalEntries
Indeed, this contains dbnary:isTranslationOf
86
Exploring an end point
Check an example translation
targetLanguage => language dbnary:writtenForm => String
87
Explore an end point
etc., until we have pairs of strings from DBnary
88
Explore an end point
89
Adding German translations sparql/gloss-en-to-de-DBnary.sparql
90
Sample output
• hints: { ... } UNION { ...} => logical or FILTER(NOT EXISTS { ... }) => negation
91
Task III: return a word-by-word German „translation“ (and nothing else)