90
LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you have Un*x environment (Linux, BSD, Mac, Cygwin): - make sure you have JAVA installed and wifi works - lab members: svn checkout /intern/incubator/conll-rdf - others: download and unzip http://acoli.informatik.uni- frankfurt.de/tmp/conll-rdf-prerelease.zip others: - find a neighbor who does TODO

LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 2: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

1. Linguistic Linked Open Data

and interoperability challenges wrt. corpora

2. State of the Art: NIF & POWLA

... and why neither is used in NLP nor corpus linguistics

3. Yet another format: CoNLL-RDF

Breaking the usage barrier ?

4. Working with CoNLL-RDF

LLOD and corpora

Page 3: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Linked Open Data

RDF, RDF vocabularies, Linking

Page 4: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Resource Description Framework (RDF)

• W3C standard (1999)

– generic data model: directed labeled graph

• nodes, edges, labels

– originally developed to provide metadata about resources

• e.g., journals in a bookstore and eBooks in an online shop

– resources are unambiguously identified in the web of data by Uniform Resource Identifiers URIs)

4

Page 5: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Resource Description Framework (RDF)

• a (labeled directed multi-) graph

– nodes („RDF resources“)

• anything we want to provide information about

– edges („RDF properties“)

• assigns a source node („subject“) a target node („object“) or a value („literal“)

– nodes and edges are unambiguously identified

• Uniform Resource Identifiers (URIs), e.g., URLs

5

Page 6: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF

thesaurus:bronstijd rdf:type thesaurus:period.

thesaurus: bronstijd

thesaurus: period

rdf:type

(the concept) „bronstijd“ is an (instance of concept) „period“ 6

Page 7: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF

thesaurus:bronstijd rdf:type thesaurus:period.

thesaurus: bronstijd

thesaurus: period

rdf:type

abbreviated for URI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

could be opened in a browser resolvable URIs may provide further information

7

Page 8: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF

thesaurus:bronstijd rdf:type thesaurus:period.

thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.

thesaurus: bronstijd

thesaurus: period

rdf:type

in German (de), the preferred label for (the concept) „bronstijd“ is „Bronzezeit“

„Bronzezeit“@de

skos:prefLabel

8

Page 9: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF

thesaurus:bronstijd rdf:type thesaurus:period.

thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.

thesaurus:bronstijd skos:prefLabel "bronze age"@en.

thesaurus: bronstijd

thesaurus: period

rdf:type

„Bronzezeit“@de

skos:prefLabel

„bronze age“@en

skos:prefLabel

in English (en), the preferred label for (the concept) „bronstijd“ is „bronze age“ 9

Page 10: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF

thesaurus:bronstijd rdf:type thesaurus:period.

thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.

thesaurus:bronstijd skos:prefLabel "bronze age"@en.

thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .

thesaurus: bronstijd

thesaurus: period

rdf:type

„Bronzezeit“@de

skos:prefLabel

„bronze age“@en

skos:prefLabel

„bronstijd“@nl

skos:prefLabel

in English (en), the preferred label for (the concept) „bronstijd“ is „bronstijd“ 10

Page 11: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF

thesaurus:bronstijd rdf:type thesaurus:period.

thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.

thesaurus:bronstijd skos:prefLabel "bronze age"@en.

thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .

thesaurus: bronstijd

thesaurus: period

rdf:type

„Bronzezeit“@de

skos:prefLabel

„bronze age“@en

skos:prefLabel

„bronstijd“@nl

skos:prefLabel

graphical notation

triple notation (Turtle)

11

Page 12: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF Querying

thesaurus:bronstijd rdf:type thesaurus:period.

thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.

thesaurus:bronstijd skos:prefLabel "bronze age"@en.

thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .

SPARQL

triple notation (Turtle)

12

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?de ?en

WHERE {

?a skos:prefLabel ?de.

?a skos:prefLabel ?en.

FILTER(langMatches(lang(?de), „de"))

FILTER(langMatches(lang(?en), „en“))

}

SPARQL „SQL meets Turte“

Page 13: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF Querying

thesaurus:bronstijd rdf:type thesaurus:period.

thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.

thesaurus:bronstijd skos:prefLabel "bronze age"@en.

thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .

SPARQL

triple notation (Turtle)

13

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?de ?en

WHERE {

?a skos:prefLabel ?de.

?a skos:prefLabel ?en.

FILTER(langMatches(lang(?de), „de"))

FILTER(langMatches(lang(?en), „en“))

}

triples with variables

FILTERS with XPath-like functions

Page 14: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF Querying

thesaurus: bronstijd

thesaurus: period

rdf:type

„Bronzezeit“@de

skos:prefLabel

„bronze age“@en

skos:prefLabel

„bronstijd“@nl

skos:prefLabel

graphical notation

triple notation (Turtle)

14

thesaurus:bronstijd rdf:type thesaurus:period.

thesaurus:bronstijd skos:prefLabel "Bronzezeit"@de.

thesaurus:bronstijd skos:prefLabel "bronze age"@en.

thesaurus:bronstijd skos:prefLabel "bronstijd"@nl .

=> list with two cols

Page 15: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF

RDF graphs can represent arbitrarily complex structures, can be freely extended with additional nodes, links (edges pointing to external resources), etc.

thesaurus: bronstijd

thesaurus: period

rdf:type

„Bronzezeit“@de

skos:prefLabel

„bronze age“@en

skos:prefLabel

„bronstijd“@nl

skos:prefLabel

thesaurus: arch_concept

rdfs:subClassOf

(every) „period“ is an „arch_concept“ „period“ is subclass of „arch_concept“ 15

Page 16: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF vocabularies

Specialized vocabularies for different kinds of information do exist and can (and should) be re-used

=> uniform information representation

thesaurus: bronstijd

thesaurus: period

rdf:type

„Bronzezeit“@de

skos:prefLabel

„bronze age“@en

skos:prefLabel

„bronstijd“@nl

skos:prefLabel

thesaurus: arch_concept

rdfs:subClassOf

RDF und RDF Schema (RDFS) basic vocabulary for taxonomies

16

Page 17: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF vocabularies

thesaurus: bronstijd

thesaurus: period

rdf:type

„Bronzezeit“@de

„bronze age“@en

skos:prefLabel

„bronstijd“@nl

skos:prefLabel

thesaurus: arch_concept

rdfs:subClassOf

Additional vocabularies (SKOS, OWL, etc.) extended vocabulary for taxonomies and ontologies

skos:prefLabel

Specialized vocabularies for different kinds of information do exist and can (and should) be re-used

17

Page 18: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF vocabularies

thesaurus: bronstijd

thesaurus: period

rdf:type

„Bronzezeit“@de

„bronze age“@en

skos:prefLabel

„bronstijd“@nl

skos:prefLabel

thesaurus: arch_concept

rdfs:subClassOf

thesaurus toy vocabulary for archeology

Specialized vocabularies for different kinds of information do exist

Domain-specific vocabularies can be

specified as required

18

Page 19: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF vocabularies

• graphical browsers and editors

e.g.

– http://labs.sparna.fr/skos-play (SKOS)

– http://protege.stanford.edu/ (RDF, OWL, SKOS)

19

Page 20: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF vocabularies

• graphical browsers and editors do exist …

… as well as

• data bases,

• W3C-standardized query language,

• APIs,

• reasoners,

• usw. 20

Page 21: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

RDF vocabularies

• RDF resources and edges are defined by URIs

– as a URL, they may be accessed remotely

re-usable vocabularies & knowledge bases

– link to community-maintained data types/terms instead of defining your own

• „Linked Open Data Cloud“ (LOD)

– a great variety of resources that are linked with each other and released under an open license

21

Page 22: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Linked Data

• Linked Data

– rules of best practice for publishing data on the web

• use URIs as names for things (1) – links to external URIs (links) allow us to retrieve more

information from these sites

• if they can be resolved via HTTP (2)

• and provide information as RDF* (3)

• and they include links to other URIs (4)

then, this is Linked Data (informal)

http://www.w3.org/DesignIssues/LinkedData.html

22

Page 23: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Linked Data

• Linked Data

– rules of best practice for publishing data on the web

=> Information integration

– Structural interoperability

• comparable formats and protocols to access data

=> the same query language for different data sets

23

Page 24: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Linked Data

• Linked Data

– rules of best practice for publishing data on the web

=> Information integration

– Structural interoperability

– Conceptual interoperability

• develop and (re-)use a shared vocabularies for equivalent concepts

=> the same query on different data sets

24

Page 25: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Linked Data

• Linked Data

– rules of best practice for publishing data on the web

=> Information integration

– Structural interoperability

– Conceptual interoperability

– Federation

• data published on the web – with a query interface (SPARQL end point)

=> a single query to query different datasets simultaneously

25

Page 26: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Linked Open Data (LOD)

26

Page 27: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

LOD cloud: Aug 2014

http://lod-cloud.net/

bibliography

life sciences social networks

geo information

government data

media linguistic LOD

27

Page 28: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Linking Corpora …

• … with other schemes

• … with terminology repositories

• … with lexical-semantic networks

• corpora as Linked Open Data

=> network effects

Page 29: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Structurally Interoperable Language Resources

NLP Interchange Format (NIF):

Interoperability for NLP pipelines

29

Page 30: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Linked Data Corpus Creation with NIF

(NIF Reference Card)

Best practices to follow for the generation of Linked Data text

corpora, using the NLP Interchange Format (NIF).

Target audience

Scope

Corpus creators and users seeking to make corpora interoperable

and to publish them as linked data. Basic knowledge of RDF is

mandatory for conversion. Basic knowledge of linked data and

web server access is needed for publication.

Conversion of existing corpora into RDF using NIF, as well as

creation of linked data corpora from textual data.

30

Website: http://site.nlp2rdf.org Github: http://github.com/nlp2rdf Example corpus: http://brown.nlp2rdf.org

Page 31: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

31

Core concepts

Website: http://site.nlp2rdf.org Github: http://github.com/nlp2rdf Example corpus: http://brown.nlp2rdf.org

Corpus

We understand a corpus as a collection of documents. Documents contain text, represented as strings

of characters and annotations that provide more information about these strings. NIF provides a way to

identify strings using URIs and annotate them using an ontology.

String identification via URI:

Strings are identified using a URI scheme consisting of: the prefix of the corpus URI; the character

indices of beginning and end of the string; and a scheme identifier between document URI and string

position identifier. Character indices in NIF are counted offset based, starting at zero before the first

character and counting the gaps between the characters until after the last character of the referenced

string: http://example.org/corpus/document#offset4_10

This URI scheme is valid for text/plain. Other mime types may require different URI schemes.

String annotation

After assigning URIs to meaningful strings of the corpus, these URIs can be annotated using the NIF

core ontology.

Page 32: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

32

Example

Website: http://site.nlp2rdf.org Github: http://github.com/nlp2rdf Example corpus: http://brown.nlp2rdf.org

String URIs

◦ Strings are the basis of analysis

◦ nif:anchorOf => string value

◦ offset-based

@base <http://example.org/prefix>

<#char=3,12>

<http://example.org/prefix#char=3,12>

Context

◦ Contains document text in nif:isString

◦ nif:beginIndex is always 0

◦ Strings refer to Context with

nif:referenceContext

Page 33: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

33

Example

Pre-defined categories

◦ Word, Phrase, Sentence, Paragraph

◦ rdfs:subClassOf String

◦ hierarchy => subString

Pre-defined properties

◦ head, lemma, stem, posTag, …

oriented towards industrial applications

primarily used for Entity Linking (taldentRef)

limited in scalability

limited support of dependency annotations, no

support for relational semantics

Page 34: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

34

Example

Information Integration

◦ Annotations are attached to strings

◦ Implicit unification of divergent annotations

◦ If different tools annotate the same string, this refers to the same URI

Page 35: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

35

NIF Core ontology

Rich vocabulary

• OWL-based

• redundant properties

• transitive

• inverse

• before ~

previousWord

Page 36: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• NIF is used in – EU projects

– community projects (e.g., Apache Stanbol)

– industrial applications

• Why not in NLP and linguistics? – ignorance

“Standard data formats … I I'm not sure these are important: if someone can use a parser, they can probably also write a Python wrapper”

Mark Johnson (2012), Computational Linguistics. Where do we go from here?, invited plenary talk at the 50th Annual Meeting of the ACL, Jeju

37

NIF and computational linguistics

Page 37: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• NIF is used in – EU projects

– community projects (e.g., Apache Stanbol)

– industrial applications

• Why not in NLP and linguistics? – ignorance

– readability human-readable?

38

NIF and computational linguistics

<http://example.org/sem#offset_4_16> a nif:String , nif:Phrase, nif:OffsetBasedString ; nif:anchorOf "Semantic Web"@en ; nif:beginIndex "4"^^xsd:int ; nif:endIndex "16"^^xsd:int ; nif:oliaLink <http://purl.org/olia/penn.owl#NNP> ; itsrdf:taIdentRef <http://dbpedia.org/resource/Semantic_Web> ; nif:referenceContext <http://example.org/sem#offset_0_32> .

The DT _ Semantic Web NNP http://dbpedia.org/resource/Semantic_Web is VB _ a DT _ good JJ _ idea NN _

NIF

CoNLL

Page 38: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• NIF is used in – EU projects

– community projects (e.g., Apache Stanbol)

– industrial applications

• Why not in NLP and linguistics? – ignorance

– readability

– limited expressivity • non-String annotations / empty strings?

• non-phrasal MWEs?

• hard-wired properties and concepts – limited to morphosyntax, NER and dependency syntax

39

NIF and computational linguistics

Page 39: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• NIF is used in – EU projects

– community projects (e.g., Apache Stanbol)

– industrial applications

• Why not in NLP and linguistics? – ignorance

– readability

– limited expressivity

– established formats for „basic“ annotations • Do we get anything that CoNLL-TSV doesn‘t give us yet?

40

NIF and computational linguistics

Page 40: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• NIF is used in – EU projects

– community projects (e.g., Apache Stanbol)

– industrial applications

• Why not in NLP and linguistics? – ignorance

– readability

– limited expressivity

– established formats for „basic“ annotations

– developed with a focus on Semantic Web applications • neither linguistics nor NLP

41

NIF and computational linguistics

Page 41: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• TELIX

– motivation similar as NIF

– RDFa annotations, added to XML documents

• popularity decreases with the popularity of XML

• OpenAnnotation

– intended for expressing metadata over content elements in HTML

• limited support for linguistic annotations

42

NIF Alternatives

Page 42: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• POWLA

– RDF/OWL reconstruction of an XML standoff format

• capable to represent any annotation faithfully – comes from a small and specialized sub-community in NLP

and linguistics (multi-layer corpora, discourse annotation)

• as unreadable as the original XML standoff format – but better to process

43

NIF Alternatives

Page 43: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

CoNLL-RDF

Yet another corpus formalism

Page 44: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• use off-the-shelf technologies for – data storing (RDF triple/quad stores),

– querying (SPARQL),

– manipulation (SPARQL Update), and

– access (SPARQL end points)

• structurally interoperable with – NLP output (NIF) and other RDF-based corpus

formats,

– dictionaries (lemon), and

– terminology bases

45

Technical motivations for corpora in RDF

=> flexible information integration

Page 45: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• RDF-based formalism – rather than OWL-based (NIF, POWLA)

– minimal: no transitive and inverse properties

– generic: no pre-defined categories nor properties

• Grounded in an established and widely used formats in the field – tab-separated values: CoNLL format family

• comfortable – import and export to human-readable

representations

46

CoNLL-RDF

Page 46: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

47

Page 47: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

48

Page 48: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

49

CoNLL-X format (2006)

Page 49: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

50 Tab-Separated Values

Page 50: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

51

CoNLL-2009 additions: SRL

Page 51: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

52

CoNLL-U: Universal Dependencies

Page 52: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

53

Tool-specific CoNLL variants: SENNA

Page 53: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• comments begin with #

• sentences separated by empty line

• one word per line

• annotations separated by tab (=> columns)

• empty columns left empty, contain -, _ or O

• HEAD points to another word in the same sentence – „foreign key“, identified by sentential position/ID

• all other annotations assign string values

54

CoNLL formats

Page 54: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• generate URIs

– word: base URI + „s“ + sentence nr + „.“ + word ID

– sentence: word with ID „0“ ( CoNLL root)

– base URI should refer to the original document in a corpus, e.g.,

http://cormand.huma-num.fr/01npogotiginin_kokorobola.dis.html#

55

CoNLL-RDF

Page 55: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• generate URIs

– word: base URI + „s“ + sentence nr + „.“ + word ID

• supply labels for all columns

datatype property in conll namespace

http://ufal.mff.cuni.cz/conll2009-st/task-description.html#<COLNAME>

or

conll:<COLNAME>

56

CoNLL-RDF

Page 56: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• generate URIs

– word: base URI + „s“ + sentence nr + „.“ + word ID

• supply labels for all columns

datatype property in conll namespace

• special treatment of HEAD

– object property pointing to head URI

57

CoNLL-RDF

Page 57: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• generate URIs

– word: base URI + „s“ + sentence nr + „.“ + word ID

• supply labels for all columns

datatype property in conll namespace

• special treatment of HEAD

– object property pointing to head URI

• minimal use of NIF vocabulary

– nif:Word, nif:nextWord, nif:nextSentence

58

CoNLL-RDF

Page 58: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• generate URIs

– word: base URI + „s“ + sentence nr + „.“ + word ID

• supply labels for all columns

datatype property in conll namespace

• special treatment of HEAD

– object property pointing to head URI

• minimal use of NIF vocabulary

• conventionally formatted to resemble CoNLL

59

CoNLL-RDF

Page 59: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

60

CoNLL vs. CoNLL-RDF

...

Page 60: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• three simple java programs

svn:/intern/incubator/conll-rdf/

– CoNLL2RDF basic converter

read CoNLL file, write Turtle

61

CoNLL-RDF processing

Page 61: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• three simple java programs

svn:/intern/incubator/conll-rdf/

– CoNLL2RDF basic converter

– CoNLLStreamExtractor graph manipulation

read CoNLL file sentence by sentence, for every sentence, apply SPARQL Update statements, write Turtle

62

CoNLL-RDF processing

Page 62: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• three simple java programs

svn:/intern/incubator/conll-rdf/

– CoNLL2RDF basic converter

– CoNLLStreamExtractor graph manipulation

– CoNLLRDFFormatter basic visualization

read CoNLL-RDF, output: CoNLL-like visualization

• special treatment of depencencies

• limited to conll namespace

• coloring on Unix shells

• imposes some naming conventions 63

CoNLL-RDF processing

Page 63: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• three simple java programs

svn:/intern/incubator/conll-rdf/

– CoNLL2RDF basic converter

– CoNLLStreamExtractor graph manipulation

– CoNLLRDFFormatter basic visualization

– anything beyond this is handled by SPARQL update queries

• modular pipeline (examples in *.sh)

• re-usable modules (given appropriate documentation) – maybe hard to read, but easy to write your own!

64

CoNLL-RDF processing

Page 64: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

An example

65

Requires a Unix shell

Please look into the code!

Page 65: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

66

An example: acoli-example.sh

Page 66: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

67

An example: acoli-example.sh

read one or several files if you have one file only, say FILE.conll, write „cat FILE.conll | \“ instead

Page 67: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

68

An example: acoli-example.sh

call CoNLLStreamExtractor (run.sh just adds the classpath)

base URI (can be any URL, etc.)

column labels in the order they occur in the CoNLL file

Page 68: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

69

An example: acoli-example.sh

do some layouting

$* means that command-line arguments are passed to CoNLLRDFFormatter but these are not necessary

Page 69: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

70

An example: acoli-example.sh

do some layouting

$* means that command-line arguments are passed to CoNLLRDFFormatter but these are not necessary

Please, try for yourself

Page 70: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

71

Manipulating the data

Between reading and writing, manipulations can be applied to the graph.

–u => next are some SPARQL Update queries these queries are read from the files in the argument optionally, a number of iterations can be supplied in {0}

Page 71: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• For more complicated manipulations, we also require insertions

– Use INSERT { ... } before DELETE

72

sparql/remove-ID.sparql

Page 72: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• This is from another pipeline, see shift-reduce-example.sh

73

shift-reduce/initialize-SHIFT.sparql

Page 73: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• Given the UD annotations

– write a query that extracts the main verb (predicate) of a sentence

– add this information to the graph ?verb a conll:predicate.

– add this query to acoli-example.sh and run it

74

Task: Write a SPARQL update query

use this as a template.

Page 74: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Example, extended

75

Linking with OLiA

Page 75: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Linking Corpora OLiA

(Schmidt et al. 2006, Chiarcos 2008, 2010, 2012)

English

EAGLES

MULTEXT/ East

15 (mostly) Eastern European languages

MULTEXT/ East

MULTEXT/ East

11 European languages

EAGLES EAGLES

STTS

TIGER German Connexor

TüBa-D/Z

Annotation Models for German

Penn

Brown

Susanne etc.

Reference Model

GOLD

ISOcat (morpho-

syntax)

OntoTag (morpho-

syntax)

TDS ontology

OLiA

External Reference Models

(Terminology Repositories)

Annotation Models

Linking: given a POWLA individual i

if annotations of i match

OLiA annotation model

specs

then declare i an instance of

the corresponding OLiA class

Page 76: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

77

Integrating external information

link English POS / dependencies with OLiA

Page 77: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

78

sparql/link-penn-POS-simple.sparql

Page 78: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

79

Task II: Extend this such that ?a is also an instance of superclasses of ?concept

Page 79: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

Example, further extended

80

Consult an online dictionary

Page 80: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• SPARQL permits accessing endpoints and web services on the web

– in the SELECT part of the query

SERVICE <URL> { ... }

• SPARQL permits accessing endpoints and web services on the web

– e.g. DBnary, an RDF edition of various Wiktionaries

• http://kaiko.getalp.org/sparql

81

Getting German glosses for English text

Page 81: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

82

Exploring an end point

There seems to be a German DBnary

What datasets are in here ?

Page 82: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

83

Exploring an end point

lemon:LexicalEntry may be what we‘re looking for

What does the German dictionary contain?

Page 83: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

84

Exploring an end point Examples for LexicalEntries

Ok, let‘s look into „Abenteuer“

Page 84: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

85

Exploring an end point Examples for LexicalEntries

Indeed, this contains dbnary:isTranslationOf

Page 85: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

86

Exploring an end point

Check an example translation

targetLanguage => language dbnary:writtenForm => String

Page 86: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

87

Explore an end point

etc., until we have pairs of strings from DBnary

Page 87: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

88

Explore an end point

Page 88: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

89

Adding German translations sparql/gloss-en-to-de-DBnary.sparql

Page 89: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

90

Sample output

Page 90: LLOD and corpora - acoli.cs.uni-frankfurt.deacoli.cs.uni-frankfurt.de/courses/co/2016-06-22-conllrdf/conll-rdf... · LLOD and corpora Ch. Chiarcos, ACoLi CO, 2016, July 22 If you

• hints: { ... } UNION { ...} => logical or FILTER(NOT EXISTS { ... }) => negation

91

Task III: return a word-by-word German „translation“ (and nothing else)