Querying GrAF data in linguistic analysis

Embed Size (px)

Citation preview

Querying GrAF data in linguistic analysis

Peter BoudaCentro Interdisciplinar de Documentao Lingustica e [email protected]

Overview

Existing infrastructure and workflows

GrAF

GrAF and TEI

Poio API

Queries in Poio API

Queries in GrAF API

Fieldwork

Fotos

Existing Infrastructure

LD tools and standards

Elan: EAF, MPEG, WAV

Toolbox: TXT, XML, WAV

Arbil: IMDI/CIMDI (Component MetaData Infrastructure)

Praat: XML, WAV

...

No standards for tier hierarchies, tier names or annotation schemes

Efforts in ISOcat

Interlinear Glossed Text

GrAF

GrAF: Graph Annotation Framework

ISO 24612: Language resource management - Linguistic annotation framework (LAF)

Started as stand-off version of XCES

API and representation as data structures, not a file format

GrAF/XML as XML representation

Used for the MASC of the ANC

Nodes, edges, regions, annotations, feature structures

GrAF entities

GrAF structure

GrAF-XML

so

TEI and GrAF

Schemata for GrAF created with TEI Roma

Custumized version of TEI P5 schema

ODD: One Document Does it all

GrAF is not TEI compliant

Share data types and feature structures of annotations

TEI has stand-off variant, uses XPointer/XLinkPrimary data has to be XML

Why we use GrAF

No inline markup

Radical stand-off approachEasier to share and manage data

Preferred solution to archive cultural heritage

Ideal for sparse annotations

Existing code: Java and Python

API vs. XQuery

The beauty of annotation graphs

Poio API

Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages

Subset of GrAF to represent tier based annotation

Filters and filter chains for search

Plugin mechanism for file formatsMapping semantics: tiers and annotations to nodes and edges

Efforts to map between TEI and GrAFRetro-digitized dictionary data at University of Marburg are published as GrAF files

We want to publish as TEI

Queries in GrAF API

All queries are in-memory

Users can load parts of the full graph

Annotation graph to network conversionPython library networkx

Example: Semantic similarity

Queries in GrAF API

for (node_id, node) in graf_graph.nodes.items(): if node_id.endswith("entry"): for e in node.out_edges: if e.annotations.get_first().label == "head" or \ e.annotations.get_first().label == "translation": features = e.to_node.annotations.get_first().features substr = features.get_value("substring") [...]

Queries in Poio API

Example: Word order in Hinuq

Queries in Poio API

ag = from_excel("data/Hinuq2.csv")clause_unit_nodes = ag.nodes_for_tier("clause_id")

verbs = [ 'COP', 'cop', 'SAY', 'say', 'v.tr', 'v.intr', 'v.aff' ]others = [ 'A', 'S', 'P', 'EXP', 'STIM' ]search_terms = verbs + others

word_orders = collections.defaultdict(int)

for parent_node in clause_unit_nodes: word_order = [] for word_n in parent_node.iter_children(): a_list = ag.annotations_for_tier("grammatical_relation", word_n) if len(a_list) > 0: a_value = ag.annotation_value_for_annotation(a_list[0]) if a_value in search_terms: if a_value in verbs: word_order.append('V') else: word_order.append(a_value) word_orders[tuple(word_order)] += 1

Filters and filter chains

ag = poioapi.annotationgraph.AnnotationGraph()ag.from_elan("elan-example3.eaf")ag.structure_type_handler = poioapi.data.DataStructureType(ag.tier_hierarchies[0])

af = poioapi.annotationgraph.AnnotationGraphFilter(ag)af.set_filter_for_tier("words..W-Words", "follow")af.set_filter_for_tier("part_of_speech..W-POS", r"\bpro\b")ag.append_filter(af)

print("Filtered root nodes:")print(ag.filtered_node_ids)

search_terms = { "words..W-Words": "follow", "part_of_speech..W-POS": r"\bpro\b"}af = ag.create_filter_for_dict(search_terms)ag.append_filter(af)

Poio Analyzer

Developed for and with Prof. Johannes Helmbrecht, University of Regensburg

How to query the corpus in order to write a descriptive grammar?

Started with a list of requirements

Need to publish and archive queries and results

Poio Analyzer

Thank you for your attention!

[email protected]

Links

Clarin curation project: http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthropology-language-typology/curation-project-1.html

Poio:http://media.cidles.eu/poio/

GrAF:http://www.xces.org/ns/GrAF/1.0/