18
Digital Editions & Language Resources Portal Workshop - Save the data, 2. 12. 2014, Wien Matej Ďurčo ICLTT/ACDH, ÖAW matej.durco @oeaw.ac.at

Digital Editions & Language Resources Portal Workshop - Save the data, 2. 12. 2014, Wien Matej Ďurčo ICLTT/ ACDH, ÖAW [email protected]

Embed Size (px)

Citation preview

Digital Editions &Language Resources Portal

Workshop - Save the data, 2. 12. 2014, WienMatej ĎurčoICLTT/ACDH, Ö[email protected]

What kind of data?

2

TEXT

http://www

Dictionaries• Persian – English Dictionary• German – Russian Dictionary• Dictionary of Bavarian dialects in Austria

Cooperation with Austrian dictionary and the dictionary of German variantsFull word-form corpus-based lexical database of German

Databases e.g. prosopographic, bibliographic data, …

Audio – speech recordings (project Tunico)

3

What kind of data?

Sources: plain text, images (need OCR), Word documents (need conversion), audio (need transcribing), digitally born - web! (needs cleaning)Multi-level enrichment:Structural markup, linguistic / semantic annotation (stand-off)

Linking:• Combining lexicographic material with information from

corpora (encoding in TEI)• semantic representation of lexicographic resources in

RDF

Audio with aligned transcription

Complexity, Formats

4

XML TEI

qualitative vs. quantitative K. Kraus „Die Fackel“ (1899 – 1936)~ 22.500 pages, ~ 6 mio. tokensAAC ~ 500 mio. tokens + facsimiles 40TB!AMC ~ 8 billion tokens in over 35 mio. articles of recent journalistic texts (complete newspapers & magazines in Austria over last 20 years!) – 100 GB23 000 entries prosopographic databaseA number of smaller editions/corpora5 – 50 works/resources, rich annotation, < 100.000 tMultiple dictionaries with a few thousand entries

Size?

5

Bibliographic informationencoded as teiHeaderCMDI – Metadata Infrastructure used within CLARINAllows for flexible „profiles“ specific to the type of resource and project/context- Lexical Resource- TextCorpus- Collection- teiHeader (emulated in CMDI)

Metadata

6

Varying combinations of:full-text searchsemantic search (search for persons, places, search by categories and classifications) full-view (e.g. text and facsimile of individual pages)specialized visualizations (temporal, spatial, graph)raw data available for downloadstable references to resources and resource fragments

BUT before publication: collaborative editing VRE !

Requirements on online availability

7

Publishing framework: corpus_shell Repository for digital objects (Fedora-based)Viennese Lexicographic Editor Collaborative environment for lexicographic work

oXygen, XML-database eXist Apache Solr, Sketch Engine (/NoSke), DDC for fast advanced (linguistic) search capabilitiesMost recently: Language Resources Portal

Solutions

8

„under construction“but many real services already available

Stable organisational structure has been set upgeneral assembly, board of directors, national coordinators, thematic committees, …

Network of Centres(real ones with computing and storage – not virtual)• Certification process (centre assessment)• Typ: A (infrastructure), B (LRT data/services), C

(metadata)• currently 14 centres certified (+ 4 pending)• Coordinated through SCCTC

Standing Committee on CLARIN Technical Centres9

CLARINEuropean Research Infrastructures

Federated IdentityAAI, Single-Sign-on

Persistent IdentifierCMDI – Component Metadata Infrastructureflexible framework for creation and publication of metadata

FCS – Federated Content Searchdistributed system for searching in the content of the resources (corpora, …)

Fostering the use of standardsCLARIN Standards Committee (SCS)

10

InfrastructureCLARIN

modular framework for publishing a wide range of language resources designed to operate in a distributed and heterogeneous environment

distributed setup FCS-basedintegration with CLARIN metadata infrastructure(reusing) specialized resource viewers for specific types of resourcesmultiple implementations (php, perl, XQuery)cooperation/integration with SADEScalable Architecture for Digital Editions, BBAW, Berlin

open source (code on github)11

corpus_shellPublishing Framework

12

vicavcorpus_shell instances

13

ABaC:uscorpus_shell instances

Dictionary Server• Open and freely available software that can be readily

distributed (MySQL, PHP)• Integrated with corpus_shell (FCS as common protocol)• Connected to the clients through a REST-style web

service

Vienna Lexicographic EditorThe corresponding client

DictGate• a service for hosting lexicographic data • for smaller lexicographic projects

14

Lexicography suite

Viennese Lexicographic Editor (VLE)• XML editor specialized for editing lexicographic data• Generic – support for any (XML) format (LMF, TEI, TBX, RDF)• Making use of cognate technologies (XSLT, XPath, XSD)• Various editing modes, configurable keyboard layouts• Optimised corpus-dictionary interface• On-the-fly data visualisations

15

Lexicography suite

First Austrian node in the network of CLARIN CentresDSA (Data Seal of Approval)and CLARIN Centre B status April 2014Language Resources PortalMission: National depositing and publishing servicefor digital language resources

Toolscorpus_shell, lexicographic suite, …

Infrastructure Services - „Knowledge Hub“mostly about metadata (under development)

16

CLARIN Centre Viennaclarin.oeaw.ac.at

17

CLARIN Centre Viennaclarin.oeaw.ac.at

Matej ĎurčoAustrian Centre for Digital Humanities

Österreichische Akademie der [email protected]

Thank you!