Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

Describing and Discovering Language

Resources

David Illsley, Ewan Klein, Steve Renals

School of InformaticsUniversity of Edinburgh

Overview

• Goals: availability and interoperability• Service oriented architecture and

workflow• NLP Components• Service description and discovery• NLP and the Grid

What are Language Resources?

• Language Resources (LRs) of two kinds:• Static resources:

– corpora (text, speech, multimodal)– lexicons, terminologies, ontologies– grammars, declarative rule-sets

• Processing resources:– segmenters, tokenizers, zoners, taggers,

entity classifiers, chunkers, parsers, …

Goals

• Maximize availability of static LRs for automatic processing

• Maximize interoperability of processing LRs

LRs on the WWW, 1

• Can use the WWW to locate corpora• Example: OLAC (Open Language

Archive Community)– Provides query interface to search for

corpora across multiple repositories– Requires standard metadata record for

harvesting.– Does not provide access to corpora.

LRs on the WWW,2 • Can use the WWW to directly

search corpora• Many examples• BNC Online Search

– words (with regular expressions)– tag strings

• Typically search is limited (expressiveness, number of results)

LRs on the WWW, 3• Can use the WWW to download

tools• Some tools offer a demo web

interface• No interoperability:

– you cannot take the output of one web-interfaced tool and feed it as input to another tool

LRs on the WWW, 4

• Challenges for accessing static LRs for automatic processing:– licensing restrictions– file (or database) structure– data format– data transfer

• What about processing LRs?– can download, – but not execute in an interoperable manner

Web Services (WS)

• WS is a self-contained software resource• Can be located and invoked across the web:

– identified by a URL– public interfaces and bindings are defined and

described using XML

• Other applications interact with it in a prescribed manner– XML-based messages conveyed by internet

protocols (e.g. HTTP)

• Web services can be composed into complex, distributed applications

WWW

Service Oriented Architecture (SOA)

Service Requester

Service Provider

Discovery Agencies

Source: Berners-Lee

clientclient descriptiondescription

interact

locate publish

serviceservice

description

Web Service: Key Ideas

• Interaction with Web Services is – described by – and conducted

• using XML documents exchanged over the internet

• SOAP protocol– describes the form of messages and how to

process them– a way of representing Remote Procedure

Calls over HTTP

The Appeal of Web Services

• A means of building distributed systems• virtualization — not dependent on any

one programming language, OS, development environment

• based on well-understood underlying protocols

• components can be developed independently

• decentralized (apart from DNS)

NLP Services

• Fairly easy to wrap legacy code as web services

• Allows us to deploy tools across the web as part of a larger application

• Corpora can also be deployed as services

• Helps with availability interoperability

• But still many challenges

Building NLP Applications

• Many NLP applications involve relatively few ‘conceptual’ components:– tokenizers, taggers, named entity

recognizers, parsers, etc– often different versions of the same

components– much repeated (and messy) labour in

wiring the components together to interoperate

Issues in Component Approach

• Granularity– What is appropriate ‘grain size’ of

functionality?• Too fine: heavy overheads in

communication, lose ease of use• Too gross: loss of flexibility• Hierarchical decomposition is possible

• Compatibility– informational, functional, formal

Linguistic Annotation

• Makes information in raw text explicit:– Classification of words and phrases– Detection of structural relationships– Annotation with general and domain-specific

semantic labels

• Usually proceeds from more concrete to more abstract

• Earlier stages of annotation feed into the later stages

• Assumed that annotation is represented as XML

Idealized View

Compatible NLP Services:Substitution

tokenize POS tag parse

POS tag

POS tag

Compatible NLP Services: Sequencing

tokenize POS tag parse

tokenizePOS tagparse

WSDL File

• XML document, usually on same machine as server

• Describes everything involved in calling a web service:– The service URL and namespace– The type of web service– List of available functions– Arguments for each function– Data type of each argument– Return value of each function and data type of

each return value

Processor Input and Output Types

• Composition of NL processors constrained by input and output types

• Candidates for types?• WSDL provides simple data types:

– strings, integers, booleans– not expressive enough

• Can we build on notion of metadata for LRs?

IMDI Catalogue Specification

Catalogue.Title Arabic TreebankCatalogue.Subject-Language araCatalogue.Content-Type writtenCatalogue.Format.Text UTF-8Catalogue.Smallest Annotation Unit wordCatalogue.Publisher LDCCatalogue.Size 266 Mb

LR Metadata Standards

• Advantages– consistency– software knows what to expect– can be designed according to agreed principles

• Challenges– no generally agreed ontology for LRs– hard to get agreement (and who gets to

decide?)– categorizations of LRs influenced by favourite

linguistic theory

• Other people are addressing this issue

What’s missing: tool metadata

• What kind of metadata would enable us to ensure tool interoperability?

• Neither OLAC nor IMDI provide an answer.

Discovering Resources

• Who cares about discovering LRs?– researchers who are searching for LRs

that meet specific research criteria– information providers– teachers, journalists, casual browsers– …

• Current focus: automatic discovery by software agents

Service Description & Discovery

• What LRs can be discovered depends on how the LRs are described.

• How LRs are described depends on the requirements for discovery.

• Composability:– If an agent (human or software) has already

selected component P, what other components Q can provide well-formed input to P ?

– Query for all Q such that Q’s output type is compatible with P’s input type

Some Versions of BNCname: British National Corpus, Version 1.0type: textsize: 2866 MB

name: British National Corpus, Version 1.0, marked up in XMLtype: textsize: 815 MB

name: British National Corpus, Version 1.0, parsed with Charniak parsertype: textsize: 419 MB

name: British National Corpus, Version 1.0, parsed with IMS parsertype: textsize: 2088 MB

name: British National Corpus, Version 1.0, parsed with Minipartype: textsize: 448 MB

Corpus Request Scenario

• Agent A requests corpus C with property [key = val].

• If C with [key = val] exists, serve it to A.• Otherwise,

– find processor P such that output of P(C) satisfies [key = val]

– apply P to C– serve result to A– store result for future requests

Service Description

• Standard approach– WSDL: describes service inputs/outputs

in terms of simple data types– Doesn’t support semantically-based

service discovery• Alternatives from Semantic Web

– inputs and outputs specified in an ontology language

– OWL and RDF both possible

NLP as Document Annotation

• NL Processor– takes a partially annotated document as input– yields a more richly annotated document as

output

Tagging as document annotation

• Part of Speech Tagger– takes in a document with markup of words– yields a document as with additional markup of part

of speech

Document Class

NB This is just corpus metadata!

Subsumption over the Document class

Subsumption over Processors

Grid & NLP

• Parallelism– distribute processes over many machines– use parallel algorithms within process– redundancy and fault tolerance

• Distributed data– multiple corpora– distributed annotation of single corpus

• Distributed processing pipeline– different components hosted at different

sites

Implementation

• Based on Globus Toolkit 3.2 middleware• Corpus Services and Transformation Services

provide interfaces for corpora and tools• Services Data Elements describe properties of

services– properties are aggregated by Index Service, can be

queried by clients

• Index Service extended by Model Service– provides richer description of services using RDF

triples

• Backward chaining used to construct pipelines that will produce a requested resource

Summary

• Corpus query– for user, no obvious distinction between raw

and processed data

• Corpus service– either provide existing resource, or generate it

• Need to have metadata for tools which allows automatic composition

• Metadata needs to allow subsumption matching– using shared controlled vocabulary

Documents

Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh