Upload
nelson-shepherd
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Describing and Discovering Language
Resources
David Illsley, Ewan Klein, Steve Renals
School of InformaticsUniversity of Edinburgh
Overview
• Goals: availability and interoperability• Service oriented architecture and
workflow• NLP Components• Service description and discovery• NLP and the Grid
What are Language Resources?
• Language Resources (LRs) of two kinds:• Static resources:
– corpora (text, speech, multimodal)– lexicons, terminologies, ontologies– grammars, declarative rule-sets
• Processing resources:– segmenters, tokenizers, zoners, taggers,
entity classifiers, chunkers, parsers, …
Goals
• Maximize availability of static LRs for automatic processing
• Maximize interoperability of processing LRs
LRs on the WWW, 1
• Can use the WWW to locate corpora• Example: OLAC (Open Language
Archive Community)– Provides query interface to search for
corpora across multiple repositories– Requires standard metadata record for
harvesting.– Does not provide access to corpora.
LRs on the WWW,2 • Can use the WWW to directly
search corpora• Many examples• BNC Online Search
– words (with regular expressions)– tag strings
• Typically search is limited (expressiveness, number of results)
LRs on the WWW, 3• Can use the WWW to download
tools• Some tools offer a demo web
interface• No interoperability:
– you cannot take the output of one web-interfaced tool and feed it as input to another tool
LRs on the WWW, 4
• Challenges for accessing static LRs for automatic processing:– licensing restrictions– file (or database) structure– data format– data transfer
• What about processing LRs?– can download, – but not execute in an interoperable manner
Web Services (WS)
• WS is a self-contained software resource• Can be located and invoked across the web:
– identified by a URL– public interfaces and bindings are defined and
described using XML
• Other applications interact with it in a prescribed manner– XML-based messages conveyed by internet
protocols (e.g. HTTP)
• Web services can be composed into complex, distributed applications
WWW
Service Oriented Architecture (SOA)
Service Requester
Service Provider
Discovery Agencies
Source: Berners-Lee
clientclient descriptiondescription
interact
locate publish
serviceservice
description
Web Service: Key Ideas
• Interaction with Web Services is – described by – and conducted
• using XML documents exchanged over the internet
• SOAP protocol– describes the form of messages and how to
process them– a way of representing Remote Procedure
Calls over HTTP
The Appeal of Web Services
• A means of building distributed systems• virtualization — not dependent on any
one programming language, OS, development environment
• based on well-understood underlying protocols
• components can be developed independently
• decentralized (apart from DNS)
NLP Services
• Fairly easy to wrap legacy code as web services
• Allows us to deploy tools across the web as part of a larger application
• Corpora can also be deployed as services
• Helps with availability interoperability
• But still many challenges
Building NLP Applications
• Many NLP applications involve relatively few ‘conceptual’ components:– tokenizers, taggers, named entity
recognizers, parsers, etc– often different versions of the same
components– much repeated (and messy) labour in
wiring the components together to interoperate
Issues in Component Approach
• Granularity– What is appropriate ‘grain size’ of
functionality?• Too fine: heavy overheads in
communication, lose ease of use• Too gross: loss of flexibility• Hierarchical decomposition is possible
• Compatibility– informational, functional, formal
Linguistic Annotation
• Makes information in raw text explicit:– Classification of words and phrases– Detection of structural relationships– Annotation with general and domain-specific
semantic labels
• Usually proceeds from more concrete to more abstract
• Earlier stages of annotation feed into the later stages
• Assumed that annotation is represented as XML
Idealized View
Compatible NLP Services:Substitution
tokenize POS tag parse
POS tag
POS tag
Compatible NLP Services: Sequencing
tokenize POS tag parse
tokenizePOS tagparse
WSDL File
• XML document, usually on same machine as server
• Describes everything involved in calling a web service:– The service URL and namespace– The type of web service– List of available functions– Arguments for each function– Data type of each argument– Return value of each function and data type of
each return value
Processor Input and Output Types
• Composition of NL processors constrained by input and output types
• Candidates for types?• WSDL provides simple data types:
– strings, integers, booleans– not expressive enough
• Can we build on notion of metadata for LRs?
IMDI Catalogue Specification
Catalogue.Title Arabic TreebankCatalogue.Subject-Language araCatalogue.Content-Type writtenCatalogue.Format.Text UTF-8Catalogue.Smallest Annotation Unit wordCatalogue.Publisher LDCCatalogue.Size 266 Mb
LR Metadata Standards
• Advantages– consistency– software knows what to expect– can be designed according to agreed principles
• Challenges– no generally agreed ontology for LRs– hard to get agreement (and who gets to
decide?)– categorizations of LRs influenced by favourite
linguistic theory
• Other people are addressing this issue
What’s missing: tool metadata
• What kind of metadata would enable us to ensure tool interoperability?
• Neither OLAC nor IMDI provide an answer.
Discovering Resources
• Who cares about discovering LRs?– researchers who are searching for LRs
that meet specific research criteria– information providers– teachers, journalists, casual browsers– …
• Current focus: automatic discovery by software agents
Service Description & Discovery
• What LRs can be discovered depends on how the LRs are described.
• How LRs are described depends on the requirements for discovery.
• Composability:– If an agent (human or software) has already
selected component P, what other components Q can provide well-formed input to P ?
– Query for all Q such that Q’s output type is compatible with P’s input type
Some Versions of BNCname: British National Corpus, Version 1.0type: textsize: 2866 MB
name: British National Corpus, Version 1.0, marked up in XMLtype: textsize: 815 MB
name: British National Corpus, Version 1.0, parsed with Charniak parsertype: textsize: 419 MB
name: British National Corpus, Version 1.0, parsed with IMS parsertype: textsize: 2088 MB
name: British National Corpus, Version 1.0, parsed with Minipartype: textsize: 448 MB
Corpus Request Scenario
• Agent A requests corpus C with property [key = val].
• If C with [key = val] exists, serve it to A.• Otherwise,
– find processor P such that output of P(C) satisfies [key = val]
– apply P to C– serve result to A– store result for future requests
Service Description
• Standard approach– WSDL: describes service inputs/outputs
in terms of simple data types– Doesn’t support semantically-based
service discovery• Alternatives from Semantic Web
– inputs and outputs specified in an ontology language
– OWL and RDF both possible
NLP as Document Annotation
• NL Processor– takes a partially annotated document as input– yields a more richly annotated document as
output
Tagging as document annotation
• Part of Speech Tagger– takes in a document with markup of words– yields a document as with additional markup of part
of speech
Document Class
NB This is just corpus metadata!
Subsumption over the Document class
Subsumption over Processors
Grid & NLP
• Parallelism– distribute processes over many machines– use parallel algorithms within process– redundancy and fault tolerance
• Distributed data– multiple corpora– distributed annotation of single corpus
• Distributed processing pipeline– different components hosted at different
sites
Implementation
• Based on Globus Toolkit 3.2 middleware• Corpus Services and Transformation Services
provide interfaces for corpora and tools• Services Data Elements describe properties of
services– properties are aggregated by Index Service, can be
queried by clients
• Index Service extended by Model Service– provides richer description of services using RDF
triples
• Backward chaining used to construct pipelines that will produce a requested resource
Summary
• Corpus query– for user, no obvious distinction between raw
and processed data
• Corpus service– either provide existing resource, or generate it
• Need to have metadata for tools which allows automatic composition
• Metadata needs to allow subsumption matching– using shared controlled vocabulary