33
Jakub Piskorski Warszawa, 6.10 .2003 SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI GmbH

Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Embed Size (px)

Citation preview

Page 1: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

SProUT Shallow Processing with Unification

and Typed Feature Structures

Jakub PiskorskiLanguage Technology Lab

DFKI GmbH

Page 2: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Information Extraction

PRODUCT/SERVICE:

Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC), London, have merged their UK private communication systems and networks activities to form a new company, Siemens GEC Communication Systems Limited.

Munich

Siemens GEC Communication Systems LimitedSiemens AG, The General Electric

February 18 1997communication systems, networks activities

VENTURE: PARTNERS:TIME:

LOCATION:

Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC), London, have merged their UK private communication systems and networks activities to form a new company, Siemens GEC Communication Systems Limited.

JOINT-VENTURE FOUNDATION EVENT

Page 3: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Finite-State based approaches

SPPC - pure finite-state based STP, small number of basic predicates

SMES – predciates inspect arbitrary properties of the input tokens/fragments

FASTUS – uses CPSL (Common Pattern Specification Language)

GATE – uses JAPE (Java Annotation Patterns Engine)

Page 4: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Motivation for SProUT

One System for Multilingual and Domain Adaptive Shallow Text Processing

Trade-off between efficiency and expressiveness

Modularity

Flexible integration of different processing modules

Portability

Industrial standards

Page 5: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

SProUT is a joint work by:

Witold Drożdżyński,Ulrich Krieger,

Jakub Piskorski,Ulrich Schäfer,

Feiyu Xu

Credits

Page 6: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

FINITE-STATEMACHINETOOLKIT

XTDLINTERPRETER

REGULARCOMPILER

XTDLGRAMMAR

EXTENDEDOPTIMIZED

FINITE-STATENETWORK

LEXICALRESOURCES

INPUT DATA

STRUCTURED

OUTPUT DATA

G R A M M A R D E V E L O P M E N T E N V I R O N M E N T

O N L I N E P R O C E S S I N G

STREAM OFTEXT ITEMS

…. [..] [..] [..] ….

LINGUISTICPROCESSINGRESOURCES

JTFS

SProUT Architecture

Page 7: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Core Components – FSM Toolkit

Finite-state Machine Toolkit for building, combining, and optimizing finite-state devices

Finite-state Machine model: FSA, WFSA, FST, WFST

Arbitrary real-valued semirings

Some new crucial STP-relevant operations (e.g., incremental construction of minimal deterministic FSAs)

Various memory models

Functionality similar to AT&T tools

Page 8: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Core Components – Regular Compiler

Definition and configuration via XML Unicode compatible

Extendible set of circa 20 operations

Scanner definitions vs. general regular expressions

Biasing optimization process

Various ways of handling ambiguities

Direct database connection for flexible pattern-based transformation of linguistic resources into optimized FS representation

Regular expressions over TFSs (SProUT) with restrictions

Page 9: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Core Components – Typed Feature Structure Package

JAVA implementation of TFSs Efficient unification operations

Dynamic extension of the type hierarchy

Other operations: subsumptipon checking, deep copying, path selection, feature iteration, and various printers

Page 10: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

XTDL Formalism

Combines typed feature structures (TFS) and regular expressions, including coreferences and functional application

XTDL grammar rules – production part on LHS, and output description on RHS

TDL used for establishment of a type hierarchy of linguistic entities

morph := sign & [POS atom,

STEM atom,

INFL infl]

*top*

atom *avm* *rule*

tense sign infl index-avm

present token morph lang tokentype

de en separator url

Page 11: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

XTDL Formalism

Couple of standard regular operators:

concatenation optionality ?disjunction | Kleene star *Kleene plus + n-fold repetition {n}m-n span repetition {m,n}

Unidirectional coreference under Kleene star (and restricted iteration)

[POS Det, ...] ([POS Adj, ..., RELN %LIST])* [POS Noun, ...] -> [..., RELN %LIST]

Page 12: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

XTDL Formalism

loc-pp :>

morph & [POS Prep & #preposition,

INFL [CASE #1, NUMBER #2, GENDER #3]]

morph & [POS Determiner,

INFL [CASE #1, NUMBER #2, GENDER #3]] ?

morph & [POS Adjective,

INFL [CASE #1, NUMBER #2, GENDER #3]] *

gazetteer & [TYPE general-location,

SURFACE #location]

-> [CAT location-pp,

PREP #preposition

LOCATION #location].

Page 13: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

XTDL Interpreter

1. Matching of regular patterns using unifiability (LHS)

2. LHS Pattern instance creation

3. Unfication of the rule instance and matched input

Longest match strategy

Ambiguities allowed

Interpreter generates TFSs as output (cascaded architecture)

Page 14: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

location-general TYPE

Rom SURFACE,

gender GENDER

number NUMBER

case CASE

INFL

Adjective POS

sonnig STEM

sonnigen SURFACE

,

fem GENDER

plural NUMBER

nom CASE

INFL

Prep POS

im STEM

im SURFACE

INgazetteer

inflmorphinflmorphrule

Matched input sequence “im sonnigen Rom” (in sunny Rome)

XTDL Interpreter

Page 15: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

4 LOCATION

5 PREP

np-location CAT

OUT

location-general TYPE

4 SURFACE,

3 GENDER

2 NUMBER

1 CASE

INFL

Adjective POS

,

3 GENDER

2 NUMBER

1 CASE

INFL

Prep 5 POS

IN

phrase

gazetteer

inflmorphinflmorph

rule

Rule with an instantiated pattern on the LHS

XTDL Interpreter

Page 16: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

4 LOCATION

5 PREP

np-location CAT

OUT

location-general TYPE

Rom 4 SURFACE,

3 GENDER

2 NUMBER

1 CASE

INFL

Adjective POS

sonnig STEM

sonnigen SURFACE

,

neut 3 GENDER

sing 2 NUMBER

dat 1 CASE

INFL

Prep 5 POS

im STEM

im SURFACE

IN

phrase

gazetteer

inflmorphinflmorph

rule

Unified result

XTDL formalism

Page 17: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Linguistic Processing Resources

Tokenization

Gazetteer

Extended Gazetteer

Morphology

Sentence Splitter

Reference Matcher

Page 18: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Tokenization

Text segmentation into tokens

Fine-grained token classification (ca. 30 types)

complex_compound_first_capital : AT&T-Chief

Token postsegmentation

‘<a,b>’ ‘<‘ ‘a’ ‘,’ ‘b’ ‘>’

Token Subclassification

Information

contains_position_sufix: AT&T-Chief

ndinghas_noun_e

any

english

endingnounhas

any

german

capitalfirst

:SPEC

:DOM

:LANG

,

__ :SPEC

:DOM

:LANG

:SUB

_ :MAIN

34 :END

25 :START

Page 19: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Gazetteer/Extended Gazetteer

for storing static named-entities (eg. locations) or keywords (eg. company| designators, month names, etc.)

Extended Gazetteer allows for associating entries with a list of arbitrary attribute-value pairs (and uses path compression)

... Warsaw | gaz_type:city | concept:Warsaw Warszawa | gaz_type:city | concept:Warsaw Varsovie | gaz_type:city | concept:Warsaw ...

Case Sensitivie/Insensitive Modus

Unicode compatibility

Page 20: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Morphology

Full-form lexica obtained from ‘compactified’ MMORPH:

English 200,000 entriesGerman 830,000 entries + Shallow Compound RecognitionFrench 225,000 entriesSpanish 570,000 entriesItalian 330,000 entriesDutch ? Entries (under development)

Asian Languages:

Chinese – ShanxiJapanese – Chasen

Other:

Czech – 600,000 entries + HMM-based Part-of-Speech Tagging Polish – 120,000 lexemes (Morfeusz)Lithuanian – LemouklisRussian – under acquisition

compactification of available full-form lexica

external components implemented as server

Page 21: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Compound Recognition & Segmentation for German

“Biergartenfest” “Wein“ + “sorten“ (wine types) [Bier [garten fest]] vs. [[Bier garten] fest] “Wein” + “s“ + “orten“ (wine places)

Morphology

(„Autoradiozubehör“ – radio car equipment)

Autoradiozubehör

Autoradiozubehör

Autoradiozubehör

Autoradiozubehör

Autoradiozubehör

Autoradiozubehör

Next: Adoptation for processing Dutch compounds

Page 22: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

System Description Language

Construction of a concrete system instance via definition of a regular expression of module specifications

modulest independen ofn computatio parallel-quasi

ncomputatiofixpoint *

input to theas serves ofoutput

21

2121

MM

M

MMMM

All lingusitic modules must implement a specific JAVA interface

Automatic compilation of system description into a single JAVA class

Page 23: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

System Description Language

(M1 M2)(input)

M1.clearState(); M1.setInput(input); M1.setOutput(M1.computeOutput(M1.getInput())); M2.clearState(); M2.setInput(mediateSeq(M1,M2)); M2.setOutput(M2.computeOutput(M2.getInput())); return M2.getOutput();

(M*)(input)

M.clearState(); M.setInput(input); M.setOutput(mediateFix(M)); return M.getOutput();

Page 24: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Optimization of Grammar processing

Problem: TFSs treated as symbolic values by FSM Toolkit Sorting outgoing transitions from slected states (transition hierarchy under subsumption)

- flat trees for bad-style grammars

Extending transition hierarchy via additional nodes

[ TOP ]

[TOKEN][MORPH stem: ‘Prof.’] [GAZETTEER type: X]

Page 25: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Optimization of Grammar processing

Input text consisting of 32 520 words, 157 080 characters, 22 pages + English Grammar for NE (circa 700 transitions from the initial state)

Run-time behaviour with Tokenizer/Gazetter/Morphology:

before: overall: 17.7 seconds candidate pattern selection: 11.6

now: overall: 13.2 seconds candidate pattern selection: 6.9

Page 26: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Optimization of Grammar processing

Using restrictions during compilation of XTDL grammars into FS-format

’Determinization under subsumption’ -> Approximation

’Expansion’ techniques for highly recursive grammars

Page 27: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Adapting SProUT to processing Polish

Tokenization – trivial

Morphology – integration of Morfeusz (Marcin Woliński)

Part-of-speech Disambiguation - ?

Gazetteer - several strategies:

- list all inflectional variants with additional morphological information- interplay between gazetteer and morphology- component for guessing morphological information of unknown words

Grammar Adaptation

- provide additional information to control inflection by using STEM attribute instead of SURFACE

Page 28: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Future Work

Further work concerning optimization of grammar processing

Various search strategies

Additional linguistic processing resources

Adopting to processing new languages

Real data testing: large grammars and real-world texts

Utilization in research and industrial projects

Page 29: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Examples – Simple grammar for person names

;; dummy rule for title

title :/ gazetteer & [SURFACE #title, GTYPE gaz_title] -> #title.

;; dummy rule for position

position :/ gazetteer & [SURFACE #position, GTYPE gaz_position] -> #position.

;; dummy rule for complex position, zB. Dierktor und CEO

complex_position :/

(gazetteer & [GTYPE gaz_position, SURFACE #pos1]

token & [SURFACE "und"]

gazetteer & [GTYPE gaz_position, SURFACE #pos2])

-> #position, where #position = Append(#pos1," ","und"," ",#pos2).

Page 30: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Examples – Simple grammar for person names

;; dummy rule for given name

given_name :/ gazetteer & [SURFACE #name, GTYPE gaz_given_name] -> #name.

;; dummy rule for name-suffix such as "Jr."

name_suffix :/

(token & [ SURFACE ","] ?)

token & [ SURFACE "Jr" & #suffix ] | token & [ SURFACE "jr" & #suffix ]

(token & [ SURFACE "." ] ?)

-> #suffix.

;; dummy rule for initial "M." and middle name

initial :/

(gazetteer & [GTYPE gaz_initial, SURFACE #initial]

token & [SURFACE "."] ?)-> #middle, where #middle = Append(#initial, ".").

Page 31: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Examples – Simple grammar for person names

;; dummy rule for infix like "van", "van der"

infix :/ gazetteer & [GTYPE gaz_name_infix, SURFACE #infix] -> #infix.

;; dummy rule for last name

last_name :/

token & [TYPE first_capital_word, SURFACE #name]

| token & [TYPE mixed_word_first_capital, SURFACE #name]

| token & [TYPE word_with_hyphen_first_capital, SURFACE #name]

| token & [TYPE word_with_apostrophee_first_capital, SURFACE #name]

-> #name.

;; dummy rule for last name with infix

last_name_with_infix :/

@seek(infix) & #infix

@seek(last_name) & #last_name-> #last, where #last=Append(#infix," ",#last_name).

Page 32: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

Examples – Simple grammar for person names

;; rule for person names, example: Direktor und CTO Prof. Dr. hab. Witold P. van der Berg, Jr.

person :>

((@seek(position) & #pos | @seek(complex_position) & #pos) token & [TYPE comma] ?)?

@seek(title) & #title ?

(@seek(given_name) & #given_name (@seek(given_name) & #given_name_extra ?)

| (@seek(initial) & #given_name))

@seek(initial) & #middle1 ?

@seek(initial) & #middle2 ?

(@seek(last_name) & #last_name | @seek(last_name_with_infix) & #last_name)

@seek(name_suffix) & #suffix ?

-> ne-person & [GIVEN_NAME #first_name,

TITLE #title,

SURNAME #last_name,

P-POSITION #position,

NAME-SUFFIX #suffix],where #first_name = ConcWithBlanks(#given_name,#given_name_extra,#middle1,#middle2).

Page 33: Warszawa, 6.10.2003 Jakub Piskorski SProUT Shallow Processing with Unification and Typed Feature Structures Jakub Piskorski Language Technology Lab DFKI

Jakub Piskorski Warszawa, 6.10 .2003

simple_noun_phrase :> .................

-> phrase & [CAT np,

SURFACE #info,

AGR [N #n,

C #c,

G #g]], where #info=..........

simple_event :> @seek(person) & #person

morph & [POS verb, STEM #action]

@seek(simple_noun_phrase) & [SURFACE #info]

-> [PERSON #person, ACTION #action, OBJECT #info].

Examples – Embedding rules