Deliverable 4.4 - qt21.eu · Project full title Preparation and Launch of a Large ... place in the period October 2012 - June 2013 coinciding with META-SHARE version ... Stanford

Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools

1

Deliverable 4.4.1

v.1.0

An inventory of existing language tools

Author(s): Kanella Pouli, Juli Bakagianni, Stelios Piperidis

Dissemination Level: Public

Date: 19.07.2013

This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).


2

Grant agreement no. 296347 Project acronym QTLaunchPad Project full title Preparation and Launch of a Large-scale Action for Quality Transla-

tion Technology Funding scheme Coordination and Support Action Coordinator Prof. Hans Uszkoreit (DFKI) Start date, duration 1 July 2012, 24 months Distribution Public Contractual date of delivery June 2013 Actual date of delivery 19 July 2013 Deliverable number D4.4.1 Deliverable title An inventory of existing language tools Type Report Status and version Pre-Final Number of pages 30 Contributing partners DFKI, DCU WP Leader ILSP Task Leader ILSP Authors Kanella Poui, Juli Bakagianni, Stelios Piperidis EC project officer Aleksandra Wesolowska The partners in QTLaunchPad are:

Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Germany Dublin City University (DCU), Ireland Institute for Language and Speech Processing, R.C. “Athena” (ILSP/ATHENA RC), Greece The University of Sheffield (USFD), United Kingdom

For copies of reports, updates on project activities and other QTLaunchPad-related information, con-tact:

DFKI GmbH QTLaunchPad Dr. Aljoscha Burchardt [email protected] Alt-Moabit 91c Phone: +49 (30) 23895-1838 10559 Berlin, Germany Fax: +49 (30) 23895-1810

Copies of reports and other material can also be accessed via http://www.qt21.eu/launchpad

© 2013, The Individual Authors

This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).


3

Table of Contents

1 Executive summary ........................................................................................................... 4 2 Introduction ........................................................................................................................ 4 3 The inventory ..................................................................................................................... 6

3.1 Targeted tools and services ........................................................................................ 6 3.2 Results ........................................................................................................................ 6

4 Conclusions ..................................................................................................................... 29


4

1 Executive summary

This report presents the current results of an ongoing survey and the resulting inventory of

language processing tools necessary for the purposes of machine translation research and

development, within the framework of the QTLaunchPad project. It is derived as a result of

Task 4.4 “Identification and acquisition of existing tools”. In order to provide all support

mechanisms needed for documentation, sharing, search and retrieval of all MT-related tools,

QTLaunchPad, a dedicated META-SHARE node/repository (http://qt21.metashare.ilsp.gr)

has been set up and initially populated with MT-related language processing tools and/or

their metadata-based descriptions. Tools at this stage cover processing and annotation at

the levels of tokenisation and sentence splitting, pos tagging and lemmatisation, syntactic

parsing, named entity recognition and text alignment, for English, German, Greek and Por-

tuguese.

2 Introduction

This report presents the current results of an ongoing survey and the resulting inventory of

language processing tools1 necessary for the purposes of machine translation research and

development, within the framework of the QTLaunchPad project. It is derived as a result of

Task 4.4 “Identification and acquisition of existing tools”.

In the framework of QTLaunchPad, language resources encompass monolingual and multi-

lingual data sets, structured (i.e. lexica, terminological databases, thesauri) and unstructured

(i.e. raw text corpora), as well as language processing tools such as tokenizers and sen-

tence splitters, POS taggers and lemmatizers, parsers, NE recognizers, etc.

In order to provide all support mechanisms needed for documentation, sharing, search and

retrieval of all MT-related tools, QTLaunchPad builds upon and extends the META-SHARE

infrastructure (www.meta-share.eu, www.meta-share.org). To this end, a QTLaunchPad–

dedicated META-SHARE node/repository (http://qt21.metashare.ilsp.gr) has been set up and

populated with MT-related language processing tools and/or their metadata-based descrip-

tions. The sources of the targeted tools consist in:

I. tools already available in the META-SHARE network, especially those under permis-

sive terms, allowing at least research use, as well as tools already currently widely

used in MT research;

1 The identification and acquisition of data sets is the objective of deliverable 4.1.1.


5

II. tools coming from the QTLaunchPad consortium partners, and in the immediate fu-

ture tools coming from the partnership that QTLaunchPad is mobilizing towards the

QT21 action.

The current version of the inventory of the QTLaunchPad-dedicated META-SHARE reposito-

ry focuses on item I and partially item II. The survey and the creation of the inventory took

place in the period October 2012 - June 2013 coinciding with META-SHARE version 3.0 and

the respective population cycles that have been going on almost invariably until June 2013,

with a peak in April 2013. The population process of the QTLaunchPad-dedicated repository

(and the respective inventory) is considered ongoing throughout the duration of the project,

and this report provides only a preliminary overview of the available tools.

In order to meet the goals of task 4.4 the following decisions and actions were taken:

• The META-SHARE distributed repository network has been thoroughly searched ac-

cording to existing metadata descriptions; taking into account the diverse application

scenarios, the search criteria were appropriately extended (e.g. by including addi-

tional tools commonly used to process and manage data sets)

• Tools are mainly, but not exclusively, targeted towards the predefined QTLaunchPad

languages, namely German (DE), Greek (EL), English (EN) and Portuguese (PT).

• A META-SHARE compliant repository and the respective inventory has been set up,

serving the needs for present data storage, access and future enrichment from other

sources.

• The documentation of tools follows the META-SHARE metadata model.

• All tools, with permissive terms of use, have been replicated on the QTLaunchPad-

dedicated META-SHARE repository.

The material gathered during this survey has been filtered to select tools of interest for the

purposes of the project, and it has been used to initiate the population of the QTLaunchPad

repository at http://qt21.metashare.ilsp.gr/. For the time being access to this repository is

limited to the project consortium. In the immediate future the members of the QTLP and

QT21 communities, will be able to register and not only access and download them but also

modify them and their descriptions, as well as add new ones, thus enriching and keeping the

repository and inventory updated. The process of populating the repository is and will be

ongoing trying to act as a reference point including all tools available, relevant and fit for MT

research and development, respecting all legal (and possibly other) restrictions and prefer-

ences.


6

3 The inventory

3.1 Targeted tools and services

For the purposes of QTLaunchPad, the main language processing tools of interest have

been predefined as follows:

• Tokenizers

• Sentence Splitters

• POS Taggers

• Lemmatizers

• Syntactic Parsers

• Named Entity (NE) Recognizers

• Parallel Text Aligners

Details about these tools, their main functions, the possible combinations, integration in

workflows and generic interfaces are also available in deliverables D3.1.1 and D3.2.1 of the

project. An overview of the datasets for the languages of the project is available in D4.1.1.

3.2 Results

Tools are listed per category, while collections and suites of tools (e.g the suites provided by

Stanford NLP Group, LX-Center etc.) or toolkits for NLP usage are presented separately. In

addition, tools already implemented in the form of web services or integrated in workflows

are marked. The web services presented in this report are, mainly, provided by Soaplab

which is “a tool that can automatically generate and deploy Web Services on top of existing

command-line analysis programs. It is especially well suited for applications with well de-

scribed input and output parameters, such as EMBOSS (a package of Open Source soft-

ware for sequence analysis). Soaplab allows integration of many applications within a single

programming interface. Soaplab can also interoperate with other Web Services and can cre-

ate Web Services on top of existing web resources (e.g. extracting data from a third-party

web page and providing its data as a Web Service) - a sub-project called Gowlab.”2

2 Soaplab2. Last modified: August 10th, 2010, http://soaplab.sourceforge.net/soaplab2/


7

Soaplab has already been used for the PANACEA3 project, which provides through its Web

Portals 157 web services (PANACEA Registry) and 74 shared workflows (PANACEA myEx-

periment).

Sentence Splitters

Thirteen (13) sentence splitters and detectors have been identified, 8 of which available

through META-SHARE. Their distribution for the four project languages is: a) 3 for German,

b) 2 for Greek, c) 9 for English, and d) 3 for Portuguese.

Table 1: Sentence splitters and detectors for the four project languages

Source Name DE

EL

EN

PT

Other lang.

LD4

Input Output OS5

MS6 huntoken √ HU NO

Media type: Text Resource type: Lan-guage Description Modality: Written Language

Media type: Text Resource type: Lexi-cal Conceptual Re-source Modality: Written Language

Linux

MS

U-Compare Named Entity Recognition ser-vice

Yes

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Character encoding: UTF - 8


OS-Inde-pen-dent

MS U-Compare Species Disambi-guation Service

√ Yes



OS-Inde-pen-dent

Univ. of Edinburgh Europarl tools √ √ √ √

IT, FR, SV

MS

Tokenizing, Tag-ging, Lemmatizing and Chunking free running texts

√ RO, FR

Yes

Media type: Text Resource type: Cor-pus Modality: Written Language Language: Romani-an, English, French Character encoding: UTF - 8

Media type: Text Resource type: Cor-pus Modality: Written Language Language: Romani-an, English, French Character encoding: UTF - 8 Annotation type: Lemmatization, Mor-phosyntactic Annota-tion - Pos Tagging,

OS-Inde-pen-dent

3 http://www.panacea-lr.eu/ 4 LD stands for Language Dependent. 5 OS stands for Operating System. 6 MS stands for MetaShare.


8

Source Name DE

EL

EN

PT

Other lang.

LD4

Input Output OS5

Segmentation, Syn-tactic Annotation - Shallow Parsing Annotation format: text output with one token per line and annotations separat-ed by tab Tagset: http://nl.ijs.si/ME/V3/msd/html/ Segmentation level: Sentence, Word

Ruhr-Universität Bochum

Tokenizer- sen-tence splitter for German

√

MS UIMA/U-Compare OpenNLP Sen-tence Detector

√ Yes

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English

textMedia type: Text Resource type: Cor-pus Modality: Written Language Language: English Annotation type: Structural Annotation Segmentation level: Sentence

OS-Inde-pen-dent

MS U-Compare Cafe-tiere English Sen-tence Detector

√ Yes

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Character encoding: UTF - 8 Segmentation level: Sentence

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Annotation type: Structural Annotation Segmentation level: Sentence

OS-Inde-pen-dent

MS UIMA/U-Compare GENIA Sentence Detector

√ Yes

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Annotation type: Structural Annotation Segmentation level: Sentence

OS-Inde-pen-dent

MS SENTER √ Yes

Media type: Text Modality: Written Language Language: Portu-guese

Media type: Text Modality: Written Language Language: Portu-guese Segmentation level: Sentence

Win-dows

ILSP/ SL7 Sentence splitter and tokenizer for Greek text

√ Yes

text text

OS-Inde-pen-dent (Web service)

UPF/ SL freeling3_sentence_splitter √ √ GL,

IT,

7 SL stands for SoapLAB web services which appear in grey coloured cells.


9

Source Name DE

EL

EN

PT

Other lang.

LD4

Input Output OS5

RU, SP, CY, AST, CA

Lin-guatec/SL

LT-SentenceSplitter √ √

Tokenizers

Nineteen (19) tokenizers have been identified, 12 of which available through META-SHARE.

Their distribution for the four project languages is: a) 4 for German, b) 2 for Greek, c) 15 for

English, and d) 6 for PT.

Table 2: Tokenizers for the four project languages

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

MS

UIMA/U-Compare Aper-tium Morpholog-ical Analyser

√ √

EU, CA, GL, SP

Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: Eng-lish, Spanish, Portuguese, Basque, Galici-an, Catalan

Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: English, Spanish, Portuguese, Basque, Galician, Cata-lan Annotation type: Lem-matization, Morphosyn-tactic Annotation - Pos Tagging, Segmentation Segmentation level: Word

MS huntoken √ HU NO

Media type: Text Resource type: Language De-scription Modality: Written Language

Media type: Text Resource type: Lexical Conceptual Resource Modality: Written Lan-guage

Linux

MS TextPro √ IT Yes

Media type: Text Language: Eng-lish, Italian

Media type: Text Resource type: Lexical Conceptual Re-source Modality: Written Lan-guage Language: English, Italian Annotation type: Lem-matization, Morphosyn-tactic Annotation - Pos Tagging, Semantic Annotation - Named Entities Segmentation level: Sentence, Word

Linux, Mac-OS

MS

U-Compare Apertium Part-of-Speech Tag-ging Workflow

√ √

SP, CA, GL, EU

Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: Eng-lish, Spanish;

Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: English, Spanish; Castilian, Portuguese, Catalan;

OS-Independ-ent


10

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

Castilian, Portu-guese, Catalan; Valencian, Gali-cian, Basque Character en-coding: UTF - 8

Valencian, Galician, Basque Character encoding: UTF - 8

MS U-Compare Lemmatisation service

√ FR, RO

Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: Eng-lish, French, Romanian Character en-coding: UTF - 8

Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: English, French, Romanian Character encoding: UTF - 8

OS-Independ-ent

Univ. of Edin-burgh)

Europarl tools √ √ √ √ IT, FR, SV

MS

Tokenizing, Tagging, Lem-matizing and Chunking free running texts

√ RO, FR

Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: Ro-manian, English, French Character en-coding: UTF - 8

Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: Romanian, English, French Character encoding: UTF - 8 Annotation type: Lem-matization, Morphosyn-tactic Annotation - Pos Tagging, Segmenta-tion, Syntactic Annota-tion - Shallow Parsing Annotation format: text output with one token per line and annota-tions separated by tab Tagset: http://nl.ijs.si/ME/V3/msd/html/ Segmentation level: Sentence, Word

OS-Independ-ent

Ruhr-Universi-tät Bo-chum

Tokenizer- sen-tence splitter for German

√

ILSP/ SL Sentence splitter and tokenizer for Greek text

√ Yes

text text

OS-Independ-ent (Web service)

MS U-Compare Tokenisation service

√ Yes

Media type: Text Resource type: Corpus Modality: Writ-ten Language Language: Por-tuguese Character en-coding: UTF - 8 Annotation for-mat: XML

Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: Portuguese Character encoding: UTF - 8 Annotation type: Seg-mentation Annotation format: XML Segmentation level: Sentence, Word

OS-Independ-ent

MS IULA tokenizer Web Service √ CA,

SP Ye

Media type: Text Modality: Written

Media type: Text Resource type: Corpus

OS-Independ-


11

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

s Language Language: Spanish, Cata-lan, English Mime type: text / plain Character en-coding: UTF - 8

Modality: Written Lan-guage Mime type: text / xml Character encoding: UTF - 8 Annotation type: Seg-mentation

ent

MS JTok Tokenizer √ √ Yes

Media type: Text Language: German, English

Media type: Text Language: German, English Character encoding: UTF - 8 Annotation type: Seg-mentation Annotation format: XML Segmentation level: Paragraph, Sentence, Word

OS-Independ-ent

MS

UIMA/U-Compare GENIA Tokenis-er (GENIA Tag-ger)

√ Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: Eng-lish Annotation type: Segmentation Segmentation level: Sentence

Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: English Annotation type: Struc-tural Annotation Segmentation level: Word

OS-Independ-ent

MS

UIMA/U-Compare OpenNLP To-kenizer

√ Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: Eng-lish Annotation type: Structural Anno-tation Segmentation level: Sentence

Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: English Annotation type: Struc-tural Annotation Segmentation level: Word

OS-Independ-ent

MS LX-Tokenizer √ Yes

Media type: Text Resource type: Corpus Modality: Written Language

Media type: Text Resource type: Corpus Modality: Written Lan-guage Segmentation level: Word

Linux

Stanford Univ.

Stanford Eng-lishTokenizer

√

UPF/ SL freeling3_tokenizer √ √

CA, GL, IT, RU, SP, CY, AST

Lin-guatec/ SL

LT-Tokeniser √ √

UPF/ Universi-ty of Cam-

tpc_rasp √ string or text file


12

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

bridge

POS taggers

Thirty three (33) taggers have been identified, 20 of which reside available through META-

SHARE. Their distribution for the four project languages is: a) 7 for German, b) 3 for Greek,

c) 20 for English, and d) 8 for Portuguese.

Table 3: POS taggers for the four project languages

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

MS EngGram Con-straint Grammar Parser for English

√ Yes

Media type: Text Language: English

MS GerGram Constraint Grammar Parser for German

√ Yes

Media type: Text Language: German


MS

PALAVRAS Con-straint Grammar Parsers for Portu-guese

√ Yes

Media type: Text Language: Portu-guese


MS UIMA/U-Compare Apertium Morpho-logical Analyser

√ √

EU, CA, GL, SP

Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: English, Spanish, Portu-guese, Basque, Galician, Catalan

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English, Spanish, Portuguese, Basque, Galician, Catalan Annotation type: Lemmatization, Mor-phosyntactic Annota-tion - Pos Tagging, Segmentation Segmentation level: Word

MS FORMA Yes

Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Other Annotation format: text/plain Segmentation level: Word

Media type: Text Resource type: Cor-pus Modality: Written Language Annotation type: Lemmatization, Mor-phosyntactic Annota-tion - Pos Tagging Segmentation level: Word

Linux, Mac-Os, Windows

MS hunpos NO

Media type: Text Resource type: Language Descrip-tion Modality: Written


13

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

Language

MS

MBT – Memory-Based Tagger-Generator and Tagger

NO

Media type: Text Resource type: Corpus Modality: Spoken Language, Written Language Character encoding: UTF - 8 Annotation type: Morphosyntactic Annotation - Pos Tagging Annotation format: .txt Segmentation level: Word

Media type: Text Resource type: Cor-pus Modality: Spoken Language, Written Language Character encoding: UTF - 8 Annotation type: Morphosyntactic Annotation - Pos Tagging Segmentation level: Word

Unix

MS YamCha: Yet An-other Multipurpose CHunk Annotator

NO

Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Morphosyntactic Annotation - Pos Tagging Annotation format: Plain text in column format. Segmentation level: Word

Media type: Text Resource type: Cor-pus Modality: Written Language Segmentation level: Word

Linux, Mac-OS, Windows

MS CombiTagger NO

Media type: Text Resource type: Corpus Modality: Written Language Character encoding: UTF - 8

Media type: Text Resource type: Cor-pus Modality: Written Language Character encoding: UTF - 8


MS U-Compare Aperti-um Part-of-Speech Tagging Workflow

√ √

SP, CA, GL, EU

Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: English, Spanish; Castilian, Portuguese, Cata-lan; Valencian, Galician, Basque Character encoding: UTF - 8

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English, Spanish; Castilian, Portuguese, Catalan; Valencian, Galician, Basque Character encoding: UTF - 8

OS-Inde-pendent

MS U-Compare Lem-matisation service √ FR,

RO

Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: English, French, Romanian Character encoding: UTF - 8

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English, French, Romanian Character encoding: UTF - 8

OS-Inde-pendent

MS U-Compare Syntac-tic Parsing Service √

Yes


Media type: Text Resource type: Cor-pus Modality: Written Language

OS-Inde-pendent


14

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

Language: English Character encoding: UTF - 8

Language: English Character encoding: UTF - 8

MS

Tokenizing, Tag-ging, Lemmatizing and Chunking free running texts

√ RO, FR

Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: Romani-an, English, French Character encoding: UTF - 8

Media type: Text Resource type: Cor-pus Modality: Written Language Language: Romani-an, English, French Character encoding: UTF - 8 Annotation type: Lemmatization, Mor-phosyntactic Annota-tion - Pos Tagging, Segmentation, Syn-tactic Annotation - Shallow Parsing Annotation format: text output with one token per line and annotations separat-ed by tab Tagset: http://nl.ijs.si/ME/V3/msd/html/ Segmentation level: Sentence, Word

OS-Inde-pendent

TreeTagger [ adaptable to any language if a lexi-con and a manually tagged training corpus are availa-ble]

√ √ √ √ NO text text

OS-Inde-pendent

MS UIMA/U-Compare Apertium POS Tag-ger

√ √

EU, CA, GL, SP

Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: English, Spanish, Portu-guese, Catalan, Galician, Basque Annotation type: Structural Annota-tion Segmentation level: Word

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English, Spanish, Portuguese, Galician, Catalan, Basque Annotation type: Morphosyntactic Annotation - Pos Tagging Segmentation level: Word

OS-Inde-pendent

MS UIMA/U-Compare OpenNLP POS Tagger

√ Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: English Annotation type: Segmentation Segmentation level: Word

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Annotation type: Morphosyntactic Annotation - Pos Tagging Segmentation level: Word

OS-Inde-pendent

MS ACOPOST - A Collection of POS Y

eMedia type: Text Resource type:

Media type: Text Resource type: Cor-


15

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

Taggers s Corpus Modality: Written Language

pus Modality: Written Language

MS LX-Tagger √ Yes


Media type: Text Resource type: Cor-pus Modality: Written Language Segmentation level: Word

Linux

ILSP/ SL

ILSP Feature-based multi-tiered POS Tagger

√ Yes

XCES document with sentence and token boundaries recognised

XCES document with POS tags assigned to each token

OS-Inde-pendent (Web service)

Stan-ford Univ.

Stanford POS Tag-ger √ √

FR, AR, ZH

MS STEPP Tagger √ Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: Englishu Character encoding: UTF - 8 Annotation type: Structural Annota-tion Segmentation level: Sentence, Word

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Character encoding: UTF - 8 Annotation type: Morphosyntactic Annotation - Pos Tagging Segmentation level: Word

Unix

Univer-sität des Saar-landes

TnT √ √ SV NO

ILSP/ SL

FBT part-of-speech tagger available as a web service

√

MS PANTERA NO

Media type: Text Resource type: Language Descrip-tion Modality: Written Language

Media type: Text Resource type: Lexi-cal Conceptual Re-source Modality: Written Language

Linux

MS GENIA Tagger √ Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: English Character encoding: UTF - 8 Annotation type: Structural Annota-tion Segmentation level: Sentence, Word

Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Character encoding: UTF - 8 Annotation type: Lemmatization, Mor-phosyntactic Annota-tion - Pos Tagging, Semantic Annotation - Named Entities, Syntactic Annotation - Shallow Parsing Segmentation level: Phrase, Word, Word

Unix


16

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

Group

http://phra-sys.net/

Qtag (v.1.0) √ √ NO

UPF/ SL freeling3_tagging √ √

GL, IT, RU, SP, CY, AST, CA

plain text

word, lemma, tag, probability, word-char-start and word-char-end all tab sepa-rated

Lin-guatec/SL

LT-POS-Defaulter √ √ file with a list of words

file with list of words, each with textform and POS + probability For German and English

Lin-guatec/SL

LT-Decomposer √ file containing a wordlist

file containing: input-form - lemma - POS - decomposition

UPF/ SL iula_tagger √

SP, CA

Obligatory Inputs: -Plain text (or txt file) -Language (lang) es: Spanish ca: Catalan en: English Optional Inputs -Encoding UTF-8 (default) or ISO-8859-1 -keeptags If your text have sgml tags and you will preserve it, mark this option

Output_form: treetagger (default) or iulact

UPF/ SL berkeley_tagger √ √ FR UPF/ Univer-sity of Cam-bridge


UPF/ SL iula_preprocess √

CA, SP

Obligatory Inputs: -Plain text (or txt file) -Language (lang) es: Spanish ca: Catalan en: English Optional Inputs -Encoding UTF-8 (default) or ISO-8859-1

In addition, hunmorph and Mmorph, perform morphological analysis for a number of lan-

guages and morpha (Minnen, G., J. Carroll and D. Pearce) perform morphological analysis

for English.

Lemmatizers


17

Eleven (11) lemmatizers have been identified, 4 of them available through META-SHARE,

while six (6) are provided as a service (SoapLab). Their distribution for the four project lan-

guages is: a) 3 for German, b) 1 for Greek, c) 7 for English, and d) 2 for Portuguese.

Table 4: Lemmatizers for the four project languages

Source Name DE

EL

EN

PT

LD Input Output OS

MS PELCRA EN Lemmatizer √

Media type: Text Resource type: Lexical Conceptual Resource Language: English variety: en-gb Mime type: text/plain Character encoding: UTF - 8 Segmentation level: Word

Media type: Text Resource type: Lexical Con-ceptual Resource Language: English — variety: en-gb Mime type: text/plain Character encoding: UTF - 8 Annotation type: Morphosyn-tactic Annotation - Pos Tag-ging Annotation format: tags Tagset: http://www.natcorp.ox.ac.uk/docs/c5spec.html Segmentation level: Word

OS-Inde-pen-dent

MS Lemmatizer for Portu-guese

√

Media type: Text Resource type: Corpus Modality: Written Language Language: Portuguese

Media type: Text Resource type: Corpus Modality: Written Language Language: Portuguese

OS-Inde-pen-dent

ILSP/SL ILSP Lem-matizer √ XCES document with with

POS-tagged tokens XCES document with lemmas assigned to each token

OS-Inde-pen-dent

MS FORMA

Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Other Annotation format: text/plain Segmentation level: Word

Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Lemmatiza-tion, Morphosyntactic Annota-tion - Pos Tagging Segmentation level: Word

Linux, Mac-OS, Win-dows

MS U-Compare Lemmatisa-tion service

√ Yes



OS-Inde-pen-dent

UPF/SL freeling3_tagging

√ √ plain text word, lemma, tag, probability, word-char-start and word-char-end all tab separated

Lin-guatec/SL

LT-POS-Defaulter √ √ file with a list of words

file with list of words, each with textform and POS + probability For German and English

Lin-guatec/SL

LT-Decomposer √ file containing a wordlist file containing: inputform -

lemma - POS - decomposition

Lin-guatec/SL

LT-Lemmatiser √ √

UPF/ University of Cam-bridge

tpc_rasp √ can be string or text file

UPF/SL iu-la_preprocess

√

Obligatory Inputs: -Plain text (or txt file) -Language (lang) es: Spanish ca: Catalan


18

Source Name DE

EL

EN

PT

LD Input Output OS

en: English Option Inputs -Encoding UTF-8 (default) or ISO-8859-1

Parsers

Twenty three (23) parsers have been identified, 12 of which available through META-SHARE.

Their distribution for the four project languages is: a) 3 for German, b) 3 for Greek, c) 12 for

English, and d) 6 for Portuguese.

Table 5: Parsers for the four project languages

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

MS EngGram Con-straint Grammar Parser for English

√ Yes

Media type: Text Language: English

MS

GerGram Con-straint Grammar Parser for Ger-man

√ Yes



MS

PALAVRAS Con-straint Grammar Parsers for Portu-guese

√ Yes



Leonel F. de Alencar

Donatus Parsing Tools for Portu-guese

√

MS U-Compare Syn-tactic Parsing Service

√ Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: English Character encoding: UTF - 8

Media type: Text Resource type: Corpus Modality: Written Language Language: English Character encod-ing: UTF - 8

OS-Inde-pendent

MS MSTParser Yes

Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Other Segmentation level: Word



MS Dizer √ Yes


Media type: Text Resource type: Language Descrip-tion Language: Portu-guese Annotation type: Discourse Annota-tion

OS-Inde-pendent, Linux, Mac-OS, Other, Unix, Windows

MS

MaltParser (MaltOptimizer, for automatic optimi-zation)

√ SV, FR

Yes

Media type: Text Language: Swe-dish, English, French



19

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

MS Enju parser √ Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: English Character encoding: UTF - 8

Media type: Text Resource type: Corpus Modality: Written Language Language: English

OS-Inde-pendent

MS Lexicalized Pars-ing √ RO

Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: Romani-an, English Character encoding: UTF - 8

Media type: Text Resource type: Corpus Modality: Written Language Language: Roma-nian, English Character encod-ing: UTF - 8 Annotation type: Lemmatization, Morphosyntactic Annotation - Pos Tagging, Segmen-tation, Syntactic Annotation - Shal-low Parsing, Syn-tacticosemantic Annotation - Links Annotation format: text output with one token per line and annotations sepa-rated by tab Tagset: http://nl.ijs.si/ME/V3/msd/html/ Segmentation level: Sentence, Word

OS-Inde-pendent

OpenNLP CCG Li-brary

CGC parser

MS VISL - multilingual dependency par-ser

Yes

University of Lisbon LXDepParser √

University of Alberta MINIPAR √

Yes

Linux, Solaris or windows 95/98

ILSP/ SL

ILSP Dependency parser √

Yes

text text

OS-Inde-pendent (Web service)

Stanford Univ.

The Stanford Parser: A statisti-cal parser

√ √ √ ZH, AR, IT, BG

NO text text

OS-Inde-pendent

MS CSTParser √ Yes


Media type: Text Resource type: Language Descrip-tion Language: Portu-guese

OS-Inde-pendent, Linux, Mac-OS, Other,


20

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

Annotation type: Discourse Annota-tion

Unix, Windows

MS Spejd NO


Media type: Text Resource type: Lexical Conceptual Resource Modality: Written Language

Linux, Windows

Presemt Phrase Model Generator Module (PMG)

√ √ √ ZH, IT, NO

NO text text

OS-Inde-pendent

UPF/SL freeling3_dependency √

AST, CA,, GL, SP

Plain text Freeling output format, XML, XML CQP ready

UPF/SL berkeley_parser √ √ FR UPF/ Uni-versity of Cambridge


UPF/SL bohnet_parser √ SP

In addition, there are 9 chunking tools, 7 available through META-SHARE and 2 in the form

of web services.

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

MS

MBT – Memory-Based Tag-ger-Generator and Tagger

NO

Media type: Text Resource type: Corpus Modality: Spoken Lan-guage, Written Language Character encoding: UTF – 8 Annotation type: Morpho-syntactic Annotation – Pos Tagging Annotation format: .txt Segmentation level: Word

Media type: Text Resource type: Corpus Modality: Spoken Lan-guage, Written Lan-guage Character encoding: UTF – 8 Annotation type: Mor-phosyntactic Annota-tion – Pos Tagging Segmentation level: Word

Unix

MS

Shallow Processing with Unifica-tion and Typed Fea-ture Struc-tures

√ √

FR, IT, Durch, SP, PO, CS, ZH, JA

NO


Media type: Text Language: English, Italian

Media type: Text Resource type: Lexical Conceptual Resource Modality: Written Lan-guage Language: English, Italian Annotation type: Lem-matization, Morphosyn-tactic Annotation – Pos Tagging, Semantic Annotation – Named Entities Segmentation level: Sentence, Word

Linux, Mac-OS


21

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

MS

YamCha: Yet Another Multipurpose Chunk An-notator

NO

Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Morpho-syntactic Annotation – Pos Tagging Annotation format: Plain text in column format. Segmentation level: Word

Media type: Text Resource type: Corpus Modality: Written Lan-guage Segmentation level: Word


MS

Tokenizing, Tagging, Lemmatizing and Chunk-ing free running texts

√ RO, FR

Yes

Media type: Text Resource type: Corpus Modality: Written Language Language: Romanian, English, French Character encoding: UTF – 8

Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: Romanian, English, French Character encoding: UTF – 8 Annotation type: Lem-matization, Morphosyn-tactic Annotation – Pos Tagging, Segmenta-tion, Syntactic Annota-tion – Shallow Parsing Annotation format: text output with one token per line and annota-tions separated by tab Tagset: http://nl.ijs.si/ME/V3/msd/html/ Segmentation level: Sentence, Word

OS-Inde-pen-dent

ILSP/ SL ILSP Chunker √

Yes

XCES document with POS-tagged and lemmatized tokens

standoff document with chunk annotations

OS-Inde-pen-dent

MS LX-Chunker √ NO

Media type: Text Resource type: Corpus Modality: Written Language Language: Portuguese Mime type: text/plain Character encoding: ISO - 8859 - 1

Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: Portuguese Mime type: text/plain Annotation type: Seg-mentation Segmentation level: Sentence

Linux

MS hunner NO

Media type: Text Resource type: Language Description Modality: Written Language

UPF/ SL

freeling3_parsed √

AST, CA, GL, SP

Plain text Freeling output format, XML, XML CQP ready

NE Recognizers Thirteen (13) NE Recognizers have been identified, 8 of which available through META-SHARE. Their distribution for the project languages is as follows: a) 2 for German, b) 1 for Greek, c) 5 for English, and d) 2 for Portuguese.


22

Table 6: NE Recognizers for the four project languages

Source Name DE

EL

EN

PT

Other-lang.

LD Input Output OS

MS NERanka: Named Entity Recognition and Annotation Tool

NO Media type: Text Media type: Text Win-

dows

MS

U-Compare E-txt2DB: Giving structure to unstruc-tured data

√ NO

Media type: Text Resource type: Lexical Conceptual Resource Modality: Written Language Mime type: txt Character encod-ing: UTF - 8

Media type: Text Resource type: Lexical Conceptual Resource Language: English Character encod-ing: UTF - 8 Annotation type: Semantic Annota-tion - Named Enti-ties

OS-Inde-pendent

STAN-FORD

Stanford Named Entity Recognizer (NER)

√ √

Univer-sität Heidel-berg

German Named Entity Recognition (NER)

√

Univer-sity of Lisbon

LXNer √

ILSP/ SL ilsp_nerc √

MS

MBT – Memory-Based Tagger-Generator and Tagger

NO

Media type: Text Resource type: Corpus Modality: Spoken Language, Written Language Character encod-ing: UTF - 8 Annotation type: Morphosyntactic Annotation - Pos Tagging Annotation format: .txt Segmentation level: Word

Media type: Text Resource type: Corpus Modality: Spoken Language, Written Language Character encod-ing: UTF - 8 Annotation type: Morphosyntactic Annotation - Pos Tagging Segmentation level: Word

Unix


MS YamCha: Yet An-other Multipurpose CHunk Annotator

NO

Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Morphosyntactic Annotation - Pos Tagging Annotation format: Plain text in column format. Segmentation level: Word

Media type: Text Resource type: Corpus Modality: Written Language Segmentation level: Word


MS U-Compare Named Entity Recognition service

Yes

Media type: Text Resource type: Corpus Modality: Written

Media type: Text Resource type: Corpus Modality: Written

OS-Inde-pendent


23

Source Name DE

EL

EN

PT

Other-lang.

LD Input Output OS

Language Language: English Character encod-ing: UTF - 8

Language Language: English Character encod-ing: UTF - 8

MS U-Compare Species Disambiguation Service

√ Yes



OS-Inde-pendent

MS hunner NO


UPF/ SL anonymizer √ √

AST, CA, GL, IT, RU, SP, CY

Aligners

Thirteen (13) aligners have been identified, 4 of which available through META-SHARE. The

following table presents text aligners with information on language pairs and level of align-

ment (sentence, phrase or word).

Table 7: Aligners for the four project languages

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

MS Coral Corpus Aligner (bilingual corpora align-ment)

NO


Media type: Text Resource type: Lan-guage De-scription Modality: Written Lan-guage

OS-Inde-pendent

MS Lingua-Align (Syntactic tree alignment, word alignment)

NO Media type: Text

Robert C. Moore Moore's aligner

Johns-Hopkins University (CLSP/JHU)

GIZA++

Linux, Irix and SUNOS systems

PRESEMT Presemt Phrase Aligner Module (PAM) (word and phrase alignment)

√ √ √ ZH, IT, NO

NO text text

OS-Inde-pendent

MS hunalign (sentence alignment) N

O

Media type: Text Resource type: Language Descrip-tion

Media type: Text Resource type: Lexical

Linux, Win-dows


24

Source Name DE

EL

EN

PT

Other lang.

LD Input Output OS

Modality: Written Language

Conceptual Resource Modality: Written Lan-guage

MS

ACCURAT Toolkit for Multi-Level Alignment and Information Extrac-tion from Comparable Corpora

NO

Media type: Text Resource type: Corpus Modality: Written Language Character encod-ing: UTF – 8

Media type: Text Resource type: Corpus, Lexical Con-ceptual Re-source Modality: Written Lan-guage Character encoding: UTF - 8 Annotation type: Align-ment, Transla-tion Segmentation level: Sen-tence, Word Group

Win-dows

Europarl (Univ. of Edinburgh)

Europarl tools √ √ √ √ IT, FR, SV, etc

DCU/SL gma √ √

ES,

FR, IT

DCU/SL bsa

N

O

DCU/SL anymalign

N

O

Tokenised one-sentence-per-line text for source and target languages.

phrase table in Moses format

DCU/SL/ Open-MaTrEx

chunk_aligner √ √ √

ES,

FR,

GA,

IT,

CS,

CA

DCU/SL berkeley_aligner √ √ FR

Toolkits, Platforms & Workbenches

In addition to well known and widely used, for many languages, text engineering platforms like GATE (http://gate.ac.uk/), the following table presents suites of tools (toolkits, work-benches, web services and platforms), consisting of various components, such as lemmatiz-ers, POS taggers etc, up to semantic role labellers).


25

Table 8: Collections of tools (platforms, workbenches, toolkits and web services)

Source Name DE

EL

EN

PT

LD Input Output OS

MS U-Compare Platform √

NO


Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Discourse Annotation, Lemmatization, Morphosyntactic Annotation - Pos Tagging, Semantic An-notation, Semantic Annota-tion - Certainty Level, Seman-tic Annotation - Entity Men-tions, Semantic Annotation - Events, Semantic Annotation - Named Entities, Semantic Annotation - Semantic Clas-ses, Semantic Annotation - Semantic Relations, Seman-tic Annotation - Semantic Roles, Semantic Annotation - Word Senses, Stemming, Syntactic Annotation - Shal-low Parsing, Syntactic Anno-tation - Subcategorization Frames, Syntacticosemantic Annotation - Links

OS-Independent

MS U-Compare Workbench

NO

Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Discourse Annota-tion, Lemmatiza-tion, Morphosyntac-tic Annotation - Pos Tagging, Semantic Annotation, Se-mantic Annotation - Certainty Level, Semantic Annota-tion - Entity Men-tions, Semantic Annotation - Events, Semantic Annotation - Named Entities, Semantic Annota-tion - Polarity

Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Discourse Annotation, Lemmatization, Morphosyntactic Annotation - Pos Tagging, Semantic An-notation, Semantic Annota-tion - Certainty Level, Seman-tic Annotation - Entity Men-tions, Semantic Annotation - Events, Semantic Annotation - Named Entities, Semantic Annotation - Polarity, Seman-tic Annotation - Semantic Classes, Semantic Annota-tion - Semantic Relations, Semantic Annotation - Se-mantic Roles, Semantic An-notation - Word Senses, Syntactic Annotation - Shal-low Parsing, Syntactic Anno-tation - Subcategorization Frames, Syntactic Annotation - Treebanks, Translation

OS-Independent

MS Uplug √ √ √ NO

Media type: Text Resource type: Corpus Modality: Written Language Language: Catalan; Valencian, Czech, Danish, English, French, Hungarian, German, Portu-guese, Russian, Slovenian, Span-ish; Castilian, Swe-dish




26

Source Name DE

EL

EN

PT

LD Input Output OS

MS LX-Service √ Yes


Media type: Text OS-Independent

Stan-ford Univ.

Stanford CoreNLP √

Yes

The Apache Soft-ware Foun-dation

OPEN NLP

MS Heart of Gold √ NO

MS

Shallow Pro-cessing with Unification and Typed Feature Structures

√ √ NO

ILSP/ SL

ILSP web ser-vices √

Input is either plain text or an XCES document with text segmented in par-agraphs..

The output by default is an XCES document.

Additional Tools

Several other tools that might prove useful are listed below.

Table 9: Additional tools which can be used for processing a corpus.

Source Name Short Description

MS IULA paradigma Web Service

As a general rule, IULA SOAP web services accept input data either as 'direct string' data or as URL. Output results are given both as 'direct string' data and as URL. For large outputs, the 'direct string' option is disabled.

MS PELCRA Lan-guage Detector

The PELCRA language detector is a Java tool for detecting the language of an arbi-trary stretch of text developed by the PELCRA team at the University of Łódź, availa-ble under the GPL licence. The first version of this tool only supports binary classifi-cation scenarios in which one wants to detect one of two possible languages. A model for distinguishing between Polish and English is provided with the software. The default language detector provided in the package uses a binary support vector machine classifier implementation.

MS Hunspell

Hunspell is the spell checker of LibreOffice, OpenOffice.org, Mozilla Firefox 3 & Thunderbird, Google Chrome, and it is also used by proprietary software packages, like Mac OS X, InDesign, memoQ, Opera and SDL Trados. Main features: Extended support for language peculiarities; Unicode character en-coding, compounding and complex morphology. Improved suggestion using n-gram similarity, rule and dictionary based pronunciation data. Morphological analysis, stemming and generation. C++ library under GPL/LGPL/MPL tri-license. Interfaces and ports: Enchant (Generic spelling library from the Abiword project), XSpell (Mac OS X port, but Hunspell is part of the OS X from version 10.6 (Snow Leopard), and now it is enough to place the Hunspell dictionary files into ~/Library/Spelling or /Library/Spelling for spell checking), Delphi, Java (JNA, JNI), Perl, .NET, Python, Ruby ([1], [2], [3]), UNO, RichEdit.

MS jExSLI java Extremely Simple Language Identifier is a text language identifier. An initial list of languages contains 20 commonly used languages and can be easily extended.

MS Moses a factored phrase-based, hierarchical and syntax decoder for statistical machine translation


27

Source Name Short Description

MS LetsMT! Re-source Reposi-tory Software

A software package for building and maintaining resouces for statistical machine translation such as parallel corpora and monolingual corpora

MS MaltOptimizer A system for automatic optimization of MaltParser

MS Collocation and Term Extractor

CollTerm is a language independent tool for collocation and term extraction. It col-lects collocation and term candidates based on five different co occurrence measures for multiword units or distributional differences from large representative corpus by application of TF-IDF. The language dependent part consists of stop-word list and list of MWU MSD-patterns, that can also be coded with regular expressions. The first version of this application is available as an integral part of ACCURAT Toolkit (http://www.accurat-project.eu/index.php?p=accurat-toolkit).

MS

ComLinToo: The Computational Linguistics Tool-set

The Computational Linguistics Toolset is a set of tools for computational linguistics. It contains re-usable code for cleaning, splitting, refining, and taking samples from corpora (ICE, Penn, and a native one), for tagging them using the TnT-tagger, for doing permutation statistics on N-grams, etc.

SRI (and Johns Hopkins Universi-ty)

SRILM - The SRI Language Modeling Toolkit

SRILM is a toolkit for building and applying statistical language models (LMs), pri-marily for use in speech recognition, statistical tagging and segmentation, and ma-chine translation.

MS TextMatch

TextMatch is a web service that combines language independent ways of computing the similarity between two documents using powerful linguistic tools, and provides different measures of document similarity. It compares documents in different for-mats (such as Microsoft document formats, OpenDocument Format, Portable Docu-ment Format, Electronic Publication Format, HyperText Markup Language, Rich Text Format, Text formats). TextMatch recognizes the language of the document using two-stage language de-tection system. Specific language tokenizers, lemmatizers and other analyzers are utilized for English, Bulgarian, German, French, and Russian. A language independ-ent comparison algorithm is used if one of the uploaded documents is in another language.

MS Treat

Treat is a toolkit for natural language processing and computational linguistics in Ruby. It provides support for tasks such as document retrieval, text chunking, seg-mentation and tokenization, natural language parsing, part-of-speech tagging, key-word extraction and named entity recognition.

MS

U-Compare Paragraph-Breaking Ser-vice

Web service created by exporting UIMA-based workflow from the U-Compare text mining system. Functionality: Identifies paragraphs in plain text Tools in workflow: MLRS Paragraph Splitter (University of Malta) The licence provided covers the web service only. Tools used to create the workflow may have their own licences

MS Blacklist Classi-fier A language identifier for closely related languages

MS Language Iden-tifier

SOAP Web Service which identifies the language of a written text. It can identify up to 57 languages.

http://cwb.sourceforge.net/

IMS Corpus Workbench (CWB), (Version 3.0)

Corpus querying system

Stanford Univ.

Stanford Classi-fier

A machine learning classifier, directed at text categorization. A conditional loglinear classifier (a.k.a. a maximum entropy or multiclass logistic regression model).

Stanfod Univ.

Tregex, Tsur-geon, and Semgrex

A Tgrep2-style utility for matching patterns in trees and a tree-transformation utility built on top of this matching language. A similar utility for matching patterns in de-pendency graphs.

Stanford Univ.

Stanford To-kensRegex A tool for matching regular expressions over tokens.

Stanford Univ.

Stanford Tem-poral Tagger (SUTime)

A rule-based temporal tagger for English text. Online SUTime demo

MS GNU Aspell GNU Aspell is a Free and Open Source spell checker designed to eventually replace Ispell. Used either as a library or as an independent spell checker. Aspell can also easily check documents in UTF-8 without having to use a special dictionary.

MS Bibliša: Aligned Collection

This tool is a web application for search of digital libraries of articles from bilingual e-journals in the form of TMX documents, as well as for development of new bilingual


28

Source Name Short Description Search Tool lexical resources based on this search. It is based on previously developed compo-

nents for LeXimir (work station for lexical resources) and VebRanka (web query expansion tool) and uses various lexical resources: Wordnets, e-dictionaries and terminological lists.

MS Docent A document-Level Local Search Decoder for Phrase-Based SMT

MS TREFL – Trans-lation Reference Library

TREFL is a portable, multifunctional database management application for Windows, having the combined characteristics of both a Translation Memory System (bilingual databases, fuzzy matching, concordance, alignment, importing and exporting transla-tion memories, etc.) and those of an Internet/Desktop Search Engine (searching, like with Google search, all these words, this exact phrase, I’m feeling Lucky, etc.), plus some elements of semantic search. It is intended to be used as a simple, versatile, portable, effective and customizable reading, writing and translation aid tool capable of managing very large databases.

MS NERosetta

NERosetta is a multiuser web application to facilitate retrieval and comparison of named entities in a single or parallel texts. The main named entity categorization is realized according to the Quaero annotation recommendation and provides a user with approximately 50 different search options. Registered users have an extra pos-sibility to share annotated resources (in XML format) and annotation schemas for the set of languages as well as to manage their own resources and schemas. The initial version supports Stanford NER 3 and Stanford NER 7.


29

4 Conclusions

The tools and web services together with the monolingual, multilingual and lexical/ concep-tual datasets offer a good starting point for the machine translation research, development and evaluation infrastructure. The following figures summarise the counts of all the tools identified per language and type8.

Figure 1: Tools for processing German Figure 2: Tools for processing Greek

Figure 3: Tools for processing English Figure 4: Tools for processing Portuguese

Currently we are experimenting with providing a mechanism within the QT21 repository for

processing datasets with appropriate natural language processing tools, provided as SOAP 8 Some tools are language independent and can cover additional languages.

5

3 2

3

7

3 4

DE

2

1 1

2

3

2 2

EL

5 7

5

12

20

9

15

EN

3 2 2

6

8

3

6

PT


30

web services. The idea is that for each stored resource (monolingual or parallel corpora) the

extra option of “process” is integrated in the QT21 repository, besides the current options of

download a dataset and edit its metadata-based description. When the user selects “pro-

cess”, a list of processing tools for each relevant task will be provided (for the given lan-

guage, and resource type). As soon as the user selects a tool, the server will invoke a ser-

vice that will send the corpus to the specific tool web service for processing. Because of the

long processing time due to the large size of most of the resources, the system will inform

the user about the requested job via the messaging service of the platform. When the pro-

cessing has been completed, the new (annotated) resource will be automatically stored in

the repository, the inventory will be automatically updated, and the user will be informed. If

the user, for any reason, requests to process a resource with a specific tool, and this re-

source has already been processed by the specific tool, then the system will just forward the

user to the processed resource that has been created in the repository.

A first implementation of such functionality is under development. Initially, it will provide pro-

cessing tools for Greek and English, compatible in terms of input/output with XML CES, and

selected from the tools and services developed in the framework of PANACEA

(http://panacea-lr.eu).

Documents

Deliverable 4.4 - qt21.eu · Project full title Preparation and Launch of a Large ... place in the period October 2012 - June 2013 coinciding with META-SHARE version ... Stanford