Upload
duongdien
View
214
Download
0
Embed Size (px)
Citation preview
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
1
Deliverable 4.4.1
v.1.0
An inventory of existing language tools
Author(s): Kanella Pouli, Juli Bakagianni, Stelios Piperidis
Dissemination Level: Public
Date: 19.07.2013
This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
2
Grant agreement no. 296347 Project acronym QTLaunchPad Project full title Preparation and Launch of a Large-scale Action for Quality Transla-
tion Technology Funding scheme Coordination and Support Action Coordinator Prof. Hans Uszkoreit (DFKI) Start date, duration 1 July 2012, 24 months Distribution Public Contractual date of delivery June 2013 Actual date of delivery 19 July 2013 Deliverable number D4.4.1 Deliverable title An inventory of existing language tools Type Report Status and version Pre-Final Number of pages 30 Contributing partners DFKI, DCU WP Leader ILSP Task Leader ILSP Authors Kanella Poui, Juli Bakagianni, Stelios Piperidis EC project officer Aleksandra Wesolowska The partners in QTLaunchPad are:
Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Germany Dublin City University (DCU), Ireland Institute for Language and Speech Processing, R.C. “Athena” (ILSP/ATHENA RC), Greece The University of Sheffield (USFD), United Kingdom
For copies of reports, updates on project activities and other QTLaunchPad-related information, con-tact:
DFKI GmbH QTLaunchPad Dr. Aljoscha Burchardt [email protected] Alt-Moabit 91c Phone: +49 (30) 23895-1838 10559 Berlin, Germany Fax: +49 (30) 23895-1810
Copies of reports and other material can also be accessed via http://www.qt21.eu/launchpad
© 2013, The Individual Authors
This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
3
Table of Contents
1 Executive summary ........................................................................................................... 4 2 Introduction ........................................................................................................................ 4 3 The inventory ..................................................................................................................... 6
3.1 Targeted tools and services ........................................................................................ 6 3.2 Results ........................................................................................................................ 6
4 Conclusions ..................................................................................................................... 29
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
4
1 Executive summary
This report presents the current results of an ongoing survey and the resulting inventory of
language processing tools necessary for the purposes of machine translation research and
development, within the framework of the QTLaunchPad project. It is derived as a result of
Task 4.4 “Identification and acquisition of existing tools”. In order to provide all support
mechanisms needed for documentation, sharing, search and retrieval of all MT-related tools,
QTLaunchPad, a dedicated META-SHARE node/repository (http://qt21.metashare.ilsp.gr)
has been set up and initially populated with MT-related language processing tools and/or
their metadata-based descriptions. Tools at this stage cover processing and annotation at
the levels of tokenisation and sentence splitting, pos tagging and lemmatisation, syntactic
parsing, named entity recognition and text alignment, for English, German, Greek and Por-
tuguese.
2 Introduction
This report presents the current results of an ongoing survey and the resulting inventory of
language processing tools1 necessary for the purposes of machine translation research and
development, within the framework of the QTLaunchPad project. It is derived as a result of
Task 4.4 “Identification and acquisition of existing tools”.
In the framework of QTLaunchPad, language resources encompass monolingual and multi-
lingual data sets, structured (i.e. lexica, terminological databases, thesauri) and unstructured
(i.e. raw text corpora), as well as language processing tools such as tokenizers and sen-
tence splitters, POS taggers and lemmatizers, parsers, NE recognizers, etc.
In order to provide all support mechanisms needed for documentation, sharing, search and
retrieval of all MT-related tools, QTLaunchPad builds upon and extends the META-SHARE
infrastructure (www.meta-share.eu, www.meta-share.org). To this end, a QTLaunchPad–
dedicated META-SHARE node/repository (http://qt21.metashare.ilsp.gr) has been set up and
populated with MT-related language processing tools and/or their metadata-based descrip-
tions. The sources of the targeted tools consist in:
I. tools already available in the META-SHARE network, especially those under permis-
sive terms, allowing at least research use, as well as tools already currently widely
used in MT research;
1 The identification and acquisition of data sets is the objective of deliverable 4.1.1.
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
5
II. tools coming from the QTLaunchPad consortium partners, and in the immediate fu-
ture tools coming from the partnership that QTLaunchPad is mobilizing towards the
QT21 action.
The current version of the inventory of the QTLaunchPad-dedicated META-SHARE reposito-
ry focuses on item I and partially item II. The survey and the creation of the inventory took
place in the period October 2012 - June 2013 coinciding with META-SHARE version 3.0 and
the respective population cycles that have been going on almost invariably until June 2013,
with a peak in April 2013. The population process of the QTLaunchPad-dedicated repository
(and the respective inventory) is considered ongoing throughout the duration of the project,
and this report provides only a preliminary overview of the available tools.
In order to meet the goals of task 4.4 the following decisions and actions were taken:
• The META-SHARE distributed repository network has been thoroughly searched ac-
cording to existing metadata descriptions; taking into account the diverse application
scenarios, the search criteria were appropriately extended (e.g. by including addi-
tional tools commonly used to process and manage data sets)
• Tools are mainly, but not exclusively, targeted towards the predefined QTLaunchPad
languages, namely German (DE), Greek (EL), English (EN) and Portuguese (PT).
• A META-SHARE compliant repository and the respective inventory has been set up,
serving the needs for present data storage, access and future enrichment from other
sources.
• The documentation of tools follows the META-SHARE metadata model.
• All tools, with permissive terms of use, have been replicated on the QTLaunchPad-
dedicated META-SHARE repository.
The material gathered during this survey has been filtered to select tools of interest for the
purposes of the project, and it has been used to initiate the population of the QTLaunchPad
repository at http://qt21.metashare.ilsp.gr/. For the time being access to this repository is
limited to the project consortium. In the immediate future the members of the QTLP and
QT21 communities, will be able to register and not only access and download them but also
modify them and their descriptions, as well as add new ones, thus enriching and keeping the
repository and inventory updated. The process of populating the repository is and will be
ongoing trying to act as a reference point including all tools available, relevant and fit for MT
research and development, respecting all legal (and possibly other) restrictions and prefer-
ences.
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
6
3 The inventory
3.1 Targeted tools and services
For the purposes of QTLaunchPad, the main language processing tools of interest have
been predefined as follows:
• Tokenizers
• Sentence Splitters
• POS Taggers
• Lemmatizers
• Syntactic Parsers
• Named Entity (NE) Recognizers
• Parallel Text Aligners
Details about these tools, their main functions, the possible combinations, integration in
workflows and generic interfaces are also available in deliverables D3.1.1 and D3.2.1 of the
project. An overview of the datasets for the languages of the project is available in D4.1.1.
3.2 Results
Tools are listed per category, while collections and suites of tools (e.g the suites provided by
Stanford NLP Group, LX-Center etc.) or toolkits for NLP usage are presented separately. In
addition, tools already implemented in the form of web services or integrated in workflows
are marked. The web services presented in this report are, mainly, provided by Soaplab
which is “a tool that can automatically generate and deploy Web Services on top of existing
command-line analysis programs. It is especially well suited for applications with well de-
scribed input and output parameters, such as EMBOSS (a package of Open Source soft-
ware for sequence analysis). Soaplab allows integration of many applications within a single
programming interface. Soaplab can also interoperate with other Web Services and can cre-
ate Web Services on top of existing web resources (e.g. extracting data from a third-party
web page and providing its data as a Web Service) - a sub-project called Gowlab.”2
2 Soaplab2. Last modified: August 10th, 2010, http://soaplab.sourceforge.net/soaplab2/
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
7
Soaplab has already been used for the PANACEA3 project, which provides through its Web
Portals 157 web services (PANACEA Registry) and 74 shared workflows (PANACEA myEx-
periment).
Sentence Splitters
Thirteen (13) sentence splitters and detectors have been identified, 8 of which available
through META-SHARE. Their distribution for the four project languages is: a) 3 for German,
b) 2 for Greek, c) 9 for English, and d) 3 for Portuguese.
Table 1: Sentence splitters and detectors for the four project languages
Source Name DE
EL
EN
PT
Other lang.
LD4
Input Output OS5
MS6 huntoken √ HU NO
Media type: Text Resource type: Lan-guage Description Modality: Written Language
Media type: Text Resource type: Lexi-cal Conceptual Re-source Modality: Written Language
Linux
MS
U-Compare Named Entity Recognition ser-vice
Yes
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Character encoding: UTF - 8
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Character encoding: UTF - 8
OS-Inde-pen-dent
MS U-Compare Species Disambi-guation Service
√ Yes
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Character encoding: UTF - 8
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Character encoding: UTF - 8
OS-Inde-pen-dent
Univ. of Edinburgh Europarl tools √ √ √ √
IT, FR, SV
MS
Tokenizing, Tag-ging, Lemmatizing and Chunking free running texts
√ RO, FR
Yes
Media type: Text Resource type: Cor-pus Modality: Written Language Language: Romani-an, English, French Character encoding: UTF - 8
Media type: Text Resource type: Cor-pus Modality: Written Language Language: Romani-an, English, French Character encoding: UTF - 8 Annotation type: Lemmatization, Mor-phosyntactic Annota-tion - Pos Tagging,
OS-Inde-pen-dent
3 http://www.panacea-lr.eu/ 4 LD stands for Language Dependent. 5 OS stands for Operating System. 6 MS stands for MetaShare.
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
8
Source Name DE
EL
EN
PT
Other lang.
LD4
Input Output OS5
Segmentation, Syn-tactic Annotation - Shallow Parsing Annotation format: text output with one token per line and annotations separat-ed by tab Tagset: http://nl.ijs.si/ME/V3/msd/html/ Segmentation level: Sentence, Word
Ruhr-Universität Bochum
Tokenizer- sen-tence splitter for German
√
MS UIMA/U-Compare OpenNLP Sen-tence Detector
√ Yes
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English
textMedia type: Text Resource type: Cor-pus Modality: Written Language Language: English Annotation type: Structural Annotation Segmentation level: Sentence
OS-Inde-pen-dent
MS U-Compare Cafe-tiere English Sen-tence Detector
√ Yes
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Character encoding: UTF - 8 Segmentation level: Sentence
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Annotation type: Structural Annotation Segmentation level: Sentence
OS-Inde-pen-dent
MS UIMA/U-Compare GENIA Sentence Detector
√ Yes
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Annotation type: Structural Annotation Segmentation level: Sentence
OS-Inde-pen-dent
MS SENTER √ Yes
Media type: Text Modality: Written Language Language: Portu-guese
Media type: Text Modality: Written Language Language: Portu-guese Segmentation level: Sentence
Win-dows
ILSP/ SL7 Sentence splitter and tokenizer for Greek text
√ Yes
text text
OS-Inde-pen-dent (Web service)
UPF/ SL freeling3_sentence_splitter √ √ GL,
IT,
7 SL stands for SoapLAB web services which appear in grey coloured cells.
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
9
Source Name DE
EL
EN
PT
Other lang.
LD4
Input Output OS5
RU, SP, CY, AST, CA
Lin-guatec/SL
LT-SentenceSplitter √ √
Tokenizers
Nineteen (19) tokenizers have been identified, 12 of which available through META-SHARE.
Their distribution for the four project languages is: a) 4 for German, b) 2 for Greek, c) 15 for
English, and d) 6 for PT.
Table 2: Tokenizers for the four project languages
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
MS
UIMA/U-Compare Aper-tium Morpholog-ical Analyser
√ √
EU, CA, GL, SP
Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: Eng-lish, Spanish, Portuguese, Basque, Galici-an, Catalan
Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: English, Spanish, Portuguese, Basque, Galician, Cata-lan Annotation type: Lem-matization, Morphosyn-tactic Annotation - Pos Tagging, Segmentation Segmentation level: Word
MS huntoken √ HU NO
Media type: Text Resource type: Language De-scription Modality: Written Language
Media type: Text Resource type: Lexical Conceptual Resource Modality: Written Lan-guage
Linux
MS TextPro √ IT Yes
Media type: Text Language: Eng-lish, Italian
Media type: Text Resource type: Lexical Conceptual Re-source Modality: Written Lan-guage Language: English, Italian Annotation type: Lem-matization, Morphosyn-tactic Annotation - Pos Tagging, Semantic Annotation - Named Entities Segmentation level: Sentence, Word
Linux, Mac-OS
MS
U-Compare Apertium Part-of-Speech Tag-ging Workflow
√ √
SP, CA, GL, EU
Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: Eng-lish, Spanish;
Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: English, Spanish; Castilian, Portuguese, Catalan;
OS-Independ-ent
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
10
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
Castilian, Portu-guese, Catalan; Valencian, Gali-cian, Basque Character en-coding: UTF - 8
Valencian, Galician, Basque Character encoding: UTF - 8
MS U-Compare Lemmatisation service
√ FR, RO
Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: Eng-lish, French, Romanian Character en-coding: UTF - 8
Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: English, French, Romanian Character encoding: UTF - 8
OS-Independ-ent
Univ. of Edin-burgh)
Europarl tools √ √ √ √ IT, FR, SV
MS
Tokenizing, Tagging, Lem-matizing and Chunking free running texts
√ RO, FR
Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: Ro-manian, English, French Character en-coding: UTF - 8
Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: Romanian, English, French Character encoding: UTF - 8 Annotation type: Lem-matization, Morphosyn-tactic Annotation - Pos Tagging, Segmenta-tion, Syntactic Annota-tion - Shallow Parsing Annotation format: text output with one token per line and annota-tions separated by tab Tagset: http://nl.ijs.si/ME/V3/msd/html/ Segmentation level: Sentence, Word
OS-Independ-ent
Ruhr-Universi-tät Bo-chum
Tokenizer- sen-tence splitter for German
√
ILSP/ SL Sentence splitter and tokenizer for Greek text
√ Yes
text text
OS-Independ-ent (Web service)
MS U-Compare Tokenisation service
√ Yes
Media type: Text Resource type: Corpus Modality: Writ-ten Language Language: Por-tuguese Character en-coding: UTF - 8 Annotation for-mat: XML
Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: Portuguese Character encoding: UTF - 8 Annotation type: Seg-mentation Annotation format: XML Segmentation level: Sentence, Word
OS-Independ-ent
MS IULA tokenizer Web Service √ CA,
SP Ye
Media type: Text Modality: Written
Media type: Text Resource type: Corpus
OS-Independ-
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
11
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
s Language Language: Spanish, Cata-lan, English Mime type: text / plain Character en-coding: UTF - 8
Modality: Written Lan-guage Mime type: text / xml Character encoding: UTF - 8 Annotation type: Seg-mentation
ent
MS JTok Tokenizer √ √ Yes
Media type: Text Language: German, English
Media type: Text Language: German, English Character encoding: UTF - 8 Annotation type: Seg-mentation Annotation format: XML Segmentation level: Paragraph, Sentence, Word
OS-Independ-ent
MS
UIMA/U-Compare GENIA Tokenis-er (GENIA Tag-ger)
√ Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: Eng-lish Annotation type: Segmentation Segmentation level: Sentence
Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: English Annotation type: Struc-tural Annotation Segmentation level: Word
OS-Independ-ent
MS
UIMA/U-Compare OpenNLP To-kenizer
√ Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: Eng-lish Annotation type: Structural Anno-tation Segmentation level: Sentence
Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: English Annotation type: Struc-tural Annotation Segmentation level: Word
OS-Independ-ent
MS LX-Tokenizer √ Yes
Media type: Text Resource type: Corpus Modality: Written Language
Media type: Text Resource type: Corpus Modality: Written Lan-guage Segmentation level: Word
Linux
Stanford Univ.
Stanford Eng-lishTokenizer
√
UPF/ SL freeling3_tokenizer √ √
CA, GL, IT, RU, SP, CY, AST
Lin-guatec/ SL
LT-Tokeniser √ √
UPF/ Universi-ty of Cam-
tpc_rasp √ string or text file
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
12
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
bridge
POS taggers
Thirty three (33) taggers have been identified, 20 of which reside available through META-
SHARE. Their distribution for the four project languages is: a) 7 for German, b) 3 for Greek,
c) 20 for English, and d) 8 for Portuguese.
Table 3: POS taggers for the four project languages
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
MS EngGram Con-straint Grammar Parser for English
√ Yes
Media type: Text Language: English
MS GerGram Constraint Grammar Parser for German
√ Yes
Media type: Text Language: German
Media type: Text Language: German
MS
PALAVRAS Con-straint Grammar Parsers for Portu-guese
√ Yes
Media type: Text Language: Portu-guese
Media type: Text Language: Portu-guese
MS UIMA/U-Compare Apertium Morpho-logical Analyser
√ √
EU, CA, GL, SP
Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: English, Spanish, Portu-guese, Basque, Galician, Catalan
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English, Spanish, Portuguese, Basque, Galician, Catalan Annotation type: Lemmatization, Mor-phosyntactic Annota-tion - Pos Tagging, Segmentation Segmentation level: Word
MS FORMA Yes
Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Other Annotation format: text/plain Segmentation level: Word
Media type: Text Resource type: Cor-pus Modality: Written Language Annotation type: Lemmatization, Mor-phosyntactic Annota-tion - Pos Tagging Segmentation level: Word
Linux, Mac-Os, Windows
MS hunpos NO
Media type: Text Resource type: Language Descrip-tion Modality: Written
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
13
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
Language
MS
MBT – Memory-Based Tagger-Generator and Tagger
NO
Media type: Text Resource type: Corpus Modality: Spoken Language, Written Language Character encoding: UTF - 8 Annotation type: Morphosyntactic Annotation - Pos Tagging Annotation format: .txt Segmentation level: Word
Media type: Text Resource type: Cor-pus Modality: Spoken Language, Written Language Character encoding: UTF - 8 Annotation type: Morphosyntactic Annotation - Pos Tagging Segmentation level: Word
Unix
MS YamCha: Yet An-other Multipurpose CHunk Annotator
NO
Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Morphosyntactic Annotation - Pos Tagging Annotation format: Plain text in column format. Segmentation level: Word
Media type: Text Resource type: Cor-pus Modality: Written Language Segmentation level: Word
Linux, Mac-OS, Windows
MS CombiTagger NO
Media type: Text Resource type: Corpus Modality: Written Language Character encoding: UTF - 8
Media type: Text Resource type: Cor-pus Modality: Written Language Character encoding: UTF - 8
Linux, Mac-OS, Windows
MS U-Compare Aperti-um Part-of-Speech Tagging Workflow
√ √
SP, CA, GL, EU
Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: English, Spanish; Castilian, Portuguese, Cata-lan; Valencian, Galician, Basque Character encoding: UTF - 8
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English, Spanish; Castilian, Portuguese, Catalan; Valencian, Galician, Basque Character encoding: UTF - 8
OS-Inde-pendent
MS U-Compare Lem-matisation service √ FR,
RO
Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: English, French, Romanian Character encoding: UTF - 8
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English, French, Romanian Character encoding: UTF - 8
OS-Inde-pendent
MS U-Compare Syntac-tic Parsing Service √
Yes
Media type: Text Resource type: Corpus Modality: Written Language
Media type: Text Resource type: Cor-pus Modality: Written Language
OS-Inde-pendent
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
14
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
Language: English Character encoding: UTF - 8
Language: English Character encoding: UTF - 8
MS
Tokenizing, Tag-ging, Lemmatizing and Chunking free running texts
√ RO, FR
Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: Romani-an, English, French Character encoding: UTF - 8
Media type: Text Resource type: Cor-pus Modality: Written Language Language: Romani-an, English, French Character encoding: UTF - 8 Annotation type: Lemmatization, Mor-phosyntactic Annota-tion - Pos Tagging, Segmentation, Syn-tactic Annotation - Shallow Parsing Annotation format: text output with one token per line and annotations separat-ed by tab Tagset: http://nl.ijs.si/ME/V3/msd/html/ Segmentation level: Sentence, Word
OS-Inde-pendent
TreeTagger [ adaptable to any language if a lexi-con and a manually tagged training corpus are availa-ble]
√ √ √ √ NO text text
OS-Inde-pendent
MS UIMA/U-Compare Apertium POS Tag-ger
√ √
EU, CA, GL, SP
Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: English, Spanish, Portu-guese, Catalan, Galician, Basque Annotation type: Structural Annota-tion Segmentation level: Word
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English, Spanish, Portuguese, Galician, Catalan, Basque Annotation type: Morphosyntactic Annotation - Pos Tagging Segmentation level: Word
OS-Inde-pendent
MS UIMA/U-Compare OpenNLP POS Tagger
√ Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: English Annotation type: Segmentation Segmentation level: Word
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Annotation type: Morphosyntactic Annotation - Pos Tagging Segmentation level: Word
OS-Inde-pendent
MS ACOPOST - A Collection of POS Y
eMedia type: Text Resource type:
Media type: Text Resource type: Cor-
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
15
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
Taggers s Corpus Modality: Written Language
pus Modality: Written Language
MS LX-Tagger √ Yes
Media type: Text Resource type: Corpus Modality: Written Language
Media type: Text Resource type: Cor-pus Modality: Written Language Segmentation level: Word
Linux
ILSP/ SL
ILSP Feature-based multi-tiered POS Tagger
√ Yes
XCES document with sentence and token boundaries recognised
XCES document with POS tags assigned to each token
OS-Inde-pendent (Web service)
Stan-ford Univ.
Stanford POS Tag-ger √ √
FR, AR, ZH
MS STEPP Tagger √ Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: Englishu Character encoding: UTF - 8 Annotation type: Structural Annota-tion Segmentation level: Sentence, Word
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Character encoding: UTF - 8 Annotation type: Morphosyntactic Annotation - Pos Tagging Segmentation level: Word
Unix
Univer-sität des Saar-landes
TnT √ √ SV NO
ILSP/ SL
FBT part-of-speech tagger available as a web service
√
MS PANTERA NO
Media type: Text Resource type: Language Descrip-tion Modality: Written Language
Media type: Text Resource type: Lexi-cal Conceptual Re-source Modality: Written Language
Linux
MS GENIA Tagger √ Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: English Character encoding: UTF - 8 Annotation type: Structural Annota-tion Segmentation level: Sentence, Word
Media type: Text Resource type: Cor-pus Modality: Written Language Language: English Character encoding: UTF - 8 Annotation type: Lemmatization, Mor-phosyntactic Annota-tion - Pos Tagging, Semantic Annotation - Named Entities, Syntactic Annotation - Shallow Parsing Segmentation level: Phrase, Word, Word
Unix
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
16
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
Group
http://phra-sys.net/
Qtag (v.1.0) √ √ NO
UPF/ SL freeling3_tagging √ √
GL, IT, RU, SP, CY, AST, CA
plain text
word, lemma, tag, probability, word-char-start and word-char-end all tab sepa-rated
Lin-guatec/SL
LT-POS-Defaulter √ √ file with a list of words
file with list of words, each with textform and POS + probability For German and English
Lin-guatec/SL
LT-Decomposer √ file containing a wordlist
file containing: input-form - lemma - POS - decomposition
UPF/ SL iula_tagger √
SP, CA
Obligatory Inputs: -Plain text (or txt file) -Language (lang) es: Spanish ca: Catalan en: English Optional Inputs -Encoding UTF-8 (default) or ISO-8859-1 -keeptags If your text have sgml tags and you will preserve it, mark this option
Output_form: treetagger (default) or iulact
UPF/ SL berkeley_tagger √ √ FR UPF/ Univer-sity of Cam-bridge
tpc_rasp √ string or text file
UPF/ SL iula_preprocess √
CA, SP
Obligatory Inputs: -Plain text (or txt file) -Language (lang) es: Spanish ca: Catalan en: English Optional Inputs -Encoding UTF-8 (default) or ISO-8859-1
In addition, hunmorph and Mmorph, perform morphological analysis for a number of lan-
guages and morpha (Minnen, G., J. Carroll and D. Pearce) perform morphological analysis
for English.
Lemmatizers
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
17
Eleven (11) lemmatizers have been identified, 4 of them available through META-SHARE,
while six (6) are provided as a service (SoapLab). Their distribution for the four project lan-
guages is: a) 3 for German, b) 1 for Greek, c) 7 for English, and d) 2 for Portuguese.
Table 4: Lemmatizers for the four project languages
Source Name DE
EL
EN
PT
LD Input Output OS
MS PELCRA EN Lemmatizer √
Media type: Text Resource type: Lexical Conceptual Resource Language: English variety: en-gb Mime type: text/plain Character encoding: UTF - 8 Segmentation level: Word
Media type: Text Resource type: Lexical Con-ceptual Resource Language: English — variety: en-gb Mime type: text/plain Character encoding: UTF - 8 Annotation type: Morphosyn-tactic Annotation - Pos Tag-ging Annotation format: tags Tagset: http://www.natcorp.ox.ac.uk/docs/c5spec.html Segmentation level: Word
OS-Inde-pen-dent
MS Lemmatizer for Portu-guese
√
Media type: Text Resource type: Corpus Modality: Written Language Language: Portuguese
Media type: Text Resource type: Corpus Modality: Written Language Language: Portuguese
OS-Inde-pen-dent
ILSP/SL ILSP Lem-matizer √ XCES document with with
POS-tagged tokens XCES document with lemmas assigned to each token
OS-Inde-pen-dent
MS FORMA
Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Other Annotation format: text/plain Segmentation level: Word
Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Lemmatiza-tion, Morphosyntactic Annota-tion - Pos Tagging Segmentation level: Word
Linux, Mac-OS, Win-dows
MS U-Compare Lemmatisa-tion service
√ Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: English, French, Romanian Character encoding: UTF - 8
Media type: Text Resource type: Corpus Modality: Written Language Language: English, French, Romanian Character encoding: UTF - 8
OS-Inde-pen-dent
UPF/SL freeling3_tagging
√ √ plain text word, lemma, tag, probability, word-char-start and word-char-end all tab separated
Lin-guatec/SL
LT-POS-Defaulter √ √ file with a list of words
file with list of words, each with textform and POS + probability For German and English
Lin-guatec/SL
LT-Decomposer √ file containing a wordlist file containing: inputform -
lemma - POS - decomposition
Lin-guatec/SL
LT-Lemmatiser √ √
UPF/ University of Cam-bridge
tpc_rasp √ can be string or text file
UPF/SL iu-la_preprocess
√
Obligatory Inputs: -Plain text (or txt file) -Language (lang) es: Spanish ca: Catalan
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
18
Source Name DE
EL
EN
PT
LD Input Output OS
en: English Option Inputs -Encoding UTF-8 (default) or ISO-8859-1
Parsers
Twenty three (23) parsers have been identified, 12 of which available through META-SHARE.
Their distribution for the four project languages is: a) 3 for German, b) 3 for Greek, c) 12 for
English, and d) 6 for Portuguese.
Table 5: Parsers for the four project languages
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
MS EngGram Con-straint Grammar Parser for English
√ Yes
Media type: Text Language: English
MS
GerGram Con-straint Grammar Parser for Ger-man
√ Yes
Media type: Text Language: German
Media type: Text Language: German
MS
PALAVRAS Con-straint Grammar Parsers for Portu-guese
√ Yes
Media type: Text Language: Portu-guese
Media type: Text Language: Portu-guese
Leonel F. de Alencar
Donatus Parsing Tools for Portu-guese
√
MS U-Compare Syn-tactic Parsing Service
√ Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: English Character encoding: UTF - 8
Media type: Text Resource type: Corpus Modality: Written Language Language: English Character encod-ing: UTF - 8
OS-Inde-pendent
MS MSTParser Yes
Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Other Segmentation level: Word
Media type: Text Resource type: Corpus Modality: Written Language
Linux, Mac-OS, Windows
MS Dizer √ Yes
Media type: Text Language: Portu-guese
Media type: Text Resource type: Language Descrip-tion Language: Portu-guese Annotation type: Discourse Annota-tion
OS-Inde-pendent, Linux, Mac-OS, Other, Unix, Windows
MS
MaltParser (MaltOptimizer, for automatic optimi-zation)
√ SV, FR
Yes
Media type: Text Language: Swe-dish, English, French
Linux, Mac-OS, Windows
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
19
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
MS Enju parser √ Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: English Character encoding: UTF - 8
Media type: Text Resource type: Corpus Modality: Written Language Language: English
OS-Inde-pendent
MS Lexicalized Pars-ing √ RO
Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: Romani-an, English Character encoding: UTF - 8
Media type: Text Resource type: Corpus Modality: Written Language Language: Roma-nian, English Character encod-ing: UTF - 8 Annotation type: Lemmatization, Morphosyntactic Annotation - Pos Tagging, Segmen-tation, Syntactic Annotation - Shal-low Parsing, Syn-tacticosemantic Annotation - Links Annotation format: text output with one token per line and annotations sepa-rated by tab Tagset: http://nl.ijs.si/ME/V3/msd/html/ Segmentation level: Sentence, Word
OS-Inde-pendent
OpenNLP CCG Li-brary
CGC parser
MS VISL - multilingual dependency par-ser
Yes
University of Lisbon LXDepParser √
University of Alberta MINIPAR √
Yes
Linux, Solaris or windows 95/98
ILSP/ SL
ILSP Dependency parser √
Yes
text text
OS-Inde-pendent (Web service)
Stanford Univ.
The Stanford Parser: A statisti-cal parser
√ √ √ ZH, AR, IT, BG
NO text text
OS-Inde-pendent
MS CSTParser √ Yes
Media type: Text Language: Portu-guese
Media type: Text Resource type: Language Descrip-tion Language: Portu-guese
OS-Inde-pendent, Linux, Mac-OS, Other,
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
20
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
Annotation type: Discourse Annota-tion
Unix, Windows
MS Spejd NO
Media type: Text Resource type: Language Descrip-tion Modality: Written Language
Media type: Text Resource type: Lexical Conceptual Resource Modality: Written Language
Linux, Windows
Presemt Phrase Model Generator Module (PMG)
√ √ √ ZH, IT, NO
NO text text
OS-Inde-pendent
UPF/SL freeling3_dependency √
AST, CA,, GL, SP
Plain text Freeling output format, XML, XML CQP ready
UPF/SL berkeley_parser √ √ FR UPF/ Uni-versity of Cambridge
tpc_rasp √ string or text file
UPF/SL bohnet_parser √ SP
In addition, there are 9 chunking tools, 7 available through META-SHARE and 2 in the form
of web services.
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
MS
MBT – Memory-Based Tag-ger-Generator and Tagger
NO
Media type: Text Resource type: Corpus Modality: Spoken Lan-guage, Written Language Character encoding: UTF – 8 Annotation type: Morpho-syntactic Annotation – Pos Tagging Annotation format: .txt Segmentation level: Word
Media type: Text Resource type: Corpus Modality: Spoken Lan-guage, Written Lan-guage Character encoding: UTF – 8 Annotation type: Mor-phosyntactic Annota-tion – Pos Tagging Segmentation level: Word
Unix
MS
Shallow Processing with Unifica-tion and Typed Fea-ture Struc-tures
√ √
FR, IT, Durch, SP, PO, CS, ZH, JA
NO
MS TextPro √ IT Yes
Media type: Text Language: English, Italian
Media type: Text Resource type: Lexical Conceptual Resource Modality: Written Lan-guage Language: English, Italian Annotation type: Lem-matization, Morphosyn-tactic Annotation – Pos Tagging, Semantic Annotation – Named Entities Segmentation level: Sentence, Word
Linux, Mac-OS
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
21
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
MS
YamCha: Yet Another Multipurpose Chunk An-notator
NO
Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Morpho-syntactic Annotation – Pos Tagging Annotation format: Plain text in column format. Segmentation level: Word
Media type: Text Resource type: Corpus Modality: Written Lan-guage Segmentation level: Word
Linux, Mac-OS, Win-dows
MS
Tokenizing, Tagging, Lemmatizing and Chunk-ing free running texts
√ RO, FR
Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: Romanian, English, French Character encoding: UTF – 8
Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: Romanian, English, French Character encoding: UTF – 8 Annotation type: Lem-matization, Morphosyn-tactic Annotation – Pos Tagging, Segmenta-tion, Syntactic Annota-tion – Shallow Parsing Annotation format: text output with one token per line and annota-tions separated by tab Tagset: http://nl.ijs.si/ME/V3/msd/html/ Segmentation level: Sentence, Word
OS-Inde-pen-dent
ILSP/ SL ILSP Chunker √
Yes
XCES document with POS-tagged and lemmatized tokens
standoff document with chunk annotations
OS-Inde-pen-dent
MS LX-Chunker √ NO
Media type: Text Resource type: Corpus Modality: Written Language Language: Portuguese Mime type: text/plain Character encoding: ISO - 8859 - 1
Media type: Text Resource type: Corpus Modality: Written Lan-guage Language: Portuguese Mime type: text/plain Annotation type: Seg-mentation Segmentation level: Sentence
Linux
MS hunner NO
Media type: Text Resource type: Language Description Modality: Written Language
UPF/ SL
freeling3_parsed √
AST, CA, GL, SP
Plain text Freeling output format, XML, XML CQP ready
NE Recognizers Thirteen (13) NE Recognizers have been identified, 8 of which available through META-SHARE. Their distribution for the project languages is as follows: a) 2 for German, b) 1 for Greek, c) 5 for English, and d) 2 for Portuguese.
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
22
Table 6: NE Recognizers for the four project languages
Source Name DE
EL
EN
PT
Other-lang.
LD Input Output OS
MS NERanka: Named Entity Recognition and Annotation Tool
NO Media type: Text Media type: Text Win-
dows
MS
U-Compare E-txt2DB: Giving structure to unstruc-tured data
√ NO
Media type: Text Resource type: Lexical Conceptual Resource Modality: Written Language Mime type: txt Character encod-ing: UTF - 8
Media type: Text Resource type: Lexical Conceptual Resource Language: English Character encod-ing: UTF - 8 Annotation type: Semantic Annota-tion - Named Enti-ties
OS-Inde-pendent
STAN-FORD
Stanford Named Entity Recognizer (NER)
√ √
Univer-sität Heidel-berg
German Named Entity Recognition (NER)
√
Univer-sity of Lisbon
LXNer √
ILSP/ SL ilsp_nerc √
MS
MBT – Memory-Based Tagger-Generator and Tagger
NO
Media type: Text Resource type: Corpus Modality: Spoken Language, Written Language Character encod-ing: UTF - 8 Annotation type: Morphosyntactic Annotation - Pos Tagging Annotation format: .txt Segmentation level: Word
Media type: Text Resource type: Corpus Modality: Spoken Language, Written Language Character encod-ing: UTF - 8 Annotation type: Morphosyntactic Annotation - Pos Tagging Segmentation level: Word
Unix
MS TextPro √ IT Yes
MS YamCha: Yet An-other Multipurpose CHunk Annotator
NO
Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Morphosyntactic Annotation - Pos Tagging Annotation format: Plain text in column format. Segmentation level: Word
Media type: Text Resource type: Corpus Modality: Written Language Segmentation level: Word
Linux, Mac-OS, Win-dows
MS U-Compare Named Entity Recognition service
Yes
Media type: Text Resource type: Corpus Modality: Written
Media type: Text Resource type: Corpus Modality: Written
OS-Inde-pendent
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
23
Source Name DE
EL
EN
PT
Other-lang.
LD Input Output OS
Language Language: English Character encod-ing: UTF - 8
Language Language: English Character encod-ing: UTF - 8
MS U-Compare Species Disambiguation Service
√ Yes
Media type: Text Resource type: Corpus Modality: Written Language Language: English Character encod-ing: UTF - 8
Media type: Text Resource type: Corpus Modality: Written Language Language: English Character encod-ing: UTF - 8
OS-Inde-pendent
MS hunner NO
Media type: Text Resource type: Language Descrip-tion Modality: Written Language
UPF/ SL anonymizer √ √
AST, CA, GL, IT, RU, SP, CY
Aligners
Thirteen (13) aligners have been identified, 4 of which available through META-SHARE. The
following table presents text aligners with information on language pairs and level of align-
ment (sentence, phrase or word).
Table 7: Aligners for the four project languages
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
MS Coral Corpus Aligner (bilingual corpora align-ment)
NO
Media type: Text Resource type: Language Descrip-tion Modality: Written Language
Media type: Text Resource type: Lan-guage De-scription Modality: Written Lan-guage
OS-Inde-pendent
MS Lingua-Align (Syntactic tree alignment, word alignment)
NO Media type: Text
Robert C. Moore Moore's aligner
Johns-Hopkins University (CLSP/JHU)
GIZA++
Linux, Irix and SUNOS systems
PRESEMT Presemt Phrase Aligner Module (PAM) (word and phrase alignment)
√ √ √ ZH, IT, NO
NO text text
OS-Inde-pendent
MS hunalign (sentence alignment) N
O
Media type: Text Resource type: Language Descrip-tion
Media type: Text Resource type: Lexical
Linux, Win-dows
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
24
Source Name DE
EL
EN
PT
Other lang.
LD Input Output OS
Modality: Written Language
Conceptual Resource Modality: Written Lan-guage
MS
ACCURAT Toolkit for Multi-Level Alignment and Information Extrac-tion from Comparable Corpora
NO
Media type: Text Resource type: Corpus Modality: Written Language Character encod-ing: UTF – 8
Media type: Text Resource type: Corpus, Lexical Con-ceptual Re-source Modality: Written Lan-guage Character encoding: UTF - 8 Annotation type: Align-ment, Transla-tion Segmentation level: Sen-tence, Word Group
Win-dows
Europarl (Univ. of Edinburgh)
Europarl tools √ √ √ √ IT, FR, SV, etc
DCU/SL gma √ √
ES,
FR, IT
DCU/SL bsa
N
O
DCU/SL anymalign
N
O
Tokenised one-sentence-per-line text for source and target languages.
phrase table in Moses format
DCU/SL/ Open-MaTrEx
chunk_aligner √ √ √
ES,
FR,
GA,
IT,
CS,
CA
DCU/SL berkeley_aligner √ √ FR
Toolkits, Platforms & Workbenches
In addition to well known and widely used, for many languages, text engineering platforms like GATE (http://gate.ac.uk/), the following table presents suites of tools (toolkits, work-benches, web services and platforms), consisting of various components, such as lemmatiz-ers, POS taggers etc, up to semantic role labellers).
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
25
Table 8: Collections of tools (platforms, workbenches, toolkits and web services)
Source Name DE
EL
EN
PT
LD Input Output OS
MS U-Compare Platform √
NO
Media type: Text Resource type: Corpus Modality: Written Language
Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Discourse Annotation, Lemmatization, Morphosyntactic Annotation - Pos Tagging, Semantic An-notation, Semantic Annota-tion - Certainty Level, Seman-tic Annotation - Entity Men-tions, Semantic Annotation - Events, Semantic Annotation - Named Entities, Semantic Annotation - Semantic Clas-ses, Semantic Annotation - Semantic Relations, Seman-tic Annotation - Semantic Roles, Semantic Annotation - Word Senses, Stemming, Syntactic Annotation - Shal-low Parsing, Syntactic Anno-tation - Subcategorization Frames, Syntacticosemantic Annotation - Links
OS-Independent
MS U-Compare Workbench
NO
Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Discourse Annota-tion, Lemmatiza-tion, Morphosyntac-tic Annotation - Pos Tagging, Semantic Annotation, Se-mantic Annotation - Certainty Level, Semantic Annota-tion - Entity Men-tions, Semantic Annotation - Events, Semantic Annotation - Named Entities, Semantic Annota-tion - Polarity
Media type: Text Resource type: Corpus Modality: Written Language Annotation type: Discourse Annotation, Lemmatization, Morphosyntactic Annotation - Pos Tagging, Semantic An-notation, Semantic Annota-tion - Certainty Level, Seman-tic Annotation - Entity Men-tions, Semantic Annotation - Events, Semantic Annotation - Named Entities, Semantic Annotation - Polarity, Seman-tic Annotation - Semantic Classes, Semantic Annota-tion - Semantic Relations, Semantic Annotation - Se-mantic Roles, Semantic An-notation - Word Senses, Syntactic Annotation - Shal-low Parsing, Syntactic Anno-tation - Subcategorization Frames, Syntactic Annotation - Treebanks, Translation
OS-Independent
MS Uplug √ √ √ NO
Media type: Text Resource type: Corpus Modality: Written Language Language: Catalan; Valencian, Czech, Danish, English, French, Hungarian, German, Portu-guese, Russian, Slovenian, Span-ish; Castilian, Swe-dish
Media type: Text Resource type: Corpus Modality: Written Language
Linux, Mac-OS, Windows
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
26
Source Name DE
EL
EN
PT
LD Input Output OS
MS LX-Service √ Yes
Media type: Text Resource type: Corpus Modality: Written Language
Media type: Text OS-Independent
Stan-ford Univ.
Stanford CoreNLP √
Yes
The Apache Soft-ware Foun-dation
OPEN NLP
MS Heart of Gold √ NO
MS
Shallow Pro-cessing with Unification and Typed Feature Structures
√ √ NO
ILSP/ SL
ILSP web ser-vices √
Input is either plain text or an XCES document with text segmented in par-agraphs..
The output by default is an XCES document.
Additional Tools
Several other tools that might prove useful are listed below.
Table 9: Additional tools which can be used for processing a corpus.
Source Name Short Description
MS IULA paradigma Web Service
As a general rule, IULA SOAP web services accept input data either as 'direct string' data or as URL. Output results are given both as 'direct string' data and as URL. For large outputs, the 'direct string' option is disabled.
MS PELCRA Lan-guage Detector
The PELCRA language detector is a Java tool for detecting the language of an arbi-trary stretch of text developed by the PELCRA team at the University of Łódź, availa-ble under the GPL licence. The first version of this tool only supports binary classifi-cation scenarios in which one wants to detect one of two possible languages. A model for distinguishing between Polish and English is provided with the software. The default language detector provided in the package uses a binary support vector machine classifier implementation.
MS Hunspell
Hunspell is the spell checker of LibreOffice, OpenOffice.org, Mozilla Firefox 3 & Thunderbird, Google Chrome, and it is also used by proprietary software packages, like Mac OS X, InDesign, memoQ, Opera and SDL Trados. Main features: Extended support for language peculiarities; Unicode character en-coding, compounding and complex morphology. Improved suggestion using n-gram similarity, rule and dictionary based pronunciation data. Morphological analysis, stemming and generation. C++ library under GPL/LGPL/MPL tri-license. Interfaces and ports: Enchant (Generic spelling library from the Abiword project), XSpell (Mac OS X port, but Hunspell is part of the OS X from version 10.6 (Snow Leopard), and now it is enough to place the Hunspell dictionary files into ~/Library/Spelling or /Library/Spelling for spell checking), Delphi, Java (JNA, JNI), Perl, .NET, Python, Ruby ([1], [2], [3]), UNO, RichEdit.
MS jExSLI java Extremely Simple Language Identifier is a text language identifier. An initial list of languages contains 20 commonly used languages and can be easily extended.
MS Moses a factored phrase-based, hierarchical and syntax decoder for statistical machine translation
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
27
Source Name Short Description
MS LetsMT! Re-source Reposi-tory Software
A software package for building and maintaining resouces for statistical machine translation such as parallel corpora and monolingual corpora
MS MaltOptimizer A system for automatic optimization of MaltParser
MS Collocation and Term Extractor
CollTerm is a language independent tool for collocation and term extraction. It col-lects collocation and term candidates based on five different co occurrence measures for multiword units or distributional differences from large representative corpus by application of TF-IDF. The language dependent part consists of stop-word list and list of MWU MSD-patterns, that can also be coded with regular expressions. The first version of this application is available as an integral part of ACCURAT Toolkit (http://www.accurat-project.eu/index.php?p=accurat-toolkit).
MS
ComLinToo: The Computational Linguistics Tool-set
The Computational Linguistics Toolset is a set of tools for computational linguistics. It contains re-usable code for cleaning, splitting, refining, and taking samples from corpora (ICE, Penn, and a native one), for tagging them using the TnT-tagger, for doing permutation statistics on N-grams, etc.
SRI (and Johns Hopkins Universi-ty)
SRILM - The SRI Language Modeling Toolkit
SRILM is a toolkit for building and applying statistical language models (LMs), pri-marily for use in speech recognition, statistical tagging and segmentation, and ma-chine translation.
MS TextMatch
TextMatch is a web service that combines language independent ways of computing the similarity between two documents using powerful linguistic tools, and provides different measures of document similarity. It compares documents in different for-mats (such as Microsoft document formats, OpenDocument Format, Portable Docu-ment Format, Electronic Publication Format, HyperText Markup Language, Rich Text Format, Text formats). TextMatch recognizes the language of the document using two-stage language de-tection system. Specific language tokenizers, lemmatizers and other analyzers are utilized for English, Bulgarian, German, French, and Russian. A language independ-ent comparison algorithm is used if one of the uploaded documents is in another language.
MS Treat
Treat is a toolkit for natural language processing and computational linguistics in Ruby. It provides support for tasks such as document retrieval, text chunking, seg-mentation and tokenization, natural language parsing, part-of-speech tagging, key-word extraction and named entity recognition.
MS
U-Compare Paragraph-Breaking Ser-vice
Web service created by exporting UIMA-based workflow from the U-Compare text mining system. Functionality: Identifies paragraphs in plain text Tools in workflow: MLRS Paragraph Splitter (University of Malta) The licence provided covers the web service only. Tools used to create the workflow may have their own licences
MS Blacklist Classi-fier A language identifier for closely related languages
MS Language Iden-tifier
SOAP Web Service which identifies the language of a written text. It can identify up to 57 languages.
http://cwb.sourceforge.net/
IMS Corpus Workbench (CWB), (Version 3.0)
Corpus querying system
Stanford Univ.
Stanford Classi-fier
A machine learning classifier, directed at text categorization. A conditional loglinear classifier (a.k.a. a maximum entropy or multiclass logistic regression model).
Stanfod Univ.
Tregex, Tsur-geon, and Semgrex
A Tgrep2-style utility for matching patterns in trees and a tree-transformation utility built on top of this matching language. A similar utility for matching patterns in de-pendency graphs.
Stanford Univ.
Stanford To-kensRegex A tool for matching regular expressions over tokens.
Stanford Univ.
Stanford Tem-poral Tagger (SUTime)
A rule-based temporal tagger for English text. Online SUTime demo
MS GNU Aspell GNU Aspell is a Free and Open Source spell checker designed to eventually replace Ispell. Used either as a library or as an independent spell checker. Aspell can also easily check documents in UTF-8 without having to use a special dictionary.
MS Bibliša: Aligned Collection
This tool is a web application for search of digital libraries of articles from bilingual e-journals in the form of TMX documents, as well as for development of new bilingual
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
28
Source Name Short Description Search Tool lexical resources based on this search. It is based on previously developed compo-
nents for LeXimir (work station for lexical resources) and VebRanka (web query expansion tool) and uses various lexical resources: Wordnets, e-dictionaries and terminological lists.
MS Docent A document-Level Local Search Decoder for Phrase-Based SMT
MS TREFL – Trans-lation Reference Library
TREFL is a portable, multifunctional database management application for Windows, having the combined characteristics of both a Translation Memory System (bilingual databases, fuzzy matching, concordance, alignment, importing and exporting transla-tion memories, etc.) and those of an Internet/Desktop Search Engine (searching, like with Google search, all these words, this exact phrase, I’m feeling Lucky, etc.), plus some elements of semantic search. It is intended to be used as a simple, versatile, portable, effective and customizable reading, writing and translation aid tool capable of managing very large databases.
MS NERosetta
NERosetta is a multiuser web application to facilitate retrieval and comparison of named entities in a single or parallel texts. The main named entity categorization is realized according to the Quaero annotation recommendation and provides a user with approximately 50 different search options. Registered users have an extra pos-sibility to share annotated resources (in XML format) and annotation schemas for the set of languages as well as to manage their own resources and schemas. The initial version supports Stanford NER 3 and Stanford NER 7.
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
29
4 Conclusions
The tools and web services together with the monolingual, multilingual and lexical/ concep-tual datasets offer a good starting point for the machine translation research, development and evaluation infrastructure. The following figures summarise the counts of all the tools identified per language and type8.
Figure 1: Tools for processing German Figure 2: Tools for processing Greek
Figure 3: Tools for processing English Figure 4: Tools for processing Portuguese
Currently we are experimenting with providing a mechanism within the QT21 repository for
processing datasets with appropriate natural language processing tools, provided as SOAP 8 Some tools are language independent and can cover additional languages.
5
3 2
3
7
3 4
DE
2
1 1
2
3
2 2
EL
5 7
5
12
20
9
15
EN
3 2 2
6
8
3
6
PT
Preparation and Launch of a Large-scale Action for Quality Translation Technology D 4.4.1 An inventory of existing language tools
30
web services. The idea is that for each stored resource (monolingual or parallel corpora) the
extra option of “process” is integrated in the QT21 repository, besides the current options of
download a dataset and edit its metadata-based description. When the user selects “pro-
cess”, a list of processing tools for each relevant task will be provided (for the given lan-
guage, and resource type). As soon as the user selects a tool, the server will invoke a ser-
vice that will send the corpus to the specific tool web service for processing. Because of the
long processing time due to the large size of most of the resources, the system will inform
the user about the requested job via the messaging service of the platform. When the pro-
cessing has been completed, the new (annotated) resource will be automatically stored in
the repository, the inventory will be automatically updated, and the user will be informed. If
the user, for any reason, requests to process a resource with a specific tool, and this re-
source has already been processed by the specific tool, then the system will just forward the
user to the processed resource that has been created in the repository.
A first implementation of such functionality is under development. Initially, it will provide pro-
cessing tools for Greek and English, compatible in terms of input/output with XML CES, and
selected from the tools and services developed in the framework of PANACEA
(http://panacea-lr.eu).