Vancouver, 14 years ago…
On November 18th 2000, inspired by a previous presentaEon from Richard Ishida, Steven Forth organized a meeEng discussing localizaEon-‐specific tags in XML. (Some of the ideas are outlined in this arEcle in MulElingual of Oct/Nov 2000: hSp://goo.gl/tziuj4)
This led to the InternaEonalizaEon Tag Set (ITS) ITS 1.0 in 2007 and ITS 2.0 in 2013
Some ideas take Eme to mature
OYen, you have to plant seeds early to reap the benefits later. You can use ITS in HTML5 today because of the seeds planted 14 years ago.
So, let’s conEnue planEng a few more seeds for tomorrow’s standards…
New context
Many more Web services available (e.g. NLP-‐related funcEons, Text analyEcs, MT, Terminology, etc.)
Simple searches for Web services on the hSp://www.programmableweb.com/ site: – 145+ entries for “translaEon” – And 45+ for “text analyEcs”
New context (cont’)
SemanEc Web is finally reaching our industry (with more linguisEcs resources and linked data available)
– Can help with disambiguaEon, terminology, context informaEon, quality measurements, etc.
– Linked open data, RDF, SPARQL, etc.
New context (cont’)
• ApplicaEons for mobile plagorms • Internet of Things • Crowd-‐sourcing • Components-‐based workflows
Evolu&on in both what is localized and how it is localized
New needs and requirements
• Not just “files” • Finer granularity: unit, segment, fragment • Various serializaEons (i.e. “file formats”) • Not just serializaEon, need abstract concepts • AnnotaEon mechanism • For different development environments (desktop, servers, web applicaEons, etc.)
ITS 2.0
• InternaEonalizaEon Tag Set – Defines concepts (the data categories) – AnnotaEon-‐oriented
• Can be used in various file formats (HTML5, DITA, Docbook, other XML, etc.)
• Good way to carry text analysis results • More on ITS later…
XLIFF 2
• Became an OASIS Standard on August 5th 2014 • Made of a Core + modules:
– Flexibility – Can evolve and adapt
• Can integrate with other standards like ITS (e.g. good support for annotaEons)
• Many processing requirements pave the way to design an object model
Object Model and API
• Defined objects with an API (like HTML DOM) See hSp://opentag.com/data/xliffomapi/
• XLIFF (the file format) is then just one of the possible serializaEons of the object model.
• Allow for different granularity (e.g. unit, segment, fragment)
• Tools can work based on the object model rather than the file format
JSON serializaEon
• Small chunk of data as Web service payloads • Compact, simple, small footprint • JSON reading/wriEng can be automated in many programming languages
• Works well with Web services, with document-‐oriented DB (like MongoDB), etc.
Other standardizaEon efforts
• TranslaEon Kits (e.g. TIPP)
• TranslaEon Web service (e.g. TAUS TranslaEon API)
• Quality informaEon (e.g. MQM by QT Launchpad, TAUS DQF, etc.)
ApplicaEon (Acorn)
XOM API
SimpleTM
XOM API
Open-‐Calais
XOM API
Yahoo Analyzer
XOM API
DBpedia Spotlight
XOM API
TAAS
XOM API
TAUS TranslaEon
TAUS-‐T API XOM API
TranslaEon Server
XOM API TAUS-‐T API Demonstration
► Direct exchange
► Re-using patterns and components ► ITS annotations, XLIFF modules
► JSON payload ► Translation API
Language Technology Integration"
— Language Technology is Statistical"– Few yes-‐no answers
— ITS2.0 addresses language technology integration"– Machine TranslaEon Confidence Score – Terminology AnnotaEon with Confidence Score – Automated Text Analysis for Word Sense DisambiguaEon with Confidence Score
– All need annotator engine idenEficaEon
Challenges in Using LT"• LT is staEsEcal • Quality is limited by distance between training data and job at hand
• Training Data is the Key Asset for LT • For L10n its TranslaEon Memories and Term Bases • Interoperability Challenges for Training Data
– Discover – Select – Curate – Share/Pool/Sell – Measure Impact on ProducEvity
How far does TMX get us?"• TM as key source of parallel text for MT
– TMX as input for on-‐demand MT training, e.g. KantanMT
– TMX used in assessing the marginal benefit of MT over TM leverage, e.g. TMTPrime
• Need to care about Provenance and Quality of TM for LT training – Avoid MT reuse, – Loss of QA annotaEons – Split sentences – Domain and terminology annotaEon
How Far Does TBX Get Us?"— Can force term translation in MT engines"
– From Glossary in XLIFF or TBX in TIPP package
— MT engines struggle with morphologies "– Need rich lexical annotaEon
— Word Sense Disambiguation requires (lots of) annotated training corpora"
— Challenge: Open Integrated Data Management of parallel text and lexically rich term bases"
— Open Data on the Web: W3C Semantic Web standards for data published on Web "– Fine-‐grained inter-‐linking of data “cells” -‐ URL – Extensible meta-‐data – Resource DescripEon Format (RDF) – Standard Query API -‐ SPARQL
— LIDER Project: "– Stakeholder needs for language technology and resources – Best pracEces and guidelines to apply linked data
— Open data vocabularies "– Lexical-‐conceptual data – LEMON vocabulary – Resource meta-‐data: aSribuEon, licensing, provenance etc
Lingusitic Linked Data"
LinguisEc Linked Data for Lexical Conceptual Graphs
Red
PhoneEc form Form
number singular
[RED]
Form
plural [REDES]
PhoneEc form number
Red
Sense wriSen form
“red”
Sense
wriSen form
“malla”
equivalent
Red
image
Red
Sense Sense
translaEon es -‐ en
wriSen form
“red” “network”
wriSen form
Red
wriSen form
Form
gender
femenine
“red”
LinguisEc Linked Open Data Cloud
����
������������ �
�������
���
��� �������
��������
��������������������
��� ���
����� ���� �
�������
������ ��
������ ������
�������
�� ���
�����������
����������
���� ���� ��
�����
���� � ����
����� ��
���������
� �� ��������
�������
��� ������������
���������
�� � ���
��������
�����������
���
��������
������ �
����
���� ����
���������
��������
��� ���
���������
���� ���
���� ����
���������
�� �������� ��
�����
���
��� ����
�������
���������
����� ������������
�����
�� ���
������� ��
����
���������
�� !�
���������
�����
��������������� ��
��
������� �
���� ���� ����
���
��������� ��
����������
"�#$%�"&%'(%�)!*�" �+'* %�+,
����� ������ �������� ���� ������ �����������
������ ������ ����������������
%' )*+,��������� ������ �����������
� ��������������
-�!�.�!�,� �������� ��������� ������� ���� �������������
� �������� ��������� ��������������/�����0�������������1
�� ������������ ����������������
������������� ��������������/����2��������������1
"� �������"� ����'�� �.����/""'.1��������������
-���3456%%78�'�� �"� �������9���� ��:�����
/����,&&�� ���������� ����&����1
%����������������;���9�������� "� ����.����� �"� �������/"."34561
����
������������ �
�������
���
��� �������
��������
��������������������
��� ���
����� ���� �
�������
������ ��
������ ������
�������
�� ���
�����������
����������
���� ���� ��
�����
���� � ����
����� ��
���������
� �� ��������
�������
��� ������������
���������
�� � ���
��������
�����������
���
��������
������ �
����
���� ����
���������
��������
��� ���
���������
���� ���
���� ����
���������
�� �������� ��
�����
���
��� ����
�������
���������
����� ������������
�����
�� ���
������� ��
����
���������
�� !�
���������
�����
��������������� ��
��
������� �
���� ���� ����
���
��������� ��
����������
"�#$%�"&%'(%�)!*�" �+'* %�+,
����� ������ �������� ���� ������ �����������
������ ������ ����������������
%' )*+,��������� ������ �����������
� ��������������
-�!�.�!�,� �������� ��������� ������� ���� �������������
� �������� ��������� ��������������/�����0�������������1
�� ������������ ����������������
������������� ��������������/����2��������������1
"� �������"� ����'�� �.����/""'.1��������������
-���3456%%78�'�� �"� �������9���� ��:�����
/����,&&�� ���������� ����&����1
%����������������;���9�������� "� ����.����� �"� �������/"."34561
— http://linguistics.okfn.org/resources/llod/""
����
������������ �
�������
���
��� �������
��������
��������������������
��� ���
����� ���� �
�������
������ ��
������ ������
�������
�� ���
�����������
����������
���� ���� ��
�����
���� � ����
����� ��
���������
� �� ��������
�������
��� ������������
���������
�� � ���
��������
�����������
���
��������
������ �
����
���� ����
���������
��������
��� ���
���������
���� ���
���� ����
���������
�� �������� ��
�����
���
��� ����
�������
���������
����� ������������
�����
�� ���
������� ��
����
���������
�� !�
���������
�����
��������������� ��
��
������� �
���� ���� ����
���
��������� ��
����������
"�#$%�"&%'(%�)!*�" �+'* %�+,
����� ������ �������� ���� ������ �����������
������ ������ ����������������
%' )*+,��������� ������ �����������
� ��������������
-�!�.�!�,� �������� ��������� ������� ���� �������������
� �������� ��������� ��������������/�����0�������������1
�� ������������ ����������������
������������� ��������������/����2��������������1
"� �������"� ����'�� �.����/""'.1��������������
-���3456%%78�'�� �"� �������9���� ��:�����
/����,&&�� ���������� ����&����1
%����������������;���9�������� "� ����.����� �"� �������/"."34561
Linked Licensed LinguisEc Data — Existing W3C Data
Standards"— Data Catalogue (DCAT)"— Provenance"
— Under development"— Open Annotation"— CSV linked data"
— Language Specific"— Licensing"— OntoLex"— TBX-to-Ontolex"— Language Resource
Meta-Data"— Publishing Best Practice "— XLIFF-to-Linked Data"— TMX-to-Linked Data"
"
— Localization Web = Decentralised Annotated Global Translation Memory and Term Base"
— Terms and translations become linkable resources"
— Meta-data from L10n tool chain adds value"
— Use in training Machine Translation and Text Analytics"
FALCON Project"
Tool Chain • Website translaEon
• TranslaEon Management
• Terminology Management
Language Technology • Machine TranslaEon
• Term IdenEficaEon
Linked Data • Parallel Text • Terms: Lexical-‐conceptual
XLIFF +ITS2.0
Words as Resources on the Web"
Barak Obama is the 44th president of the United States of America. He was first elected in 2008.
Barak Obama si el 44 º presidente de los Estados Unidos de América. Ha fue electo primera vez en 2008.
hSp:// www.ex.org/obama_en.html
hSp:// www.ex.org/obama_es.html
The Web of Content The Localiza&on Web
hSp://data.ex.org/String_0001
hSp:// data.ex.org/String_0002
Derived From
Derived From
Text: “Barak Obama is the 44th president of the United States of America.” Lang:en
Text:“Barak Obama es el 44 º presidente de los Estados Unidos de América.” Lang:es TranslatedBy:Google Translate
Translated From
Transla&on Data
Term: “United States of America.” Lang:en
Term:“Estados Unidos de América.” Lang:es
TranslaEon Of
hSp:// babelnet.org/345621
hSp:// babelnet.org/57835
Terminology Data
Topic: Barack Obama Lang: en BirthDate: 1961-‐08-‐04 Spouse: Michelle Obama Residence: White House
hSp:// Dbpedia.org/Page/ Barak_Obama
Encyclopaedic Data
Closing the Loop"• AcEve CuraEon: SystemaEc harvesEng of LT-‐ready TM and TB from localizaEon tool chain – PrioriEze segments for postediEng and input to incremental MT retraining
– Target postedits to extract target terms and new morphologies
• PostediEng InstrumentaEon: – Postedit Eme and resource use (terms, concordance) vs. automaEon of MT metrics
– iOmegaT: instrumented open source CAT tool
Get Engaged!"— W3C Linked Data for Language
Technology Community Group"– hSp://www.w3.org/community/ld4lt/
— W3C Best Practice in Multilingual Linked Data Community Group"– hSp://www.w3.org/community/bpmlod/
— W3C Ontolex Community Group"– hSp://www.w3.org/community/ontolex/
— ITS Interest Group"– hSp://www.w3.org/InternaEonal/its/wiki/Main_Page