25
Open standards in use in Open standards in use in localisation localisation - an engineering approach - an engineering approach Andrés Vega, Andrés Vega, LRC XIII Localisation4All, LRC XIII Localisation4All, Dublin, Dublin, Ireland Ireland 2 nd nd October 2008 October 2008

About the Author - Andrés Vega

  • Upload
    sook

  • View
    34

  • Download
    6

Embed Size (px)

DESCRIPTION

Open standards in use in localisation - an engineering approach Andrés Vega, LRC XIII Localisation4All, Dublin, Ireland 2 nd October 2008. About the Author - Andrés Vega. 8+ years of experience as a Localisation Engineer with Tek Translation International. - PowerPoint PPT Presentation

Citation preview

Page 1: About the Author  -  Andrés Vega

Open standards in use in Open standards in use in localisation localisation - an engineering approach- an engineering approach

Andrés Vega, Andrés Vega, LRC XIII Localisation4All, LRC XIII Localisation4All, Dublin, Dublin, IrelandIreland

22ndnd October 2008 October 2008

Open standards in use in Open standards in use in localisation localisation - an engineering approach- an engineering approach

Andrés Vega, Andrés Vega, LRC XIII Localisation4All, LRC XIII Localisation4All, Dublin, Dublin, IrelandIreland

22ndnd October 2008 October 2008

Page 2: About the Author  -  Andrés Vega

About the AuthorAbout the Author - Andrés Vega About the AuthorAbout the Author - Andrés Vega

8+ years of experience as a Localisation Engineer with Tek Translation International.

Specializing in complex project engineering with special focus on CMS, encodings and complex scripts.

Previous work as a programming languages teacher: OO programming, C and Java.

Background in Chemistry and Healthcare.

Page 3: About the Author  -  Andrés Vega

AgendaAgenda AgendaAgenda

Why Standards?

Unicode

OpenType Fonts

XML

CMS

TMX

XLIFF

TBX and SRX

Final thoughts and Q&A

Page 4: About the Author  -  Andrés Vega

Why Standards?Why Standards? Why Standards?Why Standards?

Allow faster technology development

Assembling standard components

Concentrating effort on specialisation

Increase competence, focused on features (not compatibility)

Facilitate inter-operability

Open standards allow information to be shared

(Not locked on proprietary standards)

Complementary tools may be developed

Choose tool/resource for each job

Guarantee future compatibility

Provide conformance validation mechanisms

Standard verification serves as QA procedure

Page 5: About the Author  -  Andrés Vega

UnicodeUnicode UnicodeUnicode

Challenges Too Many Character sets: Three great ‘families’ (ANSI, DBCS, BiDi): three application types

Multilingual data (storage, display, processing) Cross-platform and character set inter-conversion issues

What Unicode is Universal character encoding standard by the Unicode Consortium 21-bit character set with 3 main encoding forms (UTF-32, UTF-16, UTF-8) Not just the character set

Character properties (Name, Category, Casing, Decomposition, …) Annexes, Technical Reports: (Comparison, Sorting, Hyphenation, …)

What Unicode is not Glyph repertoire: glyphs provided are examples, not canonical! Unicode alone does not provide language support!

Page 6: About the Author  -  Andrés Vega

Unicode (Benefits and Issues)Unicode (Benefits and Issues) Unicode (Benefits and Issues)Unicode (Benefits and Issues)

Unicode benefits One vendor neutral encoding standard for all languages Stable, but it keeps evolving Multilingual rendering/storage/transfer (No conversion - No corruption) Unified content processes (Globalized, Web enabled) Internationalisation Easy conversion from/to/between legacy codepages

Issues or drawbacks with Unicode Size (ANSI: 1byte, DBCS: 2byte, UTF-8 1-4 byte, UTF-16 2-4 byte) UniHan related (Font dependence, ‘Gaiji’ and variants) Inconsistencies on implementation choices across scripts Several ways to generate pre-composed characters

Implementation issues Script Enabling requires: Input, Display, Storage, Retrieval, Output Bidirectional support, Complex Scripts issues

Implementation status

Page 7: About the Author  -  Andrés Vega

Unicode (Transition Issues)Unicode (Transition Issues) Unicode (Transition Issues)Unicode (Transition Issues)

Transition issues Mixed content: legacy and UTF8 (FrameMaker)

FM7 FM8 + update Import old corrupted Filter version English seen OK vars & template variables corrupts ANSI

Localisation tools, filters, etc not fully adapted or testedExample: Style names containing extended characters

New filter for FrameMaker 8: English names are OK (UTF-8 = ASCII)

German designed file: Filter does not accept UTF-8 Style names

Backwards conversions: Unicode version saved as non-Unicode version

ANSI Content

ANSI Variables

ANSI Template

TTX

UTF-8 Content

ANSI Variables

ANSI Template

UTF-8 Content

ANSI Variables

ANSI Template

UTF-8 Content

Corrupt Vars

ANSI Template

Page 8: About the Author  -  Andrés Vega

Pre-Unicode Workflow (FrameMaker)

Character corruption risks in all orange (middle 3 groups) steps

Final document presents issues in TOC and index generation and in searches Unicode Workflow:

UnicodeUnicode WorkflowUnicodeUnicode Workflow

English

FrameMak

er

With

Design

Fonts

Western RTF and fonts

CE RTF and fonts

Cyrillic RTF and fonts

Turkish RTF and fonts

Greek RTF and fonts

Baltic RTF and fonts

File Preparation Translation & Review DTP and MergeFiles to localize

Western RTF

CE RTF

Cyrillic RTF

Turkish RTF

Greek RTF

Baltic RTF

Multilingu

al

Target

Document

With

several

ANSI fonts

Back Conversion

FM (Design font)

FM (CE font)

FM (Cyrillic font)

FM (Turkish font)

FM (Greek font)

FM (Baltic font)

EnglishFrameMak

erDesign

Fonts

UTF-16 TTX and fontsUTF-8 XML

Multilingual

Document &

Design Fonts

• UTF-8 FM with original design fonts

Page 9: About the Author  -  Andrés Vega

OpenType fontsOpenType fonts

Challenges

Two font families (TrueType and PostScript), two font technologies

Inter-platform issues

Benefits of Open Type

Support large character sets (Unicode, multiscript)

Glyph variants supported: Solves Unicode UniHan ambiguities

Supports advanced typography

Font embedding control

Features

Contain both TrueType and PostScript outline data

Glyph substitution

Glyph positioning

Script and language information

Page 10: About the Author  -  Andrés Vega

XMLXMLXMLXML

eXtensible Markup Language (Meta-language for markup languages)

Used to define, share and validate information (data and structure)

An XML document contains

XML declaration : <?xml version='1.1' encoding='UTF-8' standalone='yes'?> Document Type declaration(s) <!DOCTYPE root SYSTEM “rootDTD.dtd" > Elements <element attribute=“value”>Content</element> or <element/> Other: comments, entities/NCRs, instructions, conditional sections

Specific Syntax (well-formed XML)

Only one root element Tags in nested open/close pairs: <tag> </tag> Element names obey certain conventions Elements may contain attributes

DTD (Valid XML)

Defines rules on structure, valid tags and attributes and valid data Guarantees reliable data exchange between different systems Can be included in each XML, but is normally external

Page 11: About the Author  -  Andrés Vega

XML (Benefits)XML (Benefits)XML (Benefits)XML (Benefits)

Benefits

Simple (XML is plain text) but can embed any content type

Platform independent, Unicode encoded

Content is easily validated cross-platform: data transfer is safer

Structured (defines structural relationships within data)

Open and Extensible well supported standard

Metadata and version control capable

Format independent

Powerful data transformation tools (XSL): Multiple outputs

Page 12: About the Author  -  Andrés Vega

XML (Localisation benefits and issues)XML (Localisation benefits and issues)XML (Localisation benefits and issues)XML (Localisation benefits and issues)

Localisation benefits Structured: Content detached & merged (updates handling) XML support easily implemented on Localisation processes/tools Easy validation versus DTD Extensible: XML based localisation standards: XLIFF, TMX, TBX,...

Metadata (source/target version control, updates, element status)

Format independent Single-sourcing (localized once, published into many formats) Source content and formatting changes are not inter-dependant Content localisation and proofreading before formatting (DTP)

Issues Transition needs to be well planned and performed Segmentation issues (DTD needs to be multilingual aware)

Page 13: About the Author  -  Andrés Vega

CMSCMSCMSCMS

What are Content Management Systems? Set of tools configured around a data repository (database) Designed to manage information in small meaningful bits Information is isolated from format Have workflow capabilities, version control and change tracking Store localized content layers (as other alternative content layers)

General benefits Granularity (no redundancy) Reuse (content reuse and multi output) Improved Quality and Consistency Single-source and multi-publishing Easy rebranding/reformatting Metadata info and version control Workflow and Automation

Localisation benefits Workflow status control features Localisation of updates via content deltas: improved time-to-market Localisation independent from output format (better matching)

Page 14: About the Author  -  Andrés Vega

CMS (Issues)CMS (Issues)CMS (Issues)CMS (Issues)

Issues

Authoring for reuse (topic model, single-source, cross-reference)

Segmentation issues

LF Chars (0A) No Validation! Segmentation issue

Localisation readiness

CMS must be multilingual enabled (storage, I/O, processing)

Localisation workflow support

Strong version control and version rollback

Capability to export up-to-date paired TM content

Integration with LQA tools

Not to increase ROI in the short run (DTP is still needed!!)

CMS

Translation in XML LF not visibleBroken segmentation LF also formats lists

QuarkXxxx XxxxXxxx xxxxXxxx xxxx

Solution Remove meaningless LF Export remaining as tags

Workaround LF converted to tagMeaningful tags internal

Page 15: About the Author  -  Andrés Vega

CMS Localisation WorkflowCMS Localisation WorkflowCMS Localisation WorkflowCMS Localisation Workflow

ClientClient

CMS

Select only delta content Translation (TTX format)

Revision (TTX format)

Prepared for Proofreading (Colour-coded RTF format)

Content Validation in

Tracked-changes RTF

Insertion of Validation changes (TTX & TMs)

Full document in XML

Layout & Consistence Validation in PDF file

DTP in FrameMaker

Preprocessing of XML

Import to FrameMaker

Delivery in FrameMaker

XML

XML

XML

TekTek Client ValidatorsClient Validators

Page 16: About the Author  -  Andrés Vega

TMX TMX TMX TMX

What is TMX?

Translation Memory eXchange

Standard by LISA (Localisation Standards Industry Association)

Provides a standard method for TM data description

XML-compliant (validated against its TMX DTD)

Uses other ISO standards for date, time, lang, country

Consists of

Container format specification

Translation unit elements <tu>

Optional format description elements (font change,...)

Subflows (footnotes, index entries)

Low-level meta-markup format for segment content

Segment element <seg>

Page 17: About the Author  -  Andrés Vega

TMX (Benefits and Drawbacks) TMX (Benefits and Drawbacks) TMX (Benefits and Drawbacks) TMX (Benefits and Drawbacks)

Benefits Transfer TM assets across tools/vendors Provides clients with control over their translated assets

Non-proprietary and vendor neutral Can be integrated with LQA tools

Provides Translators/Vendors with freedom of tool choice Specialized tools share TM assets Tools may be outdated, assets will not Facilitates work distribution/outsourcing

Issues Tag handling

TMX DTD cannot validate inline codes TMX compliance level

Segmentation issues

Page 18: About the Author  -  Andrés Vega

XLIFFXLIFFXLIFFXLIFF

Xml Localisation Inter-exchange File Format

Standard by LISA Special Interest Group OSCAR

Tool-neutral XML-based standard localisation resource container format

To store/transfer/manipulate localizable content, context and other info

Has Built-in support for CAT tools and related standards (TBX, TMX)

Features:

Translation suggestions (TM, Glossary, MT) to approve or edit

Metadata: Translate, notes, context info, version

Hierarchical data structures

Abstraction of formatting and inline codes:

Structural formatting stored in the skeleton file

Inline formatting can be dealt with two ways

Replaced by g (paired) and x (isolated) tags (OpenTag style)

Encapsulated into bpt, ept (paired), it or ph (isolated) tags

Page 19: About the Author  -  Andrés Vega

XLIFF (Description)XLIFF (Description)XLIFF (Description)XLIFF (Description)

Separates localizable and non-localizable content

Non-localisable: Skeleton (separate or embedded) Localizable 'file' Elements with Header (metadata) and Body

Body can contain 'trans-unit' and 'bin-unit' elements

Each trans-unit can have

<trans-unit id="abc123" resname="resourceID" restype="string" translate="yes">

unique id, resource id, resource type, translate yes/no

<source xml:lang="en-US">Translatable content.</source>

Translatable content source and language <target xml:lang="es" state="needs-review-translation">Traducción.</target>

Currently validated translation <alt-trans match-quality="100%" tool="TM"> <source>Translatable content.</source> <target xml:lang="es">Contenido traducible.</target> </alt-trans>

alt-trans translation suggestion(s)

</trans-unit> (closing tag)

Page 20: About the Author  -  Andrés Vega

XLIFF (Benefits and Drawbacks)XLIFF (Benefits and Drawbacks)XLIFF (Benefits and Drawbacks)XLIFF (Benefits and Drawbacks)

Benefits: For the translation process One common format on which to translate Control on Translatable/Non-translatable content Better information handling (context, notes, metadata) Better TM matching due to formatting abstraction Concurrent tool processing visible at review stage Support for all localisation phases Supports metrics info on each trans-unit

Benefits: For localisation tool developers Common platform for tool developers to write to Easy adoption of new formats (new filters to XLIFF) All generic XML processing benefits

Drawbacks Conversion tools needed into XLIFF and back Many XLIFF features are not implemented by most tools Segmentation is inherent to XLIFF file generation As opposed to tailored tools, WYSIWYG is difficult to attain

Page 21: About the Author  -  Andrés Vega

XLIFF WorkflowXLIFF WorkflowXLIFF WorkflowXLIFF Workflow

No XLIFF Scenario

XLIFF Scenario

Many Formats!

.xml.mif

.dll

.rc

.htm

.rtf

.resx

SGML Editor

Software Editor

Reviewer A

Translator A

Reviewer B

Translator B

Many Filters!

.xml.mif

.dll

.rc

.htm

.rtf

.resx

SGML Editor

Software Editor

Reviewer A

Translator A

Reviewer B

Translator B

XLIFF

Page 22: About the Author  -  Andrés Vega

Other LISA standards: TBX, SRXOther LISA standards: TBX, SRXOther LISA standards: TBX, SRXOther LISA standards: TBX, SRX

TBX What is TBX?

Term Base eXchange standard by LISA XML based, vendor-neutral, open standard

Benefits Better control of terminology (source consistency) Reduced glossarisation effort (localisation phase) Platform and tool independent glossaries (global consistency)

Current status TBX Basic (Lighter approach) TBX Checker

SRX What is SRX?

Segmentation Rules eXchange format Describes how localisation tools segment text for processing

Benefits Standardises segmentation process (avoid segmentation issues)

Page 23: About the Author  -  Andrés Vega

Final ThoughtsFinal ThoughtsFinal ThoughtsFinal Thoughts

Unicode Use Always: If tool does not support it, convert at end stage

XML Powerful for single-source, multi-output requirements

CMS Costly. Depends on volume. First consider XML model, then migrate

TMX Use for safe TM tool to tool transfer, specially software into doc

XLIFF Not fully implemented. Good alternative for Java or Web content. Use it to unify side processes (LQA)

TBX Use to exchange glossary info. Good for clients

SRX Very much need but lacks implementation.

Page 24: About the Author  -  Andrés Vega

About Tek: Multilingual translation and localisation business solutions designed to meet the needs of Life Sciences, IT and Manufacturing

About Tek: Multilingual translation and localisation business solutions designed to meet the needs of Life Sciences, IT and Manufacturing

• Since 1961• Over 65

languages• Expert

Resources and Service

• Located in US, Spain, Brazil, China Ireland, UK, Denmark

• Tek OneWorld Platform for your language & industry needs

• Business Intelligence• Language Quality Solutions• Open Connectivity, WW Collaboration

• Scalability• Simplification

and standardisation

• ISO 9001:2000 certification

• Follow-the-sun

• Solutions-based approach for best business value

Page 25: About the Author  -  Andrés Vega

Thank You Q & A

Andrés Vega MuñozLocalisation Engineer

Tek Translation InternationalEmail: [email protected]

www.tektrans.com

Thank You Q & A

Andrés Vega MuñozLocalisation Engineer

Tek Translation InternationalEmail: [email protected]

www.tektrans.com