42
Supporting the Authoring Process with Linguistic Software Melanie Siegel

Supporting the authoring process with linguistic software

  • Upload
    vsrtwin

  • View
    459

  • Download
    1

Embed Size (px)

DESCRIPTION

Day:1 _ Melani_Siegel_1345-1430

Citation preview

Page 1: Supporting the authoring process with linguistic software

Supporting the Authoring Process with Linguistic

SoftwareMelanie Siegel

Page 2: Supporting the authoring process with linguistic software

The authoring process – and where it needs support

challenges for correctness• time pressure• non-native writing• not enough capacity for careful proofreading

automatic support possibilities• spell checking• grammar checking

Page 3: Supporting the authoring process with linguistic software

The authoring process – and where it needs support

challenges for understandability and readability• authors are experts of subject and language –

users often are not

automatic support possibilities

• style checking

Page 4: Supporting the authoring process with linguistic software

The authoring process – and where it needs support

challenges for consistence and corporate wording• guidelines for corporate wording exist – in a large document on

the shelf• terminology lists exist – in an excel sheet somewhere in the file

system• distributed writing

automatic support possibilities• terminology checking• sentence clustering

Page 5: Supporting the authoring process with linguistic software

The authoring process – and where it needs support

challenges for translatability• authors write without having the translation process in

mind• lexical, syntactic and semantic ambiguity• translation costs depend on translation memory matches

automatic support possibilities• style checking• terminology checking

Page 6: Supporting the authoring process with linguistic software

tokenization POS-tagging morphology dictionary error dictionary

NLP that is needed for authoring support

Page 7: Supporting the authoring process with linguistic software

• Close the door of our XYZ car.

tokenization

capital word lower word dot_EOSspace

花子が本を読んだ。

花子 が 本 を 読ん だ 。

Kanji dot_EOSHiragana

based on rules and lists of

abbreviations

Page 8: Supporting the authoring process with linguistic software

Close the door of our XYZ car.V DET N PREP PRON NE N

POS tagging

XML and attributevalue structures

statistical methodslarge dictionaries

Page 9: Supporting the authoring process with linguistic software

• Close the door of our XYZ car.

morphology

Lemma: closeTense: present_imp Person: third Number: singular

Lemma: carNumber: singular Case: nominative_accusative

based on dictionaries, rules for inflection and derivation

Page 10: Supporting the authoring process with linguistic software

dictionary

• words unknown to the standard NLP system

http://wiki.openoffice.org/wiki/Documentation/

Page 11: Supporting the authoring process with linguistic software

spelling

language analysis error analysis

words are defined in a dictionary

anything not in the dictionary is an error

high recall, low precision (depending on the domain)

errors are defined unknown words that

are not defined as errors are term candidates

based on words and rules

consider terminology high precision, recall is

dependent on data work

Page 12: Supporting the authoring process with linguistic software

error dictionary

• stylesheet style sheet• begginning beginning• beleive believe• definately definitely• gotta have to• hided hid|hidden|hides

Page 13: Supporting the authoring process with linguistic software

• avoid false alarms in spelling• consistency• less ambiguity• translatability• corporate wording

ultimate goal: 1 term - 1 meaning - 1 translation

why work on terminology?

Page 14: Supporting the authoring process with linguistic software

• web server – web-server• upload protection – upload-protection• timeout – time out• Reset – ReSet• sub station – sub-station

reality: variants

Page 15: Supporting the authoring process with linguistic software

– orthographic variants- hyphen, blank, case: term bank, termbank

– semi-orthographic variants - number : 6-digit, six-digit- trademark : MyCompany™, MyCompany

– syntactic variants - preposition: oil level, level of oil- gerund/noun : call center, calling center

– synonyms “classical” : vehicle, car

– language-specific variants(e.g. Fugenelemente DE, Katakana JA)

term variants

Page 16: Supporting the authoring process with linguistic software

• author/company defines the term bank

• list of deprecated terms

deprecated term: vehicleapproved term: car

• list of approved terms automatic identification of variants

approved term: SWASSNet Userdeprecated term: SWASSNet user, SWASS-Net User

how to get consistent terminology

Page 17: Supporting the authoring process with linguistic software

terminology and spelling

Page 18: Supporting the authoring process with linguistic software

terminology and spelling

Page 19: Supporting the authoring process with linguistic software

NLP for terminology

• NLP methods for term extraction– corpus analysis (morphology, POS, NER)– information extraction (potential product names)– ontologies (e.g. semantic groups)

• NLP methods for setting up a term database– morphology (finding the base form)– POS

• NLP methods for term checking– variants– similar words– inflection

Page 20: Supporting the authoring process with linguistic software

approaches to grammar checking

descriptive grammar

• definition of correct grammar• e.g. HPSG, LFG, chunk-grammar,

statistical grammars• anything that‘s not analyzable

must be a grammar error• preconditions:• grammar with large coverage• large dictionaries• robust, but not too robust

parsing • efficient parsing methods

• high recall, low precision

error grammar

• implementation of grammar errors• preconditions:• work with error corpora• error grammar with a high

number of error types• „deepness“ of analysis varies

with the type of error to be described

• high precision, recall is based on the number of rules

Page 21: Supporting the authoring process with linguistic software

• subject verb agreement:– Check if instructions are programmed in

such a way that a scan never finish.–When the operations is completed, the

return to home completes.

grammar rules, examples

Page 22: Supporting the authoring process with linguistic software

grammar rules, examples

• a an distinction:– a isolating transformer – an program

• wrong verb form:– it cannot communicates with them – IP can be automatically get

Page 23: Supporting the authoring process with linguistic software

• write_words_together

– @can ::= [ TOK "^(can)$"– MORPH.READING.MCAT "^Verb$" ];

– The application can not start.– The application can tomorrow not start.

– TRIGGER(80) == @can^1 [@adv]* 'not'^2– -> ($can, $not)– -> { mark: $can, $not;– suggest: $can -> '', $not -> 'cannot';– }

– Branch circuits can not only minimize system damage but can interrupt the flow of fault current

– NEG_EV(40) == $can 'not' 'only' @verbInf []* 'but';

example grammar rule*

* implemented in Acrolinx

Page 24: Supporting the authoring process with linguistic software

• controlled languages

• AeroSpace and Defence Industries Association of Europe (ASD)ASD-STE100 (simplified English)

• Caterpillar Technical English (CTE)

• disadvantages:

• very restrictive

• low acceptance of users

style - controlled language

Page 25: Supporting the authoring process with linguistic software

• rules define errors (like grammar rules)• rules (and instructional information) are

defined by authors• implementation in authoring support systems• high acceptance• good usability

style – moderately controlled language

Page 26: Supporting the authoring process with linguistic software

• different for different usages– text type

• (e.g., press release – technical documentation)

– domain • (e.g., software – machines)

– readers • (e.g., end users – service personnel)

– authors • (e.g., Germans tend to write long sentences)

style guidelines

Page 27: Supporting the authoring process with linguistic software

•avoid_latin_expressions

•avoid_modal_verbs

•avoid_passive

•avoid_split_infinitives

•avoid_subjunctive

•use_serial_comma

•use_comma_after_introductory_phrase

•spell_out_numerals

style rule examples*: best practise

*style rule implemented in Acrolinx

Page 28: Supporting the authoring process with linguistic software

•use_units_consistently

•abbreviate_currency

•COMPANY_trademark

•do_not_refer_to_COMPANY_intranet

•add_tag_to_UI_string

•avoid_trademark_as_noun

•avoid_articles_in_title

style rule examples: company

Page 29: Supporting the authoring process with linguistic software

•avoid_nested_sentences

•avoid_ing_words

•keep_two_verb_parts_together

•avoid_parenthetical_expressions

dependent of MT system and language pair

style rule examples MT preediting

Page 30: Supporting the authoring process with linguistic software

– replacement of words or phrases– replacement using the correct writing with

uppercase or lowercase– replacement of words using the correct inflection– generation of whole sentences (e.g. passive –

active) requires semantic analysis and generation and is therefore not (yet) possible

automatic suggestions for style rules

Page 31: Supporting the authoring process with linguistic software

• avoid_future_tense

• /* Example: „.. It will be necessary .." */

• TRIGGER (80) == @will^1 [-@comma]* @verbInf^2 • ->($will, $verbInf)• -> { mark : $will, $verbInf;}

• /* Example: „.. The router services will be offered in the future .." */

• NEG_EV(40) == $will []* @in @det @time;

example style rule*

* implemented in Acrolinx

Page 32: Supporting the authoring process with linguistic software

• Use the same phrase for the same meaning.

• Examples:– Congratulations on acquiring your new wearable digital

audio player– Congratulations, you have acquired your new wearable

digital audio player!– Dear Customer, congratulations on purchasing the new

wearable digital audio player!

consistent phrasing

Page 33: Supporting the authoring process with linguistic software

Acrolinx server

Terminology

Intelligent Reuse

Grammar&

Spelling

WritingStandards

Acrolinx intelligent reuse™

Reuse Repository

Clustersmicro-clustering

redundancy and quality filters

review and release

Content / Translationrepository

the cat sat on the matThe dog sat on the rugThe elk sat on the mossThe moose sat on the elk

the cat sat on the carpetThe cat slept on the sofa

Fish swam in the blue waterThe fish swam in the green waterThe fish swam in the red sea.

the cat sat on the matthis is a sentence you can’t read

the cat sat on the matAnother small test snippetthe cat sat on the matThis is the same as the other one.the cat sat on the mat

the cat sat on the maltThe cat ate on the matthe cat sat on the doormat

the cat sat on the mat.The cat sat on the matthe cat sat on the mat

the cat sat on the matMore useless data points

Page 34: Supporting the authoring process with linguistic software

DEMO

Page 35: Supporting the authoring process with linguistic software

checking OpenOffice documentation

Page 36: Supporting the authoring process with linguistic software

correctness

Page 37: Supporting the authoring process with linguistic software

understandability

Page 38: Supporting the authoring process with linguistic software

consistency

Page 39: Supporting the authoring process with linguistic software

consistency

Page 40: Supporting the authoring process with linguistic software

translatabiliy

Page 41: Supporting the authoring process with linguistic software

summary

• The authoring process is challenging– correctness– consistency– understandability– translatability

• It can be effectively supported by NLP-enhanced tools

Page 42: Supporting the authoring process with linguistic software

Thank you!

Melanie SiegelHochschule Darmstadt – University of Applied Sciences

[email protected]