Summary of Multilingual Natural Language Processing Applications: From Theory to Practice

MULTILINGUAL NATURAL LANGUAGE PROCESSING APPLICATION: FROM THEORY TO PRACTICE

OCTOBER 2017

Mashael Alduwais

OVERVIEW

Multilingual Natural Language processing application: From theory to practice

Edited by Daniel M. Bikel Imed Zitouni

IBM Press @ 2012

Two Parts:

I. Theory: 7 chapters

II. Practice: 9 chapters

10/30/2017 MASHAEL ALDUWAIS 2

ABOUT THE AUTHORS

Daniel M. Bikel

Current Position: Research Scientist @ Google

Previous: LinkedIn, Google, IBM

Education: Harvard University, University of Pennsylvania

Interest: Syntax/parsing, information extraction, multilingual systems, NLP systems design, machine learning toolkits, language modeling.

Imed Zitouni

Current Position: Principle Researcher@ Microsoft

Previous: IBM, Bell-Labs, DIALOCA

Education: Université Henri Poincaré, Nancy

Interest: natural language processing, language modeling, spoken dialog systems, speech recognition, and machine learning.


BOOK CONTENT

Part I: Theory

Chapter 1 Finding the Structure of Words

Chapter 2 Finding the Structure of Documents

Chapter 3 Syntax

Chapter 4 Semantic Parsing

Chapter 5 Language Modeling

Chapter 6 Recognizing Textual Entailment

Chapter 7 Multilingual Sentiment and Subjectivity Analysis

Part II: Practice

Chapter 8 Entity Detection and Tracking

Chapter 9 Relations and Events

Chapter 10 Machine Translation

Chapter 11 Multilingual Information Retrieval

Chapter 12 Multilingual Automatic Summarization

Chapter 13 Question Answering

Chapter 14 Distillation

Chapter 15 Spoken Dialog Systems

Chapter 16 Combining Natural Language Processing Engines


CHAPTER 1. FINDING THE STRUCTURE OF WORDSتركيب الكلمات

Morphological parsing: discovery of word structure

Tokens: words

In Arabic, certain tokens are concatenated in writing with the preceding or the following ones, possibly changing their forms as well. (called clitics).

Lexemes: the concept behind a linguistic form and the set of alternative that can express it.

Lexical categories of verbs, nouns, adjectives, conjunctions, particles, or other parts of speech.

Turning singular into plural

Morphemes: structural components of word form (segments or morphs). Ex: dis-agree-ment-s

Typology: divides languages into groups by characterizing the prevalent morphological phenomena in those languages. Ex: Isolating, Synthetic, Agglutinative, Fusional.


CHAPTER 1. FINDING THE STRUCTURE OF WORDSتركيب الكلمات

Issues and Challenges:

Irregularity: word forms that are not described by a prototypical linguistic model.

Ambiguity: word forms be understood in multiple ways out of the context of their discourse.

Productivity: Is the inventory of words in a language finite, or is it unlimited?

Morphological Models:

Dictionary Lookup

Finite-State Morphology

Unification-Based Morphology

Functional Morphology


CHAPTER 2. FINDING THE STRUCTURE OF DOCUMENTSتركيب النص

Some (NLP) tasks use sentences as the basic processing unit:

Parsing, machine translation, automatic speech recognition (ASR) systems, and semantic role labeling

Sentence boundary detection (sentence segmentation): Automatically segmenting a sequence of word tokens into sentence units.

Topic segmentation (discourse or text segmentation): Automatically dividing a stream of text or speech into topically homogeneous blocks.

A boundary classification problem:

Depending on the type of input (i.e., text versus speech), different features may be used.

Performance have improved by exploiting very high-dimensional feature sets.


CHAPTER 3. SYNTAXالنحو

Syntax Parsing: (syntax analysis): discover the various predicate-argument dependencies that may exist in a sentence.

Parse natural language text to provide syntactic trees.

Recursively partition the words in the sentence into individual phrases such as verb or noun.

Used for text-to-speech, machine translation, summarization, and paraphrasing application.


CHAPTER 3. SYNTAXالنحو

Treebanks:

A collection of sentences where each sentence is provided a complete syntax analysis. (Annotated text corpus)

The syntactic analysis for each sentence has been judged by a human expert.

A style book or set of annotation guidelines is typically written before the annotation process to ensure a consistent scheme of annotation throughout the treebank.

Two main approaches to construct treebanks: dependency graphs and phrase structure.

Challenges:

Ambiguity. Chose from an exponentially large number of alternative analyses.

Language issues: tokenization, case, encoding, word segmentation and morphology.


CHAPTER 4. SEMANTIC PARSINGالتحليل الدالل

Semantic parsing: identifying meaning chunks contained in an information signal in an attempt to transform it into some data structure that can be manipulated by a computer to perform higher level tasks.

Two types of representations:

Deep semantic parsing: taking natural language input and transforming it into a meaning representation. Domain-dependent.

Problem: reusability of the representation across domains is very limited.

Shallow semantic parsing: deals with the four main aspects of language: structural ambiguity, word sense, entity and event recognition, and predicate argument structure recognition. General-purpose.

Problem: difficult to construct a general-purpose ontology.



A semantic theory should be able to:

1. Explain sentences having ambiguous meanings. For example, it should account for the fact that the word bill in the sentence The bill is large is ambiguous in the sense that it could represent money or the beak of a bird.

2. Resolve the ambiguities of words in context. For example, if the same sentence is extended to form The bill is large but need not be paid, then the theory should be able to disambiguate the monetary meaning of bill.

3. Identify meaningless but syntactically well-formed sentences, such as the famous example by Chomsky: Colorless green ideas sleep furiously.

4. Identify syntactically or transformationally unrelated paraphrases of a concept having the same semantic content.



Semantic parsing can be considered as part of semantic interpretation.

Requirements for Semantic Interpretation: Structural Ambiguity: transforming a sentence into its underlying syntactic representation.

Word Sense: the same word type is used in different contexts.

EX: She nailed the loose arm of the chair with a hammer. VS. She went to the beauty salon to get a manicure.

Entity and Event Resolution: named entity recognition and coreference resolution.

Predicate-Argument Structure: identifying the participants of the entities in these events.

Can be defined as the identification of who did what to whom, when, where, why, and how

Meaning Representation: build a semantic representation that can then be manipulated by algorithms to various application ends (called deep representation). A domain-specific approach.


CHAPTER 5. LANGUAGE MODELINGاللغةنمذجة

A statistical model that assigns a probability to a sentence.

Specifies the a priori probability of a particular word sequence in the language of interest.

Given an alphabet or inventory of units Σ and a sequence W = w1w2 ...wt ∈ Σ∗, a language model can be used to compute the probability of W based on parameters previously estimated from a training set.

LM is usually combined in speech recognition, machine translation.

A standard tool in information retrieval, spell correction, summarization, authorship identification, and document classification.



n-Gram Models: all previous words except for the (n − 1) words directly preceding the current word are irrelevant for predicting the current word, or, alternatively, that they are equivalent.

Evaluation criteria: coverage rate, perplexity.

Language Model Adaptation: designing and tuning a language model such that it performs well on a new test set for which little equivalent training data is available.

Methods: Mixture language models, topic-dependent language model, trigger models.



Types of Language Models: other than n-gram language model

Class-Based Language Models

Variable-Length Language Models

Discriminative Language Models

Syntax-Based Language Models

MaxEnt Language Models

Factored Language Models

Bayesian Topic-Based Language Models

Neural Network Language Models



Language Modeling Problems:

Language-Specific Modeling Problems:

In Arabic, decomposition may be required. Integrating morphological information into the language model is helpful for modeling dialectal Arabic.

Spoken versus Written Languages:

Many of the world’s 6,900 languages are spoken languages, that is, languages without a writing system (dialects).

In this case: the only way of obtaining language model training data is to manually transcribe the language or dialect. This is a costly and time-consuming process because it involves (i) the development of a writing standard, (ii) training native speakers to use the writing system consistently and accurately, and (iii) the actual transcription effort. In the second case, those text resources that can be obtained for the language in question (e.g., from the web) will need to be normalized, which can also be a laborious process


CHAPTER 6. RECOGNIZING TEXTUAL ENTAILMENTن النص التعرف عىل التضمي

Textual entailment is defined as a directional relationship between pairs of text expressions, denoted by T, the entailing text, and H, the entailed hypothesis. We say that T entails H if the meaning of H can be inferred from the meaning of T, as would typically be interpreted by people.

Applications of Textual Entailment Solutions:

Summarization.

Exhaustive Search for Relations

Question Answering

Machine Translation


CHAPTER 7. MULTILINGUAL SENTIMENT AND SUBJECTIVITY ANALYSIS للغات المتعدد

ةتحليل المشاعر والتحليل الذات Subjectivity classification: labels text as either subjective or objective.

Sentiment classification: classifies subjective text as either positive, negative, or neutral. Used in automatic expressive text-to-speech synthesis, tracking sentiment timelines in online forums and news, and mining opinions from product reviews.

Tools: two main types of tools: I. Rule-based systems: relying on manually or semi-automatically constructed lexicons. Ex:

OpinionFinder.

II. Machine learning classifiers: trained on opinion-annotated corpora. Ex: Wiebe, Bruce, and O’Hara .

Corpora: subjectivity and sentiment annotated corpora used to train automatic classifiers, and as resources to extract opinion mining lexicons.



ةتحليل المشاعر والتحليل الذات Lexicons: OpinionFinder: contains 6,856 unique entries, out of which 990 are multiword expressions. Each entry is also associated with a polarity label, indicating whether the corresponding word or phrase is positive, negative, or neutral.

General Inquirer: a dictionary of about 10,000 words grouped into about 180 categories, which have been widely used for content analysis. It includes semantic classes (e.g., animate, human), verb classes (e.g., negatives, becoming verbs), cognitive orientation classes (e.g., causal, knowing, perception), and others. Two of the largest categories in the General Inquirer are the valence classes, which form a lexicon of 1,915 positive words and 2,291 negative words.

SentiWordNet: Built on top of WordNet, which assigns each synset in WordNet with a score triplet (positive, negative, and objective), indicating the strength of each of these three properties for the words in the synset.



ةتحليل المشاعر والتحليل الذات Word- and Phrase-Level Annotations: three main directions:

i. manual annotations, which involve human judgment of selected words and phrases,

ii. automatic annotations based on knowledge sources such as dictionaries,

iii. automatic annotations based on information derived from corpora.

Sentence-Level Annotations: corpus annotations are often required either as an end goal for various text-processing applications (e.g., mining opinions from the Web, classification of reviews into positive and negative), or as an intermediate step toward building automatic subjectivity and sentiment classifiers. Two methods:

i. dictionary-based, consisting of rule-based classifiers relying on lexicons,

ii. corpus-based, consisting of machine learning classifiers trained on preexisting annotated data.

Document-Level Annotations: applications, such as review classification or web opinion mining, often require corpus-level annotations of subjectivity and polarity.


CHAPTER 8. ENTITY DETECTION AND TRACKINGالتعرف عىل أسماء االعالم ومتابعتها

Mention detection: Detecting the boundary of a mention and optionally identifying the semantic type (e.g., PERSON or ORGANIZATION) and other attributes (e.g., named, nominal, or pronominal).

Closed to named entity recognition.

Mentions: any instances of textual references to objects or abstractions, which can be either named (e.g., John Mayor), nominal (e.g., the president), or pronominal (e.g., she, it).

Can be formulated as a classification problem by assigning a label to each token in the text.

Coreference resolution: Clustering mentions referring to the same entity into equivalence classes.

Machine learning-based approaches: learn a model from training data that assigns a score to a pair of mentions indicating the likelihood that the two mentions refer to the same entity. Mentions are then clustered into entities on the basis of mention-pair scores.


CHAPTER 9. RELATIONS AND EVENTSالعالقات واألحداث

Relation Extraction Systems: systems capable of finding semantic relations among entities.

Relation extraction can be considered as multiclass classification problem, with several classes of features including structural, lexical, entity-based, syntactic, and semantic.

Relation Extraction Types:

Extracting relations typically associated with lexical ontologies, such as meronymy, hyponymy, and troponymy;

Extracting relations similar in nature, such as detecting that verb1 expresses the same concept as verb2 but in a stronger fashion; and

Finding similarity enablement, that is, detecting that the action expressed by verb1 is a prerequisite for the action expressed by verb2.

Identifying general semantic links between potentially heterogeneous entities, such as employment relations between people and companies, cause of death relations between diseases and people, or ownership of one entity (such as a company) by another.



National Institute of Standards and Technology (NIST) ACE evaluations:

PHYS (physical): A spatial relation denoting that a person is located at or near a facility, or a location.

PART-WHOLE: A spatial relation denoting that a facility, a location, or a gpe is a part of another facility.

PER-SOC (personal-social): Personal-social relations capture links between people. Relations can be business-related, can be family-based.

ORG-AFF (organization-affiliation): This type of relation pertains to connections between persons and organizations. A person could be employed by an organization or could be a member.

GEN-AFF (general-affiliation): citizenship, residence in a country, religious affiliation, and ethnicity.

ART (artifact): A relation between a user, inventor, or manufacturer and the artifact itself.

METONYMY: A relation between two different aspects of the same underlying entity.



Events: denotes any change of state in the world that is described using natural language text.

Event extraction: is the use of any algorithm to extract a structured representation of that change of state, crucially including the entities involved.


CHAPTER 10. MACHINE TRANSLATIONجمة اآللية الت

converting text in one language into another while preserving its meaning.

Research started in the 1940s at IBM. Most profound change can be dated back to 1988.

Statistical Machine Translation:

Using large corpora of translated texts, typically many millions of words.

Learn the rules of translation from corpora and provide the basis for a decoding algorithm that finds the best translation for a given input sentence

Machine translation is being integrated into various applications: crosslingualinformation retrieval, speech translation, and tools for translators.



Word Alignment: Learning translation rules from a parallel corpus. Unsupervised learning problem.

A word-aligned parallel corpus allows the estimation of phrase-based and tree-based models and other approaches.

Evaluation: Human Assessment: ask human judges if the output constitutes a correct translation. Is it fluent? Is the translation adequate?

Automatic Evaluation Metrics: evaluation campaigns for evaluation metrics, where different metric developers compete for the highest correlation with human judges. Runs similaritymeasures test between MT output and the reference translations. Count: matches, insertions, deletions.



Current Research:

The development of models that more closely mirror linguistic understanding of language,

The application of novel machine learning methods to the estimation problem of learning Translation rules from the data, and

The attempts to exploit various types of data sources, which are often not in the desired domain or may not be even proper sentence-by-sentence translations at all.



Linguistic Challenges: Lexical Choice: word sense disambiguation. n-gram language model, try to capture effectively local context information that is very useful for making the right lexical choice.

Morphology: when translating into morphologically rich languages, it is often not clear from the local context which morphological variant to choose.

Word Order: To define which of the entities mentioned in the sentence is the subject and which are the objects and what their roles are, languages such as English use word order.

Future Directions: The estimations of parameter values in MT models.

Syntactic models

Using comparable or purely monolingual data instead of parallel data.

Integrating statistical machine translation into other information processing applications.


CHAPTER 11. MULTILINGUAL INFORMATION RETRIEVALجاع المعلومات متعدد اللغات است

Importance:

Improvements in machine translation (MT), have fostered the development of effective multilingual retrieval systems.

The growing number of non-English Internet users and non-English content on the Web.

Advent of Web 2.0 technologies.

Crosslingual information retrieval (CLIR):

Retrieving documents relevant to a given query in some language (query language) from a collection of documents in some other language (collection language).

Approaches: Translation-Based Approaches, Inter-lingual Document Representations.

Multilingual information retrieval (MLIR):

Involves corpora containing documents written in different languages.

MLIR requires different index organization and relevance computation strategies than CLIR.


CHAPTER 11. MULTILINGUAL INFORMATION RETRIEVALجاع المعلومات متعدد اللغات است

Evaluation:

Metrics: Relevance Assessments, precision and recall.

Evaluation Campaigns: Text REtrieval Conference (TREC), Crosslingual Evaluation Forum (CLEF), NII Test Collection for IR Systems (NTCIR), Forum for Information Retrieval Evaluation (FIRE).

Parallel Corpora: JRC-Acquis, Multext Dataset, Canadian Hansards, Europarl.

Tools, Software, and Resources:

Preprocessing: Content Analysis Toolkit (Tika), Snowball Stemmer, HTML Parser, BananaSplit.

IR Frameworks: Lucene, Terrier and Lemur.

Evaluation: TREC eval.


CHAPTER 12. MULTILINGUAL AUTOMATIC SUMMARIZATIONالتلخيص اآلل متعدد اللغاتIn multilingual summarization, texts written in multiple languages are used by summarization systems.

Types of summary:

An informative summary, is a compressed version of the original covering the most important facts reported in the input text(s) (e.g., summary of a journal article).

An indicative summary covers topics in the input text without providing further details (e.g., keywords for scientific papers).

An evaluative summary gives an opinion on the input text most often by comparing it to similar documents.

An elaborative summary can provide more details of parts of a large document or the document linked to by the current document to help navigation through large documents or linked collections such as Wikipedia.


CHAPTER 12. MULTILINGUAL AUTOMATIC SUMMARIZATIONالتلخيص اآلل متعدد اللغاتCrosslingual summarization: spread out over multiple source languages, and the resulting summary is presented in one (or more) target languages.

Requires the integration of multiple source documents coming from different languages

Named entities are often transcribed differently in different languages (coreference resolution)

Languages encode number and gender agreement differently as English lacks grammatical gender (Anaphora resolution).

Evaluation:

Extrinsic evaluations measure the usefulness of summaries by measuring how much they can help in performing another information-processing task.

Intrinsic evaluations measure and reflect summary quality and can be used in various stages in a summarization development cycle.


CHAPTER 12. MULTILINGUAL AUTOMATIC SUMMARIZATIONالتلخيص اآلل متعدد اللغاتSummarization systems are divided into three stages:

1. For the analysis stage, summarization systems may represent the text in the form of a graph. This may be a linguistically motivated discourse tree or a matrix representation based on sentence-to-sentence similarity.

2. The transformation process can be carried out via graph-based algorithms such as PageRank or by machine learning–based classifiers that learn to classify sentences according to their relevancy.

3. Multilingual approaches have to face many language-dependent challenges such as tokenization, anaphoric expressions, and discourse structure for the realization of the summary.


CHAPTER 13. QUESTION ANSWERINGاإلجابة عىل األسئلة

QA: Retrieve answers to user questions from information sources.

Follows a pipeline layout consisting of components for 1. Transforming questions into search engine queries

2. Retrieving related text using existing IR systems,

3. Extracting and scoring candidate answers.

Questions are classified with regard to their expected answer, factoid questions, which ask for concise answers such as named entities (e.g., What is the capital of Turkey?),

list questions seeking lists of such factoid answers (e.g., Which countries are in NATO?).

Attempts have been made to tackle questions with complex answers, such as definitional questions requesting information on a given topic, including biographies for people (e.g., Who is Albert Einstein?),

relationship questions (e.g., What is the relationship between the Taliban and Al-Qaeda?),

opinion questions (e.g., What do people like about IKEA?).





Future Directions:

Reliable confidence estimates for the top answers.

Crosslingual QA systems that translate answers back to the language in which the question was asked.

General-purpose QA algorithms and techniques that can be adapted rapidly to new tasks and achieve high performance across different domains.

QA systems that provide complex answers.

How and why questions seeking explanations or justifications

Yes–no questions requiring a system to determine whether the combined knowledge in the available information sources entails a hypothesis.

Deeper NLP techniques to find answers in sources that lack semantic redundancy.

QA systems that support user interactions and information sources in different languages.


CHAPTER 14. DISTILLATIONاالستخالص

Distillation queries can be complex and require complex answers. For example: Describe the reactions of <COUNTRY> to <EVENT>.

The Rosetta Consortium Distillation System: built as part of the GALE program. The system is designed to answer distillation queries run against a large corpus composed of text documents and audio recordings in multiple languages: English, Arabic, and Mandarin. Text sources are assumed to belong to two main categories: structured and unstructured.

Three Stages: Document preparation: recordings are transcribed, and text and transcripts in foreign languages are

translated into English. Tokenization, part-of-speech (POS) tagging, parsing, mention detection, and semantic role labeling rely on maximum entropy (MaxEnt) models is performed.

Indexing: documents are indexed using an open source search engine, Lucene.

Query answering: takes as input a GALE-style query, and returns a list of main snippets with associated supporting snippets and citations, sorted in decreasing order of relevance to the query. The architecture of the system consists of five stages: query preprocessing, document retrieval, snippet filtering, snippet processing, and planning.


CHAPTER 14. DISTILLATIONاالستخالص

Challenges

The lack of publicly available corpora for measuring the progress of the field a

The difficulty and cost of evaluating the outputs of distillation systems due to the lack of automatic metrics.


CHAPTER 15. SPOKEN DIALOG SYSTEMSأنظمة الحوار اآلل

A spoken dialog system is a complex machine that manages goal-oriented user interactions.

Functional architecture: Speech recognition and understanding module: to assign one or more semantic tags to each speech input.

Speech generation module: Rule-based grammar is used, which encodes both the syntax and semantics of possible utterances.

Dialog manager: uses a finite-state machine approach by explicitly encoding the whole interaction into what is generally known as call-flow.


CHAPTER 16. COMBINING NATURAL LANGUAGE PROCESSING ENGINESن محركات معالجة اللغة الطبيعية الجمع بي Many engines are now attaining accuracy sufficient to enable combining them to serve more complex tasks than were possible before.

Example applications: semantic search, enterprise reporting and other business intelligence, question answering, medical-abstract mining, and crosslingual search, audio/video search and cataloging, speech-to-speech translation, and foreign broadcast news analysis.

Applications like these share many common engines, such as speaker identification, speech-to-text, text tokenization, grammatical parsing, named entity detection, coreference analysis, part-of-speech labeling, and translation.

Aggregation poses several challenges: Heterogeneous computing environments, Remote operation, Data formats, Exception handling.


CHAPTER 16. COMBINING NATURAL LANGUAGE PROCESSING ENGINESن محركات معالجة اللغة الطبيعية الجمع بي Desired Attributes of Architectures for Aggregating Speech and NLP Engines:

Flexible, Distributed Componentization.

Computational Efficiency.

Data-Manipulation Capabilities.

Robust Processing.

Frameworks that support integration into more complex applications:

UIMA

GATE: General Architecture for Text Engineering

InfoSphere Streams



FOR YOUR TIME AND ATTENTION