Upload
scot-dalton
View
214
Download
0
Embed Size (px)
Citation preview
© Copyright 2013 ABBYYConfidential
NLP PLATFORMFOR EU-LINGUALDIGITAL SINGLE MARKET
Alexander Rylov
LTi Summit 2013
Confidential
Market fragmentation
By domains By languages
3Confidential
WHY SHOULD LT VENDORS
SHARE THEIR RESOURCES?
● Many of LT vendors have their own LT
● LTs are focused on particular domain/language(s)
● Resources are critical for enabling such technologies
● If case of share vendors may loose competitive advantage
4Confidential
Technologies ability and restrictions
● Language specific = language centric = limited by language
● Difficulties - Controlled links ● Anaphora● Long distance links● Ellipsis
● Ontology, dictionaries, statistic = trained on limited set of data = covers only limited variety of meaning representations = sometimes good to achieve 40% of recall (NER US DoD track)
5Confidential
WHAT IS
BIGDATA… ● Multilingual● Covers more than 1 domain● 85 – 90% is in unstructured
text documents● Language expression of the
same meaning vary by uncountable number of ways
6Confidential
A FUNDAMENTAL NATURAL LANGUAGE TECHNOLOGYREQUIRED SCALABLE BY DOMAINS AND LANGUAGES
7Confidential
ABBYY Compreno as proposal
● Interlingua approach:● semantic model is based on universal
language independent representation both for lexis and grammar
● Working Languages:● Russian, English: at the stage of
terminological and collocation expansion● German: full prototype (lexis, syntax) is
completed; at the stage of main lexis expansion (from core to periphery)
● French: full prototype is completed (tested on controlled MT task) ;
● Chinese: lexical system prototype is completed (challenged task never carried out before);
● It is proved that Compreno is a scalable technology to use for any language
Universal Semantic Hierarchy
Statistic and
machine learning
Syntactic and
semantic analysis
Complete syntactic and semantic analysis
The bank was located at the bank of the river; it was closed.
The complete analysis helps overcome linguistic problems in the text, if any..
9Confidential
Compreno current achievements
Russian syntax analysis 2011 Precision Recall F
Compreno 0.95 0.98 0.97
System 2 0.93 0.98 0.96
System 3 0.90 0.98 0.94
System 4 0.89 0.95 0.92
System 5 0.86 0.98 0.92
System 6 0.86 0.86 0.86
System 7 0.79 0.98 0.87
Fact Extraction 2013 Compreno System 1 Compreno System 2 Compreno System 3
Precision 0.95 0.95 0.96 0.98 0.92 0.92Recall 0.93 0.70 0.84 0.44 0.92 0.74
F-measure 0.94 0.81 0.90 0.61 0.92 0.82ABBYY advantage 14% 32% 10%
10Confidential
Applications
● BigData analytics – analysis of facts, extraction of objects
● Intelligence, eDiscovery (any kind)● Search by meaning rather than by
concepts● Dialogues systems by natural language● Translation
11Confidential
Few facts about Compreno
● 18 years of development● About 350 people involved● More than 2000 man-years
12Confidential
Barriers for wide implementation
● At least 3 years per language● At least 30 linguists per language● At least 12M € per language
● Then support and improvement
13Confidential
EU project idea
● Describe ALL EU languages● Describe Major domains: healthcare,
law, government, major industries
● ABBYY commitment:● Methodology, management, instruments
14Confidential
EU BENEFITS – CREATE SINGLE DIGITAL LT MARKET
● Operate not with language but with universal model of it – interlingual approach● Describe one domain in one
language – apply in all other languages
● A platform for LT vendors to create solutions and products easy scalable by languages and domains