Upload
ainl-conferences
View
390
Download
9
Tags:
Embed Size (px)
DESCRIPTION
Доклад посвящен описанию той части технологии ABBYY Compreno, с помощью которой разрабатываются предметно-ориентированные системы извлечения информации из текстов. Обсуждаются принципы работы базового механизма извлечения информации а также инструментальная среда OntoDPS (Ontological Data Preparation System), позволяющая настраивать его под конкретные предметные области. Базовый механизм позволяет использовать для извлечения информации результаты полного семантико-синтаксического анализа текста и применять к ним продукционную систему правил извлечения информации. Система правил компонуется в рамках OntoDPS. Она неразрывно связана с онтологией той предметной области, для которой создается система извлечения информации. Особое внимание в докладе уделяется вопросам модульности и инкапсуляции. Демонстрируется то, как за счет декларативной природы правил извлечения информации становится возможным их гибкое переиспользование между системами извлечения информации. Обсуждаются также вопросы автоматизированного тестирования создаваемых систем. Акцент в докладе делается в большей степени на архитектурных и технологических решениях. Конкретные онтологические и лингвистические вопросы почти не затрагиваются. Для обсуждения деталей такого рода в рамках демо-сессии конференции AINL 2014 планируется демонстрация внутреннего устройства и работы конкретной системы извлечения информации из новостных текстов.
Citation preview
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
Starostin A.S.
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
2
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
NLP-technologies
Rule-based - technologies based on the use of hand-written language rules applicable to a particular task.
Statistics-based - technologies based on machine learning on large text corpora, labeled, or parallel.
Hybrid technology - connecting a variety of approaches, for example: Rule-based + Statistics-based.
Model-based - technologies based on the universal (complete) language modeling
ABBYY Compreno
3
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
Universal Semantic Hierarchy
• It’s a tree• Intermediate nodes
represent semantic classes (concepts)
• Leafs represent lexical classes
• Concrete lexemes are linked to lexical classes
• All nodes are labeled with grammar and semantic information (set of grammemes and set of semantemes)
4
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
5
Syntactic-Semantic tree
Google sold Motorola to Lenovo for $2.91 billion.
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
6
OWL ontology
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
RDF graph
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
IE development factory
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
Extraction algorithm
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
Parse subtree interpretation rules
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
Parse subtree interpretation rules
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
Identification rules
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
Type of statements
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
IE system production
Design Input: customer needs (unformal), text examples
(marked up or not) Output: OWL-ontology where every object is well-
documented Development Input: well-documented OWL-ontology, marked up text
examples Output: production system of rules
Testing Nightly testing (marked up corpora) Reclamations (pointed error examples)
All three activities within one framework, which is called OntoDPS
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
IE system design
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
IE system design (marked up text example)
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
IE system development: libraries
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
IE system development: projects
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
IE system development: reuse and customization
Adding new items to
dictionaries
Adding new instances to
ontologies
Reuse of libraries and rules
Complex rule customization
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
IE system testing: nightly testing
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
IE system testing: nightly testing
ABBYY InfoExtractor: technology of producing domain oriented information extraction systems
Thank you!Questions?