View
216
Download
1
Tags:
Embed Size (px)
Citation preview
M-CASTMultilingual Content Aggregation System based on TRUST Search
Engine
Borys Czerniejewski
Sebastian Lisek
Infovide S.A. (PL)
The Project
eContent project #22249
project start: 1 January 2005
project end: 31 December 2006
M-CAST
multilingual, full-text search engine (server version)
Internet portals deployed in two libraries
content aggregation facility
business plan + IPRs/royalties fixed
dissemination
M-CAST Results
Expected Results
TRUST – Multilingual Semantic and Cognitive Search Engine for Text Retrieval Using Semantic Technologies (IST-1999-56416)
ICONS – Intelligent Content Management System (IST-2001-32429)
Previous Projects
Users & Business Case
digital (Internet) libraries
press agencies
(press) publishers
operators of scientific datases
big companies (multinationals)
Large Full-Text Data Collections
Czech
English
French*
Italian*
Polish*
Portuguese*
Languages
Consortium
TiP sp. z o.o., Katowice, Poland Synapse Développement SARL, Toulouse, France Priberam Informática Lda., Lisbon, Portugal Expert System S.p.A., Modena, Italy Vysoká Škola Ekonomická v Praze, Prague, Czech Republic
Language Technology
Infovide-Matrix S.A., Poland
Coordinator, integrator
The Nicholas Copernicus Provincial and Municipal Library – Toruń, Poland operator of the Polish Internet Library)
Národní Knihovna České Republiky, Prague, Czech Republic
Users
Architecture objectivesend user perspective
• Performance– >1s response time
• Usability – a user should with ease learn to operate,
prepare inputs for, and interpret outputs
• Availability– high availability - 24x7
Architecture objectivescustomer perspective• Security
– the M-CAST system should have the ability to manage, protect, and distribute sensitive information
– copyrights
• Interoperability – the M-CAST system should have the ability to use the
information that has been exchanged with various systems (resources)
• Scalability – the M-CAST architecture should be modified with ease to
fit the performance and volume requirements
Architecture objectives producer perspective
• Time span – the architecture and technology should be in use in
2007
• Portability – the M-CAST system should be transferred with ease from
one hardware or software environment to another
• Flexibility – the M-CAST architecture should be modified with ease
for use in applications or environments other than those for which it was specifically designed
Architectural decisionSOA
Service-oriented architecture is an approach to loosely coupled, protocol independent, standards-based distributed computing where software resources available on the network are considered as services.
Reso
urce
s
Internal view - architecture
Integration layer
library catalog system
digitalized
resources
M-CAST
Presentation layer
Us
End u
sers
M-C
AST
useradministrator
library portal
Linguistc Processor
Us
Exte
rnal
syste
ms
ResourceResource
Internal view resources
M-CAST
Resource
Metadata• Protocol - OAI-PMH• Formats - Qualified DublinCore
Data• Protocol - ftp - http• Formats - txt - html - pdf - rtf
Architectural decisionOAI-PMH & DublinCore
http://www.library.edu/oaipmh/OAIDataProvider?verb=GetRecord&identifier=30843
…<dc:title>Vměnj Křesťanské aneb Přjprawa k dobré Smrti...</dc:title><dc:subject xsi:type="dcterms:UDC">09</dc:subject>…
UDC Filtering• a polysemic word: ball
– sens 1: ROUND OBJECT. any object in the shape of a sphere, especially one used as a toy by children or in various sports such as tennis and football
– sens 2: DANCE. a large formal occasion where people dance• two texts :
– D1 : talk about football. UDC: 793– D2 : Cinderalla. UDC 796
• a question: "Where did the ball take place?".
Architectural decisionOAI-PMH & DublinCore
…<dc:creator>Bellarmino, Robert Francesco Romolo</dc:creator><dc:description>Při strahovském exempláři B Z VIII 28 rukopiná poznámka: Jacobus
Colens S.J. …<.dc:description><dc:format>text/html</dc:format><rdf:Seq xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <dc:identifier
xsi:type="dcterms:URI">http://www.manuscriptorium.com? id=1184206</dc:identifier>
<dc:identifier xsi:type="dcterms:URI">http://www.manuscriptorium.com? id=1184207</dc:identifier>
</rdf:Seq><rdf:Seq xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <dc:source xsi:type="dcterms:URI">http://www.manuscriptorium.com?
id_source=1184206</dc:identifier><dc:source xsi:type="dcterms:URI">http://www.manuscriptorium.com? id_source=1184206</dc:identifier>
</rdf:Seq>
ArchitectureLinguistic Processor
Language module
PLLanguage module
FRLanguage module
PTLanguage module
CZLanguage module
ITLanguage module
EN
Indexation engine Query engineLanguage recognizer
Document typesconverters
Index Documents
Linguistic Processor
GeneralOntology
Documents
Derived Forms
dictionary
Taxonomyof types ofquestions
Indexation cutting blocks
Spelling correction Parsing
Conceptual analysis
Keywords index
Names entities index Heads of derivation index
Concepts index Areas index
Anaphora resolution
Questions-answerstypes index
Question
Question processing Spelling correction
Parsing Conceptual analysis
Extraction of keywords Type of the question
Translation if multilingual
Search into the index Synonyms + converses
Selection of blocks Ordering blocks
Extraction of blocks
Answer extraction
Answer (s)
Spelling correction Parsing
Conceptual analysis Type of the answer
Keywords of the block Anaphora resolution
Detection of metaphora
Selection sentence (s) Sort of sentences
Coherence, justification Extraction answer (s)
M-CASTs search network
Us
End u
sers
M-C
AST N
etw
ork
user
M-CAST
DocumentsIndex
M-CAST 1
DocumentsIndex
M-CAST 2
DocumentsIndex
M-CAST n
DocumentsIndex