17
M-CAST M ultilingual C ontent A ggregation S ystem based on T RUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

M-CASTMultilingual Content Aggregation System based on TRUST Search

Engine

Borys Czerniejewski

Sebastian Lisek

Infovide S.A. (PL)

Page 2: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

The Project

eContent project #22249

project start: 1 January 2005

project end: 31 December 2006

M-CAST

Page 3: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

multilingual, full-text search engine (server version)

Internet portals deployed in two libraries

content aggregation facility

business plan + IPRs/royalties fixed

dissemination

M-CAST Results

Expected Results

TRUST – Multilingual Semantic and Cognitive Search Engine for Text Retrieval Using Semantic Technologies (IST-1999-56416)

ICONS – Intelligent Content Management System (IST-2001-32429)

Previous Projects

Page 4: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

Users & Business Case

digital (Internet) libraries

press agencies

(press) publishers

operators of scientific datases

big companies (multinationals)

Large Full-Text Data Collections

Czech

English

French*

Italian*

Polish*

Portuguese*

Languages

Page 5: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

Consortium

TiP sp. z o.o., Katowice, Poland Synapse Développement SARL, Toulouse, France Priberam Informática Lda., Lisbon, Portugal Expert System S.p.A., Modena, Italy Vysoká Škola Ekonomická v Praze, Prague, Czech Republic

Language Technology

Infovide-Matrix S.A., Poland

Coordinator, integrator

The Nicholas Copernicus Provincial and Municipal Library – Toruń, Poland operator of the Polish Internet Library)

Národní Knihovna České Republiky, Prague, Czech Republic

Users

Page 6: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

Architecture objectivesend user perspective

• Performance– >1s response time

• Usability – a user should with ease learn to operate,

prepare inputs for, and interpret outputs

• Availability– high availability - 24x7

Page 7: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

Architecture objectivescustomer perspective• Security

– the M-CAST system should have the ability to manage, protect, and distribute sensitive information

– copyrights

• Interoperability – the M-CAST system should have the ability to use the

information that has been exchanged with various systems (resources)

• Scalability – the M-CAST architecture should be modified with ease to

fit the performance and volume requirements

Page 8: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

Architecture objectives producer perspective

• Time span – the architecture and technology should be in use in

2007

• Portability – the M-CAST system should be transferred with ease from

one hardware or software environment to another

• Flexibility – the M-CAST architecture should be modified with ease

for use in applications or environments other than those for which it was specifically designed

Page 9: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

Architectural decisionSOA

Service-oriented architecture is an approach to loosely coupled, protocol independent, standards-based distributed computing where software resources available on the network are considered as services.

Page 10: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

Reso

urce

s

Internal view - architecture

Integration layer

library catalog system

digitalized

resources

M-CAST

Presentation layer

Us

End u

sers

M-C

AST

useradministrator

library portal

Linguistc Processor

Us

Exte

rnal

syste

ms

Page 11: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

ResourceResource

Internal view resources

M-CAST

Resource

Metadata• Protocol - OAI-PMH• Formats - Qualified DublinCore

Data• Protocol - ftp - http• Formats - txt - html - pdf - rtf

Page 12: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

Architectural decisionOAI-PMH & DublinCore

http://www.library.edu/oaipmh/OAIDataProvider?verb=GetRecord&identifier=30843

…<dc:title>Vměnj Křesťanské aneb Přjprawa k dobré Smrti...</dc:title><dc:subject xsi:type="dcterms:UDC">09</dc:subject>…

UDC Filtering• a polysemic word: ball

– sens 1: ROUND OBJECT. any object in the shape of a sphere, especially one used as a toy by children or in various sports such as tennis and football

– sens 2: DANCE. a large formal occasion where people dance• two texts :

– D1 : talk about football. UDC: 793– D2 : Cinderalla. UDC 796

• a question: "Where did the ball take place?".

Page 13: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

Architectural decisionOAI-PMH & DublinCore

…<dc:creator>Bellarmino, Robert Francesco Romolo</dc:creator><dc:description>Při strahovském exempláři B Z VIII 28 rukopiná poznámka: Jacobus

Colens S.J. …<.dc:description><dc:format>text/html</dc:format><rdf:Seq xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <dc:identifier

xsi:type="dcterms:URI">http://www.manuscriptorium.com? id=1184206</dc:identifier>

<dc:identifier xsi:type="dcterms:URI">http://www.manuscriptorium.com? id=1184207</dc:identifier>

</rdf:Seq><rdf:Seq xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <dc:source xsi:type="dcterms:URI">http://www.manuscriptorium.com?

id_source=1184206</dc:identifier><dc:source xsi:type="dcterms:URI">http://www.manuscriptorium.com? id_source=1184206</dc:identifier>

</rdf:Seq>

Page 14: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

ArchitectureLinguistic Processor

Language module

PLLanguage module

FRLanguage module

PTLanguage module

CZLanguage module

ITLanguage module

EN

Indexation engine Query engineLanguage recognizer

Document typesconverters

Index Documents

Page 15: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

Linguistic Processor

GeneralOntology

Documents

Derived Forms

dictionary

Taxonomyof types ofquestions

Indexation cutting blocks

Spelling correction Parsing

Conceptual analysis

Keywords index

Names entities index Heads of derivation index

Concepts index Areas index

Anaphora resolution

Questions-answerstypes index

Question

Question processing Spelling correction

Parsing Conceptual analysis

Extraction of keywords Type of the question

Translation if multilingual

Search into the index Synonyms + converses

Selection of blocks Ordering blocks

Extraction of blocks

Answer extraction

Answer (s)

Spelling correction Parsing

Conceptual analysis Type of the answer

Keywords of the block Anaphora resolution

Detection of metaphora

Selection sentence (s) Sort of sentences

Coherence, justification Extraction answer (s)

Page 16: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

M-CASTs search network

Us

End u

sers

M-C

AST N

etw

ork

user

M-CAST

DocumentsIndex

M-CAST 1

DocumentsIndex

M-CAST 2

DocumentsIndex

M-CAST n

DocumentsIndex

Page 17: M-CAST Multilingual Content Aggregation System based on TRUST Search Engine Borys Czerniejewski Sebastian Lisek Infovide S.A. (PL)

Thank you!

Borys Czerniejewski

Sebastian Lisek

Infovide S.A. (PL)

Information: [email protected]