Upload
calvin-briggs
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
- PowerPoint PPT Presentation
Citation preview
An Architecture for Online Information An Architecture for Online Information Integration on Concurrent Resource Integration on Concurrent Resource Access on a Z39.50 Environment Access on a Z39.50 Environment
Michalis SfakakisMichalis Sfakakis11 and Sarantos Kapidakis and Sarantos Kapidakis22
1National Documentation Centre / National Hellenic Research [email protected]
2Laboratory on Digital Libraries and Electronic PublishingArchive and Library Sciences Department / Ionian University
77thth European Conference on Digital Libraries European Conference on Digital Libraries
17-22 August 2003, Trondheim, Norway17-22 August 2003, Trondheim, Norway
Presentation SummaryPresentation Summary
Main ContributionsMain Contributions
Resource Access in a Network Environment (models, Resource Access in a Network Environment (models, characteristics, issues, implementations)characteristics, issues, implementations)
Proposed Architecture (goal, critical points, Proposed Architecture (goal, critical points, characteristics, benefits) characteristics, benefits)
Technical Details of the Proposed Architecture Technical Details of the Proposed Architecture
ConclusionsConclusions
Future ResearchFuture Research
Main ContributionsMain Contributions
Analysis of problems (in a networked environment) for:Analysis of problems (in a networked environment) for:• Concurrent resource access via parallel searchConcurrent resource access via parallel search
• Information integrationInformation integration
Proposal of architecture for these problems:Proposal of architecture for these problems:• Able to improve online information integrationAble to improve online information integration
• Taking into account the restrictions imposed by the:Taking into account the restrictions imposed by the: Network environment
Z39.50 information retrieval protocol
Resource Access in Union CataloguesResource Access in Union Catalogues
Give access to library content from one central pointGive access to library content from one central point
Functional requirementsFunctional requirements• Consistent searching & indexing Consistent searching & indexing
• Consolidation of Records (information integration)Consolidation of Records (information integration)
• Performance & Management Performance & Management
… … conformance to current implementation modelsconformance to current implementation models• CentralizedCentralized (the vast majority of the current (the vast majority of the current
implementations): conform well to all functional requirementsimplementations): conform well to all functional requirements
• DistributedDistributed (current approaches – (current approaches – virtual union cataloguesvirtual union catalogues): ): all functional requirements varyall functional requirements vary
Why Virtual Union Catalogues (Why Virtual Union Catalogues (VUCVUC))
Why Centralized Why Centralized Distributed: Distributed:
Local autonomy and control of the participating Local autonomy and control of the participating systemssystems
Retention of the specific resource characteristicsRetention of the specific resource characteristics
User ability to dynamically define his own collections of User ability to dynamically define his own collections of resourcesresources
Vast and increasing number of available resourcesVast and increasing number of available resources
Pre-requirements for VUCPre-requirements for VUC
Ensure systems interoperability, derived from the implementation of international metadata standards and information retrieval protocols
Provide information integration (indicated by user studies)
Achieve accepted performance from the systems which emulate the union catalogue
Have ability for parallel searching
Have adequate network performance
Is it possible to implement VUC now?Is it possible to implement VUC now?
Depends on:Depends on:
Current technology and network improvementsCurrent technology and network improvements
Existence and wide acceptance of metadata standards Existence and wide acceptance of metadata standards (e.g. DC, MARC, MODS, etc)(e.g. DC, MARC, MODS, etc)
Wide acceptance of the Z39.50 information retrieval Wide acceptance of the Z39.50 information retrieval protocolprotocol and its associated profiles and its associated profiles
Requirements for Information IntegrationRequirements for Information Integration
The Information Integration (Consolidation of The Information Integration (Consolidation of Records)Records) is a two step process:is a two step process:• Identification of the duplicate recordsIdentification of the duplicate records
• Presentation: Creation of a union record, or, Presentation: Creation of a union record, or, according to the Z39.50 duplicate detection model, according to the Z39.50 duplicate detection model, the clustering of records in ‘equivalence classes’ and the clustering of records in ‘equivalence classes’ and the selection of a representative recordthe selection of a representative record
Its effectiveness & quality is affected by the:Its effectiveness & quality is affected by the:• Differences in semantic models and formats of the Differences in semantic models and formats of the
metadatametadata
• Metadata Quality (i.e. specificity, completeness of Metadata Quality (i.e. specificity, completeness of fields, syntactic correctness and consistency as fields, syntactic correctness and consistency as implemented by authority files)implemented by authority files)
Methods for Information IntegrationMethods for Information Integration
Depending on the challenge:Depending on the challenge:• High quality duplicate detection and merging on High quality duplicate detection and merging on
large amount of data, offline - without hard time large amount of data, offline - without hard time restrictionsrestrictions
Development of centralized union catalogues, or creation of collection by harvesting techniques
• Good de-duplication quality on medium to small Good de-duplication quality on medium to small amount of data, online and present them to the user amount of data, online and present them to the user in accepted response timein accepted response time
Development of virtual union catalogues
Z39.50 Information Retrieval ProtocolZ39.50 Information Retrieval Protocol
A complicated, state full, client /server protocol, widely A complicated, state full, client /server protocol, widely used in the area of libraries used in the area of libraries
For every session (Z-association) a server: For every session (Z-association) a server: • Holds a search history (at least the last query)Holds a search history (at least the last query)• During the session the client can request data from any result During the session the client can request data from any result
set included in the search historyset included in the search history• The search history stays alive during the sessionThe search history stays alive during the session• The session can be abruptly terminated by the server (timeout), The session can be abruptly terminated by the server (timeout),
on ‘lack of activity’on ‘lack of activity’ The timeout period is server dependent
Depending of the implementation level, a server could Depending of the implementation level, a server could implement in a number of variations the:implement in a number of variations the:
• Sort serviceSort service• Duplicate detection serviceDuplicate detection service
Summary of VUC Implementation IssuesSummary of VUC Implementation Issues
Network dependent:Network dependent:• Network links performance & availabilityNetwork links performance & availability
Protocol dependent:Protocol dependent:• Interoperability level (e.g. supported services and their Interoperability level (e.g. supported services and their
implementation variations)implementation variations)• Timeout period and session reactivationTimeout period and session reactivation
Participating systems dependent:Participating systems dependent:• Performance, availability, extensibility, metadata encoding and Performance, availability, extensibility, metadata encoding and
semanticssemantics
De-duplication complexity & expensiveness: De-duplication complexity & expensiveness: • Highly affected by the different semantic models & formats, Highly affected by the different semantic models & formats,
quality, completeness, consistency and the amount of the quality, completeness, consistency and the amount of the metadatametadata
Overall system performanceOverall system performance
Current VUC ImplementationsCurrent VUC Implementations
Server side:Server side:• Majority support basic services (e.g. Init, Search, Present, Scan)Majority support basic services (e.g. Init, Search, Present, Scan)• A small number support the sort serviceA small number support the sort service• A minority supports the duplicate detection service A minority supports the duplicate detection service
Client side: Client side: • Has to deal with heterogeneity in receiving resulting dataHas to deal with heterogeneity in receiving resulting data• Must overcome timeout issues, avoiding session reactivationMust overcome timeout issues, avoiding session reactivation• Has to de-duplicate incoming results, even if every individual Has to de-duplicate incoming results, even if every individual
server reply does not provide duplicatesserver reply does not provide duplicates• The majority of the implementations does not make any The majority of the implementations does not make any
integration, due to performance issues.integration, due to performance issues.• Primitive duplication detection approaches, based on some Primitive duplication detection approaches, based on some
coded data (e.g. ISBN, ISSN, LC number, etc.)coded data (e.g. ISBN, ISSN, LC number, etc.)
User – VUC System InteractionsUser – VUC System Interactions
Defines the desired collection of resources Defines the desired collection of resources
Sends a search request, specifying a desired number Sends a search request, specifying a desired number of records (of records (Presentation SetPresentation Set) to display each time) to display each time
After receiving the After receiving the Presentation SetPresentation Set, subsequently , subsequently Presentation SetsPresentation Sets could be requested – or not could be requested – or not
Resource 1…j
Z39.50 Server
Resource j+1…k
Z39.50 Server
Resource l+1…r
Z39.50 Server
User Interaction
Virtual Union Catalogue System
Goal of the Proposed ArchitectureGoal of the Proposed Architecture
To improve information integration in online access of a To improve information integration in online access of a distributed system, which:distributed system, which:
Accesses concurrently resources via the networkAccesses concurrently resources via the network
Applies online good quality duplicate detection Applies online good quality duplicate detection procedures (for presenting only once each record that procedures (for presenting only once each record that is multiply located in the resourcesis multiply located in the resources))
Critical Points of the Proposed ArchitectureCritical Points of the Proposed Architecture
We have to deal with:We have to deal with:
Performance of the network links and the availability of Performance of the network links and the availability of the resourcesthe resources
Complexity and expensiveness of the duplicate Complexity and expensiveness of the duplicate detection algorithms, especially in large amount of detection algorithms, especially in large amount of records records
Extraction of the Extraction of the Presentation setPresentation set in reasonable in reasonable response timeresponse time
Characteristics of the Proposed Characteristics of the Proposed ArchitectureArchitecture
What we do:What we do:
We do not apply the duplicate detection algorithms in We do not apply the duplicate detection algorithms in one shot – the duplicate detection process is applied one shot – the duplicate detection process is applied using each received set of data and comparing them using each received set of data and comparing them against the previously processed results against the previously processed results
Incremental comparison and elimination of the Incremental comparison and elimination of the duplicates in every Presentation Set – the processed duplicates in every Presentation Set – the processed results are sorted and do not contain duplicates results are sorted and do not contain duplicates
Usage of the sort or duplicate detection service, when Usage of the sort or duplicate detection service, when supportedsupported
During the time the user is reading the results, the During the time the user is reading the results, the system prepares few next sets of unique records system prepares few next sets of unique records
Benefits of the Proposed ArchitectureBenefits of the Proposed Architecture
Avoid downloading large amounts of data over the Avoid downloading large amounts of data over the network and unnecessarily loading the serversnetwork and unnecessarily loading the servers
Apply the duplicate detection algorithm to a small Apply the duplicate detection algorithm to a small number of records – especially in the first stepsnumber of records – especially in the first steps
Every record is compared against a processed set Every record is compared against a processed set during de-duplicationduring de-duplication
We deploy the time the user is reading the presented We deploy the time the user is reading the presented data, without exhausting the system resourcesdata, without exhausting the system resources
OverviewOverview of the Proposed Architectureof the Proposed Architecture
Modules: Request Interface, Data Integrator, Resource Modules: Request Interface, Data Integrator, Resource CommunicatorCommunicator
Components: Data Provider, Local Result Set Manager, De-Components: Data Provider, Local Result Set Manager, De-duplicator, Data Presenterduplicator, Data Presenter
Interaction is accomplished by messages or synchronous data Interaction is accomplished by messages or synchronous data transmissionstransmissions
Resource 1…j
Z39.50 Server
Resource j+1…k
Z39.50 Server
Resource l+1…r
Z39.50 Server
Resource Communicator
Data Integrator
Request Interface
De-duplicatorData Presenter
Local Result Set
User Interaction
Profiles of the Z39.50 Servers
Output QueueInput Queue
Data Provider
Local Result Set Manager
Presentation Set
ModulesModules of the Proposed Architectureof the Proposed Architecture
The The Request InterfaceRequest Interface: Receives every user request (search or : Receives every user request (search or present), dispatches it to the appropriate modules, waiting the present), dispatches it to the appropriate modules, waiting the Presentation SetPresentation Set
The The Resource CommunicatorResource Communicator: Access the resources and supplies : Access the resources and supplies the data for the integrationthe data for the integration
The The Data IntegratorData Integrator: Receives the data sets, makes the : Receives the data sets, makes the information integration and manages the unique records to be information integration and manages the unique records to be ready for presentationready for presentation
Resource 1…j
Z39.50 Server
Resource j+1…k
Z39.50 Server
Resource l+1…r
Z39.50 Server
Resource Communicator
Data Integrator
Request Interface
User Interaction
ComponentsComponents of the Proposed Architectureof the Proposed Architecture
The The Local Result Set ManagerLocal Result Set Manager: Holds and arranges (e.g. sorts) the de-: Holds and arranges (e.g. sorts) the de-duplicated records and prepares the duplicated records and prepares the Presentation SetPresentation Set
The The Data ProviderData Provider: Receives data from the : Receives data from the Resource Communicator Resource Communicator Module and sends one at a time for further processModule and sends one at a time for further process
The The De-duplicator (s)De-duplicator (s): Receives a record from the : Receives a record from the Local Result Set Local Result Set ManagerManager and compares it with all the unique records in the and compares it with all the unique records in the Local Result Local Result SetSet
The The Data PresenterData Presenter: Dispatches the received request for data, from the : Dispatches the received request for data, from the Request InterfaceRequest Interface, to the , to the Local Result Set ManagerLocal Result Set Manager and returns back the and returns back the next unique records for presentationnext unique records for presentation
Request Interface
Resource Communicator Profiles of the Z39.50 Servers
Data Integrator
De-duplicator
Data Presenter
Local Result Set
Output QueueInput Queue
Data Provider
Local Result Set Manager
Presentation Set
Resource 1…j
Z39.50 Server
Resource j+1…k
Z39.50 Server
Resource l+1…r
Z39.50 Server
Resource Communicator
Data Integrator
Request Interface
User Interaction
Accomplishing a search request –Accomplishing a search request –Module InteractionsModule Interactions
1.1. The The Request InterfaceRequest Interface requests p records from the requests p records from the Data IntegratorData Integrator and and waits for (at most p) recordswaits for (at most p) records
2.2. The The Request InterfaceRequest Interface, also, forwards the search request including the , also, forwards the search request including the number p, to the number p, to the Resource CommunicatorResource Communicator and continues monitoring for and continues monitoring for user requestsuser requests
3.3. The The Resource CommunicatorResource Communicator waits for messages from the waits for messages from the Request Request InterfaceInterface and when it receives a new search request, it concurrently and when it receives a new search request, it concurrently starts the following sequences of steps for every server:starts the following sequences of steps for every server:
1.1. Interprets the search request to the appropriate message format for the Interprets the search request to the appropriate message format for the server, sends it and waits for its reply server, sends it and waits for its reply
2.2. Adds the number of hits from all the replies and sends it to the Request Adds the number of hits from all the replies and sends it to the Request InterfaceInterface
3.3. If the server supports either the duplicate detection or the sort service, it If the server supports either the duplicate detection or the sort service, it invokes it after its initial response to the search requestinvokes it after its initial response to the search request
4.4. Requests a number of records (e.g. p) from every server that replied on its Requests a number of records (e.g. p) from every server that replied on its last requestlast request
5.5. It sends the arrived data to the Data IntegratorIt sends the arrived data to the Data Integrator6.6. Waits for further commands, but if there is no communication with the server Waits for further commands, but if there is no communication with the server
for a period close to its timeout, the procedure jumps to step 3.4for a period close to its timeout, the procedure jumps to step 3.4
4.4. The The Data IntegratorData Integrator de-duplicates part of the received data, prepares de-duplicates part of the received data, prepares the set of unique records and when p records are found, it sends them the set of unique records and when p records are found, it sends them to the to the Request InterfaceRequest Interface
Module Interactions:Module Interactions:Comments & ClarificationsComments & Clarifications
All modules work in parallelAll modules work in parallel
The number of requested records from every server could vary, The number of requested records from every server could vary, depending upon its: performance, timeout, the network links and depending upon its: performance, timeout, the network links and the Result Set sizethe Result Set size
For the overall system performance, the Resource Communicator For the overall system performance, the Resource Communicator realizes if a server is down, using the Profiles of the Z39.50 realizes if a server is down, using the Profiles of the Z39.50 servers, and continues the interaction with the other modulesservers, and continues the interaction with the other modules
The calculated number of hits is not the actual oneThe calculated number of hits is not the actual one
To avoid session reactivation, imposed by the server timeout, the To avoid session reactivation, imposed by the server timeout, the Resource communicator could request data from any server at Resource communicator could request data from any server at any timeany time
A threshold value activates the Data Integrator to ‘request data’ A threshold value activates the Data Integrator to ‘request data’ from the Resource Communicatorfrom the Resource Communicator
Request Interface
Resource Communicator Profiles of the Z39.50 Servers
Data Integrator
De-duplicator
Data Presenter
Local Result Set
Output QueueInput Queue
Data Provider
Local Result Set Manager
Presentation Set
Accomplishing a search request –Accomplishing a search request –Component InteractionsComponent Interactions
1.1. The The Data ProviderData Provider starts to transfer data, possibly by rearranging them. starts to transfer data, possibly by rearranging them. If the number of data contained in it is less than a threshold (e.g. 5p), If the number of data contained in it is less than a threshold (e.g. 5p), the Data Provider sends a ‘request data’ message to the the Data Provider sends a ‘request data’ message to the Resource Resource CommunicatorCommunicator
2.2. While the While the Local Result Set ManagerLocal Result Set Manager has less than a threshold (e.g. 3 p) has less than a threshold (e.g. 3 p) unique record, it tries to read from the unique record, it tries to read from the Data ProviderData Provider and for every and for every record found, it calls the record found, it calls the De-DuplicatorDe-Duplicator to compare the record: to compare the record:
1.1. The The De-DuplicatorDe-Duplicator compares the record with the records in the compares the record with the records in the Local Result Local Result SetSet and then sends the results back to the and then sends the results back to the Local Result Set ManagerLocal Result Set Manager
2.2. The The Local Result Set ManagerLocal Result Set Manager receives the results from the duplicate receives the results from the duplicate detection process and arranges the record into the detection process and arranges the record into the Local Result SetLocal Result Set
3.3. If the number of new unique records in the If the number of new unique records in the Local Result SetLocal Result Set becomes p, it becomes p, it copies the p new unique records into the copies the p new unique records into the Presentation SetPresentation Set and activates the and activates the Data PresenterData Presenter
3.3. When the When the Presentation SetPresentation Set is filled with (the p) records, the is filled with (the p) records, the Data Data PresenterPresenter component dispatches the records to the component dispatches the records to the Request InterfaceRequest Interface module and waits to receive the next ‘request data’ message from it. If module and waits to receive the next ‘request data’ message from it. If the component does not receive any request during its predefined the component does not receive any request during its predefined timeout period, it terminates the systemtimeout period, it terminates the system
Component Interactions:Component Interactions:Comments & ClarificationsComments & Clarifications
The combination of the threshold values in Data The combination of the threshold values in Data Provider & Local Result Set Manager, controls the Provider & Local Result Set Manager, controls the ‘request data’ activity from the Resource ‘request data’ activity from the Resource CommunicatorCommunicator
The Local Result Set Manager keeps two orderings for The Local Result Set Manager keeps two orderings for the unique records in order to:the unique records in order to:
• Improve the performance of the De-duplicatorImprove the performance of the De-duplicator
• Present and Facilitate easy access of the stored recordsPresent and Facilitate easy access of the stored records
ConclusionsConclusions The online de-duplication process from resources accessed The online de-duplication process from resources accessed
concurrently in a network environment:concurrently in a network environment:• Is a requirement identified by user studiesIs a requirement identified by user studies • Is challenged by a number of issues relevant to:Is challenged by a number of issues relevant to:
Performance of the participating servers Their network links The complexity and the expensiveness of the duplicate detection algorithms
These issues make inefficient any approach to the application of These issues make inefficient any approach to the application of the information integration: the information integration:
• In online environmentsIn online environments• Especially when large amounts of data must be processedEspecially when large amounts of data must be processed
In our proposed system: In our proposed system: • We do not try to integrate all the results from all the recourses at onceWe do not try to integrate all the results from all the recourses at once • We attack this problem by:We attack this problem by:
Retrieving a small number of records, independently if the servers provide de-duplicated or sorted results
Appling the de-duplication process on small amounts of sorted records Creating a presentation set of unique records to display to the user Deploying the time the user is reading the presented data, without misapplying the system
resources
Future ResearchFuture Research
To better approximate the number of records satisfying To better approximate the number of records satisfying the search requestthe search request
To derive priorities for the servers and their resourcesTo derive priorities for the servers and their resources
To select or adapt a good de-duplication algorithm for To select or adapt a good de-duplication algorithm for different record completeness and different provision of different record completeness and different provision of records by the serversrecords by the servers
To optimize the number of requested records from a To optimize the number of requested records from a serverserver
To implement the system and evaluate its performanceTo implement the system and evaluate its performance