Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

Preview:

DESCRIPTION

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Alon Kadury. Content. Reminders History OAI overview Technical introduction Conclusions Demonstrations Resources. Definition- A Digital Library is a:. 1. Collection of digital objects - PowerPoint PPT Presentation

Citation preview

1

Open Archives Initiative Protocol Open Archives Initiative Protocol for Metadata Harvestingfor Metadata Harvesting

(OAI-PMH)(OAI-PMH)

Alon Kadury

2

ContentContent

RemindersHistoryOAI overviewTechnical introductionConclusionsDemonstrationsResources

3

Definition- A Digital Library is a:Definition- A Digital Library is a:

1. Collection of digital objects

2. Collection of knowledge structures

3. Collection of library services

4. Domain/Focus/Topic

5. Quality Control

6. Preservation/Persistence

4

Types of DLsTypes of DLs

Single Digital Library (SDL) – also Stand-alone, Self-contained

Federated Digital Library (FDL)– also confederated, distributed

Harvested Digital Library (HDL)

5

Single Digital Library (SDL)Single Digital Library (SDL)

A regular DLSelf-contained material:

– purchased– scanned/digitized

Usually localized

6

Federated Digital Library (FDL)Federated Digital Library (FDL)

Contains many autonomous librariesUsually heterogeneous repositoriesConnected via networkForms a virtual distributed libraryTransparent user interface The major problem is interoperability.The major problem is interoperability.

7

Harvested Digital Library (HDL)Harvested Digital Library (HDL)

Does not contain data, just metadataObjects harvested into summariesRegular DL characteristics:

– fine granularity– rich library services– high quality control– annotated

8

HistoryHistory

As the Web evolved, the number of Web sites and search engines increased.A similar process happened with e-prints and digital libraries.

The changes in the amount of DLs led to the development of the OAI-PMH protocol as we’re about to see.

9

History - ProblemsHistory - Problems

The development of e-prints and digital libraries let to several problems like:

Many user interfaces -Each DL offered Web interface for deposit of articles and for end-user searches.The result: Difficult for end users to work across archives without having to learn multiple different interfaces.

10

History - ProblemsHistory - Problems

Different queries’ syntax -The result: Difficult for the user to keep track of the searching syntax of each SDL and difficult to create an FDL that could query many SDLs.

Many metadata formats -SDL metadata could be kept in any format the SDL wanted.The result: Hard times for the FDLs which had to know the formats of each SDL they are harvesting.

11

History – Possible solutionsHistory – Possible solutions

The problems led researchers to recognise the need for single search interface to all archives - Universal Pre-print Service (UPS).

Two possible approaches to building the UPS where considered:

12

History – Solution 1History – Solution 1

Cross-searching multiple archive:In this approach a client sends requests to several servers and then combines the data.The client and server work with a known and agreed protocol (for example Z39.50).

However, studies showed this approach is not the preferred approach for distributed searching of large values of nodes mainly due to problems like knowing which collections to search and performance issues.

13

History – Solution 2History – Solution 2

Harvesting metadata into a ‘Central Server’:This approach harvests the metadata and stores it in a central server, on which searches are made.

The idea was demonstrated in a convention held at Santa Fe NM, October 21-22, 1999.

UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/

More reading: http://www.dlib.org/dlib/february00/02contents.html

14

OAI overview- definitionsOAI overview- definitions

Lets start with a few definitions:InteroperabilityOpen Archive Initiative (OAI)Open Archive Initiative Protocol for

Metadata Harvesting (OAI-PMH)

15

OAI overview- definitionsOAI overview- definitions

What is Interoperability?Interoperability refers to the ability of two

or more systems to interact with one another and exchange data according to a prescribed method in order to achieve predictable results.

16

OAI overview- definitionsOAI overview- definitions

In order to exchange data we need to agree on things like:– requests format– results format– transport protocols (HTTP vs FTP vs….)– Metadata formats (DC vs MARC vs…)– Usage rights (who can do what with the records)

We need someone to organize it and “set the rules”.

17

OAI overview- definitionsOAI overview- definitions

Who will organize it?Open Archive Initiative -

“The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.” (http://www.openarchives.org/organization/index.html)

18

OAI overview- definitionsOAI overview- definitions

What will the interoperability standards be called?

Open Archive Initiative Protocol for Metadata Harvesting

(OAI-PMH)

19

OAI overview- Key playersOAI overview- Key players

When talking about OAI-PMH we see three main players:

1. Data Providers

2. Service Providers

3. The protocol (OAI-PMH)

20

OAI overview- Data ProviderOAI overview- Data Provider

Data Provider:– Handles deposit/publishing of resources in archive.– Expose metadata about resources in archive (using the

OAI-PMH protocol\interface).– Data Providers may support any metadata format, but

must support the metadata format Dublin Core (DC).– Offer free access to the archives (at least the metadata).– A network accessible server, able to process OAI-PMH

requests correctly is often called a Repository.

21

OAI overview - Service OAI overview - Service ProviderProvider

Service Provider:– Harvest metadata from data providers and use it

to offer single user-interface across all harvested metadata.

– May enrich metadata.– Offer (value-added) services on the basis of the

metadata.– Client application issuing OAI-PMH requests is

often referred to as a Harvester.

22

OAI overview - ProvidersOAI overview - ProvidersData ProviderService Provider

End user interface

Might have user interface

Has user interface

ContainsItems & metadata

Metadata only

OAI interfacemust?

Offers data from

Its own resources

Harvests metadata from data providers

23

OAI overview - ProvidersOAI overview - Providers

Inputinterface Data ProviderInput

interface

Nativeharvestinginterface

Data Provider

Nativeend-userinterface

Nativeharvestinginterface

Service Provider

Nativeend-userinterface

Native end-userinterface optional(e.g., RePEc)

24

OAI overview - ProvidersOAI overview - ProvidersData providers

Service providers

Harvestingbased onOAI-PMH

25

OAI overview - ModelOAI overview - Model

Web Layer 1

SDL SDL SDL Layer 2

OAI-PMH

Layer 4

Layer 3Service Provider - FDL\HDL

Web interfaces

26

Technical introductionTechnical introduction

Since the days of the Santa Fe convention the protocol had several versions.

Version 2.0 is the latest and is considered stable.The technical introduction refers to this version.

27

Tech’- protocol versionsTech’- protocol versions

model metadataharvesting

metadataharvesting

metadataharvesting

about eprints documentlike objects

resources

metadata OAMS unqualifiedDublin Core

unqualifiedDublin Core

transport HTTP HTTP HTTP

responses XML XML XML

requests HTTP GET/POST HTTP GET/POST HTTP GET/POST

verbs Dienst OAI-PMH OAI-PMH

nature experimental experimental stable

Santa Feconvention

OAI-PMHv.1.0/1.1

OAI-PMHv.2.0

28

Tech’- request & responseTech’- request & response

The requests of the protocol are HTTP based. The response contents of the protocol are XML based. Question: why?

Answer: – Simple protocol based on existing standards which allows rapid

development & effortless implementation.– Systems can be deployed in variety of configurations.– Low barrier interoperability specification.– Internet/Firewall friendly.

29

Tech’- request & responseTech’- request & response

There are six request types which are called verbs.

The request type and additional information are passed as parameters using HTTP POST or GET methods.

Requests (based on HTTP)

Metadata (encoded in XML)Harvester

Metadata

Service Provider

Repository

Metadata(Documents)

Data Provider

„Service”

30

Lets see a demonstration about how we can create a FDL and then we will look at the backstage of it.

Demo

31

Tech’ – more definitionTech’ – more definition

Se

rvic

e P

rovi

der

Da

ta

Pro

vid

er

e-prints

Da

ta

Pro

vid

er Images

Da

ta

Pro

vid

er

OPAC

Da

ta

Pro

vid

er

Museum

Da

ta

Pro

vid

er

Archive

Requests:

Identify

ListMetadataformats

ListSets

ListIdentifiers

ListRecords

GetRecord

Responses:

General information

Metadata formats

Set structure

Record identifier

Metadata

Da

ta

Pro

vid

er Harvester

Repository

Repository

Repository

Repository

Repository

32

Tech’– Tech’– Request TypesRequest Types

Six different request types1. Identify2. ListMetadataFormats3. ListSets4. ListIdentifiers5. ListRecords6. GetRecord

Harvester does not have to use all types. Repository must implement all request types fully

(all required and optional arguments for each of the requests).

33

Tech’- Tech’- Request Type: IdentifyRequest Type: Identify

functionretrieve description and general information about an archive.

example archive.org/oai-script?verb=Identify

parameters none

errors / exceptionsbadArgument

e.g. archive.org/oai-script?verb=Identify&set=biology

34

Tech’- Tech’- Request Type: IdentifyRequest Type: IdentifyResponse format

ElementExample#

repositoryNameMy Archive1

baseURLhttp://archive.org/oai1

protocolVersion2.01

earliestDatestamp1999-01-011

deleteRecordsno, transient, persistent1

granularityYYYY-MM-DD, YYYY-MM-DDThh:mm:ssZ 1

adminEmailoai-admin@archive.org+

compressiondeflate, compress, …*

descriptionoai-identifier, eprints, friends, …*

35

Tech’- Tech’- Request Type: IdentifyRequest Type: IdentifyResponse in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=Identify

36

Tech’- Tech’- Request Type: Request Type: ListMetadataFormatsListMetadataFormats

functionretrieve available metadata formats from archive.Remember that each archive must implement at least DC.

example archive.org/oai-script?verb=ListMetadataFormats

parameters identifier (optional)

errors / exceptionsbadArgumentidDoesNotExist

e.g. archive.org/oai-script?verb=ListMetadataFormats&

identifier=really-wrong-identifier noMetadataFormats

37

Tech’- Tech’- Request Type: Request Type: ListMetadataFormatsListMetadataFormats

Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=ListMetadataFormats

38

Tech’- Tech’- Request Type: ListSetsRequest Type: ListSets

Q: What are Sets?A: Sets are logical partitioning of repositories.

Q: Why use sets?A: Sets function was aimed to enable selective harvesting.

Data providers don’t have to define sets.Sets are not strictly hierarchical.

39

Tech’- Tech’- Request Type: ListSetsRequest Type: ListSets

functionretrieve set structure of a repository

example archive.org/oai-script?verb=ListSets

parameters resumptionToken (exclusive)

errors / exceptionsbadArgumentbadResumptionToken

e.g. archive.org/oai-script?verb=ListSets&resumptionToken=any-wrong-token

noSetHierarchy

40

Tech’- Tech’- Request Type: ListSetsRequest Type: ListSetsResponse in XML format http://theses.lub.lu.se/oai-service/xerxes/?verb=ListSets

41

Tech’- Tech’- Request Type: Request Type: ListIdentifiersListIdentifiers

functionabbreviated form of ListRecords, retrieving only headers

example archive.org/oai-script?verb=ListIdentifiers&

metadataPrefix=oai_dc&from=2002-12-01parameters

from (optional)until (optional) metadataPrefix (required)set (optional) resumptionToken (exclusive)

errors / exceptionsbadArgument, e.g. …&from=2002-12-01-13:45:00badResumptionTokencannotDisseminateFormatnoRecordsMatchnoSetHierarchy

42

Tech’- Tech’- Request Type: Request Type: ListIdentifiersListIdentifiers

Response in XML format http://theses.lub.lu.se/oai-service/xerxes/?verb=ListIdentifiers&metadataPrefix=oai_dc

43

Tech’- Tech’- Request Type: Request Type: ListRecordsListRecords

functionharvest records from a repository

example archive.org/oai-script?verb=ListRecords&

metadataPrefix=oai_dc&set=biologyparameters

from (optional)until (optional) metadataPrefix (required)set (optional) resumptionToken (exclusive)

errors / exceptionsbadArgumentbadResumptionTokencannotDisseminateFormatnoRecordsMatchnoSetHierarchy

44

Tech’- Tech’- Request Type: Request Type: GetRecordGetRecord

functionretrieve individual metadata record from a repository

example archive.org/oai-script?verb=GetRecord&

identifier=oai:HUBerlin.de:3000218&metadataPrefix=oai_dc

parametersidentifier (required)metadataPrefix (required)

errors / exceptionsbadArgumentcannotDisseminateFormatidDoesNotExist

45

Tech’- Records, items & DCTech’- Records, items & DCor setting the record or setting the record straightstraight

all available metadata about David

item

Dublin Coremetadata

MARCmetadata

SPECTRUMmetadata records

item = identifier

resource

46

Tech’- Records, items & DCTech’- Records, items & DC

A record consists of:1. Header (mandatory)

identifier (1)datestamp (1)setSpec elements (*)status attribute for deleted item (?)

2. Metadata (mandatory)XML encoded metadata with root tag, namespacerepositories must support Dublin Core

3. About (optional)rights statementsprovenance statements

47

Tech’- Records, items & DCTech’- Records, items & DC

OAI-PMH supports dissemination of multiple metadata formats from a repository.

Properties of metadata formats:id string to specify the format (metadataPrefix)metadata schema URL (XML schema to test validity)XML namespace URI (global identifier for metadata format)

Repositories must be able to disseminate unqualified DC. Arbitrary metadata formats can be defined and transported via

the OAI-PMH. Returned metadata must comply with XML namespace

specification.

48

Tech’- Records, items & DCTech’- Records, items & DC

As mentioned before the minimum standard is unqualified Dublin Core (http://dublincore.org/).

Dublin Core Metadata Element Set contains 15 elements.All elements are optional.All elements may be repeated.

The Dublin Core Metadata Element Set: TitleContributorSource

CreatorDateLanguage

SubjectTypeRelation

DescriptionFormatCoverage

PublisherIdentifierRights

49

Tech’- Records, items & DCTech’- Records, items & DCResponse in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?

verb=GetRecord&identifier=oai:CiteSeerPSU:1&metadataPrefix=oai_dc

50

Tech’- Flow controlTech’- Flow control

Some of the request commands can generate a very long response (for example think about requesting a CiteSeer or Library of Congress to list ALL their records using the GetRecords verb).

In order not to generate long responses that will over load the server, a flow control mechanism was added to the protocol.

It is only within the server responsibility to split long responses into shorter ones; the client has no control over length of the responses.

51

Tech’- Flow controlTech’- Flow control

The flow control mechanism is referred to as “resumption token”, and in it, the server splits the long response into shorter ones and assigns at the end of each response a token that the client will pass on the next request the get the next part.

52

Tech’- Flow controlTech’- Flow control

Harvester

Service Provider

Repository

Data Provider

“want to have all your records”

archive.org/oai?verb=ListRecords&metadataPrefix=oai_dc

“have 267, but give you only 100”

100 records + resumptionToken “anyID1”

“want more of this”

archive.org/oai?resumptionToken=anyID1

“have 267, give you another 100”

100 records + resumptionToken “anyID2”

“want more of this”

archive.org/oai?resumptionToken=anyID2

“have 267, give you my last 67”

67 records + resumptionToken “”

53

Conclusions and future useConclusions and future use

We saw that the increasing number of digital libraries caused the different DL types some problems:– FDLs and HDLs had to overcome different

obstacles in order to federate or harvest data from SDLs due to different metadata formats and different queries formats for example.

– The user had to overcome the learning of different user interfaces each SDL offered.

54

Conclusions and future useConclusions and future use

When looking at the OAI-PMH it seemed that putting the protocol in use will eliminate those problems.Service providers can lower the number of different user interfaces the user needs to handle and federating or harvesting would be much easier using a common standard.However…

55

Conclusions and future useConclusions and future use

When putting the protocol in use in digital libraries environment, the lack of strict rules may cause new problems or make the old ones reappear in another way.

Lets take Citeseer for example.It contains 723140 records and its metadata size is around 1GB.If one would want to harvest citeseer efficiently for records dealing with a specific topic how could it be done?

56

Conclusions and future useConclusions and future use

Since the searching for data within the metadata is done at the harvester size, it could not ask citeseer to give it only records dealing with "network computationת" for example.

Remember the sets? Could they be used to harvest only part of the information instead of handling a Giga of data?

The answer is no since citeseer contains only one set.

57

Conclusions and future useConclusions and future use

The DC also might be a too low barrier which causes more and more SDLs to support not only DC but to create their own metadata formats (citeseer for example has two formats it supports).

Nevertheless, OAI-PMH is becoming more and more a standard in digital libraries and is making a large contribution for the DLs and from the looks of it,

it’s here to stay.

58

What's nextWhat's next

Riddle –

– Improving harvesting and creation of HDLs.

– Composition of HDLs.

59

What's nextWhat's next

Web Layer 1

SDL SDL SDL Layer 2

OAI-PMH

Layer 4CHDL

Layer 3HDL

Layer 5Web interfaces

60

DemonstrationDemonstration

Independent queries. Repositories explorer:

http://re.cs.uct.ac.za/ OAISter (FDL):

http://oaister.umdl.umich.edu/o/oaister/ Scirus (FDL):

http://www.scirus.com/srsapp/ Riddle demo:

http://riddle.dynalias.com:20055/riddle.html

61

ResourcesResources

OAI – official sitehttp://www.openarchives.org/

protocol specificationhttp://www.openarchives.org/OAI/openarchivesprotocol.html

general mailing listhttp://www.openarchives.org/mailman/listinfo/OAI-general/

implementers mailing listhttp://www.openarchives.org/mailman/listinfo/OAI-implementers/

Presentation which this presentation was based on: http://www.oaforum.org/otherfiles/lisb_tutorial.ppt

Z39.50:http://www.loc.gov/z3950/agency/

62

QuestionsQuestions

63

The endThe end

Recommended