72
Towards semantic interoperability of cultural information systems — making ontologies work. Magisterarbeit an der PhilosophischenFakult¨atder Universit¨ at zu K¨ oln vorgelegt von Robert Kummer

Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Towards semantic interoperability of culturalinformation systems — making ontologies work.

Magisterarbeit an derPhilosophischen Fakultat der

Universitat zu Kolnvorgelegt von Robert Kummer

Page 2: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Contents

1 Introduction 2

2 Establishing digital scholarship 52.1 Functional requirements of digital scholarship . . . . . . . . . . . . 52.2 Implementing digital scholarship . . . . . . . . . . . . . . . . . . . . 102.3 The interoperability challenge . . . . . . . . . . . . . . . . . . . . . 13

3 A web of linked cultural heritage data 173.1 Conceptual and technical requirements . . . . . . . . . . . . . . . . 173.2 Identification and representation of resources . . . . . . . . . . . . . 223.3 Semantic Web tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Standards for semantic interoperability 264.1 Managing archaeological objects . . . . . . . . . . . . . . . . . . . . 274.2 Linking to bibliographic information . . . . . . . . . . . . . . . . . 304.3 Linking to other forms of knowledge organisation . . . . . . . . . . 34

5 Dealing with heterogeneity 375.1 Levels of heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Heterogeneity on the schema level . . . . . . . . . . . . . . . . . . . 38

5.2.1 Uniform representation of data models . . . . . . . . . . . . 385.2.2 Mapping data models . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Heterogeneity on the entity level . . . . . . . . . . . . . . . . . . . . 485.3.1 Data extraction and data quality problems . . . . . . . . . . 485.3.2 Entity Identification and record linkage . . . . . . . . . . . . 50

5.4 Implementing an overall mapping workflow . . . . . . . . . . . . . . 53

6 Knowledge visualization for the Semantic Web 566.1 Paradigms for visualizing linked data . . . . . . . . . . . . . . . . . 566.2 Faceted browsing using Longwell . . . . . . . . . . . . . . . . . . . 57

7 Conclusion 61

1

Page 3: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Chapter 1

Introduction

Recently new terms have emerged to describe an IT and social infrastructure thatshould facilitate seamless digital scholarly work, usually referred to as Cyberinfras-tructure, a term coined by the US National Science Foundation. Many endeavorshave been made to approach such Cyberinfrastructure. Most of them have thesame objective with only slight variations, because due to fragmentation of ex-isting data sources that are spread all over the world, some scientific questionscannot be solved today. The main objective therefore has to be to identify, de-scribe and implement elements of an infrastructure that enable scholars to betterexploit digital resources [28, 45, 59]. This infrastructure will provide unified ac-cess to data sources and offer services that add value to their underlying culturalheritage content.

One step towards an integrated Cyberinfrastructure for cultural heritage isto syntactically bring data objects together and to semantically mediate betweendifferent data models. State of the art scientific research suggests to establishmetadata harvesting in addition to crafting software agents that are aware ofontologies. Conceptual reference models like the CIDOC CRM help to mediatebetween different data models and provide a blueprint for building software that“understands” cultural heritage data [10]. But semantic (processing data mean-ingfully) and syntactic integration (bringing data to a common place) are just onestep towards seamless interoperability of cultural heritage information systems.New ideas originating from Semantic Web research and well established conceptsfrom the world of (digital) libraries may contribute important ideas for a digitalwork environment for scientists.

However, today, many different information systems with different methodicalapproaches can be found in the field of historical cultural research; each one isdesigned according to a specific scientific question and perspective, using special-ized terminology and a certain national language. This could be seen as a ratherproductive situation, but the experience of using information systems for historical

2

Page 4: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

cultural research could be greatly enhanced by creating a common platform forinformation retrieval. In a joint effort, two parties from classics and archaeologyintend to formulate a research program for achieving the goals mentioned above.These parties will be The Perseus Project and Arachne [9, 8, 20]. The PerseusProject is a digital library currently hosted at Tufts University. It provides hu-manities resources in digital form with a focus on Classics but also early modernand even more recent material. Arachne is the central database for archeologicalobjects of the German Archaeological Institute (DAI) and the Research Archivefor Ancient Sculpture (FA) at the University of Cologne [11, 21]. DAI and FAjoined their efforts in developing Arachne as a free tool for archaeological internetresearch.

The goal of this thesis is to document the course of this project, i.e. the effortsto gain first experience with building a system that syntactically and semanticallyintegrates data in an international and therefore multilingual environment. Italso reports about issues that were encountered during the project and reflects onpossible ways to resolve them. It turns out that conceptually mapping data modelsis not the greatest challenge, but extracting data with appropriate quality andidentifying multiple digital surrogates that refer to the same entity in a multilingualenvironment.

The project is designed to contribute to the said efforts to establish a digitalinfrastructure for scientific research in the cultural heritage area. Thus, first,to incorporate the project in its greater context, by crafting a model of digitalscholarship, functional requirements are discussed that will have to be implementedduring the process of software development. Second, the peculiarities of sharingdata among Perseus and Arachne are introduced, how the collections complementeach other and where the mutual benefits are. Third, state of the art concepts andtools are discussed that help with integrating heterogeneous data from multiplesources, most of them originate from current Semantic Web research. Fourth,the reader is given a closer look at standards for digitally representing culturalheritage data, commonly known as (Networked) Knowledge Organisation Systemsand Services (NKOS).1 Within the context of this project, the CIDOC CRM wasused as a common data model for sharing metadata. The main part will bediscussing forms of heterogeneity that were encountered during implementing amapping workflow. The main issues are explained and possible ways to resolvethe problems are suggested. Finally, paradigms for visually presenting integrateddata objects to users are explored. Longwell, a Semantic Web browser, was usedto index and display the data that was mapped to the CIDOC CRM.

With only one person doing the analysis of both data models, the mapping, and

1NKOS discusses the requirements for enabling knowledge organization systems as networkservices (http:// nkos.slis.kent.edu/ ).

3

Page 5: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

the implementation of fundamental software tools, the overall software architecturehad to remain rather lean. Therefore, higher programming languages were avoidedand most of the presented workflow relies on shell scripts, most of them basictools that come with the UNIX operating system, and style-sheet processing fordata extraction and mapping. The mapping results were documented in a simpletext file that can be found in the Appendix and that was implemented usingregular expressions and XSLT style-sheets. Although the infrastructure discussedis suitable for all areas of scientific research, this thesis focuses on the culturalheritage area.

4

Page 6: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Chapter 2

Establishing digital scholarship

This section is aimed at sounding the intellectual requirements that facilitate ascholarly workflow as defined above. While still being a very young disciplinecompared to examining ancient Greek and Roman texts, software developmentalways intended to better facilitate certain tasks that a person needs to perform.Therefore, functional requirements are defined first and software developers thenbuild a specific software tool around these agreed and formulated requirements.This involves a lot of communication between experts and software developersbefore the first line of code can be written. In the larger context of this projectfunctional requirements can be deducted from traditional scientific workflow, es-pecially within the subjects that deal with culture and history. In the followingsection, first, a model of digital scholarship is used to help identifying and describ-ing the tasks that could be supported by proper integration of cultural heritageinformation systems. Second, interoperability on the level of data objects pre-supposes that also on more abstract levels, interoperability of ideas and conceptsneeds to be established. Therefore, several related ideas are discussed and evalu-ated. Finally, these concepts are applied to the particular project that this thesisis to report on.

2.1 Functional requirements of digital scholar-

ship

Large amounts of cultural heritage information have already been migrated tovarious digital media during the last years. Additionally, the importance of peerreviewed Open Access material is more and more recognized within the scientificcommunity. Consequently, a lot of work goes into reflecting the architecture thatcurrently facilitates scholarly communication and how it could be transformed asa reaction to new opportunities that arise in a digital environment. One core argu-

5

Page 7: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

ment is that new knowledge which has been discovered with the help of taxpayers’money should not be given to large publishing houses so that libraries have to payfor it again while prices become prohibitive. Making scientific results available atevery subsequent intellectual processing stage is one first important step. Addingdigital services to the data that is publicly available in the World Wide Web wouldadd even more value to the underlying content.

But which services do scientists need for research in the cultural heritage area?Figure 2.1 introduces a layered logical model of digital scholarship that transcendsthe components of the aforementioned Cyberinfrastructure. The uppermost layersuggests the need to distinguish objects of the perceptible world and their digi-tal surrogates in one or more digital library collections. These surrogates consistof critical editions of ancient literary texts, archaeological surveys of individualsites or even catalogues of physical artifacts. Scientists create these surrogatesas a result of their everyday work, by digitalizing material that has been pub-lished in traditional form or, in the future, by directly publishing in digital form.Beneath the primary sources, digital libraries should also host secondary sourceslike reference works that capture the results of a longer research process, and alsomonographs and research papers that exhaust new and original ideas. In a dig-ital library, secondary sources should be linked to primary research material tofacilitate advanced services.

Figure 2.1: Model of digital scholarship.

6

Page 8: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

The model differentiates between three further layers. While pursuing scientificresearch, scholars need to refer to texts, parts of texts, archaeological objects, andabstract things. If a digital library provides a stable and unambiguous identifierfor each relevant instance, a scientist could use this identifier to refer to the ar-chaeological object, for example. This reference would be more accurate than intraditional scholarly works. In combination with a resolving service, the identi-fier could be used to obtain one or more digital representations of, for example, apassage of an ancient text. A digital representation could be a scanned image orthe result of OCR. This is reflected by the third layer of the model. However, incertain cases scholars do not want to refer to an instance, but to a specific entity,lets say to one of the smaller Alexandrias that were built to honor Alexander theGreat, not to the one in Egypt. By grouping instances that have been identifiedas referring to the same entity in the perceptible world, scholars not only can referto all digital surrogates that have been digitalized so far but also to the one entitythat they stand for.

The model therefore emphasizes two layers between surrogates of primarysources within the digital collection and secondary sources. Secondary sourcesrefer both to named entities that are derived from grouping instances and to theinstances themselves that represent the object in the “real” world. A long termproject objective will be to populate both layers with metadata about instancesand entities that conform to the CIDOC CRM and other standards. The thirdlayer represents the world of quotations and the forth layer represents the worldof authority documents. This view suggests that annotating objects together withreferring to instances and entities are fundamental functions of digital scholar-ship especially in the humanities, since arguments have to be connected to theirevidence in primary sources. The latter is reflected by the bottom layer of themodel [1].

After defining the functional requirements that could leverage digital scientificresearch, software components have to be described as parts of a logical architec-ture that is able to provide services to meet the functional requirements defined.Snow et al. have described a layered logical model that assists archaeological re-search consisting of storage management, a web service interface layer and portalsoftware [56]. They also state that in the absence of a new generation of cyber-tools archaeological research will remain impoverished. Archaeology concentrateson exploring the evolution of culture, growth in population and the interactionof cultures. Research in these areas depends on finding meaningful links betweendifferent findings. This in turn depends on being able to access distributed datasources hosting heterogeneous data objects.

Because nowadays data mainly is held in separate silos administered by individ-uals, museum and governmental institutions, finding those connections is difficult.

7

Page 9: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Both, classification and terminology vary and especially GIS databases are com-posed of records that have been accumulated on paper. On addition, there is avoluminous amount of unpublished gray literature with images, maps and pho-tographs embedded. But problems do not only lie in access, data internally isdifferently represented.

They further state that due to political boundaries, also in the future, archaeol-ogy will remain a mosaic of provincial efforts. This is one of the main motivationsto build an integrated framework with customizable access points to methods anddata that would help to overcome the current state of fragmentation. Against thatbackground interoperability not only is a technical goal, but also a social project.Sharing design strategies can promote effective cooperation both on the level ofhuman collaboration and electronic interaction. Especially because archaeologicalresearch is dealing with cultural heritage data sustainability has to be established.Therefore, all host institutions should remain in control of their data. Digital li-braries and the services offered should be made publicly available to researchersand whole organizations to store their data.

Figure 2.2 introduces a possible logical infrastructure consisting of data providersand service providers that process the data to offer advanced services on the rawdata objects. Additionally, authority naming services will contribute informationthat can be exploited by software components or by the end user. Both Perseusand Arachne would form repositories that expose well curated data objects and ex-haustive metadata to the web community, possibly by using institutional repositorysoftware that will be introduced later. The repository software should implementa protocol such as OAI-PMH that is suitable for dissemination of huge amountsof data. IRs often also offer advanced services for scalability and durability of thedata objects.

Authority naming services will provide specialized structured information onentities of the Greco-Roman world that cannot be covered by gazetteer serviceslike the Getty Thesaurus of Geographic Names [35]. These services host knowl-edge that has been created by scientists at all times and can be used by themto unambiguously refer to a specific entity. They should be rich in variants andlanguages to help with information retrieval, entity identification and translationof metadata.

The figure also demonstrates how one service provider (indexing) can becomea data provider for a second service provider (search and image browsing). Serviceproviders harvest data from institutional repositories to offer advanced services forthat data. These could be either services that process large sets of data objects likestatistical analysis and indexing or services that focus on single data objects to de-liver representations like images in multiple formats. The figure shows an indexingservice that consults authority naming services to perform entity identification for

8

Page 10: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Figure 2.2: Overall system architecture.

data merging. A second service obtains processed data from the indexing serviceto offer searching and browsig facilities.

End users are equipped with a piece of software commonly called agents. Theterm “agent” refers to a very broad definition of software that performs complextasks. A software agent could be a web browser or more specialized software thatcan be influenced directly by a user. Either the tool (by configuration) or the user(for example by typing the address of a web page in the browser address field) hasknowledge of the service provider and knows how to connect and use the service.The agent also can run at a remote site controlled by the user with a browser. Allthe pieces of software offer useful services to the user. That could be compilationsof images of data objects, information about unambiguously identified entites ormetadata of the data objects themselves.

From a logical perspective it does not matter where the services live and wherethe data is stored as long as they are scalable, reliable, and accessible. Lately anew buzzword has emerged: distributed or grid computing. Although the term isused for a lot of things, it describes some requirements that are valuable for inter-operability. Commonly the term is used to refer to different forms of distributedcomputing, a method of digital information processing that uses a logical layer torun different parts of a computer program simultaneously and distributed to gainperformance.

9

Page 11: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

However, processing “soft” cultural heritage data will lead to scalability issues.A grid infrastructure could help to exploit resources of many separate comput-ers that are connected by a network. A grid should be able to solve large scalecomputing problems by virtualizing resources using a logical layer that mediatesbetween resource consumers and resource providers. For example, large numbersof distributed physical hard drives could be logically connected to one large volumeto host huge amounts of image data. Additionally, and absolutely transparent forthe user, this disk array could be plugged into a preservation system. This systemwould assure that all data objects are stored redundantly and will be preservedover time.

The infrastructure described would be suitable to build high level services tomanage complex workflows without having to accept multiple media discontinuities(in German “Medienbruche”). A new form of work environment could support sci-entists by offering a tool that supports a complex workflow starting from targetedinformation search to compiling and arranging thoughts and ideas to argumenta-tion chains and online publishing. This agent would be able to use a set of servicesto support the key workflow steps. The German TextGrid project is one of thelarger efforts to achieve this goal, focusing on the field of literary studies [28].

2.2 Implementing digital scholarship

But what is already out there? The following paragraphs deal with how the require-ments introduced in the preceding section should be implemented. To approachthis challenge it helps to have a look at paradigms that have emerged lately,especially the notion of Digital Libraries and Institutional Repositories. One fun-damental paradigm to keep in mind is the notion of a process-oriented view of theoverall infrastructure. Leveraging the interoperability capabilities from the meta-data to the resource level means supporting scholarly workflows like publication,citation and archiving of resources, not just information retrieval. An effectiveCyberinfrastructure will provide functions for discovery, reference, dissemination,aggregation and other forms of reuse and exchange of resources while preservingintellectual property rights.

Today, many Web resources are dynamically created by scripting languages likePHP or servlet technology. These can be considered to be part of what is commonlyreferred to as the Deep Web. Typically this data is managed in relational databasesand compiled in a certain way to provide useful presentations to the human user.From the perspective of the Semantic Web, this approach has the disadvantagethat crawling services like Google will not have access to this kind of content, itremains invisible. Only human beings can, by operating a front-end in a certainway, reveal the contents. The Semantic Web approach aims to link resources that

10

Page 12: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

conceptually belong together. But to be able to do this, all resources need tobe publicly available to a certain community. The concept of an InstitutionalRepository is a step in this direction. Arachne currently approaches this problemby creating a sitemap that helps search engine bots to find objects that are buriedwithin the architecture.1

In a digital age, the primary function of institutions hosting cultural heritagematerial is to publicly offer data and services to their audience. Digital library is anemerging term that describes a set of software that can fulfill this task. The termdigital library has been used in many different ways in the past. Digital librarieshold collections of digital objects and provide means to rapidly access material indigital form. Additionally, the digital form facilitates new services on that data.While traditional libraries focused on the document as the most granular itemneeding to be accessed, digital libraries can also focus on the content itself. Thecontent either is digitally created or digitized by for example scanning and applyingOCR software.2

A digital library has at its core some sort of institutional repository software likeFedora or DSpace [12, 3, 58]. Institutional repository software provides methods forcollecting, preserving and disseminating the intellectual output of an institution,particularly research institutions. Institutional repositories also help to achieveinteroperability of resources from institutions by providing programming interfacesthat help with disseminating and federating items of the collections. They can alsobe used for implementing common services associated with digital libraries.

Since 2006 the Mellon Foundation has been funding an initiative that willdevelop specifications allowing distributed repositories to share digital objects [33].In this context digital objects are considered as units of scholarly communicationas opposed to the traditional definition. Traditionally, a scientific publication inprinted form is one unit of scholarly discourse.

Fedora is an institutional repository that aims at building the foundation fordigital libraries [12, 40, 3, 14]. Although the models that are developed by thisinitiative seem to be very ambitious, they point in a direction that is produc-tive for the further development of digital scholarship in classics and archaeology.From this point of view, in archaeology, a set of metadata and images about anarchaeological find can be considered as a unit of scholarly communication. Thisbundle could be aligned with scientific annotations, leading away from traditionalscholarly publication in this domain.

1Google offers a set of tools for webmasters that facilitate indexing of contents that aredynamically created at https:// www.google.com/ webmasters/ tools/ docs/ de/ about.html .

2One remarkable project for the cultural heritage area is the OCRopus project that aims atcovering pluggable layout and character recognition as well as statistical language modeling andmultilingualism. In a later project phase, OCRopus wants to be able to recognize handwrittendocuments. More information can be found at http:// code.google.com/ p/ ocropus/ .

11

Page 13: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Fedora stands for Flexible Extensible Digital Object Repository Architecture.Modern digital libraries are supposed to host a large variety of heterogeneous dig-ital objects. During the life-cycle of a digital object a number of managementtasks like data creation, organization and dissemination have to be carried out.Fedora tries to reduce costs by providing a set of features that standardize thesemanagement tasks. According to the Fedora digital object model a unit of infor-mation consists of one or more data streams. Each data stream could be anotherrepresentation of a text or an image in different resolutions. Metadata that isassociated with a digital object is stored as a separate data stream, multiple meta-data formats, images and other data can be associated with one object using thismechanism. Fine-grained access-control policies to the management and access in-terfaces provide a security architecture. Internally, all data objects together withtheir data streams are serialized as XML files on a hard disk. This better supportscomplex tasks associated, for example, with digital preservation. Fedora thereforeis one approach to provide a technical foundation for digital library software.

Interestingly Fedora also implements a couple of features that are interestingfor providing Semantic Web services. Any type of relation that is expressed withinthe metadata of an object is indexed and can be queried using Semantic Web querylanguages like SPARQL.3 All data streams of digital objects can be associated withbehavior for dynamic content delivery (for example image manipulation services ormetadata crosswalks). Additionally the management and access API’s (REST andSOAP) facilitate integration into different application environments. Furthermore,each digital object is associated with a unique URI during the ingesting processand a history of all modifications is stored together with the digital object. Thisenables references to a specific version of a digital object. Fedora supports dissem-ination of all data streams, including metadata that is associated with any digitalobjects of the managed collection by implementing the OAI Protocol for MetadataHarvesting. This is a protocol developed by the Open Archives Initiative and usedto collect metadata descriptions of resources for (indexing) services that need touse metadata from many sources [39, 13].

Rooted in the e-print community and well known in the context of Open Ac-cess,4 the OAI Protocol for Metadata Harvesting is based on a client-server archi-tecture. Harvesting clients request data from repositories called “Data Providers”.“Service Providers” can then use this data to offer advanced services like index-ing or other forms of advanced organization on that data. The metadata to betransported over a network can be in any format that can be serialized as XMLand on which a certain community has agreed upon. Unqualified Dublin Core

3SPARQL is a W3C recommendation published at http:// www.w3.org/ TR/rdf-sparql-query/ .

4The Budapest Open Access Initiative (http:// www.soros.org/ openaccess/ ) is recognizedas the first visible development of Open Access.

12

Page 14: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

always has to be attached in order to facilitate a basic layer of interoperability.OAI-PMH claims to be one enabling infrastructure element for supporting newforms of scholarly communication.

Digital libraries should provide multiple access methods to their collections aswell as advanced services on the hosted content. For federated digital libraries, twofundamental paradigms for searching do exist, distributed searching and searchingan index of previously harvested metadata. Both ways of dealing with federatedinformation systems face fundamental problems both on the server and on theclient side. However, a harvesting approach is more appropriate for the projectpurposes. To exploit the full power of the CIDOC CRM, resources from differentplaces extensively have to be linked. The processing steps that are required forthis task will be performed much better if data objects are accumulated in oneplace.

Distributed searching involves a software component that is aware of a setof associated databases. Search criteria are encoded using a standardized clientserver query protocol such as Z39.50.5 These information systems translate thequery to an internal format and modify the results so that they conform with thestandard. Then they are sent back to the querying component that merges theresults. This approach delegates the indexing work to each connected database.Thus, computing efforts for index generation and searching are distributed. Sincethe search results have to be transferred back to the issuing query service, thenetwork traffic during searching is higher. Control over how the index is createdand search results are weighted continue to be up to each federated databasesystem.

Searching of metadata that was digitally harvested is basically implementedby the OAI Protocol for Metadata Harvesting. In this scenario, a service providerharvests data from multiple associated data providers in advance and then buildsa local index. This approach bears the disadvantage that all indexing work has tobe done locally. But since the harvesting is done in advance there is less networktraffic involved while the actual query is performed. In fact, in this case indexingcould be an ongoing process. This approach does not delegate indexing work tofederated library institutions and full control over how the index is technicallycreated remains at the querying software system.

2.3 The interoperability challenge

Cultural heritage databases use specialized terminology of their respective domainof research in a certain national language. Moreover, terminology and standards

5Meanwhile, the Z39.50 standard is 20 years old and currently maintained by the Library ofCongress (http:// www.loc.gov/ z3950/ agency/ ).

13

Page 15: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

used may vary within only one domain. Given the fact that the information ineach of the databases is of interest for a large community of people, efforts have tobe made to overcome current problems with data integration that are caused bythe described heterogeneity. Against this background it seems reasonable that theCIDOC CRM, delivering a set of standardized terms and properties, could serveas a basis for heterogeneity.

Many projects started experimenting using the CIDOC CRM for describing cul-tural heritage data in general and archaeological data in particular [18]. Integrationof different cultural heritage vocabularies and descriptive systems is an ongoingresearch challenge in the course of projects like BRICKS, EPOCH, SCULPTEUR,and IUGO.6 But currently only a few implementations exist that try to bridgethe gap between more than one language and several data models at the sametime. To overcome the lack of experience with implementing the CIDOC CRMas an intellectual concept and as a software system, Perseus and Arachne want toestablish a robust implementation of a mapping workflow in the long term. Thisthesis reports about launching this collaboration by creating a prototype to soundthe mapping of both databases to a shared metadata format.

Together, Perseus and Arachne are hosting hundreds of texts, thousands of artobjects, bibliographic records and large lists of named entities, especially aboutplaces and people [8, 9, 20]. Both project partners expect many benefits fromintegrating their collections by using open standards. Arachne hosts data aboutapproximately 100,000 objects of antiquity and in addition over 200,000 imagesof these objects in a connected image repository.7 The Perseus Project comprisesof 6,000 well-described art and archaeology objects and additionally 36,000 im-ages, but also approximately eight million words of Greek and Latin text as TEIcode [34]. First, the integration of records on art and archeology would providea larger source of information to users, accessible through a common and multi-lingual interface. Second, to facilitate serious digital scholarly research, advancedservices regarding those collections should be provided. Users may be interestedin browsing passages of Pausanias’ History of Greece (a text that is part of thePerseus digital collection) that are referring to objects in Arachne. Or they maywant to consult, for example, the Smiths’ Dictionary of Greek and Roman Antiq-uity that is accessible online at Perseus to rapidly acquire more information about

6The BRICKS project (http:// www.brickscommunity.org/ ) uses the CIDOC CRM fora software component that manages archaeological finds. EPOCH (http:// www.epoch-net.org/ ) wants to develop a tool that maps from other metadata standards to the CIDOC CRM.The already completed SCULPTEUR project (http:// sculpteur.it-innovation.soton.ac.uk/auth/ login.jsp) used the CIDOC CRM as internal data model for data integration amongseveral European institutions. IUGO (http:// iugo.ilrt.bris.ac.uk/ ) exploits Semantic Webtools to help locating informal related content of conferences.

7August 2007; the numbers are constantly growing, since digitization projects are ongoing.

14

Page 16: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

a specific record in Arachne, with just one or two mouse clicks. In a nutshell, dataintegration in this context would mean to link the Greek and Latin collections inPerseus to the Greco-Roman material in Arachne.

Galuzzi points out that traditionally museums present art objects only withfew context and according to specific curatorial decisions [25]. In the course ofchanging from analogue to digital media formats, he sees a chance to break withthe traditional ways of documentation and information. One challenge of introduc-ing reference models and ontologies like CIDOC CRM is the re-contextualizationof those objects by connecting them to other art objects of the same or differentkind, such as ancient texts. This approach permits to lay emphasis on “concep-tual similarities” among objects of classics and archaeology, and it does not onlyallows the user to find conceptually related objects, but also to navigate from oneobject to another by means of qualified links. The aim, therefore, must not beto imitate traditional forms of documentation in digital form, but finding newparadigms of data processing and presentation. Arachne and Perseus host uniquebut conceptually related data objects that could be linked meaningfully.

However, currently data is technically processed in completely different wayswithin each database; each institution has designed its own software that candeal with the respective specialized data model. Both databases process dataof a certain heterogeneity — sculptures, vases and entire buildings with theirhierarchical arrangement, and of course large amounts of textual data. It doesnot seem to be reasonable or feasible to change the internal data models of allparticipating database systems. Therefore, an abstract mapping agent that canbe configured to match each internal data model would certainly be a more rationalapproach. This mapping agent had to be aware of both database schemas to beable to translate data to a shared vocabulary of terms with a certain structure.It has been argued that the belief in easily building such a mapping agent isnaıve [57]. Therefore, one goal of the project was to estimate the feasibility of anabstract but adaptable mapping component.

But how should this mapping-agent be designed? In software technology, flexi-bility often is described with regard to modularity, adaptability and maintainabil-ity. It is interesting that all three claims deal with the reduction of complexity.All become especially problematic when dealing with information systems hostingand processing cultural heritage data. In this context, information systems haveto cope with rather complex and non-uniform, sometimes incomplete, sets of data.In addition, in cultural heritage research, functional requirements have a tendencyto evolve rapidly while information systems are used by historians. As the under-standing of the subject increases, new questions and requirements arise. A flexibleinformation system must therefore be able to advance at the same pace as scientificmethodology develops. This should be considered in the design phase already, and

15

Page 17: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

since databases change, mapping components do need to reflect this [4].

16

Page 18: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Chapter 3

A web of linked cultural heritagedata

The issues described so far are saturated with concepts and ideas that are currentlydiscussed under the notion of “Semantic Web.” Having discussed the intellectualrequirements of digital scholarship, presented means for implementation, this sec-tion deals with identifying and describing state of the art developments relatingto the current World Wide Web. These are to be facilitating new and better waysof scholarly communication. Although often criticized, the current Semantic Webresearch efforts articulate new and interesting ideas on how to deal with data thatis to be published on the known internet. Also, a fruitful discussion is emergingon how to identify, describe, and retrieve Web resources in future interoperabilityenvironments. These means of identifying and retrieving resources could be theglue that ties distributed data together. However, the Semantic Web lacks biggerintegrated software solutions and is mostly tool-based, today. Some of these toolsgreatly helped with sounding the usability of the Semantic Web for the interop-erability prototype that was developed during the course of the project. In thissection, first, the foundations of Semantic Web technology are described on thebasis of the traditional, and admittedly imprecise, Semantic Web “layer cake.”Then to emphasize the importance of identification and representation, the latestinformation on this topic are presented and discussed. Finally, those SemanticWeb tools are introduced that were used during the project.

3.1 Conceptual and technical requirements

“Sofern der Forscher seinen Einfall kritisch beurteilt, abandert oder ver-wirft, konnte man unsere methodologische Analyse auch als eine ratio-nale Nachkonstruktion der betreffenden denkpsychologischen Vorgange

17

Page 19: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

auffassen [49].”

The World Wide Web Consortium defines the term Semantic Web in a surpris-ingly simple way: “The Semantic Web is a web of data.”1 Even more surprising isthat a condition is described thereby that we do not have today in the humanities:A web of data. Why? A great deal of cultural heritage data that is currentlyaccessible on the Internet is controlled by software written for small and special-ized audiences and tailored to a specific purpose. Furthermore, archeological datacurrently is collected at a low level of granularity as sets of documents. Today’scultural heritage web therefore can be described as a web of linked documents, notas a web of linked data.

The “Web of Data”, how the “Semantic Web” should consequently be named,describes all activities aimed at overcoming today’s unsatisfactory state. For thatpurpose formal languages and software components need to be developed thatdeal with two aspects of data integration. Syntactical data integration physicallycombines data of different data sources by accumulating data objects at one place,a central database for example. Semantic integration builds upon this foundationby assuring that the data is interpreted and processed in a consistent way, namelyinterpreted as intended by the originator of the data. By this means data ofdifferent sources can be combined and queried better than before.

Scientists that want to solve a scientific problem need a phase of creative think-ing to collect ideas and materials that contribute to resolving the research problem.Thus, they need to juggle a lot of information at a time in their minds to exhaus-tively study all aspects of the issue. This is what the Semantic Web technologyis designed for — it is supposed to knock down the boundaries between different“silos” of information.2 The Semantic Web thus aims at allowing scientists to con-nect information in a seamless and networked way without the need to translateand transform between multiple media formats. Figure 3.1 shows the componentsof the Semantic Web.3 This comparison captures the notion that there are severallevels, each of them build upon a lower one.

The Unicode standard (ISO/IEC Standard 10646) reserves a distinct numberfor each letter (more general: character) independent of the platform (operatingsystem), language or program that uses Unicode.4 Major IT companies acceptedUnicode and other standards such as XML or JAVA support it. The concept of the

1This quote was taken from the basic introductory material about the Semantic Web thatcan be found at http:// www.w3.org/ 2001/ sw/ .

2Tim Berners-Lee expressed this idea in an interview that was published at http:// www.businessweek.com/ technology/ content/ apr2007/ tc20070409 961951.htm .

3The image was taken from http:// www.w3.org/ 2001/ 09/ 06-ecdl/ swlevels.gif .4A basic introduction on Unicode can be retrieved at http:// www.unicode.org/ standard/

WhatIsUnicode.html .

18

Page 20: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Figure 3.1: Semantic Web layer cake.

Semantic Web builds upon Unicode characters for expressing strings. In order tointeract with resources on the internet, a Uniform Resource Identifier (URI) wasintroduced. A URI is a string of Unicode characters that unambiguously namesor identifies material or abstract things of the “real” world, provided that there isa digital surrogate available.

URI’s can be divided into two subcategories, Uniform Resource Names (URN)and Uniform Resource Locators (URL). While URLs are URIs that provide someadditional information on how the reference to a resource can be resolved to anactual object, URNs only provide a unique name for a resource without informationabout where an agent can get a representation such as an image. An example forthe latter are DOIs (The Digital Object Identifier System).5 For DOIs the DOIwebsite itself provides a resolver that does not directly deliver HTML but redirectsto a URL that can be resolved to a HTML page. However, many URNs simplyare URIs that have a well known resolver mechanism, the global Domain NameSystem. This system is so well established that it seems to be totally transparent.

Metadata is data about data that is used to facilitate the understanding, use,and management of data. In the context of a digital library a data object could bea digitized text. Metadata for this text would include, for example, informationabout the author, the publisher, or the number of pages. The Extensible MarkupLanguage (XML) with its hierarchical structure can be used to attach data aboutdata objects to the same. XML defines a basis syntax that can be used to structure

5If a user types in the DOI 10.1007/978-0-387-34347-1 6 at http:// www.doi.org/ , it willbe resolved to the URL http:// www.springerlink.com/ content/ h3800073756x7872/ . Thisin turn delivers a HTML page with more information on the paper and a few more browsingfacilities.

19

Page 21: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

documents on the Web.6 But XML does not provide any means to make assertionsabout the semantics of a document or its parts. XML Schema is a language thatconstraints the structure of an XML document and augments the XML standardwith additional typing facilities.7 It depends on the context if data is considered asself contained or as data about data. One could imagine cases where metadata isthe object of research. In this event metadata about metadata would be absolutelyvaluable.

The lower layers of the model basically deal with questions of syntax while thehigher layers are concerned with interpreting the “meaning” of data. The term“semantics” has been used for a lot of things and never has been well defined.Moreover, there is no agreement on how the term semantics refers to the conceptof the Semantic Web. As mentioned earlier, the Semantic Web community latelyprefers the term “Web of Data” over “Semantic Web.” It can be said that thenotion “semantics” itself refers to the meaning that is expressed in some form ofrepresentation of information, for example natural or formal language (metadata).Uschold sates that the notion of “real world semantics” as defined by Oukseland Sheth best captures the role of “semantics” in the orbit of the SemanticWeb [60, 47]. According to this definition, objects within a model are mapped ontothe perceptible word. Uschold then introduces a semantic continuum. Accordingto his model, information can be encoded on different levels of detail ranging fromimplicitly, over explicitly informally, explicitly formally for human processing, toexplicitly formally for machine processing. Although the far right end of thiscontinuum has not been reached today, there is a lot of value in encoding meaningexplicitly and formally for human processing. This helps software developers towrite software that is able to process a certain kind of shared data. In the end, theobjective will be to build software that dynamically and autonomously resolvesthe meaning of data objects that are encountered by concept reasoning.

To explicitly make assertions about the semantics of a data object a hierar-chical markup language is insufficient. That is where higher standards like RDFand the notion of “ontologies” comes in. Gruber defines a formal ontology as anartifact of a construction that was designed for a specific purpose and is “evaluatedagainst objective design criteria [29].” The meaning of “ontology” is controver-sially discussed in the artificial intelligence field because at the same time it has along tradition in philosophical discourse where it alludes to the notion of existence.It has often been confused with epistemology that refers to knowledge and the-ory of cognition. In the context of knowledge sharing and reuse, ontology can bedefined as a specification of conceptualization. Thus, an ontology is a description

6XML is a markup standard derived from SGML (ISO 8879). More information about XMLcan be found at http:// www.w3.org/ XML/ .

7XML Schema is a WC3 standard and has been published at http:// www.w3.org/ TR/2004/ REC-xmlschema-1-20041028/ structures.html .

20

Page 22: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

(like a formal specification of a computer program) of the concepts and relationswithin a domain that an agent (again a computer program) or a set of agents canevaluate to process data. By restricting the vocabulary to express what is the casein a specific domain, ontologies facilitate interoperability between multiple piecesof software.8

The Resource Description Framework (RDF) mentioned above is another lan-guage that defines a simple data model to describe resources and the relations thatcan exist between resources. RDF provides trivial semantic concepts like objectsand relations and can be expressed in XML but also in other notations like Nota-tion3 [30]. In RDF, information is represented as triples. A triple is an assertionthat comprises subject, predicate, and object. RDF Schema builds on top of RDFby providing a vocabulary to group objects to classes and to constrain the relationsthat may exist between class instances. Thus, RDFS is to RDF what XML Schemais to XML. It augments the semantics of RDF by hierarchical generalization andthe definition of properties. It has enough semantic power to describe simple on-tologies [5]. Since the CIDOC CRM version 4.2 has been published as RDFS,both Perseus and Arachne data were exported to RDF and evaluated against thepublished RDFS document [31].

OWL is a language that reaches beyond the abilities of RDFS for example bydefining further language elements to describe relations between classes (“disjunc-tive”), restricting cardinalities (“exactly one”), equality, richer typing of prop-erties, features of properties (“symmetry”), and enumerated classes [42]. Theconcept of the Semantic Web knows three additional layers that have not beenaddressed extensively until now, Logic, Proof, and Trust: The three upper layersdeal with advanced concepts that are irrelevant for the description of the CIDOCCRM. Therefore, will not be further dealt with in this thesis.

It has been argued that the Semantic Web endeavor is too expensive, thatnobody would be willing or even be able to produce enough content to createenough uptake. Shadbolt et al. explain that “uptake is about reaching the pointwhere serendipitous reuse of data, your own and others’, becomes possible [54].”They carry on by saying that, today, most projects lack this viral uptake. In mostcases there is no stable URI for objects so that the predicted revolution has nottaken place yet. There is a need for small communities that have a pressing needfor new technology. Could the cultural heritage sector be such a community?

Viral uptake would create a network effect. In information technology theterm “network effect” was coined by Metcalfe, the founder of the ethernet [43].9

8Pidcock tries to clarify the destinction between a vocabulary, a taxonomy, a the-saurus, an ontology, and a meta-model at http:// www.metamodel.com/ article.php?story=20030115211223271 .

9For more information on applying Matclalfe’s law to the Semantic Web refer to http://blogs.sun.com/ bblfish/ entry/ rdf and metcalf s law .

21

Page 23: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

He argued that the costs of network cards is proportional to the number of cardsinstalled, but the value of the network was proportional to the square of the numberof users. These can share access to expensive resources like storage. Transfered tothe linked data idea, users then could share access to metadata about a uniquelyidentified resource that already has been annotated by others. A critical mass hasto be reached to make the system useful for all users because the value obtainedfrom the infrastructure has to be greater or equal to the price paid for establishingthe building blocks of the overall system. A reasonable strategy could be to builda system that delivers value to users even without exploiting network effects. Asthe number of users increases the system becomes more valuable to everybody.Scalability of these solutions can be almost infinitely enhanced by introducing apeer to peer principle instead of hosting all data as a monolithic block on oneserver. But it is certain that by sharing unique identifiers everybody can addmetadata to a specific entity and share it among the community.

3.2 Identification and representation of resources

Currently every archaeologist can access the Arachne database to conduct researchand to choose from a vast amount of information. It is also possible to cite Arachneas a source by mentioning the unique Arachne serial number in connection withproviding some information to disambiguate the serial number in Arachne, thatdistinguishes buildings from topographic entities. This enables the reader of acertain publication to write down the serial number and direct his browser to theArachne website. After logging in he can use the serial number to access thesame information that his predecessor got some time ago. This is one methodto reconstruct the methodical approach that was used to compile the results in apublication. This traditional approach has a couple of shortcomings and seems tobe complicated and time-consuming.

To be able to talk about a specific subject area that has an internet repre-sentation, each object on the Web should be identified by a stable URI. Then,this URI can be used to reference the entity, lets say for annotation purposes, orto resolve a digital representation of this resource (in Fedora terms this is a datastream). Many webservers also support content negotiation. By exploiting thisfunctionality, a software agent can state its preference regarding the representationof a Web resource. The webserver then can deliver one or more representationsin HTML, a machine-readable representation in RDF/XML or a couple of imagesfor the resource.

By using the traditional HTTP URL schema for naming a web resource, mostWeb-enabled programs will be able to rapidly retrieve a representation of theresource. An archaeological object in Arachne could, for example, be named

22

Page 24: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

by the URL http:// arachne.org/ object/ 30014 . By exploiting the mechanism ofcontent negotiation, a software agent could retrieve a RDF/XML representa-tion and discover that there are multiple images connected to this resource. Asa second step, the agent retrieves one image by dereferencing the URL http:// arachne.org/ images/ 482199 and indicating that compressed JPEG is the pre-ferred format. The Apache HTTP server [23], for example, would indicate thatby including the string Accept: image/jpeg; q=1.0, application/rdf+xml;

q=0.5, text/html; q=0.1 in the header of the request.10 By transmitting thisstring together with the request, the user agent can thus express that with thisrequest he prefers an image over a representation in RDF/XML. The remainingoption, if all else fails, is a representation in HTML.

Listing 3.1 demonstrates the process of content negotiation that can direct aclient to select the appropriate representation of a specific Web resource. In thisparticular example, the client tries to retrieve the URL http:// dbpedia.org/ resource/Berlin and indicates that it prefers a HTML page as result. The server respondswith a 303 message and provides another URL that most browsers automaticallyre-retrieve to display according HTML page. This process is transparent to theuser.11

Listing 3.1: The client request an HTML represenation.1 Krabat : ˜ rokummer$ t e l n e t dbpedia . org 802 Trying 1 6 0 . 4 5 . 1 3 7 . 8 5 . . .3 Connected to dbpedia . org .4 Escape charac t e r i s ’ ˆ ] ’ .5 GET / re sou r c e / Ber l i n HTTP/1.16 Host : dbpedia . org7 Accept : t ext /html89 HTTP/1.1 303 See Other

10 Date : Tue , 14 Aug 2007 12 : 05 : 12 GMT11 Server : Apache−Coyote /1 .112 Locat ion : http :// dbpedia . org /page/ Ber l i n13 Content−Length : 014 Content−Type : t ext / p l a in1516 Connection c l o s ed by f o r e i g n host .

Listing 3.2 shows the client requesting the HTML page that it asked for. Afterthe heading information, the HTML code is attached at line 34.

Listing 3.2: The client retrieves the HTML representation.17 Krabat : ˜ rokummer$ t e l n e t dbpedia . org 8018 Trying 1 6 0 . 4 5 . 1 3 7 . 8 5 . . .19 Connected to dbpedia . org .20 Escape charac t e r i s ’ ˆ ] ’ .21 GET /page/ Ber l i n HTTP/1.122 Host : dbpedia . org23 Accept : t ext /html24

10Apache supports content negotiation according to the HTTP/1.1 standard. More infor-mation on Apache content negotiation can be found at http:// httpd.apache.org/ docs/ 2.3/content-negotiation.html .

11This example is inspired by the document “How to publish Linked Data on the Web” athttp:// sites.wiwiss.fu-berlin.de/ suhl/ bizer/ pub/ LinkedDataTutorial/ .

23

Page 25: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

25 HTTP/1.1 200 OK26 Date : Wed, 15 Aug 2007 13 : 36 : 36 GMT27 Server : Apache−Coyote /1 .128 Cache−Control : no−cache29 Pragma : no−cache30 Content−Type : t ext /html ; char s e t=utf−831 Transfer−Encoding : chunked3233 5b434 <html xmlns=" h t t p :// www . w3 . org / 1 9 9 9 / x h t m l " xml : lang=" en " lang=" en ">35 <head>

Listing 3.3 shows the client indicating that RDF/XML is the preferred repre-sentation. The sever again responds with a 303 redirect but this time to the URLthat points to RDF/XML data.

Listing 3.3: The client requests a RDF representation.36 Krabat : ˜ rokummer$ t e l n e t dbpedia . org 8037 Trying 1 6 0 . 4 5 . 1 3 7 . 8 5 . . .38 Connected to dbpedia . org .39 Escape charac t e r i s ’ ˆ ] ’ .40 GET / re sou r c e / Ber l i n HTTP/1.141 Host : dbpedia . org42 Accept : app l i c a t i on / rd f+xml4344 HTTP/1.1 303 See Other45 Date : Tue , 14 Aug 2007 12 : 05 : 50 GMT46 Server : Apache−Coyote /1 .147 Locat ion : http :// dbpedia . openl inksw . com:8890/ spa rq l ? de fau l t−graph−ur i=http%3A%2F%2Fdbpedia . org&

query=DESCRIBE+%3Chttp%3A%2F%2Fdbpedia . org%2Fresource%2FBerl in%3E48 Content−Length : 049 Content−Type : t ext / p l a in5051 Connection c l o s ed by f o r e i g n host .

An alternative to addressing resources with HTTP URLs is to use a genericURI and to provide a service to resolve this URI to an appropriate representation.Whilst the use of URLs exploits existing technology, using generic URIs entailsbuilding resolving services. This is complex and cost intensive but is useful in somedomains. A sample URL that resolves an URI is http:// some.resolver.org/ resolve?uri=arachne:objekt:4711&type=application/ rdf+xml . Here the content negotiationpart is visibly encoded within the URL. There will be a more in-depth descriptionof this mechanism in section 4.2 on page 32.

3.3 Semantic Web tools

Even if there is enough data represented in a way that can be easily exchanged andshared, there is still the need for software that is able to process the data. Thesesoftware components are so-called agents. They serve to process Semantic Webdata and to provide communication channels to resolve problems collaboratively,one or more agents for each task. Many tools are evolving in the field of theSemantic Web. In fact, the number of tools that are supposed to deal with thetechnologies described has grown so fast that the W3C could not cope with theupsurge and decided to create a community driven portal to keep track of the

24

Page 26: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

domain.12 Since most of these toolkits deal with and depend on RDF, we decidedto choose RDF for implementing the mapping. Unfortunatey most of these toolscome with little or no documentation and little experience on how they deal withlarge amounts of data.

Shopping agents are degenerated examples of Semantic Web applications. Onbehalf of their users, they fulfill the fundamental task of comparing prices from dis-parate and heterogeneous but semantically related sources. They are degeneratedbecause usually none of the sources has published its vocabulary. Shopping agentstherefore usually need to scrape the information from multiple HTML pages. Thisresults in additional work for software developers since they always have to dealwith individual data models. There is no format that everybody has agreed on anda lot of semantics have to be hardwired within the agent software. Each time oneof the participating vendors changes the appearance of the web page, the agentsoftware needs to be adapted.

Throughout the project, multiple tools served to provide a better understand-ing of Semantic Web concepts and methods. The following describes the softwarecomponents used. Protege was helpful for approaching modeling techniques ofontologies including the CIDOC CRM. The user gets an impression on how theRDF markup could look like if it was produced by an automated mapping algo-rithm. Strengths and drawbacks of different modeling approaches became visibleafter manually creating data objects in the CIDOC CRM schema.13 The nexttool, called Jena, is a Java framework that supports the development of SemanticWeb applications. It provides a programming environment for RDF, RDFS andOWL and embodies a rule based inference engine. Jena is Open Source, a resultof development efforts of the HP Labs Semantic Web Programme. There are acouple of frameworks available for Java and other programming languages, butJena comprising od currently 11 developers and 24,600 downloads, appears to beone of the more active projects within the Open Source community.14 Eyeball isa part of the Jena framework that checks RDF model for common problems andis used within the project to check the CIDOC CRM markup before it is furtherprocessed by software components. It checks for unknown predicates and classes,bad namespaces, ill-formed URIs, amongst other things. The Redland RDF Li-braries provide a couple of command line tools that were useful to count triplesand to reformat the RDF code. In this particular case, it was used to count thetriples that were generated during the mapping efforts.15

12The W3C maintains a wiki-style list of Semantic Web tools at http:// esw.w3.org/ topic/SemanticWebTools.

13http:// protege.stanford.edu/ .14http:// jena.sourceforge.net/ .15http:// librdf.org/ .

25

Page 27: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Chapter 4

Standards for semanticinteroperability

Many cultural historians are happy to conduct scientific research without havingto think about formalized and shared conceptualizations. Developing formalizedontologies for easier exchange of knowledge involves more time and effort thandoing things intuitively. The issue of building awareness of the advantages thatensue from using standards for digital representation of cultural heritage data stillneeds to be addressed. Formalizing knowledge with standardized systems not onlyallows it to be transfered and displayed over network connections, but also toenrich it with annotations and behaviors like searching and browsing. However, asSemantic Web concepts are not yet understood and accepted within the culturalheritage area, this currently limits the CRM’s potential.

Common conceptual models like the CIDOC CRM can be used in may ways.Guarino categorizes different uses of ontologies by temporal and structural dimen-sion [52]. Thus, ontologies can be used at development time and at run time.At development time, ontologies can serve as a common language for softwaredevelopers and domain experts. In this scenario, it would help to model domainconcepts as software components. By using standard vocabularies, the softwareusually achieves a better rate of interoperability. Information systems that areontology-aware use ontologies at runtime. Some software agents recognize datathat they encounter as being encoded according to a certain ontology. From astructural point of view, an ontology can be used at different levels of an ap-plication program or even interfuse the whole information system, the databasecomponent, the application component, and the user interface.

Due to the respective focus of each project partner, the project focuses onmaterial objects, ancient Greek and Latin text, and the contexts that these canbe linked to. To establish interoperability, multiple standards have to collaborateto cover the needs of a specific domain. While the CIDOC CRM was developed

26

Page 28: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

to represent information about objects, especially those managed by museums, anew version of Functional Requirements for Bibliographic Records (FRBR), FR-BRoo, is being developed as ontology aligned to the CIDOC CRM [17]. As anentity-relationship model, FRBR provides the means to accurately describe bibli-ographic information in a digital world. FRBRoo provides the means to expressthe IFLA FRBR data model with the same mechanisms and notations provided bythe CIDOC CRM. The CIDOC CRM and FRBR harmonization, especially whenextended with the Canonical Text Services protocol [50], will allow collections tointegrate complex textual materials with extensive metadata about objects. Thefollowing section will focus on the introduction of these standards.

Thus, the concept of the CIDOC CRM itself heavily relies on other form ofshared infrastructure and standards. Gazetteers, other domain specific namingauthorities, and controlled vocabularies provide the means for referencing anddescribing things and objects that form the context of material and textual objects.These registries still have to be developed and published so that a wide audiencewill be able to use these vocabularies by referencing to entities and contributingto the content. Furthermore, service registries will hook up all participating dataproviders and play a major role in data discovery.

4.1 Managing archaeological objects

“We have the vision of a global semantic network model, a fusion ofrelevant knowledge from all museum sources, abstracted from their con-text of creation and units of documentation under a common concep-tual model. The network should, however, not replace the qualities ofgood scholarly text. Rather it should maintain links to related primarytextual sources to enable their discovery under relevant criteria [15].”

Many standards have emerged that facilitate representation of cultural heritagedata like the Getty Categories for the Descriptions of Works of Art or the ArtMuseum Image Consortium that operated until 2005 [2, 7]. Since 2006, the CIDOCConceptual Reference Model became the official standard ISO 21127:2006. TheCIDOC CRM comprises definitions arranged as a structured vocabulary that weredeveloped over a period of ten years by the CIDOC Documentation StandardsGroup. This group falls within the International Committee for Documentation(ICOM-CIDOC) of the International Council of Museums (ICOM). The CIDOCCRM provides a blueprint to describe cultural heritage and museum information.Therefore the CIDOC CRM will have a major role within the integration effortsof this project. It can help to analyze the data structures of the participatinginformation systems, to identify common information contents.

27

Page 29: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Technically speaking, the CIDOC CRM is a hierarchy of 84 classes definingconcepts that are commonly referred to in museum documentation practice. Eachclass describes a set of objects that share common features. 141 so called prop-erties define semantic relations between these conceptual classes. Thus, the CRMbuilds a foundation for semantic interoperability in the cultural heritage area [10].Figure 4.1 shows a schematic overview of the most important concepts and rela-tions that can exist between them, according to the model.1 By adopting theseconcepts of formal semantics, the CIDOC CRM is well prepared play a role in thedevelopment of the Semantic Web.

Figure 4.1: Conceptual overview of the CIDOC CRM.

The CIDOC CRM does not inted to prescribe how a certain community shoulddocument objects, even though it could serve as a guideline for good documen-tation practice. The goal is to facilitate a read-only data integration of datamaterially or virtually. While creating the CIDOC CRM, two design choices havebeen made, the CIDOC CRM to further enhance and facilitate data integrationand to keep the whole vocabulary to a manageable size. First, as the result ofa pragmatic approach to ontology design, the CIDOC CRM is property centric.By providing a large set of properties, richer semantics can be expressed than byusing fine grained hierarchies of classes like thesauri would do. Classes were thusonly introduced to form the domains and ranges for properties. While an attribute

1The figure follows [16].

28

Page 30: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

is applicable to only one class instance, a relation always concerns two instances.Thus, the CIDOC CRM helps with modeling objects within their context insteadof attaching isolated attributes. Second, it has been argued that explicitly in-cluding events in ontologies results in models that facilitate better integration ofcultural contents [15]. Thus, the CIDOC CRM proposes events that tie objectsand their contexts together. Figure 4.1 demonstrates how events link physicalthings, conceptual objects, places, timeframes, and actors. It goes without sayingthat a data structure that conforms to this paradigm is more difficult to createthan flat attachment of values to a data object.

An ancient sculpture for example would be modeled as an instance of the classE24 Physical Man-Made Thing, a class that “comprises all persistent physicalitems that are purposely created by human activity [10].” It came into existenceby an activity that in turn is an instance of the class E12 Production, “this classcomprises activities that are designed to, and succeed in, creating one or more newitems.” Both instances are connected by the property P108B was produced by,a property that “identifies the Physical Man-Made Thing that came into existenceas a result of a Production Event.”

Data from different sources which follow this scheme can be processed moreconsistently, even if different sources deliver contradictory information. UnlikeDublin Core, the CIDOC CRM focuses on the cultural heritage domain and addsa class and property hierarchy to its vocabulary defintions. Additionally, attributeassignments can be linked to events so that the same attribute can be assignedtwice with different values as a result of different measurement events. A situationthat is common when dealing with soft historical data. Arranging database objectsto well-defined classes also facilitates searching for common objects that originatefrom different data sources.

If certain communities figure that class concepts like E24 are too broad inscope, more detailed classes can be agreed on, for example, in order to distinguishvases and buildings that both fall within the category E24. This is usually donebe exploiting the extension mechanism of the CIDOC CRM. Certainly, a simplemapping of these concepts to E24 Physical Man-Made Thing would be dissat-isfactory because information would get lost. Therefore, the CRM offers meansto refine its high level concepts by using the class E55 Type. This class can beused to attach a thesaurus-like hierarchy of terms to the standard data model. Be-cause the extensions through E55 Type are community specific and not covered bystandard CIDOC CRM, they have to be documented and published as authoritydocuments. Not until this has been done, seamless and automatic processing ofthe data is assured.

The CRM offers two mechanisms to create more granularity for describingmuseum objects. One approach would be to define subclasses of the built in

29

Page 31: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

CIDOC CRM classes like, for example, A1 Sculpture → is a → E24 Physical

Man-Made Thing or A2 Building → is a → E24 Physical Man-Made Thing.Interestingly the same can be done with properties. The other mechanism is touse a Type hierarchy that can be constructed by using the class E55 Type. Theclass E55 Type is treated as universal and specific at the same time. This bearsthe advantage that a type can be discussed as an element of scholarly discourse(E83 Type Creation → P135 created type → E55 Type).

But in some situations this approach seems to be too complicated and the cre-ation of subclasses or usage of publicly available and more specialized ontologiesseems to be more feasible. This approach has the advantage that those ontolo-gies are already published and often well documented. Defining subclasses andexploiting the type hierarchy has the disadvantage that both extension mecha-nisms are not covered by the standard so that other information systems cannotexploit them “out of the box.” Hand-crafted extensions have to be documentedand published so that others can easily retrieve the information and build theirsoftware accordingly. Anyway, the CIDOC CRM becomes more powerful if it isused in connection with other ontologies like SKOS for attaching thesauri that willbe looked at in more detail below.

All properties of the CRM have definite domains and ranges that belong tothe vocabulary itself. The CRM offers classes for describing people, places andbibliographic entities. This seems as if the ontology claimed the authority notonly for describing museum objects but also for covering most of their contexts.However, it does not seem to be useful to treat the CRM as an all-in-one devicesuitable for each and every purpose. Additionally, the CRM is an upper levelontology and therefore cannot and does not intend to cover the pecularities ofeach cultural heritage domain. Although it does provide an extension mechanism,variations and specialization have to be documented and published (preferablyin a formal language). For each object, a specific unambiguous URI needs tobe assigned. This information does not include hints on how that URI could beresolved into a human or machine readable representation. For example, a URIof an image does not include the information how to decode and display it. Onecould argue, that this does not belong to the scope of the CRM and needs to beaddressed on other layers like content negotiation as described above.

4.2 Linking to bibliographic information

The CIDOC CRM mainly concentrates on describing material cultural heritageand museum objects. But the value of this information source can be increased bylinking the material objects to other sources of information like gazetteers or biggerbibliographic databases. Information about archaeological objects in Arachne, for

30

Page 32: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

example, is commonly drawn from publications. Thus, many objects are connectedto bibliographic information and to other forms of structured vocabularies like ma-terial descriptions. The following paragraphs focuses on standards for describingbibliographic objects, especially ancient Greek and Latin texts — and how theycould be linked by using the CRM vocabulary and structure.

FRBR is a conceptual entity-relationship model developed and maintained bythe International Federation of Library Associations and Institutions [32]. In re-spect to the model, entities are classified by “products of intellectual endeavor”(group 1), “custodianship of entities” (group 2), and “subjects” (group 3) with astrong emphasis on the first group. Within this first group entities are classifiedas Works, Expressions, Manifestations, and Items. Works are defined as specificproducts of intellectual effort (Moby Dick), Expressions form realizations of thisintellectual effort (German translation of Moby Dick), Manifestations “the physi-cal embodiment of an expression of a work” (btb edition of the German translationof Moby Dick), and Items form “a single exemplar of a manifestation” (my copyof the btb edition of the German translation of Moby Dick). The second groupcomprises concepts like person and corporate body which hold custodianship ofentities belonging to the first group. The third group consists of concepts, objects,events, and places that appear as subjects of the first two classes. All entities canbe connected by defining relationships that assist the user to “navigate” the web ofinformation that is formed by a bibliography, catalogue, or bibliographic database.Finally, four user tasks are defined: find, identify, select, and obtain. Informationsystems should implement them as behavior to enable users to perform any ofthem on any entity or relationship.

FRBR is a constitutional data model that enables digital libraries to better pro-vide the most basic functions to their user community. Users need to be able toidentify multiple instantiations of primary texts. Therefore, The Perseus Projectneeded to know precisely how many editions, translations, and commentaries ofcanonical works such as the Iliad or Suetonius’ Lives of the Caesars are in thecollections. The object hierarchy of Work, Expression and Manifestation served toencode the Iliad as a general work, its multiple translations and editions, treatedas subclasses of Expressions, and multiple instantiations of these publications, forexample page images and OCR or XML transcriptions, represented as Manifes-tations. On this basis, services can be built that help fulfilling the user tasksrequired. Perseus successfully proved FRBR to be useful for its collections byimplementing a FRBR catalog of its bibliographic assets [44].

FRBR was developed by librarians to support traditional user tasks in the worldof digital libraries. Within this standard, the most granular layer of representingintellectual efforts is the “Item” which means a single copy, a book that I canhold in my hands. Anyway, FRBR is inadequate for classical studies. For decades

31

Page 33: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

scholars have been using elaborate citation schemes with unique identifiers to referto particular chunks of text. A citation scheme like “Il. 3.44” means book 3,line 44 of Homer’s Iliad. Canonical citation schemes like this generally point tothe same text passage in different editions or translations of a specific ancientGreek or Roman text. This way of referring to a specific text passage has beendeveloped centuries ago and has been inherited as useful instrument until today.For these purposes the Canonical Text Services (CTS) have been developed [50].They comprise a protocol and interface to facilitate sophisticated referencing toand resolving of text passages. CTS implements a subset of the FRBR hierarchyand adds some extensions both upwards and downwards. Figure 4.2 demonstratesthat upwards Textgroups can group a set of Work objects. Downwards, a citationmechanism has been added that is not necessary in the library domain but vitalfor classical studies.

Figure 4.2: Canonical Text Services compared to FRBR

Unlike archaeological finds, ancient texts form a relatively constant set of docu-ments. Therefore assigning a URI with a global namespace has immediate networkramifications. Archaeological finds are “produced” in larger quantities all over theworld, and therefore, each object should be equipped with a local URI by addingnamespace information. If there is a canon of well known monuments that scien-tists regularly refer to, a global URI should be assigned to them as well, perhaps

32

Page 34: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

automatically by using entity identification systems. The URI urn:greekLit:tlg0012.tlg001:1.10 , for example, is a reference to line 10 of book 1 of the Iliad. This URI canbe resolved by a resolving service like like http:// katoptron.holycross.edu/ texttools/textbrowser/ index.html?service=fucts&urn=urn:greekLit:tlg0012.tlg001:1.10 . The re-solver accepts two variables, service=fucts is used to select the Furman Universityresolving service (there is more than one) and urn=urn:greekLit:tlg0012.tlg001:1.10represents the URI to resolve.

As mentioned above, whilst the CIDOC CRM was developed to formalize in-formation about objects, especially those managed by museum, a new version ofFRBR, FRBRoo provides the means to express the IFLA FRBR data model withthe same mechanism and notations provided by the CIDOC CRM. For the pur-pose of this project this is a major breakthrough. It provides the first third-partywith an integrated data model for textual and art and archaeological collections,both of which have undergone several years of development. Figure 4.3 shows howdatabase records can be linked to ancient texts.

Figure 4.3: Linking CIDOC CRM and FRBRoo.

33

Page 35: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

4.3 Linking to other forms of knowledge organi-

sation

According to the above, the value of digital surrogates of archaeological objectsand bibliographic items can be enhanced by establishing qualified and machine-actionable links. These links can be exploited by software-agents to provide ser-vices such as navigation, searching, and the like. This section deals with broad-ening the scope by adding other contexts that are relevant to archaeological dataand ancient Greek and Latin texts. For decades, libraries linked bibliographic datawith knowledge organized as thesauri or other forms of structured vocabularies.These forms of knowledge organization could also be useful in providing contextualknowledge for archaeological data.

The amount of information that is available on the Web grows exponentiallythrough users contributing structured and unstructured content. Although somepeople would find it helpful if this content was published as PDF-documents,this approach adheres to the traditional document-based approach of publishingand is not beneficial to the idea behind the Semantic Web. The data that isto be published most likely contains some content that follows certain structuralprinciples to express knowledge about a specific subject or domain. Formalizedsystems of knowledge organization like controlled vocabularies, taxonomies or the-sauri can encapsulate that structure and should therefore be used to publish thiscontent using Semantic Web languages. Vocabularies include Dublin Core for sim-ple cross-domain information resource description, Friend of a Friend (FOAF) formachine-readable personal profiles, Description Of A Project (DOAP) for describ-ing open-source projects and Simple Knowledge Organization System (SKOS) thatis described in greater detail now [62].2

Within the family of formal language, one appears to be helpful for the pur-poses of the project: The Simple Knowledge Organisation System (SKOS). TheSKOS language is defined using the terms of RDF and RDFS because its mainpurpose is to facilitate publishing any type of structured vocabulary on the Web,including thesauri, classification schemes, taxonomies, and subject-heading sys-tems. Additionally, SKOS provides means for multilingual labeling of resources.This could facilitate publishing the concept schemes mentioned above in more thanone language.

Figure 4.4 shows how the ZENON thesaurus that provides access to biblio-graphic information could be linked to archaeological objects that are stored inArachne. The German Archaeological Institute maintains three bilingual thesauri(English and German) in total to provide additional access methods to biblio-

2For further information about FOAF and DOAP, refer to http:// www.foaf-project.org/and http:// usefulinc.com/ doap/ .

34

Page 36: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

graphic assets. By harmonising the bibliographic databases of ZENON, Arachne,and Perseus, the bilingual thesauri could be exploited as additional multilingualaccess methods for integrated metadata of Perseus and Arachne. Access methodsthat proved to be useful for one system can also be useful for other systems withinthe same domain. The current project would be enhanced by multilingual trans-lations. SKOS, for example, could be a more elaborate language to express theZENON thesaurus.

Figure 4.4: From A to Z, Arachne and ZENON.

The approach described above would also contribute to building large net-worked systems of organized and structured knowledge. Structured vocabulariesthat have been carefully compiled in the course of many research projects shouldnot be insulated in individual software systems. This holds true not only forgazetteer systems but also for all forms of structured vocabularies such as lists ofmaterial descriptions of archaeological objects. The Arachne database contains atotal of 800 different descriptions for materials, categorized in a hierarchical man-ner, which could each be equipped with a unique identifier. This information couldalso be exploited by other research projects. If a large section of the communityused unique identifiers that are provided with each object, tangible and beneficialnetwork effects would ensue. The linked data approach enabes one to look up each

35

Page 37: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Web resource and thereby making available useful data to the general public. Cur-rent projects that intend to contribute to the linked data paradigm are DBpediaDBLP, and GeoNames.3

3The DBpedia project draws on structured information from Wikipedia to make it avail-able for browsing, semantic searcing and harvesting (http:// dbpedia.org/ ). The long-standingDBLP also provides linked data about bibliographic publications in the information technologyarea (http:// www4.wiwiss.fu-berlin.de/ dblp/ ). GeoNames is an ambitious gazetteer projectthat provides multilingual RDF for each geographic entity that it hosts (http:// www.geonames.org/ ).

36

Page 38: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Chapter 5

Dealing with heterogeneity

Information integration needs to occur on several levels. Besides the syntactical in-tegration that deals with physically accumulating data objects at one place, thereare at least two more levels of semantic integration. The CIDOC CRM is designedto provide assistance finding conceptual agreements among multiple cultural her-itage information systems. But to use the classes and properties provided by theCIDOC CRM, the source data needs to fulfill certain quality criteria. Althoughall requisite steps cannot be distinguished clearly, during the process of data in-tegration, heterogeneity occurs at several levels, namely, at the schema and theinstance level. This section takes a look at different forms of heterogeneity in theseareas and how to deal with them.

5.1 Levels of heterogeneity

Most cultural heritage databases have — due to a strong commercial influence —been built upon relational data models managed by a respective database manage-ment system. However, as relational databases are not suitable for rich semanticmodeling and force software developers entrusted with this task to switch to othercomponents of the overall information system. This invariably leads to undoc-umented distribution of composition algorithms among several software compo-nents. If one component of the software system is changed, this in turn changesthe meaning of the managed data objects themselves. This is one factor that needsto be considered when mapping internal data to a shared metadata format [37].

Under the internal data model (storage and internal representation of factualknowledge), the application logic (first layer of interpretation) retrieves and re-combines data for the graphical user interface (second layer of interpretation, in-terpretation of data model). The layouts that display the information to the userin multiple views and the user’s implicit knowledge (third layer of interpretation)

37

Page 39: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

about the information system, including implicit conventions, have to be takeninto account. The implementation of the CIDOC CRM impacts all levels and anabstract mapping component needs to be aware of all these levels to preserve allpresented levels of meaning.

It is questionable at the very least whether that complex and highly structureddata contained in one information system (context I) can be transferred to anotherinformation system (context II) without loss of (mind the structure!) informationwithout preserving the process of composition. Most current databases in the hu-manities field don’t meet halfway because their proprietary semantic modeling donot rely on standards. In future projects, it might be advisable to build aware-ness of shared data models like the CIDOC CRM into cultural heritage databasesystems. Since it appears to be too complex to map the whole data structure to ashared data model, it is important to identify those parts of the data model thatare most important and valuable for integration. It is by no means practicable tomap each detail of each data model, a reasonable level of detail has to determinded.

5.2 Heterogeneity on the schema level

Database management systems provide tools to impose constraints on manageddata objects. This prevents the data from becoming inconsistent. Relationaldatabases, for example, provide tables that contain records to model a tiny cutoutof the world in a digital environment. Database records can be linked by definingrelations between tables. Additionally, certain products provide tools to enforcethe ACID paradigm (atomicity, consistency, integrity, durability). Therefore, bylooking at the database schema, a human being or a software program can get animpression of how the data is structured. Unfortunately, data objects often containadditional formal and even informal structure that is not covered by the databaseschema. This section explores these forms of formal and informal heterogeneity onthe database schema level and how to deal with it in the context of informationintegration.

5.2.1 Uniform representation of data models

To make data integration efforts more consistent and streamlined, a uniform syntaxfor the representation of different data models needed to be found. Therefore,the data models of Perseus and Arachne were exported to XML. The ExtensibleMarkup Language provides a syntax that is able to represent heterogeneous datamodels while simultaneously providing uniform methods to process the exporteddataset. This data then can be processed by using Extenslible Stylesheet LanguageTarnsformations, a language that can be used to reorganize XML documents.

38

Page 40: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Neel Smith at Holy Cross University in collaboration with The Center of Hel-lenic Studies is developing a network service that can expose different data struc-tures as well as data objects as XML code, called the Collection services [55].A Collection service exposes sets of objects to network discovery and searches.Perseus already exposes its dataset using the Collection service. Since it could nothandle the large amount of data that is hosted by Arachne, the MySQL QueryBrowser was used to export data objects to XML.1 In future, however, imple-mentation of both Perseus and Arachne could expose their data with this service.Listing 5.1 shows the output for one specific data object of the Perseus database.The 102-field data model has been reduced to some significant fields.

Listing 5.1: The Collection XML Code.1 <?xml version=" 1.0 " encoding=" utf -8 " ?>2 <QueryCol lect ion>3 <r eques t>4 <Col l e c t i on ID>Ar t i f a c t</ Co l l e c t i on ID>5 <QueryCollectionXPath>/ Ar t i f a c t [ id =6 ’ 2 3 8 9 ’ ]</QueryCollectionXPath>7 <Conf i gF i l e>8 / usr / l o c a l / tomcat/webapps/ c o l l s e r v i c e / bu i l dSe rv i c eCon f i g . xml</ Con f i gF i l e>9 </ reques t>

10 <r e s u l t s>11 <Scu lp tu r eAr t i f a c t id=" 2 3 8 9 ">12 <authorityName>New York 30 . 11 . 33</authorityName>13 <name>New York 30 . 11 . 3</name>14 <type>Sculpture</ type>15 <s t y l e>High C l a s s i c a l</ s t y l e>16 <f o rmSty l eDesc r ip t i on>&l t ;P&gt ; The s t e l e i s crowned by a17 broad e p i s t y l e support ing a shal low , p l a in18 pediment .& l t ; /P&gt ;& l t ;P&gt ; Fre l a t t r i bu t ed t h i s f i n e piece ,19 and another in the Kerameikos (& l t ; r s20 type=" s c u l p t u r e "&gt ; Athens , Kerameikos P 1130& l t ; / r s&gt ; ) to21 the work o f the same scu lptor , perhaps h i s so−c a l l e d Dex i l eo s22 s cu lp t o r . Clairmont agree s that the t h i s p i e c e and that in23 the Kerameikos should be a t t r i bu t ed to the same hand , and24 t h i s theory ga ins support from Herberts obse rva t i on s o f the25 s t y l i s t i c s i m i l a r i t i e s between the two monuments and t h e i r26 i n s c r i p t i o n s .</ fo rmSty l eDesc r ip t i on>27 < t i t l e>Fragmentary s t e l e o f woman</ t i t l e>28 <sculptureType>Ste l e , r e l i e f −decorated</ sculptureType>29 </ Scu lp tu r eAr t i f a c t>30 </ r e s u l t s>31 </QueryCol lect ion>

5.2.2 Mapping data models

To gain a better understanding of the vocabulary definition and the structureused by the CIDOC CRM, experimental mappings of Perseus’ and Arachne’s ar-chaeological artifacts have been compiled. The following section reports aboutthe overall methodology, workflow, and the issues and challenges that have beenidentified during this process. Once a unified representation of data models hasbeen found, the structure of the internal data model can be transformed into thenew structure. This transformation process results in a different data model whilepreserving the semantics. At the same time, however, the new structure needsto conform to a data model that all parties have agreed on. In the scope of the

1http:// www.mysql.de/ products/ tools/ query-browser/ .

39

Page 41: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

project described, the XML code needs to be processed in a way that results in amarkup that is compatible with the CIDOC CRM.

The need for mapping internal data to common conceptualizations is a chal-lenge that many cultural heritage projects currently need to face. Consequently,multiple research projects started to study the feasibility of abstract mapping soft-ware that can be adapted to the needs of certain databases. Within the EPOCHnetwork, a mapping tool is being developed called the Archive Mapper for Ar-chaeology (AMA).2 The AMA tool is meant to enable mapping other well knownstandards to the CIDOC CRM. Unfortunately this approach presupposes that theinternal data model of a CH database follows a certain standard which certainlywill not be the case within most institutions. Another open-source software frame-work, Building Resources for Integrated Cultural Knowledge Services (BRICKS),for building digital libraries includes an “Archaeological Pillar” with the imple-mentation of the “Finds Identifier.”3 This software includes a mapping tool thatis based on XSLT. Both mapping tools function well if either the databases candeliver data object of a certain quality or adhere to a certain standard.

As mentioned above, the CIDOC CRM defines a data model that predomi-nantly focuses on events. However, current documentation practice of Perseus andArachne is not geared to explicitly record information about events. Neverthe-less, whenever data is recorded about archaeological objects, at least implicitly,this entails various events. So for each attribute that was assigned to a specificdata object, at least, there is this assignment event. For each date of creationthat is attached to a data object, there must have been a creation or productionevent. Current documentation practice ignores these events but implicitly recordsinformation about them that needs to be extracted.

Kondylakis et al. introduced a mapping language for information integra-tion [36]. It claims to cover the most frequent occurrences of heterogeneity andintroduces a specific formalism that can be visualized. Figure 5.1 shows the ap-plication of this language in the context of the current project. This mappinglanguage comprises the introduction of intermediate nodes, contraction and ex-traction of compounds, nesting formerly parallel structures, re-using instances fordifferent mappings and performing conditional mapping. The latter addressescases where the mapping of one field depends on the value of another field.

The first rule of Figure 5.1 is rather straightforward but demonstrates howmapping is performed. Each record of the Perseus art and archaeology tableis mapped to the CRM concept E24 Physical Man-Mad Thing, the field nameauthorityName maps to the property P47 is identified by and the field-value

2http:// www.epoch-net.org/ index.php?option=com content&task=view&id=74&Itemid=120 .

3There are no extensive publications about this particular mapping tool but a brief introduc-tion can be found at http:// dev.brickscommunity.org/ Archaeological Sites .

40

Page 42: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Figure 5.1: Graphical representation of the mapping process.

41

Page 43: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

itself finally maps to the class E42 Object Identifier. This is an example of asimple one-to-one mapping operation. The second mapping rule models the periodin which the archaeological artifact was crafted. In CRM terms, this involves aproduction event, and consequently, mapping rules two and three show how aE12 Production Event is introduced to express the creation date of an artifact.Finally, rule four explains how style information can be mapped; this rule uses theCRM class E17 Style Assignment as an intermediate node and then attachesfurther information. This is an example of changing a parallel structure to anested one.

This mapping language is supposed to be applied to data of a certain quality.Thus, the mapping language is not even remotely close to being able to deal withthe special characteristics of the involved data models; it covers frequent but simplemapping problems with high quality data sources. For the mapping project, thesepeculiarities will be dealt with in the section on data quality.

These rules stem from a thorough analysis of the Perseus data model. List-ing 5.2 illustrates what a semi-formal documentation of this analysis process mightlook like. The first step involved finding a set of fields that together need to bemapped to a different set of fields with a certain structure. Then, to help identifythe meaning of a particular field, some representative sample values were extractedand documented. In instances where a database field had several hundred values,a “representative” sample was taken. Then, after consulting the CRM definitiondocument and matching the vocabulary definitions, a first mapping proposal wasmade and then elaborated iteratively. Finally, the overall process was commentedon, and problems were documented for reconsideration in the next mapping iter-ation. Parentheses represent a constant value to be inserted and curly bracketsrepresent the value of a specific database field.

Listing 5.2: Semi-formal mapping documentation.1 Af f ec ted f i e l d s :2 " s t y l e " , " f o r m S t y l e D e s c r i p t i o n "

34 Sample va lues " s t y l e " :5 " A r c h a i s t i c " , " E a r l y H e l l e n i s t i c " , " H i g h C l a s s i c a l " , " H i g h C l a s s i c a l ] "

67 Scope note " s t y l e " :8 The epoch to which the s t y l e o f the de sc r ibed a r t i f a c t be longs .9

10 Sample va lues " f o r m S t y l e D e s c r i p t i o n " :11 " <P > A d e s c e n d a n t of the r i d e r s of the P a r t h e n o n f r i e z e . </ P > " , " <P > B e a z l e y n o t e s t h a t the s t y l e

is E a s t G r e e k . </ P > " , " <P > E c l e c t i c work , w i t h a l a t e H e l l e n i s t i c m a l e body , and a f e m i n i n e

h e a d type </ P > "

1213 Scope note " f o r m S t y l e D e s c r i p t i o n " :14 Provides more f u l l−t ext in format ion on the s t y l e assignment .1516 Proposed mapping :17 P14B . wa s c l a s s i f i e d b y18 E17 . Type Assignment ( Sty l e assignment o f ){authorityName}19 P42F . as s i gned20 E55 . Type ( s t y l e )21 P2F . has type22 E55 . Type { s t y l e }23 P3F . has note24 E62 . S t r ing ( fo rmSty l eDesc r ip t i on )25

42

Page 44: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

26 Other notes : The f i e l d " f o r m S t y l e D e s c r i p t i o n " sometimes conta ins i n v a l i d XML data . Maybe ah e u r i s t i c data c l ean ing t oo l could so l v e t h i s problem . Add i t i ona l l y the f i e l d s t y l econta ins m i s spe l l ed words and add i t i ona l cha ra c t e r s .

Figure 5.2 shows an UML diagram of the Perseus data model and Figure 5.3an entity relationship diagram of the Arachne data model. The figures demon-strate how different modeling approaches result in very different data models.While Perseus preferred a clean model with just a few tables based on inheritance,Arachne focused on tight and explicit contextualization of archaeological objectsforsaking the clarity of the data model. The latter approach is based on the as-sumption that for archaeological objects, the information does not only lie in themetadata alone but in the qualified links to other objects.

Figure 5.2: The Perseus art and archaeology database UML diagram.

As previously mentioned, the Perseus data model relies heavily on inheritance.Therefore, as a test case, we decided to start mapping those database fields that arerelevant for all objects (this refers to the class AtomicArtifact). Upon makinga mapping the problems that arose while mapping the specific fields were enu-merated. These included semantic dependence, semi-structured and unstructuredcontent, and dirty data. In general, Perseus’ and Arachne’s fields are designed forhuman viewing, i.e. not for machine processing. This results in low granularity ofdatabase fields and poses a challenge to extracting more granular data objects.

The processing instructions of XSLT were deemed to be suitable to cover thecases of heterogeneity that appeared within the Perseus data model. Thus, XSLTransformations were used to implement these mapping rules that were elaboratedand documented as set out in Listing 5.2. To comply with current developments in

43

Page 45: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Figure 5.3: The Arachne ERD diagram

the Semantic Web field, the data was transformed to RDF/XML. This will enablemost Semantic Web tools to process the data as shown in section 6.2 on page 59.Listing 5.3 shows a cutout of the XSLT style-sheet that was used for mapping. TheRDF wrapper was inserted after line 9. Since all objects stored in the databasewere mapped to E24 Physical Man-Made Thing, this element is inserted in line10. After that further templates are called (lines 18 and 19).

Listing 5.3: Mapping implementation as XSLT style-sheet.1 <?xml version=" 1.0 " encoding=" ISO -8859 -1 " ?>2 <x s l : s t y l e s h e e t xmlns :x s l=" h t t p : // www . w3 . org / 1 9 9 9 / XSL / T r a n s f o r m "

3 xmlns:dc=" h t t p : // p u r l . org / dc / e l e m e n t s / 1 . 1 / "

4 xmlns : rd f=" h t t p : // www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns # "

5 xmlns : rd f s=" h t t p : // www . w3 . org / 2 0 0 0 / 0 1 / rdf - s c h e m a # "

6 xmlns:crm=" h t t p : // c i d o c . ics . f o r t h . gr / r d f s / c i d o c _ v 4 .2. r d f s # " version=" 2.0 " xml:lang=" en ">7 <x s l : ou tpu t encoding=" UTF -8 " indent=" yes "/>8 <x s l : t emp l a t e match=" Q u e r y C o l l e c t i o n ">9 <rdf:RDF>

10 <crm:E24 . Physical Man−Made Thing>11 <x s l : a t t r i b u t e name=" r d f : a b o u t ">12 <x s l : t e x t>ht tp : // per seus . t u f t s . edu/ a r t i f a c t /</ x s l : t e x t>13 <x s l : v a l u e−o f s e l e c t=" r e p l a c e (// name , ’[ & q u o t ; ; \ [ \ ] \ + & lt ;& gt ;] ’ , ’_ ’) "/>14 </ x s l : a t t r i b u t e>15 <x s l : v a r i a b l e name=" a r t i f a c t I D " s e l e c t=" // S c u l p t u r e A r t i f a c t / @id "/>16 <x s l : v a r i a b l e name=" q u e r y L i n k "

17 s e l e c t=" c o n c a t ( ’ h t t p : / / 1 3 4 . 9 5 . 1 1 3 . 2 0 0 : 8 0 8 0 / e x i s t / x q u e r y / a r t i f a c t 2 i m g . xql ? a r t i f a c t = ’ ,$

a r t i f a c t I D ) "/>18 <xs l : app ly−templates s e l e c t=" // s t y l e "/>19 <xs l : app ly−templates s e l e c t=" d o c u m e n t ($ q u e r y L i n k ) "/>20 </crm:E24 . Physical Man−Made Thing>21 </rdf:RDF>22 </ x s l : t emp l a t e>23 <x s l : t emp l a t e match=" s t y l e ">

44

Page 46: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

24 <x s l : i f t e s t=" string - l e n g t h () ">25 <crm:P41B . wa s c l a s s i f i e d b y>26 <crm:E17 . Type Assignment>27 <x s l : a t t r i b u t e name=" r d f : a b o u t ">28 <x s l : t e x t>ht tp : // per seus . t u f t s . edu/ assessment /</ x s l : t e x t>29 <x s l : v a l u e−o f30 s e l e c t=" r e p l a c e (// a u t h o r i t y N a m e , ’[ & q u o t ; ; \ [ \ ] \ + & lt ;& gt ;] ’ , ’_ ’) "/>31 </ x s l : a t t r i b u t e>32 <d c : t i t l e>33 <x s l : t e x t>Sty l e assignment o f </ x s l : t e x t>34 <x s l : v a l u e−o f s e l e c t=" // t i t l e "/>35 </ d c : t i t l e>36 <crm:P42F . as s i gned>37 <crm:E55 . Type>38 <x s l : a t t r i b u t e name=" r d f : a b o u t ">39 <x s l : t e x t>ht tp : // per seus . t u f t s . edu/ styleType /</ x s l : t e x t>40 <x s l : v a l u e−o f s e l e c t=" r e p l a c e (. , ’[ & q u o t ; ; \ [ \ ] \ + & lt ;& gt ;] ’ , ’_ ’) "/>41 </ x s l : a t t r i b u t e>42 <d c : t i t l e>43 <x s l : v a l u e−o f s e l e c t=" . "/>44 </ d c : t i t l e>45 <crm:P2F . has type>46 <crm:E55 . Type>47 <x s l : a t t r i b u t e name=" r d f : a b o u t ">48 <x s l : t e x t>ht tp : // per seus . t u f t s . edu/ styleType</ x s l : t e x t>49 </ x s l : a t t r i b u t e>50 <d c : t i t l e>51 <x s l : t e x t>s t y l e</ x s l : t e x t>52 </ d c : t i t l e>53 </crm:E55 . Type>54 </crm:P2F . has type>55 </crm:E55 . Type>56 </crm:P42F . as s i gned>57 <xs l : app ly−templates s e l e c t=" // f o r m S t y l e D e s c r i p t i o n "/>58 </crm:E17 . Type Assignment>59 </crm:P41B . wa s c l a s s i f i e d b y>60 <crm:P2F . has type>61 <x s l : a t t r i b u t e name=" r d f : r e s o u r c e ">62 <x s l : t e x t>ht tp : // per seus . t u f t s . edu/ styleType /</ x s l : t e x t>63 <x s l : v a l u e−o f s e l e c t=" r e p l a c e (. , ’[ & q u o t ; ; \ [ \ ] \ + & lt ;& gt ;] ’ , ’_ ’) "/>64 </ x s l : a t t r i b u t e>65 </crm:P2F . has type>66 </ x s l : i f>67 </ x s l : t emp l a t e>68 <x s l : t emp l a t e match=" ref ">69 <x s l : i f t e s t=" string - l e n g t h () ">70 <crm:P138B . ha s r ep r e s en t a t i on>71 <crm:E36 . Visua l I tem>72 <x s l : a t t r i b u t e name=" r d f : a b o u t ">73 <x s l : t e x t>ht tp : // r epo s i t o ry01 . l i b . t u f t s . edu:8080 / fedora / get / t u f t s : p e r s e u s . image .</

x s l : t e x t>74 <x s l : v a l u e−o f s e l e c t=" . "/>75 <x s l : t e x t>/Thumbnail . png</ x s l : t e x t>76 </ x s l : a t t r i b u t e>77 <d c : t i t l e>78 <x s l : v a l u e−o f s e l e c t=" . "/>79 </ d c : t i t l e>80 <dc : type>81 <x s l : t e x t>image/ jpeg</ x s l : t e x t>82 </ dc : type>83 </crm:E36 . Visua l I tem>84 </crm:P138B . ha s r ep r e s en t a t i on>85 </ x s l : i f>86 </ x s l : t emp l a t e>87 </ x s l : s t y l e s h e e t>

As mentioned above, just like most cultural heritage databases, Perseus andArachne do not explicitly record information about events in their data models.Therefore, implicit events had to be extracted to make use of all concepts that theCRM has to offer. The template in line 18 evaluates the Perseus field style andconnects the E55 Type to an E17 Type Assessment, in lines 23ff. In line 57 an-other template is triggered that evaluates the Perseus field formStyleDescription

to further specify the style assignment. This is the implementation for mapping aparallel structure to a nested one that clarifies the relationships among the fields.

45

Page 47: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

According to the Semantic Web, an Uniform Resource Identifier needs to beattached to each Web resource, or it needs to be defined as anonymous (blanknode). If — during the mapping process — a node has been created that needspotentially to be referred to, a URI has been assigned, even to events. Otherwiseand anonymous node has been created. Lines 11ff demonstrate the constructionof unique and unambiguous URIs for a Web resource. In this example the stringhttp:// perseus.tufts.edu/ artifact/ has been concatenated with the value of the fieldauthorityName. A unique namespace identifying the Perseus art and artifactdatabase has been attached to a unique identifier that points to a specific artifact.As a consequence, this forms a URI that is unique world wide. The replace

function in line 13 has been included to guarantee that the URI does not becomemalformed due to forbidden characters. A HTTP URL has been chosen to facilitateproviding a representation of each object as multiple data-formats. However, thishas not been implemented yet. Additionally, lines 32ff demonstrate providing ahuman readable string along with the URI that is meant for machine processing.Many software tools are aware of Dublin Core tags and can figure out the portionof information that was meant for displaying purposes, including the Longwellbrowser being used within the project on hand to display mapped data objects.

The Perseus project hosts images of art and archaeology objects within a Fedorarepository and also maintains an index to resolve the object identification numberto one or more images that depict the same object. For mapping purposes thisindex has been ingested to an eXist database. Line 12 constructs an URL to anXquery service that is called at line 14. The result document of this query is thenevaluated at line 68.

Listing 5.4 shows the result of the mapping process as RDF/XML that canbe validated against the RDFS definition document of the CIDOC CRM. Eachmapped object has been equipped with a unique identifier and, additionally, forhuman readability a Dublin Core tag has been attached. The same has been doneto events such as the one defined in line 34. Because different attribute assignmentscould result in different conclusions about the style of an artifact, this approachprevents metadata from becoming contradictory. This helps with merging infor-mation during the process of data integration, contradicting metadata about anobject can coexist.

Listing 5.4: The artifact as RDF/XML that conforms to the CIDOC CRM.1 <?xml version=" 1.0 " encoding=" utf -8 " ?>2 <rdf:RDF xmlns:dc=" h t t p : // p u r l . org / dc / e l e m e n t s / 1 . 1 / "

3 xmlns : rd f=" h t t p : // www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns # "

4 xmlns : rd f s=" h t t p : // www . w3 . org / 2 0 0 0 / 0 1 / rdf - s c h e m a # "

5 xmlns:crm=" h t t p : // c i d o c . ics . f o r t h . gr / r d f s / c i d o c _ v 4 .2. r d f s # ">6 <crm:E24 . Physical Man−Made Thing rd f : about=" h t t p : // p e r s e u s . t u f t s . edu / a r t i f a c t / N e w _ Y o r k _ 3 0

. 1 1 . 3 ">7 <crm:P47F . i s i d e n t i f i e d b y>8 <crm:E42 . Ob j e c t I d e n t i f i e r rd f : about=" h t t p : // p e r s e u s . t u f t s . edu / i d e n t i f i e r s / 2 3 8 9 ">9 <d c : t i t l e>2389</ d c : t i t l e>

10 </crm:E42 . Ob j e c t I d e n t i f i e r>11 </crm:P47F . i s i d e n t i f i e d b y>

46

Page 48: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

12 <crm:P47F . i s i d e n t i f i e d b y>13 <crm:E42 . Ob j e c t I d e n t i f i e r rd f : about=" h t t p : // p e r s e u s . t u f t s . edu / i d e n t i f i e r s / N e w _ Y o r k _ 3 0

. 1 1 . 3 3 ">14 <d c : t i t l e>New York 30 . 11 . 33</ d c : t i t l e>15 </crm:E42 . Ob j e c t I d e n t i f i e r>16 </crm:P47F . i s i d e n t i f i e d b y>17 <crm:P48F . h a s p r e f e r r e d i d e n t i f i e r r d f : r e s o u r c e=" h t t p : // p e r s e u s . t u f t s . edu / i d e n t i f i e r s /

N e w _ Y o r k _ 3 0 . 1 1 . 3 3 " />18 <crm:P102F . h a s t i t l e>19 <crm:E35 . T i t l e rd f : about=" h t t p : // p e r s e u s . t u f t s . edu / t i t l e / F r a g m e n t a r y _ s t e l e _ o f _ w o m a n ">20 <d c : t i t l e>Fragmentary s t e l e o f woman</ d c : t i t l e>21 </crm:E35 . T i t l e>22 </crm:P102F . h a s t i t l e>23 <crm:P2F . has type>24 <crm:E55 . Type rd f : about=" h t t p : // p e r s e u s . t u f t s . edu / a r t i f a c t T y p e / S c u l p t u r e ">25 <d c : t i t l e>Sculpture</ d c : t i t l e>26 <crm:P2F . has type>27 <crm:E55 . Type rd f : about=" h t t p : // p e r s e u s . t u f t s . edu / a r t i f a c t T y p e ">28 <d c : t i t l e>type o f a r t i f a c t</ d c : t i t l e>29 </crm:E55 . Type>30 </crm:P2F . has type>31 </crm:E55 . Type>32 </crm:P2F . has type>33 <crm:P41B . wa s c l a s s i f i e d b y>34 <crm:E17 . Type Assignment rd f : about=" h t t p : // p e r s e u s . t u f t s . edu / a s s e s s m e n t / N e w _ Y o r k _ 3 0 . 1 1 . 3 3

">35 <d c : t i t l e>Sty l e assignment o f New York 30 . 11 . 33</ d c : t i t l e>36 <crm:P42F . as s i gned>37 <crm:E55 . Type rd f : about=" h t t p : // p e r s e u s . t u f t s . edu / s t y l e T y p e / H i g h _ C l a s s i c a l ">38 <d c : t i t l e>High C l a s s i c a l</ d c : t i t l e>39 <crm:P2F . has type>40 <crm:E55 . Type rd f : about=" h t t p : // p e r s e u s . t u f t s . edu / s t y l e T y p e ">41 <d c : t i t l e>s t y l e</ d c : t i t l e>42 </crm:E55 . Type>43 </crm:P2F . has type>44 </crm:E55 . Type>45 </crm:P42F . as s i gned>46 <crm:P3F . has note>The s t e l e i s crowned by a broad e p i s t y l e47 support ing a shal low , p l a in pediment . Fre l a t t r i bu t ed t h i s48 f i n e piece , and another in the Kerameikos (Athens ,49 Kerameikos P 1130) to the work o f the same scu lptor ,50 perhaps h i s so−c a l l e d Dex i l eo s s cu l p t o r . Clairmont agree s51 that the t h i s p i e c e and that in the Kerameikos should be52 a t t r i bu t ed to the same hand , and t h i s theory ga ins support53 from Herberts obse rva t i on s o f the s t y l i s t i c s i m i l a r i t i e s54 between the two monuments and t h e i r55 i n s c r i p t i o n s .</crm:P3F . has note>56 </crm:E17 . Type Assignment>57 </crm:P41B . wa s c l a s s i f i e d b y>58 <crm:P2F . has type r d f : r e s o u r c e=" h t t p : // p e r s e u s . t u f t s . edu / s t y l e T y p e / H i g h _ C l a s s i c a l " />59 <crm:P138B . ha s r ep r e s en t a t i on>60 <crm:E36 . Visua l I tem rd f : about=" h t t p : // r e p o s i t o r y 0 1 . lib . t u f t s . e d u : 8 0 8 0 / f e d o r a / get /

t u f t s : p e r s e u s . i m a g e . 7 9 8 8 7 / T h u m b n a i l . png ">61 <d c : t i t l e>79887</ d c : t i t l e>62 <dc : type>image/ jpeg</ dc : type>63 </crm:E36 . Visua l I tem>64 </crm:P138B . ha s r ep r e s en t a t i on>65 </crm:E24 . Physical Man−Made Thing>66 </rdf:RDF>

In some cases, the CIDOC CRM insists on assigning identifiers to identifiers.It provides a class called E42 Object Identifier that is meant for attachingadditional identifiers such as museum inventory numbers to archeological objects.But each identifier needs its own URI along with the additional identifier stringto be attached. This leads to an inflation of identifiers typical for the SemanticWeb. However, the generated markup can now be transported to a physical placewhere it can be used by many Semantic Web toolkits for display or more advancedprocessing purposes, for example, a triple store system with a faceted browsercomponent such as Longwell.

47

Page 49: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

5.3 Heterogeneity on the entity level

Mapping data to a common schema usually not only involves multiple databaseschemas but also multiple customs to name things that are stored in the databases.It has been mentioned before that all databases use their individual means to inter-nally represent data objects, and usually use several processing levels to mediatebetween internal representation and end-user experience. For mapping purposes,the best case is a normalized database with granular data objects. This facilitatesto map each data object to a shared data model, one to one. Unfortunately manycultural heritage databases, including both Perseus and Arachne, do not providenormalized data of sufficient quality. Therefore, additional intermediate steps needto be introduced to leverage data quality to the mapping needs. This is done eitherby simply dropping data or by cleaning it through pre-processing.

5.3.1 Data extraction and data quality problems

Data cleaning is concerned with correcting anomalies of different data sources,especially in the context of information integration. Within the integration projectof Perseus and Arachne, data is considered of high quality if it is suitable formapping to the CIDOC CRM. Then, the quality of each data object can satisfyits purpose. But data has been entered manually into the databases, while onlyfew constraints had been formulated. Therefore, database records certainly containinconsistent data. To address this issue, Galhardas et al. introduced AJAX, anextensible tool for data cleaning [24]. In the course of their research, commonproblems with extracting data from legacy databases have been identified, such asschema level quality and instance level quality problems.

Some issues on the schema level are commonly avoided by database manage-ment systems. Wrong data types and missing data, for example, can be avoidedby introducing strict types and declaring database fields as NOT NULL. However,issues that cannot easily be avoided by database management systems are thosethat deal with the meaning of the database content. Someone, for example, couldstore a term that is too general for the scope of a specific database field, or thatdoes not at all belong into a certain field. On the instance level, there are prob-lems that either concern single records or involve multiple records. Someone could,for example, use a dummy value to outsmart the NOT NULL constraint or misspellvalues. Duplicated or contradicting records result in inconsistent data that is dif-ficult to find and browse. Another example is the usage of inconsistent units formeasurements.

In the course of the project, different problems were encountered that prohib-ited a one-to-one mapping. Some of them have been classified as quality problems,others have been resolved by introducing more simple pre-processing steps. On the

48

Page 50: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

schema level, for example, fields with overlapping meaning needing to be mergedwere found. There were also database fields that did not use the NULL value consis-tently. On the instance level, values with inconsistent spelling and invalid structurewere found (bibliographic entries with nonuniform or missing separators). List-ing 5.2 shows that the field style, for example, contains misspelled words. Twofundamental approaches have been chosen to get a set of clean database recordsfor mapping. First, some inconsistencies that were identified could be resolved. Insuch case an algorithm could fix the problem for example by using regular expres-sions. Second, if step one failed or turned out to be too complex and therefore tooexpensive, the respective chunk of data was ignored and did not take part in themapping process. In this case the data that resides within the database should becorrected manually at a later stage. The following deals with more examples thatcould not be mapped one-to-one.

First, some fields are involved in semantical dependencies, the field startMod

qualifies the field startData (a terminus post quem) by stating that the datais an estimation. In many cases these functional dependencies cause the samefield to be mapped to a different CIDOC CRM concept. Conditional mappinghas to be applied here and can be implemented by using the XSLT <xsl:if

test="expression"> tag. Second, fields with structured content, the field sourcesUsed

entails an internal structure that is not covered by the database schema. TEImarkup was used to enumerate multiple bibliographic entries. Although thisresults in a database field with internal structure, the information can still beextracted by processing the TEI tags with an XML parser. Unfortunately, forsome records, the XML markup that formed the internal structure was defective.The current mapping implementation ignores those cases, future implementationsshould try to fix this issue by using, for example, heuristics. Third, fields with un-structured content, the field subjectDescription contains full-text that includesreferences to modern and ancient people. It contains valuable information thatcannot be easily mapped. It has been suggested by the CIDOC CRM communitythat authority lists for people and other named entities could help to draw moreinformation from unstructured full-text descriptions [26].

Figure 5.4 demonstrates how bibliographic entities have been extracted andmapped to fields of an intermediate data model. Within the scope of the projectprototype, this is done by using regular expressions but in future versions, anXML parser should recognize the structure and do the mapping automatically.Most of the issues described above deal with informal or formal structure withindatabase fields. Since this structure is not described by the database model, thelegacy software resolves them at levels above the database component. A softwaredeveloper designing a mapping component, needs to understand these algorithmsto produce an acceptable mapping.

49

Page 51: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Figure 5.4: Demonstration of the mapping workflow with a trivial pre-processingstep.

In future versions, the mapping component should be able to process variouskinds of commonly appearing heterogeneity. Some databases store lists of valuesdelimited by special characters. Regular expressions could be configured and usedto extract data with regular structure that is not covered by the database schema.XML markup within database fields is generally used to express structural coher-ence. This should be extracted by plugging in XML parsers. Furthermore, culturalheritage databases extensively use unstructured free text to attach relevant infor-mation that does not fit in other fields. Text parsers that can exploit authoritynaming services and that are aware of structured vocabularies could be used to ex-tract relevant information. Internally, cultural heritage databases should invest inreorganizing and structuring their own data models and in enhancing data qualityto better facilitate the re-use of data. If we had both a suite of mapping tools anddatabases with explicit formal structure, mapping would be a lot easier.

5.3.2 Entity Identification and record linkage

While the complexity of internal data models poses difficulties for conversion tothe CIDOC CRM, there are even deeper challenges. For a number of objects thecollections of Perseus and Arachne overlap. With regard to those, there is a needfor data analysis and fusion. Beyond the challenge of integrating heterogeneousdata models and establishing a certain data quality lies the problem of heteroge-neous data records. There is a need to identify semantically corresponding records— that is, two ore more digital surrogates referring to the same entity in the world.This problem usually is referred to as entity identification or record linkage andhas been described as a difficult and resource consuming challenge [63].

50

Page 52: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Figure 5.5 shows two database records that have been mapped to the CIDOCCRM, one from Perseus and the other from Arachne. These two records aresurrogates of the same archaeological object in the “real” world. The exampledemonstrates different forms of heterogeneity on the entity level. First, it showsthat the problem of language becomes apparent. While Perseus, for describing theexact same object, uses “bust”, Arachne uses “Portraitkopf” which are English andGerman words that are closely related. It all boils down to the fact that the samethings are named differently because of allowed variations that could result fromdifferent languages or accepted customs within a domain. Machine translationcurrently focuses on full text data but many hours of human effort went intostructuring data by entering information into databases. This instantly could beexploited to help translating metadata to other languages for establishing betteraccess to information. Second, there is a need to match “H 44 cm” in Arachnewith “H .0433m” in Perseus, two comparable but not equivalent figures for theheight of the bust. Third, there are different spellings of the placename “Aricia”/ “Ariccia” in each record. Both records provide additional information that agazetteer system could use to resolve both variants into a unique identifier liketgn,7007011 in the Getty Thesaurus of Geographic Names. Fourth, a system alsoshould be able to establish that “Boschung 1993” and “D. Boschung, Die Bildnissedes Augustus, Das romische Herrscherbild I 2 (1993) 146 Nr. 80 Taf. 119, 3” mostprobably refer to the same bibliographic item.

Figure 5.5: Approaches to entity identification.

51

Page 53: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

The name “Augustus” is the same in German and English but none of the datapresented in Figure 5.5 unambiguously indicates that this is “Augustus, Emperorof Rome, 63 B.C.-14 A.D.” with the unique identifier lccn:n79-33006. Theseare common problems that most cultural heritage databases have to put up with.By either using algorithm-based approaches (for measurable units) or the help ofauthorities (for names of places or people) entities could be identified by match-ing their properties as shown in red color in Figure 5.5. If a globally establishedidentifier like a museum catalog number is already available, the process is straight-forward; names and bibliographic references will be more complicated though.

Scholars have done entity identification and record linkage (for example forprosopography research) for hundreds of years, they have created lists of names,attached thesauri or indices, a valuable resource for research in the field of culturalhistory. If a scientist extracts a name from a text, s/he exactly wants to identifythe historical person that is referred to. This work has been done by humans forages and today we have the opportunity to make this in a machine actionable way.It has to be managed in a way that makes the structured information publiclyavailable together with offering functions like searching and browsing. Authoritynaming services are pieces of software capable of fulfilling this task.

Authority naming lists contain a sufficient amount of information to estab-lish a given author (entity) as unique while excluding information that, althoughperhaps interesting, does not contribute to this objective. The Functional Require-ments and Numbering of Authority Records (FRANAR) build upon the FunctionalRequirements for Bibliographic Records (FRBR) and define several user tasks as-sociated to authority services. According to this document an authority systemshould support the user in identifying an entity, that is distinguishing betweentwo or more entities with similar characteristics. Furthermore, it should be able tocontextualize an entity by provide relationships between entities. That could berelationships between two or more persons or between a person and a corporatebody etc. [48].

It has already been said that record linkage is about finding database records(digital surrogates) that refer to the same entity in the non-digital world. They donot bear enough data to explicitly link them to a unique “real-world” entity andcannot be matched by trivial string comparison. The task of record linkage resultsin linked data, i.e. is data that somehow is marked as belonging together, forexample by assigning a common identifier. In historical research, record linkagewas popular in the 1980s, where computers could help to study data sets likecensus records or parish registers. Today, it could contribute to linking largeheterogeneous sets of data for Semantic Web purposes. Several approaches torecord linkage have been developed since then, reaching from rule-based approachesto probabilistic methods like Baive-Bayes algorithms. However, semi-automatic

52

Page 54: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

normalization of data to a common format still is an important task certainlyresulting in higher accuracy of machine-driven record linkage.

The above leads to the identification of and collaboration with establishednaming-authorities by enriching established vocabularies with more granular en-tities from the Greco-Roman world. The German Archaeological Institute, forexample, hosts a vast amount of granular information about place names that isnot yet exploited and that should be made public. Once published as a web ac-cessible resource, the entities can be connected to already established authoritieslike TGN or GeoNames that currently cannot deliver information with appropriategranularity for the study of the ancient Greco-Roman world.

5.4 Implementing an overall mapping workflow

After introducing the tesserae that form the mapping process, this section con-centrates on how they are tied together. Figure 5.6 points out how an overallmapping workflow should be implemented to contribute to a system that estab-lishes interoperability. Although the model tries to divide different steps of themapping process, some steps cannot be clearly separated and need to interact.Entity linking for example does need indexing.

Figure 5.6: The overall interoperability workflow.

First, both data models had to be represented in a uniform way for furtherprocessing. Perseus’ data has been exported to XML by using the Collectionservices. Since the Collection service could currently not handle the amount ofdata that is generated during the export for all of Arachne’s data, the MySQLQuery Browser was used to export Arachne’s object, literature, and images tables.This export created more than 80,000 files, one for each data object, that hadto be distributed as a hierarchic directory structure. Thus, there are definitelyscalability issues in the mapping phase already. The export step has resulted inan one-to-one XML representation of the data models.

53

Page 55: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

The next step aimed at cleaning the resulting data-set. For building the map-ping prototype, the Unix sed command has been used, containing various regularexpressions for extracting bibliographic entities from different fields. Some XMLcode for bibliographic entries was not valid and had to be dropped until a tool willbe on hand that uses heuristics to fix the broken XML markup. Since Arachnemaintains its own bibliographic database, extraction of bibliographic entities waseasier on this side. The end result is an intermediate data model that can bebetter processed. In future versions, it would be good to experiment with profes-sional data cleaning tools to extract more data from fields with informal internalstructure. This step again resulted in an XML representation, but this time as anintermediate data model.

According to the mapping documentation, an XSLT style-sheet has been craftedthat implements the mapping rules described. By processing each XML file withthis style-sheet the intermediate data model then has been mapped to RDF/XMLconforming to the CIDOC CRM. Additionally, the “Eyeball” tool described ear-lier was used to validate the resulting RDF code against the published CRM RDFdefinition file. This mapping step also involved assigning unique identifiers to eachmaterial or conceptual object that was created during the mapping process, inaccordance with RFC 3986.4

Having mapped each database record to a single RDF/XML files, the datahas been prepared for merging. At this part of the process, all data objects havebeen cleaned for proper record linkage. The current implementation relies on asimple mechanism to copy the resulting files to a common directory. Thereafterthe RDF information has been merged by ingesting the file into the Longwellbrowser. Longwell ingests all RDF files and uses the “Lucene” search engine toconnect objects that bear the same identifiers.5 This mechanism is useful becauseit accumulates everything that has been said about a specific entity, even if theinformation is distributed among different physical files. Currently, this is theonly form of record linkage that has been achieved, the prototype does not domultilingual entity identification. However, the infrastructure that would makethis step feasible is still missing.

Longwell was also used to visually present the results of the mapping processfor debugging reasons. In this case, both indexing and presentation were achievedby ingesting the data into the Longwell browser software. The next section givesa more in-depth introduction on how Longwell has been configured to displaycultural heritage data objects.

The mapping workflow presented was chosen to gain experience with the appli-

4The full-text of the Request for Comments can be found at http:// tools.ietf.org/ html/rfc3986 .

5http:// lucene.apache.org/ java/ docs/ .

54

Page 56: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

cation of Semantic Web concepts to cultural heritage data and to explore the issuesthat are connected with it. The overall mapping process definitely needs moreautomation by implementing means to publish and harvest, index and presentthe data. Once this automation has been established further steps need to beintroduced. These include multilingual record linkage and the interaction withauthority-naming services for better linking data objects and accumulating multi-lingual metadata. This, in turn, would better facilitate services like cross languageinformation retrieval in very specialized domains like classics and archaeology. Onthe conceptual level, the mapping should be enhanced iteratively by includingmore database fields and extracting more information.

55

Page 57: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Chapter 6

Knowledge visualization for theSemantic Web

The visualization of Semantic Web data poses an interesting challenge to softwaredevelopers. Data structures of almost unlimited complexity need to be presentedto users that usually are not aware of the underlying concepts of information rep-resentation. The CIDOC CRM, for example, promotes the consideration of eventsduring modeling cultural heritage data. It has been argued that this approachfacilitates better data integration, therefore, it is necessary. Although this methodof describing data may be useful and logical, users probably will not immediatelyagree to the necessity. This assumption is backed by the observation of currentdocumentation practice. Here, events obviously are not needed and therefore notexplicitly documented. This section deals with exploring means to process andvisualize data that resulted from prior information integration. First, a survey ofparadigms for visualizing Semantic Web data is undertaken. Then, the Longwellbrowser that was used to index and display the RDF/XML is introduced. Longwellhas also been useful for exploring scalability issues with Semantic Web data.

6.1 Paradigms for visualizing linked data

When it comes to presenting data to the user, lets say a scientist who is intocultural heritage research, a fundamental conflict has to be solved. Maeda statesthat “simplicity is about subtracting the obvious, and adding the meaningful [41]”.But RDF facilitates the formulation of amazingly complex data models where hugeamounts of interlinked data objects can reside. But how do we extract what ismeaningful and useful for the end user? Visualization in information technologyalways went after explicitly pointing to coherence that, without applying smartalgorithms, remained implicit.

56

Page 58: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Geroimenko et al. pioneered the area of visualization of Semantic Web data.They propose the extensive use of SVG and X3D to implement different visu-alization paradigms.1 The identified application fields reach from creating dis-tributed user interfaces, over illustrating complex networks (for example citationnetworks), to sophistic models for knowledge visualization by using dynamic SVGcharts. One topic particularly interesting for archaeological research is the use ofSVG and XSLT to display geo-referenced data on interactive maps [27]. Differentuser communities need to figure the processing and visualization paradigms theyrequire to create useful data presentations. The most straightforward approachto presenting Semantic Web would be to generate a textual presentation of webresources that may be linked by the well-known HTTP link mechanism. By us-ing a simple XSLT transformation, an RDF/XML document can be converted toa HTML file including links to other data objects. This strategy is pursued bybrowsers like Disco or the well-known Tabulator.2

But there are more sophisticated approaches to display RDF data. Since RDFis based on the the idea that all data can internally be represented as a graph,there is an almost unlimited number of ways to visualize the data. Robertsoncreated several examples on visualizing cultural heritage data within the scopeof his Historical Event Markup and Linking Project (HEML). Historical eventseither can be displayed on a map for emphasizing the spatial element of an eventor in a timeline to emphasize the temporal dimension. This particular example isinteresting because the HEML language can easily be translated in CIDOC CRMand visa versa [53]. Other project display data objects as nodes of a graph toemphasize the relation an object has to its surrounding contexts.

6.2 Faceted browsing using Longwell

The main objective of the project on hand has been to map two specialized datamodels to a data model that conforms to the CIDOC Conceptual Reference Model.This data can be shared and processed within multiple contexts. However, thisdoes not answer for the existence of tools that can process the data in a waythat is meaningful and contributes to solving a specific scientific problem. Onefundamental step towards this objective has been visualizing the data as soon aspossible to get a better impression of how Semantic Web tools could deal with thedata to be ingested.

1SVG is maintained at the W3C (http:// www.w3.org/ TR/ SVG/ ). The X3D standard isdefined at http:// www.web3d.org/ x3d/ specifications/ x3d specification.html .

2A representation of “Berlin” in Disco as HTML can be retrieved at http:// dbpedia.org/page/ Berlin , the same resource as RDF/XML at http:// dbpedia.org/ data/ Berlin . TheTabulator browser is available at http:// www.w3.org/ 2005/ ajar/ tab.

57

Page 59: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Longwell is a Semantic Web browser using the faceted classification paradigmthat was first introduced by the Flamenco Project at Berkeley University [46, 19].This paradigm assigns a couple of category terms to each term from one or morefacets. A facet is a set of categories, for example, archaeological artefacts could beclassified under a facet “material” with categories such as “marble” and “bronze.”Unfortunately, Flamenco has its own proprietary data model and mark-up format.Therfore, it will not be able to ingest RDF metadata that is published on theWorld Wide Web. Longwell is a web-based Semantic Web browser and runs ona standalone basis or within the context of a Java servlet container like ApacheTomcat [22].

Longwell is highly configurable in how it presents data to the user. The Fres-nel3 display vocabulary can be used to change the appearance of items that aredisplayed within the browser [61]. Currently most RDF browsers rely on theirindividual methods to approach two issues: selecting what information of an RDFgraph will be displayed and how the data will be formatted. Fresnel can be usedto facilitate concept-oriented browsing by explicitly displaying links to related ob-jects. Listing 6.1 shows an abbreviated example of how the Fresnel language wasused to tailor the output.

Listing 6.1: Fresnel configuration code in Notation3 (N3).1 @pref ix f r e s n e l : <http ://www.w3 . org /2004/09/ f r e s n e l#> .2 @pref ix rd f : <http ://www.w3 . org /1999/02/22− rdf−syntax−ns#> .3 @pref ix f a c e t s : <http :// s im i l e . mit . edu /2006/01/ on t o l o g i e s / f r e s n e l−f a c e t s#> .4 @pref ix crm : <http :// c idoc . i c s . f o r th . gr / r d f s / c idoc v4 . 2 . r d f s#> .56 @pref ix : <#> .78 : f a c e t s a f a c e t s : FacetSet ;9 f a c e t s : types f a c e t s : a l lTypes ;

10 f a c e t s : f a c e t s ( rd f : type ) .1112 : c idocFac t e t s rd f : type f a c e t s : FacetSet ;13 f a c e t s : types ( crm : E24 Physical Man−Made Thing ) ;14 f a c e t s : f a c e t s (15 crm : P67B i s r e f e r r ed t o by16 crm : P53F ha s f o rme r o r cu r r en t l o ca t i on17 crm : P44F has condit ion18 crm : P46B forms part o f19 crm : P103F was intended for20 crm : P45F con s i s t s o f21 ) .2223 : c idocObjectLens rd f : type f r e s n e l : Lens ;24 f r e s n e l : purpose f r e s n e l : de fau l tLens ;25 f r e s n e l : classLensDomain crm : E24 Physical Man−Made Thing ;26 f r e s n e l : showPropert ies (27 crm : P3F has note28 crm : P103F was intended for29 crm : P53F has f o rme r o r cu r r en t l o ca t i on30 crm : P44F has condit ion31 crm : P45F con s i s t s o f32 crm : P67B i s r e f e r r ed t o by33 crm : P46B forms part o f34 crm : P138B has representat ion35 ) ;36 f r e s n e l : group : gr .3738 : cidocOjectImageFormat rd f : type f r e s n e l : Format ;

3The term “Fresnel” refers to the French physicist Augustin-Jean Fresnel who constructed aspecial type of faceted lens for lighthouses.

58

Page 60: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

39 f r e s n e l : propertyFormatDomain crm : P138B has representat ion ;40 f r e s n e l : va lue f r e s n e l : image ;41 f r e s n e l : l a b e l " I m a g e s " ;42 f r e s n e l : group : gr .4344 : gr rd f : type f r e s n e l : Group ;45 f r e s n e l : l a b e l " C I D O C CRM s t a n d a r d g r o u p " ;46 f r e s n e l : s t y l e sh e e tL ink <http :// pentheus . per seus . t u f t s . edu/crm . css> .

Figures 6.1 and 6.2 exemplify how the Fresnel language can be used to changethe appearance of data objects within the browser, including the display of images.

Figure 6.1: The Longwell Semantic Web browser, unconfigured.

Longwell is an example for using the underlying data model to control theuser interface component of an application. In this example the CIDOC CRM isused both for internal information representation and for external user interfacegeneration. If the underlying data model will be changed or extended, the graph-ical user interface component will automatically reflect these changes without anyadditional efforts.

Khurso and Tjoa found out that Longwell, compared to other browsing andvisualization tools, is one of the more scalable tools [38]. According to their ex-periments, Longwell is able to handle more than 500,000 triples. Currently about40 fields and one link to the picture database of Perseus’ 6,000 database recordsare mapped to RDF/XML. This results in about 401,000 RDF triples, an amountof data that has been indexed within a couple of hours. The data-set could bebrowsed at good performance. However, for Arachne, ten fields of the main objecttable with links to geographic entities and bibliographic information have beenmapped resulting in about 2,402,000 triples for about 60,000 archaeological ob-jects, 6,000 bibliographic entries, and 5,000 records with place information. Thisamount of data could not be ingested into the Longwell browser, the ingestingprocess was stopped after 109 hours of computing time on a Mac Pro (3,0 GHz

59

Page 61: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Figure 6.2: The Longwell Semantic Web browser, configured with Fresnel.

Quad-Core Intel Xeon 5300, 2GB main memory). Performance experiments witha native in-memory store turned out to be promising.

But there are other alternatives that should be explored as well. A largerintegration project for archaeological data would easily reach a magnitude of morethan 30 million RDF triples. Portwin and Parvatikar state that the Jena APIscaled up to 200 million triples during their project [51].4 This amount of triplesis enough for a small cultural domain but not enough for huge amounts of dataworldwide.

Unfortunately, Longwell does not support any inferencing on the underlyingontology. Even if the CIDOC CRM definitions were ingested together with Perseus’and Arachne’s metadata, no links from data objects to their defining classes werediscovered and indexed. This leads to the fact that Longwell completely ignores theconcepts of generalization and inheritance. For example, Longwell does not allowfor displaying all persistent physical items and non-material products of humanactivity by selecting the E71 Man-Made Stuff, the class under which they aresubsumed. The user rather has to formulate a concatenated query that includesboth classes. However, this prevents from exploiting some of the most fundamentaladvantages of ontologies and thesauri.

4At “http:// www.mkbergman.com/ ?p=227 ” M. K. Bergmann states that 250 milliontriples currently is the “High Water Mark” for Semantic Web data.

60

Page 62: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Chapter 7

Conclusion

After evaluating functional requirements of digital scholarship, some building blocksof a future Cyberinfrastructure have been introduced. A distinction has been madethat conceptually separates instances from entities. To conduct serious research,scientists need to refer to instances within primary sources to give evidence fortheir argumentation. They also need to make unambiguous assertions about, forexample, historical places and persons. Thus, there is a need for a system thatenables scientists referring to specific entities. A complex software architectureincluding authority-naming services and institutional repositories that build uponSemantic Web concepts could provide the functionality needed. Additionally, stan-dards that facilitate networked knowledge organization systems have been lookedat. To better understand different conceptual and physical elements of the overallarchitecture, a basic mapping workflow has been established, reaching from dataextraction, over cleaning and mapping, to visual presentation in the Longwell Se-mantic Web browser. Most current data models cannot instantly deliver the datain a way that can be processed for Semantic Web purposes. Common problemscomprise dirty and unstructured data that could not be easily extracted. Addition-ally, many Semantic Web concepts are still not well understood and complicatedto implement using state of the art web-server technology.

The Perseus art and archaeology database contains approximately 6,000 dataobjects. Each object is described as a subset of altogether 102 database fields.Some of these fields are administrative and only used internally so that 94 fieldsqualify for mapping. Since the database hosts a high diversity of objects rang-ing from coins to buildings, only 34 fields were found to be relevant for all dataobjects. Therefore, the mapping experiment started with mapping those fieldsto the CIDOC CRM. Three fields contained structured bibliographic entities thatcould be easily extracted by trivial pre-processing. Four fields, however, containedmostly unstructured text with valuable information about places and people thathave not been extracted. The mapping workflow certainly needs better automation

61

Page 63: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

and options for plugging in data quality and cleaning tools. The compilation offurther and better mapping rules as well as pre-processing components will be aniterative and ongoing endeavor. The experience we gained with mapping Perseus’data will help to better map the more than 100,000 data objects of Arachne. Asa test-case the most important fields of three central Arachne tables (objekt,literatur, ort) have also been mapped to the CIDOC CRM.

Some problems with extracting data from both databases originate from fieldswith an implicit internal structure. For bibliographic information, the structurecould be automatically discovered and items extracted. Because of poor data qual-ity, some information had to be dropped and could not be mapped to the CIDOCCRM. The application of tools that can fix common data quality problems wouldresult in a more comprehensive mapping result. A couple of tools are freely avail-able in the public domain and commercial solutions also exist. But most problemsrequire domain specific knowledge and could probably be handled better by spe-cialized software. However, investing in internal data cleaning and re-organizationof data models would help with mapping cultural heritage databases to the CIDOCCRM.

Introduction of multilingual record-linkage tools could assist with automaticallylinking data objects that belong together. Perseus and Arachne have slightlyoverlapping collections. In this context, digital surrogates that refer to the sameentity should be linked. This would result in accumulating multilingual metadatafor these objects. Even if cross language information retrieval tools are introduced,all metadata internally should to be available in a certain language, for example,English.

Record-linkage is dependent on entity identification. If two bundles of metadatacan be identified as referring to the same entity, the records can be linked asbelonging together. The objective has to be not only to identify that a specificstring refers to a person or a place, but also to what specific place or person. Thiswill be carried out by assigning a common global identifier. The overall aim willbe to automatically linking data objects that conceptually belong together.

Entity-identification will become more powerful if it is done with the assistanceof authority naming-services. Hooking text-parsers up to these services couldextract information about people and places from full-text descriptions. However,in the course of the project, record-linkage only could be established on a very lowlevel, future research should concentrate on this area. Advanced record-linkageapplications seem to be promising for contributing linked Semantic Web data.

For the time being, Perseus’ data has been published for harvesting at http:// athena.perseus.tufts.edu/ collection in three different representations (RDF/XML,HTML and Collection service XML). This data-set should be ingested into institu-tional repository software that is able to handle huge amounts of data. Each data

62

Page 64: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

object should be equipped with a persistent identifier. This could be the Fedorainstitutional repository software or a simple triple store with another publishingcomponent. Fedora bears the advantage of delivering many data managementtools and facilitating long term preservation. For publishing the data to a largeaudience, Fedora implements the OAI Protocol for Metadata Harvesting. Largerepositories that facilitate discovery of RDF data are emerging.1

By the development of new and flexible ways of knowledge representation,historical cultural scientists will be enabled to refer to, access, and manage vastamounts of densely linked data objects as surrogates for existing cultural heritageobjects. The most obvious benefit of putting granular data online is providingrapid and economic access not only to documents but to granular metadata andlarge knowledge organization systems. To action this vision, scientists will needto encode their documentation in a way that can be processed by machines andreused by other scientists. Moreover, the vast amount of material that has beenpublished traditionally, in print or even hand written, should be digitized in a waythat contributes to the linked data idea. Named entity identification systems inaddition to full-text parsers could be adapted to this task.

Although the Semantic Web is obviously an emerging field, current frameworksand browsers leave many issues unaddressed. Each tool is targeted to a certaindisplay paradigm and provides only limited scalability, being suitable for researchin the lab but not for a large production environment. Because Longwell separatesdata and display it provides a promising paradigm for future research, providedthat it will be able to overcome current scalability issues. Frameworks like the JenaAPI also show promise concerning scalability issues. However, certain communitiesneed to chose a suitable display paradigm that provides useful access to contributeto their research objectives.

All tools that have been described so far could also be applied to publicationsthat exist in digital form either to link archaeological objects and ancient textsto secondary sources or to automatically create data objects in bulk. A fruitfularea of research surely will be the development of tools that provide “IntelligentInformation Access” for digital libraries. These are “technologies that make use ofhuman knowledge or human-like intelligence to provide effective and efficient accessto large, distributed, heterogeneous and multilingual (and at this time mainly[but not only] text-based) information resources and to satisfy users’ informationneeds [6].”

1http:// pingthesemanticweb.com/ is a service which acts as a concentrator for multiplecontributing repositories.

63

Page 65: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

64

Page 66: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

Bibliography

[1] A. Babeu, D. Bamman, G. Crane, R. Kummer, and G. Weaver. Namedentity identification and cyberinfrastructure. In Proceedings of the 11thEuropean Conference on Research and Advanced Technology for DigitalLibraries (ECDL 2007)-to appear, pages 259–270. Springer Verlag,September 2007.

[2] M. Baca and P. Harpring. Categories for the Description of Works of Art.http:// www.getty.edu/ research/ conducting research/ standards/ cdwa/ , August2006.

[3] J. Bekaert, X. Liu, H. Van de Sompel, C. Lagoze, S. Payette, and S. Warner.Pathways core: a data model for cross-repository services. In JCDL ’06:Proceedings of the 6th ACM/IEEE-CS joint conference on Digital Libraries,pages 368–368, New York, NY, USA, 2006. ACM Press.

[4] O. Boonstra, L. Breure, and P. Doorn. Past, present and future of historicalinformation science. Historical Social Research / HistorischeSozialforschung, 29(2):4–131, 2004.

[5] Dan Brickley and R.V. Guha. Rdf vocabulary description language 1.0: Rdfschema. http:// www.w3.org/ TR/ rdf-schema/ , February 2004.

[6] J. Chen, F. Li, and C. Xuan. A preliminary Analysis of the Use of resourcesin intelligent information access research. In Proceedings 69th AnnualMeeting of the American Society for Information Science and Technology(ASIST), volume 43, 2006.

[7] Art Museum Image Consortium. AMICO Data Specification.http:// www.amico.org/ AMICOlibrary/ dataspec.html , 2004.

[8] G. Crane, D. Bamman, L. Cerrato, A. Jones, D. Mimno, A. Packel,D. Sculley, and G. Weaver. Beyond digital incunabula: Modeling the nextgeneration of digital libraries. In Proceedings of the 10th EuropeanConference on Research and Advanced Technology for Digital Libraries

65

Page 67: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

(ECDL 2006), volume 4172 of Lecture Notes in Computer Science. Springer,2006.

[9] G. Crane, C. E. Wulfman, L. M. Cerrato, A. Mahoney, T. L. Milbank,D. Mimno, J. A. Rydberg-Cox, D. A. Smith, and C. York. Towards acultural heritage digital library. In Proceedings of the 3rd ACM/IEEE-CSJoint Conference on Digital Libraries, JCDL 2003, pages 75–86, Houston,TX, June 2003.

[10] N. Crofts, M. Dorr, T. Gill, S. Stead, and M. Stiff. Definition of the CIDOCobject-oriented conceptual reference model. Technical report, The CIDOCCRM Special Interest Group, 2005.

[11] DAI. Deutsches Archaologisches Institut. http:// www.dainst.org, August2007.

[12] H. Van de Sompel, C. Lagoze, J. Bekaert, X. Liu, S. Payette, and S. Warner.An Interoperable Fabric for Scholarly Value Chains. D-Lib Magazine,12(10), October 2006.

[13] H. Van de Sompel, M. L. Nelson, C. Lagoze, and S. Warner. ResourceHarvesting within the OAI-PMH Framework. D-Lib Magazine, 10(12), 2004.

[14] H. Van de Sompel, S. Payette, J. Erickson, C. Lagoze, and S. Warner.Rethinking Scholarly Communication. Building the System that ScholarsDeserve. D-Lib Magazine, 10(9), September 2004.

[15] M. Dorr. The CIDOC conceptual reference module[sic!]: An ontologicalapproach to semantic interoperability of metadata. AI Mag, 24(3):75–92,2003.

[16] M. Dorr. The CIDOC CRM, a Standard for the Integration of CulturalInformation. http:// cidoc.ics.forth.gr/ docs/ crm for gothenburg.ppt , November2005.

[17] M. Dorr and P. LeBoeuf. FRBR object-oriented definition and mapping toFRBR-ER.http:// cidoc.ics.forth.gr/ docs/ frbr oo/ frbr docs/ FRBR oo V0.8.1c.pdf , May2007.

[18] EPOCH. A Survey of Documentation Standards in the Archaeological andMuseum Community. http://hdl.handle.net/2313/91, October 2006.

[19] Flamenco. The Flamenco Search Interface Project.http:// flamenco.berkeley.edu/ index.html , May 2007.

66

Page 68: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

[20] R. Fortsch. ARACHNE - Datenbank und kulturelle Archive desForschungsarchivs fur Antike Plastik Koln und des DeutschenArchaologischen Instituts. http:// arachne.uni-koeln.de/ inhalt text.html ,August 2007.

[21] R. Fortsch. Forschungsarchiv fur Antike Plastik.http:// www.klassarchaeologie.uni-koeln.de/ abteilungen/ mar/ forber.htm, August2007.

[22] The Apache Software Foundation. Apache Tomcat.http:// tomcat.apache.org/ , May 2007.

[23] The Apache Software Foundation. The Apache HTTP Server Project.http:// httpd.apache.org/ , August 2007.

[24] H. Galhardas, D. Florescu, D. Shasha, and E. Simon. AJAX: An ExtensibleData Cleaning Tool. In SIGMOD ’00: Proceedings of the 2000 ACMSIGMOD international conference on Management of data, page 590, NewYork, NY, USA, 2000. ACM Press.

[25] P. Galuzzi. The virtual museum of the Future. In Semantic Web forscientific and cultural organisations: results of some early experiments, June2003.

[26] M. Genereux and F. Niccolucci. Extraction and mapping of CIDOC-CRMencodings from texts and other digital formats. In The 7th InternationalSymposium on Virtual Reality, Archaeology and Cultural Heritage (VAST),Nicosia, Cyprus, 2006.

[27] V. Geroimenko and C. Chen. Visualizing Information Using SVG and X3D.XML Based Technologies for the XML Based Web. Springer, London [Etal.], 2. ed edition, 2004.

[28] P. Gietz, A. Aschenbrenner, S. Budenbender, F. Jannidis, M. W. Kuster,C. Ludwig, W. Pempe, T. Vitt, W. Wegstein, and A. Zielinski. TextGridand eHumanities. In E-SCIENCE ’06: Proceedings of the Second IEEEInternational Conference on e-Science and Grid Computing, pages 133–141,Washington, DC, USA, 2006. IEEE Computer Society.

[29] T. R. Gruber. Towards Principles for the Design of Ontologies Used forKnowledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology inConceptual Analysis and Knowledge Representation, Deventer, TheNetherlands, 1993. Kluwer Academic Publishers.

67

Page 69: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

[30] I. Herman, R. Swick, and D. Brickley. Resource Description Framework(RDF) / W3C Semantic Web Activity. http:// www.w3.org/ RDF/ , January2007.

[31] ICS-FORTH. Partial Definition of the CIDOC Conceptual Reference Modelversion 4.2 in RDF. http:// cidoc.ics.forth.gr/ rdfs/ cidoc v4.2.rdfs, June 2005.

[32] IFLA Study Group on the Functional Requirements for BibliographicRecords. Functional Requirements for Bibliographic Records: Final Report,volume 19 of UBCIM Publications-New Series. K.G.Saur, Munchen, 1998.

[33] Open Archives Initiative. Open Archives Initiative Protocol — ObjectReuse and Exchange. http:// www.openarchives.org/ ore/ , August 2007.

[34] The Text Encoding Initiative. Tei: Yesterday’s information tomorrow.http:// www.tei-c.org/ , August 2007.

[35] Getty Institute. The Getty Thesaurus of Geographic Names Online. http:// www.getty.edu/ research/ conducting research/ vocabularies/ tgn/ index.html ,August 2007.

[36] H. Kondylakis, M. Dorr, and D. Plexousakis. Mapping Language forInformation Integration. Technical report, ICS-FORTH, December 2006.

[37] R. Kummer. Integrating Data from The Perseus Project and Arachne usingthe CIDOC CRM: An Examination from a Software Developer’sPerspective. In Exploring the Limits of Global Models for Integration andUse of Historical and Scientific Information-ICS Forth Workshop,Heraklion, Crete, October 2006. ICS-Forth, ICS-Forth.

[38] S. Kushro and A. Tjoa. Fulfilling the Needs of a Metadata Creator andAnalyst – An Investigation of RDF Browsing and Visualization Tools.Canadian Semantic Web, pages 81–101, 2006.

[39] C. Lagoze and H. Van de Sompel. The Open Archives Initiative: Building aLow-Barrier Interoperability Framework. In ACM/IEEE Joint Conferenceon Digital Libraries, pages 54–62, 2001.

[40] C. Lagoze, S. Payette, E. Shin, and C. Wilper. Fedora: An Architecture forComplex Objects and their Relationships.http://arxiv.org/abs/cs.DL/0501012, August 2005.

[41] J. Maeda. The Laws of Simplicity (Simplicity: Design, Technology,Business, Life). The MIT Press, August 2006.

68

Page 70: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

[42] D. L. McGuinness and F. van Harmelen. OWL Web Ontology LanguageOverview. http:// www.w3.org/ TR/ owl-features/ , February 2004.

[43] B. Metcalfe. Metcalfe’s Law: A Network Becomes More Valuable as itReaches More Users. InfoWorld, October 1995.

[44] D. Mimno, G. Crane, and A. Jones. Hierarchical catalog records:Implementing a FRBR catalog. D-Lib Magazine, 11(10), 2005.

[45] American Council of Learned Societies. Our Cultural Commonwealth: Thefinal report of the ACLS Commission on Cyberinfrastructure for theHumanities and Social Sciences. http:// www.acls.org/ cyberinfrastructure/ ,December 2006.

[46] Massachusetts Institute of Technology. Longwell.http:// simile.mit.edu/ wiki/ Longwell , August 2007.

[47] A. M. Ouksel and A. P. Sheth. Semantic Interoperability in GlobalInformation Systems: A Brief Introduction to the Research Area and theSpecial Section. SIGMOD Record, 28(1):5–12, 1999.

[48] G. Patton. FRANAR: A Conceptual Model for Authority Data. Cataloging& Classification Quarterly, 38(3/4):91–104, November 2004.

[49] K. Popper. Logik der Forschung. Springer, 1935.

[50] D. Porter, W. Du. Casse, J. W. Jaromczyk, N. Moore, R. Scaife, andJ. Mitchell. Creating CTS Collections. In Digital Humanities, pages 269–74,2006.

[51] K. Portwin and P. Parvatikar. Scaling Jena in a commercial environment:The Ingenta MetaStore Project. In 2006 Jena User Conference, 2006.

[52] Proceedings of the 1st International Conference on Formal Ontologies inGuarino,N. Formal Ontology and Information Systems. In N. Guarino,editor, Proceedings of the 1st International Conference on FormalOntologies in Information Systems, FOIS’98. Formal Ontology andInformation Systems, 1998.

[53] B. Robertson. The Historical Event Markup and Linking Project.http:// www.heml.org/ heml-cocoon/ , August 2007.

[54] N. Shadbolt, T. Berners-Lee, and W. Hall. The Semantic Web Revisited.Intelligent Systems, IEEE [see also IEEE Intelligent Systems and TheirApplications], 21(3):96–101, 2006.

69

Page 71: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

[55] N. Smith. Collection services.http:// chs75.harvard.edu/ projects/ diginc/ techpub/ collections, August 2007.

[56] D. R. Snow, M. Gehagan, C. L. Giles, K. G. Hirth, G. R. Milner, P. Mitra,and J. Z. Wang. Cybertools and Archaeology. Science, 311(5763):958 – 959,February 2006.

[57] R. Stein, J. Gottschewski, R. Heuchert, A. Ermert, M. Hagedorn-Saupe,H.-J. Hansen, C. Saro, R. Scheffel, and G. Schulte-Dornberg. Das CIDOCConceptual Reference Model: Eine Hilfe fur den Datenaustausch? Berichtder AG Datenaustausch Fachgruppe Dokumentation im DeutschenMuseumsbund. Mitteilungen und Berichte aus dem Institut furMuseumskunde, 2005.

[58] R. Tansley, M. Bass, D. Stuve, M. Branschofsky, D. Chudnov, G. Mcclellan,and M. Smith. The DSpace Institutional Digital Repository System:Current Functionality. In JCDL ’03: Proceedings of the 3rd ACM/IEEE-CSjoint conference on Digital libraries, pages 87–97, Washington, DC, USA,2003. IEEE Computer Society.

[59] Research Councils UK. About the UK e-Science Programme.http:// www.rcuk.ac.uk/ escience/ default.htm, August 2007.

[60] M. Uschold. Where Are the Semantics in the Semantic Web? AI Magazine,24(3):25–36, 2003.

[61] W3C. Fresnel – Display Vocabulary for RDF.http:// www.w3.org/ 2005/ 04/ fresnel-info/ , November 2006.

[62] W3C. Simple Knowledge Organisation Systems (SKOS) — Home Page.http:// www.w3.org/ 2004/ 02/ skos/ , June 2007.

[63] H. Zhao and S. Ram. Entity identification for heterogeneous databaseintegration: a multiple classifier system approach and empirical evaluation.Inf. Syst., 30(2):119–132, 2005.

70

Page 72: Towards semantic interoperability of cultural information systems … › attachments › download › 108 › 2009... · New ideas originating from Semantic Web research and well

AppendixA DVD is attached to this thesis containing the directories below.

• /collections/ contains all mapped artifacts of Perseus’ and Arachne’s artand artifact databases.

• /docs/ contains the complete mapping documentation draft document andthis thesis as PDF document.

• /style-sheets/ contains the mapping implementations as style-sheets.

• /scripts/ contains the used shell scripts.

A running Longwell browser has been installed at http:// athena.perseus.tufts.edu:8080/ .

71