Bootstrapping the semantic web (Ontology population & Annotation) Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology

Bootstrapping the semantic web (Ontology population & Annotation)

Semantic Web - Spring 2008Computer Engineering DepartmentSharif University of Technology

outline

• Introduction• What is an Annotation?• Semantic annotations• Semi-automatic annotation• Proxy-based vs. browser-based annotations• Annotation tools• Annotation in OWL• Semantically interlinked information• Issues concerning semantic annotation• OntoAnnotate• Ontology population• Concept vs instant based annotation• Ontosophie• CREAM• Annotea

Introduction

• A lot of information exists in web in the form of texts

• It is almost impossible to make Semantic Web version of them manually

• Annotation aims to resolve this issue• Creating metadata by annotating documents is

one of the major techniques for putting machine understandable data on the Web.

• Techniques involves NLP, ML, IR, DM, …

Introduction (Cont’d)

• The World Wide Web is the richest repository of information, whose semantics are oriented to humans rather than to machines. The enrichment of the Web with semantic annotations (metadata) is fundamental for the accomplishment of the Semantic Web, and is currently performed manually, or semi-automatically.

What is an Annotation?

• Annotations are comments, notes, explanations, or other types of external remarks that can be attached to a Web document or a selected part of the document.

• As they are external, it is possible to annotate any Web document independently, without needing to edit that document.

• From the technical point of view, annotations are usually seen as metadata, as they give additional information about an existing piece of data.

• Annotations can be stored locally or in one or more annotation servers.

An annotation has many properties including:

• Physical location: is the annotation stored in a local file system or in an annotation server

• Scope: is the annotation associated to a whole document or just to a fragment of this document

• Annotation type: 'Annotation', 'Comment', 'Query', ...

Semantic annotations

• Semantic annotations associate information with specific entities within the domain of interest, aiming to facilitate a semantic-based interpretation of content by restricting their formal models of interpretation through ontologies.

• Domain entities are represented as instances of concepts in ontologies.

• A domain ontology captures knowledge in a static way, as it is a snapshot of knowledge concerning a specific domain from a particular point of view (conceptualization), in a specific time-period.

Ontologies and semantic annotations

• Ontological structures give additional value to semantic annotations.

• They allow for additional possibilities on the resulting semantic annotations, such as inferencing or conceptual navigation.

• Reference to a commonly agreed set of concepts constitutes an additional value through its normative function.

• An ontology directs the attention of the annotator to a predefined choice of semantic structures and gives some guidance about what and how items residing in the documents may be annotated.

Annotation…

• Adding metadata to existing web pages in an efficient and flexible manner that takes advantage of the rich possibilities offered by RDF, RDF-Schema, OWL, ….

Semi-automatic annotation

Different kinds of semi-automatic annotation mechanisms• Wrapper Generation: Especially in the case of annotating web

pages that mainly consist of HTML tables, one may annotate the first row of the table and automatically enumerate over the residual rows of the table.

• Pattern Matching: Regularity of word expressions may be captured by regular expression based patterns. For example given the pattern fwordg*fGmbHg yields for the german language to generic pattern for company names, and, thus, successfully recognize instances of the class COMPANY of the ontology. Patterns are stored with the concepts of the domain ontology.

• Information Extraction: The most complex mechanism for semi-automatic annotation is full fledged ontology-based information extraction based on a shallow text processing strategy.

Proxy-based vs. browser-based

• [9] introduces a framework for categorizing annotation tools distinguishing between a proxy–based and a browser–based approach.

• The proxy-based approach stores and merges the annotation and therefore preprocesses the annotated documents to be viewable for a standard web-browser.

• Within the browser–based approach the browser is modified to merge the document with the annotation data just prior to presenting the content to the user.

What are some good annotation tools?

• SMORE - A tool that allows us to markup documents in RDF using web ontologies. It is simple and easy to use. Main limitation of this tool is that it only accepts literal values for the subject and object of a triple (except for an rdf:type statement).

• Protégé OWL Plugin - This extension of Protégé enables us to edit OWL individuals for Semantic Web markup. The tool eases the task of creating OWL markup by the visualization of the ontologies and also some convenient features.

• OntoMat-Annotizer - Another good annotation tool. Very user friendly, but only supports DAML+OIL currently. OWL version is under way.

• Some other possible candidates may be found at http://annotation.semanticweb.org/tools/ and http://www.semwebcentral.org/assessment/report?type=category&category=Annotation

http://annotation.semanticweb.org/tools/

http://www.semwebcentral.org/assessment/report?type=category&category=Annotation

http://www.semwebcentral.org/assessment/report?type=category&category=Annotation

Annotation in OWL

OWL Full does not put any constraints on annotations in an ontology. OWL DL allows annotations on classes, properties, individuals and ontology headers, but only under the following conditions:

• The sets of object properties, datatype properties, annotation properties and ontology properties must be mutually disjoint. Thus, in OWL DL dc:creator cannot be at the same time a datatype property and an annotation property.

• Annotation properties must have an explicit typing triple of the form: AnnotationPropertyID rdf:type owl:AnnotationProperty .

• Annotation properties must not be used in property axioms. Thus, in OWL DL one cannot define subproperties or domain/range constraints for annotation properties.

• The object of an annotation property must be either a data literal, a URI reference, or an individual.

• Five annotation properties are predefined by OWL, namely:• owl:versionInfo • rdfs:label • rdfs:comment • rdfs:seeAlso • rdfs:isDefinedBy • Here is an example of legal use of an annotation property in OWL DL:• <owl:AnnotationProperty rdf:about="&dc;creator"/>

• <owl:Class rdf:about="#MusicalWork"> • <rdfs:label>Musical work</rdfs:label>• <dc:creator>N.N.</dc:creator>• </owl:Class>

Semantically interlinked information

• In light of the Semantic Web, what intelligent agents crave for are web pages and items on web pages that are not only described in isolation from each other, but that are also semantically interlinked.

Semantically interlinked information (cont’d)

• A Community Web Portal could present all its knowledge, taking great advantage of semantic structures: personalization by semantic bookmarks (“Fred is interested in RDF research”), conceptual browsing, or the derivation of implicit knowledge (e.g., if John works in a project, which is about XML then he knows something about XML), have been some of the features that thrived by having semantically interlinked information.

• Intelligent agents may profit from semantically interlinked information on the Web in the future.

Ontologies and semantic annotations (cont’d)

• To address semantic interlinkage between document items, an ontology-based annotation tool must address the issue of object identity and its management across many documents.

• Ontologies may have elaborate definitions of concepts. When their meaning changes, when old concepts need to be erased, or when new concepts come up, the ontology changes. Because updating previous annotations is generally too expensive, one must deal with change management of ontologies in relation to their corresponding annotations.

• Redundant annotation, which stem from duplicate pages on the web or annotation work done by fellow annotators, must be prevented.

Issues concerning semantic annotation

• The semantic annotation task does not adhere to a strict template structure. Rather it needs to follow the structure given by schema definitions that may vary with, e.g., domain and purpose. Semantic annotations need to be congruent with ontology definitions.

• Semantically interlinked metadata is labor-intensive and expensive to produce. Duplicate annotation must be avoided. It is important not to start from scratch when annotating sources, but to build on others efforts (in particular their creation of IDs).

• There is a multitude of schema descriptions (ontologies) that change over time to reflect changes in the world. Manual re-annotation of old web pages seems practically infeasible. An annotation framework that allows to handle ontology creation, mappings and versioning is needed.

Issues concerning semantic annotation (cont’d)

• Purely manual annotation is very expensive. It is necessary to help the human annotator with his task. Support for automatic —or semi-automatic — semantic annotation of web pages, is needed.

• There is a lack of experience in creating semantically interlinked metadata for web pages. It is not clear how human annotators perform overall and it is unclear what can be assumed as a baseline for the machine agent.

• Though there are corresponding investigations for only indexing documents, a corresponding richer assignment of interlinked metadata that takes advantage of the object structures of RDF is lacking.

OntoAnnotate

• OntoAnnotate tool makes the relationship between particular ontologies and their parts, i.e. concepts and properties, explicit.

• OntoAnnotate, presents to the user an interface that dynamically adapts to the given ontology.

• OntoAnnotate relies on RDF and RDF Schema.

• It considers semantically interlinked information

OntoAnnotate (cont’d)

The semantic meaning of the objects and the text passages is given by four semantic categories:

• Object identification: New objects are created by asserting the existence of an object with a unique identifier.

• Object–class relationships: Each object is assigned to a class of objects by the human annotator.

• Object–attribute relationships: Each object may be related to attribute values by an attribute. Each attribute value is either a text passage chosen by highlighting or a string typed in by the annotator.

• Object–object relationships: Each object may be related to all existing objects (including itself) via an (object) relation.


Two requirements:• 1) The annotation inferencing server needs to maintain object

identifiers during the annotation process. • Adding objects to and querying objects from the annotation inference

server)• 2) A crawler needs to gather relevant object identifiers for the start of

the annotation.• Start a focused crawl of RDF facts—covering the document and

annotation server, but also relevant parts of the Web — which provides the annotation inference server with an initial set of object identifiers, categories, attributes and relations. The metadata provided by other annotaters may be used as the starting point that one may contribute additional data to.

OntoAnnotate, Ontology refinement

When an existing class definition is refined, 3 possibilities for the objects that belong to this class.

• The objects stay in the class and, hence, the semantic meaning of the annotations is extended by additional semantic constraints;

• The objects are categorized to belong only to the superclasses of the re-defined class and, hence the semantic meaning of the annotations is reduced by cutting away semantic constraints;

• The objects are moved to another class.

Along similar lines, other cases of ontology revisions are treated.


• OntoAnnotate uses the URI to detect the reencounter of previously annotated documents and highlights annotations in the old document for the user.

• The user may decide to ignore or even delete the old annotations and create new metadata, he may augment existing data, or he may just be satisfied with what has been annotated before.

• In order to recognize that a document has been annotated before, but now appears under a different URI, OntoAnnotate searches in the document management system computing similarity with existing documents by document vector models.

Concept vs instant based annotation

• Semantically annotating a corpus is divided into concept and instance based approaches.

• The concept based approach, aiming to discover new instances beyond the one exist in the ontology, employs information extraction techniques.

• The instance based approach concerns the recognition of all the instances that exist in the ontology and appear in the corpus. A more sophisticated extension of this method usually uses disambiguation techniques to support the correct sense attribution of an ontological instance according to the ontology used.

Ontology population

• ontology population: identify new instances for concepts of a domain ontology and add them into it.

• ontology enrichment: acquire a non-taxonomic relationship between instances that captures their different lexicalizations avoiding the existence of duplicate ontological instances.

Ontology population (cont’d)

• Ontologies are widely used for capturing and organizing knowledge of a particular domain of interest.

• This knowledge is usually evolvable and therefore an ontology maintenance process is required to keep the ontological knowledge up-to-date.

• [3] proposed an incremental ontology maintenance methodology which exploits ontology population and enrichment methods to enhance the knowledge captured by the instances of the ontology and their various lexicalizations.

• Due to changes concerning knowledge-related requirements a domain ontology might contain incomplete or out-of-date knowledge regarding its instances.

• For example, an ontology that has been constructed for the domain “laptop descriptions” last year will miss the latest processor types used in laptops.

• Moreover, the different surface appearance (lexicalization) of an instance, restricts the knowledge a domain ontology intends to capture. For example, the ignorance that the instance “Intel Pentium 3” can be appeared as “P III” is a serious knowledge leak.

• Maintaining ontological knowledge through population and enrichment is a time-consuming, error prone and labor-intensive task when performed manually. Ontology learning can facilitate the process by using machine learning methods to obtain knowledge from data.

Incremental Ontology Population and Enrichment

Incremental Ontology Population and Enrichment iterates through four stages:

• Ontology-based Semantic Annotation. The instances of the domain ontology are used to semantically annotate a domain-specific corpus in an automatic way (instance based approach). In this stage disambiguation techniques are used exploiting knowledge captured in the domain ontology.

The semantic annotation of the corpus is currently performed by a string matching technique that is biased to select the maximum spanning annotated lexical expression for each instance.

• Knowledge Discovery. An information extraction module is employed in this stage to locate new ontological instances. The module is trained, using machine learning methods, on the annotated corpus of the previous stage.

• Knowledge Refinement. A compression-based clustering algorithm is employed in this stage for identifying lexicographic variants of each instance supporting the ontology enrichment.

• Validation and Insertion. A domain expert validates the candidate instances that have been added in the ontology.

Figure 3:

Ontosophie

• Information Extraction – IE is as a technology that can help an ontology expert during the ontology population and maintenance process.

• It can be viewed as the task of pulling predefined entities – such as name of visitor, location, date, etc. from texts to fill predefined slots in classes.

• [2] is an annotation tool based on IE, Machine Learning – ML and Natural Language Processing – NLP.

• The system, as with most IE systems, uses partial parsing to recognize syntactic constructions. This has the advantage of high speed and robustness, which is necessary when applying IE to a large set of documents.

Ontosophie (cont’d)

• [2] offers

• 1) identification of key entities in text articles that could participate in ontology population with instances,

• 2) identification of the most probable classes for the population based on newly introduced confidence values, and

• 3) semi-automatic population of an ontology with those instances.

The system goes through the following phases during its life cycle.

• Annotation: In order to let the system learn extraction rules it has to be provided with a set of examples since it is based on supervised learning. In our case this is a set of documents (plain text or HTML) annotated with XML tags and assigned to one of the predefined classes within the ontology O.

Learning: The learning phase consists of two steps that will be describe in the following.

• Natural language processing – NLP

Ontosophie uses shallow parsing to recognize syntactic constructs without generating a complete parse tree for each sentence. The shallow parsing has the advantages of higher speed and robustness. In particular, Ontosophie uses the Marmot NLP system.

• Generating extraction rules

This phase makes use of Crystal, a conceptual dictionary induction system. Crystal derives a dictionary of concept nodes – extraction rules, from a training corpus. It is based on the specific-to-general algorithm.

• Assigning rule confidence values to extracted rules

Pre-selecting/recommending those rules that are strongly believed to be correct. Therefore recall is kept high while trying to achieve a high precision for the automatic pre-selection.

Extraction and ontology population• The task of this phase is to extract appropriate entities from

an article and feed a newly constructed instances into a given ontology O. The document is pre-processed with Marmot prior to the extraction itself.

• The extraction is run class by class. Firstly, a set of extraction rules for only one specific class from the ontology is taken and only those rules are used for the extraction.

• The step is then repeated for all the classes within the ontology and thus for each class the system gets a couple of entities.

CREAM

• The CREAM [4] - CREAting Metadata for the Semantic Web – annotation framework contains in particular methods for:

• Manual annotation: The transformation of existing syntactic resources (viz. documents) into interlinked knowledge structures that represent relevant underlying information.

• Authoring of documents: In addition to the annotation of existing documents the authoring mode lets authors create metadata—almost for free—while putting together the content of a page.

• Semi-automatic annotation: The semi-automatic annotation based on information extraction trainable for a specific domain.

• Deep annotation: The treatment of dynamic Web documents by annotation of the underlying database when the database owner is cooperatively participating in the Semantic Web.

Annotation

Authoring

Annotea

• Annotea [8] is a system for creating and publishing shareable annotations of Web documents.

• Built on HTTP, RDF, and XML, Annotea provides an interoperable protocol suitable for implementation within Web browsers to permit users to attach data to Web pages such that other users may, at their choice, see the attached data when they later browse the same pages.

• The Annotea protocol works without modifying the original document; that is, there is no requirement that the user have write access to the Web page being annotated.

• The Annotea protocol is suitable both for annotation data that is expected to be primarily for human viewing as well as annotation data that is expected to be consumed by other application programs, such as automatic classification tools, search engines, and workflow applications.

EMMA

• EMMA, an Extensible MultiModal Annotation markup language, is intended for use by systems that provide semantic interpretations for a variety of inputs, including but not necessarily limited to, speech, natural language text, GUI and ink input.

• It is expected that this markup will be used primarily as a standard data interchange format between the components of a multimodal system; in particular, it will normally be automatically generated by interpretation components to represent the semantics of users' inputs, not directly authored by developers.

• The language is focused on annotating single inputs from users, which may be either from a single mode or a composite input combining information from multiple modes, as opposed to information that might have been collected over multiple turns of a dialog.

• The language provides a set of elements and attributes that are focused on enabling annotations on user inputs and interpretations of those inputs.

Example

• The system is uncertain whether the user meant "flights from Boston to Denver" or "flights from Austin to Denver".

On deep annotation

• Nowadays, a large percentage of Web pages are not static documents. On the contrary, the majority of Web pages are dynamic.

• For dynamic web pages (e.g. ones that are generated from the database that contains a catalogue of books) it does not seem to be useful to manually annotate every single page. Rather one wants to “annotate the database” in order to reuse it for one’s own Semantic Web purposes.

On deep annotation

• [5] describes a framework of metadata creation when web pages are generated from a database and the database owner is cooperatively participating in the Semantic Web.

• This leads us to the definition of ontology mapping rules by manual semantic annotation and the usage of the mapping rules and of web services for semantic queries.

• In order to create metadata, the framework combines the presentation layer with the data description layer — in contrast to “conventional” annotation, which remains at the presentation layer.

• Therefore, it refers to the framework as deep annotation.

Deep annotation is particularly valid because,

• web pages generated from databases outnumber static web pages,

• annotation of web pages may be a very intuitive way to create semantic data from a database

• data from databases should not be materialized as RDF files, it should remain where it can be handled most efficiently— in its databases.

On deep annotation

• Three major requirements must be provided:• A server-side web page markup that defines the

relationship between the database and the web page content.

• An annotation tool to actually let the user utilize information proper, information structures and information context for creating mappings.

• Components that let the user investigate the constructed mappings, and query the serving database.

THE PROCESS OF DEEP ANNOTATION

Input: A Web site driven by an underlying relational database. • Step 1: The database owner produces server-side web page markup according to the

information structures of the database.Result: Web site with server-side markup.

• Step 2: The annotator produces client-side annotations conforming to the client ontology and the server-side markup.

Result: Mapping rules between database and client ontology.

• Step 3: The annotator publishes the client ontology (if not already done before) and the mapping rules derived from annotations

Result: The annotator’s ontology and mapping rules are available on the Web.

• Step 4: The querying party loads second party’s ontology and mapping rules and uses them to query the database via the web service API.

Result: Results retrieved from database by querying party.

• The annotator might annotate an organization entry from ontoweb. org according to his own ontology. Then, he may use the ontology and mapping to instantiate his own syndication services by regularly querying for all recent entries the titles of which match his list of topics.

Figure 4:

• Amaya [6] is a web-browser that acts both as an editor and as a browser.

• It has been designed at W3C with the primary purpose of being a testbed for experimenting and demonstrating new languages, protocols and formats for the Web.

• Amaya is the primary browser /editor for the annotation approach in [9]. The annotation data itself is exchanged in RDF/XML form to provide other clients access to the annotation database.

• Currently, however, it does not provide comprehensive support with annotation inference server and crawling.

Main view

SVG example

An annotation

References

• [1] S. Staab, A. Maedche, S. Handschuh, An Annotation Framework for the Semantic Web, In: S. Ishizaki (ed.), Proc. of The First International Workshop on MultiMedia Annotation. January. 30 - 31, 2001. Tokyo, Japan.

• [2] D. Celjuska, M. Vargas-Vera, Ontosophie: A Semi-Automatic System for Ontology Population from Text, Tech Report kmi-04-19, knowlegde media KMI institude, 2004

• [3] A. G. Valarakos, G. Paliouras, V. Karkaletsis, G. A. Vouros, Enhancing the Ontological Knowledge through Ontology Population and Enrichment, LNCS 3257, 2004.

• [4] S. Handschuh, S. Staab, CREAM: CREAting Metadata for the Semantic Web, Elsevier Computer Networks 42, pp. 579–598, 2003.

• [5] S. Handschuh, S. Staab , R. Volz, On Deep Annotation, WWW2003, May 20–24, 2003, Budapest, Hungary.

• [6] www.w3.org/amaya• [7] http://www.w3.org/TR/emma/ • [8] http://www.w3.org/2001/annotea/ • [9] M.-R. Koivunen, D. Brickley, J., Kahan, E. P. Hommeaux, R. R. Swick, The W3C• CollaborativeWeb Annotation Project ... or how to have fun while building an RDF

infrastructure, 2000.

http://www.sigmatics.co.jp/mma2001/

http://www.w3.org/amaya

http://www.w3.org/TR/emma/

http://www.w3.org/2001/annotea/

Documents

Bootstrapping the semantic web (Ontology population & Annotation) Semantic Web - Spring 2008 Computer Engineering Department Sharif University of Technology