TAXONOMY-BASED ANNOTATION OF XML DOCUMENTS Application to e-Learning Resources Nicolas Spyratos University of Paris-South France Joint work with B. Gueye

TAXONOMY-BASED ANNOTATION OF XML DOCUMENTS

Application to e-Learning Resources

Nicolas Spyratos

University of Paris-SouthFrance

Joint work with B. Gueye and Ph. Rigaux

overview

the context of our work digital libraries

document composition

document indexing

document registration

automatic annotation

application to XML documents

concluding remarks

the context

SeLeNe (Self e-Learning Networks)

• IST accompanying measures

• ended in January 2004

Delos • network of Excellence on Digital Libraries

• started in January 2004

digital libraries

a digital library serves a network of providers willing to share their documents with other providers and/or consumers (collectively called “users”)

each document resides at the local repository of its provider, so all providers’ repositories, collectively, can be seen as a database of documents spread all over the network

the digital library acts as a mediator, indexing all shareable documents so that users can access them transparently

typically, a user can compose new documents from those available through the library

digital libraries (continued)

the digital library indexes two types of documents, atomic or composite, and provides a number of services to support the composition of new documents and their use

in what follows we shall see

how documents are composed how they are indexed by the library how they are registered in the library

in doing so, we deal with document identifiers and document descriptions

we do not deal with document content

document composition

a document is seen as an identifier d (e.g., a URI)associated with a set of other documents, d1, … , dn, called its parts :

parts(d) = {d1, … , dn}

if parts(d) = then d is called atomic else d is called composite

components of a documentif d is atomic then comp(d) = else comp(d) = parts(d)comp(d1) … comp(dn)

a document can be represented as a graph : its composition graph

4 6 7

2 3

1

5

• we assume that no document can be component of itself, so thecomposition graph of a document d is a directed acyclic graphwith d as the only root

• the set of parts is unordered

document indexing

documents are indexed using a taxonomy

a taxonomy is a pair (T, ≤) where T is a set of keywords or terms, called the terminology ≤ is a reflexive and transitive relation over T, called a subsumption

in practice, most taxonomies are trees

defining a taxonomy is not an easy task

several standard taxonomies of topics already exist todayACM-CCS, IEEE LOM, Open Directory

a digital library operates on one (or more) taxonomies to which all users adhere

example of taxonomy(fragment of the ACM Computing Classification Scheme)

Programming

Theory Languages Algorithms

OOL

C++

JSP

Java

JavaBean

Merge Quick Bubble

Sorting


registration relies on document description

a document is described along various dimensions

language, author, year, editor, content, etc

in this work we focus on content (or topic) description

(also called document annotation)

a description is a set of terms from the library taxonomy

document registration(continued)

a document d with description D is registered as follows :for each term t in D, a pair (t, d) is stored in the library

the pairs (t, d), for all registered documents dconstitute the so-called library catalogue

formally, a catalogue over a taxonomy (T, ≤) is a set of pairs (t, d)

where t is a term of T and d is an identifier from a fixed set

(here, the set of all URIs)

example of a catalogue

the catalogue taxonomy allows for browsing and querying

a

b c d

e f g h

1 2 3 4 5 6 7 8

taxonomy

docs

document registration (continued)

the basic question is : who provides the description for document registration?

the answer to this question depends on whether the document isatomic or composite :

•if the document is atomic then the description must be provided by the author (and can be any set of terms that the author chooses from the library taxonomy)

•if the document is composite then the author description should be “augmented” by a set of terms implied by the descriptions of the document’s parts(as the parts may have been created by different authors!)

providing an algorithm for generating this implied description automatically is one of the main objectives of this work

implied description

should be reduced : no term should be subsumed by any other term

{QuicSort, Java, OOL} reduces to {QuicSort, Java}

should express what all the parts have in common

should be as near to all parts’ descriptions as possible

i.e. should be the l.u.b. of these descriptions w.r.t. some ordering

3 {Sorting, OOL}

{QuickSort, Java} 1 2 {BubbleSort, C++, Theory}

computing the implied description

D⊑D’ iff for each t’D’ there is tD such that t≤t’

the relation ⊑ is a partial order over reduced descriptions every set of reduced descriptions { D1 , D2 , …, Dn } has a l.u.b. in ⊑

computed as follows :

compute P = D1 x D2 x …x Dn

for each tuple Tk = <t1k, t2

k, …, tnk> in P,

compute Lk = lub {t1k, t2

k, …, tnk } in ≤

let D= { L1, L2, …, Lm }, where m = ∖P ∖

return reduce(D)

an example

D1 = {QuickSort, Java}, D2 = {BubbleSort, C++}

P = T1 = < QuickSort, BubbleSort >

T2 = < QuickSort, C++>

T3 = < Java, BubbleSort>

T4 = <Java, C++>

L1 = Sort, L2 = Programming, L3 = Programming, L4= OOL

D = {Sort, Programming, Programming, OOL}

reduce (D) = {Sort, OOL}


(continued)

to register a document d do :

1/ compute the registration description of d :

if d is atomic then RDescr(d) := reduce (ADescr(d))

else RDescr(d) := reduce [ADescr(d) IDescr(D1 , D2 , …, Dn)]

2/ for each term t in the registration description of d do:

insert a pair (t, d) in the library catalogue

other library services

searching for relevant documents (querying)

removal of a document description modification notification of users (following registration/removal/modification)

document materialization (table of contents and index)

personalization

all these and other services rely on registration descriptions

querying service

a query is a boolean combination of terms :

q ::= t | q1q2 | q1q2 | q1 q2 |

its answer is defined recursively as follows :

ans(t) = Ext(t) Ext(t1) … Ext(tn)

where t1, …, tn are the immediate successors of t

ans(q) :

if q = t then ans(t)

else begin if q = q1q2 then ans(q) = ans(q1) ans(q2);

if q = q1q2 then ans(q) = ans(q1) ans(q2);

if q = q1 q2 then ans(q) = ans(q1) \ ans(q2)

end

ans() =

application to XML documents

Our model can easily be instantiated in an XML framework

• XML documents have a hierarchical structure

• XML documents can be combined to form larger, composite documents

XML is now a popular language to represent, exchange and integrate text-based information

• representative application: distributed e-learning repositories

Case study: annotating DocBook documents

Our XML documents are valid w.r.t. the DocBook DTD

• DocBook is a popular DTD in the area of (electronic) publishers• Well designed to represent structured textbooks

Choosing a specific DTD facilitates the extraction and the composition of parts • Any other DTD adapted to e-learning documents could have been chosen

the XAnnot prototype

a graphical interface to browse DocBook documents through their hierarchical structure (chapter – sections - subsections – etc)

an interactive tool to annotate nodes by selecting terms from the taxonomy

an implementation of our algorithm the implied annotation is computed for a document as soon as all its parts

have been annotated

ongoing work

continuing the development of the prototype• implemention of a set of core services

• experimentation (university of french polynesia)

personalization• document materialization

• local taxonomies, P2P configuration (in collaboration with CNR-Pisa)

• ranking of query answers (in collaboration with ICS-FORTH)

Documents

TAXONOMY-BASED ANNOTATION OF XML DOCUMENTS Application to e-Learning Resources Nicolas Spyratos University of Paris-South France Joint work with B. Gueye