View
216
Download
1
Tags:
Embed Size (px)
Citation preview
TAXONOMY-BASED ANNOTATION OF XML DOCUMENTS
Application to e-Learning Resources
Nicolas Spyratos
University of Paris-SouthFrance
Joint work with B. Gueye and Ph. Rigaux
overview
the context of our work digital libraries
document composition
document indexing
document registration
automatic annotation
application to XML documents
concluding remarks
the context
SeLeNe (Self e-Learning Networks)
• IST accompanying measures
• ended in January 2004
Delos • network of Excellence on Digital Libraries
• started in January 2004
digital libraries
a digital library serves a network of providers willing to share their documents with other providers and/or consumers (collectively called “users”)
each document resides at the local repository of its provider, so all providers’ repositories, collectively, can be seen as a database of documents spread all over the network
the digital library acts as a mediator, indexing all shareable documents so that users can access them transparently
typically, a user can compose new documents from those available through the library
digital libraries (continued)
the digital library indexes two types of documents, atomic or composite, and provides a number of services to support the composition of new documents and their use
in what follows we shall see
how documents are composed how they are indexed by the library how they are registered in the library
in doing so, we deal with document identifiers and document descriptions
we do not deal with document content
document composition
a document is seen as an identifier d (e.g., a URI)associated with a set of other documents, d1, … , dn, called its parts :
parts(d) = {d1, … , dn}
if parts(d) = then d is called atomic else d is called composite
components of a documentif d is atomic then comp(d) = else comp(d) = parts(d)comp(d1) … comp(dn)
a document can be represented as a graph : its composition graph
4 6 7
2 3
1
5
• we assume that no document can be component of itself, so thecomposition graph of a document d is a directed acyclic graphwith d as the only root
• the set of parts is unordered
document indexing
documents are indexed using a taxonomy
a taxonomy is a pair (T, ≤) where T is a set of keywords or terms, called the terminology ≤ is a reflexive and transitive relation over T, called a subsumption
in practice, most taxonomies are trees
defining a taxonomy is not an easy task
several standard taxonomies of topics already exist todayACM-CCS, IEEE LOM, Open Directory
a digital library operates on one (or more) taxonomies to which all users adhere
example of taxonomy(fragment of the ACM Computing Classification Scheme)
Programming
Theory Languages Algorithms
OOL
C++
JSP
Java
JavaBean
Merge Quick Bubble
Sorting
document registration
registration relies on document description
a document is described along various dimensions
language, author, year, editor, content, etc
in this work we focus on content (or topic) description
(also called document annotation)
a description is a set of terms from the library taxonomy
document registration(continued)
a document d with description D is registered as follows :for each term t in D, a pair (t, d) is stored in the library
the pairs (t, d), for all registered documents dconstitute the so-called library catalogue
formally, a catalogue over a taxonomy (T, ≤) is a set of pairs (t, d)
where t is a term of T and d is an identifier from a fixed set
(here, the set of all URIs)
example of a catalogue
the catalogue taxonomy allows for browsing and querying
a
b c d
e f g h
1 2 3 4 5 6 7 8
taxonomy
docs
document registration (continued)
the basic question is : who provides the description for document registration?
the answer to this question depends on whether the document isatomic or composite :
•if the document is atomic then the description must be provided by the author (and can be any set of terms that the author chooses from the library taxonomy)
•if the document is composite then the author description should be “augmented” by a set of terms implied by the descriptions of the document’s parts(as the parts may have been created by different authors!)
providing an algorithm for generating this implied description automatically is one of the main objectives of this work
implied description
should be reduced : no term should be subsumed by any other term
{QuicSort, Java, OOL} reduces to {QuicSort, Java}
should express what all the parts have in common
should be as near to all parts’ descriptions as possible
i.e. should be the l.u.b. of these descriptions w.r.t. some ordering
3 {Sorting, OOL}
{QuickSort, Java} 1 2 {BubbleSort, C++, Theory}
computing the implied description
D⊑D’ iff for each t’D’ there is tD such that t≤t’
the relation ⊑ is a partial order over reduced descriptions every set of reduced descriptions { D1 , D2 , …, Dn } has a l.u.b. in ⊑
computed as follows :
compute P = D1 x D2 x …x Dn
for each tuple Tk = <t1k, t2
k, …, tnk> in P,
compute Lk = lub {t1k, t2
k, …, tnk } in ≤
let D= { L1, L2, …, Lm }, where m = ∖P ∖
return reduce(D)
an example
D1 = {QuickSort, Java}, D2 = {BubbleSort, C++}
P = T1 = < QuickSort, BubbleSort >
T2 = < QuickSort, C++>
T3 = < Java, BubbleSort>
T4 = <Java, C++>
L1 = Sort, L2 = Programming, L3 = Programming, L4= OOL
D = {Sort, Programming, Programming, OOL}
reduce (D) = {Sort, OOL}
document registration
(continued)
to register a document d do :
1/ compute the registration description of d :
if d is atomic then RDescr(d) := reduce (ADescr(d))
else RDescr(d) := reduce [ADescr(d) IDescr(D1 , D2 , …, Dn)]
2/ for each term t in the registration description of d do:
insert a pair (t, d) in the library catalogue
other library services
searching for relevant documents (querying)
removal of a document description modification notification of users (following registration/removal/modification)
document materialization (table of contents and index)
personalization
all these and other services rely on registration descriptions
querying service
a query is a boolean combination of terms :
q ::= t | q1q2 | q1q2 | q1 q2 |
its answer is defined recursively as follows :
ans(t) = Ext(t) Ext(t1) … Ext(tn)
where t1, …, tn are the immediate successors of t
ans(q) :
if q = t then ans(t)
else begin if q = q1q2 then ans(q) = ans(q1) ans(q2);
if q = q1q2 then ans(q) = ans(q1) ans(q2);
if q = q1 q2 then ans(q) = ans(q1) \ ans(q2)
end
ans() =
application to XML documents
Our model can easily be instantiated in an XML framework
• XML documents have a hierarchical structure
• XML documents can be combined to form larger, composite documents
XML is now a popular language to represent, exchange and integrate text-based information
• representative application: distributed e-learning repositories
Case study: annotating DocBook documents
Our XML documents are valid w.r.t. the DocBook DTD
• DocBook is a popular DTD in the area of (electronic) publishers• Well designed to represent structured textbooks
Choosing a specific DTD facilitates the extraction and the composition of parts • Any other DTD adapted to e-learning documents could have been chosen
the XAnnot prototype
a graphical interface to browse DocBook documents through their hierarchical structure (chapter – sections - subsections – etc)
an interactive tool to annotate nodes by selecting terms from the taxonomy
an implementation of our algorithm the implied annotation is computed for a document as soon as all its parts
have been annotated
ongoing work
continuing the development of the prototype• implemention of a set of core services
• experimentation (university of french polynesia)
personalization• document materialization
• local taxonomies, P2P configuration (in collaboration with CNR-Pisa)
• ranking of query answers (in collaboration with ICS-FORTH)