18
http://creativecommons.org/licenses/by-sa/3.0 Tracking changes through EARMARK: a theoretical perspective and an implementation Silvio Peroni – [email protected] Francesco Poggi – [email protected] Fabio Vitali – [email protected] 1 st International Workshop on Document Changes: Modeling, Detection, Storage and Visualization http://diff.cs.unibo.it/dchanges2013/ @ DocEng 2013, Florence, Italy - September 10, 2013

Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Embed Size (px)

DESCRIPTION

The Extremely Annotational RDF Markup, a.k.a. EARMARK, is an OWL 2 DL ontology that defines document meta-markup. It is an ontologically precise definition of markup that instantiates the structure of a text document as an independent OWL document outside of the text string it annotates, and through appropriate OWL and SWRL characterizations it can define organizations such as trees or graphs and can be used to generate validity constraints. In this paper we present an extension of EARMARK that allows us to describe how markup documents evolve in time, which complies with concepts expressed in the Functional Requirements for Bibliographic Records (FRBR).

Citation preview

Page 1: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

http://creativecommons.org/licenses/by-sa/3.0

Tracking changes through EARMARK:a theoretical perspective and an implementation

Silvio Peroni – [email protected] Poggi – [email protected]

Fabio Vitali – [email protected]

1st International Workshop on Document Changes: Modeling, Detection, Storage and Visualizationhttp://diff.cs.unibo.it/dchanges2013/@ DocEng 2013, Florence, Italy - September 10, 2013

Page 2: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Outline

• Documents and their changes in time through FRBR

• Change tracking and provenance data of XML documents

• Defining multi-hierarchy documents through EARMARK

• EARMARK Changes Ontology and its application

• Querying and byte-counting

• Conclusions

Page 3: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Documents do change in time

• Any creative act of a text ✦ starts from a particular draft made by someone at a certain time✦ is then modified through consecutive revisions✦ may end up being forked into different variants✦ may be modified by additional editorial activities such as typo-fixing,

shortening, restructuring, etc.

• Importance of keeping tracks of changes✦ Computer Science: to show how programming code or computational

models evolve throughout the natural lifecycle of software development✦ Philology: to tell the way in which variant copies of a same book overlap in

time and content✦ Scientific Publishing: to understand the entity and quality of the

modifications and driving the final acceptance or rejection of a paper✦ etc.

Page 4: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

About (textual) documents and their changes

A document is more than the string it is

composed of

Alice

produces an hand-writtendocument on a piece of paper

Bob

produces a digital documentthrough a word-processor

“Hello world.”

composed by the string

“Hello world.”

composed by the string

differentdocuments

samestring

modifies Bob’s documentproducing a new version

“Hey, hello world!”

composed by the string

Charles

used by

to certain extent

samedocument

differentstrings

A document refers to different strings

in time

Page 5: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Introducing FRBR for change tracking

• “Functional Requirements for Bibliographic Records (FRBR) is a conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) that relates user tasks of retrieval and access in online library catalogues and bibliographic databases from a user’s perspective.”

• According to FRBR, document is an overloaded word, and is better substituted by four different concepts called respectively

✦ Work, coupled with the concept of identity✦ Expression, to record the evolution in time✦ Manifestation, which specifies the form and format✦ Item, identifying the concrete object

from http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records

string of characters constituting the content of a document

FRBR Expression layerExpressions never change in time

: a distinct intellectual or artistic creation

: the intellectual or artistic realisation of a work in the formof alpha-numeric, musical, or choreographic notation, sound, image, object, movement, etc., or any combination of such forms

FRBR Work layer

Bob“Hello world.”

Charles“Hey, hello world!”

the abstract conceptualisation of a document

realisation of realisation of

revision of

Time

Page 6: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

What happens toXML-based markup documents?

• What changes we have to keep track of among those that occur in XML documents, and what are the markup elements that, after edits and document changes, are

✦ directly affected✦ hierarchically affected✦ completely unaffected

• Research questions:✦ When a markup element E1 within a document version V1 changes in some

way, e.g. by adding something to the text it contains, thereby generating document version V2, are the two instances of E1 in V1 and E2 in V2 to be considered actually the same element?

✦ In the case the aforementioned instances of E are to be considered different, is the difference meant to be propagated also to their ancestor elements?

Page 7: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

DEL

<section>NEW

Applying FRBR to XML markupElements and text nodes as FRBR Expressions

NEW<section>

<p> <p>

<em>

Some interesting content. It was written by me.

INS

<em>

very

NEW

<section>

<p>

<section> <p>Some <em>interesting</em> content.</p> <p>It was written by me.</p> </section>

Alice

<section> <p>Some <em>very interesting</em> content.</p> <p>It was written by me.</p> </section>

Bob

revised by

<section> <p>Some <em>very interesting</em> content.</p> <p>It was written by me.</p> </section>

Charles

revised by

has partrevision of

Page 8: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Who, What, How and When

• An important part of change tracking operations involves keeping track of provenance information

✦ who made the modification✦ what was modified✦ how it was modified✦ when it was modified

• How do we keep track of all these data in practice?

✦ XML-based languages use workarounds to implement overlapping markup

✦ Other solutions?

<section>

<section>

<p> <p>

<em>

very

<section>Version made by BobJuly 3, 2013, at 04:15

Text inserted by BobJuly 3, 2013, at 04:15

Version made by AliceJune 19, 2013, at 13:45

Markup deleted by CharlesJuly 5, 2013, at 03:33

Version made by CharlesJuly 5, 2013, at 03:33

Page 9: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

From theory to practice through

The Extremely Annotational RDF Markup (EARMARK) is at the same time a markup meta-language and an ontology of (document) markup

✦ More expressive than XML – it allows to organise markup structures as graphs

✦ It makes easy to associate annotations to document items such as change tracking information – since an EARMARK document is a set of OWL assertions, all the markup items and text nodes are individuals of particular classes identified by an IRI

✦ Lot of tools available: - a Java API- frameworks to convert XML documents into EARMARK ones and vice versa

more information at http://palindrom.es/phd/research/earmark

Page 10: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Example of EARMARK documentLinearised using Turtle

Some interesting content. It was written by me.

# Textual content of the document:content a earmark:StringDocuverse ; earmark:hasContent "Some interesting content. It was written by me."^^xsd:string .

full Turtle source of the document available at http://www.essepuntato.it/2013/dchanges

# String ’Some ’:r1 a earmark:PointerRange ; earmark:refersTo :content ; earmark:begins "0"^^xsd:nonNegativeInteger ; earmark:ends "5"^^xsd:nonNegativeInteger .

# String ‘interesting’:r2 a earmark:PointerRange ...# String ’ content.’:r3 a earmark:PointerRange ...# String ’It was written by me.’ :r4 a earmark:PointerRange ...

<section>

<p>

# Element ‘section’:section a earmark:Element ; earmark:hasGeneralIdentifier "section"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :p1 ; co:nextItem [ a co:ListItem ; co:itemContent :p2 ] ] .

<p>

<em>

# First element ’p’:p1 a earmark:Element ; earmark:hasGeneralIdentifier "p"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :r1 ; co:nextItem [ a co:ListItem ; co:itemContent :em ; co:nextItem [ a co:ListItem ; co:itemContent :r3 ] ] ] .

... and similarly for the other markup elements

Page 11: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

EARMARK Changes OntologyExtending EARMARK to manage change tracking information

• The EARMARK Changes Ontology (EChO) extends the EARMARK ontology and includes the OWL 2 DL implementation of FRBR (http://purl.org/spar/frbr) and the Provenance Ontology (http://www.w3.org/ns/prov#), so as to keep track of all the changes and provenance data related to different versions of the same document

✦ the EARMARK items (docuverses, ranges and markup items) to model the structure of the different document versions and to store them all within a single EARMARK document

✦ frbr:revisionOf to indicate that a markup item is a revision of another ✦ prov:wasDerivedFrom to indicate that a range is actually derived from another one defined in a

previous version of the document ✦ prov:wasGeneratedBy (coupled with instances of echo:VersionCreation and echo:ItemInsertion) and

prov:generatedAtTime to indicate that a particular markup item, a range or a whole document version has been created at a certain time

✦ prov:wasInvalidatedBy (coupled with instances of echo:VersionRemoval and echo:ItemDeletion) and prov:invalidatedAtTime to indicate that a particular markup item, a range or a whole document version has been deleted at a certain time

✦ prov:wasAssociatedWith to indicate the agent involved in the activity of generation/invalidation of a certain item

Page 12: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Version creation

• Who: Alice

• What: document version (implicitly identified by the document element of the markup document :section)

• How: creation

• When: June 19, 2013 at 13:45

:section # Provenance information prov:wasGeneratedBy :creation-by-alice ; prov:generatedAtTime "2013-06-19T13:45:00Z"^^xsd:dateTime .

# Activity of creation of a new version :creation-by-alice a echo:VersionCreation ; prov:wasAssociatedWith :alice .

Page 13: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Revision, insertion and deletion

• Bob’s revision of Alice’s version was made on July 3, 2013, at 04:15, and concerns only the insertion of the string “very ” as first textual node of the element em

• Charles’ revision of Bob’s version was made on July 5, 2013, at 03:33, and deletes the Bob’s second p of section

# Element ’section ’ by Bob:section-by-bob a earmark:Element ; earmark:hasGeneralIdentifier "section"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :p1-by-bob ; co:nextItem [ a co:ListItem ; co:itemContent :p2 ] ] ; # relation with previous version frbr:revisionOf :section ; # provenance information prov:wasGeneratedBy :creation-by-bob ; prov:generatedAtTime "2013-07-03T04:15:00Z"^^xsd:dateTime .

:creation-by-bob a echo:VersionCreation ; prov:wasAssociatedWith :bob .

revision

# New content of the document:content-by-bob a earmark:StringDocuverse ; earmark:hasContent "very "^^xsd:string .

# New string ’very ’:r5 a earmark:PointerRange ; earmark:refersTo :content-by-bob ; earmark:begins "0"^^xsd:nonNegativeInteger ; earmark:ends "5"^^xsd:nonNegativeInteger ; # provenance information prov:wasGeneratedBy :insertion-by-bob ; prov:generatedAtTime "2013-07-03T04:15:00Z"^^xsd:dateTime .

:insertion-by-bob a echo:ItemInsertion ; prov:wasAssociatedWith :bob .

insertion

:p2 # Second element ’p’ of Bob’s ‘section‘ prov:wasInvalidatedBy :deletion-by-charles prov:invalidatedAtTime "2013-07-05T03:33:00Z"^^xsd:dateTime .

:deletion-by-charles a echo:ItemDeletion ; prov:wasAssociatedWith :charles .

deletion

Page 14: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Splitting ranges up

• Daniel’s revision of the Alice’s version, where Daniel decided to substitute the string “me” in the second p with its name (i.e. the string “Daniel”)

• In EARMARK, this string substitution (i.e. a deletion plus an insertion) is possible by defining four new ranges

• We use prov:wasDerivedFrom statements between ranges to describe (at an abstract level) a more complex scenario of overlapping markup between the two versions

Page 15: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Querying the history of changes

• Since EARMARK is defined by means of Semantic Web technologies, we can use already implemented standards such as SPARQL 1.1 to query over the change tracking history of a certain EARMARK document

✦ Return a new EARMARK document that contains only Bob’s versionCONSTRUCT { ?other ?p ?o . ?version ?pv ?ov . ?docuverse ?pd ?od } WHERE { { SELECT DISTINCT ?other ?version WHERE { { SELECT DISTINCT ?version WHERE { ?version a earmark:Element ; prov:wasGeneratedBy ?activity . ?activity a echo:VersionCreation ; prov:wasAssociatedWith :bob } } ?other (^co:itemContent?/^co:item)+ ?version } } ?version ?pv ?ov . ?other ?p ?o . OPTIONAL { ?other a earmark:PointerRange ; earmark:refersTo ?docuverse . ?docuverse ?pd ?od } }

✦ Select the textual content of all paragraphs removed by CharlesSELECT DISTINCT ?range WHERE { ?p a earmark:Element ; earmark:hasGeneralIdentifier "p"^^xsd:string ; co:item/co:itemContent ?range ; prov:wasInvalidatedBy ?activity . ?range a earmark:PointerRange . ?activity a echo:ItemDeletion ; prov:wasAssociatedWith :charles }

Page 16: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Byte-counting EARMARK documents

• We used two documents ✦ The first document composed of seven different versions, named after the “Seven Dwarfs” for

recognizability and obtained by applying very common edits according to three authors✦ The second document composed of seven different versions, named after the weekdays and

created by seven different authors when editing a very simple document

• We compared the size in bytes of consecutive versions of such documents according to OpenDocument and OpenXML formats, and to EARMARK linearised in six different formats: Turtle, RDF/XML, OWL/XML, N-Triples, HDT and Manchester Syntax

Turtle seems the best

linearisation format

for EARMARKHDT (compressed)

performances similar t

o

ODT and OOXML

Page 17: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Conclusions

• In this paper we presented a theoretical approach to track document changes based on FRBR and provenance data

• We proposed one possible implementation of it through EARMARK, a Semantic Web-aware meta-markup language that enables the definition of multiple overlapping markup hierarchies representing different versions of the same document

• We highlighted the main advantages and drawbacks in terms of querying and storing such EARMARK documents

• In the future we plan to extend EChO so as to enable the description of additional change tracking operations (e.g. swap, update)

• We also plan to experiment the effective use of translation mechanisms to convert EARMARK documents with change tracking information into XML formats, e.g. ODT and OOXML

Page 18: Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

Thanks for your attention

<end>

<end>revision of