The Web as infrastructure for scholarly research and communication

Preview:

DESCRIPTION

Keynote presented at IDCC13, Amsterdam, The Netherlands, January 16 2013.

Citation preview

Wanderer above the Sea of Fog – Caspar David Friedrich (1818) http://en.wikipedia.org/wiki/Wanderer_above_the_Sea_of_Fog

@hvdsomp #idcc13

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

The Scholarly Record is Changing

•  The scholarly record is extending with a wide range of non-traditional assets emerging from eScience and eHumanities •  e.g. datasets, software, ontologies, workflows, online debate,

slides, blogs, videos, etc.

•  Many of these non-traditional assets: •  Have a wide range of relationships with and dependencies on

other assets – grouping assets •  Are becoming increasingly dynamic, and do not have the sense

of fixity that traditional assets such as journal articles or books have – versioning assets

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

grouping assets

versioning assets

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

discovering assets

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

1999

•  OAI was a heroic effort to fundamentally transform scholarly communication •  By promoting communication via

preprints, non-peer-reviewed papers

•  The OAI took a technical approach to achieve the goal •  Make preprints easier to discover,

access – Protocol for Metadata Harvesting

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

HTTP GET on record identifier

An HTTP link

Don’t trust HTTP

Just another HTTP baseURL

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

grouping assets

versioning assets

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

2007

•  OAI-ORE observation: Scholarly assets are rapidly becoming compound, consisting of multiple resources with various: •  Relationships •  Interdependencies

•  How to convey this compound-ness in an interoperable manner so that applications can access, consume such assets?

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

See e.g. http://www.ctwatch.org/quarterly/articles/2007/08/interoperability-for-the-discovery-use-and-re-use-of-units-of-scholarly-communication/8/

index.html

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

grouping assets

versioning assets

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

2009

•  Memento is about the Web and time: •  Resources evolve over time •  Only the current representation is

available from a resource’s URI •  How to seamlessly access prior

representation, if they exist?

•  Memento looks at this problem for the Web, in general

Digital Preservation Award 2010

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

•  Memento has potential consequences for scholarly communication

•  Observation: Scholarly assets are becoming increasingly dynamic, and do not have the sense of fixity that traditional assets such as journal articles or books have •  Even traditional assets are becoming

increasingly dynamic and dependent on other assets, which may themselves be dynamic

2009

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Scientific Workflows, Services, Data, Workflow Engines

Carole Goble, JCDL 2012 Keynote https://dl.dropbox.com/u/617206/JCDL2012keynoteGoble.ppt

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

From The Version of Record to A Version of the Record

•  The ever-evolving nature of some assets challenges the notion of fixity as “forever frozen” and begs considering the notion of the “state of the scholarly record at a specific moment in time”

•  It will become essential to be able to determine what the state of related and interdependent assets was at certain moments in time

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Two Perspectives on Memento

URI-M - http://web.archive.org/web/20010911203610/http://www.cnn.com/

Web Archive

URI-R - http://www.cnn.com/

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Two Perspectives on Memento

URI-M - http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333

CMS

URI-R - http://en.wikipedia.org/wiki/September_11_attacks

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

•  How to get to the time-specific resources from the generic resource?

•  Memento addresses the problem in a resource-centric way: •  Resource, URI, state, representation,

link, content negotiation

2009

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Today Select Date Sep 12 2010 Sep 16 2010

From BL Archive

Access Versions via the original URI and datetime

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

From The Version of Record to A Version of the Record

•  The ever-evolving nature of some assets challenges the notion of fixity as “forever frozen” and begs considering the notion of the “state of the scholarly record at a specific moment in time”

•  It will become essential to be able to determine what the state of related and interdependent assets was at certain moments in time

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

•  Is it possible to reconstruct the Web-based scholarly record as it was at a certain point in time?

•  Consider a special case: Given a paper can one see the referenced materials as they were the time of publication of the paper?

•  ti: Time of publication •  Relationship: Cited resources

Recreating a Version of the Record

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Published September 15 2004

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Domain Gone

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Archived copy December 5 2003

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Current version

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Archived copy December 11 2004

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Resource gone

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Archived copy December 5 2003

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Resource gone

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Archived copy unavailable

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

•  Papers from arXiv: 400,000 papers => 144,000 unique URIs •  Papers from UNT ETD repository: 3,600 papers => 18,000 URIs •  Referenced URIs of established scholarly repositories removed (e.g. http://dx.doi.org), i.e. focusing in on the periphery of the scholarly record

•  Study looks into: •  Does the referenced resource still exist? •  Are there archived versions of of the referenced resource?

•  From around the time of publication of the citing paper?

•  Study does not look into dynamic aspects: •  If the referenced resource still exists, is its content same as at ti? •  Does an archived version have the same content as at ti?

Pilot Study at Scale with Memento

Sanderson, R., Phillips, M., and Van de Sompel, H. (2011) Analyzing the Persistence of Referenced Web Resources with Memento. Open Repositories 2011; Arxiv preprint. arXiv:1105.3459 ; http://arxiv.org/abs/1105.3459

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

UNT

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

arXiv

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

The Good News ™

•  Despite there not being a pro-active effort to archive those resources, a considerable amount were

o  Because they had HTTP URIs and hence were archived as part of ongoing web archiving processes

o  In The Wild archiving comes for free with the web infrastructure

•  404 resources exist in web archives and Memento can access them via their original HTTP URI

o  Does that make an HTTP URI a PID?

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

The Bad News ™

•  Many resources were not archived

•  For many resources there were no archival versions around ti

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Automatic Creation of Archival Snapshots

•  There is a need for a more pro-active approach to archive dynamic, interdependent assets, e.g.:

o  Web Archives as infrastructure o  Use CMS, wikis, datawikis with solid versioning mechanisms o  Archiving linked context at the time of publication o  Archive at the moment of use (social interaction,

downloading, annotating, etc.) o  Delineate which resources are considered in/out of a

scholarly assets (OAI-ORE) to understand what needs archiving

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

discovering assets

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

2012

•  ResourceSync is about allowing 3rd party systems and applications to remain synchronized with a server’s evolving resources.

•  Many use cases: •  Mirroring repository content •  Aggregating content •  Replicating datasets •  Exposing content to archives •  Keeping linked data applications that

leverage remote data up-to-date

•  Differing needs regarding: •  Coverage •  Accuracy •  Latency

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

ResourceSync Approach

•  Resource centric; it’s all about the URI (again)

•  Introduces a set of modular capabilities that a server can implement to allow 3rd parties to remain in sync with its resources. Recurrently publish:

o  Resource Lists o  Change Lists o  Resource Dumps o  Change Dumps

•  All capabilities based on the Sitemap document formats and extensions thereof

o  Existing Sitemaps are off-the-shelf compliant

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

ResourceSync Capabilities

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

2012

•  Beta spec end 01/2013 •  http://www.openarchives.org/rs/

•  Feedback •  mailto:resourcesync@googlegroups.com

•  Papers in D-Lib Magazine •  http://dx.doi.org/10.145/september2012-

vandesompel •  http://dx.doi.org/10.145/january2013-klein

•  Paper in Ariadne •  http://www.ariande.ac.uk/issue70/lewis-et-

al

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

1998 - 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

a stack of journals or a bunch of PDF files

a network of interconnected assets and actors

1998 - 2013

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013

Conclusion

•  OAI-ORE, Memento, ResourceSync illustrate the potential of leveraging the Web infrastructure for scholarly communication

•  This suggests that other special requirements of scholarly communication (certification, archiving, persistence, trust, annotation, metrics, …) may be addressable in an interoperable manner by leveraging the Web infrastructure

•  Wins: •  Long Term Sustainability: Reuse of infrastructure (network, software, platforms, standards, etc.) that the entire world depends on •  Integration of scholarly discourse with other Web-based discourse

Herbert Van de Sompel

IDCC 2013, Amsterdam, The Netherlands, January 16 2013 Wanderer above the Sea of Fog – Caspar David Friedrich (1818)

http://en.wikipedia.org/wiki/Wanderer_above_the_Sea_of_Fog

@hvdsomp #idcc13

Recommended