45
Digital Preservation Research at Old Dominion University Justin F. Brunelle The MITRE Corporation Old Dominion University (And hopefully MITRE, soon)

Digital Preservation - ODU

Embed Size (px)

DESCRIPTION

This is the slide deck of the presentation given to the RRAC national group meeting on 10-20-2010. It is a summary of the research efforts in Digital Preservation at ODU.

Citation preview

Page 1: Digital Preservation - ODU

Digital Preservation Research at Old Dominion University

Justin F. Brunelle

The MITRE Corporation

Old Dominion University

(And hopefully MITRE, soon)

Page 2: Digital Preservation - ODU

Why are we listening?

• Overview of the problem

• BRIEF introduction to ODU WSDL group research

• Memento

• I’ll be skipping around, so don’t hesitate to interrupt me

Page 3: Digital Preservation - ODU

Digital Preservation

• Using the past Web– Focus of our research

• Temporal Browsing– Sessions in the past

• Recovering Lost Pages– Is it really gone?

• 404s– How to fix broken links?

Page 4: Digital Preservation - ODU

1

same URI maps to same or very similar content at a later time

2

same URI maps to different content at a later time

3

different URI maps to same or very similar content at the same or at a later time

4

the content can not be found at any URI

U1

C1

U1

C1

timeA B

U1

C2

U1

C1

timeA B

U2

C1

U1

C1

U1

404

timeA B

U1

??

U1

C1

timeA B

Change on the Web

Page 5: Digital Preservation - ODU

Time to Talk About Saving Everything?

Dinner for one or two costs more than 1TB disk Wikis have popularized versioning

Cool URIs (http://www.w3.org/Provider/Style/URI.html) are widely adopted, e.g.:http://news.yahoo.com/s/ap/20100920/ap_on_el_se/us_alaska_senate http://d.yimg.com/a/p/ap/20100918/capt.67567dbc0a874b689f0b4a5c392f379c-67567dbc0a874b689f0b4a5c392f379c-0.jpghttp://d.yimg.com/a/p/afp/20100918/thumb.photo_1284846332993-1-0.jpg

Also related projects with cool URI / permalink focus: http://www.citability.org/ http://data.gov/ http://data.gov.uk/

Page 6: Digital Preservation - ODU

Fortress Model

• Get a lot of money

• Buy lots of storage

• Hire lots of people

• “Look upon my archive ye Mighty, and despair!”

Page 7: Digital Preservation - ODU

Alternate Methods

• Lazy Preservation (McCown)– “How much preservation do I get if I do absolutely

nothing?”• Just-In-Time Preservation (Klein)

– Wait for it to disappear, then find a “good ‘nuff” version

• Shared Infrastructure Preservation– Push content to sites that might preserve it

• arXiv.org, IA, WebCite…

• Server Enhanced Preservation– Create archival-ready resources

Page 8: Digital Preservation - ODU

And Soon…

• Social Preservation– Preserving resources using 3rd party Web Services

– Repository for OAI-ORE ReMs

– Social network feel

– Lazy-esque, server-side reconstruction

Page 9: Digital Preservation - ODU

But I digress…

• Few years away…

• Preliminary research

• And now back to the prior research…

Page 10: Digital Preservation - ODU

Web Infrastructure (McCown, 2007)

Page 11: Digital Preservation - ODU

WayBack Machine

http://web.archive.org/web/*/http://www.thecribs.com/http://mementoproxy.cs.odu.edu/aggr/timemap/link/http://www.thecribs.com/

from these we can create time-based: • indexes• IDF values• PageRank

Page 12: Digital Preservation - ODU

Batch Recovery For Sites

http://warrick.cs.odu.edu/

Free limo rides for life?!

Page 13: Digital Preservation - ODU

13

Reconstruction Diagram

added 20%

identical 50%

changed 33%

missing 17%

Page 14: Digital Preservation - ODU

Real-Time Recovery for URIs

Synchronicity - www.cs.odu.edu/~mklein/

Page 15: Digital Preservation - ODU

Memento wants to make navigating the Web’s Past Easy

15

http://www.mementoweb.orghttp://groups.google.com/group/memento-dev

Page 16: Digital Preservation - ODU

What are you talking about?

• Universal Resource Identifier (URI) ~= URL

• Resource:– <HTML>

• Representation

Page 17: Digital Preservation - ODU

W3C Web Architecture: Resource – URI - Representation

Resource

Representation

Represents

URI

Identifies

dereference

17

Page 18: Digital Preservation - ODU

dereference content negotiation

W3C Web Architecture: Resource – URI - Representation

Resource

URI

Identifies

Representation 1

Represents

Representation 2Represents

18

Page 19: Digital Preservation - ODU

Resources

19

Page 20: Digital Preservation - ODU

Resources have Representations

20

Page 21: Digital Preservation - ODU

Resources have Representations that Change over Time

21

Page 22: Digital Preservation - ODU

Only the Current Representation is Available from a Resource

22

Page 23: Digital Preservation - ODU

Old Representations are Lost Forever

23

Page 24: Digital Preservation - ODU

Finding Archived Resources

Go to http://www.archive.org/ and searchhttp://cnn.com

On http://web.archive.org/web/*/http://cnn.com, select desired datetime

24

Page 25: Digital Preservation - ODU

Archived Resources

http://web.archive.org/web/20010911203610/http://www.cnn.com/ archived resource for http://cnn.com

http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived

resource for http://en.wikipedia.org/wiki/September_11_attacks

Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC

25

Page 26: Digital Preservation - ODU

Navigating Archived Resources

http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333 archived

resource for http://en.wikipedia.org/wiki/September_11_attacks3

Dec 20 2001, 4:51:00 UTC

http://en.wikipedia.org/wiki/The_Pentagon

current

Pentagon

26

Page 27: Digital Preservation - ODU

Current and Past Web are Not Integrated

27

• Current and Past Web based on same technology.

• But, going from Current to Past Web is a matter of (manual) discovery.

• Memento wants to make going from Current to Past Web a (HTTP) protocol matter.

• Memento wants to integrate Current And Past Web.

Page 28: Digital Preservation - ODU

One Memento HTTP Navigation

28

Page 29: Digital Preservation - ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 30: Digital Preservation - ODU

One Memento HTTP Navigation

30

Scenario

• cnn.com includes Link to TimeGate at Internet Archive• URI-R on one server, URI-G & URI-M on another

Page 31: Digital Preservation - ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 32: Digital Preservation - ODU

Memento HTTP Flow: URI-RHEAD R, Accept-Datetime

HEAD http://cnn.com/ HTTP/1.1Host: cnn.comAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close

32

Page 33: Digital Preservation - ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 34: Digital Preservation - ODU

Memento HTTP Flow: Success – URI-RLinkG

HTTP/1.1 200 OKDate: Thu, 21 Jan 2010 00:02:12 GMTServer: ApacheLink: <http://web.archive.org/web/timegate/http://cnn.com>; rel="timegate"Content-Length: 255Connection: closeContent-Type: text/html; charset=iso-8859-1

34

Page 35: Digital Preservation - ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 36: Digital Preservation - ODU

GET G, Accept-Datetime

Memento HTTP Flow: URI-G

GET http://web.archive.org/web/timegate/http://cnn.com HTTP/1.1Host: web.archive.orgAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close

36

Page 37: Digital Preservation - ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 38: Digital Preservation - ODU

Memento HTTP Flow: Success – URI-G

302M, Vary, LinkR,B,M

HTTP/1.1 302 FoundDate: Thu, 21 Jan 2010 00:06:50 GMTServer: ApacheTCN: choiceVary: negotiate, accept-datetimeLocation: http://web.archive.org/web/20010911203610/http://www.cnn.comLink: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first- memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”Content-Length: 0Connection: closeContent-Type: text/plain; charset=UTF-8

38

Page 39: Digital Preservation - ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 40: Digital Preservation - ODU

GET M, Accept-Datetime

Memento HTTP Flow: URI-M

GET http://web.archive.org/web/20010911203610/http://www.cnn.com HTTP/1.1Host: web.archive.orgAccept-Datetime: Tue, 11 Sep 2001 20:35:00 GMTConnection: close

40

Page 41: Digital Preservation - ODU

Memento HTTP FlowHEAD R, Accept-Datetime

LinkG

302M, Vary, TCN, LinkR,B,M

200, Content-Datetime, LinkR,B,M

GET G, Accept-Datetime

GET M, Accept-Datetime

Page 42: Digital Preservation - ODU

Memento HTTP Flow: Success – URI-M

200, Content-Datetime, LinkR,B,M

HTTP/1.1 200 OKServer: Apache-Coyote/1.1X-Archive-Orig-Accept-Ranges: bytes…Content-Type: text/html;charset=utf-8Content-Length: 23364Date: Thu, 21 Jan 2010 00:09:40 GMTContent-Datetime: Tue, 11 Sep 2001 20:36:10 GMTLink: <http://cnn.com/>; rel="original", <http://web.archive.org/web/timebundle/http://cnn.com/>; rel="timebundle”, <http://web.archive.org/web/20000915112826/http://www.cnn.com>; rel=“first-memento”; datetime=“Tue, 15 Sep 2000 11:28:26 GMT”, <http://web.archive.org/web/20080708093433/http://www.cnn.com>; rel=“last-memento”; datetime="Tue, 08 Jul 2008 09:34:33 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“prev-memento”; datetime="Tue, 11 Sep 2001 20:30:51 GMT”, <http://web.archive.org/web/20010911203610/http://www.cnn.com>; rel=“next-memento”; datetime="Tue, 11 Sep 2001 20:47:33 GMT”Connection: close

Page 43: Digital Preservation - ODU

What does it all mean?

• Cutting edge technology

• Existing Infrastructure

• Redefining Web surfing

• MAJOR “real world” implications

Page 44: Digital Preservation - ODU

Closing Thoughts

Preservation not for

privileged priesthoodhttp://doi.acm.org/10.1145/1592761.1592794

http://booktwo.org/notebook/wikipedia-historiography/

no more hoary storiesabout format obsolescence:http://blog.dshr.org/2010/09/reinforcing-my-point.html

Don't dessicate resources;

leave them on the webEndless metadata is not

preservation…

archiving as branded service, not infrastructurehttp://blog.dshr.org/2010/06/jcdl-2010-keynote.html

Page 45: Digital Preservation - ODU

Acknowledgements

• Slides borrowed from:

• Dr. Michael L. Nelson:

– http://www.slideshare.net/phonedude/my-point-of-view-michael-l-nelson-web-archiving-cooperative

– http://www.slideshare.net/phonedude/review-of-web-archiving

– http://www.slideshare.net/phonedude/memento-time-travel-for-the-web

• Martin Klein:

– http://www.slideshare.net/phonedude/synchronicity-justintime-discovery-of-lost-web-pages