23
The Management of a Website’s Historical Resources David Chao College of Business San Francisco State University

The Management of a Website’s Historical Resources David Chao College of Business San Francisco State University

Embed Size (px)

Citation preview

The Management of a Website’s Historical Resources

David Chao

College of Business

San Francisco State University

Introduction

• An organization’s websites change constantly to reflect the dynamic nature of its environment causing changes in website structure, contents and the supporting technologies.

Types of Change

• Website structure:– Causing web pages’ URL to change

• Website content:– Changes to web pages:

• Insertions, deletions, modifications

– Changes to content databases

• Technology

What are a website’s historical resources?

• Outdated URLs

• Outdated web pages:– Web page snapshots– Content database snapshots– Deleted web pages

• Replaced technologies

The Objective of Managing Historical Resources

• The major objective of the management of historical resources is to satisfy users’ needs for historical information by enabling the website to recreate or retrieve web page snapshots. – Web page snapshot is the state of a web page at

a specific point in time.

Factors Affecting The State Of A Web Page • Content factors:

– Web page code – The state of internal resources it references:

• Images, style sheet, components, script files, databases, etc.

– The state of external resources it references:• External resources are files not managed by the web

site but can be referenced in creating the web site’s contents.

• Environment factors:– Web site host environment variables:

• System clock– Web technologies implemented on the server-side

as well as on the client-side

Levels of Web Page Snapshot

• Level 1 snapshot: A web document snapshot is the state of web document code at snaptime. – Creating level 1 snapshot enables a web site to trace the

changes to the web document code over time.• Level 2 snapshot: A level 2 snapshot is a level 1

snapshot with the additional requirement that all the internal resources it references are at least level 1 snapshots at the same snaptime. – Referencing database snapshots

• Level 3 snapshot: A level 3 snapshot is a level 2 snapshot with the additional requirement that all the external resources it references are at least level 2 snapshots at the snaptime.

Enforcing Environment Factors Page

• (1) Plus 0: If both environment factors are not enforced.

• (2) Plus 1: If the host variables are reset to the snapshot time.

• (3) Plus 2: If web technologies are compatible with the technologies at the snapshot time.

• (4) Plus 3: If both factors are enforced.

Possible Levels of Snapshot States

Schemes for Tracking Changes

• Scheme for tracking website structure changes and web page code changes– A logging and archiving scheme

• Scheme for tracking content database changes.

Design of a Logging and Archiving Scheme

for Tracking Website Changes

• The log, named TemporalURLLog, has five fields: URL, PublishDate, DocExpireDate, URLExpireDate, and NewURL.

• Those archived documents are saved in the Archive using URL + PublishDate as file name.

Impacts of Website Changes to Historical Links and Archive

Time Website Changes Current Web Pages

Historical Links Generated

Snapshots in Archive

T0 P1, P2, P3 None None

T1 P1 renamed to P4P5 is added

P2, P3, P4, P5 P1+ T0

T2 P2 is deletedP3 is modified

P3, P4, P5 P2+ T0, P3+ T0 P2+ T0,

P3+ T0

T3 P3, P4, P5 is modifiedP1, P6 are added

P1, P3, P4, P5, P6 P3+ T2, P4+ T1

P5+ T1

P3+ T2,

P4+ T1,

P5+ T1

T4 P3 is deletedP4 is renamed to P8P5 is renamed to P7A new page P3 is added

P1, P3, P6, P7, P8 P3+ T3, P4+ T3

P5+T3

P3+ T3

The contents of TemporalURLLog URL PublishDate DocExpireDate URLExpireDate NewURL

P1 T0 Null T1 P4

P2 T0 T2 T2 Null

P3 T0 T2 Null Null

P4 T1 T3 Null Null

P4 T3 Null T4 P8

P5 T1 T3 Null Null

P3 T2 T3 Null Null

P3 T3 T4 T4 Null

P5 T3 Null T4 P7

P6 T3 Null Null Null

P1 T3 Null Null Null

P8 T4 Null Null Null

P7 T4 Null Null Null

P3 T4 Null Null Null

Examples of Using the Log• Retrieve a snapshot of a current web page:• Retrieve a deleted page:• Retrieve the snapshot of a deleted web page:

– The snapshot of P3 at T2 is in the Archive: P3+ T2.• Retrieve the current web page of an out-dated URL:

– An old URL P5 is now renamed to P7. If users submit a request for P5, it can be traced to P7.

• Retrieve the web page previously associated with a current link:– A historical link P1 is now renamed to P8, and a current

link P1 points to a new web page. If the current web page associated with P1 is not what the users need, it can be redirected to P8.

• Determine if an invalid URL ever exists:– A URL P12 has never existed in the web site.

Tracking Changes to Content Databases

• A web page may use content databases:– (1) as a source for querying. – (2) as storage for contents of placeholders on a

web page.

Database Snapshot Management

• Defining snapshots:CREATE SNAPSHOT snapshotname

AS query

AS OF snaptime

• Refreshing snapshots:REFRESH SNAPSHOT snapshotname

AS OF new snaptime

Issues in Tracking Changes to Content Databases

• The content data databases may exist in many formats:– XML, delimited text files, Etc.– Not all content databases are supported by a snapshot

management system.• The website may not have the authority in the

management of the content databases.• A web page may retrieve data from many

databases.• There is no single way in designing content

databases.

Tracking Content Database Changes Using Log – An Example

• Assuming:– One content database supports many web

pages.– Each page contains many placeholders.

• Log design:– PageID + PlaceHolderID + Content + Update

Flag + Time Stamp • PageID is (URL + Page publish time)

Working with the TemporalLog

• Because a web page’s URL may change, the content database log needs the support of the TemporalURL log to track the changes of URL.

• Example:

Delivering Historical Resources to Users

• A website consists of:– (1) a current website where current web pages

are published.– (2) a historical website where historical

resources are stored and accessed.

• A typical web server serves requests for current web pages only and is inadequate to serve a request for historical information.

The Design of a Web Page Snapshot Management System

Summary

• We developed a scheme to track changes to website structure, web pages and files referenced by web pages, and a second scheme to track changes to content databases so that the website is capable of creating Level 2 snapshots.