Upload
lucy-york
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
The Management of a Website’s Historical Resources
David Chao
College of Business
San Francisco State University
Introduction
• An organization’s websites change constantly to reflect the dynamic nature of its environment causing changes in website structure, contents and the supporting technologies.
Types of Change
• Website structure:– Causing web pages’ URL to change
• Website content:– Changes to web pages:
• Insertions, deletions, modifications
– Changes to content databases
• Technology
What are a website’s historical resources?
• Outdated URLs
• Outdated web pages:– Web page snapshots– Content database snapshots– Deleted web pages
• Replaced technologies
The Objective of Managing Historical Resources
• The major objective of the management of historical resources is to satisfy users’ needs for historical information by enabling the website to recreate or retrieve web page snapshots. – Web page snapshot is the state of a web page at
a specific point in time.
Factors Affecting The State Of A Web Page • Content factors:
– Web page code – The state of internal resources it references:
• Images, style sheet, components, script files, databases, etc.
– The state of external resources it references:• External resources are files not managed by the web
site but can be referenced in creating the web site’s contents.
• Environment factors:– Web site host environment variables:
• System clock– Web technologies implemented on the server-side
as well as on the client-side
Levels of Web Page Snapshot
• Level 1 snapshot: A web document snapshot is the state of web document code at snaptime. – Creating level 1 snapshot enables a web site to trace the
changes to the web document code over time.• Level 2 snapshot: A level 2 snapshot is a level 1
snapshot with the additional requirement that all the internal resources it references are at least level 1 snapshots at the same snaptime. – Referencing database snapshots
• Level 3 snapshot: A level 3 snapshot is a level 2 snapshot with the additional requirement that all the external resources it references are at least level 2 snapshots at the snaptime.
Enforcing Environment Factors Page
• (1) Plus 0: If both environment factors are not enforced.
• (2) Plus 1: If the host variables are reset to the snapshot time.
• (3) Plus 2: If web technologies are compatible with the technologies at the snapshot time.
• (4) Plus 3: If both factors are enforced.
Schemes for Tracking Changes
• Scheme for tracking website structure changes and web page code changes– A logging and archiving scheme
• Scheme for tracking content database changes.
Design of a Logging and Archiving Scheme
for Tracking Website Changes
• The log, named TemporalURLLog, has five fields: URL, PublishDate, DocExpireDate, URLExpireDate, and NewURL.
• Those archived documents are saved in the Archive using URL + PublishDate as file name.
Impacts of Website Changes to Historical Links and Archive
Time Website Changes Current Web Pages
Historical Links Generated
Snapshots in Archive
T0 P1, P2, P3 None None
T1 P1 renamed to P4P5 is added
P2, P3, P4, P5 P1+ T0
T2 P2 is deletedP3 is modified
P3, P4, P5 P2+ T0, P3+ T0 P2+ T0,
P3+ T0
T3 P3, P4, P5 is modifiedP1, P6 are added
P1, P3, P4, P5, P6 P3+ T2, P4+ T1
P5+ T1
P3+ T2,
P4+ T1,
P5+ T1
T4 P3 is deletedP4 is renamed to P8P5 is renamed to P7A new page P3 is added
P1, P3, P6, P7, P8 P3+ T3, P4+ T3
P5+T3
P3+ T3
The contents of TemporalURLLog URL PublishDate DocExpireDate URLExpireDate NewURL
P1 T0 Null T1 P4
P2 T0 T2 T2 Null
P3 T0 T2 Null Null
P4 T1 T3 Null Null
P4 T3 Null T4 P8
P5 T1 T3 Null Null
P3 T2 T3 Null Null
P3 T3 T4 T4 Null
P5 T3 Null T4 P7
P6 T3 Null Null Null
P1 T3 Null Null Null
P8 T4 Null Null Null
P7 T4 Null Null Null
P3 T4 Null Null Null
Examples of Using the Log• Retrieve a snapshot of a current web page:• Retrieve a deleted page:• Retrieve the snapshot of a deleted web page:
– The snapshot of P3 at T2 is in the Archive: P3+ T2.• Retrieve the current web page of an out-dated URL:
– An old URL P5 is now renamed to P7. If users submit a request for P5, it can be traced to P7.
• Retrieve the web page previously associated with a current link:– A historical link P1 is now renamed to P8, and a current
link P1 points to a new web page. If the current web page associated with P1 is not what the users need, it can be redirected to P8.
• Determine if an invalid URL ever exists:– A URL P12 has never existed in the web site.
Tracking Changes to Content Databases
• A web page may use content databases:– (1) as a source for querying. – (2) as storage for contents of placeholders on a
web page.
Database Snapshot Management
• Defining snapshots:CREATE SNAPSHOT snapshotname
AS query
AS OF snaptime
• Refreshing snapshots:REFRESH SNAPSHOT snapshotname
AS OF new snaptime
Issues in Tracking Changes to Content Databases
• The content data databases may exist in many formats:– XML, delimited text files, Etc.– Not all content databases are supported by a snapshot
management system.• The website may not have the authority in the
management of the content databases.• A web page may retrieve data from many
databases.• There is no single way in designing content
databases.
Tracking Content Database Changes Using Log – An Example
• Assuming:– One content database supports many web
pages.– Each page contains many placeholders.
• Log design:– PageID + PlaceHolderID + Content + Update
Flag + Time Stamp • PageID is (URL + Page publish time)
Working with the TemporalLog
• Because a web page’s URL may change, the content database log needs the support of the TemporalURL log to track the changes of URL.
• Example:
Delivering Historical Resources to Users
• A website consists of:– (1) a current website where current web pages
are published.– (2) a historical website where historical
resources are stored and accessed.
• A typical web server serves requests for current web pages only and is inadequate to serve a request for historical information.