Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Preview:

Citation preview

Archiving Web Content

CE Course #285

Sheraton NY Hotel & Towers

Sunday, June 8, 2003

Introductions

Meet today’s panelists

Today’s Panelists

Barry Abisch - The Journal News Olivia Kobelt – Christian Science Monitor Mark Stencel – Washingtonpost.com Janine Yagielski – CNN.com

Agenda

Introductions & session overview Technology Workflow and processes Brainstorming Break Brainstorming recap Role of the librarian Building the business case Closing comments & session evaluation

Technology

Panelist: Janine Yagielski

Technology Overview

What can be archived? Preparing content to be archived Storing and serving archived content Searching archived content

What can be archived?

Overview of file formats (handout) Dynamic and static content Archiving presentation as well as content Archiving secondary information about online

content (traffic information) Challenges of changing technologies

Technology

Overview of File Formats(handout)

Text formats Image/graphic file formats Video formats Other definitions

Technology

Static and Dynamic Content

Static Content: Content that once posted does not change.

Example: Simple story or information page

Dynamic Content: Constantly changing content

Example 1: Weather data, Stock Prices

Example 2: Election Results, Sports Scores (fixed end point)

Technology

Static and Dynamic Content

Hybrid: Changes occasionally but does not have a predictable updating schedule or end point

Example: Top story with multiple and significant updates

Example: Home page or section page

Technology

Archiving Presentation and Content

CNN.com has built an internal system to archive some presentation

Home Page, US, World, Politics, International Edition

One week of pages Every 30 minutes Perl Script Size of archive: 55.4 MB

Technology

Archiving Secondary Information about Online Content (traffic)

CNN.com has extensive Webstats reporting system that parses and archives the information from Web server logs.

Simple statistics: Page Views, hits (back to 1996)

Advanced statistics: Unique users, time spent, IP address, OS, browser

Real Time Monitor: tracks click through rates of links Home and US pages One week of info on links Tracks average and peak for links

Technology

Challenges of Changing Technology

Interdependencies of the Web make it difficult to maintain old content when optimizing for new content.

Examples: .shtml pages, Vivo video, some Shockwave, other antiquated multimedia technology based on plug-ins

Technology

Preparing Content to be Archived

Directory Structure/Database Key to consistency and automation in subject specific archives. cnn.com/2003/WORLD/meast/06/02/sprj.nitop.political.council/

Slugs conventions Provide additional method of automation archiving

Examples: sprj; sprj.nitop; .ap

Technology

Preparing Content to be Archived

Content Management System Imposes and uses directory structure to prepare content for publication,

syndication and in some cases archiving and searching

Metadata in stories on publish <meta name="DESCRIPTION" content="A U.S. soldier was killed and five were wounded early Thursday in the Iraqi city of Fallujah, the U.S. Central Command announced -- the latest casualties in the city, which has become a center of resistance."> <meta name="AUTHOR" content=""> <meta name="SECTION" content="WORLD"> <meta name="SUBSECTION" content="meast"> <meta name="DATE" content="2003-06-05 05:22:20">

Technology

Preparing Content to be Archived

XML (Extensible Markup Language) CNN.com produces a XML file with every story for site search. We also

produce XML feeds of story headlines and other data sent to

syndication partners.

Metadata and XML for Multimedia CNN.com is looking into way to insert metadata and produce XML feeds

of non-traditional stories. Currently only an internal and manual process of archiving the location and subject of interactive (pop-up) content.

Technology

Storing and Serving Archived Content

Simple storage of content Content servers Burn to CD Web servers (internal and external even if not served) Tape backup

Serving to internal users Image query Directory browsing on the inside Web servers Content purged from outside available (AP, partner stories) Limited space on internal Web server (36 GB)

Technology

Serving to All External Users

All unique URLs published on CNN.com from the launch of the site are still available, unless there was an editorial decision to remove or redirect a URL.

CNN video is hosted by AOL. Because of changes in hosting and capacity of video servers. Not all previous video streams are

available.

Technology

Serving to All External Users

Web servers/NFS ServerHardware: Sun and Intel (running Linux)

Cost: $10,000-$15,000 (Sun), $5,000 (Intel)

Capacity: Storage capacity expanded by adding additional hard drives. Serving capacity varies by content. HTML -- 25K hits/minute; images, style sheets -- 60-70K hits/minute

Video Servers

Hardware: Reconfigured and video dedicated Web server

Cost : $1,500-$3,000

Capacity: Depends on length and size of video and disk space

Technology

Serving to Select Users

Registration E-mail newsletters New e-mail alerts Backend Oracle database JSP’s dynamically served

Subscription Video Real Networks handles CNN.com’s subscriber authentication

Technology

Searching Archived Content

Searching for internal users

Limited functionality for internal materials. Graphics image search. New publishing tools will incorporate a search of externally content.

Searching for external users

Site Search: Run by AOL. CMS produces and publishes (restricted by IP) XML files for every story. At set intervals AOL picks up the XML files uses those files to produce CNN.com’s internal search results.

Web Search: Powered by Google. Sponsored links from Overture. Both sets of

results are returned to CNN.com in XML feeds published on a CNN.com template.

Video/multimedia search: Exploring

Technology

Workflow

Panelist: Olivia Kobelt

Workflow Overview

Types of web content – what do we archive? Archiving old content Internal vs. external archive Making corrections/fixes Search ability Current workflow Systems we use Future Vision

Brainstorming!

Break!

Be back in 15 minutes!

Brainstorming Recap!Legal compliance vs. business user or needCopyright – can you archive someone else’s content, partner content?Talking to IT about what the requirements areHow do you approach gathering user requirements?Who are users?What are retention criteria? (date, size of files, originals/drafts/versioning, exclude search, business value)Hierarchy starting at bottom with knowledge, corporate, business use/reuse, compliance, vital recordsHow to capture and keep the hybrid web pages?What software applications are available?Microfilm archiving?What tools are available to automate the archival process?Where do we begin? Seeking advice in relation to storage, retrieval, technology, etc.What type of information/literature is available on the topic of archiving web data?Selling the idea to managementArchiving “how it looked”How did we do it? Examples of how a project was done.Measure what people are trying to find in older filesManaging the customer service side of it

Role of the Librarian

Panelist: Barry Absich

Librarian Role Overview You are the expert. What do you need? What do readers need? A news Web site has as much in common with a

library as it does with a newspaper. Become familiar with  your newspaper’s Web site. If it is politically correct, insist that you be consulted

on all matters relating to both archiving and searching.

If you can't insist, at least offer your services.  Odds are, your online editor will welcome the offer.

Building A Business Case

Panelist: Mark Stencel

Business Case Overview

What’s worth saving Making money Indirect revenue Costs and challenges Getting credit

Business Case

Does It Pay To Save?

Key points: Your news organization can profit from its

archive of original online content Making money isn’t always profitable (your

business case should account for the cost of doing business, not just revenue)

Business Case

Original Content

Breaking news stories Standing text (FAQs, online guides and

primers) Video/Audio Photo Galleries E-mail Newsletters Interactive Discussions/Chats Databases (listings, scores)

Business Case

Making Money

Sponsorships (e.g., local visitor guides) Resale (paid archives; research services,

such as LexisNexis, Factiva; online reprint rights)

Note: Few good models for selling non-text content (video, audio, galleries)

Business Case

Business Case

Business Case

Business Case

Business Case

Indirect Revenue

Promotion (can archived content attract more online users or even print or online subscribers?)

Registration (will users provide valuable e-mail addresses or other personal information in exchange for access to content)

Business Case

Business Case

Business Case

Business Case

Business Case

Business Case

Business Case

Business Case

Business Case

Costs and Challenges

Do systems, process, equipment or personnel cost more than you can make?

Rights Management (which content do you have legal rights to use, re-use, or re-sell online)

Content Management (publishing systems and file/directory management for keeping track of where your content is)

Business Case

Costs and Challenges (cont’d.)

Fulfillment and Customer Service (supporting services you provide to the public or to partners)

Revenue Shares (accounting for your partner’s shares)

Coordinating With Parents or Siblings (do your plans fit in or conflict with the overall business goals/strategies of your chain?)

Business Case

Costs and Challenges (cont’d.)

Hosting (server space, streaming) Un-hosting (time and effort to delete or de-

link content; automatically deleting content vs. selectively maintaining content)

Business Case

Get Credit!

Make sure your department gets credit for any revenue it generates, not just the bill for the cost of providing money-making content and services.

Business Case

Questions & Answers

Closing remarks

Please complete an evaluation form.

Suggested Resources “The Archival Black Hole” by Scott Kirsner, 9/19/98,

Editor & Publisher "Archiving the Internet" by Brewster Kahle 11-4-96 From the Scientific American Nothing But Net, Preserving the Internet, 1 Terabyte at a

Time by Bill Barnes, Slate.msn.com "It Was Here a Minute Ago!": Archiving the Net

By Susan E. Fledman, Searcher: The Magazine for Database Professionals

SCC systems archiving billions of bytes at newspapers Newspapers & Technology March 2000

http://www.archive.org

Recommended