24
Documents, Text Editors, Text Retrieval, and Web Pages Class 3 LBSC 690 Information Technology

Documents, Text Editors, Text Retrieval, and Web Pages Class 3 LBSC 690 Information Technology

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Documents, Text Editors, Text Retrieval, and Web

PagesClass 3

LBSC 690

Information Technology

Agenda

• Questions

• Unix Survival Guide

• Document Creation (Word Processing and HTML)

• Document Retrieval

• Project Overview

Unix Survival Guide

• WAM account

• Directory structure (mkdir, cd, .., /)

• How much space is used (du, ls -l)

• Eliminating unneeded files (rm)

• Managing mail (pine, attachments)

• Moving files (mv, cp, ftp)

• Editing files (pico, more)

• Web anywhere (lynx)

Document Creation

• Editors

• Word Processors

• Desktop Publishing

• Structured Documents

• HTML/SGML/XML

Editors(Text Editing vs. Word Processing)

• Purpose– Create and modify ASCII text

• Examples– pico, axe, and emacs on WAM

• Advantages– Compatible with virtually everything (VT-100)

• Disadvantages– Limited format control, sometimes no mouse

Word Processors

• Purpose– Create documents intended for human readers

• Examples– Microsoft Word and Word Perfect in OWL

• Advantages– Good format control– WYSIWYG (“What You See is What You

Get”)

• Disadvantages– No (universal) standard interchange format

Desktop Publishing

• Purpose– Produce documents for wide (paper) distribution

• Examples– Adobe Pagemaker in the WAM labs

• Advantages– Allows very detailed layout control

• Disadvantages– Requires fairly extensive user expertise

Structured Documents

• Purpose– Specify logical structure of the documents

• Examples– email, HTML, LaTeX, SGML/XML

• Advantages– Allows easy reformatting for different displays

• Disadvantages– Hard to read unless “rendered” before viewing

Hyper-Text Markup Language (HTML)

• Purpose– Structured document language for web pages

• Advantages– Adapts easily to different display capabilities– Widely available rendering software (browsers)

• Disadvantages– Direct control over layout is limited– The HTML “standard” is still evolving

First Steps in HTML

• Find a web page you like

• Select “Document Source” in “View” menu

• Compare HTML code with rendered version– Observe how to achieve each effect

• Select “Save As” in “File” menu

• FTP the file to ~/../pub/ on WAM

• Edit the file using pico

• http://www.wam.umd.edu/~userid/filename

HTML Document Structure• Markup tags (open and close) bracket content

<tag> … </tag>

• Title shows up in the Web browser’s frame

• Headers show up in the page itself

• For each link, specify the URL and link text<a href=“URL”>link text</a>

• Inline graphics can replace the link text<img src=oard.jpg>

Designing Web Pages

• Key design issues:– Content: What do you want to publish?– Style: How do you want to present it?– Syntax: How can you achieve that

presentation?

• Sources of information– Online tutorials (Yahoo points to lots of these)– Technical materials (e.g., the HTML 3.0 spec)

Style Guidelines

• Design for generic browsers– And test on every version you wish to support

• Provide appropriate access points– User needs and navigation strategies differ

• Design useful navigational aids– A web search may lead to the middle of a site

• Include some indication of currency– Date of last update, “new” icons, etc.

HTML Editors• Goal is to create web pages, not learn HTML!

• Several are available– In Explorer, “Edit-Page” for Front Page Express– In Netscape, “File-Edit Page” for Composer

• You may still need to edit the HTML file– Some editors use browser-specific features– Some HTML features may be missing entirely– File names may be butchered by FTP

SGML/XML

• Generalized Markup Languages– SGML - Standard Generalized Markup

Language (for paper documents)– XML - eXtensible Markup Language (for Web

documents) (see W3C)

• These allow people to design – DTDs - Document-type definitions

• A Document also needs:– DSSSL - Document Stylesheet Specification

Language

Document Retrieval• Making documents is often easier than finding them!

• Hypertext vs. Cataloging vs. Searching– yahoo vs. altavista

• Lots of applications– Chasing down citations in papers you read– Web search engines– Managing your personal files

• Two basic approaches to searching– Explicit queries (“information retrieval”)– “Watch what I do” (“adaptive filtering”)

Ways of Searching for Text

• Controlled vocabulary– Manual indexing based on named concepts

• Free text– Characterize documents by the words the

contain

• Social filtering– Exchange and interpret personal ratings

“Exact Match” Retrieval

• Find all documents with some characteristic– Indexed as “Presidents -- United States”– Containing the words “Clinton” and “Peso”– Read by my boss

• A set of documents is returned– Each is as likely to be useful as any other– Usually listed in date or alphabetical order

Ranked Retrieval

• Put most useful documents near top of a list– Put possibly useful documents lower in the list

• No need to exclude any documents– Just list those least likely to be useful last

• Two basic techniques– Similarity-based– Probability-based

Similarity-Based Retrieval

• Assume “most useful” = most similar to query

• Lots of clues to meaning– Repeated words are good cues to meaning– Rarely used words make searches more selective

• Easily combined– Compute a “weight” for each term– Add up the weights for query terms in a document

Project Overview

• Goal: Solve a practical problem– One which is fairly complex

• You choose the technology– Make a set of web pages (a web “site”)– Make a database (optional for summer 690)– Do something else that is equally complex

• Multimedia presentation, Java program, …

• Suggest two-person groups

Web Projects• Have significant content! (see “What is a Book”

web site under CLIS Dean’s Award)

• Multiple access points– Taxonomy, search engine, map, etc.

• Be creative (in a useful way)! For example:– Choose a novel application– Engage the user with an interactive approach– Adopt an innovative organization– Implement a creative layout

Database Projects(very ambitious for Summer 690)

• Your focus should be on scalability– What if the IRS decided to use your database?

• The user interface is important– Designed to be used without taking 690 first!

• Include enough content to allow testing– But focus on organization, not on content

• The same creativity issues as web projects

Project Timeline and Deliverables

(summer 690)

• Project specification (1-2 pages)

• Should include User Manual (FAQ) and Test Plan components

• Project demonstrations last week of class– Scheduled individually– All two/three team members get the same grade