Upload
amy-gardner
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Document Computing
Technologies for Managing Electronic Document Collections
Ross Wilkinson ... [et al.]
Circulation Counter [RES3H] ZA4080 .D63 1998
Chapter 1
Document Lifecycle
What is a document?
A document records a message from people to people.
Characteristics of a document
• Content
• Structure
• Metadata
Metadata
• A message has a context, which is important for understanding the message.
• A document contains not only the contents of a message, but also some information about the document, e.g. author, date, recipients.
• We called such information the metadata about the document.
Adobe Acrobat Document
Why Document Management?
• It is hard to find documents.
• It is hard to organize documents.
• It is hard to control documents.
• Metadata helps document management.
Benefits of Document Management
• Location-independent delivery of documents upon demand
• Controlled access to documents
• A record of the life of a document
• Better re-use of documents
Chapter 2
Electronic Document Description
Document Content
• Simplest type of content – unformatted text
• Text retrieval system based on search by keywords
• E.g Windows Desktop Search (video)
• Optical character recognition (OCR) system Adobe Acrobat
Document
Document Structure
• Even unformatted text has some structures, e.g. lines, words, images, etc.
• A document may have elaborate structures.
• Two levels of structures:– Logical structure– Presentational structure
Logical structures
• Example:
TO: John D.
FROM: Kate M.
DATE: 7/8/98
I have finished Stage B of the design. Could you take a look at it?
• Simple logical structure: lines of text
• A logical structure of a memo: (see next slide)
A logical structure for a memo
Memo
Head Body
Sender Receiver Date Paragraph
Presentational Structure
• A different presentational structure for the same memo
John D., 7/8/98
I have finished Stage B of the design. Could you take a look at it?
Kate M.
Presentation medium
• The content of the same document can be presented in different media with different presentational structures:
• E.g. a PDF file vs. a online Web page
Metadata
• Generally, we need metadata to capture:– Registration information– Usage information– Structural properties– Contextual information– Content description– Historical information
The Dublin Core metadata set
• Title• Creator• Subject• Description• Publisher• Contributors• Date• Type
• Format: e.g. HMTL, pdf
• Identifier: e.g. URI• Source• Language• Relation• Coverage: duration• Rights: e.g. copyright
Document Description Language (DDL)
• For use by document management system• E.g. RTF, Postcript, SGML• DDL support:
– Language support, media support, transparency, structure, link support, metadata support
• Other DDL characteristics:– Document creation, import conversion, export
transformation, update, presentation quality, presentation flexibility, etc.
Examples of DDLs
• ASCII (American Standard Code for Information Interchange)
• Unicode• ASCII and Unicode offer very limited
support• Rich Text Format• TeX and LaTeX• SGML, HTML, XML• Postscript, PDF
Rich Text Format (RTF)
• Developed by Microsoft
• For interchange between Microsoft Word and other software
• Main purposes:– Preserve information in Word (blocks of text)
• Example: next slide
{\rtf1\adeflang1025\ansi\ansicpg1252\uc2\adeff0\deff0\stshfdbch13\stshfloch0\stshfhich0\stshfbi0\deflang2057\deflangfe1028{\fonttbl{\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman
…
{\title John D}{\author Dr. Yeung}{\operator Dr. Yeung}{\creatim\yr2008\mo3\dy18\hr15\min24}{\revtim\yr2008\mo3\dy18\hr15\min25}{\version1}{\edmins1}{\nofpages1}{\nofwords14}{\nofchars81}{\*\company Lingnan University}{\nofcharsws94}
…
\ltrch\fcs0 \insrsid1782868\charrsid1782868 \hich\af0\dbch\af13\loch\f0 John D., 7/8/98
\par \hich\af0\dbch\af13\loch\f0 I have finished Stage B of the design. Could you take a look at it?
\par
\par \hich\af0\dbch\af13\loch\f0 Kate M\hich\af0\dbch\af13\loch\f0 .
\par }\pard \ltrpar\ql \li0\ri0\widctlpar\wrapdefault\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 {\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid4811147
\par }}
TeX and LaTeX
• TeX created by Donald Knuth
• TeX is a typesetting software.
• LaTeX created based on TeX by Leslie Lamport
• LaTeX use markup constructs to separate logical description from presentation.
• LaTeX example: see next slide
• To learn LaTeX: click.
\documentclass{article}\usepackage{times}\pagestyle{empty}
\begin{document}
\title{Sample Document}
\author{W. L. Yeung\\Department of Computing and Decision Sciences\\Lingnan University, Hong Kong\\[email protected]}
\maketitle
\section{Introduction}
…
\section{Conclusion}
…
\end{document}
SGML
• Standard Generalized Markup Language• To describe a document in SGML, we
need:– An SGML declaration– A document type definition (DTD)– A document instance
• An SGML declaration specifies which characters are used in the DTD. Normally a default is used.
SGML (cont.)
• A document type definition (DTD) defines the rules for forming a class of documents, i.e. the grammar of a document class.
• The building blocks of SGML documents are elements.
• A DTD for the memo document: next slide.
<!-– DTD for office memo -->
<!-- ELEMENT CONTENT -- >
<!ELEMENT memo - - (head, body, close?) >
<!ELEMENT head 0 0 (to & from & date) >
<!ELEMENT to - - (#PCDATA) >
<!ELEMENT from - - (#PCDATA) >
<!ELEMENT date - - (#PCDATA) >
<!ELEMENT body - - (#PCDATA) >
<!ELEMENT par - - (#PCDATA) >
<!ELEMENT close - - (#PCDATA) >
<!-- ELEMENT NAME VALUE DEFAULT -- >
<!ATTLIST memo status (con|pub) pub >
<!ATTLIST par id id #IMPLIED >
DTD
• An element definition gives the name of the element, then the rules for building that element.
• Elements can contain other elements.
• Terminal (basic) elements often consist of parsed character data “#PCDATA” or “#CDATA”.
The memo in SGML<MEMO>
<TO> John D </TO>
<FROM> Kate M </FROM>
<DATE> 7/8/1998 </DATE>
<BODY>
<PAR>
I have finished Stage B of the design.
</PAR>
</BODY>
</MEMO>
HTML
• Hypertext Markup Language
• For World Wide Web (WWW) documents
• Conforms to a SGML DTD
• HTML is presentation oriented: instructions (tags) are inserted into a document to for presentation effects
• The DTD for HTML is available on http://www.w3.org/TR/html401/sgml/dtd.html
The memo in HTML
<!DOCTYPE HTML PUBLIC “-//IETF//DTD HTML//EN”><HTML><HEAD><TITLE>Memo</TITLE><META NAME=“DC.AUTHOR” CONTENT=“Kate M”</META><META NAME=“DC.DATE” CONTENT=“7/8/1998”</META></HEAD><BODY><H1>Memo</H1><P>I have finished Stage B of the <A
HREF=“/team3/design2”>design<A>.</P></BODY></HTML>
XML
• Extensible Markup Language
• Three basic definitions:– XML for representing data and documents– XLink and XPointer for representing inter-
document linking– XSL for representing presentation
• XML is a near-subset of SGML
XML (Cont.)
• Two classes of XML documents:– Valid XML documents: documents that conform to a
specific supplied DTD– Well-formed documents: only satisfy a simple default
grammar, without conforming to a specific DTD
• XML has become the cornerstone of electronic commerce as it allows businesses to exchange electronic documents according to some standard formats based on XML.
Postscript
• Developed by Adobe
• For representing documents that are to be printed (mainly on laser printers)
• A page description language optimized for printing text, images, graphics.
Portable Document Format (PDF)
• Developed by Adobe• A page description language for representing
text, graphics and images• A PDF file contains presentation information on
pages, annotations, links, fonts, etc.• Support delivery of electronic documents exactly
as they would appear in printed form.• Not designed for editing or document format
exchange.