34
Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson ... [et al.] Circulation Counter [RES3H] ZA4080 .D63 1998

Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Embed Size (px)

Citation preview

Page 1: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Document Computing

Technologies for Managing Electronic Document Collections

Ross Wilkinson ... [et al.]

Circulation Counter  [RES3H]  ZA4080 .D63 1998  

Page 2: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Chapter 1

Document Lifecycle

Page 3: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

What is a document?

A document records a message from people to people.

Page 4: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Characteristics of a document

• Content

• Structure

• Metadata

Page 5: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Metadata

• A message has a context, which is important for understanding the message.

• A document contains not only the contents of a message, but also some information about the document, e.g. author, date, recipients.

• We called such information the metadata about the document.

Adobe Acrobat Document

Page 6: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Why Document Management?

• It is hard to find documents.

• It is hard to organize documents.

• It is hard to control documents.

• Metadata helps document management.

Page 7: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Benefits of Document Management

• Location-independent delivery of documents upon demand

• Controlled access to documents

• A record of the life of a document

• Better re-use of documents

Page 8: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Chapter 2

Electronic Document Description

Page 9: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Document Content

• Simplest type of content – unformatted text

• Text retrieval system based on search by keywords

• E.g Windows Desktop Search (video)

• Optical character recognition (OCR) system Adobe Acrobat

Document

Page 10: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Document Structure

• Even unformatted text has some structures, e.g. lines, words, images, etc.

• A document may have elaborate structures.

• Two levels of structures:– Logical structure– Presentational structure

Page 11: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Logical structures

• Example:

TO: John D.

FROM: Kate M.

DATE: 7/8/98

I have finished Stage B of the design. Could you take a look at it?

• Simple logical structure: lines of text

• A logical structure of a memo: (see next slide)

Page 12: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

A logical structure for a memo

Memo

Head Body

Sender Receiver Date Paragraph

Page 13: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Presentational Structure

• A different presentational structure for the same memo

John D., 7/8/98

I have finished Stage B of the design. Could you take a look at it?

Kate M.

Page 14: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Presentation medium

• The content of the same document can be presented in different media with different presentational structures:

• E.g. a PDF file vs. a online Web page

Page 15: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Metadata

• Generally, we need metadata to capture:– Registration information– Usage information– Structural properties– Contextual information– Content description– Historical information

Page 16: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

The Dublin Core metadata set

• Title• Creator• Subject• Description• Publisher• Contributors• Date• Type

• Format: e.g. HMTL, pdf

• Identifier: e.g. URI• Source• Language• Relation• Coverage: duration• Rights: e.g. copyright

Page 17: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Document Description Language (DDL)

• For use by document management system• E.g. RTF, Postcript, SGML• DDL support:

– Language support, media support, transparency, structure, link support, metadata support

• Other DDL characteristics:– Document creation, import conversion, export

transformation, update, presentation quality, presentation flexibility, etc.

Page 18: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Examples of DDLs

• ASCII (American Standard Code for Information Interchange)

• Unicode• ASCII and Unicode offer very limited

support• Rich Text Format• TeX and LaTeX• SGML, HTML, XML• Postscript, PDF

Page 19: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Rich Text Format (RTF)

• Developed by Microsoft

• For interchange between Microsoft Word and other software

• Main purposes:– Preserve information in Word (blocks of text)

• Example: next slide

Page 20: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

{\rtf1\adeflang1025\ansi\ansicpg1252\uc2\adeff0\deff0\stshfdbch13\stshfloch0\stshfhich0\stshfbi0\deflang2057\deflangfe1028{\fonttbl{\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman

{\title John D}{\author Dr. Yeung}{\operator Dr. Yeung}{\creatim\yr2008\mo3\dy18\hr15\min24}{\revtim\yr2008\mo3\dy18\hr15\min25}{\version1}{\edmins1}{\nofpages1}{\nofwords14}{\nofchars81}{\*\company Lingnan University}{\nofcharsws94}

\ltrch\fcs0 \insrsid1782868\charrsid1782868 \hich\af0\dbch\af13\loch\f0 John D., 7/8/98

\par \hich\af0\dbch\af13\loch\f0 I have finished Stage B of the design. Could you take a look at it?

\par

\par \hich\af0\dbch\af13\loch\f0 Kate M\hich\af0\dbch\af13\loch\f0 .

\par }\pard \ltrpar\ql \li0\ri0\widctlpar\wrapdefault\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 {\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid4811147

\par }}

Page 21: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

TeX and LaTeX

• TeX created by Donald Knuth

• TeX is a typesetting software.

• LaTeX created based on TeX by Leslie Lamport

• LaTeX use markup constructs to separate logical description from presentation.

• LaTeX example: see next slide

• To learn LaTeX: click.

Page 22: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

\documentclass{article}\usepackage{times}\pagestyle{empty}

\begin{document}

\title{Sample Document}

\author{W. L. Yeung\\Department of Computing and Decision Sciences\\Lingnan University, Hong Kong\\[email protected]}

\maketitle

\section{Introduction}

\section{Conclusion}

\end{document}

Page 23: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998
Page 24: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

SGML

• Standard Generalized Markup Language• To describe a document in SGML, we

need:– An SGML declaration– A document type definition (DTD)– A document instance

• An SGML declaration specifies which characters are used in the DTD. Normally a default is used.

Page 25: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

SGML (cont.)

• A document type definition (DTD) defines the rules for forming a class of documents, i.e. the grammar of a document class.

• The building blocks of SGML documents are elements.

• A DTD for the memo document: next slide.

Page 26: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

<!-– DTD for office memo -->

<!-- ELEMENT CONTENT -- >

<!ELEMENT memo - - (head, body, close?) >

<!ELEMENT head 0 0 (to & from & date) >

<!ELEMENT to - - (#PCDATA) >

<!ELEMENT from - - (#PCDATA) >

<!ELEMENT date - - (#PCDATA) >

<!ELEMENT body - - (#PCDATA) >

<!ELEMENT par - - (#PCDATA) >

<!ELEMENT close - - (#PCDATA) >

<!-- ELEMENT NAME VALUE DEFAULT -- >

<!ATTLIST memo status (con|pub) pub >

<!ATTLIST par id id #IMPLIED >

Page 27: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

DTD

• An element definition gives the name of the element, then the rules for building that element.

• Elements can contain other elements.

• Terminal (basic) elements often consist of parsed character data “#PCDATA” or “#CDATA”.

Page 28: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

The memo in SGML<MEMO>

<TO> John D </TO>

<FROM> Kate M </FROM>

<DATE> 7/8/1998 </DATE>

<BODY>

<PAR>

I have finished Stage B of the design.

</PAR>

</BODY>

</MEMO>

Page 29: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

HTML

• Hypertext Markup Language

• For World Wide Web (WWW) documents

• Conforms to a SGML DTD

• HTML is presentation oriented: instructions (tags) are inserted into a document to for presentation effects

• The DTD for HTML is available on http://www.w3.org/TR/html401/sgml/dtd.html

Page 30: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

The memo in HTML

<!DOCTYPE HTML PUBLIC “-//IETF//DTD HTML//EN”><HTML><HEAD><TITLE>Memo</TITLE><META NAME=“DC.AUTHOR” CONTENT=“Kate M”</META><META NAME=“DC.DATE” CONTENT=“7/8/1998”</META></HEAD><BODY><H1>Memo</H1><P>I have finished Stage B of the <A

HREF=“/team3/design2”>design<A>.</P></BODY></HTML>

Page 31: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

XML

• Extensible Markup Language

• Three basic definitions:– XML for representing data and documents– XLink and XPointer for representing inter-

document linking– XSL for representing presentation

• XML is a near-subset of SGML

Page 32: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

XML (Cont.)

• Two classes of XML documents:– Valid XML documents: documents that conform to a

specific supplied DTD– Well-formed documents: only satisfy a simple default

grammar, without conforming to a specific DTD

• XML has become the cornerstone of electronic commerce as it allows businesses to exchange electronic documents according to some standard formats based on XML.

Page 33: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Postscript

• Developed by Adobe

• For representing documents that are to be printed (mainly on laser printers)

• A page description language optimized for printing text, images, graphics.

Page 34: Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D63 1998

Portable Document Format (PDF)

• Developed by Adobe• A page description language for representing

text, graphics and images• A PDF file contains presentation information on

pages, annotations, links, fonts, etc.• Support delivery of electronic documents exactly

as they would appear in printed form.• Not designed for editing or document format

exchange.