How to build a digital library

  • Published on
    24-Mar-2016

  • View
    216

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Transcript

  • How to Build aDigital Library

  • The Morgan Kaufmann Series in Multimedia Information and Systems

    Series Editor, Edward A. Fox, Virginia Polytechnic University

    How to Build a Digital LibraryIan H. Witten and David Bainbridge

    Digital WatermarkingIngemar J. Cox, Matthew L. Miller, and Jeffrey A. Bloom

    Readings in Multimedia Computing and NetworkingEdited by Kevin Jeffay and HongJiang Zhang

    Introduction to Data Compression, Second EditionKhalid Sayood

    Multimedia Servers: Applications, Environments, and DesignDinkar Sitaram and Asit Dan

    Managing Gigabytes: Compressing and Indexing Documents and Images,Second EditionIan H. Witten, Alistair Moffat, and Timothy C. Bell

    Digital Compression for Multimedia: Principles and StandardsJerry D. Gibson, Toby Berger, Tom Lookabaugh, Dave Lindbergh, andRichard L. Baker

    Practical Digital Libraries: Books, Bytes, and BucksMichael Lesk

    Readings in Information RetrievalEdited by Karen Sparck Jones and Peter Willett

  • msDocuments are the digital librarys building blocks. It is time to step down from our high-level discussion of digital librarieswhat they are, how they are organized, and what they look liketo nitty-gritty details of how to represent the documents theycontain. To do a thorough job we will have to descend even further and look at the rep-

    resentation of the characters that make up textual documents and the fonts in which those characters are portrayed. For audio, images and video we examine the interplay between signal quantization, sampling rate and internal redundancy that underlies multimedia representations.Documents are the digital librarys building blocks.

    It is time to step down from our high-level discussion of dig Documents are the digital librarys building blocks. It is time to step down from our high-level discussion of digital librarieswhat they are, how they are organized, and what they look liketo nitty-gritty details of how to represent the documents they contain. To do a thorough

    job we will have to descend even further and look at the representation of the characters that make up textual documents and the fonts in which those characters are portrayed. For audio, images and video we examine the interplay between signal quantization, sampling rate and internal redundancy that underlies multimedia repre-

    sentations.Documents are the digital librarys building blocks. It is time to step down from our high-level discussion of dig Documents are the digital librarys building blocks. It is time to step down from our high-level discussion of digital librarieswhat they are, how they are organized, and what they look liketo nitty-gritty details of how

    to represent the documents they contain. To do a thorough job we will have to descend even further and look at the representation of the characters that make up textual documents and the fonts in which those characters are portrayed. For audio, images and video we examine the interplay between signal quantization, sampling rate

    and internal redundancy that underlies multimedia representations.Documents are the digital librarys building blocks. It is time to step down from our high-level discussion of dig Documents are the digital librarys building blocks. It is time to step down from our high-level discussion of digital librarieswhat they are, how they are orga-

    nized, and what they look liketo nitty-gritty details of how to represent the documents they contain. To do a thorough job we will have to descend even further and look at the representation of the characters that make up textual documents and the fontsin which those characters are portrayed. For audio, images and video we exam-

    ine the interplay between signal quantization, sampling rate and internal redundancy that underlies multimedia representations.Documents are the digital librarys building blocks. It is time to step down from our high-level discussion of dig Documents are the digital librarys building blocks. It is time to step down from our high-level dis-

    cussion of digital librarieswhat they are, how they are organized, and what they look liketo nitty-gritty details of how to represent the documents they contain. To do a thorough job we will have to descend even further and look at the representation of the characters that make up textual documents and the fonts in which those

    How to Build a Digital Library

    Ian H. Witten

    Computer Science Department University of Waikato

    David Bainbridge

    Computer Science DepartmentUniversity of Waikato

  • Publishing Director Diane D. CerraAssistant Publishing Services Manager Edward WadeSenior Developmental Editor Marilyn Uffner AlanEditorial Assistant Mona BuehlerProject Management Yonie OvertonCover Design Frances Baca DesignText Design Mark Ong, Side by Side StudiosComposition Susan Riley, Side by Side StudiosCopyeditor Carol LeybaProofreader Ken DellaPentaIndexer Steve RathPrinter The Maple-Vail Book Manufacturing Group

    Designations used by companies to distinguish their products are often claimed as trademarks or registeredtrademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product namesappear in initial capital or all capital letters. Readers, however, should contact the appropriate companies formore complete information regarding trademarks and registration.

    Morgan Kaufmann PublishersAn imprint of Elsevier Science340 Pine Street, Sixth FloorSan Francisco, CA 94104-3205www.mkp.com

    2003 by Elsevier Science (USA)All rights reserved.Printed in the United States of America

    07 06 05 04 03 5 4 3 2 1

    No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or byany meanselectronic, mechanical, photocopying, or otherwisewithout the prior written permission of thepublisher.

    Library of Congress Control Number: 2002107327ISBN: 1-55860-790-0

    This book is printed on acid-free paper.

  • vContents

    List of figures xiii

    List of tables xix

    Forewordby Edward A. Fox xxi

    Preface xxv

    1. Orientation: The world of digital libraries 1Example One: Supporting human development 1Example Two: Pushing on the frontiers of science 2Example Three: Preserving a traditional culture 3Example Four: Exploring popular music 4The scope of digital libraries 5

    1.1 Libraries and digital libraries 5

    1.2 The changing face of libraries 8In the beginning 10The information explosion 11The Alexandrian principle 14Early technodreams 15The library catalog 16The changing nature of books 17

    r

    ythe

  • 1.3 Digital libraries in developing countries 20Disseminating humanitarian information 21Disaster relief 21Preserving indigenous culture 22Locally produced information 22The technological infrastructure 23

    1.4 The Greenstone software 24

    1.5 The pen is mighty: Wield it wisely 28Copyright 29Collecting from the Web 31Illegal and harmful material 34Cultural sensitivity 34

    1.6 Notes and sources 35

    2. Preliminaries: Sorting out the ingredients 39

    2.1 Sources of material 40Ideology 41Converting an existing library 42Building a new collection 43Virtual libraries 44

    2.2 Bibliographic organization 46Objectives of a bibliographic system 47Bibliographic entities 48

    2.3 Modes of access 55

    2.4 Digitizing documents 58Scanning 59Optical character recognition 61Interactive OCR 62Page handling 67Planning an image digitization project 68Inside an OCR shop 69An example project 70

    2.5 Notes and sources 73

    3. Presentation: User interfaces 77

    3.1 Presenting documents 81Hierarchically structured documents 81Plain, unstructured text documents 83

    vi C O N T E N T S

  • Page images 86Page images and extracted text 88Audio and photographic images 89Video 91Music 92Foreign languages 93

    3.2 Presenting metadata 96

    3.3 Searching 99Types of query 100Case-folding and stemming 104Phrase searching 106Different query interfaces 108

    3.4 Browsing 112Browsing alphabetical lists 113Ordering lists of words in Chinese 114Browsing by date 116Hierarchical classification structures 116

    3.5 Phrase browsing 119A phrase browsing interface 119Key phrases 122

    3.6 Browsing using extracted metadata 124Acronyms 125Language identification 126

    3.7 Notes and sources 126Collections 126Metadata 127Searching 127Browsing 128

    4. Documents: The raw material 131

    4.1 Representing characters 134Unicode 137The Unicode character set 138Composite and combining characters 143Unicode character encodings 146Hindi and related scripts 149Using Unicode in a digital library 154

    4.2 Representing documents 155Plain text 156

    C O N T E N T S vii

  • Indexing 157Word segmentation 160

    4.3 Page description languages: PostScript and PDF 163PostScript 164Fonts 170Text extraction 173Using PostScript in a digital library 178Portable Document Format: PDF 179PDF and PostScript 183

    4.4 Word-processor documents 184Rich Text Format 185Native Word formats 191LaTeX format 191

    4.5 Representing images 194Lossless image compression: GIF and PNG 195Lossy image compression: JPEG 197Progressive refinement 203

    4.6 Representing audio and video 206Multimedia compression: MPEG 207MPEG video 210MPEG audio 211Mixing media 212Other multimedia formats 214Using multimedia in a digital library 215

    4.7 Notes and sources 216

    5. Markup and metadata: Elements of organization 221

    5.1 Hypertext markup language: HTML 224Basic HTML 225Using HTML in a digital library 228

    5.2 Extensible markup language: XML 229Development of markup and stylesheet languages 230The XML metalanguage 232Parsing XML 235Using XML in a digital library 236

    5.3 Presenting marked-up documents 237Cascading style sheets: CSS 237Extensible stylesheet language: XSL 245

    viii C O N T E N T S

  • 5.4 Bibliographic metadata 253MARC 254Dublin Core 257BibTeX 258Refer 260

    5.5 Metadata for images and multimedia 261Image metadata: TIFF 262Multimedia metadata: MPEG-7 263

    5.6 Extracting metadata 266Extracting document metadata 267Generic entity extraction 268Bibliographic references 270Language identification 270Acronym extraction 271Key-phrase extraction 273Phrase hierarchies 277

    5.7 Notes and sources 280

    6. Construction: Building collections with Greenstone 283

    6.1 Why Greenstone? 285What it does 285How to use it 288

    6.2 Using the Collector 292Creating a new collection 293Working with existing collections 300Document formats 301

    6.3 Building collections manually: A walkthrough 302Getting started 303Making a framework for the collection 304Importing the documents 305Building the indexes 307Installing the collection 308

    6.4 Importing and building 309Files and directories 310Object identifiers 312Plug-ins 313The import process 314The build process 317

    C O N T E N T S ix

  • 6.5 Greenstone archive documents 319Document metadata 320Inside the documents 322

    6.6 Collection configuration file 323Default configuration file 324Subcollections and supercollections 325

    6.7 Getting the most out of your documents 327Plug-ins 327Classifiers 336Format statements 342

    6.8 Building collections graphically 349

    6.9 Notes and sources 353

    7. Delivery: How Greenstone works 355

    7.1 Processes and protocols 356Processes 357The null protocol implementation 357The Corba protocol implementation 359

    7.2 Preliminaries 360The macro language 360The collection information database 369

    7.3 Responding to user requests 372Performing a search 375Retrieving a document 376Browsing a hierarchical classifier 377Generating the home page 378Using the protocol 378Actions 384

    7.4 Operational aspects 385Configuring the receptionist 386Configuring the site 391

    7.5 Notes and sources 392

    8. Interoperability: Standards and protocols 393

    8.1 More markup 395Names 395

    x C O N T E N T S

  • Links 397Types 402

    8.2 Resource description 408Collection-level metadata 410

    8.3 Document exchange 413Open eBook 414

    8.4 Query languages 419Common command language 419XML Query 422

    8.5 Protocols 426Z39.50 427Supporting the Z39.50 protocol 429The Open Archives Initiative 430Supporting the OAI protocol 433

    8.6 Research protocols 434Dienst 435Simple digital library interoperability protocol 436Translating between protocols 437Discussion 438

    8.7 Notes and sources 440

    9. Visions: Future, past, and present 443

    9.1 Libraries of the future 445Todays visions 445Tomorrows visions 448Working inside the digital library 451

    9.2 Preserving the past 454The problem of preservation 455A tale of preservation in the digital era 456The digital dark ages 457Preservation strategies 459

    9.3 Generalized documents: A challenge for the present 462Digital libraries of music 462Other media 466Generalized documents in Greenstone 469Digital libraries for oral cultures 471

    9.4 Notes and sources 474

    C O N T E N T S xi

  • Appendix: Installing and operating Greenstone 477

    Glossary 481

    References 489

    Index 499

    About the authors 517

    xii C O N T E N T S

  • xiii

    Figures

    Figure 1.1 Kataayis information and communication center. 2Figure 1.2 The Zia Pueblo village. 3Figure 1.3 The New York Public Library. 6Figure 1.4 Rubbing from a stele in Xian. 9Figure 1.5 A page of the original Trinity College Library catalog. 13Figure 1.6 The Bibliothque Nationale de France. 15Figure 1.7 Artists conception of the Memex, Bushs automated library. 16Figure 1.8 Part of a page from the Book of Kells. 18Figure 1.9 Pages from a palm-leaf manuscript in Thanjavur, India. 19Figure 1.10 Maori toki or ceremonial adze, emblem of the Greenstone project.

    25Figure 2.1 Scanning and optical character recognition. 59Figure 2.2 (a) Document image containing different types of data;

    (b) the document image segmented into different regions. 64Figure 2.3 (a) Double-page spread of a Maori newspaper; (b) enlarged

    version; (c) OCR text. 71Figure 3.1 Finding a quotation in Alices Adventures in Wonderland. 78Figure 3.2 Different-looking digital libraries: (a) Kids Digital Library (b)

    School Journal Digital Library. 80Figure 3.3 Village-Level Brickmaking: (a) the book; (b) the chapter on

    Moulding; (c, d) some of the pages. 82Figure 3.4 Alices Adventures in Wonderland. 84

  • Figure 3.5 A story from the School Journal collection: (a) Never Shout at a Draft Horse!; (b) with search term highlighted (mock-up). 86

    Figure 3.6 A historic Maori newspaper: (a) page image; (b) extracted text.88

    Figure 3.7 Listening to a tape from the Oral History collection. 90Figure 3.8 Finding Auld Lang Syne in a digital music library. 92Figure 3.9 Foreign-language collections: (a) French (b) Portuguese interface

    to an English collection. 94Figure 3.10 Documents from two Chinese collections: (a) rubbings of Tang

    poetry; (b) classic literature. 95Figure 3.11 An Arabic collection: (a) a document; (b) searching. 96Figure 3.12 Bibliography display. 97Figure 3.13 Metadata examples: (a) bibliography record retrieved from the

    Library of Congress; (b) description of a BBC televisionprogram. 98

    Figure 3.14 Searching for a quotation: (a) query page; (b) query response.100

    Figure 3.15 Choosing search preferences. 104Figure 3.16 Large-query search interface. 109Figure 3.17 Query with history. 110Figure 3.18 Form search: (a) simple; (b) advanced. 111Figure 3.19 Browsing an alphabetical list of titles: (a) plain list;

    (b) with AZ tags. 113Figure 3.20...

Recommended

View more >