Developing Digital Libraries

Embed Size (px)

Citation preview

  • 7/29/2019 Developing Digital Libraries

    1/37

    Developing Digital Libraries:

    General Strategies

    Sukhdev SinghBased on From Paper to Collection by Dr.

    Michel Loots, Dan Camarzan and Ian H

    Witten. Human Info NGO, Belgium; Simple

    Words, Romania; University of waikato, New

    Zealand.

  • 7/29/2019 Developing Digital Libraries

    2/37

    INTRODUCTION

    Typical steps that have to be implemented are:i. Selecting the documents to be includedii. Securing copyrights permissions to use

    these documents in thedigital libraryiii. Scanning and OCR of the hard-copydocuments which are notavailable in to digital form to have a perfect

    digital formativ. Converting all documents to a format(integrating text andimages).

  • 7/29/2019 Developing Digital Libraries

    3/37

    v. Tagging the chapters, paragraphs andimages of the digitaldocuments

    vi. Organising the collection into a optimallystructured digitallibraryvii. Building the digital library using the

    Greenstone softwareviii. Printing and distributing the collection onCD-ROM and/ordistributing it over the Internet

  • 7/29/2019 Developing Digital Libraries

    4/37

    Answers the following before

    jumping in: What is the goal of your collection?

    What is your target group?

    How big is itlocal, regional, or

    globaly?

    How many documents are you making

    available?

    How many pages?

    How much graphics content?

  • 7/29/2019 Developing Digital Libraries

    5/37

    Does the material split into parts that will beconsulted by a limitedaudience and parts that need to bedisseminated widely?

    Are the documents already availableelectronically?

    If so, in which formats? (Note incidentally thatPDF files are notautomatically equivalent to digital full-textform, as they oftencontain only page images.)

    What is the co ri ht status of the

  • 7/29/2019 Developing Digital Libraries

    6/37

    Who owns the copyright?

    Are there other organizations with the same

    target audience? Are you willing to collaborate with other

    groups?

    What budget is available for the whole

    project? What human resources are available (in

    person-months) for co-ordination,editing, scanning and programming?

    How many computers are available for this

  • 7/29/2019 Developing Digital Libraries

    7/37

    How many CD-ROMs do you want to

    distribute?

    Will they be free, or for sale?

  • 7/29/2019 Developing Digital Libraries

    8/37

    SCANNERS AND

    SCANNING The first step in converting paper

    documents into a digital library

    collection is to obtain images of allpages of all publications in digitalformat.

  • 7/29/2019 Developing Digital Libraries

    9/37

    Scanners

    Scanners are available in all priceranges, and all shapes and sizes. They

    range from Rs. 5000 for flat-bedscanners to upwards of Rs. 25,00,000for large industrial scanners frommanufacturers such as Bell & Howell.

  • 7/29/2019 Developing Digital Libraries

    10/37

    The output format of a scanned page is a

    computer file that is usually stored in TIFF

    or Bitmap format. Compressed TIFF IV isthe best format to use. An average page

    scanned and converted to this format

    occupies only 50 Kb, compared to perhaps

    2 Mb for the equivalent page inuncompressed Bitmap form.

  • 7/29/2019 Developing Digital Libraries

    11/37

    Low-cost flat-bed scanner

    Low-cost flat-bed units are the cheapestand most widely available type of

    scanner. There are many brands: HP,Agfa, Acer, etc. Prices range from 5,000to 15,000. Both black-and-white andcolor images can be scanned.

    The low price allows each computer tohave its own scanner.

  • 7/29/2019 Developing Digital Libraries

    12/37

    Fact is that rates exceeding twelve pages

    per hour are rarely

    achieved.

  • 7/29/2019 Developing Digital Libraries

    13/37

    these scanners are useful only for small

    jobs with limited numbers of pagesno

    more than 200 to 400 pages a month on aregular

    basis, or one-time jobs of up to 1000 or

    2000 pages.

  • 7/29/2019 Developing Digital Libraries

    14/37

    Low-end scanner with sheet

    feeder Low-end scanners with sheet feeders

    typically cost between Rs. 25,000 andRs. 60,000. Ten to fifty pages can beinserted, scanned and processed atonce: thus the operator does not haveto attend constantly to the machine.This increases capacity up to 150 to 200

    pages per day. These scanners aremore robust, and have a larger lifespanbefore repairusually in therange 30,000 to 50,000 pages.

  • 7/29/2019 Developing Digital Libraries

    15/37

    A disadvantage is that only one side of

    the page is scanned at a timethestack of pages must be reversed andrescanned in order to obtain an imageof both sides. This often creates

    problems because sheet feeders arenever without problems and sometimespages get blocked. These scanners areuseful for up to 1500 to 3000 pages a

    month.

  • 7/29/2019 Developing Digital Libraries

    16/37

    Color scanners

    Any scanning operation invariablyinvolves some color images, so a color

    scanner will always be required.Generally speaking, less than 5% of anypublication contains color images, plusthe cover. Thus a low cost flat-bed

    scanner as described above suffices. Itis advisable to select one capable ofscanning up to 600 dpi resolution.

  • 7/29/2019 Developing Digital Libraries

    17/37

    Professional duplex

    scanners Professional scanners are reliable,

    heavy-duty machines capable of

    processing a large volume of pagestypically from 2000 pages to 10,000pages per day. They have an automaticsheet-feeder tray system that processes

    batches of about 50 to 200 pages. Thebest and fastest are duplex machinesthat scan both sides of the page atonce.

  • 7/29/2019 Developing Digital Libraries

    18/37

    Professional duplex scanners require apowerful computer with a hard disk of at least10 to 20 Gb. Prices range from Rs. 2,50,000 to

    Rs. 25,00,000. For example, the Canon DR-6020duplex scanner costs Rs.2,50,000 and workswith double-sided documents. It has a capacityof about 2000 pages per day and a lifespan of

    600,000 to 800,000 pages. Bell & Howell andFujitsu scanners range from Rs. 5,00,000 toRs. 25,00,000 and have a lifespan of manymillions of pages.

  • 7/29/2019 Developing Digital Libraries

    19/37

    Scanning programs

    Every scanner comes with its ownsoftware, which means that the program

    must be installed on the computer thatmanages the scanner. Some have acomputer card that needs to beinstalled in your computer to speed up

    the scanning operation.

  • 7/29/2019 Developing Digital Libraries

    20/37

    Preparing the documents

    Before being scanned, documents mustbe properly prepared. Dusty documentsmust be cleaned, humid documentsdried, clips removed, pages unfolded.The spine of each book should beremoved by cutting it off, straight andprecisely. Books provided by libraries

    must often be rebound, and if so youshould be particularly careful whenremoving spines in order tofacilitate smooth rebinding.

  • 7/29/2019 Developing Digital Libraries

    21/37

    The scanning process

    Using software provided with thescanner, a digital image of each paper

    page is scanned and transformed into aBitmap or TIFF image.

  • 7/29/2019 Developing Digital Libraries

    22/37

    Quality control

    First divide the material into batcheswith similar paper and print qualities.

    Perform OCR tests on a sample fromthe first batch to determine the optimalsettings. Then scan all material in thisbatch before

    proceeding to the next one.

  • 7/29/2019 Developing Digital Libraries

    23/37

    Filename conventions

    Follow a suitable file namingconvention.

  • 7/29/2019 Developing Digital Libraries

    24/37

    Productivity and resources

    You should not underestimate themagnitude of the scanning operation

    and particularly the OCR process thatfollows. It is best to consider scanningand OCR as completely separateactivities. The optimal choice

    from an economic and practical point ofview should be made individually foreach one.

  • 7/29/2019 Developing Digital Libraries

    25/37

    Scanning costs

    An important decision is whether toinvest in scanning equipment and

    perform all scanning oneself, oroutsource it to a scanning company.The main considerations are:

  • 7/29/2019 Developing Digital Libraries

    26/37

    pressure of time for the scanning job;total number of pages;salary costs of those who perform the

    scanning

  • 7/29/2019 Developing Digital Libraries

    27/37

    OCR: OPTICALCHARACTER

    RECOGNITION The following steps are involved in

    converting paper documents to

    computer form:1. scanning;

    2. page layout analysis;

    3. recognition;

    4. scanning images and tables.

  • 7/29/2019 Developing Digital Libraries

    28/37

    On the market are many good OCR

    programs, with prices ranging from Rs.

    5000 to Rs.20,000. For example, among many others are:

    Read-Iris(http://www.readiris.com/)

    Omnipage(http://www.omnipage.com/)

    Fine-Reader(http://www.finereader.com/)

    http://www.finereader.com/http://www.finereader.com/
  • 7/29/2019 Developing Digital Libraries

    29/37

    A choice must be made between

    undertaking the scanning and OCR in-

    house or outsourcing it to a commercialorganization. To do it in-house

    requires a scanner, OCR software

    program, OCR skill development, and a

    quality-conscious, highly motivatedworkforce.

  • 7/29/2019 Developing Digital Libraries

    30/37

    The OCR process

    The OCR process differs from one OCRprogram to another, and each one

    requires a considerable amount oflearning. The programs manual willexplain this process in detail. Fourpoints deserve particular attention:

    quality control, tables, images, andspecialized material such as formulas,foreign characters etc.

  • 7/29/2019 Developing Digital Libraries

    31/37

    Quality control

    The first is performed at the same timeas OCR. Every OCR program has

    a built-in spell-checker that highlightsevery suspect letter.

    The second is a general check of thetext once the OCR process is

    finished.

  • 7/29/2019 Developing Digital Libraries

    32/37

    The third is a spelling check usingMicrosoft Word.

    Finally, the completed document shouldbe checked by an independent personwho samples the complete book andchecks for errors, problems with tables

    and images, tagging, and the generallook of the resulting text.

  • 7/29/2019 Developing Digital Libraries

    33/37

    Tables

    Images

    Specialized material

  • 7/29/2019 Developing Digital Libraries

    34/37

    Productivity and resources

    Young people between 18 and 25 are bestsuited for this job.Because the first hours are always themost productive, the work should either beorganized on a part-time basis or only themost motivated and concentrated peopleshould be selected for full-time work.

  • 7/29/2019 Developing Digital Libraries

    35/37

    Two-thirds of people tend to quit or getbored after about three to five weeks. Thistranslates into poorer quality and low

    productivity in the last weeks.A regular supply of work is needed tojustify the necessary training, to maintainconcentration, and to keep spirits high.

    28 Pages / Day

  • 7/29/2019 Developing Digital Libraries

    36/37

    Alternatives to OCR

    Manual retyping

    Image files

    Combining scanning and OCR

  • 7/29/2019 Developing Digital Libraries

    37/37

    CREATING AN

    ELECTRONIC COLLECTION Meta Data

    Tagging document files