Upload
holly-bridges
View
213
Download
0
Embed Size (px)
Citation preview
Planning a Planning a digital digital librarylibrary
How to Build a Digital LibraryHow to Build a Digital LibraryIan H. Witten and David BainbridgeIan H. Witten and David Bainbridge
Planning a Digital Planning a Digital LibraryLibrary
ResponsibilitiesResponsibilities Technology to be usedTechnology to be used
Greenstone, DSpace, Fedora, EprintsGreenstone, DSpace, Fedora, Eprints Metadata standard to be usedMetadata standard to be used
Dublin Core, METS, etc.Dublin Core, METS, etc. Types of accessTypes of access Retrospective or Born Digital?Retrospective or Born Digital?
ResponsibilitiesResponsibilities
Legal IssuesLegal Issues Distributing information carries Distributing information carries
responsibilitiesresponsibilities CopyrightCopyright
Social IssuesSocial Issues Respect customs of the communityRespect customs of the community Both source and use communitiesBoth source and use communities
Ethical issuesEthical issues
IdeologyIdeology
Ideology – a clear conception of what Ideology – a clear conception of what you plan to achieve with the collection you plan to achieve with the collection of informationof information
Ideology of a Collection:Ideology of a Collection: PurposePurpose ObjectivesObjectives PrinciplesPrinciples
guide what is to be included in the collectionguide what is to be included in the collection
Placed in Introduction to Digital LibraryPlaced in Introduction to Digital Library
Document versus WorkDocument versus Work
WorkWork The disembodied content of a messageThe disembodied content of a message Pure informationPure information
DocumentDocument Traditional library: a physical object that Traditional library: a physical object that
embodies the workembodies the work Digital library: a particular electronic encoding Digital library: a particular electronic encoding
of a workof a work
How are distinctions made between How are distinctions made between different manifestations of a single work?different manifestations of a single work?
Converting an Existing Converting an Existing LibraryLibrary
Digitizing an existing paper-based Digitizing an existing paper-based collection is the most expensive kind collection is the most expensive kind of projectof project
Consider whether it is worth the Consider whether it is worth the effort and expenseeffort and expense
1616thth Century Mexican Library Century Mexican Library IncunabulaIncunabula BroadsidesBroadsides
Advantages of Digital Advantages of Digital LibrariesLibraries
Easier to access remotely than Easier to access remotely than conventional librariesconventional libraries
Powerful search and browsingPowerful search and browsing Easier to add additional servicesEasier to add additional services Easier to organize and reorganizeEasier to organize and reorganize Easier to maintain?Easier to maintain? Easier to preserve?Easier to preserve? Does your collection have these Does your collection have these
advantages?advantages?
Questions to AddressQuestions to Address
Will the digital library coexist with Will the digital library coexist with an existing physical one?an existing physical one?
What is the collection’s growth rate?What is the collection’s growth rate? How dynamic is the collection?How dynamic is the collection? Should you consider outsourcing the Should you consider outsourcing the
whole digital library operation?whole digital library operation? Could user needs be satisfied in Could user needs be satisfied in
alternative ways?alternative ways?
Prioritizing MaterialsPrioritizing Materials
Special collections and unique Special collections and unique materialsmaterials Rare books and manuscriptsRare books and manuscripts
High use itemsHigh use items Research and teaching materialsResearch and teaching materials
Low-use itemsLow-use items
Criteria for Digital Criteria for Digital ConversionConversion
Intellectual contentIntellectual content Scholarly valueScholarly value Desire to enhance access to informationDesire to enhance access to information Funding availableFunding available
Educational valueEducational value Classroom supportClassroom support Background readingBackground reading Distance educationDistance education
InstitutionalInstitutional Resource sharingResource sharing Promote strengths of an institutionPromote strengths of an institution
Reduce handling of fragile originalsReduce handling of fragile originals Cost and space savingsCost and space savings
Building a New Building a New CollectionCollection
New materialNew material The copyright holder may be the best The copyright holder may be the best
one to create a digital collectionone to create a digital collection MetadataMetadata
Where will it come from?Where will it come from?
Bibliographic EntitiesBibliographic Entities DocumentsDocuments WorksWorks
Distinction between document and workDistinction between document and work EditionsEditions
Electronic documents use terms such as Electronic documents use terms such as version, release and revisionversion, release and revision
AuthorsAuthors Authority control – standardized names for Authority control – standardized names for
authorsauthors TitlesTitles
Attributes of worksAttributes of works
Bibliographic EntitiesBibliographic Entities SubjectsSubjects
Two approaches to automatically assign subject:Two approaches to automatically assign subject: Key-phrase extractionKey-phrase extraction Key-phrase assignmentKey-phrase assignment
Literary and artistic worksLiterary and artistic works Style, form, content, genreStyle, form, content, genre
Library of Congress Subject Headings (LCSH)Library of Congress Subject Headings (LCSH) Controlled vocabularies: 30,000 pages, 2,000,000 entriesControlled vocabularies: 30,000 pages, 2,000,000 entries
Hierarchical relationship of broader and narrower Hierarchical relationship of broader and narrower topicstopics
Subject classificationsSubject classifications Traditional libraries have a linear arrangementTraditional libraries have a linear arrangement Digital collection can be rearranged at the click of a Digital collection can be rearranged at the click of a
mousemouse
Digitizing DocumentsDigitizing Documents
DigitizationDigitization The process of taking traditional library The process of taking traditional library
materials and converting them to materials and converting them to electronic formelectronic form
Allows storage and manipulation by a Allows storage and manipulation by a computercomputer
The process is time-consuming and The process is time-consuming and expensiveexpensive
Stages of DigitizationStages of Digitization
ScanningScanning Creates a digitized image of each pageCreates a digitized image of each page Usually presented to the userUsually presented to the user
Optical Character Recognition (OCR)Optical Character Recognition (OCR) Creates an encoded representation of Creates an encoded representation of
the textual content of the pagesthe textual content of the pages Necessary for full-text indexingNecessary for full-text indexing Allows searchingAllows searching
Decisions in ScanningDecisions in Scanning
Black-and-white, grayscale or colorBlack-and-white, grayscale or color ResolutionResolution
number of pixels per linear unitnumber of pixels per linear unit Bits per pixelBits per pixel
Monochrome display: 16 or 256 levels of Monochrome display: 16 or 256 levels of graygray
Color display: up to 24 or 32 bppColor display: up to 24 or 32 bpp QualityQuality
Increases storage space and time to accessIncreases storage space and time to access
Optical Character Optical Character RecognitionRecognition
Manual cleanup is necessaryManual cleanup is necessary Less efficient than manual keying Less efficient than manual keying
when error rate drops below 95 when error rate drops below 95 percentpercent
Interactive OCRInteractive OCR Optical character recognition should be Optical character recognition should be
done as an interactive processdone as an interactive process AcquisitionAcquisition
Input from scanner or read a fileInput from scanner or read a file CleanupCleanup
Filtering, deskewing and manual cleanup of unwanted Filtering, deskewing and manual cleanup of unwanted areasareas
Page analysisPage analysis Examine layoutExamine layout
RecognitionRecognition The “OCR” partThe “OCR” part
CheckingChecking SavingSaving
Plain text, HTML, RTF, PDF, MS WordPlain text, HTML, RTF, PDF, MS Word
Page HandlingPage Handling
UnbindingUnbinding Microfiche or microfilmMicrofiche or microfilm Two most expensive partsTwo most expensive parts
Handling the paperHandling the paper OCROCR
Planning a Digitization Planning a Digitization ProjectProject
OutsourcingOutsourcing CostCost
$1 to $2 for scanning and OCR$1 to $2 for scanning and OCR Quality controlQuality control VerificationVerification