121
Building Building collections collections with Greenstone with Greenstone How to Build a Digital Library How to Build a Digital Library Ian H. Witten and David Bainbridge Ian H. Witten and David Bainbridge

Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Embed Size (px)

Citation preview

Page 1: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building Building collections with collections with

GreenstoneGreenstone

How to Build a Digital LibraryHow to Build a Digital LibraryIan H. Witten and David BainbridgeIan H. Witten and David Bainbridge

Page 2: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Digital Library Digital Library CollectionsCollections

There is a distinction betweenThere is a distinction between BUILDING collectionsBUILDING collections DELIVERING information to usersDELIVERING information to users

Similar to ‘compile-time’ versus Similar to ‘compile-time’ versus ‘runtime’ distinction in computer ‘runtime’ distinction in computer programmingprogramming

Information structures should Information structures should usually be prepared in advanceusually be prepared in advance

Page 3: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building a CollectionBuilding a Collection

The CollectorThe Collector A subsystem that takes you step by step A subsystem that takes you step by step

through building a simple collectionthrough building a simple collection Conceals details behind the scenesConceals details behind the scenes

First locate information on your First locate information on your computer or the Webcomputer or the Web Plain text, HTML, Word, PDF, email file, Plain text, HTML, Word, PDF, email file,

etc. etc.

Page 4: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Plug-insPlug-ins

Plug-ins are software modules that Plug-ins are software modules that handlehandle Format conversionFormat conversion Metadata extractionMetadata extraction

Plug-ins promote extensibilityPlug-ins promote extensibility

Page 5: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Greenstone Archive Greenstone Archive FormatFormat

Greenstone Archive FormatGreenstone Archive Format XML-based file formatXML-based file format File format for:File format for:

DocumentsDocuments MetadataMetadata

Page 6: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Collection Configuration Collection Configuration FileFile

Collection Configuration FileCollection Configuration File Defines the structure of a collectionDefines the structure of a collection Governs how the collection is builtGoverns how the collection is built Specifies how the collection will appear Specifies how the collection will appear

to usersto users

Page 7: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Greenstone Extended Greenstone Extended CapabilitiesCapabilities

Extending the Capabilities of Extending the Capabilities of GreenstoneGreenstone Plug-insPlug-ins

Handle different document and metadata Handle different document and metadata formatsformats

ClassifiersClassifiers Handle different kinds of browsing structuresHandle different kinds of browsing structures

Format statements and MacrosFormat statements and Macros Govern the user interface content and Govern the user interface content and

appearanceappearance

Page 8: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Why Greenstone?Why Greenstone?

Page 9: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Benefits of GreenstoneBenefits of Greenstone

General system for constructing and General system for constructing and presenting digital collectionspresenting digital collections

Handles millions of documents, text, Handles millions of documents, text, images, audio, videoimages, audio, video

User interfaces identical in Web-User interfaces identical in Web-based and CD-ROM versionsbased and CD-ROM versions

Installs on Windows and LinuxInstalls on Windows and Linux Access locally or remotely using web Access locally or remotely using web

browserbrowser

Page 10: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Organization of Organization of CollectionsCollections

Each collection can be organized Each collection can be organized differently:differently: Format of source documentsFormat of source documents MetadataMetadata Directory structureDirectory structure Document structureDocument structure Searching and browsing servicesSearching and browsing services PresentationPresentation Auxiliary servicesAuxiliary services

Page 11: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Variation of Source Variation of Source FormatFormat

Source documents can be supplied in:Source documents can be supplied in: Plain textPlain text HTMLHTML PostScriptPostScript PDFPDF WordWord E-mailE-mail Other file typesOther file types ImagesImages VideoVideo AudioAudio

Page 12: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Variation of MetadataVariation of Metadata

Different types of metadataDifferent types of metadata Metadata can be supplied differentlyMetadata can be supplied differently

‘‘fields’ in MS Wordfields’ in MS Word <meta> tags in HTML<meta> tags in HTML Information coded into filename and Information coded into filename and

directoriesdirectories Spreadsheet or other data fileSpreadsheet or other data file Explicit metadata format like MARCExplicit metadata format like MARC

Page 13: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Variation of Directory Variation of Directory StructureStructure

Collections can vary in the directory Collections can vary in the directory structure in which the information is structure in which the information is locatedlocated

Page 14: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Variation of Document Variation of Document StructureStructure

Document structureDocument structure FlatFlat Divided sequentially into pagesDivided sequentially into pages Hierarchical organizationHierarchical organization

Title or other metadata available at each Title or other metadata available at each levellevel

Page 15: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Variation of ServicesVariation of Services

SearchingSearching MetadataMetadata IndexesIndexes Hierarchical levelsHierarchical levels

BrowsingBrowsing MetadataMetadata Browser typeBrowser type

Page 16: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Variation of PresentationVariation of Presentation

Results can be presented to users in Results can be presented to users in various ways:various ways: Format that target documents are Format that target documents are

shown inshown in Search results pageSearch results page Metadata browsersMetadata browsers Interface languageInterface language

Page 17: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Variation of Auxiliary Variation of Auxiliary ServicesServices

A collection may require additional A collection may require additional servicesservices User loggingUser logging Etc.Etc.

Page 18: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Collection Configuration Collection Configuration FileFile

Allows VariationAllows Variation A digital library collection is made A digital library collection is made

byby Gathering raw materialGathering raw material Designing the collectionDesigning the collection Putting design information about the Putting design information about the

structure and presentation of the structure and presentation of the collection in the Collection collection in the Collection Configuration FileConfiguration File

Page 19: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Front Page of CollectionFront Page of Collection

Statement of collection’s purposeStatement of collection’s purpose

Statement of collection’s coverageStatement of collection’s coverage

Explanation of how collection is Explanation of how collection is organizedorganized

Page 20: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Searching Involves Searching Involves IndexesIndexes

Searching is provided by indexes Searching is provided by indexes built from different parts of the built from different parts of the documentsdocuments Entire documentsEntire documents ParagraphsParagraphs TitlesTitles SectionsSections Section headingsSection headings Figure captionsFigure captions

Page 21: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

IndexesIndexes

Indexes can be created automatically Indexes can be created automatically usingusing DocumentsDocuments Supporting filesSupporting files

Indexes can be rebuilt automaticallyIndexes can be rebuilt automatically New document in the same format New document in the same format

becomes availablebecomes available Process can awake, check for new material, Process can awake, check for new material,

and rebuild the indexesand rebuild the indexes

Page 22: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Plug-ins for IndexingPlug-ins for Indexing

Source documents are converted into Source documents are converted into standard XML form for indexing using plug-standard XML form for indexing using plug-insins

Standard plug-ins processStandard plug-ins process Plain textPlain text HTMLHTML WordWord PDFPDF Usenet and email messagesUsenet and email messages

New plug-ins can be written for other New plug-ins can be written for other document typesdocument types

Page 23: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Browsing Involves ListsBrowsing Involves Lists

Browsing involves lists that can be Browsing involves lists that can be examined by the userexamined by the user AuthorsAuthors TitlesTitles DatesDates Hierarchical classification structuresHierarchical classification structures

Page 24: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Classifier ModulesClassifier Modules

Modules called classifiers are used to Modules called classifiers are used to create browsers and build browsing create browsers and build browsing structures from metadatastructures from metadata Scrollable listsScrollable lists Alphabetic selectorsAlphabetic selectors DatesDates HierarchiesHierarchies

Programmers can write new Programmers can write new classifiers to create novel browsing classifiers to create novel browsing capabilitiescapabilities

Page 25: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Search TermsSearch Terms

Search Terms in Greenstone:Search Terms in Greenstone: Alphabetic charactersAlphabetic characters DigitsDigits

Separated by white spaceSeparated by white space Punctuation acts as white spacePunctuation acts as white space

Page 26: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Two Types of QueriesTwo Types of Queries

Query for ALL of the wordsQuery for ALL of the words Boolean ANDBoolean AND

Query for SOME of the wordsQuery for SOME of the words Ranked Ranked

Page 27: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Indexes to SearchIndexes to Search

In most collections, you can choose In most collections, you can choose different indexes to searchdifferent indexes to search

Examples:Examples: Author and title indexesAuthor and title indexes Chapter and paragraph indexesChapter and paragraph indexes

Usually the full matching document is Usually the full matching document is returned regardless of index searchedreturned regardless of index searched

Page 28: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Preferences PagePreferences Page

Preferences PagePreferences Page Allows advanced control over search Allows advanced control over search

operation:operation: Case-folding and stemming Case-folding and stemming Advanced query mode where users specify Advanced query mode where users specify

Boolean operatorsBoolean operators Large-query interfaceLarge-query interface Display search historyDisplay search history

Page 29: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Preferences PagePreferences Page

Preferences PagePreferences Page Specify subcollections to be included in Specify subcollections to be included in

searchessearches Specify presentation languageSpecify presentation language Customize interfaceCustomize interface

Textual vs. standard interfaceTextual vs. standard interface Suppress navigation barSuppress navigation bar Suppress alert systemSuppress alert system

Page 30: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Using the Using the CollectorCollector

Page 31: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

The Greenstone CollectorThe Greenstone Collector

Easiest way to build a simple Easiest way to build a simple collectioncollection

The Collector allows you to:The Collector allows you to: Create a new collectionCreate a new collection Modify or add to an existing collectionModify or add to an existing collection Delete a collectionDelete a collection

Page 32: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Starting the CollectorStarting the Collector

Click the Collector link from the Click the Collector link from the default Greenstone home pagedefault Greenstone home page

Log inLog in When Greenstone is installed, an When Greenstone is installed, an

account called account called adminadmin is set up with a is set up with a password chosen during installationpassword chosen during installation

The Collector works through a The Collector works through a standard web interfacestandard web interface

Page 33: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Creating a New Creating a New CollectionCollection

Collector’s main purpose is to build Collector’s main purpose is to build a new collectiona new collection

Structure of a collection is Structure of a collection is determined when the collection is determined when the collection is set upset up

Simplest to copy the structure of an Simplest to copy the structure of an existing collection and then editexisting collection and then edit

Page 34: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Collection Building StepsCollection Building Steps

1.1. Collection InformationCollection Information

2.2. Source DataSource Data

3.3. ConfigurationConfiguration

4.4. BuildingBuilding

5.5. ViewingViewing

Page 35: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Collection Building StepsCollection Building Steps

☐ ☐ Collection InformationCollection Information

☐ ☐ Source DataSource Data

☐ ☐ ConfigurationConfiguration

☐ ☐ BuildingBuilding

☐ ☐ ViewingViewing

Page 36: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

1. Collection Information1. Collection Information

Give the collection a name and Give the collection a name and provide associated informationprovide associated information TitleTitle

Short phrase used to identify the collection Short phrase used to identify the collection within the digital librarywithin the digital library

Contact e-mail addressContact e-mail address Brief descriptionBrief description

Sets out the principles that govern what is Sets out the principles that govern what is included in the collectionincluded in the collection

Page 37: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Collection Building StepsCollection Building Steps

☑ ☑ Collection InformationCollection Information

☐ ☐ Source DataSource Data

☐ ☐ ConfigurationConfiguration

☐ ☐ BuildingBuilding

☐ ☐ ViewingViewing

Page 38: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

2. Source Data2. Source Data

Specify the location of the sourcesSpecify the location of the sources Clone existing collectionClone existing collection

Specify on a pull-down menu the existing Specify on a pull-down menu the existing collectioncollection

Create a completely new collectionCreate a completely new collection

Page 39: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

2. Source Data2. Source Data

In the provided boxes, indicate In the provided boxes, indicate where Source Documents are where Source Documents are locatedlocated

Specification of sourcesSpecification of sources file://file:// http://http:// ftp://ftp://

Page 40: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

file://file://

File name on the Greenstone server File name on the Greenstone server systemsystem That file will be included in collectionThat file will be included in collection

Directory name on the Greenstone Directory name on the Greenstone serverserver Everything in the folder and its Everything in the folder and its

subfolders will be includedsubfolders will be included

Page 41: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

http://http://

Web pageWeb page The web page will be downloadedThe web page will be downloaded All pages it links to (and all pages they All pages it links to (and all pages they

link to) that reside on the same site, link to) that reside on the same site, below the URL, will also be downloadedbelow the URL, will also be downloaded

URL that leads to a list of filesURL that leads to a list of files Everything in the folder and its Everything in the folder and its

subfolders will be included in collectionsubfolders will be included in collection

Page 42: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

ftp://ftp://

File to be downloaded using FTPFile to be downloaded using FTP Directory name on the FTP serverDirectory name on the FTP server

Downloads everything in the folder and Downloads everything in the folder and its subfoldersits subfolders

Page 43: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Collection Building StepsCollection Building Steps

☑ ☑ Collection InformationCollection Information

☑ ☑ Source DataSource Data

☐ ☐ ConfigurationConfiguration

☐ ☐ BuildingBuilding

☐ ☐ ViewingViewing

Page 44: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

3. Configuration3. Configuration

This step can be bypassedThis step can be bypassed Allows adjustment of configuration Allows adjustment of configuration

optionsoptions The construction and presentation The construction and presentation

of all collections are controlled by of all collections are controlled by specifications in a special collection specifications in a special collection configuration fileconfiguration file

Page 45: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Collection Building StepsCollection Building Steps

☑ ☑ Collection InformationCollection Information

☑ ☑ Source DataSource Data

☑ ☑ ConfigurationConfiguration

☐ ☐ BuildingBuilding

☐ ☐ ViewingViewing

Page 46: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

4. Building4. Building

The computer does the work of the The computer does the work of the building processbuilding process

Indexes are built:Indexes are built: For browsingFor browsing For searchingFor searching Following specifications in the Following specifications in the

collection configuration filecollection configuration file Status line shows progressStatus line shows progress Warnings shown if files can’t be Warnings shown if files can’t be

foundfound

Page 47: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Collection Building StepsCollection Building Steps

☑ ☑ Collection InformationCollection Information

☑ ☑ Source DataSource Data

☑ ☑ ConfigurationConfiguration

☑ ☑ BuildingBuilding

☐ ☐ ViewingViewing

Page 48: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

5. Viewing5. Viewing

View the collection that has just View the collection that has just been createdbeen created

E-mail can be sent to the collection’s E-mail can be sent to the collection’s contact addresscontact address Must enable by editing Must enable by editing main.cfg main.cfg

configuration fileconfiguration file

Page 49: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Working with Existing Working with Existing CollectionsCollections

Add more material and rebuild the Add more material and rebuild the collectioncollection

Edit the configuration file to modify Edit the configuration file to modify the collection’s structurethe collection’s structure

Delete the collectionDelete the collection Put the collection on CD-ROMPut the collection on CD-ROM

Page 50: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Adding Material to a Adding Material to a CollectionCollection

Do not re-specify files that are Do not re-specify files that are already in the collectionalready in the collection Files would be included twiceFiles would be included twice

If the building process fails, the old If the building process fails, the old version remains unchangedversion remains unchanged

Structure of collection can be Structure of collection can be changedchanged Edit the configuration fileEdit the configuration file

May add plug-ins or an option to a plug-inMay add plug-ins or an option to a plug-in

Page 51: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Plug-ins & Document Plug-ins & Document FormatsFormats

Plug-ins are specified in the collection Plug-ins are specified in the collection configuration fileconfiguration file

File name determines document formatFile name determines document format Widely used document formats:Widely used document formats:

TEXTPlugHTMLPlugWORDPlugPDFPlug

PSPlugEMAILPlugZIPPlug

Page 52: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Text FilesText Files

TEXTPlug Plug-InTEXTPlug Plug-In *.txt*.txt *.text*.text

Plain text filePlain text file Title metadata based on the first line Title metadata based on the first line

of the fileof the file

Page 53: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

HTML FilesHTML Files

HTMLPlug Plug-InHTMLPlug Plug-In *.htm*.htm *.html*.html .shtml.shtml .shm.shm .asp.asp .php.php .cgi.cgi

Page 54: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

HTML FilesHTML Files

HTMLPlug Plug-InHTMLPlug Plug-In Imports HTML filesImports HTML files Title metadata extracted from the HTML Title metadata extracted from the HTML

<title> tag<title> tag Other HTML <meta> tag data can be Other HTML <meta> tag data can be

extractedextracted Parses and processes any links in the fileParses and processes any links in the file Links to other files in the collection are Links to other files in the collection are

trapped and replaced by references to the trapped and replaced by references to the documentdocument

Page 55: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

HTML FilesHTML Files

file_is_urlfile_is_url Optional switch within the HTML plug-Optional switch within the HTML plug-

inin Causes URL metadata to be inserted Causes URL metadata to be inserted

into each document, based on the file-into each document, based on the file-name convention that is adopted by the name convention that is adopted by the mirroring package. The collection uses mirroring package. The collection uses this metadata to allow readers to refer this metadata to allow readers to refer to the original source material rather to the original source material rather than a local copythan a local copy

Page 56: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Microsoft Word FilesMicrosoft Word Files

WORDPlug Plug-InWORDPlug Plug-In *.doc*.doc

Imports Microsoft Word documentsImports Microsoft Word documents Greenstone uses independent Greenstone uses independent

programs to convert Word files to programs to convert Word files to HTMLHTML Many variants on the Word formatMany variants on the Word format Older Word formats use a simple text Older Word formats use a simple text

string extractionstring extraction

Page 57: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

PDF FilesPDF Files

PDFPlug Plug-InPDFPlug Plug-In *.pdf*.pdf

Imports PDF FilesImports PDF Files Adobe’s Portable Document FormatAdobe’s Portable Document Format Greenstone uses independent Greenstone uses independent

programs to convert PDF files to programs to convert PDF files to HTMLHTML

Page 58: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

PostScript FilesPostScript Files

PSPlug Plug-InPSPlug Plug-In *.ps*.ps

Imports PostScript FilesImports PostScript Files Works best when a standard Works best when a standard

conversion program is already conversion program is already installed on the computerinstalled on the computer

Uses simple text extraction algorithm Uses simple text extraction algorithm if no conversion program is presentif no conversion program is present

Page 59: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Email FilesEmail Files EMAILPlugEMAILPlug

*.email*.email Imports files containing emailImports files containing email

Each source is checked for e-mail contents Each source is checked for e-mail contents Extracts metadata:Extracts metadata:

SubjectSubject ToTo FromFrom DateDate

Deals with common formatsDeals with common formats Netscape, Eudora, Unix mail readersNetscape, Eudora, Unix mail readers

Page 60: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Compressed & Archived Compressed & Archived FilesFiles

ZIPPlug Plug-InZIPPlug Plug-In *.zip*.zip *.tar*.tar .gz.gz *.z*.z *.tgz*.tgz *.bz*.bz

Relies on standard utility programs Relies on standard utility programs being presentbeing present

Page 61: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building Building Collections Collections ManuallyManually

Page 62: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building a CollectionBuilding a Collection

Building a Collection:Building a Collection: The process of taking a set of The process of taking a set of

documents and metadata information documents and metadata information and creating all the indexes and data and creating all the indexes and data structures that support the searching, structures that support the searching, browsing, and viewing operations that browsing, and viewing operations that the collection offersthe collection offers

Page 63: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building a CollectionBuilding a Collection

Four Phases in Building a CollectionFour Phases in Building a Collection MakeMake

Make a skeleton framework structure to contain the Make a skeleton framework structure to contain the collectioncollection

ImportImport Import the documents and metadata, convert to a Import the documents and metadata, convert to a

Greenstone standard formGreenstone standard form BuildBuild

Build the required indexes and data structuresBuild the required indexes and data structures InstallInstall

Make the collection operationalMake the collection operational

Page 64: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building Collections Building Collections ManuallyManually

☐ ☐ Getting StartedGetting Started

☐ ☐ Making a framework for the collectionMaking a framework for the collection

☐ ☐ Importing the documentsImporting the documents

☐ ☐ Building the indexesBuilding the indexes

☐ ☐ Installing the collectionInstalling the collection

Page 65: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Getting StartedGetting Started

Locate the command promptLocate the command prompt Go to the directory where Greenstone Go to the directory where Greenstone

was installedwas installed cd “C:\Program Files\gsdl”cd “C:\Program Files\gsdl”

Tell system where to find Greenstone Tell system where to find Greenstone filesfiles setup.batsetup.bat

Sets the variable GSDLHOME to the Sets the variable GSDLHOME to the Greenstone home directoryGreenstone home directory

To return later To return later cd “%GSDLHOME%”cd “%GSDLHOME%”

Page 66: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building Collections Building Collections ManuallyManually

☑ ☑ Getting StartedGetting Started

☐ ☐ Making a framework for the collectionMaking a framework for the collection

☐ ☐ Importing the documentsImporting the documents

☐ ☐ Building the indexesBuilding the indexes

☐ ☐ Installing the collectionInstalling the collection

Page 67: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Make a framework for the Make a framework for the collectioncollection

Use the Perl program Use the Perl program mkcol.pl mkcol.pl to to ‘make a collection’‘make a collection’

Get description of usage and Get description of usage and argumentsarguments perl –S mkcol.plperl –S mkcol.pl mkcol.plmkcol.pl

May leave off first part if system recognizes May leave off first part if system recognizes that .pl files are associated with Perlthat .pl files are associated with Perl

Page 68: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Make a framework for the Make a framework for the collectioncollection

perl –S mkcol.pl –creator perl –S mkcol.pl –creator emailAddress emailAddress collectionNamecollectionName

Page 69: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Make a framework for the Make a framework for the collectioncollection

Examine the file structureExamine the file structurecd “%cd “%GSDLHOMEGSDLHOME%\collect\%\collect\collectionNamecollectionName””

List directory contentsList directory contentsdirdir

Seven subdirectories are created:Seven subdirectories are created:archivesbuildingetc (contains collect.cfg file)

imagesimportindexperllib

Page 70: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Make a framework for the Make a framework for the collectioncollection

collect.cfg Filecollect.cfg File emailAddressemailAddress placed in the creator and placed in the creator and

maintainer linesmaintainer lines collectionNamecollectionName placed in collection- placed in collection-

meta linesmeta lines Plug-ins are insertedPlug-ins are inserted

Page 71: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building Collections Building Collections ManuallyManually

☑ ☑ Getting StartedGetting Started

☑ ☑ Making a framework for the collectionMaking a framework for the collection

☐ ☐ Importing the documentsImporting the documents

☐ ☐ Building the indexesBuilding the indexes

☐ ☐ Installing the collectionInstalling the collection

Page 72: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Importing the documentsImporting the documents

The collection’s The collection’s importimport directory directory should contain the source materialshould contain the source material

Drag the directory containing the Drag the directory containing the source material into the source material into the importimport directorydirectory

You may drag several source You may drag several source directories and hierarchiesdirectories and hierarchies

Page 73: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Importing the documentsImporting the documents

The import process:The import process: Brings documents into the Greenstone Brings documents into the Greenstone

systemsystem Standardizes document formatStandardizes document format

(the way that metadata is specified)(the way that metadata is specified) Standardizes the file structureStandardizes the file structure

(that contains the documents)(that contains the documents)

Page 74: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Importing the documentsImporting the documents

To get a list of options for the import To get a list of options for the import program:program: perl –S import.plperl –S import.pl

The basic import command is:The basic import command is: perl –S import .pl perl –S import .pl collectionNamecollectionName

Page 75: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Importing the documentsImporting the documents

You may be in any directory when You may be in any directory when the the importimport command is issued command is issued The software works by knowing the The software works by knowing the

collection’s name and the Greenstone collection’s name and the Greenstone home directoryhome directory

Warnings may appearWarnings may appear When files are found without When files are found without

corresponding plug-inscorresponding plug-ins These files will be ignoredThese files will be ignored

Page 76: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building Collections Building Collections ManuallyManually

☑ ☑ Getting StartedGetting Started

☑ ☑ Making a framework for the collectionMaking a framework for the collection

☑ ☑ Importing the documentsImporting the documents

☐ ☐ Building the indexesBuilding the indexes

☐ ☐ Installing the collectionInstalling the collection

Page 77: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building the indexesBuilding the indexes

Use the program Use the program buildcol.plbuildcol.pl

Page 78: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building the indexesBuilding the indexes

Modify Modify collect.cfgcollect.cfg file to customize file to customize the collection’s appearancethe collection’s appearance collectionnamecollectionname

Web browsers receive this name as the title Web browsers receive this name as the title of the collection’s front pageof the collection’s front page

collectionextracollectionextra Description of the collectionDescription of the collection Appears under “About this collection” on Appears under “About this collection” on

the collection’s home pagethe collection’s home page Enter as a single line in the editorEnter as a single line in the editor

Page 79: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building the indexesBuilding the indexes

Modify Modify collect.cfgcollect.cfg file to customize the file to customize the collection’s appearancecollection’s appearance iconcollectioniconcollection

Give the collection an icon imageGive the collection an icon image Put the location of the image between quotesPut the location of the image between quotes If absent, the collection’s name will be usedIf absent, the collection’s name will be used Use _Use _httpprefix_httpprefix_ as a shorthand way of as a shorthand way of

beginning any URL that points within the beginning any URL that points within the Greenstone file areaGreenstone file area

Example:Example:_httpprevix_/collect/collectionName/images/icon.gif_httpprevix_/collect/collectionName/images/icon.gif

Page 80: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building the indexesBuilding the indexes

To get a list of options for the build To get a list of options for the build program:program: perl –S buildcol.plperl –S buildcol.pl

The basic build command is:The basic build command is: perl –S buildcol .pl perl –S buildcol .pl collectionNamecollectionName

Page 81: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building the indexesBuilding the indexes

The building process takes about a The building process takes about a minute on small collections and can minute on small collections and can take much longer for very large take much longer for very large collectionscollections

You may ignore most warning You may ignore most warning messagesmessages

Serious problems will cause the Serious problems will cause the program to terminateprogram to terminate

Page 82: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Building Collections Building Collections ManuallyManually

☑ ☑ Getting StartedGetting Started

☑ ☑ Making a framework for the collectionMaking a framework for the collection

☑ ☑ Importing the documentsImporting the documents

☑ ☑ Building the indexesBuilding the indexes

☐ ☐ Installing the collectionInstalling the collection

Page 83: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Installing the collectionInstalling the collection

Building is done in the Building is done in the buildingbuilding directory directory Collection must be moved to the Collection must be moved to the indexindex

directory before users can see itdirectory before users can see it Drag contents of the Drag contents of the buildingbuilding directory directory

to the to the indexindex directory directory If If indexindex already contains files, remove them already contains files, remove them

firstfirst Forgetting to move the contents of Forgetting to move the contents of

buildingbuilding to to indexindex is a common mistake is a common mistake

Page 84: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Installing the collectionInstalling the collection

To view the newly built collection:To view the newly built collection: Restart GreenstoneRestart Greenstone

If using the Local Library versionIf using the Local Library version Reload Greenstone Home PageReload Greenstone Home Page

If using the Web versionIf using the Web version

Page 85: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Importing and Importing and BuildingBuilding

Page 86: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

General InformationGeneral Information

Two Main Parts to Collection Two Main Parts to Collection Building:Building: Importing (Importing (import.plimport.pl)) Building (Building (buildcol.plbuildcol.pl))

Page 87: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Files and DirectoriesFiles and Directories

Page 88: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Collection Specific Collection Specific DirectoriesDirectories

GSDLHOMEGSDLHOME collectcollect – all the digital library collections – all the digital library collections collectionNamecollectionName – directory of collection – directory of collection

importimport – original source material – original source materialarchivesarchives – result of import process – result of import processbuildingbuilding – temporary, contents manually moved to – temporary, contents manually moved to indexindexindexindex – bulk of info served to users – bulk of info served to users

((importimport, , archivesarchives and and buildingbuilding can be deleted) can be deleted)etcetc – contains – contains collect.cfg collect.cfg filefileimagesimages – icons used for the collection – icons used for the collectionperllibperllib – Perl programs specific to collection – Perl programs specific to collection

Page 89: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Other Greenstone Other Greenstone DirectoriesDirectories

GSDLHOMEGSDLHOME liblib – common software for both the collection server and – common software for both the collection server and

receptionistreceptionist binbin – programs used for building process – programs used for building process scriptscript – Perl programs used – Perl programs used

((mkcol.plmkcol.pl, , import.plimport.pl, , buildcol.plbuildcol.pl)) perllibperllib – Perl modules – Perl modules pluginsplugins – Perl plugins – Perl plugins classifyclassify – Perl classifiers – Perl classifiers cgi-bincgi-bin – Greenstone runtime system – Greenstone runtime system

(absent in Local Library version)(absent in Local Library version) srcsrc – source code in C++ – source code in C++ colservrcolservr – the collection server – the collection server recptrecpt – the receptionist – the receptionist

Page 90: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Other Greenstone Other Greenstone DirectoriesDirectories

GSDLHOMEGSDLHOME packagespackages – source code for external software packages used – source code for external software packages used

by Greenstoneby Greenstone(indexing and compression program, database (indexing and compression program, database

manager program, etc.)manager program, etc.)(each package is stored in a directory of its own (each package is stored in a directory of its own

with a readme file)with a readme file) binbin – executables – executables mappingsmappings – Unicode translation tables – Unicode translation tables etcetc – configuration files for the entire system, initialization – configuration files for the entire system, initialization

and error logs, user authorization databaseand error logs, user authorization database imagesimages – user interface images and icons – user interface images and icons macrosmacros – small code fragments that drive the user interface – small code fragments that drive the user interface tmptmp – temporary files – temporary files docsdocs – documentation for the system – documentation for the system

Page 91: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Object IdentifiersObject Identifiers Document’s permanent name in the Document’s permanent name in the

systemsystem Remain the same when collection rebuiltRemain the same when collection rebuilt Assigned by the import processAssigned by the import process Stored as an attribute in the document Stored as an attribute in the document

archive filearchive file Character strings starting with the letters Character strings starting with the letters

HASH (HASH0109d3850a6de440c4d1ca2)HASH (HASH0109d3850a6de440c4d1ca2) Used to name directory where archive file Used to name directory where archive file

is storedis stored

Page 92: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Plug-InsPlug-Ins Plug-ins do most of the work of the import processPlug-ins do most of the work of the import process Operate in the order in which they are listed in the Operate in the order in which they are listed in the collect.cfgcollect.cfg

filefile Input file is passed to each plug-in until one is found that can process Input file is passed to each plug-in until one is found that can process

itit If there is no plug-in that can process a file, a warning is If there is no plug-in that can process a file, a warning is

printedprinted Plug-ins determine the traversal of the subdirectory structure Plug-ins determine the traversal of the subdirectory structure

in the import directoryin the import directory

RecPlugRecPlug - processes directories, recurses through directory - processes directories, recurses through directory structures and passes the name through the plug-in liststructures and passes the name through the plug-in list

GAPlugGAPlug – processes Greenstone Archive Format documents – processes Greenstone Archive Format documents (in the archives directory structure)(in the archives directory structure)

ArcPlugArcPlug – used during building, processes list of document – used during building, processes list of document OIDs produced during import (list is stored in OIDs produced during import (list is stored in archives.infarchives.inf file)file)

Page 93: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

The Import ProcessThe Import Process

Page 94: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

The Import ProcessThe Import Process Brings documents and metadata into the system Brings documents and metadata into the system

in a standardized XML formin a standardized XML form Original material placed in Original material placed in importimport directory directory Import process transforms it to files in the Import process transforms it to files in the

archivesarchives directory directory The original material can be deletedThe original material can be deleted

Collection can be rebuilt from archive filesCollection can be rebuilt from archive files New material added to collection by placing it in New material added to collection by placing it in

importimport directory and re-executing the import directory and re-executing the import processprocess The new material finds it way into archives along with The new material finds it way into archives along with

existing filesexisting files To keep the source form of collectionsTo keep the source form of collections

Do not delete the archivesDo not delete the archives ““Source” form can be augmented and rebuilt laterSource” form can be augmented and rebuilt later

Page 95: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

The Build ProcessThe Build Process

Page 96: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

The Build ProcessThe Build Process

Creates the indexes and data structures Creates the indexes and data structures that make the collection operationalthat make the collection operational

Indexes for the whole collection are Indexes for the whole collection are built all at oncebuilt all at once Build process does not work incrementallyBuild process does not work incrementally Adding new material to Adding new material to archivesarchives requires requires

that entire collection be rebuilt (by issuing that entire collection be rebuilt (by issuing buildcol.plbuildcol.pl))

Most collections can be rebuilt overnightMost collections can be rebuilt overnight

Page 97: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Options for Import and Options for Import and BuildBuild

Page 98: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Additional Options for Additional Options for ImportImport

Page 99: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Additional Options for Additional Options for BuildBuild

Page 100: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Options for Import and Options for Import and BuildBuild

To see options for any Greenstone script, To see options for any Greenstone script, type its name at the command prompttype its name at the command prompt

Options for Import and Build help with Options for Import and Build help with debugging (see Table 6.5 on page 310):debugging (see Table 6.5 on page 310): verbosityverbosity archivedirarchivedir maxdocsmaxdocs collectdircollectdir outout keepoldkeepold debugdebug

Page 101: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Greenstone Greenstone Archive Archive

DocumentsDocuments

Page 102: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Greenstone Archive Greenstone Archive FormatFormat

<!DOCTYPE GreenstoneArchive [<!ELEMENT Section (Description,Content,Section*)><!ELEMENT Description (Metadata*)><!ELEMENT Content (#PCDATA)><!ELEMENT Metadata (#PCDATA)><ATTLIST Metadata name CDATA #REQUIRED>]>

Page 103: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Document MetadataDocument Metadata

Metadata – descriptive information Metadata – descriptive information about author, title, date and keywordsabout author, title, date and keywords

Stored with metadata nameStored with metadata name Stored at the beginning of the sectionStored at the beginning of the section Example:Example:

<Metadata name=“Title”>Freshwater <Metadata name=“Title”>Freshwater Resources in Arid Lands</Metadata>Resources in Arid Lands</Metadata>

Page 104: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Document MetadataDocument Metadata

Dublin Core – a metadata standardDublin Core – a metadata standard New metadata types can be inventedNew metadata types can be invented Metadata can be assigned by an Metadata can be assigned by an

automatic process rather than automatic process rather than manually enteredmanually entered

Page 105: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

The Dublin CoreThe Dublin Core

Page 106: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Collection Collection Configuration Configuration

FileFile

Page 107: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Collection Configuration Collection Configuration FileFile

Page 108: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Default Configuration Default Configuration FileFile

Page 109: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Getting the Most Getting the Most Out of Your Out of Your DocumentsDocuments

Page 110: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Basic Plug-In OptionsBasic Plug-In Options

Page 111: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Document Processing Document Processing Plug-insPlug-ins

Page 112: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Document Processing Document Processing Plug-insPlug-ins

Page 113: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Document Processing Document Processing Plug-insPlug-ins

Page 114: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Assigning Metadata from Assigning Metadata from a Filea File

XML Document Type Definition XML Document Type Definition (DTD)(DTD)

Example XML Metadata FileExample XML Metadata File

Page 115: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Document Type Definition Document Type Definition (DTD)(DTD)

<!DOCTYPE GreenstoneDirectoryMetadata [<!ELEMENT DirectoryMetadata (FileSet*)><!ELEMENT FileSet (FileName+,Description)><!ELEMENT FileName (#PCDATA)><!ELEMENT Description (Metadata*)><!ELEMENT Metadata (#PCDATA)><ATTLIST Metadata name CDATA #REQUIRED><ATTLIST Metadata mode (accumulate|override) "override">]>

Page 116: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Example XML Metadata Example XML Metadata FileFile<?xml version="1.0" ?>

<!DOCTYPE GreenstoneDirectoryMetadata SYSTEM"http://greenstone.org/dtd/GreenstoneDirectoryMetadata/1.0/GreenstoneDirectoryMetadata.dtd"><DirectoryMetadata><FileSet><FileName>nugget.*</FileName><Description><Metadata name="Title">Nugget Point Lighthouse</Metadata><Metadata name="Place" mode="accumulate">Nugget Point</Metadata></Description></FileSet><FileSet><FileName>nugget-point-1.jpg</FileName><Description><Metadata name="Title">Nugget Point Lighthouse</Metadata><Metadata name="Subject">Lighthouse</Metadata></Description></FileSet></DirectoryMetadata>

Page 117: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Tagging Document FilesTagging Document Files<!--<Section><Description><Metadata name="Title"> Realizing human rights for poorpeople: Strategies for achieving the internationaldevelopment targets </Metadata></Description>-->(text of section goes here)<!--</Section>-->

Page 118: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

ClassifiersClassifiers

Page 119: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Format StatementsFormat Statements

Page 120: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Format StatementsFormat Statements

Page 121: Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David Bainbridge

Examples of Format Examples of Format StringsStrings