Building collections with Greenstone How to Build a Digital Library Ian H. Witten and David...

Preview:

Citation preview

Building Building collections with collections with

GreenstoneGreenstone

How to Build a Digital LibraryHow to Build a Digital LibraryIan H. Witten and David BainbridgeIan H. Witten and David Bainbridge

Digital Library Digital Library CollectionsCollections

There is a distinction betweenThere is a distinction between BUILDING collectionsBUILDING collections DELIVERING information to usersDELIVERING information to users

Similar to ‘compile-time’ versus Similar to ‘compile-time’ versus ‘runtime’ distinction in computer ‘runtime’ distinction in computer programmingprogramming

Information structures should Information structures should usually be prepared in advanceusually be prepared in advance

Building a CollectionBuilding a Collection

The CollectorThe Collector A subsystem that takes you step by step A subsystem that takes you step by step

through building a simple collectionthrough building a simple collection Conceals details behind the scenesConceals details behind the scenes

First locate information on your First locate information on your computer or the Webcomputer or the Web Plain text, HTML, Word, PDF, email file, Plain text, HTML, Word, PDF, email file,

etc. etc.

Plug-insPlug-ins

Plug-ins are software modules that Plug-ins are software modules that handlehandle Format conversionFormat conversion Metadata extractionMetadata extraction

Plug-ins promote extensibilityPlug-ins promote extensibility

Greenstone Archive Greenstone Archive FormatFormat

Greenstone Archive FormatGreenstone Archive Format XML-based file formatXML-based file format File format for:File format for:

DocumentsDocuments MetadataMetadata

Collection Configuration Collection Configuration FileFile

Collection Configuration FileCollection Configuration File Defines the structure of a collectionDefines the structure of a collection Governs how the collection is builtGoverns how the collection is built Specifies how the collection will appear Specifies how the collection will appear

to usersto users

Greenstone Extended Greenstone Extended CapabilitiesCapabilities

Extending the Capabilities of Extending the Capabilities of GreenstoneGreenstone Plug-insPlug-ins

Handle different document and metadata Handle different document and metadata formatsformats

ClassifiersClassifiers Handle different kinds of browsing structuresHandle different kinds of browsing structures

Format statements and MacrosFormat statements and Macros Govern the user interface content and Govern the user interface content and

appearanceappearance

Why Greenstone?Why Greenstone?

Benefits of GreenstoneBenefits of Greenstone

General system for constructing and General system for constructing and presenting digital collectionspresenting digital collections

Handles millions of documents, text, Handles millions of documents, text, images, audio, videoimages, audio, video

User interfaces identical in Web-User interfaces identical in Web-based and CD-ROM versionsbased and CD-ROM versions

Installs on Windows and LinuxInstalls on Windows and Linux Access locally or remotely using web Access locally or remotely using web

browserbrowser

Organization of Organization of CollectionsCollections

Each collection can be organized Each collection can be organized differently:differently: Format of source documentsFormat of source documents MetadataMetadata Directory structureDirectory structure Document structureDocument structure Searching and browsing servicesSearching and browsing services PresentationPresentation Auxiliary servicesAuxiliary services

Variation of Source Variation of Source FormatFormat

Source documents can be supplied in:Source documents can be supplied in: Plain textPlain text HTMLHTML PostScriptPostScript PDFPDF WordWord E-mailE-mail Other file typesOther file types ImagesImages VideoVideo AudioAudio

Variation of MetadataVariation of Metadata

Different types of metadataDifferent types of metadata Metadata can be supplied differentlyMetadata can be supplied differently

‘‘fields’ in MS Wordfields’ in MS Word <meta> tags in HTML<meta> tags in HTML Information coded into filename and Information coded into filename and

directoriesdirectories Spreadsheet or other data fileSpreadsheet or other data file Explicit metadata format like MARCExplicit metadata format like MARC

Variation of Directory Variation of Directory StructureStructure

Collections can vary in the directory Collections can vary in the directory structure in which the information is structure in which the information is locatedlocated

Variation of Document Variation of Document StructureStructure

Document structureDocument structure FlatFlat Divided sequentially into pagesDivided sequentially into pages Hierarchical organizationHierarchical organization

Title or other metadata available at each Title or other metadata available at each levellevel

Variation of ServicesVariation of Services

SearchingSearching MetadataMetadata IndexesIndexes Hierarchical levelsHierarchical levels

BrowsingBrowsing MetadataMetadata Browser typeBrowser type

Variation of PresentationVariation of Presentation

Results can be presented to users in Results can be presented to users in various ways:various ways: Format that target documents are Format that target documents are

shown inshown in Search results pageSearch results page Metadata browsersMetadata browsers Interface languageInterface language

Variation of Auxiliary Variation of Auxiliary ServicesServices

A collection may require additional A collection may require additional servicesservices User loggingUser logging Etc.Etc.

Collection Configuration Collection Configuration FileFile

Allows VariationAllows Variation A digital library collection is made A digital library collection is made

byby Gathering raw materialGathering raw material Designing the collectionDesigning the collection Putting design information about the Putting design information about the

structure and presentation of the structure and presentation of the collection in the Collection collection in the Collection Configuration FileConfiguration File

Front Page of CollectionFront Page of Collection

Statement of collection’s purposeStatement of collection’s purpose

Statement of collection’s coverageStatement of collection’s coverage

Explanation of how collection is Explanation of how collection is organizedorganized

Searching Involves Searching Involves IndexesIndexes

Searching is provided by indexes Searching is provided by indexes built from different parts of the built from different parts of the documentsdocuments Entire documentsEntire documents ParagraphsParagraphs TitlesTitles SectionsSections Section headingsSection headings Figure captionsFigure captions

IndexesIndexes

Indexes can be created automatically Indexes can be created automatically usingusing DocumentsDocuments Supporting filesSupporting files

Indexes can be rebuilt automaticallyIndexes can be rebuilt automatically New document in the same format New document in the same format

becomes availablebecomes available Process can awake, check for new material, Process can awake, check for new material,

and rebuild the indexesand rebuild the indexes

Plug-ins for IndexingPlug-ins for Indexing

Source documents are converted into Source documents are converted into standard XML form for indexing using plug-standard XML form for indexing using plug-insins

Standard plug-ins processStandard plug-ins process Plain textPlain text HTMLHTML WordWord PDFPDF Usenet and email messagesUsenet and email messages

New plug-ins can be written for other New plug-ins can be written for other document typesdocument types

Browsing Involves ListsBrowsing Involves Lists

Browsing involves lists that can be Browsing involves lists that can be examined by the userexamined by the user AuthorsAuthors TitlesTitles DatesDates Hierarchical classification structuresHierarchical classification structures

Classifier ModulesClassifier Modules

Modules called classifiers are used to Modules called classifiers are used to create browsers and build browsing create browsers and build browsing structures from metadatastructures from metadata Scrollable listsScrollable lists Alphabetic selectorsAlphabetic selectors DatesDates HierarchiesHierarchies

Programmers can write new Programmers can write new classifiers to create novel browsing classifiers to create novel browsing capabilitiescapabilities

Search TermsSearch Terms

Search Terms in Greenstone:Search Terms in Greenstone: Alphabetic charactersAlphabetic characters DigitsDigits

Separated by white spaceSeparated by white space Punctuation acts as white spacePunctuation acts as white space

Two Types of QueriesTwo Types of Queries

Query for ALL of the wordsQuery for ALL of the words Boolean ANDBoolean AND

Query for SOME of the wordsQuery for SOME of the words Ranked Ranked

Indexes to SearchIndexes to Search

In most collections, you can choose In most collections, you can choose different indexes to searchdifferent indexes to search

Examples:Examples: Author and title indexesAuthor and title indexes Chapter and paragraph indexesChapter and paragraph indexes

Usually the full matching document is Usually the full matching document is returned regardless of index searchedreturned regardless of index searched

Preferences PagePreferences Page

Preferences PagePreferences Page Allows advanced control over search Allows advanced control over search

operation:operation: Case-folding and stemming Case-folding and stemming Advanced query mode where users specify Advanced query mode where users specify

Boolean operatorsBoolean operators Large-query interfaceLarge-query interface Display search historyDisplay search history

Preferences PagePreferences Page

Preferences PagePreferences Page Specify subcollections to be included in Specify subcollections to be included in

searchessearches Specify presentation languageSpecify presentation language Customize interfaceCustomize interface

Textual vs. standard interfaceTextual vs. standard interface Suppress navigation barSuppress navigation bar Suppress alert systemSuppress alert system

Using the Using the CollectorCollector

The Greenstone CollectorThe Greenstone Collector

Easiest way to build a simple Easiest way to build a simple collectioncollection

The Collector allows you to:The Collector allows you to: Create a new collectionCreate a new collection Modify or add to an existing collectionModify or add to an existing collection Delete a collectionDelete a collection

Starting the CollectorStarting the Collector

Click the Collector link from the Click the Collector link from the default Greenstone home pagedefault Greenstone home page

Log inLog in When Greenstone is installed, an When Greenstone is installed, an

account called account called adminadmin is set up with a is set up with a password chosen during installationpassword chosen during installation

The Collector works through a The Collector works through a standard web interfacestandard web interface

Creating a New Creating a New CollectionCollection

Collector’s main purpose is to build Collector’s main purpose is to build a new collectiona new collection

Structure of a collection is Structure of a collection is determined when the collection is determined when the collection is set upset up

Simplest to copy the structure of an Simplest to copy the structure of an existing collection and then editexisting collection and then edit

Collection Building StepsCollection Building Steps

1.1. Collection InformationCollection Information

2.2. Source DataSource Data

3.3. ConfigurationConfiguration

4.4. BuildingBuilding

5.5. ViewingViewing

Collection Building StepsCollection Building Steps

☐ ☐ Collection InformationCollection Information

☐ ☐ Source DataSource Data

☐ ☐ ConfigurationConfiguration

☐ ☐ BuildingBuilding

☐ ☐ ViewingViewing

1. Collection Information1. Collection Information

Give the collection a name and Give the collection a name and provide associated informationprovide associated information TitleTitle

Short phrase used to identify the collection Short phrase used to identify the collection within the digital librarywithin the digital library

Contact e-mail addressContact e-mail address Brief descriptionBrief description

Sets out the principles that govern what is Sets out the principles that govern what is included in the collectionincluded in the collection

Collection Building StepsCollection Building Steps

☑ ☑ Collection InformationCollection Information

☐ ☐ Source DataSource Data

☐ ☐ ConfigurationConfiguration

☐ ☐ BuildingBuilding

☐ ☐ ViewingViewing

2. Source Data2. Source Data

Specify the location of the sourcesSpecify the location of the sources Clone existing collectionClone existing collection

Specify on a pull-down menu the existing Specify on a pull-down menu the existing collectioncollection

Create a completely new collectionCreate a completely new collection

2. Source Data2. Source Data

In the provided boxes, indicate In the provided boxes, indicate where Source Documents are where Source Documents are locatedlocated

Specification of sourcesSpecification of sources file://file:// http://http:// ftp://ftp://

file://file://

File name on the Greenstone server File name on the Greenstone server systemsystem That file will be included in collectionThat file will be included in collection

Directory name on the Greenstone Directory name on the Greenstone serverserver Everything in the folder and its Everything in the folder and its

subfolders will be includedsubfolders will be included

http://http://

Web pageWeb page The web page will be downloadedThe web page will be downloaded All pages it links to (and all pages they All pages it links to (and all pages they

link to) that reside on the same site, link to) that reside on the same site, below the URL, will also be downloadedbelow the URL, will also be downloaded

URL that leads to a list of filesURL that leads to a list of files Everything in the folder and its Everything in the folder and its

subfolders will be included in collectionsubfolders will be included in collection

ftp://ftp://

File to be downloaded using FTPFile to be downloaded using FTP Directory name on the FTP serverDirectory name on the FTP server

Downloads everything in the folder and Downloads everything in the folder and its subfoldersits subfolders

Collection Building StepsCollection Building Steps

☑ ☑ Collection InformationCollection Information

☑ ☑ Source DataSource Data

☐ ☐ ConfigurationConfiguration

☐ ☐ BuildingBuilding

☐ ☐ ViewingViewing

3. Configuration3. Configuration

This step can be bypassedThis step can be bypassed Allows adjustment of configuration Allows adjustment of configuration

optionsoptions The construction and presentation The construction and presentation

of all collections are controlled by of all collections are controlled by specifications in a special collection specifications in a special collection configuration fileconfiguration file

Collection Building StepsCollection Building Steps

☑ ☑ Collection InformationCollection Information

☑ ☑ Source DataSource Data

☑ ☑ ConfigurationConfiguration

☐ ☐ BuildingBuilding

☐ ☐ ViewingViewing

4. Building4. Building

The computer does the work of the The computer does the work of the building processbuilding process

Indexes are built:Indexes are built: For browsingFor browsing For searchingFor searching Following specifications in the Following specifications in the

collection configuration filecollection configuration file Status line shows progressStatus line shows progress Warnings shown if files can’t be Warnings shown if files can’t be

foundfound

Collection Building StepsCollection Building Steps

☑ ☑ Collection InformationCollection Information

☑ ☑ Source DataSource Data

☑ ☑ ConfigurationConfiguration

☑ ☑ BuildingBuilding

☐ ☐ ViewingViewing

5. Viewing5. Viewing

View the collection that has just View the collection that has just been createdbeen created

E-mail can be sent to the collection’s E-mail can be sent to the collection’s contact addresscontact address Must enable by editing Must enable by editing main.cfg main.cfg

configuration fileconfiguration file

Working with Existing Working with Existing CollectionsCollections

Add more material and rebuild the Add more material and rebuild the collectioncollection

Edit the configuration file to modify Edit the configuration file to modify the collection’s structurethe collection’s structure

Delete the collectionDelete the collection Put the collection on CD-ROMPut the collection on CD-ROM

Adding Material to a Adding Material to a CollectionCollection

Do not re-specify files that are Do not re-specify files that are already in the collectionalready in the collection Files would be included twiceFiles would be included twice

If the building process fails, the old If the building process fails, the old version remains unchangedversion remains unchanged

Structure of collection can be Structure of collection can be changedchanged Edit the configuration fileEdit the configuration file

May add plug-ins or an option to a plug-inMay add plug-ins or an option to a plug-in

Plug-ins & Document Plug-ins & Document FormatsFormats

Plug-ins are specified in the collection Plug-ins are specified in the collection configuration fileconfiguration file

File name determines document formatFile name determines document format Widely used document formats:Widely used document formats:

TEXTPlugHTMLPlugWORDPlugPDFPlug

PSPlugEMAILPlugZIPPlug

Text FilesText Files

TEXTPlug Plug-InTEXTPlug Plug-In *.txt*.txt *.text*.text

Plain text filePlain text file Title metadata based on the first line Title metadata based on the first line

of the fileof the file

HTML FilesHTML Files

HTMLPlug Plug-InHTMLPlug Plug-In *.htm*.htm *.html*.html .shtml.shtml .shm.shm .asp.asp .php.php .cgi.cgi

HTML FilesHTML Files

HTMLPlug Plug-InHTMLPlug Plug-In Imports HTML filesImports HTML files Title metadata extracted from the HTML Title metadata extracted from the HTML

<title> tag<title> tag Other HTML <meta> tag data can be Other HTML <meta> tag data can be

extractedextracted Parses and processes any links in the fileParses and processes any links in the file Links to other files in the collection are Links to other files in the collection are

trapped and replaced by references to the trapped and replaced by references to the documentdocument

HTML FilesHTML Files

file_is_urlfile_is_url Optional switch within the HTML plug-Optional switch within the HTML plug-

inin Causes URL metadata to be inserted Causes URL metadata to be inserted

into each document, based on the file-into each document, based on the file-name convention that is adopted by the name convention that is adopted by the mirroring package. The collection uses mirroring package. The collection uses this metadata to allow readers to refer this metadata to allow readers to refer to the original source material rather to the original source material rather than a local copythan a local copy

Microsoft Word FilesMicrosoft Word Files

WORDPlug Plug-InWORDPlug Plug-In *.doc*.doc

Imports Microsoft Word documentsImports Microsoft Word documents Greenstone uses independent Greenstone uses independent

programs to convert Word files to programs to convert Word files to HTMLHTML Many variants on the Word formatMany variants on the Word format Older Word formats use a simple text Older Word formats use a simple text

string extractionstring extraction

PDF FilesPDF Files

PDFPlug Plug-InPDFPlug Plug-In *.pdf*.pdf

Imports PDF FilesImports PDF Files Adobe’s Portable Document FormatAdobe’s Portable Document Format Greenstone uses independent Greenstone uses independent

programs to convert PDF files to programs to convert PDF files to HTMLHTML

PostScript FilesPostScript Files

PSPlug Plug-InPSPlug Plug-In *.ps*.ps

Imports PostScript FilesImports PostScript Files Works best when a standard Works best when a standard

conversion program is already conversion program is already installed on the computerinstalled on the computer

Uses simple text extraction algorithm Uses simple text extraction algorithm if no conversion program is presentif no conversion program is present

Email FilesEmail Files EMAILPlugEMAILPlug

*.email*.email Imports files containing emailImports files containing email

Each source is checked for e-mail contents Each source is checked for e-mail contents Extracts metadata:Extracts metadata:

SubjectSubject ToTo FromFrom DateDate

Deals with common formatsDeals with common formats Netscape, Eudora, Unix mail readersNetscape, Eudora, Unix mail readers

Compressed & Archived Compressed & Archived FilesFiles

ZIPPlug Plug-InZIPPlug Plug-In *.zip*.zip *.tar*.tar .gz.gz *.z*.z *.tgz*.tgz *.bz*.bz

Relies on standard utility programs Relies on standard utility programs being presentbeing present

Building Building Collections Collections ManuallyManually

Building a CollectionBuilding a Collection

Building a Collection:Building a Collection: The process of taking a set of The process of taking a set of

documents and metadata information documents and metadata information and creating all the indexes and data and creating all the indexes and data structures that support the searching, structures that support the searching, browsing, and viewing operations that browsing, and viewing operations that the collection offersthe collection offers

Building a CollectionBuilding a Collection

Four Phases in Building a CollectionFour Phases in Building a Collection MakeMake

Make a skeleton framework structure to contain the Make a skeleton framework structure to contain the collectioncollection

ImportImport Import the documents and metadata, convert to a Import the documents and metadata, convert to a

Greenstone standard formGreenstone standard form BuildBuild

Build the required indexes and data structuresBuild the required indexes and data structures InstallInstall

Make the collection operationalMake the collection operational

Building Collections Building Collections ManuallyManually

☐ ☐ Getting StartedGetting Started

☐ ☐ Making a framework for the collectionMaking a framework for the collection

☐ ☐ Importing the documentsImporting the documents

☐ ☐ Building the indexesBuilding the indexes

☐ ☐ Installing the collectionInstalling the collection

Getting StartedGetting Started

Locate the command promptLocate the command prompt Go to the directory where Greenstone Go to the directory where Greenstone

was installedwas installed cd “C:\Program Files\gsdl”cd “C:\Program Files\gsdl”

Tell system where to find Greenstone Tell system where to find Greenstone filesfiles setup.batsetup.bat

Sets the variable GSDLHOME to the Sets the variable GSDLHOME to the Greenstone home directoryGreenstone home directory

To return later To return later cd “%GSDLHOME%”cd “%GSDLHOME%”

Building Collections Building Collections ManuallyManually

☑ ☑ Getting StartedGetting Started

☐ ☐ Making a framework for the collectionMaking a framework for the collection

☐ ☐ Importing the documentsImporting the documents

☐ ☐ Building the indexesBuilding the indexes

☐ ☐ Installing the collectionInstalling the collection

Make a framework for the Make a framework for the collectioncollection

Use the Perl program Use the Perl program mkcol.pl mkcol.pl to to ‘make a collection’‘make a collection’

Get description of usage and Get description of usage and argumentsarguments perl –S mkcol.plperl –S mkcol.pl mkcol.plmkcol.pl

May leave off first part if system recognizes May leave off first part if system recognizes that .pl files are associated with Perlthat .pl files are associated with Perl

Make a framework for the Make a framework for the collectioncollection

perl –S mkcol.pl –creator perl –S mkcol.pl –creator emailAddress emailAddress collectionNamecollectionName

Make a framework for the Make a framework for the collectioncollection

Examine the file structureExamine the file structurecd “%cd “%GSDLHOMEGSDLHOME%\collect\%\collect\collectionNamecollectionName””

List directory contentsList directory contentsdirdir

Seven subdirectories are created:Seven subdirectories are created:archivesbuildingetc (contains collect.cfg file)

imagesimportindexperllib

Make a framework for the Make a framework for the collectioncollection

collect.cfg Filecollect.cfg File emailAddressemailAddress placed in the creator and placed in the creator and

maintainer linesmaintainer lines collectionNamecollectionName placed in collection- placed in collection-

meta linesmeta lines Plug-ins are insertedPlug-ins are inserted

Building Collections Building Collections ManuallyManually

☑ ☑ Getting StartedGetting Started

☑ ☑ Making a framework for the collectionMaking a framework for the collection

☐ ☐ Importing the documentsImporting the documents

☐ ☐ Building the indexesBuilding the indexes

☐ ☐ Installing the collectionInstalling the collection

Importing the documentsImporting the documents

The collection’s The collection’s importimport directory directory should contain the source materialshould contain the source material

Drag the directory containing the Drag the directory containing the source material into the source material into the importimport directorydirectory

You may drag several source You may drag several source directories and hierarchiesdirectories and hierarchies

Importing the documentsImporting the documents

The import process:The import process: Brings documents into the Greenstone Brings documents into the Greenstone

systemsystem Standardizes document formatStandardizes document format

(the way that metadata is specified)(the way that metadata is specified) Standardizes the file structureStandardizes the file structure

(that contains the documents)(that contains the documents)

Importing the documentsImporting the documents

To get a list of options for the import To get a list of options for the import program:program: perl –S import.plperl –S import.pl

The basic import command is:The basic import command is: perl –S import .pl perl –S import .pl collectionNamecollectionName

Importing the documentsImporting the documents

You may be in any directory when You may be in any directory when the the importimport command is issued command is issued The software works by knowing the The software works by knowing the

collection’s name and the Greenstone collection’s name and the Greenstone home directoryhome directory

Warnings may appearWarnings may appear When files are found without When files are found without

corresponding plug-inscorresponding plug-ins These files will be ignoredThese files will be ignored

Building Collections Building Collections ManuallyManually

☑ ☑ Getting StartedGetting Started

☑ ☑ Making a framework for the collectionMaking a framework for the collection

☑ ☑ Importing the documentsImporting the documents

☐ ☐ Building the indexesBuilding the indexes

☐ ☐ Installing the collectionInstalling the collection

Building the indexesBuilding the indexes

Use the program Use the program buildcol.plbuildcol.pl

Building the indexesBuilding the indexes

Modify Modify collect.cfgcollect.cfg file to customize file to customize the collection’s appearancethe collection’s appearance collectionnamecollectionname

Web browsers receive this name as the title Web browsers receive this name as the title of the collection’s front pageof the collection’s front page

collectionextracollectionextra Description of the collectionDescription of the collection Appears under “About this collection” on Appears under “About this collection” on

the collection’s home pagethe collection’s home page Enter as a single line in the editorEnter as a single line in the editor

Building the indexesBuilding the indexes

Modify Modify collect.cfgcollect.cfg file to customize the file to customize the collection’s appearancecollection’s appearance iconcollectioniconcollection

Give the collection an icon imageGive the collection an icon image Put the location of the image between quotesPut the location of the image between quotes If absent, the collection’s name will be usedIf absent, the collection’s name will be used Use _Use _httpprefix_httpprefix_ as a shorthand way of as a shorthand way of

beginning any URL that points within the beginning any URL that points within the Greenstone file areaGreenstone file area

Example:Example:_httpprevix_/collect/collectionName/images/icon.gif_httpprevix_/collect/collectionName/images/icon.gif

Building the indexesBuilding the indexes

To get a list of options for the build To get a list of options for the build program:program: perl –S buildcol.plperl –S buildcol.pl

The basic build command is:The basic build command is: perl –S buildcol .pl perl –S buildcol .pl collectionNamecollectionName

Building the indexesBuilding the indexes

The building process takes about a The building process takes about a minute on small collections and can minute on small collections and can take much longer for very large take much longer for very large collectionscollections

You may ignore most warning You may ignore most warning messagesmessages

Serious problems will cause the Serious problems will cause the program to terminateprogram to terminate

Building Collections Building Collections ManuallyManually

☑ ☑ Getting StartedGetting Started

☑ ☑ Making a framework for the collectionMaking a framework for the collection

☑ ☑ Importing the documentsImporting the documents

☑ ☑ Building the indexesBuilding the indexes

☐ ☐ Installing the collectionInstalling the collection

Installing the collectionInstalling the collection

Building is done in the Building is done in the buildingbuilding directory directory Collection must be moved to the Collection must be moved to the indexindex

directory before users can see itdirectory before users can see it Drag contents of the Drag contents of the buildingbuilding directory directory

to the to the indexindex directory directory If If indexindex already contains files, remove them already contains files, remove them

firstfirst Forgetting to move the contents of Forgetting to move the contents of

buildingbuilding to to indexindex is a common mistake is a common mistake

Installing the collectionInstalling the collection

To view the newly built collection:To view the newly built collection: Restart GreenstoneRestart Greenstone

If using the Local Library versionIf using the Local Library version Reload Greenstone Home PageReload Greenstone Home Page

If using the Web versionIf using the Web version

Importing and Importing and BuildingBuilding

General InformationGeneral Information

Two Main Parts to Collection Two Main Parts to Collection Building:Building: Importing (Importing (import.plimport.pl)) Building (Building (buildcol.plbuildcol.pl))

Files and DirectoriesFiles and Directories

Collection Specific Collection Specific DirectoriesDirectories

GSDLHOMEGSDLHOME collectcollect – all the digital library collections – all the digital library collections collectionNamecollectionName – directory of collection – directory of collection

importimport – original source material – original source materialarchivesarchives – result of import process – result of import processbuildingbuilding – temporary, contents manually moved to – temporary, contents manually moved to indexindexindexindex – bulk of info served to users – bulk of info served to users

((importimport, , archivesarchives and and buildingbuilding can be deleted) can be deleted)etcetc – contains – contains collect.cfg collect.cfg filefileimagesimages – icons used for the collection – icons used for the collectionperllibperllib – Perl programs specific to collection – Perl programs specific to collection

Other Greenstone Other Greenstone DirectoriesDirectories

GSDLHOMEGSDLHOME liblib – common software for both the collection server and – common software for both the collection server and

receptionistreceptionist binbin – programs used for building process – programs used for building process scriptscript – Perl programs used – Perl programs used

((mkcol.plmkcol.pl, , import.plimport.pl, , buildcol.plbuildcol.pl)) perllibperllib – Perl modules – Perl modules pluginsplugins – Perl plugins – Perl plugins classifyclassify – Perl classifiers – Perl classifiers cgi-bincgi-bin – Greenstone runtime system – Greenstone runtime system

(absent in Local Library version)(absent in Local Library version) srcsrc – source code in C++ – source code in C++ colservrcolservr – the collection server – the collection server recptrecpt – the receptionist – the receptionist

Other Greenstone Other Greenstone DirectoriesDirectories

GSDLHOMEGSDLHOME packagespackages – source code for external software packages used – source code for external software packages used

by Greenstoneby Greenstone(indexing and compression program, database (indexing and compression program, database

manager program, etc.)manager program, etc.)(each package is stored in a directory of its own (each package is stored in a directory of its own

with a readme file)with a readme file) binbin – executables – executables mappingsmappings – Unicode translation tables – Unicode translation tables etcetc – configuration files for the entire system, initialization – configuration files for the entire system, initialization

and error logs, user authorization databaseand error logs, user authorization database imagesimages – user interface images and icons – user interface images and icons macrosmacros – small code fragments that drive the user interface – small code fragments that drive the user interface tmptmp – temporary files – temporary files docsdocs – documentation for the system – documentation for the system

Object IdentifiersObject Identifiers Document’s permanent name in the Document’s permanent name in the

systemsystem Remain the same when collection rebuiltRemain the same when collection rebuilt Assigned by the import processAssigned by the import process Stored as an attribute in the document Stored as an attribute in the document

archive filearchive file Character strings starting with the letters Character strings starting with the letters

HASH (HASH0109d3850a6de440c4d1ca2)HASH (HASH0109d3850a6de440c4d1ca2) Used to name directory where archive file Used to name directory where archive file

is storedis stored

Plug-InsPlug-Ins Plug-ins do most of the work of the import processPlug-ins do most of the work of the import process Operate in the order in which they are listed in the Operate in the order in which they are listed in the collect.cfgcollect.cfg

filefile Input file is passed to each plug-in until one is found that can process Input file is passed to each plug-in until one is found that can process

itit If there is no plug-in that can process a file, a warning is If there is no plug-in that can process a file, a warning is

printedprinted Plug-ins determine the traversal of the subdirectory structure Plug-ins determine the traversal of the subdirectory structure

in the import directoryin the import directory

RecPlugRecPlug - processes directories, recurses through directory - processes directories, recurses through directory structures and passes the name through the plug-in liststructures and passes the name through the plug-in list

GAPlugGAPlug – processes Greenstone Archive Format documents – processes Greenstone Archive Format documents (in the archives directory structure)(in the archives directory structure)

ArcPlugArcPlug – used during building, processes list of document – used during building, processes list of document OIDs produced during import (list is stored in OIDs produced during import (list is stored in archives.infarchives.inf file)file)

The Import ProcessThe Import Process

The Import ProcessThe Import Process Brings documents and metadata into the system Brings documents and metadata into the system

in a standardized XML formin a standardized XML form Original material placed in Original material placed in importimport directory directory Import process transforms it to files in the Import process transforms it to files in the

archivesarchives directory directory The original material can be deletedThe original material can be deleted

Collection can be rebuilt from archive filesCollection can be rebuilt from archive files New material added to collection by placing it in New material added to collection by placing it in

importimport directory and re-executing the import directory and re-executing the import processprocess The new material finds it way into archives along with The new material finds it way into archives along with

existing filesexisting files To keep the source form of collectionsTo keep the source form of collections

Do not delete the archivesDo not delete the archives ““Source” form can be augmented and rebuilt laterSource” form can be augmented and rebuilt later

The Build ProcessThe Build Process

The Build ProcessThe Build Process

Creates the indexes and data structures Creates the indexes and data structures that make the collection operationalthat make the collection operational

Indexes for the whole collection are Indexes for the whole collection are built all at oncebuilt all at once Build process does not work incrementallyBuild process does not work incrementally Adding new material to Adding new material to archivesarchives requires requires

that entire collection be rebuilt (by issuing that entire collection be rebuilt (by issuing buildcol.plbuildcol.pl))

Most collections can be rebuilt overnightMost collections can be rebuilt overnight

Options for Import and Options for Import and BuildBuild

Additional Options for Additional Options for ImportImport

Additional Options for Additional Options for BuildBuild

Options for Import and Options for Import and BuildBuild

To see options for any Greenstone script, To see options for any Greenstone script, type its name at the command prompttype its name at the command prompt

Options for Import and Build help with Options for Import and Build help with debugging (see Table 6.5 on page 310):debugging (see Table 6.5 on page 310): verbosityverbosity archivedirarchivedir maxdocsmaxdocs collectdircollectdir outout keepoldkeepold debugdebug

Greenstone Greenstone Archive Archive

DocumentsDocuments

Greenstone Archive Greenstone Archive FormatFormat

<!DOCTYPE GreenstoneArchive [<!ELEMENT Section (Description,Content,Section*)><!ELEMENT Description (Metadata*)><!ELEMENT Content (#PCDATA)><!ELEMENT Metadata (#PCDATA)><ATTLIST Metadata name CDATA #REQUIRED>]>

Document MetadataDocument Metadata

Metadata – descriptive information Metadata – descriptive information about author, title, date and keywordsabout author, title, date and keywords

Stored with metadata nameStored with metadata name Stored at the beginning of the sectionStored at the beginning of the section Example:Example:

<Metadata name=“Title”>Freshwater <Metadata name=“Title”>Freshwater Resources in Arid Lands</Metadata>Resources in Arid Lands</Metadata>

Document MetadataDocument Metadata

Dublin Core – a metadata standardDublin Core – a metadata standard New metadata types can be inventedNew metadata types can be invented Metadata can be assigned by an Metadata can be assigned by an

automatic process rather than automatic process rather than manually enteredmanually entered

The Dublin CoreThe Dublin Core

Collection Collection Configuration Configuration

FileFile

Collection Configuration Collection Configuration FileFile

Default Configuration Default Configuration FileFile

Getting the Most Getting the Most Out of Your Out of Your DocumentsDocuments

Basic Plug-In OptionsBasic Plug-In Options

Document Processing Document Processing Plug-insPlug-ins

Document Processing Document Processing Plug-insPlug-ins

Document Processing Document Processing Plug-insPlug-ins

Assigning Metadata from Assigning Metadata from a Filea File

XML Document Type Definition XML Document Type Definition (DTD)(DTD)

Example XML Metadata FileExample XML Metadata File

Document Type Definition Document Type Definition (DTD)(DTD)

<!DOCTYPE GreenstoneDirectoryMetadata [<!ELEMENT DirectoryMetadata (FileSet*)><!ELEMENT FileSet (FileName+,Description)><!ELEMENT FileName (#PCDATA)><!ELEMENT Description (Metadata*)><!ELEMENT Metadata (#PCDATA)><ATTLIST Metadata name CDATA #REQUIRED><ATTLIST Metadata mode (accumulate|override) "override">]>

Example XML Metadata Example XML Metadata FileFile<?xml version="1.0" ?>

<!DOCTYPE GreenstoneDirectoryMetadata SYSTEM"http://greenstone.org/dtd/GreenstoneDirectoryMetadata/1.0/GreenstoneDirectoryMetadata.dtd"><DirectoryMetadata><FileSet><FileName>nugget.*</FileName><Description><Metadata name="Title">Nugget Point Lighthouse</Metadata><Metadata name="Place" mode="accumulate">Nugget Point</Metadata></Description></FileSet><FileSet><FileName>nugget-point-1.jpg</FileName><Description><Metadata name="Title">Nugget Point Lighthouse</Metadata><Metadata name="Subject">Lighthouse</Metadata></Description></FileSet></DirectoryMetadata>

Tagging Document FilesTagging Document Files<!--<Section><Description><Metadata name="Title"> Realizing human rights for poorpeople: Strategies for achieving the internationaldevelopment targets </Metadata></Description>-->(text of section goes here)<!--</Section>-->

ClassifiersClassifiers

Format StatementsFormat Statements

Format StatementsFormat Statements

Examples of Format Examples of Format StringsStrings

Recommended