43
1 Australian Newspapers Australian Newspapers Digitisation Program Digitisation Program Development of the Newspapers Development of the Newspapers Content Management System Content Management System Rose Holley Rose Holley ANDP Manager ANDP Manager ANPlan ANPlan /ANDP Workshop, 28 November 2008 /ANDP Workshop, 28 November 2008

Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

1

Australian Newspapers Australian Newspapers Digitisation ProgramDigitisation Program

Development of the Newspapers Development of the Newspapers Content Management SystemContent Management System

Rose Holley Rose Holley –– ANDP ManagerANDP Manager

ANPlanANPlan/ANDP Workshop, 28 November 2008/ANDP Workshop, 28 November 2008

Page 2: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

2

RequirementsRequirements

�� Manage, store and organise millions of Manage, store and organise millions of digital newspaper pages behind the digital newspaper pages behind the scenes.scenes.

�� Manage the entire digitisation workflow Manage the entire digitisation workflow from scanning to public delivery.from scanning to public delivery.

Page 3: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

3

How?How?

�� Current NLA Digital Content Current NLA Digital Content Management System cannot cope with Management System cannot cope with volume of digital newspapers or complex volume of digital newspapers or complex structure of newspapersstructure of newspapers

�� No No ‘‘off the shelfoff the shelf’’ product available that product available that meets requirementsmeets requirements

�� Need the system now (March 2007)Need the system now (March 2007)

Page 4: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

4

SolutionSolution

�� NLA team to develop a software solutionNLA team to develop a software solution

�� Ensure the system uses open source software Ensure the system uses open source software

�� System to be standalone and not bolted into System to be standalone and not bolted into other systemsother systems

�� Possibility of sharing system in future/providing Possibility of sharing system in future/providing as open source to other librariesas open source to other libraries

Page 5: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

5

Software DevelopmentSoftware Development

�� Agile method of development usedAgile method of development used

�� Modules designed in stages as required Modules designed in stages as required

�� Stage 1 Stage 1 –– Receipt and checking of scanned imagesReceipt and checking of scanned images

�� Stage 2 Stage 2 –– Quality Assurance ModulesQuality Assurance Modules

�� Stage 3 Stage 3 –– Sending/receiving items from OCRSending/receiving items from OCR

�� Stage 4 Stage 4 –– System Administration and StatisticsSystem Administration and Statistics

�� Stage 5 Stage 5 –– Interface Design and Usability of SystemInterface Design and Usability of System

Page 6: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

6

ProgressProgress

�� Software development March 2007 Software development March 2007 –– June 2008June 2008

�� First module in use May 2007First module in use May 2007

�� CMS in use for 18 monthsCMS in use for 18 months

�� CMS in final stages of completion (Jan CMS in final stages of completion (Jan –– June 2009)June 2009)

�� Further development required to enable acceptance Further development required to enable acceptance of contributors content of contributors content

�� Simple user interface yet to be designedSimple user interface yet to be designed

Page 7: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

7

Page 8: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

8

Australian Newspapers CMSAustralian Newspapers CMS

�� Screenshots of system follow and Screenshots of system follow and explanation of workflows.explanation of workflows.

Page 9: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

9

�� Preparing for DigitisationPreparing for Digitisation

�� Creation of digital imagesCreation of digital images

�� Adding metadata and Quality AssuranceAdding metadata and Quality Assurance

�� Optical Character RecognitionOptical Character Recognition

�� Quality AssuranceQuality Assurance

�� Statistics and AdminStatistics and Admin

Workflow SummaryWorkflow Summary

Page 10: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

10

�� Identify title to be digitisedIdentify title to be digitised

�� Source master microfilm from ownerSource master microfilm from owner

�� Send master microfilm to scanning Send master microfilm to scanning contractorscontractors

�� Add title to Content Management SystemAdd title to Content Management System

Preparing for DigitisationPreparing for Digitisation

Page 11: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

11

CMS CMS -- Add Title Add Title

Page 12: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

12

Microfilm converted to digital imagesMicrofilm converted to digital images

Page 13: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

13

Image ReceptionImage Reception

�� Images received from scanning contractor Images received from scanning contractor on LTO2 Tapeon LTO2 Tape

�� Tapes added to tape robot and extractedTapes added to tape robot and extracted

�� Reels automatically added to Content Reels automatically added to Content Management SystemManagement System

�� Reel details are checkedReel details are checked

�� Images ingested into Content Images ingested into Content Management SystemManagement System

Page 14: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

14

CMS CMS -- Check Reel DetailsCheck Reel Details

Page 15: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

15

CMS CMS -- Ingest ReelsIngest Reels

Page 16: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

16

CMS CMS -- Tasks 1 and 2Tasks 1 and 2

�� Task 1 Task 1 –– Add metadata (dates and page Add metadata (dates and page numbers)numbers)

�� Supervisor reviews marked pagesSupervisor reviews marked pages

�� Task 2 Task 2 –– Define batches Define batches

�� Task 2 Task 2 –– Resolve duplicatesResolve duplicates

�� Task 2 Task 2 –– Create missing page targetsCreate missing page targets

Page 17: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

17

Identify title to be worked onIdentify title to be worked on

Page 18: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

18

Identify reel

Page 19: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

19

CMS CMS -- Adding MetadataAdding Metadata�� Date and Page Sequence number addedDate and Page Sequence number added

Page 20: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

20

Supervisor Supervisor ReviewReview

�� Supervisor Supervisor reviews pages reviews pages marked for marked for attentionattention

Page 21: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

21

CMS CMS -- Define BatchesDefine Batches�� Batches defined by dateBatches defined by date�� Each batch contains 2Each batch contains 2--3000 images3000 images�� Batches are automatically assigned a numberBatches are automatically assigned a number

Page 22: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

22

CMS CMS -- Resolve DuplicatesResolve Duplicates�� Duplicate pages compared and the best copy is selectedDuplicate pages compared and the best copy is selected

Page 23: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

23

�� Missing Missing page page targets are targets are generatedgenerated

Missing Missing PagesPages

Page 24: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

24

Optical Character Recognition Optical Character Recognition (OCR)(OCR)

�� Complete batches are added to a tapeComplete batches are added to a tape

�� Tapes are generated and written Tapes are generated and written

�� Tapes sent to OCR contractorTapes sent to OCR contractor

�� Contractor completes OCR processesContractor completes OCR processes

�� OCR data (not images) is returned via FTPOCR data (not images) is returned via FTP

Page 25: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

25

CMS CMS -- Tapes CreatedTapes Created�� Completed batches added to a tapeCompleted batches added to a tape

Page 26: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

26

Optical Character Recognition (OCR) of pages and article zoningOptical Character Recognition (OCR) of pages and article zoning

Page 27: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

27

OCR Data ReceptionOCR Data Reception(Automated process)(Automated process)

�� OCR contractor advises NLA server that a batch OCR contractor advises NLA server that a batch has been completedhas been completed

�� NLA server downloads the batchNLA server downloads the batch

�� Batch is ingested into Content Management Batch is ingested into Content Management SystemSystem

�� Checks are performed on data validityChecks are performed on data validity

�� QA Derivatives are generatedQA Derivatives are generated

�� Articles may now be searched, but are not yet Articles may now be searched, but are not yet publicly accessiblepublicly accessible

Page 28: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

28

CMS CMS -- Batch informationBatch information

Page 29: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

29

Quality Assurance (QA)Quality Assurance (QA)�� A random sample of Issues and Articles are A random sample of Issues and Articles are

checkedchecked

�� Volume and Issue number are checked for Volume and Issue number are checked for accuracyaccuracy

�� Sample articles are checked against agreed Sample articles are checked against agreed Quality Acceptance Criteria (QAC)Quality Acceptance Criteria (QAC)

�� Error rates calculated against QAC on the flyError rates calculated against QAC on the fly

�� Supervisor checks final resultsSupervisor checks final results

Page 30: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

30

CMS CMS -- Selecting the batchSelecting the batch

Page 31: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

31

Volume & Issue Number CheckVolume & Issue Number Check

Page 32: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

32

Article checked against QACArticle checked against QAC

Page 33: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

33

ReRe--keyed fields checked for accuracykeyed fields checked for accuracy

Page 34: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

34

Supervisor checks results (auto or Supervisor checks results (auto or manual accept/reject)manual accept/reject)

Page 35: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

35

QA ResultsQA Results

�� Automated email sent to supplier Automated email sent to supplier advising the resultadvising the result

�� Emails for rejected batches include a Emails for rejected batches include a summary of errorssummary of errors

�� Summary of errors saved for all batchesSummary of errors saved for all batches

�� Accepted batches are immediately Accepted batches are immediately accessible in public search systemaccessible in public search system

Page 36: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

36

Batch History and details retainedBatch History and details retained

Page 37: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

37

Page 38: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

38

Search or Browse articles within CMSSearch or Browse articles within CMS

Page 39: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

39

StatisticsStatistics�� Stats for content received, Stats for content received, QAQA’’dd and and

delivered to the public generated by the delivered to the public generated by the Content Management SystemContent Management System

�� (Stats for usage of public search system (Stats for usage of public search system collected using Google Analytics)collected using Google Analytics)

Page 40: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

40

CMS CMS -- Content StatisticsContent Statistics

Page 41: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

41

CMS CMS -- Work StatisticsWork Statistics

Page 42: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

42

AccessAccess

�� Public access to digital newspapers is Public access to digital newspapers is provided through Australian Newspapers provided through Australian Newspapers Search and Delivery SystemSearch and Delivery System

�� Users can search or browse newspapersUsers can search or browse newspapers

�� Search results can be refined using filtersSearch results can be refined using filters

�� Users can browse by Newspaper title or Users can browse by Newspaper title or Date.Date.

Page 43: Australian Newspapers Digitisation Program Development of ...eprints.rclis.org/12630/1/NDP_Content_Mgt_Syst_Nov_2008.pdf · Australian Newspapers Digitisation Program Development

43http://ndpbeta.nla.gov.au/ndp/del/home