58

Managing the Digitization of Large Press Archives

  • Upload
    dlfclir

  • View
    472

  • Download
    1

Embed Size (px)

DESCRIPTION

From the 2014 DLF Forum in Atlanta, GA. Session Leaders Bassem Elsayed, Bibliotheca Alexandrina Ahmed Samir, Bibliotheca Alexandrina Managing the digitization of press material is quite a challenge; not only in terms of quantity, but also in terms of text and material quality, designing the workflow system which organizes the operations, and handling the metadata. This challenge has been the focus of the Bibliotheca Alexandrina’s digitization work during the past year in the course of its partnership with the Center for Economic, Judicial, and Social Study and Documentation (CEDEJ). Having more than 800,000 pages of press articles to be digitally preserved and publicly accessed, triggered an inevitable need to design a workflow that can manage such a massive collection and handle its attributes proficiently. The deployment of this endeavor required simultaneous intervention of four main aspects; data analysis of the collection, developing a digitization workflow for the collection at hand, implementing and installing the necessary software tools for metadata entry, and finally, publishing the digital archive online for researchers and public access. The presentation will demonstrate the workflow system which is being implemented to manage this massive press collection, which has yielded to date more than 400,000 pages. It will shed some light on the BA’s Digital Assets Factory (DAF), which is the nucleus upon which the digitization process of CEDEJ collection has been built. Additionally, the presentation will discuss the tools implemented for ingesting data into the digitization process starting form indexing until the creation of batches that are ingested into the system. The outflow will also be discussed in terms of organizing and grouping multipart press clips, in addition to the reviewing, validation and correction of the output. Light will also be shed on the challenges encountered to associate the accessible online archive with a powerful search engine supporting multidimensional search while maintaining a user-friendly navigation experience.

Citation preview

Page 1: Managing the Digitization of Large Press Archives
Page 2: Managing the Digitization of Large Press Archives

The New Library of Alexandria Overview

Bibliotheca Alexandrina (BA)  

Page 3: Managing the Digitization of Large Press Archives

Ø  Center of excellence in the production and dissemination of knowledge

Ø  Place of dialogue, learning and understanding between cultures and peoples

Page 4: Managing the Digitization of Large Press Archives

Ø  The World’s Window on Egypt

Ø  Egypt’s Window on the World Ø  Instrument for Rising to the Challenges of

the Digital Age

Ø  Center for Dialogue Between Peoples and Civilizations

Page 5: Managing the Digitization of Large Press Archives

Not just a Library of Books but rather a vast cultural and scientific complex

Page 6: Managing the Digitization of Large Press Archives

A library that can accommodate millions of books  

Page 7: Managing the Digitization of Large Press Archives

7

http://archive.bibalex.org

Page 8: Managing the Digitization of Large Press Archives

8

Page 9: Managing the Digitization of Large Press Archives
Page 10: Managing the Digitization of Large Press Archives
Page 11: Managing the Digitization of Large Press Archives
Page 12: Managing the Digitization of Large Press Archives
Page 13: Managing the Digitization of Large Press Archives
Page 14: Managing the Digitization of Large Press Archives

14

Page 15: Managing the Digitization of Large Press Archives

15

http://descegy.bibalex.org

Page 16: Managing the Digitization of Large Press Archives

16

http://lartarab.bibalex.org

Page 17: Managing the Digitization of Large Press Archives

17

More than 230,000 Arabic books are freely available online for Arabic

readers worldwide

Page 18: Managing the Digitization of Large Press Archives

18

http://suezcanal.bibalex.org

Page 19: Managing the Digitization of Large Press Archives

19

Page 20: Managing the Digitization of Large Press Archives

20

http://naguib.bibalex.org/

Page 21: Managing the Digitization of Large Press Archives

21

http://nasser.bibalex.org

Page 22: Managing the Digitization of Large Press Archives

22

http://sadat.bibalex.org

Page 23: Managing the Digitization of Large Press Archives
Page 24: Managing the Digitization of Large Press Archives

Ø  Project Overview Ø  Collection Overview Ø  Data Representation Ø  System Workflow

�  DAF (Digital Assets Factory) �  Cataloguing �  Website

§  Solr search Engine §  Article Viewer

24

Page 25: Managing the Digitization of Large Press Archives

25

Page 26: Managing the Digitization of Large Press Archives

Ø  Centre for Economic, Judicial, and Social Study and Documentation (CEDEJ) collaborated with Bibliotheca Alexandrina (BA) for the digitization of its archive of massive press articles collection

Ø  The project consists of multiple modules to: �  Index the Press Archive Collection �  Control data entry workflow �  Digitize and process data �  Catalogue and review Articles �  Archive Web Publishing

26

Page 27: Managing the Digitization of Large Press Archives

27

Page 28: Managing the Digitization of Large Press Archives

Ø  Package of press archive �  800,000+ press clips varying between

§  Press §  Reports

�  500+ publishers �  60,000+ writers and reporters �  200 Different subjects

§  Economic, politics, social life, etc… �  Archive Languages:

§  Arabic, English and French �  Date range from 1966 to 2009

28

Page 29: Managing the Digitization of Large Press Archives

Ø  Finished so far �  115,000 press clips varying between

§  Press §  Reports

�  200 publishers �  14,000 writers and reporters �  100 Different subjects

§  Economic, politics, social life, etc… �  Archive Languages:

§  Arabic, English and French �  Date range from 1966 to 2009

29

Page 30: Managing the Digitization of Large Press Archives

30

Page 31: Managing the Digitization of Large Press Archives

Ø  A list of packaged press archive is submitted to

Bibliotheca Alexandrina to be scanned and catalogued

Ø  Source of data is a collection of boxes Ø  The box is organized on the following

hierarchy �  Folder �  File �  Sub-File �  Document

Ø  Document represents a single page of press

31

Page 32: Managing the Digitization of Large Press Archives

32

Page 33: Managing the Digitization of Large Press Archives

33

Page 34: Managing the Digitization of Large Press Archives

34

Page 35: Managing the Digitization of Large Press Archives

35

Page 36: Managing the Digitization of Large Press Archives

36

Page 37: Managing the Digitization of Large Press Archives

37

Page 38: Managing the Digitization of Large Press Archives

38

Page 39: Managing the Digitization of Large Press Archives

Article Creation

39

Page 40: Managing the Digitization of Large Press Archives

Article Metadata

40

Page 41: Managing the Digitization of Large Press Archives

Lookups Management

41

Page 42: Managing the Digitization of Large Press Archives

Reports

42

Page 43: Managing the Digitization of Large Press Archives

43

Page 44: Managing the Digitization of Large Press Archives

44

Page 45: Managing the Digitization of Large Press Archives

45

Page 46: Managing the Digitization of Large Press Archives

Ø  Based on Apache Lucene project v4.1

Ø  SolrNet API is used to connect to Solr server

Ø  Features �  Simple/Advanced search �  Results Highlighting �  Fields AutoComplete �  Text search (Article Viewer)

46

Page 47: Managing the Digitization of Large Press Archives

47

Page 48: Managing the Digitization of Large Press Archives

48

Page 49: Managing the Digitization of Large Press Archives

49

Page 50: Managing the Digitization of Large Press Archives

50

Page 51: Managing the Digitization of Large Press Archives

51

Page 52: Managing the Digitization of Large Press Archives

52

Page 53: Managing the Digitization of Large Press Archives

53

Page 54: Managing the Digitization of Large Press Archives

Ø  Article viewer is used for previewing articles �  It is one of multiple viewers developed at BA

Ø  Architecture �  Server Side: RESTful services �  Client Side: JavaScript using JSONP

Ø  Features �  Image preview �  Metadata preview �  Text selection �  Searching/highlighting �  Zooming options: fit width/height

54

Page 55: Managing the Digitization of Large Press Archives

Ø  Viewer Web Services �  Metadata Web Service:

§  Retrieve article catalogue metadata §  Return technical information (width, height, page

count..) �  Content Web Service:

§  Retrieve the image of each single page in the article applying scaling to custom width and height responsively

§  Return the selected text based on the user highlighted area

�  Search Web Service: §  Perform the search using Solr engine APIs in the

content of the articles §  Highlight the matching phrases in the article image

55

Page 56: Managing the Digitization of Large Press Archives

56

Page 57: Managing the Digitization of Large Press Archives

57

Page 58: Managing the Digitization of Large Press Archives

58