Upload
miles-rogers
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Digitisation of Newspapers
The South African Experience
Patricia Liebetrau
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
Introduction
2
Durban … a multicultural city3
Digital Innovation South Africa (DISA)
National collaborative initiative
Creating online resources for education, research and training
Make accessible online SA material of high socio-political value
Collated serial literature scattered across collections
Develop local expertise in use of advanced digital technologies
Set standards for digitisation initiatives in SA
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
DISA
Identify appropriate collections
Distributed digital production
Gateway to federated digital collections
Develop policies, strategies and guidelines in support of SA initiatives
Comply with international standards
Bridge digital gap between northern and southern hemispheres
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
http://www.disa.ukzn.ac.za
Campbell Collections @UKZN
Digital microfilm scanner Obsolete technology Preservation of
microfilms Newspapers and MSS on
microfilm Data transfer Application to DISA
Digitising microfilm
Samples were tested using the following:
1 bit at 300dpi
1 bit at 400dpi
1 bit at 600dpi
8bit greyscale at 300dpi with thresholding at 128
8bit greyscale at 400dpi with thresholding at 128
8bit greyscale at 600dpi with thresholding at 128
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
Comparisons
Sample 1: Scanned on flat bed scanner at 300dpi 8bit greyscale from unbound original
Sample 2: Scanned using Minolta MS7000 microfilm scanner at 300dpi 8bit greyscale
– microfilm copy looks as though it was bound
One would have to conclude from this example that perhaps the microfilm was not captured correctly
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
OCR recognition
It would be obvious that the rate of word return from the previous two samples would be far greater in the first image than it would be for the second image
Conclusion
Some microfilms are better than others – the resulting scan is as good as the original microfilm
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
OCR’ed text
Big no to constitution as elections draw near 11,HOUSANDS of peo-ple have
rejected the Government's new constitution under which elections for In-dian
and coloured chambers of Parlia-nent are to take place n August.
Reports from around the country talk of feverish activity as the biggest issue
facing the country nears its climax." The elections, to be
held on the 22nd and - 28th of August, is seen as an issue which con-cerns
all South Africans. The African com-munity in particular is" leading the call for a boycott of the elec` tions.Mr. Popo
Molefe, the national secretary of the United Democratic Front (UDF),
said the centralissue was the 'denationalisation of the African people'.'We call on our peo-ole in Eldorado Park, Reiger Park, Acton-ville and Lenasia, to boycott the August elections.'We call on our peo- ple to refuse to bepartners in the crime of Apartheid against the majority of SouthAfricans.' IFLA Newspaper Conference, New Delhi, 26-28 February 2010
Indexing Manual indexing!
Encoded using the international Text Encoding Initiative (TEI) later mapped to Dublin Core (DC) metadata element set
Metadata capture: publisher, place and date of publication at journal/ newspaper level
Indexing of title, author and keywords at article level
xml based
Articles over several pages
English language
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
Capturing journal metadata
<teiheader type="journal" status="new" teiform="teiHeader"><filedesc teiform="fileDesc"><titlestmt teiform="titleStmt"><title teiform="title">Speak: the voice of the community</title><title teiform="title">Volume 2 No 3</title></titlestmt><publicationstmt teiform="publicationStmt"><publisher teiform="publisher">DISA Digital Innovation of South
Africa</publisher><pubplace teiform="pubPlace">Durban, South Africa</pubplace><date teiform="date">2002</date><idno teiform="idno">1684.5188.002.003.Jul1984</idno></publicationstmt><sourcedesc default="no" teiform="sourceDesc"><biblfull default="no" teiform="biblFull"><titlestmt teiform="titleStmt"><title teiform="title">Speak: the voice of the community</title><title teiform="title">Volume 2 No 3<date teiform="date">July
1984</date></title><editor role="editor" teiform="editor"></editor></titlestmt><extent teiform="extent">16 pages</extent><publicationstmt teiform="publicationStmt"><publisher teiform="publisher">Speak Community Newspaper Project
</publisher><pubplace teiform="pubPlace">Johannesburg</pubplace><date teiform="date">July 1984</date></publicationstmt> IFLA Newspaper Conference, New Delhi, 26-28 February 2010
Search and browse
Browsing facilitiesbrowse the text images
Searching facilitiesfull text searchingarticle title, author and keyword searchingthesaurusacronyms
Readability and advanced searchability
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
Indexing results
Advanced searchability on all the encoded elements
By using terms from a thesaurus, language usage is standardised
Higher relevance of returned hits
Added intellectual input
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
However …
Human indexing is time and labour intensive
Training is required
Quality control is needed
Thesaurus management software is essential
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
Languages and translations
• African vernacular languages
• Translation challenges for a global context
• OCR challenges
• OCR training for African languages not yet developed
• Automated translation not yet possible
• Extraction of metadata useful
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
Language examples
Hindi Zulu
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
South African newspaper digitisation
Rich collections in the vernacular Poor quality microfilms Low OCR success rate on microfilms scans Level of metadata complexity Minimal manual indexing Cost of staff time Service on demand Lack of national guidelines Lack of national funding
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
Conclusions
Volume of newspapers and information Value of digitisation Rich source of social South Africa history Vernacular Teaching, learning and research value Dedicated newspaper digitisation project Overcome challenges!
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
Recommendations
National consultation National support Prioritisation Role of publishers DISA consultancy
IFLA Newspaper Conference, New Delhi, 26-28 February 2010
Contact details
Patricia Liebetrau, Director, DISA
Email: [email protected] URL: http://www.disa.ukzn.ac.za
This presentation is made available under a Creative Commons Attribution 2.0 South Africa license.