Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Senka Drobac and Pekka Kauppinen and Krister Lindén

Improving OCR of historical newspapers and journals published in

Finland by adding Swedish training data

1


Motivation

•Corpus of historical newspapers and magazines that has been digitized by the National Library of Finland

•OCR was done with commercial software Abbyy FineReader

•Character accuracy rate (CAR): ~ 90-91%


Figure from: Vesanto, Aleksi, et al. "A system for identifying and exploring text repetition in large historical document corpora." Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. No. 131. Linköping University Electronic Press, 2017.


Ocropy

•Decided to train models with Ocropy + post processing

•Ocropy:

• Open source, uses LSTM, line based

• Tools for preprocessing, segmentation, training, recognition, evaluation

• Above 98.5% CAR on German 19th and 20th century


OCR workflow (Ocropy)

Image

Line

segmentation Line

imagesText

(lines)

Text

(lines)

Binarized

Image

(Pre-processing)

Binarization OCRPost

processing

Pre-trained model


Data

1771-1919

Languages: Finnish and

Swedish

Typefaces: Fraktur and

Antiqua


• Good quality

• Finnish Fraktur

• One column


• Good quality

• Swedish Antiqua

• Two columns


• Binarized image

• Difficult segmentation


• Binarized image

• Challenging segmentation

• Many different fonts on one

page


• Both Finnish and Swedish

on the same page


• Poor quality


Line examples - Fraktur

☛ För billigt pris: En kursläde i garden

Sananlennätinkonttori awoinna joka päiwä

-— Salama i s k i tiistai yönä klo

pitänyt tarpeellisena warata jonkunlaisen


Line examples – Antiqua

osakkaat kutsutaan täten varsinaiseen yhtiö-

nuksia määräämälleen rautatiease-

m stammanträda i nämnde kontors loka

Heines poetische Werke. I två band. 17 m.


Ocropy + post.proc. results

•Finnish data sets:

• CAR: 93.5% - 94.83%

• After post-processing CAR: 93.68% - 95.21%

•It is better to randomly sample lines from the entire corpus thantrain on all lines from 250 pages


• Lots of Swedish material -> add Swedish training data

Finnish:

~10 000 training lines

(randomly picked)

~75% Fraktur, ~25% Antiqua

Swedish:

~ 3 300 training lines

(randomly picked)

~50% Fraktur, ~50% Antiqua


Experiments

Test set FIN MODEL SWE MODEL

Fin-Fraktur 95.43 / 78.79 93.2 / 69.61

Fin-Antiqua 85.81 / 53.36 88.89 / 62.32

Swe-Fraktur 78.84 / 40.43 87.59 / 55.32

Swe-Antiqua 79.93 / 40.01 90.66 / 66.36

Test set/MODEL FIN + SWE 1 FIN + SWE 2 FIN + SWE 3

Fin-Fraktur 96.19 / 81.91 95.07 / 76.65 94.97 / 76.13

Fin-Antiqua 89.35 / 63.35 87.23 / 58.22 86.64 / 55.79

Swe-Fraktur 82.53 / 51.11 80.76 / 43.48 83.22 / 45.65

Swe-Antiqua 86.65 / 59.84 83.69 / 49.49 84.88 / 52.5

SWE 1: 840 lines

SWE 2: 1 680 lines

SWE 3: 3 360 lines

Results show CAR (%) / WAR (%)

Not enough Finnish Antiqua in

training

FIN: 10 000 lines

Language is important!


Conclusions

•Need more Swedish data

•Need more Finnish Antiqua data

•Is it possible to train one model for everything?

Documents

Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained