Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta...

Preview:

Citation preview

Humanistinen tiedekunta

Senka Drobac and Pekka Kauppinen and Krister Lindén

Improving OCR of historical newspapers and journals published in

Finland by adding Swedish training data

1

Humanistinen tiedekunta

Motivation

•Corpus of historical newspapers and magazines that has been digitized by the National Library of Finland

•OCR was done with commercial software Abbyy FineReader

•Character accuracy rate (CAR): ~ 90-91%

Humanistinen tiedekunta

Figure from: Vesanto, Aleksi, et al. "A system for identifying and exploring text repetition in large historical document corpora." Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. No. 131. Linköping University Electronic Press, 2017.

Humanistinen tiedekunta

Ocropy

•Decided to train models with Ocropy + post processing

•Ocropy:

• Open source, uses LSTM, line based

• Tools for preprocessing, segmentation, training, recognition, evaluation

• Above 98.5% CAR on German 19th and 20th century

Humanistinen tiedekunta

OCR workflow (Ocropy)

Image

Line

segmentation Line

imagesText

(lines)

Text

(lines)

Binarized

Image

(Pre-processing)

Binarization OCRPost

processing

Pre-trained model

Humanistinen tiedekunta

Data

1771-1919

Languages: Finnish and

Swedish

Typefaces: Fraktur and

Antiqua

Humanistinen tiedekunta

• Good quality

• Finnish Fraktur

• One column

Humanistinen tiedekunta

• Good quality

• Swedish Antiqua

• Two columns

Humanistinen tiedekunta

• Binarized image

• Difficult segmentation

Humanistinen tiedekunta

• Binarized image

• Challenging segmentation

• Many different fonts on one

page

Humanistinen tiedekunta

• Both Finnish and Swedish

on the same page

Humanistinen tiedekunta

• Poor quality

Humanistinen tiedekunta

Line examples - Fraktur

☛ För billigt pris: En kursläde i garden

Sananlennätinkonttori awoinna joka päiwä

-— Salama i s k i tiistai yönä klo

pitänyt tarpeellisena warata jonkunlaisen

Humanistinen tiedekunta

Line examples – Antiqua

osakkaat kutsutaan täten varsinaiseen yhtiö-

nuksia määräämälleen rautatiease-

m stammanträda i nämnde kontors loka

Heines poetische Werke. I två band. 17 m.

Humanistinen tiedekunta

Ocropy + post.proc. results

•Finnish data sets:

• CAR: 93.5% - 94.83%

• After post-processing CAR: 93.68% - 95.21%

•It is better to randomly sample lines from the entire corpus thantrain on all lines from 250 pages

Humanistinen tiedekunta

• Lots of Swedish material -> add Swedish training data

Finnish:

~10 000 training lines

(randomly picked)

~75% Fraktur, ~25% Antiqua

Swedish:

~ 3 300 training lines

(randomly picked)

~50% Fraktur, ~50% Antiqua

Humanistinen tiedekunta

Experiments

Test set FIN MODEL SWE MODEL

Fin-Fraktur 95.43 / 78.79 93.2 / 69.61

Fin-Antiqua 85.81 / 53.36 88.89 / 62.32

Swe-Fraktur 78.84 / 40.43 87.59 / 55.32

Swe-Antiqua 79.93 / 40.01 90.66 / 66.36

Test set/MODEL FIN + SWE 1 FIN + SWE 2 FIN + SWE 3

Fin-Fraktur 96.19 / 81.91 95.07 / 76.65 94.97 / 76.13

Fin-Antiqua 89.35 / 63.35 87.23 / 58.22 86.64 / 55.79

Swe-Fraktur 82.53 / 51.11 80.76 / 43.48 83.22 / 45.65

Swe-Antiqua 86.65 / 59.84 83.69 / 49.49 84.88 / 52.5

SWE 1: 840 lines

SWE 2: 1 680 lines

SWE 3: 3 360 lines

Results show CAR (%) / WAR (%)

Not enough Finnish Antiqua in

training

FIN: 10 000 lines

Language is important!

Humanistinen tiedekunta

Conclusions

•Need more Swedish data

•Need more Finnish Antiqua data

•Is it possible to train one model for everything?

Recommended