18
Humanistinen tiedekunta Senka Drobac and Pekka Kauppinen and Krister Lindén Improving OCR of historical newspapers and journals published in Finland by adding Swedish training data 1

Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Senka Drobac and Pekka Kauppinen and Krister Lindén

Improving OCR of historical newspapers and journals published in

Finland by adding Swedish training data

1

Page 2: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Motivation

•Corpus of historical newspapers and magazines that has been digitized by the National Library of Finland

•OCR was done with commercial software Abbyy FineReader

•Character accuracy rate (CAR): ~ 90-91%

Page 3: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Figure from: Vesanto, Aleksi, et al. "A system for identifying and exploring text repetition in large historical document corpora." Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. No. 131. Linköping University Electronic Press, 2017.

Page 4: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Ocropy

•Decided to train models with Ocropy + post processing

•Ocropy:

• Open source, uses LSTM, line based

• Tools for preprocessing, segmentation, training, recognition, evaluation

• Above 98.5% CAR on German 19th and 20th century

Page 5: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

OCR workflow (Ocropy)

Image

Line

segmentation Line

imagesText

(lines)

Text

(lines)

Binarized

Image

(Pre-processing)

Binarization OCRPost

processing

Pre-trained model

Page 6: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Data

1771-1919

Languages: Finnish and

Swedish

Typefaces: Fraktur and

Antiqua

Page 7: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

• Good quality

• Finnish Fraktur

• One column

Page 8: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

• Good quality

• Swedish Antiqua

• Two columns

Page 9: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

• Binarized image

• Difficult segmentation

Page 10: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

• Binarized image

• Challenging segmentation

• Many different fonts on one

page

Page 11: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

• Both Finnish and Swedish

on the same page

Page 12: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

• Poor quality

Page 13: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Line examples - Fraktur

☛ För billigt pris: En kursläde i garden

Sananlennätinkonttori awoinna joka päiwä

-— Salama i s k i tiistai yönä klo

pitänyt tarpeellisena warata jonkunlaisen

Page 14: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Line examples – Antiqua

osakkaat kutsutaan täten varsinaiseen yhtiö-

nuksia määräämälleen rautatiease-

m stammanträda i nämnde kontors loka

Heines poetische Werke. I två band. 17 m.

Page 15: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Ocropy + post.proc. results

•Finnish data sets:

• CAR: 93.5% - 94.83%

• After post-processing CAR: 93.68% - 95.21%

•It is better to randomly sample lines from the entire corpus thantrain on all lines from 250 pages

Page 16: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

• Lots of Swedish material -> add Swedish training data

Finnish:

~10 000 training lines

(randomly picked)

~75% Fraktur, ~25% Antiqua

Swedish:

~ 3 300 training lines

(randomly picked)

~50% Fraktur, ~50% Antiqua

Page 17: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Experiments

Test set FIN MODEL SWE MODEL

Fin-Fraktur 95.43 / 78.79 93.2 / 69.61

Fin-Antiqua 85.81 / 53.36 88.89 / 62.32

Swe-Fraktur 78.84 / 40.43 87.59 / 55.32

Swe-Antiqua 79.93 / 40.01 90.66 / 66.36

Test set/MODEL FIN + SWE 1 FIN + SWE 2 FIN + SWE 3

Fin-Fraktur 96.19 / 81.91 95.07 / 76.65 94.97 / 76.13

Fin-Antiqua 89.35 / 63.35 87.23 / 58.22 86.64 / 55.79

Swe-Fraktur 82.53 / 51.11 80.76 / 43.48 83.22 / 45.65

Swe-Antiqua 86.65 / 59.84 83.69 / 49.49 84.88 / 52.5

SWE 1: 840 lines

SWE 2: 1 680 lines

SWE 3: 3 360 lines

Results show CAR (%) / WAR (%)

Not enough Finnish Antiqua in

training

FIN: 10 000 lines

Language is important!

Page 18: Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Conclusions

•Need more Swedish data

•Need more Finnish Antiqua data

•Is it possible to train one model for everything?