Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta...

Humanistinen tiedekunta

Senka Drobac and Pekka Kauppinen and Krister Lindén

Improving OCR of historical newspapers and journals published in

Finland by adding Swedish training data

Motivation

•Corpus of historical newspapers and magazines that has been digitized by the National Library of Finland

•OCR was done with commercial software Abbyy FineReader

•Character accuracy rate (CAR): ~ 90-91%

Figure from: Vesanto, Aleksi, et al. "A system for identifying and exploring text repetition in large historical document corpora." Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. No. 131. Linköping University Electronic Press, 2017.

Ocropy

•Decided to train models with Ocropy + post processing

•Ocropy:

• Open source, uses LSTM, line based

• Tools for preprocessing, segmentation, training, recognition, evaluation

• Above 98.5% CAR on German 19th and 20th century

OCR workflow (Ocropy)

segmentation Line

imagesText

(lines)

Binarized

(Pre-processing)

Binarization OCRPost

processing

Pre-trained model

1771-1919

Languages: Finnish and

Swedish

Typefaces: Fraktur and

Antiqua

• Good quality

• Finnish Fraktur

• One column

• Good quality

• Swedish Antiqua

• Two columns

• Binarized image

• Difficult segmentation

• Binarized image

• Challenging segmentation

• Many different fonts on one

• Both Finnish and Swedish

on the same page

• Poor quality

Line examples - Fraktur

☛ För billigt pris: En kursläde i garden

Sananlennätinkonttori awoinna joka päiwä

-— Salama i s k i tiistai yönä klo

pitänyt tarpeellisena warata jonkunlaisen

Line examples – Antiqua

osakkaat kutsutaan täten varsinaiseen yhtiö-

nuksia määräämälleen rautatiease-

m stammanträda i nämnde kontors loka

Heines poetische Werke. I två band. 17 m.

Ocropy + post.proc. results

•Finnish data sets:

• CAR: 93.5% - 94.83%

• After post-processing CAR: 93.68% - 95.21%

•It is better to randomly sample lines from the entire corpus thantrain on all lines from 250 pages

• Lots of Swedish material -> add Swedish training data

Finnish:

~10 000 training lines

(randomly picked)

~75% Fraktur, ~25% Antiqua

Swedish:

~ 3 300 training lines

(randomly picked)

~50% Fraktur, ~50% Antiqua

Experiments

Test set FIN MODEL SWE MODEL

Fin-Fraktur 95.43 / 78.79 93.2 / 69.61

Fin-Antiqua 85.81 / 53.36 88.89 / 62.32

Swe-Fraktur 78.84 / 40.43 87.59 / 55.32

Swe-Antiqua 79.93 / 40.01 90.66 / 66.36

Test set/MODEL FIN + SWE 1 FIN + SWE 2 FIN + SWE 3

Fin-Fraktur 96.19 / 81.91 95.07 / 76.65 94.97 / 76.13

Fin-Antiqua 89.35 / 63.35 87.23 / 58.22 86.64 / 55.79

Swe-Fraktur 82.53 / 51.11 80.76 / 43.48 83.22 / 45.65

Swe-Antiqua 86.65 / 59.84 83.69 / 49.49 84.88 / 52.5

SWE 1: 840 lines

SWE 2: 1 680 lines

SWE 3: 3 360 lines

Results show CAR (%) / WAR (%)

Not enough Finnish Antiqua in

training

FIN: 10 000 lines

Language is important!

Conclusions

•Need more Swedish data

•Need more Finnish Antiqua data

•Is it possible to train one model for everything?

Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta...

Documents

BING: Binarized Normed Gradients for Objectness Estimation

GUINNESS: A GUI Based Binarized Deep Neural Network

Towards Fast and Energy-Efficient Binarized Neural Network ...Towards Fast and Energy-Efficient Binarized Neural Network Inference on FPGA Cheng Fu1,2,∗, Shilin Zhu2, Hao Su2, Ching-En

Binarized Neural Networks

ディープラーニングの２値化（Binarized Neural Network）

MEMORY EFFECTS IN METAPLASTIC BINARIZED NEURAL NETWORKS · 2019. 6. 22. · Binarized Neural Networks (BNNs) are attractive for low power hardware implementation of artificial intelligence

Towards Fast and Energy-Efficient Binarized Neural Network ...cseweb.ucsd.edu/~shz338/images/FPGA.pdfTowards Fast and Energy-Efficient Binarized Neural Network Inference on FPGA Conference’17,

BING: Binarized Normed Gradients for Objectness Estimation at 300fps CVPR 2014 Oral

HUMANISTINEN AMMATTIKORKEAKOULU - Theseus

A Channel-Pruned and Weight-Binarized …A Channel-Pruned and Weight-Binarized Convolutional Neural Network for Keyword Spotting Jiancheng Lyu1 and Spencer Sheen2 1 UC Irvine, Irvine,

Humanistinen tiedekunta. Historian laitos. Matti Frondelius

Banners: Binarized Neural Networks with Replicated Secret ...Banners: Binarized Neural Networks with Replicated Secret Sharing Alberto Ibarrondo IDEMIA & EURECOM Sophia Antipolis,

BING: Binarized Normed Gradients for Objectness Estimation at 300fps

Fast Secure Comparison for Medium-Sized Integers and Its Application in Binarized ... · 2018. 12. 24. · Binarized Neural Networks Mark Abspoel1;2, Niek J. Bouman3, Berry Schoenmakers

BING: Binarized normed gradients for objectness …mmcheng.net/mftp/Papers/ObjectnessBING.pdfBING: Binarized normed gradients for objectness estimation at 300fps 5 approach starts

Local-binarized very deep residual network for visual

Binarized ImageNet Inference in 29 s...BNN: Binarized neural network on FPGA. Neurocomputing 275 (2018) Neurocomputing 275 (2018) Acknowledgement: This research was funded by the Deep

seppo Montén koulutuksesta työmarkkinoillekoulutuksesta työmarkkinoille Osa 1. Humanistinen ja kasvatusala seppo Montén koulutuksesta työmarkkinoille Osa 1. Humanistinen ja kasvatusala

Embedded Binarized Neural Networks - arXiv

Efficient Layout Hotspot Detection via Binarized Residual ...byu/papers/C82-DAC2019-BNN-HSD-slides.pdf · Proposed Binarized Neural Network-based Hotspot Detector Experimental Results