Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology

Universität Innsbruck

Christoph-Probst-Platz, Innrain 52

6020 Innsbruck

http://info.uibk.ac.at

User-driven correction of OCR errors.

Combing crowdsourcing and information retrieval

technology

Günter Mühlberger, Johannes Zelger

David Sagmeister, Albert Greinöcker

Universität Innsbruck / Höhere Technische

Bundeslehranstalt Anichstraße - Innsbruck

Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. Archivierung DATech 2014 - Madrid

• Introduction

• Crowdsourcing approaches for OCR correction

• Our approach

• Evaluation

• Future work

Agenda

2


Introduction

3


• Digitisation of historical printed material – Google: Billions of files, libraries: Millions of files

– Still hard to get access to these files

• OCR quality – There are only a few reliable data on the accuracy of OCR on large scale datasets

– E.g. we do not know „how good the Google collection“ is as a whole, or per language, per century, decade or year, per text type, etc.

• Tanner (2009) – Has done evaluation of OCR accuracy on British Newspapers

– Differences per newspaper are stronger than per publishing date

– Overall we are speaking about 10% to 40% Word Error Rate, with an average of 22% WER for standard words and 31% for significant words

– Evaluation done within the IMPACT project has shown similar figures

Digitisation and OCR quality

4


• What does this mean for the end-user? – End-users are either searching a collection or are reading an interesting item

(which they may have found by searching).

– But for reading a page/book they have the original image – so the full-text is much less important for them

• If we take the figures from above: – End-users will miss e.g. 20% or 30% of all occurances of a search term which

would be interesting for them simply because the OCR is wrong.

• Maybe acceptable to occasional users, but surely not for humanities researches or family historians: They want to get „all relevant occurrences“ – What is “relevant” is decided by the user, some may be interested just within a

specific time period, or periodical, or collection of documents

– Note: Not all words are frequent in all collections („London“ in a Tyrolian newspaper collections is seldom whereas it is frequent in a British Newspaper Collection)

End-users and OCR quality

5


Crowd sourcing for OCR

6


• OCR as an „ideal“ field for crowd-sourcing

– Simple to realize: Provide link between image and text and let the

user correct it

• Three (and a half) main approaches

– reCAPTCHA

– Australian National Library (Newspaper Digitization Project)

– National Library of Finland (gamefication)

– IBM: CONCERT (Collaborative Correction Platform)

Approaches

7


reCAPTCHA

8


Australian National Library

9


Australian National Library

10


National Library Finland: Digitalkoot

11


IBM CONCERT (COoperative eNgine for Correction of ExtRacted Text)

12


• OCR correction with the support of the crowd does work (but not always)!

• In the case of reCAPTCHA and DigitalKoot users have no influence on what they correct (de-motivating) – reCAPTCHA is successful due to the sheer size of interactions

• User specific benefit is provided mainly by the approach of the Australian National Library – User reads the text carefully when editing

– Finds corrected words immediately after submitting correct text

– Can decide what to correct

• Power users vs. crowd users – A very small segment of all users are carrying out the actual work

– Australia: Top 6 users corrected about 25% of the texts

– transcribe Bentham project: Top 7 users produced 70% of all transcripts

Conclusion

13


Proposed approach

14


• Let„s combine searching and crowd based correction!

• Provide users with a powerful instrument to correct exactly

those words where they are interested in (searching for)

• Relieve users from actually editing words, but let them just

approve or reject the results of the OCR engine

Searching AND correcting

15


Search interface

16


• User has the chance to

– select the Edit Distance (ED): 0-2

– display already approved words

– search only within the index (without showing word snippets)

• In this way users can play around and

– have influence on the recall of the system

– see the index (which is very helpful to get an impression of the OCR

errors)

– see what already has been done

Search interface: Features

17


Result page: Features

18


• Users see the word snippets of their search • Buttons

– Select all as „false“ or „correct“ • Red: A word snippet does not represent the correct text

• Green: A word snippet represents the correct text (match between search term and word snippet)

– Deselect all

– Reverse selection

– Save

• Save – Green word snippets: The text is either approved (if it is the same as in the

OCR text) or the wrong OCR text is corrected by the correct search term

– Red word snippets: Nothing is changed on the OCR text

Features

19


Result page (2)

20


• Result sets (on the left hand side)

– 150 word snippets are currently shown in the standard view

– Can be parametrized

– Currently ordered by file path (other criteria could be word

confidence)

• Index (on the right hand side)

– All index terms are listed which are „behind“ a fuzzy search

– Number of occurrences are shown for this result set

– User gets an overview of „which tokens are behind these snippets“

– User is able to decide quickly which tokens are „real“ words

Additional features

21


• Improve precision

– Search with ED0

– All word snippets should display the search term

– Those which do not are classical OCR errors

– If they are selected they get the status „approved“

– Those which are errors are currently just deselected (and not marked

as false)

• Approvals are directly written into the ALTO file

– Correction status: true „approved“

Correction strategies (1)

22


Example 1: Search for „nelle“

23


OCR errors

24

neue nelle neue nelle


Select correct word images = green = approved

25


• Search for a word with ED1 or ED2 – The number of hits (and word snippets) increases significantly

– Sometime more, sometimes less, depending very much on the search string and the length of the string

• Strategy

– One may go through all word snippets and deselect wrong ones or select correct ones takes some time and is boring

• But

Due to ED2 many other correct words are included in the result set

• Therefore another correction strategy may be more interesting

Correction strategy (2): Improve recall

26


• Recommended method

– Go for all tokens representing „real words“ which appear in the index

on the right hand side

– By clicking on a word of the index a ED0 search is triggered

– In many cases ED0 searches retrieve good results with just a few

OCR errors approval is very simple and fast

• Once the „real words“ are done, only those word snippets

appear with „real“ OCR errors of the search term which is our

real objective to correct

Correction strategies (3)

27


Example: Search for „Feuerwehr“ ED2

28


„Feuerwehr“ (fire brigade)

29

Feuenvehr Fenerwehr

Feuerwehr, Feuermeh

Feuerwehr- Feuerweh,

Feuerwehr. Feuerwerk

Feuerweh Feuerwehren

Feuerwehr-, Feuerwehr^

Feuerweh? Feuerwehr

Feuermehr Feueràhr

Feuerwert Feuerweihe

Fenerwchr

• Examples of erroneous

words in red

• These words are the „rest“

which appears after having

approved the „real“ words

(green)

• They will finally be replaced

by the correct word:

• In ALTO: correction status

true: substitute: Feuerwehr


Validating „real“ words from the index

30


• Those which were approved in the steps before are hidden to

the user.

– But users are able to see them if interested or if they want to do a

final check

– Overwriting is possible, status has to be changed

• Therefore the final correction screen shows now instead 324

word snippets for “Feuerwehr” ED2 only those which were not

approved before.

Repeated search for „Feuerwehr“ ED2

31


Finally the „real“ OCR errors are replaced by the

correct word

32


• Test set – From the Europeana Newspaper Project

– 16.000 pages from the Tessmann Library, several millions are waiting to get indexed

– METS/ALTO files

• Standard technology – JAVA, Javascript (Ajax), Lucene

• Images are cropped on the fly – „Hardest“ task: takes some seconds on a 4 core engine

– First batch of 150 snippets is done immediatly, second batch preprocessed in the background

• A testset is available online – http://dbis-faxe.uibk.ac.at/Website%202.0/CorrectionServlet

– Attention: Not a stable link!

Implementation

33


• Our method provides the chance to improve precision and

recall of search terms in a rather quick and straight forward

way.

• Fuzzy search allows to increase the recall of search terms

significantly and to „correct“ erroneous terms quickly

• No need to edit text – only typing a search term once and than

clicking on the index terms for new searches

• Snowball system since approved words are stored

permanently and are reused for the next correction sessions

as well

Conclusion

34


Evaluation

35


• Currently not enough data for providing good figures on the evaluation of the tool – implementation in real world scenario will be necessary

• But: Doan, A. et al. 2011. Crowdsourcing systems on the World-Wide Web. Communications of the ACM.

• Four main criteria for crowd sourcing projects

(1) How to recruit and retain users?

(2) What contributions can users make?

(3) How to combine user contributions to solve the target problem?

(4) How to evaluate users and their contributions?

Evaluation

36


• Users are searching anyway!

• Those who are searching have a specific interest!

• Satisfaction will be higher if precision and especially recall is higher for noisy OCR text motivation should be there

• Power users of the archive may be willing to contribute a good deal of their time to improve the full-text search working power should be there

• Our tool is a piggypack of the search interface – can be integrated in a simple way (e.g. an extra tab which is performed anyway and users may try out what is behind)

• Searching the index provides useful insights to the user learning curve (get to know your full-text archive!)

(1) How to recruit and retain users?

37


• Contributions of users are – Improve precision

– Improve recall by correcting OCR errors of search terms

– All these words are significant and meaningful to a user

• Only a small portion of words is interesting! – Text contains a lot of words which are not meaningful or are very seldomly

part of a search

– Austrian Newspapers Online: 50% of all full-text searches go for person names, 20% for geo-names, only a small portion for keywords

– This means that the corrections/approvals done by the user with our method is more valuable than to correct running text

– The whole number of corrected words may not be so high, but these should be significant and relevant words

(2) What contributions can users make?

38


• Storage of contributions – All contributions are stored in two ways:

• The Lucene index is immediately updated so that the next search already takes benefit from approvals/corrections

• Approvals/corrections are directly stored in the OCR XML files (in this case ALTO): Words are either marked as „correction status true“ „approved“ or the new alternative of the word is included as well.

• Main benefit for the next user – The next user will see which word snippets are already approved (are

shown in blue and gray) – in other words: The contributions are visible to everyone though they are distributed among large amounts of text

– This should users give the feeling that someone already has worked in this field as well

(3) How to combine user contributions to solve the

target problem?

39


• Have not tackled this field so far

• Strategy could be

– Randomly select approved or corrected words and provide them to

other users for review

– If specific users provided too many errors a log file could be utilized

to reset the correction status within the ALTO files

(4) How to evaluate users and their contributions?

40


Future work

41


• Improve user interface – Allow to mark word snippets also as „false“

• Release as Open Source package – Will be done during 2014

– JAVA, AJAX, LUCENE – only OS components

• Implementation of the tool in a real world scenario

• Include a edit distance that is more meaningful for OCR errors than the Fuzzy search of Lucene – E.g. larger ED than 2, but based on typical OCR problems (c-e, etc.)

• Use the data for machine learning – For all word snippets metadata such as title of the publication, size of the print,

language, date of printing, etc. is available

– Use it to discriminate „hard“ cases by asking users to go for specific sets (which are selected automatically)

Further work and improvements

42


Thank you for your attention!

43

Technology

Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology