IMPACT Final Conference - Michael Fuchs

Preview:

DESCRIPTION

ABBYY FineReader: IMPACT Improvements with Michael Fuchs from ABBYY Europe

Citation preview

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

ABBYY & OCR Improvements for IMPACT

Michael FuchsSenior Product Marketing ManagerABBYY Europefuchs@abbyy.com

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

2

Agenda

Who is ABBYY?Company Overview(Short) Product OverviewABBYY Technology in the IMPACT project

OCR & Processing – IMPACT improvementsBinarisation, Segmentation, RecognitionDictionary API, Export Formats

Lessons Learned, Pricing, Pre-Announcement, Q&A

ABBYY & OCR for IMPACT

ABBYY & IMPACT

3

ABBYY & OCR for IMPACT

ABBYY Group

Overview ABBYY Group Founded in 1989 as BIT Software > 1000 employees in 14 offices worldwide Headquarters/R&D in Moscow, Russia

4

ABBYY & OCR for IMPACT

ABBYY OCR Products – Usage View

Desktop/Workgroup

Server/Backend SDK/Integration

OC

R &

Docu

men

t C

on

vers

ion

FineReader (Professional, Corporate, Site Licence Edition) Note: No Gothic/Fraktur OCR!

PDF Transformer

FotoReader

ScreenshotReader

Recognition Server (Professional, Extended Edition)

Gothic/Fraktur OCR & XML

Export Support!

FineReader Engines (Windows, Linux, Mac OS X, Free BSD, Embedded Systems)

Mobile OCR Engine (Android, Symbian, Linux, Windows, Windows Mobile, iOS )

End Users, Companies,(Libraries)

Companies,Scan Service

Provider, Libraries

User driven processing,

Ready to use

Automated processing,

Ready to use

Automated processing,

Development needed

Developers,Scan Service

ProviderIMPACT Research

Users

are

:

5

ABBYY & OCR for IMPACT 6

What (ABBYY) OCR can read...

Recognition Languages Almost 200 OCR languages 34 languages with dictionary support and spell check Alphabets: Cyrillic, Latin, Greek, Armenian, Hebrew, Thai Chinese, Japanese, Korean (CJK) - 4 sets of hieroglyphs

(Chinese (traditional and simplified), Japanese, Korean) Arabic (Technical Preview in the SDK)

Font Types Recognition of mixed font types

(dot-matrix printer, typewriter, Gothic, etc.) OCR-A OCR-B MICR (E13B) CMC-7

ABBYY & OCR for IMPACT 7

IMPACT & ABBYY

ABBYY is the OCR technology provider for IMPACT members

ABBYY also improved the core technologies for the recognition of old documents in IMPACT, focus areas are/were:

Image pre-processing Segmentation Character recognition Export

IMPACT members work with the Software Development Kit (SDK) FineReader Engine – not the desktop application

IMPACT focus is/was on research and not in setting up a production system ;o)

Improved technologies are/will be added to current/future products

ABBYY & OCR for IMPACT 8

Designed to be not OCRed

ABBYY & OCR for IMPACT

Why ABBYY? - OCR …

Std. OCR *

ABBYY Fraktur OCR*

*Recognition Server 3.0 R1 – Gothic/Fraktur disabled and enabled

Original Image[perfect quality :o) ]

9

ABBYY & OCR for IMPACT

ABBYY “History” and Old Fonts Recognition FineReader XIX (V7 Technology)

2003(METAe result 2000-2003)

FineReader Engine 9.0 (Release 1)

2008(Pre-IMPACT – “State of the Art”)

FineReader Engine 10 2010IMPACT Project Optimizations

10

ABBYY & OCR for IMPACT

ABBYY and Old European Fonts Accuracy Comparison:

ABBYY Technology Version 10 recognition of old European fonts:

25% more accurate than FRE 9.0 38% more accurate than FR XIX

Up to 98,2 % on good quality

images

11

2003

2008 2010

ABBYY & OCR for IMPACT

OCR Processing Steps &

ABBYY Improvements for IMPACT

12

ABBYY & OCR for IMPACT

Step 1. Scanning, Image Loading, Pre-Processing and Modification Compensating image defects and making the document suited for

automatic OCR

Step 2. Document Layout Analysis Layout analysis, detection of document sections like text, images and

barcodes

Step 3. (Optical) Character Recognition Automatic recognition of characters, apply selected recognition languages

& dictionaries

Step 4. (optional) Verification - by Operators or automated post correction Manual validation of suspicious characters and words

Step 5. Document Synthesis and Export Generating an output document in the selected format

13

Processing Steps

ABBYY & OCR for IMPACT

Step 1: Image pre-processing

14

ABBYY & OCR for IMPACT

Intelligent background filtering

Adaptive Binarisation

15

Step 1: Image pre-processing Image Loading, Pre-Processing and Modification

General binarisation on an image level can not deliver good results for OCR

ABBYY & OCR for IMPACT

Step 1: Image pre-processing New V10: Binarisation, Textured Background optimisations

Original scan

V9 binarisation

New V10 binarisation

16

ABBYY & OCR for IMPACT

Step 1: Image pre-processing New V10: Binarisation, Textured Background optimisations

Original scan

V9 binarisation

V10 binarisation

17

ABBYY & OCR for IMPACT

Step 1: Image pre-processing New V10: Binarisation for the IMPACT project

Original State of Art (V9) New (V10)

No text from the other page!

18

ABBYY & OCR for IMPACT

Step 2: Document Layout Analysis

19

ABBYY & OCR for IMPACT 20

Step 2: Document Layout Analysis Analyze layout and find text, images, tables and barcodes

ABBYY & OCR for IMPACT

Step 2: Document Layout Analysis (old Newspapers)Segmentation Improvements: Image/Text detection – Example 1/3

V9 Technology V10 Technology

21

Part of the column was detected as an image

ABBYY & OCR for IMPACT

Step 2: Document Layout Analysis (old Newspapers) Segmentation Improvements: Word Order Detection– Example 2/3

Less linear word order errors

22

V9 Technology V10 Technology

ABBYY & OCR for IMPACT

Step 2: Document Layout Analysis (old Newspapers) Segmentation Improvements: Lost text (no Detection) – Example 3/3

Less lost text

23

V9 Technology V10 Technology

ABBYY & OCR for IMPACT

Step 2: Document Layout Analysis Segmentation Improvements: IMPACT Results over time

24

Before IMPACT: Overall segmentation improvements

● Better picture detection● Better separators● Better page layout reconstruction

Only a random set of old newspapers available

After IMPACT: IMPACT Segmentation Ground Truth available New (internal) DA model for historic newspapers New segmentation evaluation methodology Evaluation results on newspapers

● 40% less split/merge errors● 25% less garbage and lost text

ABBYY & OCR for IMPACT

Step 3: Text/Character Recognition

25

ABBYY & OCR for IMPACT

Samples for Classifiers used in ABBYY technologiesAfter line detection, character recognition is applied with different

classifiers

26

Step 3: Text/Character Recognition

Raster classifier Contour classifier

Feature differentiating classifier

Structure classifier

ABBYY & OCR for IMPACT 27

Step 3: Text/Character Recognition Optimization and new Developments Improved Gothic Classifiers

A significant amount of time was invested in gothic classifier training The library selection of ground truth material (historical relevance)

was used New gothic graphemes were added

Results Good quality images: 2.8% (total) error rate on the used test set

which is about 20% improvement to the “state of art” (V9) = almost comparable to modern documents

Bad quality Images: 7% (total) error rate on the used test set which is about 30% improvement to the “state of art” (V9)

Most of the improvements available in ABBYY current products: ABBYY FineReader Engine 10 (SDK) & Recognition Server 3.0Quality optimization will be continued in future releases and technology cycles optimized

ABBYY & OCR for IMPACT 28

Step 3: Text/Character Recognition Optimization and new Developments

Old Slavonic as new OCR Language New Development

Before

Now

ABBYY & OCR for IMPACT

Quality-Test-Comparison:Binarisation & Recognition Improvements

29

ABBYY & OCR for IMPACT

Binarisation & Recognition Improvements

How to evaluate the recognition improvements of binarisation?

Binarisation & recognition quality go hand in hand!

-> # Errors = 100% with V9 binarisation & V9 recognition-> # Errors = -5% with V9 binarisation & V10 recognition

-> # Errors = -11% with V10 binarisation & V9 recognition

-> # Errors = -15% with V10 binarisation & V10 recognition

Binarisation

Recognition Technology

30

ABBYY & OCR for IMPACT

Step 3-5: Dictionaries & Export

31

ABBYY & OCR for IMPACT 32

Step 3 – 5: Other Optimizations

External Dictionary API Tuning External Dictionary API was available in the FineReader Engine (SDK) Support for any language, any time period API was/is heavily used from IMPACT language partners to run quality

tests

New ALTO XML Export Formats FineReader Engine 10 R2, December 2010 Recognition Server 3.0, July 2011

ABBYY & OCR for IMPACT

Additional Notes

33

ABBYY & OCR for IMPACT

Further Information & Trial Versions The ABBYY Gothic/Fraktur OCR Portal:

www.frakturschrift.com

34

ABBYY & OCR for IMPACT

The Reality Masses of books/document are available & already scanned It is unclear if Antiqua and/or Gothic/Fraktur fonts are used in the

documents Pre-Sorting is impossible, it would be too time/cost expensive

ABBYY Europe's AnswerReduced the pricing for mixed “Old” + “Modern”

font OCR projectsThe pricing is now ready for “mass processing”

Examples Recognition Server 3.0  with “Gothic” enabled

10.000 pages – 299 Euro – available online 500.000  pages* – 5.000 Euro =  1 Euro cent per page = ca 2.000 books a

250 pages Over 3 Mio pages* -  ca 0,52 Euro cent per page = 12.000 books a 1,25 €

(250 pages) Over 10 Mio pages* - ca. 40.000 books = ca. 0,5 € per book

... No more excuses for not

OCRing :o)

What IMPACT taught ABBYY about Libraries & Mass Digitalization projects…

35* page size is A4, bigger formats are counted as multiple pages

ABBYY & OCR for IMPACT

The ABBYY Gothic/Fraktur OCR Portal: finereader.abbyyonline.com

Historic OCR added just last week Web GUI to upload documents and

get results Simple to use Low Volume, ad hoc Usage Instant results, quality evaluation Pay as you go

ABBYY Online OCR SDK OCR Service with API and XML Output Runs on Windows Azure Currently Closed Beta Test Public Beta Test Q1/2012

Pre-AnnouncementABBYY Online OCR Services with Gothic/Fraktur

36

ABBYY & OCR for IMPACT

Summary

37

ABBYY & OCR for IMPACT

The whole is greater than the sum of its parts

(Aristotle)

38

ABBYY & OCR for IMPACT

Thank you for your attention!

Questions?

Michael FuchsSenior Product Marketing ManagerABBYY Europefuchs@abbyy.com

39

Recommended