17
Document Representation Refinement for Precise Region Description Christian Clausner, Stefan Pletschacher and Apostolos Antonacopoulos PRImA Lab, School of Computing, Science and Engineering, University of Salford, United Kingdom

Datech2014-Session1-Document Representation Refinement for Precise Region Description

Embed Size (px)

DESCRIPTION

Slides of the presentation of the paper Document Representation Refinement for Precise Region Description by Christian Clausner, Stefan Pletschacher and Apostolos Antonacopoulos. #digidays

Citation preview

Page 1: Datech2014-Session1-Document Representation Refinement for Precise Region Description

Document Representation Refinement for Precise Region Description

Christian Clausner, Stefan Pletschacher and Apostolos Antonacopoulos

PRImA Lab, School of Computing, Science and Engineering, University of Salford,

United Kingdom

Page 2: Datech2014-Session1-Document Representation Refinement for Precise Region Description

Document Page Regions

DATeCH 2014 2

Segmentation, Classification

• Region (block, zone): Connected area of a document image with content of a single specific type

• Examples: Text, graphic, table

Page 3: Datech2014-Session1-Document Representation Refinement for Precise Region Description

Region Representation

• By geometric objects

– Bounding box

– Stack of rectangles

– Polygon

• By pixels

– Bitmap

– Run-length encoding

DATeCH 2014 3

Page 4: Datech2014-Session1-Document Representation Refinement for Precise Region Description

Need for Precise Region Descriptions

• Precise description is crucial for all but the most trivial document analysis and recognition applications

• For performance evaluation: The loss of quality introduced by imprecise regions can be bigger than the variation of accuracy of the actual recognition method

DATeCH 2014 4

Page 5: Datech2014-Session1-Document Representation Refinement for Precise Region Description

The Situation

• Trend to more precise descriptions, but…

• Output of state-of-the-art OCR systems:

– Stacks of rectangles (ABBYY FineReader Engine 11)

– Bounding boxes (Tesseract OCR 3.02)

• Popular formats for layout analysis and OCR results:

– ALTO XML (boxes, ellipses, polygons (region level only))

– FineReader XML (stacks of rectangles (region level only))

– PAGE XML (polygons for all levels)

– HOCR (boxes)

DATeCH 2014 5

Page 6: Datech2014-Session1-Document Representation Refinement for Precise Region Description

Refinement through Polygonal Fitting

• Applicable to regions that have child objects in the document model

• A typical object hierarchy contains regions, text lines, words and glyphs (characters)

• Idea: Tightly wrap a polygon around the child objects

DATeCH 2014 6

Page 7: Datech2014-Session1-Document Representation Refinement for Precise Region Description

Polygonal Fitting Approach

1. Create bitmasks for the child objects and transfer them to an empty bitmap

2. Fill the gaps between the child objects by a smearing approach

3. Optional: Exclude neighbour regions

4. Trace the contour of the foreground and create a polygon

DATeCH 2014 7

Page 8: Datech2014-Session1-Document Representation Refinement for Precise Region Description

1 - Transferring Child Object to Bitmap

• Starting point: Polygonal object (e.g. text line, word, or glyph)

• Lossless conversion to rectangle based interval representation

• Transferring the rectangles to the target bitmap

DATeCH 2014 8

Page 9: Datech2014-Session1-Document Representation Refinement for Precise Region Description

2 – Smearing Approach

• Goal: Connect all foreground components in the bitmap by filling the gaps in-between

1. Alternatingly fill horizontal and vertical gaps if they are smaller than a dynamic threshold (threshold is increased after each iteration)

2. If necessary, use diagonal smearing to connect remaining components

DATeCH 2014 9

Page 10: Datech2014-Session1-Document Representation Refinement for Precise Region Description

3 – Subtraction of Neighbours

• Optional step to avoid overlap with adjacent regions

• Simply erase the corresponding pixels from the created bitmap

DATeCH 2014 10

Page 11: Datech2014-Session1-Document Representation Refinement for Precise Region Description

4 – Outline Tracing

• Trace the contour of the foreground component in the created bitmap

• Create polygon on-the-fly by adding points for each change of direction (corner)

DATeCH 2014 11

Page 12: Datech2014-Session1-Document Representation Refinement for Precise Region Description

Experiments

• Carried out on a dataset of contemporary documents consisting of scanned magazine and technical article pages

• Processed with Tesseract OCR 3.02 (open source)

• Exported to PAGE XML with and without refinement

DATeCH 2014 12

Page 13: Datech2014-Session1-Document Representation Refinement for Precise Region Description

DATeCH 2014 13

Original (unrefined) Refined

Page 14: Datech2014-Session1-Document Representation Refinement for Precise Region Description

Results

• Measurement of region overlaps (number and area)

DATeCH 2014 14

Overlapping Regions

Overlap Area (Megapixel)

Original Outlines

621 (45.8%) 19.9

Refined Outlines

286 (21.1%) 2.5

Page 15: Datech2014-Session1-Document Representation Refinement for Precise Region Description

Impact on Performance Evaluation

• Real-world scenario

• Measure the performance of Tesseract OCR engine

• Evaluation metrics of previous ICDAR page segmentation competitions

DATeCH 2014 15

Average success rate using original outlines 81.1%

Average success rate using refined outlines 84.5%

Average improvement for all documents 3.4%

Maximum improvement 22.9%

Page 16: Datech2014-Session1-Document Representation Refinement for Precise Region Description

Conclusion • Existing geometric region data can be significantly refined by fitting

precise polygons around child objects

• Validity and impact on real-world scenarios has been shown

• Refinement in performance evaluation helps to eliminate problems that arise from insufficient geometric descriptions → Concentrate on real issues of OCR methods

• Positive effect on accuracy of presentation/repurposing systems (highlighting, cropping, article tracking, etc.)

• Approach used in Aletheia ground truth editor and result viewer (primaresearch.org/tools)

DATeCH 2014 16

Page 17: Datech2014-Session1-Document Representation Refinement for Precise Region Description

DATeCH 2014 17