24
An image based approach for content analysis in document collections Reinhold Huber-Mörk & Alexander Schindler AIT Austrian Institute of Technology GmbH Department Safety & Security Intelligent Vision Systems Vienna, Austria This work was partially supported by the SCAPE Project. The SCAPE project is cofunded by the European Union under FP7 ICT2009.4.1 (Grant Agreement number 270137).

An image based approach for content analysis in document collections

Embed Size (px)

DESCRIPTION

Reinhold Huber-Mörk of Austrain Institute of Technology presented ‘An image based approach for content analysis in document collections’ at ISVC'13 (9th International Symposium on Visual Computing) in Rethymnon, Crete, Greece, on 31 July 2013. The development of tools for library workflows for duplicate content detection and content verification for complex documents were presented accompanied by results of the work.

Citation preview

Page 1: An image based approach for content analysis in document collections

An image based approach for content analysis in document collections

Reinhold Huber-Mörk & Alexander Schindler AIT Austrian Institute of Technology GmbH Department Safety & Security Intelligent Vision Systems Vienna, Austria

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 2: An image based approach for content analysis in document collections

Motivation

Large-scale book and newspaper scanning projects Quality assurance: image quality, content preservation and detection of page duplication

Historical material: deprecated language & fonts, handwritten remarks, non-

textual content,….

Museum collections: content is essential, different scans with same content

OCR often difficult

Content based approach = Image based

2 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 3: An image based approach for content analysis in document collections

Content duplication and verification

3 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 4: An image based approach for content analysis in document collections

Approach

4 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 5: An image based approach for content analysis in document collections

Local Features

All detections on a book front page (ordered by scale)

5 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 6: An image based approach for content analysis in document collections

Local Features and Descriptors

Keypoints are detected at salient image regions

A keypoint is described in a descriptor ( = vector of features)

Scalable Invariant Feature Transform - SIFT [Lowe, 2004]

6 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

20 40 60 80 100 1200

0.1

0.2

20 40 60 80 100 1200

0.1

0.2

Page 7: An image based approach for content analysis in document collections

Bag of Features (BoF)

7 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 8: An image based approach for content analysis in document collections

Visual Vocabulary

8 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 9: An image based approach for content analysis in document collections

Image comparison

Comparison of visual histograms – tf (“term frequency”) score

Spatial verification

Comparison of 3 schemes for spatial verification based on

Homography estimation

Visual word co-occurrence

Global descriptor properties statistics

9 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

? =

Page 10: An image based approach for content analysis in document collections

Spatial verification (1)

Homography estimation and mapping

10 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Image pair Descriptor matching Estimation of affine

transformation

Image overlay Similarity estimation

Similarity measure MSSIM

Page 11: An image based approach for content analysis in document collections

Spatial verification (2)

A co-occurrence matrix of visual words counts the concurrent appearance of two visual words in a spatial neigbourhood

Comparison

11 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

0

0.5

1

1.5

2

2.5

3

3.5

4

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0

1

2

3

4

5

6

7

8

9

Page 12: An image based approach for content analysis in document collections

Global keypoint property statistics from position, orientation and scale

Spatial inhomogeneity: Subdivison of images into a sequence of rectangular tiles [Schilcher et al., 2008]

spatially uniformly distibuted h→0, spatially concentrated h→1

Spatial verification (3a)

12 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

h=0.13 h=0.21

Page 13: An image based approach for content analysis in document collections

Orientation uniformity: keypoint orientation angles sorted in ascending order [Rao, 1972]

circular uniformly distributed u→0, dominant orientation u→1

Spatial verification (3b)

13 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

u=0.63 u=0.76

Page 14: An image based approach for content analysis in document collections

Size distribution: SIFT delivers a size estimation S for each keypoint. A normalized size s is obtained from

Variance of S or s is used as feature

Spatial verification (3c)

14 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 15: An image based approach for content analysis in document collections

Derive shortlist L from tf matching Combine tf with spatial verification term:

where is based on one of the 3 verification schemes

Homography estimation

Visual word co-occurrence

Global descriptor properties statistics Computational efficiency (book with 256 pages @72 DPI)

tf only 31 sec.

Homography estimation 449 sec.

Visual word co-occurrence 451 sec.

Global descriptor properties statistics 128 sec.

Combination of BoF with spatial verifaction

15 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 16: An image based approach for content analysis in document collections

Plot shows the maximum of the combined similarity for each book page measured to all other pages of the same book

Results combination of BoF with spatial verifaction (1)

16 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 17: An image based approach for content analysis in document collections

How to interpret such plots w.r.t. duplication

Results combination of BoF with spatial verifaction (2)

17 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Page 18: An image based approach for content analysis in document collections

Further work: content analysis w.r.t.

Sections of similar layout: main body, cover, index,….

Pages of unique content: cover, graphical art,… Clustering of 2D space spanned by maximum similarity and page index by

DBSCAN algorithm [Ester et al., 1996]

Results combination of BoF with spatial verifaction (3)

18 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

0.2 0.4 0.5 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

Normalized page index

Max

imum

sim

ilarit

y 132

45

76

Page 19: An image based approach for content analysis in document collections

Results combination of BoF with spatial verifaction (4)

19 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

0.2 0.4 0.5 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

Normalized page index

Max

imum

sim

ilarit

y 132

45

76

Page 20: An image based approach for content analysis in document collections

Results: duplicate detection

20 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Manual vs. automatic detection

59 books, 34805 pages

53 books correctly processed

53/59 ≈ 90% correct

69 of 75 duplicate runs detected

69/75 ≈ 92% correct

Missing detections due to

heavily mixed content

Page 21: An image based approach for content analysis in document collections

Results: content verification between two versions (scans) of book collections (1)

21 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Pairs not matching

Pairs with low similarity

(or low overlap)

Pairs with high similarity

Page 22: An image based approach for content analysis in document collections

Results: content verification between two versions (scans) of book collections (2)

22 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

rate=1 means

content is identical

Book/barcode nr. Book/Barcode name #Pairs Rate of matches 1 +Z13641740X_31525197396364410 546 0.9982 2 +Z13722110X_31525197396362478 18 1.0000 3 +Z136400800_31525197396361993 269 0.9888 4 +Z136408008_31525197396361942 291 0.9897 5 +Z136408409_31525197396362038 1 1.0000 6 +Z136409104_31525197396361681 182 0.9670 7 +Z136411408_31525197396362266 219 0.9954 8 +Z136415001_31525197396363522 360 0.9861 9 +Z136419900_31525197396360634 219 0.9954

10 +Z136428500_31525197396360351 249 0.9799 11 +Z136436004_31525197396361129 273 0.9853 12 +Z137116108_31525197396265632 589 0.9949 13 +Z137117708_31525197396287838 651 0.9969 14 +Z137118403_31525197396265776 505 0.9822 15 +Z137120100_31525197396265914 1231 0.9992 16 +Z137150402_31525197396389590 2 1.0000 17 +Z137219001_31525197396361518 664 0.9774 18 +Z150800609_31525197396361025 212 0.9858 19 +Z152471307_31525197396311214 443 0.9910 20 +Z152472403_31525197396313828 460 0.9913 21 +Z152472701_31525197396315698 859 0.9953

Page 23: An image based approach for content analysis in document collections

Conclusion

23 04.11.2013 This work was partially supported by the SCAPE Project.

The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Tools for library workflows for duplicated content detection and content

verification for complex documents

Keypoint detection and description = purely image based approach

Bag of visual words & spatial verification

Global descriptor statistics provides reasonable good and fast spatial

verification

Further work content clustering and classification of defects

Page 24: An image based approach for content analysis in document collections

AIT Austrian Institute of Technology your ingenious partner Reinhold Huber-Mörk [email protected]