Upload
scape-project
View
265
Download
0
Embed Size (px)
DESCRIPTION
Reinhold Huber-Mörk of Austrain Institute of Technology presented ‘An image based approach for content analysis in document collections’ at ISVC'13 (9th International Symposium on Visual Computing) in Rethymnon, Crete, Greece, on 31 July 2013. The development of tools for library workflows for duplicate content detection and content verification for complex documents were presented accompanied by results of the work.
Citation preview
An image based approach for content analysis in document collections
Reinhold Huber-Mörk & Alexander Schindler AIT Austrian Institute of Technology GmbH Department Safety & Security Intelligent Vision Systems Vienna, Austria
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Motivation
Large-scale book and newspaper scanning projects Quality assurance: image quality, content preservation and detection of page duplication
Historical material: deprecated language & fonts, handwritten remarks, non-
textual content,….
Museum collections: content is essential, different scans with same content
OCR often difficult
Content based approach = Image based
2 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Content duplication and verification
3 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Approach
4 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Local Features
All detections on a book front page (ordered by scale)
5 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Local Features and Descriptors
Keypoints are detected at salient image regions
A keypoint is described in a descriptor ( = vector of features)
Scalable Invariant Feature Transform - SIFT [Lowe, 2004]
6 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
20 40 60 80 100 1200
0.1
0.2
20 40 60 80 100 1200
0.1
0.2
Bag of Features (BoF)
7 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Visual Vocabulary
8 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Image comparison
Comparison of visual histograms – tf (“term frequency”) score
Spatial verification
Comparison of 3 schemes for spatial verification based on
Homography estimation
Visual word co-occurrence
Global descriptor properties statistics
9 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
? =
Spatial verification (1)
Homography estimation and mapping
10 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Image pair Descriptor matching Estimation of affine
transformation
Image overlay Similarity estimation
Similarity measure MSSIM
Spatial verification (2)
A co-occurrence matrix of visual words counts the concurrent appearance of two visual words in a spatial neigbourhood
Comparison
11 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
0
0.5
1
1.5
2
2.5
3
3.5
4
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0
1
2
3
4
5
6
7
8
9
Global keypoint property statistics from position, orientation and scale
Spatial inhomogeneity: Subdivison of images into a sequence of rectangular tiles [Schilcher et al., 2008]
spatially uniformly distibuted h→0, spatially concentrated h→1
Spatial verification (3a)
12 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
h=0.13 h=0.21
Orientation uniformity: keypoint orientation angles sorted in ascending order [Rao, 1972]
circular uniformly distributed u→0, dominant orientation u→1
Spatial verification (3b)
13 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
u=0.63 u=0.76
Size distribution: SIFT delivers a size estimation S for each keypoint. A normalized size s is obtained from
Variance of S or s is used as feature
Spatial verification (3c)
14 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Derive shortlist L from tf matching Combine tf with spatial verification term:
where is based on one of the 3 verification schemes
Homography estimation
Visual word co-occurrence
Global descriptor properties statistics Computational efficiency (book with 256 pages @72 DPI)
tf only 31 sec.
Homography estimation 449 sec.
Visual word co-occurrence 451 sec.
Global descriptor properties statistics 128 sec.
Combination of BoF with spatial verifaction
15 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Plot shows the maximum of the combined similarity for each book page measured to all other pages of the same book
Results combination of BoF with spatial verifaction (1)
16 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
How to interpret such plots w.r.t. duplication
Results combination of BoF with spatial verifaction (2)
17 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Further work: content analysis w.r.t.
Sections of similar layout: main body, cover, index,….
Pages of unique content: cover, graphical art,… Clustering of 2D space spanned by maximum similarity and page index by
DBSCAN algorithm [Ester et al., 1996]
Results combination of BoF with spatial verifaction (3)
18 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
0.2 0.4 0.5 0.6 0.8 1 0
0.2
0.4
0.6
0.8
1
Normalized page index
Max
imum
sim
ilarit
y 132
45
76
Results combination of BoF with spatial verifaction (4)
19 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
0.2 0.4 0.5 0.6 0.8 1 0
0.2
0.4
0.6
0.8
1
Normalized page index
Max
imum
sim
ilarit
y 132
45
76
Results: duplicate detection
20 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Manual vs. automatic detection
59 books, 34805 pages
53 books correctly processed
53/59 ≈ 90% correct
69 of 75 duplicate runs detected
69/75 ≈ 92% correct
Missing detections due to
heavily mixed content
Results: content verification between two versions (scans) of book collections (1)
21 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Pairs not matching
Pairs with low similarity
(or low overlap)
Pairs with high similarity
Results: content verification between two versions (scans) of book collections (2)
22 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
rate=1 means
content is identical
Book/barcode nr. Book/Barcode name #Pairs Rate of matches 1 +Z13641740X_31525197396364410 546 0.9982 2 +Z13722110X_31525197396362478 18 1.0000 3 +Z136400800_31525197396361993 269 0.9888 4 +Z136408008_31525197396361942 291 0.9897 5 +Z136408409_31525197396362038 1 1.0000 6 +Z136409104_31525197396361681 182 0.9670 7 +Z136411408_31525197396362266 219 0.9954 8 +Z136415001_31525197396363522 360 0.9861 9 +Z136419900_31525197396360634 219 0.9954
10 +Z136428500_31525197396360351 249 0.9799 11 +Z136436004_31525197396361129 273 0.9853 12 +Z137116108_31525197396265632 589 0.9949 13 +Z137117708_31525197396287838 651 0.9969 14 +Z137118403_31525197396265776 505 0.9822 15 +Z137120100_31525197396265914 1231 0.9992 16 +Z137150402_31525197396389590 2 1.0000 17 +Z137219001_31525197396361518 664 0.9774 18 +Z150800609_31525197396361025 212 0.9858 19 +Z152471307_31525197396311214 443 0.9910 20 +Z152472403_31525197396313828 460 0.9913 21 +Z152472701_31525197396315698 859 0.9953
Conclusion
23 04.11.2013 This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Tools for library workflows for duplicated content detection and content
verification for complex documents
Keypoint detection and description = purely image based approach
Bag of visual words & spatial verification
Global descriptor statistics provides reasonable good and fast spatial
verification
Further work content clustering and classification of defects
AIT Austrian Institute of Technology your ingenious partner Reinhold Huber-Mörk [email protected]