7
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111 112 113 CVWW #36 CVWW #36 CVWW 2013 Submission #36. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Logical Layout Recovery: approach for graphic-based features Aysylu Gabdulkhakova, Tamir Hassan, Walter G. Kropatsch Pattern Recognition and Image Processing Group Technische Universit¨ at Wien Favoritenstraße 9-11, 1040 Wien, Austria {aysylu,tam,krw}@prip.tuwien.ac.at Paper ID 36 Abstract. In contrast to the existing approaches for document analysis and understanding this paper represents a system that considers a logical role for graphic content in predominantly textual, born digital PDF documents. This work was inspired by the idea of using structural graphic objects in order to clarify the logical layout even of complex mostly graphic documents. Based on visual cognition, geometric features and spatial relations, the proposed statistical method distinguishes illustrative graphic objects from structural graphic objects. We performed evaluation on two document domains - newspapers and technical manuals - and found the results to be reliable. We propose using logical information about the graphic content to be a new step towards domain-independent document understanding systems. 1. Introduction A human reader can easily rediscover the logical structure of any document from text properties (typesetting conventions) and layout. In ambiguous cases a human can additionally follow the meaning of the text paragraphs. Document analysis and understanding systems presented in literature are focused on textual data in domain-specific documents (books, business letters, scientific papers, technical specifications). They determine a logical structure - heading, paragraphs, reading order - based on individual properties of a given document class. Graphic regions are detected as non-textual and provide no semantic information. The problem stems from a need to create a system capable for a broad class of documents and to reuse or repurpose the document content, represented by graphic objects. In natively digital PDF documents(PDF Normal or Formatted Text and Graphics) these objects are defined by the set of low-level primitives, such as lines, rectangles, curves and glyphs. Graphic objects either construct an illustrative region (non-structural elements) or distinguish logical blocks from each other (structural elements). We classify documents into three types according to the layout complexity: simple, mostly textual documents: logical layout is obvious for both human and system, e.g. scientific papers(Figure 1); predominantly textual documents: logical layout is evident for human, but difficult for system, e.g. newspaper page(Figure 2); complex, mostly non-textual documents: sophisticated for human and system, e.g. creative design of magazines(Figure 3). The proposed object-based approach addresses a problem of analysis and understanding of vector graphic objects that appear in predominantly textual natively digital PDF documents. Our heuristic rule-based method considers logical geometrical properties and mutual arrangement between the graphic and text objects. This approach enables grouping low-level primitives into higher-level logical blocks and finding structural graphic elements. The remainder of this paper is organized as follows: in Section 2 we provide an overview to state-of-the-art research related to the problem. 1

Logical Layout Recovery: approach for graphic-based features...000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Logical Layout Recovery: approach for graphic-based features...000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054055056

057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109110111112113

CVWW#36

CVWW#36

CVWW 2013 Submission #36. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Logical Layout Recovery: approach for graphic-based features

Aysylu Gabdulkhakova, Tamir Hassan, Walter G. KropatschPattern Recognition and Image Processing Group

Technische Universitat WienFavoritenstraße 9-11, 1040 Wien, Austria{aysylu,tam,krw}@prip.tuwien.ac.at

Paper ID 36

Abstract. In contrast to the existing approachesfor document analysis and understanding this paperrepresents a system that considers a logical rolefor graphic content in predominantly textual, borndigital PDF documents. This work was inspired bythe idea of using structural graphic objects in orderto clarify the logical layout even of complex mostlygraphic documents. Based on visual cognition,geometric features and spatial relations, theproposed statistical method distinguishes illustrativegraphic objects from structural graphic objects. Weperformed evaluation on two document domains- newspapers and technical manuals - and foundthe results to be reliable. We propose usinglogical information about the graphic content to bea new step towards domain-independent documentunderstanding systems.

1. Introduction

A human reader can easily rediscover the logicalstructure of any document from text properties(typesetting conventions) and layout. In ambiguouscases a human can additionally follow the meaningof the text paragraphs.

Document analysis and understanding systemspresented in literature are focused on textual data indomain-specific documents (books, business letters,scientific papers, technical specifications). Theydetermine a logical structure - heading, paragraphs,reading order - based on individual properties ofa given document class. Graphic regions aredetected as non-textual and provide no semanticinformation. The problem stems from a needto create a system capable for a broad class of

documents and to reuse or repurpose the documentcontent, represented by graphic objects. In nativelydigital PDF documents(PDF Normal or FormattedText and Graphics) these objects are defined by theset of low-level primitives, such as lines, rectangles,curves and glyphs. Graphic objects either constructan illustrative region (non-structural elements) ordistinguish logical blocks from each other (structuralelements).

We classify documents into three types accordingto the layout complexity:

• simple, mostly textual documents: logicallayout is obvious for both human and system,e.g. scientific papers(Figure 1);

• predominantly textual documents: logicallayout is evident for human, but difficult forsystem, e.g. newspaper page(Figure 2);

• complex, mostly non-textual documents:sophisticated for human and system, e.g.creative design of magazines(Figure 3).

The proposed object-based approach addressesa problem of analysis and understanding of vectorgraphic objects that appear in predominantly textualnatively digital PDF documents. Our heuristicrule-based method considers logical geometricalproperties and mutual arrangement between thegraphic and text objects. This approach enablesgrouping low-level primitives into higher-levellogical blocks and finding structural graphicelements.

The remainder of this paper is organized asfollows: in Section 2 we provide an overviewto state-of-the-art research related to the problem.

1

Page 2: Logical Layout Recovery: approach for graphic-based features...000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029

114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170

171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227

CVWW#36

CVWW#36

CVWW 2013 Submission #36. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 1. Example of a mostly textual document withtrivial layout

INAUGURATA«EXA»

Bagnodifollaperlavetrinabrescianadellearmisportive

ROVATO.Lavittima èil giovane NicolaRubagadi Castrezzato

Inautocontrounpaloperdelavitaa29anni

ILFUTURO.Lamemoriael’impegnoditantiprotagonisti

Laveritàeilfilosottilechenonvaspezzato

f PAG17

BATTUTODALLASAMP

EnelgiornopiùtristeilBresciainterrompelasuacorsaaGenova

Tuttiassoltiancheinappelloa38annidall’esplosionecheprovocò8mortie100feriti

Alle parti civili restano le speseVeltroni:«Lepaghinoipartiti»

LOSPORT SOTTOCHOC.Un infarto haucciso il25enne Morosinidurante Pescara-Livorno

Morteincampo,ilcalciosiferma

PIAZZA LOGGIA 1974-2012. Confermatoilgiudizio di1˚gradoperTramonte,Maggi, Zorzie Delfino

Stragesenzacolpevoli

L’INTERVISTA

Castrezzatiequel28maggiosulpalco«Maigiovanidevonoricordare»

di MAURIZIO CATTANEO

L’automobiledellagiovanevittima diCastrezzato

I GIUDICI HANNO CONDANNATO tutte le parti civili al pagamento delle spese processuali. Una cifra chesarà modesta ma che suona comunque come una beffa per chi per tutti questi anni ha chiesto giustizia. Edi beffa parla Manlio Milani, presidente dell’Associazione famigliari delle vittime: «È ridicolo che sidebbano anche pagare le spese processuali». La proposta di Walter Veltroni: «Paghino i partiti». f PAG 5

Mario Pari

E’ un filo che ha subito un altrostrattoneecherischiadispezzar-si.Nondeveaccaderee lasperan-za è che non accadrà. Perchè in-siemealfilodellamemoriac’èan-che il legame che unisce tutti co-loro che sono impegnati nella ri-cerca della verità. f PAG 3 f PAG7

Vergogna. Dopo trentotto anni la stragediPiazza della Loggia resta senzacolpevoli. E l’atroce sconfitta non è soloper i parenti delle vittime che hannoperduto i propri cari, e per iquali

questa sentenza è come sale vivo su una feritaancora aperta. No, la sconfitta è per tutti noi. Lodiciamo senza enfasi o retorica ma quasisottovoce, con amara tristezza.Nessuno chiedeva vendetta. E nessuno mette in

croce i giudici che, dopo così lungo tempo, hannodeciso di non decidere nonostante l’imponentelavoro indiziario della Procura bresciana. E anchese la decisione di far pagare le spese processualialle parti civili suona come l’ultima beffa di unsistema sgangherato che, se nonriesce a faregiustizia, non pensa neppure a fare quel gestominimo di accollare i costi del processo allo Stato.Ma tant’è, così è finita ed in fondo ce loaspettavamo.Detto questo, ciò che davvero amareggia e che

sconcerta, è l’impunità finale per chi ha compiutoquelle orrende stragi. E dunque l’impossibilità difare chiarezza su un passaggio così drammaticoper il Paese. Un’Italia che ancora una volta nonpuò fare i conti con il proprio passato e dunquematurare, evolvere, progredire.In questo senso c’è un filo che lega i grandi

processi delle stragi con gli altrettanto grandiprocessi per mafia. Una costante tutta italiana.Degli omicidi di Falcone e Borsellino conosciamocerto i nomi dei manovali del crimine, ma quandosi è arrivati ad un passo dal «terzo livello» tutto siè fatto opaco. In Piazza della Loggia come aPalermo appena si sfiora l’ipotesi di intrecci trapolitica, servizi segreti e criminalità lavicendafinisce nei nebbiosi cassetti di una Giustizia che cimette38 anni per dire «ci arrendiamo».Questa seconda repubblica dei partiti pirateschi

- questa concezione del potere come privilegio enoncome servizio, questa sprezzante ideadell’impunità in ogni occasione, questamancanza di etica e dignità neppure nascosta masbandierata nei salotti come alla tv, questomedievale familismo da clan che sostituisce ilconfronto democratico - tutto questo fa partedella degenerazione di un sistema che fonda leradici anche nell’incapacità dell’Italia di fare lucesulle stagioni delle stragi e dunque i conti con lapropria storia.Cosa aspettarci nel futuro se si costruisce su basi

cosìpoco solide? Cosa raccontare ai nostri figliche stanno ereditando un Paese povero, confuso,e che ha perso quasi ogni riferimento?Riguardando quelle immagini sbiadite e lontanediBrescia e l’odierno Palazzo vien da pensare chelanostra generazione ha sbagliato molto. Troppo.

Ildisastroitalianonascedall’impunità

ILGOVERNOELATASSA

«Benzina,nuoviaumentisoloincasiestremi» f PAG12

DOMENICA15APRILE 2012 €1,00ANNO39. NUMERO104. www.bresciaoggi.it

Calcio sotto choc: il giocatore delLivoro Piermario Morosini, 25anni, si è accasciato in campo edè morto durante la partita di se-rie B col Pescara. La Federcalcioha deciso lo stop a tutte le partitedel fine settimana, a partire daMilan-Genoa che avrebbe dovu-togiocarsi ieri alle 18. f PAG 8-11 f PAG36-38

La verità giudiziaria sulla stragediPiazzaLoggiasiallontanasem-pre più, dopo tre inchieste a cari-coprimadineofascistibresciani,poi milanesi, infine ordinovistiveneti; e dopo undici sentenze, aquasi 38 anni da quella mattinadel 28 maggio 1974 in cui furonouccise otto persone e altre centorimasero feritedall’esplosionediuna bomba nel corso di una ma-

nifestazioneantifascistapromos-sa dai sindacati dopo giorni ditensioni e attentati. Ieri i giudicidella Corte d’assise d’appello diBrescia,hannoassoltoCarlo Ma-ria Maggi, Delfo Zorzi, MaurizioTramonte e Francesco Delfino,per i quali i pm avevano chiestol’ergastolo,mentreperPinoRau-ti la Procura non aveva fatto ap-pello. f PAG2, 3, 4, 5 e 7

ILMINISTROELEPOLEMICHE

Lavoro,Forneroattacca«Riformaoacasa» f PAG13

BRESCIAOGGI+

GENTEA solo ¤ 1,20

SABATO IN EDICOLA

Post

eItalia

ne

S.p

.a.-

Sp

ed

.in

a.p

.-D

.L.3

53/2

003

(conv.

inL.2

7/0

2/2

004

n.4

6)a

rt.1

,com

ma

1,D

CB

Bre

scia

9HRLFTB*bgiaae+[M\A\E\L\F

Idisperatisoccorsi

sulcampodiPescara

aMorosini.Nellafotoasinistra,

igiocatoridelBresciasconfitti

Ungiovanedi29anni,NicolaRu-baga di Castrezzato, è morto ieriall’alba a Rovato. La sua auto èuscita di strada, forse per un col-po di sonno o per la pioggia, fi-nendoaddossoauncavodiaccia-iochesostenevaunpalodellaTe-lecom; quindi si è ribaltata ed haconcluso la sua corsa contro unplatano. Inutili i soccorsi: il ra-gazzo è morto poco dopo l’arrivoall’ospedale di Chiari. f PAG24

NUOVO

IO12014

[email protected] - www.bsifiere.com IO12017

AL MERCATO DEL VOSTRO PAESE ARRIVA

• Lunedì Rovato• Mercoledì Rodengo S.• Giovedì Roncadelle• Venerdì Cologne• Sabato Erbusco

Patatine, polpette, verdure, alette di pollo,bastoncini di pollo, petali di cipolle,

olive ascolane, mozzarellinee tante altre specialità

TUTTI I GIORNITANTETANTE NUOVE GOLOSITà

Per prenotazioni: 347 4189369

ROSTICCERIA DI OGNI TIPO

POLLERIA FRESCA A PREZZISUPER SUPER ECONOMICI

SALAMELLE FRESCHE GOLOSISSIME

LA GASTRONOMIA METELLI

Figure 2. Example of a predominantly textual documentwith non-trivial layout

Figure 3. Example of a complex ambiguous document

Section 3 and Section 4 describe two contributionsof this paper: proposed method and evaluation toolrespectively. Discussion of the obtained results,conclusion and direction for the future work is givenin Section 5.

2. Related work

State-of-the-art methods provide various solutionsto the problem of logical structure discovery. Mostof them take into account image-based features anddeal with document images rather than electronicdocuments. As soon as our system processes natively

digital PDF documents in this Section we will focuson approaches that use object-based properties.

Anjewierden [1] created a system, AIDAS, thatincrementally builds logical blocks and determinestheir role using shallow grammars. This approachis limited by the field of technical manuals.Chao and Fan [3] developed a method forinformation extraction in scientific papers. It usesboth object-based and image-based approaches forestablishing logical vector graphic entities. Dejeanand Meunier [4] present a system that appliesXY-cut-based algorithm for dividing the givendocument into logical blocks. Bloechle et al. [2]present an object-based system, Dolores(DocumentLogical Restructuring), for restructuring textualand graphical content. The idea is based onusing artificial neural networks which are trainedto recognize the logical layout of the newspapers.Hassan [7] developed a system, PDF ExtractionToolkit1, that particularly includes an object-basedbottom-up method for extracting logical text blocks,vector-graphics objects(lines, curves, rectangles) andbitmaps. The GUI shows the rectangular boundingboxes of the detected component groups, althoughthe primitives themselves may not be rectangular inshape.

The publications mentioned above do not describethe processing of graphical primitives in sufficientdetail. However, they do not appear to distinguishbetween structural and non-structural objects. Webelieve that structural elements is a powerful featureof the documents, that, as well as human, documentunderstanding systems can use for logical structurerecovery.

This idea was roughly implemented by Gaoet.al. [6]. Their image-based method aims extractingstructural information from PDF book documents.In one of the processing steps they find separationlines, that visually distinguish different parts ofthe documents. In contrast, our approach isoriented for predominantly textual documents, suchas newspapers, that provide rich variety of logicallayout and different types of structural elements -lines, rectangles, bitmaps.

We decided to advance the PDF Extraction Toolkitfrom the point of analysis and understanding ofvector graphics. The description of the inventedmethods can be found in [5], as well as in the nextSection.

1pdfXtk: http://pdfxtk.sourceforge.net

2

Page 3: Logical Layout Recovery: approach for graphic-based features...000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029

228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284

285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341

CVWW#36

CVWW#36

CVWW 2013 Submission #36. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Parsing PDF

instructions

(see [2], [3])

initial

processing

(see [2], [3])

grouping

algorithms

(see [2], [3])

grouping

algorithms (see

Section 4.1)

structural

element

searching (see

Section 4.2)

set of text

segments

set of graphic

primitives

set of higher-level

graphic objects

set of higher-level

text segments

Document analysisDocument

understanding

Natively

digital

PDF page

set of

candidates

for being a

structural

element

Output

graphic

regions,

structural

elements

text regions

graphic regions

initial

processing

(see [2], [3])

Figure 4. Stages of the whole algorithm for processing natively digital PDF page

3. Methodology

A system performs three tasks in order torepresent a natively digital PDF document as aset of text regions, graphic regions and structuralelements. First, the page content is extracted fromPDF instructions and transformed into Java-objectprimitives. Page location of these primitives isdefined by their bounding box coordinates in 2DCartesian space. In pdfXtk we store the followingtypes of primitives: line segments, rectangles,bitmaps, text segments (2-3 character-long textblock).

Next, the grouping rules are applied in order toobtain higher-level graphic and text objects. In thefinal processing phase, we determine which of thegraphical objects represent structural elements, asopposed to graphic regions. The diagram of thewhole algorithm can be seen in Figure 4.

The remainder of this section describes ourgrouping rules and our methods for determiningwhether a vector object is structural. The task ofprocessing text blocks is not addressed in this paper(see [ [8, 7]] for a description).

3.1. Grouping algorithms

The grouping rules that we have devisedtake into account geometrical properties of thegraphic objects as well as their mutual spatialarrangement. We classify these rules into twocategories: intersection-based and distance-based.When applied in combination with each other, theyenable higher-level objects to be constructed, whichusually correspond to distinct logical objects in thedocument’s structure.

3.1.1 Based on intersections

Lines. When the intersection between two lines isestablished, we group them together. For the nextline, we check the intersection with each member

of the group. The special case is when two linesconstruct a solid line, i.e. visually they are perceivedas one. In this case we merge these line segments andno group is created.

Rectangles. This method is applicable not onlyto rectangles, but also to bitmap objects and complexfigures (in the latter case, the bounding box is used).It is based on the assumption that two structuralrectangle objects on the page are unlikely to intersect.Specifically, when the topmost coordinate of onerectangle is less than the bottommost coordinate ofthe other or when the leftmost coordinate of onerectangle is greater than the rightmost coordinate ofthe other.

Often advertisements or other separated contentis enclosed in structural rectangles, which are veryclose to each other or even overlap. Hence, beforemerging such elements using the above rule, wecheck whether they enclose other objects. If yes, thenthe given pair of rectangles is not grouped.

Line and Rectangle. In predominantly textualdocuments, two types of intersection between lineand rectangle objects can occur:

1. structural line intersects the rectangular object;

2. line segment intersects the rectangular object,both are part of a graphic region.

In order to avoid overmerging, three actions aresequentially performed:

a) the width ratio or height ratio, whichever is thelarger, is compared to the given threshold;

b) we determine the type of intersection or,more precisely, the mutual arrangement of theintersecting objects (see Figure 5);

c) the ratio between both parts of the line, split atthe intersection point, is compared to a giventhreshold.

3

Page 4: Logical Layout Recovery: approach for graphic-based features...000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029

342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398

399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455

CVWW#36

CVWW#36

CVWW 2013 Submission #36. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 5. The grouping method for rectangles, based onthe sizes and mutual arrangement of the given objects

Line and Text. There are a variety of ways inwhich line segments and text fragments can intersecteach other. In our research we focused on four casesthat commonly occur in newspapers:

a) line segment underscores text block;

b) line segment crosses the word;

c) line segments form the axes and text blocksrepresent labels (as in diagram);

d) text block is surrounded by lines, whichdistinguish it from the other part of a document.

Cases a) and b) can be distinguished from eachother by the distance between their centres projectedon the Y-axis: if the line is closer to the centre ofthe text bounding box than to its border, then casea) applies; otherwise case b). Cases c) and d) canalso be distinguished by the distance between theircentres, but projected on the X-axis: if the line istouching or intersecting the text bounding box, thencase c) applies; otherwise case d).

Rectangle and Text. As in previous paragraphthere are several possibilities of intersection betweenthe given objects. More precise:

a) rectangle encloses text fragment;

b) rectangle intersects text fragment;

c) rectangle slightly touches the text fragment.

The last case occurs in tight layouts, wherethe rectangular bounding boxes of neighboringcomponents often slightly overlap each other.

3.1.2 Based on distance

Lines. Here the distance-based rules consider thepossibility of dashed lines. Line segments representsmall objects with the distance between them lessthan or equal to the element size.

Rectangles.Two rectangles are considered as asingle object if the distance between their centres is

Figure 6. The grouping method for rectangles, based onthe sizes and mutual arrangement of the given objects

less than a given threshold. This threshold dependson two parameters: a granularity-level coefficientand the widths or heights of the rectangles. It iscalculated by multiplying the first parameter with thesum of the second.

The granularity-level coefficient depends on asize ratio of the given rectangles and is dividedinto 3 types: high (0.55), normal (0.6) and low(0.65). These numerical values were obtainedexperimentally. Next, the algorithm continues bydetecting one of nine cases of mutual arrangementbetween two rectangles (as illustrated in Figure 6).If the current rectangle is located in area 4 or 6towards the rectangle being considered (green andblue rectangles respectively), the threshold distanceuses the sum of both widths. For cases 2 or 8(yellow and blue rectangles), the threshold distancedepends on their heights. For the remaining cases1, 3, 7, 9 (red and blue rectangles), two thresholddistances are counted using the widths and heights.Finally, the threshold distance is compared to thedistance between the centres of the given objectsprojected on the appropriate axis. In cases 1, 3,7 and 9 the comparison is conducted on both axeswith the corresponding thresholds. It is worth notingthat our system represents composite objects bytheir rectangular bounding box. In such a manner,graphic glyphs that form parts of logos, newspaperheadings etc. are introduced as rectangular objects.A vivid example of the above glyphs is the headingof newspapers such as The Sydney Morning Herald,International Herald Tribune, etc. (Figure 7). Here,low-level primitives are positioned sequentially inone row/column. In order to detect this case, we refer

4

Page 5: Logical Layout Recovery: approach for graphic-based features...000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029

456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512

513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569

CVWW#36

CVWW#36

CVWW 2013 Submission #36. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 7. Newspaper heading representation in pdfXtk

to the golden ratio font rules [2].

Errors can occur with advertisements that have thesame size and are close to each other (Section 3.1.1,Rectangle and Rectangle). These advertisementsdiffer from glyphs as they also contain text.Moreover, the advertising boxes are usually filledwith objects as large as at least one third of their size,whereas glyphs have a negligibly small filled area.

3.2. Finding structural elements

Line. Generally a structural line occurs as ahorizontal/vertical line or rectangle that looks likea line, which does not intersect other non-structuralobjects (strokes, lines, rectangles, merged graphicregions, text fragments) and is not enclosed in anygraphic region.

In Section 3.1.1, paragraph Line and Text, weconsidered four cases in which line and text objectscan intersect each other. Case a) is a special case,where the line is neither structural nor illustrative, butrather an integral part of the formatting of the text. Incase b) the line is likely to form part of an illustrativeregion. In cases c) and d) there is a pair of identicalgroups of straight lines with different semantics: incase c) these lines form part of a chart, whereas incase d) the lines are used as a ”barrier” and servethe purpose of separating the text paragraph from theremaining page content. Thereby we conclude thatc) is an example of non-structural lines and d) is anexample of structural lines. In order to detect thecorrect variant, the closeness of line segments andtext fragments is taken into account.

Rectangle. Generally, a structural rectangleoccurs does not intersect other graphic primitives andregions, but can enclose them. The special case isthat of ”stand-alone” rectangle objects, which oftenlook like and serve the same logical function as lines.Therefore, in Section 3.1.1, paragraph Rectangleand Text, case a), the rectangle can represent thebounding box of textual content and thus is morelikely to be structural than in cases b) and c).

4. Evaluation

For this reason we created an interactiveevaluation tool and performed estimation of theobtained results on a bitmap level using similaritymeasures from [9].

The developed system performs several tasks:ground truth image generation, resultant imagegeneration and comparison of the above images.As an input it takes a binary image of a givenPDF document at a fixed resolution 72dpi, whichis sufficient for this purpose. Ground truth imagegeneration is obtained by manual marking on abinary image the appropriate logical graphic regions.Resultant image is produced by automatical mappingXML file of the algorithm output to the binaryimage. In a given two segmentation images -ground truth and resultant - each logical type ofa graphic object(structural line, structural rectangle,illustrative region) is represented by a specific color.Correspondence between two images is establishedvia pixel-by-pixel color-value comparison. Interfaceof the system is shown in Figure 8.

In order to define the measures that determinethe nature of overlapping between the regions, weborrow ideas from the approach proposed in [9]:

• correct detection regions are mainlyoverlapping;

• partial detection some overlapping detected,but not sufficient as in the first case;

• over-segmentation a single object in the groundtruth is detected as two separate segments;

• under-segmentation - two segments in theground truth are erroneously merged in thealgorithms output;

• incorrect detection the types of region inground truth and result of the algorithm aredifferent (structural rectangle, structural line orgraphic region);

• false positives the region is marked by thealgorithm, but does not occur in the groundtruth;

• missed objects the region is marked in theground truth, but has not been detected by thealgorithm.

5

Page 6: Logical Layout Recovery: approach for graphic-based features...000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029

570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626

627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683

CVWW#36

CVWW#36

CVWW 2013 Submission #36. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

4.1. Experimental results

Graphic-understanding approach was tested onpredominantly textual electronic PDF documents oftwo domains: newspapers and technical manuals.Precisely, we took 100 pages from 10 differentEuropean newspapers taken from 15-17 April 2012,namely Nuovo Quotidiano di Rimini, El Mundo delSiglo XXI, China Daily, Il Tirreno, Die Tageszeitung,Le Monde, L’Eco di Bergamo, Aripaev, InternationalHerald Tribune, Bresciaoggi; 30 pages from first8 different technical manuals obtained by using apopular search engine. The results of our evaluationare given in Table 1 and Table 2 correspondingly.

4.2. Discussion

Choosing two particular document classes iscaused by the purpose of testing approach onvarious data. Newspapers provide a rich layoutvariety and sparse complex vector-graphic objects,such as drawings. Technical manuals, vice versa,are represented by a simple logical layout andmostly include sophisticated figures. The algorithmdemonstrates a high performance on both types ofpredominantly textual, natively digital PDF pages.

Important drawback for the proposed algorithmis that PDF is a descendant of a PostScript pagedescription language. Limited number of renderinginstruction causes an unexpected set of underlyingoperator structures even for a simple page layout.Thus such documents are easily perceived by human,but confuses algorithms that work directly on theoperator level.

5. Conclusion and further work

A new approach for analysis and understandingof vector graphic content in natively digital PDFdocuments is described in this paper. It includesalgorithms for grouping graphic primitives intohigher-level logical blocks and for understandingtheir logical role in a document. The efficiency andreliability of the system was tested on newspapersand technical manuals and achieved good results.Choosing these document domains was caused bythe need to prove that the provided heuristic rulesperform well not only for a specific logical layoutpages, but also for the technical figures and schemes.

The goal of this paper is to address a problemof studying the properties of graphical content indocuments, as it is a powerful tool for dividing a pageinto logical blocks. The current implementation is

oriented to predominantly textual documents. Forthe future work we propose to extend our set ofheuristic rules by considering bitmap elements ortext elements to be structural. This information canbe further used to analyse complex mostly graphicdocuments such as magazines.

Acknowledgements

This work was funded in part by the AustrianFederal Ministry of Transport, Innovation andTechnology (Grant No. 829602).

References[1] A. Anjewierden. Aidas: Incremental logical

structure discovery in pdf documents. In DocumentAnalysis and Recognition, 2001. Proceedings. SixthInternational Conference on, pages 374–378. IEEE,2001. 2

[2] J. Bloechle, M. Rigamonti, K. Hadjar, D. Lalanne,and R. Ingold. Xcdf: a canonical and structureddocument format. Document Analysis Systems VII,pages 141–152, 2006. 2

[3] H. Chao and J. Fan. Layout and content extractionfor pdf documents. In DAS 2004: Proceedings ofthe International Workshop on Document AnalysisSystems, pages 213–224, 2004. 2

[4] H. Dejean and J. Meunier. A system for convertingpdf documents into structured xml format. DocumentAnalysis Systems VII, pages 129–140, 2006. 2

[5] A. Gabdulkhakova and T. Hassan. Documentunderstanding of graphical content in natively digitalpdf documents. In Proceedings of the 2012 ACMsymposium on Document engineering, pages 137–140. ACM, 2012. 2

[6] L. Gao, Z. Tang, X. Lin, Y. Liu, R. Qiu, andY. Wang. Structure extraction from pdf-based bookdocuments. In Proceedings of the 11th annualinternational ACM/IEEE joint conference on Digitallibraries, pages 11–20. ACM, 2011. 2

[7] T. Hassan. Object-level document analysis of pdffiles. In Proceedings of the 9th ACM symposium onDocument engineering, DocEng ’09, pages 47–55,New York, NY, USA, 2009. ACM. 2, 3

[8] T. Hassan. User-Guided Information Extraction fromPrint-Oriented Documents. PhD thesis, Citeseer,2010. 3

[9] A. Shahab, F. Shafait, T. Kieninger, and A. Dengel.An open approach towards the benchmarking of tablestructure recognition systems. In Proceedings ofthe 9th IAPR International Workshop on DocumentAnalysis Systems, DAS ’10, pages 113–120, NewYork, NY, USA, 2010. ACM. 5

6

Page 7: Logical Layout Recovery: approach for graphic-based features...000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029

684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740

741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797

CVWW#36

CVWW#36

CVWW 2013 Submission #36. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 8. Interface of the evaluation tool

Total Retrieved Correctdetection

Incorrectdetections

Partialdetections

Oversegment.

Undersegment.

Falsepositives

Missedobjects

Structurallines

847 875 710(86%) 93(11%) 0 0 28(3%) 122 30

Structuralrectangles

464 377 291(69%) 119(28%) 1(<1%) 11(2%) 4(<1%) 68 42

Graphicregions

364 670 291(82%) 55(15%) 2(<1%) 105(29%) 32(9%) 228 11

Table 1. Evaluation results on newspapers

Total Retrieved Correctdetection

Incorrectdetections

Partialdetections

Oversegment.

Undersegment.

Falsepositives

Missedobjects

Structurallines

92 101 85(92%) 17(18%) 0 0 0 1 0

Structuralrectangles

13 21 12(92%) 3(23%) 1(7%) 1(7%) 1(7%) 8 1

Graphicregions

66 80 61(94%) 6(9%) 1(<2%) 13(20%) 3(4%) 1 1

Table 2. Evaluation results on technical manuals

7