Transfer Learning for Structures Spotting in Unlabeled

HAL Id: hal-01681114https://hal.archives-ouvertes.fr/hal-01681114

Submitted on 6 Jun 2019

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Transfer Learning for Structures Spotting in UnlabeledHandwritten Documents using Randomly Generated

DocumentsGeoffrey Roman-Jimenez, Christian Viard-Gaudin, Adeline Granet, Harold

Mouchère

To cite this version:Geoffrey Roman-Jimenez, Christian Viard-Gaudin, Adeline Granet, Harold Mouchère. Transfer Learn-ing for Structures Spotting in Unlabeled Handwritten Documents using Randomly Generated Doc-uments. International Conference on Pattern Recognition Applications and Methods, Jan 2018,Madeira, Portugal. �10.5220/0006598204170425�. �hal-01681114�

https://hal.archives-ouvertes.fr/hal-01681114

https://hal.archives-ouvertes.fr

Transfer Learning for Structures Spotting in Unlabeled HandwrittenDocuments Using Randomly Generated Documents

Geoffrey Roman-Jimenez1, Christian Viard-Gaudin1, Adeline Granet1 and Harold Mouchere1

1University of Nantes, LS2N, UMR 6004, F-44100, [email protected], [email protected]

Keywords: Handwritting recognition, image generation, digit detection, deep neural networks, knowledge transfer.

Abstract: Despite recent achievements in handwritten text recognition due to major advances in deep neural networks,historical handwritten documents analysis is still a challenging problem because of the requirement of largeannotated training database. In this context, knowledge transfer of neural networks pre-trained on alreadyavailable labeled data could allow us to process new collections of documents. In this study, we focus onlocalization of structures at the word-level, distinguishing words from numbers, in unlabeled handwritten doc-uments. We based our approach on a transductive transfer learning paradigm using a deep convolutional neuralnetwork pre-trained on artificial labeled images randomly generated with strokes, word and number patches.We designed our model to predict a mask of the structures positions at the pixel-level, directly from the pixelvalues. The model has been trained using 100,000 generated images. The classification performances of ourmodel were assessed by using randomly generated images coming from a different set of images of words anddigits. At the pixel level, the averaged accuracy of the proposed structures detection system reach 96.1%. Weevaluated the transfer capability of our model on two datasets of real handwritten documents unseen duringthe training. Results show that our model is able to distinguish most ”digits” structures from ”word” struc-tures while avoiding other various structures present in the documents, showing the good transferability of thesystem to real documents.

1 INTRODUCTION

This work takes part of the CIRESFI project 1. TheCIRESFI project aims to improve our knowledgeabout the Italian theater during the 18th century inFrance. The researchers in human and social sci-ences intend to reveal the cultural adaptation for theItalian actors in France. Whereas the political situ-ation was against them, they succeeded in establish-ing themselves as an official and reputed institution.So, this projects aims at providing efficient tools toaccess to different kinds of information which are in-cluded in a set of financial records of the Italian com-edy dating from the XVIII century with more 28 000pages (Luca, 2011)(Cethefi, 2016). Information re-trieval within a large collection of handwritten his-torical documents is a very complicated task, specif-ically when no ground truth training dataset is avail-able. As a first functionality, we focus on the detec-

1Project ANR-14-CE31-0017 ”Contrainte et Integration: pour une Reevaluation des Spectacles Forains et italienssous l’Ancien Regime”

tion of the structure of these financial documents. Inthis paper, we are concentrating on localization of allthe number areas, distinguished from word or strokesstructures, which are one the main constituents ofthese financial records of the Italian Comedy (Cethefi,2016). Later works could extract meaningful area andcolumns based on these typed localizations.

Neural network-based deep learning have recentlyachieved great advances in the pattern recognitionarea including classification, regression and cluster-ing (LeCun et al., 2015)(Schmidhuber, 2015). In textrecognition, recent works have shown that deep neu-ral network (DNN) models can tackle the automaticdetection of specific objects in handwritten docu-ments (Moysset et al., 2016)(Butt et al., 2016). How-ever, since DNN models based their learning on thedata, most of the applications need large amount oflabeled data to train the models. In text recognition ofhandwritten document, training data and test data aregenerally taken from the same corpus of documents,respecting the same text arrangement, same handwrit-ing style and/or similar paper type. Unfortunately,

ground-truth in handwritten documents are rarely la-beled and manual annotations are very time expen-sive. When the labels of the target data are not avail-able, a transductive transfer learning strategy (Panand Yang, 2010) can be considered to learn the de-tection task by training a model with annotated datafrom another database. Of course, the training datasetshould present enough structural variability and repre-sentativeness with respect to the target dataset to bothlearn the detection task and ensure its transferability.Thus, the challenge here is to learn a generic detectiontask without learning the specific characteristics andpatterns present in the training dataset. Once trainedon this train dataset, the system should be able to re-trieve in the target dataset the same kind of patterns.

In this paper, we propose to bridge the gap be-tween a non-labeled dataset and a labeled datasetby generating randomly artificial images containingknown patterns in known locations but with variablelayouts and backgrounds. The strategy is based ona random placement of labeled patches of word andnumber structures on a background image from thetarget database. Clearly, the main advantage of suchstrategy is that it allows to generate as many im-ages as necessary, can be adjusted to the target task,and ensure structural variability(Delalandre et al.,2010). Note that this generation strategy is quite closeto (Kieu et al., 2013) where authors create full pagesof realistic documents. Furthermore, we show that afully convolutional network can be trained to solve adigit/word detection task.

The paper is structured as follows: Section 2 in-troduces the notations, definitions and objectives ofthis work; Section 3 presents the images databasesused for the generation of artificial documents andthe real handwritten documents for which we wantto detect number/word structures; Section 4 presentsthe architecture of the neural network trained for ourpixel-wise digit spotting problem; Section 5 presentshow we evaluated the classification performance ofour model; Section 6 presents the results obtainedwith our model classification considering the artificialgenerated images and a qualitative evaluation on realhandwritten mails and historical documents that werenot used for training.

2 NOTATIONS, DEFINITIONSAND OBJECTIVES

In this section we introduce some notations anddefinitions used in this paper.

Let X = {x1,x2, ...,xN} with xi ∈ R, be the im-age of N pixels of a given handwritten document.

S = {s1,s2, ...,sN}, with si = (s0i s1

i s2i )

T, ski ∈ {0,1}

corresponds to the pixel-wise classification map ofone-hot vectors indicating in which class belongseach pixel of X . In our context, k = 0 correspondsto the class ”background”, k = 1 is the class ”num-ber” and k = 2 is the class ”word”. The goal of ourmodel is to build the map Y = {y1,y2, ...,yN}, withyi = (y0

i y1i y2

i )T, yk

i ∈ [0,1] as the closest estimationof S, from the image X .

Using a set of M images X = {X1,X2, ...,XM}, thecorresponding set of structures maps S and a neuralnetwork model with the parameters Θ, we aim to learna transformation function T (X ,Θ) =Y by finding theparameters Θ? that minimize the cost function C,

Θ? = argmin

Θ

C(T (X ,Θ),S). (1)

As the targeted Y = T (X ,Θ) should be the closestestimation of S, we define the cost function C as theweighted cross entropy between Y and S:

C(Y,S) =−N

∑i

2

∑k=0

1pk

i(sk

i .log(yki )), (2)

where the weighting coefficient pki is the probability

of the pixel xi to belong to the class k. This is expectedto readjust the weight of error depending on the pro-portion of each class within each image X . Given animage X , we computed pk

i as the proportion of thepixels belonging to the class k within the image X .

Considering a set of unlabeled image X u, the setof maps of one-hot vectors S u is not available to learnthe transformation T u(Xu,Θ) = Y u. The objectiveof transfer learning is thus to learn a transformationT l(X l ,Θ) = Y l using a set of labeled images X l thatis similar enough to X u to ensure that T l(Xu,Θ) ≈Y u.

3 DATABASES

3.1 Real images

Three databases were considered in this study: theIReste Online Offline database (IROnOff) (Viard-Gaudin et al., 1999), the handwritten mail ofthe RIMES database (Augustin et al., 2006) andthe unlabeled registers accounts of Italian Comedy(RECITAL) (Cethefi, 2016). RECITAL images werescanned by the BNF (National French Library)2 at400dpi.

2As an example, the register numbered 41 is avail-able here: http://catalogue.bnf.fr/ark:/12148/cb42447323f

We used the IROnOff database to construct X l andS l to learn the transformation T l . The RECITALdatabase corresponds to the set of images X u forwhich we want to process automatic structures-spotting.

The IROnOff database contains a total of 61,291images (300dpi) of isolated handwritten characters,digits, words, with their corresponding transcriptions.We randomly separated the IROnOff database in twogroups; the training set Dtrain corresponding to 67%of the total dataset (40,861 images) and the test setDtest corresponding to 33% of the dataset (20,432 im-ages). To evaluate the model during the learning, wepicked 20% of Dtrain as the validation set Dvalid (8,172images).

Besides, we used the set of patches from theRIMES database to quantitatively evaluate the re-trieval capability of our classifier. The RIMESdatabase is a set of 1,500 images of labeled para-graphs from handwritten mails (Grosicki and El-Abed, 2011) with a total of 66,979 patches of wordsand numbers present in the documents. In RIMES,patches with numbers correspond to isolated num-bers and identification strings using digits (as licenseplates). A total of 918 patches contains digits.

3.2 Random image generation

We used images of Dtrain, Dvalid and Dtest as patchesto generate 1536x1536 images and created the set ofartificial images X l

train, X lvalid and X l

test with their as-sociated pixel-wise one-hot vector maps S l

train, S lvalid

and S ltest .

In order to generate images that are comparablewith historical documents, we selected a total of 97images from the RECITAL dataset without any hand-writing, and used them as background. We also man-ually segmented 50 handwritten strokes (accolades,separators, . . . ) from the RECITAL dataset for addingvarious structures, distinct from the word/numberpatches.

The procedure to generate an image is defined byAlgorithm 1. The first part of the procedure randomlycreate a grid G in a 1536x1536 area using a randomlyselected background image. Secondly, for each cellof the created grid, a type of patch is selected among{empty,word,number}. If the type is not empty, apatch of the corresponding type is then selected, ran-domly scaled, and placed in the current cell. Note thatthe random scaling is constrained to keep the patchin the size of the cell. The corresponding ground-truth map S is built at this stage using the type of celland its position and size. Then, a random number ofstrokes are placed on the created document, after ran-

dom scale and rotation. Finally, a Gaussian noise witha random signal-to-ratio (from 10 to 100) is added tothe artificial document.

Algorithm 1 artificial document generation algorithm

1: procedure ARTIDOC2: #Inputs3: D: dataset of word/number patches4: B: dataset of background images5: S: dataset of strokes patches6: MinH: minimum height of patches7: MinW : minimum width of patches8: (SizeH,SizeW ): size of generated image9:

10: #Random statement11: b← random image in B12: A← random area (SizeH,SizeW ) in b13: W ←U(1,SizeW/minW )14: H←U(1,SizeH/minH)15: C← grid (W ×H) of A16:17: #Random patches placement18: for each cell c ∈C do19: t ← random type of patch∈ { /0,word,number}

20: d← random patch ∈ D of type t21: d′← random scaling d (keeping it in c)22: A(c)← random placement of d′ in c23:24: #Random strokes placement25: nStroke←U(0,W ×H)26: for n← 1,2, ...,nStroke do27: s← random stroke ∈ S28: s← random scaling and rotation of s29: A← random placement of s30:31: #Add noise32: SNR←U(10,100)33: σ2

noise← σ2signal .10−

SNR10

34: Artidoc←N (0,σ2noise)

35:36: #Output37: return ArtidocNote:U(α,β): Discrete Uniform Distribution (from α to β)N (µ,σ2): Normal Distribution with mean µ and vari-ance σ2

Note that the number of generated cells c in thegrid is chosen to generate sizes of numbers/wordsin the same range than in the RECITAL images(this is an approximation as numbers and words ofdifferent sizes are presents in each original page).

Fig. 1 shows three examples of artificially gener-ated documents. Source code for the artificial doc-ument generator is available at https://github.com/GeoTrouvetout/CIRESFI.

A total of 100,000 images were generated on-the-fly for X l

train to train our model, 100,000 images forX l

valid to validate the model during training and 10,000images for X l

test to test the model. It means that eachartificial image is used only once.

4 NEURAL NETWORKMODELING AND TRAINING

The model is based on a fully-convolutional neu-ral network (FCNN) that produces a pixel-wise clas-sification of every pixels of the input image in threeclasses; background, number and word. A graphi-cal representation of the architecture of the trainedFCNN if presented on Fig. 2 and detailed in TA-BLE 1. Since it is fully convolutional, this modelis adaptable to all input image sizes so that it canbe applied on the mails from the RIMES databaseand on the unlabeled documents from the RECITALdatabase X u. A 5-level pyramid representation of theinput image was used as input to enlarge the recep-tion field of each feature maps and ensure that thenetwork can handle recognition of various sizes ofstructures. Roughly, the network is composed of twoparts: Features extraction and Structures map con-struction. The features extraction part composed of5 layers of convolution (kernel size of 5x5 with zero-padding) linked with 2x2 max-pooling leading to amiddle layer shape of 48x48x256. The digit maskconstruction part is composed of 6 layers of trans-posed convolution (Dumoulin and Visin, 2016) (ker-nel size of 5x5 and padding with half of the filter sizeon both sides) and two 2x2 upscale layers (upscal-ing by repetition) reconstructing a 384x384x3 tensorfinally upscaled to 1536x1536x3. The rectify linearunit (ReLU) function was chosen as the output nonlin-earity of each layer, with the exception of the outputlayer with a softmax function. Originally proposed byNair and Hinton in (Nair and Hinton, 2010), ReLUfunction is define as ReLU(x) = max(0,x). Besides itdoes not require input normalization, the ReLU func-tion has been shown to help the training speed ofFCNN models (Nair and Hinton, 2010)(Krizhevskyet al., 2012). The softmax function performed a ex-ponential normalization of each one-hot vector of theproduced map, thus ∀i,y0

i + y1i + y2

i = 1, defined by

Softmax(x) j =ex

j

∑k exk

with k = 0,1,2.

The model was built and trained using the Pythonlibrary Theano (Bergstra et al., 2010) and the deeplearning wrapper Lasagne (Dieleman et al., 2015).

We trained the model by minimizing the objec-tive function C (equation 2) using the Adam stochasticgradient-based algorithm described in (Kingma andBa, 2014).

Symmetrically to the set of inputs images X ltrain,

X lvalid and X l

test , the set of output maps of our net-work are denoted Y l

train, Y lvalid and Y l

test . Note thatthe direct output of the network is Y = {y1,y2, ...,yN}with yi = (y0

i y1i y2

i )T and yk

i ∈ [0,1]. To evaluate theclassification performance of our model with binaryoutputs, we computed the resulting classification mapS = {s1, s2, ..., sN} with

si =

s0i

s1i

s2i

, ski =

{1 if max(yi) = yk

i0 else

.

Table 1: Architecture of our model based on convolutionalneural network

Layer type Filter size Output layershape

Activationfunction

Input image /// 1536x1536 ///Py(X) /// 1536x1536x5 ///Convolution+ maxPool (2x2) 5x5x32 768x768x32 ReLU

Convolution+ maxPool (2x2) 5x5x32 384x384x32 ReLU



Convolution.+ maxPool (2x2) 5x5x256 48x48x256 ReLU

Convolution.+ maxPool (2x2) 5x5x128 96x96x128 ReLU

Trans. Conv.+ upscale (2x2) 5x5x64 192x192x64 ReLU

Trans. Conv.+ upscale (2x2) 5x5x32 384x384x32 ReLU

Trans. Conv. 5x5x16 384x384x16 ReLUTrans. Conv. 5x5x8 384x384x8 ReLUConv.+ upscale (4x4) 5x5x3 1536x1536x3 Softmax

Output map /// 1536x1536x3 ///

Note: ReLU corresponds to the Rectified Linear Unit func-tion defined by ReLU(x) = max(0,x). Softmax corre-sponds to the normalized exponential function defined asSo f tmax(x) j =

exj

∑k exk

with k = 0,1,2. Transposed Convolu-tion performs the backward pass of a normal convolution asdescribed in (Dumoulin and Visin, 2016).

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)Figure 1: Three examples of generated images with randomly selected background, word and digit patches from the IROnOffdatabase. (a),(b) and (c) present three examples of 1536x1536 artificial document generated using the Algorithm 1. (d), (e)and (f) present the target classification map (bounding boxes of the numbers and words) superposed on the correspondingartificial document. (g), (h) and (i) are the corresponding outputs of the detection system.

5 EVALUATION

We evaluated the performance of the structures-spotting pixel-wise localization carried out by ourmodel in three phases: firstly, by considering theset of artificial generated images X l

test and the cor-responding set of target classification map S l

test ;secondly, by evaluating the detection of numbersamong the NR patches of the RIMES database, R ={R1,R2, ...,RNR}, and their associated states B =

{b1,b2, ...,bNR}, bi ∈ {0,1}, indicating the presenceor not of a number.

Given an output S map (estimated segmentation)and a ground-truth structures map S, we computedthe precision, recall and accuracy of the structuresclassification at the pixel level. Because the imagescontains significantly more background than numbersor words, we also computed the Matthew CorrelationCoefficient (Matthews, 1975) extended for multiclassclassification by J. Gorodkin in (Gorodkin, 2004),

(1536x1536x5)

Py(X)5

5

(768x768x32)

55

5(384x384x32)

55

(192x192x64)

55

(96x96x128) (48x48x256) (96x96x128)

55

(192x192x64)

55

(384x384x32)

55

(384x384x16)

55

(384x384x8)

55

(384x384x3)

55

(1536x1536x3)Y

Figure 2: Graphical representation of the architecture of the fully-convolutional neural network. Py(X) corresponds to thepyramid representation of the input image X with 5 levels of resolutions. A filter size of 5x5 were used for both convolutionand transposed convolution layers. Each convolution layer is associated with a max-pooling layer of 2x2. The two firsttransposed convolution layers are associated with a nearest-neighbor upscale layer of 2x2.

which have the advantage of taking into account thebalance between the number of pixels belonging toclasses. Note that the Matthew correlation coefficientis ranging from -1 to +1 with a value of 1 represent-ing a perfect classification, 0 corresponds to randomclassification and −1 indicates total disagreement be-tween prediction and the ground-truth.

We describe below how we computed these met-rics of classification performance namely precision(PRE), recall (REC), accuracy (ACC) and Matthewcorrelation coefficient (MCC):

PREk =Ckk

∑l

Clk, RECk =

Ckk

∑l

Ckl, ACC =

∑k

Ckk

∑k,l

Ckl,

mPRE =1K ∑

kPREk, mREC =

1K ∑

kRECk,

MCC =

∑k,l,m

Ckk.Clm−CklCmk√∑k(∑

lCkl)( ∑

k′,l′k′ 6=k

Ck′l′).√

∑k(∑

lClk)( ∑

k′,l′k′ 6=k

Cl′k′),

(3)

where K = 3, k, l and m ∈ {0,1,2} and C beingthe 3x3 multiclass confusion matrix for one image.Note that mPRE and mREC correspond to averagedprecision and recall computed for each class versusthe others.

To focus the performance measures on handwrit-ing, and evaluate more precisely the structures re-trieval, we also computed these measures by weight-ing each classification with the ink amount of the pix-els. Considering a pixel xi of an image X , its inkamount τi is computed as the normalized pixel inten-sity τi =

xi−min(X)max(X)−min(X) . With this definition, τ ∈ [0,1],

τ = 0 corresponds to a white pixel, and τ = 1 corre-sponds to a black pixel.

Besides, we also computed precision, recall andaccuracy to evaluate the digits detection capacity ofour classifier, at a word-level, using the RIMES word

patches R . The presence or not of a digit in a patch(set B) was stated when the area formed by the pixelsclassified in the class ”number” was equal or greaterthan 25 pixels.

6 RESULTS

6.1 FCNN training

Fig. 3 shows the evolution of losses C(X ltrain,Y

ltrain)

and C(X lvalid ,Y

lvalid) during the training of our model.

The concomitant diminution of losses on both Dtrainand Dvalid shows that the variability of random artifi-cial document preventing our model from overfitting.Because of the trending cost reduction, we choose tokeep the last computed weights. Fig. 1(g), Fig. 1(h)and Fig. 1(i) present three output examples of the re-sulting system.

Figure 3: Evolution of losses C(Strain,Ytrain) andC(Svalid ,Yvalid) obtained during the training of the FCNNmodel.

6.2 Structure-spotting evaluation onartificial images

TABLE 2 presents the confusion matrix obtained onthe set of artificial images Stest . Note that the numberof occurrences in each cell of the confusion matrix arepresented in averaged percentage of pixels among im-ages of Stest . This shows the unbalance between thenumber of occurrences for each class. Globally, wecan see that classes 1 (number) and 2 (word) are overestimated (e.g 2.72% of pixels are classified as num-bers but only 1.5% of pixels are really numbers). Thiseffect is due to the balance cost function (eq. 2) whichprevents the background class from over-estimation.TABLE 3 resumes the averaged classification resultscomputed on Stest . The averaged recall, precision andaccuracy reach the percentages of 97.4%, 69.9% and96.6% respectively. The averaged Matthew correla-tion coefficient was 0.738, showing a robust globalperformance of the system. The standard deviationsare computed over the different images from the testset Y l

test . Considering each class versus the others, wecan observe poor precision despite good recall mea-sures for word and number structures. This can beexplained by the fact that among pixels correctly de-tected as word/number structures, numerous neigh-boring background pixels are included in these classesas well. The measurements using the signal-weightedclassification allow to increase the precision for the3 classes. It means that the low precision is mostlydue to white pixels which are detected. This can beobserved in Fig. 1g), Fig. 1(h) and Fig. 1(i).

Table 2: Averaged matrix confusion obtained on artificialimages of the test set. The numbers of occurrence are pre-sented in averaged percentage of pixels over the images ofStest .

Predicted class

0 1 2 Total

Actualclass

0 92.82% 1.2% 1.76% 95.78%1 0.01% 1.47% 0.02% 1.50%2 0.04% 0.05% 2.64% 2.73%

Total 92.87% 2.72% 4.42%

6.3 Transfer learning evaluation on realhandwritten documents

6.3.1 Word level evaluation

We measured the number detection performance onthe set of patches of the RIMES database R . With the

Table 3: Averaged classification performances on artificialimages (test set Stest )

Measures pixelclassification

signal-weightedpixel classification

mREC 0.974 ± 0.034 0.975 ± 0.034mPRE 0.699 ± 0.055 0.785 ± 0.061ACC 0.969 ± 0.020 0.969 ± 0.019MCC 0.738 ± 0.057 0.816 ± 0.058

REC0 0.969 ± 0.021 0.966 ± 0.022REC1 0.985 ± 0.058 0.981 ± 0.056REC2 0.969 ± 0.083 0.957 ± 0.082PRE0 0.999 ± 0.002 0.999 ± 0.002PRE1 0.518 ± 0.114 0.661 ± 0.127PRE2 0.579 ± 0.107 0.693 ± 0.103

66,797 patches of R , the Matthew correlation coeffi-cient was 0.634 and the recall, precision and accuracywere 80%, 75.6% and 82.5% respectively. It meansthat 80% of patches including digits are retrieved butonly 75.6% of detected images really contain digits.Noting that no pre-processing (binarization, denois-ing, reshape, etc.) were applied on the patches, theseresults show that the system can handle all types ofdata and distinguish number structures among hand-written documents. The Fig. 4 shows some examplesof some well-classified and miss-classified images.

(a) with number (b) with number (c) without num-ber

(d) with number (e) with number (f) without num-ber

Figure 4: Examples of word classified images from Rimes.Images (a), (b) and (c) are well-classified and (d), (e), (f)are miss-classified. Ground-truth is specified below eachimage.

6.3.2 Page level qualitative evaluation

We qualitatively analyze the transferability of ourmodel by considering RIMES paragraphs andRECITAL complete pages. Note that none of thesedocuments were seen during the training.

Fig. 5 shows six examples of structures spottingon real handwritten documents; three mails of theRIMES database and three RECITAL documents.

Concerning the RIMES mails, we can observe thatthe totality of the word structures were well retrieved.However, in Fig. 5(b), we can note that some num-bers structures (especially the ”20 and ”2007” withinthe first sentence) were partially missed by the net-work. This could be due to the cursively-writing styleof these structures, rarely present within the patchescontaining number used during training.

The RECITAL documents are more challengingbecause of the natural noise included in the back-ground, the different sizes of characters within thesame page, and mainly the writing style which is com-pletely different from modern IROnOff dataset. De-spite that most of the numbers were retrieved, we canobserve that calligraphic letters, present at the top ofthe pages, are often detected as digits. This can beexplained by the fact that the IROnOff database doesnot contains calligraphic writing. Thus, our model didnot learn to distinguish such large structures from dig-its. Also, we can observe that, at the pixel level, somenumber structures starting with ”10” are missed bythe network and classified as words. We think that themodel cannot differentiate ”10” structures from ”lo”,”la” and ”le” structures, often present in IROnOffword patches.

7 CONCLUSION

In this paper, we proposed a method forword/number structures spotting in handwritten doc-uments based on a fully-convolutional neural networktrained on artificial documents.

Since the targeted database of documents werenot annotated, we tackled this as a transfer learn-ing problem. We proposed to train the model onrandomly generated labeled images, built with num-ber/word patches of the IROnOff database and pagebackgrounds from the RECITAL database.

The performance of our model was assessed asa pixel-wise classification of the structures within aset of artificial documents. On artificial data, the sys-tem was able to correctly classify 96.9% of the pixels.This shows that our model performed robust pixel-wise classification on artificial data.

We also evaluated the model as a digit detectorby using the word/number patches of the RIMESdatabase. The model detected 80% of the patchescontaining digits. Thus, the learned model is showedto be transferable to real handwritten documents,without any preprocessing.

On unlabeled historical RECITAL documents, ourmodel detected most of the word/number structures.However, we showed that the model hardly distin-guish word from number structures in calligraphicstructures. Also, the model seems to miss few num-ber structures, mainly due to the confusion between”1” and ”l” structures.

Despite these limitations, considering the highvariability of the RECITAL documents, in terms ofshape, structure and handwriting style, we observed agood transferability of our model.

To improve the classification performance of ourmodel on unlabeled data, future work will focus onadding more variability of handwriting styles andstructures in the generation of artificial documents.Then, this classification map will be embedded in alarger document analysis system.

Source codes for the artificial document generatorand for the structure detection system are available athttps://github.com/GeoTrouvetout/CIRESFI.

REFERENCES

Augustin, E., Brodin, J.-m., Carre, M., Geoffrois, E.,Grosicki, E., and Preteux, F. (2006). RIMES evalu-ation campaign for handwritten mail processing. InProc. of the Workshop on Frontiers in HandwritingRecognition, number 1.

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu,R., Desjardins, G., Turian, J., Warde-Farley, D., andBengio, Y. (2010). Theano: A cpu and gpu math com-piler in python. In Proc. 9th Python in Science Conf,pages 1–7.

Butt, U. M., Ahmad, S., Shafait, F., Nansen, C., Mian, A. S.,and Malik, M. I. (2016). Automatic signature segmen-tation using hyper-spectral imaging. In Frontiers inHandwriting Recognition (ICFHR), 2016 15th Inter-national Conference on, pages 19–24. IEEE.

Cethefi, T. (2016). Project anr-14-ce31-0017 ”contrainteet integration : pour une reevaluation des spectaclesforains et italiens sous l’ancien regime”.

Delalandre, M., Valveny, E., Pridmore, T., and Karatzas,D. (2010). Generation of synthetic documents for per-formance evaluation of symbol recognition & spottingsystems. International journal on document analysisand recognition, 13(3):187–207.

Dieleman, S., Schluter, J., Raffel, C., Olson, E., Sønderby,S. K., Nouri, D., Maturana, D., Thoma, M., Bat-tenberg, E., Kelly, J., Fauw, J. D., Heilman, M.,de Almeida, D. M., McFee, B., Weideman, H.,Takacs, G., de Rivaz, P., Crall, J., Sanders, G., Ra-sul, K., Liu, C., French, G., and Degrave, J. (2015).Lasagne: First release.

Dumoulin, V. and Visin, F. (2016). A guide to convo-lution arithmetic for deep learning. arXiv preprintarXiv:1603.07285.

(a) (b) (c)

(d) (e) (f)Figure 5: Examples of the structures spotting performed by our model on real handwritten documents. (a), (b) and (c)unlabeled handwritten mails of the RIMES database superposed with their corresponding classification maps. (d), (e) and (f)historical handwritten documents of the RECITAL database superposed with their corresponding classification maps. (blue),(green) and (red) pixels correspond to classes background, number and word, respectively.

Gorodkin, J. (2004). Comparing two k-category assign-ments by a k-category correlation coefficient. Com-putational biology and chemistry, 28(5):367–374.

Grosicki, E. and El-Abed, H. (2011). ICDAR 2011: Frenchhandwriting recognition competition. In Proc. of IC-DAR, pages 1459–1463.

Kieu, V. C., Journet, N., Visani, M., Mullot, R., andDomenger, J.-P. (2013). Semi-synthetic Docu-ment Image Generation Using Texture Mapping onScanned 3D Document Shapes. In The Twelfth In-ternational Conference on Document Analysis andRecognition, United States.

Kingma, D. and Ba, J. (2014). Adam: A methodfor stochastic optimization. arXiv preprintarXiv:1412.6980.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-agenet classification with deep convolutional neuralnetworks. In Advances in neural information process-ing systems, pages 1097–1105.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-ing. Nature, 521(7553):436–444.

Luca, E. D. (2011). Le Repertoire de la Comedie-Italienne(1716-1762).

Matthews, B. W. (1975). Comparison of the predicted andobserved secondary structure of t4 phage lysozyme.Biochimica et Biophysica Acta (BBA)-Protein Struc-ture, 405(2):442–451.

Moysset, B., Louradour, J., Kermorvant, C., and Wolf, C.(2016). Learning text-line localization with sharedand local regression neural networks. In Frontiers inHandwriting Recognition (ICFHR), 2016 15th Inter-national Conference on, pages 1–6. IEEE.

Nair, V. and Hinton, G. E. (2010). Rectified linear unitsimprove restricted boltzmann machines. In Proceed-ings of the 27th international conference on machinelearning (ICML-10), pages 807–814.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learn-ing. IEEE Transactions on knowledge and data engi-neering, 22(10):1345–1359.

Schmidhuber, J. (2015). Deep learning in neural networks:An overview. Neural networks, 61:85–117.

Viard-Gaudin, C., Lallican, P. M., Knerr, S., and Binter,P. (1999). The ireste on/off (ironoff) dual handwrit-ing database. In Document Analysis and Recognition,1999. ICDAR’99. Proceedings of the Fifth Interna-tional Conference on, pages 455–458. IEEE.