Document Binarization with Automatic Parameter Tuning

Document Binarization with Automatic Parameter Tuning

Nicholas R. Howe

Received: date / Accepted: date

Abstract

Document analysis systems often begin with bina-rization as a first processing stage. Although numer-ous techniques for binarization have been proposed,the results produced can vary in quality and oftenprove sensitive to the settings of one or more controlparameters. This paper examines a promising ap-proach to binarization based upon simple principles,and shows that its success depends most significantlyupon the values of two key parameters. It further de-scribes an automatic technique for setting these pa-rameters in a manner that tunes them to the individ-ual image, yielding a final binarization algorithm thatcan cut total error by one-third with respect to thebaseline version. The results of this method advancethe state of the art on recent benchmarks.

1 Introduction

Physical documents in the real world exhibit a mul-titude of colors and shadings, and the appearanceof even a single document may vary widely depend-ing upon factors such as lighting, viewing angle, etc.Digital reproductions of physical documents usuallymake use of a 24-bit color representation, or perhaps8-bit grayscale. Such formats omit some of the infor-mation present in the original, but retain more thanenough for most applications. Certain document pro-cessing schemes go much further: they retain onlya single bit per document pixel. Binary documentrepresentations are very useful despite their low fi-delity because the information lost in binarizationis more or less unrelated to the symbolic content ofthe document. Most documents are produced using

monochromatic ink on paper, and their meaning isembodied solely in the distribution of the ink, a pat-tern that the binary document explicitly represents.

Of course, inferring the correct binarization of adocument from its color or grayscale representationcan prove challenging. Physical degradation of thedocument, adverse lighting or imaging conditions,and limitations on resolution can all conspire to ob-scure the original pattern.1 To meet this challenge,researchers have proposed many algorithms to bina-rize document images. Indeed, intense activity inthis field has made the Document Image Binariza-tion Contest (DIBCO) series one of the most popularcompetitions within the document analysis researchcommunity, attracting seventeen entries in its mostrecent iteration [19]. Nevertheless, the results fromthese contests reveal that there remains room for im-provement in the quality of automatic binarization.

This paper makes two related contributions to thestudy of automatic document binarization. First, itdescribes a method that can achieve excellent resultson a wide variety of challenging document images.Second, as a means to these results it investigates amethod to automatically determine the best parame-ter values suited to each particular image. Automaticparameter estimation has received little research at-tention to date but offers two advantages when donewell: it makes the method easier to employ since the

1Some ambiguity remains over how to define the groundtruth binarization. For instance, should an unintended ink spillmade by a scribe be included? For the record, stipulate thatthe ground truth should identify the set of all pixels at least50% covered with ink during the production of the document.In practice, the ground truth used on most benchmark datasetsprobably does not correspond perfectly with this ideal, butrepresents the imperfect judgment of a human expert [4].

1

user doesn’t need to worry about finding the bestparameter settings, and it also improves the averageresult because each image can use its own ideal valueinstead of a single compromise setting that works ad-equately but suboptimally for all images.

1.1 Prior Work

Before considering new approaches, it is worth sur-veying the history and development of binarizationalgorithms. Because most documents are printed indark ink on light paper, many classic methods rely onthresholding the image intensity. Under ideal circum-stances, as noted by Otsu [17] a simple global thresh-old suffices to distinguish markings from the back-ground. In more difficult cases the overall intensitymay vary over the document, so that a single globalthreshold may simultaneously prove too high in someareas and too low in others. Thus both Niblack [16]and Sauvola and Pietikainen [21] have proposed lo-cally adaptive techniques that adjust the thresholdused according to the mean and variance in somelocal region around each pixel. Although generallymore accurate than Otsu for document images withvarying intensities, locally adaptive thresholding cansometimes fail catastrophically in areas of low vari-ance, motivating hierarchical approaches such as theDecompose algorithm [7]. Adaptive thresholding alsopresents the user with additional parameters to set,relating to the size of the local region and to the rel-ative importance of the local mean vs. its deviationas the local threshold is determined.

The winners of recent DIBCO events go beyondlocally adaptive thresholding. They achieve their re-sults partially by using thresholds, but most also per-form modeling of the ink and background classes toenable more accurate individual pixel classification.Lu et al. [14] model the document background viaan iterative polynomial smoothing, and then chooselocal thresholds based on detected text stroke edges.This method won DIBCO 2009, and a revised ver-sion tied as a winner of H-DIBCO 2010 (handwrittendocuments only). The second winner in 2010 was amethod described by Bar-Yosef et al. [3] that growsforeground regions iteratively based upon local mod-eling of the foreground and background within a 7×7

pixel window. Lelore and Bouchara [13] won DIBCO2011 with a technique that first uses coarse threshold-ing to partition pixels into three groups: ink, back-ground, and unknown. Models describe the ink andbackground clusters, and guide decisions on the un-known pixels. The method also employs a higherresolution grid generated from the original image vialinear interpolation.

Several other ideas proposed in recent years pro-vide motivation for the base algorithm employed inthis paper. A number of projects have explored theuse of Markov random fields (MRF) for binarization[12, 18, 15, 11]. Inspired by the human visual sys-tem, some work has employed center-surround filtersto identify areas that are darker than their local envi-ronment, and thus likely to be ink [23]. These play arole similar to that of the Laplacian filter used herein.Finally, various approaches have used edge detectionto improve binarization results [10, 20]. Althoughthese ideas have been explored separately, the spe-cific combination used in this work appears particu-larly effective.

Automatic estimation of appropriate parametersettings for document binarization has received lim-ited attention to date. Gatos et. al. describe aparameter-free method that relies on detailed mod-eling of the document background [9]. Dawoud de-scribes a method based upon cross section sequencethat combines results at multiple threshold levels intoa single binarization [8]. Badekas and Papamarkos[2] describe a method inspired by work on edge de-tection. Their technique first produces binarizationsover a range of parameter settings, and estimates aground truth via chi-square analysis on the cumula-tive votes of the multiple binarizations at each pixel.It then iteratively narrows the parameter range un-til unable to do so further, giving the final estimatedbest setting. Although the method described hereinalso begins by computing binarizations over a rangeof parameter values, it differs by not attempting toestimate a ground truth, and requires no further it-eration to choose the ideal parameter value.

2

2 Approach

The base approach to binarization used in this pa-per was introduced recently [11] and rests on threemutually supporting strategies. First, it defines thetarget binarization as a labeling on pixels that mini-mizes a global energy function inspired by a Markovrandom field model. Second, in formulating the data-fidelity term of this energy it relies on the Laplacianof the image intensity to distinguish ink from back-ground. This grants a crucial invariance to differencesin contrast and overall intensity. Third, it incorpo-rates edge discontinuities into the smoothness term ofthe global energy function, biasing ink boundaries toalign with edges and allowing a stronger smoothnessincentive over the rest of the image. The paragraphsbelow explain each of these points in greater detail.

The global energy function operates on binariza-tions B, which label each pixel indexed by (i, j) aseither ink or background, Bij ∈ {0, 1}. The energytakes on a typical additive form, with terms rep-resenting the fidelity of a particular labeling whencompared to the intensity data, and other terms rep-resenting the smoothness or regularity of the solu-tion. In particular, the energy includes a cost L0

ij

or L1ij capturing how well the label Bij chosen for

each pixel matches its appearance, and irregularitycosts Chij and Cvij for each pixel whose label differsrespectively from those of its horizontal or verticalneighbor.

EI(B) =

m∑i=0

n∑j=0

[L0ij(1−Bij) + L1

ijBij]

+

m−1∑i=0

n∑j=0

Chij(Bij 6= Bi+1,j)

+

m∑i=0

n−1∑j=0

Cvij(Bij 6= Bi,j+1) (1)

Assume that the Boolean expressions above evalu-ate to either 0 or 1 in the usual manner accordingto their truth value. With this energy the optimalbinarization will tend to conform to the intensitycontours while smoothing over small irregularities re-sulting from noise sources. The degree of smoothing

relative to data fidelity will depend on the relativemagnitudes of Lbij to Chij and Cvij .

The label costs L0ij and L1

ij should be invariant tothe local image illumination, and thus are taken fromthe Laplacian of the image intensity:

L0ij = ∇2Iij (2)

L1ij = −∇2Iij (3)

Intuitively this will tend to separate ink from back-ground because the Laplacian measures the diver-gence of the gradient. It will thus be positive at in-tensity valleys (ink) and negative at intensity peaksor plateaus (background). The data terms of the en-ergy function become a summation of the label-signedLaplacian over all pixels in the image. For a particu-lar component of ink or background, Green’s theoremtells us that the summed Laplacian over all its pix-els is mathematically equivalent to the gradient fluxacross its boundary. In other words, the energy con-tribution of each component is determined solely bywhat happens at its boundary. This makes intuitivesense but can cause trouble for components that in-tersect the image edges: such regions can occasionallybe mislabeled because their entire natural boundaryis not visible, and thus their true energy contributioncan only be estimated. In practice, this causes trou-ble occasionally for noisy background areas that arelargely isolated from the rest of the document back-ground by ink markings. Several potential solutionsexist. For example, one could simply fix L1

ij to alarge negative value for all pixels (i, j) on the imageborder, under the assumption that all ink is framedby background regions. Rather than committing tosuch a strong assumption, this work adopts a moreconservative strategy, looking for bright outlier pix-els and applying a fixed constant L1

ij to them. Thisstill ensures that large background regions will receivethe proper label, while not preventing identificationof ink pixels on the image boundary. To be precise,modify L1

ij for any pixels more than two standarddeviations σrij brighter than the local mean µrij , ascomputed over nearby pixels weighted by a Gaussianof radius r. This may be viewed as an extremely con-servative application of locally adaptive thresholding,where only pixels most certain to be background are

3

labeled as such. In the equation below, φ will takeon a large negative value.

L1ij =

{−∇2Iij Iij ≤ µrij + 2σrij

φ Iij > µrij + 2σrij(4)

The neighbor mismatch penalties Chij and Cvij offerthe opportunity to employ the third strategy men-tioned above, incorporating the Canny edge map.(Recall that Canny [6] first smooths the image with aGaussian filter of small radius σE, then finds edges atlocal directed maxima of the image gradient, choosingto retain only those edges selected using a hysteresisprocedure with two thresholds thi and tlo.) The algo-rithm sets the mismatch penalties to a uniform valuec everywhere except between pixels where a Cannyedge has identified a likely discontinuity. To be morespecific, Canny identifies pixels as edges, while Equa-tion 1 requires locating discontinuities in the connec-tions between pairs of pixels. To address this discrep-ancy, the formulation below zeros out the disconti-nuity penalty between Canny edge pixels and theirbrighter neighbors, effectively choosing to include theCanny edge pixels within the inked area. The oppo-site choice would also be self-consistent, but wouldinhibit detection of single pixel width strokes.

Chij =

0 if Eij ∧ (Iij < Ii+1,j)0 if Ei+1,j ∧ (Iij ≥ Ii+1,j)c otherwise

(5)

Cvij =

0 Eij ∧ Iij < Ii,j+1

0 Ei,j+1 ∧ Iij ≥ Ii,j+1

c otherwise(6)

The choice of constant discontinuity penalties ev-erywhere except at edges deserves a note. Onemight imagine using a penalty that varies continu-ously according to the similarity in intensity betweenthe neighbors. Empirically this approach seems lesssuccessful, perhaps because it actually gives littleguidance about the best precise location of the ink-background transition: the intensity differences tendto be fairly large everywhere within a few pixels ofthe actual boundary, and thus it becomes too easy tochoose the wrong location.

2.1 Parameters

The algorithm described above includes six free pa-rameters, of which only two strongly influence the bi-narization outcome and require varying settings fordifferent images. The four less important parame-ters can be set to a constant with negligible conse-quences for all images tested, either because they donot strongly influence the result or because the bestvalue appears not to fluctuate much for different im-ages. Thus finding automatic values for the two im-portant parameters suffices to create a method thatcan be applied as a “black box” with little or no pa-rameter testing required.

Two of the less important parameters appear inEquation 4: r and φ. Of these, r must be largeenough to encompass at least a few background pix-els, and thus should be set to some value larger thanthe expected stroke width. φ can be any sufficientlynegative value. No attempt is made to optimize theseparameters, and the experiments all use r = 20 andφ = −500, for images with grayscale intensity inthe range from 0 to 255. The remaining less impor-tant parameters are two of the three that control theCanny edge detection algorithm, namely the lower ofthe two edge detection thresholds tlo and the radiusof smoothing applied prior to edge detection σE. Badvalues for these parameters can certainly hurt bina-rization quality, but fortunately the experiments willshow that there exists a uniform setting that workson all test images with near-optimal results.

The two important parameters that remain inter-act with each other to determine the final binariza-tion result. Most crucial is c, whose magnitude de-termines the relative balance between data fidelityand regularization. The high Canny edge detectionthreshold thi matters most when c is large becausethe detection or nondetection of an edge can deter-mine whether an entire ink component appears ordisappears in the minimal energy solution. The highCanny threshold plays a particularly important rolein images that exhibit ink bleeding through from theopposite side of the paper: since the bled ink tends tohave weaker edges, fortuitous thresholding can elim-inate most of the false ink components.

Early experiments with this binarization scheme,

4

and indeed for most approaches to binarization, haverelied on choosing parameter values that work rea-sonably well across a wide range of images. Thiscan be achieved by searching for the best values on atraining set of images similar to the ones that will beused for testing, and yields good results [11]. How-ever, even globally optimal parameter values embodysome compromise on individual images, and tuningthe settings to the image can offer significant gains.The next section explains how to do this.

2.2 Automatic Parameter Tuning

To understand how the crucial c parameter can be setautomatically, it is worthwhile to examine the behav-ior of the base binarization algorithm under a rangeof different settings. Figure 1 shows the binarizationresults on a small image patch for a logarithmicallyincreasing set of values. When c is very low, the bi-narization looks like a simple sign operator on theLaplacian, with many small noisy components. Asc increases the noise components become more con-solidated and many disappear. For a range of mid-dle values of c, the binarization stays fairly stable,with only small changes as c increases. At the high-est values, the result becomes unstable again as largeink components disappear, and occasionally join orhave voids filled in. As c goes to infinity, the bina-rization will tend toward the majority label in eachedge-isolated component.

The region of stable results at middle levels of cis significant, and turns out to show up consistentlyon ink-and-paper documents with reasons subject toexplanation. To visualize the phenomenon, define thenormalized binarization instability ξν(c) in terms of∆(Bc, Bνc) the fraction of pixels that change labelsbetween two values of c differing by a factor ν.

∆(B,B′) =

m∑i=0

n∑j=0

(Bij 6= B′ij) (7)

ξν(c) =∆(Bc, Bνc)

mn (ln(νc)− ln(c))(8)

In the equation above, Bcij is the binarization label atpixel (i, j) using mismatch penalty c, and the Boolean

operator 6= converts to 0 or 1 in the customary fash-ion. As Figure 2 shows, the general shape of theinstability curve is an intrinsic property of the im-age that does not depend on the exact value of νchosen; thus in most cases ξ can drop its subscript.Histograms provide a useful analogy: although thedetails vary slightly, a histogram made using a suffi-cient number of samples from the same distributionalways takes on the general appearance of that dis-tribution for any reasonable choice of bin size. In-deed, the instability plots may be seen as histogramsof pixel label transition events in respect to c, withlogarithmically increasing bin sizes.

Figure 3 shows as solid lines the instability plots forseveral document images. Although the stable zonediffers somewhat in appearance and location in eachcase, its nearly universal presence is striking. Yet inhindsight, the observation of such a stable zone atmiddling c values should not offer such a surprise.The twin peaks on either side arise from understand-able mechanisms that apply for all document images,and thus the absence of a stable zone for any particu-lar image would arise from exceptional circumstances.

The c parameter controls the incentive for neigh-boring patches to share the same label despite a datafidelity term indicating otherwise. The left peak inthe instability curve appears at low levels of c, as itbecomes large enough to overcome small noise fluctu-ations and label most large homogeneous foregroundand background regions correctly. On the other hand,the right peak appears when c becomes large enoughto force a uniform label even across regions of differ-ent underlying ground truth. Unless the noise am-plitude exceeds the contrast between foreground andbackground, a zone of stability will normally separatethe two. Furthermore, this stability zone coincideswith binarizations that adhere to the underlying im-age structure while ignoring noise. Figure 3 showsas dotted lines the binarization error (based on theF-measure, as defined in Equation 14 below), whichtends to reach a minimum at or near the lowest pointin the instability curve. Computing the instabilitycurve does not require access to the ground truth, yetobserving the low point of the stable region revealsa value of c that typically achieves low ground-truthbinarization error.

5

c = 5.00 c = 7.33 c = 10.7 c = 15.7 c = 23.1 c = 33.8

c = 49.6 c = 72.7 c = 106 c = 156 c = 229 c = 336

Figure 1: Binarization results for different values of c on one portion of a document image. A range ofstability for intermediate c values corresponds to the most accurate result.

Figure 2: Binarization instability measured withvarying step factor in c and normalized by the logof the step factor. The shape (and thus the loca-tion of the minimum) depends very little on the stepfactor chosen.

The link between the instability minimum and thebinarization error minimum is at heart merely an em-pirical observation, but the philosophical consider-ations just given provide justification for the beliefthat it will apply to most document images. The al-gorithms developed below rest upon two related hy-potheses, which the experiments in Section 3 largelybear out: First, the two instability peaks describedabove will reliably appear and can be consistentlyidentified. Second, the point of maximum stabilitybetween the identified peaks will yield near-minimalbinarization error. The truth of these hypotheses willdepend upon the circumstances which produced agiven document image. For example some documentsinclude multiple types of markings at different con-trast levels, such as those where ink bleeds throughfrom the reverse side. These may muddy the situa-tion by generating additional instability peaks, butin most cases the extra peaks don’t alter the resultgreatly because they tend to be more diffuse and thusless prominent. Figure 5 below describes a few moredifficult cases, including unusual images where theoptimal c varies widely across the page. Neverthelessthe experiments indicate that such problems remainmostly limited to unusual situations.

The parameter-setting technique for c summarizedas Algorithm 1 computes curves like those in Figure 3and chooses c at the minimum point between the two

6

Figure 3: Binarization instability and error vs. c for the DIBCO 2009 and H-DIBCO 2010 images. The solidcurve shows the fraction of the image area whose label changes between successive values of c. A vertical lineidentifies the value of c chosen by Algorithm 1 using this curve. The broken curve shows the error (differencefrom 1.0) of the the F-measure of the binarization. The horizontal axis shows c on a log scale. Images aregiven left to right in the order of Table 1.

7

peaks on either side. Minor parameter choices intro-duced here are the logarithmic separation between cvalues, the range of values to scan, and the amount ofsmoothing applied to the instability curve. Choosingthe logarithmic separation is somewhat equivalent tochoosing a histogram bin size; as Figure 2 demon-strates, the exact step factor chosen does not matterso long as it is not so large as to obscure the im-portant minima and maxima, nor so small that therewill often be no change observed from one value tothe next. The experiments herein use 4 points peroctave, or approximately ν = 1.19. The other twochoices may also be made in ways that affect theoutcome minimally if at all. The range of c valuesscanned should be large enough to include any likelyoptimum, and the smoothing should even out smallfluctuations in the curve without altering the over-all shape significantly. The experiments in Section 3use a unit Gaussian for smoothing, and values of cranging from 20 to 5120.

Algorithm 1 Tune c to image I given thi, tlo, andσE

for i = 0 to 28 doci ← 40 · 2i/4Bi ← Binarize(I, ci, thi, tlo, σE)

end forfor i = 0 to 27 doDi ← ∆(Bi, Bi+1)

end forG0, G1, ..., G5 ← 0.02, 0.13, 0.35, 0.35, 0.13, 0.02{Gaussian of radius 1}D−3, D−2, D−1 ← 0, 0, 0D28, D29, D30 ← 0, 0, 0for i = 0 to 28 doD′i ←

∑5j=0Di+j−3Gj

end for{D′ is now a smoothed version of D}q, r, s← arg minq,r,sD

′q − 2D′r +D′s : q < r < s

c← cr

Picking c only partially solves the parameter-setting problem, since the high Canny threshold re-mains to be chosen. This upper threshold on edgedetection rewards careful tuning, because properlychosen it can weed out the more diffuse edges of ink

bleeding through from the back side of the document,while retaining the sharper edges of ink shapes onthe front side. Fortunately, the same stability crite-rion that guided the choice for c also serves to picka good value within the likely range for thi. Algo-rithm 2 details the exact technique used in the ex-periments. Note that c and thi are set sequentially,rather than at the same time. The algorithm com-putes the best value of c for each possible thi, andthen chooses the (c, thi) pair with greatest stabilitywith respect to changes in thi. This simplifies eachstep in the process and makes it easier to find a near-optimum solution.

Algorithm 2 Tune thi to image I given tlo, and σE

for i = 0 to 10 doτi ← 0.15 + 0.05ici ← TuneC(I, τi, tlo, σE) {Algorithm 1}Bi ← Binarize(I, ci, thi, tlo, σE)

end forfor i = 0 to 9 doDi ← ∆(Bi, Bi+1)

end forG0, G1, ..., G5 ← 0.02, 0.13, 0.35, 0.35, 0.13, 0.02{Gaussian of radius 1}D−3, D−2, D−1 ← D2, D1, D0

D10, D11, D12 ← D9, D8, D7

for i = 0 to 10 doD′i ←

∑5j=0Di+j−3Gj

end for{D′ is now a smoothed version of D}k ← arg min10

j=0D′j

thi ← τkc← ck

Arguments given previously in this section sug-gested reasons why a stability criterion might workwell for choosing c. It is less clear why the same ap-proach works for choosing an edge intensity thresh-old. Nevertheless a zone of stability with respect toboth c and thi does seem to distinguish the best bi-narizations, as illustrated in Figure 4 for several rep-resentative images. Although the location of the bestparameter setting varies widely between the images,each case displays a striking qualitative correlationbetween the binarization error and the two-way sta-

8

H2009-T1 H2009-T2 P2009-T1 P2009-T2

Figure 4: Binarization instability (left in pair) anderror (right in pair) visualized with respect to vary-ing values of c (vertical axis; logarithmic range from20 to 5120) and thi (horizontal axis; linear range from0.15 to 0.65) for four representative images. Despitethe very different patterns on display, the parameterneighborhoods with lower error coincide in each in-stance with higher stability (both indicated by lightershades). The converse does not always hold: the sec-ond image shows two areas of stability, with only themore central one corresponding to low error. Thesecond, more peripheral stability zone is rejected byAlgorithm 1.

bility, measured using the simple unnormalized for-mula below.

ζν,τ (c, thi) =1

4[∆(Bc,thi , Bνc,thi)

+∆(Bc,thi , Bc/ν,thi)

+∆(Bc,thi , Bc,thi−τ )

+∆(Bc,thi , Bc,thi+τ )] (9)

Other DIBCO images not shown behave similarly.Full explanation of this phenomenon lies beyond thescope of this paper, which merely observes and seeksto exploit the pattern.

2.3 Computational Efficiency

Completing Algorithm 2 requires 363 trial binariza-tion computations (33 candidates for c times 11 can-didates for thi) and thus runs much more slowly thana single binarization with static parameter settings.However, two modifications to the algorithm can sub-stantially mitigate the speed difference.

The first takes advantage of the fact that the bina-rization result does not usually change substantiallybetween successive values of c. Data structures builtto minimize Equation 1 for one value of c can bemodified and reused with a similar c value, achievingnoticeable economies. This strategy, introduced byBoykov and Kolmogorov [5], minimizes Equation 1by finding the minimum cut on a graph derived fromthe image. The method grows search trees from boththe source and the sink of the image graph to helpfind augmenting paths, and it turns out that the treesgrown for one value of c can be largely reused atnearby values to realize significant time savings. Em-pirical tests of this implementation show that it canspeed computation by a factor of fifteen or more. Inother words, it computes all 33 trial binarizations re-quired for one iteration of Algorithm 1 in little morethan the time that it would normally take to computejust two or three.

The second modification relies on a heuristic trick,and stems from the observation that the presenceor absence of ink bleed-through significantly influ-ences the optimal value of thi. Images showing bleed-through typically require a larger thi to avoid identi-fying the spurious bled ink components as real. Bycontrast, in images without bleed-through the algo-rithm achieves its best results with a lower thi, be-cause this allows it to detect fainter edges. Thus itappears that an explicitly bimodal algorithm allowingonly two possible values {τ1, τ2} for thi might stillperform well as compared with Algorithm 2, whichtests eleven. The experimental results below confirmthis hypothesis.

Detecting the presence or absence of bleed-throughin a document is a nontrivial task in the general case,so it might seem impossible to determine which ofeven two candidate thi values to use. Fortunately, anextension of the stability criterion provides a simpleheuristic that does fairly well empirically. In additionto the two candidate values, Algorithm 3 computesthe binarization using their mean value τ0 = (τ1 +τ2)/2. This midpoint binarization can be comparedto the other two, and the closer is judged the morestable and thus selected for the final result.

Algorithm 3 computes results for only three thi val-ues and thus runs more than three times as fast as

9

Algorithm 2. The metaparameters τ1 and τ2 are cho-sen for best performance on a training set; analysisof the DIBCO 2009 and H-DIBCO 2010 image setsindicates that values of 0.25 and 0.50 respectivelyachieve the best results across the full set. (Usingonly 23 of the 24 images in cross-fold training occa-sionally yields other values.) Employing a trainingset to choose τ1 and τ2 diminishes the automatic na-ture of the full algorithm, but represents a tradeoffmade in this particular algorithm for the sake of com-putation speed.

Algorithm 3 Pick thi from {τ1, τ2} for image I givenτ1, τ2, tlo, and σE

τ0 ← (τ1 + τ2)/2c0 ← TuneC(I, τ0, tlo, σE)c1 ← TuneC(I, τ1, tlo, σE)c2 ← TuneC(I, τ2, tlo, σE)B0 ← Binarize(I, c0, τ0, tlo, σE)B1 ← Binarize(I, c1, τ1, tlo, σE)B2 ← Binarize(I, c2, τ2, tlo, σE)D1 ← ∆(B0, B1)D2 ← ∆(B0, B2)if D1 < D2 thenthi ← τ1

elsethi ← τ2

end if

These two innovations in combination mean thata result originally requiring 363 trial binarizations tocompute may be reached in the time required for justeight or nine. This is still slower than using staticparameter values, but investing the extra time maybe worthwhile for more accurate results. Alternately,in situations where a number of similar documentsmust be binarized, the parameter tuning may be runfor a few trial pages to find appropriate values, whichare then set statically for the remainder of the set.

All the algorithms display more or less linear de-pendance of computation time on the number of pix-els in the image. Executing Algorithm 2 takes an av-erage of 892 seconds per megapixel on a 2.4 GHz Xeonprocessor running as a single thread (without paral-lelism). By contrast, Algorithm 3 runs in 18.1 sec-

onds per megapixel under the same conditions, whilea single execution of the base algorithm under staticparameters takes 2.12 seconds per megapixel.

2.4 Further Algorithmic Variants

For best results, many binarization methods computean initial labeling using some base technique and thenapply one or more postprocessing algorithms to im-prove it. For example, Su et al [22] remove com-ponents of three pixels or less. More complicatedmodeling and classification algorithms can identifyand remove noise components while retaining real ink[1]. The unknown pixel classification of Lelore andBouchara [13] may also be viewed as post-processingof a sort.

Similar techniques can be applied to the resultsfrom the method described herein. Indeed, boththe three-pixel component filter and a more complexclassification-based approach reminiscent of Agrawal& Doermann have been tested informally and foundto reduce binarization error for the DIBCO test im-ages, relative to ground truth. Unfortunately gainsin training can come at the potential cost of a lossof generality, in practice. This suspicion is borne outby the DIBCO 2011 contest results. The two algo-rithms that ranked in first and second place accordingto the contest scoring methodology perform very wellon most of the test images, but fail severely on one ortwo examples (PR6 and PR7). One might view thisas a form of overfitting the problem: these algorithmsare specialized to do very well on images that matchthe expectations of the designers, but cannot handlethe full range of images that might be encountered “inthe wild”. This paper aims to develop binarizationtechniques that work automatically and reliably on aswide a range of images as possible. Thus the exper-iments eschew post-processing heuristics of the typedescribed above because they might decrease error ona given test set at the cost of generality. Such tricksare nevertheless worth mentioning because they mayprove useful in whenever the images to be binarizedare amenable.

10

3 Experimental Results

The DIBCO events have provided a platform for side-by-side comparison of different binarization results toa ground truth created by a team of human referees.Three contests have been held to date: two includedboth handwritten and printed document images, andone consisted of just handwritten documents. Imagesfrom the first two contests were used to develop thealgorithms described in this paper, with the third setheld out for post-hoc testing.

A comprehensive survey of the parameter space hasbeen prepared for this paper, providing the contextnecessary to evaluate the effectiveness of the auto-matic parameter tuning algorithms. It samples the4-dimensional parameter space on a regular grid, sur-rounding and including the region found to producethe best binarizations in prior work [11]. The specificvalues tested for each parameter are:

thi ∈ Thi ≡ {0.15, 0.20, 0.25, 0.30, 0.35, 0.40,

0.45, 0.50, 0.55, 0.60, 0.65} (10)

tlo ∈ Tlo ≡ {0, 0.05, 0.10, 0.15, 0.20} (11)

σE ∈ SE ≡ {0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} (12)

c ∈ C ≡ {40 · 2i/4‖i = 0..28} (13)

Results from the parameter survey can answer sev-eral interesting questions. They reveal both the bestpossible performance with a fixed set of parametersover all images (static parameters), and the best pos-sible performance in hindsight using the optimal pa-rameter settings for each individual image (ideal pa-rameters). The difference between these two valuesis the space where parameter-tuning algorithms op-erate: they cannot aspire to do better than the ideal,although they may certainly do worse than the beststatic setting. If the difference between the static andideal numbers is small, then parameter tuning maynot be worthwhile because any possible gains are lim-ited.

A number of different statistics can measure andcompare the effectiveness of binarization algorithms.The DIBCO events compute some half-dozen num-bers, with four used to determine the winners. Theinitial experiments herein focus on the F-measure,

defined in Equation 14, because its interpretation issimple and it serves as a proxy for the other measuresof binarization quality. An F-measure of 100% repre-sents perfection. In this paper the term error refersto the difference between the observed F-measure anda perfect 100% score.

F =2 ·R · PR+ P

(14)

where the recall R and precision P are defined interms of the number of true positive pixels NTP , thenumber of false positive pixels NFP , and the numberof false negative pixels FFN in the binarization B ascompared to ground truth G.

R =NTP

NTP +NFN(15)

P =NTP

NTP +NFP(16)

Table 1 summarizes binarization results on theDIBCO 2009 and H-DIBCO 2010 images for variousparameter-tuning algorithms. The static parametersfor the experiments represented in this table are de-termined in leave-one-out style. In other words, forthe ith image, the computation uses whatever staticparameter setting would have given the best mean re-sult on the other 23 images. Table 1 shows the rangeof parameter values chosen via this technique for thevarious algorithms. (Note that even algorithms withstatic parameter settings may show a range in Ta-ble 1 if the static values computed differ for variousleave-one-out training folds.)

The second column of the table gives the F-measure when all parameters are static; this is thebaseline algorithm [11]. The third column gives theresult achieved with an oracle that knows the idealvalues of c and thi for each image; this represents thebest possible outcome achievable by tuning the twoparameters, but is not realizable in practice withoutprior access to ground truth. The ideal parametersettings reduce the overall error by more than one-third compared to static parameters, showing thattuning methods have significant potential if they cancome close to this ideal. The table does not showresults from an oracle that optimizes all four param-eters. If available, such an oracle could achieve a

11

mean F-measure of 95.3% across all 24 images, hardlymore than the two-parameter oracle. This confirmsthe conjecture that tuning all four parameters offerslittle reward in exchange for the risk.

The remaining columns show results for the variousparameter-tuning strategies described in Section 2.Column four tunes c using Algorithm 1 and employsstatic settings for thi, tlo, and σE. It does somewhatbetter than the all-static settings, but still showsroom for improvement. Column five tunes both c andthi using Algorithm 2 and employs static settings fortlo, and σE. This method achieves 83% of the levelof improvement made by the oracle. The final col-umn show the results of the two-state tuning for thidescribed in Algorithm 3. This method does nearlyas well as the fully tuned thi, despite its much fastercomputation time.

Table 3 shows the results for the DIBCO 2011 im-ages, used as a hands-off test set. Unlike the previoustable, these are not trained leave-one-out style. Toensure an unbiased test, the algorithms were fixedbefore any results were examined, and static param-eter values are set using the first two sets of contestimages. The numbers here look more diverse becausesome of the new images are more extreme and there-fore more challenging than the training images. How-ever, the general picture remains the same: tuning cgives better results than static parameter settings,and tuning both c and thi improves the binarizationquality still further for most images. A few excep-tions lower the overall average, particularly H2011-6.Close inspection of this case reveals that the stabil-ity curve for thi on this image contains two basins ofstability, and the minimization criterion chooses thewrong one by a slight margin. The two-state tuningdoes not suffer from this problem, because it selectsthi from a more constrained set of choices. In hind-sight this may be an unforeseen practical advantageof Algorithm 3.

Despite the success of the tuning methods de-scribed, the best F-measure achieved by tuning stilllies farther from the oracular best for the DIBCO2011 images than in the experiments with earlier im-ages. This reflects the diversity of the test set. Someof the images represent adverse examples for the sta-bility criterion advocated in this paper; the two worst

Table 1: F-Measure for DIBCO 2009 and H-DIBCO2010 Images

Image Static Oracle Alg. 1 Alg. 2 Alg. 3

H2009-T1 96.4 96.6 96.6 96.1 96.6H2009-T2 93.3 93.4 92.5 93.1 93.1H2009-1 92.7 95.9 95.4 95.8 95.8H2009-2 90.1 96.4 88.8 96.2 95.7H2009-3 94.7 95.6 92.0 94.8 95.1H2009-4 92.1 94.7 93.9 94.6 94.5H2009-5 86.5 92.7 91.9 92.3 92.6H2010-1 93.7 96.2 95.0 95.4 96.0H2010-2 89.0 96.1 95.4 95.7 95.4H2010-3 94.4 94.7 93.3 94.7 94.7H2010-4 92.9 94.4 93.5 93.8 93.9H2010-5 92.9 96.5 96.1 96.5 96.4H2010-6 90.9 91.2 90.4 90.9 90.9H2010-7 95.0 95.2 94.8 95.0 95.1H2010-8 92.6 93.7 92.2 93.5 93.4H2010-9 92.6 93.8 92.9 92.1 93.5H2010-10 88.8 92.7 90.8 92.5 87.2P2009-T1 88.5 97.4 96.7 97.3 97.2P2009-T2 98.0 98.5 98.1 98.4 98.5P2009-1 93.9 94.3 90.4 94.0 94.2P2009-2 96.8 96.9 96.8 96.8 96.9P2009-3 97.6 98.4 98.2 98.2 98.3P2009-4 93.5 93.8 92.8 92.9 92.8P2009-5 88.6 91.5 84.7 91.2 84.3

Hand 92.3 94.7 93.3 94.3 94.1Print 93.9 95.8 93.9 95.6 94.6All 92.7 95.0 93.5 94.7 94.3

Table 2: Parameter Ranges for Table 1

Static Oracle Tune c Tune thi Dual

c 160 47.6–1280 56.6–3044 80–1076 67.3–3044thi 0.4 0.15–0.65 0.2–0.3 0.15–0.6 0.25 or 0.5tlo 0.1 0–0.2 0.1–0.15 0.1 0.1–0.15σE 0.6 0.3–0.8 0.5–0.6 0.6 0.5–0.6

12

Table 3: F-Measure for DIBCO 2011 Images

Image Static Oracle Tune c Tune thi Dual

H2011-1 77.3 93.9 88.2 90.2 90.0H2011-2 97.4 97.5 96.6 97.3 97.4H2011-3 93.2 95.2 94.9 94.5 93.5H2011-4 92.5 92.7 92.5 92.0 92.0H2011-5 92.4 97.1 96.2 97.1 96.2H2011-6 88.2 94.4 91.1 60.6 92.1H2011-7 87.6 92.5 92.3 89.6 89.8H2011-8 95.3 95.7 94.9 95.3 95.4P2011-1 93.0 95.3 94.3 95.1 94.7P2011-2 77.6 90.8 73.5 71.5 72.3P2011-3 95.6 96.6 96.3 96.4 96.4P2011-4 95.0 95.2 94.5 95.2 95.1P2011-5 94.8 94.9 93.7 94.2 94.6P2011-6 66.7 93.1 89.9 86.0 86.3P2011-7 90.2 93.3 89.7 93.1 93.1P2011-8 90.3 92.2 88.5 91.3 91.5

Hand 90.5 94.9 93.3 89.6 93.3Print 87.9 93.9 90.0 90.3 90.5All 89.2 94.4 91.7 90.0 91.9

appear in Figure 5.

3.1 Comparison with Other Algo-rithms

Results available for the DIBCO 2011 image set makepossible a quantitative comparison between the thealgorithms developed herein and other recent work.The DIBCO 2011 contest used three additional pri-mary criteria in addition to the F-measure. Ofthese, the peak signal-to-noise ratio (PSNR) corre-lates fairly strongly with the F-measure. If B is thebinarization and G the ground truth binary image,

PSNR = −10 log (∆(B,G)) (17)

Misclassification penalty metric (MPM) penalizes er-rors according to their distance from the ink bound-ary. Define Dij as the distance of pixel (i, j) fromthis boundary in the ground truth.

MPM =

∑mi=0

∑nj=0Dij(Bij 6= Gij)

2∑mi=0

∑nj=0Dij(1−Gij)

(18)

H2011-1 P2011-2 (detail)

Figure 5: The DIBCO 2011 test images contain sev-eral challenging examples. Both these images haveunusual stability curves. At left, the high-amplitudenoise on the right side generates instability at valuesof c that are optimal for the rest of the page. At right,the stability criterion chooses a c value that preservesthe bleed-through ink.

13

Distance reciprocal distortion metric (DRD) at-tempts to account for human perceptual salience byweighting errors according to pixel values in a local5×5 neighborhood, using weight matrix W computedas the normalized inverse distance.

DRD =1

N8

m∑i=0

n∑j=0

Γij(Bij 6= Gij) (19)

Γij =

2∑h=−2

2∑k=−2

Whk(Bij 6= Gi+h,j+k) (20)

N8 divides G into 8 × 8 blocks and counts the num-ber that are non-uniform. Note that better bina-rizations will have higher F-measure and PSNR, butlower MPM and DRD. For further details on thesemetrics, please read the DIBCO 2011 contest report[19].

Algorithm 3 is similar to entry #11 that was ac-tually submitted to the DIBCO 2011 contest. Thatentry was based on a less comprehensive survey ofthe parameter space and thus differs in the settingof the static parameter values and in the range of cvalues tested. Entry #11 nevertheless had the high-est mean F-measure across all the images, the lowestmean peak signal-to-noise ratio, the best mean mis-classification penalty metric, and the fifth-best dis-tance reciprocal distortion metric. Strangely, despitethese successes, entry #11 ranked only third in the of-ficial DIBCO 2011 contest results, and the two otherentries that placed first and second had mean scoresover the full image set that were uniformly worseon all four measures. The scoring methodology ofthe contest chose the winner based upon summedranks over all images on various quality measures,rather than directly on the mean metric scores. Thetwo higher-rated methods ranked well on most im-ages, and while they conversely showed extremelypoor performance on a few of the printed documentsthis ultimately had limited effect on the final score[19]. With rank-based scoring the comparison of onemethod relative to another is affected by the per-formance (and presence or absence) of all the othermethods in the contest.

Table 4 compares the results of the algorithms de-veloped in this paper with other recent methods on

the DIBCO 2011 image set. The first section showsselected entries from the contest (those that beat En-try #11 on at least one measure), the second sectionshows other recent algorithms2, and the third sectionshows the algorithms described in this paper, trainedusing the DIBCO 2009 and H-DIBCO 2010 imagesfor a fair comparison. The table gives the mean re-sult for each of the four metrics averaged across allthe test images, and also the rank score total used todetermine the DIBCO 2011 winner. For algorithmsnot entered in the contest, the rank score is simulatedby treating the algorithm in question as an additionalentrant, indicated with an asterisk in the table. Ofcourse, adding a contestant changes the rank scoresof the actual entrants, making them worse wheneverthe new entry does better on an image. The modifiedscores of existing entries are not shown in the table,but the information is accounted for in the parenthet-ical overall rank shown in the rightmost column. Asthe table reveals, either of Algorithms 2 or 3 wouldhave won the contest on rank scoring, and also easilyplace first in mean score on each of the four qualitymeasures as compared with all previous algorithms.

4 Conclusion

Tuning parameters offers both risks and potential re-wards. Choosing the parameter values that performbest for a given image can lower binarization error bysubstantial amounts as compared with a static set-ting suitable for generic images. On the other hand,poorly chosen parameter values can sabotage the re-sult: the potential losses in binarization quality gen-erally dwarf the potential gains. Thus one must becareful to tune only where there is reasonable confi-dence of success.

This paper has introduced a stability heuristic cri-terion that helps to choose suitable parameter val-ues for individual images. The approach hypothe-sizes that good parameter values are marked by lowvariability in the binarization solution with respectto changes in the parameter values. This criterionleads to an algorithm that successfully picks good c

2The numbers for Lelore & Bouchara exclude image PR6because its results were not available.

14

Table 4: Algorithms compared by scores on theDIBCO 2011 image set.

MPM DRDMethod F PSNR ×10−2 ×10−3 Ranks

Entry 11 88.7 17.8 5.36 8.67 429 (3)Entry 10 80.9 16.1 104.48 64.42 309 (1)Entry 8 85.2 17.2 15.66 9.07 346 (2)Entry 3 85.1 16.4 5.88 8.09 649 (12)Entry 4 85.2 16.6 6.28 4.43 489 (5)Entry 6 83.6 16.7 8.08 4.57 470 (4)Entry 14 78.0 14.9 7.62 7.35 835 (17)Gatos [9] 84.3 16.3 6.39 6.08 663 (11)∗

Lelore [12] 56.2 12.5 2.97 13.30 811 (17)∗

Lu [14] 79.7 15.5 39.67 21.47 579 (8)∗

Su [22] 87.8 17.7 5.38 4.65 435 (3)∗

Static 89.2 18.2 9.05 5.76 438 (3)∗

Alg. 1 91.7 19.2 4.43 3.40 322 (2)∗

Alg. 2 90.0 19.1 3.89 3.78 323 (1)∗

Alg. 3 91.7 19.3 3.87 3.48 317 (1)∗

values on all the images tested. A similar heuristicapplied to the choice of thi also chooses good param-eter values in most cases, although serious failureswere observed with this technique for a handful ofcases. The most successful method tested uses thestability criterion to choose c, and selects thi from aconstrained set of two possible values. This approachdelivers a substantial fraction of the maximum gainpossible from tuning both parameters, and can becomputed at about one eighth the speed of a simplebinarization with static parameters. Reference codefor the technique will be available on the author’s website.

The parameter survey results make it clear thatfor many images the tuning algorithms given hereincome close to maximizing the potential of the basebinarization algorithm. Further improvements in re-sult quality will likely have to come through devel-opment of new base algorithms. Some of these newalgorithms may be application-specific, whereas thiswork has striven for broad applicability. Whetherthe stability criterion that has proven so useful here

will apply to other approaches remains to be seen,and is a topic for future work. In any case, tun-ing parameters with the algorithms described hereinsubstantially advances the current state of the art indocument binarization, as evidenced by comparativeresults on the DIBCO 2011 test images.

Acknowledgment

The author thanks those who kindly shared results orimplementation details of their algorithms for com-parison purposes, including Basilis Gatos, Su Bolan,and Frederic Bouchara.

References

[1] Agrawal, M., Doermann, D.: Stroke-like patternnoise removal in binary document images. In:International Conference on Document Analysisand Recognition, pp. 17–21 (2011)

[2] Badekas, E., Papamarkos, N.: Estimation ofproper parameter values for document binariza-tion. International Journal of Robotics and Au-tomation, Vol. 24, No. 1, 2009 24(1), 66–78(2009)

[3] Bar-Yosef, I., Beckman, I., Kedem, K., Din-stein, I.: Binarization, character extraction, andwriter identification of historical Hebrew callig-raphy documents. Int. J. Doc. Anal. Recognit.9(2), 89–99 (2007)

[4] Barney-Smith, E.: An analysis of binariza-tion ground truthing. In: Proceedings of the9th IAPR International Workshop on DocumentAnalysis Systems, pp. 27–33. Boston (2010)

[5] Boykov, Y., Kolmogorov, V.: An experimen-tal comparison of min-cut/max-flow algorithmsfor energy minimization in vision. IEEE Trans.on Pattern Analysis and Machine Intelligence26(9), 1124–1137 (2004)

[6] Canny, J.: A computational approach to edgedetection. IEEE Trans. on Pattern Analysis andMachine Intelligence 8(6), 679–714 (1986)

15

[7] Chen, Y., Leedham, D.: Decompose algorithmfor thresholding degraded historical documentimages. IEEE Proceedings on Vision, Image andSignal Processing 152(6), 702–714 (2005)

[8] Dawoud, A.: Iterative cross section sequencegraph for handwritten character segmentation.IEEE Transactions on Image Processing 16(8),2150–2154 (2007)

[9] Gatos, B., Pratikakis, I., Perantonis, S.J.: Adap-tive degraded document image binarization.Pattern Recognition 39(3), 317–327 (2006)

[10] Gatos, B., Pratikakis, I., Perantonis, S.J.: Im-proved document image binarization by using acombination of multiple binarization techniquesand adapted edge information. In: Interna-tional Conference on Pattern Recognition, pp.1–4 (2008)

[11] Howe, N.: A Laplacian energy for documentbinarization. In: International Conference onDocument Analysis and Recognition, pp. 6–10(2011)

[12] Lelore, T., Bouchara, F.: Document image bi-narization using Markov field model. In: Inter-national Conference on Document Analysis andRecognition, pp. 551–555 (2009)

[13] Lelore, T., Bouchara, F.: Super-resolved bina-rization of text based on FAIR algorithm. In:International Conference on Document Analysisand Recognition, pp. 839–843 (2011)

[14] Lu, S., Su, B., Tan, C.L.: Document image bina-rization using background estimation and strokeedges. International Journal on Document Anal-ysis and Recognition 13(4), 303–314 (2010)

[15] Mishra, A., Alahari, K., Jawahar, C.V.: AnMRF model for binarization of natural scenetext. In: International Conference on DocumentAnalysis and Recognition (2011)

[16] Niblack, W.: An Introduction to Digital Im-age Processing. Prentice-Hall, Englewood Cliffs,New Jersey (1986)

[17] Otsu, N.: A threshold selection method fromgraylevel histogram. IEEE Trans. on System,Man, Cybernetics 19(1), 62–66 (1978)

[18] Peng, X., Setlur, S., Govindaraju, V., Sitaram,R.: Markov random field based binarization forhand-held devices captured document images.In: Proceedings of the Seventh Indian Confer-ence on Computer Vision, Graphics and ImageProcessing, pp. 71–76 (2010)

[19] Pratikakis, I., Gatos, B., Ntirogiannis, K.: IC-DAR 2011 document image binarization contest(DIBCO 2011). In: International Conference onDocument Analysis and Recognition, pp. 1506–1510 (2011)

[20] Ramırez-Ortegon, M.A., Tapia, E., Ramırez-Ramırez, L.L., Rojas, R., Cuevas, E.: Transitionpixel: A concept for binarization based on edgedetection and gray-intensity histograms. PatternRecognition pp. 1233–1243 (2010)

[21] Sauvola, N., Pietikainen, M.: Adaptive docu-ment image binarization. Pattern Recognition33(2), 225–236 (2000)

[22] Su, B., Lu, S., Tan, C.L.: Binarization of histor-ical document images using the local maximumand minimum. In: Proceedings of the 9th IAPRInternational Workshop on Document AnalysisSystems, pp. 159–166 (2010)

[23] Vonikakis, V., Andreadis, I., Papamarkos, N.,Gasteratos, A.: Adaptive document binariza-tion: A human vision approach. In: 2nd Inter-national Conference on Computer Vision Theoryand Applications, pp. 104–110. Barcelona (2007)

16

Documents

Document Binarization with Automatic Parameter Tuning