Boosted Decision Trees for Word Recognition in Handwritten

Gianmaria Silvello

Information Management Research Group (IMS)Department of Information Engineering

University of Padua, Italy

Boosted Decision Trees for Word Recognition in Handwritten Document

RetrievalHowe, N.R., Rath, T.M. and Manmatha, R.

Department of Computer Science, University of MassachusettsSIGIR 2005 published by ACM, New York

Applied Functional Analysis5 February 2009Padova, Italy

Applied Functional Analysis05 February 2009, Padova

Gianmaria Silvello

• Introduction to recognition and retrieval of handwritten documents

• Classification Algorithm: AdaBoost and Decision trees

• Classification Experiments

• Language Models for Retrieval

• Conclusions

2

Outline


Gianmaria Silvello

• Recognition and retrieval of off-line hand-written documents based upon word classification

• Decision tree with normalized pixels as feature form the basis for AdaBoost

• Problem of skewed distribution of class frequencies

• Experiments done on the GW20 and GW100 corpus

• Retrieval is done using a language model over recognized words

3

Introduction


Gianmaria Silvello

Introduction

• The main goal is to offer access to world historical handwritten documents

• Often HW works on limited vocabularies (postal address)

• Historical documents add complexity due to ink bleeding or dirt on the paper

• Use pixels in normalized word image at multiple scales (image pyramids) as features

• Propose an innovative procedure to create additional training data

4


Gianmaria Silvello

The Boosting Approach

• Boosting is a classification technique that determines its prediction via the weighted vote of a diverse set of base classifiers each of which has been trained on a different weighting of the trained data

• AdaBoost trains successive version of its base classifier focusing on hard-to-classify examples

• It can use a simple classifier but stronger classifiers get better results

5


Gianmaria Silvello

AdaBoost in brief

• Introduced in 1995 by Freund and Schapire in “A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139.

6


Gianmaria Silvello

AdaBoost in brief


6

Reference: Freund, Y. and Schapire, R. E. “A Short Introduction to Boosting”, Journal of Japanese Society for Artificial Intelligence, 14(5):

771-780, 1999.


Gianmaria Silvello

AdaBoost in brief


6

Binary case


771-780, 1999.


Gianmaria Silvello

AdaBoost in brief


6

All weights are set equally


771-780, 1999.


Gianmaria Silvello

AdaBoost in brief


6


771-780, 1999.

Find a weak hypothesis

appropriate for the distribution Dt


Gianmaria Silvello

AdaBoost in brief


6


771-780, 1999.

The error measures the goodness of the

hypothesis


Gianmaria Silvello

AdaBoost in brief


6


771-780, 1999.

AdaBoost chooses the parameter αt that measures the importance assigned to ht

αt ≥0 if εt ≤1/2


Gianmaria Silvello

AdaBoost in brief


6


771-780, 1999.

Dt is updated → increased the weight of misclassified examples → concentrate on

hard examples


Gianmaria Silvello

AdaBoost in brief


6


771-780, 1999.

H is a weighted majority vote for the T weak hypothesis where αt is the weight

assigned to ht


Gianmaria Silvello

AdaBoost in brief

• In Schapire, R. E. and Singer, Y. “Improved boosting algorithms using confidence-rated predictions”, Machine Learning 37(3) 297-336, 1999 is shown how AdaBoost can handle weak hypothesis which output real values.

• Consider x, ht outputs ht(x)∈R whose sign is the predicted label (-1 or +1) and whose magnitude ⎮ht(x)⎮gives the measure of confidence in the prediction.

• AdaBoost.M1 is the extension to the multi-class case → it is adequate when the weak learner is strong to achieve an accuracy of at least 50%.

• Extensions: AdaBoost.MH and AdaBoost.MR → reducing multi-class to a larger binary problem

7


Gianmaria Silvello

Choices and Problems

• The recognition process uses values sampled directly from the word image at varying resolutions

• The choice is to divide word image and not letters

• recognizing letters become a limiting step

• segmentation of individual word images is easier (image classification problem)

• Skewed distribution of class frequencies (Zipfian distribution)and paucity of training data for most word classes

8


Gianmaria Silvello

Classification Algorithm

• HW words belonging to a single class have similar ink distribution (bur not identical)

• The position of individual features within the word will shift from example to example

• The pixel representation contains information about word identity that can be amplified by boosting

• clearer areas will contain more reliable features

• blurring indicates areas of inconsistency

9


Gianmaria Silvello

Classification Algorithm

• HW words belonging to a single class have similar ink distribution (bur not identical)

• The position of individual features within the word will shift from example to example

• The pixel representation contains information about word identity that can be amplified by boosting

• clearer areas will contain more reliable features

• blurring indicates areas of inconsistency

9

Composite image of 21 examples of the word “Instructions”.

Straightforward use of the pixel is ineffective.


Gianmaria Silvello

Common framework

• Pixels used as features for word image classification

• Word image mapped into a common pixel grid

• Images scaled and translated → horizontal line span: (0,0) to (1,0) → resampling each image to a common grid will produce common pixel representation

• Long and short words (horizontal and vertical dimensions) → astronomic data sizes

• Pyramid approach

10


Gianmaria Silvello

Pyramid Approach

• Define a family of standard grids → base grid Φo covering ([0,1], [-0.5, 0.5]) broken into 32x32 px

• Refine grids cover the same square region with double resolution 64x64 px

• Like a tree in which each Φk has 4 children in Φk+1

✓ The standard image usually don’t cover the full vertical extent of the grid ➙

portions above and below the edges of standard image may be represented

using a single default value

✓ Data need only be stored for Φk with resolution up to that of the reference

image.

11


Gianmaria Silvello

Pyramid Approach








image.

11

This square area captures all the detail of interest for most words


Gianmaria Silvello

Pyramid Approach








image.

11


Gianmaria Silvello

Boosting and Decision Trees

• Word image recognition has many potential class → to use AdaBoost, a classifier with at least 50% accuracy is needed

• Decision Trees are the foremost option

• well understood

• achieve arbitrary accuracy on the training data in practice

• At each node the training examples are split into 2 sub-groups by comparing the value of a chosen pixel in each to a chosen threshold

• A tree branch growth is stopped when the contained subset is dominated by a majority class

12

Se si va avanti fino ad avere un solo training example per foglia si raggiunge accuratezza 100%. -> overfit the tree che deve essere pruned rimuovendo rami statisticamente poco bell


Gianmaria Silvello

C4.5

• C4.5 provides the algorithm for building the decision tree, with some modification designed to support the grid pyramid data structure

• C4.5 builds decision trees from a set of training data using the concept of information entropy

• It uses the fact that each attribute of the data can be used to make a decision that splits the data into smaller subsets

13

Training data is a setS = s1, s2, ..., sn

of already classify samples, where

si = x1, x2, ..., xm

xj = feature

Training data is augmented with a vector

C = c1, c2, ..., cv where

ci represents the class that each sample belongs to.


Gianmaria Silvello

C4.5

• C4.5 provides the algorithm for building the decision tree, with some modification designed to support the grid pyramid data structure

• C4.5 builds decision trees from a set of training data using the concept of information entropy

• It uses the fact that each attribute of the data can be used to make a decision that splits the data into smaller subsets

13

Reference: Quinlan, J. R. “C4.5: Programs for Machine Learning”. Morgan Kaufmann, 1993.

Training data is a setS = s1, s2, ..., sn

of already classify samples, where

si = x1, x2, ..., xm

xj = feature

Training data is augmented with a vector

C = c1, c2, ..., cv where

ci represents the class that each sample belongs to.


Gianmaria Silvello

C4.5 for images pyramid

• At each node a feature (i.e. pixel location) and a threshold value must be chosen as split value → exhaustiveness is not possible

• Only Φo is exhaustively examined → location and threshold offering the greatest information gain is retained

• The search proceeds selectively to its children in Φ1, from there to the children of the best of those locations, and so on until the maximum resolution available is reached

• The grid level, location and threshold with the highest information gain becomes the decision criterion for the node

14


Gianmaria Silvello

Boosting

• Single trees do not generalize well for hand-written word images

(1) The base classifier is normally generated from the training data

(2)AdaBoost raises the weights of misclassify elements → forcing base classifier to work harder

(3)After many rounds of boosting, a weighted vote classifies the training set perfectly and shows good generalization to unseen examples

(4) In practice after a certain # of rounds (here: 200) the results don’t improve significantly

15


Gianmaria Silvello

Supplementary Training Examples

• Problem: paucity of training examples for many classes makes generalization difficult

• Zipf law ➙ few examples for many words

• 57% of the words appear only one time in the test collection

• Solution: generate new training examples for low frequencies classes via stochastical distortion of the available example

• Improve overall word classification accuracy

16


Gianmaria Silvello

Supplementary Training Examples

• Sample from the original using a grid of points whose portions have been perturbed from a uniform lattice

• Nearby points should be perturbed by similar amounts

• New image is the distortion of the old one

17


Gianmaria Silvello

Classification Experiments

• Test collection: GW20 ➞ previously used and GW100 non overlapping with GW20

• written by multiple hands

• manually segmented to extract images of individual words (4856 in GW20 and 21324 in GW100)

• all images labeled with their ASCII equivalent

18

GW20 experiments. 19 pages for training and 1 for tests.


Gianmaria Silvello






18


①

Single decision tree → standard C4.5 grown to completion, then pruned


Gianmaria Silvello






18


②

AdaBoost + Decision Tree as base learner


Gianmaria Silvello






18


③

AdaBoost + Decision Tree + Synthetic Data

No experiments with AdaBoost and simple classifier because 50% accuracy cannot be achieved


Gianmaria Silvello






18


GW100: ⇓ performances = + OOV words and ↓image quality


Gianmaria Silvello

Retrieval

• Language Modeling approach to retrieval• Ref: Ponte, J. and Croft, W.B. “A language modeling approach to Information

Retrieval”, SIGIR 1998 275-281

• Use query likelihood formulation where documents are ranked according to P(Q|D)

• AdaBoost provides classification rather than probabilities → only the most likely label for each word image is preserved

• An approach can be that the probabilities are equal to their frequencies in each recognized document → but many words can be misclassified

19


Gianmaria Silvello

Retrieval: Regularization Schema

• Regularization schema based upon classification rank information

• Hypothesis: Rank info may be more important than actual probabilities ➔ top terms→very imp. | some moderate imp. etc.

• Infer probabilities from the rank ordered output of AdaBoost classification algorithm → rank the top n classes according to scores

• Associate a probability to classes fitting the Zipfian distribution to rank classes

20


Gianmaria Silvello


• Instead a document has one possible word for each position, now it contains a probability distribution at each position

• Test on Lemur with the query-likelihood ranking method

• Because of limited size of GW20 line retrieval is performed

• relevant = line containing all query terms

• stop-words removed

• GW100 allows for full page retrieval with GW20 as training examples

21


Gianmaria Silvello


• Instead a document has one possible word for each position, now it contains a probability distribution at each position

• Test on Lemur with the query-likelihood ranking method

• Because of limited size of GW20 line retrieval is performed

• relevant = line containing all query terms

• stop-words removed

• GW100 allows for full page retrieval with GW20 as training examples

21


Gianmaria Silvello

Conclusions

• Learning algorithms are not designed to deal with training data that exhibits highly skewed distribution of class frequencies

• The methodology described does not always work fine because the synthetic training data are not truly independent of the originals

• Performances are good for GW20

• The problem is challenging for GW100 →larger data-set, noise ⇒ using soft classification decisions can

improve the results for shorter queries

22

Documents

Boosted Decision Trees for Word Recognition in Handwritten