Document Analysis Techniques for Automatic Electoral ... · Introduction Preprocessing OMR ICR HWR...

Preview:

Citation preview

Document Analysis Techniques for AutomaticElectoral Document Processing: A Survey

J. Ignacio Toledo, Jordi Cucurull, Jordi Puiggalı,Alicia Fornes and Josep Llados

VoteID 2015

4 September 2015

Introduction Preprocessing OMR ICR HWR Conclusions

Contents

1 Introduction

2 PreprocessingBinarizationSkew Correction

3 Optical Mark Recognition

4 Intelligent Character Recognition

5 Handwriting Recognition

6 Conclusions

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Introduction

Why paper voting?

Legal reasons: On some countries, introducing electronicvoting would require legal modificationsTradition: People is used to paper votingUser Interface: Average citizen is an expert in using penand paperA first step: An automated process can be a first steptowards improving voter privacy and verifiability ofelections of paper based elections adapting techniquesfrom electronic voting

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Plan

1 Introduction

2 PreprocessingBinarizationSkew Correction

3 Optical Mark Recognition

4 Intelligent Character Recognition

5 Handwriting Recognition

6 Conclusions

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Global Threshold Binarization

Otsu’s MethodExhaustive search of a global threshold value that minimizes intra-classvariance for background and foreground

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Local Threshold Binarization

Sauvola’s MehodDetermine an optimal threshold value for each pixel, depending on itsneighborhood

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Vertical Projection Based Skew Correction

Compute the vertical projection histogram of the image atdifferent rotation angles(i.e. from -5 to 5 degrees, 0.25 degrees resolution)

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Vertical Projection Based Skew Correction

The image with the highest standard deviation of thevertical projection histogram has the right orientation

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Hough Transform Based Skew Correction

The Hough Transform can detectlines in an imageWe want to detect the skew angle ofthis ballot so we can correct itThe most common angle of thenear-horizontal lines will be the skewangle

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Hough Transform Based Skew Correction

As a preprocessing step, remove all foreground pixels that donot have horizontal neighbours(Mathematical morphology erosion operation with an horizontal rectangle asthe structuring element)

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Hough Transform Based Skew Correction

As a preprocessing step, remove all foreground pixels that donot have horizontal neighbours(Mathematical morphology erosion operation with an horizontal rectangle asthe structuring element)

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Hough Transform Based Skew Correction

Compute the Hough Transform

Each line is a point inparameter spacedescribed by its distanceto the origin ρ and theangle θ

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Hough Transform Based Skew Correction

Threshold and Weighted Average

Discard the lines withlower number of pixels andperform a weightedaverage for the desiredintervalIn our example thedetected skew angle is1.5370 degrees

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Plan

1 Introduction

2 PreprocessingBinarizationSkew Correction

3 Optical Mark Recognition

4 Intelligent Character Recognition

5 Handwriting Recognition

6 Conclusions

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Ballots

Ballots with voting targetsthat voters should fill inWe have a database withexpected voting targetcoordinatesWe want to decide if avoting target has beenmarked by the voter

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Optical Mark Recognition

Image Difference Based OMR

We have an empty ballot as ourmodel.It is unskewed and correctlythresholded

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Optical Mark Recognition

Image Difference Based OMR

We have voted ballot images.They are unskewed, correctlythresholded and aligned withthe ballot model

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Optical Mark Recognition

Image Difference Based OMR

Perform the difference betweenthe ballot we are examining andthe model.The output is noisy due to smallmisalignments.

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Optical Mark Recognition

Image Difference Based OMR

Remove foreground pixels notsurrounded (4-connectivity) byneighbours(Mathematical morphology erosionwith ’+’ kernel)

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Optical Mark Recognition

Style Based OMR

Most common OMR approaches, rely onthe number of foreground pixels to detect amarkHowever, some commonly accepted markshave a recognizable shapeTo detect those specific shapes, we cantrain specialized classifiers

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Plan

1 Introduction

2 PreprocessingBinarizationSkew Correction

3 Optical Mark Recognition

4 Intelligent Character Recognition

5 Handwriting Recognition

6 Conclusions

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Ballot Statements

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Intelligent Character Recognition

A particular case of image classification

We want to find out the class a character image belongs to(and model matching techniques do not work due to high intra-classvariability)We need:

1 Features to describe the image that are robust tointra-class variability but discriminative

2 Classifiers that can deal with high dimensional ”noisy” dataand are robust to outliers

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Feature Representation

Histogram of Oriented Gradients (HOG)

1 Compute the gradients of the image2 Divide it into small spatial regions,

called ”cells”3 For each cell:

Accumulate a histogram of gradientmagnitudes using fixed number ofpredefined bins for the gradientangle

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

ClassifierSupport Vector Machines

Support Vector Machines (SVM)

The best decision boundary canbe found by:

Minimizing classification errorMaximizing the distance to theclosest points of each class(margin)

To be able to separate n different classes, you must learn n one-versus-allclassifiers

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Deep Convolutional Neural Networks

Biologically inspiredLearn robust anddiscriminative featuresPerform a non-linearclassificationProne to overfittingBest error rate (0.23%on MNIST)

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Plan

1 Introduction

2 PreprocessingBinarizationSkew Correction

3 Optical Mark Recognition

4 Intelligent Character Recognition

5 Handwriting Recognition

6 Conclusions

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Connected Handwriting in electoral documents

Some different examples:Amounts in text fields in ballot statementsWrite-in fields in ballotsObservations fields in ballot statements

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Why is cursive handwriting recognition hard?Sayre’s Paradox

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Feature extraction

Sliding Window

Features: Marti Features

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Hidden Markov Models

Each timestep an observation is generated by an unknownstate

State Transition MatrixEmission Probability associated to each stateTrained using Baum-Welch algorithm

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Bidirectional Long Short Term Memory Network

Sequence Classifier: BLSTM+CTC

Open Vocabulary Handwriting Recognition Character errorRate: 18%

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Plan

1 Introduction

2 PreprocessingBinarizationSkew Correction

3 Optical Mark Recognition

4 Intelligent Character Recognition

5 Handwriting Recognition

6 Conclusions

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Introduction Preprocessing OMR ICR HWR Conclusions

Conclusions and Future Work

Preprocessing is importantOMR can do much more than just counting dark pixelsICR error rates are at human levelConnected Handwriting Recognition could only be used inconstrained scenariosTest in realistic elections scenarios

Document Analysis Techniques for Automatic Electoral Document Processing: A Survey J. Ignacio Toledo et al.

Questions?

Recommended