Recognition of Malayalam Documents IIIT Hyderabad › ... › neebaPresentation2010.pdf · IIIT...

Preview:

Citation preview

Noise cleaning and Binarization

Skew Correction

Text & Graphics Segmentation

Line & Word Segmentation

Parsing (CC Analysis)

Feature Extraction

Classification Converter & Post-processing

Document Reconstruction

Output: Text/ unicode

Input: Document image

×

Input

CC Analysis

Convert to symbols

Reorder symbols

Render the word image

CC Analysis

Labels the CCs

33 51 122 52 113 107

DP based

Matching

to align

R and W

MAP FILE

RULES FILE

R

CC Analysis Label 37 55 107 57 37 58 43 63 14

Feature Extraction

IIIT Hyderabad

IIIT Hyderabad

IIIT Hyderabad

IIIT Hyderabad

•••••••••

IIIT Hyderabad

IIIT Hyderabad

IIIT Hyderabad

Feature Dim Classifiers

MLP KNN ANN SVM-1 SVM-2 NB DTC

C.M 20 12.04 4.16 5.86 10.04 9.19 11.93 5.57

DFT 16 8.35 8.96 9.35 7.88 7.86 15.33 13.85

DCT 16 5.43 5.11 5.92 5.25 5.24 8.96 7.89

ZM 47 1.30 1.98 2.34 1.24 1.23 3.99 8.04

PCA 350 1.04 1.14 2.39 0.37 0.35 4.83 5.97

LDA 350 0.55 0.52 1.04 0.35 0.34 3.20 4.77

RP 350 0.33 0.50 0.74 0.34 0.34 3.12 8.04

DT 400 1.94 1.27 1.98 1.84 1.84 4.28 2.20

IMG 400 0.32 0.56 0.78 0.32 0.31 1.22 2.45

Error rate using CNN : 0.93

IIIT Hyderabad

Error rates on Malayalam dataset.

IIIT Hyderabad

IIIT Hyderabad

IIIT Hyderabad

Error rates of SVM-2 classifiers with varying number of features.

IIIT Hyderabad

IIIT Hyderabad

IIIT Hyderabad

Accuracy of different classifiers Vs no. of classes, Feature used : LDA.

IIIT Hyderabad`

IIIT Hyderabad

Images from dataset

IIIT Hyderabad

Feature D-1 D-2 D-3 Blobs Cuts Shear

C.M 9.45 9.46 10.97 16.28 12.33 30.07

DFT 7.89 7.93 7.98 26.70 8.73 18.90

DCT 5.71 5.72 6.07 19.80 7.93 16.46

ZM 1.96 1.98 2.10 8.41 4.35 17.75

PCA 0.39 0.39 0.40 2.17 0.64 8.59

LDA 0.30 0.31 0.32 2.01 0.61 7.32

RP 0.48 0.67 1.04 3.61 0.71 6.75

DT 1.75 1.98 2.21 10.33 5.07 12.34

IMG 0.32 0.33 0.33 2.78 0.66 6.84

IIIT Hyderabad

IIIT Hyderabad

IIIT Hyderabad

IIIT Hyderabad

Accuracies of SVM-2 classifier when trained with 4 fonts and tested on the 5th font. S1 : Dataset without degradation, S2: Dataset with degradation.

IIIT Hyderabad

Features Telugu (350 class) English (72 class)

20X20 40X40 20X20 40X40

C.M 20.78 12.32 7.25 6.48

DFT 8.45 5.48 2.04 1.12

DCT 9.67 2.71 2.14 1.04

ZM 15.71 6.71 5.37 3.31

PCA 4.62 2.93 0.86 0.46

LDA 2.56 1.67 0.29 0.23

RP 2.49 1.66 0.28 0.23

DT 3.48 3.17 0.98 0.87

IMG 3.18 2.84 0.28 0.23

IIIT Hyderabad

IIIT Hyderabad

1,5

2,5 1,4

3,5 2,4 1,3

4,5 3,4 2,3 1,2

x

5 4 1 2 3

Sample x from class 4

|C|

O)stance(C,CharEditDi (CER) DistanceEdit Character

Symbols leRecognizab of No. Total

Symbols leRecognizab and iedMisclassif of No. RateError Symbol

Unicodeof No. Total

UnicodeiedMisclassif of No. RateError Unicode

Wordsof No. Total

rdsCorrect Wo of No. Accuracy level Word

WordsleRecognizab of No. Total

WordsleRecognizabor Correct of No.Accuracy level Word

Sarada

Sanjayan

0.85%

Thiruttu

0.85%

••

• •

Recommended