Datech2014 - Session 4 - Construction of Text Digitization System for Nôm Historical Texts

Construction of a Text Digitization System

for Nôm Historical Documents

Truyen Van PHAN and Masaki NAKAGAWA

Tokyo University of Agriculture & Technology (TUAT), Japan

Construction of a Text Digitization System for Nôm Historical Documents May 20th, 2014

Outline

Introduction

What Nôm is?

How it is? Our motivation?

What we aim at?

Page Layout Analysis

Offline Recognition System

Generating Artificial Character Patterns

Building and Improving Large Set Character Recognition

Experiments and Results

GUI of Digitization System

Conclusion

Future Work

1/18


What Nôm is?

Nôm character

• 10th

century ～ 20th

century • Based on Chinese character

Nôm character

• 10th

century ～ 20th

century • Based on Chinese character

2/18

"My mother eats vegetarian food at the temple every Sunday"

Quốc Ngữ

Hán (classical Chinese)

Borrowed character

native Nôm

Invented character

Vietnamese alphabet

• 20th

century ～ present • Based on Roman alphabet

Vietnamese alphabet

• 20th

century ～ present • Based on Roman alphabet

2 categories of Nôm

src: wikipedia


How it is? Our motivation?

Current situation of Nôm

completely replaced by Quốc Ngữ.

< 100 scholars worldwide can read Nôm.

> 90% Nôm documents are not translated to Quốc Ngữ.

Digitization Project of the Hán Nôm Special

Collection

Have scanned ~ 5,200 documents.

Providing online access to 1,907 documents with

133,495 pages.

http://nom.nlv.gov.vn/

3/18

http://nom.nlv.gov.vn/


What we aim at?

Construct a digitization system that enables

people who are not even good at Nôm to build

the digital text library of Nôm documents.

Provide a set of document image processing methods:

preprocessing, binarization, character segmentation.

Provide a character recognition system.

Provide an user interface enable an operator to verify.

Lay a foundation of a digitization system for

future research and development.

4/18


Overview of Our System

Segmentation Segmentation

Document

Images

Document

Images

Labeling Labeling

Normalized

Pattern

Normalized

Pattern

OCR OCR

Clustering Clustering

Preprocessing Preprocessing Normalization Normalization

Feature

Extraction

Feature

Extraction

Training Training

Dictionary Dictionary Classification Classification

Document

Texts

Document

Texts

Pattern Pattern

Document

Digitization

Pattern

Collection

Character

Recognition

Grouping

Artificial

Pattern

Artificial

Pattern

Page Layout

Analysis

5/18


Page Layout Analysis (1/2)

Preprocessing

Red Comment Removal

Black Margin Removal

Line and Noise Removal

Binarization

1 local thresholding method (Su’s)

16 global thresholding methods (Otsu’s, SIS,…)

Character Segmentation

Top-down method: RXY cut

Bottom-up method: Voronoi

Combined method: RXY cut + Voronoi

6/18


Page Layout Analysis (2/2)

Black Margin

Removal

Black Margin

Removal

Red Comment

Removal

Red Comment

Removal

Document

Image

Document

Image

Line and Noise

Removal

Line and Noise

Removal

Binarization Binarization

Character

Images

Character

Images

Segmentation Segmentation

7/18


Offline Recognition System

Generate a database of artificial character

patterns.

There is no dataset for Nôm character with ground-truth.

Build an offline recognition engine.

Use MQDF2 recognition method.

Improve the large scale character recognition

problem.

Use GLVQ and kd-tree in coarse classification.

8/18


Generating Artificial Patterns

From 27 CJKV fonts of Nôm, Japanese, Chinese.

Use distortion models (Linear: Rotation, Shear,

Shrink,…; and Non-linear).

Generate 2 datasets:

Common 7,601 characters for segmented character recognition.

All 32,733 characters in Nôm fonts for recognized result verification.

Nô

m c

ha

racte

r H

um

an

9/18


Building Offline Recognition Engine

Normalization: Line Density Projection Interpolation (LDPI)

→ 64 x 64 image

Feature Extraction: Normalization-Cooperated Gradient

Feature (NCGF)

→ 512 features

Feature Reduction: Fisher Linear Discriminant Analysis

(FLDA)

→ 100 features

Coarse-to-fine Classification:

k-NN (k candidates) → MQDF2

10/18


Improving in coarse classification

Mean vector → learned prototype by GLVQ: accuracy

Ordered structure → space-partitioning structure of kd-tree: speed

Improving Large Scale Character Recognition

wj

d(x, ci) < d(x,wj) < d (x, ci+1)

||}{||min)(i

C

wxxg

||||i

wx : Euclidean distance

w1

w2

wC

…

…

inC

k

ik

in

ix

Cw

0

1))((

iiiwxtww

c1

c2

…

ci

ci+1

…

ck

11/18

Generalized Learning Vector Quantization

src: wikipedia


Experiments

Datasets

TUAT HANDS Japanese character pattern databases

(Nakayosi and Kuchibue)

J1_d: 2,965 JIS level-1 Kanji characters

J1&2_d: 6,355 JIS level-1 and level-2 Kanji characters

Artificial Nôm character pattern databases

NomS_d: 7,601 characters

NomL_d: 32,733 characters

Evaluation

Effects of GLVQ or/and kd-tree in large scale character

recognition.

12/18


Experimental Results (1/3)

Comparison of accuracy with and without prototype

learning by GLVQ on J1_d and J1&2_d datasets.

13/18

97,20

97,29 97,32 97,34 97,35 97,35 97,35 97,36 97,36 97,36

97,36 97,36 97,37 97,37 97,37 97,37 97,37 97,37 97,37 97,37

96,63

96,77 96,82 96,84 96,85 96,86 96,86 96,87 96,87 96,87

96,86 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,88 96,88

96,50

96,60

96,70

96,80

96,90

97,00

97,10

97,20

97,30

97,40

97,50

10 20 30 40 50 60 70 80 90 100

Re

co

gn

itio

n r

ate

(%

)

Candidate number k

J1_d J1_d_GLVQ J1&2_d J1&2_d_GLVQ

k-NN rate (top 1): 93.97% 95.96% 93.11% 95.46%


0,190 0,153

0,124 0,101

0,079 0,068 0,058

0,284

0,238

0,188 0,154

0,130 0,113

0,097

93,11 93,09 93,05 92,95

92,79

92,54

92,18

93,11 93,11 93,09 93,05 92,98

92,86

92,69

91,60

91,80

92,00

92,20

92,40

92,60

92,80

93,00

93,20

0,000

0,100

0,200

0,300

0,400

0,500

0,600

0,75 1,00 1,25 1,50 1,75 2,00 2,25 2,50 2,75 3,00

Re

co

gn

itio

n r

ate

(%

)

Sp

ee

d (

ms/c

ha

r)

bound error ε

Speed10 Speed50 Rate10 Rate50

0.308

0.229


Comparison of accuracy and speed with and without

kd-tree on J1&2_d dataset.

14/18

(-0.06)

(-0.105, 54%)

(-0.06)

(-0.154, 50%)

k=10 k=10 k=50 k=50



Summary

15/18

Dataset Categories

No. Dictionary size (Mb)

Evaluation Original engine

With GLVQ

With kd-tree

With GLVQ and kd-tree

J1_d 2,965 6.5 Accuracy (%) 97.20 97.36 97.08 97.25 +0.05

Speed (ms/char) 0.114 0.126 0.074 0.085 -25%

J1&2_d 6,355 13.9 Accuracy (%) 96.63 96.86 96.52 96.75 +0.12

Speed (ms/char) 0.233 0.258 0.132 0.154 -34%

NomS_d 7,601 16.7 Accuracy (%) 98.58 98.61 98.58 98.61 +0.03

Speed (ms/char) 0.258 0.275 0.134 0.137 -47%

NomL_d 32,733 71.7 Accuracy (%) 96.09 96.05 96.07 96.04 -0.05

Speed (ms/char) 1.212 1.257 0.808 0.666 -45%

k=10, ε=2.25

With GLVQ and kd-tree, the computational time is reduced while the recognition rate is kept

the same.


GUI of Digitization System

16/18


Conclusion

Implemented a set of image processing

(preprocessing, binarization, character

segmentation).

Built a high-accuracy character recognition

engine.

Obtained ~ 97% in recognition rate.

Reduced ~ 1/3 computational time while kept the same

rate.

Developed a GUI for Nôm document

digitization to enable an operator can verify

the processed results of binarization,

segmentation and recognition.

17/18


Future Work

Improve page layout analysis to handle many

layouts of Nôm documents.

Improve Segmentation

Line segmentation

Recognition-based character segmentation

Improve Character Recognition

Constraint output by word lexicon (use Nôm dictionary).

Introduce, call attention to the work.

Call for collaborative research.

18/18