39
7/8/2009 1 Arabic Character Recognition Professor Mohammed Zeki Khedher Jordan University [email protected] 19 th May 2003 Contents Types of Documents Signature verification Language Classification On Line and Off line OCR Latin Character Recognition Printed and Handwritten Preprocessing: Line segmentation Word segmentation Thinning Segmentation Feature Extraction Neural Networks

Arabic Character Recognition - المشكاة – مركز المشكاة ... ·  · 2010-05-25Arabic Character Recognition Professor Mohammed Zeki Khedher ... diacritics,punctuation

Embed Size (px)

Citation preview

7/8/2009

1

Arabic

Character Recognition

Professor Mohammed Zeki Khedher

Jordan University

[email protected]

19th May 2003

Contents

• Types of Documents

• Signature verification

• Language Classification

• On Line and Off line OCR

• Latin Character Recognition

• Printed and Handwritten

• Preprocessing:

– Line segmentation

– Word segmentation

– Thinning

• Segmentation

• Feature Extraction

• Neural Networks

7/8/2009

2

Arabic Character Recognition using

Approximate Stroke Sequence

• Arabic Optical Character Recognition

• Previous Work in Arabic Character

Recognition

• Main characteristics of Arabic Writing

• Approximate Stroke Sequence String Matching

• Conclusions

Types of Documents

7/8/2009

3

A page containing text, image and a table

Schema for Document Image Analysis

LEVEL OF PROCESSING

(low to high

DOCUMENT TYPE

MOSTLY-TEXT MOSTLY-GRAPHICS

Pixels Preprocessing

Representation

Noise reduction

Binarization

Skew detection

Zoning

Character segmentation

Script, language & font recognition

Character scaling

Preprocessing

Representation

Noise reduction

Binarization

Thinning

Vectorization

Primitives Glyph recognition

Connected components

Strokes

Characters, diacritics, punctuation

Words

Primitive recognition

Stright-line & curve segments

Junctions and nodes

Loops

Characters

7/8/2009

4

Structures Text recognition

Word segmentation

Text line reconstruction

Table analysis

Morphological context

Lexical context

Syntax, semantics

Structures recognition

Text field

Legends

Label attribution

Dimensions

Graphics symbols

Aerial and texture features

Beautification (constraints)

Documents Page layout analysis

Text versus non-text

Physical components analysis

Logical components analysis

Functional components (content

tags)

Compression

Interpretation

Components recognition

Connectivity analysis

CAD/GIS layer separation

Database attribute extraction

Compression

Information retrieval

Document classification retrieval

Search

Security, authentication, privacy

Database, CAD, GIS interface

Validation

Search

Update

A check

codeline

Amount and account fields signature

postcode

7/8/2009

5

Signature verification

FEATURE

EXTRACTION

DISTANCE

MEASURE

Line

Signature

from

check

Feature

vector

Feature vector

FEATURE

EXTRACTION

Reference

signatureREFERENCE

DATA BASE

distance

Language Classification

7/8/2009

6

Classical OCR Systems

Format

Analysis

Character

Segmentation

Feature

Extraction

Classification

Document image

Character group image

Character image

Character properties

Character ID

Base Line Extraction

7/8/2009

7

Handwritten sentence recognition

Word Segmentation Algorithm

ComputeInitial Grouping

Prob.

Update linkingprobabilities

GroupAdjacent Glyphs

MakeAdjustment improving the

Joint prob. most

Compute joint prob. When each glyph

pair splited & merged

Glyph sequence within line

Glyph PairsWith linking prob.

Word Partitions

Labeled Words

7/8/2009

8

Two Samples From The set of 1000 images

An assignment strategy used in a postal delivery

system

7/8/2009

9

Reference lines separating three zones

upper zone

middle zone

lower zone

Total recognition scheme

Classifier A

Classifier B

Classifier C

Classifier D

Combination

Combination

Combination

Discrimination Discrimination

Discrimination

Nonlinear normalization

Feature extraction

Canonical variates Common differencePrincipal components

Difference Principal components

7/8/2009

10

segmentation techniques

Training

Phase

Testing

phase

Segmentation problems in machine printed text

7/8/2009

11

Sample A’s

1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16 17 18 19 20

Some pre-processing operations

(a) (b) (c)

• (a) The original image;• (b) Image after thinning;• (c) Image after dilation and scale normalization.

7/8/2009

12

Steps of the decomposition process

• (a) the original bitmap• (b) the thinned image• (c) the corrected polygonal• (d) the decomposition into circular arcs.

(a) (b) (c) (d)

Algorithm of Hole Detection

STEP1: Assume C(i) is the number of strokes which we

crossed by the horizontal scanning line Y = L Scan from top

to bottom, until i1 such that C(i1) = 1 and C(i1+1)>=1.

STEP2: Continue scanning until i2 such that C(i2) >= 2 and

C(i2+1) = 1. To prevent broken stroke, we continue to such

to set if (C(i2+2) = 1, C(i2+k) = 1, k is a small integer about

2.

STEP3: Given i1 and i2, begin to confirm the hold, Assume

the internal which region at Y=i1+1 is [11,12], Let

B(i1+1]=[2-1], Let D(1) be the length of the internal black

region as Y=i1. Similarly, we can get B(i2), D(i2+1), B(i1+1),

D(i2+1).

7/8/2009

13

Examples of Numeral Features

• (a) The principal (PA) and secondary (SA) axes;• (b) Number of black pixel blocks in each row and column;• (c) Position of holes.

(a) (b) (c)

Algorithm of Contour Concavity Detection

7/8/2009

14

Recovery of Drawing Order from Handwriting

Images

Partitioning Handwritten

Numeral Strings

Original string

Partitioned string

(a) (b)

(c) (d)

7/8/2009

15

Connected Components

Dissection Technique for segmentation

7/8/2009

16

Stroke Sequence Strings for “A”’s

Computing distance table

2 1 2 1 0 7 6 7 6

0 2 4 6 8 10 12 14 16 18

1 2 1 2 4 6 8 10 12 14 16

1 4 3 1 3 4 6 8 10 12 14

1 6 5 3 2 3 5 7 9 11 13

1 8 7 5 4 2 4 6 8 10 12

1 10 9 7 6 4 3 5 7 9 11

5 12 11 9 8 6 5 5 6 8 10

5 14 13 11 10 8 7 7 6 8 9

6 16 15 13 12 10 9 8 7 7 8

5 18 17 15 14 12 11 10 9 9 8

5 20 19 17 16 14 13 12 11 11 10

7/8/2009

17

Segmentation Methods

analytic holistic

Recognitionbased

Megabased

dissectionPost

process

HiddenMarkovModel

Non-Markov

windowing Featurebased

Dynamicprogram

Markov

hybrid

Segmentation Strategies

Classical approach: character like properties

cutting into meaningful components

(dissections)

Recognition-based-segmentation: matching classes

into alphabets

Holistic methods: recognition of whole words

7/8/2009

18

Recursive segmentation

Inputpattern

WindowedInput

MatchingPrototype 1

ResidueMatching

Prototype 2

Location of the Blocks on the letter

7/8/2009

19

Feature Extraction

Multiple segmentation hypotheses

Curve

Cups

Angular point

Curvature maxima & Multiple-pointSimple loop

Anticlockwise orientation

Segmentation hypotheses into physical primitives

Curvature maxima & Multiple-pointSimple loop

Anticlockwise orientation

7/8/2009

20

Graphemes

Relationship between sub components and the

background

(a) Isolated case

(b) Partially enclosed case

(c) Totally enclosed case

7/8/2009

21

Skewing of Text

Learning 2D Shape Models

• Two training sets of left ventricle and cistern shapes

• from different patients were automatically divided into clusters

7/8/2009

22

The shape learning method

Recognition of Mathematical Symbols

∞a = c02 + ∑ cn2 / 2

n = 0

7/8/2009

23

Block diagram of the Neural Classifier

147

2309

0.56

BLOCK 1

BLOCK 1

Neural Network based Feature Extraction and

Classification

MLP classifier

Extracted Coupled Feature Space

Input image

7/8/2009

24

Recognition rates of letters for increasing

training sets

Hidden Markov Model for Text Line

7/8/2009

25

Hidden Markov Model

• (a) Training sample

• (b) Sequence of features

• (c) Hidden Markov model

• (d) Different segmentation features

Arabic Character Recognition

• On-line systems

• Off-line systems

• Arabic OCR

• Necessity of segmentation even for printed text

• Treatment of the sub-words rather than words

7/8/2009

26

Main Characteristics of Arabic Writing

• Right to left

• Always cursive

• Change of character shape according to its location in the word

• Four different shapes

• 28 basic characters: 15 with dots,13 without

• No fixed character width & No fixed size

Characters recognized by dots only

• Letters: ب ت ث ي ن

• Middle form ــQـ ـRـ ـSـ ـTـ ـU

7/8/2009

27

Characters with Hamza

• 4 characters which may take the secondary character “Hamzah ء”.

• Alif أ إ

• Waw ؤ

• Yaa ئ

• Kaf ك

Aِrabic Characters Different Forms

7/8/2009

28

Aِrabic Characters Different Forms

Words and sub-words

• رa`ل A word with 3 sub-words

• ر A sub-word with 1 character

• `a A sub-word with 2 character

• ل A sub-word with 1 character

7/8/2009

29

Test Example

• Size: about 1.4MB

• 262,647 words

• 1,126,420 characters

• 4.3 characters per word

• 574,383 sub-words

• 2.2 sub-word per word

Sub-words Shapes Statistics

7/8/2009

30

Proposal for a New Procedure for

Recognition of Arabic Characters

• Sub-words of 1 character (stand-alone form)

recognise directly without any segmentation

• Sub-words of 2 characters.

The first one is in the initial form

The second one in the final form

Segmentation in two parts only.

• Sub-words of more than two characters.

The first one is in the initial form

The last one in the final form,

The rest are in the middle form

Examples

• One character: ن ق ع

• Two characters sub-words: ل`e fg`آ ig`j

• Three characters sub-words: ikj وlm ikj`ل

• Four characters sub-words: opqr opqrو

• Five characters sub-words:fkqTjا opqTm

7/8/2009

31

Proposed Procedure for Arabic Character

Recognition

Approximate Stroke Sequence String

Matching

• Individual distance di,j between the i’th stroke in

letter a1 and the j’th stroke in letter a2where

di,j= |a1(i) – a2(j)| if |a1(i) –a2(j)| ≤ 4

7/8/2009

32

The 8-direction stroke convention

String matching between the unknown character

and character ح

7/8/2009

33

String matching between the unknown character

and character ع

a: Character ح b: character ع c: unknown character to be

matched with a and b

7/8/2009

34

The letter ع hand written by 48 different

persons

Ghain at begining

7/8/2009

35

Tha at middle

Dhad at end

7/8/2009

36

Hamza under Alif

Three dots

7/8/2009

37

Numeral 4

Three characters together

7/8/2009

38

Remarks about Arabic Language

Average Arabic word contains about 4.3 characters Average of 2.2 sub-words per word

Basic block should be sub-word rather than the word.

The size of the sub-word from 1 to 8 characters Sub-words with a single character: stand-alone form.

….continued

Sub-words with two characters: a single shotsegmentation has to be made dividing thesub-word into two characters. The first oneis in the initial form and the second one inthe final form. Sub-words of lengths longerthan 2 characters need to be segmentedinto three characters or more. The first is ofinitial form, the last of final form and therest of middle form.

7/8/2009

39

Conclusions

• On-line OCR is easier• Off-line OCR for printed text is available for Latin

characters• OCR for handwriting is still in research era• Special applications of recognition of handwritten

text is available e.g. checks and postal delivery• OCR for oriental languages is in research era• Research in Natural Language Processing aids the

OCR development

Conclusions-continued

Design of Arabic OCR system when taking these facts into account would be much simpler.

Classification of sub-words according to the number of the characters they contain, still ought to be addressed.

Approximate stroke sequence string matching. Promising results are shown.

Further refinement of the algorithm used need to be carried out for better rate of recognition.

Neural Network use in segmentation is promising