Download pdf - Tesseract Training_for Khmer Language_For Posting

by Kruy Vanna

� Download Tesseract from http://code.google.com/p/tesseract-ocr/downloads/list

� Here I choose the compiled one. Tesseract-2.01.ext.tar.gz (It is better to use new version. But since I do not

have compiler at hand now. I’ll just use the compiled one)

� Extract it to any location.

�

� Download the english language data.

� Extract it and put in the tessdata of your tesseract folder

by Kruy Vanna

�

� Download tesseract source folder. What I need are files in folders configs and tessconfigs of tessdata.

(tesseract.exe u downloaded does not have these)

by Kruy Vanna

�

� Extract it to somewhere and copy the tessdata to our previous tesseract folder.

Now we can start training:

I train with this image. They say should train enough data. So every characters should appear many times. ( don

know if m right).

May be each same character should appear many time but with different font?

� Make box file. Go to command line and set the current directory to your tesseract folder

� Got the file: fontfile.txt

� Renamed it to : fontfile.box so that I can open it in Tessboxer

(http://sites.google.com/site/spilkaondrej/)

Here I input the character in the “Letter” textbox and the UTF8 code is automatically filled.

tesseract fontfile.tif fontfile batch.nochop makebox

by Kruy Vanna

Making feature file ->

got this log

Clustering

( You should change the current directory to “training” to use the command)

Now I got the files I should have.

read_variables_file:variable not found:

textord_no_rejectsTesseract Open Source

OCR Engine

Image has 24 bits per pixel and size

(746,387)

Resolution=96

APPLY_BOXES:

Boxes read from boxfile: 19

Initially labelled blobs: 17 in 3 rows

Box failures detected: 2

Duped blobs for rebalance: 2

"ច" has fewest samples: 5

Total

unlabelled words: 1

Final

labelled words: 19

Generating training data

TRAINING ... Font name = UnknownFont.

Generated training data for 19 blobs

tesseract fontfile.tif junk nobatch box.train

mftraining fontfile.tr

by Kruy Vanna

• Inttemp

This is the binary file -> human eye can’t understand.

• Pffmtable

• I got this file too “Microfeat” but they say it’s not used

Another command:

Got this file: normproto

Compute the Character Set

Got this file: unicharset

Dictionary Data

Created “frequent_words_list” file. They said I must put at least one word so I just put “ខងច” in it using notepad.

Generate the frequent dictionary file using command:

Got the file: freq-dawg

Created “words_list” file with the content “ងចខ”

Generate the word list dictionary file using command:

Got the file: word-dawg

Created “user-words” file. They say it’s usually empty -> I keep them empty

ខ 104

ង 93

ច 85

I don’t know what the number mean.

cntraining fontfile.fr

unicharset_extractor

fontfile.box

wordlist2dawg

frequent_words_list freq-dawg

wordlist2dawg words_list word-dawg

The last file

This file “DangAmbigs” is manually generated. This file

confused with “rn” (r+n)

Khmer character may not have this kind of ambiguity.

Putting it all together

Now I have all the files renamed to have prefix

All of these files should be put in “tessdata”

� khm.DangAmbigs

� khm.freq-dawg

� khm.inttemp

� khm.normproto

� khm.pffmtable

� khm.unicharset

� khm.user-words

� khm.word-dawg

Now time to run the test!!!

I have this image khmer.tif

I run with command:

I got the output.txt with the content:

Cheers!!!

Tesseract khmer.tif output –l khm

ខងច

generated. This file’s purpose is to reduce the abiguity. Ex. “

character may not have this kind of ambiguity. (need to confirm). So I make it empty file.

renamed to have prefix “khm.” (khm is the ISO_639-2_codes of Cambodia lanuage Khmer)

” folder.

“m” can easily

empty file.

of Cambodia lanuage Khmer):