8
by Kruy Vanna Download Tesseract from http://code.google.com/p/tesseract-ocr/downloads/list Here I choose the compiled one. Tesseract-2.01.ext.tar.gz (It is better to use new version. But since I do not have compiler at hand now. I’ll just use the compiled one) Extract it to any location. Download the english language data. Extract it and put in the tessdata of your tesseract folder

Tesseract Training_for Khmer Language_For Posting

Embed Size (px)

DESCRIPTION

This is a step by step tutorial on how to train Tesseract OCR. Here I train Khmer Language as an example.

Citation preview

Page 1: Tesseract Training_for Khmer Language_For Posting

by Kruy Vanna

� Download Tesseract from http://code.google.com/p/tesseract-ocr/downloads/list

� Here I choose the compiled one. Tesseract-2.01.ext.tar.gz (It is better to use new version. But since I do not

have compiler at hand now. I’ll just use the compiled one)

� Extract it to any location.

� Download the english language data.

� Extract it and put in the tessdata of your tesseract folder

Page 2: Tesseract Training_for Khmer Language_For Posting

by Kruy Vanna

� Download tesseract source folder. What I need are files in folders configs and tessconfigs of tessdata.

(tesseract.exe u downloaded does not have these)

Page 3: Tesseract Training_for Khmer Language_For Posting

by Kruy Vanna

� Extract it to somewhere and copy the tessdata to our previous tesseract folder.

Now we can start training:

I train with this image. They say should train enough data. So every characters should appear many times. ( don

know if m right).

May be each same character should appear many time but with different font?

Page 4: Tesseract Training_for Khmer Language_For Posting
Page 5: Tesseract Training_for Khmer Language_For Posting

� Make box file. Go to command line and set the current directory to your tesseract folder

� Got the file: fontfile.txt

� Renamed it to : fontfile.box so that I can open it in Tessboxer

(http://sites.google.com/site/spilkaondrej/)

Here I input the character in the “Letter” textbox and the UTF8 code is automatically filled.

tesseract fontfile.tif fontfile batch.nochop makebox

Page 6: Tesseract Training_for Khmer Language_For Posting

by Kruy Vanna

Making feature file ->

got this log

Clustering

( You should change the current directory to “training” to use the command)

Now I got the files I should have.

read_variables_file:variable not found:

textord_no_rejectsTesseract Open Source

OCR Engine

Image has 24 bits per pixel and size

(746,387)

Resolution=96

APPLY_BOXES:

Boxes read from boxfile: 19

Initially labelled blobs: 17 in 3 rows

Box failures detected: 2

Duped blobs for rebalance: 2

"ច" has fewest samples: 5

Total

unlabelled words: 1

Final

labelled words: 19

Generating training data

TRAINING ... Font name = UnknownFont.

Generated training data for 19 blobs

tesseract fontfile.tif junk nobatch box.train

mftraining fontfile.tr

Page 7: Tesseract Training_for Khmer Language_For Posting

by Kruy Vanna

• Inttemp

This is the binary file -> human eye can’t understand.

• Pffmtable

• I got this file too “Microfeat” but they say it’s not used

Another command:

Got this file: normproto

Compute the Character Set

Got this file: unicharset

Dictionary Data

Created “frequent_words_list” file. They said I must put at least one word so I just put “ខងច” in it using notepad.

Generate the frequent dictionary file using command:

Got the file: freq-dawg

Created “words_list” file with the content “ងចខ”

Generate the word list dictionary file using command:

Got the file: word-dawg

Created “user-words” file. They say it’s usually empty -> I keep them empty

ខ 104

ង 93

ច 85

I don’t know what the number mean.

cntraining fontfile.fr

unicharset_extractor

fontfile.box

wordlist2dawg

frequent_words_list freq-dawg

wordlist2dawg words_list word-dawg

Page 8: Tesseract Training_for Khmer Language_For Posting

The last file

This file “DangAmbigs” is manually generated. This file

confused with “rn” (r+n)

Khmer character may not have this kind of ambiguity.

Putting it all together

Now I have all the files renamed to have prefix

All of these files should be put in “tessdata”

� khm.DangAmbigs

� khm.freq-dawg

� khm.inttemp

� khm.normproto

� khm.pffmtable

� khm.unicharset

� khm.user-words

� khm.word-dawg

Now time to run the test!!!

I have this image khmer.tif

I run with command:

I got the output.txt with the content:

Cheers!!!

Tesseract khmer.tif output –l khm

ខងច

generated. This file’s purpose is to reduce the abiguity. Ex. “

character may not have this kind of ambiguity. (need to confirm). So I make it empty file.

renamed to have prefix “khm.” (khm is the ISO_639-2_codes of Cambodia lanuage Khmer)

” folder.

“m” can easily

empty file.

of Cambodia lanuage Khmer):