by Kruy Vanna
� Download Tesseract from http://code.google.com/p/tesseract-ocr/downloads/list
� Here I choose the compiled one. Tesseract-2.01.ext.tar.gz (It is better to use new version. But since I do not
have compiler at hand now. I’ll just use the compiled one)
� Extract it to any location.
�
� Download the english language data.
� Extract it and put in the tessdata of your tesseract folder
by Kruy Vanna
�
� Download tesseract source folder. What I need are files in folders configs and tessconfigs of tessdata.
(tesseract.exe u downloaded does not have these)
by Kruy Vanna
�
� Extract it to somewhere and copy the tessdata to our previous tesseract folder.
Now we can start training:
I train with this image. They say should train enough data. So every characters should appear many times. ( don
know if m right).
May be each same character should appear many time but with different font?
� Make box file. Go to command line and set the current directory to your tesseract folder
� Got the file: fontfile.txt
� Renamed it to : fontfile.box so that I can open it in Tessboxer
(http://sites.google.com/site/spilkaondrej/)
Here I input the character in the “Letter” textbox and the UTF8 code is automatically filled.
tesseract fontfile.tif fontfile batch.nochop makebox
by Kruy Vanna
Making feature file ->
got this log
Clustering
( You should change the current directory to “training” to use the command)
Now I got the files I should have.
read_variables_file:variable not found:
textord_no_rejectsTesseract Open Source
OCR Engine
Image has 24 bits per pixel and size
(746,387)
Resolution=96
APPLY_BOXES:
Boxes read from boxfile: 19
Initially labelled blobs: 17 in 3 rows
Box failures detected: 2
Duped blobs for rebalance: 2
"ច" has fewest samples: 5
Total
unlabelled words: 1
Final
labelled words: 19
Generating training data
TRAINING ... Font name = UnknownFont.
Generated training data for 19 blobs
tesseract fontfile.tif junk nobatch box.train
mftraining fontfile.tr
by Kruy Vanna
• Inttemp
This is the binary file -> human eye can’t understand.
• Pffmtable
• I got this file too “Microfeat” but they say it’s not used
Another command:
Got this file: normproto
Compute the Character Set
Got this file: unicharset
Dictionary Data
Created “frequent_words_list” file. They said I must put at least one word so I just put “ខងច” in it using notepad.
Generate the frequent dictionary file using command:
Got the file: freq-dawg
Created “words_list” file with the content “ងចខ”
Generate the word list dictionary file using command:
Got the file: word-dawg
Created “user-words” file. They say it’s usually empty -> I keep them empty
ខ 104
ង 93
ច 85
I don’t know what the number mean.
cntraining fontfile.fr
unicharset_extractor
fontfile.box
wordlist2dawg
frequent_words_list freq-dawg
wordlist2dawg words_list word-dawg
The last file
This file “DangAmbigs” is manually generated. This file
confused with “rn” (r+n)
Khmer character may not have this kind of ambiguity.
Putting it all together
Now I have all the files renamed to have prefix
All of these files should be put in “tessdata”
� khm.DangAmbigs
� khm.freq-dawg
� khm.inttemp
� khm.normproto
� khm.pffmtable
� khm.unicharset
� khm.user-words
� khm.word-dawg
Now time to run the test!!!
I have this image khmer.tif
I run with command:
I got the output.txt with the content:
Cheers!!!
Tesseract khmer.tif output –l khm
ខងច
generated. This file’s purpose is to reduce the abiguity. Ex. “
character may not have this kind of ambiguity. (need to confirm). So I make it empty file.
renamed to have prefix “khm.” (khm is the ISO_639-2_codes of Cambodia lanuage Khmer)
” folder.
“m” can easily
empty file.
of Cambodia lanuage Khmer):