6
Annals of Library Science and Documentation 41, 4; 1994: 121-126. OPTICAL CHARACTER RECOGNITION: A BOON OR BANE R.P.S. DHAKA" KAMLESH ARORA •.•. NISHYP .•.•. Programme Management Division INSDOC New Delhi Optical Character Recognition (OCR) has assumed new importance as a tool for converting paper based information into electronic form. The results of various experiments conducted on a number of available OCR softwares for testing their recogni- tion power in different conditions are presented. INTRODUCTION The advent of modern information technology has opened a vast panorama of opportunities for infor- mation procurement, management and dissemi- nation. The volume of information output has grown so large that it is practically impossible to handle and manage it manually. In the field of science and technology alone, about 10 million articles are generated every year in the world. If we take into account the output of, say, previous 5-10 years for preparing a bibliography on a se- lected topic, the number of references that will be required to search is mind boggling. The information technology has provided us one part of the solution for handling and management of generated information and the other Le. elec- tronic conversion of published information, is still a highly labour intensive exercise. For the effec- tive utilization of information technology, reprocess- ing the published information in the machine read- able format is required. The technology has facili- tated the creation of large databanks or databases, where the published information is put in the re- quired format, to facilitate the easy access and quick search of the required information. The main component in the process of convert- ing published information into a machine readable format is keying-in the large published textual or numerical data through computer key boards. It is most time consuming and labour intensive op- eration. In 1990, the Indian National Scientific Documentation Centre (INSDOC) started prepar- ing the computer readable databases on a large scale and felt the need of avoiding this manual keying-in operation, if possible. In the process a number of solutions were studied, out of which the Optical Character Recognition technology ap- peared, prima facie, to be somewhat promising. OPTICAL CHARACTER RECOGNITIOIQ (OCR) A large number of language characters and nu- merals in various fonts, t.e., size and style, are stored in the memory of a computer through a software programme. The computer is made to compare the characters of the published text with those stored in it's memory. When a character of the published text identically matches with the one stored in the computer memory, the computer transforms the character in ASCII format mean- ing that the computer has recognised the pub- lished textual character that can be reproduced on the computer screen and can be stored in its memory. When the character does not exactly match, the computer refuses to recognize the textual character and puts a junk character in its place on the screen. The process of recognition of the characters of a published text by a com- puter is known as Character Recognition and as the characters of the published text are made to recognise by the computer through' the help of an optical scanner, the whole process is called Optical Character Recognition. This, however reo quires that a dictionary of all the characters and numerals in different sizes and styles is built and stored in the computer. Thus, it requires a very large disk space for storing a huge dictionary of specific font information. To solve this problem, a • Deputy Head •• Scientist Vol 41 No 4 December 1994 121

Annals ofLibrary Science and Documentation 41, 4; 1994 ...nopr.niscair.res.in/bitstream/123456789/27621/1... · Annals ofLibrary Science and Documentation 41, 4; 1994: 121-126. OPTICAL

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Annals ofLibrary Science and Documentation 41, 4; 1994 ...nopr.niscair.res.in/bitstream/123456789/27621/1... · Annals ofLibrary Science and Documentation 41, 4; 1994: 121-126. OPTICAL

Annals of Library Science and Documentation 41, 4; 1994: 121-126.

OPTICAL CHARACTER RECOGNITION: A BOON OR BANE

R.P.S. DHAKA"KAMLESH ARORA •.•.NISHYP .•.•.Programme Management DivisionINSDOCNew Delhi

Optical Character Recognition (OCR) has assumednew importance as a tool for converting paperbased information into electronic form. The resultsof various experiments conducted on a number ofavailable OCR softwares for testing their recogni-tion power in different conditions are presented.

INTRODUCTION

The advent of modern information technology hasopened a vast panorama of opportunities for infor-mation procurement, management and dissemi-nation. The volume of information output hasgrown so large that it is practically impossible tohandle and manage it manually. In the field ofscience and technology alone, about 10 millionarticles are generated every year in the world. Ifwe take into account the output of, say, previous5-10 years for preparing a bibliography on a se-lected topic, the number of references that will berequired to search is mind boggling.

The information technology has provided us onepart of the solution for handling and managementof generated information and the other Le. elec-tronic conversion of published information, is stilla highly labour intensive exercise. For the effec-tive utilization of information technology, reprocess-ing the published information in the machine read-able format is required. The technology has facili-tated the creation of large databanks or databases,where the published information is put in the re-quired format, to facilitate the easy access andquick search of the required information.

The main component in the process of convert-ing published information into a machine readableformat is keying-in the large published textual ornumerical data through computer key boards. It

is most time consuming and labour intensive op-eration. In 1990, the Indian National ScientificDocumentation Centre (INSDOC) started prepar-ing the computer readable databases on a largescale and felt the need of avoiding this manualkeying-in operation, if possible. In the process anumber of solutions were studied, out of whichthe Optical Character Recognition technology ap-peared, prima facie, to be somewhat promising.

OPTICAL CHARACTER RECOGNITIOIQ (OCR)

A large number of language characters and nu-merals in various fonts, t.e., size and style, arestored in the memory of a computer through asoftware programme. The computer is made tocompare the characters of the published text withthose stored in it's memory. When a character ofthe published text identically matches with the onestored in the computer memory, the computertransforms the character in ASCII format mean-ing that the computer has recognised the pub-lished textual character that can be reproduced onthe computer screen and can be stored in itsmemory. When the character does not exactlymatch, the computer refuses to recognize thetextual character and puts a junk character in itsplace on the screen. The process of recognitionof the characters of a published text by a com-puter is known as Character Recognition and asthe characters of the published text are made torecognise by the computer through' the help ofan optical scanner, the whole process is calledOptical Character Recognition. This, however reoquires that a dictionary of all the characters andnumerals in different sizes and styles is built andstored in the computer. Thus, it requires a verylarge disk space for storing a huge dictionary ofspecific font information. To solve this problem, a

• Deputy Head•• Scientist

Vol 41 No 4 December 1994 121

Page 2: Annals ofLibrary Science and Documentation 41, 4; 1994 ...nopr.niscair.res.in/bitstream/123456789/27621/1... · Annals ofLibrary Science and Documentation 41, 4; 1994: 121-126. OPTICAL

R.P.S. DHAKA, KAMLESH ARORA AND NISHY P.

new kind of programme called 'Omnifont Technol-ogy' has been developed. The programme isbased on "Feature Extraction Technique" forrecognising the fonts irrespective of their size andstyle. The technique involves the recognition of acharacter based on its specific features. Thecharacter B will be recognised as 'B' as long asit remotely resembles B. Thus, an Omnifont Soft-ware contains a database of shapes, i.e., lines andcircles. The OCR software breaks a characterdown into its component parts and is not requiredto contain information on large number of differ-ent fonts. It can recognise a character by its uniquecombination of shapes. Different packages usedifferent algorithms for the job of feature extrac-tion and that is why, the efficiency of differentOCR packages varies widely. Some packagesincrease their efficiency by combining the OmnifontTechnology, with learning facility and a storeddictionary, for correcting the possible spelling mis-take occurring due to an error in recognition.

OCR TECHNOLOGYThe OCR technology consists of following hard-ware and software components:

1. Optical Scanner with grey scale facility2. PC-386 or higher with VGA colour monitor3. OCR software4. Windows software

The optical scanner scans the given text on aminimum resolution of 300 dots per inch (dpi). Itproduces a photo-image of the text on the screen.Then the software is made to recognise the

characters ofthe text and convert them into ASCIIformat. A good functioning of OCR requires agood processing power of computer. Computerproduces a screen format of recognised charac-ters and unrecognised ones, which are shown asjunk characters, on the screen. Now the junk char-acters can be replaced by the actual textual char-acters by individually correcting each junk char-acter. Though the whole process of scanning apage, getting it read by the OCR software andreproducing it on the screen takes about 1.5-2 min-utes time, the process of carrying out correc-tions for the unrecognised characters is quite time-consuming and labour intensive job.

The OCR software is the most important compo-nent of the entire process. The efficiency of thesoftware depends on that it produces minimumnumber of unrecognised characters. A page ofprinted text generally contains about 500-600words or 4000-4500 characters. Even a recogni-tion error of 2% produces 80-90 junk charactersin a page and it takes a considerable time in carry-ing out so many corrections. In some cases, ittakes less time to input the whole page than tocarry out the corrections.

A number of OCR softwares viz., Read Star, WordScan, etc: Omni Page, have been floated in themarket by various foreign firms. We carried outOCR experiments with most of these softwares.Even the best ones which claims nearly 100% ac-curacy in recognition failed to recognise about 4-5% characters in a page. In several others theerror rate was as high as 25-30%. Some of thesamples are given as Exhibit 1. For obvious rea-

PC-FAX card was tried. It is not possible to send files

containing images etc. via cable or modem from one machine to

another. That is files that have been created either by optical

scanner using SCANGAL,or otherwise which contain matter apart

from the text in the form of images, graphics etc. can not be

transfered from one machine to another through computer

communication (electronic mail, LANetc.). The option of FAXon

(Original for Software 1)Exhibit 1 : Comparision of character recognition power for different softwares (contd.)

122 Ann Lib Sci Doc

Page 3: Annals ofLibrary Science and Documentation 41, 4; 1994 ...nopr.niscair.res.in/bitstream/123456789/27621/1... · Annals ofLibrary Science and Documentation 41, 4; 1994: 121-126. OPTICAL

OPTICAL CHARACTER RECOGNITION: A BOON OR BANE

Exhibit 1 (contd.)

pc-FAX card was tried. It is not possible to send filescontaininq imaqes etc. via cable or modem from one machine toanother. That is files that have been created either by opticalscanner usinq SCANGAL, or otherwise which contain matter apartfrom the text in the form of images, graphics etc. can not betransfered from one machine to another through computercommunication (electronic mail, LAN etc.) • The option of FAX on

(Output after character recognition from Software 1)

to mortality and morbidity--'. Previouslyreported obscr- vations regarding the car-diopulmonary functions and natural his-tory of scoliosis clearly indicate that itproduces changes in pulmonary functionsrelated to undertying causes of disease, Itsseverity and progression), •.

The extent of pulmonary dysfunction inpatients •••ith scoliosis is of concern toanaesthesiologists not only in deciding theanaesthetic technique, Out also, andespecially, in the immediate postoperativeperiod when any pre-existing respiratorydisability is exagerated in th-e short term

(Original for Software 2)

major contributors to mortality and morbidityi2. Previouslyreported observations regarding the cardiopulmonary functionsand natural history of scoliosis cle:rly indicate that itproduces changes in pulmonary functions related to underlyingcauses of disease, its severity and progression3, '.

The extent of pulmonary dysfunction in patients with scoliosisis of concern to anaesthesiologists not only in deciding theanaesthetic technique, but also, and especially, in theimmediate postoperative period when any pre-existingrespiratory disability is exagerated in the short term'

(Output after character recognition from Software 2)

Vo141 No 4 December 1994 123

Page 4: Annals ofLibrary Science and Documentation 41, 4; 1994 ...nopr.niscair.res.in/bitstream/123456789/27621/1... · Annals ofLibrary Science and Documentation 41, 4; 1994: 121-126. OPTICAL

R.P.S. DHAKA, KAMLESH ARORA AND NISHY P.

son, the names of the packages have not beenmentioned in the Exhibit. A few softwares arehaving the learning facility wherein the software ismade to memorize the character it failed torecognise so that it will recognize the character inits next occurrence. However, this facility doesnot provide much relief from the problem of er-rors in recognition. The errors depend on sev-eral factors other than the software deficiencies.

For electronic composition, the OCR has the maxi-mum trouble in recognising dot matrix charac-ters. These characters are made up of a series ofdots. The OCR packages either recognise eachdot as a character or interpret the dots as apos-trophes, periods or commas. Some packageshave tried to write algorithms but have not suc-ceeded in completely solving the problem.

A number of errors in character recognition iscaused by ligatures, i.e. two or more charactersjoined together. They are read by the softwareas a single unrecognised character. However,the package may be improved to take care ofcombined characters by separating them. Sev-eral stylistic features like italic, boldface or under-lined characters also cause problems in recogni-tion of characters by OCR softwares.

ERROR FACTORS

A large number of recognition errors happen dueto the poor print quality of textual matter. Thescanning is made at a resolution of 300 or higherdpl, The character is resolved into dots and a bitmap of 3OOx300, Le. 90,000 dots per square inch

is created. Whenever the print quality is not uni-form or characters are broken or an extra ink ordot appears in the vicinity, the scanner creates animage of the character which is somewhat differ-ent than the standard one stored in computermemory; or the broken characters are read astwo characters and neither of which resembles withthe stored one. Once the two images do not match,a junk character appears on the screen. The learn-ing mechanism of the software cannot help in sucha situation, as the deformed printed character maynot appear in the same form at the next place.

The poor quality paper used in printing also de-creases the character recognition rate. In a badquality paper, the characters printed on theback side affects the scanning adversely and agood scanned image is not obtained. The badscanned image definitely produces more errorsin character recognition.

Table

Character recognition rating for various types ofprintout

Sl.no. Printouts Recognitionrate (in 'Yo)

1.2.3.4.5.

Laser printoutElectronic typewriterDot Matrix printoutPhotocopy of Laser printoutFacit typewriter output

99.799.797.492.144.1

'l'o f ac i Li t.a t e the "transfer of documcnt s ccrt a i n.ir.q

images/graphs/photographs etc. besides the text. the option of

PC-FAX card was tried. It is not pos s ibLe to ce nd files

containing images etc. via cable or modern from one machine to

another. That is files that have been created either by optical

scanner using SCANGAL,or ot.he rwi se which contain matter apart

f r ori the text in the form 0': inages, graphics etc. can not be

(Original photocopied Laser printout)Exhibit 2: Comparision of character recognition power for different types of printouts (Contd.)

124 Ann Lib Sci Doc

Page 5: Annals ofLibrary Science and Documentation 41, 4; 1994 ...nopr.niscair.res.in/bitstream/123456789/27621/1... · Annals ofLibrary Science and Documentation 41, 4; 1994: 121-126. OPTICAL

OPTICAL CHARACTER RECOGNITION: A BOON OR BANE

Exhibit 2 (Contd.)

To facilitatc the tra@sEcr ui @@c@ni@@!t@@ r@r,@l@@@.ni@gimages,grap@)s,photagra@hs etc. beside@ ts@c text, t@lC c@i"ltj.@n 0

pc-FAX card was tried. It is T@ot @)@)@@@i@l@ "i!@, f;;:!'l(: @ij-@escontaining images etc. via cable or modeM fro@ UIia s,a@ilise @o

another. That is files that have been createa either @? of''ci.calscanner using SCANGAL, or otherwise which contairi matt@r @partfrom the text in the form o@@ images, graphics ctc. c@n n3t be

(After character recognition from photocopied laser printout)

scanner using SCANGAL, or otherwise which contain matter apart

~rom the text in the form of images. graphics etc. can not be

:ransfered from one machine to another through computer

communication (electronic mail. LAN e t c.), The Option of FAX on

~hE' other hand is too expensive for s uc h kind of document

tr s. ns f e r . HencE' i: w a s endeavoured to dE>\'ise a rn ea ns where files

Y.,";, to another). The option that was considered was PC-FAX card.

(Original dot matrix printout)

ssanner Llsink SCANGAL, or otherwise which contain matter apartfrom the text in the form of images, graphics etc. can not betransfered from on@ maehine to anQther through sornputereommunicati@n (electronic mail, LAN etc.). The option of FAX ont'he other hand is too expensive for s:lch kind of documenttransfer. Henc@ it was endeavoured to devise a means where filescontaining images could be transfered from syst@rn (one PC/AT orXT to anotherj. The option thaS was considered was PC-FAX card.

(After character recognition from dot matrix printout)

Vol 41 No 4 December 1994 125

Page 6: Annals ofLibrary Science and Documentation 41, 4; 1994 ...nopr.niscair.res.in/bitstream/123456789/27621/1... · Annals ofLibrary Science and Documentation 41, 4; 1994: 121-126. OPTICAL

R.P.S. DHAKA, KAMLESH ARORA AND NISHY P.

Exhibit 2 (Contd.)

To facilitate the tranefer of 0001llents contaimJ:.g images, [p'aphs, photographs

etc. besides the text, the option of PC-FAX card. was tried. It ia not possible

to Bend filea containing images etc. via cable or modemfrom one machine to

another. That is fil es that have been creat ed ei th er by opti cal scanner

using SCANGAL, or othervi 8e lIiUdl contain matter apart from the text in

the form of images, graphics etc. can not be transferred from one machine to

another through co~ter COllDlmication (el~ctronicmail, UN etc.). The

(Original facit typewriter output)

T@ @@eilit@te th@ tFan@@er @@ deo@@emts comt@i@in@ im@ges, grsp@s, P@@>Q bedde@ the ted>, the aptis@ @@ PG@PSI @@d wms tni@@ It i@ @,4 post@ @@@d @iae@ @o@t@ini@@ i@ag@@ 4c. @i@ eabl@ or @ed@@ @@am @n@ @Q@@n

@@DtheF@ Th@@ i@ @ile@ that hsQ@ be@n sr@@@@d d th@@ by @@tiESi @Eann@@in@ 8Ca@8aL, sr sther@i se 4@@h @e@t@i@ @att@r sp@r@ @ROB the t@@tthe @@@@ af i@@ges, ga@p@@ @@ eSQ @@n @ot b@ ta@n@@@rr@d @@a@ ene @@cso@ther ihRe8sh e@D@@t@a een@mn7 @@aie@ i u @et@e@@@@eEl , LaN d;@*.

(After character recognition from facit typewriter output)

A number of experiments were conducted on opti-cal character recognition on various types of printedmaterial. The character recognition rating of vari-ous types of printout are presented in Table. Itwas proved beyond doubt that better the print qual-ity and paper, the less the recognition errors. Someof the samples are given in Exhibit 2.

CONCLUSION

The optical character recognition has not yet suc-ceeded in becoming an alternative to the datainput operations throughout the world. Therefore,keying-in of the data is continuing in all the majordatabase creating organisations in India. Similarobservations have been received from otherorganisations abroad. Mr. John Rose, an UNESCOexpert in Paris, who is engaged in database cre-ation activity shared the same views and com-

126

mented that the same OCR software gave differ-ent results on different texts. The best resultswere obtained on texts printed in Germany, wherethe print quality is better than that in other coun-tries. Therefore, the OCR technology is not in aposition at present to replace the data input opera-tions, as it is not merely dependent on softwareefficiency but also on several other factors whichare beyond its control.

REFERENCES

1. STANOFORO (0) and ENGLOWSTEIN (H).Tame the paper tiger. Byte. 1991, April; 220-8.

2. KELLER (0). High speed OCR system haveflexibility and accuracy. Document Image Au-tomation. 13, 1; 1993; 28-9.

Ann Lib Sci Doc