Integrating a simple OCR in...

Preview:

Citation preview

IntegratingasimpleOCRinAlfresco

AngelBorroydeveloper@keensoft

OCRfortheEnterprise• Minimumlicensestartingin100,000documents/year

• Dedicatedserverrequired• Hardlearningcurve– Regularexpressions– Templatesandworkflows– Proprietaryintegration

OCRfortheCommunity

• OpenSource• NootherserverthanAlfresco• Nolearningcurve,justdropoffyourdocumentsonafolderandgetSearchablePDFs

• EveryhostingOSissupported

BuildingasimpleOCRAction

1 REPOAMP

• Contentmodel(simple)• Action• Transformer

OCRAction:Keyclasses<bean id="ocr-extract"

class="es.keensoft.alfresco.ocr.OCRExtractAction" parent="action-executer" init-method="init"> <property name="ocrTransformWorker" ref="transformer.worker.OCR" />

</bean>

<bean id="transformer.worker.OCR" class="es.keensoft.alfresco.ocr.OCRTransformWorker">

<property name="serverOS" value="${ocr.server.os}" />

<property name="executerWindows"><property name="executerLinux">

</bean>

OCRAction:ConfigurationLinuxalfresco-global.properties

#localocr programocr.command=/usr/local/bin/pdfsandwichocr.output.verbose=trueocr.output.file.prefix.command=-o#rotating,cleaning,languages…ocr.extra.commands=-lang spaocr.server.os=linux

OCRAction:ConfigurationWindows

alfresco-global.properties

#localocr serviceocr.url=http://localhost:60064/api/OCRocr.output.verbose=true#rotating,cleaning,languages…ocr.extra.commands=Spanishocr.server.os=windows

OCRAction:Ruleconfiguration

Onlyapplyforforeground

OCRAction:Ruleconfiguration

SYNCHRONOUS

ASYNCHRONOUS

OCRAction:Results

Whatelse?• Studydifferentoriginaldocuments– Existing(incorrect)layertext– Imageresolutionbelow200dpi– Landscape/portraitorientation– Papersizemaychange

• PlainOCRsoftisnotenough

*Imagecomingfromhttp://www.tobias-elze.de/pdfsandwich/

OCRSoftware:MacOSXhttps://github.com/jbarlow83/OCRmyPDF• GeneratesasearchablePDF/AfilefromaregularPDF• Keepstheexactresolutionoftheoriginalimages• Keepsfilesizeaboutthesame• Deskews and/orcleanstheimagebeforeperformingOCR

• UsesTesseract OCR engine• OpenSourceanddevelopedwithPython3

OCRSoftware:Linuxhttp://www.tobias-elze.de/pdfsandwich/• Generates"sandwich"OCRpdffiles• Recognizespagelayout(evenformulticolumn)

• Usesunpaper,convert,gs andtesseract

• OpenSourceanddevelopedusingOCAML

OCRSoftware:Windowshttps://github.com/Xandroid4Net/CommandLineOcr (nonfinal)• Windows.Media.Ocr– MicrosoftAPIrunnableinWindows8andWindows2012

– NativeinWindows10andWindows2016

OCRSoftware:Hostedserviceshttps://ocr.space/OCRAPIhttp://www.ocrwebservice.com/api/restguidehttp://www.bitocr.com/documentation.html…

https://cloud.google.com/vision/

Realworldusecase(1)OS Ubuntu14.04LTSVersion Alfresco5.0.dOCRsoft pdfsandwichLanguages eng+spa+cat+fra

OCR

Realworldusecase(2)OS Ubuntu15.10Version Alfresco5.0.dOCRsoft OCRmyPDFLanguage eng

OCR

Realworldusecase(3)OS WindowsServer2012R2Version Alfresco5.1.eOCRsoft Windows.Media.OcrLanguage Spanish

OCR

OpenSourceOCRaddonhttps://github.com/keensoft/alfresco-simple-ocrLicense LGPLv3.0State ProductionLanguages(interface) English,PortugueseBrazilian,GermanandSpanishLanguages(OCR) 39/25

“NooriginalAlfrescoresourceshavebeenoverwritten”https://github.com/OrderOfTheBee/addons/wiki/Inclusion-criteria-overview

OCR:Recap• GeneratesautomaticallyPDFsearchablefromPDFImage

• OpenSourceaddon forAlfrescoavailable• Minimalconfigurationrequired• DifferentOpenSourceLinuxprogramsavailable

• AlsoMicrosoftisprovidingthelibraryWindows.Media.Ocr

ResourcesGitHubhttp://github.com/keensoft/alfresco-simple-ocrTwitter@AngelBorroyBloghttp://www.keensoft.es/en/category/blog-en/http://angelborroy.wordpress.com

IntegratingasimpleOCRinAlfresco

AngelBorroydeveloper@keensoft

Recommended