22
Integrating a simple OCR in Alfresco Angel Borroy developer @ keensoft

Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

IntegratingasimpleOCRinAlfresco

AngelBorroydeveloper@keensoft

Page 2: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRfortheEnterprise• Minimumlicensestartingin100,000documents/year

• Dedicatedserverrequired• Hardlearningcurve– Regularexpressions– Templatesandworkflows– Proprietaryintegration

Page 3: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRfortheCommunity

• OpenSource• NootherserverthanAlfresco• Nolearningcurve,justdropoffyourdocumentsonafolderandgetSearchablePDFs

• EveryhostingOSissupported

Page 4: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

BuildingasimpleOCRAction

1 REPOAMP

• Contentmodel(simple)• Action• Transformer

Page 5: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRAction:Keyclasses<bean id="ocr-extract"

class="es.keensoft.alfresco.ocr.OCRExtractAction" parent="action-executer" init-method="init"> <property name="ocrTransformWorker" ref="transformer.worker.OCR" />

</bean>

<bean id="transformer.worker.OCR" class="es.keensoft.alfresco.ocr.OCRTransformWorker">

<property name="serverOS" value="${ocr.server.os}" />

<property name="executerWindows"><property name="executerLinux">

</bean>

Page 6: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRAction:ConfigurationLinuxalfresco-global.properties

#localocr programocr.command=/usr/local/bin/pdfsandwichocr.output.verbose=trueocr.output.file.prefix.command=-o#rotating,cleaning,languages…ocr.extra.commands=-lang spaocr.server.os=linux

Page 7: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRAction:ConfigurationWindows

alfresco-global.properties

#localocr serviceocr.url=http://localhost:60064/api/OCRocr.output.verbose=true#rotating,cleaning,languages…ocr.extra.commands=Spanishocr.server.os=windows

Page 8: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRAction:Ruleconfiguration

Onlyapplyforforeground

Page 9: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRAction:Ruleconfiguration

SYNCHRONOUS

ASYNCHRONOUS

Page 10: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRAction:Results

Page 11: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

Whatelse?• Studydifferentoriginaldocuments– Existing(incorrect)layertext– Imageresolutionbelow200dpi– Landscape/portraitorientation– Papersizemaychange

• PlainOCRsoftisnotenough

*Imagecomingfromhttp://www.tobias-elze.de/pdfsandwich/

Page 12: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRSoftware:MacOSXhttps://github.com/jbarlow83/OCRmyPDF• GeneratesasearchablePDF/AfilefromaregularPDF• Keepstheexactresolutionoftheoriginalimages• Keepsfilesizeaboutthesame• Deskews and/orcleanstheimagebeforeperformingOCR

• UsesTesseract OCR engine• OpenSourceanddevelopedwithPython3

Page 13: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRSoftware:Linuxhttp://www.tobias-elze.de/pdfsandwich/• Generates"sandwich"OCRpdffiles• Recognizespagelayout(evenformulticolumn)

• Usesunpaper,convert,gs andtesseract

• OpenSourceanddevelopedusingOCAML

Page 14: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRSoftware:Windowshttps://github.com/Xandroid4Net/CommandLineOcr (nonfinal)• Windows.Media.Ocr– MicrosoftAPIrunnableinWindows8andWindows2012

– NativeinWindows10andWindows2016

Page 15: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCRSoftware:Hostedserviceshttps://ocr.space/OCRAPIhttp://www.ocrwebservice.com/api/restguidehttp://www.bitocr.com/documentation.html…

https://cloud.google.com/vision/

Page 16: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

Realworldusecase(1)OS Ubuntu14.04LTSVersion Alfresco5.0.dOCRsoft pdfsandwichLanguages eng+spa+cat+fra

OCR

Page 17: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

Realworldusecase(2)OS Ubuntu15.10Version Alfresco5.0.dOCRsoft OCRmyPDFLanguage eng

OCR

Page 18: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

Realworldusecase(3)OS WindowsServer2012R2Version Alfresco5.1.eOCRsoft Windows.Media.OcrLanguage Spanish

OCR

Page 19: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OpenSourceOCRaddonhttps://github.com/keensoft/alfresco-simple-ocrLicense LGPLv3.0State ProductionLanguages(interface) English,PortugueseBrazilian,GermanandSpanishLanguages(OCR) 39/25

“NooriginalAlfrescoresourceshavebeenoverwritten”https://github.com/OrderOfTheBee/addons/wiki/Inclusion-criteria-overview

Page 20: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

OCR:Recap• GeneratesautomaticallyPDFsearchablefromPDFImage

• OpenSourceaddon forAlfrescoavailable• Minimalconfigurationrequired• DifferentOpenSourceLinuxprogramsavailable

• AlsoMicrosoftisprovidingthelibraryWindows.Media.Ocr

Page 21: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

ResourcesGitHubhttp://github.com/keensoft/alfresco-simple-ocrTwitter@AngelBorroyBloghttp://www.keensoft.es/en/category/blog-en/http://angelborroy.wordpress.com

Page 22: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco

IntegratingasimpleOCRinAlfresco

AngelBorroydeveloper@keensoft