Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
IntegratingasimpleOCRinAlfresco
AngelBorroydeveloper@keensoft
OCRfortheEnterprise• Minimumlicensestartingin100,000documents/year
• Dedicatedserverrequired• Hardlearningcurve– Regularexpressions– Templatesandworkflows– Proprietaryintegration
OCRfortheCommunity
• OpenSource• NootherserverthanAlfresco• Nolearningcurve,justdropoffyourdocumentsonafolderandgetSearchablePDFs
• EveryhostingOSissupported
BuildingasimpleOCRAction
1 REPOAMP
• Contentmodel(simple)• Action• Transformer
OCRAction:Keyclasses<bean id="ocr-extract"
class="es.keensoft.alfresco.ocr.OCRExtractAction" parent="action-executer" init-method="init"> <property name="ocrTransformWorker" ref="transformer.worker.OCR" />
</bean>
<bean id="transformer.worker.OCR" class="es.keensoft.alfresco.ocr.OCRTransformWorker">
<property name="serverOS" value="${ocr.server.os}" />
<property name="executerWindows"><property name="executerLinux">
</bean>
OCRAction:ConfigurationLinuxalfresco-global.properties
#localocr programocr.command=/usr/local/bin/pdfsandwichocr.output.verbose=trueocr.output.file.prefix.command=-o#rotating,cleaning,languages…ocr.extra.commands=-lang spaocr.server.os=linux
OCRAction:ConfigurationWindows
alfresco-global.properties
#localocr serviceocr.url=http://localhost:60064/api/OCRocr.output.verbose=true#rotating,cleaning,languages…ocr.extra.commands=Spanishocr.server.os=windows
OCRAction:Ruleconfiguration
Onlyapplyforforeground
OCRAction:Ruleconfiguration
SYNCHRONOUS
ASYNCHRONOUS
OCRAction:Results
Whatelse?• Studydifferentoriginaldocuments– Existing(incorrect)layertext– Imageresolutionbelow200dpi– Landscape/portraitorientation– Papersizemaychange
• PlainOCRsoftisnotenough
*Imagecomingfromhttp://www.tobias-elze.de/pdfsandwich/
OCRSoftware:MacOSXhttps://github.com/jbarlow83/OCRmyPDF• GeneratesasearchablePDF/AfilefromaregularPDF• Keepstheexactresolutionoftheoriginalimages• Keepsfilesizeaboutthesame• Deskews and/orcleanstheimagebeforeperformingOCR
• UsesTesseract OCR engine• OpenSourceanddevelopedwithPython3
OCRSoftware:Linuxhttp://www.tobias-elze.de/pdfsandwich/• Generates"sandwich"OCRpdffiles• Recognizespagelayout(evenformulticolumn)
• Usesunpaper,convert,gs andtesseract
• OpenSourceanddevelopedusingOCAML
OCRSoftware:Windowshttps://github.com/Xandroid4Net/CommandLineOcr (nonfinal)• Windows.Media.Ocr– MicrosoftAPIrunnableinWindows8andWindows2012
– NativeinWindows10andWindows2016
OCRSoftware:Hostedserviceshttps://ocr.space/OCRAPIhttp://www.ocrwebservice.com/api/restguidehttp://www.bitocr.com/documentation.html…
https://cloud.google.com/vision/
Realworldusecase(1)OS Ubuntu14.04LTSVersion Alfresco5.0.dOCRsoft pdfsandwichLanguages eng+spa+cat+fra
OCR
Realworldusecase(2)OS Ubuntu15.10Version Alfresco5.0.dOCRsoft OCRmyPDFLanguage eng
OCR
Realworldusecase(3)OS WindowsServer2012R2Version Alfresco5.1.eOCRsoft Windows.Media.OcrLanguage Spanish
OCR
OpenSourceOCRaddonhttps://github.com/keensoft/alfresco-simple-ocrLicense LGPLv3.0State ProductionLanguages(interface) English,PortugueseBrazilian,GermanandSpanishLanguages(OCR) 39/25
“NooriginalAlfrescoresourceshavebeenoverwritten”https://github.com/OrderOfTheBee/addons/wiki/Inclusion-criteria-overview
OCR:Recap• GeneratesautomaticallyPDFsearchablefromPDFImage
• OpenSourceaddon forAlfrescoavailable• Minimalconfigurationrequired• DifferentOpenSourceLinuxprogramsavailable
• AlsoMicrosoftisprovidingthelibraryWindows.Media.Ocr
ResourcesGitHubhttp://github.com/keensoft/alfresco-simple-ocrTwitter@AngelBorroyBloghttp://www.keensoft.es/en/category/blog-en/http://angelborroy.wordpress.com
IntegratingasimpleOCRinAlfresco
AngelBorroydeveloper@keensoft