20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow Development for OCR (and beyond) Clemens Neudecker, KB National Library of the Netherlands Creating and Communicating Digital Content Conference Umea, 26 May 2011

Workflow Development for OCR (and beyond)

Embed Size (px)

DESCRIPTION

Workflow Development for OCR (and more) Creating and Communicating Digital Content Conference, 26-27 May 2011, Umea, Sweden.

Citation preview

Page 1: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Workflow Development for OCR (and beyond)Clemens Neudecker, KB National Library of the Netherlands

Creating and Communicating Digital Content Conference

Umea, 26 May 2011

Page 2: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

2

IMPACT – Improving access to text Funded by the EC as part of the 7th Framework Programme Coordinated by KB – National Library of the Netherlands EU funding: € 12 100 000 26 partners: Libraries, Research Institutes, Industry Partners Start date: 1 January 2008 Duration: 48 Months 2012: Centre of Competence

Project website: www.impact-project.eu IMPACT blog: http://impactocr.wordpress.com/ Twitter: @impactocr, #impactproject Join us on LinkedIn!

Page 3: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

3

A familiar scene?

VVt Venetien den 1.Junij, Anno 1618.

DJgn i f paffato te S' aö'Jifeert mo?üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te /

sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met

beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe

Page 4: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

4

OCR: A multitude of challenges…I. OCR challenges (gothic fonts, bleed-through, warping, etc.)

Page 5: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

5

OCR: A multitude of challenges…II. Language challenges (spelling variants, inflection, and many more!)

Example: historical variants of the Dutch word ‘wereld’ (world):

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

Page 6: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

6

And a multitude of solutions! 22 different ‘tools’ from diverse WP’s,

developers:OCR (C++, C#), Image Processing & Lexica (DLL), Command Line Tools (Win/Linux), Java, Ruby, PHP, Perl, etc. + 3rd party software!

“One ring to rule them all...”

IMPACT Interoperability Framework (IIF)

Page 7: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7

Requirement: Interoperability Framework Interoperability vs. integration Web based vs. local installation/platform Most important: flexible, scalable, user friendly

Java 6 Apache Axis2 Apache Tomcat Apache Synapse (optional) Taverna Workflow Engine

Page 8: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

8

Generic Web Service Wrapper

Only requirement: Command Line Application HTML formAvailable on OPFlabs: https://github.com/openplanets/scape/tree/master/xa-toolwrapper

Minimise integration effort: developers can focus on their application and have to worry less about integration = higher quality software

Page 9: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

9

Service Oriented Architecture Java as programming

language = platform independence

Standard Apache components = easy to maintain, well supported

Synapse as enterprise service bus = load balancing & fail over

HTTPS encryption & authentication = secure

Minimise deployment effort: scalability, hot deployment/update

Page 10: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

10

Workflow development

OCR workflow = data pipeline

Building blocks =

processing steps (nodes)

Integration = interaction between nodes

(mashup)

Maximise usability

Page 11: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

11

Workflow management Web 2.0 style registry: myExperiment

Local client: Taverna Workbench

Web client: project website

Page 12: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Workflow registry

Share resources and experience

Rate/tag/comment workflows

Organised in groups

Page 13: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Workflow modules

“Basic” workflows = wraps exactly one software tool/web service Documented inputs/outputs

Page 14: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

14

Complex workflows

Tool/data pipeline

Easily derived from workflow modules

Task/goal oriented

Reusable

Page 15: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Local client: Taverna Workbench

http://www.taverna.org.uk/

Background: BioSciences

Developed and maintained bymyGrid, UK

Available for Windows/Linux/OSXand as open source

Funding secured until 2014

Page 16: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Web client: Taverna Server/Workflow Parser

SOAP/REST API Remote execution of workflows (webapp)

Page 17: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

17

Use case: Workflows for Evaluation Tool A vs Tool B (Tool A(v1) vs Tool A(v2)) Workflow X (Tool A + Tool B) vs Workflow Y (Tool A + Tool C) Workflow X vs previously digitised material

Users identify optimal workflow for source material/project

Page 18: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

18

Other examples

Workflows for Digitisation IMPACT

Workflows for Linguistic Analysis CLARIN

Workflows for Preservation SCAPE

Interface for automatic storage of results, based on DAV, realised as a workflow module (native beanshell support)

And there are many more…

Page 19: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Benefits & Outlook Modular Transparent Expandable Scalable Platform independent User friendly

Growing interest in workflow management in CH sector Easy to set up, deploy, free (open source) Domain independent

Page 20: Workflow Development for OCR (and beyond)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Thank you! Questions?