Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Sven Schlarb Austrian National Library

Keeping Control: Scalable Preservation Environments for Identification and Characterisation Guimarães, Portugal, 07/12/2012

Large scale preservation workflows with Taverna

SCAPE

What do you mean by „Workflow“?

• Data flow rather than control flow • (Semi-)Automated data processing pipeline • Defined inputs and outputs • Modular and reusable processing units • Easy to deploy, execute, and share

SCAPE

Modularise complex preservation tasks

• Assuming that complex preservation tasks can be separated into processing steps

• Together the steps represent the automated processing pipeline

Migrate Characterise Quality Assurance Ingest

SCAPE

Experimental workflow development

• Easy to execute a workflow on standard platforms from anywhere

• Experimental data available online or downloadable • Reproducible experiment results • Workflow development as a community activity

SCAPE

Taverna

• Workflow language and computational model for creating composite data-intensive processing chains

• Developed since 2004 as a tool for life scientists and bio-informaticians by myGrid, University of Manchester, UK

• Available for Windows/Linux/OSX and as open source (LGPL)

SCAPE

SCUFL/T2FLOW/SCUFL2

• Alternative to other workflow description languages, such as the Business Process Enactment Language (BPEL)

• SCUFL2 is Taverna's new workflow specification language (Taverna 3), workflow bundle format, and Java API

• SCUFL2 will replace the t2flow format (which replaced the SCUFL format)

• Adopts Linked Data technology

SCAPE

Creating workflows using Taverna

• Users interactively build data processing pipelines • Set of nodes represents data processing elements • Nodes are connected by directed edges and the

workflow itself is a directed graph • Nodes can have multiple inputs and outputs • Workflows can contain other (embedded) workflows

SCAPE

Processors

• Web service clients (SOAP/REST) • Local scripts (R and Beanshell languages) • Remote shell script invocations via ssh (Tool) • XML splitters - XSLT (interoperability!)

SCAPE List handling: Implicit iteration over multiple

inputs • A „single value“ input port (list depth 0) processes

values iteratively (foreach) • A flat value list has list depth 1 • List depth > 1 for tree structures • Multiple input ports with lists are combined as cross

product or dot product

SCAPE

Example: Tika Preservation Component

• Input: „file“

• Processor: Tika web service (SOAP)

• Output: Mime-Type

SCAPE

Workflow development and execution • Local development: Taverna Workbench

SCAPE

Workflow registry • Web 2.0 style registry: myExperiment

SCAPE

Remote Workflow Execution • Web client using REST API of Taverna Server

SCAPE

Hadoop

• Open source implementation of MapReduce (Dean & Ghemawat, Google, 2004)

• Hadoop = MapReduce + HDFS • HDFS: Distributed file system, data stored in 64MB

(default) blocks

SCAPE

Hadoop

• Job tracker (master) manages job execution on task trackers (workers)

• Each machine is configured to dedicate processing cores to MapReduce tasks (each core is a worker)

• Name node manages HDFS, i.e. distribution of data blocks on data nodes

SCAPE

Hadoop job building blocks

Map/reduce Application

(JAR)

Job configuration Set or overwrite configuration parameters.

Map method Create intermediate key/value pair output

Reduce method Aggregate intermediate key/value pair output from map

SCAPE

Cluster

SCAPE

Dette billede kan ikke vises i øjeblikket.

Apache Tomcat Web Application

Taverna Server (REST API)

Hadoop Jobtracker

File server

Cluster

Large scale execution environment

SCAPE Example: Characterisation on a large document

collection • Using „Tool“ service, remote ssh execution • Orchestration of hadoop jobs (Hadoop-Streaming-

API, Hadoop Map/Reduce, and Hive) • Available on myExperiment:

http://www.myexperiment.org/workflows/3105 • See Blogpost:

http://www.openplanetsfoundation.org/blogs/2012-08-07-big-data-processing-chaining-hadoop-jobs-using-taverna

SCAPE

20

Create text file containing JPEG2000 input file paths and read Image metadata using Exiftool via the Hadoop Streaming API.

Reading image metadata

21

find

/NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ...

...

NAS

reading files from NAS

1,4 GB 1,2 GB

: ~ 5 h + ~ 38 h = ~ 43 h 60.000 books

24 Million pages

SCAPE Jp2PathCreator HadoopStreamingExiftoolRead

Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ...

SCAPE

22

Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS.

SequenceFile creation

23

find

/NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ...

Z119585409/00000707

Z119585409/00000708

Z119585409/00000709

Z119585409/00000710

Z119585409/00000711

Z119585409/00000712

NAS

reading files from NAS

1,4 GB 997 GB (uncompressed)

: ~ 5 h + ~ 24 h = ~ 29 h 60.000 books

24 Million pages

SCAPE HtmlPathCreator SequenceFileCreator

SCAPE

24

Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width.

HTML Parsing

25

Z119585409/00000001

Z119585409/00000002

Z119585409/00000003

Z119585409/00000004

Z119585409/00000005 ...

: ~ 6 h 60.000 books

24 Million pages

Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400

Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400

Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400

Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400

Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400

SCAPE

Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250

Map Reduce HadoopAvBlockWidthMapReduce

SequenceFile Textfile

SCAPE

26

Create hive table and load generated data into the Hive database.

Analytic Queries

27 : ~ 6 h

60.000 books 24 Million pages

SCAPE HiveLoadExifData & HiveLoadHocrData



htmlwidth

jp2width

Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700

Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250

CREATE TABLE jp2width (hid STRING, jwidth INT)

CREATE TABLE htmlwidth (hid STRING, hwidth INT)

Analytic Queries

28 : ~ 6 h

60.000 books 24 Million pages

SCAPE HiveSelect

Dette billede kan ikke vises i øjeblikket. Dette billede kan ikke vises i øjeblikket.

htmlwidth jp2width


select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

SCAPE

29

Do a simple hive query in order to test if the database has been created successfully.

SCAPE

Example: Web Archiving

30

SCAPE

Hands on – Virtual machine

• 0.20.2+923.421 Pseudo-distributed Hadoop configuration

• Chromium Webbrowser with Hadoop Admin Links • Taverna Workbench 2.3.0 • NetBeans IDE 7.1.2 • SampleHadoopCommand.txt (executable Hadoop

Command for DEMO1) • Latest patches

SCAPE

Hands on – VM setup

• Unpackage scape4youTraining.tar.gz • VirtualBox: Mashine => Add => Browse to folder =>

select VBOX file • VM instance login:

• user: scape • pw: scape123

SCAPE

Hands on – Demo1

• Using Hadoop for analysing ARC files • Located at:

/example/sampleIN/ (HDFS) • Execution via command in:

SampleHadoopCommand.txt (on Desktop)

• Result can then be found at: /example/sample_OUT/

SCAPE

Hands on – Demo2

• Using Taverna for analysing ARC files • Workflow: /home/scape/scanARC/scanARC_TIKA.t2flow • ADD FILE LOCATION (not add value!!) • Input: /home/scape/scanARC/input/ONBSample.txt

• Result: ~/scanARC/outputCSV/fullTIKAReport.csv

• See ~/scanARC/outputGraphics/ graphicsTIKA/tika-

Technology

Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012