34
Sven Schlarb Austrian National Library Keeping Control: Scalable Preservation Environments for Identification and Characterisation Guimarães, Portugal, 07/12/2012 Large scale preservation workflows with Taverna

Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Embed Size (px)

DESCRIPTION

Sven Schlarb of the Austrian National Library gave this introduction to large scale preservation workflows with Taverna at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.

Citation preview

Page 1: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Sven Schlarb Austrian National Library

Keeping Control: Scalable Preservation Environments for Identification and Characterisation Guimarães, Portugal, 07/12/2012

Large scale preservation workflows with Taverna

Page 2: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

What do you mean by „Workflow“?

• Data flow rather than control flow • (Semi-)Automated data processing pipeline • Defined inputs and outputs • Modular and reusable processing units • Easy to deploy, execute, and share

Page 3: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Modularise complex preservation tasks

• Assuming that complex preservation tasks can be separated into processing steps

• Together the steps represent the automated processing pipeline

Migrate Characterise Quality Assurance Ingest

Page 4: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Experimental workflow development

• Easy to execute a workflow on standard platforms from anywhere

• Experimental data available online or downloadable • Reproducible experiment results • Workflow development as a community activity

Page 5: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Taverna

• Workflow language and computational model for creating composite data-intensive processing chains

• Developed since 2004 as a tool for life scientists and bio-informaticians by myGrid, University of Manchester, UK

• Available for Windows/Linux/OSX and as open source (LGPL)

Page 6: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

SCUFL/T2FLOW/SCUFL2

• Alternative to other workflow description languages, such as the Business Process Enactment Language (BPEL)

• SCUFL2 is Taverna's new workflow specification language (Taverna 3), workflow bundle format, and Java API

• SCUFL2 will replace the t2flow format (which replaced the SCUFL format)

• Adopts Linked Data technology

Page 7: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Creating workflows using Taverna

• Users interactively build data processing pipelines • Set of nodes represents data processing elements • Nodes are connected by directed edges and the

workflow itself is a directed graph • Nodes can have multiple inputs and outputs • Workflows can contain other (embedded) workflows

Page 8: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Processors

• Web service clients (SOAP/REST) • Local scripts (R and Beanshell languages) • Remote shell script invocations via ssh (Tool) • XML splitters - XSLT (interoperability!)

Page 9: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE List handling: Implicit iteration over multiple

inputs • A „single value“ input port (list depth 0) processes

values iteratively (foreach) • A flat value list has list depth 1 • List depth > 1 for tree structures • Multiple input ports with lists are combined as cross

product or dot product

Page 10: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Example: Tika Preservation Component

• Input: „file“

• Processor: Tika web service (SOAP)

• Output: Mime-Type

Page 11: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Workflow development and execution • Local development: Taverna Workbench

Page 12: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Workflow registry • Web 2.0 style registry: myExperiment

Page 13: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Remote Workflow Execution • Web client using REST API of Taverna Server

Page 14: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hadoop

• Open source implementation of MapReduce (Dean & Ghemawat, Google, 2004)

• Hadoop = MapReduce + HDFS • HDFS: Distributed file system, data stored in 64MB

(default) blocks

Page 15: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hadoop

• Job tracker (master) manages job execution on task trackers (workers)

• Each machine is configured to dedicate processing cores to MapReduce tasks (each core is a worker)

• Name node manages HDFS, i.e. distribution of data blocks on data nodes

Page 16: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hadoop job building blocks

Map/reduce Application

(JAR)

Job configuration Set or overwrite configuration parameters.

Map method Create intermediate key/value pair output

Reduce method Aggregate intermediate key/value pair output from map

Page 17: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Cluster

Page 18: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Dette billede kan ikke vises i øjeblikket.

Apache Tomcat Web Application

Taverna Server (REST API)

Hadoop Jobtracker

File server

Cluster

Large scale execution environment

Page 19: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE Example: Characterisation on a large document

collection • Using „Tool“ service, remote ssh execution • Orchestration of hadoop jobs (Hadoop-Streaming-

API, Hadoop Map/Reduce, and Hive) • Available on myExperiment:

http://www.myexperiment.org/workflows/3105 • See Blogpost:

http://www.openplanetsfoundation.org/blogs/2012-08-07-big-data-processing-chaining-hadoop-jobs-using-taverna

Page 20: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

20

Create text file containing JPEG2000 input file paths and read Image metadata using Exiftool via the Hadoop Streaming API.

Page 21: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Reading image metadata

21

find

/NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ...

...

NAS

reading files from NAS

1,4 GB 1,2 GB

: ~ 5 h + ~ 38 h = ~ 43 h 60.000 books

24 Million pages

SCAPE Jp2PathCreator HadoopStreamingExiftoolRead

Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ...

Page 22: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

22

Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS.

Page 23: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SequenceFile creation

23

find

/NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ...

Z119585409/00000707

Z119585409/00000708

Z119585409/00000709

Z119585409/00000710

Z119585409/00000711

Z119585409/00000712

NAS

reading files from NAS

1,4 GB 997 GB (uncompressed)

: ~ 5 h + ~ 24 h = ~ 29 h 60.000 books

24 Million pages

SCAPE HtmlPathCreator SequenceFileCreator

Page 24: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

24

Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width.

Page 25: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

HTML Parsing

25

Z119585409/00000001

Z119585409/00000002

Z119585409/00000003

Z119585409/00000004

Z119585409/00000005 ...

: ~ 6 h 60.000 books

24 Million pages

Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400

Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400

Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400

Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400

Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400

SCAPE

Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250

Map Reduce HadoopAvBlockWidthMapReduce

SequenceFile Textfile

Page 26: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

26

Create hive table and load generated data into the Hive database.

Page 27: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Analytic Queries

27 : ~ 6 h

60.000 books 24 Million pages

SCAPE HiveLoadExifData & HiveLoadHocrData

Dette billede kan ikke vises i øjeblikket.

Dette billede kan ikke vises i øjeblikket.

htmlwidth

jp2width

Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700

Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250

CREATE TABLE jp2width (hid STRING, jwidth INT)

CREATE TABLE htmlwidth (hid STRING, hwidth INT)

Page 28: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Analytic Queries

28 : ~ 6 h

60.000 books 24 Million pages

SCAPE HiveSelect

Dette billede kan ikke vises i øjeblikket. Dette billede kan ikke vises i øjeblikket.

htmlwidth jp2width

Dette billede kan ikke vises i øjeblikket.

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

Page 29: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

29

Do a simple hive query in order to test if the database has been created successfully.

Page 30: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Example: Web Archiving

30

Page 31: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hands on – Virtual machine

• 0.20.2+923.421 Pseudo-distributed Hadoop configuration

• Chromium Webbrowser with Hadoop Admin Links • Taverna Workbench 2.3.0 • NetBeans IDE 7.1.2 • SampleHadoopCommand.txt (executable Hadoop

Command for DEMO1) • Latest patches

Page 32: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hands on – VM setup

• Unpackage scape4youTraining.tar.gz • VirtualBox: Mashine => Add => Browse to folder =>

select VBOX file • VM instance login:

• user: scape • pw: scape123

Page 33: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hands on – Demo1

• Using Hadoop for analysing ARC files • Located at:

/example/sampleIN/ (HDFS) • Execution via command in:

SampleHadoopCommand.txt (on Desktop)

• Result can then be found at: /example/sample_OUT/

Page 34: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hands on – Demo2

• Using Taverna for analysing ARC files • Workflow: /home/scape/scanARC/scanARC_TIKA.t2flow • ADD FILE LOCATION (not add value!!) • Input: /home/scape/scanARC/input/ONBSample.txt

• Result: ~/scanARC/outputCSV/fullTIKAReport.csv

• See ~/scanARC/outputGraphics/ graphicsTIKA/tika-